CN106778070A

CN106778070A - A kind of human protein's subcellular location Forecasting Methodology

Info

Publication number: CN106778070A
Application number: CN201710204499.1A
Authority: CN
Inventors: 沈红斌; 周航
Original assignee: Shanghai Jiao Tong University
Current assignee: Shanghai Jiao Tong University
Priority date: 2017-03-31
Filing date: 2017-03-31
Publication date: 2017-05-31

Abstract

The invention discloses a method for predicting the subcellular location of a human protein, which uses the sequence of a human protein to predict the subcellular location of the protein, and optimizes the human protein subcellular classification algorithm based on Gene Ontology (GO) features and conserved domain correlations . Firstly, the sequence residue statistical features of the protein (amino acid composition features, normalized specificity scoring matrix features), conserved domain features and GO features are obtained through the protein sequence; secondly, CFS feature selection is used for the sequence residue statistical features The method extracts feature subsets, calculates the similarity measures of the conservative domain features and GO features, and uses the KNN method with weights to calculate the probability information, and then integrates the obtained features and uses the SVM classifier for classification.

Description

A method for predicting the subcellular location of human proteins

技术领域technical field

本发明属于生物信息技术领域，特别涉及一种人类蛋白质亚细胞位置预测的方法。The invention belongs to the technical field of biological information, in particular to a method for predicting the subcellular location of human proteins.

背景技术Background technique

了解蛋白质的亚细胞位置对于理解蛋白质的功能、蛋白质间的相互作用，以及药物的靶向治疗具有重要的意义。然而目前利用实验检验的方法来获取蛋白质的亚细胞位置需要很大的时间和成本。因此利用蛋白质亚细胞位置预测工具来对大量的蛋白质进行预测具有重要意义。根据我们的统计，在2016年二月份发布的SWISS-PROT蛋白质数据库上一共有550552条蛋白质，其中只有10.4％的蛋白质具有实验验证的亚细胞位置，剩下的未知亚细胞位置的蛋白质急需通过一种可靠的预测方法来预测。Understanding the subcellular location of proteins is of great significance for understanding protein functions, protein-protein interactions, and drug-targeted therapy. However, it takes a lot of time and cost to obtain the subcellular location of proteins by using experimental methods. Therefore, it is of great significance to use protein subcellular location prediction tools to predict a large number of proteins. According to our statistics, there are a total of 550,552 proteins in the SWISS-PROT protein database released in February 2016, of which only 10.4% of the proteins have experimentally verified subcellular locations, and the remaining proteins with unknown subcellular locations urgently need to pass a A reliable forecasting method to forecast.

到目前为止，已经有很多能够预测蛋白质亚细胞位置的工具，常见的网络服务器包括BaCeLlo，YLoc，MultiLoc，GOASVM，WoLF PSORT，CellPLoc，HSLPred等等。这些预测工具给相关领域的生物学家带来了极大的便利。So far, there are many tools that can predict the subcellular location of proteins. Common web servers include BaCeLlo, YLoc, MultiLoc, GOASVM, WoLF PSORT, CellPLoc, HSLPred, etc. These predictive tools have brought great convenience to biologists in related fields.

蛋白质的亚细胞位置信息经常被用在疾病的基因治疗，药物靶向治疗上。例如通过检查在肿瘤中蛋白质YAP的表达和亚细胞定位来研究Hippo/YAP途径在小儿肝细胞癌演变中的作用。所以，一个易于使用的高精度预测工具将非常有助于这些实验室进行临床研究。我们以前发布的网络服务器Hum-mPLoc2.0是专门为预测人类蛋白质定位而设计的。每年使用的次数已从2010年的2万次增加到2015年的8万多次。这表明为了提供更好的预测服务，基于新技术和更全面精准的注释数据库来进一步增强预测能力具有重要意义。The subcellular location information of proteins is often used in gene therapy and drug-targeted therapy of diseases. For example, the role of the Hippo/YAP pathway in the evolution of pediatric hepatocellular carcinoma was investigated by examining the expression and subcellular localization of the protein YAP in tumors. Therefore, an easy-to-use high-precision prediction tool will be very helpful for these laboratories to conduct clinical research. Our previously released web server Hum-mPLoc2.0 is specifically designed for predicting human protein localization. The number of uses per year has increased from 20,000 in 2010 to more than 80,000 in 2015. This shows that in order to provide better prediction services, it is of great significance to further enhance the prediction ability based on new technologies and more comprehensive and accurate annotation databases.

通常，用于预测蛋白质亚细胞定位的计算方法可以分为两类，即基于同源性搜索的方法和基于机器学习的方法。基于同源性搜索的方法可以被认为是利用最近邻方法来进行预测，在该方法中两个蛋白质之间的距离通常通过它们的序列同源性来衡量。通过计算查询蛋白质与大量已有亚细胞位置注释信息的序列的同源性，该方法找到前K个最相似的蛋白质，并将它们的注释信息传递给所要预测的蛋白质作为分类结果。基于同源性搜索的方法是一种比较直接的预测方法，但是它的性能显著取决于是否能够找到相似度高已有亚细胞位置信息注释的同源序列，此外，有些时候两个蛋白质序列之间的相似度高但是他们可具有非常不同的结构或功能，这会导致该方法的失效。Generally, computational methods for predicting protein subcellular localization can be divided into two categories, namely, homology search-based methods and machine learning-based methods. Homology search-based methods can be thought of as making predictions using nearest neighbor methods, where the distance between two proteins is usually measured by their sequence homology. By calculating the homology between the query protein and a large number of sequences that have subcellular location annotation information, the method finds the top K most similar proteins, and transfers their annotation information to the protein to be predicted as the classification result. The method based on homology search is a relatively straightforward prediction method, but its performance depends significantly on whether it can find homologous sequences with high similarity and annotated subcellular location information. In addition, sometimes the difference between two protein sequences The similarity between them is high but they may have very different structures or functions, which will lead to the failure of this method.

基于机器学习的预测器是蛋白质亚细胞位置预测中的一类较为灵活模型。它们需要所谓的训练数据集，然后通过基于统计学习的算法来学习分类规则。因此，训练数据的质量与所学习的统计规则的质量密切相关。受益于蛋白质数据库中关于亚细胞位置信息越来越多并且越来越可靠的注释，我们可以通过收集大规模训练数据以便于更充分地训练分类模型。在机器学习模型中的另一个重要问题是如何编码蛋白质序列，因为大多数算法需要提取特征向量作为输入，如何从原始蛋白质序列以及相关联的现有知识中提取特征对于分类器的最终性能是至关重要的。用于预测亚细胞位置的现有机器学习工具使用各种特征如下：Machine learning-based predictors are a class of flexible models for protein subcellular location prediction. They require so-called training data sets and then learn classification rules through algorithms based on statistical learning. Therefore, the quality of the training data is closely related to the quality of the statistical rules learned. Benefiting from the increasing and reliable annotation of subcellular location information in protein databases, we can more fully train classification models by collecting large-scale training data. Another important issue in machine learning models is how to encode protein sequences, because most algorithms need to extract feature vectors as input, how to extract features from original protein sequences and associated existing knowledge is crucial to the final performance of classifiers. important. Existing machine learning tools for predicting subcellular locations use various features as follows:

(1)基于残基的统计特征，伪氨基酸组成和位置特异性评分矩阵。(1) Residue-based statistical features, pseudo-amino acid composition and position-specific scoring matrices.

(2)基于信号肽，功能域的特征。(2) Characterization of functional domains based on signal peptides.

(3)基于数据库注释的特征，例如基因本体论(GO)特征。(3) Features based on database annotations, such as Gene Ontology (GO) features.

由于GO特征是对领域知识的高级抽象，当拥有足够的注释时，它们通常比基于序列所提取的特征具有更高的准确性。然而，大量的注释数据带来新的算法挑战。例如，通过对每个GO特征使用伯努利事件模型，即对于该GO特征是否存在进行二进制编码，常常导致极高维的特征空间。随着GO数据库的定期扩展和更新，维度将随着我们关于蛋白质的知识拓展而不断增加。高维特征向量增加了机器学习过程的复杂性，并且我们还考虑到注释数据库中的潜在噪声的影响。虽然整个GO数据库是巨大的，但每个蛋白质实际上只包含几个GO特征。根据我们的统计，在SWISS-PROT数据库中那些至少具有一个GO特征的蛋白质，他们平均拥有6个GO注释。也就是说一个蛋白质的GO特征是一个稀疏特征向量，它有数千个维度，但只有大约6个GO注释。目前领域内已经针对这个问题提出了不同的方法来处理。例如，YLoc仅选择对于特定亚细胞位置具有明显相关性的GO注释和PROSITE模式。因此，它减少了不必要的特征，并使得结果更易于理解，但是这样也会导致信息丢失。WegoLoc为每个GO特征分配权重来突出有用的GO特征。Since GO features are high-level abstractions of domain knowledge, they usually have higher accuracy than sequence-based features when they have sufficient annotations. However, large amounts of annotated data bring new algorithmic challenges. For example, by using a Bernoulli event model for each GO feature, i.e. binary encoding the presence or absence of that GO feature, often results in extremely high-dimensional feature spaces. As the GO database is regularly expanded and updated, the dimensions will continue to increase as our knowledge about proteins expands. High-dimensional feature vectors increase the complexity of the machine learning process, and we also take into account the impact of potential noise in the annotated database. Although the entire GO database is huge, each protein actually only contains a few GO features. According to our statistics, those proteins with at least one GO feature in the SWISS-PROT database have an average of 6 GO annotations. That is to say, the GO feature of a protein is a sparse feature vector, which has thousands of dimensions, but only about 6 GO annotations. At present, different methods have been proposed to deal with this problem in the field. For example, YLoc selects only GO annotations and PROSITE patterns that are clearly relevant for specific subcellular locations. Therefore, it reduces unnecessary features and makes the results easier to understand, but this also leads to loss of information. WegoLoc assigns weights to each GO feature to highlight useful GO features.

发明内容Contents of the invention

本发明提供一种人类蛋白质亚细胞位置预测方法，目的在于通过利用注释特征之间潜在相关性信息来提高人类蛋白质亚细胞分类器的预测精度。The invention provides a method for predicting the subcellular location of human proteins, with the purpose of improving the prediction accuracy of a human protein subcellular classifier by using potential correlation information between annotation features.

一种人类蛋白质亚细胞位置预测方法，基于人类蛋白质序列预测蛋白质亚细胞位置，包括以下步骤：A method for predicting the subcellular location of a human protein, which predicts the subcellular location of a protein based on a human protein sequence, comprising the following steps:

第一步：利用人类蛋白质序列信息分别提取序列全长，序列N端，C端多个蛋白质序列片段的残基统计特征，其中包括氨基酸组成成分特征和利用蛋白质同源信息所获得的特异性打分矩阵特征并对该特征进行归一化处理，在综合这两个特征之后使用Correlation-based Feature Selection这种有监督的特征选择算法进行降维；Step 1: Use human protein sequence information to extract the statistical characteristics of residues of the full-length sequence, N-terminal and C-terminal protein sequence fragments, including amino acid composition characteristics and specificity scores obtained by using protein homology information Matrix features and normalize the features, and use Correlation-based Feature Selection, a supervised feature selection algorithm, to perform dimensionality reduction after combining these two features;

第二步：通过提取蛋白质数据库中所有人类蛋白质的GO特征，利用GOSSTO获取GO(BP,MF,CC)特征空间三个相似度矩阵；Step 2: By extracting the GO features of all human proteins in the protein database, use GOSSTO to obtain three similarity matrices in the GO (BP, MF, CC) feature space;

第三步：通过blast方法在Swiss-Prot数据库中搜索同源蛋白，提取所述同源蛋白的GO特征，同时用相同的方法获取训练集中蛋白质的GO特征；Step 3: search for homologous proteins in the Swiss-Prot database by the blast method, extract the GO features of the homologous proteins, and use the same method to obtain the GO features of the proteins in the training set;

第四步：将蛋白质GO特征的三个部分(BP,MF,CC)通过一元组，二元组，三元组划分为7个部分(BP,MF,CC),(BP&MF,BP&CC,MF&CC),(BP&MF&CC)；Step 4: Divide the three parts of the protein GO feature (BP, MF, CC) into 7 parts (BP, MF, CC), (BP&MF, BP&CC, MF&CC) through one-tuple, two-tuple and three-tuple ,(BP&MF&CC);

第五步：通过蛋白质GO特征的相关性，分成七个部分来计算两个蛋白质的相关性，并通过参数优化，提取训练集中十个相关性高的蛋白质做有权值的KNN方法，获得该蛋白质在每个亚细胞位置上的概率值；Step 5: Through the correlation of protein GO features, divide it into seven parts to calculate the correlation of two proteins, and through parameter optimization, extract ten highly correlated proteins in the training set to do the weighted KNN method to obtain the The probability value of the protein at each subcellular location;

第六步：通过rps-blast来获得Swiss-Prot数据库中所有人类蛋白质的保守域特征，并通过信息差计算特征之间的相关性，得到保守域特征相似度矩阵，然后通过rps-blast来获得目标蛋白质的保守域特征来计算两个蛋白质的相关性，并通过参数优化，提取训练集中十个相关性高的蛋白质做有权值的KNN方法，获得该蛋白质在每个亚细胞位置上的概率值；Step 6: Obtain the conserved domain features of all human proteins in the Swiss-Prot database through rps-blast, and calculate the correlation between features through the information difference to obtain the similarity matrix of conserved domain features, and then obtain it through rps-blast The conserved domain features of the target protein are used to calculate the correlation between the two proteins, and through parameter optimization, ten highly correlated proteins are extracted from the training set to do the weighted KNN method to obtain the probability of the protein at each subcellular location value;

第七步：融合所获得的序列特征，GO七个部分的概率特征，保守域概率特征，使用Binary Relevance策略搭建可以预测中心体，细胞质，细胞骨架，内质网，内体，分泌途径，高尔基体，溶酶体，线粒体，细胞核，过氧化物酶体和细胞膜这12个亚细胞位置的SVM分类器。Step 7: Merge the obtained sequence features, the probability features of the seven parts of GO, and the probability features of the conserved domain. Using the Binary Relevance strategy to build can predict the centrosome, cytoplasm, cytoskeleton, endoplasmic reticulum, endosome, secretory pathway, Golgi SVM classifiers for 12 subcellular locations of body, lysosome, mitochondria, nucleus, peroxisome and membrane.

S101，利用人类蛋白质序列信息分别提取序列全长，N端前10到60，C端前10到100长度蛋白质序列片段的氨基酸组成成分特征，归一化后的PSSM矩阵特征，并使用CFS降维，其中PSSM矩阵归一化并在每部分转化为20维特征的公式为：S101, using human protein sequence information to extract the amino acid composition characteristics of the full-length sequence, the first 10 to 60 at the N-terminus, and the first 10 to 100 at the C-terminus, and the normalized PSSM matrix features, and use CFS to reduce the dimension , where the formula for normalizing the PSSM matrix and converting it into a 20-dimensional feature in each part is:

其中S_i,j表示出现在序列的第i个(1≤i≤L)位置上的氨基酸在进化过程中演变成第j种(1≤j≤20)氨基酸的概率评分，L表示蛋白质序列的长度。Where S _i,j represents the probability score of the amino acid appearing at the i-th (1≤i≤L) position of the sequence to evolve into the j-th (1≤j≤20) amino acid during the evolution process, and L represents the protein sequence length.

表示了归一化后这个特异性打分矩阵的分数，这个的N表示了氨基酸的数目，所以在公式2中N等于20。 Represents the score of this specificity scoring matrix after normalization, and N of this represents the number of amino acids, so N is equal to 20 in Formula 2.

其中表示的是对每列分数进行相加并求取平均后的值；in Indicates the value after adding and averaging the scores of each column;

就是我们所得到的经过归一化处理后的PSSM矩阵特征。 It is the normalized PSSM matrix feature we got.

S102，通过提取Swiss-Prot数据库中所有人类蛋白质的GO特征，利用GOSSTO获取GO(BP,MF,CC)特征空间三个相似度矩阵；S102, by extracting the GO features of all human proteins in the Swiss-Prot database, using GOSSTO to obtain three similarity matrices in the GO (BP, MF, CC) feature space;

S103，通过blast方法在Swiss-Prot数据库中搜索同源蛋白，提取他们的GO特征，同时用相同的方法获取训练集中蛋白质的GO特征；S103, search for homologous proteins in the Swiss-Prot database by the blast method, extract their GO features, and use the same method to obtain the GO features of the proteins in the training set;

S104，将蛋白质GO特征的三个部分(BP,MF,CC)通过一元组，二元组，三元组划分为7个部分(BP,MF,CC),(BP&MF,BP&CC,MF&CC),(BP&MF&CC)；S104, divide the three parts (BP, MF, CC) of the protein GO feature into 7 parts (BP, MF, CC), (BP&MF, BP&CC, MF&CC), ( BP&MF&CC);

S105，通过蛋白质GO特征的相关性，分成七个部分来计算两个蛋白质的相关性：S105, through the correlation of protein GO features, it is divided into seven parts to calculate the correlation of two proteins:

其中Cor(x_i,K)代表了x_i所代表的GO注释特征与第K个蛋白质在这个部分下的相关性。Among them, Cor( _xi ,K) represents the correlation between the GO annotation features represented by _xi and the Kth protein under this section.

其中Sim_k表示训练集中第K个蛋白质与我们所预测的蛋白质之间的相关性。where Sim _k represents the correlation between the Kth protein in the training set and our predicted protein.

在得到所有训练集中蛋白质与所预测的蛋白质之间的相关性之后，我们提取训练集中十个相关性高的蛋白质做有权值的KNN方法，获得该蛋白质在每个亚细胞位置上的概率值：After obtaining the correlation between all the proteins in the training set and the predicted protein, we extract ten highly correlated proteins in the training set as the weighted KNN method to obtain the probability value of the protein at each subcellular location :

其中num_a和num分别表示在训练集中，蛋白质处于第a个亚细胞位置的个数和训练集中蛋白质的总个数。而pro_a则表示所预测的蛋白质处在第a个亚细胞位置的概率。Among them, num _a and num represent the number of proteins in the a-th subcellular position in the training set and the total number of proteins in the training set, respectively. And pro _a represents the probability that the predicted protein is in the ath subcellular position.

S106，通过rps-blast来获得Swiss-Prot数据库中所有人类蛋白质的保守域特征，并通过信息差计算特征之间的相关性：S106, use rps-blast to obtain the conserved domain features of all human proteins in the Swiss-Prot database, and calculate the correlation between features through the information difference:

其中表示第i个CDD特征的熵，表示第i个CDD特征存在于蛋白质训练集中的概率。表示第i个特征和第j个特征他们的微分熵，代表了第i个CDD特征与第j个CDD特征之间的相关性。in Denotes the entropy of the i-th CDD feature, Indicates the probability that the i-th CDD feature exists in the protein training set. Represents the differential entropy of the i-th feature and the j-th feature, represents the correlation between the i-th CDD feature and the j-th CDD feature.

得到保守域特征相似度矩阵，然后通过rps-blast来获得目标蛋白质的保守域特征来计算两个蛋白质的相关性，并提取训练集中十个相关性高的蛋白质做有权值的KNN方法，获得该蛋白质在每个亚细胞位置上的概率值；Obtain the similarity matrix of the conserved domain features, and then use rps-blast to obtain the conserved domain features of the target protein to calculate the correlation between the two proteins, and extract ten highly correlated proteins from the training set as the weighted KNN method to obtain The probability value of the protein at each subcellular location;

S107，融合所获得的序列特征，GO七个部分的概率特征，保守域概率特征，使用Binary Relevance策略搭建12个SVM分类器预测蛋白质的亚细胞位置，和在每个亚细胞位置上的概率。S107, integrate the obtained sequence features, the probability features of the seven parts of GO, and the probability features of the conserved domains, use the Binary Relevance strategy to build 12 SVM classifiers to predict the subcellular location of the protein, and the probability of each subcellular location.

在本发明中，通过GO相关信息而不是使用GO在注释中的频率来对特征向量进行编码。众所周知，GO特征大体可分为三块，即生物过程(BP)，分子功能(MF)和细胞组成(CC)。这三部分特征都是具有层次结构。根据该层次结构，领域内提出了定义GO特征之间的语义相似性的许多方法，例如基于信息熵的方法和基于图论的方法。然而，据我们所知，目前很少的蛋白质亚细胞位置的预测算法考虑了这些GO特征之间的相关性。这促使我们通过GO特征之间的隐藏相关性，以在两个高维但稀疏的GO特征向量之间获得更好的相似性度量。我们提出了一种新的方法，以利用蛋白质的注释特征之间的隐藏相关性。为了处理由于GO数据库的不完整性而对一些需要预测的蛋白质缺乏GO注释，我们还结合统计蛋白质序列残基特征以及从保守结构域数据库(CDD)提取的基于肽的功能结构域特征，构建了一个新的预测器，称为Hum-mPLoc3.0，它是以我们以前开发的人类蛋白质定位预测的预测器命名，但赋予了一个全新的特征表示。In the present invention, feature vectors are encoded by GO related information instead of using the frequency of GO in annotations. As we all know, GO features can be roughly divided into three parts, namely biological process (BP), molecular function (MF) and cellular composition (CC). These three parts feature a hierarchical structure. According to this hierarchy, many methods have been proposed in the field to define the semantic similarity between GO features, such as entropy-based methods and graph-theory-based methods. However, to our knowledge, few current prediction algorithms for protein subcellular locations take into account the correlations between these GO features. This motivates us to obtain a better similarity measure between two high-dimensional but sparse GO feature vectors through the hidden correlation between GO features. We propose a novel approach to exploit hidden correlations between annotated features of proteins. In order to deal with the lack of GO annotations for some proteins that need to be predicted due to the incompleteness of the GO database, we also combined statistical protein sequence residue features and peptide-based functional domain features extracted from the Conserved Domain Database (CDD) to construct A new predictor, called Hum-mPLoc3.0, is named after our previously developed predictor for human protein localization prediction, but endowed with a completely new feature representation.

本发明与现有领域内的方法相比，其显著优点：Compared with methods in the prior art, the present invention has significant advantages:

(1)在模型中利用了注释特征之间潜在的相关性，有效提高了人类蛋白质亚细胞位置预测精度；(1) The potential correlation between annotation features is used in the model, which effectively improves the prediction accuracy of human protein subcellular location;

(2)整合了序列残基统计特征，保守域特征和GO特征，有效提高了人类蛋白质亚细胞位置预测精度。(2) The statistical features of sequence residues, conserved domain features and GO features are integrated, which effectively improves the prediction accuracy of human protein subcellular location.

附图说明Description of drawings

图1是本发明的人类蛋白质序列预测方法系统结构图：Fig. 1 is a system structure diagram of the human protein sequence prediction method of the present invention:

具体实施方式detailed description

下面结合附图对本发明作进一步的说明。The present invention will be further described below in conjunction with the accompanying drawings.

图1给出了本发明的人类蛋白质序列预测方法系统结构图：Fig. 1 has provided the human protein sequence prediction method system structural diagram of the present invention:

首先通过蛋白质的序列获得该蛋白质的序列残基统计特征，保守域特征和GO特征；其次，对序列残基统计特征使用CFS特征选择方法提取特征子集，对保守域特征和GO特征通过计算分别得到这些特征的相似性度量，使用带权值的KNN方法计算出概率信息，然后将获得的特征进行整合运用SVM分类器进行分类。下面具体进行阐述：Firstly, the sequence residue statistical features, conserved domain features and GO features of the protein are obtained through the protein sequence; secondly, the CFS feature selection method is used to extract the feature subset for the sequence residue statistical features, and the conserved domain features and GO features are calculated separately The similarity measure of these features is obtained, and the probability information is calculated using the KNN method with weights, and then the obtained features are integrated and classified using the SVM classifier. The following is a detailed description:

实例：Example:

现有一个输入序列，数据如下：There is an input sequence with the following data:

>query protein 1；example of multiple subcellularlocationsMSAVGAATPYLHHPGDSHSGRVSFLGAQLPPEVAAMARLLGDLDRSTFRKLLKFVVSSLQGEDCREAVQRLGVSANLPEEQLGALLAGMHTLLQQALRLPPTSLKPDTFRDQLQELCIPQDLVGDLASVVFGSQRPLLDSVAQQQGAWLPHVADFRWRVDVAISTSALARSLQPSVLMQLKLSDGSAYRFEVPTAKFQELRYSVALVLKEMADLEKRCERRLQD>query protein 1；example of multiple subcellularlocationsMSAVGAATPYLHHPGDSHSGRVSFLGAQLPPEVAAMARLLGDLDRSTFRKLLKFVVSSLQGEDCREAVQRLGVSANLPEEQLGALLAGMHTLLQQALRLPPTSLKPDTFRDQLQELCIPQDLVGDLASVVFGSQRPLLDSVAQQQGAWLPHVADFRWRVDVAISTSALARSLQPSVLMQLKLSDGSAYRFEVPTAKFQELRYSVALVLKEMADLEKRCERRLQD

此为一个待测序列，使用本发明方法的软件输出结果如下：This is a sequence to be tested, and the software output results using the method of the present invention are as follows:

从结果可以看出，本方法有效并且精确的预测除了人类这个蛋白质的亚细胞位置。From the results, it can be seen that this method is effective and accurately predicts the subcellular location of this protein in humans.

上述实施例不以任何方式限制本发明，凡是采用等同替换或等效变换的方式获得的技术方案均落在本发明的保护范围内。The above embodiments do not limit the present invention in any way, and all technical solutions obtained by means of equivalent replacement or equivalent transformation fall within the protection scope of the present invention.

Claims

1. A human protein subcellular position prediction method, based on human protein sequence prediction protein subcellular position, is characterized in that, comprising the following steps:

Step 1: Use human protein sequence information to extract the statistical characteristics of residues of the full-length sequence, N-terminal and C-terminal protein sequence fragments, including amino acid composition characteristics and specificity scores obtained by using protein homology information Matrix features and normalize the features, and use Correlation-based Feature Selection, a supervised feature selection algorithm, to perform dimensionality reduction after combining these two features;

Step 2: By extracting the GO features of all human proteins in the protein database, use GOSSTO to obtain three similarity matrices in the GO (BP, MF, CC) feature space;

Step 3: search for homologous proteins in the Swiss-Prot database by the blast method, extract the GO features of the homologous proteins, and use the same method to obtain the GO features of the proteins in the training set;

Step 4: Divide the three parts of the protein GO feature (BP, MF, CC) into 7 parts (BP, MF, CC), (BP&MF, BP&CC, MF&CC) through one-tuple, two-tuple and three-tuple ,(BP&MF&CC);

Step 5: Through the correlation of protein GO features, divide it into seven parts to calculate the correlation of two proteins, and through parameter optimization, extract ten highly correlated proteins in the training set to do the weighted KNN method to obtain the The probability value of the protein at each subcellular location;

Step 6: Obtain the conserved domain features of all human proteins in the Swiss-Prot database through rps-blast, and calculate the correlation between features through the information difference to obtain the similarity matrix of conserved domain features, and then obtain it through rps-blast The conserved domain features of the target protein are used to calculate the correlation between the two proteins, and through parameter optimization, ten highly correlated proteins are extracted from the training set to do the weighted KNN method to obtain the probability of the protein at each subcellular location value;

Step 7: Merge the obtained sequence features, the probability features of the seven parts of GO, and the probability features of the conserved domain. Using the Binary Relevance strategy to build can predict the centrosome, cytoplasm, cytoskeleton, endoplasmic reticulum, endosome, secretory pathway, Golgi SVM classifiers for 12 subcellular locations of body, lysosome, mitochondria, nucleus, peroxisome and membrane.

2. A human protein subcellular position prediction method, based on human protein sequence prediction protein subcellular position, characterized in that, comprising the following steps:

S101, using human protein sequence information to extract the amino acid composition characteristics of the full-length sequence, the first 10 to 60 at the N-terminus, and the first 10 to 100 at the C-terminus, and the normalized PSSM matrix features, and use CFS to reduce the dimension , where the formula for normalizing the PSSM matrix and converting it into a 20-dimensional feature in each part is:

Where S _i,j represents the probability score of the amino acid appearing at the i-th (1≤i≤L) position of the sequence to evolve into the j-th (1≤j≤20) amino acid during the evolution process, and L represents the protein sequence length,

{S S}_{i i,, j j}^{00} = = \frac{{S S}_{i i,, j j} - - \frac{11}{N N} {Σ Σ}_{k k = = 11}^{N N} {S S}_{i i,, k k}}{\sqrt{\frac{11}{N N - - 11} {Σ Σ}_{u u = = 11}^{N N} {(({S S}_{i i,, u u} - - \frac{11}{N N} {Σ Σ}_{k k = = 11}^{N N} {S S}_{i i,, k k}))}^{22}}},, - - - - - - ((22))

S ⁰ _{i, j} represents the score of this specificity scoring matrix after normalization, N represents the number of amino acids, in formula (2), N=20,

\overset{&OverBar; &OverBar;}{{S S}_{j j}^{00}} = = \frac{11}{L L} {Σ Σ}_{i i = = 11}^{L L} {S S}_{i i,, j j}^{00},, - - - - - - ((33))

in Indicates the value after adding and averaging the scores of each column;

\overset{&OverBar; &OverBar;}{{S S}_{P P S S S S M m}} = = [[\overset{&OverBar; &OverBar;}{{S S}_{11}^{00}},, \overset{&OverBar; &OverBar;}{{S S}_{22}^{00}},, \overset{&OverBar; &OverBar;}{{S S}_{33}^{00}},, ... ...,, \overset{&OverBar; &OverBar;}{{S S}_{2020}^{00}}]],, - - - - - - ((44))

It is the PSSM matrix feature after normalization processing;

S102, by extracting the GO features of all human proteins in the Swiss-Prot database, using GOSSTO to obtain three similarity matrices in the GO (BP, MF, CC) feature space;

S103, search for homologous proteins in the Swiss-Prot database by the blast method, extract their GO features, and use the same method to obtain the GO features of the proteins in the training set;

S104, divide the three parts (BP, MF, CC) of the protein GO feature into 7 parts (BP, MF, CC), (BP&MF, BP&CC, MF&CC), ( BP&MF&CC);

S105, through the correlation of protein GO features, it is divided into seven parts to calculate the correlation of two proteins:

C C o o r r (({x x}_{i i},, K K)) = = \underset{11 \leq \leq i i \leq \leq m m}{m m a a x x} C C o o r r (({x x}_{i i},, {y the y}_{j j})),, - - - - - - ((55))

Where Cor( _xi ,K) represents the correlation between the GO annotation features represented by _xi and the Kth protein under this section,

{Sim Sim}_{k k} = = \sqrt{{Σ Σ}_{i i = = 11}^{n no} C C o o r r {(({x x}_{i i},, K K))}^{22}},, - - - - - - ((66))

where Sim _k represents the correlation between the Kth protein in the training set and our predicted protein,

After obtaining the correlation between all proteins in the training set and the predicted protein, extract ten highly correlated proteins in the training set as the weighted KNN method to obtain the probability value of the protein at each subcellular location:

{pro pro}_{a a} = = \frac{{Σ Σ}_{j j &Element; &Element; {I I}_{{N N}_{a a}}} {sim sim}_{j j} + + \frac{{num num}_{a a}}{n no u u m m}}{{Σ Σ}_{i i &Element; &Element; {I I}_{N N}} {sim sim}_{i i} + + 11},, - - - - - - ((77))

Among them, num _a and num represent the number of proteins in the a-th subcellular position in the training set and the total number of proteins in the training set, respectively. And pro _a represents the probability that the predicted protein is in the ath subcellular position.

S106, use rps-blast to obtain the conserved domain features of all human proteins in the Swiss-Prot database, and calculate the correlation between features through the information difference:

H h (({f f}_{i i}^{c c d d d d})) = = - - \underset{m m &Element; &Element; {{00,, 11}}}{Σ Σ} p p (({f f}_{i i}^{c c d d d d} = = m m)) \times \times log log p p (({f f}_{i i}^{c c d d d d} = = m m)),, - - - - - - ((99))

{S S}_{i i,, j j}^{c c d d d d} = = \frac{22 \times \times ((H h (({f f}_{i i}^{c c d d d d})) + + H h (({f f}_{j j}^{c c d d d d})) - - H h (({f f}_{i i}^{c c d d d d},, {f f}_{j j}^{c c d d d d}))))}{H h (({f f}_{i i}^{c c d d d d})) + + H h (({f f}_{j j}^{c c d d d d}))} . . - - - - - - ((1111))

Where H(f _i ^cdd ) represents the entropy of the i-th CDD feature, and p(f _i ^cdd =1) represents the probability that the i-th CDD feature exists in the protein training set. H(f _j ^cdd ,f _i ^cdd ) represents the differential entropy of the i-th feature and the j-th feature, S _i,j ^cdd represents the correlation between the i-th CDD feature and the j-th CDD feature,

Obtain the similarity matrix of the conserved domain features, and then use rps-blast to obtain the conserved domain features of the target protein to calculate the correlation between the two proteins, and extract ten highly correlated proteins from the training set as the weighted KNN method to obtain The probability value of the protein at each subcellular location;

S107, integrate the obtained sequence features, the probability features of the seven parts of GO, and the probability features of the conserved domains, use the BinaryRelevance strategy to build 12 SVM classifiers to predict the subcellular location of the protein, and the probability of each subcellular location.