CN103955628A

CN103955628A - Subspace fusion-based protein-vitamin binding location point predicting method

Info

Publication number: CN103955628A
Application number: CN201410164632.1A
Authority: CN
Inventors: 胡俊; 於东军; 何雪; 李阳; 沈红斌; 杨静宇
Original assignee: Nanjing University of Science and Technology
Current assignee: Nanjing University of Science and Technology
Priority date: 2014-04-22
Filing date: 2014-04-22
Publication date: 2014-07-30
Anticipated expiration: 2034-04-22
Also published as: CN103955628B

Abstract

The present invention provides a protein-vitamin binding site prediction method based on subspace fusion, including: feature extraction and feature combination: extracting protein evolution information by using PSI-BLAST, PSIPRED and protein-vitamin binding site tendency table respectively , secondary structure information and binding propensity information, using sliding window and serial combination to convert the amino acid residues in the protein sequence into a vector representation; using multiple feature selection algorithms to perform multiple feature selections on the original feature space; The feature subset obtained by each feature selection constitutes a feature subspace, and multiple feature subspaces are constructed; for each feature subspace obtained, an SVM classifier is trained; the weighted average classifier fusion method is used to classify the trained multiple Fusion of SVM classifiers; Prediction of protein-vitamin binding sites based on the fused SVM predictor for the protein to be predicted. The prediction method of the invention has fast prediction speed and high prediction accuracy.

Description

Protein-Vitamin Binding Site Prediction Method Based on Subspace Fusion

技术领域technical field

本发明涉及生物信息学蛋白质-维他命相互作用领域，具体而言涉及一种基于子空间融合的蛋白质-维他命绑定位点预测方法。The invention relates to the field of bioinformatics protein-vitamin interaction, in particular to a method for predicting protein-vitamin binding sites based on subspace fusion.

背景技术Background technique

蛋白质与维他命之间的相互作用在新陈代谢中起到了至关重要的作用，是生命活动中普遍存在且不可或缺的。通过生物实验的方法来确定蛋白质与维他命之间的绑定位点需要耗费大量的时间和资金，并且效率较低。随着测序技术的飞速发展和人类结构基因组的不断推进，蛋白质组学中已经累积了大量未进行与维他命绑定位点标定的蛋白质序列。因此应用生物信息学的相关知识，研发能够直接从蛋白质序列出发进行蛋白质-维他命绑定位点快速且准确的智能预测方法有着迫切需求，且对于发现和认识蛋白质结构和生理功能有着重要的意义。The interaction between proteins and vitamins plays a vital role in metabolism and is ubiquitous and indispensable in life activities. Determining the binding sites between proteins and vitamins through biological experiments takes a lot of time and money, and is inefficient. With the rapid development of sequencing technology and the continuous advancement of human structural genome, proteomics has accumulated a large number of protein sequences that have not been labeled with vitamin binding sites. Therefore, there is an urgent need to develop a fast and accurate intelligent prediction method for protein-vitamin binding sites directly from the protein sequence by applying the relevant knowledge of bioinformatics, and it is of great significance for the discovery and understanding of protein structure and physiological function.

目前，针对预测蛋白质-维他命绑定位点的计算模型还很欠缺。目前仅仅发现一种专门设计用来进行蛋白质-维他命绑定位点预测的计算模型，即VitaPred。VitaPred是世界上第一个专门设计用来进行蛋白质-维他命绑定位点定位的预测器(B.Panwar,S.Gupta,and G.P.S.Raghava,“Prediction of vitamin interacting residues in a vitamin binding protein using evolutionaryinformation,”BMC Bioinformatics,vol.14,Feb7,2013)。VitaPred是一种可以预测蛋白质与不同种类维他命(维他命A、维他命B、维他命B6等)的预测器。由于不同的维他命种类之间存在着差异性，所以VitaPred构造了4个非冗余的数据集合，分别是：含有187条与维他命有绑定关系的蛋白质(这个数据集合没有区分维他命的种类)、含有31条与维他命A有绑定关系的蛋白质、含有141条与维他命B有绑定关系的蛋白质、以及含有71条与维他命B6有绑定关系的蛋白质。VitaPred通过抽取氨基酸残基的位置特异性得分矩阵所表示的进化信息特征，然后输入SVM分类模型来判定一个氨基酸残基是否属于蛋白质维他命绑定位点。此外在，VitaPred所对应的论文中还尝试其他的特征及其组合与SVM结合的方法去预测蛋白质-维他命绑定位点，但是其预测精度以及其扩展性都没有进化信息特征与SVM结合的方法好，所以VitaPred的方法就代表进化信息特征与SVM结合的预测方法。Currently, computational models for predicting protein-vitamin binding sites are lacking. Only one computational model, VitaPred, specifically designed for protein-vitamin binding site prediction has been found so far. VitaPred is the world's first predictor specifically designed for protein-vitamin binding site localization (B. Panwar, S. Gupta, and G.P.S. Raghava, "Prediction of vitamin interacting residues in a vitamin binding protein using evolutionary information, "BMC Bioinformatics, vol. 14, Feb 7, 2013). VitaPred is a predictor that can predict proteins and different types of vitamins (vitamin A, vitamin B, vitamin B6, etc.). Due to the differences between different types of vitamins, VitaPred constructed 4 non-redundant data sets, namely: 187 proteins that bind to vitamins (this data set does not distinguish between vitamin types), Contains 31 proteins that bind to vitamin A, 141 proteins that bind to vitamin B, and 71 proteins that bind to vitamin B6. VitaPred extracts the evolutionary information features represented by the position-specific score matrix of amino acid residues, and then inputs the SVM classification model to determine whether an amino acid residue belongs to the protein vitamin binding site. In addition, in the paper corresponding to VitaPred, other features and their combinations combined with SVM are also tried to predict protein-vitamin binding sites, but the prediction accuracy and scalability are not as good as the combination of evolutionary information features and SVM Ok, so VitaPred's method represents a prediction method that combines evolutionary information features with SVM.

然而，综合分析这个仅有的预测模型，其对蛋白质与不同种类维他命绑定位点预测问题都是用同一个特征子空间下的相同方法，从而导致的可解释性较差的问题有待克服；且可以发现预测精度距离实际应用还有较大差距，迫切需要进一步提高。However, the comprehensive analysis of this only prediction model uses the same method in the same feature subspace for the prediction of binding sites of proteins and different types of vitamins, resulting in poor interpretability to be overcome; And it can be found that there is still a big gap between the prediction accuracy and the actual application, which urgently needs to be further improved.

发明内容Contents of the invention

为解决上述单个多维度特征空间中有互斥特征而导致预测精度距离实际应用差距较大且可解释性较差的缺点，本发明的目的在于提出一种预测速度快、预测精度高的基于子空间融合的蛋白质-维他命绑定位点预测方法。In order to solve the above-mentioned shortcomings of mutually exclusive features in a single multi-dimensional feature space, which lead to a large gap between prediction accuracy and poor interpretability, the purpose of the present invention is to propose a sub-based A spatially fused protein-vitamin binding site prediction method.

为达成上述目的，本发明所采用的技术方案如下：In order to achieve the above object, the technical scheme adopted in the present invention is as follows:

一种基于子空间融合的蛋白质-维他命绑定位点预测方法，包括以下步骤：A protein-vitamin binding site prediction method based on subspace fusion, comprising the following steps:

步骤1、特征抽取与特征组合，即分别利用PSI-BLAST算法、PSIPRED算法抽取蛋白质的进化信息特征与二级结构信息特征，以及根据蛋白质-维他命绑定位点倾向表抽取蛋白质的绑定倾向性信息特征，前述三种特征组成原始特征空间；然后使用滑动窗口与串行组合方式将蛋白质序列中的氨基酸残基转换为向量形式表示；Step 1. Feature extraction and feature combination, that is, using the PSI-BLAST algorithm and the PSIPRED algorithm to extract the evolutionary information features and secondary structure information features of the protein, and extract the binding propensity of the protein according to the protein-vitamin binding site propensity table Information features, the aforementioned three features form the original feature space; then use the sliding window and serial combination to convert the amino acid residues in the protein sequence into a vector representation;

步骤2、使用特征选择算法即Joint Laplacian Feature Weights Learning算法、Fisher Score算法以及Laplacian Score算法，分别对原始特征空间进行多次特征选择；每次特征选择得到的特征子集构成一个特征子空间，从而构建多个特征子空间；Step 2. Use the feature selection algorithm, namely the Joint Laplacian Feature Weights Learning algorithm, the Fisher Score algorithm and the Laplacian Score algorithm, to perform multiple feature selections on the original feature space; the feature subsets obtained by each feature selection form a feature subspace, so that Construct multiple feature subspaces;

步骤3、对步骤2所得的每个特征子空间，训练一个SVM分类器；Step 3, for each feature subspace obtained in step 2, train an SVM classifier;

步骤4：使用加权平均的分类器融合方式对训练完毕的多个SVM分类器进行融合；以及Step 4: use the weighted average classifier fusion method to fuse the trained multiple SVM classifiers; and

步骤5、基于融合后的SVM预测器对待预测蛋白质进行蛋白质-维他命绑定位点预测。Step 5. Predict the protein-vitamin binding site of the protein to be predicted based on the fused SVM predictor.

进一步的实施例中，所述步骤1中，对于训练蛋白质的特征抽取和串行组合包括以下步骤：In a further embodiment, in the step 1, the feature extraction and serial combination of the training protein include the following steps:

步骤1-1、对于一个由l个氨基酸残基组成的蛋白质，通过PSI-BLAST算法获取其位置特异性得分矩阵，该矩阵为一个l行20列的矩阵，从而将蛋白质一级结构信息(即进化信息)转换为矩阵形式表示：Step 1-1. For a protein consisting of l amino acid residues, its position-specific score matrix is obtained by the PSI-BLAST algorithm, which is a matrix of l rows and 20 columns, so that the protein primary structure information (i.e. evolutionary information) into a matrix representation:

其中：A、C...Y表示20种氨基酸残基，p_i,j表示蛋白质第i个氨基酸残基在进化过程中突变成20种氨基酸残基的第j个氨基酸残基的可能性；Among them: A, C...Y represent 20 kinds of amino acid residues, p _{i, j} represent the possibility that the i-th amino acid residue of the protein will be mutated into the j-th amino acid residue of the 20 kinds of amino acid residues during the evolution process ;

然后对PSSM中的每个值利用下述公式(2)进行逐行标准化处理：Then use the following formula (2) to perform row-by-row normalization for each value in the PSSM:

$f f ((x x)) = = \frac{11}{11 + + {e e}^{- - x x}} - - - - - - ((22))$

标准化后的PSSM如式(3)：The standardized PSSM is as formula (3):

之后，再使用大小为W的滑动窗口，提取每个氨基酸残基的特征矩阵：After that, use a sliding window of size W to extract the feature matrix of each amino acid residue:

最后，将上述特征矩阵(4)按行优先的方式组合成维数为20*W的特征向量：Finally, the above feature matrix (4) is combined into a feature vector with a dimension of 20*W in a row-first manner:

${f f}^{i i} = = {(({p p}_{i i,, 11}^{normalized normalized},, {p p}_{i i,, 22}^{normalized normalized},, . . . . . .,, {p p}_{i i,, 2020 W W}^{normalized normalized}))}^{T T} - - - - - - ((55))$

步骤1-2、对于一个由l个氨基酸残基组成的蛋白质，通过PSIPRED获取其二级结构概率矩阵，该矩阵为一个l行3列的矩阵，如下式(6)所示：Step 1-2, for a protein consisting of l amino acid residues, obtain its secondary structure probability matrix by PSIPRED, which is a matrix of l rows and three columns, as shown in the following formula (6):

其中，C、H...E表示蛋白质的三种二级结构：coil、helix、strand，s_i,1表示蛋白质中第i个氨基酸残基的二级结构是coil的概率，s_i,2表示蛋白质中第i个氨基酸残基的二级结构是helix的概率，s_i,3表示蛋白质中第i个氨基酸残基的二级结构是strand的概率；Among them, C, H...E represent the three secondary structures of proteins: coil, helix, strand, s _i,1 represents the probability that the secondary structure of the i-th amino acid residue in the protein is coil, s _i,2 Indicates the probability that the secondary structure of the i-th amino acid residue in the protein is helix, s _i,3 indicates the probability that the secondary structure of the i-th amino acid residue in the protein is strand;

然后，利用上述步骤1-1的滑动窗口提取以及按行优先的方式组合得到每个氨基酸残基的维数为3*W的特征向量，如下式(7)所示：Then, use the sliding window extraction of the above step 1-1 and combine in a row-first manner to obtain a feature vector with a dimension of 3*W for each amino acid residue, as shown in the following formula (7):

fⁱ＝(s_i,1,s_i,2,…,p_i,3W)^T (7)f ⁱ ＝(s _i,1 ,s _i,2 ,…,p _i,3W ) ^T (7)

步骤1-3、对于一个由l个氨基酸残基组成的蛋白质，通过查找蛋白质-维他命绑定位点倾向表得到含有其绑定倾向性信息的矩阵，该矩阵为一个l行1列的矩阵，如下式(8)所示：Step 1-3, for a protein consisting of l amino acid residues, by looking up the protein-vitamin binding site propensity table to obtain a matrix containing its binding propensity information, the matrix is a matrix with l rows and one column, As shown in the following formula (8):

$(\begin{matrix} {b b}_{11} \\ . . \\ . . \\ . . \\ {b b}_{i i} \\ . . \\ . . \\ . . \\ {b b}_{l l} \end{matrix}) - - - - - - ((88))$

其中，b_i表示蛋白质中第i个氨基酸残基绑定维他命的倾向性；Among them, b _i represents the propensity of the ith amino acid residue in the protein to bind vitamins;

然后，利用上述步骤1-1的滑动窗口提取以及按行优先的方式组合得到每个氨基酸残基的维数为1*W的特征向量，如下式(9)所示：Then, use the sliding window extraction of the above step 1-1 and combine in a row-first manner to obtain a feature vector with a dimension of 1*W for each amino acid residue, as shown in the following formula (9):

fⁱ＝(b_i,1,b_i,2,…,b_i,W)^T (9)f ⁱ ＝(b _i,1 ,b _i,2 ,…,b _i,W ) ^T (9)

步骤1-4、将上述步骤得到的3个特征向量串行组合，得到长度为20*W+3*W+1*W的特征向量。Steps 1-4: Serially combine the three feature vectors obtained in the above steps to obtain a feature vector with a length of 20*W+3*W+1*W.

进一步的实施例中，所述步骤2中，使用所述三种特征选择算法构建多个特征子空间的具体实现包括以下步骤：In a further embodiment, in the step 2, the specific implementation of using the three feature selection algorithms to construct multiple feature subspaces includes the following steps:

步骤2-1、利用Joint Laplacian Feature Weights Learning算法对步骤1产生的原始特征空间进行特征选择，其包括：Step 2-1. Use the Joint Laplacian Feature Weights Learning algorithm to perform feature selection on the original feature space generated in step 1, which includes:

1)对于原始特征空间中的数据X＝[x₁,x₂,…,x_M]∈R^N×M，使用下述式(10)和式(11)构造Laplacian矩阵H_M×M与对角矩阵D_M×M如下：1) For the data X=[x ₁ ,x ₂ ,…,x _M ]∈R ^N×M in the original feature space, use the following formula (10) and formula (11) to construct the Laplacian matrix H _M×M and pair The corner matrix D _M×M is as follows:

D_ii＝∑_jH_ij,1≤i≤M与1≤j≤M (11)D _ii =∑ _j H _ij , 1≤i≤M and 1≤j≤M (11)

其中，R^N×M表示X矩阵的规模，即X有M个有N维特征的元素，N表示特征维数，M表示样本数目即氨基酸残基数目；Among them, R ^N×M represents the size of the X matrix, that is, X has M elements with N-dimensional features, N represents the feature dimension, and M represents the number of samples, that is, the number of amino acid residues;

2)对上述步骤所得的Laplacian矩阵H_M×M与对角矩阵D_M×M求解广义特征值分解问题Hy＝λDy，得到一个1以下的最大特征值对应的特征向量y；2) Solve the generalized eigenvalue decomposition problem Hy=λDy for the Laplacian matrix H _{M × M} and the diagonal matrix D _{M × M} obtained in the above steps, and obtain an eigenvector y corresponding to a maximum eigenvalue below 1;

3)使用上述求得的特征向量y，根据下式(12)更新每一维特征对应的权重直到收敛为止：3) Using the feature vector y obtained above, update the weight corresponding to each dimension feature according to the following formula (12) until convergence:

${w w}_{i i}^{t t + + 11} &LeftArrow; &LeftArrow; \frac{22}{33} {w w}_{i i}^{t t} + + \frac{11}{33} {w w}_{i i}^{t t} \frac{{((22 Xy xy + + 44 {ϵw ϵw}^{t t}))}_{i i}}{{(({22 XX XX}^{T T} {w w}^{t t} + + 44 {ϵw ϵw}^{t t} {(({w w}^{t t}))}^{T T} {w w}^{t t}))}_{i i}},, 11 \leq \leq i i \leq \leq N N - - - - - - ((1212))$

其中，w＝[w₁,w₂,…,w_i,…,w_N]表示每个特征维度权重，T表示矩阵的转置，t表示迭代次数，ε表示控制w中零元素个数的松弛项；Among them, w=[w ₁ ,w ₂ ,..., _wi ,...,w _N ] represents the weight of each feature dimension, T represents the transposition of the matrix, t represents the number of iterations, and ε represents the number of zero elements in w slack term;

4)在上述求得的权重向量w＝[w₁,w₂,…,w_i,…,w_N]，选择所有大于零的权重分量w_i对应的样本特征维度，最后将所有被选中的特征维度组合成的特征子空间输出，同时将子空间中特征维度的数目一并输出；4) In the weight vector w=[w ₁ ,w ₂ ,…, _wi ,…,w _N ] obtained above, select all sample feature dimensions corresponding to weight components w _i greater than zero, and finally select all selected The feature subspace output composed of feature dimensions, and the number of feature dimensions in the subspace output together;

步骤2-2、利用Fisher Score算法对步骤1产生的原始特征空间进行特征选择，其包括：Step 2-2. Use the Fisher Score algorithm to perform feature selection on the original feature space generated in step 1, which includes:

1)对于具有c类原始样本的空间其中表示第i类的样本集合，表示特征向量，表示类别，M⁽ⁱ⁾表示第i类的样本数目，前述样本是指蛋白质的一个氨基酸残基；按照式(13)与式(14)计算每一类数据的每一维特征的均值和方差 1) For a space with original samples of class c in Represents the sample set of the i-th class, represent the eigenvectors, Indicates the category, M ⁽ⁱ⁾ indicates the number of samples of the i-th category, the aforementioned sample refers to an amino acid residue of the protein; calculate the mean value of each dimension feature of each type of data according to formula (13) and formula (14) and variance

$u_{n}^{(i)} = \frac{1}{M^{(i)}} Σ_{j = 1}^{M^{(i)}} x_{jn}^{(i)},$ 1≤n≤N与1≤i≤c (13) $u_{no}^{(i)} = \frac{1}{m^{(i)}} Σ_{j = 1}^{m^{(i)}} x_{jn}^{(i)},$ 1≤n≤N and 1≤i≤c (13)

${(σ_{n}^{(i)})}^{2} = \frac{1}{M^{(i)}} Σ_{j = 1}^{M^{(i)}} {(x_{jn}^{(i)} - u_{n}^{(i)})}^{2},$ 1≤n≤N与1≤i≤c (14) ${(σ_{no}^{(i)})}^{2} = \frac{1}{m^{(i)}} Σ_{j = 1}^{m^{(i)}} {(x_{jn}^{(i)} - u_{no}^{(i)})}^{2},$ 1≤n≤N and 1≤i≤c (14)

2)使用上述中计算得来的所有均值和方差对每一个特征维度按照式(15)计算Fisher Score：2) Use all mean values calculated above and variance For each feature dimension, calculate the Fisher Score according to formula (15):

${H h}_{n no} = = \frac{{Σ Σ}_{i i = = 11}^{c c} {M m}^{((i i))} {(({u u}_{n no}^{((i i))} - - {u u}_{n no}))}^{22}}{{Σ Σ}_{i i = = 11}^{c c} {M m}^{((i i))} {(({σ σ}_{n no}^{((i i))}))}^{22}},, 11 \leq \leq n no \leq \leq N N - - - - - - ((1515))$

其中，u_n表示第n维度特征在所有数据上的均值，H_n表示第n个特征维度的Fisher Score值，N个特征维度都有一个Fisher Score值；Among them, u _n represents the mean value of the n-th dimension feature on all data, H _n represents the Fisher Score value of the n-th feature dimension, and each of the N feature dimensions has a Fisher Score value;

根据式(15)得到一个Fisher Score向量H，H＝[H₁,H₂,…,H_n,…H_N]；A Fisher Score vector H is obtained according to formula (15), H=[H ₁ ,H ₂ ,...,H _n ,...H _N ];

3)对上述Fisher Score向量H＝[H₁,H₂,…,H_n,…H_N]中的每个值进行从大到小排序，然后选择前个Fisher Score值对应的样本特征，将所有被选中特征组合成的特征子空间输出，其中表示选择留下了特征的个数，由步骤2-1确定；3) Sort each value in the above Fisher Score vector H=[H ₁ ,H ₂ ,…,H _n ,…H _N ] from large to small, and then select the top The sample features corresponding to a Fisher Score value, and output the feature subspace composed of all selected features, where Indicates the number of features left by the selection, determined by step 2-1;

步骤2-3、利用Laplacian Score算法对步骤1产生的原始特征空间进行特征选择，其包括：Step 2-3, using the Laplacian Score algorithm to perform feature selection on the original feature space generated in step 1, which includes:

1)对于原始特征空间中的数据X＝[x₁,x₂,…,x_M]∈R^N×M，使用式(16)和式(17)构造Laplacian矩阵H_M×M与对角矩阵D_M×M如下：1) For the data X=[x ₁ ,x ₂ ,…,x _M ]∈R ^N×M in the original feature space, use formula (16) and formula (17) to construct the Laplacian matrix H _M×M and the diagonal matrix D _{M × M} is as follows:

D_ii＝∑_jH_ij,1≤i≤M与1≤j≤M (17)D _ii =∑ _j H _ij , 1≤i≤M and 1≤j≤M (17)

其中，R^N×M表示X矩阵的规模，即X有M个有N维特征的元素，N表示特征维数，M表示样本数目即氨基酸残基数目，σ表示高斯参数，式(16)用于求得两个样本即氨基酸残基的核空间的距离，该σ用于控制核空间的宽度；Among them, R ^N×M represents the size of the X matrix, that is, X has M elements with N-dimensional features, N represents the feature dimension, M represents the number of samples, that is, the number of amino acid residues, and σ represents the Gaussian parameter. Equation (16) uses To obtain the distance between two samples, that is, the nuclear space of amino acid residues, the σ is used to control the width of the nuclear space;

2)使用上述构造的Laplacian矩阵H_M×M与对角矩阵D_M×M，根据式(18)计算每一个特征维度的Laplacian Score：2) Using the Laplacian matrix H _M×M and diagonal matrix D _M×M constructed above, calculate the Laplacian Score of each feature dimension according to formula (18):

${L L}_{n no} = = \frac{{Σ Σ}_{i i = = 11}^{M m} {Σ Σ}_{j j = = 11}^{M m} {(({x x}_{in in} - - {x x}_{jn jn}))}^{22} {H h}_{ij ij}}{{Σ Σ}_{i i = = 11}^{M m} {(({x x}_{in in} - - {\overset{&OverBar; &OverBar;}{x x}}_{n no}))}^{22} {D D.}_{ij ij}},, 11 \leq \leq n no \leq \leq N N - - - - - - ((1818))$

其中，x_in表示第i样本的第n个维度特征的值，表示所有样本第n个维度特征的均值；L_n表示第n个特征维度的Laplacian Score值，N个特征维度都有一个Laplacian Score值，最后根据式(18)得到一个Laplacian Score向量L，L＝[L₁,L₂,…,L_n,…,L_N]；Among them, x _in represents the value of the n-th dimension feature of the i-th sample, Indicates the mean value of the nth dimension feature of all samples; L _n indicates the Laplacian Score value of the nth feature dimension, and each N feature dimension has a Laplacian Score value, and finally a Laplacian Score vector L is obtained according to formula (18), L= [L ₁ ,L ₂ ,…,L _n ,…,L _N ];

3)对上述计算求得的Laplacian Score向量L＝[L₁,L₂,…,L_n,…,L_N]中的每个值进行从大到小排序，然后选择前个Laplacian Score值对应的样本特征，将所有被选中特征组合成的特征子空间输出，其中表示选择留下了特征的个数，由前述步骤2-1确定。3) Sort each value in the Laplacian Score vector L=[L ₁ ,L ₂ ,…,L _n ,…,L _N ] obtained from the above calculation from large to small, and then select the top The sample features corresponding to a Laplacian Score value, and output the feature subspace composed of all selected features, where Indicates the number of features left by the selection, determined by the aforementioned step 2-1.

进一步的实施例中，在所述步骤3，根据前述原始样本在每一个特征子空间中的分布情况，分别使用LIBSVM中的SVC分类算法训练一个子空间SVM预测器；最终在三个特征子空间训练出了三个不同的SVM预测器。In a further embodiment, in the step 3, according to the distribution of the aforementioned original samples in each feature subspace, use the SVC classification algorithm in LIBSVM to train a subspace SVM predictor; finally in the three feature subspaces Three different SVM predictors are trained.

进一步的实施例中，在所述步骤4中，使用加权平均方法对步骤3所训练得到的三个不同特征子空间的SVM预测器进行融合，其包括：In a further embodiment, in the step 4, the SVM predictors of the three different feature subspaces trained in the step 3 are fused using a weighted average method, which includes:

令ω₁和ω₂分别表示绑定位点类和非绑定位点类，S₁、S₂和S₃分别表示三个不同特征子空间下的SVM预测器，表示评估样本集合，用于确定子空间对应的SVM模型的权重，其中评估样本集合的氨基酸残基是已知其类别的；对于每一个x_i所表示的样本特征，S₁、S₂和S₃将会输出三个2维的向量(s_1,1(x_i),s_1,2(x_i))^T、(s_2,1(x_i),s_2,2(x_i))^T和(s_3,1(x_i),s_3,2(x_i))^T，每个2维向量的两个元素分别表示x_i属于ω₁和ω₂的程度且两个元素和为1，故对于评估样本集合分别可以得到在S₁、S₂和S₃上的预测结果矩阵：Let ω ₁ and ω ₂ denote the binding site class and non-binding site class, respectively, and S ₁ , S ₂ and S ₃ denote the SVM predictors under three different feature subspaces, respectively, Represents the evaluation sample set, which is used to determine the weight of the SVM model corresponding to the subspace, where the amino acid residues of the evaluation sample set are known; for each sample feature represented by _xi , S ₁ , S ₂ and S ₃ will output three 2-dimensional vectors (s _1,1 ( _xi ),s _1,2 (xi ₎ ) ^T , (s _2,1 (xi ₎ ,s _2,2 (xi ₎ ) ^T and (s _3,1 ( _xi ),s _3,2 ( _xi )) ^T , the two elements of each 2-dimensional vector respectively represent the extent to which x _i belongs to ω ₁ and ω ₂ and the sum of the two elements is 1, so for the evaluation sample set The prediction result matrix on S ₁ , S ₂ and S ₃ can be obtained respectively:

${R R}_{i i} = = {(\begin{matrix} {s the s}_{i i,, 11} (({x x}_{11})) & {s the s}_{i i,, 22} (({x x}_{11})) \\ {s the s}_{i i,, 11} (({x x}_{22})) & {s the s}_{i i,, 22} (({x x}_{22})) \\ . . & . . \\ . . & . . \\ . . & . . \\ {s the s}_{i i,, 11} (({x x}_{{M m}_{eva eva}})) & {s the s}_{i i,, 22} (({x x}_{{M m}_{eva eva}})) \end{matrix})}^{T T},, i i = = 1,2,3 1,2,3 - - - - - - ((1919))$

首先，根据的真实类别构造目标结果矩阵：First, according to Construct the target outcome matrix for the true category of :

$R_{true} = {(\begin{matrix} p_{1} & 1 - p_{1} \\ p_{2} & 1 - p_{2} \\ . & . \\ . & . \\ . & . \\ p^{i} & 1 - p_{i} \\ . & . \\ . & . \\ . & . \\ p_{M_{eva}} & 1 - p_{M_{eva}} \end{matrix})}^{T},$ 若y_i＝ω₁则p_i＝1，否则p_i＝0(20) $R_{true} = {(\begin{matrix} p_{1} & 1 - p_{1} \\ p_{2} & 1 - p_{2} \\ . & . \\ . & . \\ . & . \\ p^{i} & 1 - p_{i} \\ . & . \\ . & . \\ . & . \\ p_{m_{eva}} & 1 - p_{m_{eva}} \end{matrix})}^{T},$ If y _i =ω ₁ then p _i =1, otherwise p _i =0(20)

其次，计算每个特征子空间下的SVM分类器的误差：Second, calculate the error of the SVM classifier under each feature subspace:

${E E.}_{i i} = = | | | | {R R}_{true true} - - {R R}_{i i} {| | | |}_{22}^{22},, i i = = 1,2,3 1,2,3 - - - - - - ((21 twenty one))$

再次，根据每个特征子空间SVM预测器在评估集合上的预测误差构造不同子空间SVM预测器的权重：Again, according to each feature subspace SVM predictor in the evaluation set The prediction error on constructs the weights of different subspace SVM predictors:

${w w}_{i i} = = \frac{(({M m}_{eva eva} - - {E E.}_{i i}))}{{Σ Σ}_{k k = = 11}^{33} (({M m}_{eva eva} - - {E E.}_{k k}))},, i i = = 1,2,3 1,2,3 - - - - - - ((22 twenty two))$

其中，M_eva表示完全被分错时的误差；Among them, _Meva represents the error when it is completely misclassified;

最后，根据在评估样本集合上计算得到权重集成不同子空间的SVM预测器：Finally, according to the weights calculated on the evaluation sample set to integrate the SVM predictors of different subspaces:

$S S = = {Σ Σ}_{i i = = 11}^{33} {w w}_{i i} \cdot &Center Dot; {S S}_{i i} - - - - - - ((23 twenty three))$

得到如上式(23)融合后的SVM预测器。Obtain the SVM predictor after the fusion of the above formula (23).

进一步的实施例中，在步骤5中，使用融合后的SVM预测器对待预测的蛋白质进行蛋白质-维他命绑定位点预测：In a further embodiment, in step 5, use the fused SVM predictor to perform protein-vitamin binding site prediction for the protein to be predicted:

对于待预测蛋白质中的每一个氨基酸残基，根据步骤1产生氨基酸残基在原始特征空间中的特征；然后对氨基酸残基的原始特征分别使用步骤2所述的三个特征选择算法产生三个子空间特征；再将三个子空间特征输入到步骤3所对应的三个SVM预测器S₁、S₂和S₃得到三个以绑定维他命概率形式给出的预测结果，将这三个预测结果输入按照步骤4的加权平均方法集成后的SVM预测器中，输出氨基酸残基绑定或不绑定维他命的概率；最后以最大化马修斯相关性系数(matthews correlation coefficient)的阈值T作为判断基准进行绑定判断：所有绑定概率大于等于T的氨基酸残基预测为绑定残基；其他氨基酸残基即绑定概率小于阈值T的氨基酸残基则预测为非绑定残基，其中T∈[0,1]。For each amino acid residue in the protein to be predicted, the feature of the amino acid residue in the original feature space is generated according to step 1; Spatial features; then input the three subspace features to the three SVM predictors S ₁ , S ₂ and S ₃ corresponding to step 3 to obtain three prediction results given in the form of bound vitamin probability, and these three prediction results Input the SVM predictor integrated according to the weighted average method in step 4, and output the probability of amino acid residues binding or not binding vitamins; finally, the threshold T of maximizing the Matthews correlation coefficient (Matthews correlation coefficient) is used as a judgment Benchmark for binding judgment: all amino acid residues with binding probability greater than or equal to T are predicted as binding residues; other amino acid residues, that is, amino acid residues with binding probability less than threshold T, are predicted as non-binding residues, where T ∈[0,1].

由以上本发明的技术方案可知，本发明的有益效果在于：As can be seen from the technical scheme of the present invention above, the beneficial effects of the present invention are:

1、提高训练速度、预测速度及预测精度：使用基于特征选择算法的子空间融合技术，可以构建更紧密的特征子空间，有效解决特征之间存在的互斥性的现象，降低特征空间的维度，从而提高训练速度、预测速度以及预测精度；1. Improve training speed, prediction speed and prediction accuracy: Using the subspace fusion technology based on feature selection algorithm, a tighter feature subspace can be constructed, which can effectively solve the phenomenon of mutual exclusion between features and reduce the dimension of feature space , so as to improve the training speed, prediction speed and prediction accuracy;

2、提升模型的可解释性：使用了子空间融合技术后，对蛋白质与不同类别维他命绑定位点预测问题，选择的特征子空间是不一样的，更好的表达了蛋白质与不同种类维他命绑定位点预测问题之间的差异性，提升了模型的可解释性。2. Improve the interpretability of the model: After using the subspace fusion technology, for the prediction of binding sites between proteins and different types of vitamins, the selected feature subspaces are different, which better expresses proteins and different types of vitamins The variability between binding site prediction problems improves model interpretability.

附图说明Description of drawings

图1为本发明一实施方式基于子空间融合的蛋白质-维他命绑定位点预测方法的原理示意图。FIG. 1 is a schematic diagram of the principle of a protein-vitamin binding site prediction method based on subspace fusion according to an embodiment of the present invention.

具体实施方式Detailed ways

为了更了解本发明的技术内容，特举具体实施例并配合所附图式说明如下。In order to better understand the technical content of the present invention, specific embodiments are given together with the attached drawings for description as follows.

如图1所示，根据本发明的较优实施例，基于子空间融合的蛋白质-维他命绑定位点预测方法，首先，使用PSI-BLAST、PSIPRED分别获取蛋白质的PSSM矩阵(即进化信息矩阵)、二级结构概率矩阵，以及根据蛋白质-维他命绑定位点倾向表生成的蛋白质的绑定倾向性矩阵；其次，使用滑动窗口和串行组合从PSSM矩阵、二级结构概率矩阵和蛋白质-维他命绑定位点倾向表构建每个氨基酸残基的特征向量；然后，使用Joint Laplacian Feature WeightsLearning(算法1)、Fisher Score(算法2)和Laplacian Score(算法3)三个特征选择算法构建具有同一空间内特征不互斥、不同空间之间互补特性的三个特征子空间，在每个子空间上训练一个SVM预测器；最后，使用加权平均方法对多个SVM预测器使用集成技术形成最终的预测模型进行蛋白质-维他命绑定位点预测。As shown in Figure 1, according to a preferred embodiment of the present invention, the protein-vitamin binding site prediction method based on subspace fusion, first, use PSI-BLAST and PSIPRED to obtain the PSSM matrix (ie evolutionary information matrix) of the protein respectively , the secondary structure probability matrix, and the binding propensity matrix of the protein generated from the protein-vitamin binding site propensity table; secondly, using sliding window and serial combination from the PSSM matrix, the secondary structure probability matrix and the protein-vitamin The binding site propensity table constructs the feature vector of each amino acid residue; then, using the three feature selection algorithms of Joint Laplacian Feature Weights Learning (Algorithm 1), Fisher Score (Algorithm 2) and Laplacian Score (Algorithm 3) to construct a feature vector with the same space Three feature subspaces with non-mutually exclusive features and complementary characteristics between different spaces, and train an SVM predictor on each subspace; finally, use the weighted average method to integrate multiple SVM predictors to form the final prediction model Perform protein-vitamin binding site prediction.

所谓绑定位点，就是绑定了维他命的氨基酸残基。The so-called binding site is the amino acid residue that binds vitamins.

下面结合图1所示，详细说明本实施例的上述各步骤的具体实现。The specific implementation of the above-mentioned steps in this embodiment will be described in detail below in conjunction with what is shown in FIG. 1 .

作为可选的方式，所述步骤1中，对于训练蛋白质的特征抽取和串行组合包括以下步骤：：As an optional mode, in the step 1, the feature extraction and serial combination of the training protein include the following steps::

其中：A、C...Y表示20种氨基酸残基，p_i,j表示蛋白质第i个氨基酸残基在进化过程中突变成上述20种氨基酸残基(A、C...Y)的第j个氨基酸残基的可能性；Among them: A, C...Y represent 20 kinds of amino acid residues, p _i,j represent that the i-th amino acid residue of the protein is mutated into the above 20 kinds of amino acid residues (A, C...Y) during the evolution process The likelihood of the jth amino acid residue in ;

$f f ((x x)) = = \frac{11}{11 + + {e e}^{- - x x}} - - - - - - ((22))$

标准化后的PSSM如式(3)：The standardized PSSM is as formula (3):

作为可选的实施方式，所述步骤2中，使用所述三种特征选择算法构建多个特征子空间的具体实现包括以下步骤：As an optional implementation, in the step 2, the specific implementation of using the three feature selection algorithms to construct multiple feature subspaces includes the following steps:

2)对上述步骤所得的Laplacian矩阵H_M×M与对角矩阵D_M×M求解广义特征值分解问题Hy＝λDy，得到一个1以下的最大特征值对应的特征向量y(Hy＝λDy一定存在一个特征值为1，特征向量为y＝[1,1,…,1]^T，而这个y对于特征选择来说是无用的，所以需要一个特征值小于1的，特征向量不是y＝[1,1,…,1]^T)；2) Solve the generalized eigenvalue decomposition problem Hy=λDy for the Laplacian matrix H _M×M and diagonal matrix D _M×M obtained in the above steps, and obtain an eigenvector y corresponding to the largest eigenvalue below 1 (Hy=λDy must exist An eigenvalue is 1, and the eigenvector is y=[1,1,…,1] ^T , and this y is useless for feature selection, so an eigenvalue less than 1 is needed, and the eigenvector is not y=[1 ,1,...,1] ^T );

其中，w＝[w₁,w₂,…,w_i,…,w_N]表示每个特征维度权重，T表示矩阵的转置，t表示迭代次数，ε表示控制w中零元素个数的松弛项(上述公式(12)是一个迭代公式，t表示第t次迭代，用t来标记w在不同的迭代次数中值不一样)；Among them, w=[w ₁ ,w ₂ ,..., _wi ,...,w _N ] represents the weight of each feature dimension, T represents the transposition of the matrix, t represents the number of iterations, and ε represents the number of zero elements in w Slack term (the above formula (12) is an iterative formula, t represents the t-th iteration, and t is used to mark that w has different values in different iterations);

4)在上述求得的权重向量w＝[w₁,w₂,…,w_i,…,w_N]，选择所有大于零的权重分量w_i对应的样本特征维度(w_i是w＝[w₁,w₂,…,w_i,…,w_N]中的一个分量)，最后将所有被选中的特征维度组合成的特征子空间输出，同时将子空间中特征维度的数目一并输出；4) In the weight vector w=[w ₁ ,w ₂ ,..., _wi ,...,w _N ] obtained above, select all sample feature dimensions corresponding to weight components w _i greater than zero (w _i is w=[ w ₁ ,w ₂ ,…, _wi ,…,w _N ]), and finally output the feature subspace composed of all selected feature dimensions, and at the same time, the number of feature dimensions in the subspace output together;

1)对于具有c类原始样本的空间其中表示第i类的样本集合，表示特征向量，表示类别，M⁽ⁱ⁾表示第i类的样本数目，前述样本是指蛋白质的一个氨基酸残基；按照式(13)与式(14)计算每一类数据的每一维特征的均值和方差(值得一提的是：原始样本中的样本是表示一个具体事物；在本实施例中即蛋白质-维他命绑定位点预测中，一个样本就表示蛋白质的一个氨基酸残基，亦：一个样本即一个元素)：1) For a space with original samples of class c in Represents the sample set of the i-th class, represent the eigenvectors, Indicates the category, M ⁽ⁱ⁾ indicates the number of samples of the i-th category, the aforementioned sample refers to an amino acid residue of the protein; calculate the mean value of each dimension feature of each type of data according to formula (13) and formula (14) and variance (It is worth mentioning that: the sample in the original sample represents a specific thing; in this embodiment, that is, in the prediction of the protein-vitamin binding site, a sample represents an amino acid residue of the protein, and also: a sample is one element):

3)对上述Fisher Score向量H＝[H₁,H₂,…,H_n,…H_N]中的每个值进行从大到小排序，然后选择前个Fisher Score值对应的样本特征，将所有被选中特征组合成的特征子空间输出，其中表示选择留下了特征的个数，由步骤2-1确定(如前述步骤2-1的分步骤4)中，同时输出了)；3) Sort each value in the above Fisher Score vector H=[H ₁ ,H ₂ ,…,H _n ,…H _N ] from large to small, and then select the top The sample features corresponding to a Fisher Score value, and output the feature subspace composed of all selected features, where Indicates the number of features left by the selection, which is determined by step 2-1 (such as the sub-step 4 of the aforementioned step 2-1), and at the same time output );

D_ii＝∑_jHi_j,1≤i≤M与1≤j≤M (17)D _ii =∑ _j Hi _j , 1≤i≤M and 1≤j≤M (17)

3)对上述计算求得的Laplacian Score向量L＝[L₁,L₂,…,L_n,…,L_N]中的每个值进行从大到小排序，然后选择前个Laplacian Score值对应的样本特征，将所有被选中特征组合成的特征子空间输出，其中表示选择留下了特征的个数，由前述步骤2-1确定(如前述步骤2-1的分步骤4)中，同时输出了)。3) Sort each value in the Laplacian Score vector L=[L ₁ ,L ₂ ,…,L _n ,…,L _N ] obtained from the above calculation from large to small, and then select the top The sample features corresponding to a Laplacian Score value, and output the feature subspace composed of all selected features, where Indicates the number of features left by the selection, determined by the aforementioned step 2-1 (such as the sub-step 4 of the aforementioned step 2-1), and outputted at the same time ).

由于Fisher Score算法和Laplacian Score算法没有主动确定选择多少特征维数的能力，所以本实施例中借助步骤2-1的算法自主确定选择特征维数的能力。Since the Fisher Score algorithm and the Laplacian Score algorithm do not have the ability to actively determine the number of feature dimensions to be selected, in this embodiment, the algorithm in step 2-1 is used to independently determine the ability to select feature dimensions.

作为可选的实施方式，在所述步骤3，根据前述原始样本在每一个特征子空间中的分布情况，分别使用LIBSVM中的SVC分类算法训练一个子空间SVM预测器；最终在三个特征子空间训练出了三个不同的SVM预测器。As an optional implementation, in the step 3, according to the distribution of the aforementioned original samples in each feature subspace, use the SVC classification algorithm in LIBSVM to train a subspace SVM predictor; finally in the three feature subspaces Space trained three different SVM predictors.

$R_{true} = {(\begin{matrix} p_{1} & 1 - p_{1} \\ p_{2} & 1 - p_{2} \\ . & . \\ . & . \\ . & . \\ p^{i} & 1 - p_{i} \\ . & . \\ . & . \\ . & . \\ p_{M_{eva}} & 1 - p_{M_{eva}} \end{matrix})}^{T},$ 若y_i＝ω₁则p_i＝1，否则p_i＝0 (20) $R_{true} = {(\begin{matrix} p_{1} & 1 - p_{1} \\ p_{2} & 1 - p_{2} \\ . & . \\ . & . \\ . & . \\ p^{i} & 1 - p_{i} \\ . & . \\ . & . \\ . & . \\ p_{m_{eva}} & 1 - p_{m_{eva}} \end{matrix})}^{T},$ If y _i =ω ₁ then p _i =1, otherwise p _i =0 (20)

本实施例中，上述评估样本集合和待预测蛋白质是不一样的，是两个不同的集合；待预测的蛋白质的氨基酸残基是不知道类别，而评估样本集合是知道类别的，但在本实施例中用它(即评估样本集合)来确定子空间对应的SVM模型的权重，其实际意义上还是属于用来构建模型的数据一部分。In this embodiment, the above-mentioned evaluation sample set and the protein to be predicted are different, and are two different sets; the amino acid residues of the protein to be predicted do not know the category, while the evaluation sample set knows the category, but in this In the embodiment, it is used (that is, the evaluation sample set) to determine the weight of the SVM model corresponding to the subspace, which is actually part of the data used to build the model.

作为可选的实施方式，在步骤5中，使用融合后的SVM预测器对待预测的蛋白质进行蛋白质-维他命绑定位点预测：As an optional implementation, in step 5, use the fused SVM predictor to predict the protein-vitamin binding site of the protein to be predicted:

由以上本发明的一个示例性技术方案，在该实施例中提出的预测方法，其基于蛋白质的进化信息、二级结构信息以及绑定倾向性信息，采用基于多个特征选择算法的子空间融合技术及支持向量机(SVM)预测技术来进行蛋白质-维他命位点的预测，使用PSI-BLAST算法(A.A.Schaffer et al.,“Improving the accuracy of PSI-BLAST protein database searches withcomposition-based statistics and other refinements,”Nucleic Acids Res.,vol.29,pp.2994–3005,2001)来生成表示蛋白质的进化信息的位置特异性得分矩阵；使用PSIPRED算法(D.T.Jones,“Protein secondary structure prediction based on position-specific scoring matrices,”J Mol Biol,vol.292,no.2,pp.195-202,Sep17,1999)来提取蛋白质的二级结构信息；使用生成绑定倾向性算法(D.Yu,J.Hu,J.Yang et al.,“Designing template-free predictor for targeting protein-ligandbinding sites with classifier ensemble and spatial clustering,”IEEE/ACM Transactions onComputational Biology and Bioinformatics,vol.10,no.4,pp.994-1008,2013)来生成蛋白质的绑定倾向性信息。使用多个特征选择算法(H.Yan,and J.Yang,“Joint Laplacian feature weightslearning,”Pattern Recognition,vol.47,no.3,pp.1425-1432,2014；Bishop,C.“Neural Networksfor Pattern Recognition,”Clarendon Press:Oxford,1995.)来构造含有互补信息的子空间；使用加权平均的集成技术进行多预测器融合，最后使用基于软分类的阈值分割技术进行绑定位点的判定。与目前仅有的VitaPred预测器相比，具有更高的预测精度和更好的可解释性。From the above exemplary technical solution of the present invention, the prediction method proposed in this embodiment is based on protein evolution information, secondary structure information and binding tendency information, and adopts subspace fusion based on multiple feature selection algorithms Technology and Support Vector Machine (SVM) prediction technology to predict protein-vitamin sites, using PSI-BLAST algorithm (A.A.Schaffer et al., "Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements , "Nucleic Acids Res., vol.29, pp.2994–3005, 2001) to generate a position-specific score matrix representing the evolution information of proteins; using the PSIPRED algorithm (D.T.Jones, "Protein secondary structure prediction based on position-specific scoring matrices," J Mol Biol, vol.292, no.2, pp.195-202, Sep17, 1999) to extract the secondary structure information of proteins; using the binding propensity algorithm (D.Yu, J.Hu , J.Yang et al., "Designing template-free predictor for targeting protein-ligandbinding sites with classifier ensemble and spatial clustering," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol.10, no.4, pp.994-1008 , 2013) to generate protein binding propensity information. Using multiple feature selection algorithms (H.Yan, and J.Yang, "Joint Laplacian feature weightslearning," Pattern Recognition, vol.47, no.3, pp.1425-1432, 2014; Bishop, C. "Neural Networks for Pattern Recognition," Clarendon Press: Oxford, 1995.) to construct a subspace containing complementary information; use weighted average integration technology for multi-predictor fusion, and finally use threshold segmentation technology based on soft classification to determine binding sites. Compared with the only VitaPred predictor currently available, it has higher prediction accuracy and better interpretability.

下面以预测蛋白质2ZZA_A的不区分种类的维他命绑定位点为例，预测结果如表1所示。The following is an example of predicting the type-independent vitamin binding site of protein 2ZZA_A, and the prediction results are shown in Table 1.

蛋白质2ZZA_A的氨基酸序列如下所示：The amino acid sequence of protein 2ZZA_A is shown below:

>2ZZA_A>2ZZA_A

VIVSMIAALANNRVIGLDNKMPWHLPAELQLFKRATLGKPIVMGRNTFESIGRPLPGRLNIVLSRQTDYQPEGVTVVATLEDAVVAAGDVEELMIIGGATIYNQCLAAADRLYLTHIELTTEGDTWFPDYEQYNWQEIEHESYAADDKNPHNYRFSLLERVXVIVSMIAALANNRVIGLDNKMPWHLPAELQLFKRATLGKPIVMGRNTFESIGRPLPGRLNIVLSRQTDYQPEGVTVVATLEDAVVAAGDVEELMIIGGATIYNQCLAAADRLYLTHIELTTEGDTWFPDYEQYNWQEIEHESYAADDKNPHNYRFSLLERVX

该蛋白质共有19个维他命绑定位点。The protein has a total of 19 vitamin binding sites.

首先根据步骤1所描述使用PSI-BLAST算法、PSIPRED算法和蛋白质-维他命绑定位点倾向表抽取蛋白质2ZZA_A中每个氨基酸残基的原始特征；其次使用步骤2中所描述的JointLaplacian Feature Weights Learning(算法1)、Fisher Score(算法2)和Laplacian Score(算法3)三个特征选择算法对蛋白质2ZZA_A中每个氨基酸残基的原始特征进行子空间特征选择，形成三个子空间特征，再将三个子空间特征输入到步骤3所对应的三个SVM预测器S₁、S₂和S₃得到三个以绑定维他命概率形式给出的预测结果，将这三个预测结果输入按照步骤4的加权平均方法集成后的SVM预测器中，得到最终的蛋白质2ZZA_A与维他命的绑定的预测情况，最终预测结果如表1所示：First, use the PSI-BLAST algorithm, PSIPRED algorithm, and protein-vitamin binding site propensity table to extract the original features of each amino acid residue in protein 2ZZA_A according to step 1; secondly, use the JointLaplacian Feature Weights Learning described in step 2( The three feature selection algorithms Algorithm 1), Fisher Score (Algorithm 2) and Laplacian Score (Algorithm 3) perform subspace feature selection on the original features of each amino acid residue in protein 2ZZA_A, forming three subspace features, and then three subspace features The spatial features are input to the three SVM predictors S ₁ , S ₂ and S ₃ corresponding to step 3 to obtain three prediction results given in the form of binding vitamin probability, and these three prediction results are input into the weighted average according to step 4 In the SVM predictor after the method integration, the final prediction of the binding of protein 2ZZA_A to vitamins is obtained, and the final prediction results are shown in Table 1:

表1本实施例方法与目前仅有的蛋白质-维他命绑定位点预测器对2ZZA_A的预测结果对比Table 1 Comparison of the method in this example and the prediction results of 2ZZA_A by the only protein-vitamin binding site predictor at present

由表1可以看出，使用本实施例的预测方法，正确预测数15个维他命绑定位点，0个假阳性维他命绑定位点，4个假阴性维他命绑定位点，预测结果明显优于目前现有技术中仅有的蛋白质-维他命绑定位点预测器。It can be seen from Table 1 that using the prediction method of this embodiment, 15 vitamin binding sites were correctly predicted, 0 false positive vitamin binding sites, and 4 false negative vitamin binding sites, and the prediction results were obviously superior. The only protein-vitamin binding site predictor in the current state of the art.

虽然本发明已以较佳实施例揭露如上，然其并非用以限定本发明。本发明所属技术领域中具有通常知识者，在不脱离本发明的精神和范围内，当可作各种的更动与润饰。因此，本发明的保护范围当视权利要求书所界定者为准。Although the present invention has been disclosed above with preferred embodiments, it is not intended to limit the present invention. Those skilled in the art of the present invention can make various changes and modifications without departing from the spirit and scope of the present invention. Therefore, the scope of protection of the present invention should be defined by the claims.

Claims

1. protein-vitamin bindings bit point prediction the method merging based on subspace, is characterized in that, comprises the following steps:

Step 1, feature extraction and Feature Combination, utilize respectively PSI-BLAST algorithm, PSIPRED algorithm to extract evolution information characteristics and the secondary structure information characteristics of protein, and according to the binding tendentiousness information characteristics of protein-vitamin binding site tendency table extraction protein, aforementioned three kinds of features composition primitive character space; Then using moving window and serial combination mode that the amino acid residue in protein sequence is converted to vector form represents;

Step 2, use characteristic selection algorithm are Joint Laplacian Feature Weights Learning algorithm, Fisher Score algorithm and Laplacian Score algorithm, respectively repeatedly feature selecting are carried out in primitive character space; The character subset that each feature selecting obtains forms a proper subspace, thereby builds multiple proper subspaces;

Step 3, each proper subspace to step 2 gained, train a svm classifier device;

Step 4: use average weighted Multiple Classifier Fusion mode to training complete multiple svm classifier devices to merge; And

Step 5, SVM fallout predictor based on after merging are treated predicted protein matter and are carried out protein-vitamin bindings bit point prediction.

2. protein-vitamin bindings bit point prediction the method merging based on subspace according to claim 1, is characterized in that, in described step 1, comprises the following steps for feature extraction and the serial combination of training protein:

Step 1-1, for the protein being formed by l amino acid residue, obtain its position-specific scoring matrices by PSI-BLAST algorithm, this matrix is the matrix that a l capable 20 is listed as, thereby prlmary structure of protein information is converted to matrix representation:

Wherein: A, C...Y represent 20 seed amino acid residues, p _i,jrepresent that i amino acid residue of protein is mutated into the possibility of j amino acid residue of 20 seed amino acid residues during evolution;

Then utilize following formula (2) to carry out standardization line by line to the each value in PSSM:

f (x) = \frac{1}{1 + e^{- x}} - - - (2)

PSSM after standardization is suc as formula (3):

Afterwards, re-use the moving window of size for W, extract the eigenmatrix of each amino acid residue:

Finally, above-mentioned eigenmatrix (4) is combined into the proper vector that dimension is 20*W by the mode of row major:

f^{i} = {(p_{i, 1}^{normalized}, p_{i, 2}^{normalized}, . . ., p_{i, 20 W}^{normalized})}^{T} - - - (5)

Step 1-2, for the protein being formed by l amino acid residue, obtain its secondary structure probability matrix by PSIPRED, this matrix is the matrix that a l capable 3 is listed as, shown in (6):

Wherein, C, H...E represent three kinds of secondary structure: coil, helix, the strand of protein, s _{i, 1}the secondary structure that represents i amino acid residue in protein is the probability of coil, s _{i, 2}the secondary structure that represents i amino acid residue in protein is the probability of helix, s _{i, 3}the secondary structure that represents i amino acid residue in protein is the probability of strand;

Then, utilizing the moving window extraction of above-mentioned steps 1-1 and combining by the mode of row major the dimension that obtains each amino acid residue is the proper vector of 3*W, shown in (7):

f ⁱ＝(s _i,1,s _i,2,…,p _i,3W) ^T (7)

Step 1-3, for the protein being formed by l amino acid residue, obtain by searching protein-vitamin binding site tendency table the matrix that contains its binding tendentiousness information, this matrix is the matrix that a l capable 1 is listed as, shown in (8):

(\begin{matrix} b_{1} \\ . \\ . \\ . \\ b_{i} \\ . \\ . \\ . \\ b_{l} \end{matrix}) - - - (8)

Wherein, b _irepresent that in protein, i amino acid residue bound vitaminic tendentiousness;

Then, utilizing the moving window extraction of above-mentioned steps 1-1 and combining by the mode of row major the dimension that obtains each amino acid residue is the proper vector of 1*W, shown in (9):

f ⁱ＝(b _i,1,b _i,2,…,b _i,W) ^T (9)

Step 1-4,3 proper vector serial combination that above-mentioned steps is obtained, obtain the proper vector that length is 20*W+3*W+1*W.

3. protein-vitamin bindings bit point prediction the method merging based on subspace according to claim 1, is characterized in that, in described step 2, the specific implementation that uses described three kinds of feature selecting algorithm to build multiple proper subspaces comprises the following steps:

Feature selecting is carried out in step 2-1, the primitive character space that utilizes Joint Laplacian Feature Weights Learning algorithm to produce step 1, and it comprises:

1) for the data X=[x in primitive character space ₁, x ₂..., x _m] ∈ R ^{n × M}, use following formula (10) and formula (11) structure Laplacian matrix H _{m × M}with diagonal matrix D _{m × M}as follows:

D _ii=∑ _jh _ij, 1≤i≤M and 1≤j≤M (11)

Wherein, R ^{n × M}represent the scale of X matrix, X has M element that has N dimensional feature, N representation feature dimension, and M represents that number of samples is amino acid residue number;

2) the Laplacian matrix H to above-mentioned steps gained _{m × M}with diagonal matrix D _{m × M}solve generalized eigenvalue decomposition problem Hy=λ Dy, obtain an eigenvalue of maximum characteristic of correspondence vector y below 1;

3) use the above-mentioned proper vector y trying to achieve, upgrade the weight that every one-dimensional characteristic is corresponding until restrain according to following formula (12):

w_{i}^{t + 1} &LeftArrow; \frac{2}{3} w_{i}^{t} + \frac{1}{3} w_{i}^{t} \frac{{(2 Xy + 4 {ϵw}^{t})}_{i}}{{({2 XX}^{T} w^{t} + 4 {ϵw}^{t} {(w^{t})}^{T} w^{t})}_{i}}, 1 \leq i \leq N - - - (12)

Wherein, w=[w ₁, w ₂..., w _i..., w _n] represent each characteristic dimension weight, the transposition of T representing matrix, t represents iterations, ε represents to control the lax item of neutral element number in w;

4) at the above-mentioned weight vectors w=[w trying to achieve ₁, w ₂..., w _i..., w _n], select all weight component w that are greater than zero _icorresponding sample characteristics dimension, the proper subspace finally all selected characteristic dimension being combined into output, simultaneously by the number of characteristic dimension in subspace output in the lump;

Feature selecting is carried out in step 2-2, the primitive character space that utilizes Fisher Score algorithm to produce step 1, and it comprises:

1) for the space with c class original sample wherein represent the sample set of i class, representation feature vector, represent classification, M ⁽ⁱ⁾represent the number of samples of i class, aforementioned sample refers to an amino acid residue of protein; Calculate the average of every one-dimensional characteristic of each class data according to formula (13) and formula (14) and variance

u_{n}^{(i)} = \frac{1}{M^{(i)}} Σ_{j = 1}^{M^{(i)}} x_{jn}^{(i)},

1≤n≤N and 1≤i≤c (13)

{(σ_{n}^{(i)})}^{2} = \frac{1}{M^{(i)}} Σ_{j = 1}^{M^{(i)}} {(x_{jn}^{(i)} - u_{n}^{(i)})}^{2},

1≤n≤N and 1≤i≤c (14)

2) all averages that use above-mentioned middle calculating to get and variance each characteristic dimension is calculated to Fisher Score according to formula (15):

H_{n} = \frac{Σ_{i = 1}^{c} M^{(i)} {(u_{n}^{(i)} - u_{n})}^{2}}{Σ_{i = 1}^{c} M^{(i)} {(σ_{n}^{(i)})}^{2}}, 1 \leq n \leq N - - - (15)

Wherein, u _nrepresent the average of n dimensional characteristics in all data, H _nrepresent the Fisher Score value of n characteristic dimension, N characteristic dimension has a Fisher Score value;

Obtain a Fisher Score vector H, H=[H according to formula (15) ₁, H ₂..., H _n... H _n];

3) to above-mentioned Fisher Score vector H=[H ₁, H ₂..., H _n... H _n] in each value sort from big to small, then select before sample characteristics corresponding to individual Fisher Score value, the proper subspace output that all selected Feature Combinations are become, wherein the number that represents to select to have stayed feature, by step, 2-1 determines;

Feature selecting is carried out in step 2-3, the primitive character space that utilizes Laplacian Score algorithm to produce step 1, and it comprises:

1) for the data X=[x in primitive character space ₁, x ₂..., x _m] ∈ R ^{n × M}, use formula (16) and formula (17) structure Laplacian matrix H _{m × M}with diagonal matrix D _{m × M}as follows:

D _ii=∑ _jh _ij, 1≤i≤M and 1≤j≤M (17)

Wherein, R ^{n × M}represent the scale of X matrix, be that X has M element that has N dimensional feature, N representation feature dimension, M represents that number of samples is amino acid residue number, σ represents Gaussian parameter, formula (16) is for trying to achieve the distance that two samples are the nuclear space of amino acid residue, and this σ is for controlling the width of nuclear space;

2) use the Laplacian matrix H of above-mentioned structure _{m × M}with diagonal matrix D _{m × M}, calculate the Laplacian Score of each characteristic dimension according to formula (18):

L_{n} = \frac{Σ_{i = 1}^{M} Σ_{j = 1}^{M} {(x_{in} - x_{jn})}^{2} H_{ij}}{Σ_{i = 1}^{M} {(x_{in} - {\overset{&OverBar;}{x}}_{n})}^{2} D_{ij}}, 1 \leq n \leq N - - - (18)

Wherein, x _inrepresent the value of n dimensional characteristics of i sample, represent the average of n dimensional characteristics of all samples; L _nrepresent the Laplacian Score value of n characteristic dimension, N characteristic dimension has a Laplacian Score value, finally obtains a Laplacian Score vector L, L=[L according to formula (18) ₁, L ₂..., L _n..., L _n];

3) the Laplacian Score vector L=[L above-mentioned calculating being tried to achieve ₁, L ₂..., L _n..., L _n] in each value sort from big to small, then select before sample characteristics corresponding to individual Laplacian Score value, the proper subspace output that all selected Feature Combinations are become, wherein the number that represents to select to have stayed feature, by abovementioned steps, 2-1 determines.

4. protein-vitamin bindings bit point prediction the method merging based on subspace according to claim 1, it is characterized in that, in described step 3, distribution situation according to aforementioned original sample in each proper subspace, is used respectively the SVC classification algorithm training one sub spaces SVM fallout predictor in LIBSVM; Finally three different SVM fallout predictors are trained at three proper subspaces.

5. protein-vitamin bindings bit point prediction the method merging based on subspace according to claim 1, it is characterized in that, in described step 4, use weighted average method to train the SVM fallout predictor of three different characteristic subspaces that obtain to merge to step 3, it comprises:

Make ω ₁and ω ₂represent respectively binding site class and unbundling site class, S ₁, S ₂and S ₃represent respectively three SVM fallout predictors under different characteristic subspace, represent assessment sample set, for determining the weight of SVM model corresponding to subspace, the amino acid residue of wherein assessing sample set is known its classification; For each x _irepresented sample characteristics, S ₁, S ₂and S ₃will export the vector (s of three 2 dimensions _1,1(x _i), s _1,2(x _i)) ^t, (s _2,1(x _i), s _2,2(x _i)) ^t(s _3,1(x _i), s _3,2(x _i)) ^t, two elements of each 2 dimensional vectors represent respectively x _ibelong to ω ₁and ω ₂degree and two elements and be 1, therefore for assessment sample set can obtain at S respectively ₁, S ₂and S ₃on the matrix that predicts the outcome:

R_{i} = {(\begin{matrix} s_{i, 1} (x_{1}) & s_{i, 2} (x_{1}) \\ s_{i, 1} (x_{2}) & s_{i, 2} (x_{2}) \\ . & . \\ . & . \\ . & . \\ s_{i, 1} (x_{M_{eva}}) & s_{i, 2} (x_{M_{eva}}) \end{matrix})}^{T}, i = 1,2,3 - - - (19)

First, according to true category construction objective result matrix:

R_{true} = {(\begin{matrix} p_{1} & 1 - p_{1} \\ p_{2} & 1 - p_{2} \\ . & . \\ . & . \\ . & . \\ p^{i} & 1 - p_{i} \\ . & . \\ . & . \\ . & . \\ p_{M_{eva}} & 1 - p_{M_{eva}} \end{matrix})}^{T},

If y _i=ω ₁p _i=1, otherwise p _i=0 (20)

Secondly, calculate the error of the svm classifier device under each proper subspace:

E_{i} = | | R_{true} - R_{i} {| |}_{2}^{2}, i = 1,2,3 - - - (21)

Again, gather in assessment according to each proper subspace SVM fallout predictor on the weight of predicated error structure different subspace SVM fallout predictor:

w_{i} = \frac{(M_{eva} - E_{i})}{Σ_{k = 1}^{3} (M_{eva} - E_{k})}, i = 1,2,3 - - - (22)

Wherein, M _evarepresent completely by a point error of staggering the time;

Finally, according to the SVM fallout predictor that calculates the integrated different subspace of weight on assessment sample set:

S = Σ_{i = 1}^{3} w_{i} \cdot S_{i} - - - (23)

Obtain as the SVM fallout predictor after above formula (23) fusion.

6. protein-vitamin bindings bit point prediction the method merging based on subspace according to claim 1, is characterized in that, in step 5, uses the SVM fallout predictor after merging to carry out protein-vitamin bindings bit point prediction to protein to be predicted:

For each amino acid residue in protein to be predicted, produce the feature of amino acid residue in primitive character space according to step 1; Then use respectively three feature selecting algorithm described in step 2 to produce three sub spaces features to the primitive character of amino acid residue; Again three sub spaces features are input to corresponding three the SVM fallout predictor S of step 3 ₁, S ₂and S ₃obtain three predicting the outcome of providing with binding vitamin Probability Forms, by the SVM fallout predictor after integrated according to the weighted average method of step 4 these three inputs that predict the outcome, vitaminic probability is bound or do not bound to output amino acid residue; Finally bind judgement using the threshold value T that maximizes Ma Xiusi relative coefficient as judgment standard: the amino acid residue that all binding probability are more than or equal to T is predicted as binding residue; Other amino acid residues are bound the amino acid residue that probability is less than threshold value T and are predicted as unbundling residue, wherein T ∈ [0,1].