CN109273054B

CN109273054B - Protein subcellular interval prediction method based on relational graph

Info

Publication number: CN109273054B
Application number: CN201811014322.6A
Authority: CN
Inventors: 薛卫; 陈行健; 胡雪娇; 徐阳春; 韦中; 梅新兰
Original assignee: Nanjing Agricultural University
Current assignee: Nanjing Agricultural University
Priority date: 2018-08-31
Filing date: 2018-08-31
Publication date: 2021-07-13
Anticipated expiration: 2038-08-31
Also published as: CN109273054A

Abstract

The invention discloses a protein subcellular interval prediction method based on a relational graph, which takes a protein sequence as a research object and provides a method for predicting the protein subcellular interval by using an improved bag-of-words model based on the relational graph to extract protein sequence characteristic information and sending the protein sequence characteristic information into a classifier. On the basis of a traditional bag-of-words model, a position relation map is extracted from a protein sequence word fragment by combining Markov hypothesis, the relation map is sent to a Convolutional Neural Network (CNN) for depth feature extraction, the extracted depth features and bag-of-words features obtained based on the traditional bag-of-words model are fused to be used as final fusion feature representation of a protein sequence, and the final fusion feature representation is sent to a multi-class classifier of a support vector machine for classification prediction. Example results show that the accuracy of prediction of classification by using the relation map features alone is higher than that of classification by using the traditional bag-of-words features alone, and the prediction effect of classification by fusing the relation map features and the traditional bag-of-words features is better.

Description

Protein subcellular interval prediction method based on relational graph

Technical Field

The invention relates to the field of bioinformatics, in particular to a method for extracting characteristic information of a protein sequence by using a bag-of-words model based on a relational graph by using a computer programming language and sending the characteristic information into a support vector machine for predicting protein subcellular intervals.

Background

Since the rapid development of computer technology, the human beings have acquired large-scale nucleic acid and protein sequence data, and it is a necessary trend to mine effective information from these mass data by means of advanced and efficient computer automated data processing technology. In the past research, scholars at home and abroad mainly describe extracted protein sequence feature information by a mathematical method, represent protein sequences by high-dimensional feature vectors, and then design and use efficient classifiers for prediction analysis.

At present, the algorithms for protein sequence feature extraction mainly include: amino Acid Composition (AAC), physicochemical properties of amino acids, dipeptide and polypeptide composition, pseudo amino acid composition (PseAAC), fusion of different characteristics and the like. The prediction of protein subcellular intervals by using a BoW model is mainly divided into the steps of extracting sequence local features, constructing a dictionary by clustering analysis, constructing a word histogram, training a classifier for prediction and the like, and finally converting a target sequence into a feature vector and sending the feature vector to the classifier for classification.

The traditional BoW model has a good effect on prediction of protein subcellular intervals, but the model assumes that characteristic words are independent from each other, the characteristic words do not depend on whether other words appear or not, elements such as word sequences, grammars and the like among the characteristic words mapped by target sequence fragments are ignored, a target sequence is only regarded as a set consisting of a plurality of sequence words, and words appearing at any position in the sequence are not independently selected by the semantics of the sequence. The method does not consider the position relation and the sequence characteristic among the sequence characteristic words, the used bag-of-words characteristics have limitation in representing sequence information, and the clustering algorithm performance and the number of clustering centers in the dictionary building process have great influence on clustering results, which may cause the insufficiency of the characteristic expression capability and the discrimination of the characteristic words. The feature extraction method based on the Markov model only considers the state transition probability of the sequence words and cannot comprehensively reflect the overall information of the sequence.

Before the conclusion of the research results of people, the accuracy rate of performing feature extraction and sending the feature extraction to a classifier for positioning prediction is low by simply adopting a traditional protein sequence feature extraction algorithm such as AAC and the like.

Disclosure of Invention

The invention provides a protein subcellular interval prediction method based on a relational graph, aiming at the technical problems in the background technology.

The technical scheme is as follows:

a protein subcellular interval prediction method based on a relational graph comprises the following steps:

(1) segmenting all protein sequences, namely target sequences in a data set according to a certain length to generate a plurality of sequence words, and extracting the characteristics of all the sequence words;

(2) performing cluster analysis on the characteristics of the sequence words, constructing a dictionary by using a K-means clustering algorithm, wherein the number of clustering centers is the size of the dictionary, and the characteristics of the sequence words are mapped to each clustering center in the dictionary after the cluster analysis;

(3) extracting a position relation map of the sequence words represented by the clustering center and sending the position relation map into a Convolutional Neural Network (CNN) for feature extraction;

(4) counting the number of sequence words of each protein sequence belonging to each clustering center, calculating the proportion of the number of the sequence words on each clustering center to the total number of the sequence words of the protein sequence, thereby obtaining the bag-of-word characteristics of the protein sequence, then fusing with the relational graph characteristics obtained in the step (3), and obtaining the final fusion characteristics of the protein sequence after PCA dimension reduction;

(5) and (3) sending the fusion characteristics of the protein sequence into a support vector machine multi-class classifier to predict the protein subcellular interval.

Specifically, the step (1) comprises the following specific steps:

(1-A) performing segmentation processing on the protein sequence by using a sliding window segmentation method, traversing characters in the protein sequence from the head end to the tail end in sequence, fixing a sliding interval to be 1, determining the segmentation length of a sequence word by the size of a sliding window, and selecting the method as follows:

wherein L is₁,L₂,…,L_sRespectively representing the lengths of the protein sequences in the data set, and taking L as the length of the shortest protein sequence in the data set in view of the unequal lengths of each protein sequence; n is the size of a sliding window, namely the segmentation length of the sequence word is selected from L/2 to L;

and (1-B) obtaining protein sequence word characteristics by utilizing the amino acid component information of an AAC statistical sequence word of the traditional protein characteristic extraction algorithm after segmentation, wherein all the sequence word characteristics can be represented by a 20-dimensional vector through statistical calculation as each protein sequence is formed by combining 20 amino acid residues in different letter forms.

Specifically, the step (2) comprises the following specific steps:

(2-A) clustering the sequence word features by using a K-means algorithm to construct a dictionary, wherein the core idea is to divide the sequence word features into different categories according to the principle of minimum variance and within-class variance, and the number of clustering centers is the size of the dictionary;

and (2-B) quantitatively describing the protein sequence, mapping each sequence word feature of the protein sequence to a cluster center which is closest to the protein sequence in a dictionary, wherein the protein sequence can be uniquely represented by a plurality of cluster centers, namely, for any protein sequence, the protein sequence can be defined as follows:

F＝(x₁,x₂,x₃,…,x_n),1≤i≤n,n∈Z

wherein F is a protein sequence, x_iAnd n is the segmentation length of the sequence word.

Specifically, the step (3) comprises the following specific steps:

(3-A) extraction of relationship map

For any finite state sequence X ═ X₁,x₂,…,x_tWith x₁,x₂,…,x_tIndicates that the state sequence is T ═1,2, …, the state at time t. Then the state sequence X satisfying the following condition is called a markov chain:

wherein

Representing the following expression P, with P (… | …) being a conditional probability, multiplied by the following expression P, with T from 2 to T. If it is assumed that there is only one future current state in the influencing state sequence X, i.e. the state of the state sequence X at time T depends only on the state at time T-1, the above equation will become:

the hypothesis is a Markov (Markov) hypothesis, and the state sequence X is a first order Markov chain with a state space that is a set of countable numbers.

Based on Markov assumption, for arbitrary target sequence F, use x₁,x₂,…,x_nIndicates the state of the sequence F at the time N ═ 1,2, …, N, i.e. each cluster center x in F₁,x₂,…,x_nRespectively represent different states of each time in the Markov chain, N is 1,2, …, N respectively represents ordered tenses of each state in the Markov chain, the state of the target sequence F at the time N is only related to the states of N-1 times which have already appeared before, and the state x at any time in the sequence F_NThe following conditional functions are satisfied:

x_N＝g(x_N-1,x_N-2,…,x₁)

if only one future current state of the influence sequence F is assumed, for any target sequence F, the state of each time in F is only related to the state of the last time, namely the state x of any time in the target sequence F_NDependent only on the state x at the previous moment_N-1Then go upThe formula will become:

x_N＝g(x_N-1)

in a Markov process with a state space of a variable set, a first order Markov chain shows that the current state is only related to the previous adjacent state, and a k order Markov chain shows that the current state is related to the previous k adjacent states. Then for any one cluster center x in the target sequence F_i(x_iClass label representing ith characteristic word segment in target sequence F, i is more than or equal to 1 and less than or equal to n), x_iThe mapped characteristic word segments are only related to the characteristic word segments mapped by k clustering centers which appear before, wherein k is an adjacent spacing coefficient, k is more than or equal to 1 and less than or equal to i-1, and k is equal to 1 and represents the current clustering center x_iWith only one cluster centre x having appeared before_i-1In this connection, when k is i-1, the current cluster center x is represented_iAnd (3) relating to all the i-1 clustering centers which appear before, extracting a certain specific relation between each clustering center in the target sequence F and the clustering centers which appear before, and obtaining the position relation map.

The specific steps for extracting the relation map in the application are as follows:

(3-A-1) traversing the protein sequence F, i.e. for (i ═ 1; i ≦ n; i + +), in order for any clustering center x in F_iAdjacent thereto cluster center x_jForm a cluster segment (x)_i,x_j) J takes values from i-k to i-1 in sequence, wherein k is the adjacent spacing distance, the sequence F can be obtained

Grouping the clustered segments;

and (3-A-2) considering that the relational graph obtained by the steps is sparse, the number of zero elements in a graph matrix is far more than that of non-zero elements, the distribution of the non-zero elements is irregular, and in order to further improve the robustness of the CNN model, the sparse graph is subjected to intensive processing by using word embedding in natural language processing for reference. Word embedding is a type of method that converts sparse vectors of words into dense vectors. The specific method adopted in the method is to randomly generate a weight matrix of m x m, the element value of the matrix is a random number between 0 and 1, the obtained relational graph matrix is mapped to the random weight matrix, namely the two matrixes are multiplied, and the output of the process is the dense graph matrix. Randomly generating a matrix D of size m,

the rows and columns of the matrix correspond to the corresponding cluster centers respectively, and the segments (x) are clustered according to each group_i,x_j) The number of occurrences assigns a matrix, i.e., order D (x)_i,x_j)+＝1；

(3-a-3) repeating step (3-a-2) until i ═ n; then obtaining a corresponding m x m position relation matrix;

(3-A-4) representing the values of all elements in the position relation matrix by adopting pixel points with different brightness to obtain a relation map;

(3-B) dimensionality reduction of a relational graph: the relation maps extracted through the steps exist in a two-dimensional matrix form, if the data volume is large when the relation maps are directly unfolded and processed, the cost of memory and time consumption during the training of the classifier is too high, and overfitting is easy to occur. Feature selection and feature transformation are performed on the matrix by pooling methods, which are usually performed by max-pooling and mean-pooling. Only a single pooling of the matrix may lose part of the matrix information, so the convolutional pooling of the matrix using a convolutional neural network herein extracts features. In view of the fact that the fully-connected layer can capture global spatial layout information more comprehensively and reflect the relationship attribute of the target object better, the output of the fully-connected layer is directly extracted to be used as the feature representation of the relationship map.

Sending the relation map into a Convolutional Neural Network (CNN) for feature extraction, wherein the CNN consists of an input layer, a convolutional layer, a downsampling layer, a full-connection layer and an output layer:

the input layer is the relation map X;

the feature extraction is carried out on the X by formulating different window values by the convolution layer, and the relation graph obtained after the convolution is expressed as:

C_i＝g(W_i*C_i-1+b_i)

wherein, W_iWeight vectors representing the i-th layer of convolution kernels in a convolutional neural network, b_iRepresents the bias value of the i-th layer, C_iA characteristic diagram representing the ith layer, when i is 0, C₀X, g denotes the activation function, with the correct linear unit ReLU as the activation function;

the down-sampling layer is to down-sample the atlas, and is also called a pooling layer. While the complete information of the relation atlas is kept as much as possible, the dimension reduction is carried out on the characteristic map to reduce the parameters of the subsequent layer, and the formula is expressed as follows:

C_i＝max(C_i-1)

the fully-connected layer has a function similar to an artificial neural network, each neuron is connected with all neurons in the previous layer, and distributed features with category distinction in the convolutional layer or the pooling layer are mapped to high-dimensional vector output. In order to effectively relieve the overfitting phenomenon in the training process, the scheme adopts a dropout technology, abandons part of neurons according to a certain proportion in the training process, uses cross entropy as a loss function to update weights, introduces weight attenuation to regularize parameters, extracts the output of a full-connection layer as a characteristic vector, and has a vector dimension of p;

the output layer extracts the output of the full-connection layer as the characteristic vector representation of the relation map.

Preferably, the convolution layer uses the correction linear unit ReLU as the activation function g, so that the convergence speed of the model is more stable, and the performance of the model is further improved.

Preferably, during the training of the fully-connected layer, part of neurons are discarded according to a proportion of 0.5.

Specifically, the specific steps of step (4) are as follows:

mapping the characteristics of the sequence words after the clustering analysis to each clustering center in a dictionary, counting the number of the sequence words of each protein sequence belonging to each clustering center, and obtaining a sequence word histogram of the protein sequence by taking the clustering center as an abscissa and the number of the sequence words as an ordinate;

calculating the proportion of the number of sequence words in each clustering center in each protein sequence to the total number of the sequence words of the protein sequence according to a word histogram so as to obtain the bag-of-word characteristics of the protein sequence, wherein each protein sequence is represented as an m-dimensional bag-of-word characteristic vector;

and (4-C) considering that the single map feature is not enough to express all the information of the target sequence, further processing the traditional bag-of-word feature and the position feature of the relation map extracted by the CNN in a multi-feature fusion mode to improve the prediction accuracy. And (4) splicing the relational graph features obtained in the step (3) and the bag-of-words features obtained in the step (4-B) to form final feature vectors, and sending the final feature vectors into a classifier for classification after PCA dimension reduction.

The method comprises the following steps of extracting m-dimensional bag feature vectors by counting word frequency information through a word histogram based on a traditional bag model, extracting p-dimensional position feature vectors through a convolutional neural network based on a relational graph, and expressing the feature vectors obtained after fusion as follows:

V＝[v_t1,v_t2,v_t3,…,v_tm,v_p1,v_p2,v_p3,…,v_pn]

wherein V is a feature vector obtained by fusion, V_tIs a bag of words feature vector, v_pIs a location feature vector.

And carrying out PCA (principal component analysis) dimension reduction processing on the feature vectors obtained by fusion to obtain final fusion features.

Specifically, the step (5) comprises the following specific steps: selecting only one protein sequence from the data set to form a test set, forming a training set by the rest protein sequences, testing times equal to the size of the data set, and training a sample (C)_i,y_i) Feeding into a support vector machine a multi-class classifier, vector C_iRepresents the fusion characteristic, y, of the ith group of training samples obtained by the steps_iRepresenting the category of the corresponding subcellular position corresponding to the protein sequence, finally sending the test sample into a trained support vector machine for prediction, and countingAnd predicting the result.

The invention has the advantages of

Aiming at the problem that the traditional bag-of-words model ignores elements such as grammatical semantics and the like, the invention extracts a position relation map from the sequence words based on the non-aftereffect of Markov hypothesis and introduces the spatial position information of the sequence words, and the extracted relation map is subjected to feature conversion and dimension reduction and is fused with the bag-of-words features to be used as the final fusion features of the model. Compared with the traditional bag-of-words model, the model provided by the method can reflect the distribution rule of the sequence characteristics more comprehensively. The feature expression of the target sequence can be made more representative, thereby improving the classification performance.

Drawings

FIG. 1 is a flow chart of the steps for extracting a relationship map according to the present invention.

Fig. 2 is a diagram of a CNN network structure according to the present invention.

Fig. 3 is a flow chart of feature extraction using CNN according to the present invention.

FIG. 4 is a flow chart of the final fusion signature acquisition of the protein sequences of the present invention.

Detailed Description

The invention is further illustrated by the following examples, without limiting the scope of the invention:

taking a data set containing 317 apoptosis protein sequences obtained from a SWISS-PROT database as an example for explanation, extracting bag-of-word features of the protein sequences by using an improved bag-of-word model based on a relational graph and combining an AAC algorithm, and sending the bag-of-word features into a multi-class classifier of a support vector machine for positioning prediction. The method comprises the following specific steps:

(1) and segmenting all protein sequences in the data set to generate a plurality of sequence words, and extracting the characteristics of all the sequence words.

(1-A) 317 protein sequences in the data set are processed by using a computer programming language to obtain the length L of the shortest protein sequence in the data set, 50 is selected between L/2 and L as the size N of a sliding window, the sliding interval is fixed to be 1, and the 317 sequences are subjected to sliding segmentation from the N end to the C end to obtain 206990 sequence words. For example, the first protein sequence MNYLP … HPNSSPT … MQ can obtain the sequence words such as MNYLP … HPNS, NYLP … HPNSS, YLP … HPNSSP and the like after sliding segmentation.

(1-B) the frequency of 20 amino acids in each sequence word is calculated, all the sequence words are expressed as 20-dimensional vectors, and the 20-dimensional vectors are the sequence word features, for example, the feature value obtained by calculating the frequency of 20 amino acids in MNYLP … HPNS is [0.08 … 0.10.0600.04 ]. The 206990 sequence words have 206990 sequence word feature values.

(2) And carrying out clustering analysis on the sequence word characteristics to obtain a dictionary, wherein the size of the dictionary is the number m of clustering centers.

(2-A) the number m of clustering centers is gradually increased from 20 to 500 at most, for example, m is 50, and 50 sequence word characteristic values are randomly selected from a data set consisting of 206990 sequence word characteristic values to serve as initial clustering centers for K-means clustering.

(2-B) Each protein sequence is uniquely represented by several clustering centers, for example, the first protein sequence can be represented as [0, …, 26,17, …, 9] after clustering. Wherein the number of the clustering centers is the total number of words of the protein sequence segmentation sequence.

(3) And extracting the position relation map of the sequence words represented by the plurality of clustering centers, and sending the sequence words into a convolutional neural network for feature conversion and dimension reduction.

(3-a) extracting the positional relationship of the protein sequences F by taking the adjacent interval coefficient k as 1, and then each protein sequence F is expressed as a 50 × 50 relationship matrix. The specific flow of the step is shown in fig. 1, and the words in the technical scheme are described in detail and are not repeated here.

(3-B) sending the relation map into a convolutional neural network for feature extraction: in the extraction process, the convolution neural network is used to perform convolution pooling on the matrix to extract features as shown in fig. 2, the overall process of performing feature extraction by using the CNN is shown in fig. 3, detailed text description is already performed in the technical scheme, and the detailed description is not repeated here. And directly extracting the output of the full-connection layer as the characteristic representation of the relation map, and assuming that the output of the final full-connection layer is set as 100, representing the relation map characteristic of each protein sequence as a high-dimensional vector with a dimension of 100.

(4) And (4) counting the number of sequence words of each protein sequence belonging to each clustering center, calculating the proportion of the number of the sequence words in each clustering center to the total number of the sequence words of the protein sequence, thereby obtaining the bag-of-word characteristics of the protein sequence, fusing the bag-of-word characteristics with the relational graph characteristics obtained in the step (3), and obtaining the final fusion characteristic representation of the protein sequence after PCA dimension reduction.

(4-A) after the clustering analysis, mapping the sequence word features to 50 clustering centers in the dictionary, and counting the number of the sequence words of each protein sequence belonging to each clustering center, for example, the number of the sequence words of the first protein sequence MNYLP … HPNSSPT … MQ belonging to 50 clustering centers is 0, …, 26,17, … and 9. And the statistical result takes the clustering center as the abscissa and the number of the sequence words as the ordinate, so that a sequence word histogram of the protein sequence can be drawn.

(4-B) calculating the ratio of the number of sequence words in each clustering center to the total number of the sequence words of each protein sequence to obtain the bag-of-word characteristics of the protein sequences, wherein each protein sequence is represented as a 50-dimensional vector. For example, the first protein sequence MNYLP … HPNSSPT … MQ has 562 sequence words after being segmented, and the bag-of-word feature vector is [0 … 0.0462630.030249 … 0.0035590.016014 ].

(4-C) fusing the relation map characteristics obtained in the step (3) with the traditional bag-of-word characteristics (4-B) and using PCA dimension reduction as final fusion characteristics of the protein sequence.

(5) And sending the fusion characteristics into a multi-class classifier of a support vector machine to predict the protein subcellular interval. 317 protein sequences each time, selecting only one protein sequence from the data set to form a test set, wherein the training set consists of the rest protein sequences, the test times are equal to the size 317 of the data set, and training a sample (C)_i,y_i) Feeding into a support vector machine a multi-class classifier, vector C_iRepresents the fusion characteristic, y, of the ith group of training samples obtained by the steps_iIndicates the corresponding subunit of the protein sequenceAnd (4) the types of the cell positions, finally, sending the test sample into a trained support vector machine for prediction, and counting the prediction result.

The specific embodiments described herein are merely illustrative of the spirit of the invention. Various modifications or additions may be made to the described embodiments or alternatives may be employed by those skilled in the art without departing from the spirit or ambit of the invention as defined in the appended claims.

Claims

1. A protein subcellular interval prediction method based on a relational graph is characterized by comprising the following steps:

(2) performing cluster analysis on the characteristics of the sequence words, constructing a dictionary by using a K-means clustering algorithm, wherein the number of clustering centers is the size of the dictionary, and the characteristics of the sequence words are mapped to each clustering center in the dictionary after the cluster analysis; the specific steps of the step (2) are as follows:

F＝(x₁,x₂,x₃,…,x_n),1≤i≤n,n∈Z

wherein F is a protein sequence, x_iRepresenting the clustering center mapped by the ith sequence word feature in the sequence F, wherein n is the segmentation length of the sequence word;

(3) extracting a position relation map of the sequence words represented by the clustering center and sending the position relation map into a Convolutional Neural Network (CNN) for feature extraction; the specific steps of the step (3) are as follows:

(3-A) extracting a relationship map:

(3-A-1) traversing the protein sequence F, i.e. for (i ═ 1; i ≦ n; i + +), in order for any clustering center x in F_iAdjacent thereto cluster center x_jForm a cluster segment (x)_i,x_j) J takes values from i-k to i-1 in sequence, wherein k is an adjacent interval coefficient, and the sequence F can be obtained

Grouping the clustered segments;

(3-A-2) randomly generating a matrix D with a size of m x m, wherein the rows and columns of the matrix respectively correspond to the corresponding clustering centers, and clustering segments (x) according to each group_i,x_j) The number of occurrences assigns a matrix, i.e., order D (x)_i,x_j)+＝1；

(3-B) dimensionality reduction of a relational graph: sending the relation map into a Convolutional Neural Network (CNN) for feature extraction, wherein the CNN consists of an input layer, a convolutional layer, a downsampling layer, a full-connection layer and an output layer:

the input layer is the relation map X;

C_i＝g(W_i*C_i-1+b_i)

the down-sampling layer is used for down-sampling the atlas, the complete information of the relation atlas is kept as much as possible, and simultaneously, the dimension reduction is carried out on the characteristic map to reduce the parameters of the subsequent layer, and the formula is expressed as follows:

C_i＝max(C_i-1)

the fully-connected layer adopts a dropout technology, partial neurons are abandoned according to a certain proportion in the training process, cross entropy is used as a loss function to update weights, and weight attenuation is introduced to regularize parameters;

the output layer extracts the output of the full-connection layer as the characteristic vector representation of the relation map;

2. The method according to claim 1, wherein the specific steps of step (1) are:

wherein L is₁,L₂,…,L_sRespectively representing the length of the protein sequence in the data set, and taking L as the length of the shortest protein sequence in the data set; n is the size of the sliding window;

and (1-B) obtaining the protein sequence word characteristics by utilizing the amino acid component information of the AAC statistical sequence word of the traditional protein characteristic extraction algorithm after segmentation.

3. The method according to claim 1, characterized in that the convolutional layer uses a correcting linear unit ReLU as activation function g.

4. The method of claim 1, wherein during the training of the fully-connected layer, a fraction of neurons is discarded at a rate of 0.5.

5. The method according to claim 1, wherein the specific steps of step (4) are:

(4-A) counting the number of sequence words of each protein sequence belonging to each clustering center, and obtaining a sequence word histogram of the protein sequence by taking the clustering center as an abscissa and the number of the sequence words as an ordinate;

and (4-C) splicing the relation map features obtained in the step (3) and the bag-of-words features obtained in the step (4-B) to form final feature vectors, carrying out PCA dimensionality reduction, and then sending the final feature vectors into a classifier for classification, wherein the feature vectors obtained after fusion are expressed as:

V＝[v_t1,v_t2,v_t3,…,v_tm,v_p1,v_p2,v_p3,…,v_pn]

wherein v is_tIs a bag of words feature vector, v_pIs a position feature vector;

and carrying out PCA (principal component analysis) dimension reduction processing on the obtained feature vector to obtain the final fusion feature.

6. The method according to claim 1, wherein the step (5) comprises the following specific steps: each time only from the data setOne protein sequence constitutes the test set, the training set consists of the rest of the protein sequences, the number of tests is equal to the size of the data set, and the training sample (C)_i,y_i) Feeding into a support vector machine a multi-class classifier, vector C_iRepresents the fusion characteristic, y, of the ith group of training samples obtained by the steps_iAnd (3) representing the category of the corresponding subcellular position corresponding to the protein sequence, finally sending the test sample into a trained support vector machine for prediction, and counting the prediction result.