CN109273054B - Protein subcellular interval prediction method based on relational graph - Google Patents
Protein subcellular interval prediction method based on relational graph Download PDFInfo
- Publication number
- CN109273054B CN109273054B CN201811014322.6A CN201811014322A CN109273054B CN 109273054 B CN109273054 B CN 109273054B CN 201811014322 A CN201811014322 A CN 201811014322A CN 109273054 B CN109273054 B CN 109273054B
- Authority
- CN
- China
- Prior art keywords
- sequence
- protein sequence
- words
- protein
- word
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
- 102000004169 proteins and genes Human genes 0.000 title claims abstract description 113
- 108090000623 proteins and genes Proteins 0.000 title claims abstract description 112
- 238000000034 method Methods 0.000 title claims abstract description 32
- 238000013527 convolutional neural network Methods 0.000 claims abstract description 23
- 230000004927 fusion Effects 0.000 claims abstract description 22
- 238000000605 extraction Methods 0.000 claims abstract description 19
- 238000012706 support-vector machine Methods 0.000 claims abstract description 12
- 239000013598 vector Substances 0.000 claims description 38
- 239000011159 matrix material Substances 0.000 claims description 23
- 238000012549 training Methods 0.000 claims description 16
- 230000009467 reduction Effects 0.000 claims description 14
- 230000011218 segmentation Effects 0.000 claims description 14
- 238000004422 calculation algorithm Methods 0.000 claims description 10
- 238000000513 principal component analysis Methods 0.000 claims description 10
- 238000012360 testing method Methods 0.000 claims description 9
- 150000001413 amino acids Chemical class 0.000 claims description 7
- 230000008569 process Effects 0.000 claims description 7
- 238000012545 processing Methods 0.000 claims description 7
- 230000004913 activation Effects 0.000 claims description 6
- 210000002569 neuron Anatomy 0.000 claims description 6
- 238000005516 engineering process Methods 0.000 claims description 5
- 239000000284 extract Substances 0.000 claims description 5
- 238000007621 cluster analysis Methods 0.000 claims description 4
- 238000013507 mapping Methods 0.000 claims description 4
- 238000010586 diagram Methods 0.000 claims description 3
- 238000003064 k means clustering Methods 0.000 claims description 3
- 238000005070 sampling Methods 0.000 claims description 3
- 238000011160 research Methods 0.000 abstract description 3
- 230000000694 effects Effects 0.000 abstract description 2
- 239000012634 fragment Substances 0.000 abstract description 2
- 238000011176 pooling Methods 0.000 description 8
- 230000006870 function Effects 0.000 description 6
- 238000004458 analytical method Methods 0.000 description 5
- 239000000203 mixture Substances 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 2
- 208000000747 high pressure neurological syndrome Diseases 0.000 description 2
- 108010016626 Dipeptides Proteins 0.000 description 1
- 241000282414 Homo sapiens Species 0.000 description 1
- 108091005461 Nucleic proteins Chemical group 0.000 description 1
- 238000007792 addition Methods 0.000 description 1
- 125000000539 amino acid group Chemical group 0.000 description 1
- 230000006907 apoptotic process Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 210000004027 cell Anatomy 0.000 description 1
- 238000012937 correction Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000001788 irregular Effects 0.000 description 1
- 238000012067 mathematical method Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 102000039446 nucleic acids Human genes 0.000 description 1
- 108020004707 nucleic acids Proteins 0.000 description 1
- 150000007523 nucleic acids Chemical group 0.000 description 1
- 229920001184 polypeptide Polymers 0.000 description 1
- 108090000765 processed proteins & peptides Proteins 0.000 description 1
- 102000004196 processed proteins & peptides Human genes 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
Images
Landscapes
- Investigating Or Analysing Biological Materials (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a protein subcellular interval prediction method based on a relational graph, which takes a protein sequence as a research object and provides a method for predicting the protein subcellular interval by using an improved bag-of-words model based on the relational graph to extract protein sequence characteristic information and sending the protein sequence characteristic information into a classifier. On the basis of a traditional bag-of-words model, a position relation map is extracted from a protein sequence word fragment by combining Markov hypothesis, the relation map is sent to a Convolutional Neural Network (CNN) for depth feature extraction, the extracted depth features and bag-of-words features obtained based on the traditional bag-of-words model are fused to be used as final fusion feature representation of a protein sequence, and the final fusion feature representation is sent to a multi-class classifier of a support vector machine for classification prediction. Example results show that the accuracy of prediction of classification by using the relation map features alone is higher than that of classification by using the traditional bag-of-words features alone, and the prediction effect of classification by fusing the relation map features and the traditional bag-of-words features is better.
Description
Technical Field
The invention relates to the field of bioinformatics, in particular to a method for extracting characteristic information of a protein sequence by using a bag-of-words model based on a relational graph by using a computer programming language and sending the characteristic information into a support vector machine for predicting protein subcellular intervals.
Background
Since the rapid development of computer technology, the human beings have acquired large-scale nucleic acid and protein sequence data, and it is a necessary trend to mine effective information from these mass data by means of advanced and efficient computer automated data processing technology. In the past research, scholars at home and abroad mainly describe extracted protein sequence feature information by a mathematical method, represent protein sequences by high-dimensional feature vectors, and then design and use efficient classifiers for prediction analysis.
At present, the algorithms for protein sequence feature extraction mainly include: amino Acid Composition (AAC), physicochemical properties of amino acids, dipeptide and polypeptide composition, pseudo amino acid composition (PseAAC), fusion of different characteristics and the like. The prediction of protein subcellular intervals by using a BoW model is mainly divided into the steps of extracting sequence local features, constructing a dictionary by clustering analysis, constructing a word histogram, training a classifier for prediction and the like, and finally converting a target sequence into a feature vector and sending the feature vector to the classifier for classification.
The traditional BoW model has a good effect on prediction of protein subcellular intervals, but the model assumes that characteristic words are independent from each other, the characteristic words do not depend on whether other words appear or not, elements such as word sequences, grammars and the like among the characteristic words mapped by target sequence fragments are ignored, a target sequence is only regarded as a set consisting of a plurality of sequence words, and words appearing at any position in the sequence are not independently selected by the semantics of the sequence. The method does not consider the position relation and the sequence characteristic among the sequence characteristic words, the used bag-of-words characteristics have limitation in representing sequence information, and the clustering algorithm performance and the number of clustering centers in the dictionary building process have great influence on clustering results, which may cause the insufficiency of the characteristic expression capability and the discrimination of the characteristic words. The feature extraction method based on the Markov model only considers the state transition probability of the sequence words and cannot comprehensively reflect the overall information of the sequence.
Before the conclusion of the research results of people, the accuracy rate of performing feature extraction and sending the feature extraction to a classifier for positioning prediction is low by simply adopting a traditional protein sequence feature extraction algorithm such as AAC and the like.
Disclosure of Invention
The invention provides a protein subcellular interval prediction method based on a relational graph, aiming at the technical problems in the background technology.
The technical scheme is as follows:
a protein subcellular interval prediction method based on a relational graph comprises the following steps:
(1) segmenting all protein sequences, namely target sequences in a data set according to a certain length to generate a plurality of sequence words, and extracting the characteristics of all the sequence words;
(2) performing cluster analysis on the characteristics of the sequence words, constructing a dictionary by using a K-means clustering algorithm, wherein the number of clustering centers is the size of the dictionary, and the characteristics of the sequence words are mapped to each clustering center in the dictionary after the cluster analysis;
(3) extracting a position relation map of the sequence words represented by the clustering center and sending the position relation map into a Convolutional Neural Network (CNN) for feature extraction;
(4) counting the number of sequence words of each protein sequence belonging to each clustering center, calculating the proportion of the number of the sequence words on each clustering center to the total number of the sequence words of the protein sequence, thereby obtaining the bag-of-word characteristics of the protein sequence, then fusing with the relational graph characteristics obtained in the step (3), and obtaining the final fusion characteristics of the protein sequence after PCA dimension reduction;
(5) and (3) sending the fusion characteristics of the protein sequence into a support vector machine multi-class classifier to predict the protein subcellular interval.
Specifically, the step (1) comprises the following specific steps:
(1-A) performing segmentation processing on the protein sequence by using a sliding window segmentation method, traversing characters in the protein sequence from the head end to the tail end in sequence, fixing a sliding interval to be 1, determining the segmentation length of a sequence word by the size of a sliding window, and selecting the method as follows:
wherein L is1,L2,…,LsRespectively representing the lengths of the protein sequences in the data set, and taking L as the length of the shortest protein sequence in the data set in view of the unequal lengths of each protein sequence; n is the size of a sliding window, namely the segmentation length of the sequence word is selected from L/2 to L;
and (1-B) obtaining protein sequence word characteristics by utilizing the amino acid component information of an AAC statistical sequence word of the traditional protein characteristic extraction algorithm after segmentation, wherein all the sequence word characteristics can be represented by a 20-dimensional vector through statistical calculation as each protein sequence is formed by combining 20 amino acid residues in different letter forms.
Specifically, the step (2) comprises the following specific steps:
(2-A) clustering the sequence word features by using a K-means algorithm to construct a dictionary, wherein the core idea is to divide the sequence word features into different categories according to the principle of minimum variance and within-class variance, and the number of clustering centers is the size of the dictionary;
and (2-B) quantitatively describing the protein sequence, mapping each sequence word feature of the protein sequence to a cluster center which is closest to the protein sequence in a dictionary, wherein the protein sequence can be uniquely represented by a plurality of cluster centers, namely, for any protein sequence, the protein sequence can be defined as follows:
F=(x1,x2,x3,…,xn),1≤i≤n,n∈Z
wherein F is a protein sequence, xiAnd n is the segmentation length of the sequence word.
Specifically, the step (3) comprises the following specific steps:
(3-A) extraction of relationship map
For any finite state sequence X ═ X1,x2,…,xtWith x1,x2,…,xtIndicates that the state sequence is T ═1,2, …, the state at time t. Then the state sequence X satisfying the following condition is called a markov chain:
whereinRepresenting the following expression P, with P (… | …) being a conditional probability, multiplied by the following expression P, with T from 2 to T. If it is assumed that there is only one future current state in the influencing state sequence X, i.e. the state of the state sequence X at time T depends only on the state at time T-1, the above equation will become:
the hypothesis is a Markov (Markov) hypothesis, and the state sequence X is a first order Markov chain with a state space that is a set of countable numbers.
Based on Markov assumption, for arbitrary target sequence F, use x1,x2,…,xnIndicates the state of the sequence F at the time N ═ 1,2, …, N, i.e. each cluster center x in F1,x2,…,xnRespectively represent different states of each time in the Markov chain, N is 1,2, …, N respectively represents ordered tenses of each state in the Markov chain, the state of the target sequence F at the time N is only related to the states of N-1 times which have already appeared before, and the state x at any time in the sequence FNThe following conditional functions are satisfied:
xN=g(xN-1,xN-2,…,x1)
if only one future current state of the influence sequence F is assumed, for any target sequence F, the state of each time in F is only related to the state of the last time, namely the state x of any time in the target sequence FNDependent only on the state x at the previous momentN-1Then go upThe formula will become:
xN=g(xN-1)
in a Markov process with a state space of a variable set, a first order Markov chain shows that the current state is only related to the previous adjacent state, and a k order Markov chain shows that the current state is related to the previous k adjacent states. Then for any one cluster center x in the target sequence Fi(xiClass label representing ith characteristic word segment in target sequence F, i is more than or equal to 1 and less than or equal to n), xiThe mapped characteristic word segments are only related to the characteristic word segments mapped by k clustering centers which appear before, wherein k is an adjacent spacing coefficient, k is more than or equal to 1 and less than or equal to i-1, and k is equal to 1 and represents the current clustering center xiWith only one cluster centre x having appeared beforei-1In this connection, when k is i-1, the current cluster center x is representediAnd (3) relating to all the i-1 clustering centers which appear before, extracting a certain specific relation between each clustering center in the target sequence F and the clustering centers which appear before, and obtaining the position relation map.
The specific steps for extracting the relation map in the application are as follows:
(3-A-1) traversing the protein sequence F, i.e. for (i ═ 1; i ≦ n; i + +), in order for any clustering center x in FiAdjacent thereto cluster center xjForm a cluster segment (x)i,xj) J takes values from i-k to i-1 in sequence, wherein k is the adjacent spacing distance, the sequence F can be obtainedGrouping the clustered segments;
and (3-A-2) considering that the relational graph obtained by the steps is sparse, the number of zero elements in a graph matrix is far more than that of non-zero elements, the distribution of the non-zero elements is irregular, and in order to further improve the robustness of the CNN model, the sparse graph is subjected to intensive processing by using word embedding in natural language processing for reference. Word embedding is a type of method that converts sparse vectors of words into dense vectors. The specific method adopted in the method is to randomly generate a weight matrix of m x m, the element value of the matrix is a random number between 0 and 1, the obtained relational graph matrix is mapped to the random weight matrix, namely the two matrixes are multiplied, and the output of the process is the dense graph matrix. Randomly generating a matrix D of size m,
the rows and columns of the matrix correspond to the corresponding cluster centers respectively, and the segments (x) are clustered according to each groupi,xj) The number of occurrences assigns a matrix, i.e., order D (x)i,xj)+=1;
(3-a-3) repeating step (3-a-2) until i ═ n; then obtaining a corresponding m x m position relation matrix;
(3-A-4) representing the values of all elements in the position relation matrix by adopting pixel points with different brightness to obtain a relation map;
(3-B) dimensionality reduction of a relational graph: the relation maps extracted through the steps exist in a two-dimensional matrix form, if the data volume is large when the relation maps are directly unfolded and processed, the cost of memory and time consumption during the training of the classifier is too high, and overfitting is easy to occur. Feature selection and feature transformation are performed on the matrix by pooling methods, which are usually performed by max-pooling and mean-pooling. Only a single pooling of the matrix may lose part of the matrix information, so the convolutional pooling of the matrix using a convolutional neural network herein extracts features. In view of the fact that the fully-connected layer can capture global spatial layout information more comprehensively and reflect the relationship attribute of the target object better, the output of the fully-connected layer is directly extracted to be used as the feature representation of the relationship map.
Sending the relation map into a Convolutional Neural Network (CNN) for feature extraction, wherein the CNN consists of an input layer, a convolutional layer, a downsampling layer, a full-connection layer and an output layer:
the input layer is the relation map X;
the feature extraction is carried out on the X by formulating different window values by the convolution layer, and the relation graph obtained after the convolution is expressed as:
Ci=g(Wi*Ci-1+bi)
wherein, WiWeight vectors representing the i-th layer of convolution kernels in a convolutional neural network, biRepresents the bias value of the i-th layer, CiA characteristic diagram representing the ith layer, when i is 0, C0X, g denotes the activation function, with the correct linear unit ReLU as the activation function;
the down-sampling layer is to down-sample the atlas, and is also called a pooling layer. While the complete information of the relation atlas is kept as much as possible, the dimension reduction is carried out on the characteristic map to reduce the parameters of the subsequent layer, and the formula is expressed as follows:
Ci=max(Ci-1)
the fully-connected layer has a function similar to an artificial neural network, each neuron is connected with all neurons in the previous layer, and distributed features with category distinction in the convolutional layer or the pooling layer are mapped to high-dimensional vector output. In order to effectively relieve the overfitting phenomenon in the training process, the scheme adopts a dropout technology, abandons part of neurons according to a certain proportion in the training process, uses cross entropy as a loss function to update weights, introduces weight attenuation to regularize parameters, extracts the output of a full-connection layer as a characteristic vector, and has a vector dimension of p;
the output layer extracts the output of the full-connection layer as the characteristic vector representation of the relation map.
Preferably, the convolution layer uses the correction linear unit ReLU as the activation function g, so that the convergence speed of the model is more stable, and the performance of the model is further improved.
Preferably, during the training of the fully-connected layer, part of neurons are discarded according to a proportion of 0.5.
Specifically, the specific steps of step (4) are as follows:
mapping the characteristics of the sequence words after the clustering analysis to each clustering center in a dictionary, counting the number of the sequence words of each protein sequence belonging to each clustering center, and obtaining a sequence word histogram of the protein sequence by taking the clustering center as an abscissa and the number of the sequence words as an ordinate;
calculating the proportion of the number of sequence words in each clustering center in each protein sequence to the total number of the sequence words of the protein sequence according to a word histogram so as to obtain the bag-of-word characteristics of the protein sequence, wherein each protein sequence is represented as an m-dimensional bag-of-word characteristic vector;
and (4-C) considering that the single map feature is not enough to express all the information of the target sequence, further processing the traditional bag-of-word feature and the position feature of the relation map extracted by the CNN in a multi-feature fusion mode to improve the prediction accuracy. And (4) splicing the relational graph features obtained in the step (3) and the bag-of-words features obtained in the step (4-B) to form final feature vectors, and sending the final feature vectors into a classifier for classification after PCA dimension reduction.
The method comprises the following steps of extracting m-dimensional bag feature vectors by counting word frequency information through a word histogram based on a traditional bag model, extracting p-dimensional position feature vectors through a convolutional neural network based on a relational graph, and expressing the feature vectors obtained after fusion as follows:
V=[vt1,vt2,vt3,…,vtm,vp1,vp2,vp3,…,vpn]
wherein V is a feature vector obtained by fusion, VtIs a bag of words feature vector, vpIs a location feature vector.
And carrying out PCA (principal component analysis) dimension reduction processing on the feature vectors obtained by fusion to obtain final fusion features.
Specifically, the step (5) comprises the following specific steps: selecting only one protein sequence from the data set to form a test set, forming a training set by the rest protein sequences, testing times equal to the size of the data set, and training a sample (C)i,yi) Feeding into a support vector machine a multi-class classifier, vector CiRepresents the fusion characteristic, y, of the ith group of training samples obtained by the stepsiRepresenting the category of the corresponding subcellular position corresponding to the protein sequence, finally sending the test sample into a trained support vector machine for prediction, and countingAnd predicting the result.
The invention has the advantages of
Aiming at the problem that the traditional bag-of-words model ignores elements such as grammatical semantics and the like, the invention extracts a position relation map from the sequence words based on the non-aftereffect of Markov hypothesis and introduces the spatial position information of the sequence words, and the extracted relation map is subjected to feature conversion and dimension reduction and is fused with the bag-of-words features to be used as the final fusion features of the model. Compared with the traditional bag-of-words model, the model provided by the method can reflect the distribution rule of the sequence characteristics more comprehensively. The feature expression of the target sequence can be made more representative, thereby improving the classification performance.
Drawings
FIG. 1 is a flow chart of the steps for extracting a relationship map according to the present invention.
Fig. 2 is a diagram of a CNN network structure according to the present invention.
Fig. 3 is a flow chart of feature extraction using CNN according to the present invention.
FIG. 4 is a flow chart of the final fusion signature acquisition of the protein sequences of the present invention.
Detailed Description
The invention is further illustrated by the following examples, without limiting the scope of the invention:
taking a data set containing 317 apoptosis protein sequences obtained from a SWISS-PROT database as an example for explanation, extracting bag-of-word features of the protein sequences by using an improved bag-of-word model based on a relational graph and combining an AAC algorithm, and sending the bag-of-word features into a multi-class classifier of a support vector machine for positioning prediction. The method comprises the following specific steps:
(1) and segmenting all protein sequences in the data set to generate a plurality of sequence words, and extracting the characteristics of all the sequence words.
(1-A) 317 protein sequences in the data set are processed by using a computer programming language to obtain the length L of the shortest protein sequence in the data set, 50 is selected between L/2 and L as the size N of a sliding window, the sliding interval is fixed to be 1, and the 317 sequences are subjected to sliding segmentation from the N end to the C end to obtain 206990 sequence words. For example, the first protein sequence MNYLP … HPNSSPT … MQ can obtain the sequence words such as MNYLP … HPNS, NYLP … HPNSS, YLP … HPNSSP and the like after sliding segmentation.
(1-B) the frequency of 20 amino acids in each sequence word is calculated, all the sequence words are expressed as 20-dimensional vectors, and the 20-dimensional vectors are the sequence word features, for example, the feature value obtained by calculating the frequency of 20 amino acids in MNYLP … HPNS is [0.08 … 0.10.0600.04 ]. The 206990 sequence words have 206990 sequence word feature values.
(2) And carrying out clustering analysis on the sequence word characteristics to obtain a dictionary, wherein the size of the dictionary is the number m of clustering centers.
(2-A) the number m of clustering centers is gradually increased from 20 to 500 at most, for example, m is 50, and 50 sequence word characteristic values are randomly selected from a data set consisting of 206990 sequence word characteristic values to serve as initial clustering centers for K-means clustering.
(2-B) Each protein sequence is uniquely represented by several clustering centers, for example, the first protein sequence can be represented as [0, …, 26,17, …, 9] after clustering. Wherein the number of the clustering centers is the total number of words of the protein sequence segmentation sequence.
(3) And extracting the position relation map of the sequence words represented by the plurality of clustering centers, and sending the sequence words into a convolutional neural network for feature conversion and dimension reduction.
(3-a) extracting the positional relationship of the protein sequences F by taking the adjacent interval coefficient k as 1, and then each protein sequence F is expressed as a 50 × 50 relationship matrix. The specific flow of the step is shown in fig. 1, and the words in the technical scheme are described in detail and are not repeated here.
(3-B) sending the relation map into a convolutional neural network for feature extraction: in the extraction process, the convolution neural network is used to perform convolution pooling on the matrix to extract features as shown in fig. 2, the overall process of performing feature extraction by using the CNN is shown in fig. 3, detailed text description is already performed in the technical scheme, and the detailed description is not repeated here. And directly extracting the output of the full-connection layer as the characteristic representation of the relation map, and assuming that the output of the final full-connection layer is set as 100, representing the relation map characteristic of each protein sequence as a high-dimensional vector with a dimension of 100.
(4) And (4) counting the number of sequence words of each protein sequence belonging to each clustering center, calculating the proportion of the number of the sequence words in each clustering center to the total number of the sequence words of the protein sequence, thereby obtaining the bag-of-word characteristics of the protein sequence, fusing the bag-of-word characteristics with the relational graph characteristics obtained in the step (3), and obtaining the final fusion characteristic representation of the protein sequence after PCA dimension reduction.
(4-A) after the clustering analysis, mapping the sequence word features to 50 clustering centers in the dictionary, and counting the number of the sequence words of each protein sequence belonging to each clustering center, for example, the number of the sequence words of the first protein sequence MNYLP … HPNSSPT … MQ belonging to 50 clustering centers is 0, …, 26,17, … and 9. And the statistical result takes the clustering center as the abscissa and the number of the sequence words as the ordinate, so that a sequence word histogram of the protein sequence can be drawn.
(4-B) calculating the ratio of the number of sequence words in each clustering center to the total number of the sequence words of each protein sequence to obtain the bag-of-word characteristics of the protein sequences, wherein each protein sequence is represented as a 50-dimensional vector. For example, the first protein sequence MNYLP … HPNSSPT … MQ has 562 sequence words after being segmented, and the bag-of-word feature vector is [0 … 0.0462630.030249 … 0.0035590.016014 ].
(4-C) fusing the relation map characteristics obtained in the step (3) with the traditional bag-of-word characteristics (4-B) and using PCA dimension reduction as final fusion characteristics of the protein sequence.
(5) And sending the fusion characteristics into a multi-class classifier of a support vector machine to predict the protein subcellular interval. 317 protein sequences each time, selecting only one protein sequence from the data set to form a test set, wherein the training set consists of the rest protein sequences, the test times are equal to the size 317 of the data set, and training a sample (C)i,yi) Feeding into a support vector machine a multi-class classifier, vector CiRepresents the fusion characteristic, y, of the ith group of training samples obtained by the stepsiIndicates the corresponding subunit of the protein sequenceAnd (4) the types of the cell positions, finally, sending the test sample into a trained support vector machine for prediction, and counting the prediction result.
The specific embodiments described herein are merely illustrative of the spirit of the invention. Various modifications or additions may be made to the described embodiments or alternatives may be employed by those skilled in the art without departing from the spirit or ambit of the invention as defined in the appended claims.
Claims (6)
1. A protein subcellular interval prediction method based on a relational graph is characterized by comprising the following steps:
(1) segmenting all protein sequences, namely target sequences in a data set according to a certain length to generate a plurality of sequence words, and extracting the characteristics of all the sequence words;
(2) performing cluster analysis on the characteristics of the sequence words, constructing a dictionary by using a K-means clustering algorithm, wherein the number of clustering centers is the size of the dictionary, and the characteristics of the sequence words are mapped to each clustering center in the dictionary after the cluster analysis; the specific steps of the step (2) are as follows:
(2-A) clustering the sequence word features by using a K-means algorithm to construct a dictionary, wherein the core idea is to divide the sequence word features into different categories according to the principle of minimum variance and within-class variance, and the number of clustering centers is the size of the dictionary;
and (2-B) quantitatively describing the protein sequence, mapping each sequence word feature of the protein sequence to a cluster center which is closest to the protein sequence in a dictionary, wherein the protein sequence can be uniquely represented by a plurality of cluster centers, namely, for any protein sequence, the protein sequence can be defined as follows:
F=(x1,x2,x3,…,xn),1≤i≤n,n∈Z
wherein F is a protein sequence, xiRepresenting the clustering center mapped by the ith sequence word feature in the sequence F, wherein n is the segmentation length of the sequence word;
(3) extracting a position relation map of the sequence words represented by the clustering center and sending the position relation map into a Convolutional Neural Network (CNN) for feature extraction; the specific steps of the step (3) are as follows:
(3-A) extracting a relationship map:
(3-A-1) traversing the protein sequence F, i.e. for (i ═ 1; i ≦ n; i + +), in order for any clustering center x in FiAdjacent thereto cluster center xjForm a cluster segment (x)i,xj) J takes values from i-k to i-1 in sequence, wherein k is an adjacent interval coefficient, and the sequence F can be obtainedGrouping the clustered segments;
(3-A-2) randomly generating a matrix D with a size of m x m, wherein the rows and columns of the matrix respectively correspond to the corresponding clustering centers, and clustering segments (x) according to each groupi,xj) The number of occurrences assigns a matrix, i.e., order D (x)i,xj)+=1;
(3-a-3) repeating step (3-a-2) until i ═ n; then obtaining a corresponding m x m position relation matrix;
(3-A-4) representing the values of all elements in the position relation matrix by adopting pixel points with different brightness to obtain a relation map;
(3-B) dimensionality reduction of a relational graph: sending the relation map into a Convolutional Neural Network (CNN) for feature extraction, wherein the CNN consists of an input layer, a convolutional layer, a downsampling layer, a full-connection layer and an output layer:
the input layer is the relation map X;
the feature extraction is carried out on the X by formulating different window values by the convolution layer, and the relation graph obtained after the convolution is expressed as:
Ci=g(Wi*Ci-1+bi)
wherein, WiWeight vectors representing the i-th layer of convolution kernels in a convolutional neural network, biRepresents the bias value of the i-th layer, CiA characteristic diagram representing the ith layer, when i is 0, C0X, g denotes the activation function, with the correct linear unit ReLU as the activation function;
the down-sampling layer is used for down-sampling the atlas, the complete information of the relation atlas is kept as much as possible, and simultaneously, the dimension reduction is carried out on the characteristic map to reduce the parameters of the subsequent layer, and the formula is expressed as follows:
Ci=max(Ci-1)
the fully-connected layer adopts a dropout technology, partial neurons are abandoned according to a certain proportion in the training process, cross entropy is used as a loss function to update weights, and weight attenuation is introduced to regularize parameters;
the output layer extracts the output of the full-connection layer as the characteristic vector representation of the relation map;
(4) counting the number of sequence words of each protein sequence belonging to each clustering center, calculating the proportion of the number of the sequence words on each clustering center to the total number of the sequence words of the protein sequence, thereby obtaining the bag-of-word characteristics of the protein sequence, then fusing with the relational graph characteristics obtained in the step (3), and obtaining the final fusion characteristics of the protein sequence after PCA dimension reduction;
(5) and (3) sending the fusion characteristics of the protein sequence into a support vector machine multi-class classifier to predict the protein subcellular interval.
2. The method according to claim 1, wherein the specific steps of step (1) are:
(1-A) performing segmentation processing on the protein sequence by using a sliding window segmentation method, traversing characters in the protein sequence from the head end to the tail end in sequence, fixing a sliding interval to be 1, determining the segmentation length of a sequence word by the size of a sliding window, and selecting the method as follows:
wherein L is1,L2,…,LsRespectively representing the length of the protein sequence in the data set, and taking L as the length of the shortest protein sequence in the data set; n is the size of the sliding window;
and (1-B) obtaining the protein sequence word characteristics by utilizing the amino acid component information of the AAC statistical sequence word of the traditional protein characteristic extraction algorithm after segmentation.
3. The method according to claim 1, characterized in that the convolutional layer uses a correcting linear unit ReLU as activation function g.
4. The method of claim 1, wherein during the training of the fully-connected layer, a fraction of neurons is discarded at a rate of 0.5.
5. The method according to claim 1, wherein the specific steps of step (4) are:
(4-A) counting the number of sequence words of each protein sequence belonging to each clustering center, and obtaining a sequence word histogram of the protein sequence by taking the clustering center as an abscissa and the number of the sequence words as an ordinate;
calculating the proportion of the number of sequence words in each clustering center in each protein sequence to the total number of the sequence words of the protein sequence according to a word histogram so as to obtain the bag-of-word characteristics of the protein sequence, wherein each protein sequence is represented as an m-dimensional bag-of-word characteristic vector;
and (4-C) splicing the relation map features obtained in the step (3) and the bag-of-words features obtained in the step (4-B) to form final feature vectors, carrying out PCA dimensionality reduction, and then sending the final feature vectors into a classifier for classification, wherein the feature vectors obtained after fusion are expressed as:
V=[vt1,vt2,vt3,…,vtm,vp1,vp2,vp3,…,vpn]
wherein v istIs a bag of words feature vector, vpIs a position feature vector;
and carrying out PCA (principal component analysis) dimension reduction processing on the obtained feature vector to obtain the final fusion feature.
6. The method according to claim 1, wherein the step (5) comprises the following specific steps: each time only from the data setOne protein sequence constitutes the test set, the training set consists of the rest of the protein sequences, the number of tests is equal to the size of the data set, and the training sample (C)i,yi) Feeding into a support vector machine a multi-class classifier, vector CiRepresents the fusion characteristic, y, of the ith group of training samples obtained by the stepsiAnd (3) representing the category of the corresponding subcellular position corresponding to the protein sequence, finally sending the test sample into a trained support vector machine for prediction, and counting the prediction result.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811014322.6A CN109273054B (en) | 2018-08-31 | 2018-08-31 | Protein subcellular interval prediction method based on relational graph |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811014322.6A CN109273054B (en) | 2018-08-31 | 2018-08-31 | Protein subcellular interval prediction method based on relational graph |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109273054A CN109273054A (en) | 2019-01-25 |
CN109273054B true CN109273054B (en) | 2021-07-13 |
Family
ID=65155000
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811014322.6A Expired - Fee Related CN109273054B (en) | 2018-08-31 | 2018-08-31 | Protein subcellular interval prediction method based on relational graph |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109273054B (en) |
Families Citing this family (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110797084B (en) * | 2019-11-06 | 2021-05-25 | 吉林大学 | Deep neural network-based cerebrospinal fluid protein prediction method |
CN110827922B (en) * | 2019-11-06 | 2021-04-16 | 吉林大学 | Prediction method of amniotic fluid protein based on circulating neural network |
CN113571124B (en) * | 2020-04-29 | 2024-04-23 | 中国科学院上海药物研究所 | Method and device for predicting ligand-protein interaction |
CN112182275A (en) * | 2020-09-29 | 2021-01-05 | 神州数码信息系统有限公司 | Trademark approximate retrieval system and method based on multi-dimensional feature fusion |
CN112908418B (en) * | 2021-02-02 | 2024-06-28 | 杭州电子科技大学 | Dictionary learning-based amino acid sequence feature extraction method |
CN114360644A (en) * | 2021-12-30 | 2022-04-15 | 山东师范大学 | Method and system for predicting combination of T cell receptor and epitope |
CN116361713B (en) * | 2023-04-19 | 2023-08-29 | 湖北交通职业技术学院 | Performance detection method and system for aircraft engine |
CN117095743B (en) * | 2023-10-17 | 2024-01-05 | 山东鲁润阿胶药业有限公司 | Polypeptide spectrum matching data analysis method and system for small molecular peptide donkey-hide gelatin |
CN117672353B (en) * | 2023-12-18 | 2024-08-16 | 南京医科大学 | Space-time proteomics deep learning prediction method for protein subcellular migration |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104899477A (en) * | 2015-06-18 | 2015-09-09 | 江南大学 | Protein subcellular interval prediction method using bag-of-word model |
CN105046106A (en) * | 2015-07-14 | 2015-11-11 | 南京农业大学 | Protein subcellular localization and prediction method realized by using nearest-neighbor retrieval |
CN105631480A (en) * | 2015-12-30 | 2016-06-01 | 哈尔滨工业大学 | Hyperspectral data classification method based on multi-layer convolution network and data organization and folding |
CN105760711A (en) * | 2016-02-02 | 2016-07-13 | 江南大学 | Method for using KNN calculation and similarity comparison to predict protein subcellular section |
CN106295139A (en) * | 2016-07-29 | 2017-01-04 | 姹ゅ钩 | A kind of tongue body autodiagnosis health cloud service system based on degree of depth convolutional neural networks |
CN106845510A (en) * | 2016-11-07 | 2017-06-13 | 中国传媒大学 | Chinese tradition visual culture Symbol Recognition based on depth level Fusion Features |
-
2018
- 2018-08-31 CN CN201811014322.6A patent/CN109273054B/en not_active Expired - Fee Related
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104899477A (en) * | 2015-06-18 | 2015-09-09 | 江南大学 | Protein subcellular interval prediction method using bag-of-word model |
CN105046106A (en) * | 2015-07-14 | 2015-11-11 | 南京农业大学 | Protein subcellular localization and prediction method realized by using nearest-neighbor retrieval |
CN105631480A (en) * | 2015-12-30 | 2016-06-01 | 哈尔滨工业大学 | Hyperspectral data classification method based on multi-layer convolution network and data organization and folding |
CN105760711A (en) * | 2016-02-02 | 2016-07-13 | 江南大学 | Method for using KNN calculation and similarity comparison to predict protein subcellular section |
CN106295139A (en) * | 2016-07-29 | 2017-01-04 | 姹ゅ钩 | A kind of tongue body autodiagnosis health cloud service system based on degree of depth convolutional neural networks |
CN106845510A (en) * | 2016-11-07 | 2017-06-13 | 中国传媒大学 | Chinese tradition visual culture Symbol Recognition based on depth level Fusion Features |
Non-Patent Citations (2)
Title |
---|
机器学习算法在蛋白质结构预测中的应用;薛燕娜;《中国优秀硕士学位论文全文数据库 基础科学辑》;20170215;第6、14-15页 * |
词袋模型在蛋白质亚细胞定位预测中的应用;赵南等;《食品与生物技术学报》;20170331;第296-300页 * |
Also Published As
Publication number | Publication date |
---|---|
CN109273054A (en) | 2019-01-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109273054B (en) | Protein subcellular interval prediction method based on relational graph | |
CN110443143B (en) | Multi-branch convolutional neural network fused remote sensing image scene classification method | |
CN107526785B (en) | Text classification method and device | |
CN105960647B (en) | Compact face representation | |
CN112100346B (en) | Visual question-answering method based on fusion of fine-grained image features and external knowledge | |
CN109947963A (en) | A kind of multiple dimensioned Hash search method based on deep learning | |
CN110969020A (en) | CNN and attention mechanism-based Chinese named entity identification method, system and medium | |
CN109063719B (en) | Image classification method combining structure similarity and class information | |
CN110222218B (en) | Image retrieval method based on multi-scale NetVLAD and depth hash | |
JP7539971B2 (en) | Performing an XNOR equality operation by adjusting column thresholds of a compute-in-memory array | |
CN111291556B (en) | Chinese entity relation extraction method based on character and word feature fusion of entity meaning item | |
WO2017052791A1 (en) | Semantic multisensory embeddings for video search by text | |
CN108304573A (en) | Target retrieval method based on convolutional neural networks and supervision core Hash | |
CN107832335B (en) | Image retrieval method based on context depth semantic information | |
CN111400494B (en) | Emotion analysis method based on GCN-Attention | |
CN113343974B (en) | Multi-modal fusion classification optimization method considering inter-modal semantic distance measurement | |
CN112733866A (en) | Network construction method for improving text description correctness of controllable image | |
CN110188827A (en) | A kind of scene recognition method based on convolutional neural networks and recurrence autocoder model | |
CN109948742A (en) | Handwritten form picture classification method based on quantum nerve network | |
Chen et al. | A compact cnn-dblstm based character model for online handwritten chinese text recognition | |
CN111652273A (en) | Deep learning-based RGB-D image classification method | |
CN112163114B (en) | Image retrieval method based on feature fusion | |
CN108805280B (en) | Image retrieval method and device | |
CN111860823A (en) | Neural network training method, neural network training device, neural network image processing method, neural network image processing device, neural network image processing equipment and storage medium | |
CN107239827B (en) | Spatial information learning method based on artificial neural network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20210713 |