CN108197427B

CN108197427B - Protein subcellular localization method and device based on deep convolutional neural network

Info

Publication number: CN108197427B
Application number: CN201810002518.7A
Authority: CN
Inventors: 刘弘; 丛菡菡; 陈月辉; 韩延彬
Original assignee: Shandong Normal University
Current assignee: Shandong Normal University
Priority date: 2018-01-02
Filing date: 2018-01-02
Publication date: 2020-09-04
Anticipated expiration: 2038-01-02
Also published as: CN108197427A

Abstract

The invention discloses a protein subcellular localization method and a device based on a deep convolutional neural network, wherein the method comprises the following steps: receiving sequence information of known protein subcellular positions, and establishing and storing a reference protein sequence database; carrying out feature extraction on protein sequences in a reference protein sequence database, and carrying out feature fusion on extracted feature data; taking the fused feature data as the input of a deep convolutional neural network, and training the deep convolutional neural network to obtain a deep convolutional neural network classifier; and receiving a protein sequence to be predicted, extracting features, carrying out corresponding feature fusion, inputting the features into the trained deep convolutional neural network classifier, and predicting and positioning the protein subcellular position of the deep convolutional neural network classifier. The method solves the problem of selecting the optimal characteristics in the current protein subcellular localization research, and simultaneously further improves the accuracy.

Description

Protein subcellular localization method and device based on deep convolutional neural network

Technical Field

The invention belongs to the technical field of protein subcellular localization in bioinformatics, and particularly relates to a protein subcellular localization method and device based on a deep convolutional neural network.

Background

With the development of information technology and the initiation of human genome project, bioinformatics is becoming a popular field of research in recent years, and the main objective thereof is to reveal the regularity of biological systems by analyzing and counting various biological data. In the study of thousands of biological data and scrambled gene sequences or protein sequences, many new research directions have been created, one of which is the subcellular localization of proteins.

Cells are the most basic unit in biology, but the structure is highly complex, and various organelles, namely subcellular cells, can be classified according to the position of each structure in the cell and the function of each structure. Different subcellular structures provide different places for the proteins to perform specific functions, and the proteins can only perform respective functions in the specific places to maintain normal life activities of life bodies. Therefore, accurate positioning of the subcellular location of the protein is of crucial relevance to the study of the significance and action principle of the protein in the living body.

The explosive development of human genomics and proteomics has brought about the well-blown increase of the number of protein sequences in databases, the traditional mode of mainly determining the protein subcellular location through experiments becomes unable to adapt to the current situation, and people begin to try to realize new protein subcellular localization by using a machine learning method, i.e. an unknown protein sequence is assigned to a known subcellular location through a machine learning method, and the biological characteristics of the unknown protein and the like are known by collecting the characteristic information of the known protein.

Machine learning is an artificial intelligence method, and its application in protein subcellular localization mainly comprises three steps: the establishment of a reference data set, the feature extraction of protein and the design of a classifier can be seen through the main steps, the protein subcellular localization by using machine learning is different from the traditional mode of experimental localization, and the method is more suitable for processing a large amount of disordered data and has better generalization capability. At present, machine learning has achieved good effects in protein subcellular localization, annotation is made for a large number of new protein sequences, and the problem of rapid increase of biological data is solved to a certain extent, but there are many places to be improved.

Recently, the focus of protein subcellular localization by machine learning is also the improvement of prediction accuracy, if the prediction accuracy needs to be improved, a proper classifier needs to be designed, and protein features suitable for the classifier need to be extracted, namely the design of the classifier and the feature extraction mode of the protein are two key factors influencing the protein subcellular localization. On the premise of lacking enough experimental data, it is very difficult to find a suitable protein feature extraction mode, people hope to have a classifier capable of automatically distinguishing the quality of features, and the Support Vector Machine (SVM), the K-nearest neighbor model (KNN), the Artificial Neural Network (ANN) and the like which are frequently used at present cannot do the problem.

Unlike several classifiers mentioned above, the deep learning algorithm can select a superior feature from a large number of input features for learning by performing multi-layer linear filtering and nonlinear transformation on input data. Typical deep learning structures include deep belief networks (DDBN), deep autoencoders (SAE), and Deep Convolutional Neural Networks (DCNN), among others. The deep learning algorithm was applied to the feature extraction process of images and voices at the earliest, and has been used in related fields of bioinformatics in recent years, and has shown an effect superior to the existing algorithm framework in some directions.

Deep Convolutional Neural Network (DCNN) is a typical learning model of deep learning algorithm, which is essentially input-to-output mapping, and can learn a large number of input-to-output mapping relations without any precise mathematical expression between input and output. By performing convolution operation and linear filtering on the input features, the feature data can be enhanced and the noise can be reduced. The training process of the deep convolutional neural network comprises a forward propagation stage and a backward propagation stage, wherein instructor training is performed, and neurons of each layer can share one group of weights, so that the complexity of the network can be reduced, and the training and learning of the network can be performed in a parallel mode.

Because the deep convolutional neural network has the characteristic of autonomously selecting the dominant features, the dominant features can be enhanced and noise can be removed through convolutional operation in the selection process, the problem that an appropriate feature extraction mode is autonomously selected can be solved just by applying the deep convolutional neural network to the subcellular localization of the protein, and better localization accuracy is obtained. However, the existing protein subcellular localization based on deep convolutional neural network cannot autonomously select dominant features.

In summary, the protein subcellular localization method based on machine learning in the prior art has the problems that the localization precision needs to be further improved and the dominant features cannot be selected independently, and an effective solution is not available.

Disclosure of Invention

Aiming at the defects in the prior art and solving the problems that the positioning precision needs to be further improved and the dominant features cannot be selected independently in the protein subcell positioning method based on machine learning in the prior art, the invention provides a protein subcell positioning method and device based on a deep convolutional neural network.

The first purpose of the invention is to provide a protein subcellular localization method based on a deep convolutional neural network.

In order to achieve the purpose, the invention adopts the following technical scheme:

a method for protein subcellular localization based on a deep convolutional neural network, the method comprising:

receiving sequence information of known protein subcellular positions, and establishing and storing a reference protein sequence database;

carrying out feature extraction on protein sequences in a reference protein sequence database, and carrying out feature fusion on extracted feature data;

taking the fused feature data as the input of a deep convolutional neural network, and training the deep convolutional neural network to obtain a deep convolutional neural network classifier;

and receiving a protein sequence to be predicted, extracting features, carrying out corresponding feature fusion, inputting the features into the trained deep convolutional neural network classifier, and predicting and positioning the protein subcellular position of the deep convolutional neural network classifier.

More preferably, the sequence information of the known subcellular position of the protein includes the entire amino acid composition and the position information of the protein in each subcellular.

As a further preferable mode, in the method, three features of the protein sequence in the reference protein sequence database are extracted based on the physicochemical properties of the protein sequence;

the three characteristics of the extracted protein sequences are R-dipetide, I-PseAAC and PseAAC2 respectively; wherein, R-Dipeptide is the improvement of the characteristics of amino acid Dipeptide (Di-Dipeptide); I-PseAAC and PseAAC2 are improvements in the characteristics of amino acid pseudo amino acids (PseAAC).

As a further stepIn a preferred embodiment, the I-PseAAC feature extraction mode is to calculate the first amino acid residue R in the protein sequence₁And position information between other amino acid residues;

the PseAAC2 is characterized by enhancing the representation of the physicochemical properties of the amino acid residues in the process of representing the position information when extracting.

As a further preferred scheme, the number of layers of the deep convolutional neural network and the number of nodes contained in each hidden layer are determined in the structural design stage of the deep convolutional neural network classifier.

As a further preferred scheme, the connection weights between the layers of the deep convolutional neural network are determined in the training process of the deep convolutional neural network classifier.

As a further preferred approach, the deep convolutional neural network classifier training process includes a forward propagation stage and a backward propagation stage.

As a further preferred scheme, the deep convolutional neural network classifier training process is trained by instructors.

It is a second object of the present invention to provide a computer-readable storage medium.

a computer readable storage medium having stored therein a plurality of instructions adapted to be loaded by a processor of a terminal device and to perform the process of:

A third object of the present invention is to provide a terminal device.

a terminal device comprising a processor and a computer readable storage medium, the processor being configured to implement instructions; a computer readable storage medium for storing a plurality of instructions adapted to be loaded by a processor and to perform the process of:

The invention has the beneficial effects that:

1. the deep convolutional neural network is used as a classifier, is different from the classifier already used in protein subcellular localization at present, can learn a large number of mapping relations from input to output, autonomously selects the advantage characteristics through multilayer linear filtering and nonlinear mapping, enhances the advantage characteristics in convolutional operation, and solves the problem of selecting the advantage characteristics in the current protein subcellular localization research.

2. In order to further improve the accuracy of protein subcellular localization, the method and the device adopt three improved amino acid feature extraction modes in the process of extracting the features of a protein sequence, wherein the three improved amino acid feature extraction modes are R-Dipeptide, I-PseAAC and PseAAC2, which are respectively used for improving the characteristics of Dipeptide and pseudo amino acid of the amino acid and fusing the three extracted features.

3. According to the protein subcellular localization method and device based on the deep convolutional neural network, the deep convolutional neural network has the capacity of processing large-scale complex data, compared with the scale of a reference protein sequence database adopted in various current protein subcellular localization methods, the scale of the reference protein sequence database adopted in the method is larger, and the method and device are beneficial to improving the accuracy of localization.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the application and, together with the description, serve to explain the application and are not intended to limit the application.

FIG. 1 is a flow chart of the protein subcellular localization method based on deep convolutional neural network of the present invention;

FIG. 2 is a schematic representation of the organelle distribution in a eukaryotic cell of the present invention;

FIG. 3 is a schematic diagram of the deep convolutional neural network structure of the present invention;

FIG. 4 is a schematic diagram of the deep convolutional neural network convolution and pooling of the present invention;

FIG. 5 is a schematic diagram of a process for predicting the location of a protein subcellular of the invention;

FIG. 6 is a schematic diagram of the structure of the protein subcellular localization system based on deep convolutional neural network of the present invention;

FIG. 7 is a schematic diagram of the subcellular localization structure of the protein to be predicted according to the present invention.

The specific implementation mode is as follows:

the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

It is noted that the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of methods and systems according to various embodiments of the present disclosure. It should be noted that each block in the flowchart or block diagrams may represent a module, a segment, or a portion of code, which may comprise one or more executable instructions for implementing the logical function specified in the respective embodiment. It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Without conflict, the embodiments and features of the embodiments of the present application may be combined with each other to further explain the present invention in conjunction with the figures and embodiments.

Example 1:

the purpose of this example 1 is to provide a protein subcellular localization method based on deep convolutional neural network.

as shown in figure 1 of the drawings, in which,

step (1): receiving sequence information of known protein subcellular positions, and establishing and storing a reference protein sequence database;

step (2): carrying out feature extraction on protein sequences in a reference protein sequence database, and carrying out feature fusion on extracted feature data;

and (3): taking the fused feature data as the input of a deep convolutional neural network, and training the deep convolutional neural network to obtain a deep convolutional neural network classifier;

and (4): and receiving a protein sequence to be predicted, extracting features, carrying out corresponding feature fusion, inputting the features into the trained deep convolutional neural network classifier, and predicting and positioning the protein subcellular position of the deep convolutional neural network classifier.

In step (1) of the present example, the reference protein sequence database is configured and represented as follows.

As shown in FIG. 2, the human cell contains more than 20 organelles, and the proteins are mainly distributed in more than ten organelles, and the present invention uses the human protein sequence library issued by UniProtKB, which contains 9895 different protein sequences, and the positions of the protein sequences are distributed in 10 different organelles, namely cytoplasm, nucleus, cell membrane, inner membrane, secretion, cytoskeleton, cell protrusion, endoplasmic reticulum membrane, synapse and mitochondria. To avoid homology divergences, the sequence similarity between protein sequences in the reference protein sequence database is 70% or less.

In the present invention, we consider protein subcellular localization as a single-tag multi-class problem, but protein sequences in the reference protein sequence database may exist in two or more organelles, and for ease of handling, when a protein sequence belongs to two or more subcellular locations, it is considered as several different protein sequences, respectively belonging to different subcellular locations, and referred to as location-based protein sequences.

In the reference protein sequence database, the number of protein sequences based on position is:

wherein N is_locDenotes the total number of protein sequences on a position basis, m denotes that the protein sequences are at most in m different subcellular positions, t denotes that the protein sequences are in t subcellular positions, N (t) denotes the number of protein sequences in t subcellular positions.

In step (2) of this example, the protein sequence in the reference protein sequence database is subjected to feature extraction, and three features of the protein sequence in the reference protein sequence database are extracted according to the physicochemical properties of the protein sequence; the three characteristics of the extracted protein sequences are R-dipetide, I-PseAAC and PseAAC2 respectively; wherein, R-Dipeptide is the improvement of the characteristics of amino acid Dipeptide (Di-Dipeptide); I-PseAAC and PseAAC2 are improvements in the characteristics of amino acid pseudo amino acids (PseAAC).

R-Dipeptide characteristic extraction mode:

protein sequences are composed of 20 amino acid residues, and a certain protein sequence P, assuming that it contains L amino acid residues, can be expressed as:

P＝R₁,R₂,R₃,…,R_Lformula (2)

Wherein R is₁Denotes the first amino acid residue of the protein sequence P, R₂Represents the second amino acid residue of the protein sequence P, and so on.

(ii) the entire protein sequence is truncated from the first amino acid residue using a window of length 30, such that the first set of subsequences is { R }₁,R₂,…,R₃₀A second set of subsequences of { R }₂,R₃,…,R₃₁And so on, the last amino acid residue R in the subsequence_NWherein N is less than the length of all protein sequences.

All subsequences are combined to form a new protein sequence.

The frequency of occurrence of the dipeptides, i.e. the amino acid pairs, was calculated for the new protein sequence, and the amino acid pairs consisting of 20 amino acids had a combination of 20 × 20 — 400, and the feature vector can be expressed as:

V＝[f₁,f₂,…,f₄₀₀]^Tformula (3)

Wherein the content of the first and second substances,

denotes the frequency of occurrence of amino acid pairs, R_iRepresents the number of amino acid pairs, and N is the number of amino acids in the new protein sequence, i.e., the length of the protein sequence.

I-PseAAC feature extraction mode:

the positional information of the protein sequence can be expressed as follows:

wherein the content of the first and second substances,_θis a theta-level correlation factor which represents the sequence position relation of at most theta amino acid residues, and N is the number of amino acids in the protein sequence. Omega (R)_i，R_i+1) Can be expressed as:

wherein R is_iDenotes the i amino acid sequence, H₁(R_i)、H₂(R_i)、Pk₁(R_i)、Pk₂(R_i)、PI(R_i) And M (R)_i) Respectively represent the ith amino acid residue R in the protein sequence_iHydrophobicity, hydrophilicity, Pk1(-COOH), Pk2(-NH3), PI and side chain molecular weight values.

The traditional way of extracting features of PseAAC is to calculate the position information of two adjacent amino acid residues in a protein sequence, such as (R)₁,R₂)，(R₂,R₃)，(R₃,R₄) And so on. The feature extraction mode of the I-PseAAC used in the invention is to calculate the first amino acid residue R in the protein sequence₁And positional information between other amino acid residues, e.g. (R)₁,R₂)，(R₁,R₃)，(R₁,R₄) And so on.

The feature extraction method of PseAAC2 is as follows:

different from the traditional PseAAC feature extraction mode and the I-PseAAC feature extraction mode, the PseAAC2 feature extraction mode adopts a new position information representation method, and the embodiment of enhancing the physicochemical properties of amino acid residues in the representation process can be represented as follows:

Ω(R_i)＝[H₁(R_i)²+H₂(R_i)²+Pk₁(R_i)²+Pk₂(R_i)²+PI(R_i)²+M(R_i)²]formula (6)

Ω(R_i，R_j)＝Ω(R_i)*Ω(R_j) Formula (7)

Wherein R is_iDenotes the i amino acid sequence, R_jDenotes the jth amino acid sequence, H₁(R_i)、H₂(R_i)、Pk₁(R_i)、Pk₂(R_i)、PI(R_i) And M (R)_i) Respectively represent the ith amino acid residue R in the protein sequence_iHydrophobicity, hydrophilicity, Pk1(-COOH), Pk2(-NH3), PI and side chain molecular weight values.

In step (3) of this embodiment, the fused feature data is used as an input of the deep convolutional neural network, and the deep convolutional neural network is trained to obtain the deep convolutional neural network classifier.

Network structure of Deep Convolutional Neural Network (DCNN):

the deep convolutional neural network is a multilayer artificial neural network, each layer is composed of a plurality of two-dimensional planes, and each plane is composed of a plurality of independent neurons, as shown in fig. 3, wherein a layer C1 and a layer C3 are convolutional layers, namely feature extraction layers, the input of each neuron is connected with a local sensing field with a certain size of the previous layer, the local feature of a sample is extracted, and the relationship between the local feature and other local features is determined simultaneously in the process of extracting the local feature; the S2 layer and the S4 layer are pooling layers, each computing layer in the pooling layers is composed of a plurality of feature samples, each feature sample is a plane, and all neurons on the plane have uniform weights. The core ideas of the deep convolutional neural network are local receptive field, weight sharing and time and space sampling.

As shown in fig. 4, the convolutional layer reception field size is r × s, the corresponding shared weights are called filters or convolutional kernels, and different filters slide on the input layer along the spatial dimension to generate a plurality of feature maps. The spatial dimension of the signature after the convolution operation shown in fig. 4 is (m-r +1) × (n-s +1), the number of signatures is called the number of channels, the number of channels is determined by the number of filters of the convolution layer, and the sum of the filters is called the filter bank. Since the number of channels of the input layer is usually not 1, the filter is usually in the form of a 3-dimensional matrix, the filter size shown in fig. 4 is r × s × p, where p is the number of channels of the input layer, and the filter bank is a 4-dimensional matrix, where the size of the filter bank in fig. 4 is r × s × p × q, where q is the number of channels of the output layer. After passing through one convolution layer, the element of the kth feature map at spatial position (i, j) of the tth layer can be calculated as:

wherein f (-) is the activation function of element-by-element operation, s, r is the receptive field space size, p is the input layer channel number, b is the bias term, and a feature diagram is shared generallyOne value, w, is the network weight,

the element at spatial position (i + m, j + n) for the c-th feature map of layer t-1.

The pooling layer is a domain operation that down-samples the response in the domain, maintaining the number of channels in the input layer constant. At present, the most common method is to maximize pooling, and as shown in fig. 4, pooling the kth feature map of the t-th layer can be expressed as:

wherein the area of size u × v is called pooling domain or pooling receptive field,

for the element of the kth profile of the t slice at spatial position (i × u + m, j × v + n),

for the element of the kth profile of the t +1 layer at spatial position (i, j), the neighboring pooled receptive fields represented in equation (9) are non-overlapping.

Optimization algorithm of Deep Convolutional Neural Network (DCNN):

assume use of (x)⁽ⁱ⁾,y⁽ⁱ⁾) Represents a training sample, where x⁽ⁱ⁾Representing input, i.e. features, y⁽ⁱ⁾Representing the output, i.e., the true value, the training set containing m samples can be represented as { (x)⁽¹⁾，x⁽¹⁾)，(x⁽²⁾，x⁽²⁾)，...，(x^(m)，y^(m))}. The optimization of the Deep Convolutional Neural Network (DCNN) is to optimize network parameters on the premise of determining the network structure, so that a DCNN model h_{w,b}When x is⁽ⁱ⁾As input, its output

In forms and compositions as close to y as possible⁽ⁱ⁾. Defining metricsDCNN model h_{w,b}Output value

And true value y⁽ⁱ⁾The difference between is the loss function l (h)_{w,b}(x⁽ⁱ⁾),y⁽ⁱ⁾) The optimization process of DCNN adjusts { w, b } to minimize the loss function of the model, i.e.:

wherein J (w, b) is the loss function of the whole model, m is the number of samples in the training set,

is a 2 norm regularization term of the weight, and lambda is a hyper-parameter controlling the relative importance of the error term and regularization.

Compared with the loss function on the training set, the process of optimizing the model needs to consider the loss function on the prediction set, i.e. the generalization capability of the model, and when training the DCNN, since m is usually very large, it is often replaced by an objective function of a random small-batch approximation formula, that is:

where n is the size of the random mini-batch. Equation (11) is a highly nonlinear and non-convex optimization problem, and the currently used DCNN optimization algorithm is a gradient-based method. The invention uses an adaptive gradient algorithm (AdaGrad), and when the ith sample is input, the updating strategy is as follows:

wherein the content of the first and second substances,

gradient information representing j iterations, α is global learning rate, W_tIs the weight of the t-th layer. AdaGrad isAn optimization algorithm for parameter-by-parameter adaptive learning rate.

As shown in fig. 5, in step (4) of this example, the prediction of the location of protein subcellular cells includes:

step (4-1): reading in all amino acid information of a protein sequence to be predicted;

step (4-2): extracting the characteristics of the protein sequence and fusing the extracted characteristics;

step (4-3): and inputting the fusion features into a deep convolutional neural network classifier to predict the positions of the protein subcells.

As shown in fig. 6, the protein subcellular localization system based on deep convolutional neural network of the present invention comprises:

the protein sequence information storage and reading unit is used for storing a reference protein sequence database and all amino acid information of a protein sequence to be predicted, reading part of information when needed, and mainly adopting a format file storage mode;

the system comprises a protein sequence feature extraction and feature fusion unit, a deep convolution neural network classifier and a feature fusion unit, wherein the protein sequence feature extraction and feature fusion unit is used for extracting the physical and chemical features of a protein sequence and fusing the features, and the unit obtains input data of the deep convolution neural network classifier and mainly comprises an R-watermark feature extraction unit, an I-PseAAC feature extraction unit, a PseAAC2 feature extraction unit and a feature fusion unit;

and the deep convolutional neural network classifier construction unit is used for constructing a proper deep convolutional neural network classifier, and mainly comprises the determination of the basic structure of the deep convolutional neural network and the training of the classifier. Determining the number of hidden layer layers and the number of nodes contained in each hidden layer, and training a classifier, namely determining each connection weight of the deep convolutional neural network through a reference protein sequence database;

the protein subcellular localization unit is used for predicting a new protein sequence, the main structure of which is shown in figure 7, and the protein subcellular localization unit is helpful for annotating the new protein sequence and further understanding the biochemical characteristics of the new protein sequence.

The protein subcellular localization unit comprises:

the protein sequence feature reading module is used for reading out the feature of the new protein sequence after fusion from a storage file for subsequent positioning;

protein sequence prediction module: the method is used for inputting the fusion characteristics of the protein sequence into a deep convolution neural network classifier to position the protein subcellular;

the positioning result display and storage module: it is used to display and store the localization results of the protein sequences.

Example 2:

the object of this embodiment 2 is to provide a computer-readable storage medium.

Example 3:

the purpose of this embodiment 3 is to provide a terminal device.

These computer-executable instructions, when executed in a device, cause the device to perform methods or processes described in accordance with various embodiments of the present disclosure.

In the present embodiments, a computer program product may include a computer-readable storage medium having computer-readable program instructions embodied thereon for performing various aspects of the present disclosure. The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.

The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device over a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.

The computer program instructions for carrying out operations of the present disclosure may be assembly instructions, Instruction Set Architecture (ISA) instructions, machine related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, the electronic circuitry can execute computer-readable program instructions to implement aspects of the present disclosure by utilizing state information of the computer-readable program instructions to personalize the electronic circuitry, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA).

It should be noted that although several modules or sub-modules of the device are mentioned in the above detailed description, such division is merely exemplary and not mandatory. Indeed, the features and functionality of two or more of the modules described above may be embodied in one module in accordance with embodiments of the present disclosure. Conversely, the features and functions of one module described above may be further divided into embodiments by a plurality of modules.

The invention has the beneficial effects that:

The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A protein subcellular localization method based on a deep convolutional neural network is characterized by comprising the following steps:

receiving a protein sequence to be predicted, extracting features, carrying out corresponding feature fusion, inputting the features into a trained deep convolutional neural network classifier, and predicting and positioning the protein subcellular position of the deep convolutional neural network classifier;

in the method, three characteristics of a protein sequence in a reference protein sequence database are extracted according to the physicochemical properties of the protein sequence;

the three characteristics of the extracted protein sequences are R-dipetide, I-PseAAC and PseAAC2 respectively; wherein, R-Dipeptide is the improvement of the characteristics of amino acid Dipeptide Di-Dipeptide; I-PseAAC and PseAAC2 are the improvement of the characteristics of amino acid pseudo amino acid PseAAC;

where a protein sequence belongs to two or more subcellular locations, it is considered to be several different protein sequences, each belonging to a different subcellular location, and is referred to as a location-based protein sequence;

the extraction mode of the R-dipetide characteristics is as follows:

P＝R₁,R₂,R₃,…,R_Lformula (1)

Wherein R is₁Denotes the first amino acid residue of the protein sequence P, R₂Denotes the second amino acid residue of the protein sequence P, R_LRepresents the L-th amino acid residue of the protein sequence P;

(ii) the entire protein sequence is truncated from the first amino acid residue using a window of length 30, such that the first set of subsequences is { R }₁,R₂,…,R₃₀A second set of subsequences of { R }₂,R₃,…,R₃₁And so on, the last amino acid residue R in the subsequence_NWherein N is less than the length of all protein sequences;

combining all subsequences to form a new protein sequence;

V＝[f₁,f₂,…，f₄₀₀]^Tformula (2)

Wherein ^ n_i＝∑ⁿ _i＝1R_i/(N₁-1), i ═ 1, 2.., 400, representing the frequency of occurrence of amino acid pairs, R_iDenotes the number of occurrences of amino acid pairs, N₁The number of amino acids in the new protein sequence, namely the length of the protein sequence;

the extraction mode of the I-PseAAC features is as follows:

the positional information of the protein sequence can be expressed as follows:

wherein the content of the first and second substances,_θis a theta-level correlation factor which represents the sequence position relationship of at most theta amino acid residues, N₂The number of amino acids in the protein sequence; omega (R)_E，R_E+1) Can be expressed as:

wherein R is_EDenotes the amino acid sequence E, H₁(R_e)、H₂(R_e)、Pk₁(R_e)、Pk₂(R_e)、PI(R_e) And M (R)_e) Respectively represent the e-th amino acid residue R in the protein sequence_eThe hydrophobicity value, hydrophilicity value, Pk1(-COOH), Pk2(-NH3), PI and side chain molecular weight value;

the extraction mode of the features of PseAAC2 is as follows:

the PseAAC2 feature extraction mode adopts a novel position information representation method, and the representation of enhancing the physicochemical properties of amino acid residues in the representation process can be represented as follows:

Ω(R_E)＝[H₁(R_e)²+H₂(R_e)²+Pk₁(R_e)+Pk₂(R_e)²+PI(R_e)²+M(R_e)²]formula (5)

Ω(R_E，R_j)＝Ω(R_E)*Ω(R_j) Formula (6)

Wherein R is_EDenotes the amino acid sequence of E, R_jDenotes the jth amino acid sequence, H₁(R_e)、H₂(R_e)、Pk₁(R_e）、Pk₂(R_e)、PI(R_e) And M (R)_e) Respectively represent the e-th amino acid residue R in the protein sequence_eHydrophobicity, hydrophilicity, Pk1(-COOH), Pk2(-NH3), PI and side chain molecular weight values.

2. The method of claim 1, wherein the sequence information for the subcellular location of the known protein comprises the total amino acid composition and the location information for the protein in each subcellular.

3. The method of claim 1, wherein the I-PseAAC is characterized by calculating the first amino acid residue R in the protein sequence₁And position information between other amino acid residues;

4. The method of claim 1, wherein the number of layers of the deep convolutional neural network and the number of nodes contained in each hidden layer are determined at a deep convolutional neural network classifier structure design stage.

5. The method of claim 4, in which join weights between layers of the deep convolutional neural network are determined during deep convolutional neural network classifier training.

6. The method of claim 5, in which the deep convolutional neural network classifier training process comprises a forward propagation stage and a backward propagation stage.

7. The method of claim 6, wherein the deep convolutional neural network classifier training process is instructive trained.

8. A computer-readable storage medium having stored thereon a plurality of instructions, characterized in that said instructions are adapted to be loaded by a processor of a terminal device and to perform the method according to any one of claims 1-7.

9. A terminal device comprising a processor and a computer readable storage medium, the processor being configured to implement instructions; a computer-readable storage medium for storing a plurality of instructions for performing the method of any of claims 1-7.