CN115966249B

CN115966249B - protein-ATP binding site prediction method and device based on fractional order neural network

Info

Publication number: CN115966249B
Application number: CN202310115169.0A
Authority: CN
Inventors: 王艺舒; 陈晓敏; 郭梦瑶
Original assignee: University of Science and Technology Beijing USTB
Current assignee: University of Science and Technology Beijing USTB
Priority date: 2023-02-15
Filing date: 2023-02-15
Publication date: 2023-05-26
Anticipated expiration: 2043-02-15
Also published as: CN115966249A

Abstract

The invention provides a protein-ATP binding site prediction method and device based on fractional order neural network, and relates to the technical field of protein-ligand binding site prediction. Comprising the following steps: the required characteristics of the model are extracted from the digitized information of the protein and integrated into a feature matrix as input. Then, the parameter updating process of the back propagation process of the convolutional neural network is modified into fractional gradient iteration, and test data show that the prediction effect of the convolutional neural network modified by fractional gradient is superior to that of the prior machine learning and integer-order deep learning model. The combination of the deep learning method and the fractional differential provides a protein-ATP binding site prediction method, and the accuracy is improved. The invention focuses on adding the fractional order gradient defined by Caputo to the full connection layer of the single-start predictor, and improving the performance of the predictor on the premise of ensuring convergence and chain rule.

Description

protein-ATP binding site prediction method and device based on fractional order neural network

Technical Field

The invention relates to the technical field of protein-ligand binding site prediction, in particular to a protein-ATP binding site prediction method and device based on fractional order neural network.

Background

Protein is an important substance constituting life, and its research has not been stopped. Initially, the composition of proteins was an elusive problem, today, computer technology is rapidly evolving, scientists have determined more and more primary structures of proteins using computers and have built specialized databases for querying and using, for example, PDB protein databases [ h.m. Berman, j. Westbrook, z. Feng, g. Gillliland, t.n. Bhat, h. Weissig, i.n. Shindyalov, p.e. bourne (2000) The Protein Data Bank Nucleic Acids Research, 28:235-242 ]. However, other information about proteins, such as tertiary structure, and determination of binding sites for other substances is not an easy matter.

The prediction of protein-ligand interaction sites is of great significance for the determination of drug targeting interaction sites, and the determination of protein structures and binding sites with other compounds is of great significance for the promotion of drug action, the improvement of in vivo biochemical reaction rate and efficiency, such as enzymatic reaction, ATP binding and the like. Protein-ligand interactions are critical for various biological processes such as membrane transport, cell movement, muscle contraction, signal transduction, transcription and replication of DNA [ Liu Guixia, pei Zhiyao, song Jiazhi ] protein-ATP binding site prediction based on deep learning [ J ]. University of gilin (engineering edition), 2022, 52 (01): 187-194]. In the process of drug discovery, protein-ligand interaction is an important basis for determining a drug targeting action point, and has guiding significance for developing new drugs for cancers, diabetes, alzheimer disease and other diseases. Thus, accurate recognition of protein binding sites is of great importance both for protein functional annotation and for determination of targets for pharmaceutical action.

Among these ligands, ATP is called nucleoside triphosphates, which act as a small molecule compound that can function as a coenzyme in cells and also play an important role in various metabolic processes [ Hu Jun, li Yang, zhang Yang, etc. ATPbend: accuratein-ATP binding site prediction by combining sequence-profiling and structure-based comparisons [ J ]. Journal of Chemical Information & Modeling, 2018, 58:501-510 ]. ATP binding sites are important drug targets for antibacterial and anticancer chemotherapy. However, identification of protein ligand binding sites by wet laboratory experimental techniques is generally costly and time consuming, and by the 6 th 2019, 7055 proteins in the Protein Database (PDB) are labeled as ATP binding, accounting for about 4.62% of all recorded (4 Song Jiazhi, liang Yanchun, liu Guixia, etc. A Novel Prediction Method for ATP-Binding Sites From Protein Primary Sequences Based on Fusion of Deep Convolutional Neural Network and Ensemble Learning [ J ]. IEEE Access, 2020, 8:21485-21495), and the number of known ATP binding proteins is far from adequate in the face of large-scale protein sequences in the post-genomic era. At present, algorithms such as machine learning and the like rapidly develop, a method for determining a binding site on protein by a computer continuously appears, bioinformatics also continuously develops, however, the traditional calculation method has the problems of lower accuracy and high false positive rate of a prediction result [ Hong Jiajun ], and protein function prediction and drug target spot discovery research based on deep learning [ D ]. Hangzhou: university of Zhejiang, 2020]. To reveal the intrinsic mechanism of protein-ligand interactions, a large number of wet laboratory works have been performed with thousands of protein-ligand interaction structural complexes deposited in PDBs. However, identification of protein ligand binding sites by wet laboratory experimental techniques is often costly and time consuming. Because of the importance of protein-ligand interactions and the difficulty of experimentally identifying binding sites, developing efficient, automated computational methods to rapidly predict protein-ligand binding sites has become an increasingly important issue in bioinformatics. Especially when faced with large-scale protein sequences of the post-genomic era.

AI techniques, such as well known machine learning, deep learning, etc., can be used for the determination of protein-ligand interaction sites and greatly improve the experimental rate (compared to wet laboratories) and are good methods that can be selected and continue to be explored at present. The model is trained and checked by using a proper data set, so that the number of times and the experiment cost for carrying out wet experiments are greatly saved. However, these methods have problems that the prediction accuracy is poor, the misprediction rate is high, and how to improve the prediction accuracy and further reduce the time cost is a valuable problem.

In biomedicine, understanding the interaction of proteins with ATP facilitates protein functional annotation and drug development. Accurate identification of protein-ATP binding residues is an important but challenging task to gain knowledge of protein and ATP interactions, especially if only protein sequence information is provided. With the development of deep learning algorithms, convolutional Neural Networks (CNNs) have been widely used in a variety of bioinformatic fields. However, convolutional neural networks are usually implemented only by increasing the depth of convolutional layer stacks to improve classifier performance; on the other hand, even if the gradient algorithm in the convolutional neural network is an objective function, the traditional gradient descent algorithm cannot converge explosion to a real extreme point.

Disclosure of Invention

Aiming at the problems of low convergence rate, improved prediction effect, unbalanced data distribution and the like of a convolutional neural network model applied to the existing protein-ATP prediction problem in the prior art, the invention provides a protein-ATP binding site prediction method and device based on a fractional order neural network.

In order to solve the technical problems, the invention provides the following technical scheme:

in one aspect, a method for predicting a protein-ATP binding site based on fractional order neural networks is provided, the method being applied to an electronic device, comprising the steps of:

s1: constructing an initial prediction model, acquiring a training set based on a PDB protein database, collecting characteristics of target residues and adjacent residues of the target residues in the training set through a sliding window technology, and integrating the characteristics into a characteristic matrix;

s2: using the weighted cross entropy as a loss function of the prediction model, and based on the loss function, adjusting the prediction iterative algorithm of each amino acid type by giving different weights to obtain an adjusted prediction iterative algorithm;

s3: constructing a fractional derivative based on the Caputo definition, and modifying the adjusted prediction iteration algorithm based on the fractional derivative;

S4: replacing a parameter updating process of a back propagation process of the convolutional neural network in the initial prediction model with a modified prediction iteration algorithm to construct a new prediction model; inputting the feature matrix into a new prediction model, outputting a prediction result, and finishing the prediction of the protein-ATP binding site based on the fractional order neural network.

Alternatively, the training set is the untreated original protein sequence ATP-227.

Optionally, in S1, an initial prediction model is constructed, a training set is obtained based on a PDB protein database, features of target residues and residues adjacent to the target residues in the training set are collected through a sliding window technology, and the features are integrated into a feature matrix, including:

s11: acquiring a training set based on a PDB protein database, and determining the size of a sliding window; the sliding window comprises target residues, and adjacent residues of the target residues are respectively arranged at the left side and the right side of the target residues;

s12: operating psi-blast in the annotated protein sequence Swissprot database through a search tool blast based on a local alignment algorithm, and inputting a training set to obtain a PSSM matrix of the training set;

s13: acquiring a protein secondary structure in a training set, and representing the protein secondary structure by a 3-state secondary structure representation method to obtain a protein secondary structure vector;

S14: performing One-hot coding on amino acids in a training set to obtain One-hot coding vectors of each amino acid; wherein, the coding mode is one-hot coding according to the dipole and the amino acid classification mode of the coil side chain;

s15: and extracting features of the PSSM matrix, the protein secondary structure vector and the One-hot coding vector of each amino acid through a sliding window to obtain features of target residues and target residues in a training set, and integrating the features into a feature matrix.

Optionally, in S2, using weighted cross entropy as a loss function of the prediction model, adjusting a prediction iteration algorithm of each amino acid class by giving different weights based on the loss function, to obtain an adjusted prediction iteration algorithm, including:

the cross entropy of the ith sample is defined as shown in the following equation (1):

（1）

wherein ,

， />

if the ith sample belongs to the p-th class, then

，/>

Representing the prediction probability that the ith sample belongs to the p-th class;

the weighted cross entropy is defined as shown in the following equation (2):

（2）

wherein ,

is used for the weight of various kinds of materials,

a value encoded for One-hot; n represents the number of samples, +.>

。

Optionally, in S3, the fractional derivative defined by Caputo is as follows formula (3):

（3）

Wherein f (t) is an objective function, alpha is an order, 0 < alpha < 1, m-1 < alpha < m, m represents a constant, m is a positive integer,

is gamma function, t ₀ Is of initial value, f ^(m) Representing the m-order derivative of f, τ represents the time constant.

Optionally, modifying the adjusted prediction iteration algorithm based on the fractional derivative in step S3 includes:

the fractional gradient method is shown in the following formula (4):

（4）

where μ is the iteration step or learning rate, K is the number of iterations,

represents the x < th ₀ Step iteration step length;

will be described in equation (4)

Replaced by->

Then a modified fractional gradient method of the following equation (5) is obtained:

（5）

bringing the above equation (5) into equation (3) and simplifying the modified predictive iterative algorithm resulting in equation (6) as follows:

（6）

the prediction iterative algorithm of the above formula (6) converges to a true extreme point x.

Optionally, in step S4, replacing the parameter updating process of the back propagation process of the convolutional neural network in the initial prediction model with a modified prediction iterative algorithm, and constructing a new prediction model, including:

replacing a parameter updating process of a back propagation process of the convolutional neural network in the initial prediction model with a modified prediction iterative algorithm to construct a fully connected layer of the convolutional neural network in the new prediction model, wherein a back propagation gradient of the fully connected layer adopts a mixture of fractional order and integer order; wherein the fully-connected layer comprises two types of gradient pass-through layers, the two types of gradient pass-through layers comprising: the transfer gradient connecting the nodes between the two layers, and updating the gradient.

In one aspect, there is provided a fractional neural network-based protein-ATP binding site prediction apparatus for use in an electronic device, the apparatus comprising:

the feature extraction module is used for constructing an initial prediction model, acquiring a training set based on a PDB protein database, collecting features of target residues and adjacent residues of the target residues in the training set through a sliding window technology, and integrating the features into a feature matrix;

the function modification module is used for utilizing the weighted cross entropy as a loss function of the prediction model, and based on the loss function, adjusting the prediction iteration algorithm of each amino acid type by giving different weights to obtain an adjusted prediction iteration algorithm;

the algorithm modification module is used for constructing a fractional derivative defined based on the Caputo and modifying the adjusted prediction iterative algorithm based on the fractional derivative;

the result output module is used for replacing the parameter updating process of the back propagation process of the convolutional neural network in the initial prediction model with a modified prediction iteration algorithm to construct a new prediction model; inputting the feature matrix into a new prediction model, outputting a prediction result, and finishing the prediction of the protein-ATP binding site based on the fractional order neural network.

Optionally, the feature extraction module is further configured to obtain a training set based on the PDB protein database, determine a sliding window size, and include a target residue in the sliding window, where adjacent residues of the target residue are respectively located on left and right sides of the target residue;

operating psi-blast in the annotated protein sequence Swissprot database through a search tool blast based on a local alignment algorithm, and inputting a training set to obtain a PSSM matrix of the training set;

acquiring a protein secondary structure in a training set, and representing the protein secondary structure by a 3-state secondary structure representation method to obtain a protein secondary structure vector;

performing One-hot coding on amino acids in a training set to obtain One-hot coding vectors of each amino acid; wherein, the coding mode is one-hot coding according to the dipole and the amino acid classification mode of the coil side chain;

and extracting features of the PSSM matrix, the protein secondary structure vector and the One-hot coding vector of each amino acid through a sliding window to obtain features of target residues and target residues in a training set, and integrating the features into a feature matrix.

In one aspect, an electronic device is provided that includes a processor and a memory having at least one instruction stored therein that is loaded and executed by the processor to implement a fractional mesh-based protein-ATP binding site prediction method as described above.

In one aspect, a computer-readable storage medium having stored therein at least one instruction that is loaded and executed by a processor to implement a fractional neural network-based protein-ATP binding site prediction method as described above is provided.

The technical scheme provided by the embodiment of the invention has at least the following beneficial effects:

in the scheme, a deep learning method and a fractional differential combination method are provided for predicting the protein-ATP binding site, and the accuracy is improved. The invention focuses on adding the fractional order gradient defined by Caputo to the full connection layer of the single-start predictor, and improving the performance of the predictor on the premise of ensuring convergence and chain rule.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a method for predicting protein-ATP binding sites based on fractional order neural networks according to an embodiment of the invention;

FIG. 2 is a flow chart of a method for predicting protein-ATP binding sites based on fractional order neural networks according to an embodiment of the invention;

FIG. 3 is a schematic diagram of a forward propagation algorithm of a protein-ATP binding site prediction method based on fractional order neural networks according to an embodiment of the present invention;

FIG. 4 is a graph showing an update procedure of a protein-ATP binding site prediction method based on fractional order neural networks according to an embodiment of the present invention;

FIG. 5 is a graph of predicted outcome of a fractional neural network-based protein-ATP binding site prediction method according to an embodiment of the present invention;

FIG. 6 is a block diagram of a protein-ATP binding site predicting device based on fractional order neural networks according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Description of the embodiments

In order to make the technical problems, technical solutions and advantages to be solved more apparent, the following detailed description will be given with reference to the accompanying drawings and specific embodiments.

The embodiment of the invention provides a protein-ATP binding site prediction method based on fractional order neural networks, which can be realized by electronic equipment, wherein the electronic equipment can be a terminal or a server. A flowchart of a method for combining multi-scale convolution with self-attention encoding fractional order neural network-based protein-ATP binding site prediction as shown in fig. 1, the process flow of the method may include the steps of:

S101: constructing an initial prediction model, acquiring a training set based on a PDB protein database, collecting characteristics of target residues and adjacent residues of the target residues in the training set through a sliding window technology, and integrating the characteristics into a characteristic matrix;

s102: using the weighted cross entropy as a loss function of the prediction model, and based on the loss function, adjusting the prediction iterative algorithm of each amino acid type by giving different weights to obtain an adjusted prediction iterative algorithm;

s103: constructing a fractional derivative based on the Caputo definition, and modifying the adjusted prediction iteration algorithm based on the fractional derivative;

s104: replacing a parameter updating process of a back propagation process of the convolutional neural network in the initial prediction model with a modified prediction iteration algorithm to construct a new prediction model; inputting the feature matrix into a new prediction model, outputting a prediction result, and finishing the prediction of the protein-ATP binding site based on the fractional order neural network.

Optionally, in S101, an initial prediction model is constructed, a training set is obtained based on a PDB protein database, features of target residues and residues adjacent to the target residues in the training set are collected through a sliding window technology, and the features are integrated into a feature matrix, including:

S111: acquiring a training set based on a PDB protein database, and determining the size of a sliding window; the sliding window comprises target residues, and adjacent residues of the target residues are respectively arranged at the left side and the right side of the target residues;

s112: operating psi-blast in the annotated protein sequence Swissprot database through a search tool blast based on a local alignment algorithm, and inputting a training set to obtain a PSSM matrix of the training set;

s113: acquiring a protein secondary structure in a training set, and representing the protein secondary structure by a 3-state secondary structure representation method to obtain a protein secondary structure vector;

s114: performing One-hot coding on amino acids in a training set to obtain One-hot coding vectors of each amino acid; wherein, the coding mode is one-hot coding according to the dipole and the amino acid classification mode of the coil side chain;

s115: and extracting features of the PSSM matrix, the protein secondary structure vector and the One-hot coding vector of each amino acid through a sliding window to obtain features of target residues and target residues in a training set, and integrating the features into a feature matrix.

Optionally, in S102, using weighted cross entropy as a loss function of the prediction model, based on the loss function, adjusting a prediction iteration algorithm of each amino acid class by giving different weights, to obtain an adjusted prediction iteration algorithm, including:

（1）

wherein ,

， />

if the ith sample belongs to the p-th class, then

，/>

the weighted cross entropy is defined as shown in the following equation (2):

（2）

wherein ,

is used for the weight of various kinds of materials,

a value encoded for One-hot; n represents the number of samples and,

。

optionally, in S103, the fractional derivative defined by Caputo is as follows formula (3):

（3）

Optionally, modifying the adjusted prediction iteration algorithm based on the fractional derivative in step S103 includes:

the fractional gradient method is shown in the following formula (4):

（4）

wherein mu is iteration step length or learning rate, and K is iteration times;

will be described in equation (4)

Replaced by->

（5）

（6）

Optionally, in step S104, a parameter updating process of a back propagation process of the convolutional neural network in the initial prediction model is replaced by a modified prediction iterative algorithm, so as to construct a new prediction model, which includes:

In the embodiment of the invention, a protein-ATP binding site prediction method is provided by combining a deep learning method and fractional differential, and the accuracy is improved. Firstly, data sets ATP-227 and ATP-14 are selected as training sets and test sets, and the required characteristics of the model are extracted from the digitized information of the protein and integrated into a characteristic matrix to be used as input. Then, the parameter updating process of the back propagation process of the convolutional neural network is modified into fractional gradient iteration, and test data show that the prediction effect of the convolutional neural network modified by fractional gradient is superior to that of the prior machine learning and integer-order deep learning model. The invention focuses on adding the fractional order gradient defined by Caputo to the full connection layer of the single-start predictor, and improving the performance of the predictor on the premise of ensuring convergence and chain rule.

The embodiment of the invention provides a protein-ATP binding site prediction method based on fractional order neural networks,

the method may be implemented by an electronic device, which may be a terminal or a server. A flowchart of a method for combining multi-scale convolution with self-attention encoding fractional order neural network-based protein-ATP binding site prediction as shown in fig. 2, the process flow of the method may include the steps of:

s201: acquiring a training set based on a PDB protein database, and determining the size of a sliding window; the sliding window comprises target residues, and adjacent residues of the target residues are respectively arranged at the left side and the right side of the target residues;

in one possible embodiment, the training set is the untreated original protein sequence ATP-227. The present invention utilizes two commonly used classical data sets in protein-ATP binding site prediction, in which the untreated original protein sequence is selected: ATP-227 and ATP-14. ATP-227 is 227 protein chains bound to ATP, which were published in the PDB protein database 3/10/2010. Together, these 227 chains contain 3393 ATP-binding residues, and 80409 non-ATP-binding residues. Meanwhile, 14 protein chains are selected from ATP-17 (the other three protein sequences cannot find the corresponding fasta file in the PDB database according to the protein ID), and the protein chains are named as ATP-14, and as an independent test set, the similarity of any one chain of ATP-14 and ATP-227 is ensured to be less than 41 percent. Fasta sequence files of the dataset were downloaded in bulk from the PDB protein database, ATP-227 as training set and ATP-14 as test set.

In one possible embodiment, the number of amino acids in each protein sequence is high, the ratio of unbound and bound residues is high, and studies have shown that the binding properties of the target residue are affected by its neighboring residues, and thus the sliding window technique is used to collect the characteristics of the target residue and its neighboring residues. The sliding window of size L contains the target residue and features (L-1)/2 adjacent residues on the left and right sides of the target residue, respectively. In this embodiment, l=15 is finally selected by comparing the performance of different window sizes. Namely, the value of one sliding window is as follows: 000000010000000.

s202: and running psi-blast in the annotated protein sequence Swissprot database through a search tool blast based on a local alignment algorithm, inputting a training set, and obtaining a PSSM matrix of the training set.

In a possible implementation, the PSSM matrix further includes other information, and in this embodiment, only the first 20 columns may be truncated.

S203: and obtaining a protein secondary structure in the training set, and representing the protein secondary structure by a 3-state secondary structure representation method to obtain a protein secondary structure vector.

In a possible embodiment, the present invention selects 3-state secondary structure representations, i.e., helix (C), helix (H) and strand (E), for protein secondary structure, run in blast environment using psippred 4.02. Solvent accessibility was obtained using ASAquick. The extraction of the above three features is based on fasta sequence files.

S204: performing One-hot coding on amino acids in a training set to obtain One-hot coding vectors of each amino acid; wherein the coding mode is one-hot coding according to the amino acid classification modes of dipoles and coil side chains.

In One possible embodiment, there are a number of classifications of amino acids for One-hot encoding, which are encoded herein according to dipoles and roll side chains, each amino acid being represented by a vector of 1*7, e.g., alanine (Ala) belonging to the first class, then its One-hot encoding is [0,0,0,0,0,0,1] and tyrosine (Tyr) belonging to the fourth class, and its One-hot encoding is [0,0,0,1,0,0,0].

S205: and extracting features of the PSSM matrix, the protein secondary structure vector and the One-hot coding vector of each amino acid through a sliding window to obtain a target residue in a training set and features of the target residue, and integrating the features into a feature matrix.

In a possible implementation, the feature extraction is performed through a sliding window, so that in this embodiment, a PSSM matrix of 15×20, a protein secondary structure vector of 15×3, a solvent accessibility vector of 15×1, and an One-hot encoding vector of 15×7 are obtained. In this embodiment, the data sets ATP-227 and ATP-14 are used as training and testing sets, and the required features of the model are extracted from the digitized information of the protein and integrated into a feature matrix as input of a new predictive model.

S206: using weighted cross entropy as a loss function of a prediction model, and based on the loss function, adjusting a prediction iteration algorithm of each amino acid type by giving different weights to obtain an adjusted prediction iteration algorithm;

in a possible embodiment, the invention employs modifying the loss function to solve the data imbalance problem, i.e., cross entropy. Using weighted cross entropy as a loss function, adjusting the predictions for each class by assigning different weights, including:

（1）

wherein ,

if the ith sample belongs to the p-th class, & gt>

，/>

the weighted cross entropy is defined as shown in the following equation (2):

（2）/>

wherein ,

is used for the weight of various kinds of materials,

is the One-hot encoded value.

In a possible embodiment, the invention uses weighted cross entropy as a loss function, and adjusts the prediction of each class by giving different weights, so as to solve the unbalanced learning problem. Class weights are calculated by Scikit-learn, balanced class weights are determined by the formula (number):

wherein ,

representing the number of samples in each class, +.>

Represents the number of categories, here +. >

Bincount (y) is a function of the numpy library in python, and gives the number of occurrences in y of each element in y. We choose a threshold that maximizes the MCC value.

S207: constructing fractional derivatives defined based on Caputo, and modifying the adjusted prediction iteration algorithm based on the fractional derivatives;

in a possible embodiment, the invention chooses to study the fractional gradient under this definition, since the fractional derivative of the Caputo definition has very good properties, i.e. the derivative of the constant is equal to 0.

The fractional derivative defined by Caputo is given by the following equation (3):

（3）

In a possible implementation, let f (x) be the smooth convex function, x be the unique extremum point of f (x), each iteration step of the conventional integer-order gradient method is:

where μ is the iteration step or learning rate, K is the number of iterations,

represents the x < th ₀ Step iteration step size. The fractional order gradient method can be written as:

（4）

in a possible embodiment, if fractional derivatives are directly applied, the fractional step method described above cannot converge to the true extreme point x of f (x), but only to an extreme point under the definition of the caliuto fractional derivative, which is the same as the initial value x ₀ Is related to the order, and in most cases is not equal to x.

To ensure that the algorithm converges to the true extreme point, another fractional step method is considered in the subsequent iteration process, i.e., x0 is replaced with xk-1: will be described in equation (4)

Replaced by->

（5）

wherein 0 < alpha < 1.

Bringing the above equation (5) into equation (3) yields:

when only the first term is retained and its absolute value is introduced, the fractional gradient method of 0 < α < 2 is reduced to: resulting in a modified iterative algorithm of the following equation (6):

（6）

the iterative algorithm of the above formula (6) converges to a true extreme point x.

S208: replacing a parameter updating process of a back propagation process of the convolutional neural network in the initial prediction model with a modified prediction iteration algorithm to construct a new prediction model; inputting the feature matrix into a new prediction model, outputting a prediction result, and finishing the prediction of the protein-ATP binding site based on the fractional order neural network.

In one possible embodiment, a fully connected layer of the convolutional neural network is constructed, wherein the back propagation gradient of the fully connected layer employs a mixture of fractional and integer orders to ensure that the chain law is established. Two types of gradient pass-through layers are provided, one is a transfer gradient connecting nodes between the two layers, and the other is an update gradient for in-layer parameters.

In one possible implementation, a schematic diagram of the forward propagation algorithm is shown in FIG. 3

Indicate->

The individual node is at the first

Layer output:

here, the

Representation->

Layer weight(s)>

Indicating the degree of deviation->

Representing the output of the previous layer, and the function +.>

I.e. as an activation function.

To ensure that the chain law holds true, the propagation gradient is still an integer gradient:

but in updating the gradient we use fractional step updates:

the update process is shown in fig. 4.

In a possible implementation manner, ATP-17 is used as a test set to test the model, the model is output as a one-dimensional prediction probability matrix of each site of a protein sequence, and according to the standard of the maximum MCC, a threshold value is set to be 0.80, namely when the prediction probability of a site is greater than 0.8, the site is judged to be a binding site, and the binding site is expressed by 1, and otherwise, the binding site is expressed by 0. We performed 15 repeated experiments on the test set, and selected the correlation coefficients (MCC) of accuracy (Acc), sensitivity (Sen), specificity (Spe) and Ma Xiusi as evaluation indexes, and compared the traditional convolutional neural network, and the average values of the multiple experiments were taken as the following table:

table 1 evaluation index table

Then, in comparison with the predictors of several protein-ATP binding sites that perform well in the prior art, nsitePred, targetATPsite, targetS and ATPseq, respectively, the predictions on ATP-17 are shown in the following Table:

Table 2 comparison of the results of the existing predictor with the predictor of the present invention

The predicted result of the protein 2YAA sequence is shown in FIG. 5. The invention can accurately predict the binding site.

FIG. 6 is a block diagram illustrating a fractional neural network based protein-ATP binding site predicting device according to an example embodiment. Referring to fig. 6, the apparatus 300 includes:

The feature extraction module 310 is configured to construct an initial prediction model, acquire a training set based on a PDB protein database, collect features of target residues and adjacent residues of the target residues in the training set through a sliding window technology, and integrate the features into a feature matrix;

the function modification module 320 is configured to use the weighted cross entropy as a loss function of the prediction model, and adjust a prediction iteration algorithm of each amino acid type by giving different weights based on the loss function, so as to obtain an adjusted prediction iteration algorithm;

the algorithm modification module 330 is configured to construct a fractional derivative defined based on the Caputo, and modify the adjusted prediction iterative algorithm based on the fractional derivative;

the result output module 340 is configured to replace a parameter update process of a back propagation process of the convolutional neural network in the initial prediction model with a modified prediction iterative algorithm to construct a new prediction model; inputting the feature matrix into a new prediction model, outputting a prediction result, and finishing the prediction of the protein-ATP binding site based on the fractional order neural network.

Optionally, the feature extraction module 310 is further configured to obtain a training set based on the PDB protein database, and determine a sliding window size; the sliding window comprises target residues, and adjacent residues of the target residues are respectively arranged at the left side and the right side of the target residues;

Optionally, the function modifying module 320 is configured to define the cross entropy of the ith sample as shown in the following formula (1):

（1）

wherein ,

if the ith sampleThe group p is->

，/>

the weighted cross entropy is defined as shown in the following equation (2):

（2）

wherein ,

is used for the weight of various kinds of materials,

Is the One-hot encoded value.

Optionally, the algorithm modification module 330 is configured to apply the fractional derivative defined by Caputo as shown in the following formula (3):

（3）

wherein f (t) is an objective function, alpha is an order, m-1 < alpha < m, m is a positive integer,

is gamma function, t ₀ Is the initial value.

Optionally, the algorithm modification module 330 is configured to modify the iterative algorithm to converge the iterative algorithm to a true extremum point, and includes:

the fractional gradient method is shown in the following formula (4):

（4）

wherein mu is iteration step length or learning rate, K is iteration timeThe number of the product is the number,

represents the x < th ₀ Step iteration step length;

will be described in equation (4)

Replaced by->

（5）

the modified iterative algorithm of the following formula (6) is simplified by taking the above formula (5) into formula (3).

（6）

Optionally, the result output module 340 is configured to replace a parameter update process of a back propagation process of the convolutional neural network in the initial prediction model with a modified prediction iterative algorithm, and construct a fully connected layer of the convolutional neural network in the new prediction model, where a back propagation gradient of the fully connected layer adopts a mixture of fractional order and integer order; wherein the fully-connected layer comprises two types of gradient pass-through layers, the two types of gradient pass-through layers comprising: the transfer gradient connecting the nodes between the two layers, and updating the gradient.

Fig. 7 is a schematic structural diagram of an electronic device 400 according to an embodiment of the present invention, where the electronic device 400 may have a relatively large difference due to different configurations or performances, and may include one or more processors (central processing units, CPU) 401 and one or more memories 402, where at least one instruction is stored in the memories 402, and the at least one instruction is loaded and executed by the processors 401 to implement the following steps of a fractional-order neural network-based protein-ATP binding site prediction method:

In an exemplary embodiment, a computer readable storage medium, such as a memory comprising instructions executable by a processor in a terminal to perform the above-described fractional-order-neural-net-based protein-ATP binding site prediction method, is also provided. For example, the computer readable storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the invention are intended to be included within the scope of the invention.

Claims

1. A fractional neural network-based protein-ATP binding site prediction method, the method steps comprising:

s2: using weighted cross entropy as a loss function of the initial prediction model, and based on the loss function, adjusting a prediction iteration algorithm of each amino acid type by giving different weights to obtain an adjusted prediction iteration algorithm;

S3: constructing fractional derivatives defined based on Caputo, and modifying the adjusted prediction iteration algorithm based on the fractional derivatives;

s4: replacing a parameter updating process of a backward propagation process of the convolutional neural network in the initial prediction model with a modified prediction iterative algorithm to construct a new prediction model; inputting the feature matrix into the new prediction model, outputting a prediction result, and finishing protein-ATP binding site prediction based on a fractional order neural network;

in the step S1, the training set is an unprocessed original protein sequence ATP-227;

in the step S1, an initial prediction model is constructed, a training set is obtained based on a PDB protein database, features of target residues and adjacent residues of the target residues in the training set are collected through a sliding window technology, and the features are integrated into a feature matrix, wherein the method comprises the following steps:

s14: carrying out one-hot coding on amino acids in a training set to obtain one-hot coding vectors of each amino acid; wherein, the coding mode is one-hot coding according to the dipole and the amino acid classification mode of the coil side chain;

s15: performing feature extraction on a PSSM matrix, a protein secondary structure vector and a one-hot coding vector of each amino acid through a sliding window to obtain features of target residues in a training set and adjacent residues of the target residues, and integrating the features into a feature matrix;

in the step S2, a weighted cross entropy is used as a loss function of the initial prediction model, and based on the loss function, a prediction iteration algorithm of each amino acid type is adjusted by giving different weights, so as to obtain an adjusted prediction iteration algorithm, which includes:

（1）

wherein ,

represents cross entropy->

，j=1, 2, …, if the ith sample belongs to the p-th class, then

，j≠p，/>

The weighted cross entropy is defined as shown in the following equation (2):

（2）/>

wherein ,

weights of various kinds->

，j=1, 2, … 7, one-hot encoded value; n represents the number of samples, +.>

Representing the predicted probability of the ith sample;

in S3, the fractional derivative defined by Caputo is represented by the following formula (3):

（3）

wherein ,f(t)as an objective function, alpha is an order, 0<α<1，m-1<α<m is a positive integer,

is gamma function, t ₀ Is of initial value, is->

Representation pairfSolving m-order derivative, wherein τ represents a time constant;

in step S3, modifying the adjusted prediction iteration algorithm based on the fractional derivative includes:

the fractional gradient method is shown in the following formula (4):

（4）

wherein mu is the iteration step length, K is the iteration number,

represent the firstx ₀ Step iteration step length;

will be described in equation (4)

Replaced by->

（5）

（6）

the prediction iterative algorithm of the formula (6) converges to a true extreme point x;

in the step S4, a parameter updating process of the back propagation process of the convolutional neural network in the initial prediction model is replaced by a modified prediction iterative algorithm, and a new prediction model is constructed, including:

2. A fractional neural network-based protein-ATP binding site prediction device, wherein the device is applied to the method of claim 1, the device comprising:

the function modification module is used for utilizing the weighted cross entropy as a loss function of the initial prediction model, and based on the loss function, the prediction iteration algorithm of each amino acid type is adjusted by giving different weights, so that an adjusted prediction iteration algorithm is obtained;

The algorithm modification module is used for constructing fractional derivatives defined based on Caputo and modifying the adjusted prediction iterative algorithm based on the fractional derivatives;

the result output module is used for replacing the parameter updating process of the back propagation process of the convolutional neural network in the initial prediction model with a modified prediction iterative algorithm to construct a new prediction model; inputting the feature matrix into the new prediction model, outputting a prediction result, and finishing protein-ATP binding site prediction based on a fractional order neural network;

the training set is unprocessed original protein sequence ATP-227;

the feature extraction module is further used for acquiring a training set based on a PDB protein database, determining the size of a sliding window, and if the sliding window contains target residues, adjacent residues of the target residues are respectively arranged on the left side and the right side of the target residues;

Carrying out one-hot coding on amino acids in a training set to obtain one-hot coding vectors of each amino acid; wherein, the coding mode is one-hot coding according to the dipole and the amino acid classification mode of the coil side chain;

performing feature extraction on a PSSM matrix, a protein secondary structure vector and a one-hot coding vector of each amino acid through a sliding window to obtain a target residue in a training set and features of the target residue, and integrating the features into a feature matrix;

the function modifying module is further configured to define the cross entropy of the ith sample as shown in the following formula (1):

（1）

wherein ,

represents cross entropy->

，j=1, 2, …, if the ith sample belongs to the p-th class, then

，/>

the weighted cross entropy is defined as shown in the following equation (2):

（2）

wherein ,

weights of various kinds->

Representing the predicted probability of the ith sample;

the algorithm modification module is also used for the fractional derivative defined by Caputo as shown in the following formula (3):

（3）

is gamma function, t ₀ Is of initial value, is->

Representation pair fSolving m-order derivative, wherein τ represents a time constant;

the algorithm modification module is further configured to modify an iterative algorithm to converge the iterative algorithm to a true extremum point, and includes:

the fractional gradient method is shown in the following formula (4):

（4）

wherein mu is the iteration step length, K is the iteration number,

represent the firstx ₀ Step iteration step length;

will be described in equation (4)

Replaced by->

（5）

（6）/>

The iterative algorithm of the formula (6) converges to a true extreme point x;

the result output module is also used for replacing the parameter updating process of the back propagation process of the convolutional neural network in the initial prediction model with a modified prediction iterative algorithm to construct a full-connection layer of the convolutional neural network in the new prediction model, wherein the back propagation gradient of the full-connection layer adopts the mixture of fractional order and integer order; wherein the fully-connected layer comprises two types of gradient pass-through layers, the two types of gradient pass-through layers comprising: the transfer gradient connecting the nodes between the two layers, and updating the gradient.