CN115966249A

CN115966249A - Fractional order neural network-based protein-ATP binding site prediction method and device

Info

Publication number: CN115966249A
Application number: CN202310115169.0A
Authority: CN
Inventors: 王艺舒; 陈晓敏; 郭梦瑶
Original assignee: University of Science and Technology Beijing USTB
Current assignee: University of Science and Technology Beijing USTB
Priority date: 2023-02-15
Filing date: 2023-02-15
Publication date: 2023-04-14
Anticipated expiration: 2043-02-15
Also published as: CN115966249B

Abstract

The invention provides a fractional order neural network-based protein-ATP binding site prediction method and device, and relates to the technical field of protein-ligand binding site prediction. The method comprises the following steps: the features required by the model are extracted from the digitized information of the protein and integrated into a feature matrix as input. And then, a parameter updating process of a back propagation process of the convolutional neural network is modified into fractional order gradient iteration by selecting the convolutional neural network, and test data shows that the prediction effect of the convolutional neural network modified by the fractional order is superior to that of the existing machine learning and integer order deep learning models. A protein-ATP binding site prediction method is provided by combining a deep learning method and fractional differentiation, and the accuracy is improved. The invention is characterized in that the fractional order gradient defined by Caputo is added to the full-link layer of the single-start predictor, and the performance of the predictor is improved on the premise of ensuring convergence and chain rule.

Description

Fractional order neural network-based protein-ATP binding site prediction method and device

Technical Field

The invention relates to the technical field of prediction of protein-ligand binding sites, in particular to a fractional order neural network-based protein-ATP binding site prediction method and device.

Background

Protein has not been studied as an important substance constituting life without stopping. Initially, protein composition was a elusive problem, and today, with The rapid development of computer technology, scientists used computers to determine The primary structure of more and more proteins and to build specialized databases for querying and using, for example, PDB Protein databases [ h.m. Berman, j. Westbrook, z. Feng, g. Gillliland, t.n. Bhat, h. Weissig, i.n. Shindyalov, p.e. bourne, (2000) The Protein Data Bank Nucleic Acids Research, 28: 235-242 ]. However, the determination of other information about the protein, such as tertiary structure, and binding sites for other substances is not easy.

The prediction of protein-ligand interaction sites has important significance for determining drug target action sites, and the determination of protein structures and binding sites with other compounds has promotion significance for exerting drug effects and improving the rate and efficiency of in vivo biochemical reactions, such as enzymatic reactions, ATP binding and the like. Protein-ligand interactions are critical for various biological processes, such as membrane trafficking, cell motility, muscle contraction, signal transduction, transcription and replication of DNA [ Liugui nephelia, bright jelly, songzhi. 187-194]. In the process of drug discovery, the protein-ligand interaction is an important basis for determining the target action point of the drug, and has guiding significance for the research and development of new drugs for diseases such as cancer, diabetes, alzheimer disease and the like. Therefore, accurate identification of protein binding sites is of great importance for functional annotation of proteins and for the determination of targets for drug action.

Among these ligands, ATP is called nucleoside triphosphate, and is a small molecule compound that can function as a coenzyme in cells and also plays an important role in various metabolic processes [ Hu Jun, li Yang, zhang Yang, etc. [ Hu protein-ATP binding site prediction by combination sequence-profiling and structure-based compositions [ J ]. Journal of Chemical Information & Modeling, 2018, 58: 501-510 ]. ATP binding sites are important drug targets for antibacterial and anticancer chemotherapy. However, identification of Protein ligand Binding Sites by wet laboratory experimental techniques is often costly and time consuming, as of 6 months 2019, 7055 proteins in the Protein Database (PDB) were labeled ATP Binding, accounting for approximately 4.62% [ [4] ATP Binding, liang Yanchun, liu Guixia, etc. a Novel Prediction Method for ATP-Binding Sites From Protein Sequences base on Fusion of Deep genomic era, 2020 IEEE Access, 8: 21485-21495] all records, and the number of known ATP Binding proteins is far From sufficient in the face of large-scale Protein Sequences in the late genomic era. Today, algorithms such as machine learning are rapidly developed, methods for determining binding sites on proteins through computers are continuously developed, bioinformatics is continuously developed, however, the conventional calculation methods have the problems of low accuracy and high false positive rate of prediction results [ honjiajun. Zhejiang university, 2020]. To reveal the intrinsic mechanism of protein-ligand interaction, a great deal of wet laboratory work was undertaken, with thousands of protein-ligand interaction structure complexes deposited in the PDB. However, identification of protein ligand binding sites by wet laboratory techniques is often costly and time consuming. Because of the importance of protein-ligand interactions and the difficulty of identifying binding sites experimentally, the development of efficient, automated computational methods to rapidly predict protein-ligand binding sites has become an increasingly important issue in bioinformatics. Particularly when faced with the large-scale protein sequences of the latter genome era.

AI techniques such as machine learning, deep learning, etc., which are well known, can be used for the determination of protein-ligand interaction sites, and greatly improve the experimental rate (compared to wet laboratories) with great efficiency, are good methods that can be selected and continued to be explored at present. The model is trained and checked by using a proper data set, so that the times of performing wet experiments and the experiment cost are greatly saved. However, these methods have some problems, the prediction accuracy is not good enough, the error prediction rate is high, and it is a very valuable research problem how to improve the prediction accuracy and further reduce the time cost.

In biomedicine, understanding the interaction of proteins with ATP is helpful for protein functional annotation and drug development. Accurate identification of protein-ATP binding residues is an important but challenging task to gain knowledge of protein-ATP interactions, especially where only protein sequence information is provided. With the development of deep learning algorithms, convolutional Neural Networks (CNNs) have been widely used in various fields of biological information. However, in order to improve the performance of the classifier, the convolutional neural network can be realized only by superposing convolutional layers deeper and deeper; on the other hand, the gradient algorithm in the convolutional neural network is not capable of exploding and converging to a real extreme point even if the gradient algorithm is an objective function.

Disclosure of Invention

The invention provides a method and a device for predicting a protein-ATP binding site based on a fractional order neural network, aiming at the problems that in the prior art, a convolutional neural network model applied to the current protein-ATP prediction problem is low in convergence rate, the prediction effect needs to be improved, data distribution is unbalanced and the like.

In order to solve the technical problems, the invention provides the following technical scheme:

in one aspect, a fractional order neural network-based protein-ATP binding site prediction method is provided, and applied to an electronic device, and includes the following steps:

s1: constructing an initial prediction model, acquiring a training set based on a PDB protein database, collecting the characteristics of target residues and adjacent residues of the target residues in the training set by a sliding window technology, and integrating the characteristics into a characteristic matrix;

s2: using the weighted cross entropy as a loss function of the prediction model, and adjusting the prediction iterative algorithm of each amino acid type by giving different weights on the basis of the loss function to obtain an adjusted prediction iterative algorithm;

s3: constructing a fractional order derivative defined based on Caputo, and modifying the adjusted prediction iterative algorithm based on the fractional order derivative;

s4: replacing the parameter updating process of the back propagation process of the convolutional neural network in the initial prediction model with a modified prediction iterative algorithm to construct a new prediction model; and inputting the characteristic matrix into a new prediction model, outputting a prediction result, and completing the protein-ATP binding site prediction based on the fractional order neural network.

Alternatively, the training set is the raw protein sequence ATP-227 without treatment.

Optionally, in S1, constructing an initial prediction model, obtaining a training set based on the PDB protein database, collecting features of target residues and adjacent residues of the target residues in the training set by a sliding window technique, and integrating the features into a feature matrix, including:

s11: acquiring a training set based on a PDB protein database, and determining the size of a sliding window; the sliding window comprises a target residue, and adjacent residues of the target residue are respectively arranged at the left side and the right side of the target residue;

s12: running psi-blast in an annotated protein sequence Swissprot database through a search tool blast based on a local comparison algorithm, inputting a training set, and obtaining a PSSM matrix of the training set;

s13: acquiring a protein secondary structure in a training set, and expressing the protein secondary structure by a 3-state secondary structure expression method to obtain a protein secondary structure vector;

s14: carrying out One-hot coding on the amino acids in the training set to obtain One-hot coding vectors of each amino acid; wherein the coding mode is one-hot coding according to the amino acid classification modes of a dipole and a scroll side chain;

s15: and (3) performing feature extraction on the PSSM matrix, the protein secondary structure vector and the One-hot coding vector of each amino acid through a sliding window to obtain features of target residues and target residues in the set training set, and integrating the features into a feature matrix.

Optionally, in S2, the obtaining of the adjusted prediction iterative algorithm by using the weighted cross entropy as a loss function of the prediction model and adjusting the prediction iterative algorithm of each amino acid type by giving different weights based on the loss function includes:

defining the cross entropy of the ith sample as shown in the following formula (1):

（1）

wherein ,

， />

if the ith sample belongs to the p-th class, then

，/>

Representing the prediction probability that the ith sample belongs to the p-th class;

the weighted cross entropy is defined as shown in the following equation (2):

（2）

wherein ,

is the weight of each type, is>

Is the value after One-hot coding; n represents the number of samples and/or the number of samples>

。

Alternatively, in S3, the fractional derivative defined by Caputo is as the following equation (3):

（3）

wherein f (t) is a target function, alpha is an order, 0 < alpha < 1, m-1 < alpha < m, m represents a constant, m is a positive integer,

is a gamma function, t ₀ Is an initial value, f ^(m) Denotes m-order derivation for f, and τ denotes a time constant.

Optionally, modifying the adjusted prediction iteration algorithm based on the fractional derivative in step S3 includes:

the fractional order gradient method is shown in the following formula (4):

（4）

where μ is the iteration step or learning rate, K is the number of iterations,

denotes the x (th) order ₀ Step iteration step length;

will be given in formula (4)

Is replaced by>

A modified fractional order gradient method is then obtained as shown in equation (5) below:

（5）

substituting the above equation (5) into equation (3) and simplifying to obtain a modified predictive iterative algorithm as shown in equation (6) below:

（6）

the prediction iterative algorithm of the above formula (6) converges, and the point of convergence to the true extreme is x.

Optionally, in step S4, replacing a parameter updating process of a back propagation process of the convolutional neural network in the initial prediction model with a modified prediction iterative algorithm to construct a new prediction model, including:

replacing a parameter updating process of a back propagation process of the convolutional neural network in the initial prediction model with a modified prediction iterative algorithm, and constructing a full connection layer of the convolutional neural network in the new prediction model, wherein the back propagation gradient of the full connection layer adopts a mixture of fractional order and integer order; wherein the fully-connected layer comprises two types of gradient-passing layers, the two types of gradient-passing layers comprising: the transfer gradient of the junction between the two layers is connected, and the gradient is updated.

In one aspect, there is provided a fractional order neural network-based protein-ATP binding site prediction apparatus, which is applied to an electronic device, the apparatus including:

the characteristic extraction module is used for constructing an initial prediction model, acquiring a training set based on a PDB protein database, collecting the characteristics of target residues and adjacent residues of the target residues in the training set through a sliding window technology, and integrating the characteristics into a characteristic matrix;

the function modification module is used for utilizing the weighted cross entropy as a loss function of the prediction model, and adjusting the prediction iterative algorithm of each amino acid type by giving different weights on the basis of the loss function to obtain an adjusted prediction iterative algorithm;

the algorithm modification module is used for constructing a fractional derivative defined based on Caputo and modifying the adjusted prediction iteration algorithm based on the fractional derivative;

the result output module is used for replacing the parameter updating process of the back propagation process of the convolutional neural network in the initial prediction model with the modified prediction iterative algorithm to construct a new prediction model; and inputting the characteristic matrix into a new prediction model, outputting a prediction result, and completing the protein-ATP binding site prediction based on the fractional order neural network.

Optionally, the feature extraction module is further configured to obtain a training set based on the PDB protein database, and determine the size of the sliding window, where the sliding window includes a target residue, and adjacent residues of the target residue are respectively located on the left and right sides of the target residue;

running psi-blast in an annotated protein sequence Swissprot database through a search tool blast based on a local comparison algorithm, inputting a training set, and obtaining a PSSM matrix of the training set;

acquiring a protein secondary structure in a training set, and expressing the protein secondary structure by a 3-state secondary structure expression method to obtain a protein secondary structure vector;

carrying out One-hot coding on the amino acids in the training set to obtain One-hot coding vectors of each amino acid; wherein the coding mode is one-hot coding according to the amino acid classification modes of a dipole and a scroll side chain;

and (3) performing feature extraction on the PSSM matrix, the protein secondary structure vector and the One-hot coding vector of each amino acid through a sliding window to obtain features of target residues and target residues in the set training set, and integrating the features into a feature matrix.

In one aspect, an electronic device is provided, which includes a processor and a memory, where at least one instruction is stored in the memory, and the at least one instruction is loaded and executed by the processor to implement the above-mentioned fractional order neural network-based protein-ATP binding site prediction method.

In one aspect, a computer-readable storage medium having stored therein at least one instruction which is loaded and executed by a processor to implement a fractional order neural network-based protein-ATP binding site prediction method as described above is provided.

The technical scheme of the embodiment of the invention at least has the following beneficial effects:

in the scheme, a method for predicting the protein-ATP binding site is provided by combining a deep learning method and fractional differentiation, and the accuracy is improved. The invention is characterized in that the fractional order gradient defined by Caputo is added to the full-link layer of the single-start predictor, and the performance of the predictor is improved on the premise of ensuring convergence and chain rule.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the description below are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flowchart of a fractional order neural network-based protein-ATP binding site prediction method provided by an embodiment of the present invention;

FIG. 2 is a flow chart of a fractional order neural network-based protein-ATP binding site prediction method provided by an embodiment of the present invention;

FIG. 3 is a schematic diagram of a forward propagation algorithm of a fractional order neural network-based protein-ATP binding site prediction method according to an embodiment of the present invention;

FIG. 4 is a diagram illustrating an updated process of a fractional order neural network-based protein-ATP binding site prediction method according to an embodiment of the present invention;

FIG. 5 is a diagram of the result of a fractional order neural network-based protein-ATP binding site prediction method according to an embodiment of the present invention;

FIG. 6 is a block diagram of a fractional order neural network-based protein-ATP binding site prediction apparatus according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed description of the preferred embodiments

To make the technical problems, technical solutions and advantages of the present invention more apparent, the following detailed description is given with reference to the accompanying drawings and specific embodiments.

The embodiment of the invention provides a protein-ATP binding site prediction method based on a fractional order neural network, which can be realized by an electronic device, wherein the electronic device can be a terminal or a server. As shown in fig. 1, the flowchart of the fractional order neural network-based protein-ATP binding site prediction method combining multi-scale convolution and self-attention coding may include the following steps:

s101: constructing an initial prediction model, acquiring a training set based on a PDB protein database, collecting the characteristics of target residues and adjacent residues of the target residues in the training set by a sliding window technology, and integrating the characteristics into a characteristic matrix;

s102: using the weighted cross entropy as a loss function of a prediction model, and adjusting a prediction iterative algorithm of each amino acid type by giving different weights on the basis of the loss function to obtain an adjusted prediction iterative algorithm;

s103: constructing a fractional order derivative defined based on Caputo, and modifying the adjusted prediction iterative algorithm based on the fractional order derivative;

s104: replacing the parameter updating process of the back propagation process of the convolutional neural network in the initial prediction model with a modified prediction iterative algorithm to construct a new prediction model; inputting the characteristic matrix into a new prediction model, outputting a prediction result, and completing the protein-ATP binding site prediction based on the fractional order neural network.

Optionally, in S101, constructing an initial prediction model, obtaining a training set based on the PDB protein database, collecting features of a target residue and adjacent residues of the target residue in the training set by a sliding window technique, and integrating the features into a feature matrix, including:

s111: acquiring a training set based on a PDB protein database, and determining the size of a sliding window; the sliding window comprises a target residue, and adjacent residues of the target residue are respectively arranged at the left side and the right side of the target residue;

s112: running a psi-blast in an annotated protein sequence Swissprot database through a search tool blast based on a local alignment algorithm, inputting a training set, and obtaining a PSSM matrix of the training set;

s113: acquiring a protein secondary structure in a training set, and expressing the protein secondary structure by a 3-state secondary structure expression method to obtain a protein secondary structure vector;

s114: carrying out One-hot coding on the amino acids in the training set to obtain One-hot coding vectors of each amino acid; wherein the coding mode is one-hot coding according to the amino acid classification modes of a dipole and a scroll side chain;

s115: and (3) performing feature extraction on the PSSM matrix, the protein secondary structure vector and the One-hot coding vector of each amino acid through a sliding window to obtain features of target residues and target residues in the set training set, and integrating the features into a feature matrix.

Optionally, in S102, the obtaining an adjusted prediction iterative algorithm by adjusting the prediction iterative algorithm for each amino acid type by giving different weights based on the loss function using the weighted cross entropy as the loss function of the prediction model includes:

（1）

wherein ,

， />

if the ith sample belongs to the p-th class, then

，/>

Representing the prediction probability of the ith sample belonging to the p-th class;

the weighted cross entropy is defined as shown in the following equation (2):

（2）

wherein ,

based on the weight of each class>

。

Alternatively, in S103, the fractional derivative defined by Caputo is as the following formula (3):

（3）

Optionally, modifying the adjusted prediction iteration algorithm based on the fractional derivative in step S103 includes:

the fractional order gradient method is shown in the following formula (4):

（4）

wherein mu is an iteration step length or a learning rate, and K is an iteration frequency;

will be given in formula (4)

Is replaced by>

A modified fractional gradient method is then obtained as shown in equation (5) below: />

（5）

（6）

the prediction iteration algorithm of the above formula (6) converges, and the point of convergence to the true extreme is x.

Optionally, in step S104, replacing a parameter updating process of a back propagation process of the convolutional neural network in the initial prediction model with a modified prediction iterative algorithm to construct a new prediction model, including:

replacing a parameter updating process of a back propagation process of the convolutional neural network in the initial prediction model with a modified prediction iterative algorithm, and constructing a full connection layer of the convolutional neural network in the new prediction model, wherein the back propagation gradient of the full connection layer adopts a mixture of fractional order and integer order; wherein the fully-connected layer comprises two types of gradient pass-through layers, the two types of gradient pass-through layers comprising: the transfer gradient of the junction between the two layers is connected, and the gradient is updated.

In the embodiment of the invention, a protein-ATP binding site prediction method is provided by combining a deep learning method and fractional differentiation, and the accuracy is improved. Firstly, data sets ATP-227 and ATP-14 are selected as a training set and a testing set, characteristics required by a model are extracted from the digital information of the protein, and the characteristics are integrated into a characteristic matrix to be used as input. And then, a parameter updating process of a back propagation process of the convolutional neural network is modified into fractional order gradient iteration by selecting the convolutional neural network, and test data shows that the prediction effect of the convolutional neural network modified by the fractional order is superior to that of the existing machine learning and integer order deep learning models. The invention is characterized in that the fractional order gradient defined by Caputo is added to the full-connection layer of the single-start predictor, and the performance of the predictor is improved on the premise of ensuring convergence and a chain rule.

The embodiment of the invention provides a protein-ATP binding site prediction method based on a fractional order neural network,

the method may be implemented by an electronic device, which may be a terminal or a server. As shown in fig. 2, the flowchart of the method for predicting the protein-ATP binding site based on the fractional order neural network by combining the multi-scale convolution and the self-attention coding method, the processing flow of the method may include the following steps:

s201: acquiring a training set based on a PDB protein database, and determining the size of a sliding window; the sliding window comprises a target residue, and adjacent residues of the target residue are respectively arranged at the left side and the right side of the target residue;

in one possible embodiment, the training set is the raw protein sequence ATP-227 without treatment. The invention utilizes two common classical data sets in protein-ATP binding site prediction, and selects an unprocessed original protein sequence: ATP-227 and ATP-14. ATP-227 is 227 protein chains bound to ATP published in the PDB protein database 3 months and 10 days before 2010. The 227 chain contains a total of 3393 ATP-binding residues, and 80409 non-ATP-binding residues. Meanwhile, 14 protein chains are selected from ATP-17 (the corresponding fasta file cannot be found in the PDB database according to protein ID) and named ATP-14 as an independent test set, and the similarity between any one chain in ATP-14 and ATP-227 can be ensured to be less than 41%. And downloading fasta sequence files of the data set from the PDB protein database in batches, wherein ATP-227 is a training set, and ATP-14 is a testing set.

In one possible embodiment, the characteristics of the target residue and its adjacent residues are collected using a sliding window technique, due to the large number of amino acids in each protein sequence, the high proportion of non-binding and binding residues, and studies showing that the binding properties of the target residue are affected by the adjacent residues. A sliding window of size L comprises the target residue and features (L-1)/2 adjacent residues on the left and right sides of the target residue, respectively. L =15 is finally selected in this embodiment by performance comparison of different window sizes. That is, the value of one sliding window is: 000000010000000.

s202: and operating psi-blast in the annotated protein sequence Swissprot database through a search tool blast based on a local alignment algorithm, inputting a training set, and obtaining a PSSM matrix of the training set.

In a possible implementation, the PSSM matrix further includes other information, and in this embodiment, only the first 20 columns are intercepted.

S203: and acquiring a protein secondary structure in the training set, and representing the protein secondary structure by a 3-state secondary structure representation method to obtain a protein secondary structure vector.

In one possible embodiment, for protein secondary structure, the present invention selects 3-state secondary structure representations, helix (C), helix (H) and strand (E), operating in the blast environment using psicred 4.02. Solvent accessibility was obtained using ASAquick. The extraction of the above three features is based on the fasta sequence file.

S204: carrying out One-hot coding on the amino acids in the training set to obtain One-hot coding vectors of each amino acid; wherein the coding mode is one-hot coding according to the amino acid classification mode of dipole and scroll side chains.

S205: and (3) performing feature extraction on the PSSM matrix, the protein secondary structure vector and the One-hot coding vector of each amino acid through a sliding window to obtain target residues in the set training set and features of the target residues, and integrating the features into a feature matrix.

In One possible embodiment, the extraction of features is performed by sliding windows, which in this example would result in a 15 x 20 PSSM matrix, a 15 x 3 protein secondary structure vector, a 15 x 1 solvent accessibility vector and a 15 x 7 One-hot encoding vector. In this embodiment, the data sets ATP-227 and ATP-14 are used as a training set and a test set, and the features required by the model are extracted from the digitized information of the protein and integrated into a feature matrix as the input of a new prediction model.

S206: using the weighted cross entropy as a loss function of a prediction model, and adjusting a prediction iterative algorithm of each amino acid type by giving different weights based on the loss function to obtain an adjusted prediction iterative algorithm;

in one possible embodiment, the present invention employs a modification loss function to solve the data imbalance problem, i.e., cross entropy. Using the weighted cross entropy as a loss function to adjust the prediction for each class by assigning different weights, including:

（1）

wherein ,

if the ith sample belongs to the pth class, then +>

，/>

the weighted cross entropy is defined as shown in the following equation (2):

（2）

wherein ,

based on the weight of each class>

Is the One-hot encoded value.

In one possible implementation, the present invention can solve the unbalanced learning problem by using weighted cross entropy as a loss function and adjusting the prediction for each class by giving different weights. Class weights are calculated by Scikit-learn, balanced class weights are determined by the formula (numbering):

wherein ,

indicates the number of samples in each class, and>

indicates the number of classes, in this document

Bincount (y) is a function of the numpy library in python, and gives the number of occurrences of each element in y. We choose a threshold that maximizes the MCC value.

S207: constructing a fractional order derivative defined based on Caputo, and modifying the adjusted prediction iterative algorithm based on the fractional order derivative;

in one possible embodiment, the invention chooses to study the fractional gradient under the definition of Caputo, since the fractional derivative has very good properties, i.e. the derivative of the constant is equal to 0.

The fractional derivative defined by Caputo is as follows in equation (3):

（3）

In a possible embodiment, let f (x) be a smooth convex function, and x be a unique extreme point of f (x), each iteration step of the conventional integer order gradient method is:

where μ is the iteration step or learning rate, K is the number of iterations,

denotes the x (th) order ₀ Step iteration step size. The fractional order gradient method can be written as:

（4）

in a possible embodiment, if the fractional derivative is applied directly, the fractional step method cannot converge to the true extreme point x of f (x) but only to an extreme point defined by the fractional derivative of Caputo, the extreme point and the initial value x ₀ And order, most often not equal to x.

To ensure that the algorithm converges to a true extreme point, another fractional step method is considered in the subsequent iteration process, i.e., x0 is replaced by xk-1: will be given in formula (4)

Is replaced by>

A modified fractional gradient method is then obtained as shown in equation (5) below:

（5）

wherein 0 < alpha < 1.

Substituting the above equation (5) into equation (3) yields:

when only the first term is retained and its absolute value is introduced, the fractional order gradient method of 0 < α < 2 is simplified to: a modified iterative algorithm is obtained as in equation (6) below:

（6）/>

the iterative algorithm of the above formula (6) converges, and the point of convergence to the true extreme is x.

S208: replacing the parameter updating process of the back propagation process of the convolutional neural network in the initial prediction model with a modified prediction iterative algorithm to construct a new prediction model; inputting the characteristic matrix into a new prediction model, outputting a prediction result, and completing the protein-ATP binding site prediction based on the fractional order neural network.

In one possible embodiment, a fully-connected layer of a convolutional neural network is constructed, wherein the back-propagation gradient of the fully-connected layer employs a mixture of fractional and integer order to ensure that the chain rule holds. Two types of gradients are set through the layers, one is a transitive gradient connecting the nodes between the two layers, and the other is an update gradient for intra-layer parameters.

In one possible embodiment, a schematic diagram of the forward propagation algorithm is shown in fig. 3

Indicates the fifth->

Each node being in the fifth>

Output of the layer:

here, the

Represents->

Layer weights, <' > based on>

Indicates a degree of skewness, based on the measured value>

Represents the output of the previous layer and the function->

I.e. an activation function.

To ensure that the chain rule holds, the propagation gradient remains an integer gradient:

but when updating the gradient, we use fractional step updates:

the update process is shown in fig. 4.

In one possible embodiment, the model is tested using ATP-17 as the test set, and the model is output as a one-dimensional prediction probability matrix for each site of the protein sequence, and based on the criterion of maximizing MCC, we set the threshold value to 0.80, i.e. when the prediction probability of a site is greater than 0.8, it is judged as a binding site, and is represented by "1", and is represented by "0" otherwise. We performed 15 replicates on the test set, and selected accuracy (Acc), sensitivity (Sen), specificity (Spe) and Mausus Correlation Coefficient (MCC) as evaluation indexes, and compared the conventional convolutional neural network, and the average value of the multiple experiments is shown in the following table:

TABLE 1 evaluation index Table

/>

Then, the results of predictions on ATP-17, for NsitePred, targetATpSite, targetS and ATPseq, respectively, compared to several predictors of protein-ATP binding sites that are better represented in the prior art, are shown in the following table:

TABLE 2 comparison of results of the prior predictor and the predictor of the present invention

The results of the prediction of the protein 2YAA sequence are shown in FIG. 5. The invention can predict the binding site more accurately.

FIG. 6 is a block diagram of a fractional order neural net-based protein-ATP binding site prediction device, according to an exemplary embodiment. Referring to fig. 6, the apparatus 300 includes:

the feature extraction module 310 is configured to construct an initial prediction model, obtain a training set based on the PDB protein database, collect features of target residues and adjacent residues of the target residues in the training set by using a sliding window technique, and integrate the features into a feature matrix;

a function modifying module 320, configured to adjust the prediction iterative algorithm for each amino acid type by giving different weights based on the loss function by using the weighted cross entropy as a loss function of the prediction model, to obtain an adjusted prediction iterative algorithm;

the algorithm modifying module 330 is configured to construct a fractional derivative defined based on Caputo, and modify the adjusted prediction iteration algorithm based on the fractional derivative;

a result output module 340, configured to replace a parameter update process of a back propagation process of the convolutional neural network in the initial prediction model with a modified prediction iterative algorithm, so as to construct a new prediction model; and inputting the characteristic matrix into a new prediction model, outputting a prediction result, and completing the protein-ATP binding site prediction based on the fractional order neural network.

Optionally, the feature extraction module 310 is further configured to obtain a training set based on the PDB protein database, and determine a size of the sliding window; the sliding window comprises a target residue, and adjacent residues of the target residue are respectively arranged at the left side and the right side of the target residue;

running a psi-blast in an annotated protein sequence Swissprot database through a search tool blast based on a local alignment algorithm, inputting a training set, and obtaining a PSSM matrix of the training set;

carrying out One-hot coding on the amino acids in the training set to obtain One-hot coding vectors of each amino acid; wherein the encoding mode is one-hot encoding according to the amino acid classification mode of dipole and scroll side chains;

and (3) performing feature extraction on the PSSM matrix, the protein secondary structure vector and the One-hot coding vector of each amino acid through a sliding window to obtain target residues in a set training set and features of the target residues, and integrating the features into a feature matrix.

Optionally, the function modifying module 320 is configured to define the cross entropy of the ith sample as shown in the following formula (1):

（1）

wherein ,

if the ith sample belongs to the pth class, then->

，/>

the weighted cross entropy is defined as shown in the following equation (2):

（2）

wherein ,

based on the weight of each class>

Is the One-hot encoded value.

Optionally, the algorithm modifying module 330 is configured to define the fractional derivative according to the following equation (3):

（3）

wherein f (t) is an objective function, alpha is an order, m-1 is more than alpha and less than m, m is a positive integer,

is a gamma function, t ₀ Is an initial value.

Optionally, the algorithm modifying module 330 is configured to modify the iterative algorithm to converge the iterative algorithm to a true extreme point, and includes:

the fractional order gradient method is shown in the following formula (4):

（4）

where μ is the iteration step or learning rate, K is the number of iterations,

denotes the x th ₀ Step iteration step length;

will be given in formula (4)

Is replaced by>

（5）

substituting the above equation (5) into equation (3) and simplifying to obtain a modified iterative algorithm such as the following equation (6):

（6）/>

Optionally, the result output module 340 is configured to replace a parameter update process of a back propagation process of the convolutional neural network in the initial prediction model with a modified prediction iterative algorithm to construct a fully connected layer of the convolutional neural network in the new prediction model, where a back propagation gradient of the fully connected layer is a mixture of a fractional order and an integer order; wherein the fully-connected layer comprises two types of gradient-passing layers, the two types of gradient-passing layers comprising: the transfer gradient connecting the nodes between the two layers, and the update gradient.

In the embodiment of the invention, a protein-ATP binding site prediction method is provided by combining a deep learning method and fractional differentiation, and the accuracy is improved. Firstly, data sets ATP-227 and ATP-14 are selected as a training set and a testing set, characteristics required by a model are extracted from the digital information of the protein, and the characteristics are integrated into a characteristic matrix to be used as input. And then, a parameter updating process of a back propagation process of the convolutional neural network is modified into fractional order gradient iteration by selecting the convolutional neural network, and test data shows that the prediction effect of the convolutional neural network modified by the fractional order is superior to that of the existing machine learning and integer order deep learning models. The invention is characterized in that the fractional order gradient defined by Caputo is added to the full-link layer of the single-start predictor, and the performance of the predictor is improved on the premise of ensuring convergence and chain rule.

Fig. 7 is a schematic structural diagram of an electronic device 400 according to an embodiment of the present invention, where the electronic device 400 may generate relatively large differences due to different configurations or performances, and may include one or more processors (CPUs) 401 and one or more memories 402, where at least one instruction is stored in the memory 402, and the at least one instruction is loaded and executed by the processor 401 to implement the following steps of the fractional-order neural network-based protein-ATP binding site prediction method:

s2: using the weighted cross entropy as a loss function of a prediction model, and adjusting a prediction iterative algorithm of each amino acid type by giving different weights on the basis of the loss function to obtain an adjusted prediction iterative algorithm;

s3: constructing a fractional order derivative defined based on Caputo, and modifying the adjusted prediction iteration algorithm based on the fractional order derivative;

In an exemplary embodiment, a computer-readable storage medium, such as a memory including instructions executable by a processor in a terminal, to perform the fractional order neural network-based protein-ATP binding site prediction method is also provided. For example, the computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A fractional order neural network-based protein-ATP binding site prediction method, the method steps comprising:

s2: using the weighted cross entropy as a loss function of the initial prediction model, and based on the loss function, adjusting the prediction iterative algorithm of each amino acid type by giving different weights to obtain an adjusted prediction iterative algorithm;

s4: replacing the parameter updating process of the back propagation process of the convolutional neural network in the initial prediction model with a modified prediction iterative algorithm to construct a new prediction model; and inputting the characteristic matrix into the new prediction model, outputting a prediction result, and completing the protein-ATP binding site prediction based on the fractional order neural network.

2. The method of claim 1, wherein in S1, the training set is an unprocessed original protein sequence ATP-227.

3. The method of claim 1, wherein in S1, constructing an initial prediction model, obtaining a training set based on the PDB protein database, collecting features of target residues and adjacent residues of the target residues in the training set by a sliding window technique, and integrating the features into a feature matrix, comprises:

s13: acquiring a protein secondary structure in a training set, and representing the protein secondary structure by a 3-state secondary structure representation method to obtain a protein secondary structure vector;

s15: and (3) performing feature extraction on the PSSM matrix, the protein secondary structure vector and the One-hot coding vector of each amino acid through a sliding window to obtain target residues in the set training set and features of the target residues, and integrating the features into a feature matrix.

4. The method according to claim 3, wherein in S2, the adjusted prediction iterative algorithm is obtained by using weighted cross entropy as a loss function of the initial prediction model and adjusting the prediction iterative algorithm of each amino acid type by assigning different weights based on the loss function, and the method comprises:

（1）

wherein ,

if the ith sample belongs to the p-th class, then

，/>

the weighted cross entropy is defined as shown in the following equation (2):

（2）/>

wherein ,

is the weight of each type, is>

。

5. The method according to claim 4, wherein in S3, the fractional derivative defined by Caputo is as the following formula (3):

（3）

wherein f (t) is an objective function, alpha is the order, 0 < alpha < 1, m-1 < alpha < m, m represents a constant, m is a positive integer,

6. The method of claim 5, wherein modifying the adjusted prediction iteration algorithm based on the fractional order derivative in step S3 comprises:

the fractional order gradient method is shown in the following formula (4):

（4）

wherein mu is the iteration step length, K is the iteration times,

denotes the x (th) order ₀ Step iteration step length;

will be given in formula (4)

Replacement by means of>

（5）

（6）

7. The method according to claim 1, wherein in step S4, the parameter updating process of the back propagation process of the convolutional neural network in the initial prediction model is replaced with a modified prediction iterative algorithm to construct a new prediction model, which includes:

replacing a parameter updating process of a back propagation process of the convolutional neural network in the initial prediction model with a modified prediction iterative algorithm to construct a fully connected layer of the convolutional neural network in the new prediction model, wherein the back propagation gradient of the fully connected layer adopts a mixture of fractional order and integer order; wherein the fully-connected layer comprises two types of gradient-passing layers, the two types of gradient-passing layers comprising: the transfer gradient connecting the nodes between the two layers, and the update gradient.

8. A fractional order neural network-based protein-ATP binding site prediction device, for use in the method of any one of claims 1-7, the device comprising:

the function modification module is used for utilizing the weighted cross entropy as a loss function of the initial prediction model, and adjusting the prediction iterative algorithm of each amino acid type by giving different weights based on the loss function to obtain an adjusted prediction iterative algorithm;

the result output module is used for replacing the parameter updating process of the back propagation process of the convolutional neural network in the initial prediction model with a modified prediction iterative algorithm to construct a new prediction model; inputting the characteristic matrix into the new prediction model, outputting a prediction result, and completing the protein-ATP binding site prediction based on the fractional order neural network.

9. The apparatus of claim 8, wherein the training set is an unprocessed raw protein sequence ATP-227.

10. The apparatus of claim 9, wherein the feature extraction module is further configured to obtain a training set based on a PDB protein database, determine a sliding window size, and include a target residue in the sliding window, wherein the target residue is adjacent to the target residue on each of left and right sides of the target residue;

acquiring a protein secondary structure in a training set, and representing the protein secondary structure by a 3-state secondary structure representation method to obtain a protein secondary structure vector;

and (3) performing feature extraction on the PSSM matrix, the protein secondary structure vector and the One-hot coding vector of each amino acid through a sliding window to obtain target residues in the set training set and features of the target residues, and integrating the features into a feature matrix.