CN116705146A - Multi-view enzyme function prediction method taking molecular structure and sequence mining into consideration - Google Patents

Multi-view enzyme function prediction method taking molecular structure and sequence mining into consideration Download PDF

Info

Publication number
CN116705146A
CN116705146A CN202310445726.5A CN202310445726A CN116705146A CN 116705146 A CN116705146 A CN 116705146A CN 202310445726 A CN202310445726 A CN 202310445726A CN 116705146 A CN116705146 A CN 116705146A
Authority
CN
China
Prior art keywords
enzyme
feature
sequence
layer
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310445726.5A
Other languages
Chinese (zh)
Inventor
邓赵红
于管青
吴敬
未志胜
王蕾
王士同
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangnan University
Original Assignee
Jiangnan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangnan University filed Critical Jiangnan University
Priority to CN202310445726.5A priority Critical patent/CN116705146A/en
Publication of CN116705146A publication Critical patent/CN116705146A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Medical Informatics (AREA)
  • Evolutionary Computation (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Bioethics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Public Health (AREA)
  • Chemical & Material Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention belongs to the field of intelligent cell biological recognition, and particularly relates to a multi-view enzyme function prediction method taking both molecular structure and sequence mining into consideration. The method comprises the following steps: initial enzyme feature construction, deep enzyme sequence feature construction, deep enzyme structure feature construction, model training and prediction based on a TSK fuzzy system, and 4 stages. According to the method, the structural characteristics and the sequence characteristics of the enzyme are regarded as two different visual angles, information is extracted, crossed and identified from different modes by constructing a brand-new multi-visual-angle depth network, the mining of complementarity and consistency information among multi-visual-angle enzyme data can be realized, and the multi-visual-angle characteristics are trained by adopting a multi-visual-angle TSK fuzzy system model, so that final enzyme function prediction is realized. The method gives consideration to the sequence characteristics and the structural characteristics of the enzyme, so that the predicted information is more complete, and the sequence and the structural characteristics of the enzyme can be well relearned in a new network through a TSK fuzzy system.

Description

Multi-view enzyme function prediction method taking molecular structure and sequence mining into consideration
Technical Field
The invention belongs to the field of intelligent cell biological recognition, and particularly relates to a multi-view enzyme function prediction method taking both molecular structure and sequence mining into consideration.
Background
Enzymes (enzymes) are proteins or RNAs produced by living cells which have a high degree of specificity and catalytic efficiency for their substrates. Enzymes catalyze substrates to biochemically react, thereby producing highly efficient catalytic proteins or RNAs. The functional classification of research enzymes plays an extremely important role in researching the application of enzymes in the production and life of people and disease diagnosis. Enzyme Commission (EC) numbering systems generally divide enzymes into 6 major classes according to their catalytic properties: oxidoreductases (EC 1), transferases (EC 2), hydrolases (EC 3), lyases (EC 4), isomerases (EC 5) and ligases (EC 6). The increasing number of enzyme sequences in databases presents a serious challenge to the functional classification of enzymes. Because of the large number of enzymes with unknown functions, while using biological experiments to determine the properties of enzymes is time consuming and expensive, there is a need to develop efficient, low cost techniques for predicting enzyme function.
In recent years, prediction of enzyme types using various computational models has been attracting more and more attention. At present, some researches have achieved encouraging results, and provide important means for enzyme function annotation and enzyme-related drug design. With the continuous development of bioinformatics and deep learning, some effective feature extraction methods and machine learning methods based on sequence information of enzymes are proposed to predict the kinds of enzymes, and most of these methods employ Support Vector Machines (SVMs), random Forests (RFs), KNNs, and the like. However, the existing methods still have the following disadvantages:
(1) Most methods use only sequence information of enzymes, and feature extraction is usually performed by adopting feature extraction modes such as single thermal coding, position scoring matrix and the like, and then classification is performed based on traditional machine learning methods such as a support vector machine, random forests, KNN and the like. In the enzyme function classification task, the traditional feature extraction and classification methods have limited learning ability, and new feature extraction and classification technologies are required to be further developed;
(2) At present, a network model for enzyme classification by utilizing three-dimensional structure information of enzymes mostly adopts methods such as processing vectors, matrixes and the like, and omits complex representation such as graph structures of enzymes, thereby leading to insufficient learning of the structure information of the enzymes;
(3) Although a small amount of methods have been used for predicting the function of an enzyme by considering both the sequence and structure information of the enzyme, the commonalities and characteristics of the data under multiple view angles such as the sequence and structure of the enzyme cannot be fully exploited, and the capability of efficiently processing multi-mode feature fusion is lacking. Therefore, research on an enzyme function prediction method capable of fully utilizing multi-modal information such as sequence characteristics and structural characteristics of an enzyme remains a challenging task with important value.
Disclosure of Invention
The enzyme function classification method is mostly based on single sequence features or single structure features, and in the invention, a novel method for predicting the enzyme function by taking the structure features and the sequence features of the enzyme into consideration, learning the depth sequence and the structure features and adopting a multi-view TSK fuzzy system classifier is adopted. The method regards the structural features and the sequence features of the enzyme as two different visual angles, extracts, crosses and discriminates information from different modes by constructing a brand-new multi-visual-angle depth network, and can realize mining of complementary and consistent information among multi-visual-angle enzyme data.
The technical scheme of the invention is as follows:
a multi-view enzyme function prediction method taking molecular structure and sequence mining into consideration comprises the following steps: initial enzyme feature construction, deep enzyme sequence feature construction, deep enzyme structure feature construction, model training and prediction based on a TSK fuzzy system, and 4 stages as follows:
the first stage: and (3) an initial feature construction stage. The stage comprises 2 steps, namely, the sequence initial characteristic construction of the enzyme, and the structure initial characteristic construction of the enzyme, wherein the specific steps are as follows:
the first step: initial sequence feature extraction
In this work, the initial feature extraction was performed on the amino acid sequence of the enzyme using the BioVec method. BioVec is a method for biological sequence representation and feature extraction. BioVec treats the sequences as a long sentence, processed using natural language processing, each biological sequence is embedded in an n-dimensional vector that uses neural networks to characterize the biophysical and biochemical properties of the sequence. The embedding of the corpus data consisting of 3-gram sequences is trained by a Skip-gram neural network model in Word2Vec, so that the probability of each Word sequence is observed with a higher probability, and the model only needs to be trained once. Finally, the amino acid sequence of each enzyme was expressed by BioVec as a vector of size 3 x 100 for later depth sequence feature extraction.
And a second step of: initial structural feature extraction
The amino acids in each chain are sequentially extracted from the PDB file for each enzyme, and the three-dimensional coordinates (x, y, z) of the carbon atoms are extracted from the amino acids, with each amino acid residue represented by a 21-dimensional (20 standard amino acids and others) one-hot code, so that each amino acid residue can ultimately be represented by a 24 vector. Finally, a matrix of dimension (n×24) is obtained to represent the length n of the enzyme feature. The default amino acid sequence length n is 1000, and if the sequence length is less than 1000, it is complemented with 0. The (n×24) dimensional matrix is used as an initial structural feature for subsequent learning of the point cloud network to extract the structural feature of depth.
And a second stage: deep enzyme sequence characteristic construction. This stage comprises 4 steps, respectively: SMOTE data oversampling processing, BBA residual block processing, bio-CS attention block processing, and full connection block processing.
The method comprises the following specific steps:
and a third step of: and (5) performing oversampling processing on SMOTE data.
In the process of extracting sequence features, due to uneven distribution of categories of 6 enzymes in the adopted PDB data set, an SMOTE oversampling method is adopted to process unbalanced data after being represented by BioVec. SMOTE is an oversampling technique for synthesizing minority class samples, which is a modification of the random oversampling algorithm. Since random oversampling takes a strategy of simply copying samples to add a few classes of samples, this tends to create a problem of model overfitting, resulting in the information learned by the model being too Specific (Specific) and not generalized enough (General). SMOTE sampling is the interpolation between adjacent minority class samples. It is therefore able to increase the number of minority class samples by constructing new minority class samples in the neighborhood of the existing samples, helping the classifier to increase its generalization ability [23]. The flow of the SMOTE algorithm is as follows:
(1) And for each sample x in the minority, calculating the distance from the sample x to all samples in the minority sample set by taking Euclidean distance as a standard, and obtaining the k neighbor.
(2) A sampling proportion is set according to the sample unbalance proportion to determine the sampling multiplying power N, for each minority sample x, a plurality of samples are randomly selected from k neighbors of the minority sample x, and the selected neighbors are assumed to be x0.
(3) For each randomly selected neighbor x0, a new sample x_new is constructed with the original sample x according to equation 1, respectively.
x_new=x+rand(0,1)×(x0-x)
The SMOTE processed data is input into the BBA residual module for subsequent feature extraction.
Fourth step: and extracting depth sequence characteristics by using a BBCNet neural network.
4.1BBA residual Module processing
The BBA residual module mainly consists of two Bi-directional long-time short-time memory models Bi-LSTM and an additional attention layer realized by a Keras self-attention packet. The BBA residual module can be regarded as an improved version of the conventional BBA module after introducing the residual mechanism. For Bi-LSTM in the BBA residual error module, the design concept is that feature data obtained at any moment simultaneously have information between the past and the future, a model is divided into 2 independent LSTM, input sequences are respectively input into 2 LSTM neural networks in positive sequence and reverse sequence for feature extraction, and word vectors formed after 2 output vectors (namely extracted feature vectors) are spliced are used as final feature expression of the word. In general, a sequential structure is adopted, but as the number of network layers increases, the training set loss gradually decreases and tends to be saturated, and when the network depth is increased again, the training set loss increases instead. Therefore, in order to avoid the phenomenon of network degradation (degradation), and to better utilize local features, firstly, two layers of Bi-LSTM are subjected to a jointing operation, past and future information are fused, and meanwhile, the fused features are fused with the features obtained after self-attention, so that the jump connection avoids the loss of the local features, and the two times of fusion finally form the BBA residual module. The module comprises two layers of Bi-LSTM, wherein the unidirectional output of each layer of Bi-LSTM comprises 128 nodes adopting hyperbolic tangent activation functions; the third layer is the sequence self-attention layer, also using the hyperbolic tangent activation function.
4.2Bio-CS attention module treatment.
In order to pay attention to the relation among the characteristics of different dimensions, a Bio-CS attention module is provided, and the model is expected to automatically learn the importance degree of the characteristics of the channels of different dimensions.
The structure of the Bio-CS attention module is shown in FIG. 2 (b). The Bio-CS module firstly carries out global average pooling operation on the feature map obtained by convolution to obtain global features of a channel level; then activating the global features, and learning the relation among all channels to obtain the weights of different channels; and finally multiplying the weight by the original feature map to obtain the final feature. Essentially, the Bio-CS module operates as an attention mechanism formed in the channel dimension that allows the model to focus more on the most informative channel features, while suppressing those that are not important. In this work, a channel Bio-CS module and a space Bio-CS module are used to obtain new features from both directions of the channel and the space, and the obtained two features are combined, so as to obtain feature information of the channel and the space which are fused.
(1) Channel Bio-CS Block
Sequence characteristics U epsilon R processed by BBA residual error module D×C As input, where D is the spatial dimension, C is the channel dimension, and one compressive conversion calculation F is performed by global tie-pooling s A statistic z is obtained to compress the global spatial information into a channel descriptor. The calculation formula of the c element of z is as follows:
at the channel module, z is obtained through global average pooling avg The formula can be expressed as:
C avg :z avg =AvgPool(U)
then, in order to further acquire the dependency relationship between the channel characteristics, two fully connected layers are adopted to perform nonlinear parameterization processing. The two full-connection layers form a bottleneck structure, so that the complexity of a model can be reduced, and the generalization capability can be improved. The first full connection layer is nonlinear by adopting a ReLU activation function, and the second full connection layer is nonlinear by adopting a sigmoid activation function:
s=F ex (z,W)=δ(g(z,W))=δ(W 2 σ(z,W 1 ))
wherein σ refers to a ReLU activation function, δ refers to a sigmoid activation function, s refers to the weight of a channel, W 1 ∈R C×C ,W 2 ∈R C×C . To prevent model overfitting, a dropout layer was added after both full connections. Experiments prove that the accuracy of the network is improved by the two dropout layers, and the stability of the model is effectively enhanced. Obtaining new features by calculating the product of channel weights s and initial input u
Wherein the method comprises the steps ofF scale (u c ,s c ) Refers to scalar s c Mapping u with features c ∈R C Channel multiplication between them. It can be seen that the excitation operator maps an input specific descriptor to a set of channel weights, which can be seen as a self-attention function in the channel direction.
(2) Spatial Bio-CS Block
The spatial direction is the same as the channel direction, and the sequence characteristic U epsilon R processed by the BBA residual error module is processed by the spatial direction D×C As input, the network depth is increased by one layer by one-dimensional convolution while introducing a spatial weight scalar s, but without changing the size of the feature map. The final output is:
(3) Merging attention modules
Features of the channel directionAnd features of spatial orientation->And adding, and performing jump connection operation to obtain X, so that the reusability of the features is ensured, and the features of the channel and the space direction are combined more effectively:
4.3 processing by a full connection module.
Feature X processed by the Bio-CS attention module is processed by the global averaging pooling and fully connected module (using Softmax activation function) as the final depth sequence feature.
And a third stage: deep enzyme structural feature construction.
Fifth step: and extracting the depth structural characteristics by using a Pointet++ point cloud network.
The CA (calcium) atom in the amino acid extracted from the PDB file is taken as a point in the point cloud, the three-dimensional coordinates (x, y, z) of the calcium atom are taken as the coordinate characteristics of the point in the point cloud, and the single-heat coding of the amino acid residue where the calcium atom is located is taken as the sequence characteristics of the point in the point cloud. Thus, the inputs to the PointNet++ network are a set of points of size Nx 3 and a sequential feature matrix of size Nx 21, where N is the number of points in the point cloud. The PointNet++ consists of three parts, namely a Sampling & Grouping & Pointnet hierarchical structure, a full connection layer and a Softmax layer. As shown in fig. 3, the Sampling & Grouping & Pointnet hierarchy is composed of a plurality of set abstraction units, the input of each set abstraction level is an n× (d+C) matrix, the output is an N '× (d+C') matrix, where N is the number of input points, d=3 is the coordinate dimension of the points, C is the feature dimension of the input points, N 'is the number of output points, and C' is the feature dimension of the output points. Each set abstraction unit mainly comprises 3 parts: sampling layers, grouping layers and PointNet layers. Sampling input points, and selecting a plurality of center points from the points; the Grouping layer is used for dividing a point set into a plurality of areas by utilizing a center point obtained by the Sampling layer; the PointNet layer encodes each region obtained as described above to obtain a new feature vector. The specific steps of this stage are as follows:
5.1Sampling layer。
the point is sampled using Farthest Point Sampling (FPS), selecting N' points out of N points, which better covers the entire set of points than random sampling. The FPS specific algorithm is as follows: first, a point x is randomly selected from a point set S having N points 0 The method comprises the steps of carrying out a first treatment on the surface of the Then, a distance point x is selected by using a distance formula 0 Furthest point x 1 The method comprises the steps of carrying out a first treatment on the surface of the Finding again to remove x 0 And x 1 Distance point x is concentrated at the remaining points of (2) 1 Furthest point x 2 And so on until N' sampling points are found.
5.1Grouping layer。
The input to the layer is a set of points S of size n× (d+c) and a coordinate matrix of N 'sample points of size N' x 3, where N is the number of points, d=3 is the coordinate dimension and C is the feature dimension. And (3) using a Ball query method in the layer, taking N 'Sampling points extracted by the Sampling layer as mass centers in N points input, finding K points in a sphere with the radius of R, wherein the K points form a local area, and finally generating N' local areas. The output of this layer is a coordinate containing N 'sample points and a feature matrix N' x K x (d+c).
5.3PointNet layer。
The input to this layer is N '×K× (d+C) and the output is N' × (d+C '), where C' is the new feature dimension. First, the coordinates of points in the area are changed to relative coordinates around a center point before being input to the network, which allows better acquisition of the point-to-point relationship. The unordered set of points is then encoded by a multi-layer superreceiver (MLP) network of multi-layer perceptrons.
And 5.4, finally extracting structural features.
PointNet++ network in Sampling&Grouping&After the Pointernet layer, the feature F E R is obtained N′×(d+C′) After that, feature F is integrated into a new vector F' using two fully connected layers with the Relu activation function, and finally the output of Softmax layer is used as the depth structural feature of the enzyme. Through end-to-end learning, the PointNet++ network can perform effective feature extraction on enzyme structure data.
Fourth stage: model training and prediction based on TSK fuzzy system.
By the foregoing method, it is also a challenging task how to efficiently predict the function of enzymes using these multi-view features of enzymes, resulting in depth features based on enzyme sequence data and depth features based on enzyme structure data. Conventional approaches typically employ a simple late fusion approach, e.g., stitching features of different viewing angles. While such an approach is operable, it is difficult to fully achieve efficient collaboration of different perspective features. In order to solve the problem, a multi-view fuzzy system classifier based on rules is introduced to fully learn the obtained multi-view depth characteristic data, so that more effective classification is realized.
Compared with a single-view classifier, the multi-view classifier can mine commonalities and characteristics among more different view angle characteristics, and a better prediction result is obtained. Multi-view classification techniques are now widely studied and researchers have proposed a variety of efficient algorithms. Wherein fuzzy sets and fuzzy logic systems are increasingly being applied to multi-view classification. The multi-view fuzzy system classification method is a distinctive multi-view classification method, can effectively realize the effective learning of multi-view data, has better transparency and easy explanation, and therefore, has shown the advantages in various modeling tasks. For example, TSK-FS-CVH has been effectively applied to circRNA binding protein site prediction as a representative multi-view fuzzy system classification method. Here, another multi-view fuzzy classifier MV-TSK-FS was introduced for classifier construction of enzyme-based multi-view depth features. MV-TSK-FS was developed based on classical TSK-FS. In addition to the interpretability of TSK-FS and the learning ability driven by data, MV-TSK-FS has efficient multi-view collaborative learning ability, and can fully exploit the consistency and complementarity between multi-view features of enzymes. On one hand, the MV-TSK-FS can fully utilize the difference of different views and more comprehensively utilize the complementary information of the data of different views, so that the generalization capability of the model is enhanced; on the other hand, MV-TSK-FS can also fully utilize consistency between different views to guide and constrain efficient learning of classification models. Thus, the MV-TSK-FS method employed is well suited for learning based on enzyme multi-view data to achieve classification of enzyme functions.
The specific steps of this stage are as follows:
sixth step: and performing a 5-fold intersection test by using a TSK fuzzy system, respectively training the depth sequence characteristic and the depth structure characteristic, and learning the independent information of each view angle.
Seventh step: using multi-view TSK blurThe system performs a 5-fold cross test and retrains depth sequence feature F s3 And depth structural feature F t2 And adjusting importance among different visual angles by using information entropy according to the information learned before, and carrying out classification test on the sample.
The invention has the beneficial effects that:
(1) Other methods have been developed which mostly use only the sequence features of enzymes or only the structural features of enzymes. The method gives consideration to the sequence characteristics and the structural characteristics of the enzyme, so that the prediction information is more complete.
(2) In the characteristic extraction process, the existing method is difficult to consider the enzyme characteristics of two different visual angles, and in the method, the sequence and the structural characteristics of the enzyme can be well relearned in a new network through a TSK fuzzy system.
Drawings
FIG. 1 is a block diagram of an algorithmic approach of the present invention;
FIG. 2 (a) is a graph of accuracy results of a model for five-fold cross-validation on a PDB database;
FIG. 2 (b) is a graph of the accuracy results of a model for five-fold cross-validation on a PDB database;
FIG. 2 (c) is a graph of regression rate results for a model with five-fold cross-validation on a PDB database;
FIG. 2 (d) is a graph of F1 score results for a model with five-fold cross-validation on a PDB database;
FIG. 3 (a) is a BBCNet neural network;
FIG. 3 (b) NBBA is a sequential network with Bio-CS attention module, without BBA residual module;
FIG. 3 (c) NBCS is a network with BBA residual modules, without Bio-CS attention modules;
FIG. 3 (d) NBBA-NBCS is a network with neither BBA residual module nor Bio-CS attention module.
Detailed Description
The present invention will be described in detail with reference to the accompanying drawings and examples.
As shown in FIG. 1, the invention realizes a model for predicting enzyme functions by combining sequence characteristics and structural characteristics of enzymes, and the architecture of the model is shown in FIG. 1. First, the model performs initial feature construction on the enzyme features through the PDB file. Secondly, the model uses BBC neural network to carry out deep feature extraction on the sequence information of the enzyme. Next, deep feature extraction is performed on the structural information of the enzyme through the Pointernet++ network. Finally, depth features are trained and classified by using a fuzzy rule based TSK classification system.
Example 1
Performance evaluation of the methods herein was performed using a 5-fold crossover experiment, ultimately producing the final results in an average manner. To evaluate the performance of the method, the proposed method is compared with several representative methods already available. Wherein, ABLE is a method for classifying enzymes by using only sequence information of enzymes and a attention-based bidirectional LSTM model proposed in 2020; enzyNet is a method for classifying enzymes by using a three-dimensional convolution network by using only the structural information of the enzymes, which is proposed in 2017; deep fri is sequence information and structure information of bound enzymes proposed in 2019, and uses a graph rolling network to classify enzymes by using a contact graph of the enzymes. The various indices are shown in table 1.
It can be seen that Accuracy, precision, recall and F1-score for the method reached 0.9161,0.9387,0.8544,0.8946, respectively, which are the best of the four methods, respectively, as can be seen. This is due to: 1) Compared with an ABLE method which only utilizes sequence information and an EnzyNet method which only utilizes structure information, the multi-view method combines the sequence and the structure characteristics of the enzyme, and the information of the enzyme can be utilized more comprehensively, so that a better prediction result is obtained. 2) Compared with EnzyNet method which only uses the structure information, the method extracts the sequence information and simultaneously digs the structure information more fully, thereby obtaining better performance. 3) Compared with the deep fri method, although both methods consider sequence and structural information, deeper excavation is performed on two types of information, and in particular, the adopted multi-view classification technology further improves the method performance.
Table 1 comparison of the performance of different prediction methods on six classes of functions of predicted enzymes.
Example 2
The proposed method is experimentally compared with a version using only sequence information (denoted as view 1) and a version using only structure information (view 2). Specifically, BBCNet is considered as a sequence-based proposed method as view 1; while the PointNet++ is considered as a structure-based approach as view 2. The experiment was performed under the same data set and experimental environment using a five-fold crossover test, and the experimental results are shown in table 2 and fig. 2.
Table 2 experimental results for evaluating the effectiveness of the multi-view learning mechanism
The average values of five times of cross-validation experiments obtained by the three methods are shown in table 2 and fig. 2, from which it can be seen that the performance of the method after comprehensively utilizing the sequence and the organization information is effectively improved compared with the performance of the method using only the sequence information (view 1) and the structural information (view 2). This also shows that the multi-view learning mechanism employed by the methods herein is very effective.
From fig. 2, it can be seen that the gray (bars) bars are significantly higher than the green (BBCNet) and red (pointnet++) bars in the five-fold cross-validation results of accuracy, precision, F1 score and regression.
The ablation experiment proves that the multi-view learning can be more suitable for complex data scenes compared with the single-view learning by mining complementary and consistent information among views.
Example 3
Ablation experiments were used to verify the validity of BBA residual modules and Bio-CS attention modules. In fig. 3 are several network structure diagrams designed for ablation experiments for enzyme sequence depth feature extraction networks.
From the experimental results in table 3, it can be seen that the NBBA network with no BBA residual module and the NBCS network with no Bio-CS residual module have better accuracy, precision, F1 score and regression ratio than the experimental results of the NBCS network with no Bio-CS residual module and the NBBA-NBCS network with no Bio-CS residual module, and it can be seen that both the BBA residual module and the Bio-CS residual module can effectively improve the efficiency of the network, respectively. Further, the proposed BBCNet network with both BBA residual modules and Bio-CS attention modules gave the best results. Therefore, the experiment proves that the BBCNet network combined with the BBA residual module and the Bio-CS attention module can greatly improve the effectiveness of the constructed sequence depth feature extraction network.
Table 3 results table of ablation experiments based on BBA residual module and Bio-CS attention module

Claims (8)

1. A multi-view enzyme function prediction method taking the molecular structure and sequence mining into consideration comprises the following steps:
the first step: initial feature extraction was performed on the amino acid sequences of the enzymes using the Biovec biological sequence processing method, and the amino acid sequences of each enzyme were expressed as vectors as initial sequence features F of the enzymes s1
And a second step of: the sequential extraction of the amino acid sequences from the PDB files of each enzyme is encoded by using one-hot, the three-dimensional coordinates (x, y, z) of the carbon atoms are extracted from the amino acid sequences, and a matrix with the dimension (n.times.24) is used as the initial structural characteristic F of the enzyme t1
And a third step of: initial sequence feature F for enzymes s1 The SMOTE data is adopted to carry out oversampling treatment to obtain the characteristic F s2
Fourth step: based on characteristic F s2 Extracting depth sequence feature F by BBCNet neural network s3 The network comprises four modules: a BBA residual error module, a Bio-CS attention module and a full connection module; the method comprises the steps of carrying out a first treatment on the surface of the
Fifth stepStep (c) of: initial structural feature F for enzymes t1 Depth structural feature F extracted by using Pointet++ point cloud network t2 The method comprises the steps of Sampling layers, grouping layers, pointNet layers and a structural feature final extraction module;
in the Sampling layer, farthest Point Sampling (FPS) is adopted to sample the points, N' points are selected from N points, and compared with random Sampling, the method can better cover the whole point set; the FPS specific algorithm is as follows: first, a point x is randomly selected from a point set S having N points 0 The method comprises the steps of carrying out a first treatment on the surface of the Then, a distance point x is selected by using a distance formula 0 Furthest point x 1 The method comprises the steps of carrying out a first treatment on the surface of the Finding again to remove x 0 And x 1 Distance point x is concentrated at the remaining points of (2) 1 Furthest point x 2 And so on until N' sampling points are found;
in the Grouping layer, a Ball query method is adopted in the layer, among N input points, N 'Sampling points extracted by a Sampling layer are taken as mass centers, K points are found in a sphere with the radius of R, the K points form a local area, and N' local areas are finally generated; the output of the layer is a coordinate containing N 'sampling points and a feature matrix N' x K x (d+C);
in the PointNet layer, before the PointNet layer is input to a network, point coordinates in the area are changed into relative coordinates around a central point, so that the relation between the points can be better acquired; next, encoding the unordered set of points through a multi-layer superreceiver (MLP) network of multi-layer perceptrons;
sixth step: 5-fold intersection test is carried out by using TSK fuzzy system, and depth sequence characteristics F are respectively trained s3 And depth structural feature F t2 Learning independent information of each view angle;
seventh step: 5-fold intersection test using multi-view TSK blur system, retraining depth sequence feature F s3 And depth structural feature F t2 And adjusting importance among different visual angles by using information entropy according to the information learned before, and carrying out classification test on the sample.
2. The method for predicting the function of a multi-view enzyme taking into account both molecular structure and sequence mining according to claim 1, wherein the method comprises the following steps: in the fourth step, the BBA residual module includes 2 Bi-LSTM layers and 1 self-attentive layer, in order to better utilize local features, firstly, two Bi-LSTM layers are subjected to a jointing operation, and the past and future information are fused, and meanwhile, the fused features are fused with the features obtained after self-attentive, so that the jump connection avoids the loss of the local features, and the two fusions finally form the proposed BBA residual module; the module comprises two layers of Bi-LSTM, wherein the unidirectional output of each layer of Bi-LSTM comprises 128 nodes adopting hyperbolic tangent activation functions; the self-attention layer uses a hyperbolic tangent activation function.
3. The method for predicting the function of multi-view enzyme taking into account both molecular structure and sequence mining according to claim 1 or 2, wherein the method comprises the following steps: in the Bio-CS attention module in the fourth step, u c Is an initial embedding matrix and is denoted U.epsilon.R D×C Self-attention feature extraction is carried out on channel level features through a channel Bio-CS module to obtain featuresSelf-attention feature extraction is carried out on channel level features through a space Bio-CS module to obtain features ∈>Finally, the characteristics of the channel direction are->And features of spatial orientation->The addition, the jump connection operation is carried out to obtain X, thereby ensuring the reusability of the features, and more effectively combining the features of the channel and the space direction, which are expressed as follows: />
4. The method for predicting the function of multi-view enzyme taking into account both molecular structure and sequence mining according to claim 1 or 2, wherein the method comprises the following steps: and the fourth step of full connection module is to treat the feature X processed by the Bio-CS attention module by the global average pooling and full connection module (adopting a Softmax activation function) and then to use the feature X as a final depth sequence feature.
5. The method for predicting the function of a multi-view enzyme taking into account both molecular structure and sequence mining according to claim 3, wherein: and the fourth step of full connection module is to treat the feature X processed by the Bio-CS attention module by the global average pooling and full connection module (adopting a Softmax activation function) and then to use the feature X as a final depth sequence feature.
6. The method for predicting multi-view enzyme functions by taking into account both molecular structure and sequence mining as defined in claim 1, 2 or 5, wherein in the final extraction of structural features in the fifth step, pointNet++ network is used in Sampling&Grouping&After the Pointernet layer, the feature F E R is obtained N′×(d+C′) After that, feature F is integrated into a new vector F' using two fully connected layers with the Relu activation function, and finally the output of Softmax layer is used as the depth structural feature of the enzyme.
7. The multi-view enzyme function prediction method considering both molecular structure and sequence mining as claimed in claim 3, wherein in the final extraction of structural features in the fifth step, the PointNet++ network is used in Sampling&Grouping&After the Pointernet layer, the feature F E R is obtained N′×(d+C′) Thereafter, feature F is integrated into a new vector F using two fully connected layers with a Relu activation function Finally, the output of the Softmax layer was used as a deep structural feature of the enzyme.
8. As set forth in claim 4The multi-view enzyme function prediction method taking the molecular structure and sequence mining into consideration is characterized in that in the final extraction of structural features in the fifth step, a PointNet++ network is used in Sampling&Grouping&After the Pointernet layer, the feature F E R is obtained N′×(d+C′) After that, feature F is integrated into a new vector F' using two fully connected layers with the Relu activation function, and finally the output of Softmax layer is used as the depth structural feature of the enzyme.
CN202310445726.5A 2023-04-24 2023-04-24 Multi-view enzyme function prediction method taking molecular structure and sequence mining into consideration Pending CN116705146A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310445726.5A CN116705146A (en) 2023-04-24 2023-04-24 Multi-view enzyme function prediction method taking molecular structure and sequence mining into consideration

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310445726.5A CN116705146A (en) 2023-04-24 2023-04-24 Multi-view enzyme function prediction method taking molecular structure and sequence mining into consideration

Publications (1)

Publication Number Publication Date
CN116705146A true CN116705146A (en) 2023-09-05

Family

ID=87824626

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310445726.5A Pending CN116705146A (en) 2023-04-24 2023-04-24 Multi-view enzyme function prediction method taking molecular structure and sequence mining into consideration

Country Status (1)

Country Link
CN (1) CN116705146A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116913383A (en) * 2023-09-13 2023-10-20 鲁东大学 T cell receptor sequence classification method based on multiple modes

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116913383A (en) * 2023-09-13 2023-10-20 鲁东大学 T cell receptor sequence classification method based on multiple modes
CN116913383B (en) * 2023-09-13 2023-11-28 鲁东大学 T cell receptor sequence classification method based on multiple modes

Similar Documents

Publication Publication Date Title
Rao et al. MSA transformer
CN107437096B (en) Image classification method based on parameter efficient depth residual error network model
Baldi et al. The principled design of large-scale recursive neural network architectures--dag-rnns and the protein structure prediction problem
CN114999565B (en) Drug target affinity prediction method based on representation learning and graph neural network
CN112199532B (en) Zero sample image retrieval method and device based on Hash coding and graph attention machine mechanism
CN112529168A (en) GCN-based attribute multilayer network representation learning method
CN113486190A (en) Multi-mode knowledge representation method integrating entity image information and entity category information
CN114496303B (en) Anti-cancer drug screening method based on multichannel neural network
CN116705146A (en) Multi-view enzyme function prediction method taking molecular structure and sequence mining into consideration
CN116417093A (en) Drug target interaction prediction method combining transducer and graph neural network
JP7490168B1 (en) Method, device, equipment, and medium for mining biosynthetic pathways of marine nutrients
Xu et al. Tri-graph information propagation for polypharmacy side effect prediction
CN116740527A (en) Remote sensing image change detection method combining U-shaped network and self-attention mechanism
Zhang et al. Cp-nas: Child-parent neural architecture search for 1-bit cnns
CN118648063A (en) Determining variant pathogenicity based on images
Ren et al. Knowledge base enabled semantic communication: A generative perspective
Wang et al. S-PLM: Structure-aware Protein Language Model via Contrastive Learning between Sequence and Structure
CN117079744A (en) Artificial intelligent design method for energetic molecule
CN112529057A (en) Graph similarity calculation method and device based on graph convolution network
CN114972959B (en) Remote sensing image retrieval method for sample generation and in-class sequencing loss in deep learning
Karagoz et al. Analysis of multiobjective algorithms for the classification of multi-label video datasets
CN115661546A (en) Multi-objective optimization classification method based on feature selection and classifier joint design
CN115169366A (en) Session recommendation method based on sampling convolution and interaction strategy
CN116884473B (en) Protein function prediction model generation method and device
Zhang et al. Approach to 3D face reconstruction through local deep feature alignment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination