CN118098372A - Virulence factor identification method and system based on self-attention coding and pooling mechanism - Google Patents

Virulence factor identification method and system based on self-attention coding and pooling mechanism Download PDF

Info

Publication number
CN118098372A
CN118098372A CN202410489566.9A CN202410489566A CN118098372A CN 118098372 A CN118098372 A CN 118098372A CN 202410489566 A CN202410489566 A CN 202410489566A CN 118098372 A CN118098372 A CN 118098372A
Authority
CN
China
Prior art keywords
self
amino acid
layer
acid sequence
attention
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202410489566.9A
Other languages
Chinese (zh)
Other versions
CN118098372B (en
Inventor
李光辉
拜佩豪
陈娇
张黎明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
East China Jiaotong University
Original Assignee
East China Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by East China Jiaotong University filed Critical East China Jiaotong University
Priority to CN202410489566.9A priority Critical patent/CN118098372B/en
Publication of CN118098372A publication Critical patent/CN118098372A/en
Application granted granted Critical
Publication of CN118098372B publication Critical patent/CN118098372B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a virulence factor identification method and a system based on a self-attention coding and pooling mechanism. According to the technical scheme, the initial characteristics of the amino acid sequence extracted by the pre-training language model are used, so that the problems that a traditional machine learning method depends on manual characteristics and has limited expression capacity are solved; the encoder of the transducer model based on the self-attention mechanism models the dependency relationship between long sequences, so that the accuracy of virulence factor identification is effectively improved.

Description

Virulence factor identification method and system based on self-attention coding and pooling mechanism
Technical Field
The invention belongs to the technical field of deep learning and structural biology, and particularly relates to a virulence factor identification method and system based on a self-attention coding and pooling mechanism.
Background
The pathogenicity or pathogenicity of the bacteria (pathogenicity) is determined by the virulence factor (virulence factors) it encodes. With the continuous enhancement of bacterial drug resistance, researchers start from virulence factors of pathogenic bacteria to treat infectious diseases caused by drug-resistant bacteria, find that the virulence factors can become potential drug action targets specially for treating bacterial infection, and the virulence of the pathogenic bacteria can be inhibited by designing specific anti-virulence drugs without killing or inhibiting the growth of the pathogenic bacteria so as to avoid the generation of drug resistance due to higher evolutionary pressure. Meanwhile, along with the rapid progress of the whole genome sequencing technology, a large amount of pathogenic bacteria genome data is accumulated, so that the sequencing information of the bacterial strain is fully utilized to identify virulence factors of the bacterial strain, thereby being beneficial to saving a large amount of biological experiment cost and improving the research and development efficiency of the antitoxic drug. In conclusion, the identification and research of the virulence factors of pathogenic bacteria can not only clearly clarify the pathogenic mechanism of pathogenic bacteria, but also provide ideal candidate targets for the anti-virulence strategy for treating pathogenic bacteria infection.
Existing methods for identifying virulence factors are mainly divided into two types: one is a method based on sequence similarity comparison, which generally compares a predicted pathogenic bacterium gene sequence with a known virulence factor sequence, and uses BLAST, diamond or Bowtie and other tools to obtain a similarity so as to judge the sequence virulence. The second type is a method based on machine learning, which needs to pre-acquire sequence features, and then uses a related machine learning algorithm to predict virulence factors after the feature correlation processing. Sequence alignment-based methods determine genes for potential virulence factors through whole genome association studies or homology searches against virulence factor databases. However, with the advent of antibiotic resistance, the features and mechanisms of virulence factors have evolved continuously, and algorithms based on sequence similarity alignments have only been able to identify conserved virulence factors, but have been difficult to identify novel virulence factors that are evolutionarily distant from known virulence proteins. For this reason, researchers have proposed recognition methods based on machine learning. To efficiently extract virulence factor sequence features, machine learning algorithms typically incorporate a variety of predefined sequence features, such as frequency components, physicochemical properties, protein functional domains, and position-specific scoring matrices (Position Specific Scoring Matrix, PSSM), among others. However, machine learning based identification methods rely on manually extracted features and have limited ability to express high-level features of virulence factors. Therefore, researchers in recent years propose to identify virulence factors by using deep learning models, characteristic learning of virulence factor sequences is mainly performed by using convolutional neural networks and cyclic neural networks, the convolutional neural networks have the problem that modeling is difficult for long sequences, and the cyclic neural networks have the problem of long-range dependence. Considering that the self-attention-based transducer model can well model the dependency relationship among long sequences, the model based on a self-attention mechanism is necessary to be designed for feature learning of virulence factor sequences, features with virulence distinguishing capability are generated, and a method is provided for pathogenic bacteria virulence factor identification research.
Disclosure of Invention
The technical scheme of the invention provides a virulence factor identification method and a system based on a self-attention coding and pooling mechanism, which can solve the problem that the traditional machine learning method depends on manual characteristics and has limited expression capacity by utilizing an amino acid sequence initial characteristic of a protein pre-training language model ESM-2, model the dependency relationship between long sequences by introducing an encoder of a transducer model based on the self-attention mechanism to generate characteristics with the capability of distinguishing virulence, and adaptively polymerize the obtained amino acid characteristic vector sequence by adopting a self-attention pooling layer to obtain sequence-level characteristic representation, thereby greatly improving the accuracy of virulence factor identification prediction.
The invention provides the following technical scheme:
in one aspect, a method for identifying virulence factors based on self-attention coding and pooling mechanisms includes the steps of:
step S1: extracting the initial characteristics of an amino acid sequence of a virulence factor by using a protein pre-training language model ESM-2;
Step S2: coding the initial characteristic of the amino acid sequence of the virulence factor by adopting an encoder of a transducer model based on a self-attention mechanism to obtain a coding characteristic vector sequence;
Step S3: the self-attention pooling layer is utilized to carry out self-adaptive aggregation on the coding feature vector sequence of the initial feature of the amino acid sequence of the virulence factor, so as to obtain sequence features with feature representation for discriminating virulence;
Step S4: inputting the sequence characteristics of the virulence factors into a multi-layer sensor to identify whether the virulence factors are virulence factors;
before identification, after the positive and negative samples of the virulence factors are processed in step S1, the encoder, the self-attention pooling layer and the multi-layer perceptron of the transducer model based on the self-attention mechanism which are sequentially connected in series are integrally trained.
The problem of dependence on manual characteristics and limited expression capacity is solved by utilizing a protein pre-training language model ESM-2 to automatically learn the initial characteristics of the amino acid sequence of virulence factors; secondly, introducing an encoder of a transducer model based on a self-attention mechanism to model the dependency relationship between long sequences to generate characteristics with capability of distinguishing virulence; and meanwhile, adopting a self-attention pooling layer to adaptively aggregate the obtained amino acid feature vector sequence to obtain a sequence-level feature representation.
Further, in step S1, the protein pre-training language model ESM-2 is formed by sequentially connecting encoders of a plurality of stacked transducers, and the input during pre-training is an amino acid sequence of a virulence factor, and the output is an initial characteristic of the amino acid sequence of the corresponding virulence factor.
The amino acid initial feature is a 1280-dimensional vector representation. Assuming that a virulence factor consists of n amino acids, the initial feature vector for each amino acid is denoted by X, and the initial features of all amino acids constitute a feature matrix X.
Further, the positive and negative samples of virulence factors were positive sample virulence factors collected from three databases Victors, PATRIC and VFDB, respectively, and negative sample virulence factors collected from the PBVF method were used.
Further, in step S3, the self-attention pooling layer adaptively aggregates the coded feature vector sequence obtained in step S2 to obtain a sequence-level feature representation, which is defined as:
Where W c is the projection vector of the self-attention pooling layer, C is the feature representation of the sequence level after a weighted average of the coded feature vector sequence, Representing a normalized exponential function mapping self-attention weights between 0 and 1,/>And (3) a coded feature vector sequence representing the initial features of the amino acid sequence of the virulence factor, wherein T is a matrix transposed symbol.
Further, the encoder of the transducer model based on the self-attention mechanism in step S2 is composed of N stacked identical blocks, each block includes a multi-head self-attention layer and a fully connected feedforward neural network layer connected in series, and each multi-head self-attention layer and the fully connected feedforward neural network layer are respectively connected with two operation modules of residual connection and layer normalization;
The operation of each block is as follows:
First, a transducer model of the self-attention mechanism is utilized to aggregate characteristic information between different amino acids:
Wherein Q represents a query vector matrix of an amino acid sequence, K represents a key matrix of the amino acid sequence, V represents a value matrix of the amino acid sequence, W q,Wk,Wv is three projection matrices, the dimensions of W q and W k are the same and d k, X represents an initial characteristic matrix of the amino acid sequence of a virulence factor, Representing a normalized exponential function mapping self-attention weights between 0 and 1,/>An operation result of learning the amino acid feature vector by a transducer model representing a self-attention mechanism;
the transducer model of the self-attention mechanism is a feature between different elements in the mixture X in a linear projection space;
Then, the multi-head self-attention is used to blend different feature subspaces in a plurality of different projection spaces, as shown in the following formula:
Wherein i ε {1, …, M }, M represents the number of multi-heads, i.e., the number of transducer models of the self-attention mechanism, using M different 、/>、/>The projection matrix respectively obtains M query vector matrixes Q i, a key matrix K i and a value matrix V i,Wo, and the M query vector matrixes Q i, the key matrix K i and the value matrix V i,Wo are output projection matrixes which are used for converting the dimension of the multi-head self-attention aggregated feature into the dimension consistent with X; /(I)Operation result of self-attention mechanical learning amino acid sequence characteristic vector representing ith head,/>Transformer model representing the self-attention mechanism of the foregoing,/>Amino acid sequence feature vectors representing all M heads of sequentially spliced heads 1 through M,/>The result of the operation of learning the amino acid sequence feature vector by a transducer model representing the self-attention mechanism of M heads;
Then, after passing through the multi-head self-attention layer, two operations of residual connection and layer normalization are performed:
where norm represents the layer normalization operation, Representing the result after residual connection and layer normalization operation;
Normalization means that each amino acid sequence feature vector in the feature matrix is normalized separately, and the vector normalization is to subtract the average of each component in the vector and divide the average by the standard deviation;
then the feed-forward neural network of a full connection layer is entered, and the single and same two-layer linear transformation and an activation function are carried out on each amino acid sequence characteristic vector in the input sequence:
Wherein H represents a certain row in H, namely an amino acid sequence characteristic, W 1,W2 is a transformation parameter, and b 1,b2 represents a bias term; Representing an activation function;
finally, after passing through the feedforward neural network, two operations of residual connection and layer normalization are carried out:
Where Z represents a sequence of eigenvectors of the amino acid sequence, FFN () represents a feed-forward neural network.
Each block learns the initial characteristics of the amino acids of the virulence factor so that the characteristics of each amino acid can aggregate the characteristic information of all amino acids in the entire virulence factor sequence.
Further, the multi-layer sensor in step S4 is defined as:
Wherein W l is a conversion parameter from layer (l-1) to layer (l); b l denotes a bias term; σ l is the activation function of the first layer, C is the feature representation of the sequence level after a weighted average of the coded feature vector sequence.
Further, the activation function of the hidden layer in the multi-layer perceptron adopts a ReLU, the output layer adopts a Sigmoid function to output the virulence probability value of the virulence factor sequence, and if the probability value is greater than 0.5, the virulence factor sequence is predicted as the virulence factor.
Further, the whole training is carried out by adopting a deep learning method, initialization setting is carried out by random seeds, in the training process, the known training data is input into the initial characteristic of an amino acid sequence of a virulence factor, the output is the identification probability of the virulence factor, the loss function is the loss between a known virulence factor label and the predicted virulence factor probability calculated by using a cross entropy loss function, and the iteration updating of the whole model parameters is carried out by using a AdamW optimizer:
Wherein L represents a loss value, y j represents an amino acid sequence tag of a known jth virulence factor, For the amino acid sequence tag of the predicted jth virulence factor, m represents the number of amino acid sequences of virulence factors in the training set.
Iterative updating of the overall model parameters using AdamW optimizers refers to iterative updating of all parameters in the encoder, the self-attention pooling layer, and the multi-layer perceptron of the serial self-attention mechanism-based transducer model;
In a second aspect, an identification system adopting the virulence factor identification method based on the self-attention coding and pooling mechanism includes:
An amino acid sequence initial characteristic calculation module: the method comprises the steps of extracting initial characteristics of an amino acid sequence of a virulence factor by utilizing a protein pre-training language model ESM-2;
Amino acid sequence initial feature coding module: coding the initial characteristic of the amino acid sequence of the virulence factor by adopting an encoder of a transducer model based on a self-attention mechanism to obtain a coding characteristic vector sequence;
amino acid feature vector sequence aggregation module: the self-focusing pooling layer is used for adaptively polymerizing the coding feature vector sequences of the amino acid sequences to obtain sequence-level feature representations;
virulence factor identification module: is used for identifying whether the amino acid sequence to be queried is a virulence factor or not by utilizing the multilayer perceptron.
In a third aspect, a computer readable storage medium stores a computer program that is invoked by a processor to perform:
The steps of the virulence factor identification method based on the self-attention coding and pooling mechanism.
Advantageous effects
The technical scheme of the invention provides a virulence factor identification method (SAEP-VF) and a system based on a self-attention coding and pooling mechanism, wherein the method models the dependency relationship between long sequences through a self-attention-introduced transducer model to generate the characteristic with the capability of distinguishing virulence, and before that, a protein pre-training language model ESM-2 is used for automatically learning the initial characteristic of an amino acid sequence of the virulence factor sequence, so that the problem of dependence on manual characteristics and limited expression capability is solved;
Compared with the existing virulence factor identification method, the method can model the dependency relationship between long sequences and automatically learn the amino acid sequence characteristics of the virulence factors, and the association relationship of the amino acid sequence characteristics of the virulence factors is displayed in advance, so that the prediction accuracy can be greatly improved; meanwhile, on an independent test data set, the prediction performance of the SAEP-VF method disclosed by the embodiment of the invention is overall superior to that of other comparison methods.
Drawings
FIG. 1 is a schematic flow chart of a method according to an example of the invention;
FIG. 2 is a graph showing the comparison of ROC curves of the method of the present invention with other methods on separate test data sets.
Detailed Description
The invention is described in further detail below with reference to the attached drawings and examples:
Example 1:
as shown in fig. 1, the virulence factor identification method based on the self-attention coding and pooling mechanism provided in this embodiment includes the following steps:
step S1: extracting the initial characteristics of an amino acid sequence of a virulence factor sequence by using a protein pre-training language model ESM-2;
In the step S1, the protein pre-training language model ESM-2 is formed by sequentially connecting stacked encoders of 33 transducers, wherein the input of the pre-training language model is a virulence factor sequence, and the output of the pre-training language model is the initial characteristic of amino acids of the virulence factor sequence. The amino acid initial feature is a 1280-dimensional vector representation. Assuming that a virulence factor consists of n amino acids, the initial feature vector for each amino acid is denoted by X, and the initial features of all amino acids constitute a feature matrix X.
In this example, 1,000 positive sample virulence factors were collected from three databases Victors, PATRIC and VFDB together, and 1,000 negative sample virulence factors were collected according to the PBVF method in order to construct an equilibrium dataset.
Step S2: coding the initial characteristic of the amino acid sequence of the virulence factor by adopting an encoder of a transducer model based on a self-attention mechanism to obtain a coding characteristic vector sequence;
The encoder of the transducer model based on the self-attention mechanism in the step S2 consists of N stacked identical blocks, wherein each block comprises a multi-head self-attention layer and a fully-connected feedforward neural network layer which are connected in series, and two operation modules of residual connection and layer normalization are respectively connected behind each multi-head self-attention layer and the fully-connected feedforward neural network layer;
The operation of each block is as follows:
First, a transducer model of the self-attention mechanism is utilized to aggregate characteristic information between different amino acids:
Wherein Q represents a query vector matrix of an amino acid sequence, K represents a key matrix of the amino acid sequence, V represents a value matrix of the amino acid sequence, W q,Wk,Wv is three projection matrices, the dimensions of W q and W k are the same and d k, X represents an initial characteristic matrix of the amino acid sequence of a virulence factor, Representing a normalized exponential function mapping self-attention weights between 0 and 1,/>An operation result of learning the amino acid feature vector by a transducer model representing a self-attention mechanism;
the transducer model of the self-attention mechanism is a feature between different elements in the mixture X in a linear projection space;
Then, the multi-head self-attention is used to blend different feature subspaces in a plurality of different projection spaces, as shown in the following formula:
Wherein i ε {1, …, M }, M represents the number of multi-heads, i.e., the number of transducer models of the self-attention mechanism, using M different 、/>、/>The projection matrix respectively obtains M query vector matrixes Q i, a key matrix K i and a value matrix V i,Wo, and the M query vector matrixes Q i, the key matrix K i and the value matrix V i,Wo are output projection matrixes which are used for converting the dimension of the multi-head self-attention aggregated feature into the dimension consistent with X; /(I)Operation result of self-attention mechanical learning amino acid sequence characteristic vector representing ith head,/>Transformer model representing the self-attention mechanism of the foregoing,/>Amino acid sequence feature vectors representing all M heads of sequentially spliced heads 1 through M,/>The result of the operation of learning the amino acid sequence feature vector by a transducer model representing the self-attention mechanism of M heads;
Then, after passing through the multi-head self-attention layer, two operations of residual connection and layer normalization are performed:
where norm represents the layer normalization operation, Representing the result after residual connection and layer normalization operation;
Normalization means that each amino acid sequence feature vector in the feature matrix is normalized separately, and the vector normalization is to subtract the average of each component in the vector and divide the average by the standard deviation;
then the feed-forward neural network of a full connection layer is entered, and the single and same two-layer linear transformation and an activation function are carried out on each amino acid sequence characteristic vector in the input sequence:
Wherein H represents a certain row in H, namely an amino acid sequence characteristic, W 1,W2 is a transformation parameter, and b 1,b2 represents a bias term; Representing an activation function, reLu activation functions are used in this example;
finally, after passing through the feedforward neural network, two operations of residual connection and layer normalization are carried out:
Where Z represents a sequence of eigenvectors of the amino acid sequence, FFN () represents a feed-forward neural network.
Each block learns the initial characteristics of the amino acids of the virulence factor so that the characteristics of each amino acid can aggregate the characteristic information of all amino acids in the entire virulence factor sequence.
Step S3: the self-attention pooling layer is utilized to carry out self-adaptive aggregation on the coding feature vector sequence of the initial feature of the amino acid sequence of the virulence factor, so as to obtain sequence features with feature representation for discriminating virulence;
In step S3, the self-attention pooling layer adaptively aggregates the coded feature vector sequence obtained in step S2 to obtain a sequence-level feature representation, which is defined as:
Where W c is the projection vector of the self-attention pooling layer, C is the feature representation of the sequence level after a weighted average of the coded feature vector sequence, Representing a normalized exponential function mapping self-attention weights between 0 and 1,/>And (3) a coded feature vector sequence representing the initial features of the amino acid sequence of the virulence factor, wherein T is a matrix transposed symbol.
The multi-layer sensor in step S4 is defined as:
Wherein W l is a conversion parameter from layer (l-1) to layer (l); b l denotes a bias term; σ l is the activation function of the first layer, C is the feature representation of the sequence level after a weighted average of the coded feature vector sequence.
The activation function of the hidden layer is to adopt a ReLU, the output layer is to adopt a Sigmoid function to output a virulence probability value of the virulence factor sequence, and if the probability value is greater than 0.5, the virulence factor sequence is predicted to be virulence factor.
The method comprises the steps of performing overall training by adopting a deep learning method, performing initialization setting on random seeds (the random seed is usually set to be 42), inputting known training data into an amino acid sequence initial characteristic of a virulence factor in the training process, outputting the known training data into an identification probability of the virulence factor, calculating a loss between a known virulence factor label and a predicted virulence factor probability by using a cross entropy loss function, and performing iterative updating of overall model parameters by using a AdamW optimizer:
Wherein L represents a loss value, y j represents an amino acid sequence tag of a known jth virulence factor, For the amino acid sequence tag of the predicted jth virulence factor, m represents the number of amino acid sequences of virulence factors in the training set.
Iterative updating of the overall model parameters using AdamW optimizers refers to iterative updating of all parameters in the encoder, the self-attention pooling layer, and the multi-layer perceptron of the serial self-attention mechanism-based transducer model;
In summary, the present invention provides a method for identifying virulence factors based on self-attentive coding and pooling mechanisms, which models the dependency relationship between long sequences by introducing a self-attentive transducer model to generate features with capability of distinguishing virulence, and uses a protein pre-training language model ESM-2 to automatically learn the initial amino acid features of the virulence factor sequences before that, so as to solve the problem of being dependent on manual features and limited expression capability, and further improve the prediction accuracy.
And (3) validity verification:
To verify the effectiveness of the method, in this embodiment, the collected 1,000 positive samples and 1,000 negative sample data sets are divided into a training set, a verification set, and a test set in a ratio of 8:1:1. In the training process, firstly, the method is trained on a training set, meanwhile, verification is carried out on a verification set to effectively learn the parameters of the method, and finally, the optimal model parameters obtained by the method on the verification set are used for performance testing of a testing set. Table 1 shows that the prediction Accuracy Accurcy of the SAEP-VF (technical scheme of the invention) method is better than that of the other four comparison methods as compared with the performance of the method in the invention example on an independent test data set; meanwhile, the ROC curve comparison and display result is shown in fig. 2, and most of ROC curves of the SAEP-VF (technical scheme of the invention) method are positioned on the other four comparison methods, which shows that the SAEP-VF (technical scheme of the invention) method of the invention obtains reliable identification performance and has higher credibility and certain practicability.
Table 1 comparison of performance with other methods on independent test datasets
Area under ROC curve Area under Precision-Recall curve Harmonic mean of precision and recall Accuracy rate of Recall rate of recall Specificity (specificity) Accuracy rate of
CNN 0.9093 0.9177 0.8426 0.8600 0.7979 0.9151 0.8929
BLSTM 0.8997 0.9298 0.8436 0.8350 0.8396 0.8297 0.8476
LSTM 0.9171 0.9271 0.8638 0.8550 0.8679 0.8404 0.8598
GRU 0.8642 0.8183 0.8430 0.8250 0.8617 0.7924 0.7864
SAEP-VF 0.9207 0.9259 0.8611 0.8650 0.8829 0.8490 0.8383
Example 2:
the present embodiment provides a system employing a virulence factor identification method based on a self-attention coding and pooling mechanism, comprising:
An amino acid sequence initial characteristic calculation module: the method comprises the steps of extracting initial characteristics of an amino acid sequence of a virulence factor by utilizing a protein pre-training language model ESM-2;
Amino acid sequence initial feature coding module: coding the initial characteristics of the amino acid sequence of the virulence factor by adopting an encoder of a transducer model based on a self-attention mechanism;
amino acid feature vector sequence aggregation module: the self-focusing pooling layer is used for adaptively polymerizing the coding feature vector sequences of the amino acid sequences to obtain sequence-level feature representations;
virulence factor identification module: is used for identifying whether the amino acid sequence to be queried is a virulence factor or not by utilizing the multilayer perceptron.
Example 3
A computer readable storage medium storing a computer program, the computer program being invoked by a processor to perform:
the steps of the virulence factor identification method based on the self-attention coding and pooling mechanism.
For a specific implementation of each step, please refer to the description of the foregoing method.
The readable storage medium is a computer readable storage medium, which may be an internal storage unit of the controller according to any one of the foregoing embodiments, for example, a hard disk or a memory of the controller. The readable storage medium may also be an external storage device of the controller, such as a plug-in hard disk, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD), or the like, which are provided on the controller. Further, the readable storage medium may also include both an internal storage unit and an external storage device of the controller. The readable storage medium is used to store the computer program and other programs and data required by the controller. The readable storage medium may also be used to temporarily store data that has been output or is to be output.
Based on such understanding, the technical solution of the present invention is essentially or a part contributing to the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned readable storage medium includes: a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
It should be emphasized that the examples described herein are illustrative rather than limiting, and that this invention is not limited to the examples described in the specific embodiments, but is capable of other embodiments in accordance with the teachings of the present invention, as long as they do not depart from the spirit and scope of the invention, whether modified or substituted, and still fall within the scope of the invention.

Claims (10)

1. A method for identifying virulence factors based on a self-attention coding and pooling mechanism, comprising the steps of:
step S1: extracting the initial characteristics of an amino acid sequence of a virulence factor by using a protein pre-training language model ESM-2;
Step S2: coding the initial characteristic of the amino acid sequence of the virulence factor by adopting an encoder of a transducer model based on a self-attention mechanism to obtain a coding characteristic vector sequence;
Step S3: the self-attention pooling layer is utilized to carry out self-adaptive aggregation on the coding feature vector sequence of the initial feature of the amino acid sequence of the virulence factor, so as to obtain sequence features with feature representation for discriminating virulence;
Step S4: inputting the sequence characteristics of the virulence factors into a multi-layer sensor to identify whether the virulence factors are virulence factors;
before identification, after the positive and negative samples of the virulence factors are processed in step S1, the encoder, the self-attention pooling layer and the multi-layer perceptron of the transducer model based on the self-attention mechanism which are sequentially connected in series are integrally trained.
2. The method according to claim 1, wherein the protein pre-training language model ESM-2 in step S1 is formed by sequentially connecting encoders of a plurality of stacked transducers models in turn, wherein the input during pre-training is an amino acid sequence of a virulence factor, and the output is an initial characteristic of the amino acid sequence of the corresponding virulence factor.
3. The method of claim 1, wherein the positive and negative samples of virulence factors are positive sample virulence factors collected from three databases Victors, PATRIC and VFDB, respectively, and negative sample virulence factors collected using the PBVF method.
4. The method according to claim 1, wherein the self-attention pooling layer in step S3 adaptively aggregates the coded feature vector sequence obtained in step S2 to obtain a sequence-level feature representation, which is defined as:
Where W c is the projection vector of the self-attention pooling layer, C is the feature representation of the sequence level after a weighted average of the coded feature vector sequence, Representing a normalized exponential function mapping self-attention weights between 0 and 1,/>And (3) a coded feature vector sequence representing the initial features of the amino acid sequence of the virulence factor, wherein T is a matrix transposed symbol.
5. The method according to claim 1, wherein the encoder of the self-attention mechanism based transducer model in step S2 consists of a stack of N identical blocks, each block comprising a multi-headed self-attention layer and a fully connected feedforward neural network layer in series, and each multi-headed self-attention layer and fully connected feedforward neural network layer is followed by a residual connection and a layer normalization two operation modules, respectively;
The operation of each block is as follows:
First, a transducer model of the self-attention mechanism is utilized to aggregate characteristic information between different amino acids:
Wherein Q represents a query vector matrix of an amino acid sequence, K represents a key matrix of the amino acid sequence, V represents a value matrix of the amino acid sequence, W q,Wk,Wv is three projection matrices of a transducer model of a self-attention mechanism, the dimensions of W q and W k are the same as d k, X represents an initial characteristic matrix of the amino acid sequence of a virulence factor, Representing a normalized exponential function mapping self-attention weights between 0 and 1,/>The operation result of learning the amino acid feature vector by a transducer model representing a self-attention mechanism is represented, and T is a matrix transpose symbol;
Then, the multi-head self-attention is used to blend different feature subspaces in a plurality of different projection spaces, as shown in the following formula:
Wherein i ε {1, …, M }, M represents the number of multi-heads, i.e., the number of transducer models of the self-attention mechanism, using M different 、/>、/>The projection matrix respectively obtains M query vector matrixes Q i, a key matrix K i and a value matrix V i,Wo, and the M query vector matrixes Q i, the key matrix K i and the value matrix V i,Wo are output projection matrixes which are used for converting the dimension of the multi-head self-attention aggregated feature into the dimension consistent with X; /(I)Operation result of self-attention mechanical learning amino acid sequence characteristic vector representing ith head,/>Transformer model representing the self-attention mechanism of the foregoing,/>Amino acid sequence feature vectors representing all M heads of sequentially spliced heads 1 through M,/>The result of the operation of learning the amino acid sequence feature vector by a transducer model representing the self-attention mechanism of M heads;
Then, after passing through the multi-head self-attention layer, two operations of residual connection and layer normalization are performed:
where norm represents the layer normalization operation, Representing the result after residual connection and layer normalization operation;
then the feed-forward neural network of a full connection layer is entered, and the single and same two-layer linear transformation and an activation function are carried out on each amino acid sequence characteristic vector in the input sequence:
Wherein H represents one row in H, namely an amino acid sequence characteristic, W 1,W2 is a transformation parameter, and b 1,b2 represents a bias term; Representing an activation function;
finally, after passing through the feedforward neural network, two operations of residual connection and layer normalization are carried out:
Where Z represents a sequence of eigenvectors of the amino acid sequence, FFN () represents a feed-forward neural network.
6. The method according to claim 1, wherein the multi-layer sensor in step S4 is defined as:
wherein W l is a conversion parameter from layer (l-1) to layer (l); ,/> B l represents the bias items of the 1 st layer, the first-1 st layer and the first layer respectively; σ l is the activation function of the first layer, C is the feature representation of the sequence level after a weighted average of the coded feature vector sequence.
7. The method of claim 6, wherein the activation function of the hidden layer in the multi-layered perceptron uses a ReLU and the output layer uses a Sigmoid function to output a virulence probability value for a virulence factor sequence, and wherein if the probability value is greater than 0.5 indicates that the virulence factor sequence is predicted to be virulence factor.
8. The method of any one of claims 1-7, wherein the overall training is performed by deep learning, the initialization setting is performed by random seeds, the input of known training data is the initial characteristic of the amino acid sequence of the virulence factor, the output is the recognition probability of the virulence factor, the loss function is the loss between the known virulence factor label and the predicted virulence factor probability calculated by using a cross entropy loss function, and the iterative update of the overall model parameters is performed by using a AdamW optimizer:
Wherein L represents a loss value, y j represents an amino acid sequence tag of a known jth virulence factor, For the amino acid sequence tag of the predicted jth virulence factor, m represents the number of amino acid sequences of virulence factors in the training set.
9. A recognition system of a method for recognition of virulence factors based on a self-attention coding and pooling mechanism according to any of the claims 1-8, comprising:
An amino acid sequence initial characteristic calculation module: the method comprises the steps of extracting initial characteristics of an amino acid sequence of a virulence factor by utilizing a protein pre-training language model ESM-2;
Amino acid sequence initial feature coding module: coding the initial characteristic of the amino acid sequence of the virulence factor by adopting an encoder of a transducer model based on a self-attention mechanism to obtain a coding characteristic vector sequence;
amino acid feature vector sequence aggregation module: the self-focusing pooling layer is used for adaptively polymerizing the coding feature vector sequences of the amino acid sequences to obtain sequence-level feature representations;
virulence factor identification module: is used for identifying whether the amino acid sequence to be queried is a virulence factor or not by utilizing the multilayer perceptron.
10. A computer-readable storage medium, characterized by: a computer program is stored, the computer program being invoked by a processor to perform:
A method of virulence factor identification based on self-attention coding and pooling mechanisms as claimed in any of claims 1 to 8.
CN202410489566.9A 2024-04-23 2024-04-23 Virulence factor identification method and system based on self-attention coding and pooling mechanism Active CN118098372B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410489566.9A CN118098372B (en) 2024-04-23 2024-04-23 Virulence factor identification method and system based on self-attention coding and pooling mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410489566.9A CN118098372B (en) 2024-04-23 2024-04-23 Virulence factor identification method and system based on self-attention coding and pooling mechanism

Publications (2)

Publication Number Publication Date
CN118098372A true CN118098372A (en) 2024-05-28
CN118098372B CN118098372B (en) 2024-07-02

Family

ID=91164004

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410489566.9A Active CN118098372B (en) 2024-04-23 2024-04-23 Virulence factor identification method and system based on self-attention coding and pooling mechanism

Country Status (1)

Country Link
CN (1) CN118098372B (en)

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111667884A (en) * 2020-06-12 2020-09-15 天津大学 Convolutional neural network model for predicting protein interactions using protein primary sequences based on attention mechanism
KR20200126715A (en) * 2019-04-30 2020-11-09 주식회사 엘지화학 Protein Toxicity Prediction System and Method Using Artificial Neural Network
CN113936740A (en) * 2021-11-01 2022-01-14 安徽医科大学 High-throughput detection method and system for pathogenic bacteria virulence factor in environmental sample
CN115171792A (en) * 2022-06-30 2022-10-11 湖南大学 Hybrid prediction method of virulence factor and antibiotic resistance gene
CN115238749A (en) * 2022-08-04 2022-10-25 中国人民解放军军事科学院系统工程研究院 Feature fusion modulation identification method based on Transformer
CN115547414A (en) * 2022-10-25 2022-12-30 黑龙江金域医学检验实验室有限公司 Determination method and device of potential virulence factor, computer equipment and storage medium
WO2023040148A1 (en) * 2021-09-16 2023-03-23 平安科技(深圳)有限公司 Rna base unpaired probability prediction method and apparatus, storage medium, and device
JP2023062080A (en) * 2022-06-21 2023-05-02 ベイジン バイドゥ ネットコム サイエンス テクノロジー カンパニー リミテッド Method, device, apparatus, and medium for determining and training atomic coordinates in amino acid
CN116230113A (en) * 2023-01-13 2023-06-06 大连大学 Compound-protein interaction prediction method fusing multi-view information
WO2023109714A1 (en) * 2021-12-15 2023-06-22 深圳先进技术研究院 Multi-mode information fusion method and system for protein representative learning, and terminal and storage medium
CN116758978A (en) * 2023-06-15 2023-09-15 西北工业大学 Controllable attribute totally new active small molecule design method based on protein structure
WO2024072980A1 (en) * 2022-09-29 2024-04-04 Biomap Intelligence Technology Sg Pte. Ltd. Protein structure prediction
CN117831609A (en) * 2024-01-15 2024-04-05 电子科技大学 Protein secondary structure prediction method and device and computer device

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20200126715A (en) * 2019-04-30 2020-11-09 주식회사 엘지화학 Protein Toxicity Prediction System and Method Using Artificial Neural Network
CN111667884A (en) * 2020-06-12 2020-09-15 天津大学 Convolutional neural network model for predicting protein interactions using protein primary sequences based on attention mechanism
WO2023040148A1 (en) * 2021-09-16 2023-03-23 平安科技(深圳)有限公司 Rna base unpaired probability prediction method and apparatus, storage medium, and device
CN113936740A (en) * 2021-11-01 2022-01-14 安徽医科大学 High-throughput detection method and system for pathogenic bacteria virulence factor in environmental sample
WO2023109714A1 (en) * 2021-12-15 2023-06-22 深圳先进技术研究院 Multi-mode information fusion method and system for protein representative learning, and terminal and storage medium
JP2023062080A (en) * 2022-06-21 2023-05-02 ベイジン バイドゥ ネットコム サイエンス テクノロジー カンパニー リミテッド Method, device, apparatus, and medium for determining and training atomic coordinates in amino acid
CN115171792A (en) * 2022-06-30 2022-10-11 湖南大学 Hybrid prediction method of virulence factor and antibiotic resistance gene
CN115238749A (en) * 2022-08-04 2022-10-25 中国人民解放军军事科学院系统工程研究院 Feature fusion modulation identification method based on Transformer
WO2024072980A1 (en) * 2022-09-29 2024-04-04 Biomap Intelligence Technology Sg Pte. Ltd. Protein structure prediction
CN115547414A (en) * 2022-10-25 2022-12-30 黑龙江金域医学检验实验室有限公司 Determination method and device of potential virulence factor, computer equipment and storage medium
CN116230113A (en) * 2023-01-13 2023-06-06 大连大学 Compound-protein interaction prediction method fusing multi-view information
CN116758978A (en) * 2023-06-15 2023-09-15 西北工业大学 Controllable attribute totally new active small molecule design method based on protein structure
CN117831609A (en) * 2024-01-15 2024-04-05 电子科技大学 Protein secondary structure prediction method and device and computer device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
GUANGHUI LI ET AL.: "Identifying virulence factors using graph transformer autoencoder with ESMFold-predicted structures", 《COMPUTERS IN BIOLOGY AND MEDICINE》, vol. 170, 30 January 2024 (2024-01-30), pages 1 - 13 *

Also Published As

Publication number Publication date
CN118098372B (en) 2024-07-02

Similar Documents

Publication Publication Date Title
Lu et al. Deep fuzzy hashing network for efficient image retrieval
Zhang et al. Integrating feature selection and feature extraction methods with deep learning to predict clinical outcome of breast cancer
CN110688502B (en) Image retrieval method and storage medium based on depth hash and quantization
CN111667884A (en) Convolutional neural network model for predicting protein interactions using protein primary sequences based on attention mechanism
CN107609352B (en) Prediction method of protein self-interaction
Wei et al. Projected residual vector quantization for ANN search
CN112926640B (en) Cancer gene classification method and equipment based on two-stage depth feature selection and storage medium
CN114420310A (en) Medicine ATCCode prediction method based on graph transformation network
Kowalski et al. Determining significance of input neurons for probabilistic neural network by sensitivity analysis procedure
Yan et al. A hybrid algorithm based on binary chemical reaction optimization and tabu search for feature selection of high-dimensional biomedical data
CN114023376A (en) RNA-protein binding site prediction method and system based on self-attention mechanism
CN113642613A (en) Medical disease characteristic selection method based on improved goblet sea squirt group algorithm
CN116612810A (en) Medicine target interaction prediction method based on interaction inference network
Zhang et al. protein2vec: predicting protein-protein interactions based on LSTM
Rahman et al. IDMIL: an alignment-free Interpretable Deep Multiple Instance Learning (MIL) for predicting disease from whole-metagenomic data
CN118098372B (en) Virulence factor identification method and system based on self-attention coding and pooling mechanism
CN109920478B (en) Microorganism-disease relation prediction method based on similarity and low-rank matrix filling
CN115240775B (en) Cas protein prediction method based on stacking integrated learning strategy
CN111984800B (en) Hash cross-modal information retrieval method based on dictionary pair learning
CN115691817A (en) LncRNA-disease association prediction method based on fusion neural network
CN115713970A (en) Transcription factor identification method based on Transformer-Encoder and multi-scale convolutional neural network
Okamura et al. Lcnme: Label correction using network prediction based on memorization effects for cross-modal retrieval with noisy labels
Sohail et al. Selection of optimal texture descriptors for retrieving ultrasound medical images
Wójcik Random projection in deep neural networks
Badea et al. Sparse factorizations of gene expression data guided by binding data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant