CN118098372A

CN118098372A - Virulence factor identification method and system based on self-attention coding and pooling mechanism

Info

Publication number: CN118098372A
Application number: CN202410489566.9A
Authority: CN
Inventors: 李光辉; 拜佩豪; 陈娇; 张黎明
Original assignee: East China Jiaotong University
Current assignee: East China Jiaotong University
Priority date: 2024-04-23
Filing date: 2024-04-23
Publication date: 2024-05-28
Anticipated expiration: 2044-04-23
Also published as: CN118098372B

Abstract

The invention discloses a virulence factor identification method and a system based on a self-attention coding and pooling mechanism. According to the technical scheme, the initial characteristics of the amino acid sequence extracted by the pre-training language model are used, so that the problems that a traditional machine learning method depends on manual characteristics and has limited expression capacity are solved; the encoder of the transducer model based on the self-attention mechanism models the dependency relationship between long sequences, so that the accuracy of virulence factor identification is effectively improved.

Description

Virulence factor identification method and system based on self-attention coding and pooling mechanism

Technical Field

The invention belongs to the technical field of deep learning and structural biology, and particularly relates to a virulence factor identification method and system based on a self-attention coding and pooling mechanism.

Background

The pathogenicity or pathogenicity of the bacteria (pathogenicity) is determined by the virulence factor (virulence factors) it encodes. With the continuous enhancement of bacterial drug resistance, researchers start from virulence factors of pathogenic bacteria to treat infectious diseases caused by drug-resistant bacteria, find that the virulence factors can become potential drug action targets specially for treating bacterial infection, and the virulence of the pathogenic bacteria can be inhibited by designing specific anti-virulence drugs without killing or inhibiting the growth of the pathogenic bacteria so as to avoid the generation of drug resistance due to higher evolutionary pressure. Meanwhile, along with the rapid progress of the whole genome sequencing technology, a large amount of pathogenic bacteria genome data is accumulated, so that the sequencing information of the bacterial strain is fully utilized to identify virulence factors of the bacterial strain, thereby being beneficial to saving a large amount of biological experiment cost and improving the research and development efficiency of the antitoxic drug. In conclusion, the identification and research of the virulence factors of pathogenic bacteria can not only clearly clarify the pathogenic mechanism of pathogenic bacteria, but also provide ideal candidate targets for the anti-virulence strategy for treating pathogenic bacteria infection.

Existing methods for identifying virulence factors are mainly divided into two types: one is a method based on sequence similarity comparison, which generally compares a predicted pathogenic bacterium gene sequence with a known virulence factor sequence, and uses BLAST, diamond or Bowtie and other tools to obtain a similarity so as to judge the sequence virulence. The second type is a method based on machine learning, which needs to pre-acquire sequence features, and then uses a related machine learning algorithm to predict virulence factors after the feature correlation processing. Sequence alignment-based methods determine genes for potential virulence factors through whole genome association studies or homology searches against virulence factor databases. However, with the advent of antibiotic resistance, the features and mechanisms of virulence factors have evolved continuously, and algorithms based on sequence similarity alignments have only been able to identify conserved virulence factors, but have been difficult to identify novel virulence factors that are evolutionarily distant from known virulence proteins. For this reason, researchers have proposed recognition methods based on machine learning. To efficiently extract virulence factor sequence features, machine learning algorithms typically incorporate a variety of predefined sequence features, such as frequency components, physicochemical properties, protein functional domains, and position-specific scoring matrices (Position Specific Scoring Matrix, PSSM), among others. However, machine learning based identification methods rely on manually extracted features and have limited ability to express high-level features of virulence factors. Therefore, researchers in recent years propose to identify virulence factors by using deep learning models, characteristic learning of virulence factor sequences is mainly performed by using convolutional neural networks and cyclic neural networks, the convolutional neural networks have the problem that modeling is difficult for long sequences, and the cyclic neural networks have the problem of long-range dependence. Considering that the self-attention-based transducer model can well model the dependency relationship among long sequences, the model based on a self-attention mechanism is necessary to be designed for feature learning of virulence factor sequences, features with virulence distinguishing capability are generated, and a method is provided for pathogenic bacteria virulence factor identification research.

Disclosure of Invention

The technical scheme of the invention provides a virulence factor identification method and a system based on a self-attention coding and pooling mechanism, which can solve the problem that the traditional machine learning method depends on manual characteristics and has limited expression capacity by utilizing an amino acid sequence initial characteristic of a protein pre-training language model ESM-2, model the dependency relationship between long sequences by introducing an encoder of a transducer model based on the self-attention mechanism to generate characteristics with the capability of distinguishing virulence, and adaptively polymerize the obtained amino acid characteristic vector sequence by adopting a self-attention pooling layer to obtain sequence-level characteristic representation, thereby greatly improving the accuracy of virulence factor identification prediction.

The invention provides the following technical scheme:

in one aspect, a method for identifying virulence factors based on self-attention coding and pooling mechanisms includes the steps of:

step S1: extracting the initial characteristics of an amino acid sequence of a virulence factor by using a protein pre-training language model ESM-2;

Step S2: coding the initial characteristic of the amino acid sequence of the virulence factor by adopting an encoder of a transducer model based on a self-attention mechanism to obtain a coding characteristic vector sequence;

Step S3: the self-attention pooling layer is utilized to carry out self-adaptive aggregation on the coding feature vector sequence of the initial feature of the amino acid sequence of the virulence factor, so as to obtain sequence features with feature representation for discriminating virulence;

Step S4: inputting the sequence characteristics of the virulence factors into a multi-layer sensor to identify whether the virulence factors are virulence factors;

before identification, after the positive and negative samples of the virulence factors are processed in step S1, the encoder, the self-attention pooling layer and the multi-layer perceptron of the transducer model based on the self-attention mechanism which are sequentially connected in series are integrally trained.

The problem of dependence on manual characteristics and limited expression capacity is solved by utilizing a protein pre-training language model ESM-2 to automatically learn the initial characteristics of the amino acid sequence of virulence factors; secondly, introducing an encoder of a transducer model based on a self-attention mechanism to model the dependency relationship between long sequences to generate characteristics with capability of distinguishing virulence; and meanwhile, adopting a self-attention pooling layer to adaptively aggregate the obtained amino acid feature vector sequence to obtain a sequence-level feature representation.

Further, in step S1, the protein pre-training language model ESM-2 is formed by sequentially connecting encoders of a plurality of stacked transducers, and the input during pre-training is an amino acid sequence of a virulence factor, and the output is an initial characteristic of the amino acid sequence of the corresponding virulence factor.

The amino acid initial feature is a 1280-dimensional vector representation. Assuming that a virulence factor consists of n amino acids, the initial feature vector for each amino acid is denoted by X, and the initial features of all amino acids constitute a feature matrix X.

Further, the positive and negative samples of virulence factors were positive sample virulence factors collected from three databases Victors, PATRIC and VFDB, respectively, and negative sample virulence factors collected from the PBVF method were used.

Further, in step S3, the self-attention pooling layer adaptively aggregates the coded feature vector sequence obtained in step S2 to obtain a sequence-level feature representation, which is defined as:

Where W _c is the projection vector of the self-attention pooling layer, C is the feature representation of the sequence level after a weighted average of the coded feature vector sequence, Representing a normalized exponential function mapping self-attention weights between 0 and 1,/>And (3) a coded feature vector sequence representing the initial features of the amino acid sequence of the virulence factor, wherein T is a matrix transposed symbol.

Further, the encoder of the transducer model based on the self-attention mechanism in step S2 is composed of N stacked identical blocks, each block includes a multi-head self-attention layer and a fully connected feedforward neural network layer connected in series, and each multi-head self-attention layer and the fully connected feedforward neural network layer are respectively connected with two operation modules of residual connection and layer normalization;

The operation of each block is as follows:

First, a transducer model of the self-attention mechanism is utilized to aggregate characteristic information between different amino acids:

Wherein Q represents a query vector matrix of an amino acid sequence, K represents a key matrix of the amino acid sequence, V represents a value matrix of the amino acid sequence, W _q,W_k,W_v is three projection matrices, the dimensions of W _q and W _k are the same and d _k, X represents an initial characteristic matrix of the amino acid sequence of a virulence factor, Representing a normalized exponential function mapping self-attention weights between 0 and 1,/>An operation result of learning the amino acid feature vector by a transducer model representing a self-attention mechanism;

the transducer model of the self-attention mechanism is a feature between different elements in the mixture X in a linear projection space;

Then, the multi-head self-attention is used to blend different feature subspaces in a plurality of different projection spaces, as shown in the following formula:

Wherein i ε {1, …, M }, M represents the number of multi-heads, i.e., the number of transducer models of the self-attention mechanism, using M different 、/>、/>The projection matrix respectively obtains M query vector matrixes Q _i, a key matrix K _i and a value matrix V _i,W^o, and the M query vector matrixes Q _i, the key matrix K _i and the value matrix V _i,W^o are output projection matrixes which are used for converting the dimension of the multi-head self-attention aggregated feature into the dimension consistent with X; /(I)Operation result of self-attention mechanical learning amino acid sequence characteristic vector representing ith head,/>Transformer model representing the self-attention mechanism of the foregoing,/>Amino acid sequence feature vectors representing all M heads of sequentially spliced heads ₁ through _M,/>The result of the operation of learning the amino acid sequence feature vector by a transducer model representing the self-attention mechanism of M heads;

Then, after passing through the multi-head self-attention layer, two operations of residual connection and layer normalization are performed:

where norm represents the layer normalization operation, Representing the result after residual connection and layer normalization operation;

Normalization means that each amino acid sequence feature vector in the feature matrix is normalized separately, and the vector normalization is to subtract the average of each component in the vector and divide the average by the standard deviation;

then the feed-forward neural network of a full connection layer is entered, and the single and same two-layer linear transformation and an activation function are carried out on each amino acid sequence characteristic vector in the input sequence:

Wherein H represents a certain row in H, namely an amino acid sequence characteristic, W ₁,W₂ is a transformation parameter, and b ₁,b₂ represents a bias term; Representing an activation function;

finally, after passing through the feedforward neural network, two operations of residual connection and layer normalization are carried out:

Where Z represents a sequence of eigenvectors of the amino acid sequence, FFN () represents a feed-forward neural network.

Each block learns the initial characteristics of the amino acids of the virulence factor so that the characteristics of each amino acid can aggregate the characteristic information of all amino acids in the entire virulence factor sequence.

Further, the multi-layer sensor in step S4 is defined as:

Wherein W ^l is a conversion parameter from layer (l-1) to layer (l); b ^l denotes a bias term; σ ^l is the activation function of the first layer, C is the feature representation of the sequence level after a weighted average of the coded feature vector sequence.

Further, the activation function of the hidden layer in the multi-layer perceptron adopts a ReLU, the output layer adopts a Sigmoid function to output the virulence probability value of the virulence factor sequence, and if the probability value is greater than 0.5, the virulence factor sequence is predicted as the virulence factor.

Further, the whole training is carried out by adopting a deep learning method, initialization setting is carried out by random seeds, in the training process, the known training data is input into the initial characteristic of an amino acid sequence of a virulence factor, the output is the identification probability of the virulence factor, the loss function is the loss between a known virulence factor label and the predicted virulence factor probability calculated by using a cross entropy loss function, and the iteration updating of the whole model parameters is carried out by using a AdamW optimizer:

Wherein L represents a loss value, y _j represents an amino acid sequence tag of a known jth virulence factor, For the amino acid sequence tag of the predicted jth virulence factor, m represents the number of amino acid sequences of virulence factors in the training set.

Iterative updating of the overall model parameters using AdamW optimizers refers to iterative updating of all parameters in the encoder, the self-attention pooling layer, and the multi-layer perceptron of the serial self-attention mechanism-based transducer model;

In a second aspect, an identification system adopting the virulence factor identification method based on the self-attention coding and pooling mechanism includes:

An amino acid sequence initial characteristic calculation module: the method comprises the steps of extracting initial characteristics of an amino acid sequence of a virulence factor by utilizing a protein pre-training language model ESM-2;

Amino acid sequence initial feature coding module: coding the initial characteristic of the amino acid sequence of the virulence factor by adopting an encoder of a transducer model based on a self-attention mechanism to obtain a coding characteristic vector sequence;

amino acid feature vector sequence aggregation module: the self-focusing pooling layer is used for adaptively polymerizing the coding feature vector sequences of the amino acid sequences to obtain sequence-level feature representations;

virulence factor identification module: is used for identifying whether the amino acid sequence to be queried is a virulence factor or not by utilizing the multilayer perceptron.

In a third aspect, a computer readable storage medium stores a computer program that is invoked by a processor to perform:

The steps of the virulence factor identification method based on the self-attention coding and pooling mechanism.

Advantageous effects

The technical scheme of the invention provides a virulence factor identification method (SAEP-VF) and a system based on a self-attention coding and pooling mechanism, wherein the method models the dependency relationship between long sequences through a self-attention-introduced transducer model to generate the characteristic with the capability of distinguishing virulence, and before that, a protein pre-training language model ESM-2 is used for automatically learning the initial characteristic of an amino acid sequence of the virulence factor sequence, so that the problem of dependence on manual characteristics and limited expression capability is solved;

Compared with the existing virulence factor identification method, the method can model the dependency relationship between long sequences and automatically learn the amino acid sequence characteristics of the virulence factors, and the association relationship of the amino acid sequence characteristics of the virulence factors is displayed in advance, so that the prediction accuracy can be greatly improved; meanwhile, on an independent test data set, the prediction performance of the SAEP-VF method disclosed by the embodiment of the invention is overall superior to that of other comparison methods.

Drawings

FIG. 1 is a schematic flow chart of a method according to an example of the invention;

FIG. 2 is a graph showing the comparison of ROC curves of the method of the present invention with other methods on separate test data sets.

Detailed Description

The invention is described in further detail below with reference to the attached drawings and examples:

Example 1:

as shown in fig. 1, the virulence factor identification method based on the self-attention coding and pooling mechanism provided in this embodiment includes the following steps:

step S1: extracting the initial characteristics of an amino acid sequence of a virulence factor sequence by using a protein pre-training language model ESM-2;

In the step S1, the protein pre-training language model ESM-2 is formed by sequentially connecting stacked encoders of 33 transducers, wherein the input of the pre-training language model is a virulence factor sequence, and the output of the pre-training language model is the initial characteristic of amino acids of the virulence factor sequence. The amino acid initial feature is a 1280-dimensional vector representation. Assuming that a virulence factor consists of n amino acids, the initial feature vector for each amino acid is denoted by X, and the initial features of all amino acids constitute a feature matrix X.

In this example, 1,000 positive sample virulence factors were collected from three databases Victors, PATRIC and VFDB together, and 1,000 negative sample virulence factors were collected according to the PBVF method in order to construct an equilibrium dataset.

The encoder of the transducer model based on the self-attention mechanism in the step S2 consists of N stacked identical blocks, wherein each block comprises a multi-head self-attention layer and a fully-connected feedforward neural network layer which are connected in series, and two operation modules of residual connection and layer normalization are respectively connected behind each multi-head self-attention layer and the fully-connected feedforward neural network layer;

The operation of each block is as follows:

Wherein H represents a certain row in H, namely an amino acid sequence characteristic, W ₁,W₂ is a transformation parameter, and b ₁,b₂ represents a bias term; Representing an activation function, reLu activation functions are used in this example;

In step S3, the self-attention pooling layer adaptively aggregates the coded feature vector sequence obtained in step S2 to obtain a sequence-level feature representation, which is defined as:

The multi-layer sensor in step S4 is defined as:

The activation function of the hidden layer is to adopt a ReLU, the output layer is to adopt a Sigmoid function to output a virulence probability value of the virulence factor sequence, and if the probability value is greater than 0.5, the virulence factor sequence is predicted to be virulence factor.

The method comprises the steps of performing overall training by adopting a deep learning method, performing initialization setting on random seeds (the random seed is usually set to be 42), inputting known training data into an amino acid sequence initial characteristic of a virulence factor in the training process, outputting the known training data into an identification probability of the virulence factor, calculating a loss between a known virulence factor label and a predicted virulence factor probability by using a cross entropy loss function, and performing iterative updating of overall model parameters by using a AdamW optimizer:

In summary, the present invention provides a method for identifying virulence factors based on self-attentive coding and pooling mechanisms, which models the dependency relationship between long sequences by introducing a self-attentive transducer model to generate features with capability of distinguishing virulence, and uses a protein pre-training language model ESM-2 to automatically learn the initial amino acid features of the virulence factor sequences before that, so as to solve the problem of being dependent on manual features and limited expression capability, and further improve the prediction accuracy.

And (3) validity verification:

To verify the effectiveness of the method, in this embodiment, the collected 1,000 positive samples and 1,000 negative sample data sets are divided into a training set, a verification set, and a test set in a ratio of 8:1:1. In the training process, firstly, the method is trained on a training set, meanwhile, verification is carried out on a verification set to effectively learn the parameters of the method, and finally, the optimal model parameters obtained by the method on the verification set are used for performance testing of a testing set. Table 1 shows that the prediction Accuracy Accurcy of the SAEP-VF (technical scheme of the invention) method is better than that of the other four comparison methods as compared with the performance of the method in the invention example on an independent test data set; meanwhile, the ROC curve comparison and display result is shown in fig. 2, and most of ROC curves of the SAEP-VF (technical scheme of the invention) method are positioned on the other four comparison methods, which shows that the SAEP-VF (technical scheme of the invention) method of the invention obtains reliable identification performance and has higher credibility and certain practicability.

Table 1 comparison of performance with other methods on independent test datasets

	Area under ROC curve	Area under Precision-Recall curve	Harmonic mean of precision and recall	Accuracy rate of	Recall rate of recall	Specificity (specificity)	Accuracy rate of
								CNN	0.9093	0.9177	0.8426	0.8600	0.7979	0.9151	0.8929
BLSTM	0.8997	0.9298	0.8436	0.8350	0.8396	0.8297	0.8476
								LSTM	0.9171	0.9271	0.8638	0.8550	0.8679	0.8404	0.8598
GRU	0.8642	0.8183	0.8430	0.8250	0.8617	0.7924	0.7864
								SAEP-VF	0.9207	0.9259	0.8611	0.8650	0.8829	0.8490	0.8383

Example 2:

the present embodiment provides a system employing a virulence factor identification method based on a self-attention coding and pooling mechanism, comprising:

Amino acid sequence initial feature coding module: coding the initial characteristics of the amino acid sequence of the virulence factor by adopting an encoder of a transducer model based on a self-attention mechanism;

Example 3

A computer readable storage medium storing a computer program, the computer program being invoked by a processor to perform:

For a specific implementation of each step, please refer to the description of the foregoing method.

The readable storage medium is a computer readable storage medium, which may be an internal storage unit of the controller according to any one of the foregoing embodiments, for example, a hard disk or a memory of the controller. The readable storage medium may also be an external storage device of the controller, such as a plug-in hard disk, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD), or the like, which are provided on the controller. Further, the readable storage medium may also include both an internal storage unit and an external storage device of the controller. The readable storage medium is used to store the computer program and other programs and data required by the controller. The readable storage medium may also be used to temporarily store data that has been output or is to be output.

Based on such understanding, the technical solution of the present invention is essentially or a part contributing to the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned readable storage medium includes: a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

It should be emphasized that the examples described herein are illustrative rather than limiting, and that this invention is not limited to the examples described in the specific embodiments, but is capable of other embodiments in accordance with the teachings of the present invention, as long as they do not depart from the spirit and scope of the invention, whether modified or substituted, and still fall within the scope of the invention.

Claims

1. A method for identifying virulence factors based on a self-attention coding and pooling mechanism, comprising the steps of:

2. The method according to claim 1, wherein the protein pre-training language model ESM-2 in step S1 is formed by sequentially connecting encoders of a plurality of stacked transducers models in turn, wherein the input during pre-training is an amino acid sequence of a virulence factor, and the output is an initial characteristic of the amino acid sequence of the corresponding virulence factor.

3. The method of claim 1, wherein the positive and negative samples of virulence factors are positive sample virulence factors collected from three databases Victors, PATRIC and VFDB, respectively, and negative sample virulence factors collected using the PBVF method.

4. The method according to claim 1, wherein the self-attention pooling layer in step S3 adaptively aggregates the coded feature vector sequence obtained in step S2 to obtain a sequence-level feature representation, which is defined as:

5. The method according to claim 1, wherein the encoder of the self-attention mechanism based transducer model in step S2 consists of a stack of N identical blocks, each block comprising a multi-headed self-attention layer and a fully connected feedforward neural network layer in series, and each multi-headed self-attention layer and fully connected feedforward neural network layer is followed by a residual connection and a layer normalization two operation modules, respectively;

The operation of each block is as follows:

；

Wherein Q represents a query vector matrix of an amino acid sequence, K represents a key matrix of the amino acid sequence, V represents a value matrix of the amino acid sequence, W _q,W_k,W_v is three projection matrices of a transducer model of a self-attention mechanism, the dimensions of W _q and W _k are the same as d _k, X represents an initial characteristic matrix of the amino acid sequence of a virulence factor, Representing a normalized exponential function mapping self-attention weights between 0 and 1,/>The operation result of learning the amino acid feature vector by a transducer model representing a self-attention mechanism is represented, and T is a matrix transpose symbol;

；

Wherein H represents one row in H, namely an amino acid sequence characteristic, W ₁,W₂ is a transformation parameter, and b ₁,b₂ represents a bias term; Representing an activation function;

；

6. The method according to claim 1, wherein the multi-layer sensor in step S4 is defined as:

wherein W ^l is a conversion parameter from layer (l-1) to layer (l); ，/> B ^l represents the bias items of the 1 st layer, the first-1 st layer and the first layer respectively; σ ^l is the activation function of the first layer, C is the feature representation of the sequence level after a weighted average of the coded feature vector sequence.

7. The method of claim 6, wherein the activation function of the hidden layer in the multi-layered perceptron uses a ReLU and the output layer uses a Sigmoid function to output a virulence probability value for a virulence factor sequence, and wherein if the probability value is greater than 0.5 indicates that the virulence factor sequence is predicted to be virulence factor.

8. The method of any one of claims 1-7, wherein the overall training is performed by deep learning, the initialization setting is performed by random seeds, the input of known training data is the initial characteristic of the amino acid sequence of the virulence factor, the output is the recognition probability of the virulence factor, the loss function is the loss between the known virulence factor label and the predicted virulence factor probability calculated by using a cross entropy loss function, and the iterative update of the overall model parameters is performed by using a AdamW optimizer:

9. A recognition system of a method for recognition of virulence factors based on a self-attention coding and pooling mechanism according to any of the claims 1-8, comprising:

10. A computer-readable storage medium, characterized by: a computer program is stored, the computer program being invoked by a processor to perform:

A method of virulence factor identification based on self-attention coding and pooling mechanisms as claimed in any of claims 1 to 8.