CN114566216B

CN114566216B - Attention mechanism-based splice site prediction and interpretation method

Info

Publication number: CN114566216B
Application number: CN202210178010.9A
Authority: CN
Inventors: 张艳菊; 许峻玮; 齐王璟; 王荣兴
Original assignee: Guilin University of Electronic Technology
Current assignee: Guilin University of Electronic Technology
Priority date: 2022-02-25
Filing date: 2022-02-25
Publication date: 2024-04-02
Anticipated expiration: 2042-02-25
Also published as: CN114566216A

Abstract

The invention discloses a splice site prediction and interpretation method based on an attention mechanism, which provides a convolutional neural network model combined with the attention mechanism, accurately identifies splice sites, and provides a visual weight interpretation analysis method based on the model, so that an effective prediction model is established on five species. The results of the independent test sets demonstrate that the model of the present invention is more robust, better performing and more generalizing than the existing model over 10 data sets involving five species. Then, in order to research the reason that the convolutional neural network model combined with the attention mechanism can achieve better performance, the invention adopts a gradient type activation mapping visualization technology to acquire the position weight distribution of the model for each sample, and finally verifies that the model can automatically notice and acquire the effective characteristics of the samples. The invention can improve the prediction precision and carry out explanatory analysis on the sequence of the splice site.

Description

Attention mechanism-based splice site prediction and interpretation method

Technical Field

The invention relates to the technical field of splice site recognition prediction of genes, in particular to a method for predicting and explaining splice sites based on an attention mechanism.

Background

The splicing operation is a critical step in the expression of genetic information by cells as a protein, and the correct recognition of splice sites is of particular importance. Studies have shown that investigation of splice sites not only helps researchers understand the splicing mechanism during the conversion from DNA to RNA, but also helps to deduce the constitutive structure of the transcript. Recent studies have shown that different splicing patterns of genes are associated with complex diseases such as lung cancer, depression, etc. Studies of splice site binding clinical disease have also analyzed the relationship between splice events and the mechanisms of disease formation and occurrence.

Currently, in research on splice sites, researchers build models and predict successfully by extracting partial bases upstream and downstream of splice sites as data sets, then extracting features and learning sample sequence intrinsic information using machine learning algorithms. For example, pertea et al have employed a decision tree algorithm and developed a model GeneSplicer by enhancing it by a Markov algorithm to capture information around splice sites. Degroeve et al employ a linear support vector machine algorithm to construct a linear model SpliceMachine to obtain efficient information from the high-dimensional feature representation to predict splice sites. The MM1 feature extraction method is used by Baten et al to extract features from the splice site sequences and input them into the SVM to distinguish between true or false splice sites. The method has obvious defects that a researcher is required to manually acquire the characteristics and then select the characteristics. The extraction of features by researchers relies on existing knowledge of splice sites, which to some extent limits the scope to which models can learn features, which models may ignore feature information that is present in a sample sequence but not recognized temporarily. In recent years, researchers have introduced deep learning techniques to predict splice sites. For example, du et al constructed a deep model based on convolutional neural networks that predicted splice sites for both human and caenorhabditis elegans datasets. Zua-ert et al construct a SpliceRover model based on CNN to predict splice sites, and the five assumptions made by the authors are explained by algorithms.

Although the above-described methods can achieve good performance, researchers still need to explore better predicted performance. Furthermore, while deep learning techniques achieve high performance, it is often difficult for researchers to explain how deep learning affects the performance of a model.

Disclosure of Invention

Aiming at the problems of the existing splice site identification method, the invention provides a splice site prediction and interpretation method based on an attention mechanism, the method constructs a high-performance splice site prediction model based on a convolutional neural network combined with the attention mechanism, adopts a visualization technology to analyze different position weights of a splice site sequence and conduct interpretation research, and finally analyzes whether the method can bring generalization capability of cross species while greatly improving the performance of the model.

The technical scheme for realizing the aim of the invention is as follows:

an attention-based method for splice site prediction and interpretation comprising the steps of:

1) Collecting splice site data sets of five species, and dividing the collected splice site data sets into positive and negative samples, wherein the positive and negative samples are divided into a training set, a verification set and a test set;

2) Dividing the five species splice site data sets obtained in the step 1) into 10 sample data sets according to the fact that each species has both a donor splice site sample and an acceptor splice site sample, and converting the base sequences of the 10 sample data sets into Onehot codes;

3) The complex relation between the data is simulated by using a multi-level nonlinear function, and a convolutional neural network model is constructed, wherein the expression is as follows:

Lable of class＝f _fcn (f _conv2 (f _conv1 (Sequence nucleotide signal)))

wherein Lable of class represents the final classification of the convolutional neural network model, sequence nucleotide signal represents the input feature code corresponding to the base sequence, f _conv1 Representing a first layer of convolution, f _conv2 Representing a second convolution layer, f _fcn Representing that the intermediate result of the input features after convolution and other steps is transmitted into the full connection layer;

in the convolutional neural network model, if the weight of each neuron-connected filter window is fixed, sliding the shared filter weights according to the translational invariance, the convolutional layer is composed of a group of filters, the sliding filter performs dot product operation with the input vector, and for the input x, there is a filter omega on each channel _(1,c) The first filter dot product result z of the first convolution layer _1,(i,j,k) Expressed as:

z _1,(i,j,k) ＝(x*ω _1,c ) _i,j,k +b _1,(k,1)

where i, j and c represent the row, column and channel of the convolutional layer output, respectively, k is the filter of the current layer, b _1,(k,1) Representing the offset value at the time of the first filter convolution operation;

convolution layer output result z based on three channels _1,(i,j,c) The method comprises the following steps:

wherein i, j and c represent the rows, columns and channels input to the convolutional layer, respectively, l, m and n represent the rows, columns and channels of the filter, respectively, and k is the filter used by the current convolutional layer;

4) Based on the step 3) to perform preliminary feature learning on the input feature codes, adopting a attention mechanism CBAM (Convolutional Block Attention Module, CBAM) based on a convolution module to perform attention learning on the result of the step 3), acquiring key positions of feature graphs from two parts of channel attention and space attention, and giving an intermediate feature graph F epsilon R ^C×H×W As input, the CBAM in turn extrapolates one-dimensional channel attention M _c ∈R ^C×1×1 And two-dimensional spatial attention, formulated as follows:

wherein the method comprises the steps ofRepresenting element-by-element multiplication, in which the attention value is broadcast accordingly, F ¹ Is the result of the output of the characteristic diagram F through the channel attention module, F ¹ And F ² Output of the final result of the attention mechanism module CBAM;

5) Based on the convolutional neural network model constructed in the step 3) and the attention mechanism CBAM in the step 4), constructing a convolutional neural network model based on the attention mechanism, training the convolutional neural network model based on the attention mechanism by utilizing the training set and the verification set divided in the step 1) and verifying output of the model in the training process, performing 30 iterations for each training, updating back propagation by using a cross entropy loss function, wherein the probability of the cross entropy loss function predicted for each category is p and 1-p, and the expression of the cross entropy loss function L is as follows:

wherein L is _i Representing the loss function of sample i, N representing the total number of samples, y _i Label representing i, positive class 1, negative class 0; p is p _i Representing the probability that sample i is predicted to be a positive class;

6) Inputting the test set data into the model trained in the step 5), obtaining the predictive value of the model and constructing a confusion matrix, and finally obtaining the model in the Accuracy (Acc), specificity (Sp), sensitivity (Sn), F score (F-score, F) ₁ ) And the area under the subject's working characteristics (AUC) to evaluate the performance of the donor and acceptor splice sites for five species, specifically:

wherein TP, TN, FP and FN represent the numbers of true positives, true negatives, false positives and false negatives, respectively;

7) An explanatory analysis is performed: the method comprises the steps of realizing the interpretation of internal representation and decision results of a convolutional neural network model by adopting a gradient type activation mapping (gradient class activation map, grad-CAM) visualization technology, firstly calculating the gradient of a convolutional neural network model score of a c type to a convolutional layer by Grad-CAM, and simultaneously averaging characteristic vector values on each channel according to obtained gradient information, namely global averaging pooling to obtain the weight of each characteristic map, wherein the characteristic map size is c ₁ *c ₂ The weight calculation formula is:

wherein the method comprises the steps ofRepresents the weight of the ith feature map to category c, Z represents the number of feature maps, ++>The pixel value of the kth row and the jth column of the ith feature map is represented by S _c Classification score for class c;

the Grad-CAM result is calculated by weighted summation and averaging and then using the ReLu activation function, and the calculation formula is:

wherein the method comprises the steps ofRepresenting class activation mapping results for class c, using visualization techniques to view different positions of splice site sequencesWeight scores, finally obtaining weight graphs of thermodynamic diagrams and predictive scores of different positions;

8) Generalization analysis: and obtaining cross-species interpretation analysis and cross-species splice site commonality rule analysis according to interpretation analysis results of different species and model performance comparison.

Step 1), splice site datasets of five species, including splice site datasets of humans (Homo sapiens), arabidopsis thaliana (Arabidopsis thaliana), japonica rice (Oryza sativa japonica), drosophila melanogaster (Drosophila melanogaster) and caenorhabditis elegans (Caenorhabditis elegans).

Compared with the prior art, the splice site prediction and interpretation method based on the attention mechanism has the following advantages:

1. the method is based on convolutional neural network, and innovatively introduces a attention mechanism technology and a grad-cam technology to process gene base sequences.

2. The method can achieve more excellent performance than the existing method on a plurality of different species;

3. the model trained by the method in human species can be used in Arabidopsis thaliana, japonica rice, drosophila melanogaster and caenorhabditis elegans, and has stronger generalization and robustness;

4. the method has visual interpretation capability, can be used for better performance for researchers, and can be used for showing the reasons that the models of the researchers can achieve the better performance.

Drawings

FIG. 1 is a general framework diagram of an attention-based splice site prediction and interpretation method according to an embodiment of the present invention;

FIG. 2 is a diagram of the splicing process in the examples.

FIG. 3 is a sample view of donor splice sites in an example;

FIG. 4 is a sample view of acceptor splice sites in an example;

Detailed Description

The present invention will now be further illustrated with reference to the drawings and examples, but is not limited thereto.

Examples:

in this example, splice site sequence features are learned based on convolutional neural networks in combination with an attention mechanism, and the overall framework diagram is shown in fig. 1. The method section firstly introduces a data processing and sequence coding method, then introduces an overall framework of a convolutional neural network combined with an attention mechanism, and finally introduces Grad-cam visualization technology for interpretation analysis of a splice site prediction model.

Referring to fig. 1, a method for splice site prediction and interpretation based on the mechanism of attention, comprising the steps of:

1) And (3) data collection: as shown in fig. 2, splice site information is derived from genomic DNA sequences, and splicing operations occur during transcription of the DNA sequences into mRNA. The data set of this example includes five species, including human (homosapiens), arabidopsis thaliana (Arabidopsis thaliana), japonica rice (Oryza sativa japonica), drosophila melanogaster (Drosophila melanogaster), and caenorhabditis elegans (Caenorhabditis elegans) as shown in figure 1. Meanwhile, the donor splice site (as shown in FIG. 3) and the acceptor splice site (as shown in FIG. 4) are divided in each species.

TABLE 1 number of samples of donor and acceptor splice sites per species

As shown in table 1, there are 10 data sets, each with equal numbers of positive and negative samples. And dividing the processed data set of each species into a training set, a verification set and a test set.

2) And (3) data processing: each sample has a sequence length of 602, comprising 300 bases upstream of the splice site and 300 bases downstream of the splice site, as well as two bases of the splice site (donor splice site GT or acceptor splice site AG). Finally, each sample sequence is subjected to OneHot coding processing to obtain vector representation with the dimension of [602,4 ].

3) Model construction: in the embodiment, convolutional neural network technology is introduced, and a complex relationship between data is simulated by using a multi-level nonlinear function. Convolutional neural networks typically comprise an input layer, a convolutional layer, a pooling layer, a fully-connected layer, an output layer, and the like. The import layer imports a numerical representation of the base sequence of the splice site into the network. Equation (1) gives a simple neural network representation.

z ₁ ＝g(ω ₁ a ₀ +b ₁ )

Wherein z is ₁ Refers to the result output of the current layer, a ₀ Refers to the input vector, ω, of the input layer ₁ I.e. the weights of the current layer, b ₁ Is a bias term.

The convolution layer is composed of a set of filters, and the sliding filter performs dot product operation with the input vector. For input x, there is a filter ω on each channel _(1,c) 。

z _1,(i,j,k) ＝(x*ω _1,c ) _i,j,k +b _1,(k,1)

Taking channel i as an example, z _(1,i) :

z _(1,i) ＝x _i *ω _1,c(i) +b _1,i

The output result is:

where i, j and c represent the final output row, column and channel, respectively. l, m, n represent the rows, columns and channels, respectively, of the filter, k being the filter notation used by the current layer.

Each layer of convolution layer comprises super-parameter settings such as filter number setting, convolution kernel size, step length, activation function and the like, and usually, different effective characteristics can be learned by filters with different weights. The activation function may choose ReLu, gradient unsaturation is indeed the greatest benefit of ReLu, which speeds up the convergence of random gradient drops over other activation functions.

Where b is the activation threshold and x is the activation function output result value.

The pooling layer can be divided into average pooling, minimum pooling and maximum pooling, and mainly aggregates the spatial information of the feature mapping, so that the vector size of information transmission in the network is reduced. The max-pooling function is typically used to avoid overfitting and to help abstract the features learned in the first few layers. The max-pooling layer is a common pooling technique that uses signal maxima over non-overlapping windows for further representation.

Z _k ＝max(Y _1,k ,…,Y _n,k )

Where n is the maximum pooling window size and k is the motif identifier.

After multiple rolling and pooling operations, there may then be one or more fully connected layers whose weights are no longer shared. The full connection layer integrates local information with category differentiation in the convolution layer and the pooling layer, and finally outputs a classification result through a softmax function, wherein the formula is as follows:

f _i (z)＝exp(z _i )/∑ _j exp(z _j )

wherein f _i (z) represents the prediction score of the i-th.

Meanwhile, the attention mechanism CBAM (Convolutional Block Attention Module) of the convolution module is adopted in the embodiment, which is an attention mechanism module combining channel information and space information, and can achieve better effect compared with an attention mechanism sent only focusing on the channel information. Given an intermediate feature map F ε R ^C×H×W As input, the CBAM in turn extrapolates one-dimensional channel attention M _c ∈R ^C×1×1 And two-dimensional spatial attention, formulated as follows:

wherein the method comprises the steps ofRepresenting element-by-element multiplication, in which the attention value is broadcast accordingly, F ¹ Is the result of the output of the characteristic diagram F through the channel attention module, F ² At F ¹ Is processed by a spatial attention module and is output by an attention mechanism module CBAM final result.

Since each channel of the feature map is considered a feature detector. The channel attention module first aggregates the feature mapped spatial information using a pooling operation to generate two different spatial context descriptors:and->Representing the average pooling feature and the maximum pooling feature, respectively. The spatial dimension of the input feature map is then compressed in a shared network consisting of a multi-layer perceptron MLP with a hidden layer. To reduce parameter overhead, the hidden activation size is set to R ^C ^/r×1×1 Where r is the reduction rate. After the shared network is applied to each descriptor, the element-wise sums are combined to produce the channel attention vector. The formula is:

wherein σ represents a sigmoid function, W ₀ ∈R ^C/r×C ,W ₁ ∈R ^C×C/r . Two inputs share MLP weight W ₀ And W is ₁ 。

Unlike channel attention, spatial attention is focused "where" is an information part, which is complementary to channel attention. The spatial attention module takes the output feature vector of the channel attention module as input, then the average pooling and the maximum pooling result are overlapped and then convolved, finally the spatial attention feature vector is generated through a sigmoid function, the spatial attention feature vector and the input of the module are multiplied, and finally the generated two-dimensional feature map is obtained, and the position to be emphasized or suppressed is encoded.

Wherein σ represents a sigmoid function, f ^7×7 A convolution operation with a filter size of 7 x 7 is shown.

4) In the performance evaluation procedure, donor splice site predictive models and acceptor splice site predictive models were constructed herein on 10 datasets of five species, respectively. Tables 2 and 3 show the Accuracy (Acc), specificity (Sp), sensitivity (Sn), F-score (F-value, F ₁ ) And the area under the subject's working characteristics (AUC) to evaluate the performance of the donor splice site and acceptor splice site for five species.

TABLE 2 Properties of five species at the donor splice site (Donor Splice Site, doSS)

TABLE 3 Properties of five species at the donor splice site (Acceptor Splice Site, acSS)

The example compares the accuracy of the test set with the accuracy of the different tools of the prior art, and the results are shown in Table 4. The tool herein is superior to existing tools in both the acceptor splice site predictive model and the donor splice site predictive model of four species of arabidopsis, indica, drosophila melanogaster and caenorhabditis elegans. Also, in the Drosophila melanogaster donor splice site model, the tools herein are 3.68% higher than the best tools currently available. Although the accuracy of the tools herein was 1.1% lower than the Splice2Deep on the human acceptor and donor Splice site prediction models, the AUC values of the tools herein were 98.96 higher for the human acceptor Splice site model than for the Splice2Deep and 99.14 higher for the human donor Splice site than for the Splice2Deep than for 99.10. Typically, accuracy is just a single threshold of 0.5 defaults to positive and negative partitioning of the prediction results, while AUC values evaluate the performance of the model at multiple different thresholds. Intuitively, models with higher AUC values have better robustness and stability.

Table 4 accuracy (%) of the different tools at five species was compared. N/A indicates that the tool is not training a model in that species.

5) Model interpretability: the method adopts gradient class activation mapping (gradient class activation map, grad-CAM) visualization technology, and realizes the internal representation of the model and the interpretation of decision results under the condition that the structure and parameters of the model are not required to be modified. Grad-CAM calculates a thermodynamic diagram of the convolutional neural network on the input image through gradient information, the thermodynamic diagram highlights input features related to the output result of the model, and the higher the correlation, the more obvious the corresponding features in the saliency map. It is intuitively understood that a gradient represents the amount of update of a parameter in model training, and that the larger the gradient of a data point is, the more sensitive the change of the point is output, and the larger the correlation between the point and the output is.

Grad-CAM calculates the gradient of model score of c class to convolution layer, and at the same time, averages the characteristic vector values (similar to global average pooling) on each channel to obtain the weight of each characteristic map with the characteristic map size of c ₁ *c ₂ The weight calculation formula is as follows

Wherein the method comprises the steps ofRepresenting the weight of the ith feature map to category cZ represents the number of feature maps, +.>The pixel value of the kth row and the jth column of the ith feature map is represented by S _c Class c classification score.

The Grad-CAM result is calculated by weighting, summing and averaging and then using the ReLu activation function, and the calculation formula is as follows

In the formula (I)And the class activation mapping result of the class c is represented. />

Claims

1. A method for splice site prediction and interpretation based on an attention mechanism, comprising the steps of:

Lable of class＝f _fcn (f _conv2 (f _conv1 (Sequence nucleotide signal)))

wherein Lable of class represents the final classification of the convolutional neural network model, sequence nucleotide signal represents the input feature code corresponding to the base sequence, f _conv1 Representing a first layer of convolution, f _conv2 Representing a second convolution layer, f _fcn Representing to be inputThe intermediate result of the characteristic after convolution and other steps is transmitted into the full connection layer;

z _1,(i,j,k) ＝(x*ω _1,c ) _i,j,k +b _1,(k,1)

4) Based on the step 3) to perform preliminary feature learning on the input feature codes, adopting a attention mechanism CBAM (Convolutional Block Attention Module, CBAM) based on a convolution module to perform attention learning on the result of the step 3), acquiring key positions of feature graphs from two parts of channel attention and space attention, and giving an intermediate feature graph F epsilon R ^C ^×H×W As input, the CBAM in turn extrapolates one-dimensional channel attention M _c ∈R ^C×1×1 And two-dimensional spatial attention, formulated as follows:

6) Inputting the test set data into the model trained in the step 5), obtaining the predictive value of the model, constructing a confusion matrix, and finally obtaining the scores F at the accuracy Acc, the specificity Sp, the sensitivity Sn and the F ₁ And the area under the subject's working characteristics AUC evaluate the performance of the donor splice site and acceptor splice site for five species, specifically:

7) An explanatory analysis is performed: the method comprises the steps of realizing the interpretation of internal representation and decision results of a convolutional neural network model by adopting a gradient type activation mapping Grad-CAM visualization technology, firstly calculating the gradient of a convolutional neural network model score of a c type to a convolutional layer by Grad-CAM, and simultaneously averaging characteristic vector values on each channel according to the obtained gradient information, namely global averaging pooling to obtain the weight of each characteristic map, wherein the characteristic map size is c ₁ *c ₂ The weight calculation formula is:

wherein the method comprises the steps ofThe class activation mapping result of the class c is represented, different position weight scores of the splice site sequences are checked by adopting a visualization technology, and finally a thermodynamic diagram and weight diagrams of different position prediction scores are obtained;

2. The method of claim 1, wherein in step 1) the splice site dataset of five species comprises splice site datasets of human beings, arabidopsis thaliana, oryza sativa, drosophila melanogaster, and caenorhabditis elegans.