CN114566216B - Attention mechanism-based splice site prediction and interpretation method - Google Patents

Attention mechanism-based splice site prediction and interpretation method Download PDF

Info

Publication number
CN114566216B
CN114566216B CN202210178010.9A CN202210178010A CN114566216B CN 114566216 B CN114566216 B CN 114566216B CN 202210178010 A CN202210178010 A CN 202210178010A CN 114566216 B CN114566216 B CN 114566216B
Authority
CN
China
Prior art keywords
splice site
model
attention
species
neural network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210178010.9A
Other languages
Chinese (zh)
Other versions
CN114566216A (en
Inventor
张艳菊
许峻玮
齐王璟
王荣兴
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guilin University of Electronic Technology
Original Assignee
Guilin University of Electronic Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guilin University of Electronic Technology filed Critical Guilin University of Electronic Technology
Priority to CN202210178010.9A priority Critical patent/CN114566216B/en
Publication of CN114566216A publication Critical patent/CN114566216A/en
Application granted granted Critical
Publication of CN114566216B publication Critical patent/CN114566216B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Medical Informatics (AREA)
  • Biotechnology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Chemical & Material Sciences (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Genetics & Genomics (AREA)
  • Analytical Chemistry (AREA)
  • Bioethics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Public Health (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a splice site prediction and interpretation method based on an attention mechanism, which provides a convolutional neural network model combined with the attention mechanism, accurately identifies splice sites, and provides a visual weight interpretation analysis method based on the model, so that an effective prediction model is established on five species. The results of the independent test sets demonstrate that the model of the present invention is more robust, better performing and more generalizing than the existing model over 10 data sets involving five species. Then, in order to research the reason that the convolutional neural network model combined with the attention mechanism can achieve better performance, the invention adopts a gradient type activation mapping visualization technology to acquire the position weight distribution of the model for each sample, and finally verifies that the model can automatically notice and acquire the effective characteristics of the samples. The invention can improve the prediction precision and carry out explanatory analysis on the sequence of the splice site.

Description

Attention mechanism-based splice site prediction and interpretation method
Technical Field
The invention relates to the technical field of splice site recognition prediction of genes, in particular to a method for predicting and explaining splice sites based on an attention mechanism.
Background
The splicing operation is a critical step in the expression of genetic information by cells as a protein, and the correct recognition of splice sites is of particular importance. Studies have shown that investigation of splice sites not only helps researchers understand the splicing mechanism during the conversion from DNA to RNA, but also helps to deduce the constitutive structure of the transcript. Recent studies have shown that different splicing patterns of genes are associated with complex diseases such as lung cancer, depression, etc. Studies of splice site binding clinical disease have also analyzed the relationship between splice events and the mechanisms of disease formation and occurrence.
Currently, in research on splice sites, researchers build models and predict successfully by extracting partial bases upstream and downstream of splice sites as data sets, then extracting features and learning sample sequence intrinsic information using machine learning algorithms. For example, pertea et al have employed a decision tree algorithm and developed a model GeneSplicer by enhancing it by a Markov algorithm to capture information around splice sites. Degroeve et al employ a linear support vector machine algorithm to construct a linear model SpliceMachine to obtain efficient information from the high-dimensional feature representation to predict splice sites. The MM1 feature extraction method is used by Baten et al to extract features from the splice site sequences and input them into the SVM to distinguish between true or false splice sites. The method has obvious defects that a researcher is required to manually acquire the characteristics and then select the characteristics. The extraction of features by researchers relies on existing knowledge of splice sites, which to some extent limits the scope to which models can learn features, which models may ignore feature information that is present in a sample sequence but not recognized temporarily. In recent years, researchers have introduced deep learning techniques to predict splice sites. For example, du et al constructed a deep model based on convolutional neural networks that predicted splice sites for both human and caenorhabditis elegans datasets. Zua-ert et al construct a SpliceRover model based on CNN to predict splice sites, and the five assumptions made by the authors are explained by algorithms.
Although the above-described methods can achieve good performance, researchers still need to explore better predicted performance. Furthermore, while deep learning techniques achieve high performance, it is often difficult for researchers to explain how deep learning affects the performance of a model.
Disclosure of Invention
Aiming at the problems of the existing splice site identification method, the invention provides a splice site prediction and interpretation method based on an attention mechanism, the method constructs a high-performance splice site prediction model based on a convolutional neural network combined with the attention mechanism, adopts a visualization technology to analyze different position weights of a splice site sequence and conduct interpretation research, and finally analyzes whether the method can bring generalization capability of cross species while greatly improving the performance of the model.
The technical scheme for realizing the aim of the invention is as follows:
an attention-based method for splice site prediction and interpretation comprising the steps of:
1) Collecting splice site data sets of five species, and dividing the collected splice site data sets into positive and negative samples, wherein the positive and negative samples are divided into a training set, a verification set and a test set;
2) Dividing the five species splice site data sets obtained in the step 1) into 10 sample data sets according to the fact that each species has both a donor splice site sample and an acceptor splice site sample, and converting the base sequences of the 10 sample data sets into Onehot codes;
3) The complex relation between the data is simulated by using a multi-level nonlinear function, and a convolutional neural network model is constructed, wherein the expression is as follows:
Lable of class=f fcn (f conv2 (f conv1 (Sequence nucleotide signal)))
wherein Lable of class represents the final classification of the convolutional neural network model, sequence nucleotide signal represents the input feature code corresponding to the base sequence, f conv1 Representing a first layer of convolution, f conv2 Representing a second convolution layer, f fcn Representing that the intermediate result of the input features after convolution and other steps is transmitted into the full connection layer;
in the convolutional neural network model, if the weight of each neuron-connected filter window is fixed, sliding the shared filter weights according to the translational invariance, the convolutional layer is composed of a group of filters, the sliding filter performs dot product operation with the input vector, and for the input x, there is a filter omega on each channel (1,c) The first filter dot product result z of the first convolution layer 1,(i,j,k) Expressed as:
z 1,(i,j,k) =(x*ω 1,c ) i,j,k +b 1,(k,1)
where i, j and c represent the row, column and channel of the convolutional layer output, respectively, k is the filter of the current layer, b 1,(k,1) Representing the offset value at the time of the first filter convolution operation;
convolution layer output result z based on three channels 1,(i,j,c) The method comprises the following steps:
wherein i, j and c represent the rows, columns and channels input to the convolutional layer, respectively, l, m and n represent the rows, columns and channels of the filter, respectively, and k is the filter used by the current convolutional layer;
4) Based on the step 3) to perform preliminary feature learning on the input feature codes, adopting a attention mechanism CBAM (Convolutional Block Attention Module, CBAM) based on a convolution module to perform attention learning on the result of the step 3), acquiring key positions of feature graphs from two parts of channel attention and space attention, and giving an intermediate feature graph F epsilon R C×H×W As input, the CBAM in turn extrapolates one-dimensional channel attention M c ∈R C×1×1 And two-dimensional spatial attention, formulated as follows:
wherein the method comprises the steps ofRepresenting element-by-element multiplication, in which the attention value is broadcast accordingly, F 1 Is the result of the output of the characteristic diagram F through the channel attention module, F 1 And F 2 Output of the final result of the attention mechanism module CBAM;
5) Based on the convolutional neural network model constructed in the step 3) and the attention mechanism CBAM in the step 4), constructing a convolutional neural network model based on the attention mechanism, training the convolutional neural network model based on the attention mechanism by utilizing the training set and the verification set divided in the step 1) and verifying output of the model in the training process, performing 30 iterations for each training, updating back propagation by using a cross entropy loss function, wherein the probability of the cross entropy loss function predicted for each category is p and 1-p, and the expression of the cross entropy loss function L is as follows:
wherein L is i Representing the loss function of sample i, N representing the total number of samples, y i Label representing i, positive class 1, negative class 0; p is p i Representing the probability that sample i is predicted to be a positive class;
6) Inputting the test set data into the model trained in the step 5), obtaining the predictive value of the model and constructing a confusion matrix, and finally obtaining the model in the Accuracy (Acc), specificity (Sp), sensitivity (Sn), F score (F-score, F) 1 ) And the area under the subject's working characteristics (AUC) to evaluate the performance of the donor and acceptor splice sites for five species, specifically:
wherein TP, TN, FP and FN represent the numbers of true positives, true negatives, false positives and false negatives, respectively;
7) An explanatory analysis is performed: the method comprises the steps of realizing the interpretation of internal representation and decision results of a convolutional neural network model by adopting a gradient type activation mapping (gradient class activation map, grad-CAM) visualization technology, firstly calculating the gradient of a convolutional neural network model score of a c type to a convolutional layer by Grad-CAM, and simultaneously averaging characteristic vector values on each channel according to obtained gradient information, namely global averaging pooling to obtain the weight of each characteristic map, wherein the characteristic map size is c 1 *c 2 The weight calculation formula is:
wherein the method comprises the steps ofRepresents the weight of the ith feature map to category c, Z represents the number of feature maps, ++>The pixel value of the kth row and the jth column of the ith feature map is represented by S c Classification score for class c;
the Grad-CAM result is calculated by weighted summation and averaging and then using the ReLu activation function, and the calculation formula is:
wherein the method comprises the steps ofRepresenting class activation mapping results for class c, using visualization techniques to view different positions of splice site sequencesWeight scores, finally obtaining weight graphs of thermodynamic diagrams and predictive scores of different positions;
8) Generalization analysis: and obtaining cross-species interpretation analysis and cross-species splice site commonality rule analysis according to interpretation analysis results of different species and model performance comparison.
Step 1), splice site datasets of five species, including splice site datasets of humans (Homo sapiens), arabidopsis thaliana (Arabidopsis thaliana), japonica rice (Oryza sativa japonica), drosophila melanogaster (Drosophila melanogaster) and caenorhabditis elegans (Caenorhabditis elegans).
Compared with the prior art, the splice site prediction and interpretation method based on the attention mechanism has the following advantages:
1. the method is based on convolutional neural network, and innovatively introduces a attention mechanism technology and a grad-cam technology to process gene base sequences.
2. The method can achieve more excellent performance than the existing method on a plurality of different species;
3. the model trained by the method in human species can be used in Arabidopsis thaliana, japonica rice, drosophila melanogaster and caenorhabditis elegans, and has stronger generalization and robustness;
4. the method has visual interpretation capability, can be used for better performance for researchers, and can be used for showing the reasons that the models of the researchers can achieve the better performance.
Drawings
FIG. 1 is a general framework diagram of an attention-based splice site prediction and interpretation method according to an embodiment of the present invention;
FIG. 2 is a diagram of the splicing process in the examples.
FIG. 3 is a sample view of donor splice sites in an example;
FIG. 4 is a sample view of acceptor splice sites in an example;
Detailed Description
The present invention will now be further illustrated with reference to the drawings and examples, but is not limited thereto.
Examples:
in this example, splice site sequence features are learned based on convolutional neural networks in combination with an attention mechanism, and the overall framework diagram is shown in fig. 1. The method section firstly introduces a data processing and sequence coding method, then introduces an overall framework of a convolutional neural network combined with an attention mechanism, and finally introduces Grad-cam visualization technology for interpretation analysis of a splice site prediction model.
Referring to fig. 1, a method for splice site prediction and interpretation based on the mechanism of attention, comprising the steps of:
1) And (3) data collection: as shown in fig. 2, splice site information is derived from genomic DNA sequences, and splicing operations occur during transcription of the DNA sequences into mRNA. The data set of this example includes five species, including human (homosapiens), arabidopsis thaliana (Arabidopsis thaliana), japonica rice (Oryza sativa japonica), drosophila melanogaster (Drosophila melanogaster), and caenorhabditis elegans (Caenorhabditis elegans) as shown in figure 1. Meanwhile, the donor splice site (as shown in FIG. 3) and the acceptor splice site (as shown in FIG. 4) are divided in each species.
TABLE 1 number of samples of donor and acceptor splice sites per species
As shown in table 1, there are 10 data sets, each with equal numbers of positive and negative samples. And dividing the processed data set of each species into a training set, a verification set and a test set.
2) And (3) data processing: each sample has a sequence length of 602, comprising 300 bases upstream of the splice site and 300 bases downstream of the splice site, as well as two bases of the splice site (donor splice site GT or acceptor splice site AG). Finally, each sample sequence is subjected to OneHot coding processing to obtain vector representation with the dimension of [602,4 ].
3) Model construction: in the embodiment, convolutional neural network technology is introduced, and a complex relationship between data is simulated by using a multi-level nonlinear function. Convolutional neural networks typically comprise an input layer, a convolutional layer, a pooling layer, a fully-connected layer, an output layer, and the like. The import layer imports a numerical representation of the base sequence of the splice site into the network. Equation (1) gives a simple neural network representation.
z 1 =g(ω 1 a 0 +b 1 )
Wherein z is 1 Refers to the result output of the current layer, a 0 Refers to the input vector, ω, of the input layer 1 I.e. the weights of the current layer, b 1 Is a bias term.
The convolution layer is composed of a set of filters, and the sliding filter performs dot product operation with the input vector. For input x, there is a filter ω on each channel (1,c)
z 1,(i,j,k) =(x*ω 1,c ) i,j,k +b 1,(k,1)
Taking channel i as an example, z (1,i) :
z (1,i) =x i1,c(i) +b 1,i
The output result is:
where i, j and c represent the final output row, column and channel, respectively. l, m, n represent the rows, columns and channels, respectively, of the filter, k being the filter notation used by the current layer.
Each layer of convolution layer comprises super-parameter settings such as filter number setting, convolution kernel size, step length, activation function and the like, and usually, different effective characteristics can be learned by filters with different weights. The activation function may choose ReLu, gradient unsaturation is indeed the greatest benefit of ReLu, which speeds up the convergence of random gradient drops over other activation functions.
Where b is the activation threshold and x is the activation function output result value.
The pooling layer can be divided into average pooling, minimum pooling and maximum pooling, and mainly aggregates the spatial information of the feature mapping, so that the vector size of information transmission in the network is reduced. The max-pooling function is typically used to avoid overfitting and to help abstract the features learned in the first few layers. The max-pooling layer is a common pooling technique that uses signal maxima over non-overlapping windows for further representation.
Z k =max(Y 1,k ,…,Y n,k )
Where n is the maximum pooling window size and k is the motif identifier.
After multiple rolling and pooling operations, there may then be one or more fully connected layers whose weights are no longer shared. The full connection layer integrates local information with category differentiation in the convolution layer and the pooling layer, and finally outputs a classification result through a softmax function, wherein the formula is as follows:
f i (z)=exp(z i )/∑ j exp(z j )
wherein f i (z) represents the prediction score of the i-th.
Meanwhile, the attention mechanism CBAM (Convolutional Block Attention Module) of the convolution module is adopted in the embodiment, which is an attention mechanism module combining channel information and space information, and can achieve better effect compared with an attention mechanism sent only focusing on the channel information. Given an intermediate feature map F ε R C×H×W As input, the CBAM in turn extrapolates one-dimensional channel attention M c ∈R C×1×1 And two-dimensional spatial attention, formulated as follows:
wherein the method comprises the steps ofRepresenting element-by-element multiplication, in which the attention value is broadcast accordingly, F 1 Is the result of the output of the characteristic diagram F through the channel attention module, F 2 At F 1 Is processed by a spatial attention module and is output by an attention mechanism module CBAM final result.
Since each channel of the feature map is considered a feature detector. The channel attention module first aggregates the feature mapped spatial information using a pooling operation to generate two different spatial context descriptors:and->Representing the average pooling feature and the maximum pooling feature, respectively. The spatial dimension of the input feature map is then compressed in a shared network consisting of a multi-layer perceptron MLP with a hidden layer. To reduce parameter overhead, the hidden activation size is set to R C /r×1×1 Where r is the reduction rate. After the shared network is applied to each descriptor, the element-wise sums are combined to produce the channel attention vector. The formula is:
wherein σ represents a sigmoid function, W 0 ∈R C/r×C ,W 1 ∈R C×C/r . Two inputs share MLP weight W 0 And W is 1
Unlike channel attention, spatial attention is focused "where" is an information part, which is complementary to channel attention. The spatial attention module takes the output feature vector of the channel attention module as input, then the average pooling and the maximum pooling result are overlapped and then convolved, finally the spatial attention feature vector is generated through a sigmoid function, the spatial attention feature vector and the input of the module are multiplied, and finally the generated two-dimensional feature map is obtained, and the position to be emphasized or suppressed is encoded.
Wherein σ represents a sigmoid function, f 7×7 A convolution operation with a filter size of 7 x 7 is shown.
4) In the performance evaluation procedure, donor splice site predictive models and acceptor splice site predictive models were constructed herein on 10 datasets of five species, respectively. Tables 2 and 3 show the Accuracy (Acc), specificity (Sp), sensitivity (Sn), F-score (F-value, F 1 ) And the area under the subject's working characteristics (AUC) to evaluate the performance of the donor splice site and acceptor splice site for five species.
TABLE 2 Properties of five species at the donor splice site (Donor Splice Site, doSS)
TABLE 3 Properties of five species at the donor splice site (Acceptor Splice Site, acSS)
The example compares the accuracy of the test set with the accuracy of the different tools of the prior art, and the results are shown in Table 4. The tool herein is superior to existing tools in both the acceptor splice site predictive model and the donor splice site predictive model of four species of arabidopsis, indica, drosophila melanogaster and caenorhabditis elegans. Also, in the Drosophila melanogaster donor splice site model, the tools herein are 3.68% higher than the best tools currently available. Although the accuracy of the tools herein was 1.1% lower than the Splice2Deep on the human acceptor and donor Splice site prediction models, the AUC values of the tools herein were 98.96 higher for the human acceptor Splice site model than for the Splice2Deep and 99.14 higher for the human donor Splice site than for the Splice2Deep than for 99.10. Typically, accuracy is just a single threshold of 0.5 defaults to positive and negative partitioning of the prediction results, while AUC values evaluate the performance of the model at multiple different thresholds. Intuitively, models with higher AUC values have better robustness and stability.
Table 4 accuracy (%) of the different tools at five species was compared. N/A indicates that the tool is not training a model in that species.
5) Model interpretability: the method adopts gradient class activation mapping (gradient class activation map, grad-CAM) visualization technology, and realizes the internal representation of the model and the interpretation of decision results under the condition that the structure and parameters of the model are not required to be modified. Grad-CAM calculates a thermodynamic diagram of the convolutional neural network on the input image through gradient information, the thermodynamic diagram highlights input features related to the output result of the model, and the higher the correlation, the more obvious the corresponding features in the saliency map. It is intuitively understood that a gradient represents the amount of update of a parameter in model training, and that the larger the gradient of a data point is, the more sensitive the change of the point is output, and the larger the correlation between the point and the output is.
Grad-CAM calculates the gradient of model score of c class to convolution layer, and at the same time, averages the characteristic vector values (similar to global average pooling) on each channel to obtain the weight of each characteristic map with the characteristic map size of c 1 *c 2 The weight calculation formula is as follows
Wherein the method comprises the steps ofRepresenting the weight of the ith feature map to category cZ represents the number of feature maps, +.>The pixel value of the kth row and the jth column of the ith feature map is represented by S c Class c classification score.
The Grad-CAM result is calculated by weighting, summing and averaging and then using the ReLu activation function, and the calculation formula is as follows
In the formula (I)And the class activation mapping result of the class c is represented. />

Claims (2)

1. A method for splice site prediction and interpretation based on an attention mechanism, comprising the steps of:
1) Collecting splice site data sets of five species, and dividing the collected splice site data sets into positive and negative samples, wherein the positive and negative samples are divided into a training set, a verification set and a test set;
2) Dividing the five species splice site data sets obtained in the step 1) into 10 sample data sets according to the fact that each species has both a donor splice site sample and an acceptor splice site sample, and converting the base sequences of the 10 sample data sets into Onehot codes;
3) The complex relation between the data is simulated by using a multi-level nonlinear function, and a convolutional neural network model is constructed, wherein the expression is as follows:
Lable of class=f fcn (f conv2 (f conv1 (Sequence nucleotide signal)))
wherein Lable of class represents the final classification of the convolutional neural network model, sequence nucleotide signal represents the input feature code corresponding to the base sequence, f conv1 Representing a first layer of convolution, f conv2 Representing a second convolution layer, f fcn Representing to be inputThe intermediate result of the characteristic after convolution and other steps is transmitted into the full connection layer;
in the convolutional neural network model, if the weight of each neuron-connected filter window is fixed, sliding the shared filter weights according to the translational invariance, the convolutional layer is composed of a group of filters, the sliding filter performs dot product operation with the input vector, and for the input x, there is a filter omega on each channel (1,c) The first filter dot product result z of the first convolution layer 1,(i,j,k) Expressed as:
z 1,(i,j,k) =(x*ω 1,c ) i,j,k +b 1,(k,1)
where i, j and c represent the row, column and channel of the convolutional layer output, respectively, k is the filter of the current layer, b 1,(k,1) Representing the offset value at the time of the first filter convolution operation;
convolution layer output result z based on three channels 1,(i,j,c) The method comprises the following steps:
wherein i, j and c represent the rows, columns and channels input to the convolutional layer, respectively, l, m and n represent the rows, columns and channels of the filter, respectively, and k is the filter used by the current convolutional layer;
4) Based on the step 3) to perform preliminary feature learning on the input feature codes, adopting a attention mechanism CBAM (Convolutional Block Attention Module, CBAM) based on a convolution module to perform attention learning on the result of the step 3), acquiring key positions of feature graphs from two parts of channel attention and space attention, and giving an intermediate feature graph F epsilon R C ×H×W As input, the CBAM in turn extrapolates one-dimensional channel attention M c ∈R C×1×1 And two-dimensional spatial attention, formulated as follows:
wherein the method comprises the steps ofRepresenting element-by-element multiplication, in which the attention value is broadcast accordingly, F 1 Is the result of the output of the characteristic diagram F through the channel attention module, F 1 And F 2 Output of the final result of the attention mechanism module CBAM;
5) Based on the convolutional neural network model constructed in the step 3) and the attention mechanism CBAM in the step 4), constructing a convolutional neural network model based on the attention mechanism, training the convolutional neural network model based on the attention mechanism by utilizing the training set and the verification set divided in the step 1) and verifying output of the model in the training process, performing 30 iterations for each training, updating back propagation by using a cross entropy loss function, wherein the probability of the cross entropy loss function predicted for each category is p and 1-p, and the expression of the cross entropy loss function L is as follows:
wherein L is i Representing the loss function of sample i, N representing the total number of samples, y i Label representing i, positive class 1, negative class 0; p is p i Representing the probability that sample i is predicted to be a positive class;
6) Inputting the test set data into the model trained in the step 5), obtaining the predictive value of the model, constructing a confusion matrix, and finally obtaining the scores F at the accuracy Acc, the specificity Sp, the sensitivity Sn and the F 1 And the area under the subject's working characteristics AUC evaluate the performance of the donor splice site and acceptor splice site for five species, specifically:
wherein TP, TN, FP and FN represent the numbers of true positives, true negatives, false positives and false negatives, respectively;
7) An explanatory analysis is performed: the method comprises the steps of realizing the interpretation of internal representation and decision results of a convolutional neural network model by adopting a gradient type activation mapping Grad-CAM visualization technology, firstly calculating the gradient of a convolutional neural network model score of a c type to a convolutional layer by Grad-CAM, and simultaneously averaging characteristic vector values on each channel according to the obtained gradient information, namely global averaging pooling to obtain the weight of each characteristic map, wherein the characteristic map size is c 1 *c 2 The weight calculation formula is:
wherein the method comprises the steps ofRepresents the weight of the ith feature map to category c, Z represents the number of feature maps, ++>The pixel value of the kth row and the jth column of the ith feature map is represented by S c Classification score for class c;
the Grad-CAM result is calculated by weighted summation and averaging and then using the ReLu activation function, and the calculation formula is:
wherein the method comprises the steps ofThe class activation mapping result of the class c is represented, different position weight scores of the splice site sequences are checked by adopting a visualization technology, and finally a thermodynamic diagram and weight diagrams of different position prediction scores are obtained;
8) Generalization analysis: and obtaining cross-species interpretation analysis and cross-species splice site commonality rule analysis according to interpretation analysis results of different species and model performance comparison.
2. The method of claim 1, wherein in step 1) the splice site dataset of five species comprises splice site datasets of human beings, arabidopsis thaliana, oryza sativa, drosophila melanogaster, and caenorhabditis elegans.
CN202210178010.9A 2022-02-25 2022-02-25 Attention mechanism-based splice site prediction and interpretation method Active CN114566216B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210178010.9A CN114566216B (en) 2022-02-25 2022-02-25 Attention mechanism-based splice site prediction and interpretation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210178010.9A CN114566216B (en) 2022-02-25 2022-02-25 Attention mechanism-based splice site prediction and interpretation method

Publications (2)

Publication Number Publication Date
CN114566216A CN114566216A (en) 2022-05-31
CN114566216B true CN114566216B (en) 2024-04-02

Family

ID=81716301

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210178010.9A Active CN114566216B (en) 2022-02-25 2022-02-25 Attention mechanism-based splice site prediction and interpretation method

Country Status (1)

Country Link
CN (1) CN114566216B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114896307B (en) * 2022-06-30 2022-09-27 北京航空航天大学杭州创新研究院 Time series data enhancement method and device and electronic equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111798921A (en) * 2020-06-22 2020-10-20 武汉大学 RNA binding protein prediction method and device based on multi-scale attention convolution neural network
CN112395442A (en) * 2020-10-12 2021-02-23 杭州电子科技大学 Automatic identification and content filtering method for popular pictures on mobile internet
CN112767997A (en) * 2021-02-04 2021-05-07 齐鲁工业大学 Protein secondary structure prediction method based on multi-scale convolution attention neural network
WO2021115159A1 (en) * 2019-12-09 2021-06-17 中兴通讯股份有限公司 Character recognition network model training method, character recognition method, apparatuses, terminal, and computer storage medium therefor
CN113178227A (en) * 2021-04-30 2021-07-27 西安交通大学 Method, system, device and storage medium for identifying multiomic fusion splice sites

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021115159A1 (en) * 2019-12-09 2021-06-17 中兴通讯股份有限公司 Character recognition network model training method, character recognition method, apparatuses, terminal, and computer storage medium therefor
CN111798921A (en) * 2020-06-22 2020-10-20 武汉大学 RNA binding protein prediction method and device based on multi-scale attention convolution neural network
CN112395442A (en) * 2020-10-12 2021-02-23 杭州电子科技大学 Automatic identification and content filtering method for popular pictures on mobile internet
CN112767997A (en) * 2021-02-04 2021-05-07 齐鲁工业大学 Protein secondary structure prediction method based on multi-scale convolution attention neural network
CN113178227A (en) * 2021-04-30 2021-07-27 西安交通大学 Method, system, device and storage medium for identifying multiomic fusion splice sites

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于卷积神经网络的基因剪接位点预测;李国斌;杜秀全;李新路;吴志泽;;盐城工学院学报(自然科学版);20200620(第02期);全文 *

Also Published As

Publication number Publication date
CN114566216A (en) 2022-05-31

Similar Documents

Publication Publication Date Title
CN110728224B (en) Remote sensing image classification method based on attention mechanism depth Contourlet network
CN112308158B (en) Multi-source field self-adaptive model and method based on partial feature alignment
CN108985360B (en) Hyperspectral classification method based on extended morphology and active learning
CN112801942A (en) Citrus huanglongbing image identification method based on attention mechanism
CN112489723B (en) DNA binding protein prediction method based on local evolution information
CN114566216B (en) Attention mechanism-based splice site prediction and interpretation method
CN115659174A (en) Multi-sensor fault diagnosis method, medium and equipment based on graph regularization CNN-BilSTM
JP7490168B1 (en) Method, device, equipment, and medium for mining biosynthetic pathways of marine nutrients
CN113887342A (en) Equipment fault diagnosis method based on multi-source signals and deep learning
CN116580848A (en) Multi-head attention mechanism-based method for analyzing multiple groups of chemical data of cancers
CN115206423A (en) Label guidance-based protein action relation prediction method
CN112926640B (en) Cancer gene classification method and equipment based on two-stage depth feature selection and storage medium
CN114783526A (en) Depth unsupervised single cell clustering method based on Gaussian mixture graph variation self-encoder
Zhou et al. schicsc: A novel single-cell hi-c clustering framework by contact-weight-based smoothing and feature fusion
CN112613391B (en) Hyperspectral image waveband selection method based on reverse learning binary rice breeding algorithm
CN112270950B (en) Network enhancement and graph regularization-based fusion network drug target relation prediction method
Bai et al. A unified deep learning model for protein structure prediction
Arowolo et al. Enhanced dimensionality reduction methods for classifying malaria vector dataset using decision tree
Ma et al. Kernel soft-neighborhood network fusion for MiRNA-disease interaction prediction
CN114566215B (en) Double-end paired splice site prediction method
CN114758721B (en) Deep learning-based transcription factor binding site positioning method
CN113192562B (en) Pathogenic gene identification method and system fusing multi-scale module structure information
CN115691817A (en) LncRNA-disease association prediction method based on fusion neural network
Cudic et al. Prediction of sorghum bicolor genotype from in-situ images using autoencoder-identified SNPs
Sun et al. MOBS-TD: Multi-Objective Band Selection with Ideal Solution Optimization Strategy for Hyperspectral Target Detection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant