CN116884495B - Diffusion model-based long tail chromatin state prediction method - Google Patents
Diffusion model-based long tail chromatin state prediction method Download PDFInfo
- Publication number
- CN116884495B CN116884495B CN202310991350.8A CN202310991350A CN116884495B CN 116884495 B CN116884495 B CN 116884495B CN 202310991350 A CN202310991350 A CN 202310991350A CN 116884495 B CN116884495 B CN 116884495B
- Authority
- CN
- China
- Prior art keywords
- dna sequence
- chromatin state
- noise
- chromatin
- layer
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 108010077544 Chromatin Proteins 0.000 title claims abstract description 105
- 210000003483 chromatin Anatomy 0.000 title claims abstract description 105
- 238000000034 method Methods 0.000 title claims abstract description 90
- 238000009792 diffusion process Methods 0.000 title claims abstract description 69
- 108091028043 Nucleic acid sequence Proteins 0.000 claims abstract description 100
- 230000008569 process Effects 0.000 claims abstract description 30
- 108020004414 DNA Proteins 0.000 claims abstract description 11
- 238000012545 processing Methods 0.000 claims abstract description 3
- 230000006870 function Effects 0.000 claims description 27
- 238000013528 artificial neural network Methods 0.000 claims description 23
- 238000009826 distribution Methods 0.000 claims description 23
- 230000004913 activation Effects 0.000 claims description 16
- 210000004027 cell Anatomy 0.000 claims description 16
- 238000004364 calculation method Methods 0.000 claims description 11
- 239000011159 matrix material Substances 0.000 claims description 9
- 238000011176 pooling Methods 0.000 claims description 9
- 230000000694 effects Effects 0.000 claims description 3
- 230000007246 mechanism Effects 0.000 claims description 3
- 238000010606 normalization Methods 0.000 claims description 3
- 230000008447 perception Effects 0.000 claims description 3
- 108090000623 proteins and genes Proteins 0.000 abstract description 9
- 238000012549 training Methods 0.000 description 5
- 238000013527 convolutional neural network Methods 0.000 description 4
- 238000002474 experimental method Methods 0.000 description 4
- 238000001353 Chip-sequencing Methods 0.000 description 2
- 208000012639 Balance disease Diseases 0.000 description 1
- 108010033040 Histones Proteins 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000010339 dilation Effects 0.000 description 1
- 239000003623 enhancer Substances 0.000 description 1
- 230000004049 epigenetic modification Effects 0.000 description 1
- 238000011331 genomic analysis Methods 0.000 description 1
- 239000012212 insulator Substances 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000000149 penetrating effect Effects 0.000 description 1
- 230000001105 regulatory effect Effects 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0499—Feedforward networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Biophysics (AREA)
- Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Molecular Biology (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Computing Systems (AREA)
- Computational Linguistics (AREA)
- Biomedical Technology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Biotechnology (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Analytical Chemistry (AREA)
- Chemical & Material Sciences (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
The invention discloses a long tail chromatin state prediction method based on a diffusion model, which comprises the following steps of S1, obtaining an original DNA sequence, and processing the original DNA sequence to obtain DNA coding data; s2, constructing a DNA sequence diffusion model based on the DNA coding data; s3, combining a noise predictor of UNet, and performing a reverse process of a conditional DNA sequence diffusion model to obtain balanced data sets with different chromatin state categories; s4, constructing a chromatin state prediction model by adopting a back propagation algorithm based on the balance data set. The invention utilizes a DNA sequence diffusion model to generate a DNA sequence of tail type chromatin state from noise, thereby realizing sample balance; then, a chromatin state prediction model is trained using the class sample balanced dataset, which can effectively capture the gene-based grammar rules, thereby accurately predicting chromatin state.
Description
Technical Field
The invention belongs to the technical field of chromatin state prediction, and particularly relates to a long tail chromatin state prediction method based on a diffusion model.
Background
Chromatin state refers to the different structural and functional states of chromatin in different cell types. There is increasing interest in a wide range of functions, such as reflecting the function and status of cells. Epigenetic modification of DNA sequences is a major factor in determining chromatin status. For example, the prior art defines 15 chromatin states with different biological roles by mapping 9 chromatin markers. Similarly, the prior art defines 18 chromatin states from 6 histone marks by using ChIP-seq data. These studies indicate that chromatin states are distributed long tails, with some states being more abundant than others. For example, the number of enhancers is significantly greater than the number of insulators. While genomic analysis such as ChIP-seq can reveal chromatin status, it requires more expensive and time-consuming experiments. Therefore, there is an urgent need for methods for calculating long tail chromatin state predictions.
Currently, many efforts have been made to predict chromatin status through deep learning algorithms. Deep sea is an open work that constructs a CNN network that predicts 919 chromatin characteristics from DNA sequences. Following the pioneering work of deep sea, numerous researchers have made precious explores and breakthroughs in improving the performance of chromatin state prediction algorithms; the primary impact is focused on the model architecture. The prior art proposes a simple and efficient method consisting of a single CNN layer, a BiLSTM layer and a fully connected layer. Specifically, CNN is used to learn motif information, biLSTM is used to learn regulatory grammar. The prior art also proposes a hybrid DNN model deep former that uses CNN and streamer forces to achieve accurate chromatin feature predictions under limited parameters. The prior art also expands the perceived field by integrating the dilation convolution without reducing spatial resolution for better performance. However, these methods often ignore long tail problems between chromatin states, and in particular, some methods often achieve sample balancing by shuffling positive samples to produce negative samples, resulting in bias in practice. Other methods directly predict long-tailed chromatin status, resulting in an imbalance between head and tail classes.
Long tail learning aims at training a well performing model from many samples following long tail class distribution. However, in practice, the trained models are often biased toward the head class, resulting in poor performance of the tail class.
The methods of predictive analysis of chromatin states widely used in the prior art still have some disadvantages:
first, the existing chromatin state prediction method usually ignores long tail distribution of chromatin states, and it is difficult to simultaneously predict the chromatin states of the head class and the tail class, so that the method has a certain limitation in practicability.
Second, numerous studies have shown that genes have their own grammatical rules, a large number of motifs (motifs) are the "phrases" that make up the gene language, and parsing the grammatical rules of genes is the primary step in parsing chromatin states and inferring gene function. However, the existing chromatin state prediction method is difficult to effectively capture the relative position and long-distance dependency relationship between the motifs, and thus cannot accurately analyze the gene grammar and characterize the chromatin state.
Disclosure of Invention
The invention aims to overcome the defects in the prior art and provide a long tail chromatin state prediction method based on a diffusion model, so as to solve the problem of limitation of unbalanced data in the prior category on chromatin state prediction.
In order to achieve the above purpose, the invention adopts the following technical scheme:
a long tail chromatin state prediction method based on a diffusion model, comprising the steps of:
s1, acquiring an original DNA sequence, and processing the original DNA sequence to obtain DNA coding data;
s2, constructing a DNA sequence diffusion model based on the DNA coding data;
s3, combining a noise predictor of UNet, and performing a reverse process of a conditional DNA sequence diffusion model to obtain balanced data sets with different chromatin state categories;
s4, constructing a chromatin state prediction model by adopting a back propagation algorithm based on the balance data set.
Further, step S1 includes:
obtaining chromatin states corresponding to the original DNA sequences, and amplifying or intercepting the left and right ends of the obtained original DNA sequences with different lengths to obtain DNA sequences with the length L;
the DNA sequence with the length L is converted into coding matrix data of L multiplied by 4 by adopting a single-heat coding method.
Further, the DNA sequence diffusion model in step S2 includes a forward process and a backward process;
the forward procedure includes:
given the previous diffusion step state, the probability distribution q (x) of the current diffusion step state is predicted t |x t-1 ):
Wherein,to have mean and variance respectively->β t A gaussian distribution of I; x is x t Vector for each DNA sequence after the t-th addition of noise, x t-1 For the vector of each DNA sequence after the t-1 th addition of noise, x when t=0 0 Is L multiplied by 4 matrix data after single heat coding; beta t Is super parameter, I is identity matrix; e-shaped article t-1 Is the fundamental noise obtained from the t-1 th sample; a, a t =1-β t ,/> As the weight, a t To take the value of a at the diffusion step t, a i A is a super parameter, which is the value of a in the diffusion step i; the epsilon is a new Gaussian distribution obtained by adding t Gaussian distributions with different variances, namely noise in the diffusion step t.
Further, the backward procedure includes:
vector x for each DNA sequence given after the t-th addition of noise t Vector x for each DNA sequence following the t-1 th addition of noise under conditions of cell type, chromatin state t-1 Probability distribution p (x) t-1 |x t ,c):
Wherein μ (x t C) and beta t I are respectivelyMean and variance of (a); c is the condition, namely the cell type and chromatin state corresponding to the current DNA sequence;
using a fixed variance beta t I, fitting the mean μ (x) using a UNet neural network t C), realizing noise prediction at the moment t of the diffusion step, wherein the loss function of the UNet neural network is L DM :
Wherein, E is θ Noise predicted by the neural network under the parameter theta is a noise predictor based on the UNet neural network;for the parameter θ to be desired, θ is a parameter of the UNet neural network.
Further, step S3 includes:
s3.1, generating an L multiplied by 4 noisy DNA sequence from standard normal distribution, and iterating the L multiplied by 4 noisy DNA sequence based on a DNA sequence diffusion model until t=0;
s3.2, predicting a noise prediction value of an output DNA sequence by adopting a noise predictor according to the L multiplied by 4 noisy DNA sequence, the cell type, the chromatin state condition c and the diffusion step t;
s3.3, subtracting a noise predicted value of the DNA sequence predicted and output by the current noise predictor from the DNA sequence containing the noise of L multiplied by 4;
s3.4, repeating steps S3.2 and S3.3 until t=0, generating a DNA sequence having a specific cell type, chromatin state;
s3.5, repeating step S3.2, step S3.3 and step S3.4 until balanced data sets with different chromatin state categories are obtained.
Further, the chromatin state prediction model includes:
the motif perception convolution module is used for extracting a DNA sequence motif and comprises 3 layers of convolution networks, each layer of convolution network comprises a convolution layer, a ReLU activation layer, a dropout layer and a maximum pooling layer Maxpool, and the calculation process is as follows:
s (l1) =Conv(x (l) )
s (l2) =Dropout(ReLU(s (l1) ),0.2)
s (l3) =Maxpool(s (l2) )
wherein s is (l1) And x (l) The input and output of the first convolutional network respectively, is a balanced dataset; s is(s) (l2) The output of the dropout layer; s is(s) (l3) Output for maximum pooling layer Maxpool; conv () is convolution operation, reLU () is an activation function, dropout () is a fitting function for preventing overfitting, and the value is 0.2; maxpool () is the maximum pooling layer;
the cavity convolution module is used for learning the grammar of the DNA sequence;
a self-attention module for capturing the correlation inside the DNA sequence grammar;
and a classification module for constructing a chromatin state for each DNA sequence and predicting an output chromatin state.
Further, the hole convolution module comprises a 3-layer hole convolution network, each layer of the hole convolution network comprises a hole convolution layer, a ReLU activation layer and a dropout layer, and the calculation process is as follows:
z (l1) =dConv(s (l3) )
z (l2) =Dropout(ReLU(z (l1) ),0.2)
wherein z is (l1) For the output of the first hole convolution network, z (l2) Is the output of the dropout layer.
Further, the self-attention module comprises two transducer coding layers, and the calculation process is as follows:
h (l1) =LayerNorm(z (l2) +MultiHead(z (l2) ))
h (l2) =LayerNorm(h (l1) +FFN(h (l1) ))
wherein h is (l1) An output for the first transducer coding layer; layerNorm () is hierarchical normalization; multiHead () is a multi-headed self-attention mechanism; FFN () is a feed-forward neural network, h (l2) Is the output of the feed-forward neural network.
Further, the calculation process of the classification module is as follows:
y=Activation(MLP(h (l2) ))
wherein y is the chromatin state of the predicted output, activity () is the classification module Activation function, and MLP () is the full connection layer.
Further, the loss function of the DNA sequence diffusion model is:
wherein LD is a traditional softmax cross entropy loss function weighted again; c is a category; p is p j Probability of class j, y j Is a true class label; w is a weight;
wherein:
where μ is an empirical value selected manually.
The long-tail chromatin state prediction method based on the diffusion model provided by the invention has the following beneficial effects:
the invention utilizes a DNA sequence diffusion model to generate a DNA sequence of tail type chromatin state from noise, thereby realizing sample balance; then, a chromatin state prediction model is trained using the class sample balanced dataset, which can effectively capture the gene-based grammar rules, thereby accurately predicting chromatin state.
In the training of a DNA sequence diffusion model, the invention provides a balance loss, and the influence caused by deviation between a real sample and a generated sample is reduced by increasing punishment on the generated sample.
The sample balancing method provides a simple, universal and model-independent solution for long-tail chromatin state prediction; in addition, the chromatin state prediction model comprises neural network operators such as convolution, hole convolution and self-attention, and can effectively learn the grammar rules of genes, thereby realizing accurate classification of chromatin states.
Drawings
FIG. 1 is a flow chart of a long tail chromatin state prediction method based on a diffusion model according to the invention.
FIG. 2 is a block diagram of a long tail chromatin state prediction method based on a diffusion model according to the invention.
Detailed Description
The following description of the embodiments of the present invention is provided to facilitate understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and all the inventions which make use of the inventive concept are protected by the spirit and scope of the present invention as defined and defined in the appended claims to those skilled in the art.
Example 1
The embodiment provides a long-tail chromatin state prediction method based on a diffusion model, which generates a long-tail pseudo sample through a DNA sequence diffusion model and performs data balancing so as to solve the limitation of unbalanced data on chromatin state prediction; since there may be a deviation between the samples generated by the DNA sequence diffusion model and the real samples, a balanced loss function is given, and the influence caused by the deviation between the real samples and the generated samples is reduced by increasing the penalty on the generated samples, referring to fig. 1, which specifically includes the following steps:
step S1, an original DNA sequence is obtained, and the original DNA sequence is processed to obtain DNA coding data, which specifically comprises the following steps:
obtaining chromatin states corresponding to the original DNA sequences, and amplifying or intercepting the left and right ends of the obtained original DNA sequences with different lengths to obtain DNA sequences with the length L;
the DNA sequence with the length L is converted into coding matrix data of L multiplied by 4 by adopting a single-heat coding method.
S2, constructing a DNA sequence diffusion model based on DNA coding data, penetrating the processed DNA coding data into the diffusion model for training, and obtaining the DNA sequence diffusion model;
the DNA sequence diffusion model comprises a forward process and a backward process, wherein the forward process is a parameter Markov chain, the forward process is a fuzzy process of gradually adding Gaussian noise into data until the Gaussian noise becomes random noise, and the backward process is a denoising process of gradually recovering the data through a noise predictor;
the forward process is to gradually add gaussian noise to the original data until the data becomes pure noise, which specifically includes:
given the previous diffusion step state, the probability distribution q (x) of the current diffusion step state is predicted t |x t-1 ):
Wherein,to have mean and variance respectively->β t A gaussian distribution of I; x is x t Vector for each DNA sequence after the t-th addition of noise, x t-1 For the vector of each DNA sequence after the t-1 th addition of noise, x when t=0 0 Is L multiplied by 4 matrix data after single heat coding; beta t Is a super parameter, is a constant and takes a value between 0 and 1; i is an identity matrix; e-shaped article t-1 Is the fundamental noise obtained from the t-1 th sample; a, a t =1-β t ,/> As the weight, a t To take the value of a at the diffusion step t, a i A is a super parameter, which is the value of a in the diffusion step i; e is new Gaussian distribution obtained by adding t Gaussian distributions with different variances, namely noise during diffusion step t, and q (x 0 ) For the true data distribution of DNA sequences, x 0 Is from q (x 0 ) And the true DNA sequence sampled.
The backward process is to learn the data distribution p (x) by gradually denoising from the gaussian distribution, which is equivalent to the inverse process of learning the markov chain with the length T, and to build a generic model by adding "conditions" in the inverse process, which can generate different chromatin state sequences in different cell types, defining the generic model as p (x) t-1 |x t And c), specifically:
vector x for each DNA sequence given after the t-th addition of noise t Vector x for each DNA sequence following the t-1 th addition of noise under conditions of cell type, chromatin state t-1 Probability distribution p (x) t-1 |x t ,c):
Wherein μ (x t C) and beta t I are respectivelyMean and variance of (a); c is the condition, namely the cell type and chromatin state corresponding to the current DNA sequence;
specifically, a fixed variance beta is adopted t I, fitting the mean μ (x) using a UNet neural network t C), realizing noise prediction at the moment t of the diffusion step, wherein the loss function of the UNet neural network is L DM :
Wherein, E is θ Noise predicted by the neural network under the parameter theta is a noise predictor based on the UNet neural network;for the parameter θ to be desired, θ is a parameter of the UNet neural network.
In view of possible deviation between the real sequence and the generated sequence, the embodiment proposes balanced loss, which aims to reduce the influence of the deviation between the DNA real sequence and the DNA generated sequence, and is realized by adopting a re-weighted traditional softmax cross entropy loss function, wherein the function is a loss function of a DNA sequence diffusion model, namely the balanced loss function is as follows:
wherein LD is a traditional softmax cross entropy loss function weighted again; c is a category; p is p j Probability of class j, y j Is a true class label; w is a weight;
wherein:
where μ is an empirical value selected manually:
and optimizing the loss function of the DNA sequence diffusion model by an AdamW optimization algorithm until the loss in the verification set is minimum, and stopping training.
Step S3, combining a noise predictor of UNet, and performing a reverse process of a conditional DNA sequence diffusion model to obtain balanced data sets with different chromatin state categories, wherein the method specifically comprises the following steps:
step S3.1, generating a noisy DNA sequence of lx 4 from a standard normal distribution, iterating the noisy DNA sequence of lx 4 based on a DNA sequence diffusion model until t=0;
s3.2, predicting a noise prediction value of an output DNA sequence by adopting a noise predictor according to the L multiplied by 4 noisy DNA sequence, the cell type and chromatin state condition c and the diffusion step t;
s3.3, subtracting a noise predicted value of the DNA sequence predicted and output by the current noise predictor from the DNA sequence containing the noise of L multiplied by 4;
step S3.4, repeating step S3.2 and step S3.3 until t=0, generating a DNA sequence having a specific cell type, chromatin state;
step S3.5, repeating step S3.2, step S3.3 and step S3.4 until balanced data sets with different chromatin state categories are obtained
S4, training a chromatin state prediction model based on the balance data set and a back propagation algorithm to construct a chromatin state prediction model, wherein the chromatin state prediction model balances the data setIn (a) and (b)The DNA sequence is input, which specifically comprises:
the motif perception convolution module is used for extracting a DNA sequence motif and comprises 3 layers of convolution networks, each layer of convolution network comprises a convolution layer, a ReLU activation layer, a dropout layer and a maximum pooling layer Maxpool, and the calculation process is as follows:
s (l1) =Conv(x (l) )
s (l2) =Dropout(ReLU(s (l1) ),0.2)
s (l3) =Maxpools (l2) )
wherein s is (l1) And x (l) The input and output of the first convolutional network respectively, is a balanced dataset; s is(s) (l2) The output of the dropout layer; s is(s) (l3) Output for maximum pooling layer Maxpool; conv () is convolution operation, reLU () is an activation function, dropout () is a fitting function for preventing overfitting, and the value is 0.2; maxpool () is the maximum pooling layer;
the hole convolution module is used for learning DNA sequence grammar, and comprises 3 layers of hole convolution networks, wherein each layer of hole convolution network comprises a hole convolution layer, a ReLU activation layer and a dropout layer, and the calculation process is as follows:
z (l1) =dConv(s (l3) )
z (l2) =Dropout(ReLU(z (l1) ),0.2)
wherein z is (l1) For the output of the first hole convolution network, z (l2) Is the output of the dropout layer.
The self-attention module is used for capturing the correlation inside the DNA sequence grammar and comprises two transducer coding layers, and the calculation process is as follows:
h (l1) =LayerNorm(z (l2) +MultiHead(z (l2) ))
h (l2) =LayerNorm(h (l1) +FFN(h (l1) ))
wherein h is (l1) An output for the first transducer coding layer; layerNorm () is hierarchical normalization; multiHead () is a multi-headed self-attention mechanism; FFN () is a feed-forward neural network, h (l2) Is the output of the feed-forward neural network.
And a classification module for constructing a chromatin state for each DNA sequence and predicting an output chromatin state.
The method comprises a layer of fully-connected neural network and an activation function, wherein the calculation process of the classification module is as follows:
y=Activation(MLP(h (l2) ))
wherein y is the chromatin state of the predicted output, activity () is the classification module Activation function, and MLP () is the full connection layer.
Further, experiments one and two are performed, comparing the method of the present invention with other methods of the prior art;
experiment one: the long tail chromatin prediction method based on the diffusion model provided by the invention obviously improves the prediction accuracy.
Table one summarizes the proposed long tail chromatin state prediction method and three comparison methods: deep sea (method one), danQ (method two), sei (method three), and comparison results in the chromahmm dataset are shown in table 1.
TABLE 1 accuracy of chromatin State prediction
Raw data | Achieving data balancing based on diffusion model | |
Method one | 0.657 | 0.671 |
Method II | 0.667 | 0.683 |
Method III | 0.654 | 0.676 |
The invention is that | 0.676 | 0.691 |
The main observations from table 1 are as follows:
(1) The diffusion model method of the invention realizes the data balance of chromatin state, and the performance is improved in all four methods. This shows that the data balancing method based on the diffusion model is independent of the model, and the strategy can be widely adopted by different models.
(2) The chromatin state prediction model provided by the invention is superior to other three methods. This shows that the method provided by the invention can capture the chromatin characteristics more effectively, thereby realizing chromatin state prediction.
Experiment II: the equalization loss provided by the invention can effectively reduce the influence caused by the deviation between the real sample and the generated sample
Table 2 summarizes the results of the comparison of the equalization loss proposed by the present invention among the four methods.
TABLE 2 accuracy of chromatin State prediction
Without using equalisation losses | Using equalization losses | |
Method one | 0.671 | 0.706 |
Method II | 0.683 | 0.719 |
Method III | 0.676 | 0.702 |
The invention is that | 0.691 | 0.732 |
The main observations from table 2 are as follows:
(1) The balance loss provided by the invention improves the performance in all four methods. This shows that the equalization loss proposed by the present invention is model independent. The sample balancing method based on the diffusion model provided by the invention is matched with the equilibrium loss strategy, and can be widely adopted by different models.
(2) The proposed chromatin state prediction method is superior to the comparison method.
In summary, the invention provides a framework based on a diffusion model, which can generate pseudo samples of different chromatin states of different cells to realize class sample balance, thereby solving the long tail problem in chromatin state prediction; and a balance loss is provided, wherein the influence caused by deviation between a real sample and a pseudo sample is relieved by increasing punishment on the pseudo sample; the chromatin state prediction model effectively captures the motif in the DNA sequence, so that the grammar rule of the gene is learned, and the chromatin state is predicted more accurately; in addition, the invention supports parallel operation on multiple GPUs and can be used for analyzing the state of the ultra-large scale chromatin.
Although specific embodiments of the invention have been described in detail with reference to the accompanying drawings, it should not be construed as limiting the scope of protection of the present patent. Various modifications and variations which may be made by those skilled in the art without the creative effort are within the scope of the patent described in the claims.
Claims (7)
1. A long tail chromatin state prediction method based on a diffusion model is characterized by comprising the following steps:
s1, acquiring an original DNA sequence, and processing the original DNA sequence to obtain DNA coding data;
s2, constructing a DNA sequence diffusion model based on the DNA coding data;
s3, combining a noise predictor of UNet, and performing a reverse process of a conditional DNA sequence diffusion model to obtain balanced data sets with different chromatin state categories;
s4, constructing a chromatin state prediction model by adopting a back propagation algorithm based on the balance data set;
the step S1 includes:
obtaining chromatin states corresponding to the original DNA sequences, and amplifying or intercepting the left and right ends of the obtained original DNA sequences with different lengths to obtain DNA sequences with the length L;
converting the DNA sequence with the length of L into coding matrix data of L multiplied by 4 by adopting a single-heat coding method;
the DNA sequence diffusion model in the step S2 comprises a forward process and a backward process;
the forward procedure includes:
given the previous diffusion step state, the probability distribution q (x) of the current diffusion step state is predicted t |x t-1 );
Wherein,to have mean and variance respectively->Variance beta t A gaussian distribution of I; x is x t Vector for each DNA sequence after the t-th addition of noise, x t-1 For the vector of each DNA sequence after the t-1 th addition of noise, x when t=0 0 Is L multiplied by 4 matrix data after single heat coding; beta t Is super parameter, I is identity matrix; e-shaped article t-1 Is the fundamental noise obtained from the t-1 th sample; a, a t =1-β t ,/> As the weight, a t To take the value of a at the diffusion step t, a i A is a super parameter, which is the value of a in the diffusion step i; the E is new Gaussian distribution obtained by adding t Gaussian distributions with different variances, namely noise in the diffusion step t;
the backward procedure comprises:
vector x for each DNA sequence given after the t-th addition of noise t Under conditions of cell type, chromatin stateVector x of each DNA sequence after the t-1 th addition of noise was predicted t-1 Probability distribution p (x) t-1 |x t ,c):
Wherein μ (x t C) and beta t I are respectivelyMean and variance of (a); c is the condition, namely the cell type and chromatin state corresponding to the current DNA sequence;
using a fixed variance beta t I, fitting the mean μ (x) using a UNet neural network t C), realizing noise prediction at the moment t of the diffusion step, wherein the loss function of the UNet neural network is L DM :
Wherein, E is θ Noise predicted by the neural network under the parameter theta is a noise predictor based on the UNet neural network;for the parameter θ to be desired, θ is a parameter of the UNet neural network.
2. The diffusion model-based long tail chromatin state prediction method according to claim 1, wherein the step S3 comprises:
s3.1, generating an L multiplied by 4 noisy DNA sequence from standard normal distribution, and iterating the L multiplied by 4 noisy DNA sequence based on a DNA sequence diffusion model until t=0;
s3.2, predicting a noise prediction value of an output DNA sequence by adopting a noise predictor according to the L multiplied by 4 noisy DNA sequence, the cell type, the chromatin state condition c and the diffusion step t;
s3.3, subtracting a noise predicted value of the DNA sequence predicted and output by the current noise predictor from the DNA sequence containing the noise of L multiplied by 4;
s3.4, repeating steps S3.2 and S3.3 until t=0, generating a DNA sequence having a specific cell type, chromatin state;
s3.5, repeating step S3.2, step S3.3 and step S3.4 until balanced data sets with different chromatin state categories are obtained.
3. The diffusion model-based long tail chromatin state prediction method according to claim 1, wherein the chromatin state prediction model comprises:
the motif perception convolution module is used for extracting a DNA sequence motif and comprises 3 layers of convolution networks, each layer of convolution network comprises a convolution layer, a ReLU activation layer, a dropout layer and a maximum pooling layer Maxpool, and the calculation process is as follows:
s (l1) =Conv(x (l) )
s (l2) =Dropout(ReLU(s (l1) ),0.2)
s (l3) =Maxpool(s (l2) )
wherein s is (l1) And x (l) The input and output of the first convolutional network respectively, is a balanced dataset; s is(s) (l2) The output of the dropout layer; s is(s) (l3) Output for maximum pooling layer Maxpool; conv () is convolution operation, reLU () is an activation function, dropout () is a fitting function for preventing overfitting, and the value is 0.2; maxpool () is the maximum pooling layer;
the cavity convolution module is used for learning the grammar of the DNA sequence;
a self-attention module for capturing the correlation inside the DNA sequence grammar;
and a classification module for constructing a chromatin state for each DNA sequence and predicting an output chromatin state.
4. The diffusion model-based long tail chromatin state prediction method according to claim 3, wherein the hole convolution module comprises a 3-layer hole convolution network, each layer of hole convolution network comprises a hole convolution layer, a ReLU activation layer and a dropout layer, and the calculation process is as follows:
z (l1) =dConv(s (l3) )
z (l2) =Dropout(ReLU(z (l1) ),0.2)
wherein z is (l1) For the output of the first hole convolution network, z (l2) Is the output of the dropout layer.
5. The diffusion model-based long tail chromatin state prediction method according to claim 4, wherein the self-attention module comprises two transform coding layers, and the calculation process is as follows:
h (l1) =LayerNorm(z (l2) +MultiHead(z (l2) ))
h (l2) =LayerNorm(h (l1) +FFN(h (l1) ))
wherein h is (l1) An output for the first transducer coding layer; layerNorm () is hierarchical normalization; multiHead () is a multi-headed self-attention mechanism; FFN () is a feed-forward neural network, h (l2) Is the output of the feed-forward neural network.
6. The diffusion model-based long tail chromatin state prediction method according to claim 5, wherein the classification module calculates:
y=Activation(MLP(h (l2) ))
wherein y is the chromatin state of the predicted output, activity () is the classification module Activation function, and MLP () is the full connection layer.
7. The diffusion model-based long tail chromatin state prediction method according to claim 1, wherein the loss function of the DNA sequence diffusion model is:
wherein LD is a traditional softmax cross entropy loss function weighted again; c is a category; p is p j Probability of class j, y j Is a true class label; w is a weight;
wherein:
where μ is an empirical value selected manually.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310991350.8A CN116884495B (en) | 2023-08-07 | 2023-08-07 | Diffusion model-based long tail chromatin state prediction method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310991350.8A CN116884495B (en) | 2023-08-07 | 2023-08-07 | Diffusion model-based long tail chromatin state prediction method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116884495A CN116884495A (en) | 2023-10-13 |
CN116884495B true CN116884495B (en) | 2024-03-08 |
Family
ID=88264587
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310991350.8A Active CN116884495B (en) | 2023-08-07 | 2023-08-07 | Diffusion model-based long tail chromatin state prediction method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116884495B (en) |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111312329A (en) * | 2020-02-25 | 2020-06-19 | 成都信息工程大学 | Transcription factor binding site prediction method based on deep convolution automatic encoder |
CN114023300A (en) * | 2021-11-03 | 2022-02-08 | 四川大学 | Chinese speech synthesis method based on diffusion probability model |
WO2022189771A1 (en) * | 2021-03-11 | 2022-09-15 | Oxford University Innovation Limited | Generating neural network models, classifying physiological data, and classifying patients into clinical classifications |
CN115831217A (en) * | 2022-11-23 | 2023-03-21 | 四川大学 | Chromatin topological association domain boundary prediction method based on multi-modal fusion |
CN116153404A (en) * | 2023-02-28 | 2023-05-23 | 成都信息工程大学 | Single-cell ATAC-seq data analysis method |
CN116312765A (en) * | 2023-02-15 | 2023-06-23 | 成都信息工程大学 | Multi-stage-based prediction method for influence of non-coding variation on activity of enhancer |
CN116416491A (en) * | 2023-03-14 | 2023-07-11 | 福建福清核电有限公司 | Remote sensing pseudo sample generation method based on lightweight diffusion model |
-
2023
- 2023-08-07 CN CN202310991350.8A patent/CN116884495B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111312329A (en) * | 2020-02-25 | 2020-06-19 | 成都信息工程大学 | Transcription factor binding site prediction method based on deep convolution automatic encoder |
WO2022189771A1 (en) * | 2021-03-11 | 2022-09-15 | Oxford University Innovation Limited | Generating neural network models, classifying physiological data, and classifying patients into clinical classifications |
CN114023300A (en) * | 2021-11-03 | 2022-02-08 | 四川大学 | Chinese speech synthesis method based on diffusion probability model |
CN115831217A (en) * | 2022-11-23 | 2023-03-21 | 四川大学 | Chromatin topological association domain boundary prediction method based on multi-modal fusion |
CN116312765A (en) * | 2023-02-15 | 2023-06-23 | 成都信息工程大学 | Multi-stage-based prediction method for influence of non-coding variation on activity of enhancer |
CN116153404A (en) * | 2023-02-28 | 2023-05-23 | 成都信息工程大学 | Single-cell ATAC-seq data analysis method |
CN116416491A (en) * | 2023-03-14 | 2023-07-11 | 福建福清核电有限公司 | Remote sensing pseudo sample generation method based on lightweight diffusion model |
Non-Patent Citations (1)
Title |
---|
程哲 ; 白茜 ; 张浩 ; 王世普 ; 梁宇 ; .使用深层卷积神经网络提高Hi-C数据分辨率.计算机科学.2020,(第S1期),全文. * |
Also Published As
Publication number | Publication date |
---|---|
CN116884495A (en) | 2023-10-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Zhang et al. | NAS-AMR: Neural architecture search-based automatic modulation recognition for integrated sensing and communication systems | |
CN113852432B (en) | Spectrum Prediction Sensing Method Based on RCS-GRU Model | |
CN111564179B (en) | Species biology classification method and system based on triple neural network | |
CN113255832B (en) | Method for identifying long tail distribution of double-branch multi-center | |
CN114596726B (en) | Parking berth prediction method based on interpretable space-time attention mechanism | |
CN112183742A (en) | Neural network hybrid quantization method based on progressive quantization and Hessian information | |
CN112686372A (en) | Product performance prediction method based on depth residual GRU neural network | |
CN111355633A (en) | Mobile phone internet traffic prediction method in competition venue based on PSO-DELM algorithm | |
CN111797979A (en) | Vibration transmission system based on LSTM model | |
CN117251705A (en) | Daily natural gas load prediction method | |
CN113343796B (en) | Knowledge distillation-based radar signal modulation mode identification method | |
CN118337576A (en) | Lightweight automatic modulation identification method based on multichannel fusion | |
CN116884495B (en) | Diffusion model-based long tail chromatin state prediction method | |
CN114821184B (en) | Long-tail image classification method and system based on balanced complementary entropy | |
CN116243248A (en) | Multi-component interference signal identification method based on multi-label classification network | |
CN115047422A (en) | Radar target identification method based on multi-scale mixed hole convolution | |
CN113132482B (en) | Distributed message system parameter adaptive optimization method based on reinforcement learning | |
CN111476408B (en) | Power communication equipment state prediction method and system | |
CN113139464A (en) | Power grid fault detection method | |
CN116913390B (en) | Gene regulation network prediction method based on multi-view attention network | |
CN114386602B (en) | HTM predictive analysis method for multi-path server load data | |
CN113627556B (en) | Method and device for realizing image classification, electronic equipment and storage medium | |
CN116908808B (en) | RTN-based high-resolution one-dimensional image target recognition method | |
Wang et al. | Exploring quantization in few-shot learning | |
CN117584792B (en) | Online prediction method and system for charging power of electric vehicle charging station |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |