Diffusion model-based long tail chromatin state prediction method
Technical Field
The invention belongs to the technical field of chromatin state prediction, and particularly relates to a long tail chromatin state prediction method based on a diffusion model.
Background
Chromatin state refers to the different structural and functional states of chromatin in different cell types. There is increasing interest in a wide range of functions, such as reflecting the function and status of cells. Epigenetic modification of DNA sequences is a major factor in determining chromatin status. For example, the prior art defines 15 chromatin states with different biological roles by mapping 9 chromatin markers. Similarly, the prior art defines 18 chromatin states from 6 histone marks by using ChIP-seq data. These studies indicate that chromatin states are distributed long tails, with some states being more abundant than others. For example, the number of enhancers is significantly greater than the number of insulators. While genomic analysis such as ChIP-seq can reveal chromatin status, it requires more expensive and time-consuming experiments. Therefore, there is an urgent need for methods for calculating long tail chromatin state predictions.
Currently, many efforts have been made to predict chromatin status through deep learning algorithms. Deep sea is an open work that constructs a CNN network that predicts 919 chromatin characteristics from DNA sequences. Following the pioneering work of deep sea, numerous researchers have made precious explores and breakthroughs in improving the performance of chromatin state prediction algorithms; the primary impact is focused on the model architecture. The prior art proposes a simple and efficient method consisting of a single CNN layer, a BiLSTM layer and a fully connected layer. Specifically, CNN is used to learn motif information, biLSTM is used to learn regulatory grammar. The prior art also proposes a hybrid DNN model deep former that uses CNN and streamer forces to achieve accurate chromatin feature predictions under limited parameters. The prior art also expands the perceived field by integrating the dilation convolution without reducing spatial resolution for better performance. However, these methods often ignore long tail problems between chromatin states, and in particular, some methods often achieve sample balancing by shuffling positive samples to produce negative samples, resulting in bias in practice. Other methods directly predict long-tailed chromatin status, resulting in an imbalance between head and tail classes.
Long tail learning aims at training a well performing model from many samples following long tail class distribution. However, in practice, the trained models are often biased toward the head class, resulting in poor performance of the tail class.
The methods of predictive analysis of chromatin states widely used in the prior art still have some disadvantages:
first, the existing chromatin state prediction method usually ignores long tail distribution of chromatin states, and it is difficult to simultaneously predict the chromatin states of the head class and the tail class, so that the method has a certain limitation in practicability.
Second, numerous studies have shown that genes have their own grammatical rules, a large number of motifs (motifs) are the "phrases" that make up the gene language, and parsing the grammatical rules of genes is the primary step in parsing chromatin states and inferring gene function. However, the existing chromatin state prediction method is difficult to effectively capture the relative position and long-distance dependency relationship between the motifs, and thus cannot accurately analyze the gene grammar and characterize the chromatin state.
Disclosure of Invention
The invention aims to overcome the defects in the prior art and provide a long tail chromatin state prediction method based on a diffusion model, so as to solve the problem of limitation of unbalanced data in the prior category on chromatin state prediction.
In order to achieve the above purpose, the invention adopts the following technical scheme:
a long tail chromatin state prediction method based on a diffusion model, comprising the steps of:
s1, acquiring an original DNA sequence, and processing the original DNA sequence to obtain DNA coding data;
s2, constructing a DNA sequence diffusion model based on the DNA coding data;
s3, combining a noise predictor of UNet, and performing a reverse process of a conditional DNA sequence diffusion model to obtain balanced data sets with different chromatin state categories;
s4, constructing a chromatin state prediction model by adopting a back propagation algorithm based on the balance data set.
Further, step S1 includes:
obtaining chromatin states corresponding to the original DNA sequences, and amplifying or intercepting the left and right ends of the obtained original DNA sequences with different lengths to obtain DNA sequences with the length L;
the DNA sequence with the length L is converted into coding matrix data of L multiplied by 4 by adopting a single-heat coding method.
Further, the DNA sequence diffusion model in step S2 includes a forward process and a backward process;
the forward procedure includes:
given the previous diffusion step state, the probability distribution q (x) of the current diffusion step state is predicted t |x t-1 ):
Wherein,to have mean and variance respectively->β t A gaussian distribution of I; x is x t Vector for each DNA sequence after the t-th addition of noise, x t-1 For the vector of each DNA sequence after the t-1 th addition of noise, x when t=0 0 Is L multiplied by 4 matrix data after single heat coding; beta t Is super parameter, I is identity matrix; e-shaped article t-1 Is the fundamental noise obtained from the t-1 th sample; a, a t =1-β t ,/> As the weight, a t To take the value of a at the diffusion step t, a i A is a super parameter, which is the value of a in the diffusion step i; the epsilon is a new Gaussian distribution obtained by adding t Gaussian distributions with different variances, namely noise in the diffusion step t.
Further, the backward procedure includes:
vector x for each DNA sequence given after the t-th addition of noise t Vector x for each DNA sequence following the t-1 th addition of noise under conditions of cell type, chromatin state t-1 Probability distribution p (x) t-1 |x t ,c):
Wherein μ (x t C) and beta t I are respectivelyMean and variance of (a); c is the condition, namely the cell type and chromatin state corresponding to the current DNA sequence;
using a fixed variance beta t I, fitting the mean μ (x) using a UNet neural network t C), realizing noise prediction at the moment t of the diffusion step, wherein the loss function of the UNet neural network is L DM :
Wherein, E is θ Noise predicted by the neural network under the parameter theta is a noise predictor based on the UNet neural network;for the parameter θ to be desired, θ is a parameter of the UNet neural network.
Further, step S3 includes:
s3.1, generating an L multiplied by 4 noisy DNA sequence from standard normal distribution, and iterating the L multiplied by 4 noisy DNA sequence based on a DNA sequence diffusion model until t=0;
s3.2, predicting a noise prediction value of an output DNA sequence by adopting a noise predictor according to the L multiplied by 4 noisy DNA sequence, the cell type, the chromatin state condition c and the diffusion step t;
s3.3, subtracting a noise predicted value of the DNA sequence predicted and output by the current noise predictor from the DNA sequence containing the noise of L multiplied by 4;
s3.4, repeating steps S3.2 and S3.3 until t=0, generating a DNA sequence having a specific cell type, chromatin state;
s3.5, repeating step S3.2, step S3.3 and step S3.4 until balanced data sets with different chromatin state categories are obtained.
Further, the chromatin state prediction model includes:
the motif perception convolution module is used for extracting a DNA sequence motif and comprises 3 layers of convolution networks, each layer of convolution network comprises a convolution layer, a ReLU activation layer, a dropout layer and a maximum pooling layer Maxpool, and the calculation process is as follows:
s (l1) =Conv(x (l) )
s (l2) =Dropout(ReLU(s (l1) ),0.2)
s (l3) =Maxpool(s (l2) )
wherein s is (l1) And x (l) The input and output of the first convolutional network respectively, is a balanced dataset; s is(s) (l2) The output of the dropout layer; s is(s) (l3) Output for maximum pooling layer Maxpool; conv () is convolution operation, reLU () is an activation function, dropout () is a fitting function for preventing overfitting, and the value is 0.2; maxpool () is the maximum pooling layer;
the cavity convolution module is used for learning the grammar of the DNA sequence;
a self-attention module for capturing the correlation inside the DNA sequence grammar;
and a classification module for constructing a chromatin state for each DNA sequence and predicting an output chromatin state.
Further, the hole convolution module comprises a 3-layer hole convolution network, each layer of the hole convolution network comprises a hole convolution layer, a ReLU activation layer and a dropout layer, and the calculation process is as follows:
z (l1) =dConv(s (l3) )
z (l2) =Dropout(ReLU(z (l1) ),0.2)
wherein z is (l1) For the output of the first hole convolution network, z (l2) Is the output of the dropout layer.
Further, the self-attention module comprises two transducer coding layers, and the calculation process is as follows:
h (l1) =LayerNorm(z (l2) +MultiHead(z (l2) ))
h (l2) =LayerNorm(h (l1) +FFN(h (l1) ))
wherein h is (l1) An output for the first transducer coding layer; layerNorm () is hierarchical normalization; multiHead () is a multi-headed self-attention mechanism; FFN () is a feed-forward neural network, h (l2) Is the output of the feed-forward neural network.
Further, the calculation process of the classification module is as follows:
y=Activation(MLP(h (l2) ))
wherein y is the chromatin state of the predicted output, activity () is the classification module Activation function, and MLP () is the full connection layer.
Further, the loss function of the DNA sequence diffusion model is:
wherein LD is a traditional softmax cross entropy loss function weighted again; c is a category; p is p j Probability of class j, y j Is a true class label; w is a weight;
wherein:
where μ is an empirical value selected manually.
The long-tail chromatin state prediction method based on the diffusion model provided by the invention has the following beneficial effects:
the invention utilizes a DNA sequence diffusion model to generate a DNA sequence of tail type chromatin state from noise, thereby realizing sample balance; then, a chromatin state prediction model is trained using the class sample balanced dataset, which can effectively capture the gene-based grammar rules, thereby accurately predicting chromatin state.
In the training of a DNA sequence diffusion model, the invention provides a balance loss, and the influence caused by deviation between a real sample and a generated sample is reduced by increasing punishment on the generated sample.
The sample balancing method provides a simple, universal and model-independent solution for long-tail chromatin state prediction; in addition, the chromatin state prediction model comprises neural network operators such as convolution, hole convolution and self-attention, and can effectively learn the grammar rules of genes, thereby realizing accurate classification of chromatin states.
Drawings
FIG. 1 is a flow chart of a long tail chromatin state prediction method based on a diffusion model according to the invention.
FIG. 2 is a block diagram of a long tail chromatin state prediction method based on a diffusion model according to the invention.
Detailed Description
The following description of the embodiments of the present invention is provided to facilitate understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and all the inventions which make use of the inventive concept are protected by the spirit and scope of the present invention as defined and defined in the appended claims to those skilled in the art.
Example 1
The embodiment provides a long-tail chromatin state prediction method based on a diffusion model, which generates a long-tail pseudo sample through a DNA sequence diffusion model and performs data balancing so as to solve the limitation of unbalanced data on chromatin state prediction; since there may be a deviation between the samples generated by the DNA sequence diffusion model and the real samples, a balanced loss function is given, and the influence caused by the deviation between the real samples and the generated samples is reduced by increasing the penalty on the generated samples, referring to fig. 1, which specifically includes the following steps:
step S1, an original DNA sequence is obtained, and the original DNA sequence is processed to obtain DNA coding data, which specifically comprises the following steps:
obtaining chromatin states corresponding to the original DNA sequences, and amplifying or intercepting the left and right ends of the obtained original DNA sequences with different lengths to obtain DNA sequences with the length L;
the DNA sequence with the length L is converted into coding matrix data of L multiplied by 4 by adopting a single-heat coding method.
S2, constructing a DNA sequence diffusion model based on DNA coding data, penetrating the processed DNA coding data into the diffusion model for training, and obtaining the DNA sequence diffusion model;
the DNA sequence diffusion model comprises a forward process and a backward process, wherein the forward process is a parameter Markov chain, the forward process is a fuzzy process of gradually adding Gaussian noise into data until the Gaussian noise becomes random noise, and the backward process is a denoising process of gradually recovering the data through a noise predictor;
the forward process is to gradually add gaussian noise to the original data until the data becomes pure noise, which specifically includes:
given the previous diffusion step state, the probability distribution q (x) of the current diffusion step state is predicted t |x t-1 ):
Wherein,to have mean and variance respectively->β t A gaussian distribution of I; x is x t Vector for each DNA sequence after the t-th addition of noise, x t-1 For the vector of each DNA sequence after the t-1 th addition of noise, x when t=0 0 Is L multiplied by 4 matrix data after single heat coding; beta t Is a super parameter, is a constant and takes a value between 0 and 1; i is an identity matrix; e-shaped article t-1 Is the fundamental noise obtained from the t-1 th sample; a, a t =1-β t ,/> As the weight, a t To take the value of a at the diffusion step t, a i A is a super parameter, which is the value of a in the diffusion step i; e is new Gaussian distribution obtained by adding t Gaussian distributions with different variances, namely noise during diffusion step t, and q (x 0 ) For the true data distribution of DNA sequences, x 0 Is from q (x 0 ) And the true DNA sequence sampled.
The backward process is to learn the data distribution p (x) by gradually denoising from the gaussian distribution, which is equivalent to the inverse process of learning the markov chain with the length T, and to build a generic model by adding "conditions" in the inverse process, which can generate different chromatin state sequences in different cell types, defining the generic model as p (x) t-1 |x t And c), specifically:
vector x for each DNA sequence given after the t-th addition of noise t Vector x for each DNA sequence following the t-1 th addition of noise under conditions of cell type, chromatin state t-1 Probability distribution p (x) t-1 |x t ,c):
Wherein μ (x t C) and beta t I are respectivelyMean and variance of (a); c is the condition, namely the cell type and chromatin state corresponding to the current DNA sequence;
specifically, a fixed variance beta is adopted t I, fitting the mean μ (x) using a UNet neural network t C), realizing noise prediction at the moment t of the diffusion step, wherein the loss function of the UNet neural network is L DM :
Wherein, E is θ Noise predicted by the neural network under the parameter theta is a noise predictor based on the UNet neural network;for the parameter θ to be desired, θ is a parameter of the UNet neural network.
In view of possible deviation between the real sequence and the generated sequence, the embodiment proposes balanced loss, which aims to reduce the influence of the deviation between the DNA real sequence and the DNA generated sequence, and is realized by adopting a re-weighted traditional softmax cross entropy loss function, wherein the function is a loss function of a DNA sequence diffusion model, namely the balanced loss function is as follows:
wherein LD is a traditional softmax cross entropy loss function weighted again; c is a category; p is p j Probability of class j, y j Is a true class label; w is a weight;
wherein:
where μ is an empirical value selected manually:
and optimizing the loss function of the DNA sequence diffusion model by an AdamW optimization algorithm until the loss in the verification set is minimum, and stopping training.
Step S3, combining a noise predictor of UNet, and performing a reverse process of a conditional DNA sequence diffusion model to obtain balanced data sets with different chromatin state categories, wherein the method specifically comprises the following steps:
step S3.1, generating a noisy DNA sequence of lx 4 from a standard normal distribution, iterating the noisy DNA sequence of lx 4 based on a DNA sequence diffusion model until t=0;
s3.2, predicting a noise prediction value of an output DNA sequence by adopting a noise predictor according to the L multiplied by 4 noisy DNA sequence, the cell type and chromatin state condition c and the diffusion step t;
s3.3, subtracting a noise predicted value of the DNA sequence predicted and output by the current noise predictor from the DNA sequence containing the noise of L multiplied by 4;
step S3.4, repeating step S3.2 and step S3.3 until t=0, generating a DNA sequence having a specific cell type, chromatin state;
step S3.5, repeating step S3.2, step S3.3 and step S3.4 until balanced data sets with different chromatin state categories are obtained
S4, training a chromatin state prediction model based on the balance data set and a back propagation algorithm to construct a chromatin state prediction model, wherein the chromatin state prediction model balances the data setIn (a) and (b)The DNA sequence is input, which specifically comprises:
the motif perception convolution module is used for extracting a DNA sequence motif and comprises 3 layers of convolution networks, each layer of convolution network comprises a convolution layer, a ReLU activation layer, a dropout layer and a maximum pooling layer Maxpool, and the calculation process is as follows:
s (l1) =Conv(x (l) )
s (l2) =Dropout(ReLU(s (l1) ),0.2)
s (l3) =Maxpools (l2) )
wherein s is (l1) And x (l) The input and output of the first convolutional network respectively, is a balanced dataset; s is(s) (l2) The output of the dropout layer; s is(s) (l3) Output for maximum pooling layer Maxpool; conv () is convolution operation, reLU () is an activation function, dropout () is a fitting function for preventing overfitting, and the value is 0.2; maxpool () is the maximum pooling layer;
the hole convolution module is used for learning DNA sequence grammar, and comprises 3 layers of hole convolution networks, wherein each layer of hole convolution network comprises a hole convolution layer, a ReLU activation layer and a dropout layer, and the calculation process is as follows:
z (l1) =dConv(s (l3) )
z (l2) =Dropout(ReLU(z (l1) ),0.2)
wherein z is (l1) For the output of the first hole convolution network, z (l2) Is the output of the dropout layer.
The self-attention module is used for capturing the correlation inside the DNA sequence grammar and comprises two transducer coding layers, and the calculation process is as follows:
h (l1) =LayerNorm(z (l2) +MultiHead(z (l2) ))
h (l2) =LayerNorm(h (l1) +FFN(h (l1) ))
wherein h is (l1) An output for the first transducer coding layer; layerNorm () is hierarchical normalization; multiHead () is a multi-headed self-attention mechanism; FFN () is a feed-forward neural network, h (l2) Is the output of the feed-forward neural network.
And a classification module for constructing a chromatin state for each DNA sequence and predicting an output chromatin state.
The method comprises a layer of fully-connected neural network and an activation function, wherein the calculation process of the classification module is as follows:
y=Activation(MLP(h (l2) ))
wherein y is the chromatin state of the predicted output, activity () is the classification module Activation function, and MLP () is the full connection layer.
Further, experiments one and two are performed, comparing the method of the present invention with other methods of the prior art;
experiment one: the long tail chromatin prediction method based on the diffusion model provided by the invention obviously improves the prediction accuracy.
Table one summarizes the proposed long tail chromatin state prediction method and three comparison methods: deep sea (method one), danQ (method two), sei (method three), and comparison results in the chromahmm dataset are shown in table 1.
TABLE 1 accuracy of chromatin State prediction
|
Raw data
|
Achieving data balancing based on diffusion model
|
Method one
|
0.657
|
0.671
|
Method II
|
0.667
|
0.683
|
Method III
|
0.654
|
0.676
|
The invention is that
|
0.676
|
0.691 |
The main observations from table 1 are as follows:
(1) The diffusion model method of the invention realizes the data balance of chromatin state, and the performance is improved in all four methods. This shows that the data balancing method based on the diffusion model is independent of the model, and the strategy can be widely adopted by different models.
(2) The chromatin state prediction model provided by the invention is superior to other three methods. This shows that the method provided by the invention can capture the chromatin characteristics more effectively, thereby realizing chromatin state prediction.
Experiment II: the equalization loss provided by the invention can effectively reduce the influence caused by the deviation between the real sample and the generated sample
Table 2 summarizes the results of the comparison of the equalization loss proposed by the present invention among the four methods.
TABLE 2 accuracy of chromatin State prediction
|
Without using equalisation losses
|
Using equalization losses
|
Method one
|
0.671
|
0.706
|
Method II
|
0.683
|
0.719
|
Method III
|
0.676
|
0.702
|
The invention is that
|
0.691
|
0.732 |
The main observations from table 2 are as follows:
(1) The balance loss provided by the invention improves the performance in all four methods. This shows that the equalization loss proposed by the present invention is model independent. The sample balancing method based on the diffusion model provided by the invention is matched with the equilibrium loss strategy, and can be widely adopted by different models.
(2) The proposed chromatin state prediction method is superior to the comparison method.
In summary, the invention provides a framework based on a diffusion model, which can generate pseudo samples of different chromatin states of different cells to realize class sample balance, thereby solving the long tail problem in chromatin state prediction; and a balance loss is provided, wherein the influence caused by deviation between a real sample and a pseudo sample is relieved by increasing punishment on the pseudo sample; the chromatin state prediction model effectively captures the motif in the DNA sequence, so that the grammar rule of the gene is learned, and the chromatin state is predicted more accurately; in addition, the invention supports parallel operation on multiple GPUs and can be used for analyzing the state of the ultra-large scale chromatin.
Although specific embodiments of the invention have been described in detail with reference to the accompanying drawings, it should not be construed as limiting the scope of protection of the present patent. Various modifications and variations which may be made by those skilled in the art without the creative effort are within the scope of the patent described in the claims.