CN116884495B - Diffusion model-based long tail chromatin state prediction method - Google Patents

Diffusion model-based long tail chromatin state prediction method Download PDF

Info

Publication number
CN116884495B
CN116884495B CN202310991350.8A CN202310991350A CN116884495B CN 116884495 B CN116884495 B CN 116884495B CN 202310991350 A CN202310991350 A CN 202310991350A CN 116884495 B CN116884495 B CN 116884495B
Authority
CN
China
Prior art keywords
dna sequence
chromatin state
noise
chromatin
layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310991350.8A
Other languages
Chinese (zh)
Other versions
CN116884495A (en
Inventor
张永清
刘宇航
牛颢
龙树全
丁春利
杨显华
邹权
龚美琴
朱桂全
王紫轩
袁豪
吕嘉珩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SICHUAN INSTITUTE OF COMPUTER SCIENCES
Chengdu University of Information Technology
Original Assignee
SICHUAN INSTITUTE OF COMPUTER SCIENCES
Chengdu University of Information Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SICHUAN INSTITUTE OF COMPUTER SCIENCES, Chengdu University of Information Technology filed Critical SICHUAN INSTITUTE OF COMPUTER SCIENCES
Priority to CN202310991350.8A priority Critical patent/CN116884495B/en
Publication of CN116884495A publication Critical patent/CN116884495A/en
Application granted granted Critical
Publication of CN116884495B publication Critical patent/CN116884495B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0499Feedforward networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Biotechnology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Analytical Chemistry (AREA)
  • Chemical & Material Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses a long tail chromatin state prediction method based on a diffusion model, which comprises the following steps of S1, obtaining an original DNA sequence, and processing the original DNA sequence to obtain DNA coding data; s2, constructing a DNA sequence diffusion model based on the DNA coding data; s3, combining a noise predictor of UNet, and performing a reverse process of a conditional DNA sequence diffusion model to obtain balanced data sets with different chromatin state categories; s4, constructing a chromatin state prediction model by adopting a back propagation algorithm based on the balance data set. The invention utilizes a DNA sequence diffusion model to generate a DNA sequence of tail type chromatin state from noise, thereby realizing sample balance; then, a chromatin state prediction model is trained using the class sample balanced dataset, which can effectively capture the gene-based grammar rules, thereby accurately predicting chromatin state.

Description

Diffusion model-based long tail chromatin state prediction method
Technical Field
The invention belongs to the technical field of chromatin state prediction, and particularly relates to a long tail chromatin state prediction method based on a diffusion model.
Background
Chromatin state refers to the different structural and functional states of chromatin in different cell types. There is increasing interest in a wide range of functions, such as reflecting the function and status of cells. Epigenetic modification of DNA sequences is a major factor in determining chromatin status. For example, the prior art defines 15 chromatin states with different biological roles by mapping 9 chromatin markers. Similarly, the prior art defines 18 chromatin states from 6 histone marks by using ChIP-seq data. These studies indicate that chromatin states are distributed long tails, with some states being more abundant than others. For example, the number of enhancers is significantly greater than the number of insulators. While genomic analysis such as ChIP-seq can reveal chromatin status, it requires more expensive and time-consuming experiments. Therefore, there is an urgent need for methods for calculating long tail chromatin state predictions.
Currently, many efforts have been made to predict chromatin status through deep learning algorithms. Deep sea is an open work that constructs a CNN network that predicts 919 chromatin characteristics from DNA sequences. Following the pioneering work of deep sea, numerous researchers have made precious explores and breakthroughs in improving the performance of chromatin state prediction algorithms; the primary impact is focused on the model architecture. The prior art proposes a simple and efficient method consisting of a single CNN layer, a BiLSTM layer and a fully connected layer. Specifically, CNN is used to learn motif information, biLSTM is used to learn regulatory grammar. The prior art also proposes a hybrid DNN model deep former that uses CNN and streamer forces to achieve accurate chromatin feature predictions under limited parameters. The prior art also expands the perceived field by integrating the dilation convolution without reducing spatial resolution for better performance. However, these methods often ignore long tail problems between chromatin states, and in particular, some methods often achieve sample balancing by shuffling positive samples to produce negative samples, resulting in bias in practice. Other methods directly predict long-tailed chromatin status, resulting in an imbalance between head and tail classes.
Long tail learning aims at training a well performing model from many samples following long tail class distribution. However, in practice, the trained models are often biased toward the head class, resulting in poor performance of the tail class.
The methods of predictive analysis of chromatin states widely used in the prior art still have some disadvantages:
first, the existing chromatin state prediction method usually ignores long tail distribution of chromatin states, and it is difficult to simultaneously predict the chromatin states of the head class and the tail class, so that the method has a certain limitation in practicability.
Second, numerous studies have shown that genes have their own grammatical rules, a large number of motifs (motifs) are the "phrases" that make up the gene language, and parsing the grammatical rules of genes is the primary step in parsing chromatin states and inferring gene function. However, the existing chromatin state prediction method is difficult to effectively capture the relative position and long-distance dependency relationship between the motifs, and thus cannot accurately analyze the gene grammar and characterize the chromatin state.
Disclosure of Invention
The invention aims to overcome the defects in the prior art and provide a long tail chromatin state prediction method based on a diffusion model, so as to solve the problem of limitation of unbalanced data in the prior category on chromatin state prediction.
In order to achieve the above purpose, the invention adopts the following technical scheme:
a long tail chromatin state prediction method based on a diffusion model, comprising the steps of:
s1, acquiring an original DNA sequence, and processing the original DNA sequence to obtain DNA coding data;
s2, constructing a DNA sequence diffusion model based on the DNA coding data;
s3, combining a noise predictor of UNet, and performing a reverse process of a conditional DNA sequence diffusion model to obtain balanced data sets with different chromatin state categories;
s4, constructing a chromatin state prediction model by adopting a back propagation algorithm based on the balance data set.
Further, step S1 includes:
obtaining chromatin states corresponding to the original DNA sequences, and amplifying or intercepting the left and right ends of the obtained original DNA sequences with different lengths to obtain DNA sequences with the length L;
the DNA sequence with the length L is converted into coding matrix data of L multiplied by 4 by adopting a single-heat coding method.
Further, the DNA sequence diffusion model in step S2 includes a forward process and a backward process;
the forward procedure includes:
given the previous diffusion step state, the probability distribution q (x) of the current diffusion step state is predicted t |x t-1 ):
Wherein,to have mean and variance respectively->β t A gaussian distribution of I; x is x t Vector for each DNA sequence after the t-th addition of noise, x t-1 For the vector of each DNA sequence after the t-1 th addition of noise, x when t=0 0 Is L multiplied by 4 matrix data after single heat coding; beta t Is super parameter, I is identity matrix; e-shaped article t-1 Is the fundamental noise obtained from the t-1 th sample; a, a t =1-β t ,/> As the weight, a t To take the value of a at the diffusion step t, a i A is a super parameter, which is the value of a in the diffusion step i; the epsilon is a new Gaussian distribution obtained by adding t Gaussian distributions with different variances, namely noise in the diffusion step t.
Further, the backward procedure includes:
vector x for each DNA sequence given after the t-th addition of noise t Vector x for each DNA sequence following the t-1 th addition of noise under conditions of cell type, chromatin state t-1 Probability distribution p (x) t-1 |x t ,c):
Wherein μ (x t C) and beta t I are respectivelyMean and variance of (a); c is the condition, namely the cell type and chromatin state corresponding to the current DNA sequence;
using a fixed variance beta t I, fitting the mean μ (x) using a UNet neural network t C), realizing noise prediction at the moment t of the diffusion step, wherein the loss function of the UNet neural network is L DM
Wherein, E is θ Noise predicted by the neural network under the parameter theta is a noise predictor based on the UNet neural network;for the parameter θ to be desired, θ is a parameter of the UNet neural network.
Further, step S3 includes:
s3.1, generating an L multiplied by 4 noisy DNA sequence from standard normal distribution, and iterating the L multiplied by 4 noisy DNA sequence based on a DNA sequence diffusion model until t=0;
s3.2, predicting a noise prediction value of an output DNA sequence by adopting a noise predictor according to the L multiplied by 4 noisy DNA sequence, the cell type, the chromatin state condition c and the diffusion step t;
s3.3, subtracting a noise predicted value of the DNA sequence predicted and output by the current noise predictor from the DNA sequence containing the noise of L multiplied by 4;
s3.4, repeating steps S3.2 and S3.3 until t=0, generating a DNA sequence having a specific cell type, chromatin state;
s3.5, repeating step S3.2, step S3.3 and step S3.4 until balanced data sets with different chromatin state categories are obtained.
Further, the chromatin state prediction model includes:
the motif perception convolution module is used for extracting a DNA sequence motif and comprises 3 layers of convolution networks, each layer of convolution network comprises a convolution layer, a ReLU activation layer, a dropout layer and a maximum pooling layer Maxpool, and the calculation process is as follows:
s (l1) =Conv(x (l) )
s (l2) =Dropout(ReLU(s (l1) ),0.2)
s (l3) =Maxpool(s (l2) )
wherein s is (l1) And x (l) The input and output of the first convolutional network respectively, is a balanced dataset; s is(s) (l2) The output of the dropout layer; s is(s) (l3) Output for maximum pooling layer Maxpool; conv () is convolution operation, reLU () is an activation function, dropout () is a fitting function for preventing overfitting, and the value is 0.2; maxpool () is the maximum pooling layer;
the cavity convolution module is used for learning the grammar of the DNA sequence;
a self-attention module for capturing the correlation inside the DNA sequence grammar;
and a classification module for constructing a chromatin state for each DNA sequence and predicting an output chromatin state.
Further, the hole convolution module comprises a 3-layer hole convolution network, each layer of the hole convolution network comprises a hole convolution layer, a ReLU activation layer and a dropout layer, and the calculation process is as follows:
z (l1) =dConv(s (l3) )
z (l2) =Dropout(ReLU(z (l1) ),0.2)
wherein z is (l1) For the output of the first hole convolution network, z (l2) Is the output of the dropout layer.
Further, the self-attention module comprises two transducer coding layers, and the calculation process is as follows:
h (l1) =LayerNorm(z (l2) +MultiHead(z (l2) ))
h (l2) =LayerNorm(h (l1) +FFN(h (l1) ))
wherein h is (l1) An output for the first transducer coding layer; layerNorm () is hierarchical normalization; multiHead () is a multi-headed self-attention mechanism; FFN () is a feed-forward neural network, h (l2) Is the output of the feed-forward neural network.
Further, the calculation process of the classification module is as follows:
y=Activation(MLP(h (l2) ))
wherein y is the chromatin state of the predicted output, activity () is the classification module Activation function, and MLP () is the full connection layer.
Further, the loss function of the DNA sequence diffusion model is:
wherein LD is a traditional softmax cross entropy loss function weighted again; c is a category; p is p j Probability of class j, y j Is a true class label; w is a weight;
wherein:
where μ is an empirical value selected manually.
The long-tail chromatin state prediction method based on the diffusion model provided by the invention has the following beneficial effects:
the invention utilizes a DNA sequence diffusion model to generate a DNA sequence of tail type chromatin state from noise, thereby realizing sample balance; then, a chromatin state prediction model is trained using the class sample balanced dataset, which can effectively capture the gene-based grammar rules, thereby accurately predicting chromatin state.
In the training of a DNA sequence diffusion model, the invention provides a balance loss, and the influence caused by deviation between a real sample and a generated sample is reduced by increasing punishment on the generated sample.
The sample balancing method provides a simple, universal and model-independent solution for long-tail chromatin state prediction; in addition, the chromatin state prediction model comprises neural network operators such as convolution, hole convolution and self-attention, and can effectively learn the grammar rules of genes, thereby realizing accurate classification of chromatin states.
Drawings
FIG. 1 is a flow chart of a long tail chromatin state prediction method based on a diffusion model according to the invention.
FIG. 2 is a block diagram of a long tail chromatin state prediction method based on a diffusion model according to the invention.
Detailed Description
The following description of the embodiments of the present invention is provided to facilitate understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and all the inventions which make use of the inventive concept are protected by the spirit and scope of the present invention as defined and defined in the appended claims to those skilled in the art.
Example 1
The embodiment provides a long-tail chromatin state prediction method based on a diffusion model, which generates a long-tail pseudo sample through a DNA sequence diffusion model and performs data balancing so as to solve the limitation of unbalanced data on chromatin state prediction; since there may be a deviation between the samples generated by the DNA sequence diffusion model and the real samples, a balanced loss function is given, and the influence caused by the deviation between the real samples and the generated samples is reduced by increasing the penalty on the generated samples, referring to fig. 1, which specifically includes the following steps:
step S1, an original DNA sequence is obtained, and the original DNA sequence is processed to obtain DNA coding data, which specifically comprises the following steps:
obtaining chromatin states corresponding to the original DNA sequences, and amplifying or intercepting the left and right ends of the obtained original DNA sequences with different lengths to obtain DNA sequences with the length L;
the DNA sequence with the length L is converted into coding matrix data of L multiplied by 4 by adopting a single-heat coding method.
S2, constructing a DNA sequence diffusion model based on DNA coding data, penetrating the processed DNA coding data into the diffusion model for training, and obtaining the DNA sequence diffusion model;
the DNA sequence diffusion model comprises a forward process and a backward process, wherein the forward process is a parameter Markov chain, the forward process is a fuzzy process of gradually adding Gaussian noise into data until the Gaussian noise becomes random noise, and the backward process is a denoising process of gradually recovering the data through a noise predictor;
the forward process is to gradually add gaussian noise to the original data until the data becomes pure noise, which specifically includes:
given the previous diffusion step state, the probability distribution q (x) of the current diffusion step state is predicted t |x t-1 ):
Wherein,to have mean and variance respectively->β t A gaussian distribution of I; x is x t Vector for each DNA sequence after the t-th addition of noise, x t-1 For the vector of each DNA sequence after the t-1 th addition of noise, x when t=0 0 Is L multiplied by 4 matrix data after single heat coding; beta t Is a super parameter, is a constant and takes a value between 0 and 1; i is an identity matrix; e-shaped article t-1 Is the fundamental noise obtained from the t-1 th sample; a, a t =1-β t ,/> As the weight, a t To take the value of a at the diffusion step t, a i A is a super parameter, which is the value of a in the diffusion step i; e is new Gaussian distribution obtained by adding t Gaussian distributions with different variances, namely noise during diffusion step t, and q (x 0 ) For the true data distribution of DNA sequences, x 0 Is from q (x 0 ) And the true DNA sequence sampled.
The backward process is to learn the data distribution p (x) by gradually denoising from the gaussian distribution, which is equivalent to the inverse process of learning the markov chain with the length T, and to build a generic model by adding "conditions" in the inverse process, which can generate different chromatin state sequences in different cell types, defining the generic model as p (x) t-1 |x t And c), specifically:
vector x for each DNA sequence given after the t-th addition of noise t Vector x for each DNA sequence following the t-1 th addition of noise under conditions of cell type, chromatin state t-1 Probability distribution p (x) t-1 |x t ,c):
Wherein μ (x t C) and beta t I are respectivelyMean and variance of (a); c is the condition, namely the cell type and chromatin state corresponding to the current DNA sequence;
specifically, a fixed variance beta is adopted t I, fitting the mean μ (x) using a UNet neural network t C), realizing noise prediction at the moment t of the diffusion step, wherein the loss function of the UNet neural network is L DM
Wherein, E is θ Noise predicted by the neural network under the parameter theta is a noise predictor based on the UNet neural network;for the parameter θ to be desired, θ is a parameter of the UNet neural network.
In view of possible deviation between the real sequence and the generated sequence, the embodiment proposes balanced loss, which aims to reduce the influence of the deviation between the DNA real sequence and the DNA generated sequence, and is realized by adopting a re-weighted traditional softmax cross entropy loss function, wherein the function is a loss function of a DNA sequence diffusion model, namely the balanced loss function is as follows:
wherein LD is a traditional softmax cross entropy loss function weighted again; c is a category; p is p j Probability of class j, y j Is a true class label; w is a weight;
wherein:
where μ is an empirical value selected manually:
and optimizing the loss function of the DNA sequence diffusion model by an AdamW optimization algorithm until the loss in the verification set is minimum, and stopping training.
Step S3, combining a noise predictor of UNet, and performing a reverse process of a conditional DNA sequence diffusion model to obtain balanced data sets with different chromatin state categories, wherein the method specifically comprises the following steps:
step S3.1, generating a noisy DNA sequence of lx 4 from a standard normal distribution, iterating the noisy DNA sequence of lx 4 based on a DNA sequence diffusion model until t=0;
s3.2, predicting a noise prediction value of an output DNA sequence by adopting a noise predictor according to the L multiplied by 4 noisy DNA sequence, the cell type and chromatin state condition c and the diffusion step t;
s3.3, subtracting a noise predicted value of the DNA sequence predicted and output by the current noise predictor from the DNA sequence containing the noise of L multiplied by 4;
step S3.4, repeating step S3.2 and step S3.3 until t=0, generating a DNA sequence having a specific cell type, chromatin state;
step S3.5, repeating step S3.2, step S3.3 and step S3.4 until balanced data sets with different chromatin state categories are obtained
S4, training a chromatin state prediction model based on the balance data set and a back propagation algorithm to construct a chromatin state prediction model, wherein the chromatin state prediction model balances the data setIn (a) and (b)The DNA sequence is input, which specifically comprises:
the motif perception convolution module is used for extracting a DNA sequence motif and comprises 3 layers of convolution networks, each layer of convolution network comprises a convolution layer, a ReLU activation layer, a dropout layer and a maximum pooling layer Maxpool, and the calculation process is as follows:
s (l1) =Conv(x (l) )
s (l2) =Dropout(ReLU(s (l1) ),0.2)
s (l3) =Maxpools (l2) )
wherein s is (l1) And x (l) The input and output of the first convolutional network respectively, is a balanced dataset; s is(s) (l2) The output of the dropout layer; s is(s) (l3) Output for maximum pooling layer Maxpool; conv () is convolution operation, reLU () is an activation function, dropout () is a fitting function for preventing overfitting, and the value is 0.2; maxpool () is the maximum pooling layer;
the hole convolution module is used for learning DNA sequence grammar, and comprises 3 layers of hole convolution networks, wherein each layer of hole convolution network comprises a hole convolution layer, a ReLU activation layer and a dropout layer, and the calculation process is as follows:
z (l1) =dConv(s (l3) )
z (l2) =Dropout(ReLU(z (l1) ),0.2)
wherein z is (l1) For the output of the first hole convolution network, z (l2) Is the output of the dropout layer.
The self-attention module is used for capturing the correlation inside the DNA sequence grammar and comprises two transducer coding layers, and the calculation process is as follows:
h (l1) =LayerNorm(z (l2) +MultiHead(z (l2) ))
h (l2) =LayerNorm(h (l1) +FFN(h (l1) ))
wherein h is (l1) An output for the first transducer coding layer; layerNorm () is hierarchical normalization; multiHead () is a multi-headed self-attention mechanism; FFN () is a feed-forward neural network, h (l2) Is the output of the feed-forward neural network.
And a classification module for constructing a chromatin state for each DNA sequence and predicting an output chromatin state.
The method comprises a layer of fully-connected neural network and an activation function, wherein the calculation process of the classification module is as follows:
y=Activation(MLP(h (l2) ))
wherein y is the chromatin state of the predicted output, activity () is the classification module Activation function, and MLP () is the full connection layer.
Further, experiments one and two are performed, comparing the method of the present invention with other methods of the prior art;
experiment one: the long tail chromatin prediction method based on the diffusion model provided by the invention obviously improves the prediction accuracy.
Table one summarizes the proposed long tail chromatin state prediction method and three comparison methods: deep sea (method one), danQ (method two), sei (method three), and comparison results in the chromahmm dataset are shown in table 1.
TABLE 1 accuracy of chromatin State prediction
Raw data Achieving data balancing based on diffusion model
Method one 0.657 0.671
Method II 0.667 0.683
Method III 0.654 0.676
The invention is that 0.676 0.691
The main observations from table 1 are as follows:
(1) The diffusion model method of the invention realizes the data balance of chromatin state, and the performance is improved in all four methods. This shows that the data balancing method based on the diffusion model is independent of the model, and the strategy can be widely adopted by different models.
(2) The chromatin state prediction model provided by the invention is superior to other three methods. This shows that the method provided by the invention can capture the chromatin characteristics more effectively, thereby realizing chromatin state prediction.
Experiment II: the equalization loss provided by the invention can effectively reduce the influence caused by the deviation between the real sample and the generated sample
Table 2 summarizes the results of the comparison of the equalization loss proposed by the present invention among the four methods.
TABLE 2 accuracy of chromatin State prediction
Without using equalisation losses Using equalization losses
Method one 0.671 0.706
Method II 0.683 0.719
Method III 0.676 0.702
The invention is that 0.691 0.732
The main observations from table 2 are as follows:
(1) The balance loss provided by the invention improves the performance in all four methods. This shows that the equalization loss proposed by the present invention is model independent. The sample balancing method based on the diffusion model provided by the invention is matched with the equilibrium loss strategy, and can be widely adopted by different models.
(2) The proposed chromatin state prediction method is superior to the comparison method.
In summary, the invention provides a framework based on a diffusion model, which can generate pseudo samples of different chromatin states of different cells to realize class sample balance, thereby solving the long tail problem in chromatin state prediction; and a balance loss is provided, wherein the influence caused by deviation between a real sample and a pseudo sample is relieved by increasing punishment on the pseudo sample; the chromatin state prediction model effectively captures the motif in the DNA sequence, so that the grammar rule of the gene is learned, and the chromatin state is predicted more accurately; in addition, the invention supports parallel operation on multiple GPUs and can be used for analyzing the state of the ultra-large scale chromatin.
Although specific embodiments of the invention have been described in detail with reference to the accompanying drawings, it should not be construed as limiting the scope of protection of the present patent. Various modifications and variations which may be made by those skilled in the art without the creative effort are within the scope of the patent described in the claims.

Claims (7)

1. A long tail chromatin state prediction method based on a diffusion model is characterized by comprising the following steps:
s1, acquiring an original DNA sequence, and processing the original DNA sequence to obtain DNA coding data;
s2, constructing a DNA sequence diffusion model based on the DNA coding data;
s3, combining a noise predictor of UNet, and performing a reverse process of a conditional DNA sequence diffusion model to obtain balanced data sets with different chromatin state categories;
s4, constructing a chromatin state prediction model by adopting a back propagation algorithm based on the balance data set;
the step S1 includes:
obtaining chromatin states corresponding to the original DNA sequences, and amplifying or intercepting the left and right ends of the obtained original DNA sequences with different lengths to obtain DNA sequences with the length L;
converting the DNA sequence with the length of L into coding matrix data of L multiplied by 4 by adopting a single-heat coding method;
the DNA sequence diffusion model in the step S2 comprises a forward process and a backward process;
the forward procedure includes:
given the previous diffusion step state, the probability distribution q (x) of the current diffusion step state is predicted t |x t-1 );
Wherein,to have mean and variance respectively->Variance beta t A gaussian distribution of I; x is x t Vector for each DNA sequence after the t-th addition of noise, x t-1 For the vector of each DNA sequence after the t-1 th addition of noise, x when t=0 0 Is L multiplied by 4 matrix data after single heat coding; beta t Is super parameter, I is identity matrix; e-shaped article t-1 Is the fundamental noise obtained from the t-1 th sample; a, a t =1-β t ,/> As the weight, a t To take the value of a at the diffusion step t, a i A is a super parameter, which is the value of a in the diffusion step i; the E is new Gaussian distribution obtained by adding t Gaussian distributions with different variances, namely noise in the diffusion step t;
the backward procedure comprises:
vector x for each DNA sequence given after the t-th addition of noise t Under conditions of cell type, chromatin stateVector x of each DNA sequence after the t-1 th addition of noise was predicted t-1 Probability distribution p (x) t-1 |x t ,c):
Wherein μ (x t C) and beta t I are respectivelyMean and variance of (a); c is the condition, namely the cell type and chromatin state corresponding to the current DNA sequence;
using a fixed variance beta t I, fitting the mean μ (x) using a UNet neural network t C), realizing noise prediction at the moment t of the diffusion step, wherein the loss function of the UNet neural network is L DM
Wherein, E is θ Noise predicted by the neural network under the parameter theta is a noise predictor based on the UNet neural network;for the parameter θ to be desired, θ is a parameter of the UNet neural network.
2. The diffusion model-based long tail chromatin state prediction method according to claim 1, wherein the step S3 comprises:
s3.1, generating an L multiplied by 4 noisy DNA sequence from standard normal distribution, and iterating the L multiplied by 4 noisy DNA sequence based on a DNA sequence diffusion model until t=0;
s3.2, predicting a noise prediction value of an output DNA sequence by adopting a noise predictor according to the L multiplied by 4 noisy DNA sequence, the cell type, the chromatin state condition c and the diffusion step t;
s3.3, subtracting a noise predicted value of the DNA sequence predicted and output by the current noise predictor from the DNA sequence containing the noise of L multiplied by 4;
s3.4, repeating steps S3.2 and S3.3 until t=0, generating a DNA sequence having a specific cell type, chromatin state;
s3.5, repeating step S3.2, step S3.3 and step S3.4 until balanced data sets with different chromatin state categories are obtained.
3. The diffusion model-based long tail chromatin state prediction method according to claim 1, wherein the chromatin state prediction model comprises:
the motif perception convolution module is used for extracting a DNA sequence motif and comprises 3 layers of convolution networks, each layer of convolution network comprises a convolution layer, a ReLU activation layer, a dropout layer and a maximum pooling layer Maxpool, and the calculation process is as follows:
s (l1) =Conv(x (l) )
s (l2) =Dropout(ReLU(s (l1) ),0.2)
s (l3) =Maxpool(s (l2) )
wherein s is (l1) And x (l) The input and output of the first convolutional network respectively, is a balanced dataset; s is(s) (l2) The output of the dropout layer; s is(s) (l3) Output for maximum pooling layer Maxpool; conv () is convolution operation, reLU () is an activation function, dropout () is a fitting function for preventing overfitting, and the value is 0.2; maxpool () is the maximum pooling layer;
the cavity convolution module is used for learning the grammar of the DNA sequence;
a self-attention module for capturing the correlation inside the DNA sequence grammar;
and a classification module for constructing a chromatin state for each DNA sequence and predicting an output chromatin state.
4. The diffusion model-based long tail chromatin state prediction method according to claim 3, wherein the hole convolution module comprises a 3-layer hole convolution network, each layer of hole convolution network comprises a hole convolution layer, a ReLU activation layer and a dropout layer, and the calculation process is as follows:
z (l1) =dConv(s (l3) )
z (l2) =Dropout(ReLU(z (l1) ),0.2)
wherein z is (l1) For the output of the first hole convolution network, z (l2) Is the output of the dropout layer.
5. The diffusion model-based long tail chromatin state prediction method according to claim 4, wherein the self-attention module comprises two transform coding layers, and the calculation process is as follows:
h (l1) =LayerNorm(z (l2) +MultiHead(z (l2) ))
h (l2) =LayerNorm(h (l1) +FFN(h (l1) ))
wherein h is (l1) An output for the first transducer coding layer; layerNorm () is hierarchical normalization; multiHead () is a multi-headed self-attention mechanism; FFN () is a feed-forward neural network, h (l2) Is the output of the feed-forward neural network.
6. The diffusion model-based long tail chromatin state prediction method according to claim 5, wherein the classification module calculates:
y=Activation(MLP(h (l2) ))
wherein y is the chromatin state of the predicted output, activity () is the classification module Activation function, and MLP () is the full connection layer.
7. The diffusion model-based long tail chromatin state prediction method according to claim 1, wherein the loss function of the DNA sequence diffusion model is:
wherein LD is a traditional softmax cross entropy loss function weighted again; c is a category; p is p j Probability of class j, y j Is a true class label; w is a weight;
wherein:
where μ is an empirical value selected manually.
CN202310991350.8A 2023-08-07 2023-08-07 Diffusion model-based long tail chromatin state prediction method Active CN116884495B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310991350.8A CN116884495B (en) 2023-08-07 2023-08-07 Diffusion model-based long tail chromatin state prediction method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310991350.8A CN116884495B (en) 2023-08-07 2023-08-07 Diffusion model-based long tail chromatin state prediction method

Publications (2)

Publication Number Publication Date
CN116884495A CN116884495A (en) 2023-10-13
CN116884495B true CN116884495B (en) 2024-03-08

Family

ID=88264587

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310991350.8A Active CN116884495B (en) 2023-08-07 2023-08-07 Diffusion model-based long tail chromatin state prediction method

Country Status (1)

Country Link
CN (1) CN116884495B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111312329A (en) * 2020-02-25 2020-06-19 成都信息工程大学 Transcription factor binding site prediction method based on deep convolution automatic encoder
CN114023300A (en) * 2021-11-03 2022-02-08 四川大学 Chinese speech synthesis method based on diffusion probability model
WO2022189771A1 (en) * 2021-03-11 2022-09-15 Oxford University Innovation Limited Generating neural network models, classifying physiological data, and classifying patients into clinical classifications
CN115831217A (en) * 2022-11-23 2023-03-21 四川大学 Chromatin topological association domain boundary prediction method based on multi-modal fusion
CN116153404A (en) * 2023-02-28 2023-05-23 成都信息工程大学 Single-cell ATAC-seq data analysis method
CN116312765A (en) * 2023-02-15 2023-06-23 成都信息工程大学 Multi-stage-based prediction method for influence of non-coding variation on activity of enhancer
CN116416491A (en) * 2023-03-14 2023-07-11 福建福清核电有限公司 Remote sensing pseudo sample generation method based on lightweight diffusion model

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111312329A (en) * 2020-02-25 2020-06-19 成都信息工程大学 Transcription factor binding site prediction method based on deep convolution automatic encoder
WO2022189771A1 (en) * 2021-03-11 2022-09-15 Oxford University Innovation Limited Generating neural network models, classifying physiological data, and classifying patients into clinical classifications
CN114023300A (en) * 2021-11-03 2022-02-08 四川大学 Chinese speech synthesis method based on diffusion probability model
CN115831217A (en) * 2022-11-23 2023-03-21 四川大学 Chromatin topological association domain boundary prediction method based on multi-modal fusion
CN116312765A (en) * 2023-02-15 2023-06-23 成都信息工程大学 Multi-stage-based prediction method for influence of non-coding variation on activity of enhancer
CN116153404A (en) * 2023-02-28 2023-05-23 成都信息工程大学 Single-cell ATAC-seq data analysis method
CN116416491A (en) * 2023-03-14 2023-07-11 福建福清核电有限公司 Remote sensing pseudo sample generation method based on lightweight diffusion model

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
程哲 ; 白茜 ; 张浩 ; 王世普 ; 梁宇 ; .使用深层卷积神经网络提高Hi-C数据分辨率.计算机科学.2020,(第S1期),全文. *

Also Published As

Publication number Publication date
CN116884495A (en) 2023-10-13

Similar Documents

Publication Publication Date Title
Zhang et al. NAS-AMR: Neural architecture search-based automatic modulation recognition for integrated sensing and communication systems
CN113852432B (en) Spectrum Prediction Sensing Method Based on RCS-GRU Model
CN111564179B (en) Species biology classification method and system based on triple neural network
CN113255832B (en) Method for identifying long tail distribution of double-branch multi-center
CN114596726B (en) Parking berth prediction method based on interpretable space-time attention mechanism
CN112183742A (en) Neural network hybrid quantization method based on progressive quantization and Hessian information
CN112686372A (en) Product performance prediction method based on depth residual GRU neural network
CN111355633A (en) Mobile phone internet traffic prediction method in competition venue based on PSO-DELM algorithm
CN111797979A (en) Vibration transmission system based on LSTM model
CN117251705A (en) Daily natural gas load prediction method
CN113343796B (en) Knowledge distillation-based radar signal modulation mode identification method
CN118337576A (en) Lightweight automatic modulation identification method based on multichannel fusion
CN116884495B (en) Diffusion model-based long tail chromatin state prediction method
CN114821184B (en) Long-tail image classification method and system based on balanced complementary entropy
CN116243248A (en) Multi-component interference signal identification method based on multi-label classification network
CN115047422A (en) Radar target identification method based on multi-scale mixed hole convolution
CN113132482B (en) Distributed message system parameter adaptive optimization method based on reinforcement learning
CN111476408B (en) Power communication equipment state prediction method and system
CN113139464A (en) Power grid fault detection method
CN116913390B (en) Gene regulation network prediction method based on multi-view attention network
CN114386602B (en) HTM predictive analysis method for multi-path server load data
CN113627556B (en) Method and device for realizing image classification, electronic equipment and storage medium
CN116908808B (en) RTN-based high-resolution one-dimensional image target recognition method
Wang et al. Exploring quantization in few-shot learning
CN117584792B (en) Online prediction method and system for charging power of electric vehicle charging station

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant