CN116978462A - Method for generating non-natural promoter based on diffusion model - Google Patents

Method for generating non-natural promoter based on diffusion model Download PDF

Info

Publication number
CN116978462A
CN116978462A CN202310954854.2A CN202310954854A CN116978462A CN 116978462 A CN116978462 A CN 116978462A CN 202310954854 A CN202310954854 A CN 202310954854A CN 116978462 A CN116978462 A CN 116978462A
Authority
CN
China
Prior art keywords
promoters
promoter
diffusion model
natural
generated
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310954854.2A
Other languages
Chinese (zh)
Inventor
周景文
王兴隆
徐康杰
谭亚梦
赵欣怡
陈坚
曾伟主
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangnan University
Original Assignee
Jiangnan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangnan University filed Critical Jiangnan University
Priority to CN202310954854.2A priority Critical patent/CN116978462A/en
Publication of CN116978462A publication Critical patent/CN116978462A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B5/00ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Data Mining & Analysis (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Computing Systems (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Public Health (AREA)
  • Physiology (AREA)
  • Bioethics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Epidemiology (AREA)
  • Databases & Information Systems (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

本发明公开了一种基于扩散模型生成非天然启动子的方法,属于生物信息技术领域。所述为实现生成启动子,建立了基于扩散模型的深度学习网络。同时,本申请对生成启动子进行了真假判定和功能性区间分析,结果显示,生成的启动子中超过40%为真启动子,且序列具有显著的‑35和‑10功能区,具有较高的可信度。

The invention discloses a method for generating a non-natural promoter based on a diffusion model, and belongs to the field of biological information technology. In order to generate promoters, a deep learning network based on the diffusion model was established. At the same time, this application conducted authenticity judgment and functional interval analysis on the generated promoters. The results showed that more than 40% of the generated promoters were true promoters, and the sequences had significant ‑35 and ‑10 functional regions, and had relatively high High credibility.

Description

一种基于扩散模型生成非天然启动子的方法A method to generate non-natural promoters based on diffusion model

技术领域Technical field

本发明涉及一种基于扩散模型生成非天然启动子的方法,属于生物信息技术领域。The invention relates to a method for generating a non-natural promoter based on a diffusion model, and belongs to the field of biological information technology.

背景技术Background technique

启动子设计可辅助代谢工程网络搭建,用于在微生物体内从头合成化学品、药品以及其他原料。启动子的作用主要是启动基因转录及翻译,对靶基因的表达量产生直观的影响。最新研究表明,启动子启动基因的转录量和翻译量的皮尔森相关系数高达0.8,因此,调控启动子将可以实现蛋白表达的精准调控。在前期的研究中,研究人员尝试了不同方法用于挖掘非天然启动子,包括定向进化方案,目标启动子中随机引入突变位点,还包括理性设计,即只针对启动子的保守区或非保守区中一个小的区段进行突变。Promoter design can assist in building metabolic engineering networks for de novo synthesis of chemicals, drugs, and other raw materials in microorganisms. The main function of the promoter is to initiate gene transcription and translation, which has an intuitive impact on the expression level of the target gene. The latest research shows that the Pearson correlation coefficient between the transcription amount and translation amount of a promoter-initiated gene is as high as 0.8. Therefore, regulating promoters can achieve precise regulation of protein expression. In previous studies, researchers tried different methods to mine non-natural promoters, including directed evolution schemes, randomly introducing mutation sites into the target promoter, and rational design, which only targeted the conserved regions of the promoter or non-natural promoters. A small segment of the conserved region is mutated.

现阶段虽然对非天然启动子筛选已经获得了一定的进展,但构建的启动子库仍然较小。通常情况下,启动子长短为50个碱基,具有450种组成方式,仅采用实验方法难以进行验证。而真核生物的启动子长度则远超于50个碱基,在实验筛选上难度更大。因此,开发计算辅助启动子生成的方法极为重要,将有助于启动子的筛选。Although certain progress has been made in screening non-natural promoters at this stage, the constructed promoter library is still small. Normally, a promoter is 50 bases long and has 450 composition methods, which is difficult to verify using only experimental methods. The length of eukaryotic promoters is much longer than 50 bases, making experimental screening more difficult. Therefore, it is extremely important to develop methods for computationally assisted promoter generation that will facilitate promoter screening.

Wang等在2020年提出了以对抗生成网络实现启动子的从头设计,将启动子基因转化为一维数组进行学习,进而通过生成器与判别器的自我博弈,生成与天然生物分子位于类似分布的全新人工分子序列,实现启动子的从头设计。但对抗生成网络由于训练最优判别器与最小化生成器之间的相互矛盾导致其训练具有很大的不稳定性,而且对抗学习所生成的启动子的多样性也有一定的限制,因此不容易扩展到建模复杂的多模态分布。基于上述原因,有必要研究一种新型的非天然启动子的生成方法。In 2020, Wang et al. proposed using an adversarial generative network to design promoters from scratch, converting promoter genes into one-dimensional arrays for learning, and then through the self-game of the generator and the discriminator to generate genes located in a similar distribution to natural biomolecules. New artificial molecular sequences enable de novo design of promoters. However, the training of the adversarial generative network is very unstable due to the conflict between training the optimal discriminator and the minimized generator, and the diversity of promoters generated by adversarial learning also has certain limitations, so it is not easy. Extension to modeling complex multimodal distributions. Based on the above reasons, it is necessary to study a new method for generating non-natural promoters.

发明内容Contents of the invention

为了解决目前以对抗生成网络实现启动子的从头设计时存在的不稳定问题,本发明提供了一种基于扩散模型生成非天然启动子的方法,所述方法包括:In order to solve the instability problem that currently exists in the de novo design of promoters using adversarial generation networks, the present invention provides a method for generating non-natural promoters based on a diffusion model. The method includes:

步骤S1:构建用于生成非天然启动子的扩散模型,所述用于生成非天然启动子的扩散模型依托于卷积神经网络中的UNet,在搭建UNet的编码区时,采用卷积神经网络;非编码区采用上采样的方式进行图像尺寸还原;在编码区及非编码区之间采用范式化的UNet跳跃连接进行特征传递,并且编码区及解码区中均引入自注意力机制;Step S1: Construct a diffusion model for generating non-natural promoters. The diffusion model for generating non-natural promoters relies on UNet in the convolutional neural network. When building the coding region of UNet, a convolutional neural network is used. ; The non-coding area uses upsampling to restore image size; a normalized UNet jump connection is used between the coding area and the non-coding area for feature transfer, and a self-attention mechanism is introduced in both the coding area and the decoding area;

步骤S2:采用公开数据集中的启动子作为训练数据,对所述用于生成非天然启动子的扩散模型进行训练;Step S2: Use the promoters in the public data set as training data to train the diffusion model for generating non-natural promoters;

步骤S3:采用训练好的用于生成非天然启动子的扩散模型生成新的启动子。Step S3: Generate a new promoter using the diffusion model trained to generate non-natural promoters.

可选的,所述步骤S2包括:Optionally, the step S2 includes:

对公开数据集中的启动子的基因序列进行数字化处理;Digitize the gene sequences of promoters from public datasets;

利用数字化处理后的启动子的基因序列对所述用于生成非天然启动子的扩散模型进行训练,训练过程中计算损失值,对于输出样本进行启动子识别以及保守性评估,保存训练完成后的模型参数;The diffusion model used to generate non-natural promoters is trained using the gene sequence of the digitally processed promoter, the loss value is calculated during the training process, promoter identification and conservation evaluation are performed on the output sample, and the result after training is saved model parameters;

启动子识别采用基于深度学习的PromoR模块对生成的每个序列进行真、伪判别,并计算真启动子占所有生成启动子的比例;Promoter identification uses the PromoR module based on deep learning to distinguish whether each generated sequence is true or false, and calculates the proportion of true promoters to all generated promoters;

启动子保守性评估为对生成启动子进行序列比对,并观察-35和-10区序列,当-35和-10区的标识为TT和TATAAT时,则认为生成启动子具有天然启动子的特征。Promoter conservation is evaluated by comparing the sequences of the generated promoters and observing the sequences of the -35 and -10 regions. When the identifiers of the -35 and -10 regions are TT and TATAAT, the generated promoter is considered to have the characteristics of a natural promoter. feature.

可选的,所述方法观察-35和-10区序列时,采用工具为MetaLogo。Optionally, when observing the -35 and -10 region sequences using the method, the tool MetaLogo is used.

可选的,所述对公开数据集中的启动子的基因序列进行数字化处理包括:Optionally, the digital processing of the gene sequence of the promoter in the public data set includes:

采用独热编码方法进行特征提取,将长度为50个碱基的序列转化为通道数为1、长为4、宽为50的向量;The one-hot encoding method is used for feature extraction, and a sequence of 50 bases in length is converted into a vector with a channel number of 1, a length of 4, and a width of 50;

转化后碱基A、T、C、G分别为:[1 0 0 0]、[0 0 0 1]、[0 1 0 0]、[0 0 1 0]。After conversion, the bases A, T, C, and G are respectively: [1 0 0 0], [0 0 0 1], [0 1 0 0], [0 0 1 0].

可选的,所述方法还包括,设定真启动子占所有生成启动子的比例阈值。Optionally, the method further includes setting a threshold value for the ratio of true promoters to all generated promoters.

本申请还提供上述基于扩散模型生成非天然启动子的方法在化学品、药品中的应用。This application also provides the application of the above-mentioned method of generating non-natural promoters based on the diffusion model in chemicals and pharmaceuticals.

本发明有益效果是:The beneficial effects of the present invention are:

通过将基因序列进行计算机编码实现其数字化,对公开数据集进行收集,构建训练用数据集。以数字化的基因作为输入并采用扩散模型学习其特征,评估生成样本质量并用于生成非天然启动子,与现有通过对抗生成网络设计启动子的技术相比,本发明采用的生成模型训练更为稳定,可有效识别序列的小区段关键区域,并同时可识别小区域与全长序列的关联。且不具有训练不稳定的缺陷。同时,扩散模型更有益于稳定生成多样性更高的启动子,有利于挖掘新的启动子。The genetic sequence is digitized by computer coding, and public data sets are collected to construct a training data set. Taking digital genes as input and using a diffusion model to learn their characteristics, the quality of the generated samples is evaluated and used to generate non-natural promoters. Compared with the existing technology of designing promoters through adversarial generative networks, the generative model training used in the present invention is more efficient It is stable and can effectively identify the key regions of small segments of the sequence, and can also identify the correlation between the small region and the full-length sequence. And it does not have the disadvantage of unstable training. At the same time, the diffusion model is more conducive to the stable generation of promoters with higher diversity and the mining of new promoters.

附图说明Description of the drawings

为了更清楚地说明本发明实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below. Obviously, the drawings in the following description are only some embodiments of the present invention. For those of ordinary skill in the art, other drawings can also be obtained based on these drawings without exerting creative efforts.

图1是本发明一个实施例中提供的基于扩散模型生成启动子的流程图。Figure 1 is a flow chart for generating a promoter based on a diffusion model provided in one embodiment of the present invention.

图2是本发明采用扩散模型生成启动子的序列标识图。Figure 2 is a sequence identification diagram of a promoter generated by the present invention using a diffusion model.

具体实施方式Detailed ways

为使本发明的目的、技术方案和优点更加清楚,下面将结合附图对本发明实施方式作进一步地详细描述。In order to make the purpose, technical solutions and advantages of the present invention clearer, the embodiments of the present invention will be described in further detail below with reference to the accompanying drawings.

PyTorch:是torch的python版本,是由Facebook开源的神经网络框架,专门针对GPU加速的深度神经网络(DNN)编程。Torch是一个经典的对多维矩阵数据进行操作的张量(tensor,张量是机器学习程序中的数字容器,本质上就是各种不同维度的数组,通常把张量的维度称为轴,轴的个数称为阶)库,在机器学习和其他数学密集型应用有广泛应用。与Tensorflow的静态计算图不同,pytorch的计算图是动态的,可以根据计算需要实时改变计算图。PyTorch: It is the Python version of torch. It is a neural network framework open sourced by Facebook and is specifically designed for GPU-accelerated deep neural network (DNN) programming. Torch is a classic tensor that operates on multi-dimensional matrix data. A tensor is a digital container in a machine learning program. It is essentially an array of various dimensions. The dimensions of the tensor are usually called axes. The number is called an order) library and is widely used in machine learning and other mathematics-intensive applications. Unlike Tensorflow's static calculation graph, pytorch's calculation graph is dynamic and can be changed in real time according to computing needs.

实施例一:Example 1:

本实施例提供一种基于扩散模型生成非天然启动子的方法,该方法基于启动子基因序列进行扩散模型训练并生成新的启动子,参见图1,所述方法包括:This embodiment provides a method for generating a non-natural promoter based on a diffusion model. The method performs diffusion model training and generates a new promoter based on the promoter gene sequence. See Figure 1. The method includes:

步骤S1:输入基因序列的标准化处理。Step S1: Standardization of the input gene sequence.

构建训练集:采用Thomason报道的数据集中的启动子(为大肠杆菌启动子)作为训练数据。该训练集共包含启动子数量为11884个;Construct a training set: Use the promoter in the data set reported by Thomason (E. coli promoter) as training data. The training set contains a total of 11884 promoters;

步骤S2:输入启动子的基因序列的数字化处理。Step S2: Digital processing of the gene sequence of the input promoter.

采用独热编码方法进行特征提取,将长度为50个碱基的序列转化为通道数为1、长为4、宽为50的向量。转化后碱基A、T、C、G分别为:[1 0 0 0]、[0 0 0 1]、[0 1 0 0]、[0 0 10]。The one-hot encoding method is used for feature extraction, and a sequence of 50 bases in length is converted into a vector with a channel number of 1, a length of 4, and a width of 50. After conversion, the bases A, T, C, and G are respectively: [1 0 0 0], [0 0 0 1], [0 1 0 0], and [0 0 10].

通过该步骤可以将基因序列进行数字化处理,比如将启动子:Through this step, the gene sequence can be digitized, such as the promoter:

CCGCTCAAATATTGTTAAATTGCCGGTTTTGTATCAACTACTCACCCGGG转化为:[[0 1 00][01 0 0][0 0 1 0][0 1 0 0][0 0 0 1][0 1 0 0][1 0 0 0][1 0 0 0][1 0 0 0][0 0 01][1 0 0 0][0 0 0 1][0 0 0 1][0 0 1 0][0 0 0 1][0 0 0 1][1 0 0 0][1 0 0 0][10 0 0][0 0 0 1][0 0 0 1][0 0 1 0][0 1 0 0][0 1 0 0][0 0 1 0][0 0 1 0][0 0 01][0 0 0 1][0 0 0 1][0 0 0 1][0 0 1 0][0 0 0 1][1 0 0 0][0 0 0 1][0 1 0 0][10 0 0][1 0 0 0][0 1 0 0][0 0 0 1][1 0 0 0][0 1 0 0][0 0 0 1][0 1 0 0][1 0 00][0 1 00][0 1 0 0][0 1 0 0][0 0 1 0][0 0 1 0][0 0 1 0]]。CCGCTCAAATATTGTTAAATTGCCGGTTTTGTATCAACTACTCACCCGGG is converted to: [[0 1 00][01 0 0][0 0 1 0][0 1 0 0][0 0 0 1][0 1 0 0][1 0 0 0][1 0 0 0 ][1 0 0 0][0 0 01][1 0 0 0][0 0 0 1][0 0 0 1][0 0 1 0][0 0 0 1][0 0 0 1][1 0 0 0][1 0 0 0][10 0 0][0 0 0 1][0 0 0 1][0 0 1 0][0 1 0 0][0 1 0 0][0 0 1 0 ][0 0 1 0][0 0 01][0 0 0 1][0 0 0 1][0 0 0 1][0 0 1 0][0 0 0 1][1 0 0 0][0 0 0 1][0 1 0 0][10 0 0][1 0 0 0][0 1 0 0][0 0 0 1][1 0 0 0][0 1 0 0][0 0 0 1 ][0 1 0 0][1 0 00][0 1 00][0 1 0 0][0 1 0 0][0 0 1 0][0 0 1 0][0 0 1 0]].

步骤S3:构建扩散模型用于序列特征学习,具体包括:Step S3: Construct a diffusion model for sequence feature learning, specifically including:

S3.1:将数据转化为PyTorch可识别的张量;S3.1: Convert data into tensors recognized by PyTorch;

主要是将Numpy数组转tensor张量,可参考Mainly convert Numpy array to tensor, please refer to

https://blog.csdn.net/weixin_43728604/article/details/102679016中介绍的转换方式进行转换。Convert using the conversion method introduced in https://blog.csdn.net/weixin_43728604/article/details/102679016.

S3.2:基于PyTorch搭建深度学习网络,网络搭建主要依托于卷积神经网络中的UNet,在搭建UNet的编码区时,采用卷积神经网络;非编码区采用上采样的方式进行图像尺寸还原。在编码区及非编码区之间采用范式化的UNet跳跃连接进行特征传递,并且编码区及解码区中均引入自注意力机制;S3.2: Build a deep learning network based on PyTorch. The network construction mainly relies on UNet in the convolutional neural network. When building the coding area of UNet, the convolutional neural network is used; the non-coding area uses upsampling to restore the image size. . The normalized UNet skip connection is used between the coding area and the non-coding area for feature transfer, and a self-attention mechanism is introduced in both the coding area and the decoding area;

扩散模型:在设定扩散过程是一个马尔可夫链的条件下,向原始信息中不断添加高斯噪声,每一步添加高斯噪声的过程是从Xt-1→XtDiffusion model: Under the condition that the diffusion process is a Markov chain, Gaussian noise is continuously added to the original information. The process of adding Gaussian noise at each step is from X t-1 → X t ,

其中,Xt表示t-1时刻的数据,Xt-1表示添加高斯噪声后t时刻的数据。Among them, X t represents the data at time t-1, and X t-1 represents the data at time t after adding Gaussian noise.

逆扩散过程是从高斯噪声中恢复原始数据,假定逆扩散过程仍然是一个马尔可夫链的过程,要做的是XT→X0,其中,XT表示添加高斯噪声后T时刻的数据,X0指从添加高斯噪声后T时刻的数据中恢复出的原始数据。The inverse diffusion process is to recover the original data from Gaussian noise. Assuming that the inverse diffusion process is still a Markov chain process, what needs to be done is X T →X 0 , where X T represents the data at time T after adding Gaussian noise, X 0 refers to the original data recovered from the data at time T after adding Gaussian noise.

UNet:是一种典型的编码器、解码器结构,编码器主要进行特征提取,并且图像尺寸不断减小。而右边对应的是上采样过程,通过与不同卷积层的信息进行跳跃链接以使图像恢复到和原图接近的大小。UNet: It is a typical encoder and decoder structure. The encoder mainly performs feature extraction, and the image size is continuously reduced. The right side corresponds to the upsampling process, which restores the image to a size close to the original image through jump links with information from different convolutional layers.

卷积层:基于图像信息对于数字化启动子进行特征提取。公式如下:Convolutional layer: Feature extraction of digital promoters based on image information. The formula is as follows:

激活层:激活层采用ReLU函数,可理解为分段线性函数,把所有的负值都变为0,而正值不变。Activation layer: The activation layer uses the ReLU function, which can be understood as a piecewise linear function, changing all negative values to 0 while leaving the positive values unchanged.

自注意力层:自注意力机制允许输入与输入之间彼此交互,并找出它们应该更多关注的对象,输出是这些交互和注意力得分的总和。Self-attention layer: The self-attention mechanism allows inputs to interact with each other and find out which objects they should pay more attention to. The output is the sum of these interactions and attention scores.

S3.3:模型训练过程中计算损失值,对于输出样本进行启动子识别以及保守性评估。S3.3: Calculate the loss value during model training, perform promoter identification and conservative evaluation of the output samples.

启动子识别采用基于深度学习的PromoR模块(BioRxiv:doi:https://doi.org/10.1101/2023.03.05.531155),即对生成的每个序列进行真、伪判别,并计算真启动子占所有生成启动子的比例。Promoter identification uses the PromoR module (BioRxiv:doi:https://doi.org/10.1101/2023.03.05.531155) based on deep learning, that is, each generated sequence is distinguished between true and false, and the proportion of true promoters among all Proportion of generated promoters.

启动子保守性主要对生成启动子进行序列比对,并观察-35和-10区序列,采用工具为MetaLogo(http://metalogo.omicsnet.org/analysis)。当-35和-10区的标识为TT和TATAAT时,则认为生成启动子具有天然启动子的特征。Promoter conservation mainly conducts sequence alignment of the generated promoters and observes the -35 and -10 region sequences, using the tool MetaLogo (http://metalogo.omicsnet.org/analysis). When the -35 and -10 regions are identified as TT and TATAAT, the generated promoter is considered to have the characteristics of a natural promoter.

损失值计算采用平均平方误差损失(MSELoss)对输出结果与真实结果进行比对,并通过AdamW优化器实现参数优化。The loss value is calculated using the mean squared error loss (MSELoss) to compare the output results with the real results, and the parameters are optimized through the AdamW optimizer.

S3.4:在训练过程中损失值下降平缓时,对生成样本进行评估,根据启动子识别结果判定,当真实启动子占比较高、且序列具有显著的保守区间时,对该训练代的训练参数进行保存;S3.4: When the loss value decreases slowly during the training process, evaluate the generated samples and judge based on the promoter identification results. When the proportion of real promoters is relatively high and the sequence has a significant conservative interval, the training of the training generation Parameters are saved;

步骤S4:根据训练后保存的参数,采用扩散模型进行非天然启动子生成。Step S4: Use the diffusion model to generate non-natural promoters based on the parameters saved after training.

为了说明本发明构建的生成模型的优越性,本发明采用了基于深度学习的真、假启动子判别以及生成启动子序列特征分析。结果如图2所示,根据图2可以看出,生成启动子鉴别为真的占比超过40%,并且启动子具有显著的-35和-10区,说明生成的启动子具有较高的可信度。In order to illustrate the superiority of the generative model constructed by the present invention, the present invention uses deep learning-based discrimination of true and false promoters and analysis of generated promoter sequence characteristics. The results are shown in Figure 2. According to Figure 2, it can be seen that the proportion of generated promoter identifications that are true exceeds 40%, and the promoter has significant -35 and -10 regions, indicating that the generated promoter has high reliability. reliability.

本申请提供的基于扩散模型生成天热启动子的方法,通过将基因序列进行计算机编码实现其数字化,以数字化的基因作为输入并采用扩散模型学习其特征,评估生成样本质量并用于生成非天然启动子,与现有通过对抗生成网络设计启动子的技术相比,本发明采用的生成模型训练更为稳定,可有效识别序列的小区段关键区域,并同时可识别小区域与全长序列的关联。且不具有训练不稳定的缺陷。同时,扩散模型更有益于稳定生成多样性更高的启动子,有利于挖掘新的启动子。The method provided by this application to generate Tianre promoter based on the diffusion model realizes digitization by computer coding the gene sequence, uses the digitized gene as input and uses the diffusion model to learn its characteristics, evaluates the quality of the generated sample and uses it to generate non-natural promoter Compared with the existing technology of designing promoters through adversarial generative networks, the generative model training adopted in the present invention is more stable, can effectively identify key regions of small segments of sequences, and can simultaneously identify the association between small regions and full-length sequences. . And it does not have the disadvantage of unstable training. At the same time, the diffusion model is more conducive to the stable generation of promoters with higher diversity and the mining of new promoters.

本发明实施例中的部分步骤,可以利用软件实现,相应的软件程序可以存储在可读取的存储介质中,如光盘或硬盘等。Some steps in the embodiments of the present invention can be implemented using software, and corresponding software programs can be stored in readable storage media, such as optical disks or hard disks.

以上所述仅为本发明的较佳实施例,并不用以限制本发明,凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。The above are only preferred embodiments of the present invention and are not intended to limit the present invention. Any modifications, equivalent substitutions, improvements, etc. made within the spirit and principles of the present invention shall be included in the protection of the present invention. within the range.

Claims (6)

1.一种基于扩散模型生成非天然启动子的方法,其特征在于,所述方法包括:1. A method for generating non-natural promoters based on a diffusion model, characterized in that the method includes: 步骤S1:构建用于生成非天然启动子的扩散模型,所述用于生成非天然启动子的扩散模型依托于卷积神经网络中的UNet,在搭建UNet的编码区时,采用卷积神经网络;非编码区采用上采样的方式进行图像尺寸还原;在编码区及非编码区之间采用范式化的UNet跳跃连接进行特征传递,并且编码区及解码区中均引入自注意力机制;Step S1: Construct a diffusion model for generating non-natural promoters. The diffusion model for generating non-natural promoters relies on UNet in the convolutional neural network. When building the coding region of UNet, the convolutional neural network is used ; The non-coding area uses upsampling to restore image size; a normalized UNet jump connection is used between the coding area and the non-coding area for feature transfer, and a self-attention mechanism is introduced in both the coding area and the decoding area; 步骤S2:采用公开数据集中的启动子作为训练数据,对所述用于生成非天然启动子的扩散模型进行训练;Step S2: Use the promoters in the public data set as training data to train the diffusion model for generating non-natural promoters; 步骤S3:采用训练好的用于生成非天然启动子的扩散模型生成新的启动子。Step S3: Generate a new promoter using the diffusion model trained to generate non-natural promoters. 2.根据权利要求1所述的方法,其特征在于,所述步骤S2包括:2. The method according to claim 1, characterized in that step S2 includes: 对公开数据集中的启动子的基因序列进行数字化处理;Digitize the gene sequences of promoters from public datasets; 利用数字化处理后的启动子的基因序列对所述用于生成非天然启动子的扩散模型进行训练,训练过程中计算损失值,对于输出样本进行启动子识别以及保守性评估,保存训练完成后的模型参数;The diffusion model used to generate non-natural promoters is trained using the gene sequence of the digitally processed promoter, the loss value is calculated during the training process, promoter identification and conservation evaluation are performed on the output sample, and the result after training is saved model parameters; 启动子识别采用基于深度学习的PromoR模块对生成的每个序列进行真、伪判别,并计算真启动子占所有生成启动子的比例;Promoter identification uses the PromoR module based on deep learning to distinguish whether each generated sequence is true or false, and calculates the proportion of true promoters to all generated promoters; 启动子保守性评估为对生成启动子进行序列比对,并观察-35和-10区序列,当-35和-10区的标识为TT和TATAAT时,则认为生成启动子具有天然启动子的特征。Promoter conservation is evaluated by comparing the sequences of the generated promoters and observing the sequences of the -35 and -10 regions. When the identifiers of the -35 and -10 regions are TT and TATAAT, the generated promoter is considered to have the characteristics of a natural promoter. feature. 3.根据权利要求2所述的方法,其特征在于,所述方法观察-35和-10区序列时,采用工具为MetaLogo。3. The method according to claim 2, characterized in that when observing the -35 and -10 region sequences, the method uses MetaLogo. 4.根据权利要求3所述的方法,其特征在于,所述对公开数据集中的启动子的基因序列进行数字化处理包括:4. The method according to claim 3, characterized in that said digital processing of the gene sequence of the promoter in the public data set includes: 采用独热编码方法进行特征提取,将长度为50个碱基的序列转化为通道数为1、长为4、宽为50的向量;The one-hot encoding method is used for feature extraction, and a sequence of 50 bases in length is converted into a vector with a channel number of 1, a length of 4, and a width of 50; 转化后碱基A、T、C、G分别为:[1 0 0 0]、[0 0 0 1]、[0 1 0 0]、[0 0 1 0]。After conversion, the bases A, T, C, and G are respectively: [1 0 0 0], [0 0 0 1], [0 1 0 0], [0 0 1 0]. 5.根据权利要求4所述的方法,其特征在于,所述方法还包括,设定真启动子占所有生成启动子的比例阈值。5. The method according to claim 4, further comprising setting a threshold value for the ratio of true promoters to all generated promoters. 6.权利要求1-5任一所述的基于扩散模型生成非天然启动子的方法在化学品、药品中的应用。6. Application of the method of generating non-natural promoters based on the diffusion model according to any one of claims 1 to 5 in chemicals and pharmaceuticals.
CN202310954854.2A 2023-07-31 2023-07-31 Method for generating non-natural promoter based on diffusion model Pending CN116978462A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310954854.2A CN116978462A (en) 2023-07-31 2023-07-31 Method for generating non-natural promoter based on diffusion model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310954854.2A CN116978462A (en) 2023-07-31 2023-07-31 Method for generating non-natural promoter based on diffusion model

Publications (1)

Publication Number Publication Date
CN116978462A true CN116978462A (en) 2023-10-31

Family

ID=88476400

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310954854.2A Pending CN116978462A (en) 2023-07-31 2023-07-31 Method for generating non-natural promoter based on diffusion model

Country Status (1)

Country Link
CN (1) CN116978462A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118038993A (en) * 2024-04-11 2024-05-14 云南师范大学 Protein sequence diffusion generation method based on generation countermeasure network drive
CN120048352A (en) * 2025-04-23 2025-05-27 电子科技大学长三角研究院(衢州) A method for generating yeast core promoter sequences based on potential diffusion model

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118038993A (en) * 2024-04-11 2024-05-14 云南师范大学 Protein sequence diffusion generation method based on generation countermeasure network drive
CN120048352A (en) * 2025-04-23 2025-05-27 电子科技大学长三角研究院(衢州) A method for generating yeast core promoter sequences based on potential diffusion model

Similar Documents

Publication Publication Date Title
CN116978462A (en) Method for generating non-natural promoter based on diffusion model
Zeng et al. Causalcall: Nanopore basecalling using a temporal convolutional network
CN102081707B (en) DNA sequence data compression and decompression system, and method therefor
CN113593631A (en) Method and system for predicting protein-polypeptide binding site
CN105069220B (en) Microbial fermentation optimization method based on BP neural network immune genetic algorithm
CN109308355B (en) Legal judgment result prediction method and device
CN110070914B (en) Gene sequence identification method, system and computer readable storage medium
CN114023376A (en) RNA-protein binding site prediction method and system based on self-attention mechanism
CN114743600B (en) Deep learning prediction method of target-ligand binding affinity based on gated attention mechanism
CN114722202A (en) Multimodal sentiment classification method and system based on bidirectional double-layer attention LSTM network
CN111462157A (en) Infrared image segmentation method based on genetic optimization threshold method
CN117219291A (en) Drug repositioning prediction method and device based on contrast learning and graph neural network
CN118038959A (en) RNA modification prediction model construction method, mRNA and RNA modification prediction method
CN116932762A (en) Small sample financial text classification method, system, medium and equipment
CN115273965A (en) A multi-type RNA methylation modification site prediction method
CN113611367B (en) CRISPR/Cas9 off-target prediction method based on VAE data enhancement
CN104573004B (en) A kind of double clustering methods of the gene expression data based on double rank genetic computations
Zhao et al. A novel hybrid GA/SVM system for protein sequences classification
CN116050579B (en) Building energy consumption prediction method and system based on deep feature fusion network
CN115691677A (en) Multi-Omics and Phenotype Association Mining Method Based on Interpretable Autoencoders
Li et al. BaseNet: A transformer-based toolkit for nanopore sequencing signal decoding
Yi et al. ACO: lossless quality score compression based on adaptive coding order
CN117012280A (en) Method for constructing DNA sequence pre-training language model and application thereof
CN116070076A (en) High-precision high-robustness radar missing signal completion system
Sulistyawan et al. An adaptive BWT-HMM-based lossless compression system for genomic data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination