CN116978462A - Method for generating non-natural promoter based on diffusion model - Google Patents

Method for generating non-natural promoter based on diffusion model Download PDF

Info

Publication number
CN116978462A
CN116978462A CN202310954854.2A CN202310954854A CN116978462A CN 116978462 A CN116978462 A CN 116978462A CN 202310954854 A CN202310954854 A CN 202310954854A CN 116978462 A CN116978462 A CN 116978462A
Authority
CN
China
Prior art keywords
promoter
generating
diffusion model
promoters
region
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310954854.2A
Other languages
Chinese (zh)
Inventor
周景文
王兴隆
徐康杰
谭亚梦
赵欣怡
陈坚
曾伟主
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangnan University
Original Assignee
Jiangnan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangnan University filed Critical Jiangnan University
Priority to CN202310954854.2A priority Critical patent/CN116978462A/en
Publication of CN116978462A publication Critical patent/CN116978462A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B5/00ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Data Mining & Analysis (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Computing Systems (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Public Health (AREA)
  • Physiology (AREA)
  • Bioethics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Epidemiology (AREA)
  • Databases & Information Systems (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The application discloses a method for generating a non-natural promoter based on a diffusion model, and belongs to the technical field of biological information. In order to realize the generation of the promoter, a deep learning network based on a diffusion model is established. Meanwhile, the application carries out true and false judgment and functional interval analysis on the generated promoters, and the results show that more than 40% of the generated promoters are true promoters, and the sequences have obvious-35 and-10 functional regions and have higher credibility.

Description

Method for generating non-natural promoter based on diffusion model
Technical Field
The application relates to a method for generating a non-natural promoter based on a diffusion model, and belongs to the technical field of biological information.
Background
The promoter design can assist in the construction of metabolic engineering networks for de novo synthesis of chemicals, pharmaceuticals and other raw materials in microorganisms. The promoter mainly plays a role in promoting gene transcription and translation and has visual influence on the expression quantity of a target gene. Recent researches show that the pearson correlation coefficient of the transcription amount and the translation amount of the promoter gene is up to 0.8, so that accurate regulation of protein expression can be realized by regulating the promoter. In previous studies, researchers have tried different methods for mining non-natural promoters, including directed evolution schemes, randomly introducing mutation sites into the target promoter, and rational design, i.e., mutation of only a small segment of the conserved or non-conserved regions of the promoter.
At present, although a certain progress is made in non-natural promoter screening, the constructed promoter library is still small. In general, the promoter is 50 bases long and has 450 composition modes, and is difficult to verify by only experimental methods. The length of the promoter of eukaryote is far more than 50 bases, and the difficulty in experimental screening is greater. Therefore, it is extremely important to develop a method for computationally assisting in promoter generation, which will facilitate the screening of promoters.
Wang et al in 2020 proposed to realize de novo design of promoters against a generation network, convert promoter genes into one-dimensional arrays for learning, and further generate a brand new artificial molecule sequence similar to natural biomolecules in distribution through self-game of a generator and a discriminator, thereby realizing de novo design of promoters. However, the countermeasure generation network has great instability in training due to the contradiction between the training optimal discriminant and the minimization generator, and the diversity of the promoters generated by countermeasure learning is limited to a certain extent, so that the countermeasure generation network is not easy to expand to modeling complex multi-modal distribution. For the above reasons, it is necessary to investigate a novel method for generating a non-natural promoter.
Disclosure of Invention
In order to solve the problem of instability in the de novo design of promoters currently implemented against the production network, the present application provides a method for generating non-native promoters based on a diffusion model, comprising:
step S1: constructing a diffusion model for generating the non-natural promoter, wherein the diffusion model for generating the non-natural promoter depends on UNet in a convolutional neural network, and the convolutional neural network is adopted when a coding region of the UNet is built; the non-coding region adopts an up-sampling mode to restore the image size; carrying out feature transfer by adopting a normative UNet jump connection between the coding region and the non-coding region, and introducing self-attention mechanisms into the coding region and the decoding region;
step S2: training the diffusion model for generating the non-natural promoter by adopting the promoter in the public data set as training data;
step S3: and generating a new promoter by using a trained diffusion model for generating the non-natural promoter.
Optionally, the step S2 includes:
digitizing the gene sequence of the promoter in the public dataset;
training the diffusion model for generating the non-natural promoter by using the gene sequence of the promoter after the digitization treatment, calculating a loss value in the training process, carrying out promoter identification and conservation assessment on an output sample, and storing model parameters after the training is completed;
the promoter identification adopts a deep learning-based Promo R module to judge the true and false of each generated sequence, and calculates the proportion of the true promoter to all the generated promoters;
the conservation of the promoter is evaluated by aligning the sequences of the generated promoter and observing the sequences of the-35 region and the-10 region, and when the identifiers of the-35 region and the-10 region are TT and TATAAT, the generated promoter is considered to have the characteristics of the natural promoter.
Alternatively, the method is performed using MetaLogo as a tool for observing the sequences of the-35 and-10 regions.
Optionally, the digitizing the gene sequence of the promoter in the public dataset comprises:
extracting features by adopting a single-heat coding method, and converting a sequence with the length of 50 bases into vectors with the number of channels of 1, the length of 4 and the width of 50;
the transformed bases A, T, C, G are respectively: [ 10 00], [ 00 01 ], [ 01 00], [ 00 10 ].
Optionally, the method further comprises setting a ratio threshold of true promoters to all promoters generated.
The application also provides application of the method for generating the non-natural promoter based on the diffusion model in chemicals and medicines.
The application has the beneficial effects that:
the digitization of the gene sequence is realized by computer coding, and the public data set is collected to construct a training data set. The application uses digitized gene as input and adopts diffusion model to learn its characteristic, and evaluates the quality of the generated sample and uses it to generate non-natural promoter. And has no defect of unstable training. Meanwhile, the diffusion model is more beneficial to stably generating promoters with higher diversity and is beneficial to excavating new promoters.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of generating promoters based on a diffusion model provided in one embodiment of the present application.
FIG. 2 is a sequence identification diagram of a promoter generated by using a diffusion model according to the present application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail with reference to the accompanying drawings.
PyTorch: is the python version of torch, a neural network framework that is open-sourced by Facebook, specifically programmed for GPU-accelerated Deep Neural Networks (DNNs). Torch is a classical tensor (tensor, a digital container in a machine learning program, essentially an array of various dimensions, usually called axes, the number of axes being called orders) library operating on multidimensional matrix data, and has wide application in machine learning and other mathematics intensive applications. Unlike the static computational graph of Tensorflow, the computational graph of pytorch is dynamic and can be changed in real time according to the computational needs.
Embodiment one:
the present embodiment provides a method for generating a non-natural promoter based on a diffusion model, which performs diffusion model training based on a promoter gene sequence and generates a new promoter, see fig. 1, and the method includes:
step S1: normalization of the input gene sequence.
Building a training set: the promoter in the dataset reported by Thomason (which is the E.coli promoter) was used as training data. The training set contains 11884 promoters in total;
step S2: digitization of the gene sequence of the input promoter.
And (3) performing feature extraction by adopting a single-heat coding method, and converting a sequence with the length of 50 bases into vectors with the channel number of 1, the length of 4 and the width of 50. The transformed bases A, T, C, G are respectively: [ 10 00], [ 00 01 ], [ 01 00], [ 00 10 ].
By this step the gene sequence can be subjected to a digitization process, such as a promoter:
CCGCTCAAATATTGTTAAATTGCCGGTTTTGTATCAACTACTCACCCGGG is converted into: [[0 1 00][0 10 0][0 01 0][0 10 0][0 00 1][0 10 0][1 00 0][1 00 0][1 00 0][0 00 1][1 00 0][0 00 1][0 00 1][0 01 0][0 00 1][0 00 1][1 00 0][1 00 0][1 00 0][0 00 1][0 00 1][0 01 0][0 10 0][0 10 0][0 01 0][0 01 0][0 00 1][0 00 1][0 00 1][0 00 1][0 01 0][0 00 1][1 00 0][0 00 1][0 10 0][1 00 0][1 00 0][0 10 0][0 00 1][1 00 0][0 10 0][0 00 1][0 10 0][1 00 0][0 1 00][0 10 0][0 10 0][0 01 0][0 01 0][0 01 0]].
Step S3: the diffusion model is constructed for sequence feature learning, and specifically comprises the following steps:
s3.1: converting the data into a PyTorch identifiable tensor;
mainly converting Numpy array into tensor, which can be referred to
The transformation is performed in the transformation mode described in https:// blog.
S3.2: constructing a deep learning network based on PyTorch, wherein the network construction is mainly based on UNet in a convolutional neural network, and the convolutional neural network is adopted when a coding region of the UNet is constructed; the non-coding region performs image size reduction by means of upsampling. Carrying out feature transfer by adopting a normative UNet jump connection between the coding region and the non-coding region, and introducing self-attention mechanisms into the coding region and the decoding region;
diffusion model: under the condition that the diffusion process is a Markov chain, the Gaussian noise is continuously added to the original information, and the process of adding the Gaussian noise at each step is from X t-1 →X t
Wherein X is t Data representing time t-1, X t-1 Data at time t after the addition of gaussian noise is shown.
The back-diffusion process is to recover the original data from gaussian noise, assuming that the back-diffusion process is still a markov chain process, what is to be done is X T →X 0 Wherein X is T Data representing T time after adding Gaussian noise, X 0 Refers to the original data recovered from the data at time T after the addition of gaussian noise.
UNet: is a typical encoder, decoder architecture, with the encoder mainly performing feature extraction and the image size continually decreasing. And the right corresponds to the up-sampling process, by jumping the link with the information of the different convolution layers to restore the image to a size close to the original image.
Convolution layer: feature extraction is performed on the digitized promoters based on the image information. The formula is as follows:
an activation layer: the active layer uses a ReLU function, which can be understood as a piecewise linear function, changing all negative values to 0, while positive values are unchanged.
Self-attention layer: the self-attention mechanism allows inputs to interact with each other and find objects that they should focus more on, the output being the sum of these interactions and the attention score.
S3.3: and calculating a loss value in the model training process, and carrying out promoter identification and conservation assessment on the output sample.
The promoter identification adopts a Promo R module based on deep learning (BioRxiv: doi: https:// doi.org/10.1101/2023.03.05.531155), namely, each generated sequence is subjected to true and false discrimination, and the proportion of the true promoter to all generated promoters is calculated.
The promoter conservation was mainly performed on the generated promoter, and the sequences of the-35 and-10 regions were observed, using the tool Metalogo (http:// MetaLogo omicsnet. When the-35 and-10 regions are identified as TT and TATAAT, the resulting promoter is considered to be characteristic of the native promoter.
The loss value calculation adopts average square error loss (MSELoss) to compare the output result with the real result, and parameter optimization is realized through an AdamW optimizer.
S3.4: when the loss value in the training process is slowly reduced, evaluating a generated sample, judging according to a promoter identification result, and when a true promoter occupies a relatively high sequence with a significant conservation interval, storing training parameters of the training generation;
step S4: and generating the non-natural promoter by adopting a diffusion model according to the parameters stored after training.
In order to illustrate the superiority of the generated model constructed by the application, the application adopts the discrimination of true and false promoters based on deep learning and the characteristic analysis of the sequence of the generated promoter. The results are shown in FIG. 2, and it can be seen from FIG. 2 that the true duty ratio of the generated promoter is more than 40%, and the promoter has significant-35 and-10 regions, which indicates that the generated promoter has higher reliability.
According to the method for generating the natural thermal promoter based on the diffusion model, the digitization is realized by carrying out computer coding on the gene sequence, the digitization gene is used as input, the diffusion model is adopted for learning the characteristics, the quality of a generated sample is evaluated and used for generating the non-natural promoter, and compared with the existing technology for designing the promoter by antagonizing the generation network, the method for generating the natural thermal promoter based on the diffusion model is more stable in training, can effectively identify small-section key regions of the sequence, and can simultaneously identify the association of the small regions and the full-length sequence. And has no defect of unstable training. Meanwhile, the diffusion model is more beneficial to stably generating promoters with higher diversity and is beneficial to excavating new promoters.
Some steps in the embodiments of the present application may be implemented by using software, and the corresponding software program may be stored in a readable storage medium, such as an optical disc or a hard disk.
The foregoing description of the preferred embodiments of the application is not intended to limit the application to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the application are intended to be included within the scope of the application.

Claims (6)

1. A method of generating a non-native promoter based on a diffusion model, the method comprising:
step S1: constructing a diffusion model for generating the non-natural promoter, wherein the diffusion model for generating the non-natural promoter depends on UNet in a convolutional neural network, and the convolutional neural network is adopted when a coding region of the UNet is built; the non-coding region adopts an up-sampling mode to restore the image size; carrying out feature transfer by adopting a normative UNet jump connection between the coding region and the non-coding region, and introducing self-attention mechanisms into the coding region and the decoding region;
step S2: training the diffusion model for generating the non-natural promoter by adopting the promoter in the public data set as training data;
step S3: and generating a new promoter by using a trained diffusion model for generating the non-natural promoter.
2. The method according to claim 1, wherein the step S2 comprises:
digitizing the gene sequence of the promoter in the public dataset;
training the diffusion model for generating the non-natural promoter by using the gene sequence of the promoter after the digitization treatment, calculating a loss value in the training process, carrying out promoter identification and conservation assessment on an output sample, and storing model parameters after the training is completed;
the promoter identification adopts a deep learning-based Promo R module to judge the true and false of each generated sequence, and calculates the proportion of the true promoter to all the generated promoters;
the conservation of the promoter is evaluated by aligning the sequences of the generated promoter and observing the sequences of the-35 region and the-10 region, and when the identifiers of the-35 region and the-10 region are TT and TATAAT, the generated promoter is considered to have the characteristics of the natural promoter.
3. The method of claim 2, wherein the method is performed using a Metalogo as the means for observing the sequences of regions-35 and-10.
4. The method of claim 3, wherein digitizing the gene sequence of the promoter in the public dataset comprises:
extracting features by adopting a single-heat coding method, and converting a sequence with the length of 50 bases into vectors with the number of channels of 1, the length of 4 and the width of 50;
the transformed bases A, T, C, G are respectively: [ 10 00], [ 00 01 ], [ 01 00], [ 00 10 ].
5. The method of claim 4, further comprising setting a threshold ratio of true promoters to all promoters produced.
6. Use of the method for generating non-native promoters based on diffusion models according to any one of claims 1-5 in chemicals, pharmaceuticals.
CN202310954854.2A 2023-07-31 2023-07-31 Method for generating non-natural promoter based on diffusion model Pending CN116978462A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310954854.2A CN116978462A (en) 2023-07-31 2023-07-31 Method for generating non-natural promoter based on diffusion model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310954854.2A CN116978462A (en) 2023-07-31 2023-07-31 Method for generating non-natural promoter based on diffusion model

Publications (1)

Publication Number Publication Date
CN116978462A true CN116978462A (en) 2023-10-31

Family

ID=88476400

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310954854.2A Pending CN116978462A (en) 2023-07-31 2023-07-31 Method for generating non-natural promoter based on diffusion model

Country Status (1)

Country Link
CN (1) CN116978462A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118038993A (en) * 2024-04-11 2024-05-14 云南师范大学 Protein sequence diffusion generation method based on generation countermeasure network drive

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118038993A (en) * 2024-04-11 2024-05-14 云南师范大学 Protein sequence diffusion generation method based on generation countermeasure network drive

Similar Documents

Publication Publication Date Title
CN113971209B (en) Non-supervision cross-modal retrieval method based on attention mechanism enhancement
Liu et al. Optimization-based key frame extraction for motion capture animation
Korfmann et al. Deep learning in population genetics
CN116978462A (en) Method for generating non-natural promoter based on diffusion model
CN114023376B (en) RNA-protein binding site prediction method and system based on self-attention mechanism
CN103246922A (en) Method for video abstract generation
CN117153294B (en) Molecular generation method of single system
CN116312748A (en) Enhancer-promoter interaction prediction model construction method based on multi-head attention mechanism
CN116305939A (en) High-precision inversion method and system for carbon water flux of land ecological system and electronic equipment
CN118038032A (en) Point cloud semantic segmentation model based on super point embedding and clustering and training method thereof
EP4290524A2 (en) Artificial intelligence-based epigenetics
Bustillo Enhanced Genetic Algorithm for Variable Minimization through Modified Crossover Operator
Olariu et al. Biology-derived algorithms in engineering optimization
Zhao et al. A novel hybrid GA/SVM system for protein sequences classification
CN105069323A (en) Memetic algorithm based microbial fermentation control and optimization method
CN115169366A (en) Session recommendation method based on sampling convolution and interaction strategy
CN113343589B (en) Genetic-random constant-based acidic natural gas hydrate generation condition prediction method adopting genetic expression programming
CN113377907B (en) End-to-end task type dialogue system based on memory mask self-attention network
CN115660038A (en) Multi-stage integrated short-term load prediction based on error factors and improved MOEA/D-SAS
CN115545038A (en) Aspect emotion analysis method for optimizing grid label
CN115374854A (en) Multi-modal emotion recognition method and device and computer readable storage medium
CN112906527B (en) Finger vein biological key generation method based on deep neural network coding
CN109255020B (en) Method for solving dialogue generation task by using convolution dialogue generation model
CN113011519A (en) Multi-scale classified data mining method
Li et al. Generative Design of Crystal Structures by Point Cloud Representations and Diffusion Model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination