WO2024060183A1 - 基于多序列比对的酶序列生成方法、生成装置和存储介质 - Google Patents

基于多序列比对的酶序列生成方法、生成装置和存储介质 Download PDF

Info

Publication number
WO2024060183A1
WO2024060183A1 PCT/CN2022/120790 CN2022120790W WO2024060183A1 WO 2024060183 A1 WO2024060183 A1 WO 2024060183A1 CN 2022120790 W CN2022120790 W CN 2022120790W WO 2024060183 A1 WO2024060183 A1 WO 2024060183A1
Authority
WO
WIPO (PCT)
Prior art keywords
amino acid
sequence
acid sequences
enzyme
acid sequence
Prior art date
Application number
PCT/CN2022/120790
Other languages
English (en)
French (fr)
Inventor
余函
张洋铭
罗小舟
Original Assignee
中国科学院深圳先进技术研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国科学院深圳先进技术研究院 filed Critical 中国科学院深圳先进技术研究院
Publication of WO2024060183A1 publication Critical patent/WO2024060183A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Definitions

  • the invention belongs to the field of biomedicine technology. Specifically, it relates to an enzyme sequence generation method based on multiple sequence alignment, a generation device, a computer-readable storage medium, and a computer device.
  • Enzymes have important applications in the fields of biocatalysis and chemical industry.
  • the limited number of natural enzymes limits their industrial application in real-world downstream scenarios.
  • the function of an enzyme is determined by its structure, and the structure of an enzyme is essentially determined by its primary sequence. Therefore, in order to more effectively explore the functional space of enzymes, we need to broaden the sequence of natural enzymes.
  • computational methods to modify enzymes have become another important research solution.
  • the representative one is the enzyme sequence generation method based on generative adversarial network. This method has been proven to effectively broaden the effective enzyme sequence space. However, this method still does not work well when there are few samples, that is, when few sequences are generated, such as key The loss of sites results in a lower proportion of enzymatically active sequences in the generated amino acid sequences.
  • An enzyme sequence generation method based on multiple sequence alignment includes:
  • the amino acid sequence generation model is used to generate multiple extended amino acid sequences of the target enzyme.
  • the method of screening out several similar amino acid sequences similar to the complete amino acid sequence of the target enzyme from the sequence database is:
  • a local alignment search tool is used to screen out several similar amino acid sequences from the sequence database, wherein the coverage between each similar amino acid sequence and the complete amino acid sequence is greater than the first threshold and the similarity is greater than the second threshold.
  • methods for training a pre-built generative adversarial network model using several aligned amino acid sequences as training samples include:
  • the first threshold is 90% and the second threshold is 70%.
  • the enzyme sequence generating device includes:
  • a sequence screening unit is used to screen out several similar amino acid sequences that are similar to the complete amino acid sequence of the target enzyme from the sequence database;
  • a multi-sequence comparison unit is used to perform multi-sequence comparison processing on the complete amino acid sequence and several similar amino acid sequences to obtain several aligned amino acid sequences, wherein each aligned amino acid sequence has the same length;
  • the model training unit is used to train a pre-built generative adversarial network model using several aligned amino acid sequences as training samples to obtain an amino acid sequence generation model;
  • a sequence generation unit is used to generate multiple extended amino acid sequences of the target enzyme using the amino acid sequence generation model.
  • sequence screening unit is also used to:
  • a local alignment search tool is used to screen out several similar amino acid sequences from the sequence database, wherein the coverage between each similar amino acid sequence and the complete amino acid sequence is greater than the first threshold and the similarity is greater than the second threshold.
  • model training unit includes:
  • the encoding submodule is used to use different numbers to represent different types of amino acids in the aligned amino acid sequence and fill characters, and convert each aligned amino acid sequence into a digital encoding string;
  • the training submodule is used to train a pre-built generative adversarial network model using a number of digital code strings corresponding to a number of aligned amino acid sequences as training samples.
  • This application also discloses a computer-readable storage medium that stores an enzyme sequence generation program based on multiple sequence comparisons.
  • the enzyme sequence generation program based on multiple sequence comparisons is executed by a processor, the above-mentioned steps are implemented. Enzyme sequence generation method based on multiple sequence alignment.
  • the application also discloses a computer device, which includes a computer-readable storage medium, a processor, and an enzyme sequence generation program based on multiple sequence comparisons stored in the computer-readable storage medium.
  • a computer device which includes a computer-readable storage medium, a processor, and an enzyme sequence generation program based on multiple sequence comparisons stored in the computer-readable storage medium.
  • the compared enzyme sequence generating program is executed by the processor, the above-mentioned enzyme sequence generating method based on multiple sequence alignment is implemented.
  • the invention discloses an enzyme sequence generation method and device based on multiple sequence alignment. Compared with the existing technology, it has the following technical effects:
  • the model can fully learn and retain key site information in the amino acid sequence, so that the proportion of new amino acid sequences generated by the model with enzymatic activity is higher.
  • Figure 1 is a flow chart of an enzyme sequence generation method based on multiple sequence alignment according to Embodiment 1 of the present invention
  • Figure 2 is a schematic diagram of the amino acid sequence before and after multiple sequence alignment processing according to Embodiment 1 of the present invention
  • Figure 3 is a schematic diagram of an enzyme sequence generation device based on multiple sequence alignment according to Embodiment 2 of the present invention.
  • Figure 4 is a schematic diagram of computer equipment according to Embodiment 4 of the present invention.
  • the inventive concept of the present application When using a generative adversarial network to generate enzyme sequences in the prior art, the generative adversarial network cannot effectively learn the enzyme sequence due to the small number of enzyme sequence samples. Key site information in the sequence, such that the regenerated enzyme sequence will easily lose key sites, resulting in a lower proportion of sequences with enzyme activity.
  • the enzyme sequence generation method based on multiple sequence alignment provided by this application first screens out multiple similar amino acid sequences that are similar to the complete amino acid sequence of the target enzyme from the sequence data, then performs multiple sequence alignment, and uses the aligned amino acid sequence pairs Generative adversarial network model, and finally use the trained model to generate a new amino acid sequence.
  • Example 1 provides a method for generating enzyme sequences based on multiple sequence alignment, which includes the following steps:
  • Step S10 Screen out several similar amino acid sequences that are similar to the complete amino acid sequence of the target enzyme from the sequence database;
  • Step S20 Perform multiple sequence alignment processing on the complete amino acid sequence and several similar amino acid sequences to obtain several aligned amino acid sequences, wherein each aligned amino acid sequence has the same length;
  • Step S30 using a plurality of aligned amino acid sequences as training samples to train a pre-built generative adversarial network model to obtain an amino acid sequence generation model;
  • Step S40 Use the amino acid sequence generation model to generate multiple extended amino acid sequences of the target enzyme.
  • the main purpose in step S10 is to increase the number of sequence samples.
  • a local alignment search tool (Basic Local Alignment Search Tool, BLAST) is used to screen out several similar amino acid sequences from the sequence database, each of which is similar.
  • the coverage between the amino acid sequence and the complete amino acid sequence is greater than the first threshold and the similarity is greater than the second threshold.
  • the first threshold is 90% and the second threshold is 70%, so that natural amino acid sequences similar to the complete amino acid sequence of the target enzyme can be screened out, thereby increasing the number of samples.
  • the sequence database can be Uniprot sequence database, etc.
  • the function of an enzyme is mainly reflected by some key amino acids in the amino acid sequence
  • different types of high thermostability enzymes all have the same key amino acids (key sites), that is, the type of the key amino acid and its location in the sequence. The locations are all the same.
  • key sites key amino acids
  • the positions of key amino acids in different sequences will be different, that is, the key sites are not aligned.
  • the length of the first amino acid sequence is 10
  • the length of the fifth amino acid is Key site G
  • the length of the second amino acid sequence is 20
  • the 10th amino acid is the key site G, that is, the key site G is not aligned in the vertical position.
  • FIG. 1 shows the changes in the amino acid sequence of polyethylene terephthalate hydrolase (PETase) before and after multiple sequence alignment.
  • PETase polyethylene terephthalate hydrolase
  • the aligned sequences can effectively align key site.
  • multiple sequence comparison software such as MEGA can be used to implement the above multiple sequence comparison processing.
  • each aligned amino acid sequence is obtained, different numbers are used to represent different amino acid types and complementary characters in the aligned amino acid sequence, and each aligned amino acid sequence is converted into a digital code string.
  • the digital code string can be recognized by the model. , using several digital coding strings corresponding to several aligned amino acid sequences as training samples to train the pre-built generative adversarial network model.
  • the aligned amino acid sequence has 20 different natural amino acids and complementary characters. Therefore, 0, 1, 2...19, 20 can be used to represent the 20 kinds of amino acids and complementary characters, and the aligned amino acid sequence is converted into a number in the form of a combination of numbers. Encoding string.
  • the pre-built generative adversarial network model includes a generator and a discriminator, random noise is input to the generator, the generator outputs generated data, and some data are selected from the training samples as real data; The generated data and the real data are jointly input into the discriminator, and the discriminator outputs a discrimination result; the network parameters of the generator and the discriminator are adjusted according to the discrimination results to complete a round of training; the above training steps are repeated until Meet predetermined training conditions to obtain an amino acid sequence generation model.
  • the generative adversarial network model adopts the WGAN-GP network.
  • the trained amino acid sequence generation model is used to batch generate new extended amino acid sequences. Since the amino acid sequence generation model can fully learn the key site information of natural amino acids, the new extended amino acid sequences generated by the amino acid sequence generation model can also retain the key site information relatively completely, so that the extended amino acid sequence and the complete amino acid sequence of the target enzyme remain different. At the same time, the enzyme synthesized according to the extended amino acid sequence has the same function as the target enzyme, that is, the extended amino acid sequence has enzymatic activity, thereby increasing the proportion of sequences with enzymatic activity generated by the amino acid sequence generation model.
  • the applicant verified the method of the first embodiment from two aspects: computer simulation and experimental verification.
  • the enzyme sequence generation method based on multiple sequence alignment disclosed in the first embodiment of this invention can fully learn and retain the key site information in the amino acid sequence by screening similar natural amino acid sequences and performing multiple sequence alignment processing. In this way, using the model The proportion of newly generated amino acid sequences with enzymatic activity is higher.
  • the number of amino acid sequences before and after the multi-sequence alignment process is the same, that is, the number of several aligned amino acid sequences, the sum of the number of complete amino acid sequences and several similar amino acid sequences. The two are the same. In the prior art, it is directly By inputting the amino acid sequence before the multi-sequence alignment process into the model, the proportion of the generated amino acid sequence with enzymatic activity is low. In this embodiment, the amino acid sequence after the multi-sequence alignment process better retains the key site information. , the proportion of amino acid sequences with enzymatic activity generated when input into the model is higher.
  • the second embodiment also discloses an enzyme sequence generation device based on multiple sequence alignment.
  • the enzyme sequence generation device includes a sequence screening unit 100, a multiple sequence alignment unit 200, and a model training unit. Unit 300, sequence generation unit 400.
  • the sequence screening unit 100 is used to screen out several similar amino acid sequences that are similar to the complete amino acid sequence of the target enzyme from the sequence database; the multiple sequence comparison unit 200 is used to perform multiplexing on the complete amino acid sequence and several similar amino acid sequences.
  • Sequence comparison processing is performed to obtain several aligned amino acid sequences, where each aligned amino acid sequence has the same length; the model training unit 300 is used to use several aligned amino acid sequences as training samples to train the pre-constructed generative adversarial network model, An amino acid sequence generation model is obtained; the sequence generation unit 400 uses the amino acid sequence generation model to generate multiple extended amino acid sequences of the target enzyme.
  • sequence screening unit 100 is also used to use a local alignment search tool to screen out several similar amino acid sequences from the sequence database, wherein the coverage between each of the similar amino acid sequences and the complete amino acid sequence is greater than the third a threshold and the similarity is greater than the second threshold.
  • the model training unit 300 includes a coding sub-module and a training sub-module.
  • the coding submodule is used to use different numbers to represent different amino acid types and completion characters in the aligned amino acid sequences, and convert each aligned amino acid sequence into a digital code string; the training submodule is used to convert several aligned amino acid sequences into several corresponding A digital coding string is used as a training sample to train the pre-built generative adversarial network model.
  • the third embodiment also discloses a computer-readable storage medium.
  • the computer-readable storage medium stores an enzyme sequence generation program based on multiple sequence comparisons.
  • the enzyme sequence generation program based on multiple sequence comparisons is executed by a processor, Implement the enzyme sequence generation method based on multiple sequence alignment in Example 1.
  • the fourth embodiment also discloses a computer device.
  • the terminal includes a processor 12, an internal bus 13, a network interface 14, and a computer-readable storage medium 11.
  • the processor 12 reads the corresponding computer program from the computer-readable storage medium and then runs it, forming a request processing device at the logical level.
  • one or more embodiments of this specification do not exclude other implementations, such as logic devices or a combination of software and hardware, etc. That is to say, the execution subject of the following processing flow is not limited to each A logic unit can also be a hardware or logic device.
  • the computer-readable storage medium 11 stores an enzyme sequence generation program based on multiple sequence comparisons. When the enzyme sequence generation program based on multiple sequence comparisons is executed by a processor, the above-mentioned enzyme sequence generation method based on multiple sequence alignments is implemented. .
  • Computer-readable storage media includes permanent and non-transitory, removable and non-removable media and may be implemented by any method or technology to store information. Information may be computer-readable instructions, data structures, modules of programs, or other data. Examples of computer-readable storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, compact disc read-only memory (CD-ROM), digital versatile disc (DVD) or other optical storage , magnetic tape cartridges, magnetic disk storage, quantum memory, graphene-based storage media or other magnetic storage devices or any other non-transmission medium, can be used to store information that can be accessed by a computing device.
  • PRAM phase change memory
  • SRAM static random access memory
  • DRAM dynamic random access memory
  • RAM random access memory
  • ROM read-only memory memory
  • EEPROM electrically era

Landscapes

  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Medical Informatics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Software Systems (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Epidemiology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Public Health (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种基于多序列比对的酶序列生成方法、生成装置和存储介质。该酶序列生成方法包括:从序列数据库中筛选出与目标酶的完整氨基酸序列相似的若干条相似氨基酸序列(S10);将完整氨基酸序列和若干条相似氨基酸序列进行多序列比对处理,获得若干条对齐氨基酸序列,其中各条对齐氨基酸序列的长度相同(S20);将若干条对齐氨基酸序列作为训练样本对预先构建好的生成式对抗网络模型进行训练,获得氨基酸序列生成模型(S30);利用氨基酸序列生成模型生成目标酶的多条扩展氨基酸序列(S40)。通过筛选相似的天然氨基酸序列并进行多序列比对处理,模型可以充分学习到并保留氨基酸序列中的关键位点信息,这样利用模型生成全新的氨基酸序列中具有酶活性的比例更高。

Description

基于多序列比对的酶序列生成方法、生成装置和存储介质 技术领域
本发明属于生物医药技术领域,具体地讲,涉及一种基于多序列比对的酶序列生成方法、生成装置、计算机可读存储介质、计算机设备。
背景技术
酶在生物催化、化工领域有着重要的应用,而由于天然酶存在的数量有限,限制了下游真实场景的工业应用。而众所周知,酶的功能由结构决定,酶的结构则本质上由一级序列决定,因此为了更有效的探索酶功能的空间,需要我们对天然酶的序列进行拓宽。除了传统的实验方法,例如定向进化和理性设计对酶进行改造外,随着机器学习、深度学习等方法的发展,基于计算的方法对酶进行改造也成为了另一类重要的研究方案。代表性的是基于生成式对抗网络的酶序列生成方法,该方法已经证明了可以有效的拓宽有效的酶序列空间,但是该方法在样本较少,即少序列生成时效果仍然不好,如关键位点的丢失,造成生成的氨基酸序列中有酶活性序列的比例较低。
发明内容
(一)本发明所要解决的技术问题
如何提高生成的酶的氨基酸序列中有酶活性序列的比例。
(二)本发明所采用的技术方案
一种基于多序列比对的酶序列生成方法,所述酶序列生成方法包括:
从序列数据库中筛选出与目标酶的完整氨基酸序列相似的若干条相似氨基酸序列;
将所述完整氨基酸序列和若干条所述相似氨基酸序列进行多序列比对处理,获得若干条对齐氨基酸序列,其中各条对齐氨基酸序列的长度相同;
将若干条对齐氨基酸序列作为训练样本对预先构建好的生成式对抗网络模 型进行训练,获得氨基酸序列生成模型;
利用所述氨基酸序列生成模型生成所述目标酶的多条扩展氨基酸序列。
可选择地,从序列数据库中筛选出与目标酶的完整氨基酸序列相似的若干相似氨基酸序列的方法为:
采用局部对齐搜索工具从所述序列数据库中筛选出若干条相似氨基酸序列,其中每条所述相似氨基酸序列与所述完整氨基酸序列之间的覆盖度大于第一阈值且相似度大于第二阈值。
可选择地,将若干条对齐氨基酸序列作为训练样本对预先构建好的生成式对抗网络模型进行训练的方法包括:
采用不同的数字代表对齐氨基酸序列中不同种的氨基酸类型以及补齐字符,将每条对齐氨基酸序列转换为数字编码串;
将若干条对齐氨基酸序列对应的若干个数字编码串作为训练样本对预先构建好的生成式对抗网络模型进行训练。
可选择地,每个数字编码串中具有21种不同的数字。
可选择地,所述第一阈值为90%,所述第二阈值为70%。
本申请还公开了一种基于多序列比对的酶序列生成装置,所述酶序列生成装置包括:
序列筛选单元,用于从序列数据库中筛选出与目标酶的完整氨基酸序列相似的若干条相似氨基酸序列;
多序列对比单元,用于将所述完整氨基酸序列和若干条所述相似氨基酸序列进行多序列比对处理,获得若干条对齐氨基酸序列,其中各条对齐氨基酸序列的长度相同;
模型训练单元,用于将若干条对齐氨基酸序列作为训练样本对预先构建好的生成式对抗网络模型进行训练,获得氨基酸序列生成模型;
序列生成单元,用于利用所述氨基酸序列生成模型生成所述目标酶的多条扩展氨基酸序列。
可选择地,所述序列筛选单元还用于:
采用局部对齐搜索工具从所述序列数据库中筛选出若干条相似氨基酸序列, 其中每条所述相似氨基酸序列与所述完整氨基酸序列之间的覆盖度大于第一阈值且相似度大于第二阈值。
可选择地,所述模型训练单元包括:
编码子模块,用于采用不同的数字代表对齐氨基酸序列中不同种的氨基酸类型以及补齐字符,将每条对齐氨基酸序列转换为数字编码串;
训练子模块,用于将若干条对齐氨基酸序列对应的若干个数字编码串作为训练样本对预先构建好的生成式对抗网络模型进行训练。
本申请还公开了一种计算机可读存储介质,所述计算机可读存储介质存储有基于多序列对比的酶序列生成程序,所述基于多序列对比的酶序列生成程序被处理器执行时实现上述的基于多序列比对的酶序列生成方法。
本申请还公开了一种计算机设备,所述计算机设备包括计算机可读存储介质、处理器和存储在所述计算机可读存储介质中的基于多序列对比的酶序列生成程序,所述基于多序列对比的酶序列生成程序被处理器执行时实现上述的基于多序列比对的酶序列生成方法。
(三)有益效果
本发明公开的一种基于多序列比对的酶序列生成方法、生成装置,相对于现有技术,具有如下技术效果:
通过筛选相似的天然氨基酸序列并进行多序列比对处理,模型可以充分学习到并保留氨基酸序列中的关键位点信息,这样利用模型生成全新的氨基酸序列中具有酶活性的比例更高。
附图说明
图1为本发明的实施例一的基于多序列比对的酶序列生成方法的流程图;
图2为本发明的实施例一的氨基酸序列在多序列对齐处理前后的示意图;
图3为本发明的实施例二的基于多序列比对的酶序列生成装置的示意图;
图4为本发明的实施例四的计算机设备示意图。
具体实施方式
为了使本发明的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本发明进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本发明,并不用于限定本发明。
在详细描述本申请的各个实施例之前,首先简单描述本申请的发明构思:现有技术中利用生成式对抗网络生成酶序列时,由于酶序列样本较少,生成式对抗网络无法有效学习到酶序列中的关键位点信息,这样重新生成的酶序列容易丢失关键位点,导致有酶活性的序列比例较低。本申请提供的基于多序列比对的酶序列生成方法,首先从序列数据中筛选出与目标酶的完整氨基酸序列相似的多条相似氨基酸序列,接着进行多序列对齐,利用对齐之后的氨基酸序列对生成式对抗网络模型,最后利用训练好的模型生成新的氨基酸序列,由于增加了序列样本数量以及通过多序列对齐使得关键位点在位置上保持一样,模型更容易学习到关键位点信息并在学习过程中进行保留,这样利用训练好的模型生成的氨基酸序列也具有关键位点信息,这样可以提高有酶活性序列的比例。
具体来说,如图1所示,本实施例一提供一种基于多序列比对的酶序列生成方法包括如下步骤:
步骤S10、从序列数据库中筛选出与目标酶的完整氨基酸序列相似的若干条相似氨基酸序列;
步骤S20、将所述完整氨基酸序列和若干条所述相似氨基酸序列进行多序列比对处理,获得若干条对齐氨基酸序列,其中各条对齐氨基酸序列的长度相同;
步骤S30、将若干条对齐氨基酸序列作为训练样本对预先构建好的生成式对抗网络模型进行训练,获得氨基酸序列生成模型;
步骤S40、利用所述氨基酸序列生成模型生成所述目标酶的多条扩展氨基酸序列。
具体来说,在步骤S10的主要目的是增加序列样本数量。示例性地,在确定了感兴趣的目标酶的完整氨基酸序列之后,采用局部对齐搜索工具(Basic Local Alignment Search Tool,BLAST)从所述序列数据库中筛选出若干条相似氨基酸序列,其中每条相似氨基酸序列与完整氨基酸序列之间的覆盖度大于第一阈值且相似度大于第二阈值。示例性地,第一阈值为90%,第二阈值为70%,这样可以筛选出与目标酶的完整氨基酸序列相似的天然氨基酸序列,增加了样本数量。其中,序列数据库可以为Uniprot序列数据库等。
进一步地,由于酶的功能主要是通过氨基酸序列中部分关键氨基酸来体现的,例如不同种高热稳定性酶都具有相同的关键氨基酸(关键位点),即该关键氨基酸的类型、在所在序列中的位置都是相同的。另一方面,由于不同酶的氨基酸序列的长度不同,会造成关键氨基酸在不同序列中的位置不相同,即关键位点没有进行对齐,例如第一条氨基酸序列长度为10,第5个氨基酸为关键位点G,第二条氨基酸序列长度为20,第10个氨基酸为关键位点G,即在纵向位置上关键位点G是不对齐的。因此,通过多序列比对处理,在各条氨基酸序列中插入补齐字符,使得各条氨基酸序列尽可能多的关键位点对齐,且对齐后的氨基酸序列长度相同的,这样在纵向方向上关键位点处于同一列,有利于后续训练过程中模型能容易识别出关键位点信息并进行保留。示例性地,如图2所示为聚对苯二甲酸乙二酯水解酶(Polyethylene terephthalate hydrolase,PETase)的氨基酸序列在多序列比对前后的变化,对齐后的序列能够有效地比对上关键位点。其中,可以采用的MEGA等多序列比对软件来实现上述的多序列对比处理。
进一步地,在得到各条对齐氨基酸序列之后,采用不同的数字代表对齐氨基酸序列中不同种的氨基酸类型以及补齐字符,将每条对齐氨基酸序列转换为数字编码串,数字编码串可以被模型识别,将若干条对齐氨基酸序列对应的若干个数字编码串作为训练样本对预先构建好的生成式对抗网络模型进行训练。其中,对齐氨基酸序列具有20种不同天然氨基酸和补齐字符,因此可采用0、1、2……19、20来表示20种氨基酸和补齐字符,将对齐氨基酸序列转换为数字组合形式的数字编码串。
在步骤S30中,预先构建的生成式对抗网络模型包括生成器和判别器,将随机噪声输入到生成器,生成器输出生成数据,从所述训练样本中选取部分数据作为真实数据;将所述生成数据和所述真实数据共同输入至所述判别器中,判别器输出判别结果;根据判别结果调整所述生成器和所述判别器的网络参数,以完成一轮训练;重复上述训练步骤直至满足预定训练条件,以获得氨基酸序列生成模型。示例性地,生成式对抗网络模型采用WGAN-GP网络。
最后利用训练得到的氨基酸序列生成模型批量生成全新的扩展氨基酸序列,由于氨基酸序列生成模型能充分学习到天然氨基酸的关键位点信息,因此氨基酸序列生成模型生成的全新的扩展氨基酸序列也能较为完整地保留关键位点信息,使得扩展氨基酸序列与目标酶的完整氨基酸序列保持差异的前提下,同时又使得根据扩展氨基酸序列合成的酶与目标酶具有相同的功能,即扩展氨基酸 序列具有酶活性,从而提高了氨基酸序列生成模型产生有酶活性序列的比例。
进一步地,本申请人从计算机模拟和实验验证两方面对本实施例一的方法进行了验证。我们以胞苷脱氨酶作为目标酶,计算机上比较了直接基于酶的完整氨基酸序列作为生成模型的输入和基于多序列比对得到的氨基酸序列作为生成模型的输入,经过分析两种生成模型输出的全新氨基酸序列的关键位点,结果证明了后者可以更有效的保守关键位点。同时,实验进一步证明了后者能够有效的提高有活性的比例。
本实施例一公开的基于多序列比对的酶序列生成方法,通过筛选相似的天然氨基酸序列并进行多序列对齐处理,模型可以充分学习到并保留氨基酸序列中的关键位点信息,这样利用模型生成全新的氨基酸序列中具有酶活性的比例更高。另外,经过多序列对齐处理前后的氨基酸序列数量是相同的,即若干条对齐氨基酸序列的数量、完整氨基酸序列与若干条相似氨基酸序列的数量之和,两者是相同的,现有技术中直接利用多序列对齐处理之前的氨基酸序列输入到模型中,生成的具有酶活性的氨基酸序列比例值较低,本实施例一中多序列对齐处理之后的氨基酸序列由于较好地保留了关键位点信息,输入到模型中生成的具有酶活性的氨基酸序列比例值较高。
进一步地,如图3所示,本实施例二还公开了一种基于多序列比对的酶序列生成装置,所述酶序列生成装置包括序列筛选单元100、多序列比对单元200、模型训练单元300、序列生成单元400。序列筛选单元100用于从序列数据库中筛选出与目标酶的完整氨基酸序列相似的若干条相似氨基酸序列;多序列对比单元200用于将所述完整氨基酸序列和若干条所述相似氨基酸序列进行多序列比对处理,获得若干条对齐氨基酸序列,其中各条对齐氨基酸序列的长度相同;模型训练单元300用于将若干条对齐氨基酸序列作为训练样本对预先构建好的生成式对抗网络模型进行训练,获得氨基酸序列生成模型;序列生成单元400利用所述氨基酸序列生成模型生成所述目标酶的多条扩展氨基酸序列。
进一步地,序列筛选单元100还用于采用局部对齐搜索工具从所述序列数据库中筛选出若干条相似氨基酸序列,其中每条所述相似氨基酸序列与所述完整氨基酸序列之间的覆盖度大于第一阈值且相似度大于第二阈值。
进一步地,模型训练单元300包括编码子模块和训练子模块。编码子模块用于采用不同的数字代表对齐氨基酸序列中不同种的氨基酸类型以及补齐字符,将每条对齐氨基酸序列转换为数字编码串;训练子模块用于将若干条对齐氨基 酸序列对应的若干个数字编码串作为训练样本对预先构建好的生成式对抗网络模型进行训练。其中,序列筛选单元100、多序列比对单元200、模型训练单元300、序列生成单元400的更加详细的工作过程可参照实施例一的相关描述,在此不进行赘述。
本实施例三还公开了一种计算机可读存储介质,所述计算机可读存储介质存储有基于多序列对比的酶序列生成程序,所述基于多序列对比的酶序列生成程序被处理器执行时实现实施例一的基于多序列比对的酶序列生成方法。
本实施例四还公开了一种计算机设备,在硬件层面,如图4所示,该终端包括处理器12、内部总线13、网络接口14、计算机可读存储介质11。处理器12从计算机可读存储介质中读取对应的计算机程序然后运行,在逻辑层面上形成请求处理装置。当然,除了软件实现方式之外,本说明书一个或多个实施例并不排除其他实现方式,比如逻辑器件抑或软硬件结合的方式等等,也就是说以下处理流程的执行主体并不限定于各个逻辑单元,也可以是硬件或逻辑器件。所述计算机可读存储介质11上存储有基于多序列对比的酶序列生成程序,所述基于多序列对比的酶序列生成程序被处理器执行时实现上述的基于多序列比对的酶序列生成方法。
计算机可读存储介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机可读存储介质的例子包括,但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带、磁盘存储、量子存储器、基于石墨烯的存储介质或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。
上面对本发明的具体实施方式进行了详细描述,虽然已表示和描述了一些实施例,但本领域技术人员应该理解,在不脱离由权利要求及其等同物限定其范围的本发明的原理和精神的情况下,可以对这些实施例进行修改和完善,这些修改和完善也应在本发明的保护范围内。

Claims (13)

  1. 一种基于多序列比对的酶序列生成方法,其中,所述酶序列生成方法包括:
    从序列数据库中筛选出与目标酶的完整氨基酸序列相似的若干条相似氨基酸序列;
    将所述完整氨基酸序列和若干条所述相似氨基酸序列进行多序列比对处理,获得若干条对齐氨基酸序列,其中各条对齐氨基酸序列的长度相同;
    将若干条对齐氨基酸序列作为训练样本对预先构建好的生成式对抗网络模型进行训练,获得氨基酸序列生成模型;
    利用所述氨基酸序列生成模型生成所述目标酶的多条扩展氨基酸序列。
  2. 根据权利要求1所述的基于多序列比对的酶序列生成方法,其中,从序列数据库中筛选出与目标酶的完整氨基酸序列相似的若干相似氨基酸序列的方法为:
    采用局部对齐搜索工具从所述序列数据库中筛选出若干条相似氨基酸序列,其中每条所述相似氨基酸序列与所述完整氨基酸序列之间的覆盖度大于第一阈值且相似度大于第二阈值。
  3. 根据权利要求1所述的基于多序列比对的酶序列生成方法,其中,将若干条对齐氨基酸序列作为训练样本对预先构建好的生成式对抗网络模型进行训练的方法包括:
    采用不同的数字代表对齐氨基酸序列中不同种的氨基酸类型以及补齐字符,将每条对齐氨基酸序列转换为数字编码串;
    将若干条对齐氨基酸序列对应的若干个数字编码串作为训练样本对预先构建好的生成式对抗网络模型进行训练。
  4. 根据权利要求3所述的基于多序列比对的酶序列生成方法,其中,每个数字编码串中具有21种不同的数字。
  5. 根据权利要求2所述的基于多序列比对的酶序列生成方法,其中,所述第一阈值为90%,所述第二阈值为70%。
  6. 一种基于多序列比对的酶序列生成装置,其中,所述酶序列生成装置包括:
    序列筛选单元,用于从序列数据库中筛选出与目标酶的完整氨基酸序列相似的若干条相似氨基酸序列;
    多序列对比单元,用于将所述完整氨基酸序列和若干条所述相似氨基酸序列进行多序列比对处理,获得若干条对齐氨基酸序列,其中各条对齐氨基酸序列的长度相同;
    模型训练单元,用于将若干条对齐氨基酸序列作为训练样本对预先构建好的生成式对抗网络模型进行训练,获得氨基酸序列生成模型;
    序列生成单元,用于利用所述氨基酸序列生成模型生成所述目标酶的多条扩展氨基酸序列。
  7. 根据权利要求6所述的基于多序列比对的酶序列生成装置,其中,所述序列筛选单元还用于:
    采用局部对齐搜索工具从所述序列数据库中筛选出若干条相似氨基酸序列,其中每条所述相似氨基酸序列与所述完整氨基酸序列之间的覆盖度大于第一阈值且相似度大于第二阈值。
  8. 根据权利要求6所述的基于多序列比对的酶序列生成装置,其中,所述模型训练单元包括:
    编码子模块,用于采用不同的数字代表对齐氨基酸序列中不同种的氨基酸类型以及补齐字符,将每条对齐氨基酸序列转换为数字编码串;
    训练子模块,用于将若干条对齐氨基酸序列对应的若干个数字编码串作为训练样本对预先构建好的生成式对抗网络模型进行训练。
  9. 一种计算机可读存储介质,其中,所述计算机可读存储介质存储有基于多序列对比的酶序列生成程序,所述基于多序列对比的酶序列生成程序被处理器执行时实现权利要求1的基于多序列比对的酶序列生成方法。
  10. 根据权利要求9所述的计算机可读存储介质,其中,从序列数据库中筛选出与目标酶的完整氨基酸序列相似的若干相似氨基酸序列的方法为:
    采用局部对齐搜索工具从所述序列数据库中筛选出若干条相似氨基酸序列,其中每条所述相似氨基酸序列与所述完整氨基酸序列之间的覆盖度大于第一阈值且相似度大于第二阈值。
  11. 根据权利要求9所述的计算机可读存储介质,其中,将若干条对齐氨 基酸序列作为训练样本对预先构建好的生成式对抗网络模型进行训练的方法包括:
    采用不同的数字代表对齐氨基酸序列中不同种的氨基酸类型以及补齐字符,将每条对齐氨基酸序列转换为数字编码串;
    将若干条对齐氨基酸序列对应的若干个数字编码串作为训练样本对预先构建好的生成式对抗网络模型进行训练。
  12. 根据权利要求11所述的计算机可读存储介质,其中,每个数字编码串中具有21种不同的数字。
  13. 根据权利要求10所述的计算机可读存储介质,其中,所述第一阈值为90%,所述第二阈值为70%。
PCT/CN2022/120790 2022-09-21 2022-09-23 基于多序列比对的酶序列生成方法、生成装置和存储介质 WO2024060183A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211156880.2 2022-09-21
CN202211156880.2A CN115472224A (zh) 2022-09-21 2022-09-21 基于多序列比对的酶序列生成方法、装置、介质和设备

Publications (1)

Publication Number Publication Date
WO2024060183A1 true WO2024060183A1 (zh) 2024-03-28

Family

ID=84335675

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/120790 WO2024060183A1 (zh) 2022-09-21 2022-09-23 基于多序列比对的酶序列生成方法、生成装置和存储介质

Country Status (2)

Country Link
CN (1) CN115472224A (zh)
WO (1) WO2024060183A1 (zh)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2011165142A (ja) * 2010-02-15 2011-08-25 Kinki Univ ハミング長を利用した類似性アミノ酸配列、塩基配列のデータ検索方法
CN109817275A (zh) * 2018-12-26 2019-05-28 东软集团股份有限公司 蛋白质功能预测模型生成、蛋白质功能预测方法及装置
US20220122689A1 (en) * 2020-10-15 2022-04-21 Salesforce.Com, Inc. Systems and methods for alignment-based pre-training of protein prediction models

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2011165142A (ja) * 2010-02-15 2011-08-25 Kinki Univ ハミング長を利用した類似性アミノ酸配列、塩基配列のデータ検索方法
CN109817275A (zh) * 2018-12-26 2019-05-28 东软集团股份有限公司 蛋白质功能预测模型生成、蛋白质功能预测方法及装置
US20220122689A1 (en) * 2020-10-15 2022-04-21 Salesforce.Com, Inc. Systems and methods for alignment-based pre-training of protein prediction models

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
REPECKA DONATAS, JAUNISKIS VYKINTAS, KARPUS LAURYNAS, REMBEZA ELZBIETA, ZRIMEC JAN, POVILONIENE SIMONA, ROKAITIS IRMANTAS, LAURYNE: "Expanding functional protein sequence space using generative adversarial networks", BIORXIV, COLD SPRING HARBOR LABORATORY, 2 October 2019 (2019-10-02), pages 1 - 17, XP055836640, Retrieved from the Internet <URL:https://www.biorxiv.org/content/10.1101/789719v1.full.pdf> [retrieved on 20210901], DOI: 10.1101/789719 *

Also Published As

Publication number Publication date
CN115472224A (zh) 2022-12-13

Similar Documents

Publication Publication Date Title
Liu et al. Spu-net: Self-supervised point cloud upsampling by coarse-to-fine reconstruction with self-projection optimization
Kessentini et al. Detecting android smells using multi-objective genetic programming
US10089421B2 (en) Information processing apparatus and information processing method
Wu et al. Single-shot bidirectional pyramid networks for high-quality object detection
CN105989288A (zh) 一种基于深度学习的恶意代码样本分类方法及系统
CN113140018B (zh) 训练对抗网络模型的方法、建立字库的方法、装置和设备
CN109300128B (zh) 基于卷积神经网隐含结构的迁移学习图像处理方法
CN106778278B (zh) 一种恶意文档检测方法及装置
CN112668623A (zh) 基于生成对抗网络的双耳销钉缺陷样本的生成方法及装置
CN111273353A (zh) 基于U-Net网络的智能化地震数据去混叠方法及系统
CN110264392B (zh) 一种基于多gpu的强连通图检测方法
Gao et al. Imperceptible and robust backdoor attack in 3d point cloud
WO2024060183A1 (zh) 基于多序列比对的酶序列生成方法、生成装置和存储介质
Cui et al. On robustness of neural odes image classifiers
CN113222160B (zh) 一种量子态的转换方法及装置
CN102982282B (zh) 程序漏洞的检测系统和方法
JP2008152619A (ja) データ処理装置およびデータ処理プログラム
CN117370568A (zh) 一种基于预训练语言模型的电网主设备知识图谱补全方法
CN111814414A (zh) 一种基于遗传算法的覆盖率收敛方法及系统
WO2023272580A1 (zh) 高热稳定性酶的蛋白序列生成方法、介质和设备
Park et al. Robustness Evaluation of Stacked Generative Adversarial Networks using Metamorphic Testing
CN115828999A (zh) 基于量子态振幅变换的量子卷积神经网络构建方法及系统
CN112000312B (zh) 基于Kettle和GeoTools的空间大数据自动化并行处理方法和系统
CN112749082B (zh) 一种基于de-th算法的测试用例生成方法及系统
Zhou et al. Using Metamorphic Testing to Evaluate DNN Coverage Criteria

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22959190

Country of ref document: EP

Kind code of ref document: A1