WO2023098152A1 - Construction method and system for microbial gene database - Google Patents

Construction method and system for microbial gene database Download PDF

Info

Publication number
WO2023098152A1
WO2023098152A1 PCT/CN2022/113690 CN2022113690W WO2023098152A1 WO 2023098152 A1 WO2023098152 A1 WO 2023098152A1 CN 2022113690 W CN2022113690 W CN 2022113690W WO 2023098152 A1 WO2023098152 A1 WO 2023098152A1
Authority
WO
WIPO (PCT)
Prior art keywords
gene
genes
representative
module
genomes
Prior art date
Application number
PCT/CN2022/113690
Other languages
French (fr)
Chinese (zh)
Inventor
徐晓强
夏炎
王晓凯
谢海亮
Original Assignee
深圳零一生命科技有限责任公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳零一生命科技有限责任公司 filed Critical 深圳零一生命科技有限责任公司
Priority to CN202280004306.7A priority Critical patent/CN116802740A/en
Publication of WO2023098152A1 publication Critical patent/WO2023098152A1/en

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/30Data warehousing; Computing architectures
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/10Ontologies; Annotations

Definitions

  • the invention belongs to the technical field of gene database construction, and in particular relates to a method and system for constructing a microbial gene database.
  • probiotics can help restore the balance of intestinal microecology, and have been widely used in dietary supplements.
  • probiotics due to the wide variety of probiotics, different countries have issued corresponding policies to regulate the types of edible probiotics.
  • the microbial identification analysis based on metagenomic sequencing technology needs to rely on the reference gene set, that is, by comparing the sequencing read length to the reference gene set to analyze the type and gene content of microorganisms in the sample. Therefore, there are microbial reference gene sets of different species and regions.
  • the analysis of the target probiotics in the human intestine also requires the use of reference gene sets.
  • IGC Integrated Gene Collection
  • Metagenomic Phylogenetic Analysis is a species annotation tool that analyzes the composition of microbial communities from next-generation sequencing data. Although MetaPhlAn has been updated all the time, it also has the following limitations: (1) Using sequence alignment marker genes to obtain relative abundance information, compared with other strategies, the false positives are lower, but the read utilization rate is low; (2) ) Fewer species are detected, and only species in the database can be detected; (3) Species are annotated only at the species level, and the strain-level results need to be analyzed using the supporting StrainPhlAn tool.
  • the two most widely used methods are not suitable for the analysis of target probiotics.
  • the traditional method of directly constructing the genome of probiotics as a reference database will have a large amount of repetitive information, resulting in low efficiency; in addition, since there are many common segments among microbial genomes, if the whole genome is directly used as a reference genome, it will also affect The accuracy of the test results.
  • a first aspect of the present invention provides a method for constructing a microbial gene database, comprising the following steps:
  • step S2 performing gene prediction on the genome data obtained in step S1, and obtaining a gene annotation file
  • step S3 using the gene annotation file obtained in step S2 to obtain the representative gene of each target microorganism;
  • the target microorganism can be any microorganism, including but not limited to bacteria, fungi, and viruses, all of which are applicable to the method of the present invention.
  • the target microorganism is a bacterium, and in some more specific embodiments of the present invention, the target microorganism is a food-usable bacterium.
  • step S1 the acquisition of the genome data of each target microorganism in the target microorganism combination can obtain genome data stored in commercial or non-commercial databases, or use high-throughput sequencing Genomic data obtained by the method.
  • the genome data is downloaded from NCBI database. Specifically, first obtain the species name and taxonomic number of the target microorganism in NCBI; then, according to the species name, obtain the genome of the species in NCBI.
  • the genome data is sequenced using next-generation sequencing technology.
  • it also includes filtering out genomes with assembled long-sequence fragments (Scaffolds) number ⁇ 100, so that the number of long-sequence fragments in each genome of each target microorganism obtained is less than 100.
  • Scaffolds assembled long-sequence fragments
  • step S2 any software, program or algorithm capable of realizing gene prediction function can be used to complete the gene prediction.
  • Prokka software is used to perform gene prediction on genome data.
  • step S3 for the target microorganism n in the target microorganism combination, wherein, the target microorganism n represents the nth target microorganism in the target microorganism combination, 1 ⁇ n ⁇ N, so
  • the genome number M of the target microorganism n obtain the representative gene of the target microorganism n according to the size of M:
  • the following standard is used to determine whether the genome deviates from the overall population: if a certain genome is eliminated, the number of common genes in the remaining genome increases by more than 30% compared with that before the elimination, such as 30%, 35%, 40%, 50%, the genome deviates from the overall population.
  • M ⁇ MB a preset value that needs to re-determine the common genes
  • each genome either contains the genes in the gene combination, or does not contain the genes derived from the gene combination, that is, each genome has two situations, and there will be combination, removing an empty set (all genomes do not contain genes from this combination), then combination, which is the same as the calculation result above. Therefore, under the condition that the principle remains the same, no matter how it is explained or understood, it does not affect the number of combinations.
  • the common genes are re-determined according to the following steps, that is, whether it is necessary to correct the common genes is determined:
  • the representative genes further include the top Y genes in descending order of occurrence in the genome among the remaining genes except the common genes.
  • the remaining genes need to be included by genomic frequency, where 50 ⁇ X ⁇ 100.
  • the common genes here can also be the above-mentioned revised common genes in a broad sense, so that the representative genes can more truly represent the target microorganisms.
  • step S4 before step S4, it further includes a step of filtering the representative genes obtained in step S3: filtering genes whose sequence length is less than 200.
  • the gene is compared to the nucleic acid sequence database using the search tool BLAST+ (v2.11.0) software based on a local alignment algorithm, and the evalue threshold is 1e-5.
  • step S4 and before step S5 the step of filtering the comparison results is further included: comparing the comparison results below the preset coverage threshold and/or below the preset identity threshold Remove the result.
  • the preset coverage threshold is 80%; the preset identity threshold is 65%.
  • step S4 or after step S5 a step of de-redundancy of genes is further included.
  • step S4 de-redundancy is performed before step S4, de-redundancy is performed on representative genes of each target microorganism.
  • step S5 de-redundancy is performed on all retained genes.
  • any software, program or algorithm capable of realizing the de-redundancy function can be used, for example, any software, program or algorithm that can realize de-redundancy based on the principle of sequence similarity.
  • CD-HIT (v4.8.1) software is used for de-redundancy.
  • steps are used for de-redundancy:
  • de-redundancy is performed separately: filter all genes of the sequence class whose gene number is greater than 1, and all remaining genes are the only compared single-copy genes of this species;
  • the above de-redundancy steps are repeated for each newly added species.
  • a second aspect of the present invention provides a system for constructing a microbial gene database, comprising the following modules:
  • the genome data acquisition storage module is used to acquire and store the genome data of each target microorganism in the target microorganism combination, wherein the target microorganism combination includes N kinds of target microorganisms, N ⁇ 1;
  • the gene prediction module is connected with the genome data acquisition storage module, and is used to perform gene prediction on the genome data acquired in the genome data acquisition module, and obtain and output gene annotation files containing sequences and species annotations;
  • a representative gene acquisition module connected to the gene prediction module, is used to receive the gene annotation file output by the gene prediction module, and use the gene annotation file to obtain the representative gene of each target microorganism and output it;
  • a nucleic acid sequence database storage module configured to receive and store a nucleic acid sequence database
  • a gene comparison module connected to the representative genome analysis module and the nucleic acid sequence database module respectively, for receiving the representative genes output by the representative gene acquisition module, and comparing each gene in the representative genes respectively Go to the nucleic acid sequence database, obtain the comparison result and output it;
  • the gene verification module and the gene comparison module are used to verify whether the annotated species of the gene is the same as the source species: for the comparison result of each gene, the annotated species of the gene is obtained, if the annotated species is the same as the source species, Then keep the gene, and the gene verification module is also used to output all the kept genes to construct the microbial gene database.
  • the construction system also includes: a gene de-redundancy module;
  • the gene de-redundancy module is connected to the gene verification module for receiving the retained genes output by the gene verification module, and performing de-redundancy to the retained genes in each target microorganism;
  • the gene de-redundancy module is connected to the representative gene acquisition module for receiving the representative genes output by the representative gene acquisition module, and performing de-redundancy on the representative genes of each target microorganism.
  • a gene filter module is further included, which is respectively connected to the representative genome analysis module and the gene comparison module, using After receiving the representative genes output by the representative gene acquisition module and filtering: filter the genes whose sequence length is less than 200, and then output the filtered representative genes to the gene comparison module.
  • a comparison result filtering module is further included, connected to the gene comparison module and the gene verification module respectively, and used To receive and filter the comparison results output by the gene comparison module: remove the comparison results lower than the preset coverage threshold and/or lower than the preset identity threshold.
  • the present invention has the following beneficial effects:
  • the microbial gene database construction method of the present invention can establish a modularized gene database covering species level microorganisms, fast retrieval, accurate qualitative and non-redundant through the integration of multiple information on the genome of microorganisms and the method of cross-validation.
  • the representative genes of the target microorganisms are first obtained, and then the source-annotation is verified by the NT library, so that the comparison results are more reliable and the classification information is more accurate.
  • the microbial gene database construction system of the present invention is independently composed of different modules, which are independent and related to each other, that is, it is convenient to add/delete modules between modules, and can complete database construction through cooperation between modules.
  • the target probiotics can be quickly located through gene sequencing data, and the comparison time is shorter.
  • the data of the existing microbial genome database can be quickly updated, and new microbial genome information can also be quickly added to the database.
  • the microbial database constructed by the present invention can be used to assist high-throughput sequencing technology to more accurately detect the types and contents of probiotics.
  • FIG. 1 shows a schematic diagram of a construction system #1 of Embodiment 1 of the present invention.
  • Fig. 2 shows a schematic diagram of construction system #8 of Embodiment 4 of the present invention.
  • Fig. 3 shows the combinations of genes of a probiotic in Example 6 of the present invention according to genome sources.
  • Numerical ranges in this application are approximations and therefore may include values outside the range unless otherwise indicated. Numerical ranges include all values from the lower value to the upper value in increments of 1 unit provided that there is a separation of at least 2 units between any lower value and any higher value.
  • the experimental methods not specifically described in the following examples are conventional methods.
  • the instruments and equipment used in the following examples, unless otherwise specified, are routine laboratory instruments and equipment; the test materials used in the following examples, unless otherwise specified, were purchased from conventional biochemical reagent stores.
  • construction system #1 comprises the following modules:
  • the genome data acquisition storage module is used to acquire and store the genome data of each target microorganism in the target microorganism combination, wherein the target microorganism combination includes N kinds of target microorganisms, N ⁇ 1;
  • the gene prediction module is connected with the genome data acquisition storage module, and is used to perform gene prediction on the genome data obtained in the genome data acquisition module, and obtain and output the gene annotation file containing the sequence and annotation;
  • a nucleic acid sequence database storage module configured to receive and store a nucleic acid sequence database
  • the gene comparison module is respectively connected with the representative gene acquisition module and the nucleic acid sequence database module, and is used to receive the representative genes output by the representative gene acquisition module, and use the gene comparison software to compare each gene in the representative genes to the nucleic acid sequence respectively Database, obtain the comparison result and output it;
  • the gene verification module and the gene comparison module are used to verify whether the annotated species of the gene is the same as the source species: for the comparison result of each gene, the annotated species of the gene is obtained, if the annotated species is the same as the source species, Then keep the gene, and the gene verification module is also used to output all the kept genes to construct the microbial gene database.
  • This embodiment upgrades the construction system #1 of Example 1 to obtain the construction system #2.
  • the improvement point is to further include a gene de-redundancy module, which is connected to the gene verification module, and is used to receive the retained genes output by the gene verification module. , and use gene de-redundancy software to de-redundant the retained genes, extract single-copy comparison genes, and obtain non-redundant microbial gene databases.
  • de-redundancy is performed separately: filter all genes of the sequence class whose gene number is greater than 1, and all remaining genes are the only compared single-copy genes of this species;
  • This example upgrades the construction system #1 of Example 1 or the construction system #2 of Example 2 to obtain construction system #3 and construction system #4.
  • the improvement points are: in the representative genome analysis module and the gene comparison module Among them, a gene filter module is further included, which is respectively connected with the representative gene acquisition module and the gene comparison module, and is used to receive and filter the representative genes output by the representative gene acquisition module: filter the genes whose sequence length is less than 200, and then filter the Representative genes are exported to the gene comparison module.
  • This embodiment upgrades the construction system #1 of the embodiment 1, the construction system #2 of the embodiment 2, and the construction system #3 and the construction system #4 of the embodiment 3, and obtains the construction system #5, the construction system #6, Construction system #7 and construction system #8, the improvement points are: between the gene comparison module and the gene verification module, a comparison result filtering module is further included, which is respectively connected with the gene comparison module and the gene verification module, and is used to receive the gene Comparing and filtering the comparison results output by the module: removing the comparison results lower than the preset coverage threshold and/or lower than the preset identity threshold.
  • the upgraded build system #8 is shown in Figure 2.
  • Example 5 The method for constructing the representative gene of probiotic Lactobacillus casei
  • Lactobacillus casei was selected as the target probiotic, and the species name (Organism Name) or taxonomic number (Taxis) of the target probiotic in the National Center for Biological Information (NCBI) of the United States was obtained, which were Lactobacillus casei and 1582, respectively.
  • accession numbers of the genomes are: GCA_000309565 (genome 1), GCA_000829055 (genome 2), GCA_002091975 (genome 3), GCA_002192215 (genome 4), GCA_011754305 (genome 5) and GCA_012932835 (genome 6), and get the genome download path, download genomic data.
  • the number of the number is 1, which indicates the number of genomes from which it is derived.
  • the genes in Genome 1 are only from Genome 6, the genes in Genome 3 are only from Genome 4, the genes in Genome 12 are only from Genome 1, Genome 3, and Genome 6, and the genes in Genome 13 are only from Genome 4. in the entire genome.
  • the source genomes include Genome 1, Genome 3 and Genome 6. This gene combination is used as a new subgroup, and its common genes are extracted, that is, 2253 genes are common genes.
  • the corrected common genes were filtered, that is, the genes whose length was less than 200 were filtered, and 3727 genes remained.
  • the genes were compared to the nucleic acid sequence database (NT library), and the evalue threshold was 1e-5, and the comparison results were obtained.
  • the annotated species of the gene is judged by the following conditions: first filter the comparison results with a coverage threshold of 80% and an identity threshold of 65%; then select the top 10% of individual genes sorted by identity If more than 50% of the results meet the identity greater than or equal to 95% and are annotated as the same species S, then the annotation result of the gene is considered to be the aforementioned species S. Genes whose annotated species are not the source species are then filtered, and genes whose annotated species are the same as the source species are retained.
  • Example 6 Another method of constructing the representative gene of probiotic Lactobacillus casei
  • This example is adjusted according to Example 5.
  • Example 7 Method for constructing representative genes of probiotics staphylococcus flesh
  • Staphylococcus carnosus was selected as the target probiotic, and the species name (Organism Name) or taxonomic number (Taxis) of the target probiotic in the National Center for Biological Information (NCBI) of the United States was obtained, which were Staphylococcus carnosus and 1281, respectively.
  • the gene bank established by the method of the present invention has a unique ratio of single-copy representative genes of most target microorganisms ⁇ 500, the unique ratio of some target microorganisms (such as Bifidobacterium animalis and Bifidobacterium infantis)
  • the unique ratio of some target microorganisms such as Bifidobacterium animalis and Bifidobacterium infantis
  • For single-copy representative genes ⁇ 200 in order to make the comparison results more prepared, the inventor randomly incorporated the top 200 genes with the highest genome occurrence rate among the remaining genes of the two target microorganisms into the representative genes, so that the only comparison of single-copy representative genes The number of genes reached 274 and 210, respectively. The larger the number of representative genes, the more accurate the comparison results, and the fewer the number of qualified genes, the higher the comparison efficiency.
  • the probiotic database constructed in this example only contains the sequence of the target probiotic species. Compared with the Metaphlan comparison and IGC comparison, the time required for the comparison is significantly shortened. The comparison time is shown in Table 5 below.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Databases & Information Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biophysics (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Bioethics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The present invention belongs to the technical field of gene database construction. Disclosed are a construction method and system for a microbial gene database. The method comprises the following steps: acquiring target microbial genome data, and performing gene prediction on the acquired genome data to obtain a gene annotation file, which includes sequences and species annotations; obtaining representative genes of each target microorganism; comparing each of the representative genes to a nucleic acid sequence database, so as to obtain comparison results; filtering the comparison results to obtain information of annotated species of the genes, and retaining the genes, the annotated species of which are the same as an origin species, so as to construct a microbial gene database. By constructing a microbial gene database using the construction method of the present invention, the database can be updated on the basis of a change in a target microorganism, such that the real-time performance is greater; and a microbial database that is constructed by using the present invention only includes gene sequences of the target microorganism, such that the time required for comparison is shorter.

Description

一种微生物基因数据库的构建方法及系统A method and system for constructing a microbial gene database 技术领域technical field
本发明属于基因数据库构建技术领域,具体地,涉及一种微生物基因数据库的构建方法及系统。The invention belongs to the technical field of gene database construction, and in particular relates to a method and system for constructing a microbial gene database.
背景技术Background technique
近年来,伴随着人体微生物组研究的不断深入,科学家发现了肠道微生物在人体的健康中发挥了很大的促进作用,目前的一些亚健康问题也是因为肠道微生态的平衡被打破导致的。益生菌作为对人体有益的一类微生物,可以很好地帮助恢复肠道微生态平衡,目前已经被普遍应用于膳食补充剂中。然而,由于益生菌种类繁多,不同国家均出台了相应的政策对可食用益生菌的种类进行规定。In recent years, with the continuous deepening of human microbiome research, scientists have discovered that intestinal microbes play a significant role in promoting human health, and some current sub-health problems are also caused by the breakdown of the balance of intestinal microecology . As a type of microorganisms beneficial to the human body, probiotics can help restore the balance of intestinal microecology, and have been widely used in dietary supplements. However, due to the wide variety of probiotics, different countries have issued corresponding policies to regulate the types of edible probiotics.
传统的用于微生物的研究是通过对微生物进行培养,再进行生化表型的观察,这样要花费数十天的时间去完成。对于微生物的菌种进行鉴定,近年来发展起来的宏基因组学技术可以直接提取样本DNA进行全基因组测序,通过对这些DNA测序的结果进行分析和解读,已经可以做到对环境中微生物的群落结构、物种分类、系统进化、基因功能及代谢网络等进行研究。伴随着高通量测序技术的发展,目前已经可以做到在单次对至少几百个样本进行同时检测;同时,由于不需要进行培养,也就大大缩短了检测分析时间。Traditional research on microorganisms is done by cultivating microorganisms and then observing biochemical phenotypes, which takes dozens of days to complete. For the identification of microbial species, the metagenomics technology developed in recent years can directly extract sample DNA for whole-genome sequencing. By analyzing and interpreting the results of these DNA sequencing, it has been possible to analyze the community structure of microorganisms in the environment. , species classification, phylogenetic evolution, gene function and metabolic network. With the development of high-throughput sequencing technology, it is now possible to simultaneously detect at least hundreds of samples at a time; at the same time, since no cultivation is required, the detection and analysis time is greatly shortened.
然而,基于宏基因组测序技术的微生物鉴定分析需要依赖于参考基因集,即通过将测序读长比对到参考基因集,以分析样品中的微生物的种类和基因含量。因此存在不同物种,不同地域的微生物参考基因集。对人类肠道的目标益生菌进行分析,也需要用到参考基因集,通常情况下有两种方法,使用整合基因集(IGC)或者宏基因组系统发育分析(MetaPhlAn)基因库。However, the microbial identification analysis based on metagenomic sequencing technology needs to rely on the reference gene set, that is, by comparing the sequencing read length to the reference gene set to analyze the type and gene content of microorganisms in the sample. Therefore, there are microbial reference gene sets of different species and regions. The analysis of the target probiotics in the human intestine also requires the use of reference gene sets. Usually, there are two methods, using the integrated gene set (IGC) or the metagenomic phylogenetic analysis (MetaPhlAn) gene library.
整合基因集(IGC)发表于2014年,包含1267个肠道宏基因组,9879896个基因。IGC存在以下问题:(1)基因数目多,注释微生物种类多,比对时间也非常长,效率较低;(2)基因注释信息长时间未更新,准确性低;(3)公开的基因注释信息只到属水平,无法分析目标益生菌。The Integrated Gene Collection (IGC) was published in 2014 and contains 1267 gut metagenomes with 9879896 genes. IGC has the following problems: (1) The number of genes is large, there are many types of annotated microorganisms, and the comparison time is very long, and the efficiency is low; (2) The gene annotation information has not been updated for a long time, and the accuracy is low; (3) The public gene annotation The information is only at the genus level, and the target probiotics cannot be analyzed.
宏基因组系统发育分析(MetaPhlAn)是一种物种注释工具,可从二代测序数据中分析微生物群落的组成。虽然MetaPhlAn有一直更新,但也存在以下局限性:(1)使用序列比对标志基因,来获得相对丰度信息,相对于其他策略而言,假阳性较低,但读数利用率低;(2)物种检出较少,只能检出数据库内的物种;(3)物种注释只到种水平,需要使用配套的StrainPhlAn工具才能分析株水平结果。Metagenomic Phylogenetic Analysis (MetaPhlAn) is a species annotation tool that analyzes the composition of microbial communities from next-generation sequencing data. Although MetaPhlAn has been updated all the time, it also has the following limitations: (1) Using sequence alignment marker genes to obtain relative abundance information, compared with other strategies, the false positives are lower, but the read utilization rate is low; (2) ) Fewer species are detected, and only species in the database can be detected; (3) Species are annotated only at the species level, and the strain-level results need to be analyzed using the supporting StrainPhlAn tool.
因此,目前应用最为广泛的两种方法都不适合用于分析目标益生菌。但是传统的直接把益生菌的基因组构建成参考数据库,会有大量的重复信息,导致效率不高;另外,由于微生物基因组之间有很多共有片段,如果直接用全基因组作为参考基因组也会影响到检测结果的精度。Therefore, the two most widely used methods are not suitable for the analysis of target probiotics. However, the traditional method of directly constructing the genome of probiotics as a reference database will have a large amount of repetitive information, resulting in low efficiency; in addition, since there are many common segments among microbial genomes, if the whole genome is directly used as a reference genome, it will also affect The accuracy of the test results.
为了解决上述技术问题中的至少一个,本发明采用的技术方案如下:In order to solve at least one of the above-mentioned technical problems, the technical scheme adopted in the present invention is as follows:
本发明第一方面提供一种微生物基因数据库的构建方法,包括以下步骤:A first aspect of the present invention provides a method for constructing a microbial gene database, comprising the following steps:
S1,获取目标微生物组合中每种目标微生物的基因组数据,其中,所述目标微生物组合包括N种目标微生物,N≥1;S1, obtaining the genome data of each target microorganism in the target microorganism combination, wherein the target microorganism combination includes N kinds of target microorganisms, N≥1;
S2,对步骤S1获取的基因组数据进行基因预测,获得基因注释文件;S2, performing gene prediction on the genome data obtained in step S1, and obtaining a gene annotation file;
S3,利用步骤S2获得的所述基因注释文件获得每种目标微生物的代表基因;S3, using the gene annotation file obtained in step S2 to obtain the representative gene of each target microorganism;
S4,将所述代表基因中的每个基因分别比对到核酸序列数据库,获得比对结果;S4, comparing each gene in the representative gene to a nucleic acid sequence database to obtain a comparison result;
S5,对于每个基因的对比结果,获取该基因的注释物种,若所述注释物种与来源物种相同,则保留该基因;S5, for the comparison result of each gene, obtain the annotated species of the gene, if the annotated species is the same as the source species, keep the gene;
S6,利用所有被保留的基因构成所述微生物基因数据库。S6, using all the retained genes to form the microbial gene database.
在本发明中,所述目标微生物可以是任一微生物,包括但不限于细菌、真菌、病毒,均适用于本发明的方法。在本发明的一些具体实施方案中,所述目标微生物为细菌,在本发明的一些更具体实施方案中,所述目标微生物为可用于食品的细菌。In the present invention, the target microorganism can be any microorganism, including but not limited to bacteria, fungi, and viruses, all of which are applicable to the method of the present invention. In some specific embodiments of the present invention, the target microorganism is a bacterium, and in some more specific embodiments of the present invention, the target microorganism is a food-usable bacterium.
在本发明的一些实施方案中,步骤S1中,所述获取目标微生物组合中每种目标微生物的基因组数据,可以获得存储于商业或非商业数据库中的基因组数据,也可以是利用高通量测序方法获得的基因组数据。在本发明的一些具体实施方案中,所述基因组数据从NCBI数据库下载而来。具体地,首先获得目标微生物的在NCBI中的物种名称和分类学编号;然后,根据物种名称,获取该物种在NCBI中的基因组。在本发明的另一种具体实施方案中,所述基因组数据为利用二代测序技术测序得到。In some embodiments of the present invention, in step S1, the acquisition of the genome data of each target microorganism in the target microorganism combination can obtain genome data stored in commercial or non-commercial databases, or use high-throughput sequencing Genomic data obtained by the method. In some embodiments of the present invention, the genome data is downloaded from NCBI database. Specifically, first obtain the species name and taxonomic number of the target microorganism in NCBI; then, according to the species name, obtain the genome of the species in NCBI. In another specific embodiment of the present invention, the genome data is sequenced using next-generation sequencing technology.
在本发明的一些优选实施方案中,还包括过滤掉组装成长序列片段(Scaffolds)数目≥100的基因组,使得获得的每种目标微生物的各基因组中的长序列片段数目均小于100。In some preferred embodiments of the present invention, it also includes filtering out genomes with assembled long-sequence fragments (Scaffolds) number ≥ 100, so that the number of long-sequence fragments in each genome of each target microorganism obtained is less than 100.
在本发明的一些实施方案中,步骤S2中,可以使用任意能够实现基因预测功能的软件、程序或算法完成所述基因预测。在本发明的一些具体实施方案中,利用Prokka软件对基因组数据进行基因预测。In some embodiments of the present invention, in step S2, any software, program or algorithm capable of realizing gene prediction function can be used to complete the gene prediction. In some specific embodiments of the present invention, Prokka software is used to perform gene prediction on genome data.
在本发明的一些实施方案中,步骤S3中,针对所述目标微生物组合中的目标微生物n,其中,所述目标微生物n表示目标微生物组合中第n种目标微生物,1≤n≤N,所述目标微生物n的基因组数目M,根据M的大小获得所述目标微生物n的代表基因:In some embodiments of the present invention, in step S3, for the target microorganism n in the target microorganism combination, wherein, the target microorganism n represents the nth target microorganism in the target microorganism combination, 1≤n≤N, so The genome number M of the target microorganism n, obtain the representative gene of the target microorganism n according to the size of M:
(1)若M=1,则所述目标微生物n的基因组的所有基因为代表基因;(1) If M=1, all the genes in the genome of the target microorganism n are representative genes;
(2)若M≥2,则所有基因组的共有基因为代表基因。(2) If M≥2, the common gene of all genomes is the representative gene.
在本发明的一些实施方案中,进一步在,针对上述第(2)种情况,若M≥MA,则判断是否有基因组偏离总体,若有,则剔除偏离总体的基因组,再判断剩余基因组中是否有基因组偏离总体,若有,则再剔除偏离总体的基因组,直至剩余基因组中没有基因组偏离总体或者剩余基因组数目M<MA,则提取剩余基因组的共有基因,作为所有基因组修正的共有基因,并作为所述目标微生物n的代表基因,其中,MA是需要判断基因组是否偏离总体的预设值,MA≥3,例如MA=3,4,5,6,7,8,9,10或更大。In some embodiments of the present invention, further, for the above-mentioned case (2), if M≥MA, it is judged whether there is a genome that deviates from the overall population, and if so, the genome that deviates from the overall population is eliminated, and then it is judged whether the remaining genomes are If there are genomes that deviate from the overall population, then remove those genomes that deviate from the overall population until none of the remaining genomes deviates from the overall population or the number of remaining genomes M<MA, then extract the common genes of the remaining genomes as the common genes corrected by all genomes, and use them as The representative gene of the target microorganism n, wherein, MA is a preset value that needs to be judged whether the genome deviates from the whole, MA≥3, for example, MA=3, 4, 5, 6, 7, 8, 9, 10 or more.
在本发明的一些实施方案中,按如下标准判断基因组是否偏离总体:若剔除某个基因组后,剩余基因组的共有基因数目比未剔除前增加30%以上,例如30%、35%、40%、50%,则该基因组偏离总体。In some embodiments of the present invention, the following standard is used to determine whether the genome deviates from the overall population: if a certain genome is eliminated, the number of common genes in the remaining genome increases by more than 30% compared with that before the elimination, such as 30%, 35%, 40%, 50%, the genome deviates from the overall population.
在本发明的一些实施方案中,当剔除或未剔除偏离基因组的基因组数目M≥MB,其中,MB是需要重新确定共有基因的预设值,MB≥3,例如MB=3,4,5,6,7,8,9,10或更大,则进一步根据以下步骤重新确定共有基因,即确定是否需要对共有基因进行修正:In some embodiments of the present invention, when the number of genomes that deviate from the genome is eliminated or not eliminated M≥MB, wherein MB is a preset value that needs to re-determine the common genes, MB≥3, such as MB=3, 4, 5, 6, 7, 8, 9, 10 or greater, further re-determine the shared gene according to the following steps, that is, to determine whether the shared gene needs to be corrected:
S31,根据所述目标微生物n的M个基因组中各基因的来源基因组情况组成m种基因组合,其中,m=
Figure 1553dest_path_image001
。也就是说,对于一个基因,要么只来源于1个基因组,共有
Figure 512169dest_path_image002
个组合;要么只来源于其中2个基因组,共有
Figure 245900dest_path_image003
个组合;……;要么只来源于其中M-1个基因组,共有
Figure 234585dest_path_image004
个组合;要么来源于M个基因组,共有
Figure 986640dest_path_image005
个组合,因此共有
Figure 751379dest_path_image006
个组合。换一种说法,对于基因组合,每个基因组要么包含该基因组合里的基因,要么不包含来源于这个基因组合的基因,即每个基因组都有2种情况,则会有
Figure 221675dest_path_image007
个组合,去除一个空集(所有基因组均不包含来自该基因组合里的基因),则是
Figure 873105dest_path_image008
个组合,与上述计算结果相同。因此,在原理不变的情况下,无论如何解释或理解,不影响组合的数量。
S31, forming m gene combinations according to the source genome situation of each gene in the M genomes of the target microorganism n, wherein, m=
Figure 1553dest_path_image001
. That is to say, for a gene, either derived from only 1 genome, there are
Figure 512169dest_path_image002
combination; or only from two of the genomes, a total of
Figure 245900dest_path_image003
combinations; ...; or only from M-1 genomes, a total of
Figure 234585dest_path_image004
combinations; or derived from M genomes, a total of
Figure 986640dest_path_image005
combinations, so there are
Figure 751379dest_path_image006
combinations. In other words, for gene combinations, each genome either contains the genes in the gene combination, or does not contain the genes derived from the gene combination, that is, each genome has two situations, and there will be
Figure 221675dest_path_image007
combination, removing an empty set (all genomes do not contain genes from this combination), then
Figure 873105dest_path_image008
combination, which is the same as the calculation result above. Therefore, under the condition that the principle remains the same, no matter how it is explained or understood, it does not affect the number of combinations.
例如,目标微生物n的基因组为4个,即M=4,则所述目标微生物n的4个基因组中各基因的来源基因组情况有
Figure 479666dest_path_image009
种,如下表所示:
For example, there are 4 genomes of target microorganism n, that is, M=4, then the source genome situation of each gene in the 4 genomes of target microorganism n is as follows:
Figure 479666dest_path_image009
species, as shown in the table below:
基因组合编号Gene set number 基因组1Genome 1 基因组2Genome 2 基因组3Genome 3 基因组4Genome 4
11  the  the  the
22  the  the  the
33  the  the  the
44  the  the  the
55  the  the
66  the  the
77  the  the
88  the  the
99  the  the
1010  the  the
1111  the
1212  the
1313  the
1414  the
1515
S32,统计每种基因组合中的基因数目,并按从大到小顺序将所述基因数目进行排序并获得位于第S位的基因数目Q,其中,2≤S≤5,例如S=2,3,4或5。S32, count the number of genes in each gene combination, sort the number of genes in descending order and obtain the number Q of genes at the S position, where 2≤S≤5, for example, S=2, 3, 4 or 5.
S33,判断来源于M个基因组的基因组合的基因数目是否小于Q:S33, judging whether the number of genes derived from the gene combination of M genomes is less than Q:
①若来源于M个基因组的基因组合的基因数目不小于Q,则直接提取M个基因组的共有基因,即不需要进行修正;① If the number of genes derived from the gene combination of M genomes is not less than Q, then directly extract the common genes of M genomes, that is, no correction is required;
②若来源于M个基因组的基因组合的基因数目小于Q,则需要按照以下步骤对共有基因进行修正:② If the number of genes derived from the gene combination of M genomes is less than Q, the common genes need to be corrected according to the following steps:
S331,选取基因数目最多的基因组合的来源基因组作为亚群,提取亚群的共有基因;S331, selecting the source genome of the gene combination with the largest number of genes as a subgroup, and extracting the common genes of the subgroup;
S332,剔除所述亚群包含的基因组,若剩余的基因组数目<MB,则提取剩余基因组的共有基因,特别地,若剩余基因组数目为1,则提取该剩余基因组的所有基因作为共有基因;若剩余的基因组数目≥MB,则重复S31-S33步骤再次提取代表基因;S332. Eliminate the genomes contained in the subgroup, and if the number of remaining genomes is less than MB, then extract the common genes of the remaining genomes, especially, if the number of remaining genomes is 1, then extract all the genes of the remaining genomes as common genes; if If the number of remaining genomes is ≥ MB, repeat steps S31-S33 to extract representative genes again;
S34,将所有共有基因合并到一起,作为所有基因组修正的共有基因,并进一步作为所述目标微生物n的代表基因。S34, merging all the shared genes together as the shared genes of all genome corrections, and further serving as the representative gene of the target microorganism n.
在本发明的另一些实施方案中,根据以下步骤重新确定共有基因,即确定是否需要对共有基因进行修正:In other embodiments of the present invention, the common genes are re-determined according to the following steps, that is, whether it is necessary to correct the common genes is determined:
剔除任意一个基因组,得到M个基因组数目为M-1的亚群,若任意一个亚群的共有基因数目大于M个基因组的共有基因数目,则对共有基因数目最多的亚群再剔除一个得到M-1个子亚群,若任意一个子亚群的共有基因数目大于亚群的基因数目,则对子亚群进行同样的处理,直到得到的基因组组合再剔除任意基因组后,得到的新的基因组组合的共有基因数目不会比未剔除前更多,以这样的基因组组合的共有基因作为修正后的共有基因。值得注意的是,利用该步骤重新确定的共有基因与前面得到的结果相同,由此,只要是能够实现本发明构思,无论使用何种步骤,都应落入本发明保护范围。Eliminate any one genome to obtain M subgroups with the number of M-1 genomes. If the number of shared genes in any subgroup is greater than the number of shared genes in M genomes, then remove another subgroup with the largest number of shared genes to obtain M -1 sub-subgroup, if the number of shared genes of any sub-subgroup is greater than the number of genes in the sub-group, then the sub-subgroup will be treated in the same way until the obtained genome combination is removed and any genome combination is obtained to obtain a new genome combination The number of shared genes in the genome will not be more than that before deletion, and the shared genes of such a genome combination will be used as the revised shared genes. It is worth noting that the common gene re-determined by this step is the same as the previous result. Therefore, as long as the concept of the present invention can be realized, no matter what steps are used, it should fall within the protection scope of the present invention.
在本发明的一些实施方案中,所述代表基因进一步包括除共有基因外剩余基因中基因组出现率按从大到小排序前Y个的基因。其中,基因组出现率是指该基因出现在所有基因组的百分比,100≤Y≤300,例如Y=100、120、150、180、200、250、300。在本发明的一些优选实施方案中,只有在代表基因数量小于X时,才需要按基因组出现率纳入剩余基因,其中50≤X≤100。这里的共有基因除可以是狭义的所有基因组的共有的基因外,还可以是广义的上述经过修正的共有基因,以使得代表基因更加真实地代表目标微生物。In some embodiments of the present invention, the representative genes further include the top Y genes in descending order of occurrence in the genome among the remaining genes except the common genes. Among them, the genome occurrence rate refers to the percentage of the gene appearing in all genomes, 100≤Y≤300, for example, Y=100, 120, 150, 180, 200, 250, 300. In some preferred embodiments of the present invention, only when the number of representative genes is less than X, the remaining genes need to be included by genomic frequency, where 50≤X≤100. In addition to the common genes of all genomes in the narrow sense, the common genes here can also be the above-mentioned revised common genes in a broad sense, so that the representative genes can more truly represent the target microorganisms.
在本发明的一些实施方案中,在步骤S4之前,进一步包括对步骤S3获得的所述代表基因进行过滤的步骤:过滤序列长度小于200的基因。在本发明的一些具体实施方案中,使用基于局部比对算法的搜索工具BLAST+(v2.11.0)软件将基因比对到所述核酸序列数据库,evalue阈值为1e-5。In some embodiments of the present invention, before step S4, it further includes a step of filtering the representative genes obtained in step S3: filtering genes whose sequence length is less than 200. In some specific embodiments of the present invention, the gene is compared to the nucleic acid sequence database using the search tool BLAST+ (v2.11.0) software based on a local alignment algorithm, and the evalue threshold is 1e-5.
在本发明的一些实施方案中,在步骤S4之后,步骤S5之前,进一步包括所述对比对结果进行过滤的步骤:将低于预设覆盖度阈值和/或低于预设同一性阈值的对比对结果去除。在本发明的一些具体实施方案中,所述预设覆盖度阈值为80%;所述预设同一性阈值为65%。In some embodiments of the present invention, after step S4 and before step S5, the step of filtering the comparison results is further included: comparing the comparison results below the preset coverage threshold and/or below the preset identity threshold Remove the result. In some specific embodiments of the present invention, the preset coverage threshold is 80%; the preset identity threshold is 65%.
在本发明的一些实施方案中,步骤S5中,对于每个基因,获得其注释物种的步骤为:按同一性排序选取前a%的比对结果,若选取的比对结果中b%以上注释到同一物种且同一性不小于c%,则该物种为所述基因的注释物种,其中,a=5~20,b=40~60,c=90~98。在本发明的一些具体实施方案中,a=10,b=50,c=95。In some embodiments of the present invention, in step S5, for each gene, the step of obtaining its annotated species is: sorting by identity and selecting the first a% of the comparison results, if more than b% of the selected comparison results are annotated to the same species and the identity is not less than c%, then the species is the annotated species of the gene, where a=5~20, b=40~60, c=90~98. In some embodiments of the invention, a=10, b=50, c=95.
在本发明的一些实施方案中,在步骤S4之前或步骤S5之后进一步包括对基因进行去冗余的步骤。任选地,若在步骤S4之前进行基因去冗余,则是对每种目标微生物的代表基因进行去冗余。任选地,若在步骤S5之后进行基因去冗余,则是对所有被保留的基因进行去冗余。In some embodiments of the present invention, before step S4 or after step S5, a step of de-redundancy of genes is further included. Optionally, if gene de-redundancy is performed before step S4, de-redundancy is performed on representative genes of each target microorganism. Optionally, if gene de-redundancy is performed after step S5, de-redundancy is performed on all retained genes.
在本发明的一些实施方案中,可以利用任意能够实现去冗余功能的软件、程序或算法完成,例如任意基于序列相似性原理实现去冗余的软件、程序或算法。在本发明的一些具体实施方案中,利用CD-HIT(v4.8.1)软件进行去冗余。在本发明的一些具体实施方案中,利用以下步骤进行去冗余:In some embodiments of the present invention, any software, program or algorithm capable of realizing the de-redundancy function can be used, for example, any software, program or algorithm that can realize de-redundancy based on the principle of sequence similarity. In some embodiments of the invention, CD-HIT (v4.8.1) software is used for de-redundancy. In some embodiments of the present invention, the following steps are used for de-redundancy:
对每个物种,分别进行去冗余:过滤基因数目大于1的序列类的所有基因,所有留下的基因为该物种的唯一比对单拷贝基因;For each species, de-redundancy is performed separately: filter all genes of the sequence class whose gene number is greater than 1, and all remaining genes are the only compared single-copy genes of this species;
合并所有物种的去冗余基因,同样过滤基因数目大于1的序列类的所有基因。Merge the deredundant genes of all species, and filter all genes of the sequence class whose gene number is greater than 1.
在本发明的一些实施方案中,如果更新数据库,则对各新增物种重复上述去冗余步骤。In some embodiments of the invention, if the database is updated, the above de-redundancy steps are repeated for each newly added species.
 the
本发明第二方面提供一种微生物基因数据库的构建系统,包括以下模块:A second aspect of the present invention provides a system for constructing a microbial gene database, comprising the following modules:
基因组数据获取存储模块,用于获取并存储目标微生物组合中每种目标微生物的基因组数据,其中,所述目标微生物组合包括N种目标微生物,N≥1;The genome data acquisition storage module is used to acquire and store the genome data of each target microorganism in the target microorganism combination, wherein the target microorganism combination includes N kinds of target microorganisms, N≥1;
基因预测模块,与所述基因组数据获取存储模块连接,用于对所述基因组数据获取模块中获取的基因组数据进行基因预测,获得包含序列和物种注释的基因注释文件并输出;The gene prediction module is connected with the genome data acquisition storage module, and is used to perform gene prediction on the genome data acquired in the genome data acquisition module, and obtain and output gene annotation files containing sequences and species annotations;
代表基因获取模块,与所述基因预测模块连接,用于接收所述基因预测模块输出的所述基因注释文件,并利用所述基因注释文件获得每种目标微生物的代表基因并输出;A representative gene acquisition module, connected to the gene prediction module, is used to receive the gene annotation file output by the gene prediction module, and use the gene annotation file to obtain the representative gene of each target microorganism and output it;
核酸序列数据库存储模块,用于接收并存储核酸序列数据库;A nucleic acid sequence database storage module, configured to receive and store a nucleic acid sequence database;
基因比对模块,分别与所述代表基因组分析模块和所述核酸序列数据库模块连接,用于接收所述代表基因获取模块输出的代表基因,并将所述代表基因中的每个基因分别比对到核酸序列数据库,获得比对结果并输出;A gene comparison module, connected to the representative genome analysis module and the nucleic acid sequence database module respectively, for receiving the representative genes output by the representative gene acquisition module, and comparing each gene in the representative genes respectively Go to the nucleic acid sequence database, obtain the comparison result and output it;
基因验证模块,与所述基因比对模块,用于验证基因的注释物种是否与来源物种相同:对于每个基因的对比结果,获取该基因的注释物种,若所述注释物种与来源物种相同,则保留该基因,所述基因验证模块还用于输出所有被保留的基因以构建微生物基因数据库。The gene verification module and the gene comparison module are used to verify whether the annotated species of the gene is the same as the source species: for the comparison result of each gene, the annotated species of the gene is obtained, if the annotated species is the same as the source species, Then keep the gene, and the gene verification module is also used to output all the kept genes to construct the microbial gene database.
进一步地,所述构建系统还包括:基因去冗余模块;Further, the construction system also includes: a gene de-redundancy module;
任选地,所述基因去冗余模块与所述基因验证模块连接,用于接收所述基因验证模块输出的被保留的基因,并对每种目标微生物中被保留的基因进行去冗余;Optionally, the gene de-redundancy module is connected to the gene verification module for receiving the retained genes output by the gene verification module, and performing de-redundancy to the retained genes in each target microorganism;
任选地,所述基因去冗余模块与所述代表基因获取模块连接,用于接收所述代表基因获取模块输出的代表基因,并对每种目标微生物的代表基因进行去冗余。Optionally, the gene de-redundancy module is connected to the representative gene acquisition module for receiving the representative genes output by the representative gene acquisition module, and performing de-redundancy on the representative genes of each target microorganism.
在本发明的一些实施方案中,在所述代表基因获取模块和所述基因比对模块之间,进一步包括基因过滤模块,分别与所述代表基因组分析模块和所述基因比对模块连接,用于接收所述代表基因获取模块输出的代表基因并进行过滤:过滤序列长度小于200的基因,再将过滤后的代表基因输出至所述基因比对模块。In some embodiments of the present invention, between the representative gene acquisition module and the gene comparison module, a gene filter module is further included, which is respectively connected to the representative genome analysis module and the gene comparison module, using After receiving the representative genes output by the representative gene acquisition module and filtering: filter the genes whose sequence length is less than 200, and then output the filtered representative genes to the gene comparison module.
在本发明的一些实施方案中,在所述基因比对模块和所述基因验证模块之间,进一步包括比对结果过滤模块,分别与所述基因比对模块和所述基因验证模块连接,用于接收所述基因比对模块输出的比对结果并进行过滤:将低于预设覆盖度阈值和/或低于预设同一性阈值的对比对结果去除。In some embodiments of the present invention, between the gene comparison module and the gene verification module, a comparison result filtering module is further included, connected to the gene comparison module and the gene verification module respectively, and used To receive and filter the comparison results output by the gene comparison module: remove the comparison results lower than the preset coverage threshold and/or lower than the preset identity threshold.
在本发明中,本发明第二方面所述的构建系统中的所有模块能够实现本发明第一方面所述方法中的相应步骤相同的或相应地功能,在此不再赘述。In the present invention, all the modules in the construction system described in the second aspect of the present invention can realize the same or corresponding functions of the corresponding steps in the method described in the first aspect of the present invention, which will not be repeated here.
本发明的有益效果Beneficial effects of the present invention
相对于现有技术,本发明具有以下有益效果:Compared with the prior art, the present invention has the following beneficial effects:
本发明的微生物基因数据库构建方法,通过对微生物的基因组进行多重信息整合,并通过交叉验证的方法,能够建立模块化的涵盖种水平微生物、检索快、定性准确、非冗余的基因数据库。The microbial gene database construction method of the present invention can establish a modularized gene database covering species level microorganisms, fast retrieval, accurate qualitative and non-redundant through the integration of multiple information on the genome of microorganisms and the method of cross-validation.
本发明的微生物基因数据库构建方法,首先获得目标微生物的代表基因,再由NT库进行来源-注释的验证,比对结果更可靠,分类信息更准确。In the method for constructing the microbial gene database of the present invention, the representative genes of the target microorganisms are first obtained, and then the source-annotation is verified by the NT library, so that the comparison results are more reliable and the classification information is more accurate.
本发明的微生物基因数据库构建系统,由不同模块独立构成,彼此独立又彼此关联,即方便在各模块之间添加/删除模块,又能够通过各模块之间的配合,完成数据库构建。The microbial gene database construction system of the present invention is independently composed of different modules, which are independent and related to each other, that is, it is convenient to add/delete modules between modules, and can complete database construction through cooperation between modules.
利用本发明构建的微生物数据库,通过建立简单的搜索索引,即可做到通过基因测序数据快速定位到目标益生菌,比对时间更短。同时兼顾了数据库的更新迭代便捷的需要,可以快速更新已有的微生物基因组数据库的数据,也可以做到快速添加新的微生物基因组的信息进入数据库中。Utilizing the microbial database constructed by the present invention, by establishing a simple search index, the target probiotics can be quickly located through gene sequencing data, and the comparison time is shorter. At the same time, taking into account the need for convenient update and iteration of the database, the data of the existing microbial genome database can be quickly updated, and new microbial genome information can also be quickly added to the database.
利用本发明构建的微生物数据库,可以用于辅助高通量测序技术更精准得检测益生菌的种类和含量。The microbial database constructed by the present invention can be used to assist high-throughput sequencing technology to more accurately detect the types and contents of probiotics.
附图说明Description of drawings
图1示出了本发明实施例1的构建系统#1示意图。FIG. 1 shows a schematic diagram of a construction system #1 of Embodiment 1 of the present invention.
图2示出了本发明实施例4的构建系统#8示意图。Fig. 2 shows a schematic diagram of construction system #8 of Embodiment 4 of the present invention.
图3示出了本发明实施例6中一个益生菌的基因按基因组来源的组合情况。Fig. 3 shows the combinations of genes of a probiotic in Example 6 of the present invention according to genome sources.
具体实施方式Detailed ways
除非另有说明、从上下文暗示或属于现有技术的惯例,否则本申请中所有的份数和百分比都基于重量,且所用的测试和表征方法都是与本申请的提交日期同步的。在适用的情况下,本申请中涉及的任何专利、专利申请或公开的内容全部结合于此作为参考,且其等价的同族专利也引入作为参考,特别这些文献所披露的关于本领域中的相关术语的定义。如果现有技术中披露的具体术语的定义与本申请中提供的任何定义不一致,则以本申请中提供的术语定义为准。Unless otherwise stated, implied from the context, or customary in the art, all parts and percentages in this application are by weight and the testing and characterization methods used are current as of the filing date of this application. Where applicable, the contents of any patent, patent application, or publication referred to in this application are hereby incorporated by reference, and equivalent patent families are also incorporated by reference, especially those disclosed by these documents with respect to the state of the art. Definitions of related terms. If the definition of a specific term disclosed in the prior art is inconsistent with any definition provided in the present application, the definition of the term provided in the present application shall prevail.
本申请中的数字范围是近似值,因此除非另有说明,否则其可包括范围以外的数值。数值范围包括以1个单位增加的从下限值到上限值的所有数值,条件是在任意较低值与任意较高值之间存在至少2个单位的间隔。Numerical ranges in this application are approximations and therefore may include values outside the range unless otherwise indicated. Numerical ranges include all values from the lower value to the upper value in increments of 1 unit provided that there is a separation of at least 2 units between any lower value and any higher value.
为了使本发明所解决的技术问题、技术方案及有益效果更加清楚明白,以下结合实施例,对本发明进行进一步详细说明。In order to make the technical problems, technical solutions and beneficial effects solved by the present invention clearer, the present invention will be further described in detail below in conjunction with the embodiments.
实施例Example
以下例子在此用于示范本发明的优选实施方案。本领域内的技术人员会明白,下述例子中披露的技术代表发明人发现的可以用于实施本发明的技术,因此可以视为实施本发明的优选方案。但是本领域内的技术人员根据本说明书应该明白,这里所公开的特定实施例可以做很多修改,仍然能得到相同的或者类似的结果,而非背离本发明的精神或范围。The following examples are used herein to demonstrate preferred embodiments of the invention. It should be appreciated by those of skill in the art that the techniques disclosed in the examples which follow represent techniques discovered by the inventors to be employed in the practice of the invention, and thus can be considered preferred modes for its practice. However, those skilled in the art should understand from this specification that many modifications can be made to the specific embodiments disclosed herein, and the same or similar results can still be obtained without departing from the spirit or scope of the present invention.
除非另有定义,所有在此使用的技术和科学的术语,和本发明所属领域内的技术人员所通常理解的意思相同,在此公开引用及他们引用的材料都将以引用的方式被并入。Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by those skilled in the art to which this invention belongs, and the disclosures cited herein and their cited materials are all incorporated by reference .
那些本领域内的技术人员将意识到或者通过常规试验就能了解许多这里所描述的发明的特定实施方案的许多等同技术。这些等同将被包含在权利要求书中。Those skilled in the art will recognize, or be able to ascertain through routine experimentation, many equivalents to the specific embodiments of the invention described herein. These equivalents are to be covered by the claims.
下述实施例中未作具体说明的实验方法,均为常规方法。下述实施例中所用的仪器设备,如无特殊说明,均为实验室常规仪器设备;下述实施例中所用的试验材料,如无特殊说明,均为自常规生化试剂商店购买得到的。The experimental methods not specifically described in the following examples are conventional methods. The instruments and equipment used in the following examples, unless otherwise specified, are routine laboratory instruments and equipment; the test materials used in the following examples, unless otherwise specified, were purchased from conventional biochemical reagent stores.
实施例Example 1 1 微生物基因数据库构建系统Microbial gene database construction system
如图1所示,本实施例提供一种微生物基因数据库的构建系统,即构建系统#1包括以下模块:As shown in Figure 1, the present embodiment provides a kind of construction system of microbial gene database, namely construction system #1 comprises the following modules:
基因组数据获取存储模块,用于获取并存储目标微生物组合中每种目标微生物的基因组数据,其中,目标微生物组合包括N种目标微生物,N≥1;The genome data acquisition storage module is used to acquire and store the genome data of each target microorganism in the target microorganism combination, wherein the target microorganism combination includes N kinds of target microorganisms, N≥1;
基因预测模块,与基因组数据获取存储模块连接,用于对基因组数据获取模块中获取的基因组数据进行基因预测,获得包含序列和注释的基因注释文件并输出;The gene prediction module is connected with the genome data acquisition storage module, and is used to perform gene prediction on the genome data obtained in the genome data acquisition module, and obtain and output the gene annotation file containing the sequence and annotation;
代表基因获取模块,与基因预测模块连接,用于接收基因预测模块输出的基因注释文件,并利用所述基因注释文件获得每种目标微生物的代表基因并输出;Representing the gene acquisition module, connected with the gene prediction module, for receiving the gene annotation file output by the gene prediction module, and using the gene annotation file to obtain and output the representative gene of each target microorganism;
核酸序列数据库存储模块,用于接收并存储核酸序列数据库;A nucleic acid sequence database storage module, configured to receive and store a nucleic acid sequence database;
基因比对模块,分别与代表基因获取模块和核酸序列数据库模块连接,用于接收代表基因获取模块输出的代表基因,并利用基因比对软件将代表基因中的每个基因分别比对到核酸序列数据库,获得比对结果并输出;The gene comparison module is respectively connected with the representative gene acquisition module and the nucleic acid sequence database module, and is used to receive the representative genes output by the representative gene acquisition module, and use the gene comparison software to compare each gene in the representative genes to the nucleic acid sequence respectively Database, obtain the comparison result and output it;
基因验证模块,与所述基因比对模块,用于验证基因的注释物种是否与来源物种相同:对于每个基因的对比结果,获取该基因的注释物种,若所述注释物种与来源物种相同,则保留该基因,所述基因验证模块还用于输出所有被保留的基因以构建微生物基因数据库。The gene verification module and the gene comparison module are used to verify whether the annotated species of the gene is the same as the source species: for the comparison result of each gene, the annotated species of the gene is obtained, if the annotated species is the same as the source species, Then keep the gene, and the gene verification module is also used to output all the kept genes to construct the microbial gene database.
实施例Example 2 2 升级的微生物基因数据库构建系统Upgraded microbial gene database construction system
本实施例针对实施例1的构建系统#1进行升级,得到构建系统#2,改进点为进一步包括基因去冗余模块,与基因验证模块连接,用于接收基因验证模块输出的被保留的基因,并利用基因去冗余软件对被保留的基因进行去冗余,提取单拷贝比对基因,获得非冗余微生物基因数据库。This embodiment upgrades the construction system #1 of Example 1 to obtain the construction system #2. The improvement point is to further include a gene de-redundancy module, which is connected to the gene verification module, and is used to receive the retained genes output by the gene verification module. , and use gene de-redundancy software to de-redundant the retained genes, extract single-copy comparison genes, and obtain non-redundant microbial gene databases.
其中,提取单位拷贝对比基因的步骤如下:Wherein, the steps of extracting unit copy comparison genes are as follows:
对每个物种,分别进行去冗余:过滤基因数目大于1的序列类的所有基因,所有留下的基因为该物种的唯一比对单拷贝基因;For each species, de-redundancy is performed separately: filter all genes of the sequence class whose gene number is greater than 1, and all remaining genes are the only compared single-copy genes of this species;
合并所有物种的去冗余基因,同样过滤基因数目大于1的序列类的所有基因。Merge the deredundant genes of all species, and filter all genes of the sequence class whose gene number is greater than 1.
实施例Example 3 3 升级的微生物基因数据库构建系统Upgraded microbial gene database construction system
本实施例分别针对实施例1的构建系统#1或实施例2的构建系统#2进行升级,得到构建系统#3和构建系统#4,改进点为:在代表基因组分析模块和基因比对模块之间,进一步包括基因过滤模块,分别与代表基因获取模块和基因比对模块连接,用于接收代表基因获取模块输出的代表基因并进行过滤:过滤序列长度小于200的基因,再将过滤后的代表基因输出至基因比对模块。This example upgrades the construction system #1 of Example 1 or the construction system #2 of Example 2 to obtain construction system #3 and construction system #4. The improvement points are: in the representative genome analysis module and the gene comparison module Among them, a gene filter module is further included, which is respectively connected with the representative gene acquisition module and the gene comparison module, and is used to receive and filter the representative genes output by the representative gene acquisition module: filter the genes whose sequence length is less than 200, and then filter the Representative genes are exported to the gene comparison module.
实施例Example 4 4 升级的微生物基因数据库构建系统Upgraded microbial gene database construction system
本实施例分别针对实施例1的构建系统#1、实施例2的构建系统#2和实施例3的构建系统#3和构建系统#4进行升级,得到构建系统#5、构建系统#6、构建系统#7和构建系统#8,改进点为:在基因比对模块和基因验证模块之间,进一步包括比对结果过滤模块,分别与基因比对模块和基因验证模块连接,用于接收基因比对模块输出的比对结果并进行过滤:将低于预设覆盖度阈值和/或低于预设同一性阈值的对比对结果去除。This embodiment upgrades the construction system #1 of the embodiment 1, the construction system #2 of the embodiment 2, and the construction system #3 and the construction system #4 of the embodiment 3, and obtains the construction system #5, the construction system #6, Construction system #7 and construction system #8, the improvement points are: between the gene comparison module and the gene verification module, a comparison result filtering module is further included, which is respectively connected with the gene comparison module and the gene verification module, and is used to receive the gene Comparing and filtering the comparison results output by the module: removing the comparison results lower than the preset coverage threshold and/or lower than the preset identity threshold.
升级后的构建系统#8如图2所示。The upgraded build system #8 is shown in Figure 2.
实施例Example 5 5 构建益生菌干酪乳杆菌的代表基因的方法The method for constructing the representative gene of probiotic Lactobacillus casei
目标益生菌及基因组序列Target probiotics and genome sequences
本实施例选取干酪乳杆菌作为目标益生菌,获取该目标益生菌在美国国家生物信息中心(NCBI)的物种名称(Organism Name)或者分类学编号(Taxid),分别为 Lactobacillus casei和1582。 In this example, Lactobacillus casei was selected as the target probiotic, and the species name (Organism Name) or taxonomic number (Taxis) of the target probiotic in the National Center for Biological Information (NCBI) of the United States was obtained, which were Lactobacillus casei and 1582, respectively.
根据物种名称,获得NCBI中Complete或者Scaffold水平的基因组,共27个,过滤当中组装成长序列片段(Scaffolds)数目过多(≥200)的基因组(共21个),过滤后物种基因组数目为6,基因组的登记号分别是:GCA_000309565(基因组1)、GCA_000829055(基因组2)、GCA_002091975(基因组3)、GCA_002192215(基因组4)、GCA_011754305(基因组5)和GCA_012932835(基因组6),并获得基因组下载路径,下载基因组数据。According to the species name, a total of 27 genomes at the Complete or Scaffold level in NCBI were obtained, and the genomes with too many (≥200) assembled long sequence fragments (Scaffolds) were filtered (21 genomes in total), and the number of species genomes after filtering was 6, The accession numbers of the genomes are: GCA_000309565 (genome 1), GCA_000829055 (genome 2), GCA_002091975 (genome 3), GCA_002192215 (genome 4), GCA_011754305 (genome 5) and GCA_012932835 (genome 6), and get the genome download path, download genomic data.
2. 基因预测 2. Gene prediction
使用Prokka(v1.14.6)软件对每个基因组进行基因预测,获得包含序列和注释的基因注释文件。Gene predictions were performed for each genome using Prokka (v1.14.6) software, and gene annotation files containing sequences and annotations were obtained.
获取代表基因get representative gene
首先,选定MA=3,按如下标准判断某个基因组是否偏离总体:剔除该基因组后剩余基因组的共有基因数目比未剔除前增加50%以上。结果发现没有基因组偏离总体,保留全部6个基因组。First, select MA=3, and judge whether a certain genome deviates from the whole according to the following criteria: After deleting the genome, the number of common genes in the remaining genomes increases by more than 50% compared with that before the deletion. It was found that no genome deviated from the population, and all 6 genomes were retained.
选定MB=3,针对该6个基因组,共有7436个基因,根据基因的基因组来源情况共有63种基因组合,每种基因组合的基因数目如表1和图3所示(只展示基因数目大于总体1%的基因组合):Select MB=3, for the 6 genomes, there are a total of 7436 genes, and there are 63 gene combinations according to the genome sources of the genes. The number of genes in each gene combination is shown in Table 1 and Figure 3 (only the number of genes greater than Overall 1% of gene combinations):
表1 益生菌干酪乳杆菌基因组合及基因数目Table 1 Gene combinations and gene numbers of the probiotic Lactobacillus casei
基因组合编号Gene set number 基因组组合genome assembly 基因数目number of genes
11 000001000001 302302
22 000010000010 258258
33 000100000100 845845
44 001000001000 571571
55 001001001001 107107
66 010000010000 306306
77 010010010010 510510
88 010100010100 112112
99 010110010110 15771577
1010 100000100000 296296
1111 100001100001 9090
1212 101001101001 19071907
1313 111111111111 289289
其中,第2列中,第几位数字为1,表明为来源于第几个基因组。如基因组合1中的基因只来源于基因组6,基因组合3中的基因只来源于基因组4,基因组合12中的基因只来源于基因组1、基因组3和基因组6,基因组合13中的基因来源于全部基因组。Wherein, in the second column, the number of the number is 1, which indicates the number of genomes from which it is derived. For example, the genes in Genome 1 are only from Genome 6, the genes in Genome 3 are only from Genome 4, the genes in Genome 12 are only from Genome 1, Genome 3, and Genome 6, and the genes in Genome 13 are only from Genome 4. in the entire genome.
统计每种基因组合中的基因数目,并按从大到小顺序将所述基因数目进行排序并获得位于第2位的基因数目Q,Q=1577,即基因组合9中的基因数目。The number of genes in each gene combination was counted, and the number of genes was sorted in descending order to obtain the second gene number Q, Q=1577, that is, the number of genes in gene combination 9.
判断来源于6个基因组的基因组合的基因数目为289,小于Q:It is judged that the number of genes derived from the gene combination of 6 genomes is 289, which is less than Q:
选取基因数目最多的基因组合(即组合12),来源基因组包括基因组1、基因组3和基因组6,将该基因组合作为新的亚群,提取其共有基因,即2253个基因为共有基因。Select the gene combination with the largest number of genes (i.e. combination 12). The source genomes include Genome 1, Genome 3 and Genome 6. This gene combination is used as a new subgroup, and its common genes are extracted, that is, 2253 genes are common genes.
剔除所述亚群包含的基因组,剩余的基因组数目为3,则提取剩余基因组的共有基因,即基因组合9,共1880个基因作为共有基因。Excluding the genomes contained in the subgroup, and the number of the remaining genomes is 3, then extract the common genes of the remaining genomes, that is, gene combination 9, with a total of 1880 genes as common genes.
合并两次获得的共有基因,共计3844个基因,作为干酪乳杆菌修正的共有基因,远远比直接提取所有基因组的共有基因数目高。Combining the common genes obtained twice, a total of 3844 genes, as the common genes corrected by Lactobacillus casei, is far higher than the number of common genes extracted directly from all genomes.
基因过滤genetic filter
首先,将修正后共有基因进行过滤,即过滤长度低于200的基因,仍剩余3727个基因。First, the corrected common genes were filtered, that is, the genes whose length was less than 200 were filtered, and 3727 genes remained.
基因验证genetic verification
使用基于局部比对算法的搜索工具BLAST+(v2.11.0)软件将基因比对到核酸序列数据库(NT库),evalue阈值为1e-5,获得比对结果。针对比对结果,通过以下条件判断基因的注释物种:首先用覆盖度(coverage)阈值为80%和同一性(identity)阈值为65%过滤比对结果;然后单个基因按identity排序选取前10%的比对结果,如果有50%以上结果满足identity大于等于95%且注释为同一个物种S,则认为该基因的注释结果为前述物种S。然后过滤注释物种不是来源物种的基因,保留注释物种与来源物种相同的基因。Using the search tool BLAST+ (v2.11.0) software based on the local alignment algorithm, the genes were compared to the nucleic acid sequence database (NT library), and the evalue threshold was 1e-5, and the comparison results were obtained. For the comparison results, the annotated species of the gene is judged by the following conditions: first filter the comparison results with a coverage threshold of 80% and an identity threshold of 65%; then select the top 10% of individual genes sorted by identity If more than 50% of the results meet the identity greater than or equal to 95% and are annotated as the same species S, then the annotation result of the gene is considered to be the aforementioned species S. Genes whose annotated species are not the source species are then filtered, and genes whose annotated species are the same as the source species are retained.
通过该步骤,剩余1184个基因。Through this step, 1184 genes remained.
基因去冗余Gene de-redundancy
使用CD-HIT(v4.8.1)软件对过滤后的基因进行去冗余分析。Filtered genes were subjected to deredundancy analysis using CD-HIT (v4.8.1) software.
本步骤过滤基因数目大于1的序列类的所有基因,所有留下的基因为干酪乳杆菌的唯一比对单拷贝代表基因,共计1166个基因。In this step, all genes of the sequence class with the number of genes greater than 1 are filtered, and all remaining genes are the only compared single-copy representative genes of Lactobacillus casei, with a total of 1166 genes.
经过上述步骤,获得的代表基因数目更多,使得对比结果更加精确。After the above steps, the number of representative genes obtained is more, making the comparison result more accurate.
实施例Example 6 6 构建益生菌干酪乳杆菌的代表基因的另一种方法Another method of constructing the representative gene of probiotic Lactobacillus casei
本实施例针对实施例5进行调整,先利用步骤4和步骤6的方法对对步骤2获得的基因进行过滤和去冗余,再获取代表基因并进行验证,同样得到1166个唯一比对单拷贝代表基因。This example is adjusted according to Example 5. First, use the methods of Step 4 and Step 6 to filter and remove redundancy from the genes obtained in Step 2, and then obtain representative genes and verify them, and also obtain 1166 unique comparison single copies Represents genes.
实施例Example 7 7 构建益生菌肉葡萄球菌的代表基因的方法Method for constructing representative genes of probiotics staphylococcus flesh
本实施例选取肉葡萄球菌作为目标益生菌,获取该目标益生菌在美国国家生物信息中心(NCBI)的物种名称(Organism Name)或者分类学编号(Taxid),分别为 Staphylococcus carnosus和1281。 In this example, Staphylococcus carnosus was selected as the target probiotic, and the species name (Organism Name) or taxonomic number (Taxis) of the target probiotic in the National Center for Biological Information (NCBI) of the United States was obtained, which were Staphylococcus carnosus and 1281, respectively.
获得NCBI中Complete或者Scaffold水平的基因组,共11个,过滤当中组装成长序列片段(Scaffolds)数目过多(≥200)的基因组(共8个),过滤后物种基因组数目为3,基因组的登记号分别是:GCA_000009405(基因组1)、GCA_001701005(基因组2)、GCA_003970565(基因组3),并获得基因组下载路径,下载基因组数据。根据基因的基因组来源情况共有7种基因组合,每种基因组合的基因数目如表2所示(未列举组合为0)。Obtain 11 genomes at the Complete or Scaffold level in NCBI, filter the genomes with too many (≥200) assembled long sequence fragments (Scaffolds) (8 genomes in total), the number of species genomes after filtering is 3, and the registration number of the genome They are: GCA_000009405 (genome 1), GCA_001701005 (genome 2), GCA_003970565 (genome 3), and obtain the genome download path to download the genome data. According to the genome sources of genes, there are 7 gene combinations, and the number of genes in each gene combination is shown in Table 2 (0 for combinations not listed).
表2 益生菌肉葡萄球菌基因组合及基因数目Table 2 Gene combinations and gene numbers of the probiotic Staphylococcus carnosus
基因组合编号Gene set number 基因组组合genome assembly 基因数目number of genes
11 001001 23232323
22 010010 373373
33 100100 191191
44 110110 22702270
55 111111 3030
3个基因组的共有基因有30。选定MA=3,按如下标准判断某个基因组是否偏离总体:剔除该基因组后剩余基因组的共有基因数目比未剔除前增加50%以上。结果发现基因组3偏离总体,保留2个基因组,基因组1和基因组2的共有基因为组合4和组合5的基因,因此,肉葡萄球菌修正后的共有基因一共2300个。There were 30 genes shared by the three genomes. Select MA=3, and judge whether a certain genome deviates from the whole according to the following criteria: After deleting the genome, the number of common genes in the remaining genome increases by more than 50% compared with that before the deletion. It was found that genome 3 deviated from the overall population, and two genomes were retained. The shared genes of genome 1 and genome 2 were the genes of combination 4 and combination 5. Therefore, the revised common genes of Staphylococcus carnosus totaled 2300.
过滤、验证、去冗余步骤参考实施例5,此处不赘述,最后得到1842个唯一比对单拷贝代表基因。Refer to Example 5 for the steps of filtering, verification, and de-redundancy, and will not be repeated here. Finally, 1842 uniquely compared single-copy representative genes were obtained.
实施例Example 88 多种益生菌的基因数据库Gene database of various probiotics
利用同样的方法分别获得构建表3全部益生菌的唯一比对单拷贝代表代表基因,并构建基因数据库。The same method was used to obtain the unique comparison single-copy representative gene of all the probiotics in Table 3, and construct the gene database.
表3 目标益生菌列表Table 3 List of target probiotics
Figure 302391dest_path_image010
Figure 302391dest_path_image010
上述益生菌的基因信息如表4:The gene information of the above-mentioned probiotics is shown in Table 4:
表4 非冗余基因数据库基因信息Table 4 Gene information of non-redundant gene database
编号serial number 种/亚种名称Species/subspecies name 基因Gene 唯一比对基因(长度≥200)Uniquely compared genes (length ≥ 200) 唯一比对单拷贝基因(长度≥200)Uniquely aligned single-copy genes (length ≥ 200) 代表基因representative gene 唯一比对代表基因(长度≥200)Unique comparison representative gene (length ≥ 200) 唯一比对单拷贝代表基因(长度≥200)Uniquely compared single-copy representative genes (length ≥ 200)
11 青春双歧杆菌Bifidobacterium adolescent 59375937 29522952 26052605 11691169 11001100 10511051
22 动物双歧杆菌Bifidobacterium animalis 24382438 11601160 326326 13241324 853853 7474
33 两歧双歧杆菌Bifidobacterium bifidum 52115211 23592359 16891689 13021302 12311231 968968
44 短双歧杆菌Bifidobacterium breve 80988098 45444544 30293029 13001300 11521152 762762
55 婴儿双歧杆菌Bifidobacterium infantis 59955995 16231623 825825 10121012 346346 1010
66 嗜酸乳杆菌Lactobacillus acidophilus 24592459 19921992 16531653 15501550 14711471 12731273
77 干酪乳杆菌Lactobacillus casei 74367436 24622462 21862186 38443844 11841184 11661166
88 植物乳杆菌Lactobacillus plantarum 2170521705 96919691 68596859 17541754 16501650 11531153
99 罗伊氏乳杆菌Lactobacillus reuteri 1006210062 44384438 35613561 10961096 10641064 965965
1010 鼠李糖乳杆菌Lactobacillus rhamnosus 76327632 33373337 25272527 19071907 17801780 14811481
1111 清酒乳杆菌Lactobacillus sake 51735173 32593259 24442444 13601360 13161316 10881088
1212 嗜热链球菌Streptococcus thermophilus 63186318 32403240 23562356 11051105 10611061 893893
1313 产丙酸丙酸杆菌Propionibacterium propionici 48174817 40634063 37123712 25042504 24542454 23872387
1414 乳酸乳球菌乳脂亚种Lactococcus lactis subsp. cremoris 67146714 26402640 21082108 14401440 10101010 902902
1515 乳酸片球菌Pediococcus lactis 47424742 30123012 24552455 13091309 12721272 10981098
1616 戊糖片球菌Pediococcus pentosacea 50575057 26362636 20832083 13291329 12521252 10301030
1717 肉葡萄球菌Staphylococcus meatus 51875187 22562256 21452145 23002300 19401940 18421842
1818 小牛葡萄球菌Staphylococcus calf 25392539 23402340 23312331 25392539 23402340 23312331
1919 木糖葡萄球菌Staphylococcus xylosus 68346834 30573057 23922392 14611461 13801380 13511351
2020 凝结芽孢杆菌Bacillus coagulans 68626862 44024402 35243524 18201820 17501750 16751675
由上表可知,经过本发明的方法建立的基因库,虽然大部分目标微生物的唯一比对单拷贝代表基因≥500,但部分目标微生物(如动物双歧杆菌和婴儿双歧杆菌)的唯一比对单拷贝代表基因≤200,为了使得对比结果更加准备,发明人将这两个目标微生物的剩余基因中基因组出现率靠前的200个基因随机纳入到代表基因中,使得唯一比对单拷贝代表基因数目分别达到274和210。代表基因数目更多,使得对比结果更加精确,达标基因数目越少,比对效率越高。As can be seen from the above table, although the gene bank established by the method of the present invention has a unique ratio of single-copy representative genes of most target microorganisms ≥ 500, the unique ratio of some target microorganisms (such as Bifidobacterium animalis and Bifidobacterium infantis) For single-copy representative genes ≤ 200, in order to make the comparison results more prepared, the inventor randomly incorporated the top 200 genes with the highest genome occurrence rate among the remaining genes of the two target microorganisms into the representative genes, so that the only comparison of single-copy representative genes The number of genes reached 274 and 210, respectively. The larger the number of representative genes, the more accurate the comparison results, and the fewer the number of qualified genes, the higher the comparison efficiency.
本实施例构建的益生菌数据库仅包含目标益生菌物种序列,与Metaphlan比对和IGC比对相比,对比所需时间显著缩短,比对时间见下表5。The probiotic database constructed in this example only contains the sequence of the target probiotic species. Compared with the Metaphlan comparison and IGC comparison, the time required for the comparison is significantly shortened. The comparison time is shown in Table 5 below.
表5 不同数据库所需比对时间Table 5 Comparison time required for different databases
样本sample 碱基数量number of bases Metaphlan比对时间Metaphlan comparison time IGC比对时间IGC comparison time 本数据库比对时间Comparison time of this database
ERR1190551ERR1190551 5.43G5.43G 19m46.927s19m46.927s 48m39.494s48m39.494s 8m37.203s8m37.203s
ERR1190552ERR1190552 5.30G5.30G 19m8.401s19m8.401s 49m9.145s49m9.145s 8m6.807s8m6.807s
ERR1190553ERR1190553 4.47G4.47G 16m28.369s16m28.369s 39m53.330s39m53.330s 6m23.361s6m23.361s
ERR1190554ERR1190554 5.09G5.09G 18m32.386s18m32.386s 45m0.594s45m0.594s 7m26.207s7m26.207s
ERR1190555ERR1190555 5.07G5.07G 19m0.234s19m0.234s 41m59.191s41m59.191s 7m20.326s7m20.326s
在本发明提及的所有文献都在本申请中引用作为参考,就如同每一篇文献被单独引用作为参考那样。此外应理解,在阅读了本发明的上述讲授内容之后,本领域技术人员可以对本发明作各种改动或修改,这些等价形式同样落于本申请所附权利要求书所限定的范围。All documents mentioned in this application are incorporated by reference in this application as if each were individually incorporated by reference. In addition, it should be understood that after reading the above teaching content of the present invention, those skilled in the art can make various changes or modifications to the present invention, and these equivalent forms also fall within the scope defined by the appended claims of the present application.

Claims (10)

  1. 一种微生物基因数据库的构建方法,其特征在于,包括以下步骤:A method for constructing a microbial gene database, comprising the following steps:
    S1,获取目标微生物组合中每种目标微生物的基因组数据,其中,所述目标微生物组合包括N种目标微生物,N≥1;S1, obtaining the genome data of each target microorganism in the target microorganism combination, wherein the target microorganism combination includes N kinds of target microorganisms, N≥1;
    S2,对步骤S1获取的基因组数据进行基因预测,获得基因注释文件;S2, performing gene prediction on the genome data obtained in step S1, and obtaining a gene annotation file;
    S3,利用步骤S2获得的所述基因注释文件获得每种目标微生物的代表基因;S3, using the gene annotation file obtained in step S2 to obtain the representative gene of each target microorganism;
    S4,将所述代表基因中的每个基因分别比对到核酸序列数据库,获得比对结果;S4, comparing each gene in the representative gene to a nucleic acid sequence database to obtain a comparison result;
    S5,对于每个基因的对比结果,获取该基因的注释物种,若所述注释物种与来源物种相同,则保留该基因;S5, for the comparison result of each gene, obtain the annotated species of the gene, if the annotated species is the same as the source species, keep the gene;
    S6,利用所有被保留的基因构成所述微生物基因数据库。S6, using all the retained genes to form the microbial gene database.
  2. 根据权利要求1所述的一种微生物基因数据库的构建方法,其特征在于,在步骤S4之前或步骤S5之后进一步包括对基因进行去冗余的步骤。。The method for constructing a microbial gene database according to claim 1, further comprising a step of de-redundancy of genes before step S4 or after step S5. .
  3. 根据权利要求1或2所述的一种微生物基因数据库的构建方法,其特征在于,步骤S3中,针对所述目标微生物组合中目标微生物n,其中,1≤n≤N,所述目标微生物n的基因组数目M,根据M的大小获得所述目标微生物n的代表基因:The construction method of a microbial gene database according to claim 1 or 2, characterized in that, in step S3, for the target microorganism n in the target microorganism combination, wherein, 1≤n≤N, the target microorganism n The genome number M of M, obtain the representative gene of described target microorganism n according to the size of M:
    (1)若M=1,则所述目标微生物n的基因组的所有基因为代表基因;(1) If M=1, all the genes in the genome of the target microorganism n are representative genes;
    (2)若M≥2,则所有基因组的共有基因为代表基因。(2) If M≥2, the common gene of all genomes is the representative gene.
  4. 根据权利要求3所述的一种微生物基因数据库构建方法,其特征在于,在第(2)种情况,若M≥MA,则判断是否有基因组偏离总体,若有,则剔除偏离总体的基因组,再判断剩余基因组中是否有基因组偏离总体,若有,则再剔除偏离总体的基因组,直至剩余基因组中没有基因组偏离总体或者剩余基因组数目M<MA,则提取剩余基因组的共有基因,作为所有基因组修正的共有基因,并作为所述目标微生物n的代表基因,其中,MA是需要判断基因组是否偏离总体的预设值,MA≥3。A method for constructing a microbial gene database according to claim 3, wherein in the case of (2), if M≥MA, it is judged whether any genome deviates from the overall population, and if so, the genome that deviates from the overall population is eliminated, Then judge whether there are genomes in the remaining genomes that deviate from the overall population, and if so, remove the genomes that deviate from the overall population until no genomes in the remaining genomes deviate from the overall population or the number of remaining genomes M<MA, then extract the common genes of the remaining genomes as a correction for all genomes The common gene of the target microorganism n is used as the representative gene of the target microorganism n, wherein, MA is a preset value that needs to be judged whether the genome deviates from the whole, and MA≥3.
  5. 根据权利要求3所述的一种微生物基因数据库的构建方法,其特征在于,若M≥MB,进一步根据以下步骤重新确定共有基因:The construction method of a kind of microbial gene database according to claim 3, is characterized in that, if M≥MB, further re-determine the common gene according to the following steps:
    S31,根据所述目标微生物n的M个基因组中各基因的来源基因组情况组成m种基因组合,其中,m=
    Figure 376545dest_path_image001
    S31, forming m gene combinations according to the source genome situation of each gene in the M genomes of the target microorganism n, wherein, m=
    Figure 376545dest_path_image001
    ;
    S32,统计每种基因组合中的基因数目,并按从大到小顺序将所述基因数目进行排序并获得位于第S位的基因数目Q,S32, counting the number of genes in each gene combination, sorting the number of genes in descending order and obtaining the number Q of genes at the S position,
    S33,判断来源于M个基因组的基因组合的基因数目是否小于Q:S33, judging whether the number of genes derived from the gene combination of M genomes is less than Q:
    ①若来源于M个基因组的基因组合的基因数目不小于Q,则直接提取M个基因组的共有基因;②若来源于M个基因组的基因组合的基因数目小于Q,则:①If the number of genes from the gene combination of M genomes is not less than Q, directly extract the common genes of M genomes; ②If the number of genes from the gene combination of M genomes is less than Q, then:
    S331,选取基因数目最多的基因组合的来源基因组作为亚群,提取亚群的共有基因;S331, selecting the source genome of the gene combination with the largest number of genes as a subgroup, and extracting the common genes of the subgroup;
    S332,剔除S331中亚群中的基因组,若剩余的基因组数目<MB,则提取剩余基因组的共有基因;若剩余的基因组数目≥MB,则重复S31-S33步骤再次提取共有基因;S332, removing the genomes in the subgroup in S331, if the number of remaining genomes<MB, then extract the common genes of the remaining genomes; if the number of remaining genomes≥MB, repeat steps S31-S33 to extract the common genes again;
    S34,将步骤S33得到的所有共有基因合并到一起,作为所有基因组修正的共有基因,并进一步作为所述目标微生物n的代表基因,S34, merging all the common genes obtained in step S33 together as the common genes corrected for all genomes, and further as the representative gene of the target microorganism n,
    其中,MB是需要重新确定共有基因的预设值,MB≥3,2≤S≤5。Among them, MB is the preset value that needs to re-determine the shared genes, MB≥3, 2≤S≤5.
  6. 根据权利要求3-5任一所述的一种微生物基因数据库的构建方法,其特征在于,在第(2)种情况下,所述代表基因进一步包括除共有基因外剩余基因中基因组出现率按从大到小排序前Y个的基因,其中100≤Y≤300。According to the construction method of a microbial gene database according to any one of claims 3-5, it is characterized in that, in the case of (2), the representative gene further includes the occurrence rate of the genome in the remaining genes except the common gene according to Sort the top Y genes from largest to smallest, where 100≤Y≤300.
  7. 一种微生物基因数据库的构建系统,其特征在于,包括以下模块:A system for constructing a microbial gene database is characterized in that it comprises the following modules:
    基因组数据获取存储模块,用于获取并存储目标微生物组合中每种目标微生物的基因组数据,其中,所述目标微生物组合包括N种目标微生物,N≥1;The genome data acquisition storage module is used to acquire and store the genome data of each target microorganism in the target microorganism combination, wherein the target microorganism combination includes N kinds of target microorganisms, N≥1;
    基因预测模块,与所述基因组数据获取存储模块连接,用于对所述基因组数据获取模块中获取的基因组数据进行基因预测,获得包含序列和物种注释的基因注释文件并输出;The gene prediction module is connected with the genome data acquisition storage module, and is used to perform gene prediction on the genome data acquired in the genome data acquisition module, and obtain and output gene annotation files containing sequences and species annotations;
    代表基因获取模块,与所述基因预测模块连接,用于接收所述基因预测模块输出的所述基因注释文件,并利用所述基因注释文件获得每种目标微生物的代表基因并输出;A representative gene acquisition module, connected to the gene prediction module, is used to receive the gene annotation file output by the gene prediction module, and use the gene annotation file to obtain the representative gene of each target microorganism and output it;
    核酸序列数据库存储模块,用于接收并存储核酸序列数据库;A nucleic acid sequence database storage module, configured to receive and store a nucleic acid sequence database;
    基因比对模块,分别与所述代表基因获取模块和所述核酸序列数据库模块连接,用于接收所述代表基因获取模块输出的代表基因,并将所述代表基因中的每个基因分别比对到核酸序列数据库,获得比对结果并输出;A gene comparison module, connected to the representative gene acquisition module and the nucleic acid sequence database module respectively, for receiving the representative genes output by the representative gene acquisition module, and comparing each gene in the representative genes respectively Go to the nucleic acid sequence database, obtain the comparison result and output it;
    基因验证模块,与所述基因比对模块,用于验证基因的注释物种是否与来源物种相同:对于每个基因的对比结果,获取该基因的注释物种,若所述注释物种与来源物种相同,则保留该基因,所述基因验证模块还用于输出所有被保留的基因以构建微生物基因数据库。The gene verification module and the gene comparison module are used to verify whether the annotated species of the gene is the same as the source species: for the comparison result of each gene, the annotated species of the gene is obtained, if the annotated species is the same as the source species, Then keep the gene, and the gene verification module is also used to output all the kept genes to construct the microbial gene database.
  8. 根据权利要求7所述的一种微生物基因数据库的构建系统,其特征在于,还包括:The construction system of a kind of microbial gene database according to claim 7, is characterized in that, also comprises:
    基因去冗余模块,与所述基因验证模块连接,用于接收所述基因验证模块输出的被保留的基因,并对每种目标微生物中被保留的基因进行去冗余;或者A gene de-redundancy module, connected to the gene verification module, for receiving the retained genes output by the gene verification module, and performing de-redundancy to the retained genes in each target microorganism; or
    基因去冗余模块,与所述代表基因获取模块连接,用于接收所述代表基因获取模块输出的代表基因,并对每种目标微生物的代表基因进行去冗余。The gene de-redundancy module is connected to the representative gene acquisition module, and is used to receive the representative genes output by the representative gene acquisition module, and perform de-redundancy on the representative genes of each target microorganism.
  9. 根据权利要求7或8所述的一种微生物基因数据库的构建系统,其特征在于,在所述代表基因获取模块和所述基因比对模块之间,进一步包括基因过滤模块,分别与所述代表基因获取模块和所述基因比对模块连接,用于接收所述代表基因获取模块块输出的代表基因并进行过滤:过滤序列长度小于200的基因,再将过滤后的代表基因输出至所述基因比对模块。The construction system of a kind of microbial gene database according to claim 7 or 8, is characterized in that, between described representative gene acquisition module and described gene comparison module, further comprises gene filter module, respectively with described representative The gene acquisition module is connected to the gene comparison module, and is used to receive and filter the representative genes output by the representative gene acquisition module block: filter the genes whose sequence length is less than 200, and then output the filtered representative genes to the genes Compare modules.
  10. 根据权利要求7或8所述的一种微生物基因数据库的构建系统,其特征在于,在所述基因比对模块和所述基因验证模块之间,进一步包括比对结果过滤模块,分别与所述基因比对模块和所述基因验证模块连接,用于接收所述基因比对模块输出的比对结果并进行过滤:将低于预设覆盖度阈值和/或低于预设同一性阈值的对比对结果去除。The construction system of a kind of microbial gene database according to claim 7 or 8, it is characterized in that, between the gene comparison module and the gene verification module, further comprising a comparison result filtering module, respectively with the said The gene comparison module is connected to the gene verification module, and is used to receive and filter the comparison results output by the gene comparison module: the comparisons below the preset coverage threshold and/or below the preset identity threshold Remove the result.
PCT/CN2022/113690 2021-11-30 2022-08-19 Construction method and system for microbial gene database WO2023098152A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202280004306.7A CN116802740A (en) 2021-11-30 2022-08-19 Construction method and system of microbial gene database

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111443169.0A CN114121167B (en) 2021-11-30 2021-11-30 Construction method and system of microbial gene database
CN202111443169.0 2021-11-30

Publications (1)

Publication Number Publication Date
WO2023098152A1 true WO2023098152A1 (en) 2023-06-08

Family

ID=80368491

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/113690 WO2023098152A1 (en) 2021-11-30 2022-08-19 Construction method and system for microbial gene database

Country Status (2)

Country Link
CN (2) CN114121167B (en)
WO (1) WO2023098152A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114121167B (en) * 2021-11-30 2022-07-01 深圳零一生命科技有限责任公司 Construction method and system of microbial gene database
CN115732036B (en) * 2022-12-06 2023-11-28 云舟生物科技(广州)股份有限公司 Method for adjusting transcript base stock, computer storage medium and electronic device
CN117059179A (en) * 2023-08-30 2023-11-14 北京星云医学检验实验室有限公司 Biological information database annotation method and system

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108804875A (en) * 2018-06-21 2018-11-13 中国科学院北京基因组研究所 A method of analyzing micropopulation body function using macro genomic data
CN110473594A (en) * 2019-08-22 2019-11-19 广州微远基因科技有限公司 Pathogenic microorganism genome database and its method for building up
CN112599198A (en) * 2020-12-29 2021-04-02 上海派森诺生物科技股份有限公司 Microorganism species and functional composition analysis method for metagenome sequencing data
CN112992277A (en) * 2021-03-18 2021-06-18 南京先声医学检验有限公司 Construction method and application of microbial genome database
US20210202040A1 (en) * 2018-09-05 2021-07-01 Chunlab, Inc. Method for identifying and classifying sample microorganisms
CN114121167A (en) * 2021-11-30 2022-03-01 深圳零一生命科技有限责任公司 Construction method and system of microbial gene database

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107506616B (en) * 2017-09-07 2021-04-13 海南省农业科学院植物保护研究所 Elephant's ear bean root transcriptome database, fusion protein, soaking system and silencing system
CN111161794B (en) * 2018-12-30 2024-03-22 深圳碳云智能数字生命健康管理有限公司 Intestinal microorganism sequencing data processing method, device, storage medium and processor
CN110277139B (en) * 2019-06-18 2023-03-21 江苏省产品质量监督检验研究院 Microorganism limit checking system and method based on Internet
CN111261231A (en) * 2019-12-03 2020-06-09 康美华大基因技术有限公司 Construction method, analysis method and device of intestinal flora metagenome database
CN111462821B (en) * 2020-04-10 2022-02-22 广州微远医疗器械有限公司 Pathogenic microorganism analysis and identification system and application
CN113689912A (en) * 2020-12-14 2021-11-23 广东美格基因科技有限公司 Method and system for correcting microbial contrast result based on metagenome sequencing
CN112837745B (en) * 2021-01-15 2023-11-21 广州微远基因科技有限公司 Pathogenic microorganism virulence gene association model and establishment method and application thereof
CN112885412B (en) * 2021-02-25 2023-03-28 深圳华大基因科技服务有限公司 Genome annotation method, apparatus, visualization platform and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108804875A (en) * 2018-06-21 2018-11-13 中国科学院北京基因组研究所 A method of analyzing micropopulation body function using macro genomic data
US20210202040A1 (en) * 2018-09-05 2021-07-01 Chunlab, Inc. Method for identifying and classifying sample microorganisms
CN110473594A (en) * 2019-08-22 2019-11-19 广州微远基因科技有限公司 Pathogenic microorganism genome database and its method for building up
CN112599198A (en) * 2020-12-29 2021-04-02 上海派森诺生物科技股份有限公司 Microorganism species and functional composition analysis method for metagenome sequencing data
CN112992277A (en) * 2021-03-18 2021-06-18 南京先声医学检验有限公司 Construction method and application of microbial genome database
CN114121167A (en) * 2021-11-30 2022-03-01 深圳零一生命科技有限责任公司 Construction method and system of microbial gene database

Also Published As

Publication number Publication date
CN114121167A (en) 2022-03-01
CN114121167B (en) 2022-07-01
CN116802740A (en) 2023-09-22

Similar Documents

Publication Publication Date Title
CN108804875B (en) Method for analyzing microbial population function by using metagenome data
WO2023098152A1 (en) Construction method and system for microbial gene database
Kim et al. Lysogeny is prevalent and widely distributed in the murine gut microbiota
CN105368944B (en) Biomarker of detectable disease and application thereof
CN109706235A (en) A kind of the detection and analysis method and its system of intestinal microflora
Ricke et al. Molecular‐based identification and detection of Salmonella in food production systems: current perspectives
CN109923217A (en) The identification of pathogen and antibiotic characterization in macro genomic samples
CN110892081A (en) Method for diagnosing dysbacteriosis
CN111816258B (en) Optimization method for accurate identification of human flora 16S rDNA high-throughput sequencing species
US20150376697A1 (en) Method and system to determine biomarkers related to abnormal condition
Gehrig et al. Finding the right fit: evaluation of short-read and long-read sequencing approaches to maximize the utility of clinical microbiome data
Wang et al. Laser capture microdissection and metagenomic analysis of intact mucosa-associated microbial communities of human colon
Brealey et al. Dental calculus as a tool to study the evolution of the mammalian oral microbiome
Jeraldo et al. Capturing one of the human gut microbiome’s most wanted: reconstructing the genome of a novel butyrate-producing, clostridial scavenger from metagenomic sequence data
CN114420212B (en) Escherichia coli strain identification method and system
JP6644672B2 (en) Characterization of biological materials using unassembled sequence information, stochastic methods, and trait-specific database catalogs
WO2020147557A1 (en) Method and device for processing intestinal microorganism sequencing data, storage medium, and processor
Sun et al. Loss of novel diversity in human gut microbiota associated with ongoing urbanization in China
Xi et al. Using QC-Blind for quality control and contamination screening of bacteria DNA sequencing data without reference genome
CN112331268B (en) Method for obtaining specific sequence of target species and method for detecting target species
EP3961638A1 (en) Novel method for processing sequence information about single biological unit
Babu et al. Array-based synthetic genetic screens to map bacterial pathways and functional networks in Escherichia coli
CN114657270B (en) Alzheimer disease biomarker based on intestinal flora and application thereof
US20220005545A1 (en) Method and Apparatus For Analysing a Sample
CN111575358A (en) Non-diagnosis-purpose intestinal microorganism qualitative and quantitative detection method and detection system

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 202280004306.7

Country of ref document: CN

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22899972

Country of ref document: EP

Kind code of ref document: A1