WO2023098152A1

WO2023098152A1 - Construction method and system for microbial gene database

Info

Publication number: WO2023098152A1
Application number: PCT/CN2022/113690
Authority: WO
Inventors: 徐晓强; 夏炎; 王晓凯; 谢海亮
Original assignee: 深圳零一生命科技有限责任公司
Priority date: 2021-11-30
Filing date: 2022-08-19
Publication date: 2023-06-08
Also published as: CN114121167A; CN114121167B; CN116802740A

Abstract

The present invention belongs to the technical field of gene database construction. Disclosed are a construction method and system for a microbial gene database. The method comprises the following steps: acquiring target microbial genome data, and performing gene prediction on the acquired genome data to obtain a gene annotation file, which includes sequences and species annotations; obtaining representative genes of each target microorganism; comparing each of the representative genes to a nucleic acid sequence database, so as to obtain comparison results; filtering the comparison results to obtain information of annotated species of the genes, and retaining the genes, the annotated species of which are the same as an origin species, so as to construct a microbial gene database. By constructing a microbial gene database using the construction method of the present invention, the database can be updated on the basis of a change in a target microorganism, such that the real-time performance is greater; and a microbial database that is constructed by using the present invention only includes gene sequences of the target microorganism, such that the time required for comparison is shorter.

Description

A method and system for constructing a microbial gene database

technical field

The invention belongs to the technical field of gene database construction, and in particular relates to a method and system for constructing a microbial gene database.

Background technique

In recent years, with the continuous deepening of human microbiome research, scientists have discovered that intestinal microbes play a significant role in promoting human health, and some current sub-health problems are also caused by the breakdown of the balance of intestinal microecology . As a type of microorganisms beneficial to the human body, probiotics can help restore the balance of intestinal microecology, and have been widely used in dietary supplements. However, due to the wide variety of probiotics, different countries have issued corresponding policies to regulate the types of edible probiotics.

Traditional research on microorganisms is done by cultivating microorganisms and then observing biochemical phenotypes, which takes dozens of days to complete. For the identification of microbial species, the metagenomics technology developed in recent years can directly extract sample DNA for whole-genome sequencing. By analyzing and interpreting the results of these DNA sequencing, it has been possible to analyze the community structure of microorganisms in the environment. , species classification, phylogenetic evolution, gene function and metabolic network. With the development of high-throughput sequencing technology, it is now possible to simultaneously detect at least hundreds of samples at a time; at the same time, since no cultivation is required, the detection and analysis time is greatly shortened.

However, the microbial identification analysis based on metagenomic sequencing technology needs to rely on the reference gene set, that is, by comparing the sequencing read length to the reference gene set to analyze the type and gene content of microorganisms in the sample. Therefore, there are microbial reference gene sets of different species and regions. The analysis of the target probiotics in the human intestine also requires the use of reference gene sets. Usually, there are two methods, using the integrated gene set (IGC) or the metagenomic phylogenetic analysis (MetaPhlAn) gene library.

The Integrated Gene Collection (IGC) was published in 2014 and contains 1267 gut metagenomes with 9879896 genes. IGC has the following problems: (1) The number of genes is large, there are many types of annotated microorganisms, and the comparison time is very long, and the efficiency is low; (2) The gene annotation information has not been updated for a long time, and the accuracy is low; (3) The public gene annotation The information is only at the genus level, and the target probiotics cannot be analyzed.

Metagenomic Phylogenetic Analysis (MetaPhlAn) is a species annotation tool that analyzes the composition of microbial communities from next-generation sequencing data. Although MetaPhlAn has been updated all the time, it also has the following limitations: (1) Using sequence alignment marker genes to obtain relative abundance information, compared with other strategies, the false positives are lower, but the read utilization rate is low; (2) ) Fewer species are detected, and only species in the database can be detected; (3) Species are annotated only at the species level, and the strain-level results need to be analyzed using the supporting StrainPhlAn tool.

Therefore, the two most widely used methods are not suitable for the analysis of target probiotics. However, the traditional method of directly constructing the genome of probiotics as a reference database will have a large amount of repetitive information, resulting in low efficiency; in addition, since there are many common segments among microbial genomes, if the whole genome is directly used as a reference genome, it will also affect The accuracy of the test results.

In order to solve at least one of the above-mentioned technical problems, the technical scheme adopted in the present invention is as follows:

A first aspect of the present invention provides a method for constructing a microbial gene database, comprising the following steps:

S1, obtaining the genome data of each target microorganism in the target microorganism combination, wherein the target microorganism combination includes N kinds of target microorganisms, N≥1;

S2, performing gene prediction on the genome data obtained in step S1, and obtaining a gene annotation file;

S3, using the gene annotation file obtained in step S2 to obtain the representative gene of each target microorganism;

S4, comparing each gene in the representative gene to a nucleic acid sequence database to obtain a comparison result;

S5, for the comparison result of each gene, obtain the annotated species of the gene, if the annotated species is the same as the source species, keep the gene;

S6, using all the retained genes to form the microbial gene database.

In the present invention, the target microorganism can be any microorganism, including but not limited to bacteria, fungi, and viruses, all of which are applicable to the method of the present invention. In some specific embodiments of the present invention, the target microorganism is a bacterium, and in some more specific embodiments of the present invention, the target microorganism is a food-usable bacterium.

In some embodiments of the present invention, in step S1, the acquisition of the genome data of each target microorganism in the target microorganism combination can obtain genome data stored in commercial or non-commercial databases, or use high-throughput sequencing Genomic data obtained by the method. In some embodiments of the present invention, the genome data is downloaded from NCBI database. Specifically, first obtain the species name and taxonomic number of the target microorganism in NCBI; then, according to the species name, obtain the genome of the species in NCBI. In another specific embodiment of the present invention, the genome data is sequenced using next-generation sequencing technology.

In some preferred embodiments of the present invention, it also includes filtering out genomes with assembled long-sequence fragments (Scaffolds) number ≥ 100, so that the number of long-sequence fragments in each genome of each target microorganism obtained is less than 100.

In some embodiments of the present invention, in step S2, any software, program or algorithm capable of realizing gene prediction function can be used to complete the gene prediction. In some specific embodiments of the present invention, Prokka software is used to perform gene prediction on genome data.

In some embodiments of the present invention, in step S3, for the target microorganism n in the target microorganism combination, wherein, the target microorganism n represents the nth target microorganism in the target microorganism combination, 1≤n≤N, so The genome number M of the target microorganism n, obtain the representative gene of the target microorganism n according to the size of M:

(1) If M=1, all the genes in the genome of the target microorganism n are representative genes;

(2) If M≥2, the common gene of all genomes is the representative gene.

In some embodiments of the present invention, further, for the above-mentioned case (2), if M≥MA, it is judged whether there is a genome that deviates from the overall population, and if so, the genome that deviates from the overall population is eliminated, and then it is judged whether the remaining genomes are If there are genomes that deviate from the overall population, then remove those genomes that deviate from the overall population until none of the remaining genomes deviates from the overall population or the number of remaining genomes M<MA, then extract the common genes of the remaining genomes as the common genes corrected by all genomes, and use them as The representative gene of the target microorganism n, wherein, MA is a preset value that needs to be judged whether the genome deviates from the whole, MA≥3, for example, MA=3, 4, 5, 6, 7, 8, 9, 10 or more.

In some embodiments of the present invention, the following standard is used to determine whether the genome deviates from the overall population: if a certain genome is eliminated, the number of common genes in the remaining genome increases by more than 30% compared with that before the elimination, such as 30%, 35%, 40%, 50%, the genome deviates from the overall population.

In some embodiments of the present invention, when the number of genomes that deviate from the genome is eliminated or not eliminated M≥MB, wherein MB is a preset value that needs to re-determine the common genes, MB≥3, such as MB=3, 4, 5, 6, 7, 8, 9, 10 or greater, further re-determine the shared gene according to the following steps, that is, to determine whether the shared gene needs to be corrected:

S31, forming m gene combinations according to the source genome situation of each gene in the M genomes of the target microorganism n, wherein, m=

. That is to say, for a gene, either derived from only 1 genome, there are

combination; or only from two of the genomes, a total of

combinations; ...; or only from M-1 genomes, a total of

combinations; or derived from M genomes, a total of

combinations, so there are

combinations. In other words, for gene combinations, each genome either contains the genes in the gene combination, or does not contain the genes derived from the gene combination, that is, each genome has two situations, and there will be

combination, removing an empty set (all genomes do not contain genes from this combination), then

combination, which is the same as the calculation result above. Therefore, under the condition that the principle remains the same, no matter how it is explained or understood, it does not affect the number of combinations.

For example, there are 4 genomes of target microorganism n, that is, M=4, then the source genome situation of each gene in the 4 genomes of target microorganism n is as follows:

species, as shown in the table below:

基因组合编号Gene set number	基因组1Genome 1	基因组2Genome 2	基因组3Genome 3	基因组4Genome 4
11	√√	the	the	the
22	the	√√	the	the
33	the	the	√√	the
44	the	the	the	√√
55	√√	√√	the	the
66	√√	the	√√	the
77	√√	the	the	√√
88	the	√√	√√	the
99	the	√√	the	√√
1010	the	the	√√	√√
1111	√√	√√	√√	the
1212	√√	√√	the	√√
1313	√√	the	√√	√√
1414	the	√√	√√	√√
1515	√√	√√	√√	√√

S32, count the number of genes in each gene combination, sort the number of genes in descending order and obtain the number Q of genes at the S position, where 2≤S≤5, for example, S=2, 3, 4 or 5.

S33, judging whether the number of genes derived from the gene combination of M genomes is less than Q:

① If the number of genes derived from the gene combination of M genomes is not less than Q, then directly extract the common genes of M genomes, that is, no correction is required;

② If the number of genes derived from the gene combination of M genomes is less than Q, the common genes need to be corrected according to the following steps:

S331, selecting the source genome of the gene combination with the largest number of genes as a subgroup, and extracting the common genes of the subgroup;

S332. Eliminate the genomes contained in the subgroup, and if the number of remaining genomes is less than MB, then extract the common genes of the remaining genomes, especially, if the number of remaining genomes is 1, then extract all the genes of the remaining genomes as common genes; if If the number of remaining genomes is ≥ MB, repeat steps S31-S33 to extract representative genes again;

S34, merging all the shared genes together as the shared genes of all genome corrections, and further serving as the representative gene of the target microorganism n.

In other embodiments of the present invention, the common genes are re-determined according to the following steps, that is, whether it is necessary to correct the common genes is determined:

Eliminate any one genome to obtain M subgroups with the number of M-1 genomes. If the number of shared genes in any subgroup is greater than the number of shared genes in M genomes, then remove another subgroup with the largest number of shared genes to obtain M -1 sub-subgroup, if the number of shared genes of any sub-subgroup is greater than the number of genes in the sub-group, then the sub-subgroup will be treated in the same way until the obtained genome combination is removed and any genome combination is obtained to obtain a new genome combination The number of shared genes in the genome will not be more than that before deletion, and the shared genes of such a genome combination will be used as the revised shared genes. It is worth noting that the common gene re-determined by this step is the same as the previous result. Therefore, as long as the concept of the present invention can be realized, no matter what steps are used, it should fall within the protection scope of the present invention.

In some embodiments of the present invention, the representative genes further include the top Y genes in descending order of occurrence in the genome among the remaining genes except the common genes. Among them, the genome occurrence rate refers to the percentage of the gene appearing in all genomes, 100≤Y≤300, for example, Y=100, 120, 150, 180, 200, 250, 300. In some preferred embodiments of the present invention, only when the number of representative genes is less than X, the remaining genes need to be included by genomic frequency, where 50≤X≤100. In addition to the common genes of all genomes in the narrow sense, the common genes here can also be the above-mentioned revised common genes in a broad sense, so that the representative genes can more truly represent the target microorganisms.

In some embodiments of the present invention, before step S4, it further includes a step of filtering the representative genes obtained in step S3: filtering genes whose sequence length is less than 200. In some specific embodiments of the present invention, the gene is compared to the nucleic acid sequence database using the search tool BLAST+ (v2.11.0) software based on a local alignment algorithm, and the evalue threshold is 1e-5.

In some embodiments of the present invention, after step S4 and before step S5, the step of filtering the comparison results is further included: comparing the comparison results below the preset coverage threshold and/or below the preset identity threshold Remove the result. In some specific embodiments of the present invention, the preset coverage threshold is 80%; the preset identity threshold is 65%.

In some embodiments of the present invention, in step S5, for each gene, the step of obtaining its annotated species is: sorting by identity and selecting the first a% of the comparison results, if more than b% of the selected comparison results are annotated to the same species and the identity is not less than c%, then the species is the annotated species of the gene, where a=5~20, b=40~60, c=90~98. In some embodiments of the invention, a=10, b=50, c=95.

In some embodiments of the present invention, before step S4 or after step S5, a step of de-redundancy of genes is further included. Optionally, if gene de-redundancy is performed before step S4, de-redundancy is performed on representative genes of each target microorganism. Optionally, if gene de-redundancy is performed after step S5, de-redundancy is performed on all retained genes.

In some embodiments of the present invention, any software, program or algorithm capable of realizing the de-redundancy function can be used, for example, any software, program or algorithm that can realize de-redundancy based on the principle of sequence similarity. In some embodiments of the invention, CD-HIT (v4.8.1) software is used for de-redundancy. In some embodiments of the present invention, the following steps are used for de-redundancy:

For each species, de-redundancy is performed separately: filter all genes of the sequence class whose gene number is greater than 1, and all remaining genes are the only compared single-copy genes of this species;

Merge the deredundant genes of all species, and filter all genes of the sequence class whose gene number is greater than 1.

In some embodiments of the invention, if the database is updated, the above de-redundancy steps are repeated for each newly added species.

the

A second aspect of the present invention provides a system for constructing a microbial gene database, comprising the following modules:

The genome data acquisition storage module is used to acquire and store the genome data of each target microorganism in the target microorganism combination, wherein the target microorganism combination includes N kinds of target microorganisms, N≥1;

The gene prediction module is connected with the genome data acquisition storage module, and is used to perform gene prediction on the genome data acquired in the genome data acquisition module, and obtain and output gene annotation files containing sequences and species annotations;

A representative gene acquisition module, connected to the gene prediction module, is used to receive the gene annotation file output by the gene prediction module, and use the gene annotation file to obtain the representative gene of each target microorganism and output it;

A nucleic acid sequence database storage module, configured to receive and store a nucleic acid sequence database;

A gene comparison module, connected to the representative genome analysis module and the nucleic acid sequence database module respectively, for receiving the representative genes output by the representative gene acquisition module, and comparing each gene in the representative genes respectively Go to the nucleic acid sequence database, obtain the comparison result and output it;

The gene verification module and the gene comparison module are used to verify whether the annotated species of the gene is the same as the source species: for the comparison result of each gene, the annotated species of the gene is obtained, if the annotated species is the same as the source species, Then keep the gene, and the gene verification module is also used to output all the kept genes to construct the microbial gene database.

Further, the construction system also includes: a gene de-redundancy module;

Optionally, the gene de-redundancy module is connected to the gene verification module for receiving the retained genes output by the gene verification module, and performing de-redundancy to the retained genes in each target microorganism;

Optionally, the gene de-redundancy module is connected to the representative gene acquisition module for receiving the representative genes output by the representative gene acquisition module, and performing de-redundancy on the representative genes of each target microorganism.

In some embodiments of the present invention, between the representative gene acquisition module and the gene comparison module, a gene filter module is further included, which is respectively connected to the representative genome analysis module and the gene comparison module, using After receiving the representative genes output by the representative gene acquisition module and filtering: filter the genes whose sequence length is less than 200, and then output the filtered representative genes to the gene comparison module.

In some embodiments of the present invention, between the gene comparison module and the gene verification module, a comparison result filtering module is further included, connected to the gene comparison module and the gene verification module respectively, and used To receive and filter the comparison results output by the gene comparison module: remove the comparison results lower than the preset coverage threshold and/or lower than the preset identity threshold.

In the present invention, all the modules in the construction system described in the second aspect of the present invention can realize the same or corresponding functions of the corresponding steps in the method described in the first aspect of the present invention, which will not be repeated here.

本发明的有益效果Beneficial effects of the present invention

Compared with the prior art, the present invention has the following beneficial effects:

The microbial gene database construction method of the present invention can establish a modularized gene database covering species level microorganisms, fast retrieval, accurate qualitative and non-redundant through the integration of multiple information on the genome of microorganisms and the method of cross-validation.

In the method for constructing the microbial gene database of the present invention, the representative genes of the target microorganisms are first obtained, and then the source-annotation is verified by the NT library, so that the comparison results are more reliable and the classification information is more accurate.

The microbial gene database construction system of the present invention is independently composed of different modules, which are independent and related to each other, that is, it is convenient to add/delete modules between modules, and can complete database construction through cooperation between modules.

Utilizing the microbial database constructed by the present invention, by establishing a simple search index, the target probiotics can be quickly located through gene sequencing data, and the comparison time is shorter. At the same time, taking into account the need for convenient update and iteration of the database, the data of the existing microbial genome database can be quickly updated, and new microbial genome information can also be quickly added to the database.

The microbial database constructed by the present invention can be used to assist high-throughput sequencing technology to more accurately detect the types and contents of probiotics.

Description of drawings

FIG. 1 shows a schematic diagram of a construction system #1 of Embodiment 1 of the present invention.

Fig. 2 shows a schematic diagram of construction system #8 of Embodiment 4 of the present invention.

Fig. 3 shows the combinations of genes of a probiotic in Example 6 of the present invention according to genome sources.

Detailed ways

Unless otherwise stated, implied from the context, or customary in the art, all parts and percentages in this application are by weight and the testing and characterization methods used are current as of the filing date of this application. Where applicable, the contents of any patent, patent application, or publication referred to in this application are hereby incorporated by reference, and equivalent patent families are also incorporated by reference, especially those disclosed by these documents with respect to the state of the art. Definitions of related terms. If the definition of a specific term disclosed in the prior art is inconsistent with any definition provided in the present application, the definition of the term provided in the present application shall prevail.

Numerical ranges in this application are approximations and therefore may include values outside the range unless otherwise indicated. Numerical ranges include all values from the lower value to the upper value in increments of 1 unit provided that there is a separation of at least 2 units between any lower value and any higher value.

In order to make the technical problems, technical solutions and beneficial effects solved by the present invention clearer, the present invention will be further described in detail below in conjunction with the embodiments.

Example

The following examples are used herein to demonstrate preferred embodiments of the invention. It should be appreciated by those of skill in the art that the techniques disclosed in the examples which follow represent techniques discovered by the inventors to be employed in the practice of the invention, and thus can be considered preferred modes for its practice. However, those skilled in the art should understand from this specification that many modifications can be made to the specific embodiments disclosed herein, and the same or similar results can still be obtained without departing from the spirit or scope of the present invention.

Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by those skilled in the art to which this invention belongs, and the disclosures cited herein and their cited materials are all incorporated by reference .

Those skilled in the art will recognize, or be able to ascertain through routine experimentation, many equivalents to the specific embodiments of the invention described herein. These equivalents are to be covered by the claims.

The experimental methods not specifically described in the following examples are conventional methods. The instruments and equipment used in the following examples, unless otherwise specified, are routine laboratory instruments and equipment; the test materials used in the following examples, unless otherwise specified, were purchased from conventional biochemical reagent stores.

实施例Example 1 1 微生物基因数据库构建系统Microbial gene database construction system

As shown in Figure 1, the present embodiment provides a kind of construction system of microbial gene database, namely construction system #1 comprises the following modules:

The gene prediction module is connected with the genome data acquisition storage module, and is used to perform gene prediction on the genome data obtained in the genome data acquisition module, and obtain and output the gene annotation file containing the sequence and annotation;

Representing the gene acquisition module, connected with the gene prediction module, for receiving the gene annotation file output by the gene prediction module, and using the gene annotation file to obtain and output the representative gene of each target microorganism;

The gene comparison module is respectively connected with the representative gene acquisition module and the nucleic acid sequence database module, and is used to receive the representative genes output by the representative gene acquisition module, and use the gene comparison software to compare each gene in the representative genes to the nucleic acid sequence respectively Database, obtain the comparison result and output it;

实施例Example 2 2 升级的微生物基因数据库构建系统Upgraded microbial gene database construction system

This embodiment upgrades the construction system #1 of Example 1 to obtain the construction system #2. The improvement point is to further include a gene de-redundancy module, which is connected to the gene verification module, and is used to receive the retained genes output by the gene verification module. , and use gene de-redundancy software to de-redundant the retained genes, extract single-copy comparison genes, and obtain non-redundant microbial gene databases.

Wherein, the steps of extracting unit copy comparison genes are as follows:

实施例Example 3 3 升级的微生物基因数据库构建系统Upgraded microbial gene database construction system

This example upgrades the construction system #1 of Example 1 or the construction system #2 of Example 2 to obtain construction system #3 and construction system #4. The improvement points are: in the representative genome analysis module and the gene comparison module Among them, a gene filter module is further included, which is respectively connected with the representative gene acquisition module and the gene comparison module, and is used to receive and filter the representative genes output by the representative gene acquisition module: filter the genes whose sequence length is less than 200, and then filter the Representative genes are exported to the gene comparison module.

实施例Example 4 4 升级的微生物基因数据库构建系统Upgraded microbial gene database construction system

This embodiment upgrades the construction system #1 of the embodiment 1, the construction system #2 of the embodiment 2, and the construction system #3 and the construction system #4 of the embodiment 3, and obtains the construction system #5, the construction system #6, Construction system #7 and construction system #8, the improvement points are: between the gene comparison module and the gene verification module, a comparison result filtering module is further included, which is respectively connected with the gene comparison module and the gene verification module, and is used to receive the gene Comparing and filtering the comparison results output by the module: removing the comparison results lower than the preset coverage threshold and/or lower than the preset identity threshold.

The upgraded build system #8 is shown in Figure 2.

实施例Example 5 5 构建益生菌干酪乳杆菌的代表基因的方法The method for constructing the representative gene of probiotic Lactobacillus casei

目标益生菌及基因组序列Target probiotics and genome sequences

In this example, Lactobacillus casei was selected as the target probiotic, and the species name (Organism Name) or taxonomic number (Taxis) of the target probiotic in the National Center for Biological Information (NCBI) of the United States was obtained, which were Lactobacillus casei and 1582, respectively.

According to the species name, a total of 27 genomes at the Complete or Scaffold level in NCBI were obtained, and the genomes with too many (≥200) assembled long sequence fragments (Scaffolds) were filtered (21 genomes in total), and the number of species genomes after filtering was 6, The accession numbers of the genomes are: GCA_000309565 (genome 1), GCA_000829055 (genome 2), GCA_002091975 (genome 3), GCA_002192215 (genome 4), GCA_011754305 (genome 5) and GCA_012932835 (genome 6), and get the genome download path, download genomic data.

2. Gene prediction

Gene predictions were performed for each genome using Prokka (v1.14.6) software, and gene annotation files containing sequences and annotations were obtained.

获取代表基因get representative gene

First, select MA=3, and judge whether a certain genome deviates from the whole according to the following criteria: After deleting the genome, the number of common genes in the remaining genomes increases by more than 50% compared with that before the deletion. It was found that no genome deviated from the population, and all 6 genomes were retained.

Select MB=3, for the 6 genomes, there are a total of 7436 genes, and there are 63 gene combinations according to the genome sources of the genes. The number of genes in each gene combination is shown in Table 1 and Figure 3 (only the number of genes greater than Overall 1% of gene combinations):

Table 1 Gene combinations and gene numbers of the probiotic Lactobacillus casei

基因组合编号Gene set number	基因组组合genome assembly	基因数目number of genes
11	000001000001	302302
22	000010000010	258258
33	000100000100	845845
44	001000001000	571571
55	001001001001	107107
66	010000010000	306306
77	010010010010	510510
88	010100010100	112112
99	010110010110	15771577
1010	100000100000	296296
1111	100001100001	9090
1212	101001101001	19071907
1313	111111111111	289289

Wherein, in the second column, the number of the number is 1, which indicates the number of genomes from which it is derived. For example, the genes in Genome 1 are only from Genome 6, the genes in Genome 3 are only from Genome 4, the genes in Genome 12 are only from Genome 1, Genome 3, and Genome 6, and the genes in Genome 13 are only from Genome 4. in the entire genome.

The number of genes in each gene combination was counted, and the number of genes was sorted in descending order to obtain the second gene number Q, Q=1577, that is, the number of genes in gene combination 9.

It is judged that the number of genes derived from the gene combination of 6 genomes is 289, which is less than Q:

Select the gene combination with the largest number of genes (i.e. combination 12). The source genomes include Genome 1, Genome 3 and Genome 6. This gene combination is used as a new subgroup, and its common genes are extracted, that is, 2253 genes are common genes.

Excluding the genomes contained in the subgroup, and the number of the remaining genomes is 3, then extract the common genes of the remaining genomes, that is, gene combination 9, with a total of 1880 genes as common genes.

Combining the common genes obtained twice, a total of 3844 genes, as the common genes corrected by Lactobacillus casei, is far higher than the number of common genes extracted directly from all genomes.

基因过滤genetic filter

First, the corrected common genes were filtered, that is, the genes whose length was less than 200 were filtered, and 3727 genes remained.

基因验证genetic verification

Using the search tool BLAST+ (v2.11.0) software based on the local alignment algorithm, the genes were compared to the nucleic acid sequence database (NT library), and the evalue threshold was 1e-5, and the comparison results were obtained. For the comparison results, the annotated species of the gene is judged by the following conditions: first filter the comparison results with a coverage threshold of 80% and an identity threshold of 65%; then select the top 10% of individual genes sorted by identity If more than 50% of the results meet the identity greater than or equal to 95% and are annotated as the same species S, then the annotation result of the gene is considered to be the aforementioned species S. Genes whose annotated species are not the source species are then filtered, and genes whose annotated species are the same as the source species are retained.

Through this step, 1184 genes remained.

基因去冗余Gene de-redundancy

Filtered genes were subjected to deredundancy analysis using CD-HIT (v4.8.1) software.

In this step, all genes of the sequence class with the number of genes greater than 1 are filtered, and all remaining genes are the only compared single-copy representative genes of Lactobacillus casei, with a total of 1166 genes.

After the above steps, the number of representative genes obtained is more, making the comparison result more accurate.

实施例Example 6 6 构建益生菌干酪乳杆菌的代表基因的另一种方法Another method of constructing the representative gene of probiotic Lactobacillus casei

This example is adjusted according to Example 5. First, use the methods of Step 4 and Step 6 to filter and remove redundancy from the genes obtained in Step 2, and then obtain representative genes and verify them, and also obtain 1166 unique comparison single copies Represents genes.

实施例Example 7 7 构建益生菌肉葡萄球菌的代表基因的方法Method for constructing representative genes of probiotics staphylococcus flesh

In this example, Staphylococcus carnosus was selected as the target probiotic, and the species name (Organism Name) or taxonomic number (Taxis) of the target probiotic in the National Center for Biological Information (NCBI) of the United States was obtained, which were Staphylococcus carnosus and 1281, respectively.

Obtain 11 genomes at the Complete or Scaffold level in NCBI, filter the genomes with too many (≥200) assembled long sequence fragments (Scaffolds) (8 genomes in total), the number of species genomes after filtering is 3, and the registration number of the genome They are: GCA_000009405 (genome 1), GCA_001701005 (genome 2), GCA_003970565 (genome 3), and obtain the genome download path to download the genome data. According to the genome sources of genes, there are 7 gene combinations, and the number of genes in each gene combination is shown in Table 2 (0 for combinations not listed).

Table 2 Gene combinations and gene numbers of the probiotic Staphylococcus carnosus

基因组合编号Gene set number	基因组组合genome assembly	基因数目number of genes
11	001001	23232323
22	010010	373373
33	100100	191191
44	110110	22702270
55	111111	3030

There were 30 genes shared by the three genomes. Select MA=3, and judge whether a certain genome deviates from the whole according to the following criteria: After deleting the genome, the number of common genes in the remaining genome increases by more than 50% compared with that before the deletion. It was found that genome 3 deviated from the overall population, and two genomes were retained. The shared genes of genome 1 and genome 2 were the genes of combination 4 and combination 5. Therefore, the revised common genes of Staphylococcus carnosus totaled 2300.

Refer to Example 5 for the steps of filtering, verification, and de-redundancy, and will not be repeated here. Finally, 1842 uniquely compared single-copy representative genes were obtained.

实施例Example 88 多种益生菌的基因数据库Gene database of various probiotics

The same method was used to obtain the unique comparison single-copy representative gene of all the probiotics in Table 3, and construct the gene database.

Table 3 List of target probiotics

The gene information of the above-mentioned probiotics is shown in Table 4:

Table 4 Gene information of non-redundant gene database

编号serial number	种/亚种名称Species/subspecies name	基因Gene	唯一比对基因（长度≥200）Uniquely compared genes (length ≥ 200)	唯一比对单拷贝基因（长度≥200）Uniquely aligned single-copy genes (length ≥ 200)	代表基因representative gene	唯一比对代表基因（长度≥200）Unique comparison representative gene (length ≥ 200)	唯一比对单拷贝代表基因（长度≥200）Uniquely compared single-copy representative genes (length ≥ 200)
11	青春双歧杆菌Bifidobacterium adolescent	59375937	29522952	26052605	11691169	11001100	10511051
22	动物双歧杆菌Bifidobacterium animalis	24382438	11601160	326326	13241324	853853	7474
33	两歧双歧杆菌Bifidobacterium bifidum	52115211	23592359	16891689	13021302	12311231	968968
44	短双歧杆菌Bifidobacterium breve	80988098	45444544	30293029	13001300	11521152	762762
55	婴儿双歧杆菌Bifidobacterium infantis	59955995	16231623	825825	10121012	346346	1010
66	嗜酸乳杆菌Lactobacillus acidophilus	24592459	19921992	16531653	15501550	14711471	12731273
77	干酪乳杆菌Lactobacillus casei	74367436	24622462	21862186	38443844	11841184	11661166
88	植物乳杆菌Lactobacillus plantarum	2170521705	96919691	68596859	17541754	16501650	11531153
99	罗伊氏乳杆菌Lactobacillus reuteri	1006210062	44384438	35613561	10961096	10641064	965965
1010	鼠李糖乳杆菌Lactobacillus rhamnosus	76327632	33373337	25272527	19071907	17801780	14811481
1111	清酒乳杆菌Lactobacillus sake	51735173	32593259	24442444	13601360	13161316	10881088
1212	嗜热链球菌Streptococcus thermophilus	63186318	32403240	23562356	11051105	10611061	893893
1313	产丙酸丙酸杆菌Propionibacterium propionici	48174817	40634063	37123712	25042504	24542454	23872387
1414	乳酸乳球菌乳脂亚种Lactococcus lactis subsp. cremoris	67146714	26402640	21082108	14401440	10101010	902902
1515	乳酸片球菌Pediococcus lactis	47424742	30123012	24552455	13091309	12721272	10981098
1616	戊糖片球菌Pediococcus pentosacea	50575057	26362636	20832083	13291329	12521252	10301030
1717	肉葡萄球菌Staphylococcus meatus	51875187	22562256	21452145	23002300	19401940	18421842
1818	小牛葡萄球菌Staphylococcus calf	25392539	23402340	23312331	25392539	23402340	23312331
1919	木糖葡萄球菌Staphylococcus xylosus	68346834	30573057	23922392	14611461	13801380	13511351
2020	凝结芽孢杆菌Bacillus coagulans	68626862	44024402	35243524	18201820	17501750	16751675

As can be seen from the above table, although the gene bank established by the method of the present invention has a unique ratio of single-copy representative genes of most target microorganisms ≥ 500, the unique ratio of some target microorganisms (such as Bifidobacterium animalis and Bifidobacterium infantis) For single-copy representative genes ≤ 200, in order to make the comparison results more prepared, the inventor randomly incorporated the top 200 genes with the highest genome occurrence rate among the remaining genes of the two target microorganisms into the representative genes, so that the only comparison of single-copy representative genes The number of genes reached 274 and 210, respectively. The larger the number of representative genes, the more accurate the comparison results, and the fewer the number of qualified genes, the higher the comparison efficiency.

The probiotic database constructed in this example only contains the sequence of the target probiotic species. Compared with the Metaphlan comparison and IGC comparison, the time required for the comparison is significantly shortened. The comparison time is shown in Table 5 below.

Table 5 Comparison time required for different databases

样本sample	碱基数量number of bases	Metaphlan比对时间Metaphlan comparison time	IGC比对时间IGC comparison time	本数据库比对时间Comparison time of this database
ERR1190551ERR1190551	5.43G5.43G	19m46.927s19m46.927s	48m39.494s48m39.494s	8m37.203s8m37.203s
ERR1190552ERR1190552	5.30G5.30G	19m8.401s19m8.401s	49m9.145s49m9.145s	8m6.807s8m6.807s
ERR1190553ERR1190553	4.47G4.47G	16m28.369s16m28.369s	39m53.330s39m53.330s	6m23.361s6m23.361s
ERR1190554ERR1190554	5.09G5.09G	18m32.386s18m32.386s	45m0.594s45m0.594s	7m26.207s7m26.207s
ERR1190555ERR1190555	5.07G5.07G	19m0.234s19m0.234s	41m59.191s41m59.191s	7m20.326s7m20.326s

All documents mentioned in this application are incorporated by reference in this application as if each were individually incorporated by reference. In addition, it should be understood that after reading the above teaching content of the present invention, those skilled in the art can make various changes or modifications to the present invention, and these equivalent forms also fall within the scope defined by the appended claims of the present application.

Claims

A method for constructing a microbial gene database, comprising the following steps:

S1, obtaining the genome data of each target microorganism in the target microorganism combination, wherein the target microorganism combination includes N kinds of target microorganisms, N≥1;

S2, performing gene prediction on the genome data obtained in step S1, and obtaining a gene annotation file;

S3, using the gene annotation file obtained in step S2 to obtain the representative gene of each target microorganism;

S4, comparing each gene in the representative gene to a nucleic acid sequence database to obtain a comparison result;

S5, for the comparison result of each gene, obtain the annotated species of the gene, if the annotated species is the same as the source species, keep the gene;

S6, using all the retained genes to form the microbial gene database.
The method for constructing a microbial gene database according to claim 1, further comprising a step of de-redundancy of genes before step S4 or after step S5. .
The construction method of a microbial gene database according to claim 1 or 2, characterized in that, in step S3, for the target microorganism n in the target microorganism combination, wherein, 1≤n≤N, the target microorganism n The genome number M of M, obtain the representative gene of described target microorganism n according to the size of M:

(1) If M=1, all the genes in the genome of the target microorganism n are representative genes;

(2) If M≥2, the common gene of all genomes is the representative gene.
A method for constructing a microbial gene database according to claim 3, wherein in the case of (2), if M≥MA, it is judged whether any genome deviates from the overall population, and if so, the genome that deviates from the overall population is eliminated, Then judge whether there are genomes in the remaining genomes that deviate from the overall population, and if so, remove the genomes that deviate from the overall population until no genomes in the remaining genomes deviate from the overall population or the number of remaining genomes M<MA, then extract the common genes of the remaining genomes as a correction for all genomes The common gene of the target microorganism n is used as the representative gene of the target microorganism n, wherein, MA is a preset value that needs to be judged whether the genome deviates from the whole, and MA≥3.
The construction method of a kind of microbial gene database according to claim 3, is characterized in that, if M≥MB, further re-determine the common gene according to the following steps:

S31, forming m gene combinations according to the source genome situation of each gene in the M genomes of the target microorganism n, wherein, m=
;

S32, counting the number of genes in each gene combination, sorting the number of genes in descending order and obtaining the number Q of genes at the S position,

S33, judging whether the number of genes derived from the gene combination of M genomes is less than Q:

①If the number of genes from the gene combination of M genomes is not less than Q, directly extract the common genes of M genomes; ②If the number of genes from the gene combination of M genomes is less than Q, then:

S331, selecting the source genome of the gene combination with the largest number of genes as a subgroup, and extracting the common genes of the subgroup;

S332, removing the genomes in the subgroup in S331, if the number of remaining genomes<MB, then extract the common genes of the remaining genomes; if the number of remaining genomes≥MB, repeat steps S31-S33 to extract the common genes again;

S34, merging all the common genes obtained in step S33 together as the common genes corrected for all genomes, and further as the representative gene of the target microorganism n,

Among them, MB is the preset value that needs to re-determine the shared genes, MB≥3, 2≤S≤5.
According to the construction method of a microbial gene database according to any one of claims 3-5, it is characterized in that, in the case of (2), the representative gene further includes the occurrence rate of the genome in the remaining genes except the common gene according to Sort the top Y genes from largest to smallest, where 100≤Y≤300.
A system for constructing a microbial gene database is characterized in that it comprises the following modules:

The genome data acquisition storage module is used to acquire and store the genome data of each target microorganism in the target microorganism combination, wherein the target microorganism combination includes N kinds of target microorganisms, N≥1;

The gene prediction module is connected with the genome data acquisition storage module, and is used to perform gene prediction on the genome data acquired in the genome data acquisition module, and obtain and output gene annotation files containing sequences and species annotations;

A representative gene acquisition module, connected to the gene prediction module, is used to receive the gene annotation file output by the gene prediction module, and use the gene annotation file to obtain the representative gene of each target microorganism and output it;

A nucleic acid sequence database storage module, configured to receive and store a nucleic acid sequence database;

A gene comparison module, connected to the representative gene acquisition module and the nucleic acid sequence database module respectively, for receiving the representative genes output by the representative gene acquisition module, and comparing each gene in the representative genes respectively Go to the nucleic acid sequence database, obtain the comparison result and output it;

The gene verification module and the gene comparison module are used to verify whether the annotated species of the gene is the same as the source species: for the comparison result of each gene, the annotated species of the gene is obtained, if the annotated species is the same as the source species, Then keep the gene, and the gene verification module is also used to output all the kept genes to construct the microbial gene database.
The construction system of a kind of microbial gene database according to claim 7, is characterized in that, also comprises:

A gene de-redundancy module, connected to the gene verification module, for receiving the retained genes output by the gene verification module, and performing de-redundancy to the retained genes in each target microorganism; or

The gene de-redundancy module is connected to the representative gene acquisition module, and is used to receive the representative genes output by the representative gene acquisition module, and perform de-redundancy on the representative genes of each target microorganism.
The construction system of a kind of microbial gene database according to claim 7 or 8, is characterized in that, between described representative gene acquisition module and described gene comparison module, further comprises gene filter module, respectively with described representative The gene acquisition module is connected to the gene comparison module, and is used to receive and filter the representative genes output by the representative gene acquisition module block: filter the genes whose sequence length is less than 200, and then output the filtered representative genes to the genes Compare modules.
The construction system of a kind of microbial gene database according to claim 7 or 8, it is characterized in that, between the gene comparison module and the gene verification module, further comprising a comparison result filtering module, respectively with the said The gene comparison module is connected to the gene verification module, and is used to receive and filter the comparison results output by the gene comparison module: the comparisons below the preset coverage threshold and/or below the preset identity threshold Remove the result.