CN116153410A - Microbial genome reference database, construction method and application thereof - Google Patents

Microbial genome reference database, construction method and application thereof Download PDF

Info

Publication number
CN116153410A
CN116153410A CN202211644956.6A CN202211644956A CN116153410A CN 116153410 A CN116153410 A CN 116153410A CN 202211644956 A CN202211644956 A CN 202211644956A CN 116153410 A CN116153410 A CN 116153410A
Authority
CN
China
Prior art keywords
genome
quality
database
microbial
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211644956.6A
Other languages
Chinese (zh)
Other versions
CN116153410B (en
Inventor
周袁杰
李少川
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ruiyinmaituo Technology Guangzhou Co ltd
Original Assignee
Ruiyinmaituo Technology Guangzhou Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ruiyinmaituo Technology Guangzhou Co ltd filed Critical Ruiyinmaituo Technology Guangzhou Co ltd
Priority to CN202211644956.6A priority Critical patent/CN116153410B/en
Publication of CN116153410A publication Critical patent/CN116153410A/en
Application granted granted Critical
Publication of CN116153410B publication Critical patent/CN116153410B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/20Sequence assembly
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Analytical Chemistry (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The application provides a microbial genome reference database, which is constructed according to a method comprising the following steps: (1) data acquisition; (2) data quality control; (3) constructing a representative genome collection; (4) genome screening; (5) constructing a reference set and (6) constructing a pan genome. The optimized microbial genome reference database combines the advantages of representing the fungus classification specificity and the sensitivity of all genomes, and ensures the accuracy of a final result on the premise of bearable computing resources.

Description

Microbial genome reference database, construction method and application thereof
Technical Field
The application belongs to the field of microorganism detection and bioinformatics, and particularly provides a microorganism genome reference database, a construction method and application thereof.
Background
The technology for detecting pathogenic microorganisms based on metagenomic sequencing (mNGS) is a technology capable of detecting various pathogenic microorganisms (viruses/bacteria/fungi/parasites) unbiased by sequencing DNA or RNA of clinical samples, food samples, environmental samples, crop samples, culture samples, etc. by shotgun (shotgun) method. Of the data obtained by this assay technique, > 90% of the data is derived from the host/sample host genome and the microbial sequence is only a small percentage.
In pathogen mNGS detection, the comparison of a sequencing result with reference data of a pathogen microorganism genome is a core step of analysis by the technology, and an available reference database which is complete in species and high in quality and is specially used for mNGS analysis is not disclosed at present. The high quality database is FDA-ARGOS alone, a regulatory level microbial database specifically promulgated by the FDA for microbial detection in the united states, which contains only 487 species.
Thus, obtaining a genome reference database of pathogenic microorganisms in pathogenic mNGS detection often requires separate construction, and common general construction methods include: 1) Downloading a portion of the microbial genome from a public reference genome database, such as NCBI RefSeq, to construct a relatively small database for mNGS sequencing data analysis; 2) Downloading the entire microbial genome from a public genome database such as NCBI GenBank constructs a relatively complete database for mNGS sequencing data analysis.
Since microorganisms have large genomic differences between strains, there is about 5% base difference within the same species, i.e., about 200K base difference for the bacterial genome of 4M; thus method 1) may result in false negatives in the mNSS analysis. Thanks to the rapid development of high-throughput sequencing technology, the whole Genome of a microorganism can be rapidly obtained, and 174,258 Genome is recorded in NCBI Genome database by 2022, 9 and 26 days. If all of these genomes are added together, a total of about 890 Gb bases calculated on average of 5.11 Mb; and a great amount of sequence redundancy is necessarily existed in the method, so that the analysis efficiency of mNSS sequencing data is affected; among publicly published sequenced genomes, there are a large number of genomes with sequence contamination and erroneous species annotation, and direct analysis of mNSS pathogenic microorganisms based on publicly published genome sets can increase detection false positives, affecting final detection result accuracy.
In addition, most of the data in public databases are submitted by different people with varying quality of the submitted sequences, and for NCBI as an example, refSeq is a relatively high quality genome reference database, but there is a question about species annotation of many sequences. The typical representative strain is originally used for species representation, and because of the sequence of genome sequencing, the typical bacterial genome is not necessarily the representative bacterial genome, so that the species classification boundary is defined by the heavy head based on the typical bacterial genome, and the boundary type high-quality genome is used as a reference genome, so that the fuzzy or wrong species classification boundary can be avoided to the greatest extent.
The mNGS detection technology has high time limit requirements on the whole detection process, and also has high sensitivity and accuracy requirements. A single representative genome does not cover the full diversity of species well, and all genomes introduce classification errors on the one hand and greatly increase the computational resource consumption on the other hand, which is disadvantageous for the wide application of the technology.
Disclosure of Invention
In order to solve the problems, the application provides a microbial genome reference database, and a construction method and application thereof.
In one aspect, the present application provides a microbial genome reference database constructed according to a method comprising the steps of:
(1) And (3) data acquisition: acquiring genome data of a microorganism species;
(2) And (3) data quality control: evaluating the quality of the genome data, and setting a high-quality genome;
(3) Construction of a representative genome collection: constructing a representative genome set using the high quality genome obtained in (2);
(4) Genome screening: screening the genome of the microorganism species according to a preset rule, and removing the genome with undefined classification, wrong classification and low quality;
(5) Constructing a reference set: selecting part of the high-quality genome to form a reference genome;
(6) Construction of the pan genome: and comparing the residual high-quality genome with a reference genome, and removing redundant parts to obtain a universal genome database.
In another aspect, the present application provides a method for constructing the above-mentioned microbial genome reference database, the method comprising the steps of:
(1) And (3) data acquisition: acquiring genome data of a microorganism species;
(2) And (3) data quality control: evaluating the quality of the genome data, and setting a high-quality genome;
(3) Construction of a representative genome collection: constructing a representative genome set using the high quality genome obtained in (2);
(4) Genome screening: screening the genome of the microorganism species according to a preset rule, and removing the genome with undefined classification, wrong classification and low quality;
(5) Constructing a reference set: selecting part of the high-quality genome to form a reference genome;
(6) Construction of the pan genome: and comparing the residual high-quality genome with a reference genome, and removing redundant parts to obtain a universal genome database.
Further, in the step of (1) obtaining data, the source of genomic data may be selected from one or more of genomic data from microorganisms in a genome, an IMG/M database, an EMBL database, a FDA-ARGOS database, an EuPathDB database, NCBI GenBank and NCBI RefSeq.
Further, the (2) data quality control includes: evaluating the integrity and pollution degree of genome data by using a quality control tool, and determining the genome with the integrity more than or equal to Cp% and the pollution degree less than or equal to Cm% as a high-quality genome; wherein Cp is 85-100, cm is 0-10.
Further, the quality control tool was a CheckM, cp was 95, cm was 5.
Further, (3) constructing a representative genome collection comprises:
if the microorganism species has a plurality of high quality genomes, selecting the genome with the highest integrity as a representative genome typical of the microorganism species; and comparing it to genomes of other species: if no species genome with genome consistency of more than or equal to S1% is found, no classification error exists, if a species genome with genome consistency of more than or equal to S1% is found, a classification error exists, and the typical representative genome of the microorganism species is reselected; wherein S1 is 85-100.
Further, S1 is 95.
Further, (4) genome screening comprises comparing other genomes than the representative genome with the corresponding representative genome, wherein the genome identity is greater than or equal to S2% and the representative genome is correctly classified; wherein S2 is 85-100.
Further, S2 is 95.
Further, (5) constructing the reference set comprises:
(5-1) demarcating a representative genome of the microorganism as a reference genome;
(5-2) performing similarity analysis on the genome which is not divided into reference genes and the representative genome of the microorganism, and dividing the genome into represented genomes with the consistency of not less than S3% and the similarity of not less than O1%; wherein S3 is 85-100, and O1 is 75-100;
(5-3) the genome which is not divided into the represented genomes is according to genome assembly level: completing the image, chromosome, skeleton sequence and continuous overlapped group sequence, selecting one strain to add into the reference genome, and comparing the other genomes with the newly added reference genome, wherein the genome consistency is more than or equal to S4% and the similarity is more than or equal to O2% and is divided into represented bacteria; wherein S4 is 85-100, and O2 is 75-100;
(5-4) repeating (5-3) until all genomes are divided into reference genomes or represented genomes.
Further, S3 and S4 are 94, and O1 and O2 are 80.
Further, (6) constructing a pan genome comprises:
(6-1) setting the reference genome as the pan genome;
(6-2) comparing the represented genome with the species genome in turn according to the average length of the genome sequence, and adding the genome with the consistency of less than or equal to S5% and the length of more than or equal to L bp into the genome; wherein S5 is 85-100, L is 50-5000;
(6-3) repeating (6-1) and (6-2) until all reference genomes have completed the corresponding flood genome construction.
Further, S5 is 95 and L is 1000.
In another aspect, the application provides the use of the microbial genome reference database or the method for constructing the microbial genome reference database in microbial detection, wherein the use is non-diagnostic, and the use comprises the steps of sequencing a sample and comparing the sequencing result with the microbial genome reference database.
In another aspect, the present application provides a computing device performing the above method, the device comprising:
(1) And a data acquisition module: for obtaining genome data of a microorganism species;
(2) And the data quality control module: for evaluating genome data quality, setting high quality genomes;
(3) Construction of a representative genome collection module: for constructing a representative genome set typically using the high quality genome obtained in (2);
(4) Genome screening module: the method is used for screening the genome of the microorganism species according to a preset rule, and removing the genome with undefined classification, wrong classification and low quality;
(5) Constructing a reference set module: for selecting part of the high quality genome to form a reference genome;
(6) Construction of a pan genome module: and the method is used for comparing the residual high-quality genome with a reference genome, and removing redundant parts to obtain a universal genome database.
The microbial genome reference database in the present application may comprise a genome reference database of one or more microorganisms, and a person skilled in the art may obtain a microbial genome reference database comprising such a plurality of microorganisms involved by performing all or steps of the above-described method several times.
The methods and products of the present application are useful in a variety of microbial genome reference databases, preferably in pathogenic microbial genome reference databases.
The alignment in this application may use various alignment tools and algorithms known in the art, including but not limited to blast, fastANI, etc., the alignment may be a global alignment or a representative region, such as a 16s rDNA alignment.
Drawings
FIG. 1 is a flowchart of the construction of a reference database of microbial genomes of the present application.
FIG. 2 is a graph showing the file size, index file size, and genomic base length profile of three genomic databases.
FIG. 3 is a bar graph of 8 strain genome simulation data versus three genome databases versus comparative examples.
FIG. 4 is a box plot of 8 strain genome simulation data versus three genome databases for correct alignment and incorrect alignment.
Detailed Description
EXAMPLE 1 construction of genome reference database of pathogenic microorganisms
(1) And (3) data acquisition:
genomic data was downloaded from NCBI RefSeq, specifically as follows:
NCBI RefSeq is a reference sequence database of the national center for biotechnology in which complete, non-redundant and annotated well-defined genomic sequences, including genomic DNA, transcriptome, and proteomic sequences, were collected. The database sequence may be downloaded in sequence via its FTP server.
In NCBI RefSeq database FTP (ftp.ncbi.n.ni.gov/genome/RefSeq /), together, 265,430 pieces of genome were recorded, wherein the genome-related data information collated the file asembly_sum_refseq.txt.
A total of 43 genomes of Acinetobacter johnsonii (Acinetobacter johnsonii) class ID 40214 are selected based on the species class ID listed as species_taxi in the asssembly_sum_refseq.txt file.
43 genome sequences are downloaded according to a data path with the table header of ftp_path in an asssembly_summary_refseq.txt file.
(2) And (3) data quality control:
the quality of 43 genome sequences of Acinetobacter johnsonii was evaluated by means of CheckM, wherein the genome with the integrity of not less than Cp and the contamination rate of not more than Cm was 43.
(3) Construction of a representative genome collection:
four strains of genome of typical representative bacteria are obtained from the genome of the acinetobacter johnsonii, one of the four strains is a finished map, and the strain is taken as the typical representative genome;
comparing the representative genome of the Acinetobacter johnsonii with representative sequences (reference genome or representative genome) of other species, and finding no species with genome consistency of more than or equal to 95%;
the acinetobacter johnsonii and other species represent a typical genomic collection of sequence constituent species.
(4) Genome screening
The rest 42 genes of the Acinetobacter johnsonii are respectively compared with typical genome collection strains, the strains with the genome consistency more than or equal to 95% are only the Acinetobacter johnsonii, namely all 43 Acinetobacter johnsonii species are correctly classified, and the screening conditions are passed;
(5) Constructing a reference set:
the typical representative bacteria genome collection of the acinetobacter johnsonii is a reference genome;
comparing the rest 42 genome of the Acinetobacter johnsonii with a reference genome respectively, wherein the genome is divided into represented genome with the consistency of more than or equal to 94% and the similarity of more than or equal to 80%;
the remaining genomes were assembled sequentially according to genome assembly level: selecting one strain as candidate reference genome from the sequence levels of the map, the chromosome, the framework sequence and the continuous overlapped group, and comparing the rest genome with the candidate reference genome respectively, wherein the genome consistency is more than or equal to 94% and the similarity is more than or equal to 80% and is divided into represented genome;
repeating the steps until all genomes are divided into reference genomes or represented genomes;
(6) Construction of the Pan genome
The reference genome sets a universal genome respectively;
comparing the represented genome with the corresponding pan genome one by one, adding a sequence with the sequence consistency less than or equal to 95% and the length more than or equal to 1000bp into the pan genome, and updating the pan genome;
repeating the above steps until all the represented strains of the reference genome are completely aligned with the pan genome;
repeating the step until all the reference genomes complete the corresponding flood genome construction;
combining all obtained pan genomes to obtain a pan genome of the acinetobacter johnsonii;
(7) Microbial pan genome construction
Repeating all the steps to respectively construct the genome of all the microorganism species to form the microorganism genome.
Example 2 compares the process of the present application with conventional processes:
selecting a single reference genome or representative genome of a species (conventional method one), wherein the representative genome of the acinetobacter johnsonii is selected as a database, the genome size is 3.5 Mb, and the file size is 3.5 MB;
selecting the whole genome of one species to construct an alignment database of the species (the second traditional method), selecting a total of 43 genes of Acinetobacter johnsonii from NCBI RefSeq, wherein the genome size is 154.3 and Mb, the file size is 150 MB, and the result is shown in figure 2;
the genome database of the acinetobacter johnsonii constructed by the method of the embodiment 1 of the application has the genome size of 41.1 Mb and the file size of 40 MB;
the method comprises the steps of respectively constructing comparison indexes for three genome databases by using sequence comparison software Bowtie2, wherein the size of an index file constructed by a first traditional method is 13 MB, the size of an index file constructed by a second traditional method is 217 MB, and the size of an index file constructed by a method of embodiment 1 of the application is 64 MB;
respectively selecting the genome of 6 Acinetobacter johnsonii (Acinetobacter johnsonii) and 2 Acinetobacter baoshi (Acinetobacter bouvetii) in NCBI RefSeq, and simulating a single-ended sequencing result with the read length of 150 bp and the genome size of 30 times by using sequencing simulation software;
comparing and analyzing 8 strains with the typical representative library in the example 1, wherein the sequence identity of the first 5 strains with the Johnson acinetobacter in the typical representative library is more than 95%, and the sequence identity of the last 3 strains with the Johnson acinetobacter is less than 85%; the consistency of the last 3 strains with Acinetobacter bordetella is more than 95%; based on the typical representative library classification definition, determining that the first 5 strains are acinetobacter johnsonii and the second 3 strains are acinetobacter bordetenus;
the simulation data of 8 strains are respectively compared with the index constructed by the three methods in a bowtie2 sequence, the results of the statistics and the effective comparison are shown in a figure 3 (comparison percentage bar graph), the first five strains are similar to the second comparison method, the first comparison method is lower, and the second three strains are similar to the first comparison method and the second comparison method and are higher;
the simulation data of 8 strains are respectively compared with the index constructed by the three methods by the bowtie2 sequence, and the correct comparison rate and the error comparison rate are counted, and the result is shown in figure 4 (comparison rate percentage box type graph). The analog data of the Acinetobacter johnsonii (the first 5 strains) is compared with the Acinetobacter johnsonii to be correctly compared, the method I has lower correct comparison rate, and the method II has equivalent comparison rate to the method II and is close to 100 percent; the analog data of Acinetobacter baohii (the last 3 strains, one of which has species errors in NCBI RefSeq and is corrected based on a typical representative library) is compared with the Acinetobacter johnsonii for error comparison, the method II is higher, and the comparison rate of the method I is equivalent to that of the method and is not more than 10%;
the general genome database constructed by the method eliminates misclassified genome (misclassified Acinetobacter bahnsonii into Acinetobacter johnsonii) in the step of constructing the database, can accurately distinguish two different microorganisms, and maintains similar specificity as the traditional method I; meanwhile, the genome-wide method effectively maintains sequence diversity in species, so that the sensitivity of the method I is similar to that of the traditional method II;
the sensitivity of the universal genome database constructed by the method is equivalent to that of the second method and higher than that of the first method by integrating the analysis results of the previous simulation data; the specificity is equivalent to that of the first method and higher than that of the second method; i.e. higher accuracy than methods one and two.
It is apparent that the above examples are given by way of illustration only and are not limiting of the embodiments. Other variations or modifications of the above teachings will be apparent to those of ordinary skill in the art. It is not necessary here nor is it exhaustive of all embodiments. While still being apparent from variations or modifications that may be made by those skilled in the art are within the scope of the invention.

Claims (10)

1. The microbial genome reference database is characterized by being constructed according to the following steps:
(1) And (3) data acquisition: acquiring genome data of a microorganism species;
(2) And (3) data quality control: evaluating the quality of the genome data, and setting a high-quality genome;
(3) Construction of a representative genome collection: constructing a representative genome set using the high quality genome obtained in (2);
(4) Genome screening: screening the genome of the microorganism species according to a preset rule, and removing the genome with undefined classification, wrong classification and low quality;
(5) Constructing a reference set: selecting part of the high-quality genome to form a reference genome;
(6) Construction of the pan genome: and comparing the residual high-quality genome with a reference genome, and removing redundant parts to obtain a universal genome database.
2. A method for constructing a microbial genome reference database, comprising the steps of:
(1) And (3) data acquisition: acquiring genome data of a microorganism species;
(2) And (3) data quality control: evaluating the quality of the genome data, and setting a high-quality genome;
(3) Construction of a representative genome collection: constructing a representative genome set using the high quality genome obtained in (2);
(4) Genome screening: screening the genome of the microorganism species according to a preset rule, and removing the genome with undefined classification, wrong classification and low quality;
(5) Constructing a reference set: selecting part of the high-quality genome to form a reference genome;
(6) Construction of the pan genome: and comparing the residual high-quality genome with a reference genome, and removing redundant parts to obtain a universal genome database.
3. The microbial genome reference database or the construction method according to claim 1 or 2, wherein in the (1) data acquisition step, the genomic data source is selected from one or more of microbial genome data in a genome, IMG/M database, EMBL database, FDA-ARGOS database, euPathDB database, NCBI GenBank and NCBI RefSeq; the data quality control of (2) comprises: evaluating the integrity and pollution degree of genome data by using a quality control tool, and determining the genome with the integrity more than or equal to Cp% and the pollution degree less than or equal to Cm% as a high-quality genome; wherein Cp is 85-100, cm is 0-10.
4. A microbial genome reference database or construction method according to any one of claims 1-3, wherein (3) constructing a representative genome collection comprises:
if the microorganism species has a plurality of high quality genomes of typical bacterial origin, selecting the genome with the highest integrity as a typical representative genome of the microorganism species; and comparing it to genomes of other species: if no species genome with genome consistency of more than or equal to S1% is found, no classification error exists, if a species genome with genome consistency of more than or equal to S1% is found, a classification error exists, and the typical representative genome of the microorganism species is reselected; wherein S1 is 85-100; preferably S1 is 95.
5. The microbial genome reference database or construction method of any one of claims 1-4, wherein (4) genome screening comprises comparing other genomes than the representative genome to the corresponding representative genome, genome identity ≡s2% representing correct classification; wherein S2 is 85-100; preferably S2 is 95.
6. The microbial genome reference database or construction method of any one of claims 1-5, wherein (5) constructing a reference set comprises:
(5-1) demarcating a representative genome of the microorganism as a reference genome;
(5-2) performing similarity analysis on the genome which is not divided into reference genes and the representative genome of the microorganism, and dividing the genome into represented genomes with the consistency of not less than S3% and the similarity of not less than O1%; wherein S3 is 85-100, and O1 is 75-100;
(5-3) the genome which is not divided into the represented genomes is according to genome assembly level: completing the image, chromosome, skeleton sequence and continuous overlapped group sequence, selecting one strain to add into the reference genome, and comparing the other genomes with the newly added reference genome, wherein the genome consistency is more than or equal to S4% and the similarity is more than or equal to O2% and is divided into represented bacteria; wherein S4 is 85-100, and O2 is 75-100;
(5-4) repeating (5-3) until all genomes are divided into reference genomes or represented genomes;
preferably S3 and S4 are 94 and O1 and O2 are 80.
7. The microbial genome reference database or construction method of any one of claims 1-6, wherein (6) constructing a pan genome comprises:
(6-1) setting the reference genome as the pan genome;
(6-2) comparing the represented genome with the species pan genome in turn according to the average length decrease of the genome sequence, and adding the genome with the consistency of less than or equal to S5% and the length of more than or equal to L bp into the pan genome; wherein S5 is 85-100, L is 50-5000;
(6-3) repeating (6-1) and (6-2) until all reference genomes have completed the corresponding flood genome construction; preferably S5 is 95 and L is 500.
8. The microbial genome reference database or construction method of any one of claims 1-6, wherein the microbial genome reference database comprises a plurality of microbial genome reference databases.
9. Use of a microbial genome reference database or construction method according to any of claims 1-8 in microbial detection, said use being for non-diagnostic use, characterized in that the use comprises the steps of sequencing a sample and comparing the sequencing results to a microbial genome reference database.
10. Computing device performing the construction method according to any of the previous claims 1-8, characterized in that it comprises the following modules:
(1) And a data acquisition module: for obtaining genome data of a microorganism species;
(2) And the data quality control module: for evaluating genome data quality, setting high quality genomes;
(3) Construction of a representative genome collection module: for constructing a representative genome set typically using the high quality genome obtained in (2);
(4) Genome screening module: the method is used for screening the genome of the microorganism species according to a preset rule, and removing the genome with undefined classification, wrong classification and low quality;
(5) Constructing a reference set module: for selecting part of the high quality genome to form a reference genome;
(6) Construction of a pan genome module: and the method is used for comparing the residual high-quality genome with a reference genome, and removing redundant parts to obtain a universal genome database.
CN202211644956.6A 2022-12-20 2022-12-20 Microbial genome reference database, construction method and application thereof Active CN116153410B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211644956.6A CN116153410B (en) 2022-12-20 2022-12-20 Microbial genome reference database, construction method and application thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211644956.6A CN116153410B (en) 2022-12-20 2022-12-20 Microbial genome reference database, construction method and application thereof

Publications (2)

Publication Number Publication Date
CN116153410A true CN116153410A (en) 2023-05-23
CN116153410B CN116153410B (en) 2023-12-19

Family

ID=86349929

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211644956.6A Active CN116153410B (en) 2022-12-20 2022-12-20 Microbial genome reference database, construction method and application thereof

Country Status (1)

Country Link
CN (1) CN116153410B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB201518364D0 (en) * 2015-10-16 2015-12-02 Genome Res Ltd Methods associated with a database that stores a plurality of reference genomes
US20200104464A1 (en) * 2018-09-30 2020-04-02 International Business Machines Corporation A k-mer database for organism identification
CN111009286A (en) * 2018-10-08 2020-04-14 深圳华大因源医药科技有限公司 Method and apparatus for microbiological analysis of host samples
CN112863606A (en) * 2021-03-08 2021-05-28 杭州微数生物科技有限公司 Bacterium identification and typing analysis genome database and identification and typing analysis method
CN112992277A (en) * 2021-03-18 2021-06-18 南京先声医学检验有限公司 Construction method and application of microbial genome database
CN114974411A (en) * 2022-06-28 2022-08-30 杭州杰毅医学检验实验室有限公司 Metagenome pathogenic microorganism genome database and construction method thereof
CN115148288A (en) * 2022-06-29 2022-10-04 慕恩(广州)生物科技有限公司 Microorganism identification method, identification device and related equipment
CN115394361A (en) * 2022-08-15 2022-11-25 中国科学院心理研究所 Method, apparatus and medium for constructing a microbial genome database

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB201518364D0 (en) * 2015-10-16 2015-12-02 Genome Res Ltd Methods associated with a database that stores a plurality of reference genomes
US20200104464A1 (en) * 2018-09-30 2020-04-02 International Business Machines Corporation A k-mer database for organism identification
CN111009286A (en) * 2018-10-08 2020-04-14 深圳华大因源医药科技有限公司 Method and apparatus for microbiological analysis of host samples
CN112863606A (en) * 2021-03-08 2021-05-28 杭州微数生物科技有限公司 Bacterium identification and typing analysis genome database and identification and typing analysis method
CN112992277A (en) * 2021-03-18 2021-06-18 南京先声医学检验有限公司 Construction method and application of microbial genome database
CN114974411A (en) * 2022-06-28 2022-08-30 杭州杰毅医学检验实验室有限公司 Metagenome pathogenic microorganism genome database and construction method thereof
CN115148288A (en) * 2022-06-29 2022-10-04 慕恩(广州)生物科技有限公司 Microorganism identification method, identification device and related equipment
CN115394361A (en) * 2022-08-15 2022-11-25 中国科学院心理研究所 Method, apparatus and medium for constructing a microbial genome database

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
QIN JUNJIE 等: ""A human gut microbial gene catalogue established by metagenomic sequencing."", 《NATURE》, vol. 464, no. 7285, pages 59 - 65, XP008132800, DOI: 10.1038/nature08821 *
王恒超: ""宏基因组基因集构建方法及其应用研究"", 《中国优秀硕士学位论文全文数据库 (基础科学辑)》, vol. 2019, no. 15, pages 006 - 337 *

Also Published As

Publication number Publication date
CN116153410B (en) 2023-12-19

Similar Documents

Publication Publication Date Title
Steinegger et al. Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold
CN110473594B (en) Pathogenic microorganism genome database and establishment method thereof
US20200294628A1 (en) Creation or use of anchor-based data structures for sample-derived characteristic determination
Piro et al. DUDes: a top-down taxonomic profiler for metagenomics
CN114420212B (en) Escherichia coli strain identification method and system
CN113744807B (en) Macrogenomics-based pathogenic microorganism detection method and device
Saheb Kashaf et al. Recovering prokaryotic genomes from host-associated, short-read shotgun metagenomic sequencing data
CN105740650A (en) Method for rapidly and accurately identifying high-throughput genome data pollution sources
CN114121160B (en) Method and system for detecting macrovirus group in sample
CN115719616B (en) Screening method and system for pathogen species specific sequences
Yang et al. A robust and generalizable immune-related signature for sepsis diagnostics
CN115083521B (en) Method and system for identifying tumor cell group in single cell transcriptome sequencing data
Dalevi et al. Annotation of metagenome short reads using proxygenes
CN114974411A (en) Metagenome pathogenic microorganism genome database and construction method thereof
CN114121167A (en) Construction method and system of microbial gene database
Egertson et al. A theoretical framework for proteome-scale single-molecule protein identification using multi-affinity protein binding reagents
Hickl et al. binny: an automated binning algorithm to recover high-quality genomes from complex metagenomic datasets
CN116153410B (en) Microbial genome reference database, construction method and application thereof
Holstein et al. PepGM: a probabilistic graphical model for taxonomic inference of viral proteome samples with associated confidence scores
EP2835751A1 (en) Method of deconvolution of mixed molecular information in a complex sample to identify organism(s)
CN114496089B (en) Pathogenic microorganism identification method
CN115938491B (en) High-quality bacterial genome database construction method and system for clinical pathogen diagnosis
CN116646010B (en) Human virus detection method and device, equipment and storage medium
CN116153411B (en) Design method and application of multi-pathogen probe library combination
CN211578386U (en) Metagenome analysis device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant