CN116682496A - Pathogenic microorganism genome database and construction method and application thereof - Google Patents

Pathogenic microorganism genome database and construction method and application thereof Download PDF

Info

Publication number
CN116682496A
CN116682496A CN202310221252.6A CN202310221252A CN116682496A CN 116682496 A CN116682496 A CN 116682496A CN 202310221252 A CN202310221252 A CN 202310221252A CN 116682496 A CN116682496 A CN 116682496A
Authority
CN
China
Prior art keywords
genome
sequence
database
species
sequences
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310221252.6A
Other languages
Chinese (zh)
Inventor
叶生鑫
周桂兰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
WUHAN ADICON CLINICAL LABORATORIES Inc
Original Assignee
WUHAN ADICON CLINICAL LABORATORIES Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by WUHAN ADICON CLINICAL LABORATORIES Inc filed Critical WUHAN ADICON CLINICAL LABORATORIES Inc
Priority to CN202310221252.6A priority Critical patent/CN116682496A/en
Publication of CN116682496A publication Critical patent/CN116682496A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Bioethics (AREA)
  • Biophysics (AREA)
  • Databases & Information Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses a pathogenic microorganism genome database, a construction method and application thereof, and belongs to the field of pathogenic metagenome detection. The construction method comprises the following steps: genome and related explanatory file acquisition; species screening: bacterial removal species name is ambiguous, species is tentative, virus removal phage; screening the reference genome according to a predetermined rule; genome screening: the reference genome and the representative genome are reserved, abnormal genomes are removed, and the genomes with low integrity and high pollution rate of bacteria and viruses are removed; removing a classification error genome, removing a plasmid sequence, removing a pollution sequence, removing a host homologous sequence, removing a reference species genome homologous sequence, removing a low quality sequence, removing a redundant sequence, and splicing a genome. The high-quality genome database of the pathogen detection microorganism constructed by the invention has comprehensive data and high quality, and is short in analysis time and high in accuracy of analysis results when being used for detecting viruses by metagenome sequencing analysis.

Description

Pathogenic microorganism genome database and construction method and application thereof
Technical Field
The invention relates to a pathogenic microorganism genome database, a construction method and application thereof.
Background
In recent years, pathogenic microorganisms causing human infectious diseases are increasingly complex and of increased variety, and bacterial resistance is caused by abuse of antibacterial drugs, so that the pathogenic microorganisms have become the focus of global attention. According to WHO statistics, in 2019, 10 death causes account for 55% of 5540 thousands of death cases worldwide. Among them, lower respiratory tract infection is the most deadly infectious disease in the world, and the fourth leading cause of death is 260 thousands of deaths in 2019.
And (3) pathogen macrogenome detection (mNSS) is carried out, DNA/RNA in a clinical sample is directly extracted without presetting, culturing or preference, second-generation high-throughput sequencing is carried out, and detection of pathogens such as bacteria, fungi, viruses and parasites is completed once through comparison and analysis of a special pathogen database. In addition, new pathogenic microorganisms can be detected.
Pathogen metagenome detection in terms of dry experiments, the most critical is the accuracy and comprehensiveness of the database of pathogen-detecting microorganism genomes. Several pathogenic metagenome detection specialists have been published to mention the problem of constructing pathogenic microorganism databases. Among the most widely used public databases are IMG, NT, refSeq, genBank, etc., which are used directly in the field of pathogenic metagenome detection, with more or less problems. There are studies showing that microbial genomes in IMG databases are contaminated with PhiX (Mukherjee Supratim, huntermann Marcel, ivanova Natalia et al. Large-scale contamination of microbial isolate genomes by Illumina PhiX control. [ J ]. Stand Genomic Sci,2015, 10:18.) (Steinegger Martin, salzberg Steven L, terminating contamination: large-scale search identifies more than2,000,000 contracted entries in GenBank. [ J ]. Genome Biol,2020, 21:115.) (more than 200 ten thousand genomes in GenBank databases are contaminated), and that these databases, which have not been strictly screened, may contain misnamed, incomplete sequences, low quality sequences or artificial sequences (Bharucha T, oeser C, balloux F, et al STROBE-methogamics: a STROBE extension statement to guide the reporting of metagenomics studies [ J ]. The Lancet Infectious Diseases, 2020.). In addition, the public database has the phenomenon of genome redundancy, such as Escherichia coli has a plurality of genomes, wherein two genomes NC_000913.3 and NC_007779.1, and average nucleotide similarity (Average Nucleotide Identity and ANI) are indexes for comparing the relatedness of two genomes at the nucleotide level, wherein the two genomes are 99.9827 percent, which shows that the homology is very high, and similar gene fragments among the genomes are many. There are a large number of phages in the viral genome, and phage hosts are typically bacteria rather than humans. Such as: enterobacteriaphage PRD1. There are also species with undefined species level. Such as: ralstonia sp., unclassified Clostridiales. The mNGS expert consensus indicates that pathogen information should be reported to be accurate to species (except for pathogen complex populations with very few or very high similarity in the number of sequences detected) in principle, while corresponding genus information should be included (the journal of Chinese infectious diseases, editorial Committee, second generation sequencing technology of Chinese metagenomics, clinical application expert consensus for detecting infectious pathogens [ additional correction herein ] [ J ]. Journal of Chinese infectious diseases,2020,38 (11): 681-689.), a substantial portion of the genome being either free of species information or tentative.
In the metagenome analysis flow used for detecting pathogenic metagenome in the industry at present, there are two general ways for constructing a database, one is that each species selects a genome or sequence as a reference, and the database constructed by the method has the advantages of small occupied storage resources and high analysis speed, and the disadvantage that the selected genome cannot represent species diversity and is easy to cause missed detection. Secondly, all genomes of the same species are incorporated into a reference database, and the database constructed by the method has the advantages that missed detection does not occur, and the disadvantage that the quality of genome data in a public database is poor, so that the public database is polluted, wrong and redundant sequences are large, so that a lot of storage resources and calculation resources are occupied, the analysis progress is dragged down, false positives are possibly caused, and the clinical treatment is plagued.
Disclosure of Invention
The invention provides a pathogenic microorganism genome database, a construction method and application thereof, and the database constructed by the method has the advantages of multiple purposes, improved accuracy of results and high analysis efficiency.
The technical scheme adopted by the invention is as follows:
a method of constructing a database of pathogenic microorganisms genome, the method comprising the steps of:
(1) Acquiring a genome and related explanatory files;
the genome and related descriptive files are derived from genomic data of bacteria, fungi, viruses, parasites in one or more of the following databases: NT, refseq, genBank database of NCBI database, PATRIC database, VEuPathDB database, FDA-ARGOS database, GCM database, viralZone database.
(2) Screening species: bacterial removal species name is ambiguous, species is tentative, virus removal phage;
specifically, bacteria extract a species in which one of the following keywords is found in latin names: uncultured, unclassified, unidentified, candidatus and the name of Latin, and removing the name;
virus removes the species with the phase and phase keywords in Latin names;
(3) Screening reference genomes, wherein the specific screening method is in the following priority order:
A. the genome specification file is provided with a reference genome, and the genome is selected as a reference genome;
B. no reference genome is included in the genome specification file, representative genome, the genome is selected as the reference genome;
C. genome specification file without reference genome and representative genome, screening according to assembly level, complete genome, chromosome, scaffold, contig;
D. the same assembly level has multiple genomes, preferably with material source information, i.e. containing one of the following keywords: assembly from type material, assembly from synonym type material, assembly from pathotype material, assembly designated as neotype, assembly designated as reftype, ICTV species exemplar, ICTV additional isolate;
E. for a plurality of genomes with material source information or no material source, ANI is calculated between every two genomes, and a genome with higher ANI index with more genomes is selected as a reference genome according to an ANI matrix.
(4) Screening the genome: the reference genome and the representative genome are reserved, abnormal genomes are removed, and the genomes with low integrity and high pollution rate are removed by bacteria and viruses;
specifically, there is one of reference genome, representative genome keywords for the genome specification file, the genome is preserved;
for the genome specification there is one of the following keywords: the genome of chimeric, contaminated, hybrid, misassembled, mixed culture, sequence duplications, unverified source organism, abnormal gene to sequence ratio, derived from environmental source, derived from single cell, fragmented assembly, from large multi-insulated project, genome length too large, genome length too large, gene undefined, low gene count, low quality sequence, many frameshiftedproteins, methoome, missing ribosomal protein genes, missing rRNAgenes, missing strain identifier, missing tRNAgenes, partial, refSeq annotation failed, untrustworthy as type;
for bacteria or viruses with a plurality of genomes at the seed level, calculating the integrity and the pollution rate, and removing the genome with the integrity less than or equal to 50% and/or the pollution rate more than or equal to 5%;
the integrity and contamination rate of bacterial, viral genomes can generally be calculated using CheckM, checkV tools.
And (3) removing the reference genome screened in the step (3) from the genome obtained in the step.
(5) Removing the classification error genome: calculating ANI values of all genomes in the species and reference genomes, and removing genomes with abnormal ANI values;
further, it is preferable to remove a genome having an ANI value of less than or equal to a preset value;
the preset value is 90-100%
Preferably, the genome with ANI value less than or equal to 90% is removed, the genome with ANI value less than or equal to 90% is obviously abnormal and is possible wrong classification, and therefore the genome needs to be removed;
(6) The specific method for removing the plasmid sequence is as follows:
for the obtained bacterial and fungal genome, removing Plasmid sequences containing keywords Plasmid or Plasmid in genome labeling information;
comparing other sequences with the PLDB plasmid sequence database, removing sequences with high matching area ratio of the comparison result, and intercepting and retaining unmatched sequences with the length reaching a specified threshold; removing the head and tail N of the sequence;
the threshold value of the matching of the comparison result is 95-100%, and the threshold value of the matching is preferably 95%; the threshold value of the matching area ratio is 50-100%, and the preferred ratio threshold value is 50%; the matching length threshold is an integer of 100bp-2000bp, and the matching length threshold is preferably 100bp; the cut-out length threshold is an integer of 100bp-2000bp, preferably the cut-out length threshold is 100bp.
(7) The specific method for removing the pollution sequence comprises the following steps:
comparing the genome sequence with a UniVec library of NCBI, removing sequences with matched comparison results and high matching area ratio, and intercepting and retaining unmatched sequences with the length reaching a specified threshold; removing the head and tail N of the sequence;
the threshold value of the matching of the comparison result is 95-100%, and the threshold value of the matching is preferably 95%; the matching length threshold is an integer of 100bp-2000bp, and the matching length threshold is preferably 100bp; the threshold value of the matching area ratio is 50-100%, and the preferred ratio threshold value is 50%; the cut-out length threshold is an integer of 100bp-2000bp, preferably the cut-out length threshold is 100bp.
(8) The method for removing the host homologous sequence comprises the following specific steps:
comparing the genome sequence with the complete genome GCF_009914755.1 of the human or other genomes of the human, removing sequences with high matching ratio of the comparison result and matching areas, and intercepting and retaining unmatched sequences with the length reaching a specified threshold; removing the head and tail N of the sequence;
the threshold value of the matching of the comparison result is 95-100%, and the threshold value of the matching is preferably 95%; the matching length threshold is an integer of 100bp-2000bp, and the matching length threshold is preferably 100bp; the threshold value of the matching area ratio is 50-100%, and the preferred ratio threshold value is 50%; the cut-out length threshold is an integer of 100bp-2000bp, preferably the cut-out length threshold is 100bp.
(9) Removing homologous sequences of genome of a reference species, wherein the specific method is as follows;
comparing the genome sequence with the reference genome screened in the step (3), removing sequences with matched comparison results and high area ratio, and intercepting and retaining unmatched sequences with the length reaching a specified threshold; removing the head and tail N of the sequence;
the threshold value of the matching of the comparison result is 95-100%, and the threshold value of the matching is preferably 95%; the matching length threshold is an integer of 100bp-2000bp, and the matching length threshold is preferably 100bp; the threshold value of the matching area ratio is 50-100%, and the preferred ratio threshold value is 50%; the cut-out length threshold is an integer of 100bp-2000bp, preferably the cut-out length threshold is 100bp.
(10) Removing low-quality sequences and redundant sequences, wherein the specific method comprises the following steps of;
removing sequences smaller than a specified length threshold, and comparing the remaining sequences with each other, wherein only 1 sequence matched with the comparison result is reserved; obtaining a high-quality specific sequence set in the species.
An integer from 100bp to 2000bp in length threshold, preferably 100bp in length threshold; the threshold value of the matching of the comparison result is 95-100%, and the matching threshold value is preferably 95%.
(11) The method for constructing the high-quality genome database comprises the following steps of;
and connecting the high-quality reference genome and the high-quality specific sequence set in the species into 1 sequence by using N with continuous preset value number to obtain a high-quality genome database of the pathogenic microorganism.
The preset value is an integer of 50 to 100, preferably 50bp.
The method of the invention can also comprise the step of (12) grouping the library steps: and (3) operating genomes of different species according to the steps (1) - (11) to obtain high-quality genome databases of various pathogenic microorganisms, and summarizing all data to obtain the high-quality genome databases of all pathogenic microorganisms.
The pathogenic microorganism genome database is obtained by downloading genome and related explanatory files, removing bacteria and phage with ambiguous species names or tentative species, removing classification error genome, removing plasmid sequence, removing pollution sequence, removing host homologous sequence, removing reference species genome homologous sequence, removing low-quality sequence and removing redundant sequence, and splicing genome.
In the present invention, the pathogenic microorganism means a microorganism capable of causing a disease in a human or an animal. Including but not limited to bacteria, fungi, viruses, parasites, raney viruses, some arthropods, spirochetes, mycoplasma, chlamydia, rickettsiae, and the like.
In the step of shielding the homologous region of the host, the host comprises a human or an animal, and for a database of pathogenic microorganism genomes of human origin, the host is a human; for the database of pathogenic microorganism genomes of animals, the host is the corresponding animal or a near-source animal.
The invention also provides a high-quality genome database of the pathogenic microorganism constructed by the method.
The high-quality genome database of the pathogenic microorganisms provided by the invention can be used for detecting the pathogenic microorganisms by metagenome sequencing analysis.
The pathogenic microorganism genome database constructed by the invention removes genome with poor quality through pretreatment, and improves the quality of the final database. And the high-quality reference genome is taken as a benchmark, other high-quality reference genomes in the pathogen and different specific sequences thereof are screened out and used as supplements, the reference genome is the most widely applied genome with the highest acceptance, and the genome of the invention has weight and can ensure the accuracy of the final identification result.
During genome screening, phage are removed, keywords representing genome quality descriptions are also used, pollution rate and integrity are calculated, and the quality of the database is improved.
When removing the plasmid sequence, the invention selects whether to remove the whole sequence according to the comparison threshold value by a direct comparison mode, and compared with a method for screening by adopting a genome breaking sequence to find a plasmid homologous region on a genome, the invention is easy to cause error screening due to too short breaking length to compare a non-plasmid homologous region of the upper genome in the prior art, and the quality of a database is influenced.
In processing the specific sequence set, the invention removes contaminating sequences on the basis of removing host sequences, i.e., alignment with the UniVec library of NCBI. Specific sequences may be (1) caused by diversity of species, (2) contaminated during the course of the experiment, and (3) caused by incorrect assembly when the genome is assembled in bioinformatic analysis. The database is kept in (1), the interference of (2) can be removed by comparing the database with the UniVec database of NCBI, and the interference of (3) can be solved by calculating the pollution rate when genome is screened. The invention eliminates wrong and polluted specific sequences by various methods and retains the specific sequences caused by species diversity.
The pathogenic microorganism genome database constructed by the invention comprehensively contains high-quality genome information, eliminates low-quality and wrong genomes, retains specific sequences of various species, and shields host and pollution sequences. The enrichment degree of the genome of the species is greatly improved, the accuracy of the database is ensured, the analysis time is short, and the accuracy of the analysis result is high when the metagenome is analyzed and detected.
Drawings
FIG. 1 is a schematic diagram of the results of genome alignment performed on three sets of genome databases constructed by different methods.
FIG. 2 is a graph showing the results of analysis time for genome comparison in three sets of genome databases constructed by different methods.
Detailed Description
The technical scheme of the present invention will be further described with reference to specific embodiments and drawings, but the scope of the present invention is not limited thereto.
Example 1
The pathogenic microorganism genome database was constructed by the following method:
(1) Acquisition of genome and related descriptive files
The enterococcus faecalis genome was downloaded from the GenBank database of NCBI, specifically as follows:
NCBI is known as National Center for Biotechnology Information, national center for Biotechnology information. GenBank is a sub-database of NCBI, containing genomic information of animals, plants, microorganisms, etc., from which bacterial genome specification files are downloaded
ftp.// ftp.ncbi.nih.gov/genome/bacteria/assembly_summation.txt. All reference genomes from which enterococcus faecalis was downloaded were selected for a total of 5912 genomes, 17 removed genomes from NCBI, and a total of 5895 genomes. ALL genomes were designated ALL genome.
(2) Screening species
Based on the organonism_name column in the genome specification, the species-indeterminate genome is knocked out: bacterial removal species name is ambiguous, species is tentative, virus removal phage;
specifically, bacterial rejection latin is a species with one of the following keywords: uncultured, unclassified, unidentified, candidatus and the name of Latin, and removing the name;
virus removes the species with the phase and phase keywords in Latin names;
(3) Screening reference genomes, wherein the specific screening method is in the following priority order:
A. the genome specification file is provided with a reference genome, and the genome is selected as a reference genome;
B. no reference genome is included in the genome specification file, representative genome, the genome is selected as the reference genome;
C. genome specification file without reference genome and representative genome, screening according to assembly level, complete genome, chromosome, scaffold, contig;
D. the same assembly level has multiple genomes, preferably with material source information, i.e. containing one of the following keywords: assembly from type material, assembly from synonym type material, assembly from pathotype material, assembly designated as neotype, assembly designated as reftype, ICTV species exemplar, ICTV additional isolate;
E. for a plurality of genomes with material source information or no material source, ANI is calculated between every two genomes, and a genome with higher ANI index with more genomes is selected as a reference genome according to an ANI matrix.
Enterococcus faecalis reference genomes were screened according to the method described above. The reference genome is designated Ref_genome.
(4) Screening the genome: the reference genome and the representative genome are reserved, abnormal genomes are removed, and the genomes with low integrity and high pollution rate are removed by bacteria and viruses;
specifically, there is one of reference genome, representative genome keywords for the genome specification file, the genome is preserved;
for the genome specification there is one of the following keywords: the genome of chimeric, contaminated, hybrid, misassembled, mixed culture, sequence duplications, unverified source organism, abnormal gene to sequence ratio, derived from environmental source, derived from single cell, fragmented assembly, from large multi-insulated project, genome length too large, genome length too large, gene undefined, low gene count, low quality sequence, many frameshiftedproteins, methoome, missing ribosomal protein genes, missing rRNAgenes, missing strain identifier, missing tRNAgenes, partial, refSeq annotation failed, untrustworthy as type;
the integrity and contamination rate of bacterial, viral genomes can generally be calculated using CheckM, checkV tools. Enterococcus faecalis is bacteria, so that a CheckM tool is selected to calculate the integrity and the pollution rate, and the genome with the integrity less than or equal to 50% and/or the pollution rate more than or equal to 5% is removed;
among the genomes obtained in this step, the reference genome selected in the step (3) is removed.
(5) Removing the classification error genome: calculating ANI values of all genomes in the species and reference genomes, and removing genomes with abnormal ANI values;
removing genome with ANI value less than or equal to 90%, and retaining genome with ANI value more than 90%.
(6) Removal of plasmid sequence:
removing Plasmid sequences containing keywords Plasmid or Plasmid from genome labeling information for the obtained bacterial genome;
comparing other sequences with the PLDB plasmid sequence database, removing the sequences with the comparison result of >95% and the ratio of >50%, and intercepting and retaining the unmatched sequences with the length of 100bp; the sequence is removed end to end N.
(7) Removal of the decontamination sequence:
comparing the genome sequence with UniVec library of NCBI, removing the sequence with the comparison result of >95% matching and the ratio of >50%, and intercepting and retaining the sequence with the length of 100bp which is not matched; the sequence is removed end to end N.
(8) Removal of the deliberate host homology:
comparing the genome sequence with human complete genome GCF_009914755.1, removing the sequence with the ratio of >50% on the matching of >95% of the comparison result, and intercepting and retaining the sequence with the length of 100bp on the unmatched sequence; the sequence is removed end to end N.
(9) The method for removing the homologous sequence of the genome of the reference species is as follows;
comparing the genome sequence with the reference genome screened in the step (3), removing the sequence with the ratio of more than 50% on the matching of more than 95% of the comparison result, and intercepting and retaining the unmatched sequence with the length of 100bp; the sequence is removed end to end N.
(10) The method for removing the low-quality sequence and the redundant sequence comprises the following steps of;
removing sequences smaller than a length threshold of 100bp, and comparing the rest sequences with each other, wherein only 1 sequence is reserved for 95% matching sequences of the comparison result; obtaining a high-quality specific sequence set in the species.
(11) The method for constructing the high-quality genome database comprises the following steps of;
the high-quality reference genome and the high-quality specific sequence set in the species are connected into 1 sequence by using 50 continuous N, so as to obtain the enterococcus faecalis high-quality genome database HiQ_genome.
Example 2
To evaluate the effect of the high quality genome database of enterococcus faecalis constructed in example 1 above, database size, accuracy, analysis time were compared for ALL untreated ALL genome ALL_genome, reference genome Ref_genome, enterococcus faecalis genome database HiQ_genome of enterococcus faecalis of example 1. As the Genome number in the ALL Genome database reaches 5895, the sequence reaches 431,614, the total base number reaches 17,995,057,128bp, the calculation difficulty of a common computer is great, and in order to make the analysis smooth, the strain Genome with the assembly level of "Complete Genome" or "Chromosome" in the ALL Genome database is selected to form the ALL Genome good database for comparison of accuracy and analysis time.
1. Database size
Database for storing data Size and dimensions of
ALL_genome 17995.05Mb
Ref_genome 2.87Mb
HiQ_genome 41.63Mb
As shown in the table, the pathogen detection microorganism genome database constructed by the method III can greatly reduce storage and calculation resources.
2. Effect of data analysis
2.1 data sets
10 genomes of enterococcus faecalis with Complete Genome assembly state are downloaded from a GenBank database, and are Simulated into a simulation data set Simulded_data with a sequencing length of 75bp and a depth of 10X by software.
2.2 accuracy
The simultane_data and the three genome databases are compared, and the accuracy result is shown in figure 1.
2.3 analysis time
The simultane_data and the three genome databases were compared and the analysis time results schematic results are shown in figure 2.
The results of FIGS. 1 and 2, the accuracy of the ALL_genome analysis results in the Simulated dataset, simultaneous data, was 100% and the average time was 78.40s; the accuracy of the Ref_genome analysis was 85.31% with an average time of 2.88s; the accuracy of the HiQ_genome analysis was 95.84% and the average time period was 4.26s. The Ref_genome database is less accurate, takes the fastest time, and ALL_genome is the most accurate, but takes too long. The accuracy of the HiQ_genome database constructed by the method is equivalent to that of ALL_genome, and the average time is equivalent to that of Ref_genome, so that the analysis efficiency can be considered on the premise of ensuring the accuracy, and the analysis time can be shortened.

Claims (10)

1. A method of constructing a database of pathogenic microorganisms genome, the method comprising the steps of:
(1) Acquiring a genome and related explanatory files;
(2) Screening species: bacterial removal species name is ambiguous, species is tentative, virus removal phage;
(3) Screening a reference genome: screening the reference genome according to a predetermined rule;
(4) Screening the genome: the reference genome and the representative genome are reserved, abnormal genomes are removed, and the genomes with low integrity and high pollution rate of bacteria and viruses are removed;
(5) Removing the wrong classification genome;
(6) Removing the plasmid sequence;
(7) Removing the pollution sequence;
(8) Removing the host homologous sequence;
(9) Removing the homologous sequences of the genome of the reference species;
(10) Removing low quality sequences and redundant sequences;
(11) Constructing a high quality genome database: and (5) sequence splicing to obtain a high-quality genome database of the pathogenic microorganism.
2. The method of claim 1, wherein in step (1), the genome and associated descriptive files are derived from genome data of bacteria, fungi, viruses, parasites in one or more of the following databases: NT, refseq, genBank database of NCBI database, PATRIC database, VEuPathDB database, FDA-ARGOS database, GCM database, viralzone database.
3. The method of claim 1, wherein in step (2), the method of screening the species is: bacteria extract a species in the name of latin that has one of the following keywords: uncultured, unclassified, unidentified, candidatus and the name of Latin, and removing the name;
viruses remove the species with the phase and phase keywords in Latin names.
4. The method of claim 1, wherein in step (3), the reference genome is screened for in the following order of priority:
A. the genome specification file is provided with a reference genome, and the genome is selected as a reference genome;
B. no reference genome is present in the genome specification file, and representational genome is present, and the genome is selected as the reference genome;
C. no reportencegenome or representational genome in the genome specification was screened at the assembly level, completegenome, chromosome, scaffold, contig in order;
D. the same assembly level has multiple genomes, preferably with material source information, i.e. containing one of the following keywords: asssemblyfrom type materials, assemblyfromsynonymtype material, assemblyfrompathotypematerial, assemblydesignatedasneotype, assemblydesignatedasreftype, ICTVspeciesexemplar, ICTVadditionalisolate;
E. for a plurality of genomes with material source information or no material source, ANI is calculated between every two genomes, and a genome with higher ANI index with more genomes is selected as a reference genome according to an ANI matrix.
5. The method of claim 1, wherein in step (4), the method of screening the genome comprises:
for one of the referencegenome, representativegenome keywords in the genome specification file, the genome is reserved;
for the genome specification there is one of the following keywords: chimeric, contaminated, hybrid, misassembled, mixedculture, sequenceduplications, unverifiedsourceorganism, abnormalgenetosequenceratio, derivedfromenvironmentalsource, derivedfrom singlecell, fragmentedassembly, fromlargemulti-isolateepoject, genomelengthtoo large, genomelengthtoolarge, genusundefined, lowgenecount, lowquality sequence, manyframeshiftedproteins, metagenome, missingribosomalproteingenes, missingrRNAgenes, missingstrainidentifier, missingtRNAgenes, partial, refSeq annotationfailed, untrustworth type genome, were removed;
for bacteria or viruses with a plurality of genomes at the seed level, calculating the integrity and the pollution rate, and removing the genome with the integrity less than or equal to 50% and/or the pollution rate more than or equal to 5%;
and (3) removing the reference genome screened in the step (3) from the genome obtained in the step.
6. The method according to claim 1, wherein in the step (5), the method for removing the classification error genome comprises:
calculating ANI values of all genomes in the species and reference genomes, and removing genomes with ANI values less than or equal to a preset value;
the preset value is 90-100%.
7. The method of claim 1, wherein in step (6), the plasmid sequence is removed by:
for the obtained bacterial and fungal genome, removing Plasmid sequences containing keywords Plasmid or Plasmid in genome labeling information;
comparing other sequences with the PLDB plasmid sequence database, removing sequences with high matching result and high matching area ratio, and intercepting unmatched sequences with lengths reaching a specified threshold; removing the head and tail N of the sequence;
the threshold value of the comparison result matching is 95-100%; the threshold value of the matching length is an integer of 100bp-2000 bp; the threshold value of the matching area ratio is 50-100%; intercepting an integer with the length threshold value of 100bp-2000 bp;
in the step (7), the method for removing the decontamination sequence comprises the following steps:
comparing the genome sequence with a UniVec library of NCBI, removing sequences with matched comparison results and high matching area ratio, and intercepting unmatched sequences with the length reaching a specified threshold; removing the head and tail N of the sequence;
the threshold value of the comparison result matching is 95-100%; the threshold value of the matching length is an integer of 100bp-2000 bp; the threshold value of the matching area ratio is 50-100%; intercepting an integer with the length threshold value of 100bp-2000 bp;
in the step (8), the method for removing the host homologous sequence is as follows:
comparing the genome sequence with the complete genome GCF_009914755.1 of the human or other genomes of the human, removing sequences with high matching ratio of the comparison result and matching areas, and intercepting unmatched sequences with the length reaching a specified threshold; removing the head and tail N of the sequence;
the threshold value of the comparison result matching is 95-100%; the threshold value of the matching length is an integer of 100bp-2000 bp; the threshold value of the matching area ratio is 50-100%; intercepting an integer with the length threshold value of 100bp-2000 bp;
in the step (9), the method for removing the homologous sequences of the genome of the reference species is as follows;
comparing the genome sequence with the reference genome screened in the step (3), removing sequences with high matching areas of matching results, and intercepting unmatched sequences with lengths reaching a specified threshold; removing the head and tail N of the sequence;
the threshold value of the comparison result matching is 95-100%; the threshold value of the matching length is an integer of 100bp-2000 bp; the threshold value of the matching area ratio is 50-100%; the interception length threshold is an integer of 100bp-2000 bp.
8. The method according to claim 1, wherein in the step (10), the method for removing the low quality sequence and removing the redundant sequence is as follows;
removing sequences smaller than a specified length threshold, and comparing the remaining sequences with each other, wherein only 1 sequence matched with the comparison result is reserved; obtaining a high-quality specific sequence set in the species;
the length threshold is an integer of 100bp-2000bp, and the threshold of comparison result matching is 95-100%;
in the step (11), the method for constructing the high-quality genome database is as follows;
connecting high-quality reference genome and high-quality specific sequence sets in species into 1 sequence by using N with continuous preset value number to obtain a high-quality genome database of pathogenic microorganisms;
the preset value is an integer of 50-100.
9. A high quality genome database of pathogenic microorganisms constructed by the method of any one of claims 1 to 8.
10. Use of a high quality genome database of pathogenic microorganisms according to claim 9 for metagenomic sequencing analysis to detect pathogenic microorganisms.
CN202310221252.6A 2023-03-09 2023-03-09 Pathogenic microorganism genome database and construction method and application thereof Pending CN116682496A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310221252.6A CN116682496A (en) 2023-03-09 2023-03-09 Pathogenic microorganism genome database and construction method and application thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310221252.6A CN116682496A (en) 2023-03-09 2023-03-09 Pathogenic microorganism genome database and construction method and application thereof

Publications (1)

Publication Number Publication Date
CN116682496A true CN116682496A (en) 2023-09-01

Family

ID=87782510

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310221252.6A Pending CN116682496A (en) 2023-03-09 2023-03-09 Pathogenic microorganism genome database and construction method and application thereof

Country Status (1)

Country Link
CN (1) CN116682496A (en)

Similar Documents

Publication Publication Date Title
EP2926288B1 (en) Accurate and fast mapping of targeted sequencing reads
CN111009286A (en) Method and apparatus for microbiological analysis of host samples
AU2011352786B2 (en) Data analysis of DNA sequences
CN115719616B (en) Screening method and system for pathogen species specific sequences
CN105653899A (en) Method and system for determining mitochondria genome sequence information of various samples at the same time
Cornman Relative abundance and molecular evolution of Lake Sinai Virus (Sinaivirus) clades
CA2906725A1 (en) Characterization of biological material using unassembled sequence information, probabilistic methods and trait-specific database catalogs
Orton et al. Bioinformatics tools for analysing viral genomic data
CN107119146B (en) Method for identifying plant viruses in high flux and application thereof
CN112259167A (en) Pathogen analysis method and device based on high-throughput sequencing and computer equipment
JP5469882B2 (en) Species identification method and system
Bastola et al. Utilization of the relative complexity measure to construct a phylogenetic tree for fungi
CN116682496A (en) Pathogenic microorganism genome database and construction method and application thereof
Majeed et al. RNAseq‐based phylogenetic reconstruction of Taxaceae and Cephalotaxaceae
CN113539369B (en) Optimized kraken2 algorithm and application thereof in second-generation sequencing
CN114574606B (en) Primer group for detecting mycobacterium tuberculosis in metagenome and high-throughput sequencing method
CN115691679A (en) Macro virome analysis method based on second-generation and third-generation sequencing technologies
Zhao et al. Eukfinder: a pipeline to retrieve microbial eukaryote genomes from metagenomic sequencing data
Marić et al. Approaches to metagenomic classification and assembly
Simmons et al. Benefits of alignment quality‐control processing steps and an Angiosperms353 phylogenomics pipeline applied to the Celastrales
Kim et al. Comprehensive analysis of the complete mitochondrial genome of Melanoplus differentialis (Acrididae: Melanoplinae) captured in Korea
Bálint et al. Purging genomes of contamination eliminates systematic bias from evolutionary analyses of ancestral genomes
Gupta et al. Chapter-8 Bioinformatics and Its Applications in Crop Improvement
CHOU et al. Multiple indexing sequence alignment for group feature identification
Sharma et al. FUNCTIONAL ANNOTATION OF EXPRESSED SEQUENCE TAGS (ESTs) IN ARABIDOPSIS THALIANA

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination