CN117174165B - Metagenome-based environmental drug resistance component analysis method - Google Patents

Metagenome-based environmental drug resistance component analysis method Download PDF

Info

Publication number
CN117174165B
CN117174165B CN202311389711.8A CN202311389711A CN117174165B CN 117174165 B CN117174165 B CN 117174165B CN 202311389711 A CN202311389711 A CN 202311389711A CN 117174165 B CN117174165 B CN 117174165B
Authority
CN
China
Prior art keywords
sequence
gene
sequences
functional
contig
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311389711.8A
Other languages
Chinese (zh)
Other versions
CN117174165A (en
Inventor
苏志国
陈吕军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN202311389711.8A priority Critical patent/CN117174165B/en
Publication of CN117174165A publication Critical patent/CN117174165A/en
Application granted granted Critical
Publication of CN117174165B publication Critical patent/CN117174165B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The application relates to the field of bioinformatics, and particularly provides an environmental drug resistance group analysis method based on metagenome, which comprises the following steps: obtaining metagenomic data for an environmental sample based on the environmental sample; and identifying a functional gene based on the metagenomic data, and calculating the relative abundance of the functional gene in the environmental sample at three levels of read length sequence, contig sequence, and microbial genome sketch, respectively; meanwhile, the method also comprises a host analysis and a lateral gene transfer potential analysis based on functional genes. The environmental drug resistance component analysis method provides a systematic technical solution for environmental drug resistance component research and risk assessment.

Description

Metagenome-based environmental drug resistance component analysis method
Technical Field
The application relates to the field of bioinformatics, in particular to an environmental drug resistance component analysis method based on metagenome.
Background
The environment is a key node for the spread of antibiotic resistance, and antibiotic resistance genes (antibiotic resistance gene, ARGs) are frequently detected in different habitats such as water bodies, sediments, soil, sludge, air and the like, thus causing great threat to ecology and human health. ARG is thus considered a new type of environmental pollutant, whose origin, distribution and propagation laws are increasingly being appreciated by researchers. In the related art, many methods are used to detect ARGs, wherein the macrogenomic method based on high-throughput sequencing technology overcomes the limitations of the traditional methods (such as bacterial culture methods, PCR-based molecular biology methods, etc.), can be used to analyze the functional gene information of all microorganisms in environmental samples, and provides technical support for comprehensively characterizing environmental drug-resistant gene profiles.
However, the second generation sequencing technology represented by Illumina generates a large amount of short read length (reads) data, and how to effectively organize sequencing data by using bioinformatics analysis means remains a great challenge for environmental resistance group research. In the related art, analysis of environmental drug resistance components is often focused only on composition analysis of ARG, and the analysis tools and databases involved are numerous, so that the data analysis method is thin, and no high-efficiency data analysis method for integrity exists yet.
Therefore, a reasonable and efficient bioinformatic data analysis process is needed to be developed to provide a systematic technical solution for environmental drug resistance group research and risk assessment.
Disclosure of Invention
The present application solves at least one of the problems of the related art from the following aspects.
To this end, embodiments of the present application provide a metagenome-based environmental resistance component analysis method comprising: obtaining metagenomic data for an environmental sample based on the environmental sample; and identifying a functional gene based on the metagenomic data and calculating the relative abundance of the functional gene in the environmental sample, comprising: a. aligning and annotating read long sequences (reads) based on the metagenomic data to identify the functional gene and to calculate a first abundance of the functional gene in the environmental sample; b. assembling the read long sequences to obtain contig sequences (contigs), and aligning and annotating based on the contig sequences to identify the functional gene, and calculating a second abundance of the functional gene in the environmental sample, optionally the contig sequences having an assembly length of greater than 500 bp; binning the contig sequences to obtain a Microbial Genome Sketch (MAGs), and aligning and annotating based on the microbial genome sketch to identify the functional genes, and calculating a third abundance of the functional genes in the environmental sample.
In some embodiments, the functional genes include one or more of the following: antibiotic resistance genes, mobile genetic elements, virulence factor genes, and germicides and metal resistance genes, preferably antibiotic resistance genes.
In some embodiments, in step b, aligning and annotating based on the contig sequence to identify the functional gene specifically comprises: predicting the open reading frame of the contig sequence to obtain a nucleic acid sequence of a gene corresponding to the contig sequence and a translated protein sequence thereof; clustering the genes corresponding to the contig sequences based on the sequence similarity of the nucleic acid sequences of the genes corresponding to the contig sequences and the translated protein sequences, calculating the gene abundance of the genes corresponding to the contig sequences and obtaining non-sequencesA redundant gene set; and comparing and annotating the non-redundant gene set based on a functional gene database to identify the functional genes and obtain composition information and abundance information of the functional genes. In some embodiments, in the alignment, the non-redundant gene set e-value is less than or equal to 10 -5 The gene sequence with sequence similarity more than or equal to 80% and coverage more than or equal to 70% is identified as the functional gene.
In some embodiments, the functional gene database is selected from one or more of the following: SARG, MGE database, VFDB and BacMet.
In some embodiments, in step c, the aligning and annotating based on the microorganism genome sketch to identify the functional genes further comprises: performing redundancy elimination and quality control on the microbial genome sketch to obtain a non-redundant high-quality microbial genome sketch; predicting the open reading frame of the non-redundant high-quality microbial genome sketch to obtain the nucleic acid sequence of each corresponding gene in the non-redundant high-quality microbial genome sketch and the translated protein sequence thereof; and comparing and annotating the translated protein sequences of the corresponding genes based on a functional gene database to identify the functional genes and obtain composition information and abundance information of the functional genes. In some embodiments, the quality control is performed on the microbial genome sketch based on an integrity and a contamination rate, wherein the integrity is required to be ≡50% and the contamination rate is required to be ≡10%.
In some embodiments, the method further comprises: performing a host assay based on the functional gene, comprising: i. identifying functional gene similar sequences in the read long sequences, and annotating the functional gene similar sequences to obtain the sequence number of the functional gene similar sequences in each species as first host abundance, wherein the sequence similarity of the functional gene similar sequences and the functional genes is more than or equal to 80%, the comparison length is more than or equal to 25 aa, and optionally, the e value is less than or equal to 10 -7 The method comprises the steps of carrying out a first treatment on the surface of the identifying a contig sequence comprising a functional gene sequence of said contig sequence according to the method of any of the embodiments above, for said functional gene sequenceSpecies annotation is carried out on the contig sequences of the gene sequences, so that the sequence number of the contig sequences containing the functional gene sequences in each species is obtained as a second host abundance; identifying a microbial genome sketch containing a functional gene sequence in the microbial genome sketch, and annotating the microbial genome sketch containing the functional gene sequence with species to obtain the sequence number of the microbial genome sketch of the functional gene sequence in each species as a third host abundance according to the method described in any one of the embodiments.
In some embodiments, the method further comprises: performing a lateral gene transfer potential assay based on the functional gene, comprising: identifying a chromosomal source sequence and a plasmid source sequence in the contig sequence and obtaining composition information of the functional gene carried by the chromosomal source sequence and the plasmid source sequence according to the method of any of the embodiments above; y. identifying one of said contig sequences in which lateral gene transfer is likely to occur and obtaining composition information for said functional gene carried by said one of said contig sequences in which lateral gene transfer is likely to occur according to the method described in any of the embodiments above; and z. a lateral gene transfer network of "donor-acceptor" pairing relationship is constructed based on the microbial genome sketch and the microbial genome sketch containing the functional gene sequence, predicting a lateral gene transfer process between the microbial genome sketch containing the functional gene sequence and the microbial genome sketch not containing the functional gene sequence according to the method described in any of the above embodiments.
In some embodiments, the calculation of the first abundance of the functional gene in the environmental sample is based on the number of 16S rRNA gene sequences and the number of cells, as calculated by:
(1)
(2)
(3)
wherein Abundance 16S Representing functional gene Abundance normalized by 16S rRNA gene sequence, abundance cell Representing the abundance of functional genes normalized by cell number, C number Representing the number of cells per environmental sample; n (N) i(mapped reads) The number of sequences in metagenome reads aligned to a specific gene reference sequence; l (L) reads Is the sequence length obtained by high-throughput sequencing; l (L) i(reference reads) Is the length of a particular gene reference sequence; n (N) 16S Is the number of sequences identified as 16S rRNA genes in metagenomic reads; l (L) 16S Is the average length of all 16S rRNA gene sequences; n (N) 16S copy number Is the average copy number of the 16S rRNA gene sequence in the microbial cells; n is the number of reference sequences annotated to belong to a certain gene in the environmental sample.
In some embodiments, step ii further comprises: calculating the second host abundance by using software CoverM, wherein the average number of sequences (mean) of the contig sequences containing the functional gene sequences in the sequences of each species is taken as the second host abundance; and step iii further comprises: the third host abundance was calculated using the quant_bins module of software metaWRAP (v1.2.1), where Genome copies/ppm reads per million sequences were taken to represent the third host abundance.
In some embodiments, step x further comprises: filtering the contig sequence to remove the contig sequence having a sequence length shorter than 1000 bp; and typing the filtered contig sequences using Plasflow (v 1.1) software to identify chromosomal and plasmid origin sequences in the filtered contig sequences, and the step y further comprises: identifying, using WAAFLE software, contig sequences of the contig sequences that are likely to undergo lateral gene transfer, wherein the contig sequences in different metagenomic datasets that undergo lateral gene transfer are screened using WAAFLE software based on sequence lengths greater than 500 bp.
The embodiment of the application also provides a metagenome-based environmental drug resistance group risk evaluation method, which comprises the following steps: obtaining composition information and abundance information of functional genes of an environmental sample at the read length sequence, the contig sequence and the microorganism genome sketch level, respectively, according to a metagenome-based environmental drug resistance component analysis method of any one embodiment of the first aspect of the application, wherein the functional genes comprise one or more of the following: antibiotic resistance genes, mobile genetic elements, virulence factor genes, and germicides and metal resistance genes, preferably antibiotic resistance genes; comparing the open reading frames of all contig sequences obtained by assembly with a functional gene database to obtain a first contig sequence, wherein the first contig sequence contains any one of the functional genes of the antibiotic resistance gene, the mobile genetic element, the virulence factor gene, and the germicide and metal resistance genes; aligning the open reading frame of the first contig sequence with the functional gene database to obtain a second contig sequence, wherein the second contig sequence contains at least one of the other three functional genes; counting the functional gene combination types contained in the first contig sequence and the second contig sequence and the corresponding contig sequence numbers, and giving weight to the contig sequence numbers; and calculating an environmental drug resistance group risk index of the environmental sample based on the weight. In some embodiments, the weights are summed to obtain an environmental resistance group risk index for the environmental sample, wherein the functional gene database is selected from one or more of the following: SARG, MGE database, ACLAME, VFDB, and BacMet.
In some embodiments, the types of functional gene combinations contained in the first contig sequence and the second contig sequence and their weights are as shown in the following table:
the calculation formula of the risk index is as follows:
(1),
wherein:
(2)
(3)
(4)
nContig is the total number of all contig sequences obtained by the assembly.
The environmental drug resistance component analysis method based on the metagenome and the related application thereof provided by the embodiment of the application realize the following beneficial effects:
(1) The invention establishes a multi-level bioinformatics data analysis flow of 'short sequence (reads) -contig sequence (contigs) -Genome Sequence (MAGs)' around research contents such as gene annotation and quantification, host information identification, horizontal gene transfer analysis and the like by selecting software with relatively stable and high performance, modularizes the complex analysis flow according to different requirements, improves the application convenience, outputs more diverse analysis results, and provides possibility for subsequent personalized statistical analysis.
(2) The invention covers more functional genes closely related to antibiotic resistance, and provides a gene comparison and quantification method based on non-spliced sequences or assembled sequences, which not only can carry out large-scale rapid screening of environmental samples, but also can strictly judge occurrence conditions of target genes, reduce false positives of gene identification and provide diversified selections for users.
(3) In the aspect of ARGs host identification, the invention provides three different metagenomic analysis strategies, in particular to a strategy for directly identifying ARGs hosts in the environment based on short sequences, and compared with metagenomic assembly and binning strategies, the method has the advantages of simplifying calculation flow, obtaining high host diversity and the like.
(4) In the aspect of ARGs horizontal gene transfer screening, the invention provides corresponding analysis methods for the assembly sequence and the genome sequence respectively, constructs a microbiome horizontal gene transfer network, and provides a solution for predicting the ARGs transmission process from the gene and community level.
(5) The invention provides a method for calculating comprehensive risk indexes based on the coexistence relation of multiple risk indicator genes on metagenomic sequences, which can be used for evaluating potential risk levels of drug resistant groups in different environmental samples, has the characteristics of high throughput, rapidness, convenience, standardization and the like, saves time, manpower and related experimental cost, is suitable for large-scale environmental investigation and research, and provides basis for establishing evaluation standards of environmental drug resistant group risks.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a method of metagenome-based environmental resistance component analysis according to examples 1-3 of the present application;
FIG. 2 is a flowchart of an environmental resistance group risk assessment method according to example 4 of the present application;
FIG. 3 shows the composition and relative abundance of different genes based on reads analysis according to example 1 of the present application, (a) ARG, (b) BMG, (c) MGE, (d) VFG;
FIG. 4 shows different gene compositions and relative abundances based on contigs analysis according to example 1 of the present application, (a) ARG, (b) BMG, (c) MGE, (d) VFG;
FIG. 5 shows different gene compositions and amounts based on MAGs analysis according to example 1 of the present application, (a) ARG, (b) BMG, (c) MGE, (d) VFG;
FIG. 6 is a representation of ARGs host composition based on reads analysis according to example 2 of the present application;
FIG. 7 is an ARGs host composition based on a contigs analysis according to example 2 of the present application;
FIG. 8 is an ARGs host composition based on MAGs analysis according to example 2 of the present application;
FIG. 9 is a genetic background analysis of different ARGs according to example 3 of the present application;
FIG. 10 is a graph showing identification of contigs-based horizontal gene transfer events according to example 3 of the present application;
FIG. 11 is a MAGs-based horizontal gene transfer network according to example 3 of the present application, (a) LH1, (b) LH2;
FIG. 12 is a graph showing calculated risk indices for drug resistant groups for BMG weights of 0, 0.5 and 1 according to example 4 of the present application;
fig. 13 is a comparison of the evaluation results of the environmental resistance group risk evaluation method according to example 4 of the present application with the prior art method in the comparative example.
Detailed Description
The invention will now be described in further detail with reference to the following specific embodiments, which are given by way of illustration only and are not intended to limit the scope of the invention. The examples provided below are intended as guidelines for further modifications by one of ordinary skill in the art and are not to be construed as limiting the invention in any way.
The examples are not to be construed as limiting the specific techniques or conditions described in the literature in this field or as per the specifications of the product. The reagents or apparatus used were conventional products commercially available without the manufacturer's attention.
The present application was made based on the following recognition by the inventors:
in the related art, various methods are used to detect ARGs, including conventional bacterial culture methods, molecular biology methods based on PCR amplification, and metagenomic methods, etc. However, the bacterial culture method is time-consuming and is susceptible to culture conditions; the PCR amplification-based method is limited by the availability of primer sequences and cannot effectively track very diverse ARGs in the environment, so that it is difficult to satisfy the related research of environmental drug resistant groups (all ARGs and precursors thereof in genome or microbial community). The limitations of these traditional methods are overcome by high throughput sequencing technology based metagenomic methods, which involve extracting total DNA from environmental samples and performing high throughput sequencing, mainly by comparing the obtained metagenomic sequence data with specific antibiotic resistance gene databases (such as arbb, CARD, etc.) via BLAST algorithm, thereby knowing the diversity composition of ARGs in environmental samples. However, the existing metagenomic methods and tools are scattered, lack a reasonable, efficient and systematic data analysis flow, and cannot meet the multi-objective requirements of environmental drug resistance group research content.
To this end, the embodiments of the present application provide a complete set of bioinformatic analysis procedures based on metagenomic short sequences, contig sequences and genomic sequences, covering analysis of four classes of risk indicator genes, namely ARG, mobile Genetic Element (MGE), virulence Factors (VFG), bactericides and metal resistance genes (BMG), drug-resistant gene host recognition, and horizontal gene transfer event screening. The analysis flow provided by the embodiment of the application surrounds research contents such as gene annotation and quantification, host information identification, horizontal gene transfer analysis and the like, software with relatively stable and efficient performance is selected, a multi-level bioinformatics data analysis flow of short sequence-contig sequence-genome sequence is established, the complex analysis flow is modularized according to different requirements, the application convenience is improved, more various analysis results are output, and the possibility is provided for subsequent personalized statistical analysis; meanwhile, the analysis flow covers more functional genes closely related to the antibiotic drug resistance, and provides a gene comparison and quantification method based on a non-spliced sequence or an assembled sequence, so that the environment sample can be rapidly screened on a large scale, the occurrence condition of a target gene can be strictly judged, the false positive of gene identification is reduced, and diversified choices are provided for users. In addition, the analysis flow of the embodiment of the application also comprises ARGs host identification and a horizontal gene transfer screening method thereof, wherein in the aspect of ARGs host identification, the analysis flow of the embodiment of the application provides three different metagenome analysis strategies, in particular a strategy for directly identifying ARGs hosts in the environment based on short sequences, and compared with metagenome assembly and binning strategies, the analysis flow of the embodiment of the application has the advantages of simplifying calculation flow, obtaining high host diversity and the like; meanwhile, in the aspect of horizontal gene transfer screening of ARGs, corresponding analysis methods are provided for an assembly sequence and a genome sequence respectively, a microbiome horizontal gene transfer network is constructed, and a solution is provided for predicting the propagation process of ARGs from the aspects of genes and communities.
In the examples herein, the term "environmental resistant group" refers to all ARGs and their precursors in a metagenome or microbiota. It will be appreciated that by analysing more functional genes (MGE, VFG, BMG etc.) in a metagenome or community of microorganisms, the spread of drug resistance genes of the environment can be effectively analysed and assessed.
In the examples herein, an "environmental sample" is an extracted source of DNA for the metagenomic data analyzed. In the embodiment of the disclosure, the environmental sample may be intestinal microorganisms, water source microorganisms, soil microorganisms, and the like, and the sample source may be intestinal, water source, soil, and the like.
In the embodiments of the present application, the "relative abundance of a functional gene in an environmental sample" refers to the number of sequences containing the functional gene in the sample, where the sequences containing the functional gene may be reads, contigs and MAGs containing fragments or full length of the functional gene. It can be appreciated that by statistically calculating the abundance of functional genes at a plurality of levels of reads, contigs and MAGs, the occurrence of functional genes in the environment can be evaluated as a whole, and false positives in gene identification can be effectively reduced.
In the examples herein, "composition information of a functional gene" refers to all the forms and the ratios of the functional gene and optionally its similar sequences present in an environmental sample.
Embodiments of the first aspect of the present application provide a metagenome-based environmental resistance component analysis method, referring to fig. 1, which may include the steps of:
(1) Collecting an environmental sample, extracting and purifying a total DNA sample in the sample by using a standard method, performing high-throughput metagenomic sequencing, and performing quality control on original sequencing data (raw reads) to obtain a high-quality clean sequence (clean reads);
(2) Comparing and annotating the clean reads with a database of functional genes, identifying functional genes such as Antibiotic Resistance Genes (ARG), mobile Genetic Elements (MGE), virulence Factor Genes (VFG) and germicides and metal resistance genes (BMG), and calculating their relative abundance (i.e., first abundance) in each environmental sample;
(3) Assembling the clean reads of each environmental sample to obtain contig sequences (contigs);
(4) Predicting open reading frames (open reading frame, ORFs) of the contigs, obtaining nucleic acid sequences of the functional genes and translated protein sequences thereof, clustering the functional genes according to sequence similarity, and calculating the relative abundance of the functional genes to obtain a non-redundant gene set;
(5) Comparing and annotating the non-redundant gene set with different functional databases to obtain composition information and abundance information (namely second abundance) of the ARG, MGE, VFG, BMG and other functional genes;
(6) Dividing the contis of each environmental sample into boxes (binding), obtaining a Microorganism Genome Sketch (MAGs), removing redundancy of the MAGs, evaluating the quality of the MAGs according to the integrity and the pollution rate, and screening non-redundant high-quality MAGs meeting the requirements;
(7) Predicting ORF of non-redundant high-quality MAGs to obtain nucleic acid sequence of functional gene and translated protein sequence thereof, comparing the protein sequence with antibiotic resistance gene database, identifying MAGs (ARG-mapping MAGs, ACMs) containing ARG sequence, and counting ARG type and number (i.e. third abundance) carried by ACMs.
Based thereon, embodiments of the present application further provide host analysis strategies based on the functional genes at multiple levels of reads, contigs and MAGs (8), comprising:
(8-1) identifying ARG similar sequences (ARG-like reads, ALRs) in clean reads, performing species annotation on the ALRs, and counting the number of the species sequences as abundance (i.e., first host abundance);
(8-2) identifying the contigs (ACCs) comprising the ARG sequence, species annotating the ACCs, and calculating the relative abundance of each ACCs (i.e., the second host abundance) according to the methods of steps (4) and (5);
(8-3) species annotating the ACMs described in step (7) and calculating the relative abundance of each ACMs (i.e., the third host abundance).
Further, the embodiments of the present application also propose a lateral gene transfer potential analysis method (9) based on the functional genes at a plurality of levels of reads, contigs and MAGs, comprising:
(9-1) identifying chromosomal and plasmid sequences from all contigs and analyzing the ARGs composition carried by both sequences according to the methods described in steps (4) and (5);
(9-2) identifying from all the constituces the constitues likely to undergo horizontal gene transfer, and analyzing the ARGs composition carried by the constitues likely to undergo horizontal gene transfer according to the methods described in steps (4) and (5);
(9-3) constructing a gene transfer network based on the "donor-acceptor" pairing relationship based on the MAGs and ACMs described in steps (6) and (7), predicting potential horizontal gene transfer processes between ACMs, and between ACMs and non-ACMs (nACMs). In some embodiments, it may be further visualized and network analyzed.
In some embodiments, in step (1), the overall quality of raw reads can be checked using FastQC (v0.11.8), and quality control of the sequences is achieved using KneadData software (https:// bitbucket. Org/biobakery/KneadData), including removal of double-ended and low-quality sequences of sequences using the software Trimmomatic (v0.39), and host contamination removal by software Bowtie2 (v2.3.5).
In some embodiments, in step (2), the composition of ARGs, MGEs, VFGs and BMGs in the sample is obtained using ARGs-OAP (v 2.0) software, and the target gene-like sequences are pre-screened from clean reads using Ublast (v11.0.667) tools, respectively, and aligned to the functional gene database using Blastx (v2.2.28+) tools.
In some embodiments, in step (2), the functional gene databases may be SARG (v 2.2, https:// github.com/biofurure/Ublastx_stageone), MGE database (https:// github.com/Katariina Parnanan/Mobilegetic elementdatabase), VFDB (VFDB_setA_prot.fas, http:// www.mgc.ac.cn/VFs/main.htm) and Baet (v 2.0, http:// BacMet. Biomedicine. Gu /). It will be appreciated that other functional gene databases may be selected for comparison and annotation analysis, as long as effective detection of ARGs, MGEs, VFGs and BMGs in the sample and obtaining of their composition information and abundance information can be achieved, and the functional gene databases used are not limited in this application.
In some embodiments, in step (2), the functional gene abundance is calculated as (1) - (3):
(1)
(2)
(3)
wherein: abundance 16S Represents the normalized gene Abundance with the 16S rRNA gene sequence, abundance cell Represents the abundance of a gene normalized by the number of cells, C number Representing the number of cells per sample; n (N) i(mapped reads) Is aligned to a specific one in metagenome clear readsSequence number of the gene reference sequence; l (L) reads Is the sequence length obtained by high-throughput sequencing; l (L) i(reference reads) Is the length of a particular gene reference sequence; n (N) 16S Is the number of sequences identified as 16S rRNA genes in metagenomic clean reads; l (L) 16S Is the average length of all 16S rRNA gene sequences; n (N) 16S copy number Is the average copy number of the 16S rRNA gene sequence in the microbial cells; n is the number of reference sequences in the sample annotated to belong to a certain gene.
In some embodiments, in step (3), the clear reads are de novo spliced using MEGAHIT (v1.1.3) software to obtain contig sequences (contigs), and contigs with splice lengths above 500 bp are selected.
In some embodiments, in step (4), the condigs are predicted for ORFs using Prodigal (v 2.6.3) software, and the obtained ORFs are clustered using CD-Hit (v 4.8.1) software to obtain a non-redundant set of genes; the reads count for each gene was calculated using Salmon (v0.13.1) software and further normalized to obtain TPM (transcripts per kilobase million) values to characterize the abundance information of the gene in each sample.
In some embodiments, in step (5), the protein sequence corresponding to the ORF is aligned to a functional gene database using Blastp (v 2.6.0) tools.
In some embodiments, in step (6), the Microorganism Genome Sketch (MAGs) in each sample is extracted (binding module) and purified (bin_definition module) by using metaWRAP (v1.2.1) software, the screening threshold of the MAGs is set to be less than or equal to 10% of pollution degree and more than or equal to 50% of integrity degree, and the obtained MAGs are subjected to redundancy elimination by using dRep (v2.6.2) software.
In some embodiments, in step (7), the gene sequence of MAGs is predicted using the metaWRAP (v1.2.1) software annotate_bins module and aligned to the antibiotic resistance gene database using the Blastp (v2.6.0) tool.
In some embodiments, in step (8-1), ARG similar Sequences (ALRs) are screened from clean reads using Ublast (v11.0.667) tools, and species annotation is performed using Krake 2 (v2.0.8-beta) software and the GTDB (r 89) database.
In some embodiments, in step (8-2), all ACCs are species annotated using Kraken2 (v2.0.8-beta) software and the GTDB (r 89) database, and the relative abundance of each ACCs is calculated using the CoverM (v0.6.1) software.
In some embodiments, in step (8-3), species annotation of MAGs is performed using GTDB-Tk (v1.0.2) software and GTDB (r 89) database, and the relative abundance of each MAGs is calculated using the quant_bins module of metaWRAP (v1.2.1) software.
In some embodiments, in step (9-1), the chromosomal and plasmid sequences are identified using plasmid (v 1.1) software to type the contigs (excluding contigs less than 1000bp in sequence length).
In some embodiments, in step (9-2), the strategy software (https:// gitsub. Com/biobiaker/waaf) is used to identify the contigs that are likely to have a horizontal gene transfer event.
In some embodiments, in step (9-3), the MetaCHIP (v1.10.10) software is used to predict potential horizontal gene transfer events between different MAGs, and the pairing relationship of "ACM-ACM" and "ACM-nACM" is screened to construct a directed gene transfer network.
In some embodiments, in step (9-3), the network visualization is performed using Gephi (v0.9.2) software.
Therefore, the environmental drug resistance group analysis method based on the metagenome, which is proposed in the embodiment of the application, covers the analysis of a plurality of functional genes related to drug resistance of a plurality of layers of reads, contigs, genome (MAGs), and provides an analysis flow module related to host analysis and horizontal gene transfer, so that the problem that the whole analysis of the drug resistance genes in the environment is difficult to effectively support only for the ARGs gene in the prior art is solved; meanwhile, the method provided by the embodiment of the application integrates various analysis tools and databases in a flow way, improves the application convenience, outputs more various analysis results, and effectively realizes the high-efficiency data analysis of the multi-level polygene integrity of the environmental sample.
An embodiment of the second aspect of the present application further provides a metagenome-based environmental resistance group risk assessment method, wherein the method further integrates the analysis result based on the analysis result obtained by the metagenome environmental resistance group analysis method set forth in any embodiment of the first aspect. Fig. 2 is a flowchart of an environmental drug resistance group risk assessment method according to an embodiment of the present application, and referring to fig. 2, in some embodiments, the risk assessment method may specifically include:
(1) Extracting total DNA from environmental samples, performing metagenomic sequencing, performing quality control on raw data (raw reads) and removing host (such as human) sequences to obtain high-quality clean sequences (clean reads);
(2) Assembling clear reads of each sample to obtain contig sequences (contigs), and counting the number of contigs of each sample;
(3) Predicting Open Reading Frames (ORFs) for all contigs, aligning all ORFs with an ARG database, identifying contigs (ACCs) containing ARG sequences, and counting the number of ACCs in each sample;
(4) Comparing the ORF of all ACCs with MGE, VFG, BMG related functional databases to identify ACCs containing MGE, VFG and BMG sequences, and counting the number of ACCs containing different gene combinations;
(5) According to the types of the contained genes, corresponding weights are given to the different ACCs, summation is carried out, the number of all the constigs in each sample is combined for standardization, and the comprehensive risk index of each sample is calculated, namely the risk level of the environmental drug resistance group is represented.
In some embodiments, the specific operation procedures, the software and the parameter and functional gene databases in steps (1) - (4) may refer to the corresponding procedures in the embodiments of the first aspect of the present application, and are not described herein.
In some embodiments, in step (5), 8 types of ACCs are screened out, and different numbers of ACCs are weighted according to the number of types containing the target gene, wherein ARG, MGE, and VFG characterize the direct risk of the environmental resistance group, the weight is 1, and BMG characterizes the indirect risk of the environmental resistance group, the weight is 0.5. The overall risk index for each sample was calculated according to formulas (1) - (4):
(1)
(2)
(3)
(4)
wherein: nContig represents the total number of contigs in the contig dataset, α represents the number of ACCs, β 1 Represents the number of MGE-carrying ACCs, beta 2 Represents the number of ACCs carrying VFG, beta 3 Represents the number of BMG-carrying ACCs, gamma 1 Represents the number of ACCs carrying MGE and VFG, gamma 2 Represents the number of ACCs carrying MGE and BMG, gamma 3 Represents the number, delta, of ACCs carrying BMG and VFG 1 The number of ACCs carrying MGE, VFG and BMG simultaneously is indicated.
It can be understood that, in the environmental resistance group risk evaluation method based on the metagenome provided in the embodiment of the second aspect of the present application, the analysis result of the environmental resistance group analysis method based on the metagenome provides an effective evaluation index (i.e., environmental resistance group risk index) for the presence and transfer of the resistance genes in the environmental sample by giving appropriate weights to the number of contigs containing different functional gene types; meanwhile, the risk evaluation method provided by the embodiment of the application not only carries out risk classification on ARGs according to the relative abundance, mobility and existence conditions in pathogenic bacteria of the ARGs and then carries out comprehensive evaluation on drug resistance risks of an environmental sample, but also carries out reasonable weighting on the existence and types of the ARGs and more related functional genes to directly carry out risk evaluation, thereby taking care of the co-selection process of antibiotic drug resistance and other resistance mechanisms of microorganisms and realizing more complete and accurate risk evaluation of an environmental drug resistance group.
The experimental methods in the following examples, unless otherwise specified, are conventional methods, and are carried out according to techniques or conditions described in the literature in the field or according to the product specifications. Materials, reagents and the like used in the examples described below are commercially available unless otherwise specified.
Unless otherwise indicated, the quantitative analysis tests in the following examples were all set up in triplicate, and the results averaged. Unless specifically stated otherwise, the parameters and command lines used in the data analysis processes in the following embodiments are default parameters and default processes in the corresponding software specifications, or are conventional parameters or processes in the art.
Example 1
In the embodiment, the drug resistance group study of the sewage treatment plant is taken as an example to show the drug resistance group integrity analysis flow provided by the application.
1.1 Sample collection and pretreatment
And (3) selecting a town sewage treatment plant as a research object, collecting water inflow (inf), secondary sedimentation tank water outflow (sc) and final water outflow samples (eff), filtering and intercepting DNA in the water sample by adopting a 0.22 mu m mixed cellulose ester filter membrane (Millipore, germany), and storing the water sample in a refrigerator at the temperature of-20 ℃ until the water sample is treated after each sample point obtains 3-4 filter membranes.
1.2 DNA extraction and metagenomic sequencing
The 0.22 μm filters were cut into 2-3 mm size squares and based on these squares, DNA extraction was performed using a PowerSoil DNA Isolation Kit (Mo Bio, usa) kit, see the instructions for the kit for specific operation. The DNA sample which is qualified in detection is used for library construction, and PE150 sequencing is carried out on the library which is qualified in quality detection by adopting an Illumina Hiseq 2500 high-throughput sequencing platform, so that an original sequencing sequence (raw reads) is obtained.
1.3 Metagenomic data analysis
1.3.1 Quality control, assembly and binning of metagenomic sequences
(1) Firstly, using FastQC (v0.11.8) software to check the whole quality of raw reads, using a KneadData procedure (https:// bitbucket. Org/biobias/KneadData) to control the quality of sequences, including removing double-end-terminal sequences and low-quality sequences of sequences by Trimmomatic (v 0.39) software (parameter setting: SLIDINGWINWINDOW: 4:20 MINLEN: 50), and removing human host pollution by Bowtie2 (v2.3.5) software to obtain non-joint high-quality reads after quality control filtering, namely clean reads.
(2) The clear reads were subjected to de novo splicing (default parameter setting) using software MEGAHIT (v1.1.3) to obtain contig sequences (contigs), and contigs with splice lengths above 500 bp were screened for subsequent analysis.
(3) Performing ORF prediction (parameter setting: -p meta) on connigs by using software Prodigal (v2.6.3), and clustering (parameter setting: -aS 0.9-c 0.95) on the obtained ORFs by using software CD-Hit (v4.8.1) to obtain a non-redundant gene set; the reads count for each gene was calculated using the software Salmon (v0.13.1) and further normalized to obtain TPM (transcripts per million) values to characterize the abundance information of the genes in the individual samples.
(4) Based on the clean ready after quality control and the assembled condigs, extracting and purifying a Microorganism Genome Sketch (MAGs) in each sample by using software metaWRAP (v1.2.1), predicting the gene sequence of the MAGs, setting the screening threshold of the MAGs to be less than or equal to 10% of pollution degree and more than or equal to 50% of integrity, and using software dRep (v2.6.2) to carry out redundancy elimination on the obtained MAGs (parameter setting is-sa 0.95, -nc 0.30).
1.3.2 Qualitative and quantitative identification of different risk indicator genes
(a) Based on clean reads: the composition of the different genes in the samples was obtained using the ARGs-OAP (v 2.0) procedure (https:// gitsub. Com/biocure/Ublastx_stageone). The first step is to pre-screen the target gene-like sequences from clean reads using Ublast tool (v11.0.667), and the second step is to set (e-value.ltoreq.10) according to default parameters using Blastx tool (v2.2.28+) -7 Sequence similarity is more than or equal to 80%, and alignment is carried outLength 25 or more aa) and performing annotation and classification of genes, wherein the used functional databases include SARG (v 2.2, https:// gilthub.com/biofurure/Ublastx_stageone), MGE database (https:// gilthub.com/Katarainea Parnann/mobilegenetics elementDatabase), VFDB (VFDB_setA_prot.fas, http:// www.mgc.ac.cn/VFs/main.htm) and Bacmet (v 2.0, http:// bacmet.biomedicine.gu.se /) for annotation ARG, MGE, VFG and BMG, respectively. The abundance of the target gene is normalized according to the following formula, whereby a free parallel comparison can be made between different samples or different gene types.
(1)
(2)
(3)
Wherein: abundance 16S Represents the normalized gene Abundance with the 16S rRNA gene sequence, abundance cell Represents the abundance of a gene normalized by the number of cells, C number Representing the number of cells per sample; n (N) i(mapped reads) The number of sequences in metagenome clear reads aligned to a specific gene reference sequence; l (L) reads Is the sequence length obtained by high-throughput sequencing; l (L) i(reference reads) Is the length of a particular gene reference sequence; n (N) 16S Is the number of sequences identified as 16S rRNA genes in metagenomic clean reads; l (L) 16S Is the average length of all 16S rRNA gene sequences; n (N) 16S copy number Is the average copy number of the 16S rRNA gene sequence in the microbial cells; n is the number of reference sequences in the sample annotated to belong to a certain gene.
(b) Based on condigs: the protein sequences of the non-redundant gene sets are respectively compared with different functional gene databases by using a Blastp tool (v 2.6.0) to obtain diversity and abundance information of genes such as ARG, MGE, BMG, VFG and the like.
The functional database used is the same as the database in (a) above.
(c) Based on genome level: predicting the ORF of the non-redundant high-quality MAGs obtained in the step (4) of 1.3.1 to obtain the nucleic acid sequence of the gene and the translated protein sequence thereof, comparing the protein sequence with an antibiotic resistance gene database, identifying MAGs (ACMs) containing ARG sequences, and counting the types and the number of ARGs carried by the ACMs.
Similarly, MAGs containing MGE, BMG and VFG sequences were identified separately using the same procedure, and the type and number of MGE, BMG and VFG carried by each were counted.
Results:
table 1 below summarizes the amount of detection for ARG, MGE, BMG and VFG in the wastewater samples based on different levels in this example. As shown in FIG. 3, the clear reads of the metagenome of different wastewater samples of two wastewater treatment systems were genetically annotated. FIG. 3 (a) shows the relative abundance and composition of ARGs in different samples, and the result shows that the detected amount and relative abundance of ARGs in the inlet water sample are higher, and the sewage treatment process has a certain reduction effect on the diversity of ARGs. Among them, multidrug (multidrug), sulfonamide (sulfanilamide), aminoglycoside (aminoglycoside), macrolide-lincosamide-streptomycin (MLS), β -lactam (beta-lactam) and tetracycline (tetracyclic) resistance genes are the main ARG types in the sample. Fig. 3 (b) shows the relative abundance and composition of BMGs in different samples, with the relative abundance of BMG in the final effluent sample of LH2 being significantly higher than in other samples, with the highest metal Hg resistance gene duty cycle. FIGS. 3 (c) and (d) show the relative abundance and composition of MGE and VFG, respectively, in different samples, with LH1_sc and LH2_eff being more abundant than the other samples, with the transposase gene (transposase) being dominant in each sample, the major VFG categories including adhesion, motility, and immunomodulation (immune modulation), among others.
FIG. 4 shows the relative abundance and composition of ARG, BMG, MGE and VFG genes in different samples identified based on the contigs levels, which are consistent with the reads-based analysis in terms of the gene type, indicating that the analytical methods of the present application can yield stable and consistent results at various levels. FIG. 5 shows the distribution of the number of different genes identified based on MAGs levels in each sample, with fewer genes than contigs levels being detected due to more stringent screening conditions, but the genes identified in this way are more likely to be present in the environmental sample and to function correspondingly by certain specific microorganisms.
TABLE 1 number of different genes identified at three analysis levels
Therefore, the environmental drug resistance group analysis method based on the metagenome provided by the embodiment of the application can produce a higher number of drug resistance related genes on a plurality of layers (reads, contigs and MAGs), so that the environmental drug resistance group information can be sensitively and comprehensively captured by the method of the embodiment of the application; meanwhile, as can be seen from fig. 3 to 5, the distribution of the drug resistance related genes produced by each layer is similar, which indicates that the method provided by the embodiment of the application can provide relatively stable and accurate analysis results of the drug resistance groups on a plurality of layers, so that systematic environmental drug resistance group risk assessment can be reasonably and efficiently performed by integrating the drug resistance group information of the plurality of layers.
Example 2
The present embodiment further performs host information identification of multi-level ARGs based on the data in embodiment 1, and the specific flow includes:
(1) Based on clean reads: the clear reads were first aligned with SARG (v 2.2) using Ublast tool (v11.0.667) and Blastx tool (v2.2.28+) and ARG-like reads (ALRs) were selected from (e-value. Ltoreq.1)0 -7 Sequence similarity is more than or equal to 80%, and alignment length is more than or equal to 25 aa); and then, matching species information of the extracted ALRs by using a software Krake 2 (v2.0.8-beta) database and a GTDB (r 89) database, and counting the distribution condition of the ALRs in each sample to obtain the composition information of ARB communities in different samples.
(2) Based on condigs: the contigs (ACCs) carrying ARG were first screened out using Blastp tool (v2.6.0) and SARG (v 2.2) databases, then all ACCs were matched for species information using software Krake 2 (v2.0.8-beta) and GTDB (r 89) databases, and the relative abundance of each ACCs was calculated using software cover M (v0.6.1, https:// gitub.com/wwood/cover M) as indicated by the average number of sequences (mean) aligned to the different sites of the ACCs.
(3) Based on MAGs: alignment of MAGs protein sequences with the SARG (v 2.2) database using BLASTP tool (v 2.6.0) (e-value. Ltoreq.10) -5 Annotating ARGs in the MAGs to obtain MAGs (ACMs) carrying the ARGs, wherein the identity is more than or equal to 80 percent and the coverage is more than or equal to 70 percent; the ACMs were then annotated with species using the software GTDB-Tk (v1.0.2) and GTDB (r 89) databases, and the quant_bins module of software metaWRAP (v1.2.1) gave relative abundance information for each ACMs, expressed as genome copies/ppm reads per million sequences.
Results:
FIGS. 6, 7 and 8 show ARG host composition in different wastewater samples identified by the three strategies ALR, ACC and ACM, respectively, and the results show that Gamma-proteobacteria are the most dominant ARG hosts and exhibit higher consistency. The ARG hosts which can be identified based on the ALR strategy have higher diversity, comprise low-abundance species, can establish direct association between host abundance and ARGs abundance, have a simpler analysis flow, and can remarkably save calculation resources and time. The strategy based on ACCs and ACMs can deeply analyze the functional information of ARGs flanking sequences, identify the genome characteristics of ARG hosts, and provide references for characterizing the environmental risks and researching targeted management and control measures.
Example 3
This example further carried out a horizontal gene transfer prediction of multi-level ARGs based on the data in examples 1 and 2, and the specific procedure included:
(1) Based on condigs: the method comprises the steps of typing the contigs according to default parameters (-c 0.7) by using software plasmid (v 1.1) (the contigs with the sequence length smaller than 1000bp are excluded according to the advice of software developers), identifying chromosome sequences and plasmid sequences, and analyzing ARGs carried by the two types of sequences. The software WAAFLE was used to screen different metagenomic datasets for contigs (> 500 bp) in which horizontal (lateral) gene transfer events occurred.
(2) Based on MAGs: based on the analysis of step (3) of example 2, ACMs and non-ACMs (nACMs) in the MAGs dataset were distinguished, and the potential horizontal gene transfer events between ACMs, or between ACMs and nACMs, were extracted using software MetaCHIP, and the direction of gene transfer could be predicted, MAGs nucleic acid sequences and their species annotation information as input files. Finally, the pairing relation of 'ACMs-ACMs' or 'ACMs-nACMs' is obtained, a directed gene transfer network is constructed, and the mapping relation is visualized through software Gephi (v0.9.2).
Results:
based on the results of sequence typing (fig. 9), the relative abundance of ARGs encoded on the plasmid sequences in the samples was higher than the other two classes of sequences, indicating that most ARGs in wastewater samples are generally more mobile, with aminoglycosides, tetracyclines, sulfonamides, MLS and β -lactam resistance genes encoded more on the plasmid sequences. The WAAFLE software is used for identifying the number of the contigs of the HGT event in the contigs data set of different samples, as shown in figure 10, and the result shows that the number of the contigs of the HGT is highest in the water inlet sample and gradually decreases with the wastewater treatment process. As shown in fig. 11, based on the recognition of ACMs, a horizontal gene transfer network model was constructed for each of two sewage treatment systems, and table 2 is characteristic parameter data related to the network. In the figure, each node represents one MAG, wherein red represents ACM, blue represents non-ACM, the node size represents the number of carrying ARGs, each side represents the possibility of HGT events between two nodes, and the thickness of the side represents the number of HGTs. By comparing the two network structures and related parameters, the probability and frequency of occurrence of HGT by ARGs in LH2 are higher.
Table 2 horizontal transfer network features
Example 4
In this example, 4 independent sewage treatment systems along the coast of Hangzhou bay in China and Hangzhou bay area are taken as research objects, wherein the sewage treatment systems comprise a domestic sewage treatment System (SD), an industrial wastewater treatment System (SI) and two mixed sewage treatment systems (LH 1 and LH 2), inlet water (inf), secondary sedimentation tank outlet water (sc) and final outlet water samples (eff) are collected, 10 sampling points are distributed in the Hangzhou bay area, and surface sediment samples (HB 1-HB 10) are collected.
4.1 Sample pretreatment
Step 1.1 as in example 1.
4.2 Sample DNA extraction and metagenomic sequencing
Step 1.2 as in example 1.
4.3 metagenomic data analysis
(1) Quality control of the sequences was achieved based on the KneadData procedure (https:// bitbucket. Org/biobiaker/KneadData), including removal of double-ended splice sequences and low-quality sequences of the sequences by software Trimmomatic (v 0.39) (parameter settings: SLIDINGWINWINDOW: 4:20 MINLEN: 50), and host contamination was removed by software Bowtie2 (v 2.3.5).
(2) And (3) performing de novo splicing (parameter setting: k-min 27 k-max 191 step 20) on clear reads by adopting software MEGAHIT (v1.1.3), obtaining contig sequences (contigs), and screening the contigs with splicing length of more than 500 bp for subsequent analysis.
(3) The contigs were predicted ORF (open reading frame) using software Prodigal (v 2.6.3) (parameter set: -p meta). The ORFs were aligned with SARG (v 2.2, https:// gitsub.com/biofurure/Ublastx_stageone) using BLASTP tool (v 2.6.0) to identify contigs (ACCs) carrying ARGs
(4) The ORF of ACCs was compared with databases of ACLAME (v 0.4, http:// ACLAME. Ac. Be /), bacMet (v 2.0, http:// BacMet. Biomedicine. Gu. Se /) and VFDB (VFDB setA_prot. Fas, http:// www.mgc.ac.cn/VFs/main. Htm), respectively, using BLASTP tool (v2.6.0), to obtain the appearance of genes such as MGEs, BMGs and VFGs on ACCs.
4.4 drug resistant group Risk index calculation
4.4.1 All ACCs of each sample were classified into 8 types, the number of different ACCs was weighted according to the number of types containing the target gene, wherein ARG, MGE and VFG characterized the direct risk of the environmental resistance group with a weight of 1, and BMG characterized the indirect risk of the environmental resistance group with weights set to 0, 0.5 and 1, respectively, and the overall risk index was calculated as follows (1).
(1)
4.4.2 Results
Fig. 12 shows calculated drug resistant group risk indices for BMG weights of 0, 0.5 and 1. As shown in fig. 12, the risk level of the sewage treatment system samples is generally higher than that of the sediment samples in the gulf-of-Hangzhou, and the final drainage of the industrial wastewater treatment system in particular still shows higher risk of drug resistance group, which is consistent with the prior knowledge, thereby proving the credibility of the risk assessment of the present embodiment. Second, in the effluent samples, if the risk of BMG is ignored entirely (weight=0), the final risk score will be significantly reduced. This suggests that ARGs and BMGs are more likely to co-occur in wastewater environments due to the presence of various chemical contaminants and heavy metals, and that the presence of BMGs also increases the ability of ARBs to proliferate and diffuse more stably under external environmental pressures. Thus, the contribution of BMGs to the overall risk of antibiotic-resistant bacteria is not negligible, especially in severely contaminated environments. Setting the BMG weight to 0.5 and 1 gives little difference in final risk scores, and setting the BMG weight to 0.5 can more carefully treat its immediate risk contribution.
Thus, based on a BMG weight of 0.5, the overall risk index for each sample was calculated according to formulas (1) - (4):
(1)/>
(2)
(3)
(4)
wherein: n is n Contig Representing the total number of contigs in the contig dataset, α represents the number of ACCs, β1 represents the number of ACCs carrying MGE, β2 represents the number of ACCs carrying VFG, β3 represents the number of ACCs carrying BMG, γ1 represents the number of ACCs carrying MGE and VFG, γ2 represents the number of ACCs carrying MGE and BMG, γ3 represents the number of ACCs carrying BMG and VFG, and δ1 represents the number of ACCs carrying MGE, VFG and BMG simultaneously.
Comparative example
This comparative example the samples of example 4 were risk rated using the existing risk rating method (meta procedure) (https:// github. Com/minoh 0201/metacomputer), as compared to the results obtained by the method presented in example 4, see figure 13.
As shown in fig. 13, the risk evaluation method according to the embodiment of the present application recognizes a consistent risk variation pattern in the wastewater sample as compared with the existing method, thereby proving the feasibility and accuracy of the method according to the present application. Further, compared to the risk score calculated by metaompare, the method proposed by the embodiment of the present application can identify sampling points with higher risk, such as HB5 and HB6. Compared with metaompare, through exploration and verification, the method provided by the embodiment of the application sets a stricter threshold for identifying risk genes on contigs, and meanwhile, consideration of related functional genes such as BMGs is increased, so that comprehensive risk level assessment is achieved more comprehensively and accurately.
To achieve the above object, an embodiment of the present disclosure further provides a computer device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the computer program is executed by the processor to implement the metagenome-based environmental resistance component analysis method or the metagenome-based environmental resistance component risk assessment method according to any one of the embodiments above.
To achieve the above object, the embodiments of the present disclosure further provide a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the metagenome-based environmental resistance group analysis method or the metagenome-based environmental resistance group risk evaluation method according to any of the embodiments above.
To achieve the above object, the embodiments of the present disclosure further propose a computer program product comprising a computer program, wherein the computer program, when being executed by a processor, implements the metagenomic-based environmental resistance group analysis method or the metagenomic-based environmental resistance group risk assessment method according to any of the embodiments above.
To achieve the above object, the embodiments of the present disclosure further propose a computer program, wherein the computer program comprises computer program code, which when run on a computer, causes the computer to perform the metagenomic based environmental resistance component analysis method or the metagenomic based environmental resistance component risk assessment method as described in any of the embodiments above.
It should be noted that the foregoing explanation of the embodiments of the metagenome-based environmental resistance component analysis method or the metagenome-based environmental resistance component risk assessment method also applies to the computer device, the non-transitory computer-readable storage medium, the computer program product and the computer program of the embodiments of the present disclosure, and are not repeated herein.
In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.
While embodiments of the present invention have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the invention, and that variations, modifications, alternatives and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the invention.

Claims (6)

1. A method of metagenome-based environmental resistance analysis, the method comprising:
obtaining metagenomic data for an environmental sample based on the environmental sample; and
identifying a functional gene based on the metagenomic data and calculating the relative abundance of the functional gene in the environmental sample, comprising:
a. aligning and annotating read long sequences based on the metagenomic data to identify the functional gene and calculating a first abundance of the functional gene in the environmental sample;
b. assembling the read long sequences to obtain contig sequences, and aligning and annotating based on the contig sequences to identify the functional gene, and calculating a second abundance of the functional gene in the environmental sample; and
c. binning the contig sequences to obtain a microbial genome sketch, and aligning and annotating based on the microbial genome sketch to identify the functional genes, and calculating a third abundance of the functional genes in the environmental sample,
Wherein the functional genes include one or more of the following: antibiotic resistance genes, mobile genetic elements, virulence factor genes and germicides and metal resistance genes,
in step b, the aligning and annotating based on the contig sequence to identify the functional gene specifically comprises:
predicting the open reading frame of the contig sequence to obtain a nucleic acid sequence of a gene corresponding to the contig sequence and a translated protein sequence thereof;
clustering the genes corresponding to the contig sequences based on the sequence similarity of the nucleic acid sequences of the genes corresponding to the contig sequences and the translated protein sequences, calculating the gene abundance of the genes corresponding to the contig sequences and obtaining a non-redundant gene set; and
comparing and annotating the non-redundant gene set based on a functional gene database to identify the functional genes and obtain composition information and abundance information of the functional genes, wherein in the comparison, the non-redundant gene set is e-value less than or equal to 10 -5 The gene sequence with the sequence similarity more than or equal to 80% and the coverage more than or equal to 70% is identified as the functional gene; and is also provided with
Wherein in step c, said aligning and annotating based on said microbial genome sketch to identify said functional genes further comprises:
Performing redundancy elimination and quality control on the microbial genome sketch to obtain a non-redundant high-quality microbial genome sketch;
predicting the open reading frame of the non-redundant high-quality microbial genome sketch to obtain the nucleic acid sequence of each corresponding gene in the non-redundant high-quality microbial genome sketch and the translated protein sequence thereof; and
comparing and annotating the translated protein sequences of the corresponding genes based on a functional gene database to identify the functional genes and obtain composition information and abundance information of the functional genes,
wherein the quality control is performed on the microbial genome sketch based on an integrity and a contamination rate, wherein the integrity is required to be equal to or more than 50% and the contamination rate is required to be equal to or less than 10%,
wherein the calculation of the first abundance of the functional gene in the environmental sample is based on the number of 16S rRNA gene sequences and the number of cells, and the calculation formula is:
wherein Abundance 16S Representing functional gene Abundance normalized by 16S rRNA gene sequence, abundance cell Representing the abundance of functional genes normalized by cell number, C number Representing the number of cells per environmental sample; n (N) i(mapped reads) The number of sequences in metagenome reads aligned to a specific gene reference sequence; l (L) reads Is the sequence length obtained by high-throughput sequencing; l (L) i(reference reads) Is the length of a particular gene reference sequence; n (N) 16S Is the number of sequences identified as 16S rRNA genes in metagenomic reads; l (L) 16S Is the average length of all 16S rRNA gene sequences; n (N) 16S copy number Is the average copy number of the 16S rRNA gene sequence in the microbial cells; n is the number of reference sequences annotated to belong to a certain gene in the environmental sample.
2. The method according to claim 1, wherein the method further comprises:
performing a host assay based on the functional gene, comprising:
i. identifying a functional gene-like sequence in the read long sequence, and annotating the functional gene-like sequence in species toObtaining the sequence number of the functional gene similar sequence in each species as the abundance of a first host, wherein the sequence similarity of the functional gene similar sequence and the functional gene is more than or equal to 80%, the comparison length is more than or equal to 25aa, and the e value is less than or equal to 10 -7
identifying a contig sequence containing a functional gene sequence in the contig sequence, and annotating the contig sequence containing the functional gene sequence to obtain the number of sequences of the contig sequence containing the functional gene sequence in each species as a second host abundance; and
identifying a microbial genome sketch containing a functional gene sequence in the microbial genome sketch, and annotating the microbial genome sketch containing the functional gene sequence in species to obtain the sequence number of the microbial genome sketch containing the functional gene sequence in each species as a third host abundance.
3. The method according to claim 2, wherein the method further comprises:
performing a lateral gene transfer potential assay based on the functional gene, comprising:
identifying a chromosome-derived sequence and a plasmid-derived sequence in the contig sequence, and obtaining composition information of the functional genes carried by the chromosome-derived sequence and the plasmid-derived sequence;
y. identifying one of said contig sequences in which lateral gene transfer is likely to occur and obtaining composition information for said functional gene carried by said one of said contig sequences in which lateral gene transfer is likely to occur; and
z. based on the microbial genome sketch and the microbial genome sketch containing the functional gene sequence, predicting a lateral gene transfer process between the microbial genome sketch containing the functional gene sequence and the microbial genome sketch not containing the functional gene sequence, and constructing a lateral gene transfer network of a 'donor-acceptor' pairing relation.
4. The method of claim 2, wherein step ii further comprises: calculating the second host abundance by using software CoverM, wherein the average sequence number of the contig sequences containing the functional gene sequences in the sequence numbers of each species is taken as the second host abundance; and is also provided with
Step iii further comprises: the third host abundance is calculated using the quant_bins module of metaWRAP version 1.2.1, where the third host abundance is expressed in copies of the genome per million sequences.
5. The method of claim 3, wherein the step of,
step x further comprises: filtering the contig sequence to remove the contig sequence having a sequence length shorter than 1000 bp; and
typing the filtered contig sequences using version 1.1 of Plasflow software to identify chromosomal and plasmid-derived sequences in the filtered contig sequences, and
the step y further comprises: identifying, using WAAFLE software, one of the contig sequences that is likely to undergo lateral gene transfer, wherein the contig sequences in different metagenomic datasets that undergo lateral gene transfer are screened using WAAFLE software based on sequence lengths greater than 500 bp.
6. A metagenome-based environmental resistance group risk assessment method, the method comprising:
the metagenome-based environmental resistance component analysis method according to any one of claims 1 to 5, obtaining composition information and abundance information of functional genes of an environmental sample at read length sequence, contig sequence and microbial genome sketch level, respectively, wherein the functional genes comprise one or more of the following: antibiotic resistance genes, mobile genetic elements, virulence factor genes, and sterilant and metal resistance genes;
comparing the open reading frames of all contig sequences obtained by assembly with a functional gene database to obtain a first contig sequence, wherein the first contig sequence contains any one of the functional genes of the antibiotic resistance gene, the mobile genetic element, the virulence factor gene, and the germicide and metal resistance genes;
aligning the open reading frame of the first contig sequence with the functional gene database to obtain a second contig sequence, wherein the second contig sequence contains at least one of the other three functional genes;
counting the functional gene combination types contained in the first contig sequence and the second contig sequence and the corresponding contig sequence numbers, and giving weight to the contig sequence numbers; and
Calculating an environmental drug resistance group risk index of the environmental sample based on the weights, wherein the weights are added to obtain the environmental drug resistance group risk index of the environmental sample,
wherein the functional gene database is selected from one or more of the following: SARG, MGE database, ACLAME, VFDB and BacMet,
wherein the types of functional gene combinations contained in the first contig sequence and the second contig sequence and weights thereof are as shown in the following table:
the calculation formula of the risk index is as follows:
wherein:
β=2β 1 +2β 2 +1.5β 3 (2)
γ=3γ 1 +2.5γ 2 +2.5γ 3 (3)
δ=3.5δ 1 (4)
nContig is the total number of all contig sequences obtained by the assembly.
CN202311389711.8A 2023-10-25 2023-10-25 Metagenome-based environmental drug resistance component analysis method Active CN117174165B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311389711.8A CN117174165B (en) 2023-10-25 2023-10-25 Metagenome-based environmental drug resistance component analysis method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311389711.8A CN117174165B (en) 2023-10-25 2023-10-25 Metagenome-based environmental drug resistance component analysis method

Publications (2)

Publication Number Publication Date
CN117174165A CN117174165A (en) 2023-12-05
CN117174165B true CN117174165B (en) 2024-03-12

Family

ID=88945246

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311389711.8A Active CN117174165B (en) 2023-10-25 2023-10-25 Metagenome-based environmental drug resistance component analysis method

Country Status (1)

Country Link
CN (1) CN117174165B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108334750A (en) * 2018-04-19 2018-07-27 江苏先声医学诊断有限公司 A kind of macro genomic data analysis method and system
CN110819699A (en) * 2019-11-27 2020-02-21 辽宁大学 Quantitative detection method for human excrement indicator in water environment
CN111944914A (en) * 2020-07-16 2020-11-17 中国科学院生态环境研究中心 Method for evaluating water health risk based on resistance gene and virulence factor gene
CN113689912A (en) * 2020-12-14 2021-11-23 广东美格基因科技有限公司 Method and system for correcting microbial contrast result based on metagenome sequencing
KR20220069626A (en) * 2020-11-20 2022-05-27 대한민국(농촌진흥청장) Method for providing information on poultry breeding environment using metagenomic resources of microbiome in poultry cecum
CN114822697A (en) * 2022-03-28 2022-07-29 南京农业大学 Method for analyzing drug-resistant gene pollution of traced soil by using metagenome

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108334750A (en) * 2018-04-19 2018-07-27 江苏先声医学诊断有限公司 A kind of macro genomic data analysis method and system
CN110819699A (en) * 2019-11-27 2020-02-21 辽宁大学 Quantitative detection method for human excrement indicator in water environment
CN111944914A (en) * 2020-07-16 2020-11-17 中国科学院生态环境研究中心 Method for evaluating water health risk based on resistance gene and virulence factor gene
KR20220069626A (en) * 2020-11-20 2022-05-27 대한민국(농촌진흥청장) Method for providing information on poultry breeding environment using metagenomic resources of microbiome in poultry cecum
CN113689912A (en) * 2020-12-14 2021-11-23 广东美格基因科技有限公司 Method and system for correcting microbial contrast result based on metagenome sequencing
CN114822697A (en) * 2022-03-28 2022-07-29 南京农业大学 Method for analyzing drug-resistant gene pollution of traced soil by using metagenome

Also Published As

Publication number Publication date
CN117174165A (en) 2023-12-05

Similar Documents

Publication Publication Date Title
Gruber-Vodicka et al. phyloFlash: rapid small-subunit rRNA profiling and targeted assembly from metagenomes
Ren et al. Identifying viruses from metagenomic data using deep learning
Nayfach et al. Toward accurate and quantitative comparative metagenomics
Yue et al. Evaluating metagenomics tools for genome binning with real metagenomic datasets and CAMI datasets
Bickhart et al. Assignment of virus and antimicrobial resistance genes to microbial hosts in a complex microbial community by combined long-read assembly and proximity ligation
Almeida et al. A new genomic blueprint of the human gut microbiota
CN110349629B (en) Analysis method for detecting microorganisms by using metagenome or macrotranscriptome
Miller et al. Autometa: automated extraction of microbial genomes from individual shotgun metagenomes
Freitas et al. Accurate read-based metagenome characterization using a hierarchical suite of unique signatures
Zielińska et al. The choice of the DNA extraction method may influence the outcome of the soil microbial community structure analysis
Pond et al. Windshield splatter analysis with the Galaxy metagenomic pipeline
Louvel et al. meta BIT, an integrative and automated metagenomic pipeline for analysing microbial profiles from high‐throughput sequencing shotgun data
Nishimura et al. The OceanDNA MAG catalog contains over 50,000 prokaryotic genomes originated from various marine environments
CN104603283A (en) Method and system to determine biomarkers related to abnormal condition
US20150376697A1 (en) Method and system to determine biomarkers related to abnormal condition
Saheb Kashaf et al. Recovering prokaryotic genomes from host-associated, short-read shotgun metagenomic sequencing data
CN105279391A (en) Metagenome 16S rRNA high-throughput sequencing data processing and analysis process control method
Zhidkov et al. MitoBamAnnotator: A web-based tool for detecting and annotating heteroplasmy in human mitochondrial DNA sequences
Gostel et al. Microfluidic Enrichment Barcoding (MEBarcoding): A new method for high throughput plant DNA barcoding
Haryono et al. Recovery of high quality metagenome-assembled genomes from full-scale activated sludge microbial communities in a tropical climate using longitudinal metagenome sampling
Basset et al. Comparison of traditional and DNA metabarcoding samples for monitoring tropical soil arthropods (Formicidae, Collembola and Isoptera)
Churcheward et al. MAGNETO: an automated workflow for genome-resolved metagenomics
Hickl et al. binny: an automated binning algorithm to recover high-quality genomes from complex metagenomic datasets
CN117174165B (en) Metagenome-based environmental drug resistance component analysis method
Albanese et al. Large-scale quality assessment of prokaryotic genomes with metashot/prok-quality

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant