WO2014019164A1 - 一种分析微生物群落组成的方法和装置 - Google Patents
一种分析微生物群落组成的方法和装置 Download PDFInfo
- Publication number
- WO2014019164A1 WO2014019164A1 PCT/CN2012/079492 CN2012079492W WO2014019164A1 WO 2014019164 A1 WO2014019164 A1 WO 2014019164A1 CN 2012079492 W CN2012079492 W CN 2012079492W WO 2014019164 A1 WO2014019164 A1 WO 2014019164A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- sequencing
- stack
- module
- reference set
- fragments
- Prior art date
Links
Classifications
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
- C12Q1/68—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
- C12Q1/6869—Methods for sequencing
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
- G16B30/20—Sequence assembly
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
- C12Q1/68—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
- C12Q1/6809—Methods for determination or identification of nucleic acids involving differential detection
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
- C12Q1/68—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
- C12Q1/6876—Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
- C12Q1/6888—Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for detection or identification of organisms
- C12Q1/689—Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for detection or identification of organisms for bacteria
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B15/00—ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/20—Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
- G16B30/10—Sequence alignment; Homology search
Definitions
- the present invention relates to the field of metagenomics and bioinformatics.
- the present invention relates to methods and apparatus for analyzing the composition of a microbial community in an environmental sample. Background technique
- Metagenomics also known as environmental genomics, metagenomics, ecogenomics, or community genomics, is a direct study of microbial communities in various environments (such as the natural environment) (including cultivable and non-cultivable) The discipline of the sum of bacteria, fungi and viruses, etc.). Studying microbial communities and species diversity in various environments has particular benefits. For example, studies of microbial communities and species diversity in the human circulatory environment are very useful for understanding the clinical drug development of the flora and the metabolic pathways of human bacteria. However, due to the limitations of traditional research methods, we know very little about the composition of microorganisms in the environment (such as the intestinal environment). In particular, many species cannot be identified by traditional research methods involving culture, as the environment may contain non-cultivable bacteria, fungi or viruses.
- Velvet (Zerbino and Birney 2008), EULER-SR (Chaisson and Pevzner 2008), Newbler (Mergul ies et al. 2006) and Soapdenovo (Li et al. 2009).
- the binning method has been widely used to discriminate the affiliation of connected fragments or spliced fragments, including but not limited to, similarity (similar larity-based) MEGAN (Huson et al. 2007) and CARMA ( Tzahor et al. 2009), such methods are segmented by sequence alignment with a reference genome; composition-based stacking methods, eg based on GC content, k ⁇ mer frequency (Schbath et Al.
- the goal of metagenomics research is to reconstruct the genomes of various microorganisms in environmental samples to analyze the microbial community composition in environmental samples.
- the above method separates the assembly from the stack, and each focuses on only one aspect. Therefore, the above methods do not fully achieve the research goals of metagenomics.
- the algorithm, the procedure, and the compatibility adopted by the different methods do not necessarily match, and the final result can achieve the research goal of metagenomics, and the final result.
- the accuracy and effectiveness are also unpredictable.
- the term "environment” refers to a variety of environments in a broad sense including, but not limited to, natural environments (eg, soil environments, marine environments, river environments) and in vivo environments (eg, oral environment, intestinal environment) ). Rather, the term “environment” refers to any area where a microbial/microbial community may be present.
- environmental sample refers to a sample from various environments that may contain a microbial/microbial community.
- microorganism has the meaning commonly understood by those skilled in the art including, but not limited to, bacteria, fungi, and viruses.
- microbial community refers to a combination of various microorganisms that are brought together in a particular environment.
- various microorganisms in the same microbial community not only have direct or indirect interactions with each other, but also interact with the environment in which they live: changes in the environment lead to the composition of the microbial community (including, microbial Changes in species and/or abundance; in turn, changes in the composition of the microbial community also affect the environment.
- the term “meteogenome” refers to the sum of the genomes of various organisms in a community.
- the term “macrogenome” refers to the sum of the genomes of various microorganisms in a microbial community.
- the term “metanomic sequencing data” refers to data obtained by sequencing the entire metagenomic genome. Because the metagenomics contain vast amounts of DNA information, they are often sequenced using high-throughput sequencing technologies such as second-generation sequencing or third-generation sequencing. However, the desired metagenomic sequencing data can also be obtained by other methods or other sources. Sequencing data is typically composed of a large number of sequencing reads.
- Second generation sequencing techniques are well known to those skilled in the art and include, for example, 454 sequencing (Roche), Solexa sequencing (Il lumina), S0UD sequencing (ABI). And single molecule sequencing.
- 454 sequencing Roche
- Solexa sequencing Il lumina
- S0UD sequencing ABSI
- single molecule sequencing For a detailed review of second generation sequencing technologies, see for example,
- sequences of low sequencing quality are known to those skilled in the art, which can be determined, for example, by sequencing platforms and sequencing software during the sequencing process (see, Quality Scores for Next-Generat ion Sequencing, Technical Note: Sequencing, I l lumina ).
- the expression “de-redundancy” means that for sequences having a similarity of 95% or more to each other, only one is retained, for example, the repeated connected segments and the spliced segments are removed.
- reference set is a set of assembled fragments or genes in a broad sense, wherein an assembled fragment refers to a long fragment assembled from a sequence of fragments, such as a cont igs or a scaffolds.
- a gene set is a collection of genes predicted on an assembled fragment. The assembled fragments or genes constitute and are referred to as "elements" of the reference set.
- multiple normal distribution model and “maximum likelihood function” have the meanings as commonly understood by those skilled in the art. A detailed description of these two terms can be found, for example, in Fraley and Raf tery, 1998.
- similarity-based clustering method refers to measuring the similarity (or distance) between sequences by comparing sequence identity between two pairs of sequences, and based on this similarity ( Or distance) clustering;
- grouping method based on compositional features means to measure the similarity between sequences by comparing the similarity of the characteristics of the two sequences themselves, such as oligonucleotide frequency, GC content, etc. Degree (or distance), and clustering based on this similarity (or distance).
- Similarity-based clustering methods are for example but not limited to, Based on similarity (simi lari ty-based) MEGAN (Huson et al. 2007) and CARMA (Tzahor et al. 2009).
- Clustering methods based on compositional features are for example, but not limited to, clustering methods based on GC content, k-mer frequency (Schbath et al. 1995) or tetranucleotide frequency (Tee ling et al. 2004).
- One technical problem to be solved by the present invention is to provide a method and apparatus for efficiently analyzing the composition of a microbial community in an environmental sample. Based on this, the inventors creatively combine the assembly method and the stacking method, and develop a method and apparatus capable of efficiently and accurately analyzing the metagenomic data obtained from the environmental sample and further determining the microbial community composition of the environmental sample. .
- the method of the present invention is also named Soap series of Met a genome analysis (hereinafter referred to as SoapMeta). Accordingly, in one aspect, the present invention provides a method for analyzing a microbial community composition in an environmental sample, comprising the steps of:
- the genomic DNA from the environmental sample is constructed and sequenced to obtain metagenomic sequencing data consisting of a sequencing pool of reads;
- Abundance-based stacking based on the relative abundance of elements in the sample, used A clustering algorithm, such as a bottom-up hierarchical clustering method (HIERARCHICAL CLUSTERING SCHEMES, STEPHEN C. JOHNSON, 1967), determines the initial stack of each element;
- E step according to the model parameters of each stack, respectively calculate the posterior probability that each element belongs to a certain stack, and modify the soft matrix The probability that the element belongs to the stack;
- M step Calculate the model parameters of each stack by the maximum likelihood function method according to the soft matrix
- the genomic sequence of each stack is used to determine the species of microorganisms corresponding to each stack, thereby determining the microbial community composition in the environmental sample.
- the environmental sample is derived from a natural environment, such as a soil environment, a marine environment, and a river environment.
- the environmental sample is derived from an in vivo environment, such as the oral environment and the intestinal environment.
- macros of microbial communities contained in environmental samples using second generation sequencing techniques (eg, 454 sequencing, Solexa sequencing, SOLiD sequencing or single molecule sequencing) or third generation sequencing techniques
- second generation sequencing techniques eg, 454 sequencing, Solexa sequencing, SOLiD sequencing or single molecule sequencing
- third generation sequencing techniques The genome is sequenced to provide metagenomic sequencing data from environmental samples.
- the metagenomic sequencing data is obtained by the following steps:
- Id sequencing the metagenomic library, preferably using Solexa sequencing, to provide metagenomic sequencing data for the environmental sample.
- the metagenomic sequencing data is a sequencing pool of reads consisting of sequenced fragments.
- Such sequencing fragments are typically obtained by second generation sequencing techniques (e.g., Solexa sequencing) or third generation sequencing techniques.
- the sequencing fragments are ended paired aired end reads.
- sequence of the sequence may be included in the sequencing fragment, the sequence of the adapter used in the sequencing process, the sequence with low sequencing quality, and/or the sequence from the host genome in the case of analyzing the sample from the in vivo environment. Such sequences may affect subsequent processing and analysis, and thus the removal of such sequences may be advantageous.
- the sequencing data is pre-treated, i.e., the linker sequence, the sequence with low sequencing quality, and/or the host genome sequence are removed prior to performing step 2).
- the metagenomics has a sequencing depth of at least 10 ⁇ , preferably at least 20 X, preferably at least 30 X, preferably at least 40 X, more preferably at least 50 ⁇ .
- the sequencing fragments are assembled into assembled fragments (e.g., ligation fragments and/or splice fragments) using Soapdenovo.
- assembled fragments e.g., ligation fragments and/or splice fragments
- Soapdenovo Such assembly methods are known to those skilled in the art, see, for example, Li et al.
- a plurality of environmental samples are used to carry out the method of the invention, and a respective reference set is obtained for each sample.
- the reference sets of all samples are combined and de-duplicated to build the final non-redundant reference set. That is, the reference sets from multiple samples are combined and de-duplicated to construct the final non-redundant reference set.
- the known reference set for the environmental sample, it can be directly used as a reference set, and the known reference set can also be constructed using the sequencing fragment in step 2a).
- the reference sets are combined and de-redundant to provide the final reference set.
- sequenced fragments are aligned to a reference set by using S0AP2 or MAQ alignment software.
- S0AP2 and MAQ are known to those skilled in the art, see, for example, R Li et al. 2009 and Li et al. 2008.
- sequenced fragments are aligned to the reference set using SOAP2 and the relative abundance of each element in the reference set is calculated according to the following formula: xJ L
- A relative abundance of element i in the sample
- the initial stack of elements is determined by the following steps: First, the correlation between the two elements is calculated based on the relative abundance of the elements in the sample, such as the pearson correlation coefficient, the spearman correlation coefficient, kendal l correlation coefficient, Euclidean distance, Manhattan distance, etc. Then, according to the correlation between the two elements, clustering algorithms, such as bottom-up hierarchical clustering, etc., gather closely related elements into one class Medium, thereby determining the initial stack of each element.
- clustering algorithms such as bottom-up hierarchical clustering, etc.
- step 3 After the stacking of step 3), the abundance of each element in the same stack in all samples conforms to a certain distribution model, such as a normal distribution. Therefore, multiple elements clustered into the same stack have the following possibilities: (1) These elements belong to the same species; (2) These elements come from symbiotic species because the abundance distribution of the common species is similar; (3) These elements are common to several species because the abundance of elements common to several species differs from the abundance of each species. About stack-based advanced assembly
- S0AP2 is used to align the sequenced fragments with the elements that have been stacked.
- the calibration is performed using GC-depth spectra class if ier and/or tetranucleotide frequencies (TNFs) class if ier (Teel ing et al. 2004).
- TNFs tetranucleotide frequencies
- the class of microorganisms corresponding to each stack is determined by aligning the genomic sequences of the respective stacks with a known genomic database.
- the genomic database includes, but is not limited to,
- NCBI/IMG has sequenced bacterial libraries, CBI's NR libraries, etc.
- the alignment is an alignment of nucleic acid levels and/or protein levels.
- the invention provides an apparatus for analyzing the composition of a microbial community in an environmental sample, comprising the following modules:
- a sequencing module for sequencing metagenomic DNA from an environmental sample to provide metagenomic sequencing data consisting of a pool of sequencing fragments
- a primary assembly module that is coupled to the sequencing module and includes the following modules connected to each other:
- assembling a building block for assembling the sequenced segments to obtain an assembled segment and then de-duplicating to construct a non-redundant reference set (ie, assembling the set of fragments); optionally, the assembled building block further
- a gene can be predicted on the obtained assembled fragment, and a set of predicted genes can be used as a reference set (ie, a gene set);
- an alignment calculation module for comparing the sequenced fragments to a reference set and calculating the relative abundance of each element in the reference set in the sample
- a stacking module which is connected to the primary assembly module, is used to determine the stack to which each element in the reference set belongs, to obtain a stack of clusters, and includes the following modules connected to each other:
- an advanced assembly module which is connected to the sequencing module and the stacking module, and is used for searching for the sequencing fragments corresponding to the respective stacks from the metagenomic sequencing data, and assembling the sequencing segments corresponding to the respective stacks separately, and assembling the same.
- the result is ⁇ £ and adjusted;
- the environmental sample is derived from a natural environment, such as a soil environment, a marine environment, and a river environment.
- the environmental sample is derived from an in vivo environment, such as the oral environment and the intestinal environment.
- the sequencing module uses second generation sequencing technology
- the device further comprises a DNA extraction module and a library construction module connected to each other, wherein the DNA extraction module is for extracting metagenomic DNA from the environmental sample, and the library is constructed The module is coupled to a sequencing module and the genomic library is constructed using the metagenomic DNA.
- the sequencing fragments obtained by the sequencing module are paired end reads.
- the apparatus further comprises a filtration module coupled to the sequencing module and the primary assembly module for removing linker sequences in the sequenced fragments, sequences of low sequencing quality, and/or prior to performing primary assembly. Host genome sequence.
- the sequencing module has a sequencing depth for the metagenomics of at least 10, preferably at least 20 X, preferably at least 30, preferably at least 40 ⁇ , more preferably at least 50 X.
- the assembly building block assembles the sequenced fragments into ligated fragments and/or spliced fragments using Soapdenovo.
- the assembly building module further includes a receiving sub-module for receiving a known reference set.
- the assembly building module uses the received known reference set as the final reference set.
- the assembly building module combines the received known reference set with a reference set constructed using the sequencing fragments and de-redundant to provide a final reference set.
- the assembly building module is capable of combining reference sets from multiple samples and de-duplicating to construct a final non-redundant reference set.
- the comparison calculation module uses S0AP2 or MAQ, the sequenced fragments are aligned to a reference set.
- the alignment calculation module uses S0AP2 to align the sequenced fragments with a reference set and calculate the relative abundance of each element in the reference set according to the following formula:
- the abundance splitting module calculates the correlation between the two elements based on the relative abundance of the elements in the sample, and then determines the initial stack of each element by a clustering algorithm.
- the model stacking module determines the stack to which the element belongs by:
- Step E calculating the posterior probability that each element belongs to a certain stack according to the model parameters of each stack, and modifying the probability that the element in the soft matrix belongs to the stack; M step: according to the soft matrix, using the maximum
- the function method calculates the model parameters of each stack.
- the advanced assembly module performs its function by:
- the advanced assembly module uses S0AP2 to align the sequence segments with the elements that have been split.
- the advanced assembly module is calibrated using GC-depth spectra class if ier and/or tetranucleotide frequencies (TNFs) classif ier.
- TNFs tetranucleotide frequencies
- the authentication module determines the class of microbes corresponding to each stack by comparing the genomic sequences of the respective stacks to a known genomic database.
- the genomic database includes, but is not limited to, NCBI/IMG sequenced bacterial libraries, CBI NR libraries, and the like.
- the authentication module is aligned at the nucleic acid level and/or protein level.
- the use of the device of the invention for analyzing the composition of a microbial community in an environmental sample.
- the environmental sample is derived from a natural environment, such as a soil environment, a marine environment, and a river environment.
- the environmental sample is derived from an in vivo environment, such as the oral environment and the intestinal environment.
- the method and apparatus of the present invention are based on high-throughput sequencing technology, which utilizes sequencing data of multiple samples in the same or similar environment for assembly, clustering and reassembly, thereby obtaining species composition information of the microbial community and genomic information of the species, which is very Wide application prospects.
- the method and device of the present invention have the following advantages. Point:
- Cluster analysis using multiple samples has two significant advantages: a) can cover more low-abundance species for a more comprehensive study of microbial communities; b) different samples may have different species due to environmental factors The composition and abundance can thus be advantageously studied comparatively.
- metagenomic analysis using a single sample usually yields only dominant species, but not comprehensive analysis of microbial communities, especially low-abundance species (see, for example, Hess et al. 2011).
- Figure 1 is a schematic illustration of a flow diagram of the SoapMeta method of the present invention, wherein the dashed hollow frame, the solid hollow frame and the solid frame are schematically represented from three different species.
- Figure 2 schematically depicts a flow chart of the primary assembly of the SoapMeta method of the present invention.
- Figure 3 is a flow chart schematically depicting the splitting of the SoapMeta method of the present invention.
- Figure 4 is a schematic diagram showing the advanced assembly of the SoapMeta method of the present invention.
- Figure 5 is a block diagram showing the structure of an apparatus for implementing the SoapMeta method of the present invention.
- Figures 6-8 show the GC content-sequence depth plots for the three samples (sample AC) obtained in Example 2 using the first strategy.
- Figure 6 Sample A;
- Figure 7 Sample B;
- Figure 8 Sample C;
- the results show that some of the bacteria in sample B and sample C are difficult to distinguish because their GC content and sequencing depth are very close.
- Figure 9 is a graph showing the classification of species obtained by 16S rRNA sequencing in Example 3 of the present application.
- Figure 10 shows the correlation between the number of Akke ansia 16S rRNA tags obtained by 16S rRNA sequencing and the sequencing depth of the corresponding genome assembled using the Soapmeta method of the present invention.
- Figure 11 shows the correlation between the number of Lactobacillus 16S rRNA tags obtained by 16S rRNA sequencing and the sequencing depth of the corresponding genome assembled using the Soapmeta method of the present invention.
- the simulated end-paired sequencing fragment was 90 bp in length, the size of the insert was 500 ⁇ 20 bp (mean ⁇ standard deviation), and the sequencing error rate was 0.1%.
- the species abundance composition ratio of each sample was determined by the relative species abundance (RSA) of the Broken- Stick model (MacArthur 1957).
- the sequencing amount of most bacteria contained in each sample was relatively low (64% of bacteria RSA ⁇ 0.01). After the sequencing data of the 10 samples were combined, the sequencing of these low-abundance bacteria was 13.6-182. 0 Mbp, and the sequencing depth was 2. 7 - 160. 4X.
- N50 is a criterion for measuring the amount of genomic diagrams, which means that when all assembled sequences are arranged in descending order of length, the lengths of the sequences are added from large to small, until the phase
- the total length obtained is fifty percent of the total length of all assembled sequences, the length of that assembly sequence, see, for example, Mi ller et al. 2010. Assembly algorithms for next generation sequencing data. Genomics. 95 (6) : 315-327 ).
- sequenced fragments are aligned with a non-redundant reference set, and the relative abundance of each connected fragment in the reference set is calculated by:
- ⁇ ' The number of times the connected segment i was detected in the sample.
- the Kendal l's tau rank correlation coefficient of each connected segment in the abundance matrix is calculated. Then, according to the correlation between the two connected segments, the bottom-up hierarchical clustering algorithm is used to cluster the closely related segments. Go to a class to get the initial stack.
- the accuracy of the stack is also determined as the percentage of the total length of the connected segments from the optimal aligned bacteria as a percentage of the total length of the connected segments in the stack. In the present experiment, the accuracy of the initial stack is 50.3% - 100. 0% (average is 95.1%).
- Step E calculating a posterior probability of each connected segment belonging to a certain stack according to model parameters of each stack, and modifying a probability that the connected segment belongs to the stack in the soft matrix;
- Step M Calculate the model parameters of each stack using the maximum likelihood function method according to the soft matrix.
- each stack represents a species. Based on the sequence of connected fragments in each stack, we identified 86 species (86%) with a genome coverage of more than 50% per species.
- each species had a genome coverage of more than 50%.
- This embodiment further illustrates the SoapMeta method of the present invention by taking a real simple environment as an example, and confirms the advantages of the SoapMeta method of the present invention by comparison with the conventional analysis method.
- a medium containing different carbon sources filter paper, cellobiose, glucose
- was cultured for 52 hours at 37 was cultured for 52 hours at 37 , and then the cells were separately harvested to obtain samples A, B, and Co for each sample, and we separately constructed a sequencing.
- the sample is first sequenced with HiSeq2000 to obtain raw reads; then, the low is filtered out.
- the mass sequence and the linker sequence provide 3.88 Gb of metagenomic sequencing data for analysis (sum of sequencing data for 3 samples).
- the first kind of strategy strategy is slightly, so that the group of each sample sample is separately subjected to the test sequence data according to the method of analysis and analysis using the traditional method.
- the basic gene group for constructing the micro-biomass from the structure see, MMEEGGAANN ((HHuussoonneett aall.. 22000077)));; the second two strategies Yes, in order to use the SooaappMMeettaa method method described in the present invention, the sequencing sequence data of all the sample samples are mixed and mixed together, and then proceed to the beginning.
- the primary-level assembly, the split-stack stack, and the high-level assembly are assembled to construct a set of basic gene genes for the formation of picophytes. .
- the first strategy strategy will be used as a comparison, in order to prove that the SooaappMMeettaa method of the present invention is used in the mixing and assembly of a plurality of samples. The superiority of the team. .
- the clustering method based on the composition of the group is used to sequence the sequence from the single sample.
- the slice segments are subjected to clustering clustering to determine the picophytes that are potentially present in the sample. .
- 33 sample samples used for the purpose we have to get 66 class classes ((samples AA)), 22 class classes (( Sample samples BB)),, and and 33 categories (samples CC).
- the GGCC maps of these 33 sample samples show that one of the sample sample BB and the sample sample CC Some of the fine bacterial bacteria are very difficult to distinguish between the points, because because of their GGCC content and the depth of the sequencing sequence is not very close. .
- the results of the table 22 show that, in the 1100 stacks, there are 66 stacks of the group-assembled base genes because the sequence of the sequence is very Pure ((i.e., based on the basic gene pair corresponding to the same species of the same species of microbial species)) :: Bacillus licheniformis NNBBRRCC 110000559999 ((BBrreevvii bbaacciilllluuss bbrreevviiss WWBBRRCC 1100005599)),, rescue, and the genus 22-66 ((BBaacciilllluuss ccooaagguullaannss 22--66)), salt-tolerant salt bud hug Phytophthora CC - 112255 ⁇ BBaacciilllluuss hhaalloodduurraannss CC--1122SS)) , and Clostridium botulinum bacillus 22 KKyyoottoo CCl
- Bacteria (Weimer and Zeikus 1977; Bayer et al. 1983; and Schwarz 2001).
- Bacillus brevis (5reW6a '//i/5 and bud hug Bacillus (c///i/5 is also known to have fiber degradation ability (Liang et al. 2009; Li et al. 2006; and Rastogi et al. 2009).
- the SoapMeta strategy of the present invention is not only superior in accuracy and coverage to the first strategy (ie, genome coverage is more complete, classification accuracy is higher), but can be more effectively and accurately identified.
- Microbial composition of environmental samples Table 2. Summary of assembled genomes of cellulose degrading bacteria
- This example exemplifies the application of the SoapMeta method of the present invention in the detection of mouse intestinal flora by taking a real complex environment as an example.
- the relative abundance of the flora of the mouse intestine varies with age, gender, diet, etc., but if the diet of the mice is fixed and the environment is fixed, the microbial composition of these flora Generally there will not be much change. Therefore, the SoapMeta method of the present invention can be utilized to study the intestinal tract of a mouse in a specific environment and a specific diet.
Landscapes
- Life Sciences & Earth Sciences (AREA)
- Chemical & Material Sciences (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Analytical Chemistry (AREA)
- General Health & Medical Sciences (AREA)
- Biotechnology (AREA)
- Biophysics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Medical Informatics (AREA)
- Theoretical Computer Science (AREA)
- Organic Chemistry (AREA)
- Genetics & Genomics (AREA)
- Molecular Biology (AREA)
- Wood Science & Technology (AREA)
- Zoology (AREA)
- Immunology (AREA)
- Microbiology (AREA)
- Biochemistry (AREA)
- General Engineering & Computer Science (AREA)
- Crystallography & Structural Chemistry (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
Description
Claims
Priority Applications (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201280064063.2A CN104039982B (zh) | 2012-08-01 | 2012-08-01 | 一种分析微生物群落组成的方法和装置 |
PCT/CN2012/079492 WO2014019164A1 (zh) | 2012-08-01 | 2012-08-01 | 一种分析微生物群落组成的方法和装置 |
US14/419,060 US20150242565A1 (en) | 2012-08-01 | 2012-08-01 | Method and device for analyzing microbial community composition |
HK14109940.6A HK1196642A1 (zh) | 2012-08-01 | 2014-10-07 | 種分析微生物群落組成的方法和裝置 |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2012/079492 WO2014019164A1 (zh) | 2012-08-01 | 2012-08-01 | 一种分析微生物群落组成的方法和装置 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2014019164A1 true WO2014019164A1 (zh) | 2014-02-06 |
Family
ID=50027091
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2012/079492 WO2014019164A1 (zh) | 2012-08-01 | 2012-08-01 | 一种分析微生物群落组成的方法和装置 |
Country Status (4)
Country | Link |
---|---|
US (1) | US20150242565A1 (zh) |
CN (1) | CN104039982B (zh) |
HK (1) | HK1196642A1 (zh) |
WO (1) | WO2014019164A1 (zh) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104278091A (zh) * | 2014-09-26 | 2015-01-14 | 上海交通大学 | 以废水处理样品微生物元基因组序列拼接细菌基因组的方法 |
CN105095688A (zh) * | 2014-08-28 | 2015-11-25 | 吉林大学 | 检测人体肠道宏基因组的细菌群落及丰度的方法 |
CN106778078A (zh) * | 2016-12-20 | 2017-05-31 | 福建师范大学 | 基于kendall相关系数的DNA序列相似性比对方法 |
CN111161798A (zh) * | 2019-12-31 | 2020-05-15 | 余珂 | 宏基因组的重组装方法、重组装装置及终端设备 |
CN111261231A (zh) * | 2019-12-03 | 2020-06-09 | 康美华大基因技术有限公司 | 肠道菌群宏基因组数据库构建方法、分析方法及装置 |
US11694764B2 (en) | 2013-09-27 | 2023-07-04 | University Of Washington | Method for large scale scaffolding of genome assemblies |
Families Citing this family (26)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2016119190A1 (en) * | 2015-01-30 | 2016-08-04 | Bgi Shenzhen | Biomarkers for colorectal cancer related diseases |
WO2017156739A1 (zh) * | 2016-03-17 | 2017-09-21 | 上海锐翌生物科技有限公司 | 分离的核酸及应用 |
CN105925664A (zh) * | 2016-03-30 | 2016-09-07 | 广州精科生物技术有限公司 | 一种确定核酸序列的方法及系统 |
CN105950707A (zh) * | 2016-03-30 | 2016-09-21 | 广州精科生物技术有限公司 | 一种确定核酸序列的方法及系统 |
US20190318807A1 (en) * | 2016-10-26 | 2019-10-17 | The Joan & Irwin Jacobs Technion-Cornell Institute | Systems and methods for ultra-fast identification and abundance estimates of microorganisms using a kmer-depth based approach and privacy-preserving protocols |
US10733214B2 (en) | 2017-03-20 | 2020-08-04 | International Business Machines Corporation | Analyzing metagenomics data |
CN107028606A (zh) * | 2017-04-21 | 2017-08-11 | 上海耐相智能科技有限公司 | 医用智能监测环系统 |
WO2019005913A1 (en) * | 2017-06-28 | 2019-01-03 | Icahn School Of Medicine At Mount Sinai | METHODS OF HIGH RESOLUTION MICROBIOME ANALYSIS |
CN107287332A (zh) * | 2017-08-03 | 2017-10-24 | 华子昂 | 利用smrt测序技术进行液体酵素菌种鉴定的方法 |
TWI629607B (zh) * | 2017-08-15 | 2018-07-11 | 極諾生技股份有限公司 | 建立腸道菌數據庫的方法和相關檢測系統 |
CN108197434B (zh) * | 2018-01-16 | 2020-04-10 | 深圳市泰康吉音生物科技研发服务有限公司 | 去除宏基因组测序数据中人源基因序列的方法 |
CN109587001B (zh) * | 2018-11-15 | 2020-11-27 | 新华三信息安全技术有限公司 | 一种性能指标异常检测方法及装置 |
CN111455021B (zh) * | 2019-01-18 | 2024-06-04 | 广州微远医疗器械有限公司 | 去除宏基因组中宿主dna的方法及试剂盒 |
WO2020252320A1 (en) * | 2019-06-13 | 2020-12-17 | Icahn School Of Medicine At Mount Sinai | Dna methylation based high resolution characterization of microbiome using nanopore sequencing |
CN110277139B (zh) * | 2019-06-18 | 2023-03-21 | 江苏省产品质量监督检验研究院 | 一种基于互联网的微生物限度检查系统及方法 |
CN110349629B (zh) * | 2019-06-20 | 2021-08-06 | 湖南赛哲医学检验所有限公司 | 一种利用宏基因组或宏转录组检测微生物的分析方法 |
CN111477267B (zh) * | 2020-03-06 | 2022-05-03 | 清华大学 | 微生物的多关联网络计算方法、装置、设备及存储介质 |
CN111627500A (zh) * | 2020-04-16 | 2020-09-04 | 中国科学院生态环境研究中心 | 一种基于宏基因组技术识别水体中携带毒性因子病原菌的方法 |
CN114067911B (zh) * | 2020-08-07 | 2024-02-06 | 西安中科茵康莱医学检验有限公司 | 获取微生物物种及相关信息的方法和装置 |
CN112071366B (zh) * | 2020-10-13 | 2024-02-27 | 南开大学 | 一种基于二代测序技术的宏基因组数据分析方法 |
CN112786102B (zh) * | 2021-01-25 | 2022-10-21 | 北京大学 | 一种基于宏基因组学分析精准识别水体中未知微生物群落的方法 |
WO2022222936A1 (en) * | 2021-04-20 | 2022-10-27 | Hangzhou Matridx Biotechnology Co., Ltd. | Methods, computer-readble media, and systems for filtering noises for dna sequencing data |
CN113284560B (zh) * | 2021-04-28 | 2022-05-17 | 广州微远基因科技有限公司 | 病原检测背景微生物判断方法及应用 |
CN113362890B (zh) * | 2021-04-28 | 2023-09-08 | 中国科学院生态环境研究中心 | 一种评价生物滤料降解有机物潜力的方法 |
CN113611359B (zh) * | 2021-08-13 | 2022-08-05 | 江苏先声医学诊断有限公司 | 一种提高宏基因组纳米孔测序数据菌种组装效率的方法 |
CN114999574B (zh) * | 2022-08-01 | 2022-12-27 | 中山大学 | 一种肠道菌群大数据的并行识别分析方法及系统 |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102517392A (zh) * | 2011-12-26 | 2012-06-27 | 深圳华大基因研究院 | 基于宏基因组16s高可变区v3的分类方法和装置 |
-
2012
- 2012-08-01 US US14/419,060 patent/US20150242565A1/en not_active Abandoned
- 2012-08-01 CN CN201280064063.2A patent/CN104039982B/zh active Active
- 2012-08-01 WO PCT/CN2012/079492 patent/WO2014019164A1/zh active Application Filing
-
2014
- 2014-10-07 HK HK14109940.6A patent/HK1196642A1/zh unknown
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102517392A (zh) * | 2011-12-26 | 2012-06-27 | 深圳华大基因研究院 | 基于宏基因组16s高可变区v3的分类方法和装置 |
Non-Patent Citations (2)
Title |
---|
RAMIRO LOGARES ET AL.: "Environmental microbiology through the lens of high-throughput DNA sequencing: Synopsis of current platforms and bioinformatics approaches.", JOURNAL OF MICROBIOLOGICAL METHODS., vol. 91, 28 July 2012 (2012-07-28), pages 106 - 113, XP028947544, DOI: doi:10.1016/j.mimet.2012.07.017 * |
WOLFGANG GERLACH.: "Taxonomic Classification of Metagenomic Sequences.", PHD THESIS OF BIELEFELD UNIVERSITY, February 2012 (2012-02-01), GERMANY * |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11694764B2 (en) | 2013-09-27 | 2023-07-04 | University Of Washington | Method for large scale scaffolding of genome assemblies |
CN105095688A (zh) * | 2014-08-28 | 2015-11-25 | 吉林大学 | 检测人体肠道宏基因组的细菌群落及丰度的方法 |
CN104278091A (zh) * | 2014-09-26 | 2015-01-14 | 上海交通大学 | 以废水处理样品微生物元基因组序列拼接细菌基因组的方法 |
CN106778078A (zh) * | 2016-12-20 | 2017-05-31 | 福建师范大学 | 基于kendall相关系数的DNA序列相似性比对方法 |
CN106778078B (zh) * | 2016-12-20 | 2019-04-09 | 福建师范大学 | 基于kendall相关系数的DNA序列相似性比对方法 |
CN111261231A (zh) * | 2019-12-03 | 2020-06-09 | 康美华大基因技术有限公司 | 肠道菌群宏基因组数据库构建方法、分析方法及装置 |
CN111161798A (zh) * | 2019-12-31 | 2020-05-15 | 余珂 | 宏基因组的重组装方法、重组装装置及终端设备 |
CN111161798B (zh) * | 2019-12-31 | 2024-03-19 | 余珂 | 宏基因组的重组装方法、重组装装置及终端设备 |
Also Published As
Publication number | Publication date |
---|---|
HK1196642A1 (zh) | 2014-12-19 |
CN104039982A (zh) | 2014-09-10 |
US20150242565A1 (en) | 2015-08-27 |
CN104039982B (zh) | 2015-09-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2014019164A1 (zh) | 一种分析微生物群落组成的方法和装置 | |
Wu et al. | A novel abundance-based algorithm for binning metagenomic sequences using l-tuples | |
Bharti et al. | Current challenges and best-practice protocols for microbiome analysis | |
Gruber-Vodicka et al. | phyloFlash: rapid small-subunit rRNA profiling and targeted assembly from metagenomes | |
Dröge et al. | Taxonomic binning of metagenome samples generated by next-generation sequencing technologies | |
EP3221470B1 (en) | Method of analyzing microbiome | |
Kellis et al. | Methods in comparative genomics: genome correspondence, gene identification and regulatory motif discovery | |
US20210403991A1 (en) | Sequencing Process | |
Jin et al. | Hybrid, ultra-deep metagenomic sequencing enables genomic and functional characterization of low-abundance species in the human gut microbiome | |
KR101798229B1 (ko) | 전장 리보솜 rna 서열정보를 얻는 방법 및 상기 리보솜 rna 서열정보를 이용하여 미생물을 동정하는 방법 | |
Zhang et al. | A comprehensive investigation of metagenome assembly by linked-read sequencing | |
Méndez-García et al. | Metagenomic protocols and strategies | |
Kuster et al. | ngsComposer: an automated pipeline for empirically based NGS data quality filtering | |
Goswami et al. | RNA-Seq for revealing the function of the transcriptome | |
Kim et al. | Unraveling metagenomics through long-read sequencing: A comprehensive review | |
Yuan et al. | RNA-CODE: a noncoding RNA classification tool for short reads in NGS data lacking reference genomes | |
US20170147744A1 (en) | System for analyzing sequencing data of bacterial strains and method thereof | |
CN113260710A (zh) | 用于通过多个定制掺合混合物验证微生物组序列处理和差异丰度分析的组合物、系统、设备和方法 | |
Tanaseichuk et al. | A probabilistic approach to accurate abundance-based binning of metagenomic reads | |
WO2022192904A1 (en) | Systems and methods for identifying microbial biosynthetic genetic clusters | |
Zhang et al. | Exploring high-quality microbial genomes by assembly of linked-reads with high barcode specificity using deep learning | |
Chandrasiri et al. | CH-Bin: A convex hull based approach for binning metagenomic contigs | |
Feng et al. | MOBFinder: a tool for MOB typing for plasmid metagenomic fragments based on language model | |
WO2023204006A1 (ja) | 微生物判別方法および微生物判別装置 | |
WO2023204008A1 (ja) | 微生物判別用のデータベースを構築する方法および装置 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 12882166 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
WWE | Wipo information: entry into national phase |
Ref document number: 14419060 Country of ref document: US |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205N DATED 26/06/2015) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 12882166 Country of ref document: EP Kind code of ref document: A1 |