WO2022262491A1 - 基于细菌16S rRNA基因序列的细菌"种"水平检测和分析方法 - Google Patents

基于细菌16S rRNA基因序列的细菌"种"水平检测和分析方法 Download PDF

Info

Publication number
WO2022262491A1
WO2022262491A1 PCT/CN2022/092574 CN2022092574W WO2022262491A1 WO 2022262491 A1 WO2022262491 A1 WO 2022262491A1 CN 2022092574 W CN2022092574 W CN 2022092574W WO 2022262491 A1 WO2022262491 A1 WO 2022262491A1
Authority
WO
WIPO (PCT)
Prior art keywords
bacterial
sequence
rrna gene
sequences
bacteria
Prior art date
Application number
PCT/CN2022/092574
Other languages
English (en)
French (fr)
Inventor
徐建国
杨晶
卢珊
濮吉
Original Assignee
中国疾病预防控制中心传染病预防控制所
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国疾病预防控制中心传染病预防控制所 filed Critical 中国疾病预防控制中心传染病预防控制所
Publication of WO2022262491A1 publication Critical patent/WO2022262491A1/zh

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6876Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
    • C12Q1/6888Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for detection or identification of organisms
    • C12Q1/689Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for detection or identification of organisms for bacteria
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing

Definitions

  • the invention discloses a method for detecting and analyzing the bacterial 16S rRNA gene V3-V4 region sequence of human feces samples, which can detect and annotate the composition diversity and composition ratio of intestinal flora from the "species" level. Proportionally dominant unknown bacteria that have not yet been isolated and studied, belonging to the technical fields of microbial ecology, microbial taxonomy and microbiomics.
  • Bacterial taxonomic levels include kingdom, phylum, class, order, family, genus and species. "Species” is the lowest taxonomic unit of bacteria. The taxonomic units of bacteria most commonly referred to in medicine are "genus” and "species”.
  • a "genus” of bacteria can include several (such as Escherichia, including 6 "species") or hundreds of "species” of bacteria (such as Streptococcus, including more than 200 "species”).
  • 16S rRNA is a ribosomal RNA on the small subunit of the ribosome, involved in processes such as protein synthesis, and is a molecular clock in bacterial evolution.
  • the gene sequence corresponding to 16S rRNA in the bacterial genome is bacterial 16S
  • the rRNA gene is about 1500 bases in length and consists of nine variable regions (V1-V9) and conserved region sequences alternately.
  • V1-V9 nine variable regions
  • conserved region sequence of the rRNA gene is highly conserved, while the sequence of the variable region varies from species to species, and the degree of variation is closely related to the phylogenetic position of bacteria (taxonomic species, genus, family, etc.). Therefore, using 16S rRNA gene sequence analysis can identify and classify all bacteria. If the full-length sequence of the 16S rRNA gene is used, the bacteria to be tested can be identified to the level of "species" in most cases.
  • 16S rRNA gene sequence analysis has become an important method for bacterial detection and identification and bacterial diversity analysis.
  • high-throughput sequencing based on the next-generation sequencing platform can obtain a large number of bacterial 16S without relying on bacterial culture.
  • the rRNA gene sequence provides a powerful tool for studying the diversity of flora.
  • the commonly used method for analyzing the diversity of intestinal flora is to conduct high-throughput sequencing of the V3-V4 region (about 400 bases) of the 16S rRNA gene on the stool sample based on the Illumina sequencing platform to obtain a large number of sequences.
  • the sequences such as the V3-V4 region of the 16S rRNA gene amplified by the next-generation sequencing technology are only about 400 bases, and most of the sequences can be identified to the taxonomic level of "genus” or above , to obtain data on the diversity and composition ratio of intestinal flora at the level of "genus” or above.
  • These analytical data at the level of "genus” or above cannot accurately reveal the relationship between changes in intestinal flora and health and disease, which limits the application and promotion of intestinal flora analysis.
  • the purpose of the present invention is to provide a method for detecting, identifying and analyzing human flora at the "species" level.
  • the present invention at first provides a kind of method based on bacterial full-length or nearly full-length 16S rRNA gene sequence on " kind " (species) level identification human flora, the 16S rRNA gene described here Refers to full-length or close to full-length 16S
  • the rRNA gene sequence is between 1450-1500 bases in length, and the method comprises the following steps:
  • OPU Operational Phylogenetic Unit
  • the reference sequence library includes all known bacteria and unknown bacteria that have been named at the "species" level.
  • OPU known bacteria
  • OPU and its code, and its higher-level taxonomic unit as the Unique name for bacteria.
  • the present invention constructs human intestinal flora 16S accordingly rRNA gene full-length reference sequence library.
  • the database includes all named bacteria and unknown intestinal bacteria found in the present invention.
  • 16S of all named reference strains of known bacteria rRNA gene sequences were obtained from published reference sequence libraries, including but not limited to: Prokaryote Standard Nomenclature, National Center for Biotechnology Information and Bacterial 16S The 16S rRNA gene sequence library included and published in the rRNA gene sequence online quality control and comparison database;
  • V3-V4 region sequence of the rRNA gene reference sequence library is virtual cut using a computer to obtain the V3-V4 region sequence.
  • Virtual shearing uses 16S rRNA gene V3-V4 region universal amplification primers 341F (SEQ ID NO.1) and 806R (SEQ ID NO.2) binding site.
  • a reference sequence working library of 16S rRNA gene V3-V4 regions of intestinal flora was formed.
  • step (3) Use the sample 16S rRNA gene sequence obtained in step (3) as the query sequence, and perform query comparison and bacterial species identification with the reference sequence working library of the 16S rRNA gene V3-V4 region of the intestinal flora in step (2).
  • the query sequence that is completely consistent (100%) with the specific sequence with taxonomic information in the reference sequence working library is identified as the specific sequence annotation name in the reference sequence working library.
  • the 16S obtained from the specimen to be tested rRNA gene V3-V4 region sequence, with 16S The rRNA gene V3-V4 region reference sequence library is compared, and for the "species" 16S of bacteria known in the reference sequence working library
  • the sequence with 100% identity to the reference sequence of rRNA gene V3-V4 region is annotated as the taxonomic "species" name of known bacteria; for the reference sequence of unknown bacterial 16S rRNA gene V3-V4 region in the reference sequence library
  • Unknown bacteria include suspected new species and high-order units.
  • the high-order unit refers to, relying only on 16S
  • the rRNA gene sequence is difficult to identify accurately, and it is represented by the upper taxonomic unit and OPU code.
  • the method further includes the step of analyzing the type, ratio, and/or abundance of the bacterial species identified in step (3) in the specimen to be tested.
  • it can be provided according to needs, including but not limited to, the number of OPU contained in the sample to be analyzed, the number, type, and abundance of known bacteria, the type, number, and abundance of unknown bacteria; and each "species" or OPU The percentage of the total intestinal flora; and the type and abundance of probiotics, pathogenic bacteria, the type and abundance of recommended pathogenic bacteria, the number and abundance of dominant OPU, etc.
  • the 16S rRNA gene sequence in the method is a V3-V4 region sequence.
  • the method of the present invention can be used for flora identification analysis based on the V3-V4 region of the 16S rRNA gene, but is not limited to the V3-V4 region, and can also be used for Analysis of flora identification in other regions of rRNA genes.
  • the human flora is derived from the flora of the digestive tract, skin, oral cavity, nasopharynx, eyes, vagina, urinary tract or ear.
  • the sequence determination of step (2) of the method is high-throughput sequencing.
  • a specific embodiment of the present invention is based on the Illumina next-generation sequencing platform to perform 16S rRNA gene V3 - The sequence obtained by deep sequencing of the V4 region.
  • the present invention provides a method for detecting and identifying human intestinal flora at the "species" level based on the analysis of the full-length or near-full-length 16S rRNA gene sequence of bacteria described in step (1).
  • Human intestinal flora 16S based on the genetic unit of the bacterial operating system
  • Sequencing and quality control obtain bacterial 16S from human samples rRNA gene sequence, after quality control to delete low-quality sequences (such as sequences with a single base quality value lower than 10; sequences that cannot recognize double-ended primers; chimeras (chimeras) sequences, etc.) sequences; in this invention, the inventors used three generations Sequencing technology
  • the PacBio sequencing platform obtained human intestinal flora 16S from stool samples from 120 healthy people Full-length or near-full-length (1450-1500 bases) sequences of rRNA genes, 850,935 16S rRNA gene sequences were obtained.
  • the inventors use the chimera detection software UCHIME QIIME (full name: Quantitative Insights Into Microbial Ecology), screened out 594,075 full-length or near-full-length 16S rRNA gene sequences;
  • OTU Opera unit of bacterial taxonomy
  • the step of constructing the bacterial phylogenetic tree is: using step (3) to obtain the representative sequence of each OTU, using SINA software (version 1.2.11), and all known bacteria 16S The rRNA gene sequence (LTP132 database) was compared. Using the built-in Parsimony tool of ARB software (version 6.0.6), insert the OTU representative sequence on the alignment into all the named bacterial reference strains 16S rRNA gene sequence database (LTP 132 database and NR SILVA Ref 132 database), the parameter is set to LTP50.
  • the inserted OTU representative sequence and the 16S rRNA gene sequence of the named bacterial reference strain were used to construct all bacterial phylogenetic trees using the Neighbor-joining Method based on the Jukes-Cantor correction, and the conservation degree was set to 30%.
  • the identity of the rRNA gene sequence is less than 98.7%, but the identity of the representative sequence of other "species" in the "genus” is 95% or more, which can be determined as a suspected new species of unknown bacteria ( Figure 1); if the OTU The identity of the representative sequence and the 16S rRNA gene sequence of the closest reference strain on all bacterial phylogenetic trees is less than 95%, and it has not yet been named. It can be named as a high-order unit of unknown bacteria, using the numbered high Taxonomic units and OPU numbers (OPU number) nomenclature at the first level ( Figure 1).
  • the OPU that has been named by the prior art it is a known bacterium, and the named name is used to annotate; the OPU that has not been named is an unknown bacterium, and the OPU and its code are used as the unique name of the bacterium;
  • 16S 1235 genetic units of the operating system were obtained by sorting the rRNA gene sequence.
  • These 1235 OPUs include 461 "species" of known bacteria and 774 unknown bacteria;
  • the sequencing in step (1) is carried out using the third-generation sequencing PacBio technology platform, including at least 120 healthy human stool samples for bacterial 16S rRNA full-length gene sequence determination, and low-quality sequences deleted in quality control Including sequences with a single base quality value lower than 10, sequences that cannot recognize double-ended primers, and chimeras.
  • bacterial 16S rRNA full-length (1450-1500 base) gene sequence determination is carried out using the third-generation sequencing PacBio technology platform, including at least 120 healthy human stool samples for bacterial 16S rRNA full-length gene sequence determination, and low-quality sequences deleted in quality control Including sequences with a single base quality value lower than 10, sequences that cannot recognize double-ended primers, and chimeras.
  • the 16S rRNA gene sequence of the named bacterial reference strain in step (4) comes from a published reference sequence library, which includes, but is not limited to: prokaryotes Standard Nomenclature, National Center for Biotechnology Information, and Bacterial 16S The 16S rRNA gene sequence library included and published in the rRNA gene sequence online quality control and comparison database.
  • the reference sequence library also absorbs the bacterial 16S rRNA gene sequence online quality control and comparison database (SILVA, https://www.arb-silva.de/) of the 16S rRNA gene sequence with the same bacterial taxonomic name, demerging bases (referring to replacing two or more bases with one symbol according to the degeneracy of codons base.
  • the merged base N which can represent the four bases U/C/A/G
  • there are a total of 143,000 sequences This part of the sequence is mainly derived from non-reference strains.
  • 16S as a taxonomic reference strain of known bacteria
  • the supplement of rRNA gene sequence improves diversity and coverage.
  • the three online databases described here are all open public databases, and do not constitute a restriction on the source and construction method of the database of the present invention, as long as the databases that can provide the diversity and coverage of bacterial sources can be processed by the method of the present invention use.
  • a reference sequence library of 16S rRNA genes of intestinal bacteria is constructed, including more than 800,000 items (including those found by 120 healthy human intestinal flora, and the prokaryote standard naming list , National Center for Biotechnology Information and Bacterial 16S rRNA gene sequence online quality control and comparison database included and published 16S rRNA gene sequence library) 16S rRNA gene sequence.
  • the number of more than 800,000 16S rRNA gene sequences does not constitute a limitation on the size of the database and the construction method of the present invention, as long as the database can provide the diversity and coverage of bacterial sources, it can be adopted by the method of the present invention.
  • the cleavage in step (6) adopts the computer virtual cleavage sequence of V3-V4 region of 16S rRNA gene.
  • the sequence of the upstream cleavage site adopted by the virtual cleavage is as shown in SEQ ID NO.1 (CCTAYGGGRBGCASCAG), and the sequence of the downstream cleavage site is as shown in SEQ ID NO.1 ID NO.2 (GGACTACNNGGGTATCTAAT).
  • the cleavage described in step (6) in the above method uses the universal amplification primer 341F (SEQ ID NO.1) and the binding site of 806R (SEQ ID NO.2) were cut by computer to obtain the V3-V4 region sequence of all intestinal flora reference sequences.
  • a reference sequence library of 16S rRNA gene V3-V4 regions of intestinal flora is formed, including 273,000 16S rRNA gene V3-V4 sequences, which can detect and identify more than 18,000 published known bacteria and unknown bacteria in the gut flora of healthy humans.
  • the method of detecting intestinal flora using the principle of high-throughput sequencing technology in the V3-V4 region of the 16S rRNA gene can only detect known bacteria, but cannot detect unknown bacteria.
  • the present invention solves the above-mentioned technical problems through the definition, discovery and annotation of OPU and the construction of OPU-based bacterial phylogenetic tree. Analysis and prediction of pathogenicity and therapeutic applications greatly improve the work efficiency of bacterial identification, pathogenic bacteria discovery and probiotic screening invention. Through the method provided by the present invention, it is found that there are 774 "species" of unknown bacteria in the human intestinal flora, that is, 774 OPUs.
  • the present invention can identify the high-throughput sequencing data of the 16S rRNA gene V3-V4 region of more than 95% of the stool samples as known bacteria and unknown bacteria (OPU).
  • the identification rate based on the sequence of the V3-V4 region has increased from 37.8% in the prior art to 95.6% and above.
  • the method of the present invention can analyze the intestinal flora imbalance of healthy people from the level of "species”; can find known pathogenic bacteria and potential pathogenic bacteria, and can analyze the type and abundance of intestinal probiotics, especially intestinal bacteria It can be used to assess the diversity of human intestinal flora, health status, disease status, etc., including the polymorphism and composition ratio analysis of intestinal flora in patients.
  • Figure 2 The composition ratio threshold of 116 kinds of bacteria (OPU) in the intestinal flora of healthy people;
  • Figure 4 The composition ratio threshold of 116 kinds of bacteria (OPU) in the intestinal flora of healthy people;
  • FIG. 7 The structure and abundance of fecal flora in patients with liver cirrhosis (F54);
  • Figure 8 The structure and abundance of fecal flora in infantile diarrhea patients (F181).
  • Construction example 1 Construction of the reference sequence working library of the intestinal flora 16S rRNA gene V3-V4 region
  • OPU is the English abbreviation for Bacterial Operating Unit, which is the smallest monophyletic group in taxonomy, including a group of full-length 16S rRNA gene sequences, representing a group of bacterial strains.
  • the 16S rRNA gene sequences of the strains in each OPU group were the closest to each other and belonged to a monophyletic group.
  • Different OPUs belong to different monophyletic groups.
  • Unknown bacteria are annotated using the numbered OPU of the present invention, representing a new "species”, a new “genus”, a new “family”, a new “order”, a new “class”, a new “phylum” and so on. Only relying on the analysis of the full-length 16S rRNA gene sequence, according to the current taxonomic knowledge, it is impossible to accurately discover and define a taxonomic unit of a new "genus" and above.
  • the division of OPU includes two steps: one is to divide OTU, and the other is to divide OPU.
  • the specific method is as follows:
  • the sequence comparison and the topology and relationship of the phylogenetic tree if it can be classified as the 16S rRNA gene sequence of a known bacterium and form an independent branch with it, it can be annotated as a known bacterium.
  • a known bacterium such as Streptococcus suis ( Steptococcus suis ).
  • This known bacterium which can form an independent clade on the phylogenetic tree, is an OPU with a taxonomic name.
  • OTUs with less than 98.7% identity with the 16S rRNA gene sequence of all known bacterial reference strains were identified as unknown bacteria and annotated using the OPU method .
  • the representative 16S rRNA gene sequences of OTUs with less than 98.7% identity were added to the Silva Reference Non Redundant database (SILVA SSURef_NR_132) of the Silva database for secondary comparison.
  • Each OPU is the smallest monophyletic group.
  • Each OPU includes at least two types of sequences: the representative sequence of OTU, and the 16S closest to these representative sequences rRNA gene sequence, especially the 16S rRNA gene sequence of the closest reference strain (Fig. 1).
  • OPU number All OPUs are numbered uniformly. The number of each OPU is unique .
  • known bacteria are indicated by the recognized names of bacteria, such as Prevotella copri .
  • Unknown bacteria are represented by OPU and code, such as Bacteroides sp. 17 (OPU-532), which means a suspected new species of Bacteroides, which has not yet been isolated and identified; such as Lachnospiraceae (OPU-001), which means Lachnospiraceae (Lachnospiraceae)
  • OPU Bacteroides sp. 17
  • Lachnospiraceae OPU-001
  • Lachnospiraceae Lachnospiraceae
  • the rRNA gene sequence was integrated to construct a 16S rRNA gene reference sequence library of intestinal flora. Including 850,000 high-quality bacterial 16S rRNA genes, it can detect and identify more than 18,000 published bacterial species and subspecies. In particular, it can detect and identify 774 unknown bacteria. It has the characteristics of large library capacity, long sequence length, and accurate taxonomic annotation information. At the same time, it will be updated according to the discovery and publication of new species of bacteria. Achieving the goal of being able to detect and identify all known bacteria (Figure 1).
  • the 850,000 sequences in the intestinal bacterial 16S rRNA gene reference sequence library constructed by us were amplified according to the 16S rRNA gene V3-V4 region with primers 341F (CCTAYGGGRBGCASCAG) and 806R (GGACTACNNGGGTATCTAAT) binding sites were cut by computer to obtain the V3-V4 region sequences of all 850,000 16S rRNA genes. That is, each full-length 16S rRNA gene in the reference sequence library is virtual cut by computer, and the V3-V4 region sequence is retained to form the reference sequence working library of the intestinal flora 16S rRNA gene V3-V4 region. In the newly established reference sequence working library, the identical sequence entries are merged.
  • 16S rRNA gene V3-V4 sequences were constructed, which can detect and identify more than 18,000 bacterial species and subspecies. Since the 16S rRNA gene sequence of unknown bacteria in the intestinal tract of healthy people is included, most of the bacterial 16S rRNA sequences obtained from human stool samples can be The rRNA gene V3-V4 sequence identified the "species" of bacteria.
  • the bacterial 16S rRNA gene V3-V4 sequence constructed by the present invention is a dynamic database, which can be changed according to the online public database and the growth of the database obtained by the researcher's own research, but the change of the database does not affect the implementation of the method of the present invention , and as the database grows, the bacterial 16S-based The accuracy of rRNA gene sequence identification of human flora at the "species" level will be improved accordingly.
  • the core of the present invention does not lie in the composition of the database itself, but in the construction of a dynamic and open operating system based on bacteria Human flora 16S in genetic units rRNA gene sequence reference sequence library method.
  • Construction example 2 The construction of the composition diversity and composition ratio analysis method of "species" level intestinal flora
  • Example 1 On the basis of the database constructed in Example 1 (Fig. 1), the "species" level intestinal flora composition diversity and composition ratio analysis method or system was constructed for the samples to be tested.
  • the specific implementation includes 4 parts: collection and processing of stool samples, high-throughput sequencing of the V3-V4 region of 16S rRNA gene, taxonomic annotation at the "species" level, and presentation of human stool flora diversity and composition ratio results.
  • the extraction method used column purification fecal nucleic acid extraction kit (Qiagen, cat.51604), take a 200 mg stool sample and extract it according to the instructions. Finally, wash the core column with 200 ⁇ L deionized water to collect fecal nucleic acid for subsequent 16S rRNA gene amplification.
  • the fecal nucleic acid was amplified by PCR, the product was purified, and the Illumina MiSeq platform was used to perform double-end sequencing on the V3-V4 region of the 16S rRNA gene.
  • Taxonomic identification at the "species" level Using the obtained 16S rRNA gene in the V3-V4 region, use conventional methods for quality control to remove ambiguous bases and chimeras. Then use the intestinal bacterial 16S rRNA gene V3-V4 region reference sequence library for comparison query. Sequences with 100% identity found in the alignment were annotated as known bacteria or unknown bacteria according to the taxonomic information of the reference sequences on the alignment. If the annotation is a known bacterium, it is annotated with the corresponding taxonomic name, such as Streptococcus suis. If the annotation is an unknown bacterium, use the corresponding coded OPU to annotate, including suspected new species, high-order units, etc. Sequences that cannot be annotated are annotated as unidentified ( Figure 1).
  • the method of the present invention can detect and describe the diversity of human intestinal flora from the level of taxonomy "species".
  • the present invention found that each healthy Chinese intestinal flora contains an average of 186 ⁇ 51 OPUs, of which low-frequency flora (carried by people below 10%), medium-frequency flora (carried by people below 10%-60%), high-frequency flora
  • the numbers of OPUs in groups were 20 ⁇ 11, 75 ⁇ 29 and 90 ⁇ 19, respectively.
  • a total of 1235 OPUs were detected, of which 774 (62.7%) OPUs were unknown bacteria (Fig. 2-Fig. 4).
  • the resident flora in the intestinal tract of healthy people refers to bacteria with a positive rate of 60% or more in the stool samples of healthy Chinese people.
  • List of known bacteria using standard nomenclature for prokaryotes (mainly bacteria) (LPSN: https://www.bacterio.net/) published bacterial names), such as Prevotella copri.
  • Suspected new species refers to a potential new species that can be identified as a "genus", which has not yet been isolated and identified, and is indicated by the genus name and OPU number, such as Bacteroides sp. 17 (OPU-532).
  • the high-order unit refers to that it is difficult to accurately identify only relying on the 16S rRNA gene sequence, and it is represented by an upper-level taxonomic unit and an OPU code, such as Lachnospiraceae (OPU-001), which represents Lachnospiraceae (Lachnospiraceae) in a new member.
  • OPU-001 Lachnospiraceae
  • the technical feature that plays a key role in the method of the present invention is the discovery of 16S rRNA gene sequences of 774 unknown bacteria, which can detect and analyze the known intestinal flora from the level of "species", and the level of "OPU” and angle detection and analysis of unknown gut bacteria.
  • the 16S obtained on the Illumina MiSeq platform We used two databases and comparison methods to analyze the sequencing data of the V3-V4 region of the rRNA gene, which are: (1) using the RDP classifier Bayesian algorithm to perform a taxonomic analysis on the OTU representative sequences with a similar level of 97%, annotated When using Silva_132 16SrRNA database to obtain the species composition and abundance information of each sample; (2) Use the data constructed in the present invention and the comparison method to analyze the species composition and abundance information of each sample. For the full-length sequencing of the 16S rRNA gene obtained on the PacBio Sequel platform, we used the OPU strategy to analyze the species composition and abundance information of each sample.
  • the 120 healthy human samples were divided into 3 methods according to the sequencing method.
  • the different databases and comparison software used are: (1) use the Illumina MiSeq platform to sequence the V3-V4 region of the 16S rRNA gene, and use the database and comparison software constructed in the present invention for analysis (hereinafter referred to as the method of the present invention); (2) use the Illumina MiSeq platform to perform 16S rRNA gene V3-V4 region sequencing, using Silva_132 16SrRNA database database and RDP classifier Bayesian algorithm for comparative analysis (hereinafter referred to as the common method); (3) using PacBio The Sequel platform performs full-length sequencing of the 16S rRNA gene, and uses the operating system genetics unit strategy to analyze the composition and abundance information of each sample, because this method can obtain 16S The full-length sequence of the rRNA gene, and the use of the full-length 16S rRNA gene to determine the "species" is the gold standard method (hereinafter referred to as the gold standard method).
  • the method of the present invention can identify an average of more than 95% of the 16S rRNA gene sequence in each stool sample to the level of "species" (OPU)
  • the method of the present invention has an advantage in increasing the ratio of the number of sequences identified at the "species" level.
  • the We use the same data, that is, the llumina MiSeq platform to carry out the sequencing data of the V3-V4 region of the 16S rRNA gene, and use the database plus comparison method constructed in the present invention and the commonly used Silva_132 16SrRNA database database plus RDP classifier Bayesian algorithm for analysis , and compare the number of sequences determined to the "species" level.
  • the comparison results show that the database and comparison method established in the present invention can identify 95.6% of the sequences to the "species" level on average, while the currently commonly used Silva_132 16SrRNA database database plus RDP classifier Bayesian algorithm can only identify 38.1% of the sequences identified to the "species” level.
  • the database and comparison software constructed in the present invention found 140.47 “species” in each sample on average among 120 samples, while the gold standard method found 92.91 “species” in each sample on average. ", currently commonly used databases and comparison software (for example: Silva_132 16SrRNA database database plus RDP classifier Bayesian algorithm) can only find 82.08 “species” per sample species on average (see attached table 3). The above data shows that the database and comparison software constructed in the present invention can find more "species", which is of great value for the analysis of the structure and abundance of intestinal flora.
  • Application Example 2 Analyzing the composition and composition ratio of fecal flora using the method of the present invention for clinical patient samples
  • Figure 5 shows the composition ratio thresholds of 116 OPUs with a detection rate of 60% and above, called intestinal resident flora.
  • the method of the present invention to analyze the composition and abundance of the fecal flora, and compared the structure and abundance of the flora with the reference population, which can be used to evaluate the The status of the intestinal flora was analyzed and also related to the correlation with the disease.
  • Figure 5 shows the analysis results of the diversity and composition ratio of the intestinal flora of healthy people.
  • Human gut bacteria not only influence body weight and digestion, protect against infection and risk of autoimmune disease, but also control the body's response to drugs that treat disease. Therefore, the data obtained from the research on the diversity and composition ratio of human intestinal flora can be used as indicators of health and disease status. Doctors analyze, judge, and diagnose patients' diseases and health status by interpreting the data of human flora diversity and composition ratio.
  • Application example 2.1 Bacterial flora analysis of adult diarrheal disease fecal samples
  • OPU bacterial "species”
  • conditional pathogenic bacteria such as Bacteroides fragilis, Klebsiella pneumoniae, Ruminococcus torques in the detected stool samples is higher than the threshold.
  • Klebsiella pneumoniae The abundance of Klebsiella pneumoniae is above the threshold. Klebsiella pneumoniae can cause diarrhea in children.
  • the invention discloses a method for detecting and analyzing the bacterial 16S rRNA gene V3-V4 region sequence of human feces samples, which can detect and annotate the composition diversity and composition ratio analysis method of intestinal flora from the "species" level.
  • the method can be implemented by The industrialization is completed and has industrial applicability.
  • n is a, c, g, or t

Landscapes

  • Chemical & Material Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Organic Chemistry (AREA)
  • Zoology (AREA)
  • Analytical Chemistry (AREA)
  • Wood Science & Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Microbiology (AREA)
  • Immunology (AREA)
  • Molecular Biology (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Physics & Mathematics (AREA)
  • Biochemistry (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Genetics & Genomics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

本发明公开一种基于细菌16S rRNA基因序列在"种"(species)水平上鉴定人体肠道菌群的方法。所述方法包括:(1)构建基于细菌操作系统发生学单元为基础的人肠道菌群16S rRNA基因参比序列库,(2)对待检测标本的16S rRNA基因进行序列测定;(3)将标本16S rRNA基因序列与16S rRNA基因参比序列库进行比对并进行菌种鉴定。本发明的方法可将人肠道菌群检测注释到"种"的水平,并揭示其多样性、构成比、丰度等数据。这些数据可用于分析人的肠道菌群是否失调,可发现是否存在已知病原菌、潜在致病菌等,可分析肠道益生菌的种类和丰度,可分析肠道菌群紊乱与健康状态、疾病等的相关性。

Description

基于细菌16S rRNA基因序列的细菌“种”水平检测和分析方法 技术领域
本发明公开了一个检测分析人粪便标本的细菌16S rRNA基因V3-V4区序列,可从“种“水平检测和注释肠道菌群组成多样性和构成比分析方法,特别是能够检测在数量和比例上占优势的尚未分离和研究的未知细菌,属于微生物生态学、微生物分类学和微生物组学技术领域。
背景技术
微生物组研究开展以来,很多研究提示人的生长发育、营养代谢、疾病状态、免疫反应等和肠道菌群相关,如结直肠癌、肥胖、糖尿病等。 可是,人肠道菌群究竟包含多少个“种”(species)?各个“种”的丰度如何?迄今尚无明确答案。过去研究肠道菌群的多样性,主要靠分离培养技术体系。由于所使用的培养基和培养条件有选择性,如培养温度、氧含量、氨基酸和碳水化合物成分、盐浓度等,人们只能获得能够在这些培养基和培养条件生长起来的细菌。忽略了大量的不能够在这些培养基和培养条件生长起来的,暂时还没有能够分离、培养和鉴定的细菌,产生了很多错误信息。
据估计地球上大约有10 12种原核生物,其中主要是细菌。细菌分类学层级包括界、门、纲、目、科、属和种。“种”是细菌的最低分类学单位。医学最常涉及的细菌分类学单位是“属”和“种”。一个细菌的“属”,可包括几个(如埃希氏菌属,包括6个“种”)或几百个“种”的细菌(如链球菌属,包括200多个“种”)。同一个属的不同“种”的细菌,生物学和医学意义差别很大,有的是益生菌(如嗜热链球菌[ Streptococcus thermophilus]),有的是致病菌(如猪链球菌[ Streptococcus suis])。因此,对肠道菌群的分类学多样性和构成比的信息,仅仅局限在“属”的水平,是远远不够的,容易产生误导。只有实现“种”水平的分析,才能较好揭示肠道菌群多样性和构成比变化与健康、疾病等的相关性,才有比较清晰的医学参考价值。
所有细菌都有16S rRNA,它是核糖体小亚基上的一种核糖体RNA,参与蛋白质合成等过程,是细菌演化中的分子钟。16S rRNA在细菌基因组中对应的基因序列,即为细菌16S rRNA基因,长度约为1500碱基,由9个可变区(Variable region, V1-V9)和保守区序列交替组成。16S rRNA基因保守区序高度保守,而可变区序列则因种属而异,且变异程度与细菌的系统发生位置(分类学上的种、属、科等)密切相关。因此,使用16S rRNA基因序列分析,可将所有细菌进行鉴定分类。如使用16S rRNA基因全长序列,在大多数情况下可将待测细菌鉴定到“种”的水平。
使用部分16S rRNA基因序列,如V3-V4区段序列,可将研究较多的、公共数据库中具有16S rRNA基因序列已知细菌分类到“种”;由于缺乏参比序列,只能将大部分未知细菌分类到“属”、“科”等高阶分类学单元。少数情况下,因为一些细菌“种”的全长16S rRNA基因非常相似,仅仅依靠16S rRNA基因,无法准确鉴定到“种”。通常把这几个不能使用全长16S rRNA基因区分的“种”,划为一个群(group)。
16S rRNA基因序列分析已经成为细菌检测鉴定和菌群多样性分析的重要方法。随着测序技术的发展和成本降低,基于二代测序平台的高通量测序,可不依赖于细菌培养,获得海量的细菌16S rRNA基因序列,为研究菌群多样性提供了有力的工具。其中肠道菌群多样性分析常用的方法,是对粪便标本进行基于Illumiina测序平台对16S rRNA基因V3-V4区(400碱基左右)开展高通量测序,获得海量序列。单个样本将获得十万及以上条16S rRNA基因序列, 经序列比对分析和注释等环节,最终完成样本中肠道(粪便)菌群的细菌分类学分析和鉴定。获得肠道菌群多样性(含有多少“种”或“属”的细菌)和构成比(每个“种”或“属”的细菌,占所有序列数的百分比)的数据。由于大量的肠道菌群是未知细菌,尚未分离鉴定,缺乏相应的全长16S rRNA基因序列可供比对。 因此,现有肠道菌群分析技术只能将这些数量占优势的未知细菌,鉴定到“属”或“属”以上的水平,无法精确鉴定到“种”。
技术问题
现有技术的不足:使用二代测序技术扩增的16S rRNA基因V3-V4区等序列,仅有400碱基左右,可将大部分序列鉴定到“属”或“属”以上的分类学水平,获得“属”或“属”水平以上的肠道菌群多样性和构成比数据。这些“属”或“属”以上水平的分析数据,无法准确揭示肠道菌群的变化和健康疾病的关系,限制了肠道菌群分析的应用和推广。本发明的目的就是提供一种在“种”(species)水平上检测、鉴定、分析人体菌群的方法。
技术解决方案
[0004] 基于上述目的,本发明首先提供了一种基于细菌全长或接近全长16S rRNA基因序列在“种”(species)水平上鉴定人体菌群的方法,该处所述的16S rRNA基因是指全长或者接近全长的16S rRNA基因序列,长度在1450-1500 碱基之间,所述方法包括以下步骤:
(1)构建基于细菌操作系统发生学单元(OPU,Operational Phylogenetic Unit)为基本注释单位的人体肠道菌群16S rRNA基因参比序列库。OPU包括所有已知细菌,和由本发明发现的人肠道众多尚未发现的未知细菌。所述的参比序列库包括已获得“种”水平命名的所有已知细菌,和未知细菌。对于已获得现有技术命名的OPU(已知细菌),采用命名的名称注释;对于未获得现有技术命名的OPU,采用所述OPU及其编码,及其高一级分类学单元,作为该细菌的唯一命名。 本发明据此构建了人肠道菌群16S rRNA基因全长参比序列库。该数据库包括所有已经被命名的细菌,和本发明发现的肠道未知细菌。所有已经被命名的已知细菌的参考菌株的16S rRNA基因序列来自于已被公开的参比序列库,包括但不限于:原核生物标准命名名录、美国国立生物技术信息中心和细菌16S rRNA基因序列在线质控和比对数据库收录和公开的16S rRNA基因序列库;
(2)构建细菌16S rRNA基因V3-V4区参比序列库,对上述人体肠道菌群全长16S rRNA基因参比序列库的 V3-V4区序列,使用计算机进行虚拟剪切,获得V3-V4区序列。虚拟剪切采用16S rRNA基因V3-V4区通用扩增引物341F (SEQ ID NO.1) 和 806R (SEQ ID NO.2)的结合位点。将序列完全相同的条目进行合并后,形成肠道菌群16S rRNA基因V3-V4区参比序列工作库。可用于检测鉴定所有已知细菌(18000余个种)和本发明发现的健康人肠道菌群的未知菌(774个OPU);
(3)对待检测标本的16S rRNA基因进行序列测定,在本发明中的一个具体的技术方案中,针对16S rRNA基因V3-V4区进行序列测定;
(4)将步骤(3)获得的标本16S rRNA基因序列作为查询序列,与步骤(2)肠道菌群16S rRNA基因V3-V4区参比序列工作库,进行查询比对及菌种鉴定。将与参比序列工作库中带有分类学信息的特定序列完全一致(100%)的查询序列,鉴定为参比序列工作库中特定序列注释名称。在本发明中的一个具体的技术方案中,将从待测标本获得的16S rRNA基因V3-V4区序列,与16S rRNA基因V3-V4区参比序列库进行比对,对于与参比序列工作库中已知细菌的“种”16S rRNA基因V3-V4区参比序列一致性为100%的序列,注释为已知细菌的分类学“种”名;对于与参比序列库中的未知细菌16S rRNA基因V3-V4区参比序列一致性为100%的序列,注释为未知细菌,赋予唯一的OPU编号。未知细菌包括疑似新种和高分阶单元。所述高分阶单元是指,仅仅依靠16S rRNA基因序列难以准确鉴定,用上一级分类学单元,和OPU编码表示。
在一个优选的实施方案中,所述方法还包括对步骤(3)所鉴定菌种在待测标本中菌群种类、比例、和/或丰度分析的步骤。在具体应用中,可根据需要提供,包括但不限于,待分析标本包含OPU的数目,已知菌数目、种类、丰度,未知菌的种类、数目和丰度;以及各个“种 ”或OPU占肠道菌群总数的百分比;以及益生菌的种类和丰度,致病菌、推荐致病菌的种类和丰度,优势OPU的数目和丰度等。
在另一个优选的实施方案中,所述方法中16S rRNA基因序列为V3-V4区序列。本发明方法可以用于基于16S rRNA基因V3-V4区的菌群鉴定分析,但并不限于V3-V4区,也可以用于基于16S rRNA基因其它区域的菌群鉴定分析。
在一个优选的实施方案中,所述人体菌群来源于消化道、皮肤、口腔、鼻咽部、眼部、阴道、泌尿道或耳部的菌群。
在另一个优选的实施方案中,所述方法步骤(2)序列测定为高通量测序,本发明的一个具体实施方案是基于Illumina二代测序平台对待检肠道或粪便样本进行16S rRNA基因V3-V4区深度测序获得序列的。
其次,本发明提供了一种构建上述基于细菌全长或接近全长16S rRNA基因序列分析的、在“种”(species)水平上检测鉴定人肠道菌群的方法中步骤(1)所述的基于细菌操作系统发生学单元为单位的人肠道菌群16S rRNA基因V3-V4区序列参比序列库的方法,所述方法包括:
(1)测序和质控:获得来自人标本中的细菌16S rRNA基因序列,经过质控删除低质量(如单碱基质量值低于10的序列;无法识别到双端引物的序列;嵌合体(chimeras)序列等)序列;本发明中,发明人应用三代测序技术PacBio测序平台对来源于120个健康人群的粪便标本获得了人肠道菌群16S rRNA基因全长或近似全长(1450-1500碱基)的序列,获得850,935条16S rRNA基因序列。
使用PacBio SMRT Link (version 6.0.0) 进行质控分析。根据RSII_384_Barcodes进行样品拆分,最小条码得分(Minimum Barcode Score)设置为26。利用环化纠错(Circular Consensus Sequencing , CCS)的方法以降低序列的错误率,设置参数为最低5个CCS循环和最低预测准确性(Minimum Predicted Accuracy)高于99.9%。随后,使用QIIME软件进行模糊的碱基、低质量的序列、引物和测序接头的过滤。去除长度在1200~ 1600 bp之外的序列。在本发明的一个具体实施例中,发明人使用生物信息学分析软件USEARCH(http://www.drive5.com/usearch/)的嵌合体检测软件UCHIME QIIME(全称:Quantitative Insights Into Microbial Ecology),筛选出594,075条全长或接近全长的16S rRNA基因序列;
(2)划分细菌分类学操作单元(Operational Taxonomic Unit, OTU):将来自步骤(1)的序列一致性达到98.7%及以上的的一组16S rRNA基因序列,划分为一个OTU(每个粪便标本可获得若干OTU,每个OTU包含若干16S rRNA基因序列);
(3)确定每个OTU (细菌分类学操作单元)的代表性序列:把在步骤(2)获得的一个细菌分类学操作单元中出现频率高居前10的16s rRNA基因序列,选为该组细菌分类学操作单元的代表性序列,不足10条序列者全部选为该细菌分类学操作单元的代表性序列;
(4)构建细菌系统发生树:使用步骤(3)获得的每个OTU代表性序列和已经被命名的细菌参考菌株16S rRNA基因序列进行比对,将比对上的OTU代表性序列,插入到所有已经被命名的所有细菌参考菌株 16S rRNA基因序列数据库中,参数设置为LTP50。将插入的OTU代表性序列和已经被命名的细菌参考菌株的16S rRNA基因序列,使用基于Jukes-Cantor修正的邻接法(Neighbor- joining Method)构建所有细菌系统发生树,保守度设为30%。
在本发明的一个具体实施方案中,所述细菌系统发生树的构建步骤为:使用步骤(3)获得每个OTU的代表性序列,使用SINA软件(version 1.2.11),与所有已知细菌的16S rRNA基因序列(LTP132数据库)进行比对。利用ARB软件(version 6.0.6)内置的Parsimony工具,将比对上的OTU代表性序列,插入到所有已经被命名的所有细菌参考菌株 16S rRNA基因序列数据库(LTP 132数据库和NR SILVA Ref 132数据库中),参数设置为LTP50。将插入的OTU代表性序列和已经被命名的细菌参考菌株的16S rRNA基因序列,使用基于Jukes-Cantor修正的邻接法(Neighbor- joining Method)构建所有细菌系统发生树,保守度设为30%。
(5)发现健康人肠道未知细菌:在构建的所有细菌系统发生树上,查询OTU的代表性序列会和相似度最近的 16S rRNA基因序列聚集,在树上形成一个分支(图1),将这个分支确定为一个OPU(操作系统发生学单元)。如果OTU的代表性序列与在所有细菌系统发生树上最临近的16S rRNA基因序列的一致性达98.7%或以上,且已经获得命名,可使用获得命名的细菌名称注释。这类OPU可确定为已知细菌(图1)。如果OTU代表性序列及其在所有细菌系统发生树上最临近的16S rRNA基因序列的一致性为98.7%以下,但和“属”内其他“种”的代表性序列的一致性达95%或以上,可确定为未知细菌的疑似新种(图1);如果OTU代表性序列及其在所有细菌系统发生树上最临近的参考菌株的16S rRNA基因序列的一致性为95%以下,且尚未获得命名,可命名为未知细菌的高分阶单元,使用编号的高一级的分类学单元和OPU编号(OPU number)命名(图1)。
(6)构建基于OPU(细菌操作系统发生学单元)的人肠道菌群16S rRNA基因序列参比序列库:在基于已知细菌16S rRNA基因构建的细菌系统发生树上,查询序列会和分类学上最临近的参考序列聚类,在所有细菌系统发生树上形成一个独立分支(树枝),命名为一个OPU(图1)。查询序列和最临近的参考序列的相似度达98.7% 及以上的OPU,可确定为已知细菌;查询序列和最临近的参考序列的相似度低于98.7%,可确定为未知细菌。对于已获得现有技术命名的OPU,为已知细菌,采用命名名称注释;未获得命名的OPU,是为未知细菌,采用所述OPU及其编码作为该细菌的唯一命名;
在本发明的一个具体实施例中,通过该步骤,对本发明获得的健康人肠道细菌来源的59.4万余条全长或接近全长(1450-1500 碱基)的16S rRNA基因序列整理获得1235 个操作系统发生学单元。这1235个OPU包括461个“种”的已知细菌、774种未知细菌;
(7)对步骤(5)获得的16S rRNA基因序列参比序列库进行剪切,将序列完全相同的条目进行合并后,形成肠道菌群16S rRNA基因V3-V4区参比序列工作库。
在一个优选的实施方案中,步骤(1)所述测序采用用三代测序PacBio技术平台进行,至少包括120名健康人粪便标本进行细菌16S rRNA全长基因序列测定,质控中删除的低质量序列包括单碱基量值(quality)值低于10的序列、无法识别到双端引物的序列、嵌合体(chimeras)。在本发明中的一个具体的实施方案中,对120名健康人粪便标本进行细菌16S rRNA全长(1450-1500 碱基)基因序列测定。
在一个优选的实施方案中,步骤(4)所述已经被命名的细菌参考菌株16S rRNA基因序列来自于已被公开的参比序列库,所述参比序列库包括,但不限于:原核生物标准命名名录、美国国立生物技术信息中心和细菌16S rRNA基因序列在线质控和比对数据库收录和公开的16S rRNA基因序列库。其中,所述原核生物标准命名名录(LPSN: https://www.bacterio.net/)和美国国立生物技术信息中心(NCBI RefSeq database: https://www.ncbi.nlm.nih.gov/)目前公开的已知细菌参考菌株的16S rRNA基因序列,合计38,000余条,包括18 000 余个已经发表并认可的细菌种和亚种参考菌株的序列。所述参比序列库还吸纳细菌16S rRNA基因序列在线质控和比对数据库(SILVA, https://www.arb-silva.de/) 的细菌分类学名称相同的16S rRNA基因序列,兼并碱基(是指根据密码子的兼并性, 用一个符号代替某两个或者更多的碱基。如兼并碱基N, 可代表U/C/A/G四个碱基)比例小于2%、一致性99%以上、长度大于1000 碱基以上的高质量序列,截止目前合计14.3万条。这部分序列主要来源于非参考菌株。作为已知细菌的分类学参考菌株的16S rRNA基因序列的补充,提高多样性、覆盖率。本处所述的三个在线数据库均为开放性的公共数据库,并不构成对本发明数据库来源和构建方法的限制,只要能够提供细菌来源的多样性、覆盖率的数据库均可以被本发明方法所采用。本发明通过对上述3个或者以上的数据库的序列整合,构成肠道细菌16S rRNA基因参比序列库,包括80余万条(包括120名健康人肠道菌群发现的、原核生物标准命名名录、美国国立生物技术信息中心和细菌16S rRNA基因序列在线质控和比对数据库收录和公开的16S rRNA基因序列库)16S rRNA基因序列。所述80余万条16S rRNA基因序列的数量并不构成对本发明数据库大小和构建方法的限制,只要能够提供细菌来源的多样性、覆盖率的数据库均可以被本发明方法所采用。
在一个优选的实施方案中,步骤(6)所述的剪切采用16S rRNA基因V3-V4区计算机虚拟剪切序列。
更为优选地,所述虚拟剪切采用的上游剪切位点的序列如SEQ ID NO.1所示 (CCTAYGGGRBGCASCAG) ,下游剪切位点的序列如SEQ ID NO.2所示(GGACTACNNGGGTATCTAAT)。上述方法中步骤(6)所述的剪切采用16S rRNA基因V3-V4区通用扩增引物341F (SEQ ID NO.1) 和 806R (SEQ ID NO.2)的结合位点,进行计算机虚拟剪切,获得所有肠道菌群参比序列的V3-V4区序列。将序列完全相同的条目进行合并后,形成肠道菌群16S rRNA基因V3-V4区参比序列工作库,包括27.3万条16S rRNA基因V3-V4序列,可检测鉴定所有公开的18000余个已知细菌和健康人肠道菌群的未知菌。
有益效果
现有技术中,使用16S rRNA基因V3-V4区高通量测序技术原理检测肠道菌群的方法,只能检测已知细菌,无法检测未知细菌。本发明通过OPU的定义、发现、注释以及基于OPU的细菌系统进化树的构建,解决了上述技术难题,不仅可以对未知细菌进行检测,并使用OPU来描述和注释,还可以在未知细菌的发现及致病性和治疗性应用上进行分析和预测,极大地提供了细菌鉴定、致病菌发现和益生菌筛选发明的工作效率。通过本发明提供的方法,发现人肠道菌群有774“种”未知细菌,即774个OPU。特别是发现60%以上中国人粪便菌群共享116个OPU,包括38种已知菌、78种未知菌(以编码的OPU表示),约占菌群总数的83.42%。使用我们发现的肠道未知细菌的全长16S rRNA基因序列做分类学参照,可以实现对肠道菌群未知细菌的检测,这是目前任何一种现有技术都无法实现的。
本发明通过比较上述未知菌和已知菌的16S rRNA基因序列,可将粪便标本平均95%以上16S rRNA基因V3-V4区高通量测序数据,鉴定为已知细菌和未知细菌(OPU)。基于V3-V4区序列的鉴定率,从现有技术的37.8%提高到95.6%及以上。本发明方法,可从“种”的水平,分析健康人肠道菌群失调情况;可发现已知病原菌和潜在致病菌,可分析肠道益生菌的种类和丰度,特别是肠道菌群和健康状况、疾病的关系,可用于人肠道菌群多样性、健康状态、疾病状态等的评估,包括患者肠道菌群多态性和构成比分析。
附图说明
图1. 细菌操作系统发生学单元(OPU)划分技术路线图;
图2. 健康人肠道常驻菌群的116种细菌(OPU)的构成比阈值;
图3. 健康人肠道常驻菌群的116种细菌(OPU)的构成比阈值;
图4. 健康人肠道常驻菌群的116种细菌(OPU)的构成比阈值;
图5. 健康中国人粪便菌群的多样性(种类数目)和丰度(构成比);
图 6. 成人腹泻病人(F32)粪便菌群结构及丰度;
图7. 肝硬化患者(F54)粪便菌群结构及丰度;
图8. 婴儿腹泻病人(F181)粪便菌群结构及丰度。
本发明的实施方式
下面结合具体实施例来进一步描述本发明。本发明的优点和特点将会随着描述而更为清楚。但这些实施例仅是范例性的,并不对本发明的权利要求所限定的保护范围构成任何限制。
构建实施例1. 肠道菌群16S rRNA基因V3-V4区参比序列工作库的构建
1.  构建肠道菌群16S rRNA基因参比序列库
(1)获得健康人肠道细菌来源的1235 个OPU的16S rRNA基因序列
对120个健康中国人肠道菌群标本,使用PacBio测序平台测序,获得850,935条16S rRNA基因序列。使用PacBio SMRT Link (version 6.0.0) 进行质控分析。利用环化纠错(Circular Consensus Sequencing , CCS)的方法以降低序列的错误率,设置参数为最低5个CCS循环和最低预测准确性(Minimum Predicted Accuracy)高于99.9%。随后,使用QIIME软件进行模糊碱基、低质量的序列、引物和测序接头的过滤。去除长度在小于1200碱基和长于1600 碱基的序列,获得594,075条全长或接近全长的16S rRNA基因序列。划分为1235个OPU。每个OPU可包括多条频率较高的代表性16S rRNA基因序列,作为参考序列,其一致性达99%及以上。
OPU是细菌操作系统发生单元的英文缩写,是分类学上最小的单系类群(monophyletic group),包括一群全长16S rRNA基因序列,代表一群细菌菌株。 每个OPU群内菌株的16S rRNA基因序列,相互之间的亲缘关系最近,属于一个单系类群。不同的OPU,属于不同的单系类群。 OPU数量众多,包括公开发表的已知细菌和未知细菌。已知细菌可用国际细菌分类学委员会通过原核生物标准命名名录公布的名称进行注释,如肺炎链球菌。未知细菌使用本发明编号的OPU进行注释,代表一个新“种”、新“属”、新“科”、新“目”、新“纲”、新“门”等。仅仅依靠全长16S rRNA基因序列分析,按照目前的分类学认知,无法准确发现和定义一个新“属”及以上的分类学单元。
OPU 的划分包括二个步骤:一是划分OTU, 二是划分OPU。具体做法如下:
1)    全长16S rRNA基因测序。利用三代测序平台(PacBio RS II platform),对粪便样本中的16S rRNA基因(V1-V9)进行测序,获得全长或接近全长的序列(1450-1500碱基)。
2)    测序数据质控。使用生物信息学分析软件USEARCH(http://www.drive5.com/usearch/)的嵌合体检测软件UCHIME QIIME(全称是Quantitative Insights Into Microbial Ecology),去除模糊碱基、嵌合体。此为常规方法。
3)     划分OTU   使用USEARCH软件的OTU聚类和代表性序列鉴定算法划分OTU。将所有一致性达到98.7% 的16S rRNA基因序列,划为一个OTU。将每个OTU中出现频率最高的前10条16S rRNA基因序列,选择为这个OTU的代表性序列。如果出现频率最高的16S rRNA基因序列不到10条,则全部纳入。
4)    将比对上的某个OTU的代表性16S rRNA基因序列,鉴定为已知细菌。将查询OTU的代表性16S rRNA基因序列,加入所有已知细菌系统发生树(The All-Species Living Tree)数据库LTP 123,使用16S rRNA序列在线查询软件SINA( The new SILVA (Web)Aligner)进行序列比对。能够比对上的序列(一致性为98.7%或以上),可插入到所有已知细菌系统发生树上。基于序列对比以及系统发生树的拓扑结构和相互关系,如果能够划归为某已知细菌的16S rRNA基因序列,和其形成独立的分支,则可注释为某个已知细菌。如猪链球菌( Steptococcus suis)。这个已知细菌,可在系统发生树上形成独立的分支,是一个有分类学名称的OPU。
5)    将和所有已知细菌的参考菌株的16S rRNA基因序列一致性低于98.7%的OTU,鉴定为未知细菌,使用OPU方法进行注释 将一致性低于98.7%的OTU的代表性16S rRNA基因序列,加入Silva 数据库的非冗余(Silva Reference Non Redundant) 数据库(SILVA SSURef_NR_132),进行二次比对。
将二次比发现的数据库中和查询序列一致性最接近的16S rRNA基因序列,以及查询OTU的代表性16S rRNA基因序列,和LTP128数据库所有已知细菌参考菌株16S rRNA基因序列,使用在线查询软件SINA,使用邻位相接法(neighbor-joining),构建所有细菌系统发生树。设定古菌为树根(root)(图1)。
分析形成的所有细菌系统发生树的拓扑结构,定义每一个OPU。每个OPU都是最小的单系类群(monophyletic group)。每个OPU都至少包括二类序列:OTU的代表性序列,和这些代表性序列最接近的16S rRNA基因序列,特别是最接近的参考菌株的16S rRNA基因序列(图1)。
6)    可注释为疑似新种的OPU。如果一个OPU可以鉴定到某个“属”,但是和“属”内所有“种”的参考菌株的16S rRNA基因序列的一致性均低于98.7%,可注释为一个未知新种细菌。
7)    高分阶单元OPU的注释 如果依据细菌系统发生树,只能够把某个OPU鉴定到“科”,或者“科”以上的分类学单元,我们把它作为未知高分阶单元对待,可认为至少代表一个未知“属”。因为,无法仅仅依据全长16S rRNA基因序列,正确做出“种”以上水平的分类学鉴定(图1)。
8)    OPU编号 所有OPU统一编号。每个OPU的编号都是唯一的
在120名健康人粪便标本中,使用上述方法,划分了1235个OPU。其中,461个OPU可鉴定为已知细菌,可鉴定到“种”;774个OPU(62.7%)是未知细菌。在774个未知细菌OPU中,有358个可鉴定到属,注释为某个“属”的疑似新种。其余416个OPU,无法准确鉴定,注释为“高水平分类单元(图1) 。
从中国120个健康人粪便标本获得的全长或接近全长的16s rRNA 基因序列中,54.45%属于未知细菌,尚未分离、命名、研究。提示,50%以上的肠道菌群是未知细菌。
 在健康中国人肠道菌群1235个OPU中,有116个OPU可以在60%以上的粪便标本检测到。其中,只有38个OPU是已知细菌,78个OPU(67%)是未知细菌。图2展示了检出率为60%及以上的116种细菌的构成比及其差异范围。没有一种细菌的检出率为100%。不同健康个体肠道菌群的构成不是完全一致的,差异很大,但有相似性。我们把检出率为60%及以上的116种细菌,称之为中国人肠道常驻菌群(图2),是肠道菌群维持平衡的主要成员。其中,已知细菌用细菌认可名称表示,如 Prevotella copri。未知细菌用 OPU及编码表示, 如 Bacteroides sp. 17(OPU-532),表示拟杆菌属的一个疑似新种,尚未分离鉴定; 如 Lachnospiraceae(OPU-001),表示 Lachnospiraceae(毛螺菌科) 中的一个新成员,仅仅依靠16S rRNA基因序列难以准确鉴定,称之为高分阶单元OPU。
(2)获得所有已知细菌参考菌株的参考16S rRNA基因序列 包括原核生物(主要是细菌)标准命名名录(LPSN: https://www.bacterio.net/)和美国国立生物技术信息中心(NCBI RefSeq database: https://www.ncbi.nlm.nih.gov/)已知细菌参考菌株的16S rRNA基因序列,合计38,000余条。每个细菌“种”,可包括多条16S rRNA基因序列。
(3)扩展上述已知细菌参考菌株的参考16S rRNA基因序列库。吸纳16S rRNA基因序列质量核查和比对在线数据库SILVA(https://www.arb-silva.de/)的、分类学名称完全一致的、兼并碱基比例小于2%、长度1000 bp以上的、一致性大于99%的高质量序列,合计14.3万条。作为公共数据库的已知细菌参考菌株的16S rRNA基因序列的补充,提高灵敏度、覆盖率和准确性。
(4)构建肠道菌群16S rRNA基因参比序列库。将本发明发现的健康人肠道细菌来源的1235 个OPU的16S rRNA基因序列、所有原核生物标准命名名录列出的已知细菌的参考菌株的16S rRNA基因序列、SILVA数据库的已知细菌的高质量16S rRNA基因序列,进行整合,构建肠道菌群16S rRNA基因参比序列库。包括85万条的高质量细菌 16S rRNA基因,可检测、鉴定所有公布的18,000余个细菌种和亚种。特别是能够检测鉴定774种未知细菌。 有库容量大、序列长度长、分类注释信息准确的特点。同时,根据新种细菌的发现和发表情况,进行更新。实现能够检测、鉴定所有已知细菌的目标(图1)。
2.          肠道菌群16S rRNA基因V3-V4区参比序列工作库的构建
将我们构建的肠道细菌16S rRNA基因参比序列库中85万条序列,按照16S rRNA基因V3-V4区扩增引物341F (CCTAYGGGRBGCASCAG) 和 806R (GGACTACNNGGGTATCTAAT)的结合位点,进行计算机剪切,获得所有85万条16S rRNA基因的V3-V4区序列。即对参比序列库中每一条全长16S rRNA基因,进行计算机虚拟剪切,保留V3-V4区序列,组成肠道菌群16S rRNA基因V3-V4区参比序列工作库。在新组建的参比序列工作库中,将完全相同的序列条目合并。本实施例构建的包括27.3万条16S rRNA基因V3-V4序列,可检测、鉴定18,000余个细菌种和亚种。由于包括了健康人肠道未知细菌的16S rRNA基因序列,能够将大多数从人粪便标本获得的细菌16S rRNA基因V3-V4序列,鉴定到细菌的“种“。
本发明构建的细菌16S rRNA基因V3-V4序列是一个动态的数据库,可根据在线的公开数据库,以及研究者自行研究获得的数据库的增长而发生变动,但是数据库的变动不影响本发明方法的实施,而且随着数据库的增长,对基于细菌16S rRNA基因序列在“种”(species)水平上鉴定人体菌群的准确性会有相应的提高,本发明的核心并不在于数据库本身的构成,而在于构建一种动态和开放的基于细菌操作系统发生学单元为单位的人体菌群16S rRNA基因序列参比序列库方法。
构建实施例2.“种”水平肠道菌群组成多样性和构成比分析方法的构建
在实施例1构建的数据库的基础上(图1),对待检测样本进行“种”水平的肠道菌群组成多样性和构成比分析方法或系统的构建。
具体实施方案包括4个部分:粪便标本采集和处理、16S rRNA基因V3-V4区高通量测序、“种”水平的分类学注释、人粪便菌群多样性和构成比结果呈现。
1.  标本的采集和处理
用便杯采集新鲜的粪便标本,临时存放于冰袋样本箱中,随后冷链转运至实验室,进行核酸提取。提取方法采用柱纯化粪便核酸提取试剂盒(Qiagen, cat.51604),取200 mg粪便样本,按说明书方法进行提取。最后用200μL去离子水洗脱离心柱搜集粪便核酸,用于后续16S rRNA基因扩增。
2.16S rRNA基因V3-V4区高通量测序
粪便核酸经PCR扩增、产物纯化,使用 Illumina MiSeq平台进行 16S rRNA基因V3-V4区进行双端测序。
    3. “种”水平的分类学鉴定  使用获得的V3-V4区16S rRNA基因,使用常规方法进行质控,去除模糊碱基、嵌合体。 然后使用肠道细菌16S rRNA基因V3-V4区参比序列工作库进行比对查询。将比对发现的一致性为100%的序列,按照比对上的参考序列的分类学信息,注释为已知细菌或未知细菌。如果注释为已知细菌,则使用相应的分类学名称注释,如猪链球菌。如果注释为未知细菌,使用相应编码的OPU进行注释,包括疑似新种、高分阶单元等。不能注释的序列,注释为未知序列(unidentified)(图1)。
4.  人粪便菌群多样性和构成比的分析结果
(1)本发明方法可从分类学“种”的水平,检测并描述人肠道菌群的多样性。本发明发现,每个健康中国人肠道菌群平均含有186±51个OPU, 其中低频菌群(10%以下人群携带)、中频菌群(10%-60%以下人群携带)、高频菌群(60%以上人群携带)的OPU数目分别为20±11、75±29和90±19。累计检出1235个OPU,其中774个(62.7%)OPU是未知细菌(图2-图4)。
图2-图4中,所述的健康人肠道常驻菌群是指健康中国人粪便标本检测阳性率为60%及以上的细菌。已知细菌用原核生物(主要是细菌)标准命名名录(LPSN: https://www.bacterio.net/) 发布的细菌名称表示),如Prevotella copri。未知细菌有2种表示方法:疑似新种和高分阶单元。疑似新种是指可鉴定到“属”的一个潜在新种,尚未被分离鉴定,用属名和OPU编号表示,如Bacteroides sp. 17(OPU-532)。所述高分阶单元是指,仅仅依靠16S rRNA基因序列难以准确鉴定,用上一级分类学单元,和OPU编码表示, 如Lachnospiraceae (OPU-001),表示Lachnospiraceae (毛螺菌科)中的一个新成员。
(2)已知细菌中,和致病菌、条件致病菌、益生菌的参考序列100%一致的V3-V4区16S rRNA基因,可以明确鉴定为相应的致病菌、条件致病菌、益生菌。
(3)已知细菌中,可明确鉴定为致病菌、条件致病菌、益生菌的V3-V4区16S rRNA基因序列数目,及其在标本总的V3-V4区16S rRNA基因序列总数的百分比,形成所有已知细菌和未知细菌“种“或OPU的构成比数据。
(4)以60%以上中国人粪便标本都有的116个OPU(包括38个已知菌、78个OPU)构成比数据相比较,提出比较结果,如升高或降低或缺失。
(5) 本发明方法起关键作用的技术特征,是发现了774种未知细菌的16S rRNA基因序列,可以从“种”的水平检测和分析肠道已知菌群,可以从“OPU”的水平和角度检测和分析肠道未知细菌。
应用实施例1. 120个健康人样品不同测序和分析方法的比较
在应用实施例中,我们应用了120个健康人用于评估本发明中的数据库和比对方法。
1. 16S rRNA基因的高通量测序
对120个健康人的粪便样本分别采用2种不同的测序方法,进行“种”水平肠道菌群组成和比例分析,分别是采用Illumina MiSeq测序进行16S rRNA基因V3-V4区测序和采用PacBio Sequel平台进行16S rRNA基因全长测序。其中16S rRNA基因V3-V4区测序平均每个样品获得118,261条有效序列,而16S rRNA基因全长测序平均每个样品获得5502条有效序列。具体数据见表1。
表1. 每份粪便标本使用16S rRNA基因的全长测序和V3-V4区测序获得有效序列条数比较
Figure 415316dest_path_image001
2. 采用的数据库和比对方法
在本应用实施例中,对Illumina MiSeq平台获得的16S rRNA基因V3-V4区测序数据,我们采用两种数据库和比对方法进行了分析,分别为:(1)使用RDP classifier 贝叶斯算法对97%相似水平的OTU代表序列进行分类学分析,注释时使用Silva_132 16SrRNA database数据库,获得每个样品菌种组成和丰度信息;(2)使用本发明中的构建的数据和比对方法进行每个样品的菌种组成和丰度信息分析。针对PacBio Sequel平台获得的16S rRNA基因全长测序,我们采用OPU策略对每个样品的菌种组成和丰度信息分析。具体方法可以参考Yang J, Pu J, Lu S, Bai X, Wu Y, Jin D, Cheng Y, Zhang G, Zhu W, Luo X, Rosselló-Móra R, Xu J. Species-Level Analysis of Human Gut Microbiota With Metataxonomics. Front Microbiol. 2020 Aug 26;11:2029. doi: 10.3389/fmicb.2020.02029. PMID: 32983030; PMCID: PMC7479098。
3. 分析的结果
120个健康人样本根据测序方法. 根据使用数据库和比对软件不同,共分为3种方法。分别为:(1)采用Illumina MiSeq平台进行16S rRNA基因V3-V4区测序,采用本发明中构建的数据库和比对软件进行分析(以下简称为本发明方法);(2)采用llumina MiSeq平台进行16S rRNA基因V3-V4区测序,采用Silva_132 16SrRNA database数据库和RDP classifier 贝叶斯算法进行比对分析(以下简称为常用方法);(3)采用PacBio Sequel平台进行16S rRNA基因全长测序,采用操作系统发生学单元策略对每个样品的菌种组成和丰度信息分析,因该方法能够获得16S rRNA基因的全长序列,而采用16S rRNA 基因全长进行定“种”是金标准方法(以下简称为金标准方法)。根据分析结果,我们从能够确定到“种”级别序列条数比例和发现“种”的数量两个方面对三种方法进行比较分析,用于确定本发明中构建的数据库和比对软件具有优异的发现“种”的能力。
(1)本发明方法能够将每份粪便标本平均95%以上的16S rRNA基因序列鉴定到“种”的水平(OPU)
我们将本发明中建立的数据库和比对方法与16S rRNA基因全长测序(金标准方法)进行比较,结果显示120个健康人样本中,本发明中方法鉴定到“种”水平的序列条数平均比例为95.6%(能够注释到OPU的序列数/每份标本所有16S rRNA基因V3-V4区序列数)。而16S rRNA基因全长测序(金标准方法),鉴定到“种”水平的序列条数平均比例为57.95%(能够注释到OPU的序列数/每份标本所有16S rRNA基因全长序列数)。以上数据说明本发明中的方法与金标准方法相比,在鉴定到“种”水平序列条数的比例提高方面更有优势。因金标准方法需要获得16S rRNA全长序列,获得相同序列条数的情况下,测序成本约是本发明中方法的10倍以上,测序周期约是本发明中方法的2-3倍,因此说明本发明中方法在确定“种”方面更具经济性和实用性。
表2.  粪便标本可鉴定到细菌“种”水平的16S rRNA序列数(%)的比较*
Figure 822026dest_path_image002
*:本发明方法和常用方法(使用Silva_132 16SrRNA database数据库加RDP classifier 贝叶斯算法):能够注释到OPU的序列数/每份标本所有16S rRNA基因V3-V4区序列数;金标准方法:能够注释到OPU的序列数/每份标本所有16S rRNA基因全长序列数。
    我们使用相同数据,即llumina MiSeq平台进行16S rRNA基因V3-V4区测序数据,分别采用本发明中构建的数据库加比对方法和目前常有的Silva_132 16SrRNA database数据库加RDP classifier 贝叶斯算法进行分析,并对确定到“种”级别序列条数进行对比。对比结果显示,本发明中建立的数据库和比对方法平均能够将95.6%的序列鉴定到“种”水平,而目前常用的Silva_132 16SrRNA database数据库加RDP classifier 贝叶斯算法只能将38.1%的序列鉴定到“种”水平。
(2)本发明方法每份粪便标本能够检测到的细菌“种”数平均可达92.9(OPU)   
表3、每个粪便标本能够检测的细菌“种”(OPU)的数量比较
Figure 585583dest_path_image003
在发现“种”的数量方面,本发明中构建的数据库和比对软件在120个样本中,平均每个样品发现140.47个“种”,而金标准方法平均每个样品中发现92.91个“种”,目前常用的数据库和比对软件(例如:Silva_132 16SrRNA database数据库加RDP classifier 贝叶斯算法)平均每个样本种只能发现82.08个“种”(见附表3)。以上数据说明本发明中构建的数据库和比对软件能够发现更多的“种”,对于肠道菌群结构和丰度分析具有重要的价值。
应用实施例2:临床病人样本采用本发明方法进行粪便菌群组成及构成比分析
我们利用分析120个健康人体肠道菌群16S rRNA基因数据,确定健康人群中不同组成肠道菌的标准阈值,构建人体肠道菌群标准常规检查的参考标准。图5为检出率为 60%及以上的、称之为肠道常驻菌群的116个OPU的构成比阈值。在此基础上,我们针对3名临床患者的粪便标本,采用本发明方法进行了粪便菌群组成及丰度分析,并和参考人群菌群菌结构及丰度进行了对比分析,可为评估患者肠道菌群状况进行了分析,也涉及到和疾病的相关性。图5显示了健康人肠道菌群多样性和构成比的分析结果。
      人体肠道菌不仅能影响体重和消化能力、抵御感染和自体免疫疾病的患病风险,还能控制人体对疾病治疗药物的反应。因此,研究获得人体肠道菌群多样性和构成比数据,可作为健康、疾病状态的指示剂。医生通过解读人体菌群多样性和构成比数据,分析、判断、诊断患者的疾病和健康状况。
应用实施例2.1:成人腹泻病粪便样本菌群分析
病人编号F32, 女,67岁,临床诊断为“志贺痢疾杆菌引起的细菌感染”。图6显示了腹泻患者肠道菌群多样性和构成比结果。从菌群结构及丰度结果可以看出得出如下结论:
1、被检测粪便样本中 Escherichia coli/Shigella 丰度明显增高,显著高于阈值(0.6%)。
2、检测出条件致病菌 Enterobacter asburiae, Acinetobacter junii  (健康人无检出)。
3. 合计发现细菌“种”数(OPU) 13个。健康人每份粪便标本可检测到OPU140个(99-179)。提示菌群多样性降低,菌群紊乱。
4、由于16S rRNA基因全长序列的一致性高于98.7%(一致性为98.7%及以上者可看作是一个“种”),仅仅依据16S rRNA基因序列,无法将 Escherichia coliShigella 菌属分开。但 Escherichia coli/Shigella 丰度明显增高,支持志贺痢疾杆菌感染的临床诊断。
应用实施例2.2:针对临床肝硬化病人样本2进行的分析
病人编号F54, 男,42岁,临床诊断为“肝硬化” 图7显示了患者肠道菌群多样性和构成比结果。从菌群结构及丰度结果可以看出得出如下结论:
1、被检测粪便样本中 Bacteroides fragilis, Klebsiella pneumoniae, Ruminococcus torques等条件致病菌的丰度高于阈值。
2、合计发现细菌“种”数(OPU) 69个。低于健康人平均每个粪便标本可发现140.47个“种”(99-179OPU)。提示菌群多样性降低,肠道菌群紊乱。
应用实施例2.3:针对临床腹泻病人样本3进行的分析
病人编号F181, 男,1岁,临床诊断为“腹泻”,图8显示 临床病人粪便标本群结构及丰度。从菌群结构及丰度结果可以看出得出如下结论:
1. 被检测粪便样本中, Citrobacter braakiiCitrobacter freundii 丰度明显升高(阈值)。由于 Citrobacter braakiiCitrobacter freundii可引起腹泻,可能是病原菌。
2. Klebsiella pneumoniae的丰度高于阈值。 Klebsiella pneumoniae可引起小儿腹泻。
3. 合计发现细菌“种”数(OPU) 52 个。低于健康人平均每个粪便标本可发现140.47个“种”(99-179OPU)。提示肠道菌群多样性降低,肠道菌群紊乱。
4. 检测到益生菌 Lactobacillus reuteriBifidobacterium breve,且丰度高于成年健康人数据。建议询问患者是否服用益生菌制剂。
工业实用性
本发明公开了一个检测分析人粪便标本的细菌16S rRNA基因V3-V4区序列,可从“种“水平检测和注释肠道菌群组成多样性和构成比分析方法,所述方法的实施可以通过工业化完成,具有工业实用性。
序列表自由内容
[0092] 序  列  表
 
<110>  中国疾病预防控制中心传染病预防控制所
 
<120>  基于细菌16S rRNA基因序列的细菌 "种"水平分析方法
 
<160>  2    
 
<170>  PatentIn version 3.3
 
<210>  1
<211>  17
<212>  DNA
<213>  Artificial
 
<400>  1
cctaygggrb gcascag                                                    17
 
 
<210>  2
<211>  20
<212>  DNA
<213>  Artificial
 
 
<220>
<221>  misc_feature
<222>  (8)..(9)
<223>  n is a, c, g, or t
 
<400>  2
ggactacnng ggtatctaat                                                 20

Claims (1)

  1. 一种基于细菌16S rRNA基因序列在“种”水平上鉴定人体菌群的方法,其特征在于,所述方法包括以下步骤:
    (1)构建基于细菌操作系统发生学单元为注释单位的人体菌群16S rRNA基因参比序列库,所述参比序列库包括已获得“种”水平命名的已知细菌和未获得“种”水平命名的未知细菌,对于细菌操作系统发生学单元已获得命名的细菌采用命名名称注释,对于细菌操作系统发生学单元未获得命名的细菌采用所述细菌操作系统发生学单元作为该细菌的唯一命名;
    (2)对待检测标本的16S rRNA基因进行序列测定;
    (3)将步骤(2)获得的标本16S rRNA基因序列与步骤(1)构建的人体菌群16S rRNA基因参比序列库进行比对及菌种鉴定,将与参比数据库中特定序列完全一致的序列鉴定为参比序列库中特定序列注释名称。
    2. 根据权利要求1所述的方法,其特征在于,所述方法还包括对步骤(3)所鉴定菌种在待测标本中菌群种类、比例、和/或丰度分析的步骤。
    3. 根据权利要求1所述的方法,其特征在于,所述方法步骤(1)中所述已获得命名的细菌的名称注释包括致病菌、条件致病菌或益生菌的注释。
    4. 根据权利要求1所述的方法,其特征在于,所述方法中16S rRNA基因序列为V3-V4区序列。
    5. 根据权利要求1所述的方法,其特征在于,所述人体菌群来源于消化道、皮肤、口腔、鼻咽部、眼部、阴道、泌尿道或耳部的菌群。
    6. 根据权利要求1所述的方法,其特征在于,所述方法步骤(2)序列测定为高通量测序。
    7. 一种构建权利要求1所述方法步骤(1)所述的基于细菌操作系统发生学单元为单位的人体菌群16S rRNA基因序列参比序列库的方法,其特征在于,所述方法包括:
    (1)测序和质控:获得来自人体标本中的细菌16S rRNA基因序列,经过质控删除低质量序列;
    (2)划分细菌分类学操作单元:将来自步骤(1)的序列一致性达到98.7%及以上的一组16S rRNA基因序列,命名为一个细菌分类学操作单元;
    (3)确定细菌分类学操作单元的代表性序列:把在步骤(2)获得的一个细菌分类学操作单元中出现频率最高的前10条16S rRNA基因序列,选为该细菌分类学操作单元的代表性序列,不足10条序列者全部选为该细菌分类学操作单元的代表性序列;
    (4)构建细菌系统发生树:使用步骤(3)获得的每个细菌分类学操作单元代表性序列和已经被命名的细菌参考菌株16S rRNA基因序列进行比对,将比对上的细菌分类学操作单元代表性序列,插入到所有已经被命名的所有细菌参考菌株 16S rRNA基因序列数据库中,参数设置为LTP50;将插入的OTU代表性序列和已经被命名的细菌参考菌株的16S rRNA基因序列,使用基于Jukes-Cantor修正的邻接法构建所有细菌系统发生树,保守度设为30%;
    (5)发现健康人体菌群中的未知细菌:在构建的所有细菌系统发生树上,查询细菌分类学操作单元的代表性序列和相似度最近的 16S rRNA基因序列聚集,在树上形成一个分支,将所述分支确定为一个细菌操作系统发生学单元;如果细菌分类学操作单元的代表性序列与在所有细菌系统发生树上最临近的16S rRNA基因序列的一致性达98.7%或以上,且已经获得命名,使用获得命名的细菌名称注释,该类细菌操作系统发生学单元可确定为已知细菌;如果细菌分类学操作单元代表性序列及其在所有细菌系统发生树上最临近的16S rRNA基因序列的一致性为98.7%以下,但和“属”内其他“种”的代表性序列的一致性达95%或以上,可确定为未知细菌的疑似新种;如果细菌分类学操作单元代表性序列及其在所有细菌系统发生树上最临近的参考菌株的16S rRNA基因序列的一致性为95%以下,且尚未获得命名,可命名为未知细菌的高分阶单元,使用编号的高一级的细菌分类学操作单元和细菌操作系统发生学单元编号命名;
    (6)获得基于细菌操作系统发生学单元的16S rRNA基因序列参比序列库: 将步骤(5)获得的健康人肠道未知细菌的全长16S rRNA基因序列,和所有已经命名发表的已知细菌的16S rRNA基因序列合并,构建人体菌群16S rRNA基因参考序列库;
    (7)对步骤(6)获得的16S rRNA基因序列参比序列库,使用16S rRNA基因V3-V4区的通用序列位点, 进行剪切,将V3-V4区序列完全相同的条目,进行合并,删除完全重复的序列,形成人体菌群16S rRNA基因序列参比工作库。
    8. 根据权利要求7所述的方法,其特征在于,步骤(1)所述测序采用三代测序PacBio技术平台进行,至少包括120名健康人粪便标本进行细菌16S rRNA全长基因序列测定,质控中删除的低质量序列包括单碱基量值低于10的序列、无法识别到双端引物的序列、嵌合体。
    9. 根据权利要求7所述的方法,其特征在于,步骤(4)所述已经被命名的细菌参考菌株16S rRNA基因序列来自于已被公开的参比序列库,所述参比序列库包括:原核生物标准命名名录、美国国立生物技术信息中心和细菌16S rRNA基因序列在线质控和比对数据库收录和公开的16S rRNA基因序列库。
    10. 根据权利要求7所述的方法,其特征在于,步骤(7)所述的剪切采用16S rRNA基因V3-V4区计算机虚拟剪切获得剪切序列。
    11. 根据权利要求10所述的方法,其特征在于,所述虚拟剪切的上游剪切位点的序列如SEQ ID NO.1所示,下游剪切位点的序列如SEQ ID NO.2所示。
PCT/CN2022/092574 2021-06-13 2022-05-12 基于细菌16S rRNA基因序列的细菌"种"水平检测和分析方法 WO2022262491A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110659956.2A CN113403409A (zh) 2021-06-13 2021-06-13 基于细菌16S rRNA基因序列的细菌“种”水平检测和分析方法
CN202110659956.2 2021-06-13

Publications (1)

Publication Number Publication Date
WO2022262491A1 true WO2022262491A1 (zh) 2022-12-22

Family

ID=77683870

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/092574 WO2022262491A1 (zh) 2021-06-13 2022-05-12 基于细菌16S rRNA基因序列的细菌"种"水平检测和分析方法

Country Status (2)

Country Link
CN (1) CN113403409A (zh)
WO (1) WO2022262491A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113403409A (zh) * 2021-06-13 2021-09-17 中国疾病预防控制中心传染病预防控制所 基于细菌16S rRNA基因序列的细菌“种”水平检测和分析方法
CN116825182B (zh) * 2023-06-14 2024-02-06 北京金匙医学检验实验室有限公司 一种基于基因组ORFs筛选细菌耐药特征的方法及应用

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103627800A (zh) * 2013-11-14 2014-03-12 浙江天科高新技术发展有限公司 环境微生物快速检测方法
WO2017044886A1 (en) * 2015-09-09 2017-03-16 uBiome, Inc. Method and system for microbiome-derived diagnostics and therapeutics for bacterial vaginosis
CN111816258A (zh) * 2020-07-20 2020-10-23 杭州谷禾信息技术有限公司 人体菌群16S rDNA高通量测序物种精确鉴定的优化方法
CN112863606A (zh) * 2021-03-08 2021-05-28 杭州微数生物科技有限公司 细菌鉴定和分型分析基因组数据库及鉴定和分型分析方法
CN113403409A (zh) * 2021-06-13 2021-09-17 中国疾病预防控制中心传染病预防控制所 基于细菌16S rRNA基因序列的细菌“种”水平检测和分析方法

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107451425A (zh) * 2017-08-21 2017-12-08 上海派森诺生物科技股份有限公司 一种基于微生物rRNA基因测序技术的菌群代谢功能预测分析方法
CN109706235A (zh) * 2019-01-29 2019-05-03 广州康昕瑞基因健康科技有限公司 一种肠道微生物菌群的检测和分析方法及其系统
CN109897906A (zh) * 2019-03-04 2019-06-18 福建西陇生物技术有限公司 一种肠道菌群16S rRNA基因的检测方法及其应用
CN109971871A (zh) * 2019-03-27 2019-07-05 江南大学 一种筛选和/或鉴定乳杆菌的方法及其应用
CN110144415A (zh) * 2019-04-23 2019-08-20 大连大学 一种基于肠道菌群预测引进奶牛健康和免疫力水平方法
CN111254186B (zh) * 2020-03-31 2023-04-07 上海市第十人民医院 一种对梭杆菌进行分子检测或对其菌种水平分类鉴定的方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103627800A (zh) * 2013-11-14 2014-03-12 浙江天科高新技术发展有限公司 环境微生物快速检测方法
WO2017044886A1 (en) * 2015-09-09 2017-03-16 uBiome, Inc. Method and system for microbiome-derived diagnostics and therapeutics for bacterial vaginosis
CN111816258A (zh) * 2020-07-20 2020-10-23 杭州谷禾信息技术有限公司 人体菌群16S rDNA高通量测序物种精确鉴定的优化方法
CN112863606A (zh) * 2021-03-08 2021-05-28 杭州微数生物科技有限公司 细菌鉴定和分型分析基因组数据库及鉴定和分型分析方法
CN113403409A (zh) * 2021-06-13 2021-09-17 中国疾病预防控制中心传染病预防控制所 基于细菌16S rRNA基因序列的细菌“种”水平检测和分析方法

Also Published As

Publication number Publication date
CN113403409A (zh) 2021-09-17

Similar Documents

Publication Publication Date Title
Earl et al. Species-level bacterial community profiling of the healthy sinonasal microbiome using Pacific Biosciences sequencing of full-length 16S rRNA genes
JP7317821B2 (ja) ディスバイオシスを診断する方法
US20190367995A1 (en) Biomarkers for colorectal cancer
Minot et al. The human gut virome: inter-individual variation and dynamic response to diet
CN108350510B (zh) 用于胃肠健康相关病症的源自微生物群系的诊断及治疗方法和系统
CN105368944B (zh) 可检测疾病的生物标志物及其用途
EP3347496A1 (en) Method and system for microbiome-derived diagnostics and therapeutics for oral health
WO2022262491A1 (zh) 基于细菌16S rRNA基因序列的细菌&#34;种&#34;水平检测和分析方法
CN108348167B (zh) 用于脑-颅面健康相关病症的源自微生物群系的诊断及治疗方法和系统
CN107430644A (zh) 用于测定胃肠道菌群失调的方法
EP3676405A2 (en) Method and system for characterization for female reproductive system-related conditions associated with microorganisms
Gehrig et al. Finding the right fit: evaluation of short-read and long-read sequencing approaches to maximize the utility of clinical microbiome data
CN114898808B (zh) 一种预测肺炎克雷伯菌对头孢吡肟敏感性的方法及系统
WO2017044880A1 (en) Method and system for microbiome-derived diagnostics and therapeutics infectious disease and other health conditions associated with antibiotic usage
KR20190047023A (ko) 샘플에서 1종 이상의 유형의 다양한 미생물 집단으로부터 핵산 분자를 추출하는 범용 방법
CN107075453A (zh) 冠状动脉疾病的生物标记物
CN109266766A (zh) 肠道微生物作为胆管细胞癌诊断标志物的用途
CN107002021A (zh) 类风湿性关节炎的生物标记物及其用途
Kushnir et al. Molecular characterization of Neisseria gonorrhoeae isolates in Almaty, Kazakhstan, by VNTR analysis, Opa-typing and NG-MAST
Stockdale et al. Viral dark matter in the gut virome of elderly humans
CN106795480A (zh) 类风湿性关节炎的生物标记物及其用途
WO2022253824A1 (en) Rna profiling of the microbiome and molecular inversion probes
CN106795479A (zh) 类风湿性关节炎的生物标记物及其用途
CN109913526B (zh) 微生物在鉴别和/或区分不同民族个体中的应用
D’Adamo et al. Bacterial clade-specific analysis identifies distinct epithelial responses in inflammatory bowel disease

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22823974

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE