CN113403409A - Bacterial species level detection and analysis method based on bacterial 16S rRNA gene sequence - Google Patents

Bacterial species level detection and analysis method based on bacterial 16S rRNA gene sequence Download PDF

Info

Publication number
CN113403409A
CN113403409A CN202110659956.2A CN202110659956A CN113403409A CN 113403409 A CN113403409 A CN 113403409A CN 202110659956 A CN202110659956 A CN 202110659956A CN 113403409 A CN113403409 A CN 113403409A
Authority
CN
China
Prior art keywords
sequence
rrna gene
bacteria
bacterial
sequences
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110659956.2A
Other languages
Chinese (zh)
Inventor
徐建国
杨晶
卢珊
濮吉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National Institute for Communicable Disease Control and Prevention of Chinese Center For Disease Control and Prevention
Original Assignee
National Institute for Communicable Disease Control and Prevention of Chinese Center For Disease Control and Prevention
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National Institute for Communicable Disease Control and Prevention of Chinese Center For Disease Control and Prevention filed Critical National Institute for Communicable Disease Control and Prevention of Chinese Center For Disease Control and Prevention
Priority to CN202110659956.2A priority Critical patent/CN113403409A/en
Publication of CN113403409A publication Critical patent/CN113403409A/en
Priority to PCT/CN2022/092574 priority patent/WO2022262491A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6876Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
    • C12Q1/6888Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for detection or identification of organisms
    • C12Q1/689Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for detection or identification of organisms for bacteria
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing

Landscapes

  • Chemical & Material Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Organic Chemistry (AREA)
  • Zoology (AREA)
  • Analytical Chemistry (AREA)
  • Wood Science & Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Microbiology (AREA)
  • Immunology (AREA)
  • Molecular Biology (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Physics & Mathematics (AREA)
  • Biochemistry (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Genetics & Genomics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses a method for identifying human intestinal flora on the seed level based on bacterial 16S rRNA gene sequence for non-diagnosis purpose. The method comprises the following steps: (1) constructing a human intestinal flora 16S rRNA gene reference sequence library based on a bacteria operating system genesis unit, (2) carrying out sequence determination on the 16S rRNA gene of a sample to be detected; (3) and comparing the 16S rRNA gene sequence of the specimen with a 16S rRNA gene reference sequence library and identifying the strain. The method of the invention can detect and annotate human intestinal flora to the level of species and reveal the data of diversity, composition ratio, abundance and the like. The data can be used for analyzing whether the human intestinal flora is disordered, finding out whether known pathogenic bacteria, potential pathogenic bacteria and the like exist, analyzing the species and abundance of intestinal probiotics, and analyzing the correlation between the intestinal flora disorder and health state, diseases and the like.

Description

Bacterial species level detection and analysis method based on bacterial 16S rRNA gene sequence
Technical Field
The invention discloses a sequence of a V3-V4 region of a bacterial 16S rRNA gene for detecting and analyzing a human fecal specimen, which can detect and annotate the composition diversity and composition ratio analysis method of intestinal flora from the 'species' level, particularly can detect unknown bacteria which are dominant in quantity and proportion and are not separated and researched, and belongs to the technical field of microbial ecology, microbial taxonomy and microbiology.
Background
Since the development of microbiome research, many studies suggest that human growth and development, nutritional metabolism, disease states, immune response, etc. are related to intestinal flora, such as colorectal cancer, obesity, diabetes, etc. However, how many "species" (species) are contained in the human intestinal flora? How abundant the respective "species"? No clear answer has been made to date. In the past, the diversity of intestinal flora was studied mainly by means of isolated culture technology systems. Due to the selectivity of the culture medium and the culture conditions used, such as culture temperature, oxygen content, amino acid and carbohydrate composition, salt concentration, etc., one can only obtain bacteria that can grow on these media and culture conditions. Ignoring the large number of bacteria that cannot grow on these media and culture conditions, and have not been isolated, cultured and identified for the moment, yields a lot of false information.
It is estimated that there is about 10 on earth12Prokaryotes, of which bacteria are the main species. The taxonomic hierarchy of bacteria includes kingdom, phylum, class, order, family, genus and species. "species" is the lowest taxonomic unit of bacteria. The taxonomic units of bacteria most commonly involved in medicine are "genus" and "species". A "genus" of a bacterium may include several (e.g., Escherichia, including 6 "species") or several hundred "species" of bacteria (e.g., Streptococcus, including more than 200 "species"). Bacteria of different "species" of the same genus, which differ greatly in biological and medical significance, are probiotics (e.g., Streptococcus thermophilus]) Some are pathogenic bacteria (e.g. Streptococcus suis]). Therefore, information on the taxonomic diversity and composition ratio of the intestinal flora is limited to the "genus" level, which is far from sufficient, and misleading is likely to occur. Only by realizing the analysis of the 'species' level, the correlation between the diversity and the composition ratio change of the intestinal flora and health, diseases and the like can be well revealed, and the method has a relatively clear medical reference value.
All bacteria have 16S rRNA, which is a ribosomal RNA on the small subunit of ribosomes, participates in processes such as protein synthesis, and is a molecular clock in bacterial evolution. The corresponding gene sequence of 16S rRNA in bacterial genome, namely bacterial 16S rRNA gene, is about 1500 bases in length and consists of 9 Variable regions (V1-V9) and conserved region sequences in an alternating mode. The conserved sequence of the 16S rRNA gene is highly conserved, while the variable region sequence varies with species, and the degree of variation is closely related to the phylogenetic position (taxonomic species, genus, family, etc.) of the bacterium. Thus, using 16S rRNA gene sequence analysis, all bacteria can be identified and classified. If the full-length sequence of the 16S rRNA gene is used, the bacteria to be tested can be identified to the level of "species" in most cases.
The use of partial 16S rRNA gene sequences, such as the V3-V4 segment sequences, allows the classification of more studied bacteria with 16S rRNA gene sequences known in public databases into "species"; due to the lack of reference sequences, most unknown bacteria can only be classified into high-order taxonomic units such as "genus", "family", and the like. In a few cases, since the full-length 16S rRNA genes of some bacterial "species" are very similar, the "species" cannot be accurately identified by relying only on the 16S rRNA genes. These several "species" that cannot be distinguished using the full-length 16S rRNA gene are usually grouped together.
16S rRNA gene sequence analysis has become an important method for bacterium detection and identification and flora diversity analysis. With the development of sequencing technology and the reduction of cost, the high-throughput sequencing based on the second-generation sequencing platform can obtain massive bacterial 16S rRNA gene sequences without depending on bacterial culture, and provides a powerful tool for researching the diversity of floras. A common method for analyzing diversity of intestinal flora is to carry out high-throughput sequencing on a 16S rRNA gene V3-V4 region (about 400 bases) on the basis of an Illumiina sequencing platform on a stool specimen to obtain a mass sequence. One single sample can obtain more than one hundred thousand 16S rRNA gene sequences, and finally bacterial taxonomy analysis and identification of intestinal (fecal) flora in the sample can be completed through links such as sequence comparison analysis and annotation. Data were obtained for gut flora diversity (how many "species" or "genera" of bacteria are contained) and formation ratio (percentage of each "species" or "genus" of bacteria over all sequence numbers). Since a large number of intestinal flora are unknown bacteria, they have not been isolated and identified, and lack the corresponding full-length 16S rRNA gene sequences for alignment. Therefore, the existing intestinal flora analysis technology can only identify the unknown bacteria with dominant quantity to the level of 'genus' or 'genus', and can not accurately identify the 'species'.
The defects of the prior art are as follows: the 16S rRNA gene V3-V4 region sequence amplified by the second generation sequencing technology has only about 400 bases, most sequences can be identified to the taxonomic level of more than genus or genus, and the diversity and constitution ratio data of intestinal flora of more than genus or genus level can be obtained. The analysis data of the genus or above levels of the genus cannot accurately reveal the relationship between the change of the intestinal flora and the health diseases, and the application and popularization of the intestinal flora analysis are limited. The invention aims to provide a method for detecting, identifying and analyzing human flora on the species level.
Disclosure of Invention
In view of the above objects, the present invention provides, in a first aspect, a method for identifying human flora at the "species" (species) level based on a full-length or near-full-length 16S rRNA gene sequence of bacteria, wherein said 16S rRNA gene is a full-length or near-full-length 16S rRNA gene sequence having a length of 1450-1500 bases, said method comprising the steps of:
(1) a human intestinal flora 16S rRNA gene reference sequence library based on a bacterial operating system genesis Unit (OPU) as a basic annotation Unit is constructed. OPU includes all known bacteria, and the human gut numerous unknown bacteria found by the present invention that have not yet been discovered. The reference sequence library includes all known bacteria for which a "species" level designation has been obtained, and unknown bacteria. For OPUs (known bacteria) that have acquired the prior art nomenclature, annotated with the named name; for OPUs not obtaining the prior art designation, the OPU and its code, and its higher taxonomic unit, are used as the only designation for the bacterium. The invention constructs a human intestinal flora 16S rRNA gene full-length reference sequence library. This database includes all bacteria that have been named, and intestinal unknown bacteria found by the present invention. The 16S rRNA gene sequences of all the named reference strains of known bacteria are from a published pool of reference sequences, including but not limited to: a 16SrRNA gene sequence library which is recorded and disclosed by a prokaryotic organism standard naming directory, an American national biotechnology information center and a bacteria 16S rRNA gene sequence online quality control and comparison database;
(2) constructing a reference sequence library of a V3-V4 region of a bacterial 16S rRNA gene, and virtually shearing a V3-V4 region sequence of the full-length 16S rRNA gene reference sequence library of the human intestinal flora by using a computer to obtain a V3-V4 region sequence. The virtual cutting adopts the binding sites of universal amplification primers 341F (SEQ ID NO.1) and 806R (SEQ ID NO.2) in the V3-V4 region of the 16S rRNA gene. After the entries with completely identical sequences are combined, a working library of reference sequences of V3-V4 region of 16S rRNA genes of intestinal flora is formed. Can be used for detecting and identifying all known bacteria (more than 18000 species) and unknown bacteria (774 OPUs) of healthy human intestinal flora discovered by the invention;
(3) the method comprises the steps of carrying out sequence determination on 16S rRNA genes of a sample to be detected, and carrying out sequence determination on a V3-V4 region of the 16S rRNA genes in a specific technical scheme of the invention;
(4) and (3) taking the 16S rRNA gene sequence of the specimen obtained in the step (3) as a query sequence, and performing query comparison and strain identification with the working library of the reference sequence of the 16S rRNA gene V3-V4 region of the intestinal flora in the step (2). Query sequences that are completely identical (100%) to a particular sequence with taxonomic information in the reference sequence working library are identified as the particular sequence annotation name in the reference sequence working library. In a specific technical scheme of the invention, a sequence of a 16S rRNA gene V3-V4 region obtained from a sample to be tested is compared with a reference sequence library of a 16S rRNA gene V3-V4 region, and a sequence which has 100 percent of consistency with a reference sequence of a known bacterial 'species' 16S rRNA gene V3-V4 region in the reference sequence working library is annotated as the name of the taxonomy 'species' of the known bacteria; sequences 100% identical to the reference sequence in the V3-V4 region of the unknown bacterial 16S rRNA gene in the reference sequence pool are annotated as unknown bacteria and assigned a unique OPU number. Unknown bacteria include suspected new species and high order units. The high-order units are those that are difficult to identify accurately by virtue of the 16S rRNA gene sequence alone, and are represented by the last-order taxonomic unit, and the OPU code.
In a preferred embodiment, the method further comprises the step of analyzing the species, ratio, and/or abundance of the population of bacteria identified in step (3) in the test sample. In specific applications, it can be provided as desired, including but not limited to, the number of OPUs contained in the sample to be analyzed, the number, species, abundance of known bacteria, the species, number, and abundance of unknown bacteria; and the percentage of individual "species" or OPUs to the total number of intestinal flora; and the species and abundance of probiotics, the species and abundance of pathogenic bacteria and recommended pathogenic bacteria, the number and abundance of dominant OPUs and the like.
In another preferred embodiment, the method wherein the 16S rRNA gene sequence is a V3-V4 region sequence. The method of the present invention can be used for the identification and analysis of flora based on the V3-V4 region of the 16S rRNA gene, but is not limited to the V3-V4 region, and can also be used for the identification and analysis of flora based on other regions of the 16S rRNA gene.
In a preferred embodiment, the human flora is derived from a flora of the digestive tract, skin, mouth, nasopharynx, eye, vagina, urinary tract, ear.
In another preferred embodiment, the sequence determination in step (2) of the method is high-throughput sequencing, and a specific embodiment of the invention is based on deep sequencing of 16S rRNA genes V3-V4 region of an Illumina second generation sequencing platform to intestinal or fecal sample to be detected to obtain the sequence.
Secondly, the invention provides a method for constructing a human intestinal flora 16S rRNA gene V3-V4 region sequence reference sequence library based on the genetic unit of a bacterial operating system in the step (1) of the method for detecting and identifying the human intestinal flora at the seed (species) level based on the analysis of the full-length or near-full-length 16S rRNA gene sequence of bacteria, wherein the method comprises the following steps:
(1) sequencing and quality control: obtaining a bacterial 16S rRNA gene sequence from a human specimen, and deleting a sequence with low quality (such as a sequence with a single base quality value lower than 10; a sequence with an unidentifiable double-ended primer; a chimera (chimeras) sequence and the like) through quality control; in the invention, the inventor applies a third generation sequencing technology PacBio sequencing platform to obtain the sequence of the human intestinal flora 16SrRNA gene full length or approximate full length (1450-.
Quality control analysis was performed using a PacBio SMRT Link (version 6.0.0). The sample resolution was performed according to RSII _384_ Barcodes with a Minimum Bar Score (Minimum Bar Score) set at 26. A method of Circular error correction (CCS) is utilized to reduce the error rate of the sequence, and the parameters are set to be the Minimum 5 CCS cycles and the Minimum Predicted Accuracy (Minimum Predicted Accuracy) higher than 99.9%. Subsequently, filtration of ambiguous bases, low quality sequences, primers and sequencing adapters was performed using QIIME software. Removing sequences with the length of 1200-1600 bp. In one embodiment of the present invention, the inventors used bioinformatics analysis software USEARCH (http://www.drive5.com/usearch/) The chimera detection software UCHIME QIIME (full name: quantitative instruments endo microbiological biology), screening 594,075 full-length or nearly full-length 16S rRNA gene sequences;
(2) the bacterial taxonomy operating Unit (Operational taxomic Unit, OTU) was divided: dividing a group of 16S rRNA gene sequences with the sequence consistency of 98.7% or more from the step (1) into one OTU (a plurality of OTUs can be obtained from each stool specimen, and each OTU comprises a plurality of 16S rRNA gene sequences);
(3) representative sequences were determined for each OTU (bacterial taxonomic operating unit): selecting the 16s rRNA gene sequence with the high frequency of appearance of the top 10 in one bacterial taxonomy operation unit obtained in the step (2) as a representative sequence of the group of bacterial taxonomy operation units, and selecting all less than 10 sequences as the representative sequence of the bacterial taxonomy operation unit;
(4) construction of bacterial phylogenetic trees: using each OTU representative sequence obtained in step (3) and already named bacterial reference strain 16S rRNA gene sequences for alignment, inserting the aligned OTU representative sequences into all already named bacterial reference strain 16S rRNA gene sequence databases with the parameters set to LTP 50. All bacterial phylogenetic trees were constructed using the inserted OTU representative sequence and the already named 16S rRNA gene sequence of the bacterial reference strain using the Neighbor-joining Method (Neighbor-joining Method) based on Jukes-Cantor amendment, with a degree of conservation set at 30%.
In a specific embodiment of the present invention, the construction steps of the bacterial phylogenetic tree are: representative sequences for each OTU were obtained using step (3) and aligned to the 16S rRNA gene sequences (LTP132 database) of all known bacteria using the sini software (version 1.2.11). The aligned OTU representative sequences were inserted into all bacterial reference strain 16S rRNA gene sequence databases (LTP132 database and NR SILVA Ref 132 database) already named using Parsimony tool built into ARB software (version 6.0.6) with parameters set to LTP 50. All bacterial phylogenetic trees were constructed using the inserted OTU representative sequence and the already named 16S rRNA gene sequence of the bacterial reference strain using the Neighbor-joining Method (Neighbor-joining Method) based on Jukes-Cantor amendment, with a degree of conservation set at 30%.
(5) Finding unknown bacteria in the intestinal tract of healthy people: on all the constructed bacterial phylogenetic trees, the representative sequence of the query OTU will aggregate with the 16S rRNA gene sequence with the closest similarity, forming a branch on the tree (fig. 1), and identifying this branch as an OPU (operating systems genesis unit). If a representative sequence of OTU is 98.7% or more identical to the nearest 16S rRNA gene sequence on all bacterial phylogenetic trees and nomenclature has been obtained, annotation of the named bacterial name can be used. Such OPUs can be identified as known bacteria (fig. 1). A suspected new species of unknown bacteria can be identified if the identity of the representative OTU sequence and its nearest 16S rRNA gene sequence on all phylogenetic trees of the bacteria is below 98.7%, but 95% or more with the representative sequences of other "species" within the genus (FIG. 1); if the identity of the OTU representative sequence and its 16S rRNA gene sequence of the nearest reference strain on all bacterial phylogenetic trees is below 95% and no nomenclature has been obtained, it can be named as a high-ranking unit of unknown bacteria, using the numbered higher-ranking taxonomic units and OPU number (OPU number) nomenclature (fig. 1).
(6) Constructing a human intestinal flora 16S rRNA gene sequence reference sequence library based on OPU (bacteria operating system genesis unit): on a bacterial phylogenetic tree constructed based on known bacterial 16S rRNA genes, the query sequence would cluster with the taxonomically closest reference sequence, forming an independent branch (branch) on all bacterial phylogenetic trees, named an OPU (fig. 1). The OPU with the similarity of the query sequence and the nearest reference sequence reaching 98.7 percent or more can be determined as the known bacteria; the similarity between the query sequence and the nearest reference sequence was less than 98.7%, and it was determined as unknown bacteria. For OPUs that have acquired the prior art nomenclature, known bacteria are annotated with the nomenclature name; the OPU without the obtained designation is an unknown bacterium, and the OPU and its code are used as the only designation of the bacterium;
in one embodiment of the present invention, 1235 phylogenetic units are obtained by this step of sequencing the 16S rRNA gene sequences of 59.4 ten thousand full-length or nearly full-length (1450-. The 1235 OPUs included 461 "species" of known bacteria, 774 species of unknown bacteria;
(7) and (4) shearing the 16S rRNA gene sequence reference sequence library obtained in the step (5), and combining the entries with completely identical sequences to form a reference sequence working library of the intestinal flora 16S rRNA genes V3-V4 region.
In a preferred embodiment, the sequencing in step (1) is performed by using a three-generation sequencing PacBio technology platform, at least comprising 120 healthy human stool samples, bacterial 16S rRNA full-length gene sequence determination is performed, and the low-quality sequences deleted in quality control comprise sequences with single-base value (quality) value lower than 10, sequences without recognition of double-ended primers, chimeras (chimeras). In a specific embodiment of the present invention, the bacterial 16S rRNA full-length (1450-1500 bases) gene sequence was determined on 120 healthy human stool specimens.
In a preferred embodiment, the already named bacterial references of step (4)The strain 16SrRNA gene sequences are from a published reference sequence library including, but not limited to: the 16S rRNA gene sequence library is recorded and disclosed in prokaryotic organism standard name directory, American national center for biotechnology information and bacterial 16S rRNA gene sequence online quality control and comparison database. Wherein the prokaryotic standard naming directory (LPSN: https:// www.bacterio.net /) and the national center for Biotechnology information (NCBI RefSeq database: https:// www.ncbi.nlm.nih.gov /) currently disclose 16S rRNA gene sequences of known bacterial reference strains, totaling 38,000 pieces, including 18000 pieces of published and approved bacterial species and subspecies reference strains. The reference sequence library also takes up the bacterial 16S rRNA gene sequence on-line quality control and alignment database (SILVA,https://www.arb- silva.de/) The 16S rRNA gene sequence of (1) has the same name, and has degenerate bases (which means that two or more bases are replaced with one symbol depending on the degeneracy of codons. The degenerate base N, which may represent four bases U/C/A/G) is less than 2%, has a consistency of 99% or more, and has a length of 1000 bases or more, and the total of 14.3 thousands of sequences. This part of the sequence was mainly derived from a non-reference strain. The 16S rRNA gene sequence of the taxonomic reference strain of the known bacteria is supplemented, so that the diversity and the coverage rate are improved. The three online databases described herein are all open public databases, and do not limit the sources and construction methods of the databases of the present invention, as long as the databases capable of providing diversity and coverage of bacterial sources can be adopted by the method of the present invention. The invention integrates the sequences of the 3 or more databases to form a reference sequence library of the 16S rRNA genes of the intestinal bacteria, which comprises more than 80 million (including a 16S rRNA gene sequence library recorded and disclosed by 120 intestinal flora discovery of healthy people, a prokaryotic organism standard naming directory, a national center for biotechnology information and a bacteria 16S rRNA gene sequence online quality control and comparison database). The number of the 80 or more than ten thousand 16S rRNA gene sequences does not constitute a limitation on the size and construction method of the database of the present invention, as long as the database capable of providing diversity and coverage of bacterial sources can be usedThe method of the invention is adopted.
In a preferred embodiment, the splicing in step (6) employs a computer-generated virtual splicing sequence of the V3-V4 region of the 16S rRNA gene.
More preferably, the sequence of the upstream cleavage site used for virtual cleavage is shown in SEQ ID NO.1 (CCTAYGGGRBGCASCAG), and the sequence of the downstream cleavage site is shown in SEQ ID NO.2 (GGACTACNNGGGTATCTAAT). In the method, the splicing in the step (6) adopts the binding sites of the universal amplification primers 341F (SEQ ID NO.1) and 806R (SEQ ID NO.2) in the V3-V4 region of the 16S rRNA gene to carry out computer virtual splicing to obtain V3-V4 region sequences of all intestinal flora reference sequences. After the entries with completely identical sequences are combined, a working library of reference sequences of intestinal flora 16S rRNA genes V3-V4 regions is formed, and comprises 27.3 ten thousand sequences of 16S rRNA genes V3-V4, so that 18000 disclosed unknown bacteria of known bacteria and healthy human intestinal flora can be detected and identified.
In the prior art, the method for detecting the intestinal flora by using the technical principle of high-throughput sequencing of the V3-V4 region of the 16S rRNA gene can only detect known bacteria and cannot detect unknown bacteria. The invention solves the technical problems through the definition, discovery and annotation of the OPU and the construction of the bacteria phylogenetic tree based on the OPU, can detect unknown bacteria, describe and annotate the unknown bacteria by using the OPU, can analyze and predict the discovery of the unknown bacteria and the pathogenicity and therapeutic application, and greatly provides the working efficiency of bacteria identification, pathogenic bacteria discovery and probiotic screening invention. By the method provided by the invention, 774 unknown bacteria, namely 774 OPUs, are found in the human intestinal flora. In particular, more than 60% of the Chinese fecal flora is found to share 116 OPUs, including 38 known bacteria and 78 unknown bacteria (expressed as coded OPUs), which account for about 83.42% of the total flora. The detection of the unknown bacteria of the intestinal flora can be realized by using the discovered full-length 16S rRNA gene sequence of the unknown bacteria of the intestinal tract as a taxonomic reference, which cannot be realized by any existing technology at present.
According to the invention, by comparing the 16S rRNA gene sequences of the unknown bacteria and the known bacteria, the high-throughput sequencing data of the V3-V4 region of the 16S rRNA gene with the average content of more than 95% in a stool sample can be identified as the known bacteria and the unknown bacteria (OPU). Based on the identification rate of the V3-V4 region sequence, the identification rate is improved from 37.8% in the prior art to 95.6% and above. The method can analyze the imbalance condition of the intestinal flora of healthy people from the 'species' level; known pathogenic bacteria and potential pathogenic bacteria can be found, the species and abundance of intestinal probiotics can be analyzed, particularly the relationship between the intestinal flora and the health condition and disease can be analyzed, and the method can be used for evaluating the diversity of human intestinal flora, the health condition, the disease condition and the like, including the analysis of the polymorphism and the composition ratio of the intestinal flora of a patient.
Drawings
FIG. 1. bacterial manipulation System phylogenetic Unit (OPU) partitioning technical roadmap;
FIG. 2 shows the formation ratio threshold of 116 kinds of bacteria (OPU) in the intestinal resident flora of healthy people;
FIG. 3 shows the formation ratio threshold of 116 bacteria (OPU) in the intestinal resident flora of healthy persons;
FIG. 4 shows the formation ratio threshold of 116 bacteria (OPU) in the intestinal resident flora of healthy persons;
FIG. 5. diversity (number of species) and abundance (composition ratio) of fecal flora in healthy Chinese;
FIG. 6. faecal flora structure and abundance of adult diarrhea patients (F32);
FIG. 7 fecal flora structure and abundance of patients with cirrhosis (F54);
FIG. 8 shows the fecal flora structure and abundance of infantile diarrhea patients (F181).
Detailed Description
The invention is further described below with reference to specific examples. The advantages and features of the present invention will become more apparent as the description proceeds. These examples are only illustrative and do not limit the scope of protection defined by the claims of the present invention.
Construction example 1 construction of working library of reference sequences of intestinal flora 16S rRNA Gene V3-V4 region
1. Construction of intestinal flora 16S rRNA gene reference sequence library
(1) Obtaining 16S rRNA gene sequences of 1235 OPUs from healthy human intestinal bacteria
And (3) sequencing 120 healthy Chinese intestinal flora specimens by using a PacBio sequencing platform to obtain 850,935 16S rRNA gene sequences. Quality control analysis was performed using a PacBio SMRT Link (version 6.0.0). A method of Circular error correction (CCS) is utilized to reduce the error rate of the sequence, and the parameters are set to be the Minimum 5 CCS cycles and the Minimum Predicted Accuracy (Minimum Predicted Accuracy) higher than 99.9%. Subsequently, filtration of ambiguous bases, low quality sequences, primers and sequencing adapters was performed using QIIME software. The sequences less than 1200 bases and more than 1600 bases in length were removed to obtain 594,075 full-length or nearly full-length 16S rRNA gene sequences. The division is 1235 OPUs. Each OPU can include multiple, frequently representative 16S rRNA gene sequences, which can be used as reference sequences, with a consistency of 99% and above.
OPU is an acronym for the occurrence unit of the bacterial operating system, and is the smallest taxonomic monophyletic group (monophyletic group) comprising a population of full-length 16S rRNA gene sequences, representing a population of bacterial strains. The 16S rRNA gene sequences of the strains within each OPU group, which have the closest relationship to each other, belong to a single line group. Different OPUs belong to different monophyletic groups. OPUs are numerous, including publicly published known bacteria and unknown bacteria. Known bacteria can be annotated with the names published by the International Committee for taxonomy of bacteria via the nomenclature directory of standards for prokaryotes, such as Streptococcus pneumoniae. Unknown bacteria were annotated with the present numbered OPU and represented a new "species", new "genus", new "family", new "order", new "class", new "phylum", etc. Only by means of the analysis of the sequence of the full-length 16S rRNA gene, a new taxonomic unit of 'genus' and above cannot be accurately discovered and defined according to the current taxonomic cognition.
The division of the OPU includes two steps: the OTU is divided, and the OPU is divided. The method comprises the following steps:
1) full-length 16S rRNA gene sequencing. The 16S rRNA gene (V1-V9) in the fecal samples was sequenced using a third generation sequencing platform (PacBio RS II platform) to obtain the full-length or near full-length sequence (1450 and 1500 bases).
2) And (4) sequencing data quality control. Using bioinformatics analysis software USEARCH (http://www.drive5.com/ usearch/) The chimera detection software UCHIME QIIME (called Quantitative instruments Into microbiological Ecology) of (1) removes ambiguous bases and chimeras. This is a conventional process.
3) Partitioning OTUs were partitioned using the otau clustering and representative sequence identification algorithm of USEARCH software. All 16S rRNA gene sequences with 98.7% identity were scored as one OTU. The first 10 most frequently occurring 16S rRNA gene sequences in each OTU were selected as representative sequences of this OTU. If the 16S rRNA gene sequence with the highest frequency of occurrence is less than 10, all the sequences are included.
4) The representative 16S rRNA gene sequences of a certain OTU in the alignment were identified as known bacteria. A representative 16S rRNA gene sequence of The query OTU was added to The database LTP 123 of All known bacterial phylogenetic trees (The All-specifices Living Tree), and sequence alignment was performed using 16S rRNA sequence in-line query software SINA (The new SILVA (Web) Aligner). Sequences that can be aligned (98.7% identity or more) can be inserted into all known bacterial phylogenetic trees. Based on sequence alignment and the topology and interrelationship of the phylogenetic tree, a known bacterium can be annotated if it can be classified as a 16s rrna gene sequence of the known bacterium, and it forms an independent branch. Such as Streptococcus suis (Steptococcus suis). This known bacterium, which forms an independent branch on the phylogenetic tree, is an OPU with taxonomic name.
5) OTU having less than 98.7% sequence identity with the 16S rRNA gene of a reference strain of all known bacteria was identified as unknown bacteria and annotated using the OPU method. Representative 16S rRNA gene sequences with less than 98.7% identity to OTU were added to the Non-Redundant (Silva Reference Non Redundant) database (SILVA SSURef _ NR _132) of the Silva database for secondary alignment.
The 16S rRNA gene sequences found in the secondary ratios and closest in identity to the query sequence, as well as the representative 16S rRNA gene sequences of the query OTUs, and all known bacterial reference strain 16S rRNA gene sequences in the LTP128 database were used to construct all bacterial phylogenetic trees using the neighbor-joining method using online query software SINA. The archaea was set to a root (root) (fig. 1).
The topology of all the bacterial phylogenetic trees formed was analyzed and each OPU was defined. Each OPU is the smallest unigenetic group. Each OPU includes at least two types of sequences: representative sequences of OTUs, and the closest 16S rRNA gene sequences to these representative sequences, particularly the closest 16S rRNA gene sequence of the reference strain (fig. 1).
6) OPUs suspected of being new species can be annotated. If an OPU can identify a certain "genus", but the sequence identity with the 16S rRNA gene of the reference strain of all "species" within the "genus" is less than 98.7%, it can be annotated as an unknown new species of bacterium.
7) Annotation of high order unit OPU. If only a certain OPU can be identified to "family", or taxonomic units above "family", depending on the bacterial phylogenetic tree, we treat it as an unknown high-order unit, which can be considered to represent at least one unknown "genus". Because, taxonomic identification at levels above "species" could not be made correctly based solely on the full-length 16S rRNA gene sequence (fig. 1).
8) OPU number. All OPUs are numbered uniformly. The number of each OPU is unique.
Among 120 healthy human stool specimens, 1235 OPUs were divided using the above method. Wherein 461 OPUs can be identified as known bacteria and can be identified as 'species'; 774 OPUs (62.7%) were unknown bacteria. Of the 774 unknown bacterial OPUs, 358 were suspected new species that identified a genus, annotated as a certain "genus". The remaining 416 OPUs, which could not be accurately identified, were annotated as "high-level classification units (fig. 1).
54.45% of the full-length or nearly full-length 16s rRNA gene sequences obtained from 120 healthy human fecal specimens in China belong to unknown bacteria, and have not been isolated, named and studied. It was suggested that more than 50% of the intestinal flora was unknown bacteria.
In 1235 OPUs of intestinal flora of healthy Chinese people, more than 60 percent of 116 OPUs can be fecesThe specimen is detected. Of these, only 38 OPUs were known bacteria, and 78 OPUs (67%) were unknown bacteria3. Fig. 2 shows the composition ratio of 116 bacteria with a detection rate of 60% or more and the difference range thereof. The detection rate of none of the bacteria was 100%. The composition of the intestinal flora is not completely consistent among healthy individuals, and varies greatly, but has similarity. We refer to 116 bacteria with detection rate of 60% or more, which are called Chinese intestinal resident flora (fig. 2), and are the main members for maintaining the balance of intestinal flora. Among them, known bacteria are represented by bacteria approved names such as Prevotella copri. The unknown bacteria are represented by OPU and codes, such as Bacteroides sp.17(OPU-532), and represent a suspected new species of Bacteroides, which are not isolated and identified; for example, Lachnospiraceae (OPU-001), which represents a novel member of Lachnospiraceae (family Lachnospiraceae), is difficult to identify accurately by means of only the 16S rRNA gene sequence and is called high-order unit OPU.
(2) Reference 16S rRNA gene sequences were obtained for all known bacterial reference strains. Including the standard name list of prokaryotes (mainly bacteria) (LPSN:https://www.bacterio.net/) And the 16S rRNA gene sequence of the known bacterial reference strain of the national center for Biotechnology information (NCBI RefSeq database: https:// www.ncbi.nlm.nih.gov /), totaling 38,000. Each bacterial "species" may include multiple 16S rRNA gene sequences.
(3) The reference 16S rRNA gene sequence library of the above known bacterial reference strains was expanded. SILVA (16 SrRNA uptake gene sequence quality check and alignment online database: (https://www.arb-silva.de/) The high-quality sequences of (1) and (2) have completely identical taxonomic names, have a degenerate base ratio of less than 2%, a length of 1000bp or more, and a consistency of more than 99%, and have a total of 14.3 thousands. Supplementation of the 16S rRNA gene sequence of known bacterial reference strains as public databases improves sensitivity, coverage and accuracy.
(4) Constructing a reference sequence library of intestinal flora 16S rRNA genes. The 16S rRNA gene sequences of 1235 OPUs from intestinal bacteria of healthy people, the 16S rRNA gene sequences of reference strains of known bacteria listed in standard name lists of all prokaryotes and the high-quality 16S rRNA gene sequences of known bacteria in an SILVA database are integrated to construct an intestinal flora 16S rRNA gene reference sequence library. Comprising 85 ten thousand high-quality bacteria 16S rRNA genes, can detect and identify all the published 18,000 bacterial species and subspecies. In particular, 774 unknown bacteria can be detected and identified. The method has the characteristics of large library capacity, long sequence length and accurate classification annotation information. Meanwhile, the method is updated according to the discovery and publication of new bacteria. The aim of detecting and identifying all known bacteria is achieved (figure 1).
2. Construction of working library of reference sequence of intestinal flora 16S rRNA gene V3-V4 region
Computer shearing 85 ten thousand sequences in the constructed intestinal bacteria 16S rRNA gene reference sequence library according to the combination sites of 16S rRNA gene V3-V4 region amplification primers 341F (CCTAYGGGRBGCASCAG) and 806R (GGACTACNNGGGTATCTAAT) to obtain all 85 ten thousand sequences of V3-V4 regions of the 16S rRNA genes. Namely, each full-length 16S rRNA gene in the reference sequence library is subjected to computer virtual shearing, sequences in V3-V4 regions are reserved, and a reference sequence working library in V3-V4 regions of the 16S rRNA genes of intestinal flora is formed. In the newly constructed reference sequence working library, identical sequence entries are merged. The 16S rRNA gene V3-V4 sequence constructed in the embodiment comprises 27.3 ten thousand, and more than 18,000 bacterial species and subspecies can be detected and identified. The 16S rRNA gene sequence of unknown bacteria in healthy human intestinal tracts is included, so that most of the sequences of the 16S rRNA genes V3-V4 of bacteria obtained from human fecal specimens can be identified to be bacterial 'species'.
The sequence of the bacterial 16S rRNA gene V3-V4 constructed by the invention is a dynamic database, and can change according to the online public database and the increase of the database obtained by the self research of researchers, but the change of the database does not influence the implementation of the method of the invention, and along with the increase of the database, the accuracy of identifying the human flora on the basis of the bacterial 16S rRNA gene sequence at the seed level can be correspondingly improved, and the core of the invention is not the construction of the database per se, but the invention is to construct a dynamic and open method for constructing a human flora 16S rRNA gene sequence reference sequence library based on the bacterial operation system generator unit.
Construction example 2 construction of analysis method for compositional diversity and composition ratio of species horizontal intestinal flora
On the basis of the database constructed in example 1 (fig. 1), the sample to be tested is subjected to the construction of an analysis method or system for the diversity and composition ratio of intestinal flora at the "species" level.
The specific embodiment comprises 4 parts: stool specimen collection and handling, high-throughput sequencing of the 16S rRNA gene V3-V4 region, "species" level taxonomic annotation, human stool flora diversity and formation ratio results presentation.
1. Collection and processing of specimens
Collecting fresh excrement samples by using a urine cup, temporarily storing the excrement samples in an ice bag sample box, and then transferring a cold chain to a laboratory for nucleic acid extraction. Extraction method 200mg of fecal sample was extracted using column purified fecal nucleic acid extraction kit (Qiagen, cat.51604) according to the protocol. Finally, 200. mu.L of deionized water is used for eluting a centrifugal column to collect fecal nucleic acid for subsequent 16S rRNA gene amplification.
2.16S rRNA Gene V3-V4 region high throughput sequencing
Fecal nucleic acids were PCR amplified, product purified, and paired-end sequencing of the 16S rRNA gene V3-V4 region using the Illumina MiSeq platform.
3. taxonomic identification at the "species" level the obtained 16S rRNA gene from V3-V4 region was used for quality control and ambiguous bases and chimeras were removed by conventional methods. Then, an alignment query is carried out by using a working library of reference sequences in a V3-V4 region of the 16S rRNA gene of the enterobacteria. Sequences found to be 100% identical by alignment are annotated as known or unknown bacteria according to the taxonomic information of the reference sequences on the alignment. If the annotation is for a known bacterium, the corresponding taxonomic name annotation is used, such as Streptococcus suis. If the annotation is unknown bacteria, the corresponding coded OPU is used for annotation, including suspected new species, high order units, etc. Sequences that could not be annotated were annotated as unknown sequences (fig. 1).
4. Analysis results of diversity and composition ratio of human fecal flora
(1) The method of the invention allows the detection and characterization of the diversity of the human intestinal flora from the taxonomic "species" level. The invention discovers that each healthy Chinese intestinal flora contains 186 +/-51 OPUs on average, wherein the numbers of the OPUs of low-frequency flora (carried by less than 10 percent of people), medium-frequency flora (carried by less than 10 percent to 60 percent of people) and high-frequency flora (carried by more than 60 percent of people) are respectively 20 +/-11, 75 +/-29 and 90 +/-19. 1235 OPUs were detected cumulatively, of which 774 (62.7%) were unknown bacteria (FIGS. 2-4).
In fig. 2-4, the intestinal resident flora of healthy people refers to bacteria with a positive rate of 60% or more in fecal specimen detection of healthy chinese people. It is known that bacteria are represented by the bacterial name published in the prokaryotic (mainly bacterial) standard name list (LPSN: https:// www.bacterio.net /), such as Prevotella copri. There are 2 expression methods for unknown bacteria: suspected new species and high-order units. A suspected new species is a potential new species that can be identified as a "genus", which has not been isolated and identified, as indicated by the genus name and OPU number, such as Bacteroides sp.17 (OPU-532). The high-order unit means that it is difficult to accurately identify by means of only 16S rRNA gene sequence, and is represented by the last-order taxonomic unit, and OPU code, such as Lachnospiraceae (OPU-001), which represents a new member of Lachnospiraceae (Lachnospiraceae).
(2) In known bacteria, the 16S rRNA gene in the V3-V4 region, which is 100% consistent with the reference sequence of pathogenic bacteria, conditional pathogenic bacteria and probiotics, can be definitely identified as corresponding pathogenic bacteria, conditional pathogenic bacteria and probiotics.
(3) The number of 16S rRNA gene sequences in the V3-V4 region of known bacteria, which can be clearly identified as pathogenic bacteria, conditional pathogenic bacteria, probiotic bacteria, and the percentage of the total number of 16S rRNA gene sequences in the V3-V4 region of the total specimen form the composition ratio data of all known bacteria and unknown bacteria "species" or OPU.
(4) And (3) comparing the data by the composition ratio of 116 OPUs (including 38 known bacteria and 78 OPUs) which are all owned by more than 60 percent of Chinese excrement specimens, and providing a comparison result, such as increase or decrease or deletion.
(5) The method has the key technical characteristics that the 16SrRNA gene sequence of 774 unknown bacteria is found, the known flora of the intestinal tract can be detected and analyzed from the level of the 'species', and the unknown bacteria of the intestinal tract can be detected and analyzed from the level and the angle of the 'OPU'.
Comparison of different sequencing and analysis methods of samples from 120 healthy persons Using example 1.120
In the application example, we applied 120 healthy people for evaluating the database and the alignment method in the present invention.
1.16 high throughput sequencing of the S rRNA Gene
Fecal samples from 120 healthy persons were subjected to species-level intestinal flora composition and ratio analysis using 2 different sequencing methods, namely Illumina MiSeq sequencing for 16S rRNA gene V3-V4 region sequencing and PacBio sequence platform for full-length 16S rRNA gene sequencing. Wherein, the sequencing of the V3-V4 region of the 16S rRNA gene obtains 118,261 effective sequences in each sample on average, and the full-length sequencing of the 16S rRNA gene obtains 5502 effective sequences in each sample on average. The specific data are shown in Table 1.
TABLE 1 comparison of the number of valid sequences obtained using full-length sequencing of the 16S rRNA gene and sequencing of the V3-V4 region per stool specimen
Sequence number obtained by sequencing V3-V4 region Number of sequences obtained by full-Length sequencing
Mean value of 118261 5502
Minimum value 52491 1938
Maximum value 127833 20053
Median value 109391 4631.5
2. Database and comparison method adopted
In the present application example, for sequencing data of the V3-V4 region of the 16S rRNA gene obtained from Illumina MiSeq platform, we performed analysis using two databases and alignment methods, which are: (1) performing taxonomic analysis on OTU representative sequences with 97% similarity level by using RDP classifier Bayesian algorithm, and obtaining the composition and abundance information of each sample strain by using a Silva _13216SrRNA database during annotation; (2) the constructed data and alignment methods of the present invention were used to perform species composition and abundance information analysis for each sample. For full-length sequencing of the 16S rRNA gene obtained by the PacBio sequential platform, we used the OPU strategy to analyze the species composition and abundance information for each sample. Specific methods can be referred to Yang J, Pu J, Lu S, Bai X, Wu Y, Jin D, Cheng Y, Zhang G, Zhu W, Luo X, Rossell Lo-M Lo Ra R, Xu J, Specifes-Level Analysis of Human Gut Microbiol With Metasynthetic, front Microbiol.2020Aug 26; 11:2029.doi:10.3389/fmicb.2020.02029.PMID: 32983030; PMCID PMC 7479098.
3. Results of the analysis
120 healthy human samples were classified into 3 methods according to the sequencing method and the difference between the database and the alignment software. Respectively as follows: (1) 16S rRNA gene V3-V4 region sequencing is carried out by adopting an Illumina MiSeq platform, and the database and comparison software constructed in the invention are adopted for analysis (the method is simply called the method in the invention); (2) sequencing a V3-V4 region of a 16S rRNA gene by adopting a column-type MiSeq platform, and performing comparative analysis (hereinafter, referred to as a common method) by adopting a Silva-13216 SrRNAdaabase database and an RDP classifier Bayesian algorithm; (3) the method adopts a PacBio sequential platform to carry out 16S rRNA gene full-length sequencing, adopts an operating system genesis unit strategy to analyze strain composition and abundance information of each sample, and is a gold standard method (hereinafter, referred to as a gold standard method) because the method can obtain the full-length sequence of the 16S rRNA gene and adopts the 16S rRNA gene full-length to carry out species determination. Based on the analysis results, we compared and analyzed the three methods in terms of two aspects of the number ratio of the sequences capable of being determined to be in the 'species' level and the number of the discovered 'species', and the comparison software and the database constructed in the invention are determined to have excellent capability of discovering the 'species'.
(1) The method of the invention can identify the 16S rRNA gene sequence of more than 95 percent of each fecal sample to the level of 'variety' (OPU)
We compared the database and alignment methods established in the present invention with the full-length sequencing of the 16S rRNA gene (gold standard method) and showed that the average ratio of the number of sequences identified at the "seed" level by the method of the present invention was 95.6% (the number of sequences of OPU could be annotated per the number of sequences in the V3-V4 region of all 16S rRNA genes per specimen). While the 16S rRNA gene full-length sequencing (gold standard method) identified an average ratio of 57.95% of the number of "seed" level sequences (the number of sequences of OPU/the number of full-length sequences of all 16S rRNA genes per specimen can be annotated). The above data demonstrate that the method of the invention is more advantageous in identifying the increased proportion of "seed" level sequences compared to the gold standard method. Because the gold standard method needs to obtain a 16S rRNA full-length sequence, under the condition of obtaining the same number of sequences, the sequencing cost is about 10 times of that of the method in the invention, and the sequencing period is about 2-3 times of that of the method in the invention, thereby the method in the invention is more economical and practical in the aspect of determining the species.
TABLE 2 comparison of the number (%) of 16S rRNA sequences from fecal specimens that identified bacterial "species" levels
The method of the invention Gold standard method General procedure
Mean value of 95.63879314 57.95386165 37.86043291
Minimum value 79.08627377 17.30607673 4.708447103
Maximum value 99.63109019 95.20708518 86.63499335
Median value 97.26925676 59.6504776 38.1376648
*: the methods of the invention and commonly used methods (using the Silva-13216 SrRNA database plus RDPlasifier Bayesian algorithm): the number of sequences of OPU can be annotated per sequence of all 16S rRNA genes V3-V4 region in each specimen; gold standard method: the number of OPU sequences per the full-length sequence of all 16S rRNA genes in each specimen can be annotated.
We used the same data, namely the llominina MiSeq platform to carry out 16S rRNA gene V3-V4 region sequencing data, respectively analyzed by the database alignment method constructed in the invention and the Silva _13216 SrRNAdabase database and RDP classifier Bayesian algorithm which are commonly used at present, and compared the number of sequences determined to be of the 'seed' level. The comparison result shows that the database and the comparison method established in the invention can identify 95.6% of sequences to the seed level on average, while the Silva _13216SrRNADatabase database and RDP classifier Bayesian algorithm which are commonly used at present can only identify 38.1% of sequences to the seed level.
(2) The average number of bacterial species which can be detected by each fecal specimen in the method of the invention can reach 92.9(OPU)
TABLE 3 comparison of the number of bacteria "species" (OPU) that can be detected per stool specimen
Gold standard method The method of the invention General procedure
Mean value of 92.9137931 140.4741379 82.07758621
Minimum value 34 99 61
Maximum value 171 179 108
Median value 94 139.5 81
In terms of the number of "species" found, the database and alignment software constructed in the present invention found 140.47 "species" per sample on average, whereas the gold standard method found 92.91 "species" per sample on average, and the currently used database and alignment software (e.g., Silva-13216 SrRNA database plus RDP classifier Bayesian algorithm) found only 82.08 "species" per sample on average (see attached Table 3). The data show that the database and the comparison software constructed in the invention can find more species, and have important value for analyzing the structure and abundance of the intestinal flora.
Application example 2: the method of the invention is adopted for the analysis of the composition and the composition ratio of the fecal flora in clinical patient samples
The method comprises the steps of analyzing 16S rRNA gene data of 120 healthy human intestinal flora, determining standard thresholds of intestinal flora with different compositions in healthy people, and constructing a reference standard for standard routine inspection of the human intestinal flora. Fig. 5 shows the percentage composition threshold values of 116 OPUs called intestinal resident bacteria groups, which indicate a detection rate of 60% or more. On the basis, the method is adopted to analyze the composition and abundance of the fecal flora of 3 clinical patients, and the fecal flora is compared and analyzed with the structure and abundance of the flora of a reference population, so that the analysis can be performed for evaluating the intestinal flora condition of the patients, and the correlation with diseases is also related. FIG. 5 shows the results of analysis of the diversity and composition ratio of intestinal flora in healthy humans.
The human intestinal bacteria not only can influence the weight and the digestive ability, resist the risks of infection and autoimmune diseases, but also can control the response of the human body to disease treatment drugs. Therefore, the research obtains the diversity and composition ratio data of human intestinal flora, and the data can be used as indicators of health and disease states. Doctors analyze, judge and diagnose the diseases and health conditions of patients by reading the diversity and composition ratio data of human flora.
Application example 2.1: colony analysis of feces sample for adult diarrhea
Patient No. F32, female, age 67, clinically diagnosed as "bacterial infection by shigella dysenteriae". FIG. 6 shows the results of the diversity and composition ratio of the intestinal flora in diarrhea patients. From the results of the flora structure and abundance, the following conclusions can be drawn:
1. the abundance of Escherichia coli/Shigella in the detected fecal samples was significantly increased, significantly above the threshold (0.6%).
2. The conditionally pathogenic bacterium Enterobacter asburiae, Acinetobacter junii (undetectable by healthy people) was detected.
3. The total number of found bacterial "species" (OPU) was 13. In healthy persons, 140 OPUs per stool specimen were detected (99-179). Indicating reduced diversity of flora and disturbed flora.
4. Since the full-length sequence of the 16S rRNA gene has a higher identity than 98.7% (98.7% identity and more can be regarded as one "species"), Escherichia coli cannot be separated from Shigella species based on the 16S rRNA gene sequence alone. However, the abundance of Escherichia coli/Shigella is obviously increased, and the clinical diagnosis of Shigella Shigella infection is supported.
Application example 2.2: analysis on sample 2 of a clinical cirrhosis patient
Patient No. F54, male, 42 years old, clinically diagnosed as "cirrhosis". Figure 7 shows the results of the patient's intestinal flora diversity and formation ratio. From the results of the flora structure and abundance, the following conclusions can be drawn:
1. the abundance of conditional pathogens such as Bacteroides fragilis, Klebsiella pneumoniae and Ruminococcus torques in the detected stool sample is higher than a threshold value.
2. The total number of found bacterial "species" (OPU) was 69. On average, 140.47 "species" (99-179OPU) per stool specimen were found below healthy persons. Suggesting that the diversity of the flora is reduced and the flora in the intestinal tract is disturbed. Application example 2.3: analysis on sample 3 of a patient with clinical diarrhea
Patient No. F181, male, age 1, with clinical diagnosis of "diarrhea", fig. 8 shows the stool specimen population structure and abundance of clinical patients. From the results of the flora structure and abundance, the following conclusions can be drawn:
1. in the stool samples tested, Citrobacter braakii and Citrobacter freundii were significantly more abundant (threshold). Since Citrobacter braakii and Citrobacter freundii cause diarrhea, it is likely to be a pathogenic bacterium.
Abundance of Klebsiella pneumoconiae above the threshold. Klebsiella pneumoniae can cause infantile diarrhea.
3. The total number of found bacterial "species" (OPU) was 52. On average, 140.47 "species" (99-179OPU) per stool specimen were found below healthy persons. Suggesting that the diversity of the intestinal flora is reduced and the intestinal flora is disordered.
4. The probiotics Lactobacillus reuteri and Bifidobacterium breve are detected and the abundance is higher than that of the data of adult healthy people. The patient is advised to be asked whether to take a probiotic preparation.
Sequence listing
<110> infectious disease prevention and control institute of China center for disease prevention and control
<120> bacterial species level analysis method based on bacterial 16S rRNA gene sequence
<160> 2
<170> PatentIn version 3.3
<210> 1
<211> 17
<212> DNA
<213> Artificial
<400> 1
cctaygggrb gcascag 17
<210> 2
<211> 20
<212> DNA
<213> Artificial
<220>
<221> misc_feature
<222> (8)..(9)
<223> n is a, c, g, or t
<400> 2
ggactacnng ggtatctaat 20

Claims (11)

1. A method for identifying the human flora on the "species" level based on the gene sequence of bacterial 16S rRNA for non-diagnostic purposes, comprising the steps of:
(1) constructing a human flora 16S rRNA gene reference sequence library based on a bacterial operation system genesis unit as an annotation unit, wherein the reference sequence library comprises known bacteria which are named at a 'species' level and unknown bacteria which are not named at the 'species' level, the bacteria which are named at the bacterial operation system genesis unit are annotated with naming names, and the bacteria which are not named at the bacterial operation system genesis unit are used as unique names of the bacteria;
(2) performing sequence determination on the 16S rRNA gene of a sample to be detected;
(3) and (3) comparing the 16S rRNA gene sequence of the specimen obtained in the step (2) with the human flora 16S rRNA gene reference sequence library constructed in the step (1) and identifying strains, and identifying a sequence completely consistent with a specific sequence in the reference database as a specific sequence annotation name in the reference sequence library.
2. The method of claim 1, further comprising the step of analyzing the species, ratio, and/or abundance of the population of bacteria in the test sample identified in step (3).
3. Method according to claim 1, characterized in that the annotation of the name of the named bacteria in the method step (1) comprises an annotation of a pathogenic bacterium, a opportunistic bacterium or a probiotic bacterium.
4. The method of claim 1, wherein the 16S rRNA gene sequence is a V3-V4 region sequence.
5. The method of claim 1, wherein the human flora is derived from a flora of the digestive tract, skin, mouth, nasopharynx, eyes, vagina, urinary tract or ears.
6. The method of claim 1, wherein the method step (2) sequence determination is high throughput sequencing.
7. A method for constructing a reference sequence library of human flora 16S rRNA gene sequences based on the unit of bacterial operating system genesis units according to step (1) of the method of claim 1, wherein the method comprises:
(1) sequencing and quality control: obtaining a bacterial 16S rRNA gene sequence from a human body sample, and deleting a low-quality sequence through quality control;
(2) division of the bacterial taxonomy operating unit: a group of 16S rRNA gene sequences with the sequence consistency of 98.7 percent or more from the step (1) is named as a bacteria taxonomy operation unit;
(3) determining representative sequences of bacterial taxonomic operating units: selecting the top 10 16S rRNA gene sequences with the highest frequency of occurrence in one bacterial taxonomy manipulation unit obtained in the step (2) as representative sequences of the bacterial taxonomy manipulation unit, and selecting all less than 10 sequences as representative sequences of the bacterial taxonomy manipulation unit;
(4) construction of bacterial phylogenetic trees: comparing each bacteria taxonomy operation unit representative sequence obtained in the step (3) with the named bacteria reference strain 16S rRNA gene sequence, inserting the compared bacteria taxonomy operation unit representative sequence into all the named bacteria reference strain 16S rRNA gene sequence databases, and setting the parameters as LTP 50; constructing all bacterial phylogenetic trees by using an inserted OTU representative sequence and a named 16S rRNA gene sequence of a bacterial reference strain by using a neighbor joining method based on Jukes-Cantor correction, wherein the conservation degree is set as 30%;
(5) unknown bacteria in the healthy human flora were found: on all constructed bacterial phylogenetic trees, inquiring the representative sequences of the bacterial taxonomy operation units and 16S rRNA gene sequences with the closest similarity to aggregate, forming a branch on the tree, and determining the branch as a bacterial phylogenetic unit; if the representative sequence of the bacterial taxonomic manipulation unit is 98.7% or more identical to the nearest 16S rRNA gene sequence on all bacterial phylogenetic trees and a name has been obtained, the bacterial taxonomic manipulation unit can be determined to be a known bacterium using the annotation of the name of the bacterium for which the name has been obtained; determining a suspected new species of unknown bacteria if the identity of the representative sequence of the taxonomic manipulation unit of bacteria and its nearest 16S rRNA gene sequence on all phylogenetic trees of bacteria is below 98.7%, but 95% or above; if the consistency of the representative sequence of the bacteria taxonomy operation unit and the 16S rRNA gene sequence of the nearest reference strain on all the bacteria phylogenetic trees is below 95 percent and the name is not obtained, the bacteria taxonomy operation unit can be named as a high-order unit of unknown bacteria, and the bacteria taxonomy operation unit and the bacteria phylogenetic unit which are numbered higher are named by using the numbers;
(6) combining the full-length 16S rRNA gene sequence of the unknown healthy human intestinal bacteria obtained in the step (5) with the 16S rRNA gene sequences of all known bacteria named and published to construct a human flora 16SrRNA gene reference sequence library;
(7) and (3) shearing the 16S rRNA gene sequence reference sequence library obtained in the step (6) by using universal sequence sites in the V3-V4 regions of the 16S rRNA genes, merging the entries with completely identical sequences in the V3-V4 regions, and deleting completely repeated sequences to form a human flora 16S rRNA gene sequence reference working library.
8. The method according to claim 7, wherein the sequencing in the step (1) is carried out by adopting a three-generation sequencing PacBio technology platform, at least 120 healthy human fecal specimens are subjected to bacterial 16S rRNA full-length gene sequence determination, and the deleted low-quality sequences in quality control comprise sequences with single-base quantity values lower than 10, sequences without recognition of double-ended primers and chimeras.
9. The method of claim 7, wherein the bacterial reference strain 16S rRNA gene sequences that have been named in step (4) are from a published library of reference sequences comprising: the 16S rRNA gene sequence library is recorded and disclosed in prokaryotic organism standard name directory, American national center for biotechnology information and bacterial 16S rRNA gene sequence online quality control and comparison database.
10. The method of claim 7, wherein the splicing in step (7) is performed by computer virtual splicing of the 16S rRNA genes V3-V4 region to obtain the spliced sequence.
11. The method according to claim 10, wherein the sequence of the virtually-cleaved upstream cleavage site is shown in SEQ ID No.1 and the sequence of the virtually-cleaved downstream cleavage site is shown in SEQ ID No. 2.
CN202110659956.2A 2021-06-13 2021-06-13 Bacterial species level detection and analysis method based on bacterial 16S rRNA gene sequence Pending CN113403409A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202110659956.2A CN113403409A (en) 2021-06-13 2021-06-13 Bacterial species level detection and analysis method based on bacterial 16S rRNA gene sequence
PCT/CN2022/092574 WO2022262491A1 (en) 2021-06-13 2022-05-12 Bacterial 16s rrna gene sequence-based bacterial "species" level detection and analysis method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110659956.2A CN113403409A (en) 2021-06-13 2021-06-13 Bacterial species level detection and analysis method based on bacterial 16S rRNA gene sequence

Publications (1)

Publication Number Publication Date
CN113403409A true CN113403409A (en) 2021-09-17

Family

ID=77683870

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110659956.2A Pending CN113403409A (en) 2021-06-13 2021-06-13 Bacterial species level detection and analysis method based on bacterial 16S rRNA gene sequence

Country Status (2)

Country Link
CN (1) CN113403409A (en)
WO (1) WO2022262491A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022262491A1 (en) * 2021-06-13 2022-12-22 中国疾病预防控制中心传染病预防控制所 Bacterial 16s rrna gene sequence-based bacterial "species" level detection and analysis method
CN116825182A (en) * 2023-06-14 2023-09-29 北京金匙医学检验实验室有限公司 Method for screening bacterial drug resistance characteristics based on genome ORFs and application

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107451425A (en) * 2017-08-21 2017-12-08 上海派森诺生物科技股份有限公司 A kind of bacterial metabolism function prediction analysis method based on microorganism rRNA gene sequencing technologies
CN109706235A (en) * 2019-01-29 2019-05-03 广州康昕瑞基因健康科技有限公司 A kind of the detection and analysis method and its system of intestinal microflora
CN109897906A (en) * 2019-03-04 2019-06-18 福建西陇生物技术有限公司 A kind of detection method and its application of intestinal flora 16S rRNA gene
CN109971871A (en) * 2019-03-27 2019-07-05 江南大学 A kind of method and its application screened and/or identify lactobacillus
CN110144415A (en) * 2019-04-23 2019-08-20 大连大学 One kind introducing milk cow health and immunity level method based on intestinal flora prediction
CN111254186A (en) * 2020-03-31 2020-06-09 上海市第十人民医院 Method for carrying out molecular detection on clostridium or classifying and identifying strains of clostridium
CN111816258A (en) * 2020-07-20 2020-10-23 杭州谷禾信息技术有限公司 Optimization method for accurately identifying human flora 16S rDNA high-throughput sequencing species

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103627800B (en) * 2013-11-14 2015-02-25 浙江天科高新技术发展有限公司 Rapid detection method of environmental microorganisms
AU2016321334A1 (en) * 2015-09-09 2018-04-26 Psomagen, Inc. Method and system for microbiome-derived diagnostics and therapeutics for bacterial vaginosis
CN112863606B (en) * 2021-03-08 2022-07-26 杭州微数生物科技有限公司 Genome database for bacterium identification and typing analysis and identification and typing analysis method
CN113403409A (en) * 2021-06-13 2021-09-17 中国疾病预防控制中心传染病预防控制所 Bacterial species level detection and analysis method based on bacterial 16S rRNA gene sequence

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107451425A (en) * 2017-08-21 2017-12-08 上海派森诺生物科技股份有限公司 A kind of bacterial metabolism function prediction analysis method based on microorganism rRNA gene sequencing technologies
CN109706235A (en) * 2019-01-29 2019-05-03 广州康昕瑞基因健康科技有限公司 A kind of the detection and analysis method and its system of intestinal microflora
CN109897906A (en) * 2019-03-04 2019-06-18 福建西陇生物技术有限公司 A kind of detection method and its application of intestinal flora 16S rRNA gene
CN109971871A (en) * 2019-03-27 2019-07-05 江南大学 A kind of method and its application screened and/or identify lactobacillus
CN110144415A (en) * 2019-04-23 2019-08-20 大连大学 One kind introducing milk cow health and immunity level method based on intestinal flora prediction
CN111254186A (en) * 2020-03-31 2020-06-09 上海市第十人民医院 Method for carrying out molecular detection on clostridium or classifying and identifying strains of clostridium
CN111816258A (en) * 2020-07-20 2020-10-23 杭州谷禾信息技术有限公司 Optimization method for accurately identifying human flora 16S rDNA high-throughput sequencing species

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
杨学芳等: "树鼩肠道菌群多样性与功能预测研究", 《安徽农业科学》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022262491A1 (en) * 2021-06-13 2022-12-22 中国疾病预防控制中心传染病预防控制所 Bacterial 16s rrna gene sequence-based bacterial "species" level detection and analysis method
CN116825182A (en) * 2023-06-14 2023-09-29 北京金匙医学检验实验室有限公司 Method for screening bacterial drug resistance characteristics based on genome ORFs and application
CN116825182B (en) * 2023-06-14 2024-02-06 北京金匙医学检验实验室有限公司 Method for screening bacterial drug resistance characteristics based on genome ORFs and application

Also Published As

Publication number Publication date
WO2022262491A1 (en) 2022-12-22

Similar Documents

Publication Publication Date Title
US20190367995A1 (en) Biomarkers for colorectal cancer
CN105368944B (en) Biomarker of detectable disease and application thereof
Sepehri et al. Microbial diversity of inflamed and noninflamed gut biopsy tissues in inflammatory bowel disease
CN110283903B (en) Intestinal microflora for diagnosing pancreatitis
CN107034279A (en) Application of the tuberculosis microbial markers in the reagent of diagnosis of tuberculosis is prepared
EP3676405A2 (en) Method and system for characterization for female reproductive system-related conditions associated with microorganisms
EP3245298B1 (en) Biomarkers for colorectal cancer related diseases
CN112111586A (en) Crohn disease related microbial marker set and application thereof
CN109266766B (en) Application of intestinal microorganisms as bile duct cell cancer diagnosis marker
WO2014019267A1 (en) Method and system to determine biomarkers related to abnormal condition
CN113403409A (en) Bacterial species level detection and analysis method based on bacterial 16S rRNA gene sequence
CA2963013A1 (en) Biomarkers for rheumatoid arthritis and usage thereof
CN107937581B (en) Amplification primer pair for lactobacillus sequencing, lactobacillus species identification method and application
CN107002021A (en) Biomarker of rheumatoid arthritis and application thereof
Westaway et al. Methods for exploring the faecal microbiome of premature infants: a review
WO2021241721A1 (en) Method for treating cell population and method for analyzing genes included in cell population
CN113862382B (en) Application of biomarker of intestinal flora in preparation of product for diagnosing adult immune thrombocytopenia
WO2022253824A1 (en) Rna profiling of the microbiome and molecular inversion probes
CN109913526B (en) Use of microorganisms for identifying and/or differentiating different ethnic groups of individuals
CN109652493B (en) Use of genus oscillatoria for identifying and/or differentiating individuals of different ethnic groups
CN109913524B (en) Use of Prevotella for identifying and/or differentiating individuals of different ethnic groups
CN113151512B (en) Detection of early lung cancer using intestinal bacteria
CN114606317B (en) Flora marker for predicting lymph node metastasis of gastric cancer and application thereof
CN114317674B (en) Rheumatoid arthritis marker microorganism and application thereof
CN109735598B (en) Application of vibrio succinogenes in identifying and/or distinguishing different ethnic groups of individuals

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination