WO2016049920A1

WO2016049920A1 - Biomarkers for coronary artery disease

Info

Publication number: WO2016049920A1
Application number: PCT/CN2014/088046
Authority: WO
Inventors: Qiang FENG; Zhuye JIE; Huihua XIA; Jun Wang
Original assignee: Bgi Shenzhen Co., Limited; Bgi Shenzhen
Priority date: 2014-09-30
Filing date: 2014-09-30
Publication date: 2016-04-07
Also published as: CN107075563B; CN107075563A

Abstract

Provided are biomarkers and methods for predicting the risk of a disease related to microbes, in particular coronary artery disease (CAD) or related heart diseases.

Description

BIOMARKERS FOR CORONARY ARTERY DISEASE

CROSS-REFERENCE TO RELATED APPLICATION

None

FIELD

The present invention relates to biomarkers and methods for predicting the risk of a disease related to microbes, in particular coronary artery disease (CAD) or related heart diseases.

BACKGROUND

Coronary artery disease (CAD) refers to any abnormal condition of the coronary arteries that interferes with the delivery of an adequate supply of blood to the cardiac (i. e. , heart) muscle or any portion thereof. Typically, CAD is caused by the accumulation of plaque on the arterial walls (i. e. , atherosclerosis) , particularly in the large and medium-sized arteries serving the heart. These conditions have similar causes, mechanisms, and treatments. CAD represents the leading cause of death and morbidity worldwide. Early diagnosis of CAD will help to not only prevent mortality, but also reduce the costs for surgical intervention.

The “gold standard” for detecting CAD is invasive coronary angiography. However, this is costly, and can pose risk to the patient. Prior to angiography, non-invasive diagnostic modalities such as myocardial perfusion imaging (MPI) and CT-angiography may be used, however these have complications including radiation exposure, contrast agent sensitivity, and only add moderately to obstructive CAD identification.

Current knowledge indicates the genetic, environmental factors and their interactions collaboratively induce complex phenotype and many diseases. Coronary artery disease (CAD) , as one of the most influential complex diseases, has been increasingly investigated by GWAS in recent years and revealed 10.6％ of the inherent cause by 46 common variations (Ehret, G. B. et al. Genetic variants in novel pathways influence blood pressure and cardiovascular disease risk. Nature 478, 103-109, incorporated herein by reference) . However, our knowledge on the effect of environmental factor like gut microbes and the contribution of genes and microbes to disease still need further.

Our “forgotten organ” , gut microbiota, plays a crucial role on our health in many aspects, such as intaking energy from food, producing important metabolites, promoting the development and maturity of immune system, and protecting the host from pathogen infection et, al. Recent studies suggested the flora dysbiosis, chronic inflammatory and metabolic abnormity exist in the intestine of some metabolic diseases like diabetes and obesity. The characteristics for most coronary artery disease are inflammation, oxidation and lipid metabolism, which might potentially correlate with the gut microbes and their metabolites. A recent research indicates gut microbes could metabolize the red meat ingredients (L-carnitine, phosphatidyl-choline, cholesterol) into TMA, which would be further oxidized into TMAO in the liver to arise the oxidization reaction in blood vessel to lead inflammatory and lipid deposition, ultimately resulting in atherosclerosis and coronary heart disease. Meanwhile, compared with healthy subjects, the symptomatic atherosclerosis patients gut microbiota exhibits obvious abnormality (Koeth, R. A. et al. Intestinal microbiota metabolism of L-carnitine, a nutrient in red meat, promotes atherosclerosis. Nature medicine 19, 576-585, incorporated herein by reference) . These study suggested the dysbiosis of gut microbes may strongly influenced the pathogenesis of coronary artery disease by inducing the human metabolic abnormality. However, the characters of gut flora dysbiosis in atherosclerosis induced pathogenesis of coronary artery disease patients and its impact on metabolic system are still puzzling.

SUMMARY

Embodiments of the present disclosure seek to solve at least one of the problems existing in the prior art to at least some extent.

The present invention is based on the following findings by the inventors:

Assessment and characterization of gut microbiota has become a major research area in human disease, including coronary artery disease (CAD) . To carry out analysis on gut microbial content in CAD patients, the inventors carried out a protocol for a Metagenome-Wide Association Study (MGWAS) (Qin, J. et al. A metagenome-wide association study of gut microbiota in type 2 diabetes. Nature 490, 55–60 (2012) , incorporated herein by reference) based on deep shotgun sequencing of the gut microbial DNA from 165 individuals. The inventors identified and validated 65 CAD-associated gut microbes and 4 optimized gut microbes. To exploit the potential ability of CAD classification by gut microbiota, the inventors calculated probability of illness through a random forest model based on the 65 CAD-associated gut microbes and 4 optimized gut microbes. The inventors′ data provide insight into the characteristics of the gut metagenome related to CAD risk, a paradigm for future studies of the pathophysiological role of the gut metagenome in other relevant disorders, and the potential usefulness for a gut-microbiota-based approach for assessment of individuals at risk of such disorders.

In one aspect of present disclosure, there is provided with a biomarker set for predicting a disease related to microbiota in a subject consisting of:

a gut biomarker comprising at least one of Akkermansia muciniphila, Bacteroides fragilis, Clostridium bolteae, Clostridium hathewayi, Clostridium nexile, Clostridium sp. HGF2, Clostridium spiroforme, Clostridium symbiosum, Coprobacillus sp. 3_3_56FAA, Eggerthella sp. HGA1, Eubacterium limosum, Gemella sanguinis, Klebsiella pneumoniae, Lachnospiraceae bacterium 9_1_43BFAA, Lactobacillus amylovorus, Lactobacillus fermentum, Lactobacillus salivarius, Lactobacillus vaginalis, Rothia mucilaginosa, Ruminococcus gnavus, Ruminococcus obeum, Ruminococcus sp. 5_1_39BFAA, Ruminococcus torques, Streptococcus anginosus, Streptococcus infantarius, Streptococcus infantis, Streptococcus mitis, Streptococcus oralis, Streptococcus parasanguinis, Streptococcus pasteurianus, Streptococcus salivarius, Streptococcus sanguinis, Streptococcus sp. 2_1_36FAA, Streptococcus vestibularis, Subdoligranulum sp. 4_3_54A2FAA, CVD 1218, CVD 1259, CVD 1486, CVD 19194, CVD 19221, CVD 2015, CVD 2448, CVD 25206, CVD 461, CVD 547, CVD 659, CVD 8035, CVD 8194, CVD 8305, CVD 9620, CVD 977, Bacteroides cellulosilyticus, Bacteroides stercoris, Bacteroides uniformis, Bacteroides vulgatus, Bacteroides xylanisolvens, Bilophila wadsworthia, Clostridiales sp. SS3/4, Parabacteroides distasonis, Con 14667, Con 14806, Con 17745, Con 3602, Con 4962, Con 5544, or microbes with genomic DNA comprising at least a partial sequence of SEQ ID NO: 1 to 122009, alternatively, the biomarker set consists of at least one of the species listed in Table 4, preferably at least 10％, at least 20％, at least 30％, at least 40％, at least 50％, at least 60％, at least 70％, at least 80％, at least 90％, at least 100％of the species listed in Table 4.

preferably, at least one of Streptococcus oralis, Streptococcus sanguinis, Streptococcus mitis and Streptococcus infantis.

According to embodiments of present disclosure, the gut biomarker comprises at least a partial sequence of at least one of SEQ ID NO: 1 to 122009 as stated in Table 5-1.

In another aspect of present disclosure, there is provided with a biomarker set for predicting a disease related to microbiota in a subject consisting of:

a gut biomarker comprises at least a partial sequence of at least one of SEQ ID NO: 1 to 122009.

According to embodiments of present disclosure, the disease is coronary artery disease or related heart disease.

In another aspect of present disclosure, there is provided with a kit for determining the gene marker set of any one of claims 1 to 4, comprising primers used for PCR amplification and designed according to the DNA sequecne as set forth as below:

the gut biomarker comprises at least a partial sequence of at least one of SEQ ID NO: 1 to 122009.

In another aspect of present disclosure, there is provided with a kit for determining the gene marker set above-described , comprising one or more probes designed according to the genes as set forth as below:

In another aspect of present disclosure, there is provided with use of the gene marker set described above for predicting the risk of coronary artery disease (CAD) or related disorder in a subject to be tested, comprising:

(1) collecting a sample from the subject to be tested；

(2) determining the relative abundance information of each biomarker of the biomarker set described above in the samples obtained in step (1) ；

(3) obtaining a probability of CAD by comparing the relative abundance information of each biomarker of subject to be tested with a training dataset using a Multivariate statistical model,

wherein the probability of CAD greater than a cutoff indicates that the subject to be tested has or is at the risk of developing the coronary artery disease (CAD) or related disorder.

According to embodiments of present disclosure, the training dataset is constructed based on the relative abundance information of each biomarker of a plurality of subjects having CAD and a plurality of normal subjects using a Multivariate statistical model, alternatively the Multivariate statistical model is a randomForest model.

According to embodiments of present disclosure, the training dataset is a matrix with each row representing each biomarker of the biomarker set described above, each column representing samples, each cell representing relative abundance profile of a biomarker in the sample, and sample disease status is a vectot, with 1 for CAD and 0 for control.

According to embodiments of present disclosure, the relative abundance infromation of each of Akkermansia muciniphila, Bacteroides fragilis, Clostridium bolteae, Clostridium hathewayi, Clostridium nexile, Clostridium sp. HGF2, Clostridium spiroforme, Clostridium symbiosum, Coprobacillus sp. 3_3_56FAA, Eggerthella sp. HGA1, Eubacterium limosum, Gemella sanguinis, Klebsiella pneumoniae, Lachnospiraceae bacterium 9_1_43BFAA, Lactobacillus amylovorus, Lactobacillus fermentum, Lactobacillus salivarius, Lactobacillus vaginalis, Rothia mucilaginosa, Ruminococcus gnavus, Ruminococcus obeum, Ruminococcus sp. 5_1_39BFAA, Ruminococcus torques, Streptococcus anginosus, Streptococcus infantarius, Streptococcus infantis, Streptococcus mitis, Streptococcus oralis, Streptococcus parasanguinis, Streptococcus pasteurianus, Streptococcus salivarius, Streptococcus sanguinis, Streptococcus sp. 2_1_36FAA, Streptococcus vestibularis, Subdoligranulum sp. 4_3_54A2FAA, CVD 1218, CVD 1259, CVD 1486, CVD 19194, CVD 19221, CVD 2015, CVD 2448, CVD 25206, CVD 461, CVD 547, CVD 659, CVD 8035, CVD 8194, CVD 8305, CVD 9620, CVD 977, Bacteroides cellulosilyticus, Bacteroides stercoris, Bacteroides uniformis, Bacteroides vulgatus, Bacteroides xylanisolvens, Bilophila wadsworthia, Clostridiales sp. SS3/4, Parabacteroides distasonis, Con 14667, Con 14806, Con 17745, Con 3602, Con 4962, Con 5544 is obtained based on the relative abundance information of SEQ ID NO: 1 to 122009.

According to embodiments of present disclosure, the training dataset is at least one of Table 6-1、 6-2、 6-3、 6-4、 6-5, and the probability of CAD being at least 0.5 indicates that the subject to be tested has or is at the risk of developing the coronary artery disease (CAD) or related disorder.

In another aspect of present disclosure, there is provided with use of the gene marker set described above for preparation of a kit for predicting the risk of coronary artery disease (CAD) or related disorder in a subject to be tested, comprising:

(1) collecting a sample from the subject to be tested；

In another aspect of present disclosure, there is provided with a method of diagnosing whether a subject has an abnormal condition related to microbiota or is at the risk of developing an abnormal condition related to microbiota, comprising:

determining the relative abundance of the biomarkers described above in a sample from the subject, and

determining whether a subject has an abnormal condition related to microbiota or is at the risk of developing an abnormal condition related to microbiota based on the relative abundance.

According to embodiments of present disclosure, the method comprises:

(1) collecting a sample from the subject to be tested；

It is believed that 65 CAD-associated gut microbes and 4 optimized gut microbes are valuable for increasing CAD detection at earlier stages due to the following. First, the markers of the present invention are more specific and sensitive as compared with conventional markers. Second, analysis of stool promises accuracy, safety, affordability, and patient compliance. And samples of stool are transportable. Thus, the present invention relates to an in vitro method, which is comfortable and noninvasive, so people will participate in a given screening program more easily. Third, the markers of the present invention may also serve as tools for therapy monitoring in CAD patients to detect the response to therapy.

BRIEF DISCRIPTION OF DRAWINGS

These and other aspects and advantages of the present disclosure will become apparent and more readily appreciated from the following descriptions taken in conjunction with the drawings, in which:

Fig. 1 Density histogram showing the P-value distribution of all genes identified in the study cohorts. The horizon line represents the distribution of P-values under the null hypothesis.

Fig. 2 The 65 most discriminant MLGs in the Random Forest model using 126 MLG markers. The bar length indicated the importance of variable (MLG species) .

Fig. 3 Performance of 65 MLG Random Forest models. 165samples (case 88, control 77) were train set and other 86samples (case 29, control 57) were test set to validation with false negative rate 2/29 and false positive rate 12/57.

Fig. 4 Identification of ACVD-associated markers from gut metagenome. Performance of 65 MLG Random Forest models, 165 samples (88 cases and 77 controls) were applied as the training sets (AUC＝98.17％) . The area between the two outside curves represents the 95％ CI shape.

Fig. 5 Identification of ACVD-associated markers from gut metagenome. Performance of 4 MLG Random Forest models, 165 samples (88 cases and 77 controls) were applied as the training sets (AUC＝85.86％) . The area between the two outside curves represents the 95％ CI shape.

EXAMPLES

Terms used herein have meanings as commonly understood by a person of ordinary skill in the areas relevant to the present invention. Terms such as “a” , “an” and “the” are not intended to refer to only a singular entity, but include the general class of which a specific example may be used for illustration. The terminology herein is used to describe specific embodiments of the invention, but their usage does not delimit the invention, except as outlined in the claims.

The present invention is further exemplified in the following non-limiting Examples. Unless otherwise stated, parts and percentages are by weight and degrees are Celsius. As apparent to one of ordinary skill in the art, these Examples, while indicating preferred embodiments of the invention, are given by way of illustration only, and the agents were all commercially available.

Example 1. Identifying biomarkers for evaluating coronary artery disease risk

1.1 Sample collection

Fecal samples from 165 south Chinese subjects, including 88 atherosclerotic cardiovascular disease (ACVD) patients and 77 control subjects (training set, Table 1) , were collected by Guangdong Provincial People′s Hospital in 2011. ACVD patients were diagnosed and categorized according to pathological features (coronary angiography) . Subjects were asked to collect fresh feces samples at hospital. Collected samples were put in sterile tubes and stored at -80℃immediately until further analysis.

The complete ethical approval has been obtained, and all the patients gave written informed consent. The study was approved by the Institutional Review Board of Guangdong General Hospital.

Table 1 Baseline characteristics of atherosclerotic cardiovascular disease (ACVD) cases and controls. Fourth column reports results from Wilcoxon rank-sum tests.

NOTE: For the information of gender, one of the 88 patients’ was unknown and two of the 77 controls’ were unknown.

1.2 DNA extraction

Fecal samples were thawed on ice and DNA extraction was performed using the Qiagen QIAamp DNA Stool Mini Kit (Qiagen) according to manufacturer`s instructions. Extracts were treated with DNase-free RNase to eliminate RNA contamination. DNA quantity was determined using NanoDrop spectrophotometer, Qubit Fluorometer (with the Quant-iTTMdsDNA BR Assay Kit) and gel electrophoresis.

1.3 DNA library construction and sequencing of fecal samples

DNA library construction was performed following the manufacturer`s instruction (Illumina) . The inventors used the same workflow as described previously to perform cluster generation, template hybridization, isothermal amplification, linearization, blocking and denaturation, and hybridiza-tion of the sequencing primers. The inventors constructed one paired-end (PE) library with insert size of 350 bp for each sample, followed by a high-throughput sequencing to obtain around 30 million PE reads of length 2x100bp. High-quality reads were obtained by filtering low-quality reads with ambiguous `N′ bases, adapter contamination and human DNA contamination from the Illumina raw reads, and by trimming low-quality terminal bases of reads simultaneously.

The inventors totally output about 4.77 Gb per sample of fecal micbiota sequencing data (high quality clean data) (Table 2) from 165 samples (88 cases and 77 controls) on Illumina HiSeq 2000 platform.

Table 2 Summary of metagenomic data. Fourth column reports results from Wilcoxon rank-sum tests.

Parameter	Controls	Cases	P-value
Parameter	Controls	Cases	P-value	Average raw bases (G)	4.85	4.92	0.831
After removing low quality bases	4.76 (98.14％)	4.79 (97.36％)		Average raw bases (G)	4.85	4.92	0.831
After removing low quality bases	4.76 (98.14％)	4.79 (97.36％)		After removing human reads	4.73 (97.53％)	4.78 (97.15％)	0.874

1.4 Metagenomic data processing and analysis

1.4.1 Gene catalogue construction

Gene catalogue construction. Employing the same parameters that were used to construct the type 2 diabetes gene catalogue (Qin, J. et al. A metagenome-wide association study of gut microbiota in type 2 diabetes. Nature 490, 55–60 (2012) , incorporated herein by reference) , the inventors performed de novo assembly and gene prediction for the high quality reads of 165 samples using SOAPdenovo v1.06 (Li, R. et al. De novo assembly of human genomes with massively parallel short read sequencing. Genome Research 20, 265-272, doi: 10.1101/gr. 097261.109 (2009) , incorporated herein by reference) and GeneMark v2.7 (Zhu, W. , Lomsadze, A. & Borodovsky, M. Ab initio gene identification in metagenomic sequences. Nucleic acids research 38, e132, doi: 10.1093/nar/gkq275 (2010) , incorporated herein by reference) , respectively. All predicted genes were aligned pairwise using BLAT and genes, of which over 90％ of their length can be aligned to another one with more than 95％ identity (no gaps allowed) , were removed as redundancies, resulting in a non-redundant gene catalogue comprising of 4,537,046 genes (4.5 M gene catalogue) .

Taxonomic assignment of genes. Taxonomic assignment of the predicted genes was performed using an in-house pipeline which had described in the published T2D paper (Qin et al. 2012, supra) .

1.4.2 Data profile construction

Gene profile. These 4,537,046 genes and their associated measures of relative abundance in 165 samples were used to establish the gene profile for the association study (The inventors use the same method described in the published T2D paper (Qin et al. 2012, supra) to compute the relative gene abundance. ) .

IMG species and mOTU species profiles. Toatal fecal clean reads were aligned to the 4,653 reference genomes from IMG v400 (Markowitz, V. M. et al. IMG: the integrated microbial genomes database and comparative analysis system. Nucleic acids research 40, D115-D122 (2012) , incorporated herein by reference) and to the 79268 sequences of mOTU reference (Sunagawa, S. et al. Metagenomic species profiling using universal phylogenetic marker genes. Nature methods 10, 1196-1199 (2013) , incorporated herein by reference) with default parameters, respectively. 1290 IMG species (species that were shared among at least 10 subjects) and 560 species level mOTUs were identified.

1.4.3 Analysis of factors influencing gut microbiota gene profile. The inventors used the permutational multivariate analysis of variance (PERMANOVA) to assess the effect of 25 different characteristics, including CAD status, HDLC, CHOL, Gender, FBG, hypertension, APOB, Age, CREA , LDLC, HbA1c, APOA, TP, diabetes, ALB, TRIG, BMI, WHR, Lpa, HBDH, CKMB, AST, CK, ProBNP_E_, ALT, on gene profiles of 4.5M reference gene catalogue. The inventors performed the analysis using the method implemented in package ″vegan″ in R, and the permuted p-value was obtained by 10,000 times permutations. The inventors also corrected for multiple testing using ″p. adjust″ in R with Benjamini-Hochberg method to get the q-value for each test. PERMANOA identified two significant factors associated with gut microbe (based on gene profiles) (q<0.05, Table 3) . The analysis indicated CAD and HDLC status were both the strongest associated markers, supporting the diseases status was the major determinant influencing the composition of gut microbiota. Gender, age and some CAD clinical indices like CHOL, FGB, hypertension and APOB, were also significant factors.

Table 3 PERMANOVA based on euclidean distance analysis of gene profile. The analysis was conducted to test whether clinical parameters, and ACVD status have significant impact on the gut microbiota with q-value<0.05.

1.4.4 Identification of ACVD associated markers

Identification of ACVD associated genes. To identify the association between the metagenomic profile and ACVD, a two-tailed Wilcoxon rank-sum test was used in 2.1M high occurrence gene (genes that were present in less than 10 samples across all 165 samples were removed) profiles. 438,750 gene markers (20.48％ of 2.1M genes) were obtained, which were enriched in either case or control with p-value<0.01, FDR＝2.23％ (Fig. 1) .

Estimating the false discovery rate (FDR) . Instead of a sequential p-value rejection method, the inventors applied the “q-value” method proposed in a previous study to estimate the FDR (Storey, J. D. A direct approach to false discovery rates. Journal of the Royal Statistical Society 64, 479-498 (2002) , incorporated herein by reference) .

Receiver Operator Characteristic (ROC) analysis. The inventors applied the ROC analysis to assess the performance of the ACVD classification based on metagenomic markers. The inventors then used the “pROC” package in R to draw the ROC curve.

1.4.5 Contruction of MLG and identification of ACVD associated MLG species markers

126 MLG species based on the 438,750 ACVD associated maker gene profile. The inventors used the 438,750 gene markers to built the metagenomic linkage group (MLG) using the same method described in the published T2D paper (Qin et al. 2012, supra) . All the 438,750 genes were annotated by aligning these genes to the 4,653 reference genomes in IMG v400. An MLG was assigned to a genome if more than 50％ constitutive genes were annotated to that genome, otherwise it was termed as unclassified. Total 136 MLG genomes with gene number>550 were selected, these MLG genomes which belonging to a same species were grouped to construct MLG species, and finally the inventors obtained 127 MLGs species. The inventors performed Wilcoxon rank-sum test to the 127 MLGs species with Benjamini-Hochberg adjustment, and 126 MLGs were selected out as ACVD-associated MLGs with q<0.05. To estimate the relative abundance of an MLG species, the inventors estimated the average abundance of the genes of the MLG species, after removing the 5％ lowest and 5％ highest abundant genes (Qin et al. 2012, supra) .

In total, The inventors built 136 metagenomic linkage groups (MLG>550 genes) based on the distribution and the occurrence rate (Qin et al. 2012, supra) of 438, 750 genes, 94.8％ of the significant genes (P-value<0.01) were included into MLGs. 136 MLGs (each>550 genes, >50％ coverage and q<0.05) were annotated to NCBI database, and MLGs from same species were grouped to get 126 MLG species.

65 MLG species marker identification. To identify 126 MLG species makers, the inventors used “randomForest 4.5-36” package in R vision 2.10 based on the 126 ACVD associated MLG species. Firstly, the inventors sorted all the 126 MLG species by the importance given by the “randomForest” method (Liaw, Andy & Wiener, Matthew. Classification and Regression by randomForest, R News (2002) , Vol. 2/3 p. 18, incorporated herein by reference) . MLG marker sets were constructed by creating incremental subsets of the top ranked MLG species, starting from 5 MLG species and ending at all 126 MLG species. For each MLG makers set, the inventors calculated the false predication ratio in our 165 Chinese cohorts. Finally, the 65 MLG species sets with lowest false prediction ratio were selected out as MLG species makers (Fig. 2, Table 4 and Table 5-1、 5-2) with false negative (FN) rate 6.81％ (6/88) and false positive (FP) rate 3.89％ (3/77) (Fig. 3, Trainset) . Furthermore, the inventors drew the ROC curve using the OOB (out of bag) prediction probability of illness from randomForest model based on the selected MLG species markers (Table 6-1、 6-2、 6-3、 6-4、 6-5) and calculate the area under the ROC curve (AUC) was 98.17％ (95％ CI: 96.6％-99.74％) using R package “pROC” (Fig. 4) .

Among the 65 MLG species, the control enriched MLG species Bacteroides uniformis (q＝4.21E-11) , Bacteroides vulgatus (q＝1.80E-09) and Clostridiales-sp. -SS3/4 (q＝1.68E-08) , were known of SCFAs producing bacterial. Most of the case enriched MLG species (totally 51) were opportunistic pathogens from Streptococcus (9/11 MLG species were oral pathogens) , Clostridium (6 MLG species) , Ruminococcus (4 MLG species) and Lactobacillus (4 MLG species) . Rothia mucilaginosa naturally inhabits in the oral cavity and upper respiratory tract and is increasingly recognized as an emerging opportunistic pathogen associated with prosthetic device infections and Endocarditis. The Clostridium bolteae, isolated from human fecal material, blood and intra-abdominal abscess, were Gram-positive pathogens and could produce some toxins including neurotoxin, it encountered in clinically significant infections in humans, and the mean counts of which in autistic children were 46-fold (P-value＝0.01) greater than those in control children. Gemella sanguinis could strengthen the inflammation in immunodeficiency patients. The Akkermansia muciniphila was also enriched in CAD patients.

1.4.6 Identification of ACVD associated IMG species and mOTU species. IMG species and mOTU species markers identification based on the IMG species and mOTU species profiles, the inventors identified the ACVD associated IMG species and mOTU species with q<0.05 (Wilcoxon rank-sum test with Benjamini-Hochberg adjustment) . Subsequently, IMG species markers and mOTU species markers were selecting using the random forest approach as in MLG species markers selection.

65 IMG species with ROC 98.52％ and 15 mOTUs species with ROC 96.16％ were also clearly separate CAD patients from healthy subjects (q<0.05； see Table 7, 8) by Wilcoxon rank-sum test and random forest selection. Through overlapping with the 65 MLG markers, the inventors found the oral original pathogens including Streptococcus oralis, Streptococcus sanguinis, Streptococcus mitis and Streptococcus infantis and Akkermansia muciniphila were significantly distributed in cases.

The inventors drew the ROC curve using the OOB (out of bag) prediction probability of illness from randomForest model based on the 4 microbes from Streptococcus (Streptococcus oralis, Streptococcus sanguinis, Streptococcus mitis and Streptococcus infantis) as a biomarker (Table 9) and calculated the area under the ROC curve was 85.86％ (95％ CI: 80.24％-91.48％) using R package “pROC” (Fig. 5) . The false negative (FN) rate was 28.40％ (25/88) and false positive (FP) rate was 20.77％ (16/77) .

Table 5-1. SEQ ID of the 65 MLG species

MLG ID	SEQ ID NO:	genes number
MLG ID	SEQ ID NO:	genes number	mlg_id: X127	1～2248	2248
mlg_id: X21	2249～2862	614	mlg_id: X127	1～2248	2248
mlg_id: X21	2249～2862	614	mlg_id: X63	2863～6837	3975
mlg_id: X118	6838～7608	771	mlg_id: X63	2863～6837	3975
mlg_id: X118	6838～7608	771	mlg_id: X102	7609～9275	1667
mlg_id: X26	9276～10648	1373	mlg_id: X102	7609～9275	1667
mlg_id: X26	9276～10648	1373	mlg_id: X80	10649～13331	2683
mlg_id: X72	13332～15672	2341	mlg_id: X80	10649～13331	2683
mlg_id: X72	13332～15672	2341	mlg_id: X125	15673～18620	2948
mlg_id: X44	18621～19520	900	mlg_id: X125	15673～18620	2948
mlg_id: X44	18621～19520	900	mlg_id: X74	19521～20240	720
mlg_id: X84	20241～20801	561	mlg_id: X74	19521～20240	720
mlg_id: X84	20241～20801	561	mlg_id: X95	20802～23660	2859
mlg_id: X108	23661～26771	3111	mlg_id: X95	20802～23660	2859
mlg_id: X108	23661～26771	3111	mlg_id: X115	26772～30163	3392
mlg_id: X92	30164～32611	2448	mlg_id: X115	26772～30163	3392
mlg_id: X92	30164～32611	2448	mlg_id: X109	32612～35767	3156
mlg_id: X103	35768～38880	3113	mlg_id: X109	32612～35767	3156
mlg_id: X103	35768～38880	3113	mlg_id: X10	38881～39514	634
mlg_id: X11	39515～40260	746	mlg_id: X10	38881～39514	634
mlg_id: X11	39515～40260	746	mlg_id: X91	40261～41635	1375
mlg_id: X48	41636～42428	793	mlg_id: X91	40261～41635	1375
mlg_id: X48	41636～42428	793	mlg_id: X107	42429～43644	1216
mlg_id: X93	43645～45587	1943	mlg_id: X107	42429～43644	1216
mlg_id: X93	43645～45587	1943	mlg_id: X77	45588～46386	799
mlg_id: X65	46387～48007	1621	mlg_id: X77	45588～46386	799
mlg_id: X65	46387～48007	1621	mlg_id: X123	48008～50806	2799
mlg_id: X50	50807～51375	569	mlg_id: X123	48008～50806	2799
mlg_id: X50	50807～51375	569	mlg_id: X39	51376～52814	1439
mlg_id: X64	52815～59026	6212	mlg_id: X39	51376～52814	1439
mlg_id: X64	52815～59026	6212	mlg_id: X114	59027～60592	1566
mlg_id: X101	60593～63979	3387	mlg_id: X114	59027～60592	1566
mlg_id: X101	60593～63979	3387	mlg_id: X86	63980～64989	1010
mlg_id: X76	64990～67004	2015	mlg_id: X86	63980～64989	1010
mlg_id: X76	64990～67004	2015	mlg_id: X70	67005～68492	1488
mlg_id: X68	68493～70925	2433	mlg_id: X70	67005～68492	1488
mlg_id: X68	68493～70925	2433	mlg_id: X17	70926～72312	1387
mlg_id: X2	72313～72990	678	mlg_id: X17	70926～72312	1387
mlg_id: X2	72313～72990	678	mlg_id: X1	72991～75225	2235
mlg_id: X88	75226～76581	1356	mlg_id: X1	72991～75225	2235
mlg_id: X88	75226～76581	1356	mlg_id: X116	76582～79319	2738
mlg_id: X82	79320～81414	2095	mlg_id: X116	76582～79319	2738
mlg_id: X82	79320～81414	2095	mlg_id: X25	81415～83112	1698
mlg_id: X28	83113～85682	2570	mlg_id: X25	81415～83112	1698
mlg_id: X28	83113～85682	2570	mlg_id: X75	85683～87229	1547
mlg_id: X40	87230～88951	1722	mlg_id: X75	85683～87229	1547
mlg_id: X40	87230～88951	1722	mlg_id: X83	88952～90306	1355
mlg_id: X59	90307～91116	810	mlg_id: X83	88952～90306	1355
mlg_id: X59	90307～91116	810	mlg_id: X69	91117～92589	1473
mlg_id: X24	92590～93249	660	mlg_id: X69	91117～92589	1473

mlg_id: X104	93250～97356	4107
mlg_id: X104	93250～97356	4107	mlg_id: X124	97357～100151	2795
mlg_id: X79	100152～103376	3225	mlg_id: X124	97357～100151	2795
mlg_id: X79	100152～103376	3225	mlg_id: X13	103377～104672	1296
mlg_id: X96	104673～106081	1409	mlg_id: X13	103377～104672	1296
mlg_id: X96	104673～106081	1409	mlg_id: X105	106082～106998	917
mlg_id: X6	106999～107917	919	mlg_id: X105	106082～106998	917
mlg_id: X6	106999～107917	919	mlg_id: X85	107918～109895	1978
mlg_id: X3	109896～110770	875	mlg_id: X85	107918～109895	1978
mlg_id: X3	109896～110770	875	mlg_id: X94	110771～114337	3567
mlg_id: X111	114338～115109	772	mlg_id: X94	110771～114337	3567
mlg_id: X111	114338～115109	772	mlg_id: X8	115110～116480	1371
mlg_id: X98	116481～119348	2868	mlg_id: X8	115110～116480	1371
mlg_id: X98	116481～119348	2868	mlg_id: X4	119349～120420	1072
mlg_id: X5	120421～122009	1589	mlg_id: X4	119349～120420	1072

Table 5-2. SEQ ID of the 4 MLG species

MLG ID	SEQ ID NO:	genes number
MLG ID	SEQ ID NO:	genes number	mlg_id: X88	1～1356	1356
mlg_id: X68	1357～3789	2433	mlg_id: X88	1～1356	1356
mlg_id: X68	1357～3789	2433	mlg_id: X96	3790～5198	1409
mlg_id: X82	5199～7293	2095	mlg_id: X96	3790～5198	1409

Table 9 4 MLGs relative abundance profiles in165samples

Example 2. Validating the biomarkers in another 86 individuals

For validating the discriminatory power of the biomarkers, namely the 65 selected MLGs and 4 microbes from Streptococcus, the inventors used another new independent study group, including 29 case samples and 57 control samples that were used as test set (Table 10) and also collected in Guangdong Provincial People′s Hospital.

Table 10. Sample information

Group	case	control	total number
Group	case	control	total number	Test set	29	57	86

For each sample, DNA was extracted and a DNA library was constructed followed by high throughput sequencing as described in Example 1. The inventors estimated the relative abundance of a MLG in all samples by using the relative abundance values of genes from this MLG (Qin et al. 2012, supra) .

About the randomForest model, using “randomForest 4.5-36” package in R vision 2.10, input is a training dataset (namely Table 6-1、 6-2、 6-3、 6-4、 6-5 or Table 9 respcetively) , sample disease status (training dataset is a matrix, each row represents MLG； each column represents samples； each cell represents relative abundance profile of a MLG in a sample； sample disease status of training sample in Example 1 is a vectot, 1 for CAD, 0 for control) , and a testset (just the MLG relative abundance profile of the test set) . Then the inventors used the randomForest function from randomForest package in R software to build the classification, and predict function was used to predict the testset. Output is matrix containing the prediction results (the first column “0” is probability of health； the second column “1” is probability of CAD； cutoff is 0.5 and if the probability of CAD≥0.5, the subject is at risk of CAD)

The inventors used the 65 selected MLGs to redo random forest and then probability of illness was calculated (Table 11, Fig. 3 Testset) . The model was tested on the test set (n＝86, 29 case samples and 57 control samples) and prediction error was calculated. False negative (FN) rate was 6.89％ (2/29) and false positive (FP) rate was 21.05％ (12/57) , and the area under the ROC curve was 94.34％ (95％ CI: 89.86％-98.83％) .

Furthermore, the inventors used 4 microbes from Streptococcus (Streptococcus oralis, Streptococcus sanguinis, Streptococcus mitis and Streptococcus infantis) as a biomarker to test the power in separation CAD patients and controls ( (Table 11) , founding that false negative (FN) rate was 17.24％ (5/29) and false positive (FP) rate was 35.08％ (20/57) , and the area under the ROC curve was 81.94％ (95％ CI: 72.98％-90.9％) in test set.

Table 11 Prediction results of 65 MLGs and 4 MLGs

Thus the inventors have identified and validated 65 CAD-associated gut microbes and 4 optimized gut microbes by a random forest model based on CAD-associated genes markers. And the inventors have constructed a method to evaluate the risk of CAD disease based on these 65 CAD-associated gut microbes and 4 optimized gut microbes.

Although explanatory embodiments have been shown and described, it would be appreciated by those skilled in the art that the above embodiments can not be construed to limit the present disclosure, and changes, alternatives, and modifications can be made in the embodiments without departing from spirit, principles and scope of the present disclosure.

Claims

A biomarker set for predicting a disease related to microbiota in a subject consisting of:

Akkermansia muciniphila, Bacteroides fragilis, Clostridium bolteae, Clostridium hathewayi, Clostridium nexile, Clostridium sp. HGF2, Clostridium spiroforme, Clostridium symbiosum, Coprobacillus sp. 3_3_56FAA, Eggerthella sp. HGA1, Eubacterium limosum, Gemella sanguinis, Klebsiella pneumoniae, Lachnospiraceae bacterium 9_1_43BFAA, Lactobacillus amylovorus, Lactobacillus fermentum, Lactobacillus salivarius, Lactobacillus vaginalis, Rothia mucilaginosa, Ruminococcus gnavus, Ruminococcus obeum, Ruminococcus sp. 5_1_39BFAA, Ruminococcus torques, Streptococcus anginosus, Streptococcus infantarius, Streptococcus infantis, Streptococcus mitis, Streptococcus oralis, Streptococcus parasanguinis, Streptococcus pasteurianus, Streptococcus salivarius, Streptococcus sanguinis, Streptococcus sp. 2_1_36FAA, Streptococcus vestibularis, Subdoligranulum sp. 4_3_54A2FAA, CVD 1218, CVD 1259, CVD 1486, CVD 19194, CVD 19221, CVD 2015, CVD 2448, CVD 25206, CVD 461, CVD 547, CVD 659, CVD 8035, CVD 8194, CVD 8305, CVD 9620, CVD 977, Bacteroides cellulosilyticus, Bacteroides stercoris, Bacteroides uniformis, Bacteroides vulgatus, Bacteroides xylanisolvens, Bilophila wadsworthia, Clostridiales sp. SS3/4, Parabacteroides distasonis, Con 14667, Con 14806, Con 17745, Con 3602, Con 4962, Con 5544.
The biomarker set for predicting a disease related to microbiota in a subject according to claim 1, comprising at least a partial sequence of SEQ ID NO: 1 to 122009.
A biomarker set for predicting a disease related to microbiota in a subject consisting of:

a gut biomarker comprises at least a partial sequence of SEQ ID NO: 1 to 122009.
The biomarker set for predicting a disease related to microbiota in a subject, wherein the disease is coronary artery disease or related heart disease.
A kit for determining the gene marker set of any one of claims 1 to 4, comprising primers used for PCR amplification and designed according to the DNA sequecne as set forth in claim 3.
A kit for determining the gene marker set of any one of claims 1 to 4, comprising one or more probes designed according to the genes as set forth in claim 3.
Use of the gene marker set of any one of claims 1 to 4 for predicting the risk of coronary artery disease (CAD) or related disorder in a subject to be tested, comprising:

(1) collecting a sample from the subject to be tested；

(2) determining the relative abundance information of each biomarker of the biomarker set according to any one of claims 1 to 4 in the samples obtained in step (1) ；

(3) obtaining a probability of CAD by comparing the relative abundance information of each biomarker of subject to be tested with a training dataset using a Multivariate statistical model,

wherein the probability of CAD greater than a cutoff indicates that the subject to be tested has or is at the risk of developing the coronary artery disease (CAD) or related disorder.
The use of claim 7, wherein the training dataset is constructed based on the relative abundance information of each biomarker of a plurality of subjects having CAD and a plurality of normal subjects using a Multivariate statistical model, alternatively the Multivariate statistical model is a randomForest model.
The use of claim8, wherein the training dataset is a matrix with each row representing each biomarker of the biomarker set according to any one of claims 1 to 4, each column representing samples, each cell representing relative abundance profile of a biomarker in the sample, and sample disease status is a vectot, with 1 for CAD and 0 for control.
The use of claim 8, wherein the relative abundance infromation of each of Akkermansia muciniphila, Bacteroides fragilis, Clostridium bolteae, Clostridium hathewayi, Clostridium nexile, Clostridium sp. HGF2, Clostridium spiroforme, Clostridium symbiosum, Coprobacillus sp. 3_3_56FAA, Eggerthella sp. HGA1, Eubacterium limosum, Gemella sanguinis, Klebsiella pneumoniae, Lachnospiraceae bacterium 9_1_43BFAA, Lactobacillus amylovorus, Lactobacillus fermentum, Lactobacillus salivarius, Lactobacillus vaginalis, Rothia mucilaginosa, Ruminococcus gnavus, Ruminococcus obeum, Ruminococcus sp. 5_1_39BFAA, Ruminococcus torques, Streptococcus anginosus, Streptococcus infantarius, Streptococcus infantis, Streptococcus mitis, Streptococcus oralis, Streptococcus parasanguinis, Streptococcus pasteurianus, Streptococcus salivarius, Streptococcus sanguinis, Streptococcus sp. 2_1_36FAA, Streptococcus vestibularis, Subdoligranulum sp. 4_3_54A2FAA, CVD 1218, CVD 1259, CVD 1486, CVD 19194, CVD 19221, CVD 2015, CVD 2448, CVD 25206, CVD 461, CVD 547, CVD 659, CVD 8035, CVD 8194, CVD 8305, CVD 9620, CVD 977, Bacteroides cellulosilyticus, Bacteroides stercoris, Bacteroides uniformis, Bacteroides vulgatus, Bacteroides xylanisolvens, Bilophila wadsworthia, Clostridiales sp. SS3/4, Parabacteroides distasonis, Con 14667, Con 14806, Con 17745, Con 3602, Con 4962, Con 5544 is obtained based on the relative abundance information of SEQ ID NO: 1 to 122009.
The use of claim 8, wherein the training dataset is Table 6-1、 6-2、 6-3、 6-4、 6-5, and the probability of CAD being at least 0.5 indicates that the subject to be tested has or is at the risk of developing the coronary artery disease (CAD) or related disorder.
Use of the gene marker set of any one of claims 1 to 4 for preparation of a kit for predicting the risk of coronary artery disease (CAD) or related disorder in a subject to be tested, comprising:

(1) collecting a sample from the subject to be tested；

(2) determining the relative abundance information of each biomarker of the biomarker set according to any one of claims 1 to 4 in the samples obtained in step (1) ；

(3) obtaining a probability of CAD by comparing the relative abundance information of each biomarker of subject to be tested with a training dataset using a Multivariate statistical model,

wherein the probability of CAD greater than a cutoff indicates that the subject to be tested has or is at the risk of developing the coronary artery disease (CAD) or related disorder.
The use of claim 12, wherein the training dataset is constructed based on the relative abundance information of each biomarker of a plurality of subjects having CAD and a plurality of normal subject susing a Multivariate statistical model, alternatively the Multivariate statistical model is a randomForest model.
The use of claim 13, wherein the training dataset is a matrix with each row representing each biomarker of the biomarker set according to any one of claims 1 to 4, each column representing samples, each cell representing relative abundance profile of a biomarker in the sample, and sample disease status is a vectot, with 1 for CAD and 0 for control.
The use of claim 13, wherein the relative abundance infromation of each of Akkermansia muciniphila, Bacteroides fragilis, Clostridium bolteae, Clostridium hathewayi, Clostridium nexile, Clostridium sp. HGF2, Clostridium spiroforme, Clostridium symbiosum, Coprobacillus sp. 3_3_56FAA, Eggerthella sp. HGA1, Eubacterium limosum, Gemella sanguinis, Klebsiella pneumoniae, Lachnospiraceae bacterium 9_1_43BFAA, Lactobacillus amylovorus, Lactobacillus fermentum, Lactobacillus salivarius, Lactobacillus vaginalis, Rothia mucilaginosa, Ruminococcus gnavus, Ruminococcus obeum, Ruminococcus sp. 5_1_39BFAA, Ruminococcus torques, Streptococcus anginosus, Streptococcus infantarius, Streptococcus infantis, Streptococcus mitis, Streptococcus oralis, Streptococcus parasanguinis, Streptococcus pasteurianus, Streptococcus salivarius, Streptococcus sanguinis, Streptococcus sp. 2_1_36FAA, Streptococcus vestibularis, Subdoligranulum sp. 4_3_54A2FAA, CVD 1218, CVD 1259, CVD 1486, CVD 19194, CVD 19221, CVD 2015, CVD 2448, CVD 25206, CVD 461, CVD 547, CVD 659, CVD 8035, CVD 8194, CVD 8305, CVD 9620, CVD 977, Bacteroides cellulosilyticus, Bacteroides stercoris, Bacteroides uniformis, Bacteroides vulgatus, Bacteroides xylanisolvens, Bilophila wadsworthia, Clostridiales sp. SS3/4, Parabacteroides distasonis, Con 14667, Con 14806, Con 17745, Con 3602, Con 4962, Con 5544 is obtained based on the relative abundance information of SEQ ID NO: 1 to 122009.
The use of claim 13, wherein the training dataset is Table 6-1、 6-2、 6-3、 6-4、 6-5, and the probability of CAD being at least 0.5 indicates that the subject to be tested has or is at the risk of developing the coronary artery disease (CAD) or related disorder.
A method of diagnosing whether a subject has an abnormal condition related to microbiota or is at the risk of developing an abnormal condition related to microbiota, comprising:

determining the relative abundance of the biomarkers of any one of claims 1 to 4 in a sample from the subject, and

determining whether a subject has an abnormal condition related to microbiota or is at the risk of developing an abnormal condition related to microbiota based on the relative abundance.
The method according to claim 17, comprising:

(1) collecting a sample from the subject to be tested；

(2) determining the relative abundance information of each biomarker of the biomarker set according to any one of claims 1 to 4 in the samples obtained in step (1) ；

(3) obtaining a probability of CAD by comparing the relative abundance information of each biomarker subject to be tested with a training dataset using a Multivariate statistical model,

wherein the probability of CAD greater than a cutoff indicates that the subject to be tested has or is at the risk of developing the coronary artery disease (CAD) or related disorder.
The method of claim 18, wherein the training dataset is constructed based on the relative abundance information of each biomarker of a plurality of subjects having CAD and a plurality of normal subjects using a Multivariate statistical model, alternatively the Multivariate statistical model is a randomForest model.
The method of claim 19, wherein the training dataset is a matrix with each row representing each biomarker of the biomarker set according to any one of claims 1 to 4, each column representing samples, each cell representing relative abundance profile of a biomarker in the sample, and sample disease status is a vectot, with 1 for CAD and 0 for control.
The method of claim 19, wherein the relative abundance infromation of each of Akkermansia muciniphila, Bacteroides fragilis, Clostridium bolteae, Clostridium hathewayi, Clostridium nexile, Clostridium sp. HGF2, Clostridium spiroforme, Clostridium symbiosum, Coprobacillus sp. 3_3_56FAA, Eggerthella sp. HGA1, Eubacterium limosum, Gemella sanguinis, Klebsiella pneumoniae, Lachnospiraceae bacterium 9_1_43BFAA, Lactobacillus amylovorus, Lactobacillus fermentum, Lactobacillus salivarius, Lactobacillus vaginalis, Rothia mucilaginosa, Ruminococcus gnavus, Ruminococcus obeum, Ruminococcus sp. 5_1_39BFAA, Ruminococcus torques, Streptococcus anginosus, Streptococcus infantarius, Streptococcus infantis, Streptococcus mitis, Streptococcus oralis, Streptococcus parasanguinis, Streptococcus pasteurianus, Streptococcus salivarius, Streptococcus sanguinis, Streptococcus sp. 2_1_36FAA, Streptococcus vestibularis, Subdoligranulum sp. 4_3_54A2FAA, CVD 1218, CVD 1259, CVD 1486, CVD 19194, CVD 19221, CVD 2015, CVD 2448, CVD 25206, CVD 461, CVD 547, CVD 659, CVD 8035, CVD 8194, CVD 8305, CVD 9620, CVD 977, Bacteroides cellulosilyticus, Bacteroides stercoris, Bacteroides uniformis, Bacteroides vulgatus, Bacteroides xylanisolvens, Bilophila wadsworthia, Clostridiales sp. SS3/4, Parabacteroides distasonis, Con 14667, Con 14806, Con 17745, Con 3602, Con 4962, Con 5544 is obtained based on the relative abundance information of SEQ ID NO: 1 to 122009.
The method of claim 19, wherein the training dataset is Table 6-1、 6-2、 6-3、 6-4、 6-5, and the probability of CAD being at least 0.5 indicates that the subject to be tested has or is at the risk of developing the coronary artery disease (CAD) or related disorder.