US20190367995A1 - Biomarkers for colorectal cancer

Info

Abstract

Description

Claims

US20190367995A1

Publication number: US20190367995A1
Application number: US16/541,439
Authority: US
Inventors: Qiang Feng; Dongya Zhang; Youwen QIN
Original assignee: BGI Shenzhen Co Ltd
Current assignee: BGI Shenzhen Co Ltd
Priority date: 2013-08-06
Filing date: 2019-08-15
Publication date: 2019-12-05
Also published as: US20160153054A1; US20160160296A1; US10526659B2; HK1217218A1; HK1217219A1; WO2015018307A1; WO2015018308A1

Biomarkers and methods for predicting the risk of a disease related to microbiota, in particular colorectal cancer (CRC), are described.

CROSS-REFERENCE TO RELATED APPLICATION

The present patent application is a continuation of U.S. patent application Ser. No. 15/015,358, filed Feb. 4, 2016, which is a continuation-in-part of PCT Patent Application No. PCT/CN2014/083663, filed Aug. 5, 2014, which was published in the English language on Feb. 12, 2015, under International Publication No. WO 2015/018307 A1, which claims priority to PCT Patent Application No. PCT/CN2013/080872, filed Aug. 6, 2013, and the disclosure of both prior applications is incorporated herein by reference.

REFERENCE TO SEQUENCE LISTING SUBMITTED ELECTRONICALLY

This application contains a sequence listing, which is submitted electronically via EFS-Web as an ASCII formatted sequence listing with a file name “Sequence_Listing.TXT”, creation date of Aug. 13, 2019, and having a size of about 43 kilobytes. The sequence listing submitted via EFS-Web is part of the specification and is herein incorporated by reference in its entirety.

FIELD

The present invention relates to biomarkers and methods for predicting the risk of a disease related to microbiota, in particular colorectal cancer (CRC).

BACKGROUND

Colorectal cancer (CRC) is the third most common form of cancer and the second leading cause of cancer-related death in the Western world (Schetter et al., 2011, “Alterations of microRNAs contribute to colon carcinogenesis,” Semin Oncol., 38:734-742, incorporated herein by reference). A lot of people are diagnosed with CRC and many patients die of this disease each year worldwide. Although current treatment strategies, including surgery, radiotherapy, and chemotherapy, have a significant clinical value for CRC, the relapses and metastases of cancers after surgery have hampered the success of those treatment modalities. Early diagnosis of CRC will help to not only prevent mortality, but also to reduce the costs for surgical intervention.
Current tests of CRC, such as flexible sigmoidoscopy and colonoscopy, are invasive, and patients may find the procedures and the bowel preparation to be uncomfortable or unpleasant.
The development of CRC is a multifactorial process influenced by genetic, physiological, and environmental factors. With regard to environmental factors, lifestyle, particularly dietary intake, may affect the risk of developing CRC. The Western diet, which is rich in animal fat and poor in fiber, is generally associated with an increased risk of CRC. Thus, it has been hypothesized that the relationship between the diet and CRC, may be due to the influence that the diet has on the colon microbiota and bacterial metabolism, making both the colon microbiota and bacterial metabolism relevant factors in the etiology of the disease (McGarr et al., 2005, “Diet, anaerobic bacterial metabolism, and colon cancer,” J Clin Gastroenterol., 39:98-109; Hatakka et al., 2008, “The influence of Lactobacillus rhamnosus LC705 together with Propionibacterium freudenreichii ssp. shermanii JS on potentially carcinogenic bacterial activity in human colon,” Int J Food Microbiol., 128:406-410, both incorporated herein by reference). According to McGarr et al., 2005, clinical studies can now take advantage of the molecular detection techniques used to monitor changes in species of sulfate-reducing bacteria (SRB) with dietary manipulation and medical treatments.
Interactions between the gut microbiota and the immune system have an important role in many diseases both within and outside the gut (Cho et al., 2012, “The human microbiome: at the interface of health and disease,” Nature Rev. Genet. 13, 260-270, incorporated herein by reference). Intestinal microbiota analysis of feces DNA has the potential to be used as a noninvasive test for identifying specific biomarkers that can be used as a screening tool for early diagnosis of patients having CRC, thus leading to longer survival and a better quality of life. According to Cho et al., 2012, microbiome-host interactions may have important bearings on disease susceptibility, and the microbial effects on the balance of host metabolism and immunity provides an excellent model for the broader phenomenon of disease susceptibility. Thus, modifying disease risk by altering metabolic, immunological, or developmental pathways are obvious strategies (Cho et al., 2012).
With the development of molecular biology and its application in microbial ecology and environmental microbiology, an emerging field of metagenomics (environmental genomics or ecogenomics), has been rapidly developed. Metagenomics, comprising extracting total community DNA, constructing a genomic library, and analyzing the library with similar strategies for functional genomics, provides a powerful tool to study uncultured microorganisms in complex environmental habitats. In recent years, metagenomics has been applied to many environmental samples, such as oceans, soils, rivers, thermal vents, hot springs, and human gastrointestinal tracts, nasal passages, oral cavities, skin and urogenital tracts, illuminating its significant value in various areas including medicine, alternative energy, environmental remediation, biotechnology, agriculture and biodefense. For the study of CRC, the inventors performed analysis in the metagenomics field.

SUMMARY

Embodiments of the present disclosure seek to solve at least one of the problems existing in the prior art to at least some extent.
The present invention is based on at least the following findings by the inventors:
Assessment and characterization of gut microbiota has become a major research area in human disease, including colorectal cancer (CRC), one of the common causes of death among all types of cancers. To carry out analysis on the gut microbial content of CRC patients, the inventors performed deep shotgun sequencing of the gut microbial DNA from 128 Chinese individuals and conducted a Metagenome-Wide Association Study (MGWAS) using a protocol similar to that described by Qin et al., 2012, “A metagenome-wide association study of gut microbiota in type 2 diabetes,” Nature, 490, 55-60, the entire content of which is incorporated herein by reference. The inventors identified and validated 140,455 CRC-associated gene markers. To test the potential ability to classify CRC via analysis of gut microbiota, the inventors developed a disease classifier system based on 31 gene markers that are defined as an optimal gene set by a minimum redundancy-maximum relevance (mRMR) feature selection method. For intuitive evaluation of the risk of CRC disease based on these 31 gut microbial gene markers, the inventors calculated a healthy index. The inventors' data provide insight into the characteristics of the gut metagenome corresponding to a CRC risk, a model for future studies of the pathophysiological role of the gut metagenome in other relevant disorders, and the potential for a gut-microbiota-based approach for assessment of individuals at risk of such disorders.
It is believed that gene markers of intestinal microbiota are valuable for improving cancer detection at earlier stages for the following reasons. First, the markers of the present invention are more specific and sensitive as compared to conventional cancer markers. Second, the analysis of stool samples ensures accuracy, safety, affordability, and patient compliance, and stool samples are transportable. As compared to a colonoscopy, which requires bowel preparation, polymerase chain reaction (PCR)-based assays are comfortable and noninvasive, such that patients are more likely to be willing to participate in the described screening program. Third, the markers of the present invention can also serve as a tool for monitoring therapy of cancer patients in order to measure their responses to therapy.

BRIEF DESCRIPTION OF DRAWINGS

These and other aspects and advantages of the present disclosure will become apparent and more readily appreciated from the following descriptions taken in conjunction with the drawings. It should be understood that the invention is not limited to the precise embodiments shown in the drawings.

In the drawings:

FIG. 1 shows the distribution of P-value association statistics of all the microbial genes analyzed in this study: the association analysis of CRC p-value distribution identified a disproportionate over-representation of strongly associated markers at lower P-values, with the majority of genes following the expected P-value distribution under the null hypothesis, suggesting that the significant markers likely represent true rather than false associations;

FIG. 2 shows minimum redundancy maximum relevance (mRMR) method to identify 31 gene markers that differentiate colorectal cancer cases from controls: an incremental search was performed using the mRMR method which generated a sequential number of subsets; for each subset, the error rate was estimated by a leave-one-out cross-validation (LOOCV) of a linear discrimination classifier; and the optimum subset with the lowest error rate contained 31 gene markers;

FIG. 3 shows the discovered gut microbial gene markers associated with CRC: the CRC indexes computed for the CRC patients and the control individuals from this study are shown along with patients and control individuals from earlier studies on type 2 diabetes and inflammatory bowel disease; the boxes depict the interquartile ranges between the first and third quartiles, and the lines inside the boxes denote the medians; the calculated gut healthy index listed in Table 6 correlated well with the ratio of CRC patients in the population; and the CRC indexes for CRC patient microbiomes are significantly different from the rest (***P<0.001);

FIG. 4 shows that ROC analysis of the CRC index from the 31 gene markers in Chinese cohort I showing excellent classification potential, with an area under the curve of 0.9932;

FIG. 5 shows that the CRC index was calculated for an additional 19 Chinese CRC and 16 non-CRC samples in Example 2: the boxes in the inset depict the interquartile ranges (IQR) between the first and third quartiles (25th and 75th percentiles, respectively) and the lines inside denote the medians, while the points represent the gut healthy indexes in each sample; the squares represent the case group (CRC); the triangles represent the controls group (non-CRC); the triangle with the * represents non-CRC individuals that were diagnosed as CRC patients;

FIG. 6 shows species involved in gut microbial dysbiosis during colorectal cancer: the differential relative abundance of two CRC-associated and one control-associated microbial species were consistently identified using three different methods: MLG, mOTU and the IMG database;

FIG. 7 shows the enrichment of Solobacterium moore and Peptostreptococcus stomati in the CRC patient microbiomes;

FIG. 8 shows the Receive-Operator-Curve of the CRC-specific species marker selection using the random forest method and three different species annotation methods: (A) the IMG species annotation method was carried out using clean reads to IMG version 400; (B) the mOTU species annotation method was carried out using published methods; and (C) all significant genes were clustered using MLG methods and species annotations using IMG version 400;

FIG. 9 shows the stage-specific abundance of three species that are associated with or enriched in stage II and later, using three species annotation methods: MLG, IMG and mOTU;

FIG. 10 shows the species involved in gut microbial dysbiosis during colorectal cancer: the relative abundances of one bacterial species enriched in control microbiomes and three bacterial species enriched in CRC-associated microbiomes, during different stages of CRC (three different species annotation methods were used) are shown;

FIG. 11 shows the correlation between quantification by the metagenomic approach and quantitative polymerase chain reaction (qPCR) for two gene markers;

FIG. 12 shows the evaluation of the CRC index from 2 genes in Chinese cohort II: (A) the CRC index based on 2 gene markers separates CRC and control microbiomes; (B) ROC analysis reveals marginal potential for classification using the CRC index, with an area under the curve of 0.73; and

FIG. 13 shows the validation of robust gene markers associated with CRC: qPCR abundance (in log 10 scale, zero abundance plotted as −8) of three gene markers was measured in cohort II, which consisted of 51 cases and 113 healthy controls; two gene markers were randomly selected (m1704941: butyryl-CoA dehydrogenase from F. nucleatum, m482585: RNA-directed DNA polymerase from an unknown microbe), and one was targeted (m1696299: RNA polymerase subunit beta, rpoB, from P. micra): (A) the CRC index based on the three genes clearly separates CRC microbiomes from controls; (B) the CRC index classifies has an area under the receiver operating characteristic (ROC) curve of 0.84; and (C) the P. micra species-specific rpoB gene shows relatively higher incidence and abundance starting in CRC stages II and III (P=2.15×10⁻¹⁵) as compared to the control and stage I microbiomes.

DETAILED DESCRIPTION

Various publications, articles and patents are cited or described in the background and throughout the specification, each of these references is herein incorporated by reference in its entirety. Discussion of documents, acts, materials, devices, articles or the like which have been included in the present specification is for the purpose of providing context for the present invention. Such discussion is not an admission that any or all of these matters form part of the prior art with respect to any inventions disclosed or claimed.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood to one of ordinary skill in the art to which this invention pertains. Otherwise, certain terms used herein have the meanings as set in the specification. Terms such as “a”, “an” and “the” are not intended to refer to only a singular entity, but include the general class for which a specific example can be used for illustration. The terminology herein is used to describe specific embodiments of the invention, but its usage does not delimit the invention, except as outlined in the claims.
In one aspect, the present invention relates to a method of obtaining a set of gene markers for predicting the risk of an abnormal condition related to microbiota, comprising
a) identifying abnormal-associated gene markers by a metagenome-wide association study (MGWAS) strategy comprising:
i) collecting a sample from each subject from a population of subjects with the abnormal condition (abnormal) and subjects without the abnormal condition (controls), ii) extracting DNA from each sample, constructing a DNA library from each sample, and carrying out high-throughput sequencing of each DNA library to obtain sequencing reads for each sample;
iii) mapping the sequencing reads to a gene catalog, and deriving a gene profile from the mapping result;
iv) performing a Wilcoxon rank-sum test on the gene profile to identify differential metagenomic gene contents between the abnormal and controls;
b) ranking all of the abnormal-associated gene markers identified in step a) by minimum redundancy-maximum relevance (mRMR) method, and identifying or classifying sequential marker sets therefrom; and
c) for each of the sequential marker set identified or classified from step (b), estimating the error rate by a leave-one-out cross-validation (LOOCV) of a linear discrimination classifier, and selecting an optimal gene marker set with the lowest error rate as the set of gene markers for predicting the risk of the abnormal condition.
In another aspect, the present invention relates to a method of diagnosing whether a subject has an abnormal condition related to microbiota or is at the risk of developing an abnormal condition related to microbiota, comprising:
1) obtaining sequencing reads from sample j of the subject;
2) mapping the sequencing reads to a gene catalog and deriving a gene profile from the mapping result;
3) determining the relative abundance of each gene marker in a set of gene markers, wherein the set of gene markers is obtained using a method according to an embodiment of the invention; and
4) calculating an index of sample j by the following formula:
$I_{j} = [\frac{\sum_{i} ϵ_{N} \log 10 (A_{ij} + 10^{- 20})}{\langle N \rangle} - \frac{\sum_{i} ϵ_{M} \log 10 (A_{ij} + 10^{- 20})}{\langle M \rangle}],$
wherein:
A_ijis the relative abundance of marker i in sample j, wherein i refers to each of the gene markers in the set of gene markers,
N is a subset of all of abnormal-associated gene markers in selected biomarkers related to the abnormal condition,
M is a subset of all of control-associated gene markers in the selected biomarkers related to the abnormal condition, and
|N| and |M| are numbers (sizes) of the biomarkers in these two subsets, respectively wherein an index greater than a cutoff indicates that the subject has or is at the risk of developing the abnormal condition.
In one embodiment, in a method of the present invention, the metagenome-wide association study (MGWAS) strategy further comprises estimating the false discovery rate (FDR). In one embodiment, the gene catalog is a non-redundant gene set constructed for the related microbiota. In one embodiment, the abnormal condition related to microbiota is an abnormal condition related to environmental microbiota such as soil microbiota, sea microbiota, or river microbiota. In another embodiment, the abnormal condition related to microbiota is a disease related to microbiota present in the animal body or the human body such as microbiota found in the gastrointestinal tract, nasal passages, oral cavities, skin or the urogenital tract, and the sample is a feces sample, a nasal cavity swab, a buccal swab, a skin swab or a vaginal swab. In a preferred embodiment, the abnormal condition related to microbiota is a colorectal disease selected from the group consisting of Colorectal Cancer, Ulcerative Colitis, Crohn's Disease, Irritable Bowel Syndrome (IBS), Diverticular Disease, Hemorrhoids, Anal Fissure, and Bowel Incontinence. In a most preferred embodiment, the abnormal condition related to microbiota is colorectal cancer (CRC).
In one embodiment, the sequencing reads are obtained via steps comprising: 1) collecting the sample j from the subject and extracting DNA from the sample, 2) constructing a DNA library and sequencing the library. In one embodiment, the DNA library is sequenced via a next-generation sequencing method or a next-next-generation sequencing method, preferably using at least one system selected from the group consisting of Hiseq 2000, SOLID, 454, and True Single Molecule Sequencing.
In another embodiment, the cutoff value is obtained by a Receiver Operator Characteristic (ROC) method, wherein the cutoff corresponds to value when the AUC (Area Under the Curve) is at its maximum.
In yet another aspect, the present invention relates to a method for diagnosing whether a subject has colorectal cancer (CRC) or is at the risk of developing colorectal cancer, comprising:
1) obtaining sequencing reads from sample j of the subject;
2) mapping the sequencing reads to a human gut gene catalog and deriving a gene profile from the mapping result;
3) determining the relative abundance of each of the gene markers listed in SEQ ID NOs: 1-31; and
4) calculating the index of sample j using the following formula:
$I_{j} = [\frac{\sum_{i} ϵ_{N} \log 10 (A_{ij} + 10^{- 20})}{\langle N \rangle} - \frac{\sum_{i} ϵ_{M} \log 10 (A_{ij} + 10^{- 20})}{\langle M \rangle}],$
wherein:
A_ijis the relative abundance of marker i in sample j, wherein i refers to each of the gene markers listed in SEQ ID NOs 1-31,
N is a subset of all of the CRC-associated gene markers and M is a subset of all of the control-associated gene markers,
wherein the subset of CRC-associated gene markers and the subset of control-associated gene markers are shown in Table 1, and
|N| and |M| are numbers (sizes) of the biomarkers in these two subsets, respectively, wherein an index greater than a cutoff indicates that the subject has or is at the risk of developing colorectal cancer.
In one embodiment, the cutoff value is obtained by a Receiver Operator Characteristic (ROC) method, wherein the cutoff corresponds to the value when the AUC (Area Under the Curve) is at its maximum. In a preferred embodiment, the value of said cutoff is −0.0575.
In another aspect, the present invention relates to a gene marker set for predicting the risk of colorectal cancer (CRC) in a subject, gene marker set consisting of the genes listed in SEQ ID NOs: 1-31.
In another aspect, the present invention relates to a kit for analyzing the gene marker set consisting of the genes listed in SEQ ID NOs: 1-31, comprising primers used for PCR amplification that are designed according to the genes listed in SEQ ID NOs: 1-31.
In another aspect, the present invention relates to a kit for analyzing the gene marker set consisting of the genes listed in SEQ ID NOs: 1-31, comprising one or more probes that are designed according to the genes listed in SEQ ID NOs: 1-31.
In another aspect, the present invention relates to use of the gene marker set consisting of the genes listed in SEQ ID NOs: 1-31 for predicting the risk of colorectal cancer (CRC) in a subject.
In another aspect, the present invention relates to use of the gene marker set consisting of the genes listed in SEQ ID NOs: 1-31 for preparation of a kit for predicting the risk of colorectal cancer (CRC) in a subject.
In one embodiment, the sequencing reads are obtained via steps comprising: 1) collecting the sample j from the subject and extracting DNA from the sample, 2) constructing a DNA library and sequencing the library.
The present invention is further exemplified in the following non-limiting Examples. Unless otherwise stated, parts and percentages are by weight and degrees are in Celsius. As is apparent to one of ordinary skill in the art, these Examples, while indicating preferred embodiments of the invention, are given by way of illustration only, and the agents referenced are all commercially available.

General Method

I. Methods for Detecting Biomarkers (Detect Biomarkers by Using MGWAS Strategy)
To define CRC-associated metagenomic markers, the inventors carried out a MGWAS (metagenome-wide association study) strategy (Qin et al., 2012, “A metagenome-wide association study of gut microbiota in type 2 diabetes,” Nature 490, 55-60, incorporated herein by reference). Using a sequence-based profiling method, the inventors quantified the gut microbiota in samples. On average, with the requirement that there should be ≥90% identity, the inventors could uniquely map paired-end reads to the updated gene catalog. To normalize the sequencing coverage, the inventors used relative abundance instead of the raw read count to quantify the gut microbial genes. However, unlike what is done in a GWAS subpopulation correction, the inventors applied this analysis to microbial abundance rather than to genotype. A Wilcoxon rank-sum test was done on the adjusted gene profile to identify differential metagenomic gene contents between the CRC patients and controls. The outcome of the analyses showed a substantial enrichment of a set of microbial genes that had very small P values, as compared with the expected distribution under the null hypothesis, suggesting that these genes were true CRC-associated gut microbial genes.
The inventors next controlled the false discovery rate (FDR) in the analysis, and defined CRC-associated gene markers from these genes corresponding to a FDR.
II. Methods for Selecting the 31 Best Markers from the Biomarkers (Maximum Relevance Minimum Redundancy (mRMR) Feature Selection Framework)
To identify an optimal gene set, a minimum redundancy-maximum relevance (mRMR) (for detailed information, see Peng et al., 2005, “Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy,” IEEE Trans Pattern Anal Mach Intell, 27, 1226-1238, doi:10.1109/TPAMI.2005.159, which is incorporated herein by reference) feature selection method was used to select from all the CRC-associated gene markers. The inventors used the “sideChannelAttack” package of R software to perform the incremental search and found 128 sequential markers sets. For each sequential set, the inventors estimated the error rate by a leave-one-out cross-validation (LOOCV) of the linear discrimination classifier. The optimal selection of marker sets was the one corresponding to the lowest error rate. In the present study, the inventors made the feature selection on a set of 140,455 CRC-associated gene markers. Since it was computationally prohibitive to perform mRMR using all of the genes, the inventors derived a statistically non-redundant gene set. Firstly, the inventors pre-grouped the 140,455 colorectal cancer associated genes that were highly correlated with each other (Kendall correlation >0.9). Then the inventors chose the longest gene of each group as a representative gene for the group, since longer genes have a higher chance of being functionally annotated and will draw more reads during the mapping procedure. This generated a non-redundant set of 15,836 significant genes. Subsequently, the inventors applied the mRMR feature selection method to the 15,836 significant genes and identified an optimal set of 31 gene biomarkers that are strongly associated with colorectal cancer for colorectal cancer classification, which are shown in Table 1.

TABLE 1

31 optimal Gene markers' enrichment information

	Correlation		Enrichment
	coefficient with	mRMR	(1 = Control,
Gene id	CRC	rank		0 = CRC)	SEQ ID NO:

2361423	−0.558205377	1	0	1
2040133	−0.500237832	2	0	2
3246804	−0.454281109	3	0	3
3319526	0.441366585	4	1	4
3976414	0.431923463	5	1	5
1696299	−0.499397182	6	0	6
2211919	0.410506085	7	1	7
1804565	0.418663439	8	1	8
3173495	−0.55118428	9	0	9
482585	−0.454270958	10	0	10
181682	0.400814213	11	1	11
3531210	0.383705453	12	1	12
3611706	0.413879567	13	1	13
1704941	−0.468122499	14	0	14
4256106	0.42048024	15	1	15
4171064	0.43365554	16	1	16
2736705	−0.417069104	17	0	17
2206475	0.411512652	18	1	18
370640	0.399015232	19	1	19
1559769	0.427134509	20	1	20
3494506	0.382302723	21	1	21
1225574	−0.407066113	22	0	22
1694820	−0.442595115	23	0	23
4165909	0.410519669	24	1	24
3546943	−0.395361093	25	0	25
3319172	0.448526551	26	1	26
1699104	−0.467388978	27	0	27
3399273	0.388569946	28	1	28
3840474	0.383705453	29	1	29
4148945	0.407802676	30	1	30
2748108	−0.426515966	31	0	31

III. Gut Healthy Index (CRC Index)
To exploit the potential ability of disease classification by gut microbiota, the inventors developed a disease classifier system based on the gene markers that the inventors defined. For intuitive evaluation of the risk of disease based on these gut microbial gene markers, the inventors calculated a gut healthy index (CRC index).
To evaluate the effect of the gut metagenome on CRC, the inventors defined and calculated the gut healthy index for each individual on the basis of the selected 31 gut metagenomic markers as described above. For each individual sample, the gut healthy index of sample j, denoted by I_j, was calculated by the formula below:
$I_{j} = [\frac{\sum_{i} ϵ_{N} \log 10 (A_{ij} + 10^{- 20})}{\langle N \rangle} - \frac{\sum_{i} ϵ_{M} \log 10 (A_{ij} + 10^{- 20})}{\langle M \rangle}],$
Wherein A_ijis the relative abundance of marker i in sample j,
N is a subset of all of the abnormal-associated gene markers in the selected biomarkers related to the abnormal condition (namely, a subset of all of the CRC-associated gene markers in these 31 selected gut metagenomic markers),
M is a subset of all of the control-associated gene markers in the selected biomarkers related to the abnormal condition (namely, a subset of all control-associated markers in these 31 selected gut metagenomic markers), and
|N| and |M| are numbers (sizes) of the biomarkers in these two sets, respectively.
IV. Receiver Operator Characteristic (ROC) Analysis
The inventors applied the ROC analysis to assess the performance of the colorectal cancer classification based on metagenomic markers. Based on the 31 gut metagenomic markers selected above, the inventors calculated the CRC index for each sample. The inventors then used the “Daim” package of R software to draw the ROC curve.
V. Disease Classifier System
After identifying biomarkers using the MGWAS strategy, and the rule that the biomarkers used should yield the highest classification between disease and healthy samples with the least redundancy, the inventors ranked the biomarkers by a minimum redundancy-maximum relevance (mRMR) and found sequential markers sets (the size can be as large as the number of biomarkers). For each sequential set, the inventors estimated the error rate using a leave-one-out cross-validation (LOOCV) of a classifier. The optimal selection of marker sets corresponded to the lowest error rate (In some embodiments, the inventors have selected 31 biomarkers).
Finally, for intuitive evaluation of the risk of disease based on these gut microbial gene markers, the inventors calculated a gut healthy index. The larger the healthy index, the higher the risk of disease. The smaller the healthy index, the more healthy the subjects. The inventors can build an optimal healthy index cutoff using a large cohort. If the healthy index of the test sample is larger than the cutoff, then the subject is at a higher disease risk. If the healthy index of the test sample is smaller than the cutoff, then the subject has a low risk of disease. The optimal healthy index cutoff can be determined using a ROC method when the AUC (Area Under the Curve) is at its maximum.
The following examples are offered to illustrate, but not to limit the claimed invention.

Example 1. Identifying 31 Biomarkers from 128 Chinese Individuals and Using a Gut Healthy Index to Evaluate their Colorectal Cancer Risk

1.1 Sample Collection and DNA Extraction
Stool samples from 128 subjects (cohort I), including 74 colorectal cancer patients and 54 healthy controls (Table 2) were collected in the Prince of Wales Hospital, Hong Kong with informed consent. To be eligible for inclusion in this study, individuals had to fit the following criteria for stool sample collection: 1) no taking of antibiotics or other medications, no special diets (diabetics, vegetarians, etc.), and having a normal lifestyle (without extra stress) for a minimum of 3 months; 2) a minimum of 3 months after any medical intervention; 3) no history of colorectal surgery, any kind of cancer, or inflammatory or infectious diseases of the intestine. Subjects were asked to collect stool samples before a colonoscopy examination in standardized containers at home and store the samples in their home freezer immediately. Frozen samples were then delivered to the Prince of Wales Hospital in insulating polystyrene foam containers and stored at −80° C. immediately until use.
Stool samples were thawed on ice and DNA extraction was performed using the QiagenQIAamp DNA Stool Mini Kit according to the manufacturer's instructions. Extracts were treated with DNase-free RNase to eliminate RNA contamination. DNA quantity was determined using a NanoDrop spectrophotometer, a Qubit Fluorometer (with the Quant-iTTMdsDNA BR Assay Kit) and gel electrophoresis.

TABLE 2

Baseline characteristics of colorectal
cancer cases and controls in cohort I.

Parameter	Controls (n = 54)	Cases (n = 74)

Age	61.76	66.04
Sex (M:F)	33:21	48:26
BMI	23.47	23.9
eGFR	72.24	74.15
DM (%)	16 (29.6%)	29 (39.2%)
Enterotype (1:2:3)	26:22:6	37:31:6
Stage of disease (1:2:3:4)	n.a.	16:21:30:7
Location (proximal:distal)	n.a.	13:61

BMI: body mass index; eGFR: epidermal growth factor receptor; DM: diabetes mellitus type 2.

1.2 DNA Library Construction and Sequencing
DNA library construction was performed following the manufacturer's instruction (Illumina HiSeq 2000 platform). The inventors used the same workflow as described previously to perform cluster generation, template hybridization, isothermal amplification, linearization, blocking and denaturation, and hybridization of the sequencing primers (Qin, J. et al. (2012), “A metagenome-wide association study of gut microbiota in type 2 diabetes,” Nature 490, 55-60, incorporated herein by reference).
The inventors constructed one paired-end (PE) library with an insert size of 350 bp for each sample, followed by high-throughput sequencing to obtain around 30 million PE reads of a length of 2×100 bp. High quality reads were extracted by filtering out low quality reads containing ‘N’s in the read, filtering out adapter contamination and human DNA contamination from the raw data, and trimming low quality terminal bases of reads. 751 million metagenomic reads (high quality reads) were generated (5.86 million reads per individual on average, Table 3).
1.3 Reads Mapping
The inventors mapped the high quality reads (Table 3) to a published reference gut gene catalog established from European and Chinese adults (Qin, J. et al. (2012), “A metagenome-wide association study of gut microbiota in type 2 diabetes,” Nature, 490, 55-60, incorporated herein by reference) (identity >=90%), and the inventors then derived the gene profiles using the same method of Qin et al. 2012, supra. From the reference gene catalog, as Qin et al. 2012, supra, the inventors derived a subset of 2,110,489 (2.1M) genes that appeared in at least 6 of the 128 samples.

TABLE 3

Summary of metagenomic data and mapping to reference
gene catalog. The fourth column reports P-value
results from Wilcoxon rank-sum tests.

Parameter	Controls	Cases	P-value

Average raw	60162577	60496561	0.8082
reads
After removing	59423292 (98.77%)	59715967 (98.71%)	0.831
low quality
reads
After removing	59380535 ± 7378751	58112890 ± 10324458	0.419
human reads
Mapping rate	66.82%	66.27%	0.252

1.4 Analysis of Factors Influencing Gut Microbiota Gene Profiles
To ensure robust comparison of the gene content of the 128 metagenomes, the inventors generated a set of 2,110,489 (2.1M) genes that were present in at least 6 subjects, and generated 128 gene abundance profiles using these 2.1 million genes. The inventors used the permutational multivariate analysis of variance (PERMANOVA) test to assess the effect of different characteristics, including age, BMI, eGFR, TCHO, LDL, HDL, TG gender, DM, CRC status, smoking status and location, on the gene profiles of the 2.1M genes. The inventors performed the analysis using the “vegan” function of R, and the permuted p-value was obtained after 10,000 permutations. The inventors also corrected for multiple testing using the “p.adjust” function of R with the Benjamini-Hochberg method to get the q-value for each gene.
When the inventors performed permutational multivariate analysis of variance (PERMANOVA) on 13 different covariates, only a CRC status was significantly associated with these gene profiles (q=0.0028, Table 4), showing a stronger association than the second-best determinant, body mass index (q=0.15). Thus, the data suggest an altered gene composition in CRC patient microbiomes.

TABLE 4

PERMANOVA analysis using the microbial gene profile. Analysis was
conducted to test whether clinical parameters and colorectal cancer (CRC)
status have a significant impact on the gut microbiota with q < 0.05.

Phenotype	Df	SumsOfSqs	MeanSqs	F. Model	R2	Pr(>F)	q-value

CRC Status

	1	0.679293	0.679293	1.95963	0.015314	0.0004	0.0028
BMI	1	0.484289	0.484289	1.39269	0.011019	0.033	0.154
DM Status	1	0.438359	0.438359	1.257642	0.009883	0.084	0.27272
Location	1	0.436417	0.436417	1.228172	0.016772	0.0974	0.27272
Age	1	0.397282	0.397282	1.138728	0.008957	0.1923	0.4487
HDL	1	0.38049	0.38049	1.083265	0.010509	0.271	0.542
TG	1	0.365191	0.365191	1.039593	0.010089	0.3517	0.564964
eGFR	1	0.358527	0.358527	1.023138	0.009471	0.38	0.564964
CRC Stage	1	0.357298	0.357298	1.002413	0.013731	0.441	0.564964
Smoker	1	0.347969	0.347969	0.999825	0.013511	0.4439	0.564964
TCHO	1	0.321989	0.321989	0.915216	0.008893	0.6539	0.762883
LDL	1	0.306483	0.306483	0.871306	0.00847	0.7564	0.814585
Gender	1	0.267738	0.267738	0.765162	0.006036	0.9528	0.9528

BMI: body mass index;
DM: diabetes mellitus type 2;
HDL: high density lipoprotein;
TG: triglyceride; eGFR: epidermal growth factor receptor;
TCHO: total cholesterol;
LDL; low density lipoprotein.

1.5 CRC-Associated Genes Identified by MGWAS
1.5.1 Identification of colorectal cancer associated genes. The inventors performed a metagenome wide association study (MGWAS) to identify the genes contributing to the altered gene composition in the CRC samples. To identify the association between the metagenomic profile and colorectal cancer, a two-tailed Wilcoxon rank-sum test was used in the 2.1M (2,110,489) gene profiles. The inventors identified 140,455 gene markers, which were enriched in either case or control samples with P<0.01 (FIG. 1).
1.5.2 Estimating the false discovery rate (FDR). Instead of a sequential P-value rejection method, the inventors applied the “qvalue” method proposed in a previous study (J. D. Storey and R. Tibshirani (2003), “Statistical significance for genomewide studies,” Proceedings of the National Academy of Sciences of the United States of America, 100, 9440, incorporated herein by reference) to estimate the FDR. In the MGWAS, the statistical hypothesis tests were performed on a large number of features of the 140,455 genes. The false discovery rate (FDR) was 11.03%.
1.6 Gut Microbiota-Based CRC Classification
The inventors proceeded to identify potential biomarkers for CRC from the genes associated with the disease, using the minimum redundancy maximum relevance (mRMR) feature selection method. However, since the computational complexity of this method did not allow them to use all 140,455 genes from the MGWAS approach, the inventors had to reduce the number of candidate genes. First, the inventors selected a stricter set of 36,872 genes with higher statistical significance (P<0.001; FDR=4.147%). Then the inventors identified groups of genes that were highly correlated with each other (Kendall's τ>0.9) and chose the longest gene in each group, generating a statistically non-redundant set of 15,836 significant genes. Finally, the inventors used the mRMR method and identified an optimal set of 31 genes that were strongly associated with CRC status (FIG. 2, Table 5). The inventors computed a CRC index based on the relative abundance of these markers, which clearly separated the CRC patient microbiomes from the control microbiomes (Table 6), as well as from 490 fecal microbiomes from two previous studies on type 2 diabetes in Chinese individuals (Qin et al. 2012, supra) and inflammatory bowel disease in European individuals (J. Qin et al. (2010), “A human gut microbial gene catalogue established by metagenomic sequencing,” Nature, 464, 59, incorporated herein by reference) (FIG. 3, the median CRC-indexes for patients and controls in this study were 6.42 and −5.48, respectively; Wilcoxon rank-sum test, q<2.38×10⁻¹⁰for all five comparisons, see Table 7). Classification of the 74 CRC patient microbiomes against the 54 control microbiomes using the CRC index exhibited an area under the receiver operating characteristic (ROC) curve of 0.9932 (FIG. 4). At the cutoff −0.0575, the true positive rate (TPR) was 1, and the false positive rate (FPR) was 0.07407, indicating that the 31 gene markers could be used to accurately classify CRC individuals.

TABLE 6

128 samples' calculated gut healthy
index (CRC patients and non-CRC controls)

Sample	Type (Con_CRC: non-CRC
ID	controls; CRC: CRC patients)	CRC-index

502A	Con_CRC	−7.505749695
512A	Con_CRC	−5.150023018
515A	Con_CRC	−4.919398163
516A	Con_CRC	−2.793151285
517A	Con_CRC	−8.078128133
519A	Con_CRC	−7.556675412
530A	Con_CRC	−0.194519906
534A	Con_CRC	−5.251127609
536A	Con_CRC	−7.08635459
M2.PK504A	Con_CRC	−5.470747464
M2.PK514A	Con_CRC	−4.441183208
M2.PK520B	Con_CRC	−8.101427301
M2.PK522A	Con_CRC	0.269338093
M2.PK523A	Con_CRC	−6.980913756
M2.PK524A	Con_CRC	−9.027027667
M2.PK531B	Con_CRC	−5.483143199
M2.PK532A	Con_CRC	−5.96003222
M2.PK533A	Con_CRC	−7.718764145
M2.PK543A	Con_CRC	−9.844975269
M2.PK548A	Con_CRC	−4.062846751
M2.PK556A	Con_CRC	−4.15150788
M2.PK558A	Con_CRC	−9.712104855
M2.PK602A	Con_CRC	−7.380042553
M2.PK615A	Con_CRC	3.232971256
M2.PK617A	Con_CRC	−8.878473599
M2.PK619A	Con_CRC	−8.279540689
M2.PK630A	Con_CRC	−5.993197547
M2.PK644A	Con_CRC	1.230424198
M2.PK647A	Con_CRC	−7.181191393
M2.PK649A	Con_CRC	−1.576643721
M2.PK653A	Con_CRC	−4.246899704
M2.PK656A	Con_CRC	−5.80900221
M2.PK659A	Con_CRC	−7.805935646
M2.PK663A	Con_CRC	−5.007057718
M2.PK699A	Con_CRC	−8.827532431
M2.PK701A	Con_CRC	−0.981728615
M2.PK705A	Con_CRC	−8.822384737
M2.PK708A	Con_CRC	−6.573782359
M2.PK710A	Con_CRC	−7.558945558
M2.PK712A	Con_CRC	−9.207916748
M2.PK723A	Con_CRC	−4.481542621
M2.PK725A	Con_CRC	−7.520375154
M2.PK729A	Con_CRC	−5.318926226
M2.PK730A	Con_CRC	−4.3710193
M2.PK732A	Con_CRC	−5.20132309
M2.PK750A	Con_CRC	−6.64771202
M2.PK751A	Con_CRC	−3.65391467
M2.PK797A	Con_CRC	−4.675123647
M2.PK801A	Con_CRC	−7.766321018
509A	Con_CRC	−2.479402638
A60A	Con_CRC	1.078322254
506A	Con_CRC	−4.246837899
A21A	Con_CRC	−4.440375851
A51A	Con_CRC	−2.809587066
A10A	CRC	13.26483131
M2.PK002A	CRC	7.002094781
M2.PK003A	CRC	5.108478224
M2.PK018A	CRC	2.243592264
M2.PK019A	CRC	−0.057498133
M2.PK021A	CRC	7.878402029
M2.PK022A	CRC	9.047909247
M2.PK023A	CRC	5.428574192
M2.PK024A	CRC	5.032760805
M2.PK026A	CRC	6.257085759
M2.PK027A	CRC	1.59430903
M2.PK029A	CRC	9.331138747
M2.PK030A	CRC	4.728023967
M2.PK032A	CRC	6.055831256
M2.PK037A	CRC	4.227424374
M2.PK038A	CRC	2.669264211
M2.PK041A	CRC	4.558926807
M2.PK042A	CRC	3.47308125
M2.PK043A	CRC	5.347387703
M2.PK045A	CRC	8.09166979
M2.PK046A	CRC	9.235279951
M2.PK047A	CRC	8.45229555
M2.PK051A	CRC	6.602608047
M2.PK052A	CRC	3.207800397
M2.PK055A	CRC	5.088317256
M2.PK056B	CRC	5.504229632
M2.PK059A	CRC	5.466091636
M2.PK063A	CRC	3.758294225
M2.PK064A	CRC	3.763414393
M2.PK065A	CRC	6.486959786
M2.PK066A	CRC	1.199091901
M2.PK067A	CRC	9.938025463
M2.PK069B	CRC	−0.04402983
M2.PK083B	CRC	8.394697958
M2.PK084A	CRC	9.25322799
M2.PK085A	CRC	7.852591304
MSC103A	CRC	4.05476664
MSC119A	CRC	4.331580986
MSC120A	CRC	3.865826479
MSC1A	CRC	9.930238103
MSC45A	CRC	9.331894011
MSC4A	CRC	0.006971195
MSC54A	CRC	12.10968629
MSC5A	CRC	3.272778932
MSC63A	CRC	7.74197911
MSC6A	CRC	8.063701275
MSC76A	CRC	6.730976418
MSC78A	CRC	6.999247399
MSC79A	CRC	6.805539524
MSC81A	CRC	8.465000094
M118A	CRC	8.675933723
M123A	CRC	8.627635602
M2.Pk.001A	CRC	7.78045553
M2.Pk.005A	CRC	4.534189338
M2.Pk.009A	CRC	8.188718934
M2.Pk.017A	CRC	6.225010462
M84A	CRC	3.497922009
M89A	CRC	0.394210537
M2.Pk.007A	CRC	5.703428174
M2.Pk.010A	CRC	7.231959163
M122A	CRC	8.387516145
M2.Pk.004A	CRC	4.246104721
M2.Pk.008A	CRC	5.299578303
M2.Pk.011A	CRC	6.354957821
M2.Pk.015A	CRC	7.719629705
M113A	CRC	7.528437656
M116A	CRC	10.54991338
M117A	CRC	0.072052278
M2.Pk.006A	CRC	9.368358379
M2.Pk.012A	CRC	1.112535148
M2.Pk.014A	CRC	8.671786146
M2.Pk.016A	CRC	8.898356611
M115A	CRC	7.241420602
M2.Pk.013A	CRC	7.331598086

Example 2. Validating the 31 Biomarkers

The inventors validated the discriminatory power of the CRC classifier using another new independent study group, including 19 CRC patients and 16 non-CRC controls that were also collected in the Prince of Wales Hospital.
For each sample, DNA was extracted and a DNA library was constructed followed by high throughput sequencing as described in Example 1. The inventors calculated the gene abundance profile for these samples using the same method as described in Qin et al. 2012, supra. The relative abundance of each of the gene markers as set forth in SEQ ID NOs: 1-31 was then determined. The index of each sample was then calculated using the following formula:
$I_{j} = [\frac{\sum_{i} ϵ_{N} \log 10 (A_{ij} + 10^{- 20})}{\langle N \rangle} - \frac{\sum_{i} ϵ_{M} \log 10 (A_{ij} + 10^{- 20})}{\langle M \rangle}],$
wherein:
A_ijis the relative abundance of marker i in sample j, wherein i refers to each of the gene markers as set forth in SEQ ID NOs 1-31,
N is a subset of all of the abnormal-associated gene markers and M is a subset of all of the control-associated gene markers,
the subset of CRC-associated gene markers and the subset of control-associated gene markers are shown in Table 1, and
|N| and |M| are numbers (sizes) of the biomarkers in these two subsets, respectively, wherein |N| is 13 and |M| is 18.
Table 8 shows the calculated index of each sample, and Table 9 shows the relevant gene relative abundance of a representative sample, V30.
In this assessment analysis, the top 19 samples with the highest gut healthy index were all CRC patients, and all of the CRC patients were diagnosed as CRC individuals (Table 8 and FIG. 5) Only one of the non-CRC controls (FIG. 5, the triangle with *) was diagnosed as a CRC patient. At the cutoff −0.0575, the error rate was 2.86%, validating that the 31 gene markers can accurately classify CRC individuals.

TABLE 8

35 samples' calculated gut healthy index

Sample	Type (Con_CRC: non-CRC
ID	controls; CRC: CRC patients)	CRC-index

V27	Con_CRC	0.269338056
V19	Con_CRC	−0.981728643
V26	Con_CRC	−2.793151257
V10	Con_CRC	−4.371019
V18	Con_CRC	−4.440375832
V1	Con_CRC	−4.675123655
V14	Con_CRC	−4.919398178
V9	Con_CRC	−5.007057768
V33	Con_CRC	−5.20132324
V29	Con_CRC	−5.251127667
V6	Con_CRC	−5.470747485
V21	Con_CRC	−5.96003246
V22	Con_CRC	−6.64771297
V23	Con_CRC	−7.181191336
V5	Con_CRC	−7.558945528
V32	Con_CRC	−8.101427363
V35	CRC	13.16483131
V8	CRC	12.12968629
V13	CRC	10.54991338
V7	CRC	9.958035463
V17	CRC	9.2432279
V2	CRC	9.235252955
V15	CRC	8.465000028
V25	CRC	8.188718932
V20	CRC	7.852591353
V3	CRC	7.74197955
V24	CRC	7.528437632
V16	CRC	6.225010478
V30	CRC	6.055831257
V31	CRC	5.088317266
V28	CRC	3.865826489
V4	CRC	3.758294237
V11	CRC	2.669264236
V34	CRC	2.243592293
V12	CRC	1.199091982

TABLE 9

Gene relative abundance of Sample V30

	Enrichment
	(1 = Control,		Calculation of gene
Gene id
	0 = CRC)	SEQ ID NO:	relative abundance

2361423	0	1	2.24903E−05
2040133	0	2	8.77418E−08
3246804	0	3	0
3319526	1	4	0
3976414	1	5	0
1696299	0	6	4.04178E−06
2211919	1	7	7.89676E−07
1804565	1	8	0
3173495	0	9	0.000020166
482585	0	10	0
181682	1	11	0
3531210	1	12	0
3611706	1	13	0
1704941	0	14	1.73798E−06
4256106	1	15	0
4171064	1	16	9.35913E−08
2736705	0	17	1.41059E−07
2206475	1	18	3.12301E−07
370640	1	19	0
1559769	1	20	0
3494506	1	21	0
1225574	0	22	0
1694820	0	23	4.57783E−07
4165909	1	24	0
3546943	0	25	0
3319172	1	26	0
1699104	0	27	4.74411E−06
3399273	1	28	6.0661E−08
3840474	1	29	0
4148945	1	30	3.00829E−07
2748108	0	31	8.14399E−08

The inventors have therefore identified and validated a 31 markers set that was determined using a minimum redundancy-maximum relevance (mRMR) feature selection method based on 140,455 CRC-associated markers. The inventors have also developed a gut healthy index to evaluate the risk of CRC disease based on these 31 gut microbial gene markers.

Example 3. Identifying Species Biomarkers from the 128 Chinese Individuals

Based on the sequencing reads of the 128 microbiomes from cohort I in Example 1, the inventors examined the taxonomic differences between control and CRC-associated microbiomes to identify microbial taxa contributing to the dysbiosis. For this, the inventors used taxonomic profiles derived from three different methods, as supporting evidence from multiple methods would strengthen an association. First, the inventors mapped metagenomic reads to 4650 microbial genomes in the IMG database (version 400) and estimated the abundance of microbial species included in that database (denoted IMG species). Second, the inventors estimated the abundance of species-level molecular operational taxonomic units (mOTUs) using universal phylogenetic marker genes. Third, the inventors organized the 140,455 genes identified by MGWAS into metagenomic linkage groups (MLGs) that represent clusters of genes originating from the same genome, and they annotated the MLGs at the species level using the IMG database whenever possible, grouped the MLGs based on these species annotations, and estimated the abundance of these species (denoted MLG species).
3.1 Species Annotation of IMG Genomes
For each IMG genome, using the NCBI taxonomy identifier provided by IMG, the inventors identified the corresponding NCBI taxonomic classification at the species and genus levels using NCBI taxonomy dump files. The genomes without corresponding NCBI species names were left with their original IMG names, most of which were unclassified.
3.2 Data Profile Construction
3.2.1 Gene Profiles
The inventors mapped their high-quality reads to a published reference gut gene catalog established from European and Chinese adults (identity >=90%), and the inventors then derived the gene profiles using the same method of Qin et al. 2012, supra.
3.2.2 mOTU Profile
Clean reads (high quality reads, as in Example 1) were aligned to the mOTU reference (79268 sequences total) with default parameters (S. Sunagawa et al. (2013), “Metagenomic species profiling using universal phylogenetic marker genes,” Nature methods, 10, 1196, incorporated herein by reference). 549 species-level mOTUs were identified, including 307 annotated species and 242 mOTU linkage groups without representative genomes, the latter of which were putatively Firmicutes or Bacteroidetes.
3.2.3 IMG-Species and IMG-Genus Profiles
Bacterial, archaeal and fungal sequences were extracted from the IMG v400 reference database (V. M. Markowitz et al. (2012), “IMG: the Integrated Microbial Genomes database and comparative analysis system,” Nucleic acids research, 40, D115, incorporated herein by reference) downloaded from http://ftp.jgi-psf.org. 522,093 sequences were obtained in total, and a SOAP reference index was constructed based on 7 equal-sized segments of the original file. Clean reads were aligned to the reference using a SOAP aligner (R. Li et al. (2009), “SOAP2: an improved ultrafast tool for short read alignment,” Bioinformatics, 25, 1966, incorporated herein by reference) version 2.22, with the parameters “-m 4 -s 32 -r 2 -n 100 -x 600 -v 8 -c 0.9 -p 3”. SOAP coverage software was then used to calculate the read coverage of each genome, normalized by genome length, and further normalized to the relative abundance for each individual sample. The profile was generated based on uniquely-mapped reads only.
3.3 Identification of Colorectal Cancer-Associated MLG Species
Based on the identified 140,455 colorectal cancer associated maker genes profile, the inventors constructed the colorectal cancer-associated MLGs using the method described in the previous type 2 diabetes study (Qin et al. 2012, supra). All of the genes were aligned to the reference genomes of the IMG database v400 to obtain genome-level annotation. An MLG was assigned to a genome if >50% constitutive genes were annotated to that genome, otherwise the genome was labeled unclassified. A constitutive gene is a gene that is transcribed continually as opposed to a facultative gene, which is only transcribed when needed. A total of 87 MLGs with a gene number over 100 were selected as colorectal cancer-associated MLGs. These MLGs were grouped based on the species annotations of these genomes to construct MLG species.
To estimate the relative abundance of an MLG species, the inventors estimated the average abundance of the genes of the MLG species, after removing the genes with the 5% lowest and 5% highest abundance. The relative abundance of the IMG species was estimated by summing the abundance of the IMG genomes belonging to that species.
These analyses identified 30 IMG species, 21 mOTUs and 86 MLG species that were significantly associated with CRC status (Wilcoxon rank-sum test, q<0.05; see Tables 10, 11). Eubacterium ventriosum was consistently associated with or enriched in the control microbiomes using all three methods (Wilcoxon rank-sum tests—IMG: q=0.0414; mOTU: q=0.012757; MLG: q=5.446×10⁻⁴), and Eubacterium eligens was enriched according to two methods (Wilcoxon rank-sum tests—IMG: q=0.069; MLG: q=0.00031). Conversely, Parvimonas micra (q<1.80×10⁻⁵), Peptostreptococcus stomatis (q<1.80×10⁻⁵), Solobacterium moorei (q<0.004331) and Fusobacterium nucleatum (q<0.004565) were consistently associated with or enriched in CRC patient microbiomes using all three methods (FIG. 6, FIG. 7). P. stomatis has been associated with oral cancer, and S. moorei has been associated with bacteremia. Recent work using 16S rRNA sequencing has reported a significant enrichment of F. nucleatum in CRC tumor samples, and this bacteria has been shown to possess adhesive, invasive and pro-inflammatory properties. The inventors' results confirmed this association in a new cohort with different genetic and cultural origins. However, the highly-significant enrichment of P. micra—an obligate anaerobic bacterium that can cause oral infections like F. nucleatum—in CRC-associated microbiomes is a novel finding. P. micra is involved in the etiology of periodontis, and it produces a wide range of proteolytic enzymes and uses peptones and amino acids as an energy source. It is known to produce hydrogen sulphide, which promotes tumor growth and the proliferation of colon cancer cells. Further research is required to verify whether P. micra is involved in the pathogenesis of CRC, or if its enrichment is a result of CRC-associated changes in the colon and/or rectum. Nevertheless, it represents a potential biomarker for non-invasive diagnosis of CRC.
3.4 Species Marker Identification
In order to evaluate the predictive power of these taxonomic associations, the inventors used the random forest ensemble learning method (D. Knights, E. K. Costello, R. Knight (2011), “Supervised classification of human microbiota,” FEMS microbiology reviews, 35, 343, incorporated herein by reference) to identify key species markers in the species profiles from the three different methods.
3.4.1 MLG Species Marker Identification
Based on the constructed 87 MLGs with gene numbers over 100, the inventors performed the Wilcoxon rank-sum test on each MLG using a Benjamini-Hochberg adjustment, and 86 MLGs were selected as colorectal-associated MLGs with q<0.05. To identify MLG species markers, the inventors used the “randomForest 4.5-36” function of R vision 2.10 to analyze the 86 colorectal cancer-associated MLG species. Firstly, the inventors sorted all of the 86 MLG species by the importance given by the “randomForest” method. MLG marker sets were constructed by creating incremental subsets of the top ranked MLG species, starting from 1 MLG species and ending at 86 MLG species.
For each MLG marker set, the inventors calculated the false predication ratio in the 128 Chinese cohorts (cohort I). Finally, the MLG species sets with the lowest false prediction ratio were selected as MLG species markers. Furthermore, the inventors drew the ROC curve using the probability of illness based on the selected MLG species markers.
3.4.2 IMG Species and mOTU Species Markers Identification
Based on the IMG species and mOTU species profiles, the inventors identified the colorectal cancer-associated IMG species and mOTU species with q<0.05 (Wilcoxon rank-sum test with 6Benjamini-Hochberg adjustment). Subsequently, the IMG species markers and the mOTU species markers were selecting using the random forest approach as in the MLG species markers selection.
This analysis revealed that 16 IMG species, 10 species-level mOTUs and 21 MLG species were highly predictive of CRC status (Tables 12, 13), with a predictive power of 0.86, 0.90 and 0.94 in ROC analysis, respectively (FIG. 8). Parvimonas micra was identified as a key species from all three methods, and Fusobacterium nucleatum and Solobacterium moorei from two out of three methods, providing further statistical support for their association with CRC status.
3.5 MLG, IMG and mOTU Species Stage Enrichment Analysis
Encouraged by the consistent species associations with CRC status, and to take advantage of the records of disease stages of the CRC patients (Table 2), the inventors explored the species profiles for specific signatures identifying early stages of CRC. The inventors hypothesized that such an effort might even reveal stage-specific associations that are difficult to identify in a global analysis. To identify which species were associated with or enriched in the four colorectal cancer stages or in healthy controls, the inventors carried out a Kruskal test for the MLG species with a gene number over 100, and all of the IMG species and mOTU species with q<0.05 (Wilcoxon rank-sum test with Benjamini-Hochberg adjustment) to obtain the species enrichment information using the highest rank mean among the four CRC stages and the control. The inventors also compared the significance between every two groups by a pair-wise Wilcoxon Rank sum test.
In Chinese cohort I, several species showed significantly different abundances in the different CRC stages. Among these, the inventors did not identify any species enriched in stage I compared to the other CRC stages and the control samples. Peptostreptococcus stomatis, Prevotella nigrescens and Clostridium symbiosum were enriched in stage II or later compared to the control samples, suggesting that they colonize the colon/rectum after the onset of CRC (FIG. 9). However, Fusobacterium nucleatum, Parvimonas micra, and Solobacterium moorei were enriched in all four stages compared to the control samples and were most abundant in stage II (FIG. 10), suggesting that they play a role in both CRC etiology and pathogenesis, and implicating them as potential biomarkers for early CRC.

Example 4. Validation of Markers by qPCR

The 31 gene biomarkers were derived using the admittedly expensive deep metagenome sequencing approach. Translating them into diagnostic biomarkers would require reliable detection using more simple and less expensive methods such as quantitative PCR (TaqMan probe-based qPCR). Primers and probes were designed using Primer Express v3.0 (Applied Biosystems, Foster City, Calif., USA). The qPCR was performed on an ABI7500 Real-Time PCR System using the TaqMan® Universal PCR Master Mixreagent (Applied Biosystems). Universal 16S rDNA was used as an internal control, and the abundance of gene markers were expressed as relative levels to 16S rDNA.
To validate the test, the inventors selected two case-enriched gene markers (m482585 and m1704941) and measured their abundance by qPCR in a subset of 100 samples (55 cases and 45 controls). Quantification of each of the two genes using the two platforms (metagenomic sequencing and qPCR) showed strong correlations (Spearman r=0.93-0.95, FIG. 11), suggesting that the gene markers could also be reliably measured using qPCR.
Next, in order to validate the markers in previously unseen samples, the inventors measured the abundance of these two gene markers using qPCR in 164 fecal samples (51 cases and 113 controls) from an independent Chinese cohort (cohort II). Two case-enriched gene markers significantly associated with CRC status, at significance levels of q=6.56×10⁻⁹(m1704941, butyryl-CoA dehydrogenase from F. nucleatum), and q=0.0011 (m482585, RNA-directed DNA polymerase from an unknown microbe). The gene from F. nucleatum was present in only 4 out of 113 control microbiomes, suggesting a potential for developing specific diagnostic tests for CRC using fecal samples. The CRC index based on the combined qPCR abundance of the two case-enriched gene markers separated the CRC samples from control samples in cohort II (Wilcoxon rank-sum test, P=4.01×10⁻⁷; FIG. 12A). However, the moderate classification potential (inferred from area under the ROC curve of 0.73; FIG. 12B) using only these two genes suggested that additional biomarkers could improve the classification of CRC patient microbiomes.
Another gene from P. micra was the highly conserved rpoB gene (namely m1696299, with identity of 99.78%) encoding RNA polymerase subunit (3, often used as a phylogenetic marker (F. D. Ciccarelli et al. (2006), “Toward automatic reconstruction of a highly resolved tree of life,” Science, 311, 1283, incorporated herein by reference). Since the inventors repeatedly identified P. micra as a novel biomarker for CRC using several strategies including species-agnostic procedures, the inventors performed an additional qPCR experiment for this marker gene on Chinese cohort II as described above and found a significant enrichment in CRC patient microbiomes (Wilcoxon rank-sum test, P=2.15×10⁻¹⁵). When the inventors combined this gene with the two qPCR-validated genes, the CRC index from these three genes clearly separated case from control samples in Chinese cohort II (Wilcoxon rank-sum test, P=5.76×10⁻¹³, FIG. 13A) and showed reliable classification potential with an improved area under the ROC curve of 0.84 (FIG. 13B). The abundance of rpoB from P. micra was significantly higher compared to control samples starting from CRC stage II (FIG. 13C), agreeing with the inventors' results from species abundance analysis, and providing further evidence that this gene could serve as a non-invasive biomarker for the identification of early stage CRC.
Sequence Information for the primers and probes for the selected 3 gene markers:


>1696299	Forward	AAGAATGGAGAGAGTTGTTAGAGAAAGAA
		(SEQ ID NO: 32)
	Reverse	TTGTGATAATTGTGAAGAACCGAAGA (SEQ
		ID NO: 33)
	Probe	AACTCAAGATCCAGACCTTGCTACGCCTCA
		(SEQ ID NO: 34)

>1704941	Forward	TTGTAAGTGCTGGTAAAGGGATTG (SEQ ID
		NO: 35)
	Reverse	CATTCCTACATAACGGTCAAGAGGTA (SEQ
		ID NO: 36)
	Probe	AGCTTCTATTGGTTCTTCTCGTCCAGTGGC
		(SEQ ID NO: 37)

>482585	Forward	AATGGGAATGGAGCGGATTC (SEQ ID NO:
		38)
	Reverse	CCTGCACCAGCTTATCGTCAA (SEQ ID NO:
		39)
	Probe	AAGCCTGCGGAACCACAGTTACCAGC
		(SEQ ID NO: 40)

TABLE 5

The 31 gene markers identified by the mRMR feature selection method. Detailed information regarding their enrichment, occurrence in
colorectal cancer cases and controls, a statistical test of association, taxonomy and identity percentage are listed.

Occurrence

Marker

Wilcoxon Test P

Control (n = 54)

Case (n = 74)

Blastn to IMG v400

Blastp to KEGG v59

gene ID	P-value	q-value	Enrich	Count	Rate(%)	Count	Rate(%)	Identity	Taxonomy	Description

3546943	1.59E−06	1.90465E−06	Case	3	5.56	27	36.49	99.09	Bacteroides sp.	zinc protease
									2_1_56FAA
1225574	1.47E−06	1.8957E−06	Case	0	0.00	13	17.57	88.88	Clostridium hathewayi	lactose/L-arabinose transport
									DSM 13479	system substrate-binding
										protein
2736705	5.35E−07	8.4594E−07	Case	0	0.00	21	28.38	99.68	Clostridium hathewayi	NA
									DSM 13479
2748108	2.12E−07	4.38881E−07	Case	0	0.00	20	27.03	99.82	Clostridium hathewayi	RNA polymerase sigma-70
									DSM 13479	factor, ECF subfamily
2040133	7.46E−11	7.70506E−10	Case	7	12.96	44	59.46	99.4	Clostridium symbiosum	cobalt/nickel transport system
									WAL-14163	permease protein
1694820	9.78E−08	2.52552E−07	Case	1	1.85	18	24.32	99.17	Fusobacterium sp. 7_1	V-type H+-transporting
										ATPase subunit K
1704941	1.16E−08	5.12764E−08	Case	1	1.85	21	28.38	99.13	Fusobacterium nucleatum	butyryl-CoA dehydrogenase
									vincentii ATCC 49256
482585	3.81E−09	2.36224E−08	Case	9	16.67	50	67.57	NA	NA	RNA-directed DNA
3246804	4.19E−08	1.44418E−07	Case	1	1.85	24	32.43	NA	NA	polymerase citrate-Mg2+:H+
										or citrate-Ca2+:H+
										symporter, CitMHS family
1696299	8.50E−10	6.58857E−09	Case	1	1.85	33	44.59	99.78	Parvimonas micra ATCC	DNA-directed RNA
									33270	polymerase subunit beta
1699104	1.00E−08	5.12764E−08	Case	1	1.85	31	41.89	98.08	Parvimonas micra ATCC	glutamate decarboxylase
									33270
2361423	4.89E−13	1.51641E−11	Case	7	12.96	55	74.32	93.87	Peptostreptococcus	transposase
									anaerobius 653-L
3173495	1.14E−12	1.77065E−11	Case	4	7.41	44	59.46	93.98	Peptostreptococcus	transposase
									anaerobius 653-L
3494506	4.93E−06	5.27005E−06	Control	19	35.19	4	5.41	90.37	Burkholderiales bacterium	ribosomal small subunit
									1_1_4_7	pseudouridine synthase A
2211919	3.59E−08	1.3927E−07	Control	49	90.74	39	52.70	80.99	Coprobacillus sp.	NA
									8_2_54BFAA
2206475	6.49E−07	9.58475E−07	Control	23	42.59	5	6.76	98.59	Eubacterium ventriosum	beta-glucosidase
									ATCC 27560
3976414	1.57E−07	3.48653E−07	Control	15	27.78	3	4.05	87.12	Faecalibacterium cf.	adenosylcobinamide-
									prausnazii KLE1255	phosphate synthase CobD
3319172	1.12E−07	2.666E−07	Control	19	35.19	2	2.70	84.22	Faecalibacterium	UDP-N-
									prausnitzii A2-165	acetylmuramoylalanyl-D-glu
										tamyl-2,6-diaminopimelate--
										D-alanyl-D-alanine ligase
3319526	7.04E−08	1.98403E−07	Control	21	38.89	7	9.46	90.01	Faecalibacterium	replicative DNA helicase
									prausnazii L2-6
4171064	4.69E−08	1.45363E−07	Control	29	53.70	10	13.51	94.94	Faecalibacterium	cytidine deaminase
									prausnazii L2-6
370640	4.06E−06	4.49308E−06	Control	12	22.22	0	0.00	99.4	Bacteroides clarus YIT	NA
									12056
1804565	7.31E−07	9.85539E−07	Control	16	29.63	1	1.35	NA	NA	branched-chain amino acid
										transport system ATP-binding
										protein
3399273	4.88E−07	8.40846E−07	Control	41	75.93	23	31.08	NA	NA	two-component system, LytT
										family, response regulator
3531210	9.76E−06	9.75675E−06	Control	8	14.81	0	0.00	NA	NA	GDP-L-fucose synthase
3611706	1.67E−06	1.91677E−06	Control	13	24.07	0	0.00	NA	NA	anti-repressor protein
3840474	9.76E−06	9.75675E−06	Control	6	11.11	0	0.00	NA	NA	NA
4148945	5.46E−07	8.4594E−07	Control	23	42.59	8	10.81	NA	NA	NA
4165909	1.60E−06	1.90465E−06	Control	8	14.81	0	0.00	NA	NA	N-acetylmuramoyl-L-alanine
										amidase
4256106	3.69E−07	6.72327E−07	Control	21	38.89	4	5.41	NA	NA	integrase/recombinase XerD
181682	6.97E−07	9.82079E−07	Control	27	50.00	8	10.81	99.25	Roseburia intestinalis	NA
									L1-82
1559769	2.83E−07	5.48673E−07	Control	17	31.48	5	6.76	88.65	Coprococcus catus GD/7	polar amino acid transport
										system substrate-binding
										protein

TABLE 7

CRC index estimated in CRC, T2D and
IBD patients and healthy cohorts.

Comparison with CRC patients

Cohort/group	Median CRC index	P-value	q-value

CRC patients	6.420958803	NA	NA
CRC controls	−5.476945331	1.96E−21	2.44E−21
T2D patients	−0.108110996	1.33E−27	2.21E−27
T2D controls	−1.471692382	6.21E−31	3.11E−30
IBD patients	−2.214296342	2.38E−10	2.38E−10
IBD controls	−4.724156396	7.56E−29	1.89E−28

TABLE 10

IMG and mOTU species associated with CRC with q-value < 0.05

30 IMG species

Peptostreptococcus stomatis	37.25926	84.37838	0	1.29E−12	3.34E−09
Parvimonas micra	38.43519	83.52027	0	1.13E−11	1.46E−08
Parvimonas sp. oral taxon 393	39.81481	82.51351	0	1.28E−10	1.10E−07
Parvimonas sp. oral taxon 110	43.52778	79.80405	0	4.71E−08	3.04E−05
Gemella morbillorum	43.87037	79.55405	0	7.77E−08	4.01E−05
Burkholderia mallei	45.19444	78.58784	0	4.84E−07	0.000156
Fusobacterium sp. oral taxon 370	45.02778	78.70946	0	3.93E−07	0.000156
Fusobacterium nucleatum	45.09259	78.66216	0	4.33E−07	0.000156
Leptotrichia buccalis	45.60185	78.29054	0	7.30E−07	0.000209
Beggiatoa sp. PS	46.53704	77.60811	0	2.79E−06	0.000601
Prevotella intermedia	46.47222	77.65541	0	2.67E−06	0.000601
Streptococcus dysgalactiae	47.06481	77.22297	0	3.09E−06	0.000613
Streptococcus pseudoporcinus	47.5	76.90541	0	8.58E−06	0.001581
Paracoccus denitriflcans	47.48148	76.91892	0	9.35E−06	0.001608
Solobacterium moorei	47.66667	76.78378	0	1.17E−05	0.001884
Streptococcus constellatus	48.2037	76.39189	0	2.20E−05	0.003153
Crenothrix polyspora	48.76852	75.97973	0	4.20E−05	0.005697
Filifactor alocis	49.06481	75.76351	0	5.84E−05	0.007533
Sulfurovum sp. SCGC AAA036-O23	52.12037	73.53378	0	6.60E−05	0.008105
Clostridium hathewayi	49.68519	75.31081	0	0.000115	0.013431
Lachnospiraceae bacterium 5_1_57FAA	50.10185	75.00676	0	0.000178	0.019084
Peptostreptococcus anaerobius	50.14815	74.97297	0	0.000186	0.019221
Streptococcus equi	50.58333	74.65541	0	0.00029	0.027747
Streptococcus anginosus	50.66667	74.59459	0	0.000316	0.029114
Leptotrichia hofstadii	50.99074	74.35811	0	0.000342	0.030424
Peptoniphilus indolicus	51.2963	74.13514	0	0.000581	0.048307
Eubacterium ventriosum	80.98148	52.47297	1	1.77E−05	0.00269
Adhaeribacter aquaticus	77.06481	55.33108	1	0.000271	0.026839
Eubacterium eligens	77.90741	54.71622	1	0.000482	0.041404
Haemophilus sputorum	77.66667	54.89189	1	0.000608	0.048977

21 mOTU species

Parvimonas micra	46.2963	77.78378	0	4.11E−08	1.80E−05
Peptostreptococcus stomatis	46.25	77.81757	0	6.56E−08	1.80E−05
motu_linkage_group_731	50.42593	74.77027	0	1.08E−06	0.000198
Gemella morbillorum	47.93519	76.58784	0	1.57E−06	0.000215
Clostridium symbiosum	48.66667	76.05405	0	1.89E−05	0.00173
Solobacterium moorei	51.22222	74.18919	0	6.31E−05	0.004331
Fusobacterium nucleatum	54.62037	71.70946	0	9.15E−05	0.004565
unclassified Fusobacterium	54.22222	72	0	0.000176	0.00806
Clostridium ramosum	50.92593	74.40541	0	0.000289	0.012202
Clostridiales bacterium 1_7_47FAA	51.27778	74.14865	0	0.000365	0.013366
Bacteroides fragilis	51.09259	74.28378	0	0.00045	0.01371
motu_linkage_group_624	51.01852	74.33784	0	0.000448	0.01371
Clostridium bolteae	51.81481	73.75676	0	0.000952	0.026134
motu_linkage_group_407	81.13889	52.35811	1	6.00E−06	0.000659
motu_linkage_group_490	80.46296	52.85135	1	3.06E−05	0.002403
motu_linkage_group_316	79.61111	53.47297	1	8.17E−05	0.004487
motu_linkage_group_443	79.66667	53.43243	1	7.63E−05	0.004487
Eubacterium ventriosum	78.09259	54.58108	1	0.000325	0.012757
motu_linkage_group_510	77.84259	54.76351	1	0.000443	0.01371
motu_linkage_group_611	77.2963	55.16216	1	0.000606	0.017499
motu_linkage_group_190	75.16667	56.71622	1	0.001694	0.044273

TABLE 11

List of 86 MLG species formed after grouping MLGs with more
than 100 genes using the species annotation when available.

Parvimonas micra	38.40741	83.54054	0	3.16E−12	2.75E−10
Fusobacterium nucleatum	40.32407	82.14189	0	2.97E−11	1.29E−09
Solobacterium moorei	42.2037	80.77027	0	3.85E−09	1.12E−07
Clostridium symbiosum	46.31481	77.77027	0	1.64E−06	3.56E−05
CRC 2881	51.25926	74.16216	0	2.57E−06	4.46E−05
Clostridium hathewayi	46.77778	77.43243	0	3.92E−06	5.69E−05
CRC 6481	52.09259	73.55405	0	1.36E−05	0.000107
Clostridium clostridioforme	50.2037	74.93243	0	1.27E−05	0.000107
Clostridiales bacterium 1_7_47FAA	48.16667	76.41892	0	2.02E−05	0.000135
Clostridium sp. HGF2	48.27778	76.33784	0	2.36E−05	0.000147
CRC 2794	51.03704	74.32432	0	3.50E−05	0.000179
CRC 4136	50.99074	74.35811	0	5.22E−05	0.000233
Bacteroides fragilis	49.09259	75.74324	0	5.97E−05	0.000236
Lachnospiraceae bacterium 5_1_57FAA	49.96296	75.10811	0	7.37E−05	0.000273
Desulfovibrio sp. 6_1_46AFAA	53.33333	72.64865	0	0.000214	0.000546
Coprobacillus sp. 3_3_56FAA	50.53704	74.68919	0	0.000265	0.000623
Cloacibacillus evryensis	52.73148	73.08784	0	0.000359	0.000801
CRC 2867	52.31481	73.39189	0	0.000552	0.001162
Fusobacterium varium	54.57407	71.74324	0	0.000586	0.001186
Clostridium bolteae	51.39815	74.06081	0	0.000647	0.001223
Subdoligranulum sp. 4_3_54A2FAA	51.56481	73.93919	0	0.000758	0.001373
Clostridium citroniae	51.71296	73.83108	0	0.000861	0.001529
Lachnospiraceae bacterium 8_1_57FAA	51.88889	73.7027	0	0.001024	0.001782
Streptococcus equinus	54.52778	71.77703	0	0.001581	0.002457
CRC 4069	53.7963	72.31081	0	0.001632	0.00249
Lachnospiraceae bacterium 3_1_46FAA	52.53704	73.22973	0	0.00178	0.002612
Dorea formicigenerans	52.98148	72.90541	0	0.002703	0.003409
Synergistes sp. 3_1 syn1	54.37963	71.88514	0	0.003358	0.004002
Lachnospiraceae bacterium 3_1_57FAA_CT1	54.07407	72.10811	0	0.004478	0.005109
CRC 3579	54.05556	72.12162	0	0.005638	0.006289
Alistipes indistinctus	54.50926	71.79054	0	0.008262	0.008766
Con 10180	82.03704	51.7027	1	4.87E−06	6.05E−05
Coprococcus sp. ART55/1	80.85185	52.56757	1	8.22E−06	8.94E−05
Con 7958	75.27778	56.63514	1	1.36E−05	0.000107
butyrate-producing bacterium SS3/4	80.57407	52.77027	1	1.98E−05	0.000135
Haemophilus parainfluenzae	80.49074	52.83108	1	2.54E−05	0.000148
Con 154	80.35185	52.93243	1	3.30E−05	0.000179
Con 4595	77.21296	55.22297	1	4.17E−05	0.000202
Con 1617	76.12963	56.01351	1	5.61E−05	0.000233
Con 1979	79.94444	53.22973	1	5.62E−05	0.000233
Con 1371	78.46296	54.31081	1	7.54E−05	0.000273
Con 1529	75.05556	56.7973	1	9.25E−05	0.00031
Eubacterium eligens	79.53704	53.52703	1	9.03E−05	0.00031
Con 1987	79.42593	53.60811	1	0.000101	0.000324
Con 5770	79.39815	53.62838	1	0.000104	0.000324
Con 1197	75.42593	56.52703	1	0.000128	0.000383
Con 4699	78.78704	54.07432	1	0.000152	0.000441
Clostridium sp. L2-50	76.37963	55.83108	1	0.000167	0.000469
Con 2606	77.5	55.01351	1	0.000189	0.000514
Eubacterium ventriosum	78.62963	54.18919	1	0.000207	0.000545
Bacteroides clarus	75.55556	56.43243	1	0.000247	0.000597
Eubacterium biforme	74.68519	57.06757	1	0.000247	0.000597
Faecalibacterium prausnitzii	78.25926	54.45946	1	0.00034	0.000779
Con 563	72.7037	58.51351	1	0.000556	0.001162
Con 6037	77.5463	54.97973	1	0.000561	0.001162
Con 8757	77.17593	55.25	1	0.000634	0.001223
Ruminococcus obeum	77.53704	54.98649	1	0.000629	0.001223
Con 1513	76.59259	55.67568	1	0.000701	0.001298
Roseburia intestinalis	76.99074	55.38514	1	0.001079	0.001841
Ruminococcus torques	76.92593	55.43243	1	0.001186	0.001984
Con 4829	76.7963	55.52703	1	0.001335	0.002151
Con 569	73.41667	57.99324	1	0.001334	0.002151
Con 10559	76.59259	55.67568	1	0.001561	0.002457
Con 1604	71.92593	59.08108	1	0.001781	0.002612
Con 2494	74.35185	57.31081	1	0.001802	0.002612
Con 1867	76.38889	55.82432	1	0.001908	0.002722
Con 1241	76.27778	55.90541	1	0.002132	0.00294
Con 5752	73.65741	57.81757	1	0.002163	0.00294
Con 7367	76.23148	55.93919	1	0.002112	0.00294
Con 6128	76.22222	55.94595	1	0.002274	0.003043
Con 5615	76.07407	56.05405	1	0.002372	0.003104
Klebsiella pneumoniae	74.7037	57.05405	1	0.00239	0.003104
Con 4909	75.72222	56.31081	1	0.002685	0.003409
Con 356	75.94444	56.14865	1	0.002808	0.00349
Eubacterium rectale	75.90741	56.17568	1	0.002953	0.003619
Con 6068	75.74074	56.2973	1	0.003338	0.004002
Con 4295	74.98148	56.85135	1	0.004171	0.004904
Con 2703	74.55556	57.16216	1	0.00437	0.005069
Con 2503	74.14815	57.45946	1	0.004522	0.005109
Con 631	70.01852	60.47297	1	0.006178	0.006804
Con 561	70.5	60.12162	1	0.008137	0.00874
Con 8420	72.64815	58.55405	1	0.008068	0.00874
Con 425	73.19444	58.15541	1	0.008397	0.008802
Con 7993	73.74074	57.75676	1	0.009358	0.009692
Burkholderiales bacterium 1_1_47	72.37963	58.75	1	0.009707	0.009935
Con 600	69.53704	60.82432	1	0.026354	0.02666

TABLE 12

IMG and mOTU species makers. IMG and mOTU species markers identified using the random forest method among
species associated with CRC. Species markers were listed by their importance reported by the method.

16 IMG species makers

Peptostreptococcus stomatis	37.25926	84.37838	0	1.29E−12	3.34E−09
Parvimonas micra	38.43519	83.52027	0	1.13E−11	1.46E−08
Parvimonas sp. oral taxon 393	39.81481	82.51351	0	1.28E−10	1.10E−07
Parvimonas sp. oral taxon 110	43.52778	79.80405	0	4.71E−08	3.04E−05
Gemella morbillorum	43.87037	79.55405	0	7.77E−08	4.01E−05
Fusobacterium sp. oral taxon 370	45.02778	78.70946	0	3.93E−07	1.56E−04
Burkholderia mallei	45.19444	78.58784	0	4.84E−07	1.56E−04
Fusobacterium nucleatum	45.09259	78.66216	0	4.33E−07	1.56E−04
Leptotrichia buccalis	45.60185	78.29054	0	7.30E−07	2.09E−04
Prevotella intermedia	46.47222	77.65541	0	2.67E−06	6.01E−04
Beggiatoa sp. PS	46.53704	77.60811	0	2.79E−06	6.01E−04
Crenothrix polyspora	48.76852	75.97973	0	4.20E−05	5.70E−03
Clostridium hathewayi	49.68519	75.31081	0	1.15E−04	1.34E−02
Lachnospiraceae bacterium 5_1_57FAA	50.10185	75.00676	0	1.78E−04	1.91E−02
Eubacterium ventriosum	80.98148	52.47297	1	1.77E−05	2.69E−03
Haemophilus sputorum	77.66667	54.89189	1	6.08E−04	4.90E−02

10 mOTU species makers

Peptostreptococcus stomatis	46.25	77.81757	0	6.56E−08	1.80E−05
Parvimonas micra	46.2963	77.78378	0	4.11E−08	1.80E−05
Gemella morbillorum	47.93519	76.58784	0	1.57E−06	0.000215
Solobacterium moorei	51.22222	74.18919	0	6.31E−05	0.004331
unclassified Fusobacterium	54.22222	72	0	0.000176	0.00806
Clostridiales bacterium 1_7_47FAA	51.27778	74.14865	0	0.000365	0.013366
motu_linkage_group_624	51.01852	74.33784	0	0.000448	0.01371
motu_linkage_group_407	81.13889	52.35811	1	6.00E−06	0.000659
motu_linkage_group_490	80.46296	52.85135	1	3.06E−05	0.002403
motu_linkage_group_316	79.61111	53.47297	1	8.17E−05	0.004487

TABLE 13

21 MLG species markers identified using the random forest method from 106 MLGs with a gene number over 100.
21 MLG species makers

Parvimonas micra	38.40741	83.54054	0	3.16E−12	2.75E−10
Fusobacterium nucleatum	40.32407	82.14189	0	2.97E−11	1.29E−09
Solobacterium moorei	42.2037	80.77027	0	3.85E−09	1.12E−07
CRC 2881	51.25926	74.16216	0	2.57E−06	4.46E−05
Clostridium hathewayi	46.77778	77.43243	0	3.92E−06	5.69E−05
CRC 6481	52.09259	73.55405	0	1.36E−05	0.000107
Clostridiales bacterium 1_7_47FAA	48.16667	76.41892	0	2.02E−05	0.000135
Clostridium sp. HGF2	48.27778	76.33784	0	2.36E−05	0.000147
CRC 4136	50.99074	74.35811	0	5.22E−05	0.000233
Bacteroides fragilis	49.09259	75.74324	0	5.97E−05	0.000236
Clostridium citroniae	51.71296	73.83108	0	0.000861	0.001529
Lachnospiraceae bacterium 8_1_57FAA	51.88889	73.7027	0	0.001024	0.001782
Dorea formicigenerans	52.98148	72.90541	0	0.002703	0.003409
Con 10180	82.03704	51.7027	1	4.87E−06	6.05E−05
Con 7958	75.27778	56.63514	1	1.36E−05	0.000107
butyrate-producing bacterium SS3/4	80.57407	52.77027	1	1.98E−05	0.000135
Haemophilus parainfluenzae	80.49074	52.83108	1	2.54E−05	0.000148
Con 154	80.35185	52.93243	1	3.30E−05	0.000179
Con 1979	79.94444	53.22973	1	5.62E−05	0.000233
Con 5770	79.39815	53.62838	1	0.000104	0.000324
Con 1513	76.59259	55.67568	1	0.000701	0.001298

Although explanatory embodiments have been shown and described, it would be appreciated by those skilled in the art that the above embodiments can not be construed to limit the present disclosure, and changes, alternatives, and modifications can be made to the embodiments without departing from the nature, principles and scope of the present disclosure.

Enrichment

(1: Control;

Control rank mean

Case rank mean

What is claimed is:

1. A method, comprising:

1) obtaining sequencing reads from sample j of a subject, wherein the sample j comprises microbiota;

2) mapping the sequencing reads to a gene catalog and deriving a gene profile from the mapping result;

3) determining the relative abundance of each gene marker in a set of gene markers comprising at least three genes having the nucleotide sequences of SEQ ID NO: 10, SEQ ID NO: 14 and SEQ ID NO: 6; and

4) calculating an index of sample j using the following formula:

I_{j} = [\frac{\sum_{i} ϵ_{N} \log 10 (A_{ij} + 10^{- 20})}{\langle N \rangle} - \frac{\sum_{i} ϵ_{M} \log 10 (A_{ij} + 10^{- 20})}{\langle M \rangle}],

wherein:

A_ijis the relative abundance of marker i in sample j, wherein i refers to each of the gene markers in the gene marker set,

N is a subset of all of the abnormal-associated gene markers in selected biomarkers related to the abnormal condition,

M is a subset of all of the control-associated gene markers in selected biomarkers related to the abnormal condition, and

|N| and |M| are numbers (sizes) of the biomarkers in these two subsets, respectively,

5) identifying the subject as having or being at a risk of developing the abnormal condition when the index is greater than a cutoff, and

6) modifying the risk by altering metabolic, immunological, or developmental pathways in subject.

2. The method of claim 1, wherein the method further comprises estimating the false discovery rate (FDR).

3. The method of claim 1, wherein the gene catalog is a non-redundant gene set constructed for related microbiota, and the set of gene markers further comprises one or more genes having the nucleotide sequences of SEQ ID NOs: 1 to 5, SEQ ID NOs: 7 to 9, SEQ ID NOs: 11 to 13, and SEQ ID NOs: 15 to 31.

4. The method of claim 1, wherein the abnormal condition related to microbiota is an abnormal condition related to environmental microbiota.

5. The method of claim 1, wherein the abnormal condition related to microbiota is a disease related to microbiota present in the animal body or the human body, wherein the microbiota is selected from the group consisting of microbiota found in the gastrointestinal tract, nasal passages, oral cavities, skin and the urogenital tract.

6. The method of claim 1, wherein the abnormal condition related to microbiota is a colorectal disease selected from the group consisting of Colorectal Cancer, Ulcerative Colitis, Crohn's Disease, Irritable Bowel Syndrome (IBS), Diverticular Disease, Hemorrhoids, Anal Fissure, and Bowel Incontinence.

7. The method of claim 1, wherein the sequencing reads are obtained via a next-generation sequencing method or a next-next-generation sequencing method.

8. The method of claim 1, wherein the cutoff value is obtained by a Receiver Operator Characteristic (ROC) method, wherein the cutoff corresponds to the value when the AUC (Area Under the Curve) is at its maximum.

9. The method of claim 1, wherein the sample is a feces sample, a nasal cavity swab, a buccal swab, a skin swab or a vaginal swab.

10. The method of claim 1, wherein the sequencing reads are obtained via steps comprising:

1) collecting the sample j from the subject;

2) extracting DNA from the sample;

3) constructing a DNA library; and

4) sequencing the library.

11. A method, comprising:

1) obtaining sequencing reads from sample j of the subject, wherein the sample j comprises microbiota;

2) mapping the sequencing reads to a human gut gene catalog and deriving a gene profile from the mapping result;

3) determining the relative abundance of each of the gene markers listed in SEQ ID NOs: 1-31; and

4) calculating an index of sample j using the following formula:

I_{j} = [\frac{\sum_{i} ϵ_{N} \log 10 (A_{ij} + 10^{- 20})}{\langle N \rangle} - \frac{\sum_{i} ϵ_{M} \log 10 (A_{ij} + 10^{- 20})}{\langle M \rangle}],

wherein:

A_ijis the relative abundance of marker i in sample j, wherein i refers to each of the gene markers listed in SEQ ID NOs 1-31,

N is a subset of all of colorectal cancer (CRC)-associated gene markers and M is a subset of all of the control-associated gene markers,

wherein the subset of CRC-associated gene markers and the subset of control-associated gene markers are shown in Table 1, and

5) identifying the subject as having or being at a risk of developing CRC when the index is greater than a cutoff, and

12. The method of claim 11, wherein the cutoff value is obtained by a Receiver Operator Characteristic (ROC) method, wherein the cutoff corresponds to the value when the AUC (Area Under the Curve) is at its maximum.

13. The method of claim 12, wherein the value of the cutoff is −0.0575.

14. The method of claim 11, wherein the sequencing reads are obtained via steps comprising:

1) collecting the sample j from the subject;

2) extracting DNA from the sample;

3) constructing a DNA library; and

4) sequencing the library.

15. A method of diagnosing whether a subject has colorectal cancer or is at the risk of developing colorectal cancer (CRC), comprising:

1) obtaining a feces sample j from the subject;

2) measuring the abundance information of each marker gene in a gene marker set comprising at least two genes selected from the group consisting of SEQ ID NOs: 1 to 31 in sample j using quantitative PCR;

3) calculating an index of sample j using the following formula:

I_{j} = [\frac{\sum_{i} ϵ_{N} \log 10 (A_{ij} + 10^{- 20})}{\langle N \rangle} - \frac{\sum_{i} ϵ_{M} \log 10 (A_{ij} + 10^{- 20})}{\langle M \rangle}]

wherein:

A_ijis the relative abundance of marker i in sample j, wherein i refers to each of the gene markers of the gene marker set,

4) identifying the subject as having or being at a risk of developing CRC when the index is greater than a cutoff, and

5) modifying the risk by altering metabolic, immunological, or developmental pathways in subject.

16. The method of claim 15, wherein the cutoff value is obtained by a Receiver Operator Characteristic (ROC) method, wherein the cutoff corresponds to the value when the AUC (Area Under the Curve) is at its maximum.

17. The method of claim 16, wherein the value of the cutoff is −0.0575.

18. The method of claim 15, wherein the gene marker set comprises at least three of the genes in SEQ ID NOs: 1-31.

19. The method of claim 15, wherein the gene marker set comprises at least four of the genes in SEQ ID NOs: 1-31.

20. The method of claim 15, wherein the gene marker set comprises the genes in SEQ ID NOs: 1-31.