CN108283012B

CN108283012B - Methods and systems for microbiome-derived diagnosis and treatment of autoimmune system conditions

Info

Publication number: CN108283012B
Application number: CN201680032486.4A
Authority: CN
Inventors: 扎迦利·阿普特; 丹尼尔·阿尔莫纳西德; 杰西卡·里奇曼; 斯亚沃什·瑞兹万·贝赫巴哈尼
Original assignee: Prosomegen
Current assignee: Macrogenics Inc
Priority date: 2015-04-13
Filing date: 2016-04-13
Publication date: 2021-11-23
Anticipated expiration: 2036-04-13
Also published as: AU2022200242A1; WO2016168364A1; EP3283884A1; CN108283012A; AU2016248063A1; EP3283884A4

Abstract

A method for at least one of characterizing, diagnosing, and treating an autoimmune disorder in at least a subject, the method comprising: receiving an aggregate set of biological samples from a population of subjects; generating at least one of a microbiome composition dataset and a microbiome functional diversity dataset for a population of subjects; generating a characterization of the autoimmune condition based on features extracted from at least one of the microbiome composition dataset and the microbiome functional diversity dataset; generating a therapy model configured to correct the autoimmune condition based on the characterization; and at an output device associated with the subject, administering a therapy to the subject based on the characterization and the therapy model.

Description

Methods and systems for microbiome-derived diagnosis and treatment of autoimmune system conditions

Cross Reference to Related Applications

The present application further claims the benefit of U.S. provisional application serial No. 62/146,818 filed on day 4/13 of 2015, U.S. provisional application serial No. 62/146,846 filed on day 4/13 of 2015, U.S. provisional application serial No. 62/147,287 filed on day 4/14 of 2015, U.S. provisional application serial No. 62/147,324 filed on day 4/14 of 2015, U.S. provisional application serial No. 62/147,328 filed on day 4/14 of 2015, U.S. provisional application serial No. 62/147,334 filed on day 4/14 of 2015, U.S. provisional application serial No. 62/147,345 filed on day 4/14 of 2015, and U.S. provisional application serial No. 62/147,348 filed on day 4/14 of 2015, each of which is incorporated herein by this reference in its entirety.

Technical Field

The present invention relates generally to the field of autoimmune diseases, and more specifically to new and useful methods and systems for the diagnosis and treatment of microbiome-derived (microbiome-derived) in the field of autoimmune diseases.

Background

The microbiome is an ecological group of commensal (commensal), symbiotic (symbian) and pathogenic microorganisms associated with an organism. The human microbiome contains more microbial cells than are present in the whole human body, but the characterization of the human microbiome is still in its infancy due to limitations in sample processing techniques, genetic analysis techniques, and resources used to process large amounts of data. Nevertheless, microbiome is suspected to play at least a part of a role in many health/disease related states (e.g. preparation for childbirth, gastrointestinal disorders, etc.).

In view of the profound impact of microbiome on the health of a subject, efforts should be devoted to the following: characterization of the microbiome, generating insight from the characterization, and generating a treatment configured to correct the disordered state. However, current methods and systems for analyzing the microbiome of humans and providing therapeutic measures based on the obtained insights leave many unanswered questions. In particular, methods for characterizing certain health conditions and therapies (e.g., probiotic therapies) tailored to a particular subject have not been feasible due to limitations of current technology.

Thus, there is a need in the field of microbiology for new and useful methods and systems for characterizing autoimmune diseases in an individualized and population-wide manner. The present invention creates such a new and useful method and system.

Brief Description of Drawings

FIG. 1A is a flow chart of an embodiment of a method for characterizing a microbiome-derived condition and identifying a therapeutic measure;

FIG. 1B is a flow diagram of an embodiment of a method for generating a microbiome derived diagnosis;

FIG. 2 illustrates an embodiment of a method and system for generating microbiome-derived diagnostics and therapy;

FIG. 3 shows a variation of a portion of an embodiment of a method for generating microbiome-derived diagnostics and therapy;

FIG. 4 shows a variation of a method for generating a model in an embodiment of a method and system for generating microbiome-derived diagnostics and therapy;

fig. 5 shows a variation of the mechanism by which probiotic-based therapies act in an embodiment of the method for characterizing a health condition; and

fig. 6 shows an example of therapy-related notification provision in an example of a method for generating microbiome-derived diagnostics and therapy.

Description of the embodiments

The following description of the embodiments of the invention is not intended to limit the invention to these embodiments, but is intended to enable any person skilled in the art to make and use the invention.

1. Methods for characterizing microbiome-derived conditions and identifying therapeutic measures

As shown in fig. 1A, a first method 100 for diagnosing and treating an autoimmune condition (autoimmune condition) includes: receiving a collection of samples (an aggregate set of samples) from a population of subjects S110; characterizing a microbiome composition and/or functional feature (functional features) of each of an aggregate set of samples associated with a population of subjects, thereby generating at least one of a microbiome composition dataset and a microbiome functional diversity dataset for the population of subjects S120; receiving a supplemental dataset associated with at least a subset of the population of subjects, wherein the supplemental dataset provides information on characteristics (characteristics) associated with an autoimmune condition S130; and transforming the supplemental dataset and features extracted from at least one of the microbiome composition dataset and the microbiome functional diversity dataset into a characterization model of the autoimmune condition S140. In some variations, the first method 100 may further include: based on the characterization, a therapy model configured to improve a state of the autoimmune condition is generated S150.

The first method 100 is used to generate a model that can be used to characterize and/or diagnose a subject (e.g., as a clinical diagnosis, as a companion diagnosis (companion diagnosis), etc.) based on at least one of the microbiome composition and functional characteristics of the subject, and to provide a therapeutic measure (e.g., a probiotic-based therapeutic measure, a phage-based therapeutic measure, a small molecule-based therapeutic measure, a prebiotic-based therapeutic measure, a clinical measure, etc.) to the subject based on microbiome analysis of a population of subjects. Thus, data from a population of subjects can be used to characterize the subjects according to their microbiome composition and/or functional characteristics, based on characterizing aspects indicative of health state and improvement (area), and scheduling (promoter) one or more therapies that can adjust the composition of the subject's microbiome towards one or more of a set of desired equilibrium states.

In variations, the method 100 may be used to program a targeted therapy to a subject suffering from an autoimmune condition, disorder, or adverse state, wherein the autoimmune condition produces a systemic effect in one or more of: immune response, respiratory function, musculoskeletal function, gastrointestinal function, circulatory function, endocrine function, and any other suitable physiological or behavioral function. In these variations, diagnosis associated with autoimmune conditions can be assessed using, in general, one or more of the following: blood testing, spirometry, imaging-based methods, endoscopy, biopsy, and any other standard method. In particular examples, the method 100 may be used for characterization and/or therapeutic intervention of one or more of: acquired immunodeficiency syndrome (AIDS), asthma, rheumatoid arthritis, sprue (spurue), sjogren's syndrome, multiple sclerosis, type I diabetes, and systemic lupus erythematosus. Likewise, the method 100 can be used to characterize autoimmune conditions, disorders, and/or adverse states in a completely atypical manner. In particular, the inventors propose that the characterization of the microbiome of an individual may be useful for predicting the likelihood of an autoimmune condition occurring in a subject. Such characterization may also be useful for screening for autoimmune conditions and/or determining a course of treatment for individuals with autoimmune conditions. For example, by deep sequencing bacterial DNA from diseased and healthy subjects, the inventors propose that features associated with certain microbiome composition and/or functional features (e.g., the amount of certain bacteria and/or bacterial sequences corresponding to certain genetic pathways) can be used to predict the presence or absence of an autoimmune condition. In certain instances, the bacterial and genetic pathways are present in certain abundances in individuals with various autoimmune conditions as discussed in more detail below, while the bacterial and genetic pathways are present in statistically different abundances in individuals without autoimmune conditions.

Thus, in some embodiments, the output of the first method 100 can be used to generate a diagnosis and/or provide a therapeutic measure for a subject based on analysis of the microbiome composition of the subject and/or the functional characteristics of the microbiome of the subject. Thus, as shown in fig. 1B, a second method 200 derived from at least one output of the first method 100 may include: receiving a biological sample from a subject S210; characterizing the subject as having a form of autoimmune condition based on processing a microbiome dataset derived from the biological sample S220; and promoting therapy to the subject having the autoimmune condition based on the characterization and the therapy model S230. Variations of the method 100 may also facilitate monitoring and/or adjusting the therapy provided to the subject, for example, by receiving, processing, and analyzing additional samples from the subject throughout the course of the therapy. Embodiments, variations, and examples of the second method 200 are described in more detail below.

Accordingly, the

methods

100, 200 are used to generate models that can be used to classify individuals and/or provide therapeutic measures (e.g., therapy recommendations, therapies, therapy regimens, etc.) to individuals based on microbiome analysis of a population of individuals. Thus, data from a population of individuals may be used to generate models that may classify individuals according to their microbiome composition (e.g., as diagnostic measures), indicate aspects of health and improvement based on classification, and/or provide therapeutic measures that may advance the composition of an individual microbiome toward one or more of a set of improved equilibrium states. Variations of the second method 200 may also facilitate monitoring and/or adjusting the therapy provided to an individual, for example, by receiving, processing, and analyzing additional samples from the individual throughout the course of therapy.

In one application, at least one of the

methods

100, 200 is implemented at least in part in a system 300 as shown in fig. 2, the system 300 receiving a biological sample derived from a subject (or an environment associated with the subject) by way of a sample reception kit (sample reception kit), and processing the biological sample at a processing system that implements a characterization method and a therapy model configured to positively affect microbial distribution in the subject (e.g., human, non-human animal, environmental ecosystem, etc.). In a variation of this application, the processing system may be configured to generate and/or refine the characterization method and the therapy model based on sample data received from a population of subjects. However, the method 100 may alternatively be implemented using any other suitable system configured to receive and process microbiome-related data of a subject aggregated with other information to generate a model for microbiome-derived diagnostics and related therapy. Thus, the method 100 can be implemented for a population of subjects (e.g., including subjects, excluding subjects), where the population of subjects can include patients that are different and/or similar to the subjects (e.g., in terms of health status, in terms of dietary needs, in terms of demographic characteristics, etc.). Thus, due to the aggregation of data from a population of subjects, information derived from the population of subjects can be used to provide additional insight into the relationship between the behavior of the subject and the impact on the subject's microbiome.

Thus, the

methods

100, 200 can be implemented for a population of subjects (e.g., including subjects, excluding subjects), where the population of subjects can include subjects that are different and/or similar to the subjects (e.g., in terms of health status, in terms of dietary needs, in terms of demographic characteristics, etc.). Thus, due to the aggregation of data from a population of subjects, information derived from the population of subjects can be used to provide additional insight into the relationship between the behavior of the subject and the impact on the subject's microbiome.

1.1 first method: sample processing

Block S110 recites: receiving an aggregate set of biological samples from a population of subjects whose role is to enable generation of data from which a model for characterizing the subjects and/or providing therapeutic measures to the subjects can be generated. In block S110, a biological sample is preferably received in a non-invasive manner from a subject in a population of subjects. In variations, non-invasive means of sample reception may use one or more of the following: a permeable substrate (e.g., a swab configured to wipe an area of a subject's body, toilet paper, sponge, etc.), an impermeable substrate (e.g., a slide, tape, etc.), a container (e.g., a vial, a tube, a bag, etc.) configured to receive a sample from an area of a subject's body, and any other suitable sample-receiving element(s). In particular examples, samples can be collected in a non-invasive manner (e.g., using swabs and vials) from one or more of the nose, skin, genitalia, mouth, and intestine of a subject. However, one or more biological samples of the set of biological samples may additionally or alternatively be received semi-invasively or invasively. In variations, the invasive manner of sample reception may use any one or more of: needles, syringes, biopsy elements, lancets, and any other suitable instrument for collecting samples in a semi-invasive or invasive manner. In particular examples, the sample may comprise a blood sample, a plasma/serum sample (e.g., to enable extraction of cell-free DNA), and a tissue sample.

In the above variations and examples, the sample may be obtained from the body of the subject without the assistance of another entity (e.g., a caregiver associated with the individual, a health care professional, an automated or semi-automated sample collection device, etc.), or may alternatively be obtained from the body of the individual with the assistance of another entity. In one example, where a sample is obtained from a subject's body without the aid of another entity during a sample extraction procedure, a sample-provision kit (sample-provision kit) may be provided to the subject. In this example, the kit may include one or more swabs for sample collection, one or more containers configured to receive the swabs for storage, instructions for sample preparation and setup of a user account, elements configured to associate the sample with the subject (e.g., barcode identifiers, labels, etc.), and a receiver (receptacle) that allows the sample from the individual to be delivered to a sample processing operation (e.g., by a mail delivery system). In another example, where a sample is drawn from a user with the assistance of another entity, one or more samples may be collected from a subject in a clinical or research setting (e.g., during a clinical appointment).

In block S110, the aggregate set of biological samples is preferably received from a wide variety of subjects, and may include samples from human subjects and/or non-human subjects. For human subjects, module S110 can include receiving samples from a wide variety of human subjects, collectively including subjects of one or more of: different demographic characteristics (e.g., gender, age, marital status, ethnicity, nationality, socioeconomic status, sexual orientation, etc.), different health conditions (e.g., health and disease status), different living conditions (e.g., living alone, living with pets, living with important others, living with children, etc.), different eating habits (e.g., miscellaneous, vegetarian, pure vegetarian, sugar consumption (sugar consumption), acid consumption (acid consumption), etc.), different behavioral tendencies (e.g., physical activity level, drug usage level, alcohol usage level, etc.), different activity (mobility) levels (e.g., with respect to distance traveled over a given time period), biomarker status (e.g., cholesterol level, lipid level, etc.), weight, height, body mass index, genotype factors, and any other suitable characteristic (trait) that has an effect on microbiome composition. Thus, as the number of subjects increases, the feature-based models generated in subsequent modules of the method 100 increase in predictive power with respect to characterizing a variety of subjects based on their microbiome. Additionally or alternatively, the aggregate set of biological samples received in block S110 may include receiving biological samples from a population of target subjects that are similar in one or more of: demographic characteristics, health status, lifestyle, eating habits, behavioral tendencies, activity level, age range (e.g., pediatric, adult, geriatric), and any other suitable characteristic having an effect on microbiome composition. Additionally or alternatively, the

methods

100, 200 may be adapted to characterize conditions typically detected by laboratory test conditions (e.g., polymerase chain reaction-based tests, cell culture-based tests, blood tests, biopsies, chemical tests, etc.), physical detection methods (e.g., manometry), medical history-based assessments, behavioral assessments, and imaging-based (imageology) assessments. Additionally or alternatively, the

methods

100, 200 may be adapted to characterize acute conditions, chronic conditions, conditions with differences in different demographic prevalence rates, conditions with characteristic disease regions (e.g., head, digestive tract, endocrine system diseases, heart, nervous system diseases, respiratory system diseases, immune system diseases, circulatory system diseases, renal system diseases, motor system diseases, etc.), and comorbid conditions.

In some embodiments, receiving the aggregate set of biological samples in module S110 may be performed according to embodiments, variations, and examples of sample reception as described in U.S. application No. 14/593,424, filed on 9/1/2015 and entitled "Method and System for microorganism Analysis," which is incorporated by reference herein in its entirety. However, receiving the aggregate set of biological samples in module S110 may additionally or alternatively be performed in any other suitable manner. Furthermore, some alternative variations of the first method 100 may omit block S110, processing data derived from the set of biological samples in subsequent blocks of the method 100 as described below.

1.2 first method: sample analysis, microbiome composition and functional aspects

Block S120 recites: characterizing the microbiome composition and/or functional characteristics of each of the aggregate set of biological samples associated with the population of subjects, thereby generating at least one of a microbiome composition dataset and a microbiome functional diversity dataset for the population of subjects. Module S120 is for processing each of the aggregate set of biological samples to determine a compositional and/or functional aspect associated with the microbiome of each of the population of subjects. Compositional and functional aspects can include compositional aspects at the microbial level, including parameters related to the distribution of different populations of microorganisms across kingdoms, phyla, classes, orders, families, genera, species, subspecies, strains, infraspecies taxons (e.g., as measured in total abundance per population, relative abundance per population, total number of populations represented, etc.), and/or any other suitable taxons. Compositional and functional aspects may also be presented in terms of Operational Taxa Units (OTUs). Compositional and functional aspects may additionally or alternatively include compositional aspects at the genetic level (e.g., regions determined by multisite sequence typing, 16S sequences, 18S sequences, ITS sequences, other genetic markers, other phylogenetic markers, etc.). Compositional and functional aspects may include the presence or absence or amount of genes associated with a particular function (e.g., enzymatic activity, transport function, immunological activity, etc.). Thus, the output of module S120 can be used to provide features of interest for the characterization process of module S140, where the features can be microorganism-based (e.g., presence of bacterial genus), genetically-based (e.g., based on presentation of specific genetic regions and/or sequences), and/or function-based (e.g., presence of specific catalytic activity, presence of metabolic pathways, etc.).

In one variation, module S120 can include characterizing features based on the identification of phylogenetic markers derived from bacteria and/or archaea that are associated with gene families that are associated with one or more of: ribosomal protein S2, ribosomal protein S3, ribosomal protein S5, ribosomal protein S7, ribosomal protein S8, ribosomal protein S9, ribosomal protein S10, ribosomal protein S11, ribosomal protein S12/S23, ribosomal protein S13, ribosomal protein S15P/S13e, ribosomal protein S17, ribosomal protein S19, ribosomal protein L1, ribosomal protein L2, ribosomal protein L3, ribosomal protein L4/L1e, ribosomal protein L5, ribosomal protein L6, ribosomal protein L10, ribosomal protein L11, ribosomal protein L13, ribosomal protein L14b/L23 b, ribosomal protein L b/L10 b, ribosomal protein L18 b/L5 b, ribosomal protein L b/L b, ribosomal protein extension factor L2-EF 2-IF 2, and EF 2, Metalloendopeptidase (metalloendopeptidase), ffh signal recognition granule protein, phenylalanyl-tRNA synthetase alpha subunit, phenylalanyl-tRNA synthetase beta subunit, tRNA pseudouridine synthase B, porphobilinogen deaminase, phosphoribosylglycinamidine cyclic ligase (phosphoribosylglycinamidine) and ribonuclease HII. However, the marker may comprise any other suitable marker.

Thus, characterizing the microbiome composition and/or functional characteristics of each of the aggregate set of biological samples in module S120 preferably comprises a combination of sample processing techniques (e.g., wet laboratory techniques) and computational techniques (e.g., using bioinformatics tools) to quantitatively and/or qualitatively characterize the microbiome and functional characteristics associated with each biological sample from a subject or population of subjects.

In variations, the sample processing in block S120 may include any one or more of the following: lysing the biological sample, disrupting the membrane in the cells of the biological sample, separating undesired components (e.g., RNA, proteins) from the biological sample, purifying nucleic acids (e.g., DNA) in the biological sample, amplifying nucleic acids from the biological sample, further purifying the amplified nucleic acids of the biological sample, and sequencing the amplified nucleic acids of the biological sample. As such, portions of module S120 may be implemented using embodiments, variations, and examples of sample processing networks and/or computing systems as described in U.S. application No. 14/593,424, filed on 9/1/2015 and entitled "Method and System for microbe Analysis," which is incorporated by reference herein in its entirety. Accordingly, a computing system implementing one or more portions of method 100 may be implemented in one or more computing systems, where the computing system may be implemented at least in part in the cloud and/or as a machine (e.g., a computing machine, server, mobile computing device, etc.) configured to receive a computer-readable medium storing computer-readable instructions. However, any other suitable system may be used for block S120.

In a variation, lysing the biological sample and/or disrupting the membrane in the cells of the biological sample preferably includes physical methods (e.g., bead beating, nitrogen depressurization, homogenization, sonication) that omit certain reagents that create a bias in the presentation of certain bacterial populations after sequencing. Additionally or alternatively, the lysis or disruption in module S120 can include a chemical method (e.g., using a detergent, using a solvent, using a surfactant, etc.). Additionally or alternatively, the lysis or disruption in module S120 may comprise a biological method. In variations, the isolation of the undesired component may comprise removal of RNA using rnases and/or removal of protein using proteases. In variations, purification of the nucleic acid may include one or more of: precipitating nucleic acids from a biological sample (e.g., using an alcohol-based precipitation method), liquid-liquid based purification techniques (e.g., phenol-chloroform extraction), chromatography-based purification techniques (e.g., column adsorption), purification techniques that include binding moiety-bound particles (e.g., magnetic beads, floating beads (buoyant beads), beads with a particle size distribution, ultrasound-responsive beads, etc.) configured to bind nucleic acids and configured to release nucleic acids in the presence of an elution environment (e.g., with an elution solution, providing a pH change, providing a temperature change, etc.), and any other suitable purification techniques.

In variations, subjecting the purified nucleic acid to amplification operation S123 may comprise performing one or more of: polymerase Chain Reaction (PCR) -based techniques (e.g., solid phase PCR, RT-PCR, qPCR, multiplex PCR, touchdown PCR, (touchdown PCR), nanopcr (nanopcr), nested PCR, hot start PCR, etc.), helicase-dependent amplification (HDA), loop-mediated isothermal amplification (LAMP), autonomous sustained sequence replication (3SR), nucleic acid sequence-based amplification (NASBA), Strand Displacement Amplification (SDA), Rolling Circle Amplification (RCA), Ligase Chain Reaction (LCR), and any other suitable amplification technique. In the amplification of purified nucleic acids, the primers used are preferably selected to prevent or minimize amplification bias and are configured to amplify nucleic acid regions/sequences (e.g., of the 16S region, 18S region, ITS region, etc.) that provide taxonomic information, provide phylogenetic information, provide information for diagnosis, provide information for a formulation (e.g., for a probiotic formulation), and/or provide information for any other suitable purpose. Thus, universal primers configured to avoid amplification bias (e.g., F27-R338 primer set for 16S RNA, F515-R806 primer set for 16S RNA, etc.) can be used in the amplification. The primers used in the variations of module S110 may additionally or alternatively include incorporated barcode sequences specific to each biological sample that may facilitate identification of the biological sample after amplification. The primers used in variations of module S110 may additionally or alternatively comprise adaptor regions configured to mate with a sequencing technique comprising complementary adaptors (e.g., according to a protocol for Illumina sequencing).

The identification of Primer sets for Multiplex amplification procedures can be performed according to embodiments, variations and examples of methods as described in U.S. application No. 62/206,654, filed on 8/18/2015 and entitled "Method and System for Multiplex Primer Design," which is incorporated by reference herein in its entirety. The multiplex amplification procedure using the set of primers (a set of primers) in block S123 may additionally or alternatively be performed in any other suitable manner.

Additionally or alternatively, as shown in fig. 3, module S120 may implement any other steps configured to facilitate processing (e.g., using a Nextera kit) for performing a fragmentation operation S122 (e.g., fragmenting and tagging with sequencing adapters) in coordination with the amplification operation S123 (e.g., S122 may be performed after S123, S122 may be performed before S123, S122 may be performed substantially simultaneously with S123, etc.). In addition, modules S122 and/or S123 may be performed with or without a nucleic acid extraction step. For example, extraction may be performed prior to amplification of nucleic acids, followed by fragmentation, and then amplification of the fragments. Alternatively, extraction may be performed, followed by fragmentation, and then amplification of the fragments. Thus, in some embodiments, the amplification operations performed in module S123 may be performed according to embodiments, variations, and examples of amplification as described in U.S. application No. 14/593,424 filed on 9/1/2015 and entitled "Method and System for microorganism Analysis". Furthermore, the amplification in module S123 may additionally or alternatively be performed in any other suitable manner.

In a particular example, amplification and sequencing of nucleic acids from a biological sample of a set of biological samples comprises: solid phase PCR comprising bridge amplification of DNA fragments of a biological sample on a substrate with oligonucleotide adaptors, wherein the amplification comprises primers having a forward index sequence (e.g., Illumina forward index corresponding to the MiSeq/NextSeq/HiSeq platform) or a reverse index sequence (e.g., Illumina reverse index corresponding to the MiSeq/NextSeq/HiSeq platform), a forward barcode sequence or a reverse barcode sequence, a transposase sequence (e.g., transposase binding site corresponding to the MiSeq/NextSeq/HiSeq platform), a linker (e.g., a 0, 1, or 2 base fragment configured to reduce homogeneity and improve sequence outcome), additional random bases, and a sequence for targeting a specific target region (e.g., 16S region, 18S region, ITS region). Any suitable amplicon may be further amplified and sequenced as indicated throughout the disclosure. In particular examples, sequencing comprises Illumina sequencing using sequencing-by-synthesis techniques (e.g., with HiSeq platform, with MiSeq platform, with NextSeq platform, etc.). Additionally or alternatively, any other suitable next generation sequencing technology (e.g., PacBio platform, Oxford Nanopore platform, MinION platform, etc.) may be used. Additionally or alternatively, any other suitable sequencing platform or method may be used (e.g., Roche 454Life Sciences platform, Life Technologies SOLiD platform, etc.). In an example, sequencing can include deep sequencing to quantify the copy number of a particular sequence in a sample, and then can also be used to determine the relative abundance of different sequences in the sample. Deep sequencing refers to highly redundant sequencing of nucleic acid sequences, e.g., such that the original copy number of the sequence in a sample can be determined or estimated. The redundancy (i.e., depth) of sequencing is determined by the length (X) of the sequence to be determined, the number of sequencing reads (N), and the average read length (L). The redundancy is then NxL/X. The sequencing depth can be, or is at least about 2, 3,4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 70, 80, 90, 100, 110, 120, 130, 150, 200, 300, 500, 700, 1000, 2000, 3000, 4000, 5000 or more.

Some variations of sample processing in module S120 may include further purification of the amplified nucleic acids (e.g., PCR products) prior to sequencing, which is used to remove excess amplification components (e.g., primers, dntps, enzymes, salts, etc.). In an example, any one or more of the following may be used to facilitate additional purification: purification kits, buffers, alcohols, pH indicators, chaotropic salts, nucleic acid binding filters, centrifugation, and any other suitable purification technique.

In variations, the computational processing in module S120 may include any one or more of the following: performing a sequence analysis operation S124 that includes identification of a microbiome-derived sequence (e.g., as opposed to subject sequence and contaminants), performing an alignment and/or mapping operation S125 of the microbiome-derived sequence (e.g., sequences fragmented using one or more of single-ended alignment (single-ended alignment), gap-free alignment (gapped alignment), gap alignment (gapped alignment), and pairing), and generating features derived from a compositional and/or functional aspect of a microbiome associated with the biological sample S126.

Performing sequencing analysis operation S124 to identify microbiome-derived sequences may include mapping sequence data from sample processing to a subject Reference Genome (e.g., provided by Genome Reference Consortium) to remove subject Genome-derived sequences. The unidentified sequences remaining after mapping the sequence data to the subject reference genome may then be clustered into operation classification units (OTUs), aligned (e.g., using genome hashing approach, using Needleman-Wunsch algorithm, using Smith-Waterman algorithm), and mapped to the reference bacterial genome (e.g., provided by the National Center for Biotechnology Information) using Alignment algorithms (e.g., Basic Local Alignment Search Tool, FPGA accelerated Alignment Tool, BWT index with BWA, BWT index with SOAP, BWT index with Bowtie, etc.), and further based on sequence similarity and/or reference based methods (e.g., using VAMPS, using MG-RAST, using the qie database), aligned (e.g., using genome hashing approach). Mapping of unidentified sequences may additionally or alternatively include mapping to a reference archaeal genome, viral genome and/or eukaryotic genome. Furthermore, the mapping of the taxonomy units may be performed with respect to existing databases and/or with respect to custom generated databases.

Additionally or alternatively, with respect to generating a microbiome functional diversity dataset, module S120 may include extracting candidate features related to functional aspects of one or more microbiome components indicated in the aggregate set of biological samples, such as the microbiome composition dataset S127. Extracting the candidate functional features may include identifying functional features that are associated with one or more of: prokaryotic orthologous Clustering (COG) of proteins; eukaryotic orthologous clustering (KOG) of proteins; any other suitable type of gene product; RNA processing and modification functional classification; chromatin structure and kinetic functional classification; energy generation and conversion functional classification; cell cycle control and mitotic functional classification; amino acid metabolism and transport functional classification; a nucleotide metabolism and transport functional classification; carbohydrate metabolism and transport functional classification; a coenzyme metabolism functional classification; a lipid metabolism functional classification; classifying translation functions; classifying transcription functions; a copy and repair functional classification; a cell wall/membrane/envelope biogenesis functional classification; a cell motor function classification; post-translational modification, protein turnover, and chaperone function classification; inorganic ion transport and metabolic functional classification; secondary metabolite biosynthesis, transport and catabolism functional classification; signal transduction functional classification; an intracellular trafficking and secretion functional classification; a core structure functional classification; cytoskeletal function classification; a function classification of only general function prediction; and a functional classification of unknown function; and any other suitable functional classification.

Additionally or alternatively, extracting candidate functional features in block S127 may include identifying functional features that are relevant to one or more of: system information (e.g., pathway maps of cellular and biological functions, modules or functional units of genes, hierarchical classification of biological entities); genomic information (e.g., the whole genome, genes and proteins in the whole genome, orthologous gene groups in the whole genome); chemical information (e.g., chemical compounds and glycans, chemical reactions, enzyme nomenclature); health information (e.g., human disease, approved drugs, natural drugs (crude drugs), and health-related substances); a metabolic pathway profile; genetic information processing (e.g., transcription, translation, replication and repair, etc.) pathway maps; environmental information processing (e.g., membrane transport, signal transduction, etc.) pathway profiles; a map of cellular process (e.g., cell growth, cell death, cell membrane function, etc.) pathways; a pathway profile of a biological system (e.g., immune system, endocrine system, nervous system, etc.); a human disease pathway profile; a drug development pathway profile; and any other suitable pathway maps.

In extracting candidate functional features, block S127 may include conducting a search of one or more databases, such as the Kyoto Encyclopedia of Genes and Genomes (KEGG) and/or Orthologous Clustering (COG) databases managed by the National Center for Biotechnology Information, NCBI. The search may be performed based on results of generating a microbiome composition dataset from one or more of the aggregate set of biological samples and/or sequencing of material from the set of samples. In more detail, the module S127 may include an implementation of a data-oriented entry point to the KEGG database, including one or more of: a KEGG pathway tool, a KEGG BRITE tool, a KEGG module tool, a KEGG Ortholog (KO) tool, a KEGG genome tool, a KEGG gene tool, a KEGG compound tool, a KEGG glycan tool, a KEGG reaction tool, a KEGG disease tool, a KEGG drug tool, a KEGG physician tool (KEGG media tool). The search may additionally or alternatively be performed according to any other suitable filter. Additionally or alternatively, the module S127 may include the implementation of an organism-specific entry point to the KEGG database, including a KEGG organism tool. Additionally or alternatively, module S127 may include an implementation of an analysis tool, including one or more of: a KEGG mapping tool that maps KEGG pathways, BRITEs, or module data; a KEGG atlas tool for exploring KEGG global maps, a BlastKOALA tool for genome annotation and KEGG mapping, a BLAST/FASTA sequence similarity search tool, and a SIMCOMP chemical structure similarity search tool. In a particular example, block S127 may include extracting candidate functional features from the KEGG database resource and the COG database resource based on the microbiome composition dataset; however, block S127 may include extracting candidate functional features in any other suitable manner. For example, block S127 may include extracting candidate functional features including functional features derived from Gene Ontology functional classification (Gene Ontology functional classification), and/or any other suitable features.

In one example, a taxa (taxonomic group) may include one or more bacteria and their corresponding reference sequences. When the sequence reads are aligned with the reference sequences of the taxonomic group, the sequence reads can be assigned to the taxonomic group based on the alignment. A functional group (functional group) may correspond to one or more genes that are tagged as having similar functions. As such, a functional group can be represented by reference sequences of genes in the functional group, where the reference sequences of a particular gene can correspond to different bacteria. Since each cluster includes one or more reference sequences that represent the cluster, the taxonomic and functional clusters may be collectively referred to as a sequence group. A multi-bacterial taxa can be represented by multiple reference sequences, e.g., one reference sequence per bacterial species in the taxa. Embodiments can use the degree of alignment of a sequence read with multiple reference sequences to determine which sequence group the sequence read is assigned to based on the alignment.

1.2.1 examples and variations: the sequence groups correspond to the taxonomic groups

A taxonomic group can correspond to any set of one or more reference sequences representing one or more loci (e.g., genes) of the taxonomic group. The classification hierarchy at any given level will include a plurality of classification groups. For example, a reference sequence in one genus level group can be in another family level group.

The RAV may correspond to the proportion of reads assigned to a particular taxonomic group. The ratio may be relative to various denominator values, for example, relative to all sequence reads assigned to at least one cluster (sorted or functional), or all sequence reads assigned for a given level in the hierarchy. The alignment can be performed in any manner that can assign sequence reads to a particular taxonomic group.

For example, based on the mapping of the reference sequences in the 16S region, the classification group with the best match for the alignment can be identified. The RAV can then be determined for that taxonomic group using the number of sequence reads (or votes for sequence reads) for the particular sequence group divided by the number of sequence reads identified as bacteria, which can be for a particular region or even for a given level of hierarchy.

1.2.2 examples and variations: the sequence groups correspond to functional groups or genes

Instead of or in addition to determining the counts of sequence reads corresponding to a particular taxonomic group, embodiments may use the counts of a number of sequence reads corresponding to a particular gene or set of annotated genes having a particular function, where the set is referred to as a functional group. The RAV may be determined in a similar manner as for the taxonomic group. For example, a functional group can include a plurality of reference sequences corresponding to one or more genes of the functional group. For the same gene, reference sequences of multiple bacteria may correspond to the same functional group. Then, to determine the RAV, the number of sequence reads assigned to the functional group can be used to determine the proportion of the functional group.

The use of functional groups that may contain a single gene may help identify the following: where there are small changes (e.g., increases) in many taxa, such that the changes are too small to be statistically significant. However, the changes may all be for the same gene or a group of genes of the same functional group, and thus the changes may be statistically significant for the functional group, although the changes may not be significant for the taxonomic group. The converse may also be true, and a taxonomic group may be more predictive than a particular functional group, e.g., when a single taxonomic group includes many genes with relatively small amounts of variation.

As an example, if 10 taxa are increased by 10%, the statistical power to distinguish between two taxa may be low when each taxa is analyzed separately. However, if the increases are all genes for the same functional group, the increase will be 100%, or a doubling of the proportion for that taxonomic group. This large increase would have a much greater statistical power to distinguish the two clusters. In this manner, functional groups can act to provide a sum of small variations for each taxonomic group. Also, small variations for each functional group that happen to be all in the same taxonomic group can be summed to provide high statistical efficacy for that particular taxonomic group.

The taxonomic and functional groups may complement each other in that the information may be orthogonal, or at least partially orthogonal, and there may still be some relationship between the RAVs of each group. For example, RAVs of one or more taxonomic and functional groups can be used together as a plurality of features of a feature vector, which is analyzed to provide a diagnosis, as described herein. For example, the feature vector may be compared to a disease signature (disease signature) as part of the characterization model.

1.2.3 examples and variations: pipeline for taxonomy (Pipeline)

Embodiments can provide a bioinformatics pipeline that taxonomically annotates microorganisms present in a sample. An example annotation pipeline may include the following operations.

In the first module, samples can be identified and sequence data can be loaded. For example, the pipeline may begin with a demultiplexed fastq file (or other suitable file) that is the product of pair-end sequencing of an amplicon (e.g., the amplicon of the V4 region of the 16S gene). For a given input sequencing file, all samples can be identified and the corresponding fastq file can be obtained from the fastq storage server and loaded into the pipeline.

In a second module, reads may be filtered. For example, global quality filtering of reads in a fastq file may accept reads with a global Q-score > 30. In one implementation, for each read, the per-location Q-scores are averaged and if the average is equal to or above 30, the read is accepted, otherwise the read is discarded, as is its paired read.

In the third module, primers can be identified and removed. In one embodiment, only forward reads comprising a forward primer and reverse reads comprising a reverse primer (allowing primers to anneal with up to 5 mismatches or other numbers of mismatches) are further contemplated. The primer and any sequence 5' to it are removed from the reads. 125bp (or other suitable number) toward the 3 'of the forward primer is considered to be from a forward read, and only 124bp (or other suitable number) toward the 3' of the reverse primer is considered for a reverse read. All processed forward reads <125bp and reverse reads <124bp are excluded from further processing, as are their paired reads.

In a fourth module, forward and reverse reads may be written to a file (e.g., a FASTA file). For example, forward and reverse reads that remain paired can be used to generate a file containing 125bp from the forward read concatenated with 124bp from the reverse read (in the reverse complement direction).

In a fifth module, the sequence reads can be clustered, for example, to identify chimeric sequences or to determine consensus sequences of bacteria. For example, sequences in a file may be clustered with a distance of 1 using the Swarm algorithm. This process allows the generation of clusters comprising a central biological entity surrounded by sequences of 1 mutation from this biological entity, which are less abundant and the result of normal base calling errors associated with high throughput sequencing. Singletons (Singletons) were removed from further analysis. In the remaining clusters, the most abundant sequence per cluster is then used as a count to represent and assign all members in the cluster.

In the sixth module, the chimeric sequence may be removed. For example, amplification of a gene superfamily can result in the formation of a chimeric DNA sequence. These partial PCR products from one member of the superfamily extend along different members of the superfamily in subsequent PCR cycles. To remove chimeric DNA sequences, some embodiments may use the VSEARCH chimerism detection algorithm to de novo select and standard parameters. The algorithm uses the abundance of PCR products to identify reference "true" sequences as those that are most abundant, and chimeric products as those that are less abundant, and exhibit local similarity to two or more reference sequences. All chimeric sequences can be removed from further analysis.

In a seventh module, taxonomic annotations can be assigned to sequences using a sequence identity search. To assign taxonomy to sequences that have passed through all of the above filters, some embodiments may conduct an identity search against a database containing bacterial strains annotated to the phylum, class, order, family, genus and species level, or any other taxonomic level (e.g., reference sequences). The most specific level of taxonomic annotation for a sequence can be retained in view of higher-order taxonomic nomenclature where lower-level taxonomic levels can be inferred. The sequence identity search may be performed using the algorithm VSEARCH to allow for the parameters of the exhaustive exploration of the reference database used (maxaccepts 0, maxrects 0, id 1). Decreasing values of sequence identity can be used to assign sequences to different taxonomic groups: > 97% sequence identity for assignment to species, > 95% sequence identity for assignment to genus, > 90% for assignment to family, > 85% for assignment to order, > 80% for assignment to class, and > 77% for assignment to phylum.

In the eighth module, the relative abundance of each taxon (taxa) can be estimated and output to the database. For example, after all sequences have been used to identify sequences in a reference database, the relative abundance of each taxon can be determined by dividing the count of all sequences assigned to the same taxon by the total number of reads that passed through the filter (e.g., assigned). The results may be uploaded to a database table that serves as a store for taxonomy annotation data.

1.2.4 examples and variations: pipeline for functional groups

For functional groups, the method can proceed as follows.

In the first step, the sample OTU (operating classification unit) can be found. This may occur after the sixth module from above. After the sixth module above, sequences can be clustered, e.g., based on sequence identity (e.g., 97% sequence identity).

In a second step, the taxonomy can be assigned, for example, by comparing the OTU to a reference sequence of known taxonomies. The comparison may be based on sequence identity (e.g., 97%).

In a third step, taxonomic abundance can be adjusted for 16S copy number, or any genomic region that can be analyzed. Different species may have different copy numbers of the 16S gene, so those with higher copy numbers will have more 16S material for PCR amplification than other species at the same cell number. Thus, abundance can be normalized by adjusting 16S copy number.

In a fourth step, a pre-computed genome look-up table may be used to correlate the taxonomy with the function, and the amount of function. For example, a pre-calculated genome look-up table showing the number of genes for each taxon important KEGG or COG functional class may be used to estimate the abundance of those functional classes based on normalized 16S abundance data.

After identifying representative microbiota of the microbiome associated with the biological sample and/or identifying candidate functional aspects (e.g., functions associated with microbiome components of the biological sample), generating features derived from compositional aspects and/or functional aspects of the microbiome associated with the aggregate set of biological samples can be performed.

In one variation, generating the signature may include generating a signature derived from multi-locus sequence typing (MLST), which may be performed experimentally at any stage associated with the implementation of the

methods

100, 200, to identify markers for characterization in subsequent modules of the method 100. Additionally or alternatively, generating features may include generating features that describe the presence or absence of certain taxonomic groups of microorganisms and/or the ratio between taxonomic groups of microorganisms present. Additionally or alternatively, generating the features may include generating features describing one or more of: the number of taxa presented, the network of taxa presented, the relevance of the presentation of different taxa, the interaction between different taxa, the products produced by different taxa, the interaction between the products produced by different taxa, the ratio between dead and live microorganisms (e.g., RNA-based analysis for different taxa presented), the phylogenetic distance (e.g., in terms of Kantorovich-Rubinstein distance, Wasserstein distance, etc.), any other suitable taxa-related feature, any other suitable genetic or functional feature.

Additionally or alternatively, generating features may include generating features describing the Relative Abundance of different microbiota, for example, using the sparCC method, using the Genome Relative Abundance and Average size (GAAS) method, and/or using the Genome Relative Abundance using Mixture Model theory (GRAMMy) method using sequence similarity data for maximum likelihood estimation of the Relative Abundance of one or more microbiota. Additionally or alternatively, generating the feature may include generating a statistical measure of the taxonomic variation as derived from the abundance metric. Additionally or alternatively, generating the features can include generating features derived from relative abundance factors (e.g., changes in abundance for a taxon that affect the abundance of other taxons). Additionally or alternatively, generating the features may include generating qualitative features that describe the presence of one or more taxa, alone and/or in combination. Additionally or alternatively, generating the feature can include generating a feature associated with a genetic marker (e.g., a representative 16S, 18S, and/or ITS sequence) that characterizes a microorganism of a microbiome associated with the biological sample. Additionally or alternatively, generating a feature may include generating a feature associated with a functional association of a particular gene and/or organism having the particular gene. Additionally or alternatively, generating the feature may comprise generating a feature associated with the pathogenicity of the taxon and/or a product attributed to the taxon. However, module S120 may include any other suitable feature that generates sequencing and mapping of nucleic acids derived from a biological sample. For example, features may be combined (e.g., including pairs (pairs), triplets), correlated (e.g., with respect to correlation between different features), and/or with respect to variation in features (i.e., temporal variation, variation across sample sites, spatial variation, etc.). However, the features may be generated in any other suitable manner in module S120.

1.3 first method: supplemental data

Block S130 recites: a supplemental data set associated with at least a subset of the population of subjects is received, wherein the supplemental data set provides information on a characteristic associated with the autoimmune condition. Thus, the supplemental data set can provide information on the presence of the condition in the population of subjects. Module S130 is used to obtain further data relating to one or more subjects in the set of subjects, which may be used to train and/or validate the characterization process performed in module S140. In block S130, the supplemental data set preferably includes survey-sourced data, but may additionally or alternatively include one or more of the following: contextual data derived from sensors, medical data (e.g., current and historical medical data related to autoimmune conditions), and any other suitable type of data. In variations including module S130 that receives survey-sourced data, the survey-sourced data preferably provides physiological, demographic, and behavioral information related to the subject. The physiological information may include information related to physiological characteristics (e.g., height, body weight, body mass index, body fat percentage, body hair level, etc.). Demographic information may include information related to demographic characteristics (e.g., gender, age, race, marital status, number of siblings, socioeconomic status, sexual orientation, etc.). The behavioral information may include information relating to one or more of: health conditions (e.g., health and disease states), life conditions (e.g., living alone, living with pets, living with important others, living with children, etc.), eating habits (e.g., miscellaneous, vegetarian, pure vegetarian, sugar consumption, acid consumption, etc.), behavioral tendencies (e.g., physical activity level, drug usage level, alcohol usage level, etc.), different activity levels (e.g., with respect to distance traveled over a given period of time), different sexual activity levels (e.g., with respect to partner number and sexual orientation), and any other suitable behavioral information. The survey-derived data may include quantitative data and/or qualitative data that may be converted into quantitative data (e.g., using severity ratings, mapping qualitative responses to quantitative scores, etc.).

In facilitating receipt of the survey-derived data, module S130 can include providing one or more surveys to subjects of the population of subjects or to entities related to subjects of the population of subjects. The survey may be provided in person (e.g., in coordination with sample provision and receipt from the subject), electronically (e.g., during setup of an account by the subject, at an application executed by an electronic device of the subject, at a network application accessible through an internet connection, etc.), and/or in any other suitable manner.

Additionally or alternatively, portions of the supplemental data set received in module S130 may originate from sensors associated with the subject (e.g., sensors of a wearable computing device, sensors of a mobile device, biometric sensors associated with the user, etc.). Accordingly, module S130 may include receiving one or more of: physical activity-related data or physical action-related data (e.g., accelerometer and gyroscope data from a subject's mobile device or wearable electronic device), environmental data (e.g., temperature data, elevation data, climate data, light parameter data, etc.), patient nutrition or diet-related data (e.g., data from food agency records, data from spectrophotometric analysis, etc.), biometric data (e.g., data recorded by sensors within a patient's mobile computing device, data recorded by wearable or other peripheral devices in communication with the patient's mobile computing device), location data (e.g., using GPS elements), and any other suitable data. Additionally or alternatively, portions of the supplemental data set may be derived from medical record data and/or clinical data of the subject. Thus, portions of the supplemental data set may be derived from one or more Electronic Health Records (EHRs) of one or more subjects.

Additionally or alternatively, the supplemental dataset of module S130 may include any other suitable diagnostic information (e.g., clinical diagnostic information) that may be combined with the analysis derived from the features to support characterization of the subject in subsequent modules of the method 100. For example, information derived from colonoscopy, biopsy, blood test, diagnostic imaging, survey related information, and any other suitable test may be used to supplement block S130.

1.4. The first method comprises the following steps: characterization of autoimmune conditions

Block S140 recites: transforming the supplemental dataset and features extracted from at least one of the microbiome composition dataset and the microbiome functional diversity dataset into a characterization model of the autoimmune condition. Module S140 is for performing a characterization process for identifying a feature and/or combination of features that can be used to characterize a subject or population as having an autoimmune condition based on the microbiome composition and/or functional features of the subject. Additionally or alternatively, the characterization process may be used as a diagnostic tool that may characterize the subject (e.g., in terms of behavioral characteristics, in terms of medical conditions, in terms of demographic characteristics, etc.) with respect to other health condition states, behavioral characteristics, medical conditions, demographic characteristics, and/or any other suitable characteristics based on the microbiome composition and/or functional characteristics of the subject. Such characterization may then be used to suggest or provide personalized therapies through the therapy model of block S150.

In performing the characterization process, module S140 can use computational methods (e.g., statistical methods, machine learning methods, artificial intelligence methods, bioinformatics methods, etc.) to characterize the subject as exhibiting characteristics specific to a population of subjects with an autoimmune condition.

In one variation, the characterization may be based on features derived from a statistical analysis (e.g., analysis of a probability distribution) of similarities and/or differences between a first group of subjects exhibiting a target state associated with the autoimmune condition (e.g., a healthy condition state) and a second group of subjects not exhibiting a target state associated with the autoimmune condition (e.g., a "normal" state). In practicing this variation, one or more of the Kolmogorov-Smirnov (KS) test, the permutation test (membership test), the Cram mer-von Mises test, and any other statistical test (e.g., t-test, Welch t-test, z-test, Chi-square test, distribution-related test, etc.) may be used. In particular, one or more such statistical hypothesis tests may be used to evaluate a set of features having varying degrees of abundance (or variation across) in a first group of subjects exhibiting a target state associated with an autoimmune condition (i.e., an adverse state) and a second group of subjects not exhibiting a target state associated with an autoimmune condition (i.e., having a normal state). In more detail, the set of evaluated features may be constrained based on the percentage of abundance and/or any other suitable parameter pertaining to the diversity associated with the first and second groups of subjects to increase or decrease the confidence of the characterization. In a specific implementation of this example, the features can be derived from the presence of taxa and/or functional features of microorganisms abundant in a particular percentage of subjects of the first group and subjects of the second group, wherein the relative abundance of taxa between the first group of subjects and the second group of subjects can be determined from one or more of the KS test or the Welch t-test (e.g., the t-test with log-normal transformation) with an indication of significance (e.g., in terms of p-value). Thus, the output of module S140 may include a normalized relative abundance value (e.g., the abundance of taxon-derived features and/or functional features in diseased subjects is 25% greater than in healthy patients) and an indication of significance (e.g., a p-value of 0.0013). Variations of feature generation may additionally or alternatively be implemented or derived from functional features or metadata features (e.g., non-bacterial markers).

In variations and examples, the characterization may use a Relative Abundance Value (RAV) of a population of subjects with the disease (a population of conditions) and a population of subjects without the disease (a control population). A particular sequence population of the condition population can be identified as being included in the disease signature if the distribution of RAVs for the particular sequence population is statistically different from the distribution of RAVs for the control population. Since the two populations have different distributions, the RAV of the new sample for the population of sequences in the disease signature can be used to classify (e.g., determine the probability) whether the sample has the disease. Classification can also be used to determine treatment, as described herein. Discrimination levels (discrimination levels) can be used to identify clusters of sequences with high predictive value. As such, embodiments can filter out taxonomic and/or functional groups that are not very accurate for providing a diagnosis.

Having determined the RAV of a sequence population for both control and condition populations, various statistical tests can be used to determine the statistical efficacy of the sequence population for distinguishing between disease (condition) and no disease (control). In one embodiment, the Kolmogorov-Smirnov (KS) test may be used to provide probability values (p-values) that the two distributions are substantially the same. The smaller the p-value, the greater the probability of correctly identifying the population to which the sample belongs. A larger separation of the mean values between the two populations typically results in a smaller p-value (one example of a discrimination level). Other tests for comparing distributions may be used. The Welch's t-test assumes that the distribution is gaussian, which is not necessarily true for a particular sequence population. The KS test, since it is a non-parametric test, is well suited to compare the distribution of taxa or functions for which the probability distribution is unknown.

The distribution of RAVs for the control and condition populations can be analyzed to identify a population of sequences with large separations between the two distributions. The separation can be measured as p-value (see examples section). For example, the relative abundance value for the control population may have a distribution that peaks at a first value, with a certain width and decay. Also, the condition population may have another distribution that peaks at a second value that is statistically different from the first value. In this case, the abundance value of the control sample has a lower probability of being in the distribution of abundance values encountered by the condition sample. The greater the separation between the two distributions, the more accurate the discrimination to determine whether a given sample belongs to the control population or the condition population. As discussed later, the distribution can be used to determine the probability of a RAV in a control population and to determine the probability of a RAV in a condition population, where the population of sequences associated with the greatest percentage difference between the two means (means) has the smallest p-value, indicating greater separation between the two populations.

In performing the characterization, module S140 may additionally or alternatively convert input data from at least one of the microbiome composition dataset and the microbiome functional diversity dataset into a feature vector that may be tested for efficacy in predicting a characterization of the population of subjects. Data from the supplemental data set can be used to report characterization of the autoimmune condition, where the characterization process is trained with a training data set of candidate features and candidate classes to identify features and/or combinations of features that have a high degree (or low degree) of predictive power in accurately predicting the class. Thus, a training data set refinement (refinement) characterization method was used to identify a feature set (e.g., of a subject feature, of a combination of features) that has a high correlation with the presence of an autoimmune condition.

In a variation, the feature vectors that are valid in the classification of the predictive characterization process may include features that are related to one or more of the following: a microbiome diversity metric (e.g., with respect to distribution across taxa, with respect to distribution across archaeal, bacterial, viral, and/or eukaryotic populations), the presence of taxa in an individual ' S microbiome, the presence of particular gene sequences (e.g., 16S sequences) in an individual ' S microbiome, the relative abundance of taxa in an individual ' S microbiome, a microbiome resilience metric (e.g., in response to perturbations determined from a supplemental dataset), abundance of genes encoding proteins or RNAs (enzymes, transporters, proteins from the immune system, hormones, interfering RNAs, etc.) with a given function, and any other suitable features derived from a microbiome composition dataset, a microbiome functional diversity dataset (e.g., COG-derived features, KEGG-derived features, other functional features, etc.), and/or a supplemental dataset. Additionally, combinations of features can be used for the feature vector, where the features can be grouped and/or weighted in providing the combined features as part of the feature set. For example, a feature or set of features may include the number of bacterial classes present in a microbiome of an individual, the presence of a particular bacterial genus in a microbiome of an individual, the presence of a particular 16S sequence in a microbiome of an individual, and a weighted composite of the relative abundance of a first bacterial phylum relative to a second bacterial phylum. However, the feature vector may additionally or alternatively be determined in any other suitable manner.

In the example of block S140, assuming that sequencing has occurred at a sufficient depth, one can quantify the number of reads of the sequence that indicate the presence of a feature (e.g., a feature described in sections 1.4.1-1.4.8 below), thereby allowing one to set a value for the estimate of one of the criteria. Other measures of the number of reads or the amount of one of the features may be provided as absolute or relative values. An example of an absolute value is the number of reads mapped to a particular genus of reads of the 16S RNA coding sequence. Alternatively, the relative amounts may be determined. An example relative amount calculation is to determine the amount of 16S RNA coding sequence reads for a particular taxon (e.g., genus, family, order, class or phylum) relative to the total number of 16S RNA coding sequence reads assigned to a domain (domain). The value indicative of the amount of the feature in the sample can then be compared to a cutoff value or probability distribution in a disease marker for the autoimmune condition. For example, if the disease marker indicates a likelihood that 50% or more of all features possible at that level for feature #1 are indicative of an autoimmune condition, then a quantification in the sample of less than 50% of the gene sequences associated with feature #1 would indicate a higher likelihood of health (or at least not that particular autoimmune condition) and optionally a quantification in the sample of more than 50% of the gene sequences associated with feature #1 would indicate a higher likelihood of disease.

In an example, a taxonomic group and/or functional group can be referred to as a signature group, or in the context of determining the amount of sequence reads that correspond to a particular group (signature), a sequence group. In an example, a score for a particular bacteria or genetic pathway can be determined from a comparison of abundance values with one or more reference (calibration) abundance values for known samples, e.g., where detected abundance values below a certain value are correlated with the autoimmune condition in question and detected abundance values above a certain value are scored as being correlated with health, or vice versa, depending on the particular criteria. The scores for various bacterial or genetic pathways may be combined to provide a classification for the subject. Also, in an example, the comparison of the abundance value to the one or more reference abundance values can include comparison to a cutoff value determined from the one or more reference values. Such cutoff values may be part of a decision tree or clustering technique that is determined using a reference abundance value (where the cutoff value is used to determine to which cluster the abundance value belongs). The comparison may include intermediately determining other values (e.g., probability values). The comparison may also include comparing the probability distribution of the abundance value to the reference abundance value, and thus to the probability value.

In some embodiments, certain samples may not exhibit any presence of a particular taxa, or at least not a presence above a relatively low threshold (i.e., below a threshold for either of the two distributions for the control and condition populations). As such, a particular sequence population may be prevalent in a population, e.g., more than 30% of the population may have the taxonomic group. Another population of sequences may be less prevalent in the population, e.g., occurring in only 5% of the population. The prevalence of a certain population of sequences (e.g., the percentage of the population) can provide information on how likely a population of sequences can be used to determine a diagnosis.

In such instances, when the subject falls within 30%, the sequence population can be used to determine the status of the condition (e.g., a diagnostic condition). However, when the subject does not fall within 30% such that a taxa is simply not present, the particular taxa may not be helpful in determining the diagnosis of the subject. As such, whether a particular taxa or functional group is useful in diagnosing a particular subject may depend on whether the nucleic acid molecules corresponding to the sequence populations are actually sequenced.

Thus, a disease marker may include more sequence populations for a given subject. As one example, a disease marker may include 100 sequence populations, but only 60 sequence populations may be detected in a sample. The classification of the subject (including any probability in the application) will be determined based on the 60 sequence populations.

With respect to the generation of a characterization model, a population of sequences with a high level of discrimination (e.g., low-p value) for a given disease can be identified and used as part of a characterization model, e.g., which uses disease markers to determine the probability of a subject having a disease. The disease marker may include a set of sequence populations, and discriminating criteria (e.g., cutoff values and/or probability distributions) for providing a classification of the subject. The classifications may be binary (e.g., disease or non-disease) or have more classifications (e.g., probability values with or without disease). Which sequence group of disease markers to use in making the classification depends on the particular sequence reads obtained, e.g., if no sequence reads are assigned to the sequence group, the sequence group is not used. In some embodiments, separate characterization models may be determined for different populations, for example, the geography (e.g., country, region, or continent) in which the subject currently resides, the general history of the subject (e.g., race), or other factors.

1.4.0 selection of sequence groups, sequence group discrimination criteria, and use of sequence groups

As mentioned above, a population of sequences having at least a specified level of discrimination may be selected for inclusion in the characterization model. In various embodiments, the specified discrimination level may be an absolute level (e.g., having a p-value below the specified value), a percentage (e.g., 10% before the discrimination level), or a specified number of previous discrimination levels (e.g., the top 100 discrimination levels). In some embodiments, the characterization model may comprise a network graph, wherein each node in the graph corresponds to a cluster of sequences having at least a specified level of discrimination.

The population of sequences used in characterizing the disease markers of the model may also be selected based on other factors. For example, a particular cluster of sequences can only be detected in a certain percentage of the population, referred to as a percentage coverage. An ideal population of sequences will be detected in a high percentage of the population and have a high discrimination level (e.g., low p-value). A minimum percentage may be required before adding a population of sequences to a characterization model for a particular disease. The minimum percentage may vary based on the level of discrimination that accompanies it. For example, if the discrimination level is higher, a lower percentage of coverage may be tolerated. As another example, 95% of patients with a condition may be classified as one or a combination of several sequence groups, and the remaining 5% may be interpreted based on one sequence group, which involves orthogonality or overlap between the coverage of the sequence groups. As such, a population of sequences that provides identifying efficacy for 5% of affected individuals can be valuable.

Another factor for determining which sequence to include in a disease feature characterizing a model is the overlap of subjects of the population of sequences showing the disease feature. For example, both sequence populations may have a high percentage of coverage, but the sequence populations may cover the exact same subject. Thus, adding a cluster of sequences does increase the overall coverage of disease markers. In this case, the two sequence groups can be considered to be parallel to each other. Another sequence population can be selected for addition to the characterization model based on the sequence population covering a different subject than other sequence populations already in the characterization model. Such a population of sequences can be considered orthogonal to a population of sequences already present in the characterization model.

For example, the selection of the sequence group may take into account the following factors. A taxon can occur in 100% of healthy individuals and 100% of diseased individuals, but where the distribution in the two groups is so close that knowing the relative abundance of the taxon only allows a few individuals to be classified as either diseased or healthy (i.e., it has a low discrimination level). Whereas taxa present in only 20% of healthy individuals and 30% of affected individuals may have distributions of relative abundances that are so different from each other that they allow for the inclusion of 20% of healthy individuals and 30% of affected individuals (i.e. it has a high discrimination level).

In some embodiments, machine learning techniques may allow for automatic identification of the best combination of features (e.g., sequence clusters). For example, principal component analysis may reduce the number of features used for classification to only those that are most orthogonal to each other, and may account for most of the variation in the data. The same is true of the network theory approach, where one can create multiple distance metrics based on different characteristics and evaluate which is the best distance metric to separate the diseased from the healthy individuals.

The discrimination criteria for the population of sequences contained in the disease signature characterizing the model can be determined based on the status distribution and the control distribution of the disease. For example, the discrimination criterion for a population of sequences may be a cutoff value between the mean values of the two distributions. As another example, the discrimination criteria for a population of sequences may include a probability distribution for a population of control and conditions. The probability distribution may be determined in a different manner than the process of determining the discrimination level.

The probability distribution may be determined based on the RAV distributions of the two populations. The mean (or other average or median) of the two populations can be used to concentrate the peaks of the two probability distributions. For example, if the average RAV of the population of conditions is 20% (or 0.2), the probability distribution of the population of conditions may have its peak at 20%. Width or other shape parameters (e.g., dip) may also be determined based on the RAV distribution of the condition population. The same can be done for the control population.

The population of sequences included in the characterized disease markers can be used to classify new subjects. The population of sequences may be considered features of a feature vector, or the RAV of the population of sequences may be considered features of a feature vector, wherein the feature vector may be compared to a discrimination criterion for a disease marker. For example, the RAV of a new subject's sequence population can be compared to the probability distribution of each sequence population of the disease marker. If the RAV is zero or near zero, the sequence group may be skipped and not used in classification.

RAV of the population of sequences displayed in the new subject can be used to determine the classification. For example, the results (e.g., probability values) for each presented sequence group may be merged to arrive at a final classification. As another example, clustering of RAVs can be performed, and the clustering can be used to determine a classification of a condition.

As shown in fig. 4, in one such alternative variation of block S140, the characterization process may be generated and trained according to a Random Forest Predictor (RFP) algorithm that combines bagging (i.e., bootstrap aggregation) and selects a random feature set from the training data set to construct a decision tree set T that is related to the random feature set. In using a random forest algorithm, N samples from a set of decision trees are substituted for the random sampling, a subset of decision trees is created, and for each node, m predicted features are selected from all the predicted features for evaluation. The prediction features of the best split are provided at the nodes (e.g., according to an objective function) for splitting (e.g., as a binary tree at the node, as a ternary tree at the node). By sampling from a large data set multiple times, the strength of the characterization process to identify features that are strong in predicting classification can be substantially increased. In this variation, measures to prevent bias (e.g., sampling bias) and/or account for the amount of bias may be included during processing to improve the robustness of the model.

1.4.1 characterization of AIDS

In one embodiment, the statistical analysis-based characterization process of block S140 may be based on an algorithm trained and validated with validation data sets derived from a subset of the population of subjects to identify a feature set having the highest correlation with Acquired Immune Deficiency Syndrome (AIDS) for which one or more therapies would have a positive effect. In particular, AIDS in this first variation is an immune disease characterized by an immunodeficiency, as generally assessed based on analysis of immune responsive material (e.g., cells, antibodies, cytokines, etc.) and comparison to a given threshold level of material. In a first variation, a set of features useful for diagnostics associated with AIDS includes features derived from one or more of the following taxa: prevoteriaceae (Prevotella) (family), Prevotella (Prevotella) (genus), Megasphaera (genus), Veillonellaceae (family), Erysipellicaceae (family), Erysipellica (class), Erysipelothrix (class), Bacteroides (order), Bacteroides (class), Bacteroides (phylum/phylum) group, Bacteroides (order), Serenofeldes (order), Selenomalidae (order), Negativiridae (class), Laciniaceae (family), Flavobacterium (order), Fluorobacteriaceae (family), Porphyceae (family ), family (family, family (order of genus), family (order, genus (genus), family (genus), family (genus, family (order), family (order of genus, family (genus), family (order), family (genus, order of genus, order, genus, order of genus, order of < type (order of genus, order of < genus, order of genus, phyl), phylum (order, phylum (order, phylum (order of genus, phylum (order, order of < genus, order of < type (order of genus, order of < type (order of < genus, order of genus, order of < genus, phylum (order of < genus, phylum), phylum (order), phylum (order of < genus, phylum (order), phylum (order), phylum (genus, phylum (order), phylum (genus, phylum (order), phylum (rhodobacter (order), phylum (genus, phylum), phylum (order), phylum (rhodobacter (genus, phylum (order), phylum (genus, phylum (order), phylum (genus, lachnospira (genus), Barnesiella (genus), coleobacter (genus), paracaseella (genus), and paracaseella extracementihonis (species).

Additionally or alternatively, the set of features associated with AIDS may be derived from one or more of the following: COG derived features, KEGG L2, L3, L4 derived features, and any other suitable functional features. In particular examples, such features may include one or more of the following: a neurodegenerative disease KEGG L2 derived feature; a transcription KEGG L2 derived feature; a cofactor and vitamin metabolism KEGG L2 derived feature; an endocrine system KEGG L2 derived feature; a cancer KEGG L2 derived feature; an amino acid metabolism KEGG L2 derived feature; a glycolysis/gluconeogenesis KEGG L3 derived feature; a streptomycin biosynthesis KEGG L3 derived feature; a restriction enzyme KEGG L3 derived feature; a fatty acid biosynthesis KEGG L3 derived feature; a PPAR signaling pathway KEGG L3 derived feature; a phosphotransferase system (PTS) KEGG L3 derived feature; a lipid metabolism KEGG L3 derived feature; an aminobenzoate degradation KEGG L3 derived feature; a pathway in cancer KEGG L3 derived feature; a mismatch repair KEGG L3 derived feature; a vitamin B6 metabolism KEGG L3 derived feature; a butirosin and neomycin biosynthesis KEGG L3 derived feature; a pantothenate and CoA biosynthesis KEGG L3 derived feature; an oxidative phosphorylation KEGG L3 derived feature; a zeatin biosynthesis KEGG L3 derived feature; an energy metabolism KEGG L3 derived feature; a limonene and pinene degradation KEGG L3 derived feature; a valine, leucine, and isoleucine biosynthesis KEGG L3 derived feature; a bacterial chemotaxis KEGG L3 derived feature; a homologous recombination KEGG L3 derived feature; a lipopolysaccharide biosynthesis protein KEGG L3 derived feature; a transcription machinery KEGG L3 derived feature; and a selenoid KEGG L3 derived feature.

Thus, characterization of the subject includes characterizing the subject as a subject with AIDS based on detecting one or more of the above features, in an alternative or complementary manner to classical diagnostic methods. However, in variations of the specific examples, the set of features may include any other suitable features useful for diagnosis.

1.4.2 characterization of asthma

In another embodiment, the statistical analysis-based characterization method of module S140 may be based on an algorithm trained and validated with validation datasets derived from a subset of the population of subjects to identify the set of features with the highest correlation with asthma for which one or more therapies would have a positive effect. In particular, asthma in this first variation is a respiratory disorder characterized by chronic inflammation of the airways, as commonly assessed using one or more of the following: analysis of respiratory-related symptoms, response to therapy, and spirometry. In a first variation, a set of features useful for diagnosis in connection with asthma includes features derived from one or more of the following taxa: filifactor (genus), Mycoplasma (genus), Mycoplasmataceae (family), and Mycoplasmatales (order).

Additionally or alternatively, the set of features associated with asthma may be derived from one or more of the following taxa: phylum Mollicutes (phylum), alpha-Proteobacteria (class), Mollicules (class), Rhodospirillales (order), Aminococcaceae (family), Streptococcus digestions (family), Prevoteriaceae (family), Christinellaceae (family), Christensellaceae (family), Ruminococcaceae (family), Cholesterol (genus), Dorea (genus), coprococcus (genus), Coelobacter (genus), Moryella (genus), Prevotella (genus), Clostridium (genus), Choristobacterium (genus), Choristonella (genus), Streptococcus thermophilus (Diphyllophora) (species), Parameridae (species), Clostridium (genus 1. sp.), Clostridium (species), Clostridium (genus), Streptococcus sp) (species) (genus), Streptococcus sp (strain) 1 (genus), Streptococcus sp., S (strain), Streptococcus sp. 1 (strain), Streptococcus sp. (genus), Streptococcus sp. coli (strain), Streptococcus sp. 1 (strain), Streptococcus sp. (genus), Streptococcus sp. (genus), Streptococcus strain (Streptococcus sp.) (genus), Streptococcus sp. (Streptococcus sp.) (species), Streptococcus sp. (Streptococcus sp.) (Streptococcus strain (S. sp.) (S) and Streptococcus strain (S. sp.) (genus), Streptococcus strain (S. sp.) (genus), Streptococcus strain (S. sp.) (Streptococcus strain (S. sp.), Streptococcus strain (S. sp.) (genus), Streptococcus strain (S. sp.), Streptococcus strain (S. sp.), Clostridium (S. sp.), Clostridium (S. sp.), Clostridium (S. sp.), Clostridium (S. sp.), Clostridium (S. sp.), Escherichia (S. sp.), Clostridium (S. sp.) (genus), Streptococcus sp.) (S. sp.) (strain (S. sp, Eubacterium desmolans, unclassified (unclassified) family of Streptococcus digesting bacteria, unclassified (promiscuous) (unclassified) family of Streptococcus digesting bacteria, and environmental samples (unclassified).

Additionally or alternatively, the set of features associated with asthma may be derived from one or more of the following: COG derived features, KEGG L2, L3, L4 derived features, and any other suitable functional features. Thus, characterization of the subject includes characterizing the subject as a subject suffering from asthma based on detecting one or more of the above features, in an alternative or complementary manner to classical diagnostic methods. However, in variations of the specific examples, the set of features may include any other suitable features useful for diagnosis.

1.4.3 characterization of multiple sclerosis

In another embodiment, the statistical analysis-based characterization method of module S140 may be based on an algorithm trained and validated with validation datasets derived from a subset of the population of subjects to identify the set of features with the highest correlation with multiple sclerosis for which one or more therapies would have a positive effect. In particular, multiple sclerosis in this first variation is an inflammatory disease characterized by damage to nervous system cells and tissues, as generally assessed by medical imaging and/or testing for evidence of chronic inflammation of the cerebrospinal fluid. In a first variation, a set of features useful for diagnostics associated with multiple sclerosis includes features derived from one or more of the following taxa: lactococcus (Lactococcus) (genus).

Additionally or alternatively, the set of features associated with multiple sclerosis may be derived from one or more of the following taxa: the order of the species Verrucomicrobiae (Verrucomicrobiae) (class), the order of the Verrucomicrobiales (Verrucomicrobiales) (order), Aneurostipes (genus), Tricpiraceae (family), Cyanobacterium (phylum), Peptococcus (genus), Coprococcus (species), Clostridium (Clostridium) bacterium A2-162 (species), Prevoteriaceae (family), Prevotella (genus), butyric acid-producing bacterium L1-93 (species), Actinobacillus (genus), Pasteurellaceae (family), Pasteurellales (order), and Actinomycetales (order).

Additionally or alternatively, the set of features associated with multiple sclerosis may be derived from one or more of the following: COG derived features, KEGG L2, L3, L4 derived features, and any other suitable functional features. Thus, characterization of a subject includes characterizing a subject as one suffering from multiple sclerosis based on detecting one or more of the above features, in an alternative or complementary manner to classical diagnostic methods. However, in variations of the specific examples, the set of features may include any other suitable features useful for diagnosis.

1.4.4 characterization of rheumatoid arthritis

In another embodiment, the statistical analysis-based characterization method of block S140 may be based on an algorithm trained and validated with validation datasets derived from a subset of the population of subjects to identify a set of features having the highest correlation with rheumatoid arthritis for which one or more therapies would have a positive effect. In particular, rheumatoid arthritis in this first variation is a systemic inflammatory disorder characterized by inflammation of synovial tissue, as typically assessed by medical imaging and/or blood tests for the presence of rheumatoid factors. In a first variation, a set of features useful for diagnosis associated with rheumatoid arthritis includes features derived from one or more of the following taxa: bacteroides monomorphus (species) and Alisipes ondendkii (species).

Additionally or alternatively, the set of features associated with rheumatoid arthritis may be derived from one or more of the following: COG derived features, KEGG L2, L3, L4 derived features, and any other suitable functional features. Thus, characterization of the subject includes characterizing the subject as a subject with rheumatoid arthritis based on detecting one or more of the above features, in an alternative or complementary manner to classical diagnostic methods. However, in variations of the specific examples, the set of features may include any other suitable features useful for diagnosis.

1.4.5 Sjogren syndrome characterization

In another embodiment, the statistical analysis-based characterization method of block S140 may be based on an algorithm trained and validated with validation datasets derived from a subset of the population of subjects to identify the set of features with the highest correlation with sjogren' S syndrome for which one or more therapies will have a positive effect. In particular, sjogren's syndrome in this first variation is a chronic autoimmune disease characterized by the destruction of exocrine glands, as commonly assessed using blood tests (e.g., testing for rheumatoid factors, testing for antinuclear antibodies), the rose bengal test, the schirmer test, and radiological assessments. In a first variation, a set of features useful for diagnosis associated with sjogren's syndrome includes features derived from one or more of the following taxa: marine spirochaetales (order), adlercutzia (genus), and adlercutzia equilifaciens (species).

Additionally or alternatively, the set of features associated with sjogren's syndrome may be derived from one or more of: COG derived features, KEGG L2, L3, L4 derived features, and any other suitable functional features. Thus, characterization of the subject includes characterizing the subject as a subject with sjogren's syndrome based on detecting one or more of the above features, in an alternative or complementary manner to classical diagnostic methods. However, in variations of the specific examples, the set of features may include any other suitable features useful for diagnosis.

1.4.6 sprue/gluten intolerance characterisation

In another embodiment, the statistical analysis-based characterization method of block S140 may be based on an algorithm trained and validated with validation data sets derived from a subset of the population of subjects to identify a feature set having the highest correlation with sprue and/or any other suitable type of gluten intolerance for which one or more therapies will have a positive effect. In particular, the sprue in this first variation is an autoimmune disorder of the small intestine, usually characterized by blood tests, and endoscopy of disease-specific antibodies. In a first variation, a set of features useful for diagnosis associated with sprue includes features derived from one or more of the following taxa: bifidobacterium (genus), Moryella (genus), Dorea (genus), Oscillatoria (Oscillus) (genus), Intestimonas (genus), Coriolus (genus), Colistella (genus), Bacteroides (genus), Microbacterium (Diarister) (genus), Subdoligranum (genus), Hespelia (genus), Alistepes (genus), Trichospira (genus), Faecalibacterium (genus), Bifidobacterium (Family), Veillnaceae (Family), Bacteroides (Family), Ruminococcus (Family), Ruminospiraceae (Family), Fluocitaceae (Family), Rhodobacterium Family (Corobacteriaceae) (Family), Streptococcus (Streptococcaceae), Salmonella (Family), Enterobacteriaceae) (Family), Enterobacteriaceae (Family (Enterobacteriaceae) (Family), Enterobacteriaceae (Family), Family (Enterobacteriaceae) (Family), Family (Enterobacter (Enterobacteriaceae) (Family), Family (Enterobacter (Family), Family (Enterobacter (Family), Family (Enterobacter (Family), Family (Enterobacter (Family), Family (Enterobacter (Family (Enterobacter) (Family), Family (Family ), Family (Family), Family (Family ), Family Enterobacter (Family), Family Enterobacter) (Family Enterobacter (Family), Family Enterobacter (Family), Family Enterobacter (Family ), Family Enterobacter (Family), Family Enterobacter (Family ), Family Enterobacter (Family, Family Enterobacter (Family), Family Enterobacter (Family), Family Enterobacter (Family), Family Enterobacter (Family, Family Enterobacter (Family Enterobacter), Family Enterobacter (Family, Family Enterobacter (Family), Family Enterobacter (Family ), Family Enterobacter (Family), Family Enterobacter (Family ), Family Enterobacter), Family Enterobacter (Family, Family Enterobacter), Family Enterobacter (Family), Family Enterobacter (Family ), Family Enterobacter (Family, the order of the red stinkbug bacterium (Coriobacteriales) (order), the order of the Enterobacteriales (Enterobacteriales) (order), the order of the Clostridia (order), the class of actinomycetes (Actinobacteria) (order), the class of negaviculis (order), the class of bacteroides (order), the class of Clostridia (class), the class of actinomycetales (Actinobacteria) (phylum), the class of bacteroidetes (phylum), and the class of Firmicutes (phylum).

Additionally or alternatively, the set of features associated with sprue may be derived from one or more of the following taxa: anerostipes (genus), Streptococcus thermophilus (species), Clostridium mollicum (species), Chlamydia (Chlamydia)/Verrucomicrobia (Verrucomicrobia) group (Supergate), Verrucomicrobia (phylum), Verrucomicrobia (class), Verrucomicrobiacea (order), Verrucomicrobiacea (family), Eubacterium sirauum (species), Tricpiraceae (unclassified), Akkermansia (genus), Negativiridae (class), Selenomonadales (order), Actinomycetes (phylum), Actinomycetes (subclass), delta-Proteobacteria (Deltaproteobacteria) (class), Ruminococcus fascias (species), Desulfurobacteriales (order), and subphylum (epsilon/delta).

Additionally or alternatively, the feature set associated with sprue and/or any other suitable type of gluten intolerance may be derived from COG and/or KEGG features, including one or more of the following: a carbohydrate metabolism KEGG L2 derived feature; a metabolism KEGG L2 derived feature; a translation KEGG L2 derived feature; a genetic information processing KEGG L2 derived feature; an enzyme family KEGG L2 derived feature; a replication and repair KEGG L2 derived feature; a nucleotide metabolism KEGG L2 derived feature; a transport and catabolism KEGG L2 derived feature; a lipid metabolism KEGG L2 derived feature; a signal transduction KEGG L2 derived feature; a neurodegenerative disease KEGG L2 derived feature; a cofactor and vitamin metabolism KEGG L2 derived feature; a xenobiotic biodegradation and metabolism KEGG L2 derived feature; a folding, sorting, and degradation KEGG L2 derived feature; a cell growth and death KEGG L2 derived feature; a biosynthesis of other secondary metabolites KEGG L2 derived feature; an immune system disorder KEGG L2 derived feature; an environmental adaptation KEGG L2 derived feature; a cellular processes and signaling KEGG L2 derived feature; a ribosome biogenesis KEGG L3 derived feature; a peptidoglycan biosynthesis KEGG L3 derived feature; a biosynthesis and biodegradation of secondary metabolites KEGG L3 derived feature; a translation protein KEGG L3 derived feature; a pentose and glucuronate interconversion KEGG L3 derived feature; a chromosome KEGG L3 derived feature; a DNA repair and recombination proteins KEGG L3 derived feature; a glyoxylate and dicarboxylate metabolism KEGG L3 derived feature; an inositol phosphate metabolism KEGG L3 derived feature; a ribosome KEGG L3 derived feature; an aminoacyl-tRNA biosynthesis KEGG L3-derived feature; a niacin and nicotinamide metabolism KEGG L3 derived feature; an amino acid related enzyme KEGG L3 derived feature; a purine metabolism KEGG L3 derived feature; a terpenoid backbone biosynthesis KEGG L3 derived feature; a translation factor KEGG L3 derived feature; a peptidase KEGG L3 derived feature; a nucleotide excision repair KEGG L3 derived feature; an RNA polymerase KEGG L3 derived feature; a D-alanine metabolism KEGG L3 derived feature; a ribosome biogenesis in eukaryotes KEGG L3 derived feature; a cysteine and methionine metabolism KEGG L3 derived feature; a type II diabetes KEGG L3 derived feature; a homologous recombination KEGG L3 derived feature; a replication, recombination and repair protein KEGG L3 derived feature; a pentose phosphate pathway KEGG L3 derived feature; a pyrimidine metabolism KEGG L3 derived feature; a carbohydrate metabolism KEGG L3 derived feature; a fructose and mannose metabolism KEGG L3 derived feature; a pyruvate metabolism KEGG L3 derived feature; a mismatch repair KEGG L3 derived feature; other glycan degradation KEGG L3 derived features; a one carbon library of folic acid KEGG L3 derived feature; a DNA replication protein KEGG L3 derived feature; a DNA replication KEGG L3 derived feature; a riboflavin metabolism KEGG L3 derived feature; other transporter KEGG L3 derived features; a butyl ester metabolism KEGG L3 derived feature; a sphingolipid metabolism KEGG L3 derived feature; an aminobenzoate degradation KEGG L3 derived feature; a galactose metabolism KEGG L3 derived feature; a cell cycle-petiolus (Caulobacter) KEGG L3 derived feature; a thiamine metabolism KEGG L3 derived feature; a protein export KEGG L3 derived feature; a glycerophospholipid metabolism KEGG L3 derived feature; a nitrogen metabolism KEGG L3 derived feature; a tuberculosis KEGG L3 derived feature; a two component system KEGG L3 derived feature; an ethylbenzene degradation KEGG L3 derived feature; a chloroalkane and chloroalkene degradation KEGG L3 derived feature; a pore ion channel KEGG L3 derived feature; a peroxisomal KEGG L3 derived feature; a sulfur metabolism KEGG L3 derived feature; an inorganic ion transport and metabolism KEGG L3 derived feature; an amino acid metabolism KEGG L3 derived feature; a propyl ester metabolism KEGG L3 derived feature; an arginine and proline metabolism KEGG L3 derived feature; a histidine metabolism KEGG L3 derived feature; a primary immunodeficiency KEGG L3 derived feature; a chaperone and folding catalyst KEGG L3 derived feature; a base excision repair KEGG L3 derived feature; an amino sugar and nucleotide sugar metabolism KEGG L3 derived feature; a phenylpropionic acid biosynthesis KEGG L3 derived feature; a plant-pathogen interaction KEGG L3 derived feature; an energy metabolism KEGG L3 derived feature; a streptomycin biosynthesis KEGG L3 derived feature; a D-glutamine and D-glutamic acid metabolism KEGG L3 derived feature; a polycyclic aromatic hydrocarbon degradation KEGG L3 derived feature; a limonene and pinene degradation KEGG L3 derived feature; a lysine degradation KEGG L3 derived feature; and a zeatin biosynthesis KEGG L3 derived feature. Thus, characterization of the subject includes characterizing the subject as a subject with sprue and/or any other suitable type of gluten intolerance based on detecting one or more of the above features, in an alternative or complementary manner to classical diagnostic methods. However, in variations of the specific examples, the set of features may include any other suitable features useful for diagnosis.

1.4.7 characterization of systemic lupus erythematosus

In another embodiment, the statistical analysis-based characterization method of block S140 may be based on an algorithm trained and validated with validation datasets derived from a subset of the population of subjects to identify a set of features with the highest correlation with systemic lupus erythematosus for which one or more therapies will have a positive effect. In particular, systemic lupus erythematosus in this first variation is an autoimmune disease, as typically assessed by antibody testing. In a first variation, a set of features useful for diagnosis in association with systemic lupus erythematosus includes features derived from one or more of the following taxa: anaerotruncus (genus), Parabacteroides merdae (species), Parabacteroides (genus), Porphyromonaceae (family), Cellulobacteria (Fibrobacter)/Acidobacterium (Acidobacterium) group, Acidobacterium (phylum), Erysipelamiaceae (family), and unclassified Clostridiaceae (family).

Additionally or alternatively, the set of features associated with systemic lupus erythematosus may be derived from one or more of: COG derived features, KEGG L2, L3, L4 derived features, and any other suitable functional features. Thus, characterization of the subject includes characterizing the subject as one suffering from systemic lupus erythematosus based on detecting one or more of the above features, in an alternative or complementary manner to classical diagnostic methods. However, in variations of the specific examples, the set of features may include any other suitable features useful for diagnosis.

1.4.8 characterization of type I diabetes

In another embodiment, the statistical analysis-based characterization method of module S140 may be based on an algorithm trained and validated with validation datasets derived from a subset of the population of subjects to identify a set of features having the highest correlation with type I diabetes for which one or more therapies would have a positive effect. In particular, type I diabetes in this first variation is an autoimmune disorder characterized by destruction of insulin-producing beta cells in the pancreas, as typically assessed based on observations of hyperglycemia (e.g., as fasting glucose levels, as plasma glucose levels, as hemoglobin). In a first variation, a set of features useful for diagnosis associated with type I diabetes includes features derived from one or more of the following taxa: porphyridonaceae (family), Oscillatoria (genus), Peptophilus (genus), Ruminococcus faecalis (species), and Deuterospiriales (order).

Additionally or alternatively, the set of features associated with type I diabetes may be derived from one or more of the following: COG derived features, KEGG L2, L3, L4 derived features, and any other suitable functional features. Thus, characterization of a subject includes characterizing a subject as one with type I diabetes based on detecting one or more of the above features, in an alternative or complementary manner to classical diagnostic methods. However, in variations of the specific examples, the set of features may include any other suitable features useful for diagnosis.

Characterization of the subject can additionally or alternatively be performed using a high false positive test and/or a high false negative test to further analyze the sensitivity of the characterization method in supporting the analysis generated according to embodiments of method 100.

Furthermore, with respect to the methods described above, deep sequencing methods may allow for the determination of sufficient copy number of DNA sequences to determine the relative amounts of the corresponding bacteria or genetic pathways in a sample. Having identified one or more of the features described in sections 1.4.1-1.4.8 above, one can now diagnose an autoimmune condition in an individual by detecting one or more of the above features by any quantitative detection method. For example, although deep sequencing can be used to detect the presence, absence, or amount of one or more of the options in sections 1.4.1-1.4.8, one can also use other detection methods. For example, without intending to limit the scope of the invention, one may use protein-based diagnostics, such as immunoassays, to detect bacterial taxa by detecting taxon-specific protein markers.

1.5 first method: therapy model and provision

As shown in fig. 1A, in some variations, the first method 100 may further include a module S150, the module S150 recites: based on the characterization model, a therapy model configured to correct or otherwise improve a state of the autoimmune condition is generated. Module S150 is used to identify or predict therapies (e.g., probiotic-based therapies, prebiotic-based therapies, phage-based therapies, small molecule-based therapies, etc.) that can alter the microbiome composition and/or functional characteristics of a subject toward a desired state of equilibrium in improving the health of the subject. In block S150, the therapy may be selected from therapies comprising one or more of: probiotic therapy, phage-based therapy, prebiotic therapy, small molecule-based therapy, cognitive/behavioral therapy, physical rehabilitation therapy (physical rehabilitation therapies), clinical therapy, drug-based therapy, meal-related therapy, and/or any other suitable therapy designed to act in any other suitable manner in improving the health of a user. In particular examples of phage-based therapies, one or more populations of phage (e.g., in terms of colony forming units) specific for certain bacteria (or other microorganisms) present in a subject with an autoimmune condition can be used to down-regulate or otherwise eliminate populations of certain bacteria. Thus, phage-based therapies can be used to reduce the size of an undesirable population of bacteria present in a subject. Complementarily, phage-based therapies can be used to increase the relative abundance of bacterial populations not targeted by the phage used.

For example, with respect to variations of the autoimmune conditions in sections 1.4.1 to 1.4.8 above, therapies (e.g., probiotic therapies, phage-based therapies, prebiotic therapies, etc.) can be configured to down-regulate and/or up-regulate populations or subpopulations of microorganisms (and/or their functions) that are associated with characteristics specific to the autoimmune conditions.

In one such variation, module S150 may include one or more of the following steps: obtaining a sample from a subject; purifying nucleic acid (e.g., DNA) from a sample; deep sequencing nucleic acid from the sample to determine the amount of one or more of the features in one or more of sections 1.4.1-1.4.8; and comparing the resulting amount of each feature to one or more reference amounts for one or more of the features listed in one or more of sections 1.4.1-1.4.8, as occurs in a general individual having an autoimmune condition or an individual not having an autoimmune condition, or both. The compilation of characteristics may sometimes be referred to as a "disease marker" for a particular disease. Disease markers can be used as a characterization model and can include a probability distribution for a control population (no disease) or a population of conditions with disease, or both. A disease marker can include one or more features (e.g., bacterial taxa or genetic pathways) in a moiety and can optionally include a standard determined from abundance values of a control and/or condition population. Example criteria may include cutoff or probability values for the amount of those features associated with a generally healthy or diseased individual.

In a specific example of a probiotic therapy, as shown in fig. 5, a candidate therapy for a treatment model may be one or more of the following: blocking pathogen entry into epithelial cells by providing a physical barrier (e.g., by means of colonization resistance), inducing the formation of a mucus barrier by stimulating goblet cells (goblet cells), enhancing the integrity of apical tight junctions between epithelial cells of a subject (e.g., by stimulating upregulation of zona-occludins 1, by preventing redistribution of tight junction proteins), producing antimicrobial factors, stimulating the production of anti-inflammatory cytokines (e.g., by signaling of dendritic cells and induction of regulatory T cells), triggering an immune response, and performing any other suitable function that regulates the microbiome of a subject away from a dysregulated state.

In a variation, the therapy model is preferably based on data from a large population of subjects, which may include a population of subjects from which the microbiome-related dataset originated in block S110, wherein the microbiome composition and/or functional characteristics or health state are sufficiently characterized prior to and after exposure to the plurality of therapeutic measures. Such data can be used to train and validate therapy delivery models in identifying therapeutic measures that provide a desired outcome to a subject based on different microbiome characterizations. In a variation, a support vector machine, which is a supervised machine learning algorithm, may be used to generate the therapy delivery model. However, any other suitable machine learning algorithm described above may facilitate generation of the therapy delivery model.

Although the statistical analysis and machine are described above with respect to the execution of modulesSome methods of learning, but variations of the method 100 may additionally or alternatively use any other suitable algorithm in performing the characterization process. In variations, the algorithm may be characterized by a learning approach that includes any one or more of the following: supervised learning (e.g., using logistic regression, using back-propagation neural networks), unsupervised learning (e.g., using Apriori algorithm, using K-means clustering), semi-supervised learning, reinforcement learning (e.g., using Q learning algorithm, using time-difference learning), and any other suitable learning approach. Further, the algorithm may implement any one or more of the following: regression algorithms (e.g., ordinary least squares, logistic regression, stepwise regression, multivariate adaptive regression splines, etc.), example-based methods (e.g., k-nearest neighbors, learning vector quantization, self-organizing maps, etc.), regularization methods (e.g., ridge regression, least absolute value shrinkage and selection operators, elastic networks, etc.), decision tree learning methods (e.g., classification and regression trees, iterative binary 3, C4.5, chi-squared auto-interaction detection, decision stump (decision stump), random forests, multivariate adaptive regression splines, gradient pushers (Bayesian) and the like), bayesian methods (e.g., (element naive gaussian), bayesian methods (e.g., (element naive recursive spline), etc.), bayesian methods (e.g., statistical regression, least squares, etc.), and so forth

Bayes), average one-dependent estimates (averaged one-dependent estimators), Bayesian belief networks (Bayesian belief networks), etc.), kernel methods (e.g., support vector machines, radial basis functions, linear discriminant analysis, etc.), clustering methods (e.g., k-means clustering, expectation maximization, etc.), association rule learning algorithms (e.g., Apriori algorithms, Eclat algorithms, etc.), artificial neural network models (e.g., Perceptron methods), back-propagation methods (back-propagation methods), Hopfield network methods, self-organizing map methods, learning vector quantization methods, etc.), etc,Deep learning algorithms (e.g., restricted Boltzmann machine, deep belief network method, convolution network method, stacked auto-encoder method, etc.), dimension reduction methods (e.g., principal component analysis, partial least squares regression, Sammon mapping, multi-dimensional scaling, projection pursuit, etc.), integration methods (e.g., boosting, boot-strap aggregation, AdaBoost, stacked generalization, gradient boosting method, random forest method, etc.), and any suitable form of algorithms.

Additionally or alternatively, the therapy model may involve identification of a "normal" or baseline microbiome composition and/or functional characteristic as assessed from a subject identified as being physically healthy in a population of subjects. After identifying a subset of subjects of the population of subjects characterized as being healthy (e.g., using features of the characterization process), a therapy can be generated in block S150 that modulates microbiome composition and/or functional features toward those of subjects in healthy. Thus, module S150 can include identifying one or more baseline microbiome compositions and/or functional characteristics (e.g., one baseline microbiome for each of the demographic feature sets), and potential therapy agents and therapy regimens that can alter the microbiome of the subject in the dysregulated state toward one of the identified baseline microbiome compositions and/or functional characteristics. However, the therapy model may be generated and/or refined in any other suitable manner.

Probiotic therapy-related microbial compositions related to therapy models preferably include microbes that are culturable (e.g., capable of expanding to provide scalable therapy) and non-lethal (e.g., non-lethal at their desired therapeutic dose). Furthermore, the microbial composition may comprise a single type of microorganism having an acute or mild effect on the microbiome of the subject. Additionally or alternatively, the microbial composition may comprise a balanced combination of multiple types of microbes configured to cooperate with each other in driving the microbiome of the subject towards a desired state. For example, a combination of multiple types of bacteria in a probiotic therapy may include a first bacterial type that produces a product that is used by a second bacterial type that has a strong effect in positively affecting the microbiome of the subject. Additionally or alternatively, the combination of multiple types of bacteria in a probiotic therapy may include several bacterial types that produce proteins with the same function that positively affects the microbiome of the subject.

In an example of a probiotic therapy, the probiotic composition may comprise one or more components of an identified microbiome taxon (e.g., as described in sections 1.4.1 to 1.4.8 above) provided at a dose of 100 ten thousand to 100 hundred million CFU, as determined from a therapy model that predicts a positive adjustment of the microbiome of the subject in response to the therapy. Additionally or alternatively, the therapy may include a dose of protein attributed to functional presence in the microbiome composition of a subject not suffering from an autoimmune condition. In an example, a subject may be instructed to ingest a capsule comprising a probiotic formulation according to a regimen tailored to one or more of his/her following: physiology (e.g., body mass index, weight, height), demographic characteristics (e.g., gender, age), severity of the disorder, sensitivity to drugs, and any other suitable factors.

Furthermore, the probiotic composition of the probiotic-based therapy may be obtained naturally or synthetically. For example, in one application, the probiotic composition may be naturally derived from fecal matter or other biological matter (e.g., of one or more subjects having a baseline microbiome composition and/or functional characteristic, as identified using characterization processes and therapy models). Additionally or alternatively, the probiotic composition may be obtained synthetically (e.g., using a bench top method) based on baseline microbiome composition and/or functional characteristics as identified using characterization processes and therapy models. In variations, the microbial agent that may be used in the probiotic therapy may include one or more of the following: yeasts (e.g., Saccharomyces boulardii), gram-negative bacteria (e.g., Escherichia coli Nissel (E.coli Nissel), Akkermansia muciniphila, Prevotella bryantii, etc.), gram-positive bacteria (e.g., Bifidobacterium animalis (including Lactobacillus species), Bifidobacterium longum (Bifidobacterium longum) (including Lactobacillus subspecies), Bifidobacterium bifidum (Bifidobacterium bifidum), Bifidobacterium pseudolongum (Bifidobacterium pseudolongum), Bifidobacterium breve (Bifidobacterium breve), Lactobacillus rhamnosus (Lactobacillus rhamnosus), Lactobacillus acidophilus (Lactobacillus acidophilus), Lactobacillus casei (Lactobacillus), Lactobacillus plantarum (Lactobacillus), Lactobacillus salivarius, and Lactobacillus salivarius, and/or Lactobacillus salivarius Lactobacillus reuteri (Lactobacillus reuteri), Lactobacillus gasseri (Lactobacillus gasseri), Lactobacillus brevis (Lactobacillus brevis) (including Bacillus coagulans), Bacillus cereus (Bacillus cereus), Bacillus subtilis (Bacillus subtilis) (including Bacillus natto), Bacillus polyfermenticus, Bacillus clausii (Bacillus clausii), Bacillus licheniformis (Bacillus licheniformis), Bacillus coagulans (Bacillus coagulans), Bacillus pumilus (Bacillus pumilus), Bacillus faecalis praerussili, Streptococcus thermophilus (Streptococcus thermophilus), Bacillus brevis (Bacillus brevis), Lactococcus lactis (Lactobacillus), Leuconostoc mesenteroides (Leuconostoc mesenteroides), Streptococcus faecalis (Enterococcus faecalis), Lactobacillus faecalis (Endococcus), Lactobacillus plantarum (Lactobacillus), Lactobacillus brevis (Lactobacillus), Lactobacillus brevis (Bacillus brevis), Lactobacillus (Bacillus brevis), Lactobacillus (Bacillus mucilaginosus), Bacillus brevis (Bacillus mucilaginosus), Enterococcus (Bacillus mucilaginosus), Enterococcus (Bacillus mucilaginosus), Enterococcus (Bacillus mucilaginosus), Enterococcus (Bacillus mucilaginosus), Enterococcus, such as (Bacillus mucilaginosus), such as) and such as) such as (Bacillus mucilaginosus), such as) and such as (Bacillus durans (Bacillus mucilaginosus), such as) such as (Bacillus durans, such as a, And any other suitable type of microbial agent.

Additionally or alternatively, the therapy scheduled by the therapy model of block S150 may include one or more of the following: consumer goods (e.g., food items (foods items), beverage items (drinks items), nutritional supplements), suggested activities (e.g., exercise regimens, adjustments to alcohol consumption, adjustments to cigarette use, adjustments to drug use), topical therapies (e.g., lotions, ointments, disinfectants (antipeptics), etc.), adjustments to hygiene product use (e.g., use of shampoo products, use of conditioner products, use of soaps, use of cosmetics (makeup products), etc.), adjustments to diet (e.g., sugar consumption, fat consumption, salt consumption, acid consumption, etc.), adjustments to sleep behavior, adjustments to lifestyle schedules (e.g., adjustments to live with pets, adjustments to live with plants in an individual's home environment, adjustments to light and temperature in an individual's home environment, etc.), and the like, Nutritional supplements (e.g., vitamins, minerals, fibers, fatty acids, amino acids, prebiotics, probiotics, etc.), drugs, antibiotics, and any other suitable therapeutic measures. In a prebiotic suitable for use in therapy, as part of any food or as a supplement, the following components are included: 1, 4-dihydroxy-2-naphthoic acid (DHNA), inulin, trans-galacto-oligosaccharides (GOS), lactulose, mannan-oligosaccharides (MOS), fructo-oligosaccharides (FOS), Neoagaro-oligosaccharides (NAOS), pyrodextrins, xylo-oligosaccharides (XOS), isomalto-oligosaccharides (IMOS), Amylose-resistant starch (Amylose-resistant starch), soy-oligosaccharides (SBOS), lactitol, Lactosucrose (LS), Isomaltulose (isomaltose) (including Palatinose), arabino-xylooligosaccharides (AXOS), Raffinose (RFO), Arabinoxylan (AX), polyphenols or any other oligosaccharide compound capable of altering the composition of the microbiota in a desired effect.

Additionally or alternatively, the therapy scheduled by the therapy model of module S150 may include one or more of: transplantation (e.g., bone marrow transplantation); infusion (e.g., blood infusion); radiation therapy (e.g., chemotherapy); anti-inflammatory therapy; an inflammation-mitigating therapy; inflammatory gene suppression; anti-inflammatory gene activation based therapies; immune system suppression therapy; immunoglobulin-based therapies; vaccination; antibody-based therapies; and/or any other suitable type of therapy configured to improve the state of an autoimmune condition in a subject.

However, the first method 100 may include any other suitable modules or steps configured to facilitate: receiving a biological sample from an individual, processing the biological sample from the individual, analyzing data obtained from the biological sample, and generating a model that can be used to provide customized diagnosis and/or treatment based on the specific microbiome composition of the individual.

1.6 example methods

Embodiments may provide methods for determining a classification of the presence or absence of a condition and/or determining the course of treatment of an individual human having the condition. The method may be performed by a computer system.

In step 1, sequence reads of bacterial DNA obtained from analysis of a test sample from an individual human are received. The analysis may be performed in various techniques, as described herein, such as sequencing or hybridization arrays. The sequence reads may be received at a computer system, for example, from a detection device, such as a sequencing machine that provides data to a storage device (which may be loaded into the computer system) or to the computer system over a network.

In step 2, the sequence reads are mapped to a bacterial sequence database to obtain a plurality of mapped sequence reads. The bacterial sequence database includes a plurality of reference sequences for a plurality of bacteria. The reference sequence may be a predetermined region for the bacterium, for example, the 16S region.

In step 3, the mapped sequence reads are assigned to sequence groups based on the mapping to obtain assigned sequence reads assigned to at least one sequence group. The sequence population includes one or more of a plurality of reference sequences. Mapping may include mapping the sequence reads to one or more predetermined regions of the reference sequence. For example, sequence reads may be mapped to the 16S gene. As such, the sequence reads do not have to map to the entire genome, but only to the region covered by the reference sequence of the sequence population.

In step 4, the total number of assigned sequence reads is determined. In some embodiments, the total number of assigned reads may include reads identified as bacteria, but not assigned to a known sequence group. In other embodiments, the total number can be a sum of the sequence reads assigned to the known sequence groups, where the sum can include any sequence reads assigned to at least one sequence group.

In step 5, a relative abundance value may be determined. For example, for each sequence cluster of the disease signature set of one or more sequence clusters associated with the features described in sections 1.4.1-1.4.8 above, a relative abundance value of assigned sequence reads assigned to the sequence cluster relative to the total number of assigned sequence reads may be determined. The relative abundance values may form a test feature vector, where each value of the test feature vector is a RAV of a different sequence group.

In step 6, the test feature vector is compared to a calibration feature vector generated from relative abundance values of calibration samples having known states of the condition. The calibration samples may be samples of a condition population and samples of a control population. In some embodiments, the comparison may include various machine learning techniques, such as supervised machine learning (e.g., decision trees, nearest neighbors, support vector machines, neural networks, naive bayes classifiers, etc.) and unsupervised machine learning (e.g., clustering, principal component analysis, etc.).

In one embodiment, clustering may use a network approach, where the distance between each pair of samples in the network is calculated based on the relative abundance of the relevant sequence populations for each condition. The new sample can then be compared to all samples in the network using the same metric based on relative abundance, and can decide which cluster it should belong to. A meaningful distance metric would allow all diseased individuals to form one or several clusters and all healthy individuals to form one or several clusters. One distance metric is Bray-Curtis dissimilarity, or equivalently a similarity network, where the metric is 1-Bray-Curtis dissimilarity. Another example distance metric is a Tanimoto coefficient.

In some implementations, the feature vectors can be compared by converting the RAV into probability values, forming probability vectors. Similar processing for the feature vectors can be done on the probabilities, and since the probability vectors are generated from the feature vectors, such a process still includes comparison of the feature vectors.

Step 7 may classify the presence or absence of the autoimmune condition and/or determine the course of treatment of the individual human having the autoimmune condition based on the comparison. For example, the cluster to which the test feature vector is assigned may be a condition cluster, and a classification may be made that the individual human has the condition or has a certain probability of the condition.

In one embodiment that includes clustering, the calibration feature vector may be clustered to a control cluster that does not have the condition and a condition cluster that has the condition. It can then be determined to which cluster the test feature vector belongs. The identified clusters can be used to determine classification or to select a course of treatment. In one implementation, clustering may use Bray-Curtis dissimilarity.

In one embodiment that includes a decision tree, the comparison may be made by comparing the test feature vector to one or more cutoff values (e.g., as respective cutoff vectors), where the one or more cutoff values are determined from the calibration feature vector, thereby providing the comparison. As such, the comparing may include comparing each relative abundance value of the test feature vector with a respective cutoff value determined from a calibration feature vector generated from the calibration sample. Respective cut-offs can be determined to provide the best discrimination for each sequence population.

New samples can be measured to detect RAV for the population of sequences in the disease marker. The RAV for each sequence population can be compared to the probability distribution for the control and condition populations for the particular sequence population. For example, for a given input to a RAV, the probability distribution of a population of conditions can provide an output with a probability of that condition (condition probability). Similarly, for a given input to a RAV, the probability distribution of the control population may provide an output with a probability of not having that condition (control probability). As such, the values of the probability distribution at the RAV can provide the probability that the sample is in each population. Thus, by considering the maximum probability, it can be determined to which population the sample is more likely to belong.

The total probability of the sequence population across disease markers can be used. For all sequence populations measured, a condition probability can be determined for whether the sample is in the condition group, and a control probability can be determined for whether the sample is in the control population. In other embodiments, only the condition probability or only the control probability may be determined.

The probabilities across the sequence groups can be used to determine the overall probability. For example, an average of the condition probabilities can be determined, such that a final condition probability that the subject has the condition is obtained based on the disease signature. The mean of the control probabilities can be determined, thereby obtaining a final control probability that the subject does not have the condition based on the disease signature.

In one embodiment, the final condition probabilities and the final control probabilities may be compared to each other to determine a final classification. For example, a difference between two final probabilities may be determined, and a final classification probability determined from the difference. A higher large positive difference from the final condition probability will result in a subject having a higher final classification probability for the disease.

In other embodiments, only the final condition probabilities may be used to determine the final classification probabilities. For example, the final classification probability may be a final condition probability. Alternatively, depending on the format of the probabilities, the final classification probability may be one minus the final control probability, or 100% minus the final control probability.

In some embodiments, the final classification probability for one disease in a class may be combined with other final classification probabilities for other diseases in the same class. The aggregated probabilities can then be used to determine whether the subject has at least one of the category of diseases. As such, embodiments can determine whether a subject has a health issue, which can include a plurality of diseases associated with the health issue.

The classification may be one of the final probabilities. In other examples, an implementation may compare the final probability to a threshold value to make a determination of whether a condition exists. For example, the respective condition probabilities may be averaged, and the average may be compared to a threshold value to determine whether a condition exists. As another example, comparison of the average value to a threshold value can provide a treatment for treating a subject.

2. Method for generating a microbiome-derived diagnosis

As mentioned above, in some embodiments, the output of the first method 100 can be used to generate a diagnosis and/or provide a therapeutic measure for an individual based on an analysis of the individual's microbiome. As such, the second method 200, which results from at least one output of the first method 100, may include: receiving a biological sample from a subject S210; characterizing the subject as having a form of autoimmune condition based on processing a microbiome dataset derived from the biological sample S220; and promoting therapy to the subject having the autoimmune condition based on the characterization and the therapy model S230.

Block S210 recites: receiving a biological sample from a subject to facilitate generation of a microbiome composition dataset and/or a microbiome functional diversity dataset for the subject. Thus, processing and analyzing the biological sample preferably facilitates generating a microbiome composition dataset and/or a microbiome functional diversity dataset for the subject that can be used to provide input that can be used to diagnostically characterize the individual with respect to the autoimmune condition, as in block S220. Receiving a biological sample from a subject preferably occurs in a manner similar to that of one of the embodiments, variations, and/or examples of sample reception described above with respect to block S110. Thus, the receiving and processing of the biological sample in block S210 may be performed on the subject to provide consistency of the process using processes similar to those used to receive and process the biological sample used to generate the characterization and/or therapy providing model of the first method 100. However, the biological sample reception and processing in block S210 may alternatively be performed in any other suitable manner.

Block S220 recites: a subject having a form of an autoimmune condition is characterized based on processing a microbiome dataset derived from a biological sample. Module S220 is for extracting a feature from the subject' S microbiome-derived data and using the feature to positively or negatively characterize an individual as having some form of autoimmune condition. Thus, characterizing the subject in block S220 preferably comprises identifying features and/or combinations of features that correlate with the microbiome composition of the subject and/or the functional characteristics of the microbiome, and comparing such features to features characteristic of a subject having an autoimmune condition. Module S220 may also include generating and/or outputting a confidence metric related to the characterization of the individual. For example, the confidence metric may be derived from the number of features used to generate the classification, the relative weights or rankings of the features used to generate the characterization, a biased measure in the model used in module S140 above, and/or any other suitable parameter related to an aspect of the characterization operation of module S140.

In some variations, the features extracted from the microbiology data set may be supplemented with survey-derived and/or medical history-derived features from the individual, which may be used to further refine the characterization operations of module S220. However, the microbiome composition dataset and/or the microbiome functional diversity dataset of an individual may additionally or alternatively be used in any other suitable manner to enhance the first method 100 and/or the second method 200.

Block S230 recites: a subject having an autoimmune condition is scheduled for therapy based on the characterization and therapy model. Module S230 is used to recommend or provide personalized treatment measures to the subject to alter the microbiome composition of the individual towards a desired equilibrium state. Accordingly, block S230 may include correcting the autoimmune condition or otherwise positively affecting the user' S health with respect to the autoimmune condition. Accordingly, module S230 can include scheduling one or more therapeutic measures for the subject based on the characterization of the subject for the autoimmune condition, as described with respect to sections 1.4.1 to 1.4.8 above, wherein the therapy is configured to adjust the taxonomic composition of the microbiome of the subject and/or adjust functional characteristic aspects of the subject in a desired manner toward a "normal" state with respect to the characterization described above.

In block S230, providing the therapeutic measure to the subject may include recommending an available therapeutic measure configured to adjust the subject' S microbiome composition toward a desired state. Additionally or alternatively, block S230 can include providing the customized therapy to the subject according to the characterization of the subject (e.g., with respect to a particular type of autoimmune condition). In variations, the therapeutic measures for adjusting the microbiome composition of the subject to improve the state of the autoimmune condition may include one or more of: probiotics, prebiotics, phage-based therapies, consumables, suggested activities, topical therapies, adjustments to hygiene product use, adjustments to diet, adjustments to sleep behavior, lifestyle schedules, adjustments to sexual activity levels, nutritional supplements, drugs, antibiotics, and any other suitable therapeutic measures. The providing of therapy in block S230 may include providing a notification by an electronic device, by an entity associated with the individual, and/or in any other suitable manner.

In more detail, the therapy provision in block S230 may include providing the subject with a notification about recommended therapeutic measures and/or other lines of action (counsels of action) for health-related goals, as shown in fig. 6. Notifications may be provided to an individual by an electronic device (e.g., a personal computer, mobile device, tablet, wearable-head computing device, wearable-wrist computing device, etc.) executing an application (application), web interface (web interface), and/or messaging client (messaging client) configured for notification provision. In one example, a network interface of a personal computer or laptop computer (laptop) associated with a subject may provide access by the subject to a user account for the subject, wherein the user account includes information about: characterization of the subject, detailed characterization of aspects of the subject' S microbiome composition and/or functional features, and notification of suggested therapeutic measures generated in block S150. In another example, an application executing on a personal electronic device (e.g., a smartphone, a smartwatch, a head-mounted smart device) may be configured to provide notifications (e.g., display, by tactile (auditory), etc.) regarding treatment recommendations generated by the therapy model of module S150. Additionally or alternatively, the notification can be provided directly by an entity associated with the subject (e.g., a caregiver, spouse, important others, healthcare professional, etc.). In some further variations, a notification may additionally or alternatively be provided to an entity (e.g., a healthcare professional) associated with the subject, wherein the entity is capable of administering the therapeutic measure (e.g., by prescription, by conducting a therapeutic session, etc.). However, notification of therapy administration may be provided to the subject in any other suitable manner.

Further, in an extension of block S230, a therapy effect model can be generated for each recommended treatment provided according to the model generated in block S150, with monitoring the subject during the course of the treatment protocol (e.g., by receiving and analyzing a biological sample from the subject through therapy, by receiving data from a survey source of the subject through therapy).

The

methods

100, 200 and/or systems of an embodiment may be at least partially presented or implemented as a machine configured to receive a computer-readable medium storing computer-readable instructions. The instructions may be executed by computer-executable components integrated with an application, applet, host, server, network, website, communication service, communication interface, hardware/firmware/software element of a patient's computer or mobile device, or any suitable combination thereof. Other systems and methods of the embodiments may be at least partially presented and/or implemented as a machine configured to receive a computer-readable medium storing computer-readable instructions. The instructions may be executed by computer-executable components integrated with devices and networks of the type described above. The computer readable medium may be stored on any suitable computer readable medium, such as RAM, ROM, flash memory, EEPROM, optical devices (CD or DVD), hard drives, floppy drives or any suitable device. The computer-executable component may be a processor, although any suitable dedicated hardware device may (alternatively or additionally) execute instructions.

The figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to preferred embodiments, example configurations, and variations thereof. In this regard, each block in the flowchart or block diagrams may represent a module, segment, step, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

As those skilled in the art will recognize from the foregoing detailed description and from the accompanying drawings and claims, modifications and variations can be made to the embodiments of the invention without departing from the scope of the invention as defined in the following claims.

Claims

1. A method for generating a characterization model of an autoimmune condition in at least one subject, the method comprising:

receiving, at a sample processing network, an aggregate set of samples from a population of subjects;

generating, at a computing system in communication with the sample processing network, a microbiome composition dataset and a microbiome functional diversity dataset for the population of subjects upon processing the nucleic acid content of each of the aggregate set of samples in a multiplex amplification operation and a sequencing analysis operation using a primer set;

receiving, at the computing system, a supplemental data set associated with the population of subjects, wherein the supplemental data set provides information on a characteristic associated with the autoimmune condition;

-at the computing system, transforming the supplementary dataset and features extracted from the microbiome composition dataset and the microbiome functional diversity dataset into a characterization model of the autoimmune condition; and

generating, at the computing system, information based on the characterization model, wherein the information is configured to positively influence the microbial distribution in the subject for improving the state of the autoimmune condition,

wherein generating the characterization model comprises performing a statistical analysis to evaluate a set of microbiome composition features and microbiome functional features having variations across a first subset of the population of subjects exhibiting the autoimmune condition and a second subset of the population of subjects not exhibiting the autoimmune condition, or analyzing a set of features from the microbiome composition dataset with the statistical analysis, wherein the set of features comprises features associated with: the relative abundance of the different taxa present in the microbiome composition dataset and the phylogenetic distance between taxa present in the microbiome composition dataset,

wherein generating the microbiome functional diversity dataset comprises:

extracting candidate features associated with a set of functional aspects of microbiome components indicated in the microbiome composition dataset; and

characterizing the autoimmune condition with respect to a subset of the set of functional aspects, the subset originating from at least one of: orthologous clustering of protein features, genomic functional features, chemical functional features and systemic functional features, and

wherein the autoimmune condition comprises at least one of: asthma, sprue or gluten intolerance, acquired immunodeficiency syndrome (AIDS), Multiple Sclerosis (MS), rheumatoid arthritis, Sjogren's syndrome, type I diabetes, and systemic lupus erythematosus.

2. The method of claim 1, wherein generating a characterization of asthma comprises generating the characterization upon processing the aggregate set of samples and determining the presence of features derived from 1) a set of taxa including one or more of the following taxa: filifactor (genus), Mycoplasma family (family), and Mycoplasma order (order), or one or more of the following taxa: phylum Mollicutes (phylum), alpha-Proteobacteria (class), Mollicules (class), Rhodospirillales (order), Aminococcaceae (family), Streptococcus digestions (family), Prevoteriaceae (family), Christinellaceae (family), Christensellaceae (family), Ruminococcaceae (family), Cholesterol (genus), Dorea (genus), coprococcus (genus), Coelobacter (genus), Moryella (genus), Prevotella (genus), Clostridium (genus), Choristobacterium (genus), Choristonella (genus), Streptococcus thermophilus (Diphyllophora) (species), Parameridae (species), Clostridium (genus 1. sp.), Clostridium (species), Clostridium (genus), Streptococcus sp) (species) (genus), Streptococcus sp (strain) 1 (genus), Streptococcus sp., S (strain), Streptococcus sp. 1 (strain), Streptococcus sp. (genus), Streptococcus sp. coli (strain), Streptococcus sp. 1 (strain), Streptococcus sp. (genus), Streptococcus sp. (genus), Streptococcus strain (Streptococcus sp.) (genus), Streptococcus sp. (Streptococcus sp.) (species), Streptococcus sp. (Streptococcus sp.) (Streptococcus strain (S. sp.) (S) and Streptococcus strain (S. sp.) (genus), Streptococcus strain (S. sp.) (genus), Streptococcus strain (S. sp.) (Streptococcus strain (S. sp.), Streptococcus strain (S. sp.) (genus), Streptococcus strain (S. sp.), Streptococcus strain (S. sp.), Clostridium (S. sp.), Clostridium (S. sp.), Clostridium (S. sp.), Clostridium (S. sp.), Clostridium (S. sp.), Escherichia (S. sp.), Clostridium (S. sp.) (genus), Streptococcus sp.) (S. sp.) (strain (S. sp, Eubacterium desmolans, unclassified (unclassified) family of Streptococcus digesting bacteria, unclassified (promiscuous) (unclassified) family of Streptococcus digesting bacteria, and environmental samples (unclassified).

3. The method of claim 1, wherein generating a characterization relating to sprue or gluten intolerance comprises generating the characterization upon processing the aggregate set of samples and determining the presence of features derived from: 1) a set of taxa including one or more of: bifidobacterium (genus), Moryella (genus), Dorea (genus), Oscillatoria (Oscillus) (genus), Intestimonas (genus), Coriolus (genus), Colistella (genus), Bacteroides (genus), Microbacterium (Diarister) (genus), Subdoligranum (genus), Hespelia (genus), Alistepes (genus), Trichospira (genus), Faecalibacterium (genus), Bifidobacterium (Family), Veillnaceae (Family), Bacteroides (Family), Ruminococcus (Family), Ruminospiraceae (Family), Fluocitaceae (Family), Rhodobacterium Family (Corobacteriaceae) (Family), Streptococcus (Streptococcaceae), Salmonella (Family), Enterobacteriaceae) (Family), Enterobacteriaceae (Family (Enterobacteriaceae) (Family), Enterobacteriaceae (Family), Family (Enterobacteriaceae) (Family), Family (Enterobacter (Enterobacteriaceae) (Family), Family (Enterobacter (Family), Family (Enterobacter (Family), Family (Enterobacter (Family), Family (Enterobacter (Family), Family (Enterobacter (Family (Enterobacter) (Family), Family (Family ), Family (Family), Family (Family ), Family Enterobacter (Family), Family Enterobacter) (Family Enterobacter (Family), Family Enterobacter (Family), Family Enterobacter (Family ), Family Enterobacter (Family), Family Enterobacter (Family ), Family Enterobacter (Family, Family Enterobacter (Family), Family Enterobacter (Family), Family Enterobacter (Family), Family Enterobacter (Family, Family Enterobacter (Family Enterobacter), Family Enterobacter (Family, Family Enterobacter (Family), Family Enterobacter (Family ), Family Enterobacter (Family), Family Enterobacter (Family ), Family Enterobacter), Family Enterobacter (Family, Family Enterobacter), Family Enterobacter (Family), Family Enterobacter (Family ), Family Enterobacter (Family, coriobacteriales (order), Enterobacteriales (order), clostridiales (order), actinomycetes (order), negaviruses (order), bacteroides (order), Clostridia (order), actinomycetes (order), negavirales (order), bacteroides (order), Clostridia (order), actinomycetales (phylum), bacteroidetes (phylum), and Firmicutes (phylum), or one or more of the following taxa: "genus" or "genus", Streptococcus thermophilus (species), clostridium flexibile (species), chlamydia/Verrucomicrobia (Verrucomicrobia) group (supraphylum), Verrucomicrobia (phylum), Verrucomicrobia (class), verrucomica (order), verrucomiciaceae (family), Eubacterium sirauum (species), lachnospiraceae (unclassified), akkermanasia (genus), necavities (class), selenions (order), actinomycetales (phylum), actinomycetous subclass (actinomycetous) (subclass), delta-proteobacteria (Deltaproteobacteria) (class), Ruminococcus fascias (species), desulfuriformes (order), and subphylum (2) related to the following functions: a first Kyoto Encyclopedia of Genes and Genomes (KEGG) functional feature related to carbohydrate metabolism, a second KEGG functional feature related to ribosome biogenesis, and a third KEGG functional feature related to peptidoglycan biosynthesis.

4. The method of claim 1, wherein generating a characterization related to multiple sclerosis comprises generating the characterization upon processing the aggregate set of samples and determining the presence of features derived from: 1) a set of taxa including one or more of: lactococcus (genus), Verrucomicrobiae (class), Verrucomicrobiae (order), Verrucomicrobiales (order), Anaerostipes (genus), Tricpiraceae (family), Cyanobacter (phylum), Peptococcus (genus), Coprococcus (species), Clostridium (species), bacterium A2-162 (species), Prevoteriaceae (family), Prevotella (genus), butyric acid-producing bacterium L1-93 (species), Actinobacillus (genus), Pasteurellaceae (family), Pasteurellaceae (order), and Actinomycetales (order).

5. The method of claim 1, wherein generating a characterization related to AIDS comprises generating the characterization upon processing the aggregate set of samples and determining the presence of features derived from: 1) a set of taxa including one or more of the following: prevoteriaceae (Prevotella) (family), Prevotella (Prevotella) (genus), Megasphaera (genus), Veillonellaceae (family), Erysipellicaceae (family), Erysipellica (class), Erysipelothrix (class), Bacteroides (order), Bacteroides (class), Bacteroides (phylum/phylum) group, Bacteroides (order), Serenofeldes (order), Selenomalidae (order), Negativiridae (class), Laciniaceae (family), Flavobacterium (order), Fluorobacteriaceae (family), Porphyceae (family ), family (family, family (order of genus), family (order, genus (genus), family (genus), family (genus, family (order), family (order of genus, family (genus), family (order), family (genus, order of genus, order, genus, order of genus, order of < type (order of genus, order of < genus, order of genus, phyl), phylum (order, phylum (order, phylum (order of genus, phylum (order, order of < genus, order of < type (order of genus, order of < type (order of < genus, order of genus, order of < genus, phylum (order of < genus, phylum), phylum (order), phylum (order of < genus, phylum (order), phylum (order), phylum (genus, phylum (order), phylum (genus, phylum (order), phylum (rhodobacter (order), phylum (genus, phylum), phylum (order), phylum (rhodobacter (genus, phylum (order), phylum (genus, phylum (order), phylum (genus, lachnospira (genus), Barnesiella (genus), coleobacter (genus), paracaseella (genus), and paracaseella extracementihosis (species), and 2) a set of functions including one or more of the following: a neurodegenerative disease KEGG L2 derived feature; a transcription KEGG L2 derived feature; a cofactor and vitamin metabolism KEGG L2 derived feature; an endocrine system KEGG L2 derived feature; a cancer KEGG L2 derived feature; an amino acid metabolism KEGG L2 derived feature; a glycolysis/gluconeogenesis KEGG L3 derived feature; a streptomycin biosynthesis KEGG L3 derived feature; a restriction enzyme KEGG L3 derived feature; a fatty acid biosynthesis KEGG L3 derived feature; a PPAR signaling pathway KEGG L3 derived feature; a phosphotransferase system (PTS) KEGG L3 derived feature; a lipid metabolism KEGG L3 derived feature; an aminobenzoate degradation KEGG L3 derived feature; a pathway in cancer KEGG L3 derived feature; a mismatch repair KEGG L3 derived feature; a vitamin B6 metabolism KEGG L3 derived feature; a butirosin and neomycin biosynthesis KEGG L3 derived feature; a pantothenate and CoA biosynthesis KEGG L3 derived feature; an oxidative phosphorylation KEGG L3 derived feature; a zeatin biosynthesis KEGG L3 derived feature; an energy metabolism KEGG L3 derived feature; a limonene and pinene degradation KEGG L3 derived feature; a valine, leucine, and isoleucine biosynthesis KEGG L3 derived feature; a bacterial chemotaxis KEGG L3 derived feature; a homologous recombination KEGG L3 derived feature; a lipopolysaccharide biosynthesis protein KEGG L3 derived feature; a transcription machinery KEGG L3 derived feature; and a selenoid KEGG L3 derived feature.

6. The method of claim 1, wherein generating a characterization that is relevant to type I diabetes comprises generating the characterization upon processing the aggregate set of samples and determining the presence of features derived from: 1) a set of taxa including one or more of the following: porphyridonaceae (family), Oscillatoria (genus), Peptophilus (genus), Ruminococcus faecalis (species), and Deuterospiriales (order).

7. The method of claim 1, wherein generating the characterization of systemic lupus erythematosus comprises determining the presence of features derived from: 1) a set of taxa including one or more of the following: anaerotruncus (genus), Parabacteroides merdae (species), Parabacteroides (genus), Porphyromonaceae (family), Cellulobacteria (Fibrobacter)/Acidobacterium (Acidobacterium) group, Acidobacterium (phylum), Erysipelamiaceae (family), and unclassified Clostridiaceae (family).

8. The method of claim 1, wherein generating the characterization of sjogren's syndrome comprises determining the presence of features derived from: 1) a set of taxa including one or more of the following: marine spirochaetales (order), adlercutzia (genus), and adlercutzia equilifaciens (species).

9. The method of claim 1, wherein generating a characterization related to rheumatoid arthritis comprises determining the presence of features derived from: 1) a set of taxa including one or more of the following: bacteroides monomorphus (species) and Alisipes ondendkii (species).