WO2020028862A1

WO2020028862A1 - Panbam: bambam across multiple organisms in parallel

Info

Publication number: WO2020028862A1
Application number: PCT/US2019/044988
Authority: WO
Inventors: John Zachary Sanborn
Original assignee: Nantomics, Llc
Priority date: 2018-08-03
Filing date: 2019-08-02
Publication date: 2020-02-06

Abstract

Contemplated systems and methods are directed to in silico analysis of characterizing microbiome or identifying contamination of a biological entity or a sample. Most typically, the systems and methods use chimeric nucleic acid sequences that include a plurality of microorganism genome sequences merged together to form a single nucleic acid sequence file in BAM format to so align the nucleic acid sequences obtained from the biological entity or sample.

Description

PANBAM: BAMBAM ACROSS MULTIPLE ORGANISMS IN PARALLEL

[0001] This application claims priority to our copending US Provisional Patent Application with the serial number 62/714,570, which was filed 8/3/2018 and which is incorporated by reference herein.

Field of the Invention

[0002] The field of the invention is computational analysis of microbiomes using genetic information from tissue specimen and microorganism genome information.

Background of the Invention

[0003] The background description includes information that may be useful in understanding the present invention. It is not an admission that any of the information provided herein is prior art or relevant to the presently claimed invention, or that any publication specifically or implicitly referenced is prior art.

[0004] All publications and patent applications herein are incorporated by reference to the same extent as if each individual publication or patent application were specifically and individually indicated to be incorporated by reference. Where a definition or use of a term in an incorporated reference is inconsistent or contrary to the definition of that term provided herein, the definition of that term provided herein applies and the definition of that term in the reference does not apply.

[0005] A microbiome is an ecological community of all monocellular and/or multicellular symbiotic or pathogenic microorganisms. Different species of animals, plants, or any other multicellular organisms are either surrounded or possess a microbiome, which affects their metabolism, immune system, endocrine system, and general health status. Conversely, any substantive changes in the microbiome that breaks the balance in the microbiome may affect the status or condition of the host entity or any biological entity around the microbiome, necessitating a quick, yet thorough analysis of a microbiome covering a large variety of microorganisms.

[0006] Some tried to solve such problem by population analysis of a microbiome. For example, US Pat. Pub. No. 2017/0159108 to Budding discloses microbiome population analysis using taxonomic variations among DNA sequences of microbial 16S-23S rRNA internal transcribed region as primers to amplify various types of microorganisms and the length differences in the PCR products using such primers. In another example, US Pat. Pub. No. 2017/0058430 to Watts discloses infection identification using whole metagenome sequence analysis of a sample. In Watt, human DNA is removed from the human DNA- bacterial DNA mixture obtained from the wound sample, and the isolated bacterial DNA sequences are analyzed using k-mer based sequence analysis. Unfortunately, while somewhat informative, all such methods require multiple steps of sample purification, amplification and/or sequencing, which may cause incomplete analysis of all microorganisms present in the microbiome.

[0007] Therefore, while several options for microbiome analysis are known in the art, there is still a need for a quick, in silico, and thorough microbiome analysis that can be used in many types of samples.

Summary of The Invention

[0008] The inventive subject matter is directed to in silico analysis of a microbiome to identify the presence of one or more known microorganisms in the sample as well as any changes in the microorganism that may alter the balance of the microbiome. Thus, in one inventive subject matter, the inventors contemplate a method of identifying microbiome of a sample in silico. In this method, a nucleic acid sequence, which preferably comprises a plurality of short reads, from the sample is obtained. Then, the nucleic acid sequence is aligned with a reference nucleic acid sequence to locate the reads relative to the reference nucleic acid sequence. The located reads are then aligned with a chimeric nucleic acid sequence comprising at least two microorganisms’ genomic nucleic acid sequences that are merged to form a single sequence file. An origin of the nucleic acid sequence can be identified from an alignment of the chimeric nucleic acid sequence and the nucleic acid sequence from the sample. Typically, the nucleic acid sequence comprises at least one microorganism’s genomic nucleic acid sequence, which can be selected from bacteria, yeast, a fungus, a virus, and a mycoplasma.

[0009] In some embodiments, the sample is a human tissue, and the nucleic acid sequence comprises a human genomic nucleic acid sequence and a non-human genomic nucleic acid sequence. In such embodiments, the chimeric reference nucleic acid sequence can further comprise a reference human genomic nucleic acid that is merged with the at least two microorganisms’ genomic nucleic acid sequences. Optionally, in such embodiments, the method may also include steps of generating a patient chimeric nucleic acid sequence comprising the non-human genomic nucleic acid sequence merged at an end of the human genome nucleic acid sequence, and aligning the patient chimeric nucleic acid sequence and the chimeric reference nucleic acid sequence to identify the origin of the nucleic acid sequence.

[0010] In some embodiments, the human tissue is a diseased tissue. In such embodiments, the method may also include steps of comparing the located reads with a nucleic acid sequence of a matched normal tissue to identify a tumor specific mutation.

[0011] Preferably, the nucleic acid sequence from the sample and the chimeric reference nucleic acid sequence are in BAM, SAM, FASTQ, FASTA, or FASTA index format. Also preferably, the nucleic acid sequence is aligned with a chimeric reference nucleic acid sequence using incremental synchronized alignment.

[0012] In some embodiments, the step of identifying the origin of the nucleic acid sequence comprises determining a quantity of first reads of the nucleic acid sequence that are aligned with a first portion of the chimeric nucleic acid sequence. In such embodiments, a quantity of second reads of the nucleic acid sequence that are aligned with a second portion of the chimeric nucleic acid sequence can be determined and then a relative quantity of first and second microorganisms in the sample can be determined. Preferably, the first and second portions are nucleic acids sequences of the first and second microorganisms, respectively.

[0013] Optionally, the method further comprises generating or updating a record of the sample according to the identification of the origin and/or providing or recommending a treatment that is specific to the origin. Also, where the nucleic acid sequence is obtained from the sample periodically, the method can further comprises generating or updating a record of the sample according to a change of the identification of the origin.

[0014] Another aspect of the inventive subject matter includes a method of determining a change in a microbiome of a biological entity in silico. In this method, first and second samples from the biological entity are obtained at first and second time points. For example, the first time point can be before application of an antibiotic, and the second time point can be after application of the antibiotics. Then, first and second nucleic acid sequences are obtained from the first and second samples, respectively. Most preferably, each of the first and second nucleic acid sequence comprises a plurality of reads. The first and second nucleic acid sequences can then be aligned with a reference nucleic acid sequence to locate the reads relative to the reference nucleic acid sequence. The located reads then are aligned with a chimeric nucleic acid sequence comprising at least two microorganisms’ genomic nucleic acid sequences that are merged to form a single sequence file. Then, a change in the microbiome can be determined by comparing alignments of located reads derived from the first and second nucleic acid sequences with the chimeric nucleic acid sequence. Preferably, the first and second nucleic acid sequences comprise at least one microorganism’s genomic nucleic acid sequence, which can be bacteria, yeast, a fungus, a vims, and a mycoplasma.

[0015] In some embodiments, the first and second samples are derived from a human tissue, and each of the first and second nucleic acid sequences comprises a human genomic nucleic acid sequence and a non-human genomic nucleic acid sequence, respectively. In such embodiments, the chimeric reference nucleic acid sequence further comprises a reference human genomic nucleic acid that is merged with the at least two microorganisms’ genomic nucleic acid sequences. Further, the human tissue can be a diseased tissue. In such embodiments, method can further comprise steps of the located reads derived from the first and second nucleic acid sequences with a nucleic acid sequence of a matched normal tissue to identify a tumor specific mutation. Also, where human tissue can be a diseased tissue, the method can also include steps of generating first and second patient chimeric nucleic acid sequences, each comprising the non-human genomic nucleic acid sequence merged at an end of the human genome nucleic acid sequence and aligning the patient chimeric nucleic acid sequence and the chimeric reference nucleic acid sequence to identify the origin of the nucleic acid sequence.

[0016] Preferably, the first and second nucleic acid sequences and the chimeric nucleic acid sequence are in BAM, SAM, FASTQ, FASTA, or FASTA index format and/or the first and second nucleic acid sequence is aligned with a chimeric nucleic acid sequence using incremental synchronized alignment.

[0017] It is contemplated that the change in the microbiome comprises at least one of a quantity change of at least one microorganism, a ratio change among a plurality of microorganisms, and a mutation in at least one microorganism. Where the change is a quantity change, the quantity change is determined by measuring quantities of first and second reads of first and second nucleic acid sequences, wherein the first and second reads are aligned with a first portion of the chimeric nucleic acid sequence. Where the change is the ratio change, the ratio change is determined by measuring first and second reads of the first nucleic acid sequence and third and fourth reads of the second nucleic acid sequence, wherein the first and third reads, and second and fourth reads are aligned with a first portion or a second portion of the chimeric nucleic acid sequence, respectively. In such embodiment, the first and second portions are nucleic acids sequences of the first and second microorganisms, respectively or are nucleic acids sequences of different strains of same species of the microorganism.

[0018] Optionally, the method further comprises generating or updating a record of the sample according to the change of the microbiome or providing or recommending a treatment according to the change of the microbiome.

[0019] Various objects, features, aspects and advantages of the inventive subject matter will become more apparent from the following detailed description of preferred embodiments.

Brief Description of The Drawing

[0020] Figure 1A is an illustrated example of a reference sequence comprising only human genome

[0021] Figure IB is an illustrated example of a chimeric nucleic acid sequence comprising a plurality of microorganisms’ genomic nucleic acid sequences.

[0022] Figure 1C is an illustrated example of a chimeric nucleic acid sequence comprising a human genome sequence and a plurality of microorganisms’ genomic nucleic acid sequences.

[0023] Figure 2 shows three different scenarios of aligning the nucleic acid sequences with a reference sequence at an aligner and analyzing the aligned sequence via the aligner by comparing with a chimeric nucleic acid sequence.

Detailed Description

[0024] The inventors have discovered that the presence, quantities and/or ratio of one or more organisms in a microbiome in a sample can be readily determined by comparing the sequence information of nucleic acid sequences obtained from the sample with a plurality of genomic sequences of microorganisms. The inventors further discovered that the efficiency of the comparison of sequence information can be substantially increased when the plurality of genomic sequences of microorganisms are coupled to form a single hybrid or chimeric sequence. Consequently, in one especially preferred aspect of the inventive subject matter, the inventors contemplate a method of identifying microbiome of a sample in silico. In this method, a nucleic acid sequence can be obtained from the sample, and the nucleic acid sequence is aligned with a chimeric nucleic acid sequence that includes a plurality of microorganisms’ genomic nucleic acid sequences. Based on where and how the nucleic acid sequence aligns with the chimeric nucleic acid sequence, the source/origin of the nucleic acid sequence can be identified.

[0025] As used herein, the term“tumor” refers to, and is interchangeably used with one or more cancer cells, cancer tissues, malignant tumor cells, or malignant tumor tissue, that can be placed or found in one or more anatomical locations in a human body. It should be noted that the term“patient” as used herein includes both individuals that are diagnosed with a condition (e.g., cancer) as well as individuals undergoing examination and/or testing for the purpose of detecting or identifying a condition. Thus, a patient having a tumor refers to both individuals that are diagnosed with a cancer as well as individuals that are suspected to have a cancer. As used herein, the term“provide” or“providing” refers to and includes any acts of manufacturing, generating, placing, enabling to use, transferring, or making ready to use.

[0026] As used herein, the term“sample” refers to any biological and nonbiological, or any organic or inorganic substances or entity, or their portions thereof, where the microbiome may present temporarily or permanently. Thus, in one example, a sample can be an living or nonliving organism or portions or parts thereof, including animals, plants, tissues, cells, cultured cells, cultured tissues, bodily fluids (e.g., blood, mucus, cerebrospinal fluid, urine, etc.), organs, and parts (e.g., stem, root, etc.), a food item (e.g., ground beef, etc.). A sample can also be healthy tissues or diseased tissues (e.g., tumor tissue, autoimmune disease tissue, infected tissue, etc.) that can be obtained via biopsy. In another example, a sample can be any surrounding environment of a living organism or portions or parts thereof, including cell culture media and tissue culture media.

[0027] Where the sample is a tissue of an individual human or an animal, any suitable methods of obtaining a tissue (either healthy tissue or diseased tissue) are contemplated. Most typically, tissue samples can be obtained from the individual via a biopsy (including liquid biopsy, or obtained via tissue excision during a surgery or an independent biopsy procedure, etc.), which can be fresh or processed (e.g., frozen, etc.) until further process for obtaining nucleic acid information from the sample. For example, tissues or cells may be fresh or frozen. In other example, the tissues or cells may be in a form of cell/tissue extracts. In some embodiments, the tissues or cells may be obtained from a single or multiple different tissues or anatomical regions. For example, a metastatic breast cancer tissue can be obtained from the patient’s breast as well as other organs (e.g., liver, brain, lymph node, blood, lung, etc.) for metastasized breast cancer tissues. In another example, a healthy tissue or matched normal tissue (e.g., patient’s non-cancerous breast tissue) of the patient can be obtained from any part of the body or organs, preferably from liver, blood, or any other tissues near the tumor (in a close anatomical distance, etc.).

[0028] Most typically, the sample contains or suspected to contain at least one

microorganism (e.g., bacteria, yeast, virus, fungus, mycoplasma, etc.), and preferably its nucleic acid (e.g., DNA, RNA, etc.). In some embodiments, the sample may contain two or more different types of microorganisms (e.g., mixture of two distinct families of bacteria (e.g., streptococcus and E.coli, etc.) or two or more strains of the same species of microorganism. In other embodiments, the sample may contain a mixture of a host cell (e.g., human tissue, animal tissue, etc.) and one or more microorganisms. For example, a sample may be a biopsy sample from the rectum, colon, skin, oral or gastric mucosa, trachea, lung, etc., each of which are known to contain microbial entities (e.g., vims, bacteria, yeast, etc.)

[0029] Any suitable methods and/or procedures to obtain nucleic acid information from the sample are contemplated. Preferably, the nucleic acid information is whole genome information of all cells and/or microorganisms present in the sample. Thus, for example, nucleic acid information can be obtained by processing the sample to obtain DNA and/or RNA from the sample to further analyze relevant information. In another example, the nucleic acid information can be obtained directly from a database that stores the nucleic acid information of the including DNA sequence analysis information and/or RNA sequence information (i.e., where the microorganism is an RNA virus, etc.) of a plurality of short reads at a length between 50-500 base pairs, preferably between 100-300 base pairs, that can be obtained by whole genome sequencing and/or exome sequencing (typically at a coverage depth of at least lOx, more typically at least 20x) and/or RNAseq using next generation sequencing. Alternatively, DNA data may also be provided from an already established sequence record (e.g., SAM, BAM, FASTA, FASTQ, or VCF file) from a prior sequence determination. Therefore, data sets may include unprocessed or processed data sets, and exemplary data sets include those having BAM format, SAM format, FASTQ format, or FASTA format. However, it is especially preferred that the data sets are provided in BAM format or as BAMBAM diff objects (e.g., US2012/0059670A1 and US2012/0066001A1). Likewise, computational analysis of the sequence data may be performed in numerous manners. In most preferred methods, however, analysis is performed in silico by location- guided synchronous alignment of tumor and normal samples as, for example, disclosed in US 2012/0059670A1 and US 2012/0066001 Al using BAM files and BAM servers. Such analysis advantageously reduces false positive neoepitopes and significantly reduces demands on memory and computational resources.

[0030] Such obtained nucleic acid sequences, preferably short reads, can be aligned with a reference sequence to locate the short reads relative to the reference sequence and/or separate out some of the short reads as non-aligned sequences. The inventors contemplate that reference sequence may vary depending on the type of samples and/or type of

microorganisms interested and/or type of analysis further required. In one preferred embodiment, the reference sequence may include genomic sequence of one or more microorganisms. The inventors found that rapid identification or analysis of the

microorganism genomic information without a prior knowledge of the presence or types of microorganism can be achieved by modification of a reference genome in silico where two or more microorganism genome sequences (either RNA or DNA sequences) are merged to so form a chimeric reference nucleic acid sequence as shown in Figure IB. In this context it should be appreciated that while genomic sequences are generally preferred (e.g., sequences covering at least 10%, or at least 25%, or at least 50%, or at least 70%, or at least 80%, or at least 90%, or at least 95% of the genome), transcriptomic sequences are also deemed suitable for use herein and may include single gene transcripts, multiple gene transcripts, and transcriptomic data (e.g., covering at least 10%, or at least 25%, or at least 50%, or at least 70%, or at least 80%, or at least 90%, or at least 95% of the transcriptome). Furthermore, such transcriptomics data may be used in conjunction with genomic data. Most preferably, the chimeric reference nucleic acid sequence includes at least 10, at least 20, at least 50, at least 100, or substantially all possible microorganisms that may be present in the sample. Thus, the chimeric reference nucleic acid sequence may vary depending on the type of samples that are analyzed (e.g., human v. other animals, types of tissues (e.g., stomach v. skin, etc.), in vitro samples (e.g., tissue culture and its media, etc.), and so on. In some embodiments, the chimeric reference nucleic acid sequence may be constructed by types of microorganisms. In such embodiments, several different chimeric reference nucleic acid sequences can be constructed, each of which has merged genomic nucleic acid sequences of known microorganism of same species or same family (e.g., bacterial chimeric reference nucleic acid sequence, E.coli chimeric reference nucleic acid sequence having nucleic acid sequences of all known strains of E. coli, etc.).

[0031] In some embodiments, especially where the sample to be analyzed has mixed human tissue and microorganism (e.g., tumor tissue, infected tissue, etc.), the chimeric reference nucleic acid sequence may also include a human genome sequence as shown in Figure 1A or a human genome sequence merged with the two or more microorganism genome sequences as shown in Figure 1C. Typically, where the chimeric reference nucleic acid sequence is a human genome sequence merged with the two or more microorganism genome sequences, human genome sequence to form the chimeric reference nucleic acid sequence is derived from a whole genome nucleic acid sequences of a healthy tissue of a same individual where the sample to be analyzed are derived. Alternatively, the human genome sequence to form the chimeric reference nucleic acid sequence is derived from genome sequences of a plurality of individuals, preferably stratified by gender, or an average or consensus sequence. Most typically, the reference genome will be or encompass the entire genome. However, smaller portions of the genome are also contemplated and include at least one chromosome, or two- five chromosomes, or five-ten chromosomes, or more than ten chromosomes. Alternatively, the reference genome may also be only representative of a portion (e.g., between 1-10%, between 10-30%, between 30-60%, or between 60-90%) of the entire exome or entire transcriptome. Thus, and viewed form yet another perspective, the reference genome will typically include at least 10%, or at least 30%, or at least 50%, or at least 70% of the entire genome of the human (or other species). In such embodiments, it is also contemplated that the chimeric reference nucleic acid sequence can be constructed in a fusion chromosome format or structure (e.g., in a BAM format or file), in which genomic sequence of each human chromosome (e.g., chromosome 14, chromosome 15, etc.) is merged with the two or more microorganism genome sequences to form a plurality of chimeric reference nucleic acid sequence for each chromosome. Once the chimeric nucleic acid sequence file has been assembled, it is preferred that the sequence database is then updated with the so produced chimeric nucleic acid sequence file. [0032] Once the nucleic acid sequences from the sample are aligned with the reference sequence, the short reads can then be processed using incremental synchronized alignment with the reference sequence. For example, and while not limiting the inventive subject matter, it is generally preferred that the genomic analysis is performed using a software tool in which a chimeric reference nucleic acid sequence is synchronized and incrementally compared against the nucleic acid sequence from the sample. One especially preferred tool includes BAMBAM as previously described in WO2013/074058A1, incorporated by reference herein.

[0033] It is contemplated that the form of such generated synchronously aligned sequences may vary depending on the type of reference sequence used in the alignment. For example, where the reference sequence is a chimeric nucleic acid having a plurality of

microorganisms’ genomic sequences only, then the synchronously aligned sequences generated from the nucleic acid sequences obtained from the sample would be a linear combination of microorganisms’ genomic sequences aligned with the chimeric reference sequence. In such embodiment, if the nucleic acid sequences obtained from the sample contains any human nucleic acid sequences, such human nucleic acid sequences that are not aligned with the chimeric reference sequence would be left out. Alternatively, such human nucleic acid sequences may be appended to the end of the linear combination of

microorganisms’ genomic sequences.

[0034] In another example, where the reference sequence is a human genomic nucleic acid only, then the synchronously aligned sequences generated from the nucleic acid sequences obtained from the sample would be a linear human genomic nucleic acid sequence as aligned with the reference sequence. In such example, if the nucleic acid sequences obtained from the sample contains any non-human genomic nucleic acid sequence, such non-human genomic nucleic acid sequence that are not aligned with the human genome reference sequence would be left out. Alternatively, such non-human genomic nucleic acid sequence may be appended to the end of the linear human genomic nucleic acid sequence.

[0035] In still another example, where the reference sequence is a chimeric reference nucleic acids sequence having a human genomic nucleic acid sequence merged (or appended) with a plurality of microorganisms’ genomic nucleic acid sequence, then the synchronously aligned sequences generated from the nucleic acid sequences obtained from the sample would be a linear human genomic nucleic acid sequence merged with one or more microorganisms’ genomic nucleic acid sequence as aligned with the chimeric reference sequence. [0036] The synchronously aligned sequences can then be further analyzed by comparing the synchronously aligned sequences with a control nucleic acid sequence. Alternatively, it is also contemplated that instead of obtaining nucleic acid sequences from the sequencer and aligning with the reference sequence to obtain the location and alignment information of each read, the nucleic acid sequence data set can be obtained from a database, preferably in BAM or SAM format, in which each nucleic acid sequence is accompanied with location information relative to the reference sequence. In some embodiments, the control nucleic acid sequence is the same nucleic acid sequence with the reference sequence that the nucleic acid sequences from the sample are aligned. In other embodiments, the control nucleic acid sequence is different nucleic acid sequence from the reference sequence that the nucleic acid sequences from the sample are aligned. For example, where the synchronously aligned sequences were aligned with the reference sequence having only human genomic nucleic acid sequence, the control sequence can be the reference sequence having only human genomic nucleic acid sequence or a nucleic acid sequence including a plurality of microorganisms’ genomic nucleic acid sequence. In such example, only a portion of the synchronously aligned sequences that were not aligned previously with the reference sequence will be analyzed against the control nucleic acid sequence (e.g., a fusion sequence of a plurality of microorganisms’ genomic nucleic acid sequence, etc.).

[0037] Figure 2 shows at least three scenarios how the nucleic acid reads obtained from the sequencer can be processed via an aligner and the BAMBAM analyzer. In scenarios 1-3, a plurality of short reads that may contain mixed population of human genome sequence reads and microorganism sequence reads are obtained from the sequencer. In scenario 1, the short reads are aligned with a reference sequence comprising a fused human genome sequence and a plurality of microorganism sequence at one end of the human genome sequence. The aligned sequence at the aligner can then be formatted in BAM file format and analyzed for further information by comparing with the control nucleic acid sequence comprising a fused human genome sequence (HG) and a plurality of microorganism sequence (MO) at one end of the human genome sequence. In some embodiments, the reference sequence used in aligner and the control nucleic acid sequence used in BAMBAM analysis can be the same or substantially similar file (e.g., having changed or altered order of microorganism sequences, etc.). [0038] In scenario 2, the short reads are aligned with a reference sequence comprising a fused microorganism sequences of a plurality of microorganisms. In such scenario, the human genome sequence reads can be discarded from further analysis, and only aligned microorganism sequences are converted into BAM file format. Such converted

microorganism sequences can then be analyzed by comparing with the control nucleic acid sequence comprising a fused microorganism sequences of a plurality of microorganisms or a control sequence comprising a fused human genome sequence (HG) and a plurality of microorganism sequence (MO) at one end of the human genome sequence.

[0039] In scenario 3, the short reads are aligned with a reference sequence comprising a human genome sequence (HG) only. In such scenario, the unaligned sequences (supposedly non-human sequences) are separated out and converted into BAM file format. Such converted unaligned sequences are then analyzed by comparing with the control nucleic acid sequence comprising a fused microorganism sequences of a plurality of microorganisms or a control sequence comprising a fused human genome sequence (HG) and a plurality of microorganism sequence (MO) at one end of the human genome sequence.

[0040] The inventors contemplate that various analysis on the nucleic acid sequences obtained from the sample can be performed including, but not limited to, origin identification, mutation analysis, quantification analysis, and ratio analysis. For example, origin of the nucleic acid sequence from the sample can be identified by aligning the nucleic acid sequence with the control nucleic acid sequence having a plurality of microorganisms’ genomic nucleic acid sequences. In some embodiments, each genomic sequence merged in the control nucleic acid sequence is associated with the microorganism information and the location in the chimeric reference nucleic acid. Thus, in such embodiments, identification of the origin of the nucleic acid sequence from the sample can be rapidly and readily accomplished by determining the aligned locus in the control nucleic acid sequence and number of reads aligned in such locus. For example, in some embodiments, the origin of the nucleic acid sequence from the sample can be identified by identifying the portion of the chimeric reference nucleic acid sequence that are aligned with at least 5 reads, at least 10 reads, at least 20 reads, etc. The threshold for the number of reads in such embodiments can be pre determined based on the sequencing depth (e.g., lOx, 20x, 40x, 50x, etc.). In addition, it is contemplated that any aligned read that has less than 90% homology, less than 80% homology, or less than 70% homology with the aligned chimeric reference nucleic acid sequence can be disregarded from the further analysis. Likewise, it is also contemplated that any aligned reads, taken together, that covers less than 50%, less than 40%, less than 30%, less than 20%, of any organisms’ genomic sequences in the control nucleic acid sequence, can be disregarded from the further analysis to reduce any false positive signal.

[0041] Identification of the microorganism’ s origin can be expanded to multiple

microorganisms in the microbiome of a sample. For example, the origins of the nucleic acid sequences from the sample can be identified by identifying the portions of the control nucleic acid sequence that are aligned with at least 5 reads, at least 10 reads, at least 20 reads, etc. Thus, for example, if the nucleic acid sequences from the sample include reads that are aligned to the genomic nucleic acid sequence of microorganism A, B, and C, it is likely that the sample includes microorganism A, B, and C. Of course, the threshold for the number of reads in such embodiments can be also pre-determined based on the sequencing depth (e.g., lOx, 20x, 40x, 50x, etc.). In addition, it is contemplated that any aligned read that has less than 90% homology, less than 80% homology, or less than 70% homology with the aligned chimeric reference nucleic acid sequence can be disregarded from the further analysis.

Likewise, it is also contemplated that any aligned reads, taken together, that covers less than 50%, less than 40%, less than 30%, less than 20%, of any organisms’ genomic sequences in the chimeric reference nucleic acid sequence, can be disregarded from the further analysis to reduce any false positive signal.

[0042] Further, the inventors also contemplate that any presence or emergence of mutation in the microorganism in the sample can be identified by aligning with the control nucleic acid and identifying a mismatched nucleotide sequence indicating a mutation (e.g., a deletion, a point mutation, an insertion, a duplication, etc.).

[0043] Additionally, the inventors contemplate that ratios among the multiple types of microorganisms in the sample can be determined from comparing the quantities of reads aligned with the control nucleic acid sequence. For example, once the origins of the nucleic acid sequences from the sample are identified by aligning with the control nucleic acid sequence, the quantities of reads that are aligned to each portion of the control nucleic acid sequence corresponding to distinct genomic nucleic acid sequence of a microorganism can be determined. The ratio between or among the quantities of reads is likely to reflect the amount ratio between or among the multiple organisms in the sample. [0044] In some embodiments, samples can be obtained in multiple time points in order to determine any changes in the microbiome over a relevant time period. For example, samples can be obtained or nucleic acid information from the samples can be obtained before and after the sample is treated with an antibiotic, at different time points after the sample is treated with the antibiotics. In another example, where the sample is a tumor tissue, samples can be obtained or nucleic acid information from the samples can be obtained before, during, and/or after (e.g., upon completion, etc.) a one time or a series of anti-tumor treatment (e.g., radiotherapy, chemotherapy, immunotherapy, etc.).

[0045] Consequently, the inventors contemplate that various analysis using the

microorganism’ genomic information obtained in different time points can be performed. For example, a relative quantity or a quantity change of a microorganism can be determined by quantifying the number of reads that are aligned in a portion of the chimeric nucleic acid sequence in a first and second condition. One use of such example may include determination of quantity change of a strain of E.coli (e.g., Ol57:F!7, etc.) in the gastrointestinal tract of a patient infected by Ol57:H7 upon antibiotic A treatment to determine the effect of antibiotic A in Ol57:H7 infection treatment. In such example, genomic nucleic acid sequences are obtained from at least two gastrointestinal tract samples (e.g., via biopsy, wiping the surface of the gastrointestinal tract, fecal sample, etc.): one before the antibiotic A treatment and another after the antibiotic A treatment, and the obtained genomic nucleic acid sequences are aligned with the control nucleic acid sequence including full genomic sequence of Ol57:H7. The number of reads aligned to a portion of genomic sequence of Ol57:H7, preferably same portion of genomic sequence of 0157: H7, in the chimeric reference nucleic acid sequence can be quantified for the genomic nucleic acid sequences of before antibiotic A treatment and genomic nucleic acid sequences of after antibiotic A treatment. The relative quantity or changes in the quantity can be measured by either a difference in absolute number of reads in those samples or a percentile change (increase or decrease) between those samples.

[0046] Another use of such example may include detection of a change in microbiome balance in a tumor by determining the ratio changes among a plurality of microorganisms in tissues of a colon cancer patient. In such example, tumor tissues of the colon cancer patient can be obtained in different time points via biopsies, and genomic nucleic acid sequences from such tumor tissues are obtained. Such obtained genomic nucleic acid sequences are aligned with the control nucleic acid sequence including full length human genomic nucleic acid (e.g., organized by individual chromosome, etc.) merged with genomic nucleic acid sequences of any potential microorganisms that may be present in the colon cancer tissue (e.g., actinobacteria, saprospiraceae, capnocytophaga, christensenellaceae, acidobacteria, corynebacterium, etc.). The number of reads aligned to a portion of genomic sequence of microorganisms in the control nucleic acid sequence can be quantified for determining the ratio among the microorganisms present in the colon cancer tissue microbiome in the patient as well as changes of the microorganism quantity in the colon cancer tissue. In some embodiments, the number of reads aligned with genomic nucleic acid sequences of a microorganism can be normalized with a number of reads aligned with a human genome sequence.

[0047] Optionally, where the sample is a diseased tissue (e.g., tumor tissue, etc.) of a patient, the nucleic acid sequence obtained from the sample can be aligned with a control nucleic acid sequence that has a human genomic nucleic acid sequence derived from a matched normal tissue from the same patient to so obtain unmatched reads with the control nucleic acid sequence that may represent the tumor-specific mutation. Alternatively, the nucleic acid sequence obtained from the sample can be aligned with a control nucleic acid sequence that has a human genomic nucleic acid sequence derived from a healthy individual, and further aligned with a nucleic acid sequence derived from a matched normal tissue from the same patient in order to obtain information on tumor- specific mutation as well as individual germline variations.

[0048] In some embodiments, samples can be obtained from multiple locations in order to determine any differences among distinct microbiomes. For example, tumor tissues of the colon cancer patient can be obtained in different tumor masses via biopsies, and genomic nucleic acid sequences from such tumor tissues are obtained. Such obtained genomic nucleic acid sequences are aligned with the control nucleic acid sequence including full length human genomic nucleic acid (e.g., organized by individual chromosome, etc.) merged with genomic nucleic acid sequences of any potential microorganisms that may be present in the colon cancer tissue (e.g., actinobacteria, saprospiraceae, capnocytophaga, christensenellaceae, acidobacteria, corynebacterium, etc.). The number of reads aligned to a portion of genomic sequence of microorganisms in the control nucleic acid sequence can be quantified for determining the ratio among the microorganisms present in the colon cancer tissue microbiome in the patient as well as any differences in the microorganism quantity or in microorganism ratios between those colon cancer tissues.

[0049] It is contemplated that such in silico microbiome analysis can be applied and used for many industrial and medical purposes besides the examples provided above. For example, in silico microbiome analysis can be performed with samples obtained from gastrointestinal tract of a pet animal (e.g., a dog, a cat, etc.) in order to customize the pet food selection and avoid any food items that may not be digested or processed in the pet. In another example, in silico microbiome analysis can be performed with tissue samples obtained from a group of individuals suspected to be associated with a disease or syndrome that has no clear association with genetic or environmental factors to identify any factors that may contribute to the prognosis of the disease or syndrome. In still another example, in silico microbiome analysis can be performed with samples (e.g., blood, tissues, etc.) obtained from livestock to track any emerging infection and/or inflammation caused by one or more microorganisms that may likely to spread among the livestock.

[0050] It should be particularly appreciated that contemplated systems and methods will be especially advantageous to identify the presence of one or more microorganisms (e.g., contamination of a tissue culture sample, contamination of a viral production environment, etc.), to track the changes in the microbiome of a diseased tissue, of an animal, or of a group of animals, or for other purposes without a priori knowledge on the existence and/or types of microorganisms in the sample. In addition, above in silico microbiome analysis can provide quick and thorough results by avoiding cumbersome amplification and/or isolation process of individual genomic nucleic acid of microorganisms using a common primer and/or a specific primer to the genomic nucleic acid of microorganisms.

[0051] The inventors further contemplate that based on such obtained microbiome information of the sample, various actions can be taken further. For example, where the samples are obtained from patient’s infected tissue before and after antibiotics treatment, the relative quantity and/or change of microbiome of the patient’s infected tissue can be analyzed to update and/or generate the patient’s record with respect to the effectiveness of the antibiotics treatment. In some embodiments, a treatment regimen can be generated or updated based on the effectiveness of the antibiotics treatment can be provided, for example, changing to another antibiotics if the quantities of one or two types of bacteria are not reduced or even increased after the antibiotics treatment or if the balance of the microbiome is changed due to decrease or increase of one or two types of bacteria over others, and so on.

[0052] Moreover, it should be appreciated that contemplated systems and methods not only allow for detection of microorganisms, but also for detection of interactions between the host and the microorganism. For example, where the sequence information of the microorganism and where the sequence information of human (or other non-microbial host) includes transcriptomics information, such transcriptomics information can be correlated to identify or detect interactions between the host and the microorganism. Among other use cases, it is contemplated that the expression of one or more microbial genes could be associated with the expression (level) of one or more genes in the human (or other non-microbial host). Such association could be indicative of a disease state in the human (e.g., acute infection, chronic infection, latent infection, etc.), or of a proper or inappropriate immune response in the host. Still further, it should be noted that such associations may also be used to identify and/or detect signatures in the host and/or microorganism. For example, contemplated host signatures include allergic or inflammatory immune responses (high expression of human immune genes), tolerance to a microorganism (lack of expression of human immune genes), while contemplated signatures in the microorganism include specific expression of disease related genes.

[0053] It should be apparent to those skilled in the art that many more modifications besides those already described are possible without departing from the inventive concepts herein. The inventive subject matter, therefore, is not to be restricted except in the spirit of the appended claims. Moreover, in interpreting both the specification and the claims, all terms should be interpreted in the broadest possible manner consistent with the context. In particular, the terms“comprises” and“comprising” should be interpreted as referring to elements, components, or steps in a non-exclusive manner, indicating that the referenced elements, components, or steps may be present, or utilized, or combined with other elements, components, or steps that are not expressly referenced. Where the specification claims refers to at least one of something selected from the group consisting of A, B, C..., and N, the text should be interpreted as requiring only one element from the group, not A plus N, or B plus N, etc. Moreover, as used in the description herein and throughout the claims that follow, the meaning of“a,”“an,” and“the” includes plural reference unless the context clearly dictates otherwise. Also, as used in the description herein, the meaning of“in” includes“in” and “on” unless the context clearly dictates otherwise.

Claims

CLAIMS What is claimed is:

1. A method of identifying microbiome of a sample in silico, comprising:

obtaining a nucleic acid sequence from the sample, wherein the nucleic acid sequence comprises a plurality of reads;

aligning the reads with a reference nucleic acid sequence to locate the reads relative to the reference nucleic acid sequence;

aligning the located reads with a chimeric nucleic acid sequence comprising at least two microorganisms’ genomic and/or transcriptomic nucleic acid sequences that are merged to form a single sequence file; and

identifying an origin of the nucleic acid sequence from an alignment of the chimeric nucleic acid sequence and the located reads.

2. The method of claim 1, wherein the sample is a human tissue, and the nucleic acid sequence comprises a human genomic nucleic acid sequence and a non-human genomic nucleic acid sequence.

3. The method of claim 2, wherein the chimeric nucleic acid sequence further comprises a human reference genomic nucleic acid that is merged with the at least two microorganisms’ genomic and/or transcriptomic nucleic acid sequences.

4. The method of claim 2, wherein the human tissue is a diseased tissue.

5. The method of claim 4, further comprising comparing the located reads with a nucleic acid sequence of a matched normal tissue to identify a tumor specific mutation.

6. The method of claim 3, further comprising:

generating a patient chimeric nucleic acid sequence comprising the non-human

genomic nucleic acid sequence merged at an end of the human genome nucleic acid sequence; and

aligning the patient chimeric nucleic acid sequence and the chimeric nucleic acid sequence to identify the origin of the nucleic acid sequence.

7. The method of claim 1, wherein the nucleic acid sequence comprises at least one microorganism’s genomic nucleic acid sequence.

8. The method of claim 1, wherein the aligning the located reads with a chimeric nucleic acid sequence using incremental synchronized alignment.

9. The method of claim 1, wherein the microorganism is selected from a bacteria, a yeast, a fungus, a vims, and a mycoplasma.

10. The method of claim 1, wherein the nucleic acid sequence from the sample and the chimeric reference nucleic acid sequence are in BAM, SAM, FASTQ, FASTA, or FASTA index format.

11. The method of claim 1, the identifying the origin of the nucleic acid sequence comprises determining a quantity of first reads of the nucleic acid sequence that are aligned with a first portion of the chimeric nucleic acid sequence.

12. The method of claim 11, further comprising:

determining a quantity of second reads of the nucleic acid sequence that are aligned with a second portion of the chimeric nucleic acid sequence; and determining a relative quantity of first and second microorganisms in the sample, wherein the first and second portions are nucleic acids sequences of the first and second microorganisms, respectively.

13. The method of claim 1, further comprising generating or updating a record of the sample according to the identification of the origin.

14. The method of claim 2, further comprising providing or recommending a treatment that is specific to the origin.

15. The method of claim 1, wherein the nucleic acid sequence is obtained from the sample periodically, and further comprising generating or updating a record of the sample according to a change of the identification of the origin.

16. A method of determining a change in a microbiome of a biological entity in silico, comprising:

obtaining first and second samples from the biological entity at first and second time points; obtaining first and second nucleic acid sequences from the first and second samples, wherein each of the first and second nucleic acid sequence comprises a plurality of reads;

aligning reads of the first and second nucleic acid sequences with a reference nucleic acid sequence to locate the reads relative to the reference nucleic acid sequence;

determining a change in the microbiome by comparing alignments of located reads derived from the first and second nucleic acid sequences with the chimeric nucleic acid sequence.

17. The method of claim 16, wherein the first and second samples are derived from a human tissue, and each of the first and second nucleic acid sequences comprises a human genomic nucleic acid sequence and a non-human genomic nucleic acid sequence, respectively.

18. The method of claim 17, wherein the chimeric nucleic acid sequence further comprises a human reference genomic nucleic acid that is merged with the at least two microorganisms’ genomic and/or transcriptomic nucleic acid sequences.

19. The method of claim 17, wherein the human tissue is a diseased tissue.

20. The method of claim 19, further comprising comparing the located reads derived from the first and second nucleic acid sequences with a nucleic acid sequence of a matched normal tissue to identify a tumor specific mutation.

21. The method of claim 19, further comprising:

generating first and second patient chimeric nucleic acid sequences, each comprising the non-human genomic nucleic acid sequence merged at an end of the human genome nucleic acid sequence; and

aligning the patient chimeric nucleic acid sequence and the chimeric reference nucleic acid sequence to identify the origin of the nucleic acid sequence.

22. The method of claim 16, wherein the first and second nucleic acid sequences comprise at least one microorganism’s genomic nucleic acid sequence.

23. The method of claim 16, wherein the aligning the first and second nucleic acid sequence with a chimeric nucleic acid sequence using incremental synchronized alignment.

24. The method of claim 16, wherein the microorganism is selected from a bacteria, a yeast, a fungus, a vims, and a mycoplasma.

25. The method of claim 16, wherein the first and second nucleic acid sequences and the chimeric nucleic acid sequence are in BAM, SAM, FASTQ, FASTA, or FASTA index format.

26. The method of claim 16, wherein the change in the microbiome comprises at least one of a quantity change of at least one microorganism, a ratio change among a plurality of microorganisms, and a mutation in at least one microorganism.

27. The method of claim 26, wherein the quantity change is determined by measuring quantities of first and second reads of first and second nucleic acid sequences, wherein the first and second reads are aligned with a first portion of the chimeric nucleic acid sequence.

28. The method of claim 26, wherein the ratio change is determined by measuring first and second reads of the first nucleic acid sequence and third and fourth reads of the second nucleic acid sequence, wherein the first and third reads, and second and fourth reads are aligned with a first portion or a second portion of the chimeric nucleic acid sequence, respectively.

29. The method of claim 28, wherein the first and second portions are nucleic acids sequences of the first and second microorganisms, respectively.

30. The method of claim 28, wherein the first and second portions are nucleic acids sequences of different strains of same species of the microorganism.

31. The method of claim 16, further comprising generating or updating a record of the sample according to the change of the microbiome.

32. The method of claim 17, further comprising providing or recommending a treatment according to the change of the microbiome.

33. The method of claim 16, wherein the first time point is before application of an antibiotics, and the second time point is after application of the antibiotics.