CN113228194A

CN113228194A - Multigroup search engine for comprehensive analysis of cancer genome and clinical data

Info

Publication number: CN113228194A
Application number: CN201980080958.7A
Authority: CN
Inventors: A·哈雷; E·辛布洛特; C·劳
Original assignee: Human Longevity Co
Current assignee: Human Longevity Co; Human Longevity Inc
Priority date: 2018-10-12
Filing date: 2019-10-14
Publication date: 2021-08-06
Also published as: AU2019356597A1; US20210319907A1; JP2022504916A; CA3115991A1; WO2020077352A1; EP3864659A1

Abstract

Methods for tumor profiling using multi-set mathematical data indexing are provided. The method can include storing a plurality of multiomic data indices, wherein each of the plurality of multiomic data indices comprises cancer-specific tokenized data; ingest additional sets of mathematical data and any annotations associated with the additional sets of mathematical data, the additional sets of mathematical data relating to one or more indexes; indexing the additional sets of ingested mathematical data and annotations while preserving in a particular index a multiple set of mathematical mappings between gene names, gene variant names, and different data streams for the same patient to produce tokenized additional sets of ingested mathematical data; receiving a user query; selecting one or more related omics data indices based on the user query; ranking the selected one or more multigroup index of mathematical data based on at least one of clinical operability, pathogenicity, feature weight, or frequency; and returning the ranked one or more multi-set indices of mathematical data to the user.

Description

Multigroup search engine for comprehensive analysis of cancer genome and clinical data

Background

With the increasing importance of cancer genome sequencing, thousands of cancer genomes, exomes, transcriptomes, proteomics, and other cancer data have been sequenced by both private and public institutions (e.g., cancer genome map [ TCGA ], international cancer genome consortium [ ICGC ]). Interpretation and analysis of tumor and normal sequencing data depends on the correlation of comprehensive analysis of both private and public genomic data and databases.

The industry, biopharmaceutical companies, research institutions, and the international cancer society face barriers such as, for example, (1) providing immediate access to any sample or subset of samples; (2) integrating the multiomic datasets to form a complete picture of tumor biology; (3) prognostic, diagnostic, and treatment information is effectively correlated with all available data (e.g., genomic, transcriptomic, proteomic, functional, medical, imaging, literature data) to provide clinical insight and operability for individual cancer patients, as well as stratification of patient cohorts (cohort) based on potential multicohort prognostic, diagnostic, or therapeutic biomarker(s).

Currently, publicly available data is scattered throughout publications, guidelines, and web-based resources. Finally, a solution to the three problems described above would allow widespread clinical application of cancer genomic analysis.

Data integration and integration presents a particularly serious challenge in cancer sequencing, namely, standardization and integration that allows users to consolidate multiple sources of data and identify clinically and biologically relevant information. Additionally, genomic analysis of cancer requires extensive bioinformatics pipelines and multiple sets of chemical streams to generate data for the same sample, as compared to germline sequence analysis. For example, for typical cancer biopsies and hematology, binary Base Calls (BCL) for tumor DNA, normal DNA, tumor RNA, and sometimes normal RNA must be converted to Variant Call Format (VCF) via alignment with the reference genome, de-duplication, realignment, and variant recalibration. Furthermore, running multiple somatic variant invokers to derive a set of consensus somatic Single Nucleotide Variants (SNVs) and small insertions and deletions (indels) is often the industry standard. Of further interest are, for example, the detection of Copy Number Variants (CNV) of tumors, differential gene expression between tumors and normal RNA-Seq replication, data processing to confirm that variants detected in somatic (tumor) DNA are also expressed in RNA, and lines to detect gene fusions. Of further interest is the use of tools that invoke large structural variants and tools that perform advanced bioinformatics to annotate cancer changes and calculate relevant properties of tumors (e.g., tumor mutation burden, genomic mutation signatures, microsatellite status, expressed neoantigens, HLA typing of normal genomes), and identify clinically relevant tumor changes.

Modern cancer profiling techniques can easily generate 25 gigabytes of multigenomic data per sample, which means that researchers conducting medium-sized cancer biomarker discovery studies are easily confronted with terabytes of raw data. Thus, the identification-related biomarkers are similar to "great sea fishing needles". Moreover, once the analysis pipeline is finished running, it is virtually impossible to interact with the results to form new assumptions.

The most common approach to solving the accessibility, multi-integration and operability problems of current cancer data is to design a portal to display pre-filtered data tables and analysis based on previously curated files and pre-computed workflows. Examples of portals include the Illumina BaseTrace Correlation Engine and Cohort Analyzer, WuXI nextCODE Portal, cBioPortal, IntOGen, Tumorscape, Tumorphoral, Xena, ICGC Data Portal, St.Jude PeCan, and Qiagen OmicSoft. However, these portals often limit the types of problems that can be solved and the additional analysis that can be performed. Furthermore, at many levels of the bioinformatics pipeline, data is typically not accessible for querying. The data in the portal is typically pre-filtered, not integrated and typically not ranked. Additionally, most portals do not host individual user data. The few portals that allow users to upload their own data typically do not provide a way to integrate the user's data with portal data, or to deduce advanced cancer analysis and make these data accessible and ranked by clinical operability, pathogenicity, feature weight, or frequency.

Accordingly, there is a need for systems and methods that provide efficient and effective provision of immediate access to any sample or subset of samples. There is also a need to provide systems and methods that are efficient and form a complete picture of tumor biology from efficiently integrated multiomic datasets. There is also a need to provide systems and methods that effectively and efficiently correlate prognosis, diagnosis and treatment information with all available data (e.g., genomic, transcriptomic, proteomic, functional, medical, imaging, literature data) to provide clinical insight and operability for individual cancer patients, and stratify homogeneous cohorts of patients according to potential multicohort prognoses or treatment biomarker(s).

Disclosure of Invention

And (6) profiling. A method may include storing a plurality of omics data indices, wherein each of the plurality of multi-set chemical data indices includes cancer-specific tokenized (tokenized) data. The method may further include ingesting the additional omics data and any annotations associated with the additional sets of mathematical data, the additional sets of mathematical data being related to the one or more indices. The method may further include indexing the additional sets of ingested mathematical data and annotations while preserving a multimathematical mapping between gene names, gene variant names, and different data streams for the same patient in a particular index to produce additional sets of tokenized ingested mathematical data. The method may also include receiving a user query. The method may also include selecting one or more multi-set indices of mathematical data that are relevant based on the user query. The method may further include ranking the selected one or more multigroup index of mathematical data based on at least one of clinical operability, pathogenicity, feature weight, or frequency. The method may also include returning the ranked one or more multi-set indices of mathematical data to the user.

According to various embodiments, a non-transitory computer-readable medium is provided having stored therein a program for causing a computer to execute a method for utilizing multi-set mathematical data indexing for tumor profiling (tumor profiling). The method can include storing a plurality of omics data indices, wherein each of the plurality of multi-set chemical data indices includes cancer-specific tokenized data. The method may further include ingesting the additional omics data and annotations associated with additional sets of mathematical data, the additional sets of mathematical data being related to the one or more indices. The method may further include indexing the additional sets of ingested mathematical data and annotations while preserving a multimathematical mapping between gene names, gene variant names, and different data streams for the same patient in a particular index to produce additional sets of tokenized ingested mathematical data. The method may also include receiving a user query. The method may also include selecting one or more multi-set indices of mathematical data that are relevant based on the user query. The method may further include ranking the selected one or more multigroup index of mathematical data based on at least one of clinical operability, pathogenicity, feature weight, or frequency. The method may also include returning the ranked one or more multi-set indices of mathematical data to the user.

According to various embodiments, a system for tumor profiling using multigroup mathematical data indexing is provided. The system may include an indexing unit. The indexing unit may include a storage element configured to store a plurality of multiple sets of indices of mathematical data, wherein each of the plurality of multiple sets of indices of mathematical data includes cancer-specific tokenized data. The indexing unit may also include an indexing engine. The indexing unit may be configured to ingest additional sets of mathematical data and annotations associated with the additional sets of mathematical data, the additional sets of mathematical data relating to one or more indexes. The indexing unit may be further configured to index the additional sets of ingested mathematical data and annotations while preserving multiple sets of mathematical mappings between gene names, gene variant names, and different data streams for the same patient in a particular index to produce tokenized additional sets of mathematical data ingested. The system may also include a user interface configured to receive a user query. The system may also include a query engine configured to select one or more relevant multi-set indices of mathematical data from the index units based on the user query. The system may further include a ranking engine configured to receive the selected relevant one or more multiple sets of indices of mathematical data and rank the selected one or more multiple sets of indices of mathematical data based on at least one of clinical operability, pathogenicity, feature weight, or frequency. The ranking engine may be further configured to return the ranked one or more multi-set indices of mathematical data to the user via the user interface.

According to various embodiments, a system for tumor profiling using multigroup mathematical data indexing is provided. The system may include an indexing unit. The indexing unit may include a storage element configured to store a plurality of multiple sets of indices of mathematical data, wherein each of the plurality of multiple sets of indices of mathematical data includes cancer-specific tokenized data. The indexing unit may also include an indexing engine. The indexing unit may be configured to ingest additional sets of mathematical data and annotations associated with the additional sets of mathematical data, the additional sets of mathematical data relating to one or more indexes. The indexing unit may be further configured to index the additional sets of ingested mathematical data and annotations while preserving multiple sets of mathematical mappings between gene names, gene variant names, and different data streams for the same patient in a particular index to produce tokenized additional sets of mathematical data ingested. The system may also include a user interface configured to receive a user query. The system may also include a query engine configured to select one or more relevant multi-set indices of mathematical data from the index units based on the user query. The query engine may be further configured to rank the selected one or more multigroup index of mathematical data based on at least one of clinical operability, pathogenicity, feature weight, or frequency. The query engine may be further configured to return the ranked one or more multi-set indices of mathematical data to the user via the user interface.

According to various embodiments, a multi-component cancer search engine system for tumor profiling is provided. The system may include: a storage element configured to store a plurality of integrated multigroup mathematical indices; a high-grade cancer analysis software module; a multiomic index pipeline; a ranking engine reflecting the clinical utility of multigroup cancer alterations; a query engine that selects and combines related multimath indices and returns ranked multimath changes for individual samples and homogeneous groups of samples; and a user interface configured to receive a user query and perform a search for cancer data.

Additional aspects will become apparent from the following detailed description, the appended claims and the accompanying drawings.

Drawings

The above illustrative examples of various aspects and implementations provide an overview or framework for understanding the nature and character of the claimed aspects and implementations:

figure 1 illustrates an example of a system architecture for a omic cancer search engine, in accordance with various embodiments.

Fig. 2a illustrates an example of a multi-set mathematical index organization in accordance with various embodiments. Fig. 2b illustrates an example of hierarchical propagation of annotations and ranking of variants, in accordance with various embodiments.

Fig. 3 illustrates an example of a set of cancer analyses that are dynamically pre-computed and computed for individual samples and homogeneous groups, in accordance with various embodiments.

Fig. 4a illustrates an example of an extent and depth model for learning variant rankings, in accordance with various embodiments. Fig. 4b illustrates an example of a learning ranking engine that relies on a Deep Semantic Similarity Model (DSSM) for biomedical data, in accordance with various embodiments.

Fig. 5a and 5b together illustrate an example of a workflow for the operation of a query engine, in accordance with various embodiments.

FIG. 6 illustrates an example of a user interface in accordance with various embodiments. As shown, for example, a single search box allows a user to enter different queries and receive ranked results.

Fig. 7 illustrates an example of search results obtained with a particular syntax, in accordance with various embodiments.

Fig. 8a and 8b illustrate examples of search results obtained with a particular syntax, in accordance with various embodiments.

FIG. 9 illustrates an example of search results returned from a user query, in accordance with various embodiments.

FIG. 10 illustrates an example of search results returned from a user query, in accordance with various embodiments.

FIG. 11 illustrates an example of search results returned from a user query, in accordance with various embodiments.

FIG. 12 illustrates an example of search results returned from a user query, in accordance with various embodiments.

FIG. 13 illustrates a block diagram of a computer system, in accordance with various embodiments.

Fig. 14 illustrates a flow diagram of a method for tumor profiling with multi-set mathematical data indexing, in accordance with various embodiments.

Fig. 15 illustrates a system for tumor profiling with multi-set mathematical data indexing, in accordance with various embodiments.

Fig. 16 illustrates a system for tumor profiling with multi-set mathematical data indexing, in accordance with various embodiments.

It should be understood that the drawings are not necessarily drawn to scale, nor are the objects in the drawings necessarily drawn to scale in relation to one another. The accompanying drawings are included to provide a further understanding of the various embodiments of the apparatus, systems, and methods disclosed herein. Wherever possible, the same reference numbers will be used throughout the drawings to refer to the same or like parts. Moreover, it should be understood that the drawings are not intended to limit the scope of the present teachings in any way.

Detailed Description

The present specification describes various exemplary embodiments of a multi-component search engine for comprehensive analysis of cancer genomes and clinical data, and systems and methods associated therewith. However, the present disclosure is not limited to these exemplary embodiments and applications, nor to the manner in which the exemplary embodiments and applications operate or are described herein.

Unless defined otherwise, all technical terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which embodiments disclosed herein belong. As used in this specification and the appended claims, the singular forms "a", "an", and "the" include plural references unless the context clearly dictates otherwise. Any reference herein to "or" is intended to encompass "and/or" unless otherwise indicated.

The present disclosure describes systems and methods for operating a multiomic Search engine for comprehensive analysis of Cancer genomic and clinical data, and may be referred to herein by the shorthand "Cancer Search" (or Cancer Search).

Unless defined otherwise, scientific and technical terms used in connection with the present teachings described herein shall have the meanings that are commonly understood by one of ordinary skill in the art. Furthermore, unless the context requires otherwise, singular terms shall include the plural, and plural terms shall include the singular. Generally, the terminology used in connection with, and the techniques of, cell and tissue culture, molecular biology, and protein and oligonucleotide or polynucleotide chemistry and hybridization described herein are well known and commonly used in the art. Standard techniques are used, for example, for nucleic acid purification and preparation, chemical analysis, recombinant nucleic acid, and oligonucleotide synthesis. Enzymatic reactions and purification techniques are performed according to the manufacturer's instructions or as commonly done in the art or as described herein. The techniques and procedures described herein are generally performed according to conventional methods well known in the art and as described in various general and more specific references that are cited and discussed throughout the present specification. See, e.g., Sambrook et al Molecular Cloning: A Laboratory Manual (Third ed., Cold Spring Harbor Laboratory Press, Cold Spring Harbor, N.Y. 2000). The nomenclature described herein, and the laboratory procedures and techniques used in connection therewith, are well known and commonly employed in the art.

As used herein, "DNA" (deoxyribonucleic acid) refers to a nucleotide chain consisting of 4 types of nucleotides: a (adenine), T (thymine), C (cytosine) and G (guanine), and RNA (ribonucleic acid) consists of 4 types of nucleotides: A. u (uracil), G and C. Certain nucleotide pairs specifically bind to each other in a complementary manner (referred to as complementary base pairing). That is, adenine (a) pairs with thymine (T) (but in the case of RNA, adenine (a) pairs with uracil (U)), and cytosine (C) pairs with guanine (G). When a first nucleic acid strand is joined to a second nucleic acid strand consisting of nucleotides complementary to the nucleotides in the first strand, the two strands join to form a double strand. As used herein, "nucleic acid sequencing data," "nucleic acid sequencing information," "nucleic acid sequence," "genomic sequence," "genetic sequence," or "fragment sequence" or "nucleic acid sequencing reads" refers to any information or data indicative of the order of nucleotide bases (e.g., adenine, guanine, cytosine, and thymine/uracil) in a DNA or RNA molecule (e.g., whole genome, whole transcriptome, exome, oligonucleotide, polynucleotide, fragment, etc.).

It is understood that the present teachings contemplate sequence information obtained using all available technologies (technologies), platforms, or technologies (technologies), including but not limited to: capillary electrophoresis, microarrays, ligation reaction-based systems, polymerase-based systems, hybridization-based systems, direct or indirect nucleotide identification systems, pyrosequencing, ion-or pH-based detection systems, electronic signature-based systems, and the like. "Polynucleotide", "nucleic acid" or "oligonucleotide" refers to a linear polymer of nucleosides (including deoxyribonucleosides, ribonucleosides, or analogs thereof) linked by internucleotide linkages. Typically, a polynucleotide comprises at least three nucleosides. Typically, oligonucleotides range in size from a few monomeric units to (e.g., 3-4) hundreds of monomeric units. Unless otherwise indicated, whenever a polynucleotide (such as an oligonucleotide) is represented by a letter sequence (such as "ATGCCTG"), it is understood that the order of nucleotides from left to right is 5'- >3', and "a" represents deoxyadenosine, "C" represents deoxycytidine, "G" represents deoxyguanosine, and "T" represents thymidine. As is standard in the art, the letters A, C, G and T may be used to refer to the base itself, to a nucleoside, or to a nucleotide that includes a base.

The phrase "next generation sequencing" (NGS) refers to a sequencing technique with improved throughput compared to traditional sanger and capillary electrophoresis based methods, e.g., with the ability to generate hundreds of thousands of relatively small sequence reads at a time. Some examples of next generation sequencing technologies include, but are not limited to, sequencing by synthesis, sequencing by ligation reactions, and sequencing by hybridization. More specifically, the MISEQ, hisseq and nextsseq systems of Illumina and the Personal Genome Machine (PGM) and SOLiD sequencing systems of Life Technologies corp. The SOLID System and associated workflow, protocols, chemistry, etc. are described in more detail in PCT publication No. WO 2006/084132 entitled "Reagents, Methods, and library for load-Based Sequencing" on International filing date, 2/1, 2006, U.S. patent application Ser. No. 12/873,190 entitled "Low-Volume Sequencing System and Method of Use" filed on 31/8/2010, and U.S. patent application Ser. No. 12/873,132 entitled "Fast-extracting Filter factory and Method of Use" filed on 31/8/2010, each of which is incorporated herein by reference in its entirety.

The phrase "sequencing run" refers to any step or portion of a sequencing experiment that is performed in order to determine certain information about at least one biomolecule (e.g., a nucleic acid molecule).

As used herein, the phrase "genomic feature" may refer to a genomic region (e.g., a gene, protein coding sequence, mRNA, tRNA, rRNA, repeat sequence, inverted repeat, miRNA, siRNA, etc.) or a genetic/genomic variant (e.g., single nucleotide polymorphism/variant, insertion/deletion sequence, copy number variation, inversion, etc.) that represents a single gene or grouping of genes (in DNA or RNA) that has undergone a change due to mutation, recombination/crossover, or genetic drift with reference to a particular species or subpopulation within a particular species, with some annotated function.

As used herein, the term "biomarker" refers to an objective measurable indicator of a biological state.

As used herein, the term "pathogenic" refers to a genetically altered variant property that increases an individual's susceptibility or predisposition to a disease or dysfunction. Also known as predisposing, deleterious and pathogenic mutations.

As used herein, the term "germline" refers to a tissue derived from germ cells (eggs or sperm) that is incorporated into the DNA of every cell in the offspring body. Germline mutations can be passed from parent to offspring.

As used herein, the term "somatic cell" refers to a genetic alteration obtained by a cell during cell division. Somatic mutations differ from germline mutations, which are genetic alterations that occur in germ cells.

As used herein, the term "codon" refers to a trinucleotide sequence of DNA or RNA corresponding to a particular amino acid.

As used herein, the term "UI" is an acronym for user interface.

As used herein, the term "query time" refers to the point in time at which a query is to be submitted by a user.

As used herein, the term "learning ranking" or "ranking engine" or "relevance learning engine" refers to the application of machine learning, typically supervised learning, semi-supervised learning, or reinforcement learning, in the construction of a ranking model for an information retrieval system. The training data consists of a list of items, and some partial order is specified between the items in each list. This order is typically triggered by giving a numeric or ordinal score or a binary judgment (e.g., "related" or "unrelated") for each term. The purpose of the ranking model is to rank, i.e., to produce an arrangement of items in a new, unseen list in a manner that is "similar" in some sense to the ranking in the training data.

As used herein, the term "latent space" or "hidden space" refers to the space in which a feature is located.

As used herein, the term "embedding" refers to mapping a document (e.g., text, image, structured data) to a lower-dimensional potential space, preserving the main properties of the object.

As used herein, the term "depth and breadth model" refers to a deep learning model that trains a breadth linear model (e.g., for memory) in conjunction with a deep neural network (e.g., for generalization).

As used herein, the term "language model" refers to a probability distribution over a sequence of words.

As used herein, the term "transducer model" refers to a deep learning model with the core idea of self-attention, i.e., the ability to compute a representation of an input sequence from attention, i.e., from different locations of the sequence.

As used herein, the term "BM 25" refers to a broad family of statistical functions in information retrieval that take into account the number of occurrences of each query term in a document or set of documents, i.e., the Term Frequency (TF) and the corresponding inverse document(s), and rank a set of documents based on the query terms that occur in each document without regard to their proximity in the document.

As used herein, the term "RM 3" refers to an information retrieval model that is useful for both relevance feedback and pseudo-relevance feedback.

As used herein, the term "DSSM" is an acronym that stands for a deep semantic similarity model.

As used herein, the term "twin network" refers to an artificial neural network that uses the same weights to compute comparable output vectors when working cooperatively on two different input vectors.

As used herein, the term "FDA" is an acronym for the U.S. food and drug administration.

As used herein, the term "NCCN" is an acronym for the national comprehensive cancer network.

As used herein, the term "cosmc" is an acronym for a list of somatic mutations in cancer.

As used herein, the term "TCGA" is an acronym for cancer genomic profiling.

As used herein, the term "CPRA" is an acronym for chromosome, location, reference, and substitution.

As used herein, the term "SNV" is an acronym for a single nucleotide variant.

As used herein, the term "CNV" is an acronym for copy number variants.

As used herein, the term "BCL" is an acronym for binary base calling.

As used herein, the term "FASTQ" refers to a text-based format for storing both biological sequences (typically nucleotide sequences) and their corresponding quality scores. For simplicity, the sequence letters and the quality scores are each encoded in a single ASCII character.

As used herein, the term "BAM" refers to a binary format for storing sequence data.

As used herein, the term "VCF" is an acronym that stands for variant calling format, and refers to a format used in bioinformatics to store text files of genetic sequence variations.

As used herein, the term "EHR" is an acronym that stands for electronic health record.

As used herein, the term "ASCO" is an acronym that stands for the american society for clinical oncology.

The present disclosure describes various embodiments of a multigroup search engine for integrated analysis of cancer genomes and clinical data, referred to herein simply as "cancer search. Cancer Search is an extension of the work presented in U.S. patent application No. 15/465,454 entitled "Genomic Metabolic and microbial Search Engine," filed on 21/3.2017, the entire contents of which are incorporated herein by reference.

According to various embodiments, a general search engine architecture is provided that may be configured to adapt to the specific needs of cancer multigenomic data. The overall architecture 1, discussed in more detail below with reference to fig. 1, may include various components. For example, the generic architecture may include a Web-based user interface, a query engine, an index pipeline that can index cancer omics data with all annotations, a cancer analysis software module, and a ranking engine. The query engine may be configured to search any combination of sets of chemical data streams available for individual samples or groups of the same type in response to a request. Cancer analysis (e.g., in a software module or engine) may be configured to derive important tumor characteristics by pre-computing some characteristics and dynamically computing others at query time. The ranking engine can be configured such that at index time it will preload a default clinically actionable or pathogenicity-related ranking, and at query service time it will further enhance the ranking based on the detected query intent. More detailed information about the various data types, pipelines, engines, modules, and analyses will be provided below.

The overall functionality of the User Interface (UI) can be configured to present a uniform and highly responsive way for querying and navigating the omics cancer search results. The UI may actively maintain the state of the user search session. The UI may be configured to accept a user query, may relay the user query to a query engine, may present integrated omic ranking results of the results and a summary visualization thereof if available, and may allow the user to interact with the search results. Users may interact with search results via the UI in various ways, including, for example, by providing relevance feedback (e.g., a promote/demote/fix/delete type assessment of how well the result answers the user's information needs), by comments on the accuracy of the information presented by the search results (e.g., particular annotation sources/publications are outdated or inconsistent), and by marking particular results to be included in a dynamic individual patient or homogeneous group report. More detailed information about the UI will be provided below.

Fig. 1 represents a non-limiting example of the general architecture of a multi-component cancer search system 100. A set of multi-set mathematical data 110 (e.g., genomic, transcriptomic, etc.) for a sample(s) (e.g., tumor and/or normal samples) may be added to the index pipeline or indexer 115 from the somatic workflow 120 or uploaded via the user interface 125. Non-limiting examples of upload formats may include FASTQ, BAM, VCF against tumor, normal, somatic VCF, RNA-Seq variant confirmation VCF, RNA-Seq differential gene expression in tabular form, CNV VCF, structural variant VCF, fusion call VCF, or any combination thereof. The multi-cluster data 110 can be cancer multi-cluster data including BCL, FASTQ, BAM, VCF, tabular cancer data, textual cancer data, pictorial cancer data. A set of annotations, literature, and phenotype data 130 can be added to the indexer 115 via an annotations pipeline 135. The data may either reside on the storage unit 170 (e.g., cloud storage, internal computer storage) or be uploaded by the user via a dedicated search upload interface. Data added by index pipeline 115 may be stored in one or more indexes 140. The system architecture may also include a cancer analysis engine or module 145, and the cancer analysis engine or module 145 may be configured to derive important characteristics of the tumor at index and service times. The cancer analysis engine 145 may derive the important characteristics whether the analysis is for a single sample or a homogeneous group. The user interface 125 may allow a user to enter a query and receive results provided by the query engine 150. The query engine 150 may be configured to accept user queries; selecting, pre-linking, aggregating, and summarizing associated multigroup index; and returning the ranked sets of mathematical data or features. According to various embodiments, the system architecture may also include a load balancer 155 to provide bi-directional data transfer between the UI 125 and the query engine 150 for a large number of users. According to various embodiments, the system architecture may also include an authentication agent 160 and include an identity provider 175 (e.g., a third party provider). The results retrieved from the indexer 115 can be ranked by a ranking engine 165 (e.g., a learning ranking engine), and the ranking engine 165 can be configured to derive ranking models for variants, genes, pathways, phenotypes, textual data, and images, for example. The results retrieved from the index may be ranked by a ranking engine and presented to the user in a ranked order. As will be discussed in detail herein, the types of data that can be queried, analyzed, and ranked, whether it be genomic, transcriptomic, epigenetic, chromatin accessibility data, microbiome, proteomic, medical literature, phenotypic data, textual data, imaging data, annotation sources, cancer analysis, predictive models, features that contribute to model accuracy, and the like, are enormous. More details regarding various method and system embodiments related to this example of a generic architecture will be presented below.

Referring now to fig. 14, and in accordance with various embodiments, a method 1400 is provided for tumor profiling using multigroup index of mathematical data. The method can include, at step 1410, storing a plurality of omics data indices, wherein each of the plurality of sets of chemical data indices includes cancer-specific tokenized data. Further discussion is provided throughout this disclosure regarding, for example, storage characteristics, multi-component data indexing, and cancer specific data, and such discussion will apply to this and all embodiments discussed or contemplated herein.

The method can further include, at step 1420, ingesting additional omics data and annotations associated with additional sets of mathematical data, the additional sets of mathematical data relating to the one or more indices. Further discussion is provided throughout the present disclosure regarding, for example, annotation and ingestion features, and such discussion will apply to this and all embodiments discussed or contemplated herein.

The method can further include, at step 1430, indexing the additional sets of ingested mathematical data and annotations while preserving a multimathematical mapping between gene names, gene variant names, and different data streams for the same patient in a particular index to produce additional sets of tokenized ingested mathematical data. Further discussion is provided throughout the present disclosure regarding, for example, indices, gene names, gene variant names, and multi-set mapping, and will apply to this and all embodiments discussed or contemplated herein.

The method can also include, at step 1440, receiving a user query. Further discussion is provided throughout this disclosure regarding, for example, receiving features and user queries, and such discussion will apply to this and all embodiments discussed or contemplated herein.

The method can also include, at step 1450, selecting one or more multi-set indices of mathematical data that are relevant based on the user query. Further discussion is provided throughout this disclosure regarding, for example, selection features, pre-linking of multi-component indices, and relevance determination, and will apply to this and all embodiments discussed or contemplated herein.

The method may further include, at step 1460, ranking the selected one or more multigroup index of mathematical data based on at least one of clinical operability, pathogenicity, feature weight, and frequency. Other ranking factors, such as factors related to query intent, may also be included. Further discussion regarding ranking is provided throughout this disclosure and will apply to this and all embodiments discussed or contemplated herein.

The method may also include, at step 1470, returning the ranked one or more multi-set indices of mathematical data to the user. Further discussion is provided throughout this disclosure regarding, for example, return features, displays, and reports, and such discussion will apply to this and all embodiments discussed or contemplated herein.

According to various embodiments, a non-transitory computer readable medium stores a program for causing a computer to execute a method for tumor profiling using multigroup mathematical data indices. The steps within the method may be similar to those provided above, or may be varied as desired.

The method can include storing a plurality of omics data indices, wherein each of the plurality of sets of chemical data indices comprises cancer-specific tokenized data. Further discussion is provided throughout this disclosure regarding, for example, storage characteristics, multi-component data indexing, and cancer specific data, and such discussion will apply to this and all embodiments discussed or contemplated herein.

The method can further include ingesting additional omics data and annotations associated with additional sets of mathematical data, the additional sets of mathematical data being related to the one or more indices. Further discussion is provided throughout the present disclosure regarding, for example, annotation and ingestion features, and such discussion will apply to this and all embodiments discussed or contemplated herein.

The method may further include indexing the additional sets of ingested mathematical data and annotations while preserving a multimathematical mapping between gene names, gene variant names, and different data streams for the same patient in a particular index to produce additional sets of tokenized ingested mathematical data. Further discussion is provided throughout the present disclosure regarding, for example, indices, gene names, gene variant names, and multi-set mapping, and will apply to this and all embodiments discussed or contemplated herein.

The method may also include receiving a user query. Further discussion is provided throughout this disclosure regarding, for example, receiving features and user queries, and such discussion will apply to this and all embodiments discussed or contemplated herein.

The method may further include selecting one or more multi-set indices of mathematical data that are relevant based on the user query. Further discussion is provided throughout the present disclosure regarding, for example, selection features and relevance determinations, and such discussion will apply to this and all embodiments discussed or contemplated herein.

The method may further include ranking the selected one or more multigroup index of mathematical data based on at least one of clinical operability, pathogenicity, feature weight, and frequency. It should be noted that the ranking can be further altered by the intent of the query (e.g., ranking in order of reverse frequency, ranking in order of feature contribution to a particular prediction of the model, ranking abrupt signatures in reverse order of their weights, etc.). As such, clinical operability may be used as a default ranking if other rankings are not requested and other intentions are not easily (or cannot be) inferred. Further discussion related to ranking and determining features is provided throughout this disclosure and will apply to this and all embodiments discussed or contemplated herein.

The method may also include returning the ranked one or more multigroup indices of mathematical data to the user. Further discussion is provided throughout this disclosure regarding, for example, the return feature, and such discussion will apply to this and all embodiments discussed or contemplated herein.

According to various embodiments, the multi-mathematical data may be selected from the group consisting of: genomic, transcriptome, epigenetic, chromatin accessibility data, microbial, proteomic, phenotypic, image, related literature, integrated multigenomic data, and combinations of the foregoing. According to various embodiments, the plurality of omics data indices may further comprise tumor (somatic) genomic alterations, normal (germline) genomic alterations, and cancer annotation sources.

According to various embodiments, the methods discussed or contemplated herein may further comprise deriving a cancer analysis for the selected one or more multi-set indices of mathematical data. The cancer analysis may comprise a tumor characteristic selected from the group comprising: quality control, tumor mutation burden, genomic mutation signature, microsatellite instability status, neoantigens and their binding affinity, HLA allele typing, RNA-confirmed variants, copy number variants, structural variants, non-coding regulatory variants, gene fusion, pathway enrichment, cancer driver identification, mutation profiling, differential gene expression, immune signature, and combinations of the foregoing. According to various embodiments, cancer analysis may be derived for a single sample or a homogeneous group of samples. In addition, the cancer analysis may include matching information regarding treatment outcomes of similar patients. According to various embodiments, the cancer analysis may include machine learning predictions and ranked features. According to various embodiments, cancer analysis may include machine learning predictions, and machine learning model features ranked in order of their relevance to a particular prediction. The machine learning prediction may be selected from the group consisting of: major primary site classifier, future metastatic site prediction classifier, microsatellite instability status prediction, neoantigen binding affinity prediction, disease status stratification, determining cancer lineage, and combinations of the foregoing. The cancer analysis may be dynamically computed after receiving a user query. Derivation of cancer analysis may include utilization of deep neural networks and other machine learning Methods (e.g., support vector classifiers, tree Methods, Ensemble Methods). The derivation of the significance of the features of the model may include a gradient attribution method or other feature significance methods

According to various embodiments, the methods discussed or contemplated herein may further comprise propagating annotations from higher level genomic hierarchies to lower level genomic hierarchies.

According to various embodiments, the methods discussed or contemplated herein may further include propagation of the ranking of the selected one or more multi-set mathematical data indexes from a higher level genome level to a lower level genome level. The ranking may include a clinical ranking for cancer variants and genes. The ranking may include the probability that the enrichment of the gene belongs to a particular pathway. The ranking may include importance weights determined for features of the machine learning model. Ranking may include layering the cohort groups by incorporating potential spatial representations and sub-selection representations of cancer data that lead to maximal de-entanglement between responders and non-responders, short-term and long-term progression-free survival, one cancer subtype and another, and so on. The homogeneous group may be layered into a responder and a non-responder. The cohort may be stratified into a long-term progression-free survival time and a short-term progression-free survival time. The homogeneous groups may be layered into different cancer subtypes. The potential spatial representation may be performed by a neural network or any other dimension reduction method (e.g., principal component analysis, individual component analysis, manifold learning). The neural network may be selected from the group consisting of: an autoencoder, a variational autoencoder, a depth confidence network, a limited boltzmann machine, feed forward, convolution, recursion, gated recursion, long and short term memory, residual, and generate a countermeasure network.

According to various embodiments, including the methods discussed or contemplated herein, ranking may further include a model for learning a ranking selected from the group consisting of: support vector machines, boosted decision trees, regression methods, neural networks, and combinations of the foregoing. The model for learning the ranking may also include other machine learning models or deep neural networks. The ranking may also include a deep learning ranking. The ranking may also include similarity between the embedding of the query and the indexed documents in the joint embedding space learned via deep learning methods. The deep learning ranking may be derived from a deep learning model selected from the group consisting of: a deep semantic similarity model, a deep and breadth model, a deep language model, a learned deep learning text embedding, a learned named entity recognition, a twin neural network, and combinations of the foregoing.

According to various embodiments, including the methods discussed or contemplated herein, the multi-mathematical data may be selected from the group consisting of: somatic (and germline) calls from whole genome sequence data, somatic (and germline) calls from whole exome sequence data, somatic (and germline) sequencing from fresh frozen tissue, somatic (and germline) sequencing from formalin-fixed paraffin-embedded tissue, somatic (and germline) sequencing from liquid biopsy, tumor and normal variant calls, tumor/normal transcript data indexed to confirm variants at the level of RNA or gene expression, epigenetic data, chromatin accessibility data, microbiology data, proteomics data, single cell sequencing data, and combinations of the foregoing. In various embodiments, the indexed multigroup data may be either from internal somatic call and immunization pipelines, or may be provided or uploaded in real-time from any external partner in the form of FASTQ, BAM, VCF, and other tabular formats.

According to various embodiments, including the methods discussed or contemplated herein, the multi-set mathematical data index may further include the extracted phenotypic data. The phenotypic data may be selected from the group comprising: electronic health records, clinical data, functional data, and combinations of the foregoing.

According to various embodiments, including the methods discussed or contemplated herein, the multi-set mathematical data index may also include characterization/embedded cinematology data. The characterizing imagery data may be selected from the group consisting of: histological slides, MRI images, X-rays, mammograms, ultrasound, PET images, CT scans, and combinations of the foregoing.

According to various embodiments, including the methods discussed or contemplated herein, indexing of the additional sets of chemical data and annotations ingested may further comprise indexing derived data selected from the group consisting of: cancer analysis, annotation, features extracted from imaging data, phenotype, medical literature data, data embedding, and combinations of the foregoing.

According to various embodiments, including the methods discussed or contemplated herein, ranking may further include matching sample changes to established drug target labels and available clinical trials. Ranking may further comprise identifying cancer drug targets in the cohort by detecting potential biomarkers that stratify the cohort based on clinical variables and/or statistical significance of interest, and wherein returning the ranked one or more multicohort data indices to the user comprises a hierarchical visualization.

According to various embodiments, including the methods discussed or contemplated herein, returning the ranked one or more omics data indices to the user may further include dynamically creating a hyperlinked report (e.g., containing the ranked changes, where each entry is hyperlinked to a search query) for individual patients and/or groups of peers, the hyperlinked report providing a comprehensive profile of the tumor or cancer. Returning the ranked one or more multi-set indices of mathematical data to the user may further include returning a summary visualization of the returned results and a list of the ranked results.

According to various embodiments, including the methods discussed or contemplated herein, the user query may include user uploaded data selected from the group consisting of: a set of variants, genes, pathways, disease states, phenotypes of interest, and wherein selecting comprises querying individual samples or cohort data selected by the uploaded data subsets. The user query may be provided via a user interface and may include uploading data for indexing selected from the group consisting of: genomic data, transcriptomic data, epigenetic data, chromatin accessibility data, microbiome data, proteomic data, phenotypic data, annotation data, and combinations of the foregoing.

According to various embodiments, the methods discussed or contemplated herein may further include normalizing and/or augmenting the user query, classifying the query intent, aggregating the retrieved documents, and performing document retrieval based on similarities between the query and documents in the underlying space using deep learning methods.

According to various embodiments, including the methods discussed or contemplated herein, at least one of indexing, selecting, and ranking comprises utilizing a deep neural network.

According to various embodiments, the methods (and systems) discussed or contemplated herein may operate to centralize a large number of cancer multi-component data to provide oncologists, medical practitioners, research scientists, and other non-programmers with a platform to interrogate cancer bioinformatics pipelines at any level of detail, as well as to obtain clinical and biological insight into cancer biology and potential cancer clinical treatment approaches. Types of data can include, for example, genomic (single nucleotide variations, insertions and deletions of tumor and normal tissues, structural rearrangements, copy number variations, gene fusions, and expressed variants of tumor genomes), transcriptomic, epigenetic, chromatin accessibility, microbiome, proteomic abundance and localization, medical literature data (publications, therapeutic guidelines, clinical trial inclusion/exclusion criteria), phenotypic data (functional, clinical, electronic medical history, histopathology, and radiology reports), imaging data (histopathology slides, MRI scans, X-rays, mammograms, ultrasound, PET images, CT scans), cancer annotation sources (variants, genes, pathways, drugs), resulting cancer analyses (tumor mutation burden, mutation signatures, microsatellite instability status, RNA sequence confirmed variants, RNA sequence-encoded variants, and the like, Differentially expressed genes, spatial multigroup lineage representation, neoantigen binding affinity of MHC class I and class II molecules).

As described above, and as will be discussed in further detail below, according to various embodiments, various methods (and systems) described and contemplated herein include cancer analysis (e.g., as steps, features, engines, modules, or software modules). Cancer analysis allows the user to have access to important characteristics of the tumor including, for example, tumor mutational burden, mutation signatures, spatial multiomic lineage representation, neoantigen binding affinity of MHC class I and class II molecules, RNA sequence-confirmed variants, differentially expressed genes, pathway enrichment, microsatellite instability status and microsatellite repeat sites, and features extracted from imaging and clinical data. According to various embodiments, this data may be pre-computed for individual samples or dynamically computed for homogeneous group samples. According to various embodiments, cancer analysis may provide an integration of predictions from machine learning models and features thereof ranked by their contribution to a particular classification. Specific classifications may include, for example, prediction of major primary sites, future metastatic sites, classification of variants as true or false positives, information about treatment outcomes of similar patients, outlier detection of sequencing quality, and prediction of disease state for a class group using potential and actual representations. An advantage of returning features ranked by their contribution to a particular classification is that model predictions are more interpretable to the user.

As described above, and as will be discussed in further detail below, in accordance with various embodiments, various methods (and systems) described and contemplated herein include multi-modal ranking (e.g., as steps, features, engines, modules, or software modules). Multi-modal ranking can provide a relevance learning engine to integrate the omics genetic data, annotation sources, literature data, clinical trial results, and well characterized genes with significant mutations in the same group to understand the clinical actionable ranking of cancer data. In various embodiments, machine learning models may be used to weigh contributions from annotations of multiple sets of mathematical data. In various embodiments, deep learning and machine learning dimension reduction techniques may be used to derive potential spatial representations of sample homogeneous groups. In various embodiments, the learned embeddings may be used to rank genomic, textual, and imagery data.

As described above, and as will be discussed in further detail below, in accordance with various embodiments, the various methods (and systems) described and contemplated herein also include mechanisms (e.g., as steps, features, engines, modules, or software modules) for integrating and ranking multiple cancer annotation sources. These sources may include, for example, FDA labels, NCCN guidelines, clinical trials, CIViC, doc, OncoKB, mycancergeme, cancer drug genomic biomarker databases, TCGA, ICGC, cosmc, NCI60, CCLE, drug bank, ClinVar, HGMD, PGMD, PharmGKB, dbSNP, dbNSFP, 1000Genomes, EXEC, CPDB, KEGG, BioCarta, BioCyc, Reactome, genmap, msidb, breda, CTD, HPRD, GXD, BIND. In various embodiments, annotations and rankings can propagate from higher level representations to lower levels (e.g., from pathway to gene to variant, or from gene to variant codon to complete variant canonical-chromosome, position, reference, substitution).

As described above, and as will be discussed in further detail below, according to various embodiments, the various methods (and systems) described and contemplated herein also include mechanisms (e.g., as steps, features, engines, modules, or software modules) for integrating several deep learning models. Integration may serve to provide an index of neural data (e.g., embedding multiple sets of mathematical data separately and together to regularize their respective potential spatial regularization for DNA and RNA neoplastic changes; embedding textual data from electronic health records, clinical notes, documents, annotations; depth transformer models for named entity recognition and aggregation of textual and annotated data; embedding imagery data). Integration may also provide a neuro-learning ranking model (e.g., a deep semantic similarity model, a convolutional deep semantic similarity model, a cyclic deep semantic similarity model, a deep relevance matching model, an interactive twin network, a lexical and semantic matching network, DeepRank) that may be used to solve the feature engineering problem of learning ranking. Integration may provide neural query models (e.g., deep learning transformer models for query normalization, synonym expansion, abbreviation expansion, term disambiguation, alternative suggestions). Integration can serve to provide neural models for advanced cancer analysis (e.g., primary site classification, prediction of future metastatic sites, prediction of neoantigen binding affinity, classification of variants as true or false positives, drug and test matching, treatment recommendation system using information from similar cases indexed, comparison of reduction, increase, maintained models of allele scores, copy number variation, RNA expression of serial biopsies at each location, and deep-learning automated encoder methods and other dimension reduction techniques for homogeneous cohort analysis and stratification).

As described above, and as will be discussed in further detail below, according to various embodiments, the various methods (and systems) described and contemplated herein may also include statistical, machine learning, and deep learning methods (e.g., as steps, features, engines, modules, or software modules) for identifying diagnostic, prognostic, or predictive biomarker(s). In various embodiments, when a user (e.g., an academic or industrial researcher) enters a phenotypic query about a sample cohort, a ranked biomarker is returned that can stratify the cohort, its statistical significance, and its summary visualization. In various embodiments, verification queries may be suggested by search engines to perform robust algorithms and statistical verification. In various embodiments, the systems and methods may automatically suggest iterative hypothesis refinements via suggested query refinements. According to various embodiments, the statistical visualization and analysis derived for the cancer cohort group query may include, for example, Kaplan-Meier survival analysis visualization, log rank test result visualization, Cox proportional hazards regression analysis visualization, tree survival model visualization, heat maps, scatter maps, box maps, and bar maps that provide statistical significance.

As described above, and as will be discussed in further detail below, according to various embodiments, the various methods (and systems) described and contemplated herein may further include using and/or receiving interactive summary visualizations and/or ranked variants, genes, pathways, derived cancer analyses, outputs of integrated machine learning models (e.g., cancer type classification, most likely recurrence sites) (e.g., as steps, features, engines, modules, or software modules). This may be provided via a query engine (discussed in more detail below). In various embodiments, the summary visualization may be dynamic, and each data point may be linked to a particular result returned.

As described above, and as will be discussed in further detail below, according to various embodiments, the various methods (and systems) described and contemplated herein may also provide interactive and rapid access to multiple sets of scientific cancer data ranked by clinical operability, pathogenicity, feature weight, or frequency within 10000, 5000, 4000, 3000, 2000, 1000, 900, 800, 700, 500, 400, 300, 200, 100 milliseconds or less of access, or within any range of access between the aforementioned values.

As discussed above, according to various embodiments, the systems and methods described herein may provide a universal search interface (as opposed to many different entry points). In various embodiments, all knowledge, e.g., multi-component cancer data, samples, variants, genes, drugs, pathways, phenotypes, medical literature, image data, derived cancer analysis, machine learning models for predicting tumor characteristics and their characteristics, uploading of user data, etc., may be accessible through the same simple search interface.

As described above, and as will be discussed in further detail below, according to various embodiments, the various methods (and systems) described and contemplated herein may also provide the ability (e.g., as a step, feature, engine, module, or software module) to compare sequential biopsy samples and provide differences (increase, decrease, maintenance) between new and old cancer drivers, variant allele fraction changes, copy number changes, and RNA confirmation status changes of cancer alterations.

As described above, and as will be discussed in further detail below, various methods (and systems) described and contemplated herein may also provide various comparison schemes (e.g., as steps, features, engines, modules, or software modules) according to various embodiments. These protocols can include, for example, (1) sample-to-sample comparisons, comparisons of any combination of multiple sets of mathematical data streams within the same patient, (2) sample-to-cohort comparisons (e.g., individual samples are compared to the same cancer subtype in TCGA), and (3) pairwise cohort comparisons (e.g., cohort comparisons to well-characterized TCGA cohort groups with the same cancer type).

According to various embodiments, various methods (and systems) described and contemplated herein may provide for dynamic upload (e.g., as a step, feature, engine, module, or software module) of a variant/genetic drug target package (or packages currently used in practice) from a user facility. Subsequent queries may indicate the intersection of the stored multinomial data for the sample(s) using the uploaded set.

In the public domain, and as already discussed herein, universal genome searches have been proposed to address the issue of immediate access to germline genomic data. G this search represents a significantly different problem of germline genome profiling focused on mendelian rare variants, GWAS hit rates, stress testing and multigenic risk of common diseases, and genetic risk. In order to effectively address all three major issues in comprehensive cancer characterization discussed above and herein, the systems and methods described herein may further include advanced cancer analysis for individual samples and homogeneous groups, as well as a ranking engine (discussed above and in detail herein), according to various embodiments provided and contemplated. According to various embodiments provided herein, the systems and methods described herein can augment all portions of existing generic germline search systems to integrate multiple sets of mathematical data during the time of indexing and service, rank cancer changes due to their clinical relevance and pathogenicity, and make the search engine paradigm useful for comprehensive cancer profiling for individual samples and cohort groups. Additionally, according to various embodiments provided herein, the systems and methods described herein may include hierarchical analysis of cancer peer groups built on top of a cancer search engine, which is lacking in its entirety from previous work.

Fig. 15 illustrates a system 1500 provided for tumor profiling with multi-set mathematical data indexing, in accordance with various embodiments. The system 1500 may include an indexing unit 1510. The indexing unit may include a storage element 1520 configured to store a plurality of multiple sets of indices of the mathematical data, wherein each multiple set of search engines of the plurality of multiple sets of indices of the mathematical data includes cancer-specific tokenized data. The indexing unit 1510 may also include an indexing engine 1530. The indexing unit 1510 may be configured to ingest additional sets of mathematical data and annotations associated with the additional sets of mathematical data via the data source 1540, the additional sets of mathematical data relating to one or more indexes. The indexing unit 1510 may also be configured to index additional sets of ingested scientific data and annotations from the data source 1540, while preserving multiple sets of mathematical mappings between gene names, gene variant names, and different data streams for the same patient in a particular index, to produce tokenized additional sets of ingested scientific data.

The system 1500 may also include a user interface 1550 configured to receive a user query 1560.

The system 1500 may also include a query engine 1570, the query engine 1570 configured to select one or more relevant multi-set indices of mathematical data from the indexing unit 1510 based on the user query 1560.

The system 1500 may also include a ranking engine 1580 configured to receive the selected relevant one or more multigroup mathematical data indexes (e.g., from the query engine 1570), to rank the selected one or more multigroup mathematical data indexes, and to return the ranked one or more multigroup mathematical data indexes to the user via the user interface 1550.

Fig. 16 illustrates a system 1600 provided for tumor profiling with multi-set mathematical data indexing, in accordance with various embodiments. The system 1600 may include an indexing unit 1610. The indexing unit may include a storage element 1620 configured to store a plurality of multiple sets of indices of mathematical data, wherein each of the plurality of multiple sets of indices of mathematical data comprises cancer-specific tokenized data. Indexing unit 1610 may also include an indexing engine 1630. The indexing unit 1610 may be configured to ingest, via the data source 1640, additional sets of mathematical data and annotations associated with the additional sets of mathematical data, the additional sets of mathematical data relating to one or more indexes. The indexing unit 1610 may also be configured to index additional sets of ingested scientific data and annotations from the data source 1640 while retaining multiple sets of mathematical mappings between gene names, gene variant names, and different data streams for the same patient in a particular index to produce tokenized additional sets of ingested scientific data.

The system 1600 may also include a user interface 1650 configured to receive a user query 1660.

The system 1600 can also include a query engine 1670, the query engine 1670 configured to select one or more relevant multi-set mathematical data indices from the indexing unit 1610 based on the user query 1660. The query engine 1670 may also be configured to rank the selected one or more multigroup index of mathematical data based on clinical operability, pathogenicity, feature weight, or frequency. The query engine may also be configured to return the ranked one or more multi-set mathematical data indices to the user via the user interface 1650.

Note that all of the previous discussion of all of the additional features, particularly with respect to the previously described methods and non-transitory computer readable media, applies to the features of the various system embodiments described and contemplated herein, in accordance with the various embodiments.

According to various embodiments, a computer-implemented system for tumor profiling using multigroup index of mathematical data is provided. The system may include computer storage, a digital processing device including at least one processor, an operating system configured to execute executable instructions, memory, and a computer program including instructions executable by the digital processing device to create a multi-component cancer search engine application. The multigroup cancer search engine application may include a plurality of integrated multigroup cancer indices recorded in computer storage, and a software module providing advanced cancer analysis. The multigroup cancer search engine application may include the following software modules: the software module provides a multi-component science indexing pipeline that takes in multiple sets of science cancer data, annotations, medical and clinical data associated with multiple sets of science genomic and imaging data, tokenizes the data while preserving variant nomenclature, gene names and drug names, and updates the index with the tokenized data. The omics cancer search engine application may also include the following software modules: the software module is responsible for ranking the integrated multigenomic data reflecting the clinical utility of the cancer change. The multi-cohort cancer search engine application may include a query engine that selects and combines relevant multi-cohort indices and returns ranked multi-cohort changes for individual samples and sample homogeneous groups. The multi-component cancer search engine application may include a software module that presents a user interface that allows a user to enter user queries and perform faceted searches on multiple component data.

According to various embodiments, a non-transitory computer-readable storage medium encoded with a computer program comprising instructions executable by a processor to create a multi-component cancer search engine application is provided. The multigroup cancer search engine application may include a plurality of integrated multigroup cancer indices recorded in computer storage, and a software module providing advanced cancer analysis. The multigroup cancer search engine application may include the following software modules: the software module provides a multi-component index pipeline to ingest multiple sets of scientific cancer data, annotations, medical and clinical data associated with multiple sets of scientific genomic and imaging data, tokenize the data while preserving variant nomenclature, gene names, and drug names, and update the index with the tokenized data. The omics cancer search engine application may also include the following software modules: the software module is responsible for ranking the integrated multigroup mathematical data reflecting clinical utility, pathogenicity, frequency, feature weight of the returned results. The multi-cohort cancer search engine application may include a query engine that selects and combines relevant multi-cohort indices, and returns ranked multi-cohort changes for individual samples and sample homogeneous groups. The multi-component cancer search engine application may include a software module that presents a user interface that allows a user to enter user queries and perform faceted searches on multiple component data.

According to various embodiments, a computer-implemented method of providing a multiomic cancer search engine application is provided. The multigroup cancer search engine application may include a plurality of integrated multigroup cancer indices recorded in computer storage, and a software module providing advanced cancer analysis. The multigroup cancer search engine application may include the following software modules: the software module provides a multi-component science indexing pipeline that takes in multiple sets of science cancer data, annotations, medical and clinical data associated with multiple sets of science genomic and imaging data, tokenizes the data while preserving variant nomenclature, gene name and drug name, and updates the index with the tokenized data. The multigroup cancer search engine application may include the following software modules: the software module is responsible for ranking the integrated multigroup mathematical data reflecting the clinical utility, pathogenicity, frequency, feature weight of the cancer changes returned. The multi-cohort cancer search engine application may include a query engine that selects and combines relevant multi-cohort indices, and returns ranked multi-cohort changes for individual samples and sample homogeneous groups. The multi-panel cancer search engine application may include a software module that presents a user interface that allows a user to enter user queries and perform multi-panel searches on multiple panels of mathematical data. In various embodiments, the index is optimally formatted into a partially pre-concatenated configuration, and the clinical rankings are pre-loaded such that the search speed is increased and the lag time between the search and the results is reduced. In various embodiments, pre-linking of multi-set mathematical indices occurs prior to a user entering a query.

Note that all of the previous discussions of additional features, particularly with respect to the previously described computer-implemented methods, computer-implemented systems, and non-transitory computer-readable media, apply to the features of the various system embodiments described and contemplated herein, in accordance with the various embodiments.

As discussed above, according to various embodiments, the systems and methods described herein may focus a large number of cancer multigenomic data included. The data can include, for example, genomic (e.g., single nucleotide variations, insertions and deletions in tumor and normal tissues, structural rearrangements, copy number variations, gene fusions, and expressed variants of tumor genomes), transcriptome (e.g., RNA-Seq variant validation and differential gene expression), epigenetic, chromatin accessibility, microbiome, proteome abundance and localization, medical literature data (e.g., publications, treatment guidelines, clinical trial inclusion/exclusion criteria), phenotypic data (e.g., functional, clinical, EHR), imaging data (e.g., histology, MRI, X-ray, mammogram, ultrasound, PET images, CT scans), cancer annotation sources (e.g., variants, genes, pathways, drugs), derived cancer analyses (e.g., tumor mutation burden load), and the like, Mutation signatures, microsatellite instability status, spatial multiomic lineage representation, neoantigen binding affinity of MHC class I and class II molecules), predictions from machine learning models, and their characteristics (e.g., major primary site, microsatellite instability, potential future metastatic sites, drug and assay matches). According to various embodiments, the genomic data may be in the form of a whole exome, a whole genome, genomic set data, SNP array. According to various embodiments, sequential biopsy multiomic data may be indexed for the purpose of monitoring disease progression, development of drug resistance, and recurrence monitoring.

According to various embodiments, the indexed data may be in the form of, for example, but not limited to, Variant Call Format (VCF), BAM, and FASTQ, both for tumor and normal, or just for tumor. According to various embodiments, the phenotype data may be provided in a tabular format or raw format (e.g., HER, clinical record, pdf report).

As discussed above, according to various embodiments, the systems and methods described herein may include an annotation source. Examples of annotation sources may include, but are not limited to: FDA tags, NCCN guidelines, clinical trials, CIViC, DoCM, OncoKB, Mycancerrgenome, cancer drug genomic biomarker databases, TCGA, ICGC, COSMIC, NCI60, CCLE, Drugbank, ClinVar, HGMD, PGMD, PharmGKB, dbSNP, dbNSFP, 1000Genomes, EXEC, CPDB, CADD, Polyphen, dbNSFP, and the like.

According to various embodiments, the systems and methods described herein may further include drug target information, which may be derived and integrated from multiple sources. These sources include, but are not limited to, FDA labels, the compendium for NCCN drugs and biologicals, Thomson Micromedex drug dex, the clinical pharmacology compendium for Elsevier Gold Standard, the compendium for U.S. hospital prescription drug information, the ESMO guidelines, the ASCO guidelines, the NCCN guidelines, and mutations annotated in other cancer knowledge databases such as, for example, OncoKB, CIViC, doc, cosmc. According to various embodiments, drug targets may be indexed at the variant, gene, and pathway levels. According to various embodiments, the pharmaceutical indications, evidence, cancer type, reported adverse reactions, and additional information may be stored in a search index.

As discussed above, according to various embodiments, the systems and methods described herein may include cancer analysis (or advanced cancer analysis), or a software module providing advanced cancer analysis, or the use of the foregoing. The software module may provide both pre-computed (e.g., computed at index time) and dynamic (e.g., computed at query time) derived cancer analyses. According to various embodiments, advanced analysis may also be visualized at query time. Fig. 3 illustrates an example of cancer analysis pre-computed and dynamically computed for individual samples and homogeneous groups. Advanced analysis modules can integrate predictions from machine learning and deep learning models for predicting important characteristics of tumor biology.

According to various embodiments, pre-computed derived cancer analyses for individual samples may include, for example, but are not limited to, tumor mutation burden (important biomarkers for therapy (e.g., immunotherapy)), microsatellite instability state (important cancer state where mismatch repair proteins are not functional), genomic mutation signature (underlying etiology and mechanistic basis of cancer), detected neoORF (frame shift mutations that may result in new amino acid sequences, may be useful for cancer vaccines), detected neoantigens, neoantigen binding affinity of MHC class I and class II molecules, HLA allele typing (an important variable for cancer vaccine design), expressed immune genes (e.g., genes that respond to immunotherapy), RNA sequence confirmed variants, and differentially expressed genes.

According to various embodiments, dynamic high-level cancer analysis for individual samples may include, for example, but not limited to, pathway enrichment analysis and spatial multigroup chemical lineage representation for specific types of variants (query-based, e.g., non-silent variants). According to various embodiments, dynamic advanced cancer analysis for a sample homogeneous cohort may include, but is not limited to: homogeneous group mutation signatures; detecting significantly mutated genes and cancer drivers by folding recurrent somatic changes in the same gene and after correcting the ratio of non-silent to silent variants, gene replication time, and other properties of cancer biology; stratification of disease states; (ii) a spatial multiomic lineage representation; and pathway enrichment analysis for a subset of variants (e.g., non-silent mutations).

According to various embodiments, cancer analysis may be provided via an advanced analysis module that may be configured to integrate predictions from, for example, machine learning and deep learning models for predicting important characteristics of tumor biology (e.g., tumor-only and tumor-normal classifiers for microsatellite instability status; classification of tumor origin for metastatic tumors of unknown origin; models for predicting most likely recurrence sites for specific patients; deep learning and machine learning methods for tumor variant-only invocation; new antigen binding prediction; machine learning models for genetic cancer risk prediction for different cancer types; machine learning models for immunotherapy outcome prediction; classification of variants as true positives or false positives; deep learning methods for variants, genes, drugs and diseases; methods for processing literature, genetic information, and/or genetic information; genetic information, and/or genetic information, Named entity identification of EHR and clinical trial data; a depth learning method for identifying regions of interest and extracting features from unstructured histological and radiological slides and other imaging data; a potentially embedded deep learning model for learning multi-component disease states of cancer; deep learning methods for drug and test matching; a machine learning model for identifying similar patients; a cancer treatment recommender system based on results of treating similar patients; and machine learning and deep learning methods for homogeneous cohort biomarker stratification(s) and cohort disease status identification).

According to various embodiments, the systems and methods described herein may include deep learning embedding of phenotypic data (e.g., learned from electronic health records, clinical and functional records), annotation sources, medical literature, or imaging data (e.g., histological slides, MRI, X-rays, mammograms, ultrasound, PET images, CT scans), for example.

According to various embodiments, the systems and methods described herein may include an advanced cancer analysis module that sets statistical thresholds for quality control, outliers identified for indexed sequencing quality indicators. Some non-limiting examples of quality control indicators of interest may include quality control for tumor-normal matches (e.g., consanguinity and identity values); tumor and normal sequencing indices (e.g., Freemix/Conpair indices reflecting potential tumor/normal contamination, sequencing indices including but not limited to average total coverage, percent reads aligned, percent repeats, and Y/X ratio); and somatic sequencing quality control indicators including, but not limited to, the number of variants in dbSNP, dbSNP enrichment, dbSNP indel ratio, dbSNP switch/transversion ratio, and hetero/homogeneous variant ratio (heterozygous/homozygous variant ratio).

According to various embodiments, advanced cancer analysis (or its associated modules) may provide, for example, dynamic algorithms for mutation profiling based on suspected (multiomic) biomarkers in homogeneous groups of samples, cancer driver identification, comparison of multiple biopsies, and homogeneous group stratification. In various embodiments, a comparison of a sample to a homogeneous group of samples, and a comparison of multiple homogeneous groups may be achieved.

According to various embodiments, the systems and methods described herein may include indexing and aggregating a large number of cancer multi-component data. As discussed in detail above, the data may include, for example, but is not limited to, genomic data (e.g., single nucleotide variations, tumor and normal insertions and deletions, structural rearrangements, copy number variations, gene fusions, and expressed variants of the tumor genome), transcriptomic data, epigenetic data, chromatin accessibility data, microbiological data, proteomic abundance and localization data, medical literature data (e.g., publications, treatment guidelines, clinical trial inclusion/exclusion criteria), phenotypic data (e.g., functional, clinical, EHR), imaging data (e.g., histological projections, MRI, X-ray, mammograms, ultrasound, PET images, CT scans), cancer annotation sources (e.g., variants, genes, pathways, drugs), derived cancer analyses (e.g., tumor mutation burden, mutation signatures, genetic mutations, genetic alterations, and genetic alterations, genetic alterations, Differentially expressed genes, spatial multigroup lineage representation, prediction and characterization of neoantigen binding affinity from major primary sites, future metastatic sites, microsatellite instability status, MHC class I and class II molecules in a machine learning model).

Applicants have advantageously found that by indexing the raw data and derived analysis, predictions from machine-learned and deep-learned models, and their (derived) features and embeddings, can include better machine-learned interpretability, iterative hypothesis generation, and refinement of subsequent queries by the user, so that tumor biology can be better characterized and understood.

According to various embodiments and as discussed above, the systems and methods disclosed herein may include software modules for multigroup chemical indexing of cancer data, annotations associated with genomic and iconography data, medical and clinical data, tokenizing the data while preserving variant nomenclature, gene names, and drug names, and updating the index with the tokenized data. According to various embodiments, the step of multigroup mathematical indexing may comprise integrating and pre-linking multigroup mathematical indices at the level of variants, genes, pathways, cancer subtypes or samples.

In particular to cancer annotation data, according to various embodiments, the systems and methods described herein may include an indexing step (see above), or provide a software module that performs multigroup mathematical indexing for cancer annotation data. Cancer annotation data may include, but is not limited to, FDA labels and NCCN guidelines, clinical trials, public cancer databases (CIViC, doc, OncoKB, mycancergome, COSMIC, cancer drug genomic biomarker databases, ICGC, TCGA), public genomic databases (ClinVar, dbNSFP, dbSNP), commercial data sources (HGMD, PGMD, PharmGKB, CPDB). In another aspect, the multigroup science indexing software module also indexes annotation sources for unfocused cancers: ClinVar, dbNSFP, dbSNP, CPDB, HGMD, PGMD. According to various embodiments, the software module for omic indexing may be configured to integrate and pre-link the omic annotation data at the level of variants, gene codon numbering, genes, pathways, cancer subtypes, or samples.

According to various embodiments, indexing may further comprise utilizing the derived content embedding to index complex phenotypes, literature data, histopathology, MRI, X-ray, mammogram, ultrasound, PET images, CT scan images.

According to various embodiments, the systems and methods described herein may further include an indexing procedure in which multi-set chemical data integration during indexing occurs first at the sample level and then at the variant, gene codon numbering, gene or pathway level, or any combination thereof, as depicted in fig. 2a and 2 b. In a non-limiting example of multi-cluster index integration shown in fig. 2a, the ingested multi-cluster cancer data is selected from the group consisting of: single Nucleotide Variants (SNVs) and small insertions and deletions (indels) (expressed as chromosome number, chromosome position, reference, substitution allele-CPRA), Copy Number Variants (CNVs) and confirmed variants in RNA. SNVs can be indexed from SNVs containing somatic VCFs and small insertions and deletions. Copy Number Variants (CNVs) that are called on a chromosomal region (e.g., also mapped at the genetic level using advanced cancer analysis modules) can be indexed from copy number calls VCFs (CNVs are also mapped at the genetic level). RNA-Seq confirmed variants can be obtained from RNA-Seq analysis (from advanced cancer analysis modules). The omics index can be concatenated to answer complex queries (e.g., CNV increases and losses where SNVs are acquired and small insertions and deletions overlap, which are expressed in RNA for a sample set). Differentially expressed genes can be derived from, for example, advanced analytical software modules.

According to various embodiments, the concatenated multimathematical index may be generated via a selected indexing method, such as, for example, but not limited to, KEYSxCPRA, KEYSxCNV _ RANGE, KEYSxCNV _ GENE, KEYSxCPRA _ RNA, and KEYSxGENE _ RNA, used to index copy number variants and confirmed RNA variants (see again FIG. 2 a). Applicants have advantageously discovered that cross-indexing of multiple information streams provides, for example, the ability to query any combination of omics data streams or the individual streams themselves, as well as the ability to perform variant, gene codon numbering, genes, pathways, and other levels of entity linking.

Referring to the example shown in fig. 2a, the first index table 210 describes single nucleotide polymorphisms and small insertions and deletions in DNA in terms of its CPRA 212 (chromosome 214, position 216, reference 218, substitution allele 220) that occurs in a sample with KEYS sample ID 222. The second index table 230 describes Copy Number Variants (CNVs) in terms of ranges 232 (chromosome 234, start 236, end 238) that occur in samples with KEYS sample ID 242. The third index table 250 describes variants in DNA (CPRA)252 variants in terms of RNA-Seq that occur in samples with KEYS sample ID 262 (see first index table 210). The fourth index table 270 describes copy number variants CNV 272 in their scope relative to single nucleotide polymorphisms and small insertions and deletions in dna (cpra) 274.

Referring to the example shown in fig. 2b, which provides a CPRAxTERM ranking 300, the CPRAxTERM ranking 300 consists of a ranking for the aggregated annotations (terms) on the CPRA level 310, the GENE _ CODON level 312, and the GENE level 314. Equation 320 provides an example of how to compute a ranking for the CPRA on the GENE _ CODON level. Formula 322 provides an example of how to compute a ranking for CPRA on the GENE level. The fifth index table 330 provides an example of the CPRA that maps the index table by GENE _ CODON. The sixth index table 340 provides an example of a GENE _ CODON level annotation index table. The seventh index table 350 provides an example of a CPRA level annotation index table.

As discussed above, according to various embodiments, the systems and methods described herein may provide for ranking of the selected one or more sets of mathematical data indexes. In various embodiments, the ranking may occur without a filter associated with available cancer multigroup data. As discussed above, the accessible data may include, for example, variants, genes, pathways, RNA sequence-confirmed variants, differentially expressed genes, regions of high/low methylation, expressed proteins, copy number variants, structural variants, gene fusions, phenotypes, family history, annotations, drugs, clinical trial inclusion/exclusion criteria, derived analyses (e.g., mutation signature weights, microsatellite repeat sites, features extracted from imaging data and images themselves and literature data and embeddings thereof), and machine learning model predictions and their features (e.g., microsatellite instability states and microsatellite instability sites, predicted primary sites and changes in key features identified as the model by their relative importance, predicted metastatic sites and model key features, and predicted neoantigen binding affinities of MHC class I and class II molecules). In various embodiments, any combination of different omics streams or individual data streams may be returned based on the user query.

Fig. 2b, for example, illustrates an example of hierarchical propagation of annotations and ranking of the variants accumulated by the weighted ranking of variant-level CPRA x cppraterm, codon-level cpraxcodentrm, and gene-level CPRA xgeneTERM annotations (CPRA).

As discussed above, according to various embodiments, the systems and methods described herein may provide for integration and ranking of multiple cancer annotation sources. These multiple cancer annotation sources may include, for example, FDA labels, NCCN guidelines, NCCN outline biomarkers, clinical trials, CIViC, DoCM, OncoKB, mycanccelergenome, cancer drug genome biomarker databases, TCGA, ICGC, cosinc, NCI60, CCLE, drug bank, ClinVar, HGMD, PGMD, PharmGKB, dbSNP, dbNSFP, 1000Genomes, EXAC, CPDB, KEGG, BioCarta, BioCyc, Reactome, genp, MSigDB, Brenda, CTD, HPRD, GXD, and BIND.

According to various embodiments, the multimodal ranking engine (or module) may further associate a learning engine to integrate, for example, annotation sources, literature data, clinical trial results, and well-characterized genes of significant mutations in a homogeneous cohort (such as the TCGA) to learn the clinically feasible ranking of multiple sets of mathematical data in the query use case settings for both individual patients and homogeneous cohorts. In other embodiments, the learned ranking may be based on a predicted pathogenicity of the change with unknown clinical significance.

As discussed above, according to various embodiments, the systems and methods described herein may provide ranking of cancer genomic alterations in terms of their clinical operability, pathogenicity, feature weight, or frequency. According to various embodiments, the ranking model may be derived by training a supervised learning model by learning to weight features extracted for the omics cancer data. For variants (e.g., at precise locations and specific codons) or genes (e.g., considering mutation types), this may include, for example, indicators of whether variants and/or alteration types in genes have been implicated in FDA labels, NCCN guidelines, NCCN biomarker compendia, ASCO guidelines, ESMO guidelines, or other top-level cancer guidelines, and whether there are indications/contraindications of specific drugs; characteristics of gene variants or altered types extracted from other cancer annotation sources (such as, for example, clinical trials, OncoKB, mycancrgenome, CIViC, DoCM, and cancer drug genomic biomarker databases); features extracted from other relevant annotation sources, such as, for example, TCGA significantly mutated genes, cosinc cancer gene census, cosinc, ICGC, drug bank, Swissprot, dbNSFP, HGMD, PGMD, PharmGKB, and ClinVar; population allele frequency data from HLI, HLI cancer, TCGA, cosinc, ICGC, 1000 genes, EXAS, Gnomad; embedding from text taken from relevant clinical trials, PubMed, Medline, OMIM articles, and other medical literature; and embedding of named entities extracted from medical literature.

According to various embodiments, the ranking may be based on support vector regression, boosting trees, and other machine learning models that weight information from annotation sources such as, for example, FDA, NCCN guidelines, NCCN biomarker compendium, curated cancer genes, cosinc, TCGA significantly mutated genes, known hotspots, clinical trials, and computer-predicted gain of function/loss of function (scores) scores (e.g., CADD, FATHMM, SIFT, Polyphen).

According to various embodiments, three learned ranking methods are used to derive the ranking. These methods include point-by-point methods (e.g., logistic regression), pairwise methods (e.g., RankSVM, RankBoost), and list-based methods (LambdaMart).

According to various embodiments, the ranking of variants and genes may be learned separately compared to the ranking of other files (e.g., medical documents), where a separate learned ranking model is trained to use a set of weighted transformation features that may include, for example, BM25, PageRank, RM3, and other ranking models for text documents.

According to various embodiments, the ranking of true pairs of variants and genes may be learned alone or as part of a depth and breadth pattern along with rankings for other document types. In some embodiments, ranking for text documents utilizes deep learning Language Modeling (LM), with terms ranked by the probability of the document given the query. According to various embodiments, the deep learning language model may be a transformer model (e.g., BERT, RoBERTa, XLnet, Albert) that is fine-tuned on the relevant data. Such models may be large-scale, pre-trained language model embedding. According to various embodiments, document relevance may be generated using text and temporal portions of a document, for example, by deriving a plurality of classes of features including, for example, entity features and temporal features derived from a set of annotations, named entity identification (NER), and temporal annotations.

According to various embodiments, to provide additional semantic understanding, deep learning methods (e.g., deep semantic similarity models, convolutional deep semantic similarity models, recursive deep semantic similarity models, deep correlation matching models, interactive twin networks, lexical and semantic matching networks, long-short term memory networks, transformer networks, word embedding methods, deep rank) may be used to solve the feature engineering task of learning ranks by first and foremost using features automatically learned from the original text and documents of the query. As such, the deep learning approach may use different types of neural networks, whether, for example, convolutional or recursive.

As discussed above, according to various embodiments, the ranking may include a clinical ranking for cancer variants and genes. The ranking may comprise a deep-learned ranking, wherein the deep-learned ranking may be derived from a deep-learned model selected from the group consisting of: a deep semantic similarity model, a deep and breadth model, a deep language model, a learned deep learning text embedding, a learned named entity recognition, a twin neural network, and combinations of the foregoing.

Fig. 4a illustrates an example of an extent and depth model for learning variant rankings. The breadth part can use cross-product feature transforms from different annotation sources to effectively remember sparse features and their interactions, while the depth part can generalize to previously unseen feature interactions and document embedding.

FIG. 4b illustrates the dependence on depth for biomedical dataExamples of a learning ranking engine for semantic similarity models (see discussion above). In the particular example shown in FIG. 4, a twin network is used to allow learning of queries (Q) and related documents (D) by learning joint queries and document embedding⁺) Semantic similarity between them. Relevance can be estimated by the cosine similarity R (Q, D) between the query and the document embedding. The network may sample randomly negative documents D^-Minimizing the cross entropy loss of (c):

wherein D ═ { D ═ D⁺}∪D^-

After the ranking model is trained, the document embedding may be pre-computed (e.g., as the centroid of all unit vectors of words in the document). At query time, query vector embedding may be generated prior to evaluating similarity between the query and the document representation in the joint latent space. Note that the particular queries and documents referenced in FIG. 4b are merely exemplary, and do not limit the types of queries submitted and documents analyzed in any way.

According to various embodiments, global ranking may be optimized for clinical operability (or pathogenicity when clinical utility is unknown) and preloaded into an index so that results (e.g., via the top-K algorithm) may be re-ranked to further meet specific information needs. According to various embodiments, re-ranking may involve using language modeling or weighted transformed features from standard information retrieval models (e.g., PageRank, BM25, RM 3).

According to various embodiments, ranking of potential biomarkers in a sample cohort may be achieved by first learning potential spatial representations of multiple sets of mathematical data streams (e.g., DNA and RNA, as discussed herein, among others), and then clustering the representations and identifying a set of features (e.g., biomarkers) that are responsible for maximum disentanglement between sub-cohorts of interest. According to various embodiments, a multiomic unsupervised deep learning approach (e.g., a variational auto-encoder) may be constructed for the purpose. According to various embodiments, a deep-generation countermeasure network may be constructed with cyclic losses between multiple data streams. According to various embodiments, standard dimension reduction techniques (e.g., principal component analysis, individual component analysis, manifold learning) may be used to transform sparse, extensive sets of mathematical data into meaningful latent space. These methods advantageously can increase the ability to detect multiomic biomarkers.

As discussed above, according to various embodiments, the systems and methods described herein may propagate rankings learned from higher level bio-hierarchies to inform lower level bio-hierarchies. For example, a gene-level ranking may inform variant-level rankings where information about the occurrence of variants in various cancer annotation sources may not be available.

According to various embodiments, the ranking for the deletion annotated variants may be constructed as an aggregation of the rankings for the genes and mutation types. For example, an aggregation function is learned that predicts overall relevance given these aspects, after which a conventional learned ranking algorithm may be applied to the learned rankings.

According to various embodiments, clinically actionable and pathogenicity rankings may be preloaded into the index to improve retrieval speed. According to various embodiments, ranking formulas learned for particular combinations of omic flows may be applied at index retrieval time.

As discussed above, according to various embodiments, the systems and methods described herein may include ranking of returned results for a particular user query, the returned results may depend on the combination of the multiple sets of chemistry data streams queried, and may vary based on user preferences in response to the user query, taking into account the clinical relevance of the individual and combined multiple sets of chemistry data streams. .

According to various embodiments, the ranking may be changed by the user (e.g., the returned results may be promoted or demoted). According to various embodiments, the ranking may be altered by indirect feedback from the user, such as, for example, click-through rate and dwell time for particular returned results.

As discussed above, according to various embodiments, the systems and methods described herein may provide for gathering user feedback via web interactivity to improve the multi-set ranking of results. For example, variants, genes, pathways, resulting analyses may be promoted or demoted in the list of returned results based on user feedback. According to various embodiments, additional curation information may be provided and saved in the index.

In various embodiments, the systems and methods described herein may provide an interface (or interaction with an interface) to collect explicit user feedback on the relevance of the returned results (e.g., user likes/raises/saves for reporting/fixes/derives a particular result, or user declines/demotes to delete a result from a list of returned results).

In various embodiments, the systems and methods described herein may facilitate the collection and analysis of implicit user feedback from search logs (e.g., analysis of clicks, dwell times, query sequences, number of returned results).

In various embodiments, a collaborative search user interface may be provided (or interacted with) to allow multiple users to collaboratively refine the quality of ranking of multiple sets of mathematical cancer changes (e.g., in a virtual tumor panel setting).

As discussed above, according to various embodiments, the systems described herein may include a query engine that may be configured to perform at least one of: accepting user queries, selecting, aggregating and aggregating relevant multi-cohort indices, and returning ranked multi-cohort changes for individual samples and/or cancer sample homogeneous groups.

In various embodiments, the query engine may be a stateless server that accepts user queries (e.g., as HTTP POST requests) based on a set of precomputed and pre-concatenated multigroup index files and responds with a ranked list of results (e.g., as asynchronous JSON). In various embodiments, the query engine may perform at least one of the following functions: (a) parse the query and classify the user intent (e.g., whether the user wants variants, genes, pathways, samples, single sample data, congeneric group sample data, sample to congeneric group comparison, congeneric group to congeneric group comparison, publications, images); (b) providing query auto-correction (e.g., using an auto-correction deep learning model with fine tuning on a log), providing selective synonym expansion and abbreviation expansion, generating alternative queries (e.g., using a fine tuned transformer model for deep learning), and providing content-based suggestions (e.g., using a fine tuned language model for consecutive queries, using a model that utilizes indexed data); (c) determining a combination of appropriate multigroup mathematical indices to use; (e) ranking results by relevance of the results to predicted query intent (e.g., clinical relevance and pathogenicity — default ranking, frequency of certain queries, mutual information content of other queries, feature weights, etc.); (f) aggregating annotation documents and medical documents (e.g., using deep learning aggregation techniques); and (g) processing interaction/feedback signals from the UI. In various embodiments, the query engine may allow for sub-second latency per query and scalability to hundreds of thousands of concurrent users.

At least some of these functions are illustrated in the example workflow of fig. 5 a-5 b, fig. 5a and 5b illustrate a query engine workflow that functions to (1) generate synonym and abbreviation expansions, (2) generate alternative (similar) queries, (3) generate content-based suggestions and provide query autocompletion and autocorrection functions, (4) classify user query intent (e.g., whether the user wants variants, genes, paths, samples, single sample data, sample group sample data, sample to homogeneous group comparison, homogeneous group to homogeneous group comparison, publication, image?), (5) perform neural information retrieval (e.g., based on joint embedding of queries and indexed documents) and (6) provide summaries of documents (e.g., multiple source text summaries), these summaries may be passed back to the user via the system UI. According to various embodiments, topic-specific item embedding may be used for query expansion, in particular the query expansion in (2) above. According to various embodiments, for textual data, the neural information retrieval model may consider both matches in term space and matches in latent space. Furthermore, named entity identification models for, e.g., variants, genes, pathways, drugs, and cancer types may also be integrated to improve recall. Note that the particular queries, data, and summaries referenced in FIGS. 5a and 5b are merely exemplary, and do not limit in any way the type of query submitted, the documents analyzed, and the summaries generated. For example, in the case of the particular example workflow shown in fig. 5 a-5 b, given the particular parameters of the query, the query engine may conclude that, although the loss-of-function event in TP53 is very common in cancer, the R248 variant appears not only to result in loss of tumor suppression, but may also gain mutations as a function that can promote tumorigenesis in a mouse model (see annotation source CIViC and cancer drug genome biomarker database GDKB).

As discussed above, according to various embodiments, the systems and methods described herein may facilitate integration of query term expansion using deep learning models trained on available biomedical literature and medical ontologies (e.g., GO, UMLS, DO, MeSH, voc, HPO, MPO).

As discussed above, according to various embodiments, the systems described herein may facilitate integration of neural information retrieval models with the aim of providing better semantic understanding capabilities for ranking documents, images, and annotations. In various embodiments, distributed representations of terms (e.g., terms generated by word2 vec) may be combined to generate embeddings for queries and documents, and average embeddings may be used to generate efficient document similarity retrieval.

An example of an effective way to perform the ranking of queries is to build a ranking scheme independently for each query. However, training the model separately for each query suffers from the lack of labeled data for the unseen query. However, according to various embodiments, a cancer genomics alteration search engine may allow for fine-tuning ranking of specific subsets of queries that group query types and have critical clinical importance (e.g., queries that return cancer alterations in the order of their clinical operability and pathogenicity, queries that return genes in the order of their clinical operability). To derive the clinical operability of variants and genes, a manually tagged corpus of query and document pairs may be used. In various embodiments, the accuracy and recall of the results may be measured.

In various embodiments, the corpus may include comprehensive cancer cases that are manually examined by a cancer analyst.

In various embodiments, the manually trained corpus may be constructed by, for example, a cancer analyst/curator. Analysts/curators can examine, for example, (1) changes (>0.02 p or q values from MutSigCV) in significantly mutated genes within a well-characterized cohort of the same cancer type (e.g., TCGA, ICGC, internal cohort); (2) ranking of significantly mutated genes; (3) whether the detected mutation is identical to a well-characterized homogeneous population (e.g., missense (missense), insertion and deletion, nonsense); (4) if the mutation is missense, whether it occurs at a hot spot; (5) number of patients with the mutation from a well characterized cohort with the mutation; and (6) in some cases, further examination of the mutation, location, structure, and type of cancer of the patient having the mutation.

As discussed above, according to various embodiments, the systems and methods described herein may provide a universal search interface (as opposed to many different entry points). In various embodiments, all knowledge (whether it be, for example, multigroup cancer data, samples, variants, genes, drugs, pathways, phenotypes, medical literature, image data, derived cancer analysis, machine learning models and their features for predicting tumor characteristics, uploading of user data, etc.) may be accessible through the same simple search interface.

As discussed above, according to various embodiments, the systems and methods described herein may provide a list/terminal of critically operable and important cancer changes, derived cancer analysis, and quality control indicators for clinicians or researchers working with individual samples or homogeneous groups of samples.

According to various embodiments, the systems and methods described herein can provide important cancers and genetic cancer variants as reported according to ACMG guidelines.

According to various embodiments, the systems and methods described herein may provide dynamically hyperlinked individual patient and homogeneous group reports where at least some of the terms on the report are hyperlinked to a multimodal cancer search query, with cancer changes ranked. In various embodiments, hyperlinked report content may be dynamically generated based on queries that the user makes and saves for reporting purposes.

According to various embodiments, the systems and methods described herein allow for the inclusion of at least one of the following in a dynamic report generated by a saved user query for a report: integrated multi-set chemistry results, visualization, images, medical literature, advanced cancer analysis, and any level of data from a cancer bioinformatics pipeline (e.g., sequencing coverage, percentage of base pair change types, visualization of sequencing reads supporting individual variants).

According to various embodiments, the systems and methods described herein may operate as a web service with a two factor authentication and access control layer to help ensure that each client can only access samples for which they are authorized to access, and is performed without analyzing independent data sets for which access is controlled by different entities across access.

In various embodiments, a query may include natural language terms (which may be conceptually arbitrary) combined with special operators. In various embodiments, the query may include a speech-to-text model. In various embodiments, special operators may enable users to reference certain information unambiguously (e.g., a particular client) or to impose certain constraints (e.g., provide only genes or pathways as results). In various embodiments, operators may include, for example, plus, minus, equal, sum, asterisk, quotation mark, brace, curly brace, backslash, forward slash, colon, semicolon, pound (#), @ symbol (@), wave (@), equal (═ sign), greater than (>), less than (<), and words and, or, not, except. In various embodiments, a query is composed of natural language terms combined with special operators. In various embodiments, special operators may enable a user to reference certain information unambiguously.

Fig. 6 illustrates an example of a user interface 600 having a single search box 610, the search box 610 allowing a user to enter different queries and receive ranked results. Each variant can be displayed with abundant data including, for example, variant quality control, variant indices, allele frequencies compared to the seed population database, therapeutic drug annotations, comparisons to the cancer database and annotation sources, variant variants, and the ability to view mutations and surrounding sequencing reads using an integrated genome variant browser (IGV) and explore the variants in the UCSC genome browser.

Section 620 of UI 600 allows the user to check the location and quality of variant calls. Chromosomes, positions, and variants can be listed using mutated bases highlighted in a color different from the reference. UCSC links allow the user to view the variants in a genome browser (allowing in-depth investigation of the variants). The actual sequencing reads can be visualized using IGV links, which would allow the user to, for example, determine the reliability of variant calls, see if variants appear in chaotic regions, or if calls are unreliable due to sequencing artifacts.

Section 630 of UI 600 lists information at the gene level. The gene name is listed and, when clicked on, one can proceed to in-depth information about the variant, including the gene summary, the frequency of the variant in the TCGA data. As such, the user can investigate whether variants were found, and with what frequency variants were found in the same and other tumor types. Clinical trials for the variant and other relevant clinical information may be displayed. The HGVS option card shows protein-level variants. The Ensembl tab shows the transcripts used to map the proteins, and dbSNP rsID is also listed. Variants can be compared to the frequency found in healthy populations (see "HLI healthy allele frequency" in figure 6). The PubMed tab links to the related papers from PubMed in the scientific literature relating to this variant.

Portion 640 of UI 600 may allow the user to perform quality control of variant calls. If RNA-Seq was also performed, RNA-Seq allele scores were displayed. Tumor and normal allele scores and read depths allow the user to determine the quality of the call, as well as any evidence of the presence of variants in normal blood.

Block 650 of UI 600 provides clinical information (if available).

In various embodiments, the systems described herein may include an interface that allows a user to enter or use a user query. In various embodiments, the methods described herein may provide for entering or using a user query via an interface. As discussed above, in various embodiments, the user query may be spoken. In various embodiments, the user query may include, for example, a patient/individual ID number, a cohort name/ID number, a certain gene name or gene symbol, a particular annotation source, a variant, and/or a phenotype. In various embodiments, the input may be a check box or clickable button that limits or filters the output to a sequence, e.g., a variant, a gene, phenotypic data, a particular combination of multigroup mathematical data streams, and a statistically significant variant, gene, pathway. In various embodiments, the results may be orderable, designated as favorites as appropriate, or exported to another program or to a dynamically generated report. In various embodiments, individual search terms may be combinable. In various embodiments, an individual (or user) may use additional user queries or filtering to search for additional information within a certain result set. Table 1 illustrates a non-exhaustive list of examples of desired information, example user inputs, and example outputs. Table 1 is not an exclusive or exhaustive list of queries that may be deployed by a user.

Note that all references to the figures in Table 1 are for guidance only and are not meant to limit relative user inputs and example outputs with respect to the type of information desired by the user. For example, fig. 7 illustrates an example of search results obtained with a particular syntax ("fda + nccn @ PatientSeqID"), in accordance with various embodiments.

Further, for example, fig. 8a and 8b illustrate examples of search results obtained with a particular grammar ("@ PatientSeqID afrac >0.05 tmb"), in accordance with various embodiments. In particular, fig. 8b illustrates the display of one of the non-silent mutations that resulted in the overall tumor mutational burden of the tumor in this particular example. In more detail, fig. 8a and 8b show an example of search results obtained using the specific syntax cited above, in which the user wishes to account for tumor mutation load values for only mutations with an allele fraction greater than 5%. The cancer mutation load can then be displayed on a background over tumor mutation values of cancer genomic profiles grouped by cohort. The number of types of non-silent mutations found in tumor samples can also be shown in the illustrated pie chart (see fig. 8 b). This display allows the user to quickly assess potential cancer subtypes, potential sequencing problems, and an overall assessment of factors behind the tumor mutation load value. The central region of the pie chart shows the total count of non-silent mutations. The total number of non-silent mutations is again further subdivided into the types of non-silent mutations that have been identified by a reference outside the central region of the pie chart (legend provided beside the pie chart). In many cancers (as seen in this example), missense mutations may be the most frequent. The pie-chart display function allows a quick check of this parameter if the frameshift mutation of the microsatellite instability constitutes a large part of the mutation. Various sequencing artifacts may also result in a high percentage of mutation types that are not commonly seen in this cancer. Pie chart display function can also be used to determine clinical relevance of tumor mutational burden. Some immunotherapeutics work best for tumors that consist primarily of frameshift mutations or other specific types of mutations. As such, the pie chart display function will allow the user to quickly assess those possibilities. Below the graph, the interface produced a ranked list of all non-silent variants with allele scores greater than 5% displayed (fig. 8b shows a single hit due to insufficient space).

Further, for example, FIG. 9 illustrates an example of search results returned from a user query, in accordance with various embodiments. In particular, FIG. 9 illustrates a non-limiting example of search results obtained with a particular grammar "@ PatientSeqID mutsig". Mutation signatures are the overall pattern of base pair changes that occur in all genes in a tumor. Mutation signatures can be derived by computing the changes in all base pairs in a context to derive the overall pattern of mutation occurrence. Easy to use mutation signature definitions can be found at https:// cancer. Identification of mutation signatures can guide therapy, can help explain underlying causes of tumors, and can help address variants of unknown significance. Therefore, the mutation signature is important for analyzing the overall characteristics of the tumor.

Part A of FIG. 9 shows the X-Y plot (3bp, shown on the X-axis) of the type of base pair substitution pattern (i.e., C > A, C > G, C > T, T > A, T > C, T > G) in the context of the base pairs surrounding the mutation. The frequency of each mutation type is plotted on the Y-axis. In this example case, the map is compared to the signature identified by cosinc to derive an overall mutational signature of the tumor.

Part B shows the percentage of the overall mutation signature found in the tumor on a pie chart. The display may allow the user to determine the primary signature in the tumor and any secondary signatures identified. In this example, from melanoma tumors, the main signature shown is S7, which is consistent with literature. If the displayed mutation signature is not expected for that cancer type, the user may conduct further investigation.

Mutation signatures can also help guide clinical decisions. For example, consider the mutation in BRCA1/2 in breast and ovarian cancer. PARP inhibitors may be used in BRCA1/2 mutated breast and ovarian cancer patients. Cosinc signature 3 may be characterized by a defect in BRCA or pathway genes, such that identification of signature 3 in a tumor indicates a BRCA mutational process, even in the absence of the identified mutation. If the tumor contains an unknown significant BRCA mutation, then assaying for the presence of signature 3 can help determine whether the mutation is functional. In both cases, the potential benefits of PARP inhibitors can be explored.

Another function accessible here is the reconstruction weight (not shown) for each of the 96 triples.

Additionally, for example, fig. 10 illustrates an example of search results returned from a user query, in accordance with various embodiments. In particular, FIG. 10 illustrates a non-limiting example of search results obtained with a particular syntax "cohort: Cohortidtmb". The query for this case may be to identify the tumor mutation burden in the homogeneous group. The tumor mutation burden (TMB, mutation/mb) of each tumor in the cohort (circles with their associated numerical TMB values) can be compared to the TMB of tumors from the same cancer type (in this case, pancreatic cancer-PAAD) by the cancer genomic map (the remaining and most circles on this figure, without reference to any associated TMB values). The TMB is represented on the Y-axis, which allows the user to see if the TMB identified in the homogeneous group is consistent with a priori knowledge about the cancer. The median TCGA for PAAD is shown as the horizontal line in the middle of the box. Using the representation of the box and whisker plots allows the user to see whether the homogeneous cohort sample plot is within the average range or the anomaly range found in the TCGA.

Referring to FIG. 10, a homogeneous group TMB chart 500 is provided in which a TMB 510 is represented on a Y-axis 512. The tumor mutation burden (TMB, mutation/mb) of each tumor in the cohort is the first point 520, with the first point 520 having an associated numerical TMB value 522 associated therewith. Those values are compared to the TMB of a tumor from the same cancer type (in this case, pancreatic cancer-PAAD) by the cancer genomic map represented by the second point 530, which second point 530 has no TMB value associated with it, and in this example, forms the majority of capture points.

Further, for example, fig. 11 illustrates an example of search results returned from a user query, in accordance with various embodiments. In particular, FIG. 11 shows an integrated summary of a plurality of genomic alterations and clinical information examined in a sample cohort in response to a user query "cohort: CohortID panel: cgc noise" that requires the aggregation of non-silent mutations of a cancer gene panel set in a particular cohort. Effectively, the query for this case may be to identify whether the samples in a given homogeneous group have the same number and type of mutations. Each tumor sample may be displayed in a column, each gene in a row, and available clinical information may be added to the table. The map may be layered by any of the displayed clinical parameters. The graph may be initially ranked (as shown) by the most frequently mutated cancer genes in the homogeneous group, and show the frequency of the gene classes. Mutation types (e.g., missense, nonsense, frameshift) may be identified by the type of variant using different box color variants (see section B of fig. 11). In the example shown, the driver gene (NRAS) is a missense mutation as expected. The total mutation count for each sample can also be displayed, which information can be used by the user to rank the plots. This display feature allows the user to perform in-depth analysis on the homogeneous group, as well as identify specific changes to any individual sample. The co-occurrence or mutual exclusion of mutations can be seen in this figure. Individual mutations may be listed below the graph (not shown).

In the case illustrated in fig. 11, part a illustrates that the leftmost sample has the highest amount of mutation. Within this cohort, the mutation types were fairly consistent. In some cases, samples with very high mutation counts and high frameshift type mutations may be observed. This observation may require more exploration to determine if the sample is microsatellite unstable or artifact present. In addition, the third sample from the left does not have the NRAS mutation of the remaining samples. However, the number and type of mutations are different from the rest of the cohort. This observation may require a more thorough exploration to determine whether the difference is an artifact or biological. Section C illustrates a mutation table graph that can be sorted using clinical data.

Further, for example, FIG. 12 illustrates an example of search results returned from a user query, in accordance with various embodiments. In particular, FIG. 12 shows a non-limiting example of search results obtained with the specific syntax "cohort: nonresponsors EGFR" where the user wishes to compare gene EGFR mutations in two sub-cohorts: a responder and a non-responder. The ranked individual mutations may be listed below (not shown in this figure). In this example, section a provides a schematic representation of germline/somatically mutated EGFR gene levels in two cohorts (cohort responders versus cohort non-responders). Part B provides a 3D protein structure highlighting the positions affected by hot spot mutations that cluster near the binding site of two congenic groups of drugs (gefitinib).

FIG. 13 is a block diagram that illustrates a computer system 1000 upon which an embodiment or portion of an embodiment of the present teachings may be implemented. In various embodiments of the present teachings, computer system 1000 may include a bus 1002 or other communication mechanism for communicating information, and a processor 1004 coupled with bus 1002 for processing information. In various embodiments, computer system 1000 may also include a memory 1006, which memory 1006 may be a Random Access Memory (RAM) or other dynamic storage device, coupled to bus 1002 for determining instructions to be executed by processor 1004. The memory 1006 may also be used for storing temporary variables or other intermediate information during execution of instructions to be executed by the processor 1004. In various embodiments, computer system 1000 may also include a Read Only Memory (ROM)1008 or other static storage device coupled to bus 1002 for storing static information and instructions for processor 1004. A storage device 1010, such as a magnetic disk or optical disk, may be provided and coupled to bus 1002 for storing information and instructions.

In various embodiments, computer system 1000 may be coupled via bus 1002 to a display 1012, such as a Cathode Ray Tube (CRT) or Liquid Crystal Display (LCD), for displaying information to a computer user. An input device 1014, including alphanumeric and other keys, may be coupled to bus 1002 for communicating information and command selections to processor 1004. Another type of user input device is cursor control 1016, such as a mouse, a trackball, or cursor direction keys for communicating direction information and command selections to processor 1004 and for controlling cursor movement on display 1012. This input device 1014 typically has two degrees of freedom in two axes, a first axis (i.e., x) and a second axis (i.e., y), that allow a specified position in the plane of the device. However, it should be understood that input device 1014 that allows 3-dimensional (x, y, and z) cursor movement is also contemplated herein. More detail is discussed herein with respect to display and input devices (or interfaces also used herein) beyond the capabilities discussed herein.

Consistent with certain implementations of the present teachings, the results may be provided by the computer system 1000 in response to the processor 1004 executing one or more sequences of one or more instructions contained in memory 1006. Such instructions may be read into memory 1006 from another computer-readable medium or computer-readable storage medium, such as storage device 1010. Execution of the sequences of instructions contained in memory 1006 may cause processor 1004 to perform processes described herein. Alternatively, hardwired circuitry may be used in place of or in combination with software instructions to implement the present teachings. Thus, implementations of the present teachings are not limited to any specific combination of hardware circuitry and software.

The term "computer-readable medium" (e.g., a data store, a data storage device, etc.) or "computer-readable storage medium" as used herein and discussed in more detail below, refers to any medium that participates in providing instructions to processor 1004 for execution. Such a medium may take many forms, including but not limited to, non-volatile media, and transmission media. Examples of non-volatile media may include, but are not limited to, optical, solid-state, magnetic disks, such as storage device 1010. Examples of volatile media may include, but are not limited to, dynamic memory, such as memory 1006. Examples of transmission media may include, but are not limited to, coaxial cables, copper wire and fiber optics, including the wires that comprise bus 1002.

Common forms of computer-readable media include, for example, a floppy disk, a flexible disk, hard disk, magnetic tape, or any other magnetic medium, a CD-ROM, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, a RAM, a PROM, and EPROM, a FLASH-EPROM, any other memory chip or cartridge, or any other tangible medium from which a computer can read. Further discussion regarding the media is provided below.

In addition to computer-readable media, instructions or data may also be provided as signals on transmission media included in a communication device or system to provide one or more sequences of instructions to processor 1004 of computer system 1000 for execution. For example, the communication device may include a transceiver having signals indicative of instructions and data. The instructions and data are configured to cause one or more processors to implement the functions outlined in the disclosure herein. Representative examples of data communication transmission connections may include, but are not limited to, telephone modem connections, Wide Area Networks (WANs), Local Area Networks (LANs), infrared data connections, NFC connections, and the like. Further discussion regarding data communication is provided below.

It should be understood that the methods described herein, including the flowcharts, illustrations, and accompanying disclosure, may be implemented using the computer system 1000 as a standalone device or on a distributed network (e.g., a cloud computing network) of shared computer processing resources.

It should also be understood that in certain embodiments, a machine-readable storage device is provided for storing non-transitory machine-readable instructions for performing (execute) or performing (carry out) the methods described herein. Machine readable instructions may control all aspects of the systems and methods described herein. Further, the machine-readable instructions may be initially loaded into a memory module or accessed via the cloud or via an API.

In various embodiments, the systems and methods described herein may include or use a digital processing device. In various embodiments, the digital processing device may include one or more hardware Central Processing Units (CPUs) or general purpose graphics processing units (gpgpgpus) that perform device functions. In various embodiments, the digital processing device further comprises an operating system configured to execute the executable instructions. In various embodiments, the digital processing device may optionally be connected to a computer network. In various embodiments, the digital processing device may optionally be connected to the internet such that it accesses the world wide web. In various embodiments, the digital processing device may optionally be connected to a cloud computing infrastructure. In various embodiments, the digital processing device may optionally be connected to an intranet. In various embodiments, the digital processing device may be optionally connected to a data storage device.

According to various embodiments, suitable digital processing devices may include, by way of non-limiting example, server computers, desktop computers, laptop computers, notebook computers, sub-notebook computers, netbook tablet computers, handheld computers, internet appliances, mobile smart phones, tablet computers, and personal digital assistants. One of ordinary skill in the art will recognize that many smart phones are suitable for use in the systems described herein. Those of ordinary skill in the art will also recognize that select televisions, video players, and digital music players with optional computer network connectivity are suitable for use in the systems described herein. Suitable tablet computers include those known to those of ordinary skill in the art having books, boards, and convertible configurations.

In various embodiments, the digital processing device includes an operating system configured to execute executable instructions. The operating system may be, for example, software including programs and data that manages the hardware of the device and provides for execution of applicationsAnd (6) serving. One of ordinary skill in the art will recognize that suitable server operating systems include, by way of non-limiting example, FreeBSD, OpenBSD, Net BSD, Linux, and the like,

Mac OS

And

those of ordinary skill in the art will recognize that suitable personal computer operating systems include, by way of non-limiting example

Mac OS

And UNIX-like operating systems, such as

In various embodiments, the operating system is provided by cloud computing. One of ordinary skill in the art will also recognize that suitable mobile smartphone operating systems include, by way of non-limiting example

OS、

Research In

Black

Windows

OS、

Windows

OS、

And

in various embodiments, the device includes a storage and/or memory device. A storage and/or memory device is one or more physical means for temporarily or permanently storing data or programs. In various embodiments, the device is volatile memory and requires power to maintain the stored information. In various embodiments, the device is a non-volatile memory and retains stored information when the digital processing device is not powered. In various embodiments, the non-volatile memory includes flash memory. In some embodiments, the non-volatile memory comprises Dynamic Random Access Memory (DRAM). In various embodiments, the non-volatile memory comprises Ferroelectric Random Access Memory (FRAM). In various embodiments, the non-volatile memory includes a phase-change random access memory (PRAM). In various embodiments, the device is a storage device, including, by way of non-limiting example, CD-ROMs, DVDs, flash memory devices, magnetic disk drives, tape drives, optical disk drives, and cloud-based storage. In various embodiments, the storage and/or memory devices are a combination of devices such as those disclosed herein.

In various embodiments, the digital processing device includes a display for sending visual information to a user. In various embodiments, the display is a Cathode Ray Tube (CRT). In various embodiments, the display is a Liquid Crystal Display (LCD). In various embodiments, the display is a thin film transistor liquid crystal display (TFT-LCD). In various embodiments, the display is an Organic Light Emitting Diode (OLED) display. In various embodiments, on the OLED display is a passive matrix OLED (pmoled) or active matrix OLED (amoled) display. In various embodiments, the display is a plasma display. In various embodiments, the display is a video projector. In various embodiments, the display is a combination of devices such as those disclosed herein.

In various embodiments, the digital processing device includes an input device for receiving information from a user. In various embodiments, the input device is a keyboard. In various embodiments, the input device is a pointing device, including by way of non-limiting example, a mouse, trackball, trackpad, joystick, game controller, or stylus. In various embodiments, the input device is a touch screen or a multi-touch screen. In various embodiments, the input device is a microphone for capturing voice or other sound input. In various embodiments, the input device is a video camera or other sensor for capturing motion or visual input. In various embodiments, the input device is a Kinect, Leap Motion, or the like. In various embodiments, the input device is a combination of devices such as those disclosed herein.

In various embodiments, the systems disclosed herein may include one or more non-transitory computer-readable storage media on which the methods herein may run, and the non-transitory computer-readable storage media encoded with a program comprising instructions executable by an operating system of an optionally networked digital processing device. In various embodiments, the computer readable storage medium is a tangible component of a digital processing device. In various embodiments, the computer readable storage medium is optionally removable from the digital processing device. In various embodiments, the computer-readable storage medium includes, by way of non-limiting examples, CD-ROMs, DVDs, flash memory devices, solid state memory, magnetic disk drives, magnetic tape drives, optical disk drives, cloud computing systems and services, and the like. In various embodiments, programs and instructions are encoded on media permanently, substantially permanently, semi-permanently, or non-temporarily.

In various embodiments, the systems and methods disclosed herein may include or use at least one computer program. The computer program includes a sequence of instructions executable in the CPU of the digital processing apparatus, written to perform specified tasks. Computer readable instructions may be implemented as program modules, such as functions, objects, Application Programming Interfaces (APi), data structures, etc., that perform particular tasks or implement particular abstract data types. One of ordinary skill in the art will recognize that a computer program may be written in various versions of various languages.

The functionality of the computer readable instructions may be combined or distributed as desired in various environments. In various embodiments, a computer program comprises a sequence of instructions. In various embodiments, a computer program comprises a plurality of sequences of instructions. In various embodiments, the computer program is provided from one location. In various embodiments, the computer program is provided from multiple locations. In various embodiments, the computer program includes one or more software modules. In various embodiments, the computer program portion or the whole includes one or more web applications, one or more mobile applications, one or more standalone applications, one or more web browser plug-ins, extensions, add-ons, or additional items, or a combination thereof.

In various embodiments, the computer program comprises a web application. One of ordinary skill in the art will recognize that, in various embodiments, a web application utilizes one or more software frameworks and one or more database systems. In various embodiments, the web application is based on a web application such as

NET or Ruby on Rails (RoR). In various embodiments, the web application utilizes one or more database systems, including, by way of non-limiting example, relational, non-relational, object-oriented, relational, and XML database systems. In various embodiments, a suitable relational database system includes, by way of non-limiting example

SQL Server、mySQL^TMAnd

one of ordinary skill in the art will also recognize that, in various embodiments, web applications are written in one or more versions of one or more languages. The web application may be written in one or more markup languages, presentation definition languages, client-side scripting languages, server-side coding languages, database query languages, or a combination thereof. In various embodiments, web applications are written to some extent in a markup language such as hypertext markup language (HTML), extensible hypertext markup language (XHTML), or extensible markup language (XML). In various embodiments, web applications are written to some extent in a presentation definition language such as Cascading Style Sheets (CSS). In various embodiments, web applications are implemented to some extent in a web application such as asynchronous Javascript and XML (AJAX),

ActionScript, Javascript or

Is written. In various embodiments, the web application is in a server-side coding language (such as Active Server Pages (ASP), Perl, Java, to some extent)^TMJava Server Pages (JSP), Hypertext preprocessor (PHP), Python^TM、Ruby、Tel、Smalltalk、

Or Groovy) is written. In various embodiments, web applications are written to some extent in a database query language, such as the Structured Query Language (SQL). In various embodiments, web applications integrate enterprise server products, such as

In various embodiments, the web application includes a media player element. In various entitiesIn embodiments, the media player element utilizes one or more suitable multimedia counts of one or more of a number of suitable multimedia technologies, including, by way of non-limiting example

HTML 5、

Java^TMAnd

in various embodiments, the computer program comprises a mobile application provided to a mobile digital processing device. In various embodiments, the mobile application is provided to the mobile digital processing device at the time of manufacture. In various embodiments, mobile applications are provided to a mobile digital processing device via a computer network as described herein.

The mobile program may be created using hardware, language, and development environments known in the art, by techniques known to those of ordinary skill in the art. One of ordinary skill in the art will recognize that mobile applications may be written in several languages. By way of non-limiting example, suitable programming languages include C, C + +, C #, Objective-C, Java^TM、Javascript、Pascal、Object Pascal、Python^TMNet, WML and XHTML/HTML with or without CSS, or combinations thereof.

Suitable mobile application development environments several sources are available. By way of non-limiting example, commercially available development environments include AirplaySDK, alchemiO, AlcheMo,

Celsius, Bedrop, Flash Lite,. NET Compact Frame-work, Rhomobile and Worklight Mobile Platform. Other development environments are freely available, including, by way of non-limiting example, Lazarus, Mobiflex, MoSync, and Phonegap. In addition, mobile device manufacturers also distribute software developer toolkits, including, as non-limiting examples, iPhone and IPad (iOS) SDK, Android^TMSDK、

SDK、BREW SDK、

OS SDK, Symbian SDK, webOS SDK and

Mobile SDK。

one of ordinary skill in the art will recognize that several business forums may be used to distribute mobile applications, including by way of non-limiting example

App Store、

Play、Chrome WebStore、

App World, App Store for Palm devices, App Catalog for webOS, for Mobile

Markemplce for

Ovi Store, of the plant,

Apps and Nintendo DSi Shop.

In various embodiments, the computer program comprises a stand-alone application, which is a program that runs as a stand-alone computer process rather than as an add-on to an existing process (e.g., rather than a plug-in). As will be appreciated by one of ordinary skill in the art,often, the independent applications are compiled. A compiler is computer program(s) that converts source code written in a programming language into binary object code, such as assembly language or machine code. By way of non-limiting example, suitable compiled programming languages include, but are not limited to, C, C + +, Objective-C, COBOL, Delphi, Eiffel, Java^TM、Lisp、Python^TMVisual Basic and vb. Compilation is typically performed, at least in part, to create an executable program. In various embodiments, the computer program comprises one or more executable compiled applications.

In various embodiments, the computer program includes a web browser plug-in (e.g., extension, etc.). In computing, a plug-in is one or more software components that add specific functionality to a larger software application. Producers of software applications support plug-ins to enable third party developers to create the ability to extend applications to support the easy addition of new functionality and to reduce the size of applications. When supported, the plug-ins enable customization of the functionality of the software application. For example, plug-ins are commonly used in web browsers to play videos, generate interactivity, scan for viruses, and display specific file types. One of ordinary skill in the art will be familiar with several web browser plug-ins, including

Player、

And

in various embodiments, the toolbar includes one or more web browser extensions, add-ons, or add-ons. In various embodiments, the toolbar includes one or more explorer bars, toolbars, or desktop bars.

One of ordinary skill in the art will recognize that several plug-in frameworks are available, making it possible to develop plug-ins in a variety of programming languages, including but not limited to C + +, Delphi, Java^TM、PHP、Python^TMNet or a combination thereof.

A Web browser (also known as an Internet browser) is a software application designed for use with networked digital processing devices to retrieve, present, and traverse information resources on the world wide Web. By way of non-limiting example, suitable web browsers include

Internet

Chrome、

Opera

And KDE Konqueror. In various embodiments, the web browser is a mobile web browser. Mobile web browsers (also known as microbrowsers, mini-browsers, and wireless browsers) are designed for use on mobile digital processing devices, including by way of non-limiting example, handheld computers, tablet computers, netbook computers, sub-notebook computers, smart phones, and Personal Digital Assistants (PDAs). By way of non-limiting example, suitable mobile web browsers include

Browser, RIM

Browser、

Blazer、

Browser, for mobile devices

Internet

Mobile、

Basic Web、

Browser、

Mobile and Sony PSP^TMA browser.

In various embodiments, the systems and methods disclosed herein include, or are used in conjunction with, software, server, and/or database modules in methods according to various embodiments disclosed herein. Software modules may be created using machines, software, and languages known in the art by techniques known to those of ordinary skill in the art. The software modules disclosed herein are implemented in a variety of ways. In various embodiments, a software module comprises a file, a code segment, a programming object, a programming structure, or a combination thereof. In further various embodiments, a software module comprises a plurality of files, a plurality of code segments, a plurality of programming objects, a plurality of programming structures, or a combination thereof. In various embodiments, the one or more software modules include, by way of non-limiting example, a web application, a mobile application, and a standalone application. In various embodiments, the software modules are in one computer program or application. In various embodiments, the software modules are in more than one computer program or application. In various embodiments, the software modules are hosted on one machine. In various embodiments, the software modules are hosted on more than one machine. In various embodiments, the software modules are hosted on a cloud computing platform. In various embodiments, software modules are hosted on one or more machines in one location. In various embodiments, software modules are hosted on one or more machines in more than one location.

In various embodiments, the systems and methods disclosed herein include one or more databases or incorporate their use in methods according to various embodiments disclosed herein. One of ordinary skill in the art will recognize that many databases are suitable for storing and retrieving user, query, token, and result information. In various embodiments, suitable databases include, by way of non-limiting example, relational databases, non-relational databases, object-oriented databases, object databases, entity-relational model databases, relational databases, and XML databases. Other non-limiting examples include SQL, PostgreSQL, MySQL, Oracle, DB2, and Sybase. In various embodiments, the database is internet-based. In a further web, suitable web browsers include, by way of non-limiting example

Internet

Chrome、

Opera

And KDE Konqueror. In various embodiments, the web browser is a mobile web browser. Mobile web browsers (also known as microbrowsers, and wireless browsers) are designed for use on mobile digital processing devices, including by way of non-limiting example, handheld computers, tablet computers, netbook computers, sub-notebook computers, smart phones, and Personal Digital Assistants (PDAs). By way of non-limiting example, suitable mobile web browsers include

Browser, RIM

Browser、

Blazer、

Browser, for mobile devices

Internet

Mobile、

Basic Web、

Browser、

Mobile and Sony PSP^TMA browser.

In various embodiments, the database is web-based. In various embodiments, the database is based on cloud computing. In other embodiments, the database is based on one or more local computer storage devices.

In various embodiments, the systems and methods disclosed herein include one or more features for preventing unauthorized access. The security measures may for example protect the user's data. In various embodiments, the data is encrypted. In various embodiments, access to the system requires a multi-factor authentication and access control layer. In various embodiments, access to the system requires two-step authentication (e.g., a web-based interface). In various embodiments, the two-step authentication requires the user to enter an access code that is sent to the user's email or cell phone in addition to the username and password. In some instances, the user is locked out of the account after failing to enter the correct username and password. In various embodiments, the systems and methods disclosed herein also include mechanisms for protecting the anonymity of a user's genome and its search in any genome.

In various embodiments, the systems and methods described herein may assist an oncologist to derive clinical insight in a collaborative environment during case review or during a virtual oncology session by: allowing exploration of data of a patient or set of patients at any level of the cancer bioinformatics pipeline, verifying which cancer changes are real and do not represent sequencing artifacts, reporting quality control values, integrating multicohort data streams and advanced analysis to provide a key dashboard or "error-free" listing of cancer characteristics and findings, and providing clinical, prognostic, diagnostic, and therapeutic information for each ranked result returned. In various embodiments, the multi-component cancer search described herein provides "enhanced intelligence" to physicians to aid in clinical decision-making.

According to various embodiments, use of the systems and methods described herein may include a clinician as a user. These users can perform comprehensive reports of drug targets and critical changes in the tumor (and normal) genome using the systems and methods described herein.

According to various embodiments, the systems and methods described herein may be used in a virtual tumor concert. According to various embodiments, the systems and methods described herein may be used by an individual clinician as a checklist of important oncology properties that are not missed, as well as an examination of clinical trials available in oncology institutions or globally. According to various embodiments, the systems and methods described herein may be used by an oncologist during a patient-oncologist access session. In various embodiments, multiple clinicians may use a collaborative function that queries, visualizes, re-ranks clinically actionable and pathogenic cancer changes to help navigate available phenotypic, iconographic, and literature data during a virtual molecular oncology session to determine optimal diagnosis and treatment. Some non-limiting examples of problems that may be solved by the systems and methods described herein may include what are clinically relevant cancer variants? Is there a potential therapy (FDA approved, NCNN, clinical trial)? Is the mutation identified in the tumor authentic? Is it supported by high quality sequencing reads? Is the mutation in a region that is difficult to sequence? Is it present only in tumors and not in normal tissues? Is it expressed in RNA? Does this mutation work? What is the global tumor nature, tumor mutation burden or microsatellite instability? The system may display multiple metrics that may be used to determine the overall quality and the quality of individual variants. Systems and methods according to various embodiments may provide for comparing a patient's mutation to mutations that have been previously described in public data sets such as cancer genomic maps (TCGA). Systems and methods according to various embodiments may provide for comparing multiple biopsies of the same patient.

In various embodiments, users of the systems and methods described herein may include biopharmaceutical or academic researchers who may then perform, for example, cohort tumor profiling to characterize genetic profiles of patients, responders/non-responders with good/poor prognosis, perform quality control checks, drug target identification, stratify cohort groups with respect to potential drug response biomarkers, and perform fast and iterative hypothesis generation prior to more extensive analysis of additional validation or testing of the cohort groups. In various embodiments, ranked biomarkers that can be hierarchically grouped by the same class, statistical significance of the biomarkers, and summary visualization can be returned by the system. In various embodiments, the validation query may be suggested by the search engine to perform robust algorithmic and statistical validation. In various embodiments, the system may automatically suggest iterative hypothesis refinements via the proposed query refinement.

In various embodiments, the systems and methods described herein may, for example: identifying proteins, pathways, mutational processes associated with survival, resistance, response; any differences found in one group were studied extensively; comparing with other data sets; based on one of the quality control parameters, checking the quality control of the same-class groups to ensure that the same-class groups are analyzed reliably and without skewing; investigate any anomalous results to ensure they are not due to systematic problems; drill down a single sample, outlier or outlier to ensure it is a true result; further exploration and rapid acquisition of statistical significance of the analysis; executing multi-target data exploration; and search literature and annotation sources for potential therapies. Standard bioinformatic analysis typically does not give the ability to use domain knowledge to interactively query data and refine assumptions. Internal systems are often based on database systems, rather than on being able to provide relevancy ranking, perform integration of multiple information streams (e.g., genomic, transcriptome, annotation, literature), and include relevant built-in machine learning model search indexes (such as those discussed herein).

As discussed above, according to various embodiments, the systems and methods described herein may be configured to provide dynamically hyperlinked variant reports for individual patients and homogeneous groups, where all terms on the report are hyperlinked to a multimodal cancer search query. In various embodiments, hyperlinked report content is dynamically generated based on queries that the user makes and highlights, and that are saved for reporting purposes.

As discussed above, according to variant embodiments, the systems and methods described herein may be configured with expert review capabilities, giving the user the ability to select which query results to use for real-time report generation by hyperlinks.

In various embodiments, the dynamic report is never outdated and is updated based on newly indexed information. In addition, the user may be notified of any new annotations, medications, clinical trials available.

In various embodiments, the systems and methods provided herein may allow for analysis extensions that go beyond both static clinical reports and pre-computed cancer portal analysis to provide dynamic generation of hyperlinked reports for individual patients or groups of the same kind. Examples of such reports include, but are not limited to, tumor profiling, drug and test matching, and immune reporting of individual samples, and peer profiling reporting of peer groups of samples. The report may be customized based on the user query, and in various embodiments, the report contains user pre-selected results returned by a multigroup science cancer search.

Applicants have advantageously discovered that a dynamic reporting paradigm based on a multigroup science cancer search system can provide (1) user interaction with data that exceeds the ability of standard static PDF reports that cannot be modified or updated after a broad bioinformatics pipeline has been run; (2) ranking all multicohort cancer changes by their clinical operability, pathogenicity, characteristic weight, or frequency; (3) the user queries any level of pipeline output from BAM to VCF to output for more complex analysis; (4) the user not only views the machine learning model predictions, but also views a list of ranked features that guide the particular predictions.

Claims

1. A method for tumor profiling using multigroup mathematical data indexing, the method comprising:

storing a plurality of omic data indices, wherein each multi-set omic data index of the plurality of omic data indices comprises cancer-specific tokenized data;

ingesting additional omics data and any annotations associated with the additional sets of mathematical data, the additional sets of mathematical data relating to one or more indices;

indexing the additional sets of ingested mathematical data and the annotations while preserving a multimathematical mapping between gene names, gene variant names, and different data streams for the same patient in a particular said index to produce tokenized additional sets of ingested mathematical data;

receiving a user query;

selecting one or more related omics data indices based on the user query;

ranking the selected one or more multigroup index of mathematical data based on at least one of clinical operability, pathogenicity, feature weight, or frequency, and

returning the ranked one or more multi-set indices of mathematical data to the user.

2. The method of claim 1, wherein the plurality of sets of mathematical data are selected from the group consisting of: genomic, transcriptome, epigenetic, chromatin accessibility, microbiome, proteomic, phenotypic, imagery, related literature, integrated multigenomic data, and combinations of the foregoing.

3. The method of claim 1 wherein the plurality of omics data indices further comprise somatic genomic alterations, normal genomic alterations, and cancer annotation sources.

4. The method of claim 1, further comprising deriving a cancer analysis for the one or more selected multi-set mathematical data indices, wherein the cancer analysis includes a tumor characteristic selected from the group consisting of: quality control, tumor mutation burden, genomic mutation signature, microsatellite instability status, neoantigens, HLA allelic typing, RNA confirmed variants, copy number variants, structural variants, non-coding regulatory variants, gene fusion, pathway enrichment, cancer driver identification, mutation profiling, differential gene expression, immune signatures, match information on similar patient treatment outcomes, and combinations of the foregoing.

5. The method of claim 4, wherein the cancer analysis is derived for individual samples or cohorts of samples.

6. The method of claim 4, wherein the cancer analysis comprises machine learning prediction and ranked features.

7. The method of claim 6, wherein the machine learning prediction is selected from the group consisting of: major primary site classifier, future metastatic site prediction classifier, microsatellite instability status prediction, neoantigen binding affinity prediction, disease status stratification, determining cancer lineage, and combinations of the foregoing.

8. The method of claim 1, further comprising propagating annotations from a higher level of the genome hierarchy to a lower level of the genome hierarchy.

9. The method of claim 1, further comprising ranking the one or more selected multi-set mathematical data indices from a higher level of a genome hierarchy to a lower level of a genome hierarchy.

10. The method of claim 1, wherein the ranking comprises a clinical ranking and a pathogenicity ranking for cancer variants and genes.

11. The method of claim 1, wherein the ranking comprises layering groups of peers by incorporating potential spatial representations for cancer data.

12. The method of claim 11, wherein the cohort groups are layered into responders and non-responders.

13. The method of claim 11, wherein the cohort groups are stratified into a long-term progression-free survival time and a short-term progression-free survival time.

14. The method of claim 11, wherein the cohort groups are stratified into different subtypes of cancer.

15. The method of claim 11, wherein the potential spatial representation is performed by a neural network.

16. The method of claim 11, wherein the potential spatial representation is performed by a dimension reduction technique.

17. The method of claim 16, wherein the neural network is selected from the group consisting of: an autoencoder, a variational autoencoder, a depth confidence network, a limited boltzmann machine, feed forward, convolution, recursion, gated recursion, long and short term memory, residual, and generate a countermeasure network.

18. The method of claim 1, wherein the ranking further comprises a model for learning a ranking selected from the group consisting of: support vector machines, boosted decision trees, regression methods, neural networks, and combinations of the foregoing.

19. The method of claim 1, wherein the ranking further comprises a deep learning ranking.

20. The method of claim 19, wherein the deep-learning ranking is derived from a deep-learning model selected from the group consisting of: a deep semantic similarity model, a convolutional deep semantic similarity model, a recursive deep semantic similarity model, a deep correlation matching model, a depth and breadth model, a deep language model, a transformer network, a long-short term memory network, a learned deep learned text embedding, a learned named entity recognition, a twin neural network, an interactive twin network, a lexical and semantic matching network, and combinations of the foregoing.

21. The method of claim 1, wherein the plurality of sets of mathematical data are selected from the group consisting of: somatic calling from whole genome sequence data, somatic calling from whole exome sequence data, somatic sequencing from a somatic stack of fresh frozen tissue, somatic sequencing from a formalin fixed paraffin embedded tissue, somatic sequencing from a liquid biopsy, tumor and normal variant calling, tumor/normal transcriptome data indexed as variants at RNA or gene expression levels, epigenetic data, chromatin accessibility data, microbiome data, proteome data, single cell sequencing data, and combinations of the foregoing.

22. The method of claim 1, wherein the multi-set index of mathematical data further comprises extracted phenotypic data.

23. The method of claim 22, wherein said phenotypic data is selected from the group consisting of: electronic health records, clinical data, functional data, and combinations of the foregoing.

24. The method of claim 1, wherein the multi-set mathematical data index further comprises characterized imagery data.

25. The method of claim 24, wherein the characterized imagery data is selected from the group consisting of: histological slides, MRI images, X-rays, mammograms, ultrasound, PET images, CT scans, and combinations of the foregoing.

26. The method of claim 4, wherein the cancer analysis is dynamically computed after receipt of the user query.

27. The method of claim 1, wherein said indexing the ingested additional sets of mathematical data and said annotations further comprises indexing data derived from a group comprising: cancer analysis, annotation, features extracted from imaging data, phenotype, medical literature data and their embedding, and combinations of the foregoing.

28. The method of claim 1, wherein the ranking further comprises matching sample changes to established drug target labels and available clinical trials.

29. The method of claim 1, wherein the ranking further comprises identifying cancer drug targets in the cohort by detecting potential biomarkers that stratify cohorts based on clinical variables and/or statistical significance of interest, and wherein returning the ranked one or more multicohort data indices to the user comprises a hierarchical visualization.

30. The method of claim 1, wherein the returning the ranked one or more multi-cohort data indices to the user further comprises dynamically creating a hyperlinked report for individual patients and/or cohort groups, the hyperlinked report providing a comprehensive profile of a neoplasm.

31. The method of claim 1, wherein the user query can include user uploaded data selected from the group consisting of: a set of variants, genes, pathways, disease state disorders, phenotypes of interest, and wherein the selecting comprises querying individual samples or cohort data selected by the uploaded data subsets.

32. The method of claim 1, wherein the user query can be provided via a user interface and can include uploading data for indexing selected from the group consisting of: genomic data, transcriptomic data, epigenetic data, chromatin accessibility data, microbiome data, proteomic data, phenotypic data, annotation data, and combinations of the foregoing.

33. The method of claim 1, further comprising: normalizing and/or augmenting the user query, classifying intent of the query, aggregating retrieved documents, and performing document retrieval based on similarity between the query and documents in a potential space using a deep learning approach.

34. The method of claim 1, wherein the at least one of indexing, selecting, and ranking comprises utilizing a deep neural network.

35. The method of claim 4, wherein deriving the cancer analysis comprises utilizing a deep neural network.

36. The method of claim 1, wherein the returning the ranked one or more multigroup mathematical data indices to the user further comprises: returning a summary visualization of the returned results and a list of the ranked results.

37. A non-transitory computer-readable medium having stored therein a program for causing a computer to execute a method for tumor profiling with multi-set mathematical data indexing, the method comprising:

receiving a user query;

selecting one or more related omics data indices based on the user query;

ranking the one or more selected multigroup index of mathematical data based on at least one of clinical operability, an

38. The method of claim 37, wherein the plurality of sets of mathematical data are selected from the group consisting of: genomic, transcriptome, epigenetic, chromatin accessibility, microbiome, proteomic, phenotypic, imagery, related literature, integrated multigenomic data, and combinations of the foregoing.

39. The method of claim 37, wherein the plurality of omics data indices further comprise somatic genomic alterations, normal genomic alterations, and cancer annotation sources.

40. The method of claim 37, further comprising deriving a cancer analysis for the one or more selected multi-set mathematical data indices, wherein the cancer analysis comprises a cancer characteristic selected from the group consisting of: quality control, tumor mutation burden, genomic mutation signature, microsatellite instability status, neoantigens, HLA allelic typing, RNA confirmed variants, copy number variants, structural variants, non-coding regulatory variants, gene fusion, pathway enrichment, cancer driver identification, mutation profiling, differential gene expression, immune signatures, match information on similar patient treatment outcomes, and combinations of the foregoing.

41. The method of claim 40, wherein the cancer analysis is derived for individual samples or cohorts of samples.

42. The method of claim 40, wherein the cancer analysis comprises machine learning prediction and ranked features.

43. The method of claim 42, wherein the machine learning prediction is selected from the group consisting of: major primary site classifier, future metastatic site prediction classifier, microsatellite instability status prediction, neoantigen binding affinity prediction, disease status stratification, determining cancer lineage, and combinations of the foregoing.

44. The method of claim 37, further comprising propagating annotations from a higher level of the genome hierarchy to a lower level of the genome hierarchy.

45. The method of claim 37, further comprising ranking the selected one or more multi-set mathematical data indices from a higher level of a genome hierarchy to a lower level of a genome hierarchy.

46. The method of claim 37, wherein the ranking comprises a clinical ranking for cancer variants and genes.

47. The method of claim 3375, wherein the ranking comprises layering groups of peers by incorporating potential spatial representations for cancer data.

48. The method of claim 47, wherein the cohort groups are stratified into responders and non-responders.

49. The method of claim 47, wherein the cohort groups are stratified into a long-term progression-free survival time and a short-term progression-free survival time.

50. The method of claim 47, wherein the potential spatial representation is performed by a neural network.

51. The method of claim 50, wherein the neural network is selected from the group consisting of: an autoencoder, a variational autoencoder, a deep belief network, a limited boltzmann machine, a feed forward network, a convolutional network, a recursive network, a long short term memory network, and a generate countermeasure network.

52. The method of claim 37, wherein the ranking further comprises a model for learning a ranking selected from the group consisting of: support vector machines, boosted decision trees, regression models, neural networks, and combinations of the foregoing.

53. The method of claim 37, wherein the ranking further comprises a deep learning ranking.

54. The method of claim 53, wherein the deep learning ranking is derived from a deep learning model selected from the group consisting of: a deep semantic similarity model, a deep and breadth model, a deep language model, a learned deep learning text embedding, a learned named entity recognition, a twin neural network, and combinations of the foregoing.

55. The method of claim 37, wherein the plurality of sets of mathematical data are selected from the group consisting of: somatic calling from whole genome sequence data, somatic calling from whole exome sequence data, somatic sequencing from a somatic stack of fresh frozen tissue, somatic sequencing from a formalin fixed paraffin embedded tissue, somatic sequencing from a liquid biopsy, tumor and normal variant calling, tumor/normal transcriptome data indexed as variants at RNA or gene expression levels, epigenetic data, chromatin accessibility data, microbiome data, proteome data, single cell sequencing data, and combinations of the foregoing.

56. The method of claim 37, wherein the multi-set index of mathematical data further comprises extracted phenotypic data.

57. The method of claim 56, wherein said phenotypic data is selected from the group consisting of: electronic health records, clinical data, functional data, and combinations of the foregoing.

58. The method of claim 37, wherein the multi-set mathematical data index further comprises characterized imagery data.

59. The method of claim 58, wherein the characterized imagery data is selected from the group consisting of: histological slides, MRI images, X-rays, mammograms, ultrasound, PET images, CT scans, and combinations of the foregoing.

60. The method of claim 40, wherein the cancer analysis is dynamically computed after receipt of the user query.

61. The method of claim 40, wherein said indexing additional sets of ingested mathematical data and said annotations further comprises indexing data derived from a group comprising: cancer analysis, annotation, features extracted from imaging data, phenotype, medical literature data and their embedding, and combinations of the foregoing.

62. The method of claim 37, wherein the ranking further comprises matching sample changes to established drug target labels and available clinical trials.

63. The method of claim 37, wherein the ranking further comprises identifying cancer drug targets in the cohort by detecting potential biomarkers that stratify cohorts based on clinical variables and/or statistical significance of interest, and wherein returning the ranked one or more multicohort data indices to the user comprises a hierarchical visualization.

64. The method of claim 37, wherein the returning the ranked one or more multi-cohort data indices to the user further comprises dynamically creating a hyperlinked report for individual patients and/or cohort groups, the hyperlinked report providing a comprehensive profile of a neoplasm.

65. The method of claim 37, wherein the user query can include user uploaded data selected from the group consisting of: a set of variants, genes, pathways, disease state disorders, phenotypes of interest, and wherein the selecting comprises querying individual samples or cohort data selected by the uploaded data subsets.

66. The method of claim 37, wherein the user query can be provided via a user interface and can include uploading data for indexing selected from the group consisting of: genomic data, transcriptomic data, epigenetic data, chromatin accessibility data, microbiome data, proteomic data, phenotypic data, annotation data, and combinations of the foregoing.

67. The method of claim 37, further comprising: normalizing and/or augmenting the user query, classifying intent of the query, aggregating retrieved documents, and performing document retrieval based on similarity between the query and documents in a potential space using a deep learning approach.

68. The method of claim 37, wherein at least one of the indexing, the selecting, and the ranking comprises utilizing a deep neural network.

69. The method of claim 40, wherein deriving the cancer analysis comprises utilizing a deep neural network.

70. The method of claim 37, wherein the returning the ranked one or more multigroup mathematical data indices to the user further comprises: returning a summary visualization of the returned results and a list of the ranked results.

71. A system for tumor profiling using multigroup index of mathematical data, the system comprising:

an indexing unit comprising:

a storage element configured to store a plurality of multinomial data indices, wherein each multinomial data index of the plurality of multinomial data indices includes cancer-specific tokenized data, an

An index engine configured to

Ingesting additional omics data and any annotations associated with the additional sets of mathematical data, the additional sets of mathematical data being related to one or more indices, an

Indexing the additional sets of ingested mathematical data and annotations while preserving a multimathematical mapping between gene names, gene variant names, and different data streams for the same patient in a particular said index to produce tokenized additional sets of ingested mathematical data;

a user interface configured to receive a user query;

a query engine configured to select relevant one or more omics data indices from the indexing unit based on the user query; and

a ranking engine configured to receive the selected relevant one or more multi-set indices of mathematical data, rank the selected one or more multi-set indices of mathematical data based on at least one of clinical operability, pathogenicity, feature weight, or frequency, and return the ranked one or more multi-set indices of mathematical data to the user via the user interface.

72. The system of claim 71, wherein the plurality of sets of mathematical data are selected from the group consisting of: genomic, transcriptome, epigenetic, chromatin accessibility, microbiome, proteomic, phenotypic, imagery, related literature, integrated multigenomic data, and combinations of the foregoing.

73. The system of claim 71, wherein the plurality of multigenomic data indices further comprises somatic genomic alterations, normal genomic alterations, and cancer annotation sources.

74. The system of claim 71, further comprising a cancer analysis engine configured to derive a cancer analysis for the one or more selected multi-set mathematical data indices, wherein the cancer analysis comprises a cancer characteristic selected from the group consisting of: quality control, tumor mutation burden, genomic mutation signature, microsatellite instability status, neoantigens, HLA allelic typing, RNA-confirmed variants, copy number variants, structural variants, non-coding regulatory variants, gene fusion, pathway enrichment, cancer driver identification, mutation profiling, differential gene expression, immune signatures, match information about similar patient treatment outcomes, and combinations of the foregoing.

75. The system of claim 74, wherein the cancer analysis is derived for individual samples or cohorts of samples.

76. The system of claim 74, wherein the cancer analysis includes machine learning predictions and ranked features.

77. The system of claim 76, wherein the machine learning prediction is selected from the group consisting of: major primary site classifier, future metastatic site prediction classifier, microsatellite instability status prediction, neoantigen binding affinity prediction, disease status stratification, determining cancer lineage, and combinations of the foregoing.

78. The system of claim 71, wherein the indexing engine is configured to propagate annotations from a higher level of a genome hierarchy to a lower level of a genome hierarchy.

79. The system of claim 71, wherein the ranking engine is configured to rank the selected one or more multi-set mathematical data indices from a higher level of a genome hierarchy to a lower level of a genome hierarchy.

80. The system of claim 71, wherein the ranking comprises a clinical ranking for cancer variants and genes.

81. The system of claim 71, wherein the ranking comprises layering groups of peers by incorporating potential spatial representations for cancer data.

82. The system of claim 81, wherein said cohort groups are stratified into responders and non-responders.

83. The system of claim 81, wherein said cohort groups are stratified into a long term progression free survival time and a short term progression free survival time.

84. The system of claim 79, wherein the cohort groups are stratified into different cancer subtypes.

85. The system of claim 81, wherein the potential spatial representation is performed by a neural network.

86. The system of claim 85, wherein the neural network is selected from the group consisting of: an autoencoder, a variational autoencoder, a depth confidence network, a limited boltzmann machine, feed forward, convolution, recursion, gated recursion, long and short term memory, residual, and generate a countermeasure network.

87. The system of claim 71, wherein the ranking engine further comprises a model for learning a ranking selected from the group consisting of: support vector machines, boosted decision trees, regression models, neural networks, and combinations of the foregoing.

88. The system of claim 71, wherein the ranking further comprises a deep learning ranking.

89. The system of claim 88, wherein the deep learning ranking is derived from a deep learning model selected from the group consisting of: a deep semantic similarity model, a deep and breadth model, a deep language model, a learned deep learning text embedding, a learned named entity recognition, a twin neural network, and combinations of the foregoing.

90. The system of claim 71, wherein the plurality of sets of mathematical data are selected from the group consisting of: somatic calling from whole genome sequence data, somatic calling from whole exome sequence data, somatic sequencing from a somatic stack of fresh frozen tissue, somatic sequencing from a formalin fixed paraffin embedded tissue, somatic sequencing from a liquid biopsy, tumor and normal variant calling, tumor/normal transcriptome data indexed as variants at RNA or gene expression levels, epigenetic data, chromatin accessibility data, microbiome data, proteome data, single cell sequencing data, and combinations of the foregoing.

91. The system of claim 71, wherein said multi-set index of mathematical data further comprises extracted phenotypic data.

92. The system of claim 91, wherein said phenotypic data is selected from the group consisting of: electronic health records, clinical data, functional data, and combinations of the foregoing.

93. The system of claim 71, wherein the multi-set mathematical data index further comprises characterized imagery data.

94. The system of claim 93, wherein the characterized imagery data is selected from the group consisting of: histological slides, MRI images, X-rays, mammograms, ultrasound, PET images, CT scans, and combinations of the foregoing.

95. The system of claim 74, wherein the cancer analysis is dynamically computed after receipt of the user query.

96. The system of claim 71, wherein the indexing engine is further configured to index data derived from a group comprising: cancer analysis, annotation, features extracted from imaging data, phenotype, medical literature data and their embedding, and combinations of the foregoing.

97. The system of claim 71, wherein the ranking engine is further configured to match sample changes to established drug target labels and available clinical trials.

98. The system of claim 71, wherein the ranking engine is further configured to perform identification of cancer drug targets in the cohort by detecting potential biomarkers that stratify cohort groups based on clinical variables and/or statistical significance of interest, and the ranking engine is further configured to return the ranked one or more multi-cohort data indices to the user via hierarchical visualization.

99. The system of claim 71, wherein the ranking engine is configured to return the one or more multiorganization data items to be ranked to the user via dynamic creation of a hyperlink report for individual patients and/or cohort groups that provides a comprehensive profile of a neoplasm.

100. The system of claim 71, wherein the user query comprises user uploaded data selected from the group consisting of: a set of variants, genes, pathways, disease state disorders, phenotypes of interest, and wherein the selecting comprises querying individual samples or cohort data selected by the uploaded data subsets.

101. The system of claim 71, wherein the user interface is configured to receive a user query comprising uploaded data for indexing selected from the group consisting of: genomic data, transcriptomic data, epigenetic data, chromatin accessibility data, microbiome data, proteomic data, phenotypic data, annotation data, and combinations of the foregoing.

102. The system of claim 71, wherein the query engine is further configured to normalize and/or augment the user query, classify intent of the query, summarize retrieved documents, and perform document retrieval based on similarity between the query and documents in a potential space using a deep learning approach.

103. The system of claim 71, wherein at least one of the indexing engine, the query engine, and the ranking engine are configured to utilize a deep neural network.

104. The system of claim 74, wherein the cancer analysis engine is configured to utilize a deep neural network to derive the cancer analysis.

105. The system of claim 71, wherein the ranking engine is further configured to: the ranked one or more multi-set indices of mathematical data are also returned to the user by returning a summary visualization of the returned results and a ranked list of results.

106. A system for tumor profiling using multigroup index of mathematical data, the system comprising:

an indexing unit comprising:

An index engine configured to

Indexing the additional sets of ingested mathematical data and annotations while preserving a multimathematical mapping between gene names, gene variant names, and different data streams for the same patient in the particular index to produce tokenized additional sets of ingested mathematical data;

a user interface configured to receive a user query; and

a query engine configured to select relevant one or more sets of mathematical data indices from the indexing unit based on the user query, to rank the selected one or more sets of mathematical data indices based on at least one of clinical operability, pathogenicity, feature weight, or frequency, and to return the ranked one or more sets of mathematical data indices to the user via the user interface.