US11315658B2 - Systems and methods for deconvolution of expression data - Google Patents

Systems and methods for deconvolution of expression data Download PDF

Info

Publication number
US11315658B2
US11315658B2 US17/200,492 US202117200492A US11315658B2 US 11315658 B2 US11315658 B2 US 11315658B2 US 202117200492 A US202117200492 A US 202117200492A US 11315658 B2 US11315658 B2 US 11315658B2
Authority
US
United States
Prior art keywords
cell type
expression data
cell
cells
rna
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
US17/200,492
Other languages
English (en)
Other versions
US20210287759A1 (en
Inventor
Aleksandr Zaitsev
Maksim Chelushkin
Ilya Cheremushkin
Ekaterina Nuzhdina
Vladimir Zyrin
Daniiar Dyikanov
Alexander Bagaev
Ravshan Ataullakhanov
Boris Shpak
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BostonGene Corp
Original Assignee
BostonGene Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BostonGene Corp filed Critical BostonGene Corp
Priority to US17/200,492 priority Critical patent/US11315658B2/en
Publication of US20210287759A1 publication Critical patent/US20210287759A1/en
Assigned to BOSTONGENE CORPORATION reassignment BOSTONGENE CORPORATION ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: BOSTONGENE LLC
Assigned to BOSTONGENE LLC reassignment BOSTONGENE LLC ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: ATAULLAKHANOV, RAVSHAN, BAGAEV, Alexander, CHELUSHKIN, MAKSIM, CHEREMUSHKIN, Ilya, DYIKANOV, Daniiar, NUZHDINA, EKATERINA, SHPAK, Boris, ZAITSEV, ALEKSANDR, ZYRIN, Vladimir
Priority to US17/707,623 priority patent/US11587642B2/en
Application granted granted Critical
Publication of US11315658B2 publication Critical patent/US11315658B2/en
Priority to US18/082,157 priority patent/US20230178178A1/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/18Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • G16B25/10Gene or protein expression profiling; Expression-ratio estimation or normalisation
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Definitions

  • a tumor mass may comprise a population of malignant cells (e.g., cancer cells) and a microenvironment which may include, for example, immune cells, fibroblasts, and extracellular matrix proteins.
  • malignant cells e.g., cancer cells
  • microenvironment which may include, for example, immune cells, fibroblasts, and extracellular matrix proteins.
  • Some embodiments provide for a method comprising using at least one computer hardware processor to perform: obtaining expression data for a biological sample, the biological sample previously obtained from a subject, the expression data including first expression data associated with a first set of genes associated with a first cell type; determining first a cell composition percentage for the first cell type using the expression data and one or more non-linear regression models including a first non-linear regression model, wherein the first cell composition percentage indicates an estimated percentage of cells of the first cell type in the biological sample, wherein determining the first cell composition percentage for the first cell type comprises: processing the first expression data with the first non-linear regression model to determine the first cell composition percentage for the first cell type; and outputting the first cell composition percentage.
  • Some embodiments provide for a system, comprising: at least one hardware processor; and at least one non-transitory computer-readable storage medium storing processor-executable instructions that, when executed by the at least one hardware processor, cause the at least one hardware processor to perform: obtaining expression data for a biological sample, the biological sample previously obtained from a subject, the expression data including first expression data associated with a first set of genes associated with a first cell type; determining a first cell composition percentage for the first cell type using the expression data and one or more non-linear regression models including a first non-linear regression model, wherein the first cell composition percentage indicates an estimated percentage of cells of the first cell type in the biological sample, wherein determining the first cell composition percentage for the first cell type comprises: processing the first expression data with the first non-linear regression model to determine the first cell composition percentage for the first cell type; and outputting the first cell composition percentage.
  • Some embodiments provide for at least one non-transitory computer-readable storage medium storing processor-executable instructions that, when executed by at least one hardware processor, cause the at least one hardware processor to perform: obtaining expression data for a biological sample, the biological sample previously obtained from a subject, the expression data including first expression data associated with a first set of genes associated with a first cell type; determining a first cell composition percentage for the first cell type using the expression data and one or more non-linear regression models including a first non-linear regression model, wherein the first cell composition percentage indicates an estimated percentage of cells of the first cell type in the biological sample, wherein determining the first cell composition percentage for the first cell type comprises: processing the first expression data with the first non-linear regression model to determine the first cell composition percentage for the first cell type; and outputting the first cell composition percentage.
  • Some embodiments provide for a method comprising using at least one computer hardware processor to perform: obtaining RNA expression data for a biological sample, the biological sample previously obtained from a subject having, suspected of having, or at risk of having cancer, wherein the RNA expression data includes first RNA expression data associated with a first set of genes associated with a first cell type, wherein the first RNA expression data includes expression data for at least 10 genes selected from the group of genes for the first cell type in Table 2, wherein the first cell type is selected from the group consisting of B cells, CD4+ T cells, CD8+ T cells, endothelial cells, fibroblasts, lymphocytes, macrophages, monocytes, NK cells, neutrophils, and T cells; and determining a first cell composition percentage for the first cell type, using the first RNA expression data, the first cell composition percentage indicating an estimated percentage of cells of the first cell type in the biological sample, wherein determining the first cell composition percentage for the first cell type comprises: providing the first RNA expression data as input to a first non
  • RNA expression data for a biological sample, the biological sample previously obtained from a subject having, suspected of having, or at risk of having cancer
  • the RNA expression data includes first RNA expression data associated with a first set of genes associated with a first cell type
  • the first RNA expression data includes expression data for at least 10 genes selected from the group of genes for the first cell type in Table 2, wherein the first cell type is selected from the group consisting of B cells, CD4+ T cells, CD8+ T cells, endothelial cells, fibroblasts, lymphocytes, macrophages, monocytes, NK cells, neutrophils, and T cells; and determining a first cell composition percentage for the first cell type, using the first RNA expression data, the first cell composition percentage indicating an estimated percentage of cells of the
  • Some embodiments provide for at least one non-transitory computer-readable storage medium storing processor-executable instructions that, when executed by at least one hardware processor, cause the at least one hardware processor to perform: obtaining RNA expression data for a biological sample, the biological sample previously obtained from a subject having, suspected of having, or at risk of having cancer, wherein the RNA expression data includes first RNA expression data associated with a first set of genes associated with a first cell type, wherein the first RNA expression data includes expression data for at least 10 genes selected from the group of genes for the first cell type in Table 2, wherein the first cell type is selected from the group consisting of B cells, CD4+ T cells, CD8+ T cells, endothelial cells, fibroblasts, lymphocytes, macrophages, monocytes, NK cells, neutrophils, and T cells; and determining a first cell composition percentage for the first cell type, using the first RNA expression data, the first cell composition percentage indicating an estimated percentage of cells of the first cell type in the biological sample, wherein
  • Some embodiments provide for a method comprising: using at least one computer hardware processor to perform: obtaining training data comprising simulated RNA expression data, the simulated RNA expression data including first RNA expression data for first genes associated with a first cell type and second RNA expression data for second genes associated with a second cell type different from the first cell type; and training a plurality of non-linear regression models to estimate percentages of RNA from one or more respective cell types, the plurality of non-linear regression models comprising a first non-linear regression model for estimating percentage of RNA from the first cell type and a second non-linear regression model for estimating percentage of RNA from the second cell type, wherein training the plurality of non-linear regression models comprises training the first non-linear regression model at least in part by: generating, using the first non-linear regression model and the first RNA expression data, an estimated percentage of RNA from the first cell type; and updating parameters of the first non-linear regression model using the estimated percentage of RNA from the first cell type; and outputting the
  • Some embodiments provide for a system, comprising: at least one computer hardware processor; and at least one non-transitory computer-readable storage medium storing processor executable instructions that, when executed by the at least one computer hardware processor, cause the at least one computer hardware processor to perform: obtaining training data comprising simulated RNA expression data, the simulated RNA expression data including first RNA expression data for first genes associated with a first cell type and second RNA expression data for second genes associated with a second cell type different from the first cell type; and training a plurality of non-linear regression models to estimate percentages of RNA from one or more respective cell types, the plurality of non-linear regression models comprising a first non-linear regression model for estimating percentage of RNA from the first cell type and a second non-linear regression model for estimating percentage of RNA from the second cell type, wherein training the plurality of non-linear regression models comprises training the first non-linear regression model at least in part by: generating, using the first non-linear regression model and the first RNA expression data
  • Some embodiments provide at least one non-transitory computer-readable storage medium storing processor executable instructions that, when executed by the at least one computer hardware processor, cause the at least one computer hardware processor to perform: obtaining training data comprising simulated RNA expression data, the simulated RNA expression data including first RNA expression data for first genes associated with a first cell type and second RNA expression data for second genes associated with a second cell type different from the first cell type; and training a plurality of non-linear regression models to estimate percentages of RNA from one or more respective cell types, the plurality of non-linear regression models comprising a first non-linear regression model for estimating percentage of RNA from the first cell type and a second non-linear regression model for estimating percentage of RNA from the second cell type, wherein training the plurality of non-linear regression models comprises training the first non-linear regression mode at least in part by: generating, using the first non-linear regression model and the first RNA expression data, an estimated percentage of RNA from the first cell type; and updating parameters
  • Some embodiments provide for a method comprising using at least one computer hardware processor to perform: obtaining expression data for a biological sample, the biological sample previously obtained from a subject having, suspected of having, or at risk of having cancer; obtaining a plurality of expression profiles for a corresponding plurality of cell types, each of the expression profiles comprising respective expression data from one or more genes associated with a respective cell type from the plurality of cell types; and determining a plurality of cell composition percentages for the plurality of cell types at least in part by optimizing a piecewise continuous error function between the expression data and the plurality of expression profiles.
  • Some embodiments provide for a system, comprising: at least one computer hardware processor; and at least one computer readable storage medium storing processor executable instructions that, when executed by the at least one computer hardware processor, cause the at least one computer hardware processor to perform obtaining expression data for a biological sample, the biological sample previously obtained from a subject having, suspected of having, or at risk of having cancer; obtaining a plurality of expression profiles for a corresponding plurality of cell types, each of the expression profiles comprising respective expression data from one or more genes associated with a respective cell type from the plurality of cell types; and determining a plurality of cell composition percentages for the plurality of cell types at least in part by optimizing a piecewise continuous error function between the expression data and the plurality of expression profiles.
  • Some embodiments provide for at least one computer-readable storage medium storing processor executable instructions that, when executed by at least one computer hardware processor, cause the at least one computer hardware processor to perform: obtaining expression data for a biological sample, the biological sample previously obtained from a subject having, suspected of having, or at risk of having cancer; obtaining a plurality of expression profiles for a corresponding plurality of cell types, each of the expression profiles comprising respective expression data from one or more genes associated with a respective cell type from the plurality of cell types; and determining a plurality of cell composition percentages for the plurality of cell types at least in part by optimizing a piecewise continuous error function between the expression data and the plurality of expression profiles.
  • FIG. 1A is a diagram depicting a system for determining a cell composition percentage based on expression data, according to some embodiments of the technology described herein.
  • FIG. 1B is an example diagram for determining different cell composition percentages for different cell types and cell subtypes using a non-linear regression model for each respective cell type and cell sub-type, according to some embodiments of the technology described herein.
  • FIG. 1C is a t-SNE visualization depicting exemplary cell populations including malignant and microenvironment cells, according to some embodiments of the technology described herein.
  • FIG. 1D is a t-SNE visualization depicting exemplary malignant cell populations, according to some embodiments of the technology described herein.
  • FIG. 1E is a chart depicting exemplary gene expressions for a variety of cells, according to some embodiments of the technology described herein.
  • FIG. 1F is a chart depicting an exemplary correlation between genes and selected cell proportions in a sample mixture of various cell types, according to some embodiments of the technology described herein.
  • FIG. 1G is a chart depicting exemplary gene expressions for tumor cell lines, according to some embodiments of the technology described herein.
  • FIG. 2A is a flowchart depicting an exemplary non-linear method for determining a cell composition percentage based on expression data, according to some embodiments of the technology described herein.
  • FIG. 2B is a flowchart illustrating an example implementation of method 200 for determining a cell composition percentage based on expression data, according to some embodiments of the technology described herein.
  • FIG. 2C is a flowchart illustrating an example implementation of act 216 a of method 200 , according to some of the embodiments of the technology described herein.
  • FIG. 3A is a diagram depicting use of a machine learning method for determining RNA percentages based on RNA expression data, according to some embodiments of the technology described herein.
  • FIG. 3B is a diagram depicting use of a non-linear regression model comprising sub-models for determining RNA percentages based on RNA expression data, according to some embodiments of the technology described herein.
  • FIG. 3C is a diagram depicting a method for determining cell composition percentages based on RNA percentages, according to some embodiments of the technology described herein.
  • FIG. 3D is a diagram depicting an example method for determining malignancy expression profiles based on cell composition percentages, according to some embodiments of the technology described herein.
  • FIG. 4 is a flowchart depicting an exemplary method for training one or more non-linear regression models to determine cell composition percentages based on RNA expression data, according to some embodiments of the technology described herein.
  • FIG. 5A-5B are diagrams depicting an exemplary method for training one or more machine learning models including validation and multiple stages of training, according to some embodiments of the technology described herein.
  • FIG. 6A is a diagram depicting an exemplary method for training one or more non-linear regression models including generating simulated RNA expression data, according to some embodiments of the technology described herein.
  • FIG. 6B is an exemplary diagram for generating artificial mixes of RNA expression data to imitate real tissue, according to some embodiments of the technology described herein.
  • FIG. 6C is an exemplary diagram for generating and using artificial mixes to train cell type models, according to some embodiments of the technology described herein.
  • FIG. 6D-E are exemplary illustrations for generating specific artificial mixes for training particular cell type/subtype models, according to some embodiments of the technology described herein.
  • FIG. 6F is an exemplary diagram illustrating techniques for processing datasets and generating artificial mixes, according to some embodiments of the technology described herein.
  • FIG. 7A is a chart comparing simulated RNA expression data to RNA expression data from a biological sample, according to some embodiments of the technology described herein.
  • FIG. 7B is a chart depicting exemplary cell composition percentages predicted according to the deconvolution techniques developed by the inventors and corresponding true cell composition percentages, according to some embodiments of the technology described herein.
  • FIGS. 7C-7D are charts comparing exemplary prediction accuracy for the deconvolution techniques developed by the inventors, to prediction accuracy for alternative algorithms, according to some embodiments of the technology described herein.
  • FIG. 7E is a graph depicting expression of four selected genes in normal tissue, immune cell types, and cancerous tissue, according to some embodiments of the technology described herein.
  • FIG. 7F is a chart depicting exemplary prediction specificity for the deconvolution techniques developed by the inventors, according to some embodiments of the technology described herein.
  • FIG. 7G is a chart comparing exemplary non-specificity scores for the deconvolution techniques developed by the inventors to non-specificity scores for alternative algorithms, according to some embodiments of the technology described herein.
  • FIG. 8 is a flowchart depicting an exemplary linear method for determining cell composition percentages based on RNA expression data, according to some embodiments of the technology described herein.
  • FIG. 9A is a diagram depicting exemplary RNA expression profiles and overall RNA expression data, according to some embodiments of the technology described herein.
  • FIG. 9B depicts an exemplary piecewise continuous error function, according to some embodiments of the technology described herein.
  • FIG. 10 depicts an illustrative implementation of a computer system that may be used in connection with some embodiments of the technology described herein.
  • FIG. 11 is a block diagram of an illustrative environment in which one or more embodiments of the technology described herein may be implemented.
  • FIGS. 12A-12K are charts and graphs depicting analysis and results from an experiment to establish RNA transcript normalization, and analyze sequencing technical noise as described in connection with Example 1.
  • FIGS. 13A-13J are charts and graphs depicting analysis and results from an experiment to deconvolve RNA-seq of multiple normal and cancer tissues as described in connection with Example 2.
  • FIGS. 14A-14G are charts and graphs depicting analysis and results from an experiment to deconvolve single cell RNA-seq data and bulk RNA-seq of blood as described in connection with Example 3.
  • FIGS. 15A-15I are charts and graphs depicting analysis and results from an experiment to deconvolve several different cancer tissues as described in connection with Example 4.
  • determining cell composition percentages e.g., percentages of cells of particular respective types
  • a biological sample e.g., such as a sample from a tumor or other diseased tissue
  • RNA expression data e.g., data collected by processing the biological sample with a sequencing technique, such as bulk RNA-sequencing.
  • determining cell composition percentages for one or more cell types may involve using one or more non-linear regression models to estimate respective cell composition percentages for the cell types.
  • the non-linear regression models may be trained using simulated RNA expression data, which may be generated according to the techniques described herein, such as by combining RNA expression data for a variety of malignant and/or microenvironment cell types and/or using any of the sampling, rebalancing, and noising techniques described herein.
  • TME tumor microenvironment
  • immune and non-immune components of the TME participate in tumor survival, maintenance, growth, and development using cell-to-cell contacts and different molecular signals, such as growth factors and cytokines.
  • the inventors have recognized that the TME can mediate tumor survival by controlling the immune system of the host, providing immune surveillance of the tumor. The inventors have therefore appreciated that understanding the quantity and functionality of TME components is essential for cancer research and is important for therapy and understanding its clinical impact.
  • RNA-seq bulk RNA-sequencing
  • TME cellular composition e.g., cell composition percentages
  • cell types may be considered as populations of cells having distinguishable expression profiles.
  • CD4+ T cells, CD8+ T cells and NK cells tend to share the expression of a substantial amount of structural and regulatory genes, including metabolic, signaling and surface markers.
  • RNA expression data can contain both unique marker genes and genes relevant to the cell lineage.
  • the inventors have also recognized that the ratio between marker and lineage-specific gene expression may or may not provide information about cell subtypes (for example the ratio of CD4/CD3D genes may be a marker of CD4+ T cells, but CD3D is not a unique marker for subtypes of helper T cells). Since cells of different types, even if they are closely related, can have significantly different impacts on tumor pathogenesis, the inventors have recognized that it nevertheless may be critical to distinguish cell populations even between closely related cell types.
  • Another challenge with cellular deconvolution recognized by the inventors is the difficulty of distinguishing between the number of cells and their state.
  • the expression of a gene specific or semi-specific to one cell type may vary depending on the activation state of the cells of that type or may differ between subtypes of that type.
  • multiple studies can sequence similar cell subtypes, they may be captured in different biological states.
  • the inventors have recognized and appreciated that the variability in biological states can play an important role in developing accurate estimate for cell composition percentages.
  • the inventors have recognized and appreciated that the tumor microenvironment may make up only a relatively small fraction of the tumor as a whole.
  • the identification of small cell populations from bulk RNA-seq data can be especially challenging because of a reduced signal-to-noise ratio.
  • the inventors have recognized that identifying changes in small cell populations (e.g., NK-cells) remains important, as even small cell populations can nevertheless have significant impact on response to treatment.
  • RNA enrichment method e.g., total RNA-seq (REF), polyA enriched (REF), exome capture or 3′ scRNA-seq (REF), for example
  • scRNA-seq single cell RNA-seq
  • a deconvolution method comprising obtaining expression data (e.g., bulk RNA-seq data) for a biological sample from a subject, and determining a cell composition percentages for one or more cell types (e.g., B cells, CD4+ T cells, CD8+ T cells, endothelial cells, fibroblasts, lymphocytes, macrophages, monocytes, NK cells, neutrophils, and T cells).
  • expression data e.g., bulk RNA-seq data
  • cell types e.g., B cells, CD4+ T cells, CD8+ T cells, endothelial cells, fibroblasts, lymphocytes, macrophages, monocytes, NK cells, neutrophils, and T cells.
  • the cell composition percentage may indicate an estimated percentage of cells of a particular respective type in the biological sample.
  • determining a cell composition percentage for a particular cell type may comprise obtaining expression data for a set of genes associated with the cell type (e.g., such as one or more marker genes, which may be specific or semi-specific genes for the particular cell type), and processing that expression data with a non-linear regression model to determine the cell composition percentage for the particular cell type.
  • this process may be repeated or performed in parallel for each of multiple cell types (which may include subtypes of cell types, as described herein) in order to achieve a deconvolution across the multiple cell types. As described herein at least with respect to FIG. 7 , these techniques present a significant improvement over the prior art.
  • machine learning techniques used for determining cell composition percentages may include using multiple non-linear regression models, each trained to determine a cell composition percentage for a particular respective cell type.
  • the non-linear regression model may have multiple parameters (e.g., thousands, tens of thousands, hundreds of thousands, at least one million, millions, tens of millions, or hundreds of millions of parameters) and training the non-linear regression model may include estimating values of those parameters, computationally from expression data simulated for training.
  • generating the simulated training data may include generating many training sets (e.g., at least 25,000, at least 50,000, at least 100,000, at least 150,000, at least 200,000, at least 500,000, etc.) for each non-linear regression model, for each cell type.
  • multiple non-linear regression models may be trained respectively for multiple cells types (e.g., at least 5, at least 10, at least 20, at least 30, at least 40, etc.).
  • FIGS. 7C and 7D show that, compared to conventional techniques, the non-linear deconvolution techniques developed by the inventors (e.g., referred to as “Kassandra”) result in more accurate predictions of cell composition percentages for different cell types, even in the presence of cancer cell hyperexpression noise (e.g., as shown in FIG.
  • the techniques described herein constitute an improvement to bioinformatics generally and, specifically, to supporting clinical decision making and understanding tumor pathogenesis because the techniques described herein provide for improved methods of determining cell composition percentages (e.g., particularly for cell populations in the tumor microenvironment.)
  • the machine learning techniques described herein can successfully identify dependencies and interconnections between genes of phenotypically closely related cell types by using expression data associated with genes that are associated (e.g., specific and/or semi-specific) with the particular subtype as input to a non-linear regression model specifically trained for that subtype, allowing for the accurate detection of cell subtypes even with similar expression patterns ( FIGS. 7A, 7B ).
  • the non-linear deconvolution techniques described herein are also more robust than prior algorithms, showing more consistent accuracy across a variety of cell types/subtypes, and providing significantly more accurate results than conventional techniques on realistic, noisy data ( FIGS. 7C, 7D, 13F, 15G ).
  • these more accurate results enable improved cancer diagnosis and prognosis, as well as personalized treatment options for the patient.
  • the expression data may include expression data associated with particular genes associated with the given cell type.
  • the expression data may include expression data associated with genes for a given cell type.
  • identifying the genes that are associated with a particular cell type may comprise processing expression data from multiple samples, which may be obtained from multiple databases, and/or with a variety of sequencing techniques, to identify genes that are only or predominantly expressed in certain cell types or subtypes.
  • the use of expression data associated with particular genes associated with the particular cell types allows the cellular deconvolution techniques developed by the inventors to leverage domain-specific knowledge relating to which genes are expressed by which cell types, contributing to the success of the techniques described herein.
  • a separate non-linear regression model is trained and used to estimate cell composition percentages for each respective cell type and/or subtype being analyzed in a biological sample (e.g., as described herein including at least with respect to FIG. 3A ). This may allow for cell types and/or subtypes in the biological sample to be distinguished more accurately (e.g., as shown in FIGS. 7A-7G ).
  • the model architecture may include a tiered structure (e.g., as described herein including at least with respect to FIG.
  • the model architecture may include multiple sub-models corresponding to multiple stages, in which the output of one or more previous sub-models (which may comprise, for example, initial predictions of one or more cell composition percentages for one or more cell types) may be used as part of the input for a subsequent sub-model.
  • This allows the models to develop more accurate predictions by improving upon their initial predictions (e.g., from a first stage of training and/or using the models) in order to provide a more accurate final predictions (e.g., at a second, third, etc. stage of training and/or using the models).
  • a tiered structure may be utilized in which outputs from the first sub-model across multiple models for multiple cell types and/or subtypes may be provided as input to subsequent sub-model(s) for each model.
  • first sub-model predictions of cell composition percentages for all cell types may be provided as input to the second sub-models (e.g., for other cell types and/or subtypes.)
  • second sub-models e.g., for other cell types and/or subtypes.
  • the models described herein have been trained with data representing artificial mixtures of cell types, allowing the training process to take into account the diverse and tissue-specific expression of malignant and microenvironment cells across much larger numbers of samples of diverse composition (e.g., simulating a wide variety of tumor microenvironments) than could be practically possible by physically sampling and analyzing tumor samples.
  • This substantially reduces the effort and computational resources associated with training the non-linear regression models for cellular deconvolution.
  • the artificial mixes described herein can also be obtained in such a way that they replicate technical noise and capture a wide biological variability, improving the ability of a machine learning model trained using this data to identify biologically meaningful signals in the presence of such noise and variability.
  • RNA expression data used to develop these artificial mixes was derived from multiple different samples, across multiple cell populations having a variety of biological states. These artificial mixes improve the ability of the non-linear regression models to effectively estimate cell composition percentages across a variety of cell types in real tumor samples.
  • the techniques developed by the inventors also include improved linear techniques for cellular deconvolution.
  • one aspect of the linear techniques that contributes to their success is the use of an error function developed by the inventors.
  • the error function may be a piecewise, continuous error function. Compared to conventional methods, such as finding a square distance, the piecewise continuous error function accounts for genes that are strongly expressed in tumor cells. This may increase the accuracy for deconvolution of cells in tumor samples.
  • the use of such an error function allows the techniques developed by the inventors to more accurately model the error associated with predicted cell composition percentages (e.g., as described herein including with respect to FIGS. 8 and 9A ), providing improved results over conventional techniques.
  • FIG. 1A depicts a system 100 for determining cell composition percentages 110 .
  • the illustrated system may be implemented in a clinical or laboratory setting.
  • the system 100 includes a biological sample 102 , which may be, for example, a tumor biopsy obtained for a subject (e.g., a subject having, suspected of having, or at risk of having cancer).
  • a subject may be at risk of having cancer, for example, if the subject has a genetic predisposition (e.g., a known genetic mutation or mutations) to cancer or may have been exposed to cancer causing agents.
  • the biological sample 102 may be obtained by performing a biopsy, obtaining a blood sample, a salivary sample, or any other suitable biological sample from the patient.
  • the biological sample 102 may have been previously obtained from a subject. Thus any step applied to the sample (e.g., obtaining expression data from the biological sample) may be performed in vitro.
  • the biological sample 102 may include diseased tissue (e.g., a tumor), and/or healthy tissue.
  • the biological sample may be obtained from a physician, hospital, clinic, or other healthcare provider.
  • the origin or preparation methods of the biological sample may include any of the embodiments described with respect to the “Biological Samples” section.
  • the subject may include any of the embodiments described with the “Subjects” section.
  • the system 100 may further include a sequencing platform 104 , which may produce sequence information 106 .
  • the sequencing platform 104 may be a next generation sequencing platform (e.g., IlluminaTM, RocheTM, Ion TorrentTM, etc.), or any high-throughput or massively parallel sequencing platform.
  • the sequencing platform 104 may include any suitable sequencing device and/or any sequencing system including one or more devices. In some embodiments, these methods may be automated, in some embodiments, there may be manual intervention.
  • the sequence information 106 may be the result of non-next generation sequencing (e.g., Sanger sequencing).
  • the sample preparation may be according to manufacturer's protocols.
  • the sample preparation may be custom made protocols, or other protocols which are for research, diagnostic, prognostic, and/or clinical purposes.
  • the protocols may be experimental.
  • the origin or preparation method of the sequence information may be unknown.
  • Sequence information 106 can include the sequence data generated by a sequencing protocol (e.g., the series of nucleotides in a nucleic acid molecule identified by next-generation sequencing, sanger sequencing, etc.) as well as information contained therein (e.g., information indicative of source, tissue type, etc.) which may also be considered information that can be inferred or determined from the sequence data. For example, in some embodiments RNA sequence information may be analyzed to determine whether the nucleic acid was primarily polyadenylated or not.
  • a sequencing protocol e.g., the series of nucleotides in a nucleic acid molecule identified by next-generation sequencing, sanger sequencing, etc.
  • information contained therein e.g., information indicative of source, tissue type, etc.
  • RNA sequence information may be analyzed to determine whether the nucleic acid was primarily polyadenylated or not.
  • sequence information 106 can include information included in a FASTA file, a description and/or quality scores included in a FASTQ file, an aligned position included in a BAM file, and/or any other suitable information obtained from any suitable file.
  • the sequence information 106 may be generated using a nucleic acid from a sample from a subject.
  • Reference to a nucleic acid may refer to one or more nucleic acid molecules (e.g., a plurality of nucleic acid molecules).
  • the sequence information may be a sequence data indicating a nucleotide sequence of DNA and/or RNA from a previously obtained biological sample of a subject having, suspected of having, or at risk of having a disease.
  • the nucleic acid is deoxyribonucleic acid (DNA).
  • the nucleic acid is prepared such that the whole genome is present in the nucleic acid.
  • the nucleic acid is processed such that only the protein coding regions of the genome remain (e.g., the exome).
  • WES whole exome sequencing
  • a variety of methods are known in the art to isolate the exome for sequencing, for example, solution based isolation wherein tagged probes are used to hybridize the targeted regions (e.g., exons) which can then be further separated from the other regions (e.g., unbound oligonucleotides). These tagged fragments can then be prepared and sequenced.
  • the nucleic acid is ribonucleic acid (RNA).
  • sequenced RNA comprises both coding and non-coding transcribed RNA found in a sample.
  • total RNA total RNA
  • the nucleic acids can be prepared such that the coding RNA (e.g., mRNA) is isolated and used for sequencing. This can be done through any means known in the art, for example by isolating or screening the RNA for polyadenylated sequences. This is sometimes referred to as mRNA-Seq.
  • sequence information 106 may include raw DNA or RNA sequence data, DNA exome sequence data (e.g., from whole exome sequencing (WES), DNA genome sequence data (e.g., from whole genome sequencing (WGS)), RNA expression data, gene expression data, bias-corrected gene expression data, or any other suitable type of sequence data comprising data obtained from the sequencing platform 104 and/or comprising data derived from data obtained from sequencing platform 104 .
  • the origin or preparation of the sequencing information 106 may include any of the embodiments described with respect to the “Expression Data,” “Obtaining RNA expression data,” “Alignment and annotation,” “Removing non-coding transcripts,” and “Conversion to TPM and gene aggregation” sections.
  • the sequence information 106 may be processed using computing device 108 in order to determine cell composition percentages 110 .
  • the sequence information 106 may be processed by one or more software programs running on computing device 108 (e.g., as described herein with respect to FIG. 10 ).
  • the sequence information 106 may be processed according to the machine-learning based approach of FIGS. 2A-2C , or any other methods described herein for determining cell composition percentages (e.g., such as the non-linear deconvolution methods described at least with respect to FIGS. 2A-2C and 3A-3C and the linear deconvolution methods described at least with respect to FIGS. 8 and 9A -B).
  • the computing device 108 may be operated by a user such as a doctor, clinician, researcher, patient, or other individual.
  • the user may provide the sequence information 106 as input to the computing device 108 (e.g., by uploading a file), and/or may provide user input specifying processing or other methods to be performed using the sequence information.
  • the result may be one or more cell composition percentages 110 .
  • each cell composition percentage may represent an estimated percentage of cells of a particular respective type in the biological sample 102 .
  • the cell composition percentages are normalized so that the biological sample as a whole represents 100%.
  • Cell types may include, for example, B-cells, Plasma B-cells, Non plasma B cells, T cells, CD4+ T-cells, CD8+ T-cells, Treg, T helpers, CD8+ PD1-high, CD8+ PD1-low, NK-cells, monocytes, macrophages, resting tumor associated macrophages (TAM), M1-like or activated macrophages, neutrophils, endothelial cells, and fibroblasts, and/or any other suitable cell types.
  • a cell type may comprise one or more subtypes.
  • T cells may have subtypes including CD4+ T cells, CD8+ T cells, Tregs, etc.
  • the cell composition percentages 110 may include percentages for cell subtypes as well as cell types which are not subtypes of any other cell types. According to some embodiments, the cell composition percentages may include a percentage for an “Other” cell type, which may represent an estimated percentage of cells not accounted for in the other cell composition percentages (e.g., cells of one or more types not explicitly included in the analysis).
  • FIG. 1B is an example diagram for determining different cell composition percentages for different cell types and cell subtypes using a non-linear regression model for each respective cell type and cell sub-type, according to some embodiments of the technology described herein.
  • a first non-linear regression model, model A 126 may be used to estimate cell composition percentage 128 for cell type A 122 , using sequence information 124 associated with cell type A 122 .
  • a second non-linear regression model, model B 136 may be used to estimate cell composition percentage 138 for cell type B 132 , using sequence information 134 associated with cell type B 136 .
  • cell type A 122 and cell type B 132 are different cell types.
  • cell type A 122 may include B-cells, while cell type B 132 may include T cells.
  • cell type A and/or cell type B may be any suitable cell type, as aspects of the technology described herein are not limited in that respect.
  • sequence information 124 and sequence information 134 may be obtained for cell type A 122 and cell type B 132 , respectively.
  • sequence information may be associated with a set of genes that is specific and/or semi-specific to the cell type.
  • sequence information 124 may be associated with a first set of genes that is specific to cell type A 122
  • sequence information 134 may be associated with a second set of genes that is specific to cell type B 132 .
  • Techniques for identifying genes that are specific and/or semi-specific to a particular cell type and/or subtype may include any of the embodiments described with respect to the “Gene Selection & Specificity” section.
  • model A 126 is used to estimate cell composition percentage 128 for cell type A 122
  • model B 136 is used to estimate cell composition percentage 138 for cell type B 132
  • each of the models may be trained to estimate cell composition percentages for a specific cell type, as described herein including at least with respect to FIG. 4 .
  • cell types may include cell subtypes.
  • cell subtypes of close origin may share common genes (e.g., with one another and/or with the cell type from which it was differentiated.)
  • cell type B 132 includes subtype A 142 and subtype B 162 .
  • cell type B 132 may include T cells
  • subtype A 142 and subtype B 162 may include subtypes of T cells (e.g., CD4+ and CD8+ T cells).
  • a third non-linear regression model, model C 146 may be used to estimate cell composition percentage 148 for subtype A 142 , using sequence information 144 .
  • a fourth non-linear regression model, model D 156 may be used to estimate cell composition percentage 158 for subtype B 162 , using sequence information 164 .
  • sequence information 144 and sequence information 164 may be obtained for subtype A 142 and subtype B 162 , respectively. In some embodiments, this may include obtaining sequence information associated with a gene set that includes genes specific and/or semi-specific to the subtype. For example, sequence information 144 may be associated with a first set of genes that is specific to subtype A 142 , while sequence information 164 may be associated with a second set of genes that is specific to subtype B 144 . Techniques for identifying genes that are specific and/or semi-specific to a particular cell type and/or subtype may include any of the embodiments described with respect to the “Gene Selection & Specificity” section.
  • FIG. 1C is a t-SNE visualization depicting expression data for a plurality of genes for exemplary cell populations including malignant and microenvironment cells.
  • the cell types and/or subtypes depicted in the t-SNE plot include macrophages, M1 macrophages, M2 macrophages, B cells, B cells (non-plasma), Plasma B cells, T cells, CD8+ T cells, PD1+ CD8+ T cells, PD1 ⁇ CD8+ T cells, CD4+ T cells, Tregs, T helpers, endothelium cells, monocytes, NK cells, fibroblasts, neutrophils and tumor cells (e.g., cancer cells).
  • Malignant cells may comprise tumor cells, or any other cells associated with disease and/or diseased tissue.
  • Microenvironment cells may comprise any non-tumor cells, including, for example, immune cells, skin cells, or any other cells not included in the tumor cells.
  • RNA-seq samples which may be collected from biological samples via any of the sequencing techniques described herein.
  • the RNA-seq datasets may be combined, homogeneously annotated, and bioinformatically recalculated (e.g., expression values are bioinformatically recalculated) to obtain accurate and comparable measurements of transcript expression.
  • RNA-seq data was available for 12,450 sorted samples (e.g., sorted by flow cytometry and magnetic-assisted sorting of cells with beads), which could be subdivided into nineteen cell populations of interest. After the removal of low coverage samples and quality checks, the selected samples were distributed between 10 major cell types and 19 cell subpopulations, listed in Table 1, below.
  • the quality control techniques may include any of the embodiments described in the “Data collection, analysis, and preprocessing” section, or any other suitable quality control techniques.
  • data derived from cells with abnormal physiological states may be identified (e.g., based on the annotations provided with the data) and excluded.
  • T cell samples with phorbol myristate acetate/ionomycin activation and/or induced pluripotent stem cell-derived samples were excluded.
  • samples with a low isolation purity, sequencing quality parameters, high contamination from other organisms (e.g., organisms other than the primary organism under investigation), and/or low coverage were also eliminated.
  • the cell populations may include tumor cells 152 .
  • the cancer types may include breast cancer, colorectal cancer, head and neck cancer, kidney cancer, lung cancer, melanoma, pancreatic cancer, prostate cancer, stomach cancer, and/or any other types of cancer.
  • some or all of the samples of RNA expression data plotted in FIGS. 1C and 1D may be used as part of selecting specific and/or semi-specific genes for particular cell types/subtypes, as described herein including at least with respect to FIG. 1E .
  • some or all of the illustrated samples of RNA expression data may be used as part of generating artificial mixes of RNA expression data, as described herein at least with respect to FIG. 6A .
  • RNA expression data may be used.
  • similar data sets that include some or all of the cell types represented in Table 1, each represented by a plurality of samples from a plurality of datasets as illustrated in Table 1, can be used.
  • FIG. 1E is a heatmap depicting exemplary expressions of genes 170 for cell types 160 .
  • the vertical axis represents the cell types 160
  • the horizontal axis represents the expression of genes 170 in transcripts per million (TPM).
  • TPM transcripts per million
  • Each row in the heat map represents a single RNA-seq sample.
  • some genes may be considered specific to certain cell types.
  • the selected genes 190 may be correlated with the RNA percentage in corresponding sorted cell populations 180 .
  • the selected genes 192 may have limited or no expression for tumor cell lines 182 .
  • Table 2 specifies, for each of multiple cell types, a set of genes which may be considered specific or semi-specific to that cell type, and/or which may be used for the deconvolution techniques described herein.
  • the cellular deconvolution techniques developed by the inventors may involve using only certain gene expression data in order to determine cell composition percentages for a particular cell type.
  • only expression data of specific and/or semi-specific genes for the particular cell type may be used, as described herein including at least with respect to FIGS. 2A-2C .
  • genes which are highly expressed in malignant cells e.g., cancer cell lines
  • may be excluded e.g., specific to tumor cells
  • the specific and/or semi-specific genes for a particular cell type e.g., non-malignant cell types
  • selecting specific and/or semi-specific genes for a particular cell type may comprise performing any or all of the following techniques: literature analysis, fold change analysis with statistical Kruskal-Wallis test (nonparametric ANOVA analogue), Conover-Iman test (nonparametric pairwise test for multiple comparisons), and/or correlation analysis using the RNA-seq data from FIGS. 1C-1D .
  • gene sets may be collected from various sources. In some embodiments, only genes with a known function may be used. Some genes may be similar to the labels used in CYTOF, some may be taken from literature data (which may demonstrate the specificity of certain genes), and/or some genes may be found on existing RNA-seq samples of sorted cells (e.g., after filtering experimental conditions, sequencing quality, and quality by expressions). The search for genes in samples may be carried out in several ways: using differential gene expression, using correlations of gene expression with the proportion of cells in artificial mixes (e.g., as described herein including at least with respect to FIG.
  • a gene may be considered “specific” to a particular cell type or subtype if it is only expressed in the particular cell type or cell subtype.
  • a gene may be considered “semi-specific” to a particular cell type or subtype when: (1) it is expressed both in the particular cell type or subtype and in one or more other cell types or subtypes; and (2) it is expressed to a greater degree in the particular cell type or subtype than in the other cell type(s) or subtype(s).
  • a gene may be considered semi-specific for a particular cell type or subtype if the average expression of the gene in the particular cell type or sub-type is at least a threshold percentage (e.g., 50%, 100%, 200%, 500%, 1000%, etc.) or threshold factor (e.g., a factor of 2, 5, 10, 15, 20, etc.) higher than the average expression of the same gene in the other cell types or sub-types.
  • a gene may be considered semi-specific for a particular cell type or subtype if the average expression of the gene in the cell type or subtype is at least ten times higher than the average expression of the gene in the other cell types or subtypes.
  • the common genes may be considered semi-specific to the cell types and/or subtypes (e.g., semi-specific to both CD4+ T cells and CD8+ T cells.)
  • genes may be selected because their expression is significantly lower or absent in malignant cell (e.g., tumor) lines.
  • the specificity criterion can be evaluated when assessed on combined expression data from a plurality of datasets, as described above. In some embodiments, if several types of cells are present in the same dataset, then for each such dataset, a similar specificity analysis may also be carried out inside the datasets to control batch effects.
  • TCGA Cancer Genome Atlas
  • analysis may be performed to determine how these genes are expressed in TCGA (The Cancer Genome Atlas) for the desired type of tumor. For example, for a given cell type, it may be desirable that the ratios of the average TCGA expression to the average expression lie within a comparable range. In other words, if the average expression of a specific or semi-specific gene (e.g., in a specific or semi-specific set of genes) in TCGA is 70% of the average expression in the samples of the sorted cells, while the other gene expressions of this set are around 5%, then the specific or semi-specific gene is likely expressed by a tumor or other cells, or the cells in the tumor differ greatly in the expression of this gene.
  • TCGA Cancer Genome Atlas
  • the expression of genes from the same set may be desirable for the expression of genes from the same set to correlate with one another among the TCGA samples for this type of tumor (e.g., the desired type of tumor, above.)
  • the mean among the correlations with the other genes from the set may be analyzed.
  • the characteristic values of the expression of the considered genes in TCGA LUAD may be low (e.g., less than 10 TPM), so the correlations of these genes with each other may also be low (e.g., due to insufficient sequencing depth). In some cases, there may be especially low gene expressions of NK cells and neutrophils.
  • hematopoietic immune cells express CD45 (PTPRC) and HCLS1. Due to their development, immune cells can be divided into lymphocytes and myeloid cells. In turn, lymphocytes can be divided into T, B, NK cells, then CD4+ and CD8+ T cells can be distinguished from among T cells. But among these cells, there are also subtypes that can play an important role both in the development of tumors and in the course of treatment. Therefore, as described herein, it may be desirable for cell composition percentages to be determined for subtypes of certain cells.
  • RNA expression data may be difficult, since fewer specific and/or semi-specific genes may be expressed in cell subtypes, and the number of such cells in the tumor microenvironment may be smaller than the combined groups of cells.
  • one way to improve the accuracy of determining both cell types and subtypes may be to use information on the expression of genes specific and/or semi-specific for the combined group of cells (e.g., including cell types and cell subtypes that share common genes) in determining cell composition percentages for the cell subtypes.
  • genes specific and/or semi-specific for the combined group of cells e.g., including cell types and cell subtypes that share common genes
  • Such common genes can be used when determining cell composition percentages of individual cell types and subtypes, for example.
  • Another way to use genes common to a group of cell subtypes may be to initially calculate a cell composition percentage for the combined group, then refine that calculation in order to determine cell composition percentages for individual cell types in the group, as described elsewhere herein.
  • FIG. 2A is a flowchart depicting a method 200 for determining a cell composition percentage for at least one cell type.
  • the method 200 may be carried out on a computing device (e.g., as described herein including at least with respect to FIG. 10 ).
  • the computing device may include at least one processor, and at least one non-transitory storage medium storing processor-executable instructions which, when executed, perform the acts of method 200 .
  • the method 200 may be carried out, for example, in a system such as system 100 (which may include, for example, a clinical setting or a laboratory setting), by one or more computing devices such as by computing device 108 .
  • the method 200 begins with obtaining expression data for a biological sample from a subject.
  • obtaining expression data may include obtaining expression data from a biological sample that has been previously obtained from a subject using any suitable techniques.
  • obtaining the expression data may include obtaining expression data that has been previously obtained from a biological sample (e.g., obtaining the expression data by accessing a database.)
  • the expression data is RNA expression data. Examples of RNA expression data are provided herein.
  • the subject may have, be suspected of having, or be at risk of having cancer. As described herein including with respect to FIG.
  • the biological sample may comprise a biopsy (e.g., of a tumor or other diseased tissue of the subject), any of the embodiments described herein including with respect to the “Biological Samples” section, or any other suitable type of biological sample.
  • the origin or preparation of the expression data may include any of the embodiments described with respect to the “Expression Data” and “Obtaining RNA expression data” sections.
  • the expression data may be RNA expression data extracted using any suitable techniques.
  • the expression data obtained at act 202 may comprise RNA expression data measured in TPM.
  • the expression data may be stored on at least one storage medium and accessed as part of act 202 .
  • the expression data may be stored in one or more files or in a database, then read.
  • the at least one storage medium storing the RNA expression data may be local to the computing device (e.g., stored on the same at least one non-transitory storage medium), or may be external to the computing device (e.g., stored in a remote database or a cloud storage environment).
  • the expression data may be stored on a single storage medium or may be distributed across multiple storage mediums.
  • the expression data of act 202 may include first expression data associated with a first set of genes associated with a first cell type (e.g., a cell type of the cell types and/or subtypes being analyzed in the biological sample).
  • the first set of genes may comprise genes that are specific and/or semi-specific to the first cell type, as described herein at least with respect to FIG. 1E .
  • the set of genes may comprise: ANGPT2, APLN, CDH5, CLEC14A, ECSCR, EMCN, ENG, ESAM, ESM1, FLT1, HHIP, KDR, MMRN1, MMRN2, NOS3, PECAM1, PTPRB, RAWL ROBO4, SELE, TEK, TIE1, and/or VWF.
  • the first set of genes may be the same as a set of genes, or a subset of a set of genes, used as part of training a corresponding non-linear regression model for the cell type, as described herein including at least with respect to FIGS. 4-6 .
  • determining a first cell composition percentage for the first cell type may comprise processing first expression data associated with a first set of genes for the first cell type with a first non-linear regression model (e.g., of the one or more non-linear regression models) to determine the first cell composition percentage for the first cell type.
  • the first expression data may be provided as input to the first non-linear regression model.
  • other information may be provided as part of the input to the non-linear regression model.
  • a median of the expression data may be included as part of the input to the non-linear regression model.
  • any other suitable information may additionally or alternatively be provided as part of the input (e.g., an average of the expression data, a median or average of a subset of the expression data, or any other suitable statistics derived from or otherwise relating to the expression data).
  • parts of act 204 may be repeated and/or performed in parallel for each cell type and/or subtype being analyzed.
  • a subset of the expression data may be provided as input to each non-linear regression model for each respective cell type and/or subtype.
  • the output of the non-linear regression model may comprise information representing an estimated percentage of RNA from the first cell type in the sample.
  • the estimate percentage of RNA from the first cell type may be used to calculate a corresponding cell composition percentage for the first cell type.
  • the techniques described herein including at least with respect to FIG. 3C may be applied as part of processing the non-linear regression model, such that the output of the non-linear regression model may be an estimated cell composition percentage for the first cell type rather than an estimated percentage of RNA.
  • process 200 then proceeds to act 206 for outputting the first cell composition percentage.
  • the output(s) of the one or more non-linear regression models may be combined, stored, or otherwise post-processed as part of method 200 .
  • the cell composition percentages for each cell type may be stored locally on the computing device used to perform method 200 (e.g., on the non-transitory storage medium).
  • the cell composition percentages may be stored in one or more external storage mediums (e.g., such as a remote database or cloud storage environment).
  • FIG. 2B is an example implementation of method 200 for determining a cell composition percentage based on expression data.
  • implementing method 200 may include any suitable combination of acts included in the example flowchart of FIG. 2B .
  • implementing method 200 may include additional or alternative steps that are not shown in FIG. 2B .
  • executing method 200 may include every act included in the example flowchart.
  • method 200 may include only a subset of the acts included in the example flowchart (e.g., acts 212 and 216 , acts 212 , 214 , 216 , and 218 , acts 212 , 216 , and 220 , etc.).
  • the example implementation 220 begins at act 212 , where expression data is obtained for a biological sample from a subject. Obtaining expression data for a biological sample from a subject is described herein above including with respect to act 202 of FIG. 2A .
  • act 212 may include obtaining first expression data and second expression data.
  • the first expression data may be associated with a first set of genes that is associated with a first cell type, while the second expression data may be associated with a second set of genes that is associated with a second cell type.
  • the first expression data may be associated with a first set of genes that is associated with B cells, while the second expression data may be associated with a second set of genes that is associated with T cells.
  • the first expression data may be associated with a first set of genes associated with a first cell subtype, while the second expression data may be associated with a second set of genes associated with a second cell subtype.
  • the first expression data may be associated with a first set of genes associated with CD4+ cells, while the second expression data may be associated with a second set of genes associated with CD8+ cells.
  • Techniques for identifying genes associated with different cell type and/or subtypes are described herein including with respect to the “Gene Selection & Specificity” section.
  • the example method 220 proceeds to act 214 , where the expression data is pre-processed.
  • the pre-processing may make the expression data suitable to be processed using the one or more non-linear regression models.
  • the expression data may be sorted, combined, organized into batches, filtered, or pre-processed with any other suitable techniques.
  • techniques for processing the expression data may include any of the embodiments described with respect to the “Alignment and annotation,” “Removing non-coding transcripts,” and “Conversion to TPM and gene aggregation” sections.
  • example method 220 proceeds to act 216 , where a plurality of cell composition percentages may be determined for a plurality of cell types using the expression data and one or more non-linear regression models (e.g., at least five, at least ten, at least fifteen, models.)
  • each non-linear regression model may be trained according to the techniques described herein including at least with respect to FIGS. 4-6 .
  • a separate non-linear regression model may be used to estimate a cell composition percentage for each cell type and/or subtype.
  • act 216 may include act 216 a and act 216 b , each of which includes using a separate non-linear regression model trained for determining cell composition percentages for the first and second cell types and/or subtypes, respectively.
  • Act 216 a includes determining a first cell composition percentage for the first cell type using the first expression data and a first non-linear regression model.
  • Act 216 b includes determining a second cell composition percentage for the second cell type using the second expression data and a second non-linear regression model.
  • act 216 may include only one of acts 216 a and 216 b .
  • act 216 may include using one or more additional non-linear regression models for determining cell composition percentages for one or more other cell types (e.g., a third cell type or subtype).
  • additional non-linear regression models for determining cell composition percentages for one or more other cell types (e.g., a third cell type or subtype).
  • An example implementation of act 216 a is described herein including with respect to FIG. 2C .
  • example method 220 proceeds to act 218 for outputting the plurality of cell composition percentages.
  • the plurality of cell composition percentages may be output through a graphical user interface, saved to memory, transmitted to one or more other computing devices and/or output in any other suitable way.
  • techniques may be used to post-process the plurality of cell composition percentages output at act 218 and/or the expression data obtained at act 212 .
  • post-processing techniques may include using the cell composition percentages and expression data to determine a malignancy expression profile for the biological sample at act 220 .
  • a malignancy expression profile may include information indicative of the expression of malignant cells included in the biological sample. For example, this may include the expression of different genes associated with the malignant cells.
  • determining the malignancy expression profile may include (a) estimating the expression profile for TME cells in the biological sample and (b) subtracting the expression of the TME cells from the total expression (e.g., bulk expression data, expression data obtained at act 212 , etc.) of the biological sample.
  • total expression e.g., bulk expression data, expression data obtained at act 212 , etc.
  • FIG. 2C shows an example implementation of act 216 a for determining, using the first expression data and the first non-linear regression model, a first cell composition percentage for the first cell type.
  • the first non-linear regression model may include a first sub-model and/or a second sub-model for processing the first expression data (e.g., as shown in FIG. 3C ).
  • the first expression data may include first expression data associated with a first set of genes associated with the first cell type, as well as second expression data associated with a second set of genes associated with the first cell type.
  • the example implementation begins at act 232 , for predicting a first value for the estimated percentage of RNA from the first cell type, using a first sub-model.
  • the first expression data associated with the first set of genes and/or any other input information may be provided as input to the first sub-model of the non-linear regression model, and the output may be a predicted percentage of RNA from the first cell type.
  • the example implementation proceeds to act 234 , for predicting a second value for the estimated percentage of RNA from the first cell type, using a second sub-model.
  • the second expression data associated with the second set of genes may be provided as input to the second sub-model of the non-linear expression model in addition to the prediction from the first sub-model and/or any other input information provided at the first sub-model. Additionally or alternatively, the first expression data associated with the first set of genes may be provided as input to the second sub-model.
  • predictions from multiple non-linear regression models may be provided as input to the second sub-model of the non-linear regression model for the first cell type.
  • the output of the second sub-model of the non-linear regression model may be an estimated percentage of RNA from the first cell type in the sample.
  • the output of the second sub-model may comprise the output of the non-linear regression model for the first cell type, in some embodiments.
  • the non-linear regression model may comprise more than two sub-models.
  • the second sub-model may be repeated any number of times, with the predictions from one or more of the prior sub-models being included as input each time.
  • determining the estimated percentage of RNA from the first cell type may include (a) estimating the number of cells of the first type included in the biological sample and (b) estimating the total number of cells included in the biological sample (e.g., using equation 350 .) Estimating the number of cells of the first type may include comparing the estimated percentage of RNA (e.g., R cell of equation 350 ) to an RNA per cell coefficient (e.g., A cell of equation 350 .) Estimating the total number of cells may include estimating the number of cells of each cell type, then summing those values. Techniques for estimating cell composition percentages are described herein including with respect to FIG. 3C .
  • FIG. 3A is a diagram depicting an illustrative use of a machine learning method for determining RNA percentages based on RNA expression data.
  • RNA expression data from primary tumor samples 302 available on the TCGA database is processed according to the machine learning techniques described herein including at least with respect to FIGS. 2A-2C , in order to arrive at corresponding estimated RNA percentages 306 for T cells, CD4+ T cells, CD8+ T cells.
  • the RNA expression data for the tumor samples 302 is obtained from an online database of RNA expression data (e.g., from The Cancer Genome Atlas (TCGA) database, in this example).
  • TCGA Cancer Genome Atlas
  • the RNA expression data may be obtained from any suitable source, including one or more databases such as TCGA, or directly from a biological sample (e.g., as described herein including at least with respect to FIG. 1A ).
  • the RNA expression data may be processed using non-linear regression models 304 .
  • the non-linear regression models 304 may be implemented using a gradient boosting technique (e.g., as implemented in XGBoost) as described herein including at least with respect to FIGS. 4-6 .
  • non-linear regression models 304 may comprise separate non-linear regression model for each of multiple cell types.
  • the non-linear regression models 304 include a non-linear regression model for T cells, a non-linear regression model for CD4+ T cells, and a non-linear regression model for CD8+ T cells. As shown, additional non-linear regression models for one or more additional cell types and/or subtypes may be provided, in some embodiments.
  • the input to the non-linear regression models 304 may comprise a select subset of the RNA expression data for each non-linear regression model.
  • the input to a non-linear regression model for a particular cell type may comprise RNA expression data for specific and/or semi-specific genes for that cell type.
  • the non-linear regression model for T cells may take as input RNA expression data for genes: CAMK4, CBLB, CD2, CD226, CD3D, CD3E, CD3G, CD48, CD5, CD6, CD7, FLT3LG, ITK, KCNA3, KLRB1, LAGS, LAT, LCK, LTA, SIRPG, SIT1, SLA2, TBX21, TCF7, TESPA1, TRAC, TRAF3IP3, TRAT1, TRBC2, TRDC, TRGC1, TRGC2, UBASH3A, ZBED2.
  • other information about the RNA expression data e.g., a median of the RNA expression data, or any other suitable statistics
  • the output of non-linear regression models 304 may be RNA percentages 306 for respective cell types and/or subtypes.
  • the non-linear regression model for T cells may produce as its output a predicted percentage of RNA from T cells in the input RNA expression data.
  • the non-linear regression model for CD 4 T cells may produce as its output a predicted percentage of RNA from CD 4 T cells
  • the non-linear regression model for CD 8 T cells may produce as its output a predicted percentage of RNA from CD 8 T cells.
  • the predicted percentages of RNA may be used to calculate corresponding cell composition percentages for some or all of the cell types and/or subtypes being analyzed.
  • the sum of the predictions for the subtypes may or may not be equal to the prediction for the type comprising those subtypes.
  • the sum of predictions for CD 4 T cells and CD 8 T cells may exceed the prediction for T cells, or the sum of predictions for CD 4 T cells and CD 8 T cells may be lower than the prediction for T cells.
  • the sum of the subtype predictions may be equal to the total type prediction, and/or the subtype predictions may be normalized or adjusted so that their sum is equal to the total type prediction.
  • FIG. 3B is a diagram depicting use of non-linear regression models 320 , 322 , 324 comprising first sub-models 326 , 328 , 330 and second sub-models 338 , 340 , 342 for determining RNA percentages based on RNA expression data.
  • each example non-linear regression model includes a first sub-model 326 , 328 , 330 , for generating a first value 332 , 334 , 336 for the estimated percentage of RNA from each cell type, and a second sub-model 338 , 340 , 342 for generating a second value 344 , 346 , 348 for the estimated percentage of RNA from each cell type.
  • expression data 316 may be obtained from a set of genes associated with cell type B 310 and used as input to the non-linear regression model 322 .
  • cell type B 310 may include immune cells and the expression data 316 may include expression data for the genes ADAP2, ADGRE3, ADGRG3, C1QA, C1QC, and C3AR1 (e.g., from the gene set associated with immune cells listed in Table 2.)
  • at least some of the expression data 316 e.g., expression data associated with a subset of genes, expression data associated with all the genes, etc.
  • the first sub-model may then process the input expression data to determine a first value 334 of the estimated percentage of RNA from cell type B 310 .
  • the example non-linear regression model 322 may include a second sub-model 340 to generate a second value 346 of the estimated percentage of RNA from cell type B 310 .
  • the second sub-model 340 may use one or more inputs to generate the second value 340 .
  • at least some of the expression data 316 may be used as input.
  • the expression data may include the same expression data input to the first sub-model 328 (e.g., expression data for the genes ADAP2, ADGRE3, and ADGRG3.) In some embodiments, the expression data may include the same expression data input to the first sub-model, as well as additional expression data (e.g., expression data for the genes ADAP2, ADGRE3, ADGRG3, C1QA, and C3AR1.) In some embodiments, the expression data may include expression data different from the expression data input to the first sub-model (e.g., expression data for the genes C1QA, C1QC, and C3AR1.)
  • the second sub-model 340 may take as input estimate percentages of RNA output by the first sub-models 326 , 330 of non-linear regression models 320 , 324 for other cell types 308 , 312 .
  • the second sub-model 340 for cell type B 310 takes as input the first value 332 for the estimate percentage of RNA from cell type A 308 and the first value 336 for the estimate percentage of RNA from cell type C 312 .
  • This type of input may be informative when trying to determine the percentage of RNA from a cell type that is associated with a same gene or same set of genes as another cell type(s).
  • the second sub-model 340 can use the first values 332 , 336 to make such inferences.
  • the output of the second sub-model 340 is a second value 346 for the estimated percentage of RNA from cell type B 310 .
  • the estimated RNA percentages may be processed to determine cell composition percentages for each of the cell types.
  • FIG. 3C is a diagram depicting a method for determining cell composition percentages 370 based on RNA percentages 360 .
  • the method of FIG. 3C may be applied to RNA percentages predicted according to the techniques described herein including with respect to FIGS. 2 and 3A , in order to arrive at predictions for cell composition percentages for some or all of the cell types and/or subtypes being analyzed.
  • obtaining cell composition percentages based on RNA percentages may comprise applying equation 350 to the RNA percentages for each cell type.
  • equation 350 may be applied independently to each RNA percentage (e.g., in sequence), or may be applied to some or all of the RNA percentages together (e.g., in parallel) in some embodiments.
  • equation 350 may be applied initially to RNA percentages for cell types which are not subsets of one another.
  • equation 350 may subsequently be applied to RNA percentages for cell types that are a subtype of one or more initially used cell types.
  • the calculation of cell composition percentages for cell subtypes may be modified based on the initially calculated cell composition percentages. For example, in some embodiments, subsequently calculated cell composition percentages for cell subtypes may be normalized or otherwise adjusted such that they sum to the cell composition percentage for the total cell type (i.e., the initially-calculated cell type of which they are subtypes).
  • equation 350 is:
  • C c ⁇ e ⁇ l ⁇ l R c ⁇ e ⁇ l ⁇ l A c ⁇ e ⁇ l ⁇ l / ⁇ c ⁇ e ⁇ l ⁇ l ⁇ s ⁇ R c ⁇ e ⁇ l ⁇ l A c ⁇ e ⁇ l ⁇ l ⁇ l
  • C cell is the cell composition percentage for the cell type
  • R cell is the RNA percentage for the cell type
  • a cell is an RNA per cell coefficient.
  • the denominator may comprise a sum over all cell types and/or subtypes being analyzed (cells). As such, the expression
  • R cell A cell may be initially computed for all cell types and/or subtypes, then used to compute individual C cell values for each cell type and/or subtype.
  • an RNA percentage for a cell type may be represented as a fraction or decimal (e.g., for purposes of calculation with equation 350 ).
  • an R other expression may be introduced, which may be equal to 1 ⁇ cells R cell .
  • equation 350 includes an RNA per cell coefficient A cell , which may represent an RNA concentration per cell.
  • the inventors have recognized and appreciated that the abundance of RNA per cell may depend on the cellular size and/or other factors. As such, different cell types may contribute a different amount of RNA to the bulk sample.
  • the RNA per cell coefficient can be used to allow the conversion of RNA percentages to corresponding cell composition percentages.
  • the RNA per cell coefficient A cell may be determined as part of a model training process (e.g., from simulated or artificial data with known percentages of the different cell types.) In some embodiments, the RNA per cell coefficient A cell may be determined experimentally for some or all cell types.
  • RNA per cell coefficients may be obtained by accessing data relating to RNA expression for each cell type (e.g., from available scientific literature, such as PMID: 29130882, PMID: 30726743, or estimated from single cell data, using average or non-linearly transformed UMI count per cell type) and using that data to determine a corresponding RNA per cell coefficient (e.g., by analyzing purity and/or histological TCGA lymphocyte data, for example) for each cell type.
  • data relating to RNA expression for each cell type e.g., from available scientific literature, such as PMID: 29130882, PMID: 30726743, or estimated from single cell data, using average or non-linearly transformed UMI count per cell type
  • a RNA per cell coefficient e.g., by analyzing purity and/or histological TCGA lymphocyte data, for example
  • the RNA per cell coefficients may be tissue specific, and could vary based on the disease being analyzed (e.g., from cancer to cancer). In some embodiments, the RNA per cell coefficient may be tissue agnostic, and may not vary based on a disease being analyzed (e.g., because non-malignant microenvironment cells may be represented by the same or substantially similar cellular phenotypes even across different cancers, tissues, or diseases). In the latter case, data from multiple types of cancers, tissues, diseases, etc. may be combined in order to calculate the RNA per cell coefficients. For example, in some embodiments, more than 10,000 different cancer tissues samples from TCGA were analyzed as part of determining RNA per cell coefficients for cell types.
  • determining RNA per cell coefficients may comprise aligning non-malignant cell composition percentages obtained from RNA to cell composition percentages obtained from DNA in order to develop coefficients for RNA per cell type.
  • RNA-seq data some embodiments of the technology described herein may be applied to microarray data.
  • the expression values may be normalized to lie in a range similar to the values of the transcripts per million (TPM) for RNA-seq (for example, make the sum of the expressions be 1 million) and optionally use a linear scale.
  • FIG. 3D is a diagram depicting an example method 380 for determining malignancy expression profiles based on cell composition percentages, according to some embodiments of the technology described herein. This may include obtaining a biological sample (e.g., a biopsy) and determining the expression (e.g., the expression of individual genes) of malignant cells included in the biological sample. In some embodiments, this may include removing the expression of TME cells from the overall expression of the biological sample (e.g., bulk biopsy expression).
  • a biological sample e.g., a biopsy
  • the expression e.g., the expression of individual genes
  • the example method includes three steps.
  • the first step 382 includes determining mean expression profiles of different, non-malignant cell types.
  • this may include using expression data from sorted cell types.
  • this may include obtaining and using RNA-seq data from T cells, B cells, macrophages, fibroblasts, and any other suitable cell type that may be included in a TME.
  • the cell types may exclude tumor (e.g., malignant) cells.
  • a mean expression profile may include the mean expression of a set of genes for each cell type.
  • the example method then proceeds to the second step 384 for predicting the cell composition fractions using cellular deconvolution techniques.
  • the cell composition fractions may be indicative of the fraction of each cell type in a biological sample (e.g., a biopsy.) As shown, this may include generating a vector of cell composition fractions.
  • cellular deconvolution techniques may include any of the embodiments described herein, including with respect to FIGS. 1-3C .
  • the mean expression profiles of different cell types included in the TME (e.g., first step 382 ) and the fraction of each of those cell types in the biological sample (e.g., second step 384 ) may be used to estimate the expression of each cell type in the biological sample.
  • the third step 386 may include determining the product of the matrix of expression profiles and the vector of cell fractions. The resulting vector is an estimate expression profile of the TME cells in the biological sample.
  • determining the tumor expression profile may include subtracting the TME expression profile from the bulk expression of the biological sample (e.g., the bulk biopsy expression). As shown, this may include subtracting the vector generated for the expression profile of the TME cells from the vector of bulk expression.
  • FIG. 4 is a flowchart depicting a method 400 for training one or more non-linear regression models to determine cell composition percentages based on RNA expression data.
  • the method 400 may comprise training one or more non-linear regression models (e.g., at least five, at least ten, at least fifteen non-linear regression models) to estimate cell composition percentages for a corresponding one or more cell types in a biological sample.
  • a separate non-linear regression model may be trained for each cell type and/or subtype, such that each non-linear regression model is trained to estimate cell composition percentages for a particular cell type in the biological sample.
  • the method 400 may be carried out on a computing device (e.g., as described herein including at least with respect to FIG. 10 ).
  • the computing device may include at least one processor, and at least one non-transitory storage medium storing processor-executable instructions which, when executed, perform the acts of method 400 .
  • the method 400 may begin with obtaining training data comprising simulated RNA expression data.
  • the “simulated” RNA expression data may include RNA expression data that is generated partially in silico.
  • the simulated RNA expression data may include data that was obtained by sampling reads from multiple expression data sets from purified cell type samples.
  • the RNA expression data may comprise expression data measured in TPM.
  • the RNA expression data includes first RNA expression data for first genes associated with a first cell type and second RNA expression data for second genes associated with a second cell type.
  • the first genes may be, for example, the specific and/or semi-specific genes for the first cell type, while the second genes may be specific and/or semi-specific genes for the second cell type.
  • the training data may comprise RNA expression data of genes associated with each cell type and/or subtype being analyzed, and/or other cell types.
  • the training data may be generated as part of act 402 .
  • the simulated RNA expression data may be generated by combining RNA expression data from malignant cells (e.g., cancer cells) with RNA expression data from microenvironment cells (e.g., immune cells, skin cells, etc.) to produce a plurality of simulated RNA mixtures (which may be referred to herein as “artificial mixtures” or “mixes”) for training.
  • RNA mixtures which may be referred to herein as “artificial mixtures” or “mixes”
  • at least a thousand, at least ten thousand, at least one hundred thousand, or at least one million mixes may be generated and/or accessed as part of act 402 .
  • the training data may be obtained in any suitable manner at act 402 .
  • the training data may be stored on at least one storage medium (e.g., in one or more files, or in a database).
  • the at least one storage medium storing the training data may be local to the computing device (e.g., stored on the same at least one non-transitory storage medium), or may be external to the computing device (e.g., stored in a remote database or a cloud storage environment).
  • the training data may be stored on a single storage medium, or may be distributed across multiple storage mediums.
  • act 402 may further comprise pre-processing the training data in any suitable manner.
  • the training data may be sorted, combined, organized into batches, filtered, or pre-processed with any other suitable techniques.
  • the pre-processing may make the training data suitable to be processed using the one or more non-linear regression models, for example.
  • the training data may be split into separate training, validation, and holdout datasets, as described herein including at least with respect to FIG. 5A .
  • acts 404 to 408 the method 400 may proceed with training the one or more non-linear regression models using the training data.
  • acts 404 to 408 describe training a first model of the non-linear regression models to estimate cell composition percentages for a corresponding first cell type.
  • Acts 404 and 406 may be referred to herein as a training step.
  • each model of the non-linear regression models may be trained at least in part separately for each cell type (e.g., with corresponding different input data, and different learned parameters, for each non-linear regression model).
  • each non-linear regression model of the one or more non-linear regression models may be trained, mutatis mutandis, according to the techniques described herein including with respect to acts 404 to 406 , and/or stored according to act 408 .
  • training the first model of the non-linear regression models may proceed with generating an estimated percentage of RNA for the first cell type, using the first model and the first RNA expression data.
  • the first RNA expression data may comprise first genes associated with the first cell type (e.g., only specific and/or semi-specific genes for the first cell type).
  • the first RNA expression data may be provided as input to the first model.
  • other input may additionally or alternatively be provided to the first model. For example, a median, average, or any other suitable information relating the some or all of the RNA expression data may be provided as part of the input to the first model.
  • training the first model of the non-linear regression models may proceed with updating parameters using the estimated percentage of RNA from the first cell type.
  • the estimated percentage of RNA from the first cell type may be compared to a known value for the percentage of RNA from the first cell type as part of act 406 .
  • a loss function may be applied to the estimated value and the known value in order to determine a loss associated with the estimated value.
  • the loss may be used to update the parameters of the model. For example, a gradient descent, or any other suitable optimization technique, may be applied in order to update the parameters of the model so as to minimize the loss.
  • the first model may process its input using any suitable techniques, including non-linear regression techniques, as described herein.
  • the first model may use a gradient boosting machine learning technique.
  • the first model may comprise an ensemble of weak prediction models, such as decision trees, or any other suitable prediction models, which may be combined in an iterative fashion using a gradient boosting algorithm.
  • a gradient boosting framework such as XGBoost or LightGBM may be used as part of training the first model.
  • a random forest model may be used as part of training the first model.
  • acts 404 to 406 may be repeated multiple times (e.g., at least one hundred, at least one thousand, at least ten thousand, at least one hundred thousand, or at least one million times). In some embodiments, acts 404 to 406 may be repeated for a set number of iterations, or may be repeated until a threshold is surpassed (e.g., until loss decreases below a threshold value). In some embodiments, the non-linear regression models may be trained in two or more stages, as described herein including at least with respect to FIG. 5A .
  • the method 400 may proceed with outputting the trained plurality of non-linear regression models including the first non-linear regression model and the second non-linear regression model.
  • outputting the trained plurality of non-linear regression models may comprise: storing one or more of the models in at least one non-transitory computer-readable storage medium (e.g., memory) for subsequent access, providing the model(s) to a recipient (e.g., transmitting data associated with the model(s) to a recipient using any suitable communication network or other means), displaying information associate with the model(s) to a user via a graphical user interface, and/or any other suitable manner of outputting the trained models, as aspects of the technology described herein are not limited in this respect.
  • FIG. 5A depicts an exemplary method 500 for training one or more non-linear regression models, according to the techniques developed by the inventors.
  • the illustrated techniques may be used in conjunction with any of the other techniques described herein, including at least with respect to FIGS. 2 and 4 .
  • the method 500 may begin at act 502 with preparing one or more datasets for training.
  • the datasets may be generated (e.g., according to the techniques described herein including at least with respect to FIG. 6A ) and/or accessed (e.g., from one or more databases) as part of act 502 .
  • the datasets may comprise a plurality of artificial mixes of RNA expression data, which may comprise RNA expression data from a variety of malignant (e.g., tumor) and/or microenvironment cells.
  • the datasets may comprise at least one thousand, at least ten thousand, at least one hundred thousand, or at least one million artificial mixes.
  • the datasets may be separated into training datasets and holdout datasets.
  • the datasets may be separated into the training and holdout datasets at random in some embodiments, with a set percentage of the datasets to be used for training and holdout, respectively. For instance, in the illustrated example, 80% of the datasets are used as training datasets, while the remaining 20% are retained as holdout datasets.
  • the holdout datasets may be used to develop quality metrics (e.g., as described herein including at least with respect to FIG. 7B ). In some embodiments, there may be no holdout datasets, such that all the datasets may be used for training. As shown in the figure at act 502 , the training datasets may be further subdivided into one or more (e.g., ten) folds each containing a respective training and validation set. According to some embodiments, the training datasets may be divided into folds at random. In some embodiments, cross-fold validation may be performed as part of training.
  • each non-linear regression model may be trained to estimate, based on input RNA expression data, a corresponding percentage of RNA from a particular cell type.
  • the non-linear regression models may be trained in two stages, the first stage corresponding to training a first sub-model of the non-linear regression model, the second stage corresponding to training a second sub-model of the non-linear regression model.
  • the first sub-model of each non-linear regression model may be trained to generate an initial prediction for the percentage of RNA from its respective cell type.
  • the input may comprise RNA expression data of specific and/or semi-specific genes for the corresponding cell type.
  • only the RNA expression data of the specific and/or semi-specific genes for the cell type may be provided as input.
  • other information such as a median of the expression data, may be provided.
  • the output of the first stage may be initial predictions for the percentages of RNA from each cell type, with each first sub-model of each non-linear regression model providing a prediction for its respective cell type.
  • the second sub-model of each non-linear regression model may be trained to generate a second prediction for the percentage of RNA from its respective cell type.
  • the input may comprise RNA expression data of specific and/or semi-specific genes for the corresponding cell type, and the predictions from the first stage.
  • the RNA expression data used at the second stage may be different from the RNA expression data used at the first stage.
  • some or all of the training data may be regenerated (e.g., according to the techniques described herein including with respect to FIGS. 5B and 6 ) for the purposes of training the non-linear regression models in the second stage.
  • the training data for the first stage and the second stage may be generated in parallel (e.g., at the same time) but independently, such that the training data for each stage is different.
  • the predictions from the first stage may be provided as input at the second stage.
  • the initial predictions for all cell types may be provided as input to the second stage. This may allow the second stage to effectively correct the predictions from the first stage, and may increase the consistency and/or accuracy of the final model.
  • the output at the second stage may be second predictions for the percentages of RNA from each cell type, with the second sub-model of each non-linear regression model providing a prediction for its respective cell type.
  • the second predictions may be the final output of the non-linear regression models (e.g., as described herein including with respect to FIGS. 2 and 4 ).
  • additional stages of training e.g., additional sub-models
  • may be performed e.g., a third stage, a fourth stage, etc.), with each stage taking as input new training data (e.g., RNA expression data), and the predictions from the previous stage.
  • Providing the predictions from the previous stage as part of the input to the next stage may allow a model for a particular cell type to use the information about estimated proportions of other cell types and adapt to them (e.g., by knowing that the total number of T cells equals 10 and number of CD4+ T cells is 8, the number of CD8+ T cells could not exceed 2).
  • a multi-stage training procedure as described herein, may allow the model to account for this. This procedure may allow for information from different cell types and subtypes to be used for each individual cell type model.
  • FIG. 5B is an exemplary, non-limiting illustration for training a machine learning model, in accordance with some embodiments of the technology described herein.
  • the illustrated techniques may be used in conjunction with any of the other techniques described herein, including at least with respect to FIGS. 2 and 4 .
  • diagram 530 illustrates the division of the datasets into one or more folds, as described herein including with respect to FIG. 5A .
  • the datasets may be randomly split into three folds, with each of the three folds being further divided into a training dataset and a validation dataset.
  • datasets may be used to generate artificial mixes, as described herein including with respect to FIG. 6A .
  • the folds may then be used to train one or more models for a given set of parameters (e.g., parameters 550 ).
  • the parameters may be generated (e.g., at random) based on a set of predetermined ranges shown in Table 3.
  • at least some of the (e.g., all) folds may be used to train each cell type model separately.
  • validation mixes may be used to evaluate each parameter set and generate associated evaluation data.
  • parameters may be updated with each stage of training and/or used as input to subsequent training stages.
  • a first fold may be used as input for a first stage of training to generate a first set of parameters.
  • a second fold may then be used as input for a second stage of training to generate an updated set of parameters.
  • Tables 4 and 5 show example parameters for one or more cell type models after a first stage and a second stage of training, respectively.
  • FIG. 6A is a diagram depicting an exemplary method 600 for training one or more non-linear regression models, including generating simulated RNA expression data (e.g., to use as training data, as described herein including at least with respect to FIGS. 4-5 ).
  • the simulated RNA expression data may be generated by combining samples of RNA expression data from malignant cells (e.g., cancer cells) and microenvironment cells (e.g., immune cells, stromal cells, etc.), as shown in branches 610 and 620 of the method 600 .
  • malignant cells e.g., cancer cells
  • microenvironment cells e.g., immune cells, stromal cells, etc.
  • An exemplary process for generating artificial mixes of RNA expression data is described herein below with respect to FIG. 6A .
  • FIG. 6B is a diagram depicting an example of generating artificial mixes of RNA expression data to imitate real tissue, according to some embodiments of the technology described herein.
  • the RNA expression data is derived from one or more sorted cell types/subtypes representing one or more biological states (e.g., positive gene regulation, negative gene regulation, etc.), as shown in branch 630 .
  • the one or more cell types/subtypes are mixed in different proportions to generate artificial mixes, as shown in branches 640 and 650 .
  • FIG. 6C is an exemplary diagram for generating and using artificial mixes to train cell type models, according to some embodiments of the technology described herein.
  • the dataset is divided into folds.
  • the resultant datasets are used to create artificial mixes.
  • the artificial mixes are used to train and validate each of one or more non-linear regression models that is specific to one or more cell type/subtype.
  • the resultant models from each of the folds may be considered together or independently, as described with respect to FIG. 5A .
  • FIGS. 6D and 6E are exemplary illustrations for generating specific artificial mixes for training particular cell type/subtype models, according to some embodiments of the technology described herein.
  • one or more datasets may be excluded for training a specific cell type/subtype model, as described herein including with respect to Table 6.
  • FIG. 6F is an exemplary diagram illustrating techniques for processing datasets and generating artificial mixes, according to some embodiments of the technology described herein.
  • act 602 illustrates datasets for a cell type, prior to rebalancing (e.g., resampling large datasets to avoid overtraining models.)
  • datasets may be rebalanced 604 and combined into a total set of samples for a specific cell type. Further, as described herein, samples may then be randomly selected in act 608 and averaged in act 612 .
  • hyperexpression noise may be added to the expression of the cell type, as illustrated in 614 .
  • the samples of RNA expression data may be obtained as described herein including at least with respect to FIGS. 1C-1D .
  • a large number of samples of sorted malignant and microenvironment cells may be used to construct the artificial mixes of RNA expression data.
  • the number of samples may be on the order of the number of samples included in Table 1.
  • the number of samples may be at least 5,000, at least 10,000, at least 15,000, at least 20,000, at least 30,000, at least 50,000, at least 100,000, or any number of suitable samples.
  • open source datasets such as Gene Expression Omnibus (GEO) and ArrayExpress may be used.
  • GEO Gene Expression Omnibus
  • ArrayExpress may be used.
  • the datasets used may be selected so as to satisfy the following criteria: only Homo sapiens , standard RNA-seq (without polyA depletion, targeted panel, etc.) with read length higher 31 bp.
  • only relevant cell types for the particular disease being analyzed e.g., particular type of tumor
  • data for all cell types may instead be used.
  • selection of datasets may be based on both biological and bioinformatic parameters. For example, datasets with samples cultivated in conditions close to normal physiological conditions may be used. In some embodiments, datasets with abnormal stimulation were excluded, like datasets of CD4+ T-cells hyper stimulated with phorbol 12-myristate 13-acetate and ionomycin activation or macrophages co-cultured with an excessive number of bacterial cultures. In some embodiments, only those samples having at least 4 million coding read counts were used.
  • quality control may be performed on the RNA expression data prior to construction of the artificial mixes (e.g., to exclude strange or unreliable datasets). For example, if some samples of CD4+ T cells show no or very low expression of CD45, CD4 or CD3 genes, they may be excluded. The same may done for other cell types, in some embodiments. For example, samples for some cell types may be excluded if they significantly express genes that are not typical for that type of cell (e.g., if in a sample of T cells, CD19, CD33, MS4A1, etc. were expressed in significant amounts, while in most other T cell samples these expressions were low). In some embodiments, samples of CD4+ T cells may be removed if they express significant amounts of CD8 genes.
  • several methods of expression analysis like t-SNE or PCA with different gene sets may be used to visualize the similarities and differences between datasets (e.g., as shown in FIGS. 1C and 1D ). If a particular cell type from one dataset fails to cluster with the same cell type in the other datasets (e.g., in a t-SNE, PCA, or other plot), then the one dataset may be further analyzed as part of quality control, and some or all of the data from that dataset may be excluded.
  • RNA expression data may be constructed using samples prepared as described herein above. Artificial mixes may be generated using sample expressions in TPM (transcripts per million) units, such that the gene expressions for an overall sample are formed as a linear combination of the expressions of individual cells from that sample.
  • RNA expression data from samples of various cell types may be mixed in predetermined proportions, as described herein below.
  • simulated RNA expression data for malignant cells e.g., generated as shown in branch 610
  • microenvironment cells e.g., generated as shown in branch 620 .
  • samples of each cell type may be rebalanced by datasets (e.g., reducing the weight of datasets with a large number of samples) and subtypes (e.g., changing the proportions of subtypes of a sample).
  • datasets e.g., reducing the weight of datasets with a large number of samples
  • subtypes e.g., changing the proportions of subtypes of a sample.
  • Techniques for rebalancing are described herein including with respect to the “Rebalancing by datasets” and “Rebalancing by subtypes” sections.
  • For each cell type multiple samples may then be randomly selected and averaged. Then, for some or all of the cell types being used, the rebalanced/averaged samples may be mixed together in particular proportions (e.g., so as to simulate a real tumor microenvironment).
  • branch 610 an exemplary process for generating simulated malignant cell RNA expression data is shown.
  • random samples of cancer cells e.g., NSCLC, ccRCC, Mel, HNCK, etc.
  • hyperexpression noise may be added to the resulting RNA expression data to account for abnormal expression of genes by malignant cells.
  • tumor cells sometimes express genes which are ordinarily absent in the parental cell type.
  • the overexpressed genes may interfere with the deconvolution techniques described herein.
  • the result of branch 610 may be simulated malignant cell RNA expression data.
  • the simulated RNA expression data for the malignant cells (e.g., generated as shown in branch 610 ) and the simulated RNA expression data for the microenvironment cells (e.g., generated as shown in branch 620 ) may be combined into an artificial mix (referred to in FIG. 6A as an “expression mix”).
  • the simulated RNA expression data for the malignant cells and the simulated RNA expression data for the microenvironment cells may be mixed together in a random proportion based on a given distribution for cancer cells.
  • noise may then be added to the mix to mimic technical noise and noise resulting from biological variability.
  • Each type of noise may be specified according to one or more suitable distributions. For example, as shown in FIG.
  • the technical noise may be specified by a Poisson distribution, while the noise resulting from biological variability may be specified according to a normal distribution.
  • technical noise may have multiple components, which may be specified by other distributions.
  • another component of technical noise may be specified by a non-Poisson distribution.
  • the artificial mix may be representative of an artificial tumor, including the tumor microenvironment (TME).
  • the inventors have recognized and appreciated that, when creating artificial mixes, it may be desirable to use different cells of the same type from different samples. Using a small number of samples for the mixes, or even just one sample for each cell type, would provide poor performance on real tumor samples (e.g., due to the variability of cell states and their expressions, as well as noise due to limited numbers of read counts for different expressions, alignment errors and other causes of technical noise). Therefore, when creating artificial mixtures, the inventors have recognized that is may be desirable to use as many available cell samples as possible.
  • RNA-seq samples e.g., at least one hundred, at least five hundred, at least one thousand, at least two thousand, or at least five thousand samples
  • a number of datasets of malignant cells e.g., pure cancer cells for various diagnoses, cancer cell lines or sorted from tumors
  • Table 7 lists the quantities of samples remaining after quality control for a number of cell types.
  • the artificial mixes may be used as training datasets for training one or more non-linear regression models.
  • the non-linear regression models may be specific to a cell type/subtype. Accordingly, in some embodiments many (e.g., 150,000) artificial mixes may be generated to train models for each specific cell type model.
  • the sets of mixes used for each model may include or exclude specific datasets that allow for differentiation between particular cell types/subtypes, as illustrated in FIGS. 6D and 6E . For example, to train a model for CD4+ T cells, datasets that include unspecified T cells may be excluded to avoid uncertainty about the proportions of CD4+ T cells within the datasets.
  • Table 6 specifies the mixes used to train one or more corresponding cell type/subtype models.
  • This table specifies, as an example, the samples included in the artificial mixes used to train particular cell type models.
  • Mixes set Trained models All Immune cells, available Myeloid cells, samples Lymphocytes, Monocytes, Fibroblasts, Endothelium, Neutrophils, NK cells, T cells, Macrophages, B cells Without T CD4+ T cells, cell samples CD8+ T cells Without T cells, CD8+ T cells CD8+ T PD1 high, CD8 + cells samples T cells PD1 low Without T cells, Tregs, T helpers CD4+ T cells samples Without Macrophages M1, Macrophages Macrophages M2 samples Without Plasma B cells, B cells Non plasma B cells samples Averaging of Samples
  • multiple samples for each cell type may be averaged in any suitable manner (e.g., to improve the quality of samples before adding artificial noise). For example, in some embodiments, averaging may be performed in groups of two, such that an averaged sample of 4 million reads may contain information on 8 million reads. In some embodiments, averaging across multiple samples may reduce the noise in the expression caused by technical factors during sequencing.
  • num av samples are selected, the expressions of which are averaged (the value of num av is indicated in the parameter table, Table 9).
  • any subtype samples may be used at this stage. So, for example, Tregs may be processed along with T cells in some embodiments. Since this approach creates greater subtype diversity for artificial samples but can decrease the biological variability of gene expression within cell type or subtype if too many samples are averaged, the degree of averaging employed may affect the learning outcome. Therefore, the number of samples for averaging may appear as a parameter, which, together with other parameters, may be selected during training (e.g., so as to increase or maximize quality).
  • the number of samples may be rebalanced. As described herein below, in one example, the samples may be rebalanced by datasets, then by cell subtypes. Then num av samples may be selected from the rebalanced number of samples.
  • the number of samples of sorted cells in datasets may range from one to several hundred (e.g., at least five, at least ten, at least 50, or at least 100 samples).
  • each dataset may contain samples of one or two cell types, sorted and sequenced in the same way.
  • Cell samples within the same dataset may also have specific conditions, such as a specific set of markers for sorting or a specific disease of patients from whom the cells were taken. Datasets with a large number of samples can lead to overtraining of models for such datasets. To reduce the weight of datasets with a large number of samples, samples of all datasets are resampled in order to rebalance by datasets.
  • N d ⁇ a ⁇ t ⁇ a ⁇ s ⁇ e ⁇ t , n ⁇ e ⁇ w N max * ( N d ⁇ a ⁇ t ⁇ a ⁇ s ⁇ e ⁇ t , o ⁇ l ⁇ d N max ) 1 - rebalance ⁇ ⁇ parameter ⁇
  • N max is number of samples in the largest dataset (e.g., for the particular cell type) and N dataset,old is the original number of samples in the dataset.
  • the rebalance parameter in the equation is a value in the range [0, 1], where 0 means there is no change in the number of samples, and 1 means that for each dataset there will be the same number of samples.
  • the rebalancing parameter may be selected during training. Rebalancing by Cell Subtypes
  • samples of this type there may also be samples of more specific subtypes.
  • the number of available subtype samples may not coincide with those ratios that are specified during the formation of mixes with these subtypes, in some cases. Therefore, when creating mixes for the cell type, samples of its subtypes may be rebalanced.
  • CD4+ T cells there may be significantly more CD4+ T cells (and T helpers with Tregs) samples available than CD8+ T cells.
  • proportions of CD4+ and CD8+ T cells samples may be changed before the random selection of samples.
  • the proportions may be chosen similar to the ratios of the predicted average RNA fractions for the TCGA or PBMC samples for these cell types.
  • the predictions may be obtained using one or more linear models trained on mixes with equal cell proportions.
  • the subtype rebalancing algorithm may be as follows. To rebalance each subtype for a given type, resample with replacement a number of samples equal to: P subtype *m size/min P +1 Where P subtype is a number reflecting the proportion of a given subtype (e.g., the proportion of this subtype among all subtypes for the given type, which may be represented as the number of samples for the subtype divided by the total number of samples for the type); msize is the maximum number of samples among all the subtypes for the given type, and min P is the minimum number P subtype between all subtypes. According to some embodiments, the rebalancing operation may be performed recursively for all nested subtypes (e.g., subtypes which themselves have subtypes. Microenvironment Cells Proportions Generation
  • the resulting samples of different cell types may be mixed with one another in random ratios in order to generate the simulated microenvironment cell RNA expression data.
  • a first set of artificial mixes may be generated using random proportions of each cell type:
  • R cell is a random number distributed uniformly from 0 to 1 and K cell is the coefficient for the particular cell type.
  • the coefficient K cell in the above equations may be chosen so that the most likely ratios of cells mRNA are close to what is observed in TCGA or PBMC samples. These approximate ratios may be calculated from the TCGA or PBMC samples, using models trained without using such ratios. For example, a vector of numbers may be used, reflecting approximate proportions for a given type of tissue. Each number of the vector is multiplied by a random number from 0 to 1. The resulting coefficients are normalized to the sum and used in a linear combination.
  • K cell may be selected from Table 7, which specifies, for each of multiple cell types, the most likely proportion of the cell type based on tumor tissue and blood (PBMC).
  • the inventors have recognized and appreciated that it may be desirable for the deconvolution algorithm to work in any cell range. For example, the preparation of a cell suspension from a tumor sample may lead to a dramatic increase in the proportion of lymphocytes—and it may be desirable for the algorithm to work on the sequencing data of such a suspension.
  • the inventors have recognized and appreciated that the formation of cell ratios by the method described may generate practically no samples where there is a large proportion (e.g., 70-100%) of a certain cell type, such as NK cells. Therefore, in some embodiments, additional mixtures are created in which proportions are generated from the Dirichlet distribution with parameter 1/number_of_types for each dimension. This parameter may be selected along with other parameters for creating mixtures.
  • the number of samples in a dataset formed in this way may be controlled by a parameter dirichlet_samples_proportion (Table 9). This parameter may also be selected as a parameter for creating mixtures. Thus, in the final dataset, each cell type may be found in proportions from 0 to 100 percent. However, there most of the characteristic quantities may reflect cell populations that mimic real tumors.
  • expressions of artificial tissue may be generated based on expression vectors of each cell type and the randomly selected proportion of RNA of those cells. For example, as described herein, expression vectors are added up with random coefficients that reflect the proportion of RNA of those cells:
  • PBMC tumor tissue and blood
  • Cell type Solid tumors PBMC B_cells 11 20 Plasma_B_cells 6 3 Non_plasma_B_cells 5 17 T_cells 15 100 CD4_T_cells 7 50 Tregs 4 2 CD8_T_cells 8 50 CD8_T_cells_PD1_low 4 48 CD8_T_cells_PD1_high 4 2 NK_cells 2 16 Monocytes 2 80 Macrophages 40 1 Neutrophils 2 10 Fibroblasts 50 1 Endothelium 36 1 T_helpers 3 48 Macrophages_M1 12 0.5 Macrophages_M2 28 0.5 Noise Generation
  • noise e.g., technical noise, uniform noise, or any suitable form of noise
  • expression of each gene may contribute noise to the overall tissue expression.
  • T i j ⁇ T i +P i j +N prep i +N bio i
  • u T i represents the true expression of the gene
  • P i j Poisson technical noise
  • N prep i represents normally distributed noise derived from sequencing library preparation
  • N bio i represents variable biological noise.
  • a TPM-based mathematical noise model is provided, which accounts for technical noise (both Poisson and non-Poisson).
  • this model of variability may be added to the artificial mixes generated to train the non-linear regression models, as described herein.
  • technical non-Poisson noise is assumed to be normally distributed.
  • Poisson noise is a type of technical noise which may be associated with the sequencing coverage or number of read counts and may not be normally distributed.
  • the resulting dependence of technical noise on coverage and gene expression could be expressed by a formula:
  • ⁇ P i ⁇ ⁇ 1 l i ⁇ T ⁇ i ⁇ R
  • T i a mean TPM in technical replicates
  • R read counts
  • an estimated proportional coefficient
  • this model may correctly represent the gene expression variability as a result of expression levels and coverage, as shown using technical replicates of purified cell populations ( FIG. 12I ).
  • the limit of detection of gene expression varied from 1TPM at a coverage of 20 million total reads to 12 TPM at a coverage of 1 million reads per sample Therefore, the ability to assess gene expression may be influenced by the amount of material which is available.
  • the noise coefficient (a) can be calculated for Poisson noise ( FIG. 12K ). By calculating this coefficient, the technical noise for each sample and each gene can be inferred according to the deduced formula.
  • biological noise which may be associated with different activated states of a cell, can contribute to the overall variance in an RNA-seq sample.
  • biological noise may already be present through the use of RNA-seq data derived from cell subsets representing a variation of biological states.
  • this overall variance can be assessed, in one example, by plotting data for the same cell types, obtained by different experiments ( FIG. 12J ).
  • An example of the dependence of technical noise, both Poisson and non-Poisson, and biological variability on the average sequencing coverage is presented in FIG. 12J .
  • the noise increases from 10% to 26% from technical to biological replicates for certain cell types ( FIG. 12J , right).
  • the analysis of noise contribution due to single gene expression may be applied to simulate technical and biological noise in artificial mixes.
  • noise may be added to total gene expression in two summands:
  • T i mix after T i mix before + ⁇ ⁇ T i mix before l i ⁇ ⁇ P + ⁇ ⁇ ⁇ T i mix before ⁇ ⁇ N
  • ⁇ P , ⁇ N N(0,1)
  • is the coefficient of Poisson noise level coefficient
  • is the coefficient of uniform level non-Poisson noise (Table 9).
  • the above-describe approach may be validated by excluding the technical Poisson noise from the technical non-Poisson and biological noise.
  • an average variance at about 16% was obtained, which was subsequently used in mixes.
  • the noise lost the dependence on the sequencing coverage. This may be expected, since the technical non-Poisson and biological variability do not depend on the measurement method.
  • the noise model described herein may be used to add technical (both Poisson and non-Poisson) variation to artificial mixes. This results in artificial mixes which better mimic real tissues. Improved artificial mixes may subsequently be used to train the deconvolution algorithm (e.g., as described herein including with respect to FIGS. 4-6 ) to ensure model stability when encountering real sequencing variability.
  • training a non-linear regression model may comprise estimating and/or updating parameters for the model, in some embodiments.
  • the parameters for the model may include some parameters, which may be referred to herein as hyperparameters, other than the learned weights for the model (e.g., as described herein at least with respect to FIG. 4 ).
  • hyperparameters may be referred to herein as hyperparameters, other than the learned weights for the model (e.g., as described herein at least with respect to FIG. 4 ).
  • An exemplary list of such hyperparameters and their values is shown in Table 9.
  • values for the hyperparameters may be estimated as the non-linear regression models are trained. For example, some or all of the hyperparameters may be updated based one or more validation sets of the training data (e.g., with each fold of the model training). In some embodiments, the hyperparameters may be estimated based on TCGA data. For example, the results for a particular setting of the hyperparameters may be checked for consistency against TCGA data, such that TCGA model concordance may be achieved.
  • the sum of results across the cell subtypes e.g., T cells, B cells, and NK cells
  • the sum of results across the cell subtypes is equal (or close) to the overall result for the cell type.
  • a parameter search may be performed as part of estimating the hyperparameters. Any suitable parameter search technique may be used, including a random search, a grid search, or a genetic algorithm. In some embodiments, the parameter search may be performed using Bayesian optimization, gradient-based optimization, or evolutionary optimization, for example. In some embodiments, a parameter search may select one or more hyperparameter values from a predetermined range associated with the hyperparameter.
  • Tables 8 and 9 list example hyperparameters: number of samples for averaging (Nav), uniform noise level ( ⁇ ), Dirichlet samples proportion (Dp), rebalance parameter (r), hyperexpression fraction (Hf), and maximum hyperexpression level (Mhl).
  • a number of artificial mixes “Dp” may be created in which proportion are generated from the Dirichlet distribution.
  • the rebalance parameter “r” may be used in an equation to determine a new number of samples in the dataset. As described, “r” is a value in the range [0, 1], where 0 means there is no change in the number of samples, and 1 means that for each dataset there will be the same number of samples. In some embodiments, the rebalancing parameter may be selected during training.
  • hyperexpression noise may be added to each of the artificial mixes.
  • random values are added to the genes' expression of a selected tumor sample with a small probability for creating each mix. For example, with a probability of “Hf” a random number from a uniform distribution from zero to “Mhl” may be added to the expression of each gene.
  • the machine learning models described herein may include tens of thousands, hundreds of thousands, or millions of parameters.
  • the non-linear regression models 304 as described herein including at least with respect to FIGS. 2-6 , may include at least ten thousand parameters, at least one hundred thousand parameters, or at least one million parameters.
  • processing data with machine learning models like the non-linear regression models 304 requires millions of calculations to be performed, which cannot be done practically in the human mind and without computers.
  • the algorithms for training such the machine learning models described herein may require an even greater amount of computational resources, as such models are trained using tens of thousands, hundreds of thousands, or millions of artificial mixes (e.g., as described herein including at least with respect to FIG. 6A ). In one specific example, three million artificial mixes may be generated for training the non-linear regression models across two stages (e.g., as described herein including at least with respect to FIG. 5A ). Neither the training algorithms nor the use of the trained models may be performed without computing resources.
  • FIGS. 7A-7G are a variety of results achieved using the techniques developed by the inventors.
  • the techniques developed by the inventors substantially outperform conventional techniques for cellular deconvolution.
  • the cellular deconvolution techniques developed by the inventors may be referred to as “Kassandra”.
  • FIG. 7A is a chart comparing simulated RNA expression data 702 (e.g., a plurality of artificial mixes generated according to the techniques of FIG. 6A ) to RNA expression data 704 from a plurality of biological samples (e.g., tumor).
  • the RNA expression data 702 is obtained from five hundred artificial lung cancer samples, developed using the techniques described herein including with respect to FIG. 6A .
  • FIG. 7B is a chart depicting exemplary cell composition percentages predicted according to the deconvolution techniques developed by the inventors, and corresponding true cell composition percentages.
  • FIGS. 7C and 7D are exemplary charts representing the Pearson correlation for different cell types between predicted and true artificial mix values (e.g., prediction accuracy.)
  • the graphs compare exemplary prediction accuracy for the deconvolution techniques developed by the inventors, and the prediction accuracy for alternative algorithms.
  • FIG. 7C the prediction accuracy without cancer cell hyperexpression noise is presented.
  • FIG. 7D the prediction accuracy with cancer cell hyperexpression is presented.
  • random hyperexpression noise may be added to artificial mixes (e.g., to allow the deconvolution techniques developed by the inventors to ignore aberrant expressions from malignant cells in the samples).
  • four example gene markers in TCGA data derived from four different cancer types were analyzed: CD14 in bladder cancer, FCRLA in skin cutaneous melanoma, STAP1 in clear cell renal cell carcinoma, and PAD12 in lung squamous cell carcinoma. Each of these markers were found to be overexpressed in the corresponding cancer type. While these markers are not expressed in the corresponding normal tissue, they are found to be expressed in immune cells ( FIG. 7E ).
  • the deconvolution techniques developed by the inventors are stable to aberrant high expression present in the data.
  • the techniques developed by the inventors produce accurate predictions across cell types, even when hyperexpression noise is present ( FIG. 7D ).
  • FIG. 7D indicates that the performance of the alternative algorithms is significantly reduced in the presence of overexpression noise, while the techniques developed by the inventors retained high correlation scores on the validation dataset.
  • the alternative algorithms include CIBERSORT, CIBERSORTx, QuanTIseq, FARDEEP, Xcell, ABIS, EPIC, MCP-counter, Scaden, and MuSiC.
  • Newman et al. (“Robust enumeration of cell subsets from tissue expression profiles.” Nat. Methods 12, 453-457, (2015)) describes CIBERSORT.
  • Newman et al. (“Determining cell type abundance and expression from bulk tissues with digital cytometry.” Nat Biotechnol 37, 773-782 (2019)), describes CIBERSORTx. Finotello et al.
  • FIG. 7F is a heatmap representing the Pearson correlation for different cell types between predicted and true artificial mix values (e.g., prediction accuracy) for the deconvolution techniques developed by the inventors. Predicted cell percentages for different cell types are shown for data from sorted samples, derived from holdout datasets. As shown, the deconvolution techniques developed by the inventors achieved high prediction accuracy scores across cell types, including closely related cell types.
  • FIG. 7G is a chart comparing exemplary non-specificity scores for the deconvolution techniques developed by the inventors to non-specificity scores for alternative algorithms.
  • non-specificity scores for eleven alternative algorithms are shown.
  • the values in the chart of FIG. 7G represent percentages of non-specific (false positive) predictions relative to specific (true positive) predictions for different cell types.
  • a low non-specificity score indicates a lower percentage of false positives predictions (e.g., indicating a more specific model).
  • the detection of signals for each cell type in pure populations was assessed, and B-cells, T-cells, and macrophages were further subdivided, with each subclass clearly distinguished from the others.
  • a linear method of cellular deconvolution may be provided.
  • An exemplary linear deconvolution technique is described herein below with respect to FIGS. 8 and 9A-9C .
  • FIG. 8 is a flowchart depicting an exemplary linear method 800 for determining cell composition percentages based on expression data (e.g., RNA expression data).
  • the method 800 may comprise estimating cell composition percentages for one or more cell types in a biological sample, using an expression profile (e.g., an RNA expression, and/or an expression profile as shown in FIG. 9A ) for each cell type.
  • an expression profile e.g., an RNA expression, and/or an expression profile as shown in FIG. 9A
  • the method 800 may be carried out on a computing device (e.g., as described herein including at least with respect to FIG. 10 ).
  • the computing device may include at least one processor, and at least one non-transitory storage medium storing processor-executable instructions which, when executed, perform the acts of method 800 .
  • the method 800 may be carried out, for example, in a system such as system 100 (which may include, for example, a clinical setting or a laboratory setting), by one or more computing devices such as by computing device 108 .
  • the method 800 may begin with obtaining RNA expression data for a biological sample from a subject.
  • act 802 may include accessing RNA expression data that was previously obtained from a biological sample.
  • the biological sample may comprise a biopsy (e.g., of a tumor or other diseased tissue of the subject) or any other suitable type of biological sample, and the expression data may be extracted using any suitable techniques.
  • the expression obtained at act 802 may comprise RNA expression data measured in TPM.
  • the origin or preparation methods of the biological sample may include any of the embodiments described with respect to the “Biological Samples” section.
  • the origin or preparation methods of the expression data may include any of the embodiments described with respect to the “Expression Data” and “Obtaining RNA expression data” sections.
  • the expression data may be stored on at least one storage medium and accessed as part of act 802 .
  • the expression data may be stored in one or more files or in a database, which may be read as part of act 802 .
  • the at least one storage medium storing the expression data may be local to the computing device (e.g., stored on the same at least one non-transitory storage medium), or may be external to the computing device (e.g., stored in a remote database or a cloud storage environment).
  • the expression data may be stored on a single storage medium or may be distributed across multiple storage mediums.
  • act 802 may further comprise pre-processing the expression data in any suitable manner.
  • the expression data may be sorted, combined, organized into batches, filtered, or pre-processed with any other suitable techniques.
  • the pre-processing may make the expression data suitable to be processed using the linear regression technique described herein with respect to acts 804 - 806 .
  • pre-processing the RNA may include any of the embodiments described with respect to the “Alignment and annotation,” “Removing non-coding transcripts,” and “Conversion to TPM and gene aggregation” sections.
  • the method 800 may proceed with processing the RNA expression data using a linear regression technique in order to determine one or more corresponding cell composition percentages for the cell types.
  • the method 800 may proceed with obtaining a plurality of expression profiles (e.g., as described herein including with respect to FIG. 9A ) for a corresponding plurality of cell types. For example, if CD4+ T cells, NK cells, and CD8+ T cells are being analyzed using method 800 , then an expression profile for CD4+ T cells, an expression profile for NK cells, and an expression profile for CD8+ T cells may be obtained at act 802 .
  • Each of the expression profiles (e.g., RNA expression profiles) may comprise respective expression data (e.g., RNA expression data) from one or more genes associated with a respective cell type from the plurality of cell types.
  • the genes associated with each respective cell type may be specific and/or semi-specific genes for the cell type.
  • the genes associated with each respective cell type may comprise corresponding genes listed in Table 2.
  • the corresponding genes may include at least 2 genes, at least 4 genes, at least 6 genes, at least 8 genes, at least 10 genes, at least 12 genes, at least 14 genes, or at least 16 genes included in Table 2.
  • the corresponding genes may include fewer than 10,000, fewer than 5,000, fewer than 2,000, fewer than 1,000, fewer than 500, fewer than 250, or fewer than 100 genes.
  • the expression profile may be obtained in any suitable manner.
  • the expression profile may be stored in one or more files or in a database, which may be read as part of act 804 .
  • the at least one storage medium storing the expression profile may be local to the computing device (e.g., stored on the same at least one non-transitory storage medium), or may be external to the computing device (e.g., stored in a remote database or a cloud storage environment).
  • the expression profile may be stored on a single storage medium, or may be distributed across multiple storage mediums.
  • the method 800 may proceed with determining a plurality of cell composition percentages for the plurality of cell types at least in part by optimizing a piecewise continuous error function (e.g., the example piecewise continuous error function described with respect to FIG. 9A ) between the expression data and the plurality of expression profiles.
  • Act 806 may be performed simultaneously or iteratively across the plurality of cell types, and may be repeated (e.g., for a set number of iterations, or until a measurement of error is below a threshold) in some embodiments.
  • act 806 may comprise performing a linear regression using the expression data, the plurality of expression profiles, and the piecewise continuous error function. This may include, in some embodiments, optimizing the piecewise continuous error function. In some embodiments, optimizing the piecewise continuous error function is not limited to finding a global maximum or minimum of the piecewise continuous error function, but may also encompass finding a local maximum or minimum within a threshold distance of a global maximum or minimum. For example, act 806 may involve determining a combination (e.g., a weighted sum) of the expression profiles that has a lowest error or an error below a threshold (e.g., with the error measured using the piecewise continuous error function) relative to the expression data.
  • a combination e.g., a weighted sum
  • act 806 may comprise determining, for each gene associated with the cell type, a corresponding output of a piecewise continuous error function (e.g., such as the error function of FIG. 9C ).
  • the piecewise continuous error function may serve to compare an actual measured expression value from real data (e.g., RNA-seq data), against a predicted expression value which may be computed using the gene's expression in the expression profile for the cell type (e.g., as obtained at act 804 ).
  • the predicted expression value may be computed as a product of the expression of the gene in the expression profile, and a coefficient ⁇ for the cell type.
  • the input to the error function may be the coefficient ⁇ , the expression g of the gene in the input expression data, and the expression p of the gene in the expression profile for the cell type.
  • the error function may have coefficients a, b, k, as described herein including with respect to FIG. 9C , which may be updated as part of act 806 .
  • act 806 may be performed iteratively or in parallel for some or all of the genes. For example, act 806 may be performed repeatedly across the plurality of cell types until a coefficient ⁇ is found for each cell type such that the piecewise continuous error function is below a threshold or minimized.
  • the value of coefficient ⁇ may be determined by finding the coefficient value that minimizes the weighted error sum across all of the genes (e.g., the piecewise error function as described herein including with respect to act 806 and FIG. 9C , summed across all genes).
  • the coefficient ⁇ may represent the cell composition percentage for the corresponding cell type (e.g., because ⁇ defines the weight of each expression profile in the weighted sum for the expression data).
  • determining the plurality of cell composition percentages for the plurality of cell types may comprise processing the coefficients, such as by normalizing them, in order to obtain corresponding cell composition percentages for each of the plurality of cell types.
  • FIG. 9A is a diagram depicting exemplary RNA expression profiles and overall RNA expression data.
  • known RNA expression profiles are shown for CD4+ T cells, NK cells, and CD8+ T cells.
  • Each RNA expression profile is illustrated as a bar graph, with the horizontal axis representing genes, and the vertical axis representing the expression of those genes. As shown in the figure, each RNA expression profile may be unique for a given cell type.
  • the overall observed expression for a biological sample may be considered as a sum of expression profiles for cell types comprising the biological sample.
  • each RNA expression profile may be weighted by a coefficient ⁇ , such that the biological sample may be considered as a weighted sum of the RNA expression profiles.
  • the sum may further include a term for unknown expression of other cell types. This term may represent expression data that is not accounted for with the weighted sum of RNA expression profiles (e.g., as shown in gray in the observed expression for the biological sample).
  • FIG. 9B depicts an exemplary piecewise continuous error function for use with the method of FIG. 8 .
  • the error function ⁇ is piecewise, with the coefficients a and b dividing the function into three sections, and coefficient k affecting the shape of the rightmost section of the error function.
  • the error may be computed according the illustrated expression.
  • a biological sample is obtained from a subject having, suspected of having cancer, or at risk of having cancer.
  • the biological sample may be any type of biological sample including, for example, a biological sample of a bodily fluid (e.g., blood, urine or cerebrospinal fluid), one or more cells (e.g., from a scraping or brushing such as a cheek swab or tracheal brushing), a piece of tissue (cheek tissue, muscle tissue, lung tissue, heart tissue, brain tissue, or skin tissue), or some or all of an organ (e.g., brain, lung, liver, bladder, kidney, pancreas, intestines, or muscle), or other types of biological samples (e.g., feces or hair).
  • a bodily fluid e.g., blood, urine or cerebrospinal fluid
  • cells e.g., from a scraping or brushing such as a cheek swab or tracheal brushing
  • a piece of tissue e.g.
  • the biological sample is a sample of a tumor from a subject. In some embodiments, the biological sample is a sample of blood from a subject. In some embodiments, the biological sample is a sample of tissue from a subject.
  • a sample of a tumor refers to a sample comprising cells from a tumor.
  • the sample of the tumor comprises cells from a benign tumor, e.g., non-cancerous cells.
  • the sample of the tumor comprises cells from a premalignant tumor, e.g., precancerous cells.
  • the sample of the tumor comprises cells from a malignant tumor, e.g., cancerous cells.
  • tumors include, but are not limited to, adenomas, fibromas, hemangiomas, lipomas, cervical dysplasia, metaplasia of the lung, leukoplakia, carcinoma, sarcoma, germ cell tumors, and blastoma.
  • a sample of blood refers to a sample comprising cells, e.g., cells from a blood sample.
  • the sample of blood comprises non-cancerous cells.
  • the sample of blood comprises precancerous cells.
  • the sample of blood comprises cancerous cells.
  • the sample of blood comprises blood cells.
  • the sample of blood comprises red blood cells.
  • the sample of blood comprises white blood cells.
  • the sample of blood comprises platelets. Examples of cancerous blood cells include, but are not limited to, leukemia, lymphoma, and myeloma.
  • a sample of blood is collected to obtain the cell-free nucleic acid (e.g., cell-free DNA) in the blood.
  • a sample of blood may be a sample of whole blood or a sample of fractionated blood.
  • the sample of blood comprises whole blood.
  • the sample of blood comprises fractionated blood.
  • the sample of blood comprises buffy coat.
  • the sample of blood comprises serum.
  • the sample of blood comprises plasma.
  • the sample of blood comprises a blood clot.
  • a sample of a tissue refers to a sample comprising cells from a tissue.
  • the sample of the tumor comprises non-cancerous cells from a tissue.
  • the sample of the tumor comprises precancerous cells from a tissue.
  • tissue including organ tissue or non-organ tissue, including but not limited to, muscle tissue, brain tissue, lung tissue, liver tissue, epithelial tissue, connective tissue, and nervous tissue.
  • the tissue may be normal tissue or it may be diseased tissue or it may be tissue suspected of being diseased.
  • the tissue may be sectioned tissue or whole intact tissue.
  • the tissue may be animal tissue or human tissue.
  • Animal tissue includes, but is not limited to, tissues obtained from rodents (e.g., rats or mice), primates (e.g., monkeys), dogs, cats, and farm animals.
  • the biological sample may be from any source in the subject's body including, but not limited to, any fluid [such as blood (e.g., whole blood, blood serum, or blood plasma), saliva, tears, synovial fluid, cerebrospinal fluid, pleural fluid, pericardial fluid, ascitic fluid, and/or urine], hair, skin (including portions of the epidermis, dermis, and/or hypodermis), oropharynx, laryngopharynx, esophagus, stomach, bronchus, salivary gland, tongue, oral cavity, nasal cavity, vaginal cavity, anal cavity, bone, bone marrow, brain, thymus, spleen, small intestine, appendix, colon, rectum, anus, liver, biliary tract, pancreas, kidney, ureter, bladder, urethra, uterus, vagina, vulva, ovary, cervix, scrotum, penis, prostate, testicle,
  • any of the biological samples described herein may be obtained from the subject using any known technique. See, for example, the following publications on collecting, processing, and storing biological samples, each of which are incorporated herein in its entirety: Biospecimens and biorepositories: from afterthought to science by Vaught et al. (Cancer Epidemiol Biomarkers Prev. 2012 February; 21(2):253-5), and Biological sample collection, processing, storage and information management by Vaught and Henderson (IARC Sci Publ. 2011; (163):23-42).
  • the biological sample may be obtained from a surgical procedure (e.g., laparoscopic surgery, microscopically controlled surgery, or endoscopy), bone marrow biopsy, punch biopsy, endoscopic biopsy, or needle biopsy (e.g., a fine-needle aspiration, core needle biopsy, vacuum-assisted biopsy, or image-guided biopsy).
  • a surgical procedure e.g., laparoscopic surgery, microscopically controlled surgery, or endoscopy
  • bone marrow biopsy e.g., punch biopsy, endoscopic biopsy, or needle biopsy
  • needle biopsy e.g., a fine-needle aspiration, core needle biopsy, vacuum-assisted biopsy, or image-guided biopsy.
  • one or more than one cell may be obtained from a subject using a scrape or brush method.
  • the cell biological sample may be obtained from any area in or from the body of a subject including, for example, from one or more of the following areas: the cervix, esophagus, stomach, bronchus, or oral cavity.
  • one or more than one piece of tissue e.g., a tissue biopsy
  • the tissue biopsy may comprise one or more than one (e.g., 2, 3, 4, 5, 6, 7, 8, 9, 10, or more than 10) biological samples from one or more tumors or tissues known or suspected of having cancerous cells.
  • any of the biological samples from a subject described herein may be stored using any method that preserves stability of the biological sample.
  • preserving the stability of the biological sample means inhibiting components (e.g., DNA, RNA, protein, or tissue structure or morphology) of the biological sample from degrading until they are measured so that when measured, the measurements represents the state of the sample at the time of obtaining it from the subject.
  • a biological sample is stored in a composition that is able to penetrate the same and protect components (e.g., DNA, RNA, protein, or tissue structure or morphology) of the biological sample from degrading.
  • degradation is the transformation of a component from one from to another such that the first form is no longer detected at the same level as before degradation.
  • a biological sample e.g., tissue sample
  • a “fixed” sample relates to a sample that has been treated with one or more agents or processes in order to prevent or reduce decay or degradation, such as autolysis or putrefaction, of the sample.
  • fixative processes include but are not limited to heat fixation, immersion fixation, and perfusion.
  • a fixed sample is treated with one or more fixative agents.
  • fixative agents include but are not limited to cross-linking agents (e.g., aldehydes, such as formaldehyde, formalin, glutaraldehyde, etc.), precipitating agents (e.g., alcohols, such as ethanol, methanol, acetone, xylene, etc.), mercurials (e.g., B-5, Zenker's fixative, etc.), picrates, and Hepes-glutamic acid buffer-mediated organic solvent protection effect (HOPE) fixatuve.
  • cross-linking agents e.g., aldehydes, such as formaldehyde, formalin, glutaraldehyde, etc.
  • precipitating agents e.g., alcohols, such as ethanol, methanol, acetone, xylene, etc.
  • mercurials e.g., B-5, Zenker's fixative, etc.
  • picrates e.g., B-5, Zenker's fixative, etc
  • a formalin-fixed biological sample is embedded in a solid substrate, for example paraffin wax.
  • the biological sample is a formalin-fixed paraffin-embedded (FFPE) sample.
  • FFPE formalin-fixed paraffin-embedded
  • the biological sample is stored using cryopreservation.
  • cryopreservation include, but are not limited to, step-down freezing, blast freezing, direct plunge freezing, snap freezing, slow freezing using a programmable freezer, and vitrification.
  • the biological sample is stored using lyophilization.
  • a biological sample is placed into a container that already contains a preservant (e.g., RNALater to preserve RNA) and then frozen (e.g., by snap-freezing), after the collection of the biological sample from the subject.
  • a preservant e.g., RNALater to preserve RNA
  • such storage in frozen state is done immediately after collection of the biological sample.
  • a biological sample may be kept at either room temperature or 4° C. for some time (e.g., up to an hour, up to 8 h, or up to 1 day, or a few days) in a preservant or in a buffer without a preservant, before being frozen.
  • Non-limiting examples of preservants include formalin solutions, formaldehyde solutions, RNALater or other equivalent solutions, TriZol or other equivalent solutions, DNA/RNA Shield or equivalent solutions, EDTA (e.g., Buffer AE (10 mM Tris.Cl; 0.5 mM EDTA, pH 9.0)) and other coagulants, and Acids Citrate Dextronse (e.g., for blood specimens).
  • special containers may be used for collecting and/or storing a biological sample.
  • a vacutainer may be used to store blood.
  • a vacutainer may comprise a preservant (e.g., a coagulant, or an anticoagulant).
  • a container in which a biological sample is preserved may be contained in a secondary container, for the purpose of better preservation, or for the purpose of avoid contamination.
  • any of the biological samples from a subject described herein may be stored under any condition that preserves stability of the biological sample.
  • the biological sample is stored at a temperature that preserves stability of the biological sample.
  • the sample is stored at room temperature (e.g., 25° C.).
  • the sample is stored under refrigeration (e.g., 4° C.).
  • the sample is stored under freezing conditions (e.g., ⁇ 20° C.).
  • the sample is stored under ultralow temperature conditions (e.g., ⁇ 50° C. to ⁇ 800° C.).
  • the sample is stored under liquid nitrogen (e.g., ⁇ 1700° C.).
  • a biological sample is stored at ⁇ 60° C. to ⁇ 80° C. (e.g., ⁇ 70° C.) for up to 5 years (e.g., up to 1 month, up to 2 months, up to 3 months, up to 4 months, up to 5 months, up to 6 months, up to 7 months, up to 8 months, up to 9 months, up to 10 months, up to 11 months, up to 1 year, up to 2 years, up to 3 years, up to 4 years, or up to 5 years).
  • a biological sample is stored as described by any of the methods described herein for up to 20 years (e.g., up to 5 years, up to 10 years, up to 15 years, or up to 20 years).
  • Methods of the present disclosure encompass obtaining one or more biological samples from a subject for analysis.
  • one biological sample is collected from a subject for analysis.
  • more than one (e.g., 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, or more) biological samples are collected from a subject for analysis.
  • one biological sample from a subject will be analyzed.
  • more than one (e.g., 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, or more) biological samples may be analyzed.
  • the biological samples may be procured at the same time (e.g., more than one biological sample may be taken in the same procedure), or the biological samples may be taken at different times (e.g., during a different procedure including a procedure 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 days; 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 weeks; 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 months, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 years, or 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 decades after a first procedure).
  • a second or subsequent biological sample may be taken or obtained from the same region (e.g., from the same tumor or area of tissue) or a different region (including, e.g., a different tumor).
  • a second or subsequent biological sample may be taken or obtained from the subject after one or more treatments and may be taken from the same region or a different region.
  • the second or subsequent biological sample may be useful in determining whether the cancer in each biological sample has different characteristics (e.g., in the case of biological samples taken from two physically separate tumors in a patient) or whether the cancer has responded to one or more treatments (e.g., in the case of two or more biological samples from the same tumor or different tumors prior to and subsequent to a treatment).
  • each of the at least one biological sample is a bodily fluid sample, a cell sample, or a tissue biopsy sample.
  • one or more biological specimens are combined (e.g., placed in the same container for preservation) before further processing.
  • a first sample of a first tumor obtained from a subject may be combined with a second sample of a second tumor from the subject, wherein the first and second tumors may or may not be the same tumor.
  • a first tumor and a second tumor are similar but not the same (e.g., two tumors in the brain of a subject).
  • a first biological sample and a second biological sample from a subject are sample of different types of tumors (e.g., a tumor in muscle tissue and brain tissue).
  • a sample from which RNA and/or DNA is extracted is sufficiently large such that at least 2 ⁇ g (e.g., at least 2 ⁇ g, at least 2.5 ⁇ g, at least 3 ⁇ g, at least 3.5 ⁇ g or more) of RNA can be extracted from it.
  • the sample from which RNA and/or DNA is extracted can be peripheral blood mononuclear cells (PBMCs).
  • PBMCs peripheral blood mononuclear cells
  • the sample from which RNA and/or DNA is extracted can be any type of cell suspension.
  • a sample from which RNA and/or DNA is extracted is sufficiently large such that at least 1.8 ⁇ g RNA can be extracted from it.
  • at least 50 mg e.g., at least 1 mg, at least 2 mg, at least 3 mg, at least 4 mg, at least 5 mg, at least 10 mg, at least 12 mg, at least 15 mg, at least 18 mg, at least 20 mg, at least 22 mg, at least 25 mg, at least 30 mg, at least 35 mg, at least 40 mg, at least 45 mg, or at least 50 mg
  • tissue sample is collected from which RNA and/or DNA is extracted.
  • tissue sample is collected from which RNA and/or DNA is extracted. In some embodiments, at least 30 mg of tissue sample is collected. In some embodiments, at least 10-50 mg (e.g., 10-50 mg, 10-15 mg, 10-30 mg, 10-40 mg, 20-30 mg, 20-40 mg, 20-50 mg, or 30-50 mg) of tissue sample is collected from which RNA and/or DNA is extracted. In some embodiments, at least 30 mg of tissue sample is collected. In some embodiments, at least 20-30 mg of tissue sample is collected from which RNA and/or DNA is extracted.
  • a sample from which RNA and/or DNA is extracted is sufficiently large such that at least 0.2 ⁇ g (e.g., at least 200 ng, at least 300 ng, at least 400 ng, at least 500 ng, at least 600 ng, at least 700 ng, at least 800 ng, at least 900 ng, at least 1 ⁇ g, at least 1.1 ⁇ g, at least 1.2 ⁇ g, at least 1.3 ⁇ g, at least 1.4 ⁇ g, at least 1.5 ⁇ g, at least 1.6 ⁇ g, at least 1.7 ⁇ g, at least 1.8 ⁇ g, at least 1.9 ⁇ g, or at least 2 ⁇ g) of RNA can be extracted from it.
  • at least 0.2 ⁇ g e.g., at least 200 ng, at least 300 ng, at least 400 ng, at least 500 ng, at least 600 ng, at least 700 ng, at least 800 ng, at least 900 ng, at least 1 ⁇ g, at least 1.1 ⁇ g,
  • a sample from which RNA and/or DNA is extracted is sufficiently large such that at least 0.1 ⁇ g (e.g., at least 100 ng, at least 200 ng, at least 300 ng, at least 400 ng, at least 500 ng, at least 600 ng, at least 700 ng, at least 800 ng, at least 900 ng, at least 1 ⁇ g, at least 1.1 ⁇ g, at least 1.2 ⁇ g, at least 1.3 ⁇ g, at least 1.4 ⁇ g, at least 1.5 ⁇ g, at least 1.6 ⁇ g, at least 1.7 ⁇ g, at least 1.8 ⁇ g, at least 1.9 ⁇ g, or at least 2 ⁇ g) of RNA can be extracted from it.
  • at least 0.1 ⁇ g e.g., at least 100 ng, at least 200 ng, at least 300 ng, at least 400 ng, at least 500 ng, at least 600 ng, at least 700 ng, at least 800 ng, at least 900 ng, at least 1
  • a subject is a mammal (e.g., a human, a mouse, a cat, a dog, a horse, a hamster, a cow, a pig, or other domesticated animal).
  • a subject is a human.
  • a subject is an adult human (e.g., of 18 years of age or older).
  • a subject is a child (e.g., less than 18 years of age).
  • a human subject is one who has or has been diagnosed with at least one form of cancer.
  • a cancer from which a subject suffers is a carcinoma, a sarcoma, a myeloma, a leukemia, a lymphoma, or a mixed type of cancer that comprises more than one of a carcinoma, a sarcoma, a myeloma, a leukemia, and a lymphoma.
  • Carcinoma refers to a malignant neoplasm of epithelial origin or cancer of the internal or external lining of the body.
  • Sarcoma refers to cancer that originates in supportive and connective tissues such as bones, tendons, cartilage, muscle, and fat.
  • Myeloma is cancer that originates in the plasma cells of bone marrow.
  • Leukemias (“liquid cancers” or “blood cancers”) are cancers of the bone marrow (the site of blood cell production). Lymphomas develop in the glands or nodes of the lymphatic system, a network of vessels, nodes, and organs (specifically the spleen, tonsils, and thymus) that purify bodily fluids and produce infection-fighting white blood cells, or lymphocytes.
  • a mixed type of cancer include adenosquamous carcinoma, mixed mesodermal tumor, carcinosarcoma, and teratocarcinoma.
  • a subject has a tumor.
  • a tumor may be benign or malignant.
  • a cancer is any one of the following: skin cancer, lung cancer, breast cancer, prostate cancer, colon cancer, rectal cancer, cervical cancer, and cancer of the uterus.
  • a subject is at risk for developing cancer, e.g., because the subject has one or more genetic risk factors, or has been exposed to or is being exposed to one or more carcinogens (e.g., cigarette smoke, or chewing tobacco).
  • Expression data (e.g., indicating expression levels) for a plurality of genes may be used for any of the methods or compositions described herein.
  • the number of genes which may be examined may be up to and inclusive of all the genes of the subject. In some embodiments, expression levels may be examined for all of the genes of a subject.
  • the expression data may include, for each cell type listed in Table 2, expression data for at least 5, at least 10, at least 15, at least 20, at least 25, at least 35, at least 50, at least 75, at least 100 genes selected from the group of genes for that cell type in Table 2.
  • any method may be used on a sample from a subject in order to acquire expression data (e.g., indicating expression levels) for the plurality of genes.
  • the expression data may be RNA expression data, DNA expression data, or protein expression data.
  • DNA expression data refers to a level of DNA in a sample from a subject.
  • the level of DNA in a sample from a subject having cancer may be elevated compared to the level of DNA in a sample from a subject not having cancer, e.g., a gene duplication in a cancer patient's sample.
  • the level of DNA in a sample from a subject having cancer may be reduced compared to the level of DNA in a sample from a subject not having cancer, e.g., a gene deletion in a cancer patient's sample.
  • DNA expression data refers to data for DNA (or gene) expressed in a sample, for example, sequencing data for a gene that is expressed in a patient's sample. Such data may be useful, in some embodiments, to determine whether the patient has one or more mutations associated with a particular cancer.
  • RNA expression data may be acquired using any method known in the art including, but not limited to: whole transcriptome sequencing, total RNA sequencing, mRNA sequencing, targeted RNA sequencing, small RNA sequencing, ribosome profiling, RNA exome capture sequencing, and/or deep RNA sequencing.
  • DNA expression data may be acquired using any method known in the art including any known method of DNA sequencing. For example, DNA sequencing may be used to identify one or more mutations in the DNA of a subject. Any technique used in the art to sequence DNA may be used with the methods and compositions described herein.
  • the DNA may be sequenced through single-molecule real-time sequencing, ion torrent sequencing, pyrosequencing, sequencing by synthesis, sequencing by ligation (SOLiD sequencing), nanopore sequencing, or Sanger sequencing (chain termination sequencing).
  • Protein expression data may be acquired using any method known in the art including, but not limited to: N-terminal amino acid analysis, C-terminal amino acid analysis, Edman degradation (including though use of a machine such as a protein sequenator), or mass spectrometry.
  • the expression data is acquired through bulk RNA sequencing.
  • Bulk RNA sequencing may include obtaining expression levels for one or more genes across RNA extracted from a population of multiple input cells, which population may include multiple different cell types.
  • the expression data is acquired through single cell sequencing (e.g., scRNA-seq). Single cell sequencing may include sequencing individual cells.
  • the expression data comprises whole exome sequencing (WES) data. In some embodiments, the expression data comprises whole genome sequencing (WGS) data. In some embodiments, the expression data comprises next-generation sequencing (NGS) data. In some embodiments, the expression data comprises microarray data.
  • a method to process RNA expression data comprises obtaining RNA expression data for a subject (e.g., a subject who has or has been diagnosed with a cancer).
  • obtaining RNA expression data comprises obtaining a biological sample and processing it to perform RNA sequencing using any one of the RNA sequencing methods described herein.
  • RNA expression data is obtained from a lab or center that has performed experiments to obtain RNA expression data (e.g., a lab or center that has performed RNA-seq).
  • a lab or center is a medical lab or center.
  • RNA expression data is obtained by obtaining a computer storage medium (e.g., a data storage drive) on which the data exists.
  • RNA expression data is obtained via a secured server (e.g., a SFTP server, or Illumina BaseSpace).
  • data is obtained in the form of a text-based filed (e.g., a FASTQ file).
  • a file in which sequencing data is stored also contains quality scores of the sequencing data).
  • a file in which sequencing data is stored also contains sequence identifier information.
  • a method to process RNA expression data comprises aligning and annotating genes in the RNA expression data with known sequences of the human genome to obtain annotated RNA expression data.
  • alignment of RNA expression data comprises aligning the data to a known assembled genome for a particular species of subject (e.g., the genome of a human) or to a transcriptome database.
  • a sequence alignment software are available and can be used to align data to an assembled genome or a transcriptome database.
  • Non-limiting examples of alignment software includes short (unspliced) aligners (e.g., BLAT; BFAST, Bowtie, Burrows-Wheeler Aligner, Short Oligonucleotide Analysis package, or Mosaik), spliced aligners, aligners based on known splice junctions (e.g., Errange, IsoformEx, or Splice Seq), or de novo splice aligner (e.g., ABMapper, BBMap, CRAC, or HiSAT).
  • any suitable tool can be used for aligning and annotating data.
  • Kallisto github.com/pachterlab/kallisto
  • Kallisto is used to align and annotate data.
  • a known genome is referred to as a reference genome.
  • a reference genome also known as a reference assembly
  • a reference genome is a digital nucleic acid sequence database, assembled as a representative example of a species' set of genes.
  • human and mouse reference genomes used in any one of the methods described herein are maintained and improved by the Genome Reference Consortium (GRC).
  • GPC Genome Reference Consortium
  • Non-limiting examples of human reference releases are GRCh38, GRCh37, NCBI Build 36.1, NCBI Build 35, and NCBI Build 34.
  • a non-limiting example of transcriptome databased include Transcriptome Shotgun Assembly (TSA).
  • annotating RNA expression data comprises identifying the locations of genes and/or coding regions in the data to be processed by comparing it to assembled genomes or transcriptome databases.
  • data sources for annotation include GENCODE (www.gencodegenes.org), RefSeq (see e.g., www.ncbi.nlm.nih.gov/refseq/), and Ensembl.
  • annotating genes in RNA expression data is based on a GENCODE database (e.g., GENCODE V23 annotation; www.gencodegenes.org).
  • a method to process RNA expression data comprises removing non-coding transcripts from annotated RNA expression data. Aligning and annotating RNA expression data allows identification of coding and non-coding reads.
  • non-coding reads for transcripts are removed so as to concentrate analysis effort on expression of proteins (e.g., those that may be involved in pathology of cancer).
  • removing reads for non-coding transcripts from the data reduces the variance in the data, e.g., in replicates of the same or similar sample (e.g., nucleic acid from the same cells or cell-type).
  • non-limiting examples of expression data that is removed include one or more non-coding transcripts (e.g., 10-50, 50-100, 100-1,000, 1,000-2,500, 2,500-5,000 or more non-coding transcripts) that belong to one or more gene groups selected from the list consisting of: pseudogenes, polymorphic pseudogenes, processed pseudogenes, transcribed processed pseudogenes, unitary pseudogenes, unprocessed pseudogenes, transcribed unitary pseudogenes, constant chain immunoglobulin (IG C) pseudogenes, joining chain immunoglobulin (IG J) pseudogenes, variable chain immunoglobulin (IG V) pseudogenes, transcribed unprocessed pseudogenes, translated unprocessed pseudogenes, joining chain T cell receptor (TR J) pseudogenes, variable chain T cell receptor (TR V) pseudogenes, small nuclear RNAs (snRNA), small nucleolar RNAs (snoRNA), microRNAs (miRNA), rib
  • information for one or more transcripts for one of more of these types of transcripts can be obtained in a nucleic acid database (e.g., a Gencode database, for example Gencode V23, Genbank database, EMBL database, or other database).
  • a fraction e.g., 10%, 20% 30%, 40%, 50%, 60%, 70%, 80%, 90%, 95%, 98%, 99%, or 99.5% or more
  • the non-coding transcripts histone-encoding gene, mitochondrial genes, interleukin-encoding genes, collagen-encoding genes, and/or T cell receptor-encoding genes as described herein are removed from aligned and annotated RNA expression data.
  • a method to process RNA expression data comprises normalizing RNA expression data per length of transcript (e.g., to transcripts per kilobase million (TPM) format) that is read.
  • RNA expression data that is normalized per length of transcript is first aligned and annotated. Conversion of data to TPM allows presentation of expression in the form of concentration, rather than counts, which in turn allows comparison of samples with different total read counts and/or length of reads.
  • RNA expression data that is normalized per length of transcript read is then analyzed to obtain gene expression data (expression data per gene).
  • gene aggregation comprises combining expression data in reads for transcripts for all isoforms of a gene to obtain expression data for that gene.
  • gene aggregation to obtain gene expression data is performed after TPM normalization but before identifying genes that introduce bias. In some embodiments, gene aggregation is performed before conversion of the data to TPM.
  • FIG. 10 An illustrative implementation of a computer system 1000 that may be used in connection with any of the embodiments of the technology described herein (e.g., such as the method of FIGS. 2, 4, and 6 ) is shown in FIG. 10 .
  • the computer system 1000 includes one or more processors 1010 and one or more articles of manufacture that comprise non-transitory computer-readable storage media (e.g., memory 1020 and one or more non-volatile storage media 1030 ).
  • the processor 1010 may control writing data to and reading data from the memory 1020 and the non-volatile storage device 1030 in any suitable manner, as the aspects of the technology described herein are not limited in this respect.
  • the processor 1010 may execute one or more processor-executable instructions stored in one or more non-transitory computer-readable storage media (e.g., the memory 1020 ), which may serve as non-transitory computer-readable storage media storing processor-executable instructions for execution by the processor 1010 .
  • non-transitory computer-readable storage media e.g., the memory 1020
  • processor-executable instructions stored in one or more non-transitory computer-readable storage media (e.g., the memory 1020 ), which may serve as non-transitory computer-readable storage media storing processor-executable instructions for execution by the processor 1010 .
  • Computing device 1000 may also include a network input/output (I/O) interface 1040 via which the computing device may communicate with other computing devices (e.g., over a network), and may also include one or more user I/O interfaces 1050 , via which the computing device may provide output to and receive input from a user.
  • the user I/O interfaces may include devices such as a keyboard, a mouse, a microphone, a display device (e.g., a monitor or touch screen), speakers, a camera, and/or various other types of I/O devices.
  • the techniques described herein may be implemented in the illustrative environment 1100 shown in FIG. 11 .
  • one or more biological samples of a subject 1180 may be provided to a laboratory 1170 .
  • Laboratory 1170 may process the biological sample(s) to obtain expression data (e.g., DNA, RNA, and/or protein expression data) and/or sequence information and provide it, via network 1110 , to at least one database 1160 that stores information about subject (e.g., patient) 1180 .
  • expression data e.g., DNA, RNA, and/or protein expression data
  • Network 1110 may be a wide area network (e.g., the Internet), a local area network (e.g., a corporate Intranet), and/or any other suitable type of network. Any of the devices shown in FIG. 11 may connect to the network 1110 using one or more wired links, one or more wireless links, and/or any suitable combination thereof.
  • a wide area network e.g., the Internet
  • a local area network e.g., a corporate Intranet
  • Any of the devices shown in FIG. 11 may connect to the network 1110 using one or more wired links, one or more wireless links, and/or any suitable combination thereof.
  • the at least one database 1120 may store expression data and or sequence information for the subject (e.g., patient), medical history data for the subject (e.g., patient), test result data for the subject (e.g., patient), and/or any other suitable information about the subject 1180 .
  • Examples of stored test result data for the subject (e.g., patient) include biopsy test results, imaging test results (e.g., MRI results), and blood test results.
  • the information stored in at least one database 1120 may be stored in any suitable format and/or using any suitable data structure(s), as aspects of the technology described herein are not limited in this respect.
  • the at least one database 1120 may store data in any suitable way (e.g., one or more databases, one or more files).
  • the at least one database 1120 may be a single database or multiple databases.
  • illustrative environment 1100 includes one or more external databases 1120 , which may store information for patients other than patient 1180 .
  • external databases 1160 may store expression data and/or sequence information (of any suitable type) for one or more patients, medical history data for one or more patients, test result data (e.g., imaging results, biopsy results, blood test results) for one or more patients, demographic and/or biographic information for one or more patients, and/or any other suitable type of information.
  • external database(s) 1160 may store information available in one or more publicly accessible databases such as TCGA (The Cancer Genome Atlas), one or more databases of clinical trial information, and/or one or more databases maintained by commercial sequencing suppliers.
  • the external database(s) 1160 may store such information in any suitable way using any suitable hardware, as aspects of the technology described herein are not limited in this respect.
  • the at least one database 1120 and the external database(s) 1160 may be the same database, may be part of the same database system, or may be physically co-located, as aspects of the technology described herein are not limited in this respect.
  • server(s) 1140 may access information stored in database(s) 1120 and/or 1160 and use this information to perform processes described herein for determining one or more characteristics of a biological sample (e.g., determining cell composition percentages thereof) and/or of the sequence information.
  • server(s) 1140 may include one or multiple computing devices. When server(s) 1140 include multiple computing devices, the device(s) may be physically co-located (e.g., in a single room) or distributed across multi-physical locations. In some embodiments, server(s) 1140 may be part of a cloud computing infrastructure. In some embodiments, one or more server(s) 1140 may be co-located in a facility operated by an entity (e.g., a hospital, research institution) with which doctor 1150 is affiliated. In such embodiments, it may be easier to allow server(s) 1140 to access private medical data for the patient 1180 .
  • entity e.g., a hospital, research institution
  • the results of the analysis performed by server(s) 640 may be provided to doctor 1150 through a computing device 1130 (which may be a portable computing device, such as a laptop or smartphone, or a fixed computing device such as a desktop computer).
  • the results may be provided in a written report, an e-mail, a graphical user interface, and/or any other suitable way.
  • the results may be provided to patient 1180 or a caretaker of patient 1180 , a healthcare provider such as a nurse, or a person involved with a clinical trial.
  • the results may be part of a graphical user interface (GUI) presented to the doctor 1150 via the computing device 1130 .
  • GUI graphical user interface
  • the GUI may be presented to the user as part of a webpage displayed by a web browser executing on the computing device 1130 .
  • the GUI may be presented to the user using an application program (different from a web-browser) executing on the computing device 1130 .
  • the computing device 1130 may be a mobile device (e.g., a smartphone) and the GUI may be presented to the user via an application program (e.g., “an app”) executing on the mobile device.
  • an application program e.g., “an app”
  • FIG. 12A shows the proportions of Transcripts Per Million (TPM) covering transcripts of different biological types calculated in the different samples of purified B cells sequenced in different laboratories (as an example for a cell type).
  • GEO and ArrayExpress IDs of the different datasets of sorted B cells are shown as labels on the X axis.
  • the transcript biological type is indicated in the legend (according to GENCODE annotation, version 23).
  • variability in total expression belonging to short RNA transcripts strongly skews TPM value distribution of genes of interest due to increased variation resulting from length normalization of short transcripts.
  • reads for non-coding transcripts from the data may reduce the variance in the data.
  • FIG. 12B shows transcripts distribution by transcript biotype and length, as shown in the legend, of a reference human transcriptome (GENCODE, v23). Proportions of transcript numbers of different length for each biotype in the reference transcriptome are shown (with additional categories of all retained and all removed transcripts in FIG. 12C ).
  • a substantial amount of noise was derived from short transcripts of TCR- and BCR-coding genes, annotated in the transcriptome as corresponding to V, D, or J regions. While T- and B-cells produce long transcripts after VDJ recombination, these short transcripts are never synthesized; therefore, different TCR and BCR variants (TCR and BCR repertoires) could not be correctly measured without specific realignment.
  • TCR and BCR protein-coding transcripts were excluded from TPM normalization. Excluding non-coding transcripts and transcripts of TCR- and BCR-transcripts may reduce the variance in the data, as shown in FIG. 12B .
  • FIG. 12C is a schematic representation of an exemplary process for expression quantification and TPM renormalization.
  • TPM expressions of transcripts were calculated by Kallisto (Bray et al. 2016).
  • Next non-coding transcripts, transcripts coding for TCR/BCR associated with short V, D or J segments and other transcripts according to their biological properties and quality/evidence information are filtered.
  • transcripts are aggregated by genes and normalized on 1 million TPM.
  • transcript pseudogene transcript pseudogene
  • polymorphic_pseudogene polymorphic_pseudogene
  • biological processed_pseudogene types according transcribed_processed_pseudogene, to the unitary_pseudogene
  • GENCODE unprocessed_pseudogene (Frankish transcribed_unitary_pseudogene, et al.
  • IG_C_pseudogene annotation v23 IG_J_pseudogene, IG_V_pseudogene, transcribed_unprocessed_pseudogene, translated_unprocessed_pseudogene, TR_J_pseudogene, TR_V_pseudogene, snRNA, snoRNA, miRNA, ribozyme, rRNA, Mt_tRNA, Mt_rRNA, scaRNA, retained_intron, sense_intronic, sense_overlapping, nonsense_mediated_decay, non_stop_decay, antisense, lincRNA, macro_lncRNA, processed_transcript, 3prime_overlapping_ncrna, sRNA, misc_RNA, vaultRNA, TEC Transcripts of V, D Transcripts of GENCODE biotypes: and J regions of IG_V_gene, IG_
  • FIGS. 12D-12E are violin plots showing the relative standard deviations in expression of 3515 housekeeping genes (Eisenberg and Levanon 2013) for different cell types before (red) and after (blue) transcript filtration and TPM renormalization. Data is grouped based on the library preparation type, using either total RNA-seq ( FIG. 12D ) or polyA RNA-seq ( FIG. 12E ). The indicated P-values are calculated by the two-sided Wilcoxon test. Medians of distributions and rank-biserial correlation coefficients are shown.
  • FIG. 12F is a PCA projection of RNA expression of sorted B cells obtained from experiments using either total RNA-seq (green) or polyA RNA-seq (red), before (left) and after (right) proposed transcript filtration and renormalization. As shown, there is a decrease in unwanted batch effects between expression profiles, after the procedure of TPM renormalization described herein. Techniques for TPM normalization are described herein including with respect to the “Conversion to TPM and gene aggregation” section.
  • FIG. 12G shows the dependence of relative standard deviation of technical replicates on gene expression levels (TPM). RNA-seq experiments with a total coverage of 1 (pink), 5 (yellow) and 10 (green) million readcounts are presented.
  • FIG. 12H shows the dependence of mean standard deviation of gene expression on the total coverage of read counts in RNA-seq.
  • the illustrated graph shows samples with sequential additions of noise level: Technical Poisson noise only (blue), all technical noise (yellow), and both technical and biological noise (red).
  • FIG. 12H (right) is a violin plot showing the distribution of the same standard deviations of gene expression calculated within samples having different types of noise.
  • a component of technical noise may specified by a Poisson distribution
  • another component of technical noise may be specified by non-Poisson noise
  • biological noise may be specified by a normal distribution.
  • FIG. 12I is a plot showing measured Poisson noise coefficients for technical replicates of RNA-seq experiments with different total readcount coverage. Poisson noise is inversely proportional to the square root of the total readcount coverage of RNA-seq data.
  • FIG. 12J shows the dependence of mean standard deviation of gene expression on the total coverage of read counts in RNA-seq.
  • the illustrated graph shows gene expression with imputed Poisson noise (green) and data for the same samples with all technical noise (yellow).
  • FIG. 12J shows the dependence of mean standard deviation of gene expression on the total coverage of read counts in RNA-seq.
  • the illustrated graph shows the same data as presented in the left graph after subtraction of the imputed Poisson noise, revealing the non-Poisson addition to the technical noise. This non-Poisson technical noise does not show any dependence to sequencing coverage.
  • FIG. 12K shows the dependence of mean standard deviation of gene expression on the total coverage of read counts in RNA-seq.
  • the illustrated graph shows gene expression for one cell line across various laboratories and experiments, accounting for both biological and technical noise. Imputed Poisson technical noise calculated for the same samples is represented in green.
  • FIG. 12K (right) shows the dependence of mean standard deviation of gene expression on the total coverage of read counts in RNA-seq. The illustrated graph shows gene expression as shown on the left after subtraction of the imputed Poisson noise, revealing the pure biological noise in the samples, which did not depend on sequencing coverage.
  • Example 2 Deconvolution of Microenvironment from RNA-Seq of Multiple Normal and Cancer Tissues
  • RNA-seq data from multiple normal and cancer tissues.
  • the cellular deconvolution techniques developed by the inventors may be referred to as “Kassandra”. Specifically, techniques for selecting specific and/or semi-specific genes for cell types and/or subtypes, generating artificial mixes, training multiple non-linear regression models to determine a plurality of cell composition percentages for a plurality of cell types, using the trained non-linear regression models to determine the cell composition percentages, and other pre-processing and post-processing techniques described herein.
  • FIG. 13A is a schematic representation of a validation experiment for deconvolution based on TCGA data. Data on the number of cells obtained by other methods from hematoxylin and eosin (H&E) slides and whole exome sequencing (WES) are used.
  • H&E hematoxylin and eosin
  • WES whole exome sequencing
  • FIG. 13B are violin plots showing distributions of cell composition percentages estimated using the deconvolution techniques (e.g., using trained non-linear regression models) described herein for B-cells, CD4+, CD8+, macrophage, fibroblasts, and endothelium cells in 10,489 tumor biopsies from TCGA. As shown, tumor tissues are split by cancer type in the illustrated example.
  • FIG. 13C is a t-SNE plot showing TCGA and GTEX samples calculated based on deconvolved cell percentages.
  • FIG. 13D is a graph showing the Pearson correlation between percentages of lymphocytes predicted by the techniques described herein on TCGA RNA-seq data and predicted by machine analysis of histological TCGA data by (Saltz et al. 2018).
  • FIG. 13E is a plot showing the correlation of predicted percentages of malignant cells from RNA-seq by the techniques described herein, with tumor purity estimated from WES for 11 TCGA cancer types.
  • FIG. 13F is a graph showing Pearson correlations between tumor purity and predicted percentages of malignant cells based on RNA-seq data.
  • Tumor data was derived from TCGA.
  • the graph shows Pearson correlations for predictions by the techniques described herein, as well as Pearson correlations for predictions by various alternative algorithms. Compared to other algorithms, the non-linear deconvolution techniques developed by the inventors more accurately predicted the percentage of malignant cells, demonstrating an improvement over conventional techniques.
  • FIG. 13G is a graph showing Pearson correlations of predicted T cell RNA percentages by the techniques described herein with T cell receptor (CDR3 region of TCR) reads by MiXCR in LUSC TCGA data.
  • FIG. 13H is a graph showing Pearson correlations of predicted Plasma B cell RNA percentages by the techniques described herein with B cell receptor (CDR3 region of IgH) reads by MiXCR in LUSC TCGA data.
  • FIG. 13I is a graph showing Pearson correlation values for predicted T cell RNA percentages with T cell receptor (CDR3 region of TCR) reads in different cancer types from TCGA data. Predictions by the techniques described herein and predictions by various alternative algorithms are shown. Each data point corresponds to a different cancer type (COAD, KIRC, LUAD, LUSC, READ, SKCM, TNBC).
  • FIG. 13J is a graph showing Pearson correlation values for predicted Plasma B cell RNA percentages with B cell receptor (CDR3 region of IgH) reads in different cancer types from TCGA. Predictions by the techniques described herein and predictions by various alternative algorithms are shown. Each data point corresponds to a different cancer type (COAD, KIRC, LUAD, LUSC, READ, SKCM, TNBC).
  • FIG. 13B the inventors analyzed the cellular composition of TCGA samples of different tumor types and healthy tissues.
  • Five major cell populations were quantified including: B-cells, CD4+ T-cells, CD8+ T-cells, Macrophages, Fibroblasts, and Endothelial cells ( FIG. 13C ). These values agreed with what has been reported. For example, DLBC RNA-seq data showed a strong enrichment for B-cells.
  • FIG. 13E-F This analysis supports the ability of the techniques described herein to accurately predict cell population from bulk RNAseq data.
  • the proportion of expressed T-cell receptor (TCR) and IgH/L (B cell receptor) sequences in the RNA-seq data correlates with the presence of T or plasma B cells actively producing immunoglobulins.
  • the sequences were realigned using MIXCR to measure the abundance and diversity of CDR3 transcripts, associated with different T and plasma B cell clones.
  • MIXCR Magnetic Ink Characteristics
  • RNA-seq single cell RNA-seq data and bulk RNA-seq of blood data.
  • the cellular deconvolution techniques developed by the inventors may be referred to as “Kassandra”. Specifically, techniques for generating artificial mixes, selecting specific and/or semi-specific genes for cell types and/or subtypes, training multiple non-linear regression models to determine a plurality of cell composition percentages for a plurality of cell types, using the trained non-linear regression models to determine the cell composition percentages, and other pre-processing and post-processing techniques described herein.
  • FIG. 14A is a schematic representation of a validation experiment for deconvolution using scRNA-seq samples from PBMC.
  • the scRNA-seq data was artificially mixed to create a bulk RNA-seq dataset.
  • FIG. 14B is a t-SNE plot of cell phenotyping across 9 single-cell PBMC datasets provided by 10 ⁇ Genomics.
  • the joined plot was obtained by the Seurat pipeline including SCTransform normalization, batch correction and preceding PCA (Butler et al. 2018; Stuart et al. 2019).
  • different cell types and/or subtypes express key cell markers (e.g., specific and/or semi-specific genes) that distinguish them.
  • FIG. 14C is a graph showing the correlation between true cell percentages from scRNA-seq of PBMC, and predictions made with the techniques described herein for the bulk RNA-seq mixture.
  • FIG. 14D are plots showing correlation of true percentages from scRNA-seq of PBMC and predictions made with the techniques described herein (e.g., using non-linear regression models to determine cell composition percentages) for eight cell subtypes.
  • FIG. 14E is a schematic representation of a validation experiment for deconvolution using bulk RNA-seq of PBMC or Whole blood and FACS measurement of the same sample.
  • FIGS. 14F-1 and 14F-2 are graphs showing the correlation of predicted cell percentages by the techniques described herein from bulk RNA-seq, and actual cell percentages obtained by flow cytometry measurements for different cell types (CD4+ T cells, CD8+ T cells, NK cells, B cells, monocytes and neutrophils). Datasets that were used for comparison are: GSE107572 (Finotello et al. 2019), GSE115823 (Altman et al. 2019), GSE60424 (Linsley et al. 2014), SDY67 (Zimmermann et al. 2016), GSE127813 (Newman et al. 2019), GSE53655 (Shin et al. 2014), GSE64655 (Hoek et al. 2015). Pearson correlations are shown for all cell types combined.
  • RNA-seq which was built from scRNA-seq datasets derived from peripheral blood mononuclear cells (PBMCs)
  • FIG. 14A-B A high correlation value was obtained when aligning the true scRNA-seq percentage with the predicted RNA-seq percentage
  • FIG. 14D when graphing the correlation for each cell type separately, cell types which are present in a high number have the most significant correlation between true and predicted values.
  • FIG. 14E Eight different PBMC samples were analyzed and for each sample the FACS analysis was compared to the predicted cell composition by the techniques described herein. As shown, all analysis presented with a correlation coefficient ranging from 0.900 to 0.984 ( FIGS. 14F-1 and 14F-2 ).
  • cellular deconvolution techniques developed by the inventors may be referred to as “Kassandra”. Specifically, techniques for generating artificial mixes, selecting specific and/or semi-specific genes for cell types and/or subtypes, training multiple non-linear regression models to determine a plurality of cell composition percentages for a plurality of cell types, using the trained non-linear regression models to determine the cell composition percentages, and other pre-processing and post-processing techniques described herein.
  • FIG. 15A depicts t-SNE plots of cell phenotyping, from left to right, in melanoma (GSE72056)(Tirosh et al. 2016), lung carcinoma (E-MTAB-6149 and E-MTAB-6653)(Lambrechts et al. 2018) and head and neck carcinoma (HNC)(GSE103322)(Puram et al. 2017) single-cell datasets.
  • the t-SNE plot for lung carcinoma was obtained by the Seurat pipeline including SCTransform normalization, batch correction and preceding PCA (Butler et al. 2018; Stuart et al. 2019).
  • the melanoma and head and neck carcinoma t-SNE plots were obtained by t-SNE transformation of log TPM expression values of cell-type-specific genes.
  • FIG. 15B is a schematic representation of a validation experiment using scRNA-seq data derived from cancer tissues. scRNA-seq data was artificially mixed to create a bulk RNA-seq dataset.
  • FIGS. 15G and 15H are heatmaps showing mean Pearson correlation values ( FIG. 15G ) and mean MAE (Mean Average Error) scores ( FIG. 15H ) between predicted values from artificial bulk RNA-seq data with true values derived from scRNA-seq data for melanoma, lung carcinoma and HNC.
  • results from the techniques described herein are compared with results from alternative algorithms.
  • the non-linear regression techniques developed by the inventors are shown to, on average, more accurately predict the cell composition percentages for different cell types with lower mean average error.
  • FIG. 15I shows the correlation between predicted cell percentages by the techniques described herein and actual cell percentage obtained by FACS for lymphocytes, fibroblasts and lung adenocarcinoma cell line from dataset GSE121127 (Wang et al. 2018)(top) and CYTOF for bone marrow from dataset GSE120444 (Oetjen et al. 2018)(bottom).
  • the Pearson correlation value (r) represents correlation value for all cell types combined.
  • FIG. 15A cells from scRNA-seq were annotated manually ( FIG. 15A ) and certain percentages of each cell type were mixed to resemble a bulk-RNA-seq sample (e.g., as described herein above at least with respect to FIG. 6A ). Subsequently these cell percentages were compared with predicted values by the techniques described herein. The ability of the techniques described herein to reconstruct cell composition percentages for each cell type was measured ( FIGS. 15C-F ). The median correlation of cell types reconstruction reached ⁇ 0.97 and was the highest among other methods.
  • NCBI Gene Gene ID NCBI Accession Number (s) ACAP1 9744 NM_004288, XM_017005386 ACRBP 84519 NM_032415, NM_001324281 ACTA2 59 NM_001141945, NM_001613, NM_001320855 ADAM28 10863 XM_011544370, XM_006716273, XM_011544369, XM_011544371, XR_949375, XM_005273380, XM_006716274, XM_005273382, XM_017012976, XM_017012974, XR_247120, NM_001304351, NM_014265, NR_130710, XM_017012975, NR_130709, XM_01
  • One or more aspects and embodiments of the present disclosure involving the performance of processes or methods may utilize program instructions executable by a device (e.g., a computer, a processor, or other device) to perform, or control performance of, the processes or methods.
  • a device e.g., a computer, a processor, or other device
  • inventive concepts may be embodied as a computer readable storage medium (or multiple computer readable storage media) (e.g., a computer memory, one or more floppy discs, compact discs, optical discs, magnetic tapes, flash memories, circuit configurations in Field Programmable Gate Arrays or other semiconductor devices, or other tangible computer storage medium) encoded with one or more programs that, when executed on one or more computers or other processors, perform methods that implement one or more of the various embodiments described above.
  • the computer readable medium or media can be transportable, such that the program or programs stored thereon can be loaded onto one or more different computers or other processors to implement various ones of the aspects described above.
  • computer readable media may be non-transitory media.
  • program or “software” are used herein in a generic sense to refer to any type of computer code or set of computer-executable instructions that can be employed to program a computer or other processor to implement various aspects as described above. Additionally, it should be appreciated that according to one aspect, one or more computer programs that when executed perform methods of the present disclosure need not reside on a single computer or processor, but may be distributed in a modular fashion among a number of different computers or processors to implement various aspects of the present disclosure.
  • Computer-executable instructions may be in many forms, such as program modules, executed by one or more computers or other devices.
  • program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types.
  • functionality of the program modules may be combined or distributed as desired in various embodiments.
  • data structures may be stored in computer-readable media in any suitable form.
  • data structures may be shown to have fields that are related through location in the data structure. Such relationships may likewise be achieved by assigning storage for the fields with locations in a computer-readable medium that convey relationship between the fields.
  • any suitable mechanism may be used to establish a relationship between information in fields of a data structure, including through the use of pointers, tags or other mechanisms that establish relationship between data elements.
  • the software code can be executed on any suitable processor or collection of processors, whether provided in a single computer or distributed among multiple computers.
  • a computer may be embodied in any of a number of forms, such as a rack-mounted computer, a desktop computer, a laptop computer, or a tablet computer, as non-limiting examples. Additionally, a computer may be embedded in a device not generally regarded as a computer but with suitable processing capabilities, including a Personal Digital Assistant (PDA), a smartphone, a tablet, or any other suitable portable or fixed electronic device.
  • PDA Personal Digital Assistant
  • a computer may have one or more input and output devices. These devices can be used, among other things, to present a user interface. Examples of output devices that can be used to provide a user interface include printers or display screens for visual presentation of output and speakers or other sound generating devices for audible presentation of output. Examples of input devices that can be used for a user interface include keyboards, and pointing devices, such as mice, touch pads, and digitizing tablets. As another example, a computer may receive input information through speech recognition or in other audible formats.
  • Such computers may be interconnected by one or more networks in any suitable form, including a local area network or a wide area network, such as an enterprise network, and intelligent network (IN) or the Internet.
  • networks may be based on any suitable technology and may operate according to any suitable protocol and may include wireless networks, wired networks or fiber optic networks.
  • some aspects may be embodied as one or more methods.
  • the acts performed as part of the method may be ordered in any suitable way. Accordingly, embodiments may be constructed in which acts are performed in an order different than illustrated, which may include performing some acts simultaneously, even though shown as sequential acts in illustrative embodiments.
  • a reference to “A and/or B”, when used in conjunction with open-ended language such as “comprising” can refer, in one embodiment, to A only (optionally including elements other than B); in another embodiment, to B only (optionally including elements other than A); in yet another embodiment, to both A and B (optionally including other elements); etc.
  • the phrase “at least one,” in reference to a list of one or more elements, should be understood to mean at least one element selected from any one or more of the elements in the list of elements, but not necessarily including at least one of each and every element specifically listed within the list of elements and not excluding any combinations of elements in the list of elements.
  • This definition also allows that elements may optionally be present other than the elements specifically identified within the list of elements to which the phrase “at least one” refers, whether related or unrelated to those elements specifically identified.
  • “at least one of A and B” can refer, in one embodiment, to at least one, optionally including more than one, A, with no B present (and optionally including elements other than B); in another embodiment, to at least one, optionally including more than one, B, with no A present (and optionally including elements other than A); in yet another embodiment, to at least one, optionally including more than one, A, and at least one, optionally including more than one, B (and optionally including other elements); etc.
  • the terms “approximately,” “substantially,” and “about” may be used to mean within ⁇ 20% of a target value in some embodiments, within ⁇ 10% of a target value in some embodiments, within ⁇ 5% of a target value in some embodiments, within ⁇ 2% of a target value in some embodiments.
  • the terms “approximately,” “substantially,” and “about” may include the target value.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Medical Informatics (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biotechnology (AREA)
  • Genetics & Genomics (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Epidemiology (AREA)
  • Bioethics (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Public Health (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Operations Research (AREA)
  • Probability & Statistics with Applications (AREA)
  • Algebra (AREA)
  • General Engineering & Computer Science (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
US17/200,492 2020-03-12 2021-03-12 Systems and methods for deconvolution of expression data Active US11315658B2 (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
US17/200,492 US11315658B2 (en) 2020-03-12 2021-03-12 Systems and methods for deconvolution of expression data
US17/707,623 US11587642B2 (en) 2020-03-12 2022-03-29 Systems and methods for deconvolution of expression data
US18/082,157 US20230178178A1 (en) 2020-03-12 2022-12-15 Systems and methods for deconvolution of expression data

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US202062988700P 2020-03-12 2020-03-12
US202063108262P 2020-10-30 2020-10-30
US17/200,492 US11315658B2 (en) 2020-03-12 2021-03-12 Systems and methods for deconvolution of expression data

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/707,623 Continuation US11587642B2 (en) 2020-03-12 2022-03-29 Systems and methods for deconvolution of expression data

Publications (2)

Publication Number Publication Date
US20210287759A1 US20210287759A1 (en) 2021-09-16
US11315658B2 true US11315658B2 (en) 2022-04-26

Family

ID=75396875

Family Applications (3)

Application Number Title Priority Date Filing Date
US17/200,492 Active US11315658B2 (en) 2020-03-12 2021-03-12 Systems and methods for deconvolution of expression data
US17/707,623 Active US11587642B2 (en) 2020-03-12 2022-03-29 Systems and methods for deconvolution of expression data
US18/082,157 Pending US20230178178A1 (en) 2020-03-12 2022-12-15 Systems and methods for deconvolution of expression data

Family Applications After (2)

Application Number Title Priority Date Filing Date
US17/707,623 Active US11587642B2 (en) 2020-03-12 2022-03-29 Systems and methods for deconvolution of expression data
US18/082,157 Pending US20230178178A1 (en) 2020-03-12 2022-12-15 Systems and methods for deconvolution of expression data

Country Status (7)

Country Link
US (3) US11315658B2 (ja)
EP (2) EP4118657B1 (ja)
JP (1) JP2023518185A (ja)
AU (1) AU2021233926A1 (ja)
CA (1) CA3175126A1 (ja)
IL (1) IL296316A (ja)
WO (1) WO2021183917A1 (ja)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220230707A1 (en) * 2020-03-12 2022-07-21 Bostongene Corporation Systems and methods for deconvolution of expression data

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022232615A1 (en) 2021-04-29 2022-11-03 Bostongene Corporation Machine learning techniques for estimating tumor cell expression complex tumor tissue
CA3220280A1 (en) 2021-05-18 2022-11-24 Bostongene Corporation Techniques for single sample expression projection to an expression cohort sequenced with another protocol
CN114038505B (zh) * 2021-10-19 2024-06-14 清华大学 一种在线整合多来源单细胞数据的方法和系统
WO2023076574A1 (en) 2021-10-29 2023-05-04 Bostongene Corporation Tumor microenvironment types in breast cancer
US20240170096A1 (en) 2022-11-17 2024-05-23 Bostongene Corporation Rna-seq immunoprofiling of peripheral blood

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019018684A1 (en) 2017-07-21 2019-01-24 The Board Of Trustees Of The Leland Stanford Junior University SYSTEMS AND METHODS FOR ANALYZING MIXED CELL POPULATIONS
US20190233898A1 (en) 2015-01-22 2019-08-01 The Board Of Trustees Of The Leland Stanford Junior University Methods and Systems for Determining Proportions of Distinct Cell Subsets
US20210151128A1 (en) * 2018-06-29 2021-05-20 Preferred Networks, Inc. Learning Method, Mixing Ratio Prediction Method, and Prediction Device

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020142563A1 (en) 2018-12-31 2020-07-09 Tempus Labs, Inc. Transcriptome deconvolution of metastatic tissue samples
US11315658B2 (en) 2020-03-12 2022-04-26 Bostongene Corporation Systems and methods for deconvolution of expression data

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190233898A1 (en) 2015-01-22 2019-08-01 The Board Of Trustees Of The Leland Stanford Junior University Methods and Systems for Determining Proportions of Distinct Cell Subsets
WO2019018684A1 (en) 2017-07-21 2019-01-24 The Board Of Trustees Of The Leland Stanford Junior University SYSTEMS AND METHODS FOR ANALYZING MIXED CELL POPULATIONS
US20210151128A1 (en) * 2018-06-29 2021-05-20 Preferred Networks, Inc. Learning Method, Mixing Ratio Prediction Method, and Prediction Device

Non-Patent Citations (65)

* Cited by examiner, † Cited by third party
Title
[No Author Listed], Artyomov Lab Systems Immunology. Maxim N. Artyomov. 2013. https://artyomovlab.wustl.edu/site/index.html [last accessed Jun. 7, 2021]. 3 pages.
[No Author Listed], Deconvolution of ABsolute Immune Signal. shinyapps.io. 2021. https://giannimonaco.shinyapps.io/ABIS/ [last accessed Jun. 7, 2021]. 1 page.
[No Author Listed], MCP-counter. CIT. 2021. https://cit.ligue-cancer.net/mcp-counter/ [last accessed Jun. 7, 2021]. 2 pages.
[No Author Listed], Using EPIC to estimate the proportion of various cell types in bulk samples. EPIC. 2021. http://epic.gfellerlab.org/ [last accessed Jun. 7, 2021]. 1 page.
Abbas et al. Immune response in silico (IRIS): immune-specific genes identified from a compendium of microarray expression data Genes and Immunity vol. 6 pp. 319-331 (Year: 2005). *
Altman et al., Transcriptome networks identify mechanisms of viral and nonviral asthma exacerbations in children. Nature immunology. May 2019;20(5):637-51.
Aran et al., Systematic pan-cancer analysis of tumour purity. Nature communications. Dec. 4, 2015;6(1):1-11.
Aran et al., xCell. UCSF Institute for Computational Health Sciences. 2017. https://xcell.ucsf.edu/ [last accessed Jun. 7, 2021]. 2 pages.
Aran et al., xCell: digitally portraying the tissue cellular heterogeneity landscape. Genome biology. Dec. 2017;18(1):1-14.
Becht et al., Estimating the population abundance of tissue-infiltrating immune and stromal cell populations using gene expression. Genome biology. Dec. 2016;17(1):1-20.
Ben-Moshe et al., mRNA-seq whole transcriptome profiling of fresh frozen versus archived fixed tissues. BMC genomics. Dec. 2018;19(1):11 pages.
Bray et al., Near-optimal probabilistic RNA-seq quantification. Nature biotechnology. May 2016;34(5):525-7.
Butler et al., Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nature biotechnology. May 2018;36(5):411-20.
Chen et al., Profiling tumor infiltrating immune cells with CIBERSORT. Cancer systems biology. 2018:243-259.
Cieslik et al., The use of exome capture RNA-seq for highly degraded RNA with application to clinical cancer sequencing. Genome research. Sep. 1, 2015;25(9):1372-81.
Eisenberg et al., Human housekeeping genes, revisited. TRENDS in Genetics. Oct. 1, 2013;29(10):569-74.
Finotello et al., Molecular and pharmacological modulators of the tumor immune contexture revealed by deconvolution of RNA-seq data. Genome medicine. Dec. 2019;11(1):1-20.
Finotello et al., quanTIseq documentation. quanTIseq. Feb. 25, 2019. https://icbi.i-med.ac.at/software/quantiseq/doc/ [last accessed Jun. 7, 2021]. 9 pages.
Frankish et al., GENCODE reference annotation for the human and mouse genomes. Nucleic acids research. Jan. 8, 2019;47(D1):D766-73.
Galon et al., Type, density, and location of immune cells within human colorectal tumors predict clinical outcome. Science. Sep. 29, 2006;313(5795):1960-4.
George et al., Hemophilia B gene therapy with a high-specific-activity factor IX variant. New England Journal of Medicine. Dec. 7, 2017;377(23):2215-27.
Griffiths et al., Detection and removal of barcode swapping in single-cell RNA-seq data. Nature communications. Jul. 10, 2018;9(1):1-6.
Hao et al., Fast and robust deconvolution of tumor infiltrating lymphocyte from expression profiles using least trimmed squares. PLoS computational biology. May 6, 2019;15(5):e1006976. 21 pages.
Hirata et al., Tumor microenvironment and differential responses to therapy. Cold Spring Harbor perspectives in medicine. Jul. 1, 2017;7(7):a026781. 14 pages.
Hoek et al., A cell-based systems biology assessment of human blood to monitor immune responses after influenza vaccination. PloS one. Feb. 23, 2015;10(2):e0118528. 24 pages.
Holik et al., RNA-seq mixology: designing realistic control experiments to compare protocols and analysis methods. Nucleic acids research. Mar. 17, 2017;45(5):e30. 18 pages.
International Search Report and Written Opinion for International Application No. PCT/US2021/022155 dated Jul. 5, 2021.
Izar et al., A Single-Cell Landscape of High-Grade Serous Ovarian Cancer. Nature medicine. 2020;26:1271-1279. 23 pages.
Ke et al., Lightgbm: A highly efficient gradient boosting decision tree. Advances in neural information processing systems (NIPS). 2017;30:3146-54.
Lambrechts et al., Phenotype molding of stromal cells in the lung tumor microenvironment. Nature medicine. Aug. 2018;24(8):1277-89. 19 pages.
Levine et al., Data-driven phenotypic dissection of AML reveals progenitor-like cells that correlate with prognosis. Cell. Jul. 2, 2015;162(1):184-97. 31 pages.
Linsley et al., Copy number loss of the interferon gene cluster in melanomas is linked to reduced T cell infiltrate and poor patient prognosis. PloS one. Oct. 14, 2014;9(10):e109760. 9 pages.
Lun et al., EmptyDrops: distinguishing cells from empty droplets in droplet-based single-cell RNA sequencing data. Genome biology. Dec. 2019;20(1):1-9.
Ma et al., PD1 Hi CD8+ T cells correlate with exhausted signature and poor clinical outcome in hepatocellular carcinoma. Journal for immunotherapy of cancer. Dec. 2019;7(1):331. 15 pages.
Macosko et al., Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell. May 21, 2015;161(5):1202-14. 25 pages.
Marioni et al., RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome research. Sep. 1, 2008;18(9):1509-17.
Melsted et al., Modular and efficient pre-processing of single-cell RNA-seq. BioRxiv. Jun. 17, 2019:673285. 16 pages.
Monaco et al., RNA-Seq signatures normalized by mRNA abundance allow absolute deconvolution of human immune cell types. Cell reports. Feb. 5, 2019;26(6):1627-40.e7.
Neftel et al., An integrative model of cellular states, plasticity, and genetics for glioblastoma. Cell. Aug. 8, 2019;178(4):835-49.e29. 37 pages.
Newman et al. Robust enumeration of cell subsets from tissue expression profiles Nature Methods vol. 12, pp. 453-457 (Year: 2015). *
Newman et al., CIBERSORT. Stanford University. 2021. https://cibersort.stanford.edu/ [last accessed Jun. 7, 2021]. 1 page.
Newman et al., CIBERSORTx. Stanford University. 2021. https://cibersortx.stanford.edu/ [last accessed Jun. 7, 2021]. 1 page.
Newman et al., Determining cell type abundance and expression from bulk tissues with digital cytometry. Nature biotechnology. Jul. 2019;37(7):773-82.
Newman et al., Robust enumeration of cell subsets from tissue expression profiles. Nature methods. May 2015;12(5):453-7. 20 pages.
Norton et al., Pancreatic cancer associated fibroblasts (CAF): under-explored target for pancreatic cancer treatment. Cancers. May 2020;12(5):1347. 18 pages.
Puram et al., Single-cell transcriptomic analysis of primary and metastatic tumor ecosystems in head and neck cancer. Cell. Dec. 14, 2017;171(7):1611-24.e24. 40 pages.
Racle et al., EPIC: a tool to estimate the proportions of different cell types from bulk gene expression data. Bioinformatics for Cancer Immunotherapy. 2020:233-248.
Racle et al., Simultaneous enumeration of cancer and immune cell types from bulk tumor gene expression data. elife. Nov. 13, 2017;6:e26476. 25 pages.
Rakaee et al.,Prognostic value of macrophage phenotypes in resectable non-small cell lung cancer assessed by multiplex immunohistochemistry. Neoplasia. Mar. 1, 2019;21(3):282-93.
Roider et al., Dissecting intratumour heterogeneity of nodal B-cell lymphomas at the transcriptional, genetic and drug-response levels. Nature Cell Biology. Jul. 2020;22(7):896-906. 27 pages.
Saltz et al., Spatial organization and molecular correlation of tumor-infiltrating lymphocytes using deep learning on pathology images. Cell reports. Apr. 3, 2018;23(1):181-93.e7. 21 pages.
Shin et al., Variation in RNA-Seq transcriptome profiles of peripheral whole blood from healthy individuals with and without globin depletion. PloS one. Mar. 7, 2014;9(3):e91041. 11 pages.
Stuart et al., Comprehensive integration of single-cell data. Cell. Jun. 13, 2019;177(7):1888-902.e21. 52 pages.
Sturm et al., Comprehensive evaluation of transcriptome-based cell-type quantification methods for immuno-oncology. Bioinformatics. Jul. 15, 2019;35(14):i436-45.
Sun et al. An Efficient and Flexible Method for Deconvoluting Bulk RNA-Seq Data with Single-Cell RNA-Seq Data Cells vol. 8 article 1161 (Year: 2019). *
Tirosh et al., Dissecting the multicellular ecosystem of metastatic melanoma by single-cell RNA-seq. Science. Apr. 8, 2016;352(6282):189-96.
Van Gassen et al., FlowSOM: Using self-organizing maps for visualization and interpretation of cytometry data. Cytometry Part A. Jul. 2015;87(7):636-45.
Vivian et al., Toil enables reproducible, open source, big biomedical data analyses. Nature biotechnology. Apr. 2017;35(4):314-6.
Wagner et al., Measurement of mRNA abundance using RNA-seq data: RPKM measure is inconsistent among samples. Theory in biosciences. Dec. 1, 2012;131(4):281-5.
Wikipedia Nonlinear Regression retrieved from the internet on May 20, 2021 https://en.wikipedia.org/wiki/Nonlinear_regression (Year : 2021). *
Wu et al., Stromal PD-L1-positive regulatory T cells and PD-1-positive CD8-positive T cells define the response of different subsets of non-small cell lung cancer to PD-1/PD-L1 blockade immunotherapy. Journal of Thoracic Oncology. Apr. 1, 2018;13(4):521-32.
Xu et al., Mapping of γ/δ T cells reveals Vδ2+ T cells resistance to senescence. EBioMedicine. Jan. 1, 2019;39:44-58.
Zaitsev et al., Complete deconvolution of cellular mixtures based on linearity of transcriptional signatures. Nature communications. May 17, 2019;10(1):2209. 16 pages.
Zheng et al., Massively parallel digital transcriptional profiling of single cells. Nature communications. Jan. 16, 2017;8(1):1-12.
Zimmermann et al., System-wide associations between DNA-methylation, gene expression, and humoral immune response to influenza vaccination. PloS one. Mar. 31, 2016;11(3):e0152034. 21 pages.

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220230707A1 (en) * 2020-03-12 2022-07-21 Bostongene Corporation Systems and methods for deconvolution of expression data
US11587642B2 (en) * 2020-03-12 2023-02-21 Bostongene Corporation Systems and methods for deconvolution of expression data

Also Published As

Publication number Publication date
IL296316A (en) 2022-11-01
WO2021183917A1 (en) 2021-09-16
AU2021233926A1 (en) 2022-09-29
US20220230707A1 (en) 2022-07-21
EP4118657B1 (en) 2024-05-01
EP4118657A1 (en) 2023-01-18
US20230178178A1 (en) 2023-06-08
EP4383262A2 (en) 2024-06-12
US20210287759A1 (en) 2021-09-16
US11587642B2 (en) 2023-02-21
JP2023518185A (ja) 2023-04-28
CA3175126A1 (en) 2021-09-16
WO2021183917A8 (en) 2021-12-16
WO2021183917A9 (en) 2021-11-04

Similar Documents

Publication Publication Date Title
US11315658B2 (en) Systems and methods for deconvolution of expression data
US11802314B2 (en) Methods and systems for determining proportions of distinct cell subsets
Ravi et al. T-cell dysfunction in the glioblastoma microenvironment is mediated by myeloid cells releasing interleukin-10
Mo et al. Stromal gene expression is predictive for metastatic primary prostate cancer
Guiu et al. Molecular subclasses of breast cancer: how do we define them? The IMPAKT 2012 Working Group Statement
Stavropoulou et al. MLL-AF9 expression in hematopoietic stem cells drives a highly invasive AML expressing EMT-related genes linked to poor outcome
US20200395097A1 (en) Pan-cancer model to predict the pd-l1 status of a cancer cell sample using rna expression data and other patient data
JP7299169B2 (ja) 体細胞突然変異のクローン性を決定するための方法及びシステム
CN112602156A (zh) 用于检测残留疾病的系统和方法
CN105917008A (zh) 用于前列腺癌复发的预后的基因表达面板
CN109642259A (zh) 使用肿瘤教育的血小板的针对癌症的群智能增强的诊断和治疗选择
CN106755415A (zh) 用于诊断和预测移植物排斥的生物标志物板
JP2022538499A (ja) サンプル調製、サンプルシークエンシング、およびシークエンシングデータのバイアス補正と品質管理のためのシステムならびに方法
Barefoot et al. Detection of cell types contributing to cancer from circulating, cell-free methylated DNA
US20230073731A1 (en) Gene expression analysis techniques using gene ranking and statistical models for identifying biological sample characteristics
CN106661634A (zh) 用于诊断肾异体移植物纤维化和排异风险的方法
Jin et al. Cross-species gene expression analysis reveals gene modules implicated in human osteosarcoma
CN101400804B (zh) 用于结肠直肠癌预后的基因表达标志物
Kastenschmidt et al. A human lymphoma organoid model for evaluating and targeting the follicular lymphoma tumor immune microenvironment
Tuggle et al. Methods for transcriptomic analyses of the porcine host immune response: application to Salmonella infection using microarrays
Lu et al. Integrated identification of disease specific pathways using multi-omics data
Neary et al. Methylated DNA immunoprecipitation sequencing (MeDIP-seq): Principles and applications
Zhang et al. Comprehensive analysis of a novel RNA modifications-related model in the prognostic characterization, immune landscape and drug therapy of bladder cancer
US20220223227A1 (en) Machine learning techniques for identifying malignant b- and t-cell populations
Lambrechts et al. Differences in the tumor molecular and microenvironmental landscape between early (non-metastatic) and de novo metastatic primary luminal breast tumors

Legal Events

Date Code Title Description
FEPP Fee payment procedure

Free format text: ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY

FEPP Fee payment procedure

Free format text: ENTITY STATUS SET TO SMALL (ORIGINAL EVENT CODE: SMAL); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY

STPP Information on status: patent application and granting procedure in general

Free format text: NON FINAL ACTION MAILED

AS Assignment

Owner name: BOSTONGENE CORPORATION, MASSACHUSETTS

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:BOSTONGENE LLC;REEL/FRAME:058273/0181

Effective date: 20211117

Owner name: BOSTONGENE LLC, RUSSIAN FEDERATION

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:ZAITSEV, ALEKSANDR;CHELUSHKIN, MAKSIM;CHEREMUSHKIN, ILYA;AND OTHERS;REEL/FRAME:058273/0110

Effective date: 20211111

STPP Information on status: patent application and granting procedure in general

Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER

STPP Information on status: patent application and granting procedure in general

Free format text: NOTICE OF ALLOWANCE MAILED -- APPLICATION RECEIVED IN OFFICE OF PUBLICATIONS

STPP Information on status: patent application and granting procedure in general

Free format text: PUBLICATIONS -- ISSUE FEE PAYMENT VERIFIED

STPP Information on status: patent application and granting procedure in general

Free format text: PUBLICATIONS -- ISSUE FEE PAYMENT VERIFIED

STCF Information on status: patent grant

Free format text: PATENTED CASE