WO2018089927A1

WO2018089927A1 - Identification of instance-specific somatic genome alterations with functional impact

Info

Publication number: WO2018089927A1
Application number: PCT/US2017/061373
Authority: WO
Inventors: Xinghua LU; Gregory Cooper; Chunhui CAI; Shyam VISWESWARAN
Original assignee: University Of Pittsburgh - Of The Commonwealth System Of Higher Education
Priority date: 2016-11-11
Filing date: 2017-11-13
Publication date: 2018-05-17
Also published as: EP3538673A4; US11990209B2; US20190287651A1; EP3538673A1

Abstract

The present application provides methods for the identification of somatic genome alterations with functional impact in the genome of a tumor. In several embodiments, the methods comprise generating a bipartite causal Bayesian network with maximal posterior probability including causal edges pointing from genes including somatic mutations and somatic copy number alterations in the genome of the tumor to genes having differential expression in the tumor. The methods can be used, for example, to identify driver somatic genome alterations in the genome of a tumor.

Description

IDENTIFl CATION OF I NSTANCE-SPECI Fl C SOMATIC GENOM E ALTERATIONS

WITH FUNCTI ONAL I M PACT

CROSS REFERENCE TO RELATED APPLICATIONS

This application claims the benefit of U.S. Provisional Application No. 62/420,936, filed November 11, 2016, which is herein incorporated by reference in its entirety.

FIELD

The present di scl osure rel ates to methods and processes for the character! zati on of cancer and more parti cul arl y to methods for i denti f yi ng somati c genome al terati ons wi th f uncti onal i mpact i n the genome of a tumor.

ACKNOWLEDGM ENT OF GOVERNM ENT SUPPORT

This invention was made with government support under Grant Nos. LM 010144,

HG008540, and HG007934 awarded by the National Institutesof Health. The government has certain rights in the invention.

BACKGROUND

Cancer is a highly complex disease caused by a variety of somatic genome alterations (SGAs), i ncl udi ng, but not I i mi ted to, mutations, DNA copy number variations, gene f usi ons, chromosome rearrangement, and epi genetic changes. A tumor commonly hosts a unique combination of SGAs ranging in number from hundreds to thousands, of which only a small fraction contribute to tumori genesis (drivers or SGAs with functional impact, SGA-FIs), while the rest are non-consequential with respect to cancer (passengers). A key task of precision oncology is to identify and target cellular aberrations resulting from driver SGAs in an individual tumor.

Current methods for identifying candidate driver genes concentrate on finding those that have a higher than expected mutation rate in a cohort of tumor samples. Mutation-frequency- oriented model shave successfully identified many major oncogenes and tumor suppressors across cancer types. However, these methods are constrai ned by the need to def i ne the basel i ne mutati on rate, and as such, different model s I ead to different f i ndi ngs. Furthermore, current approaches are I i mi ted i n thei r abi I i ty to determi ne whether a gene al tered at a I ow frequency (preval ence < 2%) i s a driver. SUM MARY

Described herei n are novel methods for identifying SGA-Flsin atumor. Unlike prior methods that focus on identification SGA-FI sat a cohort level, the disclosed methods can identify tumor-specific SGA-FI s. Accordi ngly, such methods are referred to herei n as Tumor-specif i c Driver Identification (TDI) methods.

In some embodiments, a computer-implemented method for identifying a SGA-FI in a genome of a tumor t i s provi ded . The method compri ses recei vi ng a set of SGAs i n the genome of t, and receiving a set of differentially expressed genes (DEGs) in the genome of t. A bipartite causal Bayesian network (CBN) with maximal posterior probability isgenerated from these sets of data, wherein the CBN with maximal posterior probability compri ses causal edges pointing from SGAs i n the set of SGAs to D EGs i n the set of D EGs wi th a si ngl e causal edge poi nti ng to each DEG. A SGA from which five or more causal edges in the CBN point to DEGs is identified as the SGA-FI i n the genome of the tumor t. I n several embodi ments, the SGA-FI i s a driver SGA.

In some embodiments, generating the CBN with maximal posterior probability comprises generating a plurality of test CBNs for the set of SGAs and the set of DEGs. A posterior probability for each test CBN in the plurality isdetermined, and the test CBN with the maximal posterior probability is identified as the CBN with maximal posterior probability. New approaches for determining the posterior probability for the test CBNs in the plurality are provided and include novel marginal likelihood and prior probability calculations.

In additional embodiments, computer-implemented methods, computer systems, and computer readable media are provided.

The f oregoi ng and other features and advantages of the di scl osed technol ogy will become more apparent from the foil owing detailed description, which proceeds with reference to the accompanyi ng figures.

BRI EF DESCRI PTION OF THE FIGURES FIGs. 1A-1D illustrate the steps of an embodiment of the TDI method. 1A. A compendium of cancer omi cs data i s used as the trai ni ng dataset. Three types of data from the pan-cancer tumors were used in this study, including somatic mutation (SM ) data (1 ,176,482 mutation events in 35,208 genes), somatic copy number alteration (SCNA) data (3,915,798 alteration events in 24,776 genes), and gene expression data (13,135,684 DEG events in 19,524 genes). SM and SCNA data were integrated as SGA data. Expression of each gene in each tumor was compared to a distribution of the same gene i n the " normal control " sampl es, and, i f a gene' s expressi on val ue was outsi de the significance boundary, it was designated as a DEG in the tumor. The final dataset included 4,468 tumors with 1,394,182 SGA events and 6,094,652 DEG events 1B. A set of SGAsand a set of DEGsfrom an individual tumor as input for the TD I method. 1C. TheTDI method i nfers the causal relationships between SGAsand DEGsfor a given tumor t and outputs a tumor-specific causal model. 1D. A hypothetic model il I ustrates the results of analysis using the TD I method. In this tumor, SGA SETt has three SGAs pi us the non-specific factor Ao, and DEG_SETt has six DEG vari abl es. Each E\ must have exactl y one arc i nto i t, whi ch represents havi ng one cause among the variables in SGA_SETt. In this model, Ei iscaused by Ao; E∑, E3, E4 are caused by Af, £5, gare caused by A3; A2 does not have any regulatory impact.

FIG& 2A-2E illustrate the landscape of SGAsand SGA-Fls. 2A & 2B. The distributions of the number of SGAs per tumor and SGA-FI s per tumor of different cancer types. Beneath the bar box plots, the distributions of different types of SGAs (SM , copy number amplification, and deletion) are shown. 2C. Identification of SGA-Fls is independent of the alteration frequency or gene I ength. Darker dots i ndi cate SGA-FI s, and I i ghter dots represent SGAs that were not designated as SGA-FI a A few commonly altered genes are indicated by their gene names, where genes I abel ed wi th bl ack font are wel I -known dri vers, and those I abel ed wi th gray font are novel candidate drivers. 2D. A Circos plot shows SGA events and SGA-FI calls along the chromosomes. Different types of SGA events (SM, copy number amplification, deletion) are shown in the tracks 2, 3, and 4, respectively. Track 1 shows the frequency of an SGA being called as an SGA-FI . The gene names denote the top 62 SGA-Fls (some are SGA units) that were cal led i n over 300 tumors with a call rate > 0.8. Genes labeled with black font are known drivers from two TCGA reports, and gray ones are novel candidate drivers identified using the TDI method. 2E. SGA-Fls that were cal I ed i n I ess than 300 tumors and with a cal I rate > 0.8 are shown i n thi s f requency-vs-cal I rate plot. Similarly, genes labeled with black font are known driversfrom TCGA studies, and gray ones are novel candidate dri vers identified using the TDI method.

FIG& 3A and 3B illustrate the statistical characteristics of the TDI method. 3A. The causal relationships inferred by the TDI method are statistically sound. Rots in this panel show the cumulative distribution of the posterior probabilities assigned to candidate edges when the TDI method was applied to real data and the random data, in which DEGs were randomly permutated across tumors. The top pi ot shows the results for the posteri or probabi I i ti es f or al I candi date edges i n the whol e dataset; the rest of the pi ots show the posteri or probabi I iti es of the candi date edges pointing from 3 specific SGAs to all DEGs, namely, PIK3CA, CSMD3, and ZFHX4. 3B.

Combining SM and SCNA data creates a "linkage disequilibrium" among genes that are enclosed i n a common SCNA fragments The chromosome cytobands end osi ng three example genes (PIK3CA, CSMD3, andZFHX4) are shown. The bar charts show the frequency of SCNA (dark grey, standing for amplification) and SM (light grey). The disequilibrium plots beneath the bar charts depi ct the correl ati ons among genes wi thi n a cytoband.

Fl Gs. 4A-4C i 11 ustrate that an SM and a SCNA perturbi ng a gene exert a common functional impact. TheSGA patterns for 3 genes, PIK3CA, CSMD3, and ZFHX4, across different cancer types are shown on the I eft si de. Venn di agrams i 11 ustrati ng the rel ati onshi ps of D EGs caused by SCNA (dark grey) and SM (I i ght grey) are shown on the ri ght si de. The p-val ues were calculated using the Fisher'sexact test.

FIG& 5A-5B illustrate that detection of functional impacts of SGA-Flsrevealsfunctional connecti ons among SGA-Fls. 5A. Mapping novel SGA-Flsto known pathways. Known members of the p53 and Notchl pathways from the K EGG database that were i dentifi ed as SGA-FI s usi ng theTDI method are shown as nodes ( marked wi th an " * " ) i n the respecti ve networks. A n edge between a pai r of nodes indicates that two SGA-Fls share significantly overlapping target DEGs. Novel SGA-FI s that were densely connected to the known members (> 50% of known members) are shown as yel I ow nodes. 5B . Top 5 SGA-FI s that share the most si gni f i cant overl appi ng target DEGswith 3 SGA-Fls (PIK3CA, CSMD3, ZFHX4). An edge between a pair of SGA-Fls indicates that they share significantly overlapping target DEG sets, and the thickness of the line is proportional to the negative log of the hypergeometric p-val ues of overlapping target DEG seta FIGs. 6A-6I ill ustrate the cell biology evaluation of oncogenic properties of CSMD3 and ZHFX4. 6A & 6B . Expressi on status of CSMD3 and ZHFX4 target genes i n tumors wi th or wi thout al terati on i n the respecti ve genes. 6C & 6D . The i impact of knocki ng down CSMD3 and ZHFX4 on the expression of their target genes. For each cell line, we applied atailored pair of siRNAs (indicated as 1 and 2); (ns: not significant; ^* p < 0.05; ^{* *} p < 0.01; and ^{* * *} p < 0.001, ^{* * * *} p < 0.0001). 6E - 6H. The impact of knocking down CSMD3 and ZFHX4on cell proliferation (6E & 6F) and cell migration (6G & 6H). 6I. Impact of ZFHX4 knockdown on apoptosis in the FC3 cell I i ne as measured by f I ow cytometry usi ng A nnexi n V stai ni ng (to access apoptoti c eel I s) and propidium iodide (PI) staining (to assess cellular viability) staining, where the pseudo-color of a point indicates the proportion of eel Is with corresponding values of two markers, and the range of color from blue to red indicate increasing proportion.

Fl G& 7A and 7B i II ustrate a comparison of causal analysis results from real data and random data. 7A . Compari son of cumul ati ve di stri buti ons of the posteri or probabi I i ti es of the candidate causal edges point from 4 well-known cancer drivers to DEGs. 7B. Comparison of cumulative distributions of the posterior probabilities of the candidate causal edges point from 4 novel candidate drivers to DEGs. FIG. 8 shows that networks of SGA-Flsshare significantly overlapping DEGs. Top SGA- Flsthat share the most significant overlapping target DEGs with 3 SGA-FIs (777V, MUC16, and LRP1B). An edge between a pair of SGA-FIs indicates that they share significantly overlapping target DEG sets (hypergeometric test p≤ 10^"6), and the thickness of the line is proportional to the negati ve I og of the hypergeometri c p-val ues of overl appi ng target D EG sets.

FIG. 9 is a graph illustrating the false discovery rate using theTDI method. The plot shows the relationship of number of SGAs per tumor being designated as SGA-FIs (drivers) with respect to the mi ni mal number of targeted (caused) DEGs an SGA-FI must have. Results are shown for random and real data. The x-axis shows the different thresholds, i.e., the number of DEGs predicted to be regulated by an SGA with p < 0.001, and the y-axis shows the number of SGAs designated as SGA-FI in a tumor.

FIG. 10 shows a graph comparing the performances of GLMNET models trained with TDI- deri ved features and features sel ected usi ng the RFE al gori thm. The x-axi s (A U ROC, area under ROC) represents the area under ROC curve (the higher, the better models perform, with a val ue of 0.5 indicating performance of random calls); the y-axis represents the number of modelswithin a bin with an AU ROC range. Overal I performances of two groups of model s were compared usi ng t- test, and the p-val ue of the test is 9.24E-31.

FIG. 11 showsaflow chart illustrating an embodiment of theTDI method for identifying a SGA-FI i n the genome of a tumor.

FIG. 12 showsadiagram of an example computing system in which described

embodiments can be implemented.

SEQUENCES

The nucl ei c and ami no aci d sequences I i sted i n the accompanyi ng sequence I i sti ng are shown using standard letter abbreviations for nucleotide bases, and three letter code for amino acids, as defined in 37 C.F.R. 1.822. Only one strand of each nucleic acid sequence is shown, but the complementary strand is understood as included by any reference to the displayed strand. The Sequence L i sti ng i s submi tted as an ASCI I text f i I e i n the form of the f i I e named " Sequence. txt" (~4 kb), which was created on November 8, 2017 which is i ncorporated by reference herei n.

DETAI LED DESCRI PTION

/. Introduction

Identifying causative or "driver" SGAs driving the development of an individual tumor can provide insight into disease mechanisms and enable personalized modeling for cancer precision medici ne. Although methods exist for identifying driver SGAs at the cohort level , few focus on the drivers of individual tumors. This disclosure provides TD I methods and related systems that infer causal relationships between SGAsand molecular phenotypes (e.g., transcriptomic, proteomic, or metabolomics changes) within a specific tumor using aCBN, in which causal edges are only al I owed to poi nt from SGAs to D EGs.

TheTDI method combines the strengths of both frequency- and functional -impact oriented approaches for i dentifyi ng driver SGAs i n a tumor. TDI uses the frequency with whi ch a gene g is perturbed by SGA events i n a reference set of tumor genomes, D, to esti mate the pri or probabi I ity P(g) that g is a cancer driver in any given tumor. To further i ncorporate i nformation from a function-impact perspective, TDI employs the likelihood function F data \ g), which represents the probabi I ity of the molecular phenotype data that is observed (e.g., DEGs) given that g is assumed to be adriver. From such priorsand likelihoods, TDI derives the posterior probabilities of driversfor a given tumor. By inferring causal relationships between SGA events and molecular phenotypes, TDI can identify cancer drivers and provide insight into the disease mechanism of individual tumors.

The advantageous tumor-specific nature of the TDI method is reflected i n several ways, including (notation described in more detail in Example 1):

V Each tumor hasa uniqueset of SGAs and DEGs, and TDI inferscausal relationships between those two sets; thus, the resulting CBN model M is necessarily tumor specific; V The pri or probabi I i ty of a model ( ) , whi ch i s based on the component pri ors

→ ), is tumor-specific;

V Calculation of the marginal likelihood of a causal edge includes

both tumor-specific and global components The tumor-specific component is computed using SGA and DEG data from the tumors in which / =1 (i.e., "tumorslike me" ) to eval uate how strongl y Aj affects E/ i n these tumors. The gl obal component uses data from tumors with Aj = 0 to calculate as where ^is an SGA

that best explains the & at the population (global) level. Thus, the TDI method provides a relative strength assessment for how strongly Aj regulates £/ in tumors with Aj = 1 (tumors I ike the one under consideration), given the causes of E\ in rest of tumor are expl ai ned by a gl obal cause *

V The posteri or probabi I i ty for candi date causal edge → i s dependent on how wel I other candidate causal edges explain £/. As such, the posterior probability of a given edge → i s sped f i c to a gi ven tumor, and the same candi date edge → may have different posterior probabilities in different tumors dependi ng on the composition of the SGAs in the tumor.

As discussed in Example 1 , the causal relationships inferred by the TDI method are statistically robust, validated by experimental evidence, and more sensitive and specific than prior methods of identifying driver mutations in a given tumor genome. Taken together, the novel features and advantages of the TD I method and rel ated systems al I ow for surpri si ngl y ef f ecti ve elucidation of driver SGAs i n a tumor.

II. Abbreviations

CBN Bipartite Causal Bayesi an Network

DEG Differentially Expressed Gene

SCNA Somatic Copy Number Alteration

SGA Somatic Genome Alteration

SGA-FI Somatic Genome Alteration with Functional I mpact

SM Somatic Mutation

TCGA The Cancer Genome Atlas

TDI Tumor-specific Driver Identification

///. Terms and Discussion

For purposes of thi s descri pti on, certai n aspects, advantages, and novel features of the embodiments of this disclosure are descri bed herein. The disclosed methods and systems should not be construed as limiting in any way. Instead, the present disclosure is directed toward all novel and nonobvious features and aspects of the various disclosed embodiments, alone and in various combinations and sub-combinations with one another. The methods, apparatuses, and systems are not limited to any specific aspect or feature or combination thereof, nor do the disclosed embodiments require that any one or more specific advantages be present or problems be solved.

Although the operations of some of the disclosed methods are descri bed in a particular, sequential order for convenient presentation, it should be understood that this manner of description encompasses rearrangement, unless a particular ordering is required by specific language. For example, operations descri bed sequentially may in some cases be rearranged or performed concurrently. Moreover, for the sake of si mpl icity, the specification and attached figures may not show all the various ways in which the disclosed methods can be used in conjunction with other methods. The singular terms "a", "an", and "the" include plural referents unless context clearly indicates otherwise. The term "comprises" means "includes without limitation." The term "and/or" meansany oneor moreof the elements listed. Thus, the term "A and/or B" means "A", "B" or "A and B."

U nl ess otherwi se noted, techni cal terms are used accordi ng to conventi onal usage. To f aci I i tate revi ew of the vari ous embodi ments, the f ol I owi ng expl anati ons of terms are provi ded:

Cancer : A bi ol ogi cal condi ti on i n whi ch a mal i gnant tumor or other neopl asm has undergone characteristic anaplasi a with loss of differentiation, increased rate of growth, invasion of surrounding tissue, and which iscapableof metastasis. A neoplasm isa new and abnormal growth, parti cul arl y a new growth of ti ssue or eel I s i n whi ch the growth i s uncontrol I ed and progressi ve. A tumor is an example of a neoplasm. Non-limiting examples of types of cancer include lung cancer, stomach cancer, colon cancer, breast cancer, uteri ne cancer, bladder cancer, head and neck cancer, kidney cancer, liver cancer, ovarian cancer, pancreas cancer, prostate cancer, and rectal cancer.

Control: A sample or standard used for comparison with an experimental sample. In some embodiments, the control is asample obtained from a healthy individual (such as an individual without cancer) or a non-tumor tissue sample obtai ned from a patient diagnosed with cancer. I n some embodi ments, the control is a historical control or standard reference val ue or range of val ues (such as a previously tested control sample, such as a group of cancer patients with poor prognosis, or group of samples that represent baseline or normal values, such as the level of gene expression in a tumor sample).

Differentially expressed gene (DEG): A gene with an alteration (such an i ncrease or decrease) in expression of the corresponding gene product (for example, miRNA, non-coding RNA, RNA encoding a protein, or the corresponding protein) that is detectable in a biological sample relative to a control. An "alteration" in expression includes an increase in expression (up- regul ati on) or a decrease i n expressi on (down-regul ati on) rel ati ve to a control .

M decular phenotypesof a tumor: A set of molecular measurements that reflect the state of eel I ul ar system of a tumor. M ol ecul ar phenotypi c vari ati ons i ncl ude but not I i mi ted to: D EGs, proteomic changes (quantity and chemical modification of proteins), and metaboli mi cs changes (quantity and modification of metabolites).

E pi genetic status of a gene: Any structural feature at the molecular level of a nucleic acid

(e.g., DNA or RNA) other than the primary nucleotide sequence. For instance, theepigenetic state of a genomic DNA may include its secondary or tertiary structure determined or influenced by, for example, its methyl ati on pattern or its association with cellular proteins, such as hi stones, and the modifications of such proteins, such as acetyl ati on, deacetylation, and methylation. Gene: A segment of chromosomal DNA that contai ns the codi ng sequence for a protei n, wherein the segment may include promoters, exons, introns, and other untranslated regions that control expression.

Somatic Copy Number Alteration (SCNA) of a gene: An alteration in the number of copi esof a DNA in a genome i ncl udi ng al I or a porti on of one or more genes. Thi s i ncl udes homozygous deletion, singlecopy deletion, diploid normal copy, low copy number amplification, and high copy number amplification. In some embodiments, only genes with homozygous deletion or high copy number amplification are included in the analytical methods provided herein. SCNA can be detected with various types of tests, including, but not limited to, full genome sequencing.

Somatic mutation of a gene: Any change of a chromosomal nucleic acid sequence encodi ng a gene or i ncl udi ng regul atory el ements of the gene. For exampl e, mutati ons can occur within an exon of a gene, or changes in or near regulatory regions of genes. Types of mutations include, but are not limited to, base substitution point mutations (which are either transitions or transversions), deletions, and insertions. Missense mutations are those that introduce a different ami no aci d i nto the sequence of the encoded protei n; nonsense mutati ons are those that i ntroduce a new stop codon; and si I ent mutati ons are those that i ntroduce the same ami no aci d often with a base change in the third position of the codon. In thecaseof insertions or deletions, mutations can be in- frame (not changi ng the frame of the overall sequence) or frame shift mutations, which may result in the misreading of a large number of codons (and often leads to abnormal termination of the encoded product due to the presence of a stop codon in the alternative frame). Synonymous somati c mutati ons are those that do not change ami no aci ds i n the sequence of the encoded protei n and are generally biologically silent. Non-synonymous somatic mutations are those that change at I east one ami no aci d i n the sequence of the encoded protei n, and therefore may al ter the structure and/or function of the encoded protein. Gene mutations can be detected with various types of assays, i ncl udi ng, but not I i mited to, f ul I genome sequenci ng.

Somatic genome alteration (SGA): An alteration to chromosomal DNA acquired by a cell that can be passed to the progeny of the cell in the course of cell division. Non-limiting examples of SGAs include somatic mutation (SM) of a gene in chromosomal DNA, somatic copy number alteration (SCNA) of agene in chromosomal DNA, somatic single nucleotide variations in chromosomal D N A that al ter regul ati on of protei n-codi ng genes, and al terati ons i n the epi geneti c status of a gene in chromosomal DNA.

Somatic genome alteration with functional impact (SGA-FI ): A SGA that leads to or causes a change i n a molecular phenotype of a cell , such as differential gene expression. Driver SGA: An SGA-FI that leads to or causes malignancy of acell. In some

embodiments, a SGA-FI that leads to or causes differential expression of at least five different genes is identified as a driver SGA. More generally, a driver SGA is a genomic event that occurs in cancer cells and bears biological impact on molecular phenotypesthat contribute to cancer development in a subject.

Subject: A living multi -cellular vertebrate organism, a category that includes human and non-human mammals.

Tumor: An abnormal growth of cells, which can be benign or malignant. Cancer is a malignant tumor, which is characterized by abnormal or uncontrolled cell growth.

Features often associ ated wi th mal i gnancy i ncl ude metastasi s, i nterf erence wi th the normal functioning of neighboring cells, release of cytokines or other secretory products at abnormal levels and suppression or aggravation of inflammatory or immunological response, invasion of surrounding or distant tissues or organs, such as lymph nodes, etc.

The amount of a tumor in an individual is the "tumor burden" which can be measured as the number, vol ume, or wei ght of the tumor. A tumor that does not metastasi ze i s referred to as "benign." A tumor that invades the surrounding tissue and/or can metastasize is referred to as "malignant."

Examples of hematological tumors include leukemias, including acute leukemias (such as 11q23-positive acute leukemia, acute lymphocytic leukemia, acute myelocytic leukemia, acute myel ogenous I eukemi a and myel obi asti c, promyel ocyti c, myel omonocyti c, monocyti c and erythroleukemia), chronic leukemias (such as chronic myelocytic (granulocytic) leukemia, chronic myelogenous leukemia, and chronic lymphocytic leukemia), polycythemia vera, lymphoma, Hodgkin's disease, non-Hodgkin's lymphoma (indolent and high gradeforms), multiple myeloma, Waldenstrom's macroglobulinemi a, heavy chain disease, myel odyspl asti c syndrome, hairy cell I eukemi a and myel odyspl asi a.

Exam pi es of sol i d tumors, such as sarcomas and card nomas, i ncl ude f i brosarcoma, myxosarcoma, liposarcoma, chondrosarcoma, osteogenic sarcoma, and other sarcomas, synovioma, mesothelioma, Ewing's tumor, leiomyosarcoma, rhabdomyosarcoma, colon carcinoma, lymphoid malignancy, pancreatic cancer, breast cancer (including basal breast carcinoma, ductal carcinoma and lobular breast card noma), I ung cancers ovarian cancer, prostate cancer, hepatocel I ul ar carcinoma, squamous eel I carcinoma, basal cell carcinoma, adenocarcinoma, sweat gland carcinoma, medullary thyroid carcinoma, papillary thyroid carcinoma, pheochromocytomas sebaceous gland carcinoma, papillary carcinoma, papillary adenocarcinomas, medullary carcinoma, bronchogenic card noma, renal cell carcinoma, hepatoma, bile duct card noma, choriocardnoma, Wilms" tumor, cervical cancer, testicular tumor, seminoma, bladder carcinoma, and CNS tumors (such as a glioma, astrocytoma, medulloblastoma, craniopharyrgioma, ependymoma, pinealoma, hemangioblastoma, acoustic neuroma, oligodendroglioma, meningioma, melanoma, neuroblastoma and retinoblastoma). In several examples, a tumor is melanoma, lung cancer, lymphoma, breast cancer or colon cancer.

The disclosed methods can be used to identify SGA-FIs in tumors, such as any of the types of tumors I isted above.

Tumor genome: The full complement of chromosomal DNA found within cells from a tumor.

/ V. Methods of Identifying a SGA-FI

FIG. 11 depicts an exemplary method 100 for identifying a SGA-FI in a tumor, f. The disclosed method infers the functional -impact relationships between SGAsand DEGsfor agiven tumor using a CBN, i n whi ch causal edges are only all owed to point from SGAs to DEGs. Using this method, it is possible for the first time to interrogate a tumor sample and elucidatethe SGAs in the tumor genome that cause the phenotype of the tumor, such as mal ignancy. The tumor genome can be, for exampl e, the genomi c sequence determi ned from a tumor sampl e from a subj ect.

As shown in FIG. 11 at process block 110, the method begins by receiving a set of SGAs in the genome of the tumor, t. The set of SGAs i ncl udes genes i n the genome of the tumor t that contai n SM s or that are affected by SCN A wi thi n the tumor genome, or both.

At process bl ock 120, the method compri ses recei vi ng a set of D EGs i n the genome of the tumor, t. The set of set of DEGs i ncl udes genes i n the genome of tumor t that have i ncreased or decreased expression relative to a control as discussed in more detail below.

The expression level of the genes in the tumor can be determined, for example, by obtaining microarray data showing the levelsof RNA in the tumor. In some embodiments, if the expression of the genes f ol I ows Gaussi an di stri buti on i n normal eel I s, the p val ues of correspondi ng genes i n the tumor can be cal cul ated to eval uate how si gni f i cant the gene expressi on i n eel I s of the tumor i s different from that of correspondi ng non-tumor eel I s. In some embodi ments, a gene i s i ncl uded i n the set of DEGs if thep value isequal or smaller than 0.005 to either side of the Gaussian distribution. Additionally, for genes whose expressi on do not follow Gaussian distribution in normal samples (for example, due to low variance (i.e., <0.1)), the expressi on fold change can be cal cul ated (for exampl e, the gene expressi on i n the tumor eel I over the average gene expressi on i n normal cells). Genes with a 3 fold or greater change in expression in the tumor cell compared to the control can be i ncl uded i n the set of DEGs. In some embodiments, the method includes obtaining a tumor sample from a subject, identifyi ng DEGs i n the genome of the tumor sample (for example, usi ng microarray analysis), and i denti f yi ng SGAs i n the genome of the tumor sampl e (for exampl e, by sequenci ng the chromosomal DNA in the tumor sample using whole genome sequencing analysis) to provide the set of SGAs and the set of DEGs.

The tumor sampl e can be any sampl e that i ncl udes genomi c D N A from a tumor eel I . Such sampl es i ncl ude, but are not I i mi ted to, ti ssue from bi opsi es (i ncl udi ng formal i n-f i xed paraf f i n- embedded tissue), autopsies, and pathology specimens; sections of tissues (such as frozen sections or paraffin-embedded sections taken for histological purposes); body fluids, such as blood, or fracti ons thereof such as pi asma or serum, and so forth. I n several embodi ments, the tumor sampl e is from an individual suspected of havi ng a cancer.

At process block 130, the method comprises generating a CBN with maximal posterior probability from thesetsof SGAs and DEGs. The CBN includes causal edges pointing from SGAs in the set of SGAs to DEGs in the set of DEGs. The result of step 130 isgeneration of aCBN model M that best explains (that is, has maximal posterior probability) the SGA and DEG datafrom the tumor f . The CB N model assi gns a si ngl e SGA i n the set of SGAs as the most probabl e cause for each DEG i n the set of DEGs. I n model M, a gi en SGA can have zero, one, or more causal edges emanating from it to the variables in the set of DEGs.

Prior methodsfor learning Bayesian networks from observations, and computational algorithms to do so, are well understood and have been used successfully in many applications (see, e.g., Glymour C, Cooper G. Computation, Causation, and Discovery. Cambridge, MA: M IT Press; 1999; Friedman, Nir. "Inferring cellular networks using probabilistic graphical models." Sfc/^'emce 303, no. 5659 (2004): 799-805.; and Pe'er, Dana. "Bayesian network analysisof signaling networks: a primer." Saence SΓKE28^ (2005): I4). In some embodi ments, such prior methods can be modif i ed as descri bed herei n to generate the CBN i ncl udi ng causal edges poi nti ng from SGAs i n the set of SGAs to DEGs i n the set of DEGs.

In some embodiments, generating the CBN with maximal posterior probability includes generating a plurality of test CBNs using thesetsof SGAsand DEGsfrom the tumor t. As used herein, a"test CBN" isaCBN that provides a model of the causal relationships between SGAsand DEGs in the current tumor being analyzed. The plurality of test CBNs forms a "search space" from which theCBN with maximal posterior probability is identified. The posterior probability for each test CBN in the plurality isdetermined, and the test CBN in the plurality with the maximal posterior probability is identified as the CBN with maximal posterior probability. Determi ni ng the posteri or probabi I i ty for the test CB Ns i n the pi ural i ty can be done, for example, by calculating (D Μ) ^χ ( ) for each test CBN, wherein (D M) is the marginal likelihood of each test CBN in the plurality (or search space), and ( ) isthe prior probability of each test CBN in the plurality (or search space).

I n some embodi ments, ( ) i s proporti onal to the f requenci es at whi ch the SGAs of a test

CBN are present i n a reference set of tumor genomes, D. The reference set of genomes can be, for example, tumor genomes availablefrom The Cancer Genome Atlas (TCGA). TCGA provides genomi c and transcri ptomi c data from multi pi e cancer types, provi di ng a systematic

characterization of somatic mutations, somatic copy number alterations, oncogenic processes, and for many di f f erent types of tumors.

I n addi ti onal embodi ments, ( ) compri ses a product of pri or probabi I i ti es of causal edges of a test CB . For exampl e, i n such embodi ments, a pri or probabi I i ty of a causal edge from a somatic alteration of gene A) to a DEG B, can be stated as F Ah→ Ei) (abbreviated as ) and determined according to:

wherei n i s a pri or probabi I ity that the cause of DEG £ i s not a SGA , h ' i ndexes over the number m of genes i n tumor r that have SGAs and isa Dirichlet parameter set forth as

wherei n Uh denotes tumor genomes i n a reference set of tumor genomes, D, that (si mi I ar to tumor t) have a somatic alteration in geneAi, and mp denotes the number of genes with SGAsin the genome of tumor t '. The reference set of tumor genomes, D, i s obtai ned pri or to the anal ysi s of the current tumor, and providesatraining set of prior tumors. Similarly, %js a Dirichlet parameter set forth as

In additional embodiments, known values for the number of unique synonymous mutations in a particular gene in a reference set of tumor genomes D and the number of somatic copy number al terati ons i n the parti cul ar gene i n subj ects wi thout cancer can be appl i ed to account for mutati on and copy number alterations that are due to differences in gene lengths and chromosome locations. Thus, i n one embodi ment, the number of uni que synonymous mutations i n gene Ah i n a reference set of tumor genomes, D, has a known val ue (a1 ); the number of somatic copy number alterations i n gene Ah i n subj ects without cancer has a known val ue (a2); and i s:

wherei n Uh denotes tumor genomes from a reference set of tumor genomes, D, that have a somatic alteration in gene AJ, and Wht' denotes a weight proportional to the probability that the somatic alteration in gene Ah isthe SGA with functional impact in the genome of tumor V.

is calculated as: wherei n

I n the above equati ons, 2 ,, denotes whether gene Ah has a non-synonymous somati c mutation or not in tumor t' (e.g., 2³⁴ can be 1 or 0, respectively); * ³⁴ denotes the number of unique synonymous somatic mutation events i n gene Ah i n the reference set of tumor genomes, D; 2³.?⁶⁷ denotes whether gene A, has a somatic copy number alteration or not in tumor t ' (e.g., 2³?⁶⁷ is 1 or 0, respectively); * ³⁵⁶⁷ denotes the expected number of timesgeneA? hasa somatic copy number alteration among the tumors in D, and yet is only a passenger alteration (that is, has no functional impact), based on the number of times gene Ah has a somatic copy number alteration in a reference set of genomes from subjects without known cancer (e.g., subjects in the normal human population).

I n some embodi ments, ( D M) compri ses a product of margi nal I i kel i hoods of causal edges of the test CBN for tumor f. Calculati ng (D M) can comprise, for example, calculating a margi nal I i kel i hood of each causal edge Ah→ Ej of the test CB N based on tumor genomes i n a reference set of tumor genomes, D, that have the somatic alteration of gene , and tumor genomes i n D that do not have the somatic alteration of gene Ah. The reference set of genomes can be, for example, tumor genomes available from TCGA.

In some embodiments, (D M) compri ses a product of posterior probabilities of causal edges of the test CBN for tumor f. Calculati ng (D M) can comprise, for example, calculating a posterior probability of each causal edge Ah→ of the CBN as:

wherein:

In the formulas listed above, >^? denotes tumor genomes having the somatic alteration of gene Ah that are present i n a reference set of tumor genomes, D; > denotes tumor genomes that do not have the somatic alteration of gene Ah that are present in D; and G(/^') denotes a gene that hasa maximal posterior probability for causing £; in > (or a "global driver"). G(i) can be calculated by finding the gene g that maximizes the foil owing function and assigning G(i) to beg:

wherei n Nyk i s the number of tumor genomes i n D, wherei n node £/ has val ue k and i ts cause A_g has the val ue denoted by j, and _g denotes the pri or probabi I i ty that A_g i s the cause for £^■ for the tumor genomes D; j \ ndexes over the states of Ag, whi ch i s bei ng consi dered as the cause of

i s an i ndi cator vari abl e whi ch i ndexes over the states of

i s a parameter in a Dirichlet distribution that is proportional to the prior probability of

is the gamma function; is the number of tumor genomes in >^? for which £)

has val ue k and Ah has the val ue i ndexed by j ; G _H i s the number of tumor genomes i n > for which Ei has val ue k and AG^ has the val ue i ndexed by j; and wherei n

Wherei n i s a pri or probabi I i ty that somethi ng other than a SGA i s the cause of D EG Er, h ' indexes over the number m of genes in tumor t that have SGAs; and « isa Dirichlet parameter set forth as

wherein U_g denotes a reference set of tumor genomes that have a somatic alteration in gene>Ag; and mt' denotes the number of genes with somatic alterations in the genome of tumor V.

In additional embodiments, known values for the number of unique synonymous mutations i n a particular gene i n a reference set of tumor genomes D and the number of somatic copy number al terati ons i n the parti cul ar gene i n subj ects wi thout cancer can be appl i ed to account for mutati on and copy number alterations that are due to differences in gene lengths and chromosome locations. Thus, in one embodiment, the number of unique synonymous mutations in gene/Agin a reference set of tumor genomes, D, has a known val ue (a1 ); the number of somatic copy number alterations i n gene A_g i n subj ects without cancer has a known val ue (a2); and _K i s:

Wgf' is calculated as: wherein

wherei n U_g denotes tumor genomes from a reference set of tumor genomes, D, that have a somatic alteration in gene/4_g; mt' denotes the number of genes with somatic alterations in the genome of tumor t', and wg · denotes a wei ght proporti onal to the probabi I i ty that the somati c al terati ons i n gene A_g \s the SGA with functional impact in the genome of tumor V. Wgt' is calculated as: wherei n

I n the above equati ons,

denotes whether gene A_g has a non-synonymous somati c mutati on or not in tumor t' (e.g., is 1 or 0, respectively); denotes the number of unique synonymous

mutati ons in gene A_g in the reference set of tumor genomes, denotes whether gene A_g has

a somatic copy number alteration or not in tumor t ' (e.g., is 1 or 0, respectively);

denotes the expected number of ti mes A_g has a somati c copy number alterati on among the tumors i n D, and yet is only a passenger alteration (that is, has no functional impact), based on the number of ti mes gene Ah has a somati c copy number al terati on i n a reference set of genomes from subj ects without known cancer (e.g., subjects in the normal human population).

Causal edgesfrom different SGAs havedifferent distributionsof posterior probabilities. To standardize how to interpret the significance of an edge probability P<_¾ a random dataset was created by permuting DEG data in reference genomes across tumors such that the statistical relationships between SGAsand DEGswere random in thisdataset. TheTDI method was applied to thi s dataset to cal cul ate posteri or probabi I ity of edges usi ng the permuted data, and the val ues were treated as sampl es from a di stri buti on of random posteri or probabi I i ti es of edges poi nt from a given SGA to DEGs. Given a posterior probabi lity for a causal edge from real data (P_e), the probabi I ity that an edge from an SGA can be assigned with a given P_e or higher i n data from permutation experiments (i.e., the p value to the edge with a given P_e) isdetermined based on the random samples. As shown at process block 140 in Fl G. 11 , an SGA of the CBN with maximal posteri or probabi I ity is i dentif ied as a SGA-FI in the tumor if five or more causal edges poi nt from theSGA to DEGs in theCBN. In several embodiments, theSGA is identified asan SGA-FI if it has 5 or more causal edges to DEGs that are each assigned a p-val ue < 0.001. I n several embodiments, the identified SGA-FI isadriver SGA. Asdiscussed in the examples and shown in FIG. 9, the overall false discovery rate of the joint causal relationships between an SGA to 5 or more target DEGs is less than or equal to 10^~15, effectively controlling the false discovery when identifying SGA-Fls in atumor.

Once the SGA-FIs impact are identified, the information can be used in downstream applications, for example, to select a treatment for tumor t in a subject, or to provide a diagnosis or prognosis concerning the subject with tumor t. Providing a diagnosis or prognosis concerning the subj ect with tumor t can i ncl ude, for exampl e, predi cti ng the outcome (for exampl e, I i kel i hood of aggressive disease, recurrence, metastasis, or chance of survival) and sensitivity to molecularly targeted therapy of the subject.

Computer Implementation

The analytic methods described herein can be implemented by use of computer systems. For exampl e, any of the steps descri bed above for i dentif yi ng a SGA-FI may be performed by means of software components loaded into a computer or other information appliance or digital device. When so enabled, the computer, appl iance or device may then perform all or some of the above-descri bed steps to assi st the anal ysi s of val ues associ ated wi th the

For example, any of the steps described above for identifying a SGA-FI may be performed by means of software components I oaded i nto a computer or other i nf ormati on appl i ance or di gi tal device. The above features embodied in one or more computer programs may be performed by one or more computers running such programs.

Aspects of the disclosed methodsfor identifying a SGA-FI can be implemented using computer-based calculations and tools. For example, a CBN model can be generated by a computer based on underl yi ng sets of SGAs and D EGs from a tumor. The tool s are advantageousl y provi ded in the form of computer programs that are executable by a general purpose computer system (for example, as descri bed in the foil owing section) of conventional design.

Computer code for implementing aspects of the present invention may be written in a variety of languages, including FERL, C, C++, Java, JavaScript, VBScript, AWK, or any other scripting or programming language that can be executed on the host computer or that can be compi I ed to execute on the host computer. Code may al so be wri tten or di stri buted i n I ow I evel I anguages such as assembl er I anguages or machi ne I anguages. The host computer system advantageous! y provi des an i nterf ace vi a whi ch the user control s operati on of the tool s.

Any of the methods described herein can be implemented by computer-executable instructions in (e.g., encoded on) one or more computer-readable media (e.g., computer-readable storage media or other tangible media). Such instructions can cause a computer to perform the method. Any of the methods descri bed herei n can be i implemented by computer-executabl e instructions stored in one or more computer-readable storage devices (e.g., memory, magnetic storage, optical storage, or the like). Such instructions can cause a computer to perform the method.

Example Computing System

FIG. 12 illustrates a generalized example of a suitable computing system 200 in which several of the described innovations may be implemented. Thecomputing system 200 is not intended to suggest any limitation as to scope of use or functionality, as the innovations may be i mpl emented i n di verse computi ng systems, i ncl udi ng sped al -purpose computi ng systems. I n practi ce, a computi ng system can compri se mul ti pi e networked i nstances of the i 11 ustrated computi ng system.

With reference to FIG. 12, thecomputing system 200 includes one or more processi ng units

210, 215 and memory 220, 225. In FIG. 12, thisbasic configuration 230 isincluded within a dashed I i ne. The processi ng uni ts 210, 215 execute computer-executabl e i nstructi ons. A processing unit can be acentral processing unit (CFU), processor in an application-specific integrated circuit (ASIC), or any other type of processor. In a mul ti -processing system, multiple processi ng uni ts execute computer-executabl e i nstructi ons to i ncrease processi ng power. For example, FIG. 12 shows a central processing unit 210 as well as a graphics processing unit or coprocessing unit 215. The tangible memory 220, 225 may be volatile memory (e.g., registers, cache, RAM ), non-volatile memory (e.g., ROM , EEFROM , flash memory, etc.), or some combination of the two, accessi bl e by the processi ng unit(s). The memory 220, 225 stores software 280 implementing oneor more innovations descri bed herein, in theform of computer-executable instructions suitable for execution by the processing unit(s).

A computing system may have additional features. For example, thecomputing system 200 i ncl udes storage 240, one or more i nput devi ces 250, one or more output devi ces 260, and one or more communication connections 270. An interconnection mechanism (not shown) such as a bus, control I er, or network i nterconnects the components of the computi ng system 200. Typicall y, operating system software (not shown) provides an operating environment for other software executing in the computing system 200, and coordinates activities of the components of the computing system 200.

The tangi bl e storage 240 may be removabl e or non-removabl e, and i ncl udes magneti c di sks, magnetic tapes or cassettes, CD-ROMs, DVDs, or any other medium which can be used to store information in a non-transitory way and which can be accessed within the computing system 200. The storage 240 stores i nstructi ons for the software 280 i mpl ementi ng one or more i nnovati ons described herein.

The input devices) 250 may beatouch input device such asa keyboard, mouse, pen, or trackball, a voice input device, a scanning device, or another device that provides input to the computi ng system 200. For vi deo encodi ng, the i nput devi ce(s) 250 may be a camera, vi deo card, TV tuner card, or similar device that accepts video input in analog or digital form, or a CD-ROM or CD-RW that reads video samples i nto the computing system 200. The output device(s) 260 may be a display, printer, speaker, CD-writer, or another device that provides output from the computing system 200.

The communication connection(s) 270 enable communication over a communication medi urn to another computi ng enti ty . The communi cati on medi urn conveys i nf ormati on such as computer-executable instructions, audio or video input or output, or other data in a modulated data signal. A modulated data signal isasignal that hasoneor more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media can use an electrical, optical, RF, or other carrier.

The i nnovati ons can be descri bed i n the general context of com puter-executabl e

instructions, such as those included in program modules, being executed in a computing system on a target real or vi rtual processor. General I y , program modul es i ncl ude routi nes, programs, I i brari es, objects, classes, components, data structures, etc. that perform particular tasks or implement particular abstract datatypes. The functionality of the program modules may be combined or split between program modul es as desi red i n vari ous embodi ments. Computer-executabl e i nstructi ons for program modules may be executed within a local or distributed computing system.

For the sake of presentati on, the detai I ed descri pti on uses terms I i ke " determi ne" and " use" to describe computer operations in a computing system. These terms are high-level abstractions for operations performed by a computer, and should not be confused with acts performed by a human bei ng. The actual computer operations correspondi ng to these terms vary dependi ng on

implementation. Computer-Readable Media

Any of the methods described herein can be implemented by computer-executable instructions in (e.g., stored on, encoded on, or the like) one or more computer-readable media (e.g., computer-readabl e storage medi a or other tangi bl e medi a) or one or more computer-readabl e storage devices (e.g., memory, magnetic storage, optical storage, or the like). Such instructions can cause a computi ng devi ce to perform the method. The technol ogi es descri bed herei n can be implemented in avariety of programming languages.

Any of the computer-readable media herein can be non-transitory (e.g., volatile memory such as D RA M or SRA M , nonvol ati I e memory such as magneti c storage, opti cal storage, or the I i ke) and/or tangi bl e. Any of the stori ng acti ons descri bed herei n can be i mpl emented by stori ng i n one or more computer-readable media (e.g., computer-readable storage media or other tangi ble media). Any of thethings (e.g., data created and used during implementation) described as stored can be stored in one or more computer-readable media (e.g., computer-readable storage media or other tangible media). Computer-readable mediacan be limited to implementations not consisting of a signal.

EXAM PLES

The f ol I owi ng exampl es are provi ded to i 11 ustrate parti cul ar features of certai n

embodiments, but the scope of the claims should not be limited to those features exemplified.

Example 1

Identifying cancer drivers by inferring tumor -specific causal relationships between somatic genome alterations and molecular phenotypes

Thi s exampl e i 11 ustrates a method of del i neati ng the causal rel ati onshi ps between SGAs and observed mol ecul ar phenotypes as a means to reveal SGAs that likely contri bute to the

development of a tumor. For example, a tumor t hosts a set of SGAs, denoted as SGA SETt, and a set of DEGs, designated as DEG_SETt. TheTDI method represents the causal relationships between SGAs and D EGs as a CB N (see Fl G . 1 ) and searches for the model w i th the maxi mal posteri or probabi I i ty ( |> ) that best expl ai ns the data (SGAs and D EGs) of tumor r.

TheTDI method combines the strengths of both frequency- and functional -impact oriented approachesfor identifying driver SGAs. TheTDI method uses the frequency with which agenegr is perturbed by SGA events in a reference set of tumor genomes, D, to estimate the prior probabil ity F g) that g is a cancer driver in any given tumor. To further incorporate information from a function-impact perspective, TDI employs the likelihood function P[data \ g), which represents the probability of the molecular phenotype data that is observed {e.g., DEGs) given that g is assumed to be adriver. From such priorsand likelihoods, TDI derives the posterior probabilities of driversfor a given tumor. By inferring causal relationships between SGA events and molecular phenotypes, TDI can identify cancer drivers and provide insight into the disease mechanism of individual tumors.

RESULTS

The TDI method was used to analyze data from 4,468 tumors across 16 cancer types in TCGA to derive4,468 tumor-specific models. An SGA wasdesignated asan SGA with functional impact (SGA-FI) in a specific tumor if it is predicted by TDI to causally regulate 5 or more DEGs in the tumor with an expected false discovery rate <10^~15. Since it is statistically difficult to evaluate whether the driver identified within an individual tumor is valid, the results from multiple tumors were pooled to evaluate whether the causal relationship between a pair of SGA and DEG are conserved across tumors and further performed bi ol ogi cal experi ments to val i date predi cti ons by TDI.

The landscape of candidate cancer drivers identified using the TDI method.

Fi rst the I andscape of SGA-FI s was i nvesti gated . SGAs were organi zed accordi ng to the genes perturbed by them and represented each SGA by the corresponding gene names. The distri bution of the number of SGAs and SGA-FI s per tumor was compared across cancer types (FI Ga 2A - 2B). The average number of SGAs per tumor across cancer types was approxi mately 500, whereas the average number of SGA-FIs identified by TDI was approximately 50 per tumor. TDI identified 490 SGAs that had a significant functional impact in more than 1 % of the tumors; these SGAs include the majority (86.1%) of the known drivers published by the TCGA network (see Kandoth et al , Nature, 502, 333-339, 2013; and Lawrence et al , Nature, 505, 495-501 , 2014) as wel I as novel candi date dri vera SGA-FI s i denti f i ed by TD I i ncl ude protei n encodi ng genes as wel I as microRNAsand intergenic non-protein-coding RNAs(e.g., MIR548K, MIR1207, and PVT1). Being afunctional impact-oriented model, TDI can identify SGA-FIs independent of their alteration frequency or gene length (FIG. 2C). In addition, many SGAs with similar alteration frequency or gene length as those designated as SGA-FIs by TDI arenor called as SGA-FIs by

TDI, further demonstrating that TDI predictions are not simply a function of alteration frequency or gene length.

The landscape of common SGAsand SGA-FIs is illustrated (FIG. 2D) using a Circos plot (ci rcos.ca) that i ncl udes 62 SGA-FI s that occurred i n more than 300 tumors (> 6% of the tumors) and that al so have a cal I rate (the f racti on of the SGA i nstances af f ecti ng a gene bei ng cal I ed an SGA-FI ) greater than or equal to 0.8. TDI also designated many low-frequency SGAs (less than 300 tumors or less than 7%) asSGA-Flswith high call rates ( 0.8) (FIG. 2E). TDI designated as SGA-FI s both well-known drivers (eg., TP53, PIK3CA, PTEN, KRA$ and CDKN2A) and more controversial (e.g., TNN, CSMD3, MUC16, LRP1B, and ZFHX4) drivers.

Causal relationships inferred by TDI are statistically robust.

Whether the causal relationships between SGAsand DEGs inferred by TDI as SGA-Flsare stati stically robust was eval uated. A seri es of random datasets usi ng the TCGA data were generated, in which the D EG status of each gene expression variable was permuted across tumors, whi I e the SGA status i n each tumor remai ned as reported by TCGA . As such, the stati sti cal rd ati onshi ps between SGA s and D EGs woul d be expected to be random . TDI was appl i ed to these random datasets to cal cul ate the posteri or probabi I i ti es f or the candi date causal edges i n these permuted tumor data, and compared the distributions of the posterior probabilities assigned by TDI to al I candi date edges i n the real and the random data.

TDI was able to differentiate stati sti cal relationships between SGAs and DEGsfrom random ones (FIG. 3A) in that it generally assigned much higher posterior probabilities to candidate edges obtai ned from real data than those obtai ned from random data. A large number of edges from well-known transcriptomic regulators (e.g., PIK3CA3) were assigned high posterior probabilities. Interestingly, the results also show morecausal edges to DEGsfrom 777V, CSMD3, MUC16, RYR2, LRP1B, and ZFHX4, with higher posterior probabilities than would be expected by random chance, indicating that perturbing any of these genes has transcriptomic impact (FIG. 3A). Overal I , the resul ts from these randomi zati on experi ments i ndi cate that the causal rel ati onshi ps inferred by TDI are statistically sound.

An SCNA event (amplification or deletion of a chromosome fragment) in atumor often encloses a large number of genes, making it a challenge to detect the distinct signals perturbed by individual genes within a SCNA fragment. TDI addresses this problem by integrating both SM and SCNA data, whi ch can create a " I i nkage di sequi I i bri um" among genes wi thi n a SCNA fragment (Fl G. 3B). For example, the chromosome cytoband 3q26 enclosi ng PIK3CA is commonly amplified, and it would be difficult to distinguish the signal of amplification of PIK3CA from those of the other genes within thisfragment based on SCNA data alone. However, when combined with SM data, PIK3CA cl earl y has a hi gher combi ned al terati on rate than i ts nei ghbor genes, maki ng i t uncorrelated (as illustrated by darker blue) with its neighbor genes (FIG. 3B). During the process of calculating whether PIK3CA amplification is causally responsible for a DEG observed in a specific tumor, TDI uses the statistics collected from all tumors with PIK3CA alterations

(combining SM and SCNA) to calculate the likelihood that such a causal relationship is preserved in the tumor, enabling it to different! ate the signals perturbed by PIK3CA from those of the other genes i n the same SCNA fragment. It can be noted that alteration of CSMD3 and ZFHX4 exhi bit similar patterns, enabling TDI to detect their distinct signals(FIG. 3B).

Causal relationships inferred by TDI are biologically sensible.

A driver gene can be perturbed due to different SGA events (e.g., SM s, SCNAs, or both) and cancer types vary i n the number of each type of alteration exhi bited. For example, i n the dataset used here (FIG. 4), PIK3CA iscommonly altered by SMs in breast cancer, whereasit is mainly altered by SCNAs in ovarian cancer and is almost equally altered by SMsand SCNAs in head and neck squamous card noma. As a wel I -known cancer driver i n many cancer types, PIK3CA amplification and mutations should share a common functional impact.

The target DEGs regulated by each SGA event in PIK3CA (SM or SCNA) were identified i n i ndi vi dual tumors, and the I i sts of D EGs that were predi cted to be caused by SM s and SCNAs significantly overlap (p < 10^"100, Fisher's exact test). Thus, TDI correctly detected the shared functional impact of PIK3CA when perturbed by different types of SGAs in different cancer types. Similar results were obtained for 157 SGA-Flsthat were commonly perturbed by both SMsand SCNAs (with each type accounting for > 30% of instances for each SGA-FI), including CSMD3 and ZFHX4 (Fl Gs. 4B and 4C).

Additionally, whether the TDI-inferred causal relationships between SGAsand DEGs agree with existing knowledge was evaluated. For example, because RB1 protein regulates the transcription factor E2F1, E2F1 -regulated genes should be enriched among the RBi-targeted DEGs predicted by TDI . The PASTAA program (trap.molgen.mpg.de/PASTAA.htm) was used to search for motif binding sites in the promoters of the 164 DEGs that TDI predicted to be regulated by RB1 and found E2F1 binding sites in the promoters of 116 of these (p < 10^~100).

In addition, an SGA was represented as a vector reflecting the profile of its target DEGs (a vector spanning the DEG space), in which an element reflect the number of tumors in which the SGA is predicted by TDI to regulatea DEG, and the pair-wise similarity (closeness in DEG space) of SGAs was evaluated. The £APi-A/FE2F2 pair wereamong theclosest SGAswith DEGsthat significantly overlapped (p < 10"⁵²), agreeing with the knowledge that KEAP1 regulates the function of NFE2F2. These results indicate that the functional impacts inferred by TDI are biologically sensible. TDI analyses reveal functional connections among SGA-FIs

Identification of DEGsthat are functionally impacted by SGAs can suggest whether different SGAs perturb a common signal. All pair-wise intersections between target DEG sets of SGA-FIs were evaluated to identify SGA pairs sharing significantly overlapping target DEGs(p < 10^~3, Fisher's exact test). SGA-Flsthat perturb common signals in a network were then organized, in which an edge connecting a pair of SGA-FI nodes indicates that their target DEG sets significantly overlap.

Within the results, SGA-Flswere identified that belong to two well -known cancer pathways, the p53 pathway and the Notchl pathway from the Kyoto Encyclopedia of Genes and Genomes (KEGG), and these were connected i n networks based on thei r shared target DEGs (Fl G. 5A). I ndeed, the members of these two pathways are densely connected and share common functional impacts. Through the created network graph, novel candidate drivers were found that are highly connected to the well -known members of these pathways, which provides the means to evaluate the potential oncogenic properties of candidate SGA-FIs.

The top five (ranked according to p-value significance) SGA-Flswere identified that share significant overlapping target DEGs with an SGA-FI and connected them as a network. For example, the top five SGAs that share DEGs with PIK3CA i nclude PTEN, TP53, and GATA3, which are known cancer drivers, and their connections agree with existing knowledge. The top five SGA-FIs connected with CSMD3ard ZFHX4 (FIG. 5B) also form densely connected networks that i ncl ude wel I -known cancer dri vers, such as TP53, KRA§ and APC, suggesti ng that al terati on of CSMD3 and ZFHX4 perturb some of the same signal i ng pathways. We found si mi lar results for other common SGA-FIs, including TTN, MUC16, and LRP1B (FIG. 8).

Cellular experiments provide support for ZFHX4 and CSMD3 as cancer drivers

The resultsof TDI analyses indicate that commonly observed SGAs, including TTN,

CSMD3, MUC16, and ZFHX4, have a functional impact on many DEGs in tumor, even though they are not commonly considered to be cancer drivers. The predicted causal relationships between CSMD3 and ZFHX4 and target DEGs were evaluated experimentally by testing their impact on cellular behavi or i n cancer eel 1 1 i nes. Both CSMD3 and ZFHX4 are very I ong genes, and they are often perturbed by both SMs and SCNAs, but these SGAs are commonly deemed as passenger events. Whether mani pul ati ng these genes i n eel 1 1 i nes causal I y affects thei r target D EGs, as predicted by TDI , was also examined. The top five DEGs most frequently designated as the targets of CSMD3 (CDKN3, GJB2, MADL2, MMP11, WNT11) and of ZFHX4 (DE$ NPR1, CDCA3, GSTM5, FHL1) were selected. Four of the five DEGs of CSMD3 are over-expressed in tumors (FIG. 6A) relative to normal tissue of the same type, suggesting their overexpression was positively sel ected . The f i ve sel ected D EGs that are targeted by ZFHX4 were suppressed i n tumors compared with normal tissue (FIG. 6B). A gastric cancer cell line (HGC27) was identified with CSMD3 amplification and a prostate cancer cell line (FC3) with ZFHX4 amplification. We performed RNAi-mediated gene si I end ng experi ments usi ng multi pie si RNAs targeti ng different regi ons of CSMD3 and ZFHX4 transcripts. Knocking down CSMD3 resulted in significant repression of four overexpressed target DEGs, with WNT11 being non-significant (FIG. 6C). Knocking down ZFHX4 resulted in significant induction of three target DEGs (DES, GSTM5, and NPR1) (FIG. 6D);

CDCA3 was not expressed i n the PC3 eel 1 1 i ne, and FHL 1 was further repressed. Whi I e preliminary, these results support that the SGA-FIs inferred by TDI have a biological influenceon thei associated DEGs.

We al so exami ned whether al terati on of CSMD3 and ZFHX4 expressi on affected oncogeni c phenotype. The experi mental results i ndicate that knocki ng down CSMD3 and ZFHX4 i n the respective eel I lines significantly attenuated cell proliferation (viability) and migration (FIGs. 6E - 6H). In addition, knockdown of ZFHX4 induced apoptosis (FIG. 6I).

TDI revealscausal relationship at individual tumor level that cannot be detected by global causal analysis.

While it is possible to infer causal relationships between SGAsand DEGs at the cohort level (global) in a similar fashion as the expressi on quantitative trait loci (eQTL) analysis (Matthew V. Rockman & Leonid Kruglyak, "Genetics of global gene expressi on", Nature Reviews

Genetics 7, 862-872; 2006), global analysis usually prefers the SGAs that are commonly perturbed at the whole cohort level as drivers for DEGs but ignores low-frequency SGAs even though the latter is more specific driver of a DEG. For example, PIK3CA is the 2^nd most commonly mutated gene i n al I cancers and a common dri ver of many cancer process. I n gl obal causal anal ysis, it is deemed as the most probabl e dri ver regul ati ng the expressi on of KLK8, a gene bel i eved to be involved in cancer. KLK8 is differentially expressed 1,497 tumors (out of 4,468 tumors from TCGA) and 34% of these instances is predicted to be caused by SGAs in PIK3CA in global analysis. However, in TDI analysis, P/K3CA isonly assigned asdriver for LKS in 21% of the 1 ,497 i n stances, whereas the other 13% i nstances i s assi gned to dri ver SGAs that have I ower preval ences i n the cohort, i ncl udi ng CDKN2A, whi ch i s assi gned to cause 21 % of 1 ,497 i nstances of KLK8 DEG events. Interestingly, when SGAs in PIK3CA and CDKN2A co-occur in 87 tumors with differentially expressed KLK8, CDKN2A, rather than PIK3CA, is assigned as the driver of KLK8, i ndi cati ng that TD I has detected that CDKN2A has a stronger causal rel ati onshi p wi th respect to KLK8. The phenomenon can be explained as follows: There isa causal chain PIK3CA □ CDKN2A a KLK8. Although CDKN2A is a more direct regulator of KLK8, and it isinvolved in regulating KLK8 in tumors with PIK3CA, its impact is not observed in those tumors. Since PIK3CA is altered in more tumors than CDKN2A, PIK3CA explains the overall variance of KLK8 in cohort better than CDKN2A, and therefore it i s desi gnated as the global driver of KLK8. However, when evaluated at the individual tumor level, TDI can detect that the causal relationship between CDKN2A and KLK8 is stronger than that between PIK3CA and KLK8, due to non-deterministic causal relationship between PIK3CA and CDKN2A, and as such TDI correctly assigns CDKN2A as the driver of KLK8 even though both PIK3CA and CDKN2A are present i n a tumor as candidate drivers and that PIK3CA isthe global driver. Such results can be detected only by methods that perform i nstance- based i nference such as TD I but not by methods ai mi ng to i nfer causal relationship at the cohort level. This reflects the key innovation and uniqueness of the TDI algorithm. TDI outperforms conventional feature selection methods in identifying SGAsthat are predictive of DEG in tumors

The disclosed TDI was compared to another alternative approach to identify potential causal relationship between SGAsand DEGs. The alternative approach uses a classification model in combination with a feature-selection method to search for SGAsthat are predictive of expression statusof a DEG. That is, for each DEG, thisapproach searches for a set of SGAsthat can be used input feature to a classification model to predict the status of the DEG.

Over 600 DEGs involved in regulating immune responses were selected as target classes; the elastic net model was used as the classification model (implemented in the GLMNET package in R); and the recursive feature el imination (RFE) algorithm was used as the conventional approach to search for i nf ormati ve features. A set of features for each of 600 D EGs was sel ected and a GLM NET model was trained. The performance of classification model for each DEG was eval uated usi ng the area under the recei ver operati ng curve (A U ROC) as the metri c. As a comparison, a GLM NET model was also trained for each DEG using the SGAs identified by TDI as input features and their performance was eval uated. FIG. 10 shows that model strained with TDI-derived features significantly outperformed those trained with conventional feature selection and classification methods. These results clearly indicatethat the disclosed TDI outperforms conventional methods of studying the relationships between SGAs and DEGa DISCUSSION

TheTDI algorithm integrates mixed data types to infer causal relationships between SGAs and molecular phenotypes in individual tumors. In general terms, TDI provides a principled statistical framework that can combine multiple genomic and epi genetic alterations as candidate causal events (causes) and use transcriptomic data, proteomic data, and metabolomic data as molecular phenoty pes (effects). Thus, TDI combines frequency-oriented information with i nf ormati on about the f uncti onal i impact to i nvesti gate whether the SGAs observed i n a si ngl e tumor or i n a set of tumors contri bute to cancer devel opment.

The tumor-specific inference framework enables TDI to delineate the causal relationship between SGAs and DEGs in specific tumors. Other approaches that infer associations between genomic variants and expression changes, such as expression quantitative loci analysis, are not tumor specific, but rather identify functional impact across a set of tumors. However, a DEG can be causally regulated by multiple SGAs perturbing a common pathway in different tumors. For exampl e, genes regul ated by the PI 3K pathway can be causal I y affected by SGAs i n PIK3CA,

PTEN, PIK3R1, and other member genes of the pathway. I n such a case, a member gene altered at a relatively low frequency (e.g., PIK3R1) cannot explain the overall variance of a given pathway- regulated DEG at the cohort level. However, in an individual tumor, PIK3R1 may be the only altered member of the PI3K pathway and may therefore account for the PI3K-regulated DEGs in the tumor better than other SGAs. A strength of TDI is itsability to designate PI3K the major cause of the DEG with hi gh conf i dence. I ndeed, TD I offers more gl obal I y a general approach to case-specific inference that can build causal models between perturbations and responses in any biological system at the individual case level, be it a person, organ, tissue, or cell. Furthermore, as shown above, TDI outperforms conventional methods of studying the relationships between SGAs and DEGs.

Methods

SGA data collection and preprocessing.

SM datafor 16 cancer types were obtai ned directly from theTCGA data matrix portal (https://tcga-data.nci.nih.gov/tcga/dataAccessMatrix.htm). All the non-synonymous mutation events of al I genes were consi dered at the gene I evel , where a mutated gene isdefi ned as one that contai ns one or more non-synonymous mutations or i ndels.

SCNA data (processed using GISTIC2) were obtained from the Firehose browser of the

Broad Institute (gdac.broadinstitute.org). The TCGA network employed Gl STIC 2.0 to process SCNA data, whi ch discretized the gene SCNA i nto five different levels: homozygous deletion, singlecopy deletion, diploid normal copy, low copy number amplification, and high copy number amplification. Only genes with homozygous deletion or high copy number amplification were included for further analysis. Additionally, genes with inconsistent copy number variations across tumors in a given cancer type were screened out. More specifically, if a gene is perturbed by both copy number am pi if i cation and deletion events in a cancer type, and both types of events have a prevalence > 25% of tumors, the gene is deemed to be affected by insistent SCNA events, and it is excluded from further analysis.

Preprocessed SM data and SCNA data were combi ned as SGA data, such that a gene in a given tumor is designated as altered if it is affected by either an SM or an SCNA event or by both. Si nee certai n recurrent SCN As end ose mul ti pi e genes whose al terati on events are i nseparabl e, neighboring genes were organized into an SGA unit by first identifying a pair of genes that are co- perturbed 90% of the time in all instances, followed by greedily including a neighbor gene if it is also co-perturbed 90% of the ti me with the boundary gene. DEG data collection and preprocessing.

Gene expression data were preprocessed and obtained from the Fi rehose browser of the Broad Institute. We used RNASeqV2 for cancer types with expression measurements i n normal tissues. For cancer types without RNASeqV2 measurements i n normal eel Is (i .a, GBM and OV), we used microarray data to identify DEGs. We determined whether a gene is differentially expressed by compari ng the gene expressi on i n the tumor eel I agai nst that i n the correspondi ng tissue-specific normal cells. For a given cancer type, assuming the expression of each gene follows Gaussi an distri buti on i n normal eel I s, we cal cul ated the p val ues of each gene i n a tumor whi ch esti mated how si gni f i cant the gene expressi on i n tumor i s di f f erent from those of normal eel I s. If the p value was equal or smaller than 0.005 to either side, the gene was considered as differentially expressed i n the correspondi ng tumor. For genes whose expressi on do not f ol I ow Gaussi an distribution in normal samples due to low variance (i.e., <0.1), we calculated the expressi on folder change for those genes (i.e., the gene expression in the tumor cell over the average gene expression i n normal eel I s) and set 3 f ol d changes as the threshol d to determi ne the D EGs i n the tumor eel I .

We used a 3-fold change cutoff, because it represents a stringent threshold to determine a gene-expressi on on/off state. By appl yi ng both p val ue cal culati ons and 3-fol d change cal cul ati ons to the genes whose expression follows a Gaussi an distribution in normal samples, we found that over 80% of the DEG events identified using fold changes are also recovered using significant p val ues of 0.5%. Furthermore, i f a gene expressi on up-regul ati on co-occurs wi th i ts correspondi ng copy number amplification or a gene expression down-regulation co-occurs with its corresponding copy number deletion, we removed it from the DEG list of the tumor, because its differential expression is more likely due to copy number alteration rather than dysregulation. Moreover, we removed tissue specific DEGs if they are highly correlated with cancer types or tissue origin (i .e., Pearson correlation coefficient larger than 0.9). Therefore, we identified the DEGs for each tumor and created a tumor-gene bi nary matrix where 1 represents expression change and 0 represents no expression change.

TDI method

TheTDI method was designed to infer the functional -impact relationships between SGAs and DEGs for a given tumor using a CBN, in which causal edges are only all owed to point from SGAs to DEGs. It was assumed that each DEG is likely regulated by one aberrant pathway in an i ndi vi dual tumor and that such a pathway i s I i kd y perturbed by a si ngl e SGA due to the wel I known mutual exclusivity among SGAs perturbing a common pathway. Let T = { Τι, T2 Tt

7^"Λ/} denote the tumor set whi ch contai ns a total of N tumor sampl es, where t i ndexes over the tumors i ncl uded i n the tumor set. Let SGA SETt = {Ai, Az ..., Ah, An} denote a subset of m genes with genome alterations in tumor t (i .e., the SGAs) and let h indexes over the variables in SGA_SETt. Let DEG_SETt = {Ei, E2, £/, ... , E_n} denote n genes that are differential I y expressed in tumor f, and let i index over the variables in DEG_SETt (i.e., the DEGs). A variable A) was further included to collectively represent factors other than SGAs(e.g., tumor microenvironment) that may affect the gene expression. Therefore, for a tumor r, the TD I method searches for a CBN model M that best expl ai ns the data and assi gns one Ah i n SGA_SETt as the most probabl e cause for each Ei i n DEG_SETt. I n model M, a given Ah can have zero, one, or more arcs emanati ng from it to the variables in DEG SETt. For each tumor from the TCGA dataset (£>), TDI searches for the CBN with a maximal posterior probability based on Bayes rule:

where (> | ) i s the margi nal I i kel i hood of causal network structure M, and ( ) i s the pri or probability of M.

The tumor-specific nature of the TD I method i s ref I ected i n the f ol I owi ng aspects: 1 ) Each tumor has a unique SGA_SETt and the DEG_SETf; thus, the TDI method inferred CBN structure M is tumor-specific. 2) The prior probability of a model ( ) (i.e., ( h→ )) is tumor-specific. 3) Cal cul ati on of the margi nal I i kel i hood (> | n→ /) consi sts of two components: one component i s computed using the datafrom the tumors with Ah =1, (a.k.a., "tumors I ike me"), and the other component i s deri ved usi ng a gl obal model . As such, the posteri or probabi I i ty of the edge Λ→ / is specific to a given tumor, and the same edge h→ / may have different posterior probabilities in different tumors depending on the composition of SGA_SETt.

Tumor-specific model priors P(M). Assuming the structure prior is modular, wefactorize the probability P(M) asa product of prior probabilities for each edge included in M, i.e., ( ) =

While it would be ideal to def ine the prior probability for each edge

using specific prior knowledge, we usually have very limited information about it. Despite of the lack of specific prior knowledge, we derive the prior probability P(M) by incorporating well- appreciated general knowledge (which the frequency-oriented approaches are based on) regarding the probabi I i ty that a gene i s a dri ver: the more f requentl y a gene i s perturbed i n a cancer cohort, the more likely it isa driver in an indivi dual tumor. Furthermore, we al so consi der the tumor- specific context of the SGAs in each tumor.

We designed a Bayesian method that incorporates all the above consi derations. For a tumor t and an arbitrary DEG E/, we defined the prior probability of Ah bei ng a parent of usi ng a multinomial distribution with a parameter vector where

1 . H ere, θο i s a user-def i ned parameter representi ng the pri or bel i ef that the non-SGA factor Ao i s the cause of , and Qh represents the prior probabi lity of /j bei ng the cause of E/. We assumed that is distributed as isa tumor specific Dirichlet

parameter vector governing the distribution of in atumor . For atumor i, we calculated the pri or probabi I i ty as f ol I ows:

where/)' indexes over them variables in isa Dirichlet

parameter, Uh denotes tumor genomes i n a reference set of tumore genomes, D, for whi ch AH= 1 , and mt' denotes the number of variables in SG A_SETt 'for tumor V.

When additional genomic information, such as the number of unique synonymous mutations in a particular gene in a reference set of tumor genomes D and the number of somatic copy number alterations in the particular gene in subjects without cancer, was avail able for each gene Ah, we applied the additional genomic information to account for mutation and copy number alterations that are due to differences in gene lengths and chromosome locations and derive al ternati ve versi ons of . Thus, where the number of uni que synonymous mutati ons i n gene Ah i n a reference set of tumor genomes, D, has a known val ue (a1 ); and the number of somati c copy number alterations in gene A in subjects without cancer hasa known value (a2), is calculated as:

wherei n Uh denotes tumor genomes from a reference set of tumor genomes, D, that have a somatic al terati on i n gene

denotes the number of genes with SGAs i n the genome of tumor t', and ww denotes a weight proportional to the probability that the somatic alteration in geneAj isthe SGA with functional impact in the genome of tumor t'. Wht' is calculated as: wherei n

I n the above equati ons,

denotes whether gene Ah has a non-synonymous somatic mutation or not in tumor i' (e.g.,

can be 1 or 0, respectively); * ³⁴ denotes the number of unique synonymous somati c mutati on events i n gene Ah i n the reference set of tumor genomes,

denotes whether gene Ah has a somatic copy number alteration or not in tumor f'

is 1 or O, respectively); * ³⁵⁶⁷ denotes the expected number of ti mes gene Ah has a somati c copy number alteration among the tumors in D, and yet is only a passenger alteration (that is, has no functional impact), based on the number of ti mes gene Ah has a somatic copy number alteration in a reference set of genomesfrom subjects without known cancer (e.g., subjects in the normal human population).

Marginal likelihood function P(D \ M). The specifications of the TDI method allow decomposi ti on of the margi nal I i kel i hood of a model M to be a product of the margi nal I i kel i hoods of al I edges in M. We used the Bayesian BDeu scori ng measure to derive P(D \ M) \r\ closed form:

where denotes parameters of the model M; j i ndexes over the states of Ah, whi ch i s bei ng consi dered as the cause of

i s the number of possi bl e val ues of Ah (i n our case, i t i s 2, because

A is modeled as a binary variable); Zeis an indicator variable which indexes over the states of E/, and n denotes the total possible states of £y (in our case, it isset to 2);

isthe number of tumors i n D that node E/ has val ue k and i ts cause has the val ue denoted by j;

i s a parameter i n a

Dirichlet distribution that represents prior belief about

is the gamma function; represent the function that calculate the

Bayesi an score of the edge h→ / in tumor t. Then, 9( /¾ ; ) can be defi ned as f ol I ows:

The tumor-specific calculation of the marginal I i kel i hood i s to use the data that are most rel evant to the given tumor (aka "tumors I ike me") to infer if the hypothesis of a candidate causal edge

i issupported among these tumors. To thisend, we modified the Bayesian scoring function (Eq (3)) by dividing the training data D into a tumor-specific subset (the subset of tumors with Ah = 1) and a global trai ni ng set (the subset i n which variance of E_i is explained by a global cause). Let >^? denote the subset of tumors i n whi ch Aa= 1 and > denote the subset of tumors i n whi ch An=0, such that > = {>° u > ¹} . Let G(i) represent the SGA with maxi mal Bayesian score for E, derived from Eq (4) at the whole cohort level (where G(i) is referred to as the "global driver"), and let

denote the pri or probabi I i ty of G(i) as the cause for any Ei i n tumor t. We can cal cul ate the 9(to, ; ,> ^?)) and 9(A(;), ;, > ) for tumor set with AF1 and those with An=0 respectively as follows,

where G^? _H i s the number of tumors i n > ⁷ that £/ has val ue k and Ah has the val ue i ndexed by ;^'; G _H i s the number of tumors i n > that Ei has val ue k and AG ) has the val ue i ndexed by j . Fi nal I y , the posterior probability of a causal edge h→ / can be calculated with thefollowing equation:

I dentif ication of SGA-FI & Causal edges from different SGAs have different posterior probabi I i ties, as expected. To standardize how to interpret the significance of an edge probability P_E, a method was devel oped that uses a permutati on test to determi ne the probabi I i ty that an edge from an SGA can be assigned with agiven P_eor higher in datafrom permutation experiments, i.e., the p val ue to the edge with a given The p val ue i n this setti ng is also the expected rate of fal se di scovery of an SGA as the cause of a D EG by random chance. A seri es of experi ments was performed, i n whi ch the val ues of D EGs were permuted across tumors, so that the stati sti cal relationship in real data was disrupted, and then TDI was applied to both random and real datasets to evaluate the extent to which a random sample can lead to the false discovery of causal relationships. As FIG. 9 shows, no SGA was designated as an SGA-FI in the random dataset (when the number of driver targeted DEGswas5 or greater), indicating that the false discovery iswell control I ed. Therefore, thi s property was uti I i zed to control the fal se di scovery when i dentif yi ng SGA-FIs in a tumor. An SGA event in a tumor was designated as an SGA-FI if it has 5 or more causal edges to DEGsthat are each assigned a p-value < 0.001. The overall false discovery rate of the j oi nt causal rel ati onshi ps between an SGA to 5 or more target D EGs i s I ess than or equal to 10^~

15

Cell culture and siRNA transection. HGC27 (Sigma-Aldrich) and PC-3 (ATCC) cells were cultured according to the manufacturer's recommendations. The non-targeting and the CSMD3 and ZFHX4 si RN As were obtained from OriGene(Rockville, M D). ThesiRNA sequences are as follows:

si-CSMD3-1, GGUAUAUUACGAAGAAUUGCAGAGT (SEQ ID NO: 1)

si-CSMD3-2, ACAAAUGGAGGAAUACUAACAACAG (SEQ ID NO: 2)

si-ZFHX4-1, CGAUGCUUCAGAAACAAAGGAAGAC (SEQ ID NO: 3)

si-ZFHX4-2, GGAACGACAGAGAAAUAAAGAUUCA (SEQ I D NO: 4)

ThesiRNAsweretransfected into eel Is using DharmaFECT transfection reagents for 48 hrs according to the manufacturer's instructions.

RNA extraction and real time RT-qPCR. Total RN A for eel Is was extracted using RNeasy M ini kit (Qiagen, USA) following standard procedures. One microgram of total RNAs was used to generate cDN A usi ng the i Scri pt cDNA synthesis kit. Real-ti me FCR was subsequently performed using the iTaq Universal SYBR Green Supermix (Bio-Rad) on the CFX96 Real-Time FCR Detection System (Bio-Rad, Richmond, CA, USA). Data were normalized using ribosomal protein L37A (RF1_37A) as an internal control.

Cell proliferation and viability assays Cell proliferation/viability was assayed by CCK -8 assay (Dojindo Laboratories, Kumamoto, Japan). Briefly, HGC27 and FC3 cellswere plated at a density of 3 x 10³ cellswell in 96-well plates. After si RNA transfection for 3 or 6 days, CCK-8 solution containing a highly water-sol ubletetrazoli urn salt WST-8 [2-(2-methoxy-4-nitrophenyl)-3- (4-nitrophenyl)-5-(2,4-disulfophenyl)-2H-tetrazolium, monosodium salt] was added to cells in each well, followed by incubation for 1-4 h. Cell viability was determi ned by measuring theO.D. at 450 nm. Percent over control was calculated as a measure of cell viability.

Transwell migration assay. Cell migration was measured using 24-well transwell chambers with 8 pm pore polycarbonate membranes (Corning, Corning, NY). SiRNA-transfected eel Is were seeded at a density of 7.5 x 10⁴ cells/ml to the upper chamber of the transwell chambers in 0.5 ml growth mediawith 0.1% FBS. The lower chamber contained 0.9 ml of growth medium with 20% FBSaschemoattractant media. After 20 hrs of culture, the eel Is in the upper chamber that di d not mi grate were gentl y wi ped away wi th a cotton swab, the eel I s that had moved to the I ower surface of the membrane were stained with crystal violet and counted from five random fields under a I ight microscope. Apoptotic assay. Apoptosis was assessed by flow cytometry analysis of annexi n V and propidium iodide (PI) double stained cells using Vybrant Apoptosis Assay Kit (Thermo Fisher Scientific, Carlsbad, CA). Briefly, thecellsafter washing with PBS were incubated in annexin V/PI labeling solution at room temperature for 10 min, then analyzed in the BD FACSCalibur flow cytometer (Becton, Dickinson and Company, Franklin Lakes, NJ).

I n vi ew of the many possi bl e embodi ments to whi ch the pri nci pi es of the disci osed technology may be applied, it should be recognized that the illustrated embodi ments are only preferred examples and should not betaken as limiting the scope of the disclosure. Rather, the scope of the di scl osure i s at I east as broad as the f ol I owi ng cl ai ms. We therefore cl ai m al I that comes wi thi n the scope of the f ol I owi ng cl ai ms.

Claims

It is claimed:

1. A computer-implemented method for identifying a somatic genome alteration (SGA) with functional impact in a genome of a tumor f, comprising:

recei vi ng a set of SGAs i n the genome of t i n the genome of t;

receiving a set of differentially expressed genes (DEGs) in the genome of f;

generating a bipartite causal Bayesian network (CBN) with maximal posterior probability, wherein the CBN with maximal posterior probability comprises causal edges pointing from SGAs i n the set of SGAs to DEGs i n the set of DEGs with a si ngle causal edge poi nti ng to each DEG; and identifying a SGA from which five or more causal edges in the CBN point to DEGs as the SGA with functional impact i n the genome of the tumor t.

2. The method of cl ai m 1 , wherei n generati ng the CB N wi th maxi mal posteri or probabil ity comprises:

generati ng a pi urality of test CBNs for the set of SGAs and the set of DEGs, determi ni ng a posterior probability for each test CBN in the plurality, and identifying the test CBN with the maximal posterior probability.

3. The method of claim 2, wherein determining the posterior probability for a test CBN in the pi urality comprises calculating (D ( ) , wherein:

(D M) is a margi nal I i kel i hood of the test CBN ; and

( ) isa prior probability of the test CBN.

4. The method of clai m 3, wherei n ( ) compri ses a product of the f requenci es at which the SGAs of the test CBN are present i n a reference set of tumor genomes, D.

5. The method of claim 3, wherein ( ) compri ses a product of prior probabilities of causal edges of the test CBN, wherein F Ah→ E/) represents a prior probability of a causal edge from a somati c al terati on of gene Ah to D EG E/, whi ch i s abbrevi ated as , wherei n

wherei n

i s a pri or probabi I i ty that the cause of D EG E i s not a SGA ;

h' indexes over the number m of genes in tumor t that have SGAs in gene h' and

isa Dirichlet parameter set forth as or

wherein Uh denotes tumor genomes from a reference set of tumor genomes, D, that have a somatic alteration in gene

denotes the number of genes with SGAs in the genome of tumor

denotes a wei ght proporti onal to the probabi I ity that the somati c al terati on i n gene Ah i s the SGA with fundi onal i mpact i n the genome of tumor t '.

6. The method of clai m 5, wherei n

is set forth as:

and wherein:

denotes whether gene Ah has a non-synonymous somati c mutati on or not i n tumor t '; denotes the number of uni que synonymous somati c mutati ons i n gene Ah i n the reference set of tumor genomes, D;

denotes whether gene Ah has a somati c copy number al terati on or not i n tumor t '; and denotes an expected number of ti mes that gene A_h has a somati c copy number alteration without afunctional impact in the reference set of tumor genomes, D, based on a corresponding number of times that gene A? has a somatic copy number alteration in a reference set of genomes from subjects without known cancer.

7. The method of any of cl ai ms 3-6, wherei n (D M) compri ses a product of marginal likelihoods of causal edges of the test CBN for tumor f, and calculating said product comprises:

cal cul ati ng a margi nal I i kel i hood of each causal edge

of the test CBN based on tumor genomes i n D that have the somati c al terati on of gene Ah and tumor genomes i n D that do not have the somatic alteration of gene 4Λ.

8. The method of any of cl ai ms 3-6, wherei n (D M) compri ses a product of posterior probabilities of causal edges of the test CBN, and calculating said product comprises: calculating a posterior probability of each causal edge Ah→ E/ of the CBN as:

wherein:

> denotes tumor genomes having the somatic alteration of gene A> that are present in D,

> denotes tumor genomes that do not have the somati c al terati on of gene Ah that are present in D;

G(i) denotes a gene that has a maxi mal posteri or probabi I i ty for causi ng E, i n D, whi ch i s calculated by finding the gene g that maximizes the foil owing function and assigning G(i) to beg:

wherein

^'\s the number of tumor genomes in D, wherein node £/ has value k and its modeled cause A_g has the val ue denoted by j, and

g denotes the pri or probabi I ity that A_g i s the cause for £/ for the tumor genomes D; and

j i ndexes over the states of A_g, whi ch i s bei ng consi dered as the cause of £/;

qi is 2;

k i s an i ndi cator vari abl e whi ch i ndexes over the states of E/;

n is 2;

is a parameter in a Dirichlet distribution that is proportional to the prior probability of

Γ is the gamma function;

G^? _H is the number of tumor genomes i n > ⁷ f or whi ch £, has val ue k and Ah has the val ue indexed by/^';

G _H i s the number of tumor genomes i n > f or whi ch £/ has val ue k and Aoy) has the val ue indexed by/^';

and wherein

wherein:

isa prior probability that somethi ng other than aSGA isthe cause of DEG E/; h' indexes over the number m of genes in tumor r that have SGAs in gene h'; and

K isa Dirichlet parameter set forth as

or

wherei n U_g denotes tumor genomes from a reference set of tumor genomes, D, that have a somatic alteration \n geneA_g, mi- denotes the number of genes with somatic alterations in the genome of tumor t', and i%' denotes a weight proportional to the probability that the somatic alterations in geneAg is the SGA with functional impact in the genome of tumor t '.

9. The method of clai m 8, wherei n

is set forth as:

and wherein:

denotes whether gene A_g has a non-synonymous somati c mutati on or not i n tumor t '; denotes the number of uni que synonymous mutati ons i n gene A_g i n the reference set of tumor genomes, D;

denotes whether gene A_g has a somati c copy number al terati on or not i n tumor r '; and

denotes an expected number of ti mes that gene A_g has a somatic copy number alteration without afunctional impact in the reference set of tumor genome, D, based on a correspond! ng number of ti mes that gene Ag has a somati c copy number al terati on i n a reference set of genomes from subjects without known cancer.

10. The method of any of the pri or cl ai ms, wherei n the f i ve or more causal edges pointing from the SGA identified as the SGA with functional impact to DEGs in theset of DEGs are each assigned a p-val ue of less than 0.001 relati e to a distri bution of random posterior probabi I ities of causal edges poi nti ng from the given SGA to DEGs i n a control data set.

11. The method of any of the prior claims, wherein the SGAs in the set of SGAs comprise or consist of somatic mutations and somatic copy number alterations.

12. The method of any of the pri or cl ai ms, wherei n the genes i n the set of D EGs comprise:

genes of the tumor genome with an RN A expression level having a p-val ue of 0.005 or less for a Gaussian distri bution of control RNA expression levels of correspond! ng genes in non-tumor tissue; and/or

genes of the tumor genome with a 3-fold or greater change in RNA expression level compared to control RNA expression levels of corresponding genes in non-tumor tissue.

13. The method of any of the prior claims, further comprising:

obtai ni ng a tumor sampl e from a subj ect;

i dentifyi ng DEGs i n the genome of the tumor sample; and

identify! ng SGAs in the genome of the tumor sample;

to provide the set of SGAs and the set of DEGs.

14. The method of any of the prior claims, wherein the tumor genome is from a tumor sample from a subject, and wherein the method further comprises treating the subject based on the identified SGAswith functional impact.

15. The method of any one of the prior claims, wherein the SGA with functional impact is a driver SGA.

16. One or more computer-readable media comprising computer-executable instructions that when executed cause a computing system to perform the method of any of claims 1-15.

17. A computing system, comprising:

one or more processors;

memory; and

a classification tool configured to:

receive a set of somatic genome alterations (SGAs) in the genome of a tumor t; receive a set of differentially expressed genes (DEGs) in the genome of r;

generate a bipartite causal Bayesi an network (CBN) with maximal posterior probability, wherein the CBN with maximal posterior probability comprises causal edges pointing from SGAs i n the set of SGAs to DEGs i n the set of D EGs with a si ngle causal edge poi nti ng to each DEG; and

cl assif y one or more of the SGAs from whi ch f i ve or more causal edges poi nt to DEGs as SGAs with functional impact.

18. The computing system of claim 17, wherein the classification tool isfurther configured to generate the CBN with maximal posterior probability by:

generati ng a pi ural ity of test CBNs for the set of SGAs and the set of DEGs;

determining a posterior probability for each test CBN in the plurality; and

identifying the test CBN with the maximal posterior probability.

19. Thecomputing system of claim 18, wherein the classification tool isfurther configured to determine the posterior probability for each test CBN in the plurality by calculating

(D M) x ( ), wherein:

( D M) is a margi nal I i kel i hood of the test CBN ; and

( ) isa prior probability of the test CBN.

20. The computi ng system of cl ai m 19, wherei n ( ) compri ses a product of the frequencies at which the SGAs of the test CBN are present i n a reference set of tumor genomes, D.

21. Thecomputing system of claim 19, wherein ( ) comprises a product of prior probabilities of causal edges of the test CBN, wherein F\Ah→ Ei) represents a prior probability of a causal edge from a somati c al terati on of gene A? to D EG Ei, whi ch i s abbrevi ated as , wherei n

wherei n

i s a pri or probabi I i ty that the cause of D EG Ej is not a SGA ;

indexes over the number m of genes in tumor t that have SGAs in gene h'; and isa Dirichlet parameter set forth as

or

denotes the number of genes with SGAs in the genome of tumor t'\ and Wht' denotes a weight proportional to the probability that the somatic alteration in genes is the SGA with f uncti onal i mpact i n the genome of tumor t '.

22. The computi ng system of cl ai m 21 , wherei n

Wht' is set forth as:

is set forth as:

and wherein:

denotes whether gene Λ_Λ has a somatic copy number alteration or not in tumor f; and

denotes an expected number of times that gene A) has a somatic copy number alteration without afunctional impact in the reference set of tumor genomes, D, based on a correspond! ng number of ti mes that gene Ah has a somati c copy number al terati on i n a reference set of genomesfrom subjects without known cancer.

23. The computing system of any of claims 19-22, wherein (D M) comprises a product of marginal likelihoods of causal edges of the test CBN for tumor t, and calculating said product comprises:

cal cul ati ng a margi nal I i kel i hood of each causal edge Ah→ E of the test CBN based on tumor genomes i n D that have the somati c al terati on of gene Ah and tumor genomes i n D that do not have the somatic alteration of gene A,.

24. The computing system of any of claims 19-22, wherein (D M) comprises a product of posterior probabilities of causal edges of the test CBN, and calculating said product comprises

calculating a posterior probability of each causal edge Ah→ E of the CBN as:

wherein:

wherein:

>^? denotes tumor genomes having the somatic alteration of gene Ah that are present in D, > denotes tumor genomes that do not have the somati c al terati on of gene Ah that are present in D;

G(i) denotes a gene that has a maxi mal posteri or probabi I i ty for causi ng E i n D, whi ch i s calculated by finding the gene g that maximizes the foil owing function and assigning G(/) to beg:

wherei n

Nijk is the number of tumor genomes in D, wherein node E hasvalue/iand its cause A_g has the value denoted by j, and

g denotes the pri or probabi I i ty that Ag i s the cause for E for the tumor genomes D; and j i ndexes over the states of Ag, whi ch i s bei ng consi dered as the cause of E;

Q/ is 2;

k i s an i ndi cator vari abl e whi ch i ndexes over the states of E ; is 2;

Γ is the gamma function;

G^? _H is the number of tumor genomes i n > ⁷ f or whi ch £/ has val ue k and Ah has the val ue indexed by/^';

G H i s the number of tumor genomes i n > f or whi ch £, has val ue k and Aoj) has the val ue indexed by/^';

and wherein

wherei n

isa prior probability that something other than aSGA isthe cause of DEG E; h' indexes over the number m of genes in tumor t that have SGAs in gene h'\ and

K isa Dirichlet parameter set forth as

or

wherei n U_g denotes a reference set of tumor genomes that have a somati c al terati on i n gene Ag, mr denotes the number of genes with somatic alterations in the genome of tumor f; and ww denotes a weight proportional to the probability that the somatic alteration in gene,A_gistheSGA with functional impact in the genome of tumor f '.

25. The computi ng system of cl ai m 24, wherei n Wgt · i s set forth as:

is set forth as:

and wherein: denotes whether gene A_g has a non-synonymous somati c mutati on or not i n

tumor t';

denotes the number of unique synonymous mutations i n gene A_g i n the reference set of tumor genomes, D;

denotes whether geneAg has a somatic copy number alteration or not in tumor t'; and

denotes an expected number of ti mes that gene Ag has a somati c copy number

alteration without afunctional impact in the reference set of tumor genomes, D, based on a corresponding number of times that gene Ag has a somatic copy number alteration in a reference set of genomes from subjects without known cancer.

26. The computi ng system of any of cl ai ms 17-25, wherei n the f i ve or more causal edges poi nti ng from the SGA identified as the SGA with functional impact to DEGs in the set of D EGs are each assi gned a p-val ue of I ess than 0.001 usi ng rel ati ve to a di stri buti on of random posterior probabi I i ties of causal edges poi nti ng from the given SGA to DEGs i n a control data set.

27. The computi ng system of any of cl ai ms 17-25, wherei n the SGAs i n the set of SGAs comprise or consist of mutations and copy number variations.

28. The computing system of any of claims 17-25, wherein the genes in the set of DEGs comprise:

genes of the tumor genome with an RNA expression level having a p-val ue of 0.005 or less for a Gaussian distri buti on of control RNA expression levels of correspond! ng genes in non-tumor tissue; and/or