WO2022185028A1

WO2022185028A1 - Evaluation framework for target identification in precision medicine

Info

Publication number: WO2022185028A1
Application number: PCT/GB2022/050440
Authority: WO
Inventors: Alex DEGIORGIO; Harry Rose; Meltem GUREL; Paidi Creed; Gregor LUEG
Original assignee: Benevolentai Technology Limited
Priority date: 2021-03-02
Filing date: 2022-02-18
Publication date: 2022-09-09
Also published as: GB202102948D0

Abstract

A computer-implemented method for evaluating a target identification workflow in precision medicine is provided. The target identification workflow comprises: an endotype detection module configured to detect endotypes from cohort data, and a target prediction module configured to predict targets for each of the endotypes. The method comprises: mapping endotypes detected by the endotype detection module to assays; assessing targets predicted by the target prediction module for endotype specificity; and evaluating the workflow for its ability to predict endotype specific targets. It is intended that the abstract, when published, will be accompanied by Figure 6.

Description

EVALUATION FRAMEWORK FOR TARGET IDENTIFICATION IN PRECISION MEDICINE

[0001] The present application relates to systems and methods for evaluating the performance of a drug discovery workflow. The presently disclosed techniques find particular application in the field of precision medicine where there is a need to identify drug targets for specific endotypes of a disease.

Background

[0002] Diseases typically have subtypes with distinct observable traits called phenotypes. When differences in traits can be linked to distinct underlying patho-biological mechanisms, disease subtypes are referred to as endotypes. In precision medicine, the aim is to find therapies that are particularly well suited to a specific endotype of a disease based on its underlying mechanism. As such, a key step in precision medicine drug discovery programmes is to identify endotype-specific therapeutic targets.

[0003] Various methods exist for identifying endotype-specific targets. These typically use machine learning methods to process large biological data sets from patient cohorts in order to identify endotypes and predict endotype-specific targets. However, there is a lack of suitable evaluation techniques for assessing the ability of precision medicine workflows to accurately predict endotype-specific targets.

[0004] Accordingly, there is a need for a suitable evaluation framework for assessing the ability of precision medicine workflows to accurately predict endotype-specific targets.

[0005] The embodiments described below are not limited to implementations which solve any or all of the disadvantages of the known approaches described above.

Summary

[0006] This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to determine the scope of the claimed subject matter.

[0007] In a first aspect, the present disclosure provides a computer-implemented method for evaluating a target identification workflow in precision medicine, the target identification workflow comprising an endotype detection module configured to detect endotypes from cohort data relating to individuals of a cohort and a target prediction module configured to predict targets for each of the endotypes, the method comprising: mapping endotypes detected by the endotype detection module to assays; assessing targets predicted by the target prediction module for endotype specificity; and evaluating the workflow for its ability to predict endotype specific targets.

[0008] Optionally, the cohort data comprises one or more of Bulk RNA-seq, scRNA-seq, and methylation data. Optionally, the target identification workflow is configured to detect endotypes by assigning individuals of the cohort to subgroups. Optionally, the target identification workflow is configured to detect endotypes by generating a gene signature for each subgroup. Optionally, the target identification workflow is configured to predict targets by using PPI networks. Optionally, the target identification workflow is configured to predict targets by a method comprising one or more of: ranking genes according to their weight in a gene signature of a detected endotype; performing upstream regulator analysis; and performing directed upstream regulator analysis.

[0009] Optionally, mapping the endotypes to assays comprises comparing gene signatures of the endotypes to gene signatures of the assays. Optionally, the method comprises computing affinity scores, each affinity score representing a similarity between a gene signature of an endotype and a gene signature of an assay. Optionally, computing the affinity scores comprises using a Nearest Template Mapping method. Optionally, computing the affinity scores comprises using a Single Sample Scoring method. Optionally, the Single Sample Scoring method comprises Single Sample Gene Set Enrichment Analysis.

Optionally, assessing the targets for endotype specificity comprises: using assay readouts that indicate an extent to which a perturbation of a target in an assay associated with an endotype has an effect on a phenotype of interest. Optionally, assessing the targets for endotype specificity comprises: using affinity scores that indicate an association between an endotype and an assay. Optionally, assessing the targets for endotype specificity comprises: calculating an endotype specificity score for a target in relation to an endotype by using the assay readouts and the affinity scores to determine an extent to which perturbation of the target has a greater effect on a phenotype of interest in assays associated with the endotype compared to assays not associated with the endotype. Optionally, calculating the endotype specificity score for the target in relation to the endotype comprises: determining an average assay readout for the target in assays associated with the endotype; determining an average assay readout for the target in assays not associated with the endotype; and determining a difference between the averages. [0010] Optionally, evaluating the workflow for its ability to predict endotype specific targets comprises: calculating an average endotype specificity score of an endotype of interest. Optionally, the method comprises arranging the workflow in a plurality of configurations and evaluating the workflow for its ability to predict endotype specific targets in each of the plurality of configurations. Optionally, the method comprises determining an optimum configuration of the workflow for predicting endotype specific targets.

[0011] In a second aspect, the present disclosure provides a computer-readable medium storing code that, when executed by a computer, causes the computer to perform any method provided by the present disclosure.

[0012] In a third aspect, the present disclosure provides a system for evaluating a target identification workflow in precision medicine, the target identification workflow comprising an endotype detection module configured to detect endotypes from cohort data relating to individuals of a cohort and a target prediction module configured to predict targets for each of the endotypes, the system comprising: an endotype mapping module configured to map endotypes detected by the endotype detection module to assays; a target assessment module configured to assess targets predicted by the target prediction module for endotype specificity; and a workflow evaluation module configured to evaluate the workflow for its ability to predict endotype specific targets.

[0013] Optionally, the endotype mapping module is configured to compare gene signatures of the endotypes to gene signatures of the assays. Optionally, the endotype mapping module is configured to compute affinity scores, each affinity score representing a similarity between a gene signature of an endotype and a gene signature of an assay. Optionally, the target assessment module is configured to use assay readouts that indicate an extent to which a perturbation of a target in an assay associated with an endotype has an effect on a phenotype of interest. Optionally, the target assessment module is configured to use affinity scores that indicate an association between an endotype and an assay. Optionally, the target assessment module is configured to calculate an endotype specificity score for a target in relation to an endotype by using the assay readouts and the affinity scores to determine an extent to which perturbation of the target has a greater effect on a phenotype of interest in assays associated with the endotype compared to assays not associated with the endotype.

[0014] The methods described herein may be performed by software in machine readable form on a tangible storage medium e.g. in the form of a computer program comprising computer program code means adapted to perform all the steps of any of the methods described herein when the program is run on a computer and where the computer program may be embodied on a computer readable medium. Examples of tangible (or non-transitory) storage media include disks, thumb drives, memory cards etc. and do not include propagated signals. The software can be suitable for execution on a parallel processor or a serial processor such that the method steps may be carried out in any suitable order, or simultaneously.

[0015] This application acknowledges that firmware and software can be valuable, separately tradable commodities. It is intended to encompass software, which runs on or controls “dumb” or standard hardware, to carry out the desired functions. It is also intended to encompass software which “describes” or defines the configuration of hardware, such as HDL (hardware description language) software, as is used for designing silicon chips, or for configuring universal programmable chips, to carry out desired functions.

[0016] The preferred features may be combined as appropriate, as would be apparent to a skilled person, and may be combined with any of the aspects of the invention.

Brief Description of the Drawings

[0017] Embodiments of the invention will be described, by way of example, with reference to the following drawings, in which:

Figure 1 is a flow chart showing a method of predicting endotype-specific targets according to a precision medicine workflow;

Figure 2 is a block diagram of a system for carrying out the method of Figure 1 ;

Figure 3 is a flow chart showing optional steps of the method of Figure 1 ;

Figure 4 is a flow chart showing a method of evaluating a precision medicine workflow according to an embodiment of the invention;

Figure 5 is a block diagram of a system for carrying out the method of Figure 4;

Figure 6 is a flow chart showing a combined method of predicting endotype-specific targets according to a precision medicine workflow and evaluating the precision medicine workflow according to an embodiment of the invention;

Figure 7 is a block diagram of a system for carrying out the method of Figure 6;

Figure 8 is a flow chart showing optional steps of the method of Figure 4; Figure 9 is a flow chart showing an extended method of predicting endotype-specific targets according to a precision medicine workflow, evaluating the precision medicine workflow according to an embodiment of the invention, and determining an optimum configuration of the precision medicine workflow; and

Figure 10 is a block diagram of computer hardware suitable for implementing embodiments of the invention.

[0018] Common reference numerals are used throughout the figures to indicate similar features.

Detailed Description

[0019] Embodiments of the present invention are described below by way of example only. These examples represent the best ways of putting the invention into practice that are currently known to the Applicant although they are not the only ways in which this could be achieved. The description sets forth the functions of the example and the sequence of steps for constructing and operating the example. Flowever, the same or equivalent functions and sequences may be accomplished by different examples.

[0020] Figure 1 shows a method 100 of predicting endotype-specific targets according to a precision medicine workflow that can be evaluated in accordance with the invention. The method 100 comprises receiving 102 cohort data relating to features such as biological features of individuals of a cohort. The cohort of individuals may comprise patients or other individuals having a disease of interest. It is this data about the cohort of individuals that the precision medicine workflow is configured to use to predict therapeutic targets for specific endotypes of the disease. When the appropriate cohort data has been received, the method 100 proceeds to detecting 104 endotypes of the disease of interest from the cohort data. This step requires analysis of the cohort data using an algorithm or other suitable technique that enables the cohort to be separated into individuals having different endotypes of the disease. Finally, the method 200 comprises predicting 106 targets for each of the endotypes. This step involves using a representation of an endotype such as a gene signature of the endotype together with other resources such as protein-protein interaction (PPI) networks to generate predictions of therapeutic targets for the endotype.

[0021] Figure 2 shows a system 200 for carrying out the method of Figure 1 . The system 200 comprises an input module 202 configured to receive cohort data, an endotype detection module 204 configured to detect endotypes from the cohort data, and a target prediction module 206 configured to predict targets for each of the endotypes. [0022] Figure 3 shows a method 300 that includes several optional steps of the method of Figure 1 . Referring to Figure 3, the method 300 comprises the step of receiving 102 cohort data. The cohort data may suitably comprise a large, multidimensional biological dataset that captures individual variability in biological, clinical and/or environmental factors that are or may be relevant to at least one disease of interest. In suitable examples, the cohort is clinically heterogenous, typically suggesting the presence of multiple endotype. In order to maximise the prospects of finding well separated endotypes with cohesive members, cohorts with sufficient breadth and depth of cohort data are required.

[0023] Cohort data may be sourced from a variety of datasets. Depending on the disease, there may be an established dataset with relatively high sample sizes. For example, this is the case for many neoplastic diseases. Extensive, public datasets that may provide sources of cohort data include The Cancer Genome Atlas (TGCA), the International Cancer Genome Consortium (ICGC) dataset, and similar.

[0024] The cohort data may comprise various types of data. The type or types of data used to represent each individual of the cohort should enable the identification of endotypes by exhibiting some low dimensional structure that can be captured by precision medicine models. Data types may comprise bulk ribonucleic acid sequences (Bulk RNA-seq), single cell ribonucleic acid sequences (scRNA-seq), and methylation data or any other type of biological or clinical data derived from individuals of a cohort such as a cohort of patients. As such, the step of receiving 102 cohort data may comprise one or more of the steps of: receiving cohort data comprising Bulk RNA-seq 302 , receiving cohort data comprising scRNA-seq 304, and receiving cohort data comprising methylation data 306, as shown in Figure 3. In other examples, data types may comprise genomics and/or proteomics data. Optionally, each individual sample of the cohort may be accompanied by clinical data relevant to the endotype. The selection of data types may suitably depend on the type of algorithms implemented in the precision medicine model to identify endotypes.

[0025] After the step of receiving 102 cohort data, the method 300 comprises detecting 104 endotypes. With reference to Figure 3, this step of detecting 104 endotypes may comprise assigning 308 individuals of the cohort to subgroups and generating 310 a gene signature for each subgroup. Assigning 308 individuals to subgroups may be implemented on the basis of gene expression data relating to those individuals. It is the subgroups that represent potential endotypes of the disease in the sense that the individuals of a subgroup are likely to have the same endotype of the disease. In addition to gene expression data, clinical data about the individuals of the cohort may also be used as an input for endotype detection. In this case, the clinical data may be considered as metadata. [0026] Any suitable method that assigns 308 individuals to subgroups on the basis of gene expression data may be used. For example, individuals may be assigned to subgroups by using a clustering method 312. A clustering method uses a clustering algorithm to detect endotypes. The clustering algorithm may be executed on the gene expression data or any suitable alternative representation of the individuals that has been derived from the gene expression data. A clustering algorithm may assign each individual of the cohort to a single cluster or may assign each individual to one or more clusters. An endotype is then represented by a cluster. In one approach, an endotype may be thought of as being represented by the set of individuals that are assigned to the cluster, or alternatively an endotype may be thought of as being represented by the characteristics of the individuals that are assigned to the cluster.

[0027] In a further example, individuals of the cohort may be assigned to subgroups by using a latent factor model 316. One or more latent factor models may be used to decompose the gene expression data into independent factors of variation, each represented by a latent variable. A latent may be considered to be gene-sparse if it describes variation in a subset of genes, where the subset contains much fewer genes than the gene expression data of the input. In some non-limiting examples, the input gene expression data may relate to approximately 20,000 genes. Alternatively, a latent may be considered to be gene-sparse if the number of values in a loading vector are equal to or close to zero and this number is significantly smaller than the total number of genes. In suitable examples, latent variables that describe useful biological features may be gene-sparse and contain between 20 to 500 non-zero genes from the initial gene expression input matrix. Unlike the clustering approach, latent factor models do not uniquely assign individuals of the cohort to endotypes and can potentially generate a large number of candidate endotypes. As a result, an algorithm based on latent factor models would suitably include a latent variable selection step, in which in formation inferred from input metadata such as clinical metadata or other sources of additional data, such as but not limited to PPI networks or gene set libraries, is used to prune the list of candidate endotypes for consideration.

[0028] Any suitable method that generates 310 gene signatures for the subgroups may be used. For example, gene signatures may be generated for the subgroups by using differential expression analysis 318. In this approach, the gene signature for a subgroup is generated by performing a statistical test for each gene on the difference in mean gene expression between samples of individuals in the subgroup and samples of individuals outside the subgroup. A threshold on corrected probability values (p-values) may be used to identify which genes to include with non-zero weight in the gene signature and a log fold-change may be used to represent a value assigned to these genes. It will be appreciated that any alternative method for generating gene signatures for the subgroups may be used, including generating gene signatures as part of an endotyping algorithm. Any other suitable method which can be used to derive gene signatures by comparing expression levels of genes across two samples may be used.

[0029] In suitable examples, the output of the step of detecting 104 endotypes may comprise, for each candidate endotype that has been detected, a subgroup of individuals of the cohort that have been assigned to the endotype, and a gene signature for the endotype that may comprise a vector assigning a weight to each gene present in the gene expression data that was provided as an input.

[0030] After the step of detecting 104 endotypes, the method 300 comprises predicting 106 therapeutic targets for an endotype. With reference to Figure 3, predicting 106 targets for an endotype may be performed by an algorithm that uses the gene signature of the endotype as input and optionally also uses as input additional algorithm-specific data such as PPI networks 320 or other suitable resources commonly used in bioinformatics. In suitable examples, the output of the step of predicting 106 targets may comprise a set or genes, all of which are considered to be potential therapeutic targets for the endotype. In other suitable examples, the output may comprise a ranked list of genes, in which the position of a gene in the ranking is representative of how likely that gene is to be a suitable therapeutic target for the endotype.

[0031] With reference to Figure 3, targets for an endotype may be predicted using any suitable approach. For example, one approach involves ranking 322 genes according to the absolute value of their weight in the gene signature of the endotype. This approach may optionally include restricting the ranked genes to either the positively or the negatively weighted genes. The decision on how to rank the genes by their weight in the signature could depend on the biological question of interest; for example, a user may be specifically interested in identifying therapeutic targets from the genes which are over-expressed in the endotype.

[0032] Another approach uses protein-protein interaction (PPI) network analysis 324. This approach involves selecting genes which are connected to a significant number of the genes which have non-zero weight in the gene signature of the endotype. The selection involves calculations that compare a set of genes which are connected to a gene g in the PPI network with the set of genes with non-zero weight in the gene signature, for example using Fisher’s Exact Test. In this method, we optionally exclude genes using a p-value threshold, to offset the fact that genes which are highly connected in the PPI network will tend to be connected to genes in the signatures of many different endotypes. Additionally, ranking based on the odds- ratio or some other statistic related to the method by which genes are assessed, can be used to provide a ranking on the list of candidate gene targets.

[0033] A further approach uses directed upstream regulator analysis 326. This approach is an extension of the PPI network analysis, and considers the sign of the weights of genes in the gene signature of an endotype as well as the direction of regulation in the PPI network. The Ingenuity Pathway Analysis (IPA) algorithm may be used to achieve this, or alternatively multiple instances of Fisher’s Exact Tests may be combined.

[0034] Figure 4 shows a method 400 of evaluating a precision medicine workflow according to an embodiment of the invention. The method 400 is suitable for evaluating the ability of a precision medicine workflow to accurately predict therapeutic targets that are endotype specific. Determining the extent to which a precision medicine workflow, or a particular configuration of a precision medicine workflow, can predict endotype specific targets may contribute advantageously to the development of endotype specific drugs. The method 400 comprises mapping 402 endotypes that have been detected by a precision medicine workflow to assays. The assays provide experimental evidence relating to the efficacy of targets for producing a particular phenotypic effect in specific cell lines. This enables the targets that have been predicted by the precision medicine workflow to be assessed for their endotype specificity on the basis of experimental evidence. It is on this basis that the precision medicine workflow is evaluated for its ability to predict endotype specific targets. As such, the method 400 comprises assessing 404 targets that have been predicted by the precision medicine workflow for endotype specificity using data from assays. Finally, the method 400 comprises evaluating 406 the precision medicine workflow. The evaluation is based on the endotype specificity of the targets that the precision medicine workflow predicts.

[0035] Figure 5 shows a system 500 for carrying out the method of Figure 4. The system 500 comprises an endotype mapping module 502 configured to map endotypes detected by a precision medicine workflow to assays; a target assessment module 504 configured to assess targets predicted by the precision medicine workflow for endotype specificity; and a workflow evaluation module 506 configured to evaluate the precision medicine workflow for its ability to predict endotype specific targets.

[0036] Referring to Figure 6, the invention extends to a combined method 600 of predicting endotype-specific targets according to a precision medicine workflow and of evaluating the precision medicine workflow according to an embodiment of the invention. The method 600 is a combination of the methods shown in Figures 1 and 4. The method 600 may be performed in order to both predict therapeutic targets and evaluate the endotype specificity of the predictions.

[0037] Method 600 comprises receiving 102 cohort data relating to features of individuals of a cohort; detecting 104 endotypes of a disease of interest using the cohort data; predicting 106 targets for each of the endotypes; mapping 402 the endotypes to assays; assessing 404 the targets for endotype specificity based on the assays; and evaluating 406 the ability of steps 102, 104 and 106 to predict therapeutic targets that are endotype specific.

[0038] Referring to Figure 7, the invention extends to a system 700 for carrying out the method of Figure 6. The system 700 is a combination of the systems shown in Figures 2 and 5. The system 700 comprises an input module 202 configured to receive cohort data relating to features of individuals of a cohort; an endotype detection module 204 configured to detect endotypes of a disease of interest using the cohort data; a target prediction module 206 configured to predict targets for each of the endotypes; an endotype mapping module 502 configured to map the endotypes to assays; a target assessment module 504 configured to assess the targets for endotype specificity based on the assays; and a workflow evaluation module 506 configured to evaluate the ability of modules 202, 204 and 206 to predict therapeutic targets that are endotype specific.

[0039] Figure 8 shows a method 800 that includes several optional steps of the method of Figure 4. Referring to Figure 8, the method 800 comprises the step of mapping 402 endotypes to assays. The purpose of mapping endotypes to assays is to utilise experimental evidence from assays for the suitability of a potential therapeutic target for treating a specific endotype. As such, there is a need to identify assays that are relevant for or representative of a given endotype.

[0040] To achieve this, the mapping 402 of endotypes to assays may suitably comprise comparing 802 gene signatures of endotypes to gene signatures of assays. It will be appreciated that the gene signature of an assay refers to the gene signature of a cell line of the assay. In some examples, additional metadata may be used to compare endotypes and assays. For example, other biological data relating to features of the individuals associated with an endotype such as the expression of genes on the cell membrane, as assessed by immunohistochemistry (IHC) or genetic features such as mutations or copy-number (CN) alterations may be used together with corresponding data relating to the assays.

[0041] Comparing the gene signatures of an endotype and an assay may suitably comprise computing 804 an affinity score that represents an extent of similarity between the gene signature of the endotype and the gene signature of the assay. The affinity scores may take a range of values. For example, they could take values from a continuous range where a higher score indicates a higher degree of similarity between the gene signatures of the endotype and the assay and a lower score indicates a lower degree of similarity between the gene signatures of the endotype and the assay. In another approach, the values could range from, say, -1 to +1 , where negative scores indicate a lack of similarity (the endotype does not match the assay), positive scores indicate a similarity (the endotype matches the assay), and the magnitude of the score indicates the confidence in the match. In other approaches, the affinity scores could take binary values indicating whether or not an endotype matches an assay, or alternatively ternary values indicating whether or not an endotype matches an assay or whether the question of whether they match could not be determined from the available data.

[0042] The affinity scores may be determined using any suitable method. For example, a Nearest Template Mapping method 806 may be used for matching a gene signature to a set of templates. In this case, the gene signature of each assay is regarded as a sample and the gene signature of the endotype is regarded as the template.

[0043] In another approach, the affinity scores may be determined using a Single Sample Scoring method 808 such as a Single Sample Gene Set Enrichment Analysis (GSEA) 810 which compares sample transcriptomes to molecular signatures. The Single Sample GSEA 810 approach uses the GSEA algorithm to compute a Normalised Enrichment Score (NES), with the gene signature of an endotype taken to be the gene set and the expression profile of an assay used in place of the correlation vector. The NES represents the degree to which the genes which receive non-zero weight in the endotype signature are over-represented in the set of genes which are highly expressed in the assay.

[0044] Normalisation can be achieved by comparing the NES to a null distribution of enrichment scores for randomly selected gene sets of a similar or same size. Dividing the NES by the mean of this distribution ensures that NES scores for the different candidate endotypes are on similar scales.

[0045] In some examples, the NES is taken to be the affinity score between an endotype and an assay. In some other examples, other algorithms such as but not limited to GSVA (Gene Set Variation Analysis) may be used.

[0046] With reference to Figure 8, the method 800 comprises assessing 404 targets for endotype specificity. The purpose of this assessment is to establish whether the precision medicine workflow is good at predicting therapeutic targets that are especially efficacious or exclusively efficacious for treating an endotype of interest. For example, let’s say a precision medicine workflow predicts a target g (i.e. gene g) for treating an endotype E. In this case, we want to know if gene g is specific for endotype E. As a result, we may be interested in asking the question ‘Is target g disproportionately important for a given phenotype (such as cell viability) in the cell lines that are associated with endotype E?’ To answer this question, we use two items of information:

(1 ) We need to know which assays have cell lines that are associated with endotype E (i.e. which assays match (or map to) endotype E) and which assays have cell lines that are not associated with endotype E (i.e. which assays do not match endotype E); and

(2) We need to know the phenotypic effect of perturbing gene g in those the matching assays and in the unmatching assays.

[0047] With this information, it becomes possible to determine whether a given phenotypic effect of perturbing gene g is greater in assays that match endotype E compared with assays that do not match endotype E. If the phenotypic effect of the perturbation is greater in matching assays, then gene g is taken to be an endotype-specific hit for endotype E. The greater the difference in the phenotypic effect between matching and unmatching assays, the greater the endotype specificity.

[0048] The extent of a phenotypic effect of perturbing a target in a cell line may be established from assay readouts. Assay readouts provide a measure, preferably quantitative, of the effect of a perturbation or any other intervention (such as, for example, a compound treatment or genetic knockout using techniques such as clustered regularly interspaced short palindromic repeats or ‘CRISPR’) on a phenotype of interest. The phenotype could be any phenotype of interest for the endotype being studied, for example glucose uptake, cell viability, and so on. In examples, the assay readouts may be from large-scale experiments, comprising thousands of assay readouts across hundreds of cell lines. Suitably, the assay readouts are sufficient to provide reasonably comprehensive representation of all or most of the possible endotypes of a given disease. In examples, assay readouts may comprise or be derived from CRISPR or CERES screening data or scores, or any other gene effect scores. Assay readouts may additionally or alternatively comprise scores derived from any large- scale genomic and/or chemical perturbation experiments or scores derived from short hairpin ribonucleic acid (shRNA), small molecules screen, genome editing tools, chemical screens, or cellular potency data, for example from pharmacogenomics screening. It will be appreciated that in non-limiting examples, assay readouts may convey information relating to essential genes in a cancer cell line.

[0049] Assay readouts may be normalised using a positive and negative control. For example, a negative control could relate to a perturbation that is known to have no effect on the cells while a positive control could relate to a perturbation that is known to kill the cells. It is customary to normalize across the positive and negative controls to achieve comparable assay readout values between experimental batches. In suitable examples, gene effect scores such as CRISPR or CERES scores may be normalised between 0 (which could indicate that the target is not essential for optimal cell proliferation) and 1 (which could indicate that the target is essential for optimal cell proliferation).

[0050] In order to determine the degree of specificity of a target to a given endotype, an endotype specificity score may be calculated. As described above, the approach is to determine the extent to which the target produces a greater phenotypic effect in assays that map to the given endotype compared with assays that do not map to the given endotype. Affinity scores provide a measure of which assays map to the endotype of interest and which do not, and assay readouts provide a measure of the phenotypic effect of perturbation of the target on the cell lines of the respective matching and unmatching assay. As such, the step of assessing 404 targets for endotype specificity may suitably comprise calculating 812 endotype specificity scores using assay readouts and affinity scores, as shown in Figure 8.

[0051] A non-limiting example of a calculation of an endotype specificity score will now be described. In this example, the disease is a cancer and the assay readouts are target efficacy scores such as CERES scores or CRISPR scores computed from CRISPR gene essentiality screens. In this example, the endotype specificity score is a difference between an average assay readout for assays that map to the endotype of interest and an average assay readout for assays that do not map to the endotype of interest. In particular, the endotype specificity score for a gene g may be defined as follows.

[0052] In this equation, the term c,_g is the target efficacy of gene g in cell line i, and the term m(i) is the affinity score that matches the endotype of interest to the assay having the ith cell line. In this example, affinity scores m(i) may take values of: +1 , indicating that the endotype maps to an assay having the ith cell line; -1 , indicating that the endotype does not map to an assay having the ith cell line; or 0, indicating that it is not known whether or not the endotype maps to an assay having the ith cell line. As such, the term Ci_g:m(i)=+1 means the set of target efficacy values c,_g such that the assays map to the endotype of interest. Therefore, the equation as a whole means the median target efficacy for assays that map to the endotype of interest minus the median target efficacy for assays that do not map to the endotype of interest.

[0053] It will be appreciated that the above equation is just one non-limiting example of an endotype specificity score for a target g, and that any suitable definition could be used that provides a measure of how specific a target g is for an endotype of interest. Referring to Figure 8, calculating 812 an endotype specificity score may in some examples comprise determining an average assay readout for assays that match (i.e. map to) an endotype of interest and determining an average assay readout for assays that do not match the endotype of interest, and determining 814 a difference between the averages.

[0054] An endotype specificity score provides a measure of the specificity of a particular target g for an endotype of interest. However, in order to evaluate a precision medicine workflow for its ability to reliably predict endotype-specific targets, multiple target predictions should be taken into account. Referring to Figure 8, a suitable approach for evaluating 406 a precision medicine workflow comprises calculating 816 an average endotype specificity score for a set of targets T that have been predicted by the precision medicine workflow. This example approach of defining a workflow score may be expressed by the following equation.

Workflow score = mean(S_g: g £ T)

[0055] The term S_g is the endotype specificity score for target g for the endotype of interest, and T is the set of targets g that have been predicted by the precision medicine workflow. As a result, the equation defines the workflow score as the mean of the endotype specificity scores for all the targets g that have been predicted by the precision medicine workflow. The workflow score therefore provides a measure of the ability of the precision medicine workflow to predict targets that are specific for the endotype of interest.

[0056] The workflow score may be normalised by comparison to a null distribution of mean workflow scores for randomly selected gene sets of the same size. Normalisation can be achieved by, for example, dividing the workflow score by the mean of the null distribution or by subtracting the mean and dividing by the standard deviation. Applying normalisation ensures that workflow scores for workflows that generate sets of predicted targets of different sizes are on a comparable scale. [0057] Figure 9 shows a method 900 for determining an optimum configuration of a precision medicine workflow. The method 900 is an extended version of the method 600 of Figure 6 and comprises changing 902 the configuration of the precision medicine workflow and evaluating 406 the workflow in each of its configurations, and determining 904 an optimum configuration of the workflow based on the evaluations of the different configurations.

[0058] A computer apparatus 1000 suitable for implementing methods according to the present invention is shown in Figure 10. The apparatus 1000 comprises a processor 1002, an input-output device 1004, a communications portal 1006 and computer memory 1008.

The memory 1008 may store code that, when executed by the processor 1002, causes the apparatus 1000 to perform the method 400 shown in Figure 4.

[0059] In the embodiment described above the server may comprise a single server or network of servers. In some examples the functionality of the server may be provided by a network of servers distributed across a geographical area, such as a worldwide distributed network of servers, and a user may be connected to an appropriate one of the network of servers based upon a user location.

[0060] The above description discusses embodiments of the invention with reference to a single user for clarity. It will be understood that in practice the system may be shared by a plurality of users, and possibly by a very large number of users simultaneously.

[0061] The embodiments described above are fully automatic. In some examples a user or operator of the system may manually instruct some steps of the method to be carried out.

[0062] In the described embodiments of the invention the system may be implemented as any form of a computing and/or electronic device. Such a device may comprise one or more processors which may be microprocessors, controllers or any other suitable type of processors for processing computer executable instructions to control the operation of the device in order to gather and record routing information. In some examples, for example where a system on a chip architecture is used, the processors may include one or more fixed function blocks (also referred to as accelerators) which implement a part of the method in hardware (rather than software or firmware). Platform software comprising an operating system or any other suitable platform software may be provided at the computing-based device to enable application software to be executed on the device.

[0063] Various functions described herein can be implemented in hardware, software, or any combination thereof. If implemented in software, the functions can be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer- readable media may include, for example, computer-readable storage media. Computer- readable storage media may include volatile or non-volatile, removable or non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. A computer-readable storage media can be any available storage media that may be accessed by a computer. By way of example, and not limitation, such computer-readable storage media may comprise RAM, ROM, EEPROM, flash memory or other memory devices, CD-ROM or other optical disc storage, magnetic disc storage or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. Disc and disk, as used herein, include compact disc (CD), laser disc, optical disc, digital versatile disc (DVD), floppy disk, and blu- ray disc (BD). Further, a propagated signal is not included within the scope of computer- readable storage media. Computer-readable media also includes communication media including any medium that facilitates transfer of a computer program from one place to another. A connection, for instance, can be a communication medium. For example, if the software is transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of communication medium. Combinations of the above should also be included within the scope of computer-readable media.

[0064] Alternatively, or in addition, the functionality described herein can be performed, at least in part, by one or more hardware logic components. For example, and without limitation, hardware logic components that can be used may include Field-Programmable Gate Arrays (FPGAs), Program-Specific Integrated Circuits (ASICs), Program-Specific Standard Products (ASSPs), System-On-a-Chip systems (SOCs). Complex Programmable Logic Devices (CPLDs), etc.

[0065] Although illustrated as a single system, it is to be understood that the computing device may be a distributed system. Thus, for instance, several devices may be in communication by way of a network connection and may collectively perform tasks described as being performed by the computing device.

[0066] Although illustrated as a local device it will be appreciated that the computing device may be located remotely and accessed via a network or other communication link (for example using a communication interface).

[0067] The term 'computer' is used herein to refer to any device with processing capability such that it can execute instructions. Those skilled in the art will realise that such processing capabilities are incorporated into many different devices and therefore the term 'computer' includes PCs, servers, mobile telephones, personal digital assistants and many other devices.

[0068] Those skilled in the art will realise that storage devices utilised to store program instructions can be distributed across a network. For example, a remote computer may store an example of the process described as software. A local or terminal computer may access the remote computer and download a part or all of the software to run the program. Alternatively, the local computer may download pieces of the software as needed, or execute some software instructions at the local terminal and some at the remote computer (or computer network). Those skilled in the art will also realise that by utilising conventional techniques known to those skilled in the art that all, or a portion of the software instructions may be carried out by a dedicated circuit, such as a DSP, programmable logic array, or the like.

[0069] It will be understood that the benefits and advantages described above may relate to one embodiment or may relate to several embodiments. The embodiments are not limited to those that solve any or all of the stated problems or those that have any or all of the stated benefits and advantages.

[0070] Any reference to “an” item refers to one or more of those items. The term 'comprising' is used herein to mean including the method steps or elements identified, but that such steps or elements do not comprise an exclusive list and a method or apparatus may contain additional steps or elements.

[0071] As used herein, the terms "component" and "system" are intended to encompass computer-readable data storage that is configured with computer-executable instructions that cause certain functionality to be performed when executed by a processor. The computer- executable instructions may include a routine, a function, or the like. It is also to be understood that a component or system may be localized on a single device or distributed across several devices.

[0072] Further, as used herein, the term "exemplary" is intended to mean "serving as an illustration or example of something".

[0073] Further, to the extent that the term "includes" is used in either the detailed description or the claims, such term is intended to be inclusive in a manner similar to the term "comprising" as "comprising" is interpreted when employed as a transitional word in a claim. [0074] The figures illustrate exemplary methods. While the methods are shown and described as being a series of acts that are performed in a particular sequence, it is to be understood and appreciated that the methods are not limited by the order of the sequence.

For example, some acts can occur in a different order than what is described herein. In addition, an act can occur concurrently with another act. Further, in some instances, not all acts may be required to implement a method described herein.

[0075] Moreover, the acts described herein may comprise computer-executable instructions that can be implemented by one or more processors and/or stored on a computer-readable medium or media. The computer-executable instructions can include routines, sub-routines, programs, threads of execution, and/or the like. Still further, results of acts of the methods can be stored in a computer-readable medium, displayed on a display device, and/or the like.

[0076] The order of the steps of the methods described herein is exemplary, but the steps may be carried out in any suitable order, or simultaneously where appropriate. Additionally, steps may be added or substituted in, or individual steps may be deleted from any of the methods without departing from the scope of the subject matter described herein. Aspects of any of the examples described above may be combined with aspects of any of the other examples described to form further examples without losing the effect sought.

[0077] It will be understood that the above description of a preferred embodiment is given by way of example only and that various modifications may be made by those skilled in the art. What has been described above includes examples of one or more embodiments. It is, of course, not possible to describe every conceivable modification and alteration of the above devices or methods for purposes of describing the aforementioned aspects, but one of ordinary skill in the art can recognize that many further modifications and permutations of various aspects are possible. Accordingly, the described aspects are intended to embrace all such alterations, modifications, and variations that fall within the scope of the appended claims.

Claims

1 . A computer-implemented method for evaluating a target identification workflow in precision medicine, the target identification workflow comprising an endotype detection module configured to detect endotypes from cohort data relating to individuals of a cohort and a target prediction module configured to predict targets for each of the endotypes, the method comprising: mapping endotypes detected by the endotype detection module to assays; assessing targets predicted by the target prediction module for endotype specificity; and evaluating the workflow for its ability to predict endotype specific targets.

2. A computer-implemented method according to claim 1 , wherein the cohort data comprises one or more of Bulk RNA-seq, scRNA-seq, and methylation data.

3. A computer-implemented method according to any preceding claim, wherein the target identification workflow is configured to detect endotypes by assigning individuals of the cohort to subgroups.

4. A computer-implemented method according to claim 3, wherein the target identification workflow is configured to detect endotypes by generating a gene signature for each subgroup.

5. A computer-implemented method according to any preceding claim, wherein the target identification workflow is configured to predict targets by using PPI networks.

6. A computer-implemented method according to any preceding claim, wherein the target identification workflow is configured to predict targets by a method comprising one or more of: ranking genes according to their weight in a gene signature of a detected endotype; performing upstream regulator analysis; and performing directed upstream regulator analysis.

7. A computer-implemented method according to any preceding claim, wherein mapping the endotypes to assays comprises comparing gene signatures of the endotypes to gene signatures of the assays.

8. A computer-implemented method according to claim 7, comprising computing affinity scores, each affinity score representing a similarity between a gene signature of an endotype and a gene signature of an assay.

9. A computer-implemented method according to claim 8, wherein computing the affinity scores comprises using a Nearest Template Mapping method.

10. A computer-implemented method according to claim 8, wherein computing the affinity scores comprises using a Single Sample Scoring method.

11. A computer-implemented method according to claim 10, wherein the Single Sample Scoring method comprises Single Sample Gene Set Enrichment Analysis.

12. A computer-implemented method according to any preceding claim, wherein assessing the targets for endotype specificity comprises: using assay readouts that indicate an extent to which a perturbation of a target in an assay associated with an endotype has an effect on a phenotype of interest.

13. A computer-implemented method according to claim 12, wherein assessing the targets for endotype specificity comprises: using affinity scores that indicate an association between an endotype and an assay.

14. A computer-implemented method according to claim 13, wherein assessing the targets for endotype specificity comprises: calculating an endotype specificity score for a target in relation to an endotype by using the assay readouts and the affinity scores to determine an extent to which perturbation of the target has a greater effect on a phenotype of interest in assays associated with the endotype compared to assays not associated with the endotype.

15. A computer-implemented method according to claim 14, wherein calculating the endotype specificity score for the target in relation to the endotype comprises: determining an average assay readout for the target in assays associated with the endotype; determining an average assay readout for the target in assays not associated with the endotype; and determining a difference between the averages.

16. A computer-implemented method according to claim 14, wherein evaluating the workflow for its ability to predict endotype specific targets comprises: calculating an average endotype specificity score of an endotype of interest.

17. A computer-implemented method according to any preceding claim, comprising: arranging the workflow in a plurality of configurations; and evaluating the workflow for its ability to predict endotype specific targets in each of the plurality of configurations.

18. A computer-implemented method according to claim 17, comprising determining an optimum configuration of the workflow for predicting endotype specific targets.

19. A computer-readable medium storing code that, when executed by a computer, causes the computer to perform the method of any previous claim.

20. A system for evaluating a target identification workflow in precision medicine, the target identification workflow comprising an endotype detection module configured to detect endotypes from cohort data relating to individuals of a cohort and a target prediction module configured to predict targets for each of the endotypes, the system comprising: an endotype mapping module configured to map endotypes detected by the endotype detection module to assays; a target assessment module configured to assess targets predicted by the target prediction module for endotype specificity; and a workflow evaluation module configured to evaluate the workflow for its ability to predict endotype specific targets.

21 . A system according to claim 20, wherein the endotype mapping module is configured to compare gene signatures of the endotypes to gene signatures of the assays.

22. A system according to claim 21 , wherein the endotype mapping module is configured to compute affinity scores, each affinity score representing a similarity between a gene signature of an endotype and a gene signature of an assay.

23. A system according to claim 20, 21 or 22, wherein the target assessment module is configured to use assay readouts that indicate an extent to which a perturbation of a target in an assay associated with an endotype has an effect on a phenotype of interest.

24. A system according to claim 23, wherein the target assessment module is configured to use affinity scores that indicate an association between an endotype and an assay.

25. A system according to claim 24, wherein the target assessment module is configured to calculate an endotype specificity score for a target in relation to an endotype by using the assay readouts and the affinity scores to determine an extent to which perturbation of the target has a greater effect on a phenotype of interest in assays associated with the endotype compared to assays not associated with the endotype.