WO2023011723A1

WO2023011723A1 - Identification of somatic variants

Info

Publication number: WO2023011723A1
Application number: PCT/EP2021/071952
Authority: WO
Inventors: Jyoti NANGALIA; Nicholas Williams
Original assignee: Genome Research Limited
Priority date: 2021-08-05
Filing date: 2021-08-05
Publication date: 2023-02-09

Abstract

Methods for identifying somatic variants from sequence data are described, the method accounting for tumour in normal contamination to improve the sensitivity of variant calling when tumour in normal contamination is expected. The methods comprise receiving sequence data from one or more tumour samples of a subject and one or more normal samples, identifying one or more sites where the sequence data from the one or more tumour samples is indicative of a difference between the tumour genome and a reference genome, and determining whether each of the one or more identified sites is a somatic variant using the sequence data from the one or more normal samples and a probabilistic model that models the presence of tumour genetic material in the one or more normal samples.

Description

IDENTIFICATION OF SOMATIC VARIANTS

Field of the invention

The present invention relates to methods of identifying somatic variants from sequence data, in particular by taking into account tumour in normal contamination. Systems and related products for implementing such methods are also described.

Background to the invention

Somatic mutations are a hallmark of cancers. Detecting, cataloguing and interpreting the somatic mutations that drive the transformation from normal cells to malignant cells is therefore essential to develop our understanding of this disease. As such, tremendous effort has been put towards this goal, as exemplified by projects such as The Cancer Genome Atlas (TCGA)/ International Cancer Genome Consortium (ICGC) Pan-Cancer Analysis of Whole Genomes Consortium (PCAWGC). However, identifying somatic variants remains a challenging task. Indeed, they may occur at very low frequency in the genome, may occur in only a sub-population of tumour cells, which themselves may only represent a proportion of a tumour sample (due to normal cell contamination), and occur in a background that often has high copy number variability.

Multiple methods for identifying somatic variants have been developed in the last few years to address these challenges. For example, CaVEMan (Jones et al., 2016) is an expectation maximization-based somatic substitution-detection algorithm that performs a comparative analysis of a tumour and normal samples from the same patient to derive a probabilistic estimate for putative somatic substitutions. Combined with a set of validated post-hoc filters designed to remove putative false positive identifications, CaVEMan has been shown to generate a set of somatic substitution calls with high recall and positive predictive value (Jones et al., 2016). Despite this tremendous progress, some types of tumours remain challenging to study. For example, in order to unambiguously identify somatic mutations in a cancer genome, many approaches including CaVEMan require the knowledge of the corresponding germline single nucleotide polymorphisms (SNPs), which knowledge is obtained using a matched normal sample (typically a blood sample). This is problematic for haematological malignancies as obtaining germline DNA from such patients is very difficult. Some advances have been made using less traditional sources of germline DNA such as nail clippings (see e.g. Kakadia et al., 2018). However, in the majority of cases, the inability to confidently identify germline variants in haematological cancers have led to somatic variant analysis in such contexts being limited to putative variants that are known to be driver mutations.

Thus, there exists a need for new methods for identifying somatic variants, that alleviate one or more of the drawbacks of existing methods.

Brief Description of the Invention

Broadly, the present inventors recognised that the identification of somatic variants (in particular single nucleotide variants) from sequence data from tumour and matched normal samples could be improved by taking into account tumour in normal (TiN) contamination in the identification process. Tumour in normal contamination may occur in many clinical contexts, including but not limited to liquid and solid tumours. For example, even in the case of solid tumours, blood samples which are commonly used as matched normal samples are known to sometimes contain circulating tumour cells. The problem of TiN is particularly acute in the context of haematological cancers where obtaining the usual blood sample as a representation of germline DNA is not an option. Blood can be separated into specific cell types that are unaffected by the cancer, which can work in some instances in providing an accurate germline sequence, however, some contamination of the DNA with tumour is common. A similar problem arises when using alternative tissues (eg DNA from skin or buccal epithelium) as the source of germline DNA due to contamination of these tissues with blood cells. In this context, most of the currently used variant identification processes fail because they assume (whether in the identification and/or post-hoc filtering steps) that the matched normal sample is truly representative of germline sequence. Thus variants present in the tumour which are then also found contaminating the germline sample, are missed by the variant caller as such variants are assumed to have a germline rather than a somatic origin. The inventors set out to develop methods to identify somatic variants from matched tumour and normal samples that did not rely on the assumption that the matched normal sample is truly representative of the germline sequence, and instead explicitly built into the models used for the variant identification and/or post-hoc filtering the possibility that the matched normal sample may be contaminated with tumour derived sequences. As a result, the methods described herein are able to identify somatic variants in multiple cancers (particularly blood cancers) that would have been missed if state of the art variant calling methods were used.

Accordingly, in a first aspect the present invention provides a method for identifying somatic variants from sequence data, the method comprising: receiving sequence data from one or more tumour samples of a subject and one or more normal samples, identifying one or more sites where the sequence data from the one or more tumour samples is indicative of a difference between the tumour genome and a reference genome, and determining whether each of the one or more identified sites is a somatic variant using the sequence data from the one or more normal samples and a probabilistic model that models the presence of tumour genetic material in the one or more normal samples.

As a result of the methods using a model that accounts for tumour in normal contamination, an improved sensitivity of variant calling is achieved when tumour in normal contamination is expected or confirmed.

The method may have any one or more of the following optional features.

As the skilled person understands, the complexity of the operations described herein (due at least to the amount of data that is typically generated by sequencing genomic DNA) are such that they are beyond the reach of a mental activity. Thus, unless context indicates otherwise (e.g. where sample preparation or acquisition steps are described), all steps of the methods described herein are computer implemented.

According to any aspect, sequence data may be received from a user, for example through a user interface. Sequence data may be received from a computing device, from a sequence data acquisitions means, and/or from one or more databases. Receiving sequence data may comprise receiving sequence data that is aligned to a reference genome. Alternatively, the method may further comprise aligning the sequence data to a reference genome. The method of any aspect may further comprise obtaining sequence data from one or more tumour samples of a subject and/or from one or more normal samples. Obtaining sequence data from a sample may comprise sequencing nucleic acids present in the sample, preferably using next generation sequencing such as WGS, WES or targeted (also referred to as "panel" or "pulldown" sequencing). Obtaining sequence data from a sample may comprise one or more in vitro steps. The one or more in vitro steps may comprise nucleic acid extraction, library preparation, target sequence capture, sequencing and/or hybridisation to a copy number array.

According to any aspect, the sample(s) may have been previously obtained from one or more subjects. Alternatively, the method may further comprise obtaining a tumour sample and/or a normal sample from a subject. As the skilled person understands, once sequence data has been obtained through e.g. sequencing, additional steps may be applied as known in the art, such as e.g. quality control steps, reference genome alignment step (also referred to as mapping),normalisation steps, etc. A sample may be a cell or tissue sample (in which case the sample may comprise multiple cell types each comprising respective genomic material), or a sample of nucleic acid derived therefrom. The one or more normal samples may have been obtained from the same subject as the tumour sample. The sequencing data that may have a sequencing depth of at least lOx, preferably at least 20x, more preferably at least 30x. Having a reasonable depth of sequencing in the normal sample may ensure that there is sufficient evidence to stop a variant being classified as germline in the presence of tumour in normal contamination. This is because the prior germline probability is higher than the prior somatic probability, thus requiring more evidence to be "overturned". The probabilistic model may be a Bayesian model. The one or more normal samples may be from the same subject. The sequence data from the one or more normal samples may comprise sequence data associated with the presence of tumour genomic material. According to any aspect, the one or more normal samples may be from the same subject. The sequence data from the one or more normal samples may comprise sequence data associated with the presence of tumour genomic material In other words, the one or more normal samples may comprise or be suspected of comprising tumour-in-normal contamination. The tumour may be a tumour of the hematopoietic or lymphoid tissue. The normal sample may be a buccal sample, a T cell sample, a blood sample, a plasma sample, or a purified sample derived from blood or plasma. The one or more normal samples may comprise at least 1%, at least 2%, at least 3%, at least 4% or at least 5% tumour in normal contamination.

According to any aspect, the tumour may be a tumour of the hematopoietic or lymphoid tissue, and/or the normal sample may be a buccal sample, a T cell sample, a blood sample, a plasma sample, or a purified sample derived from blood or plasma. According to any aspect, the one or more normal samples may comprise at least 1%, at least 2%, at least 3%, at least 4% or at least 5% tumour in normal contamination. The percentage of tumour in normal contamination may be the percentage of cells in a cell sample that are tumour (also referred to as "aberrant") cells, or the equivalent percentage of DNA containing cells in the sample that would result in a proportion of tumour/aberrant vs germline genomic material. The percentage of tumour in normal contamination in a normal cell sample may be referred to as the aberrant cell fraction for the normal sample (ot_n). The methods of the present invention have been shown to be able to identify more somatic variants than prior art methods in the present of tumour-in-normal contamination. Further, the proportion of somatic variants that would have been missed without the methods of the present invention increases with the percentage of tumour in normal contamination. The one or more tumour samples may comprise at least 5%, at least 10%, at least 15% or at least 20% normal cell contamination. Thus, the one or more tumour samples may have an aberrant cell fraction (a_t) of up to 80%, up to 85%,up to 90%, or up to 95%. The one or more normal samples may comprise at most 20%, or at most 15% tumour in normal contamination. The one or more normal samples may comprise between 0 and 20%, between 1 and 20%, between 2 and 20%, between 3 and 20%, between 4 and 20%, between 5 and 20%, between 0 and 15%, between 1 and 15%, between 2 and 15%, between 3 and 15%, between 4 and 15%, between 5 and 15% tumour in normal contamination. Where a plurality of tumour/normal samples are used, at least one of the samples may meet the above criteria. Where a plurality of tumour/normal samples are used, one or more of the samples (such as e.g. all of the samples) may meet the above criteria. The somatic variants may be single nucleotide variants. The somatic variants may be multinucleotide variants. The one or more variants may be insertion/deletion variants. The sequence data may comprise DNA sequence data. The DNA sequence data may be genomic sequence data.

Determining whether each of the one or more identified sites is a somatic variant using the sequence data from the one or more normal samples and a probabilistic model that models the presence of tumour genetic material in the one or more normal samples may comprise: identifying a set of one or more candidate somatic variant sites; and applying one or more filters to the set of candidate somatic variant sites to obtain a filtered set; wherein the step of identifying a set of candidate somatic variant sites and/or the one or more filters use a probabilistic model that models the presence of tumour genetic material in the one or more normal samples. The one or more filters may be selected from: a proximal gap filter that excludes variants that are within a predetermined distance (such as e.g. 10 bp) of an indel, a poor mapping region filter that excludes variants that are in locations having a mapping quality score below a predetermined threshold, a clustered variant filter that excludes variants that are within a predetermined distance from each other (such as e.g. 10 bp), a strand bias filter that excludes variants where the genotype inferred from reads mapped to the forward strand is not the same as the genotype inferred from reads mapped to the reverse strand, a read depth filter that excludes variants that are in regions with a read depth below a predetermined threshold, optionally wherein the one or mor filters do not include a filter that excludes sites where the sequence data from the one or more normal samples contain evidence of the presence of the variant unless the filter uses a probabilistic model that models the presence of tumour genetic material in the one or more normal samples. The one or more filters may comprise a filter that excludes variants where a metric reflecting the relative likelihood of the sequence data for the one or more normal samples assuming that the site has a somatic variant or a germline variant is below a predetermined threshold, wherein the likelihood assuming that the site has a somatic variant is obtained using a model that takes into account the tumour in normal contamination in the one or more normal samples. In other words, determining whether each of the one or more identified sites is a somatic variant using the sequence data from the one or more normal samples and a probabilistic model that models the presence of tumour genetic material in the one or more normal samples may comprise: identifying a set of one or more candidate somatic variant sites; and excluding candidate somatic variants where a metric reflecting the relative likelihood of the sequence data for the one or more normal samples assuming that the site has a somatic variant or a germline variant is below a predetermined threshold, wherein the likelihood assuming that the site has a somatic variant is obtained using a model that takes into account the tumour in normal contamination in the one or more normal samples. The step of identifying a set of one or more candidate somatic variant sites may eb performed using any variant calling method known in the art, such as e.g. CaveMAn (Jones et al., 2016).

The likelihood assuming that the site has a germline variant may be obtained by determining the probability of observing the sequence data for the site in the one or more normal samples in view of a probability of sequencing error associated with the sequence data and a first predetermined variant allele fraction. The likelihood assuming that the site has a somatic variant may be obtained by determining the probability of observing the sequence data for the site in the one or more normal samples in view of a probability of sequencing error associated with the sequence data and a second predetermined variant allele fraction, wherein the first and second predetermined variant allele fractions are different from each other. The second predetermined variant allele fraction may be obtained by multiplying the first predetermined variant allele fraction by a predetermined tumour in normal contamination.

According to any embodiment of any aspect, a predetermined tumour in normal contamination may be defined as a default value, such as e.g. a default value that enables the calling of variants with modest tumour in normal contamination without unduly classifying variants as somatic rather than germline. For example, a value of 0.1 may be used. The predetermined value may be set based on expert knowledge and/or the expected level of tumour contamination in a particular normal sample. The predetermined value may be determined by obtaining an estimated value using read counts from mutations that are expected to be present in tumour DNA but not in normal DNA, such as e.g. driver mutations. The predetermined value may be determined by analysing the variant allele fractions for called variants in the normal and tumour samples, when called without a matched sample. This can be performed for example by clustering the VAF for these variants int those samples, excluding germline variants (e.g. by applying a cutoff on the normal VAF, e.g. 25%), and identifying a cluster that likely comprises somatic variants with a VAF >0 in the normal sample. The centroid or another suitable summary metric for such a cluster may be used to estimate the aberrant cell fraction (tumour in normal contamination and optionally also tumour cell fraction in the tumour sample). The value for an aberrant cell fraction in the tumour sample may be estimated using methods known in the art, such as e.g. ASCAT (Van Loo et al., 2010) or Battenberg (Nik-Zainal et al., 2012). Alternatively, a suitable predetermined value may be used.

The probability of sequencing error associated with the sequence data may be obtained for each read in the sequence data. The probability of sequencing error associated with the sequence data may be a predetermined value that is the same for all sequence reads. The use of a single value across all reads may advantageously enable the method to run using only summary read counts, rather than e.g. BAM files from which per base call quality scores can be extracted. The predetermined value may be an error probability, based on e.g. the minimum phred scale quality expected in the data. The predetermined threshold (0) may depend on a desired level of certainty (5), a prior probability that the candidate somatic variant is a somatic variant, and a prior probability that the candidate somatic variant is a germline variant. The parameter 5 may represent the minimum ratio of: (i) the posterior probability that the candidate somatic variant is a somatic variant, and (ii) the posterior probability that the candidate somatic variant is a germline variant. Thus, requiring that the ratio of the posterior probabilities are above 5 may be equivalent to requiring that the variant is at least 5 times more likely to be a somatic variant thana germline variant. The posterior probability that the candidate somatic variant is a somatic/germline variant is the product of the likelihood of the data assuming that the candidate somatic variant is a somatic/germline variant (as the case may be), and a prior probability that the candidate somatic variant is a somatic/germline variant (as the case may be).

The filter that excludes variants where a metric reflecting the relative likelihood of the sequence data for the one or more normal samples assuming that the site has a somatic variant or a germline variant is below a predetermined threshold may be a filter that excludes variants that do not meet the criterion:

Or the equivalent criterion: login

where P(Data|Somatic) is the likelihood of the sequence data for the one or more normal samples assuming that the site has a somatic variant, P(Data|Germline) is the likelihood of the sequence data for the one or more normal samples assuming that the site has a germline variant, P(Somatic) is the prior probability that the candidate somatic variant is a somatic variant and P(Germline) is the prior probability that the candidate somatic variant is a germline variant. As the skilled person understands, although a loglO is used in the formula above, any other transformation (including any other log, or non-logarithmic transformation, or no transformation) of the ratio may be used.

The prior probability that the candidate somatic variant is a somatic variant may be set to a predetermined value, such as e.g. 3xl0^-6. The prior probability that the candidate somatic variant is a germline variant may depend on whether the variant is catalogued in one or more data sets of known single nucleotide polymorphisms. According to any embodiment of any aspect described herein, a prior probability that a variant is a somatic variant (P(Som)) may be set to a predetermined value, for example based on prior knowledge, such as e.g. a typical mutation frequency in the genome under investigation. For humans this can be set for example to 3xl0^-6 as a sensible default. In embodiments, an estimate of the number of SNVs acquired per year may be used to set P(Som) depending on the age of the sample. For example, assuming that 20 SNVs are expected per year, approximately 1000 SNVs may be expected at age 50, leading to a P(Som) estimate of 1000/3xl0⁹ for a human genome, i.e. 3xl0^-7. According to any embodiment of any aspect described herein, a prior probability that the candidate somatic variant is a somatic variant may be set to a predetermined value, which may depend on whether the variant is catalogued in one or more data sets of known single nucleotide polymorphisms. According to any embodiment of any aspect described herein, a prior probability that a variant is a germline variant (P(SNP) may be set to a predetermined value, for example based on prior knowledge, such as e.g. the average number of SNPs expected in a human genome. For example, assuming that every human genome (approximately 3xl0⁹ bases) contains approximately 3xl0⁶ SNPs, a value of P(SNP)=lxlO³ may be used. The use of a prior probability that the candidate somatic variant is a germline variant that depends on whether the variant is catalogued in one or more data sets of known SNPs advantageously includes prior knowledge about germline variant sites in the prior probability. For example, a higher prior probability of the variant being a germline variant may be used for candidate somatic variants that are known to occur as germline variants than for variants that are not known to occur as germline variants, thereby requiring more evidence in favour of the variant being somatic in order for the variant being identified as somatic in the former case (variant is a known germline variant) than in the latter case (variant is not a known germline variant). The one or more data sets of known single nucleotide polymorphisms may comprise one or more of: dbSNP, 1000 genomes and ExAC data sets. The use of a plurality of data sets may advantageously lower the evidence required for a variant to be classified as somatic at known SNP sites. The one or more data sets of known single nucleotide polymorphisms may be filtered to exclude single nucleotide variants that are catalogued in one or more data sets of known single nucleotide variants in cancer. For example, the one or more data sets of known single nucleotide variants in cancer may be selected from the COSMIC database (Tate et al., 2019). This may advantageously increase the sensitivity of detection of somatic variants by reducing the amount of evidence required for a variant to be classified as somatic at known SNP sites that are also known SNV sites. For example, a first prior probability that the candidate somatic variant is a germline variant may be used if the candidate somatic variant is catalogued in one or more data sets of known single nucleotide polymorphisms, and a second prior probability that the candidate somatic variant is a germline variant may be used if the candidate somatic variant is not catalogued in one or more data sets of known single nucleotide polymorphisms.

The first prior probability may depend on: the probability that a site is a germline variant if it is in the one or more data sets (P(germline|site in db), which may for example be estimated as the proportion of single nucleotide polymorphisms expected to be represented in the one or more data sets of known single nucleotide polymorphisms), the probability that any site in a genome is a germline mutation (P(site is germline), which can be estimated for example as the number of germline mutations per genome), and the probability that a site is in the one or more data sets (P(site in db), which may be estimated for example as the number of single nucleotide polymorphisms sites represented in the one or more data sets of known single nucleotide polymorphisms). The first prior probability may be obtained as P(germline|site is in db)=

P(germline|site in db)x P(site is germline)/P(site in db). The first prior probability may depend on: an estimate proportion of single nucleotide polymorphisms expected to be represented in the one or more data sets of known single nucleotide polymorphisms (such as e.g.0.95 when using dbSNP or 0.995 when using a combination of dbSNP, 1000 genomes and ExAc), an estimated number of germline mutations per genome (such as e.g. 3e6), and the number of single nucleotide polymorphisms sites represented in the one or more data sets of known single nucleotide polymorphisms (e.g. approx. 30e6 when using dbSNP, or approx. 10e6 when using a combination of dbSNP, 1000 genomes and ExAC). The second prior probability may depend on: the probability that a site is a germline variant if it is not in the one or more data sets (P(germline|site not in db), which may for example be estimated as the proportion of single nucleotide polymorphisms expected not to be represented in the one or more data sets of known single nucleotide polymorphisms), the probability that any site in a genome is a germline mutation (P(site is germline), which can be estimated for example as the number of germline mutations per genome), and the probability that a site is not in the one or more data sets (P(site not in db), which may be estimated for example as the number of single nucleotide polymorphisms sites expected not to be represented in the one or more data sets of known single nucleotide polymorphisms). The second prior probability may be obtained as P(germline|site is not in db)= P(germline|site not in db)x P(site is germline)/P(site not in db). The second prior probability may depend on: an estimate proportion of single nucleotide polymorphisms not represented in the one or more data sets of known single nucleotide polymorphisms (such as e.g. 0.05 when using dbSNP, or 0.005 when using a combination of dbSNP, Exac and 1000 Genomes, or 1-the estimated proportion used for the first prior probability), an estimated number of germline mutations per genome (which may be the same as that used for the first prior probability, for example 3e6), and an estimated number of single nucleotide polymorphisms sites not represented in the one or more data sets of known single nucleotide polymorphisms (such as e.g. 3e9).

Determining whether each of the one or more identified sites is a somatic variant may comprise: determining the posterior probability of each of one or more candidate joint genotypes having a somatic variation, in view of the sequence data for the one or more tumour samples and the sequence data for the one or more normal samples, wherein said posterior probability depends on the aberrant cell fraction in the one or more tumour samples and the aberrant cell fraction in the one or more normal samples. The aberrant cell fraction in the one or more tumour samples and/or the aberrant cell fraction in the one or more normal samples may be set to a respective predetermined value. Alternatively, the aberrant cell fraction in the one or more tumour samples and/or the aberrant cell fraction in the one or more normal samples may be estimated using any method known in the art such as e.g. ASCAT (Van Loo et al., 2010). The posterior probability may further depend on the site specific copy number for the germline genome and the site specific copy number for the tumour genome. These values may be estimated using any method known in the art, such as e.g. ASCAT Van Loo et al., 2010). Alternatively, these values may be set to a respective predetermined value. For example, a value of 2 may be used for the one or more normal samples. A value equal to or higher than 2 (such as e.g. 3 or more, 4 or more, 5 or more, or 5) may be used for the tumour sample(s). The present inventors have found using values of 5 to be sufficiently permissive for many tumour genomes. The posterior probability may further depend on a respective prior probability for the joint genotype for which a posterior probability is obtained. The posterior probability may further depend on the value of one or more covariates associated with the sequence data, for example through the likelihood of the data assuming the joint genotype for which a joint posterior probability is obtained. The one or more covariates may include one or more covariates defined at the level of individual sequence reads (such as e.g. lane group, read group, read order, strand mapping quality) and/or one or more covariates defined at the level of individual base calls (such as e.g. base quality, and position within read). The posterior probability for a joint genotype may be obtained as the product of the likelihood of the sequence data assuming the joint genotype and a prior probability for the joint genotype, divided by a normalising term. The normalising term may represent the sum over all joint genotypes considered of the product of the likelihood of the sequence data assuming a joint genotype and a prior probability for the joint genotype. The prior probability for a joint genotype may be estimated for a genotype that is a somatic joint genotype using the probability that any site has a somatic variant (P(Som)) and the probability that any site has a germline variant (P(SNP). The prior probability for a joint genotype may be estimated for a genotype that is a germline joint genotype using the probability that any site has a germline variant. The prior probability for a joint genotype may be estimated for a genotype that is a reference joint genotype using the probability that any site has a somatic variant and the probability that any site has a germline variant. For example, the prior probability P(G_g,G_t) for a joint genotype G_g,G_t may be obtained as P((G_g,Gt) e co(G_g,G_t))/| co(G_g,G_t)|, where P((G_g,Gt) £ co(G_g,Gt)) may be estimated as P((G_g,G_t)e Q_s)=P(Som)(1-P(SNP)), P((G_g,G_t)£ Q_g)=P(SNP), and P((G_g,G_t)e Q_r)=(1-P(Som))(1-P(SNP)), respectively for the somatic, germline and reference joint genotype sets, and each joint genotype in a joint genotype set is equally likely.

The posterior probability may depend on the aberrant cell fraction in the one or more tumour samples and the aberrant cell fraction in the one or more normal samples through the likelihood of the sequence data assuming the joint genotype. The likelihood of the data assuming the joint genotype may bethe product of: a term for the sequence data from the tumour sample(s), which depends on the aberrant cell fraction in the one or more tumour samples, and a term for the sequence data from the normal sample(s), which depends on the aberrant cell fraction in the one or more normal samples. The likelihood of the data assuming the joint genotype may be the product of: a term for the likelihood of the sequence data from the tumour sample(s), in view of the aberrant cell fraction in the one or more tumour samples, the joint genotype and optionally the value of one or more covariates, and a term for the likelihood of the sequence data from the normal sample(s), in view of the aberrant cell fraction in the one or more normal samples, the joint genotype and optionally the value of one or more covariates. The one or more covariates may include one or more covariates defined at the level of individual sequence reads (such as e.g. lane group, read group, read order, strand mapping quality) and/or one or more covariates defined at the level of individual base calls (such as e.g. base quality, and position within read). The likelihood of the sequence data from the normal or tumour samples may be expressed as the sum of a term for the reference allele and a term for the variant allele in the joint genotype. The likelihood of the sequence data from the normal or tumour samples may be expressed using equation 3. The likelihood of the sequence data from the normal or tumour samples may be expressed using equation 4.

The likelihood of the sequence data from the normal or tumour samples may depend on: a term that represents the probability of a particular base being called from the sequence data assuming that the true genotype is the reference allele for the joint genotype and in view of the values of one or more covariates, and a term that represents the probability of a particular base being called from the sequence data assuming that the true genotype is the variant allele for the joint genotype and in view of the value of one or more covariates. The probability of a particular base being called from the sequence data assuming a true genotype and in view of the value of one or more covariates may be estimated empirically from the sequence data from the tumour samples and/or the normal samples, by assuming that all sites that differ from a reference sequence are sequencing errors, preferably excluding all sites that are known germline variant. Known germline variants can be obtained from one or more databases such as e.g. dbSNP, 100 Genomes and/or ExAc. The likelihood of the sequence data from the normal or tumour samples may be obtained as the sum of: a term that represents the probability of a particular base being called from the sequence data assuming that the true genotype is the reference allele for the joint genotype and in view of the values of one or more covariates (profile probability), multiplied by a term that represents the probability that a read from the sequence data has the reference allele in view of the joint genotype, and a term that represents the probability of a particular base being called from the sequence data assuming that the true genotype is the variant allele for the joint genotype and in view of the values of one or more covariates (profile probability), multiplied by a term that represents the probability that a read from the sequence data has the variant allele in view of the joint genotype.

Determining whether each of the one or more identified sites is a somatic variant may further comprise: determining whether the posterior probability a candidate joint genotypes having a somatic variation and/or the sum of the posterior probabilities of each of the candidate joint genotypes having a somatic variation is/are above a predetermined threshold, and determining that the site has a somatic variant in the affirmative. The step of identifying one or more sites where the sequence data from the one or more tumour samples is indicative of a difference between the tumour genome and a reference genome may not use the sequence data from the one or more normal samples, or may use sequence data from one or more unmatched normal samples. The step of identifying one or more sites where the sequence data from the one or more tumour samples is indicative of a difference between the tumour genome and a reference genome may use any variant calling process, including any variant calling process known in the art. For example, the step of identifying one or more sites where the sequence data from the one or more tumour samples is indicative of a difference between the tumour genome and a reference genome may use the CaVEMan method as described in Jones et al. 2016 (where TiN contamination is not accounted for). This may use sequence data from one or more normal samples that are matched normal samples, and/or one or more samples that are not matched normal samples (such as e.g. an unmatched normal sample, as exemplified in Example 5). Alternatively this step may use a variant calling process that does not use a normal sample at all, for example as described in the second aspect below and in Example 2. The method may further comprise filtering out sites of the one or more identified sites that are likely to be germline SNP by comparison to a germline SNP data set.

The method may further comprise analysing the sequence data from the one or more tumour samples of the subject and the one or more normal samples using embodiments comprising using a filter that excludes variants where a metric reflecting the relative likelihood of the sequence data for the one or more normal samples assuming that the site has a somatic variant or a germline variant is below a predetermined threshold, wherein the likelihood assuming that the site has a somatic variant is obtained using a model that takes into account the tumour in normal contamination in the one or more normal samples; and(ii) analysing the sequence data from the one or more tumour samples of the subject and the one or more normal samples using embodiments comprising determining the posterior probability of each of one or more candidate joint genotypes having a somatic variation, in view of the sequence data for the one or more tumour samples and the sequence data for the one or more normal samples, wherein said posterior probability depends on the aberrant cell fraction in the one or more tumour samples and the aberrant cell fraction in the one or more normal samples. The method may further comprise(iii) comparing the identified sites determined to be somatic variants at step (i) and the identified sites determined to be somatic variants at step (ii) and further filtering the identified sites determined to be somatic variants at step (i) but not determined to be somatic variants at step (ii). Variants that are identified as somatic variants using the filtering approach in claims 8 to 14 may include variants that are not true somatic variants, particularly if the filtering approach was applied to a set of sites called without a matched normal sample. Thus, this set of variants may be further analysed to filter out the variants that are unlikely to eb true somatic variants. This may be done, for example, by analysing the VAF in the normal and tumour sample(s) for these variants, and/or by looking at the trinucleotide context of these variants compared to the distribution of trinucleotide contexts of variants identified at both steps (i) and (ii). According to a second aspect, there is provided a method for identifying somatic variants from sequence data, the method comprising: receiving sequence data from one or more tumour samples of a subject, identifying one or more sites where the sequence data from the one or more tumour samples is indicative of a difference between the tumour genome and a reference genome, and determining the posterior probability of each of one or more candidate joint genotypes having a somatic variation, in view of the sequence data for the one or more tumour samples, wherein said posterior probability depends on the likelihood of observing the sequence data from the one or more tumour samples assuming each of the one or more candidate joint genotypes having a variation compared to the reference genome and an aberrant cell fraction in the one or more tumour samples.

The aberrant cell fraction in the one or more tumour samples may be set to a respective predetermined value. Alternatively, the aberrant cell fraction in the one or more tumour samples may be estimated using any method known in the art such as e.g. ASCAT (Van Loo et al., 2010). The posterior probability may further depend on the site specific copy number for a germline genome and the site specific copy number for the tumour genome. These values may be estimated using any method known in the art, such as e.g. ASCAT Van Loo et al., 2010). Alternatively, these values may be set to a respective predetermined value. For example, a value of 2 may be used for the one or more normal samples. A value equal to or higher than 2 (such as e.g. 3 or more, 4 or more, 5 or more, or 5) may be used for the tumour sample(s). The present inventors have found using values of 5 to be sufficiently permissive for many tumour genomes. The posterior probability may further depend on a respective prior probability for the joint genotype for which a posterior probability is obtained. The posterior probability may further depend on the value of one or more covariates associated with the sequence data, for example through the likelihood of the data assuming the joint genotype for which a joint posterior probability is obtained. The one or more covariates may include one or more covariates defined at the level of individual sequence reads (such as e.g. lane group, read group, read order, strand mapping quality) and/or one or more covariates defined at the level of individual base calls (such as e.g. base quality, and position within read). The posterior probability for a joint genotype may be obtained as the product of the likelihood of the sequence data assuming the joint genotype and a prior probability for the joint genotype, divided by a normalising term. The normalising term may represent the sum over all joint genotypes considered of the product of the likelihood of the sequence data assuming a joint genotype and a prior probability for the joint genotype. The prior probability for a joint genotype may be estimated for a genotype that is a somatic joint genotype using the probability that any site has a somatic variant (P(Som)) and the probability that any site has a germline variant (P(SNP). The probability that any site has a germline variant (P(SNP)) may be set to 0. The prior probability for a joint genotype may be estimated for a genotype that is a germline joint genotype using the probability that any site has a germline variant. The prior probability for a joint genotype may be estimated for a genotype that is a reference joint genotype using the probability that any site has a somatic variant and the probability that any site has a germline variant. For example, the prior probability P(G_g,Gt) for a joint genotype G_g,G_t may be obtained as P((G_g,G_t) £ co(G_g,G_t))/| co(G_g,G_t)|, where P((G_g,G_t) £ co(G_g,G_t)) may be estimated as P((G_g,G_t)£ Q_s)=P(Som)(1-P(SNP)), P((G_g,G_t)£ Q_g)=P(SNP), and P((G_g,G_t)£ Q_r)=(1-P(Som))(1-P(SNP)), respectively for the somatic, germline and reference joint genotype sets, and each joint genotype in a joint genotype set is equally likely.

When data for normal samples is available, according to the present aspect the likelihood of the data assuming the joint genotype does not include a term for the sequence data from the normal sample(s). The likelihood of the sequence data from the tumour sample(s) may be a likelihood in view of the aberrant cell fraction in the one or more tumour samples, the joint genotype and the value of one or more covariates. The one or more covariates may include one or more covariates defined at the level of individual sequence reads (such as e.g. lane group, read group, read order, strand mapping quality) and/or one or more covariates defined at the level of individual base calls (such as e.g. base quality, and position within read). The likelihood of the sequence data from the tumour samples may be expressed as the sum of a term for the reference allele and a term for the variant allele in the joint genotype. The likelihood of the sequence data from the normal or tumour samples may be expressed using equation 3. The likelihood of the sequence data from the normal or tumour samples may be expressed using equation 4. The likelihood of the sequence data from the tumour samples may depend on: a term that represents the probability of a particular base being called from the sequence data assuming that the true genotype is the reference allele for the joint genotype and in view of the values of one or more covariates, and a term that represents the probability of a particular base being called from the sequence data assuming that the true genotype is the variant allele for the joint genotype and in view of the value of one or more covariates. The probability of a particular base being called from the sequence data assuming a true genotype and in view of the value of one or more covariates may be estimated empirically from the sequence data from the tumour samples, by assuming that all sites that differ from a reference sequence are sequencing errors, preferably excluding all sites that are known germline variant. Known germline variants can be obtained from one or more databases such as e.g. dbSNP, 100 Genomes and/or ExAc. The likelihood of the sequence data from the tumour samples may be obtained as the sum of: a term that represents the probability of a particular base being called from the sequence data assuming that the true genotype is the reference allele for the joint genotype and in view of the values of one or more covariates (profile probability), multiplied by a term that represents the probability that a read from the sequence data has the reference allele in view of the joint genotype, and a term that represents the probability of a particular base being called from the sequence data assuming that the true genotype is the variant allele for the joint genotype and in view of the values of one or more covariates (profile probability), multiplied by a term that represents the probability that a read from the sequence data has the variant allele in view of the joint genotype.

Determining whether each of the one or more identified sites is a somatic variant may further comprise: determining whether the posterior probability a candidate joint genotypes having a somatic variation and/or the sum of the posterior probabilities of each of the candidate joint genotypes having a somatic variation is/are above a predetermined threshold, and determining that the site has a somatic variant in the affirmative. In such embodiments, the method may comprise: receiving sequence data from one or more tumour samples of a subject and one or more normal samples, identifying one or more sites where the sequence data from the one or more tumour samples is indicative of a difference between the tumour genome and a reference genome, and determining whether each of the one or more identified sites is a somatic variant using the sequence data from the one or more normal samples and a probabilistic model that models the presence of tumour genetic material in the one or more normal samples. The model may be a Bayesian model. Determining whether each of the one or more identified sites is a somatic variant using the sequence data from the one or more normal samples and a probabilistic model that models the presence of tumour genetic material in the one or more normal samples may comprise: identifying a set of one or more candidate somatic variant sites; and applying one or more filters to the set of candidate somatic variant sites to obtain a filtered set; wherein the one or more filters use a probabilistic model that models the presence of tumour genetic material in the one or more normal samples. The one or more filters may be selected from: a proximal gap filter that excludes variants that are within a predetermined distance (such as e.g. 10 bp) of an indel, a poor mapping region filter that excludes variants that are in locations having a mapping quality score below a predetermined threshold, a clustered variant filter that excludes variants that are within a predetermined distance from each other (such as e.g. 10 bp), a strand bias filter that excludes variants where the genotype inferred from reads mapped to the forward strand is not the same as the genotype inferred from reads mapped to the reverse strand, a read depth filter that excludes variants that are in regions with a read depth below a predetermined threshold. The one or mor filters may not include a filter that excludes sites where the sequence data from the one or more normal samples contain evidence of the presence of the variant unless the filter uses a probabilistic model that models the presence of tumour genetic material in the one or more normal samples. The one or more filters may comprise a filter that excludes variants where a metric reflecting the relative likelihood of the sequence data for the one or more normal samples assuming that the site has a somatic variant or a germline variant is below a predetermined threshold, wherein the likelihood assuming that the site has a somatic variant is obtained using a model that takes into account the tumour in normal contamination in the one or more normal samples. The method may comprise excluding candidate somatic variants where a metric reflecting the relative likelihood of the sequence data for the one or more normal samples assuming that the site has a somatic variant or a germline variant is below a predetermined threshold, wherein the likelihood assuming that the site has a somatic variant is obtained using a model that takes into account the tumour in normal contamination in the one or more normal samples. The likelihood assuming that the site has a germline variant may be obtained by determining the probability of observing the sequence data for the site in the one or more normal samples in view of a probability of sequencing error associated with the sequence data and a first predetermined variant allele fraction, the likelihood assuming that the site has a somatic variant may be obtained by determining the probability of observing the sequence data for the site in the one or more normal samples in view of a probability of sequencing error associated with the sequence data and a second predetermined variant allele fraction, wherein the first and second predetermined variant allele fractions are different from each other. The second predetermined variant allele fraction may be obtained by multiplying the first predetermined variant allele fraction by a predetermined tumour in normal contamination. The probability of sequencing error associated with the sequence data may be obtained for each read in the sequence data or wherein the probability of sequencing error associated with the sequence data is a predetermined value that is the same for all sequence reads. The predetermined threshold (0) may depend on a desired level of certainty (5), a prior probability that the candidate somatic variant is a somatic variant, and a prior probability that the candidate somatic variant is a germline variant. The parameter 5 may represent the minimum ratio of: (i) the posterior probability that the candidate somatic variant is a somatic variant, and (ii) the posterior probability that the candidate somatic variant is a germline variant. Thus, requiring that the ratio of the posterior probabilities are above 5 may be equivalent to requiring that the variant is at least 5 times more likely to be a somatic variant thana germline variant. The posterior probability that the candidate somatic variant is a somatic/germline variant is the product of the likelihood of the data assuming that the candidate somatic variant is a somatic/germline variant (as the case may be), and a prior probability that the candidate somatic variant is a somatic/germline variant (as the case may be).

Or the equivalent criterion:

where P(Data|Somatic) is the likelihood of the sequence data for the one or more normal samples assuming that the site has a somatic variant, P(Data|Germline) is the likelihood of the sequence data for the one or more normal samples assuming that the site has a germline variant, P(Somatic) is the prior probability that the candidate somatic variant is a somatic variant and P(Germline) is the prior probability that the candidate somatic variant is a germline variant. As the skilled person understands, although a loglO is used in the formula above, any other transformation (including any other log, or non-logarithmic transformation, or no transformation) of the ratio may be used. The prior probability that the candidate somatic variant is a somatic variant may be set to a predetermined value, such as e.g. 3xl0^-6 and/or wherein the prior probability that the candidate somatic variant is a germline variant depends on whether the variant is catalogued in one or more data sets of known single nucleotide polymorphisms. The use of a prior probability that the candidate somatic variant is a germline variant that depends on whether the variant is catalogued in one or more data sets of known SNPs advantageously includes prior knowledge about germline variant sites in the prior probability. For example, a higher prior probability of the variant being a germline variant may be used for candidate somatic variants that are known to occur as germline variants than for variants that are not known to occur as germline variants, thereby requiring more evidence in favour of the variant being somatic in order for the variant being identified as somatic in the former case (variant is a known germline variant) than in the latter case (variant is not a known germline variant). The one or more data sets of known single nucleotide polymorphisms may comprise one or more of: dbSNP, 1000 genomes and ExAC data sets. The use of a plurality of data sets may advantageously lower the evidence required for a variant to be classified as somatic at known SNP sites. The one or more data sets of known single nucleotide polymorphisms may be filtered to exclude single nucleotide variants that are catalogued in one or more data sets of known single nucleotide variants in cancer. For example, the one or more data sets of known single nucleotide variants in cancer may be selected from the COSMIC database (Tate et al., 2019). This may advantageously increase the sensitivity of detection of somatic variants by reducing the amount of evidence required for a variant to be classified as somatic at known SNP sites that are also known SNV sites. For example, a first prior probability that the candidate somatic variant is a germline variant may be used if the candidate somatic variant is catalogued in one or more data sets of known single nucleotide polymorphisms, and a second prior probability that the candidate somatic variant is a germline variant may be used if the candidate somatic variant is not catalogued in one or more data sets of known single nucleotide polymorphisms. The first prior probability may depend on: the probability that a site is a germline variant if it is in the one or more data sets (P(germline|site in db), which may for example be estimated as the proportion of single nucleotide polymorphisms expected to be represented in the one or more data sets of known single nucleotide polymorphisms), the probability that any site in a genome is a germline mutation (P(site is germline), which can be estimated for example as the number of germline mutations per genome), and the probability that a site is in the one or more data sets (P(site in db), which may be estimated for example as the number of single nucleotide polymorphisms sites represented in the one or more data sets of known single nucleotide polymorphisms). The first prior probability may be obtained as P(germline|site is in db)= P(germline|site in db)x P(site is germline)/P(site in db). The first prior probability may depend on: an estimate proportion of single nucleotide polymorphisms expected to be represented in the one or more data sets of known single nucleotide polymorphisms (such as e.g.0.95 when using dbSNP or 0.995 when using a combination of dbSNP, 1000 genomes and ExAc), an estimated number of germline mutations per genome (such as e.g. 3e6), and the number of single nucleotide polymorphisms sites represented in the one or more data sets of known single nucleotide polymorphisms (e.g. approx. 30e6 when using dbSNP, or approx. 10e6 when using a combination of dbSNP, 1000 genomes and ExAC). The second prior probability may depend on: the probability that a site is a germline variant if it is not in the one or more data sets (P(germline|site not in db), which may for example be estimated as the proportion of single nucleotide polymorphisms expected not to be represented in the one or more data sets of known single nucleotide polymorphisms), the probability that any site in a genome is a germline mutation (P(site is germline), which can be estimated for example as the number of germline mutations per genome), and the probability that a site is not in the one or more data sets (P(site not in db), which may be estimated for example as the number of single nucleotide polymorphisms sites expected not to be represented in the one or more data sets of known single nucleotide polymorphisms). The second prior probability may be obtained as

P(germline|site is not in db)= P(germline|site not in db)x P(site is germline)/P(site not in db). The second prior probability may depend on: an estimate proportion of single nucleotide polymorphisms not represented in the one or more data sets of known single nucleotide polymorphisms (such as e.g. 0.05 when using dbSNP, or 0.005 when using a combination of dbSNP, Exac and 1000 Genomes, or 1-the estimated proportion used for the first prior probability), an estimated number of germline mutations per genome (which may be the same as that used for the first prior probability, for example 3e6), and an estimated number of single nucleotide polymorphisms sites not represented in the one or more data sets.

The methods according to any embodiment of any aspect may comprise providing to a user, for example through a user interface, information identifying the sites determined to be somatic variants. The method may further comprise providing to a user, for example through a user interface, information identifying the sites determined not to be somatic variants. The information identifying a site determined to be a somatic variant may comprise one or more of the location of the site in the reference genome, the identity of the variant, and one or more confidence metrics associated with the determination of the variant as somatic and/or the sequence data associated with the site. Confidence metrics associated with the sequence data may comprise base call quality scores, alignment scores, etc. Confidence metrics associated with the determination of the variant as somatic may comprise a posterior probability that the site has a somatic variant, a posterior probability that the site has a germline variant, a posterior probability that the site has the reference genome, one or more posterior probabilities that the site has a joint genotype associated with the presence of a somatic variant, one or more posterior probabilities that the site has a joint genotype associated with the presence of a germline variant, and/or a metric reflecting the relative likelihood that the site has a somatic variant or a germline variant in view of the sequence data for the one or more normal samples. A joint genotype may refer to a pair of genotypes for the tumour and normal cells. According to a third aspect, there is provided a method of providing a prognosis for a subject that has been diagnosed as having cancer, the method comprising identifying one or more somatic variants in one or more tumour samples from the subject using the method of any embodiment of a preceding aspect, and classifying the subject between a plurality of groups associated with different prognosis based on the one or more somatic variants identified.

According to a fourth aspect, there is provided a method of monitoring a subject that has been diagnosed as having cancer, the method comprising identifying one or more somatic variants in one or more tumour samples from the subject acquired at a first time point and in one or more tumour samples from the subject acquired at a further time point, using the method of any embodiment of the first aspect.

According to a fifth aspect, there is provided a method of determining whether a subject that has been diagnosed as having cancer is likely to respond to a therapy, optionally wherein the therapy is an immunotherapy, the method comprising identifying one or more somatic variants in one or more tumour samples from the subject using the method of any embodiment of the first aspect, and classifying the subject between a plurality of groups associated with different responses to the therapy based on the one or more somatic variants identified. The method may further comprise administering the therapy, such as the immunotherapy, to a subject that has been classified in a group likely to respond to the therapy. The method may comprise recommending a subject that has been diagnosed as likely to respond to the therapy for treatment with the therapy. The method may comprise administering an alternative therapy (e.g. a conventional chemotherapy, radiotherapy, etc.) and/or recommending a subject for treatment with an alternative therapy, where the subject has been diagnosed as not likely to respond to the therapy. In embodiments, the method may comprise classifying the subject between a group that is likely to respond to the immunotherapy (e.g. CPI therapy), and a group that is not likely to respond to the immunotherapy (e.g. CPI therapy), based on an estimated tumour mutational burden for the subject obtained from the identified somatic variants. Such a classification may be obtained using a multivariate classification model (for example, a classification method using a trained generalised linear model). The parameters of a classification model (e.g. threshold(s), parameters of a multivariate model) may be determined using an appropriate training cohort, for example using a group of subjects with known Immunotherapy response status.

According to a further aspect, the present invention provides a system comprising: at least one processor; and at least one non-transitory computer readable medium containing instructions that, when executed by the at least one processor, cause the at least one processor to perform operations comprisingreceiving sequence data from one or more tumour samples of a subject and one or more normal samples, identifying one or more sites where the sequence data from the one or more tumour samples is indicative of a difference between the tumour genome and a reference genome, and determining whether each of the one or more identified sites is a somatic variant using the sequence data from the one or more normal samples and a probabilistic model that models the presence of tumour genetic material in the one or more normal samples. The system according to the present aspect may have any of the features described in relation to the first aspect. The system according to the present aspect may be configured to implement the method of any embodiment of the preceding aspects. In particular, the at least one non- transitory computer readable medium may contain instructions that, when executed by the at least one processor, cause the at least one processor to perform operations comprising any of the operations described in relation to the first aspect.

The system may further comprise, in operable connection with the processor, one or more of: a user interface, wherein the instructions further cause the processor to provide, to the user interface for outputting to a user, at least information identifying one or more somatic variants identified; one or more sequence data acquisition devices (such as e.g. a sequencing device); one or more data stores, such as e.g. a sequence data and/or reference sequence data database.

According to a further aspect, there is provided a non-transitory computer readable medium or media comprising instructions that, when executed by at least one processor, cause the at least one processor to perform the method of any embodiment of any aspect described herein.

According to a further aspect, there is provided a computer program comprising code which, when the code is executed on a computer, causes the computer to perform the method of any embodiment of any aspect described herein.

The present invention includes the combination of the aspects and preferred features described except where such a combination is clearly impermissible or is stated to be expressly avoided. These and further aspects and embodiments of the invention are described in further detail below and with reference to the accompanying examples and figures.

Brief description of the figures

Figure 1 is a flowchart illustrating schematically a method of identifying somatic variants according to the present disclosure.

Figure 2 shows an embodiment of a system for identifying somatic variants according to the present disclosure.

Figure 3 shows the results of an analysis of variants identified in bulk, whole genome sequenced blood samples (Bone marrow, peripheral blood) from patients with myeloproliferative neoplasms using standard off the shelf Caveman ("NST") or a method described herein incorporating tumour in normal modelling in the variant calling process (CAVETIN - described in Example 2). A. Absolute number of SNVs identified in each sample. B. Proportion of total number of SNVs identified in each sample. C. Proportion of variants identified only by the methods described herein as a function of contamination of the normal sample with tumour (ACF=aberrant cell fraction). D. Violin plot of variant allele fraction in the tumour and normal sample, for variants identified by either or both methods. E-G show results for a single exemplary patient. E. VAF in tumour and normal sample for an individual sample. F. Cosine similarity in the trinucleotide distribution of the variants identified by either or both methods. G. Trinucleotide sequence distribution of the variants identified by either or both methods.

Figure 4 shows the results of an analysis of variants identified in bulk, whole genome sequenced blood samples (Bone marrow, peripheral blood) from patients with myeloproliferative neoplasms using standard off the shelf Caveman ("NST") or a method described herein incorporating tumour in normal modelling in the variant filtering process (GLOD - described in Example 1). A. Absolute number of SNVs identified in each sample. B. Proportion of total number of SNVs identified in each sample. C. Proportion of variants identified only by the methods described herein as a function of contamination of the normal sample with tumour (ACF=aberrant cell fraction). D. Violin plot of variant allele fraction in the tumour and normal sample, for variants identified by either or both methods. E-G show results for a single exemplary patient. E. VAF in tumour and normal sample for an individual sample. F. Cosine similarity in the trinucleotide distribution of the variants identified by either or both methods. G. Trinucleotide sequence distribution of the variants identified by either or both methods.

Figure 5 shows the results of an analysis of variants identified in bulk, whole genome sequenced blood samples (Bone marrow, peripheral blood) from patients with myeloproliferative neoplasms using a gold standard validated method comprising using multiple variant callers and requiring agreement between at least two callers (PANCAN) or a method described herein incorporating tumour in normal modelling in the variant calling process (CAVETIN - described in Example 2). A. Absolute number of SNVs identified in each sample. B. Proportion of total number of SNVs identified in each sample. C. Proportion of variants identified only by the methods described herein as a function of contamination of the normal sample with tumour (ACF=aberrant cell fraction). D. Violin plot of variant allele fraction in the tumour and normal sample, for variants identified by either or both methods. E. Trinucleotide sequence distribution of the variants identified by either or both methods. F. Cosine similarity in the trinucleotide distribution of the variants identified by either or both methods.

Figure 6 shows the results of an analysis of variants identified in bulk, whole genome sequenced blood samples (Bone marrow, peripheral blood) from patients with myeloproliferative neoplasms using a gold standard validated method comprising using multiple variant callers and requiring agreement between at least two callers (PANCAN) or a method described herein incorporating tumour in normal modelling in the variant filtering process (GLOD - described in Example 1). A. Absolute number of SNVs identified in each sample. B. Proportion of total number of SNVs identified in each sample. C. Proportion of variants identified only by the methods described herein as a function of contamination of the normal sample with tumour (ACF=aberrant cell fraction). D. Violin plot of variant allele fraction in the tumour and normal sample, for variants identified by either or both methods. E. Trinucleotide sequence distribution of the variants identified by either or both methods. F. Cosine similarity in the trinucleotide distribution of the variants identified by either or both methods.

Figure 7 shows the results of an analysis of the sensitivity of a method of identifying somatic variants to the aberrant cell fraction in the normal sample. Each plot shows, for a different aberrant cell fraction in the normal sample (A:0, B:0.1, C:0.2, D:0.3), the number of mutant reads in the tumour (y axis) and the normal sample (x axis) covering a position, coloured according to whether the position is classified as reference (no SNV or SNP), ambiguous, SNV (i.e. somatic mutation), SNV or SNP , or SNP (germline variant).

Figure 8 shows the results of an analysis of the sensitivity of a method of identifying somatic variants to the tumour copy number. Each plot shows, for a different tumour copy number (A:l, B:2, C:5, D:10), the number of mutant reads in the tumour (y axis) and the normal sample (x axis) covering a position, coloured according to whether the position is classified as reference (no SNV or SNP), ambiguous, SNV (i.e. somatic mutation), SNV or SNP , or SNP (germline variant).

Figure 9 shows the results of an analysis of the sensitivity of a method of identifying somatic variants to the variant allele fraction. A. For each class (reference (REF), ambiguous (AMB), SNV (SOM), SNP (SNP), "SNV or SNP" (referred to as "VAR")), two curves are shown, one assuming a tumour copy number of two (solid lines), and one assuming a tumour copy number of 5 (dotted lines)- all with a normal copy number of 2. B-C. For each class (reference (REF), ambiguous (AMB), SNV (SOM), SNP (SNP), "SNV or SNP" (referred to as "VAR")), two curves are shown, one assuming a tumour in normal contamination of 0.1 (solid line), and one assuming a tumour in normal contamination of 0 (dotted line). In B, the SNP prior was set to 0.001, whereas it was set to 0.00001 in C. In all figures, a fixed depth of 20 and 40, respectively normal sample and the tumour sample, and aberrant cell fraction of 0.9 in the tumour sample were used. In B and C the copy numbers were assumed to be 5 in the tumour and 2 in the normal.

Figure 10 shows the percentages of variants filtered using a variety of posthoc variant filters, in a representative patient (see Example 5). The diagonal entries show the percentage of filtered candidate variants that are removed by each filter and the off-diagonal elements show the number of filtered variants removed by both the corresponding filters. The germline filter "bglod" according to aspects of the invention is most active, here filtering out 93.5% of the candidate somatic SNVs (note, the filtering here was also applied to variants present in all colonies to confirm their germline nature).

Figure 11 shows the results of variant calling using CaveMAn with or without the new GLOD filter described herein, for three patients. Each plot shows the VAF in the tumour and in the normal sample for each variant identified (an=aberrant cell fraction in normal sample, at=aberrant cell fraction in tumour sample). The dots in black were identified as Germline SNPs (0.5 VAF in both tumour and normal). The dots in grey are somatic variants that have a non-zero VAF in the normal sample due to tumour-in-normal contamination, but pass the new GLOD filter described herein.

Figure 12 shows for two patients, the VAF estimated for each variant (both germline and somatic) identified in an unmatched variant calling process for a tumour and a normal sample. The red dot is the centroid of a cluster from k-means cluster analysis performed on this data, excluding variants with a read depth below 40 in the tumour sample and a VAF in the normal sample below 25% (i.e. excluding normal variants).

Figure 13 shows in an exemplary data set (Example 6), the number of variants that could be identified only using the methods described herein taking into account tumour in normal contamination (y axis) and the number of variants that could be identified using a method of the prior art that does not take tumour in normal contamination into account (x axis).

Detailed description of the invention

In describing the present invention, the following terms will be employed, and are intended to be defined as indicated below.

"and/or" where used herein is to be taken as specific disclosure of each of the two specified features or components with or without the other. For example "A and/or B" is to be taken as specific disclosure of each of (i) A, (ii) B and (iii) A and B, just as if each is set out individually herein.

A "sample" as used herein may be a cell or tissue sample (e.g. a biopsy), a biological fluid, an extract (e.g. a protein or DNA extract obtained from the subject), from which material (i.e. nucleic acids) can be obtained for genomic analysis, such as genomic sequencing (whole genome sequencing, whole exome sequencing targeted (also referred to as "panel") sequencing), array profiling, or transcriptome sequencing (e.g. whole transcriptome sequencing (RNA-seq)). Preferably, the sample comprises genomic DNA or DNA derived therefrom (such as e.g. cell free DNA, fragmented DNA, a DNA library, etc.). In particular, the sample may be a normal sample, or a tumour sample. The sample may be one which has been freshly obtained from a subject or may be one which has been processed and/or stored prior to making a determination (e.g. frozen, fixed or subjected to one or more purification, enrichment or extractions steps). The sample may be a cell or tissue culture sample. As such, a sample as described herein may refer to any type of sample comprising cells or genomic material derived therefrom, whether from a biological sample obtained from a subject, or from a sample obtained from e.g. a cell line. The sample may be from any species, suitably a eukaryotic species, such as from a mammalian (including in particular a model animal such as mouse, rat, etc.), preferably from a human (such as e.g. a human cell sample or a sample from a human subject). Further, the sample may be transported ad/or stored, and collection may take place at a location remote from the genetic sequence data acquisition (e.g. sequencing) location, and/or the computer-implemented method steps may take place at a location remote from the sample collection location and/or remote from the genetic data acquisition (e.g. sequencing) location (e.g. the computer-implemented method steps may be performed by means of a networked computer, such as by means of a "cloud" provider).

A "tumour sample" refers to a sample derived from or obtained from a tumour. A tumour sample may be a sample that comprises tumour cells and at least one other cell type (and/or genetic material derived therefrom). For example, a sample obtained from or derived from a tumour may contain tumour cells and normal (non-tumour cells). Preferably, a tumour sample comprises more tumour cells than normal cells (or more tumour-derived genomic material than germline genomic material). The cancer may be any type of cancer including in particular any type of solid or liquid cancer. In particular, the cancer may be a tumour of the hematopoietic or lymphoid tissue(also referred to as "hematological malignancy" or "blood cancers"), including but not limited to leukemia (e.g. acute lymphocytic (ALL), chronic lymphocytic (CLL), acute myeloid (AML), chronic myeloid (CML)), myeloma, and lymphoma (Hodgkin's and non-Hodgkin's (NHL)) and myeloma (e.g. myeloproliferative neoplasms (MPN)). A solid cancer may be a sarcoma (a cancer that arises from cells of mesenchymal origin such as e.g. a bone sarcoma or a soft tissue sarcoma), carcinoma (a cancer of epithelial origin, including an adenocarcinoma, basal cell carcinoma, squamous cell carcinoma and transitional cell carcinoma) or lymphoma (a cancer that arises from immune system cells). The tumour may be a lung tumour, a bladder tumour, a gastrointestinal tumour, a brain tumour, a spinal cord tumour, a breast tumour, an ovarian tumour, a oesophagal tumour, an endometrial tumour, a kidney tumour, a melanoma, a merkel cell carcinoma, a leukemia, a small bowel tumour, a pancreatic tumour, a germ cell cancer, a prostate tumour, a head and neck tumour, a thyroid tumour, a bone marrow tumour, etc.

A "normal sample" or "germline sample" refers to a sample that is assumed not to comprise tumour cells or genomic material derived from tumour cells, or where the proportion of tumour cells or material genetic derived therefrom is assumed to be lower than that of normal cells or genomic material derived therefrom. A germline sample may be a blood sample, a tissue sample (e.g. a buccal swab sample), or a purified sample (such as a sample of peripheral blood mononuclear cells, a sample of T cells, etc.) from a subject. Similarly, the terms "normal", "germline" or "wild type" when referring to sequences or genotypes refer to the sequence / genotype of cells other than tumour cells. A germline sample may comprise a proportion of tumour cells or genomic material derived therefrom, and may nevertheless be referred to as a "normal sample", for practical purposes, if it is compared with a tumour sample that comprises a higher proportion of tumour cells or genetic material derived therefrom. A "matched normal sample" refers to a normal sample that has been obtained from the same subject as a tumour sample to be analysed. For example, a tumour sample may be a sample of bone marrow from a subject who has been diagnosed as having a myeloma, and a matched tumour sample may be a buccal swab sample or a T-cell sample from the same subject. A "mixed sample" refers to a sample that is assumed to comprise multiple cell types or genetic material derived from multiple cell types. Within the context of the present invention, a tumour sample is typically a mixed sample. Further, within the context of the present invention, a normal sample (e.g. a blood sample) may also be a mixed sample.

The term "aberrant cell fraction" refers to the proportion of DNA containing cells within a mixed sample that are not germline cells (typically, the aberrant cells are tumour cells, and as such the terms "aberrant cell fraction" and "tumour cell fraction" or "tumour fraction" may be used interchangeably). In the case of a mixed sample that comprises genomic material derived from cells, the aberrant cell faction refers to the equivalent proportion of DNA containing cells in the sample that would result in a proportion of aberrant vs germline genomic material. The aberrant cell fraction in a sample may be denotated herein with a symbol "a". The symbol "a_t" may refer to the fraction of tumour cells in a tumour sample. The symbol "a_n" may refer to the fraction of tumour cells in a normal (germline) sample. In the context of a normal sample, the aberrant cell fraction may also be referred to as "tumour in normal contamination" (or "tumour contamination", or "contamination", c). The fraction of normal (germline) cells in a normal sample may be equal to (l-ot_n). Similarly, the fraction of normal (germline) cells in a tumour sample may be equal to (l-a_t).

The term "sequence data" refers to information about the identity (sequence) and abundance of nucleic acid molecules present in a sample. Sequence data may comprise one or more sequencing reads, and optionally one or more associated quality metrics. Alternatively, sequence data may comprise one or more intensity values indicative of the abundance of nucleic acid sequences that have a particular sequence. Within the context of the present invention, the nucleic acids for which sequence data is obtained are typically DNAs.

Indeed, the methods of the present invention aim to identify genomic variants, which are primarily identified from genomic DNA or fragments thereof (including e.g. cfDNA). Nevertheless, as the skilled person understands, other types of nucleic acids such as e.g. RNA (whether obtained from a subject or transcribed from DNA obtained from a subject) may also contain information indicative of the genomic sequence of a subject, and as such may also be used. The term "read depth" or "read count" refers to a signal that is indicative of the amount of genomic material in a sample that maps to a particular genomic location. Sequencing data and derived signals such as read depth/count may be obtained using sequencing technologies, such as e.g. next generation sequencing (NGS, such as e.g. WES, WGS, or sequencing of captured genomic loci, sometimes also referred to as "targeted pull-down"), or using array technologies, such as e.g. SNP arrays. When NGS technologies are used, a read depth may be a read depth within the common sense of the word, i.e. a count of the number of sequencing reads mapping to a genomic location. When array technologies are used, a read depth may be an intensity value associated with a particular genomic location, which can be compared to a control to provide an indication of the amount of genomic material that maps to the particular location. Within the context of the present invention, a read depth or count is preferably obtained by sequencing in which case it refers to a count of the number of sequencing reads mapping to a genomic location.

A "variant" refers to a genomic locus where a subject genome sequence differs from a reference genome sequence. The reference genome sequence may be a reference genome sequence from a database, such as e.g. human reference genome build 37 (GRCh37, also referred to as hgl9, available at https://www.ncbi.nlm.nih.gov/assembly/GCF 000001405.13/) or human reference genome build 38 (GRCh38, also referred to as hg38, available at https://www.ncbi.nlm.nih.gov/assembly/GCF 000001405.39). A reference genome sequence may be the sequence of a reference sample, such as a normal sample. The normal sample may be a matched normal sample, or a common (e.g. pooled) normal sample for a plurality of subject samples. A "single nucleotide variant" (SNV) is a variation in a single nucleotide (i.e. where a base at a particular position in a reference genome is substituted for another base at the corresponding position in the subject genome). Within the context of the present invention, a variant is typically a single nucleotide variant. A "single nucleotide polymorphism" (SNP) is a substitution of a single nucleotide at a specific genomic position that is known to be present in the germline sequence of individuals within a population. A SNV may be required to be present in a population at a frequency above a particular threshold (e.g. 1%) in order to be identified as a SNP. In other words, a SNP is a single nucleotide variant that has been identified as a normal part of genetic variation in a population. In the context of the present disclosure, the term SNV may be used to refer to those single nucleotide substitutions that are not SNPs. These substitutions are therefore likely to be somatic mutations. Thus, putative somatic mutations may be referred to as "SNVs", whereas known germline substitutions may be referred to as "SNPs".

The term "variant calling" refers to the process by which variants are identified from sequence data. The variant calling process may include the following steps: 1. acquire sequence data (such as e.g. through WGS or WES) - this may produce one or more FASTQ files (a text-based format that stores one more nucleic acid sequences, for example the sequence of reads from a next-generation sequencing process, and their corresponding quality scores); 2. Align the sequences to a reference genome - this may create one or more BAM (the compressed binary version of a SAM file(sequence alignment/map file, a text-based format that stores information about alignment of one or more sequences, such as nucleic acid sequence reads, to a reference genome)) or CRAM (a compressed columnar file format for storing biological sequences aligned to a reference sequence)files; 3. Identify positions where the aligned sequences differ from the reference genome - this may result in one or more VCF files (variant call format, a text-based format that stores information about positions in a reference genome, such as e.g. information identifying the position in the genome, the identity of the base at the location in the reference genome, the identity of one or more non-reference alleles at the position, quality scores (e.g. Phred- scaled quality scores) for each of the non-reference alleles, a filter status resulting e.g. from one of more quality control filters, etc.). Step 3 is the variant calling step per se, and thus the reference to "variant calling" may in practice only encompass the steps necessary to identify variant positions from an aligned set of sequences. The methods described herein may be performed at least in part after variant calling has been performed for one or more samples (such as e.g. starting from a VCF file, or from a list of candidate variants in any other formats). In such cases, the methods may include a variant filtering step. Such methods may use one or more outputs of steps preceding the variant identification, such as the output of the sequence alignment step (e.g. one or more SAM/BAM/CRAM files). Instead or in addition to this, the methods described herein may result in a set of called variants (such as e.g. in the form of a VCF file) for one or more samples, and as such may be performed at least in part after the sequence acquisition and alignment steps have been performed (such as e.g. starting from a BAM or CRAM file). In such cases, the methods may include a variant calling step. The methods described herein may also include the steps of acquiring sequence data for the one or more samples, and/or aligning the sequence data. The one or more samples may include at least one (optionally, more than one) tumour sample. The one or more samples may include at least one (optionally, more than one) tumour sample and one (optionally, more than one) matched normal sample. The one or

As used herein, the terms "computer system" includes the hardware, software and data storage devices for embodying a system or carrying out a method according to the above described embodiments. For example, a computer system may comprise a central processing unit (CPU), input means, output means and data storage, which may be embodied as one or more connected computing devices. Preferably the computer system has a display or comprises a computing device that has a display to provide a visual output display (for example in the design of the business process). The data storage may comprise RAM, disk drives or other computer readable media. The computer system may include a plurality of computing devices connected by a network and able to communicate with each other over that network. It is explicitly envisaged that computer system may consist of or comprise a cloud computer.

As used herein, the term "computer readable media" includes, without limitation, any non-transitory medium or media which can be read and accessed directly by a computer or computer system. The media can include, but are not limited to, magnetic storage media such as floppy discs, hard disc storage media and magnetic tape; optical storage media such as optical discs or CD-ROMs; electrical storage media such as memory, including RAM, ROM and flash memory; and hybrids and combinations of the above such as magnetic/optical storage media.

Identifying somatic variants by modelling of tumour in normal contamination

The present disclosure provides method for identifying somatic variants, using sequence data from a tumour sample and a normal sample. An illustrative method will be described by reference to Figure 1. At optional step 10, one or more tumour samples, which each typically comprise genomic material from multiple cell types including cells with a normal (germline) genome and cell with a tumour genome, may be obtained from a subject. Advantageously, one or more normal samples may also be obtained from the same subject, in which case these may be referred to as "matched normal" samples. Each normal sample may comprise genomic material from multiple cell types including cells with a normal (germline) genome and cell with a tumour genome At optional step 12, sequence data may be obtained from each sample, for example by sequencing the genomic material in the sample using one of whole exome sequencing, whole genome sequencing, or panel sequencing. At step 14, the sequence data for the/each sample may be aligned to a reference genome.

A step 16, candidate somatic variants in the tumour sample are identified. This may be performed using any method known in the art, such as e.g. using CaveMAn (Jones et al., 2020). In its simplest form, this may comprise the step 16A of identifying one or more locations where the mapped sequence data for the tumour sample (and optionally also for the normal sample) differ from a reference genome. These location are candidate variants, also referred to as candidate alterations (although based on the tumour sequence data alone it is not possible to determine whether the candidate variants are candidate somatic variants or candidate germline variants). Variants may be single nucleotide variants, multiple nucleotide variants, or indels. Indels may be of any length that can be identified by the variant caller used. Advantageously, step 16 may further comprise step 16B of evaluating the confidence in the presence of a somatic alteration at these one or more locations based on the sequence data for the tumour and optionally the sequence data for one or more normal samples (that may be matched or non-matched, such as e.g. pooled) tumour samples. For example, when looking at single nucleotide variants, step 16B can be performed using CaveMan. As another alternative, step 16B can be performed using Mutect (Cibulkis et al., 2013). When looking at indels, step 16B can be performed using cgpPindel (Ye et al., 2009). As another alternative, step 16B may be performed using a method that does not use data from a normal sample at all (e.g. as described in Example 2 - unmatched mode). Instead or in addition to these, step 16B may be performed using a probabilistic method that determines the posterior probability of one or more candidate joint genotypes in view of the sequence data available, using a model that accounts for tumour-in- normal contamination as well as normal-in-tumour contamination (e.g. as described in Example 2).

At step 18, candidate somatic variants identified at step 16 are filtered to remove false positives. This may be performed using any posthoc filters known in the art, such as e.g. any one or more of the filters described in Jones et al., 2016, or e.g. the germline filter described in Cibulkis et al., 2013. For example, the candidate somatic variants may be filtered to remove polymorphisms (known SNPs), locations prone to aberrant mapping or systematic sequencing artifacts in genome sequencing, regions where depth is abnormally higher (and where there is a higher likelihood of false positive calls), etc. In embodiments, step 18 comprises using one or more filters selected from: a proximal gap filter that excludes variants that are within a predetermined distance (such as e.g. 10 bp) of an indel, a poor mapping region filter that excludes variants that are in locations having a mapping quality score below a predetermined threshold, a clustered variant filter that excludes variants that are within a predetermined distance from each other (such as e.g. 10 bp), a strand bias filter that excludes variants where the genotype inferred from reads mapped to the forward strand is not the same as the genotype inferred from reads mapped to the reverse strand, and a read depth filter that excludes variants that are in regions with a read depth below a predetermined threshold. Step 18 may comprise a step of filtering the variants based on the sequence data from the normal sample(s), taking into account the possible presence of tumour-in-normal contamination in the normal sample(s). In such embodiments, the method may comprise optional step 18A of filtering candidate somatic variants using the sequence data from the one or more normal samples and a probabilistic model that models the presence of tumour genetic material in the one or more normal samples (such as e.g. as explained in Example 1). The method may comprise: (i) identifying candidate somatic variants in the tumour sample at step 16 using a probabilistic method that determines the posterior probability of one or more candidate joint genotypes in view of the sequence data available, using a model that accounts for tumour-in-normal contamination as well as normal-intumour contamination (e.g. as described in Example 2), and filtering variants at step 18 not based on the sequence data from the normal samples at step 18; and/or (ii) identifying candidate somatic variants in the tumour sample at step 16 using a method that does not account for tumour-in-normal contamination (or may not even use a normal sample, for example as explained in Example 2- unmatched mode, or in Example 5) and filtering variants at step 18 based on the sequence data from the normal sample(s), taking into account the possible presence of tumour-in-normal contamination in the normal sample(s). The results of steps (i) and (ii) may be further analysed to identify variants identified in (ii) but not in (i), and further filtering those variants at step 18. Preferably, step 16 does not comprise any filtering step based on the number of reads that show the candidate variant in one or more normal samples, unless the step takes into account the possible presence of tumour-in-normal contamination in the normal sample(s).

At optional step 20, a list of filtered candidate variants and optionally any value associated with these may be provided to a user, for example through a user interface.

Applications

The above methods find applications in a variety of clinical contexts in oncology. Indeed, accurate identification of somatic variants present in a tumour is known to have value in providing prognostics, diagnostics and therapeutic indications for many cancer types. For example, tumour mutational burden (TMB) has been shown to have value as a clinical biomarker associated with response to immune checkpoint inhibitor therapy. The quantification of TMB requires the identification of somatic variants in the tumour. As another example, some mutational signatures have been shown to be indicative of likelihood of response to therapies (see e.g. Ma et al., 2018). Again, the identification of such signatures requires the identification of somatic variants in the tumour. Many somatic variants have been shown to have prognostic, diagnostic or therapeutic significance(see e.g. Li et al., 2017). Again, the identification of the presence of such variants in a patient requires the identification of somatic variants in the tumour. As illustrated in Example 1 below, the identification of somatic variants in tumours can also be used to reconstruct the phylogeny of tumours, providing valuable clinical information in terms of early diagnosis and monitoring.

Accurate identification of somatic variants typically relies on a comparison with a matched normal sample (to filter out departures from a reference genome that are attributable to germline variation rather than somatic variation), whether in the process of identifying or filtering somatic variants. The presence of tumour- in-normal contamination may negatively influence the sensitivity of such processes in any clinical contexts where such contamination is possible or likely. This is particularly the case in the context of blood tumours.

Thus, also described herein are methods of providing a prognosis, diagnosis or therapeutic recommendation for a subject that has been diagnosed as having a cancer, the method comprising identifying somatic variants in a tumour from the subject using the methods described herein.

Systems

Fig. 2 shows an embodiment of a system for identifying somatic variants, according to the present disclosure. The system comprises a computing device 1, which comprises a processor 101 and computer readable memory 102. In the embodiment shown, the computing device 1 also comprises a user interface 103, which is illustrated as a screen but may include any other means of conveying information to a user such as e.g. through audible or visual signals. The computing device 1 is communicably connected, such as e.g. through a network 4, to sequence data acquisition means 3, such as a sequencing machine, and/or to one or more databases 2 storing sequence data. The computing device may be a smartphone, tablet, personal computer or other computing device. The computing device is configured to implement a method for identifying somatic variants, as described herein. In alternative embodiments, the computing device 1 is configured to communicate with a remote computing device (not shown), which is itself configured to implement a method of identifying somatic variants, as described herein. In such cases, the remote computing device may also be configured to send the result of the method of identifying somatic variants to the computing device. Communication between the computing device 1 and the remote computing device may be through a wired or wireless connection, and may occur over a local or public network such as e.g. over the public internet. The sequence data acquisition means 3 may be in wired connection with the computing device 1 and/or the database 2, or may be able to communicate through a wireless connection, such as e.g. through WiFi. The connection between the computing device 1 and the sequence data acquisition means 3 may be direct or indirect (such as e.g. through a remote computer). The sequence data acquisition means 3 are configured to acquire sequence data from nucleic acid samples, for example genomic DNA samples extracted from cells and/or tissue samples. In some embodiments, the sample may have been subject to one or more preprocessing steps such as DNA purification, fragmentation, library preparation, target sequence capture (such as e.g. exon capture and/or panel sequence capture). The sample may not have been subject to amplification. Alternatively, the sample may have been subject to amplification this in the presence of amplification bias controlling means such as e.g. using unique molecular identifiers. Any sample preparation process that is suitable for use in the identification of variants may be used within the context of the present invention. The sequence data acquisition means 3 is preferably a next generation sequencer. The one or more databases 2 may further store information that may be used by the computing device 1 and/or the sequence data acquisition means 3, such as e.g. a reference genome sequence, one or more parameters used in e.g. sequence read alignment, quality scoring, etc.

The following is presented by way of example and is not to be construed as a limitation to the scope of the claims.

Examples

Example 1 — Development of a new post variant calling filtering method: Germline Filter (GLOP)

Variants of any type can be identified using any variant caller, such as e.g. CaVEMan (Jones et al., 2016), the method described in Example 2 in unmatched mode, the GATK variant caller from the Broad institute, etc. The method described below can then be applied to "rescue" variants that may have been filtered out due to the presence of TiN contamination. The method uses just the normal sample and compares the likelihood of the observed read counts assuming the site is a heterozygous SNP compared to the likelihood assuming it is a somatic variant:

A similar approach was described by Cibulkis et al. (2013) to classify variants detected in a tumour sample as somatic (not present in the matched normal sample), germline (present in the matched normal sample) or undetermined. In their paper, Cibulksi et al., formulate an expression for the likelihood of observed read counts (P(Data|somatic) and P(Data|germline)) for a given VAF (set to 0 and 0.5, respectively for the likelihood of observed read counts assuming the site is a somatic variant and hence not present in the normal sample, and for the likelihood of observed read counts assuming the site is a heterozygous SNP present in the normal sample). These are denoted L(M₀) and L(Mo.s), respectively for P(Data|germline) and P(Data|somatic). The likelihood for a particular sample and a particular variant allele fraction is given by: L(M)=n_i=i^d P(b±|e±,r,m,f) where f is the VAF assumed in the model, bi is the called base in read i (i=l...d) that covers the site in the sample, e±is the probability of error at that base call (each base in each read being associated with a Phred-like quality score q± where eⁱ⁼10^(_qi/10) ), r is the reference allele and m is a true variant allele. These probabilities are evaluated assuming the sequencing errors are independent across reads, and all substitutions are equally likely and occur with probability eⁱ/3, such that:

P(bi

- P(bi

P(bi|ei,r,m,f)= e^J/3 otherwise.

In other words, Cibulkis et al. classify variants that have been called in the tumour by comparing two models in view of the normal sample data: one in which there is no variant at the site (i.e. VAF=0) in the normal sample and all nonreference bases in the normal sequence reads are due to sequencing noise (Mo) (i.e. the variant is a somatic variant not present in the normal sample), and one in which a heterozygous variant allele m truly exists at the site, with a VAF=f (where sequencing noise is also present in addition)( M_f ^m with f=0.5). A variant is classified by comparing the likelihood of these two models using a log odds score. If this score exceeds a predetermined decision threshold, then the variant is a true somatic variant.

Cilbulkis et al. further assume the following priors: P(Somatic)= 3X 10^-6 (this is a typical mutation frequency for human samples) and that P(Germline) depends on whether the site is represented in dbSNP or not:

This effectively divides the calculation of P(germline) into two possible cases: (i) the variant is a site that is known to be a germline variant site in a relevant population, and (ii) the variant is any other site. The number of known variants in dbSNP at the time was 30xl0⁶, i.e. approximately 1000 variants / Mb of human genome. A given individual human genome is expected to contain approximately 3xl0⁶ variants. 95% of the variants presents in any human genome are expected to be represented in dbSNP. The number of variants not in dbSNP in a human genome is therefore approximately 5% of 3xl0⁶ or 50 variants per Mb (0.05x3xl0^-6 / 3xl0⁹) assuming a genome size of 3xl0⁹. A cut-off (5=10 in Cibulkis et al.) is then applied to the log odds score (LOD) as follows: giving

This evaluates to and

In the present method, a similar approach is taken but modified to allow for Tumour contamination,c, of the Normal. The parameter c can be set to a predetermined value, such as e.g. approximately 0.1, to enable the calling of variants with modest tumour in normal contamination without unduly classifying variants as somatic rather than germline. The predetermined value may be set based on expert knowledge and/or the expected level of tumour contamination in a particular normal sample. Alternatively, the parameter can be estimated using read counts from mutations that are expected to be present in tumour DNA but not in normal DNA, such as e.g. driver mutations. Alternatively, the parameter can be estimated by analysing patterns in the read counts in the normal sample, for example using cluster analysis of unmatched calls (as explained in Example 6 - "Jointly estimating aberrant cell fraction in tumour and normal"). It is also possible to iteratively refine, c, by measuring the normal VAF in called variants. In particular, the parameter c can be set to a value that results in a distribution of normal VAF in the variants called that most closely matches an expected distribution.

Then for somatic variants:

We conservatively assume a tumour VAF of 0.5 (although other assumptions are possible, such as e.g. if the tumour sample may not be clonal - in which case the tumour VAF of 0.5 may under-filter the variants), giving:

where L(M) are calculated in a similar way as above, i.e. L(M)=n_i=i^d P(bi|e±,r,m,f), except that f=0.5xc and 0.5, respectively (instead of 0 and 0.5). Finally, the read counts have a minimum phred scale quality of 30 - which our filter uses as the error probability (e± in the formula above) rather than the per-read reported base qualities as done in Cibulkis et al. This enables the method to run using only summary read counts, whereas the method in Cibulkis et al. requires BAM files from which to extract per base call quality scores. The 5 threshold is conservatively set to 5 = 10 so that a variant is only called as Somatic if it is at least 10 times more likely that the variant is Somatic than that it is Germline.

In addition to the above modifications, the present method used an alternative SNP site definition. In particular, the following pseudo-code was used to define SNP sites:

SNP=[DBSNP or 1000 Genomes or ExAC ] and not COSMIC This gives significantly more sites than specified above (the union of distinct sites from dbSNP, 1000 genomes and ExAC SNP sites yields 96X 10⁶ sites - which we approximate as 100 million). This lowers the evidence required for a variant to be classified as Somatic at known SNP sites:

0(SNPsite)= 5.0

This means that the conditional sensitivity is improved at SNP sites, but there are now more SNP sites at which the SNP site threshold is applied rather than the non SNP site threshold. In the 1000 Genomes Phase 3 (Sudmant et al., 2015) it is found that on average GBR samples have around 4 million variant sites per GBR genome and around 10,000 singletons. This corresponds to approximately 99.75% probability of each variant being represented in 1000 Genomes. This considerably lowers the threshold at nongermline sites. Looking at our samples we find that 20,000 (corresponding to 99.5% chance each variant is in our germline resources) is perhaps a more reasonable number of germline variants not found in our resources. This gives:

The present method differs in important ways from the method in Cibulksi et al. Firstly, the present method models the tumour-in- normal contamination, which was not taken into account or even recognised as a problem in Cibulksi et al. This results in many genuine somatic variants being lost when using the approach in Cibulkis et al. Secondly, the present method assumes a general error rate whereas the method in Cilbulkis et al. uses an error rate from the bam file for every read and is therefore less efficient. Finally, Cilbulkis et al. describes the variant classification as part of a variant calling pipeline that includes the steps of a variant detection process that compares the likelihood of the tumour sequence data using a model in which there is no variant at the site and all nonreference bases are due to sequencing noise, and a model in which a variant allele truly exists with a certain allele fraction and in the presence of sequencing noise. Variant detection is performed by comparing the likelihood of these models using a LCD score, and this exceeds a decision threshold then a candidate variant is called. The variant classification step performs the corresponding step of this for the normal sample, assuming a variant allele fraction f=0.5 for a germline heterozygous variant, and using a more stringent threshold. Thus, the variant calling process in Cibulkis et al. comprises a two-step process where a first step looks at the tumour sample and a second step looks at the normal sample. By contrast, the present filtering method is applicable to any list of variants generated by any variant calling process, including in particular processes such as CaveMAn that already take the normal sequence data into account in the variant calling process. As such, the present approach is both more general and more sensitive than that in Cibulkis et al.

In the present example, the inventors developed a new method to take into account tumour-in-normal contamination in the variant identification process, this time as part of the variant calling process itself (not the variant filtering process). A new method to perform variant calling which models the presence of tumour-in- normal contamination as well as the presence of normal-in-tumour- contamination (more commonly referred to as "tumour cell fraction" or "aberrant cell fraction") was therefore developed, as explained below.

Variant calling using CaVEMan

CaVEMan (Cancer Variants through Expectation Maximization, Jones et al., 2016)is a method that applies an expectation maximisation algorithm to call single nucleotide substitutions. The approach calculates a probability for each possible genotype per base (given tumour and normal copy numbers), by comparing read data from tumour and normal samples (typically in the form of BAM/CRAM next generation sequencing files). A sum of all possible somatic genotype probabilities is calculated, and if this is above a predetermined probability cut-off, the position is designated as a candidate variant (by outputting to a VCF file). In order to provide more accurate estimates of sequence error rates within the algorithm, the approach incorporates variables such as base quality, read position, lane, and read orientation into the calculations.

The method has four main steps:

- Split: the split step divides the genome into regions of approximately equal read count, which are then analysed in dividually in the mstep and estep, for computational efficiency;

- Mstep: the mstep creates a profile of the genomic data in a split region, using covariates including base quality, reference base, called base, read position, MAPQ (mapping quality) scores, and mapping strand;

- Merge: the merge step combines the per-split profiles into a profile for the whole genome; and

Estep: the estep uses the genome profile generated by the mstep, location-specific copy numbers, and the reference base in combination with sequence data to assign a probability to each possible genotype.

This may be followed by a "flagging" step (post-hoc filtering aimed at reducing false positive somatic mutation calls), which may include one or more filters as explained above. The m-step essentially recalibrates the sequencing error rate by assuming a background model where all reads are drawn from the reference genome and thus any read exhibiting a mutant base is interpreted as an error. This model is applied by stratifying on various features of the read and base call e.g. position in read and input base quality (further detail in the Model description below). More information on the basic CaVEMan process can be found in Jones et al., 2016, and at https://github.com/cancerit/CaVEMan, both of which are incorporated herein by reference. In the present example, a new model was implemented for use in the Estep, as explained below. Model Using the model described herein, a probability of there being a somatic mutation is calculated for each site of the genome given each of: - α_n: a pre-specified aberrant cell fraction in the normal sample, - α_t: a pre-specified aberrant cell fraction in the tumour sample, - N_g: a site specific copy number for the germline, - N_t: a site-specific copy number for the tumour, - P(Som): a prior probability that the site has a somatic variant, and - P(SNP): a prior probability that the site has a germline variant. The value for α_t may be estimated using methods known in the art, such as e.g. ASCAT (Van Loo et al., 2010) or Battenberg (Nik-Zainal et al., 2012). Alternatively, a suitable predetermined value may be used. The value for α_n may be estimated in a similar manner. Alternatively, a suitable predetermined value may be used, such as e.g. 0.1. Such values were found by the inventors to be suitable in many cases, particularly for blood malignancies. Instead or in addition to this, a suitable estimate for the aberrant cell fraction in the normal and/or the tumour samples (α_n and/or α_t) may be obtained by analysing the estimated VAF in the normal and tumour samples and identifying a cluster of sites (using any clustering method known in the art, such as e.g. k-means) with the same normal VAF and various tumour VAF not equal to 0 (see example plot on Figure 3E, star), where the centroid of said cluster may be used as to obtain an estimate of the aberrant cell fractions in the normal (x axis coordinates) and/or tumour (y axis coordinates – assuming clonal variants). A detailed explanation of an exemplary method to do this is described in Example 6 (section “Joint estimation aberrant cell fraction in tumour and normal”). A suitable predetermined value for P(Som) may be used. For example, a default value of P(Som)= 3x10^-6 may be used. Alternatively, a suitable default value may be set based on prior knowledge depending on the samples to be analysed, for example based on prior knowledge of the expected number of mutations acquired per year. In embodiments, for a clonal blood sample an estimate of the number of SNVs acquired per year may be used to set P(Som) depending on the age of the sample. For example, assuming that 20 SNVs are expected per year, approximately 1000 SNVs may be expected at age 50, leading to a P(Som) estimate of 1000/3x10⁹ for a human genome, i.e. 3x10^-7. A suitable predetermined value for P(SNP) may be used. For example, a value of P(SNP)= 1x10^-4 may be used. Alternatively, a suitable default value may be set based on prior knowledge, such as e.g. the average number of SNPs expected in a human genome. For example, assuming that every human genome (approximately 3x10⁹ bases) contains approximately 3x10⁶ SNPs, a value of P(SNP)=1x10³ may be used. Genotypes and genotype priors: At each site, the model considers the set of possible germline (G_g) and tumour (G_t) genotypes (i.e. joint genotypes (G_g,G_t)) constructed from the reference allele (A) and all three possible variant alleles (B!=A), given by: - Somatic genotypes Ω_s, which comprise a set of genotypes for each of the 3 possible non-reference alleles B, i.e. ^^^^ _^^^^ =

o each set of somatic genotypes for a non-reference allele B is given by: Ω_s(B)={(G_g,G_t): G_g=(A…A)_Ng, G_t=(A…A)_Nt- _k(B…B)_k:N_t≥k>0} (i.e. all genotypes compatible with the germline having N_g copies of allele A, and the tumour having N_t-k copies of allele A and k copies of allele B, k taking any value between 1 and N_t); - SNP genotypes Ω_g, which comprise a set of genotypes for a heterozygous SNP

and a set of genotypes for a homozygous SNP , each set comprising a set of genotypes for each of the 3 possible non reference alleles B, i.e. ^^^^ _^^^^ = ⋃ _{^^^^!= ^^^^} ^^^^ _{^^^^ ^^^^ ^^^^ ^^^^} ( ^^^^) a^{nd ^^^^} _{^^^^ ^^^^ ^^^^ ^^^^} ⁼ _{⋃ ^^^^!= ^^^^} ^{^^^^} _{^^^^ ^^^^ ^^^^ ^^^^} ^{( ^^^^);} o each set of heterozygous SNP genotypes for a non- reference allele B is given by: Ω_gHET(B)={(G_g,G_t): G_g^ {(A…A)_Ng-k(B…B)_k:N_g≥k>0}, G_t^{(A…A)_Nt-k(B…B)_k:N_t≥k>0}} (i.e. all genotypes compatible with the germline having N_g-k copies of allele A and k copies of allele B, and the tumour having N_t-k copies of allele A and k copies of allele B, k taking any value between 1 and N_t for the tumour and any value between 1 and N_g for the germline); o each set of homozygous SNP genotypes for a non-reference allele B is given by:

G_g=(B…B)_Ng, G_t=(B…B)_Nt} (i.e. both the tumour and germline only have allele B, with copy number N_t and N_g, respectively); With these notations, the reference genotype Ω_r is given by Ω_r={( G_g,G_t): G_g=(A…A)_Ng, G_t=(A…A)_Nt} (i.e. both the tumour and germline only have allele A, with copy number N_t and N_g, respectively). As mentioned above, the variants allele B can be any of the 3 non- reference alleles. Thus, the joint genotypes consistent with somatic and germline variation with a single variant allele are: ^^^^ _^^^^ =

The union of the 3 sets Ω_s, Ω_g, and Ω_r represents the full set of possible genotypes (Ω)_. The prior probability for each genotype to belong to each of these three sets can be expressed in terms of the prior probabilities that the site has a germline or somatic variant: - prior probability that the genotype is a somatic genotype: P((G_g,G_t)^ Ω_s)=P(Som)(1-P(SNP)) (i.e. the product of the prior probability that the site has a somatic mutation and the prior probability that the site does not have a germline variant); - prior probability that the genotype is a SNP genotype: P((G_g,G_t)^ Ω_g)=P(SNP) (i.e. the prior probability that the site has a germline variant); - prior probability that the genotype is the reference genotype: P((G_g,G_t)^ Ω_r)=(1-P(Som))(1-P(SNP)) (i.e. the product of the prior probability that the site does not have a somatic mutation and the prior probability that the site does not have a germline variant). Let ω(G_g,G_t) represent one of the above subsets that G_g,G_t belong to. we can express as: P(G_g,G_t)=P((G_g,

_t) ^ ω(G_g,G_t)) P((G_g,G_t) ^ ω(G_g,G_t)) or, assuming a flat prior within each of these sets: P(G_g,G_t)=P((G_g,G_t) ^ ω(G_g,G_t))/| ω(G_g,G_t)| where P((G_g,G_t) ^ ω(G_g,G_t))is given by the above prior probabilities for each of the sets Ω_s, Ω_g, and Ω_r, and | ω(G_g,G_t)| is the number of genotypes in the set ω(G_g,G_t) (where ω(G_g,G_t) refers to one of Ω_s, Ω_g, and Ω_r). Probability of each joint genotype given the data: the probability of each joint genotype given the data D is calculated using Bayes’ rule: (equation 1)

The priors P(G_g,G_t) in equation (1) can be calculated as explained above. For the calculation of the likelihoods P(D|G_g,G_t) in equation 1, it is assumed that for a given mapped read at position p, the true base Ȓ_p (where Ȓ_p=x) is called as base C with a probability that depends on covariates θ (i.e. the probability of calling base C when the true base is x is given by P(Ȓ_p=C| Ȓ_p=x, θ)). These may be referred to as “profile probabilities” (P(Ȓ_p=A| Ȓ_p=x, θ)) and P(Ȓ_p=B| Ȓ_p=x, θ))). The covariates used in these examples include lane/read group, read order, strand mapping quality, base quality, and position within read. The first 4 are defined at the level of the read and the last 2 are defined at the level of the base call. The data D at position p is assumed to comprise: - D_p ⁽ⁿ⁾: a set of reads (also referred to as a “pileup”) at the specified genomic position for the normal sample, and - D_p ^(t): a set of reads also referred to as a “pileup”) at the specified genomic position for the tumour sample. Assuming that errors in the calling of bases are conditionally independent given the covariates, we can write:

(Equation 2) In other words, the likelihood of the data given a particular joint genotype can be expressed as a product of the likelihood of the tumour data and the normal data, each given a set of covariates and an aberrant cell fraction. Any sample (e.g. a tumour sample or a normal sample) is assumed to be a combination of germline cells and tumour cells with aberrant cell fraction α. For such a sample, the probability of the data can be expressed as a sum of two terms, one for the reference allele and one for the variant allele:

The aberrant cell fraction and copy number estimates can be used to obtain an “adjusted aberrant cell fraction” α’ that represents the probability that a read from a sample comprising tumour and germline cells comes from a tumour cell: α’=( αN_t)/((1- α)N_g+ αN_t). The proportion of variant alleles in any genotype G can be expressed as (G)=B(B)/(A(G)+B(G)). This can be adjusted for reference bias (r_b) to obtain an adjusted value ψ’(G) that represents the probability that a given read that maps to position p, from a cell that has genotype G, carries the variant read (i.e. carries a read that originates from a variant molecule, regardless of whether the base was correctly or incorrectly called): (P(Ȓ_p=B|G)): ψ’(G)=(r_b ψ(G))/((1- r_b)+ r_b ψ(G)). The above values can be used to calculate the probabilities that a true read base is the reference allele A or variant allele B, given a joint genotype: P(Ȓ_p=A| G_g,G_t)= P(Ȓ_p=A|R ^ germline,G_g)P(R ^ germline)+ P(Ȓ_p=A|R ^ tumour,G_t)P(R ^ tumour) = P(Ȓ_p=A|R ^ germline,G_g)(1- α’)+ P(Ȓ_p=A|R ^ tumour,G_t) α’

’ and similarly: P(Ȓ_p=B| G_g,G_t)=(1- α’) ψ’(G_g)+ α’ ψ’(G_t). These expressions can be substituted into equation (3), to obtain equation (4): u_{ation 4).}

For the normal sample, equation 4 would be expressed as: Pⁿ(R_p=C|θ ,Ȓ_p=R,G_g,G_t,α_n)= P(R_p=C|Ȓ_p=A,θ)((1- ψ’(G_g))(1- α_n’)+ (1-

α_n’=(α_nN_t)/((1- α_n)N_g+ α_nN_t). For the tumour sample, equation 4 would be expressed as: P^t(R_p=C|θ ,Ȓ_p=R,G_g,G_t,α_t)= P(R_p=C|Ȓ_p=A,θ)((1- ψ’(G_g))(1- α_t’)+ (1-ψ’(G_t)) α_t’)+ P(R_p=C|Ȓ_p=B,θ)((1- α_t’) ψ’(G_g)+ α_t’ ψ’(G_t)) where α_t’=(α_tN_t)/((1- α_t)N_g+ α_tN_t). The profile probabilities (P(R_p=C|Ȓ_p=x,θ)) in Equation 4 can be calculated as follows. Each read R at position p has covariates θ(R,p). An overall count χ of the number of times that each covariate appears across the genome in both the set of overlapping reads for the normal sample (D_p ⁽ⁿ⁾) and the set of overlapping reads for the tumour sample (D_p ^(t)) can be expressed as: χ(R_p=C,Ȓ_p=R,θ)=∑_{p ^}

θ(R,p)= θ|)}. This count matrix can be used to obtain an empirical estimate of the profile probabilities as: P(R_p=C|Ȓ_p=x,θ)= χ(R_p=C,Ȓ_p=x,θ)/(∑_{C ^ {a,c,g,t}} χ(R_p=C,Ȓ_p=x,θ)). These counts are obtained in the mstep and merged in the merge step of the algorithm. The counts may be computed over all positions, as is the case in the state of the art CaveMAn method. This is equivalent to assuming a background model where all reads are drawn from the reference genome and thus any read exhibiting a mutant base is interpreted as an error. Such a background model may be suitable when base call quality is poor and the observed mutant base calls are overwhelmingly due to sequencing error. However, it does mean that mutant base calls that have a biological source are misinterpreted as error, and may thus systematically underestimate the effective base call quality. Given that SNPs typically occur at approximately one in a thousand sites this effectively floors the error rate at 0.001 (or quality 30 on the PHRED scale). Alternatively, the counts may advantageously be computed over all positions excluding a set of sites that are known sites of germline variability (i.e. known SNP sites). Such a set of sites may be obtained for example from one or more databases such as dbSNP (https://www.ncbi.nlm.nih.gov/snp/), 1000 Genomes (https://www.internationalgenome.org/) and/or ExAC (Karczewski et al. 2017). This may remove some important biases and improve the sensitivity of the variant calling process.

These profile probabilities (obtained from the results of the mstep/merge steps) are then used in the estep, to calculate a probability for each possible joint genotype. In particular, the profile probabilities can be included in equation 4, such that all of the terms of equation 4 can now be calculated (both for the normal sample and the tumour sample), leading to values for Pⁿ(R_p=C|0 ,R_P=R,G_g,G_t,ot_n) and P^fc (R_P=C|0 ,R_P=R,G_g,G_t,a_t). These can in turn be used to calculate P(D|G_g,G_t) with equation 2. P(D|G_g,Gt) can in turn be used to calculate P(G_g,G_t|D) using equation 1 and the prior probabilities P(G_g,Gt) given by P(G_g,Gt)=P((G_g,Gt) £ co(G_g,Gt))/| co(G_g,G_t)|, where as explained above P((G_g,G_t) £ co(G_g,G_t)) is given by P((G_g,Gt)£ Q_S)=P(Som)(1-P(SNP)), P((G_g,G_t)£ Q_g)=P(SNP), and P((G_g,G_t)£ Q_r)=(1-P(Som))(l-P(SNP)), respectively for the somatic, germline and reference joint genotype sets. Thus, it is now possible to calculate a posterior probability for each possible joint genotype at each position. From these, it is possible to calculate a posterior probability that at each position, the joint genotype has a somatic variant, a germline variant or a reference genotype, by summing the posterior probabilities for each genotype within the somatic, germline and reference sets, respectively.

Unmatched mode

It is possible to apply the above statistical model in an unmatched context where no matched normal sample is available or it is not desirable to use one. For example, when using the above approach in comparison with the filtering method described in Example 1, it may be desirable to run the variant calling method without a matched normal sample (to include a larger set of candidate variants, which can then be filtered to include those variants that are not present in the normal sample, taking into account TiN in the normal sample as explained in Example 1).

In this context, a special case of equation (2) can be used (Equation 2')): (Equation 2')

where the normal sample is simply ignored (a similar equation can be written when the normal sample is analysed instead of the tumour sample, i.e.

Because of the absence of a normal sample, this particular mode is unlikely to robustly distinguish somatic and germline variants.

Thus, in order for all variants to be interpreted as somatic variants, the prior SNP probability is preferably set to 0. The task of distinguishing somatic from germline variants then becomes a post-processing filtering step. This can for example make use of SNP resources and a matched normal sample if available, for example using the method described in Example 1.

Unless specified otherwise, the method described in this Example is applied in a matched mode. Thus, references to the method of Example 2 refer to the matched mode unless specified otherwise.

otherwise In this example, the inventors demonstrate the use of the methods described in Examples 1 and 2 to analyse blood samples from patients with myeloproliferative neoplasms, where the tumour in the blood typically contaminates the matched germline control used to identify mutations.

In a first analysis, bulk whole genome sequenced blood samples (Bone marrow, peripheral blood) from patients with myeloproliferative neoplasms were analysed using matched germline samples that were a buccal DNA sample, T cell DNA or in vivo expanded T-Cells (in an attempt to get a cleaner germline free of TiN). The samples were analysed using: (i) standard off the shelf CaVEMan (as described in Jones et al., 2016) - labelled as "NST" in the data shown; (ii) the new variant calling method taking into account tumour in normal contamination described in Example 2 - labelled as "CAVETIN" in the data shown; or (iii) the variant calling method without matched normal described in Example 2 in combination with the new variant filter described in Example 1 - labelled "GLOD" in the data below.

Figure 3 shows the results for the new variant calling method taking into account tumour in normal contamination described in Example 2. The results on Figure 3A and Figure 3B show the comparison between (i) standard CaVEMan (NST) and (ii) CaVEMan with tumour in normal model (CAVETIN), in terms of the numbers of SNVs identified. Figure 3A shows the absolute number of SNVs (y axis) identified in each sample (x axis) by both methods (bottom of the bar), by the standard CaVEMan only (middle bar) and by the new method with TIN modelling (top of the bar). Figure 3B shows the proportion of SNVs (y axis) identified in each sample (x axis) by both methods (bottom of the bar), by the standard CaVEMan only (middle bar) and by the new method with TIN modelling (top of the bar). The data shows that the new method identified the vast majority of the variants identified by the standard CaVEMan method, and additionally identified further SNVs that could not be identified by the standard method. For most of the samples, the number of SNVs that have been "rescued" thanks to the new method outnumbers the number of SNVs that could not be identified by the new method. Indeed for some samples the proportion of SNVs that could only be identified using the new method is very high.

The results on Figure 3C show that the proportion of variants rescued by the new method that takes into account TiN (y axis) goes up with increasing tumour contamination of the germline sample (x axis, ACF=aberrant cell fraction in matched germline control). The results on Figure 3D confirm that the VAF of the tumour variants is higher in the normal sample for those variants that were rescued by the new method, compared to the variants identified by both calling methods or only by the standard calling method.

Figures 3E-G show more detailed results for a single exemplary patient (PD4775c). Figure 3E shows all variants in one individual sample and the VAF of those variants in the germline (x axis) and tumour (y axis). Central black dots represent germline SNPs. Red dots represent 'easy to identify' somatic variants present in the tumour but not present in the normal, identified by both the standard method (Caveman) and the new method (Caveman_TiN). Green dots represent definitive somatic variants in the tumour but missed by the standard method (caveman) due to their presence in the normal. These green coloured variants are picked up by the new method (Caveman_TiN). The star is the VAF centroid of a cluster from a k-means based cluster analysis, from which the aberrant cell fraction in the normal sample used in the new method was estimated, which enabled the identification of the green variants (see Example 6, section "Joint estimation aberrant cell fraction in tumour and normal"). Thus, the star represents the expected VAF of the clonal variants with the estimated aberrant cell fractions given by the coordinates of the star. Figure 3F shows the cosine similarity in trinucleotide sequence distribution of the variants identified by either or both methods. This data shows that the variants identified by the new method only are more similar to the variants identified by both methods than the variants that were only identified by the standard method. This indicates that the rescued variants are more likely to be true variants than the variants that were only identified by the standard method (which are more likely to be artefacts). Figure 3G shows the trinucleotide sequence of the variants identified by either or both methods. This shows that the variants rescued by the new method have a similar nature to the variants that were identified by both methods. By contrast, the variants identified only by the standard method (and not the new method) appear to contain at least some variants that are artefacts (see the enrichment in T>G variants in the last column of the plot).

Thus, the data on Figure 3 shows that the new method described in Example 2 is able to identify many variants that would have been missed by the gold standard comparative method, while still identifying the majority of the true variants identified by the gold standard comparative method.

Figure 4 shows the results for the new variant filtering method taking into account tumour in normal contamination described in Example 1. The results on Figure 4A and Figure 4B show the comparison between (i) standard CaVEMan (NST) and (ii) unmatched CaVEMan with tumour in normal model-based filter (GLOD), in terms of the numbers of SNVs identified. Figure 4A shows the absolute number of SNVs (y axis) identified in each sample (x axis) by both methods (bottom of the bar), by the standard CaVEMan only (middle bar) and by the new method with TIN modelling (top of the bar). Figure 4B shows the proportion of SNVs (y axis) identified in each sample (x axis) by both methods (bottom of the bar), by the standard CaVEMan only (middle bar) and by the new method with TIN modelling (top of the bar). The data shows that the new method identified the vast majority of the variants identified by the standard CaVEMan method, and additionally identified further SNVs that could not be identified by the standard method. For most of the samples, the number of SNVs that have been "rescued" thanks to the new method outnumbers the number of SNVs that could not be identified by the new method. Indeed for some samples the proportion of SNVs that could only be identified using the new method is very high.

The results on Figure 4C show that the proportion of variants rescued by the new method that takes into account TiN (y axis) goes up with increasing tumour contamination of the germline sample (x axis, ACF=aberrant cell fraction in matched germline control). The results on Figure 4D confirm that the VAF of the tumour variants is higher in the normal sample for those variants that were rescued by the new method, compared to the variants identified by both calling methods or only by the standard calling method.

Figures 4E-G show more detailed results for a single exemplary patient (PD4775c). Figure 4E shows all variants in one individual sample and the VAF of those variants in the germline (x axis) and tumour (y axis). Central black dots represent germline SNPs. Red dots represent 'easy to identify' somatic variants present in the tumour but not present in the normal, identified by both the standard method (Caveman) and the new method (GLOD). Green dots represent definitive somatic variants in the tumour but missed by the standard method (caveman) due to their presence in the normal. These green coloured variants are picked up by the new method (GLOD). The star is the VAF centroid of a cluster from a k-means based cluster analysis, from which the aberrant cell fraction in the normal sample used in the new method was estimated, which enabled the identification of the green variants (see Example 6 - section "Joint estimation aberrant cell fraction in tumour and normal"). Thus, the star represents the expected VAF of the clonal variants with the estimated aberrant cell fractions given by the coordinates of the star. Note that the number of green variants in this case is higher than on Figure 3E. This is not unexpected as this method did not use a matched germline sample for variant calling (just for variant filtering). Some of these variants may therefore be artefacts, such as e.g. the lower cluster of green variants that appear to have the same VAF in tumour and normal. However, this more inclusive set of variants beneficially decreases the chance of variants being missed. The variants identified can then be further analysed to exclude those that may be artefacts. For example, the new method described in Example 1 (GLOD) and the new method described in Example 2 (CAVETIN) with a matched germline sample may both be run on the same sample, and variants only identified by one of these two (typically the method in Example 1 as this is more permissive than the method in Example 2 used with a matched germline sample) can be further filtered. This can be done, for example, by looking at the VAF for these variants in the normal and tumour samples (e.g. to identify variants that have a similar VAF in the normal and tumour samples) and/or by looking at the trinucleotide context (e.g. to identify variants that have trinucleotide sequences that are enriched compared to the distribution in variants identified by both the new methods and other methods). Thus, the methods can be used alone or in combinations and will result in a number of variants being rescued, which number increases with the level of TiN contamination. Figure 4F shows the cosine similarity in trinucleotide sequence distribution of the variants identified by either or both methods. This data shows that the variants identified by the new method only are more similar to the variants identified by both methods than the variants that were only identified by the standard method. This indicates that the rescued variants are more likely to be true variants than the variants that were only identified by the standard method (which are more likely to be artefacts). Figure 4G shows the trinucleotide sequence of the variants identified by either or both methods. This shows that the variants rescued by the new method have a similar nature to the variants that were identified by both methods. By contrast, the variants identified only by the standard method (and not the new method) appear to contain at least some variants that are artefacts (see the enrichment in T>G variants in the last column of the plot).

Thus, the data on Figure 4 shows that the new method described in Example 1 is able to identify many variants that would have been missed by the gold standard comparative method, while still identifying the majority of the true variants identified by the gold standard comparative method.

In a second analysis, the methods described in Examples 1 and 2 were compared to the results obtained for the same samples in an analysis previously published (Gerstung, M. et al. The evolutionary history of 2,658 cancers. Nature 578, 122-128 (2020); ICGC/TCGA Pan-Cancer Analysis of Whole Genomes Consortium. Pan-cancer analysis of whole genomes. Nature 578, 82-93 (2020)). The PanCan dataset used in this example (and described in the references above) is a large cancer dataset including the bulk myeloproliferative neoplasm samples above and further samples used in Example 6. In this dataset, every sample had mutations called by 4 different internationally used variant callers including Caveman. A variant 'passed' if it was identified by two variant callers. These samples and variants have gone through peer review and publication. Thus, these variants can be assumed to represent a 'truth set'.

Figure 5 shows the results for the new variant calling method taking into account tumour in normal contamination described in Example 2, compared to the method applied in the references above (multiple callers and requirement for consensus between at least 2 callers). The results on Figure 5A and Figure 5B show the comparison between (i) the multiple callers approach (PANCAN) and (ii) CaVEMan with tumour in normal model (CAVETIN), in terms of the numbers of SNVs identified. Figure 5A shows the absolute number of SNVs (y axis) identified in each sample (x axis) by both methods (bottom of the bar), by the multiple callers method only (middle bar) and by the new method with TIN modelling (top of the bar). Figure 5B shows the proportion of SNVs (y axis) identified in each sample (x axis) by both methods (bottom of the bar), by the multiple callers approach only (middle bar) and by the new method with TIN modelling (top of the bar). The data shows that the new method identified the vast majority of the variants identified by the multiple callers approach, and additionally identified further SNVs that could not be identified in the "truth set". Again, for at least some of the samples, the proportion of SNVs that could only be identified using the new method is very high. The results on Figure 5C show that the proportion of variants rescued by the new method that takes into account TiN (y axis) goes up with increasing tumour contamination of the germline sample (x axis, ACF=aberrant cell fraction in matched germline control). The results on Figure 5D confirm that the VAF of the tumour variants is higher in the normal sample for those variants that were rescued by the new method, compared to the variants identified by both calling methods or only by the multiple callers method. The variants identified only by the multiple callers method also contain variants with higher VAF in the normal sample than the variants identified by both methods, but fewer variants with high VAF in the tumour compared to those identified by both methods. Figures 5E-F show an analysis of the trinucleotide context of the variants identified. Figure 5F shows the cosine similarity in trinucleotide sequence distribution of the variants identified by either or both methods. This data shows that the variants identified by the new method only are more similar to the variants identified by both methods than the variants that were only identified by the multiple callers method (although this effect is less pronounced than when comparing with a single caller, likely due to the added stringency of requesting agreement between multiple callers or inclusion in this set). This indicates that the rescued variants are more likely to be true variants than the variants that were only identified by the multiple callers method (which are more likely to be artefacts). Figure 5E shows the trinucleotide sequence of the variants identified by either or both methods. This shows that the variants rescued by the new method have a similar nature to the variants that were identified by both methods. By contrast, the variants identified only by the multiple callers method (and not the new method) appear to contain at least some variants that are artefacts (see the enrichment in a particular category of T>G variants in the last column of the plot).

Thus, the data on Figure 5 shows that the new method described in Example 2 is able to identify many variants that would have been missed by even the most thorough comparative method, while still identifying the majority of the true variants identified by such an approach. In other words, the method described in Example 2 rescued variants that are missed by all current commonly used variant callers in the PCAWG (Pan-Cancer Analysis of Whole Genomes) studies.

Figure 6 shows the results for the new variant filtering method taking into account tumour in normal contamination described in Example 1, compared to the method applied in the references above (multiple callers and requirement for consensus between at least 2 callers). The results on Figure 6A and Figure 6B show the comparison between (i) the multiple callers approach (PANCAN) and (ii) the tumour in normal based filter(GLOD), in terms of the numbers of SNVs identified. Figure 6A shows the absolute number of SNVs (y axis) identified in each sample (x axis) by both methods (bottom of the bar), by the multiple callers method only (middle bar) and by the new method with TIN modelling (top of the bar). Figure 6B shows the proportion of SNVs (y axis) identified in each sample (x axis) by both methods (bottom of the bar), by the multiple callers approach only (middle bar) and by the new method with TIN modelling (top of the bar). The data shows that the new method identified the vast majority of the variants identified by the multiple callers approach, and additionally identified further SNVs that could not be identified in the "truth set". Again, for at least some of the samples, the proportion of SNVs that could only be identified using the new method is very high. The results on Figure 6C show that the proportion of variants rescued by the new method that takes into account TiN (y axis) goes up with increasing tumour contamination of the germline sample (x axis, ACF=aberrant cell fraction in matched germline control). The results on Figure 6D confirm that the VAF of the tumour variants is higher in the normal sample for those variants that were rescued by the new method, compared to the variants identified by both calling methods or only by the multiple callers method. Figures 6E-F show an analysis of the trinucleotide context of the variants identified. Figure 6F shows the cosine similarity in trinucleotide sequence distribution of the variants identified by either or both methods. This data shows that the variants identified by the new method only are more similar to the variants identified by both methods than the variants that were only identified by the multiple callers method (although this effect is less pronounced than when comparing with a single caller, likely due to the added stringency of requesting agreement between multiple callers or inclusion in this set). This indicates that the rescued variants are more likely to be true variants than the variants that were only identified by the multiple callers method (which are more likely to be artefacts). Figure 6E shows the trinucleotide sequence of the variants identified by either or both methods. This shows that the variants rescued by the new method have a similar nature to the variants that were identified by both methods. By contrast, the variants identified only by the multiple callers method (and not the new method) appear to contain at least some variants that are artefacts (see the enrichment in a particular category of T>G variants in the last column of the plot). Thus, the data on Figure 6 shows that the new method described in Example 1 is able to identify many variants that would have been missed by even the most thorough comparative method, while still identifying the majority of the true variants identified by such an approach. In other words, the method described in Example 1 rescued variants that are missed by all current commonly used variant callers in the PCAWG (Pan-Cancer Analysis of Whole Genomes) studies.

Example 4 — Theoretical analysis of the method described in Example 2

Sensitivity

For modest depth in Tumour and Normal it is possible to directly calculate a theoretical model sensitivity for fixed values of specific model parameters and with a simplified error model for the profile probabilities P(R_P=C|R_p=x,0). In particular, as explained below, by fixing the sequencing error rate e at 0.001, the impact of various combinations of values for the depth in the tumour and normal samples, the aberrant cell fraction in the normal and tumour samples, the SNP prior with a fixed somatic prior of 3xl0^-6, and the normal and tumour copy numbers can be used to assess sensitivity. The simplified error model for the profile probabilities assumes that the probability of error is e and each of the 3 possible incorrect bases have a probability of e/3, i.e.:

P(R_P=B|R_P=A)=S=SB!=A e/3 P(R_P=A |R_P=B)=S=2A!=B e/3 P(R_P=A |R_P=A)=1-s P(R_P=B|R_P=B)=l-e where P(R_p=A|R_p=A) is the probability that a called genotype is A given that the true genotype is A, and similarly P(R_p=B|R_p=B) is the probability that a called genotype is B given that the true genotype is B.

Under this simplified model, the data at each locus for a sample is completely specified by the read counts for each of the bases (i.e. N⁽ⁿ⁾=[N_A,N_B,N_c,N_D] and N^(t)=[N_A,N_B,N_c,N_D]) (where A is the reference allele, B the variant allele and C, D the other two possible bases). With this model it is possible to analyse the power of the model to correctly detect and classify mutations with a specified variant allele fraction (VAF) equal to V_t in the underlying tumour and V_g in pure germline tissue, sequenced to depth d_t and d_n in the tumour sample and normal sample respectively. This can directly be modelled using equation 5 to determine sensitivity:

P((Gg,Gt)I Vg,Vt,dt,dn,a_n,a_t)=P((G_g,G_t)|N<ⁿ>,N<t>,a_n,a_t)P(N<ⁿ>|V_g,V_t,d_n,a_n,e) P(N^(t) |Vg,V_t,d_t,oft,£) where the first term (P((G_g,G_t)|N⁽ⁿ⁾ ,N^(t) ,a_n,a_t)) gives the model probability of each genotype for a particular set of read counts in the normal sample and tumour sample. The final two terms are exhaustively calculated by enumerating all possible counts for the 4 bases given the specified depth and applying the multinomial probability as explained below.

Classification of mutation types

The posterior probability that at each position, the joint genotype has a somatic variant, a germline variant or a reference genotype can be used to classify each position between one of three classes of variants (SNP, SNV, reference). This can be done for example by summing the probabilities for all joint genotypes in a respective class of variants, thereby obtaining a probability for the class. These class probabilities can then be sued to perform a hard classification, for example by classifying a position in one of the classes (SNP, SNV, reference) if the corresponding probability exceeds a predetermined threshold (such as e.g. 0.8, as was used to generate the results below). The classification can be left ambiguous if none of the probabilities reaches the predetermined threshold.

Using the above simplified model for sensitivity described in Example 2, it is possible to directly get a sense of how the model classification (between each class of variants, i.e. ) corresponds to normal and tumour read counts. It is possible to model: P(N⁽ⁿ⁾ |Vg,V_t,d_n,otn)-Multinomial(N⁽ⁿ⁾ ,[P(A |V_g,V_t,d_n,ot_n,£), P(B|Vg,Vt,d_n,a_n,£), P(C|Vg,V_t,d_n,a_n,£) P(D|Vg,Vt,d_n,a_n,£)]) and similarly

P(B|Vg,V_t,d,ot,e)=(l-a)[(l-e)Vg+((1-Vg)e/3)]+ a[(1-e)V_t+(1-V_t)e/3]

P(c|Vg,V_t,d,a,e)= s/3

P(D|Vg,V_t,d,a,e)= s/3

With a=a_n, d=d_n for a normal sample and a=a_t, d=d_t for a tumour sample.

Fixing the sequencing error rate e at 0.001, the impact of various combinations of values for the depth in the tumour and normal samples, the aberrant cell fraction in the normal and tumour samples, and the normal and tumour copy numbers were used to assess sensitivity on simulated data.

In particular, to investigate the sensitivity of the model to the aberrant cell fraction in the normal sample, the following parameters were used (with a fixed e=0.001 and copy numbers=2 in the normal and 5 in the tumour):

- d_t=40,d_n=20,a_n=0,a_t=0.8 (results on Fig. 7A);

- d_t=40,d_n=20,ot_n=0.l,a_t=0.8 (results on Fig. 7B);

- d_t=40,d_n=20,o<n=0.2,a_t=0.8(results on Fig. 7C);

- d_t=40,d_n=20,o<n=0.3,a_t=0.8(results on Fig. 7D).

Additionally, to investigate the sensitivity of the model to the tumour copy number, the following parameters were used (with a fixed e=0.001 and aberrant cell fractions=0.1 in the normal and 0.9 in the tumour):

- d_t=40,d_n=20, N_g (site specific copy number for the germline)=2, N_t (site-specific copy number for the tumour)=1(results on Fig. 8A);

- dt=40,d_n=20, Ng=2, N_t=2(results on Fig. 8B); dt=40,d_n=20, Ng=2, N_t=5(results on Fig. 8C);

- dt=40,d_n=20, Ng=2, N_t=10(results on Fig. 8D).

Finally, the influence of the variant allele fraction (in a theoretical pure tumour sample, i.e. true VAF in the tumour) on the classification proportions was investigated by fixing the sequencing depths, aberrant cell fractions and sequencing error prior to d_t=40,d_n=20, a_n=0.0,a_t=0.9 and e=0.001 with tumour/normal copy numbers set to 2/2 and 5/2(results on Figure 9A), d_t=40,d_n=20, ot_n=0.l,a_t=0.9 and SNP prior=0.001 with copy numbers of 2 and 5 respectively in the normal and tumour and tumour in normal parameter set to 0 or 0.1(results on Figure 9B), and d_t=40,d_n=20, ot_n=0.l,a_t=0.9 and SNP prior=0.00001 with copy numbers of 2 and 5 respectively in the normal and tumour and tumour in normal parameter set to 0 or 0.1 (results on Figure 9C). The results on Figure 9 investigate somatic mutations that have a particular VAF in a theoretical pure tumour sample (whereas of course the normal and tumour sample both have specified fractions of pure tumour and pure germline) in terms of the ability of the new TiN aware method to correctly identify these mutations as somatic, compared to the previous method that was not TiN aware (i.e. equivalent to TiN set to 0, even though of course in the previous method no TiN parameter was in fact provided). On Figures 9A-9C, the "SOM" lines coincide with the sensitivity of detection of a somatic variant. The other lines represent classification proportions (misclassified variants - reference - REF, ambiguous (AMB - no classification (SNV,SNP,SNV or SNP) has greater than 80% probability), SNV - SOM, SNP, SNV or SNP - VAR). Figure 9A shows that in the absence of tumour-in-normal contamination (i.e. aberrant cell fraction in normal sample=0 for all curves in this plot), standard caveman can sensitively detect somatic mutations with minimal classification error. Moreover the sensitivity of detection is improved using a 5/2 setting. Figure 9B shows that standard caveman (dotted curves, tumour in normal contamination set to 0) is unable to reliably classify somatic variants in presence of even modest tumour-in-normal contamination with an approximately linearly increasing probability of classifying somatic variants as germline with increasing VAF. In contrast, TiN aware Caveman (solid curves, TiN set to 0.1) is able to correctly classify somatic variants as somatic variants across a wide range of VAF values. Figure 9C shows that even when the SNP prior is set to the low value of le-5, standard caveman still has limited ability to correctly classify high VAF variants (dotted lines, TiN set to 0). By contrast, TiN aware Caveman (solid curves, TiN set to 0.1) is able to correctly classify somatic variants as somatic variants across a wide range of VAF values.

Figure 7 shows the results of investigating the sensitivity of the model to the aberrant cell fraction in the normal sample. Figure 7 illustrates how the number of mutant reads in the normal sample that are consistent with the method calling a variant as somatic varies with the number of mutant reads in the tumour sample and tumour-in- normal aberrant cell fraction. Figure 8 shows the results of investigating the sensitivity of the model to the tumour copy number. Figure 8 illustrates how the number of mutant reads in the normal sample that are consistent with the method calling a variant as somatic varies with the number of mutant reads in the tumour sample and tumour copy number. Each plot shows the number of mutant reads in the tumour and the normal from which the position is classified as reference, ambiguous, SNV (i.e. somatic mutation), SNV or SNP (VAR), or SNP (germline variant). Generally, where there is a high proportion of mutant reads in the tumour then the cap on the number of mutant reads in the normal that support a somatic mutation call is higher, as expected.

Example 5 — Improved variant filtering process enables phylogenetic reconstruction of myeloproliferative neoplasm, revealing very early origins and lifelong evolution Introduction

Human cancers harbour hundreds to hundreds of thousands of somatically acquired DNA mutations. Whilst the majority of such mutations do not affect the cancer's biology, a minority drive tumour initiation, growth and progression (ICGC/TCGA, 2020). These so-called driver mutations occur in recurrently mutated cancer genes, and stimulate the cell acquiring it to expand into a clone. With a large enough clonal expansion, typically abetted by acquisition of further driver mutations, a cancer emerges. Little is known about what ages driver mutations occur, the timelines of clonal expansion over a patient's lifetime, or how these relate to clinical presentation with overt cancer. Some mutational processes accrue at a constant rate across life, representing a 'molecular clock' (Welch et al, 2021, Alexandrov et al.,2015). Knowing this tissue-specific rate of mutation accumulation, it has been possible to infer broad estimates for the timing of driver mutations for some cancers (Mitchell et al.,2018, Gerstung et al., 2020).

Understanding the absolute timelines of cancer evolution is critical for efforts aimed at early detection and intervention, especially if a given cancer takes decades to emerge after its first driver mutation. Myeloproliferative neoplasms (MPN) are blood cancers driven by somatic driver mutations in haematopoietic stem cells (HSC) that result in increased mature myeloid cell production (Vainchenker & Kralovics, 2017). Most patients harbor JAK2V617F, and this can either be the only driver mutation or occur in co-operation with driver mutations in other genes such as DNMT3A or TET2 (Grinfield et al., 2018). The clinical course often spans decades, with phenotypic disease progression occurring upon acquisition of additional driver mutations (Nangalia & Green, 2017). MPNs provide a unique opportunity to capture the earliest stages of tumourigenesis through to disease evolution which are otherwise inaccessible in other malignancies. The inventors undertook whole-genome sequencing (WGS) of individual single-cell derived haematopoietic colonies, and targeted resequencing of longitudinal blood samples from patients with MPN, to assess the absolute timing of driver mutations, tumour evolutionary dynamics, and the fundamental nature of driver mutation mediated clonal selection in vivo. The results of these studies are described in Williams et al., 2020 (available at https://www.biorxiv.org/content/10.1101/2020.11.09.374710vl.full.pdf ), which is incorporated herein by reference.

Such extensive phylogenetic analysis was made possible by the development of a new approach to identify somatic variants in cases where there is tumour-in-normal contamination (as is the case with MNP), described for the first time in this example.

Methods 1. Patients and Samples: Peripheral blood and bone marrow samples were obtained from patients with myeloproliferative neoplasms as described in Williams et al., 2020.

2. In—vitro colony culture for whole genome sequencing and targeted recapture of bulk DNA samples Peripheral blood mononuclear cells were isolated from patient blood or bone marrow samples, cultured and analysed by next generation sequencing as described in Williams et al., 2020. For the targeted sequencing of bulk peripheral blood (recapture) samples, we used Agilent SureDesign to design a baitset that captured all unmasked shared mutations across colonies. For private branches, we assayed up to 4 randomly selected unmasked variants per year of approximate branch length, capping the total number per private branch variants at 80 mutations. Variants that passed the SureDesign "most stringent" filter were preferentially selected. Sequencing on Illumina Novaseq was undertaken to depth of roughly 300-400x across all recapture samples.

3. Somatic mutation identification and quality filtering

3.(i). Single nucleotide variants (SNV) identification. SNVs were identified using CaVEMan (Jones et al., 2016 - without tumour in normal modelling) for each colony by comparison to an unmatched normal colorectal crypt sample (PD26636b) that had previously undergone whole genome sequencing. CaVEMan was run with the normal contamination of tumour parameter set to zero, and the tumour/normal copy numbers set to 5/2. Loci-specific copy number value estimates were investigated but were not found to be more useful than using such an inclusive copy number estimate for the tumour genome (which allows most variants to be identified, whereas using loci-specific estimates that are more constraining may lead to variants being missed if the patterns observed do not fit the estimated prior for copy number at the locus). In addition to standard filters (as described in Jones et al., 2016), reads supporting a SNV had to have a median BWA-MEM Alignment Score >=140 (ASMD>=140) and less than half of the reads should be clipped (CLPM=0). Further filters that were designed for the quality control post processing through the low-input sequencing pipeline were applied to all colonies, as recently described in Ellis et al, 2020 and Moore et al., 2020. The choice of filters used may depend on the data and the purpose of the analysis. The present invention is not limited in terms of filters applied in addition to filtering and/or variant calling using TiN modelling. The use of the unmatched normal means that this process calls both somatic and germline SNVs. The germline SNVs are subsequently removed in a downstream filtering process that uses pooled information across per-patient colonies and read counts from a matched germline WGS sample (buccal or T-cell DNA) (see section v). Note that a similar result could be obtained by running the method of Example 2 in unmatched mode, instead of the standard CaVEMan.

3.(ii). Identification of small insertions and deletions. Short Insertions and deletions were called using cgpPindel (Ye et al., 2009) with the standard cgpPindel VCF filters applied, except the F018 Pindel filter was disabled as it excludes loci of depth<10.

3.(ill). Identification of copy number changes. Copy-number aberrations (CNA) were identified using ASCAT (Van Loo et al., 2010) with comparison to a matched normal sample. Note that given the clonal nature of samples more refined identification of sub-clonal copy number changes was not required.

3.(iv). Joint loci pileups across all samples. The union of colony SNVs and Indels was taken and reads counted across all samples belonging to the patient (colonies, recapture samples, buccal and T- cells) using VAFCorrect (https://github.com/cancerit/vafCorrect). VAFCorrect avoids double counting of overlapping reads and minimises reference bias in the variant allele fractions of indels by performing a pileup mapping to both the reference genome and a locally altered reference genome that incorporates the alternative allele in the reference sequence. The basis for almost all further analysis is this per patient matrix of mutant read counts and depth. This joint pileup was analysed using cgpvaf, and used as input for the GLOD filter (as described in Example 1). 3.(v). Additional filters conducted post loci pileup. The next filtering steps are designed to identify variants that are germline, artefactual or non-clonal. Given that some patients have very few wild-type samples it is impractical to identify germline variants exclusively by measurements that work well in cases where there is more equitable split in the phylogeny as is characteristic for cells sampled from a phenotypically normal patient (e.g. one approach is to only keep variants that exhibit an aggregate VAF <0.5 (binomial test), but this risks excluding somatic variants that are on a highly shared mutant branch). So as an alternative approach that also accounts for tumour-in-normal contamination we defined a Germline SNV Filter (GLOD) that was an adaptation of the Mutect germline classifier (Cibulkis et al., 2013) to account for modest levels of potential contamination in a matched normal sample (Buccal or T-Cell)(see section 15 below). Figure 10 shows an example of the activity and pairwise overlap of these filters for a typical patient dataset. The diagonal entries show the percentage of filtered candidate variants that are removed by each filter and the off- diagonal elements show the number of filtered variants removed by both the corresponding filters. The germline filter "bglod" is most active, here filtering out 93.5% of the candidate somatic SNVs (note, the filtering here was also applied to variants present in all colonies to confirm their germline nature). The quality control process generates this information for each patient. The following filters were used: a. bglod: Remove loci where the GLOD filter (see Example 1 above) applied to the nominated matched normal indicates that the variant is probably a germline. b. near_indel: Remove loci within 10 bp of an indel. c. too_close: Remove loci that are within 10 bp of each other. d. max_miss: Remove loci where more than a fifth of the colonies have depth<6. e. max_gmiss: Remove loci where more than a fifth of the colonies have missing genotype. f. count_fliter: Remove loci where no samples have a genotype 1 or where all samples have a genotype of 1. g. vaf_too_low_s: Remove loci where the total mutant read count is significantly less (p=?) than 0.9*Expected VAF*total depth across all colonies that have genotype=l (Binomial Test). This implicitly assumes that 10% false positive rate in genotyping is acceptable. The expected VAF accounts for copy number variation and is the depth weight average of the minimum mutant VAF across the mutant colonies at that locus. h. vaf_too_low_m2: Remove loci where the total mutant read count is significantly less than 0.9*0.5*total depth across all colonies that have at least 2 mutant reads (Binomial Test). Generally, dominates the above. i. vaf_zg_too_noisy: Remove loci where total mutant read count at sites with genotype=0 is significantly greater than

0.01*Depth.

Results

Using somatic mutations for haematopoietic lineage tracing in patients with MPN

The mutations present in a somatic cell's genome have accumulated throughout its ancestral lineage, passed from mother to both daughter cells with each cell division. The inventors identified somatic mutations in individual HSCs from patients with MPN, using them to reconstruct the lineage relationships among both malignant and normal blood cells in each patient (Lee-Six et al., 2018). Given that somatic mutation burden does not differ between HSCs and myeloid progenitors (Lee-Six et al., 2018; Osorio et al., 2018), we undertook WGS of in-vitro expanded single-cell derived haematopoietic colonies as faithful surrogates for the genomes of their parental HSCs. The inventors then 'recaptured' these somatic mutations in bulk peripheral blood cells using targeted sequencing in order to longitudinally track clones and infer population estimates. The cohort used comprised 10 patients with MPN diagnosed between ages 20 and 76 years (3 essential thrombocythemia (ET), 5 polycythemia vera (PV) and 2 post-PV myelofibrosis (MF)).The inventors obtained 952 colonies from 15 timepoints spanning both diagnosis and disease course for WGS to a mean depth of ~14x. Colonies with low sequencing coverage or evidence of non-clonality were excluded, and 843 were included in the final analyses. The inventors identified 448,553 somatic single nucleotide variants (SNV) and 14,851 small insertions and deletions. Variant allele fractions (VAFs) clustered around 0.5, confirming colonies derived from a single cell. There were no additional subclonal peaks in the VAF distribution confirming that few mutations were acquired in- vitro. All patients harbored mutated-<7AFC2 (9 JAK2V617F, 1 JAK2exon 12), and 9 patients had additional driver mutations, most commonly in DNMT3A (n=8), TET2 (n=4) and PPM1D (n=3).

Improved somatic mutations calling taking into account tumour-in- normal contamination

The extent of the possible tumour-in-normal contamination can be investigated by looking at known somatic mutation sites in patients. For example, by aligning reads at the site of mutation JAK2V^617F for a tumour sample and matched germline (buccal) sample and estimating the VAF of the mutation in the two samples. In the case of tumour sample PD11410c and matched buccal sample P11410b, the aligned reads were associated with a calculated VAF for the mutation of 38.5% in the tumour sample and 15.4% in the normal sample (where germline cells should not contain the mutation), indicating approximately 30% tumour in normal contamination. For another tumour sample (PD5028d) and matched germline (T-cell) sample (PD5028b, 5.5%), the aligned reads were associated with a calculated VAF for the mutation of 100% in the tumour sample and 5.5% in the normal sample (where germline cells should not contain the mutation), indicating approximately 10% tumour in normal contamination.

Next, variants were called as explained above, and sites that were previously identified as somatic variants using the normal filters in CaveMAn but that passed the new GLOD filter were identified.

These represent somatic variants that would have been missed without the new filtering method described herein. The results of this are shown on Figure 11 for three patients, where each plot shows the VAF in the tumour and in the normal sample for each variant identified (an=aberrant cell fraction in normal sample, at=aberrant cell fraction in tumour sample). The dots in black were identified as Germline SNPs (0.5 VAF in both tumour and normal). The dots in grey are somatic variants that have a non-zero VAF in the normal sample due to tumour-in-normal contamination, and would therefore fail standard filters. They could however be "rescued" using the new methods described herein. The stars in the plots represent the centroid of a cluster from a k-means cluster analysis comprising those variants in grey, which was used to identify the major tumour clone and the degree of tumour in normal contamination (as explained in Example 6 - section "Joint estimation aberrant cell fraction in tumour and normal").

Phylogeny of haematopoiesis, patterns of driver mutations, mutation acquisition in adulthood, estimated age of acquisition of driver mutations, clonal expansion and disease latency Using the variant data analysed as explained above (Methods) and in Example 1, the inventors were able to reconstruct phylogenetic trees from the presence or absence of SNVs across colonies, and to estimate mutation rates in individual patients. These estimates were used to calculate estimates for the age at which driver mutations occurred in MPN patients. These results and further analyses are provided in Williams et al., 2020.

Conclusions

By using somatic mutations for lineage tracing, the inventors were able to retrace life histories from early embryogenesis through to the single cell HSC origin of MPN and CH, and delineate driver mutation timing, clonal selection and clonal evolution over the life of 10 patients with MPN. This was enabled at least in part due to the development of a new method for identifying somatic variant which showed greatly improved sensitivity in contexts where tumour- in-normal contamination is present.

The analysis described above and in Williams et al.,2020 is just one example of a context in which this method can be applied to great benefits. In particular, the study revealed, amongst other important insights, that regardless of age of diagnosis JAK2^V617F is usually acquired early in life (earliest appearance few weeks post conception). DNMT3A mutations, most commonly associated with age related CH (Genovese et al., 2014, Jaiswal et al., 2014, Xie et al., 2014), were also acquired in utero and during childhood with slow and steady growth over life. The inventors further identified that clonal expansions spanned the lifetime of MPN patients, often with sequential driver mutation acquisitions, including JAK2V617F as a second driver event. Additionally, the rate of clonal expansion upon JAK2^V617F acquisition was found to be variable. The results indicate that the point at which a clinical diagnosis is made for MPN (currently defined phenotypically, by blood count parameters and bone marrow histomorphology), represents one time-point on a continuous trajectory of lifelong clonal outgrowth, at which blood counts have reached certain thresholds, or clinical complications have already occurred. Current diagnostic criteria do not best capture when patients begin to have disease as life-threatening thromboses often trigger diagnosis. The data show that mutant clones will generally have been present for 10 to 40 years before diagnosis and would have been detectable for much of this time using sensitive assays. Taken together, the key to early detection and prevention may lie in both detecting low burden mutant clones early and in establishing their rate of growth, by repeated sampling, to capture those individuals on a future trajectory to clinical complications. The cornerstone of MPN management currently is aimed at normalising blood counts and reducing risk of thrombotic or haemorrhagic events - such treatments are mostly safe and well-tolerated, and could be offered to individuals with high-risk molecular profiles.

Thus, the new methods described herein enabled the discovery of important insights that may change the approach to MPN diagnosis. As the skilled person understand, any molecular diagnostic or prognostic method that relies on the identification of somatic variants and where some level of tumour in normal contamination may be present is likely to benefit equally from the methods disclosed herein.

Example 6 — Improved variant calling enables identification of changes in clonal architecture that inform MPN disease course in advance of phenotypic manifestations Introduction

Myeloproliferative neoplasms (MPNs) are chronic haematopoietic neoplasms and clinical disease can often span decades. The commonest WHO-defined Philadelphia-negative MPNs are polycythemia vera (PV), essential thrombocythemia (ET) and myelofibrosis (MF), with rarer entities including systemic mastocytosis (SM), chronic myelomonocytic leukemia (CMML) and MPN/myelodysplasia (MDS) overlap disorders. Disease progression can occur in MPN, in particular to myelofibrosis (MF) and acute myeloid leukaemia (AML) in 5-20% of patients and is associated with a poor prognosis. We currently diagnose disease progression based on blood parameters, histology, patient symptoms and clinical exam without incorporating genomic information. Incorporating serial genomic data into the diagnosis and monitoring of disease and understanding this relationship will be key to improved risk stratification, earlier recognition of disease progression and offering of targeted or definitive therapies for patients at a stage of lower clinical burden of disease. In this example, the inventors study longitudinal changes in the underlying genomic architecture of MPN and how they relate to clinical parameters and disease status to better define disease progression. This analysis was made possible by the use of a new filter termed Glod (described in Example 1) that was able to rescue variants initially filtered out by the matched germline filter applied in standard variant calling, and which were lost due to significant and variable contamination of tumour in the germline sample.

Methods

Patient samples: Patients were recruited from Addenbrooke's NHS Foundation trust under the clonal disorders of haematopoiesis study (07/MRE05/44). Whole blood was obtained during routine clinical visits.

Whole Genome sequencing (WGS): Samples were sequenced on Illumina Hiseq2500 lOObp PE reads to an average depth of 44x (range 36 - 75x) in samples and 33x range (27- 41x ) in matched germline (T cell and buccal swabs) samples. Data analysis: The samples were analysed using the established whole genome sequencing pipeline at the Sanger institute with alignment to the human reference genome (GRCh37d5) with BWA-Mem and sorting by samtools. Single nucleotide variants (SNVs) were called using the CAVEman (1.11.0) variant caller with both a matched (standard Caveman) and unmatched setting (as described in Example 2 - unmatched setting). Pindel (2.1.0) was used to identify insertion/deletion (indel) type mutations with the germline flag F015 removed in order to be able to rescue all variants. Both SNVs and indels underwent additional filtering to rescue variants missed due to tumour in normal contamination (TiN). Copy number aberrations (CNAs) were detected using ASCAT(4.0.1; Van Loo et al.,2010) and Battenberg (3.5.3; Nik-Zainal et al., 2012) with the latter providing information on subclonal copy number states used as input into the dpclust algorithm for subclonal decomposition of the MPN tumour.

Tumour in normal contamination: The Glod filter described in Example 1 was applied to an unmatched caveman analysis which contained calls in the germline as well as the tumour sample. The specific TiN value for each germline was taken into account to rescue variants (see below).

Joint estimation aberrant cell fraction in tumour and normal: An unmatched analysis was carried out that identifies all somatic variants and also less common germline variants. This resulted in 2-D VAF dataset (VAF in contaminated normal sample vs VAF in contaminated tumour sample). We assume that in a pure tumour the main clone will have a VAF=0.5 in autosomal chromosomes with copy number regions excluded - so in a contaminated sample the observed VAF,v, will be v=0.5*acf (where acf is the aberrant cell fraction in the sample) and thus acf=2*v. This main clone is identified by kmeans based cluster analysis on all sufficiently well covered variants (e.g. depth>40 in both tumour and normal) with a VAF in the normal sample of that excludes germline variants, such as e.g. less than 25%. Two clusters are fitted and the aberrant cell fraction is determined by the centroid (vn,vt) of the largest VAF cluster (other summary metrics may be used). The aberrant cell fractions are then determined as follows: aberrant cell fraction in the normal=2*vn aberrant cell fraction in the tumour=2*vt

2-D VAF Plots of the above clustering analysis are manually inspected to check whether the clustering is plausible. The aberrant cell fraction in tumour can be checked against values obtained using other methods e.g. ASCAT or Battenberg.

Additionally, the VAF of driver variants can be inspected to provide additional evidence of tumour-in-normal contamination. The above method is particularly suitable when the samples can be assumed to be clonal. dNdScv: We carried out a somatic driver mutation screen by looking at the ratio of synonymous to non-synonymous mutations using the R package dNdScv (https://github.com/im3sanger/dndscv).

Subclonal reconstruction: DPclust V2.2.8 (https://github.com/Wedge- lab/dpclust) was used to reconstruct the subclonal structure as has been previously described (nik-zainal and maura papers). Variants assigned to specific clusters (clones) were annotated using ANNOVAR.

Correction of mutation burden: The sensitivity of CAVEman to identify mutations at specific VAFs and depth was taken into account to correct number of mutations of clones.

Timing of mutations: The timing of acquisition of key mutant clones was inferred using the burden of somatic mutations within individual clones following subclonal reconstruction. Following correction of mutation burden, a constant rate of mutation burden was assumed at 18 SNVs/year with a range (x-y mutations/year). For the earliest detectable clone, an additional penalty of 50 SNVs was given to account for recently observed accelerated mutation burden during embryogenesis. The resultant figure therefore is the upper bound for time of acquisition of all SNVs in the clone.

Results Longitudinal WGS was undertaken in 32 MPN patients(14 PV, 15 ET, 1 SM/CMML, 1 CMML and 1 MPN/MDS-RS-T), across a total of 71 timepoints, together with matched germline controls. The median interval between WGS timepoints was 9.8 years (range 2-174 months). Disease transformation to MF (n=8), AML (n=10) occurred in 18 patients by the latest time point of sequencing and 14 patients had clinically stable disease at all sequencing timepoints.

In order to identify somatic variants in the tumour samples, we used matched germline samples from buccal swabs, T cell cultures and b cells. Variable germline contamination with tumour (range 0-44%) (Figure 12) was observed in our samples which required a custom strategy in order to accurately rescue somatic variants filtered out by the germline filters. The custom Glod filter (see Example 1) was run on an unmatched call set of SNV and indel variants allowing for rescue of somatic variants (Figure 13) in those with higher levels of tumour in normal contamination. The above variants underwent dirichlet based clustering using cancer cell fractions (CCFs) derived from the VAFs of each variant accounting for ploidy at variant site. The clustering allowed for each variant to be assigned to a unique clone/subclone enabling subclonal reconstruction of the tumour compartment. Since each patient had at least 2 WGS timepoints, assignment of mutations to clones was accurately achieved by considering the change in cancer cell fraction (CCF) of each variant over the samples.

Somatic driver discovery using the dNdScv algorithm to analyse the ratio of synonymous to non-synonymous mutations identified mutations in JAK2, TET2, EZH2, ASXL1, CALR, PHF6, ZRSR2 and SH2B3 in those patients that had resultant disease transformation (FDR 0.01). In comparison only JAK2 was identified in patients with lasting stable disease. No recurrent non-coding mutations were identified.

By looking at the corrected total number of single nucleotide variants (SNVs) attributable to each clone and assuming a constant rate of SNV acquisition (18 SNVs/year), we were able to ascribe an upper bound on the timing of mutation acquisition for drivers present in clones. To test this concept, we overlaid the called SNVs for each clone in 3 patients with SNVs called from WGS of single cell derived colonies that underwent phylogenetic reconstruction. Using this phylogenetic approach had enabled accurate assessment of mutation burden as well as providing estimates of mutational time. Clearly, we are not able to capture resolution of all SNVs to a single cell level with WGS of bulk samples at this depth. However, given the size of the clones it was reasonable to expect coverage of the shared branches of the genomic phylogeny. The SNVs called by the multi-sample bulk approach cover the majority of the early shared branches with resolution being lost further down the tree as to be expected with the drop off in VAF of those mutations. Nonetheless, after correction for sensitivity for detecting SNVs at a mean depth and VAF we can ascribe reasonable confidence to our approach of using the total SNVs in a clone to be represent the mutation history of that clone and thereby offer estimates of time. Following correction for the increased mutation burden in the earliest time of life, the median age for latest acquisition of earliest detectable clone (EDC) across the cohort was 34.6 years (range 2.8-89.6 yrs) with this value reducing to 27.5 years (range 2.8 to 62.9 yrs) when considering patients with JAK2 driver mutations (21 out of 32). However, when we look in patients with an identifiable JAK2 only founder clone the median age for the upper bound is 17 years with range (2.8 - 43 years). This suggests that JAK2 mutations in particular are acquired in the early decades of life with the MPN disease manifesting years to decades later. Indeed, in this cohort of patients with a detectable JAK2 founder clone the median age of diagnosis was 45 (range 36-64yrs). In a few patients we were able to time acquisition of clones that drive transformation to MF or AML to be prior to the first sample time point. In PD4060, a patient diagnosed with PV age 50 years old, we timed the JAK2+TET2 founder MPN clone to have been acquired by age 23.8 years and the NOTCH2+9pLOH clone responsible for their MF transformation by age 37.8 years. In PD9478, the DNMTJa clone has been acquired by 51.9 years before the age of diagnosis at age 53.

Conclusion Our study represents a rare dataset of WGS on serial samples paired with clinical data from MPN/MDS patients who have had long term clinical follow-up leading to various clinical paths. A challenge with calling somatic variants in this dataset was the variable levels of TiN. This required the development of a specialised filter (Glod) to rescue variants from an unmatched call set by comparing the VAFs of mutations in germline to the tumour samples whilst accounting for known SNP sites. This rescuing of variants was most profound in samples with high TiN values (see Figure 6c).

By performing WGS on serial samples we were able to accurately ascribe mutations to individual clones and provide some measure of timing for acquisition of mutations in the clone. We show that by using the adjusted total number of SNVs in each clone we can provide estimates of timing of mutation assuming a constant rate of SNV acquisition. These estimates of timing provide an upper bound and so any driver mutation or event within that clone may have occurred anywhere from birth or where appropriate, the upper bound of the previous clone. Crucially, we are able to show that driver clones are acquired years to decades in advance of presentation of MPN. This is most profound in the patients with an identifiable JAK2 founder clone with median for latest age of acquisition being 17 years.

We describe that changes in the underlying clonal composition of disease are a prequel to changes in laboratory blood count parameters, which in turn pre-date clinical recognition of a change in disease status offering opportunities for earlier diagnosis of disease progression. In such patients, the drivers of subsequent disease transformation may have already been acquired years in advance of the initial MPN diagnosis. In contrast, clonal stability correlates with clinically stable disease. Serial genetic monitoring of mutant clones in MPN patients could allow prediction of those patients with stable clinical disease, and earlier identification of patients at high risk of disease transformation whereby targeted or definitive therapies may be offered to patients prior to deterioration in clinical status. References

Jones D, Raine KM, Davies H, Tarpey PS, Butler AP, Teague JW, Nik- Zainal S, Campbell PJ. cgpCaVEManWrapper: Simple Execution of CaVEMan in Order to Detect Somatic Single Nucleotide Variants in NGS Data. Curr Protoc Bioinformatics. 2016 Dec 8;56:15.10.1-15.10.18. doi: 10.1002/cpbi.20.

Welch, J. S. et al. The origin and evolution of mutations in acute myeloid leukemia. Cell 150, 264-278 (2012).

Genovese, G. et al. Clonal hematopoiesis and blood-cancer risk inferred from blood DNA sequence. N. Engl. J. Med. 371, 2477-2487 (2014).

Jaiswal, S. et al. Age-related clonal hematopoiesis associated with adverse outcomes. N. Engl. J. Med. 371, 2488-2498 (2014).

Xie, M. et al. Age-related mutations associated with clonal hematopoietic expansion and malignancies. Nat. Med. 20, 1472-1478 (2014).

Osorio, F. G. et al. Somatic Mutations Reveal Lineage Relationships and Age-Related Mutagenesis in Human Hematopoiesis. Cell Rep. 25, 2308-2316.e4 (2018).

John G Tate, Sally Bamford, Harry C Jubb, Zbyslaw Sondka, David M Beare, Nidhi Bindal, Harry Boutselakis, Charlotte G Cole, Celestino Creatore, Elisabeth Dawson, Peter Fish, Bhavana Harsha, Charlie Hathaway, Steve C Jupe, Chai Yin Kok, Kate Noble, Laura Ponting, Christopher C Ramshaw, Claire E Rye, Helen E Speedy, Ray Stefancsik, Sam L Thompson, Shicai Wang, Sari Ward, Peter J Campbell, Simon A Forbes, COSMIC: the Catalogue Of Somatic Mutations In Cancer, Nucleic Acids Research, Volume 47, Issue DI, 08 January 2019, Pages D941-D947.

Li, M. M., Datto, M., Duncavage, E. J., Kulkarni, S., Lindeman, N. I., Roy, S., Tsimberidou, A. M., Vnencak-Jones, C. L., Wolff, D. J., Younes, A., & Nikiforova, M. N. (2017). Standards and Guidelines for the Interpretation and Reporting of Sequence Variants in Cancer: A Joint Consensus Recommendation of the Association for Molecular Pathology, American Society of Clinical Oncology, and College of American Pathologists. The Journal of molecular diagnostics : JMD, 19(1), 4-23.

Ma, J., Setton, J., Lee, N. et al. The therapeutic significance of mutational signatures from DNA repair deficiency in cancer. Nat Commun 9, 3292 (2018).

Kakadia, P.M., Van de Water, N., Browett, P.J. et al. Efficient identification of somatic mutations in acute myeloid leukaemia using whole exome sequencing of fingernail derived DNA as germline control. Sci Rep 8, 13751 (2018). ICGC/TCGA Pan-Cancer Analysis of Whole Genomes Consortium. Pancancer analysis of whole genomes. Nature 578, 82-93 (2020).

Alexandrov, L. B. et al. Clock-like mutational processes in human somatic cells. Nat. Genet. 47, 1402-1407 (2015).

Mitchell, T. J. et al. Timing the Landmark Events in the Evolution of Clear Cell Renal Cell Cancer: TRACERx Renal. Cell 173, 611- 623.el7 (2018).

Gerstung, M. et al. The evolutionary history of 2,658 cancers. Nature 578, 122-128 (2020).

Karczewski KJ, Weisburd B, Thomas B, Solomonson M, Ruderfer DM, Kavanagh D, Hamamsy T, Lek M, Samocha KE, Cummings BB, Birnbaum D; The Exome Aggregation Consortium, Daly MJ, MacArthur DG. The ExAC browser: displaying reference data information from over 60000 exomes. Nucleic Acids Res. 2017 Jan 4;45(DI):D840-D845.

Williams, Nicholas and Lee, Joe and Moore, Luiza and Baxter, E Joanna and Hewinson, James and Dawson, Kevin J and Menzies, Andrew and Godfrey, Anna L and Green, Anthony R and Campbell, Peter J and Nangalia, Jyoti. Phylogenetic reconstruction of myeloproliferative neoplasm reveals very early origins and lifelong evolution. Biorxiv elocation ID 2020.11.09.374710. November 9, 2020. https://doi.org/10.1101/2020.ll.09.374710

Vainchenker, W. & Kralovics, R. Genetic basis and molecular pathophysiology of classical myeloproliferative neoplasms. Blood 129, 667-679 (2017).

Grinfeld, J. et al. Classification and Personalized Prognosis in Myeloproliferative Neoplasms. N. Engl. J. Med. 379, 1416-1430

(2018).

Nangalia, J. & Green, A. R. Myeloproliferative neoplasms: from origins to outcomes. Blood 130, 2475-2483 (2017).

Sudmant, Peter H., et al. "An integrated map of structural variation in 2,504 human genomes." Nature 526.7571 (2015): 75-81.

Alexandrov, L. B. et al. The repertoire of mutational signatures in human cancer. Nature 578, 94-101 (2020).

Lee-Six, H. et al. Population dynamics of normal human blood inferred from somatic mutations. Nature 561, 473-478 (2018).

Cibulskis, K. et al. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat. Biotechnol. 31, 213-219 (2013).

Moore, L. et al. The mutational landscape of normal human endometrial epithelium. Nature 580, 640-646 (2020).

Ye, K., Schulz, M. H., Long, Q., Apweiler, R. & Ning, Z. Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads. Bioinformatics 25, 2865-2871 (2009).

Van Loo, P. et al. Allele-specific copy number analysis of tumors. Proc. Natl. Acad. Sci. U. S. A. 107, 16910-16915 (2010).

Nangalia, J. et al. Somatic CALR mutations in myeloproliferative neoplasms with nonmutated JAK2. N. Engl. J. Med. 369, 2391-2405 (2013).

Ellis, P. et al. Reliable detection of somatic mutations in solid tissues by laser-capture microdissection and low-input DNA sequencing. Nat. Protoc. 1-31 (2020) doi:10.1038/s41596-020-00437-6.

Nik-Zainal et al. The life history of 21 breast cancers. Cell 2012 May 25;149(5):994-1007.All references cited herein are incorporated herein by reference in their entirety and for all purposes to the same extent as if each individual publication or patent or patent application was specifically and individually indicated to be incorporated by reference in its entirety.

The specific embodiments described herein are offered by way of example, not by way of limitation. Various modifications and variations of the described compositions, methods, and uses of the technology will be apparent to those skilled in the art without departing from the scope and spirit of the technology as described. Any sub-titles herein are included for convenience only, and are not to be construed as limiting the disclosure in any way.

The methods of any embodiments described herein may be provided as computer programs or as computer program products or computer readable media carrying a computer program which is arranged, when run on a computer, to perform the method(s) described above.

Unless context dictates otherwise, the descriptions and definitions of the features set out above are not limited to any particular aspect or embodiment of the invention and apply equally to all aspects and embodiments which are described.

Throughout the specification and claims, the following terms take the meanings explicitly associated herein, unless the context clearly dictates otherwise. The phrase "in one embodiment" as used herein does not necessarily refer to the same embodiment, though it may. Furthermore, the phrase "in another embodiment" as used herein does not necessarily refer to a different embodiment, although it may. Thus, as described below, various embodiments of the invention may be readily combined, without departing from the scope or spirit of the invention.

It must be noted that, as used in the specification and the appended claims, the singular forms "a," "an," and "the" include plural referents unless the context clearly dictates otherwise. Ranges may be expressed herein as from "about" one particular value, and/or to "about" another particular value. When such a range is expressed, another embodiment includes from the one particular value and/or to the other particular value. Similarly, when values are expressed as approximations, by the use of the antecedent "about," it will be understood that the particular value forms another embodiment. The term "about" in relation to a numerical value is optional and means for example +/- 10%.

Throughout this specification, including the claims which follow, unless the context requires otherwise, the word "comprise" and "include", and variations such as "comprises", "comprising", and "including" will be understood to imply the inclusion of a stated integer or step or group of integers or steps but not the exclusion of any other integer or step or group of integers or steps.

Other aspects and embodiments of the invention provide the aspects and embodiments described above with the term "comprising" replaced by the term "consisting of" or "consisting essentially of", unless the context dictates otherwise.

The features disclosed in the foregoing description, or in the following claims, or in the accompanying drawings, expressed in their specific forms or in terms of a means for performing the disclosed function, or a method or process for obtaining the disclosed results, as appropriate, may, separately, or in any combination of such features, be utilised for realising the invention in diverse forms thereof.

Claims

1. A method for identifying somatic variants from sequence data, the method comprising: receiving sequence data from one or more tumour samples of a subject and one or more normal samples, identifying one or more sites where the sequence data from the one or more tumour samples is indicative of a difference between the tumour genome and a reference genome, and determining whether each of the one or more identified sites is a somatic variant using the sequence data from the one or more normal samples and a probabilistic model that models the presence of tumour genetic material in the one or more normal samples.

2. The method of claim 1, wherein the probabilistic model is a Bayesian model.

3. The method of any preceding claim, wherein the one or more normal samples are from the same subject, and/or the sequence data from the one or more normal samples comprise sequence data associated with the presence of tumour genomic material.

4. The method of any preceding claim, wherein the tumour is a tumour of the hematopoietic or lymphoid tissue, and/or wherein the normal sample is a buccal sample, a T cell sample, a blood sample, a plasma sample, or a purified sample derived from blood or plasma.

5. The method of any preceding claim, wherein the one or more normal samples comprise at least 1%, at least 2%, at least 3%, at least 4% or at least 5% tumour in normal contamination.

6. The method of any preceding claim, wherein determining whether each of the one or more identified sites is a somatic variant using the sequence data from the one or more normal samples and a probabilistic model that models the presence of tumour genetic material in the one or more normal samples comprise: identifying a set of one or more candidate somatic variant sites; and applying one or more filters to the set of candidate somatic variant sites to obtain a filtered set; wherein the step of identifying a set of candidate somatic variant sites and/or the one or more filters use a probabilistic model that models the presence of tumour genetic material in the one or more normal samples.

7. The method of claim 6, wherein the one or more filters are selected from: a proximal gap filter that excludes variants that are within a predetermined distance (such as e.g. 10 bp) of an indel, a poor mapping region filter that excludes variants that are in locations having a mapping quality score below a predetermined threshold, a clustered variant filter that excludes variants that are within a predetermined distance from each other (such as e.g. 10 bp), a strand bias filter that excludes variants where the genotype inferred from reads mapped to the forward strand is not the same as the genotype inferred from reads mapped to the reverse strand, a read depth filter that excludes variants that are in regions with a read depth below a predetermined threshold, optionally wherein the one or mor filters do not include a filter that excludes sites where the sequence data from the one or more normal samples contain evidence of the presence of the variant unless the filter uses a probabilistic model that models the presence of tumour genetic material in the one or more normal samples.

8. The method of claim 6 or claim 7, wherein the one or more filters comprise a filter that excludes variants where a metric reflecting the relative likelihood of the sequence data for the one or more normal samples assuming that the site has a somatic variant or a germline variant is below a predetermined threshold, wherein the likelihood assuming that the site has a somatic variant is obtained using a model that takes into account the tumour in normal contamination in the one or more normal samples.

9. The method of claim 8, wherein the likelihood assuming that the site has a germline variant is obtained by determining the probability of observing the sequence data for the site in the one or more normal samples in view of a probability of sequencing error associated with the sequence data and a first predetermined variant allele fraction, the likelihood assuming that the site has a somatic variant is obtained by determining the probability of observing the sequence data for the site in the one or more normal samples in view of a probability of sequencing error associated with the sequence data and a second predetermined variant allele fraction, wherein the first and second predetermined variant allele fractions are different from each other.

10. The method of claim 9, wherein the second predetermined variant allele fraction is obtained by multiplying the first predetermined variant allele fraction by a predetermined tumour in normal contamination.

11. The method of claim 9 or claim 10, wherein the probability of sequencing error associated with the sequence data is obtained for each read in the sequence data or wherein the probability of sequencing error associated with the sequence data is a predetermined value that is the same for all sequence reads.

12. The method of any of claims 8 to 11, wherein the predetermined threshold (0) depends on a desired level of certainty (5), a prior probability that the candidate somatic variant is a somatic variant, and a prior probability that the candidate somatic variant is a germline variant.

13. The method of claim 12, wherein the filter that excludes variants where a metric reflecting the relative likelihood of the sequence data for the one or more normal samples assuming that the site has a somatic variant or a germline variant is below a predetermined threshold is a filter that excludes variants that do not meet the criterion:

where P(Data|Somatic) is the likelihood of the sequence data for the one or more normal samples assuming that the site has a somatic variant, P(Data|Germline) is the likelihood of the sequence data for the one or more normal samples assuming that the site has a germline variant, P(Somatic) is the prior probability that the candidate somatic variant is a somatic variant and P(Germline) is the prior probability that the candidate somatic variant is a germline variant.

14. The method of claim 12 or claim 13, wherein the prior probability that the candidate somatic variant is a somatic variant is set to a predetermined value, such as e.g. 3xl0^-6 and/or wherein the prior probability that the candidate somatic variant is a germline variant depends on whether the variant is catalogued in one or more data sets of known single nucleotide polymorphisms.

15. The method of any preceding claim, wherein determining whether each of the one or more identified sites is a somatic variant comprises: determining the posterior probability of each of one or more candidate joint genotypes having a somatic variation, in view of the sequence data for the one or more tumour samples and the sequence data for the one or more normal samples, wherein said posterior probability depends on the aberrant cell fraction in the one or more tumour samples and the aberrant cell fraction in the one or more normal samples.

16. The method of claim 15, wherein the posterior probability depends on the aberrant cell fraction in the one or more tumour samples and the aberrant cell fraction in the one or more normal samples through the likelihood of the sequence data assuming the joint genotype, optionally wherein the likelihood of the data assuming the joint genotype is the product of: a term for the sequence data from the tumour sample(s), which depends on the aberrant cell fraction in the one or more tumour samples, and a term for the sequence data from the normal sample(s), which depends on the aberrant cell fraction in the one or more normal samples.

17. The method of claim 16, wherein the likelihood of the sequence data from the normal or tumour samples depend on: a term that represents the probability of a particular base being called from the sequence data assuming that the true genotype is the reference allele for the joint genotype and in view of the values of one or more covariates, and a term that represents the probability of a particular base being called from the sequence data assuming that the true genotype is the variant allele for the joint genotype and in view of the value of one or more covariates, optionally wherein the probability of a particular base being called from the sequence data assuming a true genotype and in view of the value of one or more covariates is estimated empirically from the sequence data from the tumour samples and/or the normal samples, by assuming that all sites that differ from a reference sequence are sequencing errors, preferably excluding all sites that are known germline variant.

18. The method of any of claims 15 to 17, wherein determining whether each of the one or more identified sites is a somatic variant further comprises: determining whether the posterior probability a candidate joint genotypes having a somatic variation and/or the sum of the posterior probabilities of each of the candidate joint genotypes having a somatic variation is/are above a predetermined threshold, and determining that the site has a somatic variant in the affirmative.

19. The method of any of claims 1 to 14, wherein the step of identifying one or more sites where the sequence data from the one or more tumour samples is indicative of a difference between the tumour genome and a reference genome: does not use the sequence data from the one or more normal samples, or uses sequence data from one or more unmatched normal samples.

20. The method of any preceding claim, comprising:

(i) analysing the sequence data from the one or more tumour samples of the subject and the one or more normal samples using the method of any of claims 8 to 14; and

(ii) analysing the sequence data from the one or more tumour samples of the subject and the one or more normal samples using the method of any of claims 15 to 18; and optionally (iii) comparing the identified sites determined to be somatic variants at step (i) and the identified sites determined to be somatic variants at step (ii) and further filtering the identified sites determined to be somatic variants at step (i) but not determined to be somatic variants at step (ii).

21. The method of any preceding claim, further comprising providing to a user, for example through a user interface, information identifying the sites determined to be somatic variants, and optionally information identifying the sites determined not to be somatic variants.

22. A method of providing a prognosis for a subject that has been diagnosed as having cancer, the method comprising identifying one or more somatic variants in one or more tumour samples from the subject using the method of any of claims 1 to 21, and classifying the subject between a plurality of groups associated with different prognosis based on the one or more somatic variants identified.

23. A method of determining whether a subject that has been diagnosed as having cancer is likely to respond to a therapy, optionally wherein the therapy is an immunotherapy, the method comprising identifying one or more somatic variants in one or more tumour samples from the subject using the method of any of claims 1 to 21, and classifying the subject between a plurality of groups associated with different responses to the therapy based on the one or more somatic variants identified.

24. A system comprising: a processor; and a computer readable medium comprising instructions that, when executed by the processor, cause the processor to perform the steps of the method of any of claims 1 to 23, optionally wherein the system further comprises sequence data acquisition means.

25. One or more non-transitory computer readable media comprising instructions that when executed by one or more processors, cause the one or more processors to perform the steps of the method of any of claims 1 to 23.