WO2018042185A1 - Procédés, systèmes et appareil d'identification de variantes de gènes pathogènes - Google Patents

Procédés, systèmes et appareil d'identification de variantes de gènes pathogènes Download PDF

Info

Publication number
WO2018042185A1
WO2018042185A1 PCT/GB2017/052545 GB2017052545W WO2018042185A1 WO 2018042185 A1 WO2018042185 A1 WO 2018042185A1 GB 2017052545 W GB2017052545 W GB 2017052545W WO 2018042185 A1 WO2018042185 A1 WO 2018042185A1
Authority
WO
WIPO (PCT)
Prior art keywords
variant
gene
disease
information
rules
Prior art date
Application number
PCT/GB2017/052545
Other languages
English (en)
Inventor
Stuart Alexander COOK
James WARE
Paul Barton
Roddy WALSH
Gillian REA
Nicola WHIFFIN
Elizabeth Edwards
Daniel Geoffrey MACARTHUR
Eric MINIKEL
Original Assignee
Imperial Innovations Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Imperial Innovations Ltd filed Critical Imperial Innovations Ltd
Publication of WO2018042185A1 publication Critical patent/WO2018042185A1/fr

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/30ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection

Definitions

  • the present invention relates to methods, systems and apparatus for classifying gene variants according to pathogenicity.
  • the invention relates to diagnosing inherited cardiac conditions based on the genetic variant profile of an individual and identifying causative variants for these conditions, for example to allow family screening after an individual has been diagnosed.
  • Genetic information is increasingly used as part of the diagnostic toolbox for conditions that have an inheritable component, supported by advances in sequencing technologies such as the development of affordable high-throughput next generation sequencing.
  • advances in sequencing technologies such as the development of affordable high-throughput next generation sequencing.
  • the generation of increasing amounts of genetic data is accompanied by new challenges in interpreting this information. Indeed, as every individual is estimated to carry approximately 12,000 to 14,000 predicted protein-altering variants, distinguishing disease-causing variants from benign bystanders is perhaps the principal challenge in contemporary clinical genetics.
  • a system for assessing the pathogenicity of a genetic variant comprising: a data analysis
  • the data analysis server being connected to at least one of a genetic information data source storing frequency information relating to frequency of at least one genetic variant in at least a control population, a disease-variation association data source storing information on associations between at least one gene variant, at least one gene or other gene variants in the at least one gene with diseases; and a protein-related data source storing information on the known or predicted effects of at least one genetic variant on a gene product; wherein the data analysis server is connected to a user device and configured to receive information from a user about a genetic variant identified in an individual; the data analysis server further including a search application configured to query at least one of the genetic information data source for frequency information relating to the genetic variant in at least a control population, the protein-related data source for information on the known or predicted effect of the variant on the gene product, and the disease-variation association data source for information on association between the variant, the gene or other variants in the gene with diseases; and a data analysis application configured to execute at least one of one or
  • mgc maximal genetic contribution
  • mac maximum allelic contribution
  • the data analysis application is configured to transmit a pathogenicity assessment to the user device, wherein the pathogenicity assessment is based at least in part on the pathogenicity score.
  • a method for assessing the pathogenicity of a protein altering genetic variant comprising: (1) receiving information from a user about a genetic variant identified in an individual; (2) querying: (i) a genetic information data source for frequency information relating to the variant in at least a control population; and (ii) a protein-related data source for information about the known or predicted effect of the variant on the gene product; and/or (iii) a disease-variation association data source for information on association between a chosen disease and the variant under assessment, other variants in the same gene; and/or variants in a paralogous gene; (3) executing one or more rules that include a comparison of the frequency of the variant in a control population and one or more rules that include a comparison of the known or predicted effect of the variant on the gene product and/
  • mgc maximal genetic contribution
  • mac maximum allelic contribution
  • the data analysis application is configured to transmit a pathogenicity assessment to the user device, wherein the pathogenicity assessment is based at least in part on the pathogenicity score.
  • a data analysis server comprising a processor and a memory, wherein the memory includes a processor executable program configured to perform the method of the invention and the processor is configured to execute the program.
  • a computing device comprising a software program configured to control the computing device to perform the method of the invention.
  • a system for assessing the pathogenicity of a genetic variant comprising: a data analysis server, a genetic information data source, a disease-variation association data source and a protein-related data source wherein the data analysis server is programmed to: (1) receive information from a user about a genetic variant identified in an individual; (2) query a genetic information data source for frequency information relating to the variant in at least a control population and a protein-related data source for information on the known or predicted effect of the variant on the gene product and/or a disease-variation association data source for information on association between the variant, the gene or other variants in the gene with diseases; (3) evaluate the results of one or more tests based at least on the frequency of the variant in a control population and one or more tests based at least on the known or predicted effect of the variant on the gene product and/or information from a disease-variation association data source; (4) combine the results of the tests of step (3) into a pathogenicity score; and (5) provide the path
  • a method for assessing the pathogenicity of a protein altering genetic variant comprising:
  • a genetic information data source for frequency information relating to the variant in at least a control population; and (ii) a protein-related data source for information about the known or predicted effect of the variant on the gene product; and/or (iii) a disease-variation association data source for information on association between a chosen disease and the variant under assessment, other variants in the same gene; and/or variants in a paralogous gene; (3) evaluating the results of one or more tests based at least on the frequency of the variant in a control population and one or more tests based at least on the known or predicted effect of the variant on the gene product and/or information from the disease-variant association data source; (4) combining the results of the tests of step (3) into a pathogenicity score; and (5) providing the pathogenicity assessment to a user, wherein the evaluating the results of at least one test based on the frequency of the variant in a control population comprises calculating a maximum tolerated allele frequency using Eq. 1 or Eq. 3 and determining
  • a data analysis server comprising a processor and a memory, wherein the processor is programmed to perform the method of the invention.
  • a computing device comprising software adapted to perform the method of the invention.
  • Figure 1 shows schematically relevant parts of a representative genetics diagnostic system suitable for implementing an embodiment of the disclosure
  • FIGS. 2a and 2b illustrate schematically relevant functions of a user device and a data analysis server, each suitable for implementing an embodiment of the disclosure
  • Figure 3 describes a method according to an aspect of the disclosure
  • Figure 4 is an example of the use of a rare variant frequency filtering method according to embodiments of the invention.
  • Figure 5a shows the results of applying a variant frequency filtering method according to an aspect of the invention on the ExAC data and a simulated disease.
  • Figure 5b shows how the use of methods according to aspects of the invention in the context of cardiac disease allow filtering for clinically significant genes.
  • FIG. 6 shows an extract of an exemplary report generated using the methods and systems of the invention.
  • a reference sample or reference data / database is a source of genetic data from one or more individuals that have not been specifically selected for the presence or absence of a particular condition. Therefore, the frequency of a disease or disease causing variant in this data is not expected to be larger than the disease prevalence in the general population or subpopulation from which the genetic data was extracted.
  • An example of a reference data and associated database is the Exome Aggregation Consortium (ExAC) dataset (see ExAC et al.
  • a genetic information data source is a data source that comprises reference data.
  • a genetic information data source may also contain genetic data from one or more individuals in one or more case cohorts, as part of the same or multiple databases.
  • a genetic variant is any departure from the sequence of a reference genome.
  • Variant may be single nucleotide changes, changes in copy number, insertions, deletions, or other structural variants.
  • 'variant' refers to genetic sequence modifications that influence protein function, either by influencing the abundance of the protein, or by causing a change in protein coding genetic sequences.
  • the term 'variant' refers in particular to protein altering variants. These are variations in a gene sequence that alter the protein that results from the gene once transcribed and translated (when compared to a non-variant / wild-type gene), for example, by changing the amino acid incorporated at a position or by causing the premature termination of translation (truncating variants).
  • protein altering variants may result in a frameshift (where a mutation caused by the addition or deletion of a base pair or base pairs in the DNA of a gene results in the translation of the genetic code in an unnatural reading frame from the position of the mutation to the end of the gene / next stop codon), a nonsense codon (a mutation replacing a codon corresponding to an amino acid by a stop codon), a splicing error (creation of a new splice donor / acceptor or loss of a site), missense (point mutation resulting in change of the amino acid incorporated) and in-frame insertions / deletions.
  • a frameshift where a mutation caused by the addition or deletion of a base pair or base pairs in the DNA of a gene results in the translation of the genetic code in an unnatural reading frame from the position of the mutation to the end of the gene / next stop codon
  • a nonsense codon a mutation replacing a codon corresponding to an amino acid by a stop codon
  • the known or predicted effect of a variant on a gene product refers to the consequences of the variant being present on the subsequent steps of expression and function of the gene in which the variant is located. These may include predicted effects of the variant on the expression of the gene (e.g. transcription and/or translation rate), but in the context of protein altering variants, these relate to the way in which the resulting protein differs from the reference one.
  • this includes whether the variant will cause a frameshift (and hence a completely different sequence), a nonsense codon resulting in truncation of the protein, a splicing error, resulting in a different combination of exons being included in the protein, a missense mutation resulting in a different amino acid being incorporated, including the identity of the new amino acid, or an insertion-deletion resulting in a change of the total length of the protein.
  • the known or predicted effect of a variant on a gene product may also refer to the functional effect of such protein sequence modifications, whether known or predicted, such as: does the variant cause a loss or degradation of the function of the protein, a structural change, etc.
  • the known or predicted effect of a variant on a gene product refers to changes in the identity and/or function of the protein, not to whether or not these changes, in turn, may contribute to the development of a disease.
  • allele' takes its common meaning in the art and refers to one of a number of alternative forms of the same gene or genetic locus.
  • 'Inherited Cardiovascular Conditions' refers to a diverse set of diseases of the heart and blood vessels with a strong genetic predisposition, and in which genetic testing may be applicable. These include cardiomyopathies (heart muscle diseases), arrhythmia syndromes or "channelopathies" (leading to abnormalities of heart rhythm), dyslipidaemias (abnormalities of blood lipids including cholesterol), aortopathies (abnormalities of the aorta), and a number of congenital structural abnormalities.
  • cardiomyopathies heart muscle diseases
  • arrhythmia syndromes or "channelopathies” leading to abnormalities of heart rhythm
  • dyslipidaemias abnormalities of blood lipids including cholesterol
  • aortopathies abnormalities of the aorta
  • a number of congenital structural abnormalities The phenotypic features of these diseases are known in the art, and these terms are used throughout this disclosure with the meaning that is common in the art.
  • prevalence, penetrance and heterogeneity are used herein with the meaning
  • penetrance refers to the proportion of individuals carrying a particular variant of a gene (allele or genotype) that also expresses an associated trait (phenotype).
  • the penetrance of a disease causing mutation is the proportion of individuals with the mutation who also exhibit clinical manifestations.
  • a variant penetrance of 0.5 may advantageously be used. This corresponds to the minimum variant penetrance found when researching HCM and other variants / disorders.
  • Heterogeneity refers to a phenomenon whereby a disorder may be caused by any one of a number of disease-causing variants.
  • the maximum allelic or genetic contribution is used to refer to the maximum proportion of cases potentially attributable to a single allele or gene (depending on whether allelic or genetic heterogeneity are investigated). This maximum allelic / genetic contribution is inversely proportional to the allelic / genetic heterogeneity. Where a large cohort exists for a disorder, the upper confidence interval of the frequency of the most common variant in this cohort may be used.
  • the prevalence of a condition is the proportion of a population found to have a condition. Estimates of disease prevalence may be obtained from the literature. Where multiple different values are reported, the highest value may be used in the calculation, which leads to conservative filtering.
  • the ACMG Standard and Guidelines refers to a set of guidelines published by the American College of Medical Genetics and Genomics and the Association for Molecular Pathology, for assessing the pathogenicity of a genetic variant. These are published in Richards, S. et al. (2015), Genetics in Medicine 17, 405-423.
  • a protein-related data source is a source of information regarding the sequence, function or properties of proteins.
  • protein-related data sources may contain information regarding the predicted effect of a protein-coding genetic variant on the resulting protein (e.g. frameshift, truncation due to stop codon, a splicing error, missense and in-frame insertions / deletions).
  • Protein-related data sources may instead or in addition contain information regarding functional domains, predicted effect of mutations on the function of a protein, etc.
  • protein-related data sources may be in the form of a static data repository, or may be in the form of algorithms that can e.g. predict the effect of a variant on the gene product.
  • a disease-variation association data source refers to a data source, in the form of e.g. a database that collects information about genetic variants and their association with diseases. This may be in the form of e.g. annotations of a variant for reported pathogenicity, functional data, etc.
  • Examples of such a data sources include the ClinVar database (http://www.ncbi.nlm.nih.gov/clinvar/) and the HGMD® database (http://www.hgmd.cf.ac.uk/ac/index.php).
  • the present invention is directed to methods, systems and apparatus for analysing a patient's genetic variation profile and determining whether any variants identified are likely disease causing variants.
  • the present invention determines whether any variants identified in a patient are likely to cause an inherited cardiovascular condition in the patient.
  • Figure 1 shows schematically relevant parts of a representative genetics diagnostic system suitable for implementing an embodiment of the disclosure.
  • a user (not shown) is provided with an electronic device 2 - this may be for example a personal computer or a mobile device 2 (such as a mobile phone, tablet, laptop, or other mobile computing device). These devices typically have processors and memories for storing information including firmware and applications run by the respective processors.
  • the user device 2 may comprise an antenna and associated hardware and software to allow communications with a data analysis server 4 via the internet 6 via a local WiFi router 10, a 3G/4G telecommunications network 8, any combination of the above or any wireless communications protocol, or may connect to the internet using a wired connection.
  • the data analysis server 4 (while represented here as a single server, may of course comprise any appropriate computer system or set of computer systems) is shown as interacting with both the user device 2 and a genetic data source 12.
  • the genetic data source 12 contains genetic data from individuals in a reference or control group.
  • the data source 12 also contains data from individuals in a case cohort for a particular disease.
  • the genetic data source 12 is used to represent a repository of genetic data but may in fact be implemented as a collection of databases, such as e.g. a database containing reference genetic information, and one or more databases containing genetic information of disease cohorts.
  • the data services server also interacts with one or more other data sources 14, such as a disease-variation association data source 14a, a genomic database 14b, a protein-related data source 14c, etc.
  • the protein-related data source 14c contains information about the known or predicted effect of a variant on a protein sequence, structure and/or function.
  • the protein-related data source 14c may comprise data on or tools to determine the class of a variant in terms of its consequence on the protein sequence (as detailed above in reference to protein-altering variants).
  • the disease-variation association data source 14a contains information about variants and genes and their relationship with diseases, such as e.g.: reported association with disease; associated publications; associated in vitro and/or in vivo functional evaluation; which diseases are linked to a gene; what classes of variant are relevant for each disease; are there specific sub- regions that are 'hotspots'; what are the possible inheritance patterns associated with each class of variant and disease, and so on.
  • diseases such as e.g.: reported association with disease; associated publications; associated in vitro and/or in vivo functional evaluation; which diseases are linked to a gene; what classes of variant are relevant for each disease; are there specific sub- regions that are 'hotspots'; what are the possible inheritance patterns associated with each class of variant and disease, and so on.
  • the different sources of data 12, 14 may in fact be organised as a single database, or multiple separate databases or tools that provide the required data on query.
  • Figures 2a and 2b illustrate schematically relevant functions of a user device and a data analysis server that are suitable for implementing embodiments of the disclosure.
  • Figure 2a shows a user device 2 such as a mobile phone, though it should be noted that any other portable computing apparatus such as a laptop, notebook or tablet computer, or even a fixed apparatus such as a desktop computer, can be used as computing apparatus in embodiments of the disclosure.
  • a user device 2 such as a mobile phone
  • any other portable computing apparatus such as a laptop, notebook or tablet computer, or even a fixed apparatus such as a desktop computer, can be used as computing apparatus in embodiments of the disclosure.
  • the mobile device comprises a processor 202 and a memory 204, such that the memory stores and the processor will subsequently run applications 206.
  • the user device has a user interface comprising a display 208 and an input device 210 such as a keyboard, a mouse, touchpad, touchscreen or any combination of these and associated drivers to allow a user to enter data into and view information from the applications 206.
  • the user device 2 is a mobile phone, it also has a cellular telecommunications capability, including a wireless communication element 212 providing the ability to connect to a cellular communications network.
  • the user device 2 may, instead or in addition to the wireless communication element 212, include a local networking element 214, in order to establish a short range wireless network.
  • the computing device may be a tablet computer without cellular telecommunications capability but capable of making a local wireless network connection, and so a connection to the data analysis server through the public internet.
  • the device may be a fixed apparatus such as a desktop computer, establishing a wired or wireless connection to the data analysis server 4 via the internet.
  • Figure 2b describes elements of the data analysis server 4. This is shown as comprising a server 220 with processor 222 and memory 224, with associated communications functionality 226.
  • the communications functionality may include networking capability allowing communication with the user device 2.
  • the processor 222 is a representation of processing capability and may in practice be provided by several processors.
  • the server provides at least a data analysis application 228 stored in the memory 224 and run on the processor 222, and a search engine 230 interacting with the one or more databases 12, 14.
  • the memory 224 also stores the genetic data 12, and/or one or more other data 14.
  • the data analysis server 4 receives information from the user device 2, and interacts with the data analysis application 228 and the search engine 230 to obtain the required data (as will be further described below) from the databases 12, 14.
  • the data analysis application 228 collects and analyses the data and serves it to the user device 2 for display by an application 206.
  • the application 206 is a browser, and the data is submitted by, and provided to a user, via a webpage.
  • the methods of the invention may be run locally on the user device 2.
  • the functionalities of the data analysis server 4 are run directly on the user device as part of an application 206.
  • the data from databases 12, 14 may also be locally stored on the device, or may be accessed via e.g. the internet.
  • Figure 3 describes a method according to an aspect of the disclosure.
  • a user enters genetic information in the form of variant data about a patient, and any other relevant information available (see below).
  • Variant data may advantageously be in the form of a list of variants found in a sample (i.e. a list of loci where the sample sequence was found to depart from a reference genome, and the nature of the departure, for example a VCF file). Any number of variants may be included in such a list, and means to obtain such a list from a DNA sample or resulting sequencing data are known in the art.
  • the server collects the data that is relevant to any variants present in the user data.
  • the server computes 330a a series of evidence rules (as described below) and combines 330b this evidence to generate a variant score.
  • the variant score is to be understood here in the broadest sense as any combined measure of how likely a variant is to be pathogenic.
  • the score is computed as the assignment of a variant to one of a discrete set of categories based on the combined result of the evaluation of the rules (see section 'Combining evidence for pathogenicity' below).
  • the server sends that combined result to the user in the form of a report highlighting activated evidence rules. The report is displayed by the user device at step 350.
  • the report may be queried, for example by triggering the display of underlying evidence for a rule, or any evidence rule may be modified by the user. This may result in the modified evidence being used to re-start the process from step 330.
  • a user may then decide 360, based on the output of the method, whether any of the variants identified in a patient are disease causing for a specific disease.
  • the server will collect, analyse and report data separately for each variant in the data set for which evidence is available.
  • the methods of the invention rely on computing the results of multiple evidence rules (i.e. tests based on evidence related to the variant, for which a yes/no answer provides evidence of the variant being benign or pathogenic), each of which analyses a piece of data that is relevant to the pathogenicity (or lack thereof) of a variant.
  • the rules also referred to herein as 'tests'
  • ICCs inherited cardiac conditions
  • additional rules may be added or removed as appropriate, for example because of growing knowledge about a disease.
  • variants are only analysed if they are protein altering variants, as many of the rules mentioned below relate to the function of the resulting protein.
  • variants that may alter protein function are analysed, including e.g. synonymous variants.
  • variants may only be analysed if data is available for this genetic location; for example, if sufficient data is available to evaluate the result of at least some of the evidence tests detailed below, in relation to the genetic location of the variant. Additionally, some of the evidence required to assess pathogenicity (see e.g. the discussion on the rarity of variants below), depends on the disease that the variant is analysed for. Much of this document is centred on cardiomyopathies, as the inventors have found the set of rules described in the embodiments below to be particularly useful in diagnosing pathogenic variants in such diseases. However, the person skilled in the art would understand that the principles of this invention may be applicable to a variety of other inheritable diseases, provided that adequate data is available.
  • rules are divided into different categories. This follows a scheme set out in the ACMG guidelines, and in particular rules are referred to as 'Pathogenic Very Strong' (PVS, indicating that such rules being activated represents very strong evidence of pathogenicity), 'Pathogenic Strong' (PS, indicating that such rules being activated represents strong evidence of pathogenicity), 'Pathogenic Moderate' (PM, indicating that such rules being activated represent moderate evidence of pathogenicity), 'Pathogenic Supporting' (PP, indicating that such rules being activate support pathogenicity), 'Benign Stand Alone' (BSA, indicating that a variant is very likely to be benign), 'Benign Strong' (BS, indicating that such rules being activated represents strong evidence of the variant being benign), and 'Benign Supporting' (BP, indicating that such rules being activated represents strong evidence of the variant being benign).
  • Very strong evidence for pathogenicity PVS rules
  • the first PVS rule determines whether the variant is a null variant in a gene where loss of function of the gene is a known mechanism of disease. In embodiments, truncating variants are assumed to cause loss of function. In embodiments, the first PVS rule is activated when the variant is a truncating variant in a gene that has been implicated in a disease with loss of function as a reported mechanism.
  • a variant causes activation of the rule if any one of three criteria are fulfilled: (i) the variant is in a gene with a significant burden in disease cases compared to control populations based on one or more genetic information data sources; (ii) the variant is in a gene where truncating mutations are reported in excess in case cohorts compared to reference data; (iii) the variant is in a gene associated with phenocopy and where truncating mutations are reported in excess in case cohorts compared to reference data.
  • a truncating variant in any of the following genes may activate this rule: LMNA, DSP, VCL, MYBPC3, TNNT2, PLN, DSP, DSG2, PKP2 and DSC2 (based on criterion (i) and analysis of 7,855 cardiomyopathy cases and 60,706 controls from http://biorxiv.org/content/early/2017/02/24/041 1 11), KCNQ1 , KCNH2, SCN5A, FHL1 , BAG3, TAZ, FBN1 , TGFB2, LDLR (based on criterion (ii) and comparison of data between the HGMD® database (http://www.hgmd.cf.ac.uk/ac/index.php) and the ExAC data), and GLA, LAMP2 (based on criterion (iii)).
  • null variants in a gene where loss of function is a known mechanism of disease do not activate this rule if there is a strong regional effect in the gene (i.e. the pathogenicity is highly dependent on the location of the variant). This may be the case, for example, where variants are located in regions that are frequently spliced out. This exception was not identified in the ACMG guidelines, but the inventors have found that it allowed for a more reliable output of the rule. In particular, in embodiments relating to cardiomyopathy, the gene TTN is excluded on this basis.
  • This rule is an example of a rule that is evaluated based on the known or predicted effect of the variant on the gene product and information frequency of the variant in a control population (as compared to a case population). Strong evidence for pathogenicity (PS rules)
  • the first PS rule is activated if the variant results in the same amino acid change as a previously established pathogenic variant.
  • any variants from the ClinVar database http://www.ncbi.nlm.nih.gov/clinvar/) that have multiple submitters, for the phenotype of interest, with no conflicting evidence and classed as 'pathogenic' are used.
  • the ACMG guidelines indicate that variants resulting in the same amino acid change and "previously established as pathogenic" should be considered as strong evidence for pathogenicity, there is no indication in the guidelines as to what level of evidence constitutes "established pathogenic variants".
  • a disease-variation association data source containing annotations from individual laboratories, publications etc.
  • a filter on the other variant being classified as "pathogenic”
  • a filter on the number of lines of individual evidence e.g. submitters - for the specific phenotype / disease of interest and a filter on the presence of conflictual annotations.
  • users can save the final classification of a variant being analysed using the methods and systems of the invention, and this data may be queried to evaluate this rule (for example, it may be considered as one of the lines of evidence as mentioned above).
  • a user can choose to include previous user data in the evaluation of this rule.
  • the method comprises displaying previous user data so that a user can decide to activate this rule or not.
  • This rule is an example of a rule that is evaluated based on the known or predicted effect of the variant on the gene product and information from a disease-variant association data source.
  • a further PS rule may be activated if the variant is observed de novo in a patient that has the disease and the paternity and maternity of the patient are confirmed.
  • this rule is displayed in the diagnostic report and a user can insert evidence.
  • the rule is only evaluated and taken into account in computing the score of a variant if the user has activated the rule by providing evidence.
  • a further PS rule may be activated is there are well established in vitro or in vivo functional studies supportive of a damaging effect of the variant on the gene or gene product.
  • the rule is activated if the variant has been shown to recapitulate a disease phenotype or endophenotype in a model system that has been shown to be predictive of human disease. While the ACMG guidelines suggest the use of functional studies supporting of a damaging effect of the variant, they provide no indication as to how this should be assessed. The present inventors have found that the above criterion provided a reliable and transferrable way of assessing this rule.
  • the rule is activated when such evidence is available in a database, such as a database collated from previous user reports, or a source of curated data on the effect of mutations on protein function.
  • the rule is only activated if the user provides information to activate it, at step 310 or 350 above. This rule is an example of a rule that is evaluated based on the known or predicted effect of the variant on the gene product.
  • a fourth PS rule may be activated if the prevalence of the variant in affected individuals is significantly increased compared with controls. According to ACMG guidelines, this rule should be applied based on an odds ratio above 5 and confidence interval not including 1. However, the present inventors have found that this threshold was not generally applicable in cases where the case and control cohorts were imbalanced. Accordingly, in embodiments, a case data base is compared with a reference database in the following way: a case and reference database are used to compare the frequency of each rare variant in the two data sets using a Fisher's exact test to assay for association of each variant with disease. The results across variants are adjusted for multiple testing as known in the art, for example, using a Bonferroni correction. An appropriate threshold for statistical significance of the corrected test result may then be used.
  • the strength of the disease association data is taken into account in calculating the weight associated with activation of this rule (see 'Combining evidence' below).
  • the weight associated with activation of this rule may be proportional to the odds ratio (odds of developing the condition if an individual has the variant versus odds of developing the condition if an individual does not have the variant).
  • a minimum threshold on the number of individuals with a variant in the case cohort data may also be applied in order to avoid including variants where there is not enough data available.
  • this data is precomputed, based on chosen data sources for a given disease, for each rare variant found in these databases.
  • this test may be computed dynamically based on the disease being analysed, and e.g. a choice of case cohort and reference database given to the user.
  • a threshold for what is considered a rare variant may be set at e.g. a frequency in control data of below 0.0001.
  • This rule is an example of a rule that is evaluated based on the frequency of the variant in a control population.
  • a fifth PS rule may be activated if the variant is a truncating variant in a gene where truncating variants are known to cause disease, but the gene shows strong regional effect such that not all truncating variants are equally deleterious. This may be the case e.g. where the variant only truncates such isoforms (e.g. Titin).
  • the rule is only activated if the truncating variant is in an exon that is constitutively expressed in the specific transcripts relevant to the disease. For example, in the case of inherited cardiac conditions, the rule may only be activated if it is in an exon constitutively expressed in the isoforms relevant to the heart.
  • this rule was no present in the ACMG guidelines but the present inventors have found that introducing this rule and excluding such variants from activating a pathogenic very strong rule led to more reliable results.
  • this rule is activated for nonsense, frameshift and essential splice site variants within exons with proportion spliced in (PSI) > 0.9.
  • this rule is restricted to a predetermined set of genes.
  • the predetermined set of genes may comprise TTN.
  • This rule is an example of a rule that is evaluated based on the known or predicted effect of the variant on the gene product and information frequency of the variant in a control population (as compared to a case population).
  • a first PM rule may be activated when the variant is located in a mutational hot spot, and/or in a critical and well established functional domain.
  • mutational hot spots are defined regions of genes that are either enriched in variation in cases, or depleted of variation in controls such that the odds ratio associated with variants in that region is higher than for other parts of the gene. They may be defined using curated literature evidence, or by comparing variant frequencies from case data to reference data over a set of defined regions, as known in the art. In embodiments, the rule is evaluated by calculating the prior probability that a variant in certain regions is pathogenic before considering other evidence, based on the frequencies of variants in disease and control populations.
  • Protein functional domains may be extracted e.g. from protein databases.
  • This rule is an example of a rule that is evaluated based on the known or predicted effect of the variant on the gene product (when the rule is activated based on the location in a critical functional domain), or the frequency of variants in a gene in a control and disease population (when the rule is activated based on the location in a mutational hotspot).
  • a second PM rule may be activated if the variant is present at extremely low frequency in a control population.
  • this rule is activated if the allele frequency in a reference population is below the maximum acceptable frequency calculated as described in the 'Rare variants' section below.
  • the user is able to directly access the reference data from the report and check the coverage at the variant location. The user may then be able to overrule an activation of this rule if the coverage is not sufficient. In some embodiments, the rule is automatically deactivated if the coverage in the reference data used is insufficient.
  • This rule is an example of a rule that is evaluated based on the frequency of the variant in a control population.
  • a further PM rule may be activated if the disorder analysed is recessive and the variant is detected in trans with a pathogenic variant.
  • this rule is displayed in the diagnostic report and a user can insert evidence.
  • the rule is only evaluated and taken into account in computing the score of a variant if the user has activated the rule by providing evidence.
  • This rule is an example of a rule that is evaluated based on information from a disease-variant association data source.
  • a fourth PM rule may be activated if the variant results in an in-frame deletion or insertion in a non-repeat region or a stop-loss variant, resulting in a protein length change.
  • this rule is activated based on a prediction of the effect of the variant on the protein. Methods of prediction such as that implemented in the Ensembl Variant Effect Predictor (VEP) are suitable for the purpose of the invention, and known in the art.
  • VEP Ensembl Variant Effect Predictor
  • variants that are in-frame insertion / deletions only activate the rule if they are not with a repeat region, where repeat regions are available from e.g.
  • This rule is an example of a rule that is evaluated based on the known or predicted effect of the variant on the gene product.
  • a fifth PM rule may be activated if the variant results in a novel missense change at an amino acid residue where a different missense change has previously been determined to be pathogenic.
  • data from a suitable database is used, in addition with filters (based on reliability) on the evidence provided. For example, any variants from the ClinVar database (http://www.ncbi.nlm.nih.gov/clinvar/) that have multiple submitters, for the phenotype of interest, with no conflicting evidence and classed as 'pathogenic' may be used.
  • data from previous uses of the method are stored and queried for any 'pathogenic' variants at the same residue that result in a different missense change.
  • This rule is an example of a rule that is evaluated based on the known or predicted effect of the variant on the gene product and information from a disease-variant association data source.
  • a further PM rule may be activated if the variant is observed de novo in a patient that has the disease, but the paternity and maternity of the patient are not confirmed.
  • this rule is displayed in the diagnostic report and a user can insert evidence.
  • the rule is only evaluated and taken into account in computing the score of a variant if the user has activated the rule by providing evidence.
  • the user provides information to indicate whether the variant is de novo or inherited. In other embodiments, this may be obtained from the variant input file. Having been identified as a de novo variant once (either through use of the tool, or in literature or other data source available to the inventors), such information may be stored and used to activate this rule in further uses.
  • a seventh PM rule may be activated if an equivalent amino acid change in a paralogous gene is pathogenic.
  • This rule is not used in the ACMG guidelines, but the inventors have found that it strengthened the diagnostic provided by the tool of the invention.
  • This rule is an example of a rule that is evaluated based on the known or predicted effect of the variant on the gene product and information from a disease-variant association data source. Supporting evidence for pathogenicity (PP rules)
  • a first PP rule may be activated if the variant co-segregates with the disease in multiple affected family members, and lies in a gene that is known to cause the disease.
  • Information to assess that rule may be obtained from disease-variant association data sources such as ClinVar. Alternatively, this information can be provided directly by the user at steps 310 or 350.
  • this rule is displayed in the diagnostic report and a user can insert evidence.
  • the rule is only evaluated and taken into account in computing the score of a variant if the user has activated the rule by providing evidence.
  • the strength of the segregation data is taken into account in calculating the weight associated with activation of this rule (see 'Combining evidence' below).
  • the weight associated with activation of this rule may be proportional to the strength of segregation as quantified by the LOD score (log odds).
  • the LOD score may be estimated as 0.3 x the number of informative meioses, or more formally calculated, as known in the art.
  • LOD thresholds for supporting, moderate & strong evidence may be predefined or specified by the user. For example the following thresholds may be used: i) strong when random chance ⁇ 1 % ( ⁇ 7 meioses/segregations); ii) moderate when random chance ⁇ 5% ( ⁇ 5 meioses/segregations); and iii) supporting when random chance ⁇ 25% ( ⁇ 3 meioses/segregations).
  • thresholds of 3, 6 and 10 meioses/segregations, respectively for supporting, moderate and strong evidence may be used.
  • This rule is an example of a rule that is evaluated based on information from a disease-variant association data source.
  • a further PP rule may be activated if the variant is a missense variant in a gene with a low rate of benign missense variation and in which missense variants are common mechanisms of disease.
  • this rule is activated based on a comparison between the frequencies of missense variants in a gene of interest in a case cohort and control population. This rule may be activated if the variant is in a gene with etiological fraction (i.e.
  • the rule can be evaluated in an unambiguous and widely applicable way, thereby providing a consistent and reliable method of assessing evidence for pathogenicity.
  • This rule is an example of a rule that is evaluated based on the known or predicted effect of the variant on the gene product. In embodiments, this rule is not activated if the first PM rule mentioned above (activated when the variant is located in a mutational hot spot, and/or in a critical and well established functional domain) is already activated.
  • a third PP rule may be activated if multiple lines of computational evidence support a deleterious effect on the gene or gene product.
  • the ACMG guidelines recommend that the rule only be activated if all tools provide a consistent result.
  • the present inventors have found that a more reliable prediction could be obtained by using multiple (4 or more) independent computational tools and combining their results in a slightly less stringent way.
  • at least 5 tools, preferably at least 7 tools are used and the rule is activated if: (i) only 1 tool predicts that the variant is benign and less than 3 have unknown classifications, or (ii) 3 or more tools have unknown outcomes and all other tools predict that the variant is damaging.
  • This rule is an example of a rule that is evaluated based on the known or predicted effect of the variant on the gene product.
  • a fourth PP rule may be activated if the patient's phenotype or family history is highly specific for a disease with a single genetic aetiology. This information can be provided directly by the user at steps 310 or 350. In embodiments of the method, this rule is displayed in the diagnostic report and a user can insert evidence. In embodiments, the rule is only evaluated and taken into account in computing the score of a variant if the user has activated the rule by providing evidence. In other embodiments, the rule is activated if information from the disease-variation association data source indicates that the disease implies a specific single genetic aetiology concordant with input by the user.
  • a further PP rule may be activated when a reputable source has reported the variant as pathogenic, but the evidence is not available to the user to perform an independent validation. This information can be provided directly by the user at steps 310 or 350. In embodiments of the method, this rule is displayed in the diagnostic report and a user can insert evidence. In embodiments, the rule is only evaluated and taken into account in computing the score of a variant if the user has activated the rule by providing evidence.
  • a sixth PP rule may be activated if there is a missense mutation at an equivalent amino acid residue of a paralogous gene and this mutation is pathogenic.
  • This rule is not used in the ACMG guidelines, but the inventors have found that it strengthened the diagnostic provided by the tool of the invention.
  • This rule is an example of a rule that is evaluated based on the known or predicted effect of the variant on the gene product and information from a disease-variant association data source. Stand-alone benign evidence (BA rules)
  • a benign stand-alone rule is a rule that is activated when the allele frequency of the variant in a control population is above a threshold.
  • the threshold suggested is 5%.
  • variants present in a control population at a frequency >0.1 % for heterozygotes or >3.16% (sqrt(O.OOI)) for homozygotes activate this rule.
  • the sampling variance in subset populations is taken into account in applying this threshold.
  • totalCount is the number of individuals in the control population covered at that variant position
  • maximumFreq is the thresholds as described above (e.g. 0.1 % and 3.16% respectively for hetero- and homozygotes).
  • This rule is an example of a rule that is evaluated based on the frequency of the variant in a control population.
  • this rule is used as a pass / fail test, whereby any variant that activates this rule is automatically classified as benign.
  • a first BS rule may be activated if the allele frequency in a control population is higher than expected for the disorder.
  • the maximum credible population frequency for any variant involved in a disease is calculated using embodiments of the method described in the 'Rare variants' section below.
  • the ACMG guidelines indicate that an allele frequency being "too high" for a disorder, they do not provide any indication on how to decide on what is "too high”.
  • the solution of the present invention, as described below, provides a reliable framework to confidently assess this rule, taking the genetic architecture of the disease under consideration into account.
  • the penetrance is set at 0.5. This is a conservatively low value that the present inventors have found useful when specific information is not available.
  • the maximum allelic contribution is defined as the upper confidence interval of the most common causal variant in the case cohort.
  • the frequency in a mutation database is used instead.
  • the maximum allelic contribution is set to the maximum proportion of cases due to a single variant across diseases of interest (e.g. diseases that are similar or related to a disease of interest) where this was known. In the case of cardiac diseases, this may be set to 0.1.
  • This rule is an example of a rule that is evaluated based on the frequency of the variant in a control population.
  • a further BS rule may be activated if the variant is observed in a healthy individual, with full penetrance expected at an early age. This information can be provided directly by the user at steps 310 or 350. In embodiments of the method, this rule is displayed in the diagnostic report and a user can insert evidence. In embodiments, the rule is only evaluated and taken into account in computing the score of a variant if the user has activated the rule by providing evidence.
  • a third BS rule may be activated if there is well-established in vitro or in vivo functional studies showing that there is no damaging effect on protein function or splicing. This information can be provided directly by the user at steps 310 or 350. In embodiments of the method, this rule is displayed in the diagnostic report and a user can insert evidence. In embodiments, the rule is only evaluated and taken into account in computing the score of a variant if the user has activated the rule by providing evidence. In embodiments, the rule is activated based on stored results of previous users. In embodiments, this information is obtained from disease-variant association data sources.
  • a fourth BS rule may be activated if there is a lack of segregation in affected members of a family.
  • This information can be provided directly by the user at steps 310 or 350.
  • this rule is displayed in the diagnostic report and a user can insert evidence.
  • the rule is only evaluated and taken into account in computing the score of a variant if the user has activated the rule by providing evidence.
  • this information is obtained from disease-variant association data sources.
  • a first BP rule may be activated if the variant is a missense variant in a gene for which primarily truncating variants are known to cause disease. This information can be provided directly by the user at steps 310 or 350. In embodiments of the method, this rule is displayed in the diagnostic report and a user can insert evidence. In embodiments, the rule is only evaluated and taken into account in computing the score of a variant if the user has activated the rule by providing evidence. In embodiments, the rule is activated based on data from a variation-disease association database. This rule is an example of a rule that is evaluated based on the known or predicted effect of the variant on the gene product and information from a disease-variant association data source.
  • a second BP rule may be activated if the variant is or has been observed in trans with a pathogenic variant for a fully penetrant dominant gene / disorder, or observed in cis with a pathogenic variant in any inheritance pattern.
  • This information can be provided directly by the user at steps 310 or 350.
  • this rule is displayed in the diagnostic report and a user can insert evidence.
  • the rule is only evaluated and taken into account in computing the score of a variant if the user has activated the rule by providing evidence.
  • the rule is activated based on data from a variation-disease association database.
  • This rule is an example of a rule that is evaluated based on the known or predicted effect of the variant on the gene product and information from a disease-variant association data source.
  • a further BP rule may be activated if the variant is an in-frame deletion or insertion in a repetitive region without known function.
  • Data on repetitive regions may be obtained from e.g. genomic data sources, such as the UCSC table browser (https://genome- euro.ucsc.edu/cgi-bin/hgTables). Such data may for example be cross referenced with gene regions, also available from genomic data sources.
  • any variant that is an in-frame insertion / deletion that overlaps with a repeat region activates this rule.
  • This rule is an example of a rule that is evaluated based on the known or predicted effect of the variant on the gene product.
  • a fourth BP rule may be activated if multiple lines of computational evidence suggest that the variant has no impact on the gene or gene product.
  • the ACMG guidelines recommend that the rule only be activated if all tools provide a consistent result.
  • the present inventors have found that a more reliable prediction could be obtained by using multiple (4 or more) independent computational tools and combining their results in a slightly less stringent way.
  • at least 5 tools, preferably at least 7 tools are used and the rule is activated if: (i) only 1 tool predicts that the variant is damaging and less than 3 have unknown classifications, or (ii) 3 or more tools have unknown outcomes and all other tools predict that the variant is benign.
  • This rule is an example of a rule that is evaluated based on the known or predicted effect of the variant on the gene product.
  • a fifth BP rule may be activated if the variant was found in a case with an alternative molecular basis for disease. This information can be provided directly by the user at steps 310 or 350. In embodiments of the method, this rule is displayed in the diagnostic report and a user can insert evidence. In embodiments, the rule is only evaluated and taken into account in computing the score of a variant if the user has activated the rule by providing evidence. In embodiments, the rule is blocked from activation if the user has already activated the rule indicating that the variant is observed in trans with a pathogenic variant for a fully penetrant dominant gene / disorder, or observed in cis with a pathogenic variant in any inheritance pattern.
  • a sixth BP rule may be activated if a reputable source has reported the variant as benign, but the evidence is not available to the user to perform an independent evaluation. This information can be provided directly by the user at steps 310 or 350. In embodiments of the method, this rule is displayed in the diagnostic report and a user can insert evidence. In embodiments, the rule is only evaluated and taken into account in computing the score of a variant if the user has activated the rule by providing evidence.
  • a seventh BP rule may be activated if the variant is a synonymous (silent) variant for which splicing prediction algorithms predict no impact and the nucleotide is not highly conserved.
  • This information can be provided directly by the user at steps 310 or 350.
  • this rule is displayed in the diagnostic report and a user can insert evidence.
  • the rule is only evaluated and taken into account in computing the score of a variant if the user has activated the rule by providing evidence.
  • data about the known or predicted effect of the variant on the gene product is combined with data from a genomic data source to evaluate this rule.
  • the present invention is directed to identification of a genetic variant as likely pathogenic or discounting it as likely benign.
  • a variant's low frequency in, or absence from, reference databases is a necessary but not sufficient criterion for variant pathogenicity, and a high frequency is strong evidence for a benign role.
  • assessing how rare a variant has to be in order to confidently mark it as likely pathogenic is not trivial. In practice, there exists considerable ambiguity around what allele frequency should be considered 'too common' (in a reference sample), with the conservative values of 1 % and 0.1 % often invoked as frequency cut-offs for recessive and dominant diseases respectively.
  • the present invention provides methods and systems for confidently determining an allele frequency threshold under which a variant may be considered as a pathogenic candidate for a given disease.
  • the invention provides a method for assessing whether rare variants are sufficiently rare to cause penetrant Mendelian diseases, while accounting for both disease-specific genetic architecture and sampling variance in observed allele counts.
  • the method disclosed relies on the principle that when assessing a variant for a causative role in a dominant Mendelian disease, the frequency of a variant in a reference sample not selected for the condition, should not exceed the prevalence of the condition.
  • this will be influenced by different inheritance modes, genetic and allelic heterogeneity, and reduced penetrance.
  • estimation of true population allele frequency is clouded by considerable sampling variance, even in the largest samples currently available.
  • the invention also provides systems to allow a user to determine a frequency threshold for a particular disease, where the system comprises a reference data source and a processor programmed to perform the method described herein and below.
  • the system may also comprise input means for a user to specify the genetic architecture of the disease of interest.
  • the maximum allele count tolerated in a reference data set can be calculated, taking into account the sampling variation in the reference data set using a Poisson distribution and the size of the reference data set used. Any allele count in the reference data set that is above e.g. the 95 th percentile of the Poisson distribution (upper bound of the one tailed 95% confidence interval) for that allele frequency - given the observed allele number where ⁇ is the expected allele count given by Eq. 2:
  • the 95 th percentile of the Poisson distribution with the above ⁇ is a maximum sample estimate corresponding to the population frequency mpaf obtained at the chosen confidence level of 95% for the population size (in the number of individuals in the reference data) of sample size.
  • the confidence chosen is, as would be clear to the person skilled in the art, a matter of preference. Confidence above 95% is generally preferred, however thresholds based on the 90 th , 95 th , 99 th percentile may be used, depending on the number of false positive results that a user may be prepared to take into account. This may also depend on the availability of any orthogonal data to make the pathogenicity assessment, and on how conservatively any of the parameters of Eq. 1 have been set, as would be clear to the person skilled in the art. Accordingly, the maximum sample estimate may be based on a 90%, 95% or 99% confidence threshold (or any other appropriate level).
  • the allele number will depend on the size of the population that is considered, and as the tightness of a 95% confidence interval in the Poisson distribution depends upon sample size, the stringency of the filter depends upon the allele number (AN).
  • AN allele number
  • the method therefore comprises computing a maximum tolerated AC for each distinct subpopulation of a reference sample, and filter based on the highest allele frequency observed in any major population.
  • the sequencing coverage in the reference data at a particular size is taken into account to correct the sample size.
  • the stringency of the filter therefore may therefore vary according to the size of the sub-population in which the variant is observed, and/or the sequencing coverage at that site.
  • the maximum allelic / genetic contribution i.e. the maximum proportion of the disease attributable to a single gene - rather than a single variant as used above.
  • Example 1 below demonstrates application of this approach to dominant diseases
  • Example 1a demonstrates the application of this approach to hypertrophic cardiomyopathy, for which case cohort data exists.
  • Example 1 b demonstrates the application of this approach to a disease where disease-specific variant databases exist and can be used to estimate maximum allelic / genetic contribution.
  • Example 1c demonstrates the application of this approach to diseases where no mutation database exists.
  • Example 1 d demonstrates the application of this approach to a disease where allelic heterogeneity is poorly characterised.
  • mgc maximal genetic contribution
  • mac maximum allelic contribution
  • Example 2 demonstrates the application of this approach to a recessive disease, Primary Ciliary Dyskinesia.
  • Example 3 shows the application of this approach on every variant in the ExAC data using a simulated dominant Mendelian variant discovery analysis, and uses HCM data to demonstrate that the approach results in filtering of candidate variants that are in the clinically actionable range of disease odds ratio. Note that the approach has been described and illustrated below by analysing frequencies at the level of a disease. However, in some cases this approach may be further refined by calculating distinct thresholds for individual genes, or even variants. For example, if there is one common founder mutation but no other variants that are recurrent across cases, then the founder mutation may be seen as an exception to the calculated threshold.
  • the method of the invention classifies a variant into one of a series of diagnostic categories.
  • the classification is based on how many of the evidence rules described above are activated (i.e. the rule produces a positive outcome when assessed), and the nature (in terms of evidence category, as described above) of these activated rules.
  • the categories are: Pathogenic, Likely Pathogenic, Benign, Likely Benign and Uncertain Significance. As the person skilled in the art would understand, further categories or subcategories may be created and the combination of evidence rules leading to a certain classification may be adapted accordingly.
  • evidence rules relating to pathogenicity is assigned a weight ⁇ ⁇ ... ⁇ / ⁇ for each rule P 1T ...,P N relating to pathogenicity
  • evidence rules relating to benignity may be assigned a weight ⁇ , .,., ⁇ for each rule Bi, ...,B M relating to pathogenicity.
  • a combined score may for example be obtained by summing the evidence for pathogenicity and subtracting the sum of evidence for benignity (i.e.
  • variant score ⁇ " x p — ⁇ y B ) .
  • a variant may be classified based on thresholds on the variant score, such as variant score > pp : pathogenic, Ip ⁇ variant score ⁇ pp : likely pathogenic, etc.
  • the thresholds will depend on the values of the individual rules scores, and on the number of categories used.
  • rules may be divided in categories, such as very strong pathogenic / strong pathogenic / pathogenic moderate / supporting pathogenic / strong benign / supporting benign / stand-alone benign as described above, and all rules in a category may be assigned the same weight.
  • weights x PS v, Xps, MP, PP, yBA, yBs, yBP may be used, wherein x PS v > Xps > MP > Xpp, and y B A > yBs > yBP-
  • a variant is put in the pathogenic category if any of the following apply: one rule with a 'pathogenic very strong' label is activated, and any of the following applies
  • a variant in line with the ACMG guidelines, a variant is put in the likely pathogenic category if any of the following apply:
  • one rule with a 'pathogenic strong' label is activated and one or two rules with a 'pathogenic moderate' label is/are activated;
  • one rule with a 'pathogenic strong' label is activated and at least two rules with a 'pathogenic supporting' label are activated;
  • one rule with a 'pathogenic moderate' and at least four rules with a 'pathogenic supporting' label are activated.
  • a variant in line with the ACMG guidelines, is put in the benign category if either a rule with a 'benign standalone' label is activated, or at least two rules with a 'benign strong' label are activated.
  • a variant in line with the ACMG guidelines, a variant is put in the likely benign category if any of the following apply: one rule with a 'benign strong' label is activated and at one rule with a 'benign supporting' label is activated;
  • At least two rules with a 'benign supporting' label are activated.
  • a variant in line with the ACMG guidelines, a variant is put in the 'uncertain significance' category if none of the other criteria apply, or the criteria for benign and pathogenic are contradictory.
  • any combination of the rules described in this document may be used, provided that at least one test is based at least on the frequency of the variant in a control population and at least one test is based at least on the known or predicted effect of the variant on the gene product and/or information from a disease-variant association data source.
  • the user can decide which rules are used. Having obtained a score and/or a classification for the variant, this information is sent to the user (via the user device 2) for assessment of the pathogenicity of the variant.
  • the methods of the invention may involve producing a report that allows a user to confidently decide whether a variant may be pathogenic by providing the evidence that supports this decision.
  • the report may display the result of evaluation of each rule, as well as any evidence that has led to this activation, and highlight any rule that is activated. Additional evidence may be displayed together with the classification and the outcome of the rules, e.g. in the form of frequency data in one or more reference or disease cohorts datasets, predicted effect of a mutation on the resulting protein function, location of the variant in the gene sequence, possibly in relation to other known disease causing variants, etc.
  • Example 4 shows an example of a report generated using a method of the invention.
  • Example 1 Assessing rarity of variants for specific diseases Data from the ExAC database was used to assess maximum tolerated allele count in the data for variants causative of a series of inherited cardiac conditions with different patterns of available information.
  • Example 1a Case cohorts data available
  • HCM hypertrophic cardiomyopathy
  • Marfan syndrome is a rare connective tissue disorder caused by variants in the FBN1 gene.
  • the UMD-FBN1 database contains 3,077 variants in FBN 1 from 280 references (last updated 28/08/14).
  • the most common variant is in 30/3,006 records (1.00%; 95CI 0.53-1.46%), which likely overestimates its contribution to disease if related individuals are not systematically excluded. Taking the upper bound of this frequency as our HF, a maximum tolerated allele count of 2 is derived. None of the five most common variants in the database are present in ExAC.
  • CPVT Catecholaminergic Polymorphic Ventricular Tachycardia
  • the maximum genetic contribution i.e the maximum proportion of the disease attributable to a single gene
  • the maximum genetic contribution can be used as a conservative estimate.
  • Ehlers-Danlos syndrome up to 40% of the disease is caused by variation in the COL5A1 gene. Taking 0.4 as our HF, and a population prevalence of 1/200,000 a maximum tolerated ExAC AC of 5 is derived.
  • PCD Primary Ciliary Dyskinesia
  • Example 3 Computing threshold values for the ExAC population.
  • a 'filtering allele frequency' was defined, which represents the highest disease-specific 'maximum tolerated allele frequency' that would be incompatible with that variant causing disease. If the disease under study has a maximum tolerated allele frequency ⁇ the filtering allele frequency the variant should be filtered, while if it has a maximum tolerated allele frequency > the filtering allele frequency the variant remains a candidate.
  • the filtering allele frequency was calculated based on 60,206 exomes and the filters were applied to a simulated dominant Mendelian variant discovery analysis on the remaining 500 exomes.
  • Figure 5a shows that filtering at allele frequencies lower than 0.1 % can substantially reduce the number of predicted protein-altering variants in consideration, with the mean number of variants per exome falling from 176 at a cutoff of 0.1 % to 63 at a cutoff of 1e-6.
  • Case / control variant frequencies were calculated for all protein altering variants (frameshift, nonsense, splice donor / acceptor, missense and in- frame insertions / deletions), with frequencies and case / control odds ratios calculated separately for non-overlapping ExAC allele frequency bins with the following breakpoints: 1x10-5, 5x10-5, 1x10-4, 5x10-4, 1x10-3, 5x10-3 and 1x10-2.
  • OR (cases with variant / cases without variant) / (ExAC samples with variant / ExAC samples without variant) along with 95% confidence intervals.
  • Figure 5b shows that the odds ratio for disease-association increases markedly at very low allele frequencies demonstrating that increasing the stringency of a frequency filter improves the information content of a genetic result. Therefore, for established disease genes it has been shown that prioritisation of variants purely by rarity can achieve disease-association odds ratios in the clinically-actionable range.
  • Figure 6 shows an extract of a report generated using the methods and systems of the invention.
  • the report includes a top line providing information about the variant, including the variant score (or its corresponding classification, i.e. 'Likely Pathogenic', in this case) the gene that the variant is found in, the type of variant (e.g. non synonymous single nucleotide polymorphism, or snSNP, in this case), the effect of the variant on the coding sequence of the gene, the effect on the protein sequence, the zygosity of the variant in the patient, and the data source (e.g. sequencing platform used to obtain the data).
  • the data source e.g. sequencing platform used to obtain the data.
  • the exact information displayed will of course depend on the data that is available, and the variant being assessed.
  • the report also includes a table that details the evidence rules that have been used to produce the assessment.
  • the table in subdivided into two main sections: a set of columns showing rules that are evidence of benignity of the variant, and a set of columns showing rules that are evidence of pathogenicity of the variant.
  • the table is further divided into rows according to the type of data on which the rule is based. Fields that are 'greyed out' indicate that the rule was not used to calculate the score of the variant, or that the assessment of the rule led to the rule not being activated.
  • a user can immediately identify those rules that are activated, make a diagnostic assessment based on the balance of benign / pathogenic rules that are activated (i.e. coloured rather than greyed out), and the strength of evidence that is associated with the activation of these rules (i.e. whether a rule is strong, moderate or supporting of a pathogenic diagnostic, or strong / supporting of a benign diagnostic).
  • the report displayed in Figure 6 contains, below the table, additional evidence that can be taken into account by the user to refine the diagnostic.
  • the report shows detailed accounts of the number of individuals with the variant in a series of genetic information databases, including reference and disease cohort data sources.
  • the report displayed in Figure 6 shows the results produced by multiple computational tools that predict the functional effect that the variant might have on the protein, and the location of the variant in relation to other variants of the gene.
  • the additional data displayed here may depend on the data that is available for a variant, as well as user preferences on which evidence they would like to be able to investigate for themselves.
  • a system for assessing the pathogenicity of a genetic variant comprising:
  • a data analysis server (4), a genetic information data source (12), a disease- variation association data source (14a) and a protein-related data source (14c) wherein the data analysis server is programmed to: (1) receive (310) information from a user about a genetic variant identified in an individual;
  • the evaluating the results of one or more tests based at least on the frequency of the variant in a control population comprises calculating a maximum tolerated allele frequency using Eq. 1 or Eq. 3 and determining a maximum sample estimate corresponding to the population frequency obtained at the chosen confidence level.
  • evaluating the results of one or more tests comprises determining whether the prevalence of the variant in affected individuals is significantly increased compared with controls, and a Fisher's exact test is used to determine whether a variant is associated with a disease based on the frequency of the variant in the control and the diseased population. 5.
  • the results of the one or more test are pre-computed across all variants present in the control and the diseased population, and the results are corrected for multiple testing. 6.
  • the maximum allelic / genetic contribution parameter in Eq. 1 or Eq. 3 is determined based on the frequency of the most common pathogenic variant in the diseased population, or in a disease population for a similar disease.
  • genomic database contains information about paralogous genes
  • evaluating the one or more tests comprises determining whether the variant is a missense mutation and an equivalent amino acid change in a paralogous gene is pathogenic, based on information from the disease- variation association data source.
  • disease-variant association data source contains information on the association between a chosen disease and the variant under assessment, other variants in the same gene, and/or variants in a paralogous gene.
  • the known or predicted effect of the variant on the gene product comprises information on whether the variant is a null variant
  • the disease-variation association data source contains information on whether loss of function of the gene containing the variant is a known mechanism of disease; and wherein evaluating the results of one or more test based at least on the frequency of the variant comprises determining whether the variant is a null variant in a gene where loss of function of the gene is a known mechanism of disease.
  • determining whether the pathogenicity of the variant is highly dependent on its location, and/or determining whether the variant is a null variant in a gene where loss of function of the gene is a known mechanism of disease additionally comprises determining whether the variant is a nonsense, frameshift or essential splice site variant within exons with a high proportion spliced in (PSI), such as a PSI > 0.9. 16.
  • PSI proportion spliced in
  • the protein-related data source comprises the results of at least five tools for prediction of the effect of a variant on the function of the protein
  • evaluating the results of one or more tests comprises: determining whether at least two of the tools predict a deleterious effect on the gene or gene product
  • evaluating the results of one or more tests based at least on the frequency of the variant in a control population comprises determining whether the allele frequency of the variant in a control population is above a threshold, wherein the threshold is the maximum sample estimate corresponding to the population maximum tolerated allele frequency at the chosen confidence level.
  • step (3) is assigned weights; and combining the results of the tests of step (3) into a pathogenicity score in step (4) comprises computing a sum of the weights for all the test that are evaluated as positive.
  • calculating the maximum sample estimate corresponding to the population frequency mpaf obtained at the chosen confidence level x comprises calculating the x th percentile of a Poisson distribution where ⁇ is given by Eq. 2, wherein sample size is the number of individuals in the control population from the genetic information data source.
  • the evaluating the results of at least one test based on the frequency of the variant in a control population comprises calculating a maximum tolerated allele frequency using Eq. 1 or Eq. 3 and determining a maximum sample estimate corresponding to the population frequency obtained at the chosen confidence level.
  • a data analysis server comprising a processor 222 and a memory 224, wherein the processor is programmed to perform the method of Clause 27.
  • a computing device comprising software adapted to perform the method of Clause 27.

Landscapes

  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Biotechnology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Molecular Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • Analytical Chemistry (AREA)
  • Chemical & Material Sciences (AREA)
  • Genetics & Genomics (AREA)
  • Theoretical Computer Science (AREA)
  • Public Health (AREA)
  • Biomedical Technology (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Pathology (AREA)
  • Epidemiology (AREA)
  • Primary Health Care (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

L'invention concerne un système d'évaluation de la pathogénicité d'une variante génétique qui comprend un serveur d'analyse de données connecté à au moins un élément parmi une source de données d'informations génétiques mémorisant des informations de fréquence relatives à la fréquence d'au moins une variante génétique dans au moins une population témoin, une source de données d'association de variation de maladie mémorisant des informations sur des associations entre au moins une variante génétique, au moins un gène ou d'autres variantes génétiques dans le ou les gènes avec des maladies ; et une source de données sur des protéines mémorisant des informations sur des effets connus ou prédits d'au moins une variante génétique sur un produit génétique. Le serveur d'analyse de données est connecté à un dispositif utilisateur et est configuré pour recevoir des informations provenant d'un utilisateur concernant une variante génétique identifiée chez un individu, et pour déterminer et transmettre un score de pathogénicité au dispositif d'utilisateur. Un procédé d'évaluation de la pathogénicité d'une variante génétique, un serveur d'analyse de données et un dispositif informatique sont également décrits.
PCT/GB2017/052545 2016-09-02 2017-09-01 Procédés, systèmes et appareil d'identification de variantes de gènes pathogènes WO2018042185A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201662383189P 2016-09-02 2016-09-02
US62/383,189 2016-09-02

Publications (1)

Publication Number Publication Date
WO2018042185A1 true WO2018042185A1 (fr) 2018-03-08

Family

ID=60037637

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/GB2017/052545 WO2018042185A1 (fr) 2016-09-02 2017-09-01 Procédés, systèmes et appareil d'identification de variantes de gènes pathogènes

Country Status (1)

Country Link
WO (1) WO2018042185A1 (fr)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111816253A (zh) * 2020-06-16 2020-10-23 荣联科技集团股份有限公司 一种基因检测解读方法及装置
CN112908412A (zh) * 2021-02-10 2021-06-04 北京贝瑞和康生物技术有限公司 用于复合杂合变异致病证据适用性的方法、设备和介质
CN113832224A (zh) * 2021-09-29 2021-12-24 苏州赛美科基因科技有限公司 一种检测scn1a基因毒外显子变异的方法
CN114429785A (zh) * 2022-04-01 2022-05-03 普瑞基准生物医药(苏州)有限公司 一种基因变异的自动分类方法、装置和电子设备
CN114882946A (zh) * 2022-03-29 2022-08-09 深圳裕康医学检验实验室 肿瘤相关基因变异致病性分类的方法、装置和存储介质
WO2023070422A1 (fr) * 2021-10-28 2023-05-04 京东方科技集团股份有限公司 Méthode et appareil de prédiction de maladies, dispositif électronique et support de stockage lisible par ordinateur

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2008067551A2 (fr) * 2006-11-30 2008-06-05 Navigenics Inc. Procédés et systèmes d'analyse génétique
US20090087854A1 (en) * 2007-09-27 2009-04-02 Perlegen Sciences, Inc. Methods for genetic analysis
US20160140288A1 (en) * 2014-11-19 2016-05-19 TCI Gene, Inc. Method for forming personal nutrition complex according to incidence of disease and genetic polymorphism by a prediction system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2008067551A2 (fr) * 2006-11-30 2008-06-05 Navigenics Inc. Procédés et systèmes d'analyse génétique
US20090087854A1 (en) * 2007-09-27 2009-04-02 Perlegen Sciences, Inc. Methods for genetic analysis
US20160140288A1 (en) * 2014-11-19 2016-05-19 TCI Gene, Inc. Method for forming personal nutrition complex according to incidence of disease and genetic polymorphism by a prediction system

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
EXAC ET AL., NATURE, 2015
LANDRUM, M. J. ET AL., NUCLEIC ACIDS RESEARCH, vol. 42, 2013, pages D980 - D985
RICHARDS, S. ET AL., GENETICS IN MEDICINE, vol. 17, 2015, pages 405 - 423

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111816253A (zh) * 2020-06-16 2020-10-23 荣联科技集团股份有限公司 一种基因检测解读方法及装置
CN112908412A (zh) * 2021-02-10 2021-06-04 北京贝瑞和康生物技术有限公司 用于复合杂合变异致病证据适用性的方法、设备和介质
CN113832224A (zh) * 2021-09-29 2021-12-24 苏州赛美科基因科技有限公司 一种检测scn1a基因毒外显子变异的方法
WO2023070422A1 (fr) * 2021-10-28 2023-05-04 京东方科技集团股份有限公司 Méthode et appareil de prédiction de maladies, dispositif électronique et support de stockage lisible par ordinateur
CN114882946A (zh) * 2022-03-29 2022-08-09 深圳裕康医学检验实验室 肿瘤相关基因变异致病性分类的方法、装置和存储介质
CN114429785A (zh) * 2022-04-01 2022-05-03 普瑞基准生物医药(苏州)有限公司 一种基因变异的自动分类方法、装置和电子设备

Similar Documents

Publication Publication Date Title
Khera et al. Polygenic prediction of weight and obesity trajectories from birth to adulthood
Schaid et al. From genome-wide associations to candidate causal variants by statistical fine-mapping
WO2018042185A1 (fr) Procédés, systèmes et appareil d'identification de variantes de gènes pathogènes
Chen et al. Analysis of 589,306 genomes identifies individuals resilient to severe Mendelian childhood diseases
Oliver et al. Bioinformatics for clinical next generation sequencing
Duzkale et al. A systematic approach to assessing the clinical significance of genetic variants
DiVincenzo et al. The allelic spectrum of Charcot–Marie–Tooth disease in over 17,000 individuals with neuropathy
Girolami et al. Contemporary genetic testing in inherited cardiac disease: tools, ethical issues, and clinical applications
He et al. Systems biology of kidney diseases
Cohorts for Heart and Aging Research in Genetic Epidemiology (CHARGE) Consortium Morrison Alanna C 1 Voorman Arend 2 Johnson Andrew D 3 4 Liu Xiaoming 1 Yu Jin 5 Li Alexander 1 Muzny Donna 5 Yu Fuli 5 Rice Kenneth 2 Zhu Chengsong 6 Bis Joshua 7 Heiss Gerardo 8 O'Donnell Christopher J 3 4 Psaty Bruce M 7 9 Cupples L Adrienne 3 10 Gibbs Richard 5 Boerwinkle Eric eric. boerwinkle@ uth. tmc. edu 1 5 u Whole-genome sequence–based analysis of high-density lipoprotein cholesterol
Golbus et al. Population-based variation in cardiomyopathy genes
Striano et al. Clinical significance of rare copy number variations in epilepsy: a case-control survey using microarray-based comparative genomic hybridization
Granka et al. Limited evidence for classic selective sweeps in African populations
Pan et al. Cardiac structural and sarcomere genes associated with cardiomyopathy exhibit marked intolerance of genetic variation
Byun et al. Cross-ancestry genome-wide meta-analysis of 61,047 cases and 947,237 controls identifies new susceptibility loci contributing to lung cancer
Fernandez-San Jose et al. Targeted next-generation sequencing improves the diagnosis of autosomal dominant retinitis pigmentosa in Spanish patients
Farlow et al. Lessons learned from whole exome sequencing in multiplex families affected by a complex genetic disorder, intracranial aneurysm
Mendes de Almeida et al. Whole gene sequencing identifies deep-intronic variants with potential functional impact in patients with hypertrophic cardiomyopathy
Walsh et al. Paralogue annotation identifies novel pathogenic variants in patients with Brugada syndrome and catecholaminergic polymorphic ventricular tachycardia
Aguet et al. Molecular quantitative trait loci
Zhu et al. An iterative approach to detect pleiotropy and perform Mendelian Randomization analysis using GWAS summary statistics
Brandes et al. Genetic association studies of alterations in protein function expose recessive effects on cancer predisposition
Kapplinger et al. Enhancing the predictive power of mutations in the C-terminus of the KCNQ1-encoded Kv7. 1 voltage-gated potassium channel
WO2017218798A1 (fr) Systèmes et procédés de diagnostic de l'hypercholestérolémie familiale
Hernandez et al. Singleton variants dominate the genetic architecture of human gene expression

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17780852

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17780852

Country of ref document: EP

Kind code of ref document: A1