WO2017025925A1

WO2017025925A1 - Method and system for filtering whole exome sequence variants

Info

Publication number: WO2017025925A1
Application number: PCT/IB2016/054845
Authority: WO
Inventors: Brigitte GLANZMANN; Hendrik Jacobus HERBST
Original assignee: Stellenbosch University
Priority date: 2015-08-11
Filing date: 2016-08-11
Publication date: 2017-02-16
Also published as: ZA201801633B

Abstract

A computer-implemented method, system and computer program product for filtering a plurality of exomic sequencing variants in a dataset in order to identify potential disease-causing variants is provided. The invention allows a user to obtain a shortlist of variants, diseases likely to be associated with those variants, and scores for each variant-disease association with minimal computational requirements and bioinformatics knowledge.

Description

METHOD AND SYSTEM FOR FILTERING WHOLE EXOME SEQUENCE VARIANTS

CROSS-REFERENCE(S) TO RELATED APPLICATIONS This application claims priority from South African provisional patent application number 2015/05726 filed on 1 1 August 2015, which is incorporated by reference herein.

FIELD OF THE INVENTION

This invention relates to a method and system for filtering and prioritizing sequence variants, more particularly, human whole exome sequence variants.

BACKGROUND TO THE INVENTION

Rapid developments in high throughput sequence capture methods as well as in next generation sequencing (NGS) approaches have made whole exome sequencing (WES) both technically feasible and cost-effective. Moreover, the success of WES in the discovery of novel disease-causing mutations in numerous rare diseases is well established. WES can typically yield hundreds of thousands of variants per sequenced individual; the generation of data for analysis is therefore not considered to be the challenge with WES or NGS, but rather the way in which data is analyzed is proving to be the major conundrum. The identification of a single, plausible disease-causing mutation for a particular disease is proving to be as difficult as looking for the proverbial "needle in a haystack". In addition to this, it has been well documented that every individual or pedigree will carry several so-called private mutations that do not cause overt disease. Although numerous software tools are available that can aid in the prioritization of candidate disease-causing variants, all of the functionalities are disseminated in various analytical tools and researchers are forced to do all the analyses separately and then pool all of the results together - a task which is both time-consuming and demands a considerable understanding of each of the bioinformatics tools that are used. Moreover, some functional prediction tools provide inconsistent results thereby making it exceptionally difficult to obtain a shortlist of candidates for validation and further follow up studies. Bioinformatics analyses of WES data are offered by commercial companies but these are usually quite expensive. Additionally, in developing countries such as South Africa, the limited number of bioinformaticists and the lack of adequate computational infrastructure further limit the successful application and implementation of NGS technologies. The paucity of trained bioinformaticists means that the task of prioritizing candidate disease-causing variants from files in variant call format (VCF) is left to wet bench scientists with limited bioinformatics knowledge, which presents a daunting challenge.

Furthermore, current approaches to identifying disease-causing variants are unreliable and do not take disease symptoms or possible modes of inheritance into consideration.

There is therefore a need for a means of filtering and prioritizing whole exome sequencing variants in order to identify potential disease-causing variants that at least to some extent addresses these challenges. The preceding discussion of the background to the invention is intended only to facilitate an understanding of the present invention. It should be appreciated that the discussion is not an acknowledgment or admission that any of the material referred to was part of the common general knowledge in the art as at the priority date of the application.

SUMMARY OF THE INVENTION

In accordance with a first aspect of the invention there is provided a method for filtering and prioritizing a plurality of exomic sequencing variants in a dataset on a computing device in order to identify potential disease-causing variants, wherein the method comprises the steps of:

sending the dataset to a remotely-accessible variant caller;

receiving an annotated dataset from the remotely accessible variant caller;

removing from the annotated dataset:

any variant annotated as synonymous or non-frameshift,

any variant identified as occurring in a database of the 1000 Genomes Project

(1 KGP) at a Minor Allele Frequency (MAF) greater than a predetermined threshold,

any variant identified as occurring in a database of the Exome Sequencing Project (ESP) at a MAF greater than a predetermined threshold,

any variant having a negative conservation score calculated using Genomic

Evolutionary Rate Profiling (GERP) software, and any variant having a positive score calculated by a Functional Analysis through Hidden Markov Models (FATHMM) analysis on each of the variants in the annotated dataset, to yield the filtered dataset;

comparing the filtered dataset with a reference database of variants and associated diseases, matching the variants in the filtered dataset to the diseases in the reference database and obtaining a score for each variant-disease match, wherein the score indicates a likely degree of association between a variant and a disease; and

outputting the results. A further feature of the invention provides for the dataset to be received by the computing device as a Variant Call Format (VCF) file.

Even further features of the invention provide for the 1 KGP database to be received by the computing device as a VCF file; and for the predetermined MAF threshold for variants occurring in the 1 KGP database to be less than or equal to 5%, such as 4%, 3%, 2%, 1 % or 0.1 %, preferably 1 %.

Still further features of the invention provide for the ESP database to be received by the computing device as a VCF file; for the ESP database to be ESP6500; and for the predetermined MAF threshold to be less than or equal to 5%, such as 4%, 3%, 2%, 1 %, or 0.1 %, preferably 1 %.

A yet further feature of the invention provides for the GERP score to be a GERP++ score. Still further features of the invention provide for the method to include a step of removing from the annotated dataset any variant having a position on an X or Y chromosome when it is known that a disease, for which a disease-causing variant is sought, is not sex-dependent.

Yet further features of the invention provide for the reference database in the step of comparing the filtered dataset with a reference database of diseases and associated variants to be a database of the Online Mendelian Inheritance in Man (OMIM) or a database of the Jensen Laboratory (http://diseases.jensenlab.org/Search).

An even further feature of the invention provides for the results to be output as a Comma Separated Values (CSV) file. It will be understood by a person ordinarily skilled in the art that the steps of the method may be carried out in a variety of sequences without departing from the scope of the invention.

In accordance with a second aspect of the invention, there is provided a system for filtering and prioritizing a plurality of exomic sequencing variants in a dataset on a computing device in order to identify potential disease-causing variants, the system comprising:

a communication component for sending the dataset to a remotely-accessible variant caller and receiving an annotated dataset therefrom;

a filtering component for creating a filtered dataset by removing from the annotated dataset:

any variant annotated as synonymous or non-frameshift,

any variant identified as occurring in a database of the 1000 Genomes Project (1 KGP) at a Minor Allele Frequency (MAF) which is greater than a predetermined threshold,

any variant identified as occurring in a database of the Exome Sequencing

Project (ESP) at a MAF which is greater than a predetermined threshold, any variant having a negative conservation score calculated using Genomic Evolutionary Rate Profiling (GERP) software, and

any variant having a positive score calculated by a Functional Analysis through Hidden Markov Models (FATHMM) analysis on each of the variants in the annotated dataset, to yield the filtered dataset;

a comparing component for comparing the filtered dataset with a reference database of variants and associated diseases, matching the variants in the dataset to the diseases in the database and obtaining a score for each variant-disease match, wherein the score indicates a likely degree of association between a variant with a disease; and

an output component for outputting the results.

A further feature of the invention provides for the system to include a component for receiving the dataset in Variant Call Format (VCF).

Even further features of the invention provide for the system to receive the 1 KGP database as a VCF file and for the predetermined MAF threshold for variants occurring in the 1 KGP database to be less than or equal to 5%, such as 4%, 3%, 2%, 1 % or 0.1 %, preferably 1 %. Still further features of the invention provide for the ESP database to be received by the system as a VCF file, for the ESP database to be ESP6500, and for the predetermined MAF threshold to be less than or equal to 5%, such as 4%, 3%, 2%, 1 %, or 0.1 %, preferably 1 %. A yet further feature of the invention provides for the GERP score to be a GERP++ score.

A still further feature of the invention provides for the filtering component to remove from the annotated dataset any variant having a position on an X or Y chromosome when it is known that a disease, for which a disease-causing variant is sought, is not sex-dependent.

A yet further feature of the invention provides for the reference database to be a database of the Online Mendelian Inheritance in Man (OMIM) or a database of the Jensen Laboratory (http://diseases.jensenlab.org/Search).

An even further feature of the invention provides for the results to be output as a Comma Separated Values (CSV) file. In accordance with a third aspect of the invention there is provided a computer program product for filtering and prioritizing a plurality of exomic sequencing variants in a dataset on a computing device in order to identify potential disease-causing variants, the computer program product comprising a computer-readable medium having stored computer-readable program code for performing the steps of:

sending the dataset to a remotely-accessible variant caller;

receiving an annotated dataset from the remotely accessible variant caller;

removing from the annotated dataset:

any variants annotated as synonymous or non-frameshift,

any variant identified as occurring in a database of the Exome Sequencing

any variants having a positive score calculated when performing a Functional Analysis through Hidden Markov Models (FATHMM) analysis on each of the variants in the annotated dataset,

to yield the filtered dataset;

comparing the filtered dataset with a reference database of variants and associated diseases, matching the variants in the filtered dataset to the diseases in the reference database and obtaining a score for each variant-disease match, wherein the indicates a likely degree of association between a variant and a disease; and

outputting the results. It will be understood to a person ordinarily skilled in the art that the steps of the computer program product may be carried out in a variety of sequences without departing from the scope of the invention.

An embodiment of the invention will now be described, by way of example only, with reference to the accompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS In the drawings:

Figure 1 is a schematic representation of a system in which various aspects of the disclosure may be implemented;

Figure 2 is a swimlane diagram illustrating the steps of the method according to first aspect of the disclosure;

Figure 3 is a scheme in which the filtering steps of the method illustrated in Figure 2

are illustrated in more detail;

Figure 4 is a block diagram illustrating the components of the system according to a second aspect of the disclosure; and

Figure 5 is a block diagram of a computing device useful in a system and method according to the disclosure, which may include subsystems or components interconnected via a communication infrastructure. DETAILED DESCRIPTION WITH REFERENCE TO THE DRAWINGS

The present disclosure relates to a method, system and computer program product for filtering a plurality of exomic sequencing variants in a dataset on a computing device in order to identify potential disease-causing variants.

Before meaningful information can be extracted from a sequenced exome, raw, unaligned sequences, often obtained in FASTQ format and subjected to quality control by FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/), are aligned to a reference genome, such as the NCBI Human Reference Genome hg19 or hg38 using an alignment program such as NovoAlign (http://www.novocraft.com/main/page.php?s=novoalign). Alignment of a sequenced exome with a reference genome can yield hundreds of thousands of variants. It is therefore necessary for a researcher to be able to filter large quantities of variants in order to prioritize sequence variants of interest. A sequence of variant filtering steps is known in the art as a pipeline.

The pipeline of the present disclosure allows a user to filter variants according to adjustable settings. All of the predetermined filtration criteria are implemented according to a customisable method determined by the user.

Figure 1 illustrates a system (10) in which various aspects of the disclosure may be implemented. The system (10) comprises a computing device (100) which may be suitable for storing and executing computer program code. The various participants and elements in the described system diagrams may use any suitable number of subsystems or components of the computing device (100) to facilitate the functions described herein. The computing device (100) is in communication with a remotely-accessible variant caller (102) via a communication infrastructure (105). The computing device (100) is also in communication, via the communication infrastructure (105), with a plurality of databases, which may be downloadable to or reproducible on, as the case may be, the computing device (100). The downloadable or reproducible databases may include a database comprising exomic sequence variants identified in the 1000 genome project (1 KGP) (1 10), a database comprising exomic sequence variants identified in the Exome Sequencing Project (ESP) (1 15), particularly ESP6500, and at least one reference database comprising a plurality of exomic sequence variants with associated diseases, such as a database of the Online Mendelian Inheritance in Man (OMIM) (120) or a publicly-accessible database of the Jensen Laboratory (http://diseases.jensenlab.org/Search) (125). Figure 2 is a swimlane diagram which represents the steps of the method according to a first aspect of the disclosure. In a preliminary step, the computing device (100) is preloaded with a dataset comprising a plurality of exomic sequencing variants in Variant Call Format (VCF), as well as a plurality of databases (1 10, 1 15, 120, 125). A dataset comprising sequencing variants is sent (202) from the computing device (100) to a variant caller (102), such as wANNOVAR (http://wannovar.usc.edu/) or SeattleSeq

(http://snp.gs.washington.edu/SeattleSeqAnnotation141 /). The Variant caller (102) receives the dataset (204) and annotates the variants in the dataset (206). Each variant is given an annotation indicating the functional consequences of that variant. Each variant is also given a Genomic Evolutionary Rate Profiling (GERP) score

(http://mendel.stanford.edu/SidowLab/downloads/gerp/index.html) and a Functional Analysis Through Hidden Markov Models (FATHMM) score

(http://fathmm.biocompute.org.uk/index.html). The GERP score provides an indication of the degree of conservation of a given variant and is derived from the dbNSFP (database for nonsynonymous SNPs and their functional predictions) where higher scores are indicative of greater conservation and scores greater than zero are considered to be conserved. The GERP score may be a GERP++ score. GERP++ (also referred to as GERP2) consists of two programs: gerpcol and gerpelem. Gerpcol estimates constraint and gerpelem identifies constrained elements from gerpcol's output (http://mendel.stanford.edu/SidowLab/downloads/gerp/). The FATHMM scores are used to determine species-specific weightings for predictions of the functional effects of protein missense variants. The use of FATHMM scores has been shown to outperform conventional prediction methods such as SIFT, PolyPhen2 and MutationTaster. Positive FATHMM scores predict a tolerance to the variation while negative FATHMM scores predict an intolerance to the variation, and are considered to be pathogenic.

Once the variants in the dataset have been annotated, the dataset is sent (208) from the variant caller to the computing device (100). The computing device (100) receives (210) the dataset and performs a plurality of filtering steps (212) to yield a filtered dataset of prioritized variants. The computing device then compares (214) the filtered dataset with a reference database of variants and associated diseases, generates (216) disease association scores for each of the prioritized variants, and outputs (218) the results. The reference database of variants and associated diseases may contain an algorithm that calculates the disease association scores for each of the prioritised variants. The result is preferably output in Comma Separated Values (CSV) file format, in which the prioritized variants, associated diseases, and association scores are presented. The output result can be further manipulated by a suitable program for representing the data in graph format (for example, Microsoft Excel). The plurality of filtering steps (212) are further elaborated on in Figure 3. Variants annotated as synonymous or non-frameshift are removed (302) from the dataset. Synonymous variations are defined as codon substitutions that do not change the synthesized amino acid, do not affect the final protein structure, and are therefore unlikely to be the underlying cause for rare diseases. For this reason, these variants, along with those that do not cause frameshifts (which also do not significantly alter the final protein structure), are removed from the dataset of prioritized variants. Variants in the dataset that are also present in the database of the 1 KGP at a MAF which is greater than a predetermined threshold, are removed (304). The predetermined threshold may be less than or equal to 5%, and may be 0.1 %, 1 %, 2%, 3%, 4% or 5%, depending on the disease of interest. Preferably the predetermined threshold is 1 % for rare diseases. Any variant that is found in the 1 KGP database at a frequency of 1 % or less is considered to be rare. It is hypothesized that disease-causing variants are unlikely to be found at a high frequency in a normal, healthy population. For this reason, variants with very low or no available frequency data are prioritized.

Variants in the dataset that are also present in a database of the National Heart, Lung, and Blood Institute (NHLBI) Exome Sequencing Project (ESP) at a MAF which is greater than a predetermined threshold, are removed (306). The ESP database may be ESP6500 and the predetermined threshold may be less than or equal to 5%, and may be 0.1 %, 1 %, 2%, 3%, 4% or 5%, depending on the disease of interest. Preferably the predetermined threshold is 1 % for rare diseases. The rationale for inclusion of this step (306) is similar to that of the filtering step (304), namely, that any variant that is found in the ESP database at a frequency of 1 % or less is considered to be rare. Since disease-causing variants are hypothesized to be rare, those occurring at a low frequency are prioritized.

Variants having a negative GERP score are removed (308) from the dataset. The GERP score gives an indication of the degree of conservation of the variant. Variants that are plausibly disease-causing are unlikely to be found in regions of the human genome that are subject to change. They are more likely to be found in highly conserved regions across multiple species and individuals. This filtering step accounts for this phenomenon. Variants having a positive FATHMM score are removed (310) from the dataset. Variants that are predicted to have a positive FATHMM score are unlikely to be disease-causing since they are intolerant to change. When it is known that a disease for which a variant is sought is not sex-dependent, an optional filtering (312) may be performed in which any variants corresponding to positions on X and Y chromosomes are removed from the dataset.

After completing one or more of the filtering steps (212), the prioritized variants in the dataset are compared (214) with a reference database of variants and associated diseases, wherein the variants in the dataset are matched to diseases in the database and each match given a score (216). The score indicates the likelihood that the variant is associated with the disease. Examples of suitable databases for carrying out this step include a database of the OMIM (120) or a database of the Jensen Laboratory (125).

It will be understood to a person ordinarily skilled in the art that the steps of the method and/or computer program product may be carried out in a variety of sequences without departing from the scope of the invention. The steps and components are numbered for identification purposes and not necessarily to define an order in which they are carried out.

In order to perform the methods and/or functions described herein, the computing device (100) may include several components, as illustrated in Figure 4. These include a dataset sending component (400), a dataset receiving component (405), a dataset filtering component (410) having first (415), second (420), third (425), fourth (430) and fifth (435) filtering subcomponents, a component for comparing the filtered dataset (440) with a reference database of variants and associated diseases (120, 125), and a results outputting component (445). Figure 5 shows a block diagram of the computing device (100), which may include subsystems or components interconnected via a communication infrastructure (505) (for example, a communications bus, a cross-over bar device, or a network). The computing device (100) may include at least one central processor (510) and at least one memory component in the form of computer-readable media.

The memory components may include system memory (515), which may include read only memory (ROM) and random access memory (RAM). A basic input/output system (BIOS) may be stored in ROM. System software may be stored in the system memory (515) including operating system software. The memory components may also include secondary memory (520). The secondary memory (520) may include a fixed disk (521 ), such as a hard disk drive, and, optionally, one or more removable-storage interfaces (522) for removable-storage components (523). The removable-storage interfaces (522) may be in the form of removable-storage drives (for example, magnetic tape drives, optical disk drives, floppy disk drives, etc.) for corresponding removable storage-components (for example, a magnetic tape, an optical disk, a floppy disk, etc.), which may be written to and read by the removable-storage drive. The removable-storage interfaces (522) may also be in the form of ports or sockets for interfacing with other forms of removable-storage components (523) such as a flash memory drive, external hard drive, or removable memory chip, etc.

The computing device (100) may include an external communications interface (530) for operation of the computing device (100) in a networked environment enabling transfer of data between multiple computing devices (100). Data transferred via the external communications interface (530) may be in the form of signals, which may be electronic, electromagnetic, optical, radio, or other types of signal. The external communications interface (530) may enable communication of data between the computing device (100) and other computing devices including servers and external storage facilities. Web services may be accessible by the computing device (100) via the communications interface (530). The external communications interface (530) may also enable other forms of communication to and from the computing device (100) including, voice communication, near field communication, Bluetooth, etc.

The computer-readable media in the form of the various memory components may provide storage of computer-executable instructions, data structures, program modules, and other data. A computer program product may be provided by a computer-readable medium having stored computer-readable program code executable by the central processor (510).

A computer program product may be provided by a non-transient computer-readable medium, or may be provided via a signal or other transient means via the communications interface (530). Interconnection via the communication infrastructure (505) allows a central processor (510) to communicate with each subsystem or component and to control the execution of instructions from the memory components, as well as the exchange of information between subsystems or components.

Peripherals (such as printers, scanners, cameras, or the like) and input/output (I/O) devices (such as a mouse, touchpad, keyboard, microphone, joystick, or the like) may couple to the computing device (100) either directly or via an I/O controller (535). These components may be connected to the computing device (100) by any number of means known in the art, such as a serial port.

One or more monitors (545) may be coupled via a display or video adapter (540) to the computing device (100). The foregoing description of the embodiments of the invention has been presented for the purpose of illustration; it is not intended to be exhaustive or to limit the invention to the precise forms disclosed. Persons skilled in the relevant art can appreciate that many modifications and variations are possible in light of the above disclosure. Some portions of this description describe the embodiments of the invention in terms of algorithms and symbolic representations of operations on information. These algorithmic descriptions and representations are commonly used by those skilled in the data processing arts to convey the substance of their work effectively to others skilled in the art. These operations, while described functionally, computationally, or logically, are understood to be implemented by computer programs or equivalent electrical circuits, microcode, or the like. The described operations may be embodied in software, firmware, hardware, or any combinations thereof.

The software components or functions described in this application may be implemented as software code to be executed by one or more processors using any suitable computer language such as, for example, Java, C++, or Perl using, for example, conventional or object-oriented techniques. The software code may be stored as a series of instructions, or commands on a non-transitory computer-readable medium, such as a random access memory (RAM), a readonly memory (ROM), a magnetic medium such as a hard-drive or a floppy disk, or an optical medium such as a CD-ROM. Any such computer-readable medium may also reside on or within a single computational apparatus, and may be present on or within different computational apparatuses within a system or network. Any of the steps, operations, or processes described herein may be performed or implemented with one or more hardware or software modules, alone or in combination with other devices. In one embodiment, a software module is implemented with a computer program product comprising a non-transient computer-readable medium containing computer program code, which can be executed by a computer processor for performing any or all of the steps, operations, or processes described.

Finally, the language used in the specification has been principally selected for readability and instructional purposes, and it may not have been selected to delineate or circumscribe the inventive subject matter. It is therefore intended that the scope of the invention be limited not by this detailed description, but rather by any claims that issue on an application based hereon. Accordingly, the disclosure of the embodiments of the invention is intended to be illustrative, but not limiting, of the scope of the invention, which is set forth in the following claims.

The pipeline was validated in three proof-of-concept examples in which known mutations corresponding to Mendelian disorders were identified by the method, system and computer program product of the present disclosure. These are provided below. EXAMPLES

A proof of concept study was conducted on existing processed WES data for which known genes and variants associated with disease were identified. This was performed in order to determine the effectiveness of the pipeline, as well as to determine specific threshold values for each of the parameters used. The datasets that were used were sourced from research collaborators as well as from files published in recent publications of the journal, Clinical Genetics (Wiley Online Library). The datasets used were the subject of previous studies conducted to identify variants implicated in Parkinson's Disease, severe intellectual disability and microcephaly, and ataxia and myoclonic epilepsy. Through the use of the proof of concept method, a FATHMM threshold of 0.1 was determined to be too stringent and for this reason was adjusted to >0. The pipeline was successfully used to identify variants that were identified in previously reported studies as the disease-causing variant for specific diseases. A summary of the results is provided in Tables 1 and 2 below. Table 1 : Proof of concept studies carried out according to the present invention to identify disease-causing variants

UPMC - Universite Pierre et Marie CURIE; ICM - Institut du Cerveau et de la Moelle Epiniere Table 2: Stepwise breakdown of results obtained by the pipeline of the present invention

EXAMPLE 1 : The pathogenic mutation in the FBX07 gene was identified from the shortlist of variants obtained by the pipeline of the present invention by studying the shortlisted variants obtained for mother (Individual 1 ), father (Individual 2) and affected child (Individual 3). Variants that were heterozygous in the parents and homozygous for the patient were prioritized further which enabled the target FBX07 variant, L34R, to be identified.

EXAMPLE 2: The pathogenic mutation in the SLC1A4 gene was identified from the shortlist of variants obtained by the pipeline of the present invention by looking for rare, overlapping variants between two siblings (Individuals 1 and 2). The target variant, E256K, was homozygous for both affected individuals and not found in any of their 1 1 other unaffected siblings.

EXAMPLE 3: The pathogenic mutation in the KCNA2 gene was identified from the shortlist of variants obtained by the pipeline of the present invention as it was not found in either of the parents (who were not consanguineous with the child - Individual 1 ) and the gene itself had previously been associated with ataxia and convulsions in KCNA2-nu\\ mice.

The variant prioritizing pipeline of the present invention combines information obtained from the annotation of variants into a comprehensive multistep analysis providing poorly-resourced researchers the ability to carry out WES-based discovery of genetic variants in rare disorders with minimal computing power and/or bioinformatics knowledge. Moreover, this analysis is conducted using a hypothesis-free approach, in which no inheritance pattern for disease nor phenotypic characteristics are considered in the filtering steps, allowing the pipeline to prioritize variants without bias to a particular disease. This is a notable advantage over existing pipelines in the art, which may discard variants of interest by filtering datasets according to phenotypic characteristics and/or hypothesized causal disease. However, and in contrast to some pipelines which do not consider disease symptoms or possible modes of inheritance, the inclusion of the step of comparing the variants in the dataset to a reference database of variants and associated diseases after filtering allows the pipeline to obtain a shortlist of variants, diseases likely to be associated with those variants, and scores for each variant-disease association with minimal computational requirements. The output result would not be expected if the steps were carried out individually and the results combined.

Throughout the specification and claims unless the contents requires otherwise the word 'comprise' or variations such as 'comprises' or 'comprising' will be understood to imply the inclusion of a stated integer or group of integers but not the exclusion of any other integer or group of integers.

Claims

CLAIMS:

A computer-implemented method of filtering a plurality of exomic sequencing variants in a dataset in order to identify potential disease-causing variants, the method comprising the steps of:

sending (202) the dataset to a remotely-accessible variant caller (102);

receiving (210) an annotated dataset from the remotely accessible variant caller; and

removing (212) from the annotated dataset:

any variant annotated as synonymous or non-frameshift,

any variant identified as occurring in a database of the 1000 Genomes Project (1 KGP) (1 10) at a Minor Allele Frequency (MAF) greater than a predetermined threshold,

any variant identified as occurring in a database of the Exome Sequencing Project (ESP) (1 15) at a MAF greater than a predetermined threshold,

any variant having a negative Genomic Evolutionary Rate Profiling (GERP) score, and

any variant having a positive score calculated by a Functional Analysis through Hidden Markov Models (FATHMM) analysis, to yield the filtered dataset.

The method as claimed in claim 1 further comprising the steps of comparing (214) the filtered dataset with a reference database of variants and associated diseases, and matching the variants in the filtered dataset to the diseases in the reference database.

The method as claimed in claim 2 further comprising obtaining a score (216) for each variant-disease match, wherein the score indicates a likely degree of association between a variant and a disease, and outputting (218) the result.

The method as claimed in any one of claims 1 to 3 further comprising a step of removing (312) from the annotated dataset any variant having a position on an X or Y chromosome when it is known that a disease, for which a disease-causing variant is sought, is not sex-dependent.

5. The method as claimed in any one of claims 1 to 4, wherein the predetermined MAF threshold for variants occurring in the 1 KGP database (1 10) is less than or equal to 5%.

6. The method as claimed in claim 5, wherein the predetermined MAF threshold for variants occurring in the 1 KGP database (1 10) is 1 %.

7. The method as claimed in any one of claims 1 to 6, wherein the predetermined MAF threshold for variants occurring in the ESP database (1 15) is less than or equal to 5%.

8. The method as claimed in claim 7, wherein the predetermined MAF threshold for variants occurring in the ESP database (1 15) is 1 %.

9. The method as claimed in any one of claims 1 to 8, wherein the GERP score is a GERP++ score.

10. A system (10) for filtering a plurality of exomic sequencing variants in a dataset on a computing device (100) in order to identify potential disease-causing variants, the system (10) comprising:

a dataset sending component (400) for sending the dataset to a remotely- accessible variant caller (102) and a dataset receiving component (405) for receiving an annotated dataset therefrom; and

a dataset filtering component (410) for creating a filtered dataset by removing from the annotated dataset:

any synonymous or non-frameshift variant,

any variant having a positive score calculated by a Functional Analysis through Hidden Markov Models (FATHMM) analysis,

to yield the filtered dataset. A computer program product for filtering a plurality of exomic sequencing variants in a dataset on a computing device (100) in order to identify potential disease-causing variants, the computer program product comprising a computer-readable medium having stored computer-readable program code for performing the steps of:

sending (202) the dataset to a remotely-accessible variant caller (102);

receiving (210) an annotated dataset from the remotely accessible variant caller

(102); and

removing (212) from the annotated dataset:

any variant annotated as synonymous or non-frameshift,

to yield the filtered dataset.