WO2022218509A1

WO2022218509A1 - A method for predicting an effect of a gene variant on an organism by means of a data processing system and a corresponding data processing system

Info

Publication number: WO2022218509A1
Application number: PCT/EP2021/059567
Authority: WO
Inventors: Jun Cheng; Carolin LAWRENCE; Mathias Niepert
Original assignee: NEC Laboratories Europe GmbH
Priority date: 2021-04-13
Filing date: 2021-04-13
Publication date: 2022-10-20

Abstract

For realizing a very high prediction accuracy by simple means a method for predicting an effect of a gene variant on an organism by means of a data processing system is provided, comprising the steps: providing or collecting benign and pathogenic gene variants; creating a variant-gene graph by connecting each gene variant to one or more genes to which it belongs and by connecting each gene to one or more other genes according to a pre-definable rule; training a graph neural network, GNN, on the variant-gene graph; and feeding a new gene variant to the graph neural network for predicting by the graph neural network whether the new gene variant is benign or pathogenic. Further, a corresponding data processing system for carrying out the above method for predicting an effect of a gene variant on an organism is provided.

Description

A METHOD FOR PREDICTING AN EFFECT OF A GENE VARIANT ON AN ORGANISM BY MEANS OF A DATA PROCESSING SYSTEM AND A CORRESPONDING DATA PROCESSING SYSTEM

The present invention relates to a method for predicting an effect of a gene variant on an organism by means of a data processing system and a data processing system for carrying out this method.

Genetic mutations can cause disease by disrupting normal gene function. However, identifying the disease-causing mutations from millions of genetic variants within an individual patient is challenging. Computational methods which can prioritize disease-causing mutations have enormous applications. It is well known that genes function through a complex regulatory network.

Methods for predicting an effect of a gene variant on an organism by means of a data processing system and corresponding data processing systems are known from prior art. Corresponding prior art documents are listed as follows:

US 2016/0357903 A1 - A framework for determining the relative effect of genetic variants.

US 2015/0066378 A1 - Identifying Possible Disease-Causing Genetic Variants by Machine Learning Classification.

US 2019/0114547 A1 - Deep Learning-Based Splice Site Classification.

Further, US 2019/0139622 A1 discloses a method and a system for predicting effects of perturbations to an organism. The method discloses that a neural network is trained to classify the effects of perturbations to a gene or other features of the organism. After training the graph neural network is configured to predict activity of a new strain having one or more modifications to the gene.

The prior art reference EP 3 514 798 A1 discloses a system for a prediction of genetic variants with machine learning model. The prior art discloses an automated computational system for predicted information about genetic variants. The method comprises a microprocessor, determining the functionality for each gene based on the genetic variant data and also generating a weighted genetic network comprising the plurality of genes of the genome having connections between them. The method also comprises a regression model explaining the type of variant affecting genes.

The prior art reference WO 2016/172 464 A1 discloses a method for predicting gene-dysfunction caused by a defined genetic mutation in the genome of an organism. This reference also discloses a variant gene graph and also discloses the variant category either benign or pathogenic based on a trained machine learning model. This prior art is not disclosing the feature to identify a newly added variant category to be analyzed.

The prior art reference US 2016/0371431 A discloses a method of predicting pathogenicity of genetic sequence variants. It also discloses that after the machine learning model is trained and has categorized the variant with respect to category of disease causing variant or not, it will identify or predict the variant pathogenicity of newly added variant. The prior art reference does not disclose a gene interaction network.

It is an object of the present invention to improve and further develop a method for predicting an effect of a gene variant on an organism by means of a data processing system and a corresponding data processing system for realizing a very high prediction accuracy by simple means.

In accordance with the invention, the aforementioned object is accomplished by a method for predicting an effect of a gene variant on an organism by means of a data processing system, comprising the following steps:

- providing or collecting benign and pathogenic gene variants;

- creating a variant-gene graph by connecting each gene variant to one or more genes to which said gene variant belongs and by connecting each gene to one or more other genes according to a pre-definable rule;

- training a graph neural network, GNN, on the variant-gene graph; and - feeding a new gene variant to the graph neural network for predicting by the graph neural network whether the new gene variant is benign or pathogenic.

Further, the aforementioned object is accomplished by a data processing system for carrying out the method for predicting an effect of a gene variant on an organism, comprising:

- providing or collecting means for providing or collecting benign and pathogenic gene variants;

- creating means for creating a variant-gene graph by connecting each gene variant to one or more genes to which said gene variant belongs and by connecting each gene to one or more other genes according to a pre- definable rule;

- training means for training a graph neural network, GNN, on the variant-gene graph; and

- feeding means for feeding a new gene variant to the graph neural network for predicting by the graph neural network whether the new gene variant is benign or pathogenic.

According to the invention it has been recognized that it is possible to realize a very high prediction accuracy by simply providing a particularly suitable graph neural network model and training set and proceeding. Firstly, benign and pathogenic gene variants are provided or collected from a suitable source. This means that relevant data and/or features of such gene variants are provided or collected for further use in the method. In a next step a suitable variant-gene graph is created by a) connecting each gene variant to one or more genes to which this gene variant belongs and by b) connecting each gene to one or more other genes according to a pre-definable rule. Then, with such a variant-gene graph a graph neural network is trained. In a last step a new or unknown gene variant is fed to the graph neural network for predicting by the graph neural network whether the new or unknown gene variant is benign or pathogenic. All or some of the method steps can be performed or supported by the data processing system, e.g. a computer. This graph neural network approach operates on a heterogeneous graph with genes and gene variants. This graph is created by assigning gene variants to genes and connecting genes with an existing gene-gene interaction network. The invention improves the prediction accuracy and allows experts to interpret the prediction by inspecting which gene variants and genes had a large effect on a prediction. The prediction of effects of new observed gene variants is possible with very high accuracy.

Thus, on the basis of the invention a method and system are provided which realize a very high prediction accuracy by simple means.

According to an embodiment of the invention the provided or collected benign and pathogenic gene variants can be provided or collected from one or more databases comprising data or features of gene variants. There can be a definable communication between the data processing system and one or more database for the step of providing or collecting the gene variants. Thus, a large amount of gene variants can be used for realizing a high prediction accuracy by simple means.

Within a further embodiment labeling for each variant to which gene or genes it belongs can be based on suitable coordinates. Benign and pathogenic gene variants can be assigned to the closest gene or genes in a related genome. This simplifies the method and provides a realization of a high prediction accuracy.

According to a further embodiment the pre-definable rule can comprise connecting each gene to every other gene. Alternatively, the pre-definable rule can comprise connecting each gene to one or more other genes which is or are connected to said gene based on one or more predefined biological interactions. Preferably, the one or more predefined biological interactions can simply be retrieved from a biological database or from a gene-gene interaction graph of a biological database.

Within a further embodiment at least one feature can be collected for at least one or each gene variant, wherein preferably the at least one feature can be the output of another variant prediction model that does not use a graph. According to a further embodiment and for providing a simple and effective representation the at least one or each gene variant can be represented by a feature vector.

Within a further embodiment at least one feature can be collected for at least one or each gene. As a result, at least one or each gene can be specified by such a feature.

According to a further embodiment at least one or each gene can be represented by a N dimensional vector, wherein N is an integer. This provides a very simple and clear representation.

Within a further embodiment the N dimensional vector can be a randomly initialized vector, which is optimized in the training step. Such a type of vector is very suitable for effectively performing the method.

Within a further embodiment the N dimensional vector can comprise at least one collected feature and/or is a concatenation of a randomly initialized vector, which is trainable, with one or more collected gene features. Also such a type of vector is very suitable for effectively performing the method.

According to a further embodiment, for a training set in the training step, each gene variant in the training set can have a definable label, e.g. 0 for benign and 1 for pathogenic. By means of such a label very efficient and structured prediction with high prediction accuracy is possible.

Within a further embodiment in the training step one or more parameters of the graph neural network can be updated using gradient descent. This proceeding supports an increase of the likelihood for gene variants in the training set to obtain the correct label from the network.

According to a further embodiment and for further enhancing the prediction accuracy an explanation for the prediction of a gene variant or variants can be provided by returning which other gene variant or gene variants and/or which gene or genes the graph neural network has utilized to arrive at the prediction, wherein preferably the impact can be provided, for example to an expert, that the gene variant or gene variants and/or gene or genes had on the prediction. This proceeding provides a high degree of information to a user of the method.

Advantages and aspects of embodiments of the present invention are summarized and further explained as follows:

According to embodiments of the present invention a graph neural network approach is proposed that operates on a heterogeneous graph with genes and gene variants. The graph can be created by assigning variants to genes and connecting genes with an existing gene-gene interaction network. The graph neural network can be trained to aggregate information between genes, and between genes and gene variants. Gene variants can exchange information via the genes they connect to. This method improves the prediction accuracy and allows experts to interpret the prediction by inspecting which gene variants and genes had a large effect on a prediction.

Generally, all embodiments of the present invention provide a variant effect prediction with graph neural network, VEGN.

Predicting variant effects is a long-standing problem in genetics. Previous Artificial Intelligence, Al, systems that perform variant effect prediction do this by predicting each gene variant in isolation. Even though work has been done to interpret variant effects in the context of biological regulatory network, no existing method can effectively integrate gene-variant and gene-gene network together to predict effect of new observed gene variants. In embodiments of the present invention, we learn the high order relationship between variants and between gene and variants with a graph neural network. Comparing to existing approaches, VEGN has two advantages. First, VEGN considers variant effect in the context of a biological network instead of isolated events. Such approach enables VEGN to capture potential remote - trans - effect of variants on indirectly connected genes. Second, existing annotations of disease variants are sparse and are focused on some well- studied genes. VEGN enables information flow from well-studied genes to less- studied genes through the biological network. It also considers the correlated disease-causing status for variants within the same functional module in a biological network. The present approach greatly improves the prediction accuracy compared to previous methods.

In embodiments of this invention, we formulate variant effect prediction as a graph. A graph can consist of a set of nodes and a set of edges, where an edge holds between the two nodes. A gene variant or variant is a genetic variation in a genome that differs from the reference genome. Such a variant can be identified to belong to a certain gene or genes by assigning it to the nearest gene - or genes in the case of equal distance - in the genome coordinate. Given a set of variants and the set of genes they belong to, the union of this set is the set of nodes in a graph. For each variant there is an edge to the genes it belongs to. For edges between genes, we consider two options: (1) the edges are given as input, e.g. a domain expert labelled the edges; (2) we assume an edge exists from each gene to every other gene.

Based on this graph, a graph neural network, GNN, with weights w can be trained. For this, we assume we are given a training set where each variant has a feature vector - e.g. predicted variant effect on transcription factor binding, on splicing, conservation score - and a classification label, e.g. 0 or 1 for benign or pathogenic. For each variant in the training data, we can utilize the associated classification label to define a loss - e.g. binary cross entropy loss - and use - stochastic - gradient descent and backpropagation to update the weights w of the GNN. Once trained, new variants can be added to the graph and applying the GNN will classify the variant, e.g. as benign or pathogenic.

The GNN itself can take various forms, e.g. it could be a graph attention network, see Velickovic et al. 2018, Graph Attention Networks. International Conference on Learning Representations, ICLR. Furthermore, we can learn one joint GNN or we could learn a different GNN depending on the edge type, e.g. a different network is learnt for gene-variant, variant-gene and gene-gene edges. Furthermore, if we assume that each gene has an edge to every other gene, then we learn the strength of each edge. This can be done with a fully connected neural network, e.g. using a Transformer, see Vaswani et al. 2017, Attention is All you Need, Neural Information Processing Systems, NeurlPS. The fully connected neural network can then be used for the edge type gene-gene, whereas a GNN can be used for other edge types. This allows us to combine a given graph structure and a learnt graph structure in one joint neural network.

When classifying a new variant, information flows via the graph neural network from the variant’s feature vector to the gene the variant it is attached to and from there to other parts of the graph. This enables us to track which gene and other variant had an influence on the prediction. This can be done by using an explanation method suitable for GNNs, e.g. GNNExplainer, see Ying et al 2019, GNNExplainer: Generating Explanations for Graph Neural Networks, Neural Information Processing Systems, NeurlPS. This is a powerful advantage of embodiments of our invention because it allows us to explain the model’s variant effect prediction to a domain expert by providing the information on which variants and genes had an impact on the prediction. This information may help the domain experts to discover new disease associated genes or non-additive effects of variant combinations.

For each variant v VEGN predicts a probability of the variant to be disease-causing (pathogenic): P (pathogenic). The graph neural network model with weights w can be trained with standard stochastic gradient descent and a cross entropy loss function:

T(w) = åm Ti log Piipatho genic) + (1 - y_t) · log(l - P_t(patho genic)), where y_t is the label of the variant v_t in the training data, pathogenic being 1 and benign being 0, Pi(pathogenic) is the prediction for v_t and where i is an integer.

Validating our method empirically shows large improvements over the previous state of the art, both in terms of average precision and area under the curve:

Further advantages and aspects of embodiments of the present invention can be summarized as follows:

1) Embodiments can formulate variant effect prediction as a graph via gene attachments and can learn a graph neural network.

2) Embodiments can learn an application specific gene-gene interaction graph.

3) Embodiments can combine a given graph structure with a learnt graph structure in one joint neural network.

4) Embodiments can explain a prediction of a variant by providing the variants and genes that and the impact they had on the prediction.

Further aspects and advantages of embodiments of the method and data processing system according to the present invention can be summarized as follows:

An embodiment can comprise a method for predicting what effect a human’s gene variant will have on their body. The method can comprise the steps of:

1) Collecting existing benign and pathogenic variants from databases.

2) Labeling for each variant to which genes it belongs, based on the coordinates or coordinates of the genes. Variants can be assigned to the closest genes in the genome.

3) Optional: For each gene, labeling which other genes are connected to it based on biological interactions, e.g. retrieved from a gene-gene interaction graph of a biological database.

4) Creating a variant-gene graph, where: a. each variant can be connected to one or more genes based on step

2) b. each gene can be either i. connected to every other gene ii. connected to the genes identified in step 3) if step 3) is present.

5) Collecting features for each variant. For example, the feature could be the output of another variant prediction model that does not use a graph.

6) Optional: Collecting features for each gene.

7) Each variant can be represented by the feature vector collected in step 5). 8) Each gene can be represented by a N dimensional vector, which may be either one of the below or a concatenation: a. A randomly initialized vector, which can be optimized in the training process. b. The gene features collected in step 6). c. A concatenation of the randomly initialized vector, which is trainable, with gene features collected in step 6).

9) Training a graph neural network model on the graph defined in step 4), where a. for a training set, each variant in the training set can have a label, e.g. 0 for benign 1 for pathogenic. b. the model’s parameters can be updated using gradient descent in order to increase the likelihood for variants in the training set to obtain the correct label from the network.

10)Once the model is trained, giving a new, previously unseen variant to the model to have the model predict whether the variant is benign or pathogenic

11)Optional: Providing an explanation for the prediction by returning which other variants and which genes the model utilized to arrive at the prediction.

Previous methods classify each variant in isolation. By treating the problem as a graph where variants are linked to each other via genes and by automatically learning a gene-gene network, embodiments of the present method can learn a graph neural network that greatly improves the accuracy of the variant prediction.

There are several ways how to design and further develop the teaching of the present invention in an advantageous way. To this end it is to be referred to the following explanation of examples of embodiments of the invention, illustrated by the drawing. In the drawing

Fig. 1 shows in a diagram the overall architecture of an embodiment of the present invention,

Fig. 2 shows in a diagram a further embodiment of the present invention, Fig. 3 shows in a block diagram a further embodiment of the present invention, and

Fig. 4 shows in a block diagram a further embodiment of the present invention.

Fig. 1 shows in a diagram the overall architecture of an embodiment of the present invention, concretely a VEGN. The goal is to classify gene variants - in short form: variants - which are denoted by triangles. Variants are associated with a gene, denoted by circles, and a gene-gene network is either given or learnt. Based on this, a GNN can be learnt. New variants are added to the graph via the gene they attach to. Given a new variant’s feature vector, the GNN classifies the new variants and can give an explanation of which other variants and genes were relevant for the classification.

Fig. 2 shows in a diagram a further embodiment of the present invention. Flere is shown a concrete instantiation with a different GNN for each edge type: The goal is to classify variants which are denoted by triangles, e.g. as benign 0 or pathogenic 1. Variants are associated with a gene, denoted by circles, and a gene-gene network is either given or learnt. Based on this, a GNN can be learnt. This can either be one joint GNN or different GNNs can be learnt for different edges. E.g. for the three different edge types - “gene has variant”, “gene interacts with gene” and “variant in gene” - separate GNN layers are instantiated and learnt. Arrows within a layer indicate the direction of information flow, where the hidden representation of the arrow's source is used to update the hidden representation of the arrow's target. Within a layer the arrows represent the weights of the GNN that is learnt and these weights are shared within this layer, i.e. for ’’variant in gene”, each variant has its own feature vector and to this the same GNN layer's weights are applied to update the target hidden representation. The hidden representations of each layer are aggregated, e.g. by sum. Finally, there is one further GNN layer where information flows from the gene to the variant. Based on this update, a classification layer, e.g. via a sigmoid function, determines the likelihood of a variant being benign or pathogenic. During training, the true label of a variant v is observed and weights can be updated via a loss function and backpropagation. During test time, new variants can be added to the graph via the gene they attach to. Based on the features associated with the variant, the learnt weights can be applied in a forward pass to derive a prediction.

Further embodiments:

1. Genetic diagnostics for patients

Each individual has millions of genetic variants. Even though such variants can be identified with high-throughput sequencing and bioinformatics variant calling methods, it is challenging to prioritize potential disease-causing variants. VEGN or embodiments of the present invention can be used to prioritize a short list of variants for clinician to manually inspect.

Fig. 3 shows in a block diagram such a further embodiment of the present invention. In genetic diagnosis, patients first have their genome sequenced with whole genome sequencing or whole exon sequencing. A list of variants is generated through variant calling on the sequencing data. VEGN or embodiments of the present invention can be applied to each of the variant and predict a disease-relevance score P(pathogenic). The variants can then be sorted based on the score in descending order. The top k variants, wherein k is an integer, are selected for further manual investigation by domain experts. The number of k is dependent on the resource.

2. Neoantigen selection

Neoantigens are antigens found specifically in tumor samples. They are products from tumor-specific variants. Due to the tumor-specificity of neoantigens, they are frequently used as targets for immunotherapy. Existing neoantigen selection pipelines typically do not consider the effects of variants. VEGN or embodiments of the present invention can help to prioritize and select most biologically relevant variants.

Fig. 4 shows in a block diagram such a further embodiment of the present invention. In neoantigen discovery, tumor samples are whole genome sequenced or whole exon sequenced. A list of missense variants is generated through variant calling on the sequencing data. VEGN or embodiments of the present invention can be applied to each of the variant and predict a disease-relevance score P(pathogenic). The variants can then be sorted based on the score in descending order. The predicted disease-causing probabilities are combined with other evidence in an existing neoantigen discovery pipeline to select for neoantigens. Many modifications and other embodiments of the invention set forth herein will come to mind to the one skilled in the art to which the invention pertains having the benefit of the teachings presented in the foregoing description and the associated drawings. Therefore, it is to be understood that the invention is not to be limited to the specific embodiments disclosed and that modifications and other embodiments are intended to be included within the scope of the appended claims. Although specific terms are employed herein, they are used in a generic and descriptive sense only and not for purposes of limitation.

Claims

C l a i m s

1. A method for predicting an effect of a gene variant on an organism by means of a data processing system, comprising the following steps:

- providing or collecting benign and pathogenic gene variants;

- training a graph neural network, GNN, on the variant-gene graph; and

- feeding a new gene variant to the graph neural network for predicting by the graph neural network whether the new gene variant is benign or pathogenic.

2. A method according to claim 1 , wherein the provided or collected benign and pathogenic gene variants are provided or collected from one or more databases.

3. A method according to claim 1 or 2, wherein the benign and pathogenic gene variants are assigned to the closest gene or genes in a related genome.

4. A method according to one of claims 1 to 3, wherein the pre-definable rule comprises connecting each gene to every other gene or connecting each gene to one or more other genes which is or are connected to said gene based on one or more predefined biological interactions, wherein preferably the one or more predefined biological interactions are retrieved from a biological database or from a gene-gene interaction graph of a biological database.

5. A method according to one of claims 1 to 4, wherein at least one feature is collected for at least one or each gene variant, wherein preferably the at least one feature is the output of another variant prediction model that does not use a graph.

6. A method according to one of claims 1 to 5, wherein at least one or each gene variant is represented by a feature vector.

7. A method according to one of claims 1 to 6, wherein at least one feature is collected for at least one or each gene.

8. A method according to one of claims 1 to 7, wherein at least one or each gene is represented by a N dimensional vector.

9. A method according to claim 8, wherein the N dimensional vector is a randomly initialized vector, which is optimized in the training step.

10. A method according to claim 8 or 9, wherein the N dimensional vector comprises at least one collected feature.

11. A method according to one of claims 8 to 10, wherein the N dimensional vector is a concatenation of a randomly initialized vector, which is trainable, with one or more collected gene features.

12. A method according to one of claims 1 to 11 , wherein for a training set in the training step, each gene variant in the training set has a definable label, e.g. 0 for benign and 1 for pathogenic.

13. A method according to one of claims 1 to 12, wherein in the training step one or more parameters of the graph neural network are updated using gradient descent.

14. A method according to one of claims 1 to 13, wherein an explanation for the prediction of a gene variant or variants is provided by returning which other gene variant or gene variants and/or which gene or genes the graph neural network has utilized to arrive at the prediction, wherein preferably the impact is provided that the gene variant or gene variants and/or gene or genes had on the prediction.

15. A data processing system for carrying out the method for predicting an effect of a gene variant on an organism according to any one of claims 1 to 14, comprising:

- training means for training a graph neural network, GNN, on the variant-gene graph; and - feeding means for feeding a new gene variant to the graph neural network for predicting by the graph neural network whether the new gene variant is benign or pathogenic.