WO2022218509A1 - A method for predicting an effect of a gene variant on an organism by means of a data processing system and a corresponding data processing system - Google Patents

A method for predicting an effect of a gene variant on an organism by means of a data processing system and a corresponding data processing system Download PDF

Info

Publication number
WO2022218509A1
WO2022218509A1 PCT/EP2021/059567 EP2021059567W WO2022218509A1 WO 2022218509 A1 WO2022218509 A1 WO 2022218509A1 EP 2021059567 W EP2021059567 W EP 2021059567W WO 2022218509 A1 WO2022218509 A1 WO 2022218509A1
Authority
WO
WIPO (PCT)
Prior art keywords
gene
variant
graph
variants
genes
Prior art date
Application number
PCT/EP2021/059567
Other languages
French (fr)
Inventor
Jun Cheng
Carolin LAWRENCE
Mathias Niepert
Original Assignee
NEC Laboratories Europe GmbH
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by NEC Laboratories Europe GmbH filed Critical NEC Laboratories Europe GmbH
Priority to PCT/EP2021/059567 priority Critical patent/WO2022218509A1/en
Publication of WO2022218509A1 publication Critical patent/WO2022218509A1/en

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B5/00ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks
    • G16B5/20Probabilistic models
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/10Ontologies; Annotations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/20Heterogeneous data integration

Definitions

  • the present invention relates to a method for predicting an effect of a gene variant on an organism by means of a data processing system and a data processing system for carrying out this method.
  • US 2019/0139622 A1 discloses a method and a system for predicting effects of perturbations to an organism.
  • the method discloses that a neural network is trained to classify the effects of perturbations to a gene or other features of the organism. After training the graph neural network is configured to predict activity of a new strain having one or more modifications to the gene.
  • the prior art reference EP 3 514 798 A1 discloses a system for a prediction of genetic variants with machine learning model.
  • the prior art discloses an automated computational system for predicted information about genetic variants.
  • the method comprises a microprocessor, determining the functionality for each gene based on the genetic variant data and also generating a weighted genetic network comprising the plurality of genes of the genome having connections between them.
  • the method also comprises a regression model explaining the type of variant affecting genes.
  • WO 2016/172 464 A1 discloses a method for predicting gene-dysfunction caused by a defined genetic mutation in the genome of an organism. This reference also discloses a variant gene graph and also discloses the variant category either benign or pathogenic based on a trained machine learning model. This prior art is not disclosing the feature to identify a newly added variant category to be analyzed.
  • the prior art reference US 2016/0371431 A discloses a method of predicting pathogenicity of genetic sequence variants. It also discloses that after the machine learning model is trained and has categorized the variant with respect to category of disease causing variant or not, it will identify or predict the variant pathogenicity of newly added variant.
  • the prior art reference does not disclose a gene interaction network.
  • the aforementioned object is accomplished by a method for predicting an effect of a gene variant on an organism by means of a data processing system, comprising the following steps:
  • a data processing system for carrying out the method for predicting an effect of a gene variant on an organism comprising:
  • - creating means for creating a variant-gene graph by connecting each gene variant to one or more genes to which said gene variant belongs and by connecting each gene to one or more other genes according to a pre- definable rule;
  • - feeding means for feeding a new gene variant to the graph neural network for predicting by the graph neural network whether the new gene variant is benign or pathogenic.
  • a particularly suitable graph neural network model According to the invention it has been recognized that it is possible to realize a very high prediction accuracy by simply providing a particularly suitable graph neural network model and training set and proceeding.
  • benign and pathogenic gene variants are provided or collected from a suitable source. This means that relevant data and/or features of such gene variants are provided or collected for further use in the method.
  • a suitable variant-gene graph is created by a) connecting each gene variant to one or more genes to which this gene variant belongs and by b) connecting each gene to one or more other genes according to a pre-definable rule. Then, with such a variant-gene graph a graph neural network is trained.
  • a new or unknown gene variant is fed to the graph neural network for predicting by the graph neural network whether the new or unknown gene variant is benign or pathogenic. All or some of the method steps can be performed or supported by the data processing system, e.g. a computer.
  • This graph neural network approach operates on a heterogeneous graph with genes and gene variants. This graph is created by assigning gene variants to genes and connecting genes with an existing gene-gene interaction network. The invention improves the prediction accuracy and allows experts to interpret the prediction by inspecting which gene variants and genes had a large effect on a prediction. The prediction of effects of new observed gene variants is possible with very high accuracy.
  • the provided or collected benign and pathogenic gene variants can be provided or collected from one or more databases comprising data or features of gene variants.
  • a large amount of gene variants can be used for realizing a high prediction accuracy by simple means.
  • labeling for each variant to which gene or genes it belongs can be based on suitable coordinates. Benign and pathogenic gene variants can be assigned to the closest gene or genes in a related genome. This simplifies the method and provides a realization of a high prediction accuracy.
  • the pre-definable rule can comprise connecting each gene to every other gene.
  • the pre-definable rule can comprise connecting each gene to one or more other genes which is or are connected to said gene based on one or more predefined biological interactions.
  • the one or more predefined biological interactions can simply be retrieved from a biological database or from a gene-gene interaction graph of a biological database.
  • At least one feature can be collected for at least one or each gene variant, wherein preferably the at least one feature can be the output of another variant prediction model that does not use a graph.
  • the at least one or each gene variant can be represented by a feature vector.
  • At least one feature can be collected for at least one or each gene.
  • at least one or each gene can be specified by such a feature.
  • At least one or each gene can be represented by a N dimensional vector, wherein N is an integer. This provides a very simple and clear representation.
  • the N dimensional vector can be a randomly initialized vector, which is optimized in the training step.
  • Such a type of vector is very suitable for effectively performing the method.
  • the N dimensional vector can comprise at least one collected feature and/or is a concatenation of a randomly initialized vector, which is trainable, with one or more collected gene features. Also such a type of vector is very suitable for effectively performing the method.
  • each gene variant in the training set can have a definable label, e.g. 0 for benign and 1 for pathogenic.
  • one or more parameters of the graph neural network can be updated using gradient descent. This proceeding supports an increase of the likelihood for gene variants in the training set to obtain the correct label from the network.
  • an explanation for the prediction of a gene variant or variants can be provided by returning which other gene variant or gene variants and/or which gene or genes the graph neural network has utilized to arrive at the prediction, wherein preferably the impact can be provided, for example to an expert, that the gene variant or gene variants and/or gene or genes had on the prediction.
  • This proceeding provides a high degree of information to a user of the method.
  • a graph neural network approach operates on a heterogeneous graph with genes and gene variants.
  • the graph can be created by assigning variants to genes and connecting genes with an existing gene-gene interaction network.
  • the graph neural network can be trained to aggregate information between genes, and between genes and gene variants. Gene variants can exchange information via the genes they connect to. This method improves the prediction accuracy and allows experts to interpret the prediction by inspecting which gene variants and genes had a large effect on a prediction.
  • all embodiments of the present invention provide a variant effect prediction with graph neural network, VEGN.
  • a graph can consist of a set of nodes and a set of edges, where an edge holds between the two nodes.
  • a gene variant or variant is a genetic variation in a genome that differs from the reference genome. Such a variant can be identified to belong to a certain gene or genes by assigning it to the nearest gene - or genes in the case of equal distance - in the genome coordinate. Given a set of variants and the set of genes they belong to, the union of this set is the set of nodes in a graph. For each variant there is an edge to the genes it belongs to. For edges between genes, we consider two options: (1) the edges are given as input, e.g. a domain expert labelled the edges; (2) we assume an edge exists from each gene to every other gene.
  • a graph neural network, GNN with weights w can be trained.
  • GNN graph neural network
  • each variant has a feature vector - e.g. predicted variant effect on transcription factor binding, on splicing, conservation score - and a classification label, e.g. 0 or 1 for benign or pathogenic.
  • a classification label e.g. 0 or 1 for benign or pathogenic.
  • the GNN itself can take various forms, e.g. it could be a graph attention network, see Velickovic et al. 2018, Graph Attention Networks. International Conference on Learning Representations, ICLR. Furthermore, we can learn one joint GNN or we could learn a different GNN depending on the edge type, e.g. a different network is learnt for gene-variant, variant-gene and gene-gene edges. Furthermore, if we assume that each gene has an edge to every other gene, then we learn the strength of each edge. This can be done with a fully connected neural network, e.g. using a Transformer, see Vaswani et al. 2017, Attention is All you Need, Neural Information Processing Systems, NeurlPS. The fully connected neural network can then be used for the edge type gene-gene, whereas a GNN can be used for other edge types. This allows us to combine a given graph structure and a learnt graph structure in one joint neural network.
  • v VEGN predicts a probability of the variant to be disease-causing (pathogenic): P (pathogenic).
  • the graph neural network model with weights w can be trained with standard stochastic gradient descent and a cross entropy loss function:
  • T(w) ⁇ m Ti log Piipatho genic) + (1 - y t ) ⁇ log(l - P t (patho genic)), where y t is the label of the variant v t in the training data, pathogenic being 1 and benign being 0, Pi(pathogenic) is the prediction for v t and where i is an integer.
  • Embodiments can formulate variant effect prediction as a graph via gene attachments and can learn a graph neural network.
  • Embodiments can learn an application specific gene-gene interaction graph.
  • Embodiments can combine a given graph structure with a learnt graph structure in one joint neural network.
  • Embodiments can explain a prediction of a variant by providing the variants and genes that and the impact they had on the prediction.
  • An embodiment can comprise a method for predicting what effect a human’s gene variant will have on their body.
  • the method can comprise the steps of:
  • each variant can be connected to one or more genes based on step
  • each gene can be either i. connected to every other gene ii. connected to the genes identified in step 3) if step 3) is present.
  • the feature could be the output of another variant prediction model that does not use a graph.
  • Each variant can be represented by the feature vector collected in step 5).
  • Each gene can be represented by a N dimensional vector, which may be either one of the below or a concatenation: a. A randomly initialized vector, which can be optimized in the training process. b. The gene features collected in step 6). c. A concatenation of the randomly initialized vector, which is trainable, with gene features collected in step 6).
  • each variant in the training set can have a label, e.g. 0 for benign 1 for pathogenic.
  • the model’s parameters can be updated using gradient descent in order to increase the likelihood for variants in the training set to obtain the correct label from the network.
  • Previous methods classify each variant in isolation. By treating the problem as a graph where variants are linked to each other via genes and by automatically learning a gene-gene network, embodiments of the present method can learn a graph neural network that greatly improves the accuracy of the variant prediction.
  • Fig. 1 shows in a diagram the overall architecture of an embodiment of the present invention
  • Fig. 2 shows in a diagram a further embodiment of the present invention
  • Fig. 3 shows in a block diagram a further embodiment of the present invention
  • Fig. 4 shows in a block diagram a further embodiment of the present invention.
  • Fig. 1 shows in a diagram the overall architecture of an embodiment of the present invention, concretely a VEGN.
  • the goal is to classify gene variants - in short form: variants - which are denoted by triangles. Variants are associated with a gene, denoted by circles, and a gene-gene network is either given or learnt. Based on this, a GNN can be learnt. New variants are added to the graph via the gene they attach to. Given a new variant’s feature vector, the GNN classifies the new variants and can give an explanation of which other variants and genes were relevant for the classification.
  • Fig. 2 shows in a diagram a further embodiment of the present invention.
  • Flere is shown a concrete instantiation with a different GNN for each edge type: The goal is to classify variants which are denoted by triangles, e.g. as benign 0 or pathogenic 1. Variants are associated with a gene, denoted by circles, and a gene-gene network is either given or learnt. Based on this, a GNN can be learnt. This can either be one joint GNN or different GNNs can be learnt for different edges. E.g. for the three different edge types - “gene has variant”, “gene interacts with gene” and “variant in gene” - separate GNN layers are instantiated and learnt.
  • Arrows within a layer indicate the direction of information flow, where the hidden representation of the arrow's source is used to update the hidden representation of the arrow's target.
  • the arrows represent the weights of the GNN that is learnt and these weights are shared within this layer, i.e. for ’’variant in gene”, each variant has its own feature vector and to this the same GNN layer's weights are applied to update the target hidden representation.
  • the hidden representations of each layer are aggregated, e.g. by sum.
  • a classification layer e.g. via a sigmoid function, determines the likelihood of a variant being benign or pathogenic.
  • weights can be updated via a loss function and backpropagation.
  • new variants can be added to the graph via the gene they attach to.
  • the learnt weights can be applied in a forward pass to derive a prediction.
  • VEGN or embodiments of the present invention can be used to prioritize a short list of variants for clinician to manually inspect.
  • Fig. 3 shows in a block diagram such a further embodiment of the present invention.
  • patients first have their genome sequenced with whole genome sequencing or whole exon sequencing.
  • a list of variants is generated through variant calling on the sequencing data.
  • VEGN or embodiments of the present invention can be applied to each of the variant and predict a disease-relevance score P(pathogenic).
  • the variants can then be sorted based on the score in descending order.
  • the top k variants, wherein k is an integer, are selected for further manual investigation by domain experts. The number of k is dependent on the resource.
  • Neoantigens are antigens found specifically in tumor samples. They are products from tumor-specific variants. Due to the tumor-specificity of neoantigens, they are frequently used as targets for immunotherapy. Existing neoantigen selection pipelines typically do not consider the effects of variants. VEGN or embodiments of the present invention can help to prioritize and select most biologically relevant variants.
  • Fig. 4 shows in a block diagram such a further embodiment of the present invention.
  • tumor samples are whole genome sequenced or whole exon sequenced.
  • a list of missense variants is generated through variant calling on the sequencing data.
  • VEGN or embodiments of the present invention can be applied to each of the variant and predict a disease-relevance score P(pathogenic). The variants can then be sorted based on the score in descending order.
  • the predicted disease-causing probabilities are combined with other evidence in an existing neoantigen discovery pipeline to select for neoantigens.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Theoretical Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Biotechnology (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Public Health (AREA)
  • Bioethics (AREA)
  • Epidemiology (AREA)
  • Databases & Information Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Probability & Statistics with Applications (AREA)
  • Molecular Biology (AREA)
  • Physiology (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

For realizing a very high prediction accuracy by simple means a method for predicting an effect of a gene variant on an organism by means of a data processing system is provided, comprising the steps: providing or collecting benign and pathogenic gene variants; creating a variant-gene graph by connecting each gene variant to one or more genes to which it belongs and by connecting each gene to one or more other genes according to a pre-definable rule; training a graph neural network, GNN, on the variant-gene graph; and feeding a new gene variant to the graph neural network for predicting by the graph neural network whether the new gene variant is benign or pathogenic. Further, a corresponding data processing system for carrying out the above method for predicting an effect of a gene variant on an organism is provided.

Description

A METHOD FOR PREDICTING AN EFFECT OF A GENE VARIANT ON AN ORGANISM BY MEANS OF A DATA PROCESSING SYSTEM AND A CORRESPONDING DATA PROCESSING SYSTEM
The present invention relates to a method for predicting an effect of a gene variant on an organism by means of a data processing system and a data processing system for carrying out this method.
Genetic mutations can cause disease by disrupting normal gene function. However, identifying the disease-causing mutations from millions of genetic variants within an individual patient is challenging. Computational methods which can prioritize disease-causing mutations have enormous applications. It is well known that genes function through a complex regulatory network.
Methods for predicting an effect of a gene variant on an organism by means of a data processing system and corresponding data processing systems are known from prior art. Corresponding prior art documents are listed as follows:
US 2016/0357903 A1 - A framework for determining the relative effect of genetic variants.
US 2015/0066378 A1 - Identifying Possible Disease-Causing Genetic Variants by Machine Learning Classification.
US 2019/0114547 A1 - Deep Learning-Based Splice Site Classification.
Further, US 2019/0139622 A1 discloses a method and a system for predicting effects of perturbations to an organism. The method discloses that a neural network is trained to classify the effects of perturbations to a gene or other features of the organism. After training the graph neural network is configured to predict activity of a new strain having one or more modifications to the gene.
The prior art reference EP 3 514 798 A1 discloses a system for a prediction of genetic variants with machine learning model. The prior art discloses an automated computational system for predicted information about genetic variants. The method comprises a microprocessor, determining the functionality for each gene based on the genetic variant data and also generating a weighted genetic network comprising the plurality of genes of the genome having connections between them. The method also comprises a regression model explaining the type of variant affecting genes.
The prior art reference WO 2016/172 464 A1 discloses a method for predicting gene-dysfunction caused by a defined genetic mutation in the genome of an organism. This reference also discloses a variant gene graph and also discloses the variant category either benign or pathogenic based on a trained machine learning model. This prior art is not disclosing the feature to identify a newly added variant category to be analyzed.
The prior art reference US 2016/0371431 A discloses a method of predicting pathogenicity of genetic sequence variants. It also discloses that after the machine learning model is trained and has categorized the variant with respect to category of disease causing variant or not, it will identify or predict the variant pathogenicity of newly added variant. The prior art reference does not disclose a gene interaction network.
It is an object of the present invention to improve and further develop a method for predicting an effect of a gene variant on an organism by means of a data processing system and a corresponding data processing system for realizing a very high prediction accuracy by simple means.
In accordance with the invention, the aforementioned object is accomplished by a method for predicting an effect of a gene variant on an organism by means of a data processing system, comprising the following steps:
- providing or collecting benign and pathogenic gene variants;
- creating a variant-gene graph by connecting each gene variant to one or more genes to which said gene variant belongs and by connecting each gene to one or more other genes according to a pre-definable rule;
- training a graph neural network, GNN, on the variant-gene graph; and - feeding a new gene variant to the graph neural network for predicting by the graph neural network whether the new gene variant is benign or pathogenic.
Further, the aforementioned object is accomplished by a data processing system for carrying out the method for predicting an effect of a gene variant on an organism, comprising:
- providing or collecting means for providing or collecting benign and pathogenic gene variants;
- creating means for creating a variant-gene graph by connecting each gene variant to one or more genes to which said gene variant belongs and by connecting each gene to one or more other genes according to a pre- definable rule;
- training means for training a graph neural network, GNN, on the variant-gene graph; and
- feeding means for feeding a new gene variant to the graph neural network for predicting by the graph neural network whether the new gene variant is benign or pathogenic.
According to the invention it has been recognized that it is possible to realize a very high prediction accuracy by simply providing a particularly suitable graph neural network model and training set and proceeding. Firstly, benign and pathogenic gene variants are provided or collected from a suitable source. This means that relevant data and/or features of such gene variants are provided or collected for further use in the method. In a next step a suitable variant-gene graph is created by a) connecting each gene variant to one or more genes to which this gene variant belongs and by b) connecting each gene to one or more other genes according to a pre-definable rule. Then, with such a variant-gene graph a graph neural network is trained. In a last step a new or unknown gene variant is fed to the graph neural network for predicting by the graph neural network whether the new or unknown gene variant is benign or pathogenic. All or some of the method steps can be performed or supported by the data processing system, e.g. a computer. This graph neural network approach operates on a heterogeneous graph with genes and gene variants. This graph is created by assigning gene variants to genes and connecting genes with an existing gene-gene interaction network. The invention improves the prediction accuracy and allows experts to interpret the prediction by inspecting which gene variants and genes had a large effect on a prediction. The prediction of effects of new observed gene variants is possible with very high accuracy.
Thus, on the basis of the invention a method and system are provided which realize a very high prediction accuracy by simple means.
According to an embodiment of the invention the provided or collected benign and pathogenic gene variants can be provided or collected from one or more databases comprising data or features of gene variants. There can be a definable communication between the data processing system and one or more database for the step of providing or collecting the gene variants. Thus, a large amount of gene variants can be used for realizing a high prediction accuracy by simple means.
Within a further embodiment labeling for each variant to which gene or genes it belongs can be based on suitable coordinates. Benign and pathogenic gene variants can be assigned to the closest gene or genes in a related genome. This simplifies the method and provides a realization of a high prediction accuracy.
According to a further embodiment the pre-definable rule can comprise connecting each gene to every other gene. Alternatively, the pre-definable rule can comprise connecting each gene to one or more other genes which is or are connected to said gene based on one or more predefined biological interactions. Preferably, the one or more predefined biological interactions can simply be retrieved from a biological database or from a gene-gene interaction graph of a biological database.
Within a further embodiment at least one feature can be collected for at least one or each gene variant, wherein preferably the at least one feature can be the output of another variant prediction model that does not use a graph. According to a further embodiment and for providing a simple and effective representation the at least one or each gene variant can be represented by a feature vector.
Within a further embodiment at least one feature can be collected for at least one or each gene. As a result, at least one or each gene can be specified by such a feature.
According to a further embodiment at least one or each gene can be represented by a N dimensional vector, wherein N is an integer. This provides a very simple and clear representation.
Within a further embodiment the N dimensional vector can be a randomly initialized vector, which is optimized in the training step. Such a type of vector is very suitable for effectively performing the method.
Within a further embodiment the N dimensional vector can comprise at least one collected feature and/or is a concatenation of a randomly initialized vector, which is trainable, with one or more collected gene features. Also such a type of vector is very suitable for effectively performing the method.
According to a further embodiment, for a training set in the training step, each gene variant in the training set can have a definable label, e.g. 0 for benign and 1 for pathogenic. By means of such a label very efficient and structured prediction with high prediction accuracy is possible.
Within a further embodiment in the training step one or more parameters of the graph neural network can be updated using gradient descent. This proceeding supports an increase of the likelihood for gene variants in the training set to obtain the correct label from the network.
According to a further embodiment and for further enhancing the prediction accuracy an explanation for the prediction of a gene variant or variants can be provided by returning which other gene variant or gene variants and/or which gene or genes the graph neural network has utilized to arrive at the prediction, wherein preferably the impact can be provided, for example to an expert, that the gene variant or gene variants and/or gene or genes had on the prediction. This proceeding provides a high degree of information to a user of the method.
Advantages and aspects of embodiments of the present invention are summarized and further explained as follows:
According to embodiments of the present invention a graph neural network approach is proposed that operates on a heterogeneous graph with genes and gene variants. The graph can be created by assigning variants to genes and connecting genes with an existing gene-gene interaction network. The graph neural network can be trained to aggregate information between genes, and between genes and gene variants. Gene variants can exchange information via the genes they connect to. This method improves the prediction accuracy and allows experts to interpret the prediction by inspecting which gene variants and genes had a large effect on a prediction.
Generally, all embodiments of the present invention provide a variant effect prediction with graph neural network, VEGN.
Predicting variant effects is a long-standing problem in genetics. Previous Artificial Intelligence, Al, systems that perform variant effect prediction do this by predicting each gene variant in isolation. Even though work has been done to interpret variant effects in the context of biological regulatory network, no existing method can effectively integrate gene-variant and gene-gene network together to predict effect of new observed gene variants. In embodiments of the present invention, we learn the high order relationship between variants and between gene and variants with a graph neural network. Comparing to existing approaches, VEGN has two advantages. First, VEGN considers variant effect in the context of a biological network instead of isolated events. Such approach enables VEGN to capture potential remote - trans - effect of variants on indirectly connected genes. Second, existing annotations of disease variants are sparse and are focused on some well- studied genes. VEGN enables information flow from well-studied genes to less- studied genes through the biological network. It also considers the correlated disease-causing status for variants within the same functional module in a biological network. The present approach greatly improves the prediction accuracy compared to previous methods.
In embodiments of this invention, we formulate variant effect prediction as a graph. A graph can consist of a set of nodes and a set of edges, where an edge holds between the two nodes. A gene variant or variant is a genetic variation in a genome that differs from the reference genome. Such a variant can be identified to belong to a certain gene or genes by assigning it to the nearest gene - or genes in the case of equal distance - in the genome coordinate. Given a set of variants and the set of genes they belong to, the union of this set is the set of nodes in a graph. For each variant there is an edge to the genes it belongs to. For edges between genes, we consider two options: (1) the edges are given as input, e.g. a domain expert labelled the edges; (2) we assume an edge exists from each gene to every other gene.
Based on this graph, a graph neural network, GNN, with weights w can be trained. For this, we assume we are given a training set where each variant has a feature vector - e.g. predicted variant effect on transcription factor binding, on splicing, conservation score - and a classification label, e.g. 0 or 1 for benign or pathogenic. For each variant in the training data, we can utilize the associated classification label to define a loss - e.g. binary cross entropy loss - and use - stochastic - gradient descent and backpropagation to update the weights w of the GNN. Once trained, new variants can be added to the graph and applying the GNN will classify the variant, e.g. as benign or pathogenic.
The GNN itself can take various forms, e.g. it could be a graph attention network, see Velickovic et al. 2018, Graph Attention Networks. International Conference on Learning Representations, ICLR. Furthermore, we can learn one joint GNN or we could learn a different GNN depending on the edge type, e.g. a different network is learnt for gene-variant, variant-gene and gene-gene edges. Furthermore, if we assume that each gene has an edge to every other gene, then we learn the strength of each edge. This can be done with a fully connected neural network, e.g. using a Transformer, see Vaswani et al. 2017, Attention is All you Need, Neural Information Processing Systems, NeurlPS. The fully connected neural network can then be used for the edge type gene-gene, whereas a GNN can be used for other edge types. This allows us to combine a given graph structure and a learnt graph structure in one joint neural network.
When classifying a new variant, information flows via the graph neural network from the variant’s feature vector to the gene the variant it is attached to and from there to other parts of the graph. This enables us to track which gene and other variant had an influence on the prediction. This can be done by using an explanation method suitable for GNNs, e.g. GNNExplainer, see Ying et al 2019, GNNExplainer: Generating Explanations for Graph Neural Networks, Neural Information Processing Systems, NeurlPS. This is a powerful advantage of embodiments of our invention because it allows us to explain the model’s variant effect prediction to a domain expert by providing the information on which variants and genes had an impact on the prediction. This information may help the domain experts to discover new disease associated genes or non-additive effects of variant combinations.
For each variant v VEGN predicts a probability of the variant to be disease-causing (pathogenic): P (pathogenic). The graph neural network model with weights w can be trained with standard stochastic gradient descent and a cross entropy loss function:
T(w) = åm Ti log Piipatho genic) + (1 - yt) · log(l - Pt(patho genic)), where yt is the label of the variant vt in the training data, pathogenic being 1 and benign being 0, Pi(pathogenic) is the prediction for vt and where i is an integer.
Validating our method empirically shows large improvements over the previous state of the art, both in terms of average precision and area under the curve:
Figure imgf000010_0001
Further advantages and aspects of embodiments of the present invention can be summarized as follows:
1) Embodiments can formulate variant effect prediction as a graph via gene attachments and can learn a graph neural network.
2) Embodiments can learn an application specific gene-gene interaction graph.
3) Embodiments can combine a given graph structure with a learnt graph structure in one joint neural network.
4) Embodiments can explain a prediction of a variant by providing the variants and genes that and the impact they had on the prediction.
Further aspects and advantages of embodiments of the method and data processing system according to the present invention can be summarized as follows:
An embodiment can comprise a method for predicting what effect a human’s gene variant will have on their body. The method can comprise the steps of:
1) Collecting existing benign and pathogenic variants from databases.
2) Labeling for each variant to which genes it belongs, based on the coordinates or coordinates of the genes. Variants can be assigned to the closest genes in the genome.
3) Optional: For each gene, labeling which other genes are connected to it based on biological interactions, e.g. retrieved from a gene-gene interaction graph of a biological database.
4) Creating a variant-gene graph, where: a. each variant can be connected to one or more genes based on step
2) b. each gene can be either i. connected to every other gene ii. connected to the genes identified in step 3) if step 3) is present.
5) Collecting features for each variant. For example, the feature could be the output of another variant prediction model that does not use a graph.
6) Optional: Collecting features for each gene.
7) Each variant can be represented by the feature vector collected in step 5). 8) Each gene can be represented by a N dimensional vector, which may be either one of the below or a concatenation: a. A randomly initialized vector, which can be optimized in the training process. b. The gene features collected in step 6). c. A concatenation of the randomly initialized vector, which is trainable, with gene features collected in step 6).
9) Training a graph neural network model on the graph defined in step 4), where a. for a training set, each variant in the training set can have a label, e.g. 0 for benign 1 for pathogenic. b. the model’s parameters can be updated using gradient descent in order to increase the likelihood for variants in the training set to obtain the correct label from the network.
10)Once the model is trained, giving a new, previously unseen variant to the model to have the model predict whether the variant is benign or pathogenic
11)Optional: Providing an explanation for the prediction by returning which other variants and which genes the model utilized to arrive at the prediction.
Previous methods classify each variant in isolation. By treating the problem as a graph where variants are linked to each other via genes and by automatically learning a gene-gene network, embodiments of the present method can learn a graph neural network that greatly improves the accuracy of the variant prediction.
There are several ways how to design and further develop the teaching of the present invention in an advantageous way. To this end it is to be referred to the following explanation of examples of embodiments of the invention, illustrated by the drawing. In the drawing
Fig. 1 shows in a diagram the overall architecture of an embodiment of the present invention,
Fig. 2 shows in a diagram a further embodiment of the present invention, Fig. 3 shows in a block diagram a further embodiment of the present invention, and
Fig. 4 shows in a block diagram a further embodiment of the present invention.
Fig. 1 shows in a diagram the overall architecture of an embodiment of the present invention, concretely a VEGN. The goal is to classify gene variants - in short form: variants - which are denoted by triangles. Variants are associated with a gene, denoted by circles, and a gene-gene network is either given or learnt. Based on this, a GNN can be learnt. New variants are added to the graph via the gene they attach to. Given a new variant’s feature vector, the GNN classifies the new variants and can give an explanation of which other variants and genes were relevant for the classification.
Fig. 2 shows in a diagram a further embodiment of the present invention. Flere is shown a concrete instantiation with a different GNN for each edge type: The goal is to classify variants which are denoted by triangles, e.g. as benign 0 or pathogenic 1. Variants are associated with a gene, denoted by circles, and a gene-gene network is either given or learnt. Based on this, a GNN can be learnt. This can either be one joint GNN or different GNNs can be learnt for different edges. E.g. for the three different edge types - “gene has variant”, “gene interacts with gene” and “variant in gene” - separate GNN layers are instantiated and learnt. Arrows within a layer indicate the direction of information flow, where the hidden representation of the arrow's source is used to update the hidden representation of the arrow's target. Within a layer the arrows represent the weights of the GNN that is learnt and these weights are shared within this layer, i.e. for ’’variant in gene”, each variant has its own feature vector and to this the same GNN layer's weights are applied to update the target hidden representation. The hidden representations of each layer are aggregated, e.g. by sum. Finally, there is one further GNN layer where information flows from the gene to the variant. Based on this update, a classification layer, e.g. via a sigmoid function, determines the likelihood of a variant being benign or pathogenic. During training, the true label of a variant v is observed and weights can be updated via a loss function and backpropagation. During test time, new variants can be added to the graph via the gene they attach to. Based on the features associated with the variant, the learnt weights can be applied in a forward pass to derive a prediction.
Further embodiments:
1. Genetic diagnostics for patients
Each individual has millions of genetic variants. Even though such variants can be identified with high-throughput sequencing and bioinformatics variant calling methods, it is challenging to prioritize potential disease-causing variants. VEGN or embodiments of the present invention can be used to prioritize a short list of variants for clinician to manually inspect.
Fig. 3 shows in a block diagram such a further embodiment of the present invention. In genetic diagnosis, patients first have their genome sequenced with whole genome sequencing or whole exon sequencing. A list of variants is generated through variant calling on the sequencing data. VEGN or embodiments of the present invention can be applied to each of the variant and predict a disease-relevance score P(pathogenic). The variants can then be sorted based on the score in descending order. The top k variants, wherein k is an integer, are selected for further manual investigation by domain experts. The number of k is dependent on the resource.
2. Neoantigen selection
Neoantigens are antigens found specifically in tumor samples. They are products from tumor-specific variants. Due to the tumor-specificity of neoantigens, they are frequently used as targets for immunotherapy. Existing neoantigen selection pipelines typically do not consider the effects of variants. VEGN or embodiments of the present invention can help to prioritize and select most biologically relevant variants.
Fig. 4 shows in a block diagram such a further embodiment of the present invention. In neoantigen discovery, tumor samples are whole genome sequenced or whole exon sequenced. A list of missense variants is generated through variant calling on the sequencing data. VEGN or embodiments of the present invention can be applied to each of the variant and predict a disease-relevance score P(pathogenic). The variants can then be sorted based on the score in descending order. The predicted disease-causing probabilities are combined with other evidence in an existing neoantigen discovery pipeline to select for neoantigens. Many modifications and other embodiments of the invention set forth herein will come to mind to the one skilled in the art to which the invention pertains having the benefit of the teachings presented in the foregoing description and the associated drawings. Therefore, it is to be understood that the invention is not to be limited to the specific embodiments disclosed and that modifications and other embodiments are intended to be included within the scope of the appended claims. Although specific terms are employed herein, they are used in a generic and descriptive sense only and not for purposes of limitation.

Claims

C l a i m s
1. A method for predicting an effect of a gene variant on an organism by means of a data processing system, comprising the following steps:
- providing or collecting benign and pathogenic gene variants;
- creating a variant-gene graph by connecting each gene variant to one or more genes to which said gene variant belongs and by connecting each gene to one or more other genes according to a pre-definable rule;
- training a graph neural network, GNN, on the variant-gene graph; and
- feeding a new gene variant to the graph neural network for predicting by the graph neural network whether the new gene variant is benign or pathogenic.
2. A method according to claim 1 , wherein the provided or collected benign and pathogenic gene variants are provided or collected from one or more databases.
3. A method according to claim 1 or 2, wherein the benign and pathogenic gene variants are assigned to the closest gene or genes in a related genome.
4. A method according to one of claims 1 to 3, wherein the pre-definable rule comprises connecting each gene to every other gene or connecting each gene to one or more other genes which is or are connected to said gene based on one or more predefined biological interactions, wherein preferably the one or more predefined biological interactions are retrieved from a biological database or from a gene-gene interaction graph of a biological database.
5. A method according to one of claims 1 to 4, wherein at least one feature is collected for at least one or each gene variant, wherein preferably the at least one feature is the output of another variant prediction model that does not use a graph.
6. A method according to one of claims 1 to 5, wherein at least one or each gene variant is represented by a feature vector.
7. A method according to one of claims 1 to 6, wherein at least one feature is collected for at least one or each gene.
8. A method according to one of claims 1 to 7, wherein at least one or each gene is represented by a N dimensional vector.
9. A method according to claim 8, wherein the N dimensional vector is a randomly initialized vector, which is optimized in the training step.
10. A method according to claim 8 or 9, wherein the N dimensional vector comprises at least one collected feature.
11. A method according to one of claims 8 to 10, wherein the N dimensional vector is a concatenation of a randomly initialized vector, which is trainable, with one or more collected gene features.
12. A method according to one of claims 1 to 11 , wherein for a training set in the training step, each gene variant in the training set has a definable label, e.g. 0 for benign and 1 for pathogenic.
13. A method according to one of claims 1 to 12, wherein in the training step one or more parameters of the graph neural network are updated using gradient descent.
14. A method according to one of claims 1 to 13, wherein an explanation for the prediction of a gene variant or variants is provided by returning which other gene variant or gene variants and/or which gene or genes the graph neural network has utilized to arrive at the prediction, wherein preferably the impact is provided that the gene variant or gene variants and/or gene or genes had on the prediction.
15. A data processing system for carrying out the method for predicting an effect of a gene variant on an organism according to any one of claims 1 to 14, comprising:
- providing or collecting means for providing or collecting benign and pathogenic gene variants;
- creating means for creating a variant-gene graph by connecting each gene variant to one or more genes to which said gene variant belongs and by connecting each gene to one or more other genes according to a pre- definable rule;
- training means for training a graph neural network, GNN, on the variant-gene graph; and - feeding means for feeding a new gene variant to the graph neural network for predicting by the graph neural network whether the new gene variant is benign or pathogenic.
PCT/EP2021/059567 2021-04-13 2021-04-13 A method for predicting an effect of a gene variant on an organism by means of a data processing system and a corresponding data processing system WO2022218509A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/EP2021/059567 WO2022218509A1 (en) 2021-04-13 2021-04-13 A method for predicting an effect of a gene variant on an organism by means of a data processing system and a corresponding data processing system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/EP2021/059567 WO2022218509A1 (en) 2021-04-13 2021-04-13 A method for predicting an effect of a gene variant on an organism by means of a data processing system and a corresponding data processing system

Publications (1)

Publication Number Publication Date
WO2022218509A1 true WO2022218509A1 (en) 2022-10-20

Family

ID=75674774

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2021/059567 WO2022218509A1 (en) 2021-04-13 2021-04-13 A method for predicting an effect of a gene variant on an organism by means of a data processing system and a corresponding data processing system

Country Status (1)

Country Link
WO (1) WO2022218509A1 (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150066378A1 (en) 2013-08-27 2015-03-05 Tute Genomics Identifying Possible Disease-Causing Genetic Variants by Machine Learning Classification
WO2016172464A1 (en) 2015-04-22 2016-10-27 Genepeeks, Inc. Device, system and method for assessing risk of variant-specific gene dysfunction
US20160357903A1 (en) 2013-09-20 2016-12-08 University Of Washington Through Its Center For Commercialization A framework for determining the relative effect of genetic variants
US20160371431A1 (en) 2015-06-22 2016-12-22 Counsyl, Inc. Methods of predicting pathogenicity of genetic sequence variants
US20190114547A1 (en) 2017-10-16 2019-04-18 Illumina, Inc. Deep Learning-Based Splice Site Classification
US20190139622A1 (en) 2017-08-03 2019-05-09 Zymergen, Inc. Graph neural networks for representing microorganisms
EP3514798A1 (en) 2011-10-31 2019-07-24 The Scripps Research Institute Systems and methods for genomic annotation and distributed variant interpretation

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3514798A1 (en) 2011-10-31 2019-07-24 The Scripps Research Institute Systems and methods for genomic annotation and distributed variant interpretation
US20150066378A1 (en) 2013-08-27 2015-03-05 Tute Genomics Identifying Possible Disease-Causing Genetic Variants by Machine Learning Classification
US20160357903A1 (en) 2013-09-20 2016-12-08 University Of Washington Through Its Center For Commercialization A framework for determining the relative effect of genetic variants
WO2016172464A1 (en) 2015-04-22 2016-10-27 Genepeeks, Inc. Device, system and method for assessing risk of variant-specific gene dysfunction
US20160371431A1 (en) 2015-06-22 2016-12-22 Counsyl, Inc. Methods of predicting pathogenicity of genetic sequence variants
WO2016209999A1 (en) * 2015-06-22 2016-12-29 Counsyl, Inc. Methods of predicting pathogenicity of genetic sequence variants
US20190139622A1 (en) 2017-08-03 2019-05-09 Zymergen, Inc. Graph neural networks for representing microorganisms
US20190114547A1 (en) 2017-10-16 2019-04-18 Illumina, Inc. Deep Learning-Based Splice Site Classification

Non-Patent Citations (11)

* Cited by examiner, † Cited by third party
Title
CHEREDA HRYHORII ET AL: "Explaining decisions of graph convolutional neural networks: patient-specific molecular subnetworks responsible for metastasis prediction in breast cancer", GENOME MEDICINE, vol. 13, no. 1, 11 March 2021 (2021-03-11), pages 42, XP055872471, Retrieved from the Internet <URL:https://genomemedicine.biomedcentral.com/track/pdf/10.1186/s13073-021-00845-7.pdf> [retrieved on 20211214], DOI: 10.1186/s13073-021-00845-7 *
ERASLAN GÖKCEN ET AL: "Deep learning: new computational modelling techniques for genomics", NATURE REVIEWS GENETICS, NATURE PUBLISHING GROUP, GB, vol. 20, no. 7, 10 April 2019 (2019-04-10), pages 389 - 403, XP036813365, ISSN: 1471-0056, [retrieved on 20190410], DOI: 10.1038/S41576-019-0122-6 *
KRZYSZTOF CHOROMANSKI ET AL: "Rethinking Attention with Performers", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 9 March 2021 (2021-03-09), XP081897794 *
PETAR VELIKOVI ET AL: "GRAPH ATTENTION NETWORKS", 4 February 2018 (2018-02-04), XP055703475, Retrieved from the Internet <URL:https://arxiv.org/pdf/1710.10903.pdf> [retrieved on 20200610] *
SCHULTE-SASSE ROMAN ET AL: "Graph Convolutional Networks Improve the Prediction of Cancer Driver Genes", 9 September 2019, ADVANCES IN DATABASES AND INFORMATION SYSTEMS; [LECTURE NOTES IN COMPUTER SCIENCE; LECT.NOTES COMPUTER], SPRINGER INTERNATIONAL PUBLISHING, CHAM, PAGE(S) 658 - 668, ISBN: 978-3-319-10403-4, XP047520829 *
SHARAD VIKRAM ET AL: "SSCM: A method to analyze and predict the pathogenicity of sequence variants", BIORXIV, 26 June 2015 (2015-06-26), XP055546969, Retrieved from the Internet <URL:https://www.biorxiv.org/content/biorxiv/early/2015/06/26/021527.full.pdf> [retrieved on 20211214], DOI: 10.1101/021527 *
SUNDARAM LAKSSHMAN ET AL: "Predicting the clinical impact of human mutation with deep neural networks", NATURE GENETICS, NATURE PUBLISHING GROUP US, NEW YORK, vol. 50, no. 8, 23 July 2018 (2018-07-23), pages 1161 - 1170, XP036902750, ISSN: 1061-4036, [retrieved on 20180723], DOI: 10.1038/S41588-018-0167-Z *
TIANWEI YUE ET AL: "Deep Learning for Genomics: A Concise Overview", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 2 February 2018 (2018-02-02), XP080857057 *
VASWANI ET AL.: "Attention is All you Need", NEURAL INFORMATION PROCESSING SYSTEMS, NEURLPS, 2017
VELICKOVIC ET AL.: "Graph Attention Networks", INTERNATIONAL CONFERENCE ON LEARNING REPRESENTATIONS, ICLR, 2018
YING ET AL.: "GNNExplainer: Generating Explanations for Graph Neural Networks", NEURAL INFORMATION PROCESSING SYSTEMS, NEURLPS, 2019

Similar Documents

Publication Publication Date Title
JP7275228B2 (en) Deep Convolutional Neural Networks for Variant Classification
AU2020202267B2 (en) Methods and systems for identification of causal genomic variants
Pitangueira et al. Software requirements selection and prioritization using SBSE approaches: A systematic review and mapping of the literature
US20190318806A1 (en) Variant Classifier Based on Deep Neural Networks
AU2023282274A1 (en) Variant classifier based on deep neural networks
AU2019272062A1 (en) Deep learning-based techniques for pre-training deep convolutional neural networks
Mieth et al. DeepCOMBI: explainable artificial intelligence for the analysis and discovery in genome-wide association studies
Matukumalli et al. Application of machine learning in SNP discovery
D’Agaro Artificial intelligence used in genome analysis studies
WO2023014912A1 (en) Transfer learning-based use of protein contact maps for variant pathogenicity prediction
Wang et al. Predict long-range enhancer regulation based on protein–protein interactions between transcription factors
Wise et al. SMARTS: reconstructing disease response networks from multiple individuals using time series gene expression data
Pradier et al. AIRIVA: a deep generative model of adaptive immune repertoires
WO2022218509A1 (en) A method for predicting an effect of a gene variant on an organism by means of a data processing system and a corresponding data processing system
Minot et al. Meta Learning Improves Robustness and Performance in Machine Learning-Guided Protein Engineering
US20230045003A1 (en) Deep learning-based use of protein contact maps for variant pathogenicity prediction
Egilmez et al. Cell loading and shipment optimisation in a cellular manufacturing system: an integrated genetic algorithms and neural network approach
Zheng et al. Translation rate prediction and regulatory motif discovery with multi-task learning
Rahimikollu et al. SLIDE: Significant Latent Factor Interaction Discovery and Exploration across biological domains
US20200265270A1 (en) Mutual neighbors
US11443181B2 (en) Apparatus and method for characterization of synthetic organisms
US20230368868A1 (en) Entity selection metrics
Bartoszewicz et al. DeePaC: Predicting pathogenic potential of novel DNA with a universal framework for reverse-complement neural networks
Bej Improved imbalanced classification through convex space learning
Bronikowski et al. Prediction of chronic fatigue syndrome using decision tree-based ensemble methods

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21721407

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21721407

Country of ref document: EP

Kind code of ref document: A1