CN111489788A

CN111489788A - Deep association nuclear learning technology for explaining complex disease genetic relationship

Info

Publication number: CN111489788A
Application number: CN202010229815.2A
Authority: CN
Inventors: 邓岳; 鲍峰; 王勃
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2020-03-27
Filing date: 2020-03-27
Publication date: 2020-08-04
Anticipated expiration: 2040-03-27
Also published as: CN111489788B

Abstract

The invention discloses a deep association core learning technology for explaining a genetic relationship of a complex disease, which comprises the following steps: the system comprises a path grouping module, a gene coding module, a convolutional layer module, a kernel machine regression module and a gradient return algorithm module; the pathway grouping module groups variants in the same biological pathway; the gene coding module encodes and models alleles in each group of variants in a heat-coded manner; the convolutional layer module is used for identifying a causal region in an allele and encoding a causal locus as a deep feature; the kernel machine regression module performs regression through a deep learning method, extracts the features of the path hierarchy and counts the significance of the features; and the gradient return algorithm module optimizes and updates the depth network parameters and feeds back the parameters to the convolutional layer module. The invention solves the limitation in the traditional technology, realizes the detection of complex association between genes and enhances the interpretability of GWAS.

Description

Deep association nuclear learning technology for explaining complex disease genetic relationship

Technical Field

The invention relates to the technical field of genetic engineering, in particular to a deep association nuclear learning technology for explaining the genetic relationship of complex diseases.

Background

Currently, genetic mutations cause complex diseases in a variety of different ways, and a comprehensive determination of genetic causality can provide valuable insight into the development and treatment of diseases. However, existing genome-wide association study (GWAS) approaches are always based on linear hypotheses and simple disease models, limiting their popularity in discovering complex causal relationships. On the other hand, with the development of deep learning, the deep learning method is generally used as a "black box" tool in genomics to solve the problems that the conventional technology cannot solve, but the underlying theory behind the deep learning method cannot be explained.

Existing genome-wide association study (GWAS) methods are always based on linear hypothesis and simple disease models, thereby limiting their universality in discovering complex causal relationships, and the effect is only obvious when genes have strong and directly associated variables. Meanwhile, some existing technologies rely on some preset genetic models to perform artificial gene coding, but in fact, the genetic effect of diseases is unknown, and early modeling is difficult to perform, so that a method without a genetic model is needed to reasonably simulate the internal relation between genes and characterization.

Therefore, how to provide a deep association nuclear learning technology for explaining the genetic relationship of complex diseases is a problem to be solved urgently by those skilled in the art.

Disclosure of Invention

In view of the above, the present invention provides a deep association nuclear learning technique for explaining the genetic relationship of complex diseases.

In order to achieve the purpose, the invention adopts the following technical scheme:

a deep association nuclear learning technique to interpret complex disease genetic relationships, comprising: the route grouping module, the gene coding module, the convolutional layer module, the kernel machine regression module and the gradient return algorithm module are sequentially connected with the route grouping module;

the pathway grouping module groups variants in the same biological pathway;

the gene coding module encodes and models alleles in each group of variants in a thermally encoded manner;

the convolutional layer module is for identifying a causal region in an allele, encoding a causal locus as a deep feature;

the kernel machine regression module performs regression through a deep learning method, extracts the features of the path hierarchy and counts the significance of the features;

and the gradient feedback algorithm module optimizes and updates the depth network parameters and feeds back the parameters to the convolutional layer module.

Preferably, the pathway grouping module increases the combined effect of multiple SNPs in a pathway, which are grouped into a pathway-level genome.

Preferably, the genotype of the SNP includes major homozygote, heterozygote, and minor homozygote genotypes.

Preferably, the convolutional layer module identifies causal regions by increasing the convolution output difference between causal and non-causal regions and expanding the similarity between samples carrying causal alleles, improving the detectability of causal paths.

Preferably, the kernel machine regression module performs regression by a deep learning method to find out the relationship between genetic causal relationship and complex disease.

Preferably, the kernel machine regression module includes regression factors for environmental factors and genotype characteristics.

Preferably, the gradient back-transmission algorithm module optimizes the deep network parameters by using tensierflow.

Preferably, the kernel machine regression module performs statistical tests through SKAT framework correlation.

Preferably, the gradient back-propagation algorithm module performs back propagation through multi-instance learning, and optimizes the whole SKAT framework parameters in an end-to-end manner.

Compared with the prior art, the invention discloses a deep association nuclear learning technology for explaining the genetic relationship of complex diseases, and the complex, nonlinear and various causal sites are automatically deduced from the pathway-level gene sequences by utilizing the deep learning capability. Meanwhile, the 'black box' of the deep learning method of the DAK can be tried to be opened, so that the performance of association detection can be greatly improved, the interpretability of the deep learning model for GWAS research is further enhanced through analysis, and the principle understanding of deep learning is deepened. The method for reasonably simulating the internal relation between the gene core representations by designing a non-inheritance model solves some limitations in the traditional technology, realizes the detection of complex association between genes and enhances the interpretability of GWAS.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a schematic structural diagram provided by the present invention.

FIG. 2 is a bar graph of error rates compared by seven methods provided by the present invention.

Fig. 3 is a DAK training curve diagram provided by the present invention.

FIG. 4a is a histogram of the effects of five genetic models provided by the present invention.

FIG. 4b is a graph showing the efficacy of the different genotypes according to the method of the present invention in five genetic models.

FIG. 4c is a graph showing the line of efficacy of each method in five genetic models for different genotypes at 5000 samples according to the present invention.

FIG. 4d is a line graph showing the efficacy of each method in five genetic models for a sample of 3000 different genotypes as provided by the present invention.

FIG. 5a is a graph showing the efficacy of each of the five genetic models under the action of multiple genes according to the present invention.

FIG. 5b is a graph showing the line of efficacy of each of the five genetic models at 5000 samples for the polygene provided by the present invention.

FIG. 5c is a graph showing the line of efficacy of each of the methods in five genetic models at a sample of 3000 under the polygenic effect provided by the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention discloses a deep association kernel learning technology for explaining a genetic relationship of a complex disease, which comprises the following steps: the route grouping module 1, the gene coding module 2, the convolutional layer module 3, the kernel machine regression module 4 and the gradient return algorithm module 5 are sequentially connected with the route grouping module 1;

the pathway grouping module 1 groups variants in the same biological pathway;

the gene coding module 2 encodes and models the alleles in each group of variants in a heat-coded manner;

the convolutional layer module 3 is used to identify causal regions in alleles, encoding causal loci as deep features;

the kernel machine regression module 4 performs regression through a deep learning method, extracts the features of the path hierarchy and counts the significance of the features;

the gradient return algorithm module 5 optimizes and updates the depth network parameters and feeds back the parameters to the convolutional layer module 3.

To further optimize the above technical solution, the pathway grouping module 1 increases the combined effect of multiple SNPs in the pathway, which are grouped into a pathway-level genome.

To further optimize the above technical solution, the genotype of the SNP includes major homozygote, heterozygote and minor homozygote genotypes.

To further optimize the above solution, the convolutional layer module 3 identifies causal regions by increasing the convolution output difference between the causal and non-causal regions and enlarging the similarity between samples carrying causal alleles.

In order to further optimize the technical scheme, the kernel machine regression module 4 performs regression by a deep learning method to find out the relationship between genetic causal relationship and complex diseases.

In order to further optimize the above solution, the kernel machine regression module 4 includes regression factors for environmental factors and genotype characteristics.

In order to further optimize the above technical solution, the gradient backhaul algorithm module 5 optimizes the deep network parameters by using tenserflow.

In order to further optimize the above technical solution, the kernel machine regression module 4 performs a statistical test through SKAT framework association.

In order to further optimize the above technical solution, the gradient back-propagation algorithm module 5 performs back propagation through multi-instance learning, and optimizes the whole SKAT framework parameters in an end-to-end manner.

In each simulation experiment, the present invention simulates the data set under null (no causal relationship) or surrogate (disease is caused by different genetic associations) assumptions. The performance of the different methods was evaluated in 100 replicates using type I error rate (corresponding to the original hypothesis) and empirical efficacy (corresponding to the alternative hypothesis) (method).

First disclose Type-I errors, if no causal sites are present in all pathways (no assumptions), all methods show lower error rate levels, as shown in FIG. 2, and altering the sample size has little impact on the results. Fig. 3 is a training curve showing that DAK converges in several iterations.

We then believe that the disease in the database is caused by a single common variation. To illustrate the different functional pathways by which genes cause disease, it was assumed that alleles of causal loci play a role in five different genetic models: 1) additive model, secondary homozygous genotype affects twice as much as heterozygous; 2) dominant mode, both genotypes show the same magnitude of effect; 3) multiplicative models, secondary alleles multiply disease risk; 4) recessive models, where only a few homozygous genotypes play a role; 5) heterozygous model, only heterozygous alleles worked, as shown in figure 4 a. Under the most widely used additive disease model, all methods showed reasonable accuracy to identify pathways with disease loci, as shown in fig. 4 b. However, the efficacy of all comparative methods decreases dramatically when the basic genetic model changes, whereas the DAK technique of the present invention maintains reliable performance with optimal efficacy under all conditions. In particular, for relatively difficult recessive genetic models, the accuracy of all comparison methods is greatly reduced and far below the performance of DAK. At the same time, as shown in fig. 4b, when the sample size was increased to 5,000, the efficacy of all methods increased, while DAK was still the best.

Due to the low gene frequency, finding rare variations in GWAS (minor allele frequency < 0.5%) is a difficult task. In contrast, the present invention simulates a rare dataset of 5000 samples, with the corresponding disease in the dataset being caused by a single rare variation under five genotype models. As shown in fig. 4c, DAK achieved higher performance on recessive and multiplicative genetic models than other models. On the other hand, as shown in fig. 4d, DAK can find causal rare variations on the data set at a power of around 0.8 even with only 3000 samples, which is a difficult task for several other methods. These experiments illustrate the great advantages of the technology designed by the present invention in the gene detection disease technology.

On the other hand, most diseases are the result of the co-action of multiple genes. However, identifying combined and mixed effect signals from causal variables of multiple genes has been very difficult. The method of the present invention and other methods were actually tested and compared by randomly assigning three causal common variants and generating phenotypes under five genetic models (methods) to simulate the combined effect. All methods performed much less than the single variable results described above. However, in all multivariate experiments, DAK was still far superior to other methods and had the most stable performance, as shown in fig. 5 a. The advantage of DAK is even more pronounced when the causal position is an unusual variation, as shown in fig. 5b and 5 c.

The invention has experimented with DAK on four real datasets covering cancer and mental disease, further proving that DAK can find invisible but meaningful ways. In particular, interesting causal relationships between schizophrenia and the dilated cardiomyopathy pathway have been discovered.

The invention adopts DAK structure mathematical model:

for the ith individual in a total of N samples, y_iRepresents a phenotype (e.g., a disease or control);

is an adjustment vector consisting of K context-dependent factors (e.g., gender, stratification, and bias). The genotype of each SNP falls into one of three categories: major homozygote, heterozygote and minor homozygote genotypes. Thus, it is natural to represent the genotype of each SNP with a heat vector, where non-zero entries represent its specific genotype. All l on the p-th path of the individual i^(P)Combining SNPs together to obtain a corresponding pathway-level genotype matrix

After path assembly, p total number of paths were obtained for all experimental samples.

Conv (· | Θ) by convolution layer using M convolution operators_c) Convert each

Wherein the content of the first and second substances,

representing parameters

The j (th) convolution operator of]Representing the max pool operator.

All learnable parameters representing convolutional layers.

By applying a pass h_∞The output of the convolutional layer of layers, a kernel representation of the p-path of the ith individual is obtained,

where k (·, ·) is a kernel function and N is the number of samples.

Kernel regression function defining path hierarchy:

where ω is { α }, a learnable regression factor containing environmental factors and genotype characteristics for individual i can be obtained from P paths

The labels (disease and non-disease) are provided only at the individual level, not at each single pathway level. Thus, consider the multi-instance learning penalty and define the respective level labels for sample i as:

the loss function of this multi-instance learning is naturally explained in GWAS: a sample is considered a patient if at least one pathway in the sample is associated with a disease. The training loss function is defined as:

the loss function is optimized in tensierflow in batch form.

After good training, scoring tests were performed to quantify the statistical significance of each path using the same method as in SKAT 12. For each path P, the statistical score passes through the kernel similarity matrix

By:

Q_P＝(L-Y)^Tκ^(P)(L-Y)

wherein the content of the first and second substances,

is the predicted disease state across pathway P in N samples. As stated in the introduction to SKAT, Q_PHexix-²Are compared to obtain a P value.

The deep association kernel learning (DAK) techniques disclosed herein enable automatic causal genotype coding of GWAS at the path level. DAK can detect common and rare variants of complex genetic effects that are not detectable by currently existing methods. The "black box" of the deep learning model is explained and the reason why it can greatly improve the performance of the association detection is discussed. When applied to real-world GWAS data, the DAK analysis designed by the present invention finds potential contingent pathways that may be explained by other biological studies. The invention reasonably simulates the internal relation between the gene nucleus characteristics, solves some limitations in the traditional technology, realizes the detection of complex association between genes and enhances the interpretability of GWAS.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A deep association nuclear learning technique for interpreting genetic relationships in complex diseases, comprising: the route grouping module (1), a gene coding module (2), a convolutional layer module (3), a kernel machine regression module (4) and a gradient return algorithm module (5) which are sequentially connected with the route grouping module (1);

the pathway grouping module (1) groups variants in the same biological pathway;

the gene coding module (2) encodes and models alleles in each set of variants in a thermally encoded manner;

the convolutional layer module (3) is for identifying a causal region in an allele, encoding a causal locus as a deep feature;

the kernel machine regression module (4) performs regression through a deep learning method, extracts the features of the path hierarchy and counts the significance of the features;

the gradient feedback algorithm module (5) optimizes and updates the depth network parameters and feeds back the parameters to the convolutional layer module (3).

2. The deep association nuclear learning technique for interpreting genetic relationships of complex diseases as claimed in claim 1, wherein the pathway grouping module (1) increases the combined effect of a plurality of SNPs in a pathway, which are grouped into a genome at the pathway level.

3. The deep association nuclear learning technique of claim 2, wherein the genotypes of the SNPs comprise major homozygote, heterozygote and minor homozygote genotypes.

4. The deep correlation nuclear learning technique of interpreting complex disease inheritance relationships according to claim 1, wherein the convolution layer module (3) identifies causal regions by increasing the convolution output difference between causal and non-causal regions and expanding the similarity between samples carrying causal alleles.

5. The deep correlation nuclear learning technique for interpreting genetic relationship of complex diseases as claimed in claim 1, wherein the kernel machine regression module (4) performs regression by deep learning method to find out the relationship between genetic causal relationship and complex diseases.

6. The deep correlation nuclear learning technique for interpreting genetic relationship of complex diseases as claimed in claim 1, wherein the kernel machine regression module (4) includes regression factors of environmental factors and genotype features.

7. The deep correlation nuclear learning technique for interpreting genetic relationship of complex diseases as claimed in claim 1, characterized in that the gradient back-transmission algorithm module (5) optimizes the deep network parameters by using tensorflow.

8. The deep correlation nuclear learning technique for interpreting genetic relationship of complex diseases as claimed in claim 1, wherein the kernel machine regression module (4) performs statistical tests by SKAT framework correlation.

9. The deep correlation nuclear learning technique for interpreting genetic relationship of complex diseases as claimed in claim 1, wherein the gradient back-propagation algorithm module (5) performs back propagation through multi-instance learning to optimize the whole SKAT framework parameters in an end-to-end manner.