WO2018053647A1 - Ajustement d'associations pour notation prédictive de gènes - Google Patents

Ajustement d'associations pour notation prédictive de gènes Download PDF

Info

Publication number
WO2018053647A1
WO2018053647A1 PCT/CA2017/051126 CA2017051126W WO2018053647A1 WO 2018053647 A1 WO2018053647 A1 WO 2018053647A1 CA 2017051126 W CA2017051126 W CA 2017051126W WO 2018053647 A1 WO2018053647 A1 WO 2018053647A1
Authority
WO
WIPO (PCT)
Prior art keywords
genome
value
associations
tuning
association
Prior art date
Application number
PCT/CA2017/051126
Other languages
English (en)
Inventor
Guillaume PARE
Shihong MAO
Wei Qxi DENG
Original Assignee
Mcmaster University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Mcmaster University filed Critical Mcmaster University
Priority to US16/336,406 priority Critical patent/US20190228838A1/en
Priority to EP17852044.1A priority patent/EP3516565A4/fr
Priority to CA3001257A priority patent/CA3001257C/fr
Publication of WO2018053647A1 publication Critical patent/WO2018053647A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/40Population genetics; Linkage disequilibrium
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/50Mutagenesis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/20Sequence assembly
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B5/00ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B99/00Subject matter not provided for in other groups of this subclass
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6876Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6876Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
    • C12Q1/6883Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/124Animal traits, i.e. production traits, including athletic performance or the like
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/156Polymorphic or mutational markers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/01Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/30ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment

Definitions

  • the present disclosure relates to predictive gene scoring, and more particularly to tuning of associations for use in predictive gene scoring.
  • Machine learning encompasses a class of methods widely used to solve complex prediction problems.
  • machine learning is a process in which computer algorithms are used to develop a predictive model based on known "training data", which model can then be applied to new data to make predictions.
  • Machine learning has proven particularly useful when prediction is dependent on the integration of a large number of predictor variables, including higher-order interactions, and when sizeable training datasets are available for model fitting.
  • One application of machine learning is to develop models for predicting biological characteristics based on variations in a particular genome.
  • One type of variation is a single nucleotide polymorphism (SNP). Where a single nucleotide has variations at a particular position in the genome and each of those variations is present in a population beyond a negligible degree, this is considered an SNP; the variations are referred to as "alleles" for that position in the genome.
  • Linkage disequilibrium alleles at different locations on the genome are associated with one another at a different frequency (either higher or lower) than one would expect if the associations were random.
  • Linkage disequilibrium interferes with the ability of a machine-learning-generated predictive gene score to make generalized predictions because the model may be fitted to genetic variants which are correlated with, and hence representative of, the variables sought to be predicted and therefore overfitted.
  • a computer-implemented method for tuning associations between genetic variants of sequence elements of a subject genome and a target biological characteristic notionally partitions the subject genome into discrete contiguous genome segments and derives a tuning function for each genome segment to tune the associations to the target population.
  • Derivation of the tuning function for each genome segment excludes sequence elements in that same genome segment to avoid overfitting.
  • a computer-implemented method for developing a predictive gene score for a target biological characteristic in a prediction target population comprises receiving a baseline dataset and receiving a tuning dataset.
  • the baseline dataset comprises a first plurality of associations between genetic variants of sequence elements of a subject genome and the target biological characteristic.
  • the tuning dataset comprising, for a representative sample of the prediction target population, a second plurality of associations between genetic variants of the sequence elements of the subject genome and the target biological characteristic.
  • the associations in the baseline dataset and the tuning data set are genotypic weightings representing contributions of the respective genetic variants to a value of the target biological characteristic.
  • the method further comprises notionally partitioning the subject genome into s discrete contiguous genome segments where s > 2 so that, for each genome segment, the subject genome notionally comprises that genome segment and the remainder of the subject genome, other than and excluding that genome segment,
  • the method obtains adjusted associations for each genome segment without overfitting by, for each genome segment, using only the associations for the sequence elements in the remainder of the subject genome to derive a tuning function for that genome segment,
  • the tuning function maps the associations in the baseline dataset for the remainder of the subject genome to respective corresponding ones of the associations in the tuning dataset for the remainder of the subject genome.
  • the method applies the tuning function to the associations in the baseline dataset to obtain adjusted associations for the sequence elements in that genome segment, with the adjusted associations being tuned to the prediction target population, The method uses the adjusted associations to form the predictive gene score for the target biological characteristic in the prediction target population.
  • the associations for the sequence elements in the remainder of the subject genome may be used to derive a tuning function for that genome segment by deriving regression trees representing the tuning function.
  • the method may further comprise receiving at least one annotation, wherein each annotation is associated with a respective sequence element, and, for each genome segment for which the remainder of the subject genome includes a sequence element with which one of the at least one annotation is associated, using each such annotation in deriving the tuning function for that genome segment.
  • a wide range of annotations may be used.
  • Deriving the tuning function may comprise deriving regression trees representing the tuning function.
  • the present disclosure is also directed to computer-usable media embodying instructions for implementing the methods, and to computer systems programmed to implement the method. BRIEF DESCRIPTION OF THE DRAWINGS
  • FIGURE 1 is a flow chart showing an exemplary method for tuning associations between genetic variants of sequence elements of a subject genome and a target biological
  • FIGURE 2 is a flow chart representing a first exemplary computer-implemented method for developing a predictive gene score for a target biological characteristic in a prediction target population
  • FIGURE 3 is a flow chart representing a second exemplary computer-implemented method for developing a predictive gene score for a target biological characteristic in a prediction target population
  • FIGURE 3 schematically illustrates notional partition of a subject genome into discrete contiguous genome segments
  • FIGURES 4A, 4B and 4C show performance of LD -corrected gene scores simulated using 1000 Genomes Project phased haplotypes for gene score prediction R 2 , variance and covariance, respectively;
  • FIGURES 5 A to 5D are graphs showing comparisons among various methodologies for prediction R 2 for height in HRS, BMI in HRS, height in GENEVA and BMI in GENEVA, respectively;
  • FIGURE 6 is a series of bar graphs showing relative improvement in prediction R 2 for various methodologies, as compared to unadjusted gene scores, for height in HRS, BMI in HRS, height in GENEVA and BMI in GENEVA;
  • FIGURE 7 is a block diagram of an illustrative computer system in respect of which the technology herein described may be implemented.
  • Figure 1 is a flow chart showing, in broad outline, an exemplary method, denoted generally by reference 100, for tuning associations between genetic variants of sequence elements of a subject genome and a target biological
  • the subject genome may be the human genome, a non- human animal genome, or a plant genome, and the sequence elements may be, for example, single nucleotides, nucleotide sequences, or some combination thereof.
  • the target biological characteristic may be any biological characteristic of the flora or fauna whose genome is being considered.
  • the associations between the genetic variants of the sequence elements and the target biological characteristics are typically genotypic weightings representing contributions of the respective genetic variants to a value of the target biological
  • the method 100 notionally partitions the subject genome into discrete contiguous genome segments, and at step 104, the method 100 derives a tuning function for each genome segment.
  • the tuning function derived at step 104 tunes the associations to the target population; the same genetic variant may have different contributions to the same biological characteristic depending on the population. For example, a particular' genetic variant may have a greater contribution to height for a population of Pacific Islanders than for a population of American Indians, Importantly, derivation of the tuning function for each genome segment at step 104 excludes sequence elements in that genome segment to avoid overfitting.
  • the method 100 ends; the tuning functions derived at step 104 can be used to develop a predictive gene score.
  • FIG. 2 shows a flow chart representing an exemplary computer-implemented method 200 for developing a predictive gene score for a target biological characteristic in a prediction target population.
  • the method 200 receives a baseline dataset.
  • the baseline dataset comprises a first plurality of associations between genetic variants of sequence elements of the subject genome and the target biological characteristic.
  • the method 200 receives a tuning dataset.
  • the tuning dataset comprises, for a representative sample of the prediction target population, a second plurality of associations between genetic variants of the sequence elements of the subject genome and the target biological
  • the subject genome may be the human genome or that of a non-human animal or a plant
  • the sequence elements may be single nucleotides, nucleotide sequences, or some combination thereof.
  • the target biological characteristic may be any biological characteristic of interest, and the associations between the genetic variants of the sequence elements and the target biological characteristics are genotypic weightings that represent the contributions of the respective genetic variants to the relevant value of the target biological characteristic.
  • the method 200 notionally partitions the subject genome into s discrete contiguous genome segments where s > 2 so that, for each genome segment, the subject genome notionally comprises (a) that genome segment; and (b) the remainder of the subject genome other than and excluding that genome segment.
  • the remainder 304 A of the subject genome 300 consists of the other four genome segments 302B, 302C, 302D and 302E.
  • the remainder 304B of the subject genome 300 consists of the other four genome segments 302A, 302C, 302D and 302E
  • the remainder 304C of the subject genome 300 consists of the other four genome segments 302A, 302B, 302D and 302E
  • the remainder 304D of the subject genome 300 consists of the other four genome segments 302A, 302B, 302C and 302E and for the fifth genome segment 302E
  • the remainder 304E of the subject genome 300 consists of the other four genome segments 302A, 302B, 302C and 302D.
  • Steps 202, 204 and 208 may be carried out in any order, or substantially simultaneously.
  • step 206 (notionally partitioning the subject genome into s discrete contiguous genome segments where s > 2) is carried out by way of a predefined partition scheme defined prior to receiving the baseline dataset or the tuning dataset.
  • the method 200 obtains adjusted associations for each genome segment (e.g. the genome segments 302A, 302B, 302C, 302D and 302E shown in Figure 3) without overfitting. This is done by, at step 210, for each genome segment, using only the associations for the sequence elements in the remainder of the subject genome to derive a tuning function for that genome segment.
  • the remainder 304 A of the subject genome 300 that is, only the other four genome segments 302B, 302C, 302D and 302E, would be used to derive the tuning function for that genome segment 302 A.
  • each tuning function derived at step 210 that is, the tuning function for each genome segment (e.g.
  • the genome segments 302A, 302B, 302C, 302D and 302E shown in Figure 3) maps the associations in the baseline dataset for the remainder 304A, 304B, 304C, 304D and 304E of the subject genome to respective corresponding ones of the associations in the tuning dataset for the remainder 304A, 304B, 304C, 304D and 304E of the subject genome.
  • the tuning functions may be derived by, for example, deriving regression trees representing the tuning functions, as described further below.
  • the subject genome (and hence the SNPs) will be divided into distinct contiguous sets of SNPs. More particularly, since LD is relatively localized, the notional partition of the genome makes it less likely that the genetic variants in the remainder of the subject genome will correlate with the genome segment for which the tuning function is being derived, thereby limiting potential LD spillover and reducing the resulting overfitting effects. Weights of SNPs in each set are calculated using the prediction models trained on the remaining sets.
  • the first set comprises SNPs from chromosomes 1, 2 and part of 3, then SNPs from the remaining part of chromosome 3 and chromosomes 4 to 22 would be used to derive prediction models for SNPs in that first set.
  • the observed regression coefficient of any single SNP in the prediction target population is thus never used directly or indirectly to derive its own gene score weight.
  • the notional division of the subject genome into five genome segments is merely exemplary, and the subject genome may be notionally divided into any number of genome segments, i.e. s > 2. If the value of s is set too high, the sequence elements in adjacent genome segments will be very close to one another, which may undermine the desired reduction of LD spillover since there may be LD between the sequence elements in adjacent or nearby genome segments. Therefore, the value of s is preferably less than 2000, more preferably less than 1000, still more preferably less than 100, still even more preferably less than 50 and yet still even more preferably less than 20. [0021] At step 214, for each genome segment (e.g. the genome segments 302A, 302B, 302C,
  • the method 200 applies the tuning function to the associations in the baseline dataset to obtain adjusted associations for the sequence elements in that genome segment.
  • These adjusted associations are genotypic weightings that represent the contributions of the respective genetic variants to the relevant value of the target biological characteristic, and which genotypic weightings are tuned to the prediction target population.
  • step 216 the method 200 uses the adjusted associations to form the predictive gene score for the target biological characteristic in the prediction target population, after which the method 200 ends. While step 210 above will limit the effect of linkage disequilibrium in deriving the tuning functions, where there are multiple genetic variants in a predictive gene score, it is advantageous to take linkage disequilibrium into account when using the adjusted associations to form the predictive gene score. There is a distinction between avoiding the overfitting effects of linkage disequilibrium to derive the tuning functions, and correcting the genotypic weightings (resulting from the tuning functions) to account for linkage disequilibrium. Therefore, in preferred embodiments, step 216 includes adjustment based on linkage disequilibrium, as described further below. After step 216, the method 200 ends.
  • the most popular linkage disequilibrium adjustment heuristic is based on linkage disequilibrium (LD) pruning of SNPs.
  • the LD pruning heuristic prioritizes the most significant associations up to an empirically determined /7-value threshold and prunes the remaining SNPs based on LD (International Schizophrenia, C. s Purcell, S.M., Wray, N.R., Stone, J.L., Visscher, P.M., O'Donovan, M.C., Sullivan, P.F., and Sklar, P. (2009).
  • LD Linkage disequilibrium
  • FIG. 2A shows a flow chart representing a further exemplary computer-implemented method 200A for developing a predictive gene score for a target biological characteristic in a prediction target population.
  • the method 200A shown in Figure 2A represents a particular implementation of the method 200 shown in Figure 2, and hence the same reference numerals, with the additional suffix "A", are used to denote corresponding steps.
  • the method 200A shown in Figure 2A differs from the method 200 shown in Figure 2 in that the method 200A makes use of SNP annotations.
  • the method 200A shown in Figure 2A includes an additional step 206 A of receiving at least one annotation, with each annotation being associated with a respective sequence element in the subject genome.
  • the annotation(s) may be, for example, one or more of the following:
  • association of genetic variants with other traits including but exclusive to association with health outcomes (e.g. coronary artery disease, asthma, cancer, etc.), individual characteristics (e.g. height, weight, blood pressure, hair colour, etc.), gene level of expression (eQTL) in relevant tissue(s), DNA methylation or other biomarkers;
  • health outcomes e.g. coronary artery disease, asthma, cancer, etc.
  • individual characteristics e.g. height, weight, blood pressure, hair colour, etc.
  • gene level of expression eQTL
  • disequilibrium i.e. correlated with it, including but not restricted to:
  • Coding variant i.e. inducing a change in the amino acid sequence of a protein
  • Steps 202A, 204A, 206A and 208A may be performed in any order, or substantially simultaneously.
  • step 210A includes a sub-step 212A of, for each genome segment for which the remainder of the subject genome includes a sequence element with which one of the annotations is associated, using each such annotation in deriving the tuning function for that genome segment.
  • Step 212A is shown as a distinct sub- step only for purposes of schematic illustration, and typically the use of the annotations to derive the tuning functions will be fully integrated into the overall calculations used to derive the tuning functions at step 21 OA.
  • One exemplary implementation of the method 200A uses the univariate regression coefficients from external meta-analysis summary association statistics (i.e. the baseline dataset) as a starting point, and uses the tuning dataset to update these external univariate regression coefficients with respect to a target population by way of boosted regression trees, integrating a variety of SNP annotations.
  • external meta-analysis summary association statistics i.e. the baseline dataset
  • Boosted regression trees are powerful and versatile methods for continuous outcome prediction (Hastie, T., Tibshirani, R., and Friedman, J.H. (2009). The elements of statistical learning: data mining, inference, and prediction. (New York, NY: Springer).) and thus well- suited to updating the SNP weights in a gene score.
  • the exemplary method described herein uses boosted regression trees to adjust summary association statistics regression coefficients in order to improve the prediction R 2 in a prediction target population. Regression coefficients from large meta-analyses are implicitly assumed to provide the best initial estimates and regression trees "tune" them based on the regression coefficients observed in the prediction target population as well as relevant SNP annotations; thus, the regression trees serve as tuning functions.
  • the adjusted associations i.e. the tuned genotypic weightings
  • the adjusted associations are corrected for LD to produce the final predictive gene score.
  • the resulting gene score provides an unbiased estimate of the underlying genetic variance although at a tradeoff of increased gene score variance as compared to the "true" unobserved genetic model (see Figure 4).
  • the loss in prediction R 2 resulting from increased gene score variance was estimated at -12% in simulations using phased haplotypes from the 1000 Genomes (1000G) Project, as described further below.
  • each column vector ⁇ representing the coded genotypes for an individual.
  • each column of X i.e. genotypes for a single SNP
  • y i.e. genotypes for a single SNP
  • is a vector of true genetic effects that are fixed across individuals but random across SNPs, with mean 0 and covariance matrix ⁇ ⁇ ⁇ such that the total expected genetic variance
  • r 2 dk denote the pairvvise linkage disequilibrium (r 2 ) between the and SNPs.
  • b d * is the regression coefficient commonly reported in genome- wide association studies (GWAS) meta-analysis (assumed to have been standardized for allele frequency).
  • GWAS genome- wide association studies
  • gene score in the target population is expressed as:
  • the expected value can be approximated by:
  • the sample covariance of the gene score with the observed y in the target sample is given by:
  • e* and e are the residual error in the unobserved population used to derive summary association statistics (i.e. the baseline daiaset) and the prediction target population, respectively.
  • summary association statistics i.e. the baseline daiaset
  • prediction target population i.e. the prediction target population.
  • GWAS genome-wide association studies
  • pairwise r 2 LD is either 0 or 1 and summary association statistics are derived from an asymptotically large sample.
  • partial LD reflecting the loss of information when, for example, two SNPs are in partial LD
  • HRS Health Retirement Study
  • HRS Health Retirement Study
  • dbGaP Study Accession: phs000428.vl.pl Human Omni2.5-Quad BeadChip
  • the final dataset included 7,776 European participants genotyped for 681,516 SNPs. Height and BMI was adjusted for age and sex in all analyses; and to further mitigate the effect of outliers, values outside the 1 st and 99 (h percentile range were removed. All analyses were adjusted for the first 10 genetic principal components unless stated otherwise. HRS was not part of the GIANT meta-analysis of height and BMI (Berndt, S.I., Gustafsson, S. ; Magi, R., Ganna, A., Wheeler, E., Feitosa, M.F., Justice, A.E., Monda, K.L., Croteau-Chonka, D.C., Day, F.R., et al. (2013). Genome-wide meta-analysis identifies ⁇ new loci for
  • HRS is not part of the GIANT consortium
  • the reference and target populations i.e. baseline dataset and tuning dataset, respectively
  • Cho recently proposed Cho, C.Y., Han, J., Hunter, D.J., Kraft, P., and Price, A.L. (2015).
  • Table 1 shows the relative influence of predictor variables for height and BMI.
  • the top five predictor variables for the gene score weights of height and BMI in HRS are listed, along with their relative influences on the regression trees models in terms of square error loss. Results are given for the first SNP set, corresponding to all SNPs from the start of the first chromosome to SNP rs 1032726 on chromosome 3, and are representative of the four other SNP sets.
  • An advantage of boosted regression trees models is the potential to uncover unexpected relationships between a trait of interest and predictor variables. Indeed, it is possible to evaluate the relative influence of predictor variables on squared error reduction, even in the presence of higher-order interactions.
  • the most influential variable for the prediction of height was BMI regional genetic variance, a measure of regional genetic association that has been recently described (Pare, G., Mao, S., and Deng, W.Q. (2016). A method to estimate the contribution of regional genetic associations to complex traits from summary association statistics. Sci Rep 6, 27644.).
  • LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. Nature Genetics 47, 291-295.), a measure of LD with neighboring SNPs in 1000G data (i.e. not restricted to SNPs included in the gene score), was the fourth most predictive variable, as would be expected in polygenic models of inheritance.
  • the performance of an implementation of the method 200A using boosted regression trees was also tested for prediction of body mass index (BMI) in the HRS.
  • the boosted regression trees models included 31 predictor variables which were used to update gene score weights.
  • the resulting gene score had a prediction R 2 of 0.071, outperforming the prediction R 2 of unadjusted gene score (0.065), P+T (0.062) and LDpred (0.067) (Figure 5B).
  • the boosted regression tree implementation of the method 200A accounted for 44.4% of the total polygenic variance, which was estimated at 0.160 for BMI in HRS.
  • the most influential variable was the association of each SNP itself with BMI in GIANT (see Table 1 above).
  • LDL cholesterol, total cholesterol and systolic blood pressure regional genetic variance were the second, third and fifth most influential predictor variables, respectively, while LDscore came fourth.
  • GENEVA Gene-Environment Association Studies
  • Imputed SNP genotypes from 2,557 participants of the GENEVA Diabetes study are available (Qi, Q., Liang, L., Doria, A., Hu, F.B., and Qi, L. (2012). Genetic predisposition to dyslipidemia and type 2 diabetes risk in two prospective cohorts. Diabetes 61, 745-752.) and were downloaded from dbGap (phs000091.v2.pl). Briefly, genotyping was performed using the Affymetrix Human SNP Array 6.0 and Birdseed calling algorithm, with standard quality control including sex chromosome abnormalities, sample identity, Hardy- Weinberg equilibrium (p-value >lxl0 ⁇ 4 ), and call rate (>98%).
  • MACH MACH
  • Pruning and thresholding (P+T) polygenic scores were derived using the "clump" function of PLINK (Purcell, S., Neale, B. ⁇ Todd-Brown, K., Thomas, L., Ferreira, M.A., Bender, D., Mailer, J., Sklar, P., de Bakker, P.I., Daly, M.J., et ai. (2007).
  • PLINK a tool set for whole-genome association and population-based linkage analyses. American Journal of Human Genetics 81, 559-575.
  • LD r 2 threshold of 0.2
  • testing p-value thresholds in a continuous manner from the most to the least significant association.
  • LDpred adjusts genome-wide association studies (GWAS) summary statistics for the effects of linkage disequilibrium, providing re-weighted effect estimates that are then used in gene scores
  • LDpred was run as recommended by the authors, including both the data synchronization and LDpred steps. LDpred requires specification of the fraction of SNPs assumed to be causal. For each model, causal fractions of 1 (infinitesimal), 0.3, 0.1, 0.03, 0.01, 0.003, 0.001 , 0.0003, 0.0001 were tested as recommended. Results are presented using the causal fraction giving the best results only. A heritability estimate is also required by the algorithm and is estimated from summary association statistics by LDpred. As a sensitivity analysis, heritability estimates given by the variance component models in HRS were additionally used. Results were consistent and only the default option is shown.
  • Polygenic genetic variance i.e. narrow sense heritability was estimated for height and BMI in HRS and GENEVA using variance components, as implemented in GCTA (Yang, J., Lee, S.H. et ah, supra). All LD measures or related estimates used described herein were derived from HRS genotypes.
  • the exemplary boosted regression trees implementation of the method 200A led to significant improvements in prediction R 2 as compared to existing methods.
  • the methods 200 and 200A leverage the large number of genetic variants reported in genome-wide association studies (GWAS) to train boosted regression trees models.
  • Regression trees can capture nonlinear effects and higher-order interactions while the boosting algorithm combines individually weak predictors to produce a strong classifier that enables a better prediction of genetic effects.
  • Boosted regression trees are powerful and versatile methods that combine otherwise weak classifiers to produce a strong learner for continuous outcome prediction (Hastie et al. , supra). They are thus well-suited for prediction of SNP gene score weights (where
  • Boosted regression trees can be expressed as where / is a regression function of trees with input variables The gradient
  • boost algorithm aims to minimize the expected square error loss with respect to / iteratively on weighted versions of the training data.
  • SNP annotations are included as inputs ) and their contributions are
  • SNPs are divided into 5 distinct sets of contiguous SNPs (to avoid LD spillover) and fitted w pred which are used in calculation of the actual gene scores derived using the regression trees models trained on the remaining 4 sets (the SNPs may be divided into more or fewer distinct sets as well).
  • Regional associations were derived from the 100 (herein referred as “regional association”) or 50 (herein referred as “small regional association”) SNPs upstream and downstream of each SNP according to the method previously described (Pare et al., supra). Regional associations reflect the overall impact of a set of contiguous SNPs on genetic variance. The models also included measures of tissue-specific eQTL, which was calculated by summing the -log(p- value) of association of each SNP with genes in each tissue (limiting to significant eQTL associations). Finally, general SNP annotations such as allele frequency in Europeans (from GIANT), LDScore (Bulik-SulHvan et al.
  • regulome score (Boyle, A.P., Hong, E.L., Hariharan, M., Cheng, Y., Schaub, M.A., Kasowski, M., Karczewski, K.J., Park, J., Hitz, B.C., Weng, S., et al. (2012). Annotation of functional variation in personal genomes using RegulomeDB. Genome Res 22, 1790-1797) or linkage disequilibrium with putative disruptive coding mutations (based on CADD score (Kircher, M., Witten, D.M., Jain, P., O'Roak, B.J., Cooper, G.M., and Shendure, J. (2014). A general framework for estimating the relative pathogenicity of human genetic variants. Nature Genetics 46, 310-315.)) were included. A full list of independent variables is provided in Table 2 below:
  • each predictor variable was first tested for association with in standard linear regression
  • the methods 200 and 200A described above are based on the premise that SNPs contribute additively to genetic variance. While empirical evidence suggests this holds true in most cases, the effectiveness of the methods 200 and 200A may be hampered in genomic regions where strong genetic interactions are present (e.g. HLA), and alternative methods such as LDpred might be better suited (Vilhjalmsson et al. , supra). Second, there is a possibility that gene scores derived using the methods 200 and 200A are inherently population-specific.
  • the above- described correction method for LD also has several benefits such as simplicity, use of summary association statistics and intrinsic robustness to minor misspecification of LD or association strength.
  • the above-described correction method for LD is merely one exemplary method for correcting the genotypic weightings in a predictive gene score for LD; other approaches to correcting for LD, whether now known or hereafter developed, may also be used.
  • the exemplary methods 200 and 200A described above may be applied to improve the prediction of polygenic traits using gene scores. The test results show that for the classic polygenic traits height and BMI, 46,6% and 44.4% of the estimated polygenic genetic variance can be captured by gene scores generated using boosted regression tree
  • the methods for developing predictive gene scores as described herein represent significantly more than merely using categories to organize, store and transmit information and organizing information through mathematical correlations.
  • the methods for developing predictive gene scores are in fact an improvement to the technology of eukaryote genomic prediction, as they define logical structures and processes that provide an improvement in computer capabilities in tuning predictive associations to a target population when developing predictive gene scores.
  • the methods described herein require linear partitioning of sequentially arranged data, these methods are limited to operations on the eukaryote genome, which possesses the required linear, sequential arrangement of data.
  • the present technology is confined to eukaryote genomic prediction applications, and more particularly to computerized implementations of eukaryote genomic prediction in machine learning applications.
  • the present technology may be embodied within a system, a method, a computer program product or any combination thereof.
  • the computer program product may include a computer readable storage medium or media having computer readable program instructions thereon for causing a processor to cany out aspects of the present technology.
  • the computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device,
  • the computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.
  • a non- exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon, and any suitable combination of the foregoing.
  • RAM random access memory
  • ROM read-only memory
  • EPROM or Flash memory erasable programmable read-only memory
  • SRAM static random access memory
  • CD-ROM compact disc read-only memory
  • DVD digital versatile disk
  • memory stick a floppy disk
  • mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon
  • a computer readable storage medium is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.
  • Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network.
  • the network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers.
  • a network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device.
  • Computer readable program instructions for carrying out operations of the present technology may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state- setting data, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language or a conventional procedural programming language.
  • the computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on one or more remote computers or entirely on the remote computer(s) or server(s).
  • the remote computer(s) may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
  • electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to implement aspects of the present technology.
  • each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the Figures.
  • each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams can be implemented by computer program instructions.
  • These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
  • These computer program instructions may also be stored in a computer readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
  • the computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
  • FIG. 7 An illustrative computer system in respect of which the technology herein described may be implemented is presented as a block diagram in Figure 7.
  • the illustrative computer system is denoted generally by reference numeral 700 and includes a display 702, input devices in the form of keyboard 704A and pointing device 704B, computer 706 and external devices 708. While pointing device 704B is depicted as a mouse, it will be appreciated that other types of pointing device, including for example a touch-screen interface, may also be used.
  • the computer 706 may contain one or more processors or microprocessors, such as a central processing unit (CPU) 710,
  • the CPU 710 performs arithmetic calculations and control functions to execute software stored in an internal memory 712, preferably random access memory (RAM) and/or read only memory (ROM), and possibly additional memory 714.
  • the additional memory 714 may include, for example, mass memory storage, hard disk drives, optical disk drives (including CD and DVD drives), magnetic disk drives, magnetic tape drives (including LTO, DLT, DAT and DCC), flash drives, program cartridges and cartridge interfaces such as those found in video game devices, removable memory chips such as EPROM or PROM, emerging storage media, such as holographic storage, or similar storage media as known in the art.
  • This additional memory 714 may be physically internal to the computer 706, or external as shown in Figure 7, or both.
  • the computer system 700 may also include other similar means for allowing computer programs or other instructions to be loaded.
  • Such means can include, for example, a communications interface 716 which allows software and data to be transferred between the computer system 700 and external systems and networks.
  • communications interface 716 can include a modem, a network interface such as an Ethernet card, a wireless communication interface, or a serial or parallel communications port.
  • Software and data transferred via communications interface 716 are in the form of signals which can be electronic, acoustic, electromagnetic, optical or other signals capable of being received by communications interface 716. Multiple interfaces, of course, can be provided on a single computer system 700.
  • I/O interface 718 administers control of the display 702, keyboard 704A, external devices 708 and other such components of the computer system 700.
  • the computer 706 also includes a graphical processing unit (GPU) 720. The latter may also be used for computational purposes as an adjunct to, or instead of, the (CPU) 710, for mathematical calculations.
  • GPU graphical processing unit
  • the various components of the computer system 700 are coupled to one another either directly or by coupling to suitable buses.
  • the term "computer system” and related terms, as used herein, is not limited to any particular type of computer system and encompasses servers, desktop computers, laptop computers, networked mobile wireless telecommunication computing devices such as smartphones, tablet computers, as well as other types of computer systems.
  • computer readable program code for implementing aspects of the technology described herein may be contained or stored in the memory 712 of the computer 706, or on a computer usable or computer readable medium external to the computer 706, or on any combination thereof.

Landscapes

  • Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Medical Informatics (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biophysics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Analytical Chemistry (AREA)
  • Chemical & Material Sciences (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Physiology (AREA)
  • Databases & Information Systems (AREA)
  • Public Health (AREA)
  • Epidemiology (AREA)
  • Bioethics (AREA)
  • Ecology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

L'invention concerne un procédé mis en œuvre par ordinateur permettant d'ajuster des associations de variantes génétiques d'éléments de séquence d'un génome de sujet et d'une caractéristique biologique cible à une population cible. Le procédé partitionne de manière théorique le génome de sujet en segments de génome contigus distincts et dérive une fonction d'ajustement pour chaque segment de génome. La fonction d'ajustement ajuste les associations à la population cible, et la dérivation de la fonction d'ajustement pour chaque segment de génome exclut des éléments de séquence dans le segment de génome concerné pour éviter un surapprentissage.
PCT/CA2017/051126 2016-09-26 2017-09-25 Ajustement d'associations pour notation prédictive de gènes WO2018053647A1 (fr)

Priority Applications (3)

Application Number Priority Date Filing Date Title
US16/336,406 US20190228838A1 (en) 2016-09-26 2017-09-25 Tuning of Associations For Predictive Gene Scoring
EP17852044.1A EP3516565A4 (fr) 2016-09-26 2017-09-25 Ajustement d'associations pour notation prédictive de gènes
CA3001257A CA3001257C (fr) 2016-09-26 2017-09-25 Ajustement d'associations pour notation predictive de genes

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201662399783P 2016-09-26 2016-09-26
US62/399,783 2016-09-26

Publications (1)

Publication Number Publication Date
WO2018053647A1 true WO2018053647A1 (fr) 2018-03-29

Family

ID=61690105

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CA2017/051126 WO2018053647A1 (fr) 2016-09-26 2017-09-25 Ajustement d'associations pour notation prédictive de gènes

Country Status (4)

Country Link
US (1) US20190228838A1 (fr)
EP (1) EP3516565A4 (fr)
CA (1) CA3001257C (fr)
WO (1) WO2018053647A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP4150624A4 (fr) * 2020-05-15 2024-06-12 The Scripps Research Institute Scores de risque polygéniques ajustés et procédé de calcul

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110910955B (zh) * 2019-10-21 2024-03-01 中山大学 一种易感基因罕见变异位点纵向分析模型的建立方法
CN111489788B (zh) * 2020-03-27 2022-05-20 北京航空航天大学 解释复杂疾病遗传关系的深度关联核学习系统
US20210375392A1 (en) * 2020-05-27 2021-12-02 23Andme, Inc. Machine learning platform for generating risk models

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080163824A1 (en) * 2006-09-01 2008-07-10 Innovative Dairy Products Pty Ltd, An Australian Company, Acn 098 382 784 Whole genome based genetic evaluation and selection process

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016172464A1 (fr) * 2015-04-22 2016-10-27 Genepeeks, Inc. Dispositif, système et procédé d'évaluation d'un risque de dysfonctionnement génétique spécifique d'un variant

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080163824A1 (en) * 2006-09-01 2008-07-10 Innovative Dairy Products Pty Ltd, An Australian Company, Acn 098 382 784 Whole genome based genetic evaluation and selection process

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
PARE ET AL.: "A machine-learning heuristic to improve gene score prediction of polygenic traits", SCIENTIFIC REPORTS, 10 April 2017 (2017-04-10), XP055500228, Retrieved from the Internet <URL:https://www.nature.com/articles/s41598-017-13056-1> [retrieved on 20170712] *
See also references of EP3516565A4 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP4150624A4 (fr) * 2020-05-15 2024-06-12 The Scripps Research Institute Scores de risque polygéniques ajustés et procédé de calcul

Also Published As

Publication number Publication date
EP3516565A4 (fr) 2020-06-10
CA3001257C (fr) 2020-04-14
US20190228838A1 (en) 2019-07-25
EP3516565A1 (fr) 2019-07-31
CA3001257A1 (fr) 2018-03-29

Similar Documents

Publication Publication Date Title
Sefid Dashti et al. A practical guide to filtering and prioritizing genetic variants
EP3621080B1 (fr) Réduction d&#39;erreur dans des relations génétiques prédites
MacLellan et al. Systems-based approaches to cardiovascular disease
EP2773954B1 (fr) Systèmes et procédés d&#39;annotation génomique et d&#39;interprétation de variants répartis
Shabalin et al. Merging two gene-expression studies via cross-platform normalization
US10235496B2 (en) Systems and methods for genomic annotation and distributed variant interpretation
CA3001257C (fr) Ajustement d&#39;associations pour notation predictive de genes
Yamamoto et al. Tissue-specific impacts of aging and genetics on gene expression patterns in humans
Mughal et al. Localizing and classifying adaptive targets with trend filtered regression
Sul et al. Accurate and fast multiple-testing correction in eQTL studies
van Der Graaf et al. Mendelian randomization while jointly modeling cis genetics identifies causal relationships between gene expression and lipids
Kristmundsdóttir et al. popSTR: population-scale detection of STR variants
Kim et al. Strelka2: fast and accurate variant calling for clinical sequencing applications
Kao et al. naiveBayesCall: an efficient model-based base-calling algorithm for high-throughput sequencing
US20190311785A1 (en) Systems and methods for genomic annotation and distributed variant interpretation
Bao et al. Genome-wide association studies using a penalized moving-window regression
Le et al. Nearest-neighbor Projected-Distance Regression (NPDR) for detecting network interactions with adjustments for multiple tests and confounding
Kyriazopoulou-Panagiotopoulou et al. Reconstruction of genealogical relationships with applications to Phase III of HapMap
Zu et al. Ultra-high dimensional quantile regression for longitudinal data: an application to blood pressure analysis
Hu et al. Group-combined P-values with applications to genetic association studies
Sung et al. An efficient gene–gene interaction test for genome-wide association studies in trio families
Zou et al. Inferring parental genomic ancestries using pooled semi-Markov processes
Yang et al. Analysis of homozygosity disequilibrium using whole-genome sequencing data
Zhou et al. Boosting gene mapping power and efficiency with efficient exact variance component tests of single nucleotide polymorphism sets
Yorgov et al. Use of admixture and association for detection of quantitative trait loci in the Type 2 Diabetes Genetic Exploration by Next-Generation Sequencing in Ethnic Samples (T2D-GENES) study

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 3001257

Country of ref document: CA

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17852044

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 2017852044

Country of ref document: EP