NZ788045A

NZ788045A - Deep convolutional neural networks for variant classification

Info

Publication number: NZ788045A
Application number: NZ788045A
Authority: NZ
Inventors: Kai How Farh; Hong Gao; Jeremy Francis Mcrae; Padigepati Samskruthi Reddy; Laksshman Sundaram
Original assignee: Illumina Inc
Priority date: 2017-10-16
Filing date: 2018-10-15
Publication date: 2022-05-27

Abstract

The technology disclosed relates to constructing a convolutional neural network-based classifier for variant classification. In particular, it relates to training a convolutional neural network-based classifier on training data using a backpropagation-based gradient update technique that progressively match outputs of the convolutional network network-based classifier with corresponding ground truth labels. The convolutional neural network-based classifier comprises groups of residual blocks, each group of residual blocks is parameterized by a number of convolution filters in the residual blocks, a convolution window size of the residual blocks, and an atrous convolution rate of the residual blocks, the size of convolution window varies between groups of residual blocks, the atrous convolution rate varies between groups of residual blocks. The training data includes benign training examples and pathogenic training examples of translated sequence pairs generated from benign variants and pathogenic variants. ly match outputs of the convolutional network network-based classifier with corresponding ground truth labels. The convolutional neural network-based classifier comprises groups of residual blocks, each group of residual blocks is parameterized by a number of convolution filters in the residual blocks, a convolution window size of the residual blocks, and an atrous convolution rate of the residual blocks, the size of convolution window varies between groups of residual blocks, the atrous convolution rate varies between groups of residual blocks. The training data includes benign training examples and pathogenic training examples of translated sequence pairs generated from benign variants and pathogenic variants.

Description

DEEP CONVOLUTIONAL NEURAL NETWORKS FOR VARIANT CLASSIFICATION APPENDIX The Appendix includes a bibliography of potentially relevant references listed in a paper authored by the inventors. The subject matter of the paper is covered in the US ionals to which this ation claims priority to/benefit of. These references can be made available by the Counsel upon request or may be accessible via Global Dossier. The paper is the first listed reference.

PRIORITY APPLICATIONS This application claims priority to or the benefit of US ional Patent Application No. 62/573,144, titled, "Training a Deep Pathogenicity Classifier Using Scale Benign ng Data," by Hong Gao, Kai-How Farh, Laksshman Sundaram and Jeremy Francis McRae, filed October 16, 2017 (Attorney Docket No. ILLM 1000-1/IPPRV); US Provisional Patent Application No. 62/573,149, titled, "Pathogenicity Classifier Based On Deep Convolutional Neural Networks (CNNS)," by Kai-How Farh, Laksshman Sundaram, Samskruthi Reddy Padigepati and Jeremy Francis McRae, filed October 16, 2017 (Attorney Docket No. ILLM 1000-2/IPPRV); US ional Patent Application No. 62/573,153, titled, "Deep Semi-Supervised Learning that Generates Large-Scale Pathogenic ng Data," by Hong Gao, Kai-How Farh, man am and Jeremy Francis McRae, filed October 16, 2017 (Attorney Docket No. ILLM 1000-3 /IPPRV); and US Provisional Patent Application No. 62/582,898, titled, "Pathogenicity Classification of Genomic Data Using Deep Convolutional Neural Networks (CNNs)," by Hong Gao, Kai-How Farh and Laksshman Sundaram, filed November 7, 2017 ney Docket No. ILLM 1000-4/IPPRV). These provisional applications are hereby incorporated by nce for all purposes.

INCORPORATIONS The following are incorporated by reference for all purposes as if fully set forth herein: PCT Patent Application No. PCT/US18/55840, titled "DEEP LEARNING-BASED TECHNIQUES FOR TRAINING DEEP CONVOLUTIONAL NEURAL NETWORKS," by Hong Gao, Kai-How Farh, Laksshman Sundaram and Jeremy Francis McRae, filed on October 15, 2018 (Attorney Docket No. ILLM 1000-8/ IP PCT), subsequently published as PCT Publication No. WO ____________.

PCT Patent ation No. PCT/US18/________, titled "SEMI-SUPERVISED LEARNING FOR TRAINING AN ENSEMBLE OF DEEP CONVOLUTIONAL NEURAL NETWORKS," by Laksshman Sundaram, Kai-How Farh, Hong Gao and Jeremy Francis McRae, filed contemporaneously on October 15, 2018 (Attorney Docket No. ILLM 1000-10/IPPCT) , subsequently published as PCT Publication No. WO ____________.

US Nonprovisional Patent Application titled "DEEP LEARNING-BASED QUES FOR TRAINING DEEP CONVOLUTIONAL NEURAL NETWORKS," by Hong Gao, Kai-How Farh, man Sundaram and Jeremy Francis McRae, ney Docket No. ILLM /IPUS) filed contemporaneously.

US Nonprovisional Patent Application titled "DEEP CONVOLUTIONAL NEURAL NETWORKS FOR VARIANT CLASSIFICATION," by Laksshman Sundaram, w Farh, Hong Gao and Jeremy Francis McRae, (Attorney Docket No. ILLM 1000-6/IPUS) filed contemporaneously.

US Nonprovisional Patent Application titled SUPERVISED LEARNING FOR TRAINING AN ENSEMBLE OF DEEP CONVOLUTIONAL NEURAL NETWORKS," by Laksshman Sundaram, Kai-How Farh, Hong Gao and Jeremy Francis McRae, (Attorney Docket No. ILLM 1000-7/IPUS) filed contemporaneously.

Document 1 – A. V. D. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N.

Kalchbrenner, A. Senior, and K. Kavukcuoglu, "WAVENET: A GENERATIVE MODEL FOR RAW AUDIO," 1609.03499, 2016; Document 2 – S. Ö. Arik, M. Chrzanowski, A. Coates, G. , A. Gibiansky, Y. Kang, X. Li, J.

, A. Ng, J. Raiman, S. Sengupta and M. Shoeybi, "DEEP VOICE: REAL-TIME NEURAL TEXT-TOSPEECH ," arXiv:1702.07825, 2017; nt 3 – F. Yu and V. Koltun, "MULTI-SCALE CONTEXT AGGREGATION BY DILATED CONVOLUTIONS," 1511.07122, 2016; Document 4 – K. He, X. Zhang, S. Ren, and J. Sun, "DEEP RESIDUAL LEARNING FOR IMAGE RECOGNITION," arXiv:1512.03385, 2015; Document 5 – R.K. Srivastava, K. Greff, and J. Schmidhuber, "HIGHWAY NETWORKS," arXiv: 0387, 2015; Document 6 – G. Huang, Z. Liu, L. van der Maaten and K. Q. Weinberger, "DENSELY CONNECTED CONVOLUTIONAL NETWORKS," arXiv:1608.06993, 2017; Document 7 – C. Szegedy, W. Liu,Y. Jia, P. Sermanet, S. Reed, D. ov, D. Erhan, V.

Vanhoucke, and A. Rabinovich, "GOING DEEPER WITH CONVOLUTIONS," arXiv: 1409.4842, 2014; Document 8 – S. Ioffe and C. Szegedy, "BATCH NORMALIZATION: ACCELERATING DEEP NETWORK TRAINING BY REDUCING INTERNAL ATE SHIFT," arXiv: 1502.03167, 2015; Document 9 – J. M. Wolterink, T. Leiner, M. A. Viergever, and I. Išgum, "DILATED CONVOLUTIONAL NEURAL NETWORKS FOR CARDIOVASCULAR MR SEGMENTATION IN CONGENITAL HEART DISEASE," arXiv:1704.03669, 2017; Document 10 – L. C. as, "AUTOREGRESSIVE MODEL BASED ON A DEEP CONVOLUTIONAL NEURAL NETWORK FOR AUDIO GENERATION," Tampere University of logy, 2016; nt 11 – J. Wu, "Introduction to Convolutional Neural Networks," Nanjing University, 2017; Document 12 – I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio, "CONVOLUTIONAL NETWORKS", Deep ng, MIT Press, 2016; and Document 13 – J. Gu, Z. Wang, J. Kuen, L. Ma, A. Shahroudy, B. Shuai, T. Liu, X. Wang, and G.

Wang, "RECENT ADVANCES IN CONVOLUTIONAL NEURAL NETWORKS," arXiv:1512.07108, 2017.

Document 1 describes deep convolutional neural network architectures that use groups of residual blocks with convolution filters having same convolution window size, batch normalization layers, rectified linear unit (abbreviated ReLU) , ionality altering layers, atrous convolution layers with exponentially growing atrous convolution rates, skip connections, and a softmax classification layer to accept an input sequence and produce an output sequence that scores entries in the input ce. The technology disclosed uses neural network components and parameters described in Document 1. In one implementation, the technology disclosed es the parameters of the neural network components described in Document 1. For instance, unlike in Document 1, the atrous convolution rate in the technology disclosed progresses non-exponentially from a lower al block group to a higher residual block group. In another example, unlike in Document 1, the convolution window size in the technology disclosed varies between groups of residual blocks.

Document 2 describes details of the deep convolutional neural network architectures described in Document 1.

Document 3 describes atrous convolutions used by the logy disclosed. As used herein, atrous utions are also referred to as "dilated convolutions". Atrous/dilated convolutions allow for large receptive fields with few trainable parameters. An atrous/dilated convolution is a convolution where the kernel is applied over an area larger than its length by skipping input values with a n step, also called atrous convolution rate or on factor. /dilated convolutions add spacing between the ts of a convolution filter/kernel so that neighboring input entries (e.g., nucleotides, amino acids) at larger intervals are considered when a convolution operation is performed. This enables incorporation of long-range contextual dependencies in the input. The atrous convolutions conserve partial convolution calculations for reuse as nt nucleotides are processed.

Document 4 describes residual blocks and residual connections used by the technology disclosed.

Document 5 describes skip connections used by the technology disclosed. As used herein, skip tions are also referred to as "highway networks".

Document 6 describes y connected convolutional network architectures used by the technology disclosed.

Document 7 describes dimensionality altering convolution layers and modules-based processing pipelines used by the technology disclosed. One example of a dimensionality altering convolution is a 1 x 1 convolution.

Document 8 describes batch normalization layers used by the technology disclosed. nt 9 also describes atrous/dilated convolutions used by the technology disclosed.

Document 10 bes various architectures of deep neural networks that can be used by the technology disclosed, including convolutional neural networks, deep convolutional neural networks, and deep convolutional neural networks with atrous/dilated convolutions.

Document 11 describes details of a convolutional neural network that can be used by the technology disclosed, including algorithms for training a convolutional neural k with subsampling layers (e.g., pooling) and fully-connected layers.

Document 12 describes details of various convolution operations that can be used by the technology disclosed.

Document 13 bes various architectures of utional neural networks that can be used by the technology disclosed.

INCORPORATION BY REFERENCE OF TABLES SUBMITTED ELECTRONICALLY WITH APPLICATION Following table files in ASCII text format are submitted with this application and incorporated by reference. The names, creation dates and sizes of the files are: SupplementaryTable1.txt October 2, 2018 13 KB SupplementaryTable2.txt October 2, 2018 13 KB mentaryTable3.txt October 2, 2018 11 KB SupplementaryTable4.txt r 2, 2018 13 KB SupplementaryTable6.txt October 2, 2018 12 KB SupplementaryTable7.txt October 2, 2018 44 KB SupplementaryTable13.txt October 2, 2018 119 KB SupplementaryTable18.txt October 2, 2018 35 KB SupplementaryTable20.txt October 2, 2018 1027 KB SupplementaryTable20Summary.txt October 2, 2018 9 KB SupplementaryTable21.txt October 2, 2018 24 KB SupplementaryTable21.txt October 2, 2018 24 KB SupplementaryTable18.txt October 4, 2018 35 KB leS1.txt October 4, 2018 138 MB leS2.txt October 4, 2018 980 MB DataFileS3.txt October 4, 2018 1.01 MB DataFileS4.txt October 4, 2018 834 KB Pathogenicity_prediction_model.txt October 4, 2018 8.24 KB Supplementary Table 1: Details of the variants from each species used in the analysis. The table includes the intermediate results in the pipeline for each of these data sources. Note this table is provided in SupplementaryTable1.txt.

Supplementary Table 2: Depletion of missense variants present in other species at common human allele frequencies. The depletion was calculated based on the missense:synonymous ratio in common variants (> 0.1%) compared to rare variants (< 0.1%), using variants that were identical-by-state between human and the other species. Note this table is provided in SupplementaryTable2.txt.

Supplementary Table 3: Depletion of missense variants present in other species at common human allele frequencies, restricted only to genes with > 50% mean nucleotide conservation between human and other mammals. The depletion was calculated based on the missense:synonymous ratio in common variants (> 0.1%) compared to rare variants (< 0.1%), using variants that were identical-by-state between human and the other species.

Note this table is provided in SupplementaryTable3.txt.

Supplementary Table 4: Depletion of missense variants present as fixed substitutions in related species pairs at common human allele frequencies. The depletion was calculated based on the missense:synonymous ratio in common ts (> 0.1%) compared to rare variants (< 0.1%), using variants that were cal-by-state between human and the d species pair. Note this table is provided in SupplementaryTable4.txt.

Supplementary Table 6: Domain ic annotation of SCN2A gene. Wilcoxon rank sum p-values indicate the divergence of PrimateAI scores in the specific domain in ison with the entire protein. The domains, highlighted in bold, cover imately 7% of the n, but have the majority of ClinVar pathogenic annotations. This correlates well with the average PrimateAI scores for the s and are the top 3 pathogenic domains according to the PrimateAI model. Note this table is provided in SupplementaryTable6.txt.

Supplementary Table 7: Raw counts used in calculating the effect of allele frequency on expected missense:synonymous ratio. Expected counts of synonymous and missense variants were ated based on ts in ic regions, using trinucleotide t to control for mutational rate and gene conversion. Note this table is provided in SupplementaryTables.xlsx.

Supplementary Table 13: List of protein names from the Protein DataBank (PDB) used for training the deep learning models for 3-state secondary structure and 3-state solvent accessibility tion. The label column indicates if the proteins are used in the ng/validation/testing phases of model training. Note this table is provided in SupplementaryTable13.txt. mentary Table 18: List of 605 genes that were nominally significant for disease association in the DDD study, calculated from protein-truncating variation only (p<0.05). Note this table is provided in SupplementaryTable18.txt.

Supplementary Table 20: Results of testing for enrichment of de novo mutations (DNMs) per gene, for all genes with at least one observed DNM. P-values are provided, when including all DNMs, and after removing missense DNMs with PrimateAI scores < 0.803. FDR corrected P-values are similarly provided. Counts of observed protein truncating (PTV) and missense DNMs are included, from just the DDD cohort and from the full metaanalysis cohort. Similar counts of observed and expected missense DNMs are also included, firstly when including all missense DNMs, and secondly after ng all missense DNMs with a PrimateAI score < 0.803. Note this table is provided in SupplementaryTable20.txt and SupplementaryTable20Summary.txt.

Supplementary Table 21: Results from testing ment of de novo mutations in genes with FDR < 0.1. Counts of observed protein truncating (PTV) de novo ons, and counts of other protein-altering de novo mutations, once with all missense de novo mutations and once with only ng missense mutations, are included. es when including all missense sites, versus P-values after excluding low scoring missense sites are provided. Note this table is provided in SupplementaryTable21.txt.

DataFileS1: List of all variants present in other species. The ClinVar Significance column contains the available non-conflicting ClinVar annotations. Note this table is provided in DataFileS1.txt.

DataFileS2: List of all fixed substitutions from related species pairs. Note this table is provided in DataFileS2.txt.

DataFileS3: List of withheld benign test variants IBS with primates. The benign test ts are noncommon human variants that are IBS with >=1 primate species. Note this table is provided in leS3.txt.

DataFileS4: List of unlabeled variants IBS with primates d to withheld benign test ts. The unlabeled variants are matched with benign test variants, for mutational rate, coverage biases and alignability with primate s. Note this table is provided in DataFileS4.txt.

Pathogenicity_prediction_model: Code in Python mming language that enables the technology disclosed according to one implementation. Note this code file is provided in enicity_prediction_model.txt.

FIELD OF THE TECHNOLOGY DISCLOSED The technology disclosed relates to artificial intelligence type computers and digital data processing systems and corresponding data sing methods and products for emulation of intelligence (i.e., dge based systems, ing systems, and dge acquisition systems); and including systems for reasoning with uncertainty (e.g., fuzzy logic systems), adaptive systems, machine learning systems, and artificial neural networks.

In particular, the technology disclosed relates to using deep learning-based techniques for training deep convolutional neural networks.

BACKGROUND The subject matter discussed in this section should not be assumed to be prior art merely as a result of its n in this section. Similarly, a problem mentioned in this section or associated with the subject matter provided as background should not be assumed to have been previously recognized in the prior art. The subject matter in this section merely represents different ches, which in and of themselves can also pond to implementations of the claimed technology.

Machine Learning In machine learning input variables are used to predict an output variable. The input variables are often called features and are denoted by X = (X1, X2, ..., Xk), where each Xi, i ∈ 1, ..., k is a feature. The output variable is often called the response or dependent variable and is denoted by the variable Yi. The relationship between Y and the corresponding X can be written in a general form: Y = f (X) + ∈ In the equation above, f is a function of the es (X1, X2, ..., Xk) and ∈ is the random error term.

The error term is independent of X and has a mean value of zero.

In ce, the features X are available t having Y or knowing the exact relation between X and Y. Since the error term has a mean value of zero, the goal is to te f.

Yˆ = f Xˆ =( ) In the equation above, ˆf is the te of ∈, which is often considered a black box, meaning that only the relation n the input and output of ˆf is known, but the question why it works remains unanswered.

The function ˆf is found using learning. Supervised learning and unsupervised learning are two ways used in e learning for this task. In ised learning, labeled data is used for training. By showing the inputs and the corresponding outputs (=labels), the function ˆf is optimized such that it approximates the output. In unsupervised learning, the goal is to find a hidden structure from led data. The algorithm has no measure of accuracy on the input data, which distinguishes it from supervised learning.

Neural Networks shows one implementation of a fully connected neural network with multiple layers. A neural network is a system of interconnected artificial s (e.g., a1, a2, a3) that ge messages between each other.

The illustrated neural network has three inputs, two neurons in the hidden layer and two neurons in the output layer.

The hidden layer has an activation function f • and the output layer has an tion on() g • .The() connections have numeric weights (e.g., w11, w21, w12, w31, w22, w32, v11, v22) that are tuned during the training process, so that a properly trained network responds correctly when fed an image to recognize. The input layer processes the raw input, the hidden layer processes the output from the input layer based on the weights of the connections between the input layer and the hidden layer. The output layer takes the output from the hidden layer and processes it based on the weights of the connections between the hidden layer and the output layer. The network includes multiple layers of feature-detecting neurons. Each layer has many neurons that respond to different combinations of inputs from the previous layers. These layers are ucted so that the first layer detects a set of primitive patterns in the input image data, the second layer detects patterns of patterns and the third layer detects ns of those patterns.

A survey of application of deep learning in genomics can be found in the following publications: • T. Ching et al., Opportunities And Obstacles For Deep Learning In Biology And Medicine, www.biorxiv.org:142760, 2017; • Angermueller C, Pärnamaa T, Parts L, Stegle O. Deep Learning For Computational Biology. Mol Syst Biol. 2016;12:878; • Park Y, Kellis M. 2015 Deep Learning For Regulatory Genomics. Nat. Biotechnol. 33, 825–826. (doi:10.1038/nbt.3313); • Min, S., Lee, B. & Yoon, S. Deep Learning In Bioinformatics. Brief. Bioinform. bbw068 (2016); • Leung MK, Delong A, Alipanahi B et al. Machine Learning In Genomic Medicine: A Review of Computational ms and Data Sets 2016; and • Libbrecht MW, Noble WS. Machine Learning Applications In Genetics and Genomics. Nature Reviews Genetics 2015;16(6):321-32.

BRIEF DESCRIPTION OF THE DRAWINGS In the drawings, like nce characters generally refer to like parts throughout the different views.

Also, the drawings are not necessarily to scale, with an emphasis instead generally being placed upon illustrating the principles of the technology disclosed. In the following description, various implementations of the technology disclosed are bed with reference to the following drawings, in which: shows one implementation of a feed-forward neural network with multiple layers. s one implementation of workings of a convolutional neural network. depicts a block diagram of ng a utional neural network in accordance with one implementation of the technology disclosed. is one implementation of sub-sampling layers (average/max pooling) in accordance with one implementation of the technology disclosed. shows one implementation of a ReLU non-linear layer in accordance with one entation of the technology disclosed. depicts one implementation of a two-layer convolution of the convolution layers. depicts a residual connection that reinjects prior information downstream via feature-map addition. depicts one implementation of residual blocks and skip-connections. shows the batch normalization forward pass. rates the batch ization transform at test time. shows the batch normalization backward pass. depicts use of a batch normalization layer after and before a convolutional or densely connected layer. shows one implementation of 1D convolution. illustrates how global average g (GAP) works. rates d convolutions. shows one implementation of stacked dilated convolutions. shows an example computing environment in which the technology disclosed can be operated. shows an e architecture of a deep residual network for pathogenicity prediction, referred to herein as "PrimateAI". depicts a schematic ration of PrimateAI, the deep learning network architecture for pathogenicity classification.

FIGs. 4A, 4B, and 4C are Supplementary Table 16 that show example model architecture details of the pathogenicity prediction deep learning model PrimateAI.

FIGs. 5 and 6 illustrate the deep learning network architecture used for predicting secondary structure and solvent accessibility of proteins.

FIGs. 7A and 7B are Supplementary Table 11 that show example model architecture details for the 3- state secondary structure tion deep learning (DL) model.

FIGs. 8A and 8B are Supplementary Table 12 that show example model architecture details for the 3- state solvent accessibility prediction deep learning model. depicts one entation of ting reference and alternative protein sequences from benign and pathogenic variants. shows one implementation of aligning reference and alternative protein sequences. is one implementation of generating position frequency matrices (abbreviated PFMs), also called position weight matrices (abbreviated PWMs) or on-specific scoring matrix (abbreviated PSSM).

FIGs. 12, 13, 14, and 15 show sing of the secondary structure and solvent accessibility subnetworks. operation of a t pathogenicity classifier. As used , the term variant also refers to single nucleotide polymorphisms (abbreviated SNPs) and generally to single nucleotide variants (abbreviated SNVs). illustrates a residual block. s a neural network architecture of the secondary ure and solvent accessibility subnetworks. shows a neural network architecture of the variant pathogenicity classifier. shows predicted pathogenicity score at each amino acid position in the SCN2A gene, annotated for key functional domains.

D shows a comparison of classifiers at predicting benign consequence for a test set of 10,000 common primate variants that were withheld from training.

E illustrates distributions of eAI prediction scores for de novo missense variants occurring in Deciphering Developmental Disorders (DDD) patients compared to unaffected siblings, with ponding Wilcoxon rank-sum P value.

] F depicts comparison of classifiers at ting de novo missense variants in DDD cases versus controls. Wilcoxon rank-sum test P values are shown for each classifier.

A shows enrichment of de novo missense mutations over expectation in affected individuals from the DDD cohort within 605 associated genes that were significant for de novo protein truncating ion B depicts distributions of PrimateAI prediction scores for de novo missense variants occurring in DDD patients versus unaffected siblings within the 605 associated genes, with corresponding Wilcoxon rank-sum P value.

C shows comparison of various classifiers at separating de novo missense ts in cases versus controls within the 605 genes.

D depicts comparison of various classifiers, shown on a receiver operator characteristic curve, with area under curve (AUC) indicated for each classifier.

E illustrates classification accuracy and area under curve (AUC) for each classifier.

FIGs. 23A, 23B, 23C, and 23D show impact of data used for training on fication accuracy. illustrates correcting for the effect of sequencing coverage on the ainment of common primate variants.

FIGs. 25A, 25B, 25C, and 26 depict recognition of protein motifs by the disclosed neural networks. includes a line plot showing the effect of perturbing each position in and around the variant on the predicted deep learning score for the variant. illustrates correlation patterns of weights mimic BLOSUM62 and Grantham score matrices.

FIGs. 28A, 28B, and 28C show performance evaluation of the deep learning network PrimateAI and other fiers.

FIGs. 29A and 29B illustrate bution of the prediction scores of four fiers.

] FIGs. 30A, 30B, and 30C compare the accuracy of the eAI network and other classifiers at separating pathogenic and benign variants in 605 e-associated genes.

FIGs. 31A and 31B rate correlation between classifier mance on human expert-curated ClinVar variants and mance on empirical datasets. is Supplementary Table 14 that shows performance of the 3-state secondary structure and 3- state solvent accessibility prediction models on annotated samples from the Protein DataBank. is Supplementary Table 15 that shows performance comparison of deep learning k using annotated secondary ure labels of human proteins from DSSP database. is Supplementary Table 17 that shows the accuracy values on the 10,000 withheld primate variants and the p-values for de novo variants in DDD cases vs controls for each of the 20 classifiers we evaluated. is Supplementary Table 19 that shows comparison of the performance of different classifiers on de novo variants in the DDD case vs control dataset, cted to 605 disease-associated genes. shows a computing environment of the disclosed semi-supervised learner.

FIGs. 37, 38, 39, 40, and 41 show various cycles of the disclosed semi-supervised learning. is an illustration of the iterative balanced sampling process. illustrates one implementation of a computing environment used to generate the benign dataset. depicts one implementation of generating benign human se SNPs. shows one implementation of human orthologous missense SNPs. A missense SNP in a non- human s that has matching reference and alternative codons with humans. depicts one implementation of classifying, as benign, SNPs of a non-human primate species (e.g., Chimpanzee) with matching reference codons with humans. depicts one implementation of calculating enrichment scores and comparing them. depicts one implementation of the benign SNP dataset.

FIGs. 49A, 49B, 49C, 49D, and 49E depict missense/synonymous ratios across the human allele frequency spectrum.

FIGs. 50A, 50B, 50C, and 50D show ing selection on missense variants identical-by-state with other species. shows expected se:synonymous ratios across the human allele frequency spectrum in the absence of purifying ion.

FIGs. 52A, 52B, 52C, and 52D depict missense:synonymous ratios for CpG and non-CpG variants.

FIGs. 53, 54, and 55 illustrate missense:synonymous ratios of human variants identical by state with six primates. is a simulation showing saturation of new common missense variants discovered by sing the size of the human cohorts surveyed. shows accuracy of PrimateAI across different vation es in the genome. is Supplementary Table 5 that shows contributions to the labeled benign training t from common human variants and variants present in non-human primates. is mentary Table 8 that shows effect of allele frequency on expected missense:synonymous ratio. is Supplementary Table 9 that shows ClinVar analysis. is Supplementary Table 10 that shows the number of missense variants from other species found in ClinVar, according to one implementation. is Table 1 that shows one implementation of ery of 14 additional candidate genes in intellectual disability. is Table 2 that shows one implementation of the mean difference in Grantham score between pathogenic and benign variants in ClinVar. ] shows one implementation of per-gene enrichment analysis. shows one implementation of genome-wide enrichment analysis. is a simplified block diagram of a computer system that can be used to implement the technology disclosed.

DETAILED DESCRIPTION The following discussion is presented to enable any person skilled in the art to make and use the technology disclosed, and is provided in the context of a particular application and its requirements. Various modifications to the disclosed implementations will be readily apparent to those skilled in the art, and the general principles defined herein may be d to other entations and applications without departing from the spirit and scope of the technology disclosed. Thus, the technology disclosed is not intended to be limited to the implementations shown, but is to be accorded the widest scope consistent with the principles and features disclosed herein.

Introduction Convolutional Neural Networks ] A utional neural network is a special type of neural network. The fundamental difference between a densely connected layer and a ution layer is this: Dense layers learn global patterns in their input feature space, whereas convolution layers learn local patters: in the case of images, ns found in small 2D windows of the inputs. This key characteristic gives convolutional neural networks two interesting properties: (1) the patterns they learn are translation ant and (2) they can learn spatial hierarchies of patterns.

Regarding the first, after learning a certain n in the lower-right corner of a picture, a convolution layer can recognize it anywhere: for e, in the left corner. A densely connected network would have to learn the pattern anew if it appeared at a new on. This makes convolutional neural networks data efficient because they need fewer training samples to learn representations they have generalization power.

Regarding the second, a first ution layer can learn small local patterns such as edges, a second convolution layer will learn larger patterns made of the es of the first layers, and so on. This allows convolutional neural networks to efficiently learn increasingly complex and abstract visual concepts.

] A convolutional neural network learns highly non-linear mappings by interconnecting layers of artificial neurons arranged in many different layers with activation ons that make the layers ent. It includes one or more convolutional , interspersed with one or more sub-sampling layers and non-linear , which are typically followed by one or more fully connected layers. Each element of the utional neural network receives inputs from a set of features in the previous layer. The convolutional neural network learns concurrently because the neurons in the same feature map have identical weights. These local shared weights reduce the complexity of the network such that when multi-dimensional input data enters the network, the convolutional neural network avoids the complexity of data reconstruction in feature extraction and sion or classification process.

Convolutions operate over 3D tensors, called feature maps, with two spatial axes (height and width) as well as a depth axis (also called the channels axis). For an RGB image, the dimension of the depth axis is 3, because the image has three color channels; red, green, and blue. For a black-and-white picture, the depth is 1 (levels of gray). The convolution operation extracts patches from its input feature map and applies the same transformation to all of these patches, producing an output feature map. This output e map is still a 3D tensor: it has a width and a height. Its depth can be arbitrary, because the output depth is a parameter of the layer, and the different channels in that depth axis no longer stand for specific colors as in RGB input; , they stand for filters. s encode specific aspects of the input data: at a height level, a single filter could encode the concept "presence of a face in the input," for instance.

For e, the first convolution layer takes a feature map of size (28, 28, 1) and outputs a feature map of size (26, 26, 32): it es 32 filters over its input. Each of these 32 output channels contains a 26 x 26 grid of values, which is a response map of the filter over the input, indicating the response of that filter pattern at different locations in the input. That is what the term feature map means: every dimension in the depth axis is a e (or filter), and the 2D tensor output [:, :, n] is the 2D spatial map of the response of this filter over the input.

Convolutions are defined by two key parameters: (1) size of the patches extracted from the inputs – these are typically 1 x 1, 3 x 3 or 5 x 5 and (2) depth of the output feature map – the number of s computed by the convolution. Often these start with a depth of 32, continue to a depth of 64, and terminate with a depth of 128 or A convolution works by sliding these windows of size 3 x 3 or 5 x 5 over the 3D input feature map, stopping at every location, and extracting the 3D patch of surrounding features (shape (window_height, window_width, depth)). Each such 3D patch is ten transformed (via a tensor product with the same learned weight matrix, called the convolution kernel) into a 1D vector of shape (output_depth,). All of these vectors are then spatially mbled into a 3D output map of shape (height, width, output_depth). Every spatial location in the output feature map corresponds to the same location in the input feature map (for example, the lower-right corner of the output contains information about the lower-right corner of the input). For instance, with 3 x 3 windows, the vector output [i, j, :] comes from the 3D patch input [i-1: i+1, j-1:J+1, :]. The full process is detailed in .

The convolutional neural network comprises convolution layers which perform the convolution operation between the input values and convolution filters (matrix of weights) that are learned over many gradient update iterations during the training. Let (m, n) be the filter size and W be the matrix of weights, then a convolution layer performs a convolution of the W with the input X by calculating the dot t W • x + b, where x is an instance of X and b is the bias. The step size by which the convolution filters slide across the input is called the stride, and the filter area (m × n) is called the receptive field. A same convolution filter is applied across ent positions of the input, which reduces the number of s learned. It also allows location invariant ng, i.e., if an important pattern exists in the input, the convolution filters learn it no matter where it is in the sequence.

Training a Convolutional Neural k depicts a block diagram of training a convolutional neural network in accordance with one entation of the technology disclosed. The convolutional neural k is adjusted or trained so that the input data leads to a specific output estimate. The convolutional neural network is adjusted using back propagation based on a comparison of the output estimate and the ground truth until the output estimate progressively matches or approaches the ground truth.

The convolutional neural network is trained by adjusting the weights between the neurons based on the ence between the ground truth and the actual output. This is mathematically described as: Δw = xδ i i whereδ = d truth) (− actual output) In one implementation, the training rule is defined as: wnm←wnm+α( mt a−ϕm) n In the equation above: the arrow indicates an update of the value; tm is the target value of neuron m ; ϕm is the computed current output of neuron m ; an is input n ; and α is the learning rate.

The intermediary step in the training includes generating a feature vector from the input data using the convolution layers. The gradient with respect to the weights in each layer, starting at the output, is calculated. This is referred to as the backward pass, or going backwards. The weights in the network are updated using a combination of the negative gradient and previous weights.

In one implementation, the convolutional neural network uses a stic gradient update algorithm (such as ADAM) that performs backward propagation of errors by means of gradient descent. One example of a sigmoid function based back propagation algorithm is described below: ϕ = f ( )h = 1+ e−h ] In the sigmoid function above, h is the weighted sum ed by a neuron. The d function has the following derivative: =ϕ(1−ϕ) ] The algorithm includes computing the activation of all neurons in the network, yielding an output for the forward pass. The activation of neuron m in the hidden layers is described as: m 1+ e−hm h a w m = n nm This is done for all the hidden layers to get the activation described as: k h 1+e k h =ϕ v k m mk Then, the error and the correct weights are calculated per layer. The error at the output is computed as: δ = (t −ϕ ) (1ϕ −ϕ) ok k k k k The error in the hidden layers is calculated as: δ =ϕ (1 −ϕ) v δ hm m m  mk ok The weights of the output layer are updated as: vmk←vmk +αδ ϕok m The weights of the hidden layers are updated using the learning rate α as: vnm←wnm n In one implementation, the convolutional neural k uses a gradient descent optimization to compute the error across all the layers. In such an optimization, for an input e vector x and the ted output ŷ, the loss function is defined as l for the cost of predicting ŷ when the target is y, i.e. l (ŷ, y). The predicted output ŷ is transformed from the input feature vector x using function f. Function f is parameterized by the weights of convolutional neural network, i.e. ŷ = fw (x). The loss function is described as l (ŷ, y) = l (fw (x), y), or Q (z, w) = l (fw (x), y) where z is an input and output data pair (x, y). The gradient descent optimization is performed by ng the weights according to: v v1 N t + 1 = μ t−α ∇wtQ(zt,wt) n wt +1= wt +vt +1 In the equations above, α is the learning rate. Also, the loss is computed as the average over a set of n data pairs. The computation is ated when the learning rateα is small enough upon linear convergence. In other implementations, the gradient is calculated using only selected data pairs fed to a Nesterov’s accelerated gradient and an adaptive gradient to inject computation efficiency.

] In one entation, the convolutional neural network uses a stic gradient descent (SGD) to calculate the cost function. A SGD approximates the gradient with respect to the weights in the loss function by computing it from only one, randomized, data pair, zt , described as: vt + =1 μv−α∇wQ z( , )t wt wt +1= wt +vt +1 In the equations above: α is the learning rate; μ is the momentum; and t is the current weight state before updating. The convergence speed of SGD is approximately O(1/ )t when the learning rate α are d both fast and slow . In other implementations, the convolutional neural network uses different loss functions such as Euclidean loss and x loss. In a further implementation, an Adam stic zer is used by the convolutional neural network.

Convolution Layers The convolution layers of the convolutional neural network serve as feature extractors. Convolution layers act as adaptive feature extractors capable of learning and decomposing the input data into hierarchical features. In one implementation, the convolution layers take two images as input and produce a third image as output. In such an implementation, convolution operates on two images in two-dimension (2D), with one image being the input image and the other image, called the "kernel", d as a filter on the input image, producing an output image. Thus, for an input vector f of length n and a kernel g of length m, the convolution f * g of f and g is defined as: ( * )()f g i= g() (j f⋅ i − +j m/2) The convolution operation includes sliding the kernel over the input image. For each on of the kernel, the overlapping values of the kernel and the input image are multiplied and the results are added. The sum of products is the value of the output image at the point in the input image where the kernel is centered. The resulting different outputs from many kernels are called feature maps.

Once the convolutional layers are trained, they are applied to perform recognition tasks on new inference data. Since the utional layers learn from the training data, they avoid explicit feature extraction and implicitly learn from the ng data. Convolution layers use convolution filter kernel weights, which are determined and updated as part of the training process. The convolution layers extract ent features of the input, which are ed at higher layers. The convolutional neural network uses a various number of convolution layers, each with different convolving parameters such as kernel size, s, padding, number of feature maps and weights.

Sub-Sampling Layers is one entation of sub-sampling layers in accordance with one implementation of the technology disclosed. Sub-sampling layers reduce the resolution of the es extracted by the convolution layers to make the extracted features or feature maps- robust against noise and distortion. In one implementation, subsampling layers employ two types of pooling operations, average pooling and max pooling. The pooling operations divide the input into non-overlapping two-dimensional . For average pooling, the average of the four values in the region is ated. For max pooling, the maximum value of the four values is selected.

In one implementation, the sub-sampling layers include pooling operations on a set of neurons in the previous layer by mapping its output to only one of the inputs in max pooling and by mapping its output to the average of the input in average pooling. In max pooling, the output of the pooling neuron is the maximum value that resides within the input, as described by: ϕo = max(ϕϕ1, 2,... ),ϕN In the equation above, N is the total number of elements within a neuron set.

In average g, the output of the pooling neuron is the average value of the input values that reside with the input neuron set, as described by: ϕ 1 N o = ϕ N n In the equation above, N is the total number of elements within input neuron set.

In , the input is of size 4 × 4. For 2 × 2 mpling, a 4 × 4 image is divided into four nonoverlapping matrices of size 2 × 2. For average pooling, the average of the four values is the whole-integer output.

For max g, the maximum value of the four values in the 2 × 2 matrix is the whole-integer output.

Non-Linear Layers shows one implementation of non-linear layers in accordance with one implementation of the technology disclosed. Non-linear layers use different non-linear trigger ons to signal ct identification of likely features on each hidden layer. Non-linear layers use a variety of specific functions to implement the nonlinear triggering, including the rectified linear units (ReLUs), hyperbolic tangent, absolute of hyperbolic tangent, d and continuous trigger inear) functions. In one implementation, a ReLU activation ents the function y = max(x, 0) and keeps the input and output sizes of a layer the same. The advantage of using ReLU is that the convolutional neural network is trained many times faster. ReLU is a non-continuous, non-saturating activation function that is linear with respect to the input if the input values are larger than zero and zero otherwise.

Mathematically, a ReLU activation function is described as: ϕ() max(,0)h = h 0 if h> ( ) {h ϕ h = 0 if h≤0 In other implementations, the convolutional neural network uses a power unit activation function, which is a continuous, non-saturating function described by: ϕ( ) (h = a + bh)c In the equation above, a , b and c are parameters lling the shift, scale and power respectively.

The power activation function is able to yield x and y-antisymmetric activation if c is odd and etric activation if c is even. In some implementations, the unit yields a non-rectified linear activation.

In yet other implementations, the convolutional neural k uses a sigmoid unit activation function, which is a continuous, saturating function described by the following logistic function: ϕ( )h = 1+ e−βh In the on above, β = 1. The d unit activation function does not yield negative activation and is only antisymmetric with respect to the y-axis.

Convolution Examples depicts one implementation of a two-layer convolution of the convolution layers. In , an input of size 2048 dimensions is convolved. At convolution 1, the input is convolved by a convolutional layer comprising of two channels of sixteen kernels of size 3 × 3. The resulting sixteen feature maps are then rectified by means of the ReLU activation function at ReLU1 and then pooled in Pool 1 by means of average pooling using a sixteen channel pooling layer with kernels of size 3 × 3. At ution 2, the output of Pool 1 is then convolved by another convolutional layer comprising of sixteen channels of thirty kernels with a size of 3 × 3. This is followed by yet another ReLU2 and average g in Pool 2 with a kernel size of 2 × 2. The convolution layers use varying number of strides and padding, for e, zero, one, two and three. The ing feature vector is five hundred and twelve (512) dimensions, according to one implementation.

In other implementations, the convolutional neural network uses different numbers of convolution layers, sub-sampling layers, non-linear layers and fully connected layers. In one implementation, the convolutional neural network is a shallow network with fewer layers and more neurons per layer, for example, one, two or three fully connected layers with hundred (100) to two hundred (200) s per layer. In another implementation, the convolutional neural network is a deep network with more layers and fewer neurons per layer, for example, five (5), six (6) or eight (8) fully connected layers with thirty (30) to fifty (50) s per layer.

Forward Pass ] The output of a neuron of row x, column y in the lth convolution layer and kth feature map for f number of convolution cores in a feature map is determined by the following equation: f −1 kh kw O(, )l kx y, = tanh( W ( ,) ( 1,)k t l− t (, )l k (,) (r c O x r x c+ ,+ )+ Bias ) t=0 r=0 c=0 The output of a neuron of row x, column y in the lth sub-sample layer and kth e map is determined by the following equation: Sh Sw O(, )l kx y = tanh(W ( )k O( 1, )l− k l k(, )) , (x×S+ ×r y, S c+) + Bias r c h w =0 =0 The output of an ith neuron of the lth output layer is determined by the ing on: O (,) l l i (,)= tanh( l i O( 1, ) (, )l− jW i j + Bias ) Backpropagation ] The output deviation of a kth neuron in the output layer is determined by the following equation: d O( o k ) = yk − tk The input deviation of a kth neuron in the output layer is determined by the following equation: d (Iok ) (= yk −) (tk ϕ) (′ vk=) ( )ϕ′ v dk Oo The weight and bias ion of a kth neuron in the output layer is determined by the following equation: ΔWo ) (= d) I o y k ,x k k , x ΔBiaso) ( ) o k = d I The output bias of a kth neuron in the hidden layer is determined by the following equation: d O( H) (=  ) o k d I Wi The input bias of a kth neuron in the hidden layer is ined by the following equation: d I( H k ) ( ) ( )= ′ϕ v d OHk The weight and bias variation in row x, column y in a mth feature map of a prior layer receiving input from k neurons in the hidden layer is determined by the following equation: ΔWH k, m x y, , ) (= d) I k x y, ΔBiasH H k ) (= d I) k The output bias of row x, column y in a mth e map of sub-sample layer S is determined by the following equation: d O( S m, x y, ) (=d IH, , ) H k, m x y W m x y, , The input bias of row x, column y in a mth e map of sub-sample layer S is determined by the following on: d (IS m,x y, ) (=ϕ) ( )′ v d OS m,k x y, The weight and bias variation in row x, column y in a mth feature map of sub-sample layer S and convolution layer C is determined by the following equation: fh fw ΔW S m, = d I( S m, )OC m, [ /2],[ /2]x y x y, x=0 y=0 fh fw ΔBiasS m, ) (=) d OS m,x y, x=0 y=0 The output bias of row x, column y in a kth feature map of ution layer C is determined by the following equation: d O( C k, , , ) (= S k k x y d I[ /2],[ /2]x )y W The input bias of row x, column y in a kth feature map of convolution layer C is determined by the following equation: d (IC k, C k, x y, ) (=ϕ) ( )′ v d Ok x y, The weight and bias variation in row r, column c in an mth convolution core of a kth feature map of lth convolution layer C: fh fw ΔW k m,r c, = d(I C k, )O −l 1,mx y, x+r y, +c x=0 y=0 fh fw ΔBiasC k, ) (=) d I C k,x y, x=0 y=0 Residual Connections s a residual connection that reinjects prior information downstream via feature-map addition. A residual connection comprises reinjecting previous representations into the ream flow of data by adding a past output tensor to a later output tensor, which helps prevent information loss along the data-processing flow. Residual connections tackle two common problems that plague any large-scale deep-learning model: vanishing gradients and representational necks. In general, adding residual connections to any model that has more than 10 layers is likely to be beneficial. As discussed above, a residual connection comprises making the output of an earlier layer ble as input to a later layer, effectively creating a shortcut in a sequential network.

Rather than being concatenated to the later activation, the earlier output is summed with the later activation, which assumes that both activations are the same size. If they are of different sizes, a linear transformation to e the earlier tion into the target shape can be used. Additional information about residual connections can be found in K. He, X. Zhang, S. Ren, and J. Sun, "DEEP RESIDUAL LEARNING FOR IMAGE RECOGNITION," arXiv:1512.03385, 2015, which is hereby incorporated by reference for all es as if fully set forth herein.

Residual Learning and Skip-Connections s one entation of residual blocks and skip-connections. The main idea of residual learning is that the al mapping is much easier to be learned than the original mapping. al network stacks a number of residual units to alleviate the degradation of training accuracy. Residual blocks make use of special additive skip connections to combat vanishing gradients in deep neural networks. At the beginning of a residual block, the data flow is separated into two s: the first carries the unchanged input of the block, while the second applies weights and non-linearities. At the end of the block, the two streams are merged using an element-wise sum. The main advantage of such constructs is to allow the gradient to flow through the network more easily. Additional information about residual blocks and skip-connections can be found in A. V. D. Oord, S.

Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, "WAVENET: A GENERATIVE MODEL FOR RAW AUDIO," arXiv:1609.03499, 2016.

Benefited from residual network, deep convolutional neural networks (CNNs) can be easily d and improved accuracy has been achieved for image classification and object detection. Convolutional feed-forward networks connect the output of the lth layer as input to the (l +1)th layer, which gives rise to the ing layer transition: xl = Hl (x −l 1) . Residual blocks add a skip-connection that bypasses the non-linear transformations with an identify function: xl = Hl (xl−1)+ xl −1 . An advantage of residual blocks is that the gradient can flow directly h the identity function from later layers to the earlier layers. However, the identity function and the output of Hl are combined by ion, which may impede the information flow in the network.

Dilated Convolutions illustrates dilated convolutions. d convolutions, sometimes called atrous convolutions, which literally means with holes. The French name has its origins in the algorithme a trous, which computes the fast dyadic wavelet transform. In these type of convolutional layers, the inputs corresponding to the receptive field of the filters are not neighboring points. This is illustrated in . The distance between the inputs is dependent on the on factor.

WaveNet The t is a deep neural network for generating raw audio waveforms. The WaveNet distinguishes itself from other convolutional networks since it is able to take relatively large ‘visual fields’ at low cost. Moreover, it is able to add conditioning of the signals y and globally, which allows the WaveNet to be used as a text to speech (TTS) engine with multiple voices, is the TTS gives local conditioning and the ular voice the global conditioning.

The main ng blocks of the WaveNet are the causal dilated convolutions. As an extension on the causal dilated convolutions, theWaveNet also allows stacks of these utions, as shown in . To obtain the same receptive field with dilated convolutions in this figure, another dilation layer is required. The stacks are a repetition of the d convolutions, connecting the outputs of d convolution layer to a single output. This enables the WaveNet to get a large ‘visual’ field of one output node at a relatively low computational cost. For comparison, to get a visual field of 512 inputs, a fully convolutional network (FCN) would require 511 layers. In the case of a dilated convolutional network, we would need eight layers. The stacked d convolutions only need seven layers with two stacks or six layers with four stacks. To get an idea of the differences in computational power required for covering the same visual field, the following table shows the number of weights required in the network with the assumption of one filter per layer and a filter width of two. Furthermore, it is assumed that the network is using binary encoding of the 8 bits.

Network type No. stacks No. weights per channel Total No. of weights FCN 1 2.6 . 105 2.6 . 106 WN 1 1022 8176 WN 2 1022 8176 WN 4 508 4064 The WaveNet adds a skip connection before the residual connection is made, which bypasses all the following residual blocks. Each of these skip connections is summed before passing them through a series of activation functions and convolutions. Intuitively, this is the sum of the information ted in each layer.

Batch Normalization Batch normalization is a method for accelerating deep network training by making data standardization an integral part of the network architecture. Batch normalization can adaptively normalize data even as the mean and variance change over time during training. It works by internally maintaining an exponential moving average of the wise mean and variance of the data seen during training. The main effect of batch normalization is that it helps with gradient propagation – much like residual tions – and thus allows for deep networks. Some very deep networks can only be trained if they include multiple Batch Normalization . Additional information about batch normalization can be found in S. Ioffe and C. Szegedy, "BATCH NORMALIZATION: ACCELERATING DEEP NETWORK TRAINING BY REDUCING AL COVARIATE SHIFT," arXiv: 1502.03167, 2015, which is hereby orated by reference for all es as if fully set forth herein.

Batch normalization can be seen as yet another layer that can be inserted into the model architecture, just like the fully connected or convolutional layer. The BatchNormalization layer is lly used after a convolutional or densely connected layer. It can also be used before a convolutional or densely connected layer.

Both implementations can be used by the technology disclosed and are shown in . The BatchNormalization layer takes an axis argument, which specifies the feature axis that should be normalized. This argument defaults to - 1, the last axis in the input tensor. This is the correct value when using Dense layers, Conv1D layers, RNN layers, and Conv2D layers with data_format set to "channels_last". But in the niche use case of Conv2D layers with data_format set to "channels_first", the features axis is axis 1; the axis argument in BatchNormalization can be set to Batch normalization provides a definition for feed-forwarding the input and computing the gradients with respect to the parameters and its own input via a backward pass. In practice, batch normalization layers are inserted after a convolutional or fully connected layer, but before the outputs are fed into an activation function. For utional layers, the different elements of the same feature map – i.e. the activations – at different locations are normalized in the same way in order to obey the convolutional property. Thus, all activations in a mini-batch are normalized over all locations, rather than per activation.

The internal covariate shift is the major reason why deep architectures have been notoriously slow to train. This stems from the fact that deep networks do not only have to learn a new representation at each layer, but also have to account for the change in their bution.

The covariate shift in general is a known m in the deep learning domain and frequently occurs in real-world problems. A common covariate shift problem is the difference in the distribution of the ng and test set which can lead to suboptimal generalization performance. This problem is usually handled with a standardization or whitening preprocessing step. However, especially the whitening operation is computationally expensive and thus tical in an online setting, especially if the covariate shift occurs throughout different layers.

The internal covariate shift is the enon where the distribution of network activations change across layers due to the change in network parameters during training. Ideally, each layer should be transformed into a space where they have the same distribution but the functional relationship stays the same. In order to avoid costly calculations of covariance matrices to elate and whiten the data at every layer and step, we normalize the bution of each input feature in each layer across each mini-batch to have zero mean and a standard deviation of Forward Pass During the forward pass, the mini-batch mean and variance are calculated. With these mini-batch statistics, the data is normalized by subtracting the mean and dividing by the standard ion. y, the data is scaled and d with the learned scale and shift parameters. The batch normalization forward pass fBN is depicted in .

In , μβ is the batch mean and σ 2β is the batch variance, respectively. The learned scale and shift parameters are denoted by γ and β, respectively. For clarity, the batch normalization procedure is described herein per activation and omit the corresponding s.

Since normalization is a entiable transform, the errors are propagated into these learned ters and are thus able to restore the representational power of the network by learning the ty transform.

Conversely, by learning scale and shift parameters that are identical to the corresponding batch tics, the batch normalization transform would have no effect on the network, if that was the optimal operation to m. At test time, the batch mean and variance are replaced by the respective population statistics since the input does not depend on other samples from a mini-batch. Another method is to keep g averages of the batch statistics during training and to use these to compute the network output at test time. At test time, the batch normalization transform can be expressed as illustrated in . In , μD and σ 2D denote the population mean and variance, rather than the batch statistics, respectively.

Backward Pass Since normalization is a entiable operation, the backward pass can be computed as depicted in . 1D ution 1D convolutions extract local 1D patches or uences from sequences, as shown in . 1D convolution obtains each output timestep from a temporal patch in the input ce. 1D convolution layers recognize local patters in a sequence. Because the same input transformation is performed on every patch, a pattern learned at a certain position in the input ces can be later recognized at a different position, making 1D ution layers translation invariant for al translations. For instance, a 1D convolution layer processing sequences of bases using convolution windows of size 5 should be able to learn bases or base sequences of length 5 or less, and it should be able to recognize the base motifs in any context in an input sequence. A evel 1D convolution is thus able to learn about base morphology.

Global Average Pooling illustrates how global average pooling (GAP) works. Global average pooling can be use used to replace fully ted (FC) layers for classification, by taking the spatial average of features in the last layer for scoring. The reduces the training load and bypasses overfitting issues. Global average g applies a structural prior to the model and it is equivalent to linear transformation with predefined weights. Global average pooling reduces the number of parameters and eliminates the fully connected layer. Fully connected layers are typically the most parameter and connection intensive layers, and global e pooling provides much lower-cost approach to achieve r results. The main idea of global average g is to generate the average value from each last layer feature map as the confidence factor for scoring, feeding directly into the softmax layer.

Global average pooling have three benefits: (1) there are no extra parameters in global average g layers thus overfitting is avoided at global e pooling layers; (2) since the output of global average pooling is the average of the whole feature map, global average pooling will be more robust to spatial translations; and (3) because of the huge number of parameters in fully connected layers which usually take over 50% in all the parameters of the whole k, replacing them by global average pooling layers can significantly reduce the size of the model, and this makes global average pooling very useful in model compression.

Global average pooling makes sense, since stronger features in the last layer are expected to have a higher average value. In some implementations, global average pooling can be used as a proxy for the classification score. The feature maps under global average pooling can be interpreted as confidence maps, and force correspondence between the feature maps and the categories. Global average pooling can be ularly effective if the last layer features are at a sufficient abstraction for direct classification; r, global average pooling alone is not enough if evel features should be combined into groups like parts models, which is best performed by adding a simple fully connected layer or other classifier after the global average pooling.

Deep Learning in cs ] Genetic variations can help explain many es. Every human being has a unique genetic code and there are lots of genetic variants within a group of individuals. Most of the deleterious genetic ts have been depleted from genomes by natural selection. It is important to identify which genetics ions are likely to be pathogenic or deleterious. This will help researchers focus on the likely pathogenic genetic variants and accelerate the pace of diagnosis and cure of many diseases.

Modeling the properties and functional effects (e.g., pathogenicity) of variants is an important but challenging task in the field of genomics. Despite the rapid advancement of functional genomic sequencing technologies, interpretation of the functional consequences of ts remains a great challenge due to the complexity of cell type-specific transcription regulation systems.

Advances in biochemical technologies over the past decades have given rise to next generation sequencing (NGS) platforms that quickly e genomic data at much lower costs than ever before. Such overwhelmingly large s of sequenced DNA remain difficult to annotate. Supervised machine learning algorithms typically perform well when large amounts of labeled data are available. In bioinformatics and many other data-rich disciplines, the s of ng instances is costly; however, unlabeled instances are inexpensive and readily ble. For a scenario in which the amount of labeled data is relatively small and the amount of unlabeled data is substantially larger, semi-supervised ng represents a cost-effective alternative to manual labeling.

An unity arises to use semi-supervised algorithms to construct deep learning-based pathogenicity classifiers that accurately predict pathogenicity of variants. Databases of pathogenic variants that are free from human ascertainment bias may result.

Regarding pathogenicity classifiers, deep neural networks are a type of artificial neural ks that use multiple nonlinear and complex transforming layers to successively model high-level features. Deep neural networks provide feedback via opagation which carries the ence between observed and predicted output to adjust parameters. Deep neural ks have evolved with the availability of large training datasets, the power of parallel and distributed computing, and sophisticated training algorithms. Deep neural networks have facilitated major advances in numerous domains such as computer vision, speech ition, and natural language processing.

Convolutional neural networks (CNNs) and recurrent neural networks (RNNs) are components of deep neural networks. Convolutional neural networks have succeeded particularly in image ition with an architecture that comprises convolution layers, nonlinear layers, and g layers. Recurrent neural networks are designed to utilize sequential information of input data with cyclic connections among building blocks like perceptrons, long short-term memory units, and gated recurrent units. In addition, many other emergent deep neural networks have been proposed for limited contexts, such as deep spatio-temporal neural networks, multi-dimensional recurrent neural networks, and convolutional auto-encoders.

The goal of training deep neural networks is optimization of the weight parameters in each layer, which gradually combines simpler features into complex features so that the most le hierarchical entations can be learned from data. A single cycle of the optimization process is organized as follows. First, given a training t, the forward pass tially computes the output in each layer and propagates the function signals forward through the network. In the final output layer, an objective loss function measures error between the inferenced outputs and the given labels. To minimize the training error, the backward pass uses the chain rule to backpropagate error signals and compute gradients with respect to all weights throughout the neural network.

Finally, the weight parameters are updated using optimization algorithms based on stochastic gradient descent.

Whereas batch gradient t performs ter updates for each complete dataset, stochastic gradient descent es stochastic approximations by ming the updates for each small set of data es. Several optimization algorithms stem from stochastic gradient descent. For example, the Adagrad and Adam training algorithms perform stochastic gradient descent while adaptively modifying learning rates based on update frequency and moments of the gradients for each parameter, respectively.

Another core t in the training of deep neural ks is regularization, which refers to strategies intended to avoid overfitting and thus achieve good generalization performance. For example, weight decay adds a penalty term to the objective loss function so that weight parameters converge to smaller absolute values. Dropout randomly removes hidden units from neural networks during training and can be ered an ensemble of possible subnetworks. To enhance the capabilities of dropout, a new activation function, maxout, and a variant of dropout for ent neural networks called rnnDrop have been proposed. Furthermore, batch normalization provides a new regularization method through normalization of scalar features for each activation within a mini-batch and learning each mean and variance as parameters.

Given that sequenced data are multi- and imensional, deep neural networks have great e for bioinformatics research because of their broad applicability and ed prediction power. Convolutional neural networks have been adapted to solve ce-based problems in genomics such as motif discovery, pathogenic t identification, and gene expression inference. utional neural networks use a weightsharing strategy that is especially useful for studying DNA because it can e sequence motifs, which are short, recurring local patterns in DNA that are presumed to have significant biological functions. A hallmark of convolutional neural networks is the use of convolution filters. Unlike traditional classification approaches that are based on elaborately-designed and manually-crafted features, convolution filters perform adaptive learning of features, analogous to a process of mapping raw input data to the informative representation of knowledge. In this sense, the convolution filters serve as a series of motif scanners, since a set of such filters is capable of recognizing relevant patterns in the input and updating lves during the training procedure. Recurrent neural networks can e long-range dependencies in sequential data of g lengths, such as protein or DNA sequences.

Therefore, a powerful computational model for predicting the pathogenicity of variants can have enormous benefits for both basic e and translational research.

Common polymorphisms represent natural experiments whose fitness has been tested by generations of natural selection. Comparing the allele frequency distributions for human missense and synonymous substitutions, we find that the presence of a missense t at high allele frequencies in a non-human primate species reliably predicts that the variant is also under neutral selection in the human population. In contrast, common variants in more distant species experience negative selection as evolutionary distance increases.

We employ common variation from six non-human primate species to train a semi-supervised deep learning network that accurately classifies clinical de novo se mutations using sequence alone. With over 500 known species, the primate lineage contains sufficient common variation to systematically model the effects of most human variants of n significance.

The human reference genome harbors more than 70 million potential protein-altering missense substitutions, the vast ty of which are rare mutations whose effects on human health have not been characterized. These variants of unknown significance present a challenge for genome retation in clinical applications, and are a roadblock to the long term on of sequencing for population-wide screening and individualized ne. guing common variation across diverse human populations is an ive strategy for identifying clinically benign ion, but the common variation available in modern day humans is limited by bottleneck events in our species’ distant past. Humans and chimpanzees share 99% sequence identity, ting that natural selection operating on chimpanzee variants has the potential to model the effects of variants that are cal-by-state in human. The mean coalescence time for neutral polymorphisms in the human population is a fraction of the species’ divergence time, hence, naturally occurring chimpanzee variation largely explores onal space that is non-overlapping with human variation, aside from rare instances of haplotypes maintained by balancing selection.

] The recent availability of aggregated exome data from 60,706 humans enables us to test this hypothesis by comparing the allele frequency spectra for missense and synonymous mutations. ton ts in ExAC closely match the expected 2.2:1 missense:synonymous ratio predicted by de novo mutation after adjusting for mutational rate using trinucleotide context, but at higher allele frequencies the number of observed missense variants decreases due to the filtering out of deleterious variants by natural selection. The pattern of missense:synonymous ratios across the allele frequency spectrum indicates that a large fraction of missense variants with population frequency < 0.1% are mildly deleterious, that is, neither pathogenic enough to warrant ate removal from the population, nor neutral enough to be allowed to exist at high allele frequencies, consistent with prior observations on more limited population data. These findings support the widespread empirical practice by diagnostic labs of filtering out variants with greater than 0.1%~1% allele ncy as likely benign for penetrant genetic disease, aside from a handful of well-documented exceptions caused by balancing selection and founder Repeating this analysis with the subset of human variants that are identical-by-state with common chimpanzee variants (observed more than once in chimpanzee population sequencing), we find that the missense:synonymous ratio is largely constant across the allele frequency spectrum. The high allele ncy of these variants in the chimpanzee population indicates that they have already been through the sieve of natural selection in chimpanzee, and their neutral impact on fitness in human populations provides compelling evidence that the selective pressures on missense ts are highly concordant in the two species. The lower missense:synonymous ratio observed in chimpanzee is consistent with the larger effective population size in ancestral chimpanzee populations enabling more efficient filtering of mildly deleterious ts.

In contrast, rare chimpanzee variants (observed only once in chimpanzee population sequencing) show a modest decrease in missense:synonymous ratio at higher allele frequencies. ting an identically sized cohort from human variation data, we estimate that only 64% of the variants observed once in a cohort of this size would have an allele frequency greater than 0.1% in the general population, compared to 99.8% for the variants seen multiple times in the cohort, ting that not all of the rare chimpanzee variants have been through the sieve of selection. Overall, we estimate that 16% of the ascertained nzee missense variants have an allele frequency less than 0.1% in the general population, and would be subject to negative selection at higher allele frequencies.

We next characterize human variants that are identical-by-state with variation observed in other nonhuman primate species (Bonobo, Gorilla, Orangutan, Rhesus, and Marmoset). Similar to chimpanzee, we observe that the missense:synonymous ratios are roughly equivalent across the allele frequency spectrum, other than a slight ion of missense variation at high allele frequencies, which would be anticipated due to the inclusion of a small number of rare variants (~5-15%). These results imply that the selective forces on missense variants are largely concordant within the primate lineage at least out to new world monkeys, which are estimated to have diverged from the human ancestral lineage ~35 n years ago.

Human se variants that are identical-by-state with variants in other primates are strongly enriched for benign consequence in ClinVar. After excluding variants with unknown or conflicting annotations, we observe that human variants with primate orthologs are approximately 95% likely to be annotated as Benign or Likely Benign in ClinVar, compared to 45% for missense ion in l. The small fraction of ClinVar variants that are classified as pathogenic from non-human primates is comparable to the fraction of pathogenic ClinVar variants that would be ed by ascertaining rare variants from a similar sized cohort of healthy humans.

A substantial fraction of these variants annotated as enic or Likely Pathogenic indicate that received their classifications prior to the advent of large allele frequency databases, and might be curated differently today.

The field of human cs has long relied upon model sms to infer the clinical impact of human mutations, but the long evolutionary distance to most genetically tractable animal models raises concerns about the extent to which these findings are generalizable back to human. To examine the concordance of natural selection on missense variants in human and more distant species, we extend our analysis beyond the primate lineage to include largely common variation from four additional mammalian s (mouse, pig, goat, cow) and two species of more distant rates (chicken, zebrafish). In contrast to the prior primate analyses, we e that missense variation is markedly depleted at common allele frequencies compared to rare allele frequencies, especially at greater evolutionary distances, indicating that a substantial fraction of common missense variation in more distant species would experience negative selection in human populations. Nonetheless, the observation of a missense variant in more distant vertebrates still increases the likelihood of benign consequence, as the fraction of common missense variants depleted by natural selection is far less than the ~50% ion for human missense ts at baseline. Consistent with these results, we find that human missense variants that have been ed in mouse, dog, pig, and cow are approximately 85% likely to be annotated as Benign or Likely Benign in ClinVar, compared to 95% for e variation and 45% for the ClinVar database as a whole.

The ce of closely related pairs of species at varying ionary distances also provides an unity to evaluate the functional consequences of fixed missense tutions in human tions. Within closely related pairs of species h length < 0.1) on the mammalian family tree, we observe that fixed missense variation is depleted at common allele frequencies compared to rare allele frequencies, indicating that a substantial fraction of inter-species fixed substitutions would be non-neutral in human, even within the primate e. A comparison of the magnitude of missense depletion indicates that inter-species fixed substitutions are significantly less neutral than within-species polymorphisms. uingly, inter-species variation between closely related mammals are not substantially more pathogenic in ClinVar (83% likely to be annotated as Benign or Likely Benign) compared to within-species common polymorphisms, suggesting that these changes do not abrogate protein function, but rather reflect tuning of n function that confer species-specific ve advantages.

The large number of possible variants of unknown significance and the crucial importance of accurate variant classification for clinical applications has inspired multiple attempts to tackle the problem with machine learning, but these efforts have largely been limited by the insufficient ty of common human variants and the dubious quality of annotations in curated databases. Variation from the six non-human es contributes over 300,000 unique se variants that are non-overlapping with common human variation and largely of benign consequence, greatly enlarging the size of the training dataset that can be used for machine learning approaches.

Unlike earlier models which employ a large number of human-engineered features and metaclassifiers , we apply a simple deep learning residual network which takes as input only the amino acid sequence flanking the variant of interest and the orthologous sequence alignments in other species. To provide the network with information about protein structure, we train two separate networks to learn ary structure and solvent accessibility from sequence alone, and incorporate these as sub-networks in the larger deep learning network to predict effects on protein structure. Using sequence as a starting point avoids potential biases in protein ure and functional domain annotation, which may be incompletely ained or inconsistently applied.

We use semi-supervised learning to overcome the problem of the training set ning only variants with benign labels, by initially training an ensemble of networks to separate likely benign primate variants versus random unknown variants that are matched for mutation rate and sequencing coverage. This ensemble of networks is used to score the te set of unknown variants and influence the selection of unknown variants to seed the next iteration of the classifier by biasing towards unknown variants with more pathogenic predicted consequence, taking gradual steps at each iteration to prevent the model from prematurely converging to a suboptimal result.

] Common primate variation also provides a clean validation dataset for evaluating existing methods that is completely independent of previously used training data, which has been hard to evaluate objectively because of the eration of meta-classifiers. We ted the performance of our model, along with four other popular classification algorithms (Sift, Polyphen2, CADD, , using 10,000 ut primate common variants.

Because roughly 50% of all human missense variants would be removed by natural ion at common allele frequencies, we calculated the ercentile score for each classifier on a set of randomly picked missense variants that were matched to the 10,000 held-out primate common variants by mutational rate, and used that threshold to evaluate the held-out primate common variants. The accuracy of our deep learning model was significantly better than the other classifiers on this independent validation t, using either deep learning networks that were trained only on human common variants, or using both human common variants and primate variants.

Recent trio sequencing studies have catalogued thousands of de novo mutations in patients with neurodevelopmental disorders and their healthy siblings, enabling ment of the strength of various fication thms in separating de novo missense mutations in cases versus controls. For each of the four classification algorithms, we scored each de novo missense variant in cases versus controls, and report the p-value from the Wilcoxon um test of the difference n the two distributions, showing that the deep learning method trained on primate variants (p ~ 10-33) performed far better than the other classifiers (p ~ 10-13 to 10-19) on this clinical scenario. From the ~1.3-fold ment of de novo missense ts over expectation previously reported for this cohort, and prior estimates that ~20% of missense variants produce loss-of-function effects, we would expect a perfect classifier to separate the two classes with a p-value of p ~ 10-40, ting that our classifier still has room for improvement.

The accuracy of the deep learning classifier scales with the size of the training dataset, and variation data from each of the six primate species independently butes to boosting the accuracy of the classifier. The large number and ity of extant non-human primate species, along with evidence showing that the selective pressures on protein-altering variants are largely concordant within the primate e, suggests atic primate population sequencing as an effective strategy to classify the millions of human variants of unknown significance that currently limit clinical genome interpretation. Of the 504 known non-human e species, roughly 60% face extinction due to hunting and habitat loss, motivating urgency for a worldwide conservation effort that would t both these unique and irreplaceable species and our own.

] Although not as much aggregate whole genome data is available as exome data, limiting the power to detect the impact of natural selection in deep intronic regions, we were also able to ate the observed vs expected counts of cryptic splice mutations far from exonic regions. Overall, we observe a 60% depletion in cryptic splice mutations at a ce > 50nt from an exon-intron boundary. The attenuated signal is likely a combination of the smaller sample size with whole genome data compared to exome, and the greater difficulty of predicting the impact of deep intronic variants.

Terminology All literature and similar material cited in this application, including, but not limited to, patents, patent applications, articles, books, treatises, and web pages, less of the format of such ture and similar materials, are expressly incorporated by reference in their entirety. In the event that one or more of the incorporated literature and similar materials differs from or contradicts this application, including but not limited to defined terms, term usage, described ques, or the like, this application controls.

As used herein, the following terms have the meanings indicated.

A base refers to a nucleotide base or nucleotide, A ne), C ine), T (thymine), or G (guanine).

This application uses the terms "protein" and "translated ce" interchangeably.

This application uses the terms "codon" and "base triplet" interchangeably.

This ation uses the terms "amino acid" and "translated unit" interchangeably.

This ation uses the phrases "variant pathogenicity classifier", "convolutional neural networkbased fier for variant classification", and "deep convolutional neural network-based classifier for variant classification" interchangeably.

The term "chromosome" refers to the heredity-bearing gene carrier of a living cell, which is derived from chromatin strands sing DNA and protein components (especially histones). The conventional internationally recognized individual human genome chromosome numbering system is employed .

The term "site" refers to a unique position (e.g., chromosome ID, chromosome position and orientation) on a reference genome. In some implementations, a site may be a residue, a sequence tag, or a segment’s position on a sequence. The term "locus" may be used to refer to the specific location of a nucleic acid sequence or rphism on a reference some.

The term "sample" herein refers to a sample, typically derived from a biological fluid, cell, tissue, organ, or organism containing a nucleic acid or a mixture of nucleic acids ning at least one nucleic acid ce that is to be sequenced and/or phased. Such samples include, but are not limited to sputum/oral fluid, amniotic fluid, blood, a blood on, fine needle biopsy samples (e.g., surgical biopsy, fine needle biopsy, etc.), urine, peritoneal fluid, pleural fluid, tissue explant, organ culture and any other tissue or cell preparation, or fraction or derivative thereof or isolated therefrom. Although the sample is often taken from a human subject (e.g., patient), samples can be taken from any organism having chromosomes, including, but not limited to dogs, cats, horses, goats, sheep, cattle, pigs, etc. The sample may be used directly as obtained from the biological source or following a pretreatment to modify the character of the sample. For example, such pretreatment may include preparing plasma from blood, diluting viscous fluids and so forth. s of pretreatment may also involve, but are not limited to, tion, precipitation, dilution, distillation, , centrifugation, freezing, lization, concentration, amplification, nucleic acid fragmentation, vation of interfering components, the addition of reagents, lysing, The term "sequence" includes or represents a strand of nucleotides coupled to each other. The nucleotides may be based on DNA or RNA. It should be understood that one sequence may include multiple subsequences.

For example, a single sequence (e.g., of a PCR on) may have 350 nucleotides. The sample read may include multiple sub-sequences within these 350 nucleotides. For instance, the sample read may include first and second flanking subsequences having, for example, 20-50 nucleotides. The first and second flanking subsequences may be located on either side of a repetitive t having a ponding sub-sequence (e.g., 40-100 nucleotides). Each of the flanking sub-sequences may include (or include portions of) a primer sub-sequence (e.g., -30 tides). For ease of reading, the term "sub-sequence" will be referred to as "sequence," but it is tood that two sequences are not arily separate from each other on a common strand. To differentiate the various sequences described herein, the sequences may be given different labels (e.g., target sequence, primer sequence, flanking ce, reference ce, and the like). Other terms, such as "allele," may be given different labels to differentiate between like objects.

The term "paired-end sequencing" refers to sequencing methods that sequence both ends of a target fragment. Paired-end sequencing may facilitate detection of genomic rearrangements and repetitive segments, as well as gene fusions and novel transcripts. Methodology for paired-end cing are described in PCT publication WO07010252, PCT ation Serial No. PCTGB2007/003798 and US patent application publication US 2009/0088327, each of which is incorporated by reference herein. In one example, a series of operations may be performed as follows; (a) generate clusters of nucleic acids; (b) linearize the nucleic acids; (c) hybridize a first sequencing primer and carry out repeated cycles of extension, scanning and deblocking, as set forth above; (d) "invert" the target nucleic acids on the flow cell surface by synthesizing a complimentary copy; (e) linearize the hesized ; and (f) hybridize a second sequencing primer and carry out repeated cycles of extension, scanning and deblocking, as set forth above. The inversion operation can be carried out be delivering reagents as set forth above for a single cycle of bridge amplification.

The term "reference genome" or "reference sequence" refers to any particular known genome sequence, whether l or complete, of any organism which may be used to reference identified ces from a subject. For example, a reference genome used for human subjects as well as many other organisms is found at the National Center for Biotechnology Information at ncbi.nlm.nih.gov. A "genome" refers to the complete genetic information of an sm or virus, expressed in nucleic acid sequences. A genome includes both the genes and the noncoding sequences of the DNA. The reference sequence may be larger than the reads that are aligned to it. For e, it may be at least about 100 times larger, or at least about 1000 times larger, or at least about 10,000 times larger, or at least about 105 times larger, or at least about 106 times larger, or at least about 107 times larger. In one example, the reference genome sequence is that of a full length human genome. In another example, the reference genome sequence is limited to a specific human chromosome such as chromosome 13. In some implementations, a reference chromosome is a chromosome sequence from human genome version hg19. Such sequences may be referred to as chromosome reference sequences, although the term nce genome is intended to cover such sequences. Other examples of reference sequences include genomes of other s, as well as chromosomes, sub- chromosomal regions (such as strands), etc., of any species. In various implementations, the reference genome is a consensus sequence or other combination derived from multiple individuals. However, in certain ations, the reference sequence may be taken from a particular individual.

The term "read" refers to a collection of sequence data that bes a fragment of a nucleotide sample or reference. The term "read" may refer to a sample read and/or a reference read. Typically, though not necessarily, a read represents a short sequence of contiguous base pairs in the sample or reference. The read may be represented symbolically by the base pair sequence (in ATCG) of the sample or reference fragment. It may be stored in a memory device and processed as appropriate to determine whether the read matches a reference sequence or meets other criteria. A read may be obtained directly from a sequencing apparatus or indirectly from stored sequence ation concerning the sample. In some cases, a read is a DNA sequence of sufficient length (e.g., at least about bp) that can be used to identify a larger sequence or region, e.g., that can be aligned and specifically assigned to a chromosome or genomic region or gene.

] Next-generation sequencing methods include, for example, sequencing by synthesis technology (Illumina), pyrosequencing (454), ion semiconductor technology (Ion Torrent sequencing), -molecule realtime sequencing (Pacific ences) and sequencing by ligation (SOLiD sequencing). Depending on the sequencing methods, the length of each read may vary from about 30 bp to more than 10,000 bp. For e, Illumina sequencing method using SOLiD sequencer generates nucleic acid reads of about 50 bp. For another example, Ion Torrent cing generates nucleic acid reads of up to 400 bp and 454 pyrosequencing generates c acid reads of about 700 bp. For yet another e, single-molecule real-time sequencing methods may generate reads of 10,000 bp to 15,000 bp. Therefore, in certain implementations, the nucleic acid sequence reads have a length of 30-100 bp, 50-200 bp, or 50-400 bp.

The terms "sample read", e sequence" or "sample fragment" refer to sequence data for a c sequence of interest from a sample. For example, the sample read comprises sequence data from a PCR amplicon having a forward and reverse primer ce. The sequence data can be obtained from any select sequence methodology. The sample read can be, for example, from a sequencing-by-synthesis (SBS) reaction, a sequencing-by-ligation reaction, or any other suitable sequencing ology for which it is desired to determine the length and/or identity of a tive element. The sample read can be a consensus (e.g., averaged or weighted) sequence derived from multiple sample reads. In certain implementations, providing a reference sequence comprises fying a locus-of-interest based upon the primer sequence of the PCR amplicon.

The term "raw fragment" refers to ce data for a portion of a genomic sequence of interest that at least partially ps a designated position or secondary position of interest within a sample read or sample fragment. Non-limiting examples of raw fragments include a duplex stitched fragment, a simplex ed nt, a duplex un-stitched fragment and a simplex un-stitched fragment. The term "raw" is used to indicate that the raw fragment includes sequence data having some relation to the sequence data in a sample read, regardless of whether the raw fragment exhibits a supporting variant that ponds to and authenticates or confirms a potential variant in a sample read. The term "raw fragment" does not indicate that the fragment necessarily includes a supporting variant that validates a variant call in a sample read. For example, when a sample read is determined by a variant call ation to exhibit a first variant, the variant call ation may determine that one or more raw fragments lack a corresponding type of "supporting" variant that may otherwise be expected to occur given the variant in the sample read.

The terms "mapping", "aligned," "alignment," or "aligning" refer to the process of comparing a read or tag to a reference sequence and thereby determining whether the nce sequence contains the read sequence. If the reference sequence contains the read, the read may be mapped to the reference sequence or, in certain implementations, to a particular location in the reference sequence. In some cases, alignment simply tells whether or not a read is a member of a particular reference sequence (i.e., whether the read is present or absent in the nce sequence). For example, the alignment of a read to the reference sequence for human chromosome 13 will tell whether the read is present in the reference ce for chromosome 13. A tool that provides this information may be called a set ship tester. In some cases, an alignment onally indicates a location in the reference sequence where the read or tag maps to. For example, if the reference sequence is the whole human genome sequence, an alignment may indicate that a read is present on chromosome 13, and may further indicate that the read is on a particular strand and/or site of chromosome 13.

The term "indel" refers to the insertion and/or the deletion of bases in the DNA of an organism. A micro-indel represents an indel that results in a net change of 1 to 50 nucleotides. In coding regions of the genome, unless the length of an indel is a multiple of 3, it will produce a frameshift mutation. Indels can be contrasted with point ons. An indel inserts and s nucleotides from a ce, while a point mutation is a form of substitution that replaces one of the nucleotides without changing the l number in the DNA. Indels can also be contrasted with a Tandem Base Mutation (TBM), which may be defined as substitution at adjacent nucleotides (primarily substitutions at two adjacent nucleotides, but substitutions at three adjacent nucleotides have been observed).

The term "variant" refers to a nucleic acid sequence that is different from a nucleic acid reference.

Typical nucleic acid sequence variant includes t limitation single nucleotide polymorphism (SNP), short deletion and insertion polymorphisms (Indel), copy number variation (CNV), microsatellite markers or short tandem repeats and structural variation. Somatic variant calling is the effort to identify variants present at low frequency in the DNA sample. Somatic variant calling is of interest in the context of cancer treatment. Cancer is caused by an accumulation of mutations in DNA. A DNA sample from a tumor is generally heterogeneous, including some normal cells, some cells at an early stage of cancer progression (with fewer ons), and some late-stage cells (with more mutations). Because of this heterogeneity, when sequencing a tumor (e.g., from an FFPE sample), somatic mutations will often appear at a low frequency. For example, a SNV might be seen in only 10% of the reads covering a given base. A variant that is to be classified as somatic or germline by the variant fier is also referred to herein as the nt under test".

The term "noise" refers to a mistaken variant call resulting from one or more errors in the sequencing process and/or in the variant call application.

The term "variant frequency" represents the relative frequency of an allele (variant of a gene) at a particular locus in a population, sed as a fraction or percentage. For example, the fraction or tage may be the fraction of all chromosomes in the population that carry that allele. By way of example, sample variant frequency represents the relative frequency of an allele/variant at a particular locus/position along a genomic sequence of interest over a "population" corresponding to the number of reads and/or s obtained for the genomic sequence of st from an individual. As another example, a baseline variant frequency represents the relative frequency of an allele/variant at a particular locus/position along one or more baseline c ces where the "population" ponding to the number of reads and/or samples obtained for the one or more baseline c ces from a population of normal individuals.

The term "variant allele frequency (VAF)" refers to the percentage of sequenced reads observed matching the variant divided by the overall coverage at the target position. VAF is a measure of the proportion of sequenced reads carrying the variant.

The terms "position", "designated position", and "locus" refer to a location or coordinate of one or more nucleotides within a sequence of tides. The terms "position", "designated position", and "locus" also refer to a on or coordinate of one or more base pairs in a sequence of nucleotides.

The term "haplotype" refers to a combination of s at adjacent sites on a chromosome that are inherited together. A haplotype may be one locus, several loci, or an entire chromosome ing on the number of recombination events that have occurred between a given set of loci, if any occurred.

The term "threshold" herein refers to a numeric or non-numeric value that is used as a cutoff to characterize a sample, a nucleic acid, or portion f (e.g., a read). A threshold may be varied based upon cal analysis. The threshold may be ed to a measured or calculated value to determine whether the source giving rise to such value suggests should be classified in a particular manner. Threshold values can be identified empirically or analytically. The choice of a threshold is dependent on the level of ence that the user wishes to have to make the classification. The threshold may be chosen for a particular purpose (e.g., to e sensitivity and selectivity). As used herein, the term "threshold" indicates a point at which a course of analysis may be changed and/or a point at which an action may be triggered. A threshold is not required to be a predetermined number. Instead, the threshold may be, for instance, a function that is based on a plurality of factors. The threshold may be adaptive to the circumstances. Moreover, a old may indicate an upper limit, a lower limit, or a range between limits.

In some implementations, a metric or score that is based on sequencing data may be compared to the threshold. As used herein, the terms c" or "score" may include values or results that were determined from the cing data or may include functions that are based on the values or results that were determined from the sequencing data. Like a old, the metric or score may be adaptive to the circumstances. For instance, the metric or score may be a ized value. As an example of a score or metric, one or more implementations may use count scores when analyzing the data. A count score may be based on the number of sample reads. The sample reads may have undergone one or more ing stages such that the sample reads have at least one common characteristic or quality. For example, each of the sample reads that are used to determine a count score may have been d with a reference sequence or may be assigned as a potential allele. The number of sample reads having a common characteristic may be counted to determine a read count. Count scores may be based on the read count. In some implementations, the count score may be a value that is equal to the read count. In other implementations, the count score may be based on the read count and other information. For example, a count score may be based on the read count for a particular allele of a genetic locus and a total number of reads for the genetic locus. In some implementations, the count score may be based on the read count and usly-obtained data for the genetic locus.

In some implementations, the count scores may be normalized scores n predetermined values. The count score may also be a function of read counts from other loci of a sample or a function of read counts from other samples that were concurrently run with the sample-of-interest. For instance, the count score may be a function of the read count of a particular allele and the read counts of other loci in the sample and/or the read counts from other samples. As one example, the read counts from other loci and/or the read counts from other samples may be used to normalize the count score for the particular allele.

The terms "coverage" or "fragment coverage" refer to a count or other measure of a number of sample reads for the same fragment of a sequence. A read count may represent a count of the number of reads that cover a corresponding fragment. Alternatively, the coverage may be determined by multiplying the read count by a designated factor that is based on historical knowledge, knowledge of the sample, knowledge of the locus, etc.

The term "read depth" ntionally a number followed by "×") refers to the number of sequenced reads with overlapping alignment at the target position. This is often expressed as an average or percentage exceeding a cutoff over a set of intervals (such as exons, genes, or panels). For example, a clinical report might say that a panel average coverage is 1,105× with 98% of targeted bases covered >100×.

The terms "base call quality score" or "Q score" refer to a PHRED-scaled probability ranging from 0- inversely proportional to the probability that a single sequenced base is correct. For example, a T base call with Q of 20 is considered likely correct with a confidence P-value of 0.01. Any base call with Q<20 should be ered low quality, and any variant identified where a ntial proportion of sequenced reads supporting the t are of low y should be considered potentially false ve.

The terms "variant reads" or "variant read number" refer to the number of sequenced reads ting the presence of the variant.

Sequencing Process Implementations set forth herein may be able to analyzing nucleic acid sequences to identify ce variations. Implementations may be used to analyze potential variants/alleles of a genetic position/locus and ine a genotype of the genetic locus or, in other words, provide a genotype call for the locus. By way of example, c acid sequences may be analyzed in accordance with the methods and systems bed in US Patent Application Publication No. 2016/0085910 and US Patent Application Publication No. 2013/0296175, the complete subject matter of which are expressly incorporated by reference herein in their entirety.

In one implementation, a sequencing process es receiving a sample that includes or is suspected of ing nucleic acids, such as DNA. The sample may be from a known or unknown source, such as an animal (e.g., human), plant, bacteria, or fungus. The sample may be taken directly from the source. For instance, blood or saliva may be taken ly from an individual. Alternatively, the sample may not be obtained directly from the source. Then, one or more processors direct the system to prepare the sample for cing. The preparation may include removing extraneous material and/or isolating certain material (e.g., DNA). The ical sample may be prepared to include features for a ular assay. For example, the biological sample may be prepared for sequencing-by-synthesis (SBS). In certain implementations, the preparing may include amplification of certain regions of a genome. For ce, the preparing may include amplifying predetermined genetic loci that are known to include STRs and/or SNPs. The genetic loci may be amplified using predetermined primer sequences.

Next, the one or more sors direct the system to sequence the sample. The sequencing may be performed through a variety of known sequencing protocols. In particular implementations, the sequencing includes SBS. In SBS, a plurality of fluorescently-labeled tides are used to sequence a plurality of clusters of amplified DNA bly millions of clusters) present on the surface of an optical substrate (e.g., a e that at least partially defines a channel in a flow cell). The flow cells may contain nucleic acid samples for sequencing where the flow cells are placed within the appropriate flow cell holders.

The nucleic acids can be prepared such that they comprise a known primer sequence that is adjacent to an unknown target sequence. To initiate the first SBS sequencing cycle, one or more differently labeled nucleotides, and DNA polymerase, etc., can be flowed into/through the flow cell by a fluid flow subsystem. Either a single type of nucleotide can be added at a time, or the nucleotides used in the sequencing procedure can be specially designed to possess a reversible termination property, thus ng each cycle of the sequencing reaction to occur simultaneously in the presence of several types of labeled nucleotides (e.g., A, C, T, G). The nucleotides can include detectable label moieties such as fluorophores. Where the four nucleotides are mixed together, the rase is able to select the correct base to incorporate and each sequence is extended by a single base. Non-incorporated nucleotides can be washed away by flowing a wash solution through the flow cell. One or more lasers may excite the nucleic acids and induce scence. The fluorescence emitted from the nucleic acids is based upon the fluorophores of the incorporated base, and different fluorophores may emit different wavelengths of emission light.

A deblocking reagent can be added to the flow cell to remove reversible ator groups from the DNA s that were extended and detected. The deblocking reagent can then be washed away by flowing a wash solution through the flow cell. The flow cell is then ready for a further cycle of sequencing starting with introduction of a labeled nucleotide as set forth above. The fluidic and detection operations can be repeated several times to complete a sequencing run. Example sequencing methods are described, for example, in Bentley et al., Nature 456:53-59 , International ation No. WO 04/018497; U.S. Pat. No. 7,057,026; International ation No. WO 78; ational Publication No. WO 07/123744; U.S. Pat. No. 7,329,492; U.S. Patent No. 7,211,414; U.S.

Patent No. 7,315,019; U.S. Patent No. 7,405,281, and U.S. Patent Application Publication No. 2008/0108082, each of which is incorporated herein by nce.

In some entations, nucleic acids can be attached to a e and amplified prior to or during sequencing. For example, amplification can be carried out using bridge amplification to form nucleic acid clusters on a surface. Useful bridge amplification methods are described, for example, in U.S. Patent No. 658; U.S.

Patent Application Publication No. 2002/0055100; U.S. Patent No. 7,115,400; U.S. Patent Application ation No. 2004/0096853; U.S. Patent Application Publication No. 2004/0002090; U.S. Patent Application Publication No. 2007/0128624; and U.S. Patent Application Publication No. 2008/0009420, each of which is incorporated herein by reference in its entirety. Another useful method for amplifying nucleic acids on a surface is g circle amplification (RCA), for example, as described in Lizardi et al., Nat. Genet. -232 (1998) and U.S. Patent Application Publication No. 2007/0099208 A1, each of which is incorporated herein by reference.

One example SBS protocol ts modified nucleotides having removable 3’ blocks, for example, as described in International Publication No. WO 497, U.S. Patent Application Publication No. 2007/0166705A1, and U.S. Patent No. 7,057,026, each of which is incorporated herein by nce. For example, repeated cycles of SBS reagents can be delivered to a flow cell having target nucleic acids attached thereto, for example, as a result of the bridge amplification protocol. The nucleic acid clusters can be converted to single stranded form using a linearization solution. The ization solution can contain, for example, a restriction endonuclease capable of cleaving one strand of each cluster. Other methods of ge can be used as an alternative to restriction enzymes or nicking enzymes, including inter alia chemical cleavage (e.g., cleavage of a diol linkage with periodate), cleavage of abasic sites by cleavage with endonuclease (for example ‘USER’, as supplied by NEB, Ipswich, Mass., USA, part number M5505S), by exposure to heat or alkali, cleavage of ribonucleotides incorporated into amplification products otherwise comprised of deoxyribonucleotides, photochemical cleavage or cleavage of a peptide linker. After the linearization operation a sequencing primer can be delivered to the flow cell under conditions for hybridization of the sequencing primer to the target nucleic acids that are to be sequenced.

A flow cell can then be contacted with an SBS extension reagent having modified nucleotides with removable 3’ blocks and fluorescent labels under conditions to extend a primer hybridized to each target nucleic acid by a single nucleotide on. Only a single nucleotide is added to each primer because once the modified nucleotide has been incorporated into the growing polynucleotide chain complementary to the region of the template being sequenced there is no free 3’-OH group available to direct further sequence extension and therefore the polymerase cannot add further tides. The SBS extension reagent can be removed and replaced with scan reagent containing components that protect the sample under excitation with radiation. Example components for scan reagent are described in U.S. Patent Application Publication No. 2008/0280773 A1 and U.S. Patent Application No. 13/018,255, each of which is incorporated herein by reference. The extended nucleic acids can then be fluorescently detected in the presence of scan reagent. Once the fluorescence has been detected, the 3’ block may be removed using a deblock reagent that is appropriate to the ng group used. e deblock reagents that are useful for respective blocking groups are described in WO004018497, US 2007/0166705A1 and U.S. Patent No. 7,057,026, each of which is incorporated herein by nce. The deblock reagent can be washed away leaving target c acids hybridized to extended primers having 3’-OH groups that are now competent for on of a further nucleotide. Accordingly the cycles of adding extension reagent, scan reagent, and deblock reagent, with optional washes between one or more of the operations, can be repeated until a desired sequence is obtained. The above cycles can be d out using a single extension t delivery operation per cycle when each of the modified nucleotides has a different label attached thereto, known to correspond to the particular base. The different labels facilitate discrimination between the nucleotides added during each incorporation ion. atively, each cycle can include separate ions of extension reagent delivery followed by te operations of scan reagent delivery and detection, in which case two or more of the nucleotides can have the same label and can be distinguished based on the known order of delivery. gh the sequencing operation has been discussed above with t to a particular SBS protocol, it will be understood that other protocols for sequencing any of a variety of other molecular analyses can be carried out as desired.

Then, the one or more processors of the system receive the sequencing data for subsequent analysis.

The sequencing data may be formatted in various s, such as in a .BAM file. The sequencing data may include, for example, a number of sample reads. The sequencing data may include a plurality of sample reads that have corresponding sample sequences of the nucleotides. Although only one sample read is discussed, it should be understood that the sequencing data may include, for example, hundreds, thousands, hundreds of thousands, or millions of sample reads. ent sample reads may have different s of nucleotides. For example, a sample read may range between 10 nucleotides to about 500 tides or more. The sample reads may span the entire genome of the source(s). As one example, the sample reads are directed toward predetermined genetic loci, such as those c loci having suspected STRs or suspected SNPs.

Each sample read may include a sequence of nucleotides, which may be referred to as a sample sequence, sample nt or a target sequence. The sample sequence may include, for example, primer sequences, flanking sequences, and a target sequence. The number of nucleotides within the sample sequence may include 30, 40, 50, 60, 70, 80, 90, 100 or more. In some implementations, one or more of the sample reads (or sample sequences) includes at least 150 nucleotides, 200 nucleotides, 300 nucleotides, 400 nucleotides, 500 tides, or more. In some implementations, the sample reads may include more than 1000 nucleotides, 2000 nucleotides, or more. The sample reads (or the sample sequences) may include primer sequences at one or both ends.

Next, the one or more processors analyze the sequencing data to obtain potential variant call(s) and a sample variant frequency of the sample variant call(s). The operation may also be referred to as a variant call application or variant caller. Thus, the variant caller identifies or detects variants and the variant classifier classifies the ed variants as somatic or ne. Alternative variant callers may be utilized in ance with entations herein, wherein different t callers may be used based on the type of sequencing ion being performed, based on features of the sample that are of interest and the like. One non-limiting example of a variant call application, such as the Pisces™ application by Illumina Inc. (San Diego, CA) hosted at https://github.com/Illumina/Pisces and described in the article Dunn, Tamsen & Berry, Gwenn & Emig-Agius, Dorothea & Jiang, Yu & Iyer, Anita & Udar, Nitin & Strömberg, Michael. (2017). Pisces: An Accurate and Versatile Single Sample Somatic and Germline t Caller. 595-595. 10.1145/3107411.3108203, the complete subject matter of which is expressly incorporated herein by nce in its ty.

Such a variant call application can comprise four sequentially ed modules: (1) Pisces Read Stitcher: Reduces noise by stitching paired reads in a BAM (read one and read two of the same molecule) into consensus reads. The output is a stitched BAM. (2) Pisces Variant Caller: Calls small SNVs, insertions and deletions. Pisces includes a variantcollapsing algorithm to coalesce variants broken up by read boundaries, basic filtering algorithms, and a simple Poisson-based variant confidence-scoring algorithm. The output is a VCF. (3) Pisces Variant Quality Recalibrator (VQR): In the event that the variant calls overwhelmingly follow a pattern associated with thermal damage or FFPE deamination, the VQR step will downgrade the variant Q score of the suspect variant calls. The output is an adjusted VCF. (4) Pisces Variant Phaser (Scylla): Uses a read-backed greedy clustering method to assemble small variants into complex alleles from clonal subpopulations. This allows for the more accurate determination of functional consequence by downstream tools. The output is an adjusted VCF.

Additionally or alternatively, the operation may utilize the variant call application Strelka™ ation by Illumina Inc. hosted at https://github.com/Illumina/strelka and described in the article T rs, Christopher & Wong, Wendy & Swamy, Sajani & Becq, Jennifer & J Murray, Lisa & Cheetham, Keira. (2012).

Strelka: te c small-variant calling from sequenced tumor-normal sample pairs. Bioinformatics (Oxford, England). 28. 1811-7. 10.1093/bioinformatics/bts271, the te subject matter of which is sly incorporated herein by reference in its ty. Furthermore, additionally or alternatively, the operation may utilize the variant call application Strelka2™ ation by Illumina Inc. hosted at https://github.com/Illumina/strelka and described in the e Kim, S., Scheffler, K., Halpern, A.L., Bekritsky, M.A., Noh, E., Källberg, M., Chen, X., , D., Krusche, P., and Saunders, C.T. (2017). Strelka2: Fast and accurate variant calling for clinical sequencing applications, the complete subject matter of which is expressly incorporated herein by reference in its ty. Moreover, additionally or alternatively, the operation may utilize a variant annotation/call tool, such as the Nirvana™ application by Illumina Inc. hosted at https://github.com/Illumina/Nirvana/wiki and described in the article Stromberg, Michael & Roy, Rajat & Lajugie, Julien & Jiang, Yu & Li, Haochen & Margulies, Elliott. (2017).

Nirvana: Clinical Grade Variant Annotator. 596-596. 10.1145/3107411.3108204, the complete t matter of which is expressly incorporated herein by reference in its ty.

Such a variant annotation/call tool can apply ent algorithmic techniques such as those disclosed in Nirvana: a. Identifying all overlapping transcripts with Interval Array: For functional annotation, we can identify all transcripts overlapping a variant and an interval tree can be used. r, since a set of intervals can be static, we were able to further optimize it to an Interval Array. An interval tree returns all overlapping transcripts in O(min(n,k lg n)) time, where n is the number of intervals in the tree and k is the number of overlapping intervals. In practice, since k is really small compared to n for most variants, the effective runtime on interval tree would be O(k lg n) . We improved to O(lg n + k ) by creating an interval array where all intervals are stored in a sorted array so that we only need to find the first overlapping interval and then enumerate through the remaining (k-1). b. CNVs/SVs (Yu): annotations for Copy Number Variation and Structural Variants can be provided. r to the annotation of small ts, transcripts overlapping with the SV and also previously reported structural variants can be ted in online databases. Unlike the small variants, not all overlapping transcripts need be annotated, since too many transcripts will be overlapped with a large SVs. Instead, all overlapping transcripts can be annotated that belong to a partial overlapping gene. Specifically, for these transcripts, the ed introns, exons and the consequences caused by the structural variants can be reported. An option to allow output all overlapping transcripts is available, but the basic information for these transcripts can be reported, such as gene symbol, flag whether it is canonical p or partial overlapped with the transcripts. For each SV/CNV, it is also of interest to know if these variants have been d and their frequencies in different populations. Hence, we reported overlapping SVs in external databases, such as 1000 genomes, DGV and ClinGen. To avoid using an arbitrary cutoff to determine which SV is overlapped, instead all overlapping transcripts can be used and the reciprocal overlap can be calculated, i.e. the pping length divided by the minimum of the length of these two c. Reporting supplementary annotations : Supplementary annotations are of two types: small and ural variants (SVs). SVs can be modeled as intervals and use the interval array discussed above to identify overlapping SVs. Small variants are modeled as points and matched by position and nally) allele. As such, they are searched using a binary-search-like algorithm. Since the supplementary annotation database can be quite large, a much smaller index is created to map chromosome positions to file locations where the supplementary annotation resides. The index is a sorted array of objects (made up of some position and file location) that can be binary searched using position. To keep the index size small, multiple ons (up to a certain max count) are compressed to one object that stores the values for the first on and only deltas for subsequent positions.

Since we use Binary search, the e is O(lg n) , where n is the number of items in the database. d. VEP cache files e. ript Database : The ript Cache (cache) and Supplementary database (SAdb) files are serialized dump of data objects such as transcripts and mentary tions. We use Ensembl VEP cache as our data source for cache. To create the cache, all transcripts are inserted in an interval array and the final state of the array is stored in the cache files. Thus, during annotation, we only need to load a pre-computed interval array and perform searches on it. Since the cache is loaded up in memory and searching is very fast (described above), finding overlapping transcripts is extremely quick in Nirvana (profiled to less than 1% of total runtime?). f. Supplementary Database : The data sources for SAdb are listed under supplementary material. The SAdb for small variants is ed by a k -way merge of all data sources such that each object in the database (identified by reference name and position) holds all relevant supplementary annotations. Issues encountered during parsing data source files have been documented in detail in Nirvana’s home page. To limit memory usage, only the SA index is loaded up in . This index allows a quick lookup of the file location for a supplementary tion. However, since the data has to be fetched from disk, adding supplementary annotation has been identified as Nirvana’s largest bottleneck (profiled at ~30% of total runtime.) g. Consequence and Sequence Ontology : Nirvana’s functional annotation (when provided) follows the ce Ontology (SO) (http://www.sequenceontology.org/ ) guidelines. On occasions, we had the unity to identify issues in the current SO and collaborate with the SO team to improve the state of annotation.

Such a variant annotation tool can e pre-processing. For example, Nirvana included a large number of tions from al data sources, like ExAC, EVS, 1000 Genomes project, dbSNP, ClinVar, Cosmic, DGV and ClinGen. To make full use of these databases, we have to sanitize the information from them. We implemented different strategy to deal with different conflicts that exist from different data s. For example, in case of multiple dbSNP entries for the same position and alternate allele, we join all ids into a comma separated list of ids; if there are multiple entries with different CAF values for the same allele, we use the first CAF value. For conflicting ExAC and EVS entries, we consider the number of sample counts and the entry with higher sample count is used. In 1000 Genome Projects, we removed the allele frequency of the conflicting allele. r issue is inaccurate information. We mainly extracted the allele ncies information from 1000 Genome Projects, r, we noticed that for , the allele frequency reported in the info field did not exclude samples with genotype not available, leading to deflated frequencies for variants which are not ble for all samples. To guarantee the accuracy of our annotation, we use all of the individual level genotype to compute the true allele frequencies. As we know, the same variants can have different representations based on different alignments. To make sure we can accurately report the information for already identified variants, we have to preprocess the variants from different ces to make them have consistent representation. For all external data sources, we trimmed s to remove duplicated nucleotides in both reference allele and alternative allele. For ClinVar, we directly parsed the xml file we performed a five-prime alignment for all variants, which is often used in vcf file.

Different databases can contain the same set of information. To avoid unnecessary duplicates, we removed some ated information. For example, we removed variants in DGV which has data source as 1000 genome projects, since we already reported these variants in 1000 genomes with more ed information.

In accordance with at least some implementations, the variant call application provides calls for low frequency variants, germline calling and the like. As non-limiting example, the variant call ation may run on tumor-only samples and/or tumor-normal paired samples. The variant call application may search for single nucleotide variations (SNV), multiple nucleotide variations (MNV), indels and the like. The variant call application identifies variants, while ing for mismatches due to sequencing or sample preparation errors. For each variant, the variant caller identifies the reference sequence, a position of the variant, and the potential variant sequence(s) (e.g., A to C SNV, or AG to A deletion). The t call application identifies the sample sequence (or sample fragment), a reference sequence/fragment, and a variant call as an indication that a variant is present. The variant call application may identify raw fragments, and output a designation of the raw fragments, a count of the number of raw fragments that verify the potential t call, the position within the raw fragment at which a supporting variant occurred and other relevant information. Non-limiting examples of raw fragments include a duplex stitched fragment, a simplex stitched fragment, a duplex un-stitched fragment and a simplex un- stitched fragment.

The variant call application may output the calls in various formats, such as in a .VCF or .GVCF file.

By way of example only, the variant call application may be included in a MiSeqReporter pipeline (e.g., when implemented on the MiSeq® cer instrument). Optionally, the application may be implemented with various workflows. The analysis may include a single protocol or a combination of protocols that e the sample reads in a designated manner to obtain desired information.

Then, the one or more processors perform a validation ion in connection with the potential variant call. The validation operation may be based on a quality score, and/or a hierarchy of tiered tests, as explained hereafter. When the validation operation authenticates or es that the potential variant call, the validation ion passes the variant call information (from the variant call application) to the sample report generator. atively, when the validation ion invalidates or disqualifies the potential variant call, the validation operation passes a ponding indication (e.g., a negative indicator, a no call indicator, an in-valid call indicator) to the sample report generator. The validation operation also may pass a confidence score related to a degree of confidence that the variant call is t or the in-valid call designation is correct.

Next, the one or more processors te and store a sample report. The sample report may include, for example, information regarding a plurality of genetic loci with respect to the sample. For e, for each genetic locus of a predetermined set of genetic loci, the sample report may at least one of provide a genotype call; indicate that a genotype call cannot be made; provide a confidence score on a certainty of the genotype call; or indicate ial problems with an assay regarding one or more genetic loci. The sample report may also indicate a gender of an individual that provided a sample and/or indicate that the sample include le sources. As used herein, a "sample report" may include digital data (e.g., a data file) of a c locus or predetermined set of genetic locus and/or a printed report of the genetic locus or the set of genetic loci. Thus, ting or providing may include creating a data file and/or printing the sample report, or displaying the sample report.

The sample report may indicate that a variant call was determined, but was not validated. When a variant call is determined invalid, the sample report may indicate additional ation regarding the basis for the ination to not validate the variant call. For example, the additional information in the report may include a ption of the raw fragments and an extent (e.g., a count) to which the raw fragments support or contradict the variant call. Additionally or alternatively, the additional information in the report may include the quality score obtained in accordance with implementations described herein.

Variant Call Application Implementations disclosed herein include analyzing sequencing data to identify potential variant calls. t calling may be performed upon stored data for a previously performed sequencing operation. Additionally or alternatively, it may be performed in real time while a cing operation is being performed. Each of the sample reads is assigned to corresponding genetic loci. The sample reads may be assigned to corresponding genetic loci based on the sequence of the nucleotides of the sample read or, in other words, the order of nucleotides within the sample read (e.g., A, C, G, T). Based on this analysis, the sample read may be designated as including a possible variant/allele of a particular genetic locus. The sample read may be collected (or aggregated or binned) with other sample reads that have been designated as including possible variants/alleles of the genetic locus. The assigning operation may also be referred to as a g operation in which the sample read is identified as being possibly associated with a particular genetic position/locus. The sample reads may be analyzed to locate one or more identifying sequences (e.g., primer sequences) of nucleotides that differentiate the sample read from other sample reads. More specifically, the identifying sequence(s) may identify the sample read from other sample reads as being associated with a particular genetic locus.

The ing operation may include analyzing the series of n nucleotides of the identifying sequence to determine if the series of n nucleotides of the identifying sequence effectively matches with one or more of the select sequences. In particular implementations, the assigning operation may include analyzing the first n nucleotides of the sample sequence to determine if the first n nucleotides of the sample ce effectively matches with one or more of the select sequences. The number n may have a variety of values, which may be programmed into the protocol or entered by a user. For example, the number n may be defined as the number of nucleotides of the st select sequence within the se. The number n may be a predetermined number. The predetermined number may be, for example, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, or 30 nucleotides. However, fewer or more nucleotides may be used in other implementations. The number n may also be selected by an individual, such as a user of the . The number n may be based on one or more conditions. For instance, the number n may be d as the number of nucleotides of the shortest primer sequence within the database or a designated number, whichever is the smaller number. In some implementations, a minimum value for n may be used, such as 15, such that any primer sequence that is less than 15 nucleotides may be designated as an exception.

In some cases, the series of n nucleotides of an identifying sequence may not precisely match the nucleotides of the select sequence. Nonetheless, the identifying sequence may effectively match the select sequence if the identifying sequence is nearly identical to the select sequence. For example, the sample read may be called for a genetic locus if the series of n nucleotides (e.g., the first n nucleotides) of the identifying sequence match a select sequence with no more than a ated number of mismatches (e.g., 3) and/or a designated number of shifts (e.g., 2). Rules may be established such that each mismatch or shift may count as a difference between the sample read and the primer sequence. If the number of differences is less than a designated number, then the sample read may be called for the corresponding genetic locus (i.e., assigned to the corresponding genetic locus). In some implementations, a matching score may be determined that is based on the number of differences between the identifying sequence of the sample read and the select sequence associated with a genetic locus. If the matching score passes a designated matching old, then the genetic locus that corresponds to the select sequence may be designated as a potential locus for the sample read. In some entations, subsequent analysis may be performed to determine whether the sample read is called for the genetic locus.

If the sample read effectively matches one of the select sequences in the se (i.e., exactly matches or nearly matches as described , then the sample read is assigned or designated to the genetic locus that correlates to the select sequence. This may be ed to as locus calling or provisional-locus calling, wherein the sample read is called for the genetic locus that ates to the select sequence. However, as discussed above, a sample read may be called for more than one genetic locus. In such implementations, further analysis may be performed to call or assign the sample read for only one of the potential genetic loci. In some implementations, the sample read that is ed to the se of reference sequences is the first read from paired-end sequencing.

When performing paired-end sequencing, a second read (representing a raw fragment) is obtained that correlates to the sample read. After assigning, the subsequent analysis that is performed with the assigned reads may be based on the type of genetic locus that has been called for the assigned read.

] Next, the sample reads are analyzed to identify potential variant calls. Among other things, the results of the is identify the potential variant call, a sample variant frequency, a nce sequence and a position within the c sequence of interest at which the variant ed. For example, if a genetic locus is known for including SNPs, then the assigned reads that have been called for the genetic locus may undergo analysis to identify the SNPs of the assigned reads. If the genetic locus is known for including polymorphic repetitive DNA elements, then the ed reads may be analyzed to identify or characterize the polymorphic repetitive DNA elements within the sample reads. In some implementations, if an assigned read effectively matches with a STR locus and a SNP locus, a warning or flag may be assigned to the sample read. The sample read may be designated as both a STR locus and an SNP locus. The analyzing may include aligning the assigned reads in accordance with an alignment protocol to ine sequences and/or lengths of the assigned reads. The alignment protocol may include the method bed in International Patent Application No. (Publication No. filed on March 15, 2013, which is herein incorporated by reference in its entirety.

Then, the one or more processors analyze raw fragments to determine whether supporting variants exist at corresponding positions within the raw fragments. Various types of raw fragments may be identified. For e, the variant caller may identify a type of raw nt that exhibits a variant that validates the original variant call. For example, the type of raw fragment may represent a duplex stitched fragment, a simplex stitched fragment, a duplex tched fragment or a x un-stitched fragment. Optionally other raw fragments may be identified instead of or in addition to the foregoing examples. In connection with identifying each type of raw fragment, the variant caller also identifies the position, within the raw fragment, at which the supporting variant occurred, as well as a count of the number of raw fragments that exhibited the supporting t. For example, the t caller may output an indication that 10 reads of raw fragments were fied to represent duplex stitched fragments having a supporting variant at a particular position X. The t caller may also output indication that five reads of raw fragments were identified to represent simplex un-stitched fragments having a supporting variant at a particular position Y. The variant caller may also output a number of raw fragments that corresponded to reference sequences and thus did not include a supporting variant that would otherwise provide evidence validating the potential variant call at the genomic sequence of interest.

Next, a count is maintained of the raw fragments that include supporting variants, as well as the position at which the supporting variant occurred. Additionally or alternatively, a count may be maintained of the raw fragments that did not include ting variants at the position of interest (relative to the position of the potential t call in the sample read or sample fragment). Additionally or atively, a count may be ined of raw fragments that correspond to a nce sequence and do not authenticate or confirm the potential variant call. The information determined is output to the variant call validation application, including a count and type of the raw fragments that support the potential variant call, positions of the ting variance in the raw fragments, a count of the raw fragments that do not support the potential variant call and the like.

When a potential variant call is identified, the process outputs an indicating of the potential variant call, the variant sequence, the t on and a reference sequence associated therewith. The t call is ated to ent a "potential" t as errors may cause the call process to identify a false variant. In accordance with implementations herein, the potential variant call is analyzed to reduce and eliminate false variants or false positives. Additionally or alternatively, the process analyzes one or more raw fragments associated with a sample read and outputs a corresponding variant call associated with the raw fragments.

Benign Training Set Generation ns of human genomes and exomes have been sequenced, but their clinical applications remain limited due to the difficulty of distinguishing disease-causing mutations from benign genetic variation. Here we demonstrate that common se variants in other e species are largely clinically benign in human, enabling pathogenic mutations to be systematically identified by the s of elimination. Using hundreds of nds of common variants from population sequencing of six non-human primate species, we train a deep neural network that identifies pathogenic mutations in rare disease patients with 88% accuracy and enables the discovery of 14 new candidate genes in intellectual disability at genome-wide significance. Cataloging common variation from additional primate species would improve interpretation for millions of variants of uncertain significance, further advancing the clinical utility of human genome sequencing.

The clinical actionability of diagnostic sequencing is limited by the difficulty of interpreting rare genetic variants in human populations and inferring their impact on disease risk. e of their deleterious effects on fitness, clinically significant genetic ts tend to be extremely rare in the population and, for the vast majority, their effects on human health have not been determined. The large number and rarity of these variants of uncertain clinical significance present a able obstacle to the adoption of sequencing for individualized medicine and population-wide health screening.

Most penetrant Mendelian diseases have very low prevalence in the tion, hence the observation of a variant at high frequencies in the population is strong evidence in favor of benign consequence. Assaying common variation across diverse human populations is an effective strategy for cataloguing benign ts, but the total amount of common variation in present-day humans is limited due to bottleneck events in our species’ recent history, during which a large fraction of ancestral diversity was lost. Population studies of present-day humans show a remarkable inflation from an effective population size (Ne) of less than 10,000 individuals within the last 15,000– 65,000 years, and the small pool of common polymorphisms traces back to the limited capacitance for variation in a tion of this size. Out of more than 70 million potential n-altering missense substitutions in the reference genome, only roughly 1 in 1,000 are present at greater than 0.1% overall population allele frequency.

Outside of modern human populations, chimpanzees comprise the next closest extant species, and share 99.4% amino acid sequence identity. The near-identity of the protein-coding sequence in humans and chimpanzees suggests that purifying ion ing on chimpanzee protein-coding variants might also model the consequences on fitness of human mutations that are identical-by-state.

Because the mean time for neutral rphisms to t in the ancestral human lineage (~4Ne generations) is a fraction of the species’ divergence time (~6 million years ago), naturally occurring nzee variation explores mutational space that is largely non-overlapping except by chance, aside from rare instances of haplotypes maintained by balancing selection. If polymorphisms that are cal-by-state similarly affect fitness in the two species, the presence of a variant at high allele frequencies in chimpanzee tions should indicate benign consequence in human, expanding the catalog of known variants whose benign uence has been established by purifying selection.

Results – Common Variants in Other Primates are Largely Benign in Human The recent availability of aggregated exome data, comprising 123,136 humans collected in the Exome Aggregation Consortium (ExAC) and Genome Aggregation Database (gnomAD), allows us to measure the impact of natural ion on se and synonymous mutations across the allele frequency spectrum. Rare singleton variants that are ed only once in the cohort closely match the expected 2.2/1 missense/synonymous ratio predicted by de novo on after adjusting for the effects of trinucleotide t on mutational rate (A, , and FIGs. 52A, 52B, 52C, and 52D), but at higher allele frequencies the number of observed missense variants decreases due to the purging of deleterious ons by natural selection. The l decrease of missense/synonymous ratios with increasing allele frequency is consistent with a substantial on of missense variants of population frequency < 0.1% having a mildly rious consequence despite being observed in healthy individuals. These findings support the widespread empirical practice by diagnostic laboratories of ing out variants with greater than 0.1% to ~1% allele frequency as probably benign for penetrant genetic disease, aside from a handful of well-documented exceptions due to balancing ion and founder effects.

We identified common chimpanzee variants that were sampled two or more times in a cohort of 24 unrelated individuals; we estimate that 99.8% of these variants are common in the l chimpanzee population e ncy (AF) > 0.1%), indicating that these variants have already passed through the sieve of purifying selection. We examined the human allele frequency spectrum for the corresponding identical-by-state human variants (B), excluding the ed major histocompatibility complex region as a known region of balancing selection, along with ts lacking a one-to-one mapping in the multiple sequence alignment. For human variants that are identical-by-state with common chimpanzee variants, the missense/synonymous ratio is largely constant across the human allele frequency spectrum (P > 0.5 by Chi-squared (χ2) test), which is consistent with the absence of negative selection against common chimpanzee variants in the human population and concordant selection coefficients on missense variants in the two species. The low missense/synonymous ratio observed in human variants that are identical-by-state with common chimpanzee variants is consistent with the larger ive population size in chimpanzee (Ne ~ 73,000), which enables more efficient filtering of mildly deleterious variation.

In contrast, for singleton chimpanzee variants (sampled only once in the cohort), we e a significant decrease in the missense/synonymous ratio at common allele frequencies (P < 5.8 × 10−6; C), indicating that 24% of singleton chimpanzee missense ts would be filtered by ing selection in human populations at allele ncies greater than 0.1%. This depletion indicates that a significant on of the nzee singleton variants are rare deleterious mutations whose ng effects on fitness have prevented them from ng common allele ncies in either species. We estimate that only 69% of singleton variants are common (AF > 0.1%) in the general chimpanzee population.

We next identified human variants that are identical-by-state with variation observed in at least one of six non-human primate species. Variation in each of the six species was ascertained from either the great ape genome project (chimp, bonobo, gorilla, and orangutan) or were submitted to the Single Nucleotide Polymorphism Database (dbSNP) from the primate genome projects (rhesus, marmoset), and largely represent common variants based on the limited number of individuals sequenced and the low missense:synonymous ratios observed for each species (Supplementary Table 1). Similar to chimpanzee, we found that the missense/synonymous ratios for variants from the six non-human primate species are roughly equal across the human allele frequency spectrum, other than a mild depletion of missense variation at common allele frequencies (D, FIGs. 53, 54, and 55, Supplementary Data File 1), which is expected due to the inclusion of a minority of rare variants (~16% with under 0.1% allele frequency in chimpanzee, and less in other species due to fewer individuals sequenced). These results suggest that the selection coefficients on identical-by-state se variants are concordant within the primate lineage at least out to New World monkeys, which are estimated to have diverged from the human ancestral lineage ~35 million years ago.

We found that human missense variants that are identical-by-state with observed primate variants are strongly enriched for benign consequence in the ClinVar database. After excluding variants of uncertain significance and those with cting tions, ClinVar variants that are present in at least one non-human primate species are annotated as benign or likely benign 90% of the time on average, ed to 35% for r missense variants in general (P < 10-40; E). The pathogenicity of ClinVar annotations for primate variants is slightly greater than that observed from sampling a rly sized cohort of healthy humans (~95% benign or likely benign consequence, P = 0.07) excluding human variants with greater than 1% allele frequency to reduce curation bias.

The field of human genetics has long relied on model organisms to infer the clinical impact of human mutations, but the long evolutionary distance to most genetically tractable animal models raises concerns about the extent to which findings on model organisms are generalizable back to human. We extended our analysis beyond the primate lineage to e largely common variation from four additional mammalian species , pig, goat, and cow) and two species of more distant vertebrates (chicken and zebrafish). We selected species with sufficient genome-wide ascertainment of variation in the dbSNP, and confirmed that these are largely common variants, based on missense/synonymous ratios being much lower than 2.2/1. In contrast to our primate analyses, human missense mutations that are identical-by-state with ion in more distant species are markedly depleted at common allele frequencies (A), and the ude of this depletion increases at longer ionary distances (B and Supplementary Tables 2 and 3).

The missense mutations that are deleterious in human, yet tolerated at high allele frequencies in more distant species, indicate that the coefficients of selection for cal-by-state missense mutations have diverged ntially between humans and more distant species. eless, the presence of a missense variant in more distant mammals still increases the likelihood of benign consequence, as the fraction of missense variants depleted by natural selection at common allele frequencies is less than the ~50% depletion observed for human missense variants in l (A). Consistent with these results, we found that ClinVar missense variants that have been observed in mouse, pig, goat, and cow are 73% likely to be annotated with benign or likely benign consequence, compared to 90% for e variation (P < 2 × 10-8; C), and 35% for the ClinVar database overall.

To confirm that ionary distance, and not domestication artifact, is the primary driving force for the divergence of the selection coefficients, we repeated the analysis using fixed substitutions between pairs of closely related species in lieu of species polymorphisms across a broad range of evolutionary distances (D, Supplementary Table 4 and mentary Data File 2). We found that the depletion of human missense variants that are identical-by-state with inter-species fixed substitutions increases with evolutionary branch length, with no discernable difference for wild species compared to those exposed to domestication. This s with earlier work in fly and yeast, which found that the number of identical-by-state fixed missense substitutions was lower than expected by chance in divergent lineages.

Deep Learning Network for Variant Pathogenicity Classification The logy sed provides a deep learning network for variant pathogenicity classification.

The importance of variant classification for al applications has inspired numerous attempts to use supervised machine learning to address the problem, but these efforts have been hindered by the lack of an adequately sized truth dataset containing confidently labeled benign and pathogenic variants for training.

Existing databases of human expert curated variants do not represent the entire , with ~50% of the variants in the ClinVar database coming from only 200 genes (~1% of human protein-coding genes). Moreover, systematic studies identify that many human expert annotations have questionable supporting evidence, underscoring the difficulty of interpreting rare variants that may be observed in only a single t. Although human expert interpretation has become increasingly rigorous, classification guidelines are largely formulated around consensus practices and are at risk of reinforcing existing tendencies. To reduce human interpretation biases, recent classifiers have been trained on common human polymorphisms or fixed human–chimpanzee substitutions, but these classifiers also use as their input the prediction scores of earlier fiers that were trained on human curated databases. Objective benchmarking of the performance of these various methods has been elusive in the absence of an independent, bias-free truth t.

Variation from the six non-human primates (chimpanzee, bonobo, gorilla, orangutan, rhesus, and marmoset) contributes over 0 unique missense variants that are non-overlapping with common human variation, and largely represent common variants of benign consequence that have been through the sieve of purifying selection, greatly enlarging the training dataset available for machine learning approaches. On average, each primate species contributes more variants than the whole of the ClinVar database (~42,000 missense variants as of November 2017, after excluding ts of uncertain significance and those with conflicting annotations).

Additionally, this content is free from biases in human interpretation.

Using a dataset comprising common human variants (AF>0.1%) and e variation (Supplementary Table 5 (), we trained a novel deep al network, PrimateAI, which takes as input the amino acid sequence flanking the variant of interest and the orthologous sequence alignments in other s (and . Unlike existing classifiers that employ human-engineered features, our deep learning network learns to extract features directly from the primary sequence. To incorporate information about protein structure, we trained te networks to predict the secondary structure and solvent accessibility from the sequence alone, and then included these as subnetworks in the full model (and . Given the small number of human proteins that have been successfully llized, inferring structure from the primary sequence has the advantage of avoiding biases due to incomplete protein structure and functional domain annotation. The total depth of the network, with n structure included, was 36 layers of convolutions, comprising roughly 400,000 trainable parameters.

To train a classifier using only ts with benign labels, we framed the prediction problem as whether a given mutation is likely to be observed as a common variant in the tion. Several factors influence the probability of observing a variant at high allele frequencies, of which we are interested only in deleteriousness; other factors include mutation rate, technical cts such as sequencing coverage, and factors impacting neutral c drift such as gene conversion.

We d each variant in the benign training set with a missense mutation that was absent in 123,136 exomes from the ExAC database, controlling for each of these confounding factors, and trained the deep learning network to distinguish between benign variants and matched controls (). As the number of led ts greatly exceeds the size of the labeled benign training dataset, we trained eight networks in parallel, each using a different set of unlabeled variants matched to the benign training t, to obtain a consensus prediction.

Using only the primary amino acid ce as its input, the deep ng network accurately s high pathogenicity scores to residues at useful protein functional domains, as shown for the e-gated sodium channel SCN2A (), a major disease gene in epilepsy, , and intellectual disability. The structure of SCN2A comprises four homologous repeats, each containing six transmembrane helixes (S1–S6). On membrane depolarization, the positively charged S4 transmembrane helix moves towards the extracellular side of the membrane, causing the S5/S6 pore-forming domains to open via the S4–S5 linker. Mutations in the S4, S4–S5 linker, and S5 domains, which are ally associated with early onset epileptic encephalopathy, are predicted by the k to have the highest pathogenicity scores in the gene, and are depleted for variants in the healthy population (Supplementary Table 6). We also found that the network recognizes important amino acid positions within domains, and assigns the highest pathogenicity scores to ons at these positions, such as the DNA- contacting residues of transcription factors and the catalytic residues of enzymes (FIGs. 25A, 25B, 25C, and 26).

To better understand how the deep learning network derives insights into protein ure and function from the y sequence, we ized the trainable parameters from the first three layers of the network. Within these layers, we observed that the k learns correlations between the weights of different amino acids that approximate existing measurements of amino acid distance such as Grantham score (). The outputs of these initial layers become the inputs for later layers, ng the deep learning k to construct progressively higher order representations of the data.

We ed the performance of our network with existing classification algorithms, using 10,000 common primate ts that were ld from training. Because ~50% of all newly arising human missense variants are filtered by purifying selection at common allele frequencies (A), we determined the 50thpercentile score for each classifier on a set of 10,000 randomly selected variants that were matched to the 10,000 common primate variants by onal rate and sequencing coverage, and evaluated the accuracy of each classifier at that threshold (D, A, and Supplementary Data File 4). Our deep learning network (91% accuracy) surpassed the performance of other classifiers (80% accuracy for the next best model) at assigning benign consequence to the 10,000 withheld common primate variants.

Roughly half the improvement over existing methods comes from using the deep ng network and half comes from augmenting the training dataset with primate ion, as compared to the accuracy of the network d with human variation data only (D). To test the classification of variants of uncertain significance in a clinical scenario, we evaluated the ability of the deep learning network to distinguish between de novo mutations occurring in ts with evelopmental disorders versus healthy ls. By prevalence, neurodevelopmental disorders constitute one of the largest categories of rare c diseases, and recent trio sequencing studies have implicated the central role of de novo missense and protein truncating mutations.

We classified each confidently called de novo missense variant in 4,293 affected individuals from the Deciphering Developmental Disorders (DDD) cohort versus de novo missense variants from 2,517 unaffected siblings in the Simon’s Simplex Collection (SSC) cohort, and ed the difference in prediction scores between the two distributions with the Wilcoxon rank-sum test (E and FIGs. 29A and 29B). The deep ng network clearly outperforms other classifiers on this task (P<10-28; F and B). Moreover, the performance of the various classifiers on the withheld primate variant dataset and the DDD cases versus controls dataset was correlated (Spearman ρ = 0.57, P<0.01), indicating good agreement between the two datasets for evaluating pathogenicity, despite using entirely different sources and methodologies (A).

We next sought to estimate the accuracy of the deep learning network at classifying benign versus pathogenic mutations within the same gene. Given that the DDD population largely comprises index cases of ed children t affected first degree relatives, it is important to show that the classifier has not inflated its accuracy by favoring pathogenicity in genes with de novo dominant modes of inheritance. We restricted the analysis to 605 genes that were nominally significant for disease association in the DDD study, calculated from proteintruncating variation only (P<0.05). Within these genes, de novo missense ons are enriched 3/1 compared to expectation (A), ting that ~67% are pathogenic.

The deep learning network was able to discriminate pathogenic and benign de novo variants within the same set of genes (P<10-15; B), outperforming other methods by a large margin (FIGs. 22C and 28C). At a binary cutoff of ≥ 0.803 (FIGs. 22D and 30B), 65% of de novo missense mutations in cases are classified by the deep learning network as pathogenic, compared to 14% of de novo missense mutations in controls, corresponding to a classification accuracy of 88% (FIGs. 22E and 30C). Given nt incomplete penetrance and variable expressivity in neurodevelopmental disorders, this figure probably underestimates the cy of our fier due to the ion of partially penetrant pathogenic variants in controls.

Novel Candidate Gene Discovery Applying a threshold of ≥ 0.803 to stratify pathogenic se mutations increases the enrichment of de novo se mutations in DDD patients from 1.5-fold to 2.2-fold, close to protein-truncating mutations (2.5- fold), while relinquishing less than ird of the total number of variants enriched above expectation. This substantially improves statistical power, enabling discovery of 14 additional candidate genes in intellectual disability, which had previously not reached the genome-wide significance threshold in the original DDD study (Table 1).

Comparison with Human Expert on We examined the mance of various classifiers on recent human expert-curated variants from the ClinVar database, but found that the performance of classifiers on the ClinVar dataset was not significantly correlated with either the withheld primate variant dataset or the DDD case versus control t (P = 0.12 and P = 0.34, respectively) (FIGs. 31A and 31B). We hypothesize that existing classifiers have biases from human expert curation, and while these human heuristics tend to be in the right direction, they may not be l. One example is the mean difference in Grantham score between pathogenic and benign variants in ClinVar, which is twice as large as the difference n de novo ts in DDD cases versus controls within the 605 disease-associated genes (Table 2). In comparison, human expert curation appears to tilize protein structure, especially the importance of the residue being exposed at the surface where it can be available to interact with other molecules. We observe that both ClinVar enic mutations and DDD de novo mutations are ated with predicted solvent-exposed es, but that the difference in solvent accessibility between benign and pathogenic ClinVar variants is only half that seen for DDD cases versus controls. These findings are suggestive of ascertainment bias in favor of factors that are more straightforward for a human expert to interpret, such as Grantham score and conservation. Machine learning classifiers trained on human curated databases would be expected to reinforce these tendencies.

Our results suggest that systematic primate population sequencing is an effective strategy to fy the millions of human variants of uncertain significance that currently limit clinical genome interpretation. The accuracy of our deep learning network on both withheld common primate variants and al variants increases with the number of benign variants used to train the network (A). Moreover, training on variants from each of the six man primate species independently contributes to an increase in the performance of the network, whereas training on variants from more t mammals negatively impacts the performance of the k (FIGs. 23B and 23C). These results support the assertion that common primate variants are largely benign in human with respect to penetrant Mendelian disease, while the same cannot be said of variation in more distant s.

Although the number of non-human primate genomes examined in this study is small ed to the number of human genomes and exomes that have been sequenced, it is important to note that these additional primates contribute a disproportionate amount of information about common benign variation. Simulations with ExAC show that discovery of common human variants (> 0.1% allele ncy) plateaus quickly after only a few hundred individuals (), and r healthy population sequencing into the ns mainly contributes additional rare variants. Unlike common variants, which are known to be largely clinically benign based on allele frequency, rare variants in healthy populations may cause recessive genetic diseases or dominant genetic diseases with incomplete penetrance. Because each primate species carries a ent pool of common variants, sequencing several dozen members of each species is an ive strategy to systematically catalog benign missense ion in the primate lineage. Indeed, the 134 individuals from six non-human primate species examined in this study contribute nearly four times as many common missense variants as the 123,136 humans from the ExAC study (Supplementary Table 5 ()). Primate population sequencing studies involving hundreds of individuals may be practical even with the relatively small numbers of unrelated duals residing in wildlife sanctuaries and zoos, thus minimizing the disturbance to wild populations, which is important from the standpoint of vation and ethical treatment of non-human primates.

Present-day human populations carry much lower genetic diversity than most non-human primate species, with y half the number of single nucleotide variants per individual as chimpanzee, gorilla, and gibbon, and one-third as many variants per individual as orangutan. Although genetic diversity levels for the majority of non-human primate species are not known, the large number of extant non-human primate species allows us to extrapolate that the majority of possible benign human missense ons are likely to be covered by a common variant in at least one primate s, enabling pathogenic variants to be systematically identified by the process of elimination (D). Even with only a subset of these species sequenced, increasing the training data size will enable more accurate prediction of missense consequence with e learning. Finally, while our findings focus on missense variation, this strategy may also be applicable for inferring the consequences of ing variation, particularly in conserved regulatory regions where there is ient alignment between human and primate genomes to unambiguously determine whether a variant is identical-by-state.

Of the 504 known non-human primate species, roughly 60% face extinction due to poaching and widespread habitat loss. The reduction in population size and potential extinction of these species represents an irreplaceable loss in genetic diversity, motivating urgency for a worldwide conservation effort that would benefit both these unique and irreplaceable species and our own.

Data tion and Alignment ] Coordinates in the application refer to human genome build UCSC hg19/GRCh37, ing the coordinates for variants in other species mapped to hg19 using multiple sequence alignments. cal ripts for protein-coding DNA sequence and multiple sequence alignments of 99 vertebrate genomes and branch length were downloaded from the UCSC genome browser.

We obtained human exome polymorphism data from the Exome Aggregation Consortium (ExAC)/Genome Aggregation Database (gnomAD exomes) v2.0. We obtained primate variation data from the great ape genome sequencing project, which comprises whole genome sequencing data and genotypes for 24 chimpanzees, 13 s, 27 gorillas, and 10 orangutans. We also included variation from 35 chimpanzees from a separate study of chimpanzee and bonobos, but because of differences in variant calling ology, we ed these from the population analysis, and used them only for training the deep learning model. In addition, 16 rhesus individuals and 9 marmoset individuals were used to assay ion in the original genome projects for these species, but individual-level information was not available. We obtained variation data for rhesus, marmoset, pig, cow, goat, mouse, chicken, and zebrafish from dbSNP. dbSNP also included additional tan variants, which we only used for training the deep learning model, since individual genotype information was not available for the population analysis. To avoid effects due to balancing ion, we also excluded variants from within the extended major histocompatability complex region (chr6: 28,477,797-33,448,354) for the population analysis.

We used the multiple species alignment of 99 vertebrates to ensure orthologous one-to-one mapping to human protein-coding regions and prevent mapping to pseudogenes. We accepted ts as identical-by-state if they ed in either reference/alternative orientation. To ensure that the variant had the same predicted proteincoding uence in both human and the other species, we ed that the other two nucleotides in the codon were identical between the species, for both missense and synonymous variants. Polymorphisms from each species included in the analysis are listed in Supplementary Data File 1 and detailed metrics are shown in mentary Table 1.

For each of the four allele frequency categories (A), we used variation in intronic regions to estimate the expected number of synonymous and missense variants in each of 96 possible trinucleotide contexts and to correct for mutational rate ( and Supplementary Tables 7, 8 (). We also separately analyzed identical-by-state CpG dinucleotide and non-CpG dinucleotide variants, and verified that the missense/synonymous ratio was flat across the allele frequency um for both classes, ting that our analysis holds for both CpG and non-CpG variants, despite the large difference in their mutation rate (FIGs. 52A, 52B, 52C, and 52D).

Depletion of Human Missense Variants that are Identical-by-State with Polymorphisms in other Species To evaluate whether variants t in other species would be tolerated at common allele frequencies (> 0.1%) in human, we identified human variants that were identical-by-state with variation in the other s. For each of the variants, we ed them to one of the four categories based on their allele frequencies in human populations (singleton, more than singleton ~0.01%, 0.01% to ~0.1%, > 0.1%), and estimated the decrease in missense/synonymous ratios (MSR) between the rare (< 0.1%) and common (> 0.1%) ts. The depletion of cal-by-state missense variants at common human allele frequencies (> 0.1%) indicates the fraction of variants from the other species that are sufficiently deleterious that they would be filtered out by natural selection at common allele frequencies in human: MSRrare −MSR %depletion = comm MSRrare The se/synonymous ratios and the percentages of depletion are computed per species and are shown in B and Supplementary Table 2. In addition, for chimpanzee common variants (B), chimpanzee singleton variants (C), and mammal variants (A), we performed the Chi-squared test (χ 2) test of homogeneity on the 2× 2 contingency table to test if the differences in missense/synonymous ratios between rare and common variants were significant. e sequencing was only performed on limited numbers of individuals from the great ape genome project, we used the human allele frequency spectrum from ExAC to estimate the fraction of sampled ts that were rare (< 0.1%) or common (> 0.1%) in the general chimpanzee population. We sampled a cohort of 24 humans based on the ExAC allele frequencies and identified missense variants that were observed either once, or more than once, in this cohort. Variants that were observed more than once had a 99.8% chance of being common (> 0.1%) in the general population, s variants that were observed only once in the cohort had a 69% chance of being common in the general population. To verify that the observed depletion for missense variants in more distant s was not due to a confounding effect of genes that are better conserved, and hence more accurately aligned, we repeated the above is, restricting only to genes with >50% average nucleotide identity in the multiple sequence alignment of 11 primates and 50 mammals compared with human (see Supplementary Table 3).

This removed ~7% of human protein-coding genes from the is, without substantially affecting the s. Additionally, to ensure that our s were not affected by issues with variant calling, or domestication artifacts (since most of the species selected from dbSNP were domesticated), we repeated the analyses using fixed substitutions from pairs of y related species in lieu of intraspecies polymorphisms (D, Supplementary Table 4, and Supplementary Data File 2).

ClinVar Analysis of Polymorphism Data for Human, Primates, Mammals, and other Vertebrates To examine the clinical impact of variants that are identical-by-state with other species, we aded the ClinVar database, excluding those variants that had conflicting annotations of pathogenicity or were only labeled as variants of uncertain significance. Following the ing steps shown in Supplementary Table 9, there are a total of 24,853 missense variants in the pathogenic category and 17,775 se variants in the benign category.

We counted the number of pathogenic and benign ClinVar variants that were identical-by-state with variation in , non-human primates, mammals, and other vertebrates. For human, we simulated a cohort of 30 humans, sampled from ExAC allele frequencies. The numbers of benign and pathogenic variants for each species are shown in mentary Table 10.

Generation of Benign and Unlabeled ts for Model Training We constructed a benign training dataset of largely common benign missense variants from human and non-human es for machine learning. The dataset comprises common human variants (> 0.1% allele frequency; 83,546 variants), and variants from chimpanzee, bonobo, gorilla, and orangutan, rhesus, and marmoset (301,690 unique primate ts). The number of benign training variants contributed by each source is shown in Supplementary Table 5.

We trained the deep learning network to discriminate between a set of labeled benign variants and an unlabeled set of variants that were matched to control for trinucleotide context, cing coverage, and alignability between the species and human. To obtain an unlabeled ng dataset, we started with all possible missense variants in canonical coding s. We ed variants that were observed in the 123,136 exomes from ExAC, and variants in start or stop codons. In total, 68,258,623 unlabeled missense variants were generated. This was filtered to correct for regions of poor sequencing coverage, and s where there was not a -one alignment between human and e genomes when selecting matched unlabeled variants for the primate variants.

We obtained a consensus prediction by training eight models that use the same set of labeled benign variants and eight ly sampled sets of unlabeled variants, and taking the average of their predictions. We also set aside two randomly sampled sets of 10,000 primate variants for validation and testing, which we withheld from training (Supplementary Data File 3). For each of these sets, we sampled 10,000 unlabeled variants that were matched by trinucleotide context, which we used to normalize the threshold of each classifier when comparing between different classification algorithms (Supplementary Data File 4). In other implementations, fewer or additional models can be used in the ensemble, ranging from 2 to 500.

] We assessed the classification accuracy of two versions of the deep learning network, one trained with common human variants only, and one trained with the full benign labeled dataset including both common human variants and primate variants.

Architecture of the Deep Learning Network ] For each variant, the pathogenicity prediction network takes as input the gth amino acid sequence centered at the variant of interest, and the outputs of the ary structure and solvent accessibility networks (and with the missense variant substituted in at the central position. Three 51-length position frequency matrices are generated from multiple sequence alignments of 99 vertebrates, ing one for 11 primates, one for 50 mammals excluding es, and one for 38 vertebrates excluding primates and mammals.

The secondary structure deep learning network predicts a three-state secondary structure at each amino acid position: alpha helix (H), beta sheet (B), and coils (C) (Supplementary Table 11). The solvent ibility network ts a three-state solvent accessibility at each amino acid position: buried (B), intermediate (I), and d (E) (Supplementary Table 12). Both networks only take the flanking amino acid sequence as their inputs, and were trained using labels from known non-redundant crystal structures in the Protein DataBank (Supplementary Table 13). For the input to the pretrained three-state secondary structure and three-state solvent accessibility networks, we used a single length position frequency matrix generated from the multiple sequence alignments for all 99 vertebrates, also with length 51 and depth 20. After pre-training the networks on known crystal structures from the Protein DataBank, the final two layers for the secondary structure and solvent models were removed and the output of the network was directly connected to the input of the pathogenicity model. The best testing accuracy achieved for the three-state secondary structure prediction model was 79.86% (Supplementary Table 14). There was no ntial difference when comparing the predictions of the neural network when using nnotated (Define Secondary Structure of Proteins) structure labels for the approximately ~4,000 human proteins that had l structures versus using predicted structure labels only ementary Table 15).

Both our deep learning network for pathogenicity tion teAI) and deep learning networks for predicting secondary structure and solvent accessibility adopted the architecture of residual blocks. The detailed architecture for PrimateAI is described in ( and Supplementary Table 16 (FIGs. 4A, 4B, and 4C). The detailed architecture for the networks for predicting secondary structure and t accessibility is described in and Supplementary Tables 11 (FIGs. 7A and 7B) and 12 (FIGs. 8A and 8B).

Benchmarking of Classifier Performance on a Withheld Test Set of 10,000 Primate Variants We used the 10,000 withheld primate variants in the test dataset to ark the deep learning network as well as the other 20 previously published fiers, for which we obtained prediction scores from the database dbNSFP. The performance for each of the classifiers on the 10,000 ld primate variant test set is also provided in A. Because the different classifiers had widely varying score distributions, we used 10,000 randomly selected unlabeled variants that were matched to the test set by trinucleotide context to identify the 50thpercentile threshold for each classifier. We benchmarked each fier on the fraction of variants in the 10,000 withheld primate t test set that were classified as benign at the 50th-percentile threshold for that fier, to ensure fair comparison between the methods.

] For each of the classifiers, the fraction of withheld primate test variants predicted as benign using the ercentile threshold is also shown in in A and Supplementary Table 17 (). We also show that the performance of PrimateAI is robust with respect to the number of aligned species at the t position, and generally performs well as long as sufficient conservation information from mammals is available, which is true for most protein-coding sequence ().

Analysis of de novo Variants from the DDD Study We obtained published de novo variants from the DDD study, and de novo variants from the y g controls in the SSC autism study. The DDD study provides a confidence level for de novo variants, and we excluded variants from the DDD dataset with a threshold of <0.1 as potential false positives due to variant calling errors. In one implementation, in total, we had 3,512 missense de novo variants from DDD affected individuals and 1,208 missense de novo variants from healthy controls. The canonical transcript annotations used by UCSC for the 99-vertebrate multiple-sequence alignment differed slightly from the transcript annotations used by DDD, resulting in a small difference in the total counts of missense variants. We evaluated the classification methods on their ability to discriminate between de novo missense variants in the DDD ed individuals versus de novo missense variants in unaffected sibling controls from the autism studies. For each classifier, we reported the P value from the Wilcoxon rank-sum test of the difference between the prediction scores for the two distributions (Supplementary Table 17 ()).

To measure the accuracy of various classifiers at guishing benign and pathogenic variation within the same disease gene, we ed the analysis on a subset of 605 genes that were enriched for de novo proteintruncating variation in the DDD cohort 5, Poisson exact test) (Supplementary Table 18). Within these 605 genes, we estimated that irds of the de novo variants in the DDD dataset were pathogenic and one-third were benign, based on the 3/1 ment of de novo missense mutations over expectation. We assumed minimal incomplete penetrance and that the de novo missense mutations in the y controls were benign. For each fier, we identified the threshold that ed the same number of benign or pathogenic predictions as the empirical proportions ed in these datasets, and used this threshold as a binary cutoff to estimate the accuracy of each fier at distinguishing de novo mutations in cases versus controls. To construct a receiver operator characteristics curve, we treated pathogenic fication of de novo DDD variants as true positive calls, and treated classification of de novo variants in healthy controls as pathogenic as being false positive calls. Because the DDD dataset contains one-third benign de novo variants, the area under the curve (AUC) for a theoretically perfect classifier is less than one. Hence, a fier with perfect separation of benign and pathogenic variants would classify 67% of de novo variants in the DDD patients as true ves, 33% of de novo variants in the DDD patients as false negatives, and 100% of de novo variants in ls as true negatives, yielding a maximum possible AUC of 0.837 (FIGs. 29A and 29B and Supplementary Table 19 ()).

Novel Candidate Gene Discovery We tested enrichment of de novo ons in genes by comparing the ed number of de novo mutations to the number expected under a null mutation model. We repeated the enrichment analysis performed in the DDD study, and report genes that are newly genome-wide significant when only counting de novo missense mutations with a eAI score of >0.803. We adjusted the genome-wide expectation for de novo ng missense variation by the fraction of missense variants that meet the PrimateAI threshold of >0.803 (roughly onefifth of all possible missense ons genome-wide). As per the DDD study, each gene required four tests, one testing protein truncating enrichment and one testing enrichment of protein-altering de novo mutations, and both tested for just the DDD cohort and for a larger meta-analysis of neurodevelopmental trio sequencing cohorts. The enrichment of protein-altering de novo mutations was combined by ’s method with a test of the clustering of missense de novo mutations within the coding sequence (Supplementary Tables 20, 21). The P value for each gene was taken from the minimum of the four tests, and genome-wide significance was determined as P<6.757 × 10-7 (α = 0.05, 18,500 genes with four tests).

ClinVar fication Accuracy Since most of the existing classifiers are either d ly or indirectly on ClinVar content, such as using prediction scores from classifiers that are trained on ClinVar, we limited analysis of the ClinVar dataset so that we only used ClinVar variants that had been added since 2017. There was substantial overlap among the recent ClinVar variants and other ses, and hence we further filtered to remove variants found at common allele frequencies (>0.1%) in ExAC, or present in HGMD (Human Gene on Database), LOVD (Leiden Open Variation Database), or Uniprot (Universal Protein Resource). After excluding variants annotated only as uncertain significance and those with conflicting tions, we were left with 177 missense variants with benign annotation and 969 missense variants with pathogenic annotation. We scored these ClinVar ts using both the deep learning network and the other classification methods. For each classifier, we fied the threshold that produced the same number of benign or pathogenic predictions as the cal proportions observed in these datasets, and used this threshold as a binary cutoff to estimate the accuracy of each classifier (FIGs. 31A and 31B).

Impact of Increasing Training Data Size and Using Different s of ng Data To evaluate the impact of training data size on the performance of the deep learning network, we randomly sampled a subset of ts from the labeled benign training set of 385,236 primate and common human ts, and kept the ying deep learning network architecture the same. To show that variants from each individual primate species contributes to classification accuracy whereas ts from each individual mammal s lower fication accuracy, we trained deep ng networks using a training dataset comprising 83,546 human variants plus a constant number of randomly selected variants for each species, again keeping the underlying network architecture the same, according to one implementation. The constant number of variants we added to the training set (23,380) was the total number of variants available in the species with the lowest number of missense variants, i.e. bonobo. We repeated the training procedures five times to get the median performance of each classifier.

Saturation of All Possible Human Missense Mutations with Increasing Number of Primate Populations Sequenced We investigated the expected saturation of all ~70 million possible human missense mutations by common variants present in the 504 extant primate species, by simulating ts based on the trinucleotide context of human common missense variants (>0.1% allele frequency) observed in ExAC. For each primate species, we simulated four times the number of common missense variants observed in human (~83,500 missense variants with allele frequency >0.1%), because humans have roughly half the number of variants per individual as other primate s, and about ~50% of human missense variants have been filtered out by purifying selection at >0.1% allele frequency (A).

To model the fraction of human common missense variants (>0.1% allele frequency) discovered with increasing size of human cohorts surveyed (), we sampled genotypes according to ExAC allele frequencies and report the fraction of common variants that were ed at least once in these simulated cohorts.

In one implementation, for practical application of PrimateAI scores, a threshold of >0.8 is red for likely pathogenic classification, <0.6 for likely benign, and 8 as intermediate in genes with dominant modes of inheritance, on the basis of the enrichment of de novo variants in cases as compared to controls (D), and a threshold of >0.7 for likely pathogenic and <0.5 for likely benign in genes with recessive modes of inheritance. shows an example ecture of a deep al network for enicity prediction, referred to herein as "PrimateAI". In 1D refers to 1-dimensional convolutional layer. Predicted pathogenicity is on a scale from 0 (benign) to 1 (pathogenic). The network takes as input the human amino acid (AA) reference and alternate sequence (51 AAs) ed at the variant, the on weight matrix (PWM) vation profiles calculated from 99 rate species, and the outputs of secondary structure and solvent accessibility prediction deep learning networks, which predict three-state protein secondary structure (helix—H, beta sheet—B, and coil— C) and three-state solvent accessibility (buried—B, intermediate—I, and exposed—E). depicts a schematic illustration of PrimateAI, the deep learning network architecture for pathogenicity classification. The inputs to the model include 51 amino acids (AA) of flanking sequence for both the reference sequence and the sequence with the variant substituted in, conservation represented by three 51-AA-length on-weighted matrices from primate, mammal, and vertebrate alignments, and the outputs of pre-trained ary structure network and solvent accessibility network (also 51 AA in length).

] FIGs. 4A, 4B, and 4C are Supplementary Table 16 that show example model architecture s of the pathogenicity prediction deep learning model PrimateAI. The shape specifies the shape of the output tensor at each layer of the model and the activation is the activation given to the neurons of the layer. Inputs to the model are the position-specific ncy matrices (51 AA length, 20 depth) for the flanking amino acid sequence around the variant, the one-hot encoded human reference and alternate sequences (51 AA length, 20 depth), and the output from the secondary structure and solvent accessibility models (51 AA length, 40 depth).

The rated example uses 1D convolutions. In other entations, the model can use ent types of convolutions such as 2D convolutions, 3D utions, d or atrous convolutions, transposed convolutions, separable convolutions, and depthwise separable convolutions. Some layers also use ReLU activation function which greatly accelerates the convergence of stochastic gradient descent compared to saturating nonlinearities such as sigmoid or hyperbolic tangent. Other examples of activation functions that can be used by the technology disclosed include parametric ReLU, leaky ReLU, and exponential linear unit (ELU).

Some layers also use batch normalization (Ioffe and Szegedy 2015). Regarding batch normalization, distribution of each layer in a convolution neural network (CNN) changes during training and it varies from one layer to another. This reduces the convergence speed of the optimization algorithm. Batch normalization is a technique to overcome this problem. Denoting the input of a batch normalization layer with x and its output using z, batch normalization applies the following transformation on x: x − μ z = γ + β σ 2 + ε Batch ization applies mean-variance normalization on the input x using µ and σ and linearly scales and shifts it using γ and β. The normalization ters µ and σ are computed for the current layer over the training set using a method called exponential moving e. In other words, they are not trainable parameters. In st, γ and β are trainable parameters. The values for µ and σ calculated during training are used in forward pass during inference.

FIGs. 5 and 6 illustrate the deep learning network architecture used for predicting secondary structure and solvent accessibility of proteins. The input to the model is a position-weighted matrix using conservation generated by the RaptorX re (for training on Protein Data Bank sequences) or the 99-vertebrate alignments (for training and inference on human protein sequences). The output of the second to last layer, which is 51 AAs in length, becomes the input for the deep learning network for pathogenicity classification.

FIGs. 7A and 7B are Supplementary Table 11 that show example model architecture details for the 3- state secondary structure prediction deep learning (DL) model. The shape specifies the shape of the output tensor at each layer of the model and the activation is the activation given to the neurons of the layer. Inputs to the model were the position-specific ncy matrices (51 AA length, 20 depth) for the flanking amino acid sequence around the variant.

FIGs. 8A and 8B are Supplementary Table 12 that show example model architecture details for the 3- state solvent ibility prediction deep learning model. The shape specifies the shape of the output tensor at each layer of the model and the activation is the activation given to the neurons of the layer. Inputs to the model were the position-specific ncy matrices (51 AA length, 20 depth) for the flanking amino acid sequence around the variant. shows predicted pathogenicity score at each amino acid position in the SCN2A gene, annotated for key functional domains. Plotted along the gene is the average PrimateAI score for missense substitutions at each amino acid position.

D shows a comparison of classifiers at predicting benign consequence for a test set of 10,000 common primate variants that were withheld from training. The y axis represents the percentage of primate variants tly classified as benign, after normalizing the threshold of each classifier to its 50th-percentile score on a set of 10,000 random variants that were matched for mutational rate.

E illustrates distributions of eAI prediction scores for de novo missense variants ing in Deciphering Developmental Disorders (DDD) patients compared to unaffected siblings, with ponding Wilcoxon rank-sum P value.

F depicts comparison of classifiers at separating de novo missense variants in DDD cases versus controls. Wilcoxon rank-sum test P values are shown for each fier.

FIGs. 22A, 22B, 22C, 22D, and 22E illustrate fication accuracy within 605 DDD genes with P<0.05. A shows enrichment of de novo missense mutations over expectation in affected duals from the DDD cohort within 605 ated genes that were significant for de novo protein ting variation 5).

B s distributions of PrimateAI prediction scores for de novo missense ts occurring in DDD patients versus unaffected siblings within the 605 associated genes, with corresponding Wilcoxon rank-sum P value.

] C shows comparison of various classifiers at separating de novo missense variants in cases versus controls within the 605 genes. The y axis shows the P values of the Wilcoxon rank-sum test for each classifier.

D depicts comparison of various classifiers, shown on a receiver operator characteristic curve, with AUC indicated for each classifier.

E rates classification accuracy and AUC for each fier. The classification accuracy shown is the e of the true positive and true negative error rates, using the threshold where the fier would predict the same number of pathogenic and benign variants as expected based on the enrichment shown in A. To take into account the fact that 33% of the DDD de novo missense variants represent background, the maximum achievable AUC for a perfect fier is indicated with a dotted line.

FIGs. 23A, 23B, 23C, and 23D show impact of data used for training on classification accuracy. Deep ng networks trained with increasing numbers of primate and human common variants up to the full dataset (385,236 variants). In A, classification performance for each of the networks is benchmarked on accuracy for the 10,000 withheld primate variants and de novo variants in DDD cases versus controls.

] FIGs. 23B and 23C show performance of networks trained using datasets comprising 83,546 human common variants plus 23,380 variants from a single primate or mammal species, according to one implementation. s are shown for each network trained with ent sources of common variation, benchmarked on 10,000 withheld primate variants (B), and on de novo missense variants (C) in DDD cases versus controls.

D s expected saturation of all possible human benign missense positions by identical-by- state common variants (>0.1%) in the 504 extant primate species. The y axis shows the fraction of human missense variants observed in at least one primate species, with CpG missense variants indicated in green, and all missense variants indicated in blue. To simulate the common variants in each e species, we sampled from the set of all possible single nucleotide substitutions with ement, matching the trinucleotide context distribution ed for common human variants (>0.1% allele frequency) in ExAC. illustrates ting for the effect of sequencing coverage on the ascertainment of common primate variants. The probability of observing a given variant in a nonhuman primate species is ely correlated with the sequencing depth at that position in the ExAC/gnomAD exome dataset. In contrast, the lower gnomAD read depth did not affect the probability of observing a common human variant at that position (>0.1% allele frequency) because the large number of human exomes sequenced makes ascertainment of common variation almost guaranteed. When picking matched variants for each of the primate variants for training the network, the probability of picking a variant was adjusted for the effects of sequencing depth, in addition to matching for trinucleotide context to control for mutational rate and gene conversion.

FIGs. 25A, 25B, 25C, and 26 depict recognition of protein motifs by the disclosed neural networks.

Regarding FIGs. 25A, 25B, and 25C, to illustrate the neural networks’ recognition of protein domains, we show the average PrimateAI scores for variants at each amino acid position in three different n s. In A, the collagen strand of COL1A2, with glycine in a repeating GXX motif is highlighted. Clinically identified mutations in collagen genes are largely due to missense mutations in glycine in GXX repeats, as these interfere with the normal assembly of collagen and exert strong nt-negative effects. In B, the active site of the IDS sulfatase enzyme is highlighted, which ns a cysteine at the active site that is post-translationally modified to formylglycine. In C, the p domain of the MYC transcription factor is shown. The basic domain contacts DNA via positively charged arginine and lysine residues (highlighted) that interact with the vely charged phosphate backbone. The leucine-zipper domain comprises leucine residues spaced seven amino acids apart ighted), which are crucial for dimerization. includes a line plot showing the effect of perturbing each position in and around the variant on the predicted deep learning score for the variant. We systematically zeroed out the inputs at nearby amino acids (positions –25 to +25) around the variant and measured the change in the neural network’s predicted pathogenicity of the variant. The plot shows the average change in predicted pathogenicity score for perturbations at each nearby amino acid on for 5,000 randomly selected variants. illustrates correlation patterns of weights mimic BLOSUM62 and Grantham score matrices. ation patterns of weights from the first three layers of the secondary-structure deep learning network show correlations between amino acids that are similar to BLOSUM62 and Grantham score matrices. The left heat map shows the correlation of parameter weights from the first convolutional layer following two initial ling layers of the secondary-structure deep learning network n amino acids encoded using a t representation. The middle heat map shows BLOSUM62 scores between pairs of amino acids. The right heat map shows Grantham distance between amino acids. The Pearson ation between deep learning weights and BLOSUM62 scores is 0.63 (P = 3.55 × 10–9). The correlation between deep learning weights and Grantham scores is –0.59 (P = 4.36 × 10–8). The correlation between BLOSUM62 and Grantham scores is –0.72 (P = 8.09 × 10–13).

FIGs. 28A, 28B, and 28C show performance evaluation of the deep ng network PrimateAI and other classifiers. A depicts accuracy of the deep learning network PrimateAI at predicting a benign consequence for a test set of 10,000 primate ts that were withheld from training and comparison with other classifiers, including SIFT, PolyPhen-2, CADD, REVEL, M-CAP, LRT, MutationTaster, MutationAssessor, FATHMM, PROVEAN, VEST3, MetaSVM, MetaLR, MutPred, DANN, FATHMM-MKL_coding, Eigen, GenoCanyon, integrated_fitCons, and GERP. The y axis represents the percentage of primate ts classified as benign, based on normalizing the threshold for each classifier to its 50th-percentile score using a set of 10,000 ly selected variants that were matched to the primate variants for trinucleotide context to control for mutational rate and gene conversion.

B depicts comparison of the performance of the PrimateAI network in ting de novo missense variants in DDD cases versus controls, along with the 20 existing methods listed above. The y axis shows the P values of the Wilcoxon rank-sum test for each classifier.

] C depicts comparison of the performance of the PrimateAI network in separating de novo se ts in DDD cases versus unaffected controls within 605 disease-associated genes, with the 20 methods listed above. The y axis shows the P values of the Wilcoxon rank-sum test for each classifier.

] FIGs. 29A and 29B illustrate distribution of the prediction scores of four classifiers. Histograms of the prediction scores of four classifiers, including SIFT, PolyPhen-2, CADD, and REVEL, for de novo missense variants occurring in DDD cases versus unaffected ls, with corresponding Wilcoxon rank-sum P-values.

FIGs. 30A, 30B, and 30C compare the accuracy of the PrimateAI network and other fiers at separating pathogenic and benign variants in 605 disease-associated genes. The scatterplot in A shows the performance of each of the classifiers on DDD cases versus controls (y axis) and benign prediction accuracy on the ld primate dataset (x axis). B es different classifiers at separating de novo missense variants in cases versus ls within the 605 genes, shown on a receiver operator characteristic (ROC) curve, with area under the curve (AUC) indicated for each classifier. C shows classification accuracy and AUC for the PrimateAI network and the 20 classifiers listed in FIGs. 28A, 28B, and 28C. The classification accuracy shown is the average of the true positive and true negative rates, using the threshold where the classifier would predict the same number of pathogenic and benign ts as expected based on the enrichment in A. The maximum achievable AUC for a perfect classifier is indicated with a dotted line, assuming that de novo missense variants in DDD cases are 67% pathogenic variants and 33% benign, and de novo missense variants in controls are 100% benign.

FIGs. 31A and 31B illustrate ation between classifier mance on human expert-curated ClinVar variants and performance on empirical datasets. The scatterplot in A shows the fication accuracy (y axis) on ClinVar variants on the 10,000 withheld primate variants (x axis) for each of the 20 other classifiers and the PrimateAI network trained with human-only or human + primates data. Shown are the Spearman ation coefficient rho and associated P value. To limit the evaluation to data that were not used for training the classifiers, we only used ClinVar variants that were added between January 2017 and November 2017, and excluded common human variants from ExAC/gnomAD (>0.1% allele frequency). The ClinVar classification accuracy shown is the average of the true positive and true negative rates, using the threshold where the fier would predict the same number of pathogenic and benign variants as ed in the ClinVar dataset.

The scatterplot in B shows the classification accuracy (y axis) on ClinVar variants the DDD cases versus controls full dataset (x axis) for each of the 20 other classifiers and the PrimateAI network trained with human-only or human + primates data. is Supplementary Table 14 that shows performance of the 3-state secondary structure and 3- state solvent accessibility prediction models on annotated samples from the Protein DataBank, using 6,367 unrelated n sequences for training, 400 for validation, and 500 for testing. Only ns with <25% sequence similarity were ed from the Protein DataBank. We report the accuracy of the deep learning ks as a performance metric, as the three classes are not grossly unbalanced for either secondary structure or solvent accessibility. is Supplementary Table 15 that shows performance comparison of deep learning network using annotated ary structure labels of human proteins from DSSP database when available with deep learning network using predicted secondary structure labels. ] is Supplementary Table 17 that shows the accuracy values on the 10,000 withheld primate variants and the p-values for de novo variants in DDD cases vs controls for each of the 20 classifiers we evaluated.

The PrimateAI model with Human data only is our deep learning network that was trained using a d benign ng dataset comprising only of common human variants (83.5K variants with > 0.1% in the population), while PrimateAI model with Human + Primates data is our deep learning network trained on the full set of 385K labeled benign variants, comprising both common human variants and primate variants. is Supplementary Table 19 that shows comparison of the performance of different classifiers on de novo variants in the DDD case vs control dataset, cted to 605 e-associated genes. To normalize between the different methods, for each fier we identified the threshold where the fier would predict the same number of pathogenic and benign ts as expected based on the enrichment in DDD and the control set.

Classification accuracy shown is the average of the true positive and true ve error rates at this threshold.

FIGs. 49A, 49B, 49C, 49D, and 49E depict missense/synonymous ratios across the human allele frequency spectrum. A shows missense and mous variants observed in 123,136 humans from the ExAC/gnomAD database were divided into four categories by allele frequency. Shaded grey bars represent counts of synonymous variants in each category; dark green bars represent missense variants. The height of each bar is scaled to the number of synonymous variants in each allele frequency category and the missense/synonymous counts and ratios are displayed after adjusting for mutation rate. FIGs. 49B and 49C illustrate allele frequency um for human missense and synonymous variants that are identical-by-state (IBS) with chimpanzee common variants (B) and chimpanzee singleton ts (C). The depletion of chimpanzee missense variants at common human allele frequencies (>0.1%) compared to rare human allele frequencies (<0.1%) is indicated by the red box, along with anying Chi-squared (χ2) test P values.

D shows human variants that are observed in at least one of the man primate species.

E illustrates counts of benign and pathogenic missense variants in the overall ClinVar database (top row), compared to ClinVar variants in a cohort of 30 humans sampled from ExAC/gnomAD allele frequencies (middle row), compared to ts observed in primates (bottom row). Conflicting benign and pathogenic assertions and variants annotated only with uncertain significance were excluded.

FIGs. 50A, 50B, 50C, and 50D show purifying selection on missense variants identical-by-state with other species. A depicts allele frequency spectrum for human missense and synonymous variants that are identical-by-state with variants present in four non-primate mammalian species (mouse, pig, goat, and cow). The ion of missense variants at common human allele frequencies (>0.1%) is indicated by the red box, along with the accompanying Chi-squared (χ2) test P value.

B is a scatter plot showing the depletion of missense variants observed in other species at common human allele frequencies ) versus the species’ evolutionary distance from human, expressed in units of branch length (mean number of tutions per nucleotide position). The total branch length between each s and human is indicated next to the species’ name. Depletion values for singleton and common variants are shown for species where variant frequencies were available, with the exception of gorilla, which contained related individuals.

C illustrates counts of benign and pathogenic missense ts in a cohort of 30 humans sampled from ExAC/gnomAD allele frequencies (top row), compared to variants observed in primates (middle row), and compared to ts ed in mouse, pig, goat, and cow (bottom row). Conflicting benign and pathogenic assertions and variants annotated only with uncertain significance were excluded.

D is a scatter plot showing the depletion of fixed missense substitutions observed in pairs of closely related species at common human allele frequencies (>0.1%) versus the s’ ionary distance from human (expressed in units of mean branch length). shows expected missense:synonymous ratios across the human allele frequency spectrum in the absence of purifying ion. Shaded gray bars represent the number of synonymous variants, and dark green bars represent the number of missense ts. The dotted line shows the baseline formed by synonymous variants.

Missense:synonymous ratios are indicated for each allele frequency ry. According to one implementation, the expected missense and synonymous counts in each allele frequency category were calculated by taking intronic variants from the ExAC/gnomAD dataset comprising 123,136 exomes and using them to te the fraction of variants expected to fall into each of the four allele frequency categories, based on the trinucleotide t of the variant, which controls for mutational rate and GC bias in gene conversion.

FIGs. 52A, 52B, 52C, and 52D depict missense:synonymous ratios for CpG and G variants.

FIGs. 52A and 52B show missense:synonymous ratios for CpG ts (A) and non-CpG variants (A) across the human allele frequency spectrum, using all variants from the ExAC/gnomAD exomes. FIGs. 52C and 52D show missense:synonymous ratios for CpG variants (C) and non-CpG variants (D) across the human allele frequency spectrum, cted to only human variants that are identical by state with chimpanzee common polymorphisms.

FIGs. 53, 54, and 55 illustrate missense:synonymous ratios of human variants identical by state with six primates. Patterns of missense:synonymous ratios across the human allele frequency um for ExAC/gnomAD variants that are identical by state with variation present in chimpanzee, bonobo, gorilla, orangutan, , and marmoset. is a simulation showing saturation of new common missense variants discovered by increasing the size of the human cohorts surveyed. In simulations, the genotypes of each sample were sampled according to gnomAD allele frequencies. The fraction of gnomAD common ts discovered is averaged across 100 simulations for each sample size from 10 to 100,000. ] shows accuracy of PrimateAI across ent conservation profiles in the genome. The x axis represents the percentage alignability of the 51 AA around a sequence with the 99-vertebrate alignments. The y axis represents the classification performance of PrimateAI accuracy for ts in each of the conservation bins, benchmarked on the test t of 10,000 withheld primate variants. is Supplementary Table 5 that shows contributions to the labeled benign training dataset from common human variants and variants present in non-human primates. is Supplementary Table 8 that shows effect of allele frequency on expected missense:synonymous ratio. Expected counts of mous and se variants were calculated based on the allele frequency spectrum of variants in intronic regions at least 20-30nt away from exon boundaries, using trinucleotide context to control for mutational rate and gene sion biases. is Supplementary Table 9 that shows ClinVar analysis. According to one implementation, ts downloaded from the Nov. 2017 build of the ClinVar database were filtered to remove missense variants with conflicting annotations and e variants of uncertain significance, leaving 17,775 benign variants and 24,853 enic variants. is Supplementary Table 10 that shows the number of missense variants from other species found in ClinVar, according to one implementation. Variants were required to be identical-by-state with the corresponding human variant, and to have identical nucleotides at the other two ons in the reading frame to ensure the same coding uence. is Table 1 that shows one implementation of ery of 14 additional candidate genes in intellectual disability, which had previously not reached the genome-wide significance threshold in the original DDD study. is Table 2 that shows one implementation of the mean difference in Grantham score between pathogenic and benign variants in ClinVar, which is twice as large as the difference between de novo variants in DDD cases versus controls within the 605 disease-associated genes.

Data Generation ] All coordinates used in the paper refer to human genome build UCSC hg19/GRCh37, including the coordinates for variants in other species, which were mapped to hg19 using multiple sequence alignments using the procedure described in this section. Protein-coding DNA sequence and multiple sequence alignments of 99 vertebrate genomes with human were downloaded from the UCSC genome browser for the hg19 build. (http://hgdownload.soe.ucsc.edu/goldenPath/hg19/multiz100way/alignments/knownCanonical.exonNuc.fa.gz). For genes with multiple canonical gene annotations, the longest coding transcript was selected.

We downloaded human exome polymorphism data from the Exome Aggregation Consortium (ExAC)/genome Aggregation Database (gnomAD) v2.0, which collected the whole-exome sequencing (WES) data of 123,136 individuals from eight subpopulations around the world (http://gnomad.broadinstitute.org/). We excluded variants that failed the default y control filters as annotated in ExAC VCF file or fell outside canonical coding regions. To avoid s due to balancing ion, we also ed variants from within the extended MHC region (chr6: 28,477,797-33,448,354) for the primate analyses. The great ape genome sequencing project provides whole genome sequencing data and genotypes for 24 chimpanzees, 13 bonobos, 27 gorillas and 10 orangutans ding 5 from the Sumatran cies and 5 from the Bornean subspecies, which we collapsed for the downstream analyses). The study on chimpanzee and s provides genome sequences of additional 35 chimpanzees. However, as variants for these onal chimpanzees were not called using the same methods as the great ape genome sequencing project, we excluded them from the allele frequency spectrum analysis, and used them only for training the deep learning model. Variation from these primate diversity studies were already mapped to the human reference . In addition, for Marmoset and Rhesus, 16 rhesus individuals and 9 marmoset duals were used to assay ion in the original sequencing of these species’ genomes, but individual-level information is not ble.

The great ape genome sequencing project4 provides whole genome sequencing data and genotypes for 24 chimpanzees, 13 bonobos, 27 gorillas and 10 orangutans (including 5 from the an subspecies and 5 from the Bornean cies, which we collapsed for the downstream analyses). The study on chimpanzee and bonobos provides genome sequences of additional 35 chimpanzees. However, as variants for these additional chimpanzees were not called using the same methods as the great ape genome sequencing project, we excluded them from the allele frequency spectrum analysis, and used them only for training the deep learning model. Variation from these primate diversity s were y mapped to the human reference (hg19). In addition, for Marmoset and Rhesus, 16 rhesus individuals and 9 marmoset individuals were used to assay ion in the original sequencing of these species’ genomes, but individual-level information is not available.

To compare with other primates and mammals, we also downloaded SNPs of other species from dbSNP, including rhesus, marmoset, pig, cow, goat, mouse, chicken, and zebrafish. dbSNP also included additional orangutan variants, which we only used for ng the deep learning model, since individual genotype information was not available for the allele frequency spectrum analysis. We ded other species, such as dog, cat, or sheep, as dbSNP provides limited numbers of variants for those s.

To map the variants to human, we used the multiple species alignment of 99 vertebrates to ensure orthologous 1:1 g to human n-coding regions. Mapping variants using the orthologous multiple species alignments was essential to remove artifacts caused by pseudogene or retrotransposed sequences, which occur when directly mapping SNPs between s using tools such as liftOver that allow many-to-1 mappings. In cases where the genome build of the species in dbSNP did not match the genome build of the species in the 99-vertebrate multiple sequence alignment, we used liftOver to update the variants to the genome build used in the multiple sequence alignment. We accepted variants as identical-by-state if they occurred in either reference/alternative orientation, for example, if the human reference was G and the alternative allele was A, this was considered to be identical-by-state with a variant in another species where the reference was A and the alternative allele was G. To ensure that the variant had the same predicted protein-coding consequence in both human and the other species, we required that the other two nucleotides in the codon are identical between the species, for both missense and synonymous variants. Polymorphisms from each species included in the analysis are listed in Supplementary Data File 1 and detailed metrics are shown in Supplementary Table 1.

To ensure that the variants from each dbSNP submitter batch were of high quality and aligned tly to human, we calculated the se:synonymous ratio for each batch, confirming that this was less than the expected ratio of 2.2:1; most species had ratios lower than 1:1, especially ish and mouse, which would be ed to have very large effective population sizes. We excluded two batches of SNPs from cow which had unusually high missense:synonymous ratios from further analysis (snpBatch_1000_BULL_GENOMES_1059190.gz with ratio of 1.391 and snpBatch_COFACTOR_GENOMICS_1059634.gz with ratio of 2.568). The mean missense:synonymous ratio for the remaining cow batches was 0.8:1.

Correction for the Effect of Allele Frequency on Missense:Synonymous Ratio, Mutational Rate, Genetic Drift, and GC-Biased Gene Conversion ] In addition to the action of purifying selection, the observed depletion of human missense variants at high allele frequencies could also be affected by factors not related to l selection. The probability of a neutral mutation appearing at a particular allele frequency in the population is a function of mutational rate, gene conversion, and genetic drift, and these factors could potentially introduce a bias in the missense:synonymous ratio across the allele frequency spectrum even in the absence of selective forces.

To compute the expected missense:synonymous ratios at each allele frequency category in the absence of protein-coding selection, we selected variants within intronic regions 31-50bp upstream and p downstream of each exon. These s were chosen to be t enough to avoid effects of the extended splice motifs.

Because these regions are near the edges of the exome capture sequence for ExAC/gnomAD , to ensure fair ascertainment of variants we removed any chrX regions and ed regions with mean read depth < 30. Each variant and its immediately upstream and ream nucleotides falls within one of 64 trinucleotide contexts. If we mutate the middle tide into three other bases, in total, 64 × 3 = 192 tri-nucleotide configurations are le.

As tri-nucleotide configurations and their reverse complements are lent, there are effectively 96 tri-nucleotide contexts. We observed that tri-nucleotide context has a very strong effect on mutational rate, and a smaller effect on GCbiased gene conversion, making tri-nucleotide context effective for modeling these variables.

Within these intronic s, we took each variant from the 126,136 ExAC/gnomAD exomes and separated them into 4 × 192 categories, based on four categories of allele frequency (singleton, more than singleton~0.01%, 0.01%~0.1%, > 0.1%) and 192 tri-nucleotide contexts. We normalized the number of variants observed in each of the 4 × 192 categories (allele frequency × tri-nucleotide context) by dividing by the total number of possible variants with that tri-nucleotide t (obtained by substituting each nucleotide in the intronic sequence three different ways). For each of the 192 tri-nucleotide contexts, we had thereby ed the expected fraction of variants to fall in each of the 4 allele frequency categories in the e of protein-coding selection.

This implicitly models the effects of mutational rate, GC-biased gene conversion, and genetic drift that are due to differences in tri-nucleotide context (Supplementary Table 7).

To get the expected missense:synonymous ratio in each allele frequency category, we counted the total number of possible mous and missense mutations in the human genome accessible by single nucleotide substitutions, and assigned each of them to one of the 192 trinucleotide contexts. For each context, we used the 4 × 192 table to calculate the number of variants expected to fall within each of the 4 allele ncy categories. y, we summed up the number of synonymous and missense variants across the 192 tri-nucleotide contexts, to obtain the total ed numbers of synonymous and missense variants in each of the four allele frequency ries ( and Supplementary Table 8 ()).

The expected missense:synonymous ratios were nearly constant across the allele frequency um and close to the ratio of 2.23:1 that would be expected for de novo variants in the absence of natural ion, with the exception of singleton variants, whose expected missense:synonymous ratio was 2.46:1. This indicates that due to the action of factors independent of protein-coding selective pressures (mutational rate, gene conversion, genetic drift), variants with the singleton allele ncy category in ExAC/gnomAD are expected to have a missense:synonymous ratio about 10% higher than that of de novo mutations by default. To correct for this, we adjusted down the missense:synonymous ratio for singletons by 10% in the allele frequency analyses (FIGs. 49A, 49B, 49C, 49D, and 49E and FIGs. 50A, 50B, 50C, and 50D). This small adjustment reduced the estimated missense depletion for common human variants present in primates and other mammals (shown in FIGs. 49A, 49B, 49C, 49D, and 49E and FIGs. 50A, 50B, 50C, and 50D) by roughly ~3.8%. The higher missense:synonymous ratio for singleton variants is due to transition mutations (which are more likely to create synonymous s) having higher allele frequencies owing to higher mutation rate than transversion mutations (which are more likely to create missense changes).

Moreover, this explains the observed missense:synonymous ratio of 2.33:1 for singleton variants in ExAC/gnomAD, which exceeds the expected ratio for de novo mutations of 2.23:1. After factoring in the effects of allele frequency um on missense:synonymous ratio, this actually reflects a 5.3% depletion of ton variants compared to expectation, which would presumably be due to selection against pathogenic missense mutations with de novo dominant modes of inheritance. Indeed, when we consider only haploinsufficient genes with high probability of loss of function (pLI > 0.9), the missense:synonymous ratio for ExAC/gnomAD singleton variants is 2.04:1, indicating a depletion of around ~17% within haploinsufficient genes. This result is concordant with prior estimates that 20% of missense mutations are equivalent to loss of function mutations, assuming some degree of lete penetrance.

We also specifically examined missense:synonymous ratios for CpG and non-CpG variants across the human allele frequency spectrum, due to the large differences in their mutation rates (FIGs. 52A, 52B, 52C, and 52D). We confirmed that for both CpG and non-CpG mutations, human variants that are identical-by-state with nzee common polymorphisms have nearly constant missense:synonymous ratios across the allele frequency spectrum.

Depletion of Human se Variants that are Identical-by-State with Polymorphisms in Other Species ] To evaluate whether variants from other species would be tolerated at common allele frequencies (> 0.1%) in human, we identified human variants that were identical-by-state with ion in the other species. For each of the variants, we assigned them to one of the four categories based on their allele frequencies in human populations eton, more than singleton~0.01%, 0.01%~0.1%, > 0.1%), and estimated the se in missense:synonymous ratios (MSR) between the rare (< 0.1%) and common (> 0.1%) variants. The depletion of identical-by-state missense variants at common human allele frequencies (> 0.1%) indicates the fraction of variants from the other species that are sufficiently rious that they would be filtered out by natural selection at common allele ncies in human.

MSR −MSR tion = rare comm MSRrare ] The missense:synonymous ratios and the percentages of depletion were computed per species and are shown in B and Supplementary Table 2. In addition, for chimpanzee common variants (A), chimpanzee singleton variants (C), and mammal variants (A), we performed the Chi-squared (χ2) test of homogeneity on 2×2 contingency table to test if the differences in missense:synonymous ratios n rare and common ts were significant.

Because sequencing was only med on limited numbers of individuals from the great ape diversity project, we used the human allele frequency spectrum from ExAC/gnomAD to estimate the fraction of d variants which were rare (< 0.1%) or common (>0.1%) in the general chimpanzee population. We sampled a cohort of 24 individuals based on the ExAC/gnomAD allele frequencies, and identified missense ts that were observed either once, or more than once, in this cohort. Variants that were observed more than once had a 99.8% chance of being common (> 0.1%) in the general population, whereas variants that were ed only once in the cohort had a 69% chance of being common in the general tion. In FIGs. 49B and 49C, we show that as a consequence of some of the chimpanzee ton variants being rare deleterious mutations, we observe depletion of singleton chimpanzee variants at high allele frequencies in human, but not for the common chimpanzee variants. y half of chimpanzee variants observed in the cohort of 24 individuals were observed only once, and roughly half were observed more than once.

To confirm that the observed depletion for missense variants in more distant mammals was not due to a confounding effect of genes that are better ved, and hence more accurately aligned, we repeated the above analysis, cting only to genes with > 50% average tide identity in the multiple sequence alignment of 11 primates and 50 mammals compared with human (see Supplementary Table 3). This removed ~7% of human protein-coding genes from the analysis, without substantially affecting the results.

Fixed Substitutions Among Primates, Mammals and Distant Vertebrates To ensure that our results using dbSNP variation were not affected by issues with the variant data, or ication artifacts (since most of the species selected from dbSNP were domesticated), we also repeated the analyses using fixed substitutions from pairs of closely-related species in lieu of intra-species polymorphisms. We downloaded the phylogenetic tree of 100 rate species (http://hgdownload.soe.ucsc.edu/goldenPath/hg19/multiz100way/hg19.100way.commonNames.n h) from UCSC genome browser, with phylogenetic distance measured in branch length (mean number of nucleotide substitutions per on). We selected closely-related species pairs (branch length < 0.25) for further is. To identify fixed tutions between closely related species pairs, we downloaded coding regions for the multiple sequence alignments of 99 vertebrate genomes with human, as well as the alignments of 19 mammalian (16 primates) genomes with human from the UCSC genome browser. The additional 19 mammalian le species alignment was necessary because some of the primate species, such as bonobo, were absent in the 99 vertebrate alignment (http://hgdownload.soe.ucsc.edu/goldenPath/hg38/multiz20way/alignments/knownCanonical.exo nNuc.fa.gz). In total, we obtained 15 pairs of closely related species, including five primate pairs, as listed in D and Supplementary Table 4.

We took the multiple sequence alignments of 19 mammalians or 99 vertebrate genomes with human within the canonical coding regions and obtained nucleotide substitutions between each selected pair of vertebrates, listed in Supplementary Data File 2. These substitutions were mapped to the human genome, requiring that the other two tides in the codon were unchanged between human and the other species, and accepting the variant in either nce or alternative ation. Using human variants that were identical-by-state with the fixed substitutions from related species pairs, we ated missense:synonymous ratios for variants in the rare (< 0.1%) and common (> 0.1%) allele frequency categories to obtain the fraction of fixed substitutions under negative selection, as shown in Supplementary Table 4.

ClinVar is of Polymorphism Data for Human, es, Mammals, and Other Vertebrates To examine the clinical impact of variants that are identical-by-state with other species, we downloaded the release variant summary for the ClinVar database (ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/clinvar_20171029.vcf.gz released on 02-Nov-2017)12. The database contained 324,698 variants on the hg19 genome build, of which 122,884 were missense single nucleotide variants g to our list of protein-coding genes (Supplementary Table 9). Most of the variants in the ClinVar database were not of missense consequence, and were excluded. Next, we filtered variants with conflicting interpretations of enicity, and kept only those with , likely , pathogenic, and likely pathogenic annotations. We merged variants with Benign or Likely Benign annotations into a single category, as well merging variants with Pathogenic or Likely Pathogenic annotations. Following the filtering steps shown in Supplementary Table 9, there are a total of 24,853 ts in the pathogenic category and 17,775 variants in the benign category; the remainder were excluded e they arevariants of unknown significance or conflicting annotations.

] To get a baseline for ClinVar missense variants in the human population, we examined r missense variants in a cohort of 30 individuals d from ExAC/gnomAD allele frequencies. This cohort size was chosen to reflect roughly the number of individuals sequenced in the primate diversity project study. We report the average number of pathogenic and benign variants in a cohort of 30 humans (E) from 100 such simulations. Because rs have systematically ted common human variants with benign consequence in ClinVar, we excluded variants with greater than 1% allele frequency to avoid this curation bias.

We analyzed ClinVar variants that were identical-by-state with variation in primates, mammals and other vertebrates. The numbers of benign and pathogenic variants for each species are shown in Supplementary Table 10. A summary of the number of ClinVar variants that were present in humans, primates, and more distant mammals is shown in FIGs. 49E and 50B, along with the results from Chi-squared (χ2) test of homogeneity for differences in the ratio of benign to enic variants.

Generation of Benign Variants for Model Training Variants that are common in the human population are largely neutral aside from rare instances of founder effects or balancing selection, making them suitable as a benign training t for machine learning that is cted by biases in human interpretation. We used allele frequency data from 123,136 exomes from the ExAC/gnomAD database (release v2.0), excluding variants that did not pass filters, which left us with 83,546 missense variants with overall population allele frequency >= 0.1% within canonical protein-coding transcripts.

Based on our earlier results showing that variants present in primates are largely benign in human, we created a benign training dataset for machine learning comprising common human variants (> 0.1% allele frequency), variants from chimpanzee, bonobo, gorilla, and orangutan from the great ape diversity project and additional primate cing, and rhesus, tan, and marmoset ts from dbSNP. In total, 301,690 unique primate ts were added to the benign training set, according to one implementation. The number of benign ng variants contributed by each source is shown in Supplementary Table 5.

A caveat is that while most primate variants are common in their respective populations, a minority of them are rare variants. Because the non-human primate species have had limited numbers of individuals sequenced, we would expect that the set of ascertained variants generally represent common ion. Indeed, we find that the missense:synonymous ratio for variants from each of the primate species is less than half of the expected 2.23:1 ratio for de novo mutation, ting that these are mostly common variants that have already passed through the sieve of selection. er, for the chimpanzee , we have estimated that ~84% of the ascertained variants are present at common allele frequencies (> 0.1%) in their respective populations. Since ~50% of newly arising missense mutations are filtered by ing selection at common human allele frequencies (> 0.1%) (A), this figure is tent with ~16% rare variants accounting for the observed depletion of 8.8% of human missense variants that are identical-by-state with observed primate ion (D).

Applying the estimate that ~20% of human missense mutations are loss of function equivalent, primate variants would be expected to comprise 3.2% fully pathogenic mutations, 91.2% benign mutations (tolerated at > 0.1% allele frequency), and 5.6% intermediate mutations that do not fully abrogate gene function, yet are deleterious enough to be filtered out at common allele frequencies (> 0.1%). e the known imperfections in this training dataset, the classification accuracy of the deep learning network was much better when d on a benign training dataset comprising both common human variants and primate ts, compared to common human variants alone.

Therefore, it appears that at the current classification accuracy, the amount of training data available is the stronger limitation. As larger numbers of duals are sequenced in each primate species, it will be possible to prepare training datasets that contain a higher fraction of primate common variants, reducing contamination from pathogenic variants in the training t and further improving classification performance.

Generation of Unlabeled Variants to Complement the Benign Training Dataset All possible missense variants were generated from each base position of canonical coding regions by substituting the nucleotide at the position withthe other three nucleotides. We excluded variants that were observed in the 123,136 exomes from ExAC/gnomAD, and variants in start or stop codons. In total, ,623 led ts were generated. We ed each of the unlabeled variants to one of the 96 different tri-nucleotide context categories. We trained the deep learning network using a semi-supervised approach, by sampling variants from this led dataset that match the variants in the benign dataset by tri-nucleotide context, and training the classifier to discriminate between benign and unlabeled training examples.

Additional Filtering of led Variants ] By presenting examples of benign and unlabeled variants along with the flanking amino acid sequence, the deep learning network learns regions of proteins that are highly intolerant of mutations. However, the absence of common variants within a region of the n sequence may be due to strong purifying ion, or it may be due to technical artifacts that prevent variants from being called in the region. To correct for the latter, we removed variants from both the benign and unlabeled datasets from regions where the ExAC/gnomAD dataset had mean coverage < 1. Similarly, when matching unlabeled variants to primate variants in the benign dataset during training, we ed unlabeled variants from regions where that primate did not have ogous alignable sequence with human in the multiple sequence ent.

Withheld Primate Variants for Validation and Testing, and de novo Variants from Affected and Unaffected Individuals For ting and testing of the deep learning network, we randomly sampled two sets of 10,000 primate variants for validation and testing, which we withheld from training. The rest of the primate variants, along with the common human variants (> 0.1% allele ncy), were used as the benign dataset for training the deep learning network. In addition, we also sampled two sets of 10,000 unlabeled variants that were matched to the withheld primate variants for the validation set and the test set.

] We used the 10,000 withheld primate variants in the validation set and the matched 10,000 unlabeled ts to monitor the performance of the deep learning network during the course of training, by measuring the ability of the network to discriminate between variants in the two sets. This enabled us to determine a stopping point for training and avoid overfitting, once the mance of the network had saturated.

We used the 10,000 ld primate variants in the test dataset to benchmark the deep learning k as well as the other 20 classifiers. Because the different classifiers had widely varying score distributions, we used these unlabeled variants to identify the 50th-percentile threshold for each classifier. We benchmarked each classifier on the fraction of variants in the 10,000 withheld primate t test set that were fied as benign at the ercentile threshold for that classifier, to ensure fair ison between the methods.

To evaluate the performance of the deep learning network in clinical settings using de novo variants in affected individuals with neurodevelopmental disorders and de novo variants in healthy controls, we downloaded the de novo variants from the Deciphering Developmental Disorders (DDD) study, and de novo ts from the healthy sibling controls in the Simons Simplex Collection (SSC) autism study. The DDD study es a confidence level for de novo variants, and we excluded variants from the DDD t with a threshold of < 0.1 as potential false ves due to variant calling errors. In total, we had 3,512 se de novo variants from DDD affected individuals and 1,208 missense de novo variants from healthy controls.

To better model the real-world clinical io of guishing between benign and pathogenic variants of uncertain significance within a panel of candidate disease genes, we limited the analysis to only de novo ts within 605 genes that were associated with disease in the DDD study (p<0.05), calculated only from protein-truncating variation (Supplementary Table 18). We evaluated the gene-specific enrichment of proteintruncating de novo mutations by computing the statistical significance under a null hypothesis of the expected number of de novo mutations given the gene-specific mutation rate and the number of considered chromosomes. We selected the 605 genes with a nominal P-value < 0.05. We calculated the excess of synonymous and missense de novo mutations within the 605 genes (A) as the ratio of counts of observed de novo mutations versus expected de novo mutations, as well as the ence of observed de novo mutations minus ed de novo mutations. Within these 605 genes, we observed 380 de novo missense mutations from the DDD affected individuals (A). For each of the classifiers, including our own, a small fraction of variants did not have predictions, generally e they did not map to the same transcript models used by the classifier. Hence, for our deep learning network, we performed the downstream analyses in FIGs. 22A, 22B, 22C, 22D, and 22E using 362 de novo missense mutations from DDD affected individuals and 65 de novo missense mutations from healthy controls.

Saturation of All Possible Human Missense Mutations with Increasing Number of Primate Populations Sequenced We investigated the expected saturation of all ~70M possible human missense mutations by common variants t in the 504 extant primate species. For each primate species, we ted 4 times the number of common missense ts observed in human (~83,500 missense variants with allele frequency > 0.1%), because humans appear to have roughly half the number of variants per individual as other primate species, and about ~50% of human missense variants have been filtered out by purifying selection at > 0.1% allele frequency (A). We assigned simulated variants based on the observed distribution of human common missense variants in the 96 trinucleotide contexts. For example, if 2% of human common missense variants were from the CCG>CTG trinucleotide context, we would e that 2% of the simulated variants were randomly sampled CCG>CTG mutations. This has the effect of controlling for the effects of mutational rate, genetic drift, and gene conversion bias, using tri-nucleotide context.

The curves in D show the cumulative tion of the ~70M possible human missense mutations by common ts present in any of the 504 primate species, assuming that we ascertain all common variants in each primate species (> 0.1% allele frequency). From A, y ~50% of human missense mutations are sufficiently deleterious in both human and other primates to prevent them from rising to common allele frequencies (> 0.1%), and therefore the curves in D represent the fraction of non-deleterious human missense ons saturated by common primate variation as the number of primate s grows. We show that with 504 primate species, the majority of leterious human missense ons will be saturated, with nondeleterious CpG mutations ted with a much smaller number of species due to their higher mutation rate.

To model the fraction of human common missense variants (> 0.1% allele frequency) discovered with increasing size of human cohorts surveyed (), we sampled genotypes according to gnomAD allele frequencies. The fraction of gnomAD common missense variants ered was averaged across 100 simulations for each sample size from 10 to 100K.

Prediction of Secondary Structure and Solvent Accessibility The deep ng network for pathogenicity tion contains 36 total convolutional layers, including 19 convolutional layers for the secondary structure and solvent accessibility prediction ks, and 17 for the main pathogenicity prediction k which takes as input the results of the secondary structure and solvent accessibility networks. Because the crystal structures of most human proteins are unknown, we trained two models to enable the network to learn protein ure from primary sequence. Both models used the same network architecture and inputs shown in The inputs to the secondary structure and solvent accessibility networks is a 51 length × 20 amino acid position frequency matrix encoding conservation information from the le sequence alignment of human with 99 other vertebrates.

The secondary structure network is trained to predict 3-state secondary structure: alpha helix (H), beta sheet (B), and coils (C). The solvent accessibility network is trained to predict 3-state solvent accessibility: buried (B), intermediate (I), and exposed (E). Both networks only take primary ce as their inputs, and were trained using labels from known crystal structures in the n DataBank. The models predict one state for each amino acid residue.

Data Preparation for Prediction of Secondary Structure and Solvent Accessibility We used ted crystal structures from the Protein nk for training the models. Amino acid sequences with more than 25% sequence similarity were removed. In total, 6,367 protein sequences were used for training, 400 for validation, and 500 for testing ementary Table 13). The data used for training, including the amino acid sequence and the secondary structure and solvent accessibility labels is available from the RaptorX website: http://raptorx.uchicago.edu/download/.

Most solved crystal structures are of man proteins, hence for pre-training the secondary structure and solvent models, we used the RaptorX suite (based on PSI-BLAST) to obtain related sequences, since human-based multiple sequence alignments were generally not available. We generated multiple sequence alignments for the proteins using the CNFsearch1.66_release tool from RaptorX, and counted the amino acids at each position from the 99 closest alignments to form the on frequency matrix. For example, the specific commands using X to retrieve the multiple ce alignment for the 1u7lA.fasta protein were as follows: % ./buildFeature -i 1u7lA.fasta -c 10 -o ./TGT/1u7lA.tgt % ./CNFsearch -a 30 -q 1u7lA For each amino acid position in the dataset, we took a window from the position ncy matrix corresponding to the ng 51 amino acids, and used this to predict the label for either secondary structure or solvent accessibility for the amino acid at the center of the 51- length amino acid sequence. The labels for secondary structure and relative solvent accessibility were directly obtained from the known 3D crystal structure of the n using the DSSP software and did not require prediction from primary sequence. To incorporate the ary structure and solvent accessibility networks as part of the pathogenicity prediction network, we computed position frequency matrices from the human-based 99 vertebrate multiple sequence alignments. Although the conservation matrices generated from these two methods are generally similar, we enabled backpropagation through the secondary structure and t accessibility models during training for pathogenicity prediction to allow finetuning of the parameter weights.

Model Architecture and Training We trained two separate deep convolutional neural network models to predict secondary structure and relative solvent accessibility of proteins. The two models have identical architecture and input data, but differ on the prediction states. We conducted detailed arameter search to optimize the models for best performance. Both our deep learning network for pathogenicity prediction and deep learning networks for predicting secondary structure and solvent accessibility d the architecture of residual blocks, which has become widely adopted due to its success in image fication. The residual blocks comprise repeating units of convolution, interspersed with skip connections that allow information from earlier layers to skip over residual blocks. In each residual block, the input layer is first batch normalized, followed by an activation layer using rectified linear units (ReLU). The activation is then passed through a 1D convolution layer. This intermediate output from the 1D convolution layer is again batch normalized and ReLU activated, ed by another 1D ution layer. At the end of the second 1D convolution, we summed its output with the original input into the residual block, which acts as a skip connection by allowing the original input ation to bypass the residual block. In such an architecture, termed a deep residual learning network by its authors, the input is preserved in its al state and the residual connections are kept free of nonlinear activations from the model, allowing ive training of deeper networks. The detailed architecture is provided in and Supplementary Tables 11 (FIGs. 7A and 7B) and 12 (FIGs. 8A and 8B).

Following the residual blocks, the softmax layer computes probabilities of the three states for each amino acid, among which the largest softmax probability determines the state of the amino acid. The model is trained with accumulated categorical cross entropy loss function for the whole protein sequence using the ADAM optimizer. Once the networks had been pretrained on secondary structure and solvent accessibility, instead of directly taking the output of the networks as inputs for the pathogenicity prediction network, we took the layer before the softmax layer, so that more ation would pass through to the pathogenicity prediction k.

] The best testing accuracy ed for the 3-state secondary structure prediction model is 79.86 % (Supplementary Table 14), r to the state-of-the-art accuracy predicted by DeepCNF model30. The best testing accuracy for the 3-state solvent ibility prediction model is 60.31% (Supplementary Table 14), similar to the current best accuracy predicted by RaptorX on the similar training dataset. We also ed the predictions of the neural network when using DSSP-annotated structure labels for the approximately ~4000 human proteins that had crystal structures, versus using the standard PrimateAI model which uses predicted structure labels only. We did not obtain further improvement in pathogenicity prediction accuracy when using the DSSP-annotated labels (Supplemental Table 15).

Input Features for Deep ng Models for Pathogenicity Prediction The training dataset for the pathogenicity prediction network contains 385,236 labeled benign variants and 68,258,623 unlabeled variants after filtering. For each variant, we generated the following input features. The first input feature of each variant is its gth flanking amino acid sequence, i.e., 25 amino acids to each side of the variant obtained from the nce sequence of hg19, to provide the deep learning models the sequence context of the variant. In total, this flanking reference sequence is 51 amino acids in length. Through empirical observation we found that amino acid representation of the protein sequence was more effective than representing the protein coding sequence using nucleotides.

The second feature is the 51-length flanking human amino acid sequence with the ative amino acid substituted at the central position by the variant. The alternative flanking sequence is the same as the reference flanking sequence in the first feature, except that the middle position of the sequence contains the alternative amino acid, d of the reference amino acid. Both reference and alternative human amino acid sequences were converted to one- hot encoded vectors of length 51 × 20, where each amino acid is represented by a vector of 19 amino acids with value 0, and a single amino acid with value 1.

] Three position frequency matrices (PFMs) are ted from le sequence alignments of 99 vertebrates for the variant, including one for 11 primates, one for 50 mammals excluding primates, and one for 38 vertebrates excluding es and mammals. Each PFM has dimensions L x 20, where L is the length of ng sequences around the variant (in our case, L represents 51 amino acids).

For the input to the pre-trained e secondary structure and 3-state t accessibility networks, we used a single PFM matrix generated from the multiple sequence alignments for all 99 vertebrates, also with length 51 and depth 20. After pre-training the networks on known crystal structures from the Protein DataBank, the final two layers for the secondary ure and solvent models were removed (the global maxpool layer and the output layer), and the 51 × 40 shape of the previous layer’s output was used as input for the pathogenicity prediction network. We allowed backpropagation h the structural layers of the network to fine-tune parameters.

Semi-Supervised Learning Because semi-supervised learning algorithms use both labeled and unlabeled ces in the ng process, they can produce classifiers that achieve better performance than completely supervised learning algorithms that have only a small amount of labeled data available for training. The principle behind semi-supervised ng is that intrinsic knowledge within unlabeled data can be leveraged in order to strengthen the prediction lity of a ised model that only uses labeled instances, y providing a potential advantage for semi-supervised learning. Model parameters learned by a supervised classifier from a small amount of d data may be steered towards a more realistic distribution (which more closely resembles the distribution of the test data) by the unlabeled data.

Another challenge that is prevalent in bioinformatics is the data imbalance problem. The data imbalance phenomenon arises when one of the s to be predicted is underrepresented in the data because instances belonging to that class are rare (noteworthy cases) or hard to obtain. Ironically, minority classes are typically the most important to learn, because they may be ated with l cases.

An algorithmic approach to handle imbalanced data butions is based on ensembles of classifiers. d amounts of labeled data naturally lead to weaker classifiers, but ensembles of weak classifiers tend to surpass the performance of any single constituent classifier. Moreover, ensembles typically improve the prediction accuracy obtained from a single classifier by a factor that validates the effort and cost associated with ng multiple models. Intuitively, aggregating several fiers leads to better tting control, since ing the high ility of individual classifiers also averages the classifiers’ overfitting.

We d a semi-supervised learning strategy because of the lack of adequately sized datasets of confidently labeled pathogenic variants. Although the ClinVar database has over 300,000 entries, after removing variants of uncertain significance, there were only ~42,000 missense variants ing with non-conflicting interpretations of pathogenicity.

Systematic reviews have also found that these entries often have insufficient al evidence to support their ted pathogenicity. Moreover, most of the variants in human curated databases tend to be within a very small set of genes, making them mismatched for ts in benign training datasets, which are ascertained genome-wide using human common variants or chimpanzee-human fixed substitutions. Given how differently the datasets were ascertained, training a supervised learning model with human-curated variants as the pathogenic set and genome-wide common variants as the benign set is likely to introduce icant .

We trained the deep learning network to discriminate between a set of d benign variants and an unlabeled set of variants that were carefully matched to remove biases. According to one implementation, the set of 385,236 labeled benign variants comprised human common variants (> 0.1% allele frequency) from the ExAC/gnomAD database and variants from six species of non-human primates.

We sampled a set of unlabeled variants, requiring matching with the benign variants on leotide context (to control for mutational rate, genetic drift, and gene conversion), and adjusting for the impact of alignability and sequence coverage on variant ascertainment. Because the number of unlabeled variants greatly exceeds the labeled benign variants, we obtained a consensus prediction by training eight models that use the same set of labeled benign variants and eight randomly sampled sets of unlabeled variants and taking the e of their predictions.

The motivation of choosing semi-supervised learning is that human-curated variant databases are unreliable and noisy, and in particular, the lack of reliable pathogenic variants. We obtained a set of reliable benign variants from human common variants from gnomAD and primate variants. For pathogenic ts, we adopted an iterative balanced ng approach to initially sampling pathogenic variants from a set of unknown variants (VUS variants t annotated clinical significance).

To reduce the ng bias, we trained an le of eight models that use the same set of benign training variants and eight different sets of pathogenic variants. Initially, we randomly sampled unknown variants to ent pathogenic variants. Next, iteratively, the ensemble of models are used to score a set of unknown variants that were not involved in the prior training cycle. The top scored pathogenic variants are then obtained to replace 5% of random unknown variants in the prior cycle. Note that we retained 25% more top scored pathogenic variants than needed, so that we can sample eight different sets of scored pathogenic variants to replace unknown variants which increases randomness for the eight models. Then the new pathogenic training set is formed and a new cycle of training isexecuted . This process is repeated till initial randomly-sampled n variants are all replaced by highly confident pathogenic variants predicted by the ensemble models. is an illustration of the iterative balanced sampling process.

Balancing the Benign and Unknown Training Sets The sampling scheme of unknown variants that match benign ts is useful in reducing the bias of our model training. When the unknown ts are sampled randomly, deep learning models often t biased information and present l solutions. For example, if an amino acid tution K->M occurs more frequently in unknown variants than benign variants, the deep ng models tend to always classify K->M substitutions as pathogenic. Thus, it is important to balance the distribution of amino acid substitutions n the two training sets.

Higher mutable classes like CpG transitions have a huge representation bias in the benign common variants. The orthologous variants from other primates also follow the human mutation rates, implying enrichment of the highly mutable classes in the overall benign training set. If the sampling procedure of unknown variants is not well lled and balanced, deep learning models tend to be more likely to classify CpG transitions as benign, compared to a less represented class such as transversion or non CpG transitions.

To prevent deep learning models from converging to a trivial, non-biological solution, we consider balancing the trinucleotide contexts of benign and unknown variants. The trinucleotide is formed by the base before the variant, the reference base of the t, and the base after the variant. And the reference base of the variant can be changed into the other three nucleotides. In total, there are 64x3 trinucleotide contexts.

Iterative Balanced Sampling Cycle 1 We d the unknown variants to match the exact number of benign variants for each trinucleotide t. In other words, at the first cycle, we ed the benign and pathogenic training sets in terms of trinucleotide contexts of variants. The intuition behind such a sampling methodology is that there are equal representation of variants with identical mutation rates between the benign and n sets. This prevents the model from converging to a trivial solution based on mutation rates.

Cycle 2 to Cycle 20 For cycle 2, we applied the trained model from cycle 1 to score a set of unknown variants that are not ed in cycle 1, and replaced 5% of the unknown variants with the top predicted pathogenic ts. This set is purely generated by the model and we applied no balancing for trinucleotide contexts in this set. The rest of 95% of the unknown variants required for training are d to be 95% of the counts of each trinucleotide context in benign variants.

The intuition is that since cycle 1 uses completely matched training sets, the top predicted pathogenic variants are generated t any mutation rate bias. Hence, There is no need to t for any bias in this set.

The rest of 95% of the data is still lled for the trinucleotide context mutation rate to prevent the model from converging to a trivial solution.

For each cycle, the percentage of the replaced unknown variants increases by 5%. For cycle 3, we replaced 5% of the unknown variants with the top predicted pathogenic variants from the model of cycle 3.

Accumulatively, the fraction of pathogenic variants increases to 10% and that of the trinucleotide-context-mirrored unknown variants is reduced to 90%. The sampling process is similar for the rest cycles.

Cycle 21 For cycle 21, the last cycle, the entire pathogenic training set comprise purely of the top pathogenic variants predicted from the deep learning . Since we have explicitly controlled for the mutation rate bias at every cycle, the enic variants are reliable to use as ng data and not ed by the on rate bias.

Thus, the last cycle of training produces the final deep learning model for pathogenicity prediction.

Matching the Labeled Benign and Unlabeled Training Sets Balanced sampling of unlabeled variants is crucial for removing biases that are unrelated to the deleteriousness of the variant. In absence of proper l of nding effects, deep learning can easily pick up on inadvertently introduced biases to discriminate between the s. Human common variants tend to be enriched with variants from highly mutable classes, such as those on CpG islands. Likewise, primate polymorphisms also follow human mutational rates, implying enrichment of highly mutable variants in the overall benign training set. If the sampling procedure of unlabeled variants is not well controlled and balanced, deep learning networks tend to rely on mutational rate bias to classify variants, thus they are more likely to classify CpG transitions as benign, compared to less represented classes such as ersion or non-CpG transition. We sampled y the same number of unlabeled variants as labeled benign variants in each of the 96 trinucleotide contexts (discussed previously).

When matching unlabeled ts for the primate variants in the labeled benign dataset, we disallowed unlabeled variants from being selected from regions of the human genome where that primate species was not aligned in the multiple ce alignment, since it was not possible to call a variant in that primate species at that position.

Within each of the 96 trinucleotide contexts, we ted for sequencing coverage for primate variants. Because of the large number of humans sequenced, common variants in the human population are observed frequently enough that they are well-ascertained even in areas with low sequencing coverage. The same is not true for primate ts, since only a small number of individuals were sequenced. We split the genome into 10 bins based on sequencing coverage in ExAC/gnomAD exomes. For each bin, we measured the fraction of primate variants in the labeled benign dataset versus the unlabeled dataset. We calculated the probability that a primate variant is a member of the labeled benign dataset, based only on sequencing coverage, using linear regression (). When selecting unlabeled variants to match the primate ts in the labeled benign dataset, we weighted the probability of sampling a t based on the sequencing coverage at that position using the regression coefficients.

Generation of Benign and Unknown ts Common Variants in Human Population Recent studies demonstrated that common variants in human tions are generally benign. gnomAD provides 90,958 onymous SNPs with minor allele frequency (MAF) >= 0.1% within canonical coding regions, according to one implementation. Variants that passed filters are retained. Indels are excluded.

Variants that occur in the start or stop codons are removed, as well as protein-truncating variants. Scrutinizing ulations, the total number of missense ts with MAF >= 0.1% within each subpopulation increases to 245,360, according to one implementation. These variants forms part of the training set of benign variants.

Common Polymorphism in Great Apes As coding regions are known to be highly conservative, it is straightforward to assume if a polymorphism is segregating in a great ape population at high frequency, it may also have a mild impact on human fitness. Polymorphism data of bonobo, chimp, gorilla. and orangutan from great ape genome projects and other studies were merged with SNPs of rhesus and marmoset from dbSNP.

Generation of Unknown Variants All the possible variants are ted from each base position of canonical coding regions by substituting the nucleotide at the position to the other three nucleotides. New codons are formed leading to potential s of amino acids at the ons. Synonymous changes are filtered.

Variants ed in gnomAD dataset are removed. Variants that occur in the start or stop codons are d, as well as variants that form stop codons. For SNPs with multiple gene annotations, the canonical gene annotation is selected to represent the SNP’s annotation. In total, 68,258,623 unknown variants are generated, according to one implementation.

Additional Filtering of Variants In some regions of human genome, it is known to be difficult to align reads. Inclusion of those regions cause confounding effects on the training and testing datasets. For example, regions under high selective pressure tend to have limited numbers of polymorphisms. s, regions that are hard to sequence also have fewer polymorphisms. To avoid such confounding inputs to our models, we removed variants from genes that were not sequenced by gnomAD.

Generally benign variants are discovered in the equenced regions which tend to be conservative across multiple s. While unknown variants are randomly sampled across genomes, which include some -covered regions. This causes ascertainment bias between the benign and unknown sets. To reduce the bias, we filtered out ts with read depths < 10 in gnomAD. We also filtered out all variants which have more than % missing data in the flanking sequence alignments across all the mammal species.

Data for Validation and Testing For validating and testing the pathogenicity models, we randomly sampled from the big pool of benign ts two sets of 10,000 benign variants for validation and testing, respectively, according to one implementation.

The rest of benign variants are used for training the deep learning models. These variants are ically sampled from orthologous primate variants to ensure fair ison between methods as some methods are trained on human common ts. We also randomly d two sets of 10,000 unknown variants for validation and testing, separately, according to one entation. We ensure that the number of unknown variants within each of the 192 trinucleotide contexts match that of benign variants for validation and testing sets, respectively.

We evaluated the mance of multiple methods in clinical settings using de novo variants of ed children with autism or developmental disability disorder (DDD) and their unaffected siblings. Totally, according to one implementation, there are 3821 missense de novo variants from the cases with DDD and 2736 missense de novo variants from the cases with autism. There are 1231 missense de novo variants for unaffected siblings, according to one implementation.

Deep Learning Network Architecture The pathogenicity prediction network receives five direct inputs and two indirect inputs via the secondary structure and solvent accessibility networks. The five direct inputs are the 51-length amino acid sequences × 20-depth (encoding the 20 different amino acids), and comprise the reference human amino acid sequence without the variant (1a), the ative human amino acid sequence with the variant substituted in (1b), the PFM from the multiple sequence alignment of primate species (1c), the PFM from the multiple sequence alignment of mammalian s (1d), and the PFM from the multiple sequence alignment of more distant vertebrate species (1e). The secondary ure and solvent accessibility networks each receive as input a PFM from the multiple sequence alignment (1f) and (1g), and provide their outputs as inputs into the main enicity prediction network. The secondary ure and solvent accessibility networks were pre-trained on known n l structures for the Protein DataBank, and allow backpropagation during the pathogenicity model training.

The five direct input channels are passed through an upsampling convolution layer of 40 kernels with linear activations. The human reference amino acid sequence (1a) is merged with the PFMs from primate, mammal, and vertebrate multiple sequence alignments (Merge 1a). Similarly, the human alternative amino acid sequence (1b), is merged with the PFMs from primate, mammal, and vertebrate multiple sequence alignments (Merge 1b). This creates two parallel tracks, one for the reference sequence, and one with the alternate sequence with the variant substituted in.

The merged feature map of both reference channel and the alternative channel (Merge 1a and Merge 1b) are passed h a series of six residual blocks (Layers 2a to 7a, Merge 2a and Layers 2b to 7b, Merge 2b).

The output of the residual blocks (Merge 2a and Merge 2b) are concatenated together to form a feature map of size (51,80) (Merge 3a, Merge 3b) which fully mixes the data from the reference and alternative channels. Next, the data has two paths for passing through the k in parallel, either through a series of six residual blocks containing two convolutional layers each, as defined in n 2.1 (Merge 3 to 9, Layers 9 to 46 excluding layer 21,34), or via skip connections, which connect the output of every two residual blocks after passing through a 1D convolution (Layer 21, Layer 37, Layer 47). Finally, the merged activations (Merge 10) are fed to another residual block (layers 48 to 53, Merge 11). The activations from Merge 11 are given to a 1D convolution with filter size 1 and sigmoid activation (Layer 54), then passed through a global max pooling layer that picks a single value representing the network’s prediction for variant pathogenicity. A tic illustration of the model is shown in and Supplementary Table 16 (FIGs. 4A, 4B, and 4C).

Model Overview We developed semi-supervised deep convolutional neural network (CNN) models to predict pathogenicity of variants. The input features to the models e protein sequences and conservation es flanking variants, and depletion of missense ts in specific gene regions. We also predicted the changes caused by variants to secondary structure and solvent accessibility by deep ng models and integrated that into our pathogenicity prediction model. For training the model, we ted benign variants from common variants of human subpopulations and orthologous variants from primates. However, we are still lacking reliable s for pathogenic variants. We initially trained the model with benign and unknown variants, and then used a semisupervised iterative balanced ng (IBS) algorithm to gradually replace unknown ts with a set of enic variants ted with high confidence. Finally, we demonstrated that our model outperformed the existing methods in guishing de novo variants causing developmental disability er in humans from benign ones.

Adoption of Residual Block illustrates a residual block. Both our deep learning model of pathogenicity prediction and deep learning models for predicting secondary structure and solvent accessibility adopt the definition of residual blocks that was first illustrated in the. Structure of a al block is shown in figure below. The input layer is first batch normalized, followed by the nonlinear activation "ReLU". The activation is then passed through a 1D convolution layer. This intermediate output from the 1D convolution layer is again batch normalized and ReLU activated, followed by r 1D convolution layer. At the end of the second 1D convolution, we merge its output with the original input. In such an ecture, the input is preserved in its original state and the residual connections are kept free of non-linear activations of the model. /dilated convolutions allow for large receptive fields with few trainable parameters. An atrous/dilated convolution is a convolution where the kernel is d over an area larger than its length by ng input values with a certain step, also called atrous convolution rate or dilation factor. Atrous/dilated convolutions add spacing between the elements of a convolution filter/kernel so that neighboring input entries (e.g., tides, amino acids) at larger intervals are considered when a convolution operation is med. This enables incorporation of long-range contextual encies in the input. The atrous utions ve partial convolution calculations for reuse as adjacent nucleotides are processed.

Novelty of Our Model Our method differs from the existing methods for predicting enicity of variants in three ways.

First, our method adopts a novel architecture of semi-supervised deep convolutional neural networks. Second, reliable benign variants are obtained from human common variants from gnomAD and primate ts, while the highly confident pathogenic training set is ted through iterative balanced sampling and training, to avoid circular training and testing of models using the identical human curated variant databases. Third, deep learning models for secondary structure and solvent accessibility are integrated into the architecture of our pathogenicity model. The information obtained from the ure and solvent models are not limited to label prediction for specific amino acid residues. Rather, the readout layer is removed from the structure and solvent models, and the pre-trained models are merged with the pathogenicity model. While training the pathogenicity model, the structure and solvent pretrained layers also backpropagate to minimize the error. This helps the pre-trained structure and solvent model to focus on the enicity prediction problem. ng Secondary Structure and Solvent Accessibility Models Data Preparation We trained deep convolutional neural networks to predict the 3-state secondary ure and the 3- state solvent accessibility of the proteins. Protein annotations from PDB are used for training the models. Sequences with more than 25% similarity with their sequence profile are removed, according to one implementation. In total, 6,293 protein sequences are used for training, 392 for validation, and 499 for testing, ing to one implementation.

The position-specific scoring matrix (PSSM) conservation profiles for the proteins are generated by running PSI-BLAST with E-value threshold 0.001 and 3 iterations to search UniRef90. Any unknown amino acid is set as blank, as well as its secondary ure. We also run AST with similar parameter settings for all human genes to collect their PSSM conservation profiles. These matrices are used for integrating the structure model to pathogenicity prediction. The amino acids of protein sequences are then converted into one-hot encoding vectors. And the protein sequences and PSSM matrices are reshaped to a matrix of Lx20, where L is the length of a n. The three predicted labels for secondary structure include helix (H), beta sheet (B) and coils (C). The three labels for solvent ibility include buried (B), intermediate (I) and exposed (E). One label corresponds to one amino acid residue. The labels are coded as one-hot encoding vectors of dimension=3.

Model ecture and Training We trained two end-to-end deep convolutional neural network models to predict the 3-state secondary structure and 3-state solvent accessibility of proteins, respectively. The two models have similar urations, including two input channels, one for protein sequences and the other for protein conservation profiles. Each input channel has the dimension L x 20, where L s the length of a protein.

Each of the input channel is passed through a 1D convolution layer (layer 1a and 1b) with 40 kernels and linear activations. This layer is used to upsample the input dimensions from 20 to 40. Note all the other layers throughout the model use 40 kernels. The two layer (1a and 1b) activations are merged together by summing the values across each of the 40 dimensions (i.e., merge mode = ‘sum’). The output of the merge node is passed through a single layer of 1D convolution (layer 2) followed by linear activation.

The activations from layer 2 is passed through a series of 9 residual blocks (layers 3 to 11) as defined above. The activations of layer 3 is fed to layer 4 and layer 4’s activation is fed to layer 5 and so on. There are also skip connections that directly sum the output of every 3rd al block ( layers 5, 8 and 11) . The merged activations is then fed to two 1D convolutions ( layers 12 and 13) with ReLU tions. The tions from layer 13 is given to the softmax readout layer. The softmax computes the probabilities of the three-class outputs for the given input.

For the best secondary structure model, the 1D convolutions have an atrous rate of 1. For the solvent accessibility model, the last 3 residual blocks ( layers 9, 10 and 11) have an atrous rate of 2 to increase the coverage of the kernels. The secondary structure of a protein is strongly dependent on the interactions of amino acids in close proximity. Thus models with higher kernel coverage improves the performance slightly. On the other hand, solvent accessibility is affected by the long-range interactions between amino acids. Therefore, for the model with high coverage of s using atrous convolutions, its accuracy is more than 2% higher than that of the short-coverage models.

The table below provides details about activations and parameters for each layer of the e secondary structure prediction model, according to one implementation.

Layer Type Number of Shape Atrous rate tion Kernels, Window Input Sequence Convolution 1D 40,1 (L,40) 1 Linear (layer 1a) Input PSSM (layer Convolution 1D 40,1 (L,40) 1 Linear Merging Sequence Merge (mode = - (L,80) - - + PSSM Concatenate) Layer 2 Convolution 1D 40,5 (L,40) 1 Linear Layer 3 Convolution 1D 40,5 (L,40) 1 ReLU Layer 4 Convolution 1D 40,5 (L,40) 1 ReLU Layer 5 Convolution 1D 40,5 (L,40) 1 ReLU Layer 6 ution 1D 40,5 (L,40) 1 ReLU Layer 7 Convolution 1D 40,5 (L,40) 1 ReLU Layer 8 Convolution 1D 40,5 (L,40) 1 ReLU Layer 9 Convolution 1D 40,5 (L,40) 1 ReLU Layer 10 Convolution 1D 40,5 (L,40) 1 ReLU Layer 11 Convolution 1D 40,5 (L,40) 1 ReLU Merge activations Merge - layer 5, 8 - (L,40) - - 11.mode=‘sum’ Layer 12 Convolution 1D 40,5 (L,40) 1 ReLU Layer 13 ution 1D 40,5 (L,40) 1 ReLU Output layer Convolution 1D 1,3 (L,3) - Softmax The details of the solvent accessibility model is shown in the table below, according to one implementation.

Layer Type Number of Shape Atrous rate Activation s, Window Input Sequence Convolution 1D 40,1 (L,40) 1 Linear (layer 1a) Input PSSM (layer Convolution 1D 40,1 (L,40) 1 Linear Merging ce Merge (mode = - (L,80) - - + PSSM enate) Layer 2 Convolution 1D 40,5 (L,40) 1 Linear Layer 3 Convolution 1D 40,5 (L,40) 1 ReLU Layer 4 Convolution 1D 40,5 (L,40) 1 ReLU Layer 5 Convolution 1D 40,5 (L,40) 1 ReLU Layer 6 Convolution 1D 40,5 (L,40) 1 ReLU Layer 7 Convolution 1D 40,5 (L,40) 1 ReLU Layer 8 Convolution 1D 40,5 (L,40) 1 ReLU Layer 9 Convolution 1D 40,5 (L,40) 2 ReLU Layer 10 Convolution 1D 40,5 (L,40) 2 ReLU Layer 11 Convolution 1D 40,5 (L,40) 2 ReLU Merge activations Merge - layer 5, 8 - (L,40) - - 11.mode=‘sum’ Layer 12 Convolution 1D 40,5 (L,40) 1 ReLU Layer 13 Convolution 1D 40,5 (L,40) 1 ReLU Output layer Convolution 1D 1,3 (L,3) - Softmax The secondary structure class of a specific amino acid residue is ined by the largest ted x probabilities. The model is trained with accumulated categorical cross entropy loss function for the whole protein sequence using ADAM optimizer for optimizing the backpropagation.

The best testing accuracy for the 3-state ary structure prediction model is 80.32%, similar to the state-of-the-art accuracy predicted by DeepCNF model on a similar training dataset.

The best testing accuracy for the 3-state solvent accessibility prediction model is 64.83%, r to the current best accuracy predicted by RaptorX on a similar training dataset.

We integrated the pretrained 3-state secondary ure and solvent accessibility prediction models into our pathogenicity prediction model as explained below.

Training Models to Predict Pathogenicity of Variants Input Features for Pathogenicity Prediction Model As discussed above, for the enicity prediction problem, there is a benign variant training set and an unknown variant training set for training the pathogenicity model. For each t, we prepared the following input features to feed into our model.

The first input feature of each variant is its flanking amino acid sequence, i.e., 25 amino acids on each side of the variant obtained from the reference sequence of hg19, to provide the deep learning models of the sequence context of the variant. In total, this flanking reference sequence has 51 amino acids in length.

The second e is the alternative amino acid that caused the variant. Instead of providing the nce-alternative amino acid pair directly, we provide the alternative flanking sequence to the model. The alternative flanking sequence is the same as the nce flanking ce in the first feature, except that the middle position of the sequence contains the alternative amino acid, instead of the reference amino acid.

Both the sequences are then converted to one-hot encoded vectors of length 51 x 20, where each amino acid is represented by a vector of 20 zeros or ones.

Then three position weight matrices (PWM) are generated from multiple sequence alignments (MSA) of 99 vertebrates for the variant, ing one for 12 primates, one for 47 mammals excluding es, and one for 40 rates excluding primates and mammals. Each PWM has the dimension of L x 20, where L is the length of flanking sequences around the variant (in our case, L represents 51 amino acids). It comprises counts of amino acids seen in each ry of species.

We also generate the PSSM matrices for the variant-flanking sequences of 51 amino acids from psi blast. This is used for the integration of 3-state secondary structure and solvent accessibility prediction models for pathogenicity prediction.

We train the pathogenicity model with the reference sequence (input1), alternate sequence (input2), PWM matrices for primates 3), mammals (input4), vertebrates (input5), and information from 3-state secondary structure and solvent accessibility models.

Deep Learning Model Training is a block diagram that provides an overview of the deep ng models workflow. The pathogenicity training models comprises five direct inputs and four indirect inputs. The five direct input features include reference sequence (1a), ative sequence (1b), primate conservation (1c), mammal vation (1d) , and vertebrate conservation (1e). The indirect inputs include reference-sequence-based secondary structure (1f), alternative-sequence-based secondary ure (1g), reference-sequence-based solvent ibility (1h), and alternative-sequence-based t accessibility (1i).

For indirect inputs 1f and 1g, we load the pretrained layers of secondary structure prediction model, excluding the softmax layer. For input 1f, the pretrained layers are based on the human reference sequence for the variants along with the PSSM generated by PSI-BLAST for the variant. Likewise, for input 1g, the pretrained layers of the secondary structure prediction models are based on the human alternative sequence as the input along with the PSSM matrix. Inputs 1h and 1i correspond to the similar pretrained channels containing the solvent accessibility information for reference and alternative sequences of the variant, respectively.

The five direct input channels are passed h an ling ution layer of 40 kernels with linear activations. The layers 1a, 1c, 1d and 1e are merged with values summed across the 40 feature dimensions to produce layer 2a . In other words, the feature map of the reference sequence is merged with the three types of conservation feature maps. Similarly, 1b,1c,1d and 1e are merged with values summed across the 40 feature dimensions to generate layer 2b, i.e., the features of the alternative sequence is merged with the three types of conservation features.

] The layers 2a and 2b are batch normalized with the activation of ReLU and each passed through a 1D convolution layer of filter size 40 (3a and 3b). The s of layers 3a and 3b are merged with 1f, 1g,1h and 1i with the feature maps enated to each other. In other words, the feature maps of reference sequence with conservation profile, and ative sequence with the conservation profile are merged with the secondary structure feature maps of the reference and alternative sequence and the t accessibility e maps of the reference and alternative sequence (layer 4).

The outputs of layer 4 are passed through six al blocks (layers 5,6,7,8,9,10). The last three residual blocks have an atrous rate of 2 for the 1D convolutions, to give higher ge for the kernels. The output of layer 10 is passed through a 1D convolution of filter size 1 and activation sigmoid (layer 11) . The output of layer 11 is passed through a global maxpool that picks a single value for the variant. This value represents the pathogenicity of the variant. The details of one implementation of the pathogenicity prediction model are shown in the table below.

Layer Type Number of Shape Atrous rate Activation Kernels, Window size Reference sequence Convolution 1D 40,1 (51,40) 1 Linear ( 1a ) Alternative ution 1D 40,1 (51,40) 1 Linear sequence ( 1b ) Primate Convolution 1D 40,1 (51,40) 1 Linear conservation ( 1c ) Mammal Convolution 1D 40,1 (51,40) 1 Linear Conservation ( 1d ) Vertebrate Convolution 1D 40,1 (51, 40) 1 Linear Conservation ( 1e ) Reference- Input layer - (51,40) - - sequence-based Secondary structure (1f ) Alternative- Input layer - (51,40) - - sequence-based secondary structure (1g ) Reference- Input layer - (51,40) - - sequence-based solvent accessibility (1h ) Alternative- Input layer - (51,40) - - sequence-based solvent accessibility (1i ) Reference sequence Merge (mode - (51,40) - - merge ( 2a ) =Sum) ( 1a,1c,1d,1e ) Alternative Merge (mode - (51,40) - - sequence merge ( =Sum) ( 2b ) 1b,1c,1d,1e ) 3a Convolution 1D ( 40,5 ) 1 ReLU 3b Convolution 1D ( 40,5 (51,40) 1 ReLU 4 Merge ( mode = - (51,240) - - enate) ( 1f,1g,1h,1i) ution 1D 40,5 (51,40) 1 ReLU 6 Convolution 1D 40,5 (51,40) 1 ReLU 7 Convolution 1D 40,5 (51,40) 1 ReLU 8 Convolution 1D 40,5 (51,40) 1 ReLU 9 Convolution 1D 40,5 (51,40) 2 ReLU Convolution 1D 40,5 (51,40) 2 ReLU 11 Convolution 1D 40,1 (51,1) 2 Sigmoid 12 Global max pooling - Output layer Activation layer Sigmoid Ensembles ] In one implementation, for each cycle of our method, we ran eight different models that train on the same benign data set and eight different unknown data sets and averaged the prediction of evaluation data sets across the eight models. Sampling bias can be reduced and well controlled when multiple randomly-sampled sets of unknown variants are ted to the model.

Furthermore, adoption of le-of-ensembles approach can improve the performance of our model on our evaluation dataset. CADD uses an ensemble of 10 models and gets the e score across all the 10 models to score a variant. Here we attempted to use a similar ensemble approach. We benchmarked the results using one ensemble and then increased the number of les to assess the performance gain. Note that each le has eight models that train on the same benign data set and eight different unknown data sets. For ent ensembles, the seed values of the random number tor are distinct so that the random variant sets are drawn differently from each other.

The detailed results according to one implementation are shown in the table below.

Number of -log(pvalue) on -log(pvalue) on DDD dataset ( median Ensembles DDD dataset ( of mean of ensembles) mean of mean of ensembles ) 1 3.4e-27 3.4e-27 2.74e-29 3.8e-29 2.98e-29 2.55e-29 4.06e-29 3.88e-29 3.116e-29 3.05e-29 3.77e-29 3.31e-29 3.81e-29 3.34e-29 Compared with one ensemble, 5 ensembles and 10 ensembles produced more significant p-values when evaluated using DDD ts. But increasing the number of ensembles fails to improve the performance further, indicating a saturation for ensembles. The ensembles reduce sampling bias with a large variety of unknown variants. However, we also required matching the 192 trinucleotide contexts between benign and pathogenic classes, which limits our ng space substantially, leading to the quick saturation. We conclude that ensemble-ofensembles approach improves the model performance significantly and further enriches our understanding of models.

Early Stopping for Training Pathogenicity Model Since there lacks reliable annotated pathogenic variant samples, it is challenging to define stopping criteria for model training. To avoid using pathogenic variants in model evaluation, in one implementation, we used the 10,000 benign validation variants from orthologous primates and 10,000 trinucleotide-context-matched unknown ts. After training each epoch of the model, we evaluated the benign validation variants and the unknown validation variants. We used Wilcoxon rank-sum test to assess the difference of the probability distributions of both of the validation variant sets.

] The p-value of the test becomes more significant with improvements in the model’s ability to distinguish benign ts from a set of unknown variants. We stop the training if no improvement is observed in the model’s ability to distinguish between the two distributions during any five consecutive epochs of model training.

Earlier, we had set aside two separate sets of 10,000 withheld primate variants from ng, which we termed the validation set and the test set. We used the validation set of 10,000 withheld primate variants and 10,000 unlabeled variants that were matched for trinucleotide context for evaluating early stopping during model training.

After each ng epoch, we evaluated the ability of the deep neural network to discriminate between variants in the labeled benign validation set and the unlabeled matched controls, measuring the difference in the butions of the predicted scores using the on rank-sum test. We stopped the ng once no further improvement was ed after five consecutive training epochs to prevent overfitting.

Benchmarking of fier Performance We assessed the fication accuracy of two versions of the deep learning network, one trained with common human variants only, and one trained with the full benign labeled dataset including both common human variants and primate variants, in on to the following classifiers: SIFT , PolyPhen-2 , CADD, REVEL, M-CAP, LRT, MutationTaster, MutationAssessor, FATHMM, PROVEAN, VEST3, MetaSVM, MetaLR, MutPred, DANN, FATHMM-MKL_coding, Eigen, GenoCanyon, and GERP++ 13,32-48. To obtain the scores for each of the other classifiers, we downloaded the scores for all missense variants from dbNSFP 49 (https://sites.google.com/site/jpopgen/dbNSFP), and benchmarked the methods on the 10,000 withheld e variant test set, and on de novo variants in DDD cases versus controls. We selected SIFT, PolyPhen-2, and CADD for inclusion in the main paper because they are among the most widely used methods, and REVEL, e across the different modes of evaluation, it stood out as being one of the best of the 20 existing classifiers we evaluated.

The performance for all fiers we ted is provided in A.

For evaluating the impact of available training data size on the performance of the deep learning network, we trained deep learning ks at each data point in by randomly sampling from the labeled benign training set of 385,236 primate and common human variants. To reduce random noise in the performance of the classifiers, we performed this training procedure five times, each time using a random instantiation of initial parameter weights, and showed the median performance on both the 10,000 withheld e ts and the DDD case vs controls dataset in By chance, the performance of the median classifier with the full dataset of 385,236 labeled benign variants was slightly better than the one we used for the rest of the paper on the DDD dataset (P < 10-29 instead of P < 10-28 by Wilcoxon rank-sum test). To show that variants from each individual primate species butes to classification accuracy whereas variants from each dual mammal species lower classification accuracy, we d deep learning networks using a training dataset comprising 83,546 human variants plus a constant number of randomly selected variants for each species, according to one implementation.

According to one implementation, the constant number of variants we added to the training set (23,380) is the total number of variants available in the species with the lowest number of missense variants, i.e. bonobo. To reduce noise, we again repeated the training procedure five times, and reported the median mance of the classifier.

Model Evaluation In one implementation, we trained 21 cycles of deep learning models following the iterative balanced sampling procedure. We performed two types of evaluations to assess the performance of our classifiers. We also compared our models with Polyphen2, SIFT, and CADD on the two s and assessed the potential of ation of our models for clinical annotation.

Method 1: Benign Test Set Accuracy ] In one implementation, we evaluated the 10,000 benign variants and unknown variants by computing their predicted probabilities using an le of eight different trained models. We also obtained their predicted probabilities scored by the other existing methods mentioned above.

We then obtained the median of predicted probabilities across the unknown test variants for each of the s used in the tion. Using the median score, we found the number of benign variants that scored above or below the median depending on the annotation of benign and pathogenic variants used by each of the methods.

SIFT , CADD and our method label enic variants as 1 and benign variants as 0. Thus, we counted the number of benign variants that scored below the median. Polyphen uses the opposite annotation and we counted the number of benign variants above the . The ratio of the number of benign variants scored above/below the median divided by the total number of benign variants represents the prediction accuracy of benign variants.

Benign accuracy = Total number of benign variants above (below*) the median ÷ Total number of benign variants ] Our reasoning behind this evaluation method relies on the analysis of ive re of variants in gnomAD. For the singletons in gnomAD, the ratio of missense variants to synonymous variants is ~ 2.26:1. While for the common variants (MAF>0.1%) in gnomAD, the missense-to-synonymous ratio is ~ 1.06 : 1. This indicates that from a set of random unknown variants, approximately 50% are expected to be purged by natural selection and the remaining 50% tend to be mild and likely become common in the population.

Method Accuracy of withheld benign set en 0.74 SIFT 0.69 CADD 0.77 Our Model 0.85 ] As shown in the table above, our method outperforms the second best method CADD by more than 8%. This shows an significant improvement in our model’s ability to classify the benign variants. While such a demonstration proves the ability of our model, the following Method 2 shows the usefulness of our model on clinical datasets for clinical interpretation.

Method 2 : al Dataset Evaluation ] In one implementation, we evaluated these pathogenicity tion methods on clinical datasets, including developmental disability er (DDD) case-control dataset. DDD dataset comprise 3,821 de novo missense ts from affected children and 1,231 de novo missense variants from their unaffected siblings. Our hypothesis is that the de novo variants from affected children tend to be more deleterious than the de novo variants from their unaffected siblings.

As clinical test datasets do not clearly label pathogenic variants, we used the separation between the two sets of de novo variants (from ed and unaffected) to gauge the performance of those methods. We applied Wilcoxon rank-sum test to evaluate how well the separation of these two sets of de novo variants is.

Method -log10(p-value) on DDD dataset Polyphen 15.02 SIFT 13.52 CADD 13.83 DL 28.35 According to the table above, our semi-supervised deep learning models performs significantly better in guishing the affected set of de novo variants from the unaffected set. This shows that our model is more appropriate for clinical interpretation than the existing methods. This also validates that the general approach of ting es from genome sequences and conservation es is superior to the manually-crafted features based on human curated datasets.

Benign Prediction Accuracy on a Withheld Test Set of 10,000 Primate Variants We used the 10,000 withheld e variants in the test dataset to benchmark the deep learning network as well as the other 20 classifiers. Because the different classifiers had widely varying score distributions, we used 10,000 randomly selected unlabeled variants that were matched to the test set by tri-nucleotide context to identify the 50th-percentile threshold for each classifier. To ensure fair comparison n the methods, we benchmarked each classifier on the fraction of variants in the 10,000 withheld primate variant test set that were classified as benign at the 50th-percentile threshold for that fier.

] Our reasoning behind using the 50th-percentile to identify benign variants is based on the selective pressure observed for missense variants in the ExAC/gnomAD dataset. For variants ing at singleton allele frequency, the missense:synonymous ratio is ~ 2.2:1, whereas for common variants (> 0.1% allele frequency), the missense:synonymous ratio is ~ 1.06 : 1. This indicates that approximately 50% of missense variants are expected to be purged by natural selection at common allele frequencies, and the remaining 50% are mild enough to have the potential to become common in the population via genetic drift.

For each of the fiers, the fraction of withheld primate test variants predicted as benign using the 50th-percentile threshold is shown (A and Supplementary Table 17 ()).

Analysis of de novo Variants from DDD Study We benchmarked the classification methods on their ability to discriminate between de novo missense variants in the DDD affected individuals, versus de novo missense variants in unaffected sibling ls. For each classifier, we reported the p-value from the Wilcoxon rank-sum test of the difference between the prediction scores for the two distributions (FIGs. 28B and 28C and Supplementary Table 17 ()).

Given that our two metrics for analyzing the performance of the model are derived from different sources and methodologies, we tested if the performance of the classifiers on the two different metrics was correlated. Indeed, we found that these two metrics were correlated, with a spearman ρ = 0.57 (P < 0.01) between benign classification accuracy on the withheld primate test set, and the Wilcoxon rank-sum p-value for de novo missense variants in DDD cases vs controls. This shows that there is good agreement between the withheld e test set cy and the DDD case vs. control p-value for benchmarking the classifiers (A).

] Furthermore, we tested if the deep learning network could assist in discovery of genes ated with disease. We tested enrichment of de novo mutations in genes by comparing the observed number of de novo mutations to the number expected under a null mutation model.

We examined the performance of the deep learning network comparing results from all missense de novo mutations versus results from missense ons with a score > 0.803. Testing all missense de novos used the default missense rate, whereas testing filtered se de novos used missense mutation rates calculated from sites with scores > 0.803. Each gene required four tests, one testing protein truncating enrichment, and one testing enrichment of protein-altering de novo mutations, both tested for just the DDD cohort, and for a larger meta-analysis of neurodevelopmental trio sequencing cohorts. The ment of protein-altering de novo mutations was combined by Fisher’s method with a test of the ring of missense de novo mutations within the coding sequence ementary Table 20 and 21). The p-value for each gene was taken from the minimum of the four tests, and genome-wide significance was determined as P < 6.757 x 10-7 (α=0.05, 18,500 genes with four .

Computing Receiver Operator Curve Characteristics and fication Accuracy within 605 DDD Disease- Associated Genes To test if the deep learning network was indeed discriminating between pathogenic and benign variants within the same gene, rather than favoring pathogenicity in genes with a de novo dominant mode of tance, we identified a set of 605 genes that were associated with neurodevelopmental disease with p-value < 0.05 in the DDD cohort (calculated using de novo protein-truncating variation alone) (Supplementary Table 18). We report the Wilcoxon rank-sum e for all the classifiers in their ability to separate the probability distributions of variants in the 605 genes in DDD and control dataset (C and Supplementary Table 19 ()).

Within this set of 605 genes, we observe an enrichment ratio for de novo missense variants that is three times what is expected by mutational rate alone. This tes that de novo missense variants in DDD affected patients comprise imately 67% pathogenic variants, and 33% background variants, while the de novo se variants in healthy controls are comprised largely of background variants, except for instances of incomplete penetrance.

To calculate the maximum le AUC for a classifier that perfectly discriminates between pathogenic and benign variants, we took into t that only 67% of the de novo se variants in affected individuals within the 605 genes were pathogenic, and the rest were background. To construct a receiver operator characteristics curve, we treated classification of de novo DDD variants as pathogenic as true positive calls, and treated classification of de novo variants in healthy controls as pathogenic as being false positive calls. Hence, a perfect classifier would classify 67% of de novo variants in the DDD patients as true positives, 33% of de novo variants in the DDD patients as false negatives, and 100% of de novo variants in controls as true negatives.

Visualization of the receiver or curve would show only a single point with 67% True Positive rate and 0% False Positive rate, connected to the (0%,0%) and (100%,100%) corners of the plot by straight lines, ng a maximum AUC of 0.837 for a classifier with perfect mination of benign and pathogenic mutations (B and Supplementary Table 19 ()).

We calculated the classification accuracy of the deep learning network for separating pathogenic and benign variants at a binary threshold by estimating the expected fraction of pathogenic variants within 605 genes in the combined DDD and healthy control datasets. Since the DDD dataset contained 379 de novo variants with an excess of 249 de novo missense variants over expectation, and the control dataset contained 65 de novo variants, we expected 249 pathogenic variants out of 444 total variants (A). We selected the threshold for each fier that separated the 444 de novo missense variants into benign or pathogenic categories according to this ed proportion, and used this as a binary cutoff to evaluate the accuracy of each classifier. For our deep learning model, this threshold was attained at a cutoff of ≥ 0.803, with a True Positive Rate of 65%, and a False Positive Rate of 14%. To calculate the classification accuracy adjusted for the presence of ~33% background variants in the DDD individuals, we assumed that the 33% of de novo DDD variants which were background would be classified at the same False Positive Rate as we observed in the healthy controls. This ponds to 14% × 0.33 = 4.6% of the true positive classification events in the DDD dataset actually being False ves from background variants. We estimate the adjusted True Positive Rate for the deep learning network to be (65% - 4.6%) / 67% = 90%. We report the average of the True Positive Rate and the True Negative Rate, which is 88% for the deep learning network (C and Supplementary Table 19 ()). This estimate is likely to underestimate the true accuracy of the fier, due to the high prevalence of incomplete penetrance in neurodevelopmental disorders.

ClinVar Classification Accuracy Most of the existing classifiers are trained on ClinVar; even classifiers that do not directly train on ClinVar may be affected by using prediction scores from classifiers that are trained on r. In addition, common human variants are highly ed for benign ClinVar consequence, because allele frequency is part of the criteria for assigning benign uence to a variant.

We tried to minimize the circularity in the ClinVar dataset to make it suitable for analysis by using only ClinVar variants that were added in 2017, as the other fication methods were published in previous years.

Even among the 2017 ClinVar variants, we excluded any t present at common allele frequencies (> 0.1%) in ExAC, or present in HGMD, LSDB, or Uniprot. After filtering all such variants, and excluding variants of uncertain significance and those with conflicting tions, we were left with 177 variants with benign tion and 969 variants with pathogenic annotation in ClinVar.

We scored all ClinVar variants using both the deep ng network and existing methods. We selected the threshold for each classifier that separated the ClinVar variants into benign or pathogenic categories according to the observed proportion of benign and pathogenic variants within this dataset, and used this as a binary cutoff to evaluate the accuracy of each fier. We report the average of the True Positive Rate and the True Negative Rate for each classifier (FIGs. 31A and 31B). The performance of classifiers on the ClinVar t were not significantly correlated with the performance of the classifiers on either classification accuracy on the 10,000 withheld primate variants or the Wilcoxon rank-sum p-value for the DDD cases vs controls dataset (FIGs. 31A and 31B).

] We hypothesize that existing classifiers accurately model the behavior of human experts, but that human heuristics may not be fully optimal for discriminating between enic and benign mutations in empirical data. One such example is Grantham score, which provides a distance metric for characterizing the similarity or dissimilarity of amino acid substitutions. We computed the mean Grantham score for the pathogenic and benign variants within the complete r dataset (~42,000 variants), and compared this with mean Grantham score for de novo variants in DDD affected and unaffected individuals within the 605 genes. To correct for presence of ~33% background variants in the DDD affected individuals, we increased the difference in the Grantham score n DDD cases vs controls by 50%, which was still smaller than the difference between enic and benign variants in ClinVar. One possibility is that human experts place too much weight on measures that are easy to measure, such as amino acid substitution distance, while underweighting factors such as protein structure, that are more difficult for a human expert to quantify.

Interpreting the Deep Learning Models Understanding the means by which machine learning algorithms solve problems is often difficult. We visualized the l layers of the deep learning network to understand the features that it had learned to extract in order to predict variant pathogenicity. We calculated the correlation coefficients for different amino acids within the first three layers (the first convolutional layer following two upsampling layers) of the pretrained 3-state ary structure prediction , and showed that the weights of the convolutional layers learn features very similar to the 62 matrix or Grantham distance.

To compute the correlation coefficients between the different amino acids, we started with weights of the first utional layer preceded by three upsampling layers (layers 1a ,1b and 1c) in the ary structure model. We performed matrix multiplication between the three layers, resulted in a matrix with dimensions (20,5,40), where 20 is the number of amino acids, 5 is the window size of the convolutional layer, and 40 is the number of kernels. We reshaped the matrix to have the dimension (20,200) by flattening the last two dimensions, obtaining a matrix in which the weights operating on each of the 20 amino acids were represented as a 200-length vector. We calculated the correlation matrix between the 20 amino acids. Since each dimension represents each amino acid, by calculating the correlation coefficient matrix we are calculating the correlation between the amino acids, and how similar they appear to the deep learning k, based on what it has d from the training data. The visualization of the correlation coefficient matrix is shown in (amino acids sorted by BLOSUM62 matrix order), and shows two prominent clusters, comprising the hydrophobic amino acids onine, Isoleucine, Leucine, , Phenylalanine, Tyrosine, Tryptophan), and the hydrophilic amino acids (Asparagine, Aspartic Acid, Glutamic Acid, Glutamine, ne, and Lysine). The outputs of these initial layers become the inputs for later layers, enabling the deep ng network to uct increasingly complex chical representations of the data.

To illustrate the window of amino acid sequence used by the neural network in its predictions, we perturbed each position in and around 5000 randomly selected variants to e its effects on the predicted eAI score for the variant (B). We systematically zeroed out the inputs at each nearby amino acid position (-25 to +25) around the variant, and measured the change in the neural network’s predicted enicity of the variant, and plotted the average absolute value of the change across the 5000 ts. Amino acids near the variant have the greatest effect, in a roughly symmetric distribution, gradually tailing off as distance from the variant increases. antly, the model makes its predictions based not only on the amino acid at the position of the variant, but by using ation from a wider window, as would be needed to recognize protein motifs. Consistent with the relatively compact size of protein subdomains, we empirically observed that extending the size of the window to more than 51 amino acids did not further improve accuracy.

To evaluate the sensitivity of the deep learning classifier to alignment, we ed the effects of the depth of the alignment on the accuracy of variant classification as follows. We split the data into five bins based on the number of species in the alignment, and evaluated the accuracy of the network in each bin (). We found that the accuracy of the network at separating a set of withheld benign mutations from ly selected mutations that were matched for trinucleotide context (as in D, but performed tely for each bin) is strongest at the top three bins and noticeably weaker in the bottom two bins. The 99 rate multi-species alignment comprises 11 man primates, 50 mammals, and 38 vertebrates, with the bottom two bins representing proteins that have sparse alignment ation from other non-primate mammals. The deep learning network is robust and accurate when alignment information extends throughout primates and mammals, with conservation information from more distant vertebrates being less important.

Definition of Canonical Coding Regions To define cal coding regions, multiple alignments of 99 rate genomes with human for coding DNA sequence (CDS) s (knownCanonical.exonNuc.fa.gz) were downloaded from UCSC genome browser. For human, the coordinates of exons are under Build hg19. Exons are merged to form gene. Genes on autosomes and chrX are retained. Nonhomologous genes were removed, where the list of homologous genes are downloaded from NCBI ftp://ftp.ncbi.nih.gov/pub/HomoloGene/current/homologene.data. For SNPs with multiple gene annotations, the longest ript is selected to represent the SNP’s annotation.

Human, Ape, and Mammal Polymorphism Data We downloaded human exome polymorphism data from a recent large-scale study, the genome Aggregation Database (gnomAD), which collected the whole-exome sequencing data of 123,136 duals from 8 subpopulations across the world. We then extracted variants that pass the filters and fall within the canonical coding regions.

The great ape genome sequencing project provides whole genome sequencing data of 24 chimpanzees, 13 bonobos, 27 as and 10 orangutans (including 5 Sumatran Orangutan and 5 Bornean Orangutan). The study on chimpanzee and bonobos es WGS of an additional 25 apes. As all the sequencing data were mapped to hg19, we downloaded the VCF files from these studies and directly extracted variants within the canonical coding regions.

To compare with other apes and mammals, we also downloaded SNPs of a few other species from dbSNP, including rhesus, marmoset, pig, cow, goat, mouse and chicken. We discarded other species, such as dog, cat or sheep, as dbSNP provides limit numbers of variants for those species. We lly lifted over SNPs of each species to hg19. It turns out about 20% of variants are mapped to pseudogene regions. We then ed the exon coordinates of each species from multiple alignment file of 100 vertebrates of canonical coding regions and extracted variants within those exons. Then those extracted SNPs were lifted over to hg19. If variants are on a different genome build of the species from the alignment, we first lifted variants to the genome build of the alignment.

As cow SNP data comes from various studies, we downloaded from dbSNP all the large batches of cow variants (16 s with VCF files > 100MB), and evaluated the quality of different batches of cow SNPs by computing se-to-synonymous ratios for each batch. The median of missense-to-synonymous ratios is 0.781 and the median absolute deviation (MAD) is 0.160 (mean is 0.879 and SD is 0.496). Two batches with outlier ratios (snpBatch_1000_BULL_GENOMES_1059190.gz with ratio of 1.391 and ch_COFACTOR_GENOMICS_1059634.gz with ratio of 2.568) were excluded from further is.

Evaluation of ties of Polymorphisms in Apes and Mammals To demonstrate the ity of great apes SNPs, we devised the enrichment score that measures the ratio of the number of singletons and common SNPs (allele frequency (AF) > 0.1%). Synonymous ts are known to be benign and generally evolve neutrally without any selection pressure. Deleterious missense variants are gradually purged by natural selection, thus its allele frequency distribution tends to be in excess of rare variants compared with synonymous variants.

We focused on those gnomAD SNPs that overlap with SNPs ed in primates, mammals and poultry. We counted the number of synonymous and missense variants per species. For missense variants, we further classified into two types, those that share identical amino acid changes in another species, termed as nse identical", and those that have different amino acid changes in r species, termed as "missense different". The enrichment scores were then computed per s as the ratio of the numbers of singletons vs common variants.

In on, we performed Chi-squared (χ2) test of homogeneity on 2x2 contingency table to compare enrichment scores between synonymous and missense identical variants for each species. All the primates trate no significant difference in ment scores between synonymous and missense identical variants, while cow, mouse and chicken show significant difference.

The result revealed that those SNPs that share identical amino acid changes in great apes tend to have very similar ment scores as synonymous SNPs, implying they tend to have mild effect on human healthy.

While those that have different amino acid changes or that are absent in great apes have ment scores that significantly deviate from that of synonymous SNPs. Missense polymorphisms from non-primate species also have different allele frequency distribution from synonymous variants. The conclusion is that SNPs that share identical amino acid changes in great apes can be added to the training set of benign variants.

Our assumption is that most variants are independently derived, not generated by identity-by-descent (IBD). Thus, we performed enrichment is of rare variants in IBD SNPs to evaluate different behavior of their enrichment scores. IBD SNPs are defined as human SNPs that appear in both human and two or more great ape species, including chimpanzee, bonobo, gorilla, S. orangutan, and B. orangutan. Then the enrichment scores, defined as the number of singletons divided by the number of common ts (AF > 0.1%), are calculated separately for missense variants and synonymous variants, which are considered neutral and serve as the baseline for comparison.

Fixed tutions Among Mammal Species Enrichment Analysis of Fixed tutions ] We also studied the rare variant enrichment analysis of inter-species substitutions. We downloaded the phylogenetic tree of 100 vertebrate species from UCSC genome browser (http://hgdownload.soe.ucsc.edu/goldenPath/hg19/multiz100way/hg19.100way.commonNames.nh ). Then we calculated pairwise enetic distance and selected closely-related species pairs (distance < 0.3). To obtain primate species pairs, we downloaded alignments (hg38) of 19 mammalian (16 primate) genomes with human for CDS regions from the UCSC genome browser. Four primate pairs were added to 13 vertebrate pairs. The ing table shows genetic distance of multiple pairs of closely-related species, according to one implementation.

Species1 Species2 Distance Chimp Bonobo 9195 Rhesus Crab-eating macaque 0.00576807 Marmoset Squirrel monkey 0.0680206 Proboscis monkey Golden snub-nosed monkey 0.01339514 Baboon Green-monkey 0.045652 Horse White-rhinoceros 97 Sheep Domestic-goat 0.1 Mouse Rat 0.176098 Chinchilla Brush-tailed-rat 0.16 Cape-golden-mole Tenrec 0.265936 Chinese-hamster Golden-hamster 0.08 David-myotis-bat Microbat 0.08254 Dolphin Killer-whale 0.074368 Saker-falcon Peregrine-falcon 0.2 Chicken Turkey 0.126972 Green-sea-turtle Painted-turtle 0.2 Chinese-softshell-turtle Spiny-softshell-turtle 0.2 We took the multiple alignments of 19 mammalian or 99 vertebrate genomes with human within canonical coding regions and obtained nucleotide substitutions between each selected pair of vertebrates. These substitutions were mapped to human exome SNPs from gnomAD, requiring identical codon changes between species pair and human variants. We classified variants into three types, synonymous variants, missense ts that share identical amino acid changes in another species, and missense variants that have different amino acid changes in another species. The enrichment scores were computed for each class per species pair.

Comparison of Intraspecies and Interspecies Polymorphisms Six s were selected to perform comparison of pecies and interspecies rphisms, including nzee, rhesus, marmoset, goat, mouse and chicken, as both intraspecies and interspecies variants are ble for these species. Comparison of enrichment scores of intraspecies and interspecies variants is similar to comparison of odds ratios of two 2x2 contingency tables. Woolf test is usually applied to evaluate homogeneity of odds ratios between contingency . Therefore, we utilized Woolf test to evaluate the difference of ment scores between intraspecies and interspecies polymorphisms.

Per-Gene Enrichment Analysis shows one implementation of per-gene enrichment analysis. In one implementation, the deep convolutional neural network-based variant pathogenicity classifier is further configured to ent a per-gene ment is which confirms pathogenicity of variants that have been determined to be pathogenic. For a particular gene sampled from a cohort of individuals with a genetic disorder, the per-gene enrichment analysis includes applying the deep convolutional neural network-based variant pathogenicity classifier to identify candidate variants in the particular gene that are pathogenic, determining a baseline number of mutations for the particular gene based on summing observed trinucleotide mutation rates of the candidate variants and lying the sum with a transmission count and a size of the cohort, applying the deep convolutional neural network-based variant enicity classifier to identify de novo missense variants in the ular gene that are pathogenic, and comparing the baseline number of mutations with a count of the de novo missense variants. Based on an output of the comparison, the per-gene enrichment analysis confirms that the particular gene is associated with the genetic disorder and that the de novo missense variants are pathogenic. In some implementations, the genetic disorder is autism spectrum disorder (abbreviated ASD). In other implementations, the c disorder is developmental delay disorder viated DDD).

In the example shown in , five ate ts in a particular gene have been classified as pathogenic by the deep convolutional neural network-based variant pathogenicity classifier. These five candidate variants have respective observed trinucleotide mutation rates of 10-8, 10-2, 10-1, 105, and 101. The baseline number of mutations for the particular gene is determined to be 10-5 based on summing the respective observed trinucleotide mutation rates of the five candidate ts and multiplying the sum with a transmission/chromosome count (2) and a size of the cohort (1000). This is then compared with the de novo variant count (3).

In some implementations, the deep utional neural network-based variant pathogenicity classifier is further configured to perform the comparison using a statistical test that produces a p-value as the .

In other implementations, the deep convolutional neural network-based variant pathogenicity classifier is further configured to compare the baseline number of ons with the count of the de novo missense variants and, based on the output of the comparison, confirm that the particular gene is not associated with the genetic disorder and that the de novo se variants are benign.

Genome-Wide Enrichment Analysis shows one implementation of -wide ment analysis. In another implementation, the deep convolutional neural network-based variant pathogenicity classifier is further configured to implement a -wide ment analysis which confirms pathogenicity of variants that have been determined to be pathogenic. The genome-wide ment analysis includes applying the deep convolutional neural network-based variant pathogenicity fier to identify a first set of de novo missense variants that are enic in a plurality of genes sampled from a cohort of healthy individuals, applying the deep convolutional neural network-based variant enicity classifier to identify a second set of de novo missense variants that are pathogenic in the plurality of genes sampled from a cohort of individuals with a genetic disorder, and ing respective counts of the first and second sets, and based on an output of the comparison confirming that the second set of de novo missense variants are enriched in the cohort of individuals with the genetic disorder and therefore pathogenic. In some implementations, the genetic disorder is autism spectrum disorder (abbreviated ASD). In other implementations, the genetic disorder is developmental delay er (abbreviated DDD).

In some implementations, the deep convolutional neural network-based variant pathogenicity classifier is further configured to perform the comparison using a statistical test that produces a p-value as the output. In one implementation, the ison is further parametrized by respective cohort sizes.

In some entations, the deep convolutional neural network-based variant pathogenicity classifier is further configured to compare the respective counts of the first and second sets, and based on the output of the comparison confirming that the second set of de novo missense variants are not enriched in the cohort of individuals with the genetic disorder and therefore benign.

In the example shown in , mutation rate in the healthy cohort (0.001) and mutation rate in affected cohort (0.004) is illustrated, along with per-individual mutation ration (4).

Particular entations We describe systems, methods, and es of manufacture for ucting a variant pathogenicity classifier. One or more features of an entation can be ed with the base implementation.

Implementations that are not mutually exclusive are taught to be combinable. One or more features of an implementation can be combined with other implementations. This disclosure periodically reminds the user of these options. Omission from some implementations of recitations that repeat these options should not be taken as limiting the combinations taught in the preceding sections – these recitations are hereby orated forward by reference into each of the following implementations.

A system implementation of the technology disclosed includes one or more processors coupled to the memory. The memory is loaded with computer instructions to train a splice site detector that identifies splice sites in c sequences (e.g., nucleotide sequences).

As shown in FIGs. 48 and 19, the system trains a convolutional neural network-based variant pathogenicity classifier, which runs on numerous processors coupled to memory. The system uses benign training examples and pathogenic training examples of protein sequence pairs generated from benign variants and enic variants. The benign variants include common human missense variants and non-human primate missense variants occurring on alternative non-human e codon sequences that share matching reference codon sequences with humans. The phrase "protein ce pairs" refers to a reference protein sequence and an ative protein sequence, wherein the reference protein sequence comprises reference amino acids formed by reference triplet nucleotide bases (reference codons) and the alternative protein sequence comprises alternative amino acids formed by alternative triplet nucleotide bases (alternative codons) such that the alternative protein sequence is produce as a result of a variant occurring in the reference t nucleotide bases (reference codons) forming the reference amino acids of the reference n sequence. The variant can be a SNP, an insertion or a deletion.

This system implementation and other systems disclosed optionally include one or more of the following features. System can also include features described in connection with methods disclosed. In the interest of conciseness, alternative combinations of system features are not individually enumerated. Features applicable to systems, methods, and es of manufacture are not repeated for each statutory class set of base features. The reader will understand how features identified in this section can readily be combined with base features in other statutory classes.

As shown in , the common human missense variants have a minor allele frequency (abbreviated MAF) greater than 0.1% across a human population variant dataset d from at least 100000 humans.

As shown in , the sampled humans belong to different human subpopulations and the common human missense variants have a MAF greater than 0.1% within tive human subpopulation variant datasets.

The human subpopulations include n/African American (abbreviated AFR), American (abbreviated AMR), Ashkenazi Jewish (abbreviated ASJ), East Asian (abbreviated EAS), Finnish (abbreviated FIN), Non-Finnish European (abbreviated NFE), South Asian (abbreviated SAS), and Others (abbreviated OTH).

As shown in FIGs. 43 and 44, the man primate missense variants include missense variants from a plurality of non-human primate species, ing Chimpanzee, Bonobo, Gorilla, B. tan, S. tan, Rhesus, and Marmoset.

As shown in FIGs. 45 and 46, based on an enrichment analysis, the system accepts a particular nonhuman e species for inclusion of missense variants of the particular non-human primate species among the benign variants. The enrichment analysis includes, for the particular non-human primate species, comparing a first enrichment score of synonymous variants of the particular non-human primate species to a second enrichment score of missense identical variants of the particular non-human primate species. shows one entation of human orthologous missense SNPs. A missense SNP in a non- human species that has matching reference and alternative codons with humans. As shown in , the missense identical ts are se variants that share matching reference and alternative codon sequences with humans.

As shown in FIGs. 46 and 47, the first enrichment score is produced by ining a ratio of rare synonymous ts with a MAF less than 0.1% over common synonymous variants with a MAF greater than 0.1%. The second ment score is produced by determining a ratio of rare se identical variants with a MAF less than 0.1% over common missense identical variants with a MAF greater than 0.1%. Rare variants include singleton variants.

As shown in FIGs. 46 and 47, a difference n the first enrichment score and the second enrichment score is within a predetermined range, further including ing the particular non-human primate species for inclusion of se variants of the particular non-human primate among the benign variants. The difference being in the predetermined range indicates that the se identical variants are under a same degree of natural selection as the synonymous variants and therefore benign as the mous variants.

As shown in , the system repeatedly applies the enrichment analysis to accept a plurality of non-human primate species for inclusion of missense variants of the non-human primate species among the benign variants. The system further includes, a chi-squared test of homogeneity to compare a first enrichment score of synonymous variants and a second enrichment score of missense identical variants for each of the non-human primate species.

As shown in , a count of the non-human primate missense variants is at least 100000. the count of the non-human primate missense ts is 385236. A count of the common human missense variants is at least 50000. The count of the common human missense variants is 83546.

Other implementations may include a non-transitory computer readable e medium storing instructions executable by a processor to perform s of the system described above. Yet another implementation may include a method performing actions of the system described above.

Another system implementation of the technology disclosed includes constructing a single nucleotide polymorphism viated SNP) pathogenicity classifier. The system trains a convolutional neural k-based SNP pathogenicity classifier, which runs on numerous processors coupled to memory, using benign ng examples and pathogenic training examples of amino acid sequences expressed by benign SNPs and pathogenic SNPs. The benign training examples include first and second sets of tide sequences, expressed as amino acid ce pairs, each amino acid sequence including a central amino acid flanked by upstream and downstream amino acids. Each amino acid sequence pair includes a reference sequence of amino acids expressed by a reference nucleotide sequence and an alternative sequence of amino acids expressed by an ative nucleotide ce containing a SNP.

As shown in the first set comprises human nucleotide sequence pairs, with each pair including a human alternative nucleotide sequence containing a SNP and having a minor allele frequency (abbreviated MAF) deemed to be common within a human population. The second set comprises a non-human primate reference nucleotide sequence paired with a non-human primate alternative nucleotide ce. The non-human e reference nucleotide ce has an orthologous human nucleotide reference sequence. The non-human primate alternative nucleotide sequence contains a SNP.

Each of the features discussed in this particular entation section for the first system implementation apply equally to this system implementation. As ted above, all the system features are not repeated here and should be considered repeated by reference.

Other implementations may include a non-transitory computer readable storage medium storing instructions executable by a processor to perform actions of the system described above. Yet another implementation may e a method performing actions of the system described above.

As shown in FIGs. 48 and 19, a first method implementation of the technology disclosed includes constructing a variant pathogenicity classifier, the method ing. The method r includes, training a convolutional neural network-based variant enicity classifier, which runs on numerous processors coupled to , using benign training examples and pathogenic training examples of n sequence pairs generated from benign variants and pathogenic variants. The benign variants include common human missense variants and nonhuman primate missense variants occurring on alternative non-human primate codon sequences that share matching nce codon sequences with humans.

Each of the features discussed in this ular implementation section for the first system implementation apply equally to this method implementation. As indicated above, all the system features are not repeated here and should be considered repeated by reference.

Other implementations may e a non-transitory computer readable storage medium storing instructions executable by a processor to perform the method described above. Yet another implementation may include a system including memory and one or more processors operable to execute instructions, stored in the , to perform the method described above.

As shown in FIGs. 48 and 19, a second method implementation of the technology disclosed includes constructing a single nucleotide polymorphism (abbreviated SNP) pathogenicity classifier. The method r es training a convolutional neural network-based SNP pathogenicity fier, which runs on numerous processors coupled to memory, using benign ng examples and pathogenic training examples of amino acid sequences expressed by benign SNPs and pathogenic SNPs. The benign training examples include first and second sets of tide ces, expressed as amino acid sequence pairs, each amino acid sequence including a central amino acid flanked by upstream and downstream amino acids, and each amino acid sequence pair including a reference sequence of amino acids expressed by a reference nucleotide sequence and an alternative sequence of amino acids sed by an alternative nucleotide sequence containing a SNP. The first set comprises human nucleotide sequence pairs, with each pair including a human alternative nucleotide sequence containing a SNP and having a minor allele frequency (abbreviated MAF) deemed to be common within a human population. The second set comprises a man primate reference nucleotide sequence paired with a non-human primate alternative nucleotide sequence. The non-human primate reference nucleotide sequence has an orthologous human nucleotide reference sequence and the non-human primate alternative nucleotide sequence contains a SNP.

Each of the features discussed in this particular implementation section for the second system implementation apply equally to this method implementation. As indicated above, all the system features are not ed here and should be considered repeated by reference.

Other implementations may e a non-transitory computer readable storage medium storing instructions executable by a processor to perform the method described above. Yet another implementation may include a system including memory and one or more processors operable to execute instructions, stored in the memory, to perform the method described above.

We describe systems, methods, and articles of cture for using a deep convolutional neural network-based a variant pathogenicity classifier with ary structure and solvent accessibility classifiers. One or more features of an implementation can be combined with the base implementation. Implementations that are not ly ive are taught to be combinable. One or more features of an implementation can be combined with other implementations. This disclosure periodically reminds the user of these options. Omission from some implementations of recitations that repeat these options should not be taken as limiting the combinations taught in the ing ns – these recitations are hereby incorporated forward by reference into each of the following implementations.

A system implementation of the technology disclosed includes one or more sors coupled to the memory. The memory is loaded with computer instructions to run a deep convolutional neural network-based variant pathogenicity classifier with secondary structure and solvent accessibility fiers.

The system comprises a first secondary structure subnetwork, g on numerous processors coupled to memory, trained to predict a state secondary structure at amino acid locations within a protein sequence.

The system r includes a second solvent accessibility subnetwork, g on numerous processors coupled to memory, d to predict a three-state solvent accessibility at amino acid locations within a n sequence.

The three-state ary structure refers to one of a plurality of DNA secondary structure states alpha helix (H), beta sheet (B), and coils (C).

The three-state solvent accessibility refers to one of a plurality of protein solvent accessibility states buried (B), intermediate (I), and exposed (E).

A positional frequency matrix (abbreviated PFM) generator, running on at least one of the numerous processors, applied to three sequence groups of primates, mammals, and vertebrates excluding primates and mammals to generate a primate PFM, a mammal PFM, and a vertebrate PFM.

In other words, this includes applying the PFM generator to the primates ce data to generate a e PFM, applying the PFM generator to the mammals sequence data to generate a mammal PFM, and ng the PFM generator to the vertebrates sequence data not including the primates and mammals sequence data to te a vertebrate PFM.

An input processor that accepts a variant amino acid sequence with a target variant amino acid flanked upstream and downstream by at least 25 amino acids in each direction, wherein a single tide variant produces the target variant amino acid. A supplemental data allocator, running on at least one of the us processors, that allocates a reference amino acid sequence with a target reference amino acid d upstream and ream by at least 25 amino acids in each direction, d with the variant amino acid sequence. Following this, it allocates reference state classifications produced by the first and second subnetworks for the reference amino acid sequence.

After this the supplemental data allocator allocates variant state classifications produced by the first and second subnetworks for the variant amino acid sequence. Finally, it allocates primate, mammal, and vertebrate PFMs aligned with the reference amino acid sequence.

In the context of this application, the phrase "aligned with" refers to position-wise determining primate, mammal, and vertebrate PFMs for each amino position in the reference amino acid sequence or the alternative amino acid sequence and encoding and storing the results of the determining on the position-wise or l position-basis in the same order as the amino acid positions occur in the reference amino acid sequence or the alternative amino acid sequence.

The system also includes a deep convolutional neural network, running on the numerous sors, trained to classify the variant amino acid ce as benign or pathogenic based on processing the variant amino acid sequence, the allocated reference amino acid sequence, the allocated reference and variant state classifications, and the allocated PFMs. The system includes an output processor that reports at least a pathogenicity score for the variant amino acid sequence.

This system implementation and other systems disclosed ally include one or more of the following features. System can also include features described in connection with methods disclosed. In the interest of conciseness, alternative combinations of system features are not individually enumerated. Features applicable to s, methods, and articles of manufacture are not repeated for each statutory class set of base features. The reader will understand how features identified in this section can readily be combined with base features in other statutory s.

The system comprising the deep utional neural k-based variant pathogenicity classifier, further configured to classify the single nucleotide variant as benign or pathogenic based on the pathogenicity score.