WO2023059750A1 - Apprentissage combiné et par transfert d'un prédicteur de pathogénicité de variants au moyen d'échantillons de protéines à brèche et sans brèche - Google Patents
Apprentissage combiné et par transfert d'un prédicteur de pathogénicité de variants au moyen d'échantillons de protéines à brèche et sans brèche Download PDFInfo
- Publication number
- WO2023059750A1 WO2023059750A1 PCT/US2022/045823 US2022045823W WO2023059750A1 WO 2023059750 A1 WO2023059750 A1 WO 2023059750A1 US 2022045823 W US2022045823 W US 2022045823W WO 2023059750 A1 WO2023059750 A1 WO 2023059750A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- amino acid
- gapped
- pathogenicity
- protein
- computer
- Prior art date
Links
- 108090000623 proteins and genes Proteins 0.000 title claims abstract description 619
- 102000004169 proteins and genes Human genes 0.000 title claims abstract description 605
- 230000007918 pathogenicity Effects 0.000 title claims abstract description 499
- 238000013526 transfer learning Methods 0.000 title description 15
- 238000012549 training Methods 0.000 claims abstract description 225
- 230000001717 pathogenic effect Effects 0.000 claims abstract description 152
- 108010026552 Proteome Proteins 0.000 claims abstract description 66
- 150000001413 amino acids Chemical class 0.000 claims description 1526
- 238000000034 method Methods 0.000 claims description 438
- 238000006467 substitution reaction Methods 0.000 claims description 100
- 239000002773 nucleotide Substances 0.000 claims description 99
- 125000003729 nucleotide group Chemical group 0.000 claims description 98
- 238000012545 processing Methods 0.000 claims description 65
- 238000010200 validation analysis Methods 0.000 claims description 43
- 239000002253 acid Substances 0.000 claims description 28
- 230000004044 response Effects 0.000 claims description 15
- 230000000873 masking effect Effects 0.000 claims description 6
- 125000003275 alpha amino acid group Chemical group 0.000 claims 3
- 238000005516 engineering process Methods 0.000 abstract description 68
- 235000001014 amino acid Nutrition 0.000 description 1436
- 235000018102 proteins Nutrition 0.000 description 526
- 125000004429 atom Chemical group 0.000 description 266
- 239000000523 sample Substances 0.000 description 79
- 108700028369 Alleles Proteins 0.000 description 60
- 230000008569 process Effects 0.000 description 49
- 125000002924 primary amino group Chemical group [H]N([H])* 0.000 description 45
- 238000003780 insertion Methods 0.000 description 44
- 230000037431 insertion Effects 0.000 description 41
- 235000004279 alanine Nutrition 0.000 description 38
- 230000009471 action Effects 0.000 description 34
- 238000013528 artificial neural network Methods 0.000 description 34
- 230000035772 mutation Effects 0.000 description 34
- 229910052799 carbon Inorganic materials 0.000 description 27
- 238000013527 convolutional neural network Methods 0.000 description 27
- 230000006870 function Effects 0.000 description 27
- QNAYBMKLOCPYGJ-REOHCLBHSA-N L-alanine Chemical compound C[C@H](N)C(O)=O QNAYBMKLOCPYGJ-REOHCLBHSA-N 0.000 description 23
- 108020004705 Codon Proteins 0.000 description 20
- 239000011159 matrix material Substances 0.000 description 20
- 241000288906 Primates Species 0.000 description 17
- 238000004364 calculation method Methods 0.000 description 17
- 238000013507 mapping Methods 0.000 description 17
- 239000011575 calcium Substances 0.000 description 16
- 230000015654 memory Effects 0.000 description 16
- 241000894007 species Species 0.000 description 15
- -1 Alanine amino acid Chemical class 0.000 description 14
- 238000013459 approach Methods 0.000 description 14
- 238000011049 filling Methods 0.000 description 14
- 238000010801 machine learning Methods 0.000 description 14
- 238000003860 storage Methods 0.000 description 14
- DHMQDGOQFOQNFH-UHFFFAOYSA-N Glycine Chemical compound NCC(O)=O DHMQDGOQFOQNFH-UHFFFAOYSA-N 0.000 description 13
- 102100036789 Protein TBATA Human genes 0.000 description 13
- 101710118245 Protein TBATA Proteins 0.000 description 13
- 210000004027 cell Anatomy 0.000 description 12
- 230000002068 genetic effect Effects 0.000 description 12
- 230000000306 recurrent effect Effects 0.000 description 12
- 238000010586 diagram Methods 0.000 description 11
- 238000004422 calculation algorithm Methods 0.000 description 10
- 238000013135 deep learning Methods 0.000 description 10
- 125000004435 hydrogen atom Chemical class [H]* 0.000 description 9
- 102000054765 polymorphisms of proteins Human genes 0.000 description 9
- 238000002864 sequence alignment Methods 0.000 description 9
- 239000004475 Arginine Substances 0.000 description 8
- IJGRMHOSHXDMSA-UHFFFAOYSA-N Atomic nitrogen Chemical compound N#N IJGRMHOSHXDMSA-UHFFFAOYSA-N 0.000 description 8
- 101150072950 BRCA1 gene Proteins 0.000 description 8
- 102000053602 DNA Human genes 0.000 description 8
- 108020004414 DNA Proteins 0.000 description 8
- 108091028043 Nucleic acid sequence Proteins 0.000 description 8
- 150000007513 acids Chemical class 0.000 description 8
- ODKSFYDXXFIFQN-UHFFFAOYSA-N arginine Natural products OC(=O)C(N)CCCNC(N)=N ODKSFYDXXFIFQN-UHFFFAOYSA-N 0.000 description 8
- 239000013256 coordination polymer Substances 0.000 description 8
- 230000000694 effects Effects 0.000 description 8
- 239000004471 Glycine Substances 0.000 description 7
- 230000004913 activation Effects 0.000 description 7
- 238000001994 activation Methods 0.000 description 7
- 238000004458 analytical method Methods 0.000 description 7
- 230000008859 change Effects 0.000 description 7
- 239000010754 BS 2869 Class F Substances 0.000 description 6
- 108091023040 Transcription factor Proteins 0.000 description 6
- 102000040945 Transcription factor Human genes 0.000 description 6
- 201000010099 disease Diseases 0.000 description 6
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 6
- 108010077544 Chromatin Proteins 0.000 description 5
- 210000003483 chromatin Anatomy 0.000 description 5
- 238000007477 logistic regression Methods 0.000 description 5
- 230000004048 modification Effects 0.000 description 5
- 238000012986 modification Methods 0.000 description 5
- 125000004433 nitrogen atom Chemical group N* 0.000 description 5
- 238000010606 normalization Methods 0.000 description 5
- 125000004430 oxygen atom Chemical group O* 0.000 description 5
- 238000011176 pooling Methods 0.000 description 5
- 102000036365 BRCA1 Human genes 0.000 description 4
- 108700020463 BRCA1 Proteins 0.000 description 4
- 108700040618 BRCA1 Genes Proteins 0.000 description 4
- 235000003704 aspartic acid Nutrition 0.000 description 4
- QVGXLLKOCUKJST-UHFFFAOYSA-N atomic oxygen Chemical compound [O] QVGXLLKOCUKJST-UHFFFAOYSA-N 0.000 description 4
- 208000029560 autism spectrum disease Diseases 0.000 description 4
- OQFSQFPPLPISGP-UHFFFAOYSA-N beta-carboxyaspartic acid Natural products OC(=O)C(N)C(C(O)=O)C(O)=O OQFSQFPPLPISGP-UHFFFAOYSA-N 0.000 description 4
- 229910052791 calcium Inorganic materials 0.000 description 4
- 230000014509 gene expression Effects 0.000 description 4
- 229910052739 hydrogen Inorganic materials 0.000 description 4
- 239000001257 hydrogen Substances 0.000 description 4
- 238000002372 labelling Methods 0.000 description 4
- 238000005259 measurement Methods 0.000 description 4
- 230000007246 mechanism Effects 0.000 description 4
- 238000002887 multiple sequence alignment Methods 0.000 description 4
- 229910052757 nitrogen Inorganic materials 0.000 description 4
- 229910052760 oxygen Inorganic materials 0.000 description 4
- 239000001301 oxygen Substances 0.000 description 4
- 230000000007 visual effect Effects 0.000 description 4
- 108091032973 (ribonucleotides)n+m Proteins 0.000 description 3
- 241000282412 Homo Species 0.000 description 3
- 108090000144 Human Proteins Proteins 0.000 description 3
- 102000003839 Human Proteins Human genes 0.000 description 3
- CKLJMWTZIZZHCS-REOHCLBHSA-N L-aspartic acid Chemical compound OC(=O)[C@@H](N)CC(O)=O CKLJMWTZIZZHCS-REOHCLBHSA-N 0.000 description 3
- 238000013473 artificial intelligence Methods 0.000 description 3
- 230000002939 deleterious effect Effects 0.000 description 3
- 238000004321 preservation Methods 0.000 description 3
- 230000009466 transformation Effects 0.000 description 3
- 230000007067 DNA methylation Effects 0.000 description 2
- 208000035976 Developmental Disabilities Diseases 0.000 description 2
- 101001066268 Homo sapiens Erythroid transcription factor Proteins 0.000 description 2
- 101001012669 Homo sapiens Melanoma inhibitory activity protein 2 Proteins 0.000 description 2
- KZSNJWFQEVHDMF-BYPYZUCNSA-N L-valine Chemical compound CC(C)[C@H](N)C(O)=O KZSNJWFQEVHDMF-BYPYZUCNSA-N 0.000 description 2
- 241000124008 Mammalia Species 0.000 description 2
- 102100029778 Melanoma inhibitory activity protein 2 Human genes 0.000 description 2
- 108700011259 MicroRNAs Proteins 0.000 description 2
- 206010028980 Neoplasm Diseases 0.000 description 2
- 101150080074 TP53 gene Proteins 0.000 description 2
- KZSNJWFQEVHDMF-UHFFFAOYSA-N Valine Natural products CC(C)C(N)C(O)=O KZSNJWFQEVHDMF-UHFFFAOYSA-N 0.000 description 2
- 230000006978 adaptation Effects 0.000 description 2
- 230000002776 aggregation Effects 0.000 description 2
- 238000004220 aggregation Methods 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 125000004432 carbon atom Chemical group C* 0.000 description 2
- 238000010367 cloning Methods 0.000 description 2
- 238000000205 computational method Methods 0.000 description 2
- 238000013136 deep learning model Methods 0.000 description 2
- 238000012217 deletion Methods 0.000 description 2
- 230000037430 deletion Effects 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 239000003623 enhancer Substances 0.000 description 2
- 108020004999 messenger RNA Proteins 0.000 description 2
- 239000002679 microRNA Substances 0.000 description 2
- 210000002569 neuron Anatomy 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 230000008520 organization Effects 0.000 description 2
- 108700025694 p53 Genes Proteins 0.000 description 2
- 230000002093 peripheral effect Effects 0.000 description 2
- 229920000642 polymer Polymers 0.000 description 2
- 230000004853 protein function Effects 0.000 description 2
- 230000010076 replication Effects 0.000 description 2
- 238000010845 search algorithm Methods 0.000 description 2
- 238000002741 site-directed mutagenesis Methods 0.000 description 2
- 229920002803 thermoplastic polyurethane Polymers 0.000 description 2
- 238000013519 translation Methods 0.000 description 2
- 239000004474 valine Substances 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 240000001436 Antirrhinum majus Species 0.000 description 1
- 238000012935 Averaging Methods 0.000 description 1
- OYPRJOBELJOOCE-UHFFFAOYSA-N Calcium Chemical compound [Ca] OYPRJOBELJOOCE-UHFFFAOYSA-N 0.000 description 1
- OKTJSMMVPCPJKN-UHFFFAOYSA-N Carbon Chemical compound [C] OKTJSMMVPCPJKN-UHFFFAOYSA-N 0.000 description 1
- 208000037170 Delayed Emergence from Anesthesia Diseases 0.000 description 1
- 208000012239 Developmental disease Diseases 0.000 description 1
- 102100031690 Erythroid transcription factor Human genes 0.000 description 1
- 108700024394 Exon Proteins 0.000 description 1
- 108020005004 Guide RNA Proteins 0.000 description 1
- 108091092195 Intron Proteins 0.000 description 1
- ROHFNLRQFUQHCH-YFKPBYRVSA-N L-leucine Chemical compound CC(C)C[C@H](N)C(O)=O ROHFNLRQFUQHCH-YFKPBYRVSA-N 0.000 description 1
- COLNVLDHVKWLRT-QMMMGPOBSA-N L-phenylalanine Chemical compound OC(=O)[C@@H](N)CC1=CC=CC=C1 COLNVLDHVKWLRT-QMMMGPOBSA-N 0.000 description 1
- 238000004510 Lennard-Jones potential Methods 0.000 description 1
- ROHFNLRQFUQHCH-UHFFFAOYSA-N Leucine Natural products CC(C)CC(N)C(O)=O ROHFNLRQFUQHCH-UHFFFAOYSA-N 0.000 description 1
- 108010021466 Mutant Proteins Proteins 0.000 description 1
- 102000008300 Mutant Proteins Human genes 0.000 description 1
- 238000005481 NMR spectroscopy Methods 0.000 description 1
- BDUHCSBCVGXTJM-IZLXSDGUSA-N Nutlin-3 Chemical compound CC(C)OC1=CC(OC)=CC=C1C1=N[C@H](C=2C=CC(Cl)=CC=2)[C@H](C=2C=CC(Cl)=CC=2)N1C(=O)N1CC(=O)NCC1 BDUHCSBCVGXTJM-IZLXSDGUSA-N 0.000 description 1
- 108700026244 Open Reading Frames Proteins 0.000 description 1
- 108010076504 Protein Sorting Signals Proteins 0.000 description 1
- 108091081024 Start codon Proteins 0.000 description 1
- NINIDFKCEFEMDL-UHFFFAOYSA-N Sulfur Chemical compound [S] NINIDFKCEFEMDL-UHFFFAOYSA-N 0.000 description 1
- 241000251539 Vertebrata <Metazoa> Species 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 230000000996 additive effect Effects 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 238000003556 assay Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000002457 bidirectional effect Effects 0.000 description 1
- 230000004071 biological effect Effects 0.000 description 1
- 230000031018 biological processes and functions Effects 0.000 description 1
- 239000000090 biomarker Substances 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 230000001364 causal effect Effects 0.000 description 1
- 238000002487 chromatin immunoprecipitation Methods 0.000 description 1
- 210000000349 chromosome Anatomy 0.000 description 1
- 239000003086 colorant Substances 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 238000013434 data augmentation Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 238000001493 electron microscopy Methods 0.000 description 1
- 230000010429 evolutionary process Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000004880 explosion Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000002349 favourable effect Effects 0.000 description 1
- 238000007667 floating Methods 0.000 description 1
- 231100000221 frame shift mutation induction Toxicity 0.000 description 1
- 230000037433 frameshift Effects 0.000 description 1
- 230000007614 genetic variation Effects 0.000 description 1
- 238000012268 genome sequencing Methods 0.000 description 1
- 230000036541 health Effects 0.000 description 1
- 238000000126 in silico method Methods 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000011835 investigation Methods 0.000 description 1
- PNDPGZBMCMUPRI-UHFFFAOYSA-N iodine Chemical compound II PNDPGZBMCMUPRI-UHFFFAOYSA-N 0.000 description 1
- 238000013140 knowledge distillation Methods 0.000 description 1
- 231100000518 lethal Toxicity 0.000 description 1
- 230000001665 lethal effect Effects 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 230000004807 localization Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000001000 micrograph Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 210000000056 organ Anatomy 0.000 description 1
- 230000002085 persistent effect Effects 0.000 description 1
- COLNVLDHVKWLRT-UHFFFAOYSA-N phenylalanine Natural products OC(=O)C(N)CC1=CC=CC=C1 COLNVLDHVKWLRT-UHFFFAOYSA-N 0.000 description 1
- 230000004481 post-translational protein modification Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 238000012913 prioritisation Methods 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 108020001580 protein domains Proteins 0.000 description 1
- 230000006916 protein interaction Effects 0.000 description 1
- 238000013139 quantization Methods 0.000 description 1
- 230000001105 regulatory effect Effects 0.000 description 1
- 230000002787 reinforcement Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000012163 sequencing technique Methods 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 238000013517 stratification Methods 0.000 description 1
- 229910052717 sulfur Inorganic materials 0.000 description 1
- 239000011593 sulfur Substances 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
- 230000026676 system process Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 230000001225 therapeutic effect Effects 0.000 description 1
- 238000013518 transcription Methods 0.000 description 1
- 230000035897 transcription Effects 0.000 description 1
- 230000002103 transcriptional effect Effects 0.000 description 1
- 238000000844 transformation Methods 0.000 description 1
- 230000001131 transforming effect Effects 0.000 description 1
- 230000014616 translation Effects 0.000 description 1
- 238000011282 treatment Methods 0.000 description 1
- 238000002424 x-ray crystallography Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/20—Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
Definitions
- the technology disclosed relates to artificial intelligence type computers and digital data processing systems and corresponding data processing methods and products for emulation of intelligence (i.e., knowledge based systems, reasoning systems, and knowledge acquisition systems); and including systems for reasoning with uncertainty (e.g., fuzzy logic systems), adaptive systems, machine learning systems, and artificial neural networks.
- intelligence i.e., knowledge based systems, reasoning systems, and knowledge acquisition systems
- systems for reasoning with uncertainty e.g., fuzzy logic systems
- adaptive systems e.g., machine learning systems
- artificial neural networks e.g., neural network analysis of neural networks.
- Genomics in the broad sense, also referred to as functional genomics, aims to characterize the function of every genomic element of an organism by using genome-scale assays such as genome sequencing, transcriptome profiling and proteomics.
- Genomics arose as a data-driven science — it operates by discovering novel properties from explorations of genome-scale data rather than by testing preconceived models and hypotheses.
- Applications of genomics include finding associations between genotype and phenotype, discovering biomarkers for patient stratification, predicting the function of genes, and charting biochemically active genomic regions such as transcriptional enhancers.
- Genomics data are too large and too complex to be mined solely by visual investigation of pairwise correlations. Instead, analytical tools are required to support the discovery of unanticipated relationships, to derive novel hypotheses and models and to make predictions.
- machine learning algorithms are designed to automatically detect patterns in data.
- machine learning algorithms are suited to data-driven sciences and, in particular, to genomics.
- the performance of machine learning algorithms can strongly depend on how the data are represented, that is, on how each variable (also called a feature) is computed. For instance, to classify a tumor as malign or benign from a fluorescent microscopy image, a preprocessing algorithm could detect cells, identify the cell type, and generate a list of cell counts for each cell type.
- a machine learning model can take the estimated cell counts, which are examples of handcrafted features, as input features to classify the tumor.
- a central issue is that classification performance depends heavily on the quality and the relevance of these features. For example, relevant visual features such as cell morphology, distances between cells or localization within an organ are not captured in cell counts, and this incomplete representation of the data may reduce classification accuracy.
- Deep learning a subdiscipline of machine learning, addresses this issue by embedding the computation of features into the machine learning model itself to yield end-to-end models. This outcome has been realized through the development of deep neural networks, machine learning models that comprise successive elementary operations, which compute increasingly more complex features by taking the results of preceding operations as input.
- Deep neural networks are able to improve prediction accuracy by discovering relevant features of high complexity, such as the cell morphology and spatial organization of cells in the above example.
- the construction and training of deep neural networks have been enabled by the explosion of data, algorithmic advances, and substantial increases in computational capacity, particularly through the use of graphical processing units (GPUs).
- the goal of supervised learning is to obtain a model that takes features as input and returns a prediction for a so-called target variable.
- An example of a supervised learning problem is one that predicts whether an intron is spliced out or not (the target) given features on the RNA such as the presence or absence of the canonical splice site sequence, the location of the splicing branchpoint or intron length.
- Training a machine learning model refers to learning its parameters, which commonly involves minimizing a loss function on training data with the aim of making accurate predictions on unseen data.
- the input data can be represented as a table with multiple columns, or features, each of which contains numerical or categorical data that are potentially useful for making predictions.
- Some input data are naturally represented as features in a table (such as temperature or time), whereas other input data need to be first transformed (such as deoxyribonucleic acid (DNA) sequence into /.-mcr counts) using a process called feature extraction to fit a tabular representation.
- DNA deoxyribonucleic acid
- feature extraction to fit a tabular representation.
- the presence or absence of the canonical splice site sequence, the location of the splicing branchpoint and the intron length can be preprocessed features collected in a tabular format.
- Tabular data are standard for a wide range of supervised machine learning models, ranging from simple linear models, such as logistic regression, to more flexible nonlinear models, such as neural networks and many others.
- Logistic regression is a binary classifier, that is, a supervised learning model that predicts a binary target variable. Specifically, logistic regression predicts the probability of the positive class by computing a weighted sum of the input features mapped to the [0,1] interval using the sigmoid function, a type of activation function. The parameters of logistic regression, or other linear classifiers that use different activation functions, are the weights in the weighted sum. Linear classifiers fail when the classes, for instance, that of an intron spliced out or not, cannot be well discriminated with a weighted sum of input features. To improve predictive performance, new input features can be manually added by transforming or combining existing features in new ways, for example, by taking powers or pairwise products.
- Neural networks use hidden layers to leam these nonlinear feature transformations automatically.
- Each hidden layer can be thought of as multiple linear models with their output transformed by a nonlinear activation function, such as the sigmoid function or the more popular rectified-linear unit (ReLU).
- a nonlinear activation function such as the sigmoid function or the more popular rectified-linear unit (ReLU).
- ReLU rectified-linear unit
- Deep neural networks use many hidden layers, and a layer is said to be folly-connected when each neuron receives inputs from all neurons of the preceding layer.
- Neural networks are commonly trained using stochastic gradient descent, an algorithm suited to training models on very large data sets.
- Implementation of neural networks using modem deep learning frameworks enables rapid prototyping with different architectures and data sets.
- Fully -connected neural networks can be used for a number of genomics applications, which include predicting the percentage of exons spliced in for a given sequence from sequence features such as the presence of binding motifs of splice factors or sequence conservation; prioritizing potential disease-causing genetic variants; and predicting cA-regulatory elements in a given genomic region using features such as chromatin marks, gene expression and evolutionary conservation [0031]
- Local dependencies in spatial and longitudinal data must be considered for effective predictions. For example, shuffling a DNA sequence or the pixels of an image severely disrupts informative patterns. These local dependencies set spatial or longitudinal data apart from tabular data, for which the ordering of the features is arbitrary.
- a convolutional layer is a special form of fully-connected layer in which the same fully- connected layer is applied locally, for example, in a 6 bp window, to all sequence positions. This approach can also be viewed as scanning the sequence using multiple PWMs, for example, for transcription factors GATA1 and TALI. By using the same model parameters across positions, the total number of parameters is drastically reduced, and the network is able to detect a motif at positions not seen during training.
- Each convolutional layer scans the sequence with several filters by producing a scalar value at every position, which quantifies the match between the filter and the sequence.
- a nonlinear activation function commonly ReLU
- a pooling operation is applied, which aggregates the activations in contiguous bins across the positional axis, commonly taking the maximal or average activation for each channel. Pooling reduces the effective sequence length and coarsens the signal.
- the subsequent convolutional layer composes the output of the previous layer and is able to detect whether a GATA1 motif and TALI motif were present at some distance range.
- the output of the convolutional layers can be used as input to a fully- connected neural network to perform the final prediction task.
- different types of neural network layers e g. , fully-connected layers and convolutional layers
- Convolutional neural networks can predict various molecular phenotypes on the basis of DNA sequence alone. Applications include classifying transcription factor binding sites and predicting molecular phenotypes such as chromatin features, DNA contact maps, DNA methylation, gene expression, translation efficiency, RBP binding, and microRNA (miRNA) targets. In addition to predicting molecular phenotypes from the sequence, convolutional neural networks can be applied to more technical tasks traditionally addressed by handcrafted bioinformatics pipelines. For example, convolutional neural networks can predict the specificity of guide RNA, denoise ChlP-seq, enhance Hi-C data resolution, predict the laboratory of origin from DNA sequences and call genetic variants.
- Convolutional neural networks have also been employed to model long-range dependencies in the genome. Although interacting regulatory elements may be distantly located on the unfolded linear DNA sequence, these elements are often proximal in the actual 3D chromatin conformation. Hence, modelling molecular phenotypes from the linear DNA sequence, albeit a crude approximation of the chromatin, can be improved by allowing for long-range dependencies and allowing the model to implicitly learn aspects of the 3D organization, such as promoter-enhancer looping. This is achieved by using dilated convolutions, which have a receptive field of up to 32 kb.
- Dilated convolutions also allow splice sites to be predicted from sequence using a receptive field of 10 kb, thereby enabling the integration of genetic sequence across distances as long as typical human introns (See Jaganathan, K. et al. Predicting splicing from primary sequence with deep learning. Cell 176, 535-548 (2019)).
- Recurrent neural networks are an alternative to convolutional neural networks for processing sequential data, such as DNA sequences or time senes, that implement a different parameter-sharing scheme.
- Recurrent neural networks apply the same operation to each sequence element. The operation takes as input the memory of the previous sequence element and the new input. It updates the memory and optionally emits an output, which is either passed on to subsequent layers or is directly used as model predictions.
- recurrent neural networks are invariant to the position index in the processed sequence. For example, a recurrent neural network can detect an open reading frame in a DNA sequence regardless of the position in the sequence. This task requires the recognition of a certain series of inputs, such as the start codon followed by an in-frame stop codon.
- recurrent neural networks can naturally process sequences of widely varying length, such as mRNA sequences.
- convolutional neural networks combined with various tricks can reach comparable or even better performances than recurrent neural networks on sequence-modelling tasks, such as audio synthesis and machine translation.
- Recurrent neural networks can aggregate the outputs of convolutional neural networks for predicting single-cell DNA methylation states, RBP binding, transcription factor binding, and DNA accessibility.
- recurrent neural networks apply a sequential operation, they cannot be easily parallelized and are hence much slower to compute than convolutional neural networks.
- Each human has a unique genetic code, though a large portion of the human genetic code is common for all humans.
- a human genetic code may include an outlier, called a genetic variant, that may be common among individuals of a relatively small group of the human population.
- a particular human protein may comprise a specific sequence of amino acids, whereas a variant of that protein may differ by one amino acid in the otherwise same specific sequence.
- Genetic variants may be pathogenetic, leading to diseases. Though most of such genetic variants have been depleted from genomes by natural selection, an ability to identify which genetic variants are likely to be pathogenic can help researchers focus on these genetic variants to gain an understanding of the corresponding diseases and their diagnostics, treatments, or cures. The clinical interpretation of millions of human genetic variants remains unclear. Some of the most frequent pathogenic variants are single nucleotide missense mutations that change the amino acid of a protein. However, not all missense mutations are pathogenic.
- Models that can predict molecular phenotypes directly from biological sequences can be used as in silico perturbation tools to probe the associations between genetic variation and phenotypic variation and have emerged as new methods for quantitative trait loci identification and variant prioritization.
- These approaches are of major importance given that the majority of variants identified by genome-wide association studies of complex phenotypes are non-coding, which makes it challenging to estimate their effects and contribution to phenotypes.
- linkage disequilibrium results in blocks of variants being co-inherited, which creates difficulties in pinpointing individual causal variants.
- sequencebased deep learning models that can be used as interrogation tools for assessing the impact of such variants offer a promising approach to find potential dnvers of complex phenotypes.
- One example includes predicting the effect of non-coding smgle-nucleotide variants and short insertions or deletions (indels) indirectly from the difference between two variants in terms of transcription factor binding, chromatin accessibility or gene expression predictions.
- Another example includes predicting novel splice site creation from sequence or quantitative effects of genetic variants on splicing.
- PrimerAI End-to-end deep learning approaches for variant effect predictions are applied to predict the pathogenicity of missense variants from protein sequence and sequence conservation data (See Sundaram, L. et al. Predicting the clinical impact of human mutation with deep neural networks. Nat. Genet. 50, 1161-1170 (2016), referred to herein as “PrimateAI”).
- PrimateAI uses deep neural networks trained on variants of known pathogenicity with data augmentation using cross-species information.
- PrimateAI uses sequences of wild-type and mutant proteins to compare the difference and decide the pathogenicity of mutations using the trained deep neural networks.
- Such an approach which utilizes the protein sequences for pathogenicity prediction is promising because it can avoid the circularity problem and overfitting to previous knowledge.
- PrimateAI outperforms prior methods when trained directly upon sequence alignments. PrimateAI learns important protein domains, conserved amino acid positions, and sequence dependencies directly from the training data consisting of about 120,000 human samples. PrimateAI substantially exceeds the performance of other variant pathogenicity prediction tools in differentiating benign and pathogenic de-novo mutations in candidate developmental disorder genes, and in reproducing prior knowledge in ClinVar. These results suggest that PrimateAI is an important step forward for variant classification tools that may lessen the reliance of clinical reporting on prior knowledge.
- Protein sites are microenvironments within a protein structure, distinguished by their structural or functional role.
- a site can be defined by a three-dimensional (3D) location and a local neighborhood around this location in which the structure or function exists.
- 3D three-dimensional
- Central to rational protein engineering is the understanding of how the structural arrangement of amino acids creates functional characteristics within protein sites. Determination of the structural and functional roles of individual amino acids within a protein provides information to help engineer and alter protein functions. Identifying functionally or structurally important amino acids allows focused engineering efforts such as site-directed mutagenesis for altering targeted protein functional properties. Alternatively, this knowledge can help avoid engineering designs that would abolish a desired function.
- Figure 1 is a flow diagram that illustrates a process of a system for determining pathogenicity of variants, according to various implementations of the technology disclosed.
- Figure 2 schematically illustrates an example reference amino acid sequence of a protein and an alternative amino acid sequence of the protein, in accordance with one implementation of the technology disclosed.
- Figure 3 illustrates amino acid-wise classification of atoms of amino acids in the reference amino acid sequence of Figure 2, in accordance with one implementation of the technology disclosed.
- Figure 4 illustrates amino acid-wise attribution of 3D atomic coordinates of the alpha-carbon atoms classified in Figure 3 on an amino acid-basis, in accordance with one implementation of the technology disclosed.
- Figure 5 schematically illustrates a process of determining voxel-wise distance values, in accordance with one implementation of the technology disclosed.
- Figure 6 shows an example of twenty-one amino acid-wise distance channels, in accordance with one implementation of the technology disclosed.
- Figure 7 is a schematic diagram of a distance channel tensor, in accordance with one implementation of the technology disclosed.
- Figure 8 shows one-hot encodings of the reference amino acid and the alternative amino acid from Figure 2, in accordance with one implementation of the technology disclosed.
- Figure 9 is a schematic diagram of a voxelized one-hot encoded reference amino acid and a voxelized one-hot encoded variant/altemative amino acid, in accordance with one implementation of the technology disclosed.
- Figure 10 schematically illustrates a concatenation process that voxel-wise concatenates the distance channel tensor of Figure 7 and a reference allele tensor, in accordance with one implementation of the technology disclosed.
- Figure 11 schematically illustrates a concatenation process that voxel-wise concatenates the distance channel tensor of Figure 7, the reference allele tensor of Figure 10, and an alternative allele tensor, in accordance with one implementation of the technology disclosed.
- Figure 12 is a flow diagram that illustrates a process of a system for determining and assigning pan-amino acid conservation frequencies of nearest atoms to voxels (voxelizing), in accordance with one implementation of the technology disclosed.
- Figure 13 illustrates voxels-to-nearest amino acids, in accordance with one implementation of the technology disclosed.
- Figure 14 shows an example multi-sequence alignment of the reference amino acid sequence across a ninety-nine species, in accordance with one implementation of the technology disclosed.
- Figure 15 shows an example of determining a pan-amino acid conservation frequencies sequence for a particular voxel, in accordance with one implementation of the technology disclosed.
- Figure 16 shows respective pan-amino acid conservation frequencies determined for respective voxels using the position frequency logic described in Figure 15, in accordance with one implementation of the technology disclosed.
- Figure 17 illustrates voxelized per-voxel evolutionary profdes, in accordance with one implementation of the technology disclosed.
- Figure 18 depicts an example of an evolutionary profdes tensor, in accordance with one implementation of the technology disclosed.
- Figure 19 is a flow diagram that illustrates a process of a system for determining and assigning per-amino acid conservation frequencies of nearest atoms to voxels (voxelizing), in accordance with one implementation of the technology disclosed.
- Figure 20 shows various examples of voxelized annotation channels that are concatenated with the distance channel tensor, in accordance with one implementation of the technology disclosed.
- Figure 21 illustrates different combinations and permutations of input channels that can be provided as inputs to a pathogenicity classifier for pathogenicity determination of a target variant, in accordance with one implementation of the technology disclosed.
- Figure 22 shows different methods of calculating the disclosed distance channels, in accordance with various implementations of the technology disclosed.
- Figure 23 shows different examples of the evolutionary channels, in accordance with various implementations of the technology disclosed.
- Figure 24 shows different examples of the annotations channels, in accordance with various implementations of the technology disclosed.
- Figure 25 shows different examples of the structure confidence channels, in accordance with various implementations of the technology disclosed.
- Figure 26 shows an example processing architecture of the pathogenicity classifier, in accordance with one implementation of the technology disclosed.
- Figure 27 shows an example processing architecture of the pathogenicity classifier, in accordance with one implementation of the technology disclosed.
- Figures 28, 29, 30, 31 A and 3 IB use PrimateAI as a benchmark model to demonstrate the disclosed PrimateAI 3D’s classification superiority over PrimateAI.
- Figures 32A and 32B show the disclosed efficient voxelization process, in accordance with various implementations of the technology disclosed.
- Figure 33 depicts how atoms are associated with voxels that contain the atoms, in accordance with one implementation of the technology disclosed.
- Figure 34 shows generating voxel-to-atoms mapping from atom -to -voxels mapping to identify nearest atoms on a voxel-by-voxel basis, in accordance with one implementation of the technology disclosed.
- Figures 35A and 35B illustrate how the disclosed efficient voxelization has a runtime complexity of O(#atoms) versus the runtime complexity of O(#atoms * #voxels) without the use of disclosed efficient voxelization
- Figure 36 shows an example computer system that can be used to implement the technology disclosed.
- Figure 37 illustrates one implementation of determining variant pathogenicity for a target alternate amino acid based on processing a gapped protein spatial representation.
- Figure 38 shows an example of a spatial representation of a protein.
- Figure 39 shows an example of a gapped spatial representation of the protein illustrated in
- Figure 40 shows an example of an atomic spatial representation of the protein illustrated in Figure 38.
- Figure 41 shows an example of a gapped atomic spatial representation of the protein illustrated in Figure 38.
- Figure 42 illustrates one implementation of a pathogenicity classifier determining variant pathogenicity for a target alternate amino acid based on processing a gapped protein spatial representation and an alternate amino acid representation of the target alternate amino acid.
- Figure 43 depicts one implementation of training data used to train the pathogenicity classifier.
- Figure 44 illustrates one implementation of generating gapped spatial representations for reference proteins samples by using reference amino acids as gap amino acids.
- Figure 45 shows one implementation of training the pathogenicity classifier on benign protein samples.
- Figure 46 shows one implementation of training the pathogenicity classifier on pathogenic protein samples.
- Figure 47 shows how certain unreachable amino acid classes are masked during training.
- Figure 48 illustrates one implementation of determining a final pathogenicity score.
- Figure 49A shows that a variant pathogenicity determination is made for a target alternate amino acid filling a vacancy created by a reference gap amino acid at a given position in a protein.
- Figure 49B shows that respective variant pathogenicity determinations are made for amino acids of respective amino acid classes filing the vacancy created by the reference gap amino acid at the given position in the protein.
- Figure 50 illustrates one implementation of determining variant pathogenicity for multiple alternate amino acids based on processing a gapped protein spatial representation.
- Figure 51 illustrates one implementation of the pathogenicity classifier determining variant pathogenicity for multiple alternate amino acids based on processing a gapped protein spatial representation.
- Figure 52 illustrates one implementation of concurrently training the pathogenicity classifier on benign and pathogenic protein samples.
- Figure 53 illustrates one implementation of determining variant pathogenicity for multiple alternate amino acids based on processing a gapped protein spatial representation and, in response, generating evolutionary conservation scores for the multiple alternate amino acids.
- Figure 54 shows the evolutionary conservation determiner in operation, in accordance with one implementation.
- Figure 55 illustrates one implementation of determining pathogenicity based on predicted evolutionary scores.
- Figure 56 illustrates one implementation of training data used to train the evolutionary conservation determiner.
- Figure 57 illustrates one implementation of concurrently training the evolutionary conservation determiner on benign and pathogenic protein samples.
- Figure 58 depicts different implementations of ground truth label encodings used to train the evolutionary conservation determiner.
- FIG 59 illustrates an example position-specific frequency matrix (PSFM).
- PSFM position-specific frequency matrix
- Figure 60 depicts an example position-specific scoring matrix (PSSM).
- PSSM position-specific scoring matrix
- Figure 61 shows one implementation of generating the PSFM and the PSSM.
- Figure 62 illustrates an example PSFM encoding.
- Figure 63 depicts an example PSSM encoding.
- Figure 64 illustrates two datasets on which the models disclosed herein can be trained.
- Figures 65A-65B illustrate one implementation of combined learning of the models disclosed herein.
- Figures 66A-66B illustrate one implementation of using transfer learning to train the models disclosed herein using the two datasets shown in Figure 64.
- Figure 67 shows one implementation of generating training data and labels to tram the models disclosed herein.
- Figure 68 illustrates one implementation of a method of determining pathogenicity of nucleotide variants.
- Figure 69 illustrates one implementation of a system to predict structural tolerability of amino acid substitutes.
- Figures 70A, 70B and 70C depict performance results that demonstrate objective indicia of non-obviousness and inventiveness.
- modules can be implemented in hardware or software, and need not be divided up in precisely the same blocks as shown in the figures. Some of the modules can also be implemented on different processors, computers, or servers, or spread among a number of different processors, computers, or servers. In addition, it will be appreciated that some of the modules can be combined, operated in parallel or in a different sequence than that shown in the figures without affecting the functions achieved.
- the modules in the figures can also be thought of as flowchart steps in a method.
- a module also need not necessarily have all its code disposed contiguously in memory; some parts of the code can be separated from other parts of the code with code from other modules or other functions disposed in between.
- FIG. 1 is a flow diagram that illustrates a process 100 of a system for determining pathogenicity of variants.
- a sequence accessor 104 of the system accesses reference and alternative amino acid sequences.
- a 3D structure generator 114 of the system generates 3D protein structures for a reference amino acid sequence.
- the 3D protein structures are homology models of human proteins.
- a so-called SwissModel homology modelling pipeline provides a public repository of predicted human protein structures.
- a so-called HHpred homology modelling uses a tool called Modeller to predict the structure of a target protein from template structures.
- Proteins are represented by a collection of atoms and their coordinates in 3D space.
- An amino acid can have a variety of atoms, such as carbon atoms, oxygen (O) atoms, nitrogen (N) atoms, and hydrogen (H) atoms.
- the atoms can be further classified as side chain atoms and backbone atoms.
- the backbone carbon atoms can include alpha-carbon (C o ) atoms and beta-carbon (Cp) atoms.
- a coordinate classifier 124 of the system classifies 3D atomic coordinates of the 3D protein structures on an amino acid-basis.
- the amino acid-wise classification involves attributing the 3D atomic coordinates to the twenty-one amino acid categories (including stop or gap amino acid category).
- an amino acid-wise classification of alpha-carbon atoms can respectively list alpha-carbon atoms under each of the twenty-one amino acid categories.
- an amino acid-wise classification of beta-carbon atoms can respectively list beta-carbon atoms under each of the twenty-one amino acid categories.
- an amino acid-wise classification of oxygen atoms can respectively list oxygen atoms under each of the twenty-one amino acid categories.
- an amino acid-wise classification of nitrogen atoms can respectively list nitrogen atoms under each of the twenty- one amino acid categories.
- an amino acid-wise classification of hydrogen atoms can respectively list hydrogen atoms under each of the twenty-one amino acid categories.
- the amino acidwise classification can include a subset of the twenty -one amino acid categories and a subset of the different atomic elements.
- a voxel grid generator 134 of the system instantiates a voxel grid.
- the voxel grid can have any resolution, for example, 3x3x3, 5x5x5, 7x7x7, and so on.
- Voxels in the voxel grid can be of any size, for example, one angstrom (A) on each side, two A on each side, three A on each side, and so on.
- A angstrom
- these example dimensions refer to cubic dimensions because voxels are cubes.
- these example dimensions are nonlimiting, and the voxels can have any cubic dimensions.
- a voxel grid centerer 144 of the system centers the voxel grid at the reference amino acid experiencing a target variant at the amino acid level.
- the voxel grid is centered at an atomic coordinate of a particular atom of the reference amino acid experiencing the target variant, for example, the 3D atomic coordinate of the alpha-carbon atom of the reference amino acid experiencing the target variant.
- the voxels in the voxel grid can have a plurality of channels (or features).
- the voxels in the voxel grid have a plurality of distance channels (e.g., twenty-one distance channels for the twenty-one amino acid categories, respectively (including stop or gap amino acid category)).
- a distance channel generator 154 of the system generates amino acid- wise distance channels for the voxels in the voxel grid. The distance channels are independently generated for each of the twenty-one amino acid categories.
- an Alanine distance channel includes twenty-seven distance values for the twenty-seven voxels in the voxel grid, respectively.
- the twenty-seven distance values in the Alanine distance channel are measured from respective centers of the twenty-seven voxels in the voxel grid to respective nearest atoms in the Alanine amino acid category.
- the Alanine amino acid category includes only alpha-carbon atoms and therefore the nearest atoms are those Alanine alpha-carbon atoms that are most proximate to the twentyseven voxels in the voxel grid, respectively.
- the Alanine amino acid category includes only beta-carbon atoms and therefore the nearest atoms are those Alanine beta-carbon atoms that are most proximate to the twenty-seven voxels in the voxel grid, respectively.
- the Alanine amino acid category includes only oxygen atoms and therefore the nearest atoms are those Alanine oxygen atoms that are most proximate to the twenty-seven voxels in the voxel grid, respectively.
- the Alanine amino acid category includes only nitrogen atoms and therefore the nearest atoms are those Alanine nitrogen atoms that are most proximate to the twenty-seven voxels in the voxel grid, respectively.
- the Alanine amino acid category includes only hydrogen atoms and therefore the nearest atoms are those Alanine hydrogen atoms that are most proximate to the twenty-seven voxels in the voxel grid, respectively.
- the distance channel generator 154 Like the Alanine distance channel, the distance channel generator 154 generates a distance channel (z.e., a set of voxel -wise distance values) for each of the remaining amino acid categories. In other implementations, the distance channel generator 154 generates distance channels only for a subset of the twenty-one amino acid categories.
- the selection of the nearest atoms is not confined to a particular atom type. That is, within a subject amino acid category, the nearest atom to a particular voxel is selected, irrespective of the atomic element of the nearest atom, and the distance value for the particular voxel calculated for inclusion in the distance channel for the subject amino acid category.
- the distance channels are generated on an atomic element-basis. Instead of or in addition to having the distance channels for the amino acid categories, distance values can be generated for atom element categories, irrespective of the amino acids to which the atoms belong.
- the atoms of amino acids in the reference amino acid sequence span seven atomic elements: carbon, oxygen, nitrogen, hydrogen, calcium, iodine, and sulfur.
- the voxels in the voxel grid are configured to have seven distance channels, such that each of the seven distance channels have twenty-seven voxel wise distance values that specify distances to nearest atoms only within a corresponding atomic element category.
- distance channels for only a subset of the seven atomic elements can be generated.
- the atomic element categories and the distance channel generation can be further stratified into variations of a same atomic element, for example, alpha-carbon (C a ) atoms and beta-carbon (Cp) atoms.
- the distance channels can be generated on an atom type-basis, for example, distance channels only for side chain atoms and distance channels only for backbone atoms.
- the nearest atoms can be searched within a predefined maximum scan radius from the voxel centers (e.g., six angstrom (A)). Also, multiple atoms can be nearest to a same voxel in the voxel grid.
- the distances are calculated between 3D coordinates of the voxel centers and 3D atomic coordinates of the atoms. Also, the distance channels are generated with the voxel grid centered at a same location (e.g., centered at the 3D atomic coordinate of the alpha-carbon atom of the reference amino acid experiencing the target variant).
- the distances can be Euclidean distances. Also, the distances can be parameterized by atom size (or atom influence) (e.g. , by using Lennard-Jones potential and/or Van der Waals atom radius of the atom in question). Also, the distance values can be normalized by the maximum scan radius, or by a maximum observed distance value of the furthest nearest atom within a subject amino acid category or a subject atomic element category or a subject atom type category. In some implementations, the distances between the voxels and the atoms are calculated based on polar coordinates of the voxels and the atoms. The polar coordinates are parameterized by angles between the voxels and the atoms.
- this angel information is used to generate an angle channel for the voxels (z.e., independent of the distance channels).
- angles between a nearest atom and neighboring atoms e.g., backbone atoms
- the voxels in the voxel grid can also have reference allele and alternative allele channels.
- a one-hot encoder 164 of the system generates a reference one-hot encoding of a reference amino acid in the reference amino acid sequence and an alternative one-hot encoding of an alternative amino acid in an alternative amino acid sequence.
- the reference amino acid experiences the target variant.
- the alternative ammo acid is the target variant.
- the reference ammo acid and the alternative ammo acid are located at a same position respectively in the reference amino acid sequence and the alternative amino acid sequence.
- the reference ammo acid sequence and the alternative amino acid sequence have the same position-wise amino acid composition with one exception. The exception is the position that has the reference amino acid in the reference amino acid sequence and the alternative amino acid in the alternative amino acid sequence.
- a concatenator 174 of the system concatenates the amino acid-wise distance channels and the reference and alternative one-hot encodings.
- the concatenator 174 concatenates the atomic element-wise distance channels and the reference and alternative one-hot encodings.
- the concatenator 174 concatenates the atomic type-wise distance channels and the reference and alternative one-hot encodings.
- runtime logic 184 of the system processes the concatenated amino acid- wise/atomic element-wise/atomic type-wise distance channels and the reference and alternative one-hot encodings through a pathogenicity classifier (pathogenicity determination engine) to determine a pathogenicity of the target variant, which is in turn inferred as a pathogenicity determination of the underlying nucleotide variant that creates the target variant at the amino acid level
- pathogenicity classifier is trained using labelled datasets of benign and pathogenic variants, for example, using the backpropagation algorithm. Additional details about the labelled datasets of benign and pathogenic variants and example architectures and training of the pathogenicity classifier can be found in commonly owned US Patent Application Nos. 16/160,903; 16/160,986; 16/160,968; and 16/407,149.
- FIG. 2 schematically illustrates a reference amino acid sequence 202 of a protein 200 and an alternative amino acid sequence 212 of the protein 200.
- the protein 200 comprises N amino acids. Positions of the amino acids in the protein 200 are labelled 1, 2, 3 ...N.
- position 16 is the location that experiences an amino acid variant 214 (mutation) caused by an underlying nucleotide variant.
- position 1 has reference amino acid Phenylalanine (F)
- position 16 has reference amino acid Glycine (G) 204
- position N e.g., the last amino acid of the sequence 202
- L Leucine
- Figure 3 illustrates amino acid-wise classification of atoms of amino acids in the reference amino acid sequence 202, also referred to herein as “atom classification 300.” Specific types of amino acids, among the twenty natural amino acids listed in column 302, may repeat in a protein. That is, a particular type of amino acid may occur more than once in a protein. Proteins may also have some undetermined amino acids that are categorized by a twenty-first stop or gap amino acid category.
- the right column in Figure 3 contains counts of alpha-carbon (C a ) atoms from different amino acids.
- Figure 3 shows amino acid-wise classification of alpha-carbon (C a ) atoms of the amino acids in the reference amino acid sequence 202.
- Column 308 of Figure 3 lists the total number of alpha-carbon atoms observed for the reference amino acid sequence 202 in each of the twenty -one amino acid categories. For example, column 308 lists eleven alpha-carbon atoms observed for the Alanine (A) amino acid category. Since each amino acid has only one alpha-carbon atom, this means that Alanine occurs 11 times in the reference amino acid sequence 202. In another example, Arginine (R) occurs thirty- five times in the reference amino acid sequence 202. The total number of alpha-carbon atoms across the twenty-one amino acid categories is eight hundred and twenty-eight.
- Figure 4 illustrates amino acid-wise attribution of 3D atomic coordinates of the alpha-carbon atoms of the reference ammo acid sequence 202 based on the atom classification 300 in Figure 3. This is referred to herein as “atomic coordinates bucketing 400.”
- lists 404-440 tabulate the 3D atomic coordinates of the alpha-carbon atoms bucketed to each of the twenty-one amino acid categories.
- the bucketing 400 in Figure 4 follows the classification 300 of Figure 3. For example, in Figure 3, the Alanine amino acid category has eleven alpha-carbon atoms, and therefore, in Figure 4, the Alanine amino acid category has eleven 3D atomic coordinates of the corresponding eleven alpha-carbon atoms from Figure 3.
- This classification-to-bucketing logic flows from Figure 3 to Figure 4 for other amino acid categories too.
- this classification-to-bucketing logic is only for representational purposes, and, in other implementations, the technology disclosed need not perform the classification 300 and the bucketing 400 to locate the voxel-wise nearest atoms, and may perform fewer, additional, or different steps.
- the technology disclosed can locate the voxel-wise nearest atoms by using a sort and search algorithm that returns the voxel-wise nearest atoms from one or more databases in response to a search query configured to accept query parameters like sort criteria (e g., amino acid-wise, atomic element-wise, atom type-wise), the predefined maximum scan radius, and the type of distances (e g., Euclidean, Mahalanobis, normalized, unnormalized).
- sort criteria e g., amino acid-wise, atomic element-wise, atom type-wise
- the predefined maximum scan radius e.g., Euclidean, Mahalanobis, normalized, unnormalized.
- type of distances e g., Euclidean, Mahalanobis, normalized, unnormalized.
- a plurality of sort and search algorithms from the current or future technical field can be analogous used by a person skilled in the art to locate the voxel-wise nearest atoms.
- one or more databases may include information regarding the 3D atomic coordinates of the alpha-carbon atoms and other atoms of amino acids in proteins. Such databases may be searchable by specific proteins.
- the voxels and the voxel grid are 3D entities.
- the drawings depict, and the description discusses the voxels and the voxel grid in a two-dimensional (2D) format.
- a 3x3x3 voxel grid of twenty-seven voxels is depicted and described herein as a 3x3 2D pixel grid with nine 2D pixels.
- the 2D format is used only for representational purposes and is intended to cover the 3D counterparts (z.e., 2D pixels represent 3D voxels and 2D pixel grid represents 3D voxel grid).
- the drawings are also not scale. For example, voxels of size two angstrom (A) are depicted using a single pixel.
- FIG. 5 schematically illustrates a process of determining voxel-wise distance values, also referred to herein as “voxel-wise distance calculation 500.”
- the voxel-wise distance values are calculated only for the Alanine (A) distance channel.
- the same distance calculation logic is executed for each of the twenty-one amino acid categories to generate twenty-one amino acid-wise distance channels and can be further expanded to other atom types like beta-carbon atoms and other atomic elements like oxygen, nitrogen, and hydrogen, as discussed above with respect to Figure 1.
- the atoms are randomly rotated prior to the distance calculation to make the training of the pathogenicity classifier invariant to atom orientation.
- a voxel grid 522 has nine voxels 514 identified with indices (1, 1), (1, 2), (1, 3), (2, 1), (2, 2), (2, 3), (3, 1), (3, 2), and (3, 3).
- the voxel grid 522 is centered, for example, at the 3D atomic coordinate 532 of the alpha-carbon atom of the Glycine (G) amino acid at position 16 in the reference amino acid sequence 202 because, in the alternative amino acid sequence 212, the position 16 experiences the variant that mutates the Glycine (G) amino acid to the Alanine (A) amino acid, as discussed above with respect to Figure 2.
- the center of the voxel grid 522 coincides with the center of voxel (2, 2).
- the centered voxel grid 522 is used for the voxel-wise distance calculation for each of the twenty-one amino acid-wise distance channels. Starting, for example, with the Alanine (A) distance channel, distances between the 3D coordinates of respective centers of the nine voxels 14 and the 3D atomic coordinates 402 of the eleven Alanine alpha-carbon atoms are measured to locate a nearest Alanine alpha-carbon atom for each of the nine voxels 514.
- the resulting Alanine distance channel arranges the nine Alanine distance values in the same order as the nine voxels 514 in the voxel grid 522.
- the above process is executed for each of the twenty-one amino acid categories.
- the centered voxel grid 522 is similarly used to calculate the Arginine (R) distance channel, such that distances between the 3D coordinates of respective centers of the nine voxels 514 and the 3D atomic coordinates 404 of the thirty-five Arginine alpha-carbon atoms are measured to locate a nearest Arginine alpha-carbon atom for each of the nine voxels 514.
- nine distance values for nine distances between the nine voxels 514 and the respective nearest Arginine alpha-carbon atoms are used to construct the Arginine distance channel.
- the resulting Arginine distance channel arranges the nine Arginine distance values in the same order as the nine voxels 514 in the voxel grid 522.
- the twenty-one amino acid-wise distance channels are voxel-wise encoded to form a distance channel tensor.
- a distance 512 is between the center of voxel (1, 1) of voxel grid 522 and the nearest alpha-carbon (C a ) atom, which is the Ca A ' atom in list 402. Accordingly, the value assigned to voxel (1, 1) is the distance 512.
- the Ca A4 atom is the nearest C a atom to the center of voxel (1, 2). Accordingly, the value assigned to voxel (1, 2) is the distance between the center of voxel (1, 2) and the Ca A4 atom.
- the Ca A6 atom is the nearest C a atom to the center of voxel (2, 1).
- the value assigned to voxel (2, 1) is the distance between the center of voxel (2, 1) and the Ca A ⁇ 5 atom.
- the Ca A6 atom is also the nearest C a atom to the center of voxels (3, 2) and (3, 3).
- the value assigned to voxel (3, 2) is the distance between the center of voxel (3, 2) and the Ca A6 atom and the value assigned to voxel (3, 3) is the distance between the center of voxel (3, 3) and the Ca A6 atom.
- the distance values assigned to the voxels 514 may be normalized distances.
- the distance value assigned to voxel (1, 1) may be the distance 512 divided by a maximum distance 502 (predefined maximum scan radius).
- the nearest-atom distances may be Euclidean distances and the nearest- atom distances may be normalized by dividing the Euclidean distances with a maximum nearest-atom distance (e.g., such as the maximum distance 502).
- the distances may be nearest-alpha-carbon atom distances from corresponding voxel centers to nearest alpha-carbon atoms of the corresponding amino acids.
- the distances may be nearest-beta-carbon atom distances from corresponding voxel centers to nearest beta-carbon atoms of the corresponding amino acids.
- the distances may be nearest-backbone atom distances from corresponding voxel centers to nearest backbone atoms of the corresponding amino acids.
- the distances may be nearest-sidechain atom distances from corresponding voxel centers to nearest sidechain atoms of the corresponding amino acids.
- the distances additionally/altematively can include distances to second, third, fourth nearest atoms, and so on.
- Figure 6 shows an example of twenty-one amino acid-wise distance channels 600. Each column in Figure 6 corresponds to a respective one of the twenty-one amino acid-wise distance channels 602-642.
- Each amino acid-wise distance channel comprises a distance value for each of the voxels 514 of the voxel grid 522.
- the amino acid-wise distance channel 602 for Alanine (A) comprises distance values for respective ones of the voxels 514 of the voxel grid 522.
- the voxel grid 522 is 3D grid of volume 3x3x3 and comprises twenty-seven voxels.
- each amino acid-wise distance channel may comprise twenty-seven voxel-wise distance values for the 3x3x3 voxel grid.
- the technology disclosed uses a directionality parameter to specify the directionality of the reference amino acids in the reference amino acid sequence 202. In some implementations, the technology disclosed uses the directionality parameter to specify the directionality of the alternative amino acids in the alternative amino acid sequence 212. In some implementations, the technology disclosed uses the directionality parameter to specify the position in the protein 200 that experiences the target variant at the amino acid level.
- all the distance values in the twenty-one amino acid-wise distance channels 602-642 are measured from respective nearest atoms to the voxels 514 in the voxel grid 522.
- These nearest atoms originate from one of the reference amino acids in the reference amino acid sequence 202.
- These originating reference amino acids which contain the nearest atoms, can be classified into two categories: (1) those originating reference amino acids that precede the variant-experiencing reference amino acid 204 in the reference amino acid sequence 202 and (2) those originating reference amino acids that succeed the variant-experiencing reference amino acid 204 in the reference amino acid sequence 202.
- the originating reference amino acids in the first category can be called preceding reference amino acids.
- the originating reference amino acids in the second category can be called succeeding reference amino acids.
- the directionality parameter is applied to those distance values in the twenty-one amino acidwise distance channels 602-642 that are measured from those nearest atoms that originate from the preceding reference amino acids. In one implementation, the directionality parameter is multiplied with such distance values.
- the directionality parameter can be any number, such as -1.
- the twenty-one amino acid-wise distance channels 600 include some distance values that indicate to the pathogenicity classifier which end of the protein 200 is the start terminal and which end is the end terminal. This also allows the pathogenicity classifier to reconstruct a protein sequence from the 3D protein structure information supplied by the distance channels and the reference and allele channels.
- FIG. 7 is a schematic diagram of a distance channel tensor 700.
- Distance channel tensor 700 is a voxelized representation of the amino acid-wise distance channels 600 from Figure 6.
- the twenty-one amino acid-wise distance channels 602-642 are concatenated voxel-wise, like RGB channels of a color image.
- the voxelized dimensionality of the distance channel tensor 700 is 21x3x3x3 (where 21 denotes the twenty-one ammo acid categories and 3x3x3 denotes the 3D voxel grid with twenty-seven voxels); although Figure 7 is a 2D depiction of dimensionality 21x3x3.
- 21x3x3x3 where 21 denotes the twenty-one ammo acid categories and 3x3x3 denotes the 3D voxel grid with twenty-seven voxels
- Figure 8 shows one-hot encodings 800 of the reference amino acid 204 and the alternative amino acid 214.
- left column is a one-hot encoding 802 of the reference amino acid Glycine (G) 204, with one for the Glycine amino acid category and zeros for all other amino acid categories.
- right column is a one-hot encoding 804 of the variant/altemative amino acid Alanine (A) 214, with one for the Alanine amino acid category and zeros for all other amino acid categories.
- Figure 9 is a schematic diagram of a voxelized one-hot encoded reference amino acid 902 and a voxelized one-hot encoded variant/altemative amino acid 912.
- the voxelized one-hot encoded reference amino acid 902 is a voxelized representation of the one-hot encoding 802 of the reference amino acid Glycine (G) 204 from Figure 8.
- the voxelized one-hot encoded alternative amino acid 912 is a voxelized representation of the one-hot encoding 804 of the variant/altemative amino acid Alanine (A) 214 from Figure 8.
- the voxelized dimensionality of the voxelized one-hot encoded reference amino acid 902 is 21x1x1x1 (where 21 denotes the twenty-one amino acid categories); although Figure 9 is a 2D depiction of dimensionality 21x1x1.
- the voxelized dimensionality of the voxelized one-hot encoded alternative amino acid 912 is 21x1x1x1 (where 21 denotes the twenty-one amino acid categories); although Figure 9 is a 2D depiction of dimensionality 21x1x1.
- Figure 10 schematically illustrates a concatenation process 1000 that voxel-wise concatenates the distance channel tensor 700 of Figure 7 and a reference allele tensor 1004.
- the reference allele tensor 1004 is a voxel-wise aggregation (repetition/cloning/replication) of the voxelized one-hot encoded reference amino acid 902 from Figure 9.
- multiple copies of the voxelized one-hot encoded reference amino acid 902 are voxel-wise concatenated according with each other to the spatial arrangement of the voxels 514 in the voxel grid 522, such that the reference allele tensor 1004 has a corresponding copy of the voxelized one-hot encoded reference amino acid 910 for each of the voxels 514 in the voxel grid 522.
- the concatenation process 1000 produces a concatenated tensor 1010.
- the voxelized dimensionality of the reference allele tensor 1004 is 21x3x3x3 (where 21 denotes the twenty-one amino acid categories and 3x3x3 denotes the 3D voxel grid with twenty-seven voxels); although Figure 10 is a 2D depiction of the reference allele tensor 1004 having dimensionality 21x3x3.
- the voxelized dimensionality of the concatenated tensor 1010 is 42x3x3x3 ; although Figure 10 is a 2D depiction of the concatenated tensor 1010 having dimensionality 42x3x3.
- Figure 11 schematically illustrates a concatenation process 1100 that voxel-wise concatenates the distance channel tensor 700 of Figure 7, the reference allele tensor 1004 of Figure 10, and an alternative allele tensor 1104.
- the alternative allele tensor 1104 is a voxel-wise aggregation (repetition/cloning/replication) of the voxelized one-hot encoded alternative amino acid 912 from Figure 9.
- multiple copies of the voxelized one-hot encoded alternative amino acid 12 are voxel-wise concatenated with each other according to the spatial arrangement of the voxels 514 in the voxel grid 522, such that the alternative allele tensor 1104 has a corresponding copy of the voxelized one-hot encoded alternative amino acid 910 for each of the voxels 514 in the voxel grid 522.
- the concatenation process 1100 produces a concatenated tensor 1110.
- the voxelized dimensionality of the alternative allele tensor 1104 is 21x3x3x3 (where 21 denotes the twenty-one amino acid categories and 3x3x3 denotes the 3D voxel grid with twenty-seven voxels); although Figure 11 is a 2D depiction of the alternative allele tensor 1104 having dimensionality 21x3x3.
- the voxelized dimensionality of the concatenated tensor 1110 is 63x3x3x3; although Figure 11 is a 2D depiction of the concatenated tensor 1110 having dimensionality 63x3x3.
- the runtime logic 184 processes the concatenated tensor 1110 through the pathogenicity classifier to determine a pathogenicity of the variant/altemative amino acid Alanine (A) 214, which is in turn inferred as a pathogenicity determination of the underlying nucleotide variant that creates the variant/altemative amino acid Alanine (A) 214.
- the technology disclosed concatenates the distance channel tensor 700, the reference allele tensor 1004, and the alternative allele tensor 1004 with evolutionary channels.
- One example of the evolutionary channels is pan-amino acid conservation frequencies.
- Another example of the evolutionary channels is per-amino acid conservation frequencies.
- the evolutionary channels are constructed using position weight matrices (PWMs). In other implementations, the evolutionary channels are constructed using position specific frequency matrices (PSFMs). In yet other implementations, the evolutionary channels are constructed using computational tools like SIFT, PolyPhen, and PANTHER-PSEC. In yet other implementations, the evolutionary channels are preservation channels based on evolutionary preservation. Preservation is related to conservation, as it also reflects the effect of negative selection that has acted to prevent evolutionary change at a given site in a protein.
- Figure 12 is a flow diagram that illustrates a process 1200 of a system for determining and assigning pan-amino acid conservation frequencies of nearest atoms to voxels (voxelizing), in accordance with one implementation of the technology disclosed.
- Figures 12, 13, 14, 15, 16, 17, and 18 are discussed in tandem.
- a similar sequence finder 1204 of the system retrieves amino acid sequences that are similar (homologous) to the reference amino acid sequence 202.
- the similar amino acid sequences can be selected from multiple species like primates, mammals, and vertebrates.
- an aligner 1214 ofthe system position-wise aligns the reference amino acid sequence 202 with the similar amino acid sequences, i.e., the aligner 1214 performs a multi-sequence alignment.
- Figure 14 shows an example multi-sequence alignment 1400 ofthe reference amino acid sequence 202 across a ninety-nine species.
- the multi-sequence alignment 1400 can be partitioned, for example, to generate a first position frequency matrix 1402 for primates, a second position frequency matrix 1412 for mammals, and a third position frequency matrix 1422 for primates.
- a single position frequency matrix is generated across the ninety-nine species.
- a pan-amino acid conservation frequency calculator 1224 of the system uses the multi-sequence alignment to determine pan-amino acid conservation frequencies ofthe reference amino acids in the reference amino acid sequence 202.
- a nearest atom finder 1234 of the system finds nearest atoms to the voxels 514 in the voxel grid 522.
- the search for the voxel-wise nearest atoms may not be confined to any particular amino acid category or atom type. That is, the voxel-wise nearest atoms can be selected across the amino acid categories and the amino acid types, as long as they are the most proximate atoms to the respective voxel centers.
- the search for the voxel-wise nearest atoms may be confined to only a particular atom category, such as only to a particular atomic element like oxygen, nitrogen, and hydrogen, or only to alpha-carbon atoms, or only to beta-carbon atoms, or only to sidechain atoms, or only to backbone atoms.
- an amino acid selector 1244 of the system selects those reference amino acids in the reference amino acid sequence 202 that contain the nearest atoms identified at the step 1232.
- Such reference amino acids can be called nearest reference amino acids.
- Figure 13 shows an example of locating nearest atoms 1302 to the voxels 514 in the voxel grid 522 and respectively mapping nearest reference amino acids 1312 that contain the nearest atoms 1302 to the voxels 514 in the voxel grid 522. This is identified in Figure 13 as "voxel s-to-ncarest amino acids mapping 1300.”
- a voxelizer 1254 of the system voxelizes pan-amino acid conservation frequencies ofthe nearest reference amino acids.
- Figure 15 shows an example of determining a pan- amino acid conservation frequencies sequence for the first voxel (1, 1) in the voxel grid 522, also referred to herein as “per-voxel evolutionary profile determination 1500.”
- the nearest reference amino acid that was mapped to the first voxel (1, 1) is Aspartic acid (D) amino acid at position 15 in the reference amino acid sequence 202.
- D Aspartic acid
- the multi-sequence alignment of the reference amino acid sequence 202 with, for example, ninety-nine homologous amino acid sequences of the ninety-nine species is analyzed at position 15.
- Such a positionspecific and cross-species analysis reveals how many instances of amino acids from each of the twenty- one amino acid categories are found at position 15 across the hundred aligned amino acid sequences (i.e., the reference amino acid sequence 202 plus the ninety-nine homologous amino acid sequences).
- the Aspartic acid (D) amino acid is found at position 15 in ninety-six out of the hundred aligned amino acid sequences. So, the Aspartic acid amino acid category 1504 is assigned a pan-amino acid conservation frequency of 0.96.
- the Valine (V) acid amino acid is found at position 15 in four out of the hundred aligned amino acid sequences. So, the Valine acid amino acid category 1514 is assigned a pan-amino acid conservation frequency of 0.04. Since no instances of amino acids from other amino acid categories are detected at position 15, the remaining amino acid categories are assigned a pan-amino acid conservation frequency of zero. This way, each of the twenty-one amino acid categories is assigned a respective pan-amino acid conservation frequency, which can be encoded in the pan-amino acid conservation frequencies sequence 1502 for the first voxel (1, 1).
- Figure 16 shows respective pan-amino acid conservation frequencies 1612-1692 determined for respective ones of the voxels 514 in the voxel grid 522 using the position frequency logic described in Figure 15, also referred to herein as “voxels-to- evolutionary profiles mapping 1600.”
- Per-voxel evolutionary profiles 1602 are then used by the voxelizer 1254 to generate voxelized per-voxel evolutionary profiles 1700, illustrated in Figure 17.
- each of the voxels 514 in the voxel grid 522 has a different pan-amino acid conservation frequencies sequence and therefore a different voxelized per-voxel evolutionary profile because the voxels are regularly mapped to different nearest atoms and therefore to different nearest reference amino acids.
- Figure 18 depicts an example of an evolutionary profiles tensor 1800 in which the voxelized per-voxel evolutionary profiles 1700 are voxel-wise concatenated with each other according to the spatial arrangement of the voxels 514 in the voxel grid 522.
- the voxelized dimensionality of the evolutionary profiles tensor 1800 is 21x3x3x3 (where 21 denotes the twenty-one amino acid categories and 3x3x3 denotes the 3D voxel grid with twenty-seven voxels); although Figure 18 is a 2D depiction of the evolutionary profiles tensor 1800 having dimensionality 21x3x3.
- the concatenator 174 voxel-wise concatenates the evolutionary profiles tensor 1800 with the distance channel tensor 700
- the evolutionary profiles tensor 1800 is voxel-wise concatenated with the concatenator tensor 1110 to generate a further concatenated tensor of dimensionality 84x3x3x3 (not shown).
- the runtime logic 184 processes the further concatenated tensor of dimensionality 84x3x3x3 through the pathogenicity classifier to determine the pathogenicity of the target variant, which is in turn inferred as a pathogenicity determination of the underlying nucleotide variant that creates the target variant at the amino acid level.
- Figure 19 is a flow diagram that illustrates a process 1900 of a system for determining and assigning per-amino acid conservation frequencies of nearest atoms to voxels (voxelizing).
- the steps 1202 and 1212 are the same as Figure 12.
- a per-amino acid conservation frequency calculator 1924 of the system uses the multi-sequence alignment to determine per-amino acid conservation frequencies of the reference amino acids in the reference amino acid sequence 202.
- a nearest atom finder 1934 of the system finds, for each of the voxels 514 in the voxel grid 522, twenty-one nearest atoms across each of the twenty-one amino acid categories. Each of the twenty-one nearest atoms is different from each other because they are selected from different amino acid categories. This leads to the selection of twenty-one unique nearest reference amino acids for a particular voxel, which in turn leads to generation of twenty-one unique position frequency matrices for the particular voxel, and which in turn leads to determination of twenty-one unique per-amino acid conservation frequencies for the particular voxel.
- an amino acid selector 1944 of the system selects, for each of the voxels 514 in the voxel grid 522, twenty-one reference amino acids in the reference amino acid sequence 202 that contain the twenty-one nearest atoms identified at the step 1932.
- Such reference amino acids can be called nearest reference amino acids.
- a voxelizer 1954 of the system voxelizes pen-amino acid conservation frequencies of the twenty-one nearest reference amino acids identified for the particular voxel at the step 1942.
- the twenty-one nearest reference amino acids are necessarily located at twenty-one different positions in the reference amino acid sequence 202 because they correspond to different underlying nearest atoms. Accordingly, for the particular voxel, twenty-one position frequency matrices can be generated for the twenty-one nearest reference amino acids.
- the twenty-one position frequency matrices can be generated across multiple species whose homologous amino acid sequences are position-wise aligned with the reference amino acid sequence 202, as discussed above with respect to Figures 12 to 15.
- Figure 20 shows various examples of voxelized annotation channels 2000 that are concatenated with the distance channel tensor 700.
- the voxelized annotation channels are one-hot indicators for different protein annotations, for example whether an amino acid (residue) is part of a transmembrane region, a signal peptide, an active site, or any other binding site, or whether the residue is subject to posttranslational modifications, PathRatio (See Pei P, Zhang A: A Topological Measurement for Weighted Protein Interaction Network. CSB 2005, 268-278.), etc. Additional examples of the annotation channels can be found below in the Particular Implementations section and in the Claims.
- the voxelized annotation channels are arranged voxel-wise such that the voxels can have a same annotation sequence like the voxelized reference allele and alternative allele sequences (e.g., annotation channels 2002, 2004, 2006), or the voxels can have respective annotation sequences like the voxelized per-voxel evolutionary profiles 1700 (e.g., annotation channels 2012, 2014, 2016 (as indicated by different colors)).
- annotation channels are voxelized, tensorized, concatenated, and processed for pathogenicity determination similar to the pan-amino acid conservation frequencies discussed with respect to Figures 12 to 18.
- the technology disclosed can also concatenate various voxelized structural confidence channels with the distance channel tensor 700.
- the TM-scores provide
- the voxelized structural confidence channels are arranged voxel -wise such that the voxels can have a same structural confidence sequence like the voxelized reference allele and alternative allele sequences, or the voxels can have respective structural confidence sequences like the voxelized per-voxel evolutionary profiles 1700.
- the structural confidence channels are voxelized, tensorized, concatenated, and processed for pathogenicity determination similar to the pan-amino acid conservation frequencies discussed with respect to Figures 12 to 18.
- Figure 21 illustrates different combinations and permutations of input channels that can be provided as inputs 2102 to a pathogenicity classifier 2108 for a pathogenicity determination 2106 of a target variant.
- One of the inputs 2102 can be distance channels 2104 generated by a distance channels generator 2272.
- Figure 22 shows different methods of calculating the distance channels 2104.
- the distance channels 2104 are generated based on distances 2202 between voxel centers and atoms across a plurality of atomic elements irrespective of amino acids.
- the distances 2202 are normalized by a maximum scan radius to generate normalized distances 2202a.
- the distance channels 2104 are generated based on distances 2212 between voxel centers and alpha-carbon atoms on an amino acid-basis. In some implementations, the distances 2212 are normalized by the maximinn scan radius to generate normalized distances 2212a. In yet another implementation, the distance channels 2104 are generated based on distances 2222 between voxel centers and beta-carbon atoms on an amino acid-basis. In some implementations, the distances 2222 are normalized by the maximum scan radius to generate normalized distances 2222a. In yet another implementation, the distance channels 2104 are generated based on distances 2232 between voxel centers and side chain atoms on an amino acid-basis.
- the distances 2232 are normalized by the maximum scan radius to generate normalized distances 2232a.
- the distance channels 2104 are generated based on distances 2242 between voxel centers and backbone atoms on an amino acid-basis.
- the distances 2242 are normalized by the maximum scan radius to generate normalized distances 2242a.
- the distance channels 2104 are generated based on distances 2252 (one feature) between voxel centers and the respective nearest atoms irrespective of atom type and amino acid type.
- the distance channels 2104 are generated based on distances 2262 (one feature) between voxel centers and atoms from non-standard amino acids.
- the distances between the voxels and the atoms are calculated based on polar coordinates of the voxels and the atoms.
- the polar coordinates are parameterized by angles between the voxels and the atoms.
- this angel information is used to generate an angle channel for the voxels (i.e. , independent of the distance channels).
- angles between a nearest atom and neighboring atoms e.g., backbone atoms
- Another one of the inputs 2102 can be a feature 2114 indicating missing atoms within a specified radius.
- Another one of the inputs 2102 can be one-hot encoding 2124 of the reference amino acid. Another one of the inputs 2102 can be one-hot encoding 2134 of the variant/altemative amino acid.
- Another one of the inputs 2102 can be evolutionary channels 2144 generated by an evolutionary profiles generator 2372, shown in Figure 23 In one implementation, the evolutionary channels 2144 can be generated based on pan-amino acid conservation frequencies 2302. In another implementation, the evolutionary channels 2144 can be generated based on pan-amino acid conservation frequencies 2312.
- Another one of the inputs 2102 can be a feature 2154 indicating missing residue or missing evolutionary profile.
- annotations channels 2164 generated by an annotations generator 2472, shown in Figure 24.
- the annotations channels 2154 can be generated based on molecular processing annotations 2402.
- the annotations channels 2154 can be generated based on regions annotations 2412.
- the annotations channels 2154 can be generated based on sites annotations 2422.
- the annotations channels 2154 can be generated based on Amino acid modifications annotations 2432.
- the annotations channels 2154 can be generated based on secondary structure annotations 2442.
- the annotations channels 2154 can be generated based on experimental information annotations 2452.
- Another one of the inputs 2102 can be structure confidence channels 2174 generated by a structure confidence generator 2572, shown in Figure 25.
- the structure confidence 2174 can be generated based on global model quality estimations (GMQEs) 2502.
- the structure confidence 2174 can be generated based on qualitative model energy analysis (QMEAN) scores 2512.
- the structure confidence 2174 can be generated based on temperature factors 2522.
- the structure confidence 2174 can be generated based on template modeling scores 2542. Examples of the template modeling scores 2542 include minimum template modeling scores 2542a, mean template modeling scores 2542b, and maximum template modeling scores 2542c.
- any permutation and combination of the input channels can be concatenated into an input for processing through the pathogenicity classifier 2108 for the pathogenicity determination 2106 of the target variant.
- only a subset of the input channels may be concatenated.
- the input channels can be concatenated in any order.
- the input channels can be concatenated into a single tensor by a tensor generator (input encoder) 2110. This single tensor can then be provided as input to the pathogenicity classifier 2108 forthe pathogenicity determination 2106 of the target variant.
- the pathogenicity classifier 2108 uses convolutional neural networks (CNNs) with a plurality of convolution layers.
- CNNs convolutional neural networks
- the pathogenicity classifier 2108 uses recurrent neural networks (RNNs) such as a long short-term memory networks (LSTMs), bidirectional LSTMs (Bi-LSTMs), and gated recurrent units (GRU)s.
- RNNs recurrent neural networks
- LSTMs long short-term memory networks
- Bi-LSTMs bidirectional LSTMs
- GRU gated recurrent units
- the pathogenicity classifier 2108 uses both the CNNs and the RNNs.
- the pathogenicity classifier 2108 uses graph-convolutional neural networks that model dependencies in graph-structured data.
- VAEs variational autoencoders
- the pathogenicity classifier 2108 uses generative adversarial networks (GANs).
- GANs generative adversarial networks
- the pathogenicity classifier 2108 can also be a language model based, for example, on self-attention such as the one implemented by Transformers and BERTs.
- the pathogenicity classifier 2108 can use ID convolutions, 2D convolutions, 3D convolutions, 4D convolutions, 5D convolutions, dilated or atrous convolutions, transpose convolutions, depthwise separable convolutions, pointwise convolutions, 1 x 1 convolutions, group convolutions, flattened convolutions, spatial and cross-channel convolutions, shuffled grouped convolutions, spatial separable convolutions, and deconvolutions.
- It can use one or more loss functions such as logistic regression/log loss, multi -class cross-entropy/softmax loss, binary cross-entropy loss, mean-squared error loss, LI loss, L2 loss, smooth LI loss, and Huber loss. It can use any parallelism, efficiency, and compression schemes such TFRecords, compressed encoding (e.g., PNG), sharding, parallel calls for map transformation, batching, prefetching, model parallelism, data parallelism, and synchronous/asynchronous stochastic gradient descent (SGD).
- loss functions such as logistic regression/log loss, multi -class cross-entropy/softmax loss, binary cross-entropy loss, mean-squared error loss, LI loss, L2 loss, smooth LI loss, and Huber loss. It can use any parallelism, efficiency, and compression schemes such TFRecords, compressed encoding (e.g., PNG), sharding, parallel calls for map transformation, batching, prefetching, model parallelism, data parallel
- It can include upsampling layers, downsampling layers, recurrent connections, gates and gated memory units (like an LSTM or GRU), residual blocks, residual connections, highway connections, skip connections, peephole connections, activation functions (e.g., non-lmear transformation functions like rectifying linear unit (ReLU), leaky ReLU, exponential liner unit (ELU), sigmoid and hyperbolic tangent (tanh)), batch normalization layers, regularization layers, dropout, pooling layers (e.g., max or average pooling), global average pooling layers, attention mechanisms, and gaussian error linear unit.
- ReLU rectifying linear unit
- ELU exponential liner unit
- the pathogenicity classifier 2108 is trained using backpropagation-based gradient update techniques.
- Example gradient descent techniques that can be used fortraining the pathogenicity classifier 2108 include stochastic gradient descent, batch gradient descent, and mini-batch gradient descent.
- Some examples of gradient descent optimization algorithms that can be used to train the pathogenicity classifier 2108 are Momentum, Nesterov accelerated gradient, Adagrad, Adadelta, RMSprop, Adam, AdaMax, Nadam, and AMSGrad.
- the pathogenicity classifier 2108 can be trained by unsupervised learning, semi-supervised learning, self-learning, reinforcement learning, multitask learning, multimodal learning, transfer learning, knowledge distillation, and so on.
- FIG. 26 shows an example processing architecture 2600 of the pathogenicity classifier 2108, in accordance with one implementation of the technology disclosed.
- the processing architecture 2600 includes a cascade of processing modules 2606, 2610, 2614, 2618, 2622, 2626, 2630, 2634, 2638, and 2642 each of which can include ID convolutions (Ixlxl CONV), 3D convolutions (3x3x3 CONV), ReLU non-linearity, and batch normalization (BN).
- ID convolutions Ixlxl CONV
- 3D convolutions 3x3x3 CONV
- ReLU non-linearity ReLU non-linearity
- BN batch normalization
- Other examples of the processing modules include fully-connected (FC) layers, a dropout layer, a flattening layer, and a final softmax layer that produces exponentially normalized scores for the target variant belonging to a benign class and a pathogenic class.
- FC fully-connected
- Figure 26 “64” denotes a number of convolution filters applied by a particular processing module.
- the size of an input voxel 2602 is 15x15x15x8.
- Figure 26 also shows respective volumetric dimensionalities of the intermediate inputs 2604, 2608, 2612, 2616, 2620, 2624, 2628, 2632, 2636, and 2640 generated by the processing architecture 2600.
- FIG. 27 shows an example processing architecture 2700 of the pathogenicity classifier 2108, in accordance with one implementation of the technology disclosed.
- the processing architecture 2700 includes a cascade of processing modules 2708, 2714, 2720, 2726, 2732, 2738, 2744, 2750, 2756, 2762, 2768, 2774, and 2780 such as ID convolutions (CONV ID), 3D convolutions (CONV 3D), ReLU non-linearity, and batch normalization (BN).
- Other examples of the processing modules include fully- connected (dense) layers, a dropout layer, a flattening layer, and a final softmax layer that produces exponentially normalized scores for the target variant belonging to a benign class and a pathogenic class.
- Figure 27 denotes a number of convolution filters applied by a particular processing module.
- the size of an input voxel 2704 supplied by an input layer 2702 is 7x7x7x108.
- Figure 27 also shows respective volumetric dimensionalities of the intermediate inputs 2710, 2716, 2722, 2728, 2734, 2740, 2746, 2752, 2758, 2764, 2770, 2776, and 2782 and the resulting intermediate outputs 2706, 2712, 2718, 2724, 2730, 2736, 2742, 2748, 2754, 2760, 2766, 2772, 2778, and 2784 generated by the processing architecture 2700.
- the variant pathogenicity classifier disclosed herein makes pathogenicity predictions based on 3D protein structures and is referred to as “PrimateAI 3D.”
- “Primate Al” is a commonly owned and previously disclosed variant pathogenicity classifier that makes pathogenicity predictions based protein sequences. Additional details about PrimateAI can be found in commonly owned US Patent Application Nos. 16/160,903; 16/160,986; 16/160,968; and 16/407,149 and in Sundaram, L. et al. Predicting the clinical impact of human mutation with deep neural networks. Nat. Genet. 50, 1161-1170 (2016).
- Figures 28, 29, 30, 31A use PrimateAI as a benchmark model to demonstrate PrimateAI 3D’s classification superiority over PrimateAI.
- the performance results in Figures 28, 29, 30, 31A and 3 IB are generated on the classification task of accurately distinguishing benign variants from pathogenic variants across a plurality of validation sets.
- PrimateAI 3D is trained on training sets that are different from the plurality of validation sets.
- PrimateAI 3D is trained on common human variants and variants from primates used as benign dataset while simulated variants based on trinucleotide context used as unlabeled or pseudo-pathogenic dataset.
- New developmental delay disorder (new DDD) is one example of a validation set used to compare the classification accuracy of Primate Al 3D against Primate Al.
- the new DDD validation set labels variants from individuals with DDD as pathogenic and labels the same variants from healthy relatives of the individuals with the DDD as benign.
- a similar labelling scheme is used with an autism spectrum disorder (ASD) validation set shown in Figures 31A and 3 IB.
- BRCA1 is another example of a validation set used to compare the classification accuracy of Primate Al 3D against Primate Al.
- the BRCA1 validation set labels synthetically generated reference amino acid sequences simulating proteins of the BRCA1 gene as benign variants and labels synthetically altered allele amino acid sequences simulating proteins of the BRCA1 gene as pathogenic variants.
- a similar labelling scheme is used with different validation sets of the TP53 gene, TP53S3 gene and its variants, and other genes and their variants shown in Figures 31A and 3 IB.
- Figure 28 identifies performance of the benchmark PrimateAI model with blue horizontal bars and performance of the disclosed PrimateAI 3D model with orange horizontal bars. Green horizontal bars depict pathogenicity predictions derived by combining respective pathogenicity predictions of the disclosed PrimateAI 3D model and the benchmark PrimateAI model.
- ens 10 denotes an ensemble of ten PrimateAI 3D models, each trained with a different seed training dataset and randomly initialized with different weights and biases.
- 7x7x7x2 depicts the size of the voxel grid used to encode the input channels during the training of the ensemble of ten PrimateAI 3D models.
- the ensemble of ten PrimateAI 3D models respectively generates ten pathogenicity predictions, which are subsequently combined (e.g. , by averaging) to generate a final pathogenicity prediction for the given variant.
- This logic analogous applies to ensembles of different group sizes.
- the y-axis has the different validation sets and the x-axis has p-values. Greater p-values, i.e., longer horizontal bars denote greater accuracy in differentiating benign vanants from pathogenic variants.
- PrimateAI 3D outperforms PrimateAI across most of the validation sets (only exception being the tp53s3_A549 validation set). That is, the orange horizontal bars for PrimateAI 3D are consistently longer than the blue horizontal bars for PrimateAI.
- a “mean” category along the y-axis calculates the mean of the p-values determined for each of the validation sets.
- PrimateAI 3D outperforms PrimateAI.
- PrimateAI is represented by blue horizontal bars
- an ensemble of twenty PrimateAI 3D models trained with a voxel grid of size 3x3x3 is represented by red horizontal bars
- an ensemble of ten PrimateAI 3D models trained with a voxel grid of size 7x7x7x2 is represented by purple horizontal bars
- an ensemble of twenty PrimateAI 3D models trained with a voxel grid of size 7x7x7x2 is represented by brown horizontal bars
- an ensemble of twenty PrimateAI 3D models trained with a voxel grid of size 17x17x17x2 is represented by purple horizontal bars.
- the y-axis has the different validation sets and the x-axis has p-values.
- greater p-values i.e. , longer horizontal bars denote greater accuracy in differentiating benign variants from pathogenic variants.
- a “mean” category along the y-axis calculates the mean of the p-values determined for each of the validation sets. In the mean category as well, the different configurations of PrimateAI 3D outperform PrimateAI.
- the red vertical bars represent PrimateAI
- the cyan vertical bars represent PrimateAI 3D.
- the y-axis has p-values
- the x-axis has the different validation sets.
- PrimateAI 3D consistently outperforms PrimateAI across all of the validation sets. That is, the cyan vertical bars for PrimateAI 3D are always longer than the red vertical bars for PrimateAI.
- Figures 31A and 3 IB identify performance of the benchmark PrimateAI model with blue vertical bars and performance of the disclosed PrimateAI 3D model with orange vertical bars. Green vertical bars depict pathogenicity predictions derived by combining respective pathogenicity predictions of the disclosed PrimateAI 3D model and the benchmark PrimateAI model.
- the y-axis has p-values
- the x-axis has the different validation sets.
- PrimateAI 3D outperforms PrimateAI across most of the validation sets (only exception being the tp53s3_A549_p53NULL_Nutlin-3 validation set). That is, the orange vertical bars for PrimateAI 3D are consistently longer than the blue vertical bars for PrimateAI.
- the mean statistics may be biased by outliers.
- a separate “method ranks” chart is also depicted in Figures 31A and 3 IB. Higher rank denotes poorer classification accuracy.
- PrimateAI 3D outperforms PrimateAI by having more counts of lower ranks 1 and 2 versus Primate Al having all 3s.
- Figure 32 is a flowchart illustrating an efficient voxelization process 3200 that efficiently identifies nearest atoms on a voxel-by-voxel basis.
- the reference amino acid sequence 202 can contain different types of atoms, such as alpha-carbon atoms, beta-carbon atoms, oxygen atoms, nitrogen atoms, hydrogen atoms, and so on. Accordingly, as discussed above, the distance channels can be arranged by nearest alpha-carbon atoms, nearest beta-carbon atoms, nearest oxygen atoms, nearest nitrogen atoms, nearest hydrogen atoms, and so on. For example, in Figure 6, each of the nine voxels 514 has twenty-one amino acid-wise distance channels for nearest alpha-carbon atoms.
- Figure 6 can be further expanded for each of the nine voxels 514 to also have twenty-one amino acid-wise distance channels for nearest beta-carbon atoms, and for each of the nine voxels 514 to also have a nearest generic atom distance channel for a nearest atom irrespective of the type of the atom and the type of the amino acid. This way, each of the nine voxels 514 can have forty-three distance channels.
- the size of the data for 32 million voxelizations is too big to fit in main memory (e g., >20TB for a 15x15x15 voxel grid).
- main memory e g., >20TB for a 15x15x15 voxel grid.
- the memory footprint of the voxelization process gets too big to be stored on disk, making the voxelization process a part of the model training and not a precomputation step.
- the technology disclosed provides an efficient voxelization process that achieves up to ⁇ 100x speedup over the runtime complexity of O(#atoms * #voxels).
- the disclosed efficient voxelization process reduces the runtime complexity for a single protein voxelization to O(#atoms).
- the disclosed efficient voxelization process reduces the runtime complexity for a single protein voxelization to O(#atoms * #attributes).
- the voxelization process becomes as fast as model training, shifting the computational bottleneck from voxelization back to computing neural network weights on processors such as GPUs, ASICs, TPUs, FPGAs, CGRAs, etc.
- processors such as GPUs, ASICs, TPUs, FPGAs, CGRAs, etc.
- the runtime complexity for a single protein voxelization is O(#atoms + voxels) and O(#atoms * #attributes + voxels) for the case of different features or channels per voxel.
- the “+ voxels” complexity is observed when the number of atoms is minuscule compared to the number of voxels, for example, when there is one atom in a 100x100x100 voxel grid (i.e., one million voxels per atom).
- the runtime is dominated by the overhead of the huge number of voxels, for example, for allocating the memory for one million voxels, initialization one million voxels to zero, etc.
- each atom e.g., each of the 828 alpha-carbon atoms and each of the 828 beta-carbon atoms
- a voxel that contains the atom e.g., one of the nine voxels 514.
- the term “contains” refers to the 3D atomic coordinates of the atom being located in the voxel.
- the voxel that contains the atom is also referred to herein as “the atom-containing voxel.”
- Figures 32B and 33 describe how a voxel that contains a particular atom is selected.
- Figure 33 uses 2D atomic coordinates as representative of 3D atomic coordinates. Note that the voxel grid 522 is regularly spaced with each of the voxels 514 having a same step size (e.g. , 1 angstrom (A) or 2 A).
- the voxel grid 522 has magenta indices [0, 1, 2] along a first dimension (e.g., x-axis) and cyan indices [0, 1, 2] along a second dimension (e.g., y-axis).
- the respective voxels 514 in the voxel 512 are identified by green voxel indices [Voxel 0, Voxel 1, ..., Voxel 8] and by black voxel center indices [(1, 1), (1, 2), ..., (3, 3)].
- center coordinates of the voxel centers along the first dimension i.e., first dimension voxel coordinates
- center coordinates of the voxel centers along the second dimension i.e., second dimension voxel coordinates
- step 3202a (Step 1 in Figure 33), 3D atomic coordinates (1.7456, 2.14323) of the particular atom are quantized to generated quantized 3D atomic coordinates (1.7, 2.1).
- the quantization can be achieved by rounding or truncation of bits.
- voxel coordinates (or voxel centers or voxel center coordinates) of the voxels 514 are assigned to the quantized 3D atomic coordinates on a dimension-basis.
- the quantized atomic coordinate 1.7 is assigned to Voxel 1 because it covers first dimension voxel coordinates ranging from 1 to 2 and is centered at 1.5 in the first dimension.
- Voxel 1 has index 1 along the first dimension, in contrast to having index 0 along the second dimension.
- the voxel grid 522 is traversed along the second dimension.
- step 3202c (Step 3 in Figure 33)
- dimension indices corresponding to the assigned voxel coordinates are selected. That is, for Voxel 1, index 1 is selected along the first dimension, and, for Voxel 7, index 2 is selected along the second dimension.
- an accumulated sum is generated based on position-wise weighting the selected dimension indices by powers of a radix.
- positional numbering systems The general idea behind positional numbering systems is that a numeric value is represented through increasing powers of the radix (or base), for example, binary is base two, ternary is base three, octal is base eight, and hexadecimal is base sixteen. This is often referred to as a weighted numbering system because each position is weighted by a power of the radix.
- the set of valid numericals for a positional numbering system is equal in size to the radix of that system.
- decimal system there are ten digits in the decimal system, zero through nine, and three digits in the ternary system, zero, one, and two.
- the largest valid number in a radix system is one smaller than the radix (so eight is not a valid numerical in any radix system smaller than nine). Any decimal integer can be expressed exactly in any other integral base system, and vice-versa.
- the selected dimension indices 1 and 2 are converted to a single integer by position-wise multiplying them with respective powers of base three and summing the results of the position-wise multiplications.
- Base three is selected here because the 3D atomic coordinates have three dimensions (although Figure 33 shows only 2D atomic coordinates along two dimensions for simplicity’s sake).
- index 2 is positioned at the rightmost bit (i.e., the least significant bit), it is multiplied by three to the power of zero to yield two. Since index 1 is positioned at the second rightmost bit (i.e., the second least significant bit), it is multiplied by three to the power of one to yield three. This results in the accumulated sum being five.
- step 3202e (Step 5 in Figure 33), based on the accumulated sum, a voxel index of the voxel containing the particular atom is selected. That is, the accumulated sum is interpreted as the voxel index of the voxel containing the particular atom.
- each atom is further associated with one or more voxels that are in a neighborhood of the atom-containing voxel, also referred to herein as “neighborhood voxels.”
- the neighborhood voxels can be selected based on being within a predefined radius of the atom-containing voxel (e.g., 5 angstrom (A)). In other implementations, the neighborhood voxels can be selected based on being contiguously adjacent to the atom -containing voxel (e.g. , top, bottom, right, left adjacent voxels).
- a first alpha-carbon atom is associated with a first subset of voxels 3404 that includes an atom-containing voxel and neighborhood voxels for the first alpha-carbon atom.
- a second alpha-carbon atom is associated with a second subset of voxels 3406 that includes an atom-containing voxel and neighborhood voxels for the second alpha-carbon atom.
- the atom -containing voxel is selected by virtue of the spatial arrangement of the voxels that allows assignment of quantized 3D atomic coordinates to corresponding regularly spaced voxel centers in the voxel grid (without using any distance calculations).
- the neighborhood voxels are selected by virtue of being spatially contiguous to the atom -containing voxel in the voxel grid (again without using any distance calculations).
- each voxel is mapped to atoms to which it was associated at steps 3202 and 3212.
- this mapping is encoded in a voxel -to -atoms mapping 3412, which is generated based on the atom-to-voxels mapping 3402 (e.g. , by applying a voxel-based sorting key on the atom-to-voxels mapping 3402).
- the voxel-to-atoms mapping 3412 is also referred to herein as “cell-to- elements mapping .”
- a first voxel is mapped to a first subset of alpha-carbon atoms 3414 that includes alpha-carbon atoms associated with the first voxel at steps 3202 and 3212.
- a second voxel is mapped to a second subset of alpha-carbon atoms 3416 that includes alphacarbon atoms associated with the second voxel at steps 3202 and 3212.
- Step 3232 for each voxel, distances are calculated between the voxel and atoms mapped to the voxel at step 3222.
- Step 3232 has a runtime complexity of O(#atoms) because distance to a particular atom is measured only once from a respective voxel to which the particular atom is uniquely mapped in the voxel-to-atoms mapping 3412. This is true when no neighboring voxels are considered. Without neighbors, the constant factor that is implied in the big-0 notation is 1. With neighbors, the big-0 notation is equal to the number of neighbors + 1 since the number of neighbors is constant for each voxel, and therefore the runtime complexity of O(#atoms) remains true. In contrast, in Figure 35 A, distances to a particular atom are redundantly measured as many times as the number of voxels (e.g., 27 distances for a particular atom due to 27 voxels).
- each voxel is mapped to a respective subset of the 828 atoms (not including distance calculations to neighborhood voxels), as illustrated by respective ovals for respective voxels.
- the respective subsets are largely non-overlapping, with some exceptions. Insignificant overlap exists due to some instances when multiple atoms are mapped to a same voxel, as indicated in Figure 35B by the prime symbol and the yellow overlap between the ovals This minimal overlap has an additive effect on the runtime complexity of O(#atoms) and not a multiplicative effect.
- This overlap is a result of considering neighboring voxels, after determining the voxel that contains the atom. Without neighboring voxels, there can be no overlap, because an atom is only associated with one voxel. Considering neighbors, however, each neighbor could potentially be associated with the same atom (as long as there is no other atom of the same amino acid that is closer).
- a nearest atom to the voxel is identified.
- this identification is encoded in a voxel-to-nearest atom mapping 3422, also referred to herein as “cell-to-nearest element mapping.”
- the first voxel is mapped to a second alpha-carbon atom as its nearest alpha-carbon atom 3424.
- the second voxel is mapped to a thirty-first alpha-carbon atom as its nearest alpha-carbon atom 3426.
- the atom -type and amino acid-type categorization of the atoms and the corresponding distance values are stored to generate categorized distance channels.
- Figure 36 shows an example computer system 3600 that can be used to implement the technology disclosed.
- Computer system 3600 includes at least one central processing unit (CPU) 3672 that communicates with a number of peripheral devices via bus subsystem 3655.
- peripheral devices can include a storage subsystem 3610 including, for example, memory devices and a file storage subsystem 3636, user interface input devices 3638, user interface output devices 3676, and a network interface subsystem 3674.
- the input and output devices allow user interaction with computer system 3600.
- Network interface subsystem 3674 provides an interface to outside networks, including an interface to corresponding interface devices in other computer systems.
- the pathogenicity classifier 2108 is communicably linked to the storage subsystem 3610 and the user interface input devices 3638.
- User interface input devices 3638 can include a keyboard; pointing devices such as a mouse, trackball, touchpad, or graphics tablet; a scanner; a touch screen incorporated into the display; audio input devices such as voice recognition systems and microphones; and other types of input devices.
- pointing devices such as a mouse, trackball, touchpad, or graphics tablet
- audio input devices such as voice recognition systems and microphones
- use of the term “input device” is intended to include all possible types of devices and ways to input information into computer system 3600.
- User interface output devices 3676 can include a display subsystem, a printer, a fax machine, or non-visual displays such as audio output devices.
- the display subsystem can include an LED display, a cathode ray tube (CRT), a flat-panel device such as a liquid crystal display (LCD), a projection device, or some other mechanism for creating a visible image.
- the display subsystem can also provide a non-visual display such as audio output devices.
- output device is intended to include all possible types of devices and ways to output information from computer system 3600 to the user or to another machine or computer system.
- Storage subsystem 3610 stores programming and data constructs that provide the functionality of some or all of the modules and methods described herein. These software modules are generally executed by processors 3678.
- Processors 3678 can be graphics processing units (GPUs), field-programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), and/or coarse-grained reconfigurable architectures (CGRAs).
- GPUs graphics processing units
- FPGAs field-programmable gate arrays
- ASICs application-specific integrated circuits
- CGRAs coarse-grained reconfigurable architectures
- Processors 3678 can be hosted by a deep learning cloud platform such as Google Cloud PlatformTM, XilinxTM, and CirrascaleTM.
- processors 3678 include Google’s Tensor Processing Unit (TPU)TM, rackmount solutions like GX4 Rackmount SeriesTM, GX36 Rackmount SeriesTM, NVIDIA DGX-1TM, Microsoft' Stratix V FPGATM, Graphcore's Intelligent Processor Unit (IPU)TM, Qualcomm’s Zeroth PlatformTM with Snapdragon processorsTM, NVIDIA ’s VoltaTM, NVIDIA ’s DRIVE PXTM, NVIDIA’s JETSON TX1/TX2 MODULETM, Intel’s NirvanaTM, Movidius VPUTM, Fujitsu DPITM, ARM’s DynamicIQTM, IBM TrueNorthTM, Lambda GPU Server with Testa VI 00sTM, and others.
- TPU Tensor Processing Unit
- rackmount solutions like GX4 Rackmount SeriesTM, GX36 Rackmount SeriesTM, NVIDIA DGX-1TM, Microsoft' Stratix V FPGATM, Graphcore's Intelligent Processor Unit (IPU)TM
- Memory subsystem 3622 used in the storage subsystem 3610 can include a number of memories including a main random access memory (RAM) 3632 for storage of instructions and data during program execution and a read only memory (ROM) 3634 in which fixed instructions are stored.
- a file storage subsystem 3636 can provide persistent storage for program and data files, and can include a hard disk drive, a floppy disk drive along with associated removable media, a CD-ROM drive, an optical drive, or removable media cartridges.
- the modules implementing the functionality of certain implementations can be stored by file storage subsystem 3636 in the storage subsystem 3610, or in other machines accessible by the processor.
- Bus subsystem 3655 provides a mechanism for letting the various components and subsystems of computer system 3600 communicate with each other as intended. Although bus subsystem 3655 is shown schematically as a single bus, alternative implementations of the bus subsystem can use multiple busses.
- Computer system 3600 itself can be of varying types including a personal computer, a portable computer, a workstation, a computer terminal, a network computer, a television, a mainframe, a server farm, a widely-distributed set of loosely networked computers, or any other data processing system or user device. Due to the ever-changing nature of computers and networks, the description of computer system 3600 depicted in Figure 36 is intended only as a specific example for purposes of illustrating the preferred implementations of the present invention. Many other configurations of computer system 3600 are possible having more or less components than the computer system depicted in Figure 36.
- Protein language models trained with the masked language modeling objective are supervised to output the probability that an amino acid occurs at a position in a protein given the surrounding context.
- Proteins are linear polymers that fold into various specific conformations to function.
- 3D three-dimensional
- Protein sites are microenvironments within a protein structure, distinguished by their structural or functional role.
- a site can be defined by a three-dimensional location and a local neighborhood around this location in which the structure or function exists.
- Central to rational protein engineering is the understanding of how the structural arrangement of amino acids creates functional characteristics within protein sites. Determination of the structural and functional roles of individual amino acids within a protein provides information to help engineer and alter protein functions. Identifying functionally or structurally important amino acids allows focused engineering efforts such as site-directed mutagenesis for altering targeted protein functional properties.
- the technology disclosed relates to predicting spatial tolerability of amino acid substitutes.
- the technology disclosed includes a gapping logic and a substitution logic.
- the gapping logic is configured to remove, from a protein, a particular amino acid at a particular position, and create an amino acid vacancy at the particular position in the protein.
- the substitution logic is configured to process the protein with the amino acid vacancy, and score tolerability of substitute amino acids that are candidates for filling/fitting the amino acid vacancy.
- the substitution logic is further configured to score the tolerability of the substitute amino acids based at least in part on structural (or spatial) compatibility between the substitute amino acids and adjacent amino acids in a neighborhood of the amino acid vacancy (e.g. , the right and left flanking amino acids).
- the substitution logic evaluates the extent to which an amino acid “fits” its surrounding protein environment and shows that mutations that disrupt strong amino acid preferences are more likely to be deleterious.
- the substitution logic is a convolutional neural network
- the weights of the convolutional filters are optimized to detect local spatial patterns that best capture the local biochemical features to separate the 20 amino acid microenvironments.
- filters in convolution layers of the convolutional neural network are activated when the desired features are present at some spatial position in the input.
- the structural (or spatial) compatibility can be defined by changes to or impact on protein functionality.
- a substitute amino acid after substitution at a specific location within a protein structure, causes changes in the functionality of a protein, then the substitute amino acid is considered structurally (or spatially) incompatible.
- a substitute amino acid after substitution at the specific location within the protein structure, does not cause changes in the functionality of a protein, then the substitute amino acid is considered structurally (or spatially) compatible.
- the structural (or spatial) compatibility can be defined by a spatial deviation measured by a distance metric.
- a pre-insertion spatial measurement of a protein structure can be determined, for example, by measuring distances between amino acids in the protein structure prior to the amino acid substitution at a particular position. The distances can be atomic distances based on atomic coordinates of the atoms of the amino acids. The distances can be measured between pairs of amino acids.
- a post-insertion spatial measure of the protein structure be determined, for example, by remeasuring the distances between the amino acids in the protein structure after the amino acid substitution at the particular position.
- the substitute amino acid is considered structurally (or spatially) incompatible.
- the substitute amino acid is considered structurally (or spatially) compatible.
- the technology disclosed relates to predicting evolutionary conservation of amino acid substitutes.
- the technology disclosed includes a gapping logic and a substitution logic.
- the gapping logic is configured to remove, from a protein, a particular amino acid at a particular position, and create an amino acid vacancy at the particular position in the protein.
- the substitution logic is configured to process the protein with the amino acid vacancy, and score evolutionary conservation of substitute ammo acids that are candidates for filling the amino acid vacancy.
- the substitution logic is further configured to score the evolutionary conservation of the substitute amino acids based at least in part on structural (or spatial) compatibility between the substitute amino acids and adjacent amino acids in a neighborhood of the amino acid vacancy (e.g. , the right and left flanking amino acids).
- the evolutionary conservation is scored using evolutionary conservation frequencies.
- the evolutionary conservation frequencies are based on a position-specific frequency matrix (PSFM).
- the evolutionary conservation frequencies are based on a position-specific scoring matrix (PSSM).
- evolutionary conservation scores of the substitute amino acids are rank-ordered by magnitude.
- the technology disclosed relates to predicting evolutionary conservation of amino acid substitutes.
- the technology disclosed includes a gapping logic and an evolutionary conservation prediction logic.
- the gapping logic is configured to remove, from a protein, a particular amino acid at a particular position, and create an amino acid vacancy at the particular position in the protein.
- the evolutionary conservation prediction logic is configured to process the protein with the amino acid vacancy, and rank evolutionary conservation of substitute amino acids that are candidates for filling the amino acid vacancy. Gapped Protein Spatial Representation-Based Pathogenicity Determination for a Target Alternate
- Figure 37 illustrates one implementation of determining 3700 variant pathogenicity for a target alternate amino acid based on processing a gapped protein spatial representation.
- a protein is a sequence of amino acids. A particular amino acid in the protein that is removed or masked from the protein is called a “gap amino acid.” The resulting protein that lacks the gap amino acid is called a “gapped protein” or a “vacancy-containing protein.”
- a “spatial representation” of a protein characterizes structural information about amino acids in the protein.
- the spatial representation of the protein can be based on shape, location, position, patterns, and/or arrangement of the amino acids in the protein.
- the spatial representation of the protein can be onedimensional (ID), two-dimensional (2D), three-dimensional (3D), or //-dimensional ( «D) information.
- the spatial representation of the protein includes the amino acid-wise distance channels discussed above, for example, the amino acid-wise distance channels 600 described above with respect to Figure 6.
- the spatial representation of the protein includes the distance channel tensor discussed above, for example, the distance channel tensor 700 described above with respect to Figure 7.
- the spatial representation of the protein includes the evolutionary profiles tensor discussed above, for example, the evolutionary profiles tensor 1800 described above with respect to Figure 18.
- the spatial representation of the protein includes the voxelized annotation channels discussed above, for example, the voxelized annotation channels 2000 described above with respect to Figure 20.
- the spatial representation of the protein includes the structure confidence channels discussed above.
- the spatial representation can include other channels as well.
- a “gapped spatial representation” of a protein is such a spatial representation of the protein that excludes at least one gap amino acid in the protein.
- a gap amino acid is excluded by excluding (or not considering or ignoring) one or more atoms or atom-types of the gap amino acid when generating the gapped spatial representation.
- the atoms of the gap amino acid can be excluded from the calculations (or selections or computations) that produce the distance channels, the evolutionary profiles, the annotation channels, and/or the structure confidence channels.
- the gapped spatial representation can be generated by excluding the gap amino acid from other feature channels as well.
- a protein sequence accessor 3704 accesses a protein that has respective amino acids at respective positions.
- a gap amino acid specifier 3714 specifies a particular amino acid at a particular position in the protein as a gap amino acid, and specifies remaining amino acids at remaining positions in the protein as non-gap amino acids.
- the particular amino acid is a reference amino acid that is a major allele of the protein.
- a gapped spatial representation generator 3724 generates a gapped spatial representation of the protein that includes spatial configurations of the non-gap amino acids, and excludes a spatial configuration of the gap amino acid.
- the spatial configurations of the non-gap amino acids are encoded as amino acid class-wise distance channels.
- Each of the amino acid class-wise distance channels has voxel-wise distance values for voxels in a plurality of voxels.
- the voxel-wise distance values specify distances from corresponding voxels in the plurality of voxels to atoms of the non-gap amino acids.
- the spatial configurations of the non-gap amino acids are determined based on spatial proximity between the corresponding voxels and the atoms of the non-gap amino acids.
- the spatial configuration of the gap ammo acid is excluded from the gapped spatial representation by disregarding distances from the corresponding voxels to atoms of the gap amino acid when determining the voxel-wise distance values.
- the spatial configuration of the gap amino acid is excluded from the gapped spatial representation by disregarding spatial proximity between the corresponding voxels and the atoms of the gap amino acid.
- the spatial configurations of the non-gap amino acids are encoded as evolutionary profile channels based on pan-amino acid conservation frequencies of amino acids with nearest atoms to the voxels.
- the spatial configuration of the gap amino acid is excluded from the gapped spatial representation by disregarding nearest atoms of the gap amino acid when determining the pan-amino acid conservation frequencies.
- the spatial configurations of the non-gap amino acids are encoded as evolutionary profile channels based on per-amino acid conservation frequencies of respective amino acids with respective nearest atoms to the voxels.
- the spatial configuration of the gap amino acid is excluded from the gapped spatial representation by disregarding respective nearest atoms of the gap amino acid when determining the per-amino acid conservation frequencies.
- the spatial configurations of the non-gap amino acids are encoded as annotation channels In one implementation, the spatial configuration of the gap amino acid is excluded from the gapped spatial representation by disregarding atoms of the gap amino acid when determining the annotation channels.
- the spatial configurations of the non-gap amino acids are encoded as structural confidence channels. In one implementation, the spatial configuration of the gap amino acid is excluded from the gapped spatial representation by disregarding atoms of the gap amino acid when determining the structural confidence channels.
- the spatial configurations of the non-gap amino acids are encoded as additional input channels. In one implementation, the spatial configuration of the gap amino acid is excluded from the gapped spatial representation by disregarding atoms of the gap amino acid when determining the additional input channels.
- a pathogenicity determiner 3734 determines a pathogenicity of a nucleotide variant based at least in part on the gapped spatial representation, and a representation of an alternate amino acid created by the nucleotide variant at the particular position.
- the representation of the alternate amino acid can be a one-hot encoding of the alternate amino acid (e.g., see Figure 8).
- the alternate amino acid is an amino acid that is same as the reference amino acid. In other implementations, the alternate amino acid is an amino acid that is different from the reference amino acid.
- Figure 38 shows an example of a spatial representation 3800 of a protein.
- the protein contains an amino acid sequence 3804.
- An Aspartic acid (D) amino acid at a 22 nd position in the amino acid sequence 3804 is selected as a gap amino acid 3802.
- Figure 39 shows an example of a gapped spatial representation 3900 of the protein illustrated in Figure 38. In Figure 39, the gap amino acid 3802 is removed from the gapped spatial representation 3900. Also in Figure 39, the absence of the gap amino acid 3802 is illustrated as a missing gap amino acid 3902.
- Figure 40 shows an example of an atomic spatial representation 4000 of the protein illustrated in Figure 38.
- Figure 40 also depicts atoms 4002 of the gap amino acid 3802.
- Figure 41 shows an example of a gapped atomic spatial representation 4100 of the protein illustrated in Figure 38.
- the atoms 4002 of the gap ammo acid 3802 are removed from the gapped atomic spatial representation 4100.
- the absence of the atoms 4002 of the gap amino acid 3802 is illustrated as missing atoms 4102 of the gap amino acid 3802.
- Figure 42 illustrates one implementation of a pathogenicity classifier 2108/2600/2700 determining 4200 variant pathogenicity for a target alternate amino acid based on processing a gapped protein spatial representation 4202 and an alternate amino acid representation 4212 of the target alternate amino acid.
- the pathogenicity classifier 2108/2600/2700 determines the pathogenicity of the nucleotide variant by processing, as input, the gapped spatial representation 4202, and the representation of the alternate amino acid 3212, and generating, as output, a pathogenicity score 4208 for the alternate amino acid
- Figure 43 depicts one implementation of training data 4300 used to train the pathogenicity classifier 2108/2600/2700.
- the pathogenicity classifier 2108/2600/2700 is trained on a benign training set 4302.
- the benign training set 4302 has respective benign protein samples 4322, 4342, and 4362 for respective reference amino acids at respective positions 4312, 4332, and 4352 in a proteome.
- the reference amino acids are major allele amino acids of the proteome.
- the proteome has ten million positions, and therefore the benign training set 4302 has ten million benign protein samples.
- the respective benign protein samples have respective gapped spatial representations generated by using the respective reference amino acids as respective gap amino acids.
- the respective benign protein samples have respective representations of the respective reference amino acids as respective alternate amino acids.
- the proteome includes human proteome and nonhuman proteome, including non-human primate proteome.
- Figure 44 illustrates one implementation of generating 4400 gapped spatial representations 4322G, 4342G, and 4362G for reference proteins samples 4322, 4342, and 4362 by using reference amino acids 4402, 4412, and 4422 as gap amino acids, respectively.
- Figure 45 shows one implementation of training the pathogenicity classifier 2108/2600/2700 on benign protein samples 4500.
- the pathogenicity classifier 2108/2600/2700 trains on a particular benign protein sample and estimates a pathogenicity of a particular reference amino acid at a particular position in the particular benign protein sample by processing, as input, (i) a particular gapped spatial representation 4322G of the particular benign protein sample, and (ii) a representation 4402 (e.g. , a one-hot encoding) of the particular reference amino acid as a particular alternate amino acid, and generating, as output, a pathogenicity score for the particular reference amino acid.
- the particular gapped spatial representation is generated by using the particular reference amino acid as a gap amino acid, and by using remaining amino acids at remaining positions in the particular benign protein sample as non-gap amino acids.
- Each of the benign protein samples has a ground truth benignness label 4506 that indicates absolute benignness of the benign protein samples.
- the ground truth benignness label is zero, one, or minus one.
- the pathogenicity score 4502 for the particular reference amino acid is compared against the ground truth benignness label to determine an error 4504, and to improve coefficients of the pathogenicity classifier 2108/2600/2700 based on the error using a training technique (e.g., backpropagation 4512).
- the pathogenicity classifier 2108/2600/2700 is trained on a pathogenic training set 4308.
- the pathogenic training set 4308 has respective pathogenic protein samples 4322A-N, 4342A-N, and 4362A- N for respective combinatorically generated amino acid substitutions for each of the reference amino acids 4312, 4332, and 4352 at each of the respective positions 4318, 4338, and 4358 in the proteome.
- the respective combinatorically generated amino acid substitutions are confined by reachability of single nucleotide polymorphisms (SNPs) to transform a reference codon of a reference amino acid into alternate amino acids of unreachable alternate amino acid classes.
- the combinatorically generated amino acid substitutions for a particular reference amino acid of a particular amino acid class at a particular position in the proteome include respective alternate amino acids of respective amino acid classes that are different from the particular amino acid class.
- the proteome has the ten million positions, wherein there are nineteen combinatorically generated amino acid substitutions for each of the ten million positions, and therefore the pathogenic training set 4308 has one hundred and ninety million pathogenic protein samples.
- the respective pathogenic protein samples have respective gapped spatial representations generated by using the respective reference amino acids as respective gap amino acids.
- the respective pathogenic protein samples have respective representations of the respective combinatorically generated amino acid substitutions as respective alternate amino acids created by respective combinatorically generated nucleotide variants at the respective positions in the proteome.
- Figure 46 shows one implementation of training the pathogenicity classifier 2108/2600/2700 on pathogenic protein samples 4600.
- the pathogenicity classifier 2108/2600/2700 trains on a particular pathogenic protein sample and estimates a pathogenicity of a particular combinatorically generated amino acid substitution for a particular reference amino acid at a particular position in the particular pathogenic protein sample by processing, as input, (i) a particular gapped spatial representation 4322G of the particular pathogenic protein sample, and (ii) a representation 4622 (e.g. , a one-hot encoding) of the particular combinatorically generated amino acid substitution as a particular alternate amino acid, and generating, as output, a pathogenicity score for the particular combinatorically generated amino acid substitution.
- the particular gapped spatial representation is generated by using the particular reference amino acid as a gap amino acid, and by using remaining amino acids at remaining positions in the particular pathogenic protein sample as non-gap amino acids.
- Each of the pathogenic protein samples has a ground truth pathogenicity label that indicates absolute pathogenicity of the pathogenic protein samples.
- the ground truth pathogenicity label is one, zero, or minus one, as long as it is different (e.g., opposite) than the ground truth benignness label.
- the pathogenicity score 4602 for the particular combinatorically generated amino acid substitution is compared against the ground truth pathogenicity label 4606 to determine an error 4604, and to improve the coefficients of the pathogenicity classifier 2108/2600/2700 based on the error using the training technique (e.g., backpropagation 4612).
- the pathogenicity classifier 2108/2600/2700 is trained on two hundred million training iterations.
- the two hundred million training iterations include ten million training iterations with the ten million benign protein samples, and one hundred and ninety million iterations with the one hundred and ninety million pathogenic protein samples.
- the proteome has one million to ten million positions, and therefore the benign training set has one million to ten million benign protein samples.
- the pathogenicity classifier 2108/2600/2700 is trained on twenty million to two hundred million training iterations.
- the twenty million to two hundred million training iterations include one million to ten million training iterations with the one million to ten million benign protein samples, and nineteen million to one hundred and ninety million iterations with the nineteen million to one hundred and ninety million pathogenic protein samples.
- Figure 47 shows how certain unreachable amino acid classes are masked 4700 during training.
- those unreachable alternate amino acid classes that are confined by reachability of single nucleotide polymorphisms (SNPs) to transform a reference codon of a reference amino acid into alternate amino acids of the unreachable alternate amino acid classes are masked in ground truth labels.
- SNPs single nucleotide polymorphisms
- the masked amino acid classes result in zero loss and do not contribute to gradient updates.
- the masked amino acid classes are identified in a lookup table.
- the lookup table identifies a set of masked amino acids classes for each reference amino acid position.
- Figure 48 illustrates one implementation of determining a final pathogenicity score.
- the pathogenicity classifier 2108/2600/2700 generates a first pathogenicity score for a first alternate amino acid that is same as a first reference amino acid.
- the pathogenicity classifier 2108/2600/2700 generates a second pathogenicity score for a second alternate amino acid that is different from the first reference amino acid.
- a final pathogenicity score for the second alternate amino acid is the second pathogenicity score for the second alternate amino acid.
- the final pathogenicity score for the second alternate amino acid is based on a combination of the first pathogenicity score and the second pathogenicity score.
- the final pathogenicity score for the second alternate amino acid is a ratio of the second pathogenicity score over a sum of the first pathogenicity score and the second pathogenicity score.
- the final pathogenicity score for the second alternate amino acid is determined by subtracting the first pathogenicity score from the second pathogenicity score.
- Figure 49A shows that a variant pathogenicity determination is made for a target alternate amino acid 4922 filling a vacancy created by a reference gap amino acid 4902 at a given position in a protein 4912.
- this analysis is done by spatially representing the protein 4912 and the vacancy in a 3D format, for example, by using voxelized amino acid category-wise distance calculations that exclude the reference gap amino acid 4902 (or atoms thereof).
- Figure 49B shows that respective variant pathogenicity determinations are made for amino acids of respective amino acid classes 4916 filing the vacancy created by the reference gap amino acid 4902 at the given position in the protein 4912.
- the inputs in Figures 49A and 49B are the same; only the output is different, and so are the spatial representations of the protein 4912 and the vacancy in the 3D format.
- Figure 49A only one pathogenicity score is generated; whereas in Figure 49B a pathogenicity score is generated for each of the twenty amino acid classes/categories (e.g., by using a 20-way softmax classification).
- Figure 50 illustrates one implementation of determining 5000 variant pathogenicity for multiple alternate amino acids based on processing a gapped protein spatial representation.
- the protein sequence accessor 3704 accesses a protein that has respective amino acids at respective positions.
- the gap amino acid specifier 3714 specifies a particular amino acid at a particular position in the protein as a gap amino acid, and specifies remaining amino acids at remaining positions in the protein as non-gap amino acids.
- the particular amino acid is a reference amino acid that is a major allele of the protein.
- the gapped spatial representation generator 3724 generates a gapped spatial representation of the protein that includes spatial configurations of the non-gap amino acids, and excludes a spatial configuration of the gap amino acid.
- the spatial configurations of the non-gap amino acids are encoded as amino acid class-wise distance channels.
- Each of the amino acid class-wise distance channels has voxel-wise distance values for voxels in a plurality of voxels.
- the voxel-wise distance values specify distances from corresponding voxels in the plurality of voxels to atoms of the non-gap amino acids.
- the spatial configurations of the non-gap amino acids are determined based on spatial proximity between the corresponding voxels and the atoms of the non-gap amino acids.
- the spatial configuration of the gap amino acid is excluded from the gapped spatial representation by disregarding distances from the corresponding voxels to atoms of the gap amino acid when determining the voxel-wise distance values.
- the spatial configuration of the gap amino acid is excluded from the gapped spatial representation by disregarding spatial proximity between the corresponding voxels and the atoms of the gap amino acid.
- the spatial configurations of the non-gap ammo acids are encoded as evolutionary profile channels based on pan-amino acid conservation frequencies of amino acids with nearest atoms to the voxels.
- the spatial configuration of the gap ammo acid is excluded from the gapped spatial representation by disregarding nearest atoms of the gap amino acid when determining the pan-amino acid conservation frequencies.
- the spatial configurations of the non-gap amino acids are encoded as evolutionary profile channels based on per-amino acid conservation frequencies of respective amino acids with respective nearest atoms to the voxels.
- the spatial configuration of the gap amino acid is excluded from the gapped spatial representation by disregarding respective nearest atoms of the gap amino acid when determining the per-amino acid conservation frequencies.
- the spatial configurations of the non-gap amino acids are encoded as annotation channels.
- the spatial configuration of the gap amino acid is excluded from the gapped spatial representation by disregarding atoms of the gap amino acid when determining the annotation channels.
- the spatial configurations of the non-gap amino acids are encoded as structural confidence channels.
- the spatial configuration of the gap amino acid is excluded from the gapped spatial representation by disregarding atoms of the gap amino acid when determining the structural confidence channels.
- the spatial configurations of the non-gap amino acids are encoded as additional input channels.
- the spatial configuration of the gap amino acid is excluded from the gapped spatial representation by disregarding atoms of the gap amino acid when determining the additional input channels.
- the pathogenicity determiner 3734 determines, based at least in part on the gapped spatial representation, a pathogenicity of respective alternate amino acids at the particular position.
- the respective alternate amino acids are respective combinatorically generated alternate amino acids created by respective combinatorically generated nucleotide variants at the particular position.
- Figure 51 illustrates one implementation of the pathogenicity classifier 2108/2600/2700 determining 5100 variant pathogenicity for multiple alternate amino acids based on processing a gapped protein spatial representation 5102.
- the pathogenicity classifier 2108/2600/2700 determines the pathogenicity of the respective alternate amino acids by processing, as input, the gapped spatial representation 5102, and generating, as output, respective pathogenicity scores 1-20 for respective amino acid classes.
- the respective amino acid classes correspond to respective twenty naturally-occurring amino acids. In other implementations, the respective amino acid classes correspond to respective naturally-occurring amino acids from a subset of the twenty naturally-occurring amino acids. In one implementation, the output is displayed with the respective rankings of the respective pathogenicity scores 1-20 for respective amino acid classes.
- Figure 52 illustrates one implementation of concurrently training 5200 the pathogenicity classifier 2108/2600/2700 on benign and pathogenic protein samples.
- the pathogenicity classifier 2108/2600/2700 is trained on a training set.
- the training set has respective protein samples for respective positions in the proteome.
- the proteome has ten million positions, and therefore the training set has ten million protein samples.
- the respective protein samples have respective gapped spatial representations generated by using respective reference ammo acids at the respective positions in proteome as respective gap amino acids.
- the reference amino acids are major allele amino acids of the proteome.
- the pathogenicity classifier 2108/2600/2700 trains on a particular protein sample and estimates a pathogenicity of respective alternate amino acids for a particular reference amino acid at a particular position in the particular protein sample by processing, as input, a particular gapped spatial representation 5202 of the particular protein sample, and generating, as output, respective pathogenicity scores 1-20 for the respective amino acid classes.
- the particular gapped spatial representation is generated by using the particular reference amino acid as a gap amino acid, and by using remaining amino acids at remaining positions in the particular protein sample as non -gap amino acids.
- Each of the protein samples has respective ground truth labels for the respective amino acid classes.
- the respective ground truth labels include an absolute benignness label for a reference amino acid class in the respective amino acid classes, and include respective absolute pathogenicity labels for respective alternate amino acid classes in the respective amino acid classes.
- the absolute benignness label is zero.
- the absolute pathogenicity labels are same across the respective alternate amino acid classes. In one implementation, the absolute pathogenicity labels are one.
- an error 5204 is determined based on a comparison of a pathogenicity score for the reference amino acid class against the absolute benignness label (e.g., pathogenicity score 8 for reference gap amino acid 5212 in Figure 52), and respective comparisons of respective pathogenicity scores for the respective alternate amino acid classes against the respective absolute pathogenicity labels (e.g., pathogenicity scores 1-7 and 9-20 in Figure 52).
- coefficients of the pathogenicity classifier 2108/2600/2700 are improved based on the error using a training technique (e g., backpropagation 5224).
- the pathogenicity classifier 2108/2600/2700 is trained on ten million training iterations with the ten million protein samples.
- the proteome has one million to ten million positions, and therefore the training set has one million to ten million protein samples.
- the pathogenicity classifier 2108/2600/2700 is trained on one million to ten million training iterations with the one million to ten million protein samples.
- the pathogenicity classifier 2108/2600/2700 generates a reference pathogenicity score for a first alternate amino acid of the reference amino acid class. In one implementation, the pathogenicity classifier 2108/2600/2700 generates respective alternate pathogenicity scores for respective alternate amino acids of the respective alternate amino acid classes.
- respective final alternate pathogenicity scores for the respective alternate amino acids are the respective alternate pathogenicity scores. In one implementation, respective final alternate pathogenicity scores for the respective alternate amino acids are based on respective combinations of the reference pathogenicity score and the respective alternate pathogenicity scores. In one implementation, respective final alternate pathogenicity scores for the respective alternate amino acids are respective ratios of the respective alternate pathogenicity scores over a sum of the reference pathogenicity score and the respective alternate pathogenicity scores. In one implementation, respective final alternate pathogenicity scores for the respective alternate amino acids are determined by respectively subtracting the reference pathogenicity score from the respective alternate pathogenicity scores.
- the pathogenicity classifier 2108/2600/2700 has an output layer that generates the respective pathogenicity scores.
- the output layer is a normalization layer.
- the respective pathogenicity scores are normalized.
- the output layer is a softmax layer.
- the respective pathogenicity scores are exponentially normalized.
- the output layer has respective sigmoid units that respectively generate the respective pathogenicity scores.
- the respective pathogenicity scores are unnormalized.
- Evolutionary conservation refers to the presence of similar genes, portions of genes, or chromosome segments in different species, reflecting both the common origin of species and an important functional property of the conserved element. Mutations occur spontaneously in each generation, randomly changing an amino acid here and there in a protein. Individuals with mutations that impair critical functions of proteins may have resulting problems that make them less able to reproduce. Harmful mutations are lost from the gene pool because the individuals carrying them reproduce less effectively. Since the harmful mutations are lost, the amino acids critical for the function of a protein are conserved in the gene pool. In contrast, harmless (or very rare beneficial) mutations are kept in the gene pool, producing variability in non-critical amino acids.
- Evolutionary conservation in proteins is identified by aligning the amino acid sequences of proteins with the same function from different taxa (orthologs). Predicting the functional consequences of variants relies at least in part on the assumption that crucial amino acids for protein families are conserved through evolution due to negative selection (z.e., amino acid changes at these sites were deleterious in the past), and that mutations at these sites have an increased likelihood of being pathogenic (causing disease) in humans.
- homologous sequences of a target protein are collected and aligned, and a metric of conservation is computed based on the weighted frequencies of different amino acids observed in the target position in the alignment.
- Figure 53 illustrates one implementation of determining 5300 variant pathogenicity for multiple alternate amino acids based on processing a gapped protein spatial representation and, in response, generating evolutionary conservation scores for the multiple alternate amino acids.
- the gap amino acid specifier 3714 specifies a particular amino acid at a particular position in a protein as a gap amino acid, and specifies remaining amino acids at remaining positions in the protein as non-gap amino acids.
- the particular amino acid is a reference amino acid that is a major allele of the protein.
- the gapped spatial representation generator 3724 generates a gapped spatial representation of the protein that includes spatial configurations of the non-gap amino acids, and excludes a spatial configuration of the gap ammo acid.
- the spatial configurations of the non-gap amino acids are encoded as amino acid class-wise distance channels.
- Each of the ammo acid class-wise distance channels has voxel-wise distance values for voxels in a plurality of voxels.
- the voxel-wise distance values specify distances from corresponding voxels in the plurality of voxels to atoms of the non-gap amino acids.
- the spatial configurations of the non-gap amino acids are determined based on spatial proximity between the corresponding voxels and the atoms of the non-gap amino acids.
- the spatial configuration of the gap amino acid is excluded from the gapped spatial representation by disregarding distances from the corresponding voxels to atoms of the gap amino acid when determining the voxel-wise distance values.
- the spatial configuration of the gap amino acid is excluded from the gapped spatial representation by disregarding spatial proximity between the corresponding voxels and the atoms of the gap amino acid.
- the spatial configurations of the non-gap amino acids are encoded as evolutionary profile channels based on pan-amino acid conservation frequencies of amino acids with nearest atoms to the voxels.
- the spatial configuration of the gap amino acid is excluded from the gapped spatial representation by disregarding nearest atoms of the gap amino acid when determining the pan-amino acid conservation frequencies.
- the spatial configurations of the non-gap amino acids are encoded as evolutionary profile channels based on per-amino acid conservation frequencies of respective amino acids with respective nearest atoms to the voxels.
- the spatial configuration of the gap amino acid is excluded from the gapped spatial representation by disregarding respective nearest atoms of the gap amino acid when determining the per-amino acid conservation frequencies.
- the spatial configurations of the non-gap amino acids are encoded as annotation channels.
- the spatial configuration of the gap amino acid is excluded from the gapped spatial representation by disregarding atoms of the gap amino acid when determining the annotation channels.
- the spatial configurations of the non-gap amino acids are encoded as structural confidence channels.
- the spatial configuration of the gap amino acid is excluded from the gapped spatial representation by disregarding atoms of the gap amino acid when determining the structural confidence channels.
- the spatial configurations of the non-gap amino acids are encoded as additional input channels.
- the spatial configuration of the gap amino acid is excluded from the gapped spatial representation by disregarding atoms of the gap amino acid when determining the additional input channels.
- an evolutionary conservation determiner 5324 determines an evolutionary conservation at the particular position of respective amino acids of respective amino acid classes based at least in part on the gapped spatial representation.
- Figure 54 shows the evolutionary conservation determiner 5324 in operation 5400, in accordance with one implementation.
- the evolutionary conservation determiner 5324 in some implementations, has the same architecture as the pathogenicity classifier 2108/2600/2700.
- the evolutionary conservation determiner 5324 determines the evolutionary conservation by processing, as input, the gapped spatial representation 5402, and generating, as output, respective evolutionary conservation scores 5406 for the respective amino acids 5408.
- the respective evolutionary conservation scores are rankable by magnitude.
- a “classifier”, “determiner”, “insert term here” can include one or more software modules, one or more hardware modules, or any combination thereof.
- the pathogenicity determiner 3734 determines a pathogenicity of respective nucleotide variants that respectively substitute the particular amino acid with the respective amino acids 5408 in alternate representations of the protein.
- Figure 55 illustrates one implementation of determining pathogenicity based on predicted evolutionary scores.
- a classifier 5516 classifies a nucleotide variant as pathogenic 5508 when an evolutionary conservation score generated by the evolutionary conservation determiner 5324 for a corresponding amino acid substitution is below a threshold.
- the classifier 5516 classifies a nucleotide variant as pathogenic 5508 when an evolutionary conservation score generated by the evolutionary conservation determiner 5324 for a corresponding amino acid substitution is zero (z.e., indication of non-conservation).
- the classifier 5516 classifies a nucleotide variant as benign 5528 when an evolutionary conservation score generated by the evolutionary conservation determiner 5324 for a corresponding amino acid substitution is above a threshold. In one implementation, the classifier 5516 classifies a nucleotide variant as benign 5528 when an evolutionary conservation score generated by the evolutionary conservation determiner 5324 for a corresponding amino acid substitution is non-zero (z.e., indication of conservation).
- Figure 56 illustrates one implementation of training data 5600 used to train the evolutionary conservation determiner 5324.
- the evolutionary conservation determiner 5324 is trained on a conserved training set and a non-conserved training set
- the conserved training set has respective conserved protein samples 5602 for respective conserved amino acids at respective positions in a proteome
- the nonconserved training set has respective non-conserved (or unconserved) protein samples 5608 for respective non-conserved amino acids at the respective positions.
- the proteome includes human proteome and non-human proteome, including non-human primate proteome.
- Each of the respective positions has a set of conserved amino acids and a set of nonconserved amino acids.
- a particular set of conserved amino acids for a particular position in a particular protein in the proteome includes at least one major allele amino acid observed at the particular position across a plurality of species.
- the major allele amino acid is a reference amino acid (e.g., REF allele 5612 spanning benign protein sample 5622, and REF allele 5662 spanning benign protein sample 5682).
- the particular set of conserved amino acids includes one or more minor allele amino acids observed at the particular position across the plurality of species (e.g. , observed ALT alleles 5632 spanning benign protein samples 5642, 5652, 5662, and observed ALT alleles 5692 spanning benign protein samples 5695, 5696).
- a particular set of non-conserved amino acids for the particular position includes amino acids not in the particular set of conserved amino acids e.g., unobserved ALT alleles 5618 spanning pathogenic protein samples 5622A-N, and unobserved ALT alleles 5668 spanning pathogenic protein samples 5682A-N).
- each of the respective positions has C conserved amino acids in the set of conserved amino acids.
- the C ranges from one to ten.
- the C varies across the respective positions.
- the C is same for some of the respective positions.
- the proteome has one to ten million positions.
- each of the one to ten million positions has the C conserved amino acids in the set of conserved amino acids.
- the evolutionary conservation determiner 5324 is trained on twenty million to two hundred million training iterations.
- the twenty million to two hundred million training iterations include one million to ten million training iterations with the one million to ten million conserved protein samples, and nineteen million to one hundred and ninety million iterations with the nineteen million to one hundred and ninety million non-conserved protein samples.
- the proteome has one million to ten million positions, and therefore the training set has one million to ten million protein samples.
- the evolutionary conservation determiner 5324 is trained on one million to ten million training iterations with the one million to ten million protein samples.
- the respective conserved and non-conserved protein samples have respective gapped spatial representations generated by using respective reference amino acids at the respective positions as respective gap amino acids.
- the evolutionary conservation determiner 5324 trains on a particular conserved protein sample and estimates an evolutionary conservation of a particular conserved amino acid at a particular position in the particular conserved protein sample by processing, as input, a particular gapped spatial representation of the particular conserved protein sample, and generating, as output, an evolutionary conservation score for the particular conserved amino acid.
- the particular gapped spatial representation is generated by using a particular reference amino acid at the particular position as a gap amino acid, and by using remaining amino acids at remaining positions in the particular conserved protein sample as non-gap amino acids.
- Each of the conserved protein samples has a ground truth conserved label.
- the ground truth conserved label is an evolutionary conservation frequency.
- the ground truth conserved label is one.
- the evolutionary conservation for the particular conserved amino acid is compared against the ground truth conserved label to determine an error, and to improve coefficients of the evolutionary conservation determiner 5324 based on the error using a training technique.
- the training technique is a loss function-based gradient update technique (e.g. , backpropagation) .
- the ground truth conserved label is masked and not used to determine the error when the particular conserved amino acid is the particular reference amino acid.
- the masking causes the evolutionary conservation determiner 5324 to not overfit on the particular reference amino acid.
- the evolutionary conservation determiner 5324 trains on a particular non-conserved protein sample and estimates an evolutionary conservation of a particular non-conserved amino acid at a particular position in the particular non-conserved protein sample by processing, as input, a particular gapped spatial representation of the particular non-conserved protein sample, and generating, as output, an evolutionary conservation score for the particular non-conserved amino acid.
- the particular gapped spatial representation is generated by using a particular reference amino acid at the particular position as a gap amino acid, and by using remaining amino acids at remaining positions in the particular nonconserved protein sample as non-gap amino acids.
- Each of the non-conserved protein samples has a ground truth non-conserved label.
- the ground truth non-conserved label is an evolutionary conservation frequency. In one implementation, the ground truth non-conserved label is zero.
- the evolutionary conservation score for the particular nonconserved amino acid is compared against the ground truth non-conserved label to determine an error, and to improve the coefficients of the evolutionary conservation determiner 5324 based on the error using the training technique (e g., backpropagation).
- the evolutionary conservation determiner 5324 is trained on a training set.
- the training set has respective protein samples for the respective positions in the proteome.
- the respective protein samples have respective gapped spatial representations generated by using the respective reference amino acids at the respective positions as the respective gap amino acids.
- Figure 57 illustrates one implementation of concurrently training 5700 the evolutionary conservation determiner on benign and pathogenic protein samples.
- the evolutionary conservation determiner 5324 trains on a particular protein sample and estimates an evolutionary conservation of respective amino acids of respective amino acid classes at a particular position in the particular protein sample by processing, as input, a particular gapped spatial representation 5722 of the particular protein sample, and generating, as output, respective evolutionary conservation scores 1-20 for the respective amino acids.
- the particular gapped spatial representation 5722 is generated by using a particular reference amino acid at the particular position as a gap amino acid, and by using remaining amino acids at remaining positions in the particular protein sample as non-gap amino acids.
- Each of the protein samples has respective ground truth labels for the respective amino acids.
- the respective ground truth labels include one or more conserved (benign) labels for one or more conserved amino acids 5732, 5702, 5712, in the respective amino acids, and include one or more nonconserved (pathogenic) labels for one or more non-conserved amino acids in the respective amino acids.
- the conserved labels and the non-conserved labels have respective evolutionary conservation frequencies. The respective evolutionary conservation frequencies are rankable according to magnitude. In one implementation, the conserved labels are ones, and the non-conserved labels are zeros.
- an error 5704 is determined based on respective comparisons of respective evolutionary conservation scores for the respective conserved amino acids against the respective conserved amino acids, and respective comparisons of respective evolutionary conservation scores for the respective non-conserved amino acids against the respective non-conserved amino acids.
- the coefficients of the evolutionary conservation determiner 5324 are improved based on the error using the training technique (e.g., backpropagation 5744).
- the conserved amino acids include the particular reference amino acid, and a conserved label for the particular reference amino acid is masked and not used to determine the error
- the masking causes the evolutionary conservation determiner 5324 to not overfit on the particular reference amino acid.
- Synonymous mutations are point mutations, meaning they are just a miscopied DNA nucleotide that only changes one base pair in the RNA copy of the DNA.
- a codon in RNA is a set of three nucleotides that encode a specific amino acid. Most amino acids have several RNA codons that translate into that particular amino acid. Most of the time, if the third nucleotide is the one with the mutation, it will result in coding for the same amino acid. This is called a synonymous mutation because, like a synonym in grammar, the mutated codon has the same meaning as the original codon and therefore does not change the amino acid. If the amino acid does not change, then the protein is also unaffected. Synonymous mutations do not change anything, and no changes are made. That means they have no real role in the evolution of species since the gene or protein is not changed in any way. Synonymous mutations are actually fairly common, but since they have no effect, then they are not noticed.
- Nonsynonymous mutations have a much greater effect on an individual than a synonymous mutation.
- a nonsynonymous mutation there is usually an insertion or deletion of a single nucleotide in the sequence during transcription when the messenger RNA is copying the DNA.
- This single missing or added nucleotide causes a frameshift mutation which throws off the entire reading frame of the amino acid sequence and mixes up the codons.
- the severity of this kind of mutation depends on how early in the amino acid sequence it happens. If it happens near the beginning and the entire protein is changed, this could become a lethal mutation.
- Nonsynonymous mutations Another way a nonsynonymous mutation can occur is if the point mutation changes the single nucleotide into a codon that does not translate into the same amino acid. A lot of times, the single amino acid change does not affect the protein very much and is still viable. If it happens early in the sequence and the codon is changed to translate into a stop signal, then the protein will not be made, and it could cause serious consequences. Sometimes nonsynonymous mutations are actually positive changes. Natural selection may favor this new expression of the gene and the individual may have developed a favorable adaptation from the mutation. If that mutation occurs in the gametes, this adaptation will be passed down to the next generation of offspring. Nonsynonymous mutations increase the diversity in the gene pool for natural selection to work on and drive evolution on a microevolutionary level.
- the nucleotide triplet that encodes an amino acid is called a codon.
- Each group of three nucleotides encodes one amino acid. Since there are 64 combinations of 4 nucleotides taken three at a time and only 20 amino acids, the code is degenerate (more than one codon per amino acid, in most cases).
- One example of the unreachable alternate amino acid classes are those alternate amino acid classes that are not coded by synonymous SNPs.
- Another example of the unreachable alternate amino acid classes are those alternate amino acid classes that are restricted by the number of triplet nucleotide mutant combinations deviated away by single nucleotide polymorphisms (SNPs) at the triplet nucleotide positions from an initial codon.
- those unreachable alternate amino acid classes that are confined by reachability of SNPs to transform a reference codon of a reference amino acid into alternate amino acids of the unreachable alternate amino acid classes are masked in ground truth labels.
- masked amino acid classes result in zero loss and do not contribute to gradient updates.
- the masked amino acid classes are identified in a lookup table.
- the lookup table identifies a set of masked amino acids classes for each reference amino acid position.
- the particular set of conserved amino acids and the particular set of non-conserved amino acids are identified based on evolutionary conservation profiles of homologous proteins of the plurality of species.
- the evolutionary conservation profiles of the homologous proteins are determined using a position-specific frequency matrix (PSFM).
- the evolutionary conservation profiles of the homologous proteins are determined using a position-specific scoring matrix (PSSM).
- Figure 58 depicts different implementations of ground truth label encodings used to train the evolutionary conservation determiner 5324.
- Ground truth label encoding 5802 uses evolutionary conservation frequencies (e.g., PSFM or PSSM) to label the conserved amino acid classes A, C, F, and uses a “zero value” to label the remaining non-conserved amino acid classes.
- PSFM evolutionary conservation frequencies
- Ground truth label encoding 5812 is the same as the ground truth label encoding 5802 except that the ground truth label encoding 5812 “masks out” the REF major allele/most-conserved amino acid class F such that the REF major allele/most-conserved amino acid class F does not contribute to the training of the evolutionary conservation determiner 5324 (e.g., by zeroing-out the loss calculated by the loss function for the REF major allele/most-conserved amino acid class F).
- Ground truth label encoding 5822 uses a “one value” to label the conserved amino acid classes A, C, F, and uses a “zero value” to label the remaining non-conserved amino acid classes.
- Ground truth label encoding 5832 is the same as the ground truth label encoding 5822 except that the ground truth label encoding 5832 “masks out” the REF major allele/most-conserved amino acid class F such that the REF major allele/most-conserved amino acid class F does not contribute to the training of the evolutionary conservation determiner 5324 (e.g., by zeroing -out the loss calculated by the loss function for the REF major allele/most-conserved amino acid class F).
- Figure 59 illustrates an example PSFM 5900.
- Figure 60 depicts an example PSSM 6000.
- Figure 61 shows one implementation of generating the PSFM and the PSSM.
- Figure 62 illustrates an example PSFM 6200 encoding.
- Figure 63 depicts an example PSSM 6300 encoding.
- MSA Multiple sequence alignment
- MSA is a sequence alignment of multiple homologous protein sequences to a target protein.
- MSA is an important step in comparative analyses and property prediction of biological sequences since a lot of information, for example, evolution and coevolution clusters, are generated from the MSA and can be mapped to the target sequence of choice or on the protein structure.
- Sequence profiles of a protein sequence X of length L are a L x 20 matrix, either in the form of a PSSM or a PSFM. The columns of a PSSM and a PSFM are indexed by the alphabet of amino acids and each row corresponds to a position in the protein sequence.
- PSSMs and PSFMs contain the substitution scores and the frequencies, respectively, of the amino acids at different positions in the protein sequence.
- Each row of a PSFM is normalized to sum to 1.
- sequence profiles of the protein sequence X are computed by aligning X with multiple sequences in a protein database that have statistically significant sequence similarities with X. Therefore, the sequence profiles contain more general evolutionary and structural information of the protein family that protein sequence X belongs to, and thus, provide valuable information for remote homology detection and fold recognition.
- a protein sequence (called query sequence, e.g., a reference amino acid sequence of a protein) can be used as a seed to search and align homogenous sequences from a protein database (e.g., SWISSPROT) using, for example, a PSI-BLAST program
- the aligned sequences share some homogenous segments and belong to the same protein family.
- the aligned sequences are further converted into two profiles to express their homogeneous information: PSSM and PSFM.
- PSSM and PSFM are matrices with 20 rows and L columns, where L is the total number of amino acids in the query sequence.
- Each column of a PSSM represents the log-likelihood of the residue substitutions at the corresponding positions in the query sequence.
- the (i, j)-th entry of the PSSM matrix represents the chance of the amino acid in the j -th position of the query sequence being mutated to amino acid type i during the evolution process.
- a PSFM contains the weighted observation frequencies of each position of the aligned sequences. Specifically, the (i, j ) -th entry of the PSFM matrix represents the possibility of having amino acid type i in position j of the query sequence.
- FIG. 61 shows the procedures of obtaining the sequence profile by using the PSI-BLAST program.
- the parameters h and j for PSI-BLAST are usually set to 0.001 and 3, respectively.
- the sequence profile of a protein encapsulates its homolog information pertaining to a query protein sequence.
- the homolog information is represented by two matrices: the PSFM and the PSSM. Examples of the PSFM and the PSSM are shown in Figures 62 and 63, respectively.
- the (1, u)-th element (I E ⁇ 1, 2, ... , Li ⁇ , u E ⁇ 1, 2, ... , 20 ⁇ ) represents the chance of having the u-th ammo acid in the 1-th position of the query protein.
- the chance of having the amino acid M in the 1st position of the query protein is 0.36.
- the (1, u)-th element (I E ⁇ 1, 2, ... , Li ⁇ , u E ⁇ 1, 2, ... , 20 ⁇ ) represents the likelihood score of the amino acid in the 1-th position of the query protein being mutated to the u-th amino acid during the evolution process.
- the score for the ammo acid V in the 1st position of the query protein being mutated to H during the evolution process is -3, while that in the 8th position is -4.
- Figure 64 illustrates two datasets on which the models disclosed herein can be trained, for example, by way of combined learning ( Figures 65A-B), or by way of transfer learning ( Figures 66A-B).
- the first training dataset is called Jigsaw Al dataset 6406.
- the second training dataset is called PrimateAI dataset 6408.
- the JigsawAI dataset 6406 is characterized by a voxel input 6412 with a missing central residue identified as a gap amino acid, as discussed above.
- the PrimateAI dataset 6408 is characterized by the voxel input 6412 with no missing residues and complete input.
- ground truth labels 6422 have a missing or masked label 6426 for the gap amino acid (e.g., the REF amino acid).
- the ground truth labels 6422 have nineteen missing or masked labels 6436 forthose remaining amino acids that are different from the alternate amino acid-under-analysis (benign or pathogenic).
- the number of samples 6432 in the JigsawAI dataset 6406 are 10 million 6436, and 1 million 6438 in the PrimateAI dataset 6408.
- Figures 65A-B illustrate one implementation of combined learning 6500 of the models disclosed herein.
- a gapped training set is accessed
- the gapped training set is also referred to herein as the Jigsaw Al dataset 6406.
- the gapped training set includes respective gapped protein samples for respective positions in a proteome.
- the respective gapped protein samples are labelled with respective gapped ground truth sequences.
- a particular gapped ground truth sequence for a particular gapped protein sample has a benign label for a particular amino acid class that corresponds to a reference amino acid at a particular position in the particular gapped protein, and has respective pathogenic labels for respective remaining amino acid classes that correspond to alternate amino acids at the particular position.
- a non-gapped training set is accessed.
- the non-gapped training set is also referred to herein as the PrimateAI dataset 6408.
- the non-gapped training set includes non-gapped benign protein samples and non-gapped pathogenic protein samples.
- a particular non-gapped benign protein sample includes a benign alternate amino acid at a particular position substituted by a benign nucleotide variant.
- a particular non-gapped pathogenic protein sample includes a pathogenic alternate amino acid at a particular position substituted by a pathogenic nucleotide variant.
- the particular non-gapped benign protein sample is labelled with a benign ground truth sequence that has a benign label for a particular amino acid class that corresponds to the benign alternate amino acid, and respective masked labels for respective remaining amino acid classes that correspond to amino acids that are different from the benign alternate amino acid.
- the particular non-gapped pathogenic protein sample is labelled with a pathogenic ground truth sequence that has a pathogenic label for a particular amino acid class that corresponds to the pathogenic alternate amino acid, and respective masked labels for respective remaining amino acid classes that correspond to amino acids that are different from the pathogenic alternate amino acid.
- the benign label for the particular amino acid class that corresponds to the reference amino acid at the particular position in the particular gapped protein is masked.
- the non-gapped benign protein samples are derived from common human and nonhuman primate nucleotide variants.
- the non-gapped pathogenic protein samples are derived from combinatorically simulated nucleotide variants.
- respective gapped spatial representations for the gapped protein samples are generated, and respective non-gapped spatial representations for the non-gapped benign protein samples and the non-gapped pathogenic protein samples are generated.
- the pathogenicity classifier 2108/2600/2700 is trained over one or more training cycles, and a trained pathogenicity classifier 2108/2600/2700 is generated as a result of parameters/coefficients/weights of the trained pathogenicity classifier 2108/2600/2700 being optimized.
- Each of the training cycles uses as training examples gapped spatial representations from the respective gapped spatial representations, and non-gapped spatial representations from the respective non-gapped spatial representations.
- the trained pathogenicity classifier 2108/2600/2700 is used to determine pathogenicity of variants.
- a sample indicator is used to indicate to the pathogenicity classifier 2108/2600/2700 whether a current training example is a gapped spatial representation for a gapped protein sample, or a non-gapped spatial representation for a non-gapped protein sample.
- the pathogenicity classifier 2108/2600/2700 generates an amino acid class-wise output sequence in response to processing a training example.
- the amino acid class-wise output sequence has amino acid class-wise pathogenicity scores.
- a performance of the trained pathogenicity classifier 2108/2600/2700 is measured between training cycles over a validation set.
- the validation set includes a pair of gapped and non-gapped spatial representations for each held-out protein sample.
- the trained pathogenicity classifier 2108/2600/2700 generates a first amino acid class-wise output sequence for the gapped spatial representation in the pair, and a second amino acid class-wise output sequence for the non-gapped spatial representation in the pair.
- a final pathogenicity score for a nucleotide variant that causes an amino acid substitution in a held-out protein sample is determined based on a combination of first and second pathogenicity scores for the amino acid substitution in the first and second amino acid class-wise output sequences.
- the final pathogenicity score is based on an average of the first and second pathogenicity scores.
- At least some of the training cycles use a same of number of gapped spatial representations and non-gapped spatial representations. In other implementations, at least some of the training cycles use batches of training examples that have a same of number of gapped spatial representations and non-gapped spatial representations.
- a masked label does not contribute to error determination, and therefore does not contribute to training of the pathogenicity classifier 2108/2600/2700.
- the masked label is zeroed-out.
- the gapped spatial representations are weighted differently from the non-gapped spatial representations, such that a contribution of the gapped spatial representations to gradient updates applied to parameters of the pathogenicity classifier 2108/2600/2700 in response to the pathogenicity classifier 2108/2600/2700 processing the non-gapped spatial representations varies from a contribution of the non-gapped spatial representations to gradient updates applied to the parameters of the pathogenicity classifier 2108/2600/2700 in response to the pathogenicity classifier 2108/2600/2700 processing the non-gapped spatial representations.
- the variation is determined by pre -defined weights.
- Figures 66A-B illustrate one implementation of using transfer learning 6600 to train the models disclosed herein using the two datasets shown in Figure 64.
- the pathogenicity classifier 2108/2600/2700 is first trained on the gapped training set (z.e., the Jigsaw Al data set 6406) to generate the trained pathogenicity classifier 2108/2600/2700.
- the trained pathogenicity classifier 2108/2600/2700 is further trained on the non-gapped training set (z.e., the PrimateAI data set 6408) to generate a retrained pathogenicity classifier 2108/2600/2700.
- the retrained pathogenicity classifier 2108/2600/2700 is used to determine pathogenicity of variants.
- performance of the trained pathogenicity classifier 2108/2600/2700 is measured between training cycles over a first validation set that includes only non-gapped spatial representations of held-out protein samples.
- performance of the retrained pathogenicity classifier 2108/2600/2700 is measured between training cycles over a second validation set that includes gapped spatial representations and non-gapped spatial representations of held-out protein samples.
- the retrained pathogenicity classifier 2108/2600/2700 generates a first amino acid class-wise output sequence for the pair in response to processing the pair.
- a final pathogenicity score for a nucleotide variant that causes an amino acid substitution in a corresponding held-out protein sample is determined based on the first amino acid class-wise output sequence.
- Figure 67 shows one implementation of generating 6700 training data and labels to train the models disclosed herein.
- a proteome accessor 6704 accesses multitude of amino acid positions in a proteome with a plurality of proteins.
- a reference specifier 6714 specifies major allele amino acids at the multitude of amino acid positions as reference amino acids of the plurality of proteins.
- a benign labeler 6724 for each amino acid position in the multitude of amino acids positions, classifies those nucleotide substitutions as benign variants that substitute a particular reference amino acid with the particular reference amino acid at a particular amino acid position in a particular alternate representation of a particular protein.
- a pathogenic labeler 6734 for each amino acid position in the multitude of amino acids positions, classifies those nucleotide substitutions as pathogenic variants that substitute the particular reference amino acid with alternate amino acids at the particular amino acid position.
- the alternate amino acids are different from the particular reference amino acid.
- a trainer 6744 trains a variant pathogenicity classifier 2108/2600/2700 on training data comprising spatial representations of protein samples, such that the spatial representations are assigned ground truth benign labels that correspond to the benign variants, and ground truth pathogenic labels that correspond to the pathogenic variants.
- the variant pathogenicity classifier 2108/2600/2700 is trained to determine whether a substitution of a first amino acid with a second amino acid at a given amino acid position in a protein is pathogenic or benign. In such an implementation, the variant pathogenicity classifier 2108/2600/2700 is trained to generate a pathogenicity score for the substitution. In one implementation, the variant pathogenicity classifier 2108/2600/2700 is trained to determine whether respective substitutions of a first amino acid with respective amino acids at a given amino acid position in a protein are pathogenic or benign. In such an implementation, the variant pathogenicity classifier 2108/2600/2700 is trained to generate respective pathogenicity scores for the respective substitutions. In some implementations, the respective amino acids correspond to respective twenty naturally-occurring amino acids. In other implementations, the respective amino acids correspond to respective naturally- occurring amino acids from a subset of the twenty naturally-occurring amino acids.
- the variant pathogenicity classifier 2108/2600/2700 is trained to determine whether an insertion of an amino acid at a given vacant amino acid position in a protein is pathogenic or benign. In such an implementation, the variant pathogenicity classifier 2108/2600/2700 is trained to generate a pathogenicity score for the insertion. In one implementation, the variant pathogenicity classifier 2108/2600/2700 is trained to determine whether respective insertions of respective amino acids at a given vacant amino acid position in a protein are pathogenic or benign. In such an implementation, the variant pathogenicity classifier 2108/2600/2700 is trained to generate respective pathogenicity scores for the respective insertions. In some implementations, the respective amino acids correspond to respective twenty naturally-occurring ammo acids. In other implementations, the respective amino acids correspond to respective naturally-occurring amino acids from a subset of the twenty naturally-occurring amino acids.
- the variant pathogenicity classifier 2108/2600/2700 is trained to determine whether a substitution of a first amino acid with a second amino acid at a given amino acid position in a protein is spatially tolerated by other amino acids of the protein or not. In such an implementation, the variant pathogenicity classifier 2108/2600/2700 is trained to generate a spatial tolerance score for the substitution. In one implementation, the variant pathogenicity classifier 2108/2600/2700 is trained to determine whether respective substitutions of a first amino acid with respective amino acids at a given amino acid position in a protein are spatially tolerated by other amino acids of the protein or not. In such an implementation, the variant pathogenicity classifier 2108/2600/2700 is trained to generate respective spatial tolerance scores for the respective substitutions. In some implementations, the respective amino acids correspond to respective twenty naturally-occurring amino acids. In other implementations, the respective amino acids correspond to respective naturally-occurring amino acids from a subset of the twenty naturally-occurring amino acids.
- the variant pathogenicity classifier 2108/2600/2700 is trained to determine whether an insertion of an amino acid at a given vacant amino acid position in a protein is spatially tolerated by other amino acids of the protein or not. In such an implementation, the variant pathogenicity classifier 2108/2600/2700 is trained to generate a spatial tolerance score for the insertion. In one implementation, the variant pathogenicity classifier 2108/2600/2700 is trained to determine whether respective insertions of respective amino acids at a given vacant amino acid position in a protein are spatially tolerated by other amino acids of the protein or not. In such an implementation, the variant pathogenicity classifier 2108/2600/2700 is trained to generate respective spatial tolerance scores for the respective insertions.
- the respective amino acids correspond to respective twenty naturally-occurring amino acids. In other implementations, the respective amino acids correspond to respective naturally-occurring amino acids from a subset of the twenty naturally-occurring amino acids. [0382] In one implementation, the variant pathogenicity classifier 2108/2600/2700 is trained to determine whether a substitution of a first amino acid with a second amino acid at a given amino acid position in a protein is evolutionary conserved or non-conserved. In such an implementation, the variant pathogenicity classifier 2108/2600/2700 is trained to generate an evolutionary conservation score for the substitution.
- the variant pathogenicity classifier 2108/2600/2700 is trained to determine whether respective substitutions of a first amino acid with respective amino acids at a given amino acid position in a protein are evolutionary conserved or non-conserved. In such an implementation, the variant pathogenicity classifier 2108/2600/2700 is trained to generate respective evolutionary conservation scores for the respective substitutions.
- the respective amino acids correspond to respective twenty naturally-occurring amino acids. In other implementations, the respective amino acids correspond to respective naturally -occurring amino acids from a subset of the twenty naturally-occurring amino acids.
- the variant pathogenicity classifier 2108/2600/2700 is trained to determine whether an insertion of an amino acid at a given vacant amino acid position in a protein is evolutionary conserved or non-conserved. In such an implementation, the variant pathogenicity classifier 2108/2600/2700 is trained to generate an evolutionary conservation score for the insertion.
- the variant pathogenicity classifier 2108/2600/2700 is trained to determine whether respective insertions of respective amino acids at a given vacant amino acid position in a protein are evolutionary conserved or non-conserved. In such an implementation, the variant pathogenicity classifier 2108/2600/2700 is trained to generate respective evolutionary conservation scores for the respective insertions.
- the respective amino acids correspond to respective twenty naturally-occurring amino acids. In other implementations, the respective amino acids correspond to respective naturally-occurring amino acids from a subset of the twenty naturally-occurring amino acids.
- spatial tolerance corresponds to structural tolerance
- spatial intolerance corresponds to structural intolerance
- the multitude of amino acids positions range from one million to ten million amino acid positions. In different implementations, the multitude of amino acids positions range from ten million to hundred million amino acid positions. In different implementations, the multitude of amino acids positions range from hundred million to a billion amino acid positions. In different implementations, the multitude of amino acids positions range from one to a million amino acid positions.
- those unreachable alternate amino acid classes that are confined by reachability of single nucleotide polymorphisms (SNPs) to transform a reference codon of a reference amino acid into alternate amino acids of the unreachable alternate amino acid classes are masked in ground truth labels.
- masked amino acid classes result in zero loss and do not contribute to gradient updates.
- the masked amino acid classes are identified in a lookup table.
- the lookup table identifies a set of masked amino acids classes for each reference amino acid position.
- the spatial representations are structural representations of protein structures of the protein samples.
- the spatial representations are encoded using voxelization.
- Figure 68 illustrates one implementation of a method 6800 of determining pathogenicity of nucleotide variants.
- the method includes, at action 6802, accessing a spatial representation of a protein.
- the spatial representation of the protein specifies respective spatial configurations of respective ammo acids at respective positions in the protein.
- the method includes, at action 6812, removing, from the spatial representation of the protein, a particular spatial configuration of a particular amino acid at a particular position, thereby generating a gapped spatial representation of the protein.
- the removal of the particular spatial configuration is implemented (or automated) by a script.
- the method includes, at action 6822, determining a pathogenicity of a nucleotide variant based at least in part on the gapped spatial representation, and a representation of an alternate amino acid created by the nucleotide variant at the particular position.
- Figure 69 illustrates one implementation of a system 6900 to predict structural tolerability of amino acid substitutes.
- a gapping logic is configured to remove, from a spatial representation of a protein, a particular amino acid at a particular position, and create an amino acid vacancy at the particular position in the spatial representation of the protein.
- a structural tolerability prediction logic is configured to process the spatial representation of the protein with the amino acid vacancy, and rank structural tolerability of substitute amino acids that are candidates for filling the amino acid vacancy based on amino acid co-occurrence patterns in a neighborhood of the amino acid vacancy.
- the variant pathogenicity classifier disclosed herein makes pathogenicity predictions based on 3D protein structures and is referred to as “PrimateAI 3D.”
- “Primate Al” is a commonly owned and previously disclosed variant pathogenicity classifier that makes pathogenicity predictions based protein sequences. Additional details about PrimateAI can be found in commonly owned US Patent Application Nos. 16/160,903; 16/160,986; 16/160,968; and 16/407,149 and in Sundaram, L. et al. Predicting the clinical impact of human mutation with deep neural networks. Nat. Genet. 50, 1161-1170 (2016).
- FIGS. 70A, 70B and 70C are generated on the classification task of accurately distinguishing benign variants from pathogenic variants across a plurality of validation sets.
- New developmental delay disorder (new DDD) is one example of a validation set used to compare the classification accuracy of Transfer Learning against Combined Learning again Primate Al 3D against Primate AL
- the new DDD validation set labels variants from individuals with DDD as pathogenic and labels the same variants from healthy relatives of the individuals with the DDD as benign.
- a similar labelling scheme is used with an autism spectrum disorder (ASD) validation set.
- ASSD autism spectrum disorder
- BRCA1 is another example of a validation set used to compare the classification accuracy of Transfer Learning against Combined Learning again Primate Al 3D against Primate AL
- the BRCA1 validation set labels synthetically generated reference amino acid sequences simulating proteins of the BRCA1 gene as benign variants and labels synthetically altered allele amino acid sequences simulating proteins of the BRCA1 gene as pathogenic variants.
- a similar labelling scheme is used with different validation sets of the TP53 gene, TP53S3 gene and its variants, and other genes and their variants shown in Figures 70A, 70B and 70C.
- a separate “mean” chart calculates the mean of the p- values determined for each of the validation sets.
- Combined Learning generally outperforms other approaches, followed by Transfer Learning, which is in turn followed by PrimateAI 3D, as indicated by the horizontal bars for Combined Learning being consistently longer than the horizontal bars for other approaches.
- the mean statistics may be biased by outliers.
- a separate “method ranks” chart is also depicted in Figures 70A, 70B and 70C. Higher rank denotes poorer classification accuracy.
- Combined Learning generally outperforms other approaches, followed by Transfer Learning, which is in turn followed by PrimateAI 3D.
- having more counts of lower ranks 1 and 2 is better than having higher ranks of 3s.
- the technology disclosed can be practiced as a system, method, or article of manufacture.
- One or more features of an implementation can be combined with the base implementation. Implementations that are not mutually exclusive are taught to be combinable.
- One or more features of an implementation can be combined with other implementations. This disclosure periodically reminds the user of these options. Omission from some implementations of recitations that repeat these options should not be taken as limiting the combinations taught in the preceding sections - these recitations are hereby incorporated forward by reference into each of the following implementations.
- One or more implementations and clauses of the technology disclosed or elements thereof can be implemented in the form of a computer product, including a non-transitory computer readable storage medium with computer usable program code for performing the method steps indicated. Furthermore, one or more implementations and clauses of the technology disclosed or elements thereof can be implemented in the form of an apparatus including a memory and at least one processor that is coupled to the memory and operative to perform exemplary method steps.
- one or more implementations and clauses of the technology disclosed or elements thereof can be implemented in the form of means for carrying out one or more of the method steps described herein; the means can include (i) hardware module(s), (ii) software module(s) executing on one or more hardware processors, or (iii) a combination of hardware and software modules; any of (i)-(iii) implement the specific techniques set forth herein, and the software modules are stored in a computer readable storage medium (or multiple such media).
- clauses described in this section can include a non-transitory computer readable storage medium storing instructions executable by a processor to perform any of the clauses described in this section.
- implementations of the clauses described in this section can include a system including memory and one or more processors operable to execute instructions, stored in the memory, to perform any of the clauses described in this section.
- a computer-implemented method of determining pathogenicity of nucleotide variants including: accessing a protein that has respective amino acids at respective positions; specifying a particular amino acid at a particular position in the protein as a gap amino acid, and specifying remaining amino acids at remaining positions in the protein as non-gap amino acids; generating a gapped spatial representation of the protein that includes spatial configurations of the non-gap amino acids, and excludes a spatial configuration of the gap amino acid; and determining a pathogenicity of a nucleotide variant based at least in part on the gapped spatial representation, and a representation of an alternate amino acid created by the nucleotide variant at the particular position.
- a pathogenicity predictor determines the pathogenicity of the nucleotide variant by processing, as input, the gapped spatial representation, and the representation of the alternate amino acid; and generating, as output, a pathogenicity score for the alternate amino acid.
- a particular gapped spatial representation of the particular benign protein sample wherein the particular gapped spatial representation is generated by using the particular reference amino acid as a gap amino acid, and by using remaining amino acids at remaining positions in the particular benign protein sample as non-gap amino acids, and
- each of the benign protein samples has a ground truth benignness label that indicates absolute benignness of the benign protein samples.
- a particular gapped spatial representation of the particular pathogenic protein sample wherein the particular gapped spatial representation is generated by using the particular reference amino acid as a gap amino acid, and by using remaining amino acids at remaining positions in the particular pathogenic protein sample as non-gap amino acids, and
- proteome includes human proteome and non-human proteome, including non-human primate proteome.
- a computer-implemented method of determining pathogenicity of nucleotide variants including: accessing a protein that has respective amino acids at respective positions; specifying a particular amino acid of a particular amino acid class at a particular position in the protein as a gap amino acid, and specifying remaining amino acids at remaining positions in the protein as non- gap amino acids; generating a gapped spatial representation of the protein that includes spatial configurations of the non-gap amino acids, and excludes a spatial configuration of the gap amino acid; and based at least in part on the gapped spatial representation, determining a pathogenicity of respective alternate amino acids at the particular position.
- a pathogenicity predictor determines the pathogenicity of the respective alternate amino acids by processing, as input, the gapped spatial representation; and generating, as output, respective pathogenicity scores for respective amino acid classes.
- the pathogenicity predictor trains on a particular protein sample and estimates a pathogenicity of respective alternate amino acids for a particular reference amino acid at a particular position in the particular protein sample by processing, as input, a particular gapped spatial representation of the particular protein sample, wherein the particular gapped spatial representation is generated by using the particular reference amino acid as a gap amino acid, and by using remaining amino acids at remaining positions in the particular protein sample as non-gap amino acids; and generating, as output, respective pathogenicity scores for the respective amino acid classes.
- proteome includes human proteome and non-human proteome, including non-human primate proteome.
- a computer-implemented method of generating training data for training a variant pathogenicity classifier including: accessing multitude of amino acid positions in a proteome with a plurality of proteins; specifying major allele amino acids at the multitude of amino acid positions as reference amino acids of the plurality of proteins; for each amino acid position in the multitude of amino acids positions, classifying those nucleotide substitutions as benign variants that substitute a particular reference amino acid with the particular reference amino acid at a particular amino acid position in a particular alternate representation of a particular protein, and classifying those nucleotide substitutions as pathogenic variants that substitute the particular reference amino acid with alternate amino acids at the particular amino acid position, wherein the alternate amino acids are different from the particular reference amino acid; and training a variant pathogenicity classifier using the benign variants and the pathogenic variants as training data.
- a computer-implemented method of determining pathogenicity of nucleotide variants including: specifying a particular amino acid at a particular position in a protein as a gap amino acid, and specifying remaining amino acids at remaining positions in the protein as non-gap amino acids; generating a gapped spatial representation of the protein that includes spatial configurations of the non-gap amino acids, and excludes a spatial configuration of the gap amino acid; determining an evolutionary conservation at the particular position of respective amino acids of respective amino acid classes based at least in part on the gapped spatial representation; and based at least in part on the evolutionary conservation of the respective amino acids, determining a pathogenicity of respective nucleotide variants that respectively substitute the particular amino acid with the respective amino acids in alternate representations of the protein.
- each of the respective positions has a set of conserved amino acids and a set of non-conserved amino acids.
- the evolutionary conservation predictor trains on a particular conserved protein sample and estimates an evolutionary conservation of a particular conserved amino acid at a particular position in the particular conserved protein sample by processing, as input, a particular gapped spatial representation of the particular conserved protein sample, wherein the particular gapped spatial representation is generated by using a particular reference amino acid at the particular position as a gap amino acid, and by using remaining amino acids at remaining positions in the particular conserved protein sample as non-gap amino acids; and generating, as output, an evolutionary conservation score for the particular conserved amino acid.
- the evolutionary conservation predictor trains on a particular non-conserved protein sample and estimates an evolutionary conservation of a particular non-conserved amino acid at a particular position in the particular non-conserved protein sample by processing, as input, a particular gapped spatial representation of the particular non-conserved protein sample, wherein the particular gapped spatial representation is generated by using a particular reference ammo acid at the particular position as a gap amino acid, and by using remaining amino acids at remaining positions in the particular non-conserved protein sample as non-gap amino acids; and generating, as output, an evolutionary conservation score for the particular non-conserved amino acid.
- the evolutionary conservation predictor trains on a particular protein sample and estimates an evolutionary conservation of respective amino acids of respective amino acid classes at a particular position in the particular protein sample by processing, as input, a particular gapped spatial representation of the particular protein sample, wherein the particular gapped spatial representation is generated by using a particular reference amino acid at the particular position as a gap amino acid, and by using remaining amino acids at remaining positions in the particular protein sample as non-gap amino acids; and generating, as output, respective evolutionary conservation scores for the respective amino acids.
- a computer-implemented method of training a pathogenicity predictor including: accessing a gapped training set that includes respective gapped protein samples for respective positions in a proteome; accessing a non-gapped training set that includes non-gapped benign protein samples and non-gapped pathogenic protein samples; generating respective gapped spatial representations for the gapped protein samples, and generating respective non-gapped spatial representations for the non-gapped benign protein samples and the nongapped pathogenic protein samples; training a pathogenicity predictor over one or more training cycles and generating a trained pathogenicity predictor, wherein each of the training cycles uses as training examples gapped spatial representations from the respective gapped spatial representations and non-gapped spatial representations from the respective non-gapped spatial representations; and using the trained pathogenicity classifier to determine pathogenicity of variants.
- a computer-implemented method of training a pathogenicity predictor including: starting with training a pathogenicity classifier on a gapped training set and generating a trained pathogenicity classifier; further training the trained pathogenicity classifier on a non-gapped training set and generating a retrained pathogenicity classifier; and using the retrained pathogenicity classifier to determine pathogenicity of variants.
- a computer-implemented method of training a pathogenicity predictor including: accessing a gapped training set that includes respective gapped protein samples for respective positions in a proteome, wherein the respective gapped protein samples are labelled with respective gapped ground truth sequences, wherein a particular gapped ground truth sequence for a particular gapped protein sample has a benign label for a particular amino acid class that corresponds to a reference amino acid at a particular position in the particular gapped protein, and has respective pathogenic labels for respective remaining amino acid classes that correspond to alternate amino acids at the particular position; accessing a non-gapped training set that includes non-gapped benign protein samples and non-gapped pathogenic protein samples, wherein a particular non-gapped benign protein sample includes a benign alternate amino acid at a particular position substituted by a benign nucleotide variant, wherein a particular non-gapped pathogenic protein sample includes a pathogenic alternate amino acid at a particular position substituted by a pathogenic nucleotide variant,
- a computer-implemented method of generating training data for training a variant pathogenicity classifier including: accessing multitude of amino acid positions in a proteome with a plurality of proteins; specifying major allele amino acids at the multitude of amino acid positions as reference amino acids of the plurality of proteins; for each ammo acid position in the multitude of amino acids positions, classifying those nucleotide substitutions as benign vanants that substitute a particular reference amino acid with the particular reference amino acid at a particular amino acid position in a particular alternate representation of a particular protein, and classifying those nucleotide substitutions as pathogenic variants that substitute the particular reference amino acid with alternate amino acids at the particular amino acid position, wherein the alternate amino acids are different from the particular reference amino acid; and training a variant pathogenicity classifier on training data comprising spatial representations of protein samples, such that the spatial representations are assigned ground truth benign labels that correspond to the benign variants, and ground truth pathogenic labels that correspond to the pathogenic variants.
- a computer-implemented method of determining pathogenicity of nucleotide variants including: accessing a spatial representation of a protein, wherein the spatial representation of the protein specifies respective spatial configurations of respective amino acids at respective positions in the protein; removing, from the spatial representation of the protein, a particular spatial configuration of a particular amino acid at a particular position, thereby generating a gapped spatial representation of the protein; and determining a pathogenicity of a nucleotide variant based at least in part on the gapped spatial representation, and a representation of an alternate ammo acid created by the nucleotide variant at the particular position.
- a computer-implemented method of determining pathogenicity of nucleotide variants including: removing, from a protein, a particular amino acid at a particular position, thereby generating a gapped protein; and determining a pathogenicity of a nucleotide variant based at least in part on the gapped protein and an alternate amino acid created by the nucleotide variant at the particular position.
- a system to predict spatial tolerability of amino acid substitutes comprising: gapping logic configured to remove, from a protein, a particular amino acid at a particular position, and create an amino acid vacancy at the particular position in the protein; and substitution logic configured to process the protein with the amino acid vacancy, and score tolerability of substitute amino acids that are candidates for filling the amino acid vacancy.
- substitution logic is further configured to score the tolerability of the substitute amino acids based at least in part on structural compatibility between the substitute amino acids and adjacent amino acids in a neighborhood of the amino acid vacancy.
- a computer-implemented method of determining pathogenicity of nucleotide variants including: accessing a protein that has respective amino acids at respective positions; specifying a particular amino acid of a particular amino acid class at a particular position in the protein as a gap amino acid, and specifying remaining amino acids at remaining positions in the protein as nongap amino acids; generating a gapped spatial representation of the protein that includes spatial configurations of the non-gap amino acids, and excludes a spatial configuration of the gap amino acid; and based at least in part on the gapped spatial representation, determining a pathogenicity of respective alternate amino acids at the particular position, wherein the respective alternate amino acids have respective amino acid classes that are different from the particular amino acid class.
- a system to predict evolutionary conservation of amino acid substitutes comprising: gapping logic configured to remove, from a protein, a particular amino acid at a particular position, and create an amino acid vacancy at the particular position in the protein; and substitution logic configured to process the protein with the amino acid vacancy, and score evolutionary conservation of substitute amino acids that are candidates for filling the amino acid vacancy.
- substitution logic is further configured to score the evolutionary conservation of the substitute amino acids based at least in part on structural compatibility between the substitute amino acids and adjacent amino acids in a neighborhood of the amino acid vacancy.
- a system to predict evolutionary conservation of amino acid substitutes comprising: gapping logic configured to remove, from a protein, a particular amino acid at a particular position, and create an amino acid vacancy at the particular position in the protein; and evolutionary conservation prediction logic configured to process the protein with the amino acid vacancy, and rank evolutionary conservation of substitute amino acids that are candidates for filling the amino acid vacancy.
- a system to predict structural tolerability of amino acid substitutes comprising: gapping logic configured to remove, from a protein, a particular amino acid at a particular position, and create an amino acid vacancy at the particular position in the protein; and structural tolerability prediction logic configured to process the protein with the amino acid vacancy, and rank structural tolerability of substitute amino acids that are candidates for filling the amino acid vacancy based on amino acid co-occurrence patterns in a neighborhood of the amino acid vacancy.
- a computer-implemented method of determining pathogenicity of nucleotide variants including: accessing a protein that has respective amino acids at respective positions; specifying a particular amino acid at a particular position in the protein as a gap amino acid, and specifying remaining amino acids at remaining positions in the protein as non-gap amino acids; generating a gapped spatial representation of the protein that includes spatial configurations of the non-gap amino acids, and excludes a spatial configuration of the gap amino acid; determining an evolutionary conservation of an alternate amino acid at the particular position based at least in part on the gapped spatial representation, and a representation of the alternate amino acid; and determining a pathogenicity of a nucleotide variant that creates the alternate amino acid based at least in part on the evolutionary conservation.
- a computer-implemented method of determining pathogenicity of nucleotide variants including: accessing a spatial representation of a protein, wherein the spatial representation of the protein specifies respective spatial configurations of respective amino acids at respective positions in the protein; removing, from the spatial representation of the protein, a particular spatial configuration of a particular amino acid at a particular position, thereby generating a gapped spatial representation of the protein; and determining a pathogenicity of a nucleotide variant based at least in part on the gapped spatial representation, and a representation of an alternate amino acid created by the nucleotide variant at the particular position.
- a computer-implemented method of determining pathogenicity of nucleotide variants including: removing, from a spatial representation of protein, a particular amino acid at a particular position, thereby generating a gapped spatial representation of the protein; and determining a pathogenicity of a nucleotide variant based at least in part on the gapped spatial representation of the protein and an alternate amino acid created by the nucleotide variant at the particular position.
- a system to predict spatial tolerability of amino acid substitutes comprising: gapping logic configured to remove, from a spatial representation of a protein, a particular amino acid at a particular position, and create an amino acid vacancy at the particular position in the spatial representation of the protein; and substitution logic configured to process the spatial representation of the protein with the amino acid vacancy, and score tolerability of substitute amino acids that are candidates for filling the amino acid vacancy.
- substitution logic is further configured to score the tolerability of the substitute amino acids based at least in part on structural compatibility between the substitute amino acids and adjacent amino acids in a neighborhood of the amino acid vacancy.
- a computer-implemented method of determining pathogenicity of nucleotide variants including: accessing a protein that has respective amino acids at respective positions; specifying a particular amino acid of a particular amino acid class at a particular position in the protein as a gap amino acid, and specifying remaining amino acids at remaining positions in the protein as nongap amino acids; generating a gapped spatial representation of the protein that includes spatial configurations of the non-gap amino acids, and excludes a spatial configuration of the gap amino acid; and based at least in part on the gapped spatial representation, determining a pathogenicity of respective alternate amino acids at the particular position, wherein the respective alternate amino acids have respective amino acid classes that are different from the particular amino acid class.
- a system to predict evolutionary conservation of amino acid substitutes comprising: gapping logic configured to remove, from a spatial representation of a protein, a particular amino acid at a particular position, and create an amino acid vacancy at the particular position in the spatial representation of the protein; and substitution logic configured to process the spatial representation of the protein with the amino acid vacancy, and score evolutionary conservation of substitute amino acids that are candidates for filling the amino acid vacancy.
- substitution logic is further configured to score the evolutionary conservation of the substitute amino acids based at least in part on structural compatibility between the substitute amino acids and adjacent amino acids in a neighborhood of the amino acid vacancy.
- a system to predict evolutionary conservation of amino acid substitutes comprising: gapping logic configured to remove, from a spatial representation of a protein, a particular amino acid at a particular position, and create an amino acid vacancy at the particular position in the spatial representation of the protein; and evolutionary conservation prediction logic configured to process the spatial representation of the protein with the amino acid vacancy, and rank evolutionary conservation of substitute amino acids that are candidates for filling the amino acid vacancy.
- a system to predict structural tolerability of amino acid substitutes comprising: gapping logic configured to remove, from a spatial representation of a protein, a particular amino acid at a particular position, and create an amino acid vacancy at the particular position in the spatial representation of the protein; and structural tolerability prediction logic configured to process the spatial representation of the protein with the amino acid vacancy, and rank structural tolerability of substitute amino acids that are candidates for filling the amino acid vacancy based on amino acid co-occurrence patterns in a neighborhood of the amino acid vacancy.
- a computer-implemented method of determining pathogenicity of nucleotide variants including: accessing a protein that has respective amino acids at respective positions; specifying a particular amino acid at a particular position in the protein as a gap amino acid, and specifying remaining amino acids at remaining positions in the protein as non-gap amino acids; generating a gapped spatial representation of the protein that includes spatial configurations of the non-gap amino acids, and excludes a spatial configuration of the gap amino acid; determining an evolutionary conservation of an alternate amino acid at the particular position based at least in part on the gapped spatial representation, and a representation of the alternate amino acid; and determining a pathogenicity of a nucleotide variant that creates the alternate amino acid based at least in part on the evolutionary conservation.
Landscapes
- Bioinformatics & Cheminformatics (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Physics & Mathematics (AREA)
- Engineering & Computer Science (AREA)
- Genetics & Genomics (AREA)
- Biotechnology (AREA)
- Biophysics (AREA)
- Chemical & Material Sciences (AREA)
- Molecular Biology (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Bioinformatics & Computational Biology (AREA)
- Analytical Chemistry (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Theoretical Computer Science (AREA)
- Investigating Or Analysing Biological Materials (AREA)
Abstract
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP22800024.6A EP4413575A1 (fr) | 2021-10-06 | 2022-10-05 | Apprentissage combiné et par transfert d'un prédicteur de pathogénicité de variants au moyen d'échantillons de protéines à brèche et sans brèche |
CN202280046352.3A CN117581302A (zh) | 2021-10-06 | 2022-10-05 | 使用有缺口和非缺口的蛋白质样品的变体致病性预测器的组合学习和迁移学习 |
KR1020237045483A KR20240088641A (ko) | 2021-10-06 | 2022-10-05 | 갭 단백질 샘플 및 비-갭 단백질 샘플을 사용하는 변이체 병원성 예측자의 결합 학습 및 전이 학습 |
Applications Claiming Priority (12)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US202163253122P | 2021-10-06 | 2021-10-06 | |
US63/253,122 | 2021-10-06 | ||
US202163281592P | 2021-11-19 | 2021-11-19 | |
US202163281579P | 2021-11-19 | 2021-11-19 | |
US63/281,592 | 2021-11-19 | ||
US63/281,579 | 2021-11-19 | ||
US17/533,091 US11538555B1 (en) | 2021-10-06 | 2021-11-22 | Protein structure-based protein language models |
US17/533,091 | 2021-11-22 | ||
US17/953,293 US20230108368A1 (en) | 2021-10-06 | 2022-09-26 | Combined and transfer learning of a variant pathogenicity predictor using gapped and non-gapped protein samples |
US17/953,286 US20230108241A1 (en) | 2021-10-06 | 2022-09-26 | Predicting variant pathogenicity from evolutionary conservation using three-dimensional (3d) protein structure voxels |
US17/953,286 | 2022-09-26 | ||
US17/953,293 | 2022-09-26 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2023059750A1 true WO2023059750A1 (fr) | 2023-04-13 |
Family
ID=84053342
Family Applications (3)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2022/045825 WO2023059752A1 (fr) | 2021-10-06 | 2022-10-05 | Modèles de langage protéines basés sur la structure de protéines |
PCT/US2022/045823 WO2023059750A1 (fr) | 2021-10-06 | 2022-10-05 | Apprentissage combiné et par transfert d'un prédicteur de pathogénicité de variants au moyen d'échantillons de protéines à brèche et sans brèche |
PCT/US2022/045824 WO2023059751A1 (fr) | 2021-10-06 | 2022-10-05 | Prédiction de pathogénicité de variants à partir d'une conservation évolutive à l'aide de voxels de structure protéique tridimensionnelle (3d) |
Family Applications Before (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2022/045825 WO2023059752A1 (fr) | 2021-10-06 | 2022-10-05 | Modèles de langage protéines basés sur la structure de protéines |
Family Applications After (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2022/045824 WO2023059751A1 (fr) | 2021-10-06 | 2022-10-05 | Prédiction de pathogénicité de variants à partir d'une conservation évolutive à l'aide de voxels de structure protéique tridimensionnelle (3d) |
Country Status (1)
Country | Link |
---|---|
WO (3) | WO2023059752A1 (fr) |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US17953A (en) | 1857-08-04 | feger |
-
2022
- 2022-10-05 WO PCT/US2022/045825 patent/WO2023059752A1/fr active Application Filing
- 2022-10-05 WO PCT/US2022/045823 patent/WO2023059750A1/fr active Application Filing
- 2022-10-05 WO PCT/US2022/045824 patent/WO2023059751A1/fr active Application Filing
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US17953A (en) | 1857-08-04 | feger |
Non-Patent Citations (4)
Title |
---|
JAGANATHAN, K ET AL.: "Predicting splicing from primary sequence with deep learning", CELL, vol. 176, 2019, pages 535 - 548 |
PEI PZHANG A: "A Topological Measurement for Weighted Protein Interaction Network", CSB, 2005, pages 268 - 278, XP010831154, DOI: 10.1109/CSB.2005.8 |
SUNDARAM LAKSSHMAN ET AL: "Predicting the clinical impact of human mutation with deep neural networks", NATURE GENETICS, NATURE PUBLISHING GROUP US, NEW YORK, vol. 50, no. 8, 23 July 2018 (2018-07-23), pages 1161 - 1170, XP036902750, ISSN: 1061-4036, [retrieved on 20180723], DOI: 10.1038/S41588-018-0167-Z * |
SUNDARAM, L ET AL.: "Predicting the clinical impact of human mutation with deep neural networks", NAT. GENET., vol. 50, 2018, pages 1161 - 1170, XP036902750, DOI: 10.1038/s41588-018-0167-z |
Also Published As
Publication number | Publication date |
---|---|
WO2023059751A1 (fr) | 2023-04-13 |
WO2023059752A1 (fr) | 2023-04-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20230045003A1 (en) | Deep learning-based use of protein contact maps for variant pathogenicity prediction | |
WO2023014912A1 (fr) | Utilisation basée sur l'apprentissage de transfert de cartes de contact de protéine pour une prédiction de pathogénicité de variant | |
US20230108368A1 (en) | Combined and transfer learning of a variant pathogenicity predictor using gapped and non-gapped protein samples | |
US20220336057A1 (en) | Efficient voxelization for deep learning | |
US11515010B2 (en) | Deep convolutional neural networks to predict variant pathogenicity using three-dimensional (3D) protein structures | |
AU2022259667A1 (en) | Efficient voxelization for deep learning | |
WO2022221587A1 (fr) | Analyse basée sur l'intelligence artificielle de structures tridimensionnelles (3d) de protéine | |
US20230343413A1 (en) | Protein structure-based protein language models | |
US20230047347A1 (en) | Deep neural network-based variant pathogenicity prediction | |
WO2023059750A1 (fr) | Apprentissage combiné et par transfert d'un prédicteur de pathogénicité de variants au moyen d'échantillons de protéines à brèche et sans brèche | |
EP4413575A1 (fr) | Apprentissage combiné et par transfert d'un prédicteur de pathogénicité de variants au moyen d'échantillons de protéines à brèche et sans brèche | |
JP2024538478A (ja) | ギャップ付き及び非ギャップタンパク質サンプルを使用した変異体病原性予測器の複合学習及び転移学習 | |
JP2024538477A (ja) | タンパク質構造に基づくタンパク質言語モデル | |
JP2024538475A (ja) | 三次元(3d)タンパク質構造ボクセルを用いた進化的保存からの変異体病原性の予測 | |
CN117581302A (zh) | 使用有缺口和非缺口的蛋白质样品的变体致病性预测器的组合学习和迁移学习 | |
CN117178327A (zh) | 使用深度卷积神经网络来预测变体致病性的多通道蛋白质体素化 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22800024 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2023580573 Country of ref document: JP Kind code of ref document: A |
|
WWE | Wipo information: entry into national phase |
Ref document number: 202280046352.3 Country of ref document: CN |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2022800024 Country of ref document: EP |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
ENP | Entry into the national phase |
Ref document number: 2022800024 Country of ref document: EP Effective date: 20240506 |