WO2023129957A1

WO2023129957A1 - Deep learning network for evolutionary conservation determination of nucleotide sequences

Info

Publication number: WO2023129957A1
Application number: PCT/US2022/082467
Authority: WO
Inventors: Sabrina RASHID; Kai-How FARH
Original assignee: Illumina, Inc.
Priority date: 2021-12-29
Filing date: 2022-12-28
Publication date: 2023-07-06

Abstract

The technology disclosed relates to a deep learning network system for evolutionary conservation prediction. In one implementation, the system includes a first model for processing a first multiple sequence alignment that aligns a query sequence with a masked base at a target position to N non-query sequences and predicting a first identity of the masked base at the target position. The system also includes a second model for processing a second multiple sequence alignment that aligns the query sequence to M non-query sequences, where M > N, and predicting a second identity of the masked base at the target position. The system further includes an evolutionary conservation determination logic configured to measure an evolutionary conservation of the masked base at the target position based on the first and second identities of the masked base.

Description

_DEE ' - — NETWORK FOR FVOLUTIONARY CONSFr'"'^^'

WO 2023/129957 PCT/US2022/082467

DFTFRMINATION OF NUCLFOTIDF SFQUFNCFS

PRIORITY APPLICATIONS

[0001] This application claims the benefit of and priority to the following:

[0002] U.S. Nonprovisional Patent Application No. 17/947,049, titled “DEEP LEARNING NETWORK FOR EVOLUTIONARY CONSERVATION,” filed September 16, 2022 (Attorney Docket No. ILLM 1066-2/IP-2299-US);

[0003] U.S. Provisional Patent Application No. 63/294,813, titled “PERIODIC MASK PATTERN FOR REVELATION LANGUAGE MODELS,” filed December 29, 2021 (Attorney Docket No. ILLM 1063-1/IP-2296-PRV);

[0004] U.S. Provisional Patent Application No. 63/294,816, titled “CLASSIFYING MILLIONS OF VARIANTS OF UNCERTAIN SIGNIFICANCE USING PRIMATE SEQUENCING AND DEEP LEARNING,” filed December 29, 2021 (Attorney Docket No. ILLM 1064-1/IP-2297-PRV);

[0005] U. S. Provisional Patent Application No. 63/294,820, titled “IDENTIFYING GENES WITH DIFFERENTIAL SELECTIVE CONSTRAINT BETWEEN HUMANS AND NONHUMAN PRIMATES,” filed December 29, 2021 (Attorney Docket No. ILLM 1065-1/IP-2298- PRV);

[0006] U.S. Provisional Patent Application No. 63/294,827, titled “DEEP LEARNING NETWORK FOR EVOLUTIONARY CONSERVATION,” filed December 29, 2021 (Attorney Docket No. ILLM 1066-1/IP-2299-PRV);

[0007] U.S. Provisional Patent Application No. 63/294,828, titled “INTER-MODEL PREDICTION SCORE RECALIBRATION,” filed December 29, 2021 (Attorney Docket No. ILLM 1067-1/IP-2301-PRV); and

[0008] U.S. Provisional Patent Application No. 63/294,830, titled “SPECIES- DIFFERENTIABLE EVOLUTIONARY PROFILES,” filed December 29, 2021 (Attorney Docket No. ILLM 1068-1/IP-2302-PRV).

[0009] The priority applications are incorporated by reference as if fully set forth herein.

FIELD OF THE TECHNOLOGY DISCLOSED

[0010] The technology disclosed relates to artificial intelligence type computers and digital data processing systems and corresponding data processing methods and products for emulation of intelligence (i.e., knowledge-based systems, reasoning systems, and knowledge acquisition systems); and including systems for reasoning with uncertainty (e.g., fuzzy logic systems), adaptive systems, machine learning systems, and artificial neural networks. In particular, the technology disclosed relates to using deep convolutional neural networks to analyze ordered data. INCORPORATIONS

[0011] The following are incorporated by reference for all purposes as if fully set forth herein, and should be considered part of, this nonprovisional patent filing:

[0012] Sundaram, L. et al. Predicting the clinical impact of human mutation with deep neural networks. Nat. Genet. 50, 1161-1170 (2018);

[0013] Jaganathan, K. et al. Predicting splicing from primary sequence with deep learning.

Cell 176, 535-548 (2019);

[0014] US Patent Application No. 62/573,144, titled “TRAINING A DEEP

PATHOGENICITY CLASSIFIER USING LARGE-SCALE BENIGN TRAINING DATA,” filed October 16, 2017 (Attorney Docket No. ILLM 1000-1/IP-1611-PRV);

[0015] US Patent Application No. 62/573,149, titled “PATHOGENICITY CLASSIFIER

BASED ON DEEP CONVOLUTIONAL NEURAL NETWORKS (CNNs),” filed October 16, 2017 (Attorney Docket No. ILLM 1000-2/IP-1612-PRV);

[0016] US Patent Application No. 62/573,153, titled “DEEP SEMI-SUPERVISED

LEARNING THAT GENERATES LARGE-SCALE PATHOGENIC TRAINING DATA,” filed October 16, 2017 (Attorney Docket No. ILLM 1000-3/IP-1613-PRV);

[0017] US Patent Application No. 62/582,898, titled “PATHOGENICITY

CLASSIFICATION OF GENOMIC DATA USING DEEP CONVOLUTIONAL NEURAL NETWORKS (CNNs),” filed November 7, 2017 (Attorney Docket No. ILLM 1000-4/IP-1618- PRV);

[0018] US Patent Application No. 16/160,903, titled “DEEP LEARNING-BASED TECHNIQUES FOR TRAINING DEEP CONVOLUTIONAL NEURAL NETWORKS,” filed on October 15, 2018 (Attorney Docket No. ILLM 1000-5/IP-1611-US);

[0019] US Patent Application No. 16/160,986, titled “DEEP CONVOLUTIONAL NEURAL NETWORKS FOR VARIANT CLASSIFICATION,” filed on October 15, 2018 (Attorney Docket No. ILLM 1000-6/IP-1612-US);

[0020] US Patent Application No. 16/160,968, titled “SEMI-SUPERVISED LEARNING FOR TRAINING AN ENSEMBLE OF DEEP CONVOLUTIONAL NEURAL NETWORKS,” filed on October 15, 2018 (Attorney Docket No. ILLM 1000-7/IP-1613-US);

[0021] US Patent Application No. 16/160,978, titled “DEEP LEARNING-BASED SPLICE SITE CLASSIFICATION,” filed on October 15, 2018 (Attorney Docket No. ILLM 1001-4/IP- 1680-US);

[0022] US Patent Application No. 16/407,149, titled “DEEP LEARNING-BASED TECHNIQUES FOR PRE-TRAINING DEEP CONVOLUTIONAL NEURAL NETWORKS,” filed May 8, 2019 (Attorney Docket No. ILLM 1010-1/IP-1734-US); [0023] US Patent Application No. 17/232,056, titled “DEEP CONVOLUTIONAL NEURAL NETWORKS TO PREDICT VARIANT PATHOGENICITY USING THREE-DIMENSIONAL (3D) PROTEIN STRUCTURES,” filed on April 15, 2021, (Atty. Docket No. ILLM 1037-2/IP- 2051 -US);

[0024] US Patent Application No. 63/175,495, titled “MULTI-CHANNEL PROTEIN VOXELIZATION TO PREDICT VARIANT PATHOGENICITY USING DEEP CONVOLUTIONAL NEURAL NETWORKS,” filed on April 15, 2021, (Atty. Docket No. ILLM 1047-1TP-2142-PRV);

[0025] US Patent Application No. 63/175,767, titled “EFFICIENT VOXELIZATION FOR DEEP LEARNING,” filed on April 16, 2021, (Atty. Docket No. ILLM 1048-1TP-2143-PRV); [0026] US Patent Application No. 17/468,411 , titled “ARTIFICIAL INTELLIGENCEBASED ANALYSIS OF PROTEIN THREE-DIMENSIONAL (3D) STRUCTURES,” filed on September 7, 2021, (Atty. Docket No. ILLM 1037-3/IP-2051A-US);

[0027] U.S. Provisional Patent Application No.: 63/253,122, titled “PROTEIN STRUCTUREBASED PROTEIN LANGUAGE MODELS,” filed October 6, 2021 (Attorney Docket No. ILLM 1050-1TP-2164-PRV);

[0028] U. S. Provisional Patent Application No. : 63/281 ,579, titled “PREDICTING VARIANT PATHOGENICITY FROM EVOLUTIONARY CONSERVATION USING THREE- DIMENSIONAL (3D) PROTEIN STRUCTURE VOXELS,” filed November 19, 2021 (Attorney Docket No. ILLM 1060-1TP-2270-PRV); and

[0029] U.S. Provisional Patent Application No.: 63/281,592, titled “COMBINED AND TRANSFER LEARNING OF A VARIANT PATHOGENICITY PREDICTOR USING GAPED AND NON-GAPED PROTEIN SAMPLES,” filed November 19, 2021 (Attorney Docket No. ILLM 1061-1TP-2271-PRV).

BACKGROUND

[0030] Mutations occur spontaneously in each generation. Harmful mutations are lost from the gene pool because the individuals carrying them reproduce less effectively. Harmless (or very rarely, beneficial) mutations are kept in the gene pool, producing variability. Genes that are functionally critical are conserved in the gene pool. The degree of evolutionary conservation of genes reflects a balance between its natural tendency to mutate and the overall need to retain the structural integrity and essential functions. The presence of similarly conserved genes, portions of genes, or chromosome segments in different species reflects both the common origin of the species and important functional property of the conserved genomic elements. Furthermore, in directional selections, a phenotype is favored over other phenotypes, causing an increase in the frequency of alleles that provide fitness advantages and their conservation through speciation events (i.e., lineage-splitting events that produce two or more separate species) and genome duplications. Consequently, conservation analysis relying on sequence similarities among different species has been exploited.

[0031] Novel functional genomic elements in the primate lineage are prime candidates to understand the changes that have contributed to the uniqueness of our own species. Although comparisons between the human genome with those of other mammal and vertebrate species have revealed an extensive catalog of conserved genes and regulatory elements, the conserved sequence elements that are specific to primates have been challenging to identify due to the short evolutionary distances between primate species. At these short evolutionary timescales, it is unclear whether the absence of changes between species is due to evolutionary conservation, or simply because there has not been an opportunity for random mutations to arise. Consequently, the genomic adaptations along the phylogenetic branch from which the human species ultimately emerge remain largely undiscovered.

[0032] Compared to the mammalian lineage which includes over 6000 species separated by -200 million years of evolution, the primate lineage comprises 521 species that are separated by a fraction of this time. Thus, despite 43 primate species having been sequenced and aligned to date in the recent Zoonomia study of 241 placental mammals, the total phylogenetic branch length within these primates is only 10% that of the placental mammal alignment. These challenges make it clear that a successful strategy to identify sequence elements that are exclusively conserved in primates will require improved algorithms for detecting conservation at short evolutionary distances, as well sequencing data from a far greater number of primate species.

[0033] Given that sequencing data is multi- and high-dimensional, deep neural networks have great promise in analyzing sequenced data because of their broad applicability and enhanced prediction power. Different types of neural networks have been adapted to solve sequence-based problems in genomics such as motif discovery, pathogenic variant identification, and gene expression inference. More importantly, an opportunity arises to use deep learning models to detect conservation at short evolutionary timescales.

BRIEF DESCRIPTION OF THE DRAWINGS

[0034] The patent or application file contains at least one drawing executed in color. Copies of this patent or patent application publication with color drawing(s) will be provided by the Office upon request and payment of the necessary fee.

[0035] The color drawings also may be available in PAIR via the Supplemental Content tab. In the drawings, like reference characters generally refer to like parts throughout the different views. Also, the drawings are not necessarily to scale, with an emphasis instead generally being placed upon illustrating the principles of the technology disclosed. In the following description, various implementations of the technology disclosed are described with reference to the following drawings, in which:

[0036] Figure 1 illustrates an example system of detecting evolutionary conservation using two deep learning network models, in accordance with one implementation of the technology disclosed.

[0037] Figure 2 illustrates an example of multiple sequence alignment, in accordance with one implementation of the technology disclosed.

[0038] Figure 3 is an example ideogram of the human genome depicting the average number of species covering each position in the multiple sequence alignment, in accordance with one implementation of the technology disclosed.

[0039] Figure 4A illustrates a genome-wide distribution of primate assembly coverage for each lOOkb segment of human genome, in accordance with one implementation of the technology disclosed.

[0040] Figure 4B illustrates per base mismatch rate between newly generated short read contigs and species with previously existing reference assemblies, in accordance with one implementation of the technology disclosed.

[0041] Figure 5 is a schematic representation of an encoder-decoder architecture, in accordance with one implementation of the technology disclosed.

[0042] Figure 6 shows an overview of an attention mechanism added onto an RNN encoderdecoder architecture, in accordance with one implementation of the technology disclosed.

[0043] Figure 7 is a schematic representation of the calculation of self-attention showing one attention head, in accordance with one implementation of the technology disclosed.

[0044] Figure 8 illustrates several attention heads in a Transformer block, in accordance with one implementation of the technology disclosed.

[0045] Figure 9 illustrates parallel execution of multi-head attention logics, in accordance with one implementation of the technology disclosed.

[0046] Figure 10 illustrates one encoder layer of a Transformer network, in accordance with one implementation of the technology disclosed.

[0047] Figure 11 illustrates a schematic overview of a Transformer model, in accordance with one implementation of the technology disclosed.

[0048] Figure 12A shows a Vision Transformer (ViT), in accordance with one implementation of the technology disclosed.

[0049] Figures 12B shows a Transformer block used by the Vision Transformer, in accordance with one implementation of the technology disclosed. [0050] Figures 13A, 13B, 13C, and 13D show details of the Transformer block of Figure 12B, in accordance with one implementation of the technology disclosed.

[0051] Figure 14 shows an example source code implementing the Vision Transformer, in accordance with one implementation of the technology disclosed.

[0052] Figure 15 illustrates an example of the first model predicting an identity of a masked base on the query sequence aligned with non-query sequences, in accordance with one implementation of the technology disclosed.

[0053] Figure 16 illustrates an example architecture of the first model with layer dimensions, in accordance with one implementation of the technology disclosed.

[0054] Figure 17 illustrates an example of the second model predicting an identity of a masked base on the query sequence aligned with non-query sequences, in accordance with one implementation of the technology disclosed.

[0055] Figure 18 illustrates an example architecture of the second model with layer dimensions, in accordance with one implementation of the technology disclosed.

[0056] Figure 19 illustrates training and testing curves for the first model (e.g., neutral substitution rate model) and the second model (e.g., conservation-adjusted substitution rate model), in accordance with one implementation of the technology disclosed.

[0057] Figure 20 is a table showing the training and testing accuracy in predicting the masked nucleotide using the first model (e.g., neutral substitution rate model) and the second model (e.g., conservation-adjusted substitution rate model), respectively, in accordance with one implementation of the technology disclosed.

[0058] Figure 21 illustrates comparison results of mean conservation scores at protein-coding exons with a constant 130 nucleotide (nt) size and the same codon reading frame (red), as well as 30 nt of flanking intronic sequence on either side (blue), predicted by an example system in accordance with one implementation of the technology disclosed and PhyloP method, respectively.

[0059] Figures 22A and 22B illustrate the distribution of conservation scores in protein-coding exons at codon position 2 (blue) vs codon position 3 (orange) using an example system in accordance with one implementation of the technology disclosed and PhyloP method, respectively.

[0060] Figure 23 illustrates the enrichment of highly conserved nucleotides (> 95^th percentile of scores) falling in codon position 2 relative to codon position 3, where the highly conserved nucleotides are identified by an example system in accordance with one implementation of the technology disclosed and PhyloP method, respectively. [0061] Figure 24 is an example computer system that can be used to implement the technology disclosed.

DETAILED DESCRIPTION

[0062] The following discussion is presented to enable any person skilled in the art to make and use the technology disclosed and is provided in the context of a particular application and its requirements. Various modifications to the disclosed implementations will be readily apparent to those skilled in the art, and the general principles defined herein may be applied to other implementations and applications without departing from the spirit and scope of the technology disclosed. Thus, the technology disclosed is not intended to be limited to the implementations shown but is to be accorded the widest scope consistent with the principles and features disclosed herein.

[0063] The detailed description of various implementations will be better understood when read in conjunction with the appended drawings. To the extent that the figures illustrate diagrams of the functional blocks of the various implementations, the functional blocks are not necessarily indicative of the division between hardware circuitry. Thus, for example, one or more of the functional blocks (e.g., modules, processors, or memories) may be implemented in a single piece of hardware (e.g., a general-purpose signal processor or a block of random-access memory, hard disk, or the like) or multiple pieces of hardware. Similarly, the programs may be stand-alone programs, may be incorporated as subroutines in an operating system, may function in an installed software package, and the like. It should be understood that the various implementations are not limited to the arrangements and instrumentality shown in the drawings.

[0064] The processing engines and databases of the figures, designated as modules, can be implemented in hardware or software, and need not be divided up in precisely the same blocks as shown in the figures. Some of the modules can also be implemented on different processors, computers, or servers, or spread among a number of different processors, computers, or servers. In addition, it will be appreciated that some of the modules can be combined, operated in parallel or in a different sequence than that shown in the figures without affecting the functions achieved. The modules in the figures can also be thought of as flowchart steps in a method. A module also need not necessarily have all its code disposed contiguously in memory; some parts of the code can be separated from other parts of the code with code from other modules or other functions disposed in between.

[0065] Rule-based logic as used herein (e.g., evolutionary conservation determination logic), can be implemented in the form of a computer product including a non-transitory computer readable storage medium with computer usable program code for performing the method steps described herein. The “logic” can be implemented in the form of an apparatus including a memory and at least one processor that is coupled to the memory and operative to perform exemplary method steps. The rule-based evolutionary conservation determination logic can be implemented in the form of means for carrying out one or more of the method steps described herein; the means can include (i) hardware module(s), (ii) software module(s) executing on one or more hardware processors, or (iii) a combination of hardware and software modules; any of (i)-(iii) implement the specific techniques set forth herein, and the software modules are stored in a computer readable storage medium (or multiple such media). In one implementation, the logic implements a data processing function. The logic can be a general purpose, single core or multicore, processor with a computer program specifying the function, a digital signal processor with a computer program, configurable logic such as an FPGA with a configuration file, a special purpose circuit such as a state machine, or any combination of these. Also, a computer program product can embody the computer program and configuration file portions of the logic.

System of Deep Learning Networks for Evolutionary Conservation Determination

[0066] Figure 1 illustrates an example system 100 for detecting evolutionary conservation using two deep learning network models, in accordance with one implementation of the technology disclosed. The first input 102 includes a multiple sequence alignment (MSA) that aligns a query sequence to a plurality of non-query sequences. The query sequence includes a masked base at a target position that is flanked by right and left bases. For example, the query sequence can be a 5 nucleotide (nt) sequence, where the masked base represented by “?” is flanked by two right and two left bases. Four 5nt non-query sequences are aligned with the query sequence in the MSA. Similarly, the second input 112 can include MSA that aligns a query sequence to a plurality of non-query sequences. The number of non-query sequences in the second input 112 can be larger than the number of non-query sequences in the first input 102. [0067] As an example, the selected query sequence and non-query sequences can indicate ancestral alleles (e.g., sequences found in the last common ancestor of two species). Knowing which allele is ancestral is important for understanding genome evolution. The ancestral allele information is crucial for increasing accuracy when estimating allele ages and can provide a better understanding of genomic signatures due to selection pressures. In addition, knowing the ancestral alleles of variants can provide specific explanations regarding the formation of linkage disequilibrium patterns in the genome. Ancestral allele information is also potentially helpful for understanding the rise and extinction of disease-causing variants and disease etiology.

[0068] It should also be noted that the technology disclosed does not intend to limit the number of non-query sequences in the MSA as the first or second input, which can be 3, 5, 10, 50, 100, 150, 200, 250, 300 and so on. Further, the technology disclosed does not intend to limit the length of query sequences and non-query sequences, which can be l-10nt, 1 l-50nt, 51-100nt, 101-200nt, 201-500nt and so on.

[0069] The first model 104 processes the first input 102 and predicts an identity of the masked base 106. In some implementations, the predicted identity is a probability distribution that specifies base-wise likelihoods of the masked base at the target position being adenine (A), cytosine (C), thymine (T), and guanine (G). In other implementations, the predicted identity is a probability distribution that specifies base-wise likelihoods of the masked base at the target position being A, C, T, G, an unknown base (X), and a missing base (-). In some implementations, the probability distribution is a softmax distribution.

[0070] In one implementation, the first model 104 is a background mutation estimation model (also called neutral substitution rate model) configured to encode a background mutation probability estimation in the identity of the masked base at the target position. Background mutation probability estimation measures random mutation experienced in genes. The first model 104 can estimate a background mutation probability by predicting the identity of a masked based at a targeted position of a query sequence from human genomes aligned with genomes from other species that are closely related to human.

[0071] The second model 114 processes the second input 112 and predicts an identity of the masked base 116. In some implementations, the predicted identity is a probability distribution that specifies base-wise likelihoods of the masked base at the target position being A, C, T and G. In other implementations, the predicted identity is a probability distribution that specifies basewise likelihoods of the masked base at the target position being A, C, T, G, an unknown base (X), and a missing base (-). In some implementations, the probability distribution is a softmax distribution.

[0072] In one implementation, the second model 114 is a group-specific mutation probability estimation model configured to encode a group-specific mutation probability estimation in the identity of the masked base at the target position. In one implementation, the second model 114 can measure epistasis-driven mutations experienced in genes of a particular group of species. Epistasis occurs when the combined effect of two or more mutations differs from the sum of their individual effects. In other words, the effect of the mutation is dependent on the genetic context in which it appears. Such dependencies on genetic background or context can cause phenotypic diversity within individuals and, on evolutionary timescales, incompatibilities between different species. For example, some mutations that cause diseases in humans are fixed in the genomes of other species without any apparent deleterious consequences. Hence, the measurement of epistasis-driven mutations in a particular group of species allows the prediction of evolutionary outcomes, which will have profound implications for disease diagnosis and treatment. As an example, sickle-cell anemia is a multigenic disease at the phenotypic level. It is clinically evident because out of fifteen of the potential complications, each patient exhibits three to five of them and in addition, with significant inter-individual variations in intensity. It is caused by three types of genes: the gene housing the primary mutation; pleiotropic genes (genes involved in secondary pathophysiologic events beyond the primary mutation); and epistatic genes (polymorphism in pleiotropic genes) that significantly modulate the pathophysiology of the disease in a particular patient). The modulations of epistatic genes are an important part of the phenotype and explain the intense inter-individual differences in the severity of the disease, in spite of all the patients having the same primary mutation. R.L. Nagel, C. R. Biologies 328, 606-615 (2005). The identification of epistasis-driven mutations and evaluation of their potential effects can provide information in diagnosis diseases in light of the complexity of different implications of patients and disease-related phenotypes and potentially, information in patient-specific treatment.

[0073] In some implementations, the first model 104 and the second model 114 are neural networks, for example, feed-forward neural networks that are trained using loss functions via backpropagation. Without limiting the scope of the technology disclosed, the first model 104 and the second model 114 can be convolutional neural networks (CNNs), recurrent neural networks (RNNs), long short-term memory (LSTM) networks, time series data processing models using time series data as input (e.g., sequencing-by-synthesis (SBS) sequencing data). More example architectures of the first model 104 and the second model 114 will be described in detail below in accordance with Figures 5-14.

[0074] The first model 104 and the second model 114 can estimate evolutionary conservation- related information that is implicitly encoded in the MSA as second-order signals. The term second-order signals represent information that is implicitly encoded and can be derived from the MSA via the processing of the first model 104 and the second model 114. Evolutionary conservation-related information can include evolutionary distances and length of the branches in the phylogenetic trees. Evolutionary distances refer to the number of nucleotide substitutions per site between two homologous DNA sequences or the number of amino acid substitutions per site between two homologous protein sequences. Estimation of evolutionary distances between sequences is important for constructing phylogenetic trees, dating species’ divergences, and understanding the mechanisms of evolution of genes, proteins, and populations. Furthermore, branch length indicates genetic change. The longer the branch, the more genetic change (or divergence) has occurred during evolution. One way of measuring the extent of genetic change is to estimate the average number of nucleotide or amino acid substitutions per site. Thus, by measuring the frequency of the substitution of bases at particular positions across the MSA, the first model 104 and the second model 114 can estimate evolutionary distances and branch lengths that are implicitly encoded across the MSA.

[0075] The first model 104 and the second model 114 can predict alternative splicing at the masked base in response to processing the MSA that aligns sequences of a given length (e.g., > 5nt). Alternative splicing is a process of selecting different combinations of splice sites within a messenger RNA precursor (pre-mRNA) to produce variably spliced mRNAs. For example, the first model 104 and the second model 114 can be used to identify protein-coding genes that are conserved only in primates and frequently alternatively spliced. These conservations that are exclusive to primates and the frequent alternative splicing suggest that, as novel exons emerge, they are initially utilized in a limited capacity while exploring sequence space for more functions. [0076] The first model 104 and the second model 114 can be configured to detect conservation-driving genomic features. Noncoding sequence, which comprises nearly 99% of the human genome, is highly active in gene regulation, and most genetic variants associated with complex human diseases map to noncoding rather than protein-coding sequence, particularly to cA-regulatory elements such as active enhancers and DNase-I hypersensitivity sites. The first model 104 and the second model 114 can identify hundreds of thousands of noncoding regulatory elements at short branch lengths that are conserved only in primates and not across placental mammals. Considering coding sequences only comprise 1% of the human genome, the emergence of human and primates from other placental mammals can be largely driven by novel regulatory sequence elements rather than innovations in protein-coding sequences. The first model 104 and the second model 114 can also identify human DNase-I hypersensitivity sites that are conserved exclusively in primates at short branch lengths. Existing methods (e.g., PhyloP method for detecting nonneutral substitution rates on mammalian phylogenies) rely on long branch lengths and thus, are incapable of catching these conservations. The first model 104 and the second model 114 add newly identified noncoding regulatory elements, suggesting a significant portion of them that were previously believed to lack conservation may represent recent evolutionary adaptations, with sequence conservation detectable in primates.

[0077] The first model 104 and the second model 114 can be configured to determine a function or feature of the masked base in dependence upon the evolutionary conservation. Hundreds of thousands of genetic variants are determined via genome-wide association study to be associated with human diseases, the majority of which map to noncoding c/.s-regulatory elements (e.g., active enhancers and DNase-I hypersensitivity sites). The first model 104 and the second model 114 can predict the identities of the masked base, as regulatory elements. These identified disease-associated regulatory elements and their corresponding conversations can be used to analyze whether the evolutionary sequence adaptations that occurred even closer to the origin of the human species would yield additional conserved c/.s-regulatory elements associated with functions. For example, the primate species can be subdivided into subsets of monkey and ape species that are diverged about 37 million years ago, less than half the divergence time of the overall primate lineage. The first model 104 and the second model 114 can identify additional noncoding regulatory elements that are conserved exclusively in monkey and ape species but not across all primates or mammals. These elements identified by the two models, based on their corresponding evolutionary conservations, suggest a substantial fraction of disease-associated genetic variation are attributed to functionally conserved, disease-associated regulatory elements with recent evolutionary origins.

[0078] As further illustrated in Figure 1, the system 100 includes an evolutionary conservation determination logic 108 that measures evolutionary conservation of the masked base 110, based on the identities of the masked base at the target position generated from the first and second models, respectively. The evolutionary conservation can be measured as a frequency of occurrence of a particular base at a particular position across the MSA. Alternatively, the evolutionary conservation can be measured as a frequency of substitution of a particular base at a particular position across the MSA.

[0079] The evolutionary conservation determination logic 108 can measure the evolutionary conservation of the masked base 110 by comparing the two identities (106 and 108) of the masked base predicted from the first and second models. For example, when the identities of the masked base are represented by probability distributions that specify base-wise likelihoods of the masked base, the evolutionary conservation determination logic 108 defines a metric from the probability output from the two models, where the metric is representative of an evolutionary conservation score of the masked base. In other implementations, the evolutionary conservation determination logic 108 compares the two identities of the masked base predicted from the first and second models and measures a difference thereof, using, for example, T-test, KL divergence, average loss. The measured difference can represent the evolutionary conservation of the masked base.

[0080] In one implementation, the first model is a neutral substitution rate model, and the first identity of the masked base 106 represents a neutral substitution rate of the marked base. The second model is a conservation-adjusted substitution rate model, and the second identity represents a conservation-adjusted substitution rate of the base. To estimate conservation at the individual nucleotide level, the conservation-adjusted substitution rate is compared with the neutral substitution rate. For each individual nucleotide x, the predicted probability of the ancestral allele from both the neutral substitution rate model (n_x) and the conservation-adjusted substitution rate model (c_x) are extracted. In one embodiment, the first model takes as input a query sequence from human and four non-query sequences from four species that are closest to human in terms of evolutionary distance, whereas the second model takes as input the query sequence and non-query sequences from 236 primate species. When a nucleotide is highly conserved, the second model (e.g., conservation-adjusted substitution rate model) predicts that it is less likely for the nucleotide to change from its ancestral state due to evolutionary constraint and thus, c_x can be significantly higher than n_x.

[0081] In one implementation, the evolutionary conservation determination logic 108 incorporates an ensemble approach to compute the significance of the difference between c_x and n_x. Consider a MSA including M sequences from M respective species, each sequence corresponding to one species. Both the neutral substitution rate model and conservation-adjusted substitution rate model are trained centered on each of the M species. In particular, the corresponding sequence from each species can be a query sequence, with a center nucleotide on the query sequence being masked.

[0082] For the neutral substitution rate model, when the sequence of a given species among the M species works as a query sequence, sequences from four species that are closest to the given species are selected as non-query sequences. For the conservation-adjusted substitution rate model, when the sequence of a given species among the M species works as a query sequence, all other M- 1 sequences are non-query sequences.

[0083] From these M pairs of trained models for each species i, the evolutionary conservation determination logic 108 can extract predicted probability of the ancestral alleles, p_x ^l and q_x ^l . Extracting the ensemble probabilities for the neutral substitution rate model and conservation- adjusted substitution rate model allows the computation of the significance of the difference between the predicted probability of the ancestral alleles, p_x ^l and q_x ^l . The difference can represent the evolutionary conservation of the masked base 110.

[0084] The t-test statistics between N_x = {n_x, n, ... , n_x and C_x = { c_x, c_x, ... , c_x , assuming equal variance, can be computed by the evolutionary conservation determination logic 108. When p_x is the p- value of the t-test statistics at base x, a conservation score S_x of each individual nucleotide x can also be calculated, where S_x = —log(p_x). It should be noted that here only the predicted probabilities from species with alignments present at the position of interest are considered to compute the t-test statistics, since alignments for all species are not always available at each position of the human genome, while computing t-test statistics, only the predicted probabilities from species with alignments present at the base of interest is considered. [0085] We now turn to the advantages of the technology disclosed by using two models to predict evolutionary conservation on a resolution of single nucleotides in the genome. By leveraging MSA from a group of species, the technology disclosed can identify a nucleotide or a region of nucleotides in the genome that has been conserved across evolution, even when the group of species has short evolutionary distances. The capability of identifying conserved regions of nucleotides without being limited by the short evolutionary distances among species, as described in various embodiments below, can be particularly desirable. It allows the identification and conservation analysis of regions of nucleotides, for example, non-coding regulatory elements, protein-coding exons and epistasis-driven mutations among closely related species, when these regions were previously missed by existing methods. These newly identified genomic elements in the primate lineage are the prime candidates to understand the changes that have contributed to the uniqueness of our own species. The use of MSA including 236 primate species substantially expands the available sequence data representing evolutionary conservation at short evolutionary distances. The use of two models with different MSA input improves the sensitivity and accurate modeling in identifying regions of conserved nucleotides and their corresponding conservation. The technology disclosed brings an improved interpretation of the genome and the importance of the genomic features. Various embodiments of the technology disclosed can be used to identify crucial evolution features to a group of closely related species, for example, microRNAs that are specific to primates, alternatively spliced exons conserved in primates but not in other branches, primate-conserved noncoding c/.s-regulatory elements, and epistasis-driven mutations experienced in genes of a particular group of species.

[0086] Unlike existing models that are highly sensitive to the distribution of branch lengths in the phylogenetic tree and thus only work well when measuring evolutionary conservation among groups of species with large evolutionary distances, various embodiments of the technology disclosed do not require explicit tree structures or branch links. In other words, it relaxes on the need to explicitly depend on branch lengths. The technology disclosed identifies nucleotide positions that have remained conserved during evolution and detects evolutionary constraints in the lineage of interest at increased resolution. Hence, it allows the investigation of the genomics of shared and specialized traits in groups of species.

Input to First Model and Second Model

[0087] In one implementation, the total number of sequences in the first input 102 is less than 10 (e.g., 5). The total number of sequences in the second input 112 is more than the total number in the first input 102, for example, 20, 50, 100, 150, 200, 210, 220, 230, 240 and so on. The nonquery sequences in the second input 112 can include part or all of the non-query sequences in the first input 102. In another implementation, all non-query sequences in the first and/or second input are closest homologues to their respective query sequences.

[0088] The query sequence in the first input 102 and/or the second input 112 can belong to one species of interest, for example, human. The non-query sequences in the first input 102 and/or the second input 112 can belong to non-human primates. In one implementation, the non-query sequences in the first input 102 can belong to a group of species that shares a family (e.g., hominids) with the species of interest. In another implementation, the group of species shares an order (e.g., primates) with the species of interest. The non-query sequences in the second input 112 can belong to another group of species that shares a class (e.g., mammals) with the species of interest. In one implementation, the non-query sequences in the second input 112 can share a phylum (e.g., chordates) with the species of interest. In another implementation, the non-query sequences in the second input 112 can share a kingdom (e.g., animals) with the species of interest.

[0089] In some implementations, an average evolutionary distance determined from evolutionary distances between species within the group of species for non-query sequences in the first input 102 is less than an average evolutionary distance determined from evolutionary distances between species within the group of species for non-query sequences in the second input 112. Evolutionary distances can refer to the number of substitutions per site separating a pair of homologous sequences since the sequences are diverged from their common ancestral sequence, and can be computed by aligning the pair of homologous sequences. In other implementations, the two groups of species for non-query sequences in the first and second input have overlap. That is, some species are present in both groups. The non-overlapping species can be more evolutionarily distant from the species of interest than the overlapping species. To include evolutionarily distant species as input to the second model but not in the first model and compare the predicted identities of the masked base from the two models can reveal evolutionary conservation information of the masked base, taking into consideration species that are closest to the species of interest and evolutionarily distant species with larger evolutionary distances.

[0090] As described above, the query sequence with a masked base at a target position is aligned with a plurality of non-query sequences from other species in the MSA. MSA aligns multiple homologous sequences with a target and is an important step in comparative analyses and property prediction of biological sequences since a lot of information, for example, evolution and coevolution clusters, are generated from the alignment and can be mapped to the target sequence of choice. MSA also provides information in evolutionary conservation by showing conserved regions across multiple species.

[0091] Figure 2 illustrates an example of MSA 200 including aligned genomic sequences of a variety of species including human, chimpanzee, bonobo, gorilla, etc. A person skilled in the art will appreciate MSA 200 can include more sequences. In one implementation, MSA 200 aligns at least hundreds of sequences. In another implementation, MSA 200 aligns thousands of sequences. It is also noted that the query sequence and non-query sequences in the MSA are not limited to genomic sequences. Various sequences of residues including amino acids, nucleotides, carbohydrates, lipids, or other types of biological polymers can be used. Each residue can be a position/ element in the sequences, which are arranged as multiple sequence alignments.

[0092] In one implementation, a MSA of 50 primate reference genomes is augmented using sequence data from 556 individuals drawn from additional 193 primate species. For each new species, an approximate reference genome is derived from combining its sequence data with the reference genome of the species in the initial MSA that is evolutionarily closest to it (i.e., base species). Read pairs from each sample are first assembled with Megahit (version 1.2.9, default parameters), resulting in assemblies with an average contig N50 of 34 Kb. The assembly sizes range from 2.2 Gb - 3.7 Gb, thus falling within the typical range of reported genome sizes for primates. The resulting contigs are then aligned to the reference genome of the base species with minimap2 (version 2.22, parameters -ax asm 10). Finally, variants are called with the paftools utility that accompanies minimap2, only considering alignments from contigs at least lOOObp in length (parameter -L 1000). The assembling reads prior to alignment are better placed to detect more complex variation between the derived and initial species than resequencing by alignment of individual read pairs.

[0093] For each derived species, the discovered variants are moved into the reference genome of its associated base species with bcftools (i.e., a set of tools that manipulate variant calls in the variant call format and its binary counterpart) to build an approximate consensus for the derived species. Given the aligned positions of the base genome in the initial MSA, the same sites are extracted from the derived genome (taking into consideration coordinate changes caused by insertions and deletions) to create a set of alignment lines for the derived species which were then inserted into the MSA for the expanded species set.

[0094] This procedure can be validated by applying it to species for which a well- characterized reference sequence is already available, for example aligning contigs assembled from a gorilla sample to the chimpanzee reference sequence to create an approximation of the gorilla genome. Error rates are computed based only on those fragments of the approximate and “true” references that participate in the MSA, since it is only these regions that are relevant to our subsequent analysis. It is considered that an acceptable rate of error to be one that approximates the known rate of heterozygosity between individuals of the species.

[0095] In one implementation, to discover primate-specific conserved elements, multiple individuals from 211 primate species, which combined with other earlier reference assemblies, brought us to a total of 236 out of the 521 extant primate species. For each species, a PCR-free whole genome sequencing is performed with 150bp paired end reads, obtaining an average of 57X-whole genome coverage per species. Megahit (i.e., a next-generation sequencing assembler optimized for metagenomes) is used to perform a de novo assembly of the reads into contigs. Thus, the existing MSA of 50 primate reference assemblies is expanded to a 236-species MSA, by aligning each short read contig to the closest available reference species. Figure 3 is an example ideogram of the human genome depicting the average number of species covering each position in the MSA. As illustrated, the human genome is well covered by primate alignments except for telomeric, centromeric, and heterochromatic regions, which are highly repetitive in the human reference. Compared to existing models that use 30 primate species, the constructed 236- species MSA used as input to the second model 114 significantly expands the sequence data from a far greater number of primate species, and facilitates the identification of the masked base and corresponding conversation at short evolutionary distances.

[0096] To quantify the conservation across the human genome, on average, each base in the human genome is covered by 185 other primate species on average, and 99.2% percent of the euchromatic regions of the human genome is covered by at least 100 other primate species. Figure 4A illustrates a genome- wide distribution of primate assembly coverage for each lOOkb segment of human genome. Over 2.8 Gb of the human genome are covered by at least 100 primate species. The sequence data of a set of 25 species within the 236 species is analyzed to ensure the per base error rate in the de novo assemblies is sufficiently low for subsequent conservation analysis. The data representing both the short read contigs and reference genome assemblies is compared with long read assemblies that had been generated. The rates of mismatches between these assembly pairs ranged between 0.02-0.5% across these 25 species and is largely explained by differences in the species’ heterozygosity. Figure 4B illustrates per base mismatch rate between newly generated short read contigs and species with previously existing reference assemblies. It is found that a majority of the mismatch rate (e.g., 82%) can be explained by allelic variation within the species. When the mismatch rate caused by heterozygosity is excluded the residual per base error rates are below 0.1%. When the intraspecific variations are further excluded, the average remaining mismatch rates attributable to assembly and sequencing errors are reduced to 0.04%.

[0097] Thus, the constructed MSA including 236 primate species increases the phylogenetic branch length 2.8-fold over the previously available 30 primate species alignment from the Zoonomia study. This MSA can be used as, for example, the second input 112 to the second model 114, such that it can detect conservation at short evolutionary distances using a far greater number of primate species.

[0098] Next, we turn to the first and the second models 104 and 114, and example architectures thereof. Architectures of First Model and Second Model

[0099] The following discussion provides different examples of machine learning architectures that can be used to implement the first model and the second model. These example machine learning architectures can take as input machine-processable or vectorized representations of genomic data, for example, one-hot encodings of nucleotides and/or amino acids, process the machine-processable representations through a plurality of hidden layers and weights of the machine learning architectures, produce learned or alternative or intermediate or compressed representations of the machine-processable representations, and generate one or more outputs based on the learned or alternative or intermediate or compressed representations. These outputs can be genotype predictions identifying one or more attributes or identifies of the genomic data, such as the identity of the nucleotides and/or amino acids, evolutionary conservation states of the nucleotides and/or amino acids, the pathogenicity of the nucleotides and/or amino acids, and so on.

[0100] In one implementation, the first model 104 and the second model 114 are a multilayer perceptron (MLP). In another implementation, the first model 104 and the second model 114 are neural networks, for example, feedforward neural networks. In yet another implementation, the first model 104 and the second model 114 are fully-connected neural networks, for example, fully-connected convolution neural networks. In yet further implementation, the first model 104 and the second model 114 are semantic segmentation neural networks. In yet another further implementation, the first model 104 and the second model 114 are a generative adversarial network (GAN) (e.g., CycleGAN, StyleGAN, pixelRNN, text-2-image, DiscoGAN, IsGAN). In yet another further implementation, the first model 104 and the second model 114 are transformer-based networks, for example, Transformer-based CNNs.

[0101] In one implementation, the first model 104 and the second model 114 are convolution neural networks (CNNs) with a plurality of convolution layers. In another implementation, the first model 104 and the second model 114 are recurrent neural networks (RNNs) such as long short-term memory networks (LSTM), bi-directional LSTM (Bi-LSTM), or a gated recurrent unit (GRU). In yet another implementation, the first model 104 and the second model 114 include both a CNN and an RNN.

[0102] In yet other implementations, the first model 104 and the second model 114 can use ID convolutions, 2D convolutions, 3D convolutions, 4D convolutions, 5D convolutions, dilated or atrous convolutions, transpose convolutions, depthwise separable convolutions, pointwise convolutions, 1 x 1 convolutions, group convolutions, flattened convolutions, spatial and crosschannel convolutions, shuffled grouped convolutions, spatial separable convolutions, and deconvolutions. The first model 104 and the second model 114 can use one or more loss functions such as logistic regression/log loss, multi-class cross-entropy/softmax loss, binary cross-entropy loss, mean-squared error loss, LI loss, L2 loss, smooth LI loss, and Huber loss. The first model 104 and the second model 114 can use any parallelism, efficiency, and compression schemes such TFRecords, compressed encoding (e.g., PNG), sharding, parallel calls for map transformation, batching, prefetching, model parallelism, data parallelism, and synchronous/asynchronous stochastic gradient descent (SGD). The first model 104 and the second model 114 can include upsampling layers, downsampling layers, recurrent connections, gates and gated memory units (like an LSTM or GRU), residual blocks, residual connections, highway connections, skip connections, peephole connections, activation functions (e.g., nonlinear transformation functions like rectifying linear unit (ReLU), leaky ReLU, exponential linear unit (ELU), sigmoid and hyperbolic tangent (tanh)), batch normalization layers, regularization layers, dropout, pooling layers (e.g., max or average pooling), global average pooling layers, and attention mechanisms (e.g., self-attention).

[0103] The first model 104 and the second model 114 can be a rule-based model, linear regression model, a logistic regression model, an Elastic Net model, a support vector machine (SVM), a random forest (RF), a decision tree, and a boosted decision tree (e.g., XGBoost), or some other tree-based logic (e.g., metric trees, kd-trees, R-trees, universal B-trees, X-trees, ball trees, locality sensitive hashes, and inverted indexes). The first model 104 and the second model 114 can be an ensemble of multiple models, in some implementations.

[0104] The first model 104 and the second model 114 can be trained using backpropagati on- based gradient update techniques. Example gradient descent techniques that can be used for training the first model 104 and the second model 114 include stochastic gradient descent, batch gradient descent, and mini-batch gradient descent. Some examples of gradient descent optimization algorithms that can be used to train the first model 104 and the second model 114 can include Momentum, Nesterov accelerated gradient, Adagrad, Adadelta, RMSprop, Adam, AdaMax, Nadam, and AMSGrad.

[0105] Examples of generative models, including transformer-based models and encoderdecoder models are described in the following. Generative models are often used in natural language processing. In recent years, generative models have emerged as powerful machinelearning tools to discover evolutionary and functional information of biological sequences (e.g., genomic sequences and protein sequences). Biological sequences can be transformed into vector representations using word embedding. The vector representations can be executed efficaciously on genome identification. Transformer-Based Models

[0106] In different implementations, the first model 104 and the second model 114 include self-attention mechanisms like Transformer, Vision Transformer (ViT), Bidirectional Transformer (BERT), Detection Transformer (DETR), Deformable DETR, UP-DETR, DeiT, Swin, GPT, iGPT, GPT-2, GPT-3, BERT, SpanBERT, RoBERTa, XLNet, ELECTRA, UmLM, BART, T5, ERNIE (THU), KnowBERT, DeiT-Ti, DeiT-S, DeiT-B, T2T-V1T-14, T2T-V1T-19, T2T-V1T-24, PVT-Small, PVT-Medium, PVT-Large, TNT-S, TNT-B, CPVT-S, CPVT-S-GAP, CPVT-B, Swin-T, Swin-S, Swin-B, Twins-SVT-S, Twins-SVT-B, Twins-SVT-L, Shuffle-T, Shuffle-S, Shuffle-B, XC1T-S12/16, CMT-S, CMT-B, VOLO-D1, VOLO-D2, VOLO-D3, VOLO-D4, MoCo v3, ACT, TSP, Max-DeepLab, VisTR, SETR, Hand-Transformer, HOT-Net, METRO, Image Transformer, Taming transformer, TransGAN, IPT, TTSR, STTN, Masked Transformer, CLIP, DALL-E, Cogview, UniT, ASH, TinyBert, FullyQT, ConvBert, FCOS, Faster R-CNN + FPN, DETR-DC5, TSP-FCOS, TSP-RCNN, ACT+MKDD (L=32), ACT+MKDD (L=16), SMCA, Efficient DETR, UP-DETR, UP-DETR, V1TB/16-FRCNN, ViT- B/16-FRCNN, PVT-Small+RetinaNet, Swin-T+RetinaNet, Swin-T+ATSS, PVT-Small+DETR, TNT-S+DETR, YOLOS-Ti, YOLOS-S, and YOLOS-B.

Transformer Logic

[0107] Machine learning is the use and development of computer systems that can leam and adapt without following explicit instructions, by using algorithms and statistical models to analyze and draw inferences from patterns in data. Some of the state-of-the-art models use Transformers, a more powerful and faster model than neural networks alone. Neural networks process input in series (e.g., time series data including sequencing-by-synthesis (SBS) sequencing data) and weight relationships by distance in the series. Transformers can process input in parallel and do not necessarily weight by distance. Transformers can be used in addition to neural networks. This architecture is described here.

Encoder-Decoder Architecture

[0108] Figure 5 is a schematic representation of an encoder-decoder architecture. This architecture is often used for time-series data processing (e.g., sequencing data generated via sequencing-by-synthesis) and has two main building blocks. The first building block is the encoder that encodes an input into a fixed-size vector. In the system described herein, the encoder is based on a recurrent neural network (RNN). At each time step, t, a hidden state of time step, t- 1, is combined with the input value at time step t to compute the hidden state at timestep t. The hidden state at the last time step, encoded in a context vector, contains relationships encoded at all previous time steps. [0109] The context vector is then passed to the second building block, the decoder. For translation, the decoder has been trained on a second language. Conditioned on the input context vector, the decoder generates an output sequence. At each time step, t, the decoder is fed the hidden state of time step, t-1, and the output generated at time step, t-1. The first hidden state in the decoder is the context vector, generated by the encoder. The context vector is used by the decoder to perform the translation.

[0110] The whole model is optimized end-to-end by using backpropagation, a method of training a neural network in which the initial system output is compared to the desired output and the system is adjusted until the difference is minimized. In backpropagation, the encoder is trained to extract the right information from the input sequence, the decoder is trained to capture the grammar and vocabulary of the output language. This results in a fluent model that uses context and generalizes well. When training an encoder-decoder model, the real output sequence is used to train the model to prevent mistakes from stacking. When testing the model, the previously predicted output value is used to predict the next one.

[OHl] When performing a translation task using the encoder-decoder architecture, all information about the input sequence is forced into one vector, the context vector. Information connecting the beginning of the sentence with the end is lost, the vanishing gradient problem. Also, different parts of the input sequence are important for different parts of the output sequence, information that cannot be learned using only RNNs in an encoder-decoder architecture.

Attention Mechanism

[0112] Attention mechanisms distinguish Transformers from other machine learning models. The attention mechanism provides a solution for the vanishing gradient problem. Figure 6 shows an overview of an attention mechanism added onto an RNN encoder-decoder architecture. At every step, the decoder is given an attention score, e, for each encoder hidden state. In other words, the decoder is given weights for each relationship between words in a sentence. The decoder uses the attention score concatenated with the context vector during decoding. The output of the decoder at time step t is be based on all encoder hidden states and the attention outputs. The attention output captures the relevant context for time step t from the original sentence. Thus, words at the end of a sentence may now have a strong relationship with words at the beginning of the sentence. In the sentence “The quick brown fox, upon arriving at the doghouse, jumped over the lazy dog,” fox and dog can be closely related despite being far apart in this complex sentence.

[0113] To weight encoder hidden states, a dot product between the decoder hidden state of the current time step, and all encoder hidden states, is calculated. This results in an attention score for every encoder hidden state. The attention scores are higher for those encoder hidden states that are similar to the decoder hidden state of the current time step. Higher values for the dot product indicate the vectors are pointing more closely in the same direction. The attention scores are converted to fractions that sum to one using the SoftMax function.

[0114] The SoftMax scores provide an attention distribution. The x-axis of the distribution is the position in a sentence. The y-axis is attention weight. The scores show which encoder hidden states are most closely related. The SoftMax scores specify which encoder hidden states are the most relevant for the decoder hidden state of the current time step.

[0115] The elements of the attention distribution are used as weights to calculate a weighted sum over the different encoder hidden states. The outcome of the weighted sum is called the attention output. The attention output is used to predict the output, often in combination (concatenation) with the decoder hidden states. Thus, both information about the inputs, as well as the already generated outputs, can be used to predict the next outputs.

[0116] By making it possible to focus on specific parts of the input in every decoder step, the attention mechanism solves the vanishing gradient problem. By using attention, information flows more directly to the decoder. It does not pass through many hidden states. Interpreting the attention step can give insights into the data. Attention can be thought of as a soft alignment. The words in the input sequence with a high attention score align with the current target word.

Attention describes long-range dependencies better than RNN alone. This enables the analysis of longer, more complex sentences.

[0117] The attention mechanism can be generalized as: given a set of vector values and a vector query, attention is a technique to compute a weighted sum of the vector values, dependent on the vector query. The vector values are the encoder hidden states, and the vector query is the decoder hidden state at the current time step.

[0118] The weighted sum can be considered a selective summary of the information present in the vector values. The vector query determines on which of the vector values to focus. Thus, a fixed-size representation of the vector values can be created, in dependence upon the vector query.

[0119] The attention scores can be calculated by the dot product, or by weighting the different values (multiplicative attention).

Embeddings

[0120] For most machine learning models, the input to the model needs to be numerical. The input to a translation model is a sentence, and words are not numerical. Multiple methods exist for the conversion of words into numerical vectors. These numerical vectors are called the embeddings of the words. Embeddings can be used to convert any type of symbolic representation into a numerical one.

[0121] Embeddings can be created by using one-hot encoding. The one-hot vector representing the symbols has the same length as the total number of possible different symbols. Each position in the one-hot vector corresponds to a specific symbol. For example, when converting colors to a numerical vector, the length of the one-hot vector would be the total number of different colors present in the dataset. For each input, the location corresponding to the color of that value is one, whereas all the other locations are valued at zero. This works well for working with images. For natural language processing (NLP), this becomes problematic, because the number of words in a language is very large. This results in enormous models and the need for a lot of computational power. Furthermore, no specific information is captured with one-hot encoding. From the numerical representation, it is not clear that orange and red are more similar than orange and green. For this reason, other methods exist.

[0122] A second way of creating embeddings is by creating feature vectors. Every symbol has its specific vector representation, based on features. With colors, a vector of three elements could be used, where the elements represent the amount of yellow, red, and/or blue needed to create the color. Thus, all colors can be represented by only using a vector of three elements. Also, similar colors, have similar representation vectors.

[0123] Embedding based on context, can be trained. Words with similar meanings occur in similar contexts. At nucleotide level, particular combinations of three DNA or RNA nucleotides correspond to specific amino acids or stop signals during protein synthesis. In addition, homologous proteins or genes have sequence similarity that reflects common ancestry. Different methods take the context into account. For natural language process, some methods, like GloVe, base their context embedding on co-occurrence statistics from corpora (large texts) such as Wikipedia. Words with similar co-occurrence statistics have similar word embeddings. Other methods use neural networks to train the embeddings. For example, they train their embeddings to predict the word based on the context (Common Bag of Words), and/or to predict the context based on the word (Skip-Gram). Training these contextual embeddings is time intensive. For this reason, pre-trained libraries exist. Other deep learning methods can be used to create embeddings. For example, the latent space of a variational autoencoder (VAE) can be used as the embedding of the input. Another method is to use ID convolutions to create embeddings. This causes a sparse, high-dimensional input space to be converted to a denser, low-dimensional feature space.

Self- Attention: Queries (O), Keys (K), Values (V}

[0124] Transformer models are based on the principle of self-attention. Self-attention allows each element of the input sequence to look at all other elements in the input sequence and search for clues that can help it to create a more meaningful encoding. It is a way to look at which other sequence elements are relevant for the current element. The Transformer can grab context from both before and after the currently processed element.

[0125] When performing self-attention, three vectors need to be created for each element of the encoder input: the query vector (Q), the key vector (K), and the value vector (V). These vectors are created by performing matrix multiplications between the input embedding vector using three unique weight matrices.

[0126] After this, self-attention scores are calculated. When calculating self-attention scores for a given element, the dot products between the query vector of this element and the key vectors of all other input elements are calculated. To make the model mathematically more stable, these self-attention scores are divided by the root of the size of the vectors. This has the effect of reducing the importance of the scalar thus emphasizing the importance of the direction of the vector. Just as before, these scores are normalized with a SoftMax layer. This attention distribution is then used to calculate a weighted sum of the value vectors, resulting in a vector z for every input element. In the attention principle explained above, the vector to calculate attention scores and to perform the weighted sum was the same, in self-attention two different vectors are created and used. As the self-attention needs to be calculated for all elements (thus a query for every element), one formula can be created to calculate a Z matrix. The rows of this Z matrix are the z vectors for every sequence input element, giving the matrix a size length sequence dimension QKV.

[0127] Multi-headed attention is executed in the Transformer. Figure 7 is a schematic representation of the calculation of self-attention showing one attention head. For every attention head, different weight matrices are trained to calculate Q, K, and V. Every attention head outputs a matrix Z. Different attention heads can capture different types of information. The different Z matrices of the different attention heads are concatenated. This matrix can become large when multiple attention heads are used. To reduce dimensionality, an extra weight matrix W is trained to condense the different attention heads into a matrix with the same size as one Z matrix. This way, the amount of data given to the next step does not enlarge every time self-attention is performed.

[0128] When performing self-attention, information about the order of the different elements within the sequence is lost. To address this problem, positional encodings are added to the embedding vectors. Every position has its unique positional encoding vector. These vectors follow a specific pattern, which the Transformer model can leam to recognize. This way, the model can consider distances between the different elements. [0129] As discussed above, in the core of self-attention are three objects: queries (Q), keys (K), and values (V). Each of these objects has an inner semantic meaning of their purpose. One can think of these as analogous to databases. We have a user-defined query of what the user wants to know. Then we have the relations in the database, i.e., the values which are the weights. More advanced database management systems create some apt representation of its relations to retrieve values more efficiently from the relations. This can be achieved by using indexes, which represent information about what is stored in the database. In the context of attention, indexes can be thought of as keys. So instead of running the query against values directly, the query is first executed on the indexes to retrieve where the relevant values or weights are stored. Lastly, these weights are run against the original values to retrieve data that are most relevant to the initial query.

[0130] Figure 8 depicts several attention heads in a Transformer block. We can see that the outputs of queries and keys dot products in different attention heads are differently colored. This depicts the capability of the multi-head attention to focus on different aspects of the input and aggregate the obtained information by multiplying the input with different attention weights. [0131] Examples of attention calculation include scaled dot-product attention and additive attention. There are several reasons why scaled dot-product attention is used in the Transformers. Firstly, the scaled dot-product attention is relatively fast to compute, since its main parts are matrix operations that can be run on modem hardware accelerators. Secondly, it performs similarly well for smaller dimensions of the K matrix, dk. as the additive attention. For larger dk, the scaled dot-product attention performs a bit worse because dot products can cause the vanishing gradient problem. This is compensated via the scaling factor, which is defined as dk. [0132] As discussed above, the attention function takes as input three objects: key, value, and query. In the context of Transformers, these objects are matrices of shapes (n, d), where n is the number of elements in the input sequence and d is the hidden representation of each element (also called the hidden vector). Attention is then computed as:

Attention( , K, F) = SoftMax ^= V where Q, K, V are computed as: X - W_Q,X - W_K,X - W_V

[0133] X is the input matrix and WQ, WK, Wy are learned weights to project the input matrix into the representations. The dot products appearing in the attention function are exploited for their geometrical interpretation where higher values of their results mean that the inputs are more similar, i.e., pointing in the geometrical space into the same direction. Since the attention function now works with matrices, the dot product becomes matrix multiplication. The SoftMax function is used to normalize the attention weights into the value of 1 prior to being multiplied by the values matrix. The resulting matrix is used either as input into another layer of attention or becomes the output of the Transformer.

Multi-Head Attention

[0134] Transformers become even more powerful when multi-head attention is used. Queries, keys, and values are computed the same way as above, though they are now projected into h different representations of smaller dimensions using a set of h learned weights. Each representation is passed into a different scaled dot-product attention block called a head. The head then computes its output using the same procedure as described above.

[0135] Formally, the multi-head attention is defined as

MultiHeadAttention (Q, K, V) = [headi, ..., headJWo where headi = Attention(QI ^Q, KW , W/T)

[0136] The outputs of all heads are concatenated together and projected again using the learned weights matrix Wo to match the dimensions expected by the next block of heads or the output of the Transformer. Using the multi-head attention instead of the simpler scaled dotproduct attention enables Transformers to jointly attend to information from different representation subspaces at different positions.

[0137] As shown in Figure 9, one can use multiple workers to compute the multi-head attention in parallel, as the respective heads compute their outputs independently of one another. Parallel processing is one of the advantages of Transformers over RNNs.

[0138] Assuming the naive matrix multiplication algorithm which has a complexity of: a ■ b ■ c

[0139] For matrices of shape (a, b) and (c, d), to obtain values Q, K, V, we need to compute the operations:

X • WQ, X • W_K, X • WV

[0140] The matrix X is of shape (n, d) where n is the number of patches and d is the hidden vector dimension. The weights WQ, WK, W are all of shape (d, d). Omitting the constant factor 3, the resulting complexity is: n • d²

[0141] We can proceed to the estimation of the complexity of the attention function itself, i.e.,

P. The matrices Q and K are both of shape (n, d). The transposition operation

does not influence the asymptotic complexity of computing the dot product of matrices of shapes (n, d) • (d. ri), therefore its complexity is: n² • d

[0142] Scaling by a constant factor of fdk, where dk is the dimension of the keys vector, as well as applying the SoftMax function, both have the complexity of a • b for a matrix of shape (a, b), hence they do not influence the asymptotic complexity. Lastly the dot product SoftMax ■ L is between matrices of shapes (n, ri) and (n, d) and so its complexity is: n² • d

[0143] The final asymptotic complexity of scaled dot-product attention is obtained by summing the complexities of computing Q, K, V, and of the attention function n • d² + n² • d.

[0144] The asymptotic complexity of multi-head attention is the same since the original input matrix X is projected into h matrices of shapes («, ^), where h is the number of heads. From the view of asymptotic complexity, h is constant, therefore we would arrive at the same estimate of asymptotic complexity using a similar approach as for the scaled dot-product attention.

[0145] Transformer models often have the encoder-decoder architecture, although this is not necessarily the case. The encoder is built out of different encoder layers which are all constructed in the same way. The positional encodings are added to the embedding vectors. Afterward, selfattention is performed.

Encoder Block of Transformer

[0146] Figure 10 portrays one encoder layer of a Transformer network. Every self-attention layer is surrounded by a residual connection, summing up the output and input of the selfattention. This sum is normalized, and the normalized vectors are fed to a feed-forward layer. Every z vector is fed separately to this feed-forward layer. The feed-forward layer is wrapped in a residual connection and the outcome is normalized too. Often, numerous encoder layers are piled to form the encoder. The output of the encoder is a fixed-size vector for every element of the input sequence.

[0147] Just like the encoder, the decoder is built from different decoder layers. In the decoder, a modified version of self-attention takes place. The query vector is only compared to the keys of previous output sequence elements. The elements further in the sequence are not known yet, as they still must be predicted. No information about these output elements may be used.

Encoder-Decoder Blocks of Transformer

[0148] Figure 11 shows a schematic overview of a Transformer model. Next to a self-attention layer, a layer of encoder-decoder attention is present in the decoder, in which the decoder can examine the last Z vectors of the encoder, providing fluent information transmission. The ultimate decoder layer is a feed-forward layer. All layers are packed in a residual connection. This allows the decoder to examine all previously predicted outputs and all encoded input vectors to predict the next output. Thus, information from the encoder is provided to the decoder, which could improve the predictive capacity. The output vectors of the last decoder layer need to be processed to form the output of the entire system. This is done by a combination of a feedforward layer and a SoftMax function. The output corresponding to the highest probability is the predicted output value for a subject time step.

[0149] For some tasks other than translation, only an encoder is needed. This is true for both document classification and name entity recognition. In these cases, the encoded input vectors are the input of the feed-forward layer and the SoftMax layer. Transformer models have been extensively applied in time-series data processing fields. These models have applications in the field of biology as well for predicting protein structure and function and labeling DNA sequences.

Vision Transformer

[0150] There are extensive applications of transformers in vision including popular recognition tasks (e.g., image classification, object detection, action recognition, and segmentation), generative modeling, multi-modal tasks (e.g., visual-question answering, visual reasoning, and visual grounding), video processing (e.g., activity recognition, video forecasting), low-level vision (e.g., image super-resolution, image enhancement, and colorization) and 3D analysis (e.g., point cloud classification and segmentation).

[0151] In image classification, we often have a single input image in which the pixels are in a sequence. To reduce the computation required, Vision Transformers (ViTs) cut the input image into a set of fixed-sized patches of pixels. The patches are often 16 x 16 pixels. ViTs are depicted in Figures 12A, 12B, 13A, 13B, 13C, and 13D. Unfortunately, important positional information is lost because image sets are position-invariant. This problem is solved by adding a learned positional encoding into the image patches.

[0152] The computations of the ViT architecture can be summarized as follows. The first layer of a ViT extracts a fixed number of patches from an input image (Figure 12A). The patches are then projected to linear embeddings. A special class token vector is added to the sequence of embedding vectors to include all representative information of all tokens through the multi-layer encoding procedure. The class vector is unique to each image. Vectors containing positional information are combined with the embeddings and the class token. The sequence of embedding vectors is passed into the Transformer blocks. The class token vector is extracted from the output of the last Transformer block and is passed into a multilayer perceptron (MLP) head whose output is the final classification. The perceptron takes the normalized input and places the output in categories. It classifies the images. This procedure directly translates into the Python Keras code shown in Figure 14.

[0153] When the input image is split into patches, a fixed patch size is specified before instantiating a ViT. Given the quadratic complexity of attention, patch size has a large effect on the length of training and inference time. A single Transformer block comprises several layers. The first layer implements Layer Normalization, followed by the multi-head attention that is responsible for the performance of ViTs. In the depiction of a Transformer block in Figure 9B, we can see two arrows. These are residual skip connections. Including skip connection data can simplify the output and improve the results. The output of the multi-head attention is followed again by Layer Normalization. And finally, the output layer is an MLP (Multi-Layer Perceptron) with the GELU (Gaussian Error Linear Unit) activation function.

[0154] ViTs can be pretrained and fine-tuned. Pretraining is generally done on a large dataset. Fine-tuning is done on a domain specific dataset.

[0155] Domain-specific architectures, like convolutional neural networks (CNNs) or long short-term memory networks (LSTMs), have been derived from the usual architecture of MLPs and may suffer from so-called inductive biases that predispose the networks towards a certain output. ViTs stepped in the opposite directions of CNNs and LSTMs and became more general architectures by eliminating inductive biases. A ViT can be seen as a generalization of MLPs because MLPs, after being trained, do not change their weights for different inputs. On the other hand, ViTs compute their attention weights at runtime based on the particular input.

Transformer Models as applied to Genomics

[0156] The following discussion describes some implementations of how the Transformer models process a genomic sequence and produce position-wise nucleotide classification of the genomic sequence.

[0157] The Transformer models include convolutional layers that can detect local patterns, and thereby enhance the detection of nucleotide motifs. The Transformer models process a genome sequence in consecutive segments of length I. Every input nucleotide x 6 {A, C, G, T} is first transformed into a vector embedding Z?(0 ), after which it is transformed k times through addition (residual connection) with another vector, obtained by the multi-head attention function present in each layer (7/(0) — > . . . — > //(A)).

[0158] A set of fully connected layers transforms h(k) into a model output y(k). For each residual block, the vector that is summed with the input (to obtain h(l), . . . , h(k)) is calculated using the hidden states of I upstream positions.

[0159] The multi-head attention applied in each residual block is methodologically identical. From each input hidden state h, a query (c/), key (k), and value (v) vector of equal shapes are calculated. The output z of the attention head, applied on the hidden state at position n, is calculated as follows: z(ⁿ) ₌ softmax

where K, V 6 lR^lxdheads _are the matrices that are composed from I upstream hidden states (e.g., K = [^ⁿ’¹⁾, . . . , ⁿ⁾]).

[0160] The denominator is used to stabilize the scores based on the dimensions of q, k, and v (d ead). The multiplication of the query vector with all the key vectors results in a vector of scores that is normalized for all input values using the softmax function. These scores are multiplied to the v vectors for the calculation of z (i.e., a linear combination). The attention scores denote the relevance of information present between two positions, where the multiplication of the q and k vectors function as a lock and key encoding, which returns goodness-of-fit scores for the information embedded in two hidden states (defined by v).

[0161] In each residual block, multiple attention heads are present (hence, multi-head attention), each featuring their own unique sets of model weights to calculate q, k, and v. As such, multiple types of information can be extracted from the input hidden states. The outcome of different attention heads within the same layer is further processed into a single vector, which is summed with h to obtain the hidden state of the next layer (e.g., h(l) —> h(2)).

[0162] Contextual information embedded within the hidden states derived from single nucleotides is limited. Motifs formed from multiple neighboring nucleotides are deemed of greater importance towards biological processes. The addition of a convolutional layer allows the q, k, and v vectors to be derived from multiple neighboring hidden states without affecting the input/output resolution. Thereby, the retrieval of relevant information using attention is improved, resulting in improved predictive performances on a variety of tasks.

[0163] Positional information is used within the vectors q, k, and v by superimposing (i.e., through summation) a positional encoding vector to h. The added signal is a function of the vector index and the relative positioning with respect to the other input hidden states.

[0164] The annotation of DNA is a sequence labeling task that has correspondences in natural language processing. The DNA sequence is a data set of n nucleotides, i.e., X 6 {x(l), x(2), ..., x(n)}, where x 6 A, C, T, G, the task comprises predicting a label y 0, 1 for each position x, where a positive label denotes the occurrence of an event at that position.

[0165] The Transformer models process the genome in sequential segments of I nucleotides. During training, a non-linear transformation function E is optimized that maps the input classes {A, C, T, G} to a vector embedding h of length dmodei. For nucleotide x(i) on the genome: h = E(x⁽ⁱ⁾), x⁽ⁱ⁾ G {A,T, C, G}, where h E ^^dmodei.

[0166] The hidden states of each segment // 6

/z^(z)], are processed through k layers. As such, the data propagation through the network for any input x follows multiple transformation: x

y.

[0167] Within each layer, multi-head attention is calculated for each hidden state. Next, for each hidden state of h, the output of the multi-head attention step (MultiHead) is summed with the input, i.e., a residual connection, with the final step being layer normalization. The calculations of the output for all hidden states h in layer t at position m of segment 5 are performed in parallel: h(s,t+i,m) > LayerNorm(h^{(s t m)} + MultiHead(H^{(s t)})), or

H^(s,t+1) = LayerNorm(H^{(s t)} + MultiHead(H^{(s t)})), where t E [0, k[ and m E [1, 1],

[0168] After a forward pass through k layers, a final linear combination reduces the dimension of the output hidden state d_modei) to the amount of output classes. In one implementation, only binary classification is performed. In another implementation, a softmax layer is applied before obtaining the prediction value y, for nucleotide x_t.

Particular Implementations of Deep Learning Networks for Evolutionary Conservation [0169] Figure 15 illustrates an example system 1500 including the first model 1520 that predicts an identity of a masked base on the query sequence aligned with non-query sequences. The first input 1510 is a MSA of five 5nt sequences. The query sequence is from human with a masked base flanked by two right and two left bases. Four non-query sequences are from four species closest to human, namely, Chimpanzee, gorilla, gray slender loris and sumatran orangutan.

[0170] In other implementations, the four species that are closest to human can be Chimpanzee, bonobo, the western lowland gorilla, and the eastern gorilla in terms of evolutionary distance. The assumption is that since these species are close to human in the phylogeny, with a mean branch length of only 0.008, the predicted probability from this deep learning model should account for the neutral substitution rate in the genome.

[0171] The first model 1520 in this implementation is a neutral substitution rate model which employs only information from the closest species to specify the ancestral allele. The first model 1520 can be a convolutional neural network including a plurality of ID convolution layers 1522 and 1524 and one or more dense layers 1526. One-hot-encoded sequences in a form of vector are convoluted with kernels of the ID convolution layers, via multiplication and summation operations. The dense layer of the one or more dense layers 1526 is deeply connected to its preceding layer and applies activation functions to the output from the preceding layer. [0172] Figure 16 illustrates an example architecture of the first model with layer dimensions. As one of the most useful and efficient techniques of analyzing sequence data in deep learning architectures, ID convolutional layers are used here as core building blocks of the example architecture. The R^^{5x 5)} dimensional input MSA (R ²⁵ when flattened) is first passed through a masking layer that masks the center nucleotide in the human sequence. The MSA with masked nucleotide is then passed through an embedding layer that embeds the input MSA into a 50- dimensional embedding space. This embedded MSA is next passed through 32 ID convolutional units of kernel size of 5 and stride gap of 5. This ID convolutional layer is designed to process sequence information from each species. The next layer of ID convolutional units also contains 32 units and has a kernel size of 5 with stride gap of 1. This layer is designed to process sequence information across the species. Hence by stacking the convolutional layers we are designing the DL model to process sequence information both along the rows of the MSA and columns of the MSA. The output of the latest convolutional layer is then passed through a dense layer. The final layer is a six-dimensional SoftMax layer. Except for the final layer, all the intermediate layers use ReLU activation.

[0173] The first output 1530 is a predicted identity of the masked base in the query sequence (labeled by “?”), which can be a probability distribution that specifies base-wise likelihoods of the masked base at the target position being A, C, T, G, an unknown base (X), and a missing base (-). For example, the aforementioned six-dimensional SoftMax layer can output the predicted probabilities of the masked nucleotide in the neutral substitution rate (NR) model, Y_NR = {y^A,y^c ,y^G ,y^T ,y~ ,y^x}- Alternatively, a different SoftMax layer can output the predicted probabilities of the masked nucleotide, Y_NR = {y^A,y^c ,y^G ,y^T}. Figure 15 illustrates two examples of the predicted identity as a probability distribution of the masked base. The output 1540 illustrates that the predicted identify of the masked base is represented by a normalized probability distribution that specifies base-wise likelihoods of the base being A, C, T and G. The masked base is predicted to be base C with a highest likelihood (p ~ 0.9). The output 1550 illustrates that predicted identity is a log probability distribution that specifies base-wise likelihoods of the masked base at the target position being A, C, T, G, an unknown base (X), and a missing base (-). Consistent with the output 1540, the masked base is predicted to be base C with a highest likelihood (logp > 10'¹).

[0174] Figure 17 illustrates an example system 1700 including the second model 1720 that predicts an identity of a masked base on the query sequence aligned with non-query sequences. Unlike the first input 1510 including sequences from 5 closest species, the second input 1710 is a full alignment of 236 primate species, including all of the non-query sequences in the first input 1510. The query sequence in the second input 1710 can be identical to the query sequence in the first input 1510.

[0175] The second model 1720 in this implementation is a conservation-adjusted substitution rate model which utilizes information from the full MSA of 236 primate species. The second model 1720 can be a convolutional neural network, including a plurality of ID convolution layers 1722 and 1724 as building blocks of the architecture, and one or more dense layers 1726. The number of ID convolution layers (1722 and 1724) and dense layers of the one or more dense layers 1726 can be different from (e.g., larger than) the number of layers in the first model.

[0176] As shown, the second model 1720 generates a second output 1730. Similar to the two types of output from the first model, here, the output 1740 illustrates the predicted identify of the masked base is represented by a normalized probability distribution that specifies base-wise likelihoods of the base being A, C, T and G. The masked base is predicted to be base C with a highest likelihood (p ~ 1.0). The output 1750 illustrates the predicted identity is a log probability distribution that specifies base-wise likelihoods of the masked base at the target position being A, C, T, G, an unknown base (X), and a missing base (-). The masked base is predicted to be base C with a highest likelihood (logp « 10°).

[0177] Figure 18 illustrates an example architecture of the second model with layer dimensions. In one implementation, to harness information from all 230 species, the second model uses a larger number of convolutional layers and dense layers as compared to the first model. The input for the second model has a dimension of R(^{230X 5}) (I?¹¹⁵⁰ when flattened). Like the first model, the input passes through the masking and embedding layers. To reduce overfitting, several dropout layers are used throughout the architecture of the conservation- adjusted substitution rate model. The final SoftMax layer outputs the predicted probability of the masked nucleotide in the conservation-adjusted substitution rate (CR) model, Y_CR =

underlying assumption is that, since the model is predicting the masked nucleotide given the sequence context of all 230 species, the output probabilities should account for the evolutionary conservation adjusted substitution rate.

[0178] If a nucleotide is highly conserved in the MSA, the second model (e.g., conservation- adjusted substitution rate model) predicts that it is less likely for this masked human nucleotide to change from its ancestral state due to evolutionary constraint. In the absence of conservation, the two models will converge on a higher substitution probability consistent with the rate of change in neutral sequence.

[0179] The evolutionary conservation determination logic can be used to determine how significantly the two models diverge in their predictions for each position in the genome based on the predicted identities (e.g., represented by probability functions) of the masked base. Based on the divergence, the evolutionary conservation determination logic can calculate a per base conservation score to represent the respective evolutionary conservation.

[0180] The divergence can be determined using t-statistics, which measures the ratio of the departure of an estimated value of a parameter from its hypothesized value to its standard error. Alternatively, the divergence can be determined using KL divergence, DKL(E II Q which measures how a probability distribution Q is different from a reference probability distribution P. Training of the First Model and the Second Model

[0181] Both the first and second models can be trained on ~2.5 million randomly sampled 5nt wide MSAs from the entire human genome. In one implementation, we trained both models using a query sequence from human. That is, the first model is trained by using the query sequence from human and non-query sequences from four of the closest species to human in the MSA. The second model is trained by using the query sequence from human and non-query sequences from all 236 primate species. The center nucleotide on the query sequence is masked in both of the models.

[0182] To train these models, we can minimize the categorical cross entropy loss between true nucleotide and the predicted masked nucleotide using the Adam optimizer. The models are trained withholding 20% data as a test set. In one implementation, the models are trained for 30 epochs with a batch size of 1000 samples. Figure 19 illustrates training and testing curves for the first model (e.g., neutral substitution rate model) and the second model (e.g., conservation- adjusted substitution rate model). The x axis shows the number of epochs, and the y axis shows the categorical cross entropy loss of the deep learning networks.

[0183] To calculate prediction probabilities for all bases genome-wide, the trained model weights with lowest test loss can be chosen (e.g., weights at epoch 14 for conservation-adjusted substitution rate model and weights at epoch 18 for neutral substitution rate model). Since the conservation-adjusted substitution rate model leverages sequence context from a larger number of species, it achieves 11% lower test loss as compared to the neutral substitution rate model. The accuracy of predicting the masked nucleotide in both models are shown in Figure 20. Even though the model only uses 5nt wide MSAs, it achieves test accuracy > 99%. This suggests that for the task of masked nucleotide prediction from MSA, the 5nt window length can be sufficient. [0184] In some implementation, the deep learning architectures using transformers instead of ID convolutional layers are tested. When using a small length MSA, the transformer model achieves similar performance as that of the model with ID convolutional layer. For example, with ID convolutional layers the conservation-adjusted substitution rate model achieved a lowest test loss of 0.0413 and with transformers the model achieved a lowest test loss of 0.0414. In other implementations, transformers, Long Short-Term Memory (LSTM), or any other recurrent deep learning neural networks as described above can be used as the first and/or second model to predict the evolutionary conservation on a resolution of single nucleotides in the MSA.

Objective Indicia of Inventiveness and Non-Obviousness

Evaluation of the Deep Learning Networks for Evolutionary Conservation

[0185] The performance of the technology disclosed is evaluated by comparing with PhyloP method, a test used for detecting nonneutral substitution rates on mammalian phylogenies. For PhyloP scores, the base wise conservation scores are generated from an MSA containing 230 primate species using the phyloFit and phyloP functions from the PHAST package. Both PhyloP method and the technology disclosed are able to detect conservation in protein-coding exons over surrounding intronic sequence. Figure 21 illustrates the mean conservation scores plotted for the deep learning system (top) described herein and PhyloP method (bottom), at protein-coding exons with a constant 130 nucleotide (nt) size and the same codon reading frame (red), as well as 30 nt of flanking intronic sequence on either side (blue).

[0186] However, the deep learning conservation scores generated from various implementations of the technology disclosed are capable of finer grained distinctions, compared to the step functions seen in the distribution of PhyloP scores due to the discontinuities in phylogenetic branch length. Figures 22A and 22B illustrate the distribution of conservation scores in protein-coding exons at codon position 2 (blue) compared to codon position 3 (orange), measured by deep learning conservation scores (Figure 22A) and PhyloP scores (Figure 22B), respectively. The x axis of Figures 22A and 22B is a score distribution and the y axis is the score density. Consequently, when comparing the nucleotides that are most highly scored by each method (> 95^th percentile), the top conserved nucleotides identified by the deep learning system are highly enriched at codon position 2 (where mutations always alter the protein) relative to codon position 3 (where mutations are often synonymous), compared to weaker enrichment for PhyloP.

[0187] Figure 23 is a bar chart showing the enrichment of highly conserved nucleotides (> 95^th percentile of scores) falling in codon position 2 relative to codon position 3. The x axis is the percentile while the y axis is an enrichment ratio between codon position 2 and codon position 3 for the most highly conserved nucleotides identified by the deep learning system (blue) and PhyloP method (orange). The highly conserved nucleotides identified by the deep learning system are highly enriched at codon position 2 compared to codon position 3.

[0188] PhyloP method relies on the prior of evolutionary distance between the species in the MSA. Hence, PhyloP is sensitive to the distribution of branch length in evolutionary tree. While it works well on measuring evolutionary conservation on a diverse group of species (e.g., mammals) which large evolutionary distances between species, it is computationally underpowered when measuring conservation on closely related group of species (e.g., primates) with a short evolutionary distance between each other. As a result, when PhyloP was used to calculate per base conservation genome-wide, no individual bases are found to be significantly conserved over chance (P<0.01), due to an insufficient branch length within the primate species. [0189] On the other hand, the deep learning system as described here can identify sequence elements that are exclusively conserved in primates and detecting conservation at short evolutionary distances. The deep learning system’s improved sensitivity can be attributed to more accurate modeling of substitution probabilities at nucleotides with high intrinsic mutation rates, which provide sufficient information to identify conserved bases even within the short evolutionary timescales in the primate lineage.

[0190] The deep learning system can identify protein-coding exons and genes that are conserved only in primates. Genome-wide, only 0.5% of human protein-coding exons (1001 exons) are significantly conserved in primate, but not in mammals. The deep learning system is essential for improving sensitivity to detect primate-conserved exons, enabling the identification of 8,973 additional exons that are deemed to be not significant using PhyloP method. Most of these newly identified exons (85%) are conserved in mammals but are missed by primate PhyloP method due to short branch length in primates.

[0191] Furthermore, the deep learning system can identify primate-conserved noncoding cis- regulatory elements. Noncoding sequence, which comprises nearly 99% of the human genome, is highly active in gene regulation, and most genetic variants associated with complex human diseases map to noncoding rather than protein-coding sequence, particularly to c/.s-regulatory elements such as active enhancers and DNase-I hypersensitivity sites. The deep learning system can identify primate-conserved DNase-I hypersensitivity sites at short branch lengths, adding 57,496 DNase-I hypersensitivity sites that are missed by primate PhyloP method. These identifications also indicate a significant fraction of c/.s-regulatory elements that were previously believed to lack conservation represent recent evolutionary adaptations, with sequence conservation detectable in primates.

Computer System

[0192] Figure 24 is an example computer system 2400 that can be used to implement the technology disclosed. Computer system 2400 includes at least one central processing unit (CPU) 2472 that communicates with a number of peripheral devices via bus subsystem 2455. These peripheral devices can include a storage subsystem 2410 including, for example, memory devices and a file storage subsystem 2436, user interface input devices 2438, user interface output devices 2476, and a network interface subsystem 2474. The input and output devices allow user interaction with computer system 2400. Network interface subsystem 2474 provides an interface to outside networks, including an interface to corresponding interface devices in other computer systems.

[0193] In one implementation, the first model 104, the second model 114 and the evolutionary conservation determination logic 108are communicably linked to the storage subsystem 2410 and the user interface input devices 2438.

[0194] User interface input devices 2438 can include a keyboard; pointing devices such as a mouse, trackball, touchpad, or graphics tablet; a scanner; a touch screen incorporated into the display; audio input devices such as voice recognition systems and microphones; and other types of input devices. In general, use of the term “input device” is intended to include all possible types of devices and ways to input information into computer system 2400.

[0195] User interface output devices 2476 can include a display subsystem, a printer, a fax machine, or non-visual displays such as audio output devices. The display subsystem can include an LED display, a cathode ray tube (CRT), a flat-panel device such as a liquid crystal display (LCD), a projection device, or some other mechanism for creating a visible image. The display subsystem can also provide a non-visual display such as audio output devices. In general, use of the term “output device” is intended to include all possible types of devices and ways to output information from computer system 2400 to the user or to another machine or computer system. [0196] Storage subsystem 2410 stores programming and data constructs that provide the functionality of some or all of the modules and methods described herein. These software modules are generally executed by processors 2478.

[0197] Processors 2478 can be graphics processing units (GPUs), field-programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), and/or coarse-grained reconfigurable architectures (CGRAs). Processors 2478 can be hosted by a deep learning cloud platform such as Google Cloud Platform™, Xilinx™, and Cirrascale™. Examples of processors 2478 include Google’s Tensor Processing Unit (TPU)™, rackmount solutions like GX4 Rackmount Series™, GX36 Rackmount Series™, NVIDIA DGX-1™, Microsoft' Stratix V FPGA™, Graphcore's Intelligent Processor Unit (IPU)™, Qualcomm’s Zeroth Platform™ with Snapdragon processors™, NVIDIA’ s Volta™, NVIDIA’ s DRIVE PX™, NVIDIA’ s JETSON TX1/TX2 MODULE™, Intel’s Nirvana™, Movidius VPU™, Fujitsu DPI™, ARM’s DynamicIQ™, IBM TrueNorth™, Lambda GPU Server with Testa VlOOs™, and others.

[0198] Memory subsystem 2422 used in the storage subsystem 2410 can include a number of memories including a main random access memory (RAM) 2432 for storage of instructions and data during program execution and a read only memory (ROM) 2434 in which fixed instructions are stored. A file storage subsystem 2436 can provide persistent storage for program and data files, and can include a hard disk drive, a floppy disk drive along with associated removable media, a CD-ROM drive, an optical drive, or removable media cartridges. The modules implementing the functionality of certain implementations can be stored by file storage subsystem 2436 in the storage subsystem 2410, or in other machines accessible by the processor. [0199] Bus subsystem 2455 provides a mechanism for letting the various components and subsystems of computer system 2400 communicate with each other as intended. Although bus subsystem 2455 is shown schematically as a single bus, alternative implementations of the bus subsystem can use multiple busses.

[0200] Computer system 2400 itself can be of varying types including a personal computer, a portable computer, a workstation, a computer terminal, a network computer, a television, a mainframe, a server farm, a widely-distributed set of loosely networked computers, or any other data processing system or user device. Due to the ever-changing nature of computers and networks, the description of computer system 2400 depicted in Figure 24 is intended only as a specific example for purposes of illustrating the preferred implementations of the technology disclosed. Many other configurations of computer system 2400 are possible having more or less components than the computer system depicted in Figure 24.

Clauses

[0201] The technology disclosed, in particularly, the clauses disclosed in this section, can be practiced as a system, method, or article of manufacture. One or more features of an implementation can be combined with the base implementation. Implementations that are not mutually exclusive are taught to be combinable. One or more features of an implementation can be combined with other implementations. This disclosure periodically reminds the user of these options. Omission from some implementations of recitations that repeat these options should not be taken as limiting the combinations taught in the preceding sections - these recitations are hereby incorporated forward by reference into each of the following implementations.

[0202] One or more implementations and clauses of the technology disclosed, or elements thereof can be implemented in the form of a computer product, including a non-transitory computer readable storage medium with computer usable program code for performing the method steps indicated. Furthermore, one or more implementations and clauses of the technology disclosed, or elements thereof can be implemented in the form of an apparatus including a memory and at least one processor that is coupled to the memory and operative to perform exemplary method steps. Yet further, in another aspect, one or more implementations and clauses of the technology disclosed or elements thereof can be implemented in the form of means for carrying out one or more of the method steps described herein; the means can include (i) hardware module(s), (ii) software module(s) executing on one or more hardware processors, or (iii) a combination of hardware and software modules; any of (i)-(iii) implement the specific techniques set forth herein, and the software modules are stored in a computer readable storage medium (or multiple such media).

[0203] The clauses described in this section can be combined as features. In the interest of conciseness, the combinations of features are not individually enumerated and are not repeated with each base set of features. The reader will understand how features identified in the clauses described in this section can readily be combined with sets of base features identified as implementations in other sections of this application. These clauses are not meant to be mutually exclusive, exhaustive, or restrictive; and the technology disclosed is not limited to these clauses but rather encompasses all possible combinations, modifications, and variations within the scope of the claimed technology and its equivalents.

[0204] Other implementations of the clauses described in this section can include a non- transitory computer readable storage medium storing instructions executable by a processor to perform any of the clauses described in this section. Yet another implementation of the clauses described in this section can include a system including memory and one or more processors operable to execute instructions, stored in the memory, to perform any of the clauses described in this section.

[0205] We disclose the following clauses:

1. A system, comprising: a first model configured to process a first multiple sequence alignment that aligns a query sequence to N nonquery sequences, wherein the query sequence includes a masked base at a target position that is flanked by right and left bases, and predict a first identity of the masked base at the target position; a second model configured to process a second multiple sequence alignment that aligns the query sequence to M non-query sequences, where M> N, predict a second identity of the masked base at the target position; and an evolutionary conservation determination logic configured to measure an evolutionary conservation of the masked base at the target position based on the first and second identities of the masked base.

2. The system of clause 1, wherein the N non-query sequences are N closest homologues to the query sequence.

3. The system of clause 1 , wherein the M non-query sequences are M closest homologues to the query sequence. 4. The system of clause 1, wherein the A7 non-query sequences include the N non-query sequences.

5. The system of clause 1, wherein the query sequence belongs to a first species.

6. The system of clause 5, wherein the first species is human.

7. The system of clause 5, wherein the N non-query sequences are non-human primates.

8. The system of clause 7, wherein the N non-query sequences belong to a first group of species that shares a family with the first species.

9. The system of clause 8, wherein the family is hominids.

10. The system of clause 7, wherein the first group of species shares an order with the first species.

11. The system of clause 7, wherein the order is primates.

12. The system of clause 5, wherein the M non-query sequences belong to a second group of species that shares a class with the first species.

13. The system of clause 12, wherein the class is mammals.

14. The system of clause 12, wherein the second group of species shares a phylum with the first species.

15. The system of clause 14, wherein the phylum is chordates.

16. The system of clause 5, wherein the second group of species shares a kingdom with the first species.

17. The system of clause 16, wherein the kingdom is animals.

18. The system of clause 1, wherein a first average evolutionary distance determined from evolutionary distances between species in the first group of species to which the N non-query sequences belong is less than a second average evolutionary distance determined from evolutionary distances between species in the second group of species to which the M non-query sequences belong.

19. The system of clause 18, wherein non-overlapping species between the first and second groups of species are more evolutionarily distant from the first species than overlapping species between the first and second groups of species.

20. The system of clause 1, wherein the first model is further configured to predict the first identity as a first probability distribution that specifies base-wise likelihoods of the masked base at the target position being adenine (A), cytosine (C), thymine (T), and guanine (G).

21. The system of clause 20, wherein the first probability distribution further specifies base-wise likelihoods of the masked base at the target position being A, C, T, G, an unknown base (X), and a missing base (-). 22. The system of clause 20, wherein the second model is further configured to predict the second identity as a second probability distribution that specifies base-wise likelihoods of the masked base at the target position being A, C, T, G.

23. The system of clause 22, wherein the second probability distribution further specifies basewise likelihoods of the masked base at the target position being A, C, T, G, -, and X.

24. The system of clause 22, wherein the evolutionary conservation determination logic is further configured to measure the evolutionary conservation of the masked base at the target position based on the first probability distribution and the second probability distribution.

25. The system of clause 24, wherein the evolutionary conservation determination logic is further configured to use t-test statistics to measure the evolutionary conservation of the masked base at the target position.

26. The system of clause 1, wherein the first model is further configured to encode a background mutation probability estimation in the first identity of the masked base at the target position.

27. The system of clause 1, wherein the second model is further configured to encode a groupspecific mutation probability estimation in the second identity of the masked base at the target position.

28. The system of clause 1, wherein the first and second models are neural networks.

29. The system of clause 28, wherein the first and second models are feed-forward neural networks.

30. The system of clause 28, wherein the first and second models are recurrent neural networks.

31. The system of clause 30, wherein the first and second models are long short-term memory (LSTM) networks.

32. The system of clause 28, wherein the first and second models are convolutional neural networks (CNNs).

33. The system of clause 32, wherein the first and second models are Transformer-based CNNs.

34. The system of clause 24, wherein the first and second probability distributions are first and second softmax distributions.

35. The system of clause 1, wherein the first and second models are trained using a categorical cross-entropy loss function.

36. The system of clause 1, wherein the background mutation probability estimation measures random mutation experienced in genes.

37. The system of clause 1, wherein the group-specific mutation probability estimation measures epistasis-driven mutations experienced in genes of a particular group of species.

38. The system of clause 1, wherein the evolutionary conservation is measured as a frequency of occurrence of a particular base at a particular position across the multiple sequence alignment. 39. The system of clause 1, wherein the evolutionary conservation is measured as a frequency of substitution of a particular base at a particular position across the multiple sequence alignment.

40. The system of clause 1, wherein the first and second models are configured to detect evolutionary distances and branch lengths implicitly encoded in the multiple sequence alignment as second-order signals.

41. The system of clause 1, wherein the masked base is flanked by two right bases and two left bases.

42. The system of clause 1, wherein the first and second models are configured to predict alternative splicing at the masked base in response to processing the multiple sequence alignment that aligns sequences of length greater than K.

43. The system of clause 42, wherein K is greater than 5.

44. The system of clause 1, wherein the first and second models are configured to detect conservation-driving genomic features.

45. The system of clause 1, further configured to determine a function or feature of the masked base in dependence upon the evolutionary conservation.

[0206] What we claim is:

Claims

2. The system of claim 1, wherein the N non-query sequences are N closest homologues to the query sequence.

3. The system of claim 1 , wherein the M non-query sequences are M closest homologues to the query sequence.

4. The system of claim 1, wherein the M non-query sequences include the N non-query sequences.

5. The system of claim 1, wherein the query sequence belongs to a first species.

6. The system of claim 5, wherein the first species is human.

7. The system of claim 5, wherein the N non-query sequences are non-human primates.

8. The system of claim 7, wherein the N non-query sequences belong to a first group of species that shares a family with the first species.

9. The system of claim 8, wherein the family is hominids.

10. The system of claim 8, wherein the first group of species shares an order with the first species.

11. The system of claim 10, wherein the order is primates.

12. The system of claim 5, wherein the M non-query sequences belong to a second group of species that shares a class with the first species.

13. The system of claim 12, wherein the class is mammals.

14. The system of claim 12, wherein the second group of species shares a phylum with the first species.

15. The system of claim 14, wherein the phylum is chordates.

43

16. The system of claim 12, wherein the second group of species shares a kingdom with the first species.

17. The system of claim 16, wherein the kingdom is animals.

18. The system of claim 1, wherein a first average evolutionary distance determined from evolutionary distances between species in a first group of species to which the N non-query sequences belong is less than a second average evolutionary distance determined from evolutionary distances between species in a second group of species to which the M non-query sequences belong.

19. The system of claim 18, wherein non-overlapping species between the first and second groups of species are more evolutionarily distant from the first species than overlapping species between the first and second groups of species.

20. The system of claim 1, wherein the first model is further configured to predict the first identity as a first probability distribution that specifies base-wise likelihoods of the masked base at the target position being adenine (A), cytosine (C), thymine (T), and guanine (G).

21. The system of claim 20, wherein the first probability distribution further specifies base-wise likelihoods of the masked base at the target position being A, C, T, G, an unknown base (X), and a missing base (-).

22. The system of claim 20, wherein the second model is further configured to predict the second identity as a second probability distribution that specifies base-wise likelihoods of the masked base at the target position being A, C, T, G.

23. The system of claim 22, wherein the second probability distribution further specifies basewise likelihoods of the masked base at the target position being A, C, T, G, -, and X.

24. The system of claim 22, wherein the evolutionary conservation determination logic is further configured to measure the evolutionary conservation of the masked base at the target position based on the first probability distribution and the second probability distribution.

25. The system of claim 24, wherein the evolutionary conservation determination logic is further configured to use t-test statistics to measure the evolutionary conservation of the masked base at the target position.

26. The system of claim 1, wherein the first model is further configured to encode a background mutation probability estimation in the first identity of the masked base at the target position.

27. The system of claim 1, wherein the second model is further configured to encode a groupspecific mutation probability estimation in the second identity of the masked base at the target position.

44