JP2024513995A

JP2024513995A - Multichannel protein voxelization to predict variant pathogenicity using deep convolutional neural networks

Info

Publication number: JP2024513995A
Application number: JP2023563033A
Authority: JP
Inventors: トビアス・ハンプ; ホン・ガオ; カイ－ハウ・ファー
Original assignee: イルミナインコーポレイテッド; イルミナ・ケンブリッジ・リミテッド
Priority date: 2021-04-15
Filing date: 2022-04-14
Publication date: 2024-03-27
Also published as: CA3215520A1; AU2022258691A1; EP4323989A1; EP4323991A1; CA3215514A1; IL307667A; KR20230170679A; KR20230170680A; WO2022221591A1; BR112023021343A2; BR112023021266A2; IL307661A; AU2022259667A1; WO2022221593A1; JP2024514894A

Abstract

システムは、少なくとも、ボクセル化器、代替対立遺伝子エンコーダ、進化的保存エンコーダ、及び畳み込みニューラルネットワークを含む。ボクセル化器は、タンパク質の参照アミノ酸配列の３次元構造にアクセスし、アミノ酸ベースで３次元構造内の原子にボクセルの３次元グリッドを当てはめて、アミノ酸ごとの距離チャネルを生成する。代替対立遺伝子エンコーダは、ボクセルの３次元グリッド内の各ボクセルに代替対立遺伝子アミノ酸を符号化する。進化的保存エンコーダは、ボクセルの３次元グリッド内の各ボクセルに進化的保存配列を符号化する。畳み込みニューラルネットワークは、代替対立遺伝子配列及びそれぞれの進化的保存配列で符号化されたアミノ酸ごとの距離チャネルを含むテンソルに３次元畳み込みを適用し、テンソルに少なくとも部分的に基づいて変異体ヌクレオチドの病原性を決定する。The system includes at least a voxelizer, an alternative allele encoder, an evolutionary conservation encoder, and a convolutional neural network. The voxelizer accesses a three-dimensional structure of a reference amino acid sequence of the protein and fits a three-dimensional grid of voxels to atoms in the three-dimensional structure on an amino acid basis to generate distance channels for each amino acid. The alternative allele encoder encodes an alternative allele amino acid for each voxel in the three-dimensional grid of voxels. The evolutionary conservation encoder encodes an evolutionary conserved sequence for each voxel in the three-dimensional grid of voxels. The convolutional neural network applies a three-dimensional convolution to a tensor including the alternative allele sequences and the distance channels for each amino acid encoded with the respective evolutionary conserved sequences, and determines pathogenicity of the variant nucleotide based at least in part on the tensor.

Description

（優先権出願）
本出願は、２０２２年３月２４日に出願された「Ｍｕｌｔｉ－ｃｈａｎｎｅｌＰｒｏｔｅｉｎＶｏｘｅｌｉｚａｔｉｏｎＴｏＰｒｅｄｉｃｔＶａｒｉａｎｔＰａｔｈｏｇｅｎｉｃｉｔｙＵｓｉｎｇＤｅｅｐＣｏｎｖｏｌｕｔｉｏｎａｌＮｅｕｒａｌＮｅｔｗｏｒｋｓ」と題する米国非仮特許出願第１７／７０３，９３５号（代理人整理番号ＩＬＬＭ１０４７－２／ＩＰ－２１４２－ＵＳ）に対する優先権を主張するものであり、これは、２０２１年４月１５日に出願された「Ｍｕｌｔｉ－ｃｈａｎｎｅｌＰｒｏｔｅｉｎＶｏｘｅｌｉｚａｔｉｏｎＴｏＰｒｅｄｉｃｔＶａｒｉａｎｔＰａｔｈｏｇｅｎｉｃｉｔｙＵｓｉｎｇＤｅｅｐＣｏｎｖｏｌｕｔｉｏｎａｌＮｅｕｒａｌＮｅｔｗｏｒｋｓ」と題する米国仮特許出願第６３／１７５，４９５号（代理人整理番号ＩＬＬＭ１０４７－１／ＩＰ－２１４２－ＰＲＶ）に対する優先権を主張するものである。 (Priority application)
This application is based on “Multi-channel Protein Voxelization To Predict Variant Pathogenicity Using Deep Convolutional Neural Netw” filed on March 24, 2022. No. 17/703,935 entitled ``Orks'' (Attorney Docket No. ILLM 1047-2/IP-2142-US), which was filed on April 15, 2021. ep Convolutional Neural Networks” Claims priority to U.S. Provisional Patent Application No. 63/175,495, entitled Attorney Docket No. ILLM 1047-1/IP-2142-PRV.

本出願はまた、２０２２年３月２４日に出願された「ＥｆｆｉｃｉｅｎｔＶｏｘｅｌｉｚａｔｉｏｎＦｏｒＤｅｅｐＬｅａｒｎｉｎｇ」と題する米国非仮特許出願第１７／７０３，９５８号（代理人整理番号ＩＬＬＭ１０４８－２／ＩＰ－２１４３－ＵＳ）に対する優先権を主張するものであり、これは、２０２１年４月１６日に出願された「ＥｆｆｉｃｉｅｎｔＶｏｘｅｌｉｚａｔｉｏｎＦｏｒＤｅｅｐＬｅａｒｎｉｎｇ」と題する米国仮特許出願第６３／１７５，７６７号（代理人整理番号ＩＬＬＭ１０４８－１／ＩＰ－２１４３－ＰＲＶ）に対する優先権を主張するものである。 This application is also filed in U.S. Nonprovisional Patent Application No. 17/703,958, entitled "Efficient Voxelization For Deep Learning," filed on March 24, 2022 (Attorney Docket No. ILLM 1048-2/IP-2143- U.S. Provisional Patent Application No. 63/175,767 (Attorney Docket No. ILLM 1048-1/IP-2143-PRV).

優先権出願は、全ての目的のために参照により本明細書に組み込まれる。 The priority application is incorporated herein by reference for all purposes.

（関連出願）
本出願は、同時に出願された「ＥｆｆｉｃｉｅｎｔＶｏｘｅｌｉｚａｔｉｏｎＦｏｒＤｅｅｐＬｅａｒｎｉｎｇ」と題するＰＣＴ特許出願（代理人整理番号ＩＬＬＭ１０４８－３／ＩＰ－２１４３－ＰＣＴ）に関する。関連出願は、全ての目的のために参照により本明細書に組み込まれる。 (Related application)
This application relates to a concurrently filed PCT patent application entitled "Efficient Voxelization For Deep Learning" (Attorney Docket No. ILLM 1048-3/IP-2143-PCT). Related applications are incorporated herein by reference for all purposes.

（発明の分野）
開示される技術は、人工知能型コンピュータ及びデジタルデータ処理システム、並びに知能（すなわち、知識ベースのシステム、推論システム、及び知識取得システム）を模倣するための対応するデータ処理方法及び製品に関し、不確実性を伴う推論のためのシステム（例えば、ファジー論理システム）、適応システム、機械学習システム、及び人工ニューラルネットワークを含む。詳細には、開示される技術は、マルチチャネルボクセル化データを分析するために深層畳み込みニューラルネットワークを使用することに関する。 (Field of invention)
The disclosed technology is subject to uncertainties regarding artificially intelligent computers and digital data processing systems, and corresponding data processing methods and products for mimicking intelligence (i.e., knowledge-based systems, reasoning systems, and knowledge acquisition systems). systems for logical reasoning (e.g., fuzzy logic systems), adaptive systems, machine learning systems, and artificial neural networks. In particular, the disclosed techniques relate to using deep convolutional neural networks to analyze multi-channel voxelized data.

（組み込み）
以下は、本明細書に完全に記載されているかのように、全ての目的のために参照により組み込まれる。
Ｓｕｎｄａｒａｍ，Ｌ．ｅｔａｌ．Ｐｒｅｄｉｃｔｉｎｇｔｈｅｃｌｉｎｉｃａｌｉｍｐａｃｔｏｆｈｕｍａｎｍｕｔａｔｉｏｎｗｉｔｈｄｅｅｐｎｅｕｒａｌｎｅｔｗｏｒｋｓ．Ｎａｔ．Ｇｅｎｅｔ．５０，１１６１－１１７０（２０１８）、
Ｊａｇａｎａｔｈａｎ，Ｋ．ｅｔａｌ．Ｐｒｅｄｉｃｔｉｎｇｓｐｌｉｃｉｎｇｆｒｏｍｐｒｉｍａｒｙｓｅｑｕｅｎｃｅｗｉｔｈｄｅｅｐｌｅａｒｎｉｎｇ．Ｃｅｌｌ１７６，５３５－５４８（２０１９）、
２０１７年１０月１６日に出願された「ＴＲＡＩＮＩＮＧＡＤＥＥＰＰＡＴＨＯＧＥＮＩＣＩＴＹＣＬＡＳＳＩＦＩＥＲＵＳＩＮＧＬＡＲＧＥ－ＳＣＡＬＥＢＥＮＩＧＮＴＲＡＩＮＩＮＧＤＡＴＡ」と題する米国仮特許出願第６２／５７３，１４４号（代理人整理番号ＩＬＬＭ１０００－１／ＩＰ－１６１１－ＰＲＶ）、
２０１７年１０月１６日に出願された「ＰＡＴＨＯＧＥＮＩＣＩＴＹ
ＣＬＡＳＳＩＦＩＥＲＢＡＳＥＤＯＮＤＥＥＰＣＯＮＶＯＬＵＴＩＯＮＡＬＮＥＵＲＡＬＮＥＴＷＯＲＫＳ（ＣＮＮｓ）」と題する米国仮特許出願第６２／５７３，１４９号（代理人整理番号ＩＬＬＭ１０００－２／ＩＰ－１６１２－ＰＲＶ）、
２０１７年１０月１６日に出願された「ＤＥＥＰＳＥＭＩ－ＳＵＰＥＲＶＩＳＥＤＬＥＡＲＮＩＮＧＴＨＡＴＧＥＮＥＲＡＴＥＳＬＡＲＧＥ－ＳＣＡＬＥＰＡＴＨＯＧＥＮＩＣＴＲＡＩＮＩＮＧＤＡＴＡ」と題する米国仮特許出願第６２／５７３，１５３号（代理人整理番号ＩＬＬＭ１０００－３／ＩＰ－１６１３－ＰＲＶ）、
２０１７年１１月７日に出願された「ＰＡＴＨＯＧＥＮＩＣＩＴＹＣＬＡＳＳＩＦＩＣＡＴＩＯＮＯＦＧＥＮＯＭＩＣＤＡＴＡＵＳＩＮＧＤＥＥＰＣＯＮＶＯＬＵＴＩＯＮＡＬＮＥＵＲＡＬＮＥＴＷＯＲＫＳ（ＣＮＮｓ）」と題する米国仮特許出願第６２／５８２，８９８号（代理人整理番号ＩＬＬＭ１０００－４／ＩＰ－１６１８－ＰＲＶ）、
２０１８年１０月１５日に出願された「ＤＥＥＰＬＥＡＲＮＩＮＧ－ＢＡＳＥＤＴＥＣＨＮＩＱＵＥＳＦＯＲＴＲＡＩＮＩＮＧＤＥＥＰＣＯＮＶＯＬＵＴＩＯＮＡＬＮＥＵＲＡＬＮＥＴＷＯＲＫＳ」と題する米国非仮特許出願第１６／１６０，９０３号（代理人整理番号ＩＬＬＭ１０００－５／ＩＰ－１６１１－ＵＳ）、
２０１８年１０月１５日に出願された「ＤＥＥＰＣＯＮＶＯＬＵＴＩＯＮＡＬＮＥＵＲＡＬＮＥＴＷＯＲＫＳＦＯＲＶＡＲＩＡＮＴＣＬＡＳＳＩＦＩＣＡＴＩＯＮ」と題する米国非仮特許出願第１６／１６０，９８６号（代理人整理番号ＩＬＬＭ１０００－６／ＩＰ－１６１２－ＵＳ）、
２０１８年１０月１５日に出願された「ＳＥＭＩ－ＳＵＰＥＲＶＩＳＥＤＬＥＡＲＮＩＮＧＦＯＲＴＲＡＩＮＩＮＧＡＮＥＮＳＥＭＢＬＥＯＦＤＥＥＰＣＯＮＶＯＬＵＴＩＯＮＡＬＮＥＵＲＡＬＮＥＴＷＯＲＫＳ」と題する米国非仮特許出願第１６／１６０，９６８号（代理人整理番号ＩＬＬＭ１０００－７／ＩＰ－１６１３－ＵＳ）、及び
２０１９年５月８日に出願された「ＤＥＥＰＬＥＡＲＮＩＮＧ－ＢＡＳＥＤＴＥＣＨＮＩＱＵＥＳＦＯＲＰＲＥ－ＴＲＡＩＮＩＮＧＤＥＥＰＣＯＮＶＯＬＵＴＩＯＮＡＬＮＥＵＲＡＬＮＥＴＷＯＲＫＳ」と題する米国非仮特許出願第１６／４０７，１４９号（代理人整理番号ＩＬＬＭ１０１０－１／ＩＰ－１７３４－ＵＳ）。 (built-in)
The following is incorporated by reference for all purposes as if fully set forth herein.
Sundaram, L. et al. Predicting the clinical impact of human mutations with deep neural networks. Nat. Genet. 50, 1161-1170 (2018),
Jaganathan, K. et al. Predicting splicing from primary sequence with deep learning. Cell 176, 535-548 (2019),
U.S. Provisional Patent Application No. 62/573,144 entitled “TRAINING A DEEP PATHOGENICITY CLASSIFIER USING LARGE-SCALE BENIGN TRAINING DATA” filed on October 16, 2017 (Attorney Docket No. ILLM 1000-1/IP-1611 -PRV),
“PATHOGENICITY” filed on October 16, 2017
CLASSIFIER BASED ON DEEP CONVOLUTIONAL NEURAL NETWORKS (CNNs)'' (Attorney Docket No. ILLM 1000-2/IP-1612-PRV);
U.S. Provisional Patent Application No. 62/573,153 (Attorney Docket No. ILL M 1000-3/IP- 1613-PRV),
U.S. Provisional Patent Application No. 62/582, 89 entitled “PATHOGENICITY CLASSIFICATION OF GENOMIC DATA USING DEEP CONVOLUTIONAL NEURAL NETWORKS (CNNs)” filed on November 7, 2017 No. 8 (Agent reference number ILLM 1000-4/IP- 1618-PRV),
U.S. Nonprovisional Patent Application No. 16/160,903 entitled “DEEP LEARNING-BASED TECHNIQUES FOR TRAINING DEEP CONVOLUTIONAL NEURAL NETWORKS” filed on October 15, 2018 (Attorney Docket No. ILLM 1000-5/IP-1611 -US),
U.S. Nonprovisional Patent Application No. 16/160,986 entitled “DEEP CONVOLUTIONAL NEURAL NETWORKS FOR VARIANT CLASSIFICATION” filed on October 15, 2018 (Attorney Docket No. ILLM 1000-6/IP-1612-US ),
U.S. Nonprovisional Patent Application No. 16/160,968 entitled “SEMI-SUPERVISED LEARNING FOR TRAINING AN ENSEMBLE OF DEEP CONVOLUTIONAL NEURAL NETWORKS” filed on October 15, 2018 (Agent reference number ILLM 1000-7/IP -1613-US), and U.S. Nonprovisional Patent Application No. 16/407 entitled “DEEP LEARNING-BASED TECHNIQUES FOR PRE-TRAINING DEEP CONVOLUTIONAL NEURAL NETWORKS” filed May 8, 2019. , No. 149 (Representative Arrangement No. ILLM 1010-1/IP-1734-US).

このセクションで考察される主題は、単にこのセクションにおける言及の結果として、先行技術であると想定されるべきではない。同様に、このセクションで言及した問題、又は背景として提供された主題と関連付けられた問題は、先行技術において以前に認識されていると想定されるべきではない。本節における主題は、単に異なるアプローチを表し、それ自体はまた、特許請求される技術の実装形態に対応し得る。 The subject matter discussed in this section should not be assumed to be prior art merely as a result of mention in this section. Similarly, it should not be assumed that the problems mentioned in this section or associated with the subject matter provided as background have been previously recognized in the prior art. The subject matter in this section merely represents different approaches, and as such may also correspond to implementations of the claimed technology.

広義のゲノミクスは、機能的ゲノミクスとも呼ばれ、ゲノム配列決定、トランスクリプトームプロファイリング及びプロテオミクスなどのゲノムスケールアッセイを使用することによって生物の全てのゲノムエレメントの機能を特徴付けることを目的とする。ゲノミクスは、データ主導の科学として生じ、予め考えられたモデル及び仮説を試験することによってではなく、ゲノムスケールデータの調査から新規の特性を発見することによって動作する。ゲノミクスの適用には、遺伝子型と表現型との間の関連を見出すこと、患者の層別化のためのバイオマーカーを発見すること、遺伝子の機能を予測すること、及び転写エンハンサーなどの生化学的に活性なゲノム領域を図表化することが含まれる。 Genomics in the broad sense, also called functional genomics, aims to characterize the function of all genomic elements of an organism by using genome-scale assays such as genome sequencing, transcriptome profiling and proteomics. Genomics has emerged as a data-driven science, operating not by testing preconceived models and hypotheses, but by discovering novel properties from the examination of genome-scale data. Applications of genomics include finding associations between genotype and phenotype, discovering biomarkers for patient stratification, predicting the function of genes, and biochemical studies such as transcriptional enhancers. This includes charting the genomic regions that are actively active.

ゲノミクスデータは、ペアワイズ相関の視覚的調査のみによってマイニングするには大きすぎ、かつ複雑すぎる。その代わりに、予期しない関係の発見をサポートし、新規な仮説及びモデルを導き出し、予測を行うために、分析ツールが必要とされる。仮定及び領域専門知識がハード符号化されるいくつかのアルゴリズムとは異なり、機械学習アルゴリズムは、データ内のパターンを自動的に検出するように設計される。したがって、機械学習アルゴリズムは、データ駆動型科学、特にゲノミクスに適している。しかしながら、機械学習アルゴリズムの性能は、データがどのように表されるかに、すなわち、各変数（特徴とも呼ばれる）がどのように計算されるかに強く依存することができる。例えば、蛍光顕微鏡画像から腫瘍を悪性又は良性として分類するために、前処理アルゴリズムは、細胞を検出し、細胞型を特定し、各細胞型について細胞数のリストを生成することができる。 Genomics data is too large and complex to be mined solely by visual inspection of pairwise correlations. Instead, analytical tools are needed to support the discovery of unexpected relationships, derive new hypotheses and models, and make predictions. Unlike some algorithms where assumptions and domain expertise are hard-coded, machine learning algorithms are designed to automatically detect patterns in data. Machine learning algorithms are therefore well-suited to data-driven science, especially genomics. However, the performance of machine learning algorithms can strongly depend on how the data is represented, that is, how each variable (also called a feature) is calculated. For example, to classify a tumor as malignant or benign from a fluorescence microscopy image, a preprocessing algorithm can detect cells, identify cell types, and generate a list of cell numbers for each cell type.

機械学習モデルは、人手で設計した特徴の例である推定細胞数を、腫瘍を分類するための入力特徴として取ることができる。中心的な問題は、分類性能がこれらの特徴の質及び関連性に大きく依存することである。例えば、細胞形態、細胞間の距離又は器官内の局在化等の関連する視覚的特徴は、細胞計数において捕捉されず、データのこの不完全な表現は、分類精度を低減させ得る。 Machine learning models can take estimated cell count, an example of a manually designed feature, as an input feature for classifying tumors. The central problem is that classification performance is highly dependent on the quality and relevance of these features. For example, relevant visual features such as cell morphology, distance between cells or localization within an organ are not captured in cell counting, and this incomplete representation of the data can reduce classification accuracy.

機械学習のサブ規律である深層学習は、特徴の計算を機械学習モデル自体に埋め込み、エンドツーエンドモデルを生成することによって、この問題に対処する。この結果は、先行する動作の結果を入力として取ることによってますます複雑になる特徴を計算する、連続する基本動作を含む機械学習モデルである深層ニューラルネットワークの開発によって実現されている。ディープニューラルネットワークは、上記の例における細胞形態及び細胞の空間的構成など、複雑度の高い関連する特徴を発見することによって予測精度を改善することができる。深層ニューラルネットワークの構築及び訓練は、データ爆発、アルゴリズムの進歩、及び計算能力が大幅な増加によって、特に画像処理装置（ＧＰＵ）の使用により可能になった。 Deep learning, a subdiscipline of machine learning, addresses this problem by embedding feature computation into the machine learning model itself, producing an end-to-end model. This result is achieved through the development of deep neural networks, machine learning models that involve successive elementary actions that compute increasingly complex features by taking as input the results of previous actions. Deep neural networks can improve prediction accuracy by discovering relevant features with high complexity, such as cell morphology and spatial organization of cells in the example above. The construction and training of deep neural networks has been made possible by the data explosion, advances in algorithms, and vast increases in computing power, particularly through the use of graphical processing units (GPUs).

教師あり学習の目標は、特徴を入力として取り、いわゆるターゲット変数の予測を返すモデルを得ることである。教師あり学習問題の例は、標準スプライス部位配列の有無、スプライシング分岐点の位置又はイントロン長などのＲＮＡ上の特徴を考慮して、イントロンがスプライシングされるか否か（標的）を予測する問題である。機械学習モデルを訓練することは、そのパラメータを学習することを指し、これは一般に、未知のデータに対して正確な予測を行う目的で、訓練データに対する損失関数を最小化することを含む。 The goal of supervised learning is to obtain a model that takes features as input and returns predictions of so-called target variables. An example of a supervised learning problem is the problem of predicting whether or not an intron will be spliced (target) by considering RNA characteristics such as the presence or absence of a standard splice site sequence, the position of a splicing branch point, or the length of an intron. be. Training a machine learning model refers to learning its parameters, which generally involves minimizing a loss function on training data with the purpose of making accurate predictions on unknown data.

計算生物学における多くの教師あり学習問題について、入力データは、複数の列又は特徴を有する表として表すことができ、その各々は、予測を行うのに潜在的に有用である数値データ又はカテゴリデータを含有する。いくつかの入力データは、表形式の特徴（例えば、温度又は時間）として自然に表されるが、他の入力データは、表の表現に適合させるために、特徴抽出と呼ばれるプロセスを使用して最初に変換される必要がある（例えば、デオキシリボ核酸（ＤＮＡ）配列をＵｍｅｒカウントに変換する）。イントロン－スプライシング予測問題のために、標準スプライス部位配列の有無、スプライシング分岐点の位置及びイントロン長は、表形式で収集された前処理された特徴であることができる。表形式データは、ロジスティック回帰などの単純な線形モデルから、ニューラルネットワーク及び多くの他のものなどのより柔軟な非線形モデルに及ぶ、広範囲の教師あり機械学習モデルの標準である。 For many supervised learning problems in computational biology, input data can be represented as a table with multiple columns or features, each of which contains numerical or categorical data that is potentially useful for making predictions. Contains. While some input data is naturally represented as tabular features (e.g. temperature or time), other input data can be adapted to fit into a tabular representation using a process called feature extraction. first needs to be converted (eg, converting deoxyribonucleic acid (DNA) sequences to Umer counts). For intron-splicing prediction problems, the presence or absence of canonical splice site sequences, location of splicing branch points, and intron length can be preprocessed features collected in tabular form. Tabular data is the standard for a wide range of supervised machine learning models, ranging from simple linear models such as logistic regression to more flexible nonlinear models such as neural networks and many others.

ロジスティック回帰は、バイナリ分類器、すなわち、バイナリターゲット変数を予測する教師あり学習モデルである。具体的には、ロジスティック回帰は、活性化関数の一種であるシグモイド関数を使用して［０，１］区間にマッピングされた入力特徴の加重和を計算することによって、陽性クラスの確率を予測する。ロジスティック回帰、又は異なる活性化関数を使用する他の線形分類器のパラメータは、加重和における重みである。線形分類器は、例えばイントロンがスプライシングされたか否かのクラスが入力特徴の加重和で十分に識別できない場合に失敗する。予測性能を改善するために、新しい入力特徴は、新しい方法で既存の特徴を変換又は組み合わせることによって、例えば、累乗又はペアワイズ積を取ることによって、手動で追加されることができる。 Logistic regression is a binary classifier, a supervised learning model that predicts binary target variables. Specifically, logistic regression predicts the probability of a positive class by calculating a weighted sum of input features mapped to the [0,1] interval using a sigmoid function, which is a type of activation function. . The parameters of logistic regression, or other linear classifiers that use different activation functions, are the weights in the weighted sum. A linear classifier fails, for example, if the class of whether an intron is spliced or not cannot be sufficiently distinguished by a weighted sum of input features. To improve prediction performance, new input features can be added manually by transforming or combining existing features in new ways, for example by taking powers or pairwise products.

ニューラルネットワークは、隠れ層を使用して、これらの非線形特徴変換を自動的に学習する。各隠れ層は、シグモイド関数又はより一般的な正規化線形ユニット（ＲｅＬＵ）などの非線形活性化関数によって変換された出力を有する複数の線形モデルと考えることができる。同時に、これらの層は、入力特徴を関連する複雑なパターンに構成し、２つのクラスを区別するタスクを容易にする。 Neural networks automatically learn these nonlinear feature transformations using hidden layers. Each hidden layer can be thought of as multiple linear models whose outputs are transformed by a nonlinear activation function, such as a sigmoid function or the more general rectified linear unit (ReLU). At the same time, these layers organize the input features into related complex patterns, facilitating the task of distinguishing between the two classes.

深層ニューラルネットワークは、多くの隠れ層を使用し、各ニューロンが先行する層の全てのニューロンから入力を受信するとき、層は全結合されていると言われる。ニューラルネットワークは、一般に、非常に大きなデータセット上でモデルを学習するのに適したアルゴリズムである確率的勾配降下法を使用して学習する。現代の深層学習フレームワークを使用するニューラルネットワークの実施態様は、異なるアーキテクチャ及びデータセットを用いたラピッドプロトタイピングを可能にする。全結合ニューラルネットワークは、スプライス因子の結合モチーフの存在又は配列保存などの配列特徴から、所与の配列に対してスプライシングされたエクソンの割合を予測すること、潜在的に疾患を引き起こす遺伝子変異体を優先順位付けすること、並びに所与のゲノム領域において、クロマチンマーク、遺伝子発現及び進化的保存などの特徴を用いてシス調節エレメントを予測することを含む、いくつかのゲノミクス用途に使用することができる。 Deep neural networks use many hidden layers, and when each neuron receives input from all neurons in the previous layer, the layers are said to be fully connected. Neural networks are generally trained using stochastic gradient descent, an algorithm suitable for learning models on very large datasets. Neural network implementations using modern deep learning frameworks enable rapid prototyping with different architectures and datasets. Fully connected neural networks can predict the proportion of exons spliced for a given sequence from sequence features such as the presence of splice factor binding motifs or sequence conservation, and identify potentially disease-causing genetic variants. It can be used for several genomics applications, including prioritizing and predicting cis-regulatory elements in a given genomic region using features such as chromatin marks, gene expression, and evolutionary conservation. .

効果的な予測のためには、空間的及び縦断的データにおける局所的依存性を考慮しなければならない。例えば、ＤＮＡ配列又は画像のピクセルのシャッフリングは、情報パターンを激しく乱す。これらの局所依存性は、特徴の順序付けが任意である表形式データとは別に、空間又は縦断的データを設定する。結合領域が、配列決定（ＣｈＩＰ－ｓｅｑ）データに続くクロマチン免疫沈降における高信頼度結合事象として定義される、特定の転写因子による結合対非結合としてゲノム領域を分類する問題を考慮されたい。転写因子は、配列モチーフを認識することによってＤＮＡに結合する。配列中のｋ－ｍｅｒインスタンスの数又は位置重み行列（ＰＷＭ）マッチなどの配列由来の特徴に基づく全結合層を、このタスクに使用することができる。ｋ－ｍｅｒ又はＰＷＭインスタンス頻度は、配列内のモチーフをシフトさせることに対してロバストであるため、そのようなモデルは、異なる位置に位置する同じモチーフを有する配列に十分に一般化することができる。しかしながら、それらは、転写因子結合が明確な間隔を有する複数のモチーフの組み合わせに依存するパターンを認識することができない。更に、可能なｋ－ｍｅｒの数は、ｋ－ｍｅｒ長と共に指数関数的に増加し、これは、保存及び過剰適合の両方の課題をもたらす。 For effective predictions, local dependencies in spatial and longitudinal data must be taken into account. For example, shuffling DNA sequences or pixels of an image severely disrupts information patterns. These local dependencies set spatial or longitudinal data apart from tabular data where the ordering of features is arbitrary. Consider the problem of classifying genomic regions as bound versus unbound by a particular transcription factor, where binding regions are defined as high-confidence binding events in chromatin immunoprecipitations following sequencing (ChIP-seq) data. Transcription factors bind to DNA by recognizing sequence motifs. Fully connected layers based on sequence-derived features such as the number of k-mer instances in the sequence or position weight matrix (PWM) matches can be used for this task. Because k-mer or PWM instance frequencies are robust to shifting motifs within a sequence, such models can generalize well to sequences with the same motif located at different positions. . However, they are unable to recognize patterns in which transcription factor binding depends on a combination of multiple motifs with well-defined spacing. Furthermore, the number of possible k-mers increases exponentially with k-mer length, which poses both conservation and overfitting challenges.

畳み込み層は、同じ全結合層が局所的に、例えば６ｂｐウィンドウ内で、全ての配列位置に適用される、全結合層の特別な形態である。このアプローチはまた、例えば、転写因子ＧＡＴＡ１及びＴＡＬ１について、複数のＰＷＭを使用して配列を走査することとみなすことができる。位置にわたって同じモデルパラメータを使用することによって、パラメータの総数は劇的に低減され、ネットワークは、学習中に見られない位置でモチーフを検出することができる。各畳み込み層は、フィルタと配列との間の一致を量子化するスカラー値を全ての位置において生成することによって、いくつかのフィルタを用いて配列を走査する。全結合ニューラルネットワークにおけるように、非線形活性化関数（一般にＲｅＬＵ）が各層において適用される。次に、プーリング演算が適用され、これは、位置軸にわたって連続するビン内の活性化を集約し、一般に、各チャネルについて最大又は平均活性化を取る。プーリングは、有効配列長を減少させ、信号を粗大化する。後続の畳み込み層は、前の層の出力を構成し、ＧＡＴＡ１モチーフ及びＴＡＬＩモチーフがある距離範囲に存在したかどうかを検出することができる。最後に、畳み込み層の出力は、最終予測タスクを実行するために全結合ニューラルネットワークへの入力として使用することができる。したがって、異なるタイプのニューラルネットワーク層（例えば、全結合層及び畳み込み層）を単一のニューラルネットワーク内で組み合わせることができる。 A convolutional layer is a special form of a fully connected layer in which the same fully connected layer is applied locally, for example within a 6 bp window, to all sequence positions. This approach can also be viewed as scanning the sequence using multiple PWMs, for example for transcription factors GATA1 and TAL1. By using the same model parameters across locations, the total number of parameters is dramatically reduced and the network is able to detect motifs at locations not seen during training. Each convolutional layer scans the array with a number of filters by generating scalar values at every position that quantize the correspondence between the filter and the array. As in a fully connected neural network, a non-linear activation function (generally ReLU) is applied at each layer. A pooling operation is then applied, which aggregates the activations in consecutive bins across the position axis, generally taking the maximum or average activation for each channel. Pooling reduces the effective array length and coarsens the signal. Subsequent convolutional layers constitute the output of the previous layer and can detect whether the GATA1 and TALI motifs were present in a certain distance range. Finally, the output of the convolutional layer can be used as input to a fully connected neural network to perform the final prediction task. Thus, different types of neural network layers (eg, fully connected layers and convolutional layers) can be combined within a single neural network.

畳み込みニューラルネットワーク（ＣＮＮ）は、ＤＮＡ配列のみに基づいて様々な分子表現型を予測することができる。用途としては、転写因子結合部位の分類、並びにクロマチン特徴、ＤＮＡコンタクトマップ、ＤＮＡメチル化、遺伝子発現、翻訳効率、ＲＢＰ結合、及びマイクロＲＮＡ（ｍｉＲＮＡ）標的などの分子表現型の予測が挙げられる。配列から分子表現型を予測することに加えて、畳み込みニューラルネットワークは、人手で設計したバイオインフォマティクスパイプラインによって伝統的に対処されるより技術的なタスクに適用することができる。例えば、畳み込みニューラルネットワークは、ガイドＲＮＡの特異性を予測し、ＣｈＩＰ－ｓｅｑをノイズ除去し、Ｈｉ－Ｃデータ分解能を向上させ、ＤＮＡ配列から実験室起源を予測し、遺伝子変異体を呼び出すことができる。畳み込みニューラルネットワークはまた、ゲノムにおける長期依存性をモデル化するために使用されてきた。相互作用する調節エレメントは、折り畳まれていない直鎖状ＤＮＡ配列上に離れて位置し得るが、これらのエレメントは、多くの場合、実際の３Ｄクロマチン立体構造において近位である。したがって、線形ＤＮＡ配列からの分子表現型のモデリングは、クロマチンの粗い近似ではあるが、長期依存性を可能にし、モデルがプロモーター－エンハンサーループなどの３Ｄ組織化の態様を暗示的に学習することを可能にすることによって改善することができる。これは、最大３２ｋｂの受容野を有する拡張畳み込みを使用することによって達成される。拡張畳み込みはまた、スプライス部位が１０ｋｂの受容野を使用して配列から予測されることを可能にし、それによって、典型的なヒトイントロンと同じ長さの距離にわたる遺伝子配列の統合を可能にする（Ｊａｇａｎａｔｈａｎ，Ｋ．ｅｔａｌ．Ｐｒｅｄｉｃｔｉｎｇｓｐｌｉｃｉｎｇｆｒｏｍｐｒｉｍａｒｙｓｅｑｕｅｎｃｅｗｉｔｈｄｅｅｐｌｅａｒｎｉｎｇ．Ｃｅｌｌ１７６，５３５－５４８（２０１９）を参照）。 Convolutional neural networks (CNNs) can predict various molecular phenotypes based solely on DNA sequences. Applications include classification of transcription factor binding sites and prediction of molecular phenotypes such as chromatin features, DNA contact maps, DNA methylation, gene expression, translation efficiency, RBP binding, and microRNA (miRNA) targets. In addition to predicting molecular phenotypes from sequences, convolutional neural networks can be applied to more technical tasks traditionally addressed by hand-designed bioinformatics pipelines. For example, convolutional neural networks can predict guide RNA specificity, denoise ChIP-seq, improve Hi-C data resolution, predict laboratory origin from DNA sequences, and call genetic variants. can. Convolutional neural networks have also been used to model long-term dependencies in genomes. Although interacting regulatory elements may be located remotely on the unfolded linear DNA sequence, these elements are often proximal in the actual 3D chromatin conformation. Therefore, modeling molecular phenotypes from linear DNA sequences is a crude approximation of chromatin, but allows for long-term dependence and allows the model to implicitly learn aspects of 3D organization such as promoter-enhancer loops. This can be improved by making it possible. This is achieved by using dilated convolution with a receptive field of up to 32 kb. Extended convolution also allows splice sites to be predicted from the sequence using a 10 kb receptive field, thereby allowing integration of gene sequences over a distance as long as a typical human intron ( Jaganathan, K. et al. Predicting splicing from primary sequence with deep learning. Cell 176, 535-548 (2019)).

異なるタイプのニューラルネットワークは、それらのパラメータ共有スキームによって特徴付けることができる。例えば、全結合層はパラメータ共有を有さないが、畳み込み層は、それらの入力の全ての位置において同じフィルタを適用することによって並進不変性を課す。リカレントニューラルネットワーク（ＲＮＮ）は、異なるパラメータ共有方式を実装する、ＤＮＡ配列又は時系列などの順次データを処理するための畳み込みニューラルネットワークの代替である。リカレントニューラルネットワークは、各配列エレメントに同じ演算を適用する。この演算は、入力として、前の配列エレメントのメモリ及び新しい入力を取る。それはメモリを更新し、任意選択で出力を発し、この出力は後続の層に渡されるか、又はモデル予測として直接使用されるかのいずれかである。各配列エレメントにおいて同じモデルを適用することによって、リカレントニューラルネットワークは、処理された配列における位置指数に対して不変である。例えば、リカレントニューラルネットワークは、配列中の位置にかかわらず、ＤＮＡ配列中のオープンリーディングフレームを検出することができる。このタスクは、開始コドンとそれに続くインフレーム停止コドンなどの特定の一連の入力の認識を必要とする。 Different types of neural networks can be characterized by their parameter sharing schemes. For example, fully connected layers have no parameter sharing, whereas convolutional layers impose translational invariance by applying the same filter at every position of their input. Recurrent neural networks (RNNs) are an alternative to convolutional neural networks for processing sequential data such as DNA sequences or time series, implementing different parameter sharing schemes. Recurrent neural networks apply the same operation to each array element. This operation takes as input the memory of the previous array element and a new input. It updates memory and optionally emits output, which is either passed to subsequent layers or used directly as model predictions. By applying the same model at each array element, the recurrent neural network is invariant to position index in the processed array. For example, recurrent neural networks can detect open reading frames in a DNA sequence regardless of their position in the sequence. This task requires recognition of a specific set of inputs, such as a start codon followed by an in-frame stop codon.

畳み込みニューラルネットワークに対するリカレントニューラルネットワークの主な利点は、理論的には、それらがメモリを介して無限に長い配列を通じて情報を引き継ぐことができることである。更に、リカレントニューラルネットワークは、ｍＲＮＡ配列のような広く変化する長さの配列を自然に処理することができる。しかしながら、様々なトリック（拡張畳み込みなど）と組み合わされた畳み込みニューラルネットワークは、オーディオ合成及び機械翻訳などの配列モデリングタスクに関して、リカレントニューラルネットワークに匹敵する、又はリカレントニューラルネットワークよりも更に良好な性能に達することができる。リカレントニューラルネットワークは、単一細胞ＤＮＡメチル化状態、ＲＢＰ結合、転写因子結合、及びＤＮＡアクセシビリティを予測するために、畳み込みニューラルネットワークの出力を集約することができる。更に、リカレントニューラルネットワークは逐次演算を適用するので、容易に並列化することができず、したがって、畳み込みニューラルネットワークよりも計算がはるかに遅い。 The main advantage of recurrent neural networks over convolutional neural networks is that, in theory, they can pass information through infinitely long arrays through memory. Furthermore, recurrent neural networks can naturally handle sequences of widely varying lengths, such as mRNA sequences. However, convolutional neural networks combined with various tricks (such as dilated convolution) reach comparable or even better performance than recurrent neural networks for sequence modeling tasks such as audio synthesis and machine translation. be able to. Recurrent neural networks can aggregate the output of convolutional neural networks to predict single cell DNA methylation status, RBP binding, transcription factor binding, and DNA accessibility. Furthermore, since recurrent neural networks apply sequential operations, they cannot be easily parallelized and are therefore much slower to compute than convolutional neural networks.

ヒト遺伝子コードの大部分は全てのヒトに共通であるが、各ヒトは固有の遺伝子コードを有する。いくつかの場合において、ヒト遺伝子コードは、ヒト集団の比較的小さい群の個体間で共通であり得る、遺伝子変異体と呼ばれる外れ値を含み得る。例えば、特定のヒトタンパク質は、アミノ酸の特定の配列を含み得るが、そのタンパク質の変異体は、他の点では同じ特定の配列において１つのアミノ酸が異なり得る。 Although most of the human genetic code is common to all humans, each human has a unique genetic code. In some cases, the human genetic code may contain outliers, called genetic variants, that may be common among a relatively small group of individuals in the human population. For example, a particular human protein may contain a particular sequence of amino acids, whereas variants of that protein may differ by one amino acid in an otherwise identical particular sequence.

遺伝子変異体は病原的であり得、疾患をもたらし得る。そのような遺伝子変異体のほとんどは、自然淘汰によってゲノムから枯渇しているが、どの遺伝子変異体が病原性である可能性が高いかを特定する能力は、研究者がこれらの遺伝子変異体に焦点を当てて、対応する疾患及びそれらの診断、治療、又は治癒の理解を得る助けとなることができる。何百万ものヒト遺伝子変異体の臨床的解釈は不明のままである。最も頻繁な病原性変異体のいくつかは、タンパク質のアミノ酸を変化させる単一ヌクレオチドミスセンス変異である。しかし、全てのミスセンス変異が病原性であるわけではない。 Genetic variants can be pathogenic and cause disease. Most such genetic variants have been depleted from the genome by natural selection, but the ability to identify which genetic variants are likely to be pathogenic is limited by the ability of researchers to identify these genetic variants. The focus can help to gain an understanding of the corresponding diseases and their diagnosis, treatment, or cure. The clinical interpretation of millions of human gene variants remains unclear. Some of the most frequent pathogenic variants are single nucleotide missense mutations that change amino acids in proteins. However, not all missense mutations are pathogenic.

生物学的配列から分子表現型を直接予測することができるモデルは、遺伝的変異と表現型変異との間の関連を調べるためのｉｎｓｉｌｉｃｏ摂動ツールとして使用することができ、定量的形質遺伝子座特定及び変異体優先順位付けのための新しい方法として出現した。これらのアプローチは、複雑な表現型のゲノムワイド関連研究によって特定された変異体の大部分が非コードであり、それがそれらの効果及び表現型への寄与を推定することを困難にすることを考慮すると、非常に重要である。更に、連鎖不平衡は、共遺伝される変異体のブロックをもたらし、これは、個々の原因変異体を正確に特定することを困難にする。したがって、そのような変異体の影響を評価するための照合ツールとして使用することができる配列ベースの深層学習モデルは、複雑な表現型の潜在的なドライバーを見出すための有望なアプローチを提供する。一例としては、転写因子結合、クロマチンアクセシビリティ又は遺伝子発現予測に関して、２つの変異体間の差異から間接的に非コード単一ヌクレオチド変異体及び短い挿入又は欠失（インデル）の効果を予測することが挙げられる。別の例としては、配列又はスプライシングに対する遺伝子変異体の定量的効果から新規スプライス部位生成を予測することが挙げられる。 Models that can directly predict molecular phenotypes from biological sequences can be used as in silico perturbation tools to investigate the association between genetic and phenotypic variation, and can be used to quantitatively predict trait loci. It has emerged as a new method for identification and variant prioritization. These approaches recognize that the majority of variants identified by genome-wide association studies of complex phenotypes are non-coding, which makes it difficult to estimate their effects and contributions to the phenotype. It is very important to consider. Furthermore, linkage disequilibrium results in blocks of co-inherited variants, which makes it difficult to pinpoint individual causative variants. Therefore, sequence-based deep learning models that can be used as matching tools to assess the impact of such variants provide a promising approach to finding potential drivers of complex phenotypes. As an example, differences between two variants can indirectly predict the effect of non-coding single nucleotide variants and short insertions or deletions (indels) on transcription factor binding, chromatin accessibility or gene expression prediction. Can be mentioned. Another example is predicting novel splice site generation from the quantitative effects of genetic variants on sequence or splicing.

タンパク質配列及び配列保存データからミスセンス変異体の病原性を予測するために、変異体効果予測のためのエンドツーエンド深層学習アプローチが適用される（Ｓｕｎｄａｒａｍ，Ｌ．ｅｔａｌ．Ｐｒｅｄｉｃｔｉｎｇｔｈｅｃｌｉｎｉｃａｌｉｍｐａｃｔｏｆｈｕｍａｎｍｕｔａｔｉｏｎｗｉｔｈｄｅｅｐｎｅｕｒａｌｎｅｔｗｏｒｋｓ．Ｎａｔ．Ｇｅｎｅｔ．５０，１１６１－１１７０（２０１８）を参照し、本明細書では「ＰｒｉｍａｔｅＡＩ」と称される）。ＰｒｉｍａｔｅＡＩは、異種間情報を使用するデータ増強を用いて既知の病原性の変異体に対して学習した深層ニューラルネットワークを使用する。特に、ＰｒｉｍａｔｅＡＩは、野生型及び突然変異タンパク質の配列を使用して、差異を比較し、学習した深層ニューラルネットワークを使用して突然変異の病原性を決定する。病原性予測のためにタンパク質配列を利用するこのようなアプローチは、真円度問題及び以前の知識への過剰適合を回避することができるので、有望である。しかしながら、深層ニューラルネットワークを効果的に学習するのに十分な数のデータと比較して、ＣｌｉｎＶａｒにおいて利用可能な臨床データの数は比較的少ない。このデータ不足を克服するために、ＰｒｉｍａｔｅＡＩは、一般的なヒト変異体及び霊長類由来の変異体を良性データとして使用し、トリヌクレオチド文脈に基づいてシミュレートされた変異体をラベルなしデータとして使用した。 An end-to-end deep learning approach for variant effect prediction is applied to predict the pathogenicity of missense variants from protein sequences and sequence conservation data (Sundaram, L. et al. Predicting the clinical impact of human mutation with deep neural networks. Nat. Genet. 50, 1161-1170 (2018), herein referred to as “PrimateAI”). PrimateAI uses deep neural networks trained on known pathogenic variants with data augmentation using cross-species information. In particular, PrimateAI uses wild-type and mutant protein sequences to compare differences and uses trained deep neural networks to determine the pathogenicity of mutations. Such approaches that utilize protein sequences for pathogenicity prediction are promising as they can avoid roundness problems and overfitting to prior knowledge. However, the amount of clinical data available in ClinVar is relatively small compared to the sufficient amount of data to effectively train deep neural networks. To overcome this data gap, PrimateAI uses common human variants and primate-derived variants as benign data and simulated variants based on trinucleotide context as unlabeled data. did.

ＰｒｉｍａｔｅＡＩは、配列アラインメントに対して直接学習した場合、従来の方法よりも性能が優れている。ＰｒｉｍａｔｅＡＩは、重要なタンパク質ドメイン、保存されたアミノ酸位置、及び配列依存性を、約１２０，０００のヒトサンプルからなる学習データから直接学習する。ＰｒｉｍａｔｅＡＩは、候補の発達障害遺伝子における良性及び病原性のデノボ突然変異を区別すること、及びＣｌｉｎＶａｒにおける事前知識を再現することにおいて、他の変異体病原性予測ツールの性能を実質的に上回る。これらの結果は、ＰｒｉｍａｔｅＡＩが、臨床報告の事前知識への依存を減らすことができる変異体分類ツールのための重要な前進であることを示唆する。 PrimateAI outperforms traditional methods when trained directly on sequence alignments. PrimateAI learns important protein domains, conserved amino acid positions, and sequence dependencies directly from training data consisting of approximately 120,000 human samples. PrimateAI substantially outperforms other variant pathogenicity prediction tools in distinguishing between benign and pathogenic de novo mutations in candidate developmental disorder genes and in reproducing prior knowledge in ClinVar. These results suggest that PrimateAI is an important advance for variant classification tools that can reduce reliance on prior knowledge of clinical reporting.

タンパク質生物学の中心は、構造エレメントが観察された機能をどのように生じさせるかの理解である。過度のタンパク質構造データは、構造－機能関係を支配する規則を系統的に導出するための計算方法の開発を可能にする。しかしながら、これらの方法の性能は、タンパク質構造表現の選択に決定的に依存する。 Central to protein biology is the understanding of how structural elements give rise to observed functions. The plethora of protein structural data allows the development of computational methods to systematically derive rules governing structure-function relationships. However, the performance of these methods depends critically on the choice of protein structure representation.

タンパク質部位は、タンパク質構造内の微小環境であり、それらの構造的又は機能的役割によって区別される。部位は、３次元（３Ｄ）位置と、構造又は機能が存在するこの位置の周りの局所近傍とによって定義することができる。合理的なタンパク質工学の中心は、アミノ酸の構造的配置がどのようにしてタンパク質部位内に機能的特徴を作り出すかの理解である。タンパク質中の個々のアミノ酸の構造的及び機能的役割の決定は、タンパク質機能の操作及び改変の助けとなるための情報を提供する。機能的又は構造的に重要なアミノ酸を特定することにより、標的タンパク質の機能特性を改変するための部位特異的突然変異誘発などの集中的な工学的努力が可能になる。あるいは、この知識は、所望の機能を無効にする工学設計を回避する助けとなることができる。 Protein sites are microenvironments within protein structures that are distinguished by their structural or functional roles. A site can be defined by a three-dimensional (3D) location and a local neighborhood around this location in which the structure or function resides. Central to rational protein engineering is an understanding of how the structural arrangement of amino acids creates functional features within protein sites. Determination of the structural and functional roles of individual amino acids in proteins provides information to aid in the manipulation and modification of protein function. Identification of functionally or structurally important amino acids enables focused engineering efforts such as site-directed mutagenesis to modify the functional properties of a target protein. Alternatively, this knowledge can help avoid engineering designs that disable desired functionality.

構造は配列よりもはるかに保存されていることが確立されているので、タンパク質構造データの増加は、データ駆動アプローチを使用して構造－機能関係を支配する基礎パターンを系統的に研究する機会を提供する。任意の計算タンパク質分析の基本的な態様は、タンパク質構造情報がどのように表されるかである。機械学習方法の性能は、使用される機械学習アルゴリズムよりもデータ表現の選択に依存することが多い。良好な表現は、最も重要な情報を効率的に捕捉するが、不良な表現は、基礎となるパターンのないノイズの多い分布を生成する。 Since it is established that structure is much more conserved than sequence, the increase in protein structural data provides an opportunity to systematically study the underlying patterns governing structure-function relationships using data-driven approaches. provide. A fundamental aspect of any computational protein analysis is how protein structural information is represented. The performance of machine learning methods often depends more on the choice of data representation than on the machine learning algorithm used. A good representation efficiently captures the most important information, whereas a bad representation produces a noisy distribution with no underlying pattern.

過度のタンパク質構造及び深層学習アルゴリズムの最近の成功は、タンパク質構造のタスク特異的表現を自動的に抽出するためのツールを開発する機会を提供する。したがって、深層ニューラルネットワークへの入力として３Ｄタンパク質構造のマルチチャネルボクセル化表現を使用して変異体病原性を予測する機会が生じる。 The recent success of a plethora of protein structures and deep learning algorithms provides an opportunity to develop tools to automatically extract task-specific representations of protein structures. Therefore, an opportunity arises to predict variant pathogenicity using multichannel voxelized representations of 3D protein structures as inputs to deep neural networks.

特許又は出願ファイルは、カラーで創作された少なくとも１つの図面を含む。カラー図面（単数又は複数）を有するこの特許又は特許出願公開のコピーは、必要な料金の要求及び支払いの際に、庁によって提供される。 The patent or application file contains at least one drawing executed in color. Copies of this patent or patent application publication with color drawing(s) will be provided by the Office upon request and payment of the necessary fee.

カラー図面はまた、補足コンテンツタブを介してＰＡＩＲ（patent application information retrieval：特許出願情報検索）で利用可能であり得る。図面では、同様の参照文字は、概して、異なる図全体を通して同様の部分を指す。また、図面は必ずしも縮尺通りではなく、その代わりに、開示された技術の原理を例示することを強調している。以下の説明において、開示された技術の様々な実施態様は、以下の図面を参照して説明される。 Color drawings may also be available in patent application information retrieval (PAIR) via the Supplemental Content tab. In the drawings, like reference characters generally refer to similar parts throughout the different figures. Additionally, the drawings are not necessarily to scale, emphasis instead being placed upon illustrating the principles of the disclosed technology. In the following description, various implementations of the disclosed technology are described with reference to the following drawings.

開示される技術の様々な実施態様による、変異体の病原性を決定するためのシステムのプロセスを示す流れ図である。1 is a flowchart illustrating the process of a system for determining pathogenicity of a variant, according to various embodiments of the disclosed technology. 開示される技術の一実施態様による、タンパク質の例示的参照アミノ酸配列及びタンパク質の代替アミノ酸配列を概略的に示す図である。FIG. 1 schematically depicts an exemplary reference amino acid sequence of a protein and an alternative amino acid sequence of a protein, according to one embodiment of the disclosed technology. 開示される技術の一実施態様による、図２の参照アミノ酸配列中のアミノ酸の原子のアミノ酸ごとの分類を示す図である。FIG. 3 illustrates a per-amino acid classification of the atoms of the amino acids in the reference amino acid sequence of FIG. 2, according to one embodiment of the disclosed technology. 開示される技術の一実施態様による、アミノ酸ベースで図３に分類されたアルファ炭素原子の３Ｄ原子座標のアミノ酸ごとの帰属を示す図である。4 is a diagram showing per amino acid assignment of 3D atomic coordinates of alpha carbon atoms classified in FIG. 3 on an amino acid basis, according to one embodiment of the disclosed technology. FIG. 開示される技術の一実施態様による、ボクセルごとの距離値を決定するプロセスを概略的に示す図である。FIG. 3 schematically illustrates a process for determining per-voxel distance values, according to one implementation of the disclosed technology. 開示される技術の一実施態様による、２１個のアミノ酸ごとの距離チャネルの例を示す図である。FIG. 3 illustrates an example of a distance channel every 21 amino acids, according to one embodiment of the disclosed technology. 開示される技術の一実施態様による、距離チャネルテンソルの概略図である。1 is a schematic diagram of a distance channel tensor, according to one implementation of the disclosed technology; FIG. 開示される技術の一実施態様による、図２からの参照アミノ酸及び代替アミノ酸のワンホット符号化を示す図である。FIG. 3 illustrates one-hot encoding of reference and alternative amino acids from FIG. 2 according to one embodiment of the disclosed technology. 開示される技術の一実施態様による、ボクセル化されたワンホット符号化参照アミノ酸及びボクセル化されたワンホット符号化変異体／代替アミノ酸の概略図である。1 is a schematic diagram of voxelized one-hot encoded reference amino acids and voxelized one-hot encoded variants/alternative amino acids according to one embodiment of the disclosed technology; FIG. 開示される技術の一実施態様による、図７の距離チャネルテンソルと参照対立遺伝子テンソルとをボクセルごとに連結する連結プロセスを概略的に示す図である。8 schematically illustrates a concatenation process for concatenating the distance channel tensor of FIG. 7 and the reference allele tensor voxel by voxel, according to one embodiment of the disclosed technology; FIG. 開示される技術の一実施態様による、図７の距離チャネルテンソル、図１０の参照対立遺伝子テンソル、及び代替対立遺伝子テンソルをボクセルごとに連結する連結プロセスを概略的に示す図である。11 schematically illustrates a concatenation process for concatenating the distance channel tensor of FIG. 7, the reference allele tensor of FIG. 10, and the alternative allele tensor voxel by voxel, according to one implementation of the disclosed technology; FIG. 開示される技術の一実施態様による、最も近い原子の汎アミノ酸保存頻度を決定し、ボクセルに割り当てる（ボクセル化する）ためのシステムのプロセスを示す流れ図である。1 is a flowchart illustrating the process of a system for determining pan-amino acid conservation frequencies of nearest atoms and assigning them to voxels (voxelization), according to one embodiment of the disclosed technology. 開示される技術の一実施態様による、ボクセルから最も近いアミノ酸を示す図である。FIG. 3 shows the closest amino acids from a voxel, according to one embodiment of the disclosed technology. 開示される技術の一実施態様による、９９の種にわたる参照アミノ酸配列の例示的多重配列アラインメントを示す図である。FIG. 2 shows an exemplary multiple sequence alignment of reference amino acid sequences across 99 species, according to one embodiment of the disclosed technology. 開示される技術の一実施態様による、特定のボクセルに対して汎アミノ酸保存頻度配列を決定する例を示す図である。FIG. 3 is a diagram illustrating an example of determining a pan-amino acid conservation frequency sequence for a specific voxel according to one embodiment of the disclosed technology. 開示される技術の一実施態様による、図１５に説明される位置頻度論理を使用して、それぞれのボクセルに対して決定される、それぞれの汎アミノ酸保存頻度を示す図である。FIG. 16 illustrates each pan-amino acid conservation frequency determined for each voxel using the position frequency logic illustrated in FIG. 15, according to one embodiment of the disclosed technology. 開示される技術の一実施態様による、ボクセル化されたボクセルごとの進化的プロファイルを示す図である。FIG. 4 illustrates a voxelized voxel-by-voxel evolutionary profile in accordance with one implementation of the disclosed technology. 開示された技術の一実施態様による、進化的プロファイルテンソルの例を示す図である。FIG. 3 is a diagram illustrating an example evolutionary profile tensor, according to one implementation of the disclosed technology. 開示される技術の一実施態様による、最も近い原子のアミノ酸ごとの保存頻度を決定し、ボクセルに割り当てる（ボクセル化する）ためのシステムのプロセスを示す流れ図である。1 is a flowchart illustrating the process of a system for determining the conserved frequency of nearest atoms per amino acid and assigning them to voxels (voxelization), according to one embodiment of the disclosed technology. 開示される技術の一実施態様による、距離チャネルテンソルと連結されるボクセル化されたアノテーションチャネルの種々の実施例を示す図である。FIG. 3 illustrates various examples of voxelized annotation channels concatenated with distance channel tensors, according to one implementation of the disclosed technology. 開示される技術の一実施態様による、標的変異体の病原性決定のための病原性分類器への入力として提供することができる、入力チャネルの異なる組み合わせ及び順列を示す図である。FIG. 1 illustrates different combinations and permutations of input channels that can be provided as input to a pathogenicity classifier for determining the pathogenicity of a target variant, according to one embodiment of the disclosed technology. 開示される技術の様々な実施態様による、開示される距離チャネルを計算する異なる方法を示す図である。FIGS. 3A and 3B illustrate different methods of calculating the disclosed distance channel according to various implementations of the disclosed technology. FIGS. 開示される技術の様々な実施態様による、進化的チャネルの異なる実施例を示す図である。FIGS. 3A and 3B illustrate different examples of evolutionary channels in accordance with various implementations of the disclosed technology. FIGS. 開示される技術の様々な実施態様による、アノテーションチャネルの異なる実施例を示す図である。FIGS. 3A and 3B illustrate different examples of annotation channels in accordance with various implementations of the disclosed technology. FIGS. 開示される技術の様々な実施態様による、構造信頼度チャネルの異なる例を示す図である。FIGS. 3A and 3B illustrate different examples of structural confidence channels in accordance with various implementations of the disclosed technology; FIGS. 開示される技術の一実施態様による、病原性分類器の例示的処理アーキテクチャを示す図である。FIG. 2 illustrates an example processing architecture for a pathogenicity classifier, in accordance with one implementation of the disclosed technology. 開示される技術の一実施態様による、病原性分類器の例示的処理アーキテクチャを示す図である。FIG. 2 illustrates an example processing architecture for a pathogenicity classifier, in accordance with one implementation of the disclosed technology. ベンチマークモデルとしてＰｒｉｍａｔｅＡＩを使用して、開示されるＰｒｉｍａｔｅＡＩ３Ｄの分類がＰｒｉｍａｔｅＡＩを上回る優位性を実証する図である。FIG. 13 uses PrimateAI as a benchmark model to demonstrate the superiority of the disclosed PrimateAI 3D classification over PrimateAI. ベンチマークモデルとしてＰｒｉｍａｔｅＡＩを使用して、開示されるＰｒｉｍａｔｅＡＩ３Ｄの分類がＰｒｉｍａｔｅＡＩを上回る優位性を実証する図である。FIG. 10 demonstrates the superiority of the disclosed PrimateAI 3D classification over PrimateAI using PrimateAI as a benchmark model; ベンチマークモデルとしてＰｒｉｍａｔｅＡＩを使用して、開示されるＰｒｉｍａｔｅＡＩ３Ｄの分類がＰｒｉｍａｔｅＡＩを上回る優位性を実証する図である。FIG. 10 demonstrates the superiority of the disclosed PrimateAI 3D classification over PrimateAI using PrimateAI as a benchmark model; ベンチマークモデルとしてＰｒｉｍａｔｅＡＩを使用して、開示されるＰｒｉｍａｔｅＡＩ３Ｄの分類がＰｒｉｍａｔｅＡＩを上回る優位性を実証する図である。FIG. 10 demonstrates the superiority of the disclosed PrimateAI 3D classification over PrimateAI using PrimateAI as a benchmark model; ３２Ａ及び３２Ｂは、開示される技術の様々な実施態様による、開示される効率的なボクセル化プロセスを示す図である。32A and 32B are diagrams illustrating the disclosed efficient voxelization process in accordance with various implementations of the disclosed technology. 開示される技術の一実施態様による、原子が原子を含有するボクセルにどのように関連付けられるかを示す図である。FIG. 3 is a diagram illustrating how atoms are associated with voxels containing atoms, according to one implementation of the disclosed technology. 開示される技術の一実施態様による、ボクセルごとへの最も近い原子を特定するために、原子からボクセルへのマッピングからボクセルから原子へのマッピングを生成することを示す図である。FIG. 3 illustrates generating a voxel-to-atom mapping from an atom-to-voxel mapping to identify the closest atom to each voxel, according to one implementation of the disclosed technology. 開示された効率的なボクセル化が、開示された効率的なボクセル化を使用せずに、０（原子数）のランタイム複雑度対０（原子数^＊ボクセル数）のランタイム複雑度をどのように有するかを示す図である。How does the disclosed efficient voxelization reduce runtime complexity of 0 (number of atoms) vs. runtime complexity of 0 (number of atoms ^* number of voxels) without using the disclosed efficient voxelization? FIG. 開示された効率的なボクセル化が、開示された効率的なボクセル化を使用せずに、０（原子数）のランタイム複雑度対０（原子数^＊ボクセル数）のランタイム複雑度をどのように有するかを示す図である。How does the disclosed efficient voxelization reduce runtime complexity of 0 (number of atoms) vs. runtime complexity of 0 (number of atoms ^* number of voxels) without using the disclosed efficient voxelization? FIG. 開示された技術を実装するために使用することのできる例示的なコンピュータシステムを示す図である。1 illustrates an example computer system that can be used to implement the disclosed techniques. FIG.

以下の考察は、開示される技術を当業者が作製及び使用することを可能にするために提示され、特定の用途及びその要件に関連して提供される。開示される実施態様に対する様々な修正は、当業者には容易に明らかとなり、本明細書で定義される一般原理は、開示される技術の趣旨及び範囲から逸脱することなく、他の実施態様及び用途に適用され得る。したがって、開示される技術は、示される実施態様に限定されることを意図するものではなく、本明細書に開示される原理及び特徴と一致する最も広い範囲を与えられるものである。 The following discussion is presented to enable any person skilled in the art to make and use the disclosed technology, and is provided in the context of a particular application and its requirements. Various modifications to the disclosed embodiments will be readily apparent to those skilled in the art, and the general principles defined herein can be applied to other embodiments and embodiments without departing from the spirit and scope of the disclosed technology. It can be applied to various uses. Therefore, the disclosed technology is not intended to be limited to the embodiments shown, but is to be accorded the widest scope consistent with the principles and features disclosed herein.

様々な実施態様の詳細な説明は、添付の図面と併せて読むと、より良く理解することができる。図が様々な実施態様の機能ブロックの図を示す限りにおいて、機能ブロックは、必ずしもハードウェア回路間の分割を示すものではない。したがって、例えば、機能ブロック（例えば、モジュール、プロセッサ、又はメモリ）のうちの１つ以上は、単一のハードウェア（例えば、汎用信号プロセッサ又はランダムアクセスメモリのブロック、ハードディスクなど）又は複数のハードウェアに実装されてもよい。同様に、プログラムは、スタンドアロンプログラムであってもよく、オペレーティングシステム内のサブルーチンとして組み込まれてもよく、インストールされたソフトウェアパッケージ内の機能である等でもよい。様々な実施態様は、図面に示された配置及び手段に限定されないことを理解されたい。 The detailed description of the various embodiments can be better understood when read in conjunction with the accompanying drawings. To the extent that the figures depict diagrams of functional blocks of various implementations, the functional blocks do not necessarily indicate divisions between hardware circuits. Thus, for example, one or more of the functional blocks (e.g., modules, processors, or memories) may be implemented using a single piece of hardware (e.g., a general-purpose signal processor or a block of random access memory, a hard disk, etc.) or multiple pieces of hardware. may be implemented. Similarly, a program may be a standalone program, embedded as a subroutine within an operating system, a function within an installed software package, and so on. It is to be understood that the various embodiments are not limited to the arrangements and instrumentalities shown in the drawings.

モジュールとして指定された図の処理エンジン及びデータベースは、ハードウェア又はソフトウェアで実装することができ、図に示されるように、正確に同じブロックで分割される必要はない。いくつかのモジュールは、異なるプロセッサ、コンピュータ若しくはサーバ上に実装されてもよく、又は多数の異なるプロセッサ、コンピュータ若しくはサーバの中で広がることもできる。加えて、モジュールの一部は、達成される機能に影響を及ぼすことなく、図に示されるものとは並行して、又は異なる順序で操作され得ることが理解されるであろう。図のモジュールはまた、方法におけるフローチャートステップと考えることができる。また、モジュールは、必ずしもメモリ内に隣接して配置された全てのコードを有する必要はない。コードのいくつかの部分は、他のモジュール又は他の機能からのコードが間に配置された状態で、コードの他の部分から分離することができる。 The processing engines and databases in the figures designated as modules can be implemented in hardware or software and need not be divided in exactly the same blocks as shown in the figures. Some modules may be implemented on different processors, computers or servers, or even spread among many different processors, computers or servers. In addition, it will be understood that some of the modules can be operated in parallel or in a different order than shown in the figures without affecting the functionality accomplished. The modules in the figures can also be considered as flowchart steps in a method. Also, modules do not necessarily have to have all code located contiguously in memory. Some portions of code can be separated from other portions of code, with code from other modules or other functions located in between.

タンパク質構造に基づく病原性の決定
図１は、変異体の病原性を決定するためのシステムのプロセス１００を示す流れ図である。ステップ１０２において、システムの配列アクセサ１０４は、参照アミノ酸配列及び代替アミノ酸配列にアクセスする。１１２において、システムの３Ｄ構造生成器１１４は、参照アミノ酸配列の３Ｄタンパク質構造を生成する。いくつかの実施態様では、３Ｄタンパク質構造は、ヒトタンパク質の相同性モデルである。一実施態様では、いわゆるＳｗｉｓｓＭｏｄｅｌ相同性モデリングパイプラインが、予測されたヒトタンパク質構造の公開リポジトリを提供する。別の実施態様では、いわゆるＨＨｐｒｅｄ相同性モデリングは、鋳型構造から標的タンパク質の構造を予測するためにＭｏｄｅｌｌｅｒと呼ばれるツールを使用する。 Determination of Pathogenicity Based on Protein Structure FIG. 1 is a flow diagram illustrating a process 100 of a system for determining pathogenicity of a variant. At step 102, the system's sequence accessor 104 accesses the reference and alternative amino acid sequences. At 112, the system's 3D structure generator 114 generates a 3D protein structure of the reference amino acid sequence. In some embodiments, the 3D protein structure is a homology model of a human protein. In one embodiment, the so-called SwissModel homology modeling pipeline provides a public repository of predicted human protein structures. In another embodiment, so-called HHpred homology modeling uses a tool called Modeler to predict the structure of a target protein from a template structure.

タンパク質は、原子の集合及び３Ｄ空間におけるそれらの座標によって表される。アミノ酸は、炭素原子、酸素（Ｏ）原子、窒素（Ｎ）原子、及び水素（Ｈ）原子などの様々な原子を有することができる。原子は、側鎖原子及び骨格原子として更に分類することができる。骨格炭素原子は、アルファ炭素（Ｃ_α）原子及びベータ炭素（Ｃ_β）原子を含むことができる。 Proteins are represented by collections of atoms and their coordinates in 3D space. Amino acids can have various atoms such as carbon atoms, oxygen (O) atoms, nitrogen (N) atoms, and hydrogen (H) atoms. Atoms can be further classified as side chain atoms and backbone atoms. Backbone carbon atoms can include alpha carbon (C _α ) and beta carbon (C _β ) atoms.

ステップ１２２において、システムの座標分類器１２４は、アミノ酸ベースで３Ｄタンパク質構造の３Ｄ原子座標を分類する。一実施態様では、アミノ酸ごとの分類は、３Ｄ原子座標を２１個のアミノ酸カテゴリ（停止又はギャップアミノ酸カテゴリを含む）に帰属させることを含む。一例では、アルファ炭素原子のアミノ酸ごとの分類は、２１個のアミノ酸カテゴリの各々の下にアルファ炭素原子をそれぞれ列挙することができる。別の例では、ベータ炭素原子のアミノ酸ごとの分類は、２１個のアミノ酸カテゴリの各々の下にベータ炭素原子をそれぞれ列挙することができる。 At step 122, the system's coordinate classifier 124 classifies the 3D atomic coordinates of the 3D protein structure on an amino acid basis. In one embodiment, the per-amino acid classification includes assigning 3D atomic coordinates to 21 amino acid categories (including stop or gap amino acid categories). In one example, an amino acid-by-amino acid classification of alpha carbon atoms may list each alpha carbon atom under each of the 21 amino acid categories. In another example, an amino acid-by-amino acid classification of beta carbon atoms can list each beta carbon atom under each of the 21 amino acid categories.

更に別の例では、酸素原子のアミノ酸ごとの分類は、２１個のアミノ酸カテゴリの各々の下に酸素原子をそれぞれ列挙することができる。更に別の例では、窒素原子のアミノ酸ごとの分類は、２１個のアミノ酸カテゴリの各々の下に窒素原子をそれぞれ列挙することができる。更に別の例では、水素原子のアミノ酸ごとの分類は、２１個のアミノ酸カテゴリの各々の下に水素原子をそれぞれ列挙することができる。 In yet another example, an amino acid-by-amino acid classification of oxygen atoms can list each oxygen atom under each of the 21 amino acid categories. In yet another example, the amino acid-by-amino acid classification of nitrogen atoms can list each nitrogen atom under each of the 21 amino acid categories. In yet another example, an amino acid-by-amino acid classification of hydrogen atoms may list hydrogen atoms under each of the 21 amino acid categories.

当業者は、種々の実施において、アミノ酸ごとの分類が、２１個のアミノ酸カテゴリのサブセット及び異なる原子エレメントのサブセットを含むことができることを理解するであろう。 Those skilled in the art will appreciate that in various implementations, classification by amino acid can include subsets of the 21 amino acid categories and subsets of different atomic elements.

ステップ１３２において、システムのボクセルグリッド生成器１３４は、ボクセルグリッドをインスタンス化する。ボクセルグリッドは、任意の解像度、例えば、３×３×３、５×５×５、７×７×７などを有することができる。ボクセルグリッド内のボクセルは、任意のサイズ、例えば、各辺に１オングストローム（Å）、各辺に２Å、各辺に３Åなどであることができる。当業者は、ボクセルが立方体であるので、これらの例示的な寸法が立方体寸法を指すことを理解するであろう。また、当業者は、これらの例示的な寸法が非限定的であり、ボクセルが任意の立方体寸法を有することができることを理解するであろう。 At step 132, the system's voxel grid generator 134 instantiates a voxel grid. The voxel grid can have any resolution, for example 3x3x3, 5x5x5, 7x7x7, etc. The voxels in the voxel grid can be of any size, such as 1 angstrom (Å) on each side, 2 Å on each side, 3 Å on each side, etc. Those skilled in the art will understand that these exemplary dimensions refer to cubic dimensions, since voxels are cubic. Those skilled in the art will also appreciate that these example dimensions are non-limiting and that voxels can have any cubic dimensions.

ステップ１４２において、システムのボクセルグリッドセンタラ１４４は、アミノ酸レベルで標的変異体を経験する参照アミノ酸にボクセルグリッドを中心とする。一実施態様では、ボクセルグリッドは、標的変異体を経験する参照アミノ酸の特定の原子の原子座標、例えば、標的変異体を経験する参照アミノ酸のアルファ炭素原子の３Ｄ原子座標に中心付けられる。 In step 142, the system's voxel grid centerer 144 centers the voxel grid on a reference amino acid that experiences the target variant at the amino acid level. In one embodiment, the voxel grid is centered at the atomic coordinates of a particular atom of the reference amino acid that experiences the target variant, such as the 3D atomic coordinates of the alpha carbon atom of the reference amino acid that experiences the target variant.

距離チャネル
ボクセルグリッド内のボクセルは、複数のチャネル（又は特徴）を有することができる。一実施態様では、ボクセルグリッド内のボクセルは、複数の距離チャネル（例えば、それぞれ、２１個のアミノ酸カテゴリ（停止又はギャップアミノ酸カテゴリを含む）のための２１個の距離チャネル）を有する。ステップ１５２において、システムの距離チャネル生成器１５４は、ボクセルグリッド内のボクセルに対するアミノ酸ごとの距離チャネルを生成する。距離チャネルは、２１個のアミノ酸カテゴリの各々について独立して生成される。 Distance Channels Voxels within a voxel grid can have multiple channels (or features). In one implementation, the voxels in the voxel grid have multiple distance channels (eg, 21 distance channels for each of the 21 amino acid categories, including stop or gap amino acid categories). At step 152, the system's distance channel generator 154 generates per-amino acid distance channels for voxels in the voxel grid. Distance channels are generated independently for each of the 21 amino acid categories.

例えば、アラニン（Ａ）アミノ酸カテゴリを考慮されたい。更に、例えば、ボクセルグリッドが３×３×３のサイズであり、２７個のボクセルを有することを考慮されたい。次いで、一実施態様では、アラニン距離チャネルは、ボクセルグリッド内の２７個のボクセルに対する２７個の距離値をそれぞれ含む。アラニン距離チャネルにおける２７個の距離値は、ボクセルグリッドにおける２７個のボクセルのそれぞれの中心から、アラニンアミノ酸カテゴリにおけるそれぞれの最も近い原子まで測定される。 For example, consider the alanine (A) amino acid category. Further, consider, for example, that the voxel grid is 3x3x3 in size and has 27 voxels. Then, in one implementation, the alanine distance channels each include 27 distance values for 27 voxels in the voxel grid. The 27 distance values in the alanine distance channel are measured from the center of each of the 27 voxels in the voxel grid to the respective nearest atom in the alanine amino acid category.

一例では、アラニンアミノ酸カテゴリは、アルファ炭素原子のみを含み、したがって、最も近い原子は、それぞれボクセルグリッド内の２７個のボクセルに最も近接するアラニンアルファ炭素原子である。別の例では、アラニンアミノ酸カテゴリは、ベータ炭素原子のみを含み、したがって、最も近い原子は、それぞれボクセルグリッド内の２７個のボクセルに最も近接するアラニンベータ炭素原子である。 In one example, the alanine amino acid category includes only alpha carbon atoms, so the nearest atom is the alanine alpha carbon atom that is closest to each of the 27 voxels in the voxel grid. In another example, the alanine amino acid category includes only beta carbon atoms, so the nearest atom is the alanine beta carbon atom that is closest to each of the 27 voxels in the voxel grid.

更に別の例では、アラニンアミノ酸カテゴリは酸素原子のみを含み、したがって、最も近い原子は、それぞれボクセルグリッド内の２７個のボクセルに最も近接するアラニン酸素原子である。更に別の例では、アラニンアミノ酸カテゴリは窒素原子のみを含み、したがって、最も近い原子は、それぞれボクセルグリッド内の２７個のボクセルに最も近接するアラニン窒素原子である。更に別の例では、アラニンアミノ酸カテゴリは水素原子のみを含み、したがって、最も近い原子は、それぞれボクセルグリッド内の２７個のボクセルに最も近接するアラニン水素原子である。 In yet another example, the alanine amino acid category includes only oxygen atoms, so the nearest atom is the alanine oxygen atom that is closest to each of the 27 voxels in the voxel grid. In yet another example, the alanine amino acid category includes only nitrogen atoms, so the nearest atom is the alanine nitrogen atom that is closest to each of the 27 voxels in the voxel grid. In yet another example, the alanine amino acid category contains only hydrogen atoms, so the nearest atom is the alanine hydrogen atom that is closest to each of the 27 voxels in the voxel grid.

アラニン距離チャネルと同様に、距離チャネル生成器１５４は、残りのアミノ酸カテゴリの各々について距離チャネル（すなわち、ボクセルごとの距離値のセット）を生成する。他の実施態様では、距離チャネル生成器１５４は、２１個のアミノ酸カテゴリのサブセットについてのみ距離チャネルを生成する。 Similar to the alanine distance channel, distance channel generator 154 generates a distance channel (i.e., a set of voxel-wise distance values) for each of the remaining amino acid categories. In other embodiments, distance channel generator 154 generates distance channels only for a subset of 21 amino acid categories.

他の実施態様では、最も近い原子の選択は、特定の原子タイプに限定されない。すなわち、対象アミノ酸カテゴリ内で、特定のボクセルへの最も近い原子が、最も近い原子の原子エレメントに関係なく選択され、特定のボクセルの距離値が、対象アミノ酸カテゴリの距離チャネルに含めるために計算される。 In other embodiments, the selection of closest atoms is not limited to a particular atom type. That is, within the amino acid category of interest, the closest atom to a given voxel is selected regardless of the atomic element of the nearest atom, and the distance value of the given voxel is calculated for inclusion in the distance channel of the amino acid category of interest. Ru.

更に他の実施態様では、距離チャネルは、原子エレメントベースで生成される。アミノ酸カテゴリについて距離チャネルを有する代わりに、又はそれに加えて、原子が属するアミノ酸に関係なく、原子エレメントカテゴリについて距離値を生成することができる。例えば、参照アミノ酸配列中のアミノ酸の原子が、７つの原子エレメント、炭素、酸素、窒素、水素、カルシウム、ヨウ素及び硫黄に及ぶことを考慮されたい。次いで、ボクセルグリッド内のボクセルは、７つの距離チャネルを有するように構成され、その結果、７つの距離チャネルの各々は、対応する原子エレメントカテゴリ内のみの最も近い原子への距離を指定する２７個のボクセルごとの距離値を有する。他の実施態様では、７つの原子エレメントのサブセットのみについての距離チャネルを生成することができる。更に他の実施態様では、原子エレメントカテゴリ及び距離チャネル生成は、同じ原子エレメント、例えば、アルファ炭素（Ｃ_α）原子及びベータ炭素（Ｃ_β）原子の変形形態に更に階層化することができる。 In yet other embodiments, distance channels are generated on an atomic element basis. Instead of or in addition to having distance channels for amino acid categories, distance values can be generated for atomic element categories regardless of the amino acid to which the atom belongs. For example, consider that the atoms of the amino acids in the reference amino acid sequence span seven atomic elements: carbon, oxygen, nitrogen, hydrogen, calcium, iodine, and sulfur. Voxels in the voxel grid are then configured to have seven distance channels, such that each of the seven distance channels specifies the distance to the nearest atom only within the corresponding atomic element category. has a per-voxel distance value of . In other implementations, distance channels may be generated for only a subset of seven atomic elements. In yet other implementations, the atomic element categories and distance channel generation can be further stratified into variations of the same atomic element, eg, alpha carbon (C _α ) atoms and beta carbon (C _β ) atoms.

更に他の実施態様では、距離チャネルは、原子タイプベースで生成することができ、例えば、側鎖原子のみについての距離チャネル及び骨格原子のみについての距離チャネルである。 In yet other embodiments, distance channels can be generated on an atom type basis, such as distance channels for side chain atoms only and distance channels for backbone atoms only.

最も近い原子は、ボクセル中心から所定の最大走査半径（例えば、６オングストローム（Å））内で検索することができる。また、複数の原子が、ボクセルグリッド内の同じボクセルに最も近くてもよい。 The closest atom can be searched within a predetermined maximum scan radius (eg, 6 angstroms (Å)) from the voxel center. Also, multiple atoms may be closest to the same voxel in the voxel grid.

距離は、ボクセル中心の３Ｄ座標と原子の３Ｄ原子座標との間で計算される。また、距離チャネルは、同じ位置に中心付けられる（例えば、標的変異体を経験する参照アミノ酸のアルファ炭素原子の３Ｄ原子座標に中心付けられる）ボクセルグリッドを用いて生成される。 The distance is calculated between the 3D coordinates of the voxel center and the 3D atomic coordinates of the atom. Distance channels are also generated using a voxel grid that is centered at the same location (e.g., centered at the 3D atomic coordinates of the alpha carbon atom of the reference amino acid experiencing the target variant).

距離は、ユークリッド距離であることができる。また、距離は、原子サイズ（又は原子の影響）によって（例えば、問題の原子のレナード－ジョーンズポテンシャル及び／又はファンデルワールス原子半径を使用することによって）パラメータ化することができる。また、距離値は、最大走査半径によって、又は対象アミノ酸カテゴリ若しくは対象原子エレメントカテゴリ若しくは対象原子タイプカテゴリ内の最も遠い最も近い原子の最大観察距離値によって正規化することができる。いくつかの実施態様では、ボクセルと原子との間の距離は、ボクセル及び原子の極座標に基づいて計算される。極座標は、ボクセルと原子との間の角度によってパラメータ化される。一実施態様では、この角度情報は、ボクセルの角度チャネルを生成するために（すなわち、距離チャネルとは無関係に）使用される。いくつかの実施態様では、最も近い原子と隣接原子（例えば、骨格原子）との間の角度は、ボクセルを用いて符号化される特徴として使用されることができる。 The distance can be a Euclidean distance. The distance can also be parameterized by atomic size (or atomic influence) (eg, by using the Lennard-Jones potential and/or van der Waals atomic radius of the atom in question). The distance value can also be normalized by the maximum scan radius or by the maximum observed distance value of the furthest nearest atom in the amino acid category of interest or the atomic element category of interest or the atom type category of interest. In some implementations, distances between voxels and atoms are calculated based on the polar coordinates of the voxels and atoms. Polar coordinates are parameterized by the angle between the voxel and the atom. In one implementation, this angular information is used to generate the angular channel of the voxel (ie, independent of the distance channel). In some implementations, the angle between the closest atom and a neighboring atom (eg, a backbone atom) can be used as a feature encoded using voxels.

参照対立遺伝子及び代替対立遺伝子チャネル
ボクセルグリッド内のボクセルはまた、参照対立遺伝子及び代替対立遺伝子チャネルを有することができる。ステップ１６２において、システムのワンホットエンコーダ１６４は、参照アミノ酸配列内の参照アミノ酸の参照ワンホット符号化と、代替アミノ酸配列内の代替アミノ酸の代替ワンホット符号化とを生成する。参照アミノ酸は標的変異体を経験する。代替アミノ酸は標的変異体である。参照アミノ酸及び代替アミノ酸は、参照アミノ酸配列及び代替アミノ酸配列においてそれぞれ同じ位置に位置する。参照アミノ酸配列及び代替アミノ酸配列は、１つの例外を除いて、同じ位置ごとのアミノ酸組成を有する。例外は、参照アミノ酸配列中の参照アミノ酸及び代替アミノ酸配列中の代替アミノ酸を有する位置である。 Reference Allele and Alternative Allele Channels Voxels within a voxel grid can also have reference allele and alternative allele channels. In step 162, the system's one-hot encoder 164 generates a reference one-hot encoding of the reference amino acid in the reference amino acid sequence and an alternate one-hot encoding of the alternate amino acid in the alternate amino acid sequence. The reference amino acid undergoes targeted mutations. Alternative amino acids are targeted variants. The reference amino acid and the alternative amino acid are located at the same position in the reference amino acid sequence and the alternative amino acid sequence, respectively. The reference amino acid sequence and the alternative amino acid sequence have the same position-by-position amino acid composition, with one exception. An exception is a position with a reference amino acid in a reference amino acid sequence and an alternative amino acid in an alternative amino acid sequence.

ステップ１７２において、システムの連結器１７４は、アミノ酸ごとの距離チャネルと参照及び代替ワンホット符号化とを連結する。別の実施態様では、連結器１７４は、原子エレメントごとの距離チャネルと、参照ワンホット符号化及び代替ワンホット符号化とを連結する。更に別の実施態様では、連結器１７４は、原子タイプごとの距離チャネルと、参照ワンホット符号化及び代替ワンホット符号化とを連結する。 At step 172, a concatenator 174 of the system concatenates the per-amino acid distance channel with the reference and alternative one-hot encodings. In another embodiment, concatenator 174 concatenates the per-atomic element distance channel and the reference one-hot encoding and the alternative one-hot encoding. In yet another embodiment, concatenator 174 concatenates the distance channel for each atom type with the reference one-hot encoding and the alternative one-hot encoding.

ステップ１８２において、システムのランタイムロジック１８４は、連結されたアミノ酸ごとの／原子エレメントごとの／原子タイプごとの距離チャネル並びに参照及び代替ワンホット符号化を病原性分類器（病原性決定エンジン）を介して処理して、標的変異体の病原性を決定し、これは次に、アミノ酸レベルで標的変異体を生成する基礎となるヌクレオチド変異体の病原性決定として推測される。病原性分類器は、良性及び病原性変異体のラベル付きデータセットを使用して、例えば、誤差逆伝播アルゴリズムを使用して学習する。良性及び病原性変異体のラベル付きデータセット、並びに病原性分類器の例示的なアーキテクチャ及び学習に関する更なる詳細は、共有に係る米国特許出願第１６／１６０，９０３号、同第１６／１６０，９８６号、同第１６／１６０，９６８号、及び同第１６／４０７，１４９号に見出すことができる。 In step 182, the system runtime logic 184 passes the concatenated per-amino acid/per-atomic element/per-atom type distance channels and reference and alternative one-hot encodings through a pathogenicity classifier (virulence decision engine). are processed to determine the pathogenicity of the target variant, which is then inferred as determining the pathogenicity of the underlying nucleotide variant generating the target variant at the amino acid level. A pathogenicity classifier is trained using labeled datasets of benign and pathogenic variants, for example, using an error backpropagation algorithm. Further details regarding labeled datasets of benign and pathogenic variants, as well as exemplary architectures and training of pathogenicity classifiers, can be found in commonly-owned U.S. patent application Ser. No. 16/160,903; No. 986, No. 16/160,968, and No. 16/407,149.

図２は、タンパク質２００の参照アミノ酸配列２０２及びタンパク質２００の代替アミノ酸配列２１２を概略的に示す。タンパク質２００は、Ｎ個のアミノ酸を含む。タンパク質２００中のアミノ酸の位置は、１、２、３、・・・、Ｎとラベル付けされる。図示された例において、位置１６は、基礎となるヌクレオチド変異体によって引き起こされるアミノ酸変異体２１４（突然変異）を経験する位置である。例えば、参照アミノ酸配列２０２については、１位は参照アミノ酸フェニルアラニン（Ｆ）を有し、１６位は参照アミノ酸グリシン（Ｇ）２０４を有し、Ｎ位（例えば、配列２０２の最後のアミノ酸）は参照アミノ酸ロイシン（Ｌ）を有する。明確にするために図示されていないが、参照アミノ酸配列２０２中の残りの位置は、タンパク質２００に特異的な順序で様々なアミノ酸を含有する。代替アミノ酸配列２１２は、参照アミノ酸グリシン（Ｇ）２０４の代わりに代替アミノ酸アラニン（Ａ）２１４を含有する１６位の変異体２１４を除いて、参照アミノ酸配列２０２と同じである。 FIG. 2 schematically depicts a reference amino acid sequence 202 of protein 200 and an alternative amino acid sequence 212 of protein 200. Protein 200 contains N amino acids. Amino acid positions in protein 200 are labeled 1, 2, 3,...,N. In the illustrated example, position 16 is the position that experiences an amino acid variant 214 (mutation) caused by the underlying nucleotide variant. For example, for reference amino acid sequence 202, position 1 has the reference amino acid phenylalanine (F), position 16 has the reference amino acid glycine (G) 204, and position N (e.g., the last amino acid of sequence 202) has the reference amino acid phenylalanine (F), and position N (e.g., the last amino acid of sequence 202) It has the amino acid leucine (L). Although not illustrated for clarity, the remaining positions in reference amino acid sequence 202 contain various amino acids in order specific to protein 200. Alternative amino acid sequence 212 is the same as reference amino acid sequence 202 except for variant 214 at position 16, which contains the alternative amino acid alanine (A) 214 in place of reference amino acid glycine (G) 204.

図３は、本明細書において「原子分類３００」とも呼ばれる、参照アミノ酸配列２０２中のアミノ酸の原子のアミノ酸ごとの分類を示す。列３０２に列挙される２０個の天然アミノ酸のうちの特定のタイプのアミノ酸は、タンパク質中で反復し得る。すなわち、特定のタイプのアミノ酸は、タンパク質中に２回以上存在し得る。タンパク質はまた、第２１の停止又はギャップアミノ酸カテゴリによって分類されるいくつかの未決定のアミノ酸を有し得る。図３の右列は、異なるアミノ酸からのアルファ炭素（Ｃ_α）原子のカウントを含む。 FIG. 3 shows a per-amino acid classification of the atoms of the amino acids in the reference amino acid sequence 202, also referred to herein as "atomic classification 300." Certain types of amino acids among the 20 naturally occurring amino acids listed in column 302 may be repeated in proteins. That is, a particular type of amino acid may occur more than once in a protein. A protein may also have some undetermined amino acids classified by the 21st stop or gap amino acid category. The right column of Figure 3 contains counts of alpha carbon (C _α ) atoms from different amino acids.

具体的には、図３は、参照アミノ酸配列２０２中のアミノ酸のアルファ炭素（Ｃ_α）原子のアミノ酸ごとの分類を示す。図３の列３０８は、２１個のアミノ酸カテゴリの各々における参照アミノ酸配列２０２について観察されたアルファ炭素原子の総数を列挙する。例えば、列３０８は、アラニン（Ａ）アミノ酸カテゴリについて観察された１１個のアルファ－炭素原子を列挙する。各アミノ酸は１つのアルファ炭素原子のみを有するので、これは、アラニンが参照アミノ酸配列２０２において１１回生じることを意味する。別の例において、アルギニン（Ｒ）は、参照アミノ酸配列２０２において３５回出現する。２１のアミノ酸カテゴリにわたるアルファ炭素原子の総数は、８２８である。 Specifically, FIG. 3 shows a classification of the alpha carbon (C _α ) atoms of the amino acids in the reference amino acid sequence 202 by amino acid. Column 308 of FIG. 3 lists the total number of alpha carbon atoms observed for the reference amino acid sequence 202 in each of the 21 amino acid categories. For example, column 308 lists the 11 alpha-carbon atoms observed for the alanine (A) amino acid category. Since each amino acid has only one alpha carbon atom, this means that alanine occurs 11 times in the reference amino acid sequence 202. In another example, arginine (R) occurs 35 times in reference amino acid sequence 202. The total number of alpha carbon atoms across the 21 amino acid categories is 828.

図４は、図３の原子分類３００に基づく参照アミノ酸配列２０２のアルファ炭素原子の３Ｄ原子座標のアミノ酸ごとの帰属を示す。これは、本明細書では「原子座標バケッティング４００」と呼ばれる。図４において、リスト４０４～４４０は、２１個のアミノ酸カテゴリの各々にバケットされたアルファ炭素原子の３Ｄ原子座標を表にしたものである。 FIG. 4 shows the assignment of the 3D atomic coordinates of the alpha carbon atoms of the reference amino acid sequence 202 for each amino acid based on the atomic classification 300 of FIG. 3. This is referred to herein as "atomic coordinate bucketing 400." In FIG. 4, lists 404-440 tabulate the 3D atomic coordinates of alpha carbon atoms bucketed into each of the 21 amino acid categories.

図示された実施態様において、図４のバケッティング４００は、図３の分類３００に従う。例えば、図３において、アラニンアミノ酸カテゴリは、１１個のアルファ炭素原子を有し、したがって、図４において、アラニンアミノ酸カテゴリは、図３からの対応する１１個のアルファ炭素原子の１１個の３Ｄ原子座標を有する。この分類からバケッティングへの論理は、他のアミノ酸カテゴリについても図３から図４に流れる。しかしながら、この分類からバケッティングへの論理は、表現目的のためのものにすぎず、他の実施態様では、開示される技術は、ボクセルごとの最も近い原子を位置特定するために分類３００及びバケッティング４００を実行する必要はなく、より少ないステップ、追加のステップ、又は異なるステップを実行することができる。例えば、いくつかの実施態様では、開示される技術は、ソート基準（例えば、アミノ酸ごとの、原子エレメントごとの、原子タイプごとの）、所定の最大走査半径、及び距離のタイプ（例えば、ユークリッド、マハラノビス、正規化、非正規化）等のクエリパラメータを受け入れるように構成される検索クエリに応答して、１つ以上のデータベースからボクセルごとの最も近い原子を返すソート及び検索アルゴリズムを使用することによって、ボクセルごとの最も近い原子を位置特定することができる。開示される技術の様々な実施態様では、現在又は将来の技術分野からの複数のソート及び検索アルゴリズムは、類似して、ボクセルごとの最も近い原子を位置特定するために当業者によって使用されることができる。 In the illustrated embodiment, bucketing 400 of FIG. 4 follows classification 300 of FIG. 3. For example, in FIG. 3, the alanine amino acid category has 11 alpha carbon atoms; therefore, in FIG. 4, the alanine amino acid category has 11 3D atoms of the corresponding 11 alpha carbon atoms from FIG. Has coordinates. This classification-to-bucketing logic also flows from FIG. 3 to FIG. 4 for other amino acid categories. However, this classification-to-bucketing logic is only for representational purposes; in other implementations, the disclosed technique uses classification 300 and bucketing to locate the nearest atom per voxel. 400; fewer, additional, or different steps may be performed. For example, in some embodiments, the disclosed techniques include sorting criteria (e.g., per amino acid, per atomic element, per atom type), predetermined maximum scan radius, and distance type (e.g., Euclidean, By using a sorting and searching algorithm that returns the nearest atom per voxel from one or more databases in response to a search query configured to accept query parameters such as Mahalanobis, normalization, denormalization), etc. , the nearest atom for each voxel can be located. In various embodiments of the disclosed technology, multiple sorting and search algorithms from the current or future technical field may similarly be used by those skilled in the art to locate the nearest atom per voxel. I can do it.

図４において、３Ｄ原子座標は、デカルト座標ｘ、ｙ、ｚによって表されるが、球面又は円筒座標などの任意のタイプの座標系が使用されてもよく、特許請求される主題は、この点において限定されない。いくつかの実施態様では、１つ以上のデータベースは、タンパク質中のアミノ酸のアルファ炭素原子及び他の原子の３Ｄ原子座標に関する情報を含み得る。このようなデータベースは、特定のタンパク質によって検索可能であり得る。 In Figure 4, the 3D atomic coordinates are represented by Cartesian coordinates x, y, z, but any type of coordinate system may be used, such as spherical or cylindrical coordinates, and the claimed subject matter is not limited to. In some embodiments, one or more databases may include information regarding the 3D atomic coordinates of alpha carbon atoms and other atoms of amino acids in proteins. Such databases may be searchable by specific proteins.

上述したように、ボクセル及びボクセルグリッドは３Ｄエンティティである。しかしながら、明確にするために、図面は、２次元（２Ｄ）フォーマットのボクセル及びボクセルグリッドを示し、説明はそれを考察する。例えば、２７個のボクセルの３×３×３ボクセルグリッドは、９個の２Ｄピクセルを有する３×３の２Ｄピクセルグリッドとして本明細書に示され、説明される。当業者は、２Ｄフォーマットが表現目的のためだけに使用され、３Ｄ対応物をカバーすることが意図される（すなわち、２Ｄピクセルは３Ｄボクセルを表し、２Ｄピクセルグリッドは３Ｄボクセルグリッドを表す）ことを理解するであろう。また、図面も縮尺通りではない。例えば、サイズ２オングストローム（Å）のボクセルは、単一ピクセルを使用して描写される。 As mentioned above, voxels and voxel grids are 3D entities. However, for clarity, the drawings show voxels and voxel grids in two-dimensional (2D) format, and the description discusses the same. For example, a 3x3x3 voxel grid of 27 voxels is shown and described herein as a 3x3 2D pixel grid with 9 2D pixels. Those skilled in the art will appreciate that the 2D format is used for representational purposes only and is intended to cover its 3D counterpart (i.e., 2D pixels represent 3D voxels and 2D pixel grids represent 3D voxel grids). you will understand. Also, the drawings are not to scale. For example, a voxel of size 2 Angstroms (Å) is described using a single pixel.

ボクセルごとの距離計算
図５は、本明細書では「ボクセルごとの距離計算５００」とも呼ばれる、ボクセルごとの距離値を決定するプロセスを概略的に示す。図示の例では、ボクセルごとの距離値は、アラニン（Ａ）距離チャネルについてのみ計算される。しかし、図１ごとに上で論じたように、２１個のアミノ酸カテゴリの各々について同じ距離計算論理を実行して、２１個のアミノ酸ごとの距離チャネルを生成し、ベータ炭素原子のような他の原子タイプ並びに酸素、窒素、及び水素のような他の原子エレメントに更に拡張することができる。いくつかの実施態様では、原子は、病原性分類器の学習を原子配向に対して不変にするために、距離計算の前にランダムに回転される。 Voxel-wise Distance Calculation FIG. 5 schematically depicts the process of determining a voxel-wise distance value, also referred to herein as "voxel-wise distance calculation 500." In the illustrated example, per-voxel distance values are calculated only for the alanine (A) distance channel. However, as discussed above for each Figure 1, we performed the same distance calculation logic for each of the 21 amino acid categories to generate distance channels for each of the 21 amino acids, and other Further extensions can be made to atomic types and other atomic elements such as oxygen, nitrogen, and hydrogen. In some implementations, atoms are randomly rotated before distance calculation to make the training of the pathogenicity classifier invariant to atom orientation.

図５では、ボクセルグリッド５２２は、インデックス（１，１）、（１，２）、（１，３）、（２，１）、（２，２）、（２，３）、（３，１）、（３，２）、及び（３，３）で特定される９つのボクセル５１４を有する。ボクセルグリッド５２２は、例えば、参照アミノ酸配列２０２の位置１６のグリシン（Ｇ）アミノ酸のアルファ炭素原子の３Ｄ原子座標５３２に中心付けられるが、これは、図２に関して上述したように、代替アミノ酸配列２１２では、位置１６がグリシン（Ｇ）アミノ酸をアラニン（Ａ）アミノ酸に突然変異させる変異体を経験するためである。また、ボクセルグリッド５２２の中心は、ボクセル（２，２）の中心と一致する。 In FIG. 5, the voxel grid 522 is indexed (1,1), (1,2), (1,3), (2,1), (2,2), (2,3), (3,1 ), (3,2), and (3,3). Voxel grid 522 is centered, for example, at the 3D atomic coordinates 532 of the alpha carbon atom of the glycine (G) amino acid at position 16 of reference amino acid sequence 202, which corresponds to alternative amino acid sequence 212, as described above with respect to FIG. , because position 16 experiences a mutation that mutates the glycine (G) amino acid to the alanine (A) amino acid. Further, the center of voxel grid 522 coincides with the center of voxel (2, 2).

中心付けられたボクセルグリッド５２２は、２１個のアミノ酸ごとの距離チャネルの各々についての、ボクセルごとの距離計算のために使用される。例えば、アラニン（Ａ）距離チャネルから開始して、９個のボクセル５１４の各々の中心の３Ｄ座標と１１個のアラニンアルファ炭素原子の３Ｄ原子座標４０２との間の距離を測定して、９個のボクセル５１４の各々について最も近いアラニンアルファ炭素原子を位置特定する。次いで、９つのボクセル５１４とそれぞれの最も近いアラニンアルファ炭素原子との間の９つの距離についての９つの距離値が、アラニン距離チャネルを構築するために使用される。結果として得られるアラニン距離チャネルは、ボクセルグリッド５２２内の９つのボクセル５１４と同じ順序で９つのアラニン距離値を配置する。 A centered voxel grid 522 is used for voxel-wise distance calculations for each of the 21 amino acid-wise distance channels. For example, starting with the alanine (A) distance channel, measure the distance between the 3D coordinates of the center of each of the 9 voxels 514 and the 3D atomic coordinates 402 of the 11 alanine alpha carbon atoms, and Locate the nearest alanine alpha carbon atom for each voxel 514 of . The nine distance values for the nine distances between the nine voxels 514 and their respective nearest alanine alpha carbon atoms are then used to construct the alanine distance channel. The resulting alanine distance channel places the nine alanine distance values in the same order as the nine voxels 514 in the voxel grid 522.

上記のプロセスは、２１個のアミノ酸カテゴリの各々について実行される。例えば、中心ボクセルグリッド５２２は、アルギニン（Ｒ）距離チャネルを計算するために同様に使用され、したがって、９個のボクセル５１４のそれぞれの中心の３Ｄ座標と、３５個のアルギニンアルファ炭素原子の３Ｄ原子座標４０４との間の距離を測定して、９個のボクセル５１４の各々について最も近いアルギニンアルファ炭素原子を位置特定する。次いで、９つのボクセル５１４とそれぞれの最も近いアルギニンアルファ炭素原子との間の９つの距離についての９つの距離値が、アルギニン距離チャネルを構築するために使用される。結果として得られるアルギニン距離チャネルは、ボクセルグリッド５２２内の９つのボクセル５１４と同じ順序で９つのアルギニン距離値を配置する。２１個のアミノ酸ごとの距離チャネルをボクセルごとに符号化して距離チャネルテンソルを形成する。 The above process is performed for each of the 21 amino acid categories. For example, the center voxel grid 522 is similarly used to calculate the arginine (R) distance channel, thus providing the 3D coordinates of the center of each of the 9 voxels 514 and the 3D atoms of the 35 arginine alpha carbon atoms. The distance between coordinates 404 is measured to locate the nearest arginine alpha carbon atom for each of the nine voxels 514. The nine distance values for the nine distances between the nine voxels 514 and their respective nearest arginine alpha carbon atoms are then used to construct the arginine distance channel. The resulting arginine distance channel places the nine arginine distance values in the same order as the nine voxels 514 in the voxel grid 522. A distance channel of every 21 amino acids is encoded voxel by voxel to form a distance channel tensor.

具体的には、図示された例では、距離５１２は、ボクセルグリッド５２２のボクセル（１，１）の中心と、リスト４０２内のＣα^Ａ５原子である最も近いアルファ炭素（Ｃ_α）原子との間である。したがって、ボクセル（１，１）に割り当てられた値は距離５１２である。別の例では、Ｃα^Ａ４原子は、ボクセル（１，２）の中心に最も近いＣ_α原子である。したがって、ボクセル（１，２）に割り当てられる値は、ボクセル（１，２）の中心とＣα^Ａ４原子との間の距離である。更に別の例では、Ｃα^Ａ６原子は、ボクセル（２，１）の中心に最も近いＣ_α原子である。したがって、ボクセル（２，１）に割り当てられる値は、ボクセル（２，１）の中心とＣα^Ａ６原子との間の距離である。更に別の例では、Ｃα^Ａ６原子はまた、ボクセル（３，２）及び（３，３）の中心に最も近いＣ_α原子である。したがって、ボクセル（３，２）に割り当てられた値は、ボクセル（３，２）の中心とＣα^Ａ６原子との間の距離であり、ボクセル（３，３）に割り当てられた値は、ボクセル（３，３）の中心とＣα^Ａ６原子との間の距離である。いくつかの実施態様では、ボクセル５１４に割り当てられた距離値は、正規化された距離であり得る。例えば、ボクセル（１，１）に割り当てられた距離値は、距離５１２を最大距離５０２（所定の最大走査半径）で除算したものであってもよい。いくつかの実施態様では、最も近い原子距離は、ユークリッド距離であってもよく、最も近い原子距離は、ユークリッド距離を最大の最も近い原子距離（例えば、最大距離５０２等）で除算することによって正規化されてもよい。 Specifically, in the illustrated example, distance 512 is between the center of voxel (1,1) of voxel grid 522 and the nearest alpha carbon (C _α ) atom, which is a C α ^A5 atom in list 402. It is. Therefore, the value assigned to voxel (1,1) is distance 512. In another example, the Cα ^A4 atom is the _Cα atom closest to the center of voxel (1,2). Therefore, the value assigned to voxel (1,2) is the distance between the center of voxel (1,2) and the Cα ^A4 atom. In yet another example, the Cα ^A6 atom is the _Cα atom closest to the center of voxel (2,1). Therefore, the value assigned to voxel (2,1) is the distance between the center of voxel (2,1) and the Cα ^A6 atom. In yet another example, the Cα ^A6 atom is also the _Cα atom closest to the center of voxels (3,2) and (3,3). Therefore, the value assigned to voxel (3,2) is the distance between the center of voxel (3,2) and the Cα ^A6 atom, and the value assigned to voxel (3,3) is the distance between the center of voxel (3,2) and the Cα A6 atom; 3,3) and the Cα ^A6 atom. In some implementations, the distance value assigned to voxel 514 may be a normalized distance. For example, the distance value assigned to voxel (1,1) may be distance 512 divided by maximum distance 502 (predetermined maximum scan radius). In some implementations, the closest atomic distance may be a Euclidean distance, where the closest atomic distance is normalized by dividing the Euclidean distance by the largest nearest atomic distance (e.g., maximum distance 502, etc.). may be converted into

上記のように、アルファ炭素原子を有するアミノ酸については、距離は、対応するボクセル中心から対応するアミノ酸の最も近いアルファ炭素原子までの最も近いアルファ炭素原子の距離であり得る。更に、ベータ炭素原子を有するアミノ酸に関して、距離は、対応するボクセル中心から対応するアミノ酸の最も近いベータ炭素原子までの最も近いベータ炭素原子の距離であり得る。同様に、骨格原子を有するアミノ酸について、距離は、対応するボクセル中心から対応するアミノ酸の最も近い骨格原子までの最も近い骨格原子の距離であり得る。同様に、側鎖原子を有するアミノ酸について、距離は、対応するボクセル中心から対応するアミノ酸の最も近い側鎖原子までの最も近い側鎖原子の距離であり得る。いくつかの実施態様では、距離は、追加的／代替的に、２番目、３番目、４番目に近い原子までの距離などを含むことができる。 As noted above, for amino acids having an alpha carbon atom, the distance can be the distance of the nearest alpha carbon atom from the corresponding voxel center to the nearest alpha carbon atom of the corresponding amino acid. Additionally, for amino acids having a beta carbon atom, the distance can be the distance of the nearest beta carbon atom from the corresponding voxel center to the nearest beta carbon atom of the corresponding amino acid. Similarly, for amino acids having backbone atoms, the distance can be the distance of the nearest backbone atom from the corresponding voxel center to the nearest backbone atom of the corresponding amino acid. Similarly, for amino acids having side chain atoms, the distance can be the distance of the nearest side chain atom from the corresponding voxel center to the nearest side chain atom of the corresponding amino acid. In some implementations, the distance can additionally/alternatively include the distance to the second, third, fourth nearest atom, etc.

アミノ酸ごとの距離チャネル
図６は、２１個のアミノ酸ごとの距離チャネル６００の例を示す。図６の各列は、２１個のアミノ酸ごとの距離チャネル６０２～６４２の各々１つに対応する。各アミノ酸ごとの距離チャネルは、ボクセルグリッド５２２のボクセル５１４の各々についての距離値を含む。例えば、アラニン（Ａ）についてのアミノ酸ごとの距離チャネル６０２は、ボクセルグリッド５２２のボクセル５１４のそれぞれについての距離値を含む。上述したように、ボクセルグリッド５２２は、体積３×３×３の３Ｄグリッドであり、２７個のボクセルを含む。同様に、図６は、ボクセル５１４を２次元（例えば、３×３グリッドの９個のボクセル）で示すが、各アミノ酸ごとの距離チャネルは、３×３×３ボクセルグリッドについて２７個のボクセルごとの距離値を含み得る。 Per Amino Acid Distance Channels FIG. 6 shows an example of 21 per amino acid distance channels 600. Each column of FIG. 6 corresponds to a respective one of the 21 per amino acid distance channels 602-642. Each per amino acid distance channel includes a distance value for each of the voxels 514 of the voxel grid 522. For example, the per amino acid distance channel 602 for alanine (A) includes a distance value for each of the voxels 514 of the voxel grid 522. As discussed above, the voxel grid 522 is a 3D grid of volume 3×3×3 and includes 27 voxels. Similarly, although FIG. 6 shows the voxels 514 in two dimensions (e.g., 9 voxels in a 3×3 grid), each per amino acid distance channel may include a distance value for each of the 27 voxels for the 3×3×3 voxel grid.

方向性符号化
いくつかの実施態様では、開示される技術は、方向性パラメータを使用して、参照アミノ酸配列２０２内の参照アミノ酸の方向性を指定する。いくつかの実施態様では、開示される技術は、方向性パラメータを使用して、代替アミノ酸配列２１２内の代替アミノ酸の方向性を指定する。いくつかの実施態様では、開示される技術は、方向性パラメータを使用して、アミノ酸レベルで標的変異体を経験するタンパク質２００内の位置を指定する。 Directionality Encoding In some embodiments, the disclosed technology uses directionality parameters to specify the directionality of reference amino acids within reference amino acid sequence 202. In some embodiments, the disclosed technology uses directionality parameters to specify the directionality of alternative amino acids within alternative amino acid sequence 212. In some embodiments, the disclosed technology uses directional parameters to specify locations within protein 200 that experience targeted variants at the amino acid level.

上述したように、２１個のアミノ酸ごとの距離チャネル６０２～６４２内の全ての距離値は、それぞれの最も近い原子からボクセルグリッド５２２内のボクセル５１４まで測定される。これらの最も近い原子は、参照アミノ酸配列２０２中の参照アミノ酸の１つに由来する。最も近い原子を含有するこれらの元の参照アミノ酸は、２つのカテゴリに分類することができる：（１）参照アミノ酸配列２０２中の変異体を経験する参照アミノ酸２０４に先行する元の参照アミノ酸及び（２）参照アミノ酸配列２０２中の変異体を経験する参照アミノ酸２０４に続く元の参照アミノ酸。第１のカテゴリにおける元の参照アミノ酸は、先行参照アミノ酸と呼ぶことができる。第２のカテゴリにおける元の参照アミノ酸は、後続の参照アミノ酸と呼ぶことができる。 As mentioned above, all distance values in distance channels 602-642 for every 21 amino acids are measured from the respective nearest atom to voxel 514 in voxel grid 522. These closest atoms are derived from one of the reference amino acids in reference amino acid sequence 202. These original reference amino acids that contain the closest atoms can be classified into two categories: (1) the original reference amino acids that precede the reference amino acid 204 that experience variants in the reference amino acid sequence 202 and ( 2) The original reference amino acid following the reference amino acid 204 that experiences a variant in the reference amino acid sequence 202. The original reference amino acids in the first category can be referred to as preceding reference amino acids. The original reference amino acid in the second category can be referred to as the subsequent reference amino acid.

方向性パラメータは、先行参照アミノ酸に由来する最も近い原子から測定される２１個のアミノ酸ごとの距離チャネル６０２～６４２における距離値に適用される。一実施態様では、方向性パラメータは、そのような距離値と乗算される。方向性パラメータは、－１などの任意の数であることができる。 Directional parameters are applied to distance values in distance channels 602-642 for every 21 amino acids measured from the nearest atom derived from the preceding reference amino acid. In one implementation, the directional parameter is multiplied by such distance value. The directionality parameter can be any number, such as -1.

方向性パラメータの適用の結果として、２１個のアミノ酸ごとの距離チャネル６００は、タンパク質２００のどの末端が開始末端であり、どの末端が終了末端であるかを病原性分類器に示すいくつかの距離値を含む。これはまた、病原性分類器が、距離チャネル並びに参照チャネル及び対立遺伝子チャネルによって供給される３Ｄタンパク質構造情報からタンパク質配列を再構築することを可能にする。 As a result of the application of the directional parameter, the distance channels 600 for every 21 amino acids contain several distances that indicate to the pathogenicity classifier which end of the protein 200 is the starting end and which end is the ending end. Contains value. This also allows the pathogenicity classifier to reconstruct protein sequences from the 3D protein structure information provided by the distance channel as well as the reference and allele channels.

距離チャネルテンソル
図７は、距離チャネルテンソル７００の概略図である。距離チャネルテンソル７００は、図６からのアミノ酸ごとの距離チャネル６００のボクセル化表現である。距離チャネルテンソル７００において、２１個のアミノ酸ごとの距離チャネル６０２～６４２は、カラー画像のＲＧＢチャネルのようにボクセルごとに連結される。距離チャネルテンソル７００のボクセル化された次元数は、２１×３×３×３である（ここで、２１は２１個のアミノ酸カテゴリを示し、３×３×３は２７個のボクセルを有する３Ｄボクセルグリッドを示す）。しかし、図７は、次元数２１×３×３の２Ｄ描写である。 Distance Channel Tensor FIG. 7 is a schematic diagram of a distance channel tensor 700. Distance channel tensor 700 is a voxelized representation of per-amino acid distance channel 600 from FIG. In the distance channel tensor 700, the distance channels 602-642 for every 21 amino acids are concatenated voxel by voxel like the RGB channels of a color image. The voxelized dimensionality of the distance channel tensor 700 is 21x3x3x3 (where 21 indicates the 21 amino acid categories and 3x3x3 represents the 3D voxel with 27 voxels). grid). However, FIG. 7 is a 2D depiction with dimensions 21×3×3.

ワンホット符号化
図８は、参照アミノ酸２０４及び代替アミノ酸２１４のワンホット符号化８００を示す。図８において、左列は、参照アミノ酸グリシン（Ｇ）２０４のワンホット符号化８０２であり、１はグリシンアミノ酸カテゴリについてであり、０は他の全てのアミノ酸カテゴリについてである。図８において、右列は、変異体／代替アミノ酸アラニン（Ａ）２１４のワンホット符号化８０４であり、１はアラニンアミノ酸カテゴリについてであり、０は他の全てのアミノ酸カテゴリについてである。 One-Hot Encoding FIG. 8 shows one-hot encoding 800 of reference amino acid 204 and alternative amino acid 214. In FIG. 8, the left column is a one-hot encoding 802 of the reference amino acid glycine (G) 204, where 1 is for the glycine amino acid category and 0 is for all other amino acid categories. In FIG. 8, the right column is a one-hot encoding 804 of the variant/alternative amino acid alanine (A) 214, where 1 is for the alanine amino acid category and 0 is for all other amino acid categories.

図９は、ボクセル化されたワンホット符号化参照アミノ酸９０２及びボクセル化されたワンホット符号化変異体／代替アミノ酸９１２の概略図である。ボクセル化されたワンホット符号化参照アミノ酸９０２は、図８からの参照アミノ酸グリシン（Ｇ）２０４のワンホット符号化８０２のボクセル化表現である。ボクセル化されたワンホット符号化代替アミノ酸９１２は、図８からの変異体／代替アミノ酸アラニン（Ａ）２１４のワンホット符号化８０４のボクセル化表現である。ボクセル化されたワンホット符号化参照アミノ酸９０２のボクセル化次元数は、２１×１×１×１である（ここで、２１は２１個のアミノ酸カテゴリを示す）。しかし、図９は、次元数２１×１×１の２Ｄ描写である。同様に、ボクセル化されたワンホット符号化代替アミノ酸９１２のボクセル化次元数は、２１×１×１×１である（ここで、２１は２１個のアミノ酸カテゴリを示す）。しかし、図９は、次元数２１×１×１の２Ｄ描写である。 FIG. 9 is a schematic diagram of a voxelized one-hot encoded reference amino acid 902 and a voxelized one-hot encoded variant/alternative amino acid 912. Voxelized one-hot encoded reference amino acid 902 is a voxelized representation of one-hot encoded 802 of reference amino acid glycine (G) 204 from FIG. Voxelized one-hot encoded alternative amino acid 912 is a voxelized representation of one-hot encoded 804 of variant/alternative amino acid alanine (A) 214 from FIG. The voxelization dimension number of the voxelized one-hot encoding reference amino acid 902 is 21×1×1×1 (here, 21 indicates 21 amino acid categories). However, FIG. 9 is a 2D depiction with dimensions 21×1×1. Similarly, the voxelization dimensionality of the voxelized one-hot encoded alternative amino acids 912 is 21×1×1×1 (where 21 indicates the 21 amino acid categories). However, FIG. 9 is a 2D depiction with dimensions 21×1×1.

参照対立遺伝子テンソル
図１０は、図７の距離チャネルテンソル７００と参照対立遺伝子テンソル１００４とをボクセルごとに連結する連結プロセス１０００を概略的に示す。参照対立遺伝子テンソル１００４は、図９からのボクセル化されたワンホット符号化参照アミノ酸９０２のボクセルごとの集合（反復／クローニング／複製）である。すなわち、ボクセル化されたワンホット符号化参照アミノ酸９０２の複数のコピーは、参照対立遺伝子テンソル１００４がボクセルグリッド５２２内のボクセル５１４の各々についてボクセル化されたワンホット符号化参照アミノ酸９１０の対応するコピーを有するように、ボクセルグリッド５２２内のボクセル５１４の空間配置に対して互いにボクセルごとに連結される。 Reference Allele Tensor FIG. 10 schematically depicts a concatenation process 1000 that concatenates the distance channel tensor 700 of FIG. 7 and the reference allele tensor 1004 voxel by voxel. Reference allele tensor 1004 is a voxel-wise collection (repeat/cloned/replicated) of voxelized one-hot encoded reference amino acids 902 from FIG. That is, multiple copies of the voxelized one-hot encoded reference amino acid 902 cause the reference allele tensor 1004 to correspond to a corresponding copy of the voxelized one-hot encoded reference amino acid 910 for each voxel 514 in the voxel grid 522. are connected voxel by voxel to each other with respect to the spatial arrangement of voxels 514 within voxel grid 522 such that .

連結プロセス１０００は、連結テンソル１０１０を生成する。参照対立遺伝子テンソル１００４のボクセル化された次元数は、２１×３×３×３である（ここで、２１は２１個のアミノ酸カテゴリを示し、３×３×３は２７個のボクセルを有する３Ｄボクセルグリッドを示す）。しかし、図１０は、次元数２１×３×３を有する参照対立遺伝子テンソル１００４の２Ｄ描写である。連結テンソル１０１０のボクセル化された次元数は、４２×３×３×３である。しかし、図１０は、次元数４２×３×３を有する連結テンソル１０１０の２Ｄ描写である。 Concatenation process 1000 generates concatenation tensor 1010. The voxelized dimensionality of the reference allele tensor 1004 is 21x3x3x3 (where 21 indicates the 21 amino acid categories and 3x3x3 is the 3D dimension with 27 voxels). (showing voxel grid). However, FIG. 10 is a 2D depiction of a reference allele tensor 1004 with dimensionality 21x3x3. The number of voxelized dimensions of the connected tensor 1010 is 42×3×3×3. However, FIG. 10 is a 2D depiction of a connected tensor 1010 with dimensionality 42x3x3.

代替対立遺伝子テンソル
図１１は、図７の距離チャネルテンソル７００、図１０の参照対立遺伝子テンソル１００４、及び代替対立遺伝子テンソル１１０４をボクセルごとに連結する連結プロセス１１００を概略的に示す。代替対立遺伝子テンソル１１０４は、図９からのボクセル化されたワンホット符号化代替アミノ酸９１２のボクセルごとの集合（反復／クローニング／複製）である。すなわち、ボクセル化されたワンホット符号化代替アミノ酸９１２の複数のコピーは、代替対立遺伝子テンソル１１０４がボクセルグリッド５２２内のボクセル５１４の各々についてボクセル化されたワンホット符号化代替アミノ酸９１０の対応するコピーを有するように、ボクセルグリッド５２２内のボクセル５１４の空間的配置に従って互いにボクセルごとに連結される。 Alternate Allele Tensor FIG. 11 schematically depicts a concatenation process 1100 that concatenates the distance channel tensor 700 of FIG. 7, the reference allele tensor 1004 of FIG. 10, and the alternate allele tensor 1104 voxel by voxel. Alternative allele tensor 1104 is a voxel-wise collection (repeat/cloned/replicated) of voxelized one-hot encoded alternative amino acids 912 from FIG. That is, multiple copies of the voxelized one-hot encoded alternative amino acids 912 cause the alternative allele tensor 1104 to represent corresponding copies of the voxelized one-hot encoded alternative amino acids 910 for each voxel 514 in the voxel grid 522. are connected voxel by voxel to each other according to the spatial arrangement of the voxels 514 within the voxel grid 522 such that

連結プロセス１１００は、連結テンソル１１１０を生成する。代替対立遺伝子テンソル１１０４のボクセル化された次元数は、２１×３×３×３である（ここで、２１は２１個のアミノ酸カテゴリを示し、３×３×３は２７個のボクセルを有する３Ｄボクセルグリッドを示す）。しかし、図１１は、次元数２１×３×３を有する代替対立遺伝子テンソル１１０４の２Ｄ描写である。連結テンソル１１１０のボクセル化された次元数は６３×３×３×３である。しかし、図１１は、次元数６３×３×３を有する連結テンソル１１１０の２Ｄ描写である。 Concatenation process 1100 generates concatenation tensor 1110. The voxelized dimensionality of the alternative allele tensor 1104 is 21x3x3x3 (where 21 indicates the 21 amino acid categories and 3x3x3 is the 3D dimension with 27 voxels). (showing voxel grid). However, FIG. 11 is a 2D depiction of an alternative allele tensor 1104 with dimensionality 21x3x3. The number of voxelized dimensions of the connected tensor 1110 is 63×3×3×3. However, FIG. 11 is a 2D depiction of a connected tensor 1110 with dimensionality 63x3x3.

いくつかの実施態様では、ランタイムロジック１８４は、病原性分類器を介して連結テンソル１１１０を処理して、変異体／代替アミノ酸アラニン（Ａ）２１４の病原性を決定し、これは次に、変異体／代替アミノ酸アラニン（Ａ）２１４を生成する基礎となるヌクレオチド変異体の病原性決定として推測される。 In some implementations, runtime logic 184 processes concatenated tensor 1110 through a pathogenicity classifier to determine the pathogenicity of variant/alternative amino acid alanine (A) 214, which in turn It is speculated that the pathogenicity of the underlying nucleotide variant producing the amino acid/alternative amino acid alanine (A)214.

進化的保存チャネル
変異体の機能的結果を予測することは、少なくとも部分的には、タンパク質ファミリーにとって重要なアミノ酸が負の選択による進化を通じて保存されている（すなわち、これらの部位でのアミノ酸変化が過去に有害であった）という仮定、及びこれらの部位での突然変異がヒトにおいて病原性である（疾患を引き起こす）尤度が高いという仮定に依存する。一般に、標的タンパク質の相同配列が収集及び整列され、保存のメトリックは、アラインメント中の標的位置において観察される異なるアミノ酸の重み付けされた頻度に基づいて計算される。 Evolutionary Conservation Channel Predicting the functional outcome of a variant depends, at least in part, on whether amino acids important to a protein family are conserved through evolution by negative selection (i.e., amino acid changes at these sites are mutations at these sites have a high likelihood of being pathogenic (causing disease) in humans. Generally, homologous sequences of a target protein are collected and aligned, and a conservation metric is calculated based on the weighted frequencies of different amino acids observed at the target position in the alignment.

したがって、開示される技術は、距離チャネルテンソル７００、参照対立遺伝子テンソル１００４、及び代替対立遺伝子テンソル１１０４を進化的チャネルと連結する。進化的チャネルの一例は、汎アミノ酸保存頻度である。進化的チャネルの別の例は、アミノ酸ごとの保存頻度である。 Accordingly, the disclosed technique concatenates distance channel tensor 700, reference allele tensor 1004, and alternative allele tensor 1104 with an evolutionary channel. An example of an evolutionary channel is pan-amino acid conservation frequency. Another example of an evolutionary channel is the conserved frequency of each amino acid.

いくつかの実施態様では、進化的チャネルは、位置特異的重み行列（ＰＷＭ）を使用して構築される。他の実施態様では、進化的チャネルは、位置固有頻度行列（ＰＳＦＭ）を使用して構築される。更に他の実施態様では、進化的チャネルは、ＳＩＦＴ、ＰｏｌｙＰｈｅｎ、及びＰＡＮＴＨＥＲ－ＰＳＥＣのような計算ツールを使用して構築される。更に他の実施態様では、進化的チャネルは、進化的保存に基づく保存チャネルである。保存は、タンパク質中の所定の部位における進化的変化を防止するように作用した負の選択の効果も反映するので、保存に関連する。 In some implementations, evolutionary channels are constructed using position-specific weight matrices (PWM). In other implementations, the evolutionary channel is constructed using a position specific frequency matrix (PSFM). In yet other embodiments, evolutionary channels are constructed using computational tools such as SIFT, PolyPhen, and PANTHER-PSEC. In yet other embodiments, the evolutionary channel is a conserved channel based on evolutionary conservation. Conservation is relevant because it also reflects the effects of negative selection that acted to prevent evolutionary change at a given site in a protein.

汎アミノ酸進化的プロファイル
図１２は、開示される技術の一実施態様による、最も近い原子の汎アミノ酸保存頻度を決定し、ボクセルに割り当てる（ボクセル化する）ためのシステムのプロセス１２００を示す流れ図である。図１２、図１３、図１４、図１５、図１６、図１７、及び図１８は、並行して説明される。 Pan-Amino Acid Evolutionary Profiles FIG. 12 is a flowchart illustrating a process 1200 of a system for determining and assigning pan-amino acid conservation frequencies of nearest atoms to voxels (voxelization), according to one embodiment of the disclosed technology. . 12, 13, 14, 15, 16, 17, and 18 will be described in parallel.

ステップ１２０２において、システムの類似配列ファインダ１２０４は、参照アミノ酸配列２０２に類似（相同）するアミノ酸配列を検索する。類似のアミノ酸配列は、霊長類、哺乳動物、及び脊椎動物のような複数の種から選択することができる。 At step 1202 , the system's similar sequence finder 1204 searches for amino acid sequences that are similar (homologous) to the reference amino acid sequence 202 . Similar amino acid sequences can be selected from multiple species such as primates, mammals, and vertebrates.

ステップ１２１２において、システムのアライナ１２１４は、参照アミノ酸配列２０２を類似アミノ酸配列と位置ごとに整列させ、すなわち、アライナ１２１４は、多重配列アラインメントを実行する。図１４は、９９の種にわたる参照アミノ酸配列２０２の例示的な多重配列アラインメント１４００を示す。いくつかの実施態様では、多重配列アラインメント１４００は、例えば、霊長類のための第１の位置頻度行列１４０２、哺乳動物のための第２の位置頻度行列１４１２、及び霊長類のための第３の位置頻度行列１４２２を生成するために分割することができる。他の実施態様では、９９の種にわたって単一の位置頻度行列が生成される。 At step 1212, the system's aligner 1214 aligns the reference amino acid sequence 202 position by position with similar amino acid sequences, ie, the aligner 1214 performs a multiple sequence alignment. FIG. 14 shows an exemplary multiple sequence alignment 1400 of reference amino acid sequences 202 across 99 species. In some implementations, the multiple sequence alignment 1400 includes, for example, a first position frequency matrix 1402 for primates, a second position frequency matrix 1412 for mammals, and a third position frequency matrix for primates. It can be partitioned to generate a location frequency matrix 1422. In other implementations, a single location frequency matrix is generated across 99 species.

ステップ１２２２において、システムの汎アミノ酸保存頻度計算器１２２４は、多重配列アラインメントを使用して、参照アミノ酸配列２０２中の参照アミノ酸の汎アミノ酸保存頻度を決定する。 At step 1222, the system's pan-amino acid conservation frequency calculator 1224 determines the pan-amino acid conservation frequency of the reference amino acid in the reference amino acid sequence 202 using multiple sequence alignments.

ステップ１２３２において、システムの最も近い原子ファインダ１２３４は、ボクセルグリッド５２２内のボクセル５１４への最も近い原子を見出す。いくつかの実施態様では、ボクセルごとの最も近い原子の検索は、任意の特定のアミノ酸カテゴリ又は原子タイプに限定されなくてもよい。すなわち、ボクセルごとの最も近い原子は、それらがそれぞれのボクセル中心に最も近接した原子である限り、アミノ酸カテゴリ及びアミノ酸タイプにわたって選択することができる。他の実施態様では、ボクセルごとの最も近い原子の探索は、酸素、窒素、及び水素などの特定の原子エレメントのみ、又はアルファ炭素原子のみ、又はベータ炭素原子のみ、又は側鎖原子のみ、又は骨格原子のみなど、特定の原子カテゴリのみに限定することができる。 At step 1232, the system's closest atom finder 1234 finds the closest atom to voxel 514 in voxel grid 522. In some embodiments, the voxel-by-voxel nearest atom search may not be limited to any particular amino acid category or atom type. That is, the closest atoms per voxel can be selected across amino acid categories and amino acid types, as long as they are the atoms closest to the respective voxel center. In other embodiments, the voxel-by-voxel nearest atom search may include only certain atomic elements such as oxygen, nitrogen, and hydrogen, or only alpha carbon atoms, or only beta carbon atoms, or only side chain atoms, or only side chain atoms, or backbone atoms. It can be limited to only specific atomic categories, such as atoms only.

ステップ１２４２において、システムのアミノ酸選択器１２４４は、ステップ１２３２において特定された最も近い原子を含有する参照アミノ酸配列２０２中の参照アミノ酸を選択する。このような参照アミノ酸は、最も近い参照アミノ酸と呼ぶことができる。図１３は、ボクセルグリッド５２２内のボクセル５１４への最も近い原子１３０２を位置特定し、ボクセルグリッド５２２内のボクセル５１４への最も近い原子１３０２を含有する最も近い参照アミノ酸１３１２をそれぞれマッピングする例を示す。これは、図１３において「ボクセルから最も近いアミノ酸へのマッピング１３００」として特定される。 In step 1242, the system's amino acid selector 1244 selects the reference amino acid in the reference amino acid sequence 202 that contains the closest atom identified in step 1232. Such a reference amino acid can be referred to as the closest reference amino acid. FIG. 13 shows an example of locating the nearest atom 1302 to voxel 514 in voxel grid 522 and mapping the nearest reference amino acid 1312 containing the nearest atom 1302 to voxel 514 in voxel grid 522, respectively. . This is identified in FIG. 13 as "Voxel to nearest amino acid mapping 1300".

ステップ１２５２において、システムのボクセル化器１２５４は、最も近い参照アミノ酸の汎アミノ酸保存頻度をボクセル化する。図１５は、本明細書において「ボクセルごとの進化的プロファイル決定１５００」とも呼ばれる、ボクセルグリッド５２２における第１のボクセル（１，１）についての汎アミノ酸保存頻度配列を決定する例を示す。 In step 1252, the system's voxelizer 1254 voxels the pan-amino acid conservation frequency of the nearest reference amino acid. FIG. 15 shows an example of determining the pan-amino acid conservation frequency sequence for the first voxel (1,1) in the voxel grid 522, also referred to herein as "voxel-wise evolutionary profile determination 1500."

図１３を参照すると、第１のボクセル（１，１）にマッピングされた最も近い参照アミノ酸は、参照アミノ酸配列２０２における１５位のアスパラギン酸（Ｄ）アミノ酸である。次いで、参照アミノ酸配列２０２と、例えば、９９種の９９個の相同アミノ酸配列との多重配列アラインメントを、１５位で分析する。このような位置特異的及び異種間分析は、２１個のアミノ酸カテゴリの各々からのアミノ酸のいくつの例が、１００個の整列されたアミノ酸配列（すなわち、参照アミノ酸配列２０２＋９９個の相同アミノ酸配列）にわたって１５位で見出されるかを明らかにする。 Referring to FIG. 13, the closest reference amino acid mapped to the first voxel (1,1) is the aspartic acid (D) amino acid at position 15 in reference amino acid sequence 202. A multiple sequence alignment between the reference amino acid sequence 202 and, for example, 99 homologous amino acid sequences at position 15 is analyzed. Such position-specific and cross-species analyzes indicate how many examples of amino acids from each of the 21 amino acid categories are found across 100 aligned amino acid sequences (i.e., reference amino acid sequence 202 + 99 homologous amino acid sequences). We will reveal whether you will be discovered in 15th place.

図１５に示される例において、アスパラギン酸（Ｄ）アミノ酸は、１００個の整列されたアミノ酸配列のうち９６個において１５位に見出される。したがって、アスパラギン酸アミノ酸カテゴリ１５０４には、０．９６の汎アミノ酸保存頻度が割り当てられる。同様に、示された例において、バリン（Ｖ）酸アミノ酸は、１００の整列されたアミノ酸配列のうち４つにおいて１５位に見出される。したがって、バリン酸アミノ酸カテゴリ１５１４には、０．０４の汎アミノ酸保存頻度が割り当てられる。他のアミノ酸カテゴリからのアミノ酸の例は１５位で検出されないので、残りのアミノ酸カテゴリには０の汎アミノ酸保存頻度が割り当てられる。このようにして、２１個のアミノ酸カテゴリの各々に、それぞれの汎アミノ酸保存頻度が割り当てられ、これは、第１のボクセル（１，１）についての汎アミノ酸保存頻度配列１５０２において符号化することができる。 In the example shown in Figure 15, the aspartic acid (D) amino acid is found at position 15 in 96 out of 100 aligned amino acid sequences. Therefore, aspartate amino acid category 1504 is assigned a pan-amino acid conservation frequency of 0.96. Similarly, in the example shown, the valine (V) amino acid is found at position 15 in four out of 100 aligned amino acid sequences. Therefore, the valic acid amino acid category 1514 is assigned a pan-amino acid conservation frequency of 0.04. Since no examples of amino acids from other amino acid categories are detected at position 15, the remaining amino acid categories are assigned a pan-amino acid conservation frequency of 0. In this way, each of the 21 amino acid categories is assigned a respective pan-amino acid conservation frequency, which can be encoded in the pan-amino acid conservation frequency array 1502 for the first voxel (1,1). can.

図１６は、本明細書では「ボクセルから進化的プロファイルへのマッピング１６００」とも呼ばれる、図１５に記載される位置頻度ロジックを使用してボクセルグリッド５２２内のボクセル５１４のそれぞれについて決定されたそれぞれの汎アミノ酸保存頻度１６１２～１６９２を示す。 FIG. 16 illustrates each of the voxels 514 determined for each of the voxels 514 in the voxel grid 522 using the location frequency logic described in FIG. Pan-amino acid conservation frequencies 1612 to 1692 are shown.

次いで、ボクセルごとの進化的プロファイル１６０２は、図１７に示すボクセル化されたボクセルごとの進化的プロファイル１７００を生成するために、ボクセル化器１２５４によって使用される。多くの場合、ボクセルグリッド５２２内のボクセル５１４の各々は、異なる汎アミノ酸保存頻度配列を有し、したがって、ボクセルが異なる最も近い原子に、したがって異なる最も近い参照アミノ酸に規則的にマッピングされるので、ボクセルごとに異なるボクセル化進化的プロファイルを有する。もちろん、２つ以上のボクセルが同じ最も近い原子を有し、それによって同じ最も近い参照アミノ酸を有する場合、同じ汎アミノ酸保存頻度配列及び同じボクセルごとのボクセル化進化的プロフィールが、２つ以上のボクセルの各々に割り当てられる。 The voxel-wise evolutionary profile 1602 is then used by the voxelizer 1254 to generate the voxelized voxel-wise evolutionary profile 1700 shown in FIG. In many cases, each of the voxels 514 within the voxel grid 522 has a different pan-amino acid conservation frequency sequence, and thus the voxels map regularly to different nearest atoms and therefore to different nearest reference amino acids; Each voxel has a different voxelization evolutionary profile. Of course, if two or more voxels have the same nearest atom and therefore the same nearest reference amino acid, then the same pan-amino acid conserved frequency sequence and the same voxel-by-voxel voxelization evolutionary profile will result in two or more voxels are assigned to each of the following.

図１８は、ボクセル化されたボクセルごとの進化的プロファイル１７００が、ボクセルグリッド５２２内のボクセル５１４の空間的配置に従って互いにボクセルごとに連結される、進化的プロファイルテンソル１８００の例を示す。進化的プロファイルテンソル１８００のボクセル化された次元数は、２１×３×３×３である（ここで、２１は２１個のアミノ酸カテゴリを示し、３×３×３は２７個のボクセルを有する３Ｄボクセルグリッドを示す）。しかし、図１８は、次元数２１×３×３を有する進化的プロファイルテンソル１８００の２Ｄ描写である。 Figure 18 shows an example of an evolutionary profile tensor 1800 in which the voxelized per-voxel evolutionary profiles 1700 are connected to each other voxel-wise according to the spatial arrangement of the voxels 514 in the voxel grid 522. The voxelized dimensionality of the evolutionary profile tensor 1800 is 21x3x3x3 (where 21 indicates 21 amino acid categories and 3x3x3 indicates a 3D voxel grid with 27 voxels). However, Figure 18 is a 2D depiction of the evolutionary profile tensor 1800 with dimensionality 21x3x3.

ステップ１２６２において、連結器１７４は、進化的プロファイルテンソル１８００を距離チャネルテンソル７００とボクセルごとに連結する。いくつかの実施態様では、進化的プロファイルテンソル１８００は、連結テンソル１１１０とボクセルごとに連結されて、次元数８４×３×３×３の更なる連結テンソル（図示せず）を生成する。 At step 1262, concatenator 174 concatenates evolutionary profile tensor 1800 with distance channel tensor 700 voxel by voxel. In some implementations, evolutionary profile tensor 1800 is concatenated voxel-wise with connectivity tensor 1110 to generate an additional connectivity tensor (not shown) of dimensionality 84x3x3x3.

ステップ１２７２において、ランタイムロジック１８４は、病原性分類器を介して次元８４×３×３×３の更なる連結テンソルを処理して、標的変異体の病原性を決定し、これは次に、アミノ酸レベルで標的変異体を生成する基礎となるヌクレオチド変異体の病原性決定として推測される。 In step 1272, the runtime logic 184 processes the further concatenated tensor of dimensions 84x3x3x3 through a pathogenicity classifier to determine the pathogenicity of the target variant, which is then inferred as the pathogenicity determination of the underlying nucleotide variant that generates the target variant at the amino acid level.

アミノ酸ごとの進化的プロファイル
図１９は、最も近い原子のアミノ酸ごとの保存頻度を決定し、ボクセルに割り当てる（ボクセル化する）ためのシステムのプロセス１９００を示す流れ図である。図１９において、ステップ１２０２及び１２１２は図１２と同じである。 Evolutionary Profiles Per Amino Acid Figure 19 is a flow diagram showing a process 1900 of a system for determining the conservation frequency per amino acid of nearest atoms and assigning them to voxels (voxelizing). In Figure 19, steps 1202 and 1212 are the same as in Figure 12.

ステップ１９２２において、システムのアミノ酸ごとの保存頻度計算器１９２４は、多重配列アラインメントを使用して、参照アミノ酸配列２０２における参照アミノ酸のアミノ酸ごとの保存頻度を決定する。 At step 1922, the system's per-amino acid conservation frequency calculator 1924 determines the per-amino acid conservation frequency of the reference amino acid in the reference amino acid sequence 202 using multiple sequence alignments.

ステップ１９３２において、システムの最も近い原子ファインダ１９３４は、ボクセルグリッド５２２内のボクセル５１４の各々について、２１個のアミノ酸カテゴリの各々にわたって２１個の最も近い原子を見出す。２１個の最も近い原子の各々は、それらが異なるアミノ酸カテゴリから選択されるので、互いに異なる。これは、特定のボクセルについての２１個の固有の最も近い参照アミノ酸の選択につながり、これは次に、特定のボクセルについての２１個の固有の位置頻度行列の生成につながり、これは次に、特定のボクセルについての２１個の固有のアミノ酸ごとの保存頻度の決定につながる。 At step 1932, the system's closest atom finder 1934 finds, for each voxel 514 in voxel grid 522, the 21 closest atoms across each of the 21 amino acid categories. Each of the 21 nearest atoms is different from each other because they are selected from different amino acid categories. This leads to the selection of the 21 unique closest reference amino acids for a particular voxel, which in turn leads to the generation of 21 unique position frequency matrices for a particular voxel, which in turn This leads to the determination of the conservation frequency of each of the 21 unique amino acids for a particular voxel.

ステップ１９４２において、システムのアミノ酸選択器１９４４は、ボクセルグリッド５２２内のボクセル５１４の各々について、ステップ１９３２において特定された２１個の最も近い原子を含有する参照アミノ酸配列２０２内の２１個の参照アミノ酸を選択する。このような参照アミノ酸は、最も近い参照アミノ酸と呼ぶことができる。 In step 1942, the system's amino acid selector 1944 selects, for each voxel 514 in the voxel grid 522, the 21 reference amino acids in the reference amino acid sequence 202 that contain the 21 nearest atoms identified in step 1932. select. Such a reference amino acid can be referred to as the closest reference amino acid.

ステップ１９５２において、システムのボクセル化器１９５４は、ステップ１９４２において特定のボクセルについて特定された２１個の最も近い参照アミノ酸のｐｅｎ－アミノ酸保存頻度をボクセル化する。２１個の最も近い参照アミノ酸は、異なる基礎となる最も近い原子に対応するので、参照アミノ酸配列２０２中の２１個の異なる位置に必然的に位置する。したがって、特定のボクセルについて、２１個の最も近い参照アミノ酸について２１個の位置頻度行列を生成することができる。２１個の位置頻度行列は、図１２～１５に関して上述したように、その相同アミノ酸配列が参照アミノ酸配列２０２と位置ごとに整列される複数の種にわたって生成することができる。 In step 1952, the system's voxelizer 1954 voxelizes the pen-amino acid conservation frequencies of the 21 closest reference amino acids identified for the particular voxel in step 1942. The 21 closest reference amino acids are necessarily located at 21 different positions in the reference amino acid sequence 202 because they correspond to different underlying nearest atoms. Therefore, for a particular voxel, 21 position frequency matrices can be generated for the 21 nearest reference amino acids. The 21 position frequency matrices can be generated across multiple species whose homologous amino acid sequences are aligned position by position with the reference amino acid sequence 202, as described above with respect to FIGS. 12-15.

次いで、２１個の位置頻度行列を使用して、２１個の位置特異的保存スコアを、特定のボクセルについて特定された２１個の最も近い参照アミノ酸について計算することができる。これら２１個の位置特異的保存スコアは、図１２の汎アミノ酸保存頻度配列１５０２と同様に、特定のボクセルに対するｐｅｎ－アミノ酸保存頻度を形成する。ただし、２１個のアミノ酸カテゴリにわたる２１個の最も近い参照アミノ酸は、異なる位置頻度行列をもたらし、それによって異なるアミノ酸ごとの保存頻度をもたらす異なる位置を必然的に有するので、配列１５０２は多くの０エントリを有するが、アミノ酸ごとの保存頻度配列における各要素（特徴）は値（例えば、浮動小数点数）を有する。 Using the 21 position frequency matrices, 21 position-specific conservation scores can then be calculated for the 21 closest reference amino acids identified for a particular voxel. These 21 position-specific conservation scores form the pen-amino acid conservation frequency for a specific voxel, similar to the pan-amino acid conservation frequency array 1502 in FIG. However, because the 21 closest reference amino acids across the 21 amino acid categories necessarily have different positions resulting in different position frequency matrices and thereby different conservation frequencies for each amino acid, sequence 1502 has many 0 entries. , where each element (feature) in the conserved frequency array for each amino acid has a value (eg, a floating point number).

上記のプロセスは、ボクセルグリッド５２２内のボクセル５１４の各々に対して実行され、結果として得られるボクセルごとのアミノ酸ごとの保存頻度は、図１２～図１８に関して説明した汎アミノ酸保存頻度と同様に、病原性決定のためにボクセル化され、テンソル化され、連結され、処理される。 The above process is performed for each of the voxels 514 in the voxel grid 522, and the resulting per-amino acid conservation frequency per voxel is similar to the pan-amino acid conservation frequency described with respect to FIGS. 12-18. Voxelized, tensorized, concatenated and processed for pathogenicity determination.

アノテーションチャネル
図２０は、距離チャネルテンソル７００と連結されるボクセル化されたアノテーションチャネル２０００の様々な例を示す。いくつかの実施態様では、ボクセル化されたアノテーションチャネルは、異なるタンパク質アノテーション、例えば、アミノ酸（残基）が膜貫通領域、シグナルペプチド、活性部位、若しくは任意の他の結合部位の一部であるかどうか、又は残基が翻訳後修飾ＰａｔｈＲａｔｉｏなどを受けるかどうかのワンホットインジケータである（ＰｅｉＰ，ＺｈａｎｇＡ：ＡＴｏｐｏｌｏｇｉｃａｌＭｅａｓｕｒｅｍｅｎｔｆｏｒＷｅｉｇｈｔｅｄＰｒｏｔｅｉｎＩｎｔｅｒａｃｔｉｏｎＮｅｔｗｏｒｋ．ＣＳＢ２００５，２６８－２７８を参照）。アノテーションチャネルの追加の例は、以下の特定の実施態様のセクション及び特許請求の範囲において見出すことができる。 Annotation Channels FIG. 20 shows various examples of voxelized annotation channels 2000 coupled with distance channel tensor 700. In some embodiments, the voxelized annotation channels are configured to have different protein annotations, e.g., whether the amino acid (residue) is part of a transmembrane region, signal peptide, active site, or any other binding site. (Pei P, Zhang A: A Topological Measurement for Weighted Protein Interaction Network. CSB 2005, 26 8-278). Additional examples of annotation channels can be found in the Specific Implementations section below and in the claims.

ボクセル化されたアノテーションチャネルは、ボクセルが、ボクセル化された参照対立遺伝子及び代替対立遺伝子配列のような同じアノテーション配列を有することができるように（例えば、アノテーションチャネル２００２、２００４、２００６）、又はボクセルが、ボクセル化されたボクセルごとの進化的プロファイル１７００のようなそれぞれのアノテーション配列を有することができるように（例えば、アノテーションチャネル２０１２、２０１４、２０１６（異なる色によって示されるような））、ボクセルごとで配置される。 Voxelized annotation channels allow voxels to have the same annotation sequences, such as voxelized reference allele and alternative allele sequences (e.g., annotation channels 2002, 2004, 2006), or may have respective annotation arrays such as a voxelized per-voxel evolutionary profile 1700 (e.g. annotation channels 2012, 2014, 2016 (as indicated by different colors)) for each voxel. It will be placed in

アノテーションチャネルは、図１２～１８に関して説明した汎アミノ酸保存頻度と同様に、病原性決定のためにボクセル化され、テンソル化され、連結され、処理される。 The annotation channels are voxelized, tensorized, concatenated, and processed for pathogenicity determination similar to the pan-amino acid conservation frequencies described with respect to FIGS. 12-18.

構造信頼度チャネル
開示される技術はまた、様々なボクセル化された構造信頼度チャネルを距離チャネルテンソル７００と連結することができる。構造信頼度チャネルのいくつかの例には、ＧＭＱＥスコア（ＳｗｉｓｓＭｏｄｅｌによって提供される）；が挙げられる。Ｂ因子；相同性モデルの温度因子列（残基がタンパク質構造における（物理的）制約をどの程度満たすかを示す）；ボクセルの中心に最も近い残基に対する整列鋳型タンパク質の正規化された数（ＨＨｐｒｅｄによって提供されるアラインメント、例えば、ボクセルは、６つの鋳型構造のうちの３つが整列する残基に最も近く、特徴が値３／６＝０．５を有することを意味する）；最小、最大、及び平均ＴＭスコア；並びにボクセルに最も近い残基に整列する鋳型タンパク質構造の予測ＴＭスコア（上記の例を続けると、３つの鋳型構造がＴＭスコア０．５、０．５、及び１．５を有すると仮定すると、最小は０．５であり、平均は２／３であり、最大は１．５である）が挙げられる。ＴＭスコアは、ＨＨｐｒｅｄによってタンパク質鋳型ごとに提供することができる。構造信頼度チャネルの追加の例は、以下の特定の実施態様のセクション及び特許請求の範囲において見出すことができる。 Structural Confidence Channels The disclosed techniques can also connect various voxelized structural confidence channels with the distance channel tensor 700. Some examples of structural confidence channels include GMQE scores (provided by SwissModel); B factor; temperature factor sequence of the homology model (indicating how well the residues satisfy the (physical) constraints in the protein structure); the normalized number of aligned template proteins relative to the residue closest to the center of the voxel ( Alignments provided by HHpred, e.g., voxels are closest to the residues that 3 of the 6 template structures align, meaning the feature has the value 3/6 = 0.5); minimum, maximum , and the average TM score; and the predicted TM score of the template protein structure that aligns with the residue closest to the voxel (continuing the example above, the three template structures have TM scores of 0.5, 0.5, and 1.5 , the minimum is 0.5, the average is 2/3, and the maximum is 1.5). A TM score can be provided for each protein template by HHpred. Additional examples of structure confidence channels can be found in the Specific Implementations section below and in the claims.

ボクセル化された構造信頼度チャネルは、ボクセルが、ボクセル化された参照対立遺伝子及び代替対立遺伝子配列のような同じ構造信頼度配列を有することができるように、又はボクセルが、ボクセル化されたボクセルごとの進化的プロファイル１７００のようなそれぞれの構造信頼度配列を有することができるように、ボクセルごとに配置される。 The voxelized structural confidence channel allows voxels to have the same structural confidence sequences, such as the voxelized reference allele and alternative allele sequences, or Each voxel is arranged such that each voxel can have its own structural confidence array, such as an evolutionary profile 1700 for each voxel.

構造信頼度チャネルは、図１２～１８に関して説明した汎アミノ酸保存頻度と同様に、病原性決定のためにボクセル化され、テンソル化され、連結され、処理される。 Structural confidence channels are voxelized, tensorized, concatenated, and processed for pathogenicity determination similar to the pan-amino acid conservation frequencies described with respect to FIGS. 12-18.

病原性分類器
図２１は、標的変異体の病原性決定２１０６のために病原性分類器２１０８への入力２１０２として提供することができる入力チャネルの異なる組み合わせ及び順列を示す。入力２１０２のうちの１つは、距離チャネル生成器２２７２によって生成される距離チャネル２１０４であることができる。図２２は、距離チャネル２１０４を計算する異なる方法を示す。一実施態様では、距離チャネル２１０４は、アミノ酸に関係なく、複数の原子エレメントにわたるボクセル中心と原子との間の距離２２０２に基づいて生成される。いくつかの実施態様では、距離２２０２は、正規化された距離２２０２ａを生成するために最大走査半径によって正規化される。別の実施態様では、距離チャネル２１０４は、アミノ酸ベースでボクセル中心とアルファ炭素原子との間の距離２２１２に基づいて生成される。いくつかの実施態様では、距離２２１２は、正規化された距離２２１２ａを生成するために最大走査半径によって正規化される。別の実施態様では、距離チャネル２１０４は、アミノ酸ベースでボクセル中心とベータ炭素原子との間の距離２２２２に基づいて生成される。いくつかの実施態様では、距離２２２２は、正規化された距離２２２２ａを生成するために最大走査半径によって正規化される。別の実施態様では、距離チャネル２１０４は、アミノ酸ベースでボクセル中心と側鎖原子との間の距離２２３２に基づいて生成される。いくつかの実施態様では、距離２２３２は、正規化された距離２２３２ａを生成するために最大走査半径によって正規化される。別の実施態様では、距離チャネル２１０４は、アミノ酸ベースでボクセル中心と骨格原子との間の距離２２４２に基づいて生成される。いくつかの実施態様では、距離２２４２は、正規化された距離２２４２ａを生成するために最大走査半径によって正規化される。更に別の実施態様では、距離チャネル２１０４は、原子タイプ及びアミノ酸タイプに関係なく、ボクセル中心とそれぞれの最も近い原子との間の距離２２５２（１つの特徴）に基づいて生成される。更に別の実施態様では、距離チャネル２１０４は、ボクセル中心と非標準アミノ酸からの原子との間の距離２２６２（１つの特徴）に基づいて生成される。いくつかの実施態様では、ボクセルと原子との間の距離は、ボクセル及び原子の極座標に基づいて計算される。極座標は、ボクセルと原子との間の角度によってパラメータ化される。一実施態様では、この角度情報は、ボクセルの角度チャネルを生成するために（すなわち、距離チャネルとは無関係に）使用される。いくつかの実施態様では、最も近い原子と隣接原子（例えば、骨格原子）との間の角度は、ボクセルを用いて符号化される特徴として使用されることができる。 Pathogenicity Classifier FIG. 21 illustrates different combinations and permutations of input channels that can be provided as inputs 2102 to a pathogenicity classifier 2108 for pathogenicity determination 2106 of a target variant. One of the inputs 2102 can be a distance channel 2104 generated by a distance channel generator 2272. FIG. 22 shows different ways to calculate the distance channel 2104. In one implementation, distance channels 2104 are generated based on distances 2202 between voxel centers and atoms across multiple atomic elements, regardless of amino acid. In some implementations, distance 2202 is normalized by the maximum scan radius to produce normalized distance 2202a. In another embodiment, the distance channel 2104 is generated based on the distance 2212 between the voxel center and the alpha carbon atom on an amino acid basis. In some implementations, distance 2212 is normalized by the maximum scan radius to generate normalized distance 2212a. In another embodiment, the distance channel 2104 is generated based on the distance 2222 between the voxel center and the beta carbon atom on an amino acid basis. In some implementations, distance 2222 is normalized by the maximum scan radius to produce normalized distance 2222a. In another embodiment, the distance channel 2104 is generated based on the distance 2232 between the voxel center and the side chain atom on an amino acid basis. In some implementations, distance 2232 is normalized by the maximum scan radius to produce normalized distance 2232a. In another embodiment, the distance channel 2104 is generated based on the distance 2242 between the voxel center and the backbone atoms on an amino acid basis. In some implementations, distance 2242 is normalized by the maximum scan radius to produce normalized distance 2242a. In yet another embodiment, the distance channel 2104 is generated based on the distance 2252 (one feature) between the voxel center and the respective nearest atom, regardless of atom type and amino acid type. In yet another embodiment, the distance channel 2104 is generated based on the distance 2262 (one feature) between the voxel center and the atom from the non-canonical amino acid. In some implementations, distances between voxels and atoms are calculated based on the polar coordinates of the voxels and atoms. Polar coordinates are parameterized by the angle between the voxel and the atom. In one implementation, this angular information is used to generate the angular channel of the voxel (ie, independent of the distance channel). In some implementations, the angle between the closest atom and a neighboring atom (eg, a backbone atom) can be used as a feature encoded using voxels.

入力２１０２の別のものは、指定された半径内の欠失原子を示す特徴２１１４であることができる。 Another of the inputs 2102 can be features 2114 that indicate missing atoms within a specified radius.

入力２１０２の別のものは、参照アミノ酸のワンホット符号化２１２４であることができる。入力２１０２の別のものは、変異体／代替アミノ酸のワンホット符号化２１３４であることができる。 Another of the inputs 2102 can be a one-hot encoding 2124 of reference amino acids. Another of the inputs 2102 can be one-hot encoding 2134 of variant/alternative amino acids.

入力２１０２の別のものは、図２３に示される進化的プロファイル生成器２３７２によって生成される進化的チャネル２１４４であることができる。一実施態様では、進化的チャネル２１４４は、汎アミノ酸保存頻度２３０２に基づいて生成することができる。一実施態様では、進化的チャネル２１４４は、汎アミノ酸保存頻度２３１２に基づいて生成することができる。 Another of the inputs 2102 can be an evolutionary channel 2144 generated by the evolutionary profile generator 2372 shown in FIG. 23. In one embodiment, evolutionary channels 2144 can be generated based on pan-amino acid conservation frequencies 2302. In one embodiment, evolutionary channels 2144 can be generated based on pan-amino acid conservation frequencies 2312.

入力２１０２の別のものは、欠失している残基又は欠失している進化的プロファイルを示す特徴２１５４であることができる。 Another of the inputs 2102 can be features 2154 that indicate missing residues or missing evolutionary profiles.

入力２１０２の別のものは、図２４に示されるアノテーション生成器２４７２によって生成されるアノテーションチャネル２１６４であることができる。一実施態様では、アノテーションチャネル２１５４は、分子処理アノテーション２４０２に基づいて生成することができる。別の実施態様では、アノテーションチャネル２１５４は、領域アノテーション２４１２に基づいて生成することができる。更に別の実施態様では、アノテーションチャネル２１５４は、部位アノテーション２４２２に基づいて生成することができる。更に別の実施態様では、アノテーションチャネル２１５４は、アミノ酸修飾アノテーション２４３２に基づいて生成することができる。更に別の実施態様では、アノテーションチャネル２１５４は、二次構造アノテーション２４４２に基づいて生成することができる。更に別の実施態様では、アノテーションチャネル２１５４は、実験情報アノテーション２４５２に基づいて生成することができる。 Another of the inputs 2102 can be an annotation channel 2164 generated by the annotation generator 2472 shown in FIG. 24. In one implementation, annotation channel 2154 can be generated based on molecule processing annotation 2402. In another implementation, annotation channel 2154 can be generated based on region annotation 2412. In yet another implementation, annotation channel 2154 can be generated based on site annotation 2422. In yet another embodiment, annotation channel 2154 can be generated based on amino acid modification annotation 2432. In yet another implementation, annotation channel 2154 can be generated based on secondary structure annotation 2442. In yet another implementation, annotation channel 2154 can be generated based on experiment information annotation 2452.

入力２１０２の別のものは、図２５に示される構造信頼度生成器２５７２によって生成される構造信頼度チャネル２１７４であることができる。一実施態様では、構造信頼度２１７４は、グローバルモデル品質推定（ＧＭＱＥ）２５０２に基づいて生成することができる。別の実施態様では、構造信頼度２１７４は、定性的モデルエネルギー解析（ＱＭＥＡＮ）スコア２５１２に基づいて生成することができる。更に別の実施態様では、構造信頼度２１７４は、温度因子２５２２に基づいて生成することができる。更に別の実施態様では、構造信頼度２１７４は、鋳型モデリングスコア２５４２に基づいて生成することができる。鋳型モデリングスコア２５４２の例には、最小鋳型モデリングスコア２５４２ａ、平均鋳型モデリングスコア２５４２ｂ、及び最大鋳型モデリングスコア２５４２ｃが含まれる。 Another of the inputs 2102 may be a structural confidence channel 2174 generated by a structural confidence generator 2572 shown in FIG. 25. In one implementation, structural confidence 2174 may be generated based on global model quality estimation (GMQE) 2502. In another implementation, structural confidence 2174 can be generated based on qualitative model energy analysis (QMEAN) scores 2512. In yet another implementation, structural reliability 2174 can be generated based on temperature factor 2522. In yet another implementation, structural confidence 2174 can be generated based on template modeling score 2542. Examples of template modeling scores 2542 include minimum template modeling score 2542a, average template modeling score 2542b, and maximum template modeling score 2542c.

当業者は、入力チャネルの任意の順列及び組み合わせが、標的変異体の病原性決定２１０６のために病原性分類器２１０８を通して処理するための入力に連結することができることを理解するであろう。いくつかの実施態様では、入力チャネルのサブセットのみが連結することができる。入力チャネルは、任意の順序で連結することができる。一実施態様では、入力チャネルは、テンソル生成器（入力エンコーダ）２１１０によって単一のテンソルに連結することができる。次いで、この単一のテンソルは、標的変異体の病原性決定２１０６のために病原性分類器２１０８への入力として提供することができる。 Those skilled in the art will appreciate that any permutation and combination of input channels can be coupled to the input for processing through the pathogenicity classifier 2108 for pathogenicity determination 2106 of the target variant. In some implementations, only a subset of input channels may be concatenated. Input channels can be concatenated in any order. In one implementation, the input channels may be concatenated into a single tensor by a tensor generator (input encoder) 2110. This single tensor can then be provided as an input to a pathogenicity classifier 2108 for pathogenicity determination 2106 of the target variant.

一実施態様では、病原性分類器２１０８は、複数の畳み込み層を有する畳み込みニューラルネットワーク（ＣＮＮ）を使用する。別の実施態様では、病原性分類器２１０８は、長期短期記憶ネットワーク（ＬＳＴＭ）、双方向ＬＳＴＭ（Ｂｉ－ＬＳＴＭ）、及びゲートされた回帰型ユニット（ＧＲＵ）などのリカレントニューラルネットワーク（ＲＮＮ）を使用する。更に別の実施態様では、病原性分類器２１０８は、ＣＮＮとＲＮＮの両方を使用する。更に別の実施態様では、病原性分類器２１０８は、グラフ構造化データにおける依存性をモデル化するグラフ畳み込みニューラルネットワークを使用する。更に別の実施態様では、病原性分類器２１０８は、変分オートエンコーダ（ＶＡＥ）を使用する。更に別の実施態様では、病原性分類器２１０８は、敵対的生成ネットワーク（ＧＡＮ）を使用する。更に別の実施態様では、病原性分類器２１０８は、例えば、Ｔｒａｎｓｆｏｒｍｅｒ及びＢＥＲＴによって実施されるものなどの自己注意に基づく言語モデルとすることもできる。 In one implementation, pathogenicity classifier 2108 uses a convolutional neural network (CNN) with multiple convolutional layers. In another embodiment, the pathogenicity classifier 2108 uses recurrent neural networks (RNNs), such as long short-term memory networks (LSTMs), bidirectional LSTMs (Bi-LSTMs), and gated recurrent units (GRUs). do. In yet another embodiment, pathogenicity classifier 2108 uses both CNNs and RNNs. In yet another implementation, pathogenicity classifier 2108 uses a graph convolutional neural network that models dependencies in graph structured data. In yet another embodiment, pathogenicity classifier 2108 uses a variational autoencoder (VAE). In yet another embodiment, pathogenicity classifier 2108 uses a generative adversarial network (GAN). In yet another implementation, pathogenicity classifier 2108 may be a self-attention-based language model, such as those implemented by Transformer and BERT, for example.

更に他の実施態様では、病原性分類器２１０８は、１Ｄ畳み込み、２Ｄ畳み込み、３Ｄ畳み込み、４Ｄ畳み込み、５Ｄ畳み込み、拡張又は膨張畳み込み、転置畳み込み、奥行分離可能な畳み込み、点単位畳み込み、１×１畳み込み、グループ畳み込み、平坦化された畳み込み、空間及びクロスチャネルの畳み込み、シャッフルされたグループ化畳み込み、空間的な分離可能な畳み込み、並びに逆畳み込みを使用することができる。病原性分類器２１０８は、ロジスティック回帰／対数損失、多クラスクロスエントロピー／ソフトマックス損失、二値クロスエントロピー損失、平均二乗誤差損失、Ｌ１損失、Ｌ２損失、平滑Ｌ１損失、及びＨｕｂｅｒ損失などの１つ以上の損失関数を使用することができる。それは、ＴＦＲｅｃｏｒｄ、圧縮符号化（例えば、ＰＮＧ）、シャープ化、マップ変換に対する平行コール、バッチング、プリフェッチ、モデル並列、データ並列、及び同期／非同期確率的勾配降下法（stochastic gradient descent、ＳＧＤ）のような、任意の並列、効率、及び圧縮方式を使用することができる。それは、アップサンプリング層、ダウンサンプリング層、回帰接続、ゲート及びゲートされたメモリユニット（ＬＳＴＭ又はＧＲＵなど）、残差ブロック、残差接続、ハイウェイ接続、スキップ接続、覗き穴結合、活性化関数（例えば、正規化線形ユニット（ＲｅＬＵ）、ＬｅａｋｙＲｅＬＵ、指数関数的線形ユニット（ＥＬＵ）、シグモイド及び双曲線正接関数（ｔａｎｈ）などの非線形変換関数）、バッチ正規化層、正則化層、ドロップアウト、プーリング層（例えば、最大又は平均プーリング）、グローバル平均プーリング層、注意機構、及びガウス誤差線形ユニットを含むことができる。 In still other embodiments, the pathogenicity classifier 2108 includes 1D convolution, 2D convolution, 3D convolution, 4D convolution, 5D convolution, dilated or dilated convolution, transposed convolution, depth separable convolution, pointwise convolution, 1x1 Convolutions, group convolutions, flattened convolutions, spatial and cross-channel convolutions, shuffled grouped convolutions, spatially separable convolutions, and deconvolutions can be used. The pathogenicity classifier 2108 is one of a logistic regression/log loss, a multiclass cross entropy/softmax loss, a binary cross entropy loss, a mean squared error loss, an L1 loss, an L2 loss, a smoothed L1 loss, and a Huber loss. The above loss functions can be used. It includes TFRecord, compression encoding (e.g. PNG), sharpening, parallel calls to map transforms, batching, prefetching, model parallelism, data parallelism, and synchronous/asynchronous stochastic gradient descent (SGD). Any parallelism, efficiency, and compression scheme can be used. It includes upsampling layers, downsampling layers, recurrent connections, gates and gated memory units (such as LSTM or GRU), residual blocks, residual connections, highway connections, skip connections, peephole connections, activation functions (e.g. , regularized linear unit (ReLU), Leaky ReLU, exponential linear unit (ELU), nonlinear transformation functions such as sigmoid and hyperbolic tangent function (tanh), batch normalization layer, regularization layer, dropout, pooling layer (e.g., max or average pooling), a global average pooling layer, an attention mechanism, and a Gaussian error linear unit.

病原性分類器２１０８は、逆伝播ベースの勾配更新技法を使用して学習する。病原性分類器２１０８が学習するために使用することのできる例示的な勾配降下技術としては、確率的勾配降下、バッチ勾配降下、及びミニバッチ勾配降下が挙げられる。病原性分類器２１０８が学習するために使用することのできる勾配降下最適化アルゴリズムのいくつかの例としては、Ｍｏｍｅｎｔｕｍ、Ｎｅｓｔｏｒｖ加速勾配、Ａｄａｇｒａｄ、Ａｄａｄｅｌｔａ、ＲＭＳｐｒｏｐ、Ａｄａｍ、ＡｄａＭａｘ、Ｎａｄａｍ、及びＡＭＳＧｒａｄが挙げられる。他の実施態様では、病原性分類器２１０８は、教師なし学習、半教師あり学習、自己学習、強化学習、マルチタスク学習、マルチモーダル学習、転移学習、知識蒸留などによって学習することができる。 Pathogenicity classifier 2108 is trained using a backpropagation-based gradient update technique. Exemplary gradient descent techniques that the pathogenicity classifier 2108 may use to learn include stochastic gradient descent, batch gradient descent, and mini-batch gradient descent. Some examples of gradient descent optimization algorithms that the pathogenicity classifier 2108 can use to learn include Momentum, Nestorv accelerated gradient, Adagrad, Adadelta, RMSprop, Adam, AdaMax, Nadam, and AMSGrad. It will be done. In other implementations, the pathogenicity classifier 2108 can be trained by unsupervised learning, semi-supervised learning, self-learning, reinforcement learning, multi-task learning, multimodal learning, transfer learning, knowledge distillation, etc.

図２６は、開示される技術の一実施態様による、病原性分類器２１０８の例示的な処理アーキテクチャ２６００を示す。処理アーキテクチャ２６００は、処理モジュール２６０６、２６１０、２６１４、２６１８、２６２２、２６２６、２６３０、２６３４、２６３８、及び２６４２のカスケードを含み、その各々は、１Ｄ畳み込み（１×１×１ＣＯＮＶ）、３Ｄ畳み込み（３×３×３ＣＯＮＶ）、ＲｅＬＵ非線形性、及びバッチ正規化（ＢＮ）を含むことができる。処理モジュールの他の例は、全結合（ＦＣ）層、ドロップアウト層、平坦化層、並びに良性クラス及び病原性クラスに属する標的変異体について指数関数的に正規化されたスコアを生成する最終ソフトマックス層を含む。図２６において、「６４」は、特定の処理モジュールによって適用される畳み込みフィルタの数を示す。図２６では、入力ボクセル２６０２のサイズは１５×１５×１５×８である。図２６はまた、処理アーキテクチャ２６００によって生成された中間入力２６０４、２６０８、２６１２、２６１６、２６２０、２６２４、２６２８、２６３２、２６３６、及び２６４０のそれぞれの体積次元を示す。 FIG. 26 illustrates an example processing architecture 2600 for the pathogenicity classifier 2108, according to one implementation of the disclosed technology. Processing architecture 2600 includes a cascade of processing modules 2606, 2610, 2614, 2618, 2622, 2626, 2630, 2634, 2638, and 2642, each of which performs 1D convolution (1x1x1 CONV), 3D convolution ( 3x3x3 CONV), ReLU nonlinearity, and batch normalization (BN). Other examples of processing modules include a fully connected (FC) layer, a dropout layer, a flattening layer, and a final software that generates exponentially normalized scores for target variants belonging to benign and pathogenic classes. Including max tier. In FIG. 26, "64" indicates the number of convolution filters applied by a particular processing module. In FIG. 26, the size of input voxel 2602 is 15×15×15×8. FIG. 26 also shows the volumetric dimensions of each of intermediate inputs 2604, 2608, 2612, 2616, 2620, 2624, 2628, 2632, 2636, and 2640 generated by processing architecture 2600.

図２７は、開示される技術の一実施態様による、病原性分類器２１０８の例示的な処理アーキテクチャ２７００を示す。処理アーキテクチャ２７００は、１Ｄ畳み込み（ＣＯＮＶ１Ｄ）、３Ｄ畳み込み（ＣＯＮＶ３Ｄ）、ＲｅＬＵ非線形性、及びバッチ正規化（ＢＮ）などの処理モジュール２７０８、２７１４、２７２０、２７２６、２７３２、２７３８、２７４４、２７５０、２７５６、２７６２、２７６８、２７７４、及び２７８０のカスケードを含む。処理モジュールの他の例は、全結合（密）層、ドロップアウト層、平坦化層、並びに良性クラス及び病原性クラスに属する標的変異体について指数関数的に正規化されたスコアを生成する最終ソフトマックス層を含む。図２７において、「６４」及び「３２」は、特定の処理モジュールによって適用される畳み込みフィルタの数を示す。図２７では、入力層２７０２によって供給される入力ボクセル２７０４のサイズは、７×７×７×１０８である。図２７はまた、処理アーキテクチャ２７００によって生成された中間入力２７１０、２７１６、２７２２、２７２８、２７３４、２７４０、２７４６、２７５２、２７５８、２７６４、２７７０、２７７６、及び２７８２、並びに結果として生じる中間出力２７０６、２７１２、２７１８、２７２４、２７３０、２７３６、２７４２、２７４８、２７５４、２７６０、２７６６、２７７２、２７７８、及び２７８４のそれぞれの体積次元を示す。 FIG. 27 shows an example processing architecture 2700 for the pathogenicity classifier 2108, according to one implementation of the disclosed technology. Processing architecture 2700 includes processing modules 2708, 2714, 2720, 2726, 2732, 2738, 2744, 2750, such as 1D convolution (CONV 1D), 3D convolution (CONV 3D), ReLU nonlinearity, and batch normalization (BN). Contains a cascade of 2756, 2762, 2768, 2774, and 2780. Other examples of processing modules are fully connected (dense) layers, dropout layers, flattening layers, and final software that generates exponentially normalized scores for target variants belonging to benign and pathogenic classes. Including max tier. In FIG. 27, "64" and "32" indicate the number of convolution filters applied by a particular processing module. In Figure 27, the size of input voxels 2704 provided by input layer 2702 is 7x7x7x108. FIG. 27 also shows intermediate inputs 2710, 2716, 2722, 2728, 2734, 2740, 2746, 2752, 2758, 2764, 2770, 2776, and 2782 generated by processing architecture 2700 and resulting intermediate outputs 2706, 2712. , 2718, 2724, 2730, 2736, 2742, 2748, 2754, 2760, 2766, 2772, 2778, and 2784.

当業者であれば、他の現在及び将来の人工知能、機械学習、及び深層学習モデル、データセット、及び学習技法を、開示される技術の精神から逸脱することなく、開示される変異体病原性分類器に組み込むことができることを理解するであろう。 Those skilled in the art will be able to develop other current and future artificial intelligence, machine learning, and deep learning models, datasets, and learning techniques based on the disclosed mutant pathogenicity without departing from the spirit of the disclosed technology. It will be appreciated that it can be incorporated into a classifier.

発明性及び非自明性の客観的な指標としての性能結果
本明細書に開示される変異体病原性分類器は、３Ｄタンパク質構造に基づいて病原性予測を行い、「ＰｒｉｍａｔｅＡＩ３Ｄ」と称される。「ＰｒｉｍａｔｅＡＩ」は、タンパク質配列に基づいて病原性予測を行う、共有に係る以前に開示された変異体病原性分類器である。ＰｒｉｍａｔｅＡＩについての更なる詳細は、共有に係る米国特許出願第１６／１６０，９０３号、同第１６／１６０，９８６号、同第１６／１６０，９６８号、及び同第１６／４０７，１４９号、並びにＳｕｎｄａｒａｍ，Ｌ．ｅｔａｌ．Ｐｒｅｄｉｃｔｉｎｇｔｈｅｃｌｉｎｉｃａｌｉｍｐａｃｔｏｆｈｕｍａｎｍｕｔａｔｉｏｎｗｉｔｈｄｅｅｐｎｅｕｒａｌｎｅｔｗｏｒｋｓ．Ｎａｔ．Ｇｅｎｅｔ．５０，１１６１－１１７０（２０１８）に見出すことができる。 Performance Results as Objective Indicators of Inventiveness and Nonobviousness The variant pathogenicity classifier disclosed herein makes pathogenicity predictions based on 3D protein structures and is referred to as "PrimateAI 3D.""PrimateAI" is a co-owned, previously disclosed variant pathogenicity classifier that makes pathogenicity predictions based on protein sequence. Further details about PrimateAI can be found in co-owned U.S. patent applications 16/160,903, 16/160,986, 16/160,968, and 16/407,149, as well as Sundaram, L. et al. Predicting the clinical impact of human mutations with deep neural networks. Nat. Genet. 50, 1161-1170 (2018).

図２８、２９、３０、及び３１は、ＰｒｉｍａｔｅＡＩに対するＰｒｉｍａｔｅＡＩ３Ｄの分類の優位性を実証するためのベンチマークモデルとしてＰｒｉｍａｔｅＡＩを使用する。図２８、２９、３０、及び３１における性能結果は、複数の検証セットにわたって病原性変異体から良性変異体を正確に区別する分類タスクに基づいて生成される。ＰｒｉｍａｔｅＡＩ３Ｄは、複数の検証セットとは異なる学習セットで学習する。ＰｒｉｍａｔｅＡＩ３Ｄは、良性データセットとして使用される一般的なヒト変異体及び霊長類由来の変異体に対して学習し、一方、トリヌクレオチド文脈に基づいてシミュレートされた変異体は、ラベルなし又は偽病原性データセットとして使用される。 Figures 28, 29, 30, and 31 use PrimateAI as a benchmark model to demonstrate the classification superiority of PrimateAI 3D over PrimateAI. The performance results in Figures 28, 29, 30, and 31 are generated based on the classification task of accurately distinguishing benign variants from pathogenic variants across multiple validation sets. PrimateAI 3D is trained on a training set that is different from the multiple validation sets. PrimateAI 3D is trained on common human variants and primate-derived variants used as benign datasets, while simulated variants based on trinucleotide context are used as unlabeled or pseudo-pathogenic datasets.

新たな発達遅延障害（新たなＤＤＤ）は、ＰｒｉｍａｔｅＡＩに対するＰｒｉｍａｔｅＡＩ３Ｄの分類精度を比較するために使用される検証セットの一例である。新たなＤＤＤ検証セットは、ＤＤＤを有する個人からの変異体を病原性としてラベル付けし、ＤＤＤを有する個人の健康な血縁者からの同じ変異体を良性としてラベル付けする。同様のラベル付けスキームが、図３１に示される自閉症スペクトラム障害（ＡＳＤ）検証セットと共に使用される。 Emerging Developmental Delay Disorders (Emerging DDD) is an example of a validation set used to compare the classification accuracy of Primate AI 3D to Primate AI. The new DDD validation set labels variants from individuals with DDD as pathogenic and the same variants from healthy relatives of individuals with DDD as benign. A similar labeling scheme is used with the autism spectrum disorder (ASD) validation set shown in FIG.

ＢＲＣＡ１は、ＰｒｉｍａｔｅＡＩに対するＰｒｉｍａｔｅＡＩ３Ｄの分類精度を比較するために使用される検証セットの別の例である。ＢＲＣＡ１検証セットは、ＢＲＣＡ１遺伝子のタンパク質をシミュレートする合成で生成された参照アミノ酸配列を良性変異体としてラベル付けし、ＢＲＣＡ１遺伝子のタンパク質をシミュレートする合成で改変された対立遺伝子アミノ酸配列を病原性変異体としてラベル付けする。ＴＰ５３遺伝子、ＴＰ５３Ｓ３遺伝子及びその変異体、並びに図３１に示される他の遺伝子及びそれらの変異体の異なる検証セットを用いて、同様のラベル付けスキームを使用する。 BRCA1 is another example of a validation set used to compare the classification accuracy of Primate AI 3D to Primate AI. The BRCA1 validation set labels synthetically generated reference amino acid sequences simulating the protein of the BRCA1 gene as benign variants and labels synthetically modified allelic amino acid sequences simulating the protein of the BRCA1 gene as pathogenic. Label as mutant. A similar labeling scheme is used with different validation sets of the TP53 gene, the TP53S3 gene and its variants, and the other genes and their variants shown in Figure 31.

図２８は、青い水平バーを有するベンチマークＰｒｉｍａｔｅＡＩモデルの性能と、橙色の水平バーを有する開示されたＰｒｉｍａｔｅＡＩ３Ｄモデルの性能とを特定する。緑色の水平バーは、開示されたＰｒｉｍａｔｅＡＩ３Ｄモデル及びベンチマークＰｒｉｍａｔｅＡＩモデルのそれぞれの病原性予測を組み合わせることによって導出された病原性予測を示す。凡例において、「ｅｎｓ１０」は、各々が異なるシード学習データセットで学習し、異なる重み及びバイアスでランダムに初期化された１０個のＰｒｉｍａｔｅＡＩ３Ｄモデルのアンサンブルを示す。また、「７×７×７×２」は、１０個のＰｒｉｍａｔｅＡＩ３Ｄモデルのアンサンブルの学習中に入力チャネルを符号化するために使用されるボクセルグリッドのサイズを示す。所与の変異体について、１０個のＰｒｉｍａｔｅＡＩ３Ｄモデルのアンサンブルは、それぞれ１０個の病原性予測を生成し、これらはその後組み合わされて（例えば、平均化によって）、所与の変異体についての最終的な病原性予測を生成する。このロジックは、異なるグループサイズのアンサンブルにも同様に当てはまる。 FIG. 28 identifies the performance of the benchmark PrimateAI model with blue horizontal bars and the performance of the disclosed PrimateAI 3D model with orange horizontal bars. The green horizontal bar indicates the pathogenicity prediction derived by combining the respective pathogenicity predictions of the disclosed PrimateAI 3D model and the benchmark PrimateAI model. In the legend, "ens10" indicates an ensemble of 10 PrimateAI 3D models, each trained on a different seed training dataset and randomly initialized with different weights and biases. Also, "7x7x7x2" indicates the size of the voxel grid used to encode the input channel during training of the ensemble of 10 PrimateAI 3D models. For a given variant, an ensemble of 10 PrimateAI 3D models each generates 10 pathogenicity predictions, which are then combined (e.g., by averaging) to form the final prediction for a given variant. generate predictive pathogenicity predictions. This logic applies equally to ensembles of different group sizes.

また、図２８において、ｙ軸は異なる検証セットを有し、ｘ軸はｐ値を有する。ｐ値が大きいほど、すなわち、水平バーが長いほど、良性変異体を病原性変異体と区別する際の精度が高いことを示す。図２８におけるｐ値によって実証されるように、ＰｒｉｍａｔｅＡＩ３Ｄは、検証セットの大部分にわたってＰｒｉｍａｔｅＡＩよりも性能が優れている（唯一の例外はｔｐ５３ｓ３＿Ａ５４９検証セットである）。すなわち、ＰｒｉｍａｔｅＡＩ３Ｄの橙色の水平バーは、ＰｒｉｍａｔｅＡＩの青色の水平バーよりも一貫して長い。 Also in FIG. 28, the y-axis has different validation sets and the x-axis has p-values. A larger p-value, ie, a longer horizontal bar, indicates a higher accuracy in distinguishing benign from pathogenic variants. As evidenced by the p-values in Figure 28, PrimateAI 3D outperforms PrimateAI across most of the validation set (the only exception being the tp53s3_A549 validation set). That is, PrimateAI 3D's orange horizontal bar is consistently longer than PrimateAI's blue horizontal bar.

また、図２８において、ｙ軸に沿った「平均」カテゴリは、検証セットの各々について決定されたｐ値の平均を計算する。平均カテゴリにおいても、ＰｒｉｍａｔｅＡＩ３ＤはＰｒｉｍａｔｅＡＩよりも性能が優れている。 Also in FIG. 28, the "average" category along the y-axis calculates the average of the p-values determined for each of the validation sets. Even in the average category, PrimateAI 3D outperforms PrimateAI.

図２９では、ＰｒｉｍａｔｅＡＩが青い水平バーによって表され、サイズ３×３×３のボクセルグリッドで学習した２０個のＰｒｉｍａｔｅＡＩ３Ｄモデルのアンサンブルが赤い水平バーによって表され、サイズ７×７×７のボクセルグリッドで学習した１０個のＰｒｉｍａｔｅＡＩ３Ｄモデルのアンサンブルが紫色の水平バーによって表され、サイズ７×７×７のボクセルグリッドで学習した２０個のＰｒｉｍａｔｅＡＩ３Ｄモデルのアンサンブルが茶色の水平バーによって表され、サイズ１７×１７×１７のボクセルグリッドで学習した２０個のＰｒｉｍａｔｅＡＩ３Ｄモデルのアンサンブルは、紫色の水平バーで表される。 In Figure 29, PrimateAI is represented by the blue horizontal bar, the ensemble of 20 PrimateAI 3D models trained on a voxel grid of size 3x3x3 is represented by the red horizontal bar, the ensemble of 10 PrimateAI 3D models trained on a voxel grid of size 7x7x7 is represented by the purple horizontal bar, the ensemble of 20 PrimateAI 3D models trained on a voxel grid of size 7x7x7 is represented by the brown horizontal bar, and the ensemble of 20 PrimateAI 3D models trained on a voxel grid of size 17x17x17 is represented by the purple horizontal bar.

また、図２９において、ｙ軸は異なる検証セットを有し、ｘ軸はｐ値を有する。前と同様に、ｐ値が大きいほど、すなわち、水平バーが長いほど、良性変異体を病原性変異体と区別する際の精度が高いことを示す。図２０のｐ値によって実証されるように、ＰｒｉｍａｔｅＡＩ３Ｄの異なる構成は、検証セットの大部分にわたってＰｒｉｍａｔｅＡＩよりも性能が優れている。すなわち、ＰｒｉｍａｔｅＡＩ３Ｄの赤色、紫色、茶色、及びピンク色の水平バーは、ＰｒｉｍａｔｅＡＩの青色の水平バーよりも大部分が長い。 Also in FIG. 29, the y-axis has different validation sets and the x-axis has p-values. As before, a larger p-value, i.e. a longer horizontal bar, indicates a higher accuracy in distinguishing benign from pathogenic variants. As demonstrated by the p-values in Figure 20, the different configurations of PrimateAI 3D outperform PrimateAI across the majority of the validation set. That is, PrimateAI 3D's red, purple, brown, and pink horizontal bars are mostly longer than PrimateAI's blue horizontal bars.

また、図２９において、ｙ軸に沿った「平均」カテゴリは、検証セットの各々について決定されたｐ値の平均を計算する。平均カテゴリにおいても、ＰｒｉｍａｔｅＡＩ３Ｄの異なる構成はＰｒｉｍａｔｅＡＩよりも性能が優れている。 Also in FIG. 29, the "average" category along the y-axis calculates the average of the p-values determined for each of the validation sets. Even in the average category, different configurations of PrimateAI 3D outperform PrimateAI.

図３０において、赤色の垂直バーはＰｒｉｍａｔｅＡＩを表し、シアンの垂直バーはＰｒｉｍａｔｅＡＩ３Ｄを表す。図３０において、ｙ軸はｐ値を有し、ｘ軸は異なる検証セットを有する。図３０では、例外なく、ＰｒｉｍａｔｅＡＩ３Ｄは、検証セットの全てにわたってＰｒｉｍａｔｅＡＩよりも一貫して性能が優れている。すなわち、ＰｒｉｍａｔｅＡＩ３Ｄのシアンの垂直バーは、ＰｒｉｍａｔｅＡＩの赤色の垂直バーよりも常に長い。 In Figure 30, the red vertical bar represents PrimateAI and the cyan vertical bar represents PrimateAI 3D. In Figure 30, the y-axis has p-values and the x-axis has different validation sets. In FIG. 30, without exception, PrimateAI 3D consistently outperforms PrimateAI across all of the validation sets. That is, the PrimateAI 3D cyan vertical bar is always longer than the PrimateAI red vertical bar.

図３１は、青色の垂直バーを有するベンチマークＰｒｉｍａｔｅＡＩモデルの性能と、橙色の垂直バーを有する開示されたＰｒｉｍａｔｅＡＩ３Ｄモデルの性能とを特定する。緑色の垂直バーは、開示されたＰｒｉｍａｔｅＡＩ３Ｄモデル及びベンチマークＰｒｉｍａｔｅＡＩモデルのそれぞれの病原性予測を組み合わせることによって導出された病原性予測を示す。図３１において、ｙ軸はｐ値を有し、ｘ軸は異なる検証セットを有する。 FIG. 31 identifies the performance of the benchmark PrimateAI model with blue vertical bars and the performance of the disclosed PrimateAI 3D model with orange vertical bars. The green vertical bar indicates the pathogenicity prediction derived by combining the respective pathogenicity predictions of the disclosed PrimateAI 3D model and the benchmark PrimateAI model. In Figure 31, the y-axis has p-values and the x-axis has different validation sets.

図３１におけるｐ値によって実証されるように、ＰｒｉｍａｔｅＡＩ３Ｄは、検証セットの大部分にわたってＰｒｉｍａｔｅＡＩよりも性能が優れている（唯一の例外はｔｐ５３ｓ３＿Ａ５４９＿ｐ５３ＮＵＬＬ＿Ｎｕｔｌｉｎ－３検証セットである）。すなわち、ＰｒｉｍａｔｅＡＩ３Ｄの橙色の垂直バーは、ＰｒｉｍａｔｅＡＩの青色の垂直バーよりも一貫して長い。 As demonstrated by the p-values in Figure 31, PrimateAI 3D outperforms PrimateAI across the majority of the validation set (the only exception being the tp53s3_A549_p53NULL_Nutlin-3 validation set). That is, the orange vertical bars in PrimateAI 3D are consistently longer than the blue vertical bars in PrimateAI.

また、図３１において、別個の「平均」チャートは、検証セットの各々について決定されたｐ値の平均を計算する。平均チャートにおいても、ＰｒｉｍａｔｅＡＩ３ＤはＰｒｉｍａｔｅＡＩよりも性能が優れている。 Also in FIG. 31, a separate "average" chart calculates the average of the p-values determined for each of the validation sets. PrimateAI 3D also outperforms PrimateAI in the average chart.

平均統計量は、外れ値によってバイアスされ得る。これに対処するために、別個の「方法ランク」チャートも図３１に示されている。ランクが高いほど、分類精度が悪いことを示す。方法ランクチャートにおいても同様に、ＰｒｉｍａｔｅＡＩ３Ｄは、全て３を有するＰｒｉｍａｔｅＡＩに対して、より多くのより低いランク１及び２のカウントを有することによって、ＰｒｉｍａｔｅＡＩよりも性能が優れている。 Average statistics can be biased by outliers. To address this, a separate "method rank" chart is also shown in FIG. The higher the rank, the worse the classification accuracy. Similarly in the method rank chart, PrimateAI 3D outperforms PrimateAI by having more counts of lower ranks 1 and 2 versus Primate AI which has all 3s.

図２８～３１において、ＰｒｉｍａｔｅＡＩ３ＤをＰｒｉｍａｔｅＡＩと組み合わせることにより、優れた分類精度が得られることも明らかである。すなわち、タンパク質をアミノ酸配列としてＰｒｉｍａｔｅＡＩに供給して第１の出力を生成することができ、同じタンパク質を３Ｄボクセル化タンパク質構造としてＰｒｉｍａｔｅＡＩ３Ｄに供給して第２の出力を生成することができ、第１及び第２の出力を組み合わせて、又は全体として分析して、タンパク質が経験する変異体の最終的な病原性予測を生成することができる。 It is also clear in Figures 28-31 that superior classification accuracy is obtained by combining PrimateAI 3D with PrimateAI. That is, a protein can be fed to PrimateAI 3D as an amino acid sequence to generate a first output, the same protein can be fed to PrimateAI 3D as a 3D voxelized protein structure to generate a second output, and the same protein can be fed to PrimateAI 3D as a 3D voxelized protein structure to generate a second output. The first and second outputs can be analyzed in combination or as a whole to generate a final pathogenicity prediction of the variant experienced by the protein.

効率的なボクセル化
図３２は、ボクセルごとに最も近い原子を効率的に特定する効率的なボクセル化プロセス３２００を示すフローチャートである。 Efficient Voxelization FIG. 32 is a flowchart illustrating an efficient voxelization process 3200 that efficiently identifies the nearest atoms for each voxel.

ここで、距離チャネルについて再度説明する。上述したように、参照アミノ酸配列２０２は、アルファ炭素原子、ベータ炭素原子、酸素原子、窒素原子、水素原子などの異なるタイプの原子を含有することができる。したがって、上述したように、距離チャネルは、最も近いアルファ炭素原子、最も近いベータ炭素原子、最も近い酸素原子、最も近い窒素原子、最も近い水素原子などによって配置することができる。例えば、図６において、９個のボクセル５１４の各々は、最も近いアルファ炭素原子について２１個のアミノ酸ごとの距離チャネルを有する。図６は、原子のタイプ及びアミノ酸のタイプに関係なく、９つのボクセル５１４の各々について、最も近いベータ炭素原子についての２１アミノ酸ごとの距離チャネルも有するように、また９つのボクセル５１４の各々について、最も近い原子についての最も近い一般的な原子の距離チャネルも有するように、更に拡張することができる。このようにして、９個のボクセル５１４の各々は、４３個の距離チャネルを有することができる。 Here, the distance channel will be explained again. As mentioned above, the reference amino acid sequence 202 can contain different types of atoms, such as alpha carbon atoms, beta carbon atoms, oxygen atoms, nitrogen atoms, hydrogen atoms, etc. Thus, as discussed above, the distance channels may be arranged by the nearest alpha carbon atom, the nearest beta carbon atom, the nearest oxygen atom, the nearest nitrogen atom, the nearest hydrogen atom, etc. For example, in FIG. 6, each of the nine voxels 514 has a distance channel every 21 amino acids for the nearest alpha carbon atom. 6 also has a distance channel every 21 amino acids for the nearest beta carbon atom for each of the nine voxels 514, regardless of atom type and amino acid type, and for each of the nine voxels 514. It can be further extended to also have the nearest common atom distance channel for the nearest atom. In this way, each of the nine voxels 514 can have 43 distance channels.

次に、距離チャネルに含めるためのボクセルごとの最も近い原子を特定するために必要な距離計算の数について説明する。２１個のアミノ酸カテゴリにわたって分布する合計８２８個のアルファ炭素原子を示す図３の例を考慮されたい。図６のアミノ酸ごとの距離チャネル６０２～６４２を計算するために、すなわち、１８９個の距離値を決定するために、９個のボクセル５１４の各々から８２８個のアルファ炭素原子の各々までの距離が測定され、９^＊８２８＝７，４５２個の距離計算が得られる。２７ボクセルの３Ｄの場合、これは、２７^＊８２８＝２２，３５６の距離計算をもたらす。８２８個のベータ炭素原子も含まれる場合、この数は２７^＊１６５６＝４４，７１２の距離計算に増加する。 We now turn to the number of distance calculations required to identify the nearest atom per voxel for inclusion in the distance channel. Consider the example of FIG. 3, which shows a total of 828 alpha carbon atoms distributed across 21 amino acid categories. To calculate the distance channels 602-642 per amino acid in FIG. 6, i.e. to determine the 189 distance values, the distance from each of the nine voxels 514 to each of the 828 alpha carbon atoms is measured, resulting in 9 ^* 828=7,452 distance calculations. For a 3D case of 27 voxels, this results in 27 ^* 828=22,356 distance calculations. If the 828 beta carbon atoms are also included, this number increases to 27 ^* 1656=44,712 distance calculations.

これは、図３５Ａによって図示されるように、単一タンパク質ボクセル化のためのボクセルごとに最も近い原子を特定するランタイム複雑度が、Ｏ（原子数^＊ボクセル数）であることを意味する。更に、距離チャネルが様々な属性（例えば、アノテーションチャネル及び構造信頼度チャネルのようなボクセルごとの異なる特徴又はチャネル）にわたって計算されるとき、単一タンパク質ボクセル化のランタイム複雑度は、Ｏ（原子数^＊ボクセル数^＊属性数）に増加する。 This means that the runtime complexity of identifying the nearest atom per voxel for single protein voxelization is O(number of atoms ^* number of voxels), as illustrated by FIG. 35A. Furthermore, when distance channels are computed over various attributes (e.g., different features or channels per voxel, such as annotation channels and structural confidence channels), the runtime complexity of single protein voxelization is ^* number of voxels ^* number of attributes).

結果として、距離計算は、ボクセル化プロセスの最も計算を消費する部分になり得、モデル学習及びモデル推論のような重要なランタイムタスクから貴重な計算リソースを取り去る。例えば、７，０００個のタンパク質の学習データセットを用いたモデル学習の場合を考慮されたい。複数のアミノ酸、原子、及び属性にわたる複数のボクセルについて距離チャネルを生成することは、タンパク質あたり１００を超えるボクセル化を伴い得、単一の学習反復（エポック）において約８００，０００のボクセル化をもたらす。各エポックにおける原子座標の回転を伴う２０～４０エポックの学習ランは、３２００万ものボクセル化をもたらし得る。 As a result, distance calculations can be the most computationally intensive part of the voxelization process, taking valuable computational resources away from important runtime tasks such as model learning and model inference. For example, consider the case of model training using a training dataset of 7,000 proteins. Generating distance channels for multiple voxels across multiple amino acids, atoms, and attributes can involve over 100 voxelizations per protein, resulting in approximately 800,000 voxelizations in a single learning iteration (epoch) . A training run of 20-40 epochs with rotation of atomic coordinates in each epoch can result in as many as 32 million voxels.

高い計算コストに加えて、３２００万ボクセル化のためのデータのサイズは、メインメモリに適合するには大きすぎる例えば、１５×１５×１５ボクセルグリッドに対して２０ＴＢを超える）。パラメータ最適化及びアンサンブル学習のための反復学習の実行を考慮すると、ボクセル化プロセスのメモリフットプリントは、ディスク上に記憶するには大きすぎ、ボクセル化プロセスをモデル学習の一部にし、事前計算ステップにしない。 In addition to the high computational cost, the size of the data for 32 million voxelization is too large to fit into main memory (e.g., over 20 TB for a 15 x 15 x 15 voxel grid). Considering parameter optimization and performing iterative learning for ensemble learning, the memory footprint of the voxelization process is too large to be stored on disk, making the voxelization process part of model training and requiring a pre-computation step. I don't do it.

開示される技術は、Ｏ（原子数^＊ボクセル数）のランタイム複雑度に対して最大で約１００倍のスピードアップを達成する効率的なボクセル化プロセスを提供する。開示される効率的なボクセル化プロセスは、単一タンパク質ボクセル化のためのランタイム複雑度をＯ（原子数）に低減する。ボクセルごとに異なる特徴又はチャネルの場合、開示される効率的なボクセル化プロセスは、単一タンパク質ボクセル化のためのランタイム複雑度をＯ（原子数^＊属性数）に低減する。結果として、ボクセル化プロセスは、モデル学習と同程度に高速になり、計算ボトルネックを、ボクセル化から、ＧＰＵ、ＡＳＩＣ、ＴＰＵ、ＦＰＧＡ、ＣＧＲＡなどのプロセッサ上でニューラルネットワーク重みを計算することに戻す。 The disclosed technique provides an efficient voxelization process that achieves up to approximately 100x speedup for a runtime complexity of O (number of atoms ^* number of voxels). The disclosed efficient voxelization process reduces the runtime complexity to O (number of atoms) for single protein voxelization. For different features or channels for each voxel, the disclosed efficient voxelization process reduces the runtime complexity for single protein voxelization to O(number of atoms ^* number of attributes). As a result, the voxelization process becomes as fast as model training, shifting the computational bottleneck from voxelization back to computing neural network weights on processors such as GPUs, ASICs, TPUs, FPGAs, and CGRAs. .

大きなボクセルグリッドを伴う、開示される効率的なボクセル化プロセスのいくつかの実施態様では、単一タンパク質ボクセル化のためのランタイム複雑度は、ボクセルごとに異なる特徴又はチャネルの場合、Ｏ（原子数＋ボクセル）及びＯ（原子数^＊属性数＋ボクセル）である。「＋ボクセル」の複雑度は、原子の数がボクセルの数と比較して非常に小さい場合、例えば、１００×１００×１００ボクセルグリッド内に１つの原子が存在する場合（すなわち、原子あたり１００万ボクセル）に観察される。そのようなシナリオでは、ランタイムは、例えば、１００万個のボクセルにメモリを割り当てるため、１００万個のボクセルを０に初期化するためなど、膨大な数のボクセルのオーバーヘッドによって支配される。 In some embodiments of the disclosed efficient voxelization process involving large voxel grids, the runtime complexity for single protein voxelization is O (number of atoms) for different features or channels per voxel. + voxels) and O (number of atoms ^* number of attributes+voxels). "+Voxel" complexity is defined when the number of atoms is very small compared to the number of voxels, for example, when there is one atom in a 100x100x100 voxel grid (i.e. 1 million voxels per atom). voxels). In such a scenario, the runtime is dominated by a huge number of voxel overheads, e.g. to allocate memory for 1 million voxels, to initialize 1 million voxels to 0, etc.

ここで、開示された効率的なボクセル化プロセスの詳細について説明する。図３２Ａ、図３２Ｂ、図３３、図３４、及び図３５Ｂは、並行して説明される。 Details of the disclosed efficient voxelization process will now be described. 32A, 32B, 33, 34, and 35B will be described in parallel.

図３２Ａから開始して、ステップ３２０２において、各原子（例えば、８２８個のアルファ炭素原子の各々及び８２８個のベータ炭素原子の各々）は、原子を含有するボクセル（例えば、９個のボクセル５１４のうちの１つ）と関連付けられる。「含有する」という用語は、ボクセル内に位置する原子の３Ｄ原子座標を指す。原子を含有するボクセルは、本明細書では「原子含有ボクセル」とも呼ばれる。 Starting from FIG. 32A, in step 3202, each atom (e.g., each of the 828 alpha carbon atoms and each of the 828 beta carbon atoms) is divided into one of the voxels containing the atom (e.g., 9 voxels 514). one of them). The term "contains" refers to the 3D atomic coordinates of atoms located within the voxel. Voxels containing atoms are also referred to herein as "atom-containing voxels."

図３２Ｂ及び図３３は、特定の原子を含有するボクセルがどのように選択されるかを説明する。図３３は、３Ｄ原子座標を表すものとして２Ｄ原子座標を使用する。ボクセルグリッド５２２は、ボクセル５１４の各々が同じステップサイズ（例えば、１オングストローム（Å）又は２Å）で等間隔に配置されていることに留意されたい。 Figures 32B and 33 illustrate how voxels containing particular atoms are selected. Figure 33 uses 2D atomic coordinates to represent 3D atomic coordinates. Note that the voxel grid 522 is equally spaced with each of the voxels 514 having the same step size (eg, 1 angstrom (Å) or 2 Å).

また、図３３では、ボクセルグリッド５２２は、第１の次元（例えば、ｘ軸）に沿ってマゼンタのインデックス［０，１，２］を有し、第２の次元（例えば、ｙ軸）に沿ってシアンのインデックス［０，１，２］を有する。また、図３３では、ボクセルグリッド５２２内のそれぞれのボクセル５１４は、緑色ボクセルインデックス［ボクセル０，ボクセル１，・・・，ボクセル８］によって、及び黒色ボクセル中心インデックス［（１，１），（１，２），・・・，（３，３）］によって特定される。 Also in FIG. 33, voxel grid 522 has magenta indices [0,1,2] along a first dimension (e.g., the x-axis) and along a second dimension (e.g., the y-axis). and has cyan index [0, 1, 2]. Also, in FIG. 33, each voxel 514 in the voxel grid 522 is represented by a green voxel index [voxel 0, voxel 1, ..., voxel 8] and a black voxel center index [(1, 1), (1 , 2), ..., (3, 3)].

また、図３３では、第１の次元に沿ったボクセル中心の中心座標、すなわち、第１の次元ボクセル座標が橙色で特定される。また、図３３では、第２の次元に沿ったボクセル中心の中心座標、すなわち、第２の次元ボクセル座標が赤色で特定される。 Further, in FIG. 33, the center coordinates of the voxel center along the first dimension, that is, the first dimension voxel coordinates are specified in orange. Further, in FIG. 33, the center coordinates of the voxel center along the second dimension, that is, the second dimension voxel coordinates are specified in red.

まず、ステップ３２０２ａ（図３３のステップ１）において、特定の原子の３Ｄ原子座標（１．７４５６，２．１４３２３）が量子化されて、量子化３Ｄ原子座標（１．７，２．１）が生成される。量子化は、ビットの丸め又は切り捨てによって達成することができる。 First, in step 3202a (step 1 in FIG. 33), the 3D atomic coordinates (1.7456, 2.14323) of a specific atom are quantized, and the quantized 3D atomic coordinates (1.7, 2.1) are generated. Quantization can be achieved by rounding or truncation of bits.

次いで、ステップ３２０２ｂ（図３３のステップ２）において、ボクセル５１４のボクセル座標（又はボクセル中心若しくはボクセル中心座標）が、次元ベースで量子化された３Ｄ原子座標に割り当てられる。第１の次元について、量子化された原子座標１．７は、１から２までの範囲の第１の次元のボクセル座標をカバーし、第１の次元において１．５に中心付けられるので、ボクセル１に割り当てられる。ボクセル１は、第２の次元に沿ってインデックス０を有するのとは対照的に、第１の次元に沿ってインデックス１を有することに留意されたい。 Then, in step 3202b (step 2 of FIG. 33), the voxel coordinates (or voxel center or voxel center coordinates) of voxel 514 are assigned to dimensionally quantized 3D atomic coordinates. For the first dimension, the quantized atomic coordinate 1.7 covers the first dimension voxel coordinates ranging from 1 to 2 and is centered at 1.5 in the first dimension, so the voxel Assigned to 1. Note that voxel 1 has index 1 along the first dimension, as opposed to index 0 along the second dimension.

第２の次元については、ボクセル１から開始して、ボクセルグリッド５２２が第２の次元に沿ってトラバースされる。これにより、量子化された原子座標２．５がボクセル７に割り当てられるが、これは、ボクセル７が２～３の範囲の第２次元ボクセル座標をカバーし、第２次元において２．５に中心付けられるためである。ボクセル７は、第１の次元に沿ってインデックス１を有するのとは対照的に、第２の次元に沿ってインデックス２を有することに留意されたい。 For the second dimension, starting with voxel 1, voxel grid 522 is traversed along the second dimension. This assigns a quantized atomic coordinate of 2.5 to voxel 7 because voxel 7 covers the range of second-dimensional voxel coordinates from 2 to 3 and is centered at 2.5 in the second dimension. Note that voxel 7 has index 2 along the second dimension as opposed to having index 1 along the first dimension.

次に、ステップ３２０２ｃ（図３３のステップ３）において、割り当てられたボクセル座標に対応する次元インデックスが選択される。すなわち、ボクセル１については、第１の次元に沿ってインデックス１が選択され、ボクセル７については、第２の次元に沿ってインデックス２が選択される。上記のステップは、第３の次元に沿って次元インデックスを選択するために第３の次元に対して同様に実行することができることが当業者には理解されよう。 Next, in step 3202c (step 3 of FIG. 33), the dimension index corresponding to the assigned voxel coordinates is selected. That is, for voxel 1, index 1 is selected along the first dimension, and for voxel 7, index 2 is selected along the second dimension. Those skilled in the art will appreciate that the above steps can be performed similarly for the third dimension to select a dimension index along the third dimension.

次に、ステップ３２０２ｄ（図３３のステップ４）において、選択された次元インデックスを基数の累乗で位置ごとに重み付けすることに基づいて、累積和が生成される。位置番号付けシステムの背後にある一般的な概念は、数値が基数（又は基数）の累乗を増加させることによって表されることであり、例えば、２進数は基数２であり、３進数は基数３であり、８進数は基数８であり、１６進数は基数１６である。これは、各位置が基数の累乗によって重み付けされるので、重み付け番号付けシステムと呼ばれることが多い。位置番号付けシステムに対する有効な数字のセットは、そのシステムの基数にサイズが等しい。例えば、１０進法では０から９の１０桁があり、３進法では０、１、２の３桁がある。基数システムにおける最大有効数は、基数よりも１小さい（したがって、８は、９よりも小さい基数系における有効数ではない）。任意の１０進整数は、任意の他の整数基数システムにおいて正確に表現することができ、逆もまた同様である。 Next, in step 3202d (step 4 of FIG. 33), a cumulative sum is generated based on position-wise weighting of the selected dimension indices by a power of the base number. The general concept behind positional numbering systems is that numbers are represented by increasing powers of a base (or base), e.g., a binary number is base 2, a ternary number is base 3 , octal numbers are base 8, and hexadecimal numbers are base 16. This is often referred to as a weighted numbering system because each position is weighted by a power of the base number. The set of valid numbers for a position numbering system is equal in size to the radix of that system. For example, the decimal system has 10 digits from 0 to 9, and the ternary system has 3 digits, 0, 1, and 2. The maximum significant number in a radix system is one less than the radix (thus, 8 is not a significant number in a radix less than 9). Any decimal integer can be represented exactly in any other integer base system, and vice versa.

図３３の例に戻ると、選択された次元インデックス１及び２は、それらを基数３のそれぞれの累乗で位置ごとに乗算し、位置ごとの乗算の結果を合計することによって、単一の整数に変換される。３Ｄ原子座標が３次元を有するので、ここでは基数３が選択される（ただし、図３３は、簡略化のために２次元に沿った２Ｄ原子座標のみを示す）。 Returning to the example of Figure 33, the selected dimension indices 1 and 2 are reduced to a single integer by multiplying them position by position by their respective powers of base 3 and summing the results of the position by position multiplications. converted. Radix 3 is chosen here because 3D atomic coordinates have three dimensions (although FIG. 33 only shows 2D atomic coordinates along two dimensions for simplicity).

インデックス２は最も右のビットすなわち、最下位ビットに位置しているので、３を０乗して２を得る。インデックス１は右端から２番目のビット（すなわち、最下位から２番目のビット）に位置しているので、３を１乗して３を得る。この結果、累積和は５になる。 Since index 2 is located at the rightmost bit, i.e., the least significant bit, we raise 3 to the power of 0 to get 2. Since index 1 is located at the second bit from the right (i.e., the second least significant bit), we raise 3 to the power of 1 to get 3. This results in a running sum of 5.

次に、ステップ３２０２ｅ（図３３のステップ５）において、累積和に基づいて、特定の原子を含有するボクセルのボクセルインデックスが選択される。すなわち、累積和は、特定の原子を含有するボクセルのボクセルインデックスとして解釈される。 Next, in step 3202e (step 5 of FIG. 33), the voxel index of the voxel containing the particular atom is selected based on the cumulative sum. That is, the cumulative sum is interpreted as the voxel index of the voxel containing the particular atom.

ステップ３２１２では、各原子が原子含有ボクセルに関連付けられた後、各原子は、本明細書では「近傍ボクセル」とも呼ばれる、原子含有ボクセルの近傍にある１つ以上のボクセルに更に関連付けられる。近傍ボクセルは、原子含有ボクセルの所定の半径（例えば、５オングストローム（Å））内にあることに基づいて選択することができる。他の実施態様では、近傍ボクセルは、原子含有ボクセルに連続的に隣接していること（例えば、上、下、右、左隣接ボクセル）に基づいて選択することができる。各原子を原子含有ボクセル及び近傍ボクセルと関連付ける、結果として生じる関連付けは、本明細書ではエレメントからセルへのマッピングとも称される、原子からボクセルへのマッピング３４０２において符号化される。一例では、第１のアルファ炭素原子は、原子含有ボクセル及び第１のアルファ炭素原子の近傍ボクセルを含むボクセル３４０４の第１のサブセットに関連付けられる。別の例では、第２のアルファ炭素原子は、原子含有ボクセル及び第２のアルファ炭素原子の近傍ボクセルを含有するボクセル３４０６の第２のサブセットに関連付けられる。 In step 3212, after each atom is associated with the atom-containing voxel, each atom is further associated with one or more voxels in the vicinity of the atom-containing voxel, also referred to herein as "neighborhood voxels." Neighboring voxels can be selected based on being within a predetermined radius (eg, 5 angstroms (Å)) of atom-containing voxels. In other implementations, neighboring voxels can be selected based on their sequential adjacency to atom-containing voxels (eg, top, bottom, right, left neighboring voxels). The resulting associations that associate each atom with the atom-containing voxel and neighboring voxels are encoded in an atom-to-voxel mapping 3402, also referred to herein as an element-to-cell mapping. In one example, the first alpha carbon atom is associated with a first subset of voxels 3404 that includes atom-containing voxels and neighboring voxels of the first alpha carbon atom. In another example, the second alpha carbon atom is associated with a second subset of voxels 3406 that includes atom-containing voxels and neighboring voxels of the second alpha carbon atom.

原子含有ボクセル及び近傍ボクセルを決定するために距離計算は行われないことに留意されたい。原子含有ボクセルは、量子化された３Ｄ原子座標を（距離計算を使用せずに）ボクセルグリッド内の対応する規則的に離間されたボクセル中心に割り当てることを可能にするボクセルの空間配置によって選択される。また、近傍ボクセルは、（この場合も距離計算を使用せずに）ボクセルグリッド内の原子含有ボクセルに空間的に隣接することによって選択される。 Note that no distance calculations are performed to determine atom-containing voxels and neighboring voxels. Atom-containing voxels are selected by a spatial arrangement of voxels that allows assigning quantized 3D atomic coordinates (without using distance calculations) to corresponding regularly spaced voxel centers in the voxel grid. Ru. Neighboring voxels are also selected by being spatially adjacent to atom-containing voxels in the voxel grid (again, without using distance calculations).

ステップ３２２２において、各ボクセルは、ステップ３２０２及び３２１２で関連付けられた原子にマッピングされる。一実施態様では、このマッピングは、ボクセルから原子へのマッピング３４１２で符号化され、これは、（例えば、ボクセルベースのソートキーを原子からボクセルへのマッピング３４０２に適用することによって）原子からボクセルへのマッピング３４０２に基づいて生成される。ボクセルから原子へのマッピング３４１２は、本明細書では「セルからエレメントへのマッピング」とも呼ばれる。一例では、ステップ３２０２及び３２１２において、第１のボクセルは、第１のボクセルに関連付けられたアルファ炭素原子を含むアルファ炭素原子の第１のサブセット３４１４にマッピングされる。別の例では、ステップ３２０２及び３２１２において、第２のボクセルは、第２のボクセルに関連付けられたアルファ炭素原子を含むアルファ炭素原子の第２のサブセット３４１６にマッピングされる。 In step 3222, each voxel is mapped to the atoms associated in steps 3202 and 3212. In one implementation, this mapping is encoded in a voxel-to-atom mapping 3412, which is an atom-to-voxel mapping (e.g., by applying a voxel-based sort key to the atom-to-voxel mapping 3402). Generated based on mapping 3402. Voxel-to-atom mapping 3412 is also referred to herein as "cell-to-element mapping." In one example, in steps 3202 and 3212, the first voxel is mapped to a first subset of alpha carbon atoms 3414 that includes alpha carbon atoms associated with the first voxel. In another example, in steps 3202 and 3212, the second voxel is mapped to a second subset of alpha carbon atoms 3416 that includes alpha carbon atoms associated with the second voxel.

ステップ３２３２において、各ボクセルについて、ボクセルとステップ３２２２においてボクセルにマッピングされた原子との間の距離が計算される。ステップ３２３２は、特定の原子までの距離が、ボクセルから原子へのマッピング３４１２において特定の原子が一意にマッピングされるそれぞれのボクセルから１回だけ測定されるので、Ｏ（原子数）のランタイム複雑度を有する。これは、隣接ボクセルが考慮されない場合に当てはまる。近傍がなければ、ｂｉｇ－Ｏ記法で暗示される定数係数は１である。近傍があれば、近傍の数が各ボクセルに対して一定であるので、ｂｉｇ－Ｏ記法は近傍の数＋１に等しく、したがって、ランタイムの複雑度はＯ（原子数）のままである。対照的に、図３５Ａでは、特定の原子までの距離は、ボクセルの数だけ重複して測定される（例えば、２７個のボクセルに起因して、特定の原子について２７個の距離）。 In step 3232, for each voxel, the distance between the voxel and the atom mapped to the voxel in step 3222 is calculated. Step 3232 has a runtime complexity of O (number of atoms) because the distance to a particular atom is measured only once from each voxel to which a particular atom is uniquely mapped in voxel-to-atom mapping 3412. has. This is the case if neighboring voxels are not considered. If there are no neighbors, the constant coefficient implied by the big-O notation is 1. With neighborhoods, the big-O notation is equal to the number of neighborhoods + 1 since the number of neighborhoods is constant for each voxel, so the runtime complexity remains O (number of atoms). In contrast, in FIG. 35A, the distance to a particular atom is measured in duplicate by the number of voxels (eg, 27 distances for a particular atom due to 27 voxels).

図３５Ｂでは、ボクセルから原子へのマッピング３４１２に基づいて、各ボクセルは、それぞれのボクセルに対するそれぞれの楕円によって示されるように、８２８個の原子のそれぞれのサブセット（近傍ボクセルまでの距離計算を含まない）にマッピングされる。それぞれのサブセットは、いくつかの例外を除いて、大部分は重複していない。図３５Ｂにおいてプライム記号「’」及び楕円間の黄色の重複によって示されるように、複数の原子が同じボクセルにマッピングされるいくつかの事例に起因して、わずかな重複が存在する。この最小の重複は、Ｏ（原子数）のランタイム複雑度に対して加算的効果を有し、乗算的効果を有さない。この重複は、原子を含有するボクセルを決定した後に、近傍ボクセルを考慮した結果である。近傍ボクセルがなければ、原子は１つのボクセルにのみ関連付けられるので、重複は存在し得ない。しかし、近傍を考慮すると、各近傍は、同じ原子と潜在的に関連し得る（より近い同じアミノ酸の他の原子が存在しない限り）。 In FIG. 35B, based on the voxel-to-atom mapping 3412, each voxel is mapped to a respective subset of 828 atoms (not including distance calculations to neighboring voxels), as indicated by the respective ellipses for each voxel. ). Each subset is largely non-overlapping, with some exceptions. There is slight overlap due to some instances where multiple atoms are mapped to the same voxel, as shown by the prime "'" and the yellow overlap between ellipses in FIG. 35B. This minimal overlap has an additive effect on the runtime complexity of O (number of atoms) and no multiplicative effect. This overlap is the result of considering neighboring voxels after determining the voxel containing the atom. Without neighboring voxels, an atom is associated with only one voxel, so there can be no overlap. However, considering neighborhoods, each neighborhood can potentially be associated with the same atom (unless there are other atoms of the same amino acid that are closer).

ステップ３２４２において、各ボクセルについて、ステップ３２３２で計算された距離に基づいて、ボクセルへの最も近い原子が特定される。一実施態様では、この特定は、本明細書では「セルから最も近いエレメントへのマッピング」とも呼ばれる、ボクセルから最も近い原子へのマッピング３４２２において符号化される。一例では、第１のボクセルは、その最も近いアルファ炭素原子３４２４として第２のアルファ炭素原子にマッピングされる。別の例では、第２のボクセルは、その最も近いアルファ炭素原子３４２６として第３１のアルファ炭素原子にマッピングされる。 At step 3242, for each voxel, the closest atom to the voxel is identified based on the distance calculated at step 3232. In one implementation, this identification is encoded in a voxel-to-nearest atom mapping 3422, also referred to herein as a "cell-to-nearest element mapping." In one example, the first voxel is mapped to the second alpha carbon atom as its nearest alpha carbon atom 3424. In another example, the second voxel is mapped to the 31st alpha carbon atom as its nearest alpha carbon atom 3426.

更に、ボクセルごとの距離は、上述の技術を用いて計算されるので、原子の原子タイプ及びアミノ酸タイプの分類並びに対応する距離値は、分類された距離チャネルを生成するために記憶される。 Furthermore, since the voxel-wise distances are calculated using the techniques described above, the atom type and amino acid type classification of the atoms and the corresponding distance values are stored to generate the classified distance channel.

最も近い原子までの距離が、上述の技術を使用して特定されると、これらの距離は、病原性分類器２１０８によるボクセル化及び後続の処理のために、距離チャネルにおいて符号化されることができる。 Once the distances to the nearest atoms are determined using the techniques described above, these distances can be encoded in a distance channel for voxelization and subsequent processing by the pathogenicity classifier 2108. can.

コンピュータシステム
図３６は、開示された技術を実施するために使用することができる例示的コンピュータシステム３６００を示す。コンピュータシステム３６００は、バスサブシステム３６５５を介して多数の周辺デバイスと通信する少なくとも１つの中央処理ユニット（ＣＰＵ）３６７２を含む。これらの周辺デバイスは、例えば、メモリデバイス及びファイル記憶サブシステム３６３６を含む記憶サブシステム３６１０、ユーザインターフェース入力デバイス３６３８、ユーザインターフェース出力デバイス３６７６、並びにネットワークインターフェースサブシステム３６７４を含むことができる。入力デバイス及び出力デバイスは、コンピュータシステム３６００とのユーザ対話を可能にする。ネットワークインターフェースサブシステム３６７４は、他のコンピュータシステム内の対応するインターフェースデバイスへのインターフェースを含む外部ネットワークへのインターフェースを提供する。 Computer System Figure 36 illustrates an exemplary computer system 3600 that can be used to implement the disclosed techniques. Computer system 3600 includes at least one central processing unit (CPU) 3672 that communicates with a number of peripheral devices via a bus subsystem 3655. These peripheral devices can include, for example, a storage subsystem 3610 including memory devices and a file storage subsystem 3636, user interface input devices 3638, user interface output devices 3676, and a network interface subsystem 3674. The input and output devices enable user interaction with computer system 3600. Network interface subsystem 3674 provides an interface to external networks, including interfaces to corresponding interface devices in other computer systems.

一実施態様では、病原性分類器２１０８は、記憶サブシステム３６１０及びユーザインターフェース入力デバイス３６３８に通信可能にリンクされている。 In one implementation, pathogenicity classifier 2108 is communicatively linked to storage subsystem 3610 and user interface input device 3638.

ユーザインターフェース入力デバイス３６３８は、キーボード、マウス、トラックボール、タッチパッド、又はグラフィックスタブレットなどのポインティングデバイス、スキャナ、ディスプレイに組み込まれたタッチスクリーン、音声認識システム及びマイクロフォンなどのオーディオ入力デバイス、並びに他のタイプの入力デバイスを含むことができる。一般に、用語「入力デバイス」の使用は、コンピュータシステム３６００に情報を入力するための全ての可能なタイプのデバイス及び方式を含むことを意図している。 User interface input devices 3638 include pointing devices such as keyboards, mice, trackballs, touch pads, or graphics tablets, scanners, touch screens integrated into displays, audio input devices such as voice recognition systems and microphones, and other type of input device. In general, use of the term "input device" is intended to include all possible types of devices and methods for inputting information into computer system 3600.

ユーザインターフェース出力デバイス３６７６は、ディスプレイサブシステム、プリンタ、ファックス装置、又はオーディオ出力デバイスなどの非視覚ディスプレイを含むことができる。ディスプレイサブシステムは、ＬＥＤディスプレイ、陰極線管（Cathode Ray Tube、ＣＲＴ）、液晶ディスプレイ（Liquid Crystal Display、ＬＣＤ）などのフラットパネルデバイス、投影デバイス、又は可視画像を作成するための何らかの他の機構を含むことができる。ディスプレイサブシステムはまた、オーディオ出力デバイスなどの非視覚ディスプレイを提供することができる。一般に、用語「出力デバイス」の使用は、コンピュータシステム３６００からユーザ又は別のマシン若しくはコンピュータシステムに情報を出力するための、全ての可能なタイプのデバイス及び方式を含むことを意図している。 User interface output device 3676 may include a display subsystem, a printer, a fax machine, or a non-visual display such as an audio output device. The display subsystem includes a flat panel device such as an LED display, a cathode ray tube (CRT), a liquid crystal display (LCD), a projection device, or some other mechanism for creating a visible image. be able to. The display subsystem may also provide non-visual displays such as audio output devices. In general, use of the term "output device" is intended to include all possible types of devices and methods for outputting information from computer system 3600 to a user or another machine or computer system.

記憶サブシステム３６１０は、本明細書に記載されるモジュール及び方法のうちのいくつか又は全ての機能を提供するプログラミング及びデータ構築物を記憶する。これらのソフトウェアモジュールは、一般にプロセッサ３６７８によって実行される。 Storage subsystem 3610 stores programming and data constructs that provide functionality for some or all of the modules and methods described herein. These software modules are typically executed by processor 3678.

プロセッサ３６７８は、画像処理装置（graphics processing unit、ＧＰＵ）、フィールドプログラマブルゲートアレイ（field-programmable gate array、ＦＰＧＡ）、特定用途向け集積回路（application-specific integrated circuit、ＡＳＩＣ）、及び／又は粗粒化再構成可能構造（coarse-grained reconfigurable architecture、ＣＧＲＡ）であることができる。プロセッサ３６７８は、ＧｏｏｇｌｅＣｌｏｕｄＰｌａｔｆｏｒｍ（商標）、Ｘｉｌｉｎｘ（商標）及びＣｉｒｒａｓｃａｌｅ（商標）などの深層学習クラウドプラットフォームによってホスティングすることができる。プロセッサ３６７８の例は、ＧｏｏｇｌｅのＴｅｎｓｏｒＰｒｏｃｅｓｓｉｎｇＵｎｉｔ（ＴＰＵ）（商標）、ＧＸ４ＲａｃｋｍｏｕｎｔＳｅｒｉｅｓ（商標）、ＧＸ３６ＲａｃｋｍｏｕｎｔＳｅｒｉｅｓ（商標）のようなラックマウントソリューション、ＮＶＩＤＩＡＤＧＸ－１（商標）、Ｍｉｃｒｏｓｏｆｔ’ ＳｔｒａｔｉｘＶＦＰＧＡ（商標）、ＧｒａｐｈｃｏｒｅのＩｎｔｅｌｌｉｇｅｎｔＰｒｏｃｅｓｓｏｒＵｎｉｔ（ＩＰＵ）（商標）、Ｓｎａｐｄｒａｇｏｎｐｒｏｃｅｓｓｏｒｓ（商標）を有するＱｕａｌｃｏｍｍのＺｅｒｏｔｈＰｌａｔｆｏｒｍ（商標）、ＮＶＩＤＩＡのＶｏｌｔａ（商標）、ＮＶＩＤＩＡのＤＲＩＶＥＰＸ（商標）、ＮＶＩＤＩＡのＪＥＴＳＯＮＴＸ１／ＴＸ２ＭＯＤＵＬＥ（商標）、ＩｎｔｅｌのＮｉｒｖａｎａ（商標）、ＭｏｖｉｄｉｕｓＶＰＵ（商標）、ＦｕｊｉｔｓｕＤＰＩ（商標）、ＡＲＭのＤｙｎａｍｉｃＩＱ（商標）、ＩＢＭＴｒｕｅＮｏｒｔｈ（商標）、ＴｅｓｔａＶＩ００ｓ（商標）を有するＬａｍｂｄａＧＰＵＳｅｒｖｅｒ、及び他のものを含む。 Processor 3678 may include a graphics processing unit (GPU), field-programmable gate array (FPGA), application-specific integrated circuit (ASIC), and/or coarse-grained It can be a coarse-grained reconfigurable architecture (CGRA). Processor 3678 may be hosted by a deep learning cloud platform such as Google Cloud Platform(TM), Xilinx(TM), and Cirrascale(TM). Examples of processors 3678 include Google's Tensor Processing Unit (TPU)(TM), rackmount solutions such as the GX4 Rackmount Series(TM), GX36 Rackmount Series(TM), NVIDIA DGX-1(TM), Micro soft' Stratix V FPGA(TM), Graphcore's Intelligent Processor Unit (IPU)(TM), Qualcomm's Zeroth Platform(TM) with Snapdragon processors(TM), NVIDIA's Volta(TM) ), NVIDIA's DRIVE PX (trademark), NVIDIA's JETSON Lamb with TX1/TX2 MODULE(TM), Intel's Nirvana(TM), Movidius VPU(TM), Fujitsu DPI(TM), ARM's DynamicIQ(TM), IBM TrueNorth(TM), Testa VI 00s(TM) da GPU Server, and others.

記憶サブシステム３６１０で使用されるメモリサブシステム３６２２は、プログラム実行中に命令及びデータを記憶するためのメインランダムアクセスメモリ（random access memory、ＲＡＭ）３６３２と、固定命令が記憶された読み取り専用メモリ（read only memory、ＲＯＭ）３６３４とを含む多数のメモリを含むことができる。ファイル記憶サブシステム３６３６は、プログラム及びデータファイルのための永続的な記憶装置を提供することができ、ハードディスクドライブ、関連する取り外し可能な媒体、ＣＤ－ＲＯＭドライブ、光学ドライブ、又は取り外し可能な媒体カートリッジを含むことができる。特定の実施態様の機能を実施するモジュールは、記憶サブシステム３６１０内のファイル記憶サブシステム３６３６によって、又はプロセッサによってアクセス可能な他のマシン内に記憶することができる。 The memory subsystem 3622 used in the storage subsystem 3610 includes a main random access memory (RAM) 3632 for storing instructions and data during program execution, and a read-only memory (RAM) 3632 for storing fixed instructions. read only memory (ROM) 3634. File storage subsystem 3636 may provide persistent storage for program and data files and may include a hard disk drive, associated removable media, CD-ROM drive, optical drive, or removable media cartridge. can include. Modules that implement the functionality of particular embodiments may be stored by file storage subsystem 3636 within storage subsystem 3610 or in other machines accessible by a processor.

バスサブシステム３６５５は、コンピュータシステム３６００の様々な構成要素及びサブシステムを、意図されるように互いに通信させるための機構を提供する。バスサブシステム３６５５は、単一のバスとして概略的に示されているが、バスサブシステムの代替の実施態様は、複数のバスを使用することができる。 Bus subsystem 3655 provides a mechanism for allowing the various components and subsystems of computer system 3600 to communicate with each other as intended. Although bus subsystem 3655 is shown schematically as a single bus, alternative implementations of the bus subsystem may use multiple buses.

コンピュータシステム３６００自体は、パーソナルコンピュータ、ポータブルコンピュータ、ワークステーション、コンピュータ端末、ネットワークコンピュータ、テレビ、メインフレーム、サーバファーム、緩く分散した一組の緩くネットワーク化されたコンピュータ、又は任意の他のデータ処理システム若しくはユーザデバイスを含む様々なタイプのものであることができる。コンピュータ及びネットワークは絶え間なく変化する性質のものであるため、図３６に示されるコンピュータシステム３６００の説明は、本発明の好ましい実施態様を例示する目的のための特定の実施例としてのみ意図される。コンピュータシステム３６００の多くの他の構成は、図３６に示されるコンピュータシステムよりも多くの又は少ない構成要素を有することができる。 Computer system 3600 itself may include a personal computer, a portable computer, a workstation, a computer terminal, a network computer, a television, a mainframe, a server farm, a loosely distributed set of loosely networked computers, or any other data processing system. or may be of various types including user devices. Due to the ever-changing nature of computers and networks, the description of computer system 3600 shown in FIG. 36 is intended only as a specific example for purposes of illustrating a preferred embodiment of the present invention. Many other configurations of computer system 3600 may have more or fewer components than the computer system shown in FIG. 36.

特定の実施態様１
以下の実施態様は、システム、方法、又は製品として実施することができる。実施態様の１つ以上の特徴を、塩基実施態様と組み合わせることができる。相互に排他的でない実施態様は、組み合わせ可能であると教示されている。実施態様の１つ以上の特徴を他の実施態様と組み合わせることができる。本開示は、これらのオプションのユーザを定期的に通知する。これらの選択肢を繰り返す列挙のいくつかの実施態様からの省略は、前述のセクションで教示されている組み合わせを制限するものとして解釈されるべきではない。これらの記載は、以下の実施のそれぞれに参照することにより本明細書に組み込まれる。 Specific embodiment 1
The following embodiments can be implemented as a system, method, or article of manufacture. One or more features of the embodiments can be combined with the base embodiments. It is taught that embodiments that are not mutually exclusive are combinable. One or more features of an embodiment may be combined with other embodiments. This disclosure regularly informs users of these options. Omission from some embodiments of enumerations that repeat these options should not be construed as limiting the combinations taught in the preceding sections. These descriptions are incorporated herein by reference to each of the following implementations.

開示される技術は、入力として３Ｄデータを使用するが、他の実施態様では、１Ｄデータ、２Ｄデータ（例えば、ピクセル及び２Ｄ原子座標）、４Ｄデータ、５Ｄデータなどを同様に使用することができる。 Although the disclosed techniques use 3D data as input, other implementations may similarly use 1D data, 2D data (e.g., pixel and 2D atomic coordinates), 4D data, 5D data, etc. .

いくつかの実施態様では、システムは、タンパク質中の複数のアミノ酸についてのアミノ酸ごとの距離チャネルを記憶するメモリを含む。アミノ酸ごとの距離チャネルの各々は、複数のボクセル内のボクセルについてのボクセルごとの距離値を有する。ボクセルごとの距離値は、複数のボクセル内の対応するボクセルから複数のアミノ酸内の対応するアミノ酸の原子までの距離を指定する。システムは、アミノ酸ごとの距離チャネルと、変異体によって発現されるタンパク質の代替対立遺伝子とを含むテンソルを処理するように構成された病原性決定エンジンを更に含む。病原性決定エンジンはまた、テンソルに少なくとも部分的に基づいて変異体の病原性を決定するように構成することができる。 In some embodiments, the system includes a memory that stores per amino acid distance channels for a plurality of amino acids in the protein. Each of the per amino acid distance channels has a per voxel distance value for a voxel in the plurality of voxels. The per voxel distance value specifies a distance from a corresponding voxel in the plurality of voxels to an atom of a corresponding amino acid in the plurality of amino acids. The system further includes a pathogenicity determination engine configured to process a tensor that includes the per amino acid distance channels and alternative alleles of the protein expressed by the variant. The pathogenicity determination engine can also be configured to determine a pathogenicity of the variant based at least in part on the tensor.

いくつかの実施態様では、システムは、ボクセルのボクセルグリッドをアミノ酸のそれぞれの残基のアルファ炭素原子上に中心付ける距離チャネル生成器を更に含む。距離チャネル生成器は、タンパク質中の変異体アミノ酸に位置する特定のアミノ酸の残基のアルファ炭素原子上にボクセルグリッドを中心付けることができる。 In some embodiments, the system further includes a distance channel generator that centers the voxel grid of voxels on the alpha carbon atom of each residue of the amino acid. A distance channel generator can center a voxel grid on the alpha carbon atom of a particular amino acid residue that is located in a variant amino acid in a protein.

システムは、特定のアミノ酸に先行するアミノ酸についてのボクセルごとの距離値に方向性パラメータを乗算することによって、テンソルにおいて、アミノ酸の方向性及び特定のアミノ酸の位置を符号化するように構成することができる。距離は、ボクセルグリッド内の対応するボクセル中心から対応するアミノ酸の最も近い原子までの最も近い原子の距離であることができる。いくつかの実施態様では、最も近い原子の距離はユークリッド距離であることができる。最も近い原子の距離は、ユークリッド距離を最大の最も近い原子の距離で除算することによって正規化することができる。アミノ酸は、アルファ炭素原子を有することができ、いくつかの実施態様では、距離は、対応するボクセル中心から対応するアミノ酸の最も近いアルファ炭素原子までの最も近いアルファ炭素原子の距離であることができる。アミノ酸は、ベータ炭素原子を有することができ、いくつかの実施態様では、距離は、対応するボクセル中心から対応するアミノ酸の最も近いベータ炭素原子までの最も近いベータ炭素原子の距離であることができる。アミノ酸は骨格原子を有することができ、いくつかの実施態様では、距離は、対応するボクセル中心から対応するアミノ酸の最も近い骨格原子までの最も近い骨格原子の距離であることができる。アミノ酸は側鎖原子を有し、いくつかの実施態様では、距離は、対応するボクセル中心から対応するアミノ酸の最も近い側鎖原子までの最も近い側鎖原子の距離であることができる。 The system may be configured to encode the orientation of the amino acid and the position of the particular amino acid in the tensor by multiplying the per-voxel distance value for the amino acid preceding the particular amino acid by a directionality parameter. The distance may be the nearest atom distance from the corresponding voxel center in the voxel grid to the nearest atom of the corresponding amino acid. In some implementations, the nearest atom distance may be a Euclidean distance. The nearest atom distance may be normalized by dividing the Euclidean distance by the maximum nearest atom distance. The amino acid may have an alpha carbon atom, and in some implementations, the distance may be the nearest alpha carbon atom distance from the corresponding voxel center to the nearest alpha carbon atom of the corresponding amino acid. The amino acid may have a beta carbon atom, and in some implementations, the distance may be the nearest beta carbon atom distance from the corresponding voxel center to the nearest beta carbon atom of the corresponding amino acid. The amino acid may have a backbone atom, and in some implementations, the distance may be the nearest backbone atom distance from the corresponding voxel center to the nearest backbone atom of the corresponding amino acid. The amino acids have side chain atoms, and in some embodiments, the distance can be the distance of the nearest side chain atom from the corresponding voxel center to the nearest side chain atom of the corresponding amino acid.

システムは、テンソルにおいて、各ボクセルから最も近い原子までの距離を指定する最も近い原子のチャネルを符号化するように更に構成することができる。最も近い原子は、アミノ酸及びアミノ酸の原子エレメントに関係なく選択することができる。いくつかの実装形態では、距離はユークリッド距離である。距離は、ユークリッド距離を最大距離で割ることによって正規化することができる。アミノ酸は、非標準アミノ酸を含むことができる。テンソルは、ボクセル中心の所定の半径内に見出されない原子を指定する不在原子チャネルを含むことができ、不在原子チャネルはワンホット符号化することができる。いくつかの実施態様では、テンソルは、アミノ酸ごとの距離チャネルの各々にボクセルごとに符号化される代替対立遺伝子のワンホット符号化を更に含むことができる。テンソルは、タンパク質の参照対立遺伝子を更に含むことができる。いくつかの実施態様では、テンソルは、アミノ酸ごとの距離チャネルの各々にボクセルごとに符号化される参照対立遺伝子のワンホット符号化を更に含むことができる。テンソルは、複数の種にわたるアミノ酸の保存レベルを指定する進化的プロファイルを更に含むことができる。 The system can be further configured to encode, in the tensor, a nearest atom channel that specifies the distance from each voxel to the nearest atom. The closest atom can be selected without regard to the amino acid and the atomic element of the amino acid. In some implementations, the distance is a Euclidean distance. The distance can be normalized by dividing the Euclidean distance by the maximum distance. Amino acids can include non-standard amino acids. The tensor can include an absent atom channel that specifies atoms not found within a predetermined radius of the voxel center, and the absent atom channel can be one-hot encoded. In some implementations, the tensor can further include one-hot encoding of alternative alleles encoded voxel-wise in each of the per-amino acid distance channels. The tensor can further include reference alleles of the protein. In some implementations, the tensor can further include one-hot encoding of the reference allele encoded voxel-wise in each of the per-amino acid distance channels. The tensor can further include an evolutionary profile that specifies the level of conservation of amino acids across multiple species.

このシステムは更に、ボクセルの各々について、アミノ酸及び原子カテゴリにわたって最も近い原子を選択し、最も近い原子を含むアミノ酸残基について汎アミノ酸保存頻度配列を選択し、汎アミノ酸保存頻度配列を進化的プロファイルの１つとして利用可能にする進化的プロファイル生成器を含むことができる。汎アミノ酸保存頻度配列は、複数の種において観察されるような残基の特定の位置について構成することができる。汎アミノ酸保存頻度配列は、特定のアミノ酸について欠失している保存頻度があるかどうかを指定することができる。いくつかの実施態様では、進化的プロファイル生成器は、ボクセルの各々について、アミノ酸のうちのそれぞれのものにおけるそれぞれの最も近い原子を選択することができ、最も近い原子を含むアミノ酸のそれぞれの残基についてそれぞれのアミノ酸ごとの保存頻度を選択することができ、アミノ酸ごとの保存頻度を進化的プロファイルの１つとして利用可能にすることができる。アミノ酸ごとの保存頻度は、複数の種において観察されるような残基の特定の位置について構成することができる。アミノ酸ごとの保存頻度は、特定のアミノ酸について欠失している保存頻度があるかどうかを指定することができる。 The system further selects, for each voxel, the closest atom across amino acids and atom categories, selects a pan-amino acid conserved frequency sequence for the amino acid residue containing the closest atom, and converts the pan-amino acid conserved frequency sequence into an evolutionary profile. One can include an evolutionary profile generator that is made available. A pan-amino acid conserved frequency sequence can be constructed for specific positions of residues as observed in multiple species. A pan-amino acid conservation frequency array can specify whether there is a conserved frequency that is missing for a particular amino acid. In some embodiments, the evolutionary profile generator may select, for each of the voxels, the respective closest atom in each of the amino acids, and each residue of the amino acid that contains the closest atom. The conservation frequency for each amino acid can be selected, and the conservation frequency for each amino acid can be made available as one of the evolutionary profiles. Conserved frequencies for each amino acid can be constructed for particular positions of residues as observed in multiple species. The conservation frequency for each amino acid can specify whether there is a deletion frequency for a particular amino acid.

システムのいくつかの実施態様では、テンソルは、アミノ酸のアノテーションチャネルを更に含むことができる。アノテーションチャネルは、テンソルにおいてワンホット符号化することができる。アノテーションチャネルは、イニシエーターメチオニン、シグナル、輸送ペプチド、プロペプチド、鎖、及びペプチドを含む分子処理アノテーションであることができる。アノテーションチャネルは、トポロジカルドメイン、膜貫通、膜内、ドメイン、リピート、カルシウム結合、ジンクフィンガー、デオキシリボ核酸（ＤＮＡ）結合、ヌクレオチド結合、領域、コイルドコイル、モチーフ、及び組成バイアスを含む領域アノテーションであることができる。アノテーションチャネルは、活性部位、金属結合、結合部位、及び部位を含む部位アノテーションであることができる。アノテーションチャネルは、非標準残基、修飾残基、脂質化、グリコシル化、ジスルフィド結合、及び架橋を含むアミノ酸修飾アノテーションであることができる。アノテーションチャネルは、ヘリックス、ターン、及びベータ鎖を含む二次構造アノテーションであることができる。アノテーションチャネルは、突然変異誘発、配列不確実性、配列競合、非隣接残基、及び非末端残基を含む実験情報アノテーションであることができる。 In some embodiments of the system, the tensor can further include an annotation channel for amino acids. Annotation channels can be one-hot encoded in tensors. Annotation channels can be molecular processing annotations including initiator methionine, signal, transit peptide, propeptide, chain, and peptide. Annotation channels can be topological domains, transmembrane, intramembrane, domains, repeats, calcium binding, zinc fingers, deoxyribonucleic acid (DNA) binding, nucleotide binding, regions, coiled coils, motifs, and regional annotations including compositional biases. can. Annotation channels can be site annotations including active sites, metal binding, binding sites, and sites. Annotation channels can be amino acid modification annotations, including non-standard residues, modified residues, lipidations, glycosylation, disulfide bonds, and crosslinks. Annotation channels can be secondary structure annotations including helices, turns, and beta strands. Annotation channels can be experimental information annotations that include mutagenesis, sequence uncertainties, sequence conflicts, non-contiguous residues, and non-terminal residues.

システムのいくつかの実施態様では、テンソルは、アミノ酸のそれぞれの構造の品質を指定するアミノ酸の構造信頼度チャネルを更に含む。構造信頼度チャネルは、グローバルモデル品質推定（ＧＭＱＥ）であることができる。構造信頼度チャネルは、定性的モデルエネルギー解析（ＱＭＥＡＮ）スコアを含むことができる。構造信頼度チャネルは、残基がそれぞれのタンパク質構造の物理的制約を満たす程度を指定する温度因子であることができる。構造信頼度チャネルは、ボクセルに最も近い原子の残基が整列した鋳型構造を有する程度を指定する鋳型構造アラインメントであることができる。構造信頼度チャネルは、整列した鋳型構造の鋳型モデリングスコアであることができる。構造信頼度チャネルは、鋳型モデリングスコアのうちの最小のもの、鋳型モデリングスコアの平均、及び鋳型モデリングスコアのうちの最大のものであることができる。 In some embodiments of the system, the tensor further includes an amino acid structure confidence channel that specifies the structural quality of each of the amino acids. The structural confidence channel may be a global model quality estimate (GMQE). The structural confidence channel may include a qualitative model energy analysis (QMEAN) score. A structure confidence channel can be a temperature factor that specifies the degree to which residues meet the physical constraints of their respective protein structure. The structure confidence channel can be a template structure alignment that specifies the extent to which the residues of the atoms closest to the voxel have an aligned template structure. The structure confidence channel can be a template modeling score for aligned template structures. The structural confidence channel can be the minimum of the template modeling scores, the average of the template modeling scores, and the maximum of the template modeling scores.

いくつかの実施態様では、システムは、アルファ炭素原子についてのアミノ酸ごとの距離チャネルを代替対立遺伝子のワンホット符号化とボクセルごとに連結して、テンソルを生成するテンソル生成器を更に含むことができる。テンソル生成器は、ベータ炭素原子についてのアミノ酸ごとの距離チャネルを代替対立遺伝子のワンホット符号化とボクセルごとに連結して、テンソルを生成することができる。テンソル生成器は、アルファ炭素原子についてのアミノ酸ごとの距離チャネル、ベータ炭素原子についてのアミノ酸ごとの距離チャネル、及び代替対立遺伝子のワンホット符号化をボクセルごとに連結して、テンソルを生成することができる。テンソル生成器は、アルファ炭素原子についてのアミノ酸ごとの距離チャネル、ベータ炭素原子についてのアミノ酸ごとの距離チャネル、代替対立遺伝子のワンホット符号化、及び汎アミノ酸保存頻度をボクセルごとに連結して、テンソルを生成することができる。テンソル生成器は、アルファ炭素原子についてのアミノ酸ごとの距離チャネル、ベータ炭素原子についてのアミノ酸ごとの距離チャネル、代替対立遺伝子のワンホット符号化、汎アミノ酸保存頻度、及びアノテーションチャネルをボクセルごとに連結して、テンソルを生成することができる。テンソル生成器は、アルファ炭素原子についてのアミノ酸ごとの距離チャネル、ベータ炭素原子についてのアミノ酸ごとの距離チャネル、代替対立遺伝子のワンホット符号化、汎アミノ酸保存頻度、アノテーションチャネル、及び構造信頼度チャネルをボクセルごとに連結して、テンソルを生成することができる。テンソル生成器は、アルファ炭素原子についてのアミノ酸ごとの距離チャネル、ベータ炭素原子についてのアミノ酸ごとの距離チャネル、代替対立遺伝子のワンホット符号化、及びアミノ酸の各々についてのアミノ酸ごとの保存頻度をボクセルごとに連結して、テンソルを生成することができる。テンソル生成器は、アルファ炭素原子についてのアミノ酸ごとの距離チャネル、ベータ炭素原子についてのアミノ酸ごとの距離チャネル、代替対立遺伝子のワンホット符号化、アミノ酸の各々についてのアミノ酸ごとの保存頻度、及びアノテーションチャネルをボクセルごとに連結して、テンソルを生成することができる。テンソル生成器は、アルファ炭素原子についてのアミノ酸ごとの距離チャネル、ベータ炭素原子についてのアミノ酸ごとの距離チャネル、代替対立遺伝子のワンホット符号化、アミノ酸の各々についてのアミノ酸ごとの保存頻度、アノテーションチャネル、及び構造信頼度チャネルをボクセルごとに連結して、テンソルを生成することができる。テンソル生成器は、アルファ炭素原子についてのアミノ酸ごとの距離チャネル、ベータ炭素原子についてのアミノ酸ごとの距離チャネル、代替対立遺伝子のワンホット符号化、及び参照対立遺伝子のワンホット符号化をボクセルごとに連結して、テンソルを生成することができる。テンソル生成器は、アルファ炭素原子についてのアミノ酸ごとの距離チャネル、ベータ炭素原子についてのアミノ酸ごとの距離チャネル、代替対立遺伝子のワンホット符号化、参照対立遺伝子のワンホット符号化、及び汎アミノ酸保存頻度をボクセルごとに連結して、テンソルを生成することができる。テンソル生成器は、アルファ炭素原子についてのアミノ酸ごとの距離チャネル、ベータ炭素原子についてのアミノ酸ごとの距離チャネル、代替対立遺伝子のワンホット符号化、参照対立遺伝子のワンホット符号化、汎アミノ酸保存頻度、及びアノテーションチャネルをボクセルごとに連結して、テンソルを生成することができる。テンソル生成器は、アルファ炭素原子についてのアミノ酸ごとの距離チャネル、ベータ炭素原子についてのアミノ酸ごとの距離チャネル、代替対立遺伝子のワンホット符号化、参照対立遺伝子のワンホット符号化、汎アミノ酸保存頻度、アノテーションチャネル、及び構造信頼度チャネルをボクセルごとに連結して、テンソルを生成することができる。テンソル生成器は、アルファ炭素原子についてのアミノ酸ごとの距離チャネル、ベータ炭素原子についてのアミノ酸ごとの距離チャネル、代替対立遺伝子のワンホット符号化、参照対立遺伝子のワンホット符号化、及びアミノ酸の各々についてのアミノ酸ごとの保存頻度をボクセルごとに連結して、テンソルを生成することができる。テンソル生成器は、アルファ炭素原子についてのアミノ酸ごとの距離チャネル、ベータ炭素原子についてのアミノ酸ごとの距離チャネル、代替対立遺伝子のワンホット符号化、参照対立遺伝子のワンホット符号化、アミノ酸の各々についてのアミノ酸ごとの保存頻度、及びアノテーションチャネルをボクセルごとに連結して、テンソルを生成することができる。テンソル生成器は、アルファ炭素原子についてのアミノ酸ごとの距離チャネル、ベータ炭素原子についてのアミノ酸ごとの距離チャネル、代替対立遺伝子のワンホット符号化、参照対立遺伝子のワンホット符号化、アミノ酸の各々についてのアミノ酸ごとの保存頻度、アノテーションチャネル、及び構造信頼度チャネルをボクセルごとに連結して、テンソルを生成することができる。 In some implementations, the system can further include a tensor generator that concatenates the per-amino acid distance channels for alpha carbon atoms voxel-wise with one-hot encoding of alternative alleles to generate a tensor. . A tensor generator can concatenate per-amino acid distance channels for beta carbon atoms with one-hot encoding of alternative alleles voxel-wise to generate a tensor. The tensor generator can concatenate per-amino acid distance channels for alpha carbon atoms, per-amino acid distance channels for beta carbon atoms, and one-hot encoding of alternative alleles voxel-wise to generate a tensor. can. The tensor generator concatenates per-amino acid distance channels for alpha carbon atoms, per-amino acid distance channels for beta carbon atoms, one-hot encoding of alternative alleles, and pan-amino acid conservation frequencies voxel-wise to create a tensor. can be generated. The tensor generator concatenates per-amino acid distance channels for alpha carbon atoms, per-amino acid distance channels for beta carbon atoms, one-hot encoding of alternative alleles, pan-amino acid conservation frequencies, and annotation channels on a voxel-by-voxel basis. You can generate a tensor using The tensor generator provides a per-amino acid distance channel for alpha carbon atoms, a per-amino acid distance channel for beta carbon atoms, one-hot encoding of alternative alleles, a pan-amino acid conservation frequency, an annotation channel, and a structure confidence channel. Each voxel can be concatenated to generate a tensor. The tensor generator provides a per-amino acid distance channel for the alpha carbon atom, a per-amino acid distance channel for the beta carbon atom, one-hot encoding of alternative alleles, and a per-amino acid conservation frequency for each of the amino acids per voxel. can be concatenated with to generate a tensor. The tensor generator has a per-amino acid distance channel for alpha carbon atoms, a per-amino acid distance channel for beta carbon atoms, one-hot encoding of alternative alleles, a per-amino acid conservation frequency for each of the amino acids, and an annotation channel. can be concatenated voxel by voxel to generate a tensor. The tensor generator includes a per-amino acid distance channel for alpha carbon atoms, a per-amino acid distance channel for beta carbon atoms, one-hot encoding of alternative alleles, a per-amino acid conservation frequency for each of the amino acids, an annotation channel, and structure confidence channels can be concatenated voxel by voxel to generate a tensor. The tensor generator concatenates the per-amino acid distance channel for alpha carbon atoms, the per-amino acid distance channel for beta carbon atoms, one-hot encoding of alternative alleles, and one-hot encoding of reference allele on a voxel-by-voxel basis. You can generate a tensor by doing this. The tensor generator has a per-amino acid distance channel for alpha carbon atoms, a per-amino acid distance channel for beta carbon atoms, one-hot encoding of alternative alleles, one-hot encoding of reference alleles, and pan-amino acid conservation frequencies. can be concatenated voxel by voxel to generate a tensor. The tensor generator has a per-amino acid distance channel for alpha carbon atoms, a per-amino acid distance channel for beta carbon atoms, one-hot encoding of alternative alleles, one-hot encoding of reference alleles, pan-amino acid conservation frequencies, and annotation channels can be concatenated voxel by voxel to generate a tensor. The tensor generator has a per-amino acid distance channel for alpha carbon atoms, a per-amino acid distance channel for beta carbon atoms, one-hot encoding of alternative alleles, one-hot encoding of reference alleles, pan-amino acid conservation frequencies, The annotation channel and the structural confidence channel can be concatenated voxel by voxel to generate a tensor. The tensor generator has a per-amino acid distance channel for the alpha carbon atom, a per-amino acid distance channel for the beta carbon atom, one-hot encoding of alternative alleles, one-hot encoding of the reference allele, and a per-amino acid distance channel for each of the amino acids. The conserved frequencies for each amino acid can be concatenated voxel by voxel to generate a tensor. The tensor generator has a per-amino acid distance channel for alpha carbon atoms, a per-amino acid distance channel for beta carbon atoms, one-hot encoding of alternative alleles, one-hot encoding of reference alleles, and a per-amino acid distance channel for each of the amino acids. Conserved frequencies for each amino acid and annotation channels can be concatenated voxel by voxel to generate a tensor. The tensor generator has a per-amino acid distance channel for alpha carbon atoms, a per-amino acid distance channel for beta carbon atoms, one-hot encoding of alternative alleles, one-hot encoding of reference alleles, and a per-amino acid distance channel for each of the amino acids. The conservation frequency, annotation channel, and structure confidence channel for each amino acid can be concatenated voxel by voxel to generate a tensor.

いくつかの実施態様では、システムは、アミノ酸ごとの距離チャネルが生成される前にアミノ酸の原子を回転させる原子回転エンジンを更に含むことができる。病原性決定エンジンは、ニューラルネットワークであることができる。特定の実施態様では、病原性決定エンジンは、畳み込みニューラルネットワークであることができる。畳み込みニューラルネットワークは、１×１×１畳み込み、３×３×３畳み込み、正規化線形ユニット活性化層、バッチ正規化層、全結合層、ドロップアウト正則化層、及びソフトマックス分類層を使用することができる。１×１×１畳み込み及び３×３×３畳み込みは、３次元畳み込みであることができる。 In some embodiments, the system can further include an atomic rotation engine that rotates the atoms of the amino acids before the per-amino acid distance channels are generated. The pathogenicity decision engine can be a neural network. In certain embodiments, the pathogenicity determination engine can be a convolutional neural network. Convolutional neural networks use 1x1x1 convolution, 3x3x3 convolution, normalized linear unit activation layer, batch normalization layer, fully connected layer, dropout regularization layer, and softmax classification layer. be able to. The 1×1×1 convolution and the 3×3×3 convolution can be three-dimensional convolutions.

いくつかの実施態様では、１×１×１の畳み込みの層は、テンソルを処理し、テンソルの畳み込み表現である中間出力を生成することができる。３×３×３の畳み込みの層の配列は、中間出力を処理し、平坦化された出力を生成することができる。全結合層は、平坦化された出力を処理し、非正規化出力を生成することができる。ソフトマックス分類層は、非正規化出力を処理し、変異体が病原性及び良性である尤度を特定する指数関数的に正規化された出力を生成することができる。シグモイド層は、非正規化出力を処理し、変異体が病原性である尤度を特定する正規化出力を生成することができる。ボクセル、原子、及び距離は、３次元座標を有することができる。テンソルは少なくとも３次元を有することができ、中間出力は少なくとも３次元を有することができ、平坦化された出力は１次元を有することができる。 In some implementations, a 1×1×1 convolutional layer may process a tensor and produce an intermediate output that is a convolved representation of the tensor. An array of 3x3x3 convolutional layers can process the intermediate output and produce a flattened output. A fully connected layer can process the flattened output and produce a denormalized output. A softmax classification layer can process the unnormalized output and produce an exponentially normalized output that identifies the likelihood that variants are pathogenic and benign. A sigmoid layer can process the unnormalized output and produce a normalized output that identifies the likelihood that a variant is pathogenic. Voxels, atoms, and distances can have three-dimensional coordinates. The tensor can have at least three dimensions, the intermediate output can have at least three dimensions, and the flattened output can have one dimension.

いくつかの実施態様では、病原性決定エンジンはリカレントニューラルネットワークである。他の実施態様では、病原性決定エンジンは、アテンションベースのニューラルネットワークである。更に他の実施態様では、病原性決定エンジンは勾配ブーストツリーである。更に他の実装形態では、病原性決定エンジンは状態ベクトルマシンである。 In some embodiments, the pathogenicity determination engine is a recurrent neural network. In other embodiments, the pathogenicity determination engine is an attention-based neural network. In yet other embodiments, the pathogenicity determination engine is a gradient boosted tree. In yet other implementations, the pathogenicity determination engine is a state vector machine.

他の実施態様では、システムは、タンパク質中のアミノ酸についての原子カテゴリごとの距離チャネルを記憶するメモリを含むことができる。アミノ酸は、複数の原子カテゴリの原子を有することができ、複数の原子カテゴリ内の原子カテゴリは、アミノ酸の原子エレメントを指定することができる。原子カテゴリごとの距離チャネルは、複数のボクセル内のボクセルについてのボクセルごとの距離値を有することができる。ボクセルごとの距離値は、複数のボクセル内の対応するボクセルから複数の原子カテゴリ内の対応する原子カテゴリ内の原子までの距離を指定することができる。システムは、原子カテゴリごとの距離チャネル及び変異体によって発現されるタンパク質の代替対立遺伝子を含むテンソルを処理し、テンソルに少なくとも部分的に基づいて変異体の病原性を決定するように構成された病原性決定エンジンを更に含むことができる。 In other embodiments, the system can include a memory that stores distance channels for each atomic category for amino acids in a protein. Amino acids can have atoms of multiple atomic categories, and atomic categories within multiple atomic categories can specify the atomic elements of the amino acid. A distance channel for each atom category can have per-voxel distance values for voxels within a plurality of voxels. The per-voxel distance value may specify the distance from a corresponding voxel within the plurality of voxels to an atom within the corresponding atom category within the plurality of atom categories. The system is configured to process a tensor including distance channels for each atomic category and alternative alleles of a protein expressed by the variant, and to determine pathogenicity of the variant based at least in part on the tensor. A sex determination engine may further be included.

システムは、複数の原子カテゴリ内のそれぞれの原子カテゴリのそれぞれの原子上にボクセルのボクセルグリッドを中心付ける距離チャネル生成器を更に含むことができる。距離チャネル生成器は、タンパク質中の少なくとも１つの変異体アミノ酸の残基のアルファ炭素原子上にボクセルグリッドを中心付けることができる。距離は、ボクセルグリッド内の対応するボクセル中心から対応する原子カテゴリ中の最も近い原子までの最も近い原子の距離であることができる。最も近い原子の距離は、ユークリッド距離であることができる。最も近い原子の距離は、ユークリッド距離を最大の最も近い原子の距離で除算することによって正規化することができる。距離は、アミノ酸及びアミノ酸の原子カテゴリに関係なく、ボクセルグリッド内の対応するボクセル中心から最も近い原子までの最も近い原子の距離であることができる。最も近い原子の距離は、ユークリッド距離であることができる。最も近い原子の距離は、ユークリッド距離を最大の最も近い原子の距離で除算することによって正規化することができる。 The system can further include a distance channel generator that centers a voxel grid of voxels on a respective atom of a respective atom category within the plurality of atom categories. The distance channel generator can center the voxel grid on an alpha carbon atom of a residue of at least one variant amino acid in the protein. The distance may be the nearest atom distance from the corresponding voxel center in the voxel grid to the nearest atom in the corresponding atom category. The distance of the closest atoms can be the Euclidean distance. The nearest atom distance can be normalized by dividing the Euclidean distance by the largest nearest atom distance. The distance can be the nearest atom distance from the corresponding voxel center to the nearest atom in the voxel grid, regardless of the amino acid and the atomic category of the amino acid. The distance of the closest atoms can be the Euclidean distance. The nearest atom distance can be normalized by dividing the Euclidean distance by the largest nearest atom distance.

このセクションで説明される方法の他の実施態様は、上述の方法のいずれかを実行するためにプロセッサによって実行可能な命令を記憶する非一時的コンピュータ可読記憶媒体を含むことができる。このセクションで説明される方法の更に別の実施態様は、メモリと、メモリ内に記憶された命令を実行して上記の方法のいずれかを実行するように動作可能な１つ以上のプロセッサとを含むシステムを含むことができる。 Other implementations of the methods described in this section may include a non-transitory computer-readable storage medium storing instructions executable by a processor to perform any of the methods described above. Yet another implementation of the methods described in this section may include a system including a memory and one or more processors operable to execute the instructions stored in the memory to perform any of the methods described above.

条項セット１
１．コンピュータ実装方法であって、
タンパク質中の複数のアミノ酸について、アミノ酸ごとの距離チャネルを記憶することであって、
アミノ酸ごとの距離チャネルの各々が、複数のボクセル内のボクセルについてのボクセルごとの距離値を有し、
ボクセルごとの距離値が、複数のボクセル内の対応するボクセルから複数のアミノ酸内の対応するアミノ酸の原子までの距離を指定する、記憶することと、
アミノ酸ごとの距離チャネルと、変異体によって発現されるタンパク質の代替対立遺伝子とを含むテンソルを処理することと、
テンソルに少なくとも部分的に基づいて、変異体の病原性を決定することと、を含む、コンピュータ実装方法。
２．ボクセルのボクセルグリッドを、アミノ酸のそれぞれの残基のアルファ炭素原子上に中心付けることを更に含む、条項１に記載のコンピュータ実装方法。
３．タンパク質中の少なくとも１つの変異体アミノ酸に対応する特定のアミノ酸の残基のアルファ炭素原子上にボクセルグリッドを中心付けることを更に含む、条項２に記載のコンピュータ実装方法。
４．特定のアミノ酸に先行するアミノ酸についてのボクセルごとの距離値に方向性パラメータを乗算することによって、テンソルにおいて、アミノ酸の方向性及び特定のアミノ酸の位置を符号化することを更に含む、条項３に記載のコンピュータ実装方法。
５．距離が、ボクセルグリッド内の対応するボクセル中心から対応するアミノ酸の最も近い原子までの最も近い原子の距離である、条項３に記載のコンピュータ実装方法。
６．最も近い原子の距離がユークリッド距離である、条項５に記載のコンピュータ実装方法。
７．最も近い原子の距離が、ユークリッド距離を最大の最も近い原子の距離で除算することによって正規化される、条項６に記載のコンピュータ実装方法。
８．アミノ酸がアルファ炭素原子を有し、距離が、対応するボクセル中心から対応するアミノ酸の最も近いアルファ炭素原子までの最も近いアルファ炭素原子の距離である、条項５に記載のコンピュータ実装方法。
９．アミノ酸がベータ炭素原子を有し、距離が、対応するボクセル中心から対応するアミノ酸の最も近いベータ炭素原子までの最も近いベータ炭素原子の距離である、条項５に記載のコンピュータ実装方法。
１０．アミノ酸が骨格原子を有し、距離が、対応するボクセル中心から対応するアミノ酸の最も近い骨格原子までの最も近い骨格原子の距離である、条項５に記載のコンピュータ実装方法。
１１．アミノ酸が側鎖原子を有し、距離が、対応するボクセル中心から対応するアミノ酸の最も近い側鎖原子までの最も近い側鎖原子の距離である、条項５に記載のコンピュータ実装方法。
１２．テンソルにおいて、各ボクセルから、アミノ酸及びアミノ酸の原子エレメントに関係なく選択される最も近い原子までの距離を指定する最も近い原子のチャネルを符号化することを更に含む、条項３に記載のコンピュータ実装方法。
１３．距離がユークリッド距離である、条項１２に記載のコンピュータ実装方法。
１４．距離が、ユークリッド距離を最大距離で除算することによって正規化される、条項１３に記載のコンピュータ実装方法。
１５．アミノ酸が非標準アミノ酸を含む、条項１２に記載のコンピュータ実装方法。
１６．テンソルが、ボクセル中心の所定の半径内に見出されない原子を指定する不在原子チャネルであって、ワンホット符号化される不在原子チャネルを更に含む、条項１に記載のコンピュータ実装方法。
１７．テンソルが、アミノ酸ごとの距離チャネルの各々にボクセルごとに符号化される代替対立遺伝子のワンホット符号化を更に含む、条項１に記載のコンピュータ実装方法。
１８．テンソルが、タンパク質の参照対立遺伝子を更に含む、条項１に記載のコンピュータ実装方法。
１９．テンソルが、アミノ酸ごとの距離チャネルの各々にボクセルごとに符号化される参照対立遺伝子のワンホット符号化を更に含む、条項１８に記載のコンピュータ実装方法。
２０．テンソルが、複数の種にわたるアミノ酸の保存レベルを指定する進化的プロファイルを更に含む、条項１に記載のコンピュータ実装方法。
２１．ボクセルの各々に対して、
アミノ酸及び原子カテゴリにわたって最も近い原子を選択することと、
最も近い原子を含むアミノ酸の残基について汎アミノ酸保存頻度配列を選択することと、
汎アミノ酸保存頻度配列を進化的プロファイルの１つとして利用可能にすることと、を更に含む、条項２０に記載のコンピュータ実装方法。
２２．汎アミノ酸保存頻度配列が、複数の種において観察されるような残基の特定の位置について構成される、条項２１に記載のコンピュータ実装方法。
２３．汎アミノ酸保存頻度配列が、特定のアミノ酸について欠失している保存頻度が存在するかどうかを指定する、条項２１に記載のコンピュータ実装方法。
２４．ボクセルの各々に対して、
アミノ酸のそれぞれにおけるそれぞれの最も近い原子を選択することと、
最も近い原子を含むアミノ酸のそれぞれの残基について、それぞれのアミノ酸ごとの保存頻度を選択することと、
進化的プロファイルの１つとして利用可能なアミノ酸ごとの保存頻度を作成することと、を更に含む、条項２１に記載のコンピュータ実装方法。
２５．アミノ酸ごとの保存頻度が、複数の種において観察されるような残基の特定の位置について構成される、条項２４に記載のコンピュータ実装方法。
２６．アミノ酸ごとの保存頻度が、特定のアミノ酸について欠失している保存頻度が存在するかどうかを指定する、条項２４に記載のコンピュータ実装方法。
２７．テンソルが、アミノ酸のアノテーションチャネルであって、テンソルにおいてワンホット符号化される、アノテーションチャネルを更に含む、条項１に記載のコンピュータ実装方法。
２８．アノテーションチャネルが、イニシエーターメチオニン、シグナル、輸送ペプチド、プロペプチド、鎖、及びペプチドを含む分子処理アノテーションである、条項２７に記載のコンピュータ実装方法。
２９．アノテーションチャネルが、トポロジカルドメイン、膜貫通、膜内、ドメイン、リピート、カルシウム結合、ジンクフィンガー、デオキシリボ核酸（ＤＮＡ）結合、ヌクレオチド結合、領域、コイルドコイル、モチーフ、及び組成バイアスを含む領域アノテーションである、条項２７に記載のコンピュータ実装方法。
３０．アノテーションチャネルが、活性部位、金属結合、結合部位、及び部位を含む部位アノテーションである、条項２７に記載のコンピュータ実装方法。
３１．アノテーションチャネルが、非標準残基、修飾残基、脂質化、グリコシル化、ジスルフィド結合、及び架橋を含むアミノ酸修飾アノテーションである、条項２７に記載のコンピュータ実装方法。
３２．アノテーションチャネルが、ヘリックス、ターン、及びベータ鎖を含む二次構造アノテーションである、条項２７に記載のコンピュータ実装方法。
３３．アノテーションチャネルが、突然変異誘発、配列不確実性、配列競合、非隣接残基、及び非末端残基を含む実験情報アノテーションである、条項２７に記載のコンピュータ実装方法。
３４．テンソルが、アミノ酸のそれぞれの構造の品質を指定するアミノ酸の構造信頼度チャネルを更に含む、条項１に記載のコンピュータ実装方法。
３５．構造信頼度チャネルが、グローバルモデル品質推定（ＧＭＱＥ）である、条項３４に記載のコンピュータ実装方法。
３６．構造信頼度チャネルが、定性的モデルエネルギー解析（ＱＭＥＡＮ）スコアを含む、条項３４に記載のコンピュータ実装方法。
３７．構造信頼度チャネルが、残基がそれぞれのタンパク質構造の物理的制約を満たす程度を指定する温度因子である、条項３４に記載のコンピュータ実装方法。
３８．構造信頼度チャネルが、ボクセルに最も近い原子の残基が整列した鋳型構造を有する程度を指定する鋳型構造アラインメントである、条項３４に記載のコンピュータ実装方法。
３９．構造信頼度チャネルが、整列された鋳型構造の鋳型モデリングスコアである、条項３８に記載のコンピュータ実装方法。
４０．構造信頼度チャネルが、鋳型モデリングスコアのうちの最小のもの、鋳型モデリングスコアの平均、及び鋳型モデリングスコアのうちの最大のものである、条項３９に記載のコンピュータ実装方法。
４１．アルファ炭素原子についてのアミノ酸ごとの距離チャネルを代替対立遺伝子のワンホット符号化とボクセルごとに連結して、テンソルを生成することを更に含む、条項１に記載のコンピュータ実装方法。
４２．ベータ炭素原子についてのアミノ酸ごとの距離チャネルを代替対立遺伝子のワンホット符号化とボクセルごとに連結して、テンソルを生成することを更に含む、条項４１に記載のコンピュータ実装方法。
４３．アルファ炭素原子についてのアミノ酸ごとの距離チャネル、ベータ炭素原子についてのアミノ酸ごとの距離チャネル、及び代替対立遺伝子のワンホット符号化をボクセルごとに連結してテンソルを生成することを更に含む、条項４２に記載のコンピュータ実装方法。
４４．アルファ炭素原子についてのアミノ酸ごとの距離チャネル、ベータ炭素原子についてのアミノ酸ごとの距離チャネル、代替対立遺伝子のワンホット符号化、及び汎アミノ酸保存頻度配列をボクセルごとに連結して、テンソルを生成することを更に含む、条項４３に記載のコンピュータ実装方法。
４５．アルファ炭素原子についてのアミノ酸ごとの距離チャネル、ベータ炭素原子についてのアミノ酸ごとの距離チャネル、代替対立遺伝子のワンホット符号化、汎アミノ酸保存頻度配列、及びアノテーションチャネルをボクセルごとに連結して、テンソルを生成することを更に含む、条項４４に記載のコンピュータ実装方法。
４６．アルファ炭素原子についてのアミノ酸ごとの距離チャネル、ベータ炭素原子についてのアミノ酸ごとの距離チャネル、代替対立遺伝子のワンホット符号化、汎アミノ酸保存頻度配列、アノテーションチャネル、及び構造信頼度チャネルをボクセルごとに連結して、テンソルを生成することを更に含む、条項４５に記載のコンピュータ実装方法。
４７．アルファ炭素原子についてのアミノ酸ごとの距離チャネル、ベータ炭素原子についてのアミノ酸ごとの距離チャネル、代替対立遺伝子のワンホット符号化、及びアミノ酸の各々についてのアミノ酸ごとの保存頻度をボクセルごとに連結して、テンソルを生成することを更に含む、条項４６に記載のコンピュータ実装方法。
４８．アルファ炭素原子についてのアミノ酸ごとの距離チャネル、ベータ炭素原子についてのアミノ酸ごとの距離チャネル、代替対立遺伝子のワンホット符号化、アミノ酸の各々についてのアミノ酸ごとの保存頻度、及びアノテーションチャネルをボクセルごとに連結して、テンソルを生成することを更に含む、条項４７に記載のコンピュータ実装方法。
４９．アルファ炭素原子についてのアミノ酸ごとの距離チャネル、ベータ炭素原子についてのアミノ酸ごとの距離チャネル、代替対立遺伝子のワンホット符号化、アミノ酸の各々についてのアミノ酸ごとの保存頻度、アノテーションチャネル、及び構造信頼度チャネルをボクセルごとに連結して、テンソルを生成することを更に含む、条項４８に記載のコンピュータ実装方法。
５０．アルファ炭素原子についてのアミノ酸ごとの距離チャネル、ベータ炭素原子についてのアミノ酸ごとの距離チャネル、代替対立遺伝子のワンホット符号化、及び参照対立遺伝子のワンホット符号化をボクセルごとに連結してテンソルを生成することを更に含む、条項４９に記載のコンピュータ実装方法。
５１．アルファ炭素原子についてのアミノ酸ごとの距離チャネル、ベータ炭素原子についてのアミノ酸ごとの距離チャネル、代替対立遺伝子のワンホット符号化、参照対立遺伝子のワンホット符号化、及び汎アミノ酸保存頻度配列をボクセルごとに連結してテンソルを生成することを更に含む、条項５０に記載のコンピュータ実装方法。
５２．アルファ炭素原子についてのアミノ酸ごとの距離チャネル、ベータ炭素原子についてのアミノ酸ごとの距離チャネル、代替対立遺伝子のワンホット符号化、参照対立遺伝子のワンホット符号化、汎アミノ酸保存頻度配列、及びアノテーションチャネルをボクセルごとに連結してテンソルを生成することを更に含む、条項５１に記載のコンピュータ実装方法。
５３．アルファ炭素原子についてのアミノ酸ごとの距離チャネル、ベータ炭素原子についてのアミノ酸ごとの距離チャネル、代替対立遺伝子のワンホット符号化、参照対立遺伝子のワンホット符号化、汎アミノ酸保存頻度配列、アノテーションチャネル、及び構造信頼度チャネルをボクセルごとに連結してテンソルを生成することを更に含む、条項５２に記載のコンピュータ実装方法。
５４．アルファ炭素原子についてのアミノ酸ごとの距離チャネル、ベータ炭素原子についてのアミノ酸ごとの距離チャネル、代替対立遺伝子のワンホット符号化、参照対立遺伝子のワンホット符号化、及びアミノ酸の各々についてのアミノ酸ごとの保存頻度をボクセルごとに連結してテンソルを生成することを更に含む、条項５３に記載のコンピュータ実装方法。
５５．アルファ炭素原子についてのアミノ酸ごとの距離チャネル、ベータ炭素原子についてのアミノ酸ごとの距離チャネル、代替対立遺伝子のワンホット符号化、参照対立遺伝子のワンホット符号化、アミノ酸の各々についてのアミノ酸ごとの保存頻度、及びアノテーションチャネルをボクセルごとに連結してテンソルを生成することを更に含む、条項５４に記載のコンピュータ実装方法。
５６．アルファ炭素原子についてのアミノ酸ごとの距離チャネル、ベータ炭素原子についてのアミノ酸ごとの距離チャネル、代替対立遺伝子のワンホット符号化、参照対立遺伝子のワンホット符号化、アミノ酸の各々についてのアミノ酸ごとの保存頻度、アノテーションチャネル、及び構造信頼度チャネルをボクセルごとに連結してテンソルを生成することを更に含む、条項５５に記載のコンピュータ実装方法。
５７．アミノ酸ごとの距離チャネルが生成される前に、アミノ酸の原子を回転させることを更に含む、条項１に記載のコンピュータ実装方法。
５８．畳み込みニューラルネットワークにおいて、１×１×１畳み込み、３×３×３畳み込み、正規化線形ユニット活性化層、バッチ正規化層、全結合層、ドロップアウト正則化層、及びソフトマックス分類層を使用することを更に含む、条項１に記載のコンピュータ実装方法。
５９．１×１×１畳み込み及び３×３×３畳み込みが、３次元畳み込みである、条項５８に記載のコンピュータ実装方法。
６０．１×１×１畳み込みの層が、テンソルを処理し、テンソルの畳み込み表現である中間出力を生成し、３×３×３畳み込み層の配列が、中間出力を処理し、平坦化された出力を生成し、全結合層が、平坦化された出力を処理し、非正規化出力を生成し、ソフトマックス分類層が、非正規化出力を処理し、変異体が病原性及び良性である尤度を特定する指数関数的に正規化された出力を生成する、条項５８に記載のコンピュータ実装方法。
６１．シグモイド層が、非正規化出力を処理し、変異体が病原性である尤度を特定する正規化出力を生成する、条項６０に記載のコンピュータ実装方法。
６２．ボクセル、原子、及び距離が３次元座標を有し、テンソルが少なくとも３次元を有し、中間出力が少なくとも３次元を有し、平坦化された出力が１次元を有する、条項６０に記載のコンピュータ実装方法。
６３．コンピュータ実装方法であって、
タンパク質中のアミノ酸についての、原子カテゴリごとの距離チャネルを記憶することであって、
アミノ酸が、複数の原子カテゴリの原子を有し、
複数の原子カテゴリのうちの原子カテゴリが、アミノ酸の原子エレメントを指定し、
原子カテゴリごとの距離チャネルの各々が、複数のボクセル内のボクセルに対するボクセルごとの距離値を有し、
ボクセルごとの距離値が、複数のボクセル内の対応するボクセルから複数の原子カテゴリ内の対応する原子カテゴリ内の原子までの距離を指定する、記憶することと、
原子カテゴリごとの距離チャネルと、変異体によって発現されるタンパク質の代替対立遺伝子とを含むテンソルを処理することと、
テンソルに少なくとも部分的に基づいて、変異体の病原性を決定することと、を含む、コンピュータ実装方法。
６４．複数の原子カテゴリ内のそれぞれの原子カテゴリのそれぞれの原子上にボクセルのボクセルグリッドを中心付けることを更に含む、条項６３に記載のコンピュータ実装方法。
６５．タンパク質中の少なくとも１つの変異体アミノ酸の残基のアルファ炭素原子上にボクセルグリッドを中心付けることを更に含む、条項６４に記載のコンピュータ実装方法。
６６．距離が、ボクセルグリッド内の対応するボクセル中心から対応する原子カテゴリ内の最も近い原子までの最も近い原子の距離である、条項６５に記載のコンピュータ実装方法。
６７．最も近い原子の距離が、ユークリッド距離である、条項６６に記載のコンピュータ実装方法。
６８．最も近い原子の距離が、ユークリッド距離を最大の最も近い原子の距離で除算することによって正規化される、条項６７に記載のコンピュータ実装方法。
６９．距離が、アミノ酸及びアミノ酸の原子カテゴリに関係なく、ボクセルグリッド内の対応するボクセル中心から最も近い原子までの最も近い原子の距離である、条項６８に記載のコンピュータ実装方法。
７０．最も近い原子の距離が、ユークリッド距離である、条項６９に記載のコンピュータ実装方法。
７１．最も近い原子の距離が、ユークリッド距離を最大の最も近い原子の距離で除算することによって正規化される、条項７０に記載のコンピュータ実装方法。 clause set 1
1. A computer-implemented method, the method comprising:
Memorizing distance channels for each amino acid for multiple amino acids in a protein,
each of the per-amino acid distance channels having per-voxel distance values for voxels within the plurality of voxels;
storing, a per-voxel distance value specifying a distance from a corresponding voxel in the plurality of voxels to an atom of a corresponding amino acid in the plurality of amino acids;
processing a tensor containing per-amino acid distance channels and alternative alleles of the protein expressed by the mutant;
and determining pathogenicity of a variant based at least in part on the tensor.
2. 2. The computer-implemented method of clause 1, further comprising centering the voxel grid of voxels on the alpha carbon atom of each residue of the amino acid.
3. 3. The computer-implemented method of clause 2, further comprising centering the voxel grid on the alpha carbon atom of a particular amino acid residue that corresponds to at least one variant amino acid in the protein.
4. Clause 3, further comprising encoding the directionality of the amino acid and the position of the particular amino acid in the tensor by multiplying the per-voxel distance value for the amino acid preceding the particular amino acid by a directional parameter. computer implementation method.
5. 4. The computer-implemented method of clause 3, wherein the distance is the nearest atom distance from the corresponding voxel center in the voxel grid to the nearest atom of the corresponding amino acid.
6. 6. The computer-implemented method of clause 5, wherein the distance between the closest atoms is the Euclidean distance.
7. 7. The computer-implemented method of clause 6, wherein the nearest atom distance is normalized by dividing the Euclidean distance by the largest nearest atom distance.
8. 6. The computer-implemented method of clause 5, wherein the amino acid has an alpha carbon atom and the distance is the distance of the nearest alpha carbon atom from the corresponding voxel center to the nearest alpha carbon atom of the corresponding amino acid.
9. 6. The computer-implemented method of clause 5, wherein the amino acid has a beta carbon atom and the distance is the distance of the nearest beta carbon atom from the corresponding voxel center to the nearest beta carbon atom of the corresponding amino acid.
10. 6. The computer-implemented method of clause 5, wherein the amino acid has a backbone atom and the distance is the distance of the nearest backbone atom from the corresponding voxel center to the nearest backbone atom of the corresponding amino acid.
11. 6. The computer-implemented method of clause 5, wherein the amino acid has a side chain atom and the distance is the distance of the nearest side chain atom from the corresponding voxel center to the nearest side chain atom of the corresponding amino acid.
12. The computer-implemented method of clause 3, further comprising encoding, in the tensor, a nearest atom channel specifying the distance from each voxel to the amino acid and the nearest atom selected regardless of the atomic element of the amino acid. .
13. 13. The computer-implemented method of clause 12, wherein the distance is a Euclidean distance.
14. 14. The computer-implemented method of clause 13, wherein the distance is normalized by dividing the Euclidean distance by the maximum distance.
15. 13. The computer-implemented method of clause 12, wherein the amino acids include non-standard amino acids.
16. 2. The computer-implemented method of clause 1, wherein the tensor further comprises an absent atom channel that specifies atoms not found within a predetermined radius of a voxel center, the absent atom channel being one-hot encoded.
17. 2. The computer-implemented method of clause 1, wherein the tensor further comprises one-hot encoding of alternative alleles encoded voxel-wise in each of the per-amino acid distance channels.
18. 2. The computer-implemented method of clause 1, wherein the tensor further comprises a reference allele of the protein.
19. 19. The computer-implemented method of clause 18, wherein the tensor further comprises one-hot encoding of the reference allele encoded voxel-by-voxel in each of the per-amino acid distance channels.
20. 2. The computer-implemented method of clause 1, wherein the tensor further comprises an evolutionary profile specifying the level of conservation of amino acids across multiple species.
21. For each voxel,
selecting the closest atoms across amino acids and atom categories;
selecting a pan-amino acid conserved frequency sequence for amino acid residues containing the closest atoms;
21. The computer-implemented method of clause 20, further comprising making available a pan-amino acid conserved frequency sequence as one of the evolutionary profiles.
22. 22. The computer-implemented method of clause 21, wherein a pan-amino acid conserved frequency sequence is constructed for specific positions of residues as observed in multiple species.
23. 22. The computer-implemented method of clause 21, wherein the pan-amino acid conserved frequency array specifies whether there is a missing conserved frequency for a particular amino acid.
24. For each voxel,
selecting each nearest atom in each of the amino acids;
selecting a conservation frequency for each amino acid for each residue of the amino acid containing the nearest atom;
22. The computer-implemented method of clause 21, further comprising: creating a conservation frequency for each amino acid that can be used as one of the evolutionary profiles.
25. 25. The computer-implemented method of clause 24, wherein a per-amino acid conservation frequency is constructed for a particular position of the residue as observed in multiple species.
26. 25. The computer-implemented method of clause 24, wherein the conservation frequency for each amino acid specifies whether there is a missing conservation frequency for a particular amino acid.
27. 2. The computer-implemented method of clause 1, wherein the tensor further comprises an annotation channel of amino acids, the annotation channel being one-hot encoded in the tensor.
28. 28. The computer-implemented method of clause 27, wherein the annotation channel is a molecular processing annotation that includes an initiator methionine, a signal, a transit peptide, a propeptide, a chain, and a peptide.
29. Clause where the annotation channel is a region annotation including topological domain, transmembrane, intramembrane, domain, repeat, calcium binding, zinc finger, deoxyribonucleic acid (DNA) binding, nucleotide binding, region, coiled coil, motif, and compositional bias. 28. The computer implementation method according to 27.
30. 28. The computer-implemented method of clause 27, wherein the annotation channel is a site annotation that includes an active site, a metal binding, a binding site, and a site.
31. 28. The computer-implemented method of clause 27, wherein the annotation channel is an amino acid modification annotation including non-standard residues, modified residues, lipidations, glycosylation, disulfide bonds, and crosslinks.
32. 28. The computer-implemented method of clause 27, wherein the annotation channel is a secondary structure annotation including helices, turns, and beta strands.
33. 28. The computer-implemented method of clause 27, wherein the annotation channel is an experimental information annotation that includes mutagenesis, sequence uncertainty, sequence conflicts, non-adjacent residues, and non-terminal residues.
34. 2. The computer-implemented method of clause 1, wherein the tensor further comprises an amino acid structural confidence channel that specifies the structural quality of each of the amino acids.
35. 35. The computer-implemented method of clause 34, wherein the structural confidence channel is a global model quality estimation (GMQE).
36. 35. The computer-implemented method of clause 34, wherein the structural confidence channel includes a qualitative model energy analysis (QMEAN) score.
37. 35. The computer-implemented method of clause 34, wherein the structure confidence channel is a temperature factor that specifies the degree to which the residues meet the physical constraints of the respective protein structure.
38. 35. The computer-implemented method of clause 34, wherein the structure confidence channel is a template structure alignment that specifies the degree to which the residues of atoms closest to the voxel have aligned template structures.
39. 39. The computer-implemented method of clause 38, wherein the structure confidence channel is a template modeling score of the aligned template structure.
40. 40. The computer-implemented method of clause 39, wherein the structural confidence channels are a minimum of the template modeling scores, an average of the template modeling scores, and a maximum of the template modeling scores.
41. 2. The computer-implemented method of clause 1, further comprising concatenating the per-amino acid distance channel for the alpha carbon atom voxel-wise with one-hot encoding of alternative alleles to generate a tensor.
42. 42. The computer-implemented method of clause 41, further comprising concatenating the per-amino acid distance channel for the beta carbon atom voxel-wise with one-hot encoding of alternative alleles to generate a tensor.
43. Clause 42, further comprising concatenating the per-amino acid distance channel for the alpha carbon atom, the per-amino acid distance channel for the beta carbon atom, and the one-hot encoding of alternative alleles on a voxel-by-voxel basis to generate a tensor. Computer implementation method described.
44. Voxel-wise concatenation of per-amino acid distance channels for alpha carbon atoms, per-amino acid distance channels for beta carbon atoms, one-hot encoding of alternative alleles, and pan-amino acid conserved frequency arrays to generate a tensor. 44. The computer-implemented method of clause 43, further comprising:
45. A per-amino acid distance channel for alpha carbon atoms, a per-amino acid distance channel for beta carbon atoms, one-hot encoding of alternative alleles, a pan-amino acid conserved frequency array, and an annotation channel are concatenated voxel-wise to create a tensor. 45. The computer-implemented method of clause 44, further comprising generating.
46. Voxel-wise concatenation of per-amino acid distance channels for alpha carbon atoms, per-amino acid distance channels for beta carbon atoms, one-hot encoding of alternative alleles, pan-amino acid conserved frequency arrays, annotation channels, and structural confidence channels 46. The computer-implemented method of clause 45, further comprising generating a tensor.
47. A per-amino acid distance channel for the alpha carbon atom, a per-amino acid distance channel for the beta carbon atom, one-hot encoding of alternative alleles, and a per-amino acid conservation frequency for each of the amino acids are concatenated voxel by voxel. 47. The computer-implemented method of clause 46, further comprising generating a tensor.
48. Concatenation of per-amino acid distance channels for alpha carbon atoms, per-amino acid distance channels for beta carbon atoms, one-hot encoding of alternative alleles, per-amino acid conservation frequencies for each of the amino acids, and annotation channels on a voxel-by-voxel basis 48. The computer-implemented method of clause 47, further comprising generating a tensor.
49. Per-amino acid distance channel for alpha carbon atoms, per-amino acid distance channel for beta carbon atoms, one-hot encoding of alternative alleles, per-amino acid conservation frequency for each of the amino acids, annotation channel, and structure confidence channel 49. The computer-implemented method of clause 48, further comprising concatenating voxel by voxel to generate a tensor.
50. Generate a tensor by voxel-wise concatenation of per-amino acid distance channels for alpha carbon atoms, per-amino acid distance channels for beta carbon atoms, one-hot encoding of alternative alleles, and one-hot encoding of reference alleles. 50. The computer-implemented method of clause 49, further comprising:
51. per-amino acid distance channel for alpha carbon atoms, per-amino acid distance channel for beta carbon atoms, one-hot encoding of alternative alleles, one-hot encoding of reference alleles, and pan-amino acid conservation frequency arrays per voxel 51. The computer-implemented method of clause 50, further comprising concatenating to generate a tensor.
52. A per-amino acid distance channel for alpha carbon atoms, a per-amino acid distance channel for beta carbon atoms, one-hot encoding of alternative alleles, one-hot encoding of reference alleles, a pan-amino acid conservation frequency array, and an annotation channel. 52. The computer-implemented method of clause 51, further comprising concatenating voxel by voxel to generate a tensor.
53. a per-amino acid distance channel for alpha carbon atoms, a per-amino acid distance channel for beta carbon atoms, one-hot encoding of alternative alleles, one-hot encoding of reference alleles, a pan-amino acid conservation frequency array, an annotation channel, and 53. The computer-implemented method of clause 52, further comprising concatenating the structural confidence channels voxel by voxel to generate a tensor.
54. Per-amino acid distance channel for alpha carbon atoms, per-amino acid distance channel for beta carbon atoms, one-hot encoding of alternative alleles, one-hot encoding of reference alleles, and per-amino acid conservation for each of the amino acids. 54. The computer-implemented method of clause 53, further comprising concatenating the frequencies voxel by voxel to generate a tensor.
55. Per-amino acid distance channel for alpha carbon atoms, per-amino acid distance channel for beta carbon atoms, one-hot encoding of alternative alleles, one-hot encoding of reference alleles, per-amino acid conservation frequency for each of the amino acids 55. The computer-implemented method of clause 54, further comprising concatenating the annotation channels voxel by voxel to generate a tensor.
56. Per-amino acid distance channel for alpha carbon atoms, per-amino acid distance channel for beta carbon atoms, one-hot encoding of alternative alleles, one-hot encoding of reference alleles, per-amino acid conservation frequency for each of the amino acids 56. The computer-implemented method of clause 55, further comprising concatenating voxel-wise the annotation channel, the annotation channel, and the structural confidence channel to generate a tensor.
57. 2. The computer-implemented method of clause 1, further comprising rotating atoms of the amino acids before the per-amino acid distance channels are generated.
58. In convolutional neural networks, use 1x1x1 convolution, 3x3x3 convolution, normalized linear unit activation layer, batch normalization layer, fully connected layer, dropout regularization layer, and softmax classification layer The computer-implemented method of clause 1, further comprising:
59. The computer-implemented method of clause 58, wherein the 1x1x1 convolution and the 3x3x3 convolution are three-dimensional convolutions.
60. A 1x1x1 convolutional layer processes the tensor and produces an intermediate output that is a convolved representation of the tensor, and an array of 3x3x3 convolutional layers processes the intermediate output and flattens it. A fully connected layer processes the flattened output and generates an unnormalized output, a softmax classification layer processes the unnormalized output, and the variants are pathogenic and benign. 59. The computer-implemented method of clause 58, wherein the computer-implemented method generates an exponentially normalized output that identifies a likelihood.
61. 61. The computer-implemented method of clause 60, wherein the sigmoid layer processes the denormalized output and produces a normalized output that identifies the likelihood that the variant is pathogenic.
62. The computer of clause 60, wherein the voxels, atoms, and distances have three-dimensional coordinates, the tensor has at least three dimensions, the intermediate output has at least three dimensions, and the flattened output has one dimension. How to implement.
63. A computer-implemented method, the method comprising:
Storing distance channels for each atomic category for amino acids in a protein, the method comprising:
the amino acid has atoms from multiple atomic categories,
an atom category of the plurality of atom categories specifies an atomic element of an amino acid,
each of the distance channels for each atomic category has a per-voxel distance value for a voxel within the plurality of voxels;
storing, a per-voxel distance value specifying a distance from a corresponding voxel in the plurality of voxels to an atom in the corresponding atom category in the plurality of atom categories;
processing a tensor containing distance channels for each atomic category and alternative alleles of the protein expressed by the variant;
and determining pathogenicity of a variant based at least in part on the tensor.
64. 64. The computer-implemented method of clause 63, further comprising centering the voxel grid of voxels on each atom of each atom category within the plurality of atom categories.
65. 65. The computer-implemented method of clause 64, further comprising centering the voxel grid on an alpha carbon atom of a residue of at least one variant amino acid in the protein.
66. 66. The computer-implemented method of clause 65, wherein the distance is the nearest atom distance from the corresponding voxel center in the voxel grid to the nearest atom in the corresponding atom category.
67. 67. The computer-implemented method of clause 66, wherein the distance between the closest atoms is the Euclidean distance.
68. 68. The computer-implemented method of clause 67, wherein the nearest atom distance is normalized by dividing the Euclidean distance by the largest nearest atom distance.
69. 69. The computer-implemented method of clause 68, wherein the distance is the nearest atom distance from the corresponding voxel center to the nearest atom in the voxel grid, regardless of the amino acid and the atomic category of the amino acid.
70. 69. The computer-implemented method according to clause 69, wherein the distance between the nearest atoms is the Euclidean distance.
71. 71. The computer-implemented method of clause 70, wherein the nearest atom distance is normalized by dividing the Euclidean distance by the largest nearest atom distance.

条項セット２
１．１つ以上のプロセッサ上で実行されると、
タンパク質中の複数のアミノ酸についてのアミノ酸ごとの距離チャネルを記憶することであって、アミノ酸ごとの距離チャネルの各々が、複数のボクセル内のボクセルについて、ボクセルごとの距離値を有し、
ボクセルごとの距離値が、複数のボクセル内の対応するボクセルから複数のアミノ酸内の対応するアミノ酸の原子までの距離を指定する、記憶することと、アミノ酸ごとの距離チャネルと、変異体によって発現されるタンパク質の代替対立遺伝子とを含むテンソルを処理することと、
テンソルに少なくとも部分的に基づいて、変異体の病原性を決定することと、を含む動作を実行するようにコンピュータを構成するコンピュータ実行可能命令を記憶する、１つ以上のコンピュータ可読媒体。
２．動作が、ボクセルのボクセルグリッドを、アミノ酸のそれぞれの残基のアルファ炭素原子上に中心付けることを更に含む条項１に記載のコンピュータ可読媒体。
３．動作が、タンパク質中の少なくとも１つの変異体アミノ酸に対応する特定のアミノ酸の残基のアルファ炭素原子上にボクセルグリッドを中心付けることを更に含む、条項２に記載のコンピュータ可読媒体。
４．動作が、特定のアミノ酸に先行するアミノ酸についてのボクセルごとの距離値に方向性パラメータを乗算することによって、テンソルにおいて、アミノ酸の方向性及び特定のアミノ酸の位置を符号化することを更に含む、条項３に記載のコンピュータ可読媒体。
５．距離が、ボクセルグリッド内の対応するボクセル中心から対応するアミノ酸の最も近い原子までの最も近い原子の距離である、条項３に記載のコンピュータ可読媒体。
６．最も近い原子の距離がユークリッド距離である、条項５に記載のコンピュータ可読媒体。
７．最も近い原子の距離が、ユークリッド距離を最大の最も近い原子の距離で除算することによって正規化される、条項６に記載のコンピュータ可読媒体。
８．アミノ酸がアルファ炭素原子を有し、距離が、対応するボクセル中心から対応するアミノ酸の最も近いアルファ炭素原子までの最も近いアルファ炭素原子の距離である、条項５に記載のコンピュータ可読媒体。
９．アミノ酸がベータ炭素原子を有し、距離が、対応するボクセル中心から対応するアミノ酸の最も近いベータ炭素原子までの最も近いベータ炭素原子の距離である、条項５に記載のコンピュータ可読媒体。
１０．アミノ酸が骨格原子を有し、距離が、対応するボクセル中心から対応するアミノ酸の最も近い骨格原子までの最も近い骨格原子の距離である、条項５に記載のコンピュータ可読媒体。
１１．アミノ酸が側鎖原子を有し、距離が、対応するボクセル中心から対応するアミノ酸の最も近い側鎖原子までの最も近い側鎖原子の距離である、条項５に記載のコンピュータ可読媒体。
１２．動作が、テンソルにおいて、各ボクセルから、アミノ酸及びアミノ酸の原子エレメントに関係なく選択される最も近い原子までの距離を指定する最も近い原子のチャネルを符号化することを更に含む、条項３に記載のコンピュータ可読媒体。
１３．距離がユークリッド距離である、条項１２に記載のコンピュータ可読媒体。
１４．距離が、ユークリッド距離を最大距離で除算することによって正規化される、条項１３に記載のコンピュータ可読媒体。
１５．アミノ酸が非標準アミノ酸を含む、条項１２に記載のコンピュータ可読媒体。
１６．テンソルが、ボクセル中心の所定の半径内に見出されない原子を指定する不在原子チャネルであって、ワンホット符号化される不在原子チャネルを更に含む、条項１に記載のコンピュータ可読媒体。
１７．テンソルが、アミノ酸ごとの距離チャネルの各々にボクセルごとに符号化される代替対立遺伝子のワンホット符号化を更に含む、条項１に記載のコンピュータ可読媒体。
１８．テンソルが、タンパク質の参照対立遺伝子を更に含む、条項１に記載のコンピュータ可読媒体。
１９．テンソルが、アミノ酸ごとの距離チャネルの各々にボクセルごとに符号化される参照対立遺伝子のワンホット符号化を更に含む、条項１８に記載のコンピュータ可読媒体。
２０．テンソルが、複数の種にわたるアミノ酸の保存レベルを指定する進化的プロファイルを更に含む、条項１に記載のコンピュータ可読媒体。
２１．動作が、ボクセルの各々に対して
アミノ酸及び原子カテゴリにわたって最も近い原子を選択することと、
最も近い原子を含むアミノ酸の残基について汎アミノ酸保存頻度配列を選択することと、
汎アミノ酸保存頻度配列を進化的プロファイルの１つとして利用可能にすることと、を更に含む、条項２０に記載のコンピュータ可読媒体。
２２．汎アミノ酸保存頻度配列が、複数の種において観察されるような残基の特定の位置について構成される、条項２１に記載のコンピュータ可読媒体。
２３．汎アミノ酸保存頻度配列が、特定のアミノ酸について欠失している保存頻度が存在するかどうかを指定する、条項２１に記載のコンピュータ可読媒体。
２４．動作が、ボクセルの各々について
アミノ酸のそれぞれにおけるそれぞれの最も近い原子を選択することと、
最も近い原子を含むアミノ酸のそれぞれの残基について、それぞれのアミノ酸ごとの保存頻度を選択することと、
進化的プロファイルの１つとして利用可能なアミノ酸ごとの保存頻度を作成することと、を更に含む、条項２１に記載のコンピュータ可読媒体。
２５．アミノ酸ごとの保存頻度が、複数の種において観察されるような残基の特定の位置について構成される、条項２４に記載のコンピュータ可読媒体。
２６．アミノ酸ごとの保存頻度が、特定のアミノ酸について欠失している保存頻度が存在するかどうかを指定する、条項２４に記載のコンピュータ可読媒体。
２７．テンソルが、アミノ酸のアノテーションチャネルであって、テンソルにおいてワンホット符号化される、アノテーションチャネルを更に含む、条項１に記載のコンピュータ可読媒体。
２８．アノテーションチャネルが、イニシエーターメチオニン、シグナル、輸送ペプチド、プロペプチド、鎖、及びペプチドを含む分子処理アノテーションである、条項２７に記載のコンピュータ可読媒体。
２９．アノテーションチャネルが、トポロジカルドメイン、膜貫通、膜内、ドメイン、リピート、カルシウム結合、ジンクフィンガー、デオキシリボ核酸（ＤＮＡ）結合、ヌクレオチド結合、領域、コイルドコイル、モチーフ、及び組成バイアスを含む領域アノテーションである、条項２７に記載のコンピュータ可読媒体。
３０．アノテーションチャネルが、活性部位、金属結合、結合部位、及び部位を含む部位アノテーションである、条項２７に記載のコンピュータ可読媒体。
３１．アノテーションチャネルが、非標準残基、修飾残基、脂質化、グリコシル化、ジスルフィド結合、及び架橋を含むアミノ酸修飾アノテーションである、条項２７に記載のコンピュータ可読媒体。
３２．アノテーションチャネルが、ヘリックス、ターン、及びベータ鎖を含む二次構造アノテーションである、条項２７に記載のコンピュータ可読媒体。
３３．アノテーションチャネルが、突然変異誘発、配列不確実性、配列競合、非隣接残基、及び非末端残基を含む実験情報アノテーションである、条項２７に記載のコンピュータ可読媒体。
３４．テンソルが、アミノ酸のそれぞれの構造の品質を指定するアミノ酸の構造信頼度チャネルを更に含む、条項１に記載のコンピュータ可読媒体。
３５．構造信頼度チャネルが、グローバルモデル品質推定（ＧＭＱＥ）である、条項３４に記載のコンピュータ可読媒体。
３６．構造信頼度チャネルが、定性的モデルエネルギー解析（ＱＭＥＡＮ）スコアを含む、条項３４に記載のコンピュータ可読媒体。
３７．構造信頼度チャネルが、残基がそれぞれのタンパク質構造の物理的制約を満たす程度を指定する温度因子である、条項３４に記載のコンピュータ可読媒体。
３８．構造信頼度チャネルが、ボクセルに最も近い原子の残基が整列した鋳型構造を有する程度を指定する鋳型構造アラインメントである、条項３４に記載のコンピュータ可読媒体。
３９．構造信頼度チャネルが、整列された鋳型構造の鋳型モデリングスコアである、条項３８に記載のコンピュータ可読媒体。
４０．構造信頼度チャネルが、鋳型モデリングスコアのうちの最小のもの、鋳型モデリングスコアの平均、及び鋳型モデリングスコアのうちの最大のものである、条項３９に記載のコンピュータ可読媒体。
４１．動作が、アルファ炭素原子についてのアミノ酸ごとの距離チャネルを代替対立遺伝子のワンホット符号化とボクセルごとに連結して、テンソルを生成することを更に含む、条項１に記載のコンピュータ可読媒体。
４２．動作が、ベータ炭素原子についてのアミノ酸ごとの距離チャネルを代替対立遺伝子のワンホット符号化とボクセルごとに連結して、テンソルを生成することを更に含む、条項４１に記載のコンピュータ可読媒体。
４３．動作が、アルファ炭素原子についてのアミノ酸ごとの距離チャネル、ベータ炭素原子についてのアミノ酸ごとの距離チャネル、及び代替対立遺伝子のワンホット符号化をボクセルごとに連結してテンソルを生成することを更に含む、条項４２のコンピュータ可読媒体。
４４．動作が、アルファ炭素原子についてのアミノ酸ごとの距離チャネル、ベータ炭素原子についてのアミノ酸ごとの距離チャネル、代替対立遺伝子のワンホット符号化、及び汎アミノ酸保存頻度配列をボクセルごとに連結して、テンソルを生成することを更に含む、条項４３のコンピュータ可読媒体。
４５．動作が、アルファ炭素原子についてのアミノ酸ごとの距離チャネル、ベータ炭素原子についてのアミノ酸ごとの距離チャネル、代替対立遺伝子のワンホット符号化、汎アミノ酸保存頻度配列、及びアノテーションチャネルをボクセルごとに連結して、テンソルを生成することを更に含む、条項４４に記載のコンピュータ可読媒体。
４６．動作が、アルファ炭素原子についてのアミノ酸ごとの距離チャネル、ベータ炭素原子についてのアミノ酸ごとの距離チャネル、代替対立遺伝子のワンホット符号化、汎アミノ酸保存頻度配列、アノテーションチャネル、及び構造信頼度チャネルをボクセルごとに連結して、テンソルを生成することを更に含む、条項４５に記載のコンピュータ可読媒体。
４７．動作が、アルファ炭素原子についてのアミノ酸ごとの距離チャネル、ベータ炭素原子についてのアミノ酸ごとの距離チャネル、代替対立遺伝子のワンホット符号化、及びアミノ酸の各々についてのアミノ酸ごとの保存頻度をボクセルごとに連結して、テンソルを生成することを更に含む、条項４６に記載のコンピュータ可読媒体。
４８．動作が、アルファ炭素原子についてのアミノ酸ごとの距離チャネル、ベータ炭素原子についてのアミノ酸ごとの距離チャネル、代替対立遺伝子のワンホット符号化、アミノ酸の各々についてのアミノ酸ごとの保存頻度、及びアノテーションチャネルをボクセルごとに連結して、テンソルを生成することを更に含む、条項４７に記載のコンピュータ可読媒体。
４９．動作が、アルファ炭素原子についてのアミノ酸ごとの距離チャネル、ベータ炭素原子についてのアミノ酸ごとの距離チャネル、代替対立遺伝子のワンホット符号化、アミノ酸の各々についてのアミノ酸ごとの保存頻度、アノテーションチャネル、及び構造信頼度チャネルをボクセルごとに連結して、テンソルを生成することを更に含む、条項４８に記載のコンピュータ可読媒体。
５０．動作が、アルファ炭素原子についてのアミノ酸ごとの距離チャネル、ベータ炭素原子についてのアミノ酸ごとの距離チャネル、代替対立遺伝子のワンホット符号化、及び参照対立遺伝子のワンホット符号化をボクセルごとに連結してテンソルを生成することを更に含む、条項４９に記載のコンピュータ可読媒体。
５１．動作が、アルファ炭素原子についてのアミノ酸ごとの距離チャネル、ベータ炭素原子についてのアミノ酸ごとの距離チャネル、代替対立遺伝子のワンホット符号化、参照対立遺伝子のワンホット符号化、及び汎アミノ酸保存頻度配列をボクセルごとに連結してテンソルを生成することを更に含む、条項５０に記載のコンピュータ可読媒体。
５２．操作が、テンソルを生成するために、アルファ炭素原子についてのアミノ酸ごとの距離チャネル、ベータ炭素原子についてのアミノ酸ごとの距離チャネル、代替対立遺伝子のワンホット符号化、参照対立遺伝子のワンホット符号化、汎アミノ酸保存頻度配列、及びアノテーションチャネルをボクセルごとに連結することを更に含む、条項５１に記載のコンピュータ可読媒体。
５３．動作が、アルファ炭素原子についてのアミノ酸ごとの距離チャネル、ベータ炭素原子についてのアミノ酸ごとの距離チャネル、代替対立遺伝子のワンホット符号化、参照対立遺伝子のワンホット符号化、汎アミノ酸保存頻度配列、アノテーションチャネル、及び構造信頼度チャネルをボクセルごとに連結してテンソルを生成することを更に含む、条項５２に記載のコンピュータ可読媒体。
５４．動作が、アルファ炭素原子についてのアミノ酸ごとの距離チャネル、ベータ炭素原子についてのアミノ酸ごとの距離チャネル、代替対立遺伝子のワンホット符号化、参照対立遺伝子のワンホット符号化、汎アミノ酸保存頻度配列、及びアミノ酸の各々についてのアミノ酸ごとの保存頻度をボクセルごとに連結してテンソルを生成することを更に含む、条項５３に記載のコンピュータ可読媒体。
５５．動作が、アルファ炭素原子についてのアミノ酸ごとの距離チャネル、ベータ炭素原子についてのアミノ酸ごとの距離チャネル、代替対立遺伝子のワンホット符号化、参照対立遺伝子のワンホット符号化、汎アミノ酸保存頻度配列、アミノ酸の各々についてのアミノ酸ごとの保存頻度、及びアノテーションチャネルをボクセルごとに連結してテンソルを生成することを更に含む、条項５４に記載のコンピュータ可読媒体。
５６．動作が、アルファ炭素原子についてのアミノ酸ごとの距離チャネル、ベータ炭素原子についてのアミノ酸ごとの距離チャネル、代替対立遺伝子のワンホット符号化、参照対立遺伝子のワンホット符号化、アミノ酸の各々についてのアミノ酸ごとの保存頻度、アノテーションチャネル、及び構造信頼度チャネルをボクセルごとに連結してテンソルを生成することを更に含む、条項５５に記載のコンピュータ可読媒体。
５７．動作が、アミノ酸ごとの距離チャネルが生成される前に、アミノ酸の原子を回転させることを更に含む、条項１に記載のコンピュータ可読媒体。
５８．動作が、畳み込みニューラルネットワークにおいて、１×１×１畳み込み、３×３×３畳み込み、正規化線形ユニット活性化層、バッチ正規化層、全結合層、ドロップアウト正則化層、及びソフトマックス分類層を使用することを更に含む、条項１に記載のコンピュータ可読媒体。
５９．１×１×１畳み込み及び３×３×３畳み込みが、３次元畳み込みである、条項５８に記載のコンピュータ可読媒体。
６０．１×１×１畳み込みの層が、テンソルを処理し、テンソルの畳み込み表現である中間出力を生成し、３×３×３畳み込み層の配列が、中間出力を処理し、平坦化された出力を生成し、全結合層が、平坦化された出力を処理し、非正規化出力を生成し、ソフトマックス分類層が、非正規化出力を処理し、変異体が病原性及び良性である尤度を特定する指数関数的に正規化された出力を生成する、条項５８に記載のコンピュータ可読媒体。
６１．シグモイド層が、非正規化出力を処理し、変異体が病原性である尤度を特定する正規化出力を生成する、条項６０に記載のコンピュータ可読媒体。
６２．ボクセル、原子、及び距離が３次元座標を有し、テンソルが少なくとも３次元を有し、中間出力が少なくとも３次元を有し、平坦化された出力が１次元を有する、条項６０に記載のコンピュータ可読媒体。
６３．１つ以上のプロセッサ上で実行されると、
タンパク質中のアミノ酸についての、原子カテゴリごとの距離チャネルを記憶することであって、
アミノ酸が、複数の原子カテゴリの原子を有し、
複数の原子カテゴリのうちの原子カテゴリが、アミノ酸の原子エレメントを指定し、
原子カテゴリごとの距離チャネルの各々が、複数のボクセル内のボクセルに対するボクセルごとの距離値を有し、
ボクセルごとの距離値が、複数のボクセル内の対応するボクセルから複数の原子カテゴリ内の対応する原子カテゴリ内の原子までの距離を指定する、記憶することと、
原子カテゴリごとの距離チャネルと、変異体によって発現されるタンパク質の代替対立遺伝子とを含むテンソルを処理することと、
テンソルに少なくとも部分的に基づいて、変異体の病原性を決定することと、を含む動作を実行するようにコンピュータを構成するコンピュータ実行可能命令を記憶する、１つ以上のコンピュータ可読媒体。
６４．動作が、複数の原子カテゴリ内のそれぞれの原子カテゴリのそれぞれの原子上にボクセルのボクセルグリッドを中心付けることを更に含む、条項６３に記載のコンピュータ可読媒体。
６５．動作が、タンパク質中の少なくとも１つの変異体アミノ酸の残基のアルファ炭素原子上にボクセルグリッドを中心付けることを更に含む、条項６４に記載のコンピュータ可読媒体。
６６．距離が、ボクセルグリッド内の対応するボクセル中心から対応する原子カテゴリ内の最も近い原子までの最も近い原子の距離である、条項６５に記載のコンピュータ可読媒体。
６７．最も近い原子の距離が、ユークリッド距離である、条項６６に記載のコンピュータ可読媒体。
６８．最も近い原子の距離が、ユークリッド距離を最大の最も近い原子の距離で除算することによって正規化される、条項６７に記載のコンピュータ可読媒体。
６９．距離が、アミノ酸及びアミノ酸の原子カテゴリに関係なく、ボクセルグリッド内の対応するボクセル中心から最も近い原子までの最も近い原子の距離である、条項６８に記載のコンピュータ可読媒体。
７０．最も近い原子の距離が、ユークリッド距離である、条項６９に記載のコンピュータ可読媒体。
７１．最も近い原子の距離が、ユークリッド距離を最大の最も近い原子の距離で除算することによって正規化される、条項７０に記載のコンピュータ可読媒体。 clause set 2
1. When executed on one or more processors,
storing per-amino acid distance channels for a plurality of amino acids in the protein, each per-amino acid distance channel having a per-voxel distance value for a voxel within the plurality of voxels;
A per-voxel distance value is expressed by a mutant with a per-amino acid distance channel that specifies the distance from the corresponding voxel in multiple voxels to the atom of the corresponding amino acid in multiple amino acids. processing a tensor containing alternative alleles of the protein;
one or more computer-readable media storing computer-executable instructions that configure the computer to perform operations including: determining pathogenicity of a variant based at least in part on the tensor;
2. 2. The computer-readable medium of clause 1, wherein the act further comprises centering the voxel grid of voxels on the alpha carbon atom of each residue of the amino acid.
3. 3. The computer-readable medium of clause 2, wherein the operations further include centering the voxel grid on the alpha carbon atom of a particular amino acid residue that corresponds to at least one variant amino acid in the protein.
4. Clause, wherein the operation further comprises encoding the directionality of the amino acid and the position of the particular amino acid in the tensor by multiplying the voxel-by-voxel distance value for the amino acid preceding the particular amino acid by a directionality parameter. 3. The computer readable medium according to 3.
5. 4. The computer-readable medium of clause 3, wherein the distance is the nearest atom distance from the corresponding voxel center in the voxel grid to the nearest atom of the corresponding amino acid.
6. 6. The computer-readable medium of clause 5, wherein the distance between the nearest atoms is the Euclidean distance.
7. 7. The computer-readable medium of clause 6, wherein the nearest atom distance is normalized by dividing the Euclidean distance by the largest nearest atom distance.
8. 6. The computer-readable medium of clause 5, wherein the amino acid has an alpha carbon atom, and the distance is the distance of the nearest alpha carbon atom from the corresponding voxel center to the nearest alpha carbon atom of the corresponding amino acid.
9. 6. The computer-readable medium of clause 5, wherein the amino acid has a beta carbon atom, and the distance is the distance of the nearest beta carbon atom from the corresponding voxel center to the nearest beta carbon atom of the corresponding amino acid.
10. 6. The computer-readable medium of clause 5, wherein the amino acid has a backbone atom, and the distance is the distance of the nearest backbone atom from the corresponding voxel center to the nearest backbone atom of the corresponding amino acid.
11. 6. The computer-readable medium of clause 5, wherein the amino acid has a side chain atom, and the distance is the distance of the nearest side chain atom from the corresponding voxel center to the nearest side chain atom of the corresponding amino acid.
12. 3, wherein the operation further comprises encoding, in the tensor, a nearest atom channel specifying the distance from each voxel to the nearest atom selected without regard to the amino acid and the atomic element of the amino acid. computer readable medium.
13. 13. The computer-readable medium of clause 12, wherein the distance is a Euclidean distance.
14. 14. The computer-readable medium of clause 13, wherein the distance is normalized by dividing the Euclidean distance by the maximum distance.
15. 13. The computer-readable medium of clause 12, wherein the amino acids include non-standard amino acids.
16. 2. The computer-readable medium of clause 1, wherein the tensor further comprises an absent atom channel that specifies atoms not found within a predetermined radius of a voxel center, the absent atom channel being one-hot encoded.
17. 2. The computer-readable medium of clause 1, wherein the tensor further comprises one-hot encoding of alternative alleles encoded voxel-wise in each of the per-amino acid distance channels.
18. The computer-readable medium of clause 1, wherein the tensor further comprises a reference allele of the protein.
19. 19. The computer-readable medium of clause 18, wherein the tensor further comprises a one-hot encoding of the reference allele encoded voxel-wise in each of the per-amino acid distance channels.
20. The computer-readable medium of clause 1, wherein the tensor further comprises an evolutionary profile specifying the level of conservation of amino acids across multiple species.
21. the operation selects the closest atom across amino acids and atomic categories for each voxel;
selecting a pan-amino acid conserved frequency sequence for amino acid residues containing the closest atoms;
21. The computer-readable medium of clause 20, further comprising making available a pan-amino acid conserved frequency sequence as one of the evolutionary profiles.
22. 22. The computer-readable medium of clause 21, wherein a pan-amino acid conserved frequency sequence is constructed for specific positions of residues as observed in multiple species.
23. 22. The computer-readable medium of clause 21, wherein the pan-amino acid conserved frequency sequence specifies whether there is a missing conserved frequency for a particular amino acid.
24. the operation selects each nearest atom in each of the amino acids for each of the voxels;
selecting a conservation frequency for each amino acid for each residue of the amino acid containing the nearest atom;
22. The computer-readable medium of clause 21, further comprising: creating a conservation frequency for each amino acid that can be used as one of the evolutionary profiles.
25. 25. The computer-readable medium of clause 24, wherein a conservation frequency for each amino acid is configured for a particular position of the residue as observed in multiple species.
26. 25. The computer-readable medium of clause 24, wherein the conservation frequency for each amino acid specifies whether there is a missing conservation frequency for the particular amino acid.
27. 2. The computer-readable medium of clause 1, wherein the tensor further comprises an annotation channel of amino acids, the annotation channel being one-hot encoded in the tensor.
28. 28. The computer-readable medium of clause 27, wherein the annotation channel is a molecular processing annotation that includes an initiator methionine, a signal, a transit peptide, a propeptide, a chain, and a peptide.
29. Clause where the annotation channel is a region annotation including topological domain, transmembrane, intramembrane, domain, repeat, calcium binding, zinc finger, deoxyribonucleic acid (DNA) binding, nucleotide binding, region, coiled coil, motif, and compositional bias. 28. The computer readable medium according to 27.
30. 28. The computer-readable medium of clause 27, wherein the annotation channel is a site annotation that includes an active site, a metal binding, a binding site, and a site.
31. 28. The computer-readable medium of clause 27, wherein the annotation channel is an amino acid modification annotation including non-standard residues, modified residues, lipidations, glycosylation, disulfide bonds, and crosslinks.
32. 28. The computer-readable medium of clause 27, wherein the annotation channel is a secondary structure annotation including helices, turns, and beta strands.
33. 28. The computer-readable medium of clause 27, wherein the annotation channel is an experimental information annotation that includes mutagenesis, sequence uncertainty, sequence conflicts, non-adjacent residues, and non-terminal residues.
34. 2. The computer-readable medium of clause 1, wherein the tensor further includes an amino acid structure confidence channel that specifies the structural quality of each of the amino acids.
35. 35. The computer-readable medium of clause 34, wherein the structural confidence channel is global model quality estimation (GMQE).
36. 35. The computer-readable medium of clause 34, wherein the structural confidence channel includes a qualitative model energy analysis (QMEAN) score.
37. 35. The computer-readable medium of clause 34, wherein the structure confidence channel is a temperature factor that specifies the degree to which the residues meet the physical constraints of their respective protein structure.
38. 35. The computer-readable medium of clause 34, wherein the structure confidence channel is a template structure alignment that specifies the degree to which the residues of atoms closest to the voxel have an aligned template structure.
39. 39. The computer-readable medium of clause 38, wherein the structure confidence channel is a template modeling score for aligned template structures.
40. 40. The computer-readable medium of clause 39, wherein the structural confidence channels are a minimum of the template modeling scores, an average of the template modeling scores, and a maximum of the template modeling scores.
41. 2. The computer-readable medium of clause 1, wherein the operations further include concatenating the per-amino acid distance channel for the alpha carbon atom voxel-wise with one-hot encoding of alternative alleles to generate a tensor.
42. 42. The computer-readable medium of clause 41, wherein the operations further include concatenating the per-amino acid distance channel for the beta carbon atom voxel-wise with one-hot encoding of alternative alleles to generate a tensor.
43. The operations further include concatenating the per-amino acid distance channel for the alpha carbon atom, the per-amino acid distance channel for the beta carbon atom, and the one-hot encoding of alternative alleles on a voxel-by-voxel basis to generate a tensor. Clause 42 Computer Readable Medium.
44. The operation consists of a per-amino acid distance channel for the alpha carbon atom, a per-amino acid distance channel for the beta carbon atom, one-hot encoding of alternative alleles, and a voxel-wise concatenation of pan-amino acid conservation frequency arrays to create a tensor. 44. The computer-readable medium of clause 43, further comprising generating.
45. The operation combines a per-amino acid distance channel for alpha carbon atoms, a per-amino acid distance channel for beta carbon atoms, one-hot encoding of alternative alleles, a pan-amino acid conserved frequency array, and an annotation channel on a voxel-by-voxel basis. , the computer-readable medium of clause 44, further comprising generating a tensor.
46. The operations include voxel per-amino acid distance channels for alpha carbon atoms, per-amino acid distance channels for beta carbon atoms, one-hot encoding of alternative alleles, pan-amino acid conserved frequency arrays, annotation channels, and structural confidence channels. 46. The computer-readable medium of clause 45, further comprising concatenating each other to generate a tensor.
47. The operations include a per-amino acid distance channel for the alpha carbon atom, a per-amino acid distance channel for the beta carbon atom, one-hot encoding of alternative alleles, and a voxel-wise concatenation of per-amino acid conservation frequencies for each of the amino acids. 47. The computer-readable medium of clause 46, further comprising: generating a tensor.
48. The operations include a per-amino acid distance channel for the alpha carbon atom, a per-amino acid distance channel for the beta carbon atom, one-hot encoding of alternative alleles, a per-amino acid conservation frequency for each of the amino acids, and a voxel annotation channel. 48. The computer-readable medium of clause 47, further comprising concatenating each other to generate a tensor.
49. Operations include per-amino acid distance channels for alpha carbon atoms, per-amino acid distance channels for beta carbon atoms, one-hot encoding of alternative alleles, per-amino acid conservation frequencies for each of the amino acids, annotation channels, and structure. 49. The computer-readable medium of clause 48, further comprising concatenating the confidence channels voxel by voxel to generate a tensor.
50. The operation concatenates the per-amino acid distance channel for the alpha carbon atom, the per-amino acid distance channel for the beta carbon atom, one-hot encoding of the alternative allele, and one-hot encoding of the reference allele on a voxel-by-voxel basis. 50. The computer-readable medium of clause 49, further comprising generating a tensor.
51. The operations include per-amino acid distance channels for alpha carbon atoms, per-amino acid distance channels for beta carbon atoms, one-hot encoding of alternative alleles, one-hot encoding of reference alleles, and pan-amino acid conservation frequency arrays. 51. The computer-readable medium of clause 50, further comprising concatenating voxel by voxel to generate a tensor.
52. The operations include a per-amino acid distance channel for the alpha carbon atom, a per-amino acid distance channel for the beta carbon atom, one-hot encoding of the alternative allele, one-hot encoding of the reference allele, to generate a tensor. 52. The computer-readable medium of clause 51, further comprising concatenating the pan-amino acid conserved frequency sequence and the annotation channel on a voxel-by-voxel basis.
53. The operations include per-amino acid distance channels for alpha carbon atoms, per-amino acid distance channels for beta carbon atoms, one-hot encoding of alternative alleles, one-hot encoding of reference alleles, pan-amino acid conservation frequency arrays, and annotations. 53. The computer-readable medium of clause 52, further comprising concatenating the channel and the structural confidence channel voxel by voxel to generate a tensor.
54. Operations include per-amino acid distance channels for alpha carbon atoms, per-amino acid distance channels for beta carbon atoms, one-hot encoding of alternative alleles, one-hot encoding of reference alleles, pan-amino acid conservation frequency arrays, and 54. The computer-readable medium of clause 53, further comprising concatenating the per-amino acid conservation frequencies for each of the amino acids voxel by voxel to generate a tensor.
55. Operations include per-amino acid distance channels for alpha carbon atoms, per-amino acid distance channels for beta carbon atoms, one-hot encoding of alternative alleles, one-hot encoding of reference alleles, pan-amino acid conserved frequency arrays, amino acids 55. The computer-readable medium of clause 54, further comprising concatenating the per-amino acid conservation frequency for each of the annotation channels voxel-wise to generate a tensor.
56. The operations include per-amino acid distance channels for alpha carbon atoms, per-amino acid distance channels for beta carbon atoms, one-hot encoding of alternative alleles, one-hot encoding of reference alleles, and per-amino acid distance channels for each of the amino acids. 56. The computer-readable medium of clause 55, further comprising concatenating the storage frequency, annotation channel, and structural confidence channel voxel by voxel to generate a tensor.
57. 2. The computer-readable medium of clause 1, wherein the operations further include rotating atoms of the amino acids before the per-amino acid distance channels are generated.
58. The operations are 1x1x1 convolution, 3x3x3 convolution, normalized linear unit activation layer, batch normalization layer, fully connected layer, dropout regularization layer, and softmax classification layer in convolutional neural networks. The computer-readable medium of clause 1, further comprising using the computer-readable medium of clause 1.
59. The computer-readable medium of clause 58, wherein the 1x1x1 convolution and the 3x3x3 convolution are three-dimensional convolutions.
60. A 1x1x1 convolutional layer processes the tensor and produces an intermediate output that is a convolved representation of the tensor, and an array of 3x3x3 convolutional layers processes the intermediate output and flattens it. A fully connected layer processes the flattened output and generates an unnormalized output, a softmax classification layer processes the unnormalized output, and the variants are pathogenic and benign. 59. The computer-readable medium of clause 58, wherein the computer-readable medium generates an exponentially normalized output that identifies a likelihood.
61. 61. The computer-readable medium of clause 60, wherein the sigmoid layer processes the denormalized output and produces a normalized output that identifies a likelihood that the variant is pathogenic.
62. The computer of clause 60, wherein the voxels, atoms, and distances have three-dimensional coordinates, the tensor has at least three dimensions, the intermediate output has at least three dimensions, and the flattened output has one dimension. readable medium.
63. When executed on one or more processors:
Storing distance channels for each atomic category for amino acids in a protein, the method comprising:
the amino acid has atoms from multiple atomic categories,
an atom category of the plurality of atom categories specifies an atomic element of an amino acid,
each of the distance channels for each atomic category has a per-voxel distance value for a voxel within the plurality of voxels;
storing, a per-voxel distance value specifying a distance from a corresponding voxel in the plurality of voxels to an atom in the corresponding atom category in the plurality of atom categories;
processing a tensor containing distance channels for each atomic category and alternative alleles of the protein expressed by the variant;
one or more computer-readable media storing computer-executable instructions that configure the computer to perform operations including: determining pathogenicity of a variant based at least in part on the tensor;
64. 64. The computer-readable medium of clause 63, wherein the operations further include centering a voxel grid of voxels on a respective atom of a respective atomic category within the plurality of atomic categories.
65. 65. The computer-readable medium of clause 64, wherein the act further comprises centering the voxel grid on an alpha carbon atom of a residue of at least one variant amino acid in the protein.
66. 66. The computer-readable medium of clause 65, wherein the distance is the nearest atom distance from the corresponding voxel center in the voxel grid to the nearest atom in the corresponding atom category.
67. 67. The computer-readable medium of clause 66, wherein the distance between the nearest atoms is the Euclidean distance.
68. 68. The computer-readable medium of clause 67, wherein the nearest atom distance is normalized by dividing the Euclidean distance by the largest nearest atom distance.
69. 69. The computer-readable medium of clause 68, wherein the distance is the nearest atom distance from the corresponding voxel center to the nearest atom in the voxel grid, regardless of the amino acid and the atomic category of the amino acid.
70. 69. The computer-readable medium of clause 69, wherein the distance between the nearest atoms is the Euclidean distance.
71. 71. The computer-readable medium of clause 70, wherein the nearest atom distance is normalized by dividing the Euclidean distance by the largest nearest atom distance.

特定の実施態様２
いくつかの実施態様では、システムは、タンパク質の参照アミノ酸配列の３次元構造にアクセスし、アミノ酸ベースで３次元構造内の原子にボクセルの３次元グリッドを当てはめて、アミノ酸ごとの距離チャネルを生成するボクセル化器を含む。アミノ酸ごとの距離チャネルの各々は、ボクセルの３次元グリッド内に各ボクセルについての３次元距離値を有する。３次元距離値は、ボクセルの３次元グリッド内の対応するボクセルから参照アミノ酸配列内の対応する参照アミノ酸の原子までの距離を指定する。システムは、ボクセルの３次元グリッド内の各ボクセルに代替対立遺伝子アミノ酸を符号化する代替対立遺伝子エンコーダを更に含む。代替対立遺伝子アミノ酸は、変異体ヌクレオチドによって発現される変異体アミノ酸のワンホット符号化の３次元表現である。システムは、ボクセルの３次元グリッド内の各ボクセルに進化的保存配列を符号化する進化的保存エンコーダを更に含む。進化的保存配列は、複数の種にわたるアミノ酸特異的保存頻度の３次元表現であることができる。アミノ酸特異的保存頻度は、対応するボクセルへのアミノ酸近接度に応じて選択することができる。システムは、代替対立遺伝子アミノ酸及びそれぞれの進化的保存配列で符号化されたアミノ酸ごとの距離チャネルを含むテンソルに３次元畳み込みを適用するように構成された畳み込みニューラルネットワークを更に含む。畳み込みニューラルネットワークはまた、テンソルに少なくとも部分的に基づいて変異体ヌクレオチドの病原性を決定するように構成することができる。 Specific embodiment 2
In some embodiments, the system accesses the three-dimensional structure of a reference amino acid sequence of the protein and fits a three-dimensional grid of voxels to atoms in the three-dimensional structure on an amino acid basis to generate distance channels for each amino acid. Contains a voxelizer. Each of the per-amino acid distance channels has a three-dimensional distance value for each voxel within a three-dimensional grid of voxels. The three-dimensional distance value specifies the distance from the corresponding voxel in the three-dimensional grid of voxels to the atom of the corresponding reference amino acid in the reference amino acid sequence. The system further includes an alternate allele encoder that encodes alternate allele amino acids for each voxel in the three-dimensional grid of voxels. Alternative allelic amino acids are three-dimensional representations of one-hot encoding of variant amino acids expressed by variant nucleotides. The system further includes an evolutionarily conserved encoder that encodes an evolutionarily conserved sequence in each voxel in the three-dimensional grid of voxels. An evolutionarily conserved sequence can be a three-dimensional representation of amino acid-specific conservation frequencies across multiple species. Amino acid-specific conservation frequencies can be selected depending on amino acid proximity to the corresponding voxel. The system further includes a convolutional neural network configured to apply a three-dimensional convolution to a tensor containing distance channels for each amino acid encoded with alternative allelic amino acids and respective evolutionarily conserved sequences. The convolutional neural network can also be configured to determine pathogenicity of a mutant nucleotide based at least in part on the tensor.

ボクセル化器は、参照アミノ酸配列内の参照アミノ酸のそれぞれの残基のアルファ炭素原子上にボクセルの３次元グリッドを中心付けることができる。ボクセル化器は、変異アミノ酸残基に位置する特定の参照アミノ酸残基のアルファ炭素原子上にボクセルの３次元グリッドを中心付けることができる。 The voxelizer can center a three-dimensional grid of voxels on the alpha carbon atom of each residue of a reference amino acid within the reference amino acid sequence. The voxelizer can center a three-dimensional grid of voxels on the alpha carbon atom of a particular reference amino acid residue located at the variant amino acid residue.

いくつかの実装形態では、システムは、テンソルにおいて、特定の参照アミノ酸に先行する参照アミノ酸についての３次元距離値に方向性パラメータを乗算することによって、参照アミノ酸配列内の参照アミノ酸の方向性及び特定の参照アミノ酸の位置を符号化するように更に構成することができる。距離は、ボクセルの３次元グリッドにおける対応するボクセル中心から、対応する参照アミノ酸の最も近い原子までの最も近い原子の距離であることができる。最も近い原子の距離はユークリッド距離であってもよく、ユークリッド距離を最大の最も近い原子の距離で除算することによって正規化することができる。 In some implementations, the system determines the directionality and identification of a reference amino acid within a reference amino acid sequence by multiplying the three-dimensional distance values for reference amino acids preceding a particular reference amino acid by a directionality parameter in a tensor. can be further configured to encode the position of the reference amino acid of. The distance can be the nearest atom distance from the corresponding voxel center in a three-dimensional grid of voxels to the nearest atom of the corresponding reference amino acid. The nearest atom distance may be the Euclidean distance and can be normalized by dividing the Euclidean distance by the largest nearest atom distance.

いくつかの実施態様では、参照アミノ酸はアルファ炭素原子を有することができ、距離は、対応するボクセル中心から対応する参照アミノ酸の最も近いアルファ炭素原子までの最も近いアルファ炭素原子の距離であることができる。いくつかの実施態様では、参照アミノ酸はベータ炭素原子を有することができ、距離は、対応するボクセル中心から対応する参照アミノ酸の最も近いベータ炭素原子までの最も近いベータ炭素原子の距離であることができる。いくつかの実施態様では、参照アミノ酸は骨格原子を有することができ、距離は、対応するボクセル中心から対応する参照アミノ酸の最も近い骨格原子までの最も近い骨格原子の距離であることができる。いくつかの実施態様では、アミノ酸は側鎖原子を有することができ、距離は、対応するボクセル中心から対応する参照アミノ酸の最も近い側鎖原子までの最も近い側鎖原子の距離であることができる。 In some embodiments, the reference amino acid can have an alpha carbon atom, and the distance can be the distance of the nearest alpha carbon atom from the corresponding voxel center to the nearest alpha carbon atom of the corresponding reference amino acid. can. In some embodiments, the reference amino acid can have a beta carbon atom, and the distance can be the distance of the nearest beta carbon atom from the corresponding voxel center to the nearest beta carbon atom of the corresponding reference amino acid. can. In some embodiments, the reference amino acid can have a backbone atom, and the distance can be the distance of the nearest backbone atom from the corresponding voxel center to the nearest backbone atom of the corresponding reference amino acid. In some embodiments, the amino acid can have a side chain atom, and the distance can be the distance of the nearest side chain atom from the corresponding voxel center to the nearest side chain atom of the corresponding reference amino acid. .

いくつかの実施態様では、システムは、テンソルにおいて、各ボクセルから最も近い原子までの距離を指定する最も近い原子のチャネルを符号化するように更に構成することができる。最も近い原子は、アミノ酸及びアミノ酸の原子エレメントに関係なく選択することができる。距離はユークリッド距離であってもよく、ユークリッド距離を最大距離で除算することによって正規化することができる。アミノ酸は、非標準アミノ酸を含むことができる。テンソルは、ボクセル中心の所定の半径内に見出されない原子を指定する不在原子チャネルを更に含むことができる。不在原子チャネルは、ワンホット符号化することができる。 In some implementations, the system can be further configured to encode, in the tensor, a nearest atom channel that specifies the distance from each voxel to the nearest atom. The closest atom can be selected without regard to the amino acid and the atomic element of the amino acid. The distance may be a Euclidean distance and can be normalized by dividing the Euclidean distance by the maximum distance. Amino acids can include non-standard amino acids. The tensor can further include an absent atom channel that specifies atoms not found within a predetermined radius of the voxel center. Missing atomic channels can be one-hot encoded.

いくつかの実施態様では、システムは、アミノ酸位置ベースで各３次元距離値に参照対立遺伝子アミノ酸をボクセルごとに符号化する参照対立遺伝子エンコーダを更に含むことができる。参照対立遺伝子アミノ酸は、参照アミノ酸配列のワンホット符号化の３次元表現であることができる。アミノ酸特異的保存頻度は、複数の種にわたるそれぞれのアミノ酸の保存レベルを指定することができる。 In some embodiments, the system can further include a reference allele encoder that encodes the reference allele amino acids voxel by voxel into each three-dimensional distance value on an amino acid position basis. The reference allele amino acid can be a one-hot encoded three-dimensional representation of the reference amino acid sequence. Amino acid-specific conservation frequencies can specify the level of conservation of each amino acid across multiple species.

いくつかの実施態様では、進化的保存エンコーダは、参照アミノ酸及び原子カテゴリにわたって、対応するボクセルへの最も近い原子を選択することができ、最も近い原子を含む参照アミノ酸の残基に対する汎アミノ酸保存頻度を選択することができ、汎アミノ酸保存頻度の３次元表現を進化的保存配列として使用することができる。汎アミノ酸保存頻度は、複数の種において観察されるような残基の特定の位置について構成することができる。汎アミノ酸保存頻度は、特定の参照アミノ酸について欠失している保存頻度があるかどうかを指定することができる。 In some embodiments, the evolutionary conservation encoder can select the closest atom to the corresponding voxel across the reference amino acid and atom category, and calculate the pan-amino acid conservation frequency for the residue of the reference amino acid that contains the closest atom. can be selected, and the three-dimensional representation of the pan-amino acid conservation frequency can be used as an evolutionarily conserved sequence. Pan-amino acid conservation frequencies can be constructed for specific positions of residues as observed in multiple species. The pan-amino acid conservation frequency can specify whether there are any conserved frequencies that are missing for a particular reference amino acid.

いくつかの実施態様では、進化的保存エンコーダは、参照アミノ酸のそれぞれにおいて対応するボクセルに対してそれぞれ最も近い原子を選択することができ、最も近い原子を含む参照アミノ酸のそれぞれの残基についてそれぞれのアミノ酸ごとの保存頻度を選択することができ、アミノ酸ごとの保存頻度の３次元表現を進化的保存配列として使用することができる。アミノ酸ごとの保存頻度は、複数の種において観察されるような残基の特定の位置について構成することができる。アミノ酸ごとの保存頻度は、特定の参照アミノ酸について欠失している保存頻度があるかどうかを指定することができる。 In some embodiments, the evolutionary conservation encoder can select each closest atom in each of the reference amino acids to the corresponding voxel, and each The conservation frequency for each amino acid can be selected, and the three-dimensional representation of the conservation frequency for each amino acid can be used as an evolutionarily conserved sequence. Conserved frequencies for each amino acid can be constructed for particular positions of residues as observed in multiple species. The conservation frequency for each amino acid can specify whether there is a conserved frequency that is missing for a particular reference amino acid.

いくつかの実施態様では、システムは、１つ以上のアノテーションチャネルを各３次元距離値にボクセルごとに符号化するアノテーションエンコーダを更に含むことができる。アノテーションチャネルは、残基アノテーションのワンホット符号化の３次元表現であり得、イニシエーターメチオニン、シグナル、輸送ペプチド、プロペプチド、鎖、及びペプチドを含む分子処理アノテーションであることができる。いくつかの実施態様では、アノテーションチャネルは、トポロジカルドメイン、膜貫通、膜内、ドメイン、リピート、カルシウム結合、ジンクフィンガー、デオキシリボ核酸（ＤＮＡ）結合、ヌクレオチド結合、領域、コイルドコイル、モチーフ、及び組成バイアスを含む領域アノテーションであることができるか、又は活性部位、金属結合、結合部位、及び部位を含む部位アノテーションであることができる。いくつかの実施態様では、アノテーションチャネルは、非標準残基、修飾残基、脂質化、グリコシル化、ジスルフィド結合、及び架橋を含むアミノ酸修飾アノテーションであることができるか、又はヘリックス、ターン、及びベータ鎖を含む二次構造アノテーションであることができる。アノテーションチャネルは、突然変異誘発、配列不確実性、配列競合、非隣接残基、及び非末端残基を含む実験情報アノテーションであることができる。 In some implementations, the system can further include an annotation encoder that encodes one or more annotation channels into each three-dimensional distance value on a voxel-by-voxel basis. Annotation channels can be one-hot encoded three-dimensional representations of residue annotations and can be molecular processing annotations including initiator methionine, signal, transit peptide, propeptide, chain, and peptide. In some embodiments, annotation channels include topological domains, transmembrane, intramembrane, domains, repeats, calcium binding, zinc fingers, deoxyribonucleic acid (DNA) binding, nucleotide binding, regions, coiled coils, motifs, and compositional biases. It can be a region annotation that includes, or it can be a site annotation that includes active site, metal binding, binding site, and site. In some embodiments, annotation channels can be amino acid modified annotations, including non-canonical residues, modified residues, lipidations, glycosylation, disulfide bonds, and crosslinks, or helices, turns, and beta can be a secondary structure annotation that includes chains. Annotation channels can be experimental information annotations that include mutagenesis, sequence uncertainties, sequence conflicts, non-contiguous residues, and non-terminal residues.

いくつかの実施態様では、システムは、１つ以上の構造信頼度チャネルを各３次元距離値にボクセルごとに符号化する構造信頼エンコーダを更に含むことができる。構造信頼度チャネルは、それぞれの残基構造の質を指定する信頼度スコアの３次元表現であることができる。構造信頼度チャネルは、グローバルモデル品質推定（ＧＭＱＥ）であることができ、定性的モデルエネルギー解析（ＱＭＥＡＮ）スコアであることができ、残基がそれぞれのタンパク質構造の物理的制約を満たす程度を指定する温度因子であることができ、ボクセルに最も近い原子の残基が整列した鋳型構造を有する程度を指定する鋳型構造アラインメントであることができ、整列した鋳型構造の鋳型モデリングスコアであることができ、又は鋳型モデリングスコアのうちの最小のもの、鋳型モデリングスコアの平均、及び鋳型モデリングスコアのうちの最大のものであることができる。 In some implementations, the system can further include a structural confidence encoder that encodes one or more structural confidence channels into each three-dimensional distance value on a voxel-by-voxel basis. A structure confidence channel can be a three-dimensional representation of confidence scores that specify the quality of each residue structure. The structural confidence channel can be a global model quality estimate (GMQE) and can be a qualitative model energy analysis (QMEAN) score, specifying the degree to which the residues satisfy the physical constraints of the respective protein structure. can be a temperature factor that specifies the degree to which the atomic residues closest to a voxel have an aligned template structure, and can be a template modeling score for an aligned template structure. , or the minimum of the template modeling scores, the average of the template modeling scores, and the maximum of the template modeling scores.

いくつかの実施態様では、システムは、アミノ酸ごとの距離チャネルが生成される前に原子を回転させる原子回転エンジンを更に含むことができる。 In some embodiments, the system can further include an atom rotation engine that rotates the atoms before the per-amino acid distance channels are generated.

畳み込みニューラルネットワークは、１×１×１畳み込み、３×３×３畳み込み、正規化線形ユニット活性化層、バッチ正規化層、全結合層、ドロップアウト正則化層、及びソフトマックス分類層を使用することができる。１×１×１畳み込み及び３×３×３畳み込みは、３次元畳み込みであることができる。いくつかの実施態様では、１×１×１の畳み込みの層は、テンソルを処理し、テンソルの畳み込み表現である中間出力を生成することができる。３×３×３の畳み込みの層の配列は、中間出力を処理し、平坦化された出力を生成することができる。全結合層は、平坦化された出力を処理し、非正規化出力を生成することができる。ソフトマックス分類層は、非正規化出力を処理し、変異体ヌクレオチドが病原性及び良性である尤度を特定する指数関数的に正規化された出力を生成することができる。 Convolutional neural networks use 1x1x1 convolution, 3x3x3 convolution, normalized linear unit activation layer, batch normalization layer, fully connected layer, dropout regularization layer, and softmax classification layer. be able to. The 1×1×1 convolution and the 3×3×3 convolution can be three-dimensional convolutions. In some implementations, a 1×1×1 convolutional layer may process a tensor and produce an intermediate output that is a convolved representation of the tensor. An array of 3x3x3 convolutional layers can process the intermediate output and produce a flattened output. A fully connected layer can process the flattened output and produce a denormalized output. A softmax classification layer can process the unnormalized output and produce an exponentially normalized output that identifies the likelihood that variant nucleotides are pathogenic and benign.

いくつかの実施態様では、シグモイド層は、非正規化出力を処理し、変異体ヌクレオチドが病原性である尤度を特定する正規化出力を生成することができる。畳み込みニューラルネットワークは、アテンションベースのニューラルネットワークであることができる。テンソルは、参照対立遺伝子アミノ酸で更に符号化されるアミノ酸ごとの距離チャネルを含むことができ、アノテーションチャネルで更に符号化されるアミノ酸ごとの距離チャネルを含むことができ、又は構造信頼度チャネルで更に符号化されるアミノ酸ごとの距離チャネルを含むことができる。 In some implementations, the sigmoid layer can process the unnormalized output and produce a normalized output that identifies the likelihood that the variant nucleotide is pathogenic. A convolutional neural network can be an attention-based neural network. The tensor may include a per-amino acid distance channel further encoded with reference allele amino acids, a per-amino acid distance channel further encoded with an annotation channel, or a per-amino acid distance channel further encoded with a structural confidence channel. A distance channel can be included for each encoded amino acid.

いくつかの実施態様では、システムは、タンパク質の参照アミノ酸配列の３次元構造にアクセスし、アミノ酸ベースで３次元構造内の原子にボクセルの３次元グリッドを当てはめて、原子カテゴリごとの距離チャネルを生成するボクセル化器を含むことができる。原子は、アミノ酸の原子エレメントを指定する複数の原子カテゴリに及ぶ。原子カテゴリごとの距離チャネルの各々は、ボクセルの３次元グリッド内に各ボクセルについての３次元距離値を有する。３次元距離値は、ボクセルの３次元グリッド内の対応するボクセルから複数の原子カテゴリ内の対応する原子カテゴリの原子までの距離を指定する。システムは、ボクセルの３次元グリッド内の各ボクセルに代替対立遺伝子アミノ酸を符号化する代替対立遺伝子エンコーダを更に含む。代替対立遺伝子アミノ酸は、変異体ヌクレオチドによって発現される変異体アミノ酸のワンホット符号化の３次元表現である。システムは、ボクセルの３次元グリッド内の各ボクセルに進化的保存配列を符号化する進化的保存エンコーダを更に含む。進化的保存配列は、複数の種にわたるアミノ酸特異的保存頻度の３次元表現であることができる。アミノ酸特異的保存頻度は、対応するボクセルへのアミノ酸近接度に応じて選択することができる。システムは、代替対立遺伝子アミノ酸及びそれぞれの進化的保存配列で符号化された原子カテゴリごとの距離チャネルを含むテンソルに３次元畳み込みを適用し、テンソルに少なくとも部分的に基づいて変異体ヌクレオチドの病原性を決定するように構成された畳み込みニューラルネットワークを更に含む。 In some embodiments, the system accesses the three-dimensional structure of a reference amino acid sequence of the protein and fits a three-dimensional grid of voxels to the atoms in the three-dimensional structure on an amino acid basis to generate distance channels for each atom category. A voxelizer can be included. Atoms spans multiple atomic categories that specify the atomic elements of amino acids. Each of the distance channels for each atom category has a three-dimensional distance value for each voxel within a three-dimensional grid of voxels. The three-dimensional distance value specifies the distance from the corresponding voxel in the three-dimensional grid of voxels to the atom of the corresponding atom category within the plurality of atom categories. The system further includes an alternate allele encoder that encodes alternate allele amino acids for each voxel in the three-dimensional grid of voxels. Alternative allelic amino acids are three-dimensional representations of one-hot encoding of variant amino acids expressed by variant nucleotides. The system further includes an evolutionarily conserved encoder that encodes an evolutionarily conserved sequence in each voxel in the three-dimensional grid of voxels. An evolutionarily conserved sequence can be a three-dimensional representation of amino acid-specific conservation frequencies across multiple species. Amino acid-specific conservation frequencies can be selected depending on amino acid proximity to the corresponding voxel. The system applies three-dimensional convolution to a tensor containing distance channels for each atomic category encoded with alternative allele amino acids and their respective evolutionarily conserved sequences, and determines the pathogenicity of variant nucleotides based at least in part on the tensor. further comprising a convolutional neural network configured to determine.

いくつかの実施態様では、システムは、タンパク質の参照アミノ酸配列の３次元構造にアクセスし、アミノ酸ベースで３次元構造内の原子にボクセルの３次元グリッドを当てはめて、アミノ酸ごとの距離チャネルを生成するボクセル化器を含む。アミノ酸ごとの距離チャネルの各々は、ボクセルの３次元グリッド内に各ボクセルについての３次元距離値を有することができる。３次元距離値は、ボクセルの３次元グリッド内の対応するボクセルから参照アミノ酸配列内の対応する参照アミノ酸の原子までの距離を指定することができる。システムは、ボクセルの３次元グリッド内の各ボクセルに代替対立遺伝子アミノ酸を符号化する代替対立遺伝子エンコーダを更に含む。代替対立遺伝子アミノ酸は、変異体ヌクレオチドによって発現される変異体アミノ酸のワンホット符号化の３次元表現である。システムは、ボクセルの３次元グリッド内の各ボクセルに進化的保存配列を符号化する進化的保存エンコーダを更に含む。進化的保存配列は、複数の種にわたるアミノ酸特異的保存頻度の３次元表現であることができる。アミノ酸特異的保存頻度は、対応するボクセルへのアミノ酸近接度に応じて選択することができる。システムは、代替対立遺伝子アミノ酸及びそれぞれの進化的保存配列で符号化されたアミノ酸ごとの距離チャネルを含むテンソルを生成するように構成されたテンソル生成器を更に含む。 In some embodiments, the system includes a voxelizer that accesses a three-dimensional structure of a reference amino acid sequence of the protein and fits a three-dimensional grid of voxels to atoms in the three-dimensional structure on an amino acid basis to generate distance channels for each amino acid. Each of the distance channels for each amino acid can have a three-dimensional distance value for each voxel in the three-dimensional grid of voxels. The three-dimensional distance value can specify a distance from a corresponding voxel in the three-dimensional grid of voxels to an atom of a corresponding reference amino acid in the reference amino acid sequence. The system further includes an alternative allele encoder that encodes an alternative allele amino acid for each voxel in the three-dimensional grid of voxels. The alternative allele amino acid is a three-dimensional representation of a one-hot encoding of a variant amino acid expressed by a variant nucleotide. The system further includes an evolutionary conservation encoder that encodes an evolutionary conserved sequence for each voxel in the three-dimensional grid of voxels. The evolutionary conserved sequence can be a three-dimensional representation of an amino acid-specific conservation frequency across multiple species. The amino acid-specific conservation frequency can be selected according to amino acid proximity to a corresponding voxel. The system further includes a tensor generator configured to generate a tensor including distance channels for each amino acid encoded with alternative allelic amino acids and their respective evolutionarily conserved sequences.

いくつかの実施態様では、システムは、タンパク質の参照アミノ酸配列の３次元構造にアクセスし、アミノ酸ベースで３次元構造内の原子にボクセルの３次元グリッドを当てはめて、原子カテゴリごとの距離チャネルを生成するボクセル化器を含む。原子は、アミノ酸の原子エレメントを指定する複数の原子カテゴリに及ぶことができる。原子カテゴリごとの距離チャネルの各々は、ボクセルの３次元グリッド内に各ボクセルについての３次元距離値を有することができる。３次元距離値は、ボクセルの３次元グリッド内の対応するボクセルから複数の原子カテゴリ内の対応する原子カテゴリの原子までの距離を指定することができる。システムは、ボクセルの３次元グリッド内の各ボクセルに代替対立遺伝子アミノ酸を符号化する代替対立遺伝子エンコーダを更に含む。代替対立遺伝子アミノ酸は、変異体ヌクレオチドによって発現される変異体アミノ酸のワンホット符号化の３次元表現である。システムは、ボクセルの３次元グリッド内の各ボクセルに進化的保存配列を符号化する進化的保存エンコーダを更に含む。進化的保存配列は、複数の種にわたるアミノ酸特異的保存頻度の３次元表現であることができる。アミノ酸特異的保存頻度は、対応するボクセルへのアミノ酸近接度に応じて選択することができる。システムは、代替対立遺伝子アミノ酸及びそれぞれの進化的保存配列で符号化された原子カテゴリごとの距離チャネルを含むテンソルを生成するように構成されたテンソル生成器を更に含む。 In some embodiments, the system accesses the three-dimensional structure of a reference amino acid sequence of the protein and fits a three-dimensional grid of voxels to the atoms in the three-dimensional structure on an amino acid basis to generate distance channels for each atom category. Contains a voxelizer to Atoms can span multiple atomic categories specifying the atomic elements of amino acids. Each of the distance channels for each atom category may have a three-dimensional distance value for each voxel within a three-dimensional grid of voxels. The three-dimensional distance value may specify the distance from a corresponding voxel in a three-dimensional grid of voxels to an atom of a corresponding atom category within a plurality of atom categories. The system further includes an alternate allele encoder that encodes alternate allele amino acids for each voxel in the three-dimensional grid of voxels. Alternative allelic amino acids are three-dimensional representations of one-hot encoding of variant amino acids expressed by variant nucleotides. The system further includes an evolutionarily conserved encoder that encodes an evolutionarily conserved sequence in each voxel in the three-dimensional grid of voxels. An evolutionarily conserved sequence can be a three-dimensional representation of amino acid-specific conservation frequencies across multiple species. Amino acid-specific conservation frequencies can be selected depending on amino acid proximity to the corresponding voxel. The system further includes a tensor generator configured to generate a tensor that includes distance channels for each atomic category encoded with alternative allelic amino acids and respective evolutionarily conserved sequences.

条項セット１
１．コンピュータ実装方法であって、
タンパク質の参照アミノ酸配列の３次元構造にアクセスし、アミノ酸ベースで３次元構造内の原子にボクセルの３次元グリッドを当てはめて、アミノ酸ごとの距離チャネルを生成することであって、
アミノ酸ごとの距離チャネルの各々が、ボクセルの３次元グリッド内の各ボクセルについての３次元距離値を有し、
３次元距離値が、ボクセルの３次元グリッド内の対応するボクセルから参照アミノ酸配列内の対応する参照アミノ酸の原子までの距離を指定する、生成することと、
ボクセルの３次元グリッド内の各ボクセルに代替対立遺伝子チャネルを符号化することであって、代替対立遺伝子チャネルが、変異体ヌクレオチドによって発現される変異体アミノ酸のワンホット符号化の３次元表現である、符号化することと、
ボクセル位置ベースで、アミノ酸ごとの距離チャネルにわたる３次元距離値の各配列に、進化的保存チャネルを符号化することであって、
進化的保存チャネルが、複数の種にわたるアミノ酸特異的保存頻度の３次元表現であり、
アミノ酸特異的保存頻度が、対応するボクセルへのアミノ酸近接度に応じて選択される、符号化することと、
代替対立遺伝子チャネル及びそれぞれの進化的保存チャネルで符号化されたアミノ酸ごとの距離チャネルを含むテンソルに３次元畳み込みを適用することと、
テンソルに少なくとも部分的に基づいて、変異体ヌクレオチドの病原性を決定することと、を含む、コンピュータ実装方法。
２．参照アミノ酸配列内の参照アミノ酸のそれぞれの残基のアルファ炭素原子上にボクセルの３次元グリッドを中心付けることを更に含む、条項１に記載のコンピュータ実装方法。
３．変異体アミノ酸に対応する特定の参照アミノ酸の残基のアルファ炭素原子上にボクセルの３次元グリッドを中心付けることを更に含む、条項２に記載のコンピュータ実装方法。
４．テンソルにおいて、特定の参照アミノ酸に先行する参照アミノ酸についての３次元距離値に方向性パラメータを乗算することによって、参照アミノ酸配列内の参照アミノ酸の方向性及び特定の参照アミノ酸の位置を符号化することを更に含む、条項３に記載のコンピュータ実装方法。
５．距離が、ボクセルの３次元グリッド内の対応するボクセル中心から対応する参照アミノ酸の最も近い原子までの最も近い原子の距離である、条項４に記載のコンピュータ実装方法。
６．最も近い原子の距離がユークリッド距離である、条項５に記載のコンピュータ実装方法。
７．最も近い原子の距離が、ユークリッド距離を最大の最も近い原子の距離で除算することによって正規化される、条項６に記載のコンピュータ実装方法。
８．参照アミノ酸がアルファ炭素原子を有し、距離が、対応するボクセル中心から対応する参照アミノ酸の最も近いアルファ炭素原子までの最も近いアルファ炭素原子の距離である、条項５に記載のコンピュータ実装方法。
９．参照アミノ酸がベータ炭素原子を有し、距離が、対応するボクセル中心から対応する参照アミノ酸の最も近いベータ炭素原子までの最も近いベータ炭素原子の距離である、条項５に記載のコンピュータ実装方法。
１０．参照アミノ酸が骨格原子を有し、距離が、対応するボクセル中心から対応する参照アミノ酸の最も近い骨格原子までの最も近い骨格原子の距離である、条項５に記載のコンピュータ実装方法。
１１．アミノ酸が側鎖原子を有し、距離が、対応するボクセル中心から対応する参照アミノ酸の最も近い側鎖原子までの最も近い側鎖原子の距離である、条項５に記載のコンピュータ実装方法。
１２．テンソルにおいて、各ボクセルから、アミノ酸及びアミノ酸の原子エレメントに関係なく選択される最も近い原子までの距離を指定する最も近い原子のチャネルを符号化することを更に含む、条項３に記載のコンピュータ実装方法。
１３．距離がユークリッド距離である、条項１２に記載のコンピュータ実装方法。
１４．距離が、ユークリッド距離を最大距離で除算することによって正規化される、条項１３に記載のコンピュータ実装方法。
１５．アミノ酸が非標準アミノ酸を含む、条項１２に記載のコンピュータ実装方法。
１６．テンソルが、ボクセル中心の所定の半径内に見出されない原子を指定する不在原子チャネルを更に含む、条項１に記載のコンピュータ実装方法。
１７．不在原子チャネルがワンホット符号化される、条項１６に記載のコンピュータ実装方法。
１８．ボクセルの３次元グリッド内の各ボクセルに参照対立遺伝子チャネルをボクセルごとに符号化することを更に含む、条項１に記載のコンピュータ実装方法。
１９．参照対立遺伝子アミノ酸が、変異体アミノ酸を経験する参照アミノ酸のワンホット符号化の３次元表現である、条項１８に記載のコンピュータ実装方法。
２０．アミノ酸特異的保存頻度が、複数の種にわたるそれぞれのアミノ酸の保存レベルを指定する、条項１に記載のコンピュータ実装方法。
２１．
参照アミノ酸及び原子カテゴリにわたって対応するボクセルへの最も近い原子を選択することと、
最も近い原子を含む参照アミノ酸の残基について汎アミノ酸保存頻度を選択することと、
進化的保存チャネルとして汎アミノ酸保存頻度の３次元表現を使用することと、を更に含む、条項２０に記載のコンピュータ実装方法。
２２．汎アミノ酸保存頻度が、複数の種において観察されるような残基の特定の位置について構成される、条項２１に記載のコンピュータ実装方法。
２３．汎アミノ酸保存頻度が、特定の参照アミノ酸について欠失している保存頻度が存在するかどうかを指定する、条項２１に記載のコンピュータ実装方法。
２４．
参照アミノ酸のそれぞれにおいて対応するボクセルにそれぞれの最も近い原子を選択することと、
最も近い原子を含む参照アミノ酸のそれぞれの残基について、それぞれのアミノ酸ごとの保存頻度を選択することと、
進化的保存チャネルとしてアミノ酸ごとの保存頻度の３次元表現を使用することと、を更に含む、条項２１に記載のコンピュータ実装方法。
２５．アミノ酸ごとの保存頻度が、複数の種において観察されるような残基の特定の位置について構成される、条項２４に記載のコンピュータ実装方法。
２６．アミノ酸ごとの保存頻度が、特定の参照アミノ酸について欠失している保存頻度が存在するかどうかを指定する、条項２４に記載のコンピュータ実装方法。
２７．ボクセルの３次元グリッド内の各ボクセルに、残基アノテーションのワンホット符号化の３次元表現である１つ以上のアノテーションチャネルをボクセルごとに符号化することを更に含む、条項１に記載のコンピュータ実装方法。
２８．アノテーションチャネルが、イニシエーターメチオニン、シグナル、輸送ペプチド、プロペプチド、鎖、及びペプチドを含む分子処理アノテーションである、条項２７に記載のコンピュータ実装方法。
２９．アノテーションチャネルが、トポロジカルドメイン、膜貫通、膜内、ドメイン、リピート、カルシウム結合、ジンクフィンガー、デオキシリボ核酸（ＤＮＡ）結合、ヌクレオチド結合、領域、コイルドコイル、モチーフ、及び組成バイアスを含む領域アノテーションである、条項２７に記載のコンピュータ実装方法。
３０．アノテーションチャネルが、活性部位、金属結合、結合部位、及び部位を含む部位アノテーションである、条項２７に記載のコンピュータ実装方法。
３１．アノテーションチャネルが、非標準残基、修飾残基、脂質化、グリコシル化、ジスルフィド結合、及び架橋を含むアミノ酸修飾アノテーションである、条項２７に記載のコンピュータ実装方法。
３２．アノテーションチャネルが、ヘリックス、ターン、及びベータ鎖を含む二次構造アノテーションである、条項２７に記載のコンピュータ実装方法。
３３．アノテーションチャネルが、突然変異誘発、配列不確実性、配列競合、非隣接残基、及び非末端残基を含む実験情報アノテーションである、条項２７に記載のコンピュータ実装方法。
３４．それぞれの残基構造の質を指定する信頼度スコアの３次元表現である１つ以上の構造信頼度チャネルをボクセルの３次元グリッド内の各ボクセルにボクセルごとに符号化することを更に含む、条項１に記載のコンピュータ実装方法。
３５．構造信頼度チャネルが、グローバルモデル品質推定（ＧＭＱＥ）である、条項３４に記載のコンピュータ実装方法。
３６．構造信頼度チャネルが、定性的モデルエネルギー解析（ＱＭＥＡＮ）スコアである、条項３４に記載のコンピュータ実装方法。
３７．構造信頼度チャネルが、残基がそれぞれのタンパク質構造の物理的制約を満たす程度を指定する温度因子である、条項３４に記載のコンピュータ実装方法。
３８．構造信頼度チャネルが、ボクセルに最も近い原子の残基が整列した鋳型構造を有する程度を指定する鋳型構造アラインメントである、条項３４に記載のコンピュータ実装方法。
３９．構造信頼度チャネルが、整列された鋳型構造の鋳型モデリングスコアである、条項３８に記載のコンピュータ実装方法。
４０．構造信頼度チャネルが、鋳型モデリングスコアのうちの最小のもの、鋳型モデリングスコアの平均、及び鋳型モデリングスコアのうちの最大のものである、条項３９に記載のコンピュータ実装方法。
４１．アミノ酸ごとの距離チャネルが生成される前に、原子を回転させることを更に含む、条項１に記載のコンピュータ実装方法。
４２．畳み込みニューラルネットワークにおいて、１×１×１畳み込み、３×３×３畳み込み、正規化線形ユニット活性化層、バッチ正規化層、全結合層、ドロップアウト正則化層、及びソフトマックス分類層を使用することを更に含む、条項１に記載のコンピュータ実装方法。
４３．１×１×１畳み込み及び３×３×３畳み込みが、３次元畳み込みである、条項４２に記載のコンピュータ実装方法。
４４．１×１×１畳み込みの層が、テンソルを処理し、テンソルの畳み込み表現である中間出力を生成し、３×３×３畳み込みの層の配列が、中間出力を処理し、平坦化された出力を生成し、完全結合層が、平滑化された出力を処理し、非正規化出力を生成し、ソフトマックス分類層が、非正規化出力を処理し、変異体ヌクレオチドが病原性および良性である尤度を特定する指数関数的に正規化された出力を生成する、条項４２に記載のコンピュータ実装方法。
４５．シグモイド層が、非正規化出力を処理し、変異体ヌクレオチドが病原性である尤度を特定する正規化出力を生成する、条項４４に記載のコンピュータ実装方法。
４６．畳み込みニューラルネットワークが、アテンションベースのニューラルネットワークである、条項１に記載のコンピュータ実装方法。
４７．テンソルが、参照対立遺伝子チャネルで更に符号化されたアミノ酸ごとの距離チャネルを含む、条項１に記載のコンピュータ実装方法。
４８．テンソルが、アノテーションチャネルで更に符号化されたアミノ酸ごとの距離チャネルを含む、条項１に記載のコンピュータ実装方法。
４９．テンソルが、構造信頼度チャネルで更に符号化されたアミノ酸ごとの距離チャネルを含む、条項１に記載のコンピュータ実装方法。
５０．コンピュータ実装方法であって、
タンパク質の参照アミノ酸配列の３次元構造にアクセスし、アミノ酸ベースで、３次元構造内の複数の原子カテゴリに及ぶ原子にボクセルの３次元グリッドを当てはめて、原子カテゴリごとの距離チャネルを生成することであって、
原子が、複数の原子カテゴリにまたがり、
複数の原子カテゴリのうちの原子カテゴリが、アミノ酸の原子エレメントを指定し、
原子カテゴリごとの距離チャネルの各々が、ボクセルの３次元グリッド内の各ボクセルについての３次元距離値を有し、
３次元距離値が、ボクセルの３次元グリッド内の対応するボクセルから複数の原子カテゴリ内の対応する原子カテゴリの原子までの距離を指定する、生成することと、
代替対立遺伝子チャネルをボクセルの３次元グリッド内の各ボクセルに符号化することであって、
代替対立遺伝子チャネルが、変異体ヌクレオチドによって発現される変異体アミノ酸のワンホット符号化の３次元表現である、符号化することと、
ボクセル位置ベースで、原子カテゴリごとの距離チャネルにわたる３次元距離値の各配列に進化的保存チャネルを符号化することであって、
進化的保存チャネルが、複数の種にわたるアミノ酸特異的保存頻度の３次元表現であり、
アミノ酸特異的保存頻度が、対応するボクセルへのアミノ酸近接度に応じて選択される、符号化することと、
代替対立遺伝子チャネル及びそれぞれの進化的保存チャネルで符号化された原子カテゴリごとの距離チャネルを含むテンソルに３次元畳み込みを適用することと、
テンソルに少なくとも部分的に基づいて、変異体ヌクレオチドの病原性を決定することと、を含む、コンピュータ実装方法。
５１．コンピュータ実装方法であって、
タンパク質の参照アミノ酸配列の３次元構造にアクセスし、アミノ酸ベースで３次元構造内の原子にボクセルの３次元グリッドを当てはめて、アミノ酸ごとの距離チャネルを生成することであって、
アミノ酸ごとの距離チャネルの各々が、ボクセルの３次元グリッド内の各ボクセルについての３次元距離値を有し、
３次元距離値が、ボクセルの３次元グリッド内の対応するボクセルから参照アミノ酸配列内の対応する参照アミノ酸の原子までの距離を指定する、生成することと、
代替対立遺伝子チャネルをボクセルの３次元グリッド内の各ボクセルに符号化することであって、
代替対立遺伝子チャネルが、変異体ヌクレオチドによって発現される変異体アミノ酸のワンホット符号化の３次元表現である、符号化することと、
ボクセル位置ベースで、アミノ酸ごとの距離チャネルにわたる３次元距離値の各配列に、進化的保存チャネルを符号化することであって、
進化的保存チャネルが、複数の種にわたるアミノ酸特異的保存頻度の３次元表現であり、
アミノ酸特異的保存頻度が、対応するボクセルへのアミノ酸近接度に応じて選択される、進化的保存エンコーダと、
代替対立遺伝子チャネル及びそれぞれの進化的保存チャネルで符号化されたアミノ酸ごとの距離チャネルを含むテンソルを生成することと、を含む、コンピュータ実装方法。
５２．コンピュータ実装方法であって、
タンパク質の参照アミノ酸配列の３次元構造にアクセスし、アミノ酸ベースで、３次元構造内の複数の原子カテゴリに及ぶ原子にボクセルの３次元グリッドを当てはめて、原子カテゴリごとの距離チャネルを生成することであって、
原子が、複数の原子カテゴリにまたがり、
複数の原子カテゴリのうちの原子カテゴリが、アミノ酸の原子エレメントを指定し、
原子カテゴリごとの距離チャネルの各々が、ボクセルの３次元グリッド内の各ボクセルについての３次元距離値を有し、
３次元距離値が、ボクセルの３次元グリッド内の対応するボクセルから複数の原子カテゴリ内の対応する原子カテゴリの原子までの距離を指定する、生成することと、
代替対立遺伝子チャネルをボクセルの３次元グリッド内の各ボクセルに符号化することであって、
代替対立遺伝子チャネルが、変異体ヌクレオチドによって発現される変異体アミノ酸のワンホット符号化の３次元表現である、符号化することと、
ボクセル位置ベースで、原子カテゴリごとの距離チャネルにわたる３次元距離値の各配列に進化的保存チャネルを符号化することであって、
進化的保存チャネルが、複数の種にわたるアミノ酸特異的保存頻度の３次元表現であり、
アミノ酸特異的保存頻度が、対応するボクセルへのアミノ酸近接度に応じて選択される、符号化することと、
代替対立遺伝子チャネル及びそれぞれの進化的保存チャネルを用いて符号化された原子カテゴリごとの距離チャネルを含むテンソルを生成することと、を含む、コンピュータ実装方法。 clause set 1
1. A computer-implemented method, the method comprising:
accessing the three-dimensional structure of a reference amino acid sequence of a protein and fitting a three-dimensional grid of voxels to atoms in the three-dimensional structure on an amino acid basis to generate distance channels for each amino acid;
each of the per-amino acid distance channels has a three-dimensional distance value for each voxel in the three-dimensional grid of voxels;
generating a three-dimensional distance value specifying a distance from a corresponding voxel in the three-dimensional grid of voxels to an atom of a corresponding reference amino acid in the reference amino acid sequence;
encoding an alternative allele channel in each voxel in a three-dimensional grid of voxels, the alternative allele channel being a three-dimensional representation of a one-hot encoding of a variant amino acid expressed by a variant nucleotide; , encoding and
encoding an evolutionary conserved channel into each array of three-dimensional distance values across the per-amino acid distance channel on a voxel location basis,
an evolutionary conservation channel is a three-dimensional representation of amino acid-specific conservation frequencies across multiple species;
encoding, wherein amino acid-specific conservation frequencies are selected according to amino acid proximity to corresponding voxels;
applying a three-dimensional convolution to a tensor containing per-amino acid distance channels encoded with alternative allele channels and respective evolutionary conserved channels;
determining pathogenicity of a mutant nucleotide based at least in part on the tensor;
2. 2. The computer-implemented method of clause 1, further comprising centering the three-dimensional grid of voxels on the alpha carbon atom of each residue of the reference amino acid within the reference amino acid sequence.
3. 3. The computer-implemented method of clause 2, further comprising centering the three-dimensional grid of voxels on the alpha carbon atom of a particular reference amino acid residue that corresponds to the variant amino acid.
4. Encoding the directionality of a reference amino acid and the position of a particular reference amino acid within a reference amino acid sequence by multiplying the three-dimensional distance value for the reference amino acid preceding the particular reference amino acid by a directionality parameter in a tensor. The computer-implemented method of clause 3, further comprising:
5. 5. The computer-implemented method of clause 4, wherein the distance is the nearest atom distance from the corresponding voxel center in the three-dimensional grid of voxels to the nearest atom of the corresponding reference amino acid.
6. 6. The computer-implemented method of clause 5, wherein the distance between the closest atoms is the Euclidean distance.
7. 7. The computer-implemented method of clause 6, wherein the nearest atom distance is normalized by dividing the Euclidean distance by the largest nearest atom distance.
8. 6. The computer-implemented method of clause 5, wherein the reference amino acid has an alpha carbon atom, and the distance is the distance of the nearest alpha carbon atom from the corresponding voxel center to the nearest alpha carbon atom of the corresponding reference amino acid.
9. 6. The computer-implemented method of clause 5, wherein the reference amino acid has a beta carbon atom, and the distance is the distance of the nearest beta carbon atom from the corresponding voxel center to the nearest beta carbon atom of the corresponding reference amino acid.
10. 6. The computer-implemented method of clause 5, wherein the reference amino acid has a backbone atom, and the distance is the distance of the nearest backbone atom from the corresponding voxel center to the nearest backbone atom of the corresponding reference amino acid.
11. 6. The computer-implemented method of clause 5, wherein the amino acid has a side chain atom and the distance is the distance of the nearest side chain atom from the corresponding voxel center to the nearest side chain atom of the corresponding reference amino acid.
12. The computer-implemented method of clause 3, further comprising encoding, in the tensor, a nearest atom channel specifying the distance from each voxel to the amino acid and the nearest atom selected regardless of the atomic element of the amino acid. .
13. 13. The computer-implemented method of clause 12, wherein the distance is a Euclidean distance.
14. 14. The computer-implemented method of clause 13, wherein the distance is normalized by dividing the Euclidean distance by the maximum distance.
15. 13. The computer-implemented method of clause 12, wherein the amino acids include non-standard amino acids.
16. 2. The computer-implemented method of clause 1, wherein the tensor further includes an absent atom channel that specifies atoms not found within a predetermined radius of the voxel center.
17. 17. The computer-implemented method of clause 16, wherein the absent atomic channels are one-hot encoded.
18. 2. The computer-implemented method of clause 1, further comprising voxel-wise encoding a reference allele channel in each voxel in the three-dimensional grid of voxels.
19. 19. The computer-implemented method of clause 18, wherein the reference allele amino acid is a three-dimensional representation of a one-hot encoding of the reference amino acid experiencing variant amino acids.
20. 2. The computer-implemented method of clause 1, wherein the amino acid-specific conservation frequency specifies the level of conservation of each amino acid across multiple species.
21.
selecting the closest atom to the corresponding voxel across the reference amino acid and atom category;
selecting a pan-amino acid conservation frequency for the residues of the reference amino acid containing the closest atoms;
21. The computer-implemented method of clause 20, further comprising: using a three-dimensional representation of pan-amino acid conservation frequencies as an evolutionary conservation channel.
22. 22. The computer-implemented method of clause 21, wherein pan-amino acid conservation frequencies are constructed for particular positions of residues as observed in multiple species.
23. 22. The computer-implemented method of clause 21, wherein the pan-amino acid conservation frequency specifies whether there is a missing conservation frequency for a particular reference amino acid.
24.
selecting the respective nearest atom to the corresponding voxel in each of the reference amino acids;
For each residue of the reference amino acid containing the nearest atom, selecting a conservation frequency for each amino acid;
22. The computer-implemented method of clause 21, further comprising: using a three-dimensional representation of conservation frequency for each amino acid as an evolutionary conservation channel.
25. 25. The computer-implemented method of clause 24, wherein a per-amino acid conservation frequency is constructed for a particular position of the residue as observed in multiple species.
26. 25. The computer-implemented method of clause 24, wherein the conserved frequency for each amino acid specifies whether there is a conserved frequency that is missing for a particular reference amino acid.
27. The computer implementation of clause 1, further comprising encoding, voxel by voxel, one or more annotation channels that are a three-dimensional representation of a one-hot encoding of residue annotations in each voxel in the three-dimensional grid of voxels. Method.
28. 28. The computer-implemented method of clause 27, wherein the annotation channel is a molecular processing annotation that includes an initiator methionine, a signal, a transit peptide, a propeptide, a chain, and a peptide.
29. Clause where the annotation channel is a region annotation including topological domain, transmembrane, intramembrane, domain, repeat, calcium binding, zinc finger, deoxyribonucleic acid (DNA) binding, nucleotide binding, region, coiled coil, motif, and compositional bias. 28. The computer implementation method according to 27.
30. 28. The computer-implemented method of clause 27, wherein the annotation channel is a site annotation that includes an active site, a metal binding, a binding site, and a site.
31. 28. The computer-implemented method of clause 27, wherein the annotation channel is an amino acid modification annotation including non-standard residues, modified residues, lipidations, glycosylation, disulfide bonds, and crosslinks.
32. 28. The computer-implemented method of clause 27, wherein the annotation channel is a secondary structure annotation including helices, turns, and beta strands.
33. 28. The computer-implemented method of clause 27, wherein the annotation channel is an experimental information annotation that includes mutagenesis, sequence uncertainty, sequence conflicts, non-adjacent residues, and non-terminal residues.
34. Clause further comprising encoding on a voxel-by-voxel basis to each voxel in the three-dimensional grid of voxels one or more structural confidence channels that are three-dimensional representations of confidence scores specifying the quality of each residue structure. 1. The computer implementation method according to 1.
35. 35. The computer-implemented method of clause 34, wherein the structural confidence channel is a global model quality estimation (GMQE).
36. 35. The computer-implemented method of clause 34, wherein the structural confidence channel is a Qualitative Model Energy Analysis (QMEAN) score.
37. 35. The computer-implemented method of clause 34, wherein the structure confidence channel is a temperature factor that specifies the degree to which the residues meet the physical constraints of the respective protein structure.
38. 35. The computer-implemented method of clause 34, wherein the structure confidence channel is a template structure alignment that specifies the degree to which the residues of atoms closest to the voxel have aligned template structures.
39. 39. The computer-implemented method of clause 38, wherein the structure confidence channel is a template modeling score of the aligned template structure.
40. 40. The computer-implemented method of clause 39, wherein the structural confidence channels are a minimum of the template modeling scores, an average of the template modeling scores, and a maximum of the template modeling scores.
41. 2. The computer-implemented method of clause 1, further comprising rotating the atoms before the per-amino acid distance channels are generated.
42. In convolutional neural networks, use 1x1x1 convolution, 3x3x3 convolution, normalized linear unit activation layer, batch normalization layer, fully connected layer, dropout regularization layer, and softmax classification layer The computer-implemented method of clause 1, further comprising:
43. The computer-implemented method of clause 42, wherein the 1x1x1 convolution and the 3x3x3 convolution are three-dimensional convolutions.
44. A 1x1x1 convolutional layer processes the tensor and produces an intermediate output that is a convolved representation of the tensor, and an array of 3x3x3 convolutional layers processes the intermediate output and produces a flattened representation of the tensor. A fully connected layer processes the smoothed output and generates an unnormalized output, a softmax classification layer processes the unnormalized output, and a softmax classification layer processes the unnormalized output to distinguish between pathogenic and benign variant nucleotides. 43. The computer-implemented method of clause 42, generating an exponentially normalized output that identifies a likelihood that is.
45. 45. The computer-implemented method of clause 44, wherein the sigmoid layer processes the denormalized output and generates a normalized output that identifies the likelihood that the variant nucleotide is pathogenic.
46. The computer-implemented method of clause 1, wherein the convolutional neural network is an attention-based neural network.
47. 2. The computer-implemented method of clause 1, wherein the tensor includes a per-amino acid distance channel further encoded with a reference allele channel.
48. 2. The computer-implemented method of clause 1, wherein the tensor includes a per-amino acid distance channel further encoded with an annotation channel.
49. 2. The computer-implemented method of clause 1, wherein the tensor includes a per-amino acid distance channel further encoded with a structural confidence channel.
50. A computer-implemented method, the method comprising:
By accessing the 3D structure of a protein's reference amino acid sequence and fitting a 3D grid of voxels to atoms spanning multiple atomic categories in the 3D structure on an amino acid basis, we generate distance channels for each atomic category. There it is,
Atoms span multiple atomic categories,
an atom category of the plurality of atom categories specifies an atomic element of an amino acid,
each of the distance channels for each atom category has a three-dimensional distance value for each voxel in the three-dimensional grid of voxels;
generating a three-dimensional distance value specifying a distance from a corresponding voxel in the three-dimensional grid of voxels to an atom of a corresponding atom category in the plurality of atom categories;
encoding an alternative allele channel into each voxel in a three-dimensional grid of voxels;
encoding, wherein the alternative allele channel is a three-dimensional representation of a one-hot encoding of variant amino acids expressed by variant nucleotides;
encoding an evolutionarily conserved channel into each array of three-dimensional distance values over the distance channel for each atomic category on a voxel location basis,
an evolutionary conservation channel is a three-dimensional representation of amino acid-specific conservation frequencies across multiple species;
encoding, wherein amino acid-specific conservation frequencies are selected according to amino acid proximity to corresponding voxels;
applying a three-dimensional convolution to a tensor containing distance channels for each atomic category encoded with alternative allele channels and respective evolutionary conservation channels;
determining pathogenicity of a mutant nucleotide based at least in part on the tensor;
51. A computer-implemented method, the method comprising:
accessing the three-dimensional structure of a reference amino acid sequence of a protein and fitting a three-dimensional grid of voxels to the atoms in the three-dimensional structure on an amino acid basis to generate distance channels for each amino acid;
each of the per-amino acid distance channels has a three-dimensional distance value for each voxel in the three-dimensional grid of voxels;
generating a three-dimensional distance value specifying a distance from a corresponding voxel in the three-dimensional grid of voxels to an atom of a corresponding reference amino acid in the reference amino acid sequence;
encoding an alternative allele channel into each voxel in a three-dimensional grid of voxels;
encoding, wherein the alternative allele channel is a three-dimensional representation of a one-hot encoding of variant amino acids expressed by variant nucleotides;
encoding an evolutionary conserved channel into each array of three-dimensional distance values across the per-amino acid distance channel on a voxel location basis,
an evolutionary conservation channel is a three-dimensional representation of amino acid-specific conservation frequencies across multiple species;
an evolutionary conservation encoder in which amino acid-specific conservation frequencies are selected depending on amino acid proximity to corresponding voxels;
generating a tensor comprising per-amino acid distance channels encoded with alternative allele channels and respective evolutionary conserved channels.
52. A computer-implemented method, the method comprising:
By accessing the 3D structure of a protein's reference amino acid sequence and, on an amino acid basis, fitting a 3D grid of voxels to atoms spanning multiple atomic categories in the 3D structure, generating distance channels for each atomic category. There it is,
Atoms span multiple atomic categories,
an atom category of the plurality of atom categories specifies an atomic element of an amino acid,
each of the distance channels for each atom category has a three-dimensional distance value for each voxel in the three-dimensional grid of voxels;
generating a three-dimensional distance value specifying a distance from a corresponding voxel in the three-dimensional grid of voxels to an atom of a corresponding atom category in the plurality of atom categories;
encoding an alternative allele channel into each voxel in a three-dimensional grid of voxels;
encoding, wherein the alternative allele channel is a three-dimensional representation of a one-hot encoding of variant amino acids expressed by variant nucleotides;
encoding an evolutionarily conserved channel into each array of three-dimensional distance values over the distance channel for each atomic category on a voxel location basis,
an evolutionary conservation channel is a three-dimensional representation of amino acid-specific conservation frequencies across multiple species;
encoding, wherein amino acid-specific conservation frequencies are selected according to amino acid proximity to corresponding voxels;
generating a tensor comprising distance channels for each atomic category encoded with alternative allele channels and respective evolutionary conserved channels.

条項セット２
１．１つ以上のプロセッサ上で実行されると、
タンパク質の参照アミノ酸配列の３次元構造にアクセスし、アミノ酸ベースで３次元構造内の原子にボクセルの３次元グリッドを当てはめて、アミノ酸ごとの距離チャネルを生成することであって、
アミノ酸ごとの距離チャネルの各々が、ボクセルの３次元グリッド内の各ボクセルについての３次元距離値を有し、
３次元距離値が、ボクセルの３次元グリッド内の対応するボクセルから参照アミノ酸配列内の対応する参照アミノ酸の原子までの距離を指定する、生成することと、
代替対立遺伝子チャネルをボクセルの３次元グリッド内の各ボクセルに符号化することであって、
代替対立遺伝子チャネルが、変異体ヌクレオチドによって発現される変異体アミノ酸のワンホット符号化の３次元表現である、符号化することと、
ボクセル位置ベースで、アミノ酸ごとの距離チャネルにわたる３次元距離値の各配列に、進化的保存チャネルを符号化することであって、
進化的保存チャネルが、複数の種にわたるアミノ酸特異的保存頻度の３次元表現であり、
アミノ酸特異的保存頻度が、対応するボクセルへのアミノ酸近接度に応じて選択される、符号化することと、
代替対立遺伝子チャネル及びそれぞれの進化的保存チャネルで符号化されたアミノ酸ごとの距離チャネルを含むテンソルに３次元畳み込みを適用することと、
テンソルに少なくとも部分的に基づいて、変異体ヌクレオチドの病原性を決定することと、を含む動作を実行するようにコンピュータを構成するコンピュータ実行可能命令を記憶する、１つ以上のコンピュータ可読媒体。
２．動作が、参照アミノ酸配列内の参照アミノ酸のそれぞれの残基のアルファ炭素原子上にボクセルの３次元グリッドを中心付けることを更に含む、条項１に記載のコンピュータ可読媒体。
３．動作が、変異体アミノ酸に対応する特定の参照アミノ酸の残基のアルファ炭素原子上にボクセルの３次元グリッドを中心付けることを更に含む、条項２に記載のコンピュータ可読媒体。
４．動作が、テンソルにおいて、特定の参照アミノ酸に先行する参照アミノ酸についての３次元距離値に方向性パラメータを乗算することによって、参照アミノ酸配列内の参照アミノ酸の方向性及び特定の参照アミノ酸の位置を符号化することを更に含む、条項３に記載のコンピュータ可読媒体。
５．距離が、ボクセルの３次元グリッド内の対応するボクセル中心から対応する参照アミノ酸の最も近い原子までの最も近い原子の距離である、条項４に記載のコンピュータ可読媒体。
６．最も近い原子の距離がユークリッド距離である、条項５に記載のコンピュータ可読媒体。
７．最も近い原子の距離が、ユークリッド距離を最大の最も近い原子の距離で除算することによって正規化される、条項６に記載のコンピュータ可読媒体。
８．参照アミノ酸がアルファ炭素原子を有し、距離が、
対応するボクセル中心から対応する参照アミノ酸の最も近いアルファ炭素原子までの最も近いアルファ炭素原子の距離である、条項５に記載のコンピュータ可読媒体。
９．参照アミノ酸がベータ炭素原子を有し、距離が、対応するボクセル中心から対応する参照アミノ酸の最も近いベータ炭素原子までの最も近いベータ炭素原子の距離である、条項５に記載のコンピュータ可読媒体。
１０．参照アミノ酸が骨格原子を有し、距離が、対応するボクセル中心から対応する参照アミノ酸の最も近い骨格原子までの最も近い骨格原子の距離である、条項５に記載のコンピュータ可読媒体。
１１．アミノ酸が側鎖原子を有し、距離が、対応するボクセル中心から対応する参照アミノ酸の最も近い側鎖原子までの最も近い側鎖原子の距離である、条項５に記載のコンピュータ可読媒体。
１２．動作が、テンソルにおいて、各ボクセルから、アミノ酸及びアミノ酸の原子エレメントに関係なく選択される最も近い原子までの距離を指定する最も近い原子のチャネルを符号化することを更に含む、条項３に記載のコンピュータ可読媒体。
１３．距離がユークリッド距離である、条項１２に記載のコンピュータ可読媒体。
１４．距離が、ユークリッド距離を最大距離で除算することによって正規化される、条項１３に記載のコンピュータ可読媒体。
１５．アミノ酸が非標準アミノ酸を含む、条項１２に記載のコンピュータ可読媒体。
１６．テンソルが、ボクセル中心の所定の半径内に見つからない原子を指定する不在原子チャネルを更に含む、条項１に記載のコンピュータ可読媒体。
１７．不在原子チャネルがワンホット符号化される、条項１６に記載のコンピュータ可読媒体。
１８．動作が、ボクセルの３次元グリッド内の各ボクセルに参照対立遺伝子チャネルをボクセルごとに符号化することを更に含む、条項１に記載のコンピュータ可読媒体。１９．参照対立遺伝子アミノ酸が、変異体アミノ酸を経験する参照アミノ酸のワンホット符号化の３次元表現である、条項１８に記載のコンピュータ可読媒体。
２０．アミノ酸特異的保存頻度が、複数の種にわたるそれぞれのアミノ酸の保存レベルを指定する、条項１に記載のコンピュータ可読媒体。
２１．動作が、
参照アミノ酸及び原子カテゴリにわたって対応するボクセルへの最も近い原子を選択することと、
最も近い原子を含む参照アミノ酸の残基について汎アミノ酸保存頻度を選択することと、
進化的保存チャネルとして汎アミノ酸保存頻度の３次元表現を使用することと、を更に含む、条項２０に記載のコンピュータ可読媒体。
２２．汎アミノ酸保存頻度が、複数の種において観察されるような残基の特定の位置について構成される、条項２１に記載のコンピュータ可読媒体。
２３．汎アミノ酸保存頻度が、特定の参照アミノ酸について欠失している保存頻度が存在するかどうかを指定する、条項２１に記載のコンピュータ可読媒体。
２４．操作が、参照アミノ酸のそれぞれにおいて対応するボクセルへのそれぞれの最も近い原子を選択するステップと、
最も近い原子を含む参照アミノ酸のそれぞれの残基について、それぞれのアミノ酸ごとの保存頻度を選択することと、
進化的保存チャネルとしてアミノ酸ごとの保存頻度の３次元表現を使用することと、を更に含む、条項２１に記載のコンピュータ可読媒体。
２５．アミノ酸ごとの保存頻度が、複数の種において観察されるような残基の特定の位置について構成される、条項２４に記載のコンピュータ可読媒体。
２６．アミノ酸ごとの保存頻度が、特定の参照アミノ酸について欠失している保存頻度が存在するかどうかを指定する、条項２４に記載のコンピュータ可読媒体。
２７．動作が、ボクセルの３次元グリッド内の各ボクセルに、残基アノテーションのワンホット符号化の３次元表現である１つ以上のアノテーションチャネルをボクセルごとに符号化することを更に含む、条項１に記載のコンピュータ可読媒体。
２８．アノテーションチャネルが、イニシエーターメチオニン、シグナル、輸送ペプチド、プロペプチド、鎖、及びペプチドを含む分子処理アノテーションである、条項２７に記載のコンピュータ可読媒体。
２９．アノテーションチャネルが、トポロジカルドメイン、膜貫通、膜内、ドメイン、リピート、カルシウム結合、ジンクフィンガー、デオキシリボ核酸（ＤＮＡ）結合、ヌクレオチド結合、領域、コイルドコイル、モチーフ、及び組成バイアスを含む領域アノテーションである、条項２７に記載のコンピュータ可読媒体。
３０．アノテーションチャネルが、活性部位、金属結合、結合部位、及び部位を含む部位アノテーションである、条項２７に記載のコンピュータ可読媒体。
３１．アノテーションチャネルが、非標準残基、修飾残基、脂質化、グリコシル化、ジスルフィド結合、及び架橋を含むアミノ酸修飾アノテーションである、条項２７に記載のコンピュータ可読媒体。
３２．アノテーションチャネルが、ヘリックス、ターン、及びベータ鎖を含む二次構造アノテーションである、条項２７に記載のコンピュータ可読媒体。
３３．アノテーションチャネルが、突然変異誘発、配列不確実性、配列競合、非隣接残基、及び非末端残基を含む実験情報アノテーションである、条項２７に記載のコンピュータ可読媒体。
３４．動作が、ボクセルの３次元グリッド内の各ボクセルに、それぞれの残基構造の質を指定する信頼度スコアの３次元表現である１つ以上の構造信頼度チャネルをボクセルごとに符号化することを更に含む、条項１に記載のコンピュータ可読媒体。
３５．構造信頼度チャネルが、グローバルモデル品質推定（ＧＭＱＥ）である、条項３４に記載のコンピュータ可読媒体。
３６．構造信頼度チャネルが、定性的モデルエネルギー解析（ＱＭＥＡＮ）スコアである、条項３４に記載のコンピュータ可読媒体。
３７．構造信頼度チャネルが、残基がそれぞれのタンパク質構造の物理的制約を満たす程度を指定する温度因子である、条項３４に記載のコンピュータ可読媒体。
３８．構造信頼度チャネルが、ボクセルに最も近い原子の残基が整列した鋳型構造を有する程度を指定する鋳型構造アラインメントである、条項３４に記載のコンピュータ可読媒体。
３９．構造信頼度チャネルが、整列された鋳型構造の鋳型モデリングスコアである、条項３８に記載のコンピュータ可読媒体。
４０．構造信頼度チャネルが、鋳型モデリングスコアのうちの最小のもの、鋳型モデリングスコアの平均、及び鋳型モデリングスコアのうちの最大のものである、条項３９に記載のコンピュータ可読媒体。
４１．動作が、アミノ酸ごとの距離チャネルが生成される前に、原子を回転させることを更に含む、条項１に記載のコンピュータ可読媒体。
４２．動作が、畳み込みニューラルネットワークにおいて、１×１×１畳み込み、３×３×３畳み込み、正規化線形ユニット活性化層、バッチ正規化層、全結合層、ドロップアウト正則化層、及びソフトマックス分類層を使用することを更に含む、条項１に記載のコンピュータ可読媒体。
４３．１×１×１畳み込み及び３×３×３畳み込みが、３次元畳み込みである、条項４２に記載のコンピュータ可読媒体。
４４．１×１×１畳み込みの層が、テンソルを処理し、テンソルの畳み込み表現である中間出力を生成し、３×３×３畳み込み層の配列が、中間出力を処理し、平坦化された出力を生成し、全結合層が、平坦化された出力を処理し、非正規化出力を生成し、ソフトマックス分類層が、非正規化出力を処理し、変異体ヌクレオチドが病原性及び良性である尤度を特定する指数関数的に正規化された出力を生成する、条項４２に記載のコンピュータ可読媒体。
４５．シグモイド層が、非正規化出力を処理し、変異体ヌクレオチドが病原性である尤度を特定する正規化出力を生成する、条項４４に記載のコンピュータ可読媒体。
４６．畳み込みニューラルネットワークが、アテンションベースのニューラルネットワークである、条項１に記載のコンピュータ可読媒体。
４７．テンソルが、参照対立遺伝子チャネルで更に符号化されたアミノ酸ごとの距離チャネルを含む、条項１に記載のコンピュータ可読媒体。
４８．テンソルが、アノテーションチャネルで更に符号化されたアミノ酸ごとの距離チャネルを含む、条項１に記載のコンピュータ可読媒体。
４９．テンソルが、構造信頼度チャネルで更に符号化されたアミノ酸ごとの距離チャネルを含む、条項１に記載のコンピュータ可読媒体。
５０．１つ以上のプロセッサ上で実行されると、
タンパク質の参照アミノ酸配列の３次元構造にアクセスし、アミノ酸ベースで、３次元構造内の複数の原子カテゴリに及ぶ原子にボクセルの３次元グリッドを当てはめて、原子カテゴリごとの距離チャネルを生成することであって、
原子が、複数の原子カテゴリにまたがり、
複数の原子カテゴリのうちの原子カテゴリが、アミノ酸の原子エレメントを指定し、
原子カテゴリごとの距離チャネルの各々が、ボクセルの３次元グリッド内の各ボクセルについての３次元距離値を有し、
３次元距離値が、ボクセルの３次元グリッド内の対応するボクセルから複数の原子カテゴリ内の対応する原子カテゴリの原子までの距離を指定する、生成することと、
代替対立遺伝子チャネルをボクセルの３次元グリッド内の各ボクセルに符号化することであって、
代替対立遺伝子チャネルが、変異体ヌクレオチドによって発現される変異体アミノ酸のワンホット符号化の３次元表現である、符号化することと、
ボクセル位置ベースで、原子カテゴリごとの距離チャネルにわたる３次元距離値の各配列に進化的保存チャネルを符号化することであって、
進化的保存チャネルが、複数の種にわたるアミノ酸特異的保存頻度の３次元表現であり、
アミノ酸特異的保存頻度が、対応するボクセルへのアミノ酸近接度に応じて選択される、符号化することと、
代替対立遺伝子チャネル及びそれぞれの進化的保存チャネルで符号化された原子カテゴリごとの距離チャネルを含むテンソルに３次元畳み込みを適用することと、
テンソルに少なくとも部分的に基づいて、変異体ヌクレオチドの病原性を決定することと、を含む動作を実行するようにコンピュータを構成するコンピュータ実行可能命令を記憶する、１つ以上のコンピュータ可読媒体。
５１．１つ以上のプロセッサ上で実行されると、
タンパク質の参照アミノ酸配列の３次元構造にアクセスし、アミノ酸ベースで３次元構造内の原子にボクセルの３次元グリッドを当てはめて、アミノ酸ごとの距離チャネルを生成することであって、
アミノ酸ごとの距離チャネルの各々が、ボクセルの３次元グリッド内の各ボクセルについての３次元距離値を有し、
３次元距離値が、ボクセルの３次元グリッド内の対応するボクセルから参照アミノ酸配列内の対応する参照アミノ酸の原子までの距離を指定する、生成することと、
アミノ酸位置ベースで、アミノ酸ごとの距離チャネルの各々における各３次元距離値に、変異体ヌクレオチドによって発現される変異体アミノ酸のワンホット符号化の３次元表現である代替対立遺伝子チャネルを符号化することであって、
代替対立遺伝子チャネルが、変異体ヌクレオチドによって発現される変異体アミノ酸のワンホット符号化の３次元表現である、符号化することと、
ボクセル位置ベースで、アミノ酸ごとの距離チャネルにわたる３次元距離値の各配列に、進化的保存チャネルを符号化することであって、
進化的保存チャネルが、複数の種にわたるアミノ酸特異的保存頻度の３次元表現であり、
アミノ酸特異的保存頻度が、対応するボクセルへのアミノ酸近接度に応じて選択される、符号化することと、
代替対立遺伝子チャネル及びそれぞれの進化的保存チャネルで符号化されたアミノ酸ごとの距離チャネルを含むテンソルを生成することと、を含む動作を実行するようにコンピュータを構成するコンピュータ実行可能命令を記憶する、１つ以上のコンピュータ可読媒体。
５２．１つ以上のプロセッサ上で実行されると、
タンパク質の参照アミノ酸配列の３次元構造にアクセスし、アミノ酸ベースで、３次元構造内の複数の原子カテゴリに及ぶ原子にボクセルの３次元グリッドを当てはめて、原子カテゴリごとの距離チャネルを生成することであって、
原子が、複数の原子カテゴリにまたがり、
複数の原子カテゴリのうちの原子カテゴリが、アミノ酸の原子エレメントを指定し、
原子カテゴリごとの距離チャネルの各々が、ボクセルの３次元グリッド内の各ボクセルについての３次元距離値を有し、
３次元距離値が、ボクセルの３次元グリッド内の対応するボクセルから複数の原子カテゴリ内の対応する原子カテゴリの原子までの距離を指定する、生成することと、
代替対立遺伝子チャネルをボクセルの３次元グリッド内の各ボクセルに符号化することであって、
代替対立遺伝子チャネルが、変異体ヌクレオチドによって発現される変異体アミノ酸のワンホット符号化の３次元表現である、符号化することと、
ボクセル位置ベースで、原子カテゴリごとの距離チャネルにわたる３次元距離値の各配列に進化的保存チャネルを符号化することであって、
進化的保存チャネルが、複数の種にわたるアミノ酸特異的保存頻度の３次元表現であり、
アミノ酸特異的保存頻度が、対応するボクセルへのアミノ酸近接度に応じて選択される、符号化することと、
代替対立遺伝子チャネル及びそれぞれの進化的保存チャネルを用いて符号化された原子カテゴリごとの距離チャネルを含むテンソルを生成することと、を含む動作を実行するようにコンピュータを構成するコンピュータ実行可能命令を記憶する、１つ以上のコンピュータ可読媒体。 clause set 2
1. When executed on one or more processors,
accessing the three-dimensional structure of a reference amino acid sequence of a protein and fitting a three-dimensional grid of voxels to atoms in the three-dimensional structure on an amino acid basis to generate distance channels for each amino acid;
each of the per-amino acid distance channels has a three-dimensional distance value for each voxel in the three-dimensional grid of voxels;
generating a three-dimensional distance value specifying a distance from a corresponding voxel in the three-dimensional grid of voxels to an atom of a corresponding reference amino acid in the reference amino acid sequence;
encoding an alternative allele channel into each voxel in a three-dimensional grid of voxels;
encoding, wherein the alternative allele channel is a three-dimensional representation of a one-hot encoding of variant amino acids expressed by variant nucleotides;
encoding an evolutionary conserved channel into each array of three-dimensional distance values across the per-amino acid distance channel on a voxel location basis,
an evolutionary conservation channel is a three-dimensional representation of amino acid-specific conservation frequencies across multiple species;
encoding, wherein amino acid-specific conservation frequencies are selected according to amino acid proximity to corresponding voxels;
applying a three-dimensional convolution to a tensor containing per-amino acid distance channels encoded with alternative allele channels and respective evolutionary conserved channels;
one or more computer-readable media storing computer-executable instructions for configuring a computer to perform operations comprising: determining the pathogenicity of a mutant nucleotide based at least in part on the tensor;
2. 2. The computer-readable medium of clause 1, wherein the act further comprises centering the three-dimensional grid of voxels on the alpha carbon atom of each residue of the reference amino acid within the reference amino acid sequence.
3. 3. The computer-readable medium of clause 2, wherein the act further comprises centering the three-dimensional grid of voxels on the alpha carbon atom of a particular reference amino acid residue that corresponds to the variant amino acid.
4. The operation encodes the directionality of the reference amino acid and the position of the particular reference amino acid within the reference amino acid sequence by multiplying the three-dimensional distance value for the reference amino acid preceding the particular reference amino acid by the directionality parameter in the tensor. 3. The computer readable medium of clause 3, further comprising:
5. 5. The computer-readable medium of clause 4, wherein the distance is the nearest atom distance from the corresponding voxel center in the three-dimensional grid of voxels to the nearest atom of the corresponding reference amino acid.
6. 6. The computer-readable medium of clause 5, wherein the distance between the nearest atoms is the Euclidean distance.
7. 7. The computer-readable medium of clause 6, wherein the nearest atom distance is normalized by dividing the Euclidean distance by the largest nearest atom distance.
8. If the reference amino acid has an alpha carbon atom and the distance is
6. The computer-readable medium of clause 5, which is the nearest alpha carbon atom distance from the corresponding voxel center to the nearest alpha carbon atom of the corresponding reference amino acid.
9. 6. The computer-readable medium of clause 5, wherein the reference amino acid has a beta carbon atom, and the distance is the distance of the nearest beta carbon atom from the corresponding voxel center to the nearest beta carbon atom of the corresponding reference amino acid.
10. 6. The computer-readable medium of clause 5, wherein the reference amino acid has a backbone atom, and the distance is the distance of the nearest backbone atom from the corresponding voxel center to the nearest backbone atom of the corresponding reference amino acid.
11. 6. The computer-readable medium of clause 5, wherein the amino acid has a side chain atom, and the distance is the distance of the nearest side chain atom from the corresponding voxel center to the nearest side chain atom of the corresponding reference amino acid.
12. 3, wherein the operation further comprises encoding, in the tensor, a nearest atom channel specifying the distance from each voxel to the nearest atom selected without regard to the amino acid and the atomic element of the amino acid. computer readable medium.
13. 13. The computer-readable medium of clause 12, wherein the distance is a Euclidean distance.
14. 14. The computer-readable medium of clause 13, wherein the distance is normalized by dividing the Euclidean distance by the maximum distance.
15. 13. The computer-readable medium of clause 12, wherein the amino acids include non-standard amino acids.
16. 2. The computer-readable medium of clause 1, wherein the tensor further includes an absent atom channel that specifies atoms not found within a predetermined radius of the voxel center.
17. 17. The computer-readable medium of clause 16, wherein the absent atomic channel is one-hot encoded.
18. 2. The computer-readable medium of clause 1, wherein the operations further include voxel-wise encoding a reference allele channel in each voxel in the three-dimensional grid of voxels. 19. 19. The computer-readable medium of clause 18, wherein the reference allele amino acid is a three-dimensional representation of a one-hot encoding of the reference amino acid experiencing variant amino acids.
20. 2. The computer-readable medium of clause 1, wherein the amino acid-specific conservation frequency specifies the level of conservation of each amino acid across multiple species.
21. The action is
selecting the closest atom to the corresponding voxel across the reference amino acid and atom category;
selecting a pan-amino acid conservation frequency for the residues of the reference amino acid containing the closest atoms;
21. The computer-readable medium of clause 20, further comprising: using a three-dimensional representation of pan-amino acid conservation frequencies as an evolutionary conservation channel.
22. 22. The computer-readable medium of clause 21, wherein pan-amino acid conservation frequencies are constructed for particular positions of residues as observed in multiple species.
23. 22. The computer-readable medium of clause 21, wherein the pan-amino acid conservation frequency specifies whether there is a missing conservation frequency for a particular reference amino acid.
24. the operation selects each closest atom to the corresponding voxel in each of the reference amino acids;
For each residue of the reference amino acid containing the nearest atom, selecting a conservation frequency for each amino acid;
22. The computer-readable medium of clause 21, further comprising: using a three-dimensional representation of conservation frequency for each amino acid as an evolutionary conservation channel.
25. 25. The computer-readable medium of clause 24, wherein a conservation frequency for each amino acid is configured for a particular position of the residue as observed in multiple species.
26. 25. The computer-readable medium of clause 24, wherein the conserved frequency for each amino acid specifies whether there is a conserved frequency missing for a particular reference amino acid.
27. Clause 1, wherein the operation further comprises encoding, on a voxel-by-voxel basis, one or more annotation channels that are a three-dimensional representation of a one-hot encoding of residue annotations for each voxel in the three-dimensional grid of voxels. computer readable medium.
28. 28. The computer-readable medium of clause 27, wherein the annotation channel is a molecular processing annotation that includes an initiator methionine, a signal, a transit peptide, a propeptide, a chain, and a peptide.
29. Clause where the annotation channel is a region annotation including topological domain, transmembrane, intramembrane, domain, repeat, calcium binding, zinc finger, deoxyribonucleic acid (DNA) binding, nucleotide binding, region, coiled coil, motif, and compositional bias. 28. The computer readable medium according to 27.
30. 28. The computer-readable medium of clause 27, wherein the annotation channel is a site annotation that includes an active site, a metal binding, a binding site, and a site.
31. 28. The computer-readable medium of clause 27, wherein the annotation channel is an amino acid modification annotation including non-standard residues, modified residues, lipidations, glycosylation, disulfide bonds, and crosslinks.
32. 28. The computer-readable medium of clause 27, wherein the annotation channel is a secondary structure annotation including helices, turns, and beta strands.
33. 28. The computer-readable medium of clause 27, wherein the annotation channel is an experimental information annotation that includes mutagenesis, sequence uncertainty, sequence conflicts, non-adjacent residues, and non-terminal residues.
34. The operation encodes, for each voxel, one or more structural confidence channels, which are three-dimensional representations of confidence scores that specify the quality of the respective residue structures, for each voxel in the three-dimensional grid of voxels. The computer-readable medium of clause 1, further comprising:
35. 35. The computer-readable medium of clause 34, wherein the structural confidence channel is global model quality estimation (GMQE).
36. 35. The computer-readable medium of clause 34, wherein the structural confidence channel is a Qualitative Model Energy Analysis (QMEAN) score.
37. 35. The computer-readable medium of clause 34, wherein the structure confidence channel is a temperature factor that specifies the degree to which the residues meet the physical constraints of their respective protein structure.
38. 35. The computer-readable medium of clause 34, wherein the structure confidence channel is a template structure alignment that specifies the degree to which the residues of atoms closest to the voxel have an aligned template structure.
39. 39. The computer-readable medium of clause 38, wherein the structure confidence channel is a template modeling score for aligned template structures.
40. 40. The computer-readable medium of clause 39, wherein the structural confidence channels are a minimum of the template modeling scores, an average of the template modeling scores, and a maximum of the template modeling scores.
41. 2. The computer-readable medium of clause 1, wherein the operations further include rotating the atoms before the per-amino acid distance channels are generated.
42. The operations are 1x1x1 convolution, 3x3x3 convolution, normalized linear unit activation layer, batch normalization layer, fully connected layer, dropout regularization layer, and softmax classification layer in convolutional neural networks. The computer-readable medium of clause 1, further comprising using the computer-readable medium of clause 1.
43. The computer-readable medium of clause 42, wherein the 1x1x1 convolution and the 3x3x3 convolution are three-dimensional convolutions.
44. A 1x1x1 convolutional layer processes the tensor and produces an intermediate output that is a convolved representation of the tensor, and an array of 3x3x3 convolutional layers processes the intermediate output and flattens it. a fully connected layer processes the flattened output and generates an unnormalized output, a softmax classification layer processes the unnormalized output, and determines whether the variant nucleotides are pathogenic or benign. 43. The computer-readable medium of clause 42, wherein the computer-readable medium generates an exponentially normalized output that identifies a likelihood.
45. 45. The computer-readable medium of clause 44, wherein the sigmoid layer processes the denormalized output and produces a normalized output that identifies a likelihood that the variant nucleotide is pathogenic.
46. The computer-readable medium of clause 1, wherein the convolutional neural network is an attention-based neural network.
47. 2. The computer-readable medium of clause 1, wherein the tensor includes a per-amino acid distance channel further encoded with a reference allele channel.
48. The computer-readable medium of clause 1, wherein the tensor includes a per-amino acid distance channel further encoded with an annotation channel.
49. 2. The computer-readable medium of clause 1, wherein the tensor includes a per-amino acid distance channel further encoded with a structural confidence channel.
50. When executed on one or more processors:
By accessing the 3D structure of a protein's reference amino acid sequence and fitting a 3D grid of voxels to atoms spanning multiple atomic categories in the 3D structure on an amino acid basis, we generate distance channels for each atomic category. There it is,
Atoms span multiple atomic categories,
an atom category of the plurality of atom categories specifies an atomic element of an amino acid,
each of the distance channels for each atom category has a three-dimensional distance value for each voxel in the three-dimensional grid of voxels;
generating a three-dimensional distance value specifying a distance from a corresponding voxel in the three-dimensional grid of voxels to an atom of a corresponding atom category in the plurality of atom categories;
encoding an alternative allele channel into each voxel in a three-dimensional grid of voxels;
encoding, wherein the alternative allele channel is a three-dimensional representation of a one-hot encoding of variant amino acids expressed by variant nucleotides;
encoding an evolutionarily conserved channel into each array of three-dimensional distance values over the distance channel for each atomic category on a voxel location basis,
an evolutionary conservation channel is a three-dimensional representation of amino acid-specific conservation frequencies across multiple species;
encoding, wherein amino acid-specific conservation frequencies are selected according to amino acid proximity to corresponding voxels;
applying a three-dimensional convolution to a tensor containing distance channels for each atomic category encoded with alternative allele channels and respective evolutionary conservation channels;
one or more computer-readable media storing computer-executable instructions for configuring a computer to perform operations comprising: determining the pathogenicity of a mutant nucleotide based at least in part on the tensor;
51. When executed on one or more processors:
accessing the three-dimensional structure of a reference amino acid sequence of a protein and fitting a three-dimensional grid of voxels to the atoms in the three-dimensional structure on an amino acid basis to generate distance channels for each amino acid;
each of the per-amino acid distance channels has a three-dimensional distance value for each voxel in the three-dimensional grid of voxels;
generating a three-dimensional distance value specifying a distance from a corresponding voxel in the three-dimensional grid of voxels to an atom of a corresponding reference amino acid in the reference amino acid sequence;
On an amino acid position basis, each three-dimensional distance value in each of the per-amino acid distance channels is encoded with an alternative allele channel that is a one-hot encoded three-dimensional representation of the variant amino acids expressed by the variant nucleotides. And,
encoding, wherein the alternative allele channel is a three-dimensional representation of a one-hot encoding of variant amino acids expressed by variant nucleotides;
encoding an evolutionary conserved channel into each array of three-dimensional distance values across the per-amino acid distance channel on a voxel location basis,
an evolutionary conservation channel is a three-dimensional representation of amino acid-specific conservation frequencies across multiple species;
encoding, wherein amino acid-specific conservation frequencies are selected according to amino acid proximity to corresponding voxels;
storing computer-executable instructions configuring the computer to perform operations comprising: generating a tensor comprising per-amino acid distance channels encoded with alternative allele channels and respective evolutionary conserved channels; one or more computer readable media.
52. When executed on one or more processors:
By accessing the 3D structure of a protein's reference amino acid sequence and, on an amino acid basis, fitting a 3D grid of voxels to atoms spanning multiple atomic categories in the 3D structure, generating distance channels for each atomic category. There it is,
Atoms span multiple atomic categories,
an atom category of the plurality of atom categories specifies an atomic element of an amino acid,
each of the distance channels for each atom category has a three-dimensional distance value for each voxel in the three-dimensional grid of voxels;
generating a three-dimensional distance value specifying a distance from a corresponding voxel in the three-dimensional grid of voxels to an atom of a corresponding atom category in the plurality of atom categories;
encoding an alternative allele channel into each voxel in a three-dimensional grid of voxels;
encoding, wherein the alternative allele channel is a three-dimensional representation of a one-hot encoding of variant amino acids expressed by variant nucleotides;
encoding an evolutionarily conserved channel into each array of three-dimensional distance values over the distance channel for each atomic category on a voxel location basis,
an evolutionary conservation channel is a three-dimensional representation of amino acid-specific conservation frequencies across multiple species;
encoding, wherein amino acid-specific conservation frequencies are selected according to amino acid proximity to corresponding voxels;
generating a tensor comprising distance channels for each atomic category encoded with alternative allele channels and respective evolutionary conserved channels; One or more computer readable media for storage.

このセクションで説明される方法の他の実施態様は、上述の方法のいずれかを実行するためにプロセッサによって実行可能な命令を記憶する非一時的コンピュータ可読記憶媒体を含むことができる。このセクションで説明される方法の更に別の実施態様は、メモリと、メモリ内に記憶された命令を実行して上記の方法のいずれかを実行するように動作可能な１つ以上のプロセッサとを含むシステムを含むことができる。 Other implementations of the methods described in this section may include a non-transitory computer-readable storage medium storing instructions executable by a processor to perform any of the methods described above. Yet another embodiment of the methods described in this section includes a memory and one or more processors operable to execute instructions stored in the memory to perform any of the methods described above. A system that includes:

特定の実施態様３
条項セット１
１．グリッド内に均一に離間された、次元ごとのセルインデックス及びセル座標を有するセルに最も近い、エレメント座標を有する配列のエレメントを効率的に決定するコンピュータ実装方法であって、
エレメントの各々にセルのサブセットをマッピングするエレメントからセルへのマッピングを生成することであって、
配列内の特定のエレメントにマッピングされたセルのサブセットが、グリッド内の最も近いセル、及びグリッド内の１つ以上の近傍セルを含み、
最も近いセルが、特定のエレメントのエレメント座標をセル座標にマッチングすることに基づいて選択され、
近傍セルが、最も近いセルに連続的に隣接し、特定の要素からの距離近接範囲内にあることに基づいて選択される、生成することと、
セルの各々にエレメントのサブセットをマッピングするセルからエレメントへのマッピングを生成することであって、
グリッド内の特定のボクセルにマッピングされるエレメントのサブセットが、エレメントからセルへのマッピングによって特定のセルにマッピングされる配列内のエレメントを含む、生成することと、
セルからエレメントへのマッピングを使用して、セルの各々について、配列内の最も近いエレメントを決定することであって、
特定のセルに対する最も近いエレメントが、特定のセルとエレメントのサブセット内のエレメントとの間の距離に基づいて決定される、決定することと、を含む、コンピュータ実装方法。
２．特定のエレメントのエレメント座標をセル座標に一致させることが、エレメント座標の小数部分を切り捨てて切り捨てられたエレメント座標を生成することを更に含む、条項１に記載のコンピュータ実装方法。
３．特定のエレメントのエレメント座標をセル座標に一致させることが、
第１の次元について、切り捨てられたエレメント座標内の第１の切り捨てられたエレメント座標をグリッド内の第１のセルの第１のセル座標に一致させ、第１のセルの第１の次元インデックスを選択することと、
第２の次元について、切り捨てられたエレメント座標内の第２の切り捨てられたエレメント座標をグリッド内の第２のセルの第２のセル座標に一致させ、第２のセルの第２の次元インデックスを選択することと、
第３の次元について、切り捨てられたエレメント座標内の第３の切り捨てられたエレメント座標をグリッド内の第３のセルの第３のセル座標に一致させ、第３のセルの第３の次元インデックスを選択することと、
選択された第１、第２、及び第３の次元インデックスを使用して、基数の累乗によって選択された第１、第２、及び第３の次元インデックスを位置ごとに重み付けすることに基づいて累積和を生成することと、
累積和を、最も近いセルの選択のためのセルインデックスとして使用することと、を更に含む、条項２に記載のコンピュータ実装方法。
４．距離が、特定のセルのセル座標とエレメントのサブセット内のエレメントのエレメント座標との間で計算される、条項１に記載のコンピュータ実装方法。
５．配列が、アミノ酸のタンパク質配列である、条項１に記載のコンピュータ実装方法。
６．エレメントがアミノ酸の原子である、条項５に記載のコンピュータ実装方法。
７．エレメントからセルへのマッピングを生成するステップと、セルからエレメントへのマッピングを生成するステップと、セルからエレメントへのマッピングを使用して、セルの各々について、最も近いエレメントを決定するステップが、Ｏ（ａ^＊ｆ＋ｖ）のランタイム複雑度を有し、
ａが原子の数であり、
ｆがアミノ酸の数であり、
ｖがセルの個数であり、
^＊が乗算演算である、条項６に記載のコンピュータ実装方法。
８．原子がアルファ炭素原子を含む、条項７に記載のコンピュータ実装方法。
９．原子がベータ炭素原子を含む、条項７に記載のコンピュータ実装方法。
１０．原子が非炭素原子を含む、条項７に記載のコンピュータ実装方法。
１１．セルが３次元ボクセルである、条項１に記載のコンピュータ実装方法。
１２．セル座標が３次元座標である、条項１１に記載のコンピュータ実装方法。
１３．エレメント座標が３次元座標である、条項１２に記載のコンピュータ実装方法。
１４．近傍セルが、最も近いセルからのインデックス隣接範囲内にあることに基づいて選択される、条項１に記載のコンピュータ実装方法。
１５．近傍セルが、最も近いセルを含むグリッド内のセル近傍内にあることに基づいて選択される、条項１に記載のコンピュータ実装方法。
１６．配列がＭ個のエレメントを含み、エレメントのサブセットがＮ個のエレメントを含み、Ｍ＞＞Ｎである、条項１に記載のコンピュータ実装方法。
１７．グリッド内の３Ｄボクセル座標を有するボクセルに最も近い、タンパク質中の３次元（３Ｄ）原子座標を有する原子を効率的に決定するコンピュータ実装方法であって、
タンパク質の特定の原子の３Ｄ原子座標をグリッド内の３Ｄボクセル座標に一致させることに基づいて選択された包含ボクセルを原子の各々にマッピングする、原子からボクセルへのマッピングを生成することと、
ボクセルの各々に原子のサブセットをマッピングする、ボクセルから原子へのマッピングを生成することであって、グリッド内の特定のボクセルにマッピングされる原子のサブセットが、原子からボクセルへのマッピングによって特定のボクセルにマッピングされるタンパク質中の原子を含む、ことと、ボクセルから原子へのマッピングを使用して、ボクセルの各々について、タンパク質中の最も近い原子を決定することと、を含む、コンピュータ実装方法。
１８．条項１７のステップが、０（原子の数）のランタイム複雑度を有する、条項１７に記載のコンピュータ実装方法。 Particular embodiment 3
Clause Set 1
1. A computer-implemented method for efficiently determining an element of an array having element coordinates that is closest to a cell having a cell index and cell coordinates per dimension, the cell index and cell coordinates being uniformly spaced within a grid, comprising:
generating an element-to-cell mapping that maps a subset of the cells to each of the elements;
the subset of cells mapped to a particular element in the array includes the nearest cell in the grid and one or more neighboring cells in the grid;
The closest cell is selected based on matching the element coordinates of the particular element to the cell coordinates;
generating a neighborhood cell, the neighborhood cells being selected based on being contiguous adjacent to the nearest cell and within a distance proximity range from the particular element;
generating a cell-to-element mapping that maps a subset of the elements to each of the cells;
generating a subset of elements that map to particular voxels in the grid, the subset including elements in the array that map to particular cells according to an element-to-cell mapping;
determining, for each of the cells, the closest element in the array using a cell-to-element mapping;
determining, where a closest element to the particular cell is determined based on a distance between the particular cell and an element in the subset of elements.
2. The computer-implemented method of clause 1, wherein matching element coordinates of a particular element to cell coordinates further comprises truncating a fractional portion of the element coordinates to generate truncated element coordinates.
3. Matching the element coordinates of a specific element to the cell coordinates
For a first dimension, matching a first truncated element coordinate in the truncated element coordinates to a first cell coordinate of a first cell in the grid and selecting a first dimension index of the first cell;
For a second dimension, matching a second truncated element coordinate in the truncated element coordinates to a second cell coordinate of a second cell in the grid and selecting a second dimension index of the second cell;
For a third dimension, matching a third truncated element coordinate in the truncated element coordinates to a third cell coordinate of a third cell in the grid and selecting a third dimension index of the third cell;
using the selected first, second, and third dimension indices to generate a cumulative sum based on position-wise weighting the selected first, second, and third dimension indices by a power of the radix;
3. The computer-implemented method of claim 2, further comprising: using the cumulative sum as a cell index for selection of the closest cell.
4. The computer-implemented method of claim 1, wherein the distance is calculated between a cell coordinate of a particular cell and an element coordinate of an element in the subset of elements.
5. The computer-implemented method of claim 1, wherein the sequence is a protein sequence of amino acids.
6. The computer-implemented method of claim 5, wherein the elements are atoms of an amino acid.
7. The steps of generating a mapping from elements to cells, generating a mapping from cells to elements, and determining, for each of the cells, the closest element using the mapping from cells to elements has a runtime complexity of O(a ^* f+v);
a is the number of atoms,
f is the number of amino acids,
v is the number of cells,
7. The computer-implemented method of claim 6, wherein ^* is a multiplication operation.
8. The computer-implemented method of claim 7, wherein the atom comprises an alpha carbon atom.
9. The computer-implemented method of claim 7, wherein the atom comprises a beta carbon atom.
10. The computer-implemented method of claim 7, wherein the atoms include non-carbon atoms.
11. The computer-implemented method of claim 1, wherein the cells are three-dimensional voxels.
12. The computer-implemented method of claim 11, wherein the cell coordinates are three-dimensional coordinates.
13. The computer-implemented method of claim 12, wherein the element coordinates are three-dimensional coordinates.
14. The computer-implemented method of claim 1, wherein the neighboring cells are selected based on being within an index contiguous range from a nearest cell.
15. The computer-implemented method of claim 1, wherein the neighboring cell is selected based on being within a neighborhood of a cell in a grid that includes the nearest cell.
16. The computer-implemented method of clause 1, wherein the array includes M elements and the subset of elements includes N elements, where M>>N.
17. A computer-implemented method for efficiently determining an atom having three-dimensional (3D) atomic coordinates in a protein that is closest to a voxel having 3D voxel coordinates in a grid, comprising:
generating an atom-to-voxel mapping that maps a containing voxel selected based on matching 3D atomic coordinates of a particular atom of the protein to 3D voxel coordinates in the grid to each of the atoms;
1. A computer-implemented method comprising: generating a voxel-to-atom mapping that maps a subset of atoms to each of the voxels, wherein the subset of atoms that are mapped to a particular voxel in the grid includes atoms in the protein that are mapped to the particular voxel by the atom-to-voxel mapping; and determining, for each of the voxels, a closest atom in the protein using the voxel-to-atom mapping.
18. The computer-implemented method of claim 17, wherein the steps of clause 17 have a runtime complexity of 0 (number of atoms).

本発明は、上述の好ましい実施態様及び実施例を参照して開示されているが、これらの実施例は、限定的な意味でではなく例示的な意味で意図されていることが理解されるべきである。当業者であれば、変更及び組み合わせが容易に生じ、その変更及び組み合わせは、本発明の趣旨及び添付の特許請求の範囲の範囲内にあると考えられる。 Although the invention has been disclosed with reference to the above-described preferred embodiments and examples, it is to be understood that these examples are intended in an illustrative rather than a restrictive sense. It is. Modifications and combinations will readily occur to those skilled in the art and are considered to be within the spirit of the invention and the scope of the appended claims.

１０４配列アクセサ
１１４３Ｄ構造生成器
１２４座標分類器
１３４ボクセルグリッド生成器
１４４ボクセルグリッドセンタラ
１５４距離チャネル生成器
１６４ワンホットエンコーダ
１７４連結器
１８４ランタイムロジック
１２０４類似配列ファインダ
１２１４アライナ
１２２４汎アミノ酸保存頻度計算器
１２３４最も近い原子ファインダ
１２４４アミノ酸選択器
１２５４ボクセル化器
１９２４アミノ酸ごとの保存頻度計算器
１９３４最も近い原子ファインダ
１９４４アミノ酸選択器
１９５４ボクセル化器
２１０８病原性分類器
３６００コンピュータシステム
３６１０記憶サブシステム
３６２２メモリサブシステム
３６３２ＲＡＭ
３６３４ＲＯＭ
３６３６ファイル記憶サブシステム
３６３８ユーザインターフェース入力デバイス
３６５５バスサブシステム
３６７２ＣＰＵ
３６７４ネットワークインターフェースサブシステム
３６７６ユーザインターフェース出力デバイス
３６７８プロセッサ（ＧＰＵ、ＦＰＧＡ、ＣＧＲＡ） 104 Sequence Accessor 114 3D Structure Generator 124 Coordinate Classifier 134 Voxel Grid Generator 144 Voxel Grid Centerer 154 Distance Channel Generator 164 One Hot Encoder 174 Concatenator 184 Runtime Logic 1204 Similar Sequence Finder 1214 Aligner 1224 Pan-Amino Acid Conservation Frequency Calculator 1234 Nearest atom finder 1244 Amino acid selector 1254 Voxelizer 1924 Preservation frequency calculator for each amino acid 1934 Nearest atom finder 1944 Amino acid selector 1954 Voxelizer 2108 Pathogenicity classifier 3600 Computer system 3610 Memory subsystem 3622 Memory subsystem 3632 RAM
3634 ROM
3636 File Storage Subsystem 3638 User Interface Input Device 3655 Bus Subsystem 3672 CPU
3674 Network Interface Subsystem 3676 User Interface Output Device 3678 Processor (GPU, FPGA, CGRA)

Claims

A system,
A voxelizer that accesses a three-dimensional structure of a reference amino acid sequence of a protein and fits a three-dimensional grid of voxels to atoms in the three-dimensional structure on an amino acid basis to generate a distance channel for each amino acid, comprising:
each of the per-amino acid distance channels having a three-dimensional distance value for each voxel in the three-dimensional grid of voxels;
a voxelizer, wherein the three-dimensional distance value specifies the distance from a corresponding voxel in the three-dimensional grid of voxels to an atom of a corresponding reference amino acid in the reference amino acid sequence;
an alternative allele encoder encoding alternative allele amino acids in each voxel in the three-dimensional grid of voxels, the alternative allele encoder comprising:
an alternative allele encoder, wherein the alternative allele amino acids are one-hot encoded three-dimensional representations of variant amino acids expressed by variant nucleotides;
an evolutionarily conserved encoder that encodes an evolutionarily conserved sequence in each voxel in the three-dimensional grid of voxels, the encoder comprising:
the evolutionary conserved sequence is a three-dimensional representation of amino acid-specific conservation frequencies across multiple species;
an evolutionary conservation encoder, wherein the amino acid-specific conservation frequency is selected according to amino acid proximity to the corresponding voxel;
A convolutional neural network,
applying a three-dimensional convolution to a tensor containing the per-amino acid distance channels encoded by the alternative allele amino acids and their respective evolutionarily conserved sequences;
a convolutional neural network configured to determine pathogenicity of the mutant nucleotide based at least in part on the tensor.

The system of claim 1, wherein the voxelizer centers the three-dimensional grid of voxels on the alpha carbon atom of each residue of a reference amino acid in the reference amino acid sequence.

3. The system of claim 1 or 2, wherein the voxelizer centers the three-dimensional grid of voxels on the alpha carbon atom of a particular reference amino acid residue located in the variant amino acid.

In the tensor, the directionality of the reference amino acid and the position of the particular reference amino acid within the reference amino acid sequence are determined by multiplying the three-dimensional distance value for the reference amino acid preceding the particular reference amino acid by a directionality parameter. 4. A system according to any one of claims 1 to 3, further configured to encode.

5. The system of any one of claims 1 to 4, wherein the distance is the nearest atomic distance from the corresponding voxel center in the three-dimensional grid of voxels to the nearest atom of the corresponding reference amino acid.

6. The system of claim 5, wherein the closest atomic distance is a Euclidean distance.

7. The system of claim 5 or 6, wherein the nearest atomic distance is normalized by dividing the Euclidean distance by the largest nearest atomic distance.

8. The reference amino acid of claim 1 to 7, wherein the reference amino acid has an alpha carbon atom, and the distance is the nearest alpha carbon atom distance from the corresponding voxel center to the nearest alpha carbon atom of the corresponding reference amino acid. A system according to any one of the clauses.

8. The reference amino acid of claim 1 to 7, wherein the reference amino acid has a beta carbon atom, and the distance is the nearest beta carbon atom distance from the corresponding voxel center to the nearest beta carbon atom of the corresponding reference amino acid. A system according to any one of the clauses.

Any one of claims 1 to 7, wherein the reference amino acid has a backbone atom, and the distance is the nearest backbone atom distance from the corresponding voxel center to the nearest backbone atom of the corresponding reference amino acid. The system described in Section.

8. The reference amino acid of claim 1 to 7, wherein the reference amino acid has a side chain atom, and the distance is the nearest side chain atom distance from the corresponding voxel center to the nearest side chain atom of the corresponding reference amino acid. A system according to any one of the clauses.

in the tensor, further configured to encode a nearest atom channel specifying a distance from each voxel to the nearest atom, the nearest atom being selected regardless of the amino acid and the atomic element of the amino acid; 12. A system according to any one of claims 1 to 11.

13. A system according to any preceding claim, wherein the distance is a Euclidean distance.

14. A system according to any preceding claim, wherein the distance is normalized by dividing the Euclidean distance by a maximum distance.

15. The system according to any one of claims 1 to 14, wherein the amino acids include non-standard amino acids.

The system of any one of claims 1 to 15, wherein the tensor further includes absent atom channels that specify atoms that are not found within a predetermined radius of a voxel center.

17. The system of claim 16, wherein the absent atomic channel is one-hot encoded.

18. The system of any one of claims 1 to 17, further comprising a reference allele encoder for encoding reference allele amino acids on a voxel-by-voxel basis in each voxel in the three-dimensional grid of voxels.

19. The system of claim 18, wherein the reference allele amino acid is a one-hot encoded three-dimensional representation of the reference amino acid that experiences the variant amino acid.

20. The system of any one of claims 1-19, wherein the amino acid-specific conservation frequency specifies the level of conservation of each amino acid across the plurality of species.

The evolutionarily conserved encoder comprises:
selecting the closest atom to the corresponding voxel across the reference amino acid and the atom category;
Select a pan-amino acid conservation frequency for the residue of the reference amino acid that contains the closest atom;
The system according to any one of claims 1 to 20, wherein a three-dimensional representation of the general amino acid conservation frequency is used as the evolutionarily conserved sequence.

22. The system of claim 21, wherein the pan-amino acid conservation frequency is constructed for a particular position of the residue as observed in the plurality of species.

23. The system of claim 21 or 22, wherein the pan-amino acid conservation frequency identifies whether a missing conservation frequency exists for a particular reference amino acid.

The evolutionary conserving encoder is
selecting the respective nearest atom to the corresponding voxel in each of the reference amino acids;
For each residue of the reference amino acid containing the nearest atom, select a conservation frequency for each amino acid;
24. The system according to any one of claims 1 to 23, wherein a three-dimensional representation of the conservation frequency for each amino acid is used as the evolutionarily conserved sequence.

25. The system of claim 24, wherein the per-amino acid conservation frequency is configured for a particular position of a residue as observed in the plurality of species.

The system of claim 24 or 25, wherein the conservation frequency for each amino acid identifies whether a missing conservation frequency exists for a particular reference amino acid.

further comprising an annotation encoder that encodes one or more annotation channels on a voxel-by-voxel basis for each voxel in the three-dimensional grid of voxels;
27. The system of any one of claims 1 to 26, wherein the annotation channel is a three-dimensional representation of a one-hot encoding of residue annotations.

28. The system of claim 27, wherein the annotation channel is a molecular processing annotation that includes initiator methionine, signal, transit peptide, propeptide, chain, and peptide.

the annotation channel is a region annotation including topological domain, transmembrane, intramembrane, domain, repeat, calcium binding, zinc finger, deoxyribonucleic acid (DNA) binding, nucleotide binding, region, coiled coil, motif, and compositional bias; A system according to claim 27 or 28.

30. The system of any one of claims 27 to 29, wherein the annotation channel is a site annotation comprising an active site, a metal binding, a binding site, and a site.

31. The system of any one of claims 27-30, wherein the annotation channel is an amino acid modification annotation including non-standard residues, modified residues, lipidations, glycosylation, disulfide bonds, and crosslinks.

The system of any one of claims 27 to 31, wherein the annotation channel is a secondary structure annotation including helices, turns, and beta strands.

The system of any one of claims 27 to 32, wherein the annotation channel is an experimental information annotation including mutagenesis, sequence uncertainty, sequence conflicts, non-adjacent residues, and non-terminal residues.

a structure confidence encoder for encoding one or more structure confidence channels for each voxel in the three-dimensional grid of voxels on a voxel-by-voxel basis;
34. The system of claim 1, wherein the structure confidence channel is a three-dimensional representation of confidence scores that specify the quality of each residue structure.

35. The system of claim 34, wherein the structural confidence channel is global model quality estimation (GMQE).

36. The system of claim 34 or 35, wherein the structural confidence channel is a Qualitative Model Energy Analysis (QMEAN) score.

37. The system of any one of claims 34-36, wherein the structure confidence channel is a temperature factor that specifies the degree to which the residues meet the physical constraints of their respective protein structure.

38. The system of any one of claims 34-37, wherein the structure confidence channel is a template structure alignment that specifies the degree to which the closest atomic residues to the voxel have an aligned template structure.

39. The system of any one of claims 34-38, wherein the structure confidence channel is a template modeling score of the aligned template structure.

The system of claim 39, wherein the structural confidence channel is a minimum one of the template modeling scores, an average of the template modeling scores, and a maximum one of the template modeling scores.

41. The system of any one of claims 1-40, further comprising an atomic rotation engine that rotates the atoms before the per-amino acid distance channels are generated.

The convolutional neural network uses a 1×1×1 convolution, a 3×3×3 convolution, a normalized linear unit activation layer, a batch normalization layer, a fully connected layer, a dropout regularization layer, and a softmax classification layer. 42. The system according to any one of claims 1 to 41, wherein:

43. The system of claim 42, wherein the 1x1x1 convolution and the 3x3x3 convolution are the three-dimensional convolutions.

the 1×1×1 convolutional layer processes the tensor and produces an intermediate output that is a convolved representation of the tensor;
The array of 3x3x3 convolutional layers produces a flattened output, the fully connected layer processes the flattened output and produces a denormalized output, and the softmax classification 44. The system of claim 42 or 43, wherein a layer processes the denormalized output and produces an exponentially normalized output that identifies the likelihood that the variant nucleotide is pathogenic and benign. .

45. The system of claim 44, wherein a sigmoid layer processes the denormalized output and generates a normalized output that identifies the likelihood that the variant nucleotide is pathogenic.

The system of any one of claims 1 to 45, wherein the convolutional neural network is an attention-based neural network.

47. The system of any one of claims 1 to 46, wherein the tensor comprises a distance channel for each of the amino acids further encoded with the reference allele amino acids.

48. The system of any one of claims 27-47, wherein the tensor comprises a distance channel for each of the amino acids further encoded with the annotation channel.

49. The system of any one of claims 34-48, wherein the tensor comprises the per-amino acid distance channel further encoded with the structure confidence channel.

A system,
A voxelizer that accesses the three-dimensional structure of a reference amino acid sequence of a protein and fits a three-dimensional grid of voxels to atoms in the three-dimensional structure on an amino acid basis to generate distance channels for each atom category,
the atom spans multiple atomic categories,
an atomic category of the plurality of atomic categories specifies an atomic element of the amino acid;
each of the distance channels for each atomic category having a three-dimensional distance value for each voxel in the three-dimensional grid of voxels;
a voxelizer, wherein the three-dimensional distance value specifies a distance from a corresponding voxel in the three-dimensional grid of voxels to an atom of a corresponding atom category in the plurality of atom categories;
an alternative allele encoder encoding alternative allele amino acids in each voxel in the three-dimensional grid of voxels, the alternative allele encoder comprising:
an alternative allele encoder, wherein the alternative allele amino acids are one-hot encoded three-dimensional representations of variant amino acids expressed by variant nucleotides;
an evolutionarily conserved encoder that encodes an evolutionarily conserved sequence in each voxel in the three-dimensional grid of voxels, the encoder comprising:
the evolutionary conserved sequence is a three-dimensional representation of amino acid-specific conservation frequencies across multiple species,
an evolutionary conservation encoder, wherein the amino acid-specific conservation frequency is selected according to amino acid proximity to the corresponding voxel;
A convolutional neural network,
applying a three-dimensional convolution to a tensor containing distance channels for each of the atomic categories encoded with the alternative allele amino acids and their respective evolutionarily conserved sequences;
a convolutional neural network configured to determine pathogenicity of the mutant nucleotide based at least in part on the tensor.

A system,
A voxelizer that accesses the three-dimensional structure of a reference amino acid sequence of a protein and fits a three-dimensional grid of voxels to atoms in the three-dimensional structure on an amino acid basis to generate a distance channel for each amino acid,
each of the per-amino acid distance channels having a three-dimensional distance value for each voxel in the three-dimensional grid of voxels;
a voxelizer, wherein the three-dimensional distance value specifies the distance from a corresponding voxel in the three-dimensional grid of voxels to an atom of a corresponding reference amino acid in the reference amino acid sequence;
an alternative allele encoder encoding alternative allele amino acids in each voxel in the three-dimensional grid of voxels, the alternative allele encoder comprising:
an alternative allele encoder, wherein the alternative allele amino acids are one-hot encoded three-dimensional representations of variant amino acids expressed by variant nucleotides;
an evolutionarily conserved encoder that encodes an evolutionarily conserved sequence in each voxel in the three-dimensional grid of voxels, the encoder comprising:
the evolutionary conserved sequence is a three-dimensional representation of amino acid-specific conservation frequencies across multiple species;
an evolutionary conservation encoder, wherein the amino acid-specific conservation frequency is selected according to amino acid proximity to the corresponding voxel;
a tensor generator configured to generate a tensor comprising a distance channel for each of the amino acids encoded with the alternative allelic amino acids and their respective evolutionarily conserved sequences.

A system,
A voxelizer that accesses the three-dimensional structure of a reference amino acid sequence of a protein and fits a three-dimensional grid of voxels to atoms in the three-dimensional structure on an amino acid basis to generate a distance channel for each atom category, the voxelizer comprising: Atoms span multiple atomic categories,
an atomic category of the plurality of atomic categories specifies an atomic element of the amino acid;
each of the distance channels for each atomic category having a three-dimensional distance value for each voxel in the three-dimensional grid of voxels;
a voxelizer, wherein the three-dimensional distance value specifies a distance from a corresponding voxel in the three-dimensional grid of voxels to an atom of a corresponding atom category in the plurality of atom categories;
an alternative allele encoder encoding alternative allele amino acids in each voxel in the three-dimensional grid of voxels, the alternative allele encoder comprising:
an alternative allele encoder, wherein the alternative allele amino acids are one-hot encoded three-dimensional representations of variant amino acids expressed by variant nucleotides;
an evolutionarily conserved encoder that encodes an evolutionarily conserved sequence in each voxel in the three-dimensional grid of voxels, the encoder comprising:
the evolutionary conserved sequence is a three-dimensional representation of amino acid-specific conservation frequencies across multiple species,
an evolutionary conservation encoder, wherein the amino acid-specific conservation frequency is selected according to amino acid proximity to the corresponding voxel;
a tensor generator configured to generate a tensor comprising a distance channel for each of the atomic categories encoded with the alternative allelic amino acids and respective evolutionarily conserved sequences.