KR20240082270A

KR20240082270A - Protein language model based on protein structure

Info

Publication number: KR20240082270A
Application number: KR1020237045482A
Authority: KR
Inventors: 토비아스 햄프; 홍 가오; 카이-하우 파
Original assignee: 일루미나, 인코포레이티드
Priority date: 2021-10-06
Filing date: 2022-10-05
Publication date: 2024-06-10

Abstract

개시된 기술은 뉴클레오티드 변이체의 병원성을 결정하는 것에 관한 것이다. 특히, 개시된 기술은, 단백질의 특정 위치에 있는 특정 아미노산을 갭 아미노산으로 지정하고, 단백질의 나머지 위치에 남아 있는 아미노산을 비-갭 아미노산으로 지정하는 단계에 관한 것이다. 개시된 기술은 추가로, 비-갭 아미노산의 공간 구성을 포함하고 갭 아미노산의 공간 구성을 배제하는 단백질의 갭 공간 표현을 생성하는 단계; 및 적어도 부분적으로, 갭 공간 표현, 및 특정 위치에서 뉴클레오티드 변이체에 의해 생성된 대체 아미노산의 표현에 기초하여 뉴클레오티드 변이체의 병원성을 결정하는 단계에 관한 것이다.The disclosed technology relates to determining the pathogenicity of nucleotide variants. In particular, the disclosed technology relates to the step of designating a specific amino acid at a specific position of the protein as a gap amino acid and designating the amino acid remaining at the remaining position of the protein as a non-gap amino acid. The disclosed techniques further include generating a gapped spatial representation of the protein that includes the spatial configuration of non-gap amino acids and excludes the spatial configuration of gap amino acids; and determining the pathogenicity of a nucleotide variant based, at least in part, on the gap space representation and the representation of alternative amino acids produced by the nucleotide variant at a particular position.

Description

Protein language model based on protein structure

우선권 출원priority application

본 출원은 하기 미국 출원에 대하여 우선권 및 이익을 주장한다. 우선권 출원은 모든 목적을 위해 참고로 본원에 포함된다.This application claims priority and benefit from the following US application. The priority application is incorporated herein by reference for all purposes.

미국 정규 특허 출원 제17/533,091호 발명의 명칭 "Protein Structure-Based Protein Language Models," 출원일 2021년 11월 22일(대리인 문서 번호 ILLM 1050-2/IP-2164-US), 우선권 주장 미국 임시 특허 출원 제63/253,122호, 출원일 2021년 10월 6일(대리인 문서 번호 ILLM 1050-1/IP-2164-PRV), 미국 임시 특허 출원 제63/281,579호, 출원일 2021년 11월 19일(대리인 문서 번호 ILLM 1060-1/IP-2270-PRV) 및 미국 임시 특허 출원 제63/281,592호, 출원일 2021년 11월 19일(대리인 문서 번호 ILLM 1061-1/IP-2271-PRV); 및U.S. Regular Patent Application No. 17/533,091, entitled “Protein Structure-Based Protein Language Models,” filed November 22, 2021 (Attorney Docket No. ILLM 1050-2/IP-2164-US), priority claimed U.S. Provisional Patent Application No. 63/253,122, filed October 6, 2021 (Attorney Docket No. ILLM 1050-1/IP-2164-PRV), U.S. Provisional Patent Application No. 63/281,579, filed November 19, 2021 (Attorney Docket No. ILLM 1050-1/IP-2164-PRV) No. ILLM 1060-1/IP-2270-PRV) and U.S. Provisional Patent Application No. 63/281,592, filed November 19, 2021 (Attorney Docket No. ILLM 1061-1/IP-2271-PRV); and

미국 정규 특허 출원 제17/953,286호 발명의 명칭 "Predicting Variant Pathogenicity From Evolutionary Conservation Using Three-Dimensional (3D) Protein Structure Voxels," 출원일 2022년 9월 26일(대리인 문서 번호 ILLM 1060-2/IP-2270-US) 우선권 주장 미국 임시 특허 출원 제63/253,122호, 출원일 2021년 10월 6일(대리인 문서 번호 ILLM 1050-1/IP-2164-PRV), 미국 임시 특허 출원 제63/281,579호, 출원일 2021년 11월 19일(대리인 문서 번호 ILLM 1060-1/IP-2270-PRV) 및 미국 임시 특허 출원 제63/281,592호, 출원일 2021년 11월 19일(대리인 문서 번호 ILLM 1061-1/IP-2271-PRV); 및U.S. Regular Patent Application Ser. No. 17/953,286, entitled “Predicting Variant Pathogenicity From Evolutionary Conservation Using Three-Dimensional (3D) Protein Structure Voxels,” filed September 26, 2022 (Attorney Docket No. ILLM 1060-2/IP-2270) -US) Priority Claim U.S. Provisional Patent Application No. 63/253,122, filed October 6, 2021 (Attorney Docket No. ILLM 1050-1/IP-2164-PRV), U.S. Provisional Patent Application No. 63/281,579, filed 2021 19 November 2021 (Attorney Docket ILLM 1060-1/IP-2270-PRV) and U.S. Provisional Patent Application No. 63/281,592, filed November 19, 2021 (Attorney Docket ILLM 1061-1/IP-2271) -PRV); and

미국 정규 특허 출원 제17/953,293호 발명의 명칭 "Combined And Transfer Learning of a Variant Pathogenicity Predictor Using Gapped and Non-Gapped Protein Samples," 출원일 2022년 9월 26일(대리인 문서 번호 ILLM 1061-2/IP-2271-US) 우선권 주장 미국 임시 특허 출원 제63/253,122호, 출원일 2021년 10월 6일(대리인 문서 번호 ILLM 1050-1/IP-2164-PRV), 미국 임시 특허 출원 제63/281,579호, 출원일 2021년 11월 19일(대리인 문서 번호 ILLM 1060-1/IP-2270-PRV) 및 미국 임시 특허 출원 제63/281,592호, 출원일 2021년 11월 19일(대리인 문서 번호 ILLM 1061-1/IP-2271-PRV).U.S. Regular Patent Application No. 17/953,293 Title “Combined And Transfer Learning of a Variant Pathogenicity Predictor Using Gapped and Non-Gapped Protein Samples,” Filed September 26, 2022 (Attorney Docket No. ILLM 1061-2/IP- 2271-US) Priority Claimed U.S. Provisional Patent Application No. 63/253,122, filing date October 6, 2021 (Attorney Docket No. ILLM 1050-1/IP-2164-PRV), U.S. Provisional Patent Application No. 63/281,579, filing date and U.S. Provisional Patent Application No. 63/281,592, filed November 19, 2021 (Attorney Docket No. ILLM 1060-1/IP-2270-PRV), filed November 19, 2021 (Attorney Docket No. ILLM 1061-1/IP- 2271-PRV).

기술분야Technology field

개시된 기술은 인공 지능 유형 컴퓨터 및 디지털 데이터 처리 시스템 및 상응하는 데이터 처리 방법 및 지능 에뮬레이션 제품(즉, 지식 기반 시스템, 추론 시스템 및 지식 획득 시스템)에 관한 것이며; 불확실성이 있는 추론을 위한 시스템(예를 들어, 퍼지 로직 시스템), 적응 시스템, 기계 학습 시스템 및 인공 신경망을 포함한다. 특히, 개시된 기술은 심층 콘볼루션 신경망을 사용하여 다중 채널 복셀화된 데이터를 분석하는 것에 관한 것이다.The disclosed technology relates to artificial intelligence type computer and digital data processing systems and corresponding data processing methods and intelligence emulation products (i.e., knowledge-based systems, inference systems, and knowledge acquisition systems); Includes systems for inference with uncertainty (e.g., fuzzy logic systems), adaptive systems, machine learning systems, and artificial neural networks. In particular, the disclosed technology relates to analyzing multi-channel voxelized data using deep convolutional neural networks.

참조 문헌References

다음은 본 명세서에 충분히 설명된 것처럼 모든 목적을 위해 참고로 포함된다:The following are incorporated by reference for all purposes as if fully set forth herein:

문헌[Sundaram, L. et al. Predicting the clinical impact of human mutation with deep neural networks. Nat. Genet. 50, 1161-1170 (2018)];Sundaram, L. et al. Predicting the clinical impact of human mutation with deep neural networks. Nat. Genet. 50, 1161-1170 (2018)];

문헌[Jaganathan, K. et al. Predicting splicing from primary sequence with deep learning. Cell 176, 535-548 (2019)];Jaganathan, K. et al. Predicting splicing from primary sequence with deep learning. Cell 176, 535-548 (2019)];

2017년 10월 16일자로 출원되고 발명의 명칭이 "Training a Deep Pathogenicity Classifier Using Large-Scale Benign Training Data"인 미국 특허 출원 제62/573,144호(대리인 문서 번호 ILLM 1000-1/IP-1611-PRV);U.S. Patent Application Serial No. 62/573,144, filed October 16, 2017 and entitled "Training a Deep Pathogenicity Classifier Using Large-Scale Benign Training Data" (Attorney Docket No. ILLM 1000-1/IP-1611-PRV) );

2017년 10월 16일자로 출원되고 발명의 명칭이 "Pathogenicity Classifier Based on Deep Convolutional Neural Networks (CNNs)"인 미국 특허 출원 제62/573,149호(대리인 문서 번호 ILLM 1000-2/IP-1612-PRV);U.S. Patent Application No. 62/573,149, filed October 16, 2017 and entitled “Pathogenicity Classifier Based on Deep Convolutional Neural Networks (CNNs)” (Attorney Docket No. ILLM 1000-2/IP-1612-PRV) ;

2017년 10월 16일자로 출원되고 발명의 명칭이 "Deep Semi-Supervised Learning That Generates Large-Scale Pathogenic Training Data"인 미국 특허 출원 제62/573,153호(대리인 문서 번호 ILLM 1000-3/IP-1613-PRV);U.S. Patent Application Serial No. 62/573,153, filed October 16, 2017 and entitled "Deep Semi-Supervised Learning That Generates Large-Scale Pathogenic Training Data" (Attorney Docket No. ILLM 1000-3/IP-1613- PRV);

2017년 11월 7일자로 출원되고 발명의 명칭이 "Pathogenicity Classification of Genomic Data Using Deep Convolutional Neural Networks (CNNs)"인 미국 특허 출원 제62/582,898호(대리인 문서 번호 ILLM 1000-4/IP-1618-PRV);U.S. Patent Application Serial No. 62/582,898, filed November 7, 2017 and entitled "Pathogenicity Classification of Genomic Data Using Deep Convolutional Neural Networks (CNNs)" (Attorney Docket No. ILLM 1000-4/IP-1618- PRV);

2018년 10월 15일자로 출원되고 발명의 명칭이 "Deep Learning-Based Techniques for Training Deep Convolutional Neural Networks"인 미국 특허 출원 제16/160,903호(대리인 문서 번호 ILLM 1000-5/IP-1611-US);U.S. Patent Application No. 16/160,903, filed October 15, 2018 and entitled “Deep Learning-Based Techniques for Training Deep Convolutional Neural Networks” (Attorney Docket No. ILLM 1000-5/IP-1611-US) ;

2018년 10월 15일자로 출원되고 발명의 명칭이 "Deep Convolutional Neural Networks for Variant Classification"인 미국 특허 출원 제16/160,986호(대리인 문서 번호 ILLM 1000-6/IP-1612-US);U.S. Patent Application Serial No. 16/160,986, entitled “Deep Convolutional Neural Networks for Variant Classification,” filed October 15, 2018 (Attorney Docket No. ILLM 1000-6/IP-1612-US);

2018년 10월 15일자로 출원되고 발명의 명칭이 "Semi-Supervised Learning for Training an Ensemble of Deep Convolutional Neural Networks"인 미국 특허 출원 제16/160,968호(대리인 문서 번호 ILLM 1000-7/IP-1613-US);U.S. Patent Application Serial No. 16/160,968, filed October 15, 2018 and entitled "Semi-Supervised Learning for Training an Ensemble of Deep Convolutional Neural Networks" (Attorney Docket No. ILLM 1000-7/IP-1613- US);

2019년 5월 8일자로 출원되고 발명의 명칭이 "Deep Learning-Based Techniques for Pre-Training Deep Convolutional Neural Networks"인 미국 특허 출원 제16/407,149호(대리인 문서 번호 ILLM 1010-1/IP-1734-US);U.S. Patent Application Serial No. 16/407,149, filed May 8, 2019 and entitled "Deep Learning-Based Techniques for Pre-Training Deep Convolutional Neural Networks" (Attorney Docket No. ILLM 1010-1/IP-1734- US);

2021년 4월 15일자로 출원되고 발명의 명칭이 "Deep Convolutional Neural Networks to Predict Variant Pathogenicity Using Three-Dimensional (3d) Protein Structures"인 미국 특허 출원 제17/232,056호(대리인 문서 번호 ILLM 1037-2/IP-2051-US).U.S. Patent Application Serial No. 17/232,056, filed April 15, 2021 and entitled "Deep Convolutional Neural Networks to Predict Variant Pathogenicity Using Three-Dimensional (3d) Protein Structures" (Attorney Docket No. ILLM 1037-2/ IP-2051-US).

2021년 4월 15일자로 출원되고 발명의 명칭이 "Multi-Channel Protein Voxelization to Predict Variant Pathogenicity Using Deep Convolutional Neural Networks"인 미국 특허 출원 제63/175,495호(대리인 문서 번호 ILLM 1047-1/IP-2142-PRV);U.S. Patent Application Serial No. 63/175,495, filed April 15, 2021 and entitled “Multi-Channel Protein Voxelization to Predict Variant Pathogenicity Using Deep Convolutional Neural Networks” (Attorney Docket No. ILLM 1047-1/IP-2142) -PRV);

2021년 4월 16일자로 출원되고 발명의 명칭이 "Efficient Voxelization for Deep Learning"인 미국 특허 출원 제63/175,767호(대리인 문서 번호 ILLM 1048-1/IP-2143-PRV);U.S. Patent Application Serial No. 63/175,767, entitled “Efficient Voxelization for Deep Learning,” filed April 16, 2021 (Attorney Docket No. ILLM 1048-1/IP-2143-PRV);

2021년 9월 7일자로 출원되고 발명의 명칭이 "Artificial Intelligence-Based Analysis of Protein Three-Dimensional (3d) Structures"인 미국 특허 출원 제17/468,411호(대리인 문서 번호 ILLM 1037-3/IP-2051A-US).U.S. Patent Application Serial No. 17/468,411, filed September 7, 2021 and entitled “Artificial Intelligence-Based Analysis of Protein Three-Dimensional (3d) Structures” (Attorney Docket No. ILLM 1037-3/IP-2051A) -US).

이 섹션에서 논의되는 주제는 단지 이 섹션 내에서의 그의 언급의 결과로서 종래 기술이라고 가정되어서는 안 된다. 유사하게, 이 섹션에서 언급되거나 배경기술로서 제공되는 주제와 연관된 문제는 종래 기술에서 이전에 인식되었다고 가정되어서는 안 된다. 이 섹션에서의 주제는 단지 상이한 접근법을 표현할 뿐이며, 그 접근법 자체는 청구되는 기술의 구현예에 또한 상응할 수 있다.The subject matter discussed in this section should not be assumed to be prior art merely as a result of its references within this section. Similarly, it should not be assumed that issues related to the subject matter mentioned in this section or provided as background have been previously recognized in the prior art. The subject matter in this section merely represents different approaches, which approaches themselves may also correspond to implementations of the claimed technology.

광범위한 의미에서, 기능 유전체학으로도 지칭되는 유전체학은 게놈 서열분석, 전사체 프로파일링 및 단백질체학과 같은 게놈 스케일 분석을 사용하여 유기체의 모든 게놈 요소의 기능을 특성화하는 것을 목표로 한다. 유전체학은 데이터 중심(data-driven) 과학으로서 발생하였으며 그것은 선입견이 있는 모델 및 가설을 테스트하기보다는 게놈 스케일 데이터의 탐구로부터 신규한 속성을 발견함으로써 작동한다. 유전체학의 응용은 유전자형과 표현형 사이의 연관성을 찾는 것, 환자 계층화에 대한 바이오마커를 발견하는 것, 유전자의 기능을 예측하는 것, 및 전사 인핸서(transcriptional enhancer)와 같은 생물화학적 활성 게놈 영역을 차트화하는 것을 포함한다.In a broad sense, genomics, also referred to as functional genomics, aims to characterize the function of all genomic elements in an organism using genome-scale analyzes such as genome sequencing, transcriptome profiling, and proteomics. Genomics emerged as a data-driven science and it works by discovering novel properties from the exploration of genome-scale data rather than testing preconceived models and hypotheses. Applications of genomics include finding associations between genotypes and phenotypes, discovering biomarkers for patient stratification, predicting the function of genes, and charting biochemically active genomic regions such as transcriptional enhancers. It includes doing.

유전체학 데이터는 쌍별 상관(pairwise correlation)의 시각적 연구만으로 조사하기에는 너무 크고 너무 복잡하다. 대신에, 예상되지 않은 관계의 발견을 지원하기 위해, 신규한 가설 및 모델을 도출하기 위해, 그리고 예측을 행하기 위해 분석 툴이 요구된다. 가정 및 도메인 전문지식이 하드 코딩되는 일부 알고리즘과는 달리, 기계 학습 알고리즘은 데이터에서 패턴을 자동으로 검출하도록 설계된다. 따라서, 기계 학습 알고리즘은 데이터 중심 과학, 및 특히 유전체학에 적합하다. 그러나, 기계 학습 알고리즘의 성능은 데이터가 표현되는 방법, 즉 각각의 변수(특징부로도 불림)가 계산되는 방법에 강하게 의존할 수 있다. 예를 들어, 형광 현미경 이미지로부터 종양을 악성 또는 양성으로 분류하기 위해, 전처리 알고리즘이 세포를 검출할 수 있고, 세포 유형을 식별할 수 있고, 각각의 세포 유형에 대한 세포 카운트의 목록을 생성할 수 있다.Genomics data is too large and too complex to be examined through visual studies of pairwise correlations alone. Instead, analysis tools are required to support the discovery of unexpected relationships, to derive new hypotheses and models, and to make predictions. Unlike some algorithms where assumptions and domain expertise are hard-coded, machine learning algorithms are designed to automatically detect patterns in data. Therefore, machine learning algorithms are suitable for data-driven science, and especially genomics. However, the performance of a machine learning algorithm can strongly depend on how the data is represented, i.e., how each variable (also called a feature) is computed. For example, to classify a tumor as malignant or benign from a fluorescence microscopy image, a preprocessing algorithm can detect cells, identify cell types, and generate a list of cell counts for each cell type. there is.

기계 학습 모델은 추정된 세포 카운트를 취할 수 있는데, 이러한 카운트는 종양을 분류하기 위한 입력 특징부로서, 수작업으로 작성된 특징부의 예이다. 중심 문제는 분류 성능이 이러한 특징부의 품질 및 관련성에 크게 의존한다는 것이다. 예를 들어, 관련 시각적 특징부, 예컨대 세포 형태학, 세포들 사이의 거리, 또는 기관 내의 국지성은 세포 카운트에서 캡처되지 않고, 데이터의 이러한 불완전한 표현은 분류 정확도를 감소시킬 수 있다.The machine learning model can take the estimated cell counts, which are examples of hand-crafted features, as input features for classifying the tumor. The central problem is that classification performance is highly dependent on the quality and relevance of these features. For example, relevant visual features such as cell morphology, distances between cells, or localization within an organ are not captured in cell counts, and this incomplete representation of the data can reduce classification accuracy.

기계 학습의 하위구분인 심층 학습은 기계 학습 모델 자체에 특징부의 계산을 임베딩하여 엔드-투-엔드(end-to-end) 모델을 산출함으로써 이러한 문제를 다룬다. 이러한 결과는 심층 신경망, 즉 연속적인 기본 동작을 포함하는 기계 학습 모델의 개발을 통해 실현되었는데, 이들은 선행 동작의 결과를 입력으로서 취함으로써 점점 더 복잡한 특징부를 계산한다. 심층 신경망은 위의 예에서 세포의 세포 형태학 및 공간 구성과 같은 높은 복잡도의 관련 특징부를 발견함으로써 예측 정확도를 개선할 수 있다. 심층 신경망의 구성 및 훈련은, 특히 그래픽 처리 유닛(graphical processing unit, GPU)의 사용을 통해, 데이터의 폭증, 알고리즘 진보, 및 계산 용량의 실질적인 증가에 의해 가능하게 되었다.Deep learning, a subdivision of machine learning, addresses these issues by embedding the calculation of features in the machine learning model itself to produce an end-to-end model. These results were realized through the development of deep neural networks, i.e. machine learning models containing successive elementary operations, which compute increasingly complex features by taking as input the results of preceding operations. Deep neural networks can improve prediction accuracy by discovering relevant features of high complexity, such as the cell morphology and spatial organization of cells in the example above. Construction and training of deep neural networks has been made possible by an explosion of data, algorithmic advances, and substantial increases in computational capacity, particularly through the use of graphical processing units (GPUs).

감독형 학습의 목표는, 특징부를 입력으로서 취하고 소위 표적 변수에 대한 예측을 반환하는 모델을 획득하는 것이다. 감독형 학습 문제의 일례는 표준(canonical) 스플라이스 부위 서열의 존재 여부, 스플라이싱 분기점의 위치 또는 인트론 길이와 같은 RNA 상의 특징부를 고려하여 인트론이 스플라이스-아웃(splice out)되는지의 여부를 예측하는 것(표적)이다. 기계 학습 모델을 훈련시키는 것은 그의 파라미터를 학습하는 것을 지칭하는데, 이는 보통, 보이지 않은 데이터에 대한 정확한 예측을 행하는 목적으로 훈련 데이터에 대한 손실 함수를 최소화하는 것을 수반한다.The goal of supervised learning is to obtain a model that takes features as input and returns predictions for the so-called target variable. An example of a supervised learning problem is determining whether an intron is splice out by considering features on the RNA, such as the presence of a canonical splice site sequence, the location of the splicing fork, or the length of the intron. It is to predict (target). Training a machine learning model refers to learning its parameters, which usually involves minimizing the loss function on the training data with the goal of making accurate predictions on unseen data.

컴퓨터 생명공학에서의 많은 감독형 학습 문제의 경우, 입력 데이터는 예측을 행하는 데 잠재적으로 유용한 수치 또는 카테고리 데이터를 각각 함유하는 다수의 열 또는 특징부를 갖는 표로서 표현될 수 있다. 일부 입력 데이터는 표 내의 특징부(예컨대, 온도 또는 시간)로서 자연적으로 표현되는 반면, (k-량체 카운트로의 데옥시리보핵산(DNA) 서열과 같이) 다른 입력 데이터는 표로 나타낸 표현에 맞추기 위해 특징부 추출로 불리는 프로세스를 사용하여 먼저 변환될 필요가 있다. 인트론 스플라이싱 예측 문제의 경우, 표준 스플라이스 부위 서열의 존재 유무, 스플라이싱 분기점의 위치 및 인트론 길이는 표로 나타낸 포맷으로 수집된 미리처리된 특징부일 수 있다. 표로 나타낸 데이터는, 로지스틱 회귀(logistic regression)와 같은 단순한 선형 모델 내지 신경망 및 많은 다른 것과 같은 더 유연한 비선형 모델의 범위에 있는 광범위한 감독형 기계 학습 모델에 대한 표준이다.For many supervised learning problems in computational biotechnology, the input data can be represented as a table with a number of columns or features, each containing numerical or categorical data potentially useful for making predictions. While some input data are naturally represented as features in a table (e.g., temperature or time), other input data (such as deoxyribonucleic acid (DNA) sequences in k -mer counts) need to be adapted to a tabular representation. It needs to be converted first using a process called feature extraction. For intron splicing prediction problems, the presence or absence of canonical splice site sequences, the location of splicing forks, and intron length can be preprocessed features collected in a tabular format. Tabular data is the standard for a wide range of supervised machine learning models, ranging from simple linear models such as logistic regression to more flexible nonlinear models such as neural networks and many others.

로지스틱 회귀는 이진 분류자, 즉 이진 표적 변수를 예측하는 감독형 학습 모델이다. 구체적으로, 로지스틱 회귀는 시그모이드 함수, 일정 유형의 활성화 함수를 사용하여 [0,1] 간격에 맵핑된 입력 특징부의 가중 합을 계산함으로써 포지티브 클래스의 확률을 예측한다. 로지스틱 회귀의 파라미터, 또는 상이한 활성화 함수를 사용하는 다른 선형 분류자는 가중 합의 가중치이다. 선형 분류자는 클래스, 예를 들어 스플라이스-아웃된 또는 스플라이스-아웃되지 않은 인트론의 것이 입력 특징부의 가중 합으로 잘 구별될 수 없을 때 실패한다. 예측 성능을 개선하기 위해, 예를 들어, 제곱 또는 쌍별 곱을 취함으로써 새로운 방식으로 기존의 특징부를 변형 또는 조합함으로써 새로운 입력 특징부가 수동으로 추가될 수 있다.Logistic regression is a binary classifier, a supervised learning model that predicts a binary target variable. Specifically, logistic regression predicts the probability of a positive class by calculating a weighted sum of input features mapped to the [0,1] interval using a sigmoid function, a type of activation function. A parameter of logistic regression, or other linear classifiers using different activation functions, is the weight of the weighted sum. A linear classifier fails when the class, for example of spliced-out or non-spliced-out introns, cannot be well distinguished by a weighted sum of the input features. To improve prediction performance, new input features can be added manually by transforming or combining existing features in new ways, for example, by taking squares or pairwise products.

신경망은 은닉 층을 사용하여 이러한 비선형 특징부 변환을 자동으로 학습한다. 각각의 은닉 층은 그들의 출력이 시그모이드 함수 또는 더 대중적인 정류형 선형 유닛(rectified-linear unit, ReLU)과 같은 비선형 활성화 함수에 의해 변환된 다수의 선형 모델로서 생각될 수 있다. 함께, 이러한 층은 입력 특징부를 관련된 복잡한 패턴으로 구성하는데, 이들은 2개의 클래스를 구별하는 태스크를 용이하게 한다.Neural networks automatically learn these nonlinear feature transformations using hidden layers. Each hidden layer can be thought of as a number of linear models whose outputs are transformed by a non-linear activation function, such as a sigmoid function or the more popular rectified-linear unit (ReLU). Together, these layers organize input features into related complex patterns, which facilitate the task of distinguishing between the two classes.

심층 신경망은 많은 은닉 층을 사용하고, 층은 각각의 뉴런이 선행 층의 모든 뉴런으로부터 입력을 수신할 때 완전 접속된 것으로 간주된다. 신경망은 일반적으로, 확률론적 기울기 하강(stochastic gradient descent), 즉 매우 큰 데이터 세트에 대한 모델을 훈련시키는 데 적합한 알고리즘을 사용하여 훈련된다. 최신 심층 학습 프레임워크를 사용한 신경망의 구현예는 상이한 아키텍처 및 데이터 세트로 신속한 프로토타이핑을 가능하게 한다. 완전 접속 신경망은 다수의 유전학 응용예에 사용될 수 있는데, 이러한 응용예는 서열 보존 또는 스플라이스 인자의 결합 모티프의 존재와 같은 서열 특징부로부터의 주어진 서열에 대해 스플라이스-인(splice in)된 엑손의 백분율을 예측하는 것; 잠재적인 질환 유발 유전자 변이체를 우선순위화하는 것; 및 염색질 마크, 유전자 발현 및 진화 보존(evolutionary conservation)과 같은 특징부를 사용하여 주어진 게놈 영역 내의 cis-조절 요소를 예측하는 것을 포함한다.Deep neural networks use many hidden layers, and a layer is considered fully connected when each neuron receives input from all neurons in the preceding layer. Neural networks are typically trained using stochastic gradient descent, an algorithm suitable for training models on very large data sets. Implementations of neural networks using modern deep learning frameworks enable rapid prototyping with different architectures and data sets. Fully connected neural networks can be used in a number of genetics applications, including determining exons splice in for a given sequence from sequence features such as sequence conservation or the presence of binding motifs for splice factors. predicting the percentage of; prioritizing potential disease-causing genetic variants; and predicting cis -regulatory elements within a given genomic region using features such as chromatin marks, gene expression, and evolutionary conservation.

효과적인 예측을 위해 공간적 및 종방향 데이터에서의 로컬 종속성이 고려되어야 한다. 예를 들어, DNA 서열 또는 이미지의 픽셀을 셔플링하는 것은 정보성 패턴을 심하게 파괴한다. 이러한 로컬 종속성은 표로 나타낸 데이터와는 분리된 공간적 또는 종방향 데이터를 설정하는데, 이를 위한 특징부의 순서화는 임의적이다. 특정 전사 인자에 의해 게놈 영역을 결합 대 비결합으로 분류하는 문제를 고려하는데, 이때 결합 영역은 서열분석(ChIP-seq) 데이터가 뒤에 오는 염색질 면역침전(immunoprecipitation)에서 고신뢰 결합 이벤트로서 정의된다. 서열 모티프를 인식함으로써 전사 인자가 DNA에 결합된다. 서열 내의 k-량체 인스턴스(instance)의 수 또는 위치 가중치 행렬(position weight matrix, PWM) 매칭과 같은 서열 도출 특징부에 기초한 완전 접속 층이 이러한 태스크에 사용될 수 있다. 따라서, k-량체 또는 PWM 인스턴스 빈도는 서열 내에서 모티프를 시프트하는 것에 강건하기 때문에, 그러한 모델은 상이한 위치에 위치된 동일한 모티프를 갖는 서열에 대한 웰(well)을 일반화할 수 있다. 그러나, 그들은 전사 인자 결합이 잘 정의된 간격을 갖는 다수의 모티프의 조합에 의존하는 패턴을 인식하지 못할 것이다. 또한, 가능한 k-량체의 수는 k-량체 길이에 따라 기하급수적으로 증가하는데, 이는 저장 및 과적합화 문제 둘 모두를 제기한다.For effective prediction, local dependencies in spatial and longitudinal data must be considered. For example, shuffling the pixels of a DNA sequence or image severely destroys informative patterns. These local dependencies establish spatial or longitudinal data separate from the tabular data, for which the ordering of features is arbitrary. We consider the problem of classifying genomic regions as bound versus unbound by specific transcription factors, where binding regions are defined as high-confidence binding events in chromatin immunoprecipitation followed by sequencing (ChIP-seq) data. By recognizing sequence motifs, transcription factors bind to DNA. Fully connected layers based on sequence derived features such as the number of k -mer instances within the sequence or position weight matrix (PWM) matching can be used for this task. Therefore, because k -mer or PWM instance frequencies are robust to shifting motifs within a sequence, such models can generalize well to sequences with the same motif located at different positions. However, they will not recognize patterns in which transcription factor binding relies on the combination of multiple motifs with well-defined spacing. Additionally, the number of possible k -mers increases exponentially with k -mer length, which poses both storage and overfitting problems.

콘볼루션 층은 완전 접속 층의 특수 형태이며, 동일한 완전 접속 층은 예를 들어 6 bp 윈도우에서, 모든 서열 위치에 국부적으로 적용된다. 이러한 접근법은 또한, 예를 들어 전사 인자 GATA1 및 TAL1에 대해, 다수의 PWM을 사용하여 서열을 스캐닝하는 것으로 보일 수 있다. 위치들에 걸쳐 동일한 모델 파라미터를 사용하여, 파라미터의 총 수는 급격히 감소되고, 네트워크는 훈련 동안 보이지 않는 위치에서 모티프를 검출할 수 있다. 각각의 콘볼루션 층은 필터와 서열 사이의 매칭을 정량화하는 모든 위치에서의 스칼라 값을 생성함으로써 여러 필터로 서열을 스캔한다. 완전 접속 신경망에서와 같이, 비선형 활성화 함수(일반적으로, ReLU)가 각각의 층에 적용된다. 다음으로, 풀링(pooling) 동작이 적용되는데, 이는 위치 축에 걸친 인접 빈(bin)에서의 활성화를 집약하여, 일반적으로, 각각의 채널에 대해 최대 또는 평균 활성화를 취한다. 풀링은 유효 서열 길이를 감소시키고, 신호를 조잡해지게 한다. 후속 콘볼루션 층은 이전 층의 출력을 구성하며, GATA1 모티프 및 TAL1 모티프가 일부 거리 범위에 존재하였는지의 여부를 검출할 수 있다. 마지막으로, 콘볼루션 층의 출력은 최종 예측 태스크를 수행하기 위해 완전 접속 신경망에 대한 입력으로서 사용될 수 있다. 따라서, 상이한 유형의 신경망 층(예컨대, 완전 접속 층 및 콘볼루션 층)이 단일 신경망 내에서 조합될 수 있다.A convolutional layer is a special form of a fully connected layer, where the same fully connected layer is applied locally to all sequence positions, for example in a 6 bp window. This approach can also be seen as scanning sequences using multiple PWMs, for example for the transcription factors GATA1 and TAL1. By using the same model parameters across positions, the total number of parameters is drastically reduced, and the network can detect motifs in positions that were unseen during training. Each convolutional layer scans a sequence through multiple filters by generating a scalar value at every position that quantifies the match between the filters and the sequence. As in a fully connected neural network, a non-linear activation function (usually ReLU) is applied to each layer. Next, a pooling operation is applied, which aggregates activations in adjacent bins across the position axis, typically taking the maximum or average activation for each channel. Pooling reduces the effective sequence length and coarsens the signal. The subsequent convolutional layer constitutes the output of the previous layer and can detect whether the GATA1 motif and TAL1 motif were present in some distance range. Finally, the output of the convolutional layer can be used as input to a fully connected neural network to perform the final prediction task. Accordingly, different types of neural network layers (eg, fully connected layers and convolutional layers) can be combined within a single neural network.

콘볼루션 신경망(convolutional neural network, CNN)은 DNA 서열 단독에 기초하여 다양한 분자 표현형을 예측할 수 있다. 응용예는 전사 인자 결합 부위를 분류하는 것, 및 염색질 특징부, DNA 접촉 맵, DNA 메틸화, 유전자 발현, 번역 효율, RBP 결합, 및 마이크로RNA(miRNA) 표적과 같은 분자 표현형을 예측하는 것을 포함한다. 서열로부터 분자 표현형을 예측하는 것에 더하여, 콘볼루션 신경망은 수작업으로 작성된 생물정보학 파이프라인에 의해 전통적으로 다루어진 더 많은 기술적 태스크에 적용될 수 있다. 예를 들어, 콘볼루션 신경망은 가이드 RNA의 특이성을 예측할 수 있고, ChIP-seq를 잡음제거할 수 있고, Hi-C 데이터 해상도를 향상시킬 수 있고, DNA 서열로부터 기원의 실험을 예측할 수 있고, 유전자 변이체를 호출할 수 있다.A convolutional neural network (CNN) can predict various molecular phenotypes based on DNA sequence alone. Applications include classifying transcription factor binding sites and predicting molecular phenotypes such as chromatin features, DNA contact maps, DNA methylation, gene expression, translation efficiency, RBP binding, and microRNA (miRNA) targets. . In addition to predicting molecular phenotypes from sequences, convolutional neural networks can be applied to many more technical tasks traditionally handled by hand-written bioinformatics pipelines. For example, convolutional neural networks can predict the specificity of guide RNAs, denoise ChIP-seq, improve Hi-C data resolution, predict experiments of origin from DNA sequences, and gene Mutants can be called.

콘볼루션 신경망은 또한, 게놈에서 장거리 종속성을 모델링하기 위해 채용되었다. 상호작용하는 조절 요소가 전개된 선형 DNA 서열 상에서 원거리에 위치될 수 있지만, 이러한 요소는 종종, 실제 3D 염색질 형태에서 근위에 있다. 따라서, 선형 DNA 서열로부터 분자 표현형을 모델링하는 것은, 염색질의 대강의 근사화에도 불구하고, 장거리 종속성을 허용하고 모델이 프로모터-인핸서 루핑과 같은 3D 구성의 양태를 암시적으로 학습할 수 있게 함으로써 개선될 수 있다. 이것은 최대 32 kb의 수용 필드를 갖는 확장된 콘볼루션을 사용하여 달성된다. 확장된 콘볼루션은 또한, 스플라이스 부위가 10 kb의 수용 필드를 사용하여 서열로부터 예측될 수 있게 하여, 이에 의해, 전형적인 인간 인트론만큼 긴 거리를 가로질러 유전자 서열의 통합을 가능하게 한다(문헌[Jaganathan, K. et al. Predicting splicing from primary sequence with deep learning. Cell 176, 535-548 (2019)] 참조).Convolutional neural networks have also been employed to model long-range dependencies in the genome. Although interacting regulatory elements may be located distal on the unfolded linear DNA sequence, these elements are often proximal in the actual 3D chromatin conformation. Therefore, modeling molecular phenotypes from linear DNA sequences, despite being a rough approximation of chromatin, would be improved by allowing for long-range dependencies and allowing the model to implicitly learn aspects of 3D organization, such as promoter-enhancer looping. You can. This is achieved using dilated convolution with a receptive field of up to 32 kb. Extended convolution also allows splice sites to be predicted from the sequence using a 10 kb receptive field, thereby enabling integration of gene sequences across distances as long as a typical human intron (see Jaganathan, K. et al. Predicting splicing from primary sequence with deep learning. Cell 176, 535-548 (2019).

상이한 유형의 신경망은 그들의 파라미터 공유 스킴에 의해 특징지어질 수 있다. 예를 들어, 완전 접속 층은 파라미터 공유를 갖지 않는 반면, 콘볼루션 층은 그들의 입력의 모든 위치에서 동일한 필터를 적용함으로써 번역 불변성을 부과한다. 순환 신경망(recurrent neural network, RNN)은 상이한 파라미터 공유 스킴을 구현하는, DNA 서열 또는 시계열과 같은 순차적 데이터를 처리하기 위한 콘볼루션 신경망에 대한 대안이다. 순환 신경망은 각각의 서열 요소에 동일한 동작을 적용한다. 동작은 이전 서열 요소의 메모리 및 새로운 입력을 입력으로서 취한다. 그것은 메모리를 업데이트하고, 후속 층으로 전달되거나 모델 예측으로서 직접 사용되는 출력을 선택적으로 방출한다. 각각의 서열 요소에서 동일한 모델을 적용함으로써, 순환 신경망은 처리된 서열에서 위치 인덱스에 대해 불변이다. 예를 들어, 순환 신경망은 서열 내의 위치에 관계없이 DNA 서열에서 개방 판독 프레임을 검출할 수 있다. 이러한 태스크는 시작 코돈 뒤에 인-프레임 정지 코돈이 이어지는 것과 같은 소정의 일련의 입력의 인식을 요구한다.Different types of neural networks can be characterized by their parameter sharing schemes. For example, fully connected layers have no parameter sharing, while convolutional layers impose translation invariance by applying the same filter at every location of their input. Recurrent neural networks (RNNs) are an alternative to convolutional neural networks for processing sequential data, such as DNA sequences or time series, that implement different parameter sharing schemes. Recurrent neural networks apply the same operation to each sequence element. The operation takes as input a memory of previous sequence elements and a new input. It updates the memory and selectively emits output that is passed on to subsequent layers or used directly as model predictions. By applying the same model at each sequence element, the recurrent neural network is invariant to the positional index in the processed sequence. For example, recurrent neural networks can detect open reading frames in a DNA sequence regardless of their location in the sequence. This task requires recognition of a predetermined sequence of inputs, such as a start codon followed by an in-frame stop codon.

콘볼루션 신경망에 비해 순환 신경망의 주요 이점은, 그들이 이론적으로, 메모리를 통해 무한히 긴 서열을 거쳐 정보를 전달할 수 있다는 것이다.The main advantage of recurrent neural networks over convolutional neural networks is that they can theoretically pass information through infinitely long sequences in memory.

또한, 순환 신경망은 mRNA 서열과 같은 광범위하게 변화하는 길이의 서열을 자연적으로 처리할 수 있다. 그러나, 다양한 트릭(예컨대, 확장된 콘볼루션)과 조합된 콘볼루션 신경망은 오디오 합성 및 기계 번역과 같은 서열 모델링 태스크에 대해 순환 신경망과 유사하거나 심지어 그보다 더 양호한 성능에 도달할 수 있다. 순환 신경망은 단일 세포 DNA 메틸화 상태, RBP 결합, 전사 인자 결합, 및 DNA 접근성을 예측하기 위한 콘볼루션 신경망의 출력을 집약할 수 있다. 또한, 순환 신경망이 순차적인 동작을 적용하기 때문에, 그들은 쉽게 병렬화될 수 없고, 따라서, 콘볼루션 신경망보다 계산하기가 훨씬 더 느리다.Additionally, recurrent neural networks can naturally process sequences of widely varying lengths, such as mRNA sequences. However, convolutional neural networks combined with various tricks (e.g., dilated convolutions) can reach similar or even better performance than recurrent neural networks for sequence modeling tasks such as audio synthesis and machine translation. Recurrent neural networks can aggregate the output of convolutional neural networks to predict single cell DNA methylation status, RBP binding, transcription factor binding, and DNA accessibility. Additionally, because recurrent neural networks apply sequential operations, they cannot be easily parallelized and are therefore much slower to compute than convolutional neural networks.

각각의 인간은 고유한 유전자 코드를 갖지만, 인간 유전자 코드의 대부분은 모든 인간에 대해 공통적이다. 일부 경우에 있어서, 인간 유전자 코드는 유전자 변이체로 불리는 이상치를 포함할 수 있는데, 이는 비교적 작은 그룹의 인간 집단의 개인들 사이에서 공통적일 수 있다. 예를 들어, 특정 인간 단백질은 특정 서열의 아미노산을 포함할 수 있는 반면, 그 단백질의 변이체는 그 외의 동일한 특정 서열 내의 하나의 아미노산만큼 상이할 수 있다.Each human has a unique genetic code, but most of the human genetic code is common to all humans. In some cases, the human genetic code may contain outliers, called genetic variants, which may be common among individuals in a relatively small group of human populations. For example, a particular human protein may contain a particular sequence of amino acids, while variants of that protein may differ by as much as one amino acid within that particular sequence that is otherwise identical.

유전자 변이체는 병원성이어서, 질환으로 이어질 수 있다. 그러한 유전자 변이체의 대부분이 자연적인 선택에 의해 게놈으로부터 고갈되었지만, 어느 유전자 변이체가 병원성일 가능성이 있는지를 식별하는 능력은 연구자가 이러한 유전자 변이체에 초점을 맞추어 상응하는 질환 및 그들의 진단, 처치, 또는 치유의 이해를 얻는 데 도움이 될 수 있다. 수백만 개의 인간 유전자 변이체의 임상 해석은 불명확하게 유지된다. 가장 빈번한 병원성 변이체 중 일부는 단백질의 아미노산을 변화시키는 단일 뉴클레오티드 미스센스(missense) 돌연변이이다. 그러나, 모든 미스센스 돌연변이가 병원성인 것은 아니다.Genetic variants can be pathogenic, leading to disease. Although most of such genetic variants have been depleted from the genome by natural selection, the ability to identify which genetic variants are likely to be pathogenic allows researchers to focus on these genetic variants to identify the corresponding diseases and their diagnosis, treatment, or cure. It can be helpful in gaining understanding. The clinical interpretation of millions of human genetic variants remains unclear. Some of the most frequent pathogenic variants are single nucleotide missense mutations that change amino acids in proteins. However, not all missense mutations are pathogenic.

생물학적 서열로부터 직접적으로 분자 표현형을 예측할 수 있는 모델은 유전자 변이와 표현형 변이 사이의 연관성을 프로브하기 위해 인실리코(in silico) 섭동 툴로서 사용될 수 있고, 양적 형질 유전자좌(quantitative trait loci) 식별 및 변이체 우선순위화를 위한 새로운 방법으로서 부상하였다. 이러한 접근법은 복잡한 표현형의 전장유전체 연관성(genome-wide association) 연구에 의해 식별된 변이체의 대부분이 비-코딩이라면, 매우 중요한데, 이는 표현형에 대한 그들의 효과 및 기여를 추정하는 것을 어렵게 만든다. 또한, 연결 불균형은 변이체의 블록이 동시-유전되는 결과를 초래하는데, 이는 개개의 인과 변이체를 정확하게 찾아내는 것에 어려움을 야기한다. 따라서, 그러한 변이체의 영향을 평가하기 위한 심문 툴로서 사용될 수 있는 서열 기반 심층 학습 모델은 복잡한 표현형의 잠재적인 드라이버를 찾기 위한 유망한 접근법을 제공한다. 하나의 예는 전사 인자 결합, 염색질 접근성 또는 유전자 발현 예측의 면에서 2개의 변이체 사이의 차이로부터 간접적으로 짧은 삽입 또는 결실(인델) 및 비-코딩 단일 뉴클레오티드 변이체의 효과를 예측하는 것을 포함한다. 다른 예는 스플라이싱에 대한 유전자 변이체의 서열 또는 정량적 효과로부터 신규한 스플라이스 부위 생성을 예측하는 것을 포함한다.Models that can predict molecular phenotypes directly from biological sequences can be used as in silico perturbation tools to probe associations between genetic and phenotypic variation, identify quantitative trait loci, and perform variant prioritization. It has emerged as a new method for ranking. This approach is critical given that most of the variants identified by genome-wide association studies of complex phenotypes are non-coding, making it difficult to estimate their effect and contribution to the phenotype. Additionally, linkage disequilibrium results in blocks of variants being co-inherited, which makes it difficult to pinpoint individual causal variants. Therefore, sequence-based deep learning models that can be used as interrogation tools to assess the impact of such variants provide a promising approach to find potential drivers of complex phenotypes. One example includes predicting the effect of short insertions or deletions (indels) and non-coding single nucleotide variants indirectly from differences between two variants in terms of transcription factor binding, chromatin accessibility, or gene expression prediction. Other examples include predicting the creation of novel splice sites from the sequence or quantitative effects of genetic variants on splicing.

변이체 효과 예측을 위한 엔드-투-엔드 심층 학습 접근법은 서열 보존 데이터 및 단백질 서열로부터의 미스센스 변이체의 병원성을 예측하기 위해 적용된다(본 명세서에서 "PrimateAI"로 지칭되는 문헌[Sundaram, L. et al. Predicting the clinical impact of human mutation with deep neural networks. Nat. Genet. 50, 1161-1170 (2018)] 참조). PrimateAI는 종간(cross-species) 정보를 사용한 데이터 증강에 의해 공지된 병원성의 변이체에 대해 훈련된 심층 신경망을 사용한다. 특히, PrimateAI는 차이를 비교하고 훈련된 심층 신경망을 사용하여 돌연변이의 병원성을 결정하기 위해 야생형 및 변종 단백질의 서열을 사용한다. 병원성 예측을 위한 단백질 서열을 활용하는 그러한 접근법은, 환상성(circularity) 문제 및 이전 지식에 대한 과적합화를 회피할 수 있기 때문에 유망하다. 그러나, 심층 신경망을 효과적으로 훈련시키기 위한 적절한 수의 데이터와 비교하면, ClinVar에서 이용가능한 임상 데이터의 수는 비교적 작다. 이러한 데이터 부족을 극복하기 위해, PrimateAI는 공통적인 인간 변이체 및 영장류로부터의 변이체를 양성 데이터로서 사용하지만, 트리뉴클레오티드 콘텍스트에 기초한 시뮬레이션된 변이체가 라벨링되지 않은 데이터로서 사용되었다.An end-to-end deep learning approach for variant effect prediction is applied to predict pathogenicity of missense variants from sequence conservation data and protein sequences (Sundaram, L. et al., referred to herein as “PrimateAI”) al. Predicting the clinical impact of human mutations with deep neural networks. Nat. 50, 1161-1170 (2018). PrimateAI uses deep neural networks trained on variants of known pathogenicity by data augmentation using cross-species information. In particular, PrimateAI uses the sequences of wild-type and variant proteins to compare differences and determine the pathogenicity of mutations using trained deep neural networks. Such approaches utilizing protein sequences for pathogenicity prediction are promising because they can avoid circularity problems and overfitting to prior knowledge. However, compared to an adequate number of data to effectively train a deep neural network, the number of clinical data available in ClinVar is relatively small. To overcome this data shortage, PrimateAI uses common human variants and variants from primates as benign data, but simulated variants based on trinucleotide context were used as unlabeled data.

PrimateAI는 서열 정렬에 대해 직접적으로 훈련될 때 이전 방법을 능가한다. PrimateAI는 약 120,000개의 인간 샘플로 이루어진 훈련 데이터로부터 직접적으로 중요한 단백질 도메인, 보존 아미노산 위치 및 서열 종속성을 학습한다. PrimateAI는 후보 발달장애 유전자에서 양성 및 병원성 신생 돌연변이를 구별하고 ClinVar에서 이전 지식을 재생하는 데 있어서 다른 변이체 병원성 예측 툴의 성능을 실질적으로 초과한다. 이러한 결과는 PrimateAI가 이전 지식에 대한 임상 보고의 의존을 줄일 수 있는 변이체 분류 툴을 위해 중요한 진전임을 시사한다.PrimateAI outperforms previous methods when trained directly on sequence alignment. PrimateAI learns key protein domains, conserved amino acid positions, and sequence dependencies directly from training data consisting of approximately 120,000 human samples. PrimateAI substantially exceeds the performance of other variant pathogenicity prediction tools in distinguishing benign and pathogenic de novo mutations in candidate developmental disorder genes and reproducing prior knowledge from ClinVar. These results suggest that PrimateAI is an important step forward for variant classification tools that can reduce the reliance of clinical reports on prior knowledge.

단백질 생물학에 대한 중심은 구조 요소가 관찰된 기능을 발생시키는 방법에 대한 이해이다. 단백질 구조 데이터의 과잉은 구조적-기능적 관계를 지배하는 규칙을 체계적으로 도출하기 위한 계산 방법의 개발을 가능하게 한다. 그러나, 이러한 방법의 성능은 단백질 구조 표현의 선택에 중대하게 의존한다.Central to protein biology is the understanding of how structural elements give rise to observed functions. The plethora of protein structural data enables the development of computational methods to systematically derive the rules governing structural-functional relationships. However, the performance of these methods is critically dependent on the choice of protein structure representation.

단백질 부위는 그들의 구조적 또는 기능적 역할에 의해 구별되는 단백질 구조 내의 미세환경이다. 부위는 3차원(3D) 위치 및 구조 또는 기능이 존재하는 이러한 위치 주위의 국부적 이웃에 의해 정의될 수 있다. 합리적인 단백질 공학에 대한 중심은 아미노산의 구조적 배열이 단백질 부위 내에서 기능적 특성을 생성하는 방법에 대한 이해이다. 단백질 내의 개개의 아미노산의 구조적 및 기능적 역할의 결정은 공학자를 돕고 단백질 기능을 변경하는 데 도움을 주기 위한 정보를 제공한다. 기능적으로 또는 구조적으로 중요한 아미노산을 식별하는 것은 표적화된 단백질 기능적 속성을 변경하기 위한 부위 유도 돌연변이유발과 같은 집중된 공학 노고를 허용한다. 대안적으로, 이러한 지식은 원하는 기능을 무효화할 공학 설계를 회피하는 데 도움이 될 수 있다.Protein sites are microenvironments within a protein structure that are distinguished by their structural or functional roles. A site can be defined by a three-dimensional (3D) location and local neighbors around that location where the structure or function resides. Central to rational protein engineering is the understanding of how the structural arrangement of amino acids creates functional properties within protein regions. Determination of the structural and functional roles of individual amino acids within a protein provides information to aid engineers and modify protein function. Identifying functionally or structurally important amino acids allows for focused engineering efforts, such as site-directed mutagenesis, to alter targeted protein functional properties. Alternatively, this knowledge can help avoid engineering designs that would defeat desired functionality.

구조가 서열보다 훨씬 더 많이 보존된다는 것이 확립되었기 때문에, 단백질 구조 데이터의 증가는 데이터 중심 접근법을 사용하여 구조적-기능적 관계를 지배하는 기본 패턴을 체계적으로 연구할 기회를 제공한다. 임의의 계산 단백질 분석의 기본 태양은 단백질 구조 정보가 표현되는 방법이다. 기계 학습 방법의 성능은 종종, 채용된 기계 학습 알고리즘보다 데이터 표현의 선택에 더 많이 의존한다. 양호한 표현은 가장 중대한 정보를 효율적으로 캡처하는 반면, 불량한 표현은 기본 패턴이 없는 잡음 분포를 생성한다.Because it has been established that structure is much more conserved than sequence, the growing body of protein structural data provides the opportunity to systematically study the underlying patterns that govern structural-functional relationships using data-driven approaches. A fundamental aspect of any computational protein analysis is how protein structural information is expressed. The performance of a machine learning method often depends more on the choice of data representation than the machine learning algorithm employed. A good representation efficiently captures the most critical information, while a poor representation creates a noisy distribution without an underlying pattern.

단백질 구조의 과잉 및 심층 학습 알고리즘의 최근의 성공은 단백질 구조의 태스크 특정적 표현을 자동으로 추출하기 위한 툴을 개발할 기회를 제공한다. 따라서, 심층 신경망에 대한 입력으로서 3D 단백질 구조의 다중 채널 복셀화된 표현을 사용하여 변이체 병원성을 예측할 기회가 발생한다.The recent success of protein structural overabundance and deep learning algorithms provides an opportunity to develop tools to automatically extract task-specific representations of protein structures. Therefore, an opportunity arises to predict variant pathogenicity using multi-channel voxelized representations of 3D protein structures as input to deep neural networks.

도면에서, 유사한 도면 부호는 일반적으로 상이한 도면 전체에 걸쳐서 유사한 부분을 지칭한다. 또한, 도면은 반드시 일정한 비율로 표시될 필요는 없으며, 대신 일반적으로 개시된 기술의 원리를 설명하는 데 중점을 두고 있다. 다음 설명에서, 개시된 기술의 다양한 구현예가 다음 도면을 참조하여 설명된다.
도 1은 개시된 기술의 다양한 구현예에 따라 변이체의 병원성을 결정하기 위한 시스템의 프로세스를 도시하는 흐름도이다.
도 2는 개시된 기술의 일 구현예에 따른 단백질의 예시적인 기준 아미노산 서열 및 단백질의 대안 아미노산 서열(alternative amino acid sequence)을 개략적으로 도시한다.
도 3은 개시된 기술의 일 구현예에 따른 도 2의 기준 아미노산 서열에서 아미노산 원자의 아미노산별 분류를 도시한다.
도 4는 개시된 기술의 일 구현에 따른 아미노산 기반으로 도 3에서 분류된 알파-탄소 원자의 3D 원자 좌표의 아미노산별 속성을 도시한다.
도 5는 개시된 기술의 일 구현에 따른 복셀별 거리 값을 결정하는 프로세스를 개략적으로 도시한다.
도 6은 개시된 기술의 일 구현에 따른 21개의 아미노산별 거리 채널의 예를 도시한다.
도 7은 개시된 기술의 일 구현에 따른 거리 채널 텐서의 개략도이다.
도 8은 개시된 기술의 일 구현에 따른 도 2로부터의 기준 아미노산 및 대안 아미노산의 원-핫(one-hot) 인코딩을 도시한다.
도 9는 개시된 기술의 일 구현에 따른 복셀화된 원-핫 인코딩된 기준 아미노산 및 복셀화된 원-핫 인코딩된 변이체/대안 아미노산의 개략도이다.
도 10은 개시된 기술의 일 구현에 따른 도 7의 거리 채널 텐서와 기준 대립유전자 텐서를 복셀별으로 연결하는 연결 프로세스를 개략적으로 도시한다.
도 11은 개시된 기술의 일 구현에 따른 도 7의 거리 채널 텐서, 도 10의 기준 대립유전자 텐서 및 대안 대립유전자 텐서(alternative allele tensor)를 복셀별로 연결하는 연결 프로세스를 개략적으로 도시한다.
도 12는 개시된 기술의 일 구현에 따른 가장 가까운 원자의 범아미노산 보존 빈도를 결정하여 복셀에 할당하기 위한(복셀화) 시스템의 프로세스를 도시하는 흐름도이다.
도 13은 개시된 기술의 일 구현에 따른 복셀-대-가장 가까운 아미노산을 도시한다.
도 14는 개시된 기술의 일 구현에 따른 99종에 걸친 기준 아미노산 서열의 예시적인 다중 서열 정렬을 도시한다.
도 15는 개시된 기술의 일 구현에 따른 특정 복셀에 대한 범아미노산 보존 빈도 서열을 결정하는 예를 도시한다.
도 16은 개시된 기술의 일 구현에 따른 도 15에 설명된 위치 빈도 로직을 사용하여 각각의 복셀에 대해 결정된 각각의 범아미노산 보존 빈도를 도시한다.
도 17은 개시된 기술의 일 구현에 따른 복셀화된 복셀당 진화 프로파일을 도시한다.
도 18은 개시된 기술의 일 구현예에 따른 진화 프로파일 텐서의 예를 도시한다.
도 19는 개시된 기술의 일 구현에 따른 가장 가까운 원자의 아미노산당 보존 빈도를 결정하여 복셀에 할당하기 위한(복셀화) 시스템의 프로세스를 도시하는 흐름도이다.
도 20은 개시된 기술의 일 구현에 따른 거리 채널 텐서와 연결되는 복셀화된 주석 채널의 다양한 예를 도시한다.
도 21은 개시된 기술의 일 구현에 따른 표적 변이체의 병원성 결정을 위한 병원성 분류자에 대한 입력으로서 제공될 수 있는 입력 채널의 다양한 조합 및 순열을 도시한다.
도 22는 개시된 기술의 다양한 구현에 따른 개시된 거리 채널을 계산하는 다양한 방법을 도시한다.
도 23은 개시된 기술의 다양한 구현에 따른 진화 채널의 상이한 예를 도시한다.
도 24는 개시된 기술의 다양한 구현에 따른 주석 채널의 상이한 예를 도시한다.
도 25는 개시된 기술의 다양한 구현에 따른 구조 신뢰 채널의 상이한 예를 도시한다.
도 26은 개시된 기술의 일 구현에 따른 병원성 분류자의 예시적인 처리 아키텍처를 도시한다.
도 27은 개시된 기술의 일 구현에 따른 병원성 분류자의 예시적인 처리 아키텍처를 도시한다.
도 28, 도 29, 도 30, 도 31a 및 도 31b는 PrimateAI에 비해 개시된 PrimateAI 3D의 분류 우월성을 입증하기 위해 PrimateAI를 벤치마크 모델로 사용한다.
도 32a 및 도 32b는 개시된 기술의 다양한 구현에 따른 개시된 효율적인 복셀화 프로세스를 도시한다.
도 33은 개시된 기술의 일 구현에 따른 원자를 함유하는 복셀과 원자가 어떻게 연관되는지를 도시한다.
도 34는 개시된 기술의 한 구현에 따른 복셀 단위로 가장 가까운 원자를 식별하기 위해 원자-대-복셀 맵핑으로부터 복셀-대-원자 맵핑을 생성하는 것을 도시한다.
도 35a 및 도 35b는 개시된 효율적인 복셀화를 사용하지 않고 개시된 효율적인 복셀화가 O(#원자)의 런타임 복잡도 대 O(#원자 * #복셀)의 런타임 복잡도를 갖는 방법을 예시한다.
도 36은 개시된 기술을 구현하는 데 사용될 수 있는 예시적인 컴퓨터 시스템을 도시한다.
도 37은 갭 단백질 공간 표현(gapped protein spatial representation)의 처리에 기초하여 표적 대체 아미노산(target alternate amino acid)에 대한 변이체 병원성을 결정하는 일 구현을 도시한다.
도 38은 단백질의 공간 표현의 예를 도시한다.
도 39는 도 38에 예시된 단백질의 갭 공간 표현의 예를 도시한다.
도 40은 도 38에 설명된 단백질의 원자 공간 표현의 예를 도시한다.
도 41은 도 38에 설명된 단백질의 갭 원자 공간 표현의 예를 도시한다.
도 42는 갭 단백질 공간 표현 및 표적 대체 아미노산의 대체 아미노산 표현의 처리에 기초하여 표적 대체 아미노산에 대한 변이체 병원성을 결정하는 병원성 분류자의 일 구현을 도시한다.
도 43은 병원성 분류자를 훈련하는 데 사용되는 훈련 데이터의 일 구현을 도시한다.
도 44는 기준 아미노산을 갭 아미노산(gap amino acid)으로 사용하여 기준 단백질 샘플에 대한 갭 공간 표현을 생성하는 일 구현예를 도시한다.
도 45는 양성 단백질 샘플에 대한 병원성 분류자를 훈련하는 일 구현예를 도시한다.
도 46은 병원성 단백질 샘플에 대한 병원성 분류자를 훈련하는 일 구현예를 도시한다.
도 47은 훈련 중에 도달할 수 없는 특정 아미노산 클래스가 어떻게 마스킹되는지를 도시한다.
도 48은 최종 병원성 점수를 결정하는 일 구현예를 도시한다.
도 49a는 단백질의 주어진 위치에서 기준 갭 아미노산에 의해 생성된 공석(vacancy)을 채우는 표적 대체 아미노산에 대한 변이체 병원성 결정이 이루어졌음을 도시한다.
도 49b는 단백질의 주어진 위치에서 기준 갭 아미노산에 의해 생성된 공석을 채우는 각각의 아미노산 클래스의 아미노산에 대해 각각의 변이체 병원성 결정이 이루어졌음을 도시한다.
도 50은 갭 단백질 공간 표현의 처리에 기초하여 다수의 대체 아미노산에 대한 변이체 병원성을 결정하는 일 구현예를 도시한다.
도 51은 갭 단백질 공간 표현의 처리에 기초하여 다수의 대체 아미노산에 대한 변이체 병원성을 결정하는 병원성 분류자의 일 구현예를 도시한다.
도 52는 양성 단백질 샘플과 병원성 단백질 샘플에 대한 병원성 분류자를 동시에 훈련하는 일 구현예를 도시한다.
도 53은 갭 단백질 공간 표현을 처리하고 그에 대한 반응으로 다수의 대체 아미노산에 대한 진화 보존 점수를 생성하는 것에 기초하여 다수의 대체 아미노산에 대한 변이체 병원성을 결정하는 일 구현예를 도시한다.
도 54는 일 구현에 따른 동작에서의 진화 보존 결정자를 도시한다.
도 55는 예측된 진화 점수에 기초하여 병원성을 결정하는 일 구현예를 도시한다.
도 56은 진화 보존 결정자를 훈련하는 데 사용되는 훈련 데이터의 일 구현예를 도시한다.
도 57은 양성 및 병원성 단백질 샘플에 대한 진화 보존 결정자를 동시에 훈련하는 일 구현예를 도시한다.
도 58은 진화 보존 결정자를 훈련하는 데 사용되는 실제 라벨 인코딩의 다양한 구현예를 도시한다.
도 59는 예시적인 위치-특이적 주파수 행렬(PSFM)을 도시한다.
도 60은 예시적인 위치-특이적 점수 매트릭스(PSSM)를 도시한다.
도 61은 PSFM 및 PSSM을 생성하는 일 구현예를 도시한다.
도 62는 예시적인 PSFM 인코딩을 도시한다.
도 63은 예시적인 PSSM 인코딩을 도시한다.
도 64는 본원에 개시된 모델이 훈련될 수 있는 두 개의 데이터세트를 도시한다.
도 65a 및 도 65b는 본원에 개시된 모델의 결합 학습의 일 구현예를 도시한다.
도 66a 및 도 66b는 도 64에 도시된 2개의 데이터세트를 사용하여 본원에 개시된 모델을 훈련시키기 위해 전이 학습을 사용하는 일 구현예를 도시한다.
도 67은 본원에 개시된 모델을 훈련시키기 위해 훈련 데이터 및 라벨을 생성하는 일 구현예를 도시한다.
도 68은 뉴클레오티드 변이체의 병원성을 결정하는 방법의 일 구현예를 도시한다.
도 69는 아미노산 치환물의 구조적 내성을 예측하기 위한 시스템의 일 구현예를 도시한다.
도 70a, 도 70b 및 도 70c는 비자명성 및 창의성의 객관적인 지표를 입증하는 성능 결과를 도시한다.In the drawings, like reference numbers generally refer to like parts throughout the different views. Additionally, the drawings are not necessarily drawn to scale, but instead focus generally on illustrating the principles of the disclosed technology. In the following description, various implementations of the disclosed technology are described with reference to the following drawings.
1 is a flow diagram illustrating the process of a system for determining pathogenicity of a variant in accordance with various embodiments of the disclosed technology.
2 schematically depicts an exemplary reference amino acid sequence of a protein and an alternative amino acid sequence of a protein according to one embodiment of the disclosed technology.
FIG. 3 shows the classification of amino acid atoms by amino acid in the reference amino acid sequence of FIG. 2 according to one embodiment of the disclosed technology.
FIG. 4 illustrates amino acid-specific properties of 3D atomic coordinates of alpha-carbon atoms classified in FIG. 3 based on amino acids according to one implementation of the disclosed technology.
Figure 5 schematically shows a process for determining a distance value for each voxel according to one implementation of the disclosed technology.
Figure 6 shows an example of a distance channel for each 21 amino acids according to one implementation of the disclosed technology.
7 is a schematic diagram of a distance channel tensor according to one implementation of the disclosed technology.
Figure 8 shows one-hot encoding of reference and alternative amino acids from Figure 2 according to one implementation of the disclosed technology.
Figure 9 is a schematic diagram of voxelized one-hot encoded reference amino acids and voxelized one-hot encoded variant/alternative amino acids according to one implementation of the disclosed technology.
FIG. 10 schematically illustrates a connection process for connecting the distance channel tensor and the reference allele tensor of FIG. 7 on a voxel-by-voxel basis according to one implementation of the disclosed technology.
FIG. 11 schematically illustrates a concatenation process for concatenating the distance channel tensor of FIG. 7, the reference allele tensor of FIG. 10, and the alternative allele tensor on a voxel-by-voxel basis, according to one implementation of the disclosed technology.
FIG. 12 is a flow diagram illustrating the process of a system for determining pan-amino acid conservation frequencies of nearest atoms and assigning them to voxels (voxelization) according to one implementation of the disclosed technology.
Figure 13 depicts voxel-to-nearest amino acid according to one implementation of the disclosed technology.
Figure 14 shows an exemplary multiple sequence alignment of reference amino acid sequences across 99 species according to one implementation of the disclosed technology.
Figure 15 shows an example of determining a pan-amino acid conservation frequency sequence for a specific voxel according to one implementation of the disclosed technology.
FIG. 16 illustrates each pan-amino acid conservation frequency determined for each voxel using the positional frequency logic described in FIG. 15 according to one implementation of the disclosed technology.
17 shows a voxelized per-voxel evolution profile according to one implementation of the disclosed technology.
18 shows an example of an evolutionary profile tensor according to one implementation of the disclosed technology.
FIG. 19 is a flow diagram illustrating the process of a system for determining the conservation frequency per amino acid of the nearest atom and assigning it to a voxel (voxelization) according to one implementation of the disclosed technology.
20 illustrates various examples of voxelized annotation channels associated with distance channel tensors according to one implementation of the disclosed technology.
21 illustrates various combinations and permutations of input channels that can serve as input to a pathogenicity classifier for determining pathogenicity of a target variant according to one implementation of the disclosed technology.
22 illustrates various methods of calculating the disclosed distance channel according to various implementations of the disclosed technology.
23 shows different examples of evolution channels according to various implementations of the disclosed technology.
24 illustrates different examples of annotation channels according to various implementations of the disclosed technology.
25 illustrates different examples of structured trust channels according to various implementations of the disclosed technology.
Figure 26 illustrates an example processing architecture of a pathogenicity classifier according to one implementation of the disclosed technology.
Figure 27 illustrates an example processing architecture of a pathogenicity classifier according to one implementation of the disclosed technology.
Figures 28, 29, 30, 31a, and 31b use PrimateAI as a benchmark model to demonstrate the classification superiority of the disclosed PrimateAI 3D compared to PrimateAI.
32A and 32B illustrate the disclosed efficient voxelization process according to various implementations of the disclosed technology.
Figure 33 illustrates how atoms are associated with voxels containing atoms according to one implementation of the disclosed technology.
FIG. 34 illustrates generating a voxel-to-atom mapping from an atom-to-voxel mapping to identify the closest atom on a voxel-by-voxel basis according to one implementation of the disclosed technology.
35A and 35B illustrate how without using the disclosed efficient voxelization, the disclosed efficient voxelization has a runtime complexity of O(#atoms) versus a runtime complexity of O(#atoms * #voxels).
Figure 36 depicts an example computer system that can be used to implement the disclosed techniques.
Figure 37 depicts one implementation of determining variant pathogenicity for a target alternate amino acid based on processing of gapped protein spatial representation.
Figure 38 shows an example of spatial representation of a protein.
Figure 39 shows an example of a gap space representation of the protein illustrated in Figure 38.
Figure 40 shows an example of an atomic space representation of the protein illustrated in Figure 38.
Figure 41 shows an example of a gap atom space representation of the protein described in Figure 38.
Figure 42 depicts one implementation of a pathogenicity classifier that determines variant pathogenicity for a target replacement amino acid based on processing of the gap protein spatial representation and the replacement amino acid representation of the target replacement amino acid.
Figure 43 shows one implementation of training data used to train a pathogenicity classifier.
Figure 44 depicts one implementation of generating a gap space representation for a reference protein sample using reference amino acids as gap amino acids.
Figure 45 depicts one implementation of training a pathogenicity classifier on benign protein samples.
Figure 46 shows one implementation of training a pathogenicity classifier on a pathogenic protein sample.
Figure 47 shows how certain unreachable amino acid classes are masked during training.
Figure 48 depicts one implementation of determining the final pathogenicity score.
Figure 49A shows that variant pathogenicity determinations were made for target replacement amino acids that fill the vacancy created by the reference gap amino acid at a given position in the protein.
Figure 49B shows that each variant pathogenicity determination was made for an amino acid of each amino acid class that fills the vacancy created by a reference gap amino acid at a given position in the protein.
Figure 50 depicts one embodiment of determining variant pathogenicity for multiple alternative amino acids based on processing of gap protein spatial representations.
Figure 51 depicts one implementation of a pathogenicity classifier that determines variant pathogenicity for multiple alternative amino acids based on processing of gap protein spatial representations.
Figure 52 shows one implementation of simultaneously training a pathogenicity classifier for benign and pathogenic protein samples.
Figure 53 depicts one embodiment of determining variant pathogenicity for multiple alternative amino acids based on processing gap protein spatial representations and generating evolutionary conservation scores for multiple alternative amino acids in response.
Figure 54 illustrates an evolutionarily conservative determinant in operation according to one implementation.
Figure 55 illustrates one implementation of determining pathogenicity based on predicted evolution scores.
Figure 56 shows one implementation of training data used to train an evolutionary conservation determinant.
Figure 57 depicts one implementation of simultaneously training evolutionary conservation determinants for benign and pathogenic protein samples.
Figure 58 shows various implementations of real label encoding used to train evolutionary conservation determinants.
Figure 59 shows an example location-specific frequency matrix (PSFM).
Figure 60 shows an example location-specific score matrix (PSSM).
Figure 61 shows one implementation of generating PSFM and PSSM.
Figure 62 shows example PSFM encoding.
Figure 63 shows example PSSM encoding.
Figure 64 illustrates two datasets on which the models disclosed herein can be trained.
65A and 65B illustrate one implementation of joint learning of the model disclosed herein.
Figures 66A and 66B illustrate one implementation of using transfer learning to train a model disclosed herein using the two datasets shown in Figure 64.
Figure 67 illustrates one implementation of generating training data and labels to train the model disclosed herein.
Figure 68 depicts one embodiment of a method for determining pathogenicity of nucleotide variants.
Figure 69 depicts one implementation of a system for predicting structural tolerance of amino acid substitutions.
Figures 70A, 70B and 70C show performance results demonstrating objective indicators of non-obviousness and creativity.

아래의 논의는 어느 당업자라도 개시된 기술을 제조하고 사용할 수 있게 하도록 제시되며, 특정의 적용 및 그의 요건과 관련하여 제공된다. 개시된 구현에 대한 다양한 수정은 당업자에게 용이하게 명백할 것이며, 본원에 정의된 일반 원리는 개시된 기술의 사상 및 범위로부터 벗어나지 않고 다른 구현 및 응용에 적용될 수 있다. 따라서, 개시된 기술은 도시된 구현예로 제한되도록 의도된 것이 아니라, 본원에 개시된 원리 및 특징과 일치하는 가장 넓은 범주에 부합되어야 한다.The following discussion is presented to enable any person skilled in the art to make and use the disclosed technology, and is presented in relation to specific applications and requirements thereof. Various modifications to the disclosed implementations will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other implementations and applications without departing from the spirit and scope of the disclosed technology. Accordingly, the disclosed techniques are not intended to be limited to the embodiments shown but are to be accorded the broadest scope consistent with the principles and features disclosed herein.

다양한 구현예에 대한 상세한 설명은 첨부된 도면과 함께 읽을 때 더 잘 이해될 것이다. 도면이 다양한 구현예의 기능 블록도를 도시하는 범위에서, 기능 블록은 반드시 하드웨어 회로부 사이의 분할을 나타내는 것은 아니다. 따라서, 예를 들어, 기능 블록 중 하나 이상(예를 들어, 모듈, 프로세서 또는 메모리)은 단일 조각의 하드웨어(예를 들어, 범용 신호 프로세서 또는 랜덤 액세스 메모리의 블록, 하드 디스크 등) 또는 다수 조각의 하드웨어에서 구현될 수 있다. 유사하게, 프로그램은 독립형 프로그램일 수 있고, 운영 체제에 서브루틴으로서 통합될 수 있고, 설치된 소프트웨어 패키지 내의 기능일 수 있고, 등등이다. 다양한 구현예가 도면에 도시된 배열 및 수단으로 제한되지 않는다는 것이 이해될 것이다.The detailed description of various implementations will be better understood when read in conjunction with the accompanying drawings. To the extent that the drawings show functional block diagrams of various implementations, the functional blocks do not necessarily represent divisions between hardware circuitry. Thus, for example, one or more of the functional blocks (e.g., a module, processor, or memory) may be a single piece of hardware (e.g., a general-purpose signal processor or a block of random access memory, a hard disk, etc.) or multiple pieces of hardware. It can be implemented in hardware. Similarly, a program may be a standalone program, integrated into the operating system as a subroutine, a function within an installed software package, etc. It will be understood that the various implementations are not limited to the arrangements and instrumentalities shown in the drawings.

모듈로 지정된, 도면의 처리 엔진 및 데이터 베이스는 하드웨어 또는 소프트웨어로 구현될 수 있고, 도면에 도시된 바와 같이 정확하게 동일한 블록으로 분할될 필요가 없다. 모듈 중 일부는 또한, 상이한 프로세서, 컴퓨터, 또는 서버 상에서 구현될 수 있거나, 또는 다수의 상이한 프로세서, 컴퓨터, 또는 서버 사이에 분산될 수 있다. 또한, 모듈 중 일부가, 달성된 기능에 영향을 주지 않고서 도면에 도시된 것과 조합되어, 병렬로 또는 상이한 순서로 동작될 수 있다는 것이 이해될 것이다. 도면 내의 모듈은 또한, 방법에서의 흐름도 단계로서 생각될 수 있다. 모듈은 또한, 그의 코드 전부가 반드시 메모리에 인접하게 배치될 필요가 없고; 코드의 일부 부분은 코드의 다른 부분과는 분리될 수 있으며, 이때 다른 모듈 또는 다른 기능으로부터의 코드가 사이에 배치된다.The processing engine and database of the drawing, designated as modules, can be implemented in hardware or software and do not need to be divided into exactly identical blocks as shown in the drawing. Some of the modules may also be implemented on different processors, computers, or servers, or may be distributed among multiple different processors, computers, or servers. It will also be understood that some of the modules may be operated in parallel or in a different order, in combination with those shown in the figures, without affecting the functionality achieved. The modules within the figures can also be thought of as flowchart steps in the method. A module also does not necessarily require all of its code to be located contiguously in memory; Some parts of the code can be separated from other parts of the code, with code from other modules or other functions interspersed.

단백질 구조 기반 병원성 결정Protein structure-based pathogenicity determination

도 1은 변이체의 병원성을 결정하기 위한 시스템의 프로세스(100)를 도시하는 흐름도이다. 단계(102)에서, 시스템의 서열 접근자(104)가 기준 및 대안 아미노산 서열에 접근한다. 112에서, 시스템의 3D 구조 생성자(114)가 기준 아미노산 서열에 대한 3D 단백질 구조를 생성한다. 일부 구현예에서, 3D 단백질 구조는 인간 단백질의 상동성 모델이다. 하나의 구현예에서, 소위 SwissModel 상동성 모델링 파이프라인이 예측된 인간 단백질 구조의 공개 리포지토리를 제공한다. 다른 구현예에서, 소위 HHpred 상동성 모델링이 모델러로 불리는 툴을 사용하여 주형 구조로부터 표적 단백질의 구조를 예측한다.1 is a flow diagram depicting the system's process 100 for determining the pathogenicity of a variant. At step 102, the system's sequence accessor 104 accesses the reference and alternative amino acid sequences. At 112, the system's 3D structure generator 114 generates a 3D protein structure for a reference amino acid sequence. In some embodiments, the 3D protein structure is a homology model of a human protein. In one embodiment, the so-called SwissModel homology modeling pipeline provides a public repository of predicted human protein structures. In another embodiment, so-called HHpred homology modeling predicts the structure of the target protein from the template structure using a tool called a modeler.

단백질은 3D 공간에서 원자들의 집합 및 그들의 좌표로 표현된다. 아미노산은 탄소 원자, 산소(O) 원자, 질소(N) 원자, 및 수소(H) 원자와 같은 다양한 원자를 가질 수 있다. 원자는 측쇄 원자 및 백본(backbone) 원자로서 추가로 분류될 수 있다. 백본 탄소 원자는 알파-탄소(C_α) 원자 및 베타-탄소(C_β) 원자를 포함할 수 있다.Proteins are expressed as a collection of atoms and their coordinates in 3D space. Amino acids can have various atoms such as carbon atoms, oxygen (O) atoms, nitrogen (N) atoms, and hydrogen (H) atoms. Atoms can be further classified as side chain atoms and backbone atoms. The backbone carbon atoms may include alpha-carbon (C _α ) atoms and beta-carbon (C _β ) atoms.

단계(122)에서, 시스템의 좌표 분류자(124)가 아미노산 단위로 3D 단백질 구조의 3D 원자 좌표를 분류한다. 하나의 구현예에서, 아미노산별 분류는 3D 원자 좌표를 21개의 아미노산 카테고리(정지 또는 갭 아미노산 카테고리를 포함함)에 귀속시키는 것을 수반한다. 하나의 예에서, 알파-탄소 원자의 아미노산별 분류는 21개의 아미노산 카테고리 각각 하에 알파-탄소 원자를 각각 열거할 수 있다. 다른 예에서, 베타-탄소 원자의 아미노산별 분류는 21개의 아미노산 카테고리 각각 하에 베타-탄소 원자를 각각 열거할 수 있다.In step 122, the system's coordinate classifier 124 sorts the 3D atomic coordinates of the 3D protein structure by amino acid. In one embodiment, classification by amino acid involves assigning 3D atomic coordinates to 21 amino acid categories (including stop or gap amino acid categories). In one example, an amino acid classification of alpha-carbon atoms could list each alpha-carbon atom under each of the 21 amino acid categories. In another example, an amino acid classification of beta-carbon atoms could list each beta-carbon atom under each of the 21 amino acid categories.

또 다른 예에서, 산소 원자의 아미노산별 분류는 21개의 아미노산 카테고리 각각 하에 산소 원자를 각각 열거할 수 있다. 또 다른 예에서, 질소 원자의 아미노산별 분류는 21개의 아미노산 카테고리 각각 하에 질소 원자를 각각 열거할 수 있다. 또 다른 예에서, 수소 원자의 아미노산별 분류는 21개의 아미노산 카테고리 각각 하에 수소 원자를 각각 열거할 수 있다.In another example, an amino acid classification of oxygen atoms could list each oxygen atom under each of the 21 amino acid categories. In another example, an amino acid classification of nitrogen atoms could list each nitrogen atom under each of the 21 amino acid categories. In another example, an amino acid classification of hydrogen atoms could list each hydrogen atom under each of the 21 amino acid categories.

당업자는, 다양한 구현예에서, 아미노산별 분류가 21개의 아미노산 카테고리의 서브세트 및 상이한 원자 원소의 서브세트를 포함할 수 있음을 이해할 것이다.Those skilled in the art will understand that, in various embodiments, classification by amino acid may include a subset of the 21 amino acid categories and a subset of different atomic elements.

단계(132)에서, 시스템의 복셀 그리드 생성자(134)가 복셀 그리드를 인스턴스화한다. 복셀 그리드는 임의의 해상도, 예를 들어 3×3×3, 5×5×5, 7×7×7 등을 가질 수 있다. 복셀 그리드 내의 복셀은 임의의 크기, 예를 들어 각각의 측에서 1 옹스트롬(Å), 각각의 측에서 2 Å, 각각의 측에서 3 Å 등등의 것일 수 있다. 당업자는, 복셀이 정육면체이기 때문에 이러한 예시적인 차원이 입방 차원을 지칭함을 이해할 것이다. 또한, 당업자는, 이러한 예시적인 차원이 비제한적이고 복셀이 임의의 입방 차원을 가질 수 있음을 이해할 것이다.At step 132, the system's voxel grid constructor 134 instantiates a voxel grid. The voxel grid can have any resolution, for example 3x3x3, 5x5x5, 7x7x7, etc. The voxels within the voxel grid may be of any size, for example 1 Angstrom (Å) on each side, 2 Å on each side, 3 Å on each side, etc. Those skilled in the art will understand that these exemplary dimensions refer to cubic dimensions because voxels are cubic. Additionally, those skilled in the art will understand that these example dimensions are non-limiting and that voxels may have any cubic dimension.

단계(142)에서, 시스템의 복셀 그리드 센터러(144)가 아미노산 수준에서 표적 변이체를 경험하는 기준 아미노산에 복셀 그리드를 중심설정한다. 하나의 구현예에서, 복셀 그리드는 표적 변이체를 경험하는 기준 아미노산의 특정 원자의 원자 좌표, 예를 들어, 표적 변이체를 경험하는 기준 아미노산의 알파-탄소 원자의 3D 원자 좌표에 중심설정된다.At step 142, the system's voxel grid centerer 144 centers the voxel grid on reference amino acids that experience target variants at the amino acid level. In one embodiment, the voxel grid is centered on the atomic coordinates of the particular atom of the reference amino acid experiencing the target variant, e.g., the 3D atomic coordinate of the alpha-carbon atom of the reference amino acid experiencing the target variant.

거리 채널street channel

복셀 그리드 내의 복셀은 복수의 채널(또는 특징부)을 가질 수 있다. 하나의 구현예에서, 복셀 그리드 내의 복셀은 복수의 거리 채널(예를 들어, 각각, 21개의 아미노산 카테고리(정지 또는 갭 아미노산 카테고리를 포함함)에 대한 21개의 거리 채널)을 갖는다. 단계(152)에서, 시스템의 거리 채널 생성자(154)는 복셀 그리드 내의 복셀에 대한 아미노산별 거리 채널을 생성한다. 거리 채널은 21개의 아미노산 카테고리 각각에 대해 독립적으로 생성된다.Voxels within a voxel grid may have multiple channels (or features). In one implementation, the voxels within the voxel grid have a plurality of distance channels (e.g., 21 distance channels each for 21 amino acid categories (including stop or gap amino acid categories)). At step 152, the system's distance channel generator 154 creates per-amino acid distance channels for voxels in the voxel grid. Distance channels are created independently for each of the 21 amino acid categories.

예를 들어, 알라닌(A) 아미노산 카테고리를 고려한다. 예를 들어, 복셀 그리드가 크기 3×3×3의 것이고 27개의 복셀을 갖는다는 것을 추가로 고려한다. 이어서, 하나의 구현예에서, 알라닌 거리 채널이 복셀 그리드 내의 27개의 복셀에 대한 27개의 거리 값을 각각 포함한다. 알라닌 거리 채널에서의 27개의 거리 값은 복셀 그리드 내의 27개의 복셀의 각자의 중심으로부터 알라닌 아미노산 카테고리 내의 각자의 가장 가까운 원자까지 측정된다.For example, consider the alanine (A) amino acid category. For example, further consider that the voxel grid is of size 3×3×3 and has 27 voxels. Then, in one implementation, the alanine distance channel contains 27 distance values for each of the 27 voxels in the voxel grid. The 27 distance values in the alanine distance channel are measured from the respective centroids of the 27 voxels in the voxel grid to their respective nearest atoms in the alanine amino acid category.

하나의 예에서, 알라닌 아미노산 카테고리는 알파-탄소 원자만을 포함하고, 따라서, 가장 가까운 원자는 각각 복셀 그리드 내의 27개의 복셀에 가장 근접한 그러한 알라닌 알파-탄소 원자이다. 다른 예에서, 알라닌 아미노산 카테고리는 베타-탄소 원자만을 포함하고, 따라서, 가장 가까운 원자는 각각 복셀 그리드 내의 27개의 복셀에 가장 근접한 그러한 알라닌 베타-탄소 원자이다.In one example, the alanine amino acid category contains only alpha-carbon atoms, so the nearest atom is that alanine alpha-carbon atom that is closest to each of the 27 voxels in the voxel grid. In another example, the alanine amino acid category contains only beta-carbon atoms, and therefore the nearest atom is that alanine beta-carbon atom that is closest to each of the 27 voxels in the voxel grid.

또 다른 예에서, 알라닌 아미노산 카테고리는 산소 원자만을 포함하고, 따라서, 가장 가까운 원자는 각각 복셀 그리드 내의 27개의 복셀에 가장 근접한 그러한 알라닌 산소 원자이다. 또 다른 예에서, 알라닌 아미노산 카테고리는 질소 원자만을 포함하고, 따라서, 가장 가까운 원자는 각각 복셀 그리드 내의 27개의 복셀에 가장 근접한 그러한 알라닌 질소 원자이다. 또 다른 예에서, 알라닌 아미노산 카테고리는 수소 원자만을 포함하고, 따라서, 가장 가까운 원자는 각각 복셀 그리드 내의 27개의 복셀에 가장 근접한 그러한 알라닌 수소 원자이다.In another example, the alanine amino acid category contains only oxygen atoms, and therefore the nearest atom is that alanine oxygen atom that is closest to each of the 27 voxels in the voxel grid. In another example, the alanine amino acid category contains only nitrogen atoms, and therefore the nearest atom is that alanine nitrogen atom that is closest to each of the 27 voxels in the voxel grid. In another example, the alanine amino acid category contains only hydrogen atoms, and therefore the nearest atom is that alanine hydrogen atom that is closest to each of the 27 voxels in the voxel grid.

알라닌 거리 채널과 마찬가지로, 거리 채널 생성자(154)는 나머지 아미노산 카테고리 각각에 대한 거리 채널(즉, 복셀별 거리 값의 세트)을 생성한다. 다른 구현예에서, 거리 채널 생성자(154)는 21개의 아미노산 카테고리의 서브세트에 대해서만 거리 채널을 생성한다.Like the alanine distance channel, distance channel generator 154 creates a distance channel (i.e., a set of voxel-wise distance values) for each of the remaining amino acid categories. In another implementation, distance channel generator 154 creates distance channels for only a subset of the 21 amino acid categories.

다른 구현예에서, 가장 가까운 원자의 선택은 특정 원자 유형으로 한정되지 않는다. 즉, 대상 아미노산 카테고리 내에서, 특정 복셀에 대해 가장 가까운 원자가, 가장 가까운 원자의 원자 원소, 및 대상 아미노산 카테고리에 대한 거리 채널에 포함시키기 위해 계산된 특정 복셀에 대한 거리 값과 관계없이 선택된다.In other embodiments, the selection of the closest atom is not limited to a specific atom type. That is, within the target amino acid category, the closest atom to the specific voxel, the atomic element of the closest atom, and the distance value for the specific voxel calculated for inclusion in the distance channel for the target amino acid category are selected.

또 다른 구현예에서, 거리 채널은 원자 원소 단위로 생성된다. 아미노산 카테고리에 대한 거리 채널을 갖는 대신에 또는 그에 더하여, 원자가 속하는 아미노산에 관계없이 원자 원소 카테고리에 대해 거리 값이 생성될 수 있다. 예를 들어, 기준 아미노산 서열 내의 아미노산의 원자는 7개의 원자 원소, 즉 탄소, 산소, 질소, 수소, 칼슘, 요오드, 및 황에 걸쳐 있음을 고려한다. 이어서, 복셀 그리드 내의 복셀은 7개의 거리 채널을 갖도록 구성되고, 따라서, 7개의 거리 채널 각각은 상응하는 원자 원소 카테고리 내의 가장 가까운 원자까지만의 거리를 특정하는 27개의 복셀별 거리 값을 갖는다. 다른 구현예에서, 7개의 원자 원소의 서브세트만을 위한 거리 채널이 생성될 수 있다. 또 다른 구현예에서, 원자 원소 카테고리 및 거리 채널 생성은 동일한 원자 원소, 예를 들어, 알파-탄소(C_α) 원자 및 베타-탄소(C_β) 원자의 변이로 추가로 계층화될 수 있다.In another implementation, the distance channels are created atomically. Instead of or in addition to having distance channels for amino acid categories, distance values can be generated for atomic element categories regardless of the amino acid to which the atoms belong. For example, consider that the atoms of amino acids in a reference amino acid sequence span seven atomic elements: carbon, oxygen, nitrogen, hydrogen, calcium, iodine, and sulfur. The voxels within the voxel grid are then configured to have seven distance channels, and thus each of the seven distance channels has 27 voxel-specific distance values that specify the distance only to the nearest atom within the corresponding atomic element category. In another implementation, distance channels may be created for only a subset of 7 atomic elements. In another implementation, atomic element categories and distance channel creation can be further stratified by variations of the same atomic element, such as alpha-carbon (C _α ) atoms and beta-carbon (C _β ) atoms.

또 다른 구현예에서, 거리 채널은 원자 유형 단위로 생성되는데, 예를 들어, 측쇄 원자만에 대한 거리 채널 및 백본 원자만에 대한 거리 채널이 생성될 수 있다.In another implementation, distance channels can be created on a per atom type basis, for example, a distance channel for only side chain atoms and a distance channel for only backbone atoms.

가장 가까운 원자는 복셀 중심으로부터 미리정의된 최대 스캔 반경(예컨대, 6 옹스트롬(Å)) 내에서 검색될 수 있다. 또한, 다수의 원자가 복셀 그리드 내의 동일한 복셀에 가장 가까울 수 있다.The nearest atom can be searched within a predefined maximum scan radius (e.g., 6 angstroms (Å)) from the voxel center. Additionally, multiple atoms may be closest to the same voxel within the voxel grid.

거리는 복셀 중심의 3D 좌표와 원자의 3D 원자 좌표 사이에서 계산된다. 또한, 거리 채널은 동일한 위치에 중심설정된(예를 들어, 표적 변이체를 경험하는 기준 아미노산의 알파-탄소 원자의 3D 원자 좌표에 중심설정된) 복셀 그리드로 생성된다.The distance is calculated between the 3D coordinates of the voxel center and the 3D atomic coordinates of the atoms. Additionally, the distance channel is created as a grid of voxels centered on the same location (e.g., centered on the 3D atomic coordinates of the alpha-carbon atom of the reference amino acid experiencing the target variant).

거리는 유클리드 거리일 수 있다. 또한, 거리는 (예를 들어, 해당 원자의 Lennard-Jones 전위 및/또는 Van der Waals 원자 반경을 사용하여) 원자 크기(또는 원자 영향)에 의해 파라미터화될 수 있다. 또한, 거리 값은 최대 스캔 반경에 의해, 또는 대상 아미노산 카테고리 또는 대상 원자 원소 카테고리 또는 대상 원자 유형 카테고리 내의 최대한 가장 가까운 원자의 최대 관찰된 거리 값에 의해 정규화될 수 있다. 일부 구현예에서, 복셀과 원자 사이의 거리는 복셀 및 원자의 극좌표에 기초하여 계산된다. 극좌표는 복셀과 원자 사이의 각도에 의해 파라미터화된다. 하나의 구현예에서, 이러한 각도 정보는 복셀에 대한 각도 채널을 생성하는 데 사용된다(즉, 거리 채널로부터 독립적임). 일부 구현예에서, 가장 가까운 원자와 이웃 원자(예를 들어, 백본 원자) 사이의 각도는 복셀로 인코딩되는 특징부로서 사용될 수 있다.The distance may be a Euclidean distance. Additionally, the distance may be parameterized by atomic size (or atomic influence) (e.g., using the Lennard-Jones potential and/or Van der Waals atomic radius of that atom). Additionally, the distance value may be normalized by the maximum scan radius, or by the maximum observed distance value of the closest possible atom within the target amino acid category or target atom element category or target atom type category. In some implementations, the distance between a voxel and an atom is calculated based on the polar coordinates of the voxel and the atom. Polar coordinates are parameterized by the angle between voxels and atoms. In one implementation, this angular information is used to create an angular channel for a voxel (i.e., independent from the distance channel). In some implementations, the angle between the nearest atom and a neighboring atom (e.g., a backbone atom) can be used as a feature encoded into a voxel.

기준 대립유전자 및 대안 대립유전자 채널Reference allele and alternative allele channels

복셀 그리드 내의 복셀은 또한, 기준 대립유전자 및 대안 대립유전자 채널을 가질 수 있다. 단계(162)에서, 시스템의 원-핫 인코더(164)가 기준 아미노산 서열 내의 기준 아미노산의 기준 원-핫 인코딩 및 대안 아미노산 서열 내의 대안 아미노산의 대안 원-핫 인코딩을 생성한다. 기준 아미노산은 표적 변이체를 경험한다. 대안 아미노산은 표적 변이체이다. 기준 아미노산 및 대안 아미노산은 기준 아미노산 서열 및 대안 아미노산 서열에서 각각 동일한 위치에 위치한다. 기준 아미노산 서열 및 대안 아미노산 서열은 하나의 예외를 갖는 동일한 위치별 아미노산 조성을 갖는다. 예외는, 기준 아미노산 서열에서는 기준 아미노산을 갖고 대안 아미노산 서열에서는 대안 아미노산을 갖는 위치이다.Voxels within a voxel grid may also have reference allele and alternative allele channels. At step 162, the system's one-hot encoder 164 generates a reference one-hot encoding of a reference amino acid in a reference amino acid sequence and an alternative one-hot encoding of an alternative amino acid in an alternative amino acid sequence. The reference amino acid experiences target variants. Alternative amino acids are target variants. The reference amino acid and the alternative amino acid are located at the same position in the reference amino acid sequence and the alternative amino acid sequence, respectively. The reference amino acid sequence and the alternative amino acid sequence have the same position-specific amino acid composition with one exception. Exceptions are positions that have a reference amino acid in the reference amino acid sequence and an alternative amino acid in the alternative amino acid sequence.

단계(172)에서, 시스템의 연결자(174)가 아미노산별 거리 채널과 기준 및 대안 원-핫 인코딩을 연결한다. 다른 구현예에서, 연결자(174)는 원자 원소별 거리 채널과 기준 및 대안 원-핫 인코딩을 연결한다. 또 다른 구현예에서, 연결자(174)는 원자 유형별 거리 채널과 기준 및 대안 원-핫 인코딩을 연결한다.At step 172, the system's linker 174 connects the reference and alternative one-hot encodings with the amino acid-specific distance channels. In another implementation, connector 174 connects the atomic element-wise distance channels with the reference and alternative one-hot encodings. In another implementation, connector 174 connects distance channels by atom type with reference and alternative one-hot encodings.

단계(182)에서, 시스템의 런타임 로직(184)은 병원성 분류자(병원성 결정 엔진)를 통해, 연결된 아미노산별/원자 원소별/원자 유형별 거리 채널과 기준 및 대안 원-핫 인코딩을 처리하여 표적 변이체의 병원성을 결정하는데, 이는 결국, 아미노산 수준에서 표적 변이체를 생성하는 기본 뉴클레오티드 변이체의 병원성 결정으로서 추론된다. 병원성 분류자는, 예를 들어 역전파 알고리즘을 사용하여, 양성 및 병원성 변이체의 라벨링된 데이터세트를 사용하여 훈련된다. 양성 및 병원성 변이체의 라벨링된 데이터세트 및 병원성 분류자의 예시적인 아키텍처 및 훈련에 관한 추가적인 세부사항은 공동 소유의 미국 특허 출원 제16/160,903호; 제16/160,986호; 제16/160,968호; 및 제16/407,149호에서 찾을 수 있다.At step 182, the system's runtime logic 184 processes the linked per-amino-acid/per-atom/atom-type distance channels and the reference and alternative one-hot encodings through a pathogenicity classifier (pathogenicity determination engine) to identify target variants. determines the pathogenicity of , which is ultimately inferred as determining the pathogenicity of the basic nucleotide variant that generates the target variant at the amino acid level. Pathogenicity classifiers are trained using labeled datasets of benign and pathogenic variants, for example using backpropagation algorithms. Additional details regarding labeled datasets of benign and pathogenic variants and exemplary architecture and training of pathogenicity classifiers can be found in commonly owned U.S. patent application Ser. No. 16/160,903; No. 16/160,986; No. 16/160,968; and 16/407,149.

도 2는 단백질(200)의 기준 아미노산 서열(202) 및 단백질(200)의 대안 아미노산 서열(212)을 개략적으로 도시한다. 단백질(200)은 N개의 아미노산을 포함한다. 단백질(200) 내의 아미노산의 위치는 1, 2, 3...N으로 라벨링된다. 예시된 예에서, 위치 16은 기본 뉴클레오티드 변이체에 의해 야기되는 아미노산 변이체(214)(돌연변이)를 경험하는 위치이다. 예를 들어, 기준 아미노산 서열(202)의 경우, 위치 1은 기준 아미노산 페닐알라닌(F)을 갖고, 위치 16은 기준 아미노산 글리신(G)(204)을 갖고, 위치 N(예컨대, 서열(202)의 마지막 아미노산)은 기준 아미노산 류신(L)을 갖는다. 명확성을 위해 예시되지 않았지만, 기준 아미노산 서열(202) 내의 나머지 위치는 단백질(200)에 특정적인 순서로 다양한 아미노산을 함유한다. 대안 아미노산 서열(212)은 위치 16에서의 변이체(214)를 제외하면 기준 아미노산 서열(202)과 동일한데, 이는 기준 아미노산 글리신(G)(204) 대신에 대안 아미노산 알라닌(A)(214)을 함유한다.2 schematically depicts a reference amino acid sequence 202 of protein 200 and an alternative amino acid sequence 212 of protein 200. Protein 200 contains N amino acids. The positions of amino acids in protein 200 are labeled 1, 2, 3...N. In the illustrated example, position 16 is the position that experiences amino acid variant 214 (mutation) caused by a base nucleotide variant. For example, for the reference amino acid sequence 202, position 1 has the reference amino acid phenylalanine (F), position 16 has the reference amino acid glycine (G) 204, and position N (e.g., of sequence 202) The last amino acid) has the reference amino acid leucine (L). Although not illustrated for clarity, the remaining positions within the reference amino acid sequence 202 contain various amino acids in an order specific to protein 200. The alternative amino acid sequence 212 is identical to the reference amino acid sequence 202 except for the variant at position 16 (214), which substitutes the reference amino acid glycine (G) (204) for the alternative amino acid alanine (A) (214). Contains.

도 3은 본 명세서에서 "원자 분류(300)"로도 지칭되는, 기준 아미노산 서열(202) 내의 아미노산의 원자의 아미노산별 분류를 도시한다. 열(302)에 열거된 20개의 천연 아미노산 중에서, 특정 유형의 아미노산이 단백질에서 반복될 수 있다. 즉, 특정 유형의 아미노산이 단백질에서 1회 초과로 발생할 수 있다. 단백질은 또한, 21번째 정지 또는 갭 아미노산 카테고리에 의해 카테고리화되는 일부 결정되지 않은 아미노산을 가질 수 있다. 도 3의 우측 열은 상이한 아미노산으로부터의 알파-탄소(C_α) 원자의 카운트를 함유한다.3 shows an amino acid-by-amino acid classification of the atoms of amino acids in a reference amino acid sequence 202, also referred to herein as “atomic classification 300.” Of the 20 natural amino acids listed in column 302, certain types of amino acids may be repeated in proteins. That is, certain types of amino acids may occur more than once in a protein. Proteins may also have some undetermined amino acids that are categorized by the 21st stop or gap amino acid category. The right column of Figure 3 contains counts of alpha-carbon (C _α ) atoms from different amino acids.

구체적으로, 도 3은 기준 아미노산 서열(202) 내의 아미노산의 알파-탄소(C_α) 원자의 아미노산별 분류를 도시한다. 도 3의 열(308)은 21개의 아미노산 카테고리 각각에서 기준 아미노산 서열(202)에 대해 관찰된 알파-탄소 원자의 총 수를 열거한다. 예를 들어, 열(308)은 알라닌(A) 아미노산 카테고리에 대해 관찰된 11개의 알파-탄소 원자를 열거한다. 각각의 아미노산은 단지 하나의 알파-탄소 원자만을 갖기 때문에, 이것은 알라닌이 기준 아미노산 서열(202)에서 11회 발생함을 의미한다. 다른 예에서, 아르기닌(R)은 기준 아미노산 서열(202)에서 35회 발생한다. 21개의 아미노산 카테고리에 걸친 알파-탄소 원자의 총 수는 828이다.Specifically, FIG. 3 shows an amino acid-by-amino acid classification of the alpha-carbon (C _α ) atoms of amino acids in the reference amino acid sequence 202. Column 308 in Figure 3 lists the total number of alpha-carbon atoms observed for the reference amino acid sequence 202 in each of the 21 amino acid categories. For example, column 308 lists the 11 alpha-carbon atoms observed for the alanine (A) amino acid category. Since each amino acid has only one alpha-carbon atom, this means that alanine occurs 11 times in the reference amino acid sequence 202. In another example, arginine (R) occurs 35 times in the reference amino acid sequence 202. The total number of alpha-carbon atoms across 21 amino acid categories is 828.

도 4는 도 3의 원자 분류(300)에 기초한 기준 아미노산 서열(202)의 알파-탄소 원자의 3D 원자 좌표의 아미노산별 속성을 도시한다. 이것은 본 명세서에서 "원자 좌표 버킷팅(bucketing)(400)"으로 지칭된다. 도 4에서, 목록(404 내지 440)은 21개의 아미노산 카테고리 각각에 버킷팅된 알파-탄소 원자의 3D 원자 좌표를 표로 나타낸다.FIG. 4 shows amino acid-by-amino acid properties of the 3D atomic coordinates of the alpha-carbon atom of the reference amino acid sequence 202 based on the atomic classification 300 of FIG. 3 . This is referred to herein as “atomic coordinate bucketing (400).” In Figure 4, lists 404-440 tabulate the 3D atomic coordinates of alpha-carbon atoms bucketed into each of the 21 amino acid categories.

도시된 구현예에서, 도 4의 버킷팅(400)은 도 3의 분류(300)를 따른다. 예를 들어, 도 3에서, 알라닌 아미노산 카테고리는 11개의 알파-탄소 원자를 갖고, 따라서, 도 4에서, 알라닌 아미노산 카테고리는 도 3으로부터의 상응하는 11개의 알파-탄소 원자의 11개의 3D 원자 좌표를 갖는다. 이러한 분류-버킷팅 로직은 역시 다른 아미노산 카테고리에 대해서도 도 3으로부터 도 4로 흐른다. 그러나, 이러한 분류-버킷팅 로직은 단지 구상적인 목적만을 위한 것이며, 다른 구현예에서, 개시된 기술은 복셀별로 가장 가까운 원자를 위치시키기 위해 분류(300) 및 버킷팅(400)을 수행할 필요가 없고, 더 적은, 추가적인, 또는 상이한 단계를 수행할 수 있다. 예를 들어, 일부 구현예에서, 개시된 기술은 분류 기준(예컨대, 아미노산별, 원자 원소별, 원자 유형별), 미리정의된 최대 스캔 반경, 및 거리들의 유형(예컨대, Euclidean, Mahalanobis, 정규화, 비정규화)과 같은 질의 파라미터를 수용하도록 구성된 검색 질의에 응답하여 하나 이상의 데이터 베이스로부터 복셀별로 가장 가까운 원자를 복귀시키는 분류 및 검색 알고리즘을 사용하여 복셀별로 가장 가까운 원자를 위치확인할 수 있다. 개시된 기술의 다양한 구현예에서, 현재 또는 미래의 기술분야로부터의 복수의 분류 및 검색 알고리즘은 당업자에 의해, 복셀별로 가장 가까운 원자를 위치확인하기 위해 유사하게 사용될 수 있다.In the depicted implementation, bucketing 400 of Figure 4 follows classification 300 of Figure 3. For example, in Figure 3, the alanine amino acid category has 11 alpha-carbon atoms, and therefore in Figure 4, the alanine amino acid category has 11 3D atomic coordinates of the corresponding 11 alpha-carbon atoms from Figure 3. have This sorting-bucketing logic flows from Figure 3 to Figure 4 for other amino acid categories as well. However, this sorting-bucketing logic is for illustrative purposes only, and in other implementations, the disclosed technique does not need to perform sorting (300) and bucketing (400) to locate the closest atoms on a voxel-by-voxel basis. , fewer, additional, or different steps may be performed. For example, in some implementations, the disclosed technology may be used to determine classification criteria (e.g., by amino acid, by atomic element, by atom type), a predefined maximum scan radius, and types of distances (e.g., Euclidean, Mahalanobis, normalized, denormalized). ) may be used to locate the closest atom by voxel using a sorting and search algorithm that returns the closest atom by voxel from one or more databases in response to a search query configured to accept query parameters such as ). In various implementations of the disclosed technology, multiple sorting and search algorithms from current or future art can similarly be used by one skilled in the art to locate the nearest atom on a voxel-by-voxel basis.

도 4에서, 3D 원자 좌표는 직교 좌표 x, y, z에 의해 표현되지만, 구형 또는 원통형 좌표와 같은 임의의 유형의 좌표계가 사용될 수 있고, 청구된 주제는 이러한 점에서 제한되지 않는다. 일부 구현예에서, 하나 이상의 데이터베이스가 단백질 내의 알파-탄소 원자 및 아미노산의 다른 원자의 3D 원자 좌표에 관한 정보를 포함할 수 있다. 그러한 데이터베이스는 특정 단백질에 의해 검색가능할 수 있다.In Figure 4, 3D atomic coordinates are represented by Cartesian coordinates x, y, z, but any type of coordinate system, such as spherical or cylindrical coordinates, may be used, and the claimed subject matter is not limited in this respect. In some embodiments, one or more databases may contain information regarding 3D atomic coordinates of alpha-carbon atoms and other atoms of amino acids in proteins. Such databases may be searchable by specific proteins.

위에서 논의된 바와 같이, 복셀 및 복셀 그리드는 3D 엔티티이다. 그러나, 명확성을 위해, 도면은 복셀 및 복셀 그리드를 2차원(2D) 포맷으로 도시하고, 설명은 이를 논의한다. 예를 들어, 27개의 복셀의 3×3×3 복셀 그리드가 본 명세서에서 9개의 2D 픽셀을 갖는 3×3 2D 픽셀 그리드로서 도시되고 설명된다. 당업자는, 2D 포맷이 단지 구상적인 목적만을 위해 사용되고 3D 대응물(즉, 2D 픽셀이 3D 복셀을 표현하고, 2D 픽셀 그리드가 3D 복셀 그리드를 표현함)을 커버하도록 의도됨을 이해할 것이다. 또한, 도면은 또한 축척대로 된 것은 아니다. 예를 들어, 크기 2 옹스트롬(Å)의 복셀이 단일 픽셀을 사용하여 묘사된다.As discussed above, voxels and voxel grids are 3D entities. However, for clarity, the figures show voxels and voxel grids in a two-dimensional (2D) format, and the description discusses this. For example, a 3x3x3 voxel grid of 27 voxels is shown and described herein as a 3x3 2D pixel grid with 9 2D pixels. Those skilled in the art will understand that the 2D format is used for representational purposes only and is intended to cover its 3D counterpart (i.e., 2D pixels represent 3D voxels, and 2D pixel grids represent 3D voxel grids). Additionally, the drawings are also not to scale. For example, a voxel of size 2 Angstroms (Å) is depicted using a single pixel.

복셀별 거리 계산Distance calculation per voxel

도 5는 본 명세서에서 "복셀별 거리 계산(500)"으로도 지칭되는 복셀별 거리 값을 결정하는 프로세스를 개략적으로 도시한다. 도시된 예에서, 복셀별 거리 값은 알라닌(A) 거리 채널에 대해서만 계산된다. 그러나, 동일한 거리 계산 로직이 21개의 아미노산 카테고리 각각에 대해 실행되어 21개의 아미노산별 거리 채널을 생성하고, 도 1과 관련하여 위에서 논의된 바와 같이, 베타-탄소 원자 및 산소, 질소 및 수소와 같은 다른 원자 원소와 같은 다른 원자 유형으로 추가로 확장될 수 있다. 일부 구현예에서, 원자는 병원성 분류자의 훈련을 원자 배향에 대해 불변이 되게 하기 위해 거리 계산 전에 랜덤하게 회전된다.5 schematically illustrates the process of determining a voxel-wise distance value, also referred to herein as “voxel-wise distance calculation 500”. In the example shown, voxel-wise distance values are calculated only for the alanine (A) distance channel. However, the same distance calculation logic is run for each of the 21 amino acid categories, resulting in 21 amino acid-specific distance channels, as discussed above with respect to Figure 1, including the beta-carbon atom and other amino acids such as oxygen, nitrogen, and hydrogen. It can be further extended to other atom types such as atomic elements. In some implementations, atoms are randomly rotated before distance calculation to make training of the pathogenicity classifier invariant to atom orientation.

도 5에서, 복셀 그리드(522)가 인덱스 (1, 1), (1, 2), (1, 3), (2, 1), (2, 2), (2, 3), (3, 1), (3, 2), 및 (3, 3)로 식별된 9개의 복셀(514)을 갖는다. 복셀 그리드(522)는, 예를 들어, 기준 아미노산 서열(202) 내의 위치 16에 있는 글리신(G) 아미노산의 알파-탄소 원자의 3D 원자 좌표(532)에 중심설정되는데, 그 이유는 도 2와 관련하여 위에서 논의된 바와 같이, 대안 아미노산 서열(212)에서, 위치 16이 글리신(G) 아미노산을 알라닌(A) 아미노산으로 돌연변이시킨 변이체를 경험하기 때문이다. 또한, 복셀 그리드(522)의 중심은 복셀 (2, 2)의 중심과 일치한다.In Figure 5, the voxel grid 522 has indices (1, 1), (1, 2), (1, 3), (2, 1), (2, 2), (2, 3), (3, It has 9 voxels 514, identified as 1), (3, 2), and (3, 3). The voxel grid 522 is centered, for example, at the 3D atomic coordinates 532 of the alpha-carbon atom of the glycine (G) amino acid at position 16 in the reference amino acid sequence 202 because As discussed above in connection, in the alternative amino acid sequence 212, position 16 experiences a variant that mutates the glycine (G) amino acid to an alanine (A) amino acid. Additionally, the center of voxel grid 522 coincides with the center of voxel (2, 2).

중심설정된 복셀 그리드(522)는 21개의 아미노산별 거리 채널 각각에 대한 복셀별 거리 계산에 사용된다. 예를 들어 알라닌(A) 거리 채널로 시작하여, 9개의 복셀(514) 각각에 대한 가장 가까운 알라닌 알파-탄소 원자를 위치확인하기 위해 9개의 복셀(514)의 각자의 중심의 3D 좌표와 11개의 알라닌 알파-탄소 원자의 3D 원자 좌표(402) 사이의 거리가 측정된다. 이어서, 9개의 복셀(514)과 각자의 가장 가까운 알라닌 알파-탄소 원자 사이의 9개의 거리에 대한 9개의 거리 값이 알라닌 거리 채널을 구성하는 데 사용된다. 생성된 알라닌 거리 채널은 복셀 그리드(522) 내의 9개의 복셀(514)과 동일한 순서로 9개의 알라닌 거리 값을 배열한다.The centered voxel grid 522 is used to calculate the distance per voxel for each of the 21 distance channels for each amino acid. For example, starting with the alanine (A) distance channel, the 3D coordinates of the centroids of each of the nine voxels 514 and the 11 The distance between the 3D atomic coordinates 402 of the alanine alpha-carbon atoms is measured. The nine distance values for the nine distances between the nine voxels 514 and their respective nearest alanine alpha-carbon atoms are then used to construct the alanine distance channel. The generated alanine distance channel arranges nine alanine distance values in the same order as the nine voxels 514 in the voxel grid 522.

위의 프로세스는 21개의 아미노산 카테고리 각각에 대해 실행된다. 예를 들어, 중심설정된 복셀 그리드(522)는 아르기닌(R) 거리 채널을 계산하는 데 유사하게 사용되어, 9개의 복셀(514)의 각자의 중심의 3D 좌표와 35개의 아르기닌 알파-탄소 원자의 3D 원자 좌표(404) 사이의 거리가 측정되어 9개의 복셀(514) 각각에 대한 가장 가까운 아르기닌 알파-탄소 원자를 위치확인하게 한다. 이어서, 9개의 복셀(514)과 각자의 가장 가까운 아르기닌 알파-탄소 원자 사이의 9개의 거리에 대한 9개의 거리 값이 아르기닌 거리 채널을 구성하는 데 사용된다. 생성된 아르기닌 거리 채널은 복셀 그리드(522) 내의 9개의 복셀(514)과 동일한 순서로 9개의 아르기닌 거리 값을 배열한다. 21개의 아미노산별 거리 채널은 복셀별로 인코딩되어 거리 채널 텐서를 형성한다.The above process is run for each of the 21 amino acid categories. For example, the centered voxel grid 522 can be similarly used to calculate the arginine (R) distance channel, providing the 3D coordinates of the respective centroids of the nine voxels 514 and the 3D coordinates of the 35 arginine alpha-carbon atoms. Distances between atomic coordinates 404 are measured to locate the nearest arginine alpha-carbon atom for each of the nine voxels 514. The nine distance values for the nine distances between the nine voxels 514 and their respective nearest arginine alpha-carbon atoms are then used to construct the arginine distance channel. The created arginine distance channel arranges nine arginine distance values in the same order as the nine voxels 514 in the voxel grid 522. The 21 amino acid-specific distance channels are encoded on a voxel-by-voxel basis to form a distance channel tensor.

구체적으로, 예시된 예에서, 거리(512)는 복셀 그리드(522)의 복셀 (1, 1)의 중심과 목록(402) 내의 Cα^A5 원자인 가장 가까운 알파-탄소(C_α) 원자 사이의 것이다. 따라서, 복셀 (1, 1)에 할당된 값은 거리(512)이다. 다른 예에서, Cα^A4 원자는 복셀 (1, 2)의 중심에 대해 가장 가까운 C_α 원자이다. 따라서, 복셀 (1, 2)에 할당된 값은 복셀 (1, 2)의 중심과 Cα^A4 원자 사이의 거리이다. 또 다른 예에서, Cα^A6 원자는 복셀 (2, 1)의 중심에 대해 가장 가까운 C_α 원자이다. 따라서, 복셀 (2, 1)에 할당된 값은 복셀 (2, 1)의 중심과 Cα^A6 원자 사이의 거리이다. 또 다른 예에서, Cα^A6 원자는 또한, 복셀 (3, 2) 및 (3, 3)의 중심에 대해 가장 가까운 C_α 원자이다. 따라서, 복셀 (3, 2)에 할당된 값은 복셀 (3, 2)의 중심과 Cα^A6 사이의 거리이고, 복셀 (3, 3)에 할당된 값은 복셀 (3, 3)의 중심과 Cα^A6 원자 사이의 거리이다. 일부 구현예에서, 복셀(514)에 할당된 거리 값은 정규화된 거리일 수 있다. 예를 들어, 복셀 (1, 1)에 할당된 거리 값은 거리(512)를 최대 거리(502)(미리정의된 최대 스캔 반경)로 나눈 것일 수 있다. 일부 구현예에서, 가장 가까운 원자 거리는 유클리드 거리일 수 있고 가장 가까운 원자 거리는 유클리드 거리를 (예컨대, 최대 거리(502)와 같은) 최대 가장 가까운 원자 거리로 나눔으로써 정규화될 수 있다.Specifically, in the illustrated example, distance 512 is between the center of voxel (1, 1) of voxel grid 522 and the nearest alpha-carbon ( _Cα ) atom in list 402, which is the Cα ^A5 atom. . Therefore, the value assigned to voxel (1, 1) is distance (512). In another example, the Cα ^A4 atom is the closest _Cα atom to the center of voxel (1, 2). Therefore, the value assigned to voxel (1, 2) is the distance between the center of voxel (1, 2) and the Cα ^A4 atom. In another example, the Cα ^A6 atom is the closest _Cα atom to the center of voxel (2, 1). Therefore, the value assigned to voxel (2, 1) is the distance between the center of voxel (2, 1) and the Cα ^A6 atom. In another example, the Cα ^A6 atom is also the closest _Cα atom to the centers of voxels (3, 2) and (3, 3). Therefore, the value assigned to voxel (3, 2) is the distance between the center of voxel (3, 2) and Cα ^A6 , and the value assigned to voxel (3, 3) is the distance between the center of voxel (3, 3) and Cα. ^A6 is the distance between atoms. In some implementations, the distance value assigned to voxel 514 may be a normalized distance. For example, the distance value assigned to voxel (1, 1) may be distance 512 divided by maximum distance 502 (predefined maximum scan radius). In some implementations, the nearest atom distance may be the Euclidean distance and the nearest atom distance may be normalized by dividing the Euclidean distance by the maximum nearest atom distance (e.g., maximum distance 502).

전술된 바와 같이, 알파-탄소 원자를 갖는 아미노산의 경우, 거리는 상응하는 복셀 중심으로부터 상응하는 아미노산의 가장 가까운 알파-탄소 원자까지의 가장 가까운 알파-탄소 원자 거리일 수 있다. 추가적으로, 베타-탄소 원자를 갖는 아미노산의 경우, 거리는 상응하는 복셀 중심으로부터 상응하는 아미노산의 가장 가까운 베타-탄소 원자까지의 가장 가까운 베타-탄소 원자 거리일 수 있다. 유사하게, 백본 원자를 갖는 아미노산의 경우, 거리는 상응하는 복셀 중심으로부터 상응하는 아미노산의 가장 가까운 백본 원자까지의 가장 가까운 백본 원자 거리일 수 있다. 유사하게, 측쇄 원자를 갖는 아미노산의 경우, 거리는 상응하는 복셀 중심으로부터 상응하는 아미노산의 가장 가까운 측쇄 원자까지의 가장 가까운 측쇄 원자 거리일 수 있다. 일부 구현예에서, 거리는 추가적으로/대안적으로, 두 번째, 세 번째, 네 번째 가장 가까운 원자까지의 거리 등을 포함할 수 있다.As described above, for amino acids with alpha-carbon atoms, the distance may be the distance of the nearest alpha-carbon atom from the center of the corresponding voxel to the nearest alpha-carbon atom of the corresponding amino acid. Additionally, for amino acids with beta-carbon atoms, the distance may be the nearest beta-carbon atom distance from the center of the corresponding voxel to the nearest beta-carbon atom of the corresponding amino acid. Similarly, for amino acids with backbone atoms, the distance may be the nearest backbone atom distance from the center of the corresponding voxel to the nearest backbone atom of the corresponding amino acid. Similarly, for amino acids with side chain atoms, the distance may be the distance of the nearest side chain atom from the center of the corresponding voxel to the nearest side chain atom of the corresponding amino acid. In some embodiments, the distance may additionally/alternatively include the distance to the second, third, fourth nearest atom, etc.

아미노산별 거리 채널Distance channels for each amino acid

도 6은 21개의 아미노산별 거리 채널(600)의 일례를 도시한다. 도 6의 각각의 열은 21개의 아미노산별 거리 채널(602 내지 642) 중 각자의 것에 상응한다. 각각의 아미노산별 거리 채널은 복셀 그리드(522)의 복셀(514) 각각에 대한 거리 값을 포함한다. 예를 들어, 알라닌(A)에 대한 아미노산별 거리 채널(602)은 복셀 그리드(522)의 복셀(514) 중 각자의 것에 대한 거리 값을 포함한다. 위에서 언급된 바와 같이, 복셀 그리드(522)는 체적 3×3×3의 3D 그리드이고, 27개의 복셀을 포함한다. 마찬가지로, 도 6이 2개의 차원으로 복셀(514)(예컨대, 3×3 그리드의 9개의 복셀)을 도시하지만, 각각의 아미노산별 거리 채널은 3×3×3 복셀 그리드에 대한 27개의 복셀별 거리 값을 포함할 수 있다.Figure 6 shows an example of a distance channel 600 for each of 21 amino acids. Each row in FIG. 6 corresponds to one of the 21 amino acid-specific distance channels 602 to 642. The distance channel for each amino acid includes a distance value for each voxel 514 of the voxel grid 522. For example, the amino acid-specific distance channel 602 for alanine (A) includes distance values for each of the voxels 514 of the voxel grid 522. As mentioned above, voxel grid 522 is a 3D grid with a volume of 3x3x3 and contains 27 voxels. Likewise, although Figure 6 shows voxels 514 in two dimensions (e.g., 9 voxels in a 3×3 grid), each amino acid-wise distance channel represents the 27 voxel-wise distances for a 3×3×3 voxel grid. Can contain values.

방향성 인코딩Directional encoding

일부 구현예에서, 개시된 기술은 방향성 파라미터를 사용하여, 기준 아미노산 서열(202) 내의 기준 아미노산의 방향성을 특정한다. 일부 구현예에서, 개시된 기술은 방향성 파라미터를 사용하여, 대안 아미노산 서열(212) 내의 대안 아미노산의 방향성을 특정한다. 일부 구현예에서, 개시된 기술은 방향성 파라미터를 사용하여, 아미노산 수준에서 표적 변이체를 경험하는 단백질(200)의 위치를 특정한다.In some embodiments, the disclosed techniques use an orientation parameter to specify the orientation of a reference amino acid within the reference amino acid sequence 202. In some embodiments, the disclosed techniques use orientation parameters to specify the orientation of alternative amino acids within alternative amino acid sequence 212. In some embodiments, the disclosed technology uses orientation parameters to specify the location of a protein 200 that experiences a target variant at the amino acid level.

위에서 논의된 바와 같이, 21개의 아미노산 거리 채널(602 내지 642)의 모든 거리 값은 각자의 가장 가까운 원자로부터 복셀 그리드(522) 내의 복셀(514)까지 측정된다. 이러한 가장 가까운 원자는 기준 아미노산 서열(202) 내의 기준 아미노산 중 하나로부터 유래한다. 가장 가까운 원자를 함유하는 이러한 유래하는 기준 아미노산은 2개의 카테고리로 분류될 수 있다: (1) 기준 아미노산 서열(202) 내의 변이체 경험 기준 아미노산(204)에 선행하는 그러한 유래하는 기준 아미노산 및 (2) 기준 아미노산 서열(202) 내의 변이체 경험 기준 아미노산(204)에 후행하는 그러한 유래하는 기준 아미노산. 제1 카테고리 내의 유래하는 기준 아미노산은 선행 기준 아미노산으로 불릴 수 있다. 제2 카테고리 내의 유래하는 기준 아미노산은 후행 기준 아미노산으로 불릴 수 있다.As discussed above, all distance values in the 21 amino acid distance channels 602-642 are measured from their respective nearest atoms to voxel 514 in voxel grid 522. This closest atom is from one of the reference amino acids in the reference amino acid sequence 202. These derived reference amino acids containing the closest atom can be classified into two categories: (1) those derived reference amino acids that precede the variant empirical reference amino acid 204 within the reference amino acid sequence 202 and (2) Variant experience within the reference amino acid sequence 202 A reference amino acid that follows the reference amino acid 204 from which it is derived. Reference amino acids derived within the first category may be referred to as preceding reference amino acids. The resulting reference amino acids within the second category may be referred to as trailing reference amino acids.

방향성 파라미터는 선행 기준 아미노산으로부터 유래하는 그러한 가장 가까운 원자로부터 측정되는 21개의 아미노산별 거리 채널(602 내지 642)에서 그러한 거리 값에 적용된다. 하나의 구현예에서, 방향성 파라미터는 그러한 거리 값과 곱해진다. 방향성 파라미터는 임의의 수, 예컨대 -1일 수 있다.Directionality parameters are applied to those distance values in the 21 amino acid-specific distance channels 602 to 642 measured from those nearest atoms originating from the preceding reference amino acid. In one implementation, the directionality parameter is multiplied by that distance value. The directionality parameter can be any number, such as -1.

방향성 파라미터의 적용의 결과로서, 21개의 아미노산별 거리 채널(600)은 단백질(200)의 어느 단부가 시작 말단이고 어느 단부가 단부 말단인지를 병원성 분류자에 나타내는 일부 거리 값을 포함한다. 이것은 또한, 병원성 분류자가 거리 채널과 기준 및 대립유전자 채널에 의해 공급되는 3D 단백질 구조 정보로부터 단백질 서열을 재구성할 수 있게 한다.As a result of the application of the directionality parameter, the 21 amino acid-specific distance channels 600 contain some distance values that indicate to the pathogenicity classifier which end of the protein 200 is the starting end and which end is the terminal end. This also allows the pathogenicity classifier to reconstruct protein sequences from the 3D protein structure information supplied by the distance channel and the reference and allele channels.

거리 채널 텐서distance channel tensor

도 7은 거리 채널 텐서(700)의 개략도이다. 거리 채널 텐서(700)는 도 6으로부터의 아미노산별 거리 채널(600)의 복셀화된 표현이다. 거리 채널 텐서(700)에서, 21개의 아미노산별 거리 채널(602 내지 642)은 색상 이미지의 RGB 채널과 같이 복셀별로 연결된다. 거리 채널 텐서(700)의 복셀화된 차원수는 21×3×3×3이지만(21은 21개의 아미노산 카테고리를 표시하고, 3×3×3은 27개의 복셀을 갖는 3D 복셀 그리드를 표시함); 도 7은 차원수 21×3×3의 2D 묘사이다.7 is a schematic diagram of the distance channel tensor 700. The distance channel tensor 700 is a voxelized representation of the per-amino acid distance channel 600 from Figure 6. In the distance channel tensor 700, 21 amino acid-specific distance channels 602 to 642 are connected for each voxel, like the RGB channels of a color image. The voxelized dimensionality of the distance channel tensor 700 is 21×3×3×3 (21 represents 21 amino acid categories, and 3×3×3 represents a 3D voxel grid with 27 voxels). ; Figure 7 is a 2D depiction of dimensions 21 x 3 x 3.

원-핫 인코딩One-hot encoding

도 8은 기준 아미노산(204) 및 대안 아미노산(214)의 원-핫 인코딩(800)을 도시한다. 도 8에서, 좌측 열은 기준 아미노산 글리신(G)(204)의 원-핫 인코딩(802)이며, 이때 1은 글리신 아미노산 카테고리에 대한 것이고, 0은 모든 다른 아미노산 카테고리에 대한 것이다. 도 8에서, 우측 열은 변이체/대안 아미노산 알라닌(A)(214)의 원-핫 인코딩(804)이며, 이때 1은 알라닌 아미노산 카테고리에 대한 것이고, 0은 모든 다른 아미노산 카테고리에 대한 것이다.Figure 8 shows one-hot encoding 800 of a reference amino acid 204 and an alternative amino acid 214. In Figure 8, the left column is a one-hot encoding (802) of the reference amino acid glycine (G) (204), where 1 is for the glycine amino acid category and 0 is for all other amino acid categories. In Figure 8, the right column is the one-hot encoding (804) of the variant/alternative amino acid alanine (A) (214), where 1 is for the alanine amino acid category and 0 is for all other amino acid categories.

도 9는 복셀화된 원-핫 인코딩된 기준 아미노산(902) 및 복셀화된 원-핫 인코딩된 변이체/대안 아미노산(912)의 개략도이다. 복셀화된 원-핫 인코딩된 기준 아미노산(902)은 도 8로부터의 기준 아미노산 글리신(G)(204)의 원-핫 인코딩(802)의 복셀화된 표현이다. 복셀화된 원-핫 인코딩된 대안 아미노산(912)은 도 8로부터의 변이체/대안 아미노산 알라닌(A)(214)의 원-핫 인코딩(804)의 복셀화된 표현이다. 복셀화된 원-핫 인코딩된 기준 아미노산(902)의 복셀화된 차원수는 21×1×1×1이지만(21은 21개의 아미노산 카테고리를 표시함); 도 9는 차원수 21×1×1의 2D 묘사이다. 유사하게, 복셀화된 원-핫 인코딩된 대안 아미노산(912)의 복셀화된 차원수는 21×1×1×1이지만(21은 21개의 아미노산 카테고리를 표시함); 도 9는 차원수 21×1×1의 2D 묘사이다.Figure 9 is a schematic diagram of a voxelized one-hot encoded reference amino acid (902) and a voxelized one-hot encoded variant/alternative amino acid (912). The voxelized one-hot encoded reference amino acid 902 is a voxelized representation of the one-hot encoding 802 of the reference amino acid glycine (G) 204 from Figure 8. The voxelized one-hot encoded alternative amino acid 912 is a voxelized representation of the one-hot encoded 804 of the variant/alternative amino acid alanine (A) 214 from Figure 8. The voxelized dimensionality of the voxelized one-hot encoded reference amino acids 902 is 21×1×1×1 (21 represents 21 amino acid categories); Figure 9 is a 2D depiction with dimensions 21 x 1 x 1. Similarly, the voxelized dimensionality of the voxelized one-hot encoded alternative amino acids 912 is 21 × 1 × 1 × 1 (21 indicates 21 amino acid categories); Figure 9 is a 2D depiction with dimensions 21 x 1 x 1.

기준 대립유전자 텐서Reference allele tensor

도 10은 도 7의 거리 채널 텐서(700) 및 기준 대립유전자 텐서(1004)를 복셀별로 연결하는 연결 프로세스(1000)를 개략적으로 도시한다. 기준 대립유전자 텐서(1004)는 도 9로부터의 복셀화된 원-핫 인코딩된 기준 아미노산(902)의 복셀별 집약(반복/클로닝/복제)이다. 즉, 복셀화된 원-핫 인코딩된 기준 아미노산(902)의 다수의 카피는 복셀 그리드(522) 내의 복셀(514)의 공간적 배열에 따라 서로와 복셀별로 연결되고, 따라서, 기준 대립유전자 텐서(1004)는 복셀 그리드(522) 내의 복셀(514) 각각에 대한 복셀화된 원-핫 인코딩된 기준 아미노산(910)의 상응하는 카피를 갖는다.FIG. 10 schematically illustrates a concatenation process 1000 that concatenates the distance channel tensor 700 and the reference allele tensor 1004 of FIG. 7 on a voxel-by-voxel basis. The reference allele tensor 1004 is a voxel-wise aggregation (repeat/cloning/replication) of the voxelized one-hot encoded reference amino acids 902 from Figure 9. That is, multiple copies of the voxelized one-hot encoded reference amino acid 902 are linked to each other voxel-by-voxel according to the spatial arrangement of the voxels 514 within the voxel grid 522, and thus the reference allele tensor 1004 ) has a corresponding copy of the voxelized one-hot encoded reference amino acid 910 for each of the voxels 514 in the voxel grid 522.

연결 프로세스(1000)는 연결된 텐서(1010)를 생성한다. 기준 대립유전자 텐서(1004)의 복셀화된 차원수는 21×3×3×3이지만(21은 21개의 아미노산 카테고리를 표시하고, 3×3×3은 27개의 복셀을 갖는 3D 복셀 그리드를 표시함); 도 10은 차원수 21×3×3을 갖는 기준 대립유전자 텐서(1004)의 2D 묘사이다. 연결된 텐서(1010)의 복셀화된 차원수는 42×3×3×3이지만; 도 10은 차원수 42×3×3을 갖는 연결된 텐서(1010)의 2D 묘사이다.The concatenation process 1000 creates a concatenated tensor 1010. The voxelized dimensionality of the reference allele tensor (1004) is 21 ); Figure 10 is a 2D depiction of the reference allele tensor 1004 with dimensions 21 x 3 x 3. The voxelized dimensionality of the connected tensor 1010 is 42×3×3×3; Figure 10 is a 2D depiction of a connected tensor 1010 with dimensions 42x3x3.

대안 대립유전자 텐서Alternative allele tensor

도 11은 도 7의 거리 채널 텐서(700), 도 10의 기준 대립유전자 텐서(1004), 및 대안 대립유전자 텐서(1104)를 복셀별로 연결하는 연결 프로세스(1100)를 개략적으로 도시한다. 대안 대립유전자 텐서(1104)는 도 9로부터의 복셀화된 원-핫 인코딩된 대안 아미노산(912)의 복셀별 집약(반복/클로닝/복제)이다. 즉, 복셀화된 원-핫 인코딩된 대안 아미노산(912)의 다수의 카피는 복셀 그리드(522) 내의 복셀(514)의 공간적 배열에 따라 서로와 복셀별로 연결되고, 따라서, 대안 대립유전자 텐서(1104)는 복셀 그리드(522) 내의 복셀(514) 각각에 대한 복셀화된 원-핫 인코딩된 대안 아미노산(910)의 상응하는 카피를 갖는다.FIG. 11 schematically illustrates a concatenation process 1100 that concatenates the distance channel tensor 700 of FIG. 7, the reference allele tensor 1004 of FIG. 10, and the alternative allele tensor 1104 on a voxel-by-voxel basis. The alternative allele tensor 1104 is a voxel-wise aggregation (repeat/cloning/replication) of the voxelized one-hot encoded alternative amino acids 912 from Figure 9. That is, multiple copies of the voxelized one-hot encoded alternative amino acids 912 are linked to each other voxel-by-voxel according to the spatial arrangement of the voxels 514 within the voxel grid 522, and thus the alternative allele tensor 1104 ) has a corresponding copy of the voxelized one-hot encoded alternative amino acid 910 for each of the voxels 514 within the voxel grid 522.

연결 프로세스(1100)는 연결된 텐서(1110)를 생성한다. 대안 대립유전자 텐서(1104)의 복셀화된 차원수는 21×3×3×3이지만(21은 21개의 아미노산 카테고리를 표시하고, 3×3×3은 27개의 복셀을 갖는 3D 복셀 그리드를 표시함); 도 11은 차원수 21×3×3을 갖는 대안 대립유전자 텐서(1104)의 2D 묘사이다. 연결된 텐서(1110)의 복셀화된 차원수는 63×3×3×3이지만; 도 11은 차원수 63×3×3을 갖는 연결된 텐서(1110)의 2D 묘사이다.The concatenation process 1100 creates a concatenated tensor 1110. The voxelized dimensionality of the alternative allele tensor 1104 is 21 ); Figure 11 is a 2D depiction of the alternative allele tensor 1104 with dimensions 21 x 3 x 3. The voxelized dimensionality of the connected tensor 1110 is 63×3×3×3; Figure 11 is a 2D depiction of a connected tensor 1110 with dimensions 63x3x3.

일부 구현예에서, 런타임 로직(184)은 병원성 분류자를 통해 연결된 텐서(1110)를 처리하여 변이체/대안 아미노산 알라닌(A)(214)의 병원성을 결정하는데, 이는 결국, 변이체/대안 아미노산 알라닌(A)(214)을 생성하는 기본 뉴클레오티드 변이체의 병원성 결정으로서 추론된다.In some embodiments, the runtime logic 184 processes the concatenated tensor 1110 through a pathogenicity classifier to determine the pathogenicity of the variant/alternative amino acid alanine (A) 214, which in turn determines the pathogenicity of the variant/alternative amino acid alanine (A ) (214).

진화 보존 채널evolution conservation channel

변이체의 기능적 결과를 예측하는 것은, 적어도 부분적으로, 단백질족에 대한 중요한 아미노산이 네거티브 선택으로 인한 진화를 통해 보존되고(즉, 이러한 부위에서의 아미노산 변화는 과거에 유해하였음) 이러한 부위에서의 돌연변이가 인간에게 (질환을 야기하는) 병원성일 가능성을 증가시킨다는 가정에 의존한다. 일반적으로, 표적 단백질의 상동 서열이 수집 및 정렬되고, 정렬 내의 표적 위치에서 관찰된 상이한 아미노산의 가중 빈도에 기초하여 보존의 메트릭이 계산된다.Predicting the functional consequences of a variant depends, at least in part, on whether important amino acids for the protein family have been conserved through evolution due to negative selection (i.e., amino acid changes at these sites have been deleterious in the past) and whether mutations at these sites have been It relies on the assumption that it increases the likelihood of pathogenicity (causing disease) in humans. Typically, homologous sequences of a target protein are collected and aligned, and a metric of conservation is calculated based on the weighted frequency of different amino acids observed at the target position within the alignment.

따라서, 개시된 기술은 거리 채널 텐서(700), 기준 대립유전자 텐서(1004), 및 대안 대립유전자 텐서(1004)를 진화 채널과 연결한다. 진화 채널의 하나의 예가 범아미노산 보존 빈도이다. 진화 채널의 다른 예가 아미노산당 보존 빈도이다.Accordingly, the disclosed technique connects the distance channel tensor 700, the reference allele tensor 1004, and the alternative allele tensor 1004 with evolutionary channels. One example of an evolutionary channel is pan-amino acid conservation frequency. Another example of an evolutionary channel is the frequency of conservation per amino acid.

일부 구현예에서, 진화 채널은 위치 가중치 행렬(PWM)을 사용하여 구성된다. 다른 구현예에서, 진화 채널은 위치 특정 빈도 행렬(position specific frequency matrix, PSFM)을 사용하여 구성된다. 또 다른 구현예에서, 진화 채널은 SIFT, PolyPhen, 및 PANTHER-PSEC과 같은 계산 툴을 사용하여 구성된다. 또 다른 구현예에서, 진화 채널은 진화 보존(preservation)에 기초한 보존 채널이다. 보존은 보존(conservation)과 관련되는데, 이는 그것이 또한, 단백질 내의 주어진 부위에서 진화 변화를 방지하도록 작용했던 네거티브 선택의 효과를 반영하기 때문이다.In some implementations, the evolution channel is constructed using a position weight matrix (PWM). In another implementation, the evolving channel is constructed using a position specific frequency matrix (PSFM). In another implementation, evolution channels are constructed using computational tools such as SIFT, PolyPhen, and PANTHER-PSEC. In another implementation, the evolutionary channel is a conservation channel based on evolutionary preservation. Conservation is related to conservation because it also reflects the effect of negative selection that has acted to prevent evolutionary change at a given site within a protein.

범아미노산 진화 프로파일Pan-amino acid evolutionary profile

도 12는 개시된 기술의 하나의 구현예에 따른, 가장 가까운 원자의 범아미노산 보존 빈도를 결정하여 복셀에 할당하기 위한(복셀화) 시스템의 프로세스(1200)를 도시하는 흐름도이다. 도 12, 도 13, 도 14, 도 15, 도 16, 도 17, 및 도 18은 동시에 논의된다.FIG. 12 is a flow diagram illustrating a process 1200 of a system for determining pan-amino acid conservation frequencies of nearest atoms and assigning them to voxels (voxelization), according to one implementation of the disclosed technology. 12, 13, 14, 15, 16, 17, and 18 are discussed simultaneously.

단계(1202)에서, 시스템의 유사한 서열 파인더(1204)가 기준 아미노산 서열(202)과 유사한(상동성) 아미노산 서열을 취출한다. 유사한 아미노산 서열은 영장류, 포유류 및 척추동물과 같은 다수의 종으로부터 선택될 수 있다.In step 1202, the system's similar sequence finder 1204 retrieves an amino acid sequence that is similar (homologous) to the reference amino acid sequence 202. Similar amino acid sequences can be selected from multiple species such as primates, mammals, and vertebrates.

단계(1212)에서, 시스템의 정렬자(1214)가 기준 아미노산 서열(202)을 유사한 아미노산 서열과 위치별로 정렬시키는데, 즉, 정렬자(1214)는 다중 서열 정렬을 수행한다. 도 14는 99종에 걸친 기준 아미노산 서열(202)의 예시적인 다중 서열 정렬(1400)을 도시한다. 일부 구현예에서, 다중 서열 정렬(1400)은, 예를 들어, 영장류에 대한 제1 위치 빈도 행렬(1402), 포유류에 대한 제2 위치 빈도 행렬(1412), 및 영장류에 대한 제3 위치 빈도 행렬(1422)을 생성하기 위해 분할될 수 있다. 다른 구현예에서, 단일 위치 빈도 행렬이 99개의 종에 걸쳐 생성된다.In step 1212, the system's aligner 1214 aligns the reference amino acid sequence 202 by position with similar amino acid sequences, i.e., aligner 1214 performs a multiple sequence alignment. Figure 14 shows an exemplary multiple sequence alignment (1400) of reference amino acid sequences (202) across 99 species. In some implementations, multiple sequence alignment 1400 can be, for example, a first position frequency matrix for primates 1402, a second position frequency matrix for mammals 1412, and a third position frequency matrix for primates. It can be split to produce (1422). In another implementation, a single position frequency matrix is generated across 99 species.

단계(1222)에서, 시스템의 범아미노산 보존 빈도 계산기(1224)가 다중 서열 정렬을 사용하여, 기준 아미노산 서열(202) 내의 기준 아미노산의 범아미노산 보존 빈도를 결정한다.In step 1222, the system's pan-amino acid conservation frequency calculator 1224 uses multiple sequence alignment to determine the pan-amino acid conservation frequency of reference amino acids within the reference amino acid sequence 202.

단계(1232)에서, 시스템의 가장 가까운 원자 파인더(1234)가 복셀 그리드(522) 내의 복셀(514)에 대해 가장 가까운 원자를 발견한다. 일부 구현예에서, 복셀별로 가장 가까운 원자에 대한 검색은 임의의 특정 아미노산 카테고리 또는 원자 유형으로 한정되지 않을 수 있다. 즉, 복셀별로 가장 가까운 원자는 그들이 각자의 복셀 중심에 대한 가장 근접한 원자인 한, 아미노산 카테고리 및 아미노산 유형에 걸쳐 선택될 수 있다. 다른 구현예에서, 복셀별로 가장 가까운 원자에 대한 검색은 특정 원자 카테고리만으로, 예컨대 산소, 질소, 및 수소와 같은 특정 원자 원소만으로, 또는 알파-탄소 원자만으로, 또는 베타-탄소 원자만으로, 또는 측쇄 원자만으로, 또는 백본 원자만으로 한정될 수 있다.At step 1232, the system's closest atom finder 1234 finds the closest atom for voxel 514 in voxel grid 522. In some implementations, the search for the closest atom on a voxel-by-voxel basis may not be limited to any particular amino acid category or atom type. That is, the closest atoms per voxel can be selected across amino acid categories and amino acid types, as long as they are the closest atoms to the centroid of each voxel. In other embodiments, a search for the closest atom on a voxel-by-voxel basis may be performed only by specific atom categories, such as only certain atomic elements, such as oxygen, nitrogen, and hydrogen, or only alpha-carbon atoms, or only beta-carbon atoms, or only branched chain atoms. It may be limited to only, or to only backbone atoms.

단계(1242)에서, 시스템의 아미노산 선택기(1244)가 단계(1232)에서 식별된 가장 가까운 원자를 함유하는 기준 아미노산 서열(202) 내의 그러한 기준 아미노산을 선택한다. 그러한 기준 아미노산은 가장 가까운 기준 아미노산으로 불릴 수 있다. 도 13은, 복셀 그리드(522) 내의 복셀(514)에 대해 가장 가까운 원자(1302)를 위치확인하고 복셀 그리드(522) 내의 복셀(514)에 대해 가장 가까운 원자(1302)를 함유하는 가장 가까운 기준 아미노산(1312)을 각각 맵핑하는 일례를 도시한다. 이것은 도 13에서 "복셀-대-가장 가까운 아미노산 맵핑(1300)"으로서 식별된다.At step 1242, the system's amino acid selector 1244 selects those reference amino acids within the reference amino acid sequence 202 that contain the closest atoms identified at step 1232. Such reference amino acids may be called closest reference amino acids. 13 shows the nearest reference, which locates the closest atom 1302 to voxel 514 in voxel grid 522 and contains the closest atom 1302 to voxel 514 in voxel grid 522. An example of mapping each amino acid 1312 is shown. This is identified as “Voxel-to-nearest amino acid mapping 1300” in Figure 13.

단계(1252)에서, 시스템의 복셀화기(1254)가 가장 가까운 기준 아미노산의 범아미노산 보존 빈도를 복셀화한다. 도 15는 본 명세서에서 "복셀당 진화 프로파일 결정(1500)"으로도 지칭되는, 복셀 그리드(522) 내의 제1 복셀 (1, 1)에 대한 범아미노산 보존 빈도 서열을 결정하는 일례를 도시한다.In step 1252, the system's voxelizer 1254 voxelizes the pan-amino acid conservation frequency of the closest reference amino acid. Figure 15 shows an example of determining the pan-amino acid conservation frequency sequence for the first voxel (1, 1) in the voxel grid 522, also referred to herein as “per-voxel evolutionary profile determination 1500.”

도 13을 참조하면, 제1 복셀 (1, 1)에 맵핑되었던 가장 가까운 기준 아미노산은 기준 아미노산 서열(202)에서 위치 15에 있는 아스파르트산(D) 아미노산이다. 이어서, 예를 들어 99종의 99개의 상동 아미노산 서열과의 기준 아미노산 서열(202)의 다중 서열 정렬이 위치 15에서 분석된다. 그러한 위치 특정적 및 종간 분석은 100개의 정렬된 아미노산 서열(즉, 기준 아미노산 서열(202) + 99개의 상동 아미노산 서열)에 걸친 위치 15에서 21개의 아미노산 카테고리 각각으로부터 얼마나 많은 인스턴스의 아미노산이 발견되는지를 나타낸다.Referring to Figure 13, the closest reference amino acid that was mapped to the first voxel (1, 1) is the aspartic acid (D) amino acid at position 15 in the reference amino acid sequence 202. A multiple sequence alignment of the reference amino acid sequence 202 with 99 homologous amino acid sequences from, for example, 99 species is then analyzed at position 15. Such position-specific and interspecies analyzes determine how many instances of an amino acid are found from each of the 21 amino acid categories at positions 15 across 100 aligned amino acid sequences (i.e., the reference amino acid sequence (202) plus 99 homologous amino acid sequences). indicates.

도 15에 도시된 예에서, 아스파르트산(D) 아미노산은 100개의 정렬된 아미노산 서열 중에서 96개의 위치 15에서 발견된다. 따라서, 아스파르트산 아미노산 카테고리(1504)는 0.96의 범아미노산 보존 빈도를 할당받는다. 유사하게, 도시된 예에서, 발린(V)산 아미노산은 100개의 정렬된 아미노산 서열 중에서 4개의 위치 15에서 발견된다. 따라서, 발린산 아미노산 카테고리(1514)는 0.04의 범아미노산 보존 빈도를 할당받는다. 위치 15에서 다른 아미노산 카테고리로부터의 아미노산의 어떠한 인스턴스도 검출되지 않기 때문에, 나머지 아미노산 카테고리는 0의 범아미노산 보존 빈도를 할당받는다. 이러한 방식으로, 21개의 아미노산 카테고리 각각은 제1 복셀 (1, 1)에 대한 범아미노산 보존 빈도 서열(1502)에서 인코딩될 수 있는 각자의 범아미노산 보존 빈도를 할당받는다.In the example shown in Figure 15, the aspartic acid (D) amino acid is found at position 15 in 96 of the 100 aligned amino acid sequences. Therefore, the aspartic acid amino acid category (1504) is assigned a pan-amino acid conservation frequency of 0.96. Similarly, in the example shown, the valine(V) acid amino acid is found at position 15 in four of the 100 aligned amino acid sequences. Therefore, the valine acid amino acid category (1514) is assigned a pan-amino acid conservation frequency of 0.04. Since no instances of amino acids from other amino acid categories are detected at position 15, the remaining amino acid categories are assigned a pan-amino acid conservation frequency of 0. In this way, each of the 21 amino acid categories is assigned a respective pan-amino acid conservation frequency, which can be encoded in the pan-amino acid conservation frequency sequence 1502 for the first voxel (1, 1).

도 16은 본 명세서에서 "복셀-진화 프로파일 맵핑(1600)"으로도 지칭되는, 도 15에서 기술된 위치 빈도 로직을 사용하여 복셀 그리드(522) 내의 복셀(514) 중 각자의 것에 대해 결정된 각자의 범아미노산 보존 빈도(1612 내지 1692)를 도시한다.FIG. 16 shows the respective Voxels 514 in the voxel grid 522 using the position frequency logic described in FIG. 15 , also referred to herein as “voxel-evolution profile mapping 1600.” Pan-amino acid conservation frequencies (1612 to 1692) are shown.

이어서, 복셀당 진화 프로파일(1602)이 복셀화기(1254)에 의해 사용되어, 도 17에 도시된 복셀화된 복셀별 진화 프로파일(1700)을 생성한다. 종종, 복셀 그리드(522) 내의 복셀(514) 각각은 상이한 범아미노산 보존 빈도 서열 및 이에 따른, 상이한 복셀화된 복셀당 진화 프로파일을 갖는데, 그 이유는 복셀이 상이한 가장 가까운 원자에 그리고 이에 따라, 상이한 가장 가까운 기준 아미노산에 규칙적으로 맵핑되기 때문이다. 물론, 2개 이상의 복셀이 동일한 가장 가까운 원자 및 이에 의한 동일한 가장 가까운 기준 아미노산을 가질 때, 동일한 범아미노산 보존 빈도 서열 및 동일한 복셀화된 복셀당 진화 프로파일이 2개 이상의 복셀 각각에 할당된다.The per-voxel evolution profile 1602 is then used by the voxelizer 1254 to generate the voxelized per-voxel evolution profile 1700 shown in FIG. 17 . Often, each of the voxels 514 within the voxel grid 522 has a different pan-amino acid conservation frequency sequence and therefore a different voxelized per-voxel evolution profile because the voxels have different closest atoms and, therefore, different This is because it is regularly mapped to the nearest reference amino acid. Of course, when two or more voxels have the same nearest atom and thereby the same closest reference amino acid, the same pan-amino acid conservation frequency sequence and the same voxelized per-voxel evolution profile are assigned to each of the two or more voxels.

도 18은 복셀화된 복셀별 진화 프로파일(1700)이 복셀 그리드(522) 내의 복셀(514)의 공간적 배열에 따라 서로와 복셀별로 연결되는 진화 프로파일 텐서(1800)의 예를 도시한다. 진화 프로파일 텐서(1800)의 복셀화된 차원수는 21×3×3×3이지만(21은 21개의 아미노산 카테고리를 표시하고, 3×3×3은 27개의 복셀을 갖는 3D 복셀 그리드를 표시함); 도 18은 차원수 21×3×3을 갖는 진화 프로파일 텐서(1800)의 2D 묘사이다.FIG. 18 shows an example of an evolution profile tensor 1800 in which voxelized voxel-wise evolution profiles 1700 are connected to each other voxel-wise according to the spatial arrangement of voxels 514 in the voxel grid 522. The voxelized dimensionality of the evolutionary profile tensor 1800 is 21×3×3×3 (21 represents 21 amino acid categories, and 3×3×3 represents a 3D voxel grid with 27 voxels). ; Figure 18 is a 2D depiction of an evolutionary profile tensor 1800 with dimensions 21 x 3 x 3.

단계(1262)에서, 연결자(174)는 진화 프로파일 텐서(1800)를 거리 채널 텐서(700)와 복셀별로 연결한다. 일부 구현예에서, 진화 프로파일 텐서(1800)는 연결자 텐서(1110)와 복셀별로 연결되어, 차원수 84×3×3×3의 추가 연결된 텐서(도시되지 않음)를 생성한다.At step 1262, concatenator 174 concatenates evolution profile tensor 1800 with distance channel tensor 700 on a voxel-by-voxel basis. In some implementations, the evolutionary profile tensor 1800 is concatenated voxel-by-voxel with the connector tensor 1110, creating an additional concatenated tensor (not shown) of dimension 84×3×3×3.

단계(1272)에서, 런타임 로직(184)은 병원성 분류자를 통해 차원수 84×3×3×3의 추가 연결된 텐서를 처리하여 표적 변이체의 병원성을 결정하는데, 이는 결국, 아미노산 수준에서 표적 변이체를 생성하는 기본 뉴클레오티드 변이체의 병원성 결정으로서 추론된다.At step 1272, the runtime logic 184 processes the additional concatenated tensor of dimensions 84×3×3×3 through a pathogenicity classifier to determine the pathogenicity of the target variant, which in turn generates the target variant at the amino acid level. is inferred as a pathogenic determinant of the basic nucleotide variant.

아미노산당 진화 프로파일Evolutionary profile per amino acid

도 19는 가장 가까운 원자의 아미노산당 보존 빈도를 결정하여 복셀에 할당하기 위한(복셀화) 시스템의 프로세스(1900)를 도시하는 흐름도이다. 도 19에서, 단계(1202, 1212)는 도 12와 동일하다.FIG. 19 is a flow diagram illustrating the system's process 1900 for determining the conservation frequency per amino acid of the nearest atom and assigning it to a voxel (voxelization). In Figure 19, steps 1202 and 1212 are the same as in Figure 12.

단계(1922)에서, 시스템의 아미노산당 보존 빈도 계산기(1924)가 다중 서열 정렬을 사용하여, 기준 아미노산 서열(202) 내의 기준 아미노산의 아미노산당 보존 빈도를 결정한다.In step 1922, the system's conservation frequency per amino acid calculator 1924 uses multiple sequence alignments to determine the conservation frequency per amino acid of reference amino acids within the reference amino acid sequence 202.

단계(1932)에서, 시스템의 가장 가까운 원자 파인더(1934)가 복셀 그리드(522) 내의 복셀(514) 각각에 대해, 21개의 아미노산 카테고리 각각에 걸쳐 21개의 가장 가까운 원자를 발견한다. 21개의 가장 가까운 원자 각각은 서로 상이한데, 그 이유는 그들이 상이한 아미노산 카테고리로부터 선택되기 때문이다. 이것은 특정 복셀에 대한 21개의 고유한 가장 가까운 기준 아미노산의 선택으로 이어지는데, 이는 결국, 특정 복셀에 대한 21개의 고유한 위치 빈도 행렬의 생성으로 이어지고, 그리고 결국, 특정 복셀에 대한 21개의 고유한 아미노산당 보존 빈도의 결정으로 이어진다.At step 1932, the system's nearest atom finder 1934 finds, for each voxel 514 in the voxel grid 522, the 21 closest atoms across each of the 21 amino acid categories. Each of the 21 closest atoms are different from each other because they are selected from different amino acid categories. This leads to the selection of the 21 unique closest reference amino acids for a specific voxel, which in turn leads to the creation of a 21 unique position frequency matrix for a specific voxel, which in turn leads to the creation of a 21 unique position frequency matrix per 21 unique amino acids for a specific voxel. This leads to a decision on the frequency of preservation.

단계(1942)에서, 시스템의 아미노산 선택기(1944)가 복셀 그리드(522) 내의 복셀(514) 각각에 대해, 단계(1932)에서 식별된 21개의 가장 가까운 원자를 함유하는 기준 아미노산 서열(202) 내의 21개의 기준 아미노산을 선택한다. 그러한 기준 아미노산은 가장 가까운 기준 아미노산으로 불릴 수 있다.At step 1942, the system's amino acid selector 1944 determines, for each voxel 514 within the voxel grid 522, a reference amino acid sequence 202 containing the 21 closest atoms identified at step 1932. Select 21 reference amino acids. Such reference amino acids may be called closest reference amino acids.

단계(1952)에서, 시스템의 복셀화기(1954)가 단계(1942)에서 특정 복셀에 대해 식별된 21개의 가장 가까운 기준 아미노산의 아미노산당 보존 빈도를 복셀화한다. 21개의 가장 가까운 기준 아미노산은 반드시 기준 아미노산 서열(202) 내의 21개의 상이한 위치에 위치되는데, 그 이유는 그들이 상이한 기본 가장 가까운 원자에 상응하기 때문이다. 따라서, 특정 복셀에 대해, 21개의 가장 가까운 기준 아미노산에 대해 21개의 위치 빈도 행렬이 생성될 수 있다. 21개의 위치 빈도 행렬은 도 12 내지 도 15와 관련하여 위에서 논의된 바와 같이, 상동 아미노산 서열이 기준 아미노산 서열(202)과 위치별로 정렬되는 다수의 종에 걸쳐 생성될 수 있다.At step 1952, the voxelizer 1954 of the system voxelizes the conservation frequencies per amino acid of the 21 closest reference amino acids identified for a particular voxel at step 1942. The 21 closest reference amino acids are necessarily located at 21 different positions within the reference amino acid sequence 202 because they correspond to different base nearest atoms. Therefore, for a particular voxel, a 21 position frequency matrix can be generated for the 21 closest reference amino acids. A 21-position frequency matrix can be generated across multiple species where the homologous amino acid sequences are aligned by position with the reference amino acid sequence 202, as discussed above with respect to FIGS. 12-15.

이어서, 21개의 위치 빈도 행렬을 사용하여, 특정 복셀에 대해 식별된 21개의 가장 가까운 기준 아미노산에 대해 21개의 위치 특정적 보존 점수가 계산될 수 있다. 이러한 21개의 위치 특정적 보존 점수는, 서열(1502)이 많은 제로(0) 엔트리를 갖는다는 점을 제외하면, 도 12에서 범아미노산 보존 빈도 서열(1502)과 유사하게, 특정 복셀에 대한 범아미노산 보존 빈도를 형성하는 반면; 아미노산당 보존 빈도 서열 내의 각각의 요소(특징부)는 일정 값(예컨대, 부동 소수점 수)을 갖는데, 그 이유는 21개의 아미노산 카테고리에 걸친 21개의 가장 가까운 기준 아미노산이 반드시, 상이한 위치 빈도 행렬 및 이에 의한 상이한 아미노산당 보존 빈도를 산출하는 상이한 위치를 갖기 때문이다.Then, using the 21 position frequency matrix, 21 position specific conservation scores can be calculated for the 21 closest reference amino acids identified for a particular voxel. These 21 position-specific conservation scores represent the pan-amino acid conservation frequency sequence 1502 for a particular voxel, similar to the pan-amino acid conservation frequency sequence 1502 in Figure 12, except that sequence 1502 has many zero entries. While shaping the frequency of preservation; Conservation Frequency Per Amino Acid Each element (feature) in the sequence has a constant value (e.g., a floating point number) because the 21 closest reference amino acids across the 21 amino acid categories necessarily have different position frequency matrices and thus This is because they have different positions, which yields a conservation frequency per different amino acid.

위의 프로세스는 복셀 그리드(522) 내의 복셀(514) 각각에 대해 실행되고, 생성된 복셀별 아미노산당 보존 빈도는 도 12 내지 도 18과 관련하여 논의된 범아미노산 보존 빈도와 유사하게 병원성 결정에 대해 복셀화, 텐서화, 연결, 및 처리된다.The above process is run for each of the voxels 514 within the voxel grid 522, and the resulting per-voxel amino acid conservation frequencies are useful for pathogenicity determinations, similar to the pan-amino acid conservation frequencies discussed in relation to Figures 12-18. voxelized, tensorized, concatenated, and processed.

주석 채널annotation channel

도 20은 거리 채널 텐서(700)와 연결되는 복셀화된 주석 채널(2000)의 다양한 예를 도시한다. 일부 구현예에서, 복셀화된 주석 채널은 상이한 단백질 주석에 대한 원-핫 표시자, 예를 들어 아미노산(잔기)이 트랜스멤브레인 영역, 신호 펩티드, 활성 부위, 또는 임의의 다른 결합 부위의 일부인지의 여부, 또는 잔기가 번역후 변형(posttranslational modification), PathRatio(문헌[Pei P, Zhang A: A Topological Measurement for Weighted Protein Interaction Network. CSB 2005, 268-278.] 참조) 등의 대상인지의 여부이다. 주석 채널의 추가적인 예는 아래의 특정 구현예 섹션에서 그리고 청구범위에서 발견될 수 있다.20 shows various examples of voxelized annotation channels 2000 coupled with distance channel tensors 700. In some embodiments, the voxelized annotation channel provides one-hot indicators for different protein annotations, e.g., whether an amino acid (residue) is part of a transmembrane region, signal peptide, active site, or any other binding site. or whether the residue is subject to posttranslational modification, PathRatio (see Pei P, Zhang A: A Topological Measurement for Weighted Protein Interaction Network. CSB 2005, 268-278.), etc. Additional examples of annotation channels can be found in the specific implementation sections below and in the claims.

복셀화된 주석 채널은 복셀이 복셀화된 기준 대립유전자 및 대안 대립유전자 서열과 같은 동일한 주석 서열을 가질 수 있도록 복셀별로 배열되거나(예를 들어, 주석 채널(2002, 2004, 2006)), 또는 복셀은 복셀화된 복셀당 진화 프로파일(1700)과 같은 각자의 주석 서열을 가질 수 있다(예를 들어, 주석 채널(2012, 2014, 2016)(상이한 색상으로 나타낸 바와 같음)).The voxelized annotation channels are arranged voxel-wise (e.g., annotation channels (2002, 2004, 2006)), or voxel-wise, such that a voxel can have the same annotation sequence as the voxelized reference allele and alternative allele sequences. may have their own annotation sequence, such as a voxelized per-voxel evolution profile 1700 (e.g., annotation channels 2012, 2014, 2016 (as shown in different colors)).

주석 채널은 도 12 내지 도 18과 관련하여 논의된 범아미노산 보존 빈도와 유사하게 병원성 결정에 대해 복셀화, 텐서화, 연결, 및 처리된다.Annotation channels are voxelized, tensorized, concatenated, and processed for pathogenicity determination similar to the pan-amino acid conservation frequencies discussed in conjunction with Figures 12-18.

구조 신뢰도 채널Structural Reliability Channel

개시된 기술은 또한, 다양한 복셀화된 구조 신뢰도 채널을 거리 채널 텐서(700)와 연결할 수 있다. 구조 신뢰도 채널의 일부 예는 하기를 포함한다: GMQE 점수(SwissModel에 의해 제공됨); B-인자; 상동성 모델의 온도 인자 열(단백질 구조에서 잔기가 (물리적) 제약을 얼마나 잘 만족시키는지를 나타냄); 복셀의 중심에 가장 가까운 잔기에 대한 주형 단백질을 정렬하는 정규화된 수(HHpred에 의해 제공된 정렬, 예컨대, 복셀은 6개의 주형 구조 중에서, 정렬되어 특징부가 값 3/6=0.5를 가짐을 나타내는 3개의 주형 구조에서의 잔기에 가장 가까움); 최소, 최대 및 평균 TM 점수; 및 복셀에 가장 가까운 잔기에 정렬하는 주형 단백질 구조의 예측된 TM 점수(위의 예를 계속하여, 3개의 주형 구조이 TM 점수 0.5, 0.5 및 1.5를 갖는다고 가정하면, 최소는 0.5이고, 평균은 2/3이고, 최대는 1.5임). TM 점수가 HHpred에 의해 단백질 주형마다 제공할 수 있다. 구조 신뢰도 채널의 추가적인 예는 하기의 특정 구현예 섹션에서 그리고 청구범위에서 찾을 수 있다.The disclosed technique can also associate various voxelized structural reliability channels with the distance channel tensor 700. Some examples of structural reliability channels include: GMQE score (provided by SwissModel); B-factor; The temperature factor column of the homology model (indicating how well residues satisfy (physical) constraints in the protein structure); A normalized number of alignments of the template protein to the residue closest to the center of the voxel (alignment provided by HHpred, e.g., a voxel is aligned among the six template structures, with three alignments indicating that the feature has the value 3/6=0.5). closest to the residue in the template structure); minimum, maximum and average TM scores; and the predicted TM score of the template protein structure that aligns to the residue closest to the voxel (continuing the example above, assuming the three template structures have TM scores 0.5, 0.5 and 1.5, the minimum is 0.5 and the average is 2 /3, maximum is 1.5). TM scores can be provided for each protein template by HHpred. Additional examples of structural reliability channels can be found in the Specific Implementations section below and in the claims.

복셀화된 구조 신뢰도 채널은 복셀이 복셀화된 기준 대립유전자 및 대안 대립유전자 서열과 같은 동일한 구조 신뢰도 서열을 가질 수 있도록 복셀별로 배열되거나, 또는 복셀은 복셀화된 복셀당 진화 프로파일(1700)과 같은 각자의 구조 신뢰도 서열을 가질 수 있다.The voxelized structural confidence channels are arranged voxel-wise such that voxels can have the same structural confidence sequence, such as voxelized reference allele and alternative allele sequences, or voxels can have identical structural confidence sequences, such as voxelized per-voxel evolutionary profiles (1700). Each can have its own structural reliability hierarchy.

구조 신뢰도 채널은 도 12 내지 도 18과 관련하여 논의된 범아미노산 보존 빈도와 유사하게 병원성 결정에 대해 복셀화, 텐서화, 연결, 및 처리된다.Structural confidence channels are voxelized, tensorized, concatenated, and processed for pathogenicity determination similar to the pan-amino acid conservation frequencies discussed in conjunction with Figures 12-18.

병원성 분류자pathogenicity classifier

도 21은 표적 변이체의 병원성 결정(2106)에 대한 병원성 분류자(2108)에 입력(2102)으로서 제공될 수 있는 입력 채널의 상이한 조합 및 순열을 도시한다. 입력(2102) 중 하나는 거리 채널 생성자(2272)에 의해 생성된 거리 채널(2104)일 수 있다. 도 22는 거리 채널(2104)을 계산하는 상이한 방법을 도시한다. 하나의 구현예에서, 거리 채널(2104)은 아미노산에 관계없이 복수의 원자 원소에 걸쳐 복셀 중심과 원자 사이의 거리(2202)에 기초하여 생성된다. 일부 구현예에서, 거리(2202)는 정규화된 거리(2202a)를 생성하기 위해 최대 스캔 반경에 의해 정규화된다. 다른 구현예에서, 거리 채널(2104)은 아미노산 단위로 복셀 중심과 알파-탄소 원자 사이의 거리(2212)에 기초하여 생성된다. 일부 구현예에서, 거리(2212)는 정규화된 거리(2212a)를 생성하기 위해 최대 스캔 반경에 의해 정규화된다. 또 다른 구현예에서, 거리 채널(2104)은 아미노산 단위로 복셀 중심과 베타-탄소 원자 사이의 거리(2222)에 기초하여 생성된다. 일부 구현예에서, 거리(2222)는 정규화된 거리(2222a)를 생성하기 위해 최대 스캔 반경에 의해 정규화된다. 또 다른 구현예에서, 거리 채널(2104)은 아미노산 단위로 복셀 중심과 측쇄 원자 사이의 거리(2232)에 기초하여 생성된다. 일부 구현예에서, 거리(2232)는 정규화된 거리(2232a)를 생성하기 위해 최대 스캔 반경에 의해 정규화된다. 또 다른 구현예에서, 거리 채널(2104)은 아미노산 단위로 복셀 중심과 백본 원자 사이의 거리(2242)에 기초하여 생성된다. 일부 구현예에서, 거리(2242)는 정규화된 거리(2242a)를 생성하기 위해 최대 스캔 반경에 의해 정규화된다. 또 다른 구현예에서, 거리 채널(2104)은 원자 유형 및 아미노산 유형에 관계없이 복셀 중심과 각자의 가장 가까운 원자 사이의 거리(2252)(하나의 특징부)에 기초하여 생성된다. 또 다른 구현예에서, 거리 채널(2104)은 복셀 중심과 비-표준 아미노산으로부터의 원자 사이의 거리(2262)(하나의 특징부)에 기초하여 생성된다. 일부 구현예에서, 복셀과 원자 사이의 거리는 복셀 및 원자의 극좌표에 기초하여 계산된다. 극좌표는 복셀과 원자 사이의 각도에 의해 파라미터화된다. 하나의 구현예에서, 이러한 각도 정보는 복셀에 대한 각도 채널을 생성하는 데 사용된다(즉, 거리 채널로부터 독립적임). 일부 구현예에서, 가장 가까운 원자와 이웃 원자(예를 들어, 백본 원자) 사이의 각도는 복셀로 인코딩되는 특징부로서 사용될 수 있다.Figure 21 shows different combinations and permutations of input channels that can be provided as inputs 2102 to a pathogenicity classifier 2108 for pathogenicity determination 2106 of the target variant. One of the inputs 2102 may be a distance channel 2104 created by a distance channel generator 2272. Figure 22 shows a different method of calculating the distance channel 2104. In one implementation, the distance channel 2104 is generated based on the distance 2202 between voxel centers and atoms across a plurality of atomic elements, regardless of amino acid. In some implementations, distance 2202 is normalized by the maximum scan radius to produce normalized distance 2202a. In another implementation, the distance channel 2104 is created based on the distance 2212 between the voxel center and the alpha-carbon atom in amino acids. In some implementations, distance 2212 is normalized by the maximum scan radius to produce normalized distance 2212a. In another implementation, the distance channel 2104 is created based on the distance 2222 between the voxel center and the beta-carbon atom in amino acids. In some implementations, distance 2222 is normalized by the maximum scan radius to produce normalized distance 2222a. In another implementation, the distance channel 2104 is generated based on the distance 2232 between the voxel center and the side chain atoms in amino acids. In some implementations, distance 2232 is normalized by the maximum scan radius to produce normalized distance 2232a. In another implementation, the distance channel 2104 is generated based on the distance 2242 between the voxel centroid and the backbone atoms in amino acids. In some implementations, distance 2242 is normalized by the maximum scan radius to produce normalized distance 2242a. In another implementation, the distance channel 2104 is generated based on the distance 2252 (one feature) between the voxel center and its respective nearest atom, regardless of atom type and amino acid type. In another implementation, the distance channel 2104 is generated based on the distance between the voxel centroid and the atom from the non-standard amino acid 2262 (one feature). In some implementations, the distance between a voxel and an atom is calculated based on the polar coordinates of the voxel and the atom. Polar coordinates are parameterized by the angle between voxels and atoms. In one implementation, this angular information is used to create an angular channel for a voxel (i.e., independent from the distance channel). In some implementations, the angle between the nearest atom and a neighboring atom (e.g., a backbone atom) can be used as a feature encoded into a voxel.

입력(2102) 중 다른 하나는 특정된 반경 내에서 누락된 원자를 나타내는 특징부(2114)일 수 있다.Another of the inputs 2102 may be a feature 2114 representing missing atoms within a specified radius.

입력(2102) 중 다른 하나는 기준 아미노산의 원-핫 인코딩(2124)일 수 있다. 입력(2102) 중 다른 하나는 변이체/대안 아미노산의 원-핫 인코딩(2134)일 수 있다.Another of the inputs 2102 may be a one-hot encoding 2124 of a reference amino acid. Another of the inputs 2102 may be a one-hot encoding 2134 of variant/alternative amino acids.

입력(2102) 중 다른 하나는 도 23에 도시된, 진화 프로파일 생성자(2372)에 의해 생성된 진화 채널(2144)일 수 있다. 하나의 구현예에서, 진화 채널(2144)은 범아미노산 보존 빈도(2302)에 기초하여 생성될 수 있다. 다른 구현예에서, 진화 채널(2144)은 범아미노산 보존 빈도(2312)에 기초하여 생성될 수 있다.Another of the inputs 2102 may be the evolution channel 2144 created by the evolution profile generator 2372, shown in FIG. 23. In one implementation, evolution channels 2144 can be created based on pan-amino acid conservation frequencies 2302. In another implementation, evolution channels 2144 can be created based on pan-amino acid conservation frequencies 2312.

입력(2102) 중 다른 하나는 누락된 잔기 또는 누락된 진화 프로파일을 나타내는 특징부(2154)일 수 있다.Another of the inputs 2102 may be a feature 2154 representing a missing residue or a missing evolutionary profile.

입력(2102) 중 다른 하나는 도 24에 도시된, 주석 생성자(2472)에 의해 생성된 주석 채널(2164)일 수 있다. 하나의 구현예에서, 주석 채널(2154)은 분자 처리 주석(2402)에 기초하여 생성될 수 있다. 다른 구현예에서, 주석 채널(2154)은 영역 주석(2412)에 기초하여 생성될 수 있다. 또 다른 구현예에서, 주석 채널(2154)은 부위 주석(2422)에 기초하여 생성될 수 있다. 또 다른 구현예에서, 주석 채널(2154)은 아미노산 변형 주석(2432)에 기초하여 생성될 수 있다. 또 다른 구현예에서, 주석 채널(2154)은 2차 구조 주석(2442)에 기초하여 생성될 수 있다. 또 다른 구현예에서, 주석 채널(2154)은 실험실 정보 주석(2452)에 기초하여 생성될 수 있다.Another of the inputs 2102 may be annotation channel 2164 created by annotation generator 2472, shown in FIG. 24. In one implementation, annotation channel 2154 can be generated based on molecular processing annotation 2402. In another implementation, annotation channel 2154 may be created based on region annotation 2412. In another implementation, annotation channel 2154 may be created based on site annotation 2422. In another implementation, annotation channel 2154 can be generated based on amino acid modification annotation 2432. In another implementation, annotation channel 2154 may be generated based on secondary structure annotation 2442. In another implementation, annotation channel 2154 may be created based on laboratory information annotation 2452.

입력(2102) 중 다른 하나는 도 25에 도시된, 구조 신뢰도 생성자(2572)에 의해 생성된 구조 신뢰도 채널(2174)일 수 있다. 하나의 구현예에서, 구조 신뢰도(2174)는 글로벌 모델 품질 추정(global model quality estimation, GMQE)(2502)에 기초하여 생성될 수 있다. 다른 구현예에서, 구조 신뢰도(2174)는 정성적 모델 에너지 분석(qualitative model energy analysis, QMEAN) 점수(2512)에 기초하여 생성될 수 있다. 또 다른 구현예에서, 구조 신뢰도(2174)는 온도 인자(2522)에 기초하여 생성될 수 있다. 또 다른 구현예에서, 구조 신뢰도(2174)는 주형 모델링 점수(2542)에 기초하여 생성될 수 있다. 주형 모델링 점수(2542)의 예는 최소 주형 모델링 점수(2542a), 평균 주형 모델링 점수(2542b), 및 최대 주형 모델링 점수(2542c)를 포함한다.Another of the inputs 2102 may be the structural reliability channel 2174 generated by the structural reliability generator 2572, shown in FIG. 25. In one implementation, structural confidence 2174 may be generated based on global model quality estimation (GMQE) 2502. In another implementation, structural confidence 2174 may be generated based on a qualitative model energy analysis (QMEAN) score 2512. In another implementation, structural reliability 2174 may be generated based on temperature factor 2522. In another implementation, structural confidence 2174 may be generated based on template modeling score 2542. Examples of mold modeling scores 2542 include minimum mold modeling scores 2542a, average mold modeling scores 2542b, and maximum mold modeling scores 2542c.

당업자는 입력 채널의 임의의 순열 및 조합이 표적 변이체의 병원성 결정(2106)을 위해 병원성 분류자(2108)를 통해 처리하기 위한 입력으로 연결될 수 있음을 이해할 것이다. 일부 구현예에서, 입력 채널의 서브세트만이 연결될 수 있다. 입력 채널은 임의의 순서로 연결될 수 있다. 하나의 구현예에서, 입력 채널은 텐서 생성자(입력 인코더)(2110)에 의해 단일 텐서로 연결될 수 있다. 이어서, 이러한 단일 텐서는 표적 변이체의 병원성 결정(2106)을 위해 병원성 분류자(2108)에 대한 입력으로서 제공될 수 있다.Those skilled in the art will understand that any permutation and combination of input channels may lead to input for processing through a pathogenicity classifier 2108 to determine the pathogenicity of the target variant 2106. In some implementations, only a subset of input channels may be connected. Input channels can be connected in any order. In one implementation, input channels can be concatenated into a single tensor by a tensor generator (input encoder) 2110. This single tensor can then be provided as input to a pathogenicity classifier 2108 for determining the pathogenicity of the target variant 2106.

하나의 구현예에서, 병원성 분류자(2108)는 복수의 콘볼루션 층을 갖는 콘볼루션 신경망(CNN)을 사용한다. 다른 구현예에서, 병원성 분류자(2108)는 장단기 메모리 네트워크(long short-term memory network, LSTM), 양방향 LSTM(bi-directional LSTM, Bi-LSTM), 및 게이트형 순환 유닛(gated recurrent unit, GRU)과 같은 순환 신경망(recurrent neural network, RNN)을 사용한다. 또 다른 구현예에서, 병원성 분류자(2108)는 CNN과 RNN 둘 모두를 사용한다. 또 다른 구현예에서, 병원성 분류자(2108)는 그래프 구조화된 데이터의 종속성을 모델링하는 그래프 콘볼루션 신경망을 사용한다. 또 다른 구현예에서, 병원성 분류자(2108)는 변이형 오토인코더(variational autoencoder, VAE)를 사용한다. 또 다른 구현예에서, 병원성 분류자(2108)는 생성적 대립 신경망(generative adversarial network, GAN)을 사용한다. 또 다른 구현예에서, 병원성 분류자(2108)는 또한, 예를 들어 변환기 및 BERT에 의해 구현된 것과 같은 자가주의(self-attention)에 기초한 언어 모델일 수 있다.In one implementation, pathogenicity classifier 2108 uses a convolutional neural network (CNN) with multiple convolutional layers. In other implementations, the pathogenicity classifier 2108 includes a long short-term memory network (LSTM), a bi-directional LSTM (Bi-LSTM), and a gated recurrent unit (GRU). ) uses a recurrent neural network (RNN) such as In another implementation, pathogenicity classifier 2108 uses both CNNs and RNNs. In another implementation, pathogenicity classifier 2108 uses a graph convolutional neural network that models dependencies in graph structured data. In another implementation, pathogenicity classifier 2108 uses a variational autoencoder (VAE). In another implementation, pathogenicity classifier 2108 uses a generative adversarial network (GAN). In another implementation, pathogenicity classifier 2108 may also be a language model based on self-attention, such as implemented by Transformer and BERT, for example.

또 다른 구현예에서, 병원성 분류자(2108)는 1D 콘볼루션, 2D 콘볼루션, 3D 콘볼루션, 4D 콘볼루션, 5D 콘볼루션, 확장형 또는 아트로스(atrous) 콘볼루션, 전치 콘볼루션, 깊이별 분리가능 콘볼루션, 포인트별 콘볼루션, 1×1 콘볼루션, 그룹 콘볼루션, 편평형 콘볼루션, 공간 및 교차 채널 콘볼루션, 셔플 그룹형 콘볼루션, 공간 분리가능 콘볼루션, 및 디콘볼루션을 사용할 수 있다. 그것은 하나 이상의 손실 함수, 예컨대 로지스틱 회귀(logistic regression)/로그(log) 손실, 다중클래스 교차-엔트로피(multi-class cross-entropy)/소프트맥스 손실, 이진 교차-엔트로피(binary cross-entropy) 손실, 평균 제곱 오류(mean-squared error) 손실, L1 손실, L2 손실, 평활한(smooth) L1 손실, 및 Huber 손실을 사용할 수 있다. 그것은 임의의 병렬성, 효율성, 및 압축 스킴, 예컨대 TFRecords, 압축 인코딩(예컨대, PNG), 샤딩, 맵 변환을 위한 병렬 호출, 배칭, 프리페칭, 모델 병렬성, 데이터 병렬성, 및 동기식/비동기식 확률적 기울기 하강법(SGD)을 사용할 수 있다. 그것은 업샘플링 층, 다운샘플링 층, 순환 접속부, 게이트 및 게이트형 메모리 유닛(예컨대, LSTM 또는 GRU), 잔차 블록, 잔차 접속부, 하이웨이 접속부, 스킵 접속부, 핍홀(peephole) 접속부, 활성화 함수(예컨대, 정류화 선형 유닛(ReLU), 리키 ReLU(leaky ReLU), ELU(exponential liner unit), 시그모이드 및 tanh(hyperbolic tangent)와 같은 비선형 변환 함수), 배치 정규화 층, 규칙화 층, 드롭아웃, 풀링 층(예컨대, 최대 또는 평균 풀링), 글로벌 평균 풀링 층, 감쇠 메커니즘, 및 가우스 에러 선형 유닛을 포함할 수 있다.In another implementation, the pathogenicity classifier 2108 can be classified into 1D convolution, 2D convolution, 3D convolution, 4D convolution, 5D convolution, dilated or atrous convolution, transposed convolution, separation by depth. You can use possible convolutions, point-wise convolutions, 1×1 convolutions, group convolutions, flat convolutions, spatial and cross-channel convolutions, shuffle grouped convolutions, spatial separable convolutions, and deconvolutions. . It includes one or more loss functions, such as logistic regression/log loss, multi-class cross-entropy/softmax loss, binary cross-entropy loss, Mean-squared error loss, L1 loss, L2 loss, smooth L1 loss, and Huber loss can be used. It supports arbitrary parallelism, efficiency, and compression schemes, such as TFRecords, compressed encoding (e.g., PNG), sharding, parallel calls for map transformations, batching, prefetching, model parallelism, data parallelism, and synchronous/asynchronous stochastic gradient descent. You can use the law (SGD). It includes an upsampling layer, a downsampling layer, recurrent connections, gates and gated memory units (e.g. LSTM or GRU), residual blocks, residual connections, highway connections, skip connections, peephole connections, activation functions (e.g. rectification) nonlinear transform functions such as linear unit (ReLU), leaky ReLU, exponential liner unit (ELU), sigmoid and hyperbolic tangent (tanh), batch normalization layer, regularization layer, dropout, pooling layer. (e.g., maximum or average pooling), a global average pooling layer, a damping mechanism, and a Gaussian error linear unit.

병원성 분류자(2108)는 역전파 기반 기울기 업데이트 기법을 사용하여 훈련된다. 병원성 분류자(2108)를 훈련하기 위해 사용될 수 있는 예시적인 기울기 하강 기법은 확률적 기울기 하강법, 배치 기울기 하강법, 및 미니-배치 기울기 하강법을 포함한다. 병원성 분류자(2108)를 훈련하는 데 사용될 수 있는 기울기 하강 최적화 알고리즘의 일부 예는 Momentum, Nesterov 가속화된 기울기, Adagrad, Adadelta, RMSprop, Adam, AdaMax, Nadam, 및 AMSGrad이다. 다른 구현예에서, 병원성 분류자(2108)는 무감독형 학습, 반감독형 학습, 자가 학습, 강화 학습, 멀티태스크 학습, 다중 모드 학습, 전달 학습, 지식 증류 등에 의해 훈련될 수 있다.The pathogenicity classifier 2108 is trained using a backpropagation-based gradient update technique. Exemplary gradient descent techniques that can be used to train pathogenicity classifier 2108 include stochastic gradient descent, batch gradient descent, and mini-batch gradient descent. Some examples of gradient descent optimization algorithms that can be used to train a pathogenicity classifier 2108 are Momentum, Nesterov accelerated gradient, Adagrad, Adadelta, RMSprop, Adam, AdaMax, Nadam, and AMSGrad. In other implementations, pathogenicity classifier 2108 can be trained by unsupervised learning, semi-supervised learning, self-learning, reinforcement learning, multi-task learning, multi-modal learning, transfer learning, knowledge distillation, etc.

도 26은 개시된 기술의 하나의 구현예에 따른, 병원성 분류자(2108)의 예시적인 처리 아키텍처(2600)를 도시한다. 처리 아키텍처(2600)는 처리 모듈(2606, 2610, 2614, 2618, 2622, 2626, 2630, 2634, 2638, 2642)의 캐스케이드를 포함하며, 이들 각각은 1D 콘볼루션(1×1×1 CONV), 3D 콘볼루션(3×3×3 CONV), ReLU 비선형성, 및 배치 정규화(BN)를 포함할 수 있다. 처리 모듈의 다른 예는 완전 접속(FC) 층, 드롭아웃 층, 평탄화 층, 및 양성 클래스 및 병원성 클래스에 속하는 표적 변이체에 대한 지수적으로 정규화된 점수를 생성하는 최종 소프트맥스 층을 포함한다. 도 26에서, "64"는 특정 처리 모듈에 의해 적용된 콘볼루션 필터의 수를 표시한다. 도 26에서, 입력 복셀(2602)의 크기는 15×15×15×8이다. 도 26은 또한, 처리 아키텍처(2600)에 의해 생성된 중간 입력(2604, 2608, 2612, 2616, 2620, 2624, 2628, 2632, 2636, 2640)의 각자의 체적 차원수를 도시한다.Figure 26 shows an example processing architecture 2600 of pathogenicity classifier 2108, according to one implementation of the disclosed technology. Processing architecture 2600 includes a cascade of processing modules 2606, 2610, 2614, 2618, 2622, 2626, 2630, 2634, 2638, and 2642, each of which performs a 1D convolution (1×1×1 CONV); May include 3D convolution (3×3×3 CONV), ReLU nonlinearity, and batch normalization (BN). Other examples of processing modules include a perfect contact (FC) layer, a dropout layer, a smoothing layer, and a final softmax layer that generates exponentially normalized scores for target variants belonging to the benign and pathogenic classes. In Figure 26, "64" indicates the number of convolutional filters applied by a particular processing module. In Figure 26, the size of the input voxel 2602 is 15×15×15×8. 26 also shows the respective volumetric dimensions of intermediate inputs 2604, 2608, 2612, 2616, 2620, 2624, 2628, 2632, 2636, 2640 generated by processing architecture 2600.

도 27은 개시된 기술의 하나의 구현예에 따른, 병원성 분류자(2108)의 예시적인 처리 아키텍처(2700)를 도시한다. 처리 아키텍처(2700)는 1D 콘볼루션(CONV 1D), 3D 콘볼루션(CONV 3D), ReLU 비선형성, 및 배치 정규화(BN)와 같은 처리 모듈(2708, 2714, 2720, 2726, 2732, 2738, 2744, 2750, 2756, 2762, 2768, 2774, 2780)의 캐스케이드를 포함한다. 처리 모듈의 다른 예를 완전 접속 (조밀) 층, 드롭아웃 층, 평탄화 층, 및 양성 클래스 및 병원성 클래스에 속하는 표적 변이체에 대한 지수적으로 정규화된 점수를 생성하는 최종 소프트맥스 층을 포함한다. 도 27에서, "64" 및 "32"는 특정 처리 모듈에 의해 적용된 콘볼루션 필터의 수를 표시한다. 도 27에서, 입력 층(2702)에 의해 공급되는 입력 복셀(2704)의 크기는 7×7×7×108이다. 도 27은 또한, 처리 아키텍처(2700)에 의해 생성되는 중간 입력(2710, 2716, 2722, 2728, 2734, 2740, 2746, 2752, 2758, 2764, 2770, 2776, 2782) 및 생성된 중간 출력(2706, 2712, 2718, 2724, 2730, 2736, 2742, 2748, 2754, 2760, 2766, 2772, 2778, 2784)의 각자의 체적 차원수를 도시한다.Figure 27 shows an example processing architecture 2700 of pathogenicity classifier 2108, according to one implementation of the disclosed technology. The processing architecture 2700 includes processing modules 2708, 2714, 2720, 2726, 2732, 2738, 2744, such as 1D convolution (CONV 1D), 3D convolution (CONV 3D), ReLU nonlinearity, and batch normalization (BN). , 2750, 2756, 2762, 2768, 2774, 2780). Other examples of processing modules include a fully connected (dense) layer, a dropout layer, a smoothing layer, and a final softmax layer that generates exponentially normalized scores for target variants belonging to the benign and pathogenic classes. In Figure 27, “64” and “32” indicate the number of convolutional filters applied by a particular processing module. In Figure 27, the size of the input voxel 2704 supplied by the input layer 2702 is 7x7x7x108. 27 also shows intermediate inputs 2710, 2716, 2722, 2728, 2734, 2740, 2746, 2752, 2758, 2764, 2770, 2776, 2782 and intermediate outputs 2706 generated by processing architecture 2700. , 2712, 2718, 2724, 2730, 2736, 2742, 2748, 2754, 2760, 2766, 2772, 2778, 2784).

당업자는 다른 현재 및 미래의 인공 지능, 기계 학습, 및 심층 학습 모델, 데이터세트, 및 훈련 기법이 개시된 기술의 사상으로부터 벗어남이 없이 개시된 변이체 병원성 분류자에 통합될 수 있음을 이해할 것이다.Those skilled in the art will understand that other current and future artificial intelligence, machine learning, and deep learning models, datasets, and training techniques may be incorporated into the disclosed variant pathogenicity classifier without departing from the spirit of the disclosed technology.

독창성 및 비자명성의 객관적 표시로서의 성능 결과Performance outcomes as objective indicators of originality and non-obviousness.

본 명세서에 개시된 변이체 병원성 분류자는 3D 단백질 구조에 기초한 병원성 예측을 행하고, "PrimateAI 3D"로 지칭된다. "Primate AI"는 병원성 예측 기반 단백질 서열을 만드는, 공동 소유되고 이전에 개시된 변이체 병원성 분류자이다. PrimateAI에 대한 자세한 내용은 공동 소유의 미국 특허 출원 제16/160,903호; 제16/160,986호; 제16/160,968호; 및 제16/407,149호 및 문헌[Sundaram, L. et al. Predicting the clinical impact of human mutation with deep neural networks. Nat. Genet. 50, 1161-1170 (2018)]에서 확인할 수 있다.The variant pathogenicity classifier disclosed herein makes pathogenicity predictions based on 3D protein structure and is referred to as “PrimateAI 3D”. “Primate AI” is a commonly owned and previously disclosed variant pathogenicity classifier that generates pathogenicity prediction-based protein sequences. For more information about PrimateAI, see commonly owned U.S. patent application Ser. No. 16/160,903; No. 16/160,986; No. 16/160,968; and 16/407,149 and Sundaram, L. et al. Predicting the clinical impact of human mutation with deep neural networks. Nat. Genet. 50, 1161-1170 (2018)].

도 28, 도 29, 도 30, 도 31a는 PrimateAI를 넘는 PrimateAI 3D의 분류 우월성을 입증하기 위해 PrimateAI를 벤치마크 모델로서 사용한다. 도 28, 도 29, 도 30, 도 31a 및 도 31b의 성능 결과는 복수의 검증 세트에 걸쳐 양성 변이체를 병원성 변이체와 정확하게 구별하는 분류 태스크에서 생성된다. PrimateAI 3D는 복수의 검증 세트와는 상이한 훈련 세트에 대해 훈련된다. PrimateAI 3D는 양성 데이터세트로서 사용되는, 공통 인간 변이체 및 영장류로부터의 변이체에 대해 훈련되지만, 라벨링되지 않은 또는 의사 병원성인 데이터세트로서 사용되는 트리뉴클레오티드 콘텍스트에 기초하여 변이체를 시뮬레이션하였다.Figures 28, 29, 30, and 31a use PrimateAI as a benchmark model to demonstrate the classification superiority of PrimateAI 3D over PrimateAI. The performance results in Figures 28, 29, 30, 31A and 31B are generated from a classification task that accurately distinguishes benign variants from pathogenic variants across multiple validation sets. PrimateAI 3D is trained on a training set that is different from the multiple validation sets. PrimateAI 3D is trained on common human variants and variants from primates, which are used as benign datasets, but simulated variants based on trinucleotide contexts, which are used as unlabeled or pseudo-pathogenic datasets.

새로운 발달 지연 장애(new DDD)가 Primate AI에 대한 Primate AI 3D의 분류 정확도를 비교하는 데 사용되는 검증 세트의 하나의 예이다. 새로운 DDD 검증 세트는 병원성으로서 DDD를 갖는 개체로부터의 변이체를 라벨링하고, 양성으로서 DDD를 갖는 개체의 건강한 동족으로부터의 동일한 변이체를 라벨링한다. 유사한 라벨링 스킴이 도 31a 및 도 31b에 도시된 자폐 스펙트럼 장애(autism spectrum disorder, ASD) 검증 세트와 함께 사용된다.New Developmental Delay Disorder (new DDD) is one example of a validation set used to compare the classification accuracy of Primate AI 3D against Primate AI. The new DDD validation set labels variants from individuals with DDD as pathogenic and labels the same variants from healthy relatives of individuals with DDD as benign. A similar labeling scheme is used with the autism spectrum disorder (ASD) validation set shown in Figures 31A and 31B.

BRCA1이 Primate AI에 대한 Primate AI 3D의 분류 정확도를 비교하는 데 사용되는 검증 세트의 다른 예이다. BRCA1 검증 세트는 양성 변이체로서 BRCA1 유전자의 단백질을 시뮬레이션하는 합성으로 생성된 기준 아미노산 서열을 라벨링하고, 병원성 변이체로서 BRCA1 유전자의 단백질을 시뮬레이션하는 합성으로 변경된 대립유전자 아미노산 서열을 라벨링한다. 유사한 라벨링 스킴이 도 31a 및 도 31b에 도시된 TP53 유전자, TP53S3 유전자 및 이의 변이체, 및 다른 유전자 및 이들의 변이체의 상이한 검증 세트와 함께 사용된다.BRCA1 is another example of a validation set used to compare the classification accuracy of Primate AI 3D against Primate AI. The BRCA1 validation set labels synthetically generated reference amino acid sequences simulating proteins from the BRCA1 gene as benign variants and synthetically altered allelic amino acid sequences simulating proteins from the BRCA1 gene as pathogenic variants. A similar labeling scheme is used with different validation sets of the TP53 gene, the TP53S3 gene and its variants, and other genes and their variants shown in Figures 31A and 31B.

도 28은 청색 수평 막대로 벤치마크 PrimateAI 모델의 성능을 식별하고, 주황색 수평 막대로 개시된 PrimateAI 3D 모델의 성능을 식별한다. 녹색 수평 막대는 개시된 PrimateAI 3D 모델 및 벤치마크 PrimateAI 모델의 각자의 병원성 예측을 조합함으로써 도출된 병원성 예측을 묘사한다. 범례에서, "ens10"은, 각각 상이한 시드 훈련 데이터세트로 훈련되고 상이한 가중치 및 바이어스로 랜덤하게 초기화되는 10개의 PrimateAI 3D 모델의 앙상블을 표시한다. 또한, "7×7×7×2"는 10개의 PrimateAI 3D 모델의 앙상블의 훈련 동안 입력 채널을 인코딩하는 데 사용되는 복셀 그리드의 크기를 묘사한다. 주어진 변이체에 대해, 10개의 PrimateAI 3D 모델의 앙상블은 각각 10개의 병원성 예측을 생성하는데, 이들은 후속적으로 (예컨대, 평균화함으로써) 조합되어 주어진 변이체에 대한 최종 병원성 예측을 생성한다. 이러한 로직이 상이한 그룹 크기의 앙상블에 유사하게 적용된다.Figure 28 identifies the performance of the benchmark PrimateAI model with blue horizontal bars and the performance of the PrimateAI 3D model disclosed with orange horizontal bars. The green horizontal bar depicts the pathogenicity predictions derived by combining the respective pathogenicity predictions of the disclosed PrimateAI 3D model and the benchmark PrimateAI model. In the legend, “ens10” denotes an ensemble of 10 PrimateAI 3D models, each trained with a different seed training dataset and randomly initialized with different weights and biases. Additionally, “7×7×7×2” describes the size of the voxel grid used to encode the input channels during training of an ensemble of 10 PrimateAI 3D models. For a given variant, an ensemble of 10 PrimateAI 3D models each generates 10 pathogenicity predictions, which are subsequently combined (e.g., by averaging) to produce the final pathogenicity prediction for the given variant. This logic applies similarly to ensembles of different group sizes.

또한, 도 28에서, y-축은 상이한 검증 세트를 갖고, x-축은 p-값을 갖는다. 더 큰 p-값, 즉, 더 긴 수평 막대는 양성 변이체를 병원성 변이체와 구별하는 데 있어서 더 큰 정확도를 표시한다. 도 28에서 p-값에 의해 입증된 바와 같이, PrimateAI 3D는 대부분의 검증 세트(유일한 예외가 tp53s3_A549 검증 세트임)에 걸쳐 PrimateAI를 능가한다. 즉, PrimateAI 3D에 대한 주황색 수평 막대는 PrimateAI에 대한 청색 수평 막대보다 일관되게 더 길다.Also, in Figure 28, the y-axis has different validation sets and the x-axis has p-values. Larger p-values, i.e. longer horizontal bars, indicate greater accuracy in distinguishing benign variants from pathogenic variants. As evidenced by the p-value in Figure 28, PrimateAI 3D outperforms PrimateAI across most validation sets (the only exception being the tp53s3_A549 validation set). That is, the orange horizontal bars for PrimateAI 3D are consistently longer than the blue horizontal bars for PrimateAI.

또한, 도 28에서, y-축을 따른 "평균" 카테고리가 검증 세트 각각에 대해 결정된 p-값의 평균을 계산한다. 평균 카테고리에서도, PrimateAI 3D는 PrimateAI를 능가한다.Also, in Figure 28, the “Mean” category along the y-axis calculates the average of the p-values determined for each of the validation sets. Even in the average category, PrimateAI 3D outperforms PrimateAI.

도 29에서, PrimateAI는 청색 수평 막대에 의해 표현되고, 크기 3×3×3의 복셀 그리드로 훈련된 20개의 PrimateAI 3D 모델의 앙상블은 적색 수평 막대에 의해 표현되고, 크기 7×7×7×2의 복셀 그리드로 훈련된 10개의 PrimateAI 3D 모델의 앙상블은 자색 수평 막대에 의해 표현되고, 크기 7×7×7×2의 복셀 그리드로 훈련된 20개의 PrimateAI 3D 모델의 앙상블은 갈색 수평 막대에 의해 표현되고, 크기 17×17×17×2의 복셀 그리드로 훈련된 20개의 PrimateAI 3D 모델의 앙상블은 자색 수평 막대에 의해 표현된다.In Figure 29, PrimateAI is represented by the blue horizontal bar, the ensemble of 20 PrimateAI 3D models trained with a voxel grid of size 3×3×3 is represented by the red horizontal bar, and the ensemble of 20 PrimateAI 3D models is represented by the red horizontal bar. The ensemble of 10 PrimateAI 3D models trained with a voxel grid of size is represented by the purple horizontal bar, and the ensemble of 20 PrimateAI 3D models trained with a voxel grid of size 7×7×7×2 is represented by the brown horizontal bar. The ensemble of 20 PrimateAI 3D models trained on a voxel grid of size 17 × 17 × 17 × 2 is represented by a purple horizontal bar.

또한, 도 29에서, y-축은 상이한 검증 세트를 갖고, x-축은 p-값을 갖는다. 이전과 같이, 더 큰 p-값, 즉, 더 긴 수평 막대는 양성 변이체를 병원성 변이체와 구별하는 데 있어서 더 큰 정확도를 표시한다. 도 20에서 p-값에 의해 입증된 바와 같이, PrimateAI 3D의 상이한 구성은 대부분의 검증 세트에 걸쳐 PrimateAI를 능가한다. 즉, PprimateAI 3D에 대한 적색, 자색, 갈색 및 분홍색 수평 막대는 PrimateAI에 대한 청색 수평 막대보다 대부분 더 길다.Also, in Figure 29, the y-axis has different validation sets and the x-axis has p-values. As before, larger p-values, i.e. longer horizontal bars, indicate greater accuracy in distinguishing benign variants from pathogenic variants. As evidenced by the p-values in Figure 20, different configurations of PrimateAI 3D outperform PrimateAI across most validation sets. That is, the red, purple, brown, and pink horizontal bars for PrimateAI 3D are mostly longer than the blue horizontal bars for PrimateAI.

또한, 도 29에서, y-축을 따른 "평균" 카테고리가 검증 세트 각각에 대해 결정된 p-값의 평균을 계산한다. 평균 카테고리에서도, PrimateAI 3D의 상이한 구성은 PrimateAI를 능가한다.Also, in Figure 29, the “Mean” category along the y-axis calculates the average of the p-values determined for each of the validation sets. Even in the average category, PrimateAI 3D's different configurations outperform PrimateAI.

도 30에서, 적색 수직 막대는 PrimateAI를 표현하고, 청록색 수직 막대는 PrimateAI 3D를 표현한다. 도 30에서, y-축은 p-값을 갖고, x-축은 상이한 검증 세트를 갖는다. 도 30에서, 예외 없이, PrimateAI 3D는 모든 검증 세트에 걸쳐 PrimateAI를 일관되게 능가한다. 즉, PrimateAI 3D에 대한 시안색 수직 막대는 PrimateAI에 대한 적색 수직 막대들보다 항상 더 길다.In Figure 30, the red vertical bar represents PrimateAI, and the cyan vertical bar represents PrimateAI 3D. In Figure 30, the y-axis has p-values and the x-axis has different validation sets. In Figure 30, without exception, PrimateAI 3D consistently outperforms PrimateAI across all validation sets. That is, the cyan vertical bars for PrimateAI 3D are always longer than the red vertical bars for PrimateAI.

도 31a 및 도 31b는 청색 수직 막대로 벤치마크 PrimateAI 모델의 성능을 식별하고, 주황색 수직 막대로 개시된 PrimateAI 3D 모델의 성능을 식별한다. 녹색 수직 막대는 개시된 PrimateAI 3D 모델 및 벤치마크 PrimateAI 모델의 각자의 병원성 예측을 조합함으로써 도출된 병원성 예측을 묘사한다. 도 31 및 도 31b에서, y-축은 p-값을 갖고, x-축은 상이한 검증 세트를 갖는다.Figures 31A and 31B identify the performance of the benchmark PrimateAI model with blue vertical bars and the performance of the PrimateAI 3D model disclosed with orange vertical bars. Green vertical bars depict pathogenicity predictions derived by combining the respective pathogenicity predictions of the disclosed PrimateAI 3D model and the benchmark PrimateAI model. 31 and 31B, the y-axis has p-values and the x-axis has different validation sets.

도 31a 및 도 31b에서 p-값에 의해 입증된 바와 같이, PrimateAI 3D는 대부분의 검증 세트(유일한 예외가 tp53s3_A549_p53NULL_Nutlin-3 검증 세트임)에 걸쳐 PrimateAI를 능가한다. 즉, PrimateAI 3D에 대한 주황색 수직 막대는 PrimateAI에 대한 청색 수직 막대보다 일관되게 더 길다.As evidenced by the p-values in Figures 31A and 31B, PrimateAI 3D outperforms PrimateAI across most validation sets (the only exception being the tp53s3_A549_p53NULL_Nutlin-3 validation set). That is, the orange vertical bars for PrimateAI 3D are consistently longer than the blue vertical bars for PrimateAI.

또한, 도 31a 및 도 31b에서, 별개의 "평균" 차트가 검증 세트 각각에 대해 결정된 p-값의 평균을 계산한다. 평균 차트에서도, PrimateAI 3D는 PrimateAI를 능가한다.Additionally, in Figures 31A and 31B, a separate "Mean" chart calculates the average of the p-values determined for each of the validation sets. Even in the average chart, PrimateAI 3D outperforms PrimateAI.

평균 통계치는 이상치에 의해 바이어스될 수 있다. 이를 다루기 위해, 별개의 "방법 랭크(method rank)" 차트가 또한 도 31a 및 도 31b에 도시된다. 더 높은 랭크가 더 불량한 분류 정확도를 표시한다. 방법 랭크 차트에서도, PrimateAI 3D는 모두가 3인 PrimateAI에 대해 하위 랭크 1 및 2의 카운트가 더 많음으로써 PrimateAI를 능가한다.Average statistics may be biased by outliers. To address this, separate “method rank” charts are also shown in FIGS. 31A and 31B. Higher ranks indicate poorer classification accuracy. Also in the method rank chart, PrimateAI 3D outperforms PrimateAI by having more counts of subranks 1 and 2 for PrimateAI where all are 3.

도 28, 도 29, 도 30, 도 31a 및 도 31b에서, PrimateAI 3D를 PrimateAI와 조합하는 것이 우수한 분류 정확도를 생성한다는 것이 또한 명백하다. 즉, 단백질이 아미노산 서열로서 PrimateAI에 공급되어 제1 출력을 생성할 수 있고, 동일한 단백질이 3D, 복셀화된 단백질 구조로서 PrimateAI 3D에 공급되어 제2 출력을 생성할 수 있고, 제1 및 제2 출력이 집약하여 조합 또는 분석되어 단백질이 경험한 변이체에 대한 최종 병원성 예측을 생성할 수 있다.28, 29, 30, 31A and 31B, it is also clear that combining PrimateAI 3D with PrimateAI produces excellent classification accuracy. That is, a protein may be fed to PrimateAI as an amino acid sequence to produce a first output, the same protein may be fed to PrimateAI 3D as a 3D, voxelized protein structure to produce a second output, and the first and second The output can be aggregated and combined or analyzed to generate a final pathogenicity prediction for the variants experienced by the protein.

효율적인 복셀화Efficient voxelization

도 32는 복셀 단위로 가장 가까운 원자를 효율적으로 식별하는 효율적인 복셀화 프로세스(3200)를 도시하는 흐름도이다.FIG. 32 is a flow diagram illustrating an efficient voxelization process 3200 that efficiently identifies nearest atoms on a voxel-by-voxel basis.

이제, 거리 채널에 대해 재논의한다. 위에서 논의된 바와 같이, 기준 아미노산 서열(202)은 알파-탄소 원자, 베타-탄소 원자, 산소 원자, 질소 원자, 수소 원자 등과 같은 상이한 유형의 원자를 함유할 수 있다. 따라서, 위에서 논의된 바와 같이, 거리 채널은 가장 가까운 알파-탄소 원자, 가장 가까운 베타-탄소 원자, 가장 가까운 산소 원자, 가장 가까운 질소 원자, 가장 가까운 수소 원자 등에 의해 배열될 수 있다.Now, we revisit the street channel. As discussed above, the reference amino acid sequence 202 may contain different types of atoms, such as alpha-carbon atoms, beta-carbon atoms, oxygen atoms, nitrogen atoms, hydrogen atoms, etc. Accordingly, as discussed above, the distance channels can be arranged by nearest alpha-carbon atom, nearest beta-carbon atom, nearest oxygen atom, nearest nitrogen atom, nearest hydrogen atom, etc.

예를 들어, 도 6에서, 9개의 복셀(514) 각각은 가장 가까운 알파-탄소 원자에 대한 21개의 아미노산별 거리 채널을 갖는다. 도 6은 9개의 복셀(514) 각각이 가장 가까운 베타-탄소 원자에 대한 21개의 아미노산별 거리 채널을 또한 갖도록, 그리고 9개의 복셀(514) 각각이 원자의 유형 및 아미노산의 유형에 관계 없이 가장 가까운 원자에 대한 가장 가까운 일반 원자 거리 채널을 또한 갖도록 추가로 확장될 수 있다. 이러한 방식으로, 9개의 복셀(514) 각각은 43개의 거리 채널을 가질 수 있다.For example, in Figure 6, each of the nine voxels 514 has 21 amino acid-specific distance channels to the nearest alpha-carbon atom. Figure 6 shows that each of the nine voxels 514 also has 21 amino acid-specific distance channels to the nearest beta-carbon atom, and that each of the nine voxels 514 has a It can be further extended to also have the nearest normal atom distance channel for the atom. In this way, each of the nine voxels 514 can have 43 distance channels.

이제, 거리 채널에 포함시키기 위해 복셀 단위로 가장 가까운 원자를 식별하기 위해 요구되는 거리 계산의 수가 논의된다. 21개의 아미노산 카테고리에 걸쳐 분배된 총 828개의 알파-탄소 원자를 도시하는 도 3의 예를 고려한다. 도 6에서 아미노산별 거리 채널(602 내지 642)을 계산하기 위해, 즉, 189개의 거리 값을 결정하기 위해, 9개의 복셀(514) 각각으로부터 828개의 알파-탄소 원자 각각까지의 거리가 측정되어, 9 * 828 = 7, 452개의 거리 계산을 초래한다. 828개의 복셀의 3D 경우에, 이것은 27 * 27 = 22,356개의 거리 계산을 초래한다. 828개의 베타-탄소 원자이 또한 포함될 때, 이 수는 27 *1656 = 44, 712개의 거리 계산으로 증가한다.Now, the number of distance calculations required to identify the closest atom on a voxel-by-voxel basis for inclusion in the distance channel is discussed. Consider the example of Figure 3, which shows a total of 828 alpha-carbon atoms distributed across 21 amino acid categories. To calculate the distance channels 602 to 642 for each amino acid in Figure 6, i.e., to determine 189 distance values, the distance from each of the 9 voxels 514 to each of the 828 alpha-carbon atoms is measured, 9 * 828 = 7, resulting in 452 distance calculations. For a 3D case of 828 voxels, this results in 27 * 27 = 22,356 distance calculations. When 828 beta-carbon atoms are also included, this number increases to 27 * 1656 = 44, 712 distance calculations.

이것은 도 35a에 도시된 바와 같이, 단일 단백질 복셀화에 대해 복셀 단위로 가장 가까운 원자를 식별하는 런타임 복잡도가 O(#원자 * #복셀)임을 의미한다. 또한, 단일 단백질 복셀화에 대한 런타임 복잡도는 거리 채널이 다양한 속성(예컨대, 주석 채널 및 구조 신뢰도 채널과 같은 복셀당 상이한 특징부 또는 채널)에 걸쳐 계산될 때 O(#원자 * #복셀 * #속성)로 증가한다.This means that the runtime complexity of identifying the nearest atom on a voxel-by-voxel basis for a single protein voxelization is O(#atom * #voxel), as shown in Figure 35a. Additionally, the runtime complexity for a single protein voxelization is O(#atoms ) increases.

결과적으로, 거리 계산은 복셀화 프로세스의 가장 계산 소모적인 부분이 되어, 모델 훈련 및 모델 추론과 같은 중대한 런타임 태스크로부터 귀중한 계산 자원을 소모할 수 있다. 예를 들어, 7,000개의 단백질의 훈련 데이터세트에 의한 모델 훈련의 경우를 고려한다. 복수의 아미노산, 원자, 및 속성에 걸쳐 복수의 복셀에 대한 거리 채널을 생성하는 것은 단백질당 100개 초과의 복셀화를 수반하여, 단일 훈련 반복(에포크)에서 약 800,000개의 복셀화를 초래할 수 있다. 각각의 에포크에서 원자 좌표를 회전시키면서 20 내지 40개의 에포크의 훈련을 실행하면 최대 3,200만 개의 복셀화를 초래할 수 있다.As a result, distance calculations can become the most computationally expensive part of the voxelization process, consuming valuable computational resources from critical runtime tasks such as model training and model inference. For example, consider the case of model training with a training dataset of 7,000 proteins. Creating distance channels for multiple voxels across multiple amino acids, atoms, and attributes may involve more than 100 voxelizations per protein, resulting in approximately 800,000 voxelizations in a single training iteration (epoch). Running 20 to 40 epochs of training while rotating the atomic coordinates in each epoch can result in up to 32 million voxels.

높은 계산 비용에 더하여, 3200만 개의 복셀화에 대한 데이터의 크기는 메인 메모리에 피팅하기에는 너무 크다(예컨대, 15×15×15 복셀 그리드의 경우에 >20 TB). 파라미터 최적화 및 앙상블 학습에 대한 반복된 훈련 실행을 고려하면, 복셀화 프로세스의 메모리 풋프린트는 디스크 상에 저장되기에는 너무 커져서, 복셀화 프로세스가 사전계산 단계가 아닌 모델 훈련의 일부가 되게 한다.In addition to the high computational cost, the size of the data for 32 million voxelizations is too large to fit into main memory (e.g., >20 TB for a 15×15×15 voxel grid). Considering repeated training runs for parameter optimization and ensemble learning, the memory footprint of the voxelization process becomes too large to be stored on disk, making the voxelization process a part of model training rather than a precomputation step.

개시된 기술은 O(#원자 * #복셀)의 런타임 복잡도에 비해 최대 약 100x 스피드업을 달성하는 효율적인 복셀화 프로세스를 제공한다. 개시된 효율적인 복셀화 프로세스는 단일 단백질 복셀화에 대한 런타임 복잡도를 O(#원자)로 감소시킨다. 복셀당 상이한 특징부 또는 채널의 경우에, 개시된 효율적인 복셀화 프로세스는 단일 단백질 복셀화에 대한 런타임 복잡도를 O(#원자 * #속성)로 감소시킨다. 그 결과, 복셀화 프로세스는 모델 훈련만큼 빨라져서, GPU, ASIC, TPU, FPGA, CGRA 등과 같은 프로세서 상에서 복셀화로부터 다시 신경망 가중치를 계산하는 것으로 계산 병목현상을 시프트시킨다.The disclosed technology provides an efficient voxelization process that achieves up to approximately 100x speedup compared to a runtime complexity of O(#atoms * #voxels). The disclosed efficient voxelization process reduces the runtime complexity for single protein voxelization to O(#atoms). In the case of different features or channels per voxel, the disclosed efficient voxelization process reduces the runtime complexity for single protein voxelization to O(#atoms * #properties). As a result, the voxelization process becomes as fast as model training, shifting the computational bottleneck from voxelization back to calculating neural network weights on processors such as GPUs, ASICs, TPUs, FPGAs, CGRAs, etc.

큰 복셀 그리드를 수반하는 개시된 효율적인 복셀화 프로세스의 일부 구현예에서, 단일 단백질 복셀화에 대한 런타임 복잡도는 복셀당 상이한 특징부 또는 채널의 경우에 대해 O(#원자 + 복셀) 및 O(#원자 * #속성 + 복셀)이다. "+ 복셀" 복잡도는, 원자의 수가 복셀의 수와 비교하여 극소일 때, 예를 들어 100×100×100 복셀 그리드 내에 하나의 원자가 있을 때(즉, 원자당 100만 개의 복셀) 관찰된다. 그러한 시나리오에서, 런타임은, 예를 들어, 100만 개의 복셀에 대해 메모리를 할당하고, 100만 개의 복셀을 0으로 초기화하고, 등등을 하기 위한 엄청난 수의 복셀의 오버헤드에 의해 지배된다.In some implementations of the disclosed efficient voxelization process involving large voxel grids, the runtime complexity for single protein voxelization is O(#atoms + voxels) and O(#atoms * for the case of different features or channels per voxel. #property + voxel). “+ voxel” complexity is observed when the number of atoms is infinitesimal compared to the number of voxels, for example, when there is one atom in a 100×100×100 voxel grid (i.e., 1 million voxels per atom). In such a scenario, the runtime is dominated by the overhead of a huge number of voxels, for example, allocating memory for 1 million voxels, initializing 1 million voxels to 0, etc.

이제, 개시된 효율적인 복셀화 프로세스의 세부사항에 대해 논의된다. 도 32a, 도 32b, 도 33, 도 34, 및 도 35b가 동시에 논의된다.The details of the disclosed efficient voxelization process are now discussed. Figures 32A, 32B, 33, 34, and 35B are discussed simultaneously.

도 32a에서 시작하여, 단계(3202)에서, 각각의 원자(예컨대, 828개의 알파-탄소 원자 각각 및 828개의 베타-탄소 원자 각각)는 원자를 함유하는 복셀(예컨대, 9개의 복셀(514) 중 하나)과 연관된다. 용어 "함유한다"는 원자의 3D 원자 좌표가 복셀 내에 위치되는 것을 지칭한다. 원자를 함유하는 복셀은 본 명세서에서 "원자 함유 복셀"로도 지칭된다.Starting in Figure 32A, at step 3202, each atom (e.g., each of the 828 alpha-carbon atoms and each of the 828 beta-carbon atoms) is assigned to a voxel containing the atom (e.g., one of nine voxels 514). It is related to one). The term “contains” refers to the 3D atomic coordinates of an atom being located within a voxel. Voxels containing atoms are also referred to herein as “atom-containing voxels.”

도 32b 및 도 33은 특정 원자를 함유하는 복셀이 선택되는 방법을 기술한다. 도 33은 3D 원자 좌표의 대표로서 2D 원자 좌표를 사용한다. 복셀 그리드(522)는 동일한 단차 크기(예컨대, 1 옹스트롬(Å) 또는 2 Å)를 갖는 복셀(514) 각각과 규칙적으로 이격됨에 유의한다.Figures 32B and 33 describe how voxels containing specific atoms are selected. Figure 33 uses 2D atomic coordinates as a representative of 3D atomic coordinates. Note that the voxel grid 522 is regularly spaced with each of the voxels 514 having the same step size (eg, 1 Angstrom (Å) or 2 Å).

또한, 도 33에서, 복셀 그리드(522)는 제1 차원(예컨대, x-축)을 따라 마젠타색 인덱스 [0, 1, 2]를 갖고, 제2 차원(예컨대, y-축)을 따라 시안색 인덱스 [0, 1, 2]를 갖는다. 또한, 도 33에서, 복셀 (512) 내의 각자의 복셀(514)은 녹색 복셀 인덱스 [복셀 0, 복셀 1, ..., 복셀 8]에 의해 그리고 흑색 복셀 중심 인덱스 [(1, 1), (1, 2), ..., (3, 3)]에 의해 식별된다.Additionally, in Figure 33, voxel grid 522 has magenta index [0, 1, 2] along the first dimension (e.g., x-axis) and cyan index along the second dimension (e.g., y-axis). It has color index [0, 1, 2]. Additionally, in Figure 33, each voxel 514 within voxel 512 is divided by the green voxel index [voxel 0, voxel 1, ..., voxel 8] and the black voxel centroid index [(1, 1), ( 1, 2), ..., (3, 3)].

또한, 도 33에서, 제1 차원을 따른 복셀 중심의 중심 좌표, 즉, 제1 차원 복셀 좌표가 주황색으로 식별된다. 또한, 도 33에서, 제2 차원을 따른 복셀 중심의 중심 좌표, 즉, 제2 차원 복셀 좌표가 적색으로 식별된다.Additionally, in Figure 33, the center coordinates of the voxel center along the first dimension, that is, the first dimensional voxel coordinates, are identified in orange. Additionally, in Figure 33, the center coordinates of the voxel center along the second dimension, that is, the second dimensional voxel coordinates, are identified in red.

먼저, 단계(3202a)(도 33의 단계 1)에서, 특정 원자의 3D 원자 좌표(1.7456, 2.14323)가 양자화되어, 양자화된 3D 원자 좌표(1.7, 2.1)를 생성한다. 양자화는 비트의 라운딩 또는 절단(truncation)에 의해 달성될 수 있다.First, in step 3202a (step 1 in Figure 33), the 3D atomic coordinates (1.7456, 2.14323) of a particular atom are quantized to generate quantized 3D atomic coordinates (1.7, 2.1). Quantization can be achieved by rounding or truncation of bits.

이어서, 단계(3202b)(도 33의 단계 2)에서, 복셀(514)의 복셀 좌표(또는 복셀 중심 또는 복셀 중심 좌표)가 차원 단위로 양자화된 3D 원자 좌표에 할당된다. 제1 차원의 경우, 양자화된 원자 좌표 1.7은 복셀 1에 할당되는데, 그 이유는 그것이 1 내지 2의 범위에 있는 제1 차원 복셀 좌표를 커버하고, 제1 차원에서 1.5에 중심설정되기 때문이다. 복셀 1은 제2 차원을 따라 인덱스 0을 갖는 것과는 대조적으로, 제1 차원을 따라 인덱스 1을 가짐에 유의한다.Next, in step 3202b (step 2 in Figure 33), the voxel coordinates (or voxel centroid or voxel center coordinates) of voxel 514 are assigned to 3D atomic coordinates quantized in dimension units. For the first dimension, the quantized atomic coordinate 1.7 is assigned to voxel 1 because it covers first dimension voxel coordinates in the range 1 to 2 and is centered at 1.5 in the first dimension. Note that voxel 1 has index 1 along the first dimension, as opposed to index 0 along the second dimension.

제2 차원의 경우, 복셀 1로부터 시작하여, 복셀 그리드(522)는 제2 차원을 따라 횡단된다. 이것은, 양자화된 원자 좌표 2.5가 복셀 7에 할당되는 결과를 가져오는데, 그 이유는 그것이 2 내지 3의 범위에 있는 제2 차원 복셀 좌표를 커버하고, 제2 차원에서 2.5에 중심설정되기 때문이다. 복셀 7은 제1 차원을 따라 인덱스 1을 갖는 것과는 대조적으로, 제2 차원을 따라 인덱스 2를 가짐에 유의한다.For the second dimension, starting from voxel 1, the voxel grid 522 is traversed along the second dimension. This results in the quantized atomic coordinate 2.5 being assigned to voxel 7 because it covers second dimension voxel coordinates in the range 2 to 3 and is centered at 2.5 in the second dimension. Note that voxel 7 has index 2 along the second dimension, as opposed to index 1 along the first dimension.

이어서, 단계(3202c)(도 33의 단계 3)에서, 할당된 복셀 좌표에 상응하는 차원 인덱스가 선택된다. 즉, 복셀 1의 경우, 인덱스 1은 제1 차원을 따라 선택되고, 복셀 7의 경우, 인덱스 2는 제2 차원을 따라 선택된다. 당업자는 위의 단계가 제3 차원의 경우에 제3 차원을 따라 차원 인덱스를 선택하도록 유사하게 실행될 수 있음을 이해할 것이다.Next, in step 3202c (step 3 in Figure 33), the dimension index corresponding to the assigned voxel coordinates is selected. That is, for voxel 1, index 1 is selected along the first dimension, and for voxel 7, index 2 is selected along the second dimension. Those skilled in the art will understand that the above steps can be similarly implemented to select a dimension index along the third dimension in the case of a third dimension.

이어서, 단계(3202d)(도 33의 단계 4)에서, 기수(radix)의 제곱에 의해 선택된 차원 인덱스의 위치별 가중화에 기초하여 누적된 합이 생성된다. 위치 넘버링 시스템 배후의 일반적인 아이디어는 수치 값이 기수(radix)(또는 기수(base))의 증가하는 제곱을 통해 표현된다는 것, 예를 들어, 2진수는 기수 2이고, 3진수는 기수 3이고, 8진수는 기수 8이고, 16진수는 기수 16이라는 것이다. 이것은 종종, 가중 넘버링 시스템으로 지칭되는데, 그 이유는 각각의 위치가 기수의 제곱에 의해 가중되기 때문이다. 위치 넘버링 시스템에 대한 유효 수치의 세트는 그 시스템의 기수와 크기가 동일하다. 예를 들어, 10진법 체계에서 0 내지 9인 10개의 숫자가 있고, 3진법 체계에서 0, 1, 및 2인 3개의 숫자가 있다. 기수 체계에서 가장 큰 유효 수는 기수보다 1 더 작다(따라서, 임의의 기수 체계에서 8은 9보다 더 작은 유효 수치가 아님). 임의의 10진 정수는 임의의 다른 적분 기반 시스템에서 정확하게 표현될 수 있고, 그 반대도 마찬가지이다.Next, in step 3202d (step 4 in FIG. 33), a cumulative sum is generated based on positional weighting of the selected dimension index by the square of the radix. The general idea behind the positional numbering system is that numerical values are expressed through increasing powers of a radix (or base), e.g. binary numbers are radix 2, ternary numbers are radix 3, etc. Octal numbers are base 8, and hexadecimal numbers are base 16. This is often referred to as a weighted numbering system because each position is weighted by the square of the cardinality. The set of significant digits for a positional numbering system is the same size as the base of the system. For example, in the decimal system there are 10 numbers 0 through 9, and in the ternary system there are 3 numbers 0, 1, and 2. In any radix system, the largest significant number is 1 less than the radix (thus, in any radix system, 8 is not a significant number less than 9). Any decimal integer can be represented exactly in any other integration-based system, and vice versa.

도 33의 예로 돌아가서, 선택된 차원 인덱스 1 및 2는, 그들을 기수 3의 각자의 제곱과 위치별로 곱하고 위치별 곱셈의 결과를 합산함으로써 단일 정수로 변환된다. 여기서, 3D 원자 좌표가 3개의 차원을 갖기 때문에 기수 3이 선택된다(그러나, 도 33은 단순화를 위해 2개의 차원을 따른 2D 원자 좌표만을 도시함).Returning to the example of Figure 33, the selected dimension indices 1 and 2 are converted to a single integer by multiplying them position-wise by their respective powers in base 3 and summing the results of the position-wise multiplication. Here, base 3 is chosen because 3D atomic coordinates have three dimensions (however, Figure 33 only shows 2D atomic coordinates along two dimensions for simplicity).

인덱스 2가 최우측 비트(즉, 최하위 비트)에 위치되기 때문에, 그것은 3의 0 제곱과 곱해져서 2를 산출한다. 인덱스 1이 두 번째 최우측 비트(즉, 두 번째 최하위 비트)에 위치되기 때문에, 그것은 3의 1 제곱과 곱해져서 3을 산출한다. 이것은 누적된 합이 5인 결과를 가져온다.Since the index 2 is located in the rightmost bit (i.e., the least significant bit), it is multiplied by 3 to the power of 0, yielding 2. Since the index 1 is located in the second rightmost bit (i.e., the second least significant bit), it is multiplied by 3 to the power of 1, yielding 3. This results in a cumulative sum of 5.

이어서, 단계(3202e)(도 33의 단계 5)에서, 누적된 합에 기초하여, 특정 원자를 함유하는 복셀의 복셀 인덱스가 선택된다. 즉, 누적된 합은 특정 원자를 함유하는 복셀의 복셀 인덱스로서 해석된다.Next, at step 3202e (step 5 in Figure 33), based on the accumulated sum, the voxel index of the voxel containing the particular atom is selected. That is, the accumulated sum is interpreted as the voxel index of the voxel containing a specific atom.

단계(3212)에서, 각각의 원자가 원자 함유 복셀과 연관된 후에, 각각의 원자는 본 명세서에서 "이웃 복셀"로도 지칭되는, 원자 함유 복셀의 이웃에 있는 하나 이상의 복셀과 추가로 연관된다. 이웃 복셀은 원자 함유 복셀의 미리정의된 반경(예컨대, 5 옹스트롬(Å)) 내에 있는 것에 기초하여 선택될 수 있다. 다른 구현예에서, 이웃 복셀은 원자 함유 복셀에 근접하여 인접한 것에 기초하여 선택될 수 있다(예컨대, 상단, 하단, 우측, 좌측 인접 복셀). 각각의 원자를 원자 함유 복셀 및 이웃 복셀과 연관시키는 생성된 연관성은 본 명세서에서 요소-셀 맵핑으로도 지칭되는, 원자-복셀 맵핑(3402)에 인코딩된다. 하나의 예에서, 제1 알파-탄소 원자는 원자 함유 복셀 및 제1 알파-탄소 원자에 대한 이웃 복셀을 포함하는 복셀(3404)의 제1 서브세트와 연관된다. 다른 예에서, 제2 알파-탄소 원자는 원자 함유 복셀 및 제2 알파-탄소 원자에 대한 이웃 복셀을 포함하는 복셀(3406)의 제2 서브세트와 연관된다.At step 3212, after each atom is associated with an atom-containing voxel, each atom is further associated with one or more voxels in the neighborhood of the atom-containing voxel, also referred to herein as “neighboring voxels.” Neighboring voxels may be selected based on being within a predefined radius (eg, 5 angstroms (Å)) of the atom-containing voxel. In other implementations, neighboring voxels may be selected based on their proximity to the atom-containing voxel (e.g., top, bottom, right, left neighboring voxels). The resulting associations associating each atom with its atom-containing voxel and neighboring voxels are encoded in an atom-to-voxel mapping 3402, also referred to herein as an element-to-cell mapping. In one example, the first alpha-carbon atom is associated with a first subset of voxels 3404 that include the atom-containing voxels and neighboring voxels for the first alpha-carbon atom. In another example, the second alpha-carbon atom is associated with a second subset of voxels 3406 that includes the atom-containing voxel and neighboring voxels for the second alpha-carbon atom.

원자 함유 복셀 및 이웃 복셀을 결정하기 위해 어떠한 거리 계산도 이루어지지 않음에 유의한다. 원자 함유 복셀은 (어떠한 거리 계산도 사용하지 않고서) 복셀 그리드 내의 상응하는 규칙적으로 이격된 복셀 중심에 대한 양자화된 3D 원자 좌표의 할당을 허용하는 복셀의 공간적 배열에 의해 선택된다. 또한, 이웃 복셀은 (다시 어떠한 거리 계산도 사용하지 않고서) 복셀 그리드 내의 원자 함유 복셀에 공간적으로 인접한 것으로 인해 선택된다.Note that no distance calculations are made to determine atom-containing voxels and neighboring voxels. Atom-containing voxels are selected by a spatial arrangement of the voxels that allows assignment of quantized 3D atomic coordinates to the centers of corresponding regularly spaced voxels within a voxel grid (without using any distance calculations). Additionally, neighboring voxels are selected due to being spatially adjacent to atom-containing voxels within the voxel grid (again without using any distance calculations).

단계(3222)에서, 각각의 복셀은 단계(3202, 3212)에서 그것이 연관되었던 원자에 맵핑된다. 하나의 구현예에서, 이러한 맵핑은 복셀-원자 맵핑(3412)에 인코딩되는데, 이는 (예컨대, 원자-복셀 맵핑(3402) 상에 복셀 기반 분류 키를 적용함으로써) 원자-복셀 맵핑(3402)에 기초하여 생성된다. 복셀-원자 맵핑(3412)은 또한, 본 명세서에서 "셀-요소 맵핑"으로도 지칭된다 하나의 예에서, 제1 복셀은 단계(3202, 3212)에서 제1 복셀과 연관된 알파-탄소 원자를 포함하는 알파-탄소 원자(3414)의 제1 서브세트에 맵핑된다. 다른 예에서, 제2 복셀은 단계(3202, 3212)에서 제2 복셀과 연관된 알파-탄소 원자를 포함하는 알파-탄소 원자(3416)의 제2 서브세트에 맵핑된다.At step 3222, each voxel is mapped to the atom it was associated with at steps 3202 and 3212. In one implementation, this mapping is encoded in a voxel-to-atom mapping 3412, which is based on the atom-to-voxel mapping 3402 (e.g., by applying a voxel-based classification key on the atom-to-voxel mapping 3402). It is created by Voxel-to-atom mapping 3412 is also referred to herein as “cell-element mapping.” In one example, the first voxel includes an alpha-carbon atom associated with the first voxel in steps 3202, 3212. is mapped to a first subset of alpha-carbon atoms 3414. In another example, the second voxel is mapped at steps 3202 and 3212 to a second subset of alpha-carbon atoms 3416 that include alpha-carbon atoms associated with the second voxel.

단계(3232)에서, 각각의 복셀에 대해, 단계(3222)에서 복셀에 맵핑된 원자와 복셀 사이의 거리가 계산된다. 단계(3232)는 O(#원자)의 런타임 복잡도를 갖는데, 그 이유는 특정 원자까지의 거리가 그 특정 원자가 복셀-원자 맵핑(3412)에서 고유하게 맵핑되는 각자의 복셀로부터 1회만 측정되기 때문이다. 이것은 어떠한 이웃하는 복셀도 고려되지 않을 때 그러하다. 이웃이 없다면, big-O 표기법에서 암시되는 상수 인자는 1이다. 이웃이 있다면, big-O 표기법은 이웃의 수가 각각의 복셀에 대해 일정하기 때문에 이웃의 수 + 1과 동일하고, 따라서, O(#원자)의 런타임 복잡도는 그대로 유지된다. 대조적으로, 도 35a에서, 특정 원자까지의 거리는 복셀의 수만큼 많은 횟수로 중복적으로 측정된다(예컨대, 27개의 복셀로 인한 특정 원자에 대한 27개의 거리).At step 3232, for each voxel, the distance between the voxel and the atom mapped to the voxel at step 3222 is calculated. Step 3232 has a runtime complexity of O(#atoms) because the distance to a particular atom is measured only once from each voxel to which that particular atom is uniquely mapped in voxel-to-atom mapping 3412. . This is the case when no neighboring voxels are considered. If there are no neighbors, the constant argument implied by big-O notation is 1. If there are neighbors, the big-O notation is equal to the number of neighbors + 1 because the number of neighbors is constant for each voxel, and thus the runtime complexity of O(#atoms) remains the same. In contrast, in Figure 35A, the distance to a specific atom is measured redundantly as many times as there are voxels (e.g., 27 distances to a specific atom due to 27 voxels).

도 35b에서, 복셀-원자 맵핑(3412)에 기초하여, 각자의 복셀에 대한 각자의 타원에 의해 예시된 바와 같이, 각각의 복셀은 828개의 원자의 각자의 서브세트에 맵핑된다(이웃 복셀에 대한 거리 계산을 포함하지 않음). 각자의 서브세트는 일부 예외를 제외하면, 대체로 중첩되지 않는다. 프라임 심볼 "'"과 타원 사이의 황색 중첩부에 의해 도 35b에 나타낸 바와 같이, 다수의 원자가 동일한 복셀에 맵핑될 때 일부 경우로 인해 사소한 중첩이 존재한다. 이러한 최소 중첩은 O(#원자)의 런타임 복잡도에 대한 가산 효과를 갖고, 곱셈 효과는 갖지 않는다. 이러한 중첩은 원자를 함유하는 복셀을 결정한 후의 이웃하는 복셀을 고려한 결과이다. 이웃하는 복셀이 없다면, 원자가 단지 하나의 복셀과 연관되기 때문에 중첩이 없을 수 있다. 그러나, 이웃을 고려하면, (더 가까운 동일한 아미노산의 다른 원자가 없는 한) 각각의 이웃은 잠재적으로, 동일한 원자와 연관될 수 있다.In Figure 35B, based on voxel-to-atom mapping 3412, each voxel is mapped to a respective subset of 828 atoms (relative to neighboring voxels), as illustrated by a respective ellipse for each voxel. does not include distance calculations). The respective subsets generally do not overlap, with some exceptions. There is minor overlap in some cases when multiple atoms are mapped to the same voxel, as shown in Figure 35B by the yellow overlap between the prime symbol "'" and the ellipse. This minimal overlap has an additive, non-multiplicative effect on the runtime complexity of O(#atoms). This overlap is the result of determining the voxel containing the atom and then considering neighboring voxels. If there are no neighboring voxels, there may be no overlap because the atom is associated with only one voxel. However, considering neighbors, each neighbor can potentially be associated with the same atom (unless there is another atom of the same amino acid that is closer).

단계(3242)에서, 각각의 복셀에 대해, 단계(3232)에서 계산된 거리에 기초하여, 복셀에 가장 가까운 원자가 식별된다. 하나의 구현예에서, 이러한 식별은 본 명세서에서 "셀-가장 가까운 요소 맵핑"으로도 지칭되는 복셀-가장 가까운 원자 맵핑(3422)에 인코딩된다 하나의 예에서, 제1 복셀은 그의 가장 가까운 알파-탄소 원자(3424)로서 제2 알파-탄소 원자에 맵핑된다. 다른 예에서, 제2 복셀은 그의 가장 가까운 알파-탄소 원자(3426)로서 31-번째 알파-탄소 원자에 맵핑된다.At step 3242, for each voxel, the atom closest to the voxel is identified, based on the distance calculated at step 3232. In one implementation, this identification is encoded in a voxel-nearest atom mapping 3422, also referred to herein as a “cell-nearest element mapping.” In one example, a first voxel has its closest alpha-nearest atom mapping. Mapped to the second alpha-carbon atom as carbon atom 3424. In another example, the second voxel is mapped to the 31-th alpha-carbon atom as its closest alpha-carbon atom 3426.

또한, 복셀별 거리가 위에서 논의된 기법을 사용하여 계산됨에 따라, 원자의 원자 유형 및 아미노산 유형 카테고리화, 및 상응하는 거리 값이 카테고리화된 거리 채널을 생성하기 위해 저장된다.Additionally, as voxel-wise distances are calculated using the techniques discussed above, the atomic type and amino acid type categorization of the atoms, and the corresponding distance values, are stored to create categorized distance channels.

일단 가장 가까운 원자까지의 거리가 위에서 논의된 기법을 사용하여 식별되면, 이러한 거리는 병원성 분류자(2108)에 의한 복셀화 및 후속 처리를 위해 거리 채널에 인코딩될 수 있다.Once the distances to the nearest atoms are identified using the techniques discussed above, these distances can be encoded into a distance channel for voxelization and subsequent processing by the pathogenicity classifier 2108.

컴퓨터 시스템computer system

도 36은 개시된 기술을 구현하는 데 사용될 수 있는 예시적인 컴퓨터 시스템(3600)을 도시한다. 컴퓨터 시스템(3600)은 버스 서브시스템(3655)을 통해 다수의 주변 디바이스와 통신하는 적어도 하나의 중앙 처리 유닛(CPU)(3672)을 포함한다. 이러한 주변 디바이스는, 예를 들어 메모리 디바이스 및 파일 저장 서브시스템(3636)을 포함하는 저장 서브시스템(3610), 사용자 인터페이스 입력 디바이스(3638), 사용자 인터페이스 출력 디바이스(3676), 및 네트워크 인터페이스 서브시스템(3674)을 포함할 수 있다. 입력 및 출력 디바이스는 컴퓨터 시스템(3600)과의 사용자 상호작용을 허용한다. 네트워크 인터페이스 서브시스템(3674)은 다른 컴퓨터 시스템에서의 상응하는 인터페이스 디바이스에 대한 인터페이스를 포함하는 인터페이스를 외부 네트워크에 제공한다.36 depicts an example computer system 3600 that can be used to implement the disclosed techniques. Computer system 3600 includes at least one central processing unit (CPU) 3672 that communicates with a number of peripheral devices via a bus subsystem 3655. These peripheral devices include, for example, storage subsystem 3610, which includes memory devices and file storage subsystem 3636, user interface input device 3638, user interface output device 3676, and network interface subsystem ( 3674). Input and output devices allow user interaction with computer system 3600. Network interface subsystem 3674 provides interfaces to external networks, including interfaces to corresponding interface devices in other computer systems.

하나의 구현예에서, 병원성 분류자(2108)는 저장 서브시스템(3610) 및 사용자 인터페이스 입력 디바이스(3638)에 통신가능하게 링크된다.In one implementation, pathogenicity classifier 2108 is communicatively linked to storage subsystem 3610 and user interface input device 3638.

사용자 인터페이스 입력 디바이스(3638)는 키보드; 마우스, 트랙볼, 터치패드, 또는 그래픽 태블릿과 같은 포인팅 디바이스; 스캐너; 디스플레이 내에 통합된 터치 스크린; 음성 인식 시스템 및 마이크로폰과 같은 오디오 입력 디바이스; 및 다른 유형의 입력 디바이스를 포함할 수 있다. 일반적으로, 용어 "입력 디바이스"의 사용은 정보를 컴퓨터 시스템(3600)에 입력하기 위한 모든 가능한 유형의 디바이스 및 방식을 포함하도록 의도된다.User interface input device 3638 may include a keyboard; A pointing device such as a mouse, trackball, touchpad, or graphics tablet; scanner; Touch screen integrated within the display; audio input devices such as voice recognition systems and microphones; and other types of input devices. Generally, use of the term “input device” is intended to include all possible types of devices and manners for inputting information into computer system 3600.

사용자 인터페이스 출력 디바이스(3676)는 디스플레이 서브시스템, 프린터, 팩스 기계, 또는 오디오 출력 디바이스와 같은 비시각적 디스플레이를 포함할 수 있다. 디스플레이 서브시스템은 LED 디스플레이, 음극선관(CRT), 액정 디스플레이(LCD)와 같은 평면 디바이스, 프로젝션 장치, 또는 가시적인 이미지를 생성하기 위한 일부 다른 메커니즘을 포함할 수 있다. 디스플레이 서브시스템은 또한, 오디오 출력 디바이스와 같은 비시각적 디스플레이를 제공할 수 있다. 일반적으로, "출력 디바이스"라는 용어의 사용은 정보를 컴퓨터 시스템(3600)으로부터 사용자에게 또는 다른 기계 또는 컴퓨터 시스템에 출력하기 위한 모든 가능한 유형의 디바이스 및 방식을 포함하도록 의도된다.User interface output device 3676 may include a non-visual display, such as a display subsystem, printer, fax machine, or audio output device. The display subsystem may include a planar device such as an LED display, a cathode ray tube (CRT), a liquid crystal display (LCD), a projection device, or some other mechanism for producing a visible image. The display subsystem may also provide non-visual displays, such as audio output devices. Generally, the use of the term “output device” is intended to include all possible types of devices and manners for outputting information from computer system 3600 to a user or to another machine or computer system.

저장 서브시스템(3610)은 본원에 기술된 모듈 및 방법 중 일부 또는 전부의 기능을 제공하는 프로그래밍 및 데이터 구성을 저장한다. 이러한 소프트웨어 모듈은 일반적으로, 프로세서(3678)에 의해 실행된다.Storage subsystem 3610 stores programming and data configurations that provide the functionality of some or all of the modules and methods described herein. These software modules are typically executed by processor 3678.

프로세서(3678)는 그래픽 처리 유닛(GPU), 필드 프로그래밍가능 게이트 어레이(FPGA), 주문형 반도체(ASIC), 및/또는 코어스-그레인드 재구성가능 아키텍처(CGRA)일 수 있다. 프로세서(3678)는 Google Cloud Platform™, Xilinx™, 및 Cirrascale™과 같은 심층 학습 클라우드 플랫폼에 의해 호스팅될 수 있다. 프로세서(3678)의 예는 Google의 Tensor Processing Unit(TPU)™, 랙마운트 솔루션, 예컨대 GX4 Rackmount Series™, GX36 Rackmount Series™, NVIDIA DGX-1™, Microsoft의 Stratix V FPGA™, Graphcore의 Intelligent Processor Unit (IPU)™, Snapdragon processors™을 갖는 Qualcomm의 Zeroth Platform™, NVIDIA의 Volta™, NVIDIA의 DRIVE PX™, NVIDIA의 JETSON TX1/TX2 MODULE™, Intel의 Nirvana™, Movidius VPU™, Fujitsu DPI™, ARM의 DynamicIQ™, IBM TrueNorth™, Testa V100s™을 갖는 Lambda GPU 서버 등을 포함한다.Processor 3678 may be a graphics processing unit (GPU), a field programmable gate array (FPGA), an application specific integrated circuit (ASIC), and/or a coarse-grained reconfigurable architecture (CGRA). Processor 3678 may be hosted by a deep learning cloud platform such as Google Cloud Platform™, Xilinx™, and Cirrascale™. Examples of processors 3678 include Google's Tensor Processing Unit (TPU)™, rackmount solutions such as GX4 Rackmount Series™, GX36 Rackmount Series™, NVIDIA DGX-1™, Microsoft's Stratix V FPGA™, and Graphcore's Intelligent Processor Unit. (IPU)™, Qualcomm's Zeroth Platform™ with Snapdragon processors™, NVIDIA's Volta™, NVIDIA's DRIVE PX™, NVIDIA's JETSON TX1/TX2 MODULE™, Intel's Nirvana™, Movidius VPU™, Fujitsu DPI™, ARM Includes Lambda GPU servers with DynamicIQ™, IBM TrueNorth™, and Testa V100s™.

저장 서브시스템(3610)에 사용되는 메모리 서브시스템(3622)은 프로그램 실행 동안 명령어 및 데이터의 저장을 위한 메인 랜덤 액세스 메모리(RAM)(3632) 및 고정된 명령어가 저장되는 판독 전용 메모리(ROM)(3634)를 포함하는 다수의 메모리를 포함할 수 있다. 파일 저장 서브시스템(3636)은 프로그램 및 데이터 파일을 위한 영구 저장소를 제공할 수 있고, 하드 디스크 드라이브, 연관된 착탈식 매체와 함께 플로피 디스크 드라이브, CD-ROM 드라이브, 광학 드라이브, 또는 착탈식 매체 카트리지를 포함할 수 있다. 소정 구현예의 기능을 구현하는 모듈은 저장 서브시스템(3610) 내의 파일 저장 서브시스템(3636)에 의해, 또는 프로세서에 의해 접근가능한 다른 기계에 저장될 수 있다.The memory subsystem 3622 used in the storage subsystem 3610 includes a main random access memory (RAM) 3632 for storage of instructions and data during program execution and a read-only memory (ROM) where fixed instructions are stored ( 3634). File storage subsystem 3636 may provide persistent storage for program and data files and may include a hard disk drive, a floppy disk drive with associated removable media, a CD-ROM drive, an optical drive, or a removable media cartridge. You can. Modules implementing the functionality of a given implementation may be stored by file storage subsystem 3636 within storage subsystem 3610, or on another machine accessible by the processor.

버스 서브시스템(3655)은 컴퓨터 시스템(3600)의 다양한 구성요소 및 서브시스템이 의도된 대로 서로 통신하게 하기 위한 메커니즘을 제공한다. 버스 서브시스템(3655)이 개략적으로 단일 버스로서 도시되어 있지만, 버스 서브시스템의 대안적인 구현예는 다수의 버스를 사용할 수 있다.Bus subsystem 3655 provides mechanisms to allow the various components and subsystems of computer system 3600 to communicate with each other as intended. Although bus subsystem 3655 is schematically depicted as a single bus, alternative implementations of the bus subsystem may use multiple buses.

컴퓨터 시스템(3600) 자체는 개인용 컴퓨터, 휴대용 컴퓨터, 워크스테이션, 컴퓨터 단말기, 네트워크 컴퓨터, 텔레비전, 메인프레임, 서버 팜, 느슨하게 네트워킹된 컴퓨터의 광범위하게 분포된 세트, 또는 임의의 다른 데이터 처리 시스템 또는 사용자 디바이스를 포함한 다양한 유형의 것일 수 있다. 컴퓨터 및 네트워크의 지속적으로 변화하는(ever-changing) 특성으로 인해, 도 36에 묘사된 컴퓨터 시스템(3600)의 설명은 본 발명의 바람직한 구현예를 예시하기 위한 특정 예로서만 의도된다. 도 36에 묘사된 컴퓨터 시스템보다 더 많은 또는 더 적은 컴포넌트를 갖는 컴퓨터 시스템(3600)의 많은 다른 구성이 가능하다.Computer system 3600 itself may be a personal computer, portable computer, workstation, computer terminal, network computer, television, mainframe, server farm, broadly distributed set of loosely networked computers, or any other data processing system or user. It can be of various types, including devices. Due to the ever-changing nature of computers and networks, the description of computer system 3600 depicted in FIG. 36 is intended only as a specific example to illustrate preferred implementations of the invention. Many other configurations of computer system 3600 are possible with more or fewer components than the computer system depicted in FIG. 36.

아미노산 서열 예측Amino acid sequence prediction

마스킹된 언어 모델링 목표로 훈련된 단백질 언어 모델은 주변 상황에 따라 단백질의 특정 위치에 아미노산이 발생할 확률을 출력하도록 감독된다. 단백질은 기능을 위해 다양한 특정 형태로 접히는 선형 중합체이다. 20개의 아미노산이 단백질 중합체 사슬(단백질의 순서)을 엮는 조합과 순서에 의해 결정되는 믿을 수 없을 정도로 다양한 3차원(3D) 구조는 대부분의 생물학적 활동을 담당하는 단백질의 정교한 기능을 가능하게 한다. 따라서, 단백질의 구조를 얻는 것은 건강과 질환의 근본적인 생물학을 이해하고 치료 분자를 개발하는 데 있어 가장 중요하다. 단백질 구조는 주로 X-선 결정학, NMR 분광학 및 점점 더 극저온 전자 현미경과 같은 정교한 실험 기술에 의해 결정되지만, 단백질의 유전적으로 인코딩된 아미노산 서열로부터 계산 구조 예측은 실험적 접근법이 제한적일 때 대안으로 사용되어 왔다.A protein language model trained with a masked language modeling goal is supervised to output the probability of an amino acid occurring at a specific position in the protein depending on the surrounding context. Proteins are linear polymers that fold into a variety of specific shapes for their function. The incredibly diverse three-dimensional (3D) structures, determined by the combination and sequence of 20 amino acids in a protein polymer chain (the sequence of proteins), enable the sophisticated functions of proteins responsible for most biological activities. Therefore, obtaining the structure of proteins is of utmost importance for understanding the fundamental biology of health and disease and developing therapeutic molecules. Protein structures are primarily determined by sophisticated experimental techniques such as .

단백질의 구조를 예측하고, 생물학적 과정의 메커니즘을 설명하고, 단백질의 특성을 결정하기 위해 컴퓨터 방법이 사용되었다. 뿐만 아니라, 모든 자연 발생 단백질은 다양한 선택 압력 하에서 발생하는 무작위 변이체의 진화 과정의 결과이다. 이 과정을 통해 자연은 이론적으로 가능한 단백질 서열 공간의 작은 하위 집합만을 탐색했다. 머신 러닝, 특히 딥 러닝의 발전은 과학 연구 패러다임의 혁명을 촉진하고 있다. 특히 구조 예측 분야의 일부 딥 러닝 기반 접근 방식은 이제 종종 고해상도 물리적 모델링과 조합하여 기존 방법보다 성능이 뛰어나다. 실험적 검증, 벤치마킹, 알려진 물리학 활용 및 모델 해석, 및 다른 생체분자 및 맥락으로 확장하는 데에는 여전히 과제가 남아 있다.Computer methods have been used to predict the structure of proteins, explain the mechanisms of biological processes, and determine the properties of proteins. Furthermore, all naturally occurring proteins are the result of an evolutionary process of random variants occurring under various selection pressures. Through this process, nature has explored only a small subset of the theoretically possible protein sequence space. Advances in machine learning, especially deep learning, are promoting a revolution in scientific research paradigms. Some deep learning-based approaches, especially in the field of structure prediction, now outperform traditional methods, often in combination with high-resolution physical modeling. Challenges remain in experimental validation, benchmarking, exploiting known physics and interpreting the model, and extending it to other biomolecules and contexts.

단백질 부위는 그들의 구조적 또는 기능적 역할에 의해 구별되는 단백질 구조 내의 미세환경이다. 부위는 3차원 위치 및 구조 또는 기능이 존재하는 이러한 위치 주위의 국부적 이웃에 의해 정의될 수 있다. 합리적인 단백질 공학에 대한 중심은 아미노산의 구조적 배열이 단백질 부위 내에서 기능적 특성을 생성하는 방법에 대한 이해이다. 단백질 내의 개개의 아미노산의 구조적 및 기능적 역할의 결정은 공학자를 돕고 단백질 기능을 변경하는 데 도움을 주기 위한 정보를 제공한다. 기능적으로 또는 구조적으로 중요한 아미노산을 식별하는 것은 표적화된 단백질 기능적 속성을 변경하기 위한 부위 유도 돌연변이유발과 같은 집중된 공학 노고를 허용한다. 일 구현예에서, 개시된 기술은 아미노산 치환물의 공간적 내성을 예측하는 것에 관한 것이다. 그러한 구현예에서, 개시된 기술은 갭핑 로직 및 치환 로직을 포함한다. 갭핑 로직은 단백질로부터 특정 위치의 특정 아미노산을 제거하고 단백질의 특정 위치에 아미노산 공석을 생성하도록 구성된다. 치환 로직은 아미노산 공석이 있는 단백질을 처리하고 아미노산 공석을 채우거나 맞추기 위한 후보인 대안 아미노산의 내성을 평가하도록 구성된다. 치환 로직은 적어도 부분적으로, 치환 아미노산과 아미노산 공석 부근의 인접 아미노산(예를 들어, 오른쪽 및 왼쪽 측면 아미노산) 사이의 구조적(또는 공간적) 적합성에 기초하여 치환 아미노산의 내성을 점수화하도록 추가로 구성된다. 치환 로직은 아미노산이 주변 단백질 환경에 "적합"하는 정도를 평가하고 강한 아미노산 선호도를 방해하는 돌연변이가 해로울 가능성이 더 높다는 것을 보여준다. 치환 로직이 콘볼루션 신경망인 경우 훈련 과정에서 콘볼루셔널 필터의 가중치는 20개의 아미노산 미세 환경을 분리하기 위해 국소 생화학적 특징을 가장 잘 포착하는 국소 공간 패턴을 감지하도록 최적화되었다. 훈련 과정 후, 입력의 일부 공간 위치에 원하는 특징이 존재할 때 콘볼루션 신경망의 콘볼루션 레이어에 있는 필터가 활성화된다. 구조적(또는 공간적) 적합성은 단백질 기능의 변화 또는 영향으로 정의될 수 있다. 치환 아미노산이 단백질 구조 내의 특정 위치에서 치환된 후 단백질의 기능에 변화를 일으키는 경우 치환 아미노산은 구조적으로(또는 공간적으로) 양립할 수 없는 것으로 간주된다. 치환 아미노산이 단백질 구조 내의 특정 위치에 치환된 후 단백질의 기능에 변화를 일으키지 않는 경우 치환 아미노산은 구조적으로(또는 공간적으로) 양립할 수 있는 것으로 간주된다. 구조적(또는 공간적) 적합성은 거리 측정법으로 측정된 공간 편차로 정의될 수 있다. 첫째, 단백질 구조의 삽입 전 공간 측정은 예를 들어 특정 위치에서 아미노산 치환 이전에 단백질 구조에서 아미노산 사이의 거리를 측정함으로써 결정될 수 있다. 거리는 아미노산 원자의 원자 좌표를 기반으로 하는 원자 거리일 수 있다. 아미노산 쌍 사이의 거리를 측정할 수 있다. 그 다음, 단백질 구조의 삽입 후 공간 측정은 예를 들어 특정 위치에서 아미노산 치환 후 단백질 구조의 아미노산 사이의 거리를 재측정함으로써 결정된다. 삽입 전 공간 측정과 삽입 후 공간 측정 사이의 공간 편차가 임계값을 초과하는 경우 치환 아미노산은 구조적으로(또는 공간적으로) 적합하지 않은 것으로 간주된다. 삽입 전 공간 측정과 삽입 후 공간 측정 사이의 공간 편차가 임계값을 초과하지 않는 경우 치환 아미노산은 구조적으로(또는 공간적으로) 적합한 것을 간주된다.Protein sites are microenvironments within a protein structure that are distinguished by their structural or functional roles. A site can be defined by a three-dimensional location and local neighbors around that location where the structure or function resides. Central to rational protein engineering is the understanding of how the structural arrangement of amino acids creates functional properties within protein regions. Determination of the structural and functional roles of individual amino acids within a protein provides information to aid engineers and modify protein function. Identifying functionally or structurally important amino acids allows for focused engineering efforts, such as site-directed mutagenesis, to alter targeted protein functional properties. In one embodiment, the disclosed technology relates to predicting spatial tolerance of amino acid substitutions. In such implementations, the disclosed technology includes gapping logic and substitution logic. Gapping logic is configured to remove a specific amino acid at a specific position from a protein and create an amino acid vacancy at a specific position in the protein. The substitution logic is configured to process proteins with amino acid vacancies and evaluate the tolerance of alternative amino acids that are candidates for filling or fitting the amino acid vacancies. The substitution logic is further configured to score the tolerance of the substituted amino acid based, at least in part, on the structural (or spatial) compatibility between the substituted amino acid and adjacent amino acids (e.g., right and left flanking amino acids) near the amino acid vacancy. Substitution logic assesses the degree to which an amino acid “fits” into its surrounding protein environment and shows that mutations that disrupt strong amino acid preferences are more likely to be deleterious. When the substitution logic was a convolutional neural network, during the training process, the weights of the convolutional filter were optimized to detect local spatial patterns that best captured local biochemical features to separate the 20 amino acid microenvironments. After the training process, the filters in the convolutional layer of the convolutional neural network are activated when the desired feature exists at some spatial location of the input. Structural (or spatial) fitness can be defined as a change or effect on protein function. Substituted amino acids are considered structurally (or spatially) incompatible if they cause changes in the function of the protein after being substituted at a specific position within the protein structure. Substituted amino acids are considered structurally (or spatially) compatible if they do not cause a change in the function of the protein after being substituted at a specific position in the protein structure. Structural (or spatial) suitability can be defined as spatial deviation measured by distance metrics. First, pre-insertion spatial measurements of the protein structure can be determined, for example, by measuring the distance between amino acids in the protein structure prior to amino acid substitution at a specific position. The distance may be an atomic distance based on the atomic coordinates of amino acid atoms. The distance between pairs of amino acids can be measured. The spatial measurements after insertion in the protein structure are then determined, for example, by re-measuring the distances between amino acids in the protein structure after amino acid substitutions at specific positions. A substituted amino acid is considered structurally (or spatially) unsuitable if the spatial deviation between the pre-insertion and post-insertion spatial measurements exceeds a threshold. A substituted amino acid is considered structurally (or spatially) suitable if the spatial deviation between the pre-insertion and post-insertion spatial measurements does not exceed a threshold.

또 다른 구현예에서, 개시된 기술은 아미노산 치환물의 진화 보존을 예측하는 것에 관한 것이다. 그러한 구현예에서, 개시된 기술은 갭핑 로직 및 치환 로직을 포함한다. 갭핑 로직은 단백질로부터 특정 위치의 특정 아미노산을 제거하고 단백질의 특정 위치에 아미노산 공석을 생성하도록 구성된다. 치환 로직은 아미노산 공석이 있는 단백질을 처리하고 아미노산 공석을 채우기 위한 후보인 치환 아미노산의 진화 보존을 점수화하도록 구성된다. 치환 로직은 적어도 부분적으로, 치환 아미노산과 아미노산 공석 부근의 인접 아미노산(예를 들어, 오른쪽 및 왼쪽 측면 아미노산) 사이의 구조적(또는 공간적) 적합성에 기초하여 치환 아미노산의 진화 보존을 점수화하도록 추가로 구성된다. 일부 구현예에서, 진화 보존은 진화 보존 빈도를 사용하여 점수가 매겨진다. 일 구현예에서, 진화 보존 빈도는 위치 특이적 빈도 행렬(PSFM)을 기반으로 한다. 또 다른 구현예에서, 진화 보존 빈도는 위치 특이적 점수 매트릭스(PSSM)를 기반으로 한다. 일 구현예에서, 치환 아미노산의 진화 보존 점수는 규모에 따라 순위가 지정된다.In another embodiment, the disclosed techniques relate to predicting evolutionary conservation of amino acid substitutions. In such implementations, the disclosed technology includes gapping logic and substitution logic. Gapping logic is configured to remove a specific amino acid at a specific position from a protein and create an amino acid vacancy at a specific position in the protein. The substitution logic is configured to process proteins with amino acid vacancies and score the evolutionary conservation of substituted amino acids that are candidates for filling the amino acid vacancies. The substitution logic is further configured to score the evolutionary conservation of the substituted amino acid based, at least in part, on the structural (or spatial) compatibility between the substituted amino acid and adjacent amino acids (e.g., right and left flanking amino acids) near the amino acid vacancy. . In some implementations, evolutionary conservation is scored using evolutionary conservation frequency. In one implementation, the evolutionary conservation frequencies are based on the position-specific frequency matrix (PSFM). In another implementation, the evolutionary conservation frequency is based on a position-specific score matrix (PSSM). In one embodiment, the evolutionary conservation scores of substituted amino acids are ranked on a scale.

또 다른 구현예에서, 개시된 기술은 아미노산 치환물의 진화 보존을 예측하는 것에 관한 것이다. 그러한 구현예에서, 개시된 기술은 갭핑 로직 및 진화 보존 예측 로직을 포함한다. 갭핑 로직은 단백질로부터 특정 위치의 특정 아미노산을 제거하고 단백질의 특정 위치에 아미노산 공석을 생성하도록 구성된다. 진화 보존 예측 로직은 아미노산 공석이 있는 단백질을 처리하고 아미노산 공석을 채우기 위한 후보인 치환 아미노산의 진화 보존 순위를 매기도록 구성된다.In another embodiment, the disclosed techniques relate to predicting evolutionary conservation of amino acid substitutions. In such implementations, the disclosed techniques include gapping logic and evolutionary conservation prediction logic. Gapping logic is configured to remove a specific amino acid at a specific position from a protein and create an amino acid vacancy at a specific position in the protein. The evolutionary conservation prediction logic is structured to process proteins with amino acid vacancies and rank the evolutionary conservation of substituted amino acids that are candidates for filling the amino acid vacancies.

표적 대체 아미노산에 대한 갭 단백질 공간 표현 기반 병원성 결정Gap protein spatial representation-based pathogenicity determination for targeted alternative amino acids.

도 37은 갭 단백질 공간 표현의 처리에 기초하여 표적 대체 아미노산에 대한 3700 변이체 병원성을 결정하는 일 구현예를 도시한다. 단백질은 아미노산의 서열이다. 단백질에서 제거되거나 마스킹된 특정 아미노산은 "갭 아미노산"으로 지칭된다. 갭 아미노산이 결여된 생성된 단백질은 "갭 단백질" 또는 "공석 함유 단백질"로 지칭된다.Figure 37 depicts one embodiment of determining 3700 variant pathogenicity for target replacement amino acids based on processing of gap protein spatial representation. Proteins are sequences of amino acids. Certain amino acids that have been removed or masked from a protein are referred to as “gap amino acids.” The resulting proteins lacking gap amino acids are referred to as “gap proteins” or “vacancy-containing proteins.”

단백질의 "공간 표현"은 단백질의 아미노산에 대한 구조적 정보를 나타낸다. 단백질의 공간 표현은 단백질 내 아미노산의 모양, 장소, 위치, 패턴 및/또는 배열을 기반으로 할 수 있다. 단백질의 공간 표현은 1차원(1D), 2차원(2D), 3차원(3D) 또는 n차원(nD) 정보일 수 있다.The “spatial representation” of a protein represents the structural information about the protein's amino acids. The spatial representation of a protein may be based on the shape, location, position, pattern, and/or arrangement of amino acids within the protein. The spatial representation of a protein can be one-dimensional (1D), two-dimensional (2D), three-dimensional (3D), or n- dimensional ( n D) information.

일 구현예에서, 단백질의 공간 표현은 위에서 논의된 아미노산별 거리 채널, 예를 들어 도 6과 관련하여 위에서 설명한 아미노산별 거리 채널(600)을 포함한다. 다른 구현예에서, 단백질의 공간 표현은 위에서 설명한 거리 채널 텐서, 예를 들어 도 7과 관련하여 위에서 설명한 거리 채널 텐서(700)를 포함한다. 또 다른 구현예에서, 단백질의 공간 표현은 위에서 설명한 진화 프로파일 텐서, 예를 들어 도 18과 관련하여 위에서 설명한 진화 프로파일 텐서(1800)를 포함한다. 또 다른 구현예에서, 단백질의 공간 표현은 위에서 설명한 복셀화된 주석 채널, 예를 들어 도 20과 관련하여 위에서 설명한 복셀화된 주석 채널(2000)을 포함한다. 또 다른 구현예에서, 단백질의 공간 표현은 위에서 논의된 구조 신뢰 채널을 포함한다. 다른 구현예에서, 공간 표현은 다른 채널도 포함할 수 있다.In one embodiment, the spatial representation of the protein includes the per-amino acid distance channel discussed above, such as the per-amino acid distance channel 600 described above with respect to FIG. 6 . In another embodiment, the spatial representation of the protein comprises a distance channel tensor described above, such as distance channel tensor 700 described above with respect to FIG. 7 . In another embodiment, the spatial representation of the protein comprises an evolutionary profile tensor described above, such as an evolutionary profile tensor 1800 described above with respect to FIG. 18 . In another embodiment, the spatial representation of the protein comprises a voxelized annotation channel described above, such as voxelized annotation channel 2000 described above with respect to Figure 20. In another embodiment, the spatial representation of the protein includes a structural confidence channel discussed above. In other implementations, the spatial representation may also include other channels.

단백질의 "갭 공간 표현"은 단백질에서 적어도 하나의 갭 아미노산을 배제하는 단백질의 공간 표현이다. 일 구현예에서, 갭 공간 표현을 생성할 때 갭 아미노산의 하나 이상의 원자 또는 원자 유형을 제외(또는 고려하지 않거나 무시)함으로써 갭 아미노산이 제외된다. 예를 들어, 갭 아미노산의 원자는 거리 채널, 진화 프로파일, 주석 채널 및/또는 구조 신뢰 채널을 생성하는 계산(또는 선택 또는 계산)에서 제외될 수 있다. 다른 구현예에서, 갭 공간 표현은 다른 특징 채널에서도 갭 아미노산을 제외함으로써 생성될 수 있다.A “gap spatial representation” of a protein is a spatial representation of the protein that excludes at least one gap amino acid in the protein. In one embodiment, gap amino acids are excluded by excluding (or not considering or ignoring) one or more atoms or atom types of the gap amino acid when generating the gap space representation. For example, atoms of gap amino acids can be excluded from calculations (or selections or computations) that generate distance channels, evolution profiles, annotation channels, and/or structure confidence channels. In other embodiments, gap spatial representations can be generated by excluding gap amino acids from other feature channels as well.

아미노산별 거리 채널 계산에서 갭 아미노산 원자를 제외하여 단백질의 갭 공간 표현을 생성하는 다음 예를 고려한다. 도 5에서 Cα^A5 원자는 단백질의 5번 위치에 있는 알라닌 아미노산에 속한다. 이제 다섯 번째 위치의 알라닌 아미노산이 갭 아미노산으로 선택되었다고 가정한다. 복셀 그리드(522)의 복셀 중심(1, 1)과 가장 가까운 알파-탄소(C_α) 원자 사이의 거리(512)를 고려하지 않고 거리 채널을 계산함으로써 갭 공간 표현이 생성되고, 이는 갭 아미노산의 Cα^A5 원자, 즉 다섯 번째 위치의 알라닌 아미노산이다.Consider the following example, which generates a gap space representation of a protein by excluding gap amino acid atoms from the amino acid-specific distance channel calculation. In Figure 5, the Cα ^A5 atom belongs to the alanine amino acid at position 5 of the protein. Now assume that the alanine amino acid at the fifth position is selected as the gap amino acid. A gap space representation is created by calculating the distance channel without considering the distance 512 between the voxel center (1, 1) of the voxel grid 522 and the nearest alpha-carbon ( _Cα ) atom, which is the It is the Cα ^A5 atom, that is, the alanine amino acid at the fifth position.

또한 본 출원은 "단백질의 공간 표현"과 "단백질 구조"를 상호 교환적으로 사용한다는 점에 유의한다. 또한 본 출원은 "단백질의 갭 공간 표현"과 "갭 단백질 구조"를 상호 교환적으로 사용한다는 점에 유의한다.It is also noted that this application uses “spatial representation of a protein” and “protein structure” interchangeably. It is also noted that this application uses “gap space representation of proteins” and “gap protein structures” interchangeably.

도 37을 참조하면, 동작(3702)에서, 단백질 서열 접근자(3704)는 각각의 위치에 각각의 아미노산을 갖는 단백질에 접근한다.Referring to Figure 37, in operation 3702, protein sequence accessor 3704 accesses the protein with each amino acid at each position.

동작(3712)에서, 갭 아미노산 지정자(3714)는 단백질의 특정 위치에 있는 특정 아미노산을 갭 아미노산으로 지정하고, 단백질의 나머지 위치에 남아 있는 아미노산을 비-갭 아미노산으로 지정한다. 일 구현예에서, 특정 아미노산은 단백질의 주요 대립유전자인 기준 아미노산이다.In operation 3712, the gap amino acid designator 3714 designates a specific amino acid at a specific position of the protein as a gap amino acid and designates the remaining amino acid at the remaining position of the protein as a non-gap amino acid. In one embodiment, the particular amino acid is a reference amino acid that is the major allele of the protein.

동작(3722)에서, 갭 공간 표현 생성자(3724)는 비-갭 아미노산의 공간 구성을 포함하고 갭 아미노산의 공간 구성을 배제하는 단백질의 갭 공간 표현을 생성한다. 비-갭 아미노산의 공간 구성은 아미노산 클래스별 거리 채널로서 인코딩된다. 각각의 아미노산 클래스별 거리 채널은 복수의 복셀 중 복셀에 대한 복셀별 거리 값을 갖는다. 복셀별 거리 값은 복수의 복셀 중 상응하는 복셀로부터 비-갭 아미노산의 원자까지의 거리를 지정한다. 비-갭 아미노산의 공간 구성은 상응하는 복셀과 비-갭 아미노산의 원자 사이의 공간적 근접성에 기초하여 결정된다. 복셀별 거리 값을 결정할 때 상응하는 복셀로부터 갭 아미노산의 원자까지의 거리를 무시함으로써 갭 아미노산의 공간 구성이 갭 공간 표현에서 제외된다. 상응하는 복셀과 갭 아미노산의 원자 사이의 공간적 근접성을 무시함으로써 갭 아미노산의 공간 구성이 갭 공간 표현에서 제외된다.In operation 3722, gap space representation generator 3724 generates a gap space representation of the protein that includes the spatial configuration of non-gap amino acids and excludes the spatial configuration of gap amino acids. The spatial organization of non-gap amino acids is encoded as distance channels per amino acid class. The distance channel for each amino acid class has a voxel-specific distance value for a voxel among a plurality of voxels. The distance value for each voxel specifies the distance from the corresponding voxel among the plurality of voxels to the atom of the non-gap amino acid. The spatial configuration of a non-gap amino acid is determined based on the spatial proximity between the corresponding voxel and the atoms of the non-gap amino acid. By ignoring the distance from the corresponding voxel to the atom of the gap amino acid when determining voxel-wise distance values, the spatial configuration of the gap amino acid is excluded from the gap space representation. By ignoring the spatial proximity between the corresponding voxel and the atoms of the gap amino acid, the spatial configuration of the gap amino acid is excluded from the gap space representation.

비-갭 아미노산의 공간 구성은 복셀에 가장 가까운 원자를 갖는 아미노산의 범아미노산 보존 빈도를 기반으로 하는 진화 프로파일 채널로 인코딩된다. 일 구현예에서, 범아미노산 보존 빈도를 결정할 때 갭 아미노산의 가장 가까운 원자를 무시함으로써 갭 아미노산의 공간 구성이 갭 공간 표현에서 제외된다. 비-갭 아미노산의 공간 구성은 복셀에 가장 가까운 원자를 갖는 각 아미노산의 아미노산당 보존 빈도를 기반으로 하는 진화 프로파일 채널로 인코딩된다. 일 구현예에서, 갭 아미노산의 공간 구성은 아미노산당 보존 빈도를 결정할 때 갭 아미노산의 각각의 가장 가까운 원자를 무시함으로써 갭 공간 표현에서 제외된다. 비-갭 아미노산의 공간 구성은 주석 채널로 인코딩된다. 일 구현예에서, 주석 채널을 결정할 때 갭 아미노산의 원자를 무시함으로써 갭 아미노산의 공간 구성이 갭 공간 표현에서 제외된다. 비-갭 아미노산의 공간 구성은 구조적 신뢰 채널로서 인코딩된다. 일 구현예에서, 구조적 신뢰 채널을 결정할 때 갭 아미노산의 원자를 무시함으로써 갭 아미노산의 공간 구성이 갭 공간 표현에서 제외된다. 비-갭 아미노산의 공간 구성은 추가 입력 채널로서 인코딩된다. 일 구현예에서, 추가 입력 채널을 결정할 때 갭 아미노산의 원자를 무시함으로써 갭 아미노산의 공간 구성이 갭 공간 표현에서 제외된다.The spatial organization of non-gap amino acids is encoded in an evolutionary profile channel based on the pan-amino acid conservation frequency of the amino acid with the closest atom to the voxel. In one embodiment, the spatial configuration of the gap amino acid is excluded from the gap space representation by ignoring the nearest atom of the gap amino acid when determining the pan-amino acid conservation frequency. The spatial organization of non-gap amino acids is encoded in an evolutionary profile channel based on the per-amino acid conservation frequency of each amino acid with the closest atom to the voxel. In one embodiment, the spatial configuration of gap amino acids is excluded from the gap space representation by ignoring each nearest atom of the gap amino acid when determining the conservation frequency per amino acid. The spatial organization of non-gap amino acids is encoded in the annotation channel. In one embodiment, the spatial configuration of the gap amino acid is excluded from the gap space representation by ignoring the atoms of the gap amino acid when determining the annotation channel. The spatial organization of non-gap amino acids is encoded as a structural confidence channel. In one embodiment, the spatial configuration of the gap amino acid is excluded from the gap space representation by ignoring the atoms of the gap amino acid when determining the structural confidence channel. The spatial configuration of non-gap amino acids is encoded as an additional input channel. In one embodiment, the spatial configuration of the gap amino acid is excluded from the gap space representation by ignoring the atoms of the gap amino acid when determining additional input channels.

동작(3732)에서 병원성 결정자(3734)는 적어도 부분적으로, 갭 공간 표현과 특정 위치에서 뉴클레오티드 변이체에 의해 생성된 대체 아미노산의 표현에 기초하여 뉴클레오티드 변이체의 병원성을 결정한다. 대체 아미노산의 표현은 대체 아미노산의 원-핫 인코딩일 수 있다(예를 들어, 도 8 참조). 일부 구현예에서, 대체 아미노산은 기준 아미노산과 동일한 아미노산이다. 다른 구현예에서, 대체 아미노산은 기준 아미노산과 상이한 아미노산이다.In operation 3732, pathogenicity determinant 3734 determines the pathogenicity of a nucleotide variant based, at least in part, on the gap space representation and the representation of alternative amino acids produced by the nucleotide variant at a particular position. The representation of the replacement amino acid may be a one-hot encoding of the replacement amino acid (see, e.g., Figure 8). In some embodiments, the replacement amino acid is the same amino acid as the reference amino acid. In other embodiments, the replacement amino acid is an amino acid that is different from the reference amino acid.

도 38은 단백질의 공간 표현(3800)의 예를 도시한다. 단백질은 아미노산 서열(3804)을 함유한다. 아미노산 서열(3804)의 22번째 위치에 있는 아스파르트산(D) 아미노산이 갭 아미노산(3802)로 선택된다. 도 39는 도 38에 예시된 단백질의 갭 공간 표현(3900)의 예를 도시한다. 도 39에서, 갭 아미노산(3802)이 갭 공간 표현(3900)에서 제거된다. 또한 도 39에서, 갭 아미노산(3802)의 부재가 누락된 갭 아미노산(3902)로 도시되어 있다.Figure 38 shows an example of a spatial representation 3800 of a protein. The protein contains an amino acid sequence (3804). The aspartic acid (D) amino acid at position 22 of the amino acid sequence 3804 is selected as the gap amino acid 3802. Figure 39 shows an example of a gap space representation 3900 of the protein illustrated in Figure 38. In Figure 39, gap amino acid 3802 is removed from gap space representation 3900. Also in Figure 39, the absence of gap amino acid 3802 is shown as missing gap amino acid 3902.

도 40은 도 38에 설명된 단백질의 원자 공간 표현(4000)의 예를 도시한다. 도 40은 또한 갭 아미노산(3802)의 원자(4002)를 도시한다. 도 41은 도 38에 설명된 단백질의 갭 원자 공간 표현(4100)의 예를 도시한다. 도 41에서, 갭 아미노산(3802)의 원자(4002)는 갭 원자 공간 표현(4100)에서 제거된다. 또한 도 41에서, 갭 아미노산(3802)의 원자(4002)의 부재는 갭 아미노산(3802)의 원자(4102)가 누락된 것으로 도시되어 있다.Figure 40 shows an example of an atomic space representation 4000 of the protein illustrated in Figure 38. Figure 40 also shows atoms 4002 of gap amino acid 3802. Figure 41 shows an example of a gap atom space representation 4100 of the protein illustrated in Figure 38. In Figure 41, atom 4002 of gap amino acid 3802 is removed from gap atom space representation 4100. Also in Figure 41, the absence of atom 4002 of gap amino acid 3802 is shown as missing atom 4102 of gap amino acid 3802.

또한 본 출원은 "병원성 결정자", "병원성 예측자", "병원성 분류자", "변이체 병원성 분류자", "진화 보존 예측자" 및 "진화 보존 결정자"를 상호 교환적으로 사용한다는 점에 유의한다.Please also note that this application uses the terms "pathogenicity determinant", "pathogenicity predictor", "pathogenicity classifier", "variant pathogenicity classifier", "evolutionary conservation predictor", and "evolutionary conservation determinant" interchangeably. do.

도 42는 갭 단백질 공간 표현(4202) 및 표적 대체 아미노산의 대체 아미노산 표현(4212) 처리에 기초하여 표적 대체 아미노산에 대한 변이체 병원성을 결정(4200)하는 병원성 분류자(2108/2600/2700)의 일 구현예를 도시한다.Figure 42 shows the work of the pathogenicity classifier (2108/2600/2700) to determine (4200) variant pathogenicity for a target replacement amino acid based on gap protein space representation (4202) and processing of the replacement amino acid representation (4212) of the target replacement amino acid. An implementation example is shown.

병원성 분류자(2108/2600/2700)는 갭 공간 표현(4202) 및 대체 아미노산의 표현(3212)을 입력으로서 처리하고, 대체 아미노산에 대한 병원성 점수(4208)를 출력으로서 생성함으로써 뉴클레오티드 변이체의 병원성을 결정한다.Pathogenicity classifiers 2108/2600/2700 determine the pathogenicity of nucleotide variants by processing the gap space representation 4202 and the representation 3212 of alternative amino acids as input and generating a pathogenicity score 4208 for the alternative amino acid as output. decide

도 43은 병원성 분류자(2108/2600/2700)를 훈련하는 데 사용되는 훈련 데이터(4300)의 일 구현예를 도시한다. 병원성 분류자(2108/2600/2700)는 양성 훈련 세트(4302)에 대해 훈련된다. 양성 훈련 세트(4302)는 프로테옴의 각각의 위치(4312, 4332 및 4352)에 각각의 기준 아미노산에 대한 각각의 양성 단백질 샘플(4322, 4342 및 4362)을 갖는다. 기준 아미노산은 프로테옴의 주요 대립유전자 아미노산이다. 일 구현예에서, 프로테옴은 1000만 개의 위치를 가지며, 따라서 양성 훈련 세트(4302)는 1000만 개의 양성 단백질 샘플을 갖는다. 각각의 양성 단백질 샘플은 각각의 기준 아미노산을 각각의 갭 아미노산으로 사용하여 생성된 각각의 갭 공간 표현을 갖는다. 각각의 양성 단백질 샘플은 각각의 기준 아미노산을 각각의 대체 아미노산으로 표현한다. 다양한 구현예에서, 프로테옴은 인간 프로테옴과 비인간 영장류 프로테옴을 포함하는 비인간 프로테옴을 포함한다.Figure 43 shows one implementation of training data 4300 used to train pathogenicity classifiers 2108/2600/2700. Pathogenicity classifiers (2108/2600/2700) are trained on the benign training set (4302). The positive training set 4302 has each positive protein sample (4322, 4342, and 4362) for each reference amino acid at each position (4312, 4332, and 4352) of the proteome. The reference amino acid is the major allelic amino acid in the proteome. In one embodiment, the proteome has 10 million positions, so positive training set 4302 has 10 million positive protein samples. Each positive protein sample has a respective gap space representation generated using each reference amino acid as each gap amino acid. Each positive protein sample is represented by each reference amino acid and each alternative amino acid. In various embodiments, the proteome includes a human proteome and a non-human proteome, including a non-human primate proteome.

도 44는 기준 아미노산(4402, 4412 및 4422)을 각각 갭 아미노산으로 사용하여 기준 단백질 샘플(4322, 4342 및 4362)에 대한 4400개의 갭 공간 표현(4322G, 4342G 및 4362G)을 생성하는 일 구현예를 도시한다. 도 45는 양성 단백질 샘플(4500)에 대한 병원성 분류자(2108/2600/2700) 훈련의 일 구현예를 도시한다.Figure 44 shows one embodiment of generating 4400 gap spatial representations (4322G, 4342G, and 4362G) for reference protein samples (4322, 4342, and 4362) using reference amino acids (4402, 4412, and 4422) as gap amino acids, respectively. It shows. Figure 45 shows one implementation of training a pathogenicity classifier (2108/2600/2700) on a positive protein sample (4500).

병원성 분류자(2108/2600/2700)는 특정 양성 단백질 샘플에 대해 훈련하고 (i) 특정 양성 단백질 샘플의 특정 갭 공간 표현(4322G), 및 (ii) 특정 대체 아미노산으로서의 특정 기준 아미노산의 표현(4402)(예를 들어, 원-핫 인코딩)을 입력으로서 처리하고, 특정 기준 아미노산에 대한 병원성 점수를 출력으로서 생성하여 특정 양성 단백질 샘플의 특정 위치에서 특정 기준 아미노산의 병원성을 추정한다. 특정 갭 공간 표현은 특정 기준 아미노산을 갭 아미노산으로 사용하고 특정 양성 단백질 샘플의 나머지 위치에 남아 있는 아미노산을 비-갭 아미노산으로 사용하여 생성된다.A pathogenicity classifier (2108/2600/2700) is trained on a specific positive protein sample and uses (i) a specific gap space representation of the specific positive protein sample (4322G), and (ii) the representation of a specific reference amino acid as a specific substitute amino acid (4402 ) (e.g., one-hot encoding) as input and generate a pathogenicity score for that specific reference amino acid as output to estimate the pathogenicity of a specific reference amino acid at a specific location in a specific positive protein sample. A specific gap space representation is created by using a specific reference amino acid as the gap amino acid and the amino acids remaining at the remaining positions in the specific positive protein sample as the non-gap amino acids.

양성 단백질 샘플 각각은 양성 단백질 샘플의 절대 양성을 나타내는 실제 양성 라벨(4506)을 갖는다. 일 구현예에서, 실제 양성 라벨은 0, 1 또는 마이너스 1이다. 특정 기준 아미노산에 대한 병원성 점수(4502)는 실제 양성 라벨과 비교되어 오류(4504)를 결정하고 훈련 기술(예를 들어, 역전파(4512))을 사용하여 오류를 기반으로 병원성 분류자(2108/2600/2700)의 계수를 개선한다.Each positive protein sample has a true positive label 4506 indicating the absolute positivity of the positive protein sample. In one embodiment, the actual positive label is 0, 1, or minus 1. The pathogenicity score 4502 for a particular reference amino acid is compared to the true positive label to determine the error 4504 and training techniques (e.g., backpropagation 4512) are used to create a pathogenicity classifier 2108/ 2600/2700) coefficient is improved.

병원성 분류자(2108/2600/2700)는 병원성 훈련 세트(4308)에 대해 훈련된다. 병원성 훈련 세트(4308)는 프로테옴의 각각의 위치(4318, 4338 및 4358)에서 각각의 기준 아미노산(4312, 4332 및 4352)에 대한 각각의 조합적으로 생성된 아미노산 치환에 대해 각각의 병원성 단백질 샘플(4322A-N, 4342A-N 및 4362A-N)을 갖는다. 일 구현예에서, 각각의 조합적으로 생성된 아미노산 치환은 기준 아미노산의 기준 코돈을 도달할 수 없는 대체 아미노산 클래스의 대체 아미노산으로 변환하기 위한 단일 뉴클레오티드 다형성(SNP)의 도달 가능성에 의해 제한된다. 프로테옴의 특정 위치에서 특정 아미노산 클래스의 특정 기준 아미노산에 대해 조합적으로 생성된 아미노산 치환은 특정 아미노산 클래스와 다른 각각의 아미노산 클래스의 각각의 대체 아미노산을 포함한다.Pathogenicity classifiers (2108/2600/2700) are trained on the pathogenicity training set (4308). The pathogenicity training set 4308 consists of each pathogenic protein sample ( 4322A-N, 4342A-N and 4362A-N). In one embodiment, each combinatorially produced amino acid substitution is limited by the reachability of a single nucleotide polymorphism (SNP) to convert the reference codon of the reference amino acid to a substitute amino acid of the unreachable substitute amino acid class. Amino acid substitutions created combinatorially for a particular reference amino acid of a particular amino acid class at a particular position in the proteome include each replacement amino acid of the particular amino acid class and each other amino acid class.

일 구현예에서, 프로테옴은 1000만 개의 위치를 가지며, 각각의 1000만 위치에 대해 19개의 조합적으로 생성된 아미노산 치환이 있으므로 병원성 훈련 세트(4308)는 1억9천만 개의 병원성 단백질 샘플을 갖는다.In one embodiment, the proteome has 10 million positions, and for each 10 million positions there are 19 combinatorially generated amino acid substitutions, so the pathogenicity training set 4308 has 190 million pathogenic protein samples.

각각의 병원성 단백질 샘플은 각각의 기준 아미노산을 각각의 갭 아미노산으로 사용하여 생성된 각각의 갭 공간 표현을 갖는다. 각각의 병원성 단백질 샘플은 프로테옴의 각각의 위치에서 각각의 조합적으로 생성된 뉴클레오티드 변이체에 의해 생성된 각각의 대체 아미노산으로서 각각의 조합적으로 생성된 아미노산 치환의 각각의 표현을 갖는다.Each pathogenic protein sample has a respective gap space representation generated using each reference amino acid as each gap amino acid. Each pathogenic protein sample has a respective representation of each combinatorially generated amino acid substitution as each replacing amino acid produced by each combinatorially generated nucleotide variant at each position in the proteome.

도 46은 병원성 단백질 샘플(4600)에 대한 병원성 분류자(2108/2600/2700) 훈련의 일 구현예를 도시한다. 병원성 분류자(2108/2600/2700)는 특정 병원성 단백질 샘플에 대해 훈련하고, (i) 특정 병원성 단백질 샘플의 특정 갭 공간 표현(4322G), 및 (ii) 특정 대체 아미노산으로서 특정 조합적으로 생성된 아미노산 치환의 표현(4622)(예를 들어, 원-핫 인코딩)을 입력으로서 처리하고, 특정 조합적으로 생성된 아미노산 치환에 대한 병원성 점수를 출력으로서 생성하여 특정 병원성 단백질 샘플의 특정 위치에서 특정 기준 아미노산에 대한 특정 조합적으로 생성된 아미노산 치환의 병원성을 추정한다. 특정 갭 공간 표현은 특정 기준 아미노산을 갭 아미노산으로 사용하고 특정 병원성 단백질 샘플의 나머지 위치에 남아 있는 아미노산을 비-갭 아미노산으로 사용하여 생성된다.Figure 46 shows one implementation of training a pathogenicity classifier (2108/2600/2700) on a pathogenic protein sample (4600). The pathogenicity classifier (2108/2600/2700) is trained on a specific pathogenic protein sample, (i) a specific gap space representation (4322G) of the specific pathogenic protein sample, and (ii) a specific combinatorially generated classifier as a specific alternative amino acid. Processes a representation 4622 of amino acid substitutions (e.g., one-hot encoding) as input, and generates as output pathogenicity scores for specific combinatorially generated amino acid substitutions, based on specific criteria at specific positions in a specific pathogenic protein sample. Estimate the pathogenicity of specific combinatorial amino acid substitutions for amino acids. A specific gap space representation is generated by using specific reference amino acids as gap amino acids and the remaining amino acids at the remaining positions in the specific pathogenic protein sample as non-gap amino acids.

각각의 병원성 단백질 샘플에는 병원성 단백질 샘플의 절대 병원성을 나타내는 실제 병원성 라벨이 있다. 일 구현예에서, 실제 병원성 라벨은 실제 양성 라벨과 다른 한(예를 들어, 반대) 1, 0 또는 마이너스 1이다. 특정 조합적으로 생성된 아미노산 치환에 대한 병원성 점수(4602)는 오류(4604)를 결정하고 훈련 기술(예를 들어, 역전파(4612))을 사용하여 오류에 기초하여 병원성 분류자(2108/2600/2700)의 계수를 개선하기 위해 실제 병원성 라벨(4606)과 비교된다.Each pathogenic protein sample has an actual pathogenicity label that indicates the absolute pathogenicity of the pathogenic protein sample. In one embodiment, the actual pathogenic label is 1, 0, or minus 1 insofar as it is different (e.g., the opposite) from the actual benign label. The pathogenicity score 4602 for a particular combinatorially generated amino acid substitution determines the error 4604 and uses a training technique (e.g., backpropagation 4612) to generate a pathogenicity classifier 2108/2600 based on the error. /2700) is compared with the actual pathogenicity label (4606) to improve the coefficient.

일 구현예에서 병원성 분류자(2108/2600/2700)는 2억 번의 훈련 반복을 통해 훈련된다. 이러한 구현예에서, 2억 개의 훈련 반복에는 1000만 개의 양성 단백질 샘플에 대한 1000만 개의 훈련 반복과 1억9천만 개의 병원성 단백질 샘플에 대한 1억9천만 개의 반복이 포함된다. 일 구현예에서, 프로테옴은 100만 내지 1000만 개의 위치를 가지므로 양성 훈련 세트에는 100만 내지 1000만 개의 양성 단백질 샘플이 있다. 이러한 구현예에서, 100만 내지 1000만 위치 각각에 대해 19개의 조합적으로 생성된 아미노산 치환이 있으므로 병원성 훈련 세트에는 1천9백만 내지 1억9천만 개의 병원성 단백질 샘플이 있다.In one implementation, the pathogenicity classifier (2108/2600/2700) is trained over 200 million training iterations. In this implementation, the 200 million training iterations include 10 million training iterations for the 10 million benign protein samples and 190 million iterations for the 190 million pathogenic protein samples. In one embodiment, the proteome has 1 to 10 million positions, so there are 1 to 10 million positive protein samples in the positive training set. In this embodiment, there are 19 combinatorially generated amino acid substitutions for each of the 1 to 10 million positions, so there are 19 to 190 million pathogenic protein samples in the pathogenicity training set.

일 구현예에서 병원성 분류자(2108/2600/2700)는 2천만 내지 2억 번의 훈련 반복으로 훈련된다. 이러한 구현예에서 2000만 내지 2억 번의 훈련 반복에는 100만 내지 1000만 개의 양성 단백질 샘플을 사용한 100만 내지 1000만 번의 훈련 반복이 포함되며, 1,900만 내지 1억9,000만 개의 병원성 단백질 샘플을 사용하여 1,900만 내지 1억9,000만 번의 반복을 수행했다.In one implementation, the pathogenicity classifier (2108/2600/2700) is trained with 20 to 200 million training iterations. In this embodiment, the 20 to 200 million training iterations include 1 to 10 million training iterations using 1 to 10 million benign protein samples, and 1 to 10 million training iterations using 19 to 190 million pathogenic protein samples. Between 19 and 190 million iterations were performed.

도 47은 훈련 중에 도달할 수 없는 특정 아미노산 클래스가 어떻게 마스킹되는지(4700)를 도시한다. 동작(4702)에서, 기준 아미노산의 기준 코돈을 도달 불가능한 대체 아미노산 클래스의 대체 아미노산으로 변환하기 위한 단일 뉴클레오티드 다형성(SNP)의 도달 가능성에 의해 제한되는 도달 불가능한 대체 아미노산 클래스는 실제 라벨에 마스킹된다. 동작(4712)에서, 마스킹된 아미노산 클래스는 손실이 전혀 발생하지 않으며 기울기 업데이트에 기여하지 않는다. 동작(4722)에서, 마스킹된 아미노산 클래스가 룩업 테이블에서 식별된다. 동작(4723)에서, 룩업 테이블은 각 기준 아미노산 위치에 대해 마스킹된 아미노산 클래스 세트를 식별한다.Figure 47 illustrates how certain unreachable amino acid classes are masked 4700 during training. In operation 4702, the unreachable substitution amino acid class, limited by the reachability of a single nucleotide polymorphism (SNP) to convert the reference codon of the reference amino acid to a substitute amino acid of the unreachable substitution amino acid class, is masked to the actual label. In operation 4712, the masked amino acid class suffers no loss and does not contribute to the gradient update. In operation 4722, the masked amino acid class is identified in the lookup table. In operation 4723, the lookup table identifies a set of masked amino acid classes for each reference amino acid position.

도 48은 최종 병원성 점수를 결정하는 일 구현예를 도시한다. 동작(4802)에서, 일 구현예에서, 병원성 분류자(2108/2600/2700)는 제1 기준 아미노산과 동일한 제1 대체 아미노산에 대한 제1 병원성 점수를 생성한다. 동작(4812)에서, 일 구현예에서, 병원성 분류자(2108/2600/2700)는 제1 기준 아미노산과 다른 제2 대체 아미노산에 대한 제2 병원성 점수를 생성한다. 동작(4822)에서, 일 구현예에서, 제2 대체 아미노산에 대한 최종 병원성 점수는 제2 대체 아미노산에 대한 제2 병원성 점수이다.Figure 48 depicts one implementation of determining the final pathogenicity score. At operation 4802, in one implementation, the pathogenicity classifier 2108/2600/2700 generates a first pathogenicity score for a first replacement amino acid that is identical to the first reference amino acid. At operation 4812, in one implementation, the pathogenicity classifier 2108/2600/2700 generates a second pathogenicity score for a second replacement amino acid that is different from the first reference amino acid. At operation 4822, in one implementation, the final pathogenicity score for the second replacement amino acid is the second pathogenicity score for the second replacement amino acid.

다른 대안에서, 제2 대체 아미노산에 대한 최종 병원성 점수는 제1 병원성 점수와 제2 병원성 점수의 조합을 기반으로 한다. 4822a의 제1 대안에서, 일 구현예에서, 제2 대체 아미노산에 대한 최종 병원성 점수는 제1 병원성 점수와 제2 병원성 점수의 합에 대한 제2 병원성 점수의 비이다. 4822b의 제2 대안에서, 일 구현예에서, 제2 대체 아미노산에 대한 최종 병원성 점수는 제2 병원성 점수에서 제1 병원성 점수를 빼서 결정된다.In another alternative, the final pathogenicity score for the second replacement amino acid is based on a combination of the first pathogenicity score and the second pathogenicity score. In the first alternative to 4822a, in one embodiment, the final pathogenicity score for the second replacement amino acid is the ratio of the second pathogenicity score to the sum of the first pathogenicity score and the second pathogenicity score. In the second alternative to 4822b, in one embodiment, the final pathogenicity score for the second replacement amino acid is determined by subtracting the first pathogenicity score from the second pathogenicity score.

지금까지의 논의에서는 도 49a에 도시된 내용을 다루었다. 도 49a는 단백질(4912)의 주어진 위치에서 기준 갭 아미노산(4902)에 의해 생성된 공석을 채우는 표적 대체 아미노산(4922)에 대한 변이체 병원성 결정이 이루어졌음을 도시한다. 특히, 이 분석은 예를 들어 기준 갭 아미노산(4902)(또는 이의 원자)를 제외하는 복셀화된 아미노산 카테고리별 거리 계산을 사용하여 단백질(4912) 및 공극을 3D 형식으로 공간적으로 표현함으로써 수행된다.The discussion so far has dealt with the content shown in FIG. 49A. Figure 49A shows that a variant pathogenicity determination was made for a target replacement amino acid 4922 that fills the vacancy created by the reference gap amino acid 4902 at a given position in the protein 4912. In particular, this analysis is performed by spatially representing the protein 4912 and the pore in a 3D format, for example using voxelized amino acid category-wise distance calculations excluding reference gap amino acids 4902 (or atoms thereof).

이제 논의는 도 49b로 넘어간다. 도 49b는 단백질(4912)의 주어진 위치에서 기준 갭 아미노산(4902)에 의해 생성된 공석을 채우는 각각의 아미노산 클래스(4916)의 아미노산에 대해 각각의 변이체 병원성 결정이 이루어졌음을 도시한다. 도 49a와 도 49b의 입력은 동일하고; 출력만 다르며, 단백질(4912)의 공간 표현과 3D 형식의 공석도 다르다. 도 49a에서는 단 하나의 병원성 점수가 생성된다; 반면, 도 49b에서는 20개의 아미노산 클래스/카테고리 각각에 대해 병원성 점수가 생성된다(예를 들어, 20방향 소프트맥스 분류를 사용하여).The discussion now turns to Figure 49b. Figure 49B shows that each variant pathogenicity determination was made for an amino acid of each amino acid class 4916 that fills the vacancy created by the reference gap amino acid 4902 at a given position in the protein 4912. The inputs in Figures 49A and 49B are the same; Only the output is different, and the spatial representation of the protein (4912) and the vacancy in 3D format are also different. In Figure 49A only one pathogenicity score is generated; Meanwhile, in Figure 49B, a pathogenicity score is generated for each of the 20 amino acid classes/categories (e.g., using 20-way softmax classification).

다중 대체 아미노산에 대한 갭 단백질 공간 표현 기반 병원성 결정Gap protein spatial representation-based pathogenicity determination for multiple substituted amino acids.

도 50은 갭 단백질 공간 표현의 처리에 기초하여 다수의 대체 아미노산에 대한 5000 변이체 병원성을 결정하는 일 구현예를 도시한다. 동작(5002)에서, 단백질 서열 접근자(3704)는 각각의 위치에 각각의 아미노산을 갖는 단백질에 접근한다.Figure 50 depicts one embodiment of determining 5000 variant pathogenicity for multiple alternative amino acids based on processing of gap protein spatial representation. In operation 5002, protein sequence accessor 3704 accesses the protein with each amino acid at each position.

동작(5012)에서, 갭 아미노산 지정자(3714)는 단백질의 특정 위치에 있는 특정 아미노산을 갭 아미노산으로 지정하고, 단백질의 나머지 위치에 남아 있는 아미노산을 비-갭 아미노산으로 지정한다. 일 구현예에서, 특정 아미노산은 단백질의 주요 대립유전자인 기준 아미노산이다.In operation 5012, the gap amino acid designator 3714 designates a specific amino acid at a specific position of the protein as a gap amino acid, and designates the remaining amino acid at the remaining position of the protein as a non-gap amino acid. In one embodiment, the particular amino acid is a reference amino acid that is the major allele of the protein.

동작(5022)에서, 갭 공간 표현 생성자(3724)는 비-갭 아미노산의 공간 구성을 포함하고 갭 아미노산의 공간 구성을 배제하는 단백질의 갭 공간 표현을 생성한다. 비-갭 아미노산의 공간 구성은 아미노산 클래스별 거리 채널로서 인코딩된다. 각각의 아미노산 클래스별 거리 채널은 복수의 복셀 중 복셀에 대한 복셀별 거리 값을 갖는다. 복셀별 거리 값은 복수의 복셀 중 상응하는 복셀로부터 비-갭 아미노산의 원자까지의 거리를 지정한다. 비-갭 아미노산의 공간 구성은 상응하는 복셀과 비-갭 아미노산의 원자 사이의 공간적 근접성에 기초하여 결정된다. 복셀별 거리 값을 결정할 때 상응하는 복셀로부터 갭 아미노산의 원자까지의 거리를 무시함으로써 갭 아미노산의 공간 구성이 갭 공간 표현에서 제외된다. 상응하는 복셀과 갭 아미노산의 원자 사이의 공간적 근접성을 무시함으로써 갭 아미노산의 공간 구성이 갭 공간 표현에서 제외된다.In operation 5022, gap spatial representation generator 3724 generates a gap spatial representation of the protein that includes the spatial configuration of non-gap amino acids and excludes the spatial configuration of gap amino acids. The spatial organization of non-gap amino acids is encoded as distance channels per amino acid class. The distance channel for each amino acid class has a voxel-specific distance value for a voxel among a plurality of voxels. The distance value for each voxel specifies the distance from the corresponding voxel among the plurality of voxels to the atom of the non-gap amino acid. The spatial configuration of a non-gap amino acid is determined based on the spatial proximity between the corresponding voxel and the atoms of the non-gap amino acid. By ignoring the distance from the corresponding voxel to the atom of the gap amino acid when determining voxel-wise distance values, the spatial configuration of the gap amino acid is excluded from the gap space representation. By ignoring the spatial proximity between the corresponding voxel and the atoms of the gap amino acid, the spatial configuration of the gap amino acid is excluded from the gap space representation.

동작(5032)에서, 병원성 결정자(3734)는 적어도 부분적으로, 갭 공간 표현에 기초하여 특정 위치에서 각각의 대체 아미노산의 병원성을 결정한다. 각각의 대체 아미노산은 특정 위치에서 각각의 조합적으로 생성된 뉴클레오티드 변이체에 의해 생성된 각각의 조합적으로 생성된 대체 아미노산이다.At operation 5032, pathogenicity determinant 3734 determines the pathogenicity of each replacement amino acid at a particular position based, at least in part, on the gap space representation. Each alternative amino acid is a respective combinatorially generated alternative amino acid produced by each combinatorially generated nucleotide variant at a specific position.

도 51은 갭 단백질 공간 표현의 처리(5102)에 기초하여 다수의 대체 아미노산에 대한 변이 병원성을 결정(5100)하는 병원성 분류자(2108/2600/2700)의 일 구현예를 도시한다. 병원성 분류자(2108/2600/2700)는 갭 공간 표현(5102)을 입력으로서 처리하고, 각각의 아미노산 클래스에 대해 각각의 병원성 점수 1-20을 출력으로서 생성함으로써 각각의 대체 아미노산의 병원성을 결정한다. 일부 구현예에서, 각각의 아미노산 클래스는 각각의 20개의 자연 발생 아미노산에 상응한다. 다른 구현예에서, 각각의 아미노산 클래스는 20개의 자연 발생 아미노산의 하위세트로부터의 각각의 자연 발생 아미노산에 상응한다. 일 구현예에서, 출력은 각각의 아미노산 클래스에 대한 각각의 병원성 점수 1-20의 각각의 순위와 함께 표시된다.Figure 51 depicts one implementation of a pathogenicity classifier (2108/2600/2700) that determines (5100) variant pathogenicity for multiple alternative amino acids based on processing (5102) of gap protein spatial representations. The pathogenicity classifier 2108/2600/2700 determines the pathogenicity of each alternative amino acid by processing the gap space representation 5102 as input and producing as output a respective pathogenicity score 1-20 for each amino acid class. . In some embodiments, each amino acid class corresponds to each of the 20 naturally occurring amino acids. In another embodiment, each amino acid class corresponds to a respective naturally occurring amino acid from a subset of the 20 naturally occurring amino acids. In one embodiment, the output is displayed with a respective ranking of each pathogenicity score 1-20 for each amino acid class.

도 52는 양성 및 병원성 단백질 샘플에 대한 병원성 분류자(2108/2600/2700)를 동시에 훈련(5200)하는 일 구현예를 도시한다. 병원성 분류자(2108/2600/2700)은 훈련 세트에서 훈련된다. 훈련 세트는 프로테옴의 각 위치에 대한 각각의 단백질 샘플을 갖는다. 프로테옴은 1000만 개의 위치가 있으므로 훈련 세트는 1000만 개의 단백질 샘플을 갖는다. 각각의 단백질 샘플은 프로테옴 내 각각의 위치에 있는 각각의 기준 아미노산을 각각의 갭 아미노산으로 사용하여 생성된 각각의 갭 공간 표현을 갖는다. 기준 아미노산은 프로테옴의 주요 대립유전자 아미노산이다.Figure 52 shows one implementation of simultaneously training (5200) a pathogenicity classifier (2108/2600/2700) on benign and pathogenic protein samples. Pathogenicity classifiers (2108/2600/2700) are trained on the training set. The training set has each protein sample for each position in the proteome. The proteome has 10 million positions, so the training set has 10 million protein samples. Each protein sample has a respective gap space representation generated using each reference amino acid at each position in the proteome as each gap amino acid. The reference amino acid is the major allelic amino acid in the proteome.

병원성 분류자(2108/2600/2700)은 특정 단백질 샘플을 훈련하고, 특정 단백질 샘플의 특정 갭 공간 표현(5202)을 입력으로서 처리하고, 각각의 아미노산 클래스에 대한 각각의 병원성 점수 1-20을 출력으로서 생성함으로써 특정 단백질 샘플의 특정 위치에 있는 특정 기준 아미노산에 대한 각각의 대체 아미노산의 병원성을 추정한다. 특정 갭 공간 표현은 특정 기준 아미노산을 갭 아미노산으로 사용하고 특정 단백질 샘플의 나머지 위치에 남아 있는 아미노산을 비-갭 아미노산으로 사용하여 생성된다.Pathogenicity classifiers 2108/2600/2700 train on specific protein samples, process specific gap space representations 5202 of specific protein samples as input, and output respective pathogenicity scores 1-20 for each amino acid class. It estimates the pathogenicity of each alternative amino acid for a specific reference amino acid at a specific position in a specific protein sample by generating as . A specific gap space representation is created by using a specific reference amino acid as the gap amino acid and the amino acids remaining at the remaining positions in the specific protein sample as the non-gap amino acids.

각 단백질 샘플은 각각의 아미노산 클래스에 대한 실제 라벨을 갖는다. 각각의 실제 라벨은 각각의 아미노산 클래스의 기준 아미노산 클래스에 대한 절대 양성 라벨을 포함하고, 각각의 아미노산 클래스의 각각의 대체 아미노산 클래스에 대한 각각의 절대 병원성 라벨을 포함한다. 일 구현예에서 절대 양성 라벨은 0이다. 절대 병원성 라벨은 각각의 대체 아미노산 클래스에서 동일하다. 일 구현예에서 절대 병원성 라벨은 1이다.Each protein sample has an actual label for each amino acid class. Each true label contains an absolute positive label for each reference amino acid class of each amino acid class, and contains a respective absolute pathogenic label for each alternative amino acid class of each amino acid class. In one embodiment, the absolute positive label is 0. The absolute pathogenicity label is identical for each alternative amino acid class. In one embodiment, the absolute pathogenicity label is 1.

일 구현예에서, 오류(5204)는 절대 양성 라벨에 대한 기준 아미노산 클래스에 대한 병원성 점수의 비교(예를 들어, 도 52의 기준 갭 아미노산(5212)에 대한 병원성 점수 8) 및 각각의 절대 병원성 라벨에 대한 각각의 대체 아미노산 클래스에 대한 각각의 병원성 점수의 각각의 비교(예를 들어, 도 52의 병원성 점수 1-7 및 9-20)에 기초하여 결정된다. 일 구현예에서, 병원성 분류자(2108/2600/2700)의 계수는 훈련 기술(예를 들어, 역전파(5224))을 사용하여 오류에 기초하여 개선된다.In one embodiment, error 5204 is a comparison of pathogenicity scores for a reference amino acid class to an absolute positive label (e.g., pathogenicity score 8 for reference gap amino acid 5212 in Figure 52) and the respective absolute pathogenicity label. is determined based on a respective comparison of each pathogenicity score for each alternative amino acid class (e.g., pathogenicity scores 1-7 and 9-20 in Figure 52). In one implementation, the coefficients of pathogenicity classifier 2108/2600/2700 are improved based on error using training techniques (e.g., backpropagation 5224).

일 구현예에서, 병원성 분류자(2108/2600/2700)는 1000만 개의 단백질 샘플을 사용하여 1000만 번의 훈련 반복으로 훈련된다. 일부 구현예에서 프로테옴은 100만 내지 1000만 개의 위치를 가지므로 훈련 세트에는 100만 내지 1000만 개의 단백질 샘플이 있다. 일 구현예에서 병원성 분류자(2108/2600/2700)는 100만 내지 1000만 개의 단백질 샘플을 사용하여 100만 내지 1000만 번의 훈련 반복으로 훈련된다.In one implementation, the pathogenicity classifier (2108/2600/2700) is trained with 10 million training iterations using 10 million protein samples. In some embodiments, the proteome has 1 to 10 million positions, so there are 1 to 10 million protein samples in the training set. In one implementation, the pathogenicity classifier (2108/2600/2700) is trained with 1 to 10 million training iterations using 1 to 10 million protein samples.

일 구현예에서, 병원성 분류자(2108/2600/2700)는 기준 아미노산 클래스의 제1 대체 아미노산에 대한 기준 병원성 점수를 생성한다. 일 구현예에서, 병원성 분류자(2108/2600/2700)는 각각의 대체 아미노산 클래스의 각각의 대체 아미노산에 대한 각각의 대체 병원성 점수를 생성한다.In one implementation, the pathogenicity classifier 2108/2600/2700 generates a baseline pathogenicity score for the first alternative amino acid of the reference amino acid class. In one implementation, the pathogenicity classifier 2108/2600/2700 generates a respective alternative pathogenicity score for each alternative amino acid of each alternative amino acid class.

일 구현예에서, 각각의 대체 아미노산에 대한 각각의 최종 대체 병원성 점수는 각각의 대체 병원성 점수이다. 일 구현예에서, 각각의 대체 아미노산에 대한 각각의 최종 대체 병원성 점수는 기준 병원성 점수와 각각의 대체 병원성 점수의 각각의 조합에 기초한다. 일 구현예에서, 각각의 대체 아미노산에 대한 각각의 최종 대체 병원성 점수는 기준 병원성 점수와 각각의 대체 병원성 점수의 합에 대한 각각의 대체 병원성 점수의 각각의 비이다. 일 구현예에서, 각각의 대체 아미노산에 대한 각각의 최종 대체 병원성 점수는 각각의 대체 병원성 점수에서 기준 병원성 점수를 각각 빼서 결정된다.In one embodiment, each final alternative pathogenicity score for each alternative amino acid is the respective alternative pathogenicity score. In one embodiment, each final alternative pathogenicity score for each alternative amino acid is based on a respective combination of the baseline pathogenicity score and each alternative pathogenicity score. In one embodiment, each final alternative pathogenicity score for each alternative amino acid is the respective ratio of each alternative pathogenicity score to the sum of the baseline pathogenicity score and each alternative pathogenicity score. In one embodiment, each final alternative pathogenicity score for each alternative amino acid is determined by subtracting each alternative pathogenicity score from each reference pathogenicity score.

일 구현예에서, 병원성 분류자(2108/2600/2700)는 각각의 병원성 점수를 생성하는 출력 층을 갖는다. 일부 구현예에서 출력 층은 정규화 층이다. 이러한 구현예에서, 각각의 병원성 점수는 정규화된다. 일 구현예에서 출력 층은 소프트맥스 층이다. 이러한 구현예에서, 각각의 병원성 점수는 기하급수적으로 정규화된다. 다른 구현예에서, 출력 층은 각각의 병원성 점수를 각각 생성하는 각각의 시그모이드 단위를 갖는다. 또 다른 구현예에서, 각각의 병원성 점수는 정규화되지 않는다.In one implementation, the pathogenicity classifier 2108/2600/2700 has an output layer that generates a respective pathogenicity score. In some implementations the output layer is a normalization layer. In this embodiment, each pathogenicity score is normalized. In one implementation, the output layer is a softmax layer. In this implementation, each pathogenicity score is exponentially normalized. In another implementation, the output layer has each sigmoid unit each generating a respective pathogenicity score. In another embodiment, each pathogenicity score is not normalized.

여러 대체 아미노산에 대한 갭 단백질 공간 표현 및 진화 보존 기반 병원성 결정Gap protein spatial representation and evolutionary conservation-based pathogenicity determination for multiple alternative amino acids.

진화 보존은 종의 공통 기원과 보존된 요소의 중요한 기능적 특성을 모두 반영하는 유사한 유전자, 유전자의 일부 또는 염색체 세그먼트가 다른 종에 존재하는 것을 지칭한다. 돌연변이는 각 세대에서 자발적으로 발생하며 단백질의 여기저기에서 아미노산이 무작위로 변경된다. 단백질의 중요한 기능을 손상시키는 돌연변이를 가진 개체는 번식 능력을 저하시키는 문제를 일으킬 수 있다. 유해한 돌연변이는 유전자 풀에서 소실되는데 그 이유는 이를 보유하고 있는 개체가 덜 효과적으로 번식하기 때문이다. 유해한 돌연변이가 사라지기 때문에 단백질 기능에 중요한 아미노산이 유전자 풀에 보존된다. 대조적으로, 무해한(또는 매우 드물게 유익한) 돌연변이가 유전자 풀에 유지되어 중요하지 않은 아미노산에 가변성을 생성한다. 단백질의 진화 보존은 서로 다른 분류군(동원체)의 동일한 기능을 가진 단백질의 아미노산 서열을 정렬함으로써 식별된다. 변이체의 기능적 결과를 예측하는 것은, 적어도 부분적으로, 단백질족에 대한 중요한 아미노산이 네거티브 선택으로 인한 진화를 통해 보존되고(즉, 이러한 부위에서의 아미노산 변화는 과거에 유해하였음) 이러한 부위에서의 돌연변이가 인간에게 (질환을 야기하는) 병원성일 가능성을 증가시킨다는 가정에 의존한다. 일반적으로, 표적 단백질의 상동 서열이 수집 및 정렬되고, 정렬 내의 표적 위치에서 관찰된 상이한 아미노산의 가중 빈도에 기초하여 보존의 메트릭이 계산된다. 도 53은 갭 단백질 공간 표현을 처리하고 그에 대한 반응으로 다수의 대체 아미노산에 대한 진화 보존 점수를 생성하는 것에 기초하여 다수의 대체 아미노산에 대한 5300 변이체 병원성을 결정하는 일 구현예를 도시한다. 동작(5302)에서, 갭 아미노산 지정자(3714)는 단백질의 특정 위치에 있는 특정 아미노산을 갭 아미노산으로 지정하고, 단백질의 나머지 위치에 남아 있는 아미노산을 비-갭 아미노산으로 지정한다. 일 구현예에서, 특정 아미노산은 단백질의 주요 대립유전자인 기준 아미노산이다.Evolutionary conservation refers to the presence in different species of similar genes, parts of genes, or chromosome segments that reflect both the common origin of the species and the important functional properties of the conserved elements. Mutations occur spontaneously in each generation, randomly changing amino acids here and there in the protein. Individuals with mutations that impair the protein's important functions may have problems that reduce their ability to reproduce. Harmful mutations are lost from the gene pool because individuals carrying them reproduce less effectively. Because harmful mutations are eliminated, amino acids important for protein function are preserved in the gene pool. In contrast, harmless (or very rarely beneficial) mutations are maintained in the gene pool, creating variability in non-critical amino acids. Evolutionary conservation of proteins is identified by aligning the amino acid sequences of proteins with the same function from different taxa (centromere). Predicting the functional consequences of a variant depends, at least in part, on whether important amino acids for the protein family have been conserved through evolution due to negative selection (i.e., amino acid changes at these sites have been deleterious in the past) and whether mutations at these sites have been It relies on the assumption that it increases the likelihood of pathogenicity (causing disease) in humans. Typically, homologous sequences of a target protein are collected and aligned, and a metric of conservation is calculated based on the weighted frequency of different amino acids observed at the target position within the alignment. Figure 53 depicts one implementation of determining 5300 variant pathogenicity for multiple alternative amino acids based on processing gap protein spatial representations and generating evolutionary conservation scores for multiple alternative amino acids in response. In operation 5302, the gap amino acid designator 3714 designates a specific amino acid at a specific position of the protein as a gap amino acid and designates the remaining amino acid at the remaining position of the protein as a non-gap amino acid. In one embodiment, the particular amino acid is a reference amino acid that is the major allele of the protein.

동작(5312)에서, 갭 공간 표현 생성자(3724)는 비-갭 아미노산의 공간 구성을 포함하고 갭 아미노산의 공간 구성을 배제하는 단백질의 갭 공간 표현을 생성한다. 비-갭 아미노산의 공간 구성은 아미노산 클래스별 거리 채널로서 인코딩된다. 각각의 아미노산 클래스별 거리 채널은 복수의 복셀 중 복셀에 대한 복셀별 거리 값을 갖는다. 복셀별 거리 값은 복수의 복셀 중 상응하는 복셀로부터 비-갭 아미노산의 원자까지의 거리를 지정한다. 비-갭 아미노산의 공간 구성은 상응하는 복셀과 비-갭 아미노산의 원자 사이의 공간적 근접성에 기초하여 결정된다. 복셀별 거리 값을 결정할 때 상응하는 복셀로부터 갭 아미노산의 원자까지의 거리를 무시함으로써 갭 아미노산의 공간 구성이 갭 공간 표현에서 제외된다. 상응하는 복셀과 갭 아미노산의 원자 사이의 공간적 근접성을 무시함으로써 갭 아미노산의 공간 구성이 갭 공간 표현에서 제외된다.In operation 5312, gap space representation generator 3724 generates a gap space representation of the protein that includes the spatial configuration of non-gap amino acids and excludes the spatial configuration of gap amino acids. The spatial organization of non-gap amino acids is encoded as distance channels per amino acid class. The distance channel for each amino acid class has a voxel-specific distance value for a voxel among a plurality of voxels. The distance value for each voxel specifies the distance from the corresponding voxel among the plurality of voxels to the atom of the non-gap amino acid. The spatial configuration of a non-gap amino acid is determined based on the spatial proximity between the corresponding voxel and the atoms of the non-gap amino acid. By ignoring the distance from the corresponding voxel to the atom of the gap amino acid when determining voxel-wise distance values, the spatial configuration of the gap amino acid is excluded from the gap space representation. By ignoring the spatial proximity between the corresponding voxel and the atoms of the gap amino acid, the spatial configuration of the gap amino acid is excluded from the gap space representation.

동작(5322)에서, 진화 보존 결정자(5324)는 적어도 부분적으로, 갭 공간 표현에 기초하여 각각의 아미노산 클래스의 각각의 아미노산의 특정 위치에서의 진화 보존을 결정한다.In operation 5322, evolutionary conservation determiner 5324 determines the evolutionary conservation at a particular position of each amino acid of each amino acid class based, at least in part, on the gap space representation.

도 54는 일 구현에 따른 동작(5400)에서의 진화 보존 결정자(5324)를 도시한다. 일부 구현예에서 진화 보존 결정자(5324)는 병원성 분류자(2108/2600/2700)와 동일한 아키텍처를 갖는다. 진화 보존 결정자(5324)는 갭 공간 표현(5402)을 입력으로서 처리하고, 각각의 아미노산(5408)에 대한 각각의 진화 보존 점수(5406)를 출력으로서 생성함으로써 진화 보존을 결정한다. 각각의 진화 보존 점수는 규모에 따라 순위가 매겨진다. 본 개시내용의 목적을 위해, "분류자", "결정자", "여기에 용어 삽입"은 하나 이상의 소프트웨어 모듈, 하나 이상의 하드웨어 모듈, 또는 이들의 임의의 조합을 포함할 수 있다.Figure 54 illustrates an evolutionary preservation determinant 5324 in operation 5400 according to one implementation. In some implementations, the evolutionary conservation determinant 5324 has the same architecture as the pathogenicity classifier 2108/2600/2700. The evolutionary conservation determinant 5324 determines evolutionary conservation by processing the gap space representation 5402 as input and producing a respective evolutionary conservation score 5406 for each amino acid 5408 as output. Each evolutionary conservation score is ranked by scale. For the purposes of this disclosure, “classifier,” “determiner,” and “insert term here” may include one or more software modules, one or more hardware modules, or any combination thereof.

동작(5332)에서, 병원성 결정자(3734)는 적어도 부분적으로, 각각의 아미노산(5408)의 진화 보존에 기초하여, 단백질의 대체 표현에서 특정 아미노산을 각각의 아미노산(5408)으로 각각 치환하는 각각의 뉴클레오티드 변이체의 병원성을 결정한다.In operation 5332, the pathogenicity determinant 3734 is configured to replace each nucleotide for a particular amino acid with each amino acid 5408 in an alternative representation of the protein, at least in part based on the evolutionary conservation of each amino acid 5408. Determine the pathogenicity of the variant.

도 55는 예측된 진화 점수에 기초하여 병원성을 결정하는 일 구현예를 도시한다. 분류자(5516)는 상응하는 아미노산 치환에 대해 진화 보존 결정자(5324)에 의해 생성된 진화 보존 점수가 임계값 미만인 경우 뉴클레오티드 변이체를 병원성(5508)으로 분류한다. 일 구현예에서, 분류자(5516)는 상응하는 아미노산 치환에 대해 진화 보존 결정자(5324)에 의해 생성된 진화 보존 점수가 0(즉, 비-보존의 표시)인 경우 뉴클레오티드 변이체를 병원성(5508)으로 분류한다.Figure 55 illustrates one implementation of determining pathogenicity based on predicted evolution scores. The classifier 5516 classifies a nucleotide variant as pathogenic 5508 if the evolutionary conservation score generated by the evolutionary conservation determinant 5324 for the corresponding amino acid substitution is below a threshold. In one embodiment, the classifier 5516 classifies a nucleotide variant as pathogenic 5508 if the evolutionary conservation score generated by the evolutionary conservation determinant 5324 for the corresponding amino acid substitution is 0 (i.e., an indication of non-conservation). Classify as

분류자(5516)는 상응하는 아미노산 치환에 대해 진화 보존 결정자(5324)에 의해 생성된 진화 보존 점수가 임계값을 초과하는 경우 뉴클레오티드 변이체를 양성(5528)으로 분류한다. 일 구현예에서, 분류자(5516)는 상응하는 아미노산 치환에 대해 진화 보존 결정자(5324)에 의해 생성된 진화 보존 점수가 0이 아닐 때(즉, 보존의 표시) 뉴클레오티드 변이체를 양성(5528)으로 분류한다.The classifier 5516 classifies a nucleotide variant as positive 5528 if the evolutionary conservation score generated by the evolutionary conservation determinant 5324 for the corresponding amino acid substitution exceeds a threshold. In one embodiment, the classifier 5516 considers a nucleotide variant positive 5528 when the evolutionary conservation score generated by the evolutionary conservation determinant 5324 for the corresponding amino acid substitution is non-zero (i.e., an indication of conservation). Classify.

도 56은 진화 보존 결정자(5324)를 훈련하는 데 사용되는 훈련 데이터(5600)의 일 구현예를 도시한다. 진화 보존 결정자(5324)는 보존 훈련 세트 및 비-보존 훈련 세트에 대해 훈련된다. 보존 훈련 세트는 프로테옴의 각각의 위치에서 각각의 보존 아미노산에 대한 각각의 보존 단백질 샘플(5602)을 갖는다. 비-보존 훈련 세트는 각각의 위치에서 각각의 비-보존 아미노산에 대한 각각의 비-보존(또는 미보존) 단백질 샘플(5608)을 갖는다. 다양한 구현예에서, 프로테옴은 인간 프로테옴과 비인간 영장류 프로테옴을 포함하는 비인간 프로테옴을 포함한다.Figure 56 shows one implementation of training data 5600 used to train the evolutionary conservation determinant 5324. Evolutionary conservation determinant 5324 is trained on a conservative training set and a non-conservative training set. The conservation training set has each conserved protein sample (5602) for each conserved amino acid at each position in the proteome. The non-conserved training set has each non-conserved (or non-conserved) protein sample 5608 for each non-conserved amino acid at each position. In various embodiments, the proteome includes a human proteome and a non-human proteome, including a non-human primate proteome.

각각의 위치 각각은 보존 아미노산의 세트 및 비-보존 아미노산의 세트를 갖는다. 프로테옴 내 특정 단백질의 특정 위치에 대한 보존 아미노산의 특정 세트는 복수의 종에 걸쳐 특정 위치에서 관찰되는 적어도 하나의 주요 대립유전자 아미노산을 포함한다. 일 구현예에서, 주요 대립유전자 아미노산은 기준 아미노산이다(예를 들어, 양성 단백질 샘플(5622)에 걸쳐 있는 REF 대립유전자(5612) 및 양성 단백질 샘플(5682)에 걸쳐 있는 REF 대립유전자(5662)). 보존 아미노산의 특정 세트는 복수의 종에 걸쳐 특정 위치에서 관찰되는 하나 이상의 소수 대립유전자 아미노산을 포함한다(예를 들어, 양성 단백질 샘플(5642, 5652, 5662)에 걸쳐 관찰된 ALT 대립유전자 5632 및 양성 단백질 샘플(5695, 5696)에 걸쳐 관찰된 ALT 대립유전자(5692)).Each position each has a set of conserved amino acids and a set of non-conserved amino acids. A particular set of conserved amino acids for a particular position of a particular protein within a proteome includes at least one major allelic amino acid observed at that particular position across multiple species. In one embodiment, the major allelic amino acid is a reference amino acid (e.g., the REF allele 5612 spanning the positive protein sample 5622 and the REF allele 5662 spanning the positive protein sample 5682). . A particular set of conserved amino acids includes one or more minor allele amino acids observed at a particular position across multiple species (e.g., ALT allele 5632 observed across benign protein samples (5642, 5652, 5662) and benign ALT allele (5692) observed across protein samples (5695, 5696).

특정 위치에 대한 특정 비-보존 아미노산의 세트에는 특정 보존 아미노산의 세트에 포함되지 않은 아미노산이 포함된다(예를 들어, 병원성 단백질 샘플(5622A-N)에 걸쳐 있는 관찰되지 않은 ALT 대립유전자(5618) 및 병원성 단백질 샘플(5682A-N)에 걸쳐 있는 관찰되지 않은 ALT 대립유전자(5668)).A specific set of non-conserved amino acids for a specific position includes amino acids that are not included in the specific set of conserved amino acids (e.g., unobserved ALT alleles (5618) spanning pathogenic protein samples (5622A-N) and an unobserved ALT allele (5668) spanning pathogenic protein samples (5682A-N).

일 구현예에서, 각각의 위치 각각은 보존 아미노산의 세트에서 C개 보존 아미노산을 갖는다. 이러한 구현예에서, 각각의 위치 각각은 비-보존 아미노산의 세트에서 NC개 비-보존 아미노산을 가지며, NC = 20-C이다. 보존 훈련 세트는 CP개 보존 단백질 샘플을 가지며, CP = 각 위치의 수 * C이다. 비-보존 훈련 세트는 NCP개 비-보존 단백질 샘플을 가지며, NCP = 각 위치의 수 * (20-C)이다. 일 구현예에서 C는 1 내지 10 범위이다. 다른 구현예에서, C는 각 위치에 따라 달라진다. 또 다른 구현예에서, C는 각각의 위치 중 일부에 대해 동일하다.In one embodiment, each position each has C conserved amino acids in the set of conserved amino acids. In this embodiment, each position has NC non-conserved amino acids in the set of non-conserved amino acids, and NC = 20-C. The conservation training set has CP conserved protein samples, where CP = number of each position * C. The non-conserved training set has NCP non-conserved protein samples, where NCP = number of each position * (20-C). In one embodiment C ranges from 1 to 10. In other implementations, C varies for each location. In another implementation, C is the same for some of each position.

일 구현예에서 프로테옴은 1 내지 1000만 위치를 갖는다. 이러한 구현예에서, 1 내지 1000만 위치 각각은 보존 아미노산의 세트에서 C개 보존 아미노산을 갖는다. 1개 내지 1000만개 위치 각각에는 비-보존 아미노산의 세트에서 NC개 비-보존 아미노산이 있다(NC = 20-C임). 보존 훈련 세트는 CP개 보존 단백질 샘플을 가지며, CP = 1 내지 1000만 * C이다. 비-보존 훈련 세트는 NCP개 비-보존 단백질 샘플을 가지며, NCP = 1 내지 1000만 * (20-C)이다.In one embodiment, the proteome has 1 to 10 million positions. In this embodiment, each of the 1 to 10 million positions has C conserved amino acids in the set of conserved amino acids. At each of the 1 to 10 million positions there are NC non-conserved amino acids in the set of non-conserved amino acids (NC = 20-C). The conservation training set has CP conservation protein samples, CP = 1 to 10 million * C. The non-conserved training set has NCP non-conserved protein samples, with NCP = 1 to 10 million * (20-C).

일 구현예에서, 진화 보존 결정자(5324)는 2000만 내지 2억 번의 훈련 반복으로 훈련된다. 이러한 구현예에서 2000만 내지 2억 번의 훈련 반복에는 100만 내지 1000만 개의 보존 단백질 샘플을 사용한 100만 내지 1000만 번의 훈련 반복이 포함되며, 1,900만 내지 1억9,000만 개의 비-보존 단백질 샘플을 사용하여 1,900만 내지 1억9,000만 번의 반복을 수행했다.In one implementation, evolutionary conservation determinant 5324 is trained with 20 to 200 million training iterations. In this implementation, the 20 to 200 million training iterations include 1 to 10 million training iterations using 1 to 10 million conserved protein samples and 19 to 190 million non-conserved protein samples. 19 to 190 million iterations were performed using

또 다른 구현예에서, 프로테옴은 100만 내지 1000만 개의 위치를 가지므로 훈련 세트에는 100만 내지 1000만 개의 단백질 샘플이 있다. 이러한 구현예에서, 진화 보존 결정자(5324)는 100만 내지 1000만 개의 단백질 샘플을 사용하여 100만 내지 1000만 번의 훈련 반복으로 훈련된다.In another embodiment, the proteome has 1 to 10 million positions, so there are 1 to 10 million protein samples in the training set. In this implementation, the evolutionary conservation determinant 5324 is trained with 1 to 10 million training iterations using 1 to 10 million protein samples.

각각의 보존 단백질 샘플과 비-보존 단백질 샘플은 각각의 위치에서 각각의 기준 아미노산을 각각의 갭 아미노산으로 사용하여 생성된 각각의 갭 공간 표현을 갖는다. 진화 보존 결정자(5324)는 특정 보존 단백질 샘플에 대해 훈련하고, 특정 보존 단백질 샘플의 특정 갭 공간 표현을 입력으로서 처리하고, 특정 보존 아미노산에 대한 진화 보존 점수를 출력으로서 생성하여 특정 보존 단백질 샘플의 특정 위치에 있는 특정 보존 아미노산의 진화 보존을 추정한다. 특정 갭 공간 표현은 특정 위치의 특정 기준 아미노산을 갭 아미노산으로 사용하고 특정 보존 단백질 샘플의 나머지 위치에 남아 있는 아미노산을 비-갭 아미노산으로 사용하여 생성된다.Each conserved and non-conserved protein sample has a respective gap space representation generated using each reference amino acid as the respective gap amino acid at each position. Evolutionary conservation determinant 5324 trains on a specific conserved protein sample, processes the specific gap space representation of the specific conserved protein sample as input, and generates an evolutionary conservation score for the specific conserved amino acid as output to determine the specific conserved protein sample. Estimate the evolutionary conservation of specific conserved amino acids at a position. A specific gap space representation is created by using a specific reference amino acid at a specific position as the gap amino acid and the remaining amino acids at the remaining positions in a specific conserved protein sample as non-gap amino acids.

각각의 보존 단백질 샘플은 실제 보존 라벨을 갖는다. 실제 보존 라벨은 진화 보존 빈도이다. 일 구현예에서 실제 보존 라벨은 1이다. 특정 보존 아미노산에 대한 진화 보존은 실제 보존 라벨과 비교되어 오류를 결정하고 훈련 기술을 사용하여 오류에 기초하여 진화 보존 결정자(5324)의 계수를 개선한다. 일 구현예에서, 훈련 기술은 손실 함수 기반 기울기 업데이트 기술(예를 들어, 역전파)이다.Each preserved protein sample has an actual preserved label. The actual conservation label is the evolutionary conservation frequency. In one embodiment, the actual retention label is 1. The evolutionary conservation for a particular conserved amino acid is compared to the actual conservation label to determine the error and training techniques are used to improve the coefficients of the evolutionary conservation determinant 5324 based on the error. In one implementation, the training technique is a loss function based gradient update technique (e.g., backpropagation).

일부 구현예에서, 특정 보존 아미노산이 특정 기준 아미노산일 때 실제 보존 라벨은 마스킹되어 오류를 결정하는 데 사용되지 않는다. 이러한 구현예에서, 마스킹으로 인해 진화 보존 결정자(5324)가 특정 기준 아미노산에 과적합화되지 않게 된다.In some embodiments, when a particular conserved amino acid is a particular reference amino acid, the actual conserved label is masked and not used to determine the error. In this embodiment, masking prevents the evolutionarily conserved determinant 5324 from overfitting to a particular reference amino acid.

진화 보존 결정자(5324)는 특정 비-보존 단백질 샘플에 대해 훈련하고, 특정 비-보존 단백질 샘플의 특정 갭 공간 표현을 입력으로서 처리하고, 특정 비-보존 아미노산에 대한 진화 보존 점수를 출력으로서 생성하여 특정 비-보존 단백질 샘플의 특정 위치에 있는 특정 비-보존 아미노산의 진화 보존을 추정한다. 특정 갭 공간 표현은 특정 위치의 특정 기준 아미노산을 갭 아미노산으로 사용하고 특정 비-보존 단백질 샘플의 나머지 위치에 남아 있는 아미노산을 비-갭 아미노산으로 사용하여 생성된다.Evolutionary conservation determinant 5324 trains on specific non-conserved protein samples, processes specific gap space representations of specific non-conserved protein samples as input, and generates evolutionary conservation scores for specific non-conserved amino acids as output, Estimate the evolutionary conservation of specific non-conserved amino acids at specific positions in a specific non-conserved protein sample. A specific gap space representation is created by using a specific reference amino acid at a specific position as the gap amino acid and the remaining amino acids at the remaining positions in the specific non-conserved protein sample as the non-gap amino acids.

각각의 비-보존 단백질 샘플은 실제 비-보존 라벨을 갖는다. 실제 비-보존 라벨은 진화 보존 빈도이다. 일 구현예에서 실제 비-보존 라벨은 0이다. 특정 비-보존 아미노산에 대한 진화 보존 점수는 실제 비-보존 라벨과 비교되어 오류를 결정하고 훈련 기술(예를 들어, 역전파)을 사용하여 오류에 기초하여 진화 보존 결정자(5324)의 계수를 개선한다.Each non-conserved protein sample has an actual non-conserved label. The actual non-conserved label is the evolutionary conservation frequency. In one implementation, the actual non-conserved label is 0. The evolutionary conservation score for a particular non-conserved amino acid is compared to the actual non-conserved label to determine the error and use training techniques (e.g., backpropagation) to improve the coefficients of the evolutionary conservation determinant 5324 based on the error. do.

진화 보존 결정자(5324)는 훈련 세트에 대해 훈련된다. 훈련 세트는 프로테옴의 각 위치에 대한 각각의 단백질 샘플을 갖는다. 각각의 단백질 샘플은 각각의 위치에 있는 각각의 기준 아미노산을 각각의 갭 아미노산으로 사용하여 생성된 각각의 갭 공간 표현을 갖는다.Evolutionary conservation determinant 5324 is trained on the training set. The training set has each protein sample for each position in the proteome. Each protein sample has a respective gap space representation generated using each reference amino acid at each position as each gap amino acid.

도 57은 양성 및 병원성 단백질 샘플에 대한 진화 보존 결정자를 동시에 훈련(5700)시키는 일 구현예를 도시한다. 진화 보존 결정자(5324)는 특정 단백질 샘플에 대해 훈련하고, 특정 단백질 샘플의 특정 갭 공간 표현(5722)을 입력으로서 처리하고, 각각의 아미노산에 대한 각각의 진화 보존 점수 1-20을 출력으로서 생성함으로써 특정 단백질 샘플의 특정 위치에 있는 각 아미노산 클래스의 각 아미노산의 진화 보존을 추정한다. 특정 갭 공간 표현(5722)은 특정 위치의 특정 기준 아미노산을 갭 아미노산으로 사용하고 특정 단백질 샘플의 나머지 위치에 남아 있는 아미노산을 비-갭 아미노산으로 사용하여 생성된다.Figure 57 shows one implementation of simultaneously training 5700 evolutionary conservation determinants for benign and pathogenic protein samples. The evolutionary conservation determinant 5324 trains on a specific protein sample, processes a specific gap space representation 5722 of the specific protein sample as input, and generates a respective evolutionary conservation score 5720 for each amino acid as output. It estimates the evolutionary conservation of each amino acid of each amino acid class at a specific position in a specific protein sample. A specific gap space representation 5722 is created by using a specific reference amino acid at a specific position as the gap amino acid and the remaining amino acids at the remaining positions in the specific protein sample as non-gap amino acids.

각 단백질 샘플은 각각의 아미노산에 대한 실제 라벨을 갖는다. 각각의 실제 라벨은 각각의 아미노산에서 하나 이상의 보존 아미노산(5732, 5702, 5712)에 대한 하나 이상의 보존(양성) 라벨을 포함하고, 각각의 아미노산 중 하나 이상의 비-보존 아미노산에 대한 하나 이상의 비-보존(병원성) 라벨을 포함한다. 보존 라벨과 비-보존 라벨은 각각 진화 보존 빈도를 갖는다. 각각의 진화 보존 빈도는 규모에 따라 순위가 매겨진다. 일 구현예에서 보존 라벨은 1이고 보존되지 않은 라벨은 0이다.Each protein sample has an actual label for each amino acid. Each actual label includes one or more conserved (positive) labels for one or more conserved amino acids (5732, 5702, 5712) in each amino acid, and one or more non-conserved labels for one or more non-conserved amino acids in each amino acid. Includes (pathogenic) label. Conservation labels and non-conserved labels each have an evolutionary conservation frequency. Each evolutionary conservation frequency is ranked according to its magnitude. In one embodiment, the preserved label is 1 and the non-preserved label is 0.

일 구현예에서, 오류(5704)는 각각의 보존 아미노산에 대한 각각의 보존 아미노산에 대한 각각의 진화 보존 점수의 각각의 비교 및 각각의 비-보존 아미노산에 대한 각각의 비-보존 아미노산에 대한 각각의 진화 보존 점수의 각각의 비교를 기초로 결정된다. 진화 보존 결정자(5324)의 계수는 훈련 기술(예를 들어, 역전파(5744))을 사용하여 오류에 기초하여 개선된다.In one embodiment, error 5704 causes each comparison of each evolutionary conservation score for each conserved amino acid to each non-conserved amino acid and each comparison of each evolutionary conservation score for each conserved amino acid to each non-conserved amino acid. Evolutionary conservation scores are determined based on each comparison. The coefficients of the evolutionary conservation determinant 5324 are improved based on the error using training techniques (e.g., backpropagation 5744).

일 구현예에서, 보존 아미노산은 특정 기준 아미노산을 포함하고, 특정 기준 아미노산에 대한 보존 라벨은 마스킹되어 오류를 결정하는 데 사용되지 않는다. 마스킹으로 인해 진화 보존 결정자(5324)가 특정 기준 아미노산에 과적합화되지 않게 된다.In one embodiment, the conserved amino acids include specific reference amino acids, and the conserved labels for specific reference amino acids are masked and not used to determine errors. Masking prevents the evolutionarily conserved determinant 5324 from overfitting to a specific reference amino acid.

동의어 돌연변이는 점 돌연변이이며, 즉, DNA의 RNA 복사본에서 염기쌍 하나만 변경하는 잘못 복사된 DNA 뉴클레오티드일 뿐이다. RNA의 코돈은 특정 아미노산을 암호화하는 세 개의 뉴클레오티드 세트이다. 대부분의 아미노산에는 특정 아미노산으로 번역되는 여러 개의 RNA 코돈이 있다. 대부분의 경우 세 번째 뉴클레오티드가 돌연변이가 있는 경우 동일한 아미노산을 코딩하게 된다. 문법상의 동의어처럼 돌연변이된 코돈은 원래의 코돈과 동일한 의미를 가지므로 아미노산이 변하지 않기 때문에 이를 동의어 돌연변이라고 한다. 아미노산이 변하지 않으면 단백질도 영향을 받지 않는다. 동의어 돌연변이는 아무것도 변경하지 않으며 변경되지 않는다. 이는 유전자나 단백질이 어떤 방식으로든 변경되지 않기 때문에 종의 진화에 실제적인 역할이 없다는 것을 의미한다. 동의어 돌연변이는 실제로 상당히 흔하지만 효과가 없으므로 눈에 띄지 않는다.A synonymous mutation is a point mutation, meaning it is simply a miscopied DNA nucleotide that changes one base pair in the RNA copy of DNA. A codon in RNA is a set of three nucleotides that code for a specific amino acid. Most amino acids have multiple RNA codons that translate to a specific amino acid. In most cases, if the third nucleotide is mutated, it will code for the same amino acid. Like grammatical synonyms, the mutated codon has the same meaning as the original codon, so the amino acid does not change, so it is called a synonymous mutation. If the amino acids do not change, the protein is not affected. Synonymous mutations do not change anything and do not change anything. This means that the genes or proteins do not change in any way and therefore have no real role in the evolution of the species. Synonymous mutations are actually quite common, but they go unnoticed because they have no effect.

비동의어 돌연변이는 동의어 돌연변이보다 개인에게 훨씬 더 큰 영향을 미친다. 비동의어 돌연변이에서는 일반적으로 메신저 RNA가 DNA를 복사할 때 전사 중에 서열에 단일 뉴클레오티드가 삽입되거나 삭제된다. 이 하나의 누락되거나 추가된 뉴클레오티드는 아미노산 서열의 전체 판독 프레임을 버리고 코돈을 혼합하는 프레임 이동 돌연변이를 유발한다. 이는 일반적으로 코딩되는 아미노산에 영향을 미치고 발현되는 결과 단백질을 변경한다. 이런 종류의 돌연변이의 심각도는 아미노산 서열에서 얼마나 일찍 발생하는지에 따라 달라진다. 이것이 시작 부분에 발생하고 전체 단백질이 변경되면 이는 치명적인 돌연변이가 될 수 있다. 비동의어 돌연변이가 발생할 수 있는 또 다른 방법은 점 돌연변이가 단일 뉴클레오티드를 동일한 아미노산으로 번역되지 않는 코돈으로 변경하는 경우이다. 많은 경우 단일 아미노산 변화는 단백질에 큰 영향을 미치지 않으며 여전히 실행 가능하다. 서열 초기에 발생하고 코돈이 변경되어 정지 신호로 변환되면 단백질이 생성되지 않으며 심각한 결과를 초래할 수 있다. 때때로 비동의어 돌연변이는 실제로 긍정적인 변화이다. 자연 선택은 유전자의 새로운 발현을 선호할 수 있으며 개인은 돌연변이로부터 유리한 적응을 발전시켰을 수 있다. 해당 돌연변이가 배우자에서 발생하면 이 적응은 다음 세대의 자손에게 전달된다. 비동의어 돌연변이는 자연 선택이 소진화 수준에서 진화를 촉진하고 작동할 수 있도록 유전자 풀의 다양성을 증가시킨다.Nonsynonymous mutations have a much greater impact on an individual than synonymous mutations. In non-synonymous mutations, a single nucleotide is usually inserted or deleted in the sequence during transcription when messenger RNA copies DNA. This single missing or added nucleotide causes a frameshift mutation that throws the entire reading frame of the amino acid sequence and mixes codons. This usually affects the amino acids encoded and alters the resulting protein expressed. The severity of this type of mutation depends on how early it occurs in the amino acid sequence. If this occurs at the beginning and the entire protein is altered, this can be a lethal mutation. Another way non-synonymous mutations can occur is when a point mutation changes a single nucleotide to a codon that does not translate into the same amino acid. In many cases, single amino acid changes do not have a significant effect on the protein and are still viable. If it occurs early in the sequence and the codon is changed and converted into a stop signal, no protein is produced and this can have serious consequences. Sometimes non-synonymous mutations are actually positive changes. Natural selection may favor new expressions of genes, and individuals may have developed advantageous adaptations from mutations. If that mutation occurs in the gamete, this adaptation is passed on to the next generation of offspring. Non-synonymous mutations increase the diversity of the gene pool, allowing natural selection to drive evolution and operate at the microevolutionary level.

아미노산을 암호화하는 뉴클레오티드 삼중항을 코돈이라고 한다. 세 개의 뉴클레오티드의 각 그룹은 하나의 아미노산을 암호화한다. 한 번에 3개씩 취해지는 4개 뉴클레오티드의 64개 조합이 있고 단지 20개의 아미노산만 있기 때문에 코드는 축퇴된다(대부분의 경우 아미노산당 하나 초과의 코돈). 도달할 수 없는 대체 아미노산 클래스의 한 예는 동의어 SNP에 의해 코딩되지 않는 대체 아미노산 클래스이다. 도달할 수 없는 대체 아미노산 클래스의 또 다른 예는 초기 코돈의 삼중 뉴클레오티드 위치에서 단일 뉴클레오티드 다형성(SNP)에 의해 벗어난 삼중 뉴클레오티드 돌연변이 조합의 수에 의해 제한되는 대체 아미노산 클래스이다.The nucleotide triplets that code for amino acids are called codons. Each group of three nucleotides encodes one amino acid. The code is degenerate because there are 64 combinations of 4 nucleotides taken 3 at a time and only 20 amino acids (more than one codon per amino acid in most cases). One example of an unreachable alternative amino acid class is a class of alternative amino acids that are not encoded by a synonymous SNP. Another example of an unattainable alternative amino acid class is the class of alternative amino acids limited by the number of triple nucleotide mutation combinations deviated by single nucleotide polymorphisms (SNPs) at the triple nucleotide position of the initial codon.

일 구현예에서, 기준 아미노산의 기준 코돈을 도달 불가능한 대체 아미노산 클래스의 대체 아미노산으로 변환하기 위한 SNP의 도달 가능성에 의해 제한되는 도달 불가능한 대체 아미노산 클래스는 실제 라벨에 마스킹된다. 이러한 구현예에서 마스킹된 아미노산 클래스는 손실이 전혀 발생하지 않으며 기울기 업데이트에 기여하지 않는다. 일 구현예에서, 마스킹된 아미노산 클래스가 룩업 테이블에서 식별된다. 일 구현예에서, 룩업 테이블은 각 기준 아미노산 위치에 대해 마스킹된 아미노산 클래스 세트를 식별한다.In one embodiment, the unreachable substitution amino acid class limited by the reachability of the SNP to convert the reference codon of the reference amino acid to the substitute amino acid of the unreachable substitution amino acid class is masked to the actual label. In this implementation, the masked amino acid classes are not lost at all and do not contribute to the gradient update. In one implementation, masked amino acid classes are identified in a lookup table. In one implementation, the lookup table identifies a set of masked amino acid classes for each reference amino acid position.

보존 아미노산의 특정 세트 및 비-보존 아미노산의 특정 세트는 복수 종의 상동 단백질의 진화 보존 프로파일에 기초하여 식별된다. 일 구현예에서, 상동 단백질의 진화 보존 프로파일은 위치-특이적 주파수 매트릭스(PSFM)를 사용하여 결정된다. 또 다른 구현예에서, 상동 단백질의 진화 보존 프로파일은 위치-특이적 채점 매트릭스(PSSM)를 사용하여 결정된다.Specific sets of conserved amino acids and specific sets of non-conserved amino acids are identified based on the evolutionary conservation profiles of homologous proteins in multiple species. In one embodiment, the evolutionary conservation profile of homologous proteins is determined using a position-specific frequency matrix (PSFM). In another embodiment, the evolutionary conservation profile of a homologous protein is determined using a position-specific scoring matrix (PSSM).

도 58은 진화 보존 결정자(5324)를 훈련하는 데 사용되는 실제 라벨 인코딩의 다양한 구현예를 도시한다. 5802를 인코딩하는 실제 라벨은 진화 보존 빈도(예를 들어, PSFM 또는 PSSM)를 사용하여 보존 아미노산 클래스 A, C, F에 라벨을 지정하고 "0 값"을 사용하여 나머지 비-보존 아미노산 클래스에 라벨을 지정한다. 5812를 인코딩하는 실제 라벨은 5812를 인코딩하는 실제 라벨이 REF 주요 대립유전자/최대 보존 아미노산 클래스 F를 "마스킹"한다는 점을 제외하면 5802를 인코딩하는 실제 라벨과 동일하여 REF 주요 대립유전자/최대 보존 아미노산 클래스 F는 진화 보존 결정자(5324)의 훈련에 기여하지 않는다(예를 들어, REF 주요 대립유전자/최대 보존 아미노산 클래스 F에 대한 손실 함수로 계산된 손실을 0으로 하여).Figure 58 shows various implementations of actual label encoding used to train the evolutionary conservation determinant 5324. The actual label encoding 5802 labels the conserved amino acid classes A, C, and F using their evolutionary conservation frequencies (e.g., PSFM or PSSM) and the remaining non-conserved amino acid classes using the “0 value.” Specify . The actual label encoding 5812 is the same as the actual label encoding 5802, except that the actual label encoding 5812 "masks" the REF major allele/most conserved amino acid class F, which is the REF major allele/most conserved amino acid. Class F does not contribute to the training of the evolutionary conservation determinant 5324 (e.g., by setting the loss calculated with the loss function for REF major allele/maximum conserved amino acid class F to 0).

5822를 인코딩하는 실제 라벨은 "1개 값"을 사용하여 보존 아미노산 클래스 A, C, F에 라벨을 지정하고 "0 값"을 사용하여 나머지 비-보존 아미노산 클래스에 라벨을 지정한다. 5832를 인코딩하는 실제 라벨은 5832를 인코딩하는 실제 라벨이 REF 주요 대립유전자/최대 보존 아미노산 클래스 F를 "마스킹"한다는 점을 제외하면 5822를 인코딩하는 실제 라벨과 동일하여 REF 주요 대립유전자/최대 보존 아미노산 클래스 F는 진화 보존 결정자(5324)의 훈련에 기여하지 않는다(예를 들어, REF 주요 대립유전자/최대 보존 아미노산 클래스 F에 대한 손실 함수로 계산된 손실을 0으로 하여).The actual label encoding 5822 uses a “1 value” to label the conserved amino acid classes A, C, and F, and a “0 value” to label the remaining non-conserved amino acid classes. The actual label encoding 5832 is the same as the actual label encoding 5822, except that the actual label encoding 5832 "masks" the REF major allele/most conserved amino acid class F, which is the REF major allele/most conserved amino acid. Class F does not contribute to the training of the evolutionary conservation determinant 5324 (e.g., by setting the loss calculated with the loss function for REF major allele/maximum conserved amino acid class F to 0).

도 59는 예시적인 PSFM(5900)을 도시한다. 도 60은 예시적인 PSSM(6000)을 도시한다. 도 61은 PSFM 및 PSSM을 생성하는 일 구현예를 도시한다. 도 62는 예시적인 PSFM(6200) 인코딩을 도시한다. 도 63은 예시적인 PSSM(6300) 인코딩을 도시한다.Figure 59 shows an example PSFM 5900. Figure 60 shows an example PSSM 6000. Figure 61 shows one implementation of generating PSFM and PSSM. Figure 62 shows an example PSFM 6200 encoding. Figure 63 shows an example PSSM 6300 encoding.

다중 서열 정렬(MSA)은 다수의 상동 단백질 서열의 표적 단백질에 대한 서열 정렬이다. MSA는 진화 및 공진화 클러스터와 같은 많은 정보가 MSA에서 생성되고 선택한 표적 서열 또는 단백질 구조에 맵핑될 수 있기 때문에 생물학적 서열의 비교 분석 및 특성 예측에 중요한 단계이다.Multiple sequence alignment (MSA) is a sequence alignment of multiple homologous protein sequences to a target protein. MSA is an important step in the comparative analysis and property prediction of biological sequences because a lot of information, such as evolution and coevolution clusters, can be generated from MSA and mapped to selected target sequences or protein structures.

길이가 L인 단백질 서열 X의 서열 프로파일은 PSSM 또는 PSFM 형태의 L × 20 매트릭스이다. PSSM 및 PSFM의 열은 아미노산 알파벳으로 색인화되어 있으며 각 행은 단백질 서열의 위치에 상응한다. PSSM 및 PSFM은 단백질 서열의 서로 다른 위치에 있는 아미노산의 치환 점수와 빈도를 각각 함유한다. PSFM의 각 행은 합이 1이 되도록 정규화된다. 단백질 서열 X의 서열 프로파일은 X와 통계적으로 유의미한 서열 유사성을 갖는 단백질 데이터베이스의 여러 서열과 X를 정렬하여 계산된다. 따라서 서열 프로파일에는 단백질 서열 X가 속한 단백질 계열의 보다 일반적인 진화 및 구조 정보가 포함되어 원격 상동성 탐지 및 접힘 인식에 유용한 정보를 제공한다.The sequence profile of a protein sequence X of length L is an L × 20 matrix in the form PSSM or PSFM. The columns of PSSM and PSFM are indexed by amino acid alphabet, with each row corresponding to a position in the protein sequence. PSSM and PSFM contain the substitution score and frequency of amino acids at different positions in the protein sequence, respectively. Each row of PSFM is normalized so that the sum is 1. The sequence profile of a protein sequence Therefore, the sequence profile contains more general evolutionary and structural information of the protein family to which protein sequence X belongs, providing useful information for remote homology detection and fold recognition.

단백질 서열(질문 서열, 예를 들어 단백질의 기준 아미노산 서열이라 함)은 예를 들어 PSI-BLAST 프로그램을 사용하여 단백질 데이터베이스(예를 들어 SWISSPROT)로부터 동종 서열을 검색하고 정렬하기 위한 시드로 사용될 수 있다. 정렬된 서열은 일부 동종 세그먼트를 공유하며 동일한 단백질 계열에 속한다. 정렬된 서열은 균질한 정보를 표현하기 위해 PSSM과 PSFM의 두 가지 프로파일로 추가로 변환된다. PSSM과 PSFM은 모두 20개의 행과 L개의 열이 있는 행렬이며, L은 쿼리 서열의 총 아미노산 수이다. PSSM의 각 열은 쿼리 서열의 상응하는 위치에서 잔기 치환의 로그 가능성을 나타낸다. PSSM 매트릭스의 (i, j)번째 항목은 쿼리 서열의 j번째 위치에 있는 아미노산이 진화 과정 동안 아미노산 유형 i로 돌연변이될 가능성을 나타낸다. PSFM에는 정렬된 서열의 각 위치에 대한 가중치 관측 빈도가 포함되어 있다. 구체적으로, PSFM 매트릭스의 (i, j)번째 항목은 쿼리 서열의 j 위치에 아미노산 유형 i가 있을 가능성을 나타낸다.The protein sequence (referred to as the query sequence, e.g. the reference amino acid sequence of the protein) can be used as a seed to search and align homologous sequences from protein databases (e.g. SWISSPROT), for example using the PSI-BLAST program. . The aligned sequences share some homologous segments and belong to the same protein family. The aligned sequences are further converted into two profiles, PSSM and PSFM, to represent homogeneous information. Both PSSM and PSFM are matrices with 20 rows and L columns, where L is the total number of amino acids in the query sequence. Each column of PSSM represents the log likelihood of a residue substitution at the corresponding position in the query sequence. The (i, j)th entry of the PSSM matrix represents the probability that the amino acid at the jth position of the query sequence will mutate into amino acid type i during the evolution process. PSFM contains weighted observed frequencies for each position in the aligned sequence. Specifically, the (i, j)th entry of the PSFM matrix represents the probability that amino acid type i exists at position j in the query sequence.

쿼리 서열이 주어지면 먼저 PSI-BLAST에 이를 제시하여 단백질 데이터베이스(예를 들어, Swiss-Prot Database)에서 상동 단백질 서열을 검색하고 정렬함으로써 서열 프로파일을 얻는다. 도 61은 PSI-BLAST 프로그램을 이용하여 서열 프로파일을 얻는 과정을 도시한다. PSI-BLAST의 매개변수 h와 j는 일반적으로 각각 0.001과 3으로 설정된다. 단백질의 서열 프로파일은 쿼리 단백질 서열과 관련된 상동체 정보를 캡슐화한다. PSI-BLAST에서 상동체 정보는 PSFM과 PSSM의 두 가지 행렬로 표현된다. PSFM과 PSSM의 예가 각각 도 62와 도 63에 도시되어 있다.Given a query sequence, it is first submitted to PSI-BLAST to obtain a sequence profile by searching and aligning homologous protein sequences in a protein database (e.g., Swiss-Prot Database). Figure 61 shows the process of obtaining a sequence profile using the PSI-BLAST program. The parameters h and j of PSI-BLAST are generally set to 0.001 and 3, respectively. The sequence profile of a protein encapsulates homolog information associated with the query protein sequence. In PSI-BLAST, homology information is expressed in two matrices: PSFM and PSSM. Examples of PSFM and PSSM are shown in Figures 62 and 63, respectively.

도 62에서, (1, u)번째 요소(1 ∈ {1, 2, ..., Li}, u ∈ {1, 2, ..., 20})는 쿼리 단백질의 1번째 위치에 u번째 아미노산이 있을 가능성을 나타낸다. 예를 들어, 쿼리 단백질의 제1 위치에 아미노산 M이 있을 확률은 0.36이다.In Figure 62, the (1, u)th element (1 ∈ {1, 2, ..., Li}, u ∈ {1, 2, ..., 20}) is the uth element at the 1st position of the query protein. Indicates the possibility of amino acids being present. For example, the probability of amino acid M being in the first position of the query protein is 0.36.

도 63에서, (1, u)번째 요소(1 ∈ {1, 2, ..., Li}, u ∈ {1, 2, ..., 20})는 진화 과정에서 쿼리 단백질의 1번째 위치에 있는 아미노산이 u번째 아미노산으로 돌연변이될 가능성 점수를 나타낸다. 예를 들어, 진화 과정에서 쿼리 단백질의 제1 위치에 있는 아미노산 V가 H로 돌연변이되는 점수는 -3인 반면, 8번째 위치에 있는 아미노산 V는 -4이다.In Figure 63, the (1, u)th element (1 ∈ {1, 2, ..., Li}, u ∈ {1, 2, ..., 20}) is the 1st position of the query protein in the evolution process. It represents the probability score that the amino acid in will be mutated to the uth amino acid. For example, during the evolution process, the score for mutating amino acid V at the first position of the query protein to H is -3, while amino acid V at the eighth position is -4.

결합 학습 및 전이 학습Combined learning and transfer learning

도 64는 본원에 개시된 모델이 예를 들어 결합 학습(도 65a 및 도 65b) 또는 전이 학습(도 66a 및 도 66b)을 통해 훈련될 수 있는 2개의 데이터세트를 도시한다. 제1 훈련 데이터세트는 JigsawAI 데이터세트(6406)로 지칭된다. 제2 훈련 데이터세트는 PrimateAI 데이터세트(6408)로 지칭된다. JigsawAI 데이터세트(6406)는 위에서 논의한 바와 같이 갭 아미노산으로 식별된 중앙 잔기가 누락된 복셀 입력(6412)을 특징으로 한다. PrimateAI 데이터세트(6408)는 누락된 잔기가 없고 완전한 입력이 있는 복셀 입력(6412)을 특징으로 한다.Figure 64 illustrates two datasets on which the models disclosed herein can be trained, for example, through conjunctive learning (Figures 65A and 65B) or transfer learning (Figures 66A and 66B). The first training dataset is referred to as JigsawAI dataset 6406. The second training dataset is referred to as PrimateAI dataset 6408. JigsawAI dataset 6406 features voxel input 6412 missing a central residue identified as a gap amino acid, as discussed above. PrimateAI dataset 6408 features voxel input 6412 with no missing residues and complete input.

JigsawAI 데이터세트(6406)의 경우, 실제 라벨(6422)에는 갭 아미노산(예를 들어 기준 아미노산)에 대한 누락되거나 마스킹된 라벨(6426)이 있다. PrimateAI 데이터세트(6408)의 경우, 실제 라벨 6422에는 분석 중인 대체 아미노산(양성 또는 병원성)과 다른 나머지 아미노산에 대한 19개의 누락되거나 마스킹된 라벨(6436)이 있다. 일 구현예에서 JigsawAI 데이터세트(6406)의 샘플 수(6432)는 1000만(6436)이고 PrimateAI 데이터세트(6408)의 샘플 수는 100만(6438)이다.For the JigsawAI dataset 6406, the actual labels 6422 have missing or masked labels 6426 for gap amino acids (e.g., reference amino acids). For the PrimateAI dataset (6408), the actual label 6422 has 19 missing or masked labels (6436) for the remaining amino acids that are different from the alternative amino acid (benign or pathogenic) being analyzed. In one implementation, the number of samples 6432 in the JigsawAI dataset 6406 is 10 million (6436) and the number of samples in the PrimateAI dataset 6408 is 1 million (6438).

도 65a 및 도 65b는 본원에 개시된 모델의 결합 학습(6500)의 일 구현예를 도시한다. 동작(6502)에서 간격이 있는 훈련 세트에 접근된다. 갭 훈련 세트는 본원에서 JigsawAI 데이터세트(6406)로도 지칭된다. 갭 훈련 세트는 프로테옴의 각 위치에 대한 각각의 갭 단백질 샘플을 포함한다. 각각의 갭 단백질 샘플은 각각의 갭 실제 서열로 라벨링된다. 특정 갭 단백질 샘플에 대한 특정 갭 실제 서열은 특정 갭 단백질의 특정 위치에 있는 기준 아미노산에 상응하는 특정 아미노산 클래스에 대한 양성 라벨을 갖고, 특정 위치의 대체 아미노산에 상응하는 각각의 나머지 아미노산 클래스에 대해 각각의 병원성 라벨을 갖는다.Figures 65A and 65B illustrate one implementation of joint learning 6500 of the model disclosed herein. In operation 6502 the spaced training set is accessed. The gap training set is also referred to herein as JigsawAI dataset 6406. The gap training set contains a sample of each gap protein for each position in the proteome. Each gap protein sample is labeled with the respective gap actual sequence. A specific gap true sequence for a specific gap protein sample has a positive label for a specific amino acid class corresponding to a reference amino acid at a specific position in the specific gap protein, and for each remaining amino acid class corresponding to an alternative amino acid at a specific position, respectively. It has a pathogenic label of

동작(6512)에서 비-갭 훈련 세트에 접근된다. 비-갭 훈련 세트는 본원에서 PrimateAI 데이터세트(6408)로도 지칭된다. 비-갭 훈련 세트에는 비-갭 양성 단백질 샘플과 비-갭 병원성 단백질 샘플이 포함된다. 특정 비-갭 양성 단백질 샘플은 양성 뉴클레오티드 변이체에 의해 치환된 특정 위치의 양성 대체 아미노산을 포함한다. 특정 비-갭 병원성 단백질 샘플은 병원성 뉴클레오티드 변이체에 의해 치환된 특정 위치의 병원성 대체 아미노산을 포함한다. 특정 비-갭 양성 단백질 샘플은 양성 대체 아미노산에 상응하는 특정 아미노산 클래스에 대한 양성 라벨과 양성 대체 아미노산과 다른 아미노산에 상응하는 각각의 나머지 아미노산 클래스에 대한 각각의 마스킹 라벨을 갖는 양성 실제 서열로 라벨링된다. 특정 비-갭 병원성 단백질 샘플은 병원 대체 아미노산에 상응하는 특정 아미노산 클래스에 대한 병원 라벨과 병원 대체 아미노산과 다른 아미노산에 상응하는 각각의 나머지 아미노산 클래스에 대한 각각의 마스킹 라벨을 갖는 양성 실제 서열로 라벨링된다.In operation 6512 the non-gap training set is accessed. The non-gap training set is also referred to herein as PrimateAI dataset 6408. The non-gap training set includes non-gap positive protein samples and non-gap pathogenic protein samples. Certain non-gap positive protein samples contain positive replacement amino acids at specific positions that are replaced by positive nucleotide variants. Certain non-gap pathogenic protein samples contain pathogenic replacement amino acids at specific positions that have been substituted by pathogenic nucleotide variants. A specific non-gap positive protein sample is labeled with a positive real sequence with a positive label for the specific amino acid class corresponding to the positive replacing amino acid and a respective masking label for each remaining amino acid class corresponding to the positive replacing amino acid and another amino acid. . A specific non-gap pathogenic protein sample is labeled with a positive authentic sequence with a pathogenic label for the specific amino acid class corresponding to the pathogenic substituted amino acid and a respective masking label for each remaining amino acid class corresponding to an amino acid different from the pathogenic substituted amino acid. .

일 구현예에서, 특정 갭 단백질의 특정 위치에 있는 기준 아미노산에 상응하는 특정 아미노산 클래스에 대한 양성 라벨이 마스킹된다. 일 구현예에서, 비-갭 양성 단백질 샘플은 일반적인 인간 및 비인간 영장류 뉴클레오티드 변이체로부터 유래된다. 일 구현예에서, 비-갭 병원성 단백질 샘플은 조합적으로 시뮬레이션된 뉴클레오티드 변이체로부터 유래된다.In one embodiment, positive labels are masked for specific amino acid classes that correspond to reference amino acids at specific positions in specific gap proteins. In one embodiment, the non-gap positive protein sample is derived from common human and non-human primate nucleotide variants. In one embodiment, the non-gap pathogenic protein sample is derived from combinatorially simulated nucleotide variants.

동작(6522)에서, 갭 단백질 샘플에 대한 각각의 갭 공간 표현이 생성되고, 비-갭 양성 단백질 샘플과 비-갭 병원성 단백질 샘플에 대한 각각의 비-갭 공간 표현이 생성된다.At operation 6522, a respective gapped space representation is created for the gapped protein sample, and a separate non-gap spatial representation is created for the non-gap positive protein sample and the non-gap pathogenic protein sample.

동작(6532)에서, 병원성 분류자(2108/2600/2700)는 하나 이상의 훈련 사이클에 걸쳐 훈련되고, 훈련된 병원성 분류자(2108/2600/2700)는 훈련된 병원성 분류자(2108/2600/2700)의 매개변수/계수/가중치가 최적화된 결과로 생성된다. 각각의 훈련 사이클은 각각의 갭 공간 표현으로부터의 갭 공간 표현, 및 각각의 비-갭 공간 표현으로부터의 비-갭 공간 표현을 훈련 예시로서 사용한다.In operation 6532, pathogenicity classifier 2108/2600/2700 is trained over one or more training cycles, and trained pathogenicity classifier 2108/2600/2700 is trained over one or more training cycles. ) parameters/coefficients/weights are generated as optimized results. Each training cycle uses as training examples a gapped space representation from each gapped space representation, and a non-gap spaced representation from each non-gap space representation.

동작(6542)에서 훈련된 병원성 분류자(2108/2600/2700)를 사용하여 변이체의 병원성을 결정한다.In operation 6542, the pathogenicity of the variant is determined using the trained pathogenicity classifier (2108/2600/2700).

일 구현예에서, 샘플 표시자는 현재 훈련 예시가 갭 단백질 샘플에 대한 갭 공간 표현인지, 아니면 비-갭 단백질 샘플에 대한 비-갭 공간 표현인지 여부를 병원성 분류자(2108/2600/2700)에 표시하는 데 사용된다.In one embodiment, the sample indicator indicates to the pathogenicity classifier (2108/2600/2700) whether the current training example is a gap space representation for gap protein samples or a non-gap space representation for non-gap protein samples. It is used to

일 구현예에서, 병원성 분류자(2108/2600/2700)는 훈련 예시 처리에 응답하여 아미노산 클래스별 출력 서열을 생성한다. 아미노산 클래스별 출력 서열은 아미노산 클래스별 병원성 점수를 갖는다.In one implementation, the pathogenicity classifier 2108/2600/2700 generates output sequences per amino acid class in response to training example processing. The output sequence for each amino acid class has a pathogenicity score for each amino acid class.

일 구현예에서, 훈련된 병원성 분류자(2108/2600/2700)의 성능은 검증 세트에 대한 훈련 사이클 사이에서 측정된다. 일부 구현예에서, 검증 세트는 각각의 유지된 단백질 샘플에 대한 갭 공간 표현과 비-갭 공간 표현의 쌍을 포함한다.In one implementation, the performance of the trained pathogenicity classifier 2108/2600/2700 is measured between training cycles on a validation set. In some implementations, the validation set includes pairs of gapped and non-gapped spatial representations for each retained protein sample.

일 구현예에서, 훈련된 병원성 분류자(2108/2600/2700)는 쌍의 갭 공간 표현에 대한 제1 아미노산 클래스별 출력 서열과 쌍의 비-갭 공간 표현에 대한 제2 아미노산 클래스별 출력 시퀀스를 생성한다. 일부 구현예에서, 유지된 단백질 샘플에서 아미노산 치환을 유발하는 뉴클레오티드 변이체에 대한 최종 병원성 점수는 제1 및 제2 아미노산 클래스별 출력 서열의 아미노산 치환에 대한 제1 및 제2 병원성 점수의 조합에 기초하여 결정된다. 다른 구현예에서, 최종 병원성 점수는 제1 및 제2 병원성 점수의 평균을 기반으로 한다.In one embodiment, the trained pathogenicity classifier 2108/2600/2700 outputs a first amino acid class-specific output sequence for the gap space representation of the pair and a second amino acid class-specific output sequence for the non-gap space representation of the pair. Create. In some embodiments, the final pathogenicity score for a nucleotide variant causing an amino acid substitution in the retained protein sample is based on a combination of the first and second pathogenicity scores for amino acid substitutions in the output sequence by first and second amino acid classes. It is decided. In another embodiment, the final pathogenicity score is based on the average of the first and second pathogenicity scores.

일부 구현예에서, 훈련 사이클 중 적어도 일부는 동일한 개수의 갭 공간 표현과 비-갭 공간 표현을 사용한다. 다른 구현예에서, 훈련 사이클 중 적어도 일부는 동일한 수의 갭 공간 표현과 비-갭 공간 표현을 갖는 훈련 예제의 배치를 사용한다.In some implementations, at least some of the training cycles use the same number of gapped and non-gapped spatial representations. In another implementation, at least some of the training cycles use batches of training examples with equal numbers of gapped and non-gapped spatial representations.

일 구현예에서, 마스킹된 라벨은 오류 결정에 기여하지 않으므로 병원성 분류자(2108/2600/2700)의 훈련에 기여하지 않는다. 일부 구현예에서, 마스킹된 라벨은 제로 아웃된다.In one implementation, the masked labels do not contribute to the error determination and therefore do not contribute to the training of the pathogenicity classifier (2108/2600/2700). In some implementations, masked labels are zeroed out.

일부 구현예에서, 갭 공간 표현은 비-갭 공간 표현과 다르게 가중치가 부여되어, 비-갭 공간 표현을 처리하는 병원성 분류자(2108/2600/2700)에 응답하여 병원성 분류자(2108/2600/2700)의 매개변수에 적용되는 기울기 업데이트에 대한 갭 공간 표현의 기여는, 비-갭 공간 표현을 처리하는 병원성 분류자(2108/2600/2700)에 응답하여 병원성 분류자(2108/2600/2700)의 매개변수에 적용되는 기울기 업데이트에 대한 비-갭 공간 표현의 기여로부터 변동한다. 일 구현예에서, 변동은 미리 정의된 가중치에 의해 결정된다.In some implementations, gapped spatial representations are weighted differently than non-gap spatial representations, such that the pathogenicity classifier 2108/2600/2700 responds to a pathogenicity classifier 2108/2600/2700 that processes the non-gap spatial representation. The contribution of the gap space representation to the gradient update applied to the parameters of the pathogenicity classifier 2108/2600/2700 in response to the pathogenicity classifier 2108/2600/2700 processing the non-gap spatial representation. It varies from the contribution of the non-gap space representation to the gradient update applied to the parameters of . In one implementation, the variance is determined by predefined weights.

도 66a 및 도 66b는 도 64에 도시된 2개의 데이터세트를 사용하여 본원에 개시된 모델을 훈련시키기 위해 전이 학습(6600)을 사용하는 일 구현예를 도시한다. 동작(6602)에서, 병원성 분류자(2108/2600/2700)는 먼저 갭 훈련 세트(즉, JigsawAI 데이터 세트(6406))에 대해 훈련되어 훈련된 병원성 분류자(2108/2600/2700)를 생성한다.Figures 66A and 66B illustrate one implementation of using transfer learning 6600 to train a model disclosed herein using the two datasets shown in Figure 64. In operation 6602, pathogenicity classifier 2108/2600/2700 is first trained on the gap training set (i.e., JigsawAI data set 6406) to generate trained pathogenicity classifier 2108/2600/2700. .

동작(6612)에서, 훈련된 병원성 분류자(2108/2600/2700)는 비-갭 훈련 세트(즉, PrimateAI 데이터 세트(6408))에 대해 추가로 훈련되어 재훈련된 병원성 분류자(2108/2600/2700)를 생성한다.In operation 6612, the trained pathogenicity classifier 2108/2600/2700 is further trained on the non-gap training set (i.e., PrimateAI data set 6408) to produce a retrained pathogenicity classifier 2108/2600. /2700).

동작(6622)에서 재훈련된 병원성 분류자(2108/2600/2700)를 사용하여 변이체의 병원성을 결정한다.In operation 6622, the pathogenicity of the variant is determined using the retrained pathogenicity classifier (2108/2600/2700).

동작(6632)에서, 훈련된 병원성 분류자(2108/2600/2700)의 성능은 유지된 단백질 샘플의 비-갭 공간 표현만을 포함하는 제1 검증 세트에 대한 훈련 사이클 사이에서 측정된다. 다른 구현예에서, 재훈련된 병원성 분류자(2108/2600/2700)의 성능은 유지된 단백질 샘플의 갭 공간 표현 및 비-갭 공간 표현을 포함하는 제2 검증 세트에 대한 훈련 사이클 사이에서 측정된다.In operation 6632, the performance of the trained pathogenicity classifier 2108/2600/2700 is measured between training cycles on a first validation set containing only non-gap spatial representations of the retained protein samples. In another embodiment, the performance of the retrained pathogenicity classifier 2108/2600/2700 is measured between training cycles on a second validation set comprising gapped and non-gapped spatial representations of the retained protein samples. .

동작(6642)에서, 재훈련된 병원성 분류자(2108/2600/2700)는 쌍을 처리하는 것에 응답하여 쌍에 대한 제1 아미노산 클래스별 출력 서열을 생성한다. 일 구현예에서, 상응하는 유지된 단백질 샘플에서 아미노산 치환을 유발하는 뉴클레오티드 변이체에 대한 최종 병원성 점수는 제1 아미노산 클래스별 출력 서열에 기초하여 결정된다.At operation 6642, the retrained pathogenicity classifier 2108/2600/2700, in response to processing the pair, generates a first amino acid class-specific output sequence for the pair. In one embodiment, the final pathogenicity score for nucleotide variants causing amino acid substitutions in the corresponding maintained protein sample is determined based on the output sequence by first amino acid class.

훈련 데이터 및 훈련 라벨 생성Generate training data and training labels

도 67은 본원에 개시된 모델을 훈련시키기 위해 훈련 데이터 및 라벨을 생성(6700)하는 일 구현을 도시한다.Figure 67 shows one implementation of generating training data and labels (6700) to train the model disclosed herein.

프로테옴 접근자(6704)는 다수의 단백질이 있는 프로테옴의 다수의 아미노산 위치에 접근한다.Proteome accessor 6704 accesses multiple amino acid positions in a proteome with multiple proteins.

기준 지정자(6714)는 다수의 단백질의 기준 아미노산으로서 다수의 아미노산 위치에 있는 주요 대립유전자 아미노산을 지정한다.Reference designator 6714 specifies the major allele amino acid at a number of amino acid positions as a reference amino acid for multiple proteins.

양성 표지자(6724)는 다수의 아미노산 위치 중 각 아미노산 위치에 대해 특정 기준 아미노산을 특정 단백질의 특정 대체 표현의 특정 아미노산 위치에서 특정 기준 아미노산으로 치환하는 양성 변이체로 이러한 뉴클레오티드 치환을 분류한다.Positive marker 6724 classifies these nucleotide substitutions as positive variants that substitute a specific reference amino acid for each amino acid position among a plurality of amino acid positions for a specific reference amino acid at a specific amino acid position in a specific alternative representation of a specific protein.

다수의 아미노산 위치의 각 아미노산 위치에 대해 병원성 표지자(6734)는 이러한 뉴클레오티드 치환을 특정 기준 아미노산을 특정 아미노산 위치의 대체 아미노산으로 치환하는 병원성 변이체로 분류한다. 대체 아미노산은 특정 기준 아미노산과 다르다.For each amino acid position in a number of amino acid positions, pathogenicity marker 6734 classifies this nucleotide substitution as a pathogenic variant that replaces a specific reference amino acid with an alternative amino acid at a specific amino acid position. A substitute amino acid is different from a specific reference amino acid.

훈련자(6744)는 단백질 샘플의 공간 표현을 포함하는 훈련 데이터에 대해 변이체 병원성 분류자(2108/2600/2700)를 훈련하여, 공간 표현에는 양성 변이체에 상응하는 실제 양성 라벨과 병원성 변이체에 상응하는 실제 병원성 라벨이 할당된다.Trainer 6744 trains a variant pathogenicity classifier 2108/2600/2700 on training data containing a spatial representation of the protein sample, such that the spatial representation includes the actual benign labels corresponding to the benign variants and the actual benign labels corresponding to the pathogenic variants. A pathogenic label is assigned.

일 구현예에서, 변이체 병원성 분류자(2108/2600/2700)는 단백질의 주어진 아미노산 위치에서 제1 아미노산의 제2 아미노산으로의 치환이 병원성인지 양성인지 여부를 결정하도록 훈련된다. 이러한 구현예에서, 변이체 병원성 분류자(2108/2600/2700)는 치환에 대한 병원성 점수를 생성하도록 훈련된다. 일 구현예에서, 변이체 병원성 분류자(2108/2600/2700)는 단백질의 주어진 아미노산 위치에서 제1 아미노산의 각각의 아미노산으로의 각각의 치환이 병원성인지 양성인지를 결정하도록 훈련된다. 이러한 구현예에서, 변이체 병원성 분류자(2108/2600/2700)는 각각의 치환에 대한 각각의 병원성 점수를 생성하도록 훈련된다. 일부 구현예에서, 각각의 아미노산은 각각의 20개의 자연 발생 아미노산에 상응한다. 다른 구현예에서, 각각의 아미노산은 20개의 자연 발생 아미노산의 하위세트로부터의 각각의 자연 발생 아미노산에 상응한다.In one embodiment, a variant pathogenicity classifier 2108/2600/2700 is trained to determine whether a substitution of a first amino acid for a second amino acid at a given amino acid position in a protein is pathogenic or benign. In this implementation, a variant pathogenicity classifier (2108/2600/2700) is trained to generate pathogenicity scores for substitutions. In one embodiment, a variant pathogenicity classifier (2108/2600/2700) is trained to determine whether each substitution of the first amino acid for each amino acid at a given amino acid position in the protein is pathogenic or benign. In this implementation, a variant pathogenicity classifier (2108/2600/2700) is trained to generate a respective pathogenicity score for each substitution. In some embodiments, each amino acid corresponds to each of the 20 naturally occurring amino acids. In another embodiment, each amino acid corresponds to a respective naturally occurring amino acid from a subset of the 20 naturally occurring amino acids.

일 구현예에서, 변이체 병원성 분류자(2108/2600/2700)는 단백질의 주어진 빈 아미노산 위치에 아미노산의 삽입이 병원성인지 양성인지 여부를 결정하도록 훈련된다. 이러한 구현예에서, 변이체 병원성 분류자(2108/2600/2700)는 삽입에 대한 병원성 점수를 생성하도록 훈련된다. 일 구현예에서, 변이체 병원성 분류자(2108/2600/2700)는 단백질의 주어진 빈 아미노산 위치에 각각의 아미노산의 각각의 삽입이 병원성인지 양성인지 여부를 결정하도록 훈련된다. 이러한 구현예에서, 변이체 병원성 분류자(2108/2600/2700)는 각각의 삽입에 대한 각각의 병원성 점수를 생성하도록 훈련된다. 일부 구현예에서, 각각의 아미노산은 각각의 20개의 자연 발생 아미노산에 상응한다. 다른 구현예에서, 각각의 아미노산은 20개의 자연 발생 아미노산의 하위세트로부터의 각각의 자연 발생 아미노산에 상응한다.In one embodiment, a variant pathogenicity classifier (2108/2600/2700) is trained to determine whether an insertion of an amino acid at a given empty amino acid position in a protein is pathogenic or benign. In this implementation, a variant pathogenicity classifier (2108/2600/2700) is trained to generate a pathogenicity score for the insertion. In one embodiment, a variant pathogenicity classifier (2108/2600/2700) is trained to determine whether each insertion of each amino acid at a given empty amino acid position in a protein is pathogenic or benign. In this implementation, a variant pathogenicity classifier (2108/2600/2700) is trained to generate a respective pathogenicity score for each insertion. In some embodiments, each amino acid corresponds to each of the 20 naturally occurring amino acids. In another embodiment, each amino acid corresponds to a respective naturally occurring amino acid from a subset of the 20 naturally occurring amino acids.

일 구현예에서, 변이체 병원성 분류자(2108/2600/2700)는 단백질의 주어진 아미노산 위치에서 제1 아미노산의 제2 아미노산으로의 치환이 단백질의 다른 아미노산에 의해 공간적으로 허용되는지 여부를 결정하도록 훈련된다. 이러한 구현예에서, 변이체 병원성 분류자(2108/2600/2700)는 치환에 대한 공간 내성 점수를 생성하도록 훈련된다. 일 구현예에서, 변이체 병원성 분류자(2108/2600/2700)는 단백질의 주어진 아미노산 위치에서 제1 아미노산의 각각의 아미노산으로의 각각의 치환이 단백질의 다른 아미노산에 의해 공간적으로 허용되는지 여부를 결정하도록 훈련된다. 이러한 구현예에서, 변이체 병원성 분류자(2108/2600/2700)는 각각의 치환에 대한 각각의 공간 허용 오차 점수를 생성하도록 훈련된다. 일부 구현예에서, 각각의 아미노산은 각각의 20개의 자연 발생 아미노산에 상응한다. 다른 구현예에서, 각각의 아미노산은 20개의 자연 발생 아미노산의 하위세트로부터의 각각의 자연 발생 아미노산에 상응한다.In one embodiment, a variant pathogenicity classifier (2108/2600/2700) is trained to determine whether a substitution of a first amino acid for a second amino acid at a given amino acid position in the protein is spatially tolerated by other amino acids in the protein. . In this implementation, a variant pathogenicity classifier (2108/2600/2700) is trained to generate a spatial tolerance score for substitutions. In one embodiment, the variant pathogenicity classifier 2108/2600/2700 is configured to determine whether each substitution of a first amino acid at a given amino acid position in the protein is spatially tolerated by the other amino acids in the protein. trained. In this implementation, a variant pathogenicity classifier (2108/2600/2700) is trained to generate a respective spatial tolerance score for each substitution. In some embodiments, each amino acid corresponds to each of the 20 naturally occurring amino acids. In another embodiment, each amino acid corresponds to a respective naturally occurring amino acid from a subset of the 20 naturally occurring amino acids.

일 구현예에서, 변이체 병원성 분류자(2108/2600/2700)는 단백질의 주어진 빈 아미노산 위치에 아미노산의 삽입이 단백질의 다른 아미노산에 의해 공간적으로 허용되는지 여부를 결정하도록 훈련된다. 이러한 구현예에서, 변이체 병원성 분류자(2108/2600/2700)는 삽입에 대한 공간 내성 점수를 생성하도록 훈련된다. 일 구현예에서, 변이체 병원성 분류자(2108/2600/2700)는 단백질의 주어진 빈 아미노산 위치에 각각의 아미노산의 각각의 삽입이 단백질의 다른 아미노산에 의해 공간적으로 허용되는지 여부를 결정하도록 훈련된다. 이러한 구현예에서, 변이체 병원성 분류자(2108/2600/2700)는 각각의 삽입에 대한 각각의 공간 허용 오차 점수를 생성하도록 훈련된다. 일부 구현예에서, 각각의 아미노산은 각각의 20개의 자연 발생 아미노산에 상응한다. 다른 구현예에서, 각각의 아미노산은 20개의 자연 발생 아미노산의 하위세트로부터의 각각의 자연 발생 아미노산에 상응한다.In one embodiment, a variant pathogenicity classifier (2108/2600/2700) is trained to determine whether insertion of an amino acid at a given empty amino acid position in a protein is spatially permitted by other amino acids in the protein. In this implementation, a variant pathogenicity classifier (2108/2600/2700) is trained to generate a spatial tolerance score for insertions. In one embodiment, a variant pathogenicity classifier (2108/2600/2700) is trained to determine whether each insertion of each amino acid at a given empty amino acid position in a protein is spatially tolerated by other amino acids in the protein. In this implementation, a variant pathogenicity classifier (2108/2600/2700) is trained to generate a respective spatial tolerance score for each insertion. In some embodiments, each amino acid corresponds to each of the 20 naturally occurring amino acids. In another embodiment, each amino acid corresponds to a respective naturally occurring amino acid from a subset of the 20 naturally occurring amino acids.

일 구현예에서, 변이체 병원성 분류자(2108/2600/2700)는 단백질의 주어진 아미노산 위치에서 제1 아미노산의 제2 아미노산으로의 치환이 진화적으로 보존되는지 또는 비-보존되는지 여부를 결정하도록 훈련된다. 이러한 구현예에서, 변이체 병원성 분류자(2108/2600/2700)는 치환에 대한 진화 보존 점수를 생성하도록 훈련된다. 일 구현예에서, 변이체 병원성 분류자(2108/2600/2700)는 단백질의 주어진 아미노산 위치에서 제1 아미노산의 각각의 아미노산으로의 각각의 치환이 진화적으로 보존되는지 또는 비-보존되는지를 결정하도록 훈련된다. 이러한 구현예에서, 변이체 병원성 분류자(2108/2600/2700)는 각각의 치환에 대한 각각의 진화 보존 점수를 생성하도록 훈련된다. 일부 구현예에서, 각각의 아미노산은 각각의 20개의 자연 발생 아미노산에 상응한다. 다른 구현예에서, 각각의 아미노산은 20개의 자연 발생 아미노산의 하위세트로부터의 각각의 자연 발생 아미노산에 상응한다.In one embodiment, a variant pathogenicity classifier (2108/2600/2700) is trained to determine whether a substitution of a first amino acid for a second amino acid at a given amino acid position in a protein is evolutionarily conserved or non-conserved. . In this implementation, a variant pathogenicity classifier (2108/2600/2700) is trained to generate evolutionary conservation scores for substitutions. In one embodiment, a variant pathogenicity classifier (2108/2600/2700) is trained to determine whether each substitution of the first amino acid for each amino acid at a given amino acid position in the protein is evolutionarily conserved or non-conserved. do. In this implementation, a variant pathogenicity classifier (2108/2600/2700) is trained to generate a respective evolutionary conservation score for each substitution. In some embodiments, each amino acid corresponds to each of the 20 naturally occurring amino acids. In another embodiment, each amino acid corresponds to a respective naturally occurring amino acid from a subset of the 20 naturally occurring amino acids.

일 구현예에서, 변이체 병원성 분류자(2108/2600/2700)는 단백질의 주어진 빈 아미노산 위치에 아미노산의 삽입이 진화적으로 보존되는지 또는 비-보존되는지 여부를 결정하도록 훈련된다. 이러한 구현예에서, 변이체 병원성 분류자(2108/2600/2700)는 삽입에 대한 진화 보존 점수를 생성하도록 훈련된다.In one embodiment, a variant pathogenicity classifier (2108/2600/2700) is trained to determine whether an insertion of an amino acid at a given empty amino acid position in a protein is evolutionarily conserved or non-conserved. In this implementation, a variant pathogenicity classifier (2108/2600/2700) is trained to generate an evolutionary conservation score for the insertion.

일 구현예에서, 변이체 병원성 분류자(2108/2600/2700)는 단백질의 주어진 빈 아미노산 위치에 있는 각각의 아미노산의 각각의 삽입이 진화적으로 보존되는지 또는 비-보존되는지를 결정하도록 훈련된다. 이러한 구현예에서, 변이체 병원성 분류자(2108/2600/2700)는 각각의 삽입에 대한 각각의 진화 보존 점수를 생성하도록 훈련된다. 일부 구현예에서, 각각의 아미노산은 각각의 20개의 자연 발생 아미노산에 상응한다. 다른 구현예에서, 각각의 아미노산은 20개의 자연 발생 아미노산의 하위세트로부터의 각각의 자연 발생 아미노산에 상응한다.In one embodiment, a variant pathogenicity classifier (2108/2600/2700) is trained to determine whether each insertion of each amino acid at a given empty amino acid position in a protein is evolutionarily conserved or non-conserved. In this implementation, a variant pathogenicity classifier (2108/2600/2700) is trained to generate a respective evolutionary conservation score for each insertion. In some embodiments, each amino acid corresponds to each of the 20 naturally occurring amino acids. In another embodiment, each amino acid corresponds to a respective naturally occurring amino acid from a subset of the 20 naturally occurring amino acids.

다양한 구현예에서, 공간적 허용오차는 구조적 허용오차에 상응하고, 공간적 허용오차는 구조적 허용오차에 상응한다. 다양한 구현예에서, 다수의 아미노산 위치는 100만 내지 1000만 개의 아미노산 위치 범위이다. 다양한 구현예에서, 다수의 아미노산 위치는 1000만 내지 수억 개의 아미노산 위치 범위이다. 다양한 구현예에서, 다수의 아미노산 위치는 수억 내지 10억 개의 아미노산 위치 범위이다. 다양한 구현예에서, 다수의 아미노산 위치는 1 내지 100만 개의 아미노산 위치 범위이다.In various implementations, the spatial tolerance corresponds to the structural tolerance, and the spatial tolerance corresponds to the structural tolerance. In various embodiments, the number of amino acid positions ranges from 1 million to 10 million amino acid positions. In various embodiments, the number of amino acid positions ranges from 10 million to hundreds of millions of amino acid positions. In various embodiments, the number of amino acid positions ranges from hundreds of millions to billions of amino acid positions. In various embodiments, the number of amino acid positions ranges from 1 to 1 million amino acid positions.

일 구현예에서, 기준 아미노산의 기준 코돈을 도달 불가능한 대체 아미노산 클래스의 대체 아미노산으로 변환하기 위한 단일 뉴클레오티드 다형성(SNP)의 도달 가능성에 의해 제한되는 도달 불가능한 대체 아미노산 클래스는 실제 라벨에 마스킹된다. 이러한 구현예에서 마스킹된 아미노산 클래스는 손실이 전혀 발생하지 않으며 기울기 업데이트에 기여하지 않는다. 이러한 구현예에서는 마스킹된 아미노산 클래스가 룩업 테이블에서 식별된다. 이러한 구현예에서, 룩업 테이블은 각 기준 아미노산 위치에 대해 마스킹된 아미노산 클래스 세트를 식별한다.In one embodiment, the unreachable substitution amino acid class, which is limited by the reachability of single nucleotide polymorphisms (SNPs) to convert the reference codon of the reference amino acid to the substitute amino acid of the unreachable substitution amino acid class, is masked to the actual label. In this implementation, the masked amino acid classes are not lost at all and do not contribute to the gradient update. In this implementation, masked amino acid classes are identified in a lookup table. In this implementation, a lookup table identifies a set of masked amino acid classes for each reference amino acid position.

다양한 구현예에서 공간 표현은 단백질 샘플의 단백질 구조의 구조적 표현이다. 다양한 구현예에서 공간 표현은 복셀화를 사용하여 인코딩된다.In various embodiments, the spatial representation is a structural representation of the protein structure of the protein sample. In various implementations, the spatial representation is encoded using voxelization.

병원성 결정Pathogenicity Determination

도 68은 뉴클레오티드 변이체의 병원성을 결정하는 방법(6800)의 일 구현을 도시한다. 이 방법은 동작(6802)에서 단백질의 공간 표현에 접근하는 것을 포함한다. 단백질의 공간 표현은 단백질의 각 위치에서 각 아미노산의 각 공간 구성을 지정한다.Figure 68 depicts one implementation of a method 6800 for determining pathogenicity of a nucleotide variant. The method includes accessing a spatial representation of the protein in operation 6802. The spatial representation of a protein specifies the respective spatial configuration of each amino acid at each position in the protein.

이 방법은 동작(6812)에서 단백질의 공간 표현으로부터 특정 위치에 있는 특정 아미노산의 특정 공간 구성을 제거하여 단백질의 갭 공간 표현을 생성하는 것을 포함한다. 일 구현예에서 특정 공간 구성의 제거는 스크립트에 의해 구현(또는 자동화)된다.The method includes removing specific spatial configurations of specific amino acids at specific positions from the spatial representation of the protein in operation 6812 to create a gap spatial representation of the protein. In one implementation, removal of specific spatial configurations is implemented (or automated) by a script.

이 방법은 동작(6822)에서 적어도 부분적으로, 갭 공간 표현과 특정 위치에서 뉴클레오티드 변이체에 의해 생성된 대체 아미노산의 표현에 기초하여 뉴클레오티드 변이체의 병원성을 결정하는 것을 포함한다.The method includes, at operation 6822, determining the pathogenicity of a nucleotide variant based, at least in part, on a gap space representation and a representation of a replacement amino acid produced by the nucleotide variant at a particular location.

구조적 내성 예측Structural resistance prediction

도 69는 아미노산 치환물의 구조적 내성을 예측하기 위한 시스템(6900)의 일 구현을 도시한다. 동작(6902)에서, 갭핑 로직은 단백질의 공간 표현에서 특정 위치의 특정 아미노산을 제거하고 단백질의 공간 표현의 특정 위치에 아미노산 공석을 생성하도록 구성된다.Figure 69 depicts one implementation of a system 6900 for predicting structural tolerance of amino acid substitutions. In operation 6902, gapping logic is configured to remove a specific amino acid at a specific position in the spatial representation of the protein and create an amino acid vacancy at a specific position in the spatial representation of the protein.

동작(6912)에서, 구조적 내성 예측 로직은 아미노산 공석이 있는 단백질의 공간 표현을 처리하도록 구성되고, 아미노산 공석 부근의 아미노산 동시 발생 패턴을 기반으로 아미노산 공석을 채우기 위한 후보인 치환 아미노산의 구조적 내성의 순위를 매긴다.In operation 6912, the structural resistance prediction logic is configured to process the spatial representation of the protein with amino acid vacancies, and rank the structural resistance of substituted amino acids that are candidates for filling the amino acid vacancies based on amino acid co-occurrence patterns near the amino acid vacancies. Rank it.

본원에 개시된 전이 학습 기술(도 66a 및 도 66b)을 사용하여 훈련된 변이체 병원성 분류자는 "전이 학습"으로 지칭된다. 본원에 개시된 결합 학습 기술(도 65a 및 도 65b)을 사용하여 훈련된 변이 병원성 분류자는 "결합 학습"으로 지칭된다.Variant pathogenicity classifiers trained using the transfer learning techniques disclosed herein (Figures 66A and 66B) are referred to as “transfer learning.” Variant pathogenicity classifiers trained using the combinational learning techniques disclosed herein (Figures 65A and 65B) are referred to as “joint learning”.

도 70a, 도 70b 및 도 70c의 성능 결과는 복수의 검증 세트에 걸쳐 양성 변이체를 병원성 변이체와 정확하게 구별하는 분류 태스크에서 생성된다. 새로운 발달 지연 장애(새로운 DDD)는 영장류 AI에 대한 영장류 AI 3D와 결합 학습에 대한 전이 학습의 분류 정확도를 비교하는 데 사용되는 검증 세트의 한 예이다. 새로운 DDD 검증 세트는 DDD를 가진 개인의 변이체를 병원성으로 표시하고 DDD를 가진 개인의 건강한 친척의 동일한 변이체를 양성으로 표시한다. 유사한 라벨링 스킴이 자폐 스펙트럼 장애(autism spectrum disorder, ASD) 검증 세트와 함께 사용된다.The performance results in Figures 70A, 70B and 70C are generated from a classification task that accurately distinguishes benign variants from pathogenic variants across multiple validation sets. New Delayed Developmental Disorder (New DDD) is an example of a validation set used to compare the classification accuracy of transfer learning against primate AI 3D and combinational learning against primate AI. The new DDD validation set marks variants in individuals with DDD as pathogenic and marks the same variants in healthy relatives of individuals with DDD as benign. A similar labeling scheme is used with the autism spectrum disorder (ASD) validation set.

BRCA1은 영장류 AI에 대한 영장류 AI 3D와 결합 학습에 대한 전이 학습의 분류 정확도를 비교하는 데 사용되는 검증 세트의 또 다른 예이다. BRCA1 검증 세트는 BRCA1 유전자의 단백질을 시뮬레이션하는 합성적으로 생성된 기준 아미노산 서열을 양성 변이체로 표시하고, BRCA1 유전자의 단백질을 시뮬레이션하는 합성적으로 변경된 대립유전자 아미노산 서열을 병원성 변이체로 표시한다. 유사한 라벨링 스킴이 도 70a, 도 70b 및 도 70c에 도시된 TP53 유전자, TP53S3 유전자 및 이의 변이체, 및 다른 유전자 및 이들의 변이체의 상이한 검증 세트와 함께 사용된다.BRCA1 is another example of a validation set used to compare the classification accuracy of transfer learning against primate AI 3D and combinational learning against primate AI. The BRCA1 validation set marks synthetically generated reference amino acid sequences that simulate proteins from the BRCA1 gene as benign variants, and synthetically altered allelic amino acid sequences that simulate proteins from the BRCA1 gene as pathogenic variants. A similar labeling scheme is used with different validation sets of the TP53 gene, TP53S3 gene and variants thereof, and other genes and variants thereof shown in Figures 70A, 70B and 70C.

도 70a, 도 70b 및 도 70c에서, y-축은 p-값을 갖고, x-축은 상이한 검증 세트를 갖는다. 도 70a, 도 70b 및 도 70c의 p-값에서 알 수 있듯이 결합 학습은 일반적으로 다른 접근 방식보다 성능이 뛰어나고 전이 학습이 그 뒤를 따르고 PrimateAI 3D가 그 뒤를 따른다. 더 큰 p-값, 즉, 더 긴 수직 막대는 양성 변이체들을 병원성 변이체들과 구별하는 데 있어서 더 큰 정확도를 표시한다. 도 70a, 도 70b 및 도 70c에서 결합 학습의 수직 막대는 다른 접근 방식의 수직 막대보다 일관되게 더 길다.In Figures 70A, 70B and 70C, the y-axis has p-values and the x-axis has different verification sets. As can be seen from the p-values in Figures 70A, 70B and 70C, combinational learning generally outperforms the other approaches, followed by transfer learning and PrimateAI 3D. Larger p-values, i.e. longer vertical bars, indicate greater accuracy in distinguishing benign variants from pathogenic variants. In Figures 70A, 70B, and 70C, the vertical bars of conjunctive learning are consistently longer than those of the other approaches.

또한, 도 70a, 도 70b 및 도 70c에서, 별개의 "평균" 차트가 검증 세트 각각에 대해 결정된 p-값의 평균을 계산한다. 평균 차트에서도 결합 학습은 일반적으로 다른 접근 방식보다 성능이 뛰어나며, 그 다음에는 전이 학습, 그 다음에는 PrimateAI 3D가 뒤따르며, 결합 학습의 가로 막대가 다른 접근 방식의 가로 막대보다 일관되게 길다는 점에서 알 수 있다.Additionally, in Figures 70A, 70B and 70C, a separate "Mean" chart calculates the average of the p-values determined for each of the validation sets. The averages chart also shows that Combined Learning generally outperforms the other approaches, followed by Transfer Learning, and then PrimateAI 3D, in that the horizontal bars for Combined Learning are consistently longer than those for the other approaches. Able to know.

평균 통계치는 이상치에 의해 바이어스될 수 있다. 이를 다루기 위해, 별개의 "방법 랭크(method rank)" 차트가 또한 도 70a, 도 70b 및 도 70c에 도시된다. 더 높은 랭크가 더 불량한 분류 정확도를 표시한다. 방법 순위 차트에서도 결합 학습은 일반적으로 다른 접근 방식보다 성능이 뛰어나고 전이 학습이 그 뒤를 따르고 PrimateAI 3D가 그 뒤를 따른다. 방법 순위 차트에서는 낮은 순위 1과 2의 개수가 더 많은 것이 높은 순위의 3을 갖는 것보다 낫다.Average statistics may be biased by outliers. To address this, separate “method rank” charts are also shown in FIGS. 70A, 70B and 70C. Higher ranks indicate poorer classification accuracy. The method ranking chart also shows that combinational learning generally outperforms other approaches, followed by transfer learning and PrimateAI 3D. In a method ranking chart, having more low ranked 1s and 2s is better than having higher ranked 3s.

조항article

개시된 기술은 시스템, 방법 또는 제조 물품으로서 실시될 수 있다. 구현예의 하나 이상의 특징부는 기본 구현예와 조합될 수 있다. 상호 배타적이지 않은 구현예는 조합가능한 것으로 교시되어 있다. 구현예의 하나 이상의 특징부는 다른 구현예와 조합될 수 있다. 본 발명은 이러한 옵션을 사용자에게 주기적으로 리마인드한다. 이러한 옵션을 반복하는 인용의 일부 구현예로부터의 생략은 전술한 섹션에 교시된 조합을 제한하는 것으로서 간주되어서는 안된다 - 이들 인용은 이로써 다음의 구현예 각각에 참조로 통합된다.The disclosed technology may be practiced as a system, method, or article of manufacture. One or more features of an implementation may be combined with the base implementation. Implementations that are not mutually exclusive are taught as combinable. One or more features of an implementation may be combined with other implementations. The present invention periodically reminds users of these options. Omission from some implementations of citations repeating these options should not be considered as limiting the combinations taught in the preceding sections - these citations are hereby incorporated by reference into each of the following implementations.

개시된 기술의 하나 이상의 구현예 및 조항 또는 이들의 요소는, 나타낸 방법 단계를 수행하기 위한 컴퓨터 사용가능 프로그램 코드를 갖는 비일시적 컴퓨터 판독가능 저장 매체를 포함하는 컴퓨터 제품의 형태로 구현될 수 있다. 더욱이, 개시된 기술의 하나 이상의 구현예 및 조항 또는 이들의 요소는, 메모리, 및 메모리에 커플링되고 예시적인 방법 단계를 수행하기 위해 동작하는 적어도 하나의 프로세서를 포함하는 장치의 형태로 구현될 수 있다. 또한, 추가로, 다른 양태에서, 개시된 기술의 하나 이상의 구현예 및 조항 또는 이들의 요소는, 본원에 기술된 방법 단계 중 하나 이상을 수행하기 위한 수단의 형태로 구현될 수 있고; 수단은 (i) 하드웨어 모듈(들), (ii) 하나 이상의 하드웨어 프로세서 상에서 실행되는 소프트웨어 모듈(들), 또는 (iii) 하드웨어와 소프트웨어 모듈의 조합을 포함할 수 있고; (i) 내지 (iii) 중 임의의 것이 본원에 제시된 특정 기법을 구현하고, 소프트웨어 모듈은 컴퓨터 판독가능 저장 매체(또는 다수의 그러한 매체)에 저장된다.One or more implementations and provisions of the disclosed technology or elements thereof may be implemented in the form of a computer product comprising a non-transitory computer-readable storage medium having computer-usable program code for performing the disclosed method steps. Moreover, one or more implementations and provisions of the disclosed technology or elements thereof may be implemented in the form of a device that includes a memory and at least one processor coupled to the memory and operative to perform the example method steps. . Additionally, in other aspects, one or more embodiments and provisions of the disclosed technology or elements thereof may be embodied in the form of means for performing one or more of the method steps described herein; The means may comprise (i) hardware module(s), (ii) software module(s) executing on one or more hardware processors, or (iii) a combination of hardware and software modules; Any of (i) through (iii) implements specific techniques set forth herein, and the software modules are stored on a computer-readable storage medium (or a plurality of such media).

이 섹션에 설명된 조항은 특징으로 조합될 수 있다. 간결함을 위해 특징의 조합은 개별적으로 열거되지 않으며 각 기본 특징 세트에서 반복되지 않는다. 독자는 이 섹션에 설명된 조항에서 식별된 특징이 이 출원의 다른 섹션에서 구현으로 식별된 기본 특징 세트와 어떻게 쉽게 조합될 수 있는지 이해할 것이다. 이러한 조항은 상호 배타적이거나 포괄적이거나 제한적이라는 의미가 아니며; 개시된 기술은 이러한 조항에 제한되지 않으며 오히려 청구된 기술 및 그 등가물의 범위 내에서 가능한 모든 조합, 수정 및 변형을 포함한다.The provisions described in this section may be combined into features. For brevity, combinations of features are not listed individually and are not repeated in each basic feature set. The reader will understand how the features identified in the provisions described in this section can be easily combined with the basic feature sets identified as implementations in other sections of this application. These provisions are not intended to be mutually exclusive, inclusive or limiting; The disclosed technology is not limited to these provisions, but rather includes all possible combinations, modifications and variations within the scope of the claimed technology and its equivalents.

이 섹션에 설명된 조항의 다른 구현은 이 섹션에 설명된 조항 중 임의의 것을 수행하기 위해 프로세서에 의해 실행 가능한 명령을 저장하는 비일시적 컴퓨터 판독 가능 저장 매체를 포함할 수 있다. 이 섹션에 설명된 조항의 또 다른 구현은 메모리 및 이 섹션에 설명된 조항 중 임의의 항목을 수행하기 위해 메모리에 저장된 명령어를 실행하도록 동작 가능한 하나 이상의 프로세서를 포함하는 시스템을 포함할 수 있다.Other implementations of the provisions described in this section may include a non-transitory computer-readable storage medium storing instructions executable by a processor to perform any of the provisions described in this section. Another implementation of the provisions described in this section may include a system including a memory and one or more processors operable to execute instructions stored in the memory to perform any of the provisions described in this section.

우리는 다음 조항을 개시한다:We set forth the following provisions:

조항 세트 1(ILLM 1050-2)Clause Set 1 (ILLM 1050-2)

1. 뉴클레오티드 변이체의 병원성을 결정하는 컴퓨터 구현 방법으로서, 1. A computer-implemented method for determining the pathogenicity of a nucleotide variant, comprising:

각각의 위치에 각각의 아미노산을 갖는 단백질에 접근하는 단계;Accessing proteins with each amino acid at each position;

단백질의 특정 위치에 있는 특정 아미노산을 갭 아미노산으로 지정하고, 단백질의 나머지 위치에 남아 있는 아미노산을 비-갭 아미노산으로 지정하는 단계;Designating a specific amino acid at a specific position of the protein as a gap amino acid, and designating the amino acid remaining at the remaining position of the protein as a non-gap amino acid;

비-갭 아미노산의 공간 구성을 포함하고 갭 아미노산의 공간 구성을 배제하는 단백질의 갭 공간 표현을 생성하는 단계; 및generating a gap spatial representation of the protein including the spatial configuration of non-gap amino acids and excluding the spatial configuration of gap amino acids; and

적어도 부분적으로, 갭 공간 표현, 및 특정 위치에서 뉴클레오티드 변이체에 의해 생성된 대체 아미노산의 표현에 기초하여 뉴클레오티드 변이체의 병원성을 결정하는 단계를 포함하는, 컴퓨터 구현 방법.A computer-implemented method comprising determining the pathogenicity of a nucleotide variant based, at least in part, on the gap space representation and the representation of alternative amino acids produced by the nucleotide variant at a particular position.

2. 조항 1에 있어서, 비-갭 아미노산의 공간 구성은 아미노산 클래스별 거리 채널로서 인코딩되고,2. Clause 1, wherein the spatial configuration of non-gap amino acids is encoded as a distance channel per amino acid class,

각각의 아미노산 클래스별 거리 채널은 복수의 복셀 내의 복셀에 대한 복셀별 거리 값을 갖고,The distance channel for each amino acid class has a voxel-specific distance value for voxels within a plurality of voxels,

복셀별 거리 값은 복수의 복셀 내의 상응하는 복셀로부터 비-갭 아미노산의 원자까지의 거리를 지정하는, 컴퓨터 구현 방법.A computer implemented method, wherein the voxel-wise distance value specifies the distance of an atom of a non-gap amino acid from a corresponding voxel within the plurality of voxels.

3. 조항 2에 있어서, 비-갭 아미노산의 공간 구성은 상응하는 복셀과 비-갭 아미노산의 원자 사이의 공간적 근접성에 기초하여 결정되는, 컴퓨터 구현 방법.3. The computer implemented method of clause 2, wherein the spatial configuration of the non-gap amino acid is determined based on the spatial proximity between corresponding voxels and atoms of the non-gap amino acid.

4. 조항 2에 있어서, 복셀별 거리 값을 결정할 때 상응하는 복셀로부터 갭 아미노산의 원자까지의 거리를 무시함으로써 갭 아미노산의 공간 구성이 갭 공간 표현에서 제외되는, 컴퓨터 구현 방법.4. The computer-implemented method of clause 2, wherein the spatial configuration of the gap amino acid is excluded from the gap space representation by ignoring the distance of the atom of the gap amino acid from the corresponding voxel when determining the voxel-wise distance value.

5. 조항 4에 있어서, 상응하는 복셀과 갭 아미노산의 원자 사이의 공간적 근접성을 무시함으로써 갭 아미노산의 공간 구성이 갭 공간 표현에서 제외되는, 컴퓨터 구현 방법.5. The computer-implemented method of clause 4, wherein the spatial configuration of the gap amino acid is excluded from the gap space representation by ignoring the spatial proximity between the corresponding voxel and the atoms of the gap amino acid.

6. 조항 1에 있어서, 특정 아미노산이 단백질의 주요 대립유전자인 기준 아미노산인, 컴퓨터 구현 방법.6. The computer-implemented method of clause 1, wherein the particular amino acid is a reference amino acid that is the major allele of the protein.

7. 조항 1에 있어서, 병원성 예측자는 다음 단계에 의해 뉴클레오티드의 병원성을 결정하는, 컴퓨터 구현 방법:7. The computer-implemented method of clause 1, wherein the pathogenicity predictor determines the pathogenicity of a nucleotide by the following steps:

갭 공간 표현 및Gap space representation and

대체 아미노산의 표현을 입력으로서 처리하는 단계; 및Processing the representation of the replacement amino acid as input; and

대체 아미노산에 대한 병원성 점수를 출력으로서 생성하는 단계.Generating pathogenicity scores for alternative amino acids as output.

8. 조항 7에 있어서, 병원성 예측자는 양성 훈련 세트에 대해 훈련되는, 컴퓨터 구현 방법.8. The computer-implemented method of clause 7, wherein the pathogenicity predictor is trained on a benign training set.

9. 조항 8에 있어서, 양성 훈련 세트는 프로테옴의 각각의 위치에서 각각의 기준 아미노산에 대한 각각의 양성 단백질 샘플을 갖는, 컴퓨터 구현 방법.9. The computer-implemented method of clause 8, wherein the positive training set has each positive protein sample for each reference amino acid at each position in the proteome.

10. 조항 9에 있어서, 기준 아미노산은 프로테옴의 주요 대립유전자 아미노산인, 컴퓨터 구현 방법.10. The computer-implemented method of clause 9, wherein the reference amino acid is a major allelic amino acid in the proteome.

11. 조항 10에 있어서, 프로테옴은 1000만 개의 위치가 있으므로 양성 훈련 세트는 1000만 개의 양성 단백질 샘플을 갖는, 컴퓨터 구현 방법.11. The computer-implemented method of clause 10, wherein the proteome has 10 million positions and therefore the positive training set has 10 million positive protein samples.

12. 조항 11에 있어서, 각각의 양성 단백질 샘플은 각각의 기준 아미노산을 각각의 갭 아미노산으로 사용하여 생성된 각각의 갭 공간 표현을 갖는, 컴퓨터 구현 방법.12. The computer-implemented method of clause 11, wherein each positive protein sample has a respective gap space representation generated using each reference amino acid as each gap amino acid.

13. 조항 12에 있어서, 각각의 양성 단백질 샘플은 각각의 기준 아미노산의 각각의 표현을 각각의 대체 아미노산으로 갖는, 컴퓨터 구현 방법.13. The computer-implemented method of clause 12, wherein each positive protein sample has as each replacement amino acid a respective representation of each reference amino acid.

14. 조항 13에 있어서, 병원성 예측자는 특정 양성 단백질 샘플에 대해 훈련하고 특정 양성 단백질 샘플의 특정 위치에서 특정 기준 아미노산의 병원성을 다음 단계에 의해 추정하는, 컴퓨터 구현 방법:14. The computer-implemented method of clause 13, wherein the pathogenicity predictor is trained on specific positive protein samples and estimates the pathogenicity of specific reference amino acids at specific positions in the specific positive protein samples by the following steps:

(i) 특정 양성 단백질 샘플의 특정 갭 공간 표현(i) Specific gap space representation of specific positive protein samples

- 이때 특정 갭 공간 표현은- At this time, the specific gap space expression is

특정 기준 아미노산을 갭 아미노산으로 사용하여, 그리고using specific reference amino acids as gap amino acids, and

특정 양성 단백질 샘플의 나머지 위치에 남아 있는 아미노산을 비-갭 아미노산으로 사용하여 생성됨 -; 및Generated by using the remaining amino acids at the remaining positions in a specific positive protein sample as non-gap amino acids -; and

(ii) 특정 대체 아미노산으로서 특정 기준 아미노산의 표현을 입력으로서 처리하는 단계; 및 (ii) processing as input a representation of a specific reference amino acid as a specific replacement amino acid; and

특정 기준 아미노산에 대한 병원성 점수를 출력으로서 생성하는 단계.Generating as output a pathogenicity score for a specific reference amino acid.

15. 조항 14에 있어서, 양성 단백질 샘플 각각은 양성 단백질 샘플의 절대적 양성을 나타내는 실제 양성 라벨을 갖는, 컴퓨터 구현 방법.15. The computer-implemented method of clause 14, wherein each positive protein sample has an actual positive label indicating the absolute positivity of the positive protein sample.

16. 조항 15에 있어서, 실제 양성 라벨은 0인, 컴퓨터 구현 방법.16. The computer-implemented method of clause 15, wherein the actual positive label is 0.

17. 조항 16에 있어서, 특정 기준 아미노산에 대한 병원성 점수는 실제 양성 라벨과 비교되어 오류를 결정하고 훈련 기술을 사용하여 오류에 기초하여 병원성 예측자의 계수를 개선하는, 컴퓨터 구현 방법.17. The computer-implemented method of clause 16, wherein the pathogenicity score for a particular reference amino acid is compared to the actual positive label to determine the error and using training techniques to improve the coefficients of the pathogenicity predictor based on the error.

18. 조항 1에 있어서, 병원성 예측자는 병원성 훈련 세트에 대해 훈련되는, 컴퓨터 구현 방법.18. The computer-implemented method of clause 1, wherein the pathogenicity predictor is trained on a pathogenicity training set.

19. 조항 18에 있어서, 병원성 훈련 세트는 프로테옴의 각각의 위치에서 각각의 기준 아미노산에 대한 각각의 조합적으로 생성된 아미노산 치환에 대한 각각의 병원성 단백질 샘플을 갖는, 컴퓨터 구현 방법.19. The computer-implemented method of clause 18, wherein the pathogenicity training set has a respective pathogenic protein sample for each combinatorially generated amino acid substitution for each reference amino acid at each position in the proteome.

20. 조항 19에 있어서, 프로테옴의 특정 위치에서 특정 아미노산 클래스의 특정 기준 아미노산에 대해 조합적으로 생성된 아미노산 치환은 특정 아미노산 클래스와 다른 각각의 아미노산 클래스의 각각의 대체 아미노산을 포함하는, 컴퓨터 구현 방법.20. The computer-implemented method of clause 19, wherein the combinatorially generated amino acid substitutions for a particular reference amino acid of a particular amino acid class at a particular position in the proteome comprise each replacement amino acid of the particular amino acid class and each other amino acid class. .

21. 조항 20에 있어서, 프로테옴은 1000만 개의 위치를 가지며, 각각의 1000만 위치에 대해 19개의 조합적으로 생성된 아미노산 치환이 있으므로 병원성 훈련 세트는 1억9천만 개의 병원성 단백질 샘플을 갖는, 컴퓨터 구현 방법.21. Clause 20: The proteome has 10 million positions, and for each 10 million positions there are 19 combinatorially generated amino acid substitutions, so the pathogenicity training set has 190 million pathogenic protein samples. How to implement it.

22. 조항 21에 있어서, 각각의 병원성 단백질 샘플은 각각의 기준 아미노산을 각각의 갭 아미노산으로 사용하여 생성된 각각의 갭 공간 표현을 갖는, 컴퓨터 구현 방법.22. The computer-implemented method of clause 21, wherein each pathogenic protein sample has a respective gap space representation generated using each reference amino acid as each gap amino acid.

23. 조항 22에 있어서, 각각의 병원성 단백질 샘플은 프로테옴의 각각의 위치에서 각각의 조합적으로 생성된 뉴클레오티드 변이체에 의해 생성된 각각의 대체 아미노산으로서 각각의 조합적으로 생성된 아미노산 치환의 각각의 표현을 갖는, 컴퓨터 구현 방법.23. The method of clause 22, wherein each pathogenic protein sample is a respective representation of each combinatorially generated amino acid substitution as each replacement amino acid produced by each combinatorially generated nucleotide variant at each position of the proteome. , a computer implemented method.

24. 조항 23에 있어서, 병원성 예측자는 특정 병원성 단백질 샘플에 대해 훈련하고 특정 병원성 단백질 샘플의 특정 위치에서 특정 기준 아미노산에 대한 특정 조합적으로 생성된 아미노산 치환의 병원성을 다음 단계에 의해 추정하는, 컴퓨터 구현 방법:24. The method of clause 23, wherein the pathogenicity predictor is a computer that trains on specific pathogenic protein samples and estimates the pathogenicity of specific combinatorially generated amino acid substitutions for specific reference amino acids at specific positions in the specific pathogenic protein sample by the following steps: How to implement:

(i) 특정 병원성 단백질 샘플의 특정 갭 공간 표현, - 이때 특정 갭 공간 표현은(i) a specific gap space representation of a specific pathogenic protein sample, where the specific gap space representation is

특정 병원성 단백질 샘플의 나머지 위치에 남아 있는 아미노산을 비-갭 아미노산으로 사용하여 생성됨 -; 및Generated by using the remaining amino acids at the remaining positions in a specific pathogenic protein sample as non-gap amino acids -; and

(ii) 특정 대체 아미노산으로서 특정 조합적으로 생성된 아미노산 치환의 표현을 입력으로서 처리하는 단계; 및(ii) processing as input a representation of a specific combinatorially produced amino acid substitution as a specific replacement amino acid; and

특정 조합적으로 생성된 아미노산 치환에 대한 병원성 점수를 출력으로서 생성하는 단계.Generating as output a pathogenicity score for specific combinatorially generated amino acid substitutions.

25. 조항 24에 있어서, 각각의 병원성 단백질 샘플에는 병원성 단백질 샘플의 절대 병원성을 나타내는 실제 병원성 라벨이 있는, 컴퓨터 구현 방법.25. The computer-implemented method of clause 24, wherein each pathogenic protein sample has an actual pathogenicity label indicating the absolute pathogenicity of the pathogenic protein sample.

26. 조항 25에 있어서, 실제 병원성 라벨은 1인, 컴퓨터 구현 방법.26. Computer-implemented method of clause 25, wherein the actual pathogenicity label is 1.

27. 조항 26에 있어서, 특정 조합적으로 생성된 아미노산 치환에 대한 병원성 점수는 실제 병원성 라벨과 비교되어 오류를 결정하고 훈련 기술을 사용하여 오류에 기초하여 병원성 예측자의 계수를 개선하는, 컴퓨터 구현 방법.27. The computer-implemented method of clause 26, wherein the pathogenicity score for a particular combinatorially generated amino acid substitution is compared to the actual pathogenicity label to determine the error and uses training techniques to improve the coefficients of the pathogenicity predictor based on the error. .

28. 조항 27에 있어서, 병원성 예측자가 2억 번의 훈련 반복으로 훈련되고,28. Clause 27, wherein the pathogenicity predictor is trained with 200 million training iterations,

2억 번의 훈련 반복은200 million training repetitions

1,000만 개의 양성 단백질 샘플을 사용한 1,000만 번의 훈련 반복, 및10 million training iterations using 10 million positive protein samples, and

1억9천만 개의 병원성 단백질 샘플을 사용한 1억9천만 번의 반복을 포함하는, 컴퓨터 구현 방법.A computer-implemented method involving 190 million iterations using 190 million pathogenic protein samples.

29. 조항 10에 있어서, 프로테옴은 100만 내지 1000만 개의 위치를 가지므로 양성 훈련 세트에는 100만 내지 1000만 개의 양성 단백질 샘플이 있고,29. Clause 10: The proteome has 1 to 10 million positions, so the positive training set has 1 to 10 million positive protein samples;

100만 내지 1000만 위치 각각에 대해 19개의 조합적으로 생성된 아미노산 치환이 있으므로 병원성 훈련 세트에는 1천9백만 내지 1억9천만 개의 병원성 단백질 샘플이 있는, 컴퓨터 구현 방법.A computer-implemented method, wherein there are 19 combinatorially generated amino acid substitutions for each of the 1 to 10 million positions, so that the pathogenicity training set has 19 to 190 million pathogenic protein samples.

30. 조항 29에 있어서, 병원성 예측자는 2천만 내지 2억 번의 훈련 반복으로 훈련되고,30. Clause 29, wherein the pathogenicity predictor is trained with 200 million to 200 million training iterations, and

2천만 내지 2억 번의 훈련 반복은20 to 200 million training repetitions

100만 내지 1000만 개의 양성 단백질 샘플을 사용한 100만 내지 1000만 번의 훈련 반복, 및1 to 10 million training iterations using 1 to 10 million positive protein samples, and

1,900만 내지 1억9,000만 개의 병원성 단백질 샘플을 사용한 1,900만 내지 1억9,000만 번의 반복을 포함하는, 컴퓨터 구현 방법.A computer-implemented method comprising 19 to 190 million iterations using 19 to 190 million pathogenic protein samples.

31. 조항 6에 있어서, 대체 아미노산은 기준 아미노산과 동일한 아미노산인, 컴퓨터 구현 방법.31. The computer-implemented method of clause 6, wherein the replacement amino acid is an amino acid identical to the reference amino acid.

32. 조항 31에 있어서, 대체 아미노산은 기준 아미노산과 상이한 아미노산인, 컴퓨터 구현 방법.32. The computer-implemented method of clause 31, wherein the replacement amino acid is a different amino acid than the reference amino acid.

33. 조항 32에 있어서, 병원성 예측자는 제1 기준 아미노산과 동일한 제1 대체 아미노산에 대한 제1 병원성 점수를 생성하고, 병원성 예측자는 제1 기준 아미노산과 다른 제2 대체 아미노산에 대한 제2 병원성 점수를 생성하는, 컴퓨터 구현 방법.33. The method of clause 32, wherein the pathogenicity predictor generates a first pathogenicity score for a first replacement amino acid that is the same as the first reference amino acid, and the pathogenicity predictor generates a second pathogenicity score for a second replacement amino acid that is different from the first reference amino acid. Generating, computer-implemented method.

34. 조항 33에 있어서, 제2 대체 아미노산에 대한 최종 병원성 점수가 제2 병원성 점수인, 컴퓨터 구현 방법.34. The computer-implemented method of clause 33, wherein the final pathogenicity score for the second replacement amino acid is the second pathogenicity score.

35. 조항 34에 있어서, 제2 대체 아미노산에 대한 최종 병원성 점수는 제1 병원성 점수와 제2 병원성 점수의 조합을 기반으로 하는, 컴퓨터 구현 방법.35. The computer-implemented method of clause 34, wherein the final pathogenicity score for the second replacement amino acid is based on a combination of the first pathogenicity score and the second pathogenicity score.

36. 조항 35에 있어서, 제2 대체 아미노산에 대한 최종 병원성 점수는 제1 병원성 점수와 제2 병원성 점수의 합에 대한 제2 병원성 점수의 비인, 컴퓨터 구현 방법.36. The computer implemented method of clause 35, wherein the final pathogenicity score for the second replacement amino acid is the ratio of the second pathogenicity score to the sum of the first pathogenicity score and the second pathogenicity score.

37. 조항 36에 있어서, 제2 대체 아미노산에 대한 최종 병원성 점수는 제2 병원성 점수에서 제1 병원성 점수를 빼서 결정되는, 컴퓨터 구현 방법.37. The computer-implemented method of clause 36, wherein the final pathogenicity score for the second replacement amino acid is determined by subtracting the first pathogenicity score from the second pathogenicity score.

38. 조항 1에 있어서, 비-갭 아미노산의 공간 구성은 복셀에 가장 가까운 원자를 갖는 아미노산의 범아미노산 보존 빈도를 기반으로 하는 진화 프로파일 채널로 인코딩되는, 컴퓨터 구현 방법.38. The computer-implemented method of clause 1, wherein the spatial organization of non-gap amino acids is encoded into an evolutionary profile channel based on the pan-amino acid conservation frequency of the amino acid with the closest atom to the voxel.

39. 조항 38에 있어서, 범-아미노산 보존 빈도를 결정할 때 갭 아미노산의 가장 가까운 원자를 무시함으로써 갭 아미노산의 공간 구성이 갭 공간 표현에서 제외되는, 컴퓨터 구현 방법.39. The computer-implemented method of clause 38, wherein the spatial configuration of the gap amino acid is excluded from the gap space representation by ignoring the nearest atoms of the gap amino acid when determining the pan-amino acid conservation frequency.

40. 조항 1에 있어서, 비-갭 아미노산의 공간 구성은 복셀에 가장 가까운 원자를 갖는 각 아미노산의 아미노산당 보존 빈도를 기반으로 하는 진화 프로파일 채널로 인코딩되는, 컴퓨터 구현 방법.40. The computer-implemented method of clause 1, wherein the spatial organization of non-gap amino acids is encoded in an evolutionary profile channel based on the per-amino acid conservation frequency of each amino acid having the atom closest to the voxel.

41. 조항 40에 있어서, 아미노산당 보존 빈도를 결정할 때 갭 아미노산의 각각의 가장 가까운 원자를 무시함으로써 갭 아미노산의 공간 구성이 갭 공간 표현에서 제외되는, 컴퓨터 구현 방법.41. The computer-implemented method of clause 40, wherein the spatial configuration of the gap amino acid is excluded from the gap space representation by ignoring each nearest atom of the gap amino acid when determining the conservation frequency per amino acid.

42. 조항 1에 있어서, 비-갭 아미노산의 공간 구성은 주석 채널로서 인코딩되는, 컴퓨터 구현 방법.42. The computer-implemented method of clause 1, wherein the spatial configuration of non-gap amino acids is encoded as an annotation channel.

43. 조항 42에 있어서, 주석 채널을 결정할 때 갭 아미노산의 원자를 무시함으로써 갭 아미노산의 공간 구성이 갭 공간 표현에서 제외되는, 컴퓨터 구현 방법.43. The computer-implemented method of clause 42, wherein the spatial configuration of the gap amino acid is excluded from the gap space representation by ignoring the atoms of the gap amino acid when determining the annotation channel.

44. 조항 1에 있어서, 비-갭 아미노산의 공간 구성은 구조적 신뢰 채널로서 인코딩되는, 컴퓨터 구현 방법.44. The computer-implemented method of clause 1, wherein the spatial configuration of non-gap amino acids is encoded as a structural confidence channel.

45. 조항 44에 있어서, 구조적 신뢰 채널을 결정할 때 갭 아미노산의 원자를 무시함으로써 갭 아미노산의 공간 구성이 갭 공간 표현에서 제외되는, 컴퓨터 구현 방법.45. The computer-implemented method of clause 44, wherein the spatial configuration of the gap amino acid is excluded from the gap space representation by ignoring the atoms of the gap amino acid when determining the structural confidence channel.

46. 조항 1에 있어서, 비-갭 아미노산의 공간 구성은 추가 입력 채널로서 인코딩되는, 컴퓨터 구현 방법.46. The computer-implemented method of clause 1, wherein the spatial configuration of non-gap amino acids is encoded as an additional input channel.

47. 조항 46에 있어서, 추가 입력 채널을 결정할 때 갭 아미노산의 원자를 무시함으로써 갭 아미노산의 공간 구성이 갭 공간 표현에서 제외되는, 컴퓨터 구현 방법.47. The computer-implemented method of clause 46, wherein the spatial configuration of the gap amino acid is excluded from the gap space representation by ignoring the atoms of the gap amino acid when determining additional input channels.

48. 조항 9에 있어서, 프로테옴은 인간 프로테옴과 비인간 영장류 프로테옴을 포함하는 비인간 프로테옴을 포함하는, 컴퓨터 구현 방법.48. The computer-implemented method of clause 9, wherein the proteome comprises a non-human proteome, including a human proteome and a non-human primate proteome.

49. 조항 7에 있어서, 기준 아미노산의 기준 코돈을 도달 불가능한 대체 아미노산 클래스의 대체 아미노산으로 변환하기 위한 단일 뉴클레오티드 다형성(SNP)의 도달 가능성에 의해 제한되는 도달 불가능한 대체 아미노산 클래스는 실제 라벨에 마스킹되는, 컴퓨터 구현 방법.49. Clause 7, wherein the class of unreachable substitute amino acids limited by the reachability of a single nucleotide polymorphism (SNP) to convert the reference codon of the reference amino acid to a substitute amino acid of the class of unreachable substitute amino acids is masked to the actual label. Computer implementation method.

50. 조항 1에 있어서, 마스킹된 아미노산 클래스는 손실이 전혀 발생하지 않으며 기울기 업데이트에 기여하지 않는, 컴퓨터 구현 방법.50. The computer-implemented method of clause 1, wherein the masked amino acid class causes no loss and does not contribute to the gradient update.

51. 조항 50에 있어서, 마스킹된 아미노산 클래스는 룩업 테이블에서 식별되는, 컴퓨터 구현 방법.51. The computer-implemented method of clause 50, wherein the masked amino acid classes are identified in a lookup table.

52. 조항 51에 있어서, 룩업 테이블은 각 기준 아미노산 위치에 대해 마스킹된 아미노산 클래스 세트를 식별하는, 컴퓨터 구현 방법.52. The computer-implemented method of clause 51, wherein the lookup table identifies a set of masked amino acid classes for each reference amino acid position.

조항 세트 2Clause Set 2

단백질의 특정 위치에 있는 특정 아미노산 클래스의 특정 아미노산을 갭 아미노산으로 지정하고, 단백질의 나머지 위치에 남아 있는 아미노산을 비-갭 아미노산으로 지정하는 단계;Designating a specific amino acid of a specific amino acid class at a specific position of the protein as a gap amino acid, and designating the remaining amino acid at the remaining position of the protein as a non-gap amino acid;

적어도 부분적으로 갭 공간 표현에 기초하여 특정 위치에서 각각의 대체 아미노산의 병원성을 결정하는 단계를 포함하는, 컴퓨터 구현 방법.A computer-implemented method comprising determining the pathogenicity of each substituted amino acid at a particular position based at least in part on the gap space representation.

7. 조항 1에 있어서, 각각의 대체 아미노산은 특정 위치에서 각각의 조합적으로 생성된 뉴클레오티드 변이체에 의해 생성된 각각의 조합적으로 생성된 대체 아미노산인, 컴퓨터 구현 방법.7. The computer-implemented method of clause 1, wherein each replacement amino acid is each combinatorially generated replacement amino acid produced by each combinatorially generated nucleotide variant at a particular position.

8. 조항 1에 있어서, 병원성 예측자는 다음 단계에 의해 각각의 대체 아미노산의 병원성을 결정하는, 컴퓨터 구현 방법:8. The computer-implemented method of clause 1, wherein the pathogenicity predictor determines the pathogenicity of each alternative amino acid by the following steps:

갭 공간 표현을 입력으로서 처리하는 단계; 및processing the gap space representation as input; and

각각의 아미노산 클래스에 대한 각각의 병원성 점수를 출력으로서 생성하는 단계.Generating each pathogenicity score for each amino acid class as output.

9. 조항 8에 있어서, 병원성 예측자는 훈련 세트에 대해 훈련되는, 컴퓨터 구현 방법.9. The computer-implemented method of clause 8, wherein the pathogenicity predictor is trained on a training set.

10. 조항 9에 있어서, 훈련 세트는 프로테옴 내의 각각의 위치에 대한 각각의 단백질 샘플을 갖는, 컴퓨터 구현 방법.10. The computer-implemented method of clause 9, wherein the training set has a respective protein sample for each position in the proteome.

11. 조항 10에 있어서, 프로테옴은 1000만 개의 위치가 있으므로 훈련 세트는 1000만 개의 단백질 샘플을 갖는, 컴퓨터 구현 방법.11. The computer-implemented method of clause 10, wherein the proteome has 10 million positions and therefore the training set has 10 million protein samples.

12. 조항 11에 있어서, 각각의 단백질 샘플은 프로테옴 내 각각의 위치에 있는 각각의 기준 아미노산을 각각의 갭 아미노산으로 사용하여 생성된 각각의 갭 공간 표현을 갖는, 컴퓨터 구현 방법.12. The computer-implemented method of clause 11, wherein each protein sample has a respective gap space representation generated using each reference amino acid at each position in the proteome as each gap amino acid.

13. 조항 12에 있어서, 기준 아미노산은 프로테옴의 주요 대립유전자 아미노산인, 컴퓨터 구현 방법.13. The computer-implemented method of clause 12, wherein the reference amino acid is a major allelic amino acid in the proteome.

14. 조항 13에 있어서, 병원성 예측자는 특정 단백질 샘플에 대해 훈련하고 특정 단백질 샘플의 특정 위치에 있는 특정 기준 아미노산에 대한 각각의 대체 아미노산의 병원성을 다음 단계에 의해 추정하는, 컴퓨터 구현 방법:14. The computer-implemented method of clause 13, wherein the pathogenicity predictor is trained on a specific protein sample and estimates the pathogenicity of each alternative amino acid for a specific reference amino acid at a specific position in the specific protein sample by the following steps:

특정 단백질 샘플의 특정 갭 공간 표현을 입력으로서 처리하는 단계Processing a specific gap space representation of a specific protein sample as input

특정 단백질 샘플의 나머지 위치에 남아 있는 아미노산을 비-갭 아미노산으로 사용하여 생성됨 -; 및Generated by using the remaining amino acids at the remaining positions in a specific protein sample as non-gap amino acids -; and

15. 조항 14에 있어서, 각 단백질 샘플은 각각의 아미노산 클래스에 대한 실제 라벨을 갖는, 컴퓨터 구현 방법.15. The computer-implemented method of clause 14, wherein each protein sample has an actual label for each amino acid class.

16. 조항 15에 있어서, 각각의 실제 라벨은 각각의 아미노산 클래스의 기준 아미노산 클래스에 대한 절대 양성 라벨을 포함하고, 각각의 아미노산 클래스의 각각의 대체 아미노산 클래스에 대한 각각의 절대 병원성 라벨을 포함하는, 컴퓨터 구현 방법.16. The clause 15, wherein each actual label comprises an absolute benign label for each reference amino acid class of each amino acid class, and each absolute pathogenic label for each alternative amino acid class of each amino acid class, Computer implementation method.

17. 조항 16에 있어서, 절대 양성 라벨은 0인, 컴퓨터 구현 방법.17. The computer-implemented method of clause 16, wherein the absolute positive label is 0.

18. 조항 17에 있어서, 절대 병원성 라벨은 각각의 대체 아미노산 클래스에서 동일한, 컴퓨터 구현 방법.18. The computer-implemented method of clause 17, wherein the absolute pathogenicity label is the same for each alternative amino acid class.

19. 조항 18에 있어서, 절대 병원성 라벨은 1인, 컴퓨터 구현 방법.19. Computer-implemented method according to clause 18, wherein the absolute pathogenicity label is 1.

20. 조항 1에 있어서, 절대 양성 라벨에 대한 기준 아미노산 클래스에 대한 병원성 점수의 비교, 및20. Clause 1, wherein comparison of pathogenicity scores for reference amino acid classes to absolute positive labels, and

각각의 절대 병원성 라벨에 대한 각각의 대체 아미노산 클래스에 대한 각각의 병원성 점수의 각각의 비교에 기초하여 오류가 결정되는, 컴퓨터 구현 방법.A computer-implemented method, wherein the error is determined based on a respective comparison of each pathogenicity score for each alternative amino acid class to each absolute pathogenicity label.

21. 조항 20에 있어서, 병원성 예측자의 계수는 훈련 기술을 사용하여 오류에 기초하여 개선되는, 컴퓨터 구현 방법.21. The computer-implemented method of clause 20, wherein the coefficients of the pathogenicity predictor are improved based on error using training techniques.

22. 조항 21에 있어서, 병원성 예측자는 1000만 개의 단백질 샘플을 사용하여 1000만 번의 훈련 반복으로 훈련되는, 컴퓨터 구현 방법.22. The computer-implemented method of clause 21, wherein the pathogenicity predictor is trained in 10 million training iterations using 10 million protein samples.

23. 조항 8에 있어서, 각각의 아미노산 클래스는 각각의 20개의 자연 발생 아미노산에 상응하는, 컴퓨터 구현 방법.23. The computer-implemented method of clause 8, wherein each amino acid class corresponds to each of the 20 naturally occurring amino acids.

24. 조항 23에 있어서, 각각의 아미노산 클래스는 20개의 자연 발생 아미노산의 하위세트로부터의 각각의 자연 발생 아미노산에 상응하는, 컴퓨터 구현 방법.24. The computer-implemented method of clause 23, wherein each amino acid class corresponds to a respective naturally occurring amino acid from a subset of the 20 naturally occurring amino acids.

25. 조항 11에 있어서, 프로테옴은 100만 내지 1000만 개의 위치를 가지므로 훈련 세트에는 100만 내지 1000만 개의 단백질 샘플이 있고, 병원성 예측자는 100만 내지 1000만 개의 단백질 샘플을 사용하여 100만 내지 1000만 번의 훈련 반복으로 훈련되는, 컴퓨터 구현 방법.25. Clause 11, the proteome has 1 to 10 million positions, so the training set has 1 to 10 million protein samples, and the pathogenicity predictor uses 1 to 10 million protein samples. A computer-implemented method, trained with 10 million training repetitions.

26. 조항 8에 있어서, 병원성 예측자가 기준 아미노산 클래스의 제1 대체 아미노산에 대한 기준 병원성 점수를 생성하고, 병원성 예측자는 각각의 대체 아미노산 클래스의 각각의 대체 아미노산에 대한 각각의 대체 병원성 점수를 생성하는, 컴퓨터 구현 방법.26. The method of clause 8, wherein the pathogenicity predictor generates a baseline pathogenicity score for a first alternative amino acid of a reference amino acid class, and the pathogenicity predictor generates a respective alternative pathogenicity score for each alternative amino acid of each alternative amino acid class. , computer implementation method.

27. 조항 26에 있어서, 각각의 대체 아미노산에 대한 각각의 최종 대체 병원성 점수는 각각의 대체 병원성 점수인, 컴퓨터 구현 방법.27. The computer-implemented method of clause 26, wherein each final alternative pathogenicity score for each alternative amino acid is the respective alternative pathogenicity score.

28. 조항 27에 있어서, 각각의 대체 아미노산에 대한 각각의 최종 대체 병원성 점수가 기준 병원성 점수와 각각의 대체 병원성 점수의 각각의 조합에 기초하는, 컴퓨터 구현 방법.28. The computer-implemented method of clause 27, wherein each final alternative pathogenicity score for each alternative amino acid is based on a respective combination of the baseline pathogenicity score and each alternative pathogenicity score.

29. 조항 28에 있어서, 각각의 대체 아미노산에 대한 각각의 최종 대체 병원성 점수는 기준 병원성 점수와 각각의 대체 병원성 점수의 합에 대한 각각의 대체 병원성 점수의 각각의 비인, 컴퓨터 구현 방법.29. The computer implemented method of clause 28, wherein each final alternative pathogenicity score for each alternative amino acid is a respective ratio of each alternative pathogenicity score to the sum of the baseline pathogenicity score and each alternative pathogenicity score.

30. 조항 29에 있어서, 각각의 대체 아미노산에 대한 각각의 최종 대체 병원성 점수는 각각의 대체 병원성 점수에서 기준 병원성 점수를 각각 빼서 결정되는, 컴퓨터 구현 방법.30. The computer-implemented method of clause 29, wherein each final alternative pathogenicity score for each alternative amino acid is determined by subtracting each alternative pathogenicity score from each of the baseline pathogenicity scores.

31. 조항 8에 있어서, 병원성 예측자는 각각의 병원성 점수를 생성하는 출력 층을 갖는, 컴퓨터 구현 방법.31. The computer-implemented method of clause 8, wherein the pathogenicity predictor has an output layer that generates a respective pathogenicity score.

32. 조항 31에 있어서, 출력 층은 정규화 층인, 컴퓨터 구현 방법.32. The computer-implemented method of clause 31, wherein the output layer is a normalization layer.

33. 조항 32에 있어서, 각각의 병원성 점수가 정규화되는, 컴퓨터 구현 방법.33. The computer-implemented method of clause 32, wherein each pathogenicity score is normalized.

34. 조항 31에 있어서, 출력 층은 소프트맥스 층인, 컴퓨터 구현 방법.34. The computer-implemented method of clause 31, wherein the output layer is a softmax layer.

35. 조항 34에 있어서, 각각의 병원성 점수는 지수적으로 정규화되는, 컴퓨터 구현 방법.35. The computer-implemented method of clause 34, wherein each pathogenicity score is exponentially normalized.

36. 조항 31에 있어서, 출력 층은 각각의 병원성 점수를 각각 생성하는 각각의 시그모이드 단위를 갖는, 컴퓨터 구현 방법.36. The computer-implemented method of clause 31, wherein the output layer has respective sigmoid units each generating a respective pathogenicity score.

37. 조항 31에 있어서, 각각의 병원성 점수는 정규화되지 않은, 컴퓨터 구현 방법.37. The computer-implemented method of clause 31, wherein each pathogenicity score is not normalized.

48. 조항 10에 있어서, 프로테옴은 인간 프로테옴과 비인간 영장류 프로테옴을 포함하는 비인간 프로테옴을 포함하는, 컴퓨터 구현 방법.48. The computer-implemented method of clause 10, wherein the proteome comprises a non-human proteome, including a human proteome and a non-human primate proteome.

49. 조항 8에 있어서, 기준 아미노산의 기준 코돈을 도달 불가능한 대체 아미노산 클래스의 대체 아미노산으로 변환하기 위한 단일 뉴클레오티드 다형성(SNP)의 도달 가능성에 의해 제한되는 도달 불가능한 대체 아미노산 클래스는 실제 라벨에 마스킹되는, 컴퓨터 구현 방법.49. Clause 8, wherein the unreachable substitute amino acid class is limited by the reachability of a single nucleotide polymorphism (SNP) to convert the reference codon of the reference amino acid to a substitute amino acid of the unreachable substitute amino acid class, wherein the unreachable substitute amino acid class is masked to the actual label. Computer implementation method.

조항 세트 3Clause set 3

1. 변이체 병원성 분류자를 훈련하기 위한 훈련 데이터를 생성하는 컴퓨터 구현 방법으로서,1. A computer-implemented method for generating training data for training a variant pathogenicity classifier, comprising:

다수의 단백질로 프로테옴 내 다수의 아미노산 위치에 접근하는 단계;Accessing multiple amino acid positions in the proteome with multiple proteins;

다수의 단백질의 기준 아미노산으로서 다수의 아미노산 위치의 주요 대립유전자 아미노산을 지정하는 단계;Designating major allele amino acids at multiple amino acid positions as reference amino acids for multiple proteins;

다수의 아미노산 위치 중 각 아미노산 위치에 대해,For each amino acid position among multiple amino acid positions,

이러한 뉴클레오티드 치환을 특정 단백질의 특정 대체 표현의 특정 아미노산 위치에서 특정 기준 아미노산을 특정 기준 아미노산으로 치환하는 양성 변이체로 분류하는 단계, 및Classifying these nucleotide substitutions as positive variants that substitute a specific reference amino acid for a specific reference amino acid at a specific amino acid position in a specific alternative representation of a specific protein, and

이러한 뉴클레오티드 치환을 특정 기준 아미노산을 특정 아미노산 위치의 대체 아미노산으로 치환하는 병원성 변이체로 분류하는 단계 - 대체 아미노산은 특정 기준 아미노산과 다름 -; 및Classifying these nucleotide substitutions as pathogenic variants in which a specific reference amino acid is replaced by a replacement amino acid at a specific amino acid position - the replacement amino acid is different from the specific reference amino acid; and

양성 변이체와 병원성 변이체를 훈련 데이터로 사용하여 변이체 병원성 분류자를 훈련하는 단계를 포함하는, 컴퓨터 구현 방법.A computer-implemented method, comprising training a variant pathogenicity classifier using benign variants and pathogenic variants as training data.

2. 조항 1에 있어서, 변이체 병원성 분류자는 단백질의 주어진 아미노산 위치에서 제1 아미노산의 제2 아미노산으로의 치환이 병원성인지 또는 양성인지 여부를 결정하도록 훈련되는, 컴퓨터 구현 방법.2. The computer-implemented method of clause 1, wherein the variant pathogenicity classifier is trained to determine whether a substitution of a first amino acid for a second amino acid at a given amino acid position in the protein is pathogenic or benign.

3. 조항 2에 있어서, 변이체 병원성 분류자가 치환에 대한 병원성 점수를 생성하도록 훈련되는, 컴퓨터 구현 방법.3. The computer-implemented method of clause 2, wherein a variant pathogenicity classifier is trained to generate pathogenicity scores for substitutions.

4. 조항 1에 있어서, 변이체 병원성 분류자는 단백질의 주어진 아미노산 위치에서 제1 아미노산의 각각의 아미노산으로의 각각의 치환이 병원성인지 또는 양성인지를 결정하도록 훈련되는, 컴퓨터 구현 방법.4. The computer-implemented method of clause 1, wherein the variant pathogenicity classifier is trained to determine whether each substitution of a first amino acid for each amino acid at a given amino acid position in the protein is pathogenic or benign.

5. 조항 4에 있어서, 변이체 병원성 분류자가 각각의 치환에 대한 각각의 병원성 점수를 생성하도록 훈련되는, 컴퓨터 구현 방법.5. The computer-implemented method of clause 4, wherein a variant pathogenicity classifier is trained to generate a respective pathogenicity score for each substitution.

6. 조항 5에 있어서, 각각의 아미노산은 각각의 20개의 자연 발생 아미노산에 상응하는, 컴퓨터 구현 방법.6. The computer-implemented method of clause 5, wherein each amino acid corresponds to each of the 20 naturally occurring amino acids.

7. 조항 6에 있어서, 각각의 아미노산은 20개의 자연 발생 아미노산의 하위세트로부터의 각각의 자연 발생 아미노산에 상응하는, 컴퓨터 구현 방법.7. The computer-implemented method of clause 6, wherein each amino acid corresponds to each naturally occurring amino acid from a subset of the 20 naturally occurring amino acids.

8. 조항 1에 있어서, 변이체 병원성 분류자는 단백질의 주어진 빈 아미노산 위치에 아미노산의 삽입이 병원성인지 양성인지를 결정하도록 훈련되는, 컴퓨터 구현 방법.8. The computer-implemented method of clause 1, wherein a variant pathogenicity classifier is trained to determine whether an insertion of an amino acid at a given empty amino acid position in the protein is pathogenic or benign.

9. 조항 8에 있어서, 변이체 병원성 분류자가 삽입에 대한 병원성 점수를 생성하도록 훈련되는, 컴퓨터 구현 방법.9. The computer-implemented method of clause 8, wherein a variant pathogenicity classifier is trained to generate a pathogenicity score for the insertion.

10. 조항 1에 있어서, 변체이 병원성 분류자는 단백질의 주어진 빈 아미노산 위치에 있는 각각의 아미노산의 각각의 삽입이 병원성인지 또는 양성인지를 결정하도록 훈련되는, 컴퓨터 구현 방법.10. The computer-implemented method of clause 1, wherein the variant pathogenicity classifier is trained to determine whether each insertion of each amino acid at a given empty amino acid position in the protein is pathogenic or benign.

11. 조항 10에 있어서, 변이체 병원성 분류자가 각각의 삽입에 대한 각각의 병원성 점수를 생성하도록 훈련되는, 컴퓨터 구현 방법.11. The computer-implemented method of clause 10, wherein a variant pathogenicity classifier is trained to generate a respective pathogenicity score for each insertion.

12. 조항 11에 있어서, 각각의 아미노산은 각각의 20개의 자연 발생 아미노산에 상응하는, 컴퓨터 구현 방법.12. The computer-implemented method of clause 11, wherein each amino acid corresponds to each of the 20 naturally occurring amino acids.

13. 조항 12에 있어서, 각각의 아미노산은 20개의 자연 발생 아미노산의 하위세트로부터의 각각의 자연 발생 아미노산에 상응하는, 컴퓨터 구현 방법.13. The computer-implemented method of clause 12, wherein each amino acid corresponds to each naturally occurring amino acid from a subset of the 20 naturally occurring amino acids.

14. 조항 1에 있어서, 변이체 병원성 분류자는 단백질의 주어진 아미노산 위치에서 제1 아미노산의 제2 아미노산으로의 치환이 단백질의 다른 아미노산에 의해 공간적으로 허용되는지 여부를 결정하도록 훈련되는, 컴퓨터 구현 방법.14. The computer-implemented method of clause 1, wherein the variant pathogenicity classifier is trained to determine whether a substitution of a first amino acid for a second amino acid at a given amino acid position in the protein is spatially tolerated by other amino acids in the protein.

15. 조항 14에 있어서, 변이체 병원성 분류자가 치환에 대한 공간 내성 점수를 생성하도록 훈련되는, 컴퓨터 구현 방법.15. The computer-implemented method of clause 14, wherein a variant pathogenicity classifier is trained to generate a spatial resistance score for substitutions.

16. 조항 1에 있어서, 변이체 병원성 분류자는 단백질의 주어진 아미노산 위치에서 제1 아미노산의 각각의 아미노산으로의 각각의 치환이 단백질의 다른 아미노산에 의해 공간적으로 허용되는지 여부를 결정하도록 훈련되는, 컴퓨터 구현 방법.16. The computer-implemented method of clause 1, wherein the variant pathogenicity classifier is trained to determine whether each substitution of the first amino acid at a given amino acid position in the protein is spatially tolerated by the other amino acids in the protein. .

17. 조항 16에 있어서, 변이체 병원성 분류자는 각각의 치환에 대한 각각의 공간 내성 점수를 생성하도록 훈련되는, 컴퓨터 구현 방법.17. The computer-implemented method of clause 16, wherein the variant pathogenicity classifier is trained to generate a respective spatial resistance score for each substitution.

18. 조항 17에 있어서, 각각의 아미노산은 각각의 20개의 자연 발생 아미노산에 상응하는, 컴퓨터 구현 방법.18. The computer-implemented method of clause 17, wherein each amino acid corresponds to each of the 20 naturally occurring amino acids.

19. 조항 18에 있어서, 각각의 아미노산은 20개의 자연 발생 아미노산의 하위세트로부터의 각각의 자연 발생 아미노산에 상응하는, 컴퓨터 구현 방법.19. The computer-implemented method of clause 18, wherein each amino acid corresponds to each naturally occurring amino acid from a subset of the 20 naturally occurring amino acids.

20. 조항 1에 있어서, 변이체 병원성 분류자는 단백질의 주어진 빈 아미노산 위치에 아미노산의 삽입이 단백질의 다른 아미노산에 의해 공간적으로 허용되는지 여부를 결정하도록 훈련되는, 컴퓨터 구현 방법.20. The computer-implemented method of clause 1, wherein the variant pathogenicity classifier is trained to determine whether insertion of an amino acid at a given empty amino acid position in the protein is spatially permitted by other amino acids in the protein.

21. 조항 20에 있어서, 변이체 병원성 분류자가 삽입에 대한 공간 내성 점수를 생성하도록 훈련되는, 컴퓨터 구현 방법.21. The computer-implemented method of clause 20, wherein a variant pathogenicity classifier is trained to generate a spatial resistance score for insertions.

22. 조항 1에 있어서, 변이체 병원성 분류자는 단백질의 주어진 빈 아미노산 위치에 각각의 아미노산의 각각의 삽입이 단백질의 다른 아미노산에 의해 공간적으로 허용되는지 여부를 결정하도록 훈련되는, 컴퓨터 구현 방법.22. The computer-implemented method of clause 1, wherein the variant pathogenicity classifier is trained to determine whether each insertion of each amino acid at a given empty amino acid position in the protein is spatially tolerated by other amino acids in the protein.

23. 조항 22에 있어서, 변이체 병원성 분류자는 각각의 삽입에 대한 각각의 공간 내성 점수를 생성하도록 훈련되는, 컴퓨터 구현 방법.23. The computer-implemented method of clause 22, wherein the variant pathogenicity classifier is trained to generate a respective spatial resistance score for each insertion.

24. 조항 23에 있어서, 각각의 아미노산은 각각의 20개의 자연 발생 아미노산에 상응하는, 컴퓨터 구현 방법.24. The computer-implemented method of clause 23, wherein each amino acid corresponds to each of the 20 naturally occurring amino acids.

25. 조항 24에 있어서, 각각의 아미노산은 20개의 자연 발생 아미노산의 하위세트로부터의 각각의 자연 발생 아미노산에 상응하는, 컴퓨터 구현 방법.25. The computer-implemented method of clause 24, wherein each amino acid corresponds to each naturally occurring amino acid from a subset of the 20 naturally occurring amino acids.

26. 조항 1에 있어서, 변이체 병원성 분류자는 단백질의 주어진 아미노산 위치에서 제1 아미노산의 제2 아미노산으로의 치환이 진화적으로 보존되는지 또는 비-보존되는지 여부를 결정하도록 훈련되는, 컴퓨터 구현 방법.26. The computer-implemented method of clause 1, wherein the variant pathogenicity classifier is trained to determine whether a substitution of a first amino acid for a second amino acid at a given amino acid position in the protein is evolutionarily conserved or non-conserved.

27. 조항 26에 있어서, 변이체 병원성 분류자가 치환에 대한 진화 보존 점수를 생성하도록 훈련되는, 컴퓨터 구현 방법.27. The computer-implemented method of clause 26, wherein a variant pathogenicity classifier is trained to generate evolutionary conservation scores for substitutions.

28. 조항 1에 있어서, 변이체 병원성 분류자는 단백질의 주어진 아미노산 위치에서 제1 아미노산의 각각의 아미노산으로의 각각의 치환이 진화적으로 보존되는지 또는 비-보존되는지를 결정하도록 훈련되는, 컴퓨터 구현 방법.28. The computer-implemented method of clause 1, wherein the variant pathogenicity classifier is trained to determine whether each substitution of the first amino acid for each amino acid at a given amino acid position in the protein is evolutionarily conserved or non-conserved.

29. 조항 28에 있어서, 변이체 병원성 분류자가 각각의 치환에 대한 각각의 진화 보존 점수를 생성하도록 훈련되는, 컴퓨터 구현 방법.29. The computer-implemented method of clause 28, wherein a variant pathogenicity classifier is trained to generate a respective evolutionary conservation score for each substitution.

30. 조항 29에 있어서, 각각의 아미노산은 각각의 20개의 자연 발생 아미노산에 상응하는, 컴퓨터 구현 방법.30. The computer-implemented method of clause 29, wherein each amino acid corresponds to each of the 20 naturally occurring amino acids.

31. 조항 30에 있어서, 각각의 아미노산은 20개의 자연 발생 아미노산의 하위세트로부터의 각각의 자연 발생 아미노산에 상응하는, 컴퓨터 구현 방법.31. The computer-implemented method of clause 30, wherein each amino acid corresponds to each naturally occurring amino acid from a subset of the 20 naturally occurring amino acids.

32. 조항 1에 있어서, 변이체 병원성 분류자는 단백질의 주어진 빈 아미노산 위치에 아미노산의 삽입이 진화적으로 보존되는지 또는 비-보존되는지를 결정하도록 훈련되는, 컴퓨터 구현 방법.32. The computer-implemented method of clause 1, wherein the variant pathogenicity classifier is trained to determine whether an insertion of an amino acid at a given empty amino acid position in the protein is evolutionarily conserved or non-conserved.

33. 조항 32에 있어서, 변이체 병원성 분류자가 삽입에 대한 진화 보존 점수를 생성하도록 훈련되는, 컴퓨터 구현 방법.33. The computer-implemented method of clause 32, wherein a variant pathogenicity classifier is trained to generate evolutionary conservation scores for insertions.

34. 조항 1에 있어서, 변체이 병원성 분류자는 단백질의 주어진 빈 아미노산 위치에 있는 각각의 아미노산의 각각의 삽입이 진화적으로 보존되는지 또는 비-보존되는지를 결정하도록 훈련되는, 컴퓨터 구현 방법.34. The computer-implemented method of clause 1, wherein the variant pathogenicity classifier is trained to determine whether each insertion of each amino acid at a given empty amino acid position in the protein is evolutionarily conserved or non-conserved.

35. 조항 34에 있어서, 변이체 병원성 분류자가 각각의 삽입에 대한 각각의 진화 보존 점수를 생성하도록 훈련되는, 컴퓨터 구현 방법.35. The computer-implemented method of clause 34, wherein a variant pathogenicity classifier is trained to generate a respective evolutionary conservation score for each insertion.

36. 조항 35에 있어서, 각각의 아미노산은 각각의 20개의 자연 발생 아미노산에 상응하는, 컴퓨터 구현 방법.36. The computer-implemented method of clause 35, wherein each amino acid corresponds to each of the 20 naturally occurring amino acids.

37. 조항 36에 있어서, 각각의 아미노산은 20개의 자연 발생 아미노산의 하위세트로부터의 각각의 자연 발생 아미노산에 상응하는, 컴퓨터 구현 방법.37. The computer-implemented method of clause 36, wherein each amino acid corresponds to each naturally occurring amino acid from a subset of the 20 naturally occurring amino acids.

38. 조항 14에 있어서, 공간적 허용오차는 구조적 허용오차에 상응하고, 공간적 허용오차는 구조적 허용오차에 상응하는, 컴퓨터 구현 방법.38. The computer-implemented method of clause 14, wherein the spatial tolerance corresponds to the structural tolerance, and the spatial tolerance corresponds to the structural tolerance.

39. 조항 1에 있어서, 다수의 아미노산 위치는 100만 내지 1000만 개의 아미노산 위치 범위인, 컴퓨터 구현 방법.39. The computer-implemented method of clause 1, wherein the plurality of amino acid positions ranges from 1 million to 10 million amino acid positions.

40. 조항 1에 있어서, 다수의 아미노산 위치는 1000만 내지 수억 개의 아미노산 위치 범위인, 컴퓨터 구현 방법.40. The computer-implemented method of clause 1, wherein the plurality of amino acid positions ranges from 10 million to hundreds of millions of amino acid positions.

41. 조항 1에 있어서, 다수의 아미노산 위치는 수억 내지 10억 개의 아미노산 위치 범위인, 컴퓨터 구현 방법.41. The computer-implemented method of clause 1, wherein the plurality of amino acid positions ranges from hundreds of millions to billions of amino acid positions.

42. 조항 1에 있어서, 다수의 아미노산 위치는 1 내지 100만 개의 아미노산 위치 범위인, 컴퓨터 구현 방법.42. The computer-implemented method of clause 1, wherein the plurality of amino acid positions ranges from 1 to 1 million amino acid positions.

43. 조항 1에 있어서, 기준 아미노산의 기준 코돈을 도달 불가능한 대체 아미노산 클래스의 대체 아미노산으로 변환하기 위한 단일 뉴클레오티드 다형성(SNP)의 도달 가능성에 의해 제한되는 도달 불가능한 대체 아미노산 클래스는 실제 라벨에 마스킹되는, 컴퓨터 구현 방법.43. Clause 1, wherein the class of unreachable substitute amino acids limited by the reachability of single nucleotide polymorphisms (SNPs) to convert the reference codon of the reference amino acid to the substitute amino acid of the class of unreachable substitute amino acids is masked to the actual label, Computer implementation method.

44. 조항 1에 있어서, 마스킹된 아미노산 클래스는 손실이 전혀 발생하지 않으며 기울기 업데이트에 기여하지 않는, 컴퓨터 구현 방법.44. The computer-implemented method of clause 1, wherein the masked amino acid class causes no loss and does not contribute to the gradient update.

45. 조항 44에 있어서, 마스킹된 아미노산 클래스는 룩업 테이블에서 식별되는, 컴퓨터 구현 방법.45. The computer-implemented method of clause 44, wherein the masked amino acid classes are identified in a lookup table.

46. 조항 45에 있어서, 룩업 테이블은 각 기준 아미노산 위치에 대해 마스킹된 아미노산 클래스 세트를 식별하는, 컴퓨터 구현 방법.46. The computer-implemented method of clause 45, wherein the lookup table identifies a set of masked amino acid classes for each reference amino acid position.

조항 세트 4(ILLM 1060-1)Clause Set 4 (ILLM 1060-1)

비-갭 아미노산의 공간 구성을 포함하고 갭 아미노산의 공간 구성을 배제하는 단백질의 갭 공간 표현을 생성하는 단계;generating a gap spatial representation of the protein including the spatial configuration of non-gap amino acids and excluding the spatial configuration of gap amino acids;

적어도 부분적으로 갭 공간 표현에 기초하여 각각의 아미노산 클래스의 각각의 아미노산의 특정 위치에서의 진화 보존을 결정하는 단계; 및determining evolutionary conservation at a particular position of each amino acid of each amino acid class based at least in part on the gap space representation; and

적어도 부분적으로 각각의 아미노산의 진화 보존에 기초하여, 단백질의 대체 표현에서 특정 아미노산을 각각의 아미노산으로 각각 치환하는 각각의 뉴클레오티드 변이체의 병원성을 결정하는 단계를 포함하는, 컴퓨터 구현 방법.A computer-implemented method comprising determining the pathogenicity of each nucleotide variant that each substitutes a particular amino acid for a respective amino acid in an alternative representation of the protein, based at least in part on the evolutionary conservation of each amino acid.

7. 조항 1에 있어서, 진화 보존 예측자는 다음과 같이 진화 보존을 결정하는, 컴퓨터 구현 방법:7. The computer-implemented method of clause 1, wherein the evolutionary conservation predictor determines evolutionary conservation as follows:

각각의 아미노산에 대한 각각의 진화 보존 점수를 출력으로서 생성하는 단계.Generating each evolutionary conservation score for each amino acid as output.

8. 조항 7에 있어서, 각각의 진화 보존 점수는 규모에 따라 순위를 매길 수 있는, 컴퓨터 구현 방법.8. The computer-implemented method of clause 7, wherein each evolutionary conservation score can be ranked according to scale.

9. 조항 7에 있어서, 상응하는 아미노산 치환에 대한 진화 보존 예측자에 의해 생성된 진화 보존 점수가 역치 미만인 경우 뉴클레오티드 변이체를 병원성으로 분류하는 단계를 추가로 포함하는, 컴퓨터 구현 방법.9. The computer-implemented method of clause 7, further comprising classifying the nucleotide variant as pathogenic if the evolutionary conservation score generated by the evolutionary conservation predictor for the corresponding amino acid substitution is below a threshold.

10. 조항 7에 있어서, 상응하는 아미노산 치환에 대한 진화 보존 예측자에 의해 생성된 진화 보존 점수가 0 미만인 경우 뉴클레오티드 변이체를 병원성으로 분류하는 단계를 추가로 포함하는, 컴퓨터 구현 방법.10. The computer-implemented method of clause 7, further comprising classifying the nucleotide variant as pathogenic if the evolutionary conservation score generated by the evolutionary conservation predictor for the corresponding amino acid substitution is less than 0.

11. 조항 7에 있어서, 상응하는 아미노산 치환에 대한 진화 보존 예측자에 의해 생성된 진화 보존 점수가 역치 초과인 경우 뉴클레오티드 변이체를 양성으로 분류하는 단계를 추가로 포함하는, 컴퓨터 구현 방법.11. The computer-implemented method of clause 7, further comprising classifying the nucleotide variant as positive if the evolutionary conservation score generated by the evolutionary conservation predictor for the corresponding amino acid substitution is above a threshold.

12. 조항 7에 있어서, 상응하는 아미노산 치환에 대한 진화 보존 예측자에 의해 생성된 진화 보존 점수가 0이 아닌 경우 뉴클레오티드 변이체를 양성으로 분류하는 단계를 추가로 포함하는, 컴퓨터 구현 방법.12. The computer-implemented method of clause 7, further comprising classifying the nucleotide variant as positive if the evolutionary conservation score generated by the evolutionary conservation predictor for the corresponding amino acid substitution is non-zero.

13. 조항 7에 있어서, 진화 보존 예측자는 보존 훈련 세트 및 비-보존 훈련 세트에 대해 훈련되는, 컴퓨터 구현 방법.13. The computer-implemented method of clause 7, wherein the evolutionary conservation predictor is trained on a conservative training set and a non-conservative training set.

14. 조항 13에 있어서, 보존 훈련 세트는 프로테옴의 각각의 위치에서 각각의 보존 아미노산에 대한 각각의 보존 단백질 샘플을 갖고,14. Clause 13, wherein the conservation training set has each conserved protein sample for each conserved amino acid at each position in the proteome,

비-보존 훈련 세트는 각각의 위치에서 각각의 비-보존 아미노산에 대한 각각의 비-보존 단백질 샘플을 갖는, 컴퓨터 구현 방법.A computer implemented method wherein the non-conserved training set has each non-conserved protein sample for each non-conserved amino acid at each position.

15. 조항 14에 있어서, 각각의 위치 각각은 보존 아미노산의 세트 및 비-보존 아미노산의 세트를 갖는, 컴퓨터 구현 방법.15. The computer-implemented method of clause 14, wherein each position has a set of conserved amino acids and a set of non-conserved amino acids.

16. 조항 15에 있어서, 프로테옴 내 특정 단백질의 특정 위치에 대한 보존 아미노산의 특정 세트는 복수의 종에 걸쳐 특정 위치에서 관찰되는 적어도 하나의 주요 대립유전자 아미노산을 포함하는, 컴퓨터 구현 방법.16. The computer-implemented method of clause 15, wherein the specific set of conserved amino acids for a specific position of a specific protein within a proteome comprises at least one major allelic amino acid observed at the specific position across a plurality of species.

17. 조항 16에 있어서, 보존 아미노산의 특정 세트가 복수의 종에 걸쳐 특정 위치에서 관찰되는 하나 이상의 소수 대립유전자 아미노산을 포함하는, 컴퓨터 구현 방법.17. The computer-implemented method of clause 16, wherein the particular set of conserved amino acids comprises one or more minor allelic amino acids observed at specific positions across a plurality of species.

18. 조항 17에 있어서, 특정 위치에 대한 특정 비-보존 아미노산의 세트는 특정 보존 아미노산의 세트에 포함되지 않은 아미노산을 포함하는, 컴퓨터 구현 방법.18. The computer-implemented method of clause 17, wherein the set of specific non-conserved amino acids for a particular position includes amino acids that are not included in the set of specific conserved amino acids.

19. 조항 18에 있어서, 보존 아미노산의 특정 세트 및 비-보존 아미노산의 특정 세트는 복수 종의 상동 단백질의 진화 보존 프로파일에 기초하여 식별되는, 컴퓨터 구현 방법.19. The computer-implemented method of clause 18, wherein the particular set of conserved amino acids and the particular set of non-conserved amino acids are identified based on the evolutionary conservation profile of the homologous protein in the plurality of species.

20. 조항 18에 있어서, 상동 단백질의 진화 보존 프로파일이 위치-특이적 주파수 매트릭스(PSFM)를 사용하여 결정되는, 컴퓨터 구현 방법.20. The computer-implemented method of clause 18, wherein the evolutionary conservation profile of the homologous protein is determined using a position-specific frequency matrix (PSFM).

21. 조항 18에 있어서, 상동 단백질의 진화 보존 프로파일이 위치-특이적 채점 매트릭스(PSSM)를 사용하여 결정되는, 컴퓨터 구현 방법.21. The computer-implemented method of clause 18, wherein the evolutionary conservation profile of the homologous protein is determined using a position-specific scoring matrix (PSSM).

22. 조항 16에 있어서, 주요 대립유전자 아미노산은 기준 아미노산인, 컴퓨터 구현 방법.22. The computer-implemented method of clause 16, wherein the major allelic amino acid is a reference amino acid.

23. 조항 14에 있어서, 각각의 위치 각각은 보존 아미노산의 세트에서 C개 보존 아미노산을 갖고,23. Clause 14, wherein each position has C conserved amino acids in the set of conserved amino acids,

각각의 위치 각각은 비-보존 아미노산의 세트에서 NC개 비-보존 아미노산을 가지며, NC = 20-C이고,Each position has NC non-conserved amino acids in the set of non-conserved amino acids, NC = 20-C ,

보존 훈련 세트는 CP개 보존 단백질 샘플을 가지며, CP = 각 위치의 수 * C이고,The conservation training set has CP conserved protein samples, where CP = number of each position * C ,

비-보존 훈련 세트는 NCP개 비-보존 단백질 샘플을 가지며, NCP = 각 위치의 수 * (20-C)인, 컴퓨터 구현 방법.The non-conserved training set has NCP 4 non-conserved protein samples, where NCP = number of each position * (20- C ).

24. 조항 23에 있어서, C는 1 내지 10 범위인, 컴퓨터 구현 방법.24. The computer-implemented method of clause 23, wherein C ranges from 1 to 10.

25. 조항 24에 있어서, C는 각각의 위치에 걸쳐 변하는, 컴퓨터 구현 방법.25. The computer-implemented method of clause 24, wherein C varies across each location.

26. 조항 25에 있어서, C는 각각의 위치 중 일부에 대해 동일한, 컴퓨터 구현 방법.26. The computer-implemented method of clause 25, wherein C is the same for some of each position.

27. 조항 14에 있어서, 각각의 보존 단백질 샘플과 비-보존 단백질 샘플이 각각의 위치에서 각각의 기준 아미노산을 각각의 갭 아미노산으로서 사용하여 생성된 각각의 갭 공간 표현을 갖는, 컴퓨터 구현 방법.27. The computer-implemented method of clause 14, wherein each conserved protein sample and each non-conserved protein sample has a respective gap space representation generated using each reference amino acid at each position as the respective gap amino acid.

28. 조항 27에 있어서, 진화 보존 예측자는, 하기의 단계에 의해, 특정 보존 단백질 샘플에 대해 훈련하고, 특정 특정 보존 단백질 샘플의 특정 위치에서 특정 보존 아미노산의 진화 보존을 추정하는, 컴퓨터 구현 방법:28. The computer-implemented method of clause 27, wherein the evolutionary conservation predictor is trained on specific conserved protein samples and estimates the evolutionary conservation of specific conserved amino acids at specific positions in the specific conserved protein samples by the following steps:

특정 보존 단백질 샘플의 특정 갭 공간 표현을 입력으로서 처리하는 단계Processing a specific gap space representation of a specific conserved protein sample as input.

특정 위치의 특정 기준 아미노산을 갭 아미노산으로 사용하여, 그리고 특정 보존 단백질 샘플의 나머지 위치에 남아 있는 아미노산을 비-갭 아미노산으로 사용하여 생성됨 -; 및Generated using a specific reference amino acid at a specific position as the gap amino acid, and the remaining amino acid at the remaining position in a specific conserved protein sample as the non-gap amino acid -; and

특정 보존 아미노산에 대한 진화 보존 점수를 출력으로서 생성하는 단계.Generating evolutionary conservation scores for specific conserved amino acids as output.

29. 조항 28에 있어서, 각각의 보존 단백질 샘플은 실제 보존 라벨을 갖는, 컴퓨터 구현 방법.29. The computer-implemented method of clause 28, wherein each preserved protein sample has an actual preserved label.

30. 조항 29에 있어서, 실제 보존 라벨은 진화 보존 빈도인, 컴퓨터 구현 방법.30. The computer-implemented method of clause 29, wherein the actual conservation label is an evolutionary conservation frequency.

31. 조항 29에 있어서, 실제 보존 라벨은 1인, 컴퓨터 구현 방법.31. The computer-implemented method of clause 29, wherein the actual retention label is 1.

32. 조항 29에 있어서, 특정 보존 아미노산에 대한 진화 보존은 실제 보존 라벨과 비교되어 오류를 결정하고 훈련 기술을 사용하여 오류에 기초하여 진화 보존 예측자의 계수를 개선하는, 컴퓨터 구현 방법.32. The computer-implemented method of clause 29, wherein the evolutionary conservation for a particular conserved amino acid is compared to the actual conservation label to determine the error and using training techniques to improve the coefficients of the evolutionary conservation predictor based on the error.

33. 조항 32에 있어서, 특정 보존 아미노산이 특정 기준 아미노산일 때 실제 보존 라벨은 마스킹되어 오류를 결정하는 데 사용되지 않고,33. Clause 32, wherein when a particular conserved amino acid is a particular reference amino acid, the actual conserved label is masked and not used to determine the error;

마스킹으로 인해 진화 보존 예측자가 특정 기준 아미노산에 과적합화되지 않게 되는, 컴퓨터 구현 방법.A computer-implemented method in which masking prevents evolutionary conservation predictors from overfitting to a particular reference amino acid.

34. 조항 32에 있어서, 훈련 기술은 손실 함수 기반 기울기 업데이트 기술인, 컴퓨터 구현 방법.34. The computer-implemented method of clause 32, wherein the training technique is a loss function based gradient update technique.

35. 조항 27에 있어서, 진화 보존 예측자는, 하기의 단계에 의해, 특정 비-보존 단백질 샘플에 대해 훈련하고, 특정 비-특정 보존 단백질 샘플의 특정 위치에서 특정 비-보존 아미노산의 진화 보존을 추정하는, 컴퓨터 구현 방법:35. The method of clause 27, wherein the evolutionary conservation predictor is trained on a specific non-conserved protein sample, and estimates the evolutionary conservation of a specific non-conserved amino acid at a specific position in the specific non-conserved protein sample, by the following steps: How to implement a computer:

특정 비-보존 단백질 샘플의 특정 갭 공간을 입력으로서 처리하는 단계 - 이때 특정 갭 공간 표현은Processing as input a specific gap space of a specific non-conserved protein sample, where the specific gap space representation is

특정 위치의 특정 기준 아미노산을 갭 아미노산으로 사용하여, 그리고 특정 비-보존 단백질 샘플의 나머지 위치에 남아 있는 아미노산을 비-갭 아미노산으로 사용하여 생성됨 -; 및Generated using a specific reference amino acid at a specific position as the gap amino acid, and the remaining amino acid at the remaining position in a specific non-conserved protein sample as the non-gap amino acid -; and

특정 비-보존 아미노산에 대한 진화 보존 점수를 출력으로서 생성하는 단계.Generating evolutionary conservation scores for specific non-conserved amino acids as output.

36. 조항 35에 있어서, 각각의 비-보존 단백질 샘플은 실제 비-보존 라벨을 갖는, 컴퓨터 구현 방법.36. The computer implemented method of clause 35, wherein each non-preserved protein sample has an actual non-preserved label.

37. 조항 35에 있어서, 실제 비-보존 라벨은 진화 보존 빈도인, 컴퓨터 구현 방법.37. The computer-implemented method of clause 35, wherein the actual non-conserved label is an evolutionary conservation frequency.

38. 조항 35에 있어서, 실제 비-보존 라벨은 0인, 컴퓨터 구현 방법.38. The computer-implemented method of clause 35, wherein the actual non-preserved label is 0.

39. 조항 35에 있어서, 특정 비-보존 아미노산에 대한 진화 보존 점수는 실제 비-보존 라벨과 비교되어 오류를 결정하고 훈련 기술을 사용하여 오류에 기초하여 진화 보존 예측자의 계수를 개선하는, 컴퓨터 구현 방법.39. The computer implementation of clause 35, wherein the evolutionary conservation score for a particular non-conserved amino acid is compared to the actual non-conserved label to determine the error and uses training techniques to improve the coefficients of the evolutionary conservation predictor based on the error. method.

40. 조항 7에 있어서, 진화 보존 예측자는 훈련 세트에 대해 훈련되는, 컴퓨터 구현 방법.40. The computer-implemented method of clause 7, wherein the evolutionary conservation predictor is trained on a training set.

41. 조항 40에 있어서, 훈련 세트는 프로테옴 내의 각각의 위치에 대한 각각의 단백질 샘플을 갖는, 컴퓨터 구현 방법.41. The computer-implemented method of clause 40, wherein the training set has a respective protein sample for each position within the proteome.

42. 조항 41에 있어서, 각각의 단백질 샘플이 각각의 위치에서 각각의 기준 아미노산을 각각의 갭 아미노산으로서 사용하여 생성된 각각의 갭 공간 표현을 갖는, 컴퓨터 구현 방법.42. The computer-implemented method of clause 41, wherein each protein sample has a respective gap space representation generated using each reference amino acid at each position as the respective gap amino acid.

43. 조항 42에 있어서, 진화 보존 예측자는 특정 단백질 샘플에 대해 훈련하고 특정 단백질 샘플의 특정 위치에서 각 아미노산 클래스의 각 아미노산의 진화 보존을 다음 단계에 의해 추정하는, 컴퓨터 구현 방법:43. The computer-implemented method of clause 42, wherein the evolutionary conservation predictor is trained on a specific protein sample and estimates the evolutionary conservation of each amino acid of each amino acid class at a specific position in the specific protein sample by the following steps:

특정 위치의 특정 기준 아미노산을 갭 아미노산으로 사용하여, 그리고 특정 단백질 샘플의 나머지 위치에 남아 있는 아미노산을 비-갭 아미노산으로 사용하여 생성됨 -; 및Generated using a specific reference amino acid at a specific position as a gap amino acid, and the remaining amino acid at the remaining position of a specific protein sample as a non-gap amino acid -; and

44. 조항 43에 있어서, 각 단백질 샘플은 각각의 아미노산에 대한 실제 라벨을 갖는, 컴퓨터 구현 방법.44. The computer-implemented method of clause 43, wherein each protein sample has an actual label for each amino acid.

45. 조항 44에 있어서, 각각의 실제 라벨은 각각의 아미노산 중 하나 이상의 보존 아미노산에 대한 하나 이상의 보존 라벨을 포함하고, 각각의 아미노산 중 하나 이상의 비-보존 아미노산에 대한 하나 이상의 비-보존 라벨을 포함하는, 컴퓨터 구현 방법.45. The clause 44, wherein each actual label comprises one or more conservative labels for one or more conserved amino acids of each amino acid and one or more non-conservative labels for one or more non-conserved amino acids of each amino acid. A computer implementation method.

46. 조항 45에 있어서, 보존 라벨 및 비-보존 라벨은 각각의 진화 보존 빈도를 갖는, 컴퓨터 구현 방법.46. The computer-implemented method of clause 45, wherein the conserved label and the non-conserved label each have an evolutionary conservation frequency.

47. 조항 46에 있어서, 각각의 진화 보존 빈도는 규모에 따라 순위를 매길 수 있는, 컴퓨터 구현 방법.47. The computer-implemented method of clause 46, wherein each evolutionary conservation frequency can be ranked according to magnitude.

48. 조항 46에 있어서, 보존 라벨은 1이고, 비-보존 라벨은 0인, 컴퓨터 구현 방법.48. The computer-implemented method of clause 46, wherein the preservation label is 1 and the non-preservation label is 0.

49. 조항 46에 있어서, 오류는 다음에 기초하여 결정되는, 컴퓨터 구현 방법: 각각의 보존 아미노산에 대한 각각의 보존 아미노산에 대한 각각의 진화 보존 점수의 각각의 비교, 및49. The computer-implemented method of clause 46, wherein the error is determined based on: each comparison of each evolutionary conservation score for each conserved amino acid to each conserved amino acid, and

각각의 비-보존 아미노산에 대한 각각의 비-보존 아미노산에 대한 각각의 진화 보존 점수의 각각의 비교.Comparison of each evolutionary conservation score for each non-conserved amino acid with respect to each non-conserved amino acid.

50. 조항 49에 있어서, 진화 보존 예측자의 계수는 훈련 기술을 사용하여 오류에 기초하여 개선되는, 컴퓨터 구현 방법.50. The computer-implemented method of clause 49, wherein the coefficients of the evolutionary conservation predictor are improved based on error using training techniques.

51. 조항 50에 있어서, 보존 아미노산은 특정 기준 아미노산을 포함하고, 특정 기준 아미노산에 대한 보존 라벨은 마스킹되어 오류를 결정하는 데 사용되지 않고,51. Clause 50, wherein the conserved amino acids include a specific reference amino acid, and the conserved label for the specific reference amino acid is masked and not used to determine the error, and

52. 조항 14에 있어서, 프로테옴은 1 내지 1000만 위치를 갖고,52. The clause 14 wherein the proteome has positions 1 to 10 million;

1 내지 1000만 위치 각각은 보존 아미노산의 세트에서 C개 보존 아미노산을 갖고,Each of the 1 to 10 million positions has C conserved amino acids in the set of conserved amino acids,

1 내지 1000만 위치 각각은 비-보존 아미노산의 세트에서 NC개 비-보존 아미노산을 가지며, NC = 20-C이고,Positions 1 to 10 million each have NC non-conserved amino acids in the set of non-conserved amino acids, NC = 20- C ,

보존 훈련 세트는 CP개 보존 단백질 샘플을 가지며, CP = 1 내지 1000만 * C이고,The conservation training set has CP dog conservation protein samples, CP = 1 to 10 million * C ,

비-보존 훈련 세트는 NCP개 비-보존 단백질 샘플을 가지며, NCP = 1 내지 1000만 * (20-C)인, 컴퓨터 구현 방법.The non-conserved training set has NCP eight non-conserved protein samples, and NCP = 1 to 10 million * (20- C ).

53. 조항 14에 있어서, 진화 보존 예측자는 2천만 내지 2억 번의 훈련 반복으로 훈련되고,53. Clause 14, wherein the evolutionary conservation predictor is trained with 200 million to 200 million training iterations, and

100만 내지 1000만 개의 보존 단백질 샘플을 사용한 100만 내지 1000만 번의 훈련 반복, 및1 to 10 million training iterations using 1 to 10 million conserved protein samples, and

1,900만 내지 1억9,000만 개의 비-보존 단백질 샘플을 사용한 1,900만 내지 1억9,000만 번의 반복을 포함하는, 컴퓨터 구현 방법.A computer-implemented method comprising 19 to 190 million iterations using 19 to 190 million non-conserved protein samples.

54. 조항 14에 있어서, 프로테옴은 100만 내지 1000만 개의 위치를 가지므로 훈련 세트에는 100만 내지 1000만 개의 단백질 샘플이 있고, 진화 보존 예측자는 100만 내지 1000만 개의 단백질 샘플을 사용하여 100만 내지 1000만 번의 훈련 반복으로 훈련되는, 컴퓨터 구현 방법.54. In clause 14, the proteome has 1 to 10 million positions, so the training set has 1 to 10 million protein samples, and the evolutionary conservation predictor uses 1 to 10 million protein samples to determine 1 million positions. A computer-implemented method that is trained with up to 10 million training repetitions.

55. 조항 1에 있어서, 비-갭 아미노산의 공간 구성은 복셀에 가장 가까운 원자를 갖는 아미노산의 범아미노산 보존 빈도를 기반으로 하는 진화 프로파일 채널로 인코딩되는, 컴퓨터 구현 방법.55. The computer-implemented method of clause 1, wherein the spatial organization of non-gap amino acids is encoded into an evolutionary profile channel based on the pan-amino acid conservation frequency of the amino acid with the closest atom to the voxel.

56. 조항 55에 있어서, 범-아미노산 보존 빈도를 결정할 때 갭 아미노산의 가장 가까운 원자를 무시함으로써 갭 아미노산의 공간 구성이 갭 공간 표현에서 제외되는, 컴퓨터 구현 방법.56. The computer-implemented method of clause 55, wherein the spatial configuration of the gap amino acid is excluded from the gap space representation by ignoring the nearest atoms of the gap amino acid when determining the pan-amino acid conservation frequency.

57. 조항 1에 있어서, 비-갭 아미노산의 공간 구성은 복셀에 가장 가까운 원자를 갖는 각 아미노산의 아미노산당 보존 빈도를 기반으로 하는 진화 프로파일 채널로 인코딩되는, 컴퓨터 구현 방법.57. The computer-implemented method of clause 1, wherein the spatial organization of non-gap amino acids is encoded in an evolutionary profile channel based on the per-amino acid conservation frequency of each amino acid having the atom closest to the voxel.

58. 조항 57에 있어서, 아미노산당 보존 빈도를 결정할 때 갭 아미노산의 각각의 가장 가까운 원자를 무시함으로써 갭 아미노산의 공간 구성이 갭 공간 표현에서 제외되는, 컴퓨터 구현 방법.58. The computer-implemented method of clause 57, wherein the spatial configuration of the gap amino acid is excluded from the gap space representation by ignoring each nearest atom of the gap amino acid when determining the conservation frequency per amino acid.

59. 조항 1에 있어서, 비-갭 아미노산의 공간 구성은 주석 채널로서 인코딩되는, 컴퓨터 구현 방법.59. The computer-implemented method of clause 1, wherein the spatial configuration of non-gap amino acids is encoded as an annotation channel.

60. 조항 59에 있어서, 주석 채널을 결정할 때 갭 아미노산의 원자를 무시함으로써 갭 아미노산의 공간 구성이 갭 공간 표현에서 제외되는, 컴퓨터 구현 방법.60. The computer-implemented method of clause 59, wherein the spatial configuration of the gap amino acid is excluded from the gap space representation by ignoring the atoms of the gap amino acid when determining the annotation channel.

61. 조항 1에 있어서, 비-갭 아미노산의 공간 구성은 구조적 신뢰 채널로서 인코딩되는, 컴퓨터 구현 방법.61. The computer-implemented method of clause 1, wherein the spatial configuration of non-gap amino acids is encoded as a structural confidence channel.

62. 조항 61에 있어서, 구조적 신뢰 채널을 결정할 때 갭 아미노산의 원자를 무시함으로써 갭 아미노산의 공간 구성이 갭 공간 표현에서 제외되는, 컴퓨터 구현 방법.62. The computer-implemented method of clause 61, wherein the spatial configuration of the gap amino acid is excluded from the gap space representation by ignoring the atoms of the gap amino acid when determining the structural confidence channel.

63. 조항 1에 있어서, 비-갭 아미노산의 공간 구성은 구조적 신뢰 채널로서 인코딩되는, 컴퓨터 구현 방법.63. The computer-implemented method of clause 1, wherein the spatial configuration of non-gap amino acids is encoded as a structural confidence channel.

64. 조항 63에 있어서, 구조적 신뢰 채널을 결정할 때 갭 아미노산의 원자를 무시함으로써 갭 아미노산의 공간 구성이 갭 공간 표현에서 제외되는, 컴퓨터 구현 방법.64. The computer-implemented method of clause 63, wherein the spatial configuration of the gap amino acid is excluded from the gap space representation by ignoring the atoms of the gap amino acid when determining the structural confidence channel.

65. 조항 1에 있어서, 비-갭 아미노산의 공간 구성은 추가 입력 채널로서 인코딩되는, 컴퓨터 구현 방법.65. The computer-implemented method of clause 1, wherein the spatial configuration of non-gap amino acids is encoded as an additional input channel.

66. 조항 65에 있어서, 추가 입력 채널을 결정할 때 갭 아미노산의 원자를 무시함으로써 갭 아미노산의 공간 구성이 갭 공간 표현에서 제외되는, 컴퓨터 구현 방법.66. The computer-implemented method of clause 65, wherein the spatial configuration of the gap amino acid is excluded from the gap space representation by ignoring the atoms of the gap amino acid when determining additional input channels.

67. 조항 14에 있어서, 프로테옴은 인간 프로테옴과 비인간 영장류 프로테옴을 포함하는 비인간 프로테옴을 포함하는, 컴퓨터 구현 방법.67. The computer-implemented method of clause 14, wherein the proteome comprises a non-human proteome, including a human proteome and a non-human primate proteome.

68. 조항 1에 있어서, 기준 아미노산의 기준 코돈을 도달 불가능한 대체 아미노산 클래스의 대체 아미노산으로 변환하기 위한 단일 뉴클레오티드 다형성(SNP)의 도달 가능성에 의해 제한되는 도달 불가능한 대체 아미노산 클래스는 실제 라벨에 마스킹되는, 컴퓨터 구현 방법.68. Clause 1, wherein the class of unreachable substitute amino acids limited by the reachability of a single nucleotide polymorphism (SNP) to convert the reference codon of the reference amino acid to a substitute amino acid of the class of unreachable substitute amino acids is masked to the actual label, Computer implementation method.

69. 조항 1에 있어서, 마스킹된 아미노산 클래스는 손실이 전혀 발생하지 않으며 기울기 업데이트에 기여하지 않는, 컴퓨터 구현 방법.69. The computer-implemented method of clause 1, wherein the masked amino acid class incurs no loss and does not contribute to the gradient update.

70. 조항 69에 있어서, 마스킹된 아미노산 클래스는 룩업 테이블에서 식별되는, 컴퓨터 구현 방법.70. The computer-implemented method of clause 69, wherein the masked amino acid classes are identified in a lookup table.

71. 조항 70에 있어서, 룩업 테이블은 각 기준 아미노산 위치에 대해 마스킹된 아미노산 클래스 세트를 식별하는, 컴퓨터 구현 방법.71. The computer-implemented method of clause 70, wherein the lookup table identifies a set of masked amino acid classes for each reference amino acid position.

조항 세트 5(ILLM 1061-1)Clause Set 5 (ILLM 1061-1)

1. 병원성 예측자를 훈련하는 컴퓨터 구현 방법으로서, 1. A computer-implemented method for training pathogenicity predictors, comprising:

프로테옴의 각각의 위치에 대한 각각의 갭 단백질 샘플을 포함하는 갭 훈련 세트에 접근하는 단계;Accessing a gap training set containing each gap protein sample for each position in the proteome;

비-갭 양성 단백질 샘플과 비-갭 병원성 단백질 샘플을 포함하는 비-갭 훈련 세트에 접근하는 단계;accessing a non-gap training set containing non-gap positive protein samples and non-gap pathogenic protein samples;

갭 단백질 샘플에 대한 각각의 갭 공간 표현을 생성하고, 비-갭 양성 단백질 샘플과 비-갭 병원성 단백질 샘플에 대한 각각의 비-갭 공간 표현을 생성하는 단계;generating a respective gapped space representation for the gapped protein sample and generating a respective non-gapped spatial representation for the non-gap positive protein sample and the non-gap pathogenic protein sample;

하나 이상의 훈련 사이클에 걸쳐 병원성 예측자를 훈련하고, 훈련된 병원성 예측자를 생성하는 단계 - 각각의 훈련 사이클은 각각의 갭 공간 표현으로부터의 갭 공간 표현과 각각의 비-갭 공간 표현으로부터의 비-갭 공간 표현을 훈련 예시로서 사용함 -; 및Training a pathogenicity predictor over one or more training cycles, generating a trained pathogenicity predictor, each training cycle comprising a gap space representation from each gap space representation and a non-gap space from each non-gap space representation. Using expressions as training examples -; and

훈련된 병원성 분류자를 사용하여 변이체의 병원성을 결정하는 단계를 포함하는, 컴퓨터 구현 방법.A computer-implemented method comprising determining the pathogenicity of a variant using a trained pathogenicity classifier.

2. 조항 1에 있어서, 각각의 갭 단백질 샘플은 각각의 갭 실제 서열로 라벨링되는, 컴퓨터 구현 방법.2. The computer-implemented method of clause 1, wherein each gap protein sample is labeled with the respective gap actual sequence.

3. 조항 2에 있어서, 특정 갭 단백질 샘플에 대한 특정 갭 실제 서열은 특정 갭 단백질의 특정 위치에 있는 기준 아미노산에 상응하는 특정 아미노산 클래스에 대한 양성 라벨을 갖는, 컴퓨터 구현 방법.3. The computer-implemented method of clause 2, wherein the specific gap actual sequence for the specific gap protein sample has a positive label for a specific amino acid class that corresponds to a reference amino acid at a specific position in the specific gap protein.

4. 조항 3에 있어서, 특정 갭 단백질 샘플은 특정 위치의 대체 아미노산에 상응하는 각각의 나머지 아미노산 클래스에 대한 각각의 병원성 라벨을 갖는, 컴퓨터 구현 방법.4. The computer-implemented method of clause 3, wherein the particular gap protein sample has a respective pathogenicity label for each remaining amino acid class corresponding to an alternative amino acid at a particular position.

5. 조항 1에 있어서, 특정 비-갭 양성 단백질 샘플은 양성 뉴클레오티드 변이체에 의해 치환된 특정 위치의 양성 대체 아미노산을 포함하는, 컴퓨터 구현 방법.5. The computer implemented method of clause 1, wherein the particular non-gap positive protein sample comprises a positive replacement amino acid at a specific position that is substituted by a positive nucleotide variant.

6. 조항 5에 있어서, 특정 비-갭 병원성 단백질 샘플은 병원성 뉴클레오티드 변이체에 의해 치환된 특정 위치의 병원성 대체 아미노산을 포함하는, 컴퓨터 구현 방법.6. The computer implemented method of clause 5, wherein the specific non-gap pathogenic protein sample comprises a pathogenic replacement amino acid at a specific position that is substituted by a pathogenic nucleotide variant.

7. 조항 6에 있어서, 특정 비-갭 양성 단백질 샘플은 양성 대체 아미노산에 상응하는 특정 아미노산 클래스에 대한 양성 라벨을 갖는 양성 실제 서열로 라벨링되는, 컴퓨터 구현 방법.7. The computer-implemented method of clause 6, wherein the specific non-gap positive protein sample is labeled with a positive authentic sequence having a positive label for a specific amino acid class corresponding to the positive replacement amino acid.

8. 조항 7에 있어서, 양성 실제 서열은 양성 대체 아미노산과 다른 아미노산에 상응하는 각각의 나머지 아미노산 클래스에 대한 각각의 마스킹 라벨인, 컴퓨터 구현 방법.8. The computer-implemented method of clause 7, wherein the positive actual sequence is a respective masking label for each remaining amino acid class corresponding to an amino acid that is different from the positive replacement amino acid.

9. 조항 8에 있어서, 특정 비-갭 병원성 단백질 샘플은 병원성 대체 아미노산에 상응하는 특정 아미노산 클래스에 대한 병원성 라벨을 갖는 병원성 실제 서열로 라벨링되는, 컴퓨터 구현 방법.9. The computer implemented method of clause 8, wherein the specific non-gap pathogenic protein sample is labeled with the pathogenic actual sequence having a pathogenic label for a specific amino acid class that corresponds to the pathogenic alternative amino acid.

10. 조항 9에 있어서, 병원성 실제 서열이 병원성 대체 아미노산과 다른 아미노산에 상응하는 각각의 나머지 아미노산 클래스에 대해 각각의 마스킹된 라벨을 갖는, 컴퓨터 구현 방법.10. The computer-implemented method of clause 9, wherein the pathogenic actual sequence has a respective masked label for each remaining amino acid class corresponding to an amino acid that is different from the pathogenic substitute amino acid.

11. 조항 1에 있어서, 현재 훈련 예시가 갭 단백질 샘플에 대한 갭 공간 표현인지 비-갭 단백질 샘플에 대한 비-갭 공간 표현인지 여부를 병원성 예측자에 표시하기 위해 샘플 표시자를 사용하는 것을 추가로 포함하는, 컴퓨터 구현 방법.11. Clause 1, further comprising using a sample indicator to indicate to the pathogenicity predictor whether the current training example is a gap space representation for a gap protein sample or a non-gap space representation for a non-gap protein sample. Including, computer implemented methods.

12. 조항 1에 있어서, 특정 갭 단백질의 특정 위치에서 기준 아미노산에 상응하는 특정 아미노산 클래스에 대한 양성 라벨을 마스킹하는 단계를 추가로 포함하는, 컴퓨터 구현 방법.12. The computer-implemented method of clause 1, further comprising masking positive labels for a particular class of amino acids corresponding to a reference amino acid at a particular position in a particular gap protein.

13. 조항 1에 있어서, 비-갭 양성 단백질 샘플은 일반적인 인간 및 비-인간 영장류 뉴클레오티드 변이체로부터 유래되는, 컴퓨터 구현 방법.13. The computer implemented method of clause 1, wherein the non-gap positive protein sample is derived from common human and non-human primate nucleotide variants.

14. 조항 1에 있어서, 비-갭 병원성 단백질 샘플은 조합적으로 시뮬레이션된 뉴클레오티드 변이체로부터 유래되는, 컴퓨터 구현 방법.14. The computer implemented method of clause 1, wherein the non-gap pathogenic protein sample is derived from combinatorially simulated nucleotide variants.

15. 조항 1에 있어서, 병원성 예측자는 훈련 예시 처리에 응답하여 아미노산 클래스별 출력 서열을 생성하고,15. Clause 1, wherein the pathogenicity predictor generates output sequences per amino acid class in response to processing training examples, and

아미노산 클래스별 출력 서열은 아미노산 클래스별 병원성 점수를 갖는, 컴퓨터 구현 방법.A computer-implemented method wherein the output sequence for each amino acid class has a pathogenicity score for each amino acid class.

16. 조항 1에 있어서, 검증 세트에 대한 훈련 사이클 사이에 훈련된 병원성 예측자의 성능을 측정하는 단계를 추가로 포함하는, 컴퓨터 구현 방법.16. The computer-implemented method of clause 1, further comprising measuring performance of the trained pathogenicity predictor between training cycles on a validation set.

17. 조항 16에 있어서, 검증 세트는 각각의 유지된 단백질 샘플에 대한 갭 공간 표현과 비-갭 공간 표현의 쌍을 포함하는, 컴퓨터 구현 방법.17. The computer-implemented method of clause 16, wherein the validation set includes pairs of gapped space representations and non-gapped space representations for each retained protein sample.

18. 조항 1에 있어서, 훈련된 병원성 예측자는 쌍의 갭 공간 표현에 대한 제1 아미노산 클래스별 출력 서열과 쌍의 비-갭 공간 표현에 대한 제2 아미노산 클래스별 출력 서열을 생성하고, 유지된 단백질 샘플에서 아미노산 치환을 유발하는 뉴클레오티드 변이체에 대한 최종 병원성 점수는 제1 및 제2 아미노산 클래스별 출력 서열의 아미노산 치환에 대한 제1 및 제2 병원성 점수의 조합에 기초하여 결정되는, 컴퓨터 구현 방법.18. The method of clause 1, wherein the trained pathogenicity predictor generates a first amino acid class-specific output sequence for the gap space representation of the pair and a second amino acid class-specific output sequence for the non-gap space representation of the pair, and the retained protein A computer-implemented method, wherein a final pathogenicity score for a nucleotide variant causing an amino acid substitution in a sample is determined based on a combination of the first and second pathogenicity scores for amino acid substitutions in output sequences by first and second amino acid classes.

19. 조항 18에 있어서, 최종 병원성 점수는 제1 및 제2 병원성 점수의 평균을 기반으로 하는, 컴퓨터 구현 방법.19. The computer-implemented method of clause 18, wherein the final pathogenicity score is based on an average of the first and second pathogenicity scores.

20. 조항 1에 있어서, 훈련 사이클 중 적어도 일부는 동일한 개수의 갭 공간 표현과 비-갭 공간 표현을 사용하는, 컴퓨터 구현 방법.20. The computer-implemented method of clause 1, wherein at least some of the training cycles use an equal number of gapped space representations and non-gap space representations.

21. 조항 1에 있어서, 훈련 사이클 중 적어도 일부는 동일한 수의 갭 공간 표현과 비-갭 공간 표현을 갖는 훈련 예제의 배치를 사용하는, 컴퓨터 구현 방법.21. The computer-implemented method of clause 1, wherein at least some of the training cycles use batches of training examples having an equal number of gapped space representations and non-gap space representations.

22. 조항 1에 있어서, 마스킹된 라벨이 오류 결정에 기여하지 않고 따라서 병원성 예측자의 훈련에 기여하지 않는, 컴퓨터 구현 방법.22. The computer-implemented method of clause 1, wherein the masked label does not contribute to the determination of the error and thus does not contribute to the training of the pathogenicity predictor.

23. 조항 22에 있어서, 마스킹된 라벨은 제로 아웃되는, 컴퓨터 구현 방법.23. The computer-implemented method of clause 22, wherein the masked label is zeroed out.

24. 조항 1에 있어서, 갭 공간 표현은 비-갭 공간 표현과 다르게 가중치가 부여되어, 비-갭 공간 표현을 처리하는 병원성 예측자에 응답하여 병원성 예측자의 매개변수에 적용되는 기울기 업데이트에 대한 갭 공간 표현의 기여는, 비-갭 공간 표현을 처리하는 병원성 예측자에 응답하여 병원성 예측자의 매개변수에 적용되는 기울기 업데이트에 대한 비-갭 공간 표현의 기여로부터 변동하는, 컴퓨터 구현 방법.24. Clause 1, wherein gap space representations are weighted differently than non-gap space representations, such that gap to gradient updates are applied to the parameters of the pathogenicity predictor in response to the pathogenicity predictor processing the non-gap spatial representation. A computer implemented method, wherein the contribution of the spatial representation varies from the contribution of the non-gap spatial representation to a gradient update applied to the parameters of the pathogenicity predictor in response to the pathogenicity predictor processing the non-gap spatial representation.

25. 조항 24에 있어서, 변동은 미리 정의된 가중치에 의해 결정되는, 컴퓨터 구현 방법.25. The computer-implemented method of clause 24, wherein the variation is determined by predefined weights.

26. 병원성 예측자를 훈련하는 컴퓨터 구현 방법으로서,26. A computer-implemented method for training pathogenicity predictors, comprising:

갭 훈련 세트에서 병원성 분류자를 훈련하는 것부터 시작하여 훈련된 병원성 분류자를 생성하는 단계;Starting with training a pathogenicity classifier on a gap training set to generate a trained pathogenicity classifier;

비-갭 훈련 세트에 대해 훈련된 병원성 분류자를 추가로 훈련하고 재훈련된 병원성 분류자를 생성하는 단계; 및further training the trained pathogenicity classifier on the non-gap training set and generating a retrained pathogenicity classifier; and

재훈련된 병원성 분류자를 사용하여 변이체의 병원성을 결정하는 단계를 포함하는, 컴퓨터 구현 방법.A computer-implemented method comprising determining the pathogenicity of a variant using a retrained pathogenicity classifier.

27. 조항 26에 있어서, 유지된 단백질 샘플의 비-갭 공간 표현만을 포함하는 제1 검증 세트에 대한 훈련 사이클 사이에 훈련된 병원성 예측자의 성능을 측정하는 단계를 추가로 포함하는, 컴퓨터 구현 방법.27. The computer-implemented method of clause 26, further comprising measuring performance of the trained pathogenicity predictor between training cycles on a first validation set containing only non-gap spatial representations of the retained protein samples.

28. 조항 27에 있어서, 유지된 단백질 샘플의 갭 공간 표현 및 비-갭 공간 표현을 포함하는 제2 검증 세트에 대한 훈련 사이클 사이의 재훈련된 병원성 예측자의 성능을 측정하는 단계를 추가로 포함하는, 컴퓨터 구현 방법.28. The method of clause 27, further comprising measuring the performance of the retrained pathogenicity predictor between training cycles on a second validation set comprising a gap space representation and a non-gap space representation of the retained protein samples. , computer implementation method.

29. 조항 28에 있어서, 재훈련된 병원성 예측자가 쌍을 처리하는 것에 반응하여 쌍에 대한 제1 아미노산 클래스별 출력 서열을 생성하고, 상응하는 유지된 단백질 샘플에서 아미노산 치환을 유발하는 뉴클레오티드 변이체에 대한 최종 병원성 점수는 제1 아미노산 클래스별 출력 서열에 기초하여 결정되는, 컴퓨터 구현 방법.29. The method of clause 28, wherein the retrained pathogenicity predictor generates a first amino acid class-specific output sequence for the pair in response to processing the pair, and for nucleotide variants causing amino acid substitutions in the corresponding maintained protein sample. A computer implemented method, wherein the final pathogenicity score is determined based on the output sequence for each first amino acid class.

30. 병원성 예측자를 훈련하는 컴퓨터 구현 방법으로서, 30. A computer-implemented method for training pathogenicity predictors, comprising:

프로테옴의 각각의 위치에 대한 각각의 갭 단백질 샘플을 포함하는 갭 훈련 세트에 접근하는 단계 - 각각의 갭 단백질 샘플은 각각의 갭 실제 서열로 라벨링되고, 특정 갭 단백질 샘플에 대한 특정 갭 실제 서열은 특정 갭 단백질의 특정 위치에 있는 기준 아미노산에 상응하는 특정 아미노산 클래스에 대한 양성 라벨을 갖고, 특정 위치의 대체 아미노산에 상응하는 각각의 나머지 아미노산 클래스에 대해 각각의 병원성 라벨을 가짐 -;Accessing a gap training set containing each gap protein sample for each position in the proteome - each gap protein sample is labeled with its respective gap real sequence, and the specific gap real sequence for a particular gap protein sample is a specific gap protein sample. having a benign label for a specific amino acid class corresponding to a reference amino acid at a specific position in the gap protein, and having a respective pathogenic label for each remaining amino acid class corresponding to an alternative amino acid at a specific position -;

비-갭 양성 단백질 샘플과 비-갭 병원성 단백질 샘플을 포함하는 비-갭 훈련 세트에 접근하는 단계 - 특정 비-갭 양성 단백질 샘플은 양성 뉴클레오티드 변이체에 의해 치환된 특정 위치의 양성 대체 아미노산을 포함하고, 특정 비-갭 병원성 단백질 샘플은 병원성 뉴클레오티드 변이체에 의해 치환된 특정 위치의 병원성 대체 아미노산을 포함하고, 특정 비-갭 양성 단백질 샘플은 양성 대체 아미노산에 상응하는 특정 아미노산 클래스에 대한 양성 라벨과 양성 대체 아미노산과 다른 아미노산에 상응하는 각각의 나머지 아미노산 클래스에 대한 각각의 마스킹 라벨을 갖는 양성 실제 서열로 라벨링되고, 특정 비-갭 병원성 단백질 샘플은 병원성 대체 아미노산에 상응하는 특정 아미노산 클래스에 대한 병원성 라벨과 병원성 대체 아미노산과 다른 아미노산에 상응하는 나머지 아미노산 클래스 각각에 대해 각각의 마스킹 라벨을 갖는 병원성 실제 서열로 라벨링됨 -;Accessing a non-gap training set comprising non-gap positive protein samples and non-gap pathogenic protein samples, wherein the specific non-gap positive protein sample contains a positive substituted amino acid at a specific position substituted by a positive nucleotide variant; , a specific non-gap pathogenic protein sample contains a pathogenic substituted amino acid at a specific position substituted by a pathogenic nucleotide variant, and a specific non-gap positive protein sample contains a positive label and a positive substitution for a specific amino acid class corresponding to the positive substituted amino acid. Labeled with a positive authentic sequence with a respective masking label for each remaining amino acid class corresponding to an amino acid and another amino acid, a specific non-gap pathogenic protein sample is labeled with a pathogenic label for a specific amino acid class corresponding to a pathogenic alternative amino acid and a pathogenic label for each remaining amino acid class. Labeled with the pathogenic actual sequence, with respective masking labels for each of the remaining amino acid classes corresponding to the substitute amino acid and the other amino acid -;

조항 세트 6Clause set 6

단백질 샘플의 공간 표현을 포함하는 훈련 데이터에 대해 변이체 병원성 분류자를 훈련하여, 공간 표현에는 양성 변이체에 상응하는 실제 양성 라벨과 병원성 변이체에 상응하는 실제 병원성 라벨이 할당되는 단계를 포함하는, 컴퓨터 구현 방법.A computer-implemented method comprising training a variant pathogenicity classifier on training data containing a spatial representation of a protein sample, such that the spatial representation is assigned an actual benign label corresponding to a benign variant and an actual pathogenic label corresponding to a pathogenic variant. .

47. 조항 1에 있어서, 공간 표현은 단백질 샘플의 단백질 구조의 구조적 표현인, 컴퓨터 구현 방법.47. The computer-implemented method of clause 1, wherein the spatial representation is a structural representation of the protein structure of the protein sample.

48. 조항 1에 있어서, 공간 표현은 복셀화를 사용하여 인코딩되는, 컴퓨터 구현 방법.48. The computer-implemented method of clause 1, wherein the spatial representation is encoded using voxelization.

조항 세트 7Clause set 7

단백질의 공간 표현에 접근하는 단계로서, 단백질의 공간 표현은 단백질의 각각의 위치에서 각각의 아미노산의 각각의 공간 구성을 지정하는 단계; Accessing a spatial representation of a protein, wherein the spatial representation of the protein specifies each spatial configuration of each amino acid at each position in the protein;

단백질의 공간 표현으로부터 특정 위치에 있는 특정 아미노산의 특정 공간 구성을 제거하여 단백질의 갭 공간 표현을 생성하는 단계; 및generating a gap spatial representation of the protein by removing specific spatial configurations of specific amino acids at specific positions from the spatial representation of the protein; and

2. 조항 1에 있어서, 특정 공간 구성의 제거는 스크립트에 의해 구현되는, 컴퓨터 구현 방법.2. The computer-implemented method of clause 1, wherein removal of specific spatial configurations is implemented by a script.

3. 뉴클레오티드 변이체의 병원성을 결정하는 컴퓨터 구현 방법으로서, 3. A computer-implemented method for determining the pathogenicity of a nucleotide variant, comprising:

단백질로부터 특정 위치의 특정 아미노산을 제거하여 갭 단백질을 생성하는 단계; 및Creating a gap protein by removing a specific amino acid at a specific position from the protein; and

적어도 부분적으로, 갭 단백질 및 특정 위치에서 뉴클레오티드 변이체에 의해 생성된 대체 아미노산에 기초하여 뉴클레오티드 변이체의 병원성을 결정하는 단계를 포함하는, 컴퓨터 구현 방법.A computer-implemented method comprising determining the pathogenicity of a nucleotide variant based, at least in part, on a gap protein and an alternative amino acid produced by the nucleotide variant at a particular position.

4. 조항 3에 있어서, 주요 대립유전자 아미노산은 기준 아미노산인, 컴퓨터 구현 방법.4. The computer-implemented method of clause 3, wherein the major allelic amino acid is a reference amino acid.

5. 아미노산 치환물의 공간적 내성을 예측하는 시스템으로서,5. A system for predicting spatial tolerance of amino acid substitutions, comprising:

단백질로부터 특정 위치의 특정 아미노산을 제거하고 단백질의 특정 위치에 아미노산 공석을 생성하도록 구성된 갭핑 로직; 및Gapping logic configured to remove specific amino acids at specific positions from the protein and create amino acid vacancies at specific positions in the protein; and

아미노산 공석이 있는 단백질을 처리하고 아미노산 공석을 채우기 위한 후보인 치환 아미노산의 내성을 점수화하도록 구성된 치환 로직을 포함하는, 시스템.A system comprising substitution logic configured to process a protein with amino acid vacancies and score the tolerance of substituted amino acids that are candidates for filling the amino acid vacancies.

6. 조항 5에 있어서, 치환 로직은 적어도 부분적으로, 치환 아미노산과 아미노산 공석 부근의 인접 아미노산 사이의 구조적 적합성에 기초하여 치환 아미노산의 내성을 점수화하도록 추가로 구성되는, 시스템.6. The system of clause 5, wherein the substitution logic is further configured to score the resistance of the substituted amino acid based, at least in part, on structural compatibility between the substituted amino acid and adjacent amino acids near the amino acid vacancy.

7. 뉴클레오티드 변이체의 병원성을 결정하는 컴퓨터 구현 방법으로서, 7. A computer-implemented method for determining the pathogenicity of a nucleotide variant, comprising:

적어도 부분적으로 갭 공간 표현에 기초하여 특정 위치에서 각각의 대체 아미노산의 병원성을 결정하는 단계Determining the pathogenicity of each alternative amino acid at a particular position based at least in part on the gap space representation.

- 이때 각각의 대체 아미노산은 특정 아미노산 클래스와 다른 각각의 아미노산 클래스를 가짐 -를 포함하는, 컴퓨터 구현 방법.- Here, each substitute amino acid has a respective amino acid class that is different from the specific amino acid class.

8. 아미노산 치환물의 진화 보존을 예측하는 시스템으로서, 8. A system for predicting the evolutionary conservation of amino acid substitutions, comprising:

아미노산 빈자리가 있는 단백질을 처리하고 아미노산 빈자리를 채우기 위한 후보인 치환 아미노산의 진화 보존을 점수화하도록 구성된 치환 로직을 포함하는, 시스템.A system comprising substitution logic configured to process proteins with amino acid vacancies and score the evolutionary conservation of substituted amino acids that are candidates for filling the amino acid vacancies.

9. 조항 8에 있어서, 치환 로직은 적어도 부분적으로, 치환 아미노산과 아미노산 공석 부근의 인접 아미노산 사이의 구조적 적합성에 기초하여 치환 아미노산의 진화 보존을 점수화하도록 추가로 구성되는, 시스템.9. The system of clause 8, wherein the substitution logic is further configured to score evolutionary conservation of the substituted amino acid based, at least in part, on structural compatibility between the substituted amino acid and adjacent amino acids near the amino acid vacancy.

10. 조항 8에 있어서, 진화 보존은 진화 보존 빈도를 사용하여 점수가 매겨지는, 시스템.10. The system of clause 8, wherein evolutionary conservation is scored using evolutionary conservation frequency.

11. 조항 10에 있어서, 진화 보존 빈도는 위치 특이적 빈도 행렬(PSFM)을 기반으로 하는, 시스템.11. The system of clause 10, wherein the evolutionary conservation frequency is based on a site-specific frequency matrix (PSFM).

12. 조항 10에 있어서, 진화 보존 빈도는 위치 특이적 점수 매트릭스(PSSM)를 기반으로 하는, 시스템.12. The system of clause 10, wherein the evolutionary conservation frequency is based on a site-specific score matrix (PSSM).

13. 조항 8에 있어서, 치환 아미노산의 진화 보존 점수는 규모에 따라 순위가 지정되는, 시스템.13. The system of clause 8, wherein the evolutionary conservation scores of substituted amino acids are ranked according to scale.

14. 아미노산 치환물의 진화 보존을 예측하는 시스템으로서, 14. A system for predicting the evolutionary conservation of amino acid substitutions, comprising:

아미노산 공석이 있는 단백질을 처리하고 아미노산 공석을 채우기 위한 후보인 치환 아미노산의 진화 보존 순위를 매기도록 구성된 진화 보존 예측 로직을 포함하는, 시스템.A system comprising evolutionary conservation prediction logic configured to process proteins with amino acid vacancies and rank evolutionary conservation of substituted amino acids that are candidates for filling amino acid vacancies.

15. 아미노산 치환물의 구조적 내성을 예측하는 시스템으로서,15. A system for predicting the structural resistance of amino acid substitutions, comprising:

아미노산 공석이 있는 단백질을 처리하고, 아미노산 공석 부근의 아미노산 동시 발생 패턴을 기반으로 아미노산 공석을 채우기 위한 후보인 치환 아미노산의 구조적 내성의 순위를 매기도록 구성된 구조적 내성 예측 로직을 포함하는, 시스템.A system comprising structural resistance prediction logic configured to process proteins with amino acid vacancies and rank the structural resistance of substituted amino acids that are candidates for filling the amino acid vacancies based on patterns of amino acid co-occurrence near the amino acid vacancies.

16. 뉴클레오티드 변이체의 병원성을 결정하는 컴퓨터 구현 방법으로서, 16. A computer-implemented method for determining the pathogenicity of a nucleotide variant, comprising:

단백질의 특정 위치에 있는 특정 아미노산을 갭 아미노산으로 지정하고, 단백질의 나머지 위치에 남아 있는 아미노산을 비-갭 아미노산으로 지정하는 단계; Designating a specific amino acid at a specific position of the protein as a gap amino acid, and designating the amino acid remaining at the remaining position of the protein as a non-gap amino acid;

갭 공간 표현 및 대체 아미노산의 표현에 적어도 부분적으로 기초하여 특정 위치에서 대체 아미노산의 진화 보존을 결정하는 단계; 및determining the evolutionary conservation of a replacement amino acid at a particular position based at least in part on the gap space representation and the representation of the replacement amino acid; and

적어도 부분적으로 진화 보존에 기초하여 대체 아미노산을 생성하는 뉴클레오티드 변이체의 병원성을 결정하는 단계를 포함하는, 컴퓨터 구현 방법.A computer-implemented method comprising determining the pathogenicity of a nucleotide variant resulting in a replacement amino acid based at least in part on evolutionary conservation.

조항 세트 8Clause set 8

단백질의 공간 표현으로부터 특정 위치의 특정 아미노산을 제거하여 단백질의 갭 공간 표현을 생성하는 단계; 및Removing specific amino acids at specific positions from the spatial representation of the protein to create a gap spatial representation of the protein; and

적어도 부분적으로, 단백질의 갭 공간 표현과 특정 위치에서 뉴클레오티드 변이체에 의해 생성된 대체 아미노산에 기초하여 뉴클레오티드 변이체의 병원성을 결정하는 단계.Determining the pathogenicity of a nucleotide variant based, at least in part, on the gap space representation of the protein and the replacement amino acid produced by the nucleotide variant at a particular position.

단백질의 공간 표현에서 특정 위치의 특정 아미노산을 제거하고 단백질의 공간 표현의 특정 위치에 아미노산 공석을 생성하도록 구성되는 갭 로직; 및gap logic, which is configured to remove specific amino acids at specific positions in the spatial representation of the protein and create amino acid vacancies at specific positions in the spatial representation of the protein; and

아미노산 공석을 갖는 단백질의 공간 표현을 처리하고, 아미노산 공석을 채우기 위한 후보인 치환 아미노산의 내성 점수를 매기도록 구성된 치환 로직을 포함하는, 시스템.A system comprising substitution logic configured to process a spatial representation of a protein having amino acid vacancies and score resistance of substituted amino acids that are candidates for filling amino acid vacancies.

아미노산 공석을 갖는 단백질의 공간 표현을 처리하고, 아미노산 공석을 채우기 위한 후보인 치환 아미노산의 진화 보존 점수를 매기도록 구성된 치환 로직을 포함하는, 시스템.A system comprising substitution logic configured to process a spatial representation of a protein having amino acid vacancies and score the evolutionary conservation of substituted amino acids that are candidates for filling amino acid vacancies.

아미노산 공석을 갖는 단백질의 공간 표현을 처리하고, 아미노산 공석을 채우기 위한 후보인 치환 아미노산의 진화 보존 순위를 매기도록 구성된 진화 보존 예측 로직을 포함하는, 시스템.A system comprising evolutionary conservation prediction logic configured to process a spatial representation of a protein having amino acid vacancies and to rank evolutionary conservation of substituted amino acids that are candidates for filling amino acid vacancies.

아미노산 공석이 있는 단백질의 공간 표현을 처리하고, 아미노산 공석 부근의 아미노산 동시 발생 패턴을 기반으로 아미노산 공석을 채우기 위한 후보인 치환 아미노산의 구조적 내성의 순위를 매기도록 구성된 구조적 내성 예측 로직을 포함하는, 시스템.A system comprising structural resistance prediction logic configured to process a spatial representation of a protein with amino acid vacancies and rank the structural resistance of substituted amino acids that are candidates for filling the amino acid vacancies based on patterns of amino acid co-occurrence near the amino acid vacancies. .

본 발명이 상기에 상술된 바람직한 구현예 및 예를 참조하여 개시되지만, 이러한 예는 제한적인 의미가 아니라 예시적인 것으로 의도됨이 이해될 것이다. 수정 및 조합이 당업자에게 쉽게 떠오를 것이며, 이러한 수정 및 조합은 본 발명의 사상 및 하기의 청구범위의 범주 내에 있을 것이라는 것이 고려된다.Although the present invention is disclosed with reference to the preferred embodiments and examples detailed above, it will be understood that these examples are intended to be illustrative and not restrictive. It is contemplated that modifications and combinations will readily occur to those skilled in the art, and that such modifications and combinations will be within the spirit of the invention and the scope of the following claims.

Claims

A computer-implemented method for evaluating nucleotide variants as benign or pathogenic, comprising:
Accessing proteins with each amino acid at each position;
Designating a specific amino acid at a specific position of the protein as a gap amino acid and designating the remaining amino acid at the remaining position of the protein as a non-gap amino acid;
generating a gapped spatial representation of the protein including the spatial configuration of non-gap amino acids and excluding the spatial configuration of gap amino acids; and
A computer-implemented method comprising evaluating a nucleotide variant as benign or pathogenic using a neural network-based pathogenicity predictor based, at least in part, on the gap space representation and the representation of alternative amino acids produced by the nucleotide variant at a particular position.

The method of claim 1, wherein the spatial configuration of non-gap amino acids is encoded as a distance channel per amino acid class,
The distance channel for each amino acid class has a voxel-specific distance value for voxels within a plurality of voxels,
A computer implemented method, wherein the voxel-wise distance value specifies the distance of an atom of a non-gap amino acid from a corresponding voxel within the plurality of voxels.

3. The computer implemented method of claim 1 or 2, wherein the spatial configuration of the non-gap amino acid is determined based on the spatial proximity between corresponding voxels and atoms of the non-gap amino acid.

The computer implementation of any one of claims 1 to 3, wherein the spatial configuration of the gap amino acid is excluded from the gap space representation by ignoring the distance of the atom of the gap amino acid from the corresponding voxel when determining the voxel-wise distance value. method.

5. A computer-implemented method according to any one of claims 1 to 4, wherein the spatial configuration of gap amino acids is excluded from the gap space representation by ignoring the spatial proximity between corresponding voxels and atoms of gap amino acids.

The computer implemented method of any one of claims 1 to 5, wherein the particular amino acid is a reference amino acid that is a major allele of the protein.

7. The method according to any one of claims 1 to 6, wherein the neural network based pathogenicity predictor comprises: processing gap space representations and representations of alternative amino acids as input; and
A computer-implemented method for evaluating nucleotide variants as benign or pathogenic by generating as output a pathogenicity score for the replacing amino acid.

8. The computer-implemented method of any one of claims 1-7, wherein the neural network-based pathogenicity predictor is trained on a benign training set.

9. The computer-implemented method of claim 8, wherein the positive training set has each positive protein sample for each reference amino acid at each position in the proteome.

10. The computer implemented method of claim 9, wherein the reference amino acid is a major allelic amino acid of the proteome.

11. The computer-implemented method of claim 9 or 10, wherein the proteome has 10 million positions so the positive training set has 10 million positive protein samples.

12. The computer-implemented method of any one of claims 9-11, wherein each positive protein sample has a respective gap space representation generated using a respective reference amino acid as a respective gap amino acid.

13. The computer-implemented method of any one of claims 9-12, wherein each positive protein sample has as its respective replacement amino acid a respective representation of a respective reference amino acid.

14. The method of any one of claims 1 to 13, wherein the neural network-based pathogenicity predictor is trained on specific benign protein samples and evaluates specific reference amino acids at specific positions of the specific positive protein samples as benign or pathogenic by the following steps. Computer implementation method:
(i) Specific gap space representation of specific positive protein samples
- At this time, the specific gap space expression is
using specific reference amino acids as gap amino acids, and
Generated by using the remaining amino acids at the remaining positions in a specific positive protein sample as non-gap amino acids -; and
(ii) processing as input a representation of a specific reference amino acid as a specific replacement amino acid; and
Generating as output a pathogenicity score for a specific reference amino acid.

15. The computer-implemented method of claim 14, wherein each positive protein sample has a ground truth positive label indicating the absolute positivity of the positive protein sample.

16. The computer implemented method of claim 15, wherein the actual positive label is 0.

17. The computer-implemented method of claim 15 or 16, wherein the pathogenicity score for a particular reference amino acid is compared to the actual positive label to determine the error and using training techniques to improve the coefficients of the neural network-based pathogenicity predictor based on the error. .

18. The computer-implemented method of any one of claims 1-17, wherein the neural network-based pathogenicity predictor is trained on a pathogenicity training set.

19. The computer-implemented method of claim 18, wherein the pathogenicity training set has a respective pathogenic protein sample for each combinatorially generated amino acid substitution for each reference amino acid at each position in the proteome.

20. The computer-implemented method of claim 19, wherein the combinatorially generated amino acid substitutions for a particular reference amino acid of a particular amino acid class at a particular position in the proteome comprise each replacement amino acid of each amino acid class different from the particular amino acid class.

21. The method of claim 19 or 20, wherein the proteome has 10 million positions, and for each 10 million positions there are 19 combinatorially generated amino acid substitutions, so the pathogenicity training set contains 190 million pathogenic protein samples. Having, a computer implemented method.

22. The computer-implemented method of any one of claims 19-21, wherein each pathogenic protein sample has a respective gap space representation generated using a respective reference amino acid as a respective gap amino acid.

23. The method of any one of claims 19 to 22, wherein each pathogenic protein sample is each combinatorially produced as each replacement amino acid produced by each combinatorially produced nucleotide variant at each position of the proteome. A computer-implemented method, having a representation for each of the amino acid substitutions.

24. The method of any one of claims 1 to 23, wherein the neural network-based pathogenicity predictor is trained on specific pathogenic protein samples and selects specific combinatorially generated amino acid substitutions for specific reference amino acids at specific positions in the specific pathogenic protein sample. Computer-implemented method for evaluating benign or pathogenic by stage:
(i) a specific gap space representation of a specific pathogenic protein sample, where the specific gap space representation is
using specific reference amino acids as gap amino acids, and
Generated by using the remaining amino acids at the remaining positions in a specific pathogenic protein sample as non-gap amino acids -; and
(ii) processing as input a representation of a specific combinatorially produced amino acid substitution as a specific replacement amino acid; and
Generating as output a pathogenicity score for specific combinatorially generated amino acid substitutions.

25. The computer-implemented method of claim 24, wherein each pathogenic protein sample has an actual pathogenicity label indicating the absolute pathogenicity of the pathogenic protein sample.

26. The computer implemented method of claim 25, wherein the actual pathogenicity label is 1.

27. The method of claim 25 or 26, wherein the pathogenicity score for a particular combinatorially generated amino acid substitution is compared to the actual pathogenicity label to determine the error and training techniques are used to improve the coefficients of the pathogenicity predictor based on the error. Computer implementation method.

28. The method of any one of claims 1 to 27, wherein the neural network-based pathogenicity predictor is trained for 200 million training iterations,
200 million training repetitions
10 million training iterations using 10 million positive protein samples, and
A computer-implemented method involving 190 million iterations using 190 million pathogenic protein samples.

A non-transitory computer-readable storage medium containing computer program instructions for evaluating nucleotide variants as benign or pathogenic, wherein the instructions, when executed by a processor, comprise:
Accessing proteins with each amino acid at each position;
Designating a specific amino acid at a specific position of the protein as a gap amino acid, and designating the amino acid remaining at the remaining position of the protein as a non-gap amino acid;
generating a gap spatial representation of the protein including the spatial configuration of non-gap amino acids and excluding the spatial configuration of gap amino acids; and
Implementing a method comprising evaluating a nucleotide variant as benign or pathogenic using a neural network-based pathogenicity predictor based, at least in part, on the gap space representation and the representation of alternative amino acids produced by the nucleotide variant at a specific position, A non-transitory computer-readable storage medium.

A system comprising one or more processors coupled to memory,
The memory is loaded with computer instructions for evaluating nucleotide variants as benign or pathogenic, and when the instructions are executed on the processor:
Accessing proteins with each amino acid at each position;
Designating a specific amino acid at a specific position of the protein as a gap amino acid, and designating the amino acid remaining at the remaining position of the protein as a non-gap amino acid;
generating a gap spatial representation of the protein including the spatial configuration of non-gap amino acids and excluding the spatial configuration of gap amino acids; and
Implementing operations comprising evaluating a nucleotide variant as benign or pathogenic using a neural network-based pathogenicity predictor based, at least in part, on the gap space representation and the representation of alternative amino acids produced by the nucleotide variant at a particular position. system.