KR20200044731A

KR20200044731A - Deep learning-based technology for pre-training deep convolutional neural networks

Info

Publication number: KR20200044731A
Application number: KR1020197038080A
Authority: KR
Inventors: 홍 가오; 카이-하우 파; 파디게파티 삼스크루티 레디
Original assignee: 일루미나, 인코포레이티드
Priority date: 2018-10-15
Filing date: 2019-05-09
Publication date: 2020-04-29
Also published as: JP2023052011A; AU2021269351B2; AU2019272062A1; JP7200294B2; CN111328419B; JP2021152907A; WO2020081122A1; CN111328419A; IL282689A; SG10202108013QA; IL271091A; AU2019272062B2; JP2021501923A; IL282689B1; JP6888123B2; SG11201911777QA; CN113705585A; JP7515559B2; AU2021269351A1; IL271091B

Abstract

개시된 기술은 아미노산 서열 및 수반되는 위치 빈도 행렬을 처리하는 신경망 구현 모델의 과적합을 감소시키기 위한 시스템 및 방법을 개시한다. 상기 시스템은 시작 위치로부터 표적 아미노산 위치를 거쳐 종료 위치까지를 포함하는 양성으로 표지된 보충 훈련 예 서열 쌍을 생성한다. 보충 서열 쌍은 병원성 또는 양성 미스센스 훈련 예 서열 쌍을 보충한다. 이것은 참조 및 대체 아미노산 서열에 동일한 아미노산을 갖는다. 상기 시스템은 일치하는 시작 및 종료 위치에서 양성 또는 병원성 미스센스의 PFM과 동일한 보충 훈련 위치 빈도 행렬(PFM)을 각 보충 서열 쌍과 함께 입력하기 위한 논리 회로를 포함한다. 상기 시스템은 상기 훈련 데이터에 보충 훈련 예 PFM을 포함시킴으로써 상기 신경망 구현 모델을 훈련하는 동안 상기 훈련 PFM의 훈련 영향을 감쇠시키기 위한 논리 회로를 포함한다.The disclosed technology discloses systems and methods for reducing overfitting of neural network implementation models that process amino acid sequences and concomitant positional frequency matrices. The system produces a positively labeled supplemental training example sequence pair that includes from the starting position to the target amino acid position through the end position. The complementary sequence pair complements the pathogenic or positive missense training example sequence pair. It has the same amino acids in the reference and replacement amino acid sequences. The system includes logic circuitry to input the complementary training position frequency matrix (PFM) equal to the PFM of the positive or pathogenic missense at the matching start and end positions, with each complementary sequence pair. The system includes logic circuitry to attenuate the training impact of the training PFM while training the neural network implementation model by including a supplemental training example PFM in the training data.

Description

Deep learning-based technology for pre-training deep convolutional neural networks

우선권 적용Priority application

본 출원은 출원일이 모두 2018년 10월 15일인 다음 3개의 PCT 출원 및 3개의 미국 정규 출원, 즉 (1) PCT 특허 출원 번호 PCT/US2018/055840(발명의 명칭: "DEEP LEARNING-BASED TECHNIQUES FOR TRAINING DEEP CONVOLUTIONAL NEURAL NETWORKS", 출원일: 2018년 10월 15일(대리인 관리 번호 ILLM 1000-8/ IP-1611-PCT)); (2) PCT 특허 출원 번호 PCT/US2018/055878(발명의 명칭: "DEEP CONVOLUTIONAL NEURAL NETWORKS FOR VARIANT CLASSIFICATION", 출원일: 2018년 10월 15일(대리인 관리 번호 ILLM 1000-9/IP-1612-PCT)); (3) PCT 특허 출원 번호 PCT/US2018/055881(발명의 명칭: "SEMI-SUPERVISED LEARNING FOR TRAINING AN ENSEMBLE OF DEEP CONVOLUTIONAL NEURAL NETWORKS", 출원일: 2018년 10월 15일(대리인 관리 번호 ILLM 1000-10/IP-1613-PCT)); (4) 미국 정규 특허 출원 번호 16/160,903(발명의 명칭: "DEEP LEARNING-BASED TECHNIQUES FOR TRAINING DEEP CONVOLUTIONAL NEURAL NETWORKS", 출원일: 2018년 10월 15일(대리인 관리 번호 ILLM 1000-5/IP-1611-US)); (5) 미국 정규 특허 출원 번호 16/160,986(발명의 명칭: "DEEP CONVOLUTIONAL NEURAL NETWORKS FOR VARIANT CLASSIFICATION", 출원일: 2018년 10월 15일(대리인 관리 번호 ILLM 1000-6/IP-1612-US)); 및 (6) 미국 정규 특허 출원 번호 16/160,968(발명의 명칭: "SEMI-SUPERVISED LEARNING FOR TRAINING AN ENSEMBLE OF DEEP CONVOLUTIONAL NEURAL NETWORKS", 출원일: 2018년 10월 15일(대리인 관리 번호 ILLM 1000-7/IP-1613-US))의 부분 계속 출원이고 이의 우선권을 주장하는 미국 부분 계속 특허 출원 번호 16/407,149(발명의 명칭: "DEEP LEARNING-BASED TECHNIQUES FOR PRE-TRAINING DEEP CONVOLUTIONAL NEURAL NETWORKS", 출원일: 2019년 5월 8일(대리인 관리 번호 ILLM 1010-1/IP-1734-US))의 우선권을 주장한다. 모두 3개의 PCT 출원 및 3개의 미국 정규 출원은 아래에 나열된 다음 4개의 미국 가출원의 우선권 및/또는 그 이익을 주장한다. The present application was filed on October 15, 2018 with the following three PCT applications and three U.S. regular applications: (1) PCT patent application number PCT / US2018 / 055840 (name of invention: DEEP LEARNING-BASED TECHNIQUES FOR TRAINING DEEP CONVOLUTIONAL NEURAL NETWORKS ", filing date: October 15, 2018 (Agent number ILLM 1000-8 / IP-1611-PCT); (2) PCT patent application number PCT / US2018 / 055878 (Invention name: "DEEP CONVOLUTIONAL NEURAL NETWORKS FOR VARIANT CLASSIFICATION", filing date: October 15, 2018 (Agent management number ILLM 1000-9 / IP-1612-PCT) ); (3) PCT patent application number PCT / US2018 / 055881 (Invention name: "SEMI-SUPERVISED LEARNING FOR TRAINING AN ENSEMBLE OF DEEP CONVOLUTIONAL NEURAL NETWORKS", filing date: October 15, 2018 (Agent management number ILLM 1000-10 / IP-1613-PCT)); (4) US regular patent application number 16 / 160,903 (Invention name: "DEEP LEARNING-BASED TECHNIQUES FOR TRAINING DEEP CONVOLUTIONAL NEURAL NETWORKS", filing date: October 15, 2018 (Agent management number ILLM 1000-5 / IP-1611 -US)); (5) U.S. regular patent application number 16 / 160,986 (invention name: "DEEP CONVOLUTIONAL NEURAL NETWORKS FOR VARIANT CLASSIFICATION", filing date: October 15, 2018 (Agent management number ILLM 1000-6 / IP-1612-US)) ; And (6) US regular patent application number 16 / 160,968 (invention name: "SEMI-SUPERVISED LEARNING FOR TRAINING AN ENSEMBLE OF DEEP CONVOLUTIONAL NEURAL NETWORKS", filing date: October 15, 2018 (Agent management number ILLM 1000-7 / IP-1613-US)), and continues to claim priority in the U.S., part number 16 / 407,149 (name of invention: "DEEP LEARNING-BASED TECHNIQUES FOR PRE-TRAINING DEEP CONVOLUTIONAL NEURAL NETWORKS", filing date: 2019 Claim priority on May 8, 2016 (Agent Management Number ILLM 1010-1 / IP-1734-US). All three PCT applications and three U.S. regular applications claim the priority and / or benefit of the following four U.S. provisional applications listed below.

미국 가특허 출원 번호 62/573,144(발명의 명칭: "TRAINING A DEEP PATHOGENICITY CLASSIFIER USING LARGE-SCALE BENIGN TRAINING DATA", 출원일: 2017년 10월 16일(대리인 관리 번호 ILLM 1000-1/IP-1611-PRV)).U.S. Provisional Patent Application No. 62 / 573,144 (Invention name: "TRAINING A DEEP PATHOGENICITY CLASSIFIER USING LARGE-SCALE BENIGN TRAINING DATA", filing date: October 16, 2017 (Agent management number ILLM 1000-1 / IP-1611-PRV )).

미국 가특허 출원 번호 62/573,149(발명의 명칭: "PATHOGENICITY CLASSIFIER BASED ON DEEP CONVOLUTIONAL NEURAL NETWORKS (CNNS)", 출원일: 2017년 10월 16일(대리인 관리 번호 ILLM 1000-2/IP-1612-PRV)).U.S. Provisional Patent Application No. 62 / 573,149 (Name of invention: "PATHOGENICITY CLASSIFIER BASED ON DEEP CONVOLUTIONAL NEURAL NETWORKS (CNNS)", filing date: October 16, 2017 (Agent management number ILLM 1000-2 / IP-1612-PRV) ).

미국 가특허 출원 번호 62/573,153(발명의 명칭: "DEEP SEMI-SUPERVISED LEARNING THAT GENERATES LARGE-SCALE PATHOGENIC TRAINING DATA", 출원일: 2017년 10월 16일(대리인 관리 번호 ILLM 1000-3/IP-1613-PRV)).United States Provisional Patent Application No. 62 / 573,153 (Name of invention: "DEEP SEMI-SUPERVISED LEARNING THAT GENERATES LARGE-SCALE PATHOGENIC TRAINING DATA", filing date: October 16, 2017 (Agent management number ILLM 1000-3 / IP-1613- PRV)).

미국 가특허 출원 번호 62/582,898(발명의 명칭: "PATHOGENICITY CLASSIFICATION OF GENOMIC DATA USING DEEP CONVOLUTIONAL NEURAL NETWORKS (CNNs)", 출원일: 2017년 11월 7일(대리인 관리 번호 ILLM 1000-4/IP-1618-PRV)).U.S. Provisional Patent Application No. 62 / 582,898 (Name of invention: "PATHOGENICITY CLASSIFICATION OF GENOMIC DATA USING DEEP CONVOLUTIONAL NEURAL NETWORKS (CNNs)", filing date: November 7, 2017 (Agent number ILLM 1000-4 / IP-1618- PRV)).

병합absorption

다음은 전체 내용이 본 명세서에 완전히 기재된 것처럼 모든 목적을 위해 본 명세서에 병합된다:The following is incorporated herein for all purposes as if the entire contents were fully described herein:

미국 가특허 출원 번호 62/573,144(발명의 명칭: "TRAINING A DEEP PATHOGENICITY CLASSIFIER USING LARGE-SCALE BENIGN TRAINING DATA", 발명자: Hong Gao, Kai-How Farh, Laksshman Sundaram 및 Jeremy Francis McRae, 출원일: 2017년 10월 16일(대리인 관리 번호 ILLM 1000-1/IP-1611-PRV)).U.S. Provisional Patent Application No. 62 / 573,144 (Invention name: "TRAINING A DEEP PATHOGENICITY CLASSIFIER USING LARGE-SCALE BENIGN TRAINING DATA", Inventor: Hong Gao, Kai-How Farh, Laksshman Sundaram and Jeremy Francis McRae, Filed: 10, 2017 Month 16 (Agent number ILLM 1000-1 / IP-1611-PRV).

미국 가특허 출원 번호 62/573,149(발명의 명칭: "PATHOGENICITY CLASSIFIER BASED ON DEEP CONVOLUTIONAL NEURAL NETWORKS (CNNS)", 발명자: Laksshman Sundaram, Kai-How Farh, Hong Gao, Samskruthi Reddy Padigepati 및 Jeremy Francis McRae, 출원일: 2017년 10월 16일(대리인 관리 번호 ILLM 1000-2/IP-1612-PRV)).U.S. Provisional Patent Application No. 62 / 573,149 (Invention name: "PATHOGENICITY CLASSIFIER BASED ON DEEP CONVOLUTIONAL NEURAL NETWORKS (CNNS)", Inventor: Laksshman Sundaram, Kai-How Farh, Hong Gao, Samskruthi Reddy Padigepati and Jeremy Francis McRae, filing date: October 16, 2017 (Agent number ILLM 1000-2 / IP-1612-PRV).

미국 가특허 출원 번호 62/573,153(발명의 명칭: "DEEP SEMI-SUPERVISED LEARNING THAT GENERATES LARGE-SCALE PATHOGENIC TRAINING DATA", 발명자: Hong Gao, Kai-How Farh, Laksshman Sundaram 및 Jeremy Francis McRae, 출원일: 2017년 10월 16일(대리인 관리 번호 ILLM 1000-3/IP-1613-PRV)).U.S. Provisional Patent Application No. 62 / 573,153 (Invention Name: "DEEP SEMI-SUPERVISED LEARNING THAT GENERATES LARGE-SCALE PATHOGENIC TRAINING DATA", Inventor: Hong Gao, Kai-How Farh, Laksshman Sundaram and Jeremy Francis McRae, Filed: 2017 October 16 (Agent number ILLM 1000-3 / IP-1613-PRV).

미국 가특허 출원 번호 62/582,898(발명의 명칭: "PATHOGENICITY CLASSIFICATION OF GENOMIC DATA USING DEEP CONVOLUTIONAL NEURAL NETWORKS (CNNs)", 발명자: Hong Gao, Kai-How Farh, Laksshman Sundaram, 출원일: 2017년 11월 7일(대리인 관리 번호 ILLM 1000-4/IP-1618-PRV)).U.S. Provisional Patent Application No. 62 / 582,898 (Invention Name: "PATHOGENICITY CLASSIFICATION OF GENOMIC DATA USING DEEP CONVOLUTIONAL NEURAL NETWORKS (CNNs)", Inventor: Hong Gao, Kai-How Farh, Laksshman Sundaram, Filed: November 7, 2017 (Agent number ILLM 1000-4 / IP-1618-PRV).

PCT 특허 출원 번호 PCT/US18/55840(발명의 명칭: "DEEP LEARNING-BASED TECHNIQUES FOR TRAINING DEEP CONVOLUTIONAL NEURAL NETWORKS", 발명자: Hong Gao, Kai-How Farh, Laksshman Sundaram 및 Jeremy Francis McRae, 출원일: 2018년 10월 15일(대리인 관리 번호 ILLM 1000-8/ IP-1611-PCT)).PCT patent application number PCT / US18 / 55840 (invention name: "DEEP LEARNING-BASED TECHNIQUES FOR TRAINING DEEP CONVOLUTIONAL NEURAL NETWORKS", inventor: Hong Gao, Kai-How Farh, Laksshman Sundaram and Jeremy Francis McRae, filing date: 2018 10 Month 15 (Agent number ILLM 1000-8 / IP-1611-PCT).

PCT 특허 출원 번호 PCT/US2018/55878(발명의 명칭: "DEEP CONVOLUTIONAL NEURAL NETWORKS FOR VARIANT CLASSIFICATION", 발명자: Laksshman Sundaram, Kai-How Farh, Hong Gao, Samskruthi Reddy Padigepati 및 Jeremy Francis McRae, 출원일: 2018년 10월 15일(대리인 관리 번호 ILLM 1000-9/IP-1612-PCT)).PCT patent application number PCT / US2018 / 55878 (invention name: "DEEP CONVOLUTIONAL NEURAL NETWORKS FOR VARIANT CLASSIFICATION", inventor: Laksshman Sundaram, Kai-How Farh, Hong Gao, Samskruthi Reddy Padigepati and Jeremy Francis McRae, filing date: 2018 10 May 15 (Agent number ILLM 1000-9 / IP-1612-PCT).

PCT 특허 출원 번호 PCT/US2018/55881(발명의 명칭: "SEMI-SUPERVISED LEARNING FOR TRAINING AN ENSEMBLE OF DEEP CONVOLUTIONAL NEURAL NETWORKS", 발명자: Laksshman Sundaram, Kai-How Farh, Hong Gao 및 Jeremy Francis McRae, 출원일: 2018년 10월 15일(대리인 관리 번호 ILLM 1000-10/IP-1613-PCT)).PCT patent application number PCT / US2018 / 55881 (invention name: "SEMI-SUPERVISED LEARNING FOR TRAINING AN ENSEMBLE OF DEEP CONVOLUTIONAL NEURAL NETWORKS", inventor: Laksshman Sundaram, Kai-How Farh, Hong Gao and Jeremy Francis McRae, filing date: 2018 15 October (Agent #ILLM 1000-10 / IP-1613-PCT).

미국 정규 특허 출원 번호 16/160,903(발명의 명칭: "DEEP LEARNING-BASED TECHNIQUES FOR TRAINING DEEP CONVOLUTIONAL NEURAL NETWORKS", 발명자: Hong Gao, Kai-How Farh, Laksshman Sundaram 및 Jeremy Francis McRae, 출원일: 2018년 10월 15일(대리인 관리 번호 ILLM 1000-5/IP-1611-US)).U.S. regular patent application number 16 / 160,903 (Invention name: "DEEP LEARNING-BASED TECHNIQUES FOR TRAINING DEEP CONVOLUTIONAL NEURAL NETWORKS '', Inventor: Hong Gao, Kai-How Farh, Laksshman Sundaram and Jeremy Francis McRae, filed: October 2018 15th (Agent number ILLM 1000-5 / IP-1611-US).

미국 정규 특허 출원 번호 16/160,986(발명의 명칭: "DEEP CONVOLUTIONAL NEURAL NETWORKS FOR VARIANT CLASSIFICATION", 발명자: Laksshman Sundaram, Kai-How Farh, Hong Gao 및 Jeremy Francis McRae, 출원일: 2018년 10월 15일(대리인 관리 번호 ILLM 1000-6/IP-1612-US)).U.S. regular patent application number 16 / 160,986 (invention name: "DEEP CONVOLUTIONAL NEURAL NETWORKS FOR VARIANT CLASSIFICATION", inventor: Laksshman Sundaram, Kai-How Farh, Hong Gao and Jeremy Francis McRae, filing date: October 15, 2018 (Agent) Management number ILLM 1000-6 / IP-1612-US)).

미국 정규 특허 출원 번호 16/160,968(발명의 명칭: "SEMI-SUPERVISED LEARNING FOR TRAINING AN ENSEMBLE OF DEEP CONVOLUTIONAL NEURAL NETWORKS", 발명자: Laksshman Sundaram, Kai-How Farh, Hong Gao 및 Jeremy Francis McRae, 출원일: 2018년 10월 15일(대리인 관리 번호 ILLM 1000-7/IP-1613-US)).U.S. regular patent application number 16 / 160,968 (Invention name: "SEMI-SUPERVISED LEARNING FOR TRAINING AN ENSEMBLE OF DEEP CONVOLUTIONAL NEURAL NETWORKS", inventor: Laksshman Sundaram, Kai-How Farh, Hong Gao and Jeremy Francis McRae, filing date: 2018 October 15 (Agent number ILLM 1000-7 / IP-1613-US).

문헌 1 - A. V. D. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, "WAVENET: A GENERATIVE MODEL FOR RAW AUDIO," arXiv:1609.03499, 2016;Literature 1-AVD Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, "WAVENET: A GENERATIVE MODEL FOR RAW AUDIO," arXiv: 1609.03499, 2016;

문헌 2 - S.

. Arik, M. Chrzanowski, A. Coates, G. Diamos, A. Gibiansky, Y. Kang, X. Li, J. Miller, A. Ng, J. Raiman, S. Sengupta and M. Shoeybi, "DEEP VOICE: REAL-TIME NEURAL TEXT-TO-SPEECH," arXiv:1702.07825, 2017;Document 2-S.

. Arik, M. Chrzanowski, A. Coates, G. Diamos, A. Gibiansky, Y. Kang, X. Li, J. Miller, A. Ng, J. Raiman, S. Sengupta and M. Shoeybi, "DEEP VOICE: REAL-TIME NEURAL TEXT-TO-SPEECH, "arXiv: 1702.07825, 2017;

문헌 3 - F. Yu and V. Koltun, "MULTI-SCALE CONTEXT AGGREGATION BY DILATED CONVOLUTIONS," arXiv:1511.07122, 2016;Document 3-F. Yu and V. Koltun, "MULTI-SCALE CONTEXT AGGREGATION BY DILATED CONVOLUTIONS," arXiv: 1511.07122, 2016;

문헌 4 - K. He, X. Zhang, S. Ren, and J. Sun, "DEEP RESIDUAL LEARNING FOR IMAGE RECOGNITION," arXiv:1512.03385, 2015;Document 4-K. He, X. Zhang, S. Ren, and J. Sun, "DEEP RESIDUAL LEARNING FOR IMAGE RECOGNITION," arXiv: 1512.03385, 2015;

문헌 5 - R.K. Srivastava, K. Greff, and J. Schmidhuber, "HIGHWAY NETWORKS," arXiv: 1505.00387, 2015;Document 5-R.K. Srivastava, K. Greff, and J. Schmidhuber, "HIGHWAY NETWORKS," arXiv: 1505.00387, 2015;

문헌 6 - G. Huang, Z. Liu, L. van der Maaten and K. Q. Weinberger, "DENSELY CONNECTED CONVOLUTIONAL NETWORKS," arXiv:1608.06993, 2017;Document 6-G. Huang, Z. Liu, L. van der Maaten and K. Q. Weinberger, "DENSELY CONNECTED CONVOLUTIONAL NETWORKS," arXiv: 1608.06993, 2017;

문헌 7 - C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, "GOING DEEPER WITH CONVOLUTIONS," arXiv: 1409.4842, 2014;Literature 7-C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, "GOING DEEPER WITH CONVOLUTIONS," arXiv: 1409.4842 , 2014;

문헌 8 - S. Ioffe and C. Szegedy, "BATCH NORMALIZATION: ACCELERATING DEEP NETWORK TRAINING BY REDUCING INTERNAL COVARIATE SHIFT," arXiv: 1502.03167, 2015;Document 8-S. Ioffe and C. Szegedy, "BATCH NORMALIZATION: ACCELERATING DEEP NETWORK TRAINING BY REDUCING INTERNAL COVARIATE SHIFT," arXiv: 1502.03167, 2015;

문헌 9 - J. M. Wolterink, T. Leiner, M. A. Viergever, and I.

, "DILATED CONVOLUTIONAL NEURAL NETWORKS FOR CARDIOVASCULAR MR SEGMENTATION IN CONGENITAL HEART DISEASE," arXiv:1704.03669, 2017;Document 9-JM Wolterink, T. Leiner, MA Viergever, and I.

, "DILATED CONVOLUTIONAL NEURAL NETWORKS FOR CARDIOVASCULAR MR SEGMENTATION IN CONGENITAL HEART DISEASE," arXiv: 1704.03669, 2017;

문헌 10 - L. C. Piqueras, "AUTOREGRESSIVE MODEL BASED ON A DEEP CONVOLUTIONAL NEURAL NETWORK FOR AUDIO GENERATION," Tampere University of Technology, 2016;Document 10-L. C. Piqueras, "AUTOREGRESSIVE MODEL BASED ON A DEEP CONVOLUTIONAL NEURAL NETWORK FOR AUDIO GENERATION," Tampere University of Technology, 2016;

문헌 11 - J. Wu, "Introduction to Convolutional Neural Networks," Nanjing University, 2017;Document 11-J. Wu, "Introduction to Convolutional Neural Networks," Nanjing University, 2017;

문헌 12 - I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio, "CONVOLUTIONAL NETWORKS", Deep Learning, MIT Press, 2016; 및 Document 12-I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio, "CONVOLUTIONAL NETWORKS", Deep Learning, MIT Press, 2016; And

문헌 13 - J. Gu, Z. Wang, J. Kuen, L. Ma, A. Shahroudy, B. Shuai, T. Liu, X. Wang, and G. Wang, "RECENT ADVANCES IN CONVOLUTIONAL NEURAL NETWORKS," arXiv:1512.07108, 2017.Literature 13-J. Gu, Z. Wang, J. Kuen, L. Ma, A. Shahroudy, B. Shuai, T. Liu, X. Wang, and G. Wang, "RECENT ADVANCES IN CONVOLUTIONAL NEURAL NETWORKS," arXiv : 1512.07108, 2017.

문헌 1은, 동일한 컨볼루션 윈도우 크기를 갖는 컨볼루션 필터와 함께 잔여 블록(residual block)의 그룹, 일괄 정규화층(batch normalization layer), 정류 선형 유닛(rectified linear unit: ReLU) 층, 차원 변경층(dimensionality altering layer), 지수적으로 성장하는 아트러스 컨볼루션 레이트(atrous convolution rate)를 갖는 아트러스 컨볼루션층, 스킵 연결, 및 입력 서열을 수용하고 입력 서열의 엔트리를 점수 매기는 출력 서열을 생성하도록 소프트맥스 분류층(softmax classification layer)을 사용하는 심층 컨볼루션 신경망 아키텍처를 기술한다. 개시된 기술은 문헌 1에 기술된 신경망 구성요소 및 파라미터를 사용한다. 일 구현예에서, 개시된 기술은 문헌 1에 기술된 신경망 구성요소의 파라미터를 수정한다. 예를 들어, 문헌 1과는 달리, 개시된 기술에서의 아트러스 컨볼루션 레이트는 낮은 잔여 블록 그룹으로부터 높은 잔여 블록 그룹으로 비지수적으로 진행된다. 다른 일례로, 문헌 1과는 달리, 개시된 기술에서의 컨볼루션 윈도우 크기는 잔여 블록의 그룹 간에 가변된다.Document 1 describes a group of residual blocks, a batch normalization layer, a rectified linear unit (ReLU) layer, a dimensional change layer (with a convolution filter having the same convolution window size) dimensionality altering layer, an atrus convolutional layer with an exponentially growing atrus convolution rate, a skip connection, and an output sequence that accepts the input sequence and scores the entry of the input sequence. Describes a deep convolutional neural network architecture using a softmax classification layer. The disclosed technology uses the neural network components and parameters described in Document 1. In one embodiment, the disclosed technology modifies the parameters of neural network components described in Document 1. For example, unlike document 1, the atrus convolution rate in the disclosed technique proceeds non-exponentially from a low residual block group to a high residual block group. As another example, unlike Document 1, the convolution window size in the disclosed technique varies between groups of residual blocks.

문헌 2는 문헌 1에 기술된 심층 컨볼루션 신경망 아키텍처의 세부사항을 기술한다.Literature 2 describes the details of the deep convolutional neural network architecture described in Literature 1.

문헌 3은 개시된 기술에 의해 사용되는 아트러스 컨볼루션을 기술한다. 본 명세서에서 사용되는 바와 같이, 아트러스 컨볼루션은 "팽창 컨볼루션"(dilated convolution)이라고도 한다. 아트러스/팽창 컨볼루션은 훈련 가능한 파라미터가 거의 없는 큰 수용장을 허용한다. 아트러스/팽창 컨볼루션은, 아트러스 컨볼루션 레이트 또는 팽창 인자라고도 하는 소정의 단차로 입력값들을 스킵함으로써 커널이 자신의 길이보다 큰 면적에 걸쳐 적용되는 컨볼루션이다. 아트러스/팽창 컨볼루션은, 컨볼루션 동작이 수행될 때 더 넓은 간격으로 이웃하는 입력 엔트리(예를 들어, 뉴클레오타이드, 아미노산)가 고려되도록 컨볼루션 필터/커널의 요소들 사이에 간격을 추가한다. 이는 입력에 장거리 컨텍스트 종속성을 통합할 수 있게 한다. 아트러스 컨볼루션은, 인접한 뉴클레오타이드가 처리될 때 재사용을 위해 부분 컨볼루션 계산을 보존한다.Document 3 describes an atrus convolution used by the disclosed technology. As used herein, atrus convolution is also referred to as “dilated convolution”. The atrus / expansion convolution allows for large receptacles with few trainable parameters. The atrus / expansion convolution is a convolution in which the kernel is applied over an area greater than its length by skipping input values with a predetermined step, also called atrus convolution rate or expansion factor. The atrus / expansion convolution adds a gap between elements of the convolution filter / kernel so that adjacent input entries (e.g., nucleotides, amino acids) are considered at a wider interval when the convolution operation is performed. This makes it possible to incorporate long-distance context dependencies into the input. Atrus convolution preserves partial convolution calculations for reuse when adjacent nucleotides are processed.

문헌 4는 개시된 기술에 의해 사용되는 잔여 블록 및 잔여 연결을 기술한다.Document 4 describes residual blocks and residual connections used by the disclosed techniques.

문헌 5는 개시된 기술에 의해 사용되는 스킵 연결을 기술한다. 본 명세서에서 사용되는 바와 같이, 스킵 연결은 "하이웨이 네트워크(highway network)"라고도 한다.Document 5 describes a skip connection used by the disclosed technique. As used herein, a skip connection is also referred to as a "highway network."

문헌 6은 개시된 기술에 의해 사용되는 조밀하게 연결된 컨볼루션 망 아키텍처를 기술한다.Document 6 describes a tightly coupled convolutional network architecture used by the disclosed technology.

문헌 7은 개시된 기술에 의해 사용되는 차원 변경 컨볼루션층 및 모듈 기반 처리 파이프라인을 기술한다. 차원 변경 컨볼루션의 일례는 1×1 컨볼루션이다.Document 7 describes a dimensional change convolutional layer and module based processing pipeline used by the disclosed technology. An example of a dimension change convolution is 1 × 1 convolution.

문헌 8은 개시된 기술에 의해 사용되는 일괄 정규화층을 기술한다.Document 8 describes a batch normalization layer used by the disclosed technique.

문헌 9는 개시된 기술에 의해 사용되는 아트러스/팽창 컨볼루션을 또한 기술한다.Document 9 also describes the atrus / expansion convolution used by the disclosed technology.

문헌 10은, 컨볼루션 신경망, 심층 컨볼루션 신경망, 및 아트러스/팽창 컨볼루션을 갖는 심층 컨볼루션 신경망을 포함하여, 개시된 기술에 의해 사용될 수 있는 심층 신경망의 다양한 아키텍처를 기술한다.Document 10 describes various architectures of deep neural networks that can be used by the disclosed techniques, including convolutional neural networks, deep convolutional neural networks, and deep convolutional neural networks with atlas / expansion convolution.

문헌 11은, 서브샘플링층(예를 들어, 풀링(pooling)) 및 완전히 연결된 층을 갖는 컨볼루션 신경망을 훈련하기 위한 알고리즘을 포함하여, 개시된 기술에 의해 사용될 수 있는 컨볼루션 신경망의 세부사항을 기술한다.Document 11 describes the details of a convolutional neural network that can be used by the disclosed technology, including algorithms for training a convolutional neural network with subsampling layers (e.g., pooling) and fully connected layers. do.

문헌 12는 개시된 기술에 의해 사용될 수 있는 다양한 컨볼루션 동작의 세부사항을 기술한다.Document 12 describes details of various convolutional operations that can be used by the disclosed technology.

문헌 13은 개시된 기술에 의해 사용될 수 있는 컨볼루션 신경망의 다양한 아키텍처를 기술한다.Document 13 describes various architectures of convolutional neural networks that can be used by the disclosed technology.

기술 분야Technical field

개시된 기술은, 불확실성이 있는 추론을 위한 시스템(예를 들어, 퍼지 논리 시스템), 적응형 시스템, 기계 학습 시스템 및 인공 신경망을 포함하여, 인공 지능형 컴퓨터 및 디지털 데이터 처리 시스템 및 대응하는 데이터 처리 방법 및 지능 에뮬레이션을 위한 제품(즉, 지식 기반 시스템, 추론 시스템 및 지식 획득 시스템)에 관한 것이다. 특히, 개시된 기술은, 심층 컨볼루션 신경망을 훈련하기 위한 심층 학습 기반 기술을 사용하는 것에 관한 것이다. 특히, 개시된 기술은 과적합(overfitting)을 피하기 위해 심층 컨볼루션 신경망을 사전 훈련시키는 것에 관한 것이다.The disclosed technology includes artificial intelligent computers and digital data processing systems and corresponding data processing methods, including systems for uncertain reasoning (eg, fuzzy logic systems), adaptive systems, machine learning systems and artificial neural networks. It relates to products for intelligent emulation (ie knowledge-based systems, reasoning systems and knowledge acquisition systems). In particular, the disclosed technology relates to using deep learning based techniques for training deep convolutional neural networks. In particular, the disclosed technique relates to pretraining deep convolutional neural networks to avoid overfitting.

이 부문에서 개시되는 주제는, 단지 이 부문에서 언급되었다는 이유만으로 종래 기술인 것으로 가정되어서는 안 된다. 유사하게, 이 부문에서 언급되거나 배경 기술로 제공된 주제에 연관된 문제는 종래 기술에서 이전에 인식된 것으로 가정되어서는 안 된다. 이 부문에서 주제는 또한 청구된 기술의 구현예에 대응할 수 있는 단지 다른 접근법을 나타낸다.The subject matter disclosed in this section should not be assumed to be prior art just for the reasons mentioned in this section. Similarly, problems related to the subject mentioned in this section or provided as background art should not be assumed to have been previously recognized in the prior art. The subject matter in this section also represents only other approaches that may correspond to implementations of the claimed technology.

기계 학습Machine learning

기계 학습에서, 입력 변수는 출력 변수를 예측하는 데 사용된다. 입력 변수는, 종종 피처(feature)라고 하며, X=(X₁, X₂, ..., X_k)로 표현되며, 여기서 각 X_i(i ∈ 1, ..., k)는 피처이다. 출력 변수는, 종종 응답 또는 종속 변수라고 칭하며, Yi로 표현된다. Y와 대응 X 간의 관계는 다음 일반식으로 표현될 수 있다:In machine learning, input variables are used to predict output variables. Input variables, often referred to as features, are represented by X = (X ₁ , X ₂ , ..., X _k ), where each X _i (i ∈ 1, ..., k) is a feature . The output variable, often referred to as the response or dependent variable, is represented by Yi. The relationship between Y and the corresponding X can be expressed by the following general formula:

Y = f(X) + ∈Y = f (X) + ∈

위 수식에서, f는 피처(X₁, X₂, ..., X_k)의 함수이고, ∈는 랜덤 에러 항이다. 에러 항은 X와는 독립적이며 0의 평균값을 갖는다.In the above equation, f is a function of the features (X ₁ , X ₂ , ..., X _k ), and ∈ is a random error term. The error term is independent of X and has an average value of zero.

실제로, 피처 X는, Y를 갖지 않고 이용 가능하거나 또는 X와 Y 간의 정확한 관계를 알지 않고도 이용 가능하다. 에러 항은 0의 평균값을 가지므로, 목표는 f를 추정하는 것이다.In fact, feature X can be used without Y or without knowing the exact relationship between X and Y. Since the error term has an average value of 0, the goal is to estimate f.

위 수식에서,

는 ∈의 추정이고, 이는, 종종 블랙 박스라고 간주되며,

의 입력과 출력 간의 관계만이 알려져 있지만, 그것이 그렇게 기능하는 이유에 대한 질문은 답변되지 않은 것을 의미한다.In the above formula,

Is an estimate of ∈, which is often considered a black box,

Only the relationship between the input and output of is known, but the question of why it functions so means that it has not been answered.

함수

는 학습을 사용하여 발견된다. 감독 학습과 비감독 학습은 이 작업을 위한 기계 학습에 사용되는 두 가지 방법이다. 감독 학습에서는, 표지된 데이터(labeled data)가 훈련에 사용된다. 입력과 대응 출력(=표지)을 표시함으로써, 함수

는 출력에 근접하도록 최적화된다. 비감독 학습에서는, 목표는 표지 없는 데이터로부터 숨겨진 구조를 찾는 것이다. 이 알고리즘은 입력 데이터의 정확도의 척도를 갖지 않아서, 감독 학습과 구별된다.function

Is discovered using learning. Supervised learning and non-supervised learning are the two methods used for machine learning for this task. In supervised learning, labeled data is used for training. Function by displaying input and corresponding output (= cover)

Is optimized to approximate the output. In non-supervised learning, the goal is to find hidden structures from unlabeled data. This algorithm does not have a measure of the accuracy of the input data, so it is distinguished from supervised learning.

신경망Neural network

신경망은, 서로 메시지를 교환하는 상호 연결된 인공 뉴런(예를 들어, a1, a2, a3)의 시스템이다. 예시된 신경망은, 3개의 입력, 숨겨진 층에서의 2개의 뉴런, 및 출력 층에서의 2개의 뉴런을 갖는다. 숨겨진 층은 활성화 함수

를 갖고, 출력층은 활성화 함수

를 갖는다. 연결에는 훈련 프로세스 동안 조정되는 숫자 가중치(예를 들어, w11, w21, w12, w31, w22, w32, v11, v22)가 있으므로, 인식할 이미지를 공급할 때 올바르게 훈련된 네트워크가 올바르게 응답한다. 입력층은 원시 입력을 처리하고, 숨겨진 층은, 입력층과 숨겨진 층 간의 연결의 가중치에 기초하여 입력층으로부터의 출력을 처리한다. 출력층은, 숨겨진 층으로부터 출력을 가져 와서 숨겨진 층과 출력층 간의 연결의 가중치에 기초하여 출력을 처리한다. 망은 피처 검출 뉴런의 다수의 층을 포함한다. 각 층은, 이전 층으로부터의 입력의 상이한 조합에 응답하는 많은 뉴런을 갖는다. 이들 층은, 제1 층이 입력 화상 데이터에서 프리미티브 패턴들의 세트를 검출하고 제2 층이 패턴들 중 패턴을 검출하고 제3 층이 이들 패턴 중 패턴을 검출하도록 구성된다.Neural networks are systems of interconnected artificial neurons (eg, a1, a2, a3) that exchange messages with each other. The illustrated neural network has 3 inputs, 2 neurons in the hidden layer, and 2 neurons in the output layer. Hidden layer activation function

And the output layer is the activation function

Have The connection has numeric weights that are adjusted during the training process (e.g., w11, w21, w12, w31, w22, w32, v11, v22), so a properly trained network responds correctly when supplying images to be recognized. The input layer processes the raw input, and the hidden layer processes the output from the input layer based on the weight of the connection between the input layer and the hidden layer. The output layer takes output from the hidden layer and processes the output based on the weight of the connection between the hidden layer and the output layer. The network contains multiple layers of feature detection neurons. Each layer has many neurons that respond to different combinations of inputs from the previous layer. These layers are configured such that the first layer detects a set of primitive patterns in the input image data, the second layer detects a pattern among the patterns, and the third layer detects a pattern among these patterns.

신경망 모델은 생산 샘플에 대한 출력을 예측하는데 사용되기 전에 훈련 샘플을 사용하여 훈련된다. 훈련된 모델의 예측 품질은 훈련 동안 입력으로 제공되지 않은 훈련 샘플의 테스트 세트를 사용하는 것에 의해 평가된다. 모델이 테스트 샘플의 출력을 올바르게 예측하면 모델은 높은 신뢰도로 추론하는데 사용될 수 있다. 그러나 모델이 테스트 샘플의 출력을 올바르게 예측하지 못하면 모델은 훈련 데이터에 과적합되어 있고, 보이지 않는 테스트 데이터에 대해 일반화되어 있지 않았다고 말할 수 있다.The neural network model is trained using training samples before being used to predict output for production samples. The predicted quality of the trained model is evaluated by using a test set of training samples that were not provided as input during training. If the model correctly predicts the output of the test sample, the model can be used to infer with high confidence. However, if the model does not correctly predict the output of the test sample, it can be said that the model is overfit to the training data and not generalized to the invisible test data.

유전체학에서의 심층 학습을 적용하는 조사는 이하의 간행물에서 찾아볼 수 있다:Investigations applying deep learning in genomics can be found in the following publications:

T. Ching et al., Opportunities And Obstacles For Deep Learning In Biology And Medicine, www.biorxiv.org:142760, 2017;

Angermueller C, P

rnamaa T, Parts L, Stegle O. Deep Learning For Computational Biology. Mol Syst Biol. 2016;12:878;

Angermueller C, P

rnamaa T, Parts L, Stegle O. Deep Learning For Computational Biology. Mol Syst Biol. 2016; 12: 878;

Park Y, Kellis M. 2015 Deep Learning For Regulatory Genomics. Nat. Biotechnol. 33, 825826. (doi:10.1038/nbt.3313);

Park Y, Kellis M. 2015 Deep Learning For Regulatory Genomics. Nat. Biotechnol. 33, 825826. (doi: 10.1038 / nbt.3313);

Min, S., Lee, B. & Yoon, S. Deep Learning In Bioinformatics. Brief. Bioinform. bbw068 (2016);

Leung MK, Delong A, Alipanahi B et al. Machine Learning In Genomic Medicine: A Review of Computational Problems and Data Sets 2016; and

Libbrecht MW, Noble WS. Machine Learning Applications In Genetics and Genomics. Nature Reviews Genetics 2015;16(6):321-32.

Libbrecht MW, Noble WS. Machine Learning Applications In Genetics and Genomics. Nature Reviews Genetics 2015; 16 (6): 321-32.

도면에서, 유사한 참조 부호는 일반적으로 여러 도면에 걸쳐 유사한 부분을 지칭한다. 또한, 도면은, 반드시 축척에 맞게 도시된 것은 아니며, 대신 개시된 기술의 원리를 설명하기 위하여 일반적으로 강조되어 있다. 이하의 설명에서는, 개시된 기술의 다양한 구현예를 이하의 도면을 참조하여 설명한다.
도 1은 보충 훈련 예를 사용하여 변이체 병원성 예측 모델의 훈련 동안 과적합을 감소시키는 시스템의 아키텍처 레벨 개략도;
도 2는 본 명세서에서 "PrimateAI"로 지칭되는 병원성 예측을 위한 심층 잔여망의 예시적인 아키텍처를 도시하는 도면;
도 3은 병원성 분류를 위한 심층 학습망 아키텍처인 PrimateAI의 개략도;
도 4는 컨볼루션 신경망의 동작의 일 구현예를 도시하는 도면;
도 5는 개시된 기술의 일 구현예에 따라 컨볼루션 신경망을 훈련시키는 블록도;
도 6은 예시적인 미스센스 변이체 및 대응하는 보충 양성 훈련 예를 제시하는 도면;
도 7은 보충 데이터 세트를 사용하여 병원성 예측 모델의 개시된 사전 훈련을 도시하는 도면;
도 8은 사전 훈련 에포크(pre-training epoch) 이후의 사전 훈련된 병원성 예측 모델의 훈련을 도시하는 도면;
도 9는 비-표지된 변이체를 평가하기 위해 훈련된 병원성 예측 모델의 적용을 도시하는 도면;
도 10은 병원성 미스센스 변이체 및 대응하는 보충 양성 훈련 예를 갖는 예시적인 아미노산 서열에 대한 위치 빈도 행렬 출발점(position frequency matrix starting point)을 도시하는 도면;
도 11은 양성 미스센스 변이체 및 대응하는 보충 양성 훈련 예를 갖는 예시적인 아미노산 서열에 대한 위치 빈도 행렬 출발점을 도시하는 도면;
도 12는 영장류, 포유류 및 척추동물의 아미노산 서열에 대한 위치 빈도 행렬의 구성을 예시하는 도면;
도 13은 인간 참조 아미노산 서열 및 인간 대체 아미노산 서열의 예시적인 하나의 핫 인코딩(hot encoding)을 제시하는 도면;
도 14는 변이체 병원성 예측 모델에 대한 입력 예를 도시하는 도면;
도 15는 개시된 기술을 구현하는데 사용될 수 있는 컴퓨터 시스템의 단순화된 블록도.In the drawings, similar reference numerals generally refer to similar parts throughout the drawings. In addition, the drawings are not necessarily drawn to scale, but instead are generally emphasized to illustrate the principles of the disclosed technology. In the following description, various implementations of the disclosed technology will be described with reference to the following drawings.
1 is an architectural level schematic diagram of a system for reducing overfitting during training of a variant pathogenic predictive model using supplemental training examples;
2 shows an exemplary architecture of a deep residual network for pathogenic prediction referred to herein as “PrimateAI”;
3 is a schematic diagram of PrimateAI, a deep learning network architecture for pathogenic classification;
4 is a diagram showing an embodiment of the operation of a convolutional neural network;
5 is a block diagram for training a convolutional neural network according to one embodiment of the disclosed technology;
6 shows an exemplary missense variant and corresponding supplemental positive training example;
7 shows the disclosed prior training of the pathogenicity prediction model using a supplemental data set;
8 shows training of a pre-trained pathogenicity prediction model after a pre-training epoch;
9 shows the application of a trained pathogenic predictive model to evaluate non-labeled variants;
10 shows position frequency matrix starting points for exemplary amino acid sequences with pathogenic missense variants and corresponding supplemental positive training examples;
FIG. 11 depicts the position frequency matrix starting point for an exemplary amino acid sequence with positive missense variants and corresponding supplemental positive training examples;
12 is a diagram illustrating the configuration of a position frequency matrix for amino acid sequences of primates, mammals, and vertebrates;
13 shows an exemplary hot encoding of human reference amino acid sequences and human replacement amino acid sequences;
14 is a diagram showing an input example for a variant pathogenicity prediction model;
15 is a simplified block diagram of a computer system that can be used to implement the disclosed technology.

이하의 설명은 이 기술 분야에 통상의 지식을 가진 자라면 개시된 기술을 제조 및 사용할 수 있도록 하기 위해 제시된 것이며, 특정 응용 및 그 요구 사항의 상황에서 제공된다. 개시된 구현예에 대한 다양한 변형은 이 기술 분야에 통상의 지식을 가진 자에게 명백할 것이며, 본 명세서에서 정의된 일반적인 원리는 개시된 기술의 사상 및 범위를 벗어나지 않고 다른 구현예와 응용 분야에 적용될 수 있을 것이다. 따라서, 개시된 기술은, 도시된 구현예로 제한되도록 의도된 것이 아니라, 본 명세서에 개시된 원리 및 특징과 일치하는 최광의 범위를 따르는 것으로 의도된다. The following description is presented to enable those skilled in the art to manufacture and use the disclosed technology, and is provided in the context of specific applications and their requirements. Various modifications to the disclosed embodiments will be apparent to those skilled in the art, and the general principles defined herein may be applied to other embodiments and applications without departing from the spirit and scope of the disclosed technology. will be. Accordingly, the disclosed technology is not intended to be limited to the illustrated embodiments, but is intended to follow the broadest scope consistent with the principles and features disclosed herein.

도입부Introduction

본 출원서의 부문들은 개시된 개선에 대한 배경을 제공하기 위해 참조 출원 문헌으로 병합된 것에서 반복된다. 이전의 출원 문헌은 후술하는 바와 같이 비인간 영장류 미스센스 변이체 데이터를 사용하여 훈련된 심층 학습 시스템을 개시하였다. 배경을 제공하기 전에 개시된 개선 사항을 도입한다.The sections of this application are repeated in that incorporated by reference application literature to provide a background for the disclosed improvements. Previous application literature disclosed a deep learning system trained using non-human primate missense variant data as described below. Introduce the improvements disclosed before providing the background.

본 발명자들은 훈련의 일부 패턴이 때때로 심층 학습 시스템에 위치 빈도 행렬 입력을 과도하게 강조하는 것을 경험적으로 관찰하였다. 위치 빈도 행렬에 과적합하면, 전형적으로 R->W와 같은 유해한 아미노산 미스센스로부터 R->K와 같은 전형적으로 양성인 아미노산 미스센스를 구별하는 시스템의 능력을 감소시킬 수 있다. 특별히 선택된 훈련 예를 사용하여 훈련 세트를 보충하면 과적합을 감소시키거나 상쇄시켜 훈련 결과를 향상시킬 수 있다.The inventors have observed empirically that some patterns of training sometimes over-emphasize the location frequency matrix input to the deep learning system. Overfitting the positional frequency matrix can reduce the ability of the system to distinguish typically positive amino acid missenses such as R-> K from harmful amino acid missenses such as R-> W. Supplementing the training set with specially selected training examples can improve training results by reducing or canceling overfitting.

양성으로 표지된 보충 훈련 예는 표지되지 않거나 (및 병원성으로 추정되거나), 표지된 병원성, 또는 양성으로 표지될 수 있는 미스센스 훈련 예와 동일한 위치 빈도 행렬("PFM")을 포함한다. 이러한 보충 양성 훈련 예의 직관적인 영향은 역 전파 훈련으로 위치 빈도 행렬 이외에 기초하여 양성과 병원성을 구별할 수 있게 하는 것이다.A positively labeled supplemental training example includes the same location frequency matrix ("PFM") as a missense training example that may be unlabeled (and presumed to be pathogenic), labeled pathogenic, or positive. The intuitive effect of this supplementary positive training example is that reverse propagation training allows distinction between positive and pathogenic based on location frequency matrices.

보충 양성 훈련 예는 훈련 세트에서 병원성 또는 표지되지 않은 예와 대비되도록 구성된다. 보충 양성 훈련 예는 양성 미스센스 예를 강화할 수도 있다. 대조적으로, 병원성 미스센스는 선별된 병원성 미스센스이거나 훈련 세트에서 조합으로 생성된 예일 수 있다. 선택된 양성 변이체는 동일한 아미노산에 대해 코딩되는 2개의 상이한 트라이뉴클레오타이드 서열인 2개의 상이한 코돈으로부터 동일한 아미노산을 발현시키는 동의 변이체(synonymous variant)일 수 있다. 동의 양성 변이체가 사용될 때는 무작위로 구성되지 않고; 대신에, 서열분석된, 즉, 서열결정된 모집단에서 관찰되는 동의 변이체로부터 선택된다. 동의 변이체는 다른 영장류, 포유류 또는 척추동물이 아닌 인간에 이용 가능한 서열 데이터가 더 많기 때문에 인간 변이체일 가능성이 더 높다. 보충 양성 훈련 예는 참조 및 대체 아미노산 서열에서 동일한 아미노산 서열을 갖는다. 대안적으로, 선택된 양성 변이체는 단순히 대조되는 훈련 예와 동일한 위치에 있을 수 있다. 이것은 동의 양성 변이체를 사용하는 것만큼 과적합을 상쇄시키는 데 효과적일 수 있다.Supplemental positive training examples are configured to contrast with pathogenic or unlabeled examples in the training set. Supplemental positive training examples may reinforce positive missense examples. In contrast, a pathogenic missense can be a selected pathogenic missense or an example produced in combination in a training set. The selected positive variant can be a synonymous variant that expresses the same amino acid from two different codons, two different trinucleotide sequences encoded for the same amino acid. When synonymous positive variants are used, they are not randomly constructed; Instead, it is selected from synonymous variants observed in sequenced, ie, sequenced populations. A synonym variant is more likely a human variant because there is more sequence data available to humans than other primates, mammals or vertebrates. The complementary positive training example has the same amino acid sequence in the reference and replacement amino acid sequences. Alternatively, the selected positive variant can simply be in the same position as the control training example. This can be as effective in offsetting overfitting as using a synonymous positive variant.

보충 양성 훈련 예를 사용하면 예들이 자연을 정확히 반영하기 때문에 초기 훈련 에포크 이후에 중단되거나 훈련 전체에 걸쳐 계속될 수 있다.Using the complementary training example, it can be stopped after the initial training epoch or continued throughout the training because the examples accurately reflect nature.

컨볼루션Convolution 신경망 Neural network

배경으로서, 컨볼루션 신경망은 특수한 유형의 신경망이다. 조밀하게 연결된 층과 컨볼루션층 간의 근본적인 차이점은 이것이다. 조밀한 층은 입력 피처 공간에서 글로벌 패턴을 학습하는 반면, 컨볼루션층은 로컬 패턴을 학습하며, 이미지의 경우, 입력의 작은 2D 윈도우에서 발견되는 패턴을 학습한다. 이러한 핵심 특성은 컨볼루션 신경망에 두 개의 흥미로운 특성을 제공하는데, 즉, (1) 컨볼루션층이 학습하는 패턴은 변환 불변이고, (2) 패턴의 공간 계층을 학습할 수 있다는 점이다.As a background, convolutional neural networks are a special type of neural network. This is the fundamental difference between a densely connected layer and a convolutional layer. The dense layer learns the global pattern in the input feature space, while the convolutional layer learns the local pattern, and in the case of images, the pattern found in the small 2D window of the input. These key characteristics provide two interesting properties to the convolutional neural network: (1) the pattern learned by the convolutional layer is transform-invariant, and (2) the spatial layer of the pattern can be learned.

첫 번째와 관련하여, 컨볼루션층은, 화상의 우측 하단 코너에서 소정의 패턴을 학습한 후, 임의의 위치, 예를 들어, 좌측 상단 코너에서 그 패턴을 인식할 수 있다. 조밀하게 연결된 망은, 새로운 패턴이 새로운 위치에 나타나면 그 새로운 패턴을 학습해야 한다. 따라서, 이는, 일반화 능력이 있는 표현을 학습하도록 더 적은 훈련 샘플을 필요로 하기 때문에, 컨볼루션 신경망 데이터를 효율적이게 만든다.In connection with the first, the convolutional layer can recognize a pattern at an arbitrary position, for example, in the upper left corner, after learning a predetermined pattern in the lower right corner of the image. In a densely connected network, when a new pattern appears in a new location, it must be learned. Thus, this makes the convolutional neural network data efficient because it requires fewer training samples to learn expressions with generalization capabilities.

두 번째와 관련하여, 제1 컨볼루션층은 에지(edge)와 같은 작은 국부 패턴을 학습할 수 있고, 제2 컨볼루션층은 제1 컨볼루션층의 피처로 이루어진 큰 패턴 등을 학습한다. 이를 통해 컨볼루션 신경망은 점점 더 복잡해지고 추상적인 시각적 개념을 효율적으로 학습할 수 있다.In relation to the second, the first convolutional layer can learn a small local pattern such as an edge, and the second convolutional layer learns a large pattern made up of features of the first convolutional layer. This allows convolutional neural networks to become increasingly complex and efficiently learn abstract visual concepts.

컨볼루션 신경망은, 다른 많은 층에 배열된 인공 뉴런 층들을 그 층들을 종속시키는 활성화 함수와 상호 연결함으로써 고도의 비선형 맵핑을 학습한다. 이것은, 하나 이상의 서브샘플링층과 비선형 층이 산재된 하나 이상의 컨볼루션층을 포함하며, 이들 층에는 통상적으로 하나 이상의 완전히 연결된 층이 뒤따른다. 컨볼루션 신경망의 각 요소는 이전 층의 피처들의 세트로부터 입력을 수신한다. 컨볼루션 신경망은, 동일한 피처 맵의 뉴런이 동일한 가중치를 가지기 때문에 동시에 학습한다. 이러한 국부 공유 가중치는, 다차원 입력 데이터가 컨볼루션 신경망에 입력될 때 컨볼루션 신경망이 피처 추출 및 회귀 또는 분류 프로세스에서 데이터 재구성의 복잡성을 피하도록 신경망의 복잡성을 감소시킨다.Convolutional neural networks learn highly nonlinear mapping by interconnecting artificial neuron layers arranged in many other layers with activation functions that depend on those layers. It includes one or more subsampling layers and one or more convolutional layers interspersed with nonlinear layers, which are usually followed by one or more fully connected layers. Each element of the convolutional neural network receives input from a set of features in the previous layer. Convolutional neural networks learn simultaneously because neurons of the same feature map have the same weight. This local shared weight reduces the complexity of the neural network so that when the multidimensional input data is input to the convolutional neural network, the convolutional neural network avoids the complexity of data reconstruction in the feature extraction and regression or classification process.

컨볼루션은, 2개의 공간 축(높이 및 폭)과 깊이 축(채널 축이라고도 함)을 갖는 피처 맵이라고 하는 3D 텐서에서 동작한다. RGB 이미지의 경우, 이미지가 3개의 색상 채널, 즉 적색, 녹색, 청색 채널을 갖기 때문에, 깊이 축의 차원은 3이다. 흑백 사진의 경우, 깊이는 1(회색 수준)이다. 컨볼루션 동작은, 자신의 입력 피처 맵으로부터 패치를 추출하고 이러한 패치 모두에 동일한 변환을 적용하여, 출력 피처 맵을 생성한다. 이러한 출력 피처 맵은, 여전히 3D 텐서이며, 폭과 높이를 갖는다. 그 깊이는 임의적일 수 있는데, 그 이유는 출력 깊이가 층의 파라미터이고, 해당 깊이 축의 상이한 채널들이 더 이상 RGB 입력에서와 같이 특정 색상을 나타내지 않고 오히려 필터를 나타내기 때문이다. 필터는 입력 데이터의 특정 양태를 인코딩하며, 예를 들어, 높이 수준에서, 단일 필터는 "입력에 얼굴이 존재함"이라는 개념을 인코딩할 수 있다.Convolution works in a 3D tensor called feature map with two spatial axes (height and width) and a depth axis (also called a channel axis). In the case of an RGB image, the dimension of the depth axis is 3, because the image has three color channels: red, green, and blue channels. For black and white photos, the depth is 1 (gray level). The convolution operation extracts a patch from its input feature map and applies the same transformation to all of these patches to produce an output feature map. This output feature map is still a 3D tensor and has a width and height. The depth can be arbitrary, because the output depth is a parameter of the layer, and different channels of that depth axis no longer exhibit a specific color as in the RGB input, but rather a filter. The filter encodes certain aspects of the input data, for example, at the height level, a single filter can encode the concept of "there is a face in the input".

예를 들어, 제1 컨볼루션층은, 크기(28, 28, 1)의 피처 맵을 취하고 크기(26, 26, 32)의 피처 맵을 출력하며, 그 입력에 대해 32개의 필터를 연산한다. 이러한 32개의 출력 채널의 각각은 26×26 그리드의 값을 포함하며, 이것은, 입력의 상이한 위치에서의 해당 필터 패턴의 응답을 나타내는, 입력에 대한 필터의 응답 맵이다. 이것이 피처 맵이라는 용어의 의미이며, 깊이 축의 모든 차원은 피처(또는 필터)이며, 2D 텐서 출력([:, :, n])은 입력에 대한 이러한 필터의 응답의 2D 공간 맵이다.For example, the first convolution layer takes a feature map of size 28, 28, 1 and outputs a feature map of size 26, 26, 32, and computes 32 filters on its input. Each of these 32 output channels includes a value of 26 × 26 grid, which is a filter's response map to the input, indicating the response of the corresponding filter pattern at different locations of the input. This is the meaning of the term feature map, all dimensions of the depth axis are features (or filters), and the 2D tensor output ([:,:, n]) is a 2D spatial map of the response of these filters to the input.

컨볼루션은 다음 2개의 주요 파라미터, 즉 (1) 입력으로부터 추출된 패치의 크기(이들은 통상적으로 1×1, 3×3 또는 5×5이다) 및 (2) 출력 피처 맵의 깊이(필터의 수는 컨볼루션에 의해 연산된다)에 의해 정의된다. 종종, 이들 컨볼루션은, 깊이 32에서 시작하여, 깊이 64로 계속되며, 깊이 128 또는 256에서 종료된다.Convolution is the following two main parameters: (1) the size of the patch extracted from the input (they are typically 1 × 1, 3 × 3 or 5 × 5) and (2) the depth of the output feature map (number of filters) Is calculated by convolution). Often, these convolutions start at depth 32, continue to depth 64, and end at depth 128 or 256.

컨볼루션은, 3D 입력 피처 맵 위로 3×3 또는 5×5 크기의 이들 윈도우를 슬라이딩하고, 모든 위치에서 정지하고, 주변 피처의 3D 패치(형상(윈도우_높이, 윈도우_폭, 입력_깊이))를 추출함으로써 동작한다. 이어서, 이러한 각 3D 패치는, (컨볼루션 커널이라고 하는 동일한 학습 가중치 행렬을 갖는 텐서 곱을 통해) 형상(출력_깊이)의 1D 벡터로 변환된다. 이어서, 이러한 벡터는 모두 형상(높이, 폭, 출력_깊이)의 3D 출력 맵으로 공간적으로 재조립된다. 출력 피처 맵의 모든 공간 위치는 입력 피처 맵의 동일한 위치에 대응한다(예를 들어, 출력의 우측 하단 코너는 입력의 우측 하단 코너에 대한 정보를 포함한다). 예를 들어, 3×3 윈도우의 경우, 벡터 출력([i,j,:])은 3D 패치 입력([i-1: i+1, j-1: J+1,:])으로부터 온 것이다. 전체 프로세스는 도 4(400으로 표시)에 상세히 설명되어 있다.Convolution slides these 3x3 or 5x5 sized windows over the 3D input feature map, stops at any location, and 3D patches of surrounding features (shape (window_height, window_width, input_depth) ). Each of these 3D patches is then converted to a 1D vector of shape (output_depth) (via a tensor product with the same learning weight matrix called the convolution kernel). Subsequently, these vectors are spatially reassembled into a 3D output map of all shapes (height, width, output_depth). All spatial locations of the output feature map correspond to the same location of the input feature map (eg, the lower right corner of the output contains information about the lower right corner of the input). For example, for a 3x3 window, the vector output ([i, j ,:]) is from the 3D patch input ([i-1: i + 1, j-1: J + 1 ,:]). . The entire process is detailed in FIG. 4 (indicated by 400).

컨볼루션 신경망은, 훈련 동안 많은 그래디언트 업데이트 반복에 걸쳐 학습되는 컨볼루션 필터(가중치 행렬)와 입력값 간의 컨볼루션 동작을 수행하는 컨볼루션층을 포함한다. (m, n)을 필터 크기라고 하고 W를 가중치 행렬이라고 하면, 컨벌루션 층은, 내적(W

x + b)을 계산함으로써 입력 X와 W의 컨볼루션을 수행하며, 여기서, x는 X의 인스턴스이고, b는 편향이다. 컨볼루션 필터가 입력을 가로질러 슬라이딩하는 단차 크기를 보폭이라고 하며, 필터 면적(m × n)을 수용장(receptive field)이라고 한다. 동일한 컨볼루션 필터가 입력의 상이한 위치에 걸쳐 적용되며, 이는 학습되는 가중치의 수를 감소시킨다. 이것은, 또한, 위치 불변 학습을 가능하게 하며, 즉, 중요한 패턴이 입력에 존재하는 경우, 컨볼루션 필터는 시퀀스의 위치에 관계없이 그 패턴을 학습한다.The convolutional neural network includes a convolutional filter (weighted matrix) that is learned over many gradient update iterations during training and a convolutional layer that performs convolutional operations between input values. If (m, n) is the filter size and W is the weight matrix, the convolutional layer is the dot product (W

Convolution of inputs X and W by calculating x + b), where x is an instance of X and b is biased. The step size at which the convolution filter slides across the input is called the stride, and the filter area (m × n) is called the receptive field. The same convolution filter is applied across different locations of the input, which reduces the number of weights learned. This also enables position-invariant learning, that is, when an important pattern is present at the input, the convolution filter learns the pattern regardless of the position of the sequence.

컨볼루션Convolution 신경망의 훈련 Neural Network Training

다른 배경으로서, 도 5는 개시된 기술의 일 구현예에 따라 컨볼루션 신경망을 훈련하는 블록도(500)를 도시한다. 컨볼루션 신경망은, 입력 데이터가 특정 출력 추정값으로 이어지도록 조정되거나 훈련된다. 컨볼루션 신경망은, 출력 추정값이 실측 자료(ground truth)에 점진적으로 일치하거나 근접할 때까지 출력 추정값과 실측 자료 간의 비교에 기초하여 역전파(back propagation)를 사용하여 조정된다.As another background, FIG. 5 shows a block diagram 500 for training a convolutional neural network according to one implementation of the disclosed technology. The convolutional neural network is adjusted or trained such that the input data leads to a specific output estimate. The convolutional neural network is adjusted using back propagation based on the comparison between the output estimate and the measured data until the output estimate gradually matches or approaches the ground truth.

컨볼루션 신경망은, 실측 자료와 실제 출력 간의 차이에 기초하여 뉴런들 간의 가중치를 조정함으로써 훈련된다. 이것은 수학적으로 다음 수식으로 설명된다:Convolutional neural networks are trained by adjusting weights between neurons based on differences between actual data and actual output. This is mathematically explained by the following formula:

여기서 δ = (실측 자료) - (실제 출력)Where δ = (actual data)-(actual output)

일 구현예에서, 훈련 규칙은 다음 수식으로 정의된다:In one embodiment, the training rule is defined by the following formula:

위 수식에서, 화살표는 값의 업데이트를 나타내고, t_m은 뉴런(m)의 표적 값이고,

은 뉴런(m)의 연산된 현재 출력이고,

은 입력(n)이고,

는 학습률이다.In the above formula, the arrow indicates the update of the value, t _m is the target value of the neuron (m),

Is the calculated current output of the neuron (m),

Is the input (n),

Is the learning rate.

훈련의 중간 단계는, 컨볼루션층을 사용하여 입력 데이터로부터 피처 벡터를 생성하는 단계를 포함한다. 출력에서 시작하여 각 층의 가중치에 대한 그래디언트를 계산한다. 이것을 역방향 패스 또는 후진이라고 한다. 네거티브 그래디언트와 이전 가중치의 조합을 사용하여 망의 가중치를 업데이트한다.The intermediate step of training includes generating a feature vector from the input data using a convolutional layer. Starting from the output, we compute the gradient for each layer's weight. This is called reverse pass or reverse. The network weight is updated using a combination of the negative gradient and the previous weight.

일 구현예에서, 컨볼루션 신경망은, 그래디언트 하강에 의해 에러의 역전파를 수행하는 (ADAM과 같은) 확률적 그래디언트 업데이트 알고리즘을 사용한다. 시그모이드 함수 기반 역전파 알고리즘의 일례가 아래에 설명되어 있다:In one implementation, the convolutional neural network uses a probabilistic gradient update algorithm (such as ADAM) that performs backpropagation of errors by gradient descent. An example of a back propagation algorithm based on a sigmoid function is described below:

위 시그모이드 함수에서, h는 뉴런에 연산된 가중 합이다. 시그모이드 함수는 이하의 도함수를 갖는다:In the above sigmoid function, h is the weighted sum calculated on the neuron. The sigmoid function has the following derivatives:

알고리즘은, 망의 모든 뉴런의 활성화를 연산하여, 순방향 패스를 위한 출력을 생성하는 것을 포함한다. 숨겨진 층의 뉴런(m)의 활성화는 다음 수식으로 기술된다:The algorithm includes calculating the activation of all neurons in the network, generating an output for the forward pass. The activation of neurons (m) in the hidden layer is described by the following formula:

이것은 아래 수식으로 기술되는 활성화를 얻도록 모든 숨겨진 층에 대하여 행해진다:This is done for all hidden layers to get the activation described by the formula below:

이어서, 층마다 에러와 보정된 가중치를 계산한다. 출력에서의 에러는 다음 수식으로 연산된다:Then, errors and corrected weights are calculated for each layer. The error in the output is calculated using the following formula:

숨겨진 층의 가중치는 다음 수식으로 계산된다:The weight of the hidden layer is calculated with the following formula:

출력층의 가중치는 다음 수식으로 업데이트된다:The weight of the output layer is updated with the following formula:

숨겨진 층의 가중치는 다음 수식으로 학습률(α)을 용하여 업데이트된다:The weight of the hidden layer is updated using the learning rate (α) with the following formula:

일 구현예에서, 컨볼루션 신경망은, 그래디언트 하강 최적화를 사용하여 모든 층에 걸쳐 에러를 연산한다. 이러한 최적화에 있어서, 입력 피처 벡터(x)와 예측 출력(

)에 대하여, 손실 함수는, 표적이 y인 경우,

를 예측하는 비용에 대하여 ℓ로서, 즉,

로서 정의된다. 예측 출력(

)은, 함수(f)를 사용하여 입력 피처 벡터(x)로부터 변환된다. 함수(f)는 컨볼루션 신경망의 가중치에 의해 파라미터화되며, 즉,

이다. 손실 함수는,

또는

로서 기술되며, 여기서, z는 입력 및 출력 데이터 쌍(x,y)이다. 그래디언트 하강 최적화는 다음 수식에 따라 가중치를 업데이트함으로써 수행된다:In one implementation, the convolutional neural network computes the error across all layers using gradient descent optimization. In this optimization, the input feature vector (x) and the predicted output (

For), the loss function, if the target is y,

ℓ for the cost of predicting, that is,

Is defined as Predictive output (

) Is converted from the input feature vector (x) using the function (f). The function f is parameterized by the weight of the convolutional neural network, that is,

to be. The loss function,

or

, Where z is the input and output data pair (x, y). The gradient descent optimization is performed by updating the weights according to the following formula:

위 수식에서, α는 학습률이다. 또한, 손실은 n개의 데이터 쌍들의 세트에 대한 평균으로서 연산된다. 연산은, 선형 수렴 시 학습률(α)이 충분히 작을 때 종료된다. 다른 구현예에서, 그래디언트는, 연산 효율을 주입하도록 네스테로브(Nesterov)의 가속 그래디언트 및 적응형 그래디언트에 공급되는 선택된 데이터 쌍만을 사용하여 계산된다.In the above formula, α is the learning rate. In addition, the loss is calculated as the average over a set of n data pairs. The calculation ends when the learning rate α in linear convergence is sufficiently small. In another implementation, the gradient is calculated using only selected data pairs supplied to Nestorov's accelerated and adaptive gradients to inject computational efficiency.

일 구현예에서, 컨볼루션 신경망은, 확률적 그래디언트 하강(stochastic gradient descent: SGD)을 사용하여 비용 함수를 계산한다. SGD는, 다음 수식으로 기술되는 하나의 랜덤화된 데이터 쌍(z_t)으로부터만 그래디언트를 연산함으로써, 손실 함수의 가중치에 대하여 그래디언트를 근사화한다:In one implementation, the convolutional neural network calculates the cost function using stochastic gradient descent (SGD). SGD approximates the gradient with respect to the weight of the loss function by computing the gradient only from one randomized data pair (z _t ) described by the following equation:

위 수식에서, α는 학습률이고, μ는 모멘텀이고, t는 업데이트 전의 현재 가중 상태이다. SGD의 수렴 속도는, 빠르고 느린 경우 모두에 대하여 학습률 α가 충분히 감소될 때 대략

이다. 다른 구현예에서, 컨볼루션 신경망은 유클리드 손실 및 소프트맥스 손실 등의 상이한 손실 함수를 사용한다. 추가 구현예에서는, 컨볼루션 신경망은 아담(Adam) 확률적 최적화기를 사용한다.In the above equation, α is the learning rate, μ is the momentum, and t is the current weighted state before the update. The rate of convergence of SGD is approximately when the learning rate α is sufficiently reduced for both fast and slow cases.

to be. In other implementations, the convolutional neural network uses different loss functions such as Euclidean loss and Softmax loss. In a further embodiment, the convolutional neural network uses an Adam stochastic optimizer.

컨볼루션 층, 서브샘플링 층 및 비선형 층에 대한 추가적인 개시 및 설명은, 컨볼루션 예 및 역 전파에 의한 훈련 설명과 함께 참조 출원 문헌으로 병합된 것에서 발견된다. 또한 기본 CNN 기술에 대한 아키텍처 변이체도 참조 자료에 병합된 것에 포함된다.Additional disclosures and descriptions of the convolutional layer, subsampling layer, and nonlinear layer are found in what is incorporated into the reference application literature along with convolutional examples and training descriptions by back propagation. In addition, architectural variants for the basic CNN technology are included in the reference material.

전술한 반복 균형 잡힌 샘플링에 대한 하나의 변이체는 20개의 사이클이 아니라 1개 또는 2개의 사이클에서 전체 엘리트 훈련 세트를 선택하는 것이다. 단 1개의 또는 2개의 훈련 사이클 또는 3개 내지 5개의 훈련 사이클만으로도 엘리트 훈련 세트를 구성하기에 충분할 수 있는 신뢰성 있게 분류된 예측된 병원성 변이체와 잘 알려진 양성 훈련 예 사이에 반감독 훈련에 의해 학습된 충분한 구별이 있을 수 있다. 단 1개의 사이클 또는 2개의 사이클 또는 3개 내지 5개의 사이클의 범위를 기술하기 위해 개시된 방법 및 장치의 변형이 본 명세서에 개시되고, 이전에 개시된 반복을 1개 또는 2개 또는 3개 내지 5개의 사이클로 변환함으로써 쉽게 달성될 수 있다.One variant for the repeated balanced sampling described above is to select the entire elite training set in one or two cycles, not 20 cycles. Only one or two training cycles or 3 to 5 training cycles were trained by semi-supervised training between well-known positive training examples and reliable classified predicted pathogenic variants that would be sufficient to constitute an elite training set. There can be enough distinction. Variations of the disclosed methods and apparatus to describe only one cycle or two cycles or a range of 3 to 5 cycles are disclosed herein, and the previously disclosed iterations are 1 or 2 or 3 to 5 cycles. It can be easily achieved by converting to cycles.

유전체학에서의 심층 학습Deep learning in genomics

참조 출원 문헌으로 병합된 일부 중요한 기여는 여기서 반복된다. 유전자 변이체는 많은 질환을 설명하는 데 도움이 될 수 있다. 모든 인간에게는 고유한 유전자 코드가 있으며 개체 그룹 내에는 많은 유전자 변이체가 있다. 유해한 유전자 변이체의 대부분은 자연 선택에 의해 게놈으로부터 고갈되었다. 병원성이거나 유해할 가능성이 있는 유전자 변이체를 식별하는 것이 중요하다. 이를 통해 연구자들은 병원성일 수 있는 유전자 변이체에 집중하고 많은 질환의 진단 및 치료 속도를 가속화하는 데 도움을 줄 수 있다.Some important contributions incorporated into the reference application literature are repeated here. Genetic variants can help explain many diseases. Every human has a unique genetic code and there are many genetic variants within a group of individuals. Most of the harmful genetic variants were depleted from the genome by natural selection. It is important to identify genetic variants that are pathogenic or potentially harmful. This will help researchers focus on genetic variants that may be pathogenic and accelerate the diagnosis and treatment of many diseases.

변이체의 특성 및 기능적 효과(예를 들어, 병원성)를 모델링하는 것은 유전체학 분야에서 중요하지만 도전적인 과제이다. 기능적 게놈 서열분석 기술(genomic sequencing technology)의 급속한 발전에도 불구하고, 변이체의 기능적 결과를 해석하는 것은 세포 유형별 전사 조절 시스템이 복잡성한 것으로 인해 여전히 큰 도전으로 남아있다.Modeling the properties and functional effects (eg, pathogenicity) of variants is an important but challenging task in the field of genomics. Despite the rapid development of functional genomic sequencing technology, interpreting the functional results of variants remains a big challenge due to the complexity of cell type transcription control systems.

지난 수십년간 생화학적 기술의 발전은 그 어느 때보다 훨씬 더 낮은 비용으로 게놈 데이터를 신속하게 생성하는 차세대 서열 분석(NGS) 플랫폼으로 부상하였다. 이러한 압도적으로 많은 양의 서열분석된 DNA는 주석을 달기 어렵다. 감독된 기계 학습 알고리즘은 일반적으로 많은 양의 표지된 데이터를 이용할 수 있을 때 성능이 우수하다. 생물 정보학 및 많은 다른 데이터가 풍부한 훈련 분야에서 인스턴스 표지 생성 프로세스는 비용이 많이 들지만; 표지가 없는 인스턴스는 저렴하고 쉽게 이용할 수 있다. 표지 있는 데이터의 양이 상대적으로 적고 표지 없는 데이터의 양이 상당히 많은 시나리오의 경우, 반감독 학습은 수동 표지에 비해 비용 효율적인 대안을 나타낸다.Advances in biochemical technology over the past decades have emerged as a next generation sequencing (NGS) platform that rapidly generates genomic data at a much lower cost than ever. This overwhelmingly large amount of sequenced DNA is difficult to annotate. Supervised machine learning algorithms generally perform well when large amounts of labeled data are available. In the field of bioinformatics and a lot of other data-rich training, the process of creating an instance marker is expensive; Unlabeled instances are inexpensive and readily available. For scenarios where the amount of labeled data is relatively small and the amount of labeled data is quite large, semi-supervised learning represents a cost-effective alternative to manual labeling.

변이체의 병원성을 정확하게 예측하는 심층 학습 기반 병원성 분류기를 구성하기 위해 반감독 알고리즘을 사용할 기회가 발생한다. 인간 확인 편견이 없는 병원성 변이체의 데이터베이스가 생성될 수 있다.Opportunities to use semi-supervisory algorithms arise to construct a deep learning-based pathogenic classifier that accurately predicts the pathogenicity of variants. A database of pathogenic variants without human identification bias can be generated.

병원성 분류기와 관련하여, 심층 신경망은 다수의 비선형적이고 복잡한 변환 층을 사용하여 높은 레벨의 피처를 연속적으로 모델링하는 인공 신경망의 유형이다. 심층 신경망은 파라미터를 조정하기 위해 관찰된 출력과 예측된 출력 간의 차이를 전달하는 역 전파를 통해 피드백을 제공한다. 심층 신경망은 대규모 훈련 데이터 세트의 이용 가능성, 병렬 및 분산 컴퓨팅의 힘, 및 정교한 훈련 알고리즘을 갖게 진화했다. 심층 신경망은 컴퓨터 비전, 음성 인식 및 자연어 처리와 같은 다양한 영역에서 주요 발전을 촉진하였다.In the context of pathogenic classifiers, deep neural networks are a type of artificial neural network that continuously models high-level features using a number of nonlinear and complex transform layers. Deep neural networks provide feedback through back propagation that conveys the difference between the observed and predicted outputs to adjust the parameters. Deep neural networks have evolved with the availability of large sets of training data, the power of parallel and distributed computing, and sophisticated training algorithms. Deep neural networks have facilitated major advances in various areas such as computer vision, speech recognition, and natural language processing.

컨볼루션 신경망(CNN) 및 순환 신경망(RNN)은 심층 신경망의 구성 요소이다. 컨볼루션 신경망은 컨볼루션 층, 비선형 층 및 풀링 층을 포함하는 아키텍처로 이미지를 인식하는 데 특히 성공하였다. 순환 신경망은 퍼셉트론(perceptron), 장기 단기 메모리 유닛, 및 게이트 순환 유닛과 같은 빌딩 블록들 간을 주기적으로 연결하는 것을 통해 입력 데이터의 서열 정보를 이용하도록 설계되었다. 또한, 심층 시공간 신경망, 다차원 순환 신경망, 및 컨볼루션 자동 인코더와 같은 많은 다른 신흥 심층 신경망이 제한된 컨텍스트에 제안되어 왔다.Convolutional neural networks (CNNs) and cyclic neural networks (RNNs) are components of deep neural networks. Convolutional neural networks have been particularly successful in recognizing images with architectures that include convolutional layers, nonlinear layers, and pooling layers. Cyclic neural networks are designed to use sequence information from input data through periodic connections between building blocks such as perceptrons, long-term short-term memory units, and gate cyclic units. In addition, many other emerging deep neural networks have been proposed in limited context, such as deep space-time neural networks, multidimensional cyclic neural networks, and convolutional automatic encoders.

심층 신경망을 훈련하는 목표는 데이터로부터 가장 적합한 계층적 표현을 학습할 수 있도록 간단한 피처를 복잡한 피처로 점진적으로 결합시키는 각 층의 가중치 파라미터를 최적화하는 것이다. 최적화 프로세스의 단일 사이클은 다음과 같이 구성된다. 먼저, 훈련 데이터 세트가 주어지면, 순방향 패스는 각 층의 출력을 순차적으로 계산하고 기능 신호를 네트워크를 통해 전파한다. 최종 출력 층에서 객관적인 손실 함수는 추론된 출력과 지정된 표지 간의 에러를 측정한다. 훈련 에러를 최소화하기 위해, 역방향 패스는 체인 규칙을 사용하여 에러 신호를 역전파하고 신경망 전체에 걸쳐 모든 가중치에 대한 그래디언트를 계산한다. 마지막으로 가중치 파라미터는 확률적 그래디언트 하강에 기초한 최적화 알고리즘을 사용하여 업데이트된다. 일괄 그래디언트 하강은 각각의 전체 데이터 세트에 대한 파라미터 업데이트를 수행하는 반면, 확률적 그래디언트 하강은 데이터 예의 소규모 세트 각각에 대한 업데이트를 수행함으로써 확률적 근사치를 제공한다. 여러 최적화 알고리즘은 확률적 그래디언트 하강으로부터 비롯된다. 예를 들어, 아다그라드(Adagrad) 및 아담(Adam) 훈련 알고리즘은 확률적 그래디언트 하강을 수행하면서 각 파라미터에 대한 그래디언트의 업데이트 빈도 및 모멘트에 기초하여 학습률을 각각 적응적으로 수정한다.The goal of training a deep neural network is to optimize the weighting parameters of each layer that gradually combines simple features into complex features to learn the best hierarchical representation from the data. The single cycle of the optimization process consists of: First, given a set of training data, the forward pass sequentially computes the output of each layer and propagates the functional signals through the network. The objective loss function in the final output layer measures the error between the deduced output and the specified marker. To minimize training errors, the reverse pass uses chain rules to propagate the error signal and compute gradients for all weights across the neural network. Finally, the weight parameter is updated using an optimization algorithm based on stochastic gradient descent. Batch gradient descent provides parameter updates for each entire data set, while stochastic gradient descent provides a probabilistic approximation by performing an update for each small set of data examples. Several optimization algorithms come from stochastic gradient descent. For example, the Adagrad and Adam training algorithms adaptively modify the learning rate based on the update frequency and moment of the gradient for each parameter while performing stochastic gradient descent.

심층 신경망을 훈련하는데 있어서 또 다른 핵심 요소는 정규화이며, 이는 과적합을 피하고 이에 따라 우수한 일반화 성능을 달성하도록 의도된 전략을 지칭한다. 예를 들어, 가중치 감소는 가중치 파라미터가 더 작은 절대 값으로 수렴하도록 객관적 손실 함수에 페널티 항을 추가한다. 드롭아웃은 훈련 중에 신경망으로부터 숨겨진 유닛을 무작위로 제거하며 가능한 서브네트워크의 앙상블로 간주될 수 있다. 드롭아웃 기능을 향상시키기 위해 rnnDrop이라는 순환 신경망에 대하여 드롭아웃의 변형 및 새로운 활성화 함수 maxout이 제안되었다. 또한 일괄 정규화는 각 평균 및 분산을 파라미터로 학습하고 미니-일괄 내 각 활성화에 대한 스칼라 피처를 정규화하는 것을 통해 새로운 정규화 방법을 제공한다.Another key factor in training deep neural networks is normalization, which refers to strategies intended to avoid overfitting and thus achieve good generalization performance. For example, weight reduction adds a penalty term to the objective loss function so that the weight parameter converges to a smaller absolute value. Dropout randomly removes hidden units from the neural network during training and can be considered as an ensemble of possible subnetworks. To improve the dropout function, a modification of dropout and a new activation function maxout have been proposed for a cyclic neural network called rnnDrop. Batch normalization also provides a new method of normalization by learning each mean and variance as a parameter and normalizing the scalar features for each activation in the mini-batch.

서열화된 데이터가 다차원 및 고차원이라는 점을 고려하면, 심층 신경망은 광범위한 적용 가능성과 향상된 예측 능력으로 인해 생물 정보학 연구에 큰 가능성을 가지고 있다. 컨볼루션 신경망은 모티프 발견, 병원성 변이체 식별 및 유전자 발현 추론 등의 유전체학에서의 서열-기반 문제를 해결하도록 적응되었다. 컨볼루션 신경망은 DNA를 연구하는데 특히 유용한 가중치 공유 전략을 사용하는데, 이는 중요한 생물학적 기능을 갖는 것으로 추정되는 DNA에서 짧고 순환되는 국부 패턴인 서열 모티프를 포착할 수 있기 때문이다. 컨볼루션 신경망의 특징은 컨볼루션 필터를 사용하는 것이다. 정교하게 설계되고 수동으로 제작된 기능에 기초하는 기존의 분류 접근법과 달리, 컨볼루션 필터는 원시 입력 데이터를 지식의 정보 표현으로 매핑하는 프로세스와 유사한 기능의 적응적 학습을 수행한다. 이러한 의미에서 이러한 필터 세트는 입력에서 관련 패턴을 인식하고 훈련 과정 중에 스스로 업데이트할 수 있기 때문에 컨볼루션 필터는 일련의 모티프 스캐너 역할을 한다. 순환 신경망은 단백질 또는 DNA 서열과 같은 다양한 길이의 서열 데이터에서 장거리 의존성을 포착할 수 있다.Given that the sequenced data is multidimensional and high dimensional, deep neural networks have great potential for bioinformatics research due to their wide applicability and improved predictive capabilities. Convolutional neural networks have been adapted to solve sequence-based problems in genomics such as motif discovery, pathogenic variant identification and gene expression inference. Convolutional neural networks use a weight-sharing strategy that is particularly useful for studying DNA because it can capture sequence motifs, short and circulating local patterns in DNA that are believed to have important biological functions. A characteristic of convolutional neural networks is the use of convolution filters. Unlike traditional classification approaches based on sophisticatedly designed and manually crafted functions, convolutional filters perform adaptive learning of functions similar to the process of mapping raw input data into informational representations of knowledge. In this sense, the convolution filter acts as a series of motif scanners, because these sets of filters can recognize relevant patterns in the input and update themselves during the training process. Circulating neural networks can capture long-range dependence from sequence data of various lengths, such as protein or DNA sequences.

따라서, 변이체의 병원성을 예측하기 위한 강력한 계산 모델은 기본 과학 및 변환 연구 모두에 큰 장점을 가질 수 있다.Thus, a powerful computational model for predicting the pathogenicity of variants can have great advantages for both basic science and transformation studies.

일반적인 다형성은 자연 선택 세대에 의해 적합성이 테스트된 자연 실험을 나타낸다. 인간 미스센스 및 동의 치환에 대한 대립 유전자 빈도 분포를 비교하면, 비인간 영장류 종에서 높은 대립 유전자 빈도로 미스센스 변이체가 존재하면 변이체가 또한 인간 모집단에서 중립적 선택 하에 있다는 것을 신뢰성 있게 예측하는 것을 발견하였다. 대조적으로, 더 먼 종의 공통 변이체는 진화 거리가 증가함에 따라 네거티브 선택을 경험한다.General polymorphism represents a natural experiment whose suitability has been tested by natural selection generation. Comparing the allele frequency distributions for human missense and synonymous substitutions, it has been found to reliably predict that if a missense variant is present with a high allele frequency in a non-human primate species, the variant is also under neutral selection in the human population. In contrast, common variants of distant species experience negative selection with increasing evolutionary distance.

본 발명자들은 서열만을 사용하여 임상적 드 노보(de novo) 미스센스 돌연변이를 정확하게 분류하는 반감독 심층 학습망을 훈련하도록 6개의 비-인간 영장류 종으로부터의 공통 변이체를 사용한다. 500종 이상의 알려진 종을 이용하여, 영장류 계통은 알려지지 않은 중요성을 가진 대부분의 인간 변이체의 영향을 체계적으로 모델링하기에 충분한 공통 변이체를 포함한다.We use common variants from six non-human primate species to train a semi-supervised deep learning network that accurately classifies clinical de novo missense mutations using only sequences. Using more than 500 known species, the primate lineage includes common variants sufficient to systematically model the effects of most human variants of unknown importance.

인간 참조 게놈은 7천만 개를 초과하는 잠재적 단백질-변경 미스센스 치환을 보유하며, 이들 대부분은 인간 건강에 대한 영향이 특성화되지 않은 희귀한 돌연변이이다. 알려지지 않은 중요성을 갖는 이러한 변이체는 임상 응용에서 게놈 해석에 대한 도전을 제시하고, 모집단에 걸친 스크리닝 및 개별화된 의약을 장기간 서열 분석하는 데 장애물이 된다.The human reference genome has more than 70 million potential protein-modifying missense substitutions, many of which are rare mutations whose impact on human health has not been characterized. These variants, of unknown importance, present challenges to genomic interpretation in clinical applications and are an obstacle to long-term sequencing of screened and individualized medications across populations.

다양한 인간 모집단에 걸쳐 공통 변이체를 분류하는 것은 임상적으로 양성인 변이체를 식별하기 위한 효과적인 전략이지만, 현대 인간에서 이용 가능한 공통 변이체는 인간 종의 먼 과거에서의 병목 현상에 의해 제한된다. 인간과 침팬지는 99% 서열 동일성을 공유하는데, 이는 침팬지 변이체에 작용하는 자연 선택이 인간에서 상태가 동일한 변이체의 영향을 모델링할 가능성이 있음을 시사한다. 인간 모집단에서 중립적 다형성에 대한 평균 유착 시간은 종의 발산 시간의 분율이므로, 자연적으로 발생하는 침팬지 변이체는 선택의 균형을 맞춤으로써 유지되는 희소한 일배체(haplotype) 유형을 제외하고는 인간 변이체와 중복되지 않는 돌연변이 공간을 대체로 탐색한다. Classifying common variants across various human populations is an effective strategy for identifying clinically positive variants, but the common variants available in modern humans are limited by bottlenecks in the distant past of human species. Humans and chimpanzees share 99% sequence identity, suggesting that natural selection acting on chimpanzee variants has the potential to model the effects of variants of the same state in humans. Since the average adhesion time for neutral polymorphism in the human population is a fraction of the species' divergence time, naturally occurring chimpanzee variants overlap with human variants, except for the rare haplotype types maintained by balancing selection. The non-mutable space is usually searched.

60,706명의 인간으로부터 집계된 엑솜(exome) 데이터의 최근 이용 가능성은 미스센스 및 동의 돌연변이에 대한 대립 유전자 빈도 스펙트럼을 비교함으로써 이 가설을 테스트할 수 있게 한다. ExAC의 싱글톤 변이체(singleton variant)는 트라이뉴클레오타이드 컨텍스트를 사용하여 돌연변이율을 조정한 후 드 노보 돌연변이에 의해 예측된 예상 2.2:1 미스센스:동의 비율과 거의 일치하지만, 높은 대립 유전자 빈도에서 관찰된 미스센스 변이체의 수는 자연 선택에 의한 유해한 변이체를 필터링하는 것으로 인해 감소한다. 대립 유전자 빈도 스펙트럼에 걸친 미스센스: 동의 비율의 패턴은 모집단 빈도가 0.1% 미만인 미스센스 변이체의 상당 부분이 약간 유해하다는 것을 나타내는데, 즉, 모집단으로부터 즉각적인 제거를 보장할 만큼 병원성이 없고, 더 제한된 모집단 데이터에 대한 사전 관찰과 일치하는, 높은 대립 유전자 빈도에서 존재할 수 있을 만큼 중립적이지 않다는 것을 나타낸다. 이러한 연구 결과는 선택 및 창시자 효과의 균형 맞춤에 의해 발생하는 소수의 잘 문서화된 예외를 제외하고는, 침투성 유전자 질환에 대해 양성일 가능성이 있는 0.1% 내지 1% 초과의 대립 유전자 빈도를 갖는 변이체를 필터링 제거하는 진단 실험실에 의한 광범위한 경험적 실습을 지원한다.The recent availability of exome data aggregated from 60,706 humans allows testing this hypothesis by comparing allele frequency spectra for missense and synonymous mutations. The singleton variant of ExAC is almost identical to the expected 2.2: 1 missense: agreement ratio predicted by the de novo mutation after adjusting the mutation rate using a trinucleotide context, but at high allele frequencies The number of sense variants decreases due to filtering harmful variants by natural selection. A pattern of missense: consent rates across the allele frequency spectrum indicates that a significant proportion of missense variants with a population frequency of less than 0.1% are slightly detrimental, i.e. not pathogenic enough to ensure immediate removal from the population, and a more limited population It indicates that it is not neutral enough to be present at high allele frequencies, consistent with prior observations for the data. The results of these studies filter variants with allele frequencies greater than 0.1% to 1% likely to be positive for permeable genetic disease, with a few well-documented exceptions caused by balancing selection and founder effects. Supports extensive empirical practice by the diagnostic laboratory to be removed.

공통 침팬지 변이체(침팬지 모집단 서열 분석에서 두 번 이상 관찰됨)와 상태가 동일한 인간 변이체의 서브세트로 이러한 분석을 반복하면, 미스센스:동의 비율은 대립 유전자 빈도 스펙트럼에 걸쳐 대체로 일정하다는 것을 발견하였다. 침팬지 모집단에서 이러한 변이체의 높은 대립 유전자 빈도는 침팬지에서 자연 선택의 체(sieve)를 이미 통과했으며, 인간 모집단의 적합성에 대하여 중립적 영향이 미스센스 변이체에 대한 선택적 압력이 두 종에서 매우 일치한다는 강력한 증거를 제공한다는 것을 나타낸다. 침팬지에서 관찰된 낮은 미스센스:동의 비율은 조상 침팬지 모집단에서 더 큰 유효 모집단 크기와 일치하여 약간 유해한 변이체를 보다 효율적으로 필터링할 수 있게 한다.Repeating this analysis with a subset of human variants that are in the same state as the common chimpanzee variant (observed more than once in chimpanzee population sequencing), found that the missense: copper ratio was generally constant across the allele frequency spectrum. The strong allele frequency of these variants in the chimpanzee population has already passed through a natural selection sieve in the chimpanzee, and strong evidence that the neutral effect on the suitability of the human population is highly consistent in the selective pressure on the missense variant in both species Indicates that. The low missense: copper ratio observed in chimpanzees is consistent with the larger effective population size in the ancestral chimpanzee population, allowing more efficient filtering of slightly detrimental variants.

이와 대조적으로, 희귀 침팬지 변이체(침팬지 모집단 서열 분석에서 한 번만 관찰됨)는 높은 대립 유전자 빈도에서 미스센스:동의 비율이 완만하게 감소되는 것을 나타낸다. 인간 변이 데이터로부터 동일한 크기의 코호트를 시뮬레이션하면, 이 크기의 코호트에서 한 번 관찰된 변이체의 64%만이, 코호트에서 여러 번 관찰된 변이체에 대한 99.8%에 비해 일반적인 모집단에서 0.1%보다 큰 대립 유전자 빈도를 가질 것으로 추정되며, 이는 희귀 침팬지 변이체 모두가 선별 체를 통과한 것은 아니라는 것을 나타낸다. 전체적으로, 확인된 침팬지 미스센스 변이체의 16%가 일반 모집단에서 0.1% 미만의 대립 유전자 빈도를 가지고, 높은 대립 유전자 빈도에서 네거티브 선택의 대상이 될 것으로 추정된다.In contrast, the rare chimpanzee variant (observed only once in chimpanzee population sequencing) indicates a moderate decrease in missense: copper ratio at high allele frequencies. When simulating cohorts of the same size from human variation data, only 64% of variants observed once in this size cohort, allele frequencies greater than 0.1% in the general population compared to 99.8% for variants observed multiple times in the cohort It is presumed to have, indicating that not all of the rare chimpanzee variants have passed through the screening sieve. Overall, it is estimated that 16% of the identified chimpanzee missense variants have an allele frequency of less than 0.1% in the general population and are subject to negative selection at high allele frequencies.

다음으로 본 발명자들은 다른 비인간 영장류 종(보노보(Bonobo), 고릴라, 오랑우탄, 레서스(Rhesus) 및 마모셋)에서 관찰된 변이체와 상태가 동일한 인간 변이체를 특성화한다. 침팬지와 유사하게, 높은 대립 유전자 빈도에서 미스센스 변동이 약간 고갈되는 것이 아니라 미스센스:동의 비율은 대립 유전자 빈도 스펙트럼에 걸쳐 거의 동일한 것으로 관찰되며, 이는 소수의 희귀 변이체(약 5% 내지 15%)가 포함되어 있기 때문이라고 예상된다. 이러한 결과는 미스센스 변이체에 대한 선택적 힘이 영장류 계통 내에서 적어도 약 3천 5백만 년 전 인간의 조상 혈통에서 벗어난 것으로 추정되는 광비원류(new world monkey)와 거의 일치한다는 것을 암시한다.Next, we characterize human variants that have the same status as those observed in other non-human primate species (Bonobo, Gorilla, Orangutan, Rhesus and Marmoset). Similar to chimpanzees, the missense variability is not slightly depleted at high allele frequencies, but the missense: copper ratio is observed to be approximately the same across the allele frequency spectrum, which is a minority rare variant (about 5% to 15%) It is expected because it contains. These results suggest that the selective force on the missense variant is almost consistent with the new world monkey, presumed to have deviated from the human ancestor lineage at least about 35 million years ago within the primate lineage.

다른 영장류의 변이체와 상태가 동일한 인간 미스센스 변이체는 ClinVar에서의 양성 결과를 위해 강하게 농축된다. 알려지지 않은 또는 충돌하는 주석이 있는 변이체를 배제한 후, 영장류 병렬 상동(ortholog)이 있는 인간 변이체는, 일반적으로 미스센스 변이의 45%에 비해 ClinVar에서 양성 또는 유사 양성으로서 주석이 달릴 가능성이 약 95%인 것으로 관찰된다. 비인간 영장류로부터 병원성으로 분류된 ClinVar 변이체의 작은 분획물은, 유사한 크기의 건강한 인간의 코호트로부터 희귀 변이체를 확인함으로써 관찰되는 병원성 ClinVar 변이체의 분획물과 비교될 수 있다. 병원성 또는 유사 병원성으로서 주석이 달린 이러한 변이체의 상당 부분은, 큰 대립유전자 빈도 데이터베이스가 출현하기 전에 분류를 받았으며 오늘날 다르게 분류될 수 있음을 나타낸다.Human missense variants in the same state as variants of other primates are strongly enriched for positive results in ClinVar. After excluding unknown or conflicting annotated variants, human variants with primate parallel orthologs are approximately 95% more likely to annotate as positive or similar positives in ClinVar compared to 45% of missense mutations in general. It is observed to be. Small fractions of ClinVar variants classified as pathogenic from non-human primates can be compared to fractions of pathogenic ClinVar variants observed by identifying rare variants from similar sized healthy human cohorts. A significant portion of these variants, annotated as pathogenic or pseudopathogenic, indicate that a large allele frequency database was classified before it appeared and can be classified differently today.

인간 유전체학 분야는 인간 돌연변이의 임상적 영향을 추론하기 위해 모델 유기체에 오랫동안 의존해 왔지만, 대부분의 유전적으로 다루기 쉬운 동물 모델까지의 긴 진화 거리는 이러한 발견이 인간에게 다시 일반화될 수 있는 정도에 대한 우려를 일으킨다. 인간과 더 먼 종에서의 미스센스 변이체에 대한 자연적 선택의 일치성을 조사하기 위해, 본 발명자들은 영장류 계통을 넘어 분석을 확장하여 4종의 추가 포유류 종(쥐, 돼지, 염소, 소)과 먼 종의 척추동물인 2개 종(닭, 제브라피시)으로부터의 대체로 공통 변이를 포함시킨다. 이전 영장류 분석과는 대조적으로, 희귀 대립유전자 빈도에 비해 흔한 대립유전자 빈도로, 특히, 더욱 큰 진화 거리에서 미스센스 변이체가 현저히 고갈됨을 관찰하였으며, 이는 더욱 먼 종의 공통 미스센스 변이체의 상당 분획물이 인간 모집단에서 네거티브 선택을 겪는다는 것을 나타낸다. 그럼에도 불구하고, 더욱 먼 척추동물에서의 미스센스 변이체의 관찰은, 자연적 선택에 의해 고갈된 공통 미스센스 변이체의 분획물이 베이스라인에서의 인간 미스센스 변이체에 대해 약 50% 고갈보다 훨씬 작기 때문에, 여전히 양성 결과의 가능성을 증가시킨다. 이러한 결과와 일관되게, 쥐, 개, 돼지 및 소에서 관찰된 인간 미스센스 변이체가, 전체적으로 영장류 변이의 경우 95% 및 ClinVar 데이터베이스의 경우 45%에 비해, ClinVar에서 양성 또는 유사 양성으로 주석 표시될 가능성이 약 85%인 것으로 나타났다.The field of human genomics has long relied on model organisms to infer the clinical effects of human mutations, but the long evolutionary distances to most genetically manageable animal models raise concerns about the extent to which these findings can be re-generalized to humans. . To investigate the consensus of natural selection for missense variants in species farther from humans, we extended the analysis beyond the primate lineage to distant the four additional mammalian species (rat, pig, goat, cow). Includes common variations from two species of vertebrates (chicken, zebrafish). In contrast to previous primate analysis, we observed a significant depletion of the missense variant at a common allele frequency, particularly at larger evolutionary distances, compared to the rare allele frequency, which indicates that a significant fraction of common missense variants of more distant species Indicates that human populations are experiencing negative choices. Nevertheless, the observation of missense variants in more distant vertebrates is still, as the fraction of common missense variants depleted by natural selection is much smaller than about 50% depletion for human missense variants at baseline. Increase the likelihood of positive results. Consistent with these results, the likelihood that human missense variants observed in rats, dogs, pigs, and cows are annotated as positive or similar positives in ClinVar, compared to 95% for primate mutations overall and 45% for ClinVar databases overall. It was found to be about 85%.

다양한 진화 거리에서 밀접하게 관련된 종들의 쌍이 존재하면, 또한, 인간 모집단에서의 고정된 미스센스 치환의 기능적 결과를 평가할 기회를 제공한다. 포유류 가계도에서 밀접하게 관련된 종들의 쌍(분기 길이 < 0.1) 내에서, 고정된 미스센스 변이체가 희귀 대립유전자 빈도와 비교하여 흔한 대립유전자 빈도로 고갈되어 있음을 관찰하였으며, 이는 종간 고정된 치환의 상당 부분이 영장류 계통 내에서도 인간에게서 비중립적임을 나타낸다. 미스센스 고갈의 크기를 비교하면, 종간 고정된 치환이 종내 다형성보다 중립성이 현저히 낮다는 것을 나타낸다. 흥미롭게도, 밀접하게 관련된 포유류 간의 종간 변이는 종 내의 공통 다형성에 비해 ClinVar에서 실질적으로 더 병원성이 아니고(83%는 양성 또는 유사 양성으로 주석 표시될 가능성이 있음), 이는 이러한 변화가 단백질 기능을 파괴하지 않고 오히려 종별 적응형 장점을 부여하는 단백질 기능의 조정을 반영함을 시사한다.The presence of pairs of closely related species at various evolutionary distances also provides an opportunity to assess the functional consequences of fixed missense substitutions in the human population. Within a pair of closely related species (branch length <0.1) in the mammalian pedigree, it was observed that the fixed missense variant was depleted to a common allele frequency compared to the rare allele frequency, which is equivalent to a fixed substitution between species. It indicates that the part is non-neutral in humans, even within the primate lineage. Comparing the size of missense depletion indicates that the interspecies fixed substitution is significantly less neutral than the intraspecies polymorphism. Interestingly, interspecies mutations between closely related mammals are substantially less pathogenic in ClinVar compared to common polymorphisms within species (83% are likely to be annotated as positive or pseudopositive), indicating that these changes disrupt protein function. Rather, it suggests that it reflects the adjustment of protein function, which confers adaptive advantages by species.

임상 적용을 위한 알려지지 않은 유의미한 다수의 가능한 변이체 및 정확한 변이체 분류의 결정적 중요성은 기계 학습의 문제를 해결하기 위한 다수의 시도에 영감을 주었지만, 이러한 노력은 불충분한 양의 공통 인간 변이체 및 선별된 데이터베이스에서의 불확실한 주석 품질에 의해 크게 제한되어 왔다. 비인간 영장류 6종의 변이는, 공통 인간 변이와 중첩되지 않는 대체로 양성 결과로 되는 300,000개를 초과하는 고유한 미스센스 변이체에 기여하며, 기계 학습 접근법에 사용될 수있는 훈련 데이터세트의 크기를 크게 확대한다.The decisive importance of classifying a large number of possible and correct variants for unknown clinical applications has inspired many attempts to solve the problem of machine learning, but this effort has led to insufficient amounts of common human variants and screened databases. It has been largely limited by the uncertain tin quality in Esau. The variation of the six non-human primates contributes to more than 300,000 unique missense variants with largely positive results that do not overlap with the common human variation, greatly expanding the size of training datasets that can be used in machine learning approaches. .

다수의 인간 공학 피처 및 메타 분류기를 사용하는 초기 모델과는 달리, 본 발명자는, 관심 변이체 옆에 있는 아미노산 서열 및 다른 종에서의 병렬 상동 서열 정렬만을 입력으로서 취하는 간단한 심층 학습 잔여망을 적용한다. 단백질 구조에 관한 정보를 망에 제공하기 위해, 2개의 별개의 망을 훈련하여 서열만으로부터 이차 구조 및 용매 접근성을 학습하고, 단백질 구조에 대한 영향을 예측하기 위해 이들을 더욱 심층의 학습망에 서브네트워크로서 통합한다. 서열을 출발점으로서 사용함으로써, 불완전하게 확인되거나 불일치하게 적용될 수 있는, 단백질 구조 및 기능적 도메인 주석에서의 잠재적 편향을 피한다.Unlike the initial model using multiple ergonomic features and meta-classifiers, we apply a simple deep learning residual network that takes as input only the amino acid sequence next to the variant of interest and parallel homologous sequence alignment in other species. To provide information about the protein structure to the network, we train two separate networks to learn secondary structure and solvent accessibility from sequence only, and to subnetwork them further into the deep learning network to predict the effect on the protein structure. As integrated. By using the sequence as a starting point, we avoid potential bias in protein structure and functional domain annotation, which may be incompletely identified or applied inconsistently.

본 발명자는, 반감독 학습을 사용하여, 망들의 앙상블을 초기에 훈련하여 돌연변이율 및 서열분석 커버리지와 일치하는 랜덤한 알려지지 않은 변이체 대 양성일 수 있는 영장류 변이체를 분리함으로써 양성 표지를 갖는 변이체만을 포함하는 훈련 세트의 문제를 극복한다. 이러한 망들의 앙상블은, 병원성으로 더 예측되는 결과를 갖는 알려지지 않은 변이체를 향하여 바이어싱하고 각 반복 시 점진적인 단계를 취하여 모델이 준최적화된 결과로 미리 수렴하는 것을 방지함으로써 분류기의 다음 반복을 시딩(seed)하기 위해 미지 변이체들의 완전한 세트를 점수 매기고 선택에 영향을 주도록 사용된다.The inventor trained to include only variants with positive markers by isolating random unknown variants consistent with mutation rates and sequencing coverage versus primate variants that may be positive by initially training the ensemble of networks using semi-directed learning. Overcome set problems. The ensemble of these networks is seeded to the next iteration of the classifier by biasing towards an unknown variant with more predictable pathogenicity and taking a gradual step at each iteration to prevent the model from converging in advance with the suboptimal results. To score a complete set of unknown variants and to influence selection.

공통 영장류 변이는, 또한, 메타-분류기의 증식으로 인해 객관적으로 평가하기 어려웠던 이전에 사용된 훈련 데이터와는 완전히 독립적인 기존 방법을 평가하기 위한 깨끗한 검증 데이터세트를 제공한다. 10,000개의 보류된 영장류 공통 변이체를 사용하여 4개의 다른 분류 알고리즘(Sift, Polyphen-2, CADD, M-CAP)을 이용하여 본 발명자들의 모델의 성능을 평가하였다. 모든 인간 미스센스 변이체의 대략 50%가 흔한 대립유전자 빈도로 자연적 선택에 의해 제거되기 때문에, 본 발명자들은 돌연변이율에 의해 10,000개의 보류된 영장류 공통 변이체와 일치한 랜덤하게 선택된 미스센스 변이체들의 세트에서의 각 분류기에 대한 50번째-백분위수 점수를 계산하였으며, 임계값을 사용하여 보류된 영장류 공통 변이체를 평가하였다. 본 발명의 심층 학습 모델의 정확도는, 인간 공통 변이체에 대해서만 훈련된 심층 학습망을 사용하거나 인간 공통 변이체와 영장류 변이체를 모두 사용하여, 이러한 독립적 검증 데이터세트에서의 다른 분류기보다 훨씬 우수하였다.Common primate mutations also provide a clean validation dataset for evaluating existing methods that are completely independent of previously used training data that were difficult to objectively evaluate due to meta-classifier proliferation. The performance of our model was evaluated using four different classification algorithms (Sift, Polyphen-2, CADD, M-CAP) using 10,000 pending primate common variants. Since approximately 50% of all human missense variants are eliminated by natural selection with a common allele frequency, we have each in a set of randomly selected missense variants consistent with 10,000 reserved primate common variants by mutation rate. The 50th-percentile score for the classifier was calculated and the reserved primate common variant was evaluated using the threshold. The accuracy of the deep learning model of the present invention was far superior to other classifiers in this independent validation dataset, using a deep learning network trained only on human common variants, or using both human common variants and primate variants.

최근의 트리오 서열분석 연구에서는, 신경발달 장애 환자 및 건강한 형제자매에서의 수천 개의 드 노보 돌연변이를 목록화하였으며, 사례와 대조군에서 드 노보 미스센스 돌연변이를 분리하는 데 있어서 다양한 분류 알고리즘의 강도를 평가할 수 있게 한다. 4가지 분류 알고리즘 각각에 대해, 사례 대 대조군에서 각각의 드 노보 미스센스 변이체를 점수화하고, 두 분포 간의 차이에 대한 윌콕슨 순위 합 테스트의 p-값을 보고하였으며, 영장류 변이체(p ~ 10^-33)에 대해 훈련된 심층 학습 방법이 이러한 임상 시나리오에서의 다른 분류기(p ~ 10^-13 내지 10^-19)보다 훨씬 더 잘 수행되었음을 나타내었다. 이러한 코호트에 대해 보고된 기대치보다 드 노보 미스센스 변이체의 약 1.3배 농축 및 미스센스 변이체의 약 20%가 기능 손실 효과를 생성하는 이전 추정으로부터, 본 발명자들은, 완벽한 분류기가 p ~ 10^-40의 p 값을 갖는 두 개 클래스로 분리할 것으로 예상한다.In a recent trio sequencing study, thousands of de novo mutations in neurodevelopmental disorder patients and healthy siblings were listed, and the strength of various classification algorithms in isolating de novo missense mutations in cases and controls could be evaluated. Have it. For each of the four classification algorithms, each de novo missense variant was scored in a case-to-control group, the p-value of the Wilcoxon rank sum test for the difference between the two distributions was reported, and the primate variant (p ~ 10 ^-33 It has been shown that the deep learning method trained on) performed much better than the other classifiers (p ~ 10 ^-13 ~ 10 ^-19 ) in these clinical scenarios. From previous estimates that about 1.3-fold enrichment of de novo missense variants and about 20% of missense variants produce a loss-of-function effect, than we would expect for this cohort, we found that the complete classifiers ranged from p to ^10-40 . Expect to separate into two classes with p values.

심층 학습 분류기의 정확도는 훈련 데이터세트의 크기에 따라 스케일링되고, 6종의 영장류 종의 각각으로부터의 변이 데이터는 분류기의 정확도를 높이는 데 독립적으로 기여한다. 현존하는 비인간 영장류 종의 수와 다양성은, 단백질 변경 변이체에 대한 선택적 압력이 영장류 계통 내에서 대부분 일치한다는 증거와 함께, 체계적 영장류 모집단 서열분석을, 임상 게놈 해석을 현재 제한하는 알려지지 않은 중요성의 수백만 개의 인간 변이체를 분류하기 위한 효과적인 전략으로서 시사한다. 504개의 알려진 비인간 영장류 종 중에서, 사냥과 서식지 손실로 인해 약 60%가 멸종 위기에 처해 있으며, 이렇게 독특하고 대체 불가한 종과 우리 자신에게 이익이 되는 전세계적 보존 노력에 대한 시급한 동기를 부여한다.The accuracy of the deep learning classifier is scaled according to the size of the training dataset, and the variation data from each of the six primate species independently contributes to improving the accuracy of the classifier. The number and diversity of non-human primate species in existence, along with evidence that selective pressure on protein-modifying variants mostly coincide within the primate lineage, has led to millions of unknown importance of currently limiting systematic primate population sequencing, clinical genomic interpretation. It suggests as an effective strategy for classifying human variants. Of the 504 known non-human primate species, about 60% are endangered due to hunting and habitat loss, and this is an imperative for this unique and irreplaceable species and worldwide conservation efforts that benefit ourselves.

전체 게놈 데이터를 엑솜 데이터로서 사용할 수 있는 것은 아니지만, 심층적 인트론(intron) 영역에서 자연 선택의 영향을 검출하는 능력을 제한하며, 또한 엑손(exon) 영역으로부터 멀리 떨어진 크립틱 스플라이스(cryptic splice) 돌연변이의 관찰된 계수치 대 예상 계수치를 계산할 수 있었다. 전반적으로, 본 발명자들은 엑손-인트론 경계로부터 >50nt 거리에서 크립틱 스플라이스 돌연변이에서 60% 고갈을 관찰한다. 감소된 신호는 엑솜과 비교하여 더 작은 샘플 크기와 전체 게놈 데이터의 조합일 수 있으며, 심층적 인트론 변이체의 영향을 예측하기가 더 어려울 수 있다.Although the entire genomic data cannot be used as exome data, it limits the ability to detect the effects of natural selection in the deep intron region, and also cryptic splice mutations far from the exon region It was possible to calculate the observed count versus the expected count. Overall, we observe 60% depletion in the cryptic splice mutation at> 50nt distance from the exon-intron boundary. The reduced signal may be a combination of a smaller sample size and total genomic data compared to the exome, and it may be more difficult to predict the impact of in-depth intron variants.

용어Terms

특허, 특허출원, 기사, 서적, 논문, 및 웹페이지를 포함하지만 이에 제한되지 않는 본 명세서에 인용된 모든 문헌 및 유사 자료의 전문은, 이러한 문헌 및 유사 자료의 형식에 관계없이, 본 명세서에 참고로 원용된다. 통합된 문헌과 유사 자료 중 하나 이상이 정의 용어, 용어 사용, 설명된 기술 등을 포함하지만 이에 제한되지 않는 본 출원과 상이하거나 상반되는 경우에는, 본 출원이 우선한다.For the full text of all documents and similar materials cited herein, including but not limited to patents, patent applications, articles, books, papers, and web pages, refer to this specification, regardless of the format of these documents and similar materials Is used as. In the event that one or more of the incorporated literature and similar material differs or contradicts this application, including but not limited to definition terms, term usage, described techniques, and the like, this application takes precedence.

본 명세서에서 사용되는 바와 같이, 하기 용어들은 지시된 의미를 갖는다.As used herein, the following terms have the indicated meaning.

염기는 뉴클레오타이드 염기 또는 뉴클레오타이드, A(아데닌), C(사이토신), T(티민) 또는 G(구아닌)를 가리킨다.Base refers to a nucleotide base or nucleotide, A (adenine), C (cytosine), T (thymine) or G (guanine).

본 출원은 "단백질" 및 "변환된 서열"이라는 용어를 호환 가능하게 사용한다.The present application uses the terms "protein" and "converted sequence" interchangeably.

본 출원은 "코돈" 및 "염기 트리플렛"이라는 용어를 호환 가능하게 사용한다.The present application uses the terms "codon" and "base triplet" interchangeably.

본 출원은 "아미노산" 및 "변환된 유닛"이라는 용어를 호환 가능하게 사용한다.The present application uses the terms "amino acid" and "converted unit" interchangeably.

본 출원은 "변이체 병원성 분류기", "변이체 분류를 위한 컨볼루션 신경망 기반 분류기", "변이체 분류를 위한 심층 컨볼루션 신경망 기반 분류기"라는 구를 호환 가능하게 사용한다.The present application uses the phrases "variant pathogenic classifier", "convolutional neural network based classifier for variant classification" and "deep convolutional neural network based classifier for variant classification" interchangeably.

"염색체"라는 용어는, DNA 및 단백질 성분(특히 히스톤(histone))을 포함하는 염색질 가닥으로부터 유도된 살아있는 세포의 유전(heredity)-보유 유전자 운반체를 지칭한다. 종래의 국제적으로 인정되는 개별 인간 게놈 염색체 넘버링 시스템이 본 명세서에서 사용된다.The term "chromosome" refers to the heredity-bearing gene carrier of living cells derived from chromatin strands comprising DNA and protein components (especially histones). Conventional internationally recognized individual human genome chromosome numbering systems are used herein.

"부위"라는 용어는, 참조 게놈 상의 고유한 위치(예를 들어, 염색체 ID, 염색체 위치, 및 배향)를 지칭한다. 일부 구현예에서, 부위는 잔기, 서열 태그, 또는 서열 상의 세그먼트의 위치일 수 있다. "좌위"(locus)라는 용어는 참조 염색체 상의 핵산 서열 또는 다형성의 특정 위치를 지칭하는 데 사용될 수 있다.The term “site” refers to a unique position on the reference genome (eg, chromosome ID, chromosome position, and orientation). In some embodiments, a site can be a residue, a sequence tag, or the position of a segment on a sequence. The term "locus" can be used to refer to a nucleic acid sequence on a reference chromosome or a specific location of a polymorphism.

본 명세서에서 "샘플"이라는 용어는, 통상적으로 핵산을 함유하는 생물학적 유체, 세포, 조직, 기관, 또는 유기체, 혹은 서열분석될 그리고/또는 상처리(phase)될 적어도 하나의 핵산 서열을 함유하는 핵산들의 혼합물로부터 유도된 샘플을 지칭한다. 이러한 샘플은, 객담/경구 액, 양수, 혈액, 혈액 분획물, 미세침 생검 샘플(예를 들어, 외과적 생검, 미세 침 생검 등)，소변, 복막액, 흉막액, 조직 외식편, 기관 배양물, 및 다른 임의의 조직 또는 세포 제제, 또는 이들의 분획물이나 유도체 또는 이들로부터 분리된 분획물이나 유도체를 포함하지만 이에 제한되지는 않는다. 샘플은 종종 인간 대상(예를 들어, 환자)으로부터 채취되지만, 샘플은, 개, 고양이, 말, 염소, 양, 소, 돼지 등을 포함하지만 이들로 제한되지 않는 염색체를 갖는 임의의 유기체로부터 채취될 수 있다. 샘플은, 생물학적 공급원으로부터 취득되었을 때 그대로 또는 샘플의 특성을 변경하도록 전처리에 이어서 사용될 수 있다. 예를 들어, 이러한 전처리는, 혈액으로부터 혈장을 제조하고 점성 유체 등을 희석하는 것을 포함할 수 있다. 전처리 방법은, 또한, 여과, 침전, 희석, 증류, 혼합, 원심분리, 동결, 동결건조, 농축, 증폭, 핵산 단편화, 간섭 성분의 비활성화, 시약의 첨가, 용해 등을 포함할 수 있지만, 이들로 제한되지는 않는다.The term “sample” herein refers to a biological fluid, cell, tissue, organ, or organism, typically containing a nucleic acid, or a nucleic acid containing at least one nucleic acid sequence to be sequenced and / or phased. Refers to a sample derived from a mixture of them. These samples include sputum / oral fluid, amniotic fluid, blood, blood fractions, microneedle biopsy samples (eg, surgical biopsy, microneedle biopsy, etc.), urine, peritoneal fluid, pleural fluid, tissue explants, organ cultures, And any other tissue or cell preparation, or fractions or derivatives thereof or fractions or derivatives isolated therefrom. Samples are often taken from human subjects (eg, patients), but samples are taken from any organism having chromosomes, including, but not limited to, dogs, cats, horses, goats, sheep, cows, pigs, etc. You can. The sample can be used as it is when it is obtained from a biological source or subsequent to pretreatment to alter the properties of the sample. For example, such pretreatment may include preparing plasma from blood and diluting viscous fluids and the like. Pretreatment methods may also include filtration, precipitation, dilution, distillation, mixing, centrifugation, freezing, lyophilization, concentration, amplification, nucleic acid fragmentation, inactivation of interfering components, addition of reagents, dissolution, and the like. It is not limited.

"서열"이라는 용어는 서로 연결된 뉴클레오타이드의 가닥을 포함하거나 나타낸다. 뉴클레오타이드는 DNA 또는 RNA에 기초할 수 있다. 하나의 서열은 다수의 하위서열(sub-sequence)을 포함할 수 있음을 이해해야 한다. 예를 들어, (예를 들어, PCR 앰플리콘의) 단일 서열은 350개의 뉴클레오타이드를 가질 수 있다. 샘플 리드, 즉, 판독물(read)은 이들 350개 뉴클레오타이드 내에 다수의 하위서열을 포함할 수 있다. 예를 들어, 샘플 리드는, 예를 들어, 20개 내지 50개의 뉴클레오타이드를 갖는 제1 및 제2 플랭킹 하위서열(flanking subsequence)을 포함할 수 있다. 제1 및 제2 플랭킹 하위서열은, 상응하는 하위서열(예를 들어, 40개 내지 100개의 뉴클레오타이드)를 갖는 반복 세그먼트의 어느 일측에 위치할 수 있다. 플랭킹 하위서열의 각각은 프라이머 하위서열(예를 들어, 10개 내지 30개의 뉴클레오타이드) 또는 프라이머 하위서열의 일부)를 포함할 수 있다. 용이한 판독을 위해, "서열"이라는 용어는, "서열"로 지칭될 것이나, 두 개의 서열이 반드시 공통 가닥 상에서 서로 분리될 필요는 없음을 이해할 수 있다. 본 명세서에 기재된 다양한 서열을 구별하기 위해, 서열에는 상이한 표지(예를 들어, 표적 서열, 프라이머 서열, 측면 서열, 참조 서열 등)가 제공될 수 있다. "대립유전자"와 같은 다른 용어에는 유사한 대상을 구별하도록 다른 표지가 부여될 수 있다.The term "sequence" includes or refers to strands of nucleotides linked together. The nucleotides can be based on DNA or RNA. It should be understood that one sequence may contain multiple sub-sequences. For example, a single sequence (eg, of a PCR amplicon) can have 350 nucleotides. Sample reads, ie reads, may contain multiple subsequences within these 350 nucleotides. For example, a sample read can include, for example, first and second flanking subsequences having 20 to 50 nucleotides. The first and second flanking subsequences may be located on either side of a repeat segment having corresponding subsequences (eg, 40 to 100 nucleotides). Each of the flanking subsequences may include primer subsequences (eg, 10 to 30 nucleotides) or part of a primer subsequence). For ease of reading, the term “sequence” will be referred to as “sequence,” but it is understood that the two sequences are not necessarily separated from each other on the common strand. To distinguish the various sequences described herein, the sequences can be provided with different labels (eg, target sequence, primer sequence, flanking sequence, reference sequence, etc.). Other terms such as “alleles” may be given different labels to distinguish similar objects.

"쌍을 이룬-말단 서열분석"(paired-end sequencing)이라는 용어는 표적 분획물의 양측 말단을 서열분석하는 서열분석 방법을 지칭한다. 쌍을 이룬-말단 서열분석은, 유전자 융합 및 신규한 전사뿐만 아니라 게놈 재배열 및 반복 세그먼트의 검출을 용이하게 할 수 있다. 쌍을 이룬-말단 서열분석 방법은, PCT 공보 WO07010252, PCT 출원 일련번호 PCTGB2007/003798, 및 미국 특허출원 공개공보 US 2009/0088327에 기재되어 있으며, 이들 각각은 본 명세서에 참고로 원용된다. 일례로, 일련의 동작을 다음과 같이 수행할 수 있는데, 즉, (a) 핵산들의 클러스터를 생성하고; (b) 핵산들을 선형화하고; (c) 제1 서열분석 프라이머를 혼성화하고 상기한 바와 같이 확장, 스캐닝 및 디블로킹(deblocking)의 반복 사이클을 수행하고, (d) 상보적 사본을 합성함으로써 유동 세포면 상의 표적 핵산을 "반전"시키고, (e) 재합성된 가닥을 선형화하고, (f) 제2 서열분석 프라이머를 혼성화하고 상기한 바와 같이 확장, 스캐닝 및 디블로킹의 반복 사이클을 수행한다. 단일 사이클의 브리지 증폭에 대해 전술한 바와 같은 시약을 전달하여 반전 작업을 수행할 수 있다.The term "paired-end sequencing" refers to a sequencing method that sequences both ends of a target fraction. Paired-terminal sequencing can facilitate gene fusion and novel transcription as well as genome rearrangement and detection of repeat segments. Paired-terminal sequencing methods are described in PCT Publication WO07010252, PCT Application Serial No. PCTGB2007 / 003798, and US Patent Application Publication No. US 2009/0088327, each of which is incorporated herein by reference. In one example, a series of operations can be performed as follows: (a) creating a cluster of nucleic acids; (b) linearizing the nucleic acids; (c) "reverse" the target nucleic acid on the flow cell surface by hybridizing the first sequencing primer and performing repeated cycles of expansion, scanning and deblocking as described above, and (d) synthesizing a complementary copy. And (e) linearize the resynthesized strand, (f) hybridize the second sequencing primer and perform repeat cycles of expansion, scanning and deblocking as described above. Reversal can be performed by delivering the reagents as described above for a single cycle of bridge amplification.

"참조 게놈" 또는 "참조 서열"이라는 용어는, 대상으로부터 확인된 서열을 참조하는 데 사용될 수 있는, 부분적인지 완전한지에 상관 없이 임의의 유기체의 임의의 특정한 알려진 게놈 서열을 지칭한다. 예를 들어, 인간 대상 및 다른 많은 유기체에 사용되는 참조 게놈은 ncbi.nlm.nih.gov의 국립 생명공학 정보 센터에서 찾을 수 있다. "게놈"은, 핵산 서열로 발현된 유기체 또는 바이러스의 완전한 유전자 정보를 지칭한다. 게놈에는 유전자와 DNA의 비암호화 서열이 모두 포함된다. 참조 서열은 이러한 서열에 정렬된 리드보다 클 수 있다. 예를 들어, 참조 서열은, 적어도 약 100배 이상, 또는 적어도 약 1000배 이상, 또는 적어도 약 10,000배 이상, 또는 적어도 약 105배 이상, 또는 적어도 약 106배 이상, 또는 적어도 약 107배 이상일 수 있다. 일례로, 참조 게놈 서열은 전장 인간 게놈의 서열이다. 다른 일례에서, 참조 게놈 서열은 염색체 13과 같은 특정 인간 염색체로 제한된다. 일부 구현예에서, 참조 염색체는 인간 게놈 버전 hg19로부터의 염색체 서열이다. 참조 게놈이라는 용어는 이러한 서열을 커버하도록 의도되었지만, 이러한 서열은 염색체 기준 서열이라고 칭할 수 있다. 참조 서열의 다른 예는, 임의의 종의 염색체, (가닥과 같은) 부염색체 영역 등뿐만 아니라 다른 종의 게놈도 포함한다. 다양한 구현예에서, 참조 게놈은 컨센서스 서열 또는 다수의 개체로부터 유도된 다른 조합이다. 그러나, 소정의 응용분야에서, 참조 서열은 특정 개체로부터 취해질 수 있다.The term “reference genome” or “reference sequence” refers to any particular known genomic sequence of any organism, whether partial or complete, that can be used to refer to a sequence identified from a subject. For example, reference genomes used in human subjects and many other organisms can be found at the National Center for Biotechnology Information at ncbi.nlm.nih.gov. "Genome" refers to complete genetic information of an organism or virus expressed in a nucleic acid sequence. The genome includes both non-coding sequences of genes and DNA. The reference sequence can be larger than the read aligned to this sequence. For example, the reference sequence can be at least about 100 times or more, or at least about 1000 times or more, or at least about 10,000 times or more, or at least about 105 times or more, or at least about 106 times or more, or at least about 107 times or more. . In one example, the reference genomic sequence is that of the full-length human genome. In another example, the reference genomic sequence is limited to a particular human chromosome, such as chromosome 13. In some embodiments, the reference chromosome is a chromosomal sequence from human genome version hg19. The term reference genome is intended to cover such sequences, but such sequences may be referred to as chromosomal reference sequences. Other examples of reference sequences include chromosomes of any species, subchromosomal regions (such as strands), and the like, as well as genomes of other species. In various embodiments, the reference genome is a consensus sequence or other combination derived from multiple individuals. However, in certain applications, the reference sequence can be taken from a particular individual.

"리드"라는 용어는, 뉴클레오타이드 샘플 또는 참조의 분획물을 기술하는 서 열 데이터의 수집을 지칭한다. "리드"이라는 용어는 샘플 리드 및/또는 참조 리드를 지칭할 수 있다. 통상적으로, 반드시 그런 것은 아니지만, 리드는 샘플 또는 참조에서의 연속 염기쌍의 짧은 서열을 나타낸다. 리드는 샘플 또는 참조 분획물의 (ATCG로 된) 염기쌍 서열에 의해 상징적으로 표현될 수 있다. 리드는, 리드가 참조 서열과 일치하는지 또는 다른 기준을 충족하는지를 결정하도록 메모리 장치에 저장될 수 있고 적절하게 처리될 수 있다. 리드는, 서열분석 장치로부터 직접 또는 샘플에 관한 저장된 서열 정보로부터 간접적으로 취득될 수 있다. 일부 경우에, 리드는, 더 큰 서열 또는 영역을 확인하도록 사용될 수 있는, 예를 들어, 염색체 또는 게놈 영역 또는 유전자에 정렬되고 특정하게 할당될 수 있는 충분한 길이(예를 들어, 적어도 약 25bp)의 DNA 서열이다.The term “lead” refers to the collection of sequence data describing a nucleotide sample or fraction of a reference. The term "lead" may refer to a sample lead and / or a reference lead. Typically, but not necessarily, reads represent short sequences of contiguous base pairs in a sample or reference. Reads can be symbolically represented by base pair sequences (in ATCG) of a sample or reference fraction. The reads can be stored in a memory device and processed appropriately to determine whether the reads match the reference sequence or meet other criteria. Reads can be obtained directly from a sequencing device or indirectly from stored sequence information about a sample. In some cases, reads are of sufficient length (eg, at least about 25 bp) that can be used to identify larger sequences or regions, for example, aligned and specifically assigned to chromosomal or genomic regions or genes. It is a DNA sequence.

차세대 서열분석 방법은, 예를 들어, 합성 기술(일루미나(Illumina))에 의한 서열분석, 파이로시퀀싱(454), 이온 반도체 기술(이온 토렌트(Ion Torrent) 서열분석), 단일-분자 실시간 서열분석(퍼시픽 바이오사이언시스사(Pacific Biosciences)), 및 결찰(SOLiD 서열분석)에 의한 시퀀싱을 포함한다. 서열분석 방법에 따라, 각 리드의 길이는 약 30bp 내지 10,000bp를 초과하도록 가변될 수 있다. 예를 들어, SOLiD 시퀀서를 이용한 일루미나 서열분석 방법은 약 50bp의 핵산 리드를 생성한다. 다른 예에서, 이온 토런트 서열분석은 최대 400bp의 핵산 리드를 생성하고, 454 파이로시퀀싱은 약 700bp의 핵산 리드를 생성한다. 또 다른 예에서, 단일-분자 실시간 서열분석 방법은 10,000bp 내지 15,000bp의 리드를 생성할 수 있다. 따라서, 소정의 구현예에서, 핵산 서열 리드의 길이는 30bp 내지 100bp, 50bp 내지 200bp, 또는 50np 내지 400bp의 길이를 갖는다.Next-generation sequencing methods include, for example, sequencing by synthetic techniques (Illumina), pyrosequencing 454, ion semiconductor technology (Ion Torrent sequencing), single-molecule real-time sequencing (Pacific Biosciences), and sequencing by ligation (SOLiD sequencing). Depending on the sequencing method, the length of each read can be varied to exceed about 30 bp to 10,000 bp. For example, an Illumina sequencing method using an SOLiD sequencer produces a nucleic acid read of about 50 bp. In another example, ion torrent sequencing produces a nucleic acid read of up to 400 bp, and 454 pyrosequencing produces a nucleic acid read of about 700 bp. In another example, a single-molecule real-time sequencing method can generate reads from 10,000 bp to 15,000 bp. Thus, in certain embodiments, the length of the nucleic acid sequence read is 30 bp to 100 bp, 50 bp to 200 bp, or 50 np to 400 bp in length.

"샘플 리드", "샘플 서열" 또는 "샘플 분획물"이라는 용어는 샘플로부터의 관심 게놈 서열에 대한 서열 데이터를 지칭한다. 예를 들어, 샘플 리드는, 순방향 및 역방향 프라이머 서열을 갖는 PCR 앰플리콘으로부터의 서열 데이터를 포함한다. 서열 데이터는 임의의 선택 서열 방법으로부터 취득될 수 있다. 샘플 리드는, 예를 들어, 합성에 의한 서열분석(sequencing by synthesis: SBS) 반응, 결찰에 의한 서열분석 반응, 또는 다른 임의의 적합한 서열분석 방법으로부터 발생하는 것일 수 있으며, 이를 위해 이미지의 요소의 길이 및/또는 동일성을 결정하는 것이 필요하다. 샘플 리드는, 다수의 샘플리드로부터 유도된 컨센서스(예를 들어, 평균 또는 가중) 서열일 수 있다. 소정의 구현예에서, 참조 서열을 제공하는 것은, PCR 앰플리콘의 프라이머 서열에 기초하여 관심 좌위를 식별하는 것을 포함한다.The terms "sample read", "sample sequence" or "sample fraction" refer to sequence data for a genomic sequence of interest from a sample. For example, sample reads contain sequence data from PCR amplicons with forward and reverse primer sequences. Sequence data can be obtained from any selected sequence method. Sample reads may be, for example, from sequencing by synthesis (SBS) reactions, sequencing reactions by ligation, or from any other suitable sequencing method, for which It is necessary to determine the length and / or identity. The sample read can be a consensus (eg, average or weighted) sequence derived from multiple sample leads. In certain embodiments, providing a reference sequence includes identifying a locus of interest based on the primer sequence of the PCR amplicon.

"원시 분획물"이라는 용어는, 샘플 리드 또는 샘플 분획물 내의 관심있는 지정된 위치 또는 이차 위치와 적어도 부분적으로 중복되는 관심 게놈 서열의 일부에 대한 서열 데이터를 지칭한다. 원시 분획물의 비제한적인 예로는, 이중 스티치 분획물, 단일 스티치 분획물, 이중 언스티치 분획물, 및 단일 언스티치 분획물을 포함한다. "원시"라는 용어는, 원시 분획물이 샘플 리드의 잠재적 변이체에 대응하고 이러한 잠재적 변이체를 인증 또는 확인하는 변이체를 나타내는지의 여부에 관계없이, 원시 분획물이 샘플 리드에서 시열 데이터와 일부 관계가 있는 서열 데이터를 포함한다는 것을 나타내는 데 사용된다. "원시 분획물"이라는 용어는, 분획물이 반드시 샘플 리드에서 변이체 콜(variant call)을 유효성 확인하는 지지 변이체를 포함한다는 것을 나타내지는 않는다. 예를 들어, 제1 변이체를 나타내기 위해 변이체 콜 애플리케이션에 의해 샘플 리드가 결정될 때, 변이체 콜 애플리케이션은, 하나 이상의 원시 분획물이 다른 경우엔 샘플 리드의 변이체가 주어지는 경우 발생할 것으로 예상될 수 있는 대응 유형의 "지지" 변이체를 갖지 않는다고 결정할 수 있다.The term "raw fraction" refers to sequence data for a portion of a genomic sequence of interest that at least partially overlaps a designated or secondary position of interest in a sample lead or sample fraction. Non-limiting examples of raw fractions include double stitch fractions, single stitch fractions, double unstitch fractions, and single unstitch fractions. The term “raw” refers to sequence data in which the raw fraction is partially related to the sequence data in the sample read, regardless of whether the raw fraction corresponds to a potential variant of the sample read and indicates a variant that certifies or identifies such potential variant. It is used to indicate that it contains. The term "raw fraction" does not indicate that the fraction necessarily includes a supporting variant that validates the variant call in the sample read. For example, when a sample read is determined by a variant call application to indicate a first variant, the variant call application can be expected to occur if one or more primitive fractions are different, which may occur if a variant of the sample read is given. It can be determined that it does not have a "support" variant of.

"맵핑", "정렬된", "정렬", 또는 "정렬하는"이라는 용어는, 리드 또는 태그를 참조 서열과 비교하여 참조 서열이 리드 서열을 포함하는지를 결정하는 프로세스를 지칭한다. 참조 서열이 리드를 포함하는 경우, 리드는, 참조 서열에 맵핑될 수 있고, 또는 특정 구현예에서 참조 서열의 특정 위치에 맵핑될 수 있다. 일부 경우에, 정렬은, 리드가 특정 참조 서열의 구성원인지 여부(즉, 리드가 참조 서열에 존재하는지 또는 부재하는지)를 단순히 알려준다. 예를 들어, 인간 염색체 13에 대한 참조 서열에 대한 리드의 정렬은, 염색체 13에 대한 참조 서열에 리드가 존재하는지의 여부를 알려줄 것이다. 이 정보를 제공하는 도구를 세트 멤버쉽 테스터라고 한다. 일부 경우에, 정렬은, 리드 태그가 맵핑되는 참조 서열의 위치를 추가로 나타낸다. 예를 들어, 참조 서열이 전체 인간 게놈 서열인 경우, 정렬은, 리드가 염색체 13에 존재함을 나타내고, 리드가 특정 가닥 및/또는 염색체 13의 부위에 있음을 추가로 나타낼 수 있다.The terms “mapping”, “aligned”, “aligned”, or “aligning” refer to the process of comparing a read or tag to a reference sequence to determine whether the reference sequence comprises a lead sequence. If the reference sequence comprises a read, the read can be mapped to a reference sequence, or in certain embodiments to a specific position in the reference sequence. In some cases, alignment simply indicates whether the read is a member of a particular reference sequence (ie, whether the read is present or absent from the reference sequence). For example, alignment of a read with respect to a reference sequence for human chromosome 13 will indicate whether a read is present in the reference sequence for chromosome 13. The tool that provides this information is called the set membership tester. In some cases, alignment further indicates the position of the reference sequence to which the read tag is mapped. For example, if the reference sequence is an entire human genome sequence, alignment may further indicate that the read is on chromosome 13 and further indicate that the read is at a site of a particular strand and / or chromosome 13.

"인델"(indel)이라는 용어는, 유기체의 DNA에서의 염기의 삽입 및/또는 삭제를 지칭한다. 마이크로-인델은, 1개 내지 50개 뉴클레오타이드의 순 변화를 초래하는 인델을 나타낸다. 게놈의 코딩 영역에서, 인델의 길이가 3의 배수가 아닌 한, 이것은 프레임시프트 돌연변이를 생성할 것이다. 인델은 점 돌연변이와 대조될 수 있다. 인델은 뉴클레오타이드를 삽입하고 서열로부터 삭제하는 반면, 점 돌연변이는 DNA의 전체 수를 변경하지 않고 뉴클레오타이드들 중 하나를 대체하는 치환 형태이다. 인델은, 또한, 인접한 뉴클레오타이드에서의 치환으로서 정의될 수 있는 탠덤 염기 돌연변이(Tandem Base Mutation: TBM)와 대조될 수 있다 (주로 2개의 인접한 뉴클레오타이드에서의 치환에 해당하지만, 3개의 인접한 뉴클레오타이드에서의 치환이 관찰되었다).The term "indel" refers to the insertion and / or deletion of a base in an organism's DNA. Micro-indels represent indels resulting in a net change of 1 to 50 nucleotides. In the coding region of the genome, unless the length of the indel is a multiple of 3, it will generate a frameshift mutation. Indels can be contrasted with point mutations. Indels insert and delete nucleotides from sequences, while point mutations are substitution forms that replace one of the nucleotides without changing the total number of DNA. Indels can also be contrasted with a Tandem Base Mutation (TBM), which can be defined as a substitution in adjacent nucleotides (mainly corresponding to a substitution in two adjacent nucleotides, but a substitution in three adjacent nucleotides). Was observed).

"변이체"라는 용어는, 핵산 참조와는 다른 핵산 서열을 지칭한다. 통상적인 핵산 서열 변이체는, 단일 뉴클레오타이드 다형성(single nucleotide polymorphism: SNP), 짧은 삭제 및 삽입 다형성(Indel), 카피 수 변이(copy number variation: CNV), 마이크로위성 마커, 또는 짧은 탠덤 반복 및 구조적 변이를 제한 없이 포함한다. 체세포 변이체 콜은, DNA 샘플에서 낮은 빈도로 존재하는 변이체를 식별하기 위한 노력이다. 체세포 변이체 콜은 암 치료의 맥락에서 중요하다. 암은, DNA에 돌연변이가 축적되어 발생하는 것이다. 종양으로부터의 DNA 샘플은, 일반적으로 일부 정상 세포, (돌연변이가 적은) 암 진행의 초기 단계의 일부 세포, 및 (돌연변이가 많은) 일부 후기 단계 세포를 포함하여 이종성이다. 이러한 이종성 때문에, (예를 들어, FFPE 샘플로부터) 종양을 시퀀싱할 때, 체세포 돌연변이는 종종 낮은 빈도로 나타난다. 예를 들어, SNV는 주어진 염기를 커버하는 리드의 10%에서만 보일 수 있다. 변이체 분류기에 의해 체세포 또는 생식 세포로서 분류되는 변이체도, 본 명세서에서 "테스트 중인 변이체"라고 지칭된다. The term "variant" refers to a nucleic acid sequence that is different from a nucleic acid reference. Typical nucleic acid sequence variants include single nucleotide polymorphism (SNP), short deletion and insertion polymorphism (Indel), copy number variation (CNV), microsatellite marker, or short tandem repeat and structural variation. Include without limitation. Somatic variant calls are an effort to identify variants present at low frequency in a DNA sample. Somatic variant calls are important in the context of cancer treatment. Cancer is caused by accumulation of mutations in DNA. DNA samples from tumors are generally heterogeneous, including some normal cells, some cells in the early stages of cancer progression (less mutations), and some late stage cells (high mutations). Because of this heterogeneity, when sequencing tumors (eg, from FFPE samples), somatic mutations often appear at low frequency. For example, SNV can only be seen in 10% of the leads that cover a given base. Variants classified as somatic or germ cells by the variant classifier are also referred to herein as “variants under test”.

"노이즈"라는 용어는, 서열분석 프로세스 및/또는 변이체 콜 애플리케이션에서의 하나 이상의 에러로 인한 잘못된 변이체 콜을 지칭한다.The term “noise” refers to a false variant call due to one or more errors in the sequencing process and / or variant call application.

"변이체 빈도"라는 용어는, 모집단의 특정 좌위에서의 대립유전자(유전자의 변이체)의 상대 빈도를 분획율 또는 백분율로서 표현한 것을 나타낸다. 예를 들어, 분획율 또는 백분율은 해당 대립유전자를 보유하는 모집단에서의 모든 염색체의 분획률일 수 있다. 예를 들어, 샘플 변이체 빈도는, 개인으로부터 관심 게놈 서열에 대하여 취득된 샘플 및/또는 리드의 수에 상응하는 "모집단"에 대한 관심 게놈 서열을 따른 특정 좌위/위치에서의 대립유전자/변이체의 상대 빈도를 나타낸다. 다른 일례로, 베이스라인 변이체 빈도는, 하나 이상의 베이스라인 게놈 서열을 따른 특정 좌위/위치에서의 대립유전자/변이체의 상대 빈도를 나타내며, 여기서 "모집단"은, 정상적인 개인들의 모집단으로부터 하나 이상의 베이스라인 게놈 서열에 대하여 취득된 샘플 및/또는 리드의 수에 상응한다.The term "variant frequency" refers to the expression of the relative frequency of alleles (variants of genes) at specific loci in a population expressed as fractions or percentages. For example, the fraction or percentage may be the fraction of all chromosomes in a population carrying the allele of interest. For example, the frequency of sample variants is the relative of the allele / variant at a particular locus / position along the genomic sequence of interest to the “population” corresponding to the number of samples and / or reads obtained for the genomic sequence of interest from the individual. Frequency. In another example, the baseline variant frequency refers to the relative frequency of alleles / variants at a particular locus / position along one or more baseline genomic sequences, where “population” is one or more baseline genomes from a population of normal individuals. It corresponds to the number of samples and / or reads obtained for the sequence.

용어 "변이체 대립유전자 빈도"(VAF)는, 변이체를 표적 위치에서의 전체 커버리지로 나눈 값과 일치하는 것으로 관찰된 서열분석된 리드의 백분율을 지칭한다. VAF는 변이체를 전달하는 서열분석된 리드의 비율의 척도이다.The term “variant allele frequency” (VAF) refers to the percentage of sequenced reads observed to match the value of the variant divided by the total coverage at the target location. VAF is a measure of the percentage of sequenced reads that deliver variants.

"위치", "지정된 위치" 및 "좌위"라는 용어는, 뉴클레오타이드들의 서열 내에서의 하나 이상의 뉴클레오타이드의 위치 또는 좌표를 지칭한다. "위치", "지정된 위치" 및 "좌위"라는 용어들은, 또한, 뉴클레오타이드들의 서열에서의 하나 이상의 염기 쌍의 위치 또는 좌표를 지칭한다.The terms "position", "designated position" and "left" refer to the position or coordinates of one or more nucleotides within the sequence of nucleotides. The terms "position", "designated position" and "left" also refer to the position or coordinates of one or more base pairs in the sequence of nucleotides.

"일배체형"이라는 용어는 함께 유전되는 염색체 상의 인접 부위에 있는 대립유전자들의 조합을 지칭한다. 일배체형은, 좌위의 주어진 세트가 발생하였다면, 이러한 세트 간에 발생한 재조합 이벤트들의 수에 따라 하나의 좌위, 여러 개의 좌위, 또는 전체 염색체일 수 있다.The term "haplotype" refers to a combination of alleles at adjacent sites on chromosomes that are inherited together. The haplotype can be a single locus, multiple loci, or the entire chromosome, depending on the number of recombination events that occurred between these sets, if a given set of loci occurred.

본 명세서에서 "임계값"이라는 용어는, 샘플, 핵산, 또는 그 일부(예를 들어, 리드)를 특성화하도록 컷오프로서 사용되는 숫자 또는 비숫자 값을 지칭한다. 임계값은 경험적 분석에 기초하여 가변될 수 있다. 임계값은, 이러한 값을 발생시키는 소스가 특정 방식으로 분류되어야 하는지의 여부를 결정하도록 측정된 값 또는 계산된 값과 비교될 수 있다. 임계값은 경험적으로 또는 분석적으로 식별될 수 있다. 임계값의 선택은, 사용자가 분류를 원하는 신뢰 수준에 의존한다. 임계값은, 특정 목적을 위해(예를 들어, 감도 및 선택성의 균형을 맞추기 위해) 선택될 수 있다. 본 명세서에서 사용되는 바와 같이, "임계값"이라는 용어는, 분석 과정이 변경될 수 있는 지점 및/또는 동작이 트리거될 수 있는 지점을 나타낸다. 임계값은 미리 정해진 수일 필요가 없다. 대신, 임계값은, 예를 들어, 복수의 인자에 기초한 함수일 수 있다. 임계값은 상황에 적응적일 수 있다. 또한, 임계값은 상한값, 하한값, 또는 한계값들 사이의 범위를 나타낼 수 있다.The term "threshold" herein refers to a numeric or non-numeric value used as a cutoff to characterize a sample, nucleic acid, or portion thereof (eg, read). The threshold can be varied based on empirical analysis. The threshold can be compared to a measured or calculated value to determine whether the source generating this value should be classified in a particular way. The threshold can be identified empirically or analytically. The choice of threshold depends on the level of confidence that the user wants to classify. Thresholds can be selected for a specific purpose (eg, to balance sensitivity and selectivity). As used herein, the term “threshold” refers to a point where the analysis process can be changed and / or a point where an action can be triggered. The threshold need not be a predetermined number. Instead, the threshold may be, for example, a function based on multiple factors. The threshold can be adaptive to the situation. Also, the threshold value may indicate an upper limit, a lower limit, or a range between limit values.

일부 구현예에서는, 서열분석 데이터에 기초한 메트릭 또는 점수가 임계값과 비교될 수 있다. 본 명세서에서 사용되는 바와 같이, "메트릭" 또는 "점수"라는 용어는, 서열분석 데이터로부터 결정된 값 또는 결과를 포함할 수 있다. 임계값과 마찬가지로, 메트릭 또는 점수는 상황에 따라 적응적일 수 있다. 예를 들어, 메트릭 또는 점수는 정규화된 값일 수 있다. 점수 또는 메트릭의 예로서, 하나 이상의 구현예는 데이터를 분석할 때 계수치 점수를 사용할 수 있다. 계수치 점수는 샘플 리드의 수에 기초할 수 있다. 샘플 리드는, 샘플 리드가 하나 이상의 공통 특성 또는 품질을 갖도록 하나 이상의 필터링 단계를 겪을 수 있다. 예를 들어, 계수치 점수를 결정하기 위해 사용되는 각각의 샘플 리드는 참조 서열과 정렬되었을 수 있고 또는 잠재적 대립유전자로서 할당될 수 있다. 공통 특성을 갖는 샘플 리드의 수는 리드 계수치를 결정하기 위해 계수될 수 있다. 계수치 점수는 리드 계수치에 기초할 수 있다. 일부 구현예에서, 계수치 점수는 리드 계수치와 동일한 값일 수 있다. 다른 구현예에서, 계수치 점수는 리드 계수치 및 다른 정보에 기초할 수 있다. 예를 들어, 계수치 점수는, 유전 좌위의 특정 대립유전자에 대한 리드 수 및 유전 좌위에 대한 총 리드 수에 기초할 수 있다. 일부 구현예에서, 계수치 점수는 유전 좌위에 대한 리드 계수치 및 이전에 취득된 데이터에 기초할 수 있다. 일부 구현예에서, 계수치 점수는 미리 결정된 값들 간에 정규화된 점수일 수 있다. 계수치 점수는, 또한, 샘플의 다른 좌위로부터의 리드 계수치의 함수 또는 관심 샘플과 동시에 실행된 다른 샘플로부터의 리드 계수치의 함수일 수 있다. 예를 들어, 계수치 점수는, 특정 대립유전자의 리드 계수치 및 샘플 내의 다른 좌위의 리드 계수치 및/또는 다른 샘플로부터의 리드 계수치의 함수일 수 있다. 일례로, 다른 좌위로부터의 리드 계수치 및/또는 다른 샘플로부터의 리드 계수치는 특정 대립유전자에 대한 계수치 점수를 정규화하는 데 사용될 수 있다.In some embodiments, a metric or score based on sequencing data can be compared to a threshold. As used herein, the terms "metric" or "score" may include values or results determined from sequencing data. Like the threshold, the metric or score can be adaptive depending on the situation. For example, the metric or score can be a normalized value. As an example of a score or metric, one or more implementations may use a count score when analyzing data. The count score can be based on the number of sample leads. A sample lead can undergo one or more filtering steps so that the sample lead has one or more common characteristics or qualities. For example, each sample read used to determine the count score can be aligned with a reference sequence or assigned as a potential allele. The number of sample leads with common characteristics can be counted to determine the lead count value. The count score can be based on a read count. In some implementations, the count score can be the same value as the read count. In other implementations, the count score can be based on a read count and other information. For example, the count score can be based on the number of leads for a particular allele of a genetic locus and the total number of leads for a genetic locus. In some implementations, the count score can be based on a read count for a genetic locus and previously acquired data. In some implementations, the count score can be a score normalized between predetermined values. The count score can also be a function of read count values from different loci of samples or a function of read count values from other samples executed concurrently with the sample of interest. For example, the count score can be a function of the read count of a particular allele and the read count of another locus in a sample and / or the read count from another sample. In one example, read counts from different loci and / or read counts from other samples can be used to normalize count scores for a particular allele.

"커버리지" 또는 "프래그먼트 커버리지"라는 용어는, 서열의 동일한 프래그먼트에 대한 다수의 샘플 리드의 계수치 또는 다른 척도를 지칭한다. 리드 계수치는 대응하는 프래그먼트를 커버하는 리드 수의 계수치를 나타낼 수 있다. 대안으로, 커버리지는, 이력 지식, 샘플의 지식, 좌위의 지식 등에 기초하는 지정된 계수에 리드 계수치를 곱함으로써 결정될 수 있다.The term "coverage" or "fragment coverage" refers to the count or other measure of multiple sample reads for the same fragment of a sequence. The lead count value may represent the count value of the number of leads covering the corresponding fragment. Alternatively, coverage can be determined by multiplying a read coefficient by a designated coefficient based on historical knowledge, sample knowledge, loci knowledge, and the like.

"리드 깊이"(통상적으로 "x"가 후속하는 수)라는 용어는 표적 위치에서 중복되는 정렬을 갖는 서열분석된 리드의 수를 지칭한다. 이는 종종 간격들의 세트(예를 들어, 엑손, 유전자 또는 패널)에 걸쳐 컷오프를 초과하는 평균 또는 백분율로서 표현된다. 예를 들어, 임상 보고서에 따르면, 패널 평균 커버리지가 1,105×이고 표적 염기의 98%가 >100×를 커버한다고 말할 수 있다.The term “lead depth” (typically the number followed by “x”) refers to the number of sequenced reads with overlapping alignments at the target position. This is often expressed as the average or percentage over cutoff over a set of intervals (eg exon, gene or panel). For example, according to clinical reports, it can be said that the panel average coverage is 1,105 × and 98% of the target base covers> 100 ×.

"염기 콜 품질 점수" 또는 "Q 점수"라는 용어는, 단일 서열분석된 염기가 정확한 확률에 반비례하여 0 내지 20 범위의 PHRED-스케일 확률을 지칭한다. 예를 들어, Q가 20인 T 염기 콜은, 신뢰도 P-값이 0.01인 경우 올바른 것으로 간주될 수 있다. Q<20인 모든 염기 콜은 품질이 낮은 것으로 간주되어야 하며, 변이체를 지지하는 서열분석된 리드의 상당 부분이 품질이 낮은 것으로 식별된 임의의 변이체는 잠재적 위양성으로 간주되어야 한다.The term “base call quality score” or “Q score” refers to a PHRED-scale probability in the range of 0-20, where a single sequenced base is inversely proportional to the correct probability. For example, a T base call with Q of 20 can be considered correct if the confidence P-value is 0.01. All base calls with Q <20 should be considered low quality, and any variant with a significant portion of the sequenced reads supporting the variant identified as low quality should be considered a potential false positive.

"변이체 리드" 또는 "변이체 리드 수"라는 용어는 변이체의 존재를 지지하는 서열분석된 리드의 수를 지칭한다.The term "variant read" or "variant read number" refers to the number of sequenced reads that support the presence of the variant.

서열분석 프로세스Sequencing process

이 부문은 변이체의 합성(SBS) 및 식별에 의해 서열 분석하는 것에 대한 배경을 제공한다. 본 명세서에 설명된 구현예들은, 서열 변이를 식별하기 위해 핵산 서열을 분석하는 데 적용될 수 있다. 구현예들은, 유전자 위치/좌위의 잠재적 변이체/대립유전자를 분석하고 유전 좌위의 유전자형을 결정하거나 다시 말하면 좌위를 위한 유전자형 콜을 제공하는 데 사용될 수 있다. 예를 들어, 핵산 서열은 미국 특허출원 공개번호 2016/0085910 및 미국 특허출원 공개번호 2013/0296175에 기술된 방법 및 시스템에 따라 분석될 수 있으며, 이들 문헌의 완전한 주제 전문은 본 명세서에서 원용된다.This section provides a background for sequencing by synthesis (SBS) and identification of variants. The embodiments described herein can be applied to analyze nucleic acid sequences to identify sequence variations. Embodiments can be used to analyze potential variants / alleles of a gene location / location and determine the genotype of the genetic locus or, in other words, to provide a genotype call for the locus. For example, nucleic acid sequences can be analyzed according to the methods and systems described in U.S. Patent Application Publication No. 2016/0085910 and U.S. Patent Application Publication No. 2013/0296175, the full subject of which is incorporated herein by reference.

일 구현예에서, 서열분석 프로세스는 DNA와 같은 핵산을 포함하거나 포함하는 것으로 의심되는 샘플을 수신하는 단계를 포함한다. 샘플은, 동물(예를 들어, 인간), 식물, 박테리아 또는 진균과 같이 공지된 또는 알려지지 않은 공급원으로부터 유래될 수 있다. 샘플은 공급원으로부터 직접 취해질 수 있다. 예를 들어, 혈액 또는 타액은 개인으로부터 직접 취해질 수 있다. 대안으로, 샘플은 공급원으로부터 직접 취득되지 않을 수 있다. 이어서, 하나 이상의 프로세서는 서열분석을 위해 샘플을 준비하도록 시스템에 지시한다. 준비는 외부 물질을 제거 및/또는 소정의 물질(예를 들어, DNA)을 격리하는 것을 포함할 수 있다. 생물학적 샘플은 특정 분석에 대한 피처를 포함하도록 준비될 수 있다. 예를 들어, 생물학적 샘플은 합성에 의한 서열분석(SBS)을 위해 준비될 수 있다. 소정의 구현예에서, 준비는 게놈의 소정의 영역의 증폭을 포함할 수 있다. 예를 들어, 준비는 STR 및/또는 SNP를 포함하는 것으로 알려진 미리 결정된 유전 좌위를 증폭시키는 것을 포함할 수 있다. 유전 좌위는 미리 결정된 프라이머 서열을 사용하여 증폭될 수 있다.In one embodiment, the sequencing process comprises receiving a sample comprising or suspected of containing a nucleic acid, such as DNA. Samples can be derived from known or unknown sources, such as animals (eg, humans), plants, bacteria or fungi. Samples can be taken directly from the source. For example, blood or saliva can be taken directly from an individual. Alternatively, the sample may not be obtained directly from the source. The one or more processors then instruct the system to prepare the sample for sequencing. Preparation may include removing foreign substances and / or sequestering certain substances (eg, DNA). Biological samples can be prepared to include features for a particular assay. For example, biological samples can be prepared for synthetic sequencing (SBS). In certain embodiments, preparation may include amplification of certain regions of the genome. For example, preparation may include amplifying a predetermined genetic locus known to contain STR and / or SNP. The genetic locus can be amplified using a predetermined primer sequence.

다음에, 하나 이상의 프로세서는 시스템이 샘플을 서열분석하도록 지시할 수 있다. 서열분석은 공지된 다양한 서열분석 프로토콜을 통해 수행될 수 있다. 특정 구현예에서, 서열분석은 SBS를 포함한다. SBS에서, 복수의 형광-표지된 뉴클레오타이드는, 광학 기판의 표면(예를 들어, 유동 세포의 채널을 적어도 부분적으로 정의하는 표면)에 존재하는 증폭된 DNA의 복수의 클러스터(수백만의 클러스터일 수 있음)를 서열분석하는 데 사용된다. 유동 세포는, 유동 세포가 적절한 유동 세포 홀더 내에 배치되는 서열분석을 위한 핵산 샘플을 포함할 수 있다.Next, one or more processors can direct the system to sequence the sample. Sequencing can be performed through a variety of known sequencing protocols. In certain embodiments, sequencing comprises SBS. In SBS, a plurality of fluorescence-labeled nucleotides can be multiple clusters (millions of clusters) of amplified DNA present on the surface of an optical substrate (eg, a surface that at least partially defines a channel of a flow cell). ). The flow cell can include a nucleic acid sample for sequencing where the flow cell is placed in an appropriate flow cell holder.

핵산은, 핵산이 알려지지 않은 표적 서열에 인접한 공지된 프라이머 서열을 포함하도록 준비될 수 있다. 제1 SBS 서열분석 사이클을 개시하기 위해, 하나 이상의 상이하게 표지된 뉴클레오타이드, 및 DNA 폴리머라제 등이 유체 흐름 서브시스템에 의해 유동 세포 내로/유동 세포를 통해 흐를 수 있다. 단일 유형의 뉴클레오타이드가 한 번에 추가될 수 있거나, 서열분석 절차에 사용되는 뉴클레오타이드는 가역적 종결 특성을 갖도록 특별히 설계될 수 있으며, 따라서 서열분석 반응의 각 사이클이 여러 유형의 표지된 뉴클레오타이드(예를 들어, A, C, T, G)가 존재하는 가운데 동시에 일어날 수 있게 한다. 뉴클레오타이드는 형광단과 같은 검출가능한 표지 모이어티를 포함할 수 있다. 4개의 뉴클레오타이드가 함께 혼합되는 경우, 폴리머라제는 혼입할 정확한 염기를 선택할 수 있고, 각 서열은 단일 염기에 의해 확장된다. 비혼합 뉴클레오타이드는 유동 세포를 통해 세척액을 흐르게 함으로써 세척될 수 있다. 하나 이상의 레이저가 핵산을 자극하고 형광을 유발할 수 있다. 핵산으로부터 방출되는 형광은 혼입된 염기의 형광단에 기초하고, 상이한 형광단은 상이한 파장의 방출 광을 방출할 수 있다. 디블로킹 시약을 유동 세포에 첨가하여 확장 및 검출된 DNA 가닥으로부터 가역적 종결자 그룹을 제거할 수 있다. 이어서, 디블로킹 시약은 유동 세포를 통해 세척 용액을 흐르게 함으로써 세척될 수 있다. 이어서, 유동 세포는, 상기 기재된 바와 같이 표지된 뉴클레오타이드의 도입으로 시작하여 서열분석의 추가 사이클에 대하여 준비된다. 서열분석 실행을 완료하기 위해 유체 및 검출 동작을 여러 번 반복할 수 있다. 서열분석 방법의 예는, 예를 들어, 문헌[Bentley et al., Nature 456:53-59 (2008)]; 국제출원공개번호 WO04/018497; 미국 특허번호 7,057,026; 국제출원공개번호 WO 91/06678; 국제출원공개번호 WO 07/123744; 미국 특허번호 7,329,492; 미국 특허번호 7,211,414; 미국특허번호 7,315,019; 미국 특허번호 7,405,281; 및 미국 특허출원 공개번호 2008/0108082에 개시되어 있으며, 이들 문헌의 각각은 본 명세서에 참고로 원용된다.Nucleic acids can be prepared such that the nucleic acid comprises a known primer sequence adjacent to an unknown target sequence. To initiate the first SBS sequencing cycle, one or more differently labeled nucleotides, and DNA polymerase, etc., can be flowed into / through flow cells by the fluid flow subsystem. A single type of nucleotide can be added at one time, or the nucleotides used in the sequencing procedure can be specifically designed to have reversible termination properties, so that each cycle of the sequencing reaction can be labeled with several types of labeled nucleotides (e.g. , A, C, T, G) in the presence. The nucleotides can include detectable label moieties such as fluorophores. When four nucleotides are mixed together, the polymerase can select the exact base to be incorporated, and each sequence is extended by a single base. Unmixed nucleotides can be washed by flowing a wash solution through flow cells. One or more lasers can stimulate nucleic acids and cause fluorescence. Fluorescence emitted from the nucleic acid is based on the fluorophore of the incorporated base, and different fluorophores can emit emitted light of different wavelengths. Deblocking reagents can be added to flow cells to remove reversible terminator groups from expanded and detected DNA strands. The deblocking reagent can then be washed by flowing the washing solution through the flow cells. The flow cells are then prepared for further cycles of sequencing, starting with the introduction of labeled nucleotides as described above. Fluid and detection operations can be repeated multiple times to complete the sequencing run. Examples of sequencing methods include, for example, Bentley et al., Nature 456: 53-59 (2008); International Application Publication No. WO04 / 018497; U.S. Patent No. 7,057,026; International Application Publication No. WO 91/06678; International Application Publication No. WO 07/123744; U.S. Patent No. 7,329,492; U.S. Patent No. 7,211,414; U.S. Patent No. 7,315,019; U.S. Patent No. 7,405,281; And US Patent Application Publication No. 2008/0108082, each of which is incorporated herein by reference.

일부 구현예에서, 핵산은, 표면에 부착될 수 있고 서열분석 전에 또는 서열분석 동안 증폭될 수 있다. 예를 들어, 증폭은, 브리지 증폭을 이용하여 수행되어 표면 상에 핵산 클러스터를 형성할 수 있다. 유용한 브리지 증폭 방법은, 예를 들어, 미국 특허번호 5,641,658; 미국 특허출원 공개번호 2002/0055100; 미국 특허번호 제7,115,400호; 미국 특허출원 공개번호 2004/0096853; 미국 특허출원 공개번호004/0002090; 미국 특허출원 공개번호 2007/0128624; 및 미국 특허출원 공개번호 2008/0009420에 개시되어 있으며, 이들 문헌 각각의 전문은 본 명세서에 참고로 원용된다. 표면 상의 핵산을 증폭시키는 또 다른 유용한 방법은, 예를 들어, Lizardi 등의 Nat. Genet. 19:225-232 (1998) 및 미국 특허출원 공개번호 2007/0099208 A1에 개시된 바와 같은 롤링 서클 증폭(RCA)이며, 이들 문헌 각각은 본 명세서에 참고로 원용된다.In some embodiments, nucleic acids can be attached to a surface and amplified before or during sequencing. For example, amplification can be performed using bridge amplification to form nucleic acid clusters on the surface. Useful bridge amplification methods include, for example, US Pat. No. 5,641,658; United States Patent Application Publication No. 2002/0055100; U.S. Patent No. 7,115,400; United States Patent Application Publication No. 2004/0096853; United States Patent Application Publication No. 004/0002090; United States Patent Application Publication No. 2007/0128624; And US Patent Application Publication No. 2008/0009420, the full text of each of which is incorporated herein by reference. Another useful method for amplifying nucleic acids on the surface is, for example, Lizardi et al. Nat. Genet. 19: 225-232 (1998) and US Patent Application Publication No. 2007/0099208 A1, which is a Rolling Circle Amplification (RCA), each of which is incorporated herein by reference.

SBS 프로토콜의 일례는, 예를 들어, 국제공개번호 WO 04/018497, 미국 특허출원 공개번호 2007/0166705A1, 및 미국 특허번호 제7,057,026호에 기재된 바와 같이, 제거가능한 3' 블록을 갖는 변형된 뉴클레오타이드를 이용하며, 이들 문헌 각각은 본 명세서에 참고로 원용된다. 예를 들어, SBS 시약의 반복 사이클은, 예를 들어, 브리지 증폭 프로토콜의 결과로 표적 핵산이 부착된 유동 세포로 전달될 수 있다. 핵산 클러스터는 선형화 용액을 사용하여 단일 가닥 형태로 전환될 수 있다. 선형화 용액은, 예를 들어, 각 클러스터의 하나의 가닥을 절단할 수 있는 제한 엔도뉴클레아제를 함유할 수 있다. 다른 절단 방법이, 특히, 화학적 절단(예를 들어, 과옥소산염에 의한 디올 연결의 절단), 엔도뉴클레아제에 의한 절단에 의한 염기성 부위의 절단(예를 들어, 미국 매사추세츠 입스위치에 소재하는 NEB사에 의해 공급되는 바와 같은 'USER', 부품 번호 M5505S), 열이나 알칼리에 대한 노출, 데옥시리보뉴클레오타이드로 달리 구성된 증폭 산물로 혼입된 리보뉴클레오타이드의 절단, 광화학적 절단, 또는 펩타이드 링커의 절단을 포함하여, 효소 또는 닉킹 효소를 제한하기 위한 대체 방법으로서 사용될 수 있다. 선형화 동작 후에, 서열분석 프라이머를 서열분석될 표적 핵산에 혼성하기 위한 조건 하에서 서열분석 프라이머를 유동 세포로 전달할 수 있다.Examples of SBS protocols include modified nucleotides with removable 3 'blocks, as described, for example, in International Publication No. WO 04/018497, US Patent Application Publication No. 2007 / 0166705A1, and US Patent No. 7,057,026. And each of these documents is incorporated herein by reference. For example, repeated cycles of the SBS reagent can be delivered to flow cells to which the target nucleic acid is attached, for example as a result of a bridge amplification protocol. Nucleic acid clusters can be converted to single-stranded form using a linearization solution. The linearization solution can contain, for example, a restriction endonuclease capable of cleaving one strand of each cluster. Other cleavage methods include, in particular, chemical cleavage (eg cleavage of diol linkages with peroxates), cleavage of basic sites by cleavage with endonucleases (eg, based in Ipswich, Massachusetts, USA). 'USER' as supplied by NEB, Part No. M5505S), exposure to heat or alkali, cleavage of ribonucleotides incorporated with amplification products consisting of deoxyribonucleotides, photochemical cleavage, or cleavage of peptide linkers It can be used as an alternative method for limiting enzymes or nicking enzymes, including. After the linearization operation, the sequencing primers can be delivered to flow cells under conditions to hybridize the sequencing primers to the target nucleic acid to be sequenced.

이어서, 유동 세포를, 단일 뉴클레오타이드 첨가에 의해 각각의 표적 핵산에 혼성화된 프라이머를 확장시키는 조건 하에서 제거가능한 3' 블록 및 형광 표지를 갖는 변형된 뉴클레오타이드를 갖는 SBS 확장 시약과 접촉시킬 수 있다. 일단 변형된 뉴클레오타이드가 서열분석되는 템플릿의 영역에 상보적인 성장하는 폴리뉴클레오타이드 쇄에 혼합되었다면, 추가 서열 확장을 지시하기 위해 이용 가능한 유리 3'-OH기가 없기 때문에, 단일 뉴클레오타이드만이 각 프라이머에 첨가되고, 따라서, 중합효소가 추가의 뉴클레오타이드를 첨가할 수 없다. SBS 확장 시약은, 제거될 수 있고 방사선으로 여기 상태에서 샘플을 보호하는 성분을 포함하는 스캔 시약으로 교체될 수 있다. 스캔 시약을 위한 예시적인 성분은 미국 특허출원 공개번호 2008/0280773 A1 및 미국 특허출원번호 13/018,255에 기재되어 있으며, 이들 문헌 각각은 본 명세서에 참고로 원용된다. 이어서, 확장된 핵산은 스캔 시약의 존재 하에서 형광 검출될 수 있다. 일단 형광이 검출되었다면, 사용된 블로킹 기에 적합한 디블로킹 시약을 사용하여 3' 블록을 제거할 수 있다. 각 블로킹 기에 유용한 예시적인 디블로킹 시약(deblock reagent)은 WO004018497, US 2007/0166705 A1, 및 미국 특허번호 7,057,026에 기재되어 있으며, 이들 문헌 각각은 본 명세서에 참고로 원용된다. 디블로킹 시약을 세척하여, 표적 핵산을, 이제 추가의 뉴클레오타이드의 첨가를 위한 성분인 3'-0H기를 갖는 확장된 프라이머에 혼성화되게 한다. 따라서, 하나 이상의 동작 사이에서의 선택적 세척에 의해 확장 시약, 스캔 시약, 및 디블로킹 시약을 첨가하는 주기는, 원하는 서열이 취득될 때까지 반복될 수 있다. 상기 사이클은, 각각의 변형된 뉴클레오타이드 각각이 특정 염기에 상응하는 것으로 공지된 상이한 표지로 부착될 때 사이클당 단일 확장 시약 전달 동작을 사용하여 수행될 수 있다. 상이한 표지는, 각각의 혼입 동작 동안 첨가되는 뉴클레오타이드의 구별을 용이하게 한다. 대안으로, 각 사이클은, 확장 시약 전달의 개별 동작 및 후속하는 시약 전달 및 검출의 개별 동작을 포함할 수 있으며, 이 경우, 2개 이상의 뉴클레오타이드가 동일한 표지를 가질 수 있고 공지된 전달 순서에 기초하여 구별될 수 있다.The flow cells can then be contacted with SBS extension reagents with modified nucleotides with removable 3 'blocks and fluorescent labels under conditions that extend primers hybridized to each target nucleic acid by the addition of a single nucleotide. Once the modified nucleotides have been incorporated into a growing polynucleotide chain complementary to the region of the template being sequenced, only a single nucleotide is added to each primer because there are no free 3'-OH groups available to direct further sequence expansion. , Therefore, the polymerase cannot add additional nucleotides. The SBS extension reagent can be removed and replaced with a scanning reagent that contains a component that protects the sample in an excited state with radiation. Exemplary ingredients for scan reagents are described in US Patent Application Publication No. 2008/0280773 A1 and US Patent Application No. 13 / 018,255, each of which is incorporated herein by reference. The expanded nucleic acid can then be fluorescently detected in the presence of a scan reagent. Once fluorescence has been detected, the 3 'block can be removed using a deblocking reagent suitable for the blocking group used. Exemplary deblocking reagents useful for each blocking group are described in WO004018497, US 2007/0166705 A1, and US Pat. No. 7,057,026, each of which is incorporated herein by reference. The deblocking reagent is washed to allow the target nucleic acid to hybridize to an extended primer with a 3'-0H group, now a component for the addition of additional nucleotides. Thus, the cycle of adding expansion reagents, scan reagents, and deblocking reagents by selective washing between one or more operations can be repeated until the desired sequence is obtained. The cycle can be performed using a single extended reagent delivery per cycle when each modified nucleotide is attached with a different label known to correspond to a particular base. Different labels facilitate differentiation of nucleotides added during each incorporation operation. Alternatively, each cycle can include separate actions of extended reagent delivery and subsequent actions of reagent delivery and detection, in which case two or more nucleotides can have the same label and based on a known sequence of delivery. Can be distinguished.

서열분석 동작을 특정 SBS 프로토콜과 관련하여 전술하였지만, 임의의 다양한 다른 분자 분석 중 임의의 것을 서열분석하기 위한 다른 프로토콜이 필요에 따라 수행될 수 있음을 이해할 것이다.Although sequencing operations have been described above with respect to a particular SBS protocol, it will be understood that other protocols for sequencing any of any of a variety of different molecular analyzes can be performed as needed.

이어서, 시스템의 하나 이상의 프로세서는 후속 분석을 위해 서열분석 데이터를 수신한다. 서열분석 데이터는 .BAM 파일과 같이 다양한 방식으로 포맷화될 수 있다. 서열분석 데이터는 예를 들어 다수의 샘플 리드를 포함할 수 있다. 서열분석 데이터는 뉴클레오타이드의 상응하는 샘플 서열을 갖는 복수의 샘플 리드를 포함할 수 있다. 하나의 샘플 리드만이 설명되고 있지만, 서열분석 데이터는 예를 들어 수백, 수천, 수십만 또는 수백만 개의 샘플 리드를 포함할 수 있음을 이해해야 한다. 상이한 샘플 리드는 상이한 수의 뉴클레오타이드를 가질 수 있다. 예를 들어, 샘플 리드는 10개의 뉴클레오타이드 내지 약 500개의 뉴클레오타이드 이상의 범위에 있을 수 있다. 샘플 리드들은 공급원(들)의 전체 게놈에 걸쳐 이어질 수 있다. 일례로, 샘플 리드값은, STR이 의심되거나 SNP가 의심되는 그러한 유전 좌위와 같은 미리 정해진 유전 좌위에 관한 것이다.The one or more processors of the system then receive sequencing data for subsequent analysis. Sequencing data can be formatted in a variety of ways, such as .BAM files. The sequencing data can include, for example, multiple sample leads. The sequencing data can include a plurality of sample reads having corresponding sample sequences of nucleotides. While only one sample read is described, it should be understood that sequencing data can include, for example, hundreds, thousands, hundreds of thousands, or millions of sample leads. Different sample leads can have different numbers of nucleotides. For example, sample reads can range from 10 nucleotides to about 500 nucleotides or more. Sample leads can span the entire genome of the source (s). In one example, sample lead values relate to predetermined genetic loci, such as those loci in which STR is suspected or SNP is suspected.

각각의 샘플 리드는, 샘플 서열, 샘플 분획물 또는 표적 서열이라고 칭할 수 있는 뉴클레오타이드들의 서열을 포함할 수 있다. 샘플 서열은, 예를 들어, 프라이머 서열, 측면 서열, 및 표적 서열을 포함할 수 있다. 샘플 서열 내의 뉴클레오타이드의 수는 30, 40, 50, 60, 70, 80, 90, 100 이상을 포함할 수 있다. 일부 구현예에서, 하나 이상의 샘플 리드(또는 샘플 서열)는, 적어도 150개의 뉴클레오타이드, 200개의 뉴클레오타이드, 300개의 뉴클레오타이드, 400개의 뉴클레오타이드, 500개의 뉴클레오타이드 이상을 포함한다. 일부 구현예에서, 샘플 리드는 1000개를 초과하는 뉴클레오타이드, 2000개 이상의 뉴클레오타이드를 포함할 수 있다. 샘플 리드(또는 샘플 서열)는 한쪽 또는 양쪽 말단에 프라이머 서열을 포함할 수 있다. Each sample read may contain a sequence of nucleotides that may be referred to as a sample sequence, sample fraction or target sequence. Sample sequences can include, for example, primer sequences, flanking sequences, and target sequences. The number of nucleotides in the sample sequence may include 30, 40, 50, 60, 70, 80, 90, 100 or more. In some embodiments, one or more sample reads (or sample sequences) comprises at least 150 nucleotides, 200 nucleotides, 300 nucleotides, 400 nucleotides, 500 nucleotides or more. In some embodiments, a sample read can include more than 1000 nucleotides, more than 2000 nucleotides. The sample read (or sample sequence) may include primer sequences at one or both ends.

다음에, 하나 이상의 프로세서는 서열분석 데이터를 분석하여 잠재적 변이체 콜(들) 및 샘플 변이체 콜(들)의 샘플 변이체 빈도를 취득한다. 상기 동작은, 또한, 변이체 콜 애플리케이션 또는 변이체 콜러(variant caller)라고 칭할 수 있다. 따라서, 변이체 콜러는 변이체를 식별 또는 검출하고, 변이체 분류기는 검출된 변이체를 체세포 또는 생식세포로서 분류한다, 대안의 변이체 콜러는 본 발명의 구현예에 따라 이용될 수 있고, 여기서 상이한 변이체 콜러들은, 관심 샘플의 피처 등에 기초하여 수행되는 서열분석 동작의 유형에 기초하여 사용될 수 있다. 변이체 콜 애플리케이션의 비제한적인 일례는, https://github.com/Illumina/Pisces에 호스팅되고 Dunn, Tamsen & Berry, Gwenn & Emig-Agius, Dorothea & Jiang, Yu & Iyer, Anita & Udar, Nitin & Str

mberg, Michael. (2017). Pisces: An Accurate and Versatile Single Sample Somatic and Germline Variant Caller. 595-595. 10.1145/3107411.3108203 기사에 개시된 일루미나사(Illumina Inc.)(캘리포니아주 샌디에고 소재)에 의한 Pisees™이 있으며, 이 문헌의 완전한 주제 전문은 명백하게 본 명세서에 참고로 원용된다Next, the one or more processors analyze the sequencing data to obtain potential variant call (s) and sample variant frequency of sample variant call (s). This operation can also be referred to as a variant call application or variant caller. Thus, variant callers identify or detect variants, and variant classifiers classify the detected variants as somatic or germ cells, alternative variant callers can be used according to embodiments of the present invention, where different variant callers are: It can be used based on the type of sequencing operation performed based on features of the sample of interest, and the like. Non-limiting examples of variant call applications are hosted at https://github.com/Illumina/Pisces and are Dunn, Tamsen & Berry, Gwenn & Emig-Agius, Dorothea & Jiang, Yu & Iyer, Anita & Udar, Nitin & Str

mberg, Michael. (2017). Pisces: An Accurate and Versatile Single Sample Somatic and Germline Variant Caller. 595-595. There is Pisees ™ by Illumina Inc. (San Diego, CA) disclosed in the 10.1145 / 3107411.3108203 article, the full subject matter of which is expressly incorporated herein by reference.

양성 훈련 세트 생성Creating a training set

확장된 훈련 세트의 생성은 참조 출원 문헌으로 병합된 것에 개시된다. 수백만 개의 인간 게놈과 엑솜이 서열 분석되었지만, 이들의 임상 적용은 질병을 유발하는 돌연변이를 양성 유전자 변이체를 구별하기 어렵기 때문에 제한되어 있다. 여기서 다른 영장류 종둘의 공통 미스센스 변이체가 인간에 있어서 대체로 임상적으로 양성이라는 것을 입증함으로써, 병원성 돌연변이를 제거 프로세스에 의해 체계적으로 식별될 수 있게 한다. 비인간 영장류 6종의 모집단 서열 분석으로부터 수십만 종의 공통 변이체를 이용하여, 88%의 정확도로 희귀병 환자에서 병원성 돌연변이를 식별하고 게놈 전체에서 유의미한 지적 장애가 있는 14개의 새로운 후보 유전자를 발견할 수 있는 심층 신경망을 훈련시킨다. 추가 영장류 종으로부터 공통 변이체를 분류하면 불확실한 유의미한 수백만 변이체에 대한 해석이 개선되어 인간 게놈 서열 분석의 임상적 효용이 더욱 향상될 것이다.The creation of an extended training set is disclosed in what has been incorporated into the reference application literature. Millions of human genomes and exomes have been sequenced, but their clinical application is limited because it is difficult to distinguish positive genetic variants from disease-causing mutations. Here, by demonstrating that the common missense variant of both primate species is largely clinically benign in humans, pathogenic mutations can be systematically identified by the elimination process. Deep neural network capable of identifying pathogenic mutations in rare patients with 88% accuracy and discovering 14 new candidate genes with significant intellectual disabilities across the genome, using hundreds of thousands of common variants from population sequencing of six non-human primates Train. Classification of common variants from additional primate species will improve the interpretation of millions of uncertain and significant variants, further improving the clinical utility of human genome sequencing.

진단 서열 분석의 임상적 실행 가능성은 인간 모집단에서 희귀 유전자 변이체를 해석하고 질병 위험에 대한 해당 영향을 유추하는 어려움에 의해 제한된다. 건강 상태에 대한 해로운 영향으로 인해 임상적으로 유의미한 유전자 변이체는 모집단에서 극히 드물게 나타나는 경향이 있으며, 대부분의 경우 인간 건강에 미치는 영향은 확인되지 않았다. 불확실한 임상적 중요성에 대한 이러한 변이체의 많은 수와 희귀성은 개인화된 의약품 및 집단 전체의 건강 검진을 위한 서열 분석 채택에 엄청난 장애를 초래한다.The clinical viability of diagnostic sequencing is limited by the difficulty in interpreting rare genetic variants in the human population and inferring their corresponding impact on disease risk. Due to the detrimental effects on health status, clinically significant genetic variants tend to appear extremely rarely in the population, and in most cases no impact on human health has been identified. The large number and rarity of these variants for their uncertain clinical significance poses a tremendous obstacle to the adoption of sequencing for personalized medicines and population-wide health screening.

대부분의 침투성 멘델리아 질병은 모집단에서의 유병률이 매우 낮기 때문에, 모집단에서 고 빈도로 변이체를 관찰하는 것은 양성 결과에 유리한 강력한 증거가 된다. 다양한 인간 모집단에 걸친 공통 변이를 검정하는 것은 양성 변이체를 분류하는 데 효과적인 전략이지만, 오늘날 인간의 공통 변이의 총량은 조상 다양성의 많은 부분이 손실된 우리 종의 최근 역사에서의 병목현상으로 인해 제한된다. 현재 인간에 대한 모집단 연구는, 지난 15,000년 내지 65,000년 내에 1만 명 미만의 개인으로 구성된 유효 인구 크기(Ne)로부터 주목할 만한 인플레이션을 나타내며, 공통 다형성의 작은 풀은 이러한 크기의 모집단의 변동에 대한 제한된 용량에서 유래된 것이다. 참조 게놈에서 7천만 개를 초과하는 잠재적 단백질 변화 미스센스 치환물 중에서, 1000명 중 대략 1명만이 0.1% 초과의 전체 모집단 대립유전자 빈도로 존재한다.Since most prevalent Mendelian diseases have a very low prevalence in the population, high frequency observations of the variant in the population provide strong evidence for positive results. Testing common mutations across various human populations is an effective strategy for classifying positive variants, but today the total amount of common mutations in humans is limited by bottlenecks in the recent history of our species, where much of the ancestral diversity has been lost. . Current population studies on humans represent notable inflation from the effective population size (Ne) of less than 10,000 individuals in the past 15,000 to 65,000 years, and a small pool of common polymorphisms is used to account for changes in populations of this size. It comes from a limited dose. Of the more than 70 million potential protein change missense substitutions in the reference genome, only about 1 in 1000 are present with a total population allele frequency of greater than 0.1%.

현대의 인간 모집단 밖에서, 침팬지는, 다음으로 가장 가까운 현존하는 종을 포함하며, 99.4%의 아미노산 서열 동일성을 공유한다. 인간과 침팬지의 단백질 부호화 서열의 거의 동일성은, 침팬지 단백질-부호화 변이체에 대하여 동작하는 선별 선택이 또한 IBS인 인간 돌연변이의 적합성에 대한 결과를 모델링할 수 있음을 시사한다.Outside the modern human population, chimpanzees contain the next closest existing species and share an amino acid sequence identity of 99.4%. The almost identical identity of the human and chimpanzee protein coding sequences suggests that selection selections that operate against chimpanzee protein-encoding variants can also model the results for the suitability of human mutations as IBS.

중립 다형성이 조상 인간 계통(~4N_e 세대)을 지속하기 위한 평균 시간은, 종의 다양성 시간(약 6백만 년 전)의 일부이기 때문에, 자연 발생 침팬지 변이는, 드물게 나타나는 균형맞춤 선택에 의해 유지되는 일배체형을 제외하고 우연한 경우를 제외하고는 대부분 중복되지 않는 돌연변이 공간을 탐구한다. IBS인 다형성이 두 종의 적합성에 유사하게 영향을 미치는 경우, 침팬지 모집단에서 높은 대립유전자 빈도로 변이체가 존재한다는 것은, 인간에게 양성 결과를 나타내는 것이며, 선별 선택에 의해 양성 결과가 확립된 것으로 알려진 변이체의 카탈로그를 확대한다. 상당한 추가 세부 사항은 참조 출원 문헌으로 병합된 것에 제공된다.Since the average time for the neutral polymorphism to sustain the ancestral human lineage (~ 4N _e generations) is part of the species diversity time (about 6 million years ago), naturally occurring chimpanzee mutations are maintained by a sparse balanced choice. Except for haplotypes, most of the non-overlapping mutant spaces are explored. If polymorphism, an IBS, similarly affects the suitability of the two species, the presence of a variant with a high allele frequency in the chimpanzee population indicates a positive result in humans, and a variant known to have established a positive result by screening selection To enlarge its catalog. Significant additional details are provided in what is incorporated by reference.

심층 Deep 학습망의Learning network 아키텍처 architecture

참조 출원 문헌으로 병합된 것에 개시된 일 구현예에서, 병원성 예측망은, 관심 변이체를 중심으로 51-길이 아미노산 서열을 입력으로서 취하고, 중심 위치에서 치환된 미스센스 변이체를 이차 구조 및 용매 접근성 망(도 2 및 도 3)의 출력으로서 취한다. 3개의 51 길이 위치 빈도 행렬은, 99마리의 척추동물의 다중 서열 정렬로부터 생성되며, 상기 3개는, 영장류 11마리에 대하여 하나, 영장류를 제외한 50마리 포유동물에 대하여 하나, 영장류와 포유류를 제외한 38마리의 척추동물에 대하여 하나이다.In one embodiment disclosed in what is incorporated by reference application literature, the pathogenic prediction network takes as input the 51-length amino acid sequence centered on the variant of interest and the missense variant substituted at the central position is the secondary structure and the solvent access network (FIG. 2 and 3). Three 51-length positional frequency matrices are generated from multiple sequence alignments of 99 vertebrates, three of which are one for eleven primates, one for 50 mammals excluding primates, and excluding primates and mammals. One for 38 vertebrates.

이차 구조 심층 학습망은, 각각의 아미노산 위치에서 알파-나선(H), 베타 시트(beta sheet)(B), 및 코일(C)의 3-상태 이차 구조를 예측한다. 용매 접근성 망은, 각각의 아미노산 위치에서 매립된(B), 개재된(intermediate)(I) 및 노출된(E)인 3-상태 용매 접근성을 예측한다. 양측 망은, 플랭킹 아미노산 서열만을 해당 입력으로서 취할 수 있고, 단백질 데이터뱅크에서 알려진 비-중복 결정 구조로부터의 표지를 사용하여 트레닝될 수 있다. 미리 훈련된 3-상태 이차 구조 및 3-상태 용매 접근성 망들에 대한 입력을 위해, 길이가 51 및 깊이 20인 99마리의 모든 척추동물에 대해 다중 서열 정렬로부터 생성된 단일 길이 위치 빈도 행렬을 사용할 수 있다. 단백질 데이터뱅크로부터 알려진 결정 구조의 망을 미리 훈련한 후에, 이차 구조 및 용매 모델의 최종 2개 층을 제거할 수 있고, 망의 출력을 병원성 모델의 입력에 직접 연결할 수 있다. 3-상태 이차 구조 예측 모델에 대해 달성된 예시적인 테스트 정확도는 79.86%였다. 예측된 구조 표지만을 사용하는 것 대 결정 구조를 갖는 대략 약 4,000개의 인간 단백질에 대해 DSSP-주석 표시된 구조 표지를 사용할 때 신경망의 예측을 비교하는 경우 실질적인 차이가 없었다.The secondary structure deep learning network predicts the three-state secondary structure of the alpha-helix (H), beta sheet (B), and coil (C) at each amino acid position. The solvent access network predicts the three-state solvent accessibility, embedded (B), intermediate (I) and exposed (E) at each amino acid position. Both networks can take only flanking amino acid sequences as their input and can be trained using labels from non-overlapping crystal structures known in the protein databank. For input to pre-trained three-state secondary structures and three-state solvent access networks, a single length position frequency matrix generated from multiple sequence alignments for all 99 vertebrates of length 51 and depth 20 can be used. have. After pre-training a network of known crystal structures from a protein databank, the final two layers of the secondary structure and solvent model can be removed, and the output of the network directly connected to the inputs of the pathogenic model. The exemplary test accuracy achieved for the three-state quadratic structure prediction model was 79.86%. There was no substantial difference when comparing the prediction of neural networks when using DSSP-annotated structural markers for approximately 4,000 human proteins with crystal structures versus only using predicted structural markers.

병원성 예측을 위한 본 발명의 심층 학습망(PrimateAI) 및 이차 구조 및 용매 접근성을 예측하기 위한 심층 학습망 모두는 잔여 블록의 구조를 채택하였다. PrimateAI의 상세한 아키텍처는 도 3에 설명되어 있다. Both the deep learning network (PrimateAI) of the present invention for predicting pathogenicity and the deep learning network for predicting secondary structure and solvent accessibility adopted the structure of the residual block. The detailed architecture of PrimateAI is described in FIG. 3.

도 2는 본 명세서에서 "PrimateAI"로 지칭되는 병원성 예측을 위한 심층 잔여망의 예시적인 아키텍처(200)를 도시한다. 도 2에서, 1D는 1차원 컨볼루션 층을 지칭한다. 예측된 병원성은 0(양성) 내지 1(병원성)에 이른다. 망은 변이체를 중심으로 인간 아미노산(AA) 참조 서열 및 교대 서열(51 AA)을 입력으로 취하고, 99마리의 척추동물 종으로부터 계산된 위치 중량 행렬(PWM) 보존 프로파일, 및 이차 구조 및 용매 접근성 예측 심층 학습망의 출력을 취하며, 이는 3-상태 단백질 이차 구조(나선-H, 베타 시트-B 및 코일-C) 및 3-상태 용매 접근성(매립-B, 중간-I 및 노출-E)을 예측한다.2 shows an exemplary architecture 200 of a deep residual network for pathogenic prediction referred to herein as “PrimateAI”. In FIG. 2, 1D refers to the one-dimensional convolution layer. The predicted pathogenicity ranges from 0 (positive) to 1 (pathogenic). The network takes the human amino acid (AA) reference sequence and alternating sequence (51 AA) as inputs around the variant and predicts the positional weight matrix (PWM) conservation profile, calculated from 99 vertebrate species, and secondary structure and solvent accessibility. It takes the output of a deep learning network, which provides 3-state protein secondary structure (helix-H, beta sheet-B and coil-C) and 3-state solvent accessibility (embedded-B, medium-I and exposed-E). Predict.

도 3은 병원성 분류를 위한 심층 학습망 아키텍처인 PrimateAI의 개략적인 예(300)를 도시한다. 모델에의 입력은 참조 서열 및 변이체가 치환된 서열 모두에 대한 플랭킹 서열의 51개 아미노산(AA), 영장류, 포유류 및 척추동물 정렬로부터 3개의 51-AA-길이 위치-가중 행렬에 의해 표현된 보존, 및 사전 훈련된 이차 구조망 및 용매 접근성망(이 또한 51 AA 길이임)의 출력을 포함한다.3 shows a schematic example 300 of PrimateAI, a deep learning network architecture for pathogenic classification. Input to the model is represented by three 51-AA-length position-weighted matrices from 51 amino acid (AA), primate, mammalian and vertebrate alignments of the flanking sequence for both the reference and variant substituted sequences. Conservation, and output of pretrained secondary fenders and solvent access networks (which are also 51 AA long).

사전 훈련을 통한 개선Improvement through pre-training

본 발명은 과적합을 감소시키거나 상쇄시키고 훈련 결과를 개선하기 위해 병원성 예측 모델을 사전 훈련시키는 것을 도입한다. 이 시스템은 구현예에 따라 시스템의 아키텍처 레벨 개략도(100)를 도시하는 도 1을 참조하여 설명된다. 도 1은 아키텍처도이므로, 설명의 명확함을 개선하기 위해 특정 세부 사항은 의도적으로 생략되어 있다. 도 1의 논의는 다음과 같이 구성되어 있다. 먼저, 도 1의 요소를 설명하고 나서 상호 연결을 설명한다. 이후 시스템의 요소를 사용하는 것에 대해 보다 자세히 설명한다.The present invention introduces pre-training a pathogenic predictive model to reduce or offset overfitting and improve training results. This system is described with reference to FIG. 1, which shows an architecture level schematic 100 of the system according to an implementation. 1 is an architectural diagram, certain details are intentionally omitted to improve clarity of description. The discussion of Figure 1 is structured as follows. First, the elements of FIG. 1 will be described, and then the interconnection will be described. The use of the elements of the system will then be described in more detail.

이 단락은 도 1에 도시된 시스템의 표지된 부분을 명명한다. 이 시스템은 4개의 훈련 데이터세트, 즉 병원성 미스센스 훈련 예(121), 보충 양성 훈련 예(131), 양성 미스센스 훈련 예(161) 및 보충 양성 훈련 예(181)를 포함한다. 시스템은 훈련기(114), 테스터(116), 위치 빈도 행렬(PFM) 계산기(184), 입력 인코더(186), 변이체 병원성 예측 모델(157) 및 네트워크(들)(155)를 더 포함한다. 보충 양성 훈련 예(131)는 병원성 미스센스 훈련 예(121)에 대응하고 따라서 파선의 박스 내에 함께 배치된다. 유사하게, 보충 양성 훈련 예(181)는 양성 미스센스 훈련 예(161)에 대응하고, 따라서 두 데이터세트는 동일한 박스로 도시되어 있다.This paragraph names the labeled portion of the system shown in FIG. 1. The system includes four training datasets: pathogenic missense training example 121, supplemental positive training example 131, positive missense training example 161, and supplemental positive training example 181. The system further includes a trainer 114, a tester 116, a position frequency matrix (PFM) calculator 184, an input encoder 186, a variant pathogenicity prediction model 157 and network (s) 155. The supplemental positive training example 131 corresponds to the pathogenic missense training example 121 and is thus placed together in a box of dashed lines. Similarly, supplemental positive training example 181 corresponds to positive missense training example 161, so the two datasets are shown in the same box.

이 시스템은 관심 변이체 옆에 있는 입력 아미노산 서열 및 다른 종에서의 병렬 상동 서열 정렬을 입력으로 취하는 예시적인 변이체 병원성 예측 모델(157)로서 PrimateAI를 사용하여 설명된다. 병원성 예측을 위한 PrimateAI 모델의 상세한 구조는 도 3을 참조하여 위에 제시되어 있다. 아미노산 서열의 입력은 관심 변이체를 포함한다. "변이체"라는 용어는 아미노산 참조 서열과는 상이한 아미노산 서열을 지칭한다. 염색체의 단백질 코딩 영역의 특정 위치에 있는 트라이-뉴클레오타이드 염기 서열(코돈이라고도 함)은 아미노산을 발현한다. 61개의 트라이-뉴클레오타이드 서열 조합에 의해 형성될 수 있는 20가지 유형의 아미노산이 있다. 하나 초과의 코돈 또는 트라이-뉴클레오타이드 서열 조합은 동일한 아미노산을 초래할 수 있다. 예를 들어, 코돈 "AAA" 및 "AAG"는 라이신 아미노산(심볼 "K"로도 지칭됨)을 나타낸다.This system is described using PrimateAI as an exemplary variant pathogenicity prediction model 157 taking input amino acid sequence next to the variant of interest and parallel homologous sequence alignment in other species as input. The detailed structure of the PrimateAI model for predicting pathogenicity is presented above with reference to FIG. 3. Input of the amino acid sequence includes the variant of interest. The term "variant" refers to an amino acid sequence that is different from the amino acid reference sequence. The tri-nucleotide base sequence (also called a codon) at a specific position in the protein coding region of the chromosome expresses amino acids. There are 20 types of amino acids that can be formed by a combination of 61 tri-nucleotide sequences. Combinations of more than one codon or tri-nucleotide sequence can result in the same amino acid. For example, codons “AAA” and “AAG” refer to lysine amino acids (also referred to as symbol “K”).

아미노산 서열 변이체는 단일 뉴클레오타이드 다형성(SNP)에 의해 야기될 수 있다. SNP는 유전자의 특정 좌위에서 발생하는 단일 뉴클레오타이드의 변이체이며, 모집단 내에서 어느 정도 눈에 띄게 (예를 들어, > 1%) 관찰된다. 개시된 기술은 엑손이라고 불리는 유전자의 단백질-코딩 영역에서 발생하는 SNP에 초점을 둔다. SNP에는 두 가지 유형, 즉 동의 SNP와 미스센스 SNP가 있다. 동의 SNP는 아미노산에 대한 제1 코돈을 동일한 아미노산에 대한 제2 코돈으로 바꾸는 단백질 코딩 SNP의 한 유형이다. 한편, 미스센스 SNP는 제1 아미노산에 대한 제1 코돈을 제2 아미노산에 대한 제2 코돈으로 변화시키는 것을 포함한다.Amino acid sequence variants can be caused by a single nucleotide polymorphism (SNP). SNP is a variant of a single nucleotide that occurs at a particular locus of a gene and is observed to some extent (eg> 1%) within the population. The disclosed technique focuses on SNPs occurring in the protein-coding region of a gene called exon. There are two types of SNPs: synonymous SNP and missense SNP. A synonymous SNP is a type of protein-coding SNP that replaces the first codon for an amino acid with a second codon for the same amino acid. On the other hand, missense SNP includes changing the first codon for the first amino acid to the second codon for the second amino acid.

도 6은 미스센스 변이체 및 대응하여 구성된 동의 변이체에 대한 "단백질 서열 쌍"의 일례(600)를 나타낸다. "단백질 서열 쌍" 또는 간단히 "서열 쌍"이라는 어구는 참조 단백질 서열 및 대체 단백질 서열을 지칭한다. 참조 단백질 서열은 참조 코돈 또는 트라이-뉴클레오타이드 염기에 의해 발현된 참조 아미노산을 포함한다. 대체 단백질 서열은 대체 코돈 또는 트라이-뉴클레오타이드 염기에 의해 발현된 대체 아미노산을 포함하여, 대체 단백질 서열은 참조 단백질 서열의 참조 아미노산을 발현하는 참조 코돈에서 발생하는 변이체로 인해 발생한다.6 shows an example 600 of “protein sequence pairs” for missense variants and correspondingly constructed synonymous variants. The phrase “protein sequence pair” or simply “sequence pair” refers to a reference protein sequence and an alternative protein sequence. The reference protein sequence includes a reference amino acid expressed by a reference codon or tri-nucleotide base. Alternate protein sequences include alternative codons or alternative amino acids expressed by tri-nucleotide bases, such that the alternative protein sequence occurs due to variants occurring in the reference codon expressing the reference amino acid of the reference protein sequence.

도 6에서는 미스센스 변이체에 대응하는 보충 양성 동의 대응물 훈련 예(위에서 보충 양성 훈련 예로서 언급됨)의 구성을 제시한다. 미스센스 변이체는 병원성 미스센스 훈련 예 또는 양성 미스센스 훈련 예일 수 있다. 위치(5, 6 및 7)(즉, 5:7)에서 염색체(1)에 코돈 "TTT"를 갖는 참조 아미노산 서열을 갖는 미스센스 변이체에 대한 단백질 서열 쌍을 고려한다. 이제 위치(6)에서 동일한 염색체에 SNP가 발생하여 동일한 위치, 즉 5:7에 코돈 "TCT"를 갖는 대체 서열을 초래하는 것을 고려한다. 참조 서열에서 코돈 "TTT"는 페닐알라닌(F) 아미노산을 초래하는 반면, 대체 아미노산 서열에서 코돈 "TCT"는 세린(S) 아미노산을 초래한다. 설명을 단순화하기 위해, 도 6은 표적 위치에서 서열 쌍의 아미노산 및 대응하는 코돈만을 도시한다. 서열 쌍에서 플랭킹 아미노산 및 각각의 코돈은 도시하지 않았다. 훈련 데이터세트에서, 미스센스 변이체는 병원성으로 표지된다("1"로 표지된다). 훈련 동안 모델의 과적합을 감소시키기 위해, 개시된 기술은 대응하는 미스센스 변이체에 대한 대응물 보충 양성 훈련 예를 구성한다. 구성된 보충 양성 훈련 예에 대한 서열 쌍의 참조 서열은 도 6의 좌측 부분에 도시된 미스센스 변이체에서 참조 서열과 동일하다. 도 6의 우측 부분은 미스센스 변이체에 대한 참조 서열에서와 같이 위치(5:7)에서 염색체(1)에서 동일한 참조 서열 코돈 "TTT"를 갖는 동의 대응물인 보충 양성 훈련 예를 도시한다. 동의 대응물에 대해 구성된 대체 서열은 위치 번호(7)에 SNP를 가져서, 코돈 "TTC"를 초래한다. 이 코돈은 동일한 염색체에서 동일한 위치에 있는 참조 서열에서와 동일한 아미노산 페닐알라닌(F)을 대체 서열에서 초래한다. 동일한 위치에서 동일한 염색체에서 2개의 상이한 코돈이 동일한 아미노산을 발현하여, 동의 대응물은 양성으로 표지된다(또는 "0"으로 표지된다). 참조 서열 및 대체 서열에서 동일한 위치에 있는 2개의 상이한 코돈은 표적 위치에서 동일한 아미노산을 발현한다. 양성 대응물은 무작위로 구성되지 않고; 대신에, 서열화된 모집단에서 관찰되는 동의 변이체로부터 선택된다. 개시된 기술은 훈련 동안 변이체 병원성 예측 모델의 과적합을 감소시키기 위해 병원성 미스센스 훈련 예와 대조하기 위해 보충 양성 훈련 예를 구성한다.FIG. 6 shows the configuration of a supplementary positive consent counterpart training example (referred to as a supplementary positive training example above) corresponding to a missense variant. The missense variant can be a pathogenic missense training example or a positive missense training example. Consider a protein sequence pair for a missense variant with a reference amino acid sequence with the codon “TTT” at chromosome 1 at positions 5, 6 and 7 (ie 5: 7). It is now considered that SNPs occur at the same chromosome at position 6, resulting in a replacement sequence with the codon “TCT” at the same position, ie 5: 7. The codon “TTT” in the reference sequence results in the phenylalanine (F) amino acid, while the codon “TCT” in the replacement amino acid sequence results in the serine (S) amino acid. For simplicity, FIG. 6 shows only the amino acids and corresponding codons of the sequence pair at the target position. The flanking amino acids and each codon in the sequence pair are not shown. In the training dataset, missense variants are labeled pathogenic (labeled "1"). In order to reduce the overfitting of the model during training, the disclosed technique constitutes a counterpart complement positive training example for the corresponding missense variant. The reference sequence of the sequence pair for the constructed complementary positive training example is identical to the reference sequence in the missense variant shown in the left part of FIG. 6. The right part of Figure 6 shows a complementary positive training example that is a synonymous counterpart with the same reference sequence codon “TTT” at chromosome 1 at position (5: 7) as in the reference sequence for the missense variant. The replacement sequence constructed for the synonym counterpart has the SNP at position number (7), resulting in the codon “TTC”. This codon results in the same amino acid phenylalanine (F) in the replacement sequence as in the reference sequence at the same position on the same chromosome. Two different codons on the same chromosome at the same position express the same amino acid, so the synonym counterpart is labeled positive (or labeled as “0”). Two different codons at the same position in the reference and replacement sequences express the same amino acid at the target position. Positive counterparts are not randomly constructed; Instead, it is selected from synonymous variants observed in sequenced populations. The disclosed technique constitutes a supplemental positive training example to contrast with the pathogenic missense training example to reduce overfitting of the variant pathogenic predictive model during training.

보충 양성 훈련 예는 동의일 필요는 없다. 개시된 기술은 또한 참조 서열에서와 동일한 트라이-뉴클레오타이드 코돈에 의해 구성된 대체 서열에서 동일한 아미노산을 갖는 보충 양성 훈련 예를 구성할 수 있다. 연관된 위치 빈도 행렬(PFM)은 아미노산이 동의 또는 동일한 코돈에 의해 발현되는지에 관계없이 동일한 아미노산 서열에 대해 동일하다. 따라서, 이러한 보충 훈련 예는 도 6에 제시된 동의 대응물 훈련 예와 훈련 동안 동일한 효과, 즉 변이체 병원성 예측 모델의 과적합을 감소시키는 효과를 갖는다.The supplemental training example does not have to be consent. The disclosed technology can also constitute a complementary positive training example with the same amino acid in the replacement sequence constructed by the same tri-nucleotide codon as in the reference sequence. The associated positional frequency matrix (PFM) is identical for the same amino acid sequence regardless of whether the amino acids are expressed by the same or the same codon. Accordingly, this supplementary training example has the same effect during training as the consent counterpart training example shown in Fig. 6, that is, the effect of reducing the overfitting of the variant pathogenicity prediction model.

이제 도 1에 도시된 시스템의 다른 요소를 설명한다. 훈련기(114)는 도 1에 도시된 4개의 훈련 데이터세트를 사용하여 변이체 병원성 예측 모델을 훈련시킨다. 일 구현예에서, 변이체 병원성 예측 모델은 컨볼루션 신경망(convolutional neural network: CNN)으로서 구현된다. CNN의 훈련은 도 5를 참조하여 위에서 설명되었다. 입력 데이터가 특정 출력 추정값으로 이어지도록 훈련 동안 CNN이 조정 또는 훈련된다. 훈련은, 출력 추정치가 점진적으로 실측 자료와 일치하거나 접근할 때까지 출력 추정치와 실측 자료를 비교하는 것에 기초하여 역 전파를 사용하여 CNN을 조정하는 것을 포함한다. 훈련 후, 테스터(116)는 테스트 데이터세트를 사용하여 변이체 병원성 예측 모델을 벤치마킹한다. 입력 인코더(186)는 참조 및 대체 아미노산 서열과 같은 분류별 입력 데이터를, 변이체 병원성 예측 모델에 대한 입력으로서 제공될 수 있는 형태로 변환한다. 이것은 도 13에서 예시적인 참조 및 대체 서열을 사용하여 더 설명된다. Now another element of the system shown in FIG. 1 will be described. Trainer 114 trains a variant pathogenic predictive model using the four training datasets shown in FIG. 1. In one embodiment, the variant pathogenicity prediction model is implemented as a convolutional neural network (CNN). Training of CNN was described above with reference to FIG. 5. The CNN is adjusted or trained during training so that the input data leads to a specific output estimate. Training involves adjusting the CNN using back propagation based on comparing the output estimate with the measured data until the output estimate gradually matches or approaches the measured data. After training, tester 116 benchmarks the variant pathogenicity prediction model using a test dataset. The input encoder 186 transforms classification-specific input data, such as reference and replacement amino acid sequences, into a form that can be provided as input to a variant pathogenic predictive model. This is further illustrated using exemplary reference and replacement sequences in FIG. 13.

PFM 계산기(184)는 위치 특정 점수 매김 행렬(PSSM) 또는 위치 가중치 행렬(PWM)라고도 지칭되는 위치 빈도 행렬(PFM)을 계산한다. PFM은 도 10 및 도 11에 도시된 바와 같이 (수평축을 따른) 각각의 아미노산 위치에서 (수직축을 따른) 모든 아미노산의 빈도를 나타낸다. 개시된 기술은 영장류, 포유류 및 척추동물에 대해 각각 하나씩 3개의 PFM을 계산한다. 3개의 PFM 각각에 대한 아미노산 서열의 길이는 51일 수 있고, 여기서 표적 아미노산에는 상류측 및 하류측에 적어도 25개의 아미노산이 옆에 있을 수 있다. PFM은 아미노산 서열에서 아미노산에 대해 20개의 행 및 아미노산 위치에 대해 51개의 열을 갖는다. PFM 계산기는 11마리의 영장류에 대해 아미노산 서열을 갖는 제1 PFM, 48마리의 포유류에 대해 아미노산 서열을 갖는 제2 PFM, 및 40마리의 척추동물에 대해 아미노산 서열을 갖는 제3 PFM을 계산한다. PFM의 세포는 서열의 특정 위치에서 아미노산의 발생 횟수이다. 3개의 PFM에 대한 아미노산 서열이 정렬된다. 이는 참조 아미노산 서열 또는 대체 아미노산 서열에서 각 아미노산 위치에 대해 영장류, 포유류 및 척추동물 PFM에 대해 위치별로 계산한 결과가 참조 아미노산 서열 또는 대체 아미노산 서열에서 아미노산 위치가 발생하는 것과 동일한 순서로 위치별로 저장되거나 또는 서수 위치에 기초하여 저장되는 것을 의미한다.The PFM calculator 184 calculates a position frequency matrix (PFM), also referred to as a position specific scoring matrix (PSSM) or a position weighting matrix (PWM). PFM shows the frequency of all amino acids (along the vertical axis) at each amino acid position (along the horizontal axis) as shown in FIGS. 10 and 11. The disclosed technique calculates three PFMs, one each for primates, mammals and vertebrates. The length of the amino acid sequence for each of the three PFMs can be 51, where the target amino acid can be flanked by at least 25 amino acids upstream and downstream. PFM has 20 rows for amino acids in the amino acid sequence and 51 columns for amino acid positions. The PFM calculator calculates a first PFM with an amino acid sequence for 11 primates, a second PFM with an amino acid sequence for 48 mammals, and a third PFM with an amino acid sequence for 40 vertebrates. Cells in PFM are the number of occurrences of amino acids at specific positions in the sequence. The amino acid sequences for the three PFMs are aligned. This means that the results calculated by position for primate, mammalian and vertebrate PFM for each amino acid position in the reference amino acid sequence or the replacement amino acid sequence are stored by position in the same order that the amino acid position occurs in the reference amino acid sequence or the replacement amino acid sequence. Or, it is stored based on the ordinal position.

개시된 기술은 초기 훈련 에포크, 예를 들어 2 또는 3 또는 5 또는 8 또는 10 에포크 또는 3 내지 5, 3 내지 8 또는 2 내지 10 에포크 동안 보충 양성 훈련 예(131 및 181)를 사용한다. 도 7, 도 8 및 도 9는 사전 훈련 에포크, 훈련 에포크 및 추론 동안의 병원성 예측 모델을 도시한다. 도 7은 약 400,000개의 양성 보충 훈련 예(131)가 심층 학습 모델로부터 예측된 약 400,000개의 병원성 변이체(121)와 결합된 사전 훈련 에포크 1 내지 5의 예(700)를 나타낸다. 약 100,000, 200,000 또는 300,000과 같은 더 적은 양성 보충 훈련 예가 병원성 변이체와 결합될 수 있다. 일 구현예에서, 병원성 변이체 데이터 세트는 상기 기술된 바와 같이 약 6천 8백만 개의 합성 변이체로부터의 랜덤 샘플을 사용하여 20개의 사이클에서 생성된다. 다른 구현예에서, 병원성 변이체 데이터 세트는 대략 6천 8백만 개의 합성 변이체로부터 하나의 사이클에서 생성될 수 있다. 병원성 변이체(121) 및 보충 양성 훈련 예(131)는 처음 5개의 에포크에서 네트워크의 앙상블에 입력으로 제공된다. 유사하게, 대략 400,000개의 보충 양성 훈련 예(181)는 사전 훈련 에포크 동안 앙상블 훈련을 위해 대략 400,000개의 양성 변이체(161)와 결합된다. 약 100,000, 200,000 또는 300,000과 같은 더 적은 양성 훈련 예가 양성 변이체와 결합될 수 있다. The disclosed techniques use supplemental positive training examples 131 and 181 during initial training epochs, for example 2 or 3 or 5 or 8 or 10 epochs or 3 to 5, 3 to 8 or 2 to 10 epochs. 7, 8 and 9 show a pre-training epoch, a training epoch and a pathogenic prediction model during inference. FIG. 7 shows an example 700 of pre-trained epochs 1-5 with about 400,000 positive supplementary training examples 131 combined with about 400,000 pathogenic variants 121 predicted from a deep learning model. Less positive supplemental training examples, such as about 100,000, 200,000 or 300,000, can be combined with pathogenic variants. In one embodiment, a pathogenic variant data set is generated in 20 cycles using random samples from about 68 million synthetic variants as described above. In other embodiments, a pathogenic variant data set can be generated in one cycle from approximately 68 million synthetic variants. Pathogenic variants 121 and supplemental positive training examples 131 are provided as inputs to the ensemble of the network in the first five epochs. Similarly, approximately 400,000 supplemental positive training examples 181 are combined with approximately 400,000 positive variants 161 for ensemble training during pre-training epochs. Less positive training examples, such as about 100,000, 200,000 or 300,000, can be combined with positive variants.

보충 양성 데이터세트(131 및 181)는 도 8의 예(800)에 도시된 바와 같이 나머지 훈련 에포크(6 내지 n)에 입력으로 제공되지 않는다. 네트워크의 앙상블 훈련은 다수의 에포크에 걸쳐 병원성 변이체 데이터 세트와 양성 변이체 데이터세트로 계속된다. 훈련은 미리 정해진 수의 훈련 에포크 이후 종료되거나 또는 종료 조건에 도달하면 종료된다. 훈련된 네트워크는 추론 동안 사용되어 도 9의 예(900)에 도시된 바와 같이 합성 변이체(810)를 평가한다. 훈련된 네트워크는 변이체를 병원성 또는 양성으로 예측한다.Supplemental positive datasets 131 and 181 are not provided as inputs to the remaining training epochs 6 to n as shown in example 800 in FIG. 8. The network's ensemble training continues with pathogenic variant datasets and positive variant datasets across multiple epochs. Training ends after a predetermined number of training epochs or ends when an end condition is reached. A trained network is used during inference to evaluate synthetic variants 810 as shown in example 900 in FIG. 9. The trained network predicts the variant as pathogenic or positive.

이제 도 10에 도시된 병원성 미스센스 변이체 훈련 예(1002)의 대응물로서 구성된 예시적인 보충 양성 훈련 예(1012)에 대한 PFM(숫자 1000으로 표시)을 설명한다. 훈련 예를 위해 PFM이 생성되거나 참조된다. 훈련 예에 대한 PFM은 참조 서열의 위치에만 의존하여서, 모두 훈련 예(1002 및 1012)는 동일한 PFM을 갖는다. 예를 들어 도 10에는 2개의 훈련 예가 도시되어 있다. 제1 훈련 예(1002)는 병원성/표지되지 않은 변이체이다. 제2 훈련 예(1012)는 훈련 예(1002)에 대응하는 대응물 보충 양성 훈련 예이다. 훈련 예(1002)는 참조 서열(1002R) 및 대체 서열(1002A)을 갖는다. 제1 PFM은 참조 서열(1002R)의 위치에만 기초하여 훈련 예(1002)에 대해 액세스되거나 생성된다. 훈련 예(1012)는 참조 서열(1012R) 및 대체 서열(1012A)을 갖는다. 제1 PFM, 예를 들어 1002는 예를 들어 1012로 재사용될 수 있다. PFM은 종에 걸친 서열의 보존의 표시로서 99종의 영장류, 포유류 및 척추동물과 같은 다수의 종으로부터의 아미노산 서열을 사용하여 계산된다. 인간은 PFM을 계산할 때 제시된 종 중에 있을 수도 있고 있지 않을 수도 있다. 이 PFM의 세포는 서열에서 종에 걸쳐 아미노산의 발생 횟수를 포함한다. PFM(1022)은 훈련 예에서 단일 서열의 하나의 핫 인코딩을 도시하는 PFM의 출발점이다. 99종의 예에 대해 PFM이 완료되면 종에 걸쳐 완전히 보존된 위치는 "1" 대신에 "99"의 값을 갖는다. 이 예에서 부분 보존은 99로 합해지는 값을 갖는 2개 이상의 행을 열에 생성한다. PFM은 서열의 중심 위치에서 아미노산이 아니라 전체 서열 위치에 의존하기 때문에, 참조 및 대체 서열은 모두 동일한 PFM을 갖는다.The PFM (denoted by the number 1000) for an exemplary supplemental positive training example 1012 constructed as a counterpart of the pathogenic missense variant training example 1002 shown in FIG. 10 is now described. PFM is generated or referenced for training examples. The PFM for the training example depends only on the position of the reference sequence, so both training examples 1002 and 1012 have the same PFM. For example, two training examples are shown in FIG. 10. The first training example 1002 is a pathogenic / unlabeled variant. The second training example 1012 is an example of training to supplement the counterpart corresponding to the training example 1002. Training example 1002 has a reference sequence 1002R and a replacement sequence 1002A. The first PFM is accessed or generated for training example 1002 based only on the position of the reference sequence 1002R. The training example 1012 has a reference sequence 1012R and a replacement sequence 1012A. The first PFM, eg 1002, can be reused, for example 1012. PFM is calculated using amino acid sequences from multiple species, such as 99 primates, mammals, and vertebrates, as an indication of conservation of sequence across species. Humans may or may not be among the species presented when calculating PFM. The cells of this PFM contain the number of occurrences of amino acids across species in sequence. PFM 1022 is the starting point of PFM showing one hot encoding of a single sequence in the training example. When PFM is complete for 99 examples, the positions that are fully conserved across the species have a value of "99" instead of "1". In this example, partial preservation creates two or more rows in the column with values summing to 99. Because PFM relies on the entire sequence position, not the amino acid at the center position of the sequence, both the reference and replacement sequences have the same PFM.

이제 도 10의 예시적인 참조 서열에서의 위치를 사용하여 PFM(1012)을 결정하는 것을 설명한다. 도 10에 도시된 바와 같이 병원성/비-표지된 훈련 예(1002) 및 보충 양성 훈련 예(1012)에 대한 예시적인 참조 및 대체 아미노산 서열은 51개의 아미노산을 가지고 있다. 참조 아미노산 서열(1002R)은 서열에서 위치(26)(또한 표적 위치로 지칭됨)에 "R"로 표시된 아르기닌 아미노산을 갖는다. 뉴클레오타이드 수준에서, 6개의 트라이-뉴클레오타이드 염기 또는 코돈(CGT, CGC, CGA, CGG, AGA 및 AAG) 중 하나는 아미노산 "R"을 발현한다는 것이 주목된다. 이 예에서는 이러한 코돈을 도시하지 않고 설명을 단순화하고 PFM을 계산하는 것에 중점을 둔다. 참조 서열과 정렬되고 위치(26)에 아미노산 "R"을 갖는 99종 중 하나로부터의 아미노산 서열(도시되지 않음)을 고려한다. 이것은 행 "R"과 열 "26"의 교차점에 있는 셀의 PFM(1022)에서 "1"의 값을 초래할 것이다. PFM의 모든 열에 대해 유사한 값이 결정된다. 2개의 PFM(즉, 병원성 미스센스 변이체(1002)에 대한 참조 서열(1002R)에 대한 PFM 및 보충 양성 훈련 예(1012)에 대한 참조 서열(1012R)에 대한 PFM)은 동일하지만 예시적인 목적을 위해 단 하나의 PFM(1022)만이 도시되어 있다. 이 두 개의 PFM은 관련 아미노산에 대한 병원성의 반대 예를 나타낸다. 하나는 병원성 또는 "1"로 표지되고, 다른 하나는 양성으로 "0"으로 표지된다. 따라서 개시된 기술은 훈련 동안 이러한 예를 모델에 제공하여 과적합을 줄인다.Described now is the determination of PFM 1012 using positions in the exemplary reference sequence of FIG. 10. Exemplary reference and replacement amino acid sequences for the pathogenic / non-labeled training example 1002 and supplemental positive training example 1012 as shown in FIG. 10 have 51 amino acids. The reference amino acid sequence 1002R has an arginine amino acid denoted "R" at position 26 (also referred to as the target position) in the sequence. It is noted that at the nucleotide level, one of the six tri-nucleotide bases or codons (CGT, CGC, CGA, CGG, AGA and AAG) express the amino acid "R". This example does not show these codons, but focuses on simplifying the explanation and calculating the PFM. Consider an amino acid sequence (not shown) from one of 99 species that is aligned with the reference sequence and has the amino acid "R" at position 26. This will result in a value of "1" in the PFM 1022 of the cell at the intersection of row "R" and column "26". Similar values are determined for all columns of the PFM. The two PFMs (i.e., the PFM for the reference sequence (1002R) for the pathogenic missense variant 1002 and the PFM for the reference sequence (1012R) for the supplemental positive training example (1012)) are the same, but for illustrative purposes Only one PFM 1022 is shown. These two PFMs represent opposite examples of pathogenicity to related amino acids. One is labeled pathogenic or "1" and the other is labeled "0" as positive. Thus, the disclosed technique reduces this by providing this example to the model during training.

훈련 데이터 세트에서 양성 미스센스 변이체(161)에 대응하는 보충 양성 훈련 예(181)의 제2 세트를 구성한다. 도 11은 예시적인 양성 미스센스 변이체(1102) 및 대응하는 보충 양성 훈련 예(1122)에 대해 2개의 PFM을 계산하는 예(1100)를 나타낸다. 이 예에서 볼 수 있는 바와 같이, 참조 서열(1102R 및 1112R)은 양성 미스센스 변이체(1102) 및 보충 양성 훈련 예(1112) 모두에 대해 동일하다. 이들 각각의 대체 서열(1102A 및 1112A)이 또한 도 11에 도시되어 있다. 도 10에 도시된 예에 대해 전술한 바와 같은 2개의 참조 서열에 대해 2개의 PFM이 생성되거나 참조된다. 두 PFM은 동일하며, 단 하나의 PFM(1122)만이 예시를 위해 도 11에 도시되어 있다. 이들 PFM은 모두 양성("0")으로 표지된 아미노산 서열을 나타낸다.Construct a second set of supplemental positive training examples 181 corresponding to positive missense variants 161 in the training data set. FIG. 11 shows an example 1100 of calculating two PFMs for an exemplary positive missense variant 1102 and a corresponding supplemental positive training example 1122. As can be seen in this example, the reference sequences 1102R and 1112R are identical for both the positive missense variant 1102 and the supplemental positive training example 1112. The respective replacement sequences 1102A and 1112A are also shown in FIG. 11. Two PFMs are generated or referenced for two reference sequences as described above for the example shown in FIG. 10. Both PFMs are identical, and only one PFM 1122 is shown in FIG. 11 for illustration. All of these PFMs show amino acid sequences labeled positive ("0").

개시된 기술은 11개의 영장류 서열, 48개의 포유류 서열 및 40개의 척추동물 서열에 대해 각각 하나씩 3개의 PFM을 계산한다. 도 12는 20개의 행 및 51개의 열을 각각 갖는 3개의 PFM(1218, 1228 및 1238)을 도시한다. 일 구현예에서, 영장류 서열은 인간 참조 서열을 포함하지 않는다. 다른 구현예에서, 영장류 서열은 인간 참조 서열을 포함한다. 3개의 PFM에서 세포의 값은 주어진 위치(열 표지)에서 PFM에 대한 모든 서열에 존재하는 아미노산(행 표지)의 발생을 카운트함으로써 계산된다. 예를 들어, 3개의 영장류 서열이 위치(26)에 아미노산 "K"를 갖는 경우, 행 표지 "K" 및 열 표지 "26"을 갖는 세포의 값은 "3"의 값을 갖는다.The disclosed technique calculates three PFMs, one for each of the 11 primate sequences, 48 mammalian sequences, and 40 vertebrate sequences. 12 shows three PFMs 1218, 1228 and 1238 with 20 rows and 51 columns, respectively. In one embodiment, the primate sequence does not include a human reference sequence. In other embodiments, the primate sequence comprises a human reference sequence. The values of the cells in the three PFMs are calculated by counting the occurrence of amino acids (row markers) present in all sequences for the PFM at a given position (column markers). For example, if the three primate sequences have the amino acid "K" at position 26, then the value of the cell with row marker "K" and column marker "26" has a value of "3".

하나의 핫 인코딩은 분류별 변수를, 심층 학습 모델에 입력을 제공할 수 있는 형태로 변환하는 프로세스이다. 분류별 값은 데이터 세트의 엔트리에 대해 영숫자 값을 나타낸다. 예를 들어, 참조 및 대체 아미노산 서열은 서열에 배열된 51개의 아미노산 문자를 각각 갖는다. 서열에서 위치 "1"의 아미노산 문자 "T"는 서열의 제1 위치에 아미노산 트레오닌을 나타낸다. 아미노산 서열은 하나의 핫 인코딩된 표현에서 행 표지 "T" 및 열 표지 "1"을 갖는 세포에서 "1"의 값을 할당함으로써 인코딩된다. 아미노산 서열에 대한 하나의 핫 인코딩된 표현은 특정 위치(열 표지)에서 발생하는 아미노산(행 표지)을 나타내는 세포를 제외한 세포에서 0을 갖는다. 도 13은 보충 양성 훈련 예에 대한 참조 및 대체 서열이 하나의 핫 인코딩으로 표시되는 예(1300)를 도시한다. 참조 및 대체 아미노산 서열은 변이체 병원성 예측 모델에 입력으로 하나의 핫 인코딩된 형태로 입력된다. 도 14는 변이체 병원성 예측 모델에 대한 입력을 도시하는 예(1400)를 포함한다. 입력은 하나의 핫 인코딩된 형태의 인간 참조 및 대체 아미노산 서열, 및 영장류의 경우 PFM(1218), 포유류의 경우 PFM(1228) 및 척추동물의 경우 PFM(1238)을 포함한다. 전술한 바와 같이, 영장류에 대한 PFM은 비인간 영장류만을 포함하거나 또는 인간 및 비인간 영장류을 포함할 수 있다.One hot encoding is the process of converting classification-specific variables into a form that can provide input to a deep learning model. The classification-specific values represent alphanumeric values for entries in the data set. For example, the reference and replacement amino acid sequences each have 51 amino acid characters arranged in the sequence. The amino acid letter "T" at position "1" in the sequence represents the amino acid threonine at the first position in the sequence. The amino acid sequence is encoded by assigning a value of "1" in cells with row marker "T" and column marker "1" in one hot encoded expression. One hot-encoded expression for the amino acid sequence has zero in cells except for cells that represent amino acids (row markers) that occur at specific positions (column markers). FIG. 13 shows an example 1300 in which reference and replacement sequences for the supplemental positive training example are represented by one hot encoding. The reference and replacement amino acid sequences are entered in one hot encoded form as input to the variant pathogenicity prediction model. 14 includes an example 1400 showing input to a variant pathogenicity prediction model. Inputs include human reference and replacement amino acid sequences in one hot encoded form, and PFM 1218 for primates, PFM 1228 for mammals and PFM 1238 for vertebrates. As described above, PFM for primates may include only non-human primates or human and non-human primates.

훈련 세트를 보충하기 위해 이러한 접근법에서 변이체는 전체 내용이 본 명세서에 병합된 출원 문헌에 기재된 아키텍처에도 적용되고, 다른 데이터 유형과 조합하여, 특히 아미노산 또는 뉴클레오타이드의 서열과 함께 PFM을 사용하는 임의의 다른 아키텍처에도 적용된다.Variants in this approach to supplement the training set also apply to the architecture described in the application literature, the entire contents of which are incorporated herein, and in combination with other data types, in particular any other using PFM with sequences of amino acids or nucleotides. It also applies to architecture.

결과result

신경망 기반 모델(예를 들어, 위에서 제시된 PrimateAI 모델)의 성능은 상기 제시된 사전 훈련 에포크를 사용함으로써 개선된다. 다음 표는 테스트 결과의 예이다. 표의 결과는 6개의 표제로 구성된다. 결과를 제시하기 전에 표제를 간략하게 설명한다. "복제" 열은 20회의 복제 실행 결과를 나타낸다. 각각의 실행은 다른 랜덤 시드를 가진 8가지 모델의 앙상블일 수 있다. "정확도"는 양성으로 분류된 보류된 10,000개의 영장류 양성 변이체의 비율이다. "P값_DDD"는 영향을 받지 않은 형제 자매로부터 발달 장애가 있는 영향 받은 어린이에서 드 노보 돌연변이가 분리되는 정도를 평가하기 위해 윌콕슨 순위 테스트의 결과를 제시한다. "p값_605유전자"는 이 경우 605 질환 관련 유전자 내에서 드 노보 돌연변이를 사용했다는 것을 제외하고는 p값_DDD와 유사한 테스트 결과를 나타낸다. "Corr_RK_RW"는 R로부터 K로 및 R로부터 W로 아미노산이 변하는 사이에 PrimateAI 점수의 상관관계를 나타낸다. Corr_RK_RW 값이 작을수록 더 나은 성능을 나타낸다. "P값_Corr"는 이전 열에서 상관 관계의 p값, 즉 Corr_RK_RW를 나타낸다.The performance of the neural network based model (eg PrimateAI model presented above) is improved by using the pre-trained epoch presented above. The following table is an example of the test results. The results of the table consist of six headings. Briefly describe the title before presenting the results. The "Replication" column shows the results of 20 replicate runs. Each run can be an ensemble of 8 models with different random seeds. “Accuracy” is the proportion of 10,000 reserved primate positive variants classified as positive. "P value_DDD" presents the results of the Wilcoxon ranking test to assess the degree of de novo mutation separation in affected children with developmental disabilities from unaffected siblings. The "p-value_605 gene" indicates a test result similar to the p-value_DDD except that in this case, a de novo mutation was used in the 605 disease-related gene. "Corr_RK_RW" represents the correlation of PrimateAI scores between amino acid changes from R to K and R to W. The smaller the Corr_RK_RW value, the better the performance. "P value_Corr" represents the p value of the correlation in the previous column, that is, Corr_RK_RW.

결과는 컷오프로서 미지의 변이체의 중앙값 점수를 사용하여 양성 변이체의 예측의 중앙값 정확도가 20회 반복 실행에 걸쳐 91.44%임을 나타낸다. 윌콕슨 순위 합 테스트의 로그 p-값은 대조군의 드 노보 미스센스 변이체로부터 DDD 환자의 드 노보 미스센스 변이체를 분리하기 위해 29.39이다. 유사하게, 순위 합 테스트의 로그 p-값은 단지 605 질환 유전자 내의 드 노보 미스센스 변이체에 비해 16.18이다. 이전에 보고된 결과보다 메트릭이 개선되었다. R->K와 R->W 사이의 상관관계는 윌콕슨 순위 합 테스트로 측정했을 때 p-값 = 3.11e-70만큼 상당히 감소된다.The results show that the median accuracy of prediction of positive variants using a median score of unknown variants as a cutoff is 91.44% over 20 replicate runs. The log p-value of the Wilcoxon rank sum test is 29.39 to separate the de novo miss sense variant of DDD patients from the de novo miss sense variant of the control. Similarly, the logarithmic p-value of the rank sum test is only 16.18 compared to the de novo missense variant in the 605 disease gene. The metric was improved over the previously reported results. The correlation between R-> K and R-> W is significantly reduced by p-value = 3.11e-70 as measured by the Wilcoxon rank sum test.

특정 certain 구현예Implementation

본 발명자들은 아미노산 서열 및 수반되는 위치 빈도 행렬(PFM)을 처리하는 신경망 구현 모델을 사전 훈련시키기 위한 시스템, 방법 및 제조 물품을 설명한다. 일 구현예의 하나 이상의 기능은 기본 구현과 결합될 수 있다. 상호 배타적이지 않은 구현예들은 결합 가능한 것으로 이해된다. 일 구현예의 하나 이상의 기능은 다른 구현과 결합될 수 있다. 본 개시 내용은 사용자에게 이러한 옵션을 주기적으로 상기시킨다. 이러한 옵션을 반복하는 설명의 일부 구현에서 생략된 것은 이전 부문에서 설명한 조합들을 제한하는 것으로 간주되어서는 안 되며, 이러한 설명은 이하의 각 구현예에 병합된다.We describe systems, methods, and articles of manufacture for pretraining neural network implementation models that process amino acid sequences and concomitant position frequency matrices (PFMs). One or more functions of one implementation can be combined with a basic implementation. It is understood that embodiments that are not mutually exclusive are combinable. One or more functions of one implementation can be combined with other implementations. This disclosure periodically reminds the user of these options. Omissions from some implementations of the repetition of these options should not be considered as limiting the combinations described in the previous section, which are incorporated in each implementation below.

개시된 기술의 시스템 구현예는 메모리에 연결된 하나 이상의 프로세서를 포함한다. 메모리에는 아미노산 서열 및 수반되는 위치 빈도 행렬(PFM)을 처리하는 신경망 구현 모델의 과적합을 줄이기 위해 컴퓨터 명령이 로딩된다. 이 시스템은 시작 위치로부터 표적 아미노산 위치를 거쳐 최종 위치까지를 포함하는 양성 표지된 보충 훈련 예 서열 쌍을 생성하기 위한 논리 회로를 포함한다. 보충 서열 쌍은 미스센스 훈련 예 서열 쌍의 시작 위치 및 종료 위치와 일치한다. 이것은 참조 및 대체 아미노산 서열에서 동일한 아미노산을 갖는다. 시스템은 일치하는 시작 및 종료 위치에서 미스센스 훈련 예 서열 쌍의 PFM과 동일한 보충 훈련 PFM을 각각의 보충 서열 쌍과 함께 입력하기 위한 논리 회로를 포함한다. 시스템은 일치하는 시작 및 종료 위치에서 양성 훈련 예 서열 쌍 및 보충 훈련 예 PFM, 및 미스센스 훈련 예 서열 쌍 및 미스센스 훈련 예 서열 쌍의 PFM을 사용하여 신경망 구현 모델을 훈련하기 위한 논리 회로를 포함한다. 훈련 PFM의 훈련 영향은 훈련 동안 감쇠된다.System implementations of the disclosed technology include one or more processors coupled to memory. The memory is loaded with computer instructions to reduce overfitting of the neural network implementation model that processes the amino acid sequence and the accompanying position frequency matrix (PFM). The system includes a logic circuit to generate a positively labeled supplemental training example sequence pair that includes from the starting position to the target amino acid position to the final position. The complementary sequence pair matches the start and end positions of the missense training example sequence pair. It has the same amino acids in the reference and replacement amino acid sequences. The system includes logic circuitry to input the complement training PFM with each complement sequence pair identical to the PFM of the missense training example sequence pair at the matching start and end positions. The system includes logic circuits to train the neural network implementation model using PFMs of positive training examples sequence pairs and complementary training examples PFM, and missense training examples sequence pairs and missense training examples sequence pairs at matching start and end positions. do. Training The impact of PFM training is attenuated during training.

개시된 본 시스템 구현예 및 다른 시스템은 다음 특징들 중 하나 이상을 선택적으로 포함한다. 시스템은 또한 개시된 방법과 관련하여 설명된 특징을 포함할 수 있다. 간결성을 위해 시스템 기능의 대체 조합은 개별적으로 열거되지 않는다. 시스템, 방법 및 제조 물품에 적용되는 특징은 기본 특징의 각 법정 범주에 대해 반복되지 않는다. 독자라면 이 부문에서 식별된 기능을 다른 법정 범주의 기본 기능과 쉽게 결합시킬 수 있는 방법을 이해할 수 있을 것이다.The disclosed system implementations and other systems optionally include one or more of the following features. The system can also include features described in connection with the disclosed method. For brevity, alternative combinations of system functions are not listed individually. Features applied to systems, methods and articles of manufacture are not repeated for each statutory category of basic features. Readers will understand how the functions identified in this sector can be easily combined with the basic functions of other statutory categories.

시스템은 각각의 보충 서열 쌍이 양성 미스센스 훈련 예 서열 쌍의 시작 위치 및 종료 위치와 일치하도록 보충 서열 쌍을 구성하기 위한 논리 회로를 포함할 수 있다.The system can include logic circuitry to construct the complementary sequence pair so that each complementary sequence pair matches the start and end positions of the positive missense training example sequence pair.

시스템은 각각의 보충 서열 쌍이 병원성 미스센스 훈련 예 서열 쌍의 시작 위치 및 종료 위치와 일치하도록 보충 서열 쌍을 구성하기 위한 논리 회로를 포함할 수 있다.The system can include logic circuitry to construct the complementary sequence pair so that each complementary sequence pair matches the pathogenic missense training example start and end positions of the sequence pair.

시스템은 미리 결정된 수의 훈련 에포크 후 보충 훈련 예 서열 쌍 및 보충 훈련 PFM을 사용하여 중단하도록 신경망 구현 모델의 훈련을 수정하기 위한 논리 회로를 포함한다.The system includes logic circuitry to modify the training of the neural network implementation model to stop using a pre-determined number of training epochs and then supplemental training examples sequence pairs and supplemental training PFM.

시스템은 3개의 훈련 에포크 후 보충 훈련 예 서열 쌍 및 보충 훈련 PFM을 사용하여 중단하도록 신경망 구현 모델의 훈련을 수정하기 위한 논리 회로를 포함한다.The system includes logic circuitry to modify the training of the neural network implementation model to stop using three training epochs and supplemental training examples sequence pairs and supplemental training PFM.

시스템은 5개의 훈련 에포크 후 보충 훈련 예 서열 쌍 및 보충 훈련 PFM을 사용하여 중단하도록 신경망 구현 모델의 훈련을 수정하기 위한 논리 회로를 포함한다.The system includes logic circuitry to modify the training of the neural network implementation model to stop using supplemental training example sequence pairs and supplemental training PFM after 5 training epochs.

보충 훈련 예 서열 쌍 대 병원성 훈련 예 서열 쌍의 비는 1:1 내지 1:8일 수 있다. 시스템은 예를 들어, 1:1 내지 1:12, 1:1 내지 1:16, 및 1:1 내지 1:24의 범위에서 서로 다른 값을 사용할 수 있다. The ratio of complementary training example sequence pair to pathogenic training example sequence pair may be 1: 1 to 1: 8. The system can use different values, for example, in the range of 1: 1 to 1:12, 1: 1 to 1:16, and 1: 1 to 1:24.

보충 훈련 예 서열 쌍 대 양성 훈련 예 서열 쌍의 비는 1:2 내지 1:8일 수 있다. 시스템은 예를 들어, 1:1 내지 1:12, 1:1 내지 1:16 및 1:1 내지 1:24의 범위에서 서로 다른 값을 사용할 수 있다.The ratio of complementary training example sequence pairs to positive training example sequence pairs may be 1: 2 to 1: 8. The system can use different values, for example, in the range of 1: 1 to 1:12, 1: 1 to 1:16 and 1: 1 to 1:24.

시스템은 보충 PFM을 생성할 때 비-인간 영장류 및 비-영장류 포유류의 데이터로부터 아미노산의 위치를 사용하기 위한 논리 회로를 포함한다. The system includes logic circuitry to use the position of amino acids from data from non-human primate and non-primate mammals when generating supplemental PFM.

다른 구현예는 전술한 시스템의 기능을 수행하기 위해 프로세서에 의해 실행 가능한 명령을 저장하는 비-일시적인 컴퓨터 판독 가능한 저장 매체를 포함할 수 있다. 또 다른 구현예는 전술한 시스템의 기능을 수행하는 방법을 포함할 수 있다.Other implementations may include a non-transitory computer-readable storage medium that stores instructions executable by the processor to perform the functions of the system described above. Another implementation may include a method of performing the functions of the above-described system.

개시된 기술의 방법 구현예는 시작 위치로부터 표적 아미노산 위치를 거쳐 최종 위치까지를 포함하는 양성 표지된 보충 훈련 예 서열 쌍을 생성하는 단계를 포함한다. 각 보충 서열 쌍은 미스센스 훈련 예 서열 쌍의 시작 위치 및 종료 위치와 일치한다. 이것은 참조 및 대체 아미노산 서열에 동일한 아미노산을 갖는다. 방법은 일치하는 시작 및 종료 위치에서 미스센스 훈련 예 서열 쌍의 PFM과 동일한 각각의 보충 서열 쌍과 함께 보충 훈련 PFM을 입력하는 단계를 포함한다. 방법은 일치하는 시작 및 종료 위치에서 양성 훈련 예 서열 쌍 및 보충 훈련 예 PFM, 및 미스센스 훈련 예 서열 쌍 및 미스센스의 PFM을 사용하여 신경망 구현 모델을 훈련하는 단계를 포함한다. 훈련 PFM의 훈련 영향은 훈련 동안 감쇠된다.Method embodiments of the disclosed technology include generating a positively labeled supplemental training example sequence pair comprising from the starting position to the target amino acid position to the final position. Each complementary sequence pair matches the start and end positions of the missense training example sequence pair. It has the same amino acids in the reference and replacement amino acid sequences. The method includes entering a complement training PFM with each complement sequence pair identical to the PFM of the missense training example sequence pair at the matching start and end positions. The method includes training a neural network implementation model using a positive training example sequence pair and complementary training example PFM, and missense training example sequence pair and missense PFM at matching start and end positions. Training The impact of PFM training is attenuated during training.

개시된 방법 구현예 및 다른 방법은 다음 특징 중 하나 이상을 선택적으로 포함한다. 방법은 또한 개시된 시스템과 관련하여 설명된 특징을 포함할 수 있다. 독자라면 이 부문에서 식별된 기능을 다른 법정 범주의 기본 기능과 쉽게 결합시킬 수 있는 방법을 이해할 수 있을 것이다.The disclosed method embodiments and other methods optionally include one or more of the following features. The method may also include features described in connection with the disclosed system. Readers will understand how the functions identified in this sector can be easily combined with the basic functions of other statutory categories.

다른 구현예는 아미노산 서열 및 수반되는 위치 빈도 행렬(PFM)을 처리하는 신경망 구현 모델의 과적합을 감소시키기 위해 하나 이상의 프로세서에 의해 실행 가능한 컴퓨터 프로그램 명령을 집합적으로 저장하는 하나 이상의 비-일시적인 컴퓨터 판독 가능한 저장 매체 세트를 포함할 수 있다. 컴퓨터 프로그램 명령은, 하나 이상의 프로세서 상에서 실행될 때, 시작 위치로부터 표적 아미노산 위치를 거쳐 종료 위치까지를 포함하는 양성 표지된 보충 훈련 예 서열 쌍을 생성하는 단계를 포함하는 방법을 구현한다. 각 보충 서열 쌍은 미스센스 훈련 예 서열 쌍의 시작 위치 및 종료 위치와 일치한다. 이것은 참조 및 대체 아미노산 서열에서 동일한 아미노산을 갖는다. 방법은 일치하는 시작 및 종료 위치에서 미스센스 훈련 예 서열 쌍의 PFM과 동일한 각각의 보충 서열 쌍과 함께 보충 훈련 PFM을 입력하는 단계를 포함한다. 방법은 일치하는 시작 및 종료 위치에서 양성 훈련 예 서열 쌍 및 보충 훈련 예 PFM, 및 미스센스 훈련 예 서열 쌍 및 미스센스 훈련의 PFM을 사용하여 신경망 구현 모델을 훈련하는 단계를 포함한다. 훈련 PFM의 훈련 영향은 훈련 동안 감쇠된다.Other embodiments are one or more non-transitory computers that collectively store computer program instructions executable by one or more processors to reduce overfitting of the neural network implementation model to process amino acid sequences and concomitant position frequency matrices (PFM). It may include a set of readable storage media. Computer program instructions, when executed on one or more processors, implement a method comprising generating a positively labeled supplemental training example sequence pair that includes from a starting position to a target amino acid position to an ending position. Each complementary sequence pair matches the start and end positions of the missense training example sequence pair. It has the same amino acids in the reference and replacement amino acid sequences. The method includes entering a complement training PFM with each complement sequence pair identical to the PFM of the missense training example sequence pair at the matching start and end positions. The method includes training a neural network implementation model using a positive training example sequence pair and complementary training example PFM, and a missense training example sequence pair and missense training PFM at matching start and end positions. Training The impact of PFM training is attenuated during training.

개시된 기술의 컴퓨터 판독 가능한 매체(CRM) 구현예는, 하나 이상의 프로세서 상에서 실행될 때, 전술한 방법을 구현하는 컴퓨터 프로그램 명령이 저장된 하나 이상의 비-일시적인 컴퓨터 판독 가능한 저장 매체를 포함한다. 이 CRM 구현예는 다음 기능 중 하나 이상을 포함한다. CRM 구현예는 또한 전술한 시스템 및 방법과 관련하여 설명된 특징을 포함할 수 있다.Computer-readable media (CRM) implementations of the disclosed technology include one or more non-transitory computer-readable storage media storing computer program instructions that, when executed on one or more processors, implement the methods described above. This CRM implementation includes one or more of the following features. CRM implementations may also include features described in connection with the systems and methods described above.

전술한 설명은 개시된 기술의 제조 및 이용을 가능하게 하기 위해 제공된다. 개시된 구현예에 대한 다양한 수정이 명백할 것이고, 본 명세서에 정의된 일반적인 원리들은 개시된 기술의 사상 및 범위를 벗어나지 않고 다른 구현예 및 응용 분야에 적용될 수 있다. 따라서, 개시된 기술은, 도시된 구현으로 제한하려고 의도된 것이 아니라, 본 명세서에 개시된 원리 및 특징과 일치하는 가장 넓은 범위에 따라야 한다. 개시된 기술의 범위는 첨부된 청구범위에 의해 정의된다.The foregoing description is provided to enable the manufacture and use of the disclosed technology. Various modifications to the disclosed embodiments will be apparent, and the general principles defined herein can be applied to other embodiments and applications without departing from the spirit and scope of the disclosed technology. Accordingly, the disclosed technology is not intended to be limited to the illustrated implementations, but should be in accordance with the broadest scope consistent with the principles and features disclosed herein. The scope of the disclosed technology is defined by the appended claims.

컴퓨터 시스템Computer system

도 15는 개시된 기술을 구현하는 데 사용될 수 있는 컴퓨터 시스템의 단순화된 블록도(1500)이다. 컴퓨터 시스템은 통상적으로 버스 서브시스템을 통해 다수의 주변 장치와 통신하는 적어도 하나의 프로세서를 포함한다. 이러한 주변 장치는, 예를 들어, 메모리 장치와 파일 저장 서브시스템을 포함하는 저장 서브시스템, 사용자 인터페이스 입력 장치, 사용자 인터페이스 출력 장치, 및 네트워크 인터페이스 서브시스템을 포함할 수 있다. 입력 및 출력 장치는 컴퓨터 시스템과의 사용자 상호 작용을 허용한다. 네트워크 인터페이스 서브시스템은, 다른 컴퓨터 시스템의 해당 인터페이스 장치에 대한 인터페이스를 포함하여 외부 네트워크에 대한 인터페이스를 제공한다.15 is a simplified block diagram 1500 of a computer system that can be used to implement the disclosed technology. Computer systems typically include at least one processor that communicates with multiple peripherals through a bus subsystem. Such peripheral devices may include, for example, a storage subsystem including a memory device and a file storage subsystem, a user interface input device, a user interface output device, and a network interface subsystem. Input and output devices allow user interaction with a computer system. The network interface subsystem provides interfaces to external networks, including interfaces to corresponding interface devices in other computer systems.

일 구현예에서, 변이체 병원성 분류기(157), PFM 계산기(184) 및 입력 인코더(186)와 같은 신경망은 저장 서브시스템 및 사용자 인터페이스 입력 장치에 통신 가능하게 연결된다. In one embodiment, neural networks such as variant pathogenic classifier 157, PFM calculator 184 and input encoder 186 are communicatively coupled to the storage subsystem and user interface input device.

사용자 인터페이스 입력 장치는, 키보드; 마우스, 트랙볼, 터치패드 또는 그래픽 태블릿과 같은 포인팅 장치; 스캐너; 디스플레이에 통합된 터치 스크린; 음성 인식 시스템 및 마이크와 같은 오디오 입력 장치; 및 다른 유형의 입력 장치를 포함할 수 있다. 일반적으로, "입력 장치"라는 용어의 사용은, 정보를 컴퓨터 시스템에 입력하도록 모든 가능한 유형의 장치 및 방법을 포함하고자 하는 것이다.The user interface input device includes a keyboard; Pointing devices such as a mouse, trackball, touchpad, or graphics tablet; scanner; A touch screen integrated into the display; Audio input devices such as speech recognition systems and microphones; And other types of input devices. In general, the use of the term "input device" is intended to include all possible types of devices and methods for inputting information into a computer system.

사용자 인터페이스 출력 장치는, 디스플레이 서브시스템, 프린터, 팩스기, 또는 오디오 출력 장치와 같은 비시각적 디스플레이를 포함할 수 있다. 디스플레이 서브시스템은, 음극선관(CRT), 액정 디스플레이(LCD)와 같은 평판 장치, 투영 장치, 또는 시각적 이미지를 생성하기 위한 다른 일부 메커니즘을 포함할 수 있다. 디스플레이 서브시스템은, 또한, 오디오 출력 장치와 같은 비시각적 디스플레이를 제공할 수 있다. 일반적으로, "출력 장치"라는 용어의 사용은, 컴퓨터 시스템으로부터 사용자 또는 다른 기계 또는 컴퓨터 시스템으로 정보를 출력하기 위한 모든 가능한 유형의 장치 및 방법을 포함하고자 하는 것이다.The user interface output device may include a non-visual display, such as a display subsystem, printer, fax machine, or audio output device. The display subsystem may include a flat panel device such as a cathode ray tube (CRT), a liquid crystal display (LCD), a projection device, or some other mechanism for generating a visual image. The display subsystem can also provide a non-visual display, such as an audio output device. In general, the use of the term "output device" is intended to include all possible types of devices and methods for outputting information from a computer system to a user or other machine or computer system.

저장 서브시스템은, 본 명세서에서 설명된 모듈과 방법 중 일부 또는 전부의 기능을 제공하는 프로그래밍 및 데이터 구성을 저장한다. 이러한 소프트웨어 모듈은 일반적으로 프로세서 단독으로 또는 다른 프로세서와 함께 실행된다.The storage subsystem stores programming and data configurations that provide the functionality of some or all of the modules and methods described herein. These software modules are typically executed either alone or in combination with other processors.

저장 서브시스템에 사용되는 메모리는, 프로그램 실행 동안 명령과 데이터를 저장하기 위한 메인 랜덤 액세스 메모리(RAM) 및 고정 명령이 저장된 판독 전용 메모리(ROM)를 포함하는 다수의 메모리를 포함할 수 있다. 파일 저장 서브시스템은, 프로그램 및 데이터 파일을 위한 영구 저장소를 제공할 수 있으며, 하드 디스크 드라이브, 연관된 탈착가능 매체를 갖는 플로피 디스크 드라이브, CD-ROM 드라이브, 광 드라이브, 또는 탈착가능 매체 카트리지를 포함할 수 있다. 소정의 구현예의 기능을 구현하는 모듈들은, 파일 저장 서브시스템에 의해 저장 서브시스템에 또는 프로세서가 액세스할 수 있는 다른 기계에 저장될 수 있다.The memory used in the storage subsystem may include a number of memories including a main random access memory (RAM) for storing instructions and data during program execution and a read-only memory (ROM) in which fixed instructions are stored. The file storage subsystem may provide permanent storage for program and data files, and may include a hard disk drive, a floppy disk drive with associated removable media, a CD-ROM drive, an optical drive, or a removable media cartridge. You can. Modules that implement the functionality of certain implementations may be stored by the file storage subsystem in a storage subsystem or other machine accessible to the processor.

버스 서브시스템은, 컴퓨터 시스템의 다양한 구성요소와 서브시스템들이 서로 의도된 바와 같이 통신하게 하는 메커니즘을 제공한다. 버스 서브시스템이 단일 버스로 개략적으로 표시되어 있지만, 버스 서브시스템의 대체 구현예에서는 다수의 버스를 사용할 수 있다.The bus subsystem provides a mechanism for allowing various components and subsystems of the computer system to communicate with each other as intended. Although the bus subsystem is schematically represented as a single bus, multiple buses may be used in alternative implementations of the bus subsystem.

컴퓨터 시스템 자체는, 개인용 컴퓨터, 휴대용 컴퓨터, 워크스테이션, 컴퓨터 단말, 네트워크 컴퓨터, 텔레비전, 메인프레임, 서버 팜, 느슨하게 네트워크화된 컴퓨터들의 광범위하게 분산된 세트, 또는 다른 임의의 데이터 처리 시스템이나 사용자 장치를 포함하는 다양한 유형일 수 있다. 컴퓨터 및 네트워크의 특성이 계속 변화함으로 인해, 도 15에 도시된 컴퓨터 시스템의 설명은, 개시된 기술을 예시하기 위한 특정한 일례를 의도한 것일 뿐이다. 도 15에 도시된 컴퓨터 시스템보다 많거나 적은 구성요소를 갖는 컴퓨터 시스템의 다른 많은 구성이 가능하다.The computer system itself includes a personal computer, portable computer, workstation, computer terminal, network computer, television, mainframe, server farm, a widely distributed set of loosely networked computers, or any other data processing system or user device. It can be of various types including. Due to the constantly changing nature of computers and networks, the description of the computer system shown in FIG. 15 is intended only as a specific example to illustrate the disclosed technology. Many other configurations of computer systems with more or fewer components than the computer system shown in FIG. 15 are possible.

심층 학습 프로세서는, GPU 또는 FPGA 일 수 있으며, 구글 클라우드 플랫폼, 자일링스, 및 시라스케일과 같은 심층 학습 클라우드 플랫폼에 의해 호스팅될 수 있다. 심층 학습 프로세서의 예로는, Google의 텐서 처리 유닛(TPU), GX4 Rackmount Series, GX8 Rackmount Series와 같은 랙마운트 솔루션, NVIDIA DGX-1, Microsoft의 Stratix V FPGA, Graphcore의 Intelligent Processor Unit(IPU), Qualcomm의 Zeroth platform with Snapdragon processors, NVIDIA의 Volta, NVIDIA의 DRIVE PX, NVIDIA의 JETSON TX1/TX2 MODULE, Intel의 Nirvana, Movidius VPU, Fujitsu DPI, ARM의 DynamicIQ, IBM TrueNorth 등이 있다.The deep learning processor can be a GPU or an FPGA, and can be hosted by a deep learning cloud platform such as Google Cloud Platform, Xilinx, and Syracscale. Examples of in-depth learning processors include rackmount solutions such as Google's Tensor Processing Unit (TPU), GX4 Rackmount Series, GX8 Rackmount Series, NVIDIA DGX-1, Microsoft's Stratix V FPGA, Graphcore's Intelligent Processor Unit (IPU), and Qualcomm Zeroth platform with Snapdragon processors from NVIDIA, Volta from NVIDIA, DRIVE PX from NVIDIA, JETSON TX1 / TX2 MODULE from NVIDIA, Nirvana from Intel, Movidius VPU, Fujitsu DPI, DynamicIQ from ARM, IBM TrueNorth and more.

Claims

A method of reducing overfitting of a neural network implementation model that processes amino acid sequences and concomitant position frequency matrices (PFM),
Generating a positively labeled supplemental training example sequence pair from the start position through the target amino acid position to the end position, each complementary sequence pair comprising:
Missense training example matches the start position and the end position of a sequence pair;
Generating said supplemental training example sequence pair, having identical amino acids in the reference and replacement amino acid sequences;
Inputting a complementary training PFM identical to the PFM of the missense training example with each complementary sequence pair at matching start and end positions; And
Training the neural network implementation model using a positive training example sequence pair and supplemental training example PFM, and the missense training example sequence pair, and the missense PFM at the matching start and end positions,
A method of reducing overfitting of a neural network implementation model, wherein the training impact of the training PFM is attenuated during the training.

The method of claim 1, wherein the complementary sequence pair matches pathogenic missense training, eg, the start position and the end position of the sequence pair, to reduce overfitting of the neural network implementation model.

The method of claim 1, wherein the complementary sequence pair coincides with the starting position and the ending position of a positive missense training example sequence pair.

According to claim 1,
Further comprising modifying training of the neural network implementation model to stop using the supplemental training example sequence pair and the supplemental training PFM after a predetermined number of training epochs, to reduce overfitting of the neural network implementation model. How to order.

According to claim 1,
And modifying the training of the neural network implementation model to stop using the supplemental training example sequence pair and the supplemental training PFM after 5 training epochs.

According to claim 2,
A method for reducing overfitting of a neural network implementation model, further comprising a ratio of the supplementary training example sequence pair to the pathogenic missense training example sequence pair is 1: 1 to 1: 8.

According to claim 3,
The ratio of the complement training example sequence pair to the positive missense training example sequence pair further comprising 1: 1 to 1: 8.

According to claim 1,
A method of reducing overfitting of a neural network implementation model, further comprising using amino acid positions from data for non-human primate and non-primate mammals when generating the supplemental PFM.

A system comprising one or more processors coupled to memory, wherein the memory is loaded with computer instructions to reduce overfitting of the neural network implementation model to process amino acid sequences and concomitant position frequency matrices (PFM), the instructions being: When running on the processor,
Generating a positively labeled supplemental training example sequence pair from the starting position to the target amino acid position and including the final position, each complementary sequence pair comprising:
Missense training example matches the start position and the end position of a sequence pair;
Generating said supplemental training example sequence pair with identical amino acids in the reference and replacement amino acid sequences;
Inputting a complementary training PFM identical to the PFM of the missense training example with each supplementary sequence pair at matching start and end positions; And
Implement an operation of training the neural network implementation model using the positive training example sequence pair and supplementary training example PFM, and the missense training example sequence pair, and the missense PFM at the matching start and end positions,
The training effect of the training PFM is attenuated or canceled during training.

10. The system of claim 9, wherein the complementary sequence pair matches the pathogenic missense training example the starting position and the ending position of the sequence pair.

10. The system of claim 9, wherein the complementary sequence pair coincides with a positive missense training example the starting position and the ending position of the sequence pair.

The method of claim 9,
And further implement the operation of modifying the training of the neural network implementation model to stop using the supplemental training example sequence pair and the supplemental training PFM after a predetermined number of training epochs.

The method of claim 9,
The system further implements an operation of modifying training of the neural network implementation model to stop using the supplemental training example sequence pair and the supplemental training PFM after 5 training epochs.

The method of claim 10,
Wherein the ratio of the supplemental training example sequence pair to the pathogenic missense training example sequence pair is between 1: 1 and 1: 8.

The method of claim 11,
Wherein the ratio of the complement training example sequence pair to the positive missense training example sequence pair is between 1: 1 and 1: 8.

The method of claim 9,
A system further implementing the operation of using amino acid positions from data for non-human primate and non-primate mammals when generating the supplemental PFM.

A non-transitory computer readable storage medium having computer program instructions stored therein to reduce overfitting of a neural network implementation model that processes amino acid sequences and concomitant position frequency matrices (PFMs), when the instructions are executed on a processor,
Generating a positively labeled supplemental training example sequence pair from the start position through the target amino acid position to the end position, each complementary sequence pair comprising:
Missense training example matches the start position and the end position of a sequence pair;
Generating said supplemental training example sequence pair, having identical amino acids in the reference and replacement amino acid sequences;
Entering a complementary training PFM identical to the PFM of the missense at each matching start and end position, with each supplemental sequence pair; And
Implementing a method comprising training the neural network implementation model using a positive training example sequence pair and supplementary training example PFM, and missense training example sequence pair, and the missense PFM at the matching start and end positions. and;
A non-transitory computer readable storage medium in which the training impact of the training PFM is attenuated during training.

18. The non-transitory computer readable storage medium of claim 17, wherein the complementary sequence pair matches the pathogenic missense training example the starting position and the ending position of the sequence pair.

18. The non-transitory computer readable storage medium of claim 17, wherein the complementary sequence pair coincides with the starting position and the ending position of a positive missense training example sequence pair.

The method of claim 17,
Non-transitory computer readable storage medium embodying a method further comprising modifying training of the neural network implementation model to stop using the supplemental training example sequence pair and the supplemental training PFM after a predetermined number of training epochs. .

The method of claim 17,
A non-transitory computer readable storage medium implementing the method further comprising modifying the training of the neural network implementation model to stop using the supplemental training example sequence pair and the supplemental training PFM after 5 training epochs.

The method of claim 18,
A non-transitory computer readable storage medium embodying a method further comprising the ratio of the supplemental training example sequence pair to the pathogenic missense training example sequence pair is 1: 1 to 1: 8.

The method of claim 19,
A non-transitory computer readable storage medium embodying a method further comprising the ratio of the supplemental training example sequence pair to the positive missense training example sequence pair is 1: 1 to 1: 8.

The method of claim 17,
A non-transitory computer readable storage medium implementing the method further comprising using amino acid positions from data for non-human primate and non-primate mammals when generating the supplemental PFM.