KR102184720B1

KR102184720B1 - Prediction method for binding preference between mhc and peptide on cancer cell and analysis apparatus

Info

Publication number: KR102184720B1
Application number: KR1020190125897A
Authority: KR
Inventors: 최정균; 김권일
Original assignee: 한국과학기술원
Priority date: 2019-10-11
Filing date: 2019-10-11
Publication date: 2020-11-30
Also published as: WO2021071182A1

Abstract

The present invention is to provide a technique for predicting the binding potential of major histocompatibility complex (MHC)-peptide expressed in cancer cells. According to the present invention, a method for predicting the binding potential of MHC-peptide binding on a surface of cancer cells comprises the steps of: receiving genome data of a sample by an analysis device; generating, by the analysis device, a matrix representing the binding preference of amino acid pairs using a first amino acid sequence of an MHC and a second amino acid sequence of an antigen generated by the cancer cell; and predicting, by the analysis device, the binding potential of the MHC with the antigen by inputting the matrix into a neural network model learned in advance. The first amino acid sequence is a sequence commonly found in a cohort to which the sample belongs, or a sequence detected by the analysis device from the genome data.

Description

A method and analysis device for predicting MHC-peptide binding degree on the surface of cancer cells {PREDICTION METHOD FOR BINDING PREFERENCE BETWEEN MHC AND PEPTIDE ON CANCER CELL AND ANALYSIS APPARATUS}

이하 설명하는 기술은 MHC와 펩타이드 사이의 결합도를 예측하는 기법에 관한 것이다.The technique described below relates to a technique for predicting the degree of binding between MHC and a peptide.

암 세포는 신항원을 생성한다. 신항원의 에피토프는 암 세포의 표면에 위치하는 MHC(major histocompatibility complexes)에 표현된다. T 세포는 MHC-에피토프를 인식하여 면역 반응을 일으킨다. Cancer cells produce new antigens. Epitopes of new antigens are expressed in major histocompatibility complexes (MHCs) located on the surface of cancer cells. T cells recognize MHC-epitopes and trigger an immune response.

암 세포는 면역세포의 면역관문을 이용하여 면역을 회피한다. 면역관문억제제(immune checkpoint inhibitor)는 면역관문을 억제하여 체내 면역세포의 활성으로 암 세포를 사멸한다. 암 세포가 생성하는 신항원을 식별하기 위하여 MHC-펩타이드 결합을 예측할 필요가 있다.Cancer cells evade immunity by using the immune checkpoint of immune cells. Immune checkpoint inhibitors suppress immune checkpoints and kill cancer cells by activating immune cells in the body. It is necessary to predict MHC-peptide binding to identify new antigens produced by cancer cells.

Nielsen, M. et al. NetMHCpan, a method for quantitative predictions of peptide binding to any HLA-A and -B locus protein of known sequence. PLoS One 2, 2007.Nielsen, M. et al. NetMHCpan, a method for quantitative predictions of peptide binding to any HLA-A and -B locus protein of known sequence. PLoS One 2, 2007.

이하 설명하는 기술은 암 세포에서 표현되는 MHC-펩타이드의 결합가능성을 예측하는 기법을 제공하고자 한다.The technique described below is intended to provide a technique for predicting the binding potential of MHC-peptide expressed in cancer cells.

암 세포 표면의 MHC-펩타이드 결합도 예측 방법은 분석장치가 샘플의 유전체 데이터를 입력받는 단계, 상기 분석장치가 MHC(major histocompatibility complex)의 제1 아미노산 서열 및 암 세포가 생성하는 항원의 제2 아미노산 서열을 이용하여 아미노산 쌍의 상호작용을 나타내는 매트릭스를 생성하는 단계 및 상기 분석장치가 상기 매트릭스를 사전에 학습된 신경망 모델에 입력하여, 상기 MHC와 상기 항원의 결합 정도를 예측하는 단계를 포함한다. 상기 제1 아미노산 서열은 상기 샘플이 속한 코호트(cohort)에 공통적으로 나타나는 서열이거나, 상기 분석장치가 상기 유전체 데이터에서 검출한 서열이다.In the method for predicting MHC-peptide binding on the surface of cancer cells, an analysis device receives genome data of a sample, and the analysis device receives a first amino acid sequence of a major histocompatibility complex (MHC) and a second amino acid of an antigen generated by cancer cells. Generating a matrix representing the interaction of the amino acid pair using the sequence, and predicting a degree of binding between the MHC and the antigen by inputting the matrix into a previously learned neural network model by the analysis device. The first amino acid sequence is a sequence commonly found in a cohort to which the sample belongs, or a sequence detected by the analysis device in the genome data.

MHC-펩타이드 결합도를 예측하는 분석장치는 샘플의 유전체 데이터를 입력받는 입력장치, MHC(major histocompatibility complex)를 구성하는 제1 아미노산 서열 및 항원을 구성하는 제2 아미노산 서열에 대한 아미노산 쌍의 상호 작용을 나타내는 매트릭스 영상을 기준으로 MHC와 항원의 결합 정도를 예측하는 신경망 모델을 저장하는 저장장치 및 상기 유전체 데이터에서 상기 제1 아미노산 서열 및 상기 제2 아미노산 서열을 검출하고, 상기 제1 아미노산 서열의 아미노산과 상기 제2 아미노산 서열의 아미노산 쌍의 상호작용 정도를 나타내는 매트릭스 영상을 생성하고, 생성한 매트릭스 영상을 상기 신경망 모델에 입력하여 상기 샘플에 대한 MHC와 항원의 결합 정도를 예측하는 연산장치를 포함한다.The analysis device for predicting the degree of MHC-peptide binding is an input device that receives genomic data of a sample, and the interaction of amino acid pairs with the first amino acid sequence constituting the major histocompatibility complex (MHC) and the second amino acid sequence constituting the antigen. A storage device for storing a neural network model that predicts the degree of binding between MHC and antigen based on a matrix image representing, and detecting the first amino acid sequence and the second amino acid sequence from the genome data, and the amino acid of the first amino acid sequence And a computing device that generates a matrix image indicating the degree of interaction between the amino acid pair of the second amino acid sequence and the second amino acid sequence, and inputs the generated matrix image to the neural network model to predict the degree of binding of the MHC and the antigen to the sample. .

이하 설명하는 기술은 신경망 모델을 사용하여 MHC와 펩타이드 결합정도 내지 결합 가능성을 정확하게 예측할 수 있다.The technique described below can accurately predict the degree or possibility of binding MHC and peptide using a neural network model.

도 1은 MHC-펩타이드 결합도를 예측하는 시스템에 대한 예이다.
도 2는 MHC-펩타이드 결합도를 예측하는 과정에 대한 예이다.
도 3은 상호작용 맵에 대한 예이다.
도 4는 종래 CNN 모델에 대한 예이다.
도 5는 MHC-펩타이드 결합도를 예측을 위한 CNN 모델에 대한 예이다.
도 6은 MHC-펩타이드 결합도를 예측하는 분석장치에 대한 예이다.
도 7은 MHC-펩타이드 결합도를 예측하는 CNN 모델을 평가한 결과이다.1 is an example of a system for predicting the degree of MHC-peptide binding.
2 is an example of a process of predicting the degree of MHC-peptide binding.
3 is an example of an interaction map.
4 is an example of a conventional CNN model.
5 is an example of a CNN model for predicting the degree of MHC-peptide binding.
6 is an example of an analysis device for predicting MHC-peptide binding.
7 is a result of evaluating a CNN model for predicting MHC-peptide binding.

이하 설명하는 기술은 다양한 변경을 가할 수 있고 여러 가지 실시례를 가질 수 있는 바, 특정 실시례들을 도면에 예시하고 상세하게 설명하고자 한다. 그러나, 이는 이하 설명하는 기술을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 이하 설명하는 기술의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다.The technology to be described below may be modified in various ways and may have various embodiments, and specific embodiments will be illustrated in the drawings and described in detail. However, this is not intended to limit the technology to be described below with respect to a specific embodiment, and it should be understood to include all changes, equivalents, and substitutes included in the spirit and scope of the technology described below.

제1, 제2, A, B 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 해당 구성요소들은 상기 용어들에 의해 한정되지는 않으며, 단지 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. 예를 들어, 이하 설명하는 기술의 권리 범위를 벗어나지 않으면서 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소도 제1 구성요소로 명명될 수 있다. 및/또는 이라는 용어는 복수의 관련된 기재된 항목들의 조합 또는 복수의 관련된 기재된 항목들 중의 어느 항목을 포함한다.Terms such as 1st, 2nd, A, B, etc. may be used to describe various components, but the components are not limited by the above terms, only for the purpose of distinguishing one component from other components. Is only used. For example, without departing from the scope of the rights of the technology described below, a first component may be referred to as a second component, and similarly, a second component may be referred to as a first component. The term and/or includes a combination of a plurality of related listed items or any of a plurality of related listed items.

본 명세서에서 사용되는 용어에서 단수의 표현은 문맥상 명백하게 다르게 해석되지 않는 한 복수의 표현을 포함하는 것으로 이해되어야 하고, "포함한다" 등의 용어는 설시된 특징, 개수, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것이 존재함을 의미하는 것이지, 하나 또는 그 이상의 다른 특징들이나 개수, 단계 동작 구성요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 배제하지 않는 것으로 이해되어야 한다.In terms of the terms used in the present specification, expressions in the singular should be understood as including plural expressions unless clearly interpreted differently in context, and terms such as "includes" are specified features, numbers, steps, actions, and components. It is to be understood that the presence or addition of one or more other features or numbers, step-acting components, parts or combinations thereof is not meant to imply the presence of, parts, or combinations thereof.

또, 방법 또는 동작 방법을 수행함에 있어서, 상기 방법을 이루는 각 과정들은 문맥상 명백하게 특정 순서를 기재하지 않은 이상 명기된 순서와 다르게 일어날 수 있다. 즉, 각 과정들은 명기된 순서와 동일하게 일어날 수도 있고 실질적으로 동시에 수행될 수도 있으며 반대의 순서대로 수행될 수도 있다.In addition, in performing the method or operation method, each of the processes constituting the method may occur differently from the specified order unless a specific order is clearly stated in the context. That is, each process may occur in the same order as the specified order, may be performed substantially simultaneously, or may be performed in the reverse order.

이하 설명에서 사용되는 용어에 대하여 설명한다.Hereinafter, terms used in the description will be described.

항원은 면역 반응을 유도하는 물질이다.Antigens are substances that induce an immune response.

신항원(neoantigen)은 종양 세포에서의 돌연변이 또는 종양 세포에 특이적인 번역 후 변형을 통해 발생하는 변경을 갖는 항원이다. 신항원은 폴리펩티드 서열 또는 뉴클레오티드 서열을 포함할 수 있다. 돌연변이는 프레임 이동 또는 비-격자 이동 인델(indel), 미스센스(missense) 또는 넌센스 (nonsense) 치환, 스플라이스 부위 변경, 게놈 재배열 또는 유전자 융합, 또는 신생 ORF를 야기하는 임의의 게놈 또는 발현 변경을 포함할 수 있다. 돌연변이는 스플라이스 변이(splice variant)도 포함할 수 있다. 종양 세포에 특이적인 번역 후 변형은 비정상적인 인산화를 포함할 수 있다. 종양 세포에 특이적인 번역 후 변형은 또한 프로테아솜-생성된 스플라이싱된 항원을 포함할 수 있다. Neoantigens are antigens with alterations that occur through mutations in tumor cells or post-translational modifications specific to tumor cells. The neoantigen may comprise a polypeptide sequence or a nucleotide sequence. Mutations can be frame shifted or non-lattice shifted indels, missense or nonsense substitutions, splice site alterations, genomic rearrangements or gene fusions, or any genomic or expression alterations that result in a newborn ORF. It may include. Mutations can also include splice variants. Post-translational modifications specific to tumor cells can include abnormal phosphorylation. Post-translational modifications specific to tumor cells can also include proteasome-generated spliced antigens.

에피토프(epitope)는 항체 또는 T-세포 수용체가 통상 결합하는 항원의 특이적인 부분을 지칭할 수 있다.An epitope may refer to a specific portion of an antigen to which an antibody or T-cell receptor usually binds.

MHC는 면역반응의 대상 물질을 항원으로 인식시키는 매개자 역할을 하는 펩타이드 구조이다.MHC is a peptide structure that acts as a mediator to recognize the target substance of the immune response as an antigen.

펩타이드는 아미노산의 중합체를 의미한다. 설명의 편의를 위하여, 이하 "펩타이드"는 암 세포가 표면에 표현하는 아미노산 중합체 내지 아미노산 서열을 의미한다. Peptide refers to a polymer of amino acids. For convenience of explanation, hereinafter "peptide" refers to an amino acid polymer or an amino acid sequence expressed on the surface of cancer cells.

MHC-펩타이드 복합체는 암 세포의 표현에 표현되는 것으로, MHC와 펩타이드의 복합 구조체이다. T 세포가 MHC-펩타이드 복합체를 인식하여 면역 반응을 수행한다.The MHC-peptide complex is expressed in cancer cells and is a complex structure of MHC and peptide. T cells recognize the MHC-peptide complex and perform an immune response.

결합도는 MHC와 펩타이드 사이의 결합 정도를 의미한다. 결합 선호도 내지 결합 친화도는 MHC 분자와 펩타이드 사이의 결합 친화성을 의미한다.The degree of binding refers to the degree of binding between the MHC and the peptide. Binding affinity to binding affinity refers to the binding affinity between the MHC molecule and the peptide.

시료 내지 샘플(sample)은 분석 대상이 되는 개체에서 채취한 단일 세포 또는 다중 세포, 세포 단편, 체액 등을 의미한다.A sample or sample means a single cell or multiple cells, cell fragments, body fluids, etc. collected from an individual to be analyzed.

개체(subject)는 세포, 조직 또는 유기체를 포함한다. 개체는 기본적으로 인간을 대상으로 하지만, 이에 한정되지 않는다.Subjects include cells, tissues or organisms. The entity is primarily intended for humans, but is not limited thereto.

엑솜(exome)은 단백질을 암호화하는 게놈의 서브셋이다. 엑솜(exome)은 세포, 세포 그룹 또는 개체에 존재하는 엑손(exon)들의 집합을 지칭할 수 있다.Exomes are a subset of the genome that encodes proteins. An exome may refer to a cell, a group of cells, or a collection of exons present in an individual.

유전체 데이터 내지 유전체 정보는 샘플을 분석하여 산출되는 유전 정보를 의미한다. 예컨대, 유전체 데이터는 세포, 조직 등으로부터 데옥시리보 핵산(DNA), 리보핵산(RNA), 또는 단백질(Protein) 등에서 얻어진 염기서열, 유전자 발현 데이터, 표준 유전체 데이터와의 유전 변이, DNA 메틸화(methylation) 등을 포함할 수 있다. 일반적으로 유전체 데이터는 특정 시료를 분석하여 얻은 서열 정보를 포함한다. 유전체 데이터는 다양한 방식으로 획득될 수 있다. 예컨대, NGS 분석을 통해 유전체 데이터를 생성할 수 있다. 유전체 데이터는 컴퓨터가 이해하는 디지털 데이터로 표현될 수 있다.Genomic data or genome information refers to genetic information that is calculated by analyzing a sample. For example, genomic data is a base sequence obtained from deoxyribonucleic acid (DNA), ribonucleic acid (RNA), or protein from cells, tissues, etc., gene expression data, genetic mutation with standard genomic data, DNA methylation ), etc. In general, genomic data includes sequence information obtained by analyzing a specific sample. Genomic data can be obtained in a variety of ways. For example, genome data can be generated through NGS analysis. Genomic data can be expressed as digital data understood by computers.

기계 학습(machine learning) 또는 학습은 인공 지능의 한 분야로, 컴퓨터가 학습할 수 있도록 알고리즘을 개발하는 분야를 의미한다. 기계학습모델 또는 학습모델은 컴퓨터가 학습할 수 있도록 개발된 모델을 의미한다. 학습모델은 접근 방법에 따라 인공신경망, 결정 트리 등과 같은 다양한 모델이 있다.Machine learning, or learning, is a field of artificial intelligence, which refers to the field in which algorithms are developed so that computers can learn. A machine learning model or learning model means a model developed so that a computer can learn. Learning models include various models such as artificial neural networks and decision trees, depending on the approach.

분석장치는 MHC-펩타이드 결합도를 예측하는 장치를 통칭한다. 분석장치는 설치된 프로그램 또는 코드를 이용하여 데이터를 처리하고 분석한다.The analysis device refers to a device that predicts the degree of MHC-peptide binding. The analysis device processes and analyzes the data using the installed program or code.

도 1은 MHC-펩타이드 결합도를 예측하는 시스템(100)에 대한 예이다. 도 1에서 분석장치(130, 140, 150)가 MHC-펩타이드 결합도를 예측한다. 도 1에서 분석장치는 서버(130) 및 컴퓨터 단말(140, 150) 형태로 도시하였다. 서버(130)는 네트워크 상에서 MHC-펩타이드 결합도를 예측하는 서비스를 제공할 수 있다. 컴퓨터 단말(140)은 네트워크에 연결되어 유전체 데이터를 수신하고, 설치된 애플리케이션을 이용하여 MHC-펩타이드 결합도를 예측한다. 컴퓨터 단말(150)은 유전체 데이터가 저장된 매체(예컨대, USB, SD 카드 등)로부터 입력 데이터를 수신하고, 설치된 애플리케이션을 이용하여 MHC-펩타이드 결합도를 예측한다. 분석장치(130, 140, 150)는 다양한 형태로 구현될 수 있다.1 is an example of a system 100 for predicting the degree of MHC-peptide binding. In FIG. 1, the analysis devices 130, 140, 150 predict the degree of MHC-peptide binding. In FIG. 1, the analysis device is shown in the form of a server 130 and computer terminals 140 and 150. The server 130 may provide a service for predicting the degree of MHC-peptide binding on a network. The computer terminal 140 is connected to a network to receive genome data, and predicts MHC-peptide binding degree using an installed application. The computer terminal 150 receives input data from a medium (eg, USB, SD card, etc.) storing genome data, and predicts MHC-peptide binding degree using an installed application. The analysis devices 130, 140, and 150 may be implemented in various forms.

분석장치(130, 140, 150)는 유전체 데이터를 이용하여 MHC-펩타이드 결합도를 분석한다. 여기서, 유전체 데이터는 유전체 서열에 대한 정보를 포함한다. 유전체 분석장치(110)는 시료를 분석하여 유전체 데이터를 생성한다. 예컨대, 유전체 분석장치(110)는 NGS 분석장치일 수 있다. 유전체 분석장치(110)는 생성한 유전체 데이터를 별도의 DB(120)에 저장할 수도 있다.Analysis devices (130, 140, 150) analyze the MHC-peptide binding degree using the genome data. Here, the genome data includes information on the genome sequence. The genome analysis device 110 analyzes a sample and generates genome data. For example, the genome analysis device 110 may be an NGS analysis device. The genome analysis device 110 may store the generated genome data in a separate DB 120.

유전체 분석장치(110)는 유전체 라이브러리를 이용하여 유전체 데이터를 생성한다. 유전체 분석장치(110)는 유전체 라이브러리에 대한 엑솜 서열 검사(whole exome sequencing)를 수행할 수 있다. 유전체 라이브러리는 상용 키트를 사용하여 준비될 수 있다. 예컨대, AllPrep DNA/RNA Mini Kit (Qiagen, 80204), AllPrep DNA/RNA Micro Kit (Qiagen, 80284), 또는 QIAamp DNA FFPE Tissue Kit (Qiagen, 56404) 등을 사용하여 서열 분석을 위한 유전체 라이브러리를 생성할 수 있다.The genome analysis device 110 generates genome data using a genome library. The genome analysis device 110 may perform whole exome sequencing on the genome library. The genome library can be prepared using a commercial kit. For example, using the AllPrep DNA / RNA Mini Kit (Qiagen, 80204), AllPrep DNA / RNA Micro Kit (Qiagen, 80284), or QIAamp DNA FFPE Tissue Kit (Qiagen, 56404) to generate a genomic library for sequence analysis. I can.

사용자(10, 20, 30)는 특정 샘플에 대한 MHC-펩타이드 결합도를 확인할 수 있다. 사용자(10)는 사용자 단말(PC, 스마트폰 등)을 통해 서버(130)에 접속하여, 서버(130)가 수행한 분석 결과를 확인할 수 있다. 사용자(20)는 자신이 사용하는 컴퓨터 단말(140)을 통해 MHC-펩타이드 결합도를 확인할 수 있다. 사용자(30)는 자신이 사용하는 컴퓨터 단말(150)을 통해 MHC-펩타이드 결합도를 확인할 수 있다. Users (10, 20, 30) can check the degree of MHC-peptide binding to a specific sample. The user 10 may access the server 130 through a user terminal (PC, smartphone, etc.) and check the analysis result performed by the server 130. The user 20 can check the MHC-peptide binding degree through the computer terminal 140 used by the user. The user 30 can check the MHC-peptide binding degree through the computer terminal 150 used by the user 30.

사용자(10, 20, 30)는 MHC-펩타이드 결합도를 연구하는 연구자일 수 있다. 또는 사용자(10, 20, 30)는 특정 환자에 대한 면역항암제 처방을 고려하는 의료진일 수도 있다.Users (10, 20, 30) may be researchers who study the degree of MHC-peptide binding. Alternatively, the users 10, 20, and 30 may be medical staff who consider prescribing an anticancer drug for a specific patient.

도 2는 MHC-펩타이드 결합도를 예측하는 과정(200)에 대한 예이다. 분석장치(130, 140 또는 150)는 도 2에서 설명하는 예측 과정(200)을 통해 MHC-펩타이드 결합도를 예측할 수 있다.2 is an example of a process 200 predicting the degree of MHC-peptide binding. The analysis apparatus 130, 140, or 150 may predict the degree of MHC-peptide binding through the prediction process 200 described in FIG. 2.

분석장치는 샘플의 유전체 데이터를 입력받는다(210). 샘플은 MHC-펩타이드 결합도를 판단하고자 하는 개체를 의미한다. 샘플 유전체 데이터는 분석 대상인 개체의 유전체 데이터를 말한다.The analysis device receives the genome data of the sample (210). The sample refers to an individual to determine the degree of MHC-peptide binding. The sample genome data refers to the genome data of an individual to be analyzed.

분석장치는 유전체 데이터에서 MHC 아미노산 서열 및 펩타이드 아미노산 서열을 검출해야 한다. 분석장치는 MHC 구조를 예측하는 프로그램 내지 모델을 사용할 수 있다. 예컨대, 분석 장치는 HLAminer를 사용하여 HLA(Human Leukocyte Antigen) 구조를 예측할 수 있다. 또 분석장치는 일정한 도구를 사용하여 유전체 데이터에서 펩타이드 아미노산 서열을 식별할 수 있다. 예컨대, 분석장치는 idfetch 프로그램을 사용하여 유전체데이터에서 비동의 돌연변이(nonsynonymous mutations) 측면 아미노산 서열을 검색하여 펩타이드 서열을 검출할 수 있다.The analysis device must detect the MHC amino acid sequence and the peptide amino acid sequence in the genomic data. The analysis device may use a program or model for predicting the MHC structure. For example, the analysis device may predict the HLA (Human Leukocyte Antigen) structure using HLAminer. In addition, the analysis device can identify the peptide amino acid sequence from the genomic data using a certain tool. For example, the analysis device may detect a peptide sequence by searching for an amino acid sequence flanking nonsynonymous mutations in genome data using an idfetch program.

분석장치는 유전체 데이터를 이용하여 상호작용 맵(interaction map)을 생성한다(220). 상호작용 맵은 MHC를 구성하는 아미노산와 펩타이드의 아미노산 사이의 결합 친화도 정보를 포함한다. 상호작용 맵은 아미노산 쌍 사이의 결합 친화도 정보를 포함한다. 결합 친화도 정보는 두 개의 아미노산 사이의 결합 선호도를 직접 나타내는 또는 간접적으로 나타내는 정보일 수 있다. 예컨대, 결합 친화도 정보는 일정한 단백질 구조체에서 아미노산 쌍의 상호작용 에너지 값일 수도 있다.The analysis device generates an interaction map using the genome data (220). The interaction map contains information on the binding affinity between the amino acids constituting the MHC and the amino acids of the peptide. The interaction map contains binding affinity information between amino acid pairs. The binding affinity information may be information that directly or indirectly represents the binding affinity between two amino acids. For example, the binding affinity information may be an interaction energy value of an amino acid pair in a certain protein structure.

상호작용 맵은 결합 친화도를 나타내는 값을 갖는 매트릭스(matrix)일 수 있다. 또 상호작용 맵은 아미노산 쌍 사이의 결합 친화도를 색상값으로 표현한 2차원 영상일 수도 있다. 상호작용 맵에 대해서는 후술한다. The interaction map may be a matrix having a value indicating binding affinity. In addition, the interaction map may be a two-dimensional image in which the binding affinity between amino acid pairs is expressed as a color value. The interaction map will be described later.

분석장치는 상호작용 맵을 신경망 모델에 입력하여 분석을 수행한다(230). 도 2는 상호작용 맵에서 결합 특징을 추출하여, 결합 또는 비결합에 대한 정보를 산출하는 신경망 모델을 예로 도시하였다. 분석장치는 신경망 모델이 출력하는 정보를 기준으로 MHC-펩타이드 결합도에 대한 정보를 제공한다.The analysis device performs analysis by inputting the interaction map to the neural network model (230). FIG. 2 shows an example of a neural network model that extracts a combination feature from an interaction map and calculates information about the combination or non-combination. The analysis device provides information on the degree of MHC-peptide binding based on the information output from the neural network model.

분석장치가 유전체 데이터를 이용하여 상호작용 맵을 생성한다. 또는 분석장치가 아닌 컴퓨터 장치가 상호작용 맵을 생성하여, 분석장치에 전달할 수도 있다. 이 경우 분석장치는 상호작용 맵을 신경망 모델에 곧바로 입력하여 MHC-펩타이드 결합도를 예측할 수 있다. 이하 분석장치가 상호작용 맵을 생성한다고 가정한다.The analysis device uses the genomic data to create an interaction map. Alternatively, a computer device other than the analysis device may generate an interaction map and transmit it to the analysis device. In this case, the analysis device can predict the degree of MHC-peptide binding by directly inputting the interaction map into the neural network model. Hereinafter, it is assumed that the analysis device generates an interaction map.

분석장치는 종래 공개된 단백질 구조에 대한 정보를 기준으로 상호작용 맵을 생성할 수 있다. 학술용 또는 상업용으로 공개된 DB(데이터베이스)가 단백질 구조에 대한 정보를 보유할 수 있다. 단백질 구조들(native protein structures)을 기준으로 단백질 구조에서 인접한 아미노산들의 빈도(frequency)를 산출할 수 있다. 단백질에서 특정 아미노산 쌍이 인접한 빈도가 높다면, 해당 아미노산 쌍의 상호작용 에너지는 높다고 할 수 있다. 단백질 구조체를 구성하는 아미노산 쌍에 따라 상호작용 에너지가 산출되었다면, 분석장치는 MHC에 속한 아미노산과 펩타이드에 속한 아미노산 쌍 사이의 상호작용 선호도를 결정할 수 있다. 또, 분석장치는 아미노산 쌍 사이의 상호작용 선호도를 수치로 나타내는 상호작용 맵으로 생성할 수 있다. 나아가, 분석장치는 아미노산 쌍 사이의 상호작용 선호도를 색상값으로 나타내는 상호작용 맵을 생성할 수도 있다.The analysis device may generate an interaction map based on information on the protein structure disclosed in the prior art. A database (database) published for academic or commercial use can hold information on protein structure. Based on native protein structures, the frequency of adjacent amino acids in the protein structure can be calculated. If the frequency of adjacent amino acid pairs in a protein is high, it can be said that the interaction energy of the amino acid pairs is high. If the interaction energy is calculated according to the amino acid pair constituting the protein structure, the analysis device may determine the interaction preference between the amino acid pair belonging to the MHC and the amino acid pair belonging to the peptide. In addition, the analysis device can generate an interaction map that numerically represents the interaction preference between amino acid pairs. Furthermore, the analysis device may generate an interaction map representing the interaction preference between amino acid pairs as color values.

아미노산 쌍 사이의 상호작용 선호도는 단백질 구조를 일정하게 구분하는 영역 단위로 결정할 수 있다. 예컨대, 아미노산 사이의 상호작용 선호도는 단백질 구조에 존재하는 Cα 원자들 사이 또는 다른 어떤 원자들 사이의 접촉을 기준으로 결정될 수 있다. 단백질 구조에서 특정 원자들이 접촉한다는 것은 해당 원자들 사이의 상호작용 내지 친화도가 높다는 의미이다. The interaction preference between amino acid pairs can be determined in units of regions that uniformly divide the protein structure. For example, interaction preferences between amino acids can be determined based on contact between Cα atoms or some other atoms present in the protein structure. The contact of certain atoms in the protein structure means that the interactions or affinity between the atoms are high.

연구자는 MHC-펩타이드 결합도를 예측하기 위한 신경망 모델에 대한 효과를 검증하였다. 검증 결과 중 일부는 후술한다. 실험 결과는 단백질 구조의 원자 중 Cα-Cα 사이의 접촉을 사용하는 경우, 가장 높은 신뢰도롤 보였다. 따라서, 상호작용 맵은 Cα 원자들 사이의 접촉을 기준으로 생성하는 것이 바람직할 수 있다. Cα 원자는 분자를 구성하는 원자들 중 3차원 구조의 뼈대를 형성하는 원자에 해당한다.The researcher verified the effect on the neural network model to predict the degree of MHC-peptide binding. Some of the verification results will be described later. The experimental results showed the highest reliability when using the contact between Cα-Cα among the atoms of the protein structure. Thus, it may be desirable to create an interaction map based on the contact between Cα atoms. The Cα atom corresponds to an atom that forms the skeleton of a three-dimensional structure among atoms constituting a molecule.

도 3은 상호작용 맵에 대한 예이다. 상호작용 맵은 가로축과 세로축을 갖는 2차원 매트릭스이다. 도 3은 색상값을 갖는 영상을 예로 도시하였다. 가로축은 1 ~ n으로 라벨링되는 아미노산 서열에 해당하고, 세로축은 a ~ z로 라벨링되는 아미노산 서열에 해당한다. 하나의 축은 MHC의 아미노산 서열이고, 다른 하나의 축은 항원 펩타이드의 아미노산 서열이다. 아미노산 쌍의 상호관계 선호도(근접 가능성)는 색상으로 표현하였다. 검은색은 상호관계 선호도가 높다는 의미이다. 흰색은 상호관계 선호도가 낮다는 의미이다. 도 3을 살펴보면, 가로축의 아미노산 7은 대체적으로 상호관계 선호도가 높다. 또 세로축의 아미노산 f 및 p가 대체적으로 상호관계 선호도가 높다. 아미노산 쌍 7-f의 상호관계 선호도가 아미노산 쌍 7-p의 상호관계 선호도보다 높다.3 is an example of an interaction map. The interaction map is a two-dimensional matrix with horizontal and vertical axes. 3 shows an image having color values as an example. The horizontal axis corresponds to the amino acid sequence labeled 1 to n, and the vertical axis corresponds to the amino acid sequence labeled a to z. One axis is the amino acid sequence of the MHC, and the other axis is the amino acid sequence of the antigenic peptide. The correlation preference (proximity) of amino acid pairs was expressed in color. Black means that the relationship preference is high. White means that the preference for interaction is low. Referring to FIG. 3, amino acid 7 on the horizontal axis generally has a high correlation preference. In addition, amino acids f and p on the vertical axis generally have high correlation preference. The correlation preference of amino acid pair 7-f is higher than that of amino acid pair 7-p.

분석장치는 신경망 모델을 이용하여 MHC-펩타이드 결합도를 예측한다. 신경망 모델은 RNN(Recurrent Neural Networks), FFNN(feedforward neural network), CNN(convolutional neural network) 등 다양한 모델이 사용될 수 있다. 이하 설명의 편의를 위하여 CNN 모델을 중심으로 설명한다.The analysis device predicts the degree of MHC-peptide binding using a neural network model. As a neural network model, various models such as recurrent neural networks (RNN), feedforward neural networks (FFNN), and convolutional neural networks (CNN) may be used. Hereinafter, for convenience of explanation, the CNN model will be described.

도 4는 종래 CNN 모델에 대한 예이다. 도 4는 CNN 모델의 일반적인 구조 및 동작을 설명하기 위한 것이다. 4 is an example of a conventional CNN model. 4 is for explaining the general structure and operation of the CNN model.

CNN은 컨볼루션 계층 (convolution layer, Conv), 풀링 계층 (pooling layer, Pool) 및 전연결 계층(fully connected layer)을 포함한다. 컨볼루션 계층 및 풀링 계층은 반복적으로 다수가 배치될 수 있다. 도 4의 CNN은 5개의 컨볼루션 계층, 2개의 풀링 계층, 2개의 전연결 계층(Fully connected layer) 구조를 가질 수 있다. The CNN includes a convolution layer (Conv), a pooling layer (Pool), and a fully connected layer. A number of convolutional layers and pooling layers may be repeatedly disposed. The CNN of FIG. 4 may have a structure of 5 convolutional layers, 2 pooling layers, and 2 fully connected layers.

컨볼루션 계층은 입력 이미지에 대한 컨볼루션 연산을 통해 특징맵(feature map)을 출력한다. 이때 컨볼루션 연산을 수행하는 필터(filter)를 커널(kernel) 이라고도 부른다. 필터의 크기를 필터 크기 또는 커널 크기라고 한다. 커널을 구성하는 연산 파라미터(parameter)를 커널 파라미터(kernel parameter), 필터 파라미터(filter parameter), 또는 가중치(weight)라고 한다. The convolution layer outputs a feature map through a convolution operation on an input image. At this time, a filter that performs a convolution operation is also called a kernel. The size of the filter is called the filter size or kernel size. An operation parameter constituting the kernel is called a kernel parameter, a filter parameter, or a weight.

컨볼루션 계층은 컨볼루션 연산과 비선형 연산을 수행한다.The convolutional layer performs convolution and nonlinear operations.

컨볼루션 연산은 일정한 크기의 윈도우에서 수행된다. 윈도우는 영상의 좌측 상단에서 우측 하단까지 한 칸씩 이동할 수 있고, 한 번에 이동하는 이동 크기를 조절할 수 있다. 이동 크기를 스트라이드(stride)라고 한다. 컨볼루션 계층은 입력이미지에서 윈도우를 이동하면서 입력이미지의 모든 영역에 대하여 컨볼루션 연산을 수행한다. 컨볼루션 계층은 영상의 가장 자리에 패딩(padding)을 하여 컨볼루션 연산 후 입력 영상의 차원을 유지할 수 있다.The convolution operation is performed on a window of a certain size. The window can be moved one by one from the upper left to the lower right of the image, and the size of the movement can be adjusted at a time. The size of the movement is called a stride. The convolutional layer performs a convolution operation on all areas of the input image while moving the window in the input image. The convolution layer can maintain the dimension of the input image after the convolution operation by padding the edge of the image.

비선형 연산 계층(nonlinear operation layer)은 뉴런(노드)에서 출력값을 결정하는 계층이다. 비선형 연산 계층은 전달 함수(transfer function)를 사용한다. 전달 함수는 Relu, sigmoid 함수 등이 있다. The nonlinear operation layer is a layer that determines output values from neurons (nodes). The nonlinear operation layer uses a transfer function. Transfer functions include Relu and sigmoid functions.

풀링 계층(pooling layer)은 컨볼루션 계층에서의 연산 결과로 얻은 특징맵을 서브 샘플링(sub sampling)한다. 풀링 연산은 최대 풀링(max pooling)과 평균 풀링(average pooling) 등이 있다. 최대 풀링은 윈도우 내에서 가장 큰 샘플 값을 선택한다. 평균 풀링은 윈도우에 포함된 값의 평균 값으로 샘플링한다.The pooling layer subsamples the feature map obtained as a result of the operation in the convolutional layer. Pooling operations include max pooling and average pooling. Maximum pooling selects the largest sample value within the window. Average pooling is sampled as the average value of the values included in the window.

전연결 계층은 최종적으로 입력 영상을 분류한다. 전연결 계층은 이전 컨볼루션 계층에서 출력하는 값을 모두 입력받아 최종적인 분류를 한다. 도 4에서 전연결 계층은 소프트맥스(softmax) 함수를 사용하여 분류 결과를 출력한다.The all-connected layer finally classifies the input image. The all-connected layer receives all values output from the previous convolutional layer and performs a final classification. In FIG. 4, the all-connected layer outputs a classification result using a softmax function.

도 5는 MHC-펩타이드 결합도를 예측을 위한 CNN 모델(300)에 대한 예이다. CNN 모델(300)은 입력 데이터(상호작용 맵)를 기준으로 MHC-펩타이드 결합도를 예측한다. CNN 모델(300)은 복수의 컨볼루션 계층(310, 320), 전연결 계층(330) 및 출력 계층(340)을 포함한다. 도 5와 같이 컨볼루션 계층는 2개의 계층으로 구성될 수 있다.5 is an example of a CNN model 300 for predicting the degree of MHC-peptide binding. The CNN model 300 predicts the degree of MHC-peptide binding based on the input data (interaction map). The CNN model 300 includes a plurality of convolutional layers 310 and 320, a full connection layer 330 and an output layer 340. As shown in FIG. 5, the convolutional layer may be composed of two layers.

컨볼루션 계층(310, 320)은 입력 데이터에 대한 컨볼루션 연산을 수행하고, 컨볼루션된 값에 대하여 ReLU(rectified linear unit) 함수를 적용한 값을 출력한다. 컨볼루션 연산은 입력 값에 대한 가중치 매트릭스를 곱하는 연산이다. 가중치는 학습 과정을 통해 마련된다. 컨볼루션 계층(310, 320)은 입력 데이터에서 상호 작용 특징을 추출한다. 커널들은 상호작용 맵을 이용하여 펩타이드-MHC 결합에 중요한 특이적 모티프(motif)를 찾는다.The convolutional layers 310 and 320 perform a convolution operation on input data, and output a value obtained by applying a ReLU (rectified linear unit) function to the convolutional value. The convolution operation is an operation that multiplies an input value by a weight matrix. Weights are provided through the learning process. The convolutional layers 310 and 320 extract interaction features from the input data. Kernels use interaction maps to find specific motifs important for peptide-MHC binding.

입력 데이터는 전술한 상호작용 맵을 사용할 수 있다. 입력 데이터는 아미노산 쌍의 상호작용 정도를 나타내는 파라미터들을 포함한다.The input data may use the above-described interaction map. The input data includes parameters indicating the degree of interaction of the amino acid pair.

전연결 계층(330)은 입력되는 정보를 통합한다. 전연결 계층(330)은 컨볼루션 계층(320)에서 출력하는 값을 입력으로 받는다. 전연결 계층(330)은 ReLU 연산을 수행할 수 있다.The all-connection layer 330 integrates input information. The all-connection layer 330 receives a value output from the convolutional layer 320 as an input. The all-connection layer 330 may perform a ReLU operation.

출력 계층(340)은 시그모이드(sigmoid) 함수를 이용하여 주어진 단백질과 MHC 단백질의 결합 가능성에 대한 정보를 출력한다.The output layer 340 outputs information on the binding possibility of a given protein and an MHC protein using a sigmoid function.

컨볼루션 계층(310, 320)은 특정 개수의 커널 또는 가중치 매트릭스를 이용하여 컨볼루션을 수행한다. 컨불루션은 1차원 또는 2차원 연산 등일 수 있다. 모든 컨볼루션 결과들은 ReLU에 의하여 변환된다. ReLU는 음의 값을 0으로 변환한다. The convolutional layers 310 and 320 perform convolution using a specific number of kernels or weight matrices. The convolution may be a one-dimensional or two-dimensional operation. All convolution results are converted by ReLU. ReLU converts negative values to zero.

첫 번째 컨볼루션 계층(310, 이하 제1 컨볼루션 계층이라고 함)은 입력 데이터에서 결합 패턴을 검출한다. 제1 컨볼루션 계층은 이동하는 거리가 1인 윈도우를 사용할 수 있다. 높은 레벨의 컨볼루션 계층은 이전 계층의 출력값에서 결합 패턴을 검출하는 컨볼루션 커널에 해당한다. 컨볼루션 계층의 연산은 아래 수학식 1과 같다. 두 번째 컨볼루션 계층(320, 이하 제2 컨볼루션 계층)은 컨볼루션 계층(310)과 동일한 구조일 수 있다. 또는 제2 컨볼루션 계층(320)은 윈도우의 크기 또는 스트라이드(stride)의 폭이 컨볼루션 계층(310)과 다를 수도 있다.The first convolutional layer 310 (hereinafter referred to as a first convolutional layer) detects a combination pattern in input data. The first convolutional layer may use a window having a moving distance of 1. The high-level convolutional layer corresponds to a convolutional kernel that detects a combination pattern from an output value of a previous layer. The operation of the convolutional layer is shown in Equation 1 below. The second convolutional layer 320 (hereinafter, the second convolutional layer) may have the same structure as the convolutional layer 310. Alternatively, the second convolutional layer 320 may have a window size or a stride width different from that of the convolutional layer 310.

X는 입력 데이터이고, i는 출력의 위치를 나타내는 인덱스이고, k는 커널의 인덱스이다. 각 컨볼루션 커널 W_k은 M×N 크기의 가중치 매트릭스에 해당한다. M은 윈도우 크기이고, N은 입력 채널의 개수이다. X is the input data, i is the index indicating the location of the output, and k is the index of the kernel. Each convolution kernel W _k corresponds to a weight matrix of size M×N. M is the window size, and N is the number of input channels.

풀링 계층은 사용되지 않는다. 즉, 컨볼루션 계층의 모든 출력값을 예측에 사용한다. 서로 비교적 먼 거리에 있는 아미노산도 MHC-펩타이드 복합체와 T 세로 리셉터의 상호작용에 영향을 줄 수 있기 때문이다.The pooling layer is not used. That is, all output values of the convolutional layer are used for prediction. This is because amino acids that are relatively distant from each other can also affect the interaction between the MHC-peptide complex and the T-sero receptor.

전연결 계층(330)은 제2 컨볼루션 계층(320)에서 출력되는 모든 출력을 입력으로 삼는다. 전연결 계층(330)은 이전 계층에서 출력되는 입력값을 통합한다. 전연결 계층은 ReLU(WX) 함수를 수행한다. X는 입력값이고, W는 전연결 계층을 위한 가중치 매트릭스이다. The full connection layer 330 takes all outputs output from the second convolutional layer 320 as inputs. The all-connection layer 330 integrates input values output from the previous layer. The all-connected layer performs the ReLU(WX) function. X is an input value, and W is a weight matrix for the all-connected layer.

출력 계층(340)은 시그모이드 함수에 따라 0 ~ 1 사이의 값을 출력할 수 있다. 출력 계층(340)이 출력하는 값은 결합(binding) 또는 비결합(non-binding) 상태를 분류하는 값이다. 출력 계층(340)은 시그모이드 함수 Sigmoid(WX)를 수행한다. X는 입력값이고, W는 시그모이드 출력 계층을 위한 가중치 매트릭스이다. The output layer 340 may output a value between 0 and 1 according to the sigmoid function. A value output by the output layer 340 is a value for classifying a binding or non-binding state. The output layer 340 performs a sigmoid function Sigmoid(WX). X is the input value and W is the weight matrix for the sigmoid output layer.

출력 계층(340)은 시그모이드가 아닌, 소프트맥스 또는 ReLU와 같은 활성화 함수를 사용할 수도 있다.The output layer 340 may use an activation function such as Softmax or ReLU, not sigmoid.

이하 CNN 모델은 도 5와 같이 MHC-펩타이드 결합도 내지 결합 가능성을 예측하는 모델을 의미한다.Hereinafter, the CNN model refers to a model for predicting MHC-peptide binding degree or binding possibility as shown in FIG. 5.

CNN 모델은 목적 함수를 최소화하는 방향으로 학습된다. 학습 과정은 CNN 모델에서 사용하는 가중치를 최적화하는 과정에 해당한다. 예컨대, 가중치 최적화는 경사 하강법(gradient descent method)을 이용할 수 있다. The CNN model is trained in the direction of minimizing the objective function. The learning process corresponds to the process of optimizing the weights used in the CNN model. For example, the weight optimization may use a gradient descent method.

목적 함수는 NLL(negative log likelihood)의 총합 및 정규화 항(regularization term)으로 정의된다. CNN 모델에 대한 목적 함수는 아래의 수학식 2와 같이 표현될 수 있다.The objective function is defined by the sum and regularization term of NLL (negative log likelihood). The objective function for the CNN model can be expressed as Equation 2 below.

이다.

to be.

s는 훈련 샘플(데이터)의 인덱스이다. t는 상호작용 특징의 인덱스이다. Y_t ^s는 샘플 s와 상호작용 특징 t에 대한 라벨값(0 또는 1)이다. f_t(X^s)는 입력 X^s의 상호작용 특징 t에 대한 예측되는 가능성이다. s is the index of the training sample (data). t is the index of the interactive feature. Y _t ^s is the label value (0 or 1) for sample s and the interaction feature t. f _t (X ^s ) is the predicted probability for the interactive feature t of the input X ^s .

정규화 기술은 딥러닝 네트워크 훈련에 사용되는 다양한 기법을 사용할 수 있다. CNN 모델에 조합된 복수의 정규화 기법을 적용할 수 있다. L2 정규화 항 ||W||₂ ²는 모든 가중치 매트릭스의 제곱의 합이다. L1 정규화 항 ||H^-1||₁는 출력 계층 직전에 있는 전연결 계층의 모든 출력값에 대한 L1 표준(norm)이다.The normalization technique can use various techniques used in deep learning network training. Multiple normalization techniques combined in the CNN model can be applied. L2 regularization term ||W|| ₂ ² is the sum of the squares of all weight matrices. L1 regularization term ||H ^-1 || ₁ is the L1 norm for all outputs of all connected layers immediately before the output layer.

최적화는 정규화 조건에 종속적이다. 정규화 조건은 어떤 계층 m과 뉴런 n에 대하여 ||W_m ⁿ||₂ ≤λ3이거나, 또는 모든 뉴런의 가중치의 L2 표준이 특정 값보다 크다는 것이다.Optimization is dependent on the normalization condition. The normalization condition is ||W _m ⁿ || for any layer m and neuron n ₂ ≤λ3, or the L2 standard of weights of all neurons is greater than a certain value.

CNN 모델에 대한 하이퍼파라미터(hyperparameter)들은 다양한 값이 사용될 수 있다. 학습율은 [0.001, 0.01, 0.1]을 사용할 수 있다. 첫 번째 계층과 두 번째 계층(컨볼루션 계층)에 대한 커널의 개수는 [10,30,50]일 수 있다. L1 및 L2 정규화 파라미터는 [0.001, 0.01, 0.1],일 수 있다. 모멘텀(momentum)은 [0,1, 0.5, 0.9]일 수 있다. Various values can be used for hyperparameters for the CNN model. The learning rate can be [0.001, 0.01, 0.1]. The number of kernels for the first layer and the second layer (convolutional layer) may be [10, 30, 50]. The L1 and L2 normalization parameters may be [0.001, 0.01, 0.1]. The momentum may be [0,1, 0.5, 0.9].

특히, 컨볼루션 계층에서 상호작용 특징을 추출하는데 사용되는 필터의 크기가 다양할 수 있다. 예컨대, 1 ~ 5bp 크기의 펩타이드, 1/2 길이의 HLA, 2/3 길이의 HLA, 전체 길이의 HLA가 각각 서로 다른 크기의 필터를 사용할 수 있다. In particular, the sizes of filters used to extract interaction features from the convolutional layer may vary. For example, a 1-5bp peptide, 1/2 length HLA, 2/3 length HLA, and full length HLA may each have different size filters.

도 6은 MHC-펩타이드 결합도를 예측하는 분석장치(400)에 대한 예이다. 분석장치(400)는 도 1의 분석 장치(130, 140 또는 150)에 해당하는 장치이다.6 is an example of an analysis device 400 for predicting the degree of MHC-peptide binding. The analysis device 400 is a device corresponding to the analysis device 130, 140, or 150 of FIG. 1.

분석장치(400)는 전술한 신경망 모델을 이용하여 MHC-펩타이드 결합도를 예측할 수 있다. 분석장치(400)는 물리적으로 다양한 형태로 구현될 수 있다. 예컨대, 분석장치(400)는 PC와 같은 컴퓨터 장치, 네트워크의 서버, 영상 처리 전용 칩셋 등의 형태를 가질 수 있다. 컴퓨터 장치는 스마트 기기 등과 같은 모바일 기기를 포함할 수 있다.The analysis device 400 may predict the degree of MHC-peptide binding using the neural network model described above. The analysis device 400 may be physically implemented in various forms. For example, the analysis device 400 may have a form such as a computer device such as a PC, a server of a network, or a chipset for image processing. The computer device may include a mobile device such as a smart device.

분석장치(400)는 저장장치(410), 메모리(420), 연산장치(430), 인터페이스 장치(440), 통신장치(450) 및 출력장치(460)를 포함한다.The analysis device 400 includes a storage device 410, a memory 420, an operation device 430, an interface device 440, a communication device 450, and an output device 460.

저장장치(410)는 MHC-펩타이드 결합도를 예측하는 신경망 모델을 저장한다. 신경망 모델는 사전에 학습되어야 한다. 나아가 저장장치(410)는 데이터 처리에 필요한 프로그램 내지 소스 코드 등을 저장할 수 있다. 저장장치(410)는 입력되는 유전체 데이터 및 예측된 MHC-펩타이드 결합도를 저장할 수 있다.The storage device 410 stores a neural network model that predicts the degree of MHC-peptide binding. The neural network model must be trained in advance. Furthermore, the storage device 410 may store programs or source codes required for data processing. The storage device 410 may store input genome data and a predicted MHC-peptide binding degree.

메모리(420)는 분석장치(400)가 수신한 데이터를 분석하는 과정에서 생성되는 데이터 및 정보 등을 저장할 수 있다.The memory 420 may store data and information generated in the process of analyzing the data received by the analysis device 400.

인터페이스 장치(440)는 외부로부터 일정한 명령 및 데이터를 입력받는 장치이다. 인터페이스 장치(440)는 물리적으로 연결된 입력 장치 또는 외부 저장장치로부터 유전체 데이터를 입력받을 수 있다. 인터페이스 장치(440)는 데이터 분석을 위한 학습모델을 입력받을 수 있다. 인터페이스 장치(440)는 학습모델 훈련을 위한 학습데이터, 정보 및 파라미터값을 입력받을 수도 있다.The interface device 440 is a device that receives certain commands and data from the outside. The interface device 440 may receive dielectric data from an input device physically connected or an external storage device. The interface device 440 may receive a learning model for data analysis. The interface device 440 may receive training data, information, and parameter values for training a learning model.

통신장치(450)는 유선 또는 무선 네트워크를 통해 일정한 정보를 수신하고 전송하는 구성을 의미한다. 통신장치(450)는 외부 객체로부터 유전체 데이터를 수신할 수 있다. 통신장치(450)는 모델 학습을 위한 데이터도 수신할 수 있다. 통신장치(450)는 입력된 샘플에 대하여 결정된 MHC-펩타이드 결합도에 대한 정보를 외부 객체로 송신할 수 있다.The communication device 450 refers to a component that receives and transmits certain information through a wired or wireless network. The communication device 450 may receive genome data from an external object. The communication device 450 may also receive data for model training. The communication device 450 may transmit information on the MHC-peptide binding degree determined for the input sample to an external object.

통신장치(450) 내지 인터페이스 장치(440)는 외부로부터 일정한 데이터 내지 명령을 전달받는 장치이다. 통신장치(450) 내지 인터페이스 장치(440)를 입력장치라고 명명할 수 있다.The communication device 450 to the interface device 440 are devices that receive certain data or commands from the outside. The communication device 450 to the interface device 440 may be referred to as an input device.

출력장치(460)는 일정한 정보를 출력하는 장치이다. 출력장치(460)는 데이터 처리 과정에 필요한 인터페이스, 분석 결과 등을 출력할 수 있다.The output device 460 is a device that outputs certain information. The output device 460 may output an interface required for a data processing process and an analysis result.

연산 장치(430)는 저장장치(410)에 저장된 신경망 모델을 이용하여 입력되는 샘플 유전체 데이터에 대한 MHC-펩타이드 결합도를 예측할 수 있다. 연산 장치(430)는 신경망 모델이 출력하는 결과를 직접 또는 일정하게 가공하여 MHC-펩타이드 결합도를 예측할 수 있다. 연산 장치(430)는 주어진 훈련 데이터를 이용하여 MHC-펩타이드 결합도 예측에 사용되는 학습모델을 훈련할 수도 있다. 연산 장치(430)는 데이터를 처리하고, 일정한 연산을 처리하는 프로세서, AP, 프로그램이 임베디드된 칩과 같은 장치일 수 있다.The computing device 430 may predict the degree of MHC-peptide binding to the input sample genome data using the neural network model stored in the storage device 410. The computing device 430 may predict the MHC-peptide binding degree by directly or uniformly processing a result output from the neural network model. The computing device 430 may train a learning model used for predicting MHC-peptide binding degree by using the given training data. The computing device 430 may be a device such as a processor, an AP, or a chip in which a program is embedded that processes data and processes certain operations.

이하 전술한 CNN 모델의 성능을 검증하는 실험에 대하여 설명한다. CNN 모델 구축을 위한 훈련 데이터는 면역 에피토프 데이터베이스(예컨대, IEDB)를 사용하였다. 개된 단백질 구조에 대한 DB를 활용하여 아미노산 쌍의 결합 선호도(biding preference)를 추정하여, 상호작용 맵을 생성하였다. CNN 모델은 펩타이스-MHC 사이의 2차원 상호작용 패턴을 기준으로 훈련되었다.Hereinafter, an experiment to verify the performance of the CNN model described above will be described. Training data for constructing a CNN model was used as an immune epitope database (eg, IEDB). An interaction map was created by estimating the binding preference of amino acid pairs using the DB for the modified protein structure. The CNN model was trained based on the two-dimensional interaction pattern between peptide-MHC.

훈련 데이터는 IEDB 3.0에서 획득한 데이터를 활용하였다. 펩타이드-MHC 클래스 I 사이의 결합 예측을 위한 것이다. 데이터베이스는 IC₅₀/EC₅₀ 기준으로 정의되는 결합 친화도를 포함하는 57,173 데이터(훈련데이터)를 제공한다. 친화도를 결정하는 임계값은 IC₅₀/EC₅₀ 값(500nM)을 사용하였다. 결합 상태는 IC₅₀/EC₅₀ < 500nM인 경우로 설정하였고, 비결합 상태는 IC₅₀/EC₅₀≥ 500nM인 경우로 설정하였다. 즉, 훈련 데이터로 상호작용 맵을 생성하였고, 상호작용 맵과 해당 훈련 데이터에 대한 결합 상태 정보를 기준으로 CNN 모델을 훈련하였다. 훈련 데이터는 주로 9mer 및 10mer 펩타이드에 대한 데이터이고, MHC 클래스 I의 HLA-A 및 HLA-B 타입에 대한 데이터였다.For training data, data obtained from IEDB 3.0 were used. For prediction of binding between peptide-MHC class I. The database provides 57,173 data (training data) including binding affinity defined by IC ₅₀ /EC ₅₀ standards. The threshold for determining the affinity was an IC ₅₀ /EC ₅₀ value (500 nM). The binding state was set as the case of IC ₅₀ /EC ₅₀ <500nM, and the non-binding state was set as the case of IC ₅₀ /EC ₅₀ ≥ 500nM. That is, an interaction map was created from the training data, and the CNN model was trained based on the interaction map and the information on the association state for the corresponding training data. The training data were mainly data for 9mer and 10mer peptides, and data for HLA-A and HLA-B types of MHC class I.

도 7은 MHC-펩타이드 결합도를 예측하는 CNN 모델을 평가한 결과이다. 도 7(A)는 9ㅡmer 펩타이드와 HLA-A에 대한 결합도를 예측한 결과이다. 도 7(B)는 10-mer 펩타이드와 HLA-A에 대한 결합도를 예측한 결과이다. 도 7(C)는 9ㅡmer 펩타이드와 HLA-B에 대한 결합도를 예측한 결과이다. 7 is a result of evaluating a CNN model for predicting MHC-peptide binding. 7(A) is a result of predicting the degree of binding to 9-mer peptide and HLA-A. 7(B) is a result of predicting the degree of binding to the 10-mer peptide and HLA-A. 7(C) is a result of predicting the degree of binding to 9-mer peptide and HLA-B.

HLA-A 및 HLA-B에 대한 AUC(area under the curve)는 각각 0.93과 0.94였다. IEDB의 테스트 데이터 세트를 이용하여 종래 알려진 예측 도구(tool)인 NetMHCpan과 CNN 모델을 비교하였다. NetMHCpan에 대한 정보는 "Automated benchmarking of peptide-MHC class i binding predictions, Bioinformatics 31, 2015"를 참조할 수 있다. CNN 모델은 종래 예측 도구과 비교하여 HLA 타입에 관계없이 전체 테스트 데이터 중 70.5%의 케이스에서 더 높은 성능을 보였다. The area under the curve (AUC) for HLA-A and HLA-B was 0.93 and 0.94, respectively. Using IEDB's test data set, NetMHCpan, a conventionally known prediction tool, and CNN model were compared. For information on NetMHCpan, refer to "Automated benchmarking of peptide-MHC class i binding predictions, Bioinformatics 31, 2015". Compared with the conventional prediction tool, the CNN model showed higher performance in 70.5% of the total test data regardless of the HLA type.

또한, 상술한 바와 같은 신경망 모델 구축 방법 및 MHC-펩타이드 결합도 예측 방법은 컴퓨터에서 실행될 수 있는 실행가능한 알고리즘을 포함하는 프로그램(또는 어플리케이션)으로 구현될 수 있다. 상기 프로그램은 비일시적 판독 가능 매체(non-transitory computer readable medium)에 저장되어 제공될 수 있다.In addition, the method for constructing a neural network model and the method for predicting MHC-peptide binding degree as described above may be implemented as a program (or application) including an executable algorithm that can be executed in a computer. The program may be provided by being stored in a non-transitory computer readable medium.

비일시적 판독 가능 매체란 레지스터, 캐쉬, 메모리 등과 같이 짧은 순간 동안 데이터를 저장하는 매체가 아니라 반영구적으로 데이터를 저장하며, 기기에 의해 판독(reading)이 가능한 매체를 의미한다. 구체적으로는, 상술한 다양한 어플리케이션 또는 프로그램들은 CD, DVD, 하드 디스크, 블루레이 디스크, USB, 메모리카드, ROM 등과 같은 비일시적 판독 가능 매체에 저장되어 제공될 수 있다.The non-transitory readable medium refers to a medium that stores data semi-permanently and can be read by a device, not a medium that stores data for a short moment, such as a register, cache, or memory. Specifically, the above-described various applications or programs may be provided by being stored in a non-transitory readable medium such as a CD, DVD, hard disk, Blu-ray disk, USB, memory card, and ROM.

본 실시례 및 본 명세서에 첨부된 도면은 전술한 기술에 포함되는 기술적 사상의 일부를 명확하게 나타내고 있는 것에 불과하며, 전술한 기술의 명세서 및 도면에 포함된 기술적 사상의 범위 내에서 당업자가 용이하게 유추할 수 있는 변형 예와 구체적인 실시례는 모두 전술한 기술의 권리범위에 포함되는 것이 자명하다고 할 것이다.The present embodiment and the accompanying drawings are merely illustrative of some of the technical ideas included in the above-described technology, and those skilled in the art will be able to easily within the scope of the technical ideas included in the specification and drawings It will be apparent that all of the modified examples and specific embodiments that can be inferred are included in the scope of the rights of the above-described technology.

Claims

Receiving, by the analysis device, genome data of the sample;
Obtaining, by the analysis device, a first amino acid sequence of a major histocompatibility complex (MHC) and a second amino acid sequence of an antigen generated by cancer cells by using the genome data;
Generating, by the analysis device, a matrix representing information on an amino acid pair using the first amino acid sequence and the second amino acid sequence; And
The analysis device comprises the step of predicting the degree of binding of the MHC and the antigen by inputting the matrix into a neural network model learned in advance,
In the matrix, a first axis is the first amino acid sequence, a second axis is the second amino acid sequence, and for each amino acid pair formed by the first amino acid sequence and the second amino acid sequence, constituting the amino acid pair It contains information of the proximity of the distance between two amino acids,
The proximity is a method for predicting MHC-peptide binding degree on the surface of cancer cells, which is determined based on the frequency of proximity of the amino acid pair in the published protein structure database.

The method of claim 1,
The analysis device is a method for predicting MHC-peptide binding degree on the surface of cancer cells for detecting the first amino acid sequence by using a program for detecting a human leukocyte antigen (HLA) allele in the genome data.

The method of claim 1,
The second amino acid sequence is a new antigen generated by cancer cells, and a method for predicting MHC-peptide binding degree on the surface of cancer cells comprising a mutant sequence.

The method of claim 1,
The matrix is a method for predicting the degree of MHC-peptide binding on the surface of cancer cells showing the binding preference of the amino acid pair based on Cα of the MHC and Cα of the antigen.

delete

The method of claim 1,
The neural network model is a CNN (Convolutional Neural Network), and a method for predicting MHC-peptide binding degree on the surface of cancer cells in which the kernel of the CNN detects a characteristic for MHC-peptide binding.

The method of claim 1,
The neural network model is a CNN (Convolutional Neural Network) model consisting of a plurality of convolutional layers, a pre-connection layer, and a sigmoid output layer.

The method of claim 7,
The convolutional layer performs a conversion by applying a convolution operation and a ReLU (Rectified Linear Unit) function,
The all-connected layer integrates the output values of the previous convolutional layer using the ReLU function,
The sigmoid output layer converts the output value of the pre-linked layer into a value representing a binding state between 0 and 1, the method for predicting MHC-peptide binding degree on the surface of cancer cells.

The method of claim 7,
The convolutional layer is a method for predicting MHC-peptide binding degree on the surface of cancer cells that performs a convolution operation (convolution(X)) according to the following formula.

(X is the input data, i is the index indicating the location of the output, k is the index of the kernel, each convolution kernel W _k is the M×N weight matrix, M is the window size, and N is the number of input channels)

The method of claim 1,
The neural network model is a method for predicting MHC-peptide binding degree on the surface of cancer cells in which weights are learned with the following objective function.

(NLL is the loss function that is the sum of the negative log likelihood, ||W|| ₂ ² is the L2 normalization function, which is the sum of squares of all weight matrices, and ||H ^-1 || ₁ is the fully connected layer. All output values of L1 are standard)

The method of claim 1,
The neural network model is
A method for predicting MHC-peptide binding degree on the surface of cancer cells comprising a convolutional layer using filters of different sizes according to the length of the MHC and the amino acid length of the antigen.

An input device for receiving dielectric data of a sample;
A storage device for storing a neural network model for predicting the degree of binding of the MHC and the antigen based on information indicating the proximity of each amino acid pair formed by the amino acid sequence constituting the major histocompatibility complex (MHC) and the amino acid sequence constituting the antigen; And
Detecting the first amino acid sequence of the MHC and the second amino acid sequence of the antigenic peptide from the genomic data, and generating a matrix representing the proximity of each amino acid pair formed by the amino acid of the first amino acid sequence and the second amino acid sequence, Comprising a computing device for predicting the degree of binding of the MHC and antigen to the sample by inputting the generated matrix into the neural network model,
In the matrix, a first axis is the first amino acid sequence, a second axis is the second amino acid sequence, and for each amino acid pair formed by the first amino acid sequence and the second amino acid sequence, constituting the amino acid pair It contains information of the proximity of the distance between two amino acids,
The proximity degree is an analysis device for predicting the MHC-peptide binding degree determined based on the frequency of the proximity of the amino acid pair in the published protein structure database.

The method of claim 12,
The calculation device is an analysis device for predicting MHC-peptide binding degree for detecting the first amino acid sequence for the sample by using a program for detecting a human leukocyte antigen (HLA) allele in the genome data.

The method of claim 12,
The second amino acid sequence is a new antigen generated by cancer cells, and an analysis device for predicting MHC-peptide binding degree including a mutant sequence.

The method of claim 12,
The calculation device predicts the degree of MHC-peptide binding that generates the matrix representing the binding preference of the amino acid pair based on Cα of the MHC and Cα of the antigen.

The method of claim 12,
The neural network model is a convolutional neural network (CNN) model consisting of a plurality of convolutional layers, a full-connection layer, and a sigmoid output layer.

The method of claim 16,
The convolutional layer is an analysis device for predicting MHC-peptide binding degree that performs a convolution operation (convolution(X)) according to the following equation.

The method of claim 12,
The neural network model is an analysis device that predicts the MHC-peptide binding degree of which weights are learned using the following objective function.

The method of claim 12,
The neural network model is
An analysis device for predicting MHC-peptide binding degree including a convolutional layer using filters of different sizes according to the length of the MHC and the amino acid length of the antigen.

A computer-readable recording medium in which a program for executing the method for predicting the degree of MHC-peptide binding on the surface of cancer cells according to any one of claims 1 to 4 and 6 to 11 is recorded on a computer.