KR102278727B1

KR102278727B1 - Method for predicting neoantigen using a peptide sequence and hla class ii allele sequence and computer program

Info

Publication number: KR102278727B1
Application number: KR1020200119331A
Authority: KR
Inventors: 황태순; 백순명; 홍성의
Original assignee: 주식회사 테라젠바이오
Priority date: 2020-09-16
Filing date: 2020-09-16
Publication date: 2021-07-19

Abstract

According to the present invention, disclosed is a method for predicting a neoantigen by using a peptide sequences and an HLA class II allele sequence. According to this, not only a coupling force among a peptide sequence included in a cancer tissue, HLA class II allele alpha, a beta complex, and alpha and beta single sequences, but also immunity of the peptide sequence are measured. A neoantigen in a cancer tissue can be determined based on the measured immunity.

Description

Method and computer program for predicting neoantigens using target cancer tissue and cell-free DNA-derived peptide sequences and HLA class II allele sequences {METHOD FOR PREDICTING NEOANTIGEN USING A PEPTIDE SEQUENCE AND HLA CLASS II ALLELE SEQUENCE AND COMPUTER PROGRAM}

본 개시의 실시예는 대상 암 조직 및 세포 유리형 DNA 유래 펩타이드 서열 및 HLA 클래스 II 대립유전자 서열을 이용하여 신생항원을 예측하는 방법 및 컴퓨터 프로그램에 관한 것이다. Embodiments of the present disclosure relate to methods and computer programs for predicting neoantigens using peptide sequences and HLA class II allele sequences derived from target cancer tissue and cell-free DNA.

항암 신약 물질 개발의 발전으로 인해 1세대 항암제인 화학항암제, 2세대 표적항암제를 거쳐 최근 3세대 면역항암제가 각광을 받고 있다. 특히 3세대 면역항암제의 경우, 앞선 항암제와 다르게 환자 자신의 면역 시스템을 활용한 치료 전략이므로 부작용이 현저하게 낮은 장점이 있다. 하지만 이러한 장점에도 불구하고 PD-L1과 같은 표지 유전자의 발현 및 현미부수체 불안정 (MSI-H)를 보이는 환자만이 면역항암제를 이용한 치료 전략이 수립 가능한 한계를 갖고 있다. 이러한 제약으로 인해 기존 항암제 투여가 어려운 환자를 치료하기 위한 전략 수립이 필요하며 대안 중 하나로 제시되고 있는 것이 바로 신생 항원을 활용한 암 백신이다. Due to the development of new anticancer drugs, first-generation chemotherapy and second-generation targeted anti-cancer drugs, and recently, third-generation immuno-oncology drugs are in the spotlight. In particular, in the case of the third-generation immuno-oncology drug, unlike the previous anti-cancer drugs, it is a treatment strategy that utilizes the patient's own immune system, so side effects are significantly lower. However, despite these advantages, there is a limit in which a treatment strategy using immunotherapy can be established only for patients who show the expression of marker genes such as PD-L1 and microsatellite instability (MSI-H). Due to these limitations, it is necessary to establish a strategy to treat patients who have difficulty in administering existing anticancer drugs, and one of the alternatives is a cancer vaccine using a new antigen.

또한 이러한 암 백신은 암 조직에 국한되지 않고 세포 유리형 DNA에도 접목 가능하다. 세포 유리형 DNA는 혈액, 혈장, 타액, 림프액, 뇌척수액, 활액, 낭종액(cystic fluid), 복수, 흉막 삼출액, 양수, 융모막 융모 샘플, 태반샘플, 기관 세척액, 간질액 또는 안구액 등의 체액에 존재하며, 혈액 내 존재하는 분비체(exosome)에도 소량의 단백질, RNA 및 DNA 등을 함유하고 있어 분비체 내 DNA에서도 암 백신을 위한 신생항원 예측이 가능하다. In addition, these cancer vaccines are not limited to cancer tissues and can be grafted onto cell-free DNA. Cell-free DNA is present in body fluids such as blood, plasma, saliva, lymph, cerebrospinal fluid, synovial fluid, cystic fluid, ascites, pleural effusion, amniotic fluid, chorionic villi sample, placental sample, tracheal lavage fluid, interstitial fluid, or ocular fluid. Exosomes present in the blood also contain small amounts of protein, RNA, and DNA, so it is possible to predict neoantigens for cancer vaccines from DNA in the exosomes.

각 환자의 암 조직에는 정상 조직에서는 발견되지 않는 돌연변이가 존재하는데 이러한 돌연변이로부터 유래하는 펩타이드를 신생 항원으로 활용하여 환자의 면역 시스템이 해당 신생 항원을 인지 및 공격할 수 있게끔 하는 것이 암 백신의 핵심 전략이다. 이 과정에 필수적으로 선결되어야 하는 과정은 첫번째, 돌연변이 유래 펩타이드와 환자 특이적 HLA 대립 유전자 간의 안정적인 결합이다. 특히, HLA 클래스 II 대립 유전자의 경우에는 HLA 클래스 I 대립 유전자 (A, B, C)보다도 다양한 유전자 (DM, DO, DP, DQ, DR)가 존재하며 알파, 베타의 복합체를 이루기 때문에 결합력을 예측함에 있어 좀 더 어려운 점이 있다. 두번째는 해당 돌연변이 유래 펩타이드가 환자의 면역 시스템을 잘 자극하는 면역원성의 유무 확인이다. 특히, 면역원성을 최대한 반영하기 위해서는 면역원성이 발생하는 모든 단계를 모사하고 주요 특징들을 추출해야 하지만 이 과정에서 누락되거나 소실되는 단계들이 발생할 수 있으며, 이러한 점을 이후 면역원성 예측에 한계점으로 작용할 수 있다. Each patient's cancer tissue contains mutations that are not found in normal tissues, and the key strategy for cancer vaccines is to use peptides derived from these mutations as neoantigens so that the patient's immune system can recognize and attack the new antigens. to be. First, a stable binding between the mutant-derived peptide and the patient-specific HLA allele is a prerequisite for this process. In particular, in the case of HLA class II alleles, more diverse genes (DM, DO, DP, DQ, DR) than HLA class I alleles (A, B, C) exist, and because they form a complex of alpha and beta, the binding force is predicted. There are some more difficult things to do. The second is to check whether the peptide derived from the mutation has immunogenicity that stimulates the patient's immune system well. In particular, in order to reflect immunogenicity as much as possible, it is necessary to simulate all stages of immunogenicity and extract key features. However, in this process, missing or missing steps may occur, which may act as a limiting point in future immunogenicity prediction. have.

이에 본 기술은 현재까지 알려진 면역원성을 갖는 펩타이드 및 MHC (사람의 경우에는 HLA) 클래스 II 서열의 조합 데이터를 기반으로 면역 과정의 누락을 방지하고 주요 특징들을 추출하는 전략을 구현하고자 하였다. 또한 HLA 클래스 II 알파, 베타 복합체와 펩타이드와의 결합력, HLA 클래스 II 알파 또는 베타 서열과 펩타이드와의 결합력, 펩타이드 서열 자체의 면역원성을 각각 모델링한 이후 각 모델에서 도출되는 i)면역원성 점수 ii) 알파, 베타 복합체와 펩타이드 간 결합력 점수 iii) 알파 또는 베타 서열과 펩타이드 간 결합력 점수, iv) 알파, 베타 복합체와 펩타이드 간 신생항원 예측 점수 v) 알파 또는 베타 서열과 펩타이드 간 신생항원 예측 점수를 기반으로 다양한 기계학습의 앙상블 모델을 통해 환자에게 적용 가능한 신생항원을 최종 도출하고자 하였다. 전술한 배경기술은 발명자가 본 발명의 도출을 위해 보유하고 있었거나, 본 발명의 도출 과정에서 습득한 기술 정보로서, 반드시 본 발명의 출원 전에 일반 공중에게 공개된 공지기술이라 할 수는 없다.Accordingly, the present technology tried to implement a strategy for preventing omission of immune processes and extracting key features based on the combination data of the known immunogenic peptide and MHC (HLA in the case of humans) class II sequence. In addition, i)Immunogenicity score ii) derived from each model after modeling the HLA class II alpha, beta complex and the peptide, the HLA class II alpha or beta sequence and the peptide, and the immunogenicity of the peptide sequence itself. Based on the avidity score between the alpha and beta complex and the peptide iii) the avidity score between the alpha or beta sequence and the peptide, iv) the neoantigen prediction score between the alpha and beta complex and the peptide v) the neoantigen prediction score between the alpha or beta sequence and the peptide We tried to finally derive neoantigens applicable to patients through various ensemble models of machine learning. The above-mentioned background art is technical information possessed by the inventor for derivation of the present invention or acquired in the process of derivation of the present invention, and cannot necessarily be said to be a known technique disclosed to the general public prior to filing of the present invention.

본 발명은 상술한 필요성에 따른 것으로, 환자의 암 조직 내에 존재하는 돌연변이 유래 펩타이드 서열이 환자 특이적 HLA 클래스 II 알파, 베타 복합체 및 단일 체인과 결합하는 한편, 일련의 면역 과정을 거쳐 최종적으로 면역원성을 나타내는 것을 예측하고, 이를 기반으로 암 환자 맞춤형 암 백신에 활용 가능한 신생항원을 결정하는 것을 목적으로 한다. The present invention is in accordance with the above-mentioned necessity, wherein a mutation-derived peptide sequence present in a patient's cancer tissue binds to a patient-specific HLA class II alpha, beta complex and a single chain, while undergoing a series of immune processes to finally achieve immunogenicity The purpose of this study is to predict the expression of , and based on this, determine a neoantigen that can be used for a cancer vaccine customized for cancer patients.

본 발명의 실시예들에 따른 펩타이드 서열 및 HLA 클래스 II 알파, 베타 체인의 대립유전자 서열을 이용하여 신생항원을 예측하는 방법은 대상 암 조직 또는 세포 유리 DNA 유래 펩타이드 서열과 HLA 클래스 II 알파, 베타 체인의 대립유전자 서열을 입력으로 수신하는 단계; 상기 펩타이드 서열로부터 T 세포 활성 데이터를 획득하고, 상기 T 세포 활성 데이터를 면역성 예측 모델에 입력하여, 상기 펩타이드 서열의 면역성을 예측하는 제1 예측값을 출력하는 단계; 상기 HLA 클래스 II 알파, 베타 복합체의 대립유전자 서열로부터 결합 데이터를 획득하고, 상기 결합 데이터를 결합성 예측 모델에 입력하여 상기 펩타이드 서열 및 상기 HLA 클래스 II 알파, 베타 체인 복합체의 대립유전자 서열의 결합성을 예측하는 제2 예측값을 출력하는 단계; 상기 HLA 클래스 II 알파 또는 베타 체인의 대립유전자 서열로부터 결합 데이터를 획득하고, 상기 결합 데이터를 결합성 예측 모델에 입력하여 상기 펩타이드 서열 및 상기 HLA 클래스 II 알파 혹은 베타 체인 대립유전자 서열의 결합성을 예측하는 제3 예측값을 출력하는 단계; 및 상기 T 세포 활성 데이터 및 상기 제1 및 제2 예측값을 이용하여 상기 대상 세포에 대한 신생항원을 예측하는 제 4 예측값을 출력하는 단계; 및 상기 T 세포 활성 데이터 및 상기 제1 및 제3 예측값을 이용하여 상기 대상 세포에 대한 신생항원 정보를 예측하는 제 5 예측값을 출력하는 단계를 포함할 수 있다. The method for predicting a neoantigen using the peptide sequence and the allele sequence of the HLA class II alpha and beta chain according to the embodiments of the present invention includes a peptide sequence derived from a target cancer tissue or cell-free DNA and an HLA class II alpha, beta chain. receiving an allele sequence of obtaining T cell activity data from the peptide sequence, inputting the T cell activity data into an immunity prediction model, and outputting a first predicted value for predicting immunity of the peptide sequence; Binding data is obtained from the allele sequence of the HLA class II alpha, beta complex, and the binding data is input to a binding prediction model to bind the peptide sequence and the allele sequence of the HLA class II alpha, beta chain complex outputting a second prediction value for predicting Binding data is obtained from the allele sequence of the HLA class II alpha or beta chain, and the binding data is input to a binding prediction model to predict the binding property of the peptide sequence and the HLA class II alpha or beta chain allele sequence outputting a third predicted value; and outputting a fourth predicted value for predicting a neoantigen for the target cell using the T cell activity data and the first and second predicted values; and outputting a fifth predicted value for predicting neoantigen information on the target cell using the T cell activity data and the first and third predicted values.

상기 면역성 예측 모델 (제1 예측값), 상기 HLA 클래스 II 알파, 베타 복합체와 펩타이드 결합성 예측 모델 (제2 예측값), 상기 HLA 클래스 II 알파 또는 베타 복합체와 펩타이드 결합성 예측 모델 (제3 예측값), 상기 HLA 클래스 II 알파, 베타 복합체와 펩타이드 신생항원 예측 모델 (제4 예측값), 상기 HLA 클래스 II 알파 혹은 베타 복합체와 펩타이드 신생항원 예측 모델 (제5 예측값) 중 적어도 하나는 복수의 대상 암 조직들에 존재하는 펩타이드 서열 및 HLA 클래스 II 대립유전자 서열을 포함하는 훈련 데이터 세트를 기반으로 기계학습의 앙상블 알고리즘에 의해 훈련될 수 있다. The immune predictive model (first predictive value), the HLA class II alpha, beta complex and peptide binding predictive model (second predictive value), the HLA class II alpha or beta complex and peptide binding predictive model (third predictive value), At least one of the HLA class II alpha and beta complex and peptide neoantigen prediction model (fourth predictive value) and the HLA class II alpha or beta complex and peptide neoantigen predictive model (fifth predictive value) is selected from a plurality of target cancer tissues. It can be trained by an ensemble algorithm of machine learning based on a training data set comprising the present peptide sequences and HLA class II allele sequences.

상기 대상 암 조직은 단일 MHC 클래스 I 또는 클래스 II 대립 유전자를 발현하도록 조작된 세포를 포함할 수 있다. The subject cancer tissue may comprise cells engineered to express a single MHC class I or class II allele.

상기 대상 암 조직은 복수의 환자로부터 수득되거나 이로부터 유래된 인간 세포를 포함할 수 있다. The target cancer tissue may include human cells obtained from or derived from a plurality of patients.

상기 대상 암 조직은 복수의 환자로부터 수득된 신선한 또는 냉동된 종양 세포를 포함할 수 있다. The target cancer tissue may include fresh or frozen tumor cells obtained from a plurality of patients.

상기 대상 암 조직은 복수의 환자로부터 수득된 신선한 또는 냉동된 조직 세포를 포함할 수 있다. The target cancer tissue may include fresh or frozen tissue cells obtained from a plurality of patients.

상기 대상 암 조직은 복수의 환자로부터 수득된 혈액 유리 DNA를 포함할 수 있다. The target cancer tissue may include blood free DNA obtained from a plurality of patients.

상기 대상 암 조직은 복수의 환자로부터 수득된 분비체 (exosome) 내 DNA를 포함할 수 있다. The target cancer tissue may include DNA in exosomes obtained from a plurality of patients.

상기 대상 암 조직은 T- 세포 분석을 사용하여 확인된 펩타이드를 포함할 수 있다. The target cancer tissue may comprise a peptide identified using T-cell analysis.

상기 훈련 데이터 세트는 상기 대상 암 조직과 관련된 단백체 서열과 관련된 데이터, 상기 대상 암 조직과 관련된 MHC 클래스 II 알파 및 베타 펩타이드 서열과 관련된 데이터, 상기 대상 암 조직과 관련된 펩타이드와 HLA 클래스 II 알파, 베타 복합체 혹은 알파, 베타 단독 대립 유전자 간의 결합 데이터, 상기 대상 암 조직과 관련된 전사체와 관련된 데이터, 상기 대상 암 조직과 관련된 게놈과 관련된 데이터 중 적어도 하나를 포함할 수 있다. The training data set includes data related to proteomic sequences related to the target cancer tissue, data related to MHC class II alpha and beta peptide sequences related to the target cancer tissue, and peptides related to the target cancer tissue and HLA class II alpha, beta complexes. Alternatively, it may include at least one of binding data between alpha and beta alleles, data related to the transcriptome related to the target cancer tissue, and data related to the genome related to the target cancer tissue.

상기 면역성 예측 모델은 펩타이드 서열로부터의 T 세포 활성 데이터를 입력으로, 상기 펩타이드 서열의 면역성을 출력으로 학습된 모델일 수 있다. The immunity prediction model may be a model trained by inputting T cell activity data from the peptide sequence as an input and outputting the immunity of the peptide sequence.

상기 결합성 예측 모델은 HLA 클래스 II 알파, 베타 복합체 혹은 알파, 베타 단독 대립유전자 서열 및 펩타이드 서열로부터의 결합 데이터를 입력으로, 상기 펩타이드 서열 및 상기 HLA 클래스 II 알파, 베타 복합체 혹은 알파, 베타 단독 대립유전자 서열의 결합성을 출력으로 학습된 모델일 수 있다. The binding prediction model is the HLA class II alpha, beta complex or alpha, beta single allele sequence and binding data from the peptide sequence as inputs, the peptide sequence and the HLA class II alpha, beta complex or alpha, beta single allele It may be a model trained as an output of the binding properties of a gene sequence.

상기 신생항원 예측 모델은 펩타이드 서열 및 HLA 클래스 II 알파, 베타 복합체 혹은 알파, 베타 단독 대립유전자 서열로부터의 T 세포 활성 데이터 및 HLA 클래스 II 대립유전자 알파, 베타 복합체 혹은 알파, 베타 단독 서열 및 펩타이드 서열로부터의 결합 데이터를 입력으로, 펩타이드 서열 및 HLA 클래스 II 대립유전자 서열 사이의 신생항원 예측값을 출력으로 학습된 모델일 수 있다. The neoantigen prediction model is derived from the peptide sequence and T cell activity data from the HLA class II alpha, beta complex or alpha, beta single allele sequence and the HLA class II allele alpha, beta complex or alpha, beta single sequence and the peptide sequence. It may be a trained model with the binding data of the as input and the neoantigen predicted value between the peptide sequence and the HLA class II allele sequence as output.

본 발명의 실시예에 따른 컴퓨터 프로그램은 컴퓨터를 이용하여 본 발명의 실시예에 따른 방법 중 어느 하나의 방법을 실행시키기 위하여 매체에 저장될 수 있다. A computer program according to an embodiment of the present invention may be stored in a medium to execute any one of the methods according to an embodiment of the present invention using a computer.

이 외에도, 본 발명을 구현하기 위한 다른 방법, 다른 시스템 및 상기 방법을 실행하기 위한 컴퓨터 프로그램을 기록하는 컴퓨터 판독 가능한 기록 매체가 더 제공된다. In addition to this, another method for implementing the present invention, another system, and a computer readable recording medium for recording a computer program for executing the method are further provided.

전술한 것 외의 다른 측면, 특징, 이점이 이하의 도면, 특허청구범위 및 발명의 상세한 설명으로부터 명확해 질 것이다.Other aspects, features and advantages other than those described above will become apparent from the following drawings, claims, and detailed description of the invention.

상기한 바와 같이 이루어진 본 발명의 일 실시예에 따르면, 암 조직에 포함된 펩타이드 서열 및 HLA 클래스 II 대립유전자 알파, 베타 복합체 혹은 알파, 베타 단독 서열 사이의 결합력 뿐만 아니라, 펩타이드 서열의 면역성을 측정하고, 측정된 면역성을 기초로 암 조직 내 신생항원을 결정할 수 있다.According to one embodiment of the present invention made as described above, the binding force between the peptide sequence and the HLA class II allele alpha, beta complex or alpha, beta single sequence included in the cancer tissue as well as the immunity of the peptide sequence is measured and , it is possible to determine neoantigens in cancer tissues based on the measured immunity.

도 1은 본 발명의 실시예들에 따른 신생항원 결정 장치(100)의 블록도이다.
도 2은 한국인들의 세포 안에 포함된 주요 HLA 클래스 II의 타입 정보에 대한 예시 도면이다.
도 3는 아미노산 간 특징의 유사성을 나타낸 blosum 매트릭스에 대한 예시 도면이다.
도 4는 HLA 대립유전자 클래스II의 구조를 나타내는 도면이다.
도 5는 HLA 클래스 II 알파 베타 체인 복합체에 대한 훈련 데이터 셋트이다.
도 6은 HLA 클래스 II 알파 또는 베타 단일 대립유전자 서열에 대한 훈련 데이터 셋트이다.
도 7, 도 8 및 도 9는 본 발명의 실시예들에 따른 면역성 예측부(121') 및 결합성 예측부(122')의 구현 예시 들에 대한 도면들이다.
도 10 내지 도 12은 본 발명의 실시예들에 따른 신생항원 결정 시스템의 구현 예시 들에 대한 도면들이다.
도 13은 면역성 예측 모델, 결합성 예측 모델, 면역 내성 예측 모델 등을 학습시키는 학습 서버(10)의 블록도이다. 1 is a block diagram of a neoantigen determination apparatus 100 according to embodiments of the present invention.
2 is an exemplary diagram of type information of major HLA class II contained in cells of Koreans.
3 is an exemplary diagram of a blosum matrix showing the similarity of characteristics between amino acids.
4 is a diagram showing the structure of HLA allele class II.
5 is a training data set for the HLA class II alpha beta chain complex.
6 is a training data set for HLA class II alpha or beta single allele sequences.
7, 8 and 9 are diagrams illustrating examples of implementation of the immunity predicting unit 121' and the binding predicting unit 122' according to embodiments of the present invention.
10 to 12 are diagrams of implementation examples of the neoantigen determination system according to embodiments of the present invention.
13 is a block diagram of the learning server 10 for learning an immunity prediction model, a binding prediction model, an immune resistance prediction model, and the like.

이하 첨부된 도면들에 도시된 본 발명에 관한 실시예를 참조하여 본 발명의 구성 및 작용을 상세히 설명한다.Hereinafter, the configuration and operation of the present invention will be described in detail with reference to the embodiments of the present invention shown in the accompanying drawings.

본 발명은 다양한 변환을 가할 수 있고 여러 가지 실시예를 가질 수 있는 바, 특정 실시예들을 도면에 예시하고 상세한 설명에 상세하게 설명하고자 한다. 본 발명의 효과 및 특징, 그리고 그것들을 달성하는 방법은 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나 본 발명은 이하에서 개시되는 실시예들에 한정되는 것이 아니라 다양한 형태로 구현될 수 있다. Since the present invention can apply various transformations and can have various embodiments, specific embodiments are illustrated in the drawings and described in detail in the detailed description. Effects and features of the present invention, and a method of achieving them, will become apparent with reference to the embodiments described below in detail in conjunction with the drawings. However, the present invention is not limited to the embodiments disclosed below and may be implemented in various forms.

이하, 첨부된 도면을 참조하여 본 발명의 실시예들을 상세히 설명하기로 하며, 도면을 참조하여 설명할 때 동일하거나 대응하는 구성 요소는 동일한 도면부호를 부여하고 이에 대한 중복되는 설명은 생략하기로 한다. Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings, and when described with reference to the drawings, the same or corresponding components are given the same reference numerals, and the overlapping description thereof will be omitted. .

이하의 실시예에서, 제1, 제2 등의 용어는 한정적인 의미가 아니라 하나의 구성 요소를 다른 구성 요소와 구별하는 목적으로 사용되었다. In the following embodiments, terms such as first, second, etc. are used for the purpose of distinguishing one component from another, not in a limiting sense.

이하의 실시예에서, 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. In the following examples, the singular expression includes the plural expression unless the context clearly dictates otherwise.

이하의 실시예에서, 포함하다 또는 가지다 등의 용어는 명세서상에 기재된 특징, 또는 구성요소가 존재함을 의미하는 것이고, 하나 이상의 다른 특징들 또는 구성요소가 부가될 가능성을 미리 배제하는 것은 아니다. In the following embodiments, terms such as include or have means that the features or components described in the specification are present, and the possibility of adding one or more other features or components is not excluded in advance.

도면에서는 설명의 편의를 위하여 구성 요소들이 그 크기가 과장 또는 축소될 수 있다. 예컨대, 도면에서 나타난 각 구성의 크기 및 두께는 설명의 편의를 위해 임의로 나타내었으므로, 본 발명이 반드시 도시된 바에 한정되지 않는다. In the drawings, the size of the components may be exaggerated or reduced for convenience of description. For example, since the size and thickness of each component shown in the drawings are arbitrarily indicated for convenience of description, the present invention is not necessarily limited to the illustrated bar.

어떤 실시예가 달리 구현 가능한 경우에 특정한 공정 순서는 설명되는 순서와 다르게 수행될 수도 있다. 예를 들어, 연속하여 설명되는 두 공정이 실질적으로 동시에 수행될 수도 있고, 설명되는 순서와 반대의 순서로 진행될 수 있다.Where certain embodiments are otherwise feasible, a specific process sequence may be performed different from the described sequence. For example, two processes described in succession may be performed substantially simultaneously, or may be performed in an order opposite to the order described.

여기서, 대상 암 조직은 실험의 대상이 되는 조직을 의미한다. 예를 들어, 대상 암 조직은 면역 반응을 일으킬 수 있는 항원을 탐지하고자 하는 암 조직일 수 있다. 바람직하게, 상기 대상 암 조직은 종양세포 또는 암세포의 집합체일 수 있다.Here, the target cancer tissue refers to a tissue to be tested. For example, the target cancer tissue may be a cancer tissue in which an antigen capable of eliciting an immune response is to be detected. Preferably, the target cancer tissue may be a tumor cell or an aggregate of cancer cells.

여기서, 돌연변이는 각 생명체 내 유전 정보를 담고 있는 유전자의 염기서열 A (아데닌), T (타이민), G (구아닌), C (사이토신)의 배열이 해당 종의 원본 유전 정보와 상이하게 변질되는 모든 현상을 의미한다. 이러한 돌연변이는 소규모 또는 대규모로 구조적 변이를 유발하며 소규모 돌연변이는 단일 염기서열이 변환되어 나타나는 점 돌연변이가 있으며 염기서열이 추가로 삽입되거나 결실되는 돌연변이도 존재한다. 대규모로 발생하여 구조에 영향을 미치는 돌연변이는 유전자 중복, 유전자 결실, 염색체 역위, 간질성 결실, 염색체 전위, 이형접합 소실 등이 있다. Here, the mutation means that the sequence of the nucleotide sequences A (adenine), T (thymine), G (guanine), and C (cytosine) of the gene containing the genetic information in each organism is altered to be different from the original genetic information of the species. means all phenomena. These mutations cause small-scale or large-scale structural changes. Small-scale mutations include point mutations in which a single nucleotide sequence is converted, and mutations in which nucleotide sequences are additionally inserted or deleted. Mutations that occur on a large scale and affect structure include gene duplication, gene deletion, chromosomal inversion, interstitial deletion, chromosomal translocation, and loss of heterozygosity.

돌연변이는 발생하는 세포의 종류에 따라 크게 생식세포 돌연변이와 체성 돌연변이로 구분된다. 체성 돌연변이는 체세포에 생기는 유전자 돌연변이로, 체세포 돌연변이, 체세포 변이 라고도 하며, 유전자의 돌연변이, 염색체 이상에 기인할 수 있다.Mutations are largely divided into germline mutations and somatic mutations according to the type of cell that occurs. Somatic mutation is a gene mutation that occurs in somatic cells, also called somatic mutation or somatic mutation, and may be caused by mutations in genes or chromosomal abnormalities.

이러한 돌연변이의 발생으로 인하여 해당 유전자에 의해 생산되는 단백질의 기능에 변화가 발생할 수 있으며 특정 기능이 소실되거나 다른 기능으로 활성화될 수도 있다. 이러한 단백질 기능의 변화는 암 발생을 야기시키거나 가속화하므로 이러한 돌연변이는 암 발생 및 진행과 직간접적으로 깊은 관련이 있을 수 있다.Due to the occurrence of such a mutation, the function of the protein produced by the gene may be changed, and a specific function may be lost or activated with another function. Since these changes in protein function cause or accelerate cancer development, these mutations may be directly or indirectly related to cancer development and progression.

상술한 바와 같이 생명체의 유전정보를 담고 있는 DNA내 염기서열은 A, T, G, C로 이루어져 있으며 이러한 염기서열이 일렬로 3개씩 모이면 하나의 특정 아미노산를 형성하는 코드가 되며 이러한 코드가 여러 개 모이면 하나의 단백질로 변환이 가능하다. 아미노산은 알라닌(Ala), 시스테인(Cys), 아스파르트산 (Asp), 글루탐산(Glu), 페닐알라닌(Phe), 글라이신(Gly), 히스티딘(His), 아이소류신(Ile), 라이신(Lys), 류신(Leu), 메티오닌(Met), 아스파라긴(Asn), 파롤라이신(Ply), 프롤린(Pro), 글루타민(Gln), 아르기닌(Arg), 세린(Ser), 트레오닌(Thr), 셀레노시스테인(Sec), 발린(Val), 트립토판(Trp), 타이로신(Tyr)으로 이루어져 있다.As described above, the nucleotide sequence in DNA containing the genetic information of living things consists of A, T, G, and C. When these nucleotide sequences are gathered three in a row, it becomes a code to form one specific amino acid, and these codes are When combined, it can be converted into a single protein. Amino acids are alanine (Ala), cysteine (Cys), aspartic acid (Asp), glutamic acid (Glu), phenylalanine (Phe), glycine (Gly), histidine (His), isoleucine (Ile), lysine (Lys), leucine (Leu), methionine (Met), asparagine (Asn), parolysine (Ply), proline (Pro), glutamine (Gln), arginine (Arg), serine (Ser), threonine (Thr), selenocysteine (Sec) ), valine (Val), tryptophan (Trp), and tyrosine (Tyr).

펩타이드는 아미노산 서열들이 이루는 펩타이드 또는 폴리 펩타이드를 의미할 수 있다. 생명체 내에는 각 종 내 유전 정보에서 유래되지 않는 외부 물질을 제거하기 위한 면역 시스템이 존재하며 특히, 외부 유래 펩타이드 중 면역 반응을 일으킬 수 있는 면역 원성 펩타이드가 존재한다. 암 발생 과정에서 원본 유전 정보와 다르게 발생하는 돌연변이 역시 이러한 면역 원성 펩타이드를 생성하여 이러한 펩타이드는 일련의 면역 시스템 내 과정을 거쳐 HLA 클래스 II 단백질과 결합할 수 있다. 더 나아가, 상기 면역 원성 펩타이드는 돌연변이 아미노산 서열을 가질 수 있으며, 그의 아미노산 길이는 25개 이하일 수 있으나 이에 한정되지 않고 다양한 길이 일 수 있다. The peptide may refer to a peptide or polypeptide composed of amino acid sequences. An immune system exists to remove foreign substances that are not derived from genetic information within each species, and in particular, among externally-derived peptides, immunogenic peptides that can induce an immune response exist. Mutations that occur differently from the original genetic information in the course of cancer development also generate these immunogenic peptides, which can bind to HLA class II proteins through a series of processes in the immune system. Furthermore, the immunogenic peptide may have a mutant amino acid sequence, and the length of the amino acid may be 25 or less, but is not limited thereto, and may have various lengths.

신생항원은 면역반응을 일으키는 펩타이드를 의미한다. 즉 신생항원은 면역원성 펩타이드 일 수 있다. 신생항원은 종양세포 특이적 돌연변이에 의해 유도될 수 있으며, 종양세포의 에피토프로 나타낼 수 있다. 이하에서는 설명의 긴명함을 위해, 면역원성 펩타이드를 신생항원으로 명명하여 설명한다. A neoantigen refers to a peptide that induces an immune response. That is, the neoantigen may be an immunogenic peptide. Neoantigens may be induced by tumor cell-specific mutations and may be expressed as epitopes of tumor cells. Hereinafter, for clarity of description, the immunogenic peptide is named as a neoantigen.

여기에서 T 세포 활성 데이터란 특정 HLA 클래스 II 대립유전자에 대해 특정 펩타이드 서열이 결합함으로 자극되었을 때 발생하는 면역 반응을 측정한 데이터로 multimer/tetramer, ELISPOT를 포함한 면역원성 측정 실험 방법론에 의해 검출된 세포 내 사이토카인 발현값 및 면역세포 특이적 활성 마커의 발현값 등의 데이터로 획득되며 결과값은 “Positive”, “Positive-High”, “Positive-Low”, “Positive-Intermediate”, “Negative” 으로 분류될 수 있다.Here, T cell activity data is data that measures the immune response that occurs when stimulated by binding of a specific peptide sequence to a specific HLA class II allele. Cells detected by immunogenicity measurement experimental methodology including multimer/tetramer and ELISPOT It is obtained from data such as my cytokine expression value and the expression value of immune cell-specific activation markers, and the result is “Positive”, “Positive-High”, “Positive-Low”, “Positive-Intermediate”, “Negative” can be classified.

본 발명의 실시예들에 따른 신생항원 결정 장치는 대상 암 조직의 펩타이드 서열과 환자의 HLA 클래스 II 알파 및 베타 대립유전자 서열을 분석하여, 대상 암 조직의 치료에 이용할 대상 암 조직의 특정 펩타이드를 신생항원으로 결정할 수 있다. 도 4은 HLA 클래스 I와 II에 대한 특징을 나타낸 모식도이며, HLA 클래스 II의 경우, I과는 다르게 알파와 베타 체인의 복합체로 구성되어 있다. 특정 HLA 클래스 II 대립유전자를 갖는 환자의 대상 암 조직에 포함된 펩타이드들 중에서, 항원으로 적합한 신생항원을 결정할 수 있다. 결정된 신생항원에 작용하는 항체를 검색하여 해당 환자의 대상 암 조직의 치료에 이용할 수 있다. 특히 한국인에서 많이 나타나는 HLA 클래스 II의 종류와 분포는 도 2과 같다.The neoantigen determining device according to the embodiments of the present invention analyzes the peptide sequence of the target cancer tissue and the HLA class II alpha and beta allele sequence of the patient to generate a specific peptide of the target cancer tissue to be used for the treatment of the target cancer tissue. antigen can be determined. 4 is a schematic diagram showing the characteristics of HLA classes I and II, and in the case of HLA class II, unlike I, it is composed of a complex of alpha and beta chains. Among the peptides contained in the target cancer tissue of a patient having a specific HLA class II allele, a neoantigen suitable as an antigen can be determined. An antibody acting on the determined neoantigen can be searched for and used for treatment of the target cancer tissue of the patient. In particular, the types and distributions of HLA class II, which appear frequently in Koreans, are shown in FIG. 2 .

도 1은 본 발명의 실시 예들에 따른 신생항원 결정 장치(100)의 블록도이다. 1 is a block diagram of an apparatus 100 for determining a neoantigen according to embodiments of the present invention.

신생항원 결정 장치(100)는 암 조직에서 유전체 데이터를 기초로 암 조직에 존재하는 질병을 치료하기 위한 신생항원 결정을 위한 장치이다. The neoantigen determination apparatus 100 is an apparatus for determining neoantigens for treating a disease existing in a cancer tissue based on genomic data in the cancer tissue.

유전체 데이터 입력부(110)는 암 조직으로부터 추출된 펩타이드 서열과 HLA 클래스 II 알파 및 베타 대립유전자 서열을 수신할 수 있다. 펩타이드 서열은 암 조직에 포함된 하나 이상의 펩타이드들에 대한 것일 수 있다. 펩타이드 서열은 펩타이드들에 대한 서열들을 포함하도록 2차원 매트릭스로 표현될 수 있다. HLA 클래스 II 알파 및 베타 대립유전자 서열은 1개부터 k개의 아미노산 단위를 한 개의 단어로 설정한 워드 임베딩 기법을 통해 특정 사이즈로 임베딩 벡터로 표현될 수 있으나 이에 한정되지 않고 도 3에서 도시된 바와 같이 blosum50 또는 blosum62 매트릭스를 통해 아미노산 간 유사성을 벡터화 하는 등 다양한 형식으로 표현될 수 있다. HLA 대립 유전자 서열은 클래스 I, 클래스 II로 구분되며, 도 4에 도시된 바와 같이 Class II의 HLA 클래스 II 대립유전자 서열은 알파, 베타의 체인 구조일 수 있다. The genome data input unit 110 may receive a peptide sequence extracted from cancer tissue and HLA class II alpha and beta allele sequences. The peptide sequence may be for one or more peptides included in cancer tissue. A peptide sequence can be represented as a two-dimensional matrix to include sequences for peptides. HLA class II alpha and beta allele sequences can be expressed as an embedding vector in a specific size through a word embedding technique in which 1 to k amino acid units are set as one word, but is not limited thereto, and as shown in FIG. It can be expressed in various formats, such as vectorizing the similarity between amino acids through a blosum50 or blosum62 matrix. The HLA allele sequence is divided into class I and class II, and as shown in FIG. 4 , the HLA class II allele sequence of Class II may have an alpha, beta chain structure.

유전체 데이터 입력부(110)는 펩타이드 서열과 HLA 클래스 II 대립유전자 서열을 기초로, 펩타이드들의 T 세포 활성 데이터 또는 펩타이드들과 HLA 클래스 II 대립유전자의 알파 체인 데이터사이의 결합 데이터 및 펩타이드들과 HLA 클래스 II 대립유전자의 베타 체인 데이터 사이의 결합 데이터를 개별적으로 산출할 수 있다. The genome data input unit 110 is based on the peptide sequence and the HLA class II allele sequence, the binding data between the peptides and the alpha chain data of the peptides and the HLA class II allele, and the peptides and the HLA class II Binding data between the beta chain data of alleles can be calculated separately.

유전체 데이터 입력부(110)는 펩타이드들에 대한 T 세포 활성 데이터를 측정하고 측정한 데이터가 기록된 테이블, 또는 데이터베이스를 이용하여 암 조직의 펩타이드들에 대한 T 세포 활성 데이터를 산출할 수 있다. The genome data input unit 110 may measure T cell activity data for peptides and calculate T cell activity data for peptides of cancer tissue using a table or database in which the measured data is recorded.

여기서, HLA 클래스 II 대립유전자는 전체 서열, pseudo 서열과 무관하게 HLA 클래스 II 대립유전자 서열을 1 ~ 1 kmer 단위로 분할하고 가상의 단어 셋으로 표현하여 입력될 수 있다. HLA 클래스 II 대립유전자는 도 5에 도시된 바와 같이 알파 체인 데이터 및 베타 체인 데이터를 포함하는 훈련 데이터 셋트로 생성될 수 있다. 훈련 데이터 셋트는 알파 체인 데이터 및 베타 체인 데이터를 구분자로 구분하여 생성할 수 있다. 훈련 데이터 셋트는 결합성 예측부(1221, 1222)에 입력되는 것일 수 있다. Here, the HLA class II allele may be input by dividing the HLA class II allele sequence in units of 1 to 1 kmer and expressing it as a virtual word set regardless of the entire sequence or pseudo sequence. The HLA class II allele can be generated as a training data set including alpha chain data and beta chain data as shown in FIG. 5 . The training data set can be created by separating the alpha chain data and beta chain data with a delimiter. The training data set may be input to the associativity prediction units 1221 and 1222 .

HLA 클래스 II 대립유전자는 도 4에 도시된 바와 같이 알파 체인 데이터 또는 베타 체인 데이터를 단독으로 갖고 있는 훈련 데이터 셋트로 생성될 수 있다. 훈련 데이터 셋트는 알파 체인 데이터 또는 베타 체인 데이터를 독립적으로 생성할 수 있다. 훈련 데이터 셋트는 결합성 예측부(122)에 입력되는 것일 수 있다.The HLA class II allele can be generated as a training data set having either alpha chain data or beta chain data alone, as shown in FIG. 4 . The training data set can independently generate alpha chain data or beta chain data. The training data set may be input to the associativity prediction unit 122 .

펩타이드 및 HLA 클래스 II 대립유전자는 도 5 및 도 6에 도시된 바와 같이 면역성 예측부(121)에 대한 훈련 데이터 셋트를 생성할 수 있다. The peptide and the HLA class II allele may generate a training data set for the immunity prediction unit 121 as shown in FIGS. 5 and 6 .

유전체 데이터 입력부(110)는 펩타이드들과 HLA 클래스 II 대립유전자들 사이의 모든 결합 관계들에 대한 결합력과 관련된 결합 데이터를 측정하고 측정한 결합 데이터가 기록된 테이블, 또는 데이터베이스를 이용하여 대상 암 조직의 펩타이드들과 HLA 클래스 II 대립유전자들 사이의 결합 데이터를 산출할 수 있다. 펩타이드들과 HLA 클래스 II 대립유전자들의 알파 체인 데이터 및 베타 체인 데이터 사이의 모든 결합 관계들에 대한 결합력과 관련된 결합 데이터를 측정하고 측정한 결합 데이터가 기록된 테이블, 또는 데이터베이스를 이용하여, 대상 암 조직의 펩타이드들과 HLA 클래스 II 대립유전자의 알파 체인 데이터 사이의 결합 데이터 및 펩타이드들과 HLA 클래스 II 대립유전자의 베타 체인 데이터 사이의 결합 데이터를 산출할 수 있다. The genomic data input unit 110 measures the binding data related to the binding force for all binding relationships between the peptides and the HLA class II alleles, and uses a table or database in which the measured binding data is recorded. Binding data between peptides and HLA class II alleles can be calculated. By measuring the binding data related to the binding force for all binding relationships between the alpha chain data and the beta chain data of the peptides and the HLA class II alleles, and using a table or database in which the measured binding data is recorded, the target cancer tissue Binding data between the peptides and the alpha chain data of the HLA class II allele and the binding data between the peptides and the beta chain data of the HLA class II allele can be calculated.

면역성 예측부(121)는 T 세포 활성 데이터로 펩타이드들 및 HLA 클래스 II 대립유전자 서열들을 입력으로 하고 펩타이드들에 대한 면역성과 대응되는 예측값들을 출력할 수 있다. 면역성 예측부(121)는 T 세포 활성 데이터와, 펩타이드들에 대한 면역성으로 학습된 모델을 이용하여 펩타이드들에 대한 면역성과 대응되는 제1 예측값들을 출력할 수 있다. 펩타이드는 복수 또는 단수 일 수 있다. T 세포 활성 데이터는 펩타이드들 서열에 대한 것 및/또는 HLA들 서열에 대한 것을 포함할 수 있다. The immunity prediction unit 121 may input peptides and HLA class II allele sequences as T cell activity data and output predicted values corresponding to immunity to the peptides. The immunity predictor 121 may output first predicted values corresponding to immunity to peptides by using the T cell activity data and a model learned as immunity to the peptides. A peptide may be plural or singular. The T cell activity data may include those for the peptides sequence and/or those for the HLAs sequence.

결합성 예측부(1221)는 펩타이드들과 HLA 클래스 II 알파 베타 대립유전자의 복합체에 대한 결합 데이터를 입력으로 하고 펩타이드들에 대한 결합성과 대응되는 제2 예측값들을 출력할 수 있다. The binding predictor 1221 may receive binding data for the complex of the peptides and the HLA class II alpha-beta allele as input and output second predicted values corresponding to binding to the peptides.

결합성 예측부(1222)는 펩타이드들과 HLA 클래스 II 알파 또는 베타 단일 대립유전자 사이의 결합 관계에 대한 결합 데이터를 입력으로 하고 펩타이드들에 대한 결합성과 대응되는 제3 예측값들을 출력할 수 있다. The binding predictor 1222 may input binding data on a binding relationship between the peptides and the HLA class II alpha or beta single allele and output third predicted values corresponding to binding to the peptides.

복합체 신생항원 예측부(131)는 제1 예측값 및 제2 예측값들을 기초로 학습된 복합체 신생항원 예측 모델을 이용하여 대상 암 조직의 신생항원 예측치인 제4 예측값을 출력할 수 있다. The complex neoantigen predictor 131 may output a fourth predictive value, which is a neoantigen predictor of a target cancer tissue, by using the complex neoantigen predictive model learned based on the first predictive value and the second predictive value.

단일 신생항원 예측부(132)는 제1 예측값 및 제3 예측값들을 기초로 학습된 단일 신생항원 예측 모델을 이용하여 대상 암 조직의 신생항원 예측치인 제5 예측값을 출력할 수 있다. The single neoantigen predictor 132 may output a fifth predictive value, which is a neoantigen predictor of a target cancer tissue, by using a single neoantigen predictive model learned based on the first predictive value and the third predictive value.

신생항원 결정부(140)는 펩타이드 별로 제1 내지 제5 예측값들 중에서, 펩타이드별로 결정된 하나의 제1 내지 제5 예측값을 기초로 기계학습 앙상블 모델을 통해 치료에 활용할 수 있는 면역성 및 결합성을 가지는 신생항원인지 여부를 예측할 수 있다. The neoantigen determining unit 140 has immunity and binding properties that can be used for treatment through a machine learning ensemble model based on one first to fifth predicted values determined for each peptide among the first to fifth predicted values for each peptide. It is possible to predict whether it is a neoantigen or not.

이를 통해, 신생항원 결정 장치(100)는 대상 암 조직에 포함된 종양세포 또는 암세포의 결합적 특성 뿐만 아니라 면역성 특성을 고려하여 치료에 활용할 수 있는 신생항원인지 여부를 출력할 수 있다. 또한, 신생항원 결정 장치(100)는 대상 암 조직의 펩타이드들에 대한 T 세포 활성 데이터를 고려하여 신생항원인지 여부를 출력할 수 있다. Through this, the neoantigen determining apparatus 100 may output whether the neoantigen can be utilized for treatment in consideration of immune characteristics as well as binding characteristics of tumor cells or cancer cells included in the target cancer tissue. Also, the neoantigen determining apparatus 100 may output whether the neoantigen is a neoantigen in consideration of T cell activity data for peptides of a target cancer tissue.

신생항원 결정 장치(100)는 도시되지 않은 통신부, 입력부, 출력부 중 적어도 하나를 포함하여 구현될 수 있으나, 이에 한정되지 않는다. 신생항원 결정 장치(100)는 출력부를 통해 신생항원 인지 여부 등의 데이터를 출력할 수 있다. 신생항원 결정 장치(100)는 입력부를 통해 데이터 출력 입력을 입력 받을 수 있다. 신생항원 결정 장치(100)는 통신부를 구비하고 외부의 장치들과 통신할 수 있다. 신생항원 결정 장치(100)의 유전체 데이터 입력부(110), 면역성 예측부(121), 결합성 예측부(122), 신생항원 예측부(130), 신생항원 결정부(140) 중 적어도 하나는 소프트웨어 또는 하드웨어로 구현될 수 있다. 면역성 예측부(121), 결합성 예측부(122), 신생항원 예측부(130), 신생항원 결정부(140) 중 적어도 하나는 하나의 구성요소로 구현될 수 있다. The neoantigen determination apparatus 100 may be implemented including at least one of a communication unit, an input unit, and an output unit (not shown), but is not limited thereto. The neoantigen determining apparatus 100 may output data such as whether the neoantigen is recognized through the output unit. The neoantigen determination apparatus 100 may receive a data output input through the input unit. The neoantigen determination device 100 may have a communication unit and communicate with external devices. At least one of the genome data input unit 110 , the immunity prediction unit 121 , the binding prediction unit 122 , the neoantigen prediction unit 130 , and the neoantigen determination unit 140 of the neoantigen determination apparatus 100 is software Alternatively, it may be implemented in hardware. At least one of the immunity predictor 121 , the binding predictor 122 , the neoantigen predictor 130 , and the neoantigen determiner 140 may be implemented as one component.

대상 암 조직은 단일 HLA 클래스 I 또는 클래스 II 대립 유전자를 발현하도록 조작된 세포일 수 있다. 대상 암 조직은 복수의 환자로부터 수득되거나 이로부터 유래된 인간 세포일 수 있다. 대상 암 조직은 복수의 환자로부터 수득된 신선한 또는 냉동된 종양세포를 포함할 수 있다. 대상 암 조직은 복수의 환자로부터 수득된 신선한 또는 냉동된 조직세포를 포함할 수 있다. 대상 암 조직은 T-세포 분석을 사용하여 확인된 펩타이드(들)를 포함할 수 있다. The cancer tissue of interest may be a cell engineered to express a single HLA class I or class II allele. The target cancer tissue may be human cells obtained from or derived from a plurality of patients. The cancer tissue of interest may include fresh or frozen tumor cells obtained from a plurality of patients. The target cancer tissue may include fresh or frozen tissue cells obtained from a plurality of patients. The cancer tissue of interest may comprise peptide(s) identified using T-cell analysis.

신생항원 결정 장치(100)는 복수의 대상 암 조직들을 기초로 면역성 예측부(121), 결합성 예측부(122), 신생항원 예측부(130), 신생항원 결정부(140)의 알고리즘을 학습시킬 수 있다. 신생항원 결정 장치(100)는 대상 암 조직들의 단백체 서열과 관련된 데이터, HLA 클래스 II 대립 유전자와 관련된 데이터, 펩타이드와 HLA 클래스 II 대립 유전자 간의 결합 데이터, 대상 암 조직과 관련된 전사체와 관련된 데이터, 대상 암 조직과 관련된 게놈과 관련된 데이터 등을 이용하여 면역성 예측부(121), 결합성 예측부(122), 신생항원 예측부(130), 신생항원 결정부(140) 중 적어도 하나의 알고리즘을 학습시킬 수 있다. The neoantigen determining apparatus 100 learns the algorithms of the immunity predicting unit 121, the binding predicting unit 122, the neoantigen predicting unit 130, and the neoantigen determining unit 140 based on the plurality of target cancer tissues. can do it The neoantigen determining device 100 includes data related to proteomic sequences of target cancer tissues, data related to HLA class II alleles, binding data between peptides and HLA class II alleles, data related to transcripts related to the target cancer tissue, and target cancer tissues. At least one algorithm among the immunity predictor 121, the binding predictor 122, the neoantigen predictor 130, and the neoantigen determiner 140 is to be trained using data related to the cancer tissue-related genome. can

면역성 예측부(121), 결합성 예측부(122), 신생항원 예측부(130), 신생항원 결정부(140) 중 적어도 하나는 펩타이드들의 길이 별로 독립적으로 구축되지 않고, 펩타이드들을 길이와 무관하게 하나의 워드로 인식하여 알고리즘(모델)을 구축할 수 있다. 면역성 예측부(121), 결합성 예측부(122), 신생항원 예측부(130), 신생항원 결정부(140) 중 적어도 하나는 워드 임베딩 기법을 이용하여 펩타이드들을 하나의 워드로 구현될 수 있다. At least one of the immunity predicting unit 121, the binding predicting unit 122, the neoantigen predicting unit 130, and the neoantigen determining unit 140 is not independently constructed for each length of the peptides, and irrespective of the length of the peptides. An algorithm (model) can be built by recognizing it as a single word. At least one of the immunity predicting unit 121, the binding predicting unit 122, the neoantigen predicting unit 130, and the neoantigen determining unit 140 may implement peptides as one word using a word embedding technique. .

면역성 예측부(121), 결합성 예측부(122), 신생항원 예측부(130), 신생항원 결정부(140) 중 적어도 하나는 blosum 등의 아미노산 간 유사성 매트릭스를 기초하여 벡터화 할 수 있다. At least one of the immunity predictor 121 , the binding predictor 122 , the neoantigen predictor 130 , and the neoantigen determiner 140 may be vectorized based on a similarity matrix between amino acids such as blosum.

면역성 예측부(121), 결합성 예측부(122), 신생항원 예측부(130), 신생항원 결정부(140) 중 적어도 하나의 알고리즘에 대한 훈련 데이터 역시, 펩타이드의 길이와 무관하게 입력될 수 있다. 신생항원 결정부(140)는 기계학습의 앙상블 모델을 이용하여 학습된 알고리즘을 이용할 수 있다. Training data for at least one of the immunity predicting unit 121, the binding predicting unit 122, the neoantigen predicting unit 130, and the neoantigen determining unit 140 may also be input irrespective of the length of the peptide. have. The neoantigen determiner 140 may use an algorithm learned using an ensemble model of machine learning.

신생항원 결정 장치(100)는 데이터를 기반으로 각각의 양성(Y)/음성(N)을 분류하는 딥러닝 및 기계학습 앙상블 모델을 구축할 수 있다. 신생항원 결정 장치(100)는 면역성 예측부(121), 결합성 예측부(122), 신생항원 예측부(130), 신생항원 결정부(140)에 대한 가중치(weight)를 고정하여 추가적인 신경망을 이용할 수 있다.The neoantigen determination apparatus 100 may build a deep learning and machine learning ensemble model that classifies each positive (Y)/negative (N) based on data. The neoantigen determination apparatus 100 sets up an additional neural network by fixing weights for the immunity prediction unit 121 , the binding prediction unit 122 , the neoantigen prediction unit 130 , and the neoantigen determination unit 140 . Available.

이를 통해, T 세포 활성 데이터 내 HLA 클래스 II 대립유전자와 펩타이드 간 면역 데이터를 이용하여 면역성 예측부(121), 결합성 예측부(122), 신생항원 예측부(130), 신생항원 결정부(140) 중 적어도 하나가 구현될 수 있다. Through this, using the immune data between the HLA class II allele and the peptide in the T cell activity data, the immunity predictor 121, the binding predictor 122, the neoantigen predictor 130, and the neoantigen determiner 140 ) may be implemented.

면역성 예측부(121)는 펩타이드들의 각 아미노산을 대상으로 워드 임베딩 기법을 적용할 수 있다. 면역성 예측부(121)는 워드 임베딩 기법을 적용하여 획득된 펩타이드들의 벡터에 CNN을 적용하고 특징값들을 추출할 수 있다. 여기서, 특징값들은 CNN 등과 같이 다양한 레이어들에서 학습을 통해서 획득될 수 있다. 면역성 예측부(121)는 펩타이드들의 벡터에 대한 추출된 특징값에 대해 GRU(Gated Recurrent Unit)를 적용하여 각 펩타이드의 면역성에 대한 양성 또는 음성을 훈련하는 과정을 통해 알고리즘을 생성할 수 있다. The immunity prediction unit 121 may apply a word embedding technique to each amino acid of the peptides. The immunity prediction unit 121 may apply CNN to a vector of peptides obtained by applying a word embedding technique and extract feature values. Here, the feature values may be acquired through learning in various layers such as CNN. The immunity prediction unit 121 may generate an algorithm through a process of training positive or negative for immunity of each peptide by applying a Gated Recurrent Unit (GRU) to the extracted feature values for the vector of peptides.

결합성 예측부(122)는 HLA 클래스 II 대립유전자의 알파 체인 데이터, 베타 체인 데이터와 펩타이드 모두에 워드 임베딩 기법을 적용하여 벡터들을 생성하고, HLA 클래스 II 대립유전자의 벡터 및 펩타이드의 벡터를 대상으로 CNN을 적용하여 특징값들을 추출할 수 있다. 결합성 예측부(122)는 특징값들을 2개의 신경망에 적용하여 HLA 클래스 II 대립유전자의 인코더 및 펩타이드의 인코더를 생성하고, HLA 클래스 II 대립유전자의 인코더 및 펩타이드의 인코더를 이용하여 결합성에 대한 양성 또는 음성을 훈련하는 과정을 통해 알고리즘을 생성할 수 있다. The binding predictor 122 generates vectors by applying the word embedding technique to both the alpha chain data, the beta chain data and the peptide of the HLA class II allele, and targets the vector of the HLA class II allele and the vector of the peptide. Feature values can be extracted by applying CNN. The binding predictor 122 applies the feature values to two neural networks to generate an encoder of an HLA class II allele and an encoder of a peptide, and uses the encoder of the HLA class II allele and an encoder of a peptide to generate positive for binding Alternatively, an algorithm can be generated through the process of training the voice.

결합성 예측부(122)는 HLA 클래스 II 대립유전자의 알파 체인 데이터, 베타 체인 데이터와 펩타이드 모두에 blosum 등의 아미노산 유사성 매트릭스 기반 벡터들을 생성하고, HLA 클래스 II 대립유전자의 벡터 및 펩타이드의 벡터를 대상으로 CNN을 적용하여 특징값들을 추출할 수 있다. 결합성 예측부(122)는 특징값들을 2개의 신경망에 적용하여 HLA 클래스 II 대립유전자의 인코더 및 펩타이드의 인코더를 생성하고, HLA 클래스 II 대립유전자의 인코더 및 펩타이드의 인코더를 이용하여 결합성에 대한 양성 또는 음성을 훈련하는 과정을 통해 알고리즘을 생성할 수 있다. The binding predictor 122 generates amino acid similarity matrix-based vectors such as blosum for both alpha chain data, beta chain data and peptide of the HLA class II allele, and targets the vector of the HLA class II allele and the vector of the peptide. By applying CNN, we can extract feature values. The binding predictor 122 applies the feature values to two neural networks to generate an encoder of an HLA class II allele and an encoder of a peptide, and uses the encoder of the HLA class II allele and an encoder of a peptide to generate positive for binding Alternatively, an algorithm can be generated through the process of training the voice.

신생항원 예측부(130)의 알고리즘을 생성하는데 이용되는 훈련 데이터는 면역성에 대한 양성과 음성으로 나눌 수 있다. 신생항원 예측부(130)는 HLA 대립 유전자와 결합하는 동시에 면역원성을 가지는 신생항원의 예측값을 출력할 수 있다. The training data used to generate the algorithm of the neoantigen prediction unit 130 may be divided into positive and negative for immunity. The neoantigen prediction unit 130 may output a predicted value of the neoantigen having immunogenicity while binding to the HLA allele.

신생항원 결정 장치(100)의 면역성 예측부(121)는 T 세포 활성 데이터를 입력으로, 면역성에 대한 제1 예측값을 출력할 수 있다. The immunity predicting unit 121 of the neoantigen determining apparatus 100 may output a first predicted value for immunity by receiving T cell activity data as an input.

결합성 예측부(1221)는 T 세포 활성 데이터 및 결합 데이터를 입력으로, HLA 클래스 II 알파 베타 복합체와 펩타이드 간 결합성에 대한 제2 예측값을 출력할 수 있다. The binding predictor 1221 may output a second predicted value for binding between the HLA class II alpha-beta complex and the peptide by inputting the T cell activity data and the binding data.

결합성 예측부(1222)는 T 세포 활성 데이터 및 결합 데이터를 입력으로, HLA 클래스 II 알파 또는 베타 단일 대립유전자와 펩타이드 간 결합성에 대한 제3 예측값을 출력할 수 있다. The binding predictor 1222 may output a third predicted value for binding between the HLA class II alpha or beta single allele and the peptide by inputting the T cell activity data and the binding data as inputs.

신생항원 예측부(131)는 T 세포 활성 데이터 및 결합 데이터를 입력으로, HLA 클래스 II 알파 베타 복합체와 펩타이드 간 면역원성에 대한 제4 예측값을 출력할 수 있다. The neoantigen prediction unit 131 may output a fourth predicted value for the immunogenicity between the HLA class II alpha-beta complex and the peptide by inputting the T cell activity data and the binding data as inputs.

신생항원 예측부(132)는 T 세포 활성 데이터 및 결합 데이터를 입력으로, HLA 클래스 II 알파 또는 베타 단일 대립유전자와 펩타이드 간 면역원성에 대한 제5 예측값을 출력할 수 있다. The neoantigen prediction unit 132 may output a fifth predicted value for the immunogenicity between the HLA class II alpha or beta single allele and the peptide by inputting the T cell activity data and the binding data as inputs.

신생항원 결정 장치(100)의 120, 130는 면역성 예측부, 결합성 예측부, 신생항원 예측부 외의 다양한 인자를 예측하는 예측부를 더 포함할 수 있다. 120 and 130 of the neoantigen determination apparatus 100 may further include a predictor for predicting various factors other than the immunity predictor, the binding predictor, and the neoantigen predictor.

신생항원 결정 장치(100)의 140는 제1 내지 제5 예측값과 T 세포 활성 데이터를 입력으로 하여, 치료에 활용할 수 있는 신생항원인지 여부를 Y, 또는 N 중 하나로 출력할 수 있다. 140 of the neoantigen determining apparatus 100 may receive the first to fifth predicted values and T cell activity data as inputs, and output whether the neoantigen can be used for treatment as either Y or N.

도 7는 HLA 클래스 II 알파 베타 복합체에 대해 신생항원 결정 장치(100)의 예측 모델을 설명하는 도면이다.7 is a view for explaining a prediction model of the neoantigen determination apparatus 100 for the HLA class II alpha beta complex.

본 발명의 실시예에 따르면, 대상 암 조직으로부터 추출된 HLA 클래스 II 대립유전자 서열 및 펩타이드 서열을 입력 데이터로 사용하고, 신생항원 예측값을 출력 데이터로 출력(return)할 수 있다. According to an embodiment of the present invention, an HLA class II allele sequence and a peptide sequence extracted from a target cancer tissue may be used as input data, and a neoantigen predicted value may be returned as output data.

이때, HLA 클래스 II 알파 베타 복합체인 경우, 도 7에서 설명하는 바와 같이 신생항원 결정 장치(100)는 면역성 예측 모델(121’-a), 결합성 예측 모델(122’-a), 및 신생항원 예측 모델(123’-a)을 이용하여, 신생항원 예측값을 출력할 수 있다. 이때, 면역성 예측 모델(121’-a)을 통해 출력된 제1 예측값, 결합성 예측 모델(122’-a)을 통해 출력된 제2 예측값, 신생항원 예측 모델(123’-a)을 통해 출력된 제3 예측값을 도출한다. In this case, in the case of the HLA class II alpha beta complex, the neoantigen determining device 100 as described in FIG. 7 includes an immunity prediction model (121'-a), a binding prediction model (122'-a), and a neoantigen Using the prediction model 123'-a, the neoantigen predicted value may be output. At this time, the first predicted value output through the immunity prediction model (121'-a), the second predicted value output through the binding prediction model (122'-a), and output through the neoantigen prediction model (123'-a) derived third predicted value.

도 8은 HLA 클래스 II 알파 또는 베타 단일체에 대해 신생항원 결정 장치(100)의 예측 모델을 설명하는 도면이다. 8 is a diagram for explaining a prediction model of the neoantigen determination apparatus 100 for HLA class II alpha or beta monoliths.

이때, HLA 클래스 II 알파 또는 베타 단일체인 경우, 신생항원 결정 장치(100)는 면역성 예측 모델(121’-b), 결합성 예측 모델(122’-b), 및 신생항원 예측 모델(123’-b)을 이용하여, 신생항원 예측값을 출력할 수 있다. 이때, 면역성 예측 모델(121’-b)을 통해 출력된 제1 예측값, 결합성 예측 모델(122’-b)을 통해 출력된 제4 예측값, 신생항원 예측 모델(123’-b)을 통해 출력된 제5 예측값을 도출한다. At this time, in the case of HLA class II alpha or beta monolith, the neoantigen determining device 100 is an immunity prediction model (121'-b), a binding prediction model (122'-b), and a neoantigen prediction model (123'-) b) can be used to output the neoantigen predicted value. At this time, the first predicted value output through the immunity prediction model (121'-b), the fourth predicted value output through the binding prediction model (122'-b), and output through the neoantigen prediction model (123'-b) A fifth predicted value is derived.

도 9는 각 환자의 HLA 클래스 II 대립 유전자와 암 조직 내 펩타이드 정보를 기반으로 신생항원 여부를 결정하는 신생항원 결정 장치(100)의 예측 모델을 설명하는 도면이다. 9 is a view for explaining a prediction model of the neoantigen determining device 100 for determining whether a neoantigen based on the HLA class II allele of each patient and the peptide information in the cancer tissue.

본 발명의 실시예에 따르면, 상기 제1 내지 제5 예측값을 입력 데이터로 사용하고, 신생항원 여부를 출력 데이터로 출력(return)할 수 있다. According to an embodiment of the present invention, the first to fifth predicted values may be used as input data, and whether or not a neoantigen is present may be returned as output data.

이때, 제1 예측값 내지 제5 예측값을 기반으로 기계학습 앙상블 모델을 학습하며 이 때 사용되는 기계학습 방법론은 Random Forest, Ridge, Kernel Ridge, Gradient Boosting Regression 등을 사용하나 이에 국한되지 않는다. 입력 데이터로 제1 예측값 내지 제5 예측값을 사용하고, 신생항원 여부를 출력 데이터로 출력(return)할 수 있다. At this time, the machine learning ensemble model is learned based on the first to fifth predicted values, and the machine learning methodology used at this time uses Random Forest, Ridge, Kernel Ridge, Gradient Boosting Regression, etc., but is not limited thereto. The first to fifth predicted values may be used as input data, and whether or not the neoantigen is present may be returned as output data.

도 7에 도시된 바와 같이 면역성 예측부(121)는 펩타이드 서열을 임베딩 워딩으로 변환(Embedding)하고 CNN layer, GRU layer, NN layer, NN layer(121’-a)으로 구현될 수 있다. 7, the immunity prediction unit 121 converts the peptide sequence into embedding wording and may be implemented as a CNN layer, a GRU layer, an NN layer, and an NN layer (121'-a).

결합성 예측부(122)는 펩타이드 서열, HLA 클래스 II 알파 베타 복합체 데이터를 입력으로, 펩타이드 서열과 HLA 클래스 II 알파 체인 복합체 간 결합 데이터를 산출하고, 결합 데이터를 기초로 펩타이드 서열과 HLA 클래스 II 대립유전자 서열 사이의 결합성을 출력할 수 있다. The binding predictor 122 receives the peptide sequence and HLA class II alpha beta complex data as inputs, and calculates binding data between the peptide sequence and the HLA class II alpha chain complex, and based on the binding data, the peptide sequence and the HLA class II allele Binding between gene sequences can be output.

결합성 예측부(122)는 펩타이드 서열, HLA 클래스 II 알파 체인 데이터 및 HLA 클래스 II 베타 체인 데이터를 임베딩 워딩 처리(Embedding)하고, 임베딩 워딩 처리한 데이터를 CNN, GRU 레이어(122’-a)를 거쳐서 펩타이드 서열, HLA 클래스 II 알파 체인 데이터 및 HLA 클래스 II 베타 체인 데이터 사이의 결합성을 예측할 수 있다. Binding prediction unit 122 embedding wording processing (Embedding) the peptide sequence, HLA class II alpha chain data, and HLA class II beta chain data, CNN, GRU layer (122'-a) the embedding wording processing data Through this, binding between the peptide sequence, HLA class II alpha chain data and HLA class II beta chain data can be predicted.

결합성 예측부(122)는 펩타이드 서열, HLA 클래스 II 알파 체인 데이터 및 HLA 클래스 II 베타 체인 데이터를 아미노산 간 유사성을 표현하는 blosum 매트릭스를 기반으로 벡터화한 뒤, CNN, GRU 레이어(122’-a)를 거쳐서 펩타이드 서열, HLA 클래스 II 알파 체인 데이터 및 HLA 클래스 II 베타 체인 데이터 사이의 결합성을 예측할 수 있다. The binding predictor 122 vectorizes the peptide sequence, HLA class II alpha chain data, and HLA class II beta chain data based on a blosum matrix expressing the similarity between amino acids, and then CNN, GRU layer (122'-a) It is possible to predict the binding between the peptide sequence, HLA class II alpha chain data, and HLA class II beta chain data.

도 8에 도시된 바와 같이, HLA 대립 유전자는 클래스 II 단일 체인 데이터일 수 있으며, 121’-b, 122’-b와 같이 구현될 수 있다. As shown in FIG. 8 , the HLA allele may be class II single chain data, and may be implemented as 121'-b, 122'-b.

신생항원 결정 장치(100)의 결합성 예측부(122)는 HLA 클래스 II 대립유전자의 알파 및 베타 체인 데이터들에 대해서는 Pair Model로 결합성을 예측하거나, HLA 클래스 II 대립유전자에 대해서 Single Model로 결합성을 예측할 수 있다. The binding predictor 122 of the neoantigen determining device 100 predicts binding with a pair model for alpha and beta chain data of the HLA class II allele, or binds to the HLA class II allele with a single model gender can be predicted.

신생항원 결정 장치(100)는 면역성에 대한 제1 예측값, 결합성에 대한 제2, 3 예측값 및 신생항원 예측값인 제4 예측값, 및 제5 예측값을 입력으로 하여, 대상 암 조직에 대한 펩타이드가 신생항원 인지 여부를 출력할 수 있다. The neoantigen determination apparatus 100 receives the first predicted value for immunity, the second and third predicted values for binding, and the fourth predicted value and the fifth predicted value that are neoantigen predicted values, and the peptide to the target cancer tissue is the neoantigen. Whether or not it is recognized can be printed.

도 10 내지 도 12는 본 발명의 실시예들에 따른 신생항원 결정 시스템의 구현 예시 들에 대한 도면들이다. 10 to 12 are diagrams of implementation examples of the neoantigen determination system according to embodiments of the present invention.

도 10에 도시된 바와 같이, 신생항원 결정 장치(100)는 외부의 전자 장치(200)로부터 암 조직에 대한 유전체 데이터를 수신할 수 있다. 신생항원 결정 장치(100)는 출력된 암 조직에 대한 펩타이드가 신생항원인지 여부에 대한 정보를 전자 장치(200)로 전송할 수 있다. As shown in FIG. 10 , the neoantigen determination apparatus 100 may receive genomic data on cancer tissue from the external electronic device 200 . The neoantigen determination apparatus 100 may transmit information on whether the output peptide for cancer tissue is a neoantigen to the electronic device 200 .

전자 장치(200)는 암 조직에 대한 유전체 데이터를 저장한 하나 이상의 프로세서를 포함하는 컴퓨팅 장치일 수 있다. 전자 장치(200)는 암 조직의 유전체 데이터를 출력하는 장치일 수 있다. 전자 장치(200)는 신생항원 결정 장치(100)와 전기적으로 연결되거나 유무선의 네트워크를 통해 연결되어 데이터를 송수신할 수 있다. The electronic device 200 may be a computing device including one or more processors storing genomic data on cancer tissue. The electronic device 200 may be a device that outputs genome data of cancer tissue. The electronic device 200 may be electrically connected to the neoantigen determining device 100 or may be connected through a wired/wireless network to transmit/receive data.

전자 장치(200)는 수회에 걸쳐서 복수의 샘플들의 암 조직들에 대한 유전체 데이터를 획득하여 저장할 수 있다. 신생항원 결정 장치(100)는 전자 장치(200)로부터 수신된 유전체 데이터들에 대한 신생항원인지 여부 등을 순차적으로 출력할 수 있다. The electronic device 200 may acquire and store genome data of cancer tissues of a plurality of samples several times. The neoantigen determination apparatus 100 may sequentially output whether the genome data received from the electronic device 200 are neoantigens or the like.

도 11에 도시된 바와 같이, 신생항원 결정 장치(100)는 복수의 전자 장치들(201, 202, …, 20n)로부터 데이터를 수신하고, 복수의 전자 장치들(201, 202, …, 20n)로 출력 데이터를 전송할 수 있다. 11, the neoantigen determination device 100 receives data from a plurality of electronic devices (201, 202, ..., 20n), and a plurality of electronic devices (201, 202, ..., 20n) output data can be sent.

신생항원 결정 장치(100)는 복수의 전자 장치들(201, 202, …, 20n)로부터 유전체 데이터들을 수신할 수 있다. 복수의 전자 장치들(201, 202, …, 20n)는 하나 이상의 주체에 의해 관리될 수 있다. The neoantigen determination apparatus 100 may receive genome data from a plurality of electronic devices 201, 202, ..., 20n. The plurality of electronic devices 201, 202, ..., 20n may be managed by one or more subjects.

도 12에 도시된 바와 같이, 신생항원 결정 장치(100)는 하나 이상의 단말 장치들(301, 302, …, 30n)의 출력부를 통해 출력 데이터를 출력시킬 수 있다. 출력 데이터는 신생항원 결정 장치(100)의 출력부를 통해 출력될 수 있다. 출력 데이터는 하나 이상의 단말 장치들(301, 302, …, 30n)의 출력부를 통해 출력될 수 있다. 신생항원 결정 장치(100)는 신생항원과 관련된 데이터를 전송함에 따라 소정의 비용에 대한 결제를 하나 이상의 단말 장치들(301, 302, …, 30n)로 요청할 수 있다. 하나 이상의 단말 장치들(301, 302, …, 30n)은 암 조직에 포함된 펩타이드들, HLA 클래스 II 대립유전자들에 대한 신생항원 관련 정보를 요청할 수 있다. 요청에 대응하여, 출력 데이터가 출력될 수 있다. 12, the neoantigen determination apparatus 100 may output output data through the output unit of one or more terminal devices (301, 302, ..., 30n). The output data may be output through the output unit of the neoantigen determination apparatus 100 . The output data may be output through an output unit of one or more of the terminal devices 301 , 302 , ..., 30n. The neoantigen determination device 100 may request payment for a predetermined cost to one or more terminal devices 301, 302, ..., 30n as data related to the neoantigen is transmitted. The one or more terminal devices 301, 302, ..., 30n may request neoantigen-related information on peptides and HLA class II alleles included in cancer tissue. In response to the request, output data may be output.

도 13는 면역성 예측 모델, 결합성 예측 모델, 신생항원 예측 모델 등을 학습시키는 학습 서버(10)의 블록도이다. 13 is a block diagram of the learning server 10 for learning an immunity prediction model, a binding prediction model, a neoantigen prediction model, and the like.

학습 서버(10)은 데이터 입력부(11), 제1 학습부(12), 제2 학습부(13), 제3 학습부(14), 및 제4 학습부(15)를 포함할 수 있다. The learning server 10 may include a data input unit 11 , a first learning unit 12 , a second learning unit 13 , a third learning unit 14 , and a fourth learning unit 15 .

제1 학습부(12)는 면역성 예측 모델을 학습하여 생성하는 것으로, 펩타이드 서열 또는 HLA 클래스 II 대립유전자 서열의 T 세포 활성 데이터 및 펩타이드 서열의 면역성을 훈련 데이터 세트로 학습하게 된다. 제1 학습부(12)에 의해 학습된 면역성 예측 모델은 도 1의 121에 도시된 바와 같이, 펩타이드 서열을 워드 임베딩 기법으로 처리하고, 처리된 펩타이드 서열을 CNN, GRU, NN의 레이어에 입력하여 학습하게 된다. The first learning unit 12 is generated by learning the immunity prediction model, and learns the T cell activity data of the peptide sequence or the HLA class II allele sequence and the immunity of the peptide sequence as a training data set. As shown in 121 of FIG. 1, the immunity prediction model trained by the first learning unit 12 processes the peptide sequence by word embedding technique, and inputs the processed peptide sequence to the layers of CNN, GRU, and NN. will learn

제2 학습부(13)는 결합성 예측 모델을 학습하여 생성하는 것으로, 펩타이드 서열의 결합 데이터 또는 HLA 클래스 II 대립유전자 서열을 입력으로 펩타이드 서열 및 HLA 클래스 II 대립유전자 서열 사이의 결합성을 훈련 데이터 세트로 학습하게 된다. 제2 학습부(13)에 의해 학습된 결합성 예측 모델은 도 1의 122에 도시된 바와 같이, 펩타이드 서열 및 HLA 클래스 II 대립유전자 서열을 각각 워드 임베딩 기법으로 처리하고, 처리된 펩타이드 서열을 CNN, GRU의 레이어에 입력하여 학습하고 HLA 클래스 II 대립유전자 서열을 CNN, CNN. GRU의 레이어에 입력하여 학습하게 된다. 결합성 예측 모델은 펩타이드 서열에 대한 결합성에 대한 예측값과 HLA 클래스 II 대립유전자 서열에 대한 결합성에 대한 예측값으로 또 다른 모델(NN1)을 학습시켜, 최종적으로 대상 암 조직에 대한 신생항원에 대한 예측값을 출력하도록 학습될 수 있다. The second learning unit 13 is generated by learning the binding prediction model, and training data on binding between the peptide sequence and the HLA class II allele sequence by inputting the binding data of the peptide sequence or the HLA class II allele sequence. learn as a set. As shown in 122 of FIG. 1, the binding prediction model learned by the second learning unit 13 processes the peptide sequence and the HLA class II allele sequence by word embedding technique, respectively, and the processed peptide sequence is CNN , learned by input into the layers of the GRU, and HLA class II allele sequences were applied to CNN, CNN. It learns by inputting it into the GRU layer. The binding prediction model trains another model (NN1) with the predicted value for binding to the peptide sequence and the predicted value for binding to the HLA class II allele sequence, and finally the predicted value for the neoantigen to the target cancer tissue. It can be learned to output.

제3 학습부(14)는 신생항원 예측 모델을 학습하여 생성하는 것으로, 펩타이드 서열 및 HLA 클랙스 II 대립유전자 서열을 입력으로 펩타이드 서열 및 HLA 대립유전자 서열이 주어졌을 때 해당 펩타이드의 신생항원 가능성 여부를 훈련 데이터 세트로 학습하게 된다. 제3 학습부(14)에 의해 학습된 신생 항원 예측 모델은 도 1의 130에 도시된 바와 같이, 펩타이드 서열 및 HLA 클래스 II 대립유전자 서열을 각각 워드 임베딩 기법으로 처리하고, 처리된 펩타이드 서열을 CNN, GRU의 레이어에 입력하여 학습하고 HLA 클래스 II 대립유전자 서열을 CNN, CNN, GRU의 레이어에 입력하여 학습하게 된다. 신생항원 예측 모델은 주어진 HLA 클래스 II 대립유전자 서열과 펩타이드 서열 기반 신생항원 예측값으로 또 다른 모델(NN2)을 학습시켜, 최종적으로 신생항원에 대한 예측값을 출력하도록 학습될 수 있다. The third learning unit 14 is generated by learning the neoantigen prediction model, and when the peptide sequence and the HLA allele sequence are given by inputting the peptide sequence and the HLA class II allele sequence, whether the corresponding peptide is a neoantigen possibility is trained as a training data set. As shown in 130 of FIG. 1, the new antigen prediction model trained by the third learning unit 14 processes the peptide sequence and the HLA class II allele sequence by word embedding technique, respectively, and the processed peptide sequence is CNN , it learns by inputting it into the layers of the GRU, and learning by inputting the HLA class II allele sequence into the layers of CNN, CNN, and GRU. The neoantigen prediction model may be trained to learn another model (NN2) with a given HLA class II allele sequence and a peptide sequence-based neoantigen predicted value, and finally to output a predicted value for the neoantigen.

학습 서버(10)는 제1 내지 제3 학습부(12, 13, 14)에 의해 생성된 학습 모델들을 신생항원 결정 장치(100)로 전송할 수 있다. 이를 통해, 신생항원 결정 장치(100)의 면역성 예측부(121), 결합성 예측부(122), 신생항원 예측부(130)의 알고리즘이 주기적으로 갱신(update 업데이트) 될 수 있다. The learning server 10 may transmit the learning models generated by the first to third learning units 12 , 13 , and 14 to the neoantigen determination apparatus 100 . Through this, the algorithms of the immunity predictor 121 , the binding predictor 122 , and the neoantigen predictor 130 of the neoantigen determination apparatus 100 may be periodically updated (updated).

이상에서 설명된 장치는 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치 및 구성요소는, 예를 들어, 프로세서, 콘트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPGA(field programmable gate array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 하나 이상의 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 상기 운영 체제 상에서 수행되는 하나 이상의 소프트웨어 어플리케이션을 수행할 수 있다. 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다. 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(processing element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다. 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 콘트롤러를 포함할 수 있다. 또한, 병렬 프로세서(parallel processor)와 같은, 다른 처리 구성(processing configuration)도 가능하다.The device described above may be implemented as a hardware component, a software component, and/or a combination of the hardware component and the software component. For example, devices and components described in the embodiments may include, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable gate array (FPGA). , a programmable logic unit (PLU), microprocessor, or any other device capable of executing and responding to instructions, may be implemented using one or more general purpose or special purpose computers. The processing device may execute an operating system (OS) and one or more software applications running on the operating system. The processing device may also access, store, manipulate, process, and generate data in response to execution of the software. For convenience of understanding, although one processing device is sometimes described as being used, one of ordinary skill in the art will recognize that the processing device includes a plurality of processing elements and/or a plurality of types of processing elements. It can be seen that can include For example, the processing device may include a plurality of processors or one processor and one controller. Other processing configurations are also possible, such as parallel processors.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치, 또는 전송되는 신호 파(signal wave)에 영구적으로, 또는 일시적으로 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.The software may comprise a computer program, code, instructions, or a combination of one or more thereof, which configures a processing device to operate as desired or is independently or collectively processed You can command the device. The software and/or data may be any kind of machine, component, physical device, virtual equipment, computer storage medium or device, to be interpreted by or to provide instructions or data to the processing device. , or may be permanently or temporarily embody in a transmitted signal wave. The software may be distributed over networked computer systems, and stored or executed in a distributed manner. Software and data may be stored in one or more computer-readable recording media.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기된 하드웨어 장치는 실시예의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded in a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, etc. alone or in combination. The program instructions recorded on the medium may be specially designed and configured for the embodiment, or may be known and available to those skilled in the art of computer software. Examples of the computer-readable recording medium include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic such as floppy disks. - includes magneto-optical media, and hardware devices specially configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like. Examples of program instructions include not only machine language codes such as those generated by a compiler, but also high-level language codes that can be executed by a computer using an interpreter or the like. The hardware devices described above may be configured to operate as one or more software modules to perform the operations of the embodiments, and vice versa.

이상과 같이 실시예들이 비록 한정된 실시예와 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기의 기재로부터 다양한 수정 및 변형이 가능하다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.As described above, although the embodiments have been described with reference to the limited embodiments and drawings, various modifications and variations are possible from the above description by those skilled in the art. For example, the described techniques are performed in a different order than the described method, and/or the described components of the system, structure, apparatus, circuit, etc. are combined or combined in a different form than the described method, or other components Or substituted or substituted by equivalents may achieve an appropriate result.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and equivalents to the claims are also within the scope of the following claims.

Claims

receiving, by the neoantigen determining device, a peptide sequence extracted from a target cancer tissue and an HLA class II allele sequence as inputs;
obtaining, by the neoantigen determining device, T cell activity data from the peptide sequence, inputting the T cell activity data into an immunity prediction model, and outputting a first predicted value for predicting immunity of the peptide sequence;
The neoantigen determining device inputs the alpha-beta complex sequence data and the peptide sequence data of the HLA class II allele sequence to the binding prediction model to predict the binding property of the peptide sequence and the HLA class II allele sequence. outputting a predicted value;
The neoantigen determination device inputs the alpha or beta single sequence data and the peptide sequence data of the HLA class II allele sequence to the binding prediction model to predict the binding of the peptide sequence and the HLA class II allele sequence 3 outputting a predicted value;
outputting, by the neoantigen determining device, a fourth predicted value for predicting whether the peptide sequence of the target cancer tissue is a neoantigen using the T cell activity data and the first predicted value and the second predicted value;
outputting, by the neoantigen determining device, a fifth predicted value for predicting whether the peptide sequence of the target cancer tissue is a neoantigen using the T cell activity data and the first predicted value and the third predicted value;
Generating neoantigen information including whether the peptide sequence of the target cancer tissue is a neoantigen using the T cell activity data and the first to fifth predicted values by the neoantigen determining device A method for predicting neoantigens using a peptide sequence derived from cancer tissue and cell-free DNA and an HLA class II allele sequence.

According to claim 1,
In the step of generating neoantigen information on the target cancer tissue,
A target cancer tissue and cell-free DNA-derived peptide sequence and HLA class II allele sequence for generating neoantigen information on a peptide of the target cancer tissue using the T cell activity data and the first to fifth predicted values A method for predicting neoantigens using

According to claim 1,
At least one of the immunity prediction model and the binding prediction model includes a peptide sequence present in a plurality of target cancer tissues, alpha chain data of HLA class II allele sequence, and beta chain data of HLA class II allele sequence A method for predicting neoantigens using peptide sequences and HLA class II allele sequences from target cancer tissue and cell-free DNA, trained by a machine learning algorithm based on a training data set.

3. The method of claim 2,
The target cancer tissue is
A method for predicting neoantigens using peptide sequences and HLA class II allele sequences from subject cancer tissue and cell-free DNA, including cells engineered to express a single MHC class I or class II allele.

4. The method of claim 3,
The target cancer tissue is
A method for predicting neoantigens using peptide sequences and HLA class II allele sequences derived from target cancer tissue and cell-free DNA, including human cells obtained from or derived from a plurality of patients.

4. The method of claim 3,
The target cancer tissue is
A method for predicting neoantigens using peptide sequences and HLA class II allele sequences derived from target cancer tissue and cell-free DNA, comprising fresh or frozen tumor cells obtained from a plurality of patients.

4. The method of claim 3,
The target cancer tissue is
A method for predicting neoantigens using a peptide sequence derived from a target cancer tissue and cell-free DNA and an HLA class II allele sequence, comprising fresh or frozen tissue cells obtained from a plurality of patients.

4. The method of claim 3,
The target cancer tissue is
A method for predicting neoantigens using peptide sequences derived from subject cancer tissue and cell-free DNA and HLA class II allele sequences, including peptides identified using T-cell analysis.

4. The method of claim 3,
The training data set is
data related to the proteomic sequence related to the target cancer tissue, data related to the MHC peptide sequence related to the target cancer tissue, binding data between the peptide related to the target cancer tissue and class II alpha chain data of the HLA class II allele, the target At least one of binding data between a peptide associated with cancer tissue and class II beta chain data of an HLA class II allele, data associated with a transcript associated with the target cancer tissue, and data related to a genome associated with the target cancer tissue A method for predicting a neoantigen using a target cancer tissue and cell-free DNA-derived peptide sequence and an HLA class II allele sequence.

According to claim 1,
The immunity prediction model is
Predicting neoantigens using T cell activity data from a peptide sequence as an input, and a peptide sequence derived from the target cancer tissue and cell-free DNA and HLA class II allele sequence, which is a trained model as an output of the immunity of the peptide sequence. How to.

According to claim 1,
The binding prediction model is
First binding data between the alpha chain data of the HLA class II allele sequence and the peptide sequence, the second binding data between the beta chain data of the HLA class II allele sequence and the peptide sequence, and the alpha beta chain of the HLA class II allele sequence A neoantigen using a target cancer tissue and cell-free DNA-derived peptide sequence and HLA class II allele sequence, which is a model trained to output the binding properties of the peptide sequence and the HLA class II allele sequence as an input to the complex How to predict.

A computer program stored in a computer-readable storage medium for executing the method of any one of claims 1 to 11 using a computer.