KR20210052855A

KR20210052855A - Electronic device for selecting biomarkers for predicting cancer prognosis based on patient-specific genetic characteristics and operating method thereof

Info

Publication number: KR20210052855A
Application number: KR1020190138354A
Authority: KR
Inventors: 안재균; 고수현
Original assignee: 인천대학교 산학협력단
Priority date: 2019-11-01
Filing date: 2019-11-01
Publication date: 2021-05-11
Also published as: KR102309002B1

Abstract

Disclosed are an electronic device for selecting a biomarker to be used for predicting a cancer prognosis based on genetic characteristics for each patient, and an operating method thereof. The present invention selects the biomarker to be used for predicting the cancer prognosis based on the genetic characteristics for each patient among a plurality of genes, and proposes a technology for constructing a predictive model capable of predicting the cancer prognosis based on the selected biomarker, thereby providing high accuracy in predicting the cancer prognosis in cancer patients.

Description

An electronic device that selects biomarkers to be used for predicting cancer prognosis based on patient-specific genetic characteristics and its operation method {ELECTRONIC DEVICE FOR SELECTING BIOMARKERS FOR PREDICTING CANCER PROGNOSIS BASED ON PATIENT-SPECIFIC GENETIC CHARACTERISTICS AND OPERATING METHOD THEREOF}

본 발명은 환자별 유전자 특성에 기초하여 암의 예후 예측에 활용할 바이오 마커를 선정하는 전자 장치 및 그 동작 방법에 대한 것이다.The present invention relates to an electronic device for selecting a biomarker to be used for predicting cancer prognosis based on patient-specific genetic characteristics, and to an operation method thereof.

최근, 식생활의 서구화 등으로 인해 암환자가 증가함에 따라 암의 치료를 위한 다양한 방법이 강구되고 있다.Recently, as the number of cancer patients increases due to westernization of diet, various methods for the treatment of cancer have been devised.

암을 치료하는데 있어서, 환자의 유전자별 특성에 따라 암의 예후를 미리 예측할 수 있다면, 환자에게 암의 예후에 따른 치료 방법을 적절히 적용함으로써, 암 치료의 효과를 극대화할 수 있을 것이다.In treating cancer, if the prognosis of cancer can be predicted in advance according to the patient's gene-specific characteristics, the effect of cancer treatment can be maximized by appropriately applying a treatment method according to the prognosis of cancer to the patient.

최근에는 인공지능 기술의 발전으로 인해 환자의 유전자별 특성에 따라 암의 예후를 예측할 수 있도록 하는 예측 모델의 도입도 고려되고 있다.In recent years, due to the development of artificial intelligence technology, introduction of a predictive model that enables predicting the prognosis of cancer according to the patient's gene-specific characteristics is also being considered.

관련해서, 암환자들을 암의 예후가 좋은 군과 암의 예후가 나쁜 군으로 구분한 후 암의 예후가 좋은 군으로 분류된 암환자들의 유전자 특성과 암의 예후가 나쁜 군으로 분류된 암환자들의 유전자 특성을 기초로 기계학습을 수행함으로써, 특정 암환자의 유전자 특성을 입력으로 인가하였을 때, 해당 암환자의 암의 예후가 좋을 것인지 나쁠 것인지를 미리 예측할 수 있는 예측 모델의 구성을 고려할 수 있다.Regarding, the genetic characteristics of cancer patients classified into a group with a good cancer prognosis and a group with a poor cancer prognosis, and then classified into a group with a good cancer prognosis, and of cancer patients classified into the group with a poor cancer prognosis By performing machine learning based on genetic characteristics, when the genetic characteristics of a specific cancer patient are applied as an input, it is possible to consider the configuration of a prediction model that can predict in advance whether the cancer prognosis of the cancer patient is good or bad.

하지만, 사람의 유전자는 그 종류가 너무 많기 때문에 모든 유전자 특성을 고려해서 암의 예후를 예측하는 예측 모델을 구성하는데에 한계가 존재한다. 아울러, 암의 예후에 영향을 크게 미치지 않는 유전자도 있기 때문에 모든 유전자 특성을 기초로 암의 예후를 예측하는 예측 모델을 구성하게 되면, 예측 모델의 정확도가 낮아질 수 있는 문제가 있다.However, because there are so many kinds of human genes, there is a limit to constructing a predictive model that predicts the prognosis of cancer by considering all genetic characteristics. In addition, since there are genes that do not significantly affect the prognosis of cancer, if a prediction model that predicts the prognosis of cancer based on all gene characteristics is constructed, there is a problem that the accuracy of the prediction model may be lowered.

따라서, 많은 수의 유전자들 중에서 암의 예후에 영향을 미치는 특정 유전자들만을 바이오 마커로 선별하고, 선별된 바이오 마커를 기초로 암의 예후를 예측할 수 있는 예측 모델을 구성함으로써, 암의 예후 예측의 정확도를 높일 수 있는 기술의 연구가 필요하다.Therefore, by selecting only specific genes that affect the prognosis of cancer among a large number of genes as biomarkers, and constructing a predictive model capable of predicting the prognosis of cancer based on the selected biomarkers, it is possible to predict the prognosis of cancer. There is a need for research on technology that can improve accuracy.

본 발명은 복수의 유전자들 중 환자별 유전자 특성에 기초하여 암의 예후 예측에 활용할 바이오 마커를 선정하고, 선정된 바이오 마커를 기초로 암의 예후를 예측할 수 있는 예측 모델을 구성하는 기술을 제시함으로써, 암환자의 암의 예후를 예측하는데 있어 높은 정확도를 제공할 수 있도록 한다.The present invention selects a biomarker to be used for predicting the prognosis of cancer based on the genetic characteristics of each patient among a plurality of genes, and proposes a technology for constructing a predictive model capable of predicting the prognosis of cancer based on the selected biomarker. In addition, it is possible to provide high accuracy in predicting the prognosis of cancer in cancer patients.

본 발명의 일실시예에 따른 환자별 유전자 특성에 기초하여 암의 예후 예측에 활용할 바이오 마커를 선정하는 전자 장치는 복수의 암환자들 각각에 대해서 사전에 설정된 유전자 네트워크 - 상기 유전자 네트워크는 서로 다른 종류의 복수의 유전자들 중 서로 영향을 미치는 유전자 간에 링크가 설정되어 있는 네트워크로, 상기 복수의 암환자들 각각에 대해서 상기 복수의 유전자들 간의 암 발현에 따른 영향도가 사전 측정되어 설정된 암환자별 고유의 유전자 네트워크를 의미함 - 에 대한 데이터가 저장되어 있는 유전자 네트워크 저장부, 상기 복수의 암환자들 각각의 유전자 네트워크에 대한 데이터를 기초로 상기 복수의 암환자들 각각에 대하여, 상기 복수의 유전자들 각각을 표현하는 임베딩 벡터를 생성하는 임베딩 벡터 생성부, 상기 복수의 유전자들 각각에 대하여, 상기 복수의 암환자들 각각의 유전자별 임베딩 벡터를 기초로 K-평균(means) 클러스터링을 수행함으로써, 상기 복수의 유전자들 각각에서의 클러스터링 결과를 생성하는 클러스터링 결과 생성부 및 상기 복수의 유전자들 각각에서 생성된 클러스터링 결과에 대한 성능을 측정한 후 상기 복수의 유전자들 중 클러스터링 결과에 대한 성능이 높은 순으로 기설정된(predetermined) 개수의 유전자들을 암의 예후 예측을 위한 바이오 마커로 결정하는 바이오 마커 결정부를 포함한다.The electronic device for selecting a biomarker to be used for predicting the prognosis of cancer based on the genetic characteristics of each patient according to an embodiment of the present invention is a genetic network set in advance for each of a plurality of cancer patients-The gene networks are of different types. A network in which a link is established between genes that affect each other among the plurality of genes of, and for each of the plurality of cancer patients, the degree of influence of the cancer expression between the plurality of genes is measured in advance and is unique for each cancer patient. Means a gene network of-a gene network storage unit in which data is stored, for each of the plurality of cancer patients, the plurality of genes based on data on the gene network of each of the plurality of cancer patients An embedding vector generation unit generating an embedding vector representing each, for each of the plurality of genes, K-means clustering based on the embedding vector for each gene of each of the plurality of cancer patients, the After measuring the performance of the clustering result generation unit that generates the clustering result of each of the plurality of genes and the clustering result generated from each of the plurality of genes, the performance of the clustering result is highest among the plurality of genes. And a biomarker determining unit that determines a predetermined number of genes as biomarkers for predicting cancer prognosis.

또한, 본 발명의 일실시예에 따른 환자별 유전자 특성에 기초하여 암의 예후 예측에 활용할 바이오 마커를 선정하는 전자 장치의 동작 방법은 복수의 암환자들 각각에 대해서 사전에 설정된 유전자 네트워크 - 상기 유전자 네트워크는 서로 다른 종류의 복수의 유전자들 중 서로 영향을 미치는 유전자 간에 링크가 설정되어 있는 네트워크로, 상기 복수의 암환자들 각각에 대해서 상기 복수의 유전자들 간의 암 발현에 따른 영향도가 사전 측정되어 설정된 암환자별 고유의 유전자 네트워크를 의미함 - 에 대한 데이터가 저장되어 있는 유전자 네트워크 저장부를 유지하는 단계, 상기 복수의 암환자들 각각의 유전자 네트워크에 대한 데이터를 기초로 상기 복수의 암환자들 각각에 대하여, 상기 복수의 유전자들 각각을 표현하는 임베딩 벡터를 생성하는 단계, 상기 복수의 유전자들 각각에 대하여, 상기 복수의 암환자들 각각의 유전자별 임베딩 벡터를 기초로 K-평균 클러스터링을 수행함으로써, 상기 복수의 유전자들 각각에서의 클러스터링 결과를 생성하는 단계 및 상기 복수의 유전자들 각각에서 생성된 클러스터링 결과에 대한 성능을 측정한 후 상기 복수의 유전자들 중 클러스터링 결과에 대한 성능이 높은 순으로 기설정된 개수의 유전자들을 암의 예후 예측을 위한 바이오 마커로 결정하는 단계를 포함한다.In addition, the operation method of the electronic device for selecting a biomarker to be used for predicting the prognosis of cancer based on the genetic characteristics of each patient according to an embodiment of the present invention is a genetic network set in advance for each of a plurality of cancer patients-the gene A network is a network in which a link is established between genes that affect each other among a plurality of genes of different types, and the influence of cancer expression between the plurality of genes is measured in advance for each of the plurality of cancer patients. Representing a unique genetic network for each cancer patient that has been set-maintaining a gene network storage unit in which data about-is stored, each of the plurality of cancer patients based on data on the gene networks of each of the plurality of cancer patients For, generating an embedding vector expressing each of the plurality of genes, for each of the plurality of genes, by performing K-means clustering based on the embedding vector for each gene of each of the plurality of cancer patients , Generating a clustering result in each of the plurality of genes and measuring the performance of the clustering result generated in each of the plurality of genes, and then performing the clustering result among the plurality of genes in the order of higher performance. And determining the set number of genes as biomarkers for predicting the prognosis of cancer.

본 발명은 복수의 유전자들 중 환자별 유전자 특성에 기초하여 암의 예후 예측에 활용할 바이오 마커를 선정하고, 선정된 바이오 마커를 기초로 암의 예후를 예측할 수 있는 예측 모델을 구성하는 기술을 제시함으로써, 암환자의 암의 예후를 예측하는데 있어 높은 정확도를 제공할 수 있다.The present invention selects a biomarker to be used for predicting the prognosis of cancer based on the genetic characteristics of each patient among a plurality of genes, and proposes a technology for constructing a predictive model capable of predicting the prognosis of cancer based on the selected biomarker. In addition, it can provide high accuracy in predicting the prognosis of cancer in cancer patients.

도 1은 본 발명의 일실시예에 따른 환자별 유전자 특성에 기초하여 암의 예후 예측에 활용할 바이오 마커를 선정하는 전자 장치의 구조를 도시한 도면이다.
도 2와 도 3은 본 발명의 일실시예에 따른 환자별 유전자 특성에 기초하여 암의 예후 예측에 활용할 바이오 마커를 선정하는 전자 장치의 동작을 설명하기 위한 도면이다.
도 4는 본 발명의 일실시예에 따른 환자별 유전자 특성에 기초하여 암의 예후 예측에 활용할 바이오 마커를 선정하는 전자 장치의 동작 방법을 도시한 순서도이다.1 is a diagram showing the structure of an electronic device for selecting a biomarker to be used for predicting cancer prognosis based on gene characteristics for each patient according to an embodiment of the present invention.
2 and 3 are diagrams for explaining the operation of an electronic device for selecting a biomarker to be used for predicting cancer prognosis based on patient-specific genetic characteristics according to an embodiment of the present invention.
4 is a flowchart illustrating a method of operating an electronic device for selecting a biomarker to be used for predicting a prognosis of cancer based on gene characteristics for each patient according to an embodiment of the present invention.

이하에서는 본 발명에 따른 실시예들을 첨부된 도면을 참조하여 상세하게 설명하기로 한다. 이러한 설명은 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. 각 도면을 설명하면서 유사한 참조부호를 유사한 구성요소에 대해 사용하였으며, 다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 본 명세서 상에서 사용되는 모든 용어들은 본 발명이 속하는 기술분야에서 통상의 지식을 가진 사람에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings. This description is not intended to limit the present invention to a specific embodiment, it is to be understood to include all changes, equivalents, and substitutes included in the spirit and scope of the present invention. While describing each drawing, similar reference numerals have been used for similar elements, and unless otherwise defined, all terms used in the present specification including technical or scientific terms refer to common knowledge in the technical field to which the present invention pertains. It has the same meaning as commonly understood by someone who has it.

본 문서에서, 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있다는 것을 의미한다. 또한, 본 발명의 다양한 실시예들에 있어서, 각 구성요소들, 기능 블록들 또는 수단들은 하나 또는 그 이상의 하부 구성요소로 구성될 수 있고, 각 구성요소들이 수행하는 전기, 전자, 기계적 기능들은 전자회로, 집적회로, ASIC(Application Specific Integrated Circuit) 등 공지된 다양한 소자들 또는 기계적 요소들로 구현될 수 있으며, 각각 별개로 구현되거나 2 이상이 하나로 통합되어 구현될 수도 있다. In this document, when a part "includes" a certain component, it means that other components may be further included rather than excluding other components unless otherwise stated. In addition, in various embodiments of the present invention, each component, function blocks, or means may be composed of one or more sub-components, and the electrical, electronic, and mechanical functions performed by each component are electronic. A circuit, an integrated circuit, or an application specific integrated circuit (ASIC) may be implemented with various known devices or mechanical elements, and may be implemented separately or two or more may be integrated into one.

한편, 첨부된 블록도의 블록들이나 흐름도의 단계들은 범용 컴퓨터, 특수용 컴퓨터, 휴대용 노트북 컴퓨터, 네트워크 컴퓨터 등 데이터 프로세싱이 가능한 장비의 프로세서나 메모리에 탑재되어 지정된 기능들을 수행하는 컴퓨터 프로그램 명령들(instructions)을 의미하는 것으로 해석될 수 있다. 이들 컴퓨터 프로그램 명령들은 컴퓨터 장치에 구비된 메모리 또는 컴퓨터에서 판독 가능한 메모리에 저장될 수 있기 때문에, 블록도의 블록들 또는 흐름도의 단계들에서 설명된 기능들은 이를 수행하는 명령 수단을 내포하는 제조물로 생산될 수도 있다. 아울러, 각 블록 또는 각 단계는 특정된 논리적 기능(들)을 실행하기 위한 하나 이상의 실행 가능한 명령들을 포함하는 모듈, 세그먼트 또는 코드의 일부를 나타낼 수 있다. 또, 몇 가지 대체 가능한 실시예들에서는 블록들 또는 단계들에서 언급된 기능들이 정해진 순서와 달리 실행되는 것도 가능함을 주목해야 한다. 예컨대, 잇달아 도시되어 있는 두 개의 블록들 또는 단계들은 실질적으로 동시에 수행되거나, 역순으로 수행될 수 있으며, 경우에 따라 일부 블록들 또는 단계들이 생략된 채로 수행될 수도 있다.On the other hand, the blocks of the attached block diagram and the steps in the flowchart are computer program instructions that are mounted on a processor or memory of equipment capable of processing data such as a general-purpose computer, a special-purpose computer, a portable notebook computer, and a network computer to perform specified functions. It can be interpreted as meaning. Since these computer program instructions can be stored in a memory provided in a computer device or in a memory readable by a computer, the functions described in the blocks in the block diagram or in the steps in the flowchart are produced as a product containing the instruction means for performing this. It could be. In addition, each block or each step may represent a module, segment, or part of code containing one or more executable instructions for executing the specified logical function(s). In addition, it should be noted that in some alternative embodiments, functions mentioned in blocks or steps may be executed in a different order. For example, two blocks or steps shown in succession may be performed substantially simultaneously or may be performed in reverse order, and in some cases, some blocks or steps may be omitted.

도 1은 본 발명의 일실시예에 따른 환자별 유전자 특성에 기초하여 암의 예후 예측에 활용할 바이오 마커를 선정하는 전자 장치의 구조를 도시한 도면이다.1 is a diagram showing the structure of an electronic device for selecting a biomarker to be used for predicting cancer prognosis based on gene characteristics for each patient according to an embodiment of the present invention.

도 1을 참조하면, 본 발명에 따른 전자 장치(110)는 유전자 네트워크 저장부(111), 임베딩 벡터 생성부(112), 클러스터링 결과 생성부(113) 및 바이오 마커 결정부(114)를 포함한다.Referring to FIG. 1, the electronic device 110 according to the present invention includes a gene network storage unit 111, an embedding vector generation unit 112, a clustering result generation unit 113, and a biomarker determination unit 114. .

유전자 네트워크 저장부(111)에는 복수의 암환자들 각각에 대해서 사전에 설정된 유전자 네트워크에 대한 데이터가 저장되어 있다.The gene network storage unit 111 stores data on a gene network previously set for each of a plurality of cancer patients.

여기서, 유전자 네트워크란 서로 다른 종류의 복수의 유전자들 중 서로 영향을 미치는 유전자 간에 링크가 설정되어 있는 네트워크를 의미한다.Here, the gene network refers to a network in which a link is established between genes that affect each other among a plurality of genes of different types.

관련해서, 유전자 네트워크는 도 2에 도시된 그림과 같이 서로 영향을 미치는 유전자 간에 링크가 설정된 정보를 의미하는 것으로, 이러한 유전자 네트워크는 생물학적 경로, protein-protein interaction(PPI), Gene Ontology(GO) 데이터 등에 의해서 구축될 수 있다.In relation to this, the gene network refers to information in which links are established between genes that affect each other as shown in Fig. 2, and such a gene network includes biological pathways, protein-protein interaction (PPI), and Gene Ontology (GO) data. It can be built by, for example.

이때, 유전자 네트워크 저장부(111)에 저장되어 있는 상기 복수의 암환자들 각각에 대한 유전자 네트워크는 상기 복수의 암환자들 각각에 대해서 상기 복수의 유전자들 간의 암 발현에 따른 영향도가 사전 측정되어 설정된 암환자별 고유의 유전자 네트워크를 의미한다.At this time, the gene network for each of the plurality of cancer patients stored in the gene network storage unit 111 is pre-measured for the influence of the cancer expression between the plurality of genes for each of the plurality of cancer patients. It means a unique genetic network for each cancer patient.

예컨대, '암환자 1'에 대해서 상기 복수의 유전자들 간의 암 발현에 따른 영향도가 사전 측정됨에 따라 상기 '암환자 1'에 대응하는 고유의 유전자 네트워크인 '유전자 네트워크 1'이 설정되어 있을 수 있으며, '암환자 2'에 대해서 상기 복수의 유전자들 간의 암 발현에 따른 영향도가 사전 측정됨에 따라 상기 '암환자 2'에 대응하는 고유의 유전자 네트워크인 '유전자 네트워크 2'가 설정되어 있을 수 있다.For example, for'cancer patient 1','gene network 1', which is a unique gene network corresponding to the'cancer patient 1', may be set as the influence of cancer expression among the plurality of genes is measured in advance. For'cancer patient 2','gene network 2', which is a unique gene network corresponding to the'cancer patient 2', may be set as the degree of influence of cancer expression among the plurality of genes is measured in advance. have.

이러한 암환자별 유전자 네트워크는 환자별로 단일 표본 t검정을 수행하는 등의 검정 실험을 수행함으로써, 구축될 수 있다.Such a genetic network for each cancer patient can be constructed by performing a test experiment such as performing a single sample t-test for each patient.

임베딩 벡터 생성부(112)는 상기 복수의 암환자들 각각의 유전자 네트워크에 대한 데이터를 기초로 상기 복수의 암환자들 각각에 대하여, 상기 복수의 유전자들 각각을 표현하는 임베딩(embedding) 벡터를 생성한다.The embedding vector generation unit 112 generates an embedding vector representing each of the plurality of genes for each of the plurality of cancer patients based on data on the gene network of each of the plurality of cancer patients. do.

이때, 본 발명의 일실시예에 따르면, 임베딩 벡터 생성부(112)는 경로 정보 생성부(115) 및 벡터 결정부(116)를 포함할 수 있다.In this case, according to an embodiment of the present invention, the embedding vector generation unit 112 may include a path information generation unit 115 and a vector determination unit 116.

경로 정보 생성부(115)는 상기 복수의 암환자들 중 어느 한 명인 제1 암환자에 대한 상기 복수의 유전자들 각각을 표현하는 임베딩 벡터를 생성하기 위해, 상기 제1 암환자의 유전자 네트워크에서 n(n은 2이상의 자연수)개의 연속된 링크로 연결되어 있는 n개의 유전자들로 구성된 복수의 경로 정보들을 랜덤하게 생성한다.In order to generate an embedding vector expressing each of the plurality of genes for a first cancer patient, which is one of the plurality of cancer patients, the path information generation unit 115 (n is a natural number of 2 or more) A plurality of path information consisting of n genes connected by consecutive links is randomly generated.

벡터 결정부(116)는 상기 복수의 경로 정보들 각각에 대하여, 상기 복수의 경로 정보들 각각을 구성하는 n개의 유전자들 중 경로 상에서 중심에 위치하는 중심 유전자와 상기 중심 유전자를 제외한 n-1개의 주변 유전자들을 선정하고, 상기 중심 유전자를 출력 데이터로, 상기 주변 유전자들을 입력 데이터로 지정한 후 상기 복수의 경로 정보들 각각에서의 상기 중심 유전자와 상기 주변 유전자들을 기초로 CBOW(Continuous Bag of Words) 모델을 학습시킴으로써, 상기 제1 암환자에 대한 상기 복수의 유전자들 각각의 임베딩 벡터를 결정한다.For each of the plurality of path information, the vector determination unit 116 includes, among n genes constituting each of the plurality of path information, a center gene located at the center of the path and n-1 pieces of CBOW (Continuous Bag of Words) model based on the central gene and the surrounding genes in each of the plurality of path information after selecting surrounding genes, designating the central gene as output data and the surrounding genes as input data By learning, the embedding vector of each of the plurality of genes for the first cancer patient is determined.

이때, 본 발명의 일실시예에 따르면, 벡터 결정부(116)는 상기 복수의 유전자들 각각에 대한 원-핫(one-hot) 벡터를 생성하고, 상기 복수의 경로 정보들 각각에 대하여, 상기 중심 유전자의 원-핫 벡터를 출력 데이터로, 상기 주변 유전자들의 원-핫 벡터를 입력 데이터로 지정함으로써, 상기 CBOW 모델의 히든층을 구성하는 가중치 행렬을 학습시키고, 상기 학습된 가중치 행렬을 구성하는 각각의 행 벡터를 상기 복수의 유전자들 각각에 대한 임베딩 벡터로 결정할 수 있다.At this time, according to an embodiment of the present invention, the vector determination unit 116 generates a one-hot vector for each of the plurality of genes, and for each of the plurality of path information, the By designating the one-hot vector of the central gene as output data and the one-hot vector of the surrounding genes as input data, the weight matrix constituting the hidden layer of the CBOW model is trained, and the learned weight matrix is constructed. Each row vector may be determined as an embedding vector for each of the plurality of genes.

관련해서, 도 3을 참조하여 경로 정보 생성부(115)와 벡터 결정부(116)의 동작을 예를 들어 설명하면 다음과 같다.In connection with FIG. 3, the operation of the path information generation unit 115 and the vector determination unit 116 will be described as follows.

우선, n을 3이라고 가정하고, 상기 복수의 유전자들을 'G₁, G₂, G₃, G₄, G₅, G₆'으로 가정한 후 상기 복수의 암환자들 중 어느 한 명인 상기 제1 암환자에 대한 상기 6개의 유전자들 각각을 표현하는 임베딩 벡터를 생성하는 상황을 설명하기로 한다.First, assuming that n is 3, and assuming that the plurality of genes are'G ₁ , G ₂ , G ₃ , G ₄ , G ₅ , G ₆ ', the first one of the plurality of cancer patients A situation in which an embedding vector expressing each of the six genes for a cancer patient is generated will be described.

경로 정보 생성부(115)는 도면부호 311에 도시된 그림과 같이, 상기 제1 암환자의 유전자 네트워크에서 3개의 연속된 링크로 연결되어 있는 3개의 유전자들로 구성된 복수의 경로 정보들을 랜덤하게 생성할 수 있다.The path information generation unit 115 randomly generates a plurality of path information consisting of three genes connected by three consecutive links in the gene network of the first cancer patient, as shown in the figure 311. can do.

즉, 상기 제1 암환자의 유전자 네트워크에서는 상기 6개의 유전자들 간의 링크가 다양하게 설정되어 있을 수 있는데, 경로 정보 생성부(115)는 상기 제1 암환자의 유전자 네트워크로부터 3개의 연속된 링크로 연결되어 있는 3개의 유전자들로 구성된 복수의 경로 정보들을 랜덤하게 생성할 수 있다.That is, in the genetic network of the first cancer patient, links between the six genes may be set in various ways, and the path information generation unit 115 uses three consecutive links from the gene network of the first cancer patient. It is possible to randomly generate a plurality of path information consisting of three connected genes.

관련해서, 상기 복수의 경로 정보들로는 '(G₁, G₃, G₅)', '(G₂, G₃, G₆)', '(G₃, G₆, G₄)' 등과 같이 생성될 수 있다.Relatedly, the plurality of path information is generated as'(G ₁ , G ₃ , G ₅ )','(G ₂ , G ₃ , G ₆ )','(G ₃ , G ₆ , G ₄ )', etc. Can be.

이때, 경로 정보 생성부(115)는 무작위 행보 알고리즘(Random walk algorithm)을 사용하여 상기 유전자 네트워크로부터 3개의 유전자들로 구성된 상기 복수의 경로 정보들을 랜덤하게 생성할 수 있다.In this case, the path information generation unit 115 may randomly generate the plurality of path information composed of three genes from the gene network using a random walk algorithm.

이렇게, 상기 복수의 경로 정보들이 생성되면, 벡터 결정부(116)는 상기 복수의 경로 정보들 각각에 대하여, 상기 복수의 경로 정보들 각각을 구성하는 3개의 유전자들 중 경로 상에서 중심에 위치하는 중심 유전자와 상기 중심 유전자를 제외한 2개의 주변 유전자들을 선정하고, 상기 중심 유전자를 출력 데이터로, 상기 주변 유전자들을 입력 데이터로 지정한 후 상기 복수의 경로 정보들 각각에서의 상기 중심 유전자와 상기 주변 유전자들을 기초로 CBOW(Continuous Bag of Words) 모델을 학습시킬 수 있다.In this way, when the plurality of path information is generated, the vector determination unit 116, for each of the plurality of path information, is located at the center of the path among the three genes constituting each of the plurality of path information. Selecting a gene and two surrounding genes excluding the center gene, designating the center gene as output data and the surrounding genes as input data, and then based on the center gene and the surrounding genes in each of the plurality of path information It is possible to train the Continuous Bag of Words (CBOW) model.

CBOW 모델은 자연어 처리에 있어서, 주변에 있는 단어들을 가지고, 중심에 있는 단어를 예측하는 모델을 의미한다. CBOW는 특정 문장들이 있을 때, 중심 단어를 출력으로, 주변 단어들을 입력으로 지정한 후 CBOW의 히든층을 구성하는 가중치 행렬을 학습시키는 방식으로 중심 단어를 예측하는 예측 모델을 만들어 낸다.In natural language processing, the CBOW model refers to a model that predicts a central word with surrounding words. When there are specific sentences, CBOW designates the central word as an output and neighboring words as an input, and then creates a predictive model that predicts the central word by learning the weight matrix constituting the hidden layer of CBOW.

이때, 학습된 가중치 행렬을 구성하는 각 행 벡터를 특정 단어의 임베딩 벡터로 사용할 수 있으며, 이러한 임베딩 벡터는 단어 간의 유사도 측정 등에 활용될 수 있다.In this case, each row vector constituting the learned weight matrix may be used as an embedding vector of a specific word, and the embedding vector may be used for measuring similarity between words.

본 발명의 벡터 결정부(116)는 복수의 경로 정보들 각각을 구성하는 3개의 유전자들을 3개의 단어들이 나열된 하나의 문장으로 보고 CBOW 모델을 학습시킴으로써, 'G₁, G₂, G₃, G₄, G₅, G₆'이라는 6개의 유전자들 각각의 임베딩 벡터를 결정할 수 있다.The vector determination unit 116 of the present invention sees the three genes constituting each of the plurality of path information as one sentence in which three words are listed, and learns the CBOW model, so that'G ₁ , G ₂ , G ₃ , G _{It is} possible to determine the embedding vector of each of the six genes 4, G ₅ , and G _6'.

관련해서, 경로 정보 생성부(115)를 통해 '(G₁, G₃, G₅)', '(G₂, G₃, G₆)', '(G₃, G₆, G₄)'라고 하는 경로 정보가 생성되었다고 하는 경우, 벡터 결정부(116)는 '(G₁, G₃, G₅)'에 대해 도면부호 312와 313에 도시된 그림과 같이, 중심 유전자로 G₃을, 주변 유전자로 G₁과 G₅를 선정할 수 있고, '(G₂, G₃, G₆)'에 대해 도면부호 312와 313에 도시된 그림과 같이, 중심 유전자로 G₃을, 주변 유전자로 G₂와 G₆을 선정할 수 있으며, '(G₃, G₆, G₄)'에 대해 도면부호 312와 313에 도시된 그림과 같이, 중심 유전자로 G₆을, 주변 유전자로 G₃과 G₄를 선정할 수 있다.In relation to,'(G ₁ , G ₃ , G ₅ )','(G ₂ , G ₃ , G ₆ )','(G ₃ , G ₆ , G ₄ )'through the path information generation unit 115 if that path information is generated that, vector determination unit 116 is as shown in the illustration shown in the reference numeral 312 and 313 for a '(G _1, G _3, G _5)', the G ₃ to the center gene, _{G 1} and G ₅ can be selected as the surrounding genes, and G ₃ _{as the central gene and G 3} as the surrounding genes as shown in the figures 312 and 313 for _{'(G 2} , G 3, G _{6 )'.} G may be selected for the ₂ and G _6, as shown in the illustration shown in the reference numeral 312 and 313 for a '(G _3, G _6, G _4)', the G ₆ to the center of the gene, and G ₃ to the peripheral gene G ₄ can be selected.

이렇게, 중심 유전자와 주변 유전자가 선정되면, 벡터 결정부(116)는 도면부호 312와 313에 도시된 그림과 같이, 각 경로 정보에 대해 중심 유전자를 출력으로, 주변 유전자들을 입력으로 지정한 후 이를 기초로 도면부호 314에 도시된 그림과 같이 CBOW 모델을 학습시킬 수 있다.In this way, when the central gene and the neighboring genes are selected, the vector determination unit 116 outputs the central gene for each path information, as shown in the figures 312 and 313, and designates the neighboring genes as inputs, and then based on it. As shown in the figure 314, it is possible to train the CBOW model.

구체적으로, 벡터 결정부(116)는 'G₁, G₂, G₃, G₄, G₅, G₆'이라는 유전자 각각에 대해 각 유전자를 표현하기 위한 6차원의 원-핫 벡터를 생성한 후 상기 복수의 경로 정보들 각각에 대하여, 상기 중심 유전자의 원-핫 벡터를 출력 데이터로, 상기 주변 유전자들의 원-핫 벡터를 입력 데이터로 지정함으로써, 상기 CBOW 모델의 히든층을 구성하는 가중치 행렬을 학습시킬 수 있다.Specifically, the vector determination unit 116 generates a six-dimensional one-hot vector for expressing each gene for each of the genes'G ₁ , G ₂ , G ₃ , G ₄ , G ₅ , and G _6'. Then, for each of the plurality of path information, a weight matrix constituting the hidden layer of the CBOW model by designating the one-hot vector of the central gene as output data and the one-hot vector of the neighboring genes as input data Can be learned.

이때, 상기 가중치 행렬은 유전자의 개수가 6개라고 가정하였기 때문에 6개의 행 벡터로 구성된 행렬이 사용될 수 있다.In this case, since the number of genes is assumed to be 6, the weight matrix may be a matrix composed of 6 row vectors.

이렇게 상기 가중치 행렬이 결정되면, 벡터 결정부(116)는 도면부호 315에 도시된 그림과 같이, 상기 가중치 행렬을 구성하는 각 행 벡터를 'G₁, G₂, G₃, G₄, G₅, G₆'이라고 하는 유전자 각각의 임베딩 벡터로 결정함으로써, 상기 제1 암환자에 대한 6개의 유전자들 각각의 임베딩 벡터를 결정할 수 있다.When the weighting matrix is determined in this way, the vector determining unit 116 converts each row vector constituting the weighting matrix into'G ₁ , G ₂ , G ₃ , G ₄ , G _{5', as shown in the figure 315.} By determining the embedding vector of each of the genes, G ₆ ', it is possible to determine the embedding vector of each of the six genes for the first cancer patient.

이러한 방식으로, 경로 정보 생성부(115)와 벡터 결정부(116)는 상기 복수의 암환자들 각각에 대해 CBOW 모델의 학습을 수행함으로써, 상기 복수의 암환자들 각각에 대하여, 'G₁, G₂, G₃, G₄, G₅, G₆'이라는 유전자 각각을 표현하는 임베딩 벡터를 생성할 수 있다.In this way, the path information generation unit 115 and the vector determination unit 116 perform learning of the CBOW model for each of the plurality of cancer patients, so that, for each of the plurality of cancer patients,'G ₁ , An embedding vector expressing each of the genes G ₂ , G ₃ , G ₄ , G ₅ , and G _{6 'can be created.}

이렇게, 상기 복수의 암환자들 각각에 대해 복수의 유전자들 각각의 임베딩 벡터가 생성되면, 클러스터링 결과 생성부(113)는 상기 복수의 유전자들 각각에 대하여, 상기 복수의 암환자들 각각의 유전자별 임베딩 벡터를 기초로 K-평균(means) 클러스터링을 수행함으로써, 상기 복수의 유전자들 각각에서의 클러스터링 결과를 생성한다.In this way, when the embedding vector of each of the plurality of genes is generated for each of the plurality of cancer patients, the clustering result generation unit 113 is configured for each gene of the plurality of cancer patients. By performing K-means clustering based on the embedding vector, a clustering result in each of the plurality of genes is generated.

K-평균 클러스터링이란 다차원 입력 데이터에 대해 해당 데이터가 어떤 그룹에 속하게 될지를 결정하는 클러스터링 기법을 의미하는 것으로, 하기의 수학식 1의 왜곡 측정 함수와 같이, 특정 중심점과 특정 입력 데이터 간의 거리의 제곱합이 최소가 되도록 하는 클러스터 집합을 찾는 알고리즘을 의미한다.K-means clustering refers to a clustering technique that determines which group the data will belong to for multidimensional input data.As shown in the distortion measurement function of Equation 1 below, the sum of squares of the distance between a specific center point and specific input data It refers to an algorithm to find a cluster set that makes this minimum.

여기서, x는 다차원 입력 데이터,

는 S를 클러스터 집합이라고 할 때 S_i 클러스터 집합에서의 중심점을 의미한다.Where x is multidimensional input data,

When S is a cluster set, S _i means the center point in the cluster set.

예컨대, 전술한 예시와 같이, 상기 복수의 유전자들이 'G₁, G₂, G₃, G₄, G₅, G₆'이라고 하는 경우, 클러스터링 결과 생성부(113)는 'G₁, G₂, G₃, G₄, G₅, G₆' 각각에 대해 상기 복수의 암환자들에서의 임베딩 벡터를 기초로 K-평균 클러스터링을 수행할 수 있다.For example, as in the above example, when the plurality of genes are'G ₁ , G ₂ , G ₃ , G ₄ , G ₅ , G ₆ ', the clustering result generation unit 113 is'G ₁ , G ₂ For each of G ₃ , G ₄ , G ₅ , and G ₆ ', K-means clustering may be performed based on the embedding vectors in the plurality of cancer patients.

즉, 클러스터링 결과 생성부(113)는 'G₁'에 대해 상기 복수의 암환자들 각각의 임베딩 벡터를 기초로 K-평균 클러스터링을 수행하여 'G₁'에 대한 클러스터링 결과를 생성할 수 있고, 'G₂'에 대해 상기 복수의 암환자들 각각의 임베딩 벡터를 기초로 K-평균 클러스터링을 수행하여 'G₂'에 대한 클러스터링 결과를 생성할 수 있으며, 'G₃'에 대해 상기 복수의 암환자들 각각의 임베딩 벡터를 기초로 K-평균 클러스터링을 수행하여 'G₃'에 대한 클러스터링 결과를 생성할 수 있고, 'G₄'에 대해 상기 복수의 암환자들 각각의 임베딩 벡터를 기초로 K-평균 클러스터링을 수행하여 'G₄'에 대한 클러스터링 결과를 생성할 수 있고, 'G₅'에 대해 상기 복수의 암환자들 각각의 임베딩 벡터를 기초로 K-평균 클러스터링을 수행하여 'G₅'에 대한 클러스터링 결과를 생성할 수 있고, 'G₆'에 대해 상기 복수의 암환자들 각각의 임베딩 벡터를 기초로 K-평균 클러스터링을 수행하여 'G₆'에 대한 클러스터링 결과를 생성할 수 있다.That is, the clustering result generation unit 113 may generate a clustering result for'G ₁ _{'by performing K-means clustering for'G 1} ' based on the embedding vectors of each of the plurality of cancer patients, on the basis of each of the embedded vector of the plurality of cancer patients for the 'G _2' performs a K- mean clustering may generate the clustering results for the 'G _2', the plurality of arms for the 'G _3' A clustering result for'G ₃ 'can be generated by performing K-mean clustering based on each embedding vector of each patient, and K based on the embedding vector of each of the plurality of cancer patients _{for'G 4'} - by performing a mean clustering may generate clustering results for _{_{'G 4', 'G 5}} ' to perform the K- mean clustering on the basis of the respective embedding vector of the plurality of cancer patients 'G _5' for _{A clustering result for'G 6} 'may be generated, and a clustering result for'G 6' may be generated by performing K-means clustering based on the embedding vectors of each of the plurality of cancer patients for'G ₆ '.

이렇게, 상기 복수의 유전자들 각각에서의 클러스터링 결과가 생성되면, 바이오 마커 결정부(114)는 상기 복수의 유전자들 각각에서 생성된 클러스터링 결과에 대한 성능을 측정한 후 상기 복수의 유전자들 중 클러스터링 결과에 대한 성능이 높은 순으로 기설정된(predetermined) 개수의 유전자들을 암의 예후 예측을 위한 바이오 마커로 결정한다.In this way, when the clustering result of each of the plurality of genes is generated, the biomarker determination unit 114 measures the performance of the clustering result generated from each of the plurality of genes, and then the clustering result of the plurality of genes A predetermined number of genes in the order of high performance against are determined as biomarkers for predicting the prognosis of cancer.

이때, 본 발명의 일실시예에 따르면, 바이오 마커 결정부(114)는 상기 복수의 유전자들 각각에서 생성된 클러스터링 결과에 대해 정규화 상호정보량(Normalized Mutual Information)을 연산함으로써, 상기 연산된 정규화 상호정보량을 상기 복수의 유전자들 각각에서 생성된 클러스터링 결과에 대한 성능으로 측정할 수 있다.At this time, according to an embodiment of the present invention, the biomarker determination unit 114 calculates a normalized mutual information amount for the clustering result generated from each of the plurality of genes, thereby calculating the calculated normalized mutual information amount. Can be measured as the performance of the clustering result generated from each of the plurality of genes.

여기서, 정규화 상호정보량은 특정 클러스터링 결과에 대해 클러스터링이 얼마나 적절하게 잘 되었는지를 평가하는 성능 지표를 의미한다.Here, the amount of normalized mutual information means a performance index that evaluates how well clustering has been properly performed for a specific clustering result.

본 발명에서의 클러스터링은 특정 유전자에 대해서 상기 복수의 암환자들 각각의 유전자 임베딩 벡터를 기초로 상기 복수의 암환자들을 그룹화하는 것이기 때문에, 특정 유전자에서 클러스터링이 잘되었다는 것은 해당 유전자에 대해 환자별로 특성이 명확하게 구분된다는 의미로 볼 수 있다.Since the clustering in the present invention is to group the plurality of cancer patients based on the gene embedding vector of each of the plurality of cancer patients for a specific gene, good clustering in a specific gene is characteristic for each patient for the gene. This can be seen as a clear distinction.

따라서, 바이오 마커 결정부(114)는 상기 복수의 유전자들 중 클러스터링 결과에 대한 성능이 높은 순으로 기설정된 개수의 유전자들을 암의 예후 예측을 위한 바이오 마커로 결정할 수 있다.Accordingly, the biomarker determination unit 114 may determine a preset number of genes as biomarkers for predicting the prognosis of cancer among the plurality of genes in order of high performance for the clustering result.

본 발명의 일실시예에 따르면, 전자 장치(110)는 상기 바이오 마커의 결정이 완료되면, 상기 바이오 마커를 이용해서 암의 예후 예측을 위한 예측 모델을 만들기 위한 구성으로, 예측 모델 생성부(117)를 더 포함할 수 있다.According to an embodiment of the present invention, when the determination of the biomarker is completed, the electronic device 110 is configured to create a prediction model for predicting a prognosis of cancer using the biomarker, and the prediction model generation unit 117 ) May be further included.

예측 모델 생성부(117)는 상기 기설정된 개수의 유전자들이 상기 바이오 마커로 결정된 이후, 사용자에 의해 상기 복수의 암환자들 각각으로부터 사전 수집된 상기 바이오 마커 각각의 유전자 데이터와 상기 복수의 암환자들 각각의 암의 예후 결과 데이터가 트레이닝 세트로 입력되면서, 암의 예후 예측을 위한 모델 생성 명령이 인가되면, 상기 바이오 마커 각각의 유전자 데이터를 입력으로 지정하고, 상기 암의 예후 결과 데이터를 출력으로 지정한 후 지도 학습(supervised learning) 기반의 기계학습을 수행함으로써, 암의 예후 예측 모델을 생성한다.After the predetermined number of genes are determined as the biomarkers, the predictive model generation unit 117 includes gene data of each of the biomarkers and the plurality of cancer patients previously collected by a user from each of the plurality of cancer patients. When the prognostic result data of each cancer is input to the training set, and a model generation command for predicting the prognosis of cancer is applied, the genetic data of each of the biomarkers is designated as an input, and the prognostic result data of the cancer is designated as an output. By performing supervised learning-based machine learning, a cancer prognosis prediction model is generated.

예컨대, 'G₁, G₂, G₃, G₄, G₅, G₆'이라는 유전자들 중 'G₁'과 'G₃'이라는 유전자가 바이오 마커로 결정되었다고 하는 경우, 예측 모델 생성부(117)는 상기 복수의 암환자들 각각으로부터 수집된 'G₁'과 'G₃'이라는 유전자에 대한 유전자 데이터와 상기 복수의 암환자들 각각의 암의 예후 결과 데이터를 트레이닝 세트로 활용해서 암의 예후 예측 모델을 생성할 수 있다.For example, in the case of that gene called _{_{'G 1, G 2, G}} 3, G 4, G 5, G 6' of genes of 'G _1' and 'G _3' is determined to be a biomarker, a prediction model generation unit ( 117) is to take advantage of the genetic data and the prognosis result data of each arm of the plurality of cancer patients for the gene of the 'G _1' and 'G _3' collected from each of the plurality of cancer patients in the training set of cancer Prognostic prediction models can be generated.

여기서, 유전자 데이터란 각 유전자의 발현 값이 될 수 있고, 암의 예후 결과 데이터란 각 환자에 대해서 암의 예후가 좋은지, 좋지 않은지 여부를 나타내는 사전 설정된 데이터로 사용자는 암의 예후가 좋은 경우 '1'이라는 데이터를 사용할 수 있고, 암의 예후가 좋지 않은 경우, '0'이라는 데이터를 사용할 수 있다.Here, the genetic data may be the expression value of each gene, and the cancer prognosis result data is preset data indicating whether the cancer prognosis is good or not for each patient. The data of 'can be used, and if the prognosis of cancer is poor, the data '0' can be used.

이렇게, 상기 복수의 암환자들 각각에 대한 'G₁'과 'G₃'에 대한 유전자 데이터와 각 환자의 암의 예후 결과 데이터가 존재하는 경우, 예측 모델 생성부(117)는 각 환자의 'G₁'과 'G₃'에 대한 유전자 데이터를 심층 신경망에 입력으로 인가하고, 그 출력을 해당 환자의 암의 예후 결과 데이터와 비교하여 상기 심층 신경망을 학습시킴으로써, 상기 암의 예후 예측 모델을 생성할 수 있다. _{In this way, when genetic data for'G 1} 'and'G ₃ ' for each of the plurality of cancer patients and prognostic result data of each patient exist, the prediction model generation unit 117 Genetic data for G ₁ ′ and G ₃ ′ are applied as inputs to the deep neural network, and the output is compared with the prognostic result data of the patient's cancer, and the deep neural network is trained to generate a prognosis prediction model for the cancer. can do.

이때, 본 발명의 일실시예에 따르면, 전자 장치(110)는 예측부(118)를 더 포함할 수 있다.In this case, according to an embodiment of the present invention, the electronic device 110 may further include a prediction unit 118.

예측부(118)는 상기 암의 예후 예측 모델이 생성된 이후, 상기 사용자에 의해 암의 예후 예측의 대상이 되는 예측 대상 암환자로부터 수집된 상기 바이오 마커 각각에 대한 제1 유전자 데이터가 입력으로 인가되면서, 상기 예측 대상 암환자에 대한 암의 예후 예측 명령이 인가되면, 상기 암의 예후 예측 모델에 상기 예측 대상 암환자로부터 수집된 상기 바이오 마커 각각에 대한 상기 제1 유전자 데이터를 입력으로 인가함으로써, 상기 예측 대상 암환자에 대한 암의 예후 결과 데이터를 출력 정보로 산출할 수 있다.After the prognosis prediction model of the cancer is generated, the prediction unit 118 applies the first genetic data for each of the biomarkers collected from the predicted cancer patient to be predicted by the user as an input. While, when the command to predict the prognosis of cancer is applied to the predicted cancer patient, by applying the first genetic data for each of the biomarkers collected from the predicted cancer patient to the cancer prognosis prediction model as an input, Cancer prognosis result data for the predicted cancer patient may be calculated as output information.

관련해서, 전술한 예시와 같이, 상기 바이오 마커가 'G₁'과 'G₃'이라고 하는 경우, 예측부(118)는 상기 예측 대상 암환자의 'G₁'과 'G₃'에 대한 유전자 데이터를 상기 암의 예후 예측 모델에 입력으로 인가함으로써, 상기 예측 대상 암환자에 대한 암의 예후 결과 데이터를 출력 정보로 산출할 수 있다.Then, as in the above-described example related to the biomarker is 'G _1' and 'G _3' if it, predictor 118 is the gene for the prediction cancer patients 'G _1' and 'G _3' By applying data as an input to the cancer prognosis prediction model, cancer prognosis result data for the cancer patient to be predicted may be calculated as output information.

만약, 상기 암의 예후 예측 모델을 생성하는데 있어, 암의 예후가 좋은 경우의 암의 예후 결과 데이터로 '1'이 사용되었고, 암의 예후가 좋지 않은 경우의 암의 예후 결과 데이터로 '0'이라는 데이터가 사용되었다고 하는 경우, 사용자는 예측부(118)에서 '0.5'를 초과하는 결과 데이터가 산출되면, 상기 예측 대상 암환자에 대해 암의 예후가 좋을 것으로 예측할 수 있고, '0.5' 미만인 결과 데이터가 산출되면, 상기 예측 대상 암환자에 대해 암의 예후가 좋지 않을 것으로 예측할 수 있다.If, in generating the cancer prognosis prediction model, '1' was used as the cancer prognostic result data when the cancer prognosis is good, and '0' is used as the cancer prognostic result data when the cancer prognosis is poor. When it is said that the data is used, the user can predict that the prognosis of cancer is good for the predicted cancer patient when the result data exceeding '0.5' is calculated by the prediction unit 118, and the result is less than '0.5'. When the data are calculated, it can be predicted that the prognosis of cancer is not good for the predicted cancer patient.

도 4는 본 발명의 일실시예에 따른 환자별 유전자 특성에 기초하여 암의 예후 예측에 활용할 바이오 마커를 선정하는 전자 장치의 동작 방법을 도시한 순서도이다.4 is a flowchart illustrating a method of operating an electronic device for selecting a biomarker to be used for predicting a prognosis of cancer based on gene characteristics for each patient according to an embodiment of the present invention.

단계(S410)에서는 복수의 암환자들 각각에 대해서 사전에 설정된 유전자 네트워크(상기 유전자 네트워크는 서로 다른 종류의 복수의 유전자들 중 서로 영향을 미치는 유전자 간에 링크가 설정되어 있는 네트워크로, 상기 복수의 암환자들 각각에 대해서 상기 복수의 유전자들 간의 암 발현에 따른 영향도가 사전 측정되어 설정된 암환자별 고유의 유전자 네트워크를 의미함)에 대한 데이터가 저장되어 있는 유전자 네트워크 저장부를 유지한다.In step S410, a gene network previously set for each of a plurality of cancer patients (the gene network is a network in which a link is established between genes that affect each other among a plurality of genes of different types, and the plurality of cancer patients) For each of the patients, a gene network storage unit in which data on a specific gene network for each cancer patient is set by pre-measurement of the influence of the cancer expression among the plurality of genes.

단계(S420)에서는 상기 복수의 암환자들 각각의 유전자 네트워크에 대한 데이터를 기초로 상기 복수의 암환자들 각각에 대하여, 상기 복수의 유전자들 각각을 표현하는 임베딩 벡터를 생성한다.In step S420, an embedding vector representing each of the plurality of genes is generated for each of the plurality of cancer patients based on data on the gene network of each of the plurality of cancer patients.

단계(S430)에서는 상기 복수의 유전자들 각각에 대하여, 상기 복수의 암환자들 각각의 유전자별 임베딩 벡터를 기초로 K-평균 클러스터링을 수행함으로써, 상기 복수의 유전자들 각각에서의 클러스터링 결과를 생성한다.In step S430, for each of the plurality of genes, K-means clustering is performed based on the embedding vector for each gene of each of the plurality of cancer patients, thereby generating a clustering result for each of the plurality of genes. .

단계(S440)에서는 상기 복수의 유전자들 각각에서 생성된 클러스터링 결과에 대한 성능을 측정한 후 상기 복수의 유전자들 중 클러스터링 결과에 대한 성능이 높은 순으로 기설정된 개수의 유전자들을 암의 예후 예측을 위한 바이오 마커로 결정한다.In step S440, after measuring the performance of the clustering result generated from each of the plurality of genes, among the plurality of genes, a predetermined number of genes in the order of the highest performance for the clustering result are selected for predicting the prognosis of cancer. Determined by biomarker

이때, 본 발명의 일실시예에 따르면, 단계(S420)에서는 상기 복수의 암환자들 중 어느 한 명인 제1 암환자에 대한 상기 복수의 유전자들 각각을 표현하는 임베딩 벡터를 생성하기 위해, 상기 제1 암환자의 유전자 네트워크에서 n(n은 2이상의 자연수)개의 연속된 링크로 연결되어 있는 n개의 유전자들로 구성된 복수의 경로 정보들을 랜덤하게 생성하는 단계 및 상기 복수의 경로 정보들 각각에 대하여, 상기 복수의 경로 정보들 각각을 구성하는 n개의 유전자들 중 경로 상에서 중심에 위치하는 중심 유전자와 상기 중심 유전자를 제외한 n-1개의 주변 유전자들을 선정하고, 상기 중심 유전자를 출력 데이터로, 상기 주변 유전자들을 입력 데이터로 지정한 후 상기 복수의 경로 정보들 각각에서의 상기 중심 유전자와 상기 주변 유전자들을 기초로 CBOW 모델을 학습시킴으로써, 상기 제1 암환자에 대한 상기 복수의 유전자들 각각의 임베딩 벡터를 결정하는 단계를 포함할 수 있다.At this time, according to an embodiment of the present invention, in step S420, in order to generate an embedding vector expressing each of the plurality of genes for the first cancer patient, which is one of the plurality of cancer patients, the first 1 Randomly generating a plurality of path information consisting of n genes connected by n (n is a natural number of 2 or more) consecutive links in a genetic network of a cancer patient, and for each of the plurality of path information, Among the n genes constituting each of the plurality of path information, a center gene located at the center of the path and n-1 surrounding genes excluding the center gene are selected, and the center gene is used as output data, and the surrounding genes To determine the embedding vector of each of the plurality of genes for the first cancer patient by learning a CBOW model based on the central gene and the surrounding genes in each of the plurality of path information after designating them as input data It may include steps.

이때, 본 발명의 일실시예에 따르면, 상기 임베딩 벡터를 결정하는 단계는 상기 복수의 유전자들 각각에 대한 원-핫 벡터를 생성하고, 상기 복수의 경로 정보들 각각에 대하여, 상기 중심 유전자의 원-핫 벡터를 출력 데이터로, 상기 주변 유전자들의 원-핫 벡터를 입력 데이터로 지정함으로써, 상기 CBOW 모델의 히든층을 구성하는 가중치 행렬을 학습시키고, 상기 학습된 가중치 행렬을 구성하는 각각의 행 벡터를 상기 복수의 유전자들 각각에 대한 임베딩 벡터로 결정할 수 있다.At this time, according to an embodiment of the present invention, the determining of the embedding vector generates a one-hot vector for each of the plurality of genes, and for each of the plurality of path information, the circle of the central gene -By designating a hot vector as output data and a one-hot vector of the surrounding genes as input data, the weight matrix constituting the hidden layer of the CBOW model is trained, and each row vector constituting the learned weight matrix May be determined as an embedding vector for each of the plurality of genes.

또한, 본 발명의 일실시예에 따르면, 단계(S440)에서는 상기 복수의 유전자들 각각에서 생성된 클러스터링 결과에 대해 정규화 상호정보량을 연산함으로써, 상기 연산된 정규화 상호정보량을 상기 복수의 유전자들 각각에서 생성된 클러스터링 결과에 대한 성능으로 측정할 수 있다.In addition, according to an embodiment of the present invention, in step (S440), by calculating the amount of normalized mutual information on the clustering result generated from each of the plurality of genes, the calculated amount of normalized mutual information is calculated from each of the plurality of genes. It can be measured by the performance of the generated clustering result.

또한, 본 발명의 일실시예에 따르면, 상기 전자 장치의 동작 방법은 상기 기설정된 개수의 유전자들이 상기 바이오 마커로 결정된 이후, 사용자에 의해 상기 복수의 암환자들 각각으로부터 사전 수집된 상기 바이오 마커 각각의 유전자 데이터와 상기 복수의 암환자들 각각의 암의 예후 결과 데이터가 트레이닝 세트로 입력되면서, 암의 예후 예측을 위한 모델 생성 명령이 인가되면, 상기 바이오 마커 각각의 유전자 데이터를 입력으로 지정하고, 상기 암의 예후 결과 데이터를 출력으로 지정한 후 지도 학습 기반의 기계학습을 수행함으로써, 암의 예후 예측 모델을 생성하는 단계를 더 포함할 수 있다.In addition, according to an embodiment of the present invention, after the predetermined number of genes are determined as the biomarkers, each of the biomarkers previously collected from each of the plurality of cancer patients is When the genetic data of and the prognostic result data of each cancer of the plurality of cancer patients are input into the training set, when a model generation command for predicting the prognosis of cancer is applied, the genetic data of each of the biomarkers are designated as input, The method may further include generating a cancer prognosis prediction model by designating the cancer prognosis result data as an output and then performing supervised learning-based machine learning.

이때, 본 발명의 일실시예에 따르면, 상기 전자 장치의 동작 방법은 상기 암의 예후 예측 모델이 생성된 이후, 상기 사용자에 의해 암의 예후 예측의 대상이 되는 예측 대상 암환자로부터 수집된 상기 바이오 마커 각각에 대한 제1 유전자 데이터가 입력으로 인가되면서, 상기 예측 대상 암환자에 대한 암의 예후 예측 명령이 인가되면, 상기 암의 예후 예측 모델에 상기 예측 대상 암환자로부터 수집된 상기 바이오 마커 각각에 대한 상기 제1 유전자 데이터를 입력으로 인가함으로써, 상기 예측 대상 암환자에 대한 암의 예후 결과 데이터를 출력 정보로 산출하는 단계를 더 포함할 수 있다.In this case, according to an embodiment of the present invention, the operation method of the electronic device is, after the cancer prognosis prediction model is generated, the biometric data collected from the predicted cancer patient to be predicted by the user. When the first genetic data for each of the markers is applied as an input and a command to predict the prognosis of cancer is applied to the predicted cancer patient, each of the biomarkers collected from the predicted cancer patient is added to the cancer prognosis prediction model. The method may further include applying the first genetic data for the predicted cancer patient as an input, thereby calculating prognosis result data of the cancer for the predicted cancer patient as output information.

이상, 도 4를 참조하여 본 발명의 일실시예에 따른 환자별 유전자 특성에 기초하여 암의 예후 예측에 활용할 바이오 마커를 선정하는 전자 장치의 동작 방법에 대해 설명하였다. 여기서, 본 발명의 일실시예에 따른 환자별 유전자 특성에 기초하여 암의 예후 예측에 활용할 바이오 마커를 선정하는 전자 장치의 동작 방법은 도 1 내지 도 3을 이용하여 설명한 환자별 유전자 특성에 기초하여 암의 예후 예측에 활용할 바이오 마커를 선정하는 전자 장치(110)의 동작에 대한 구성과 대응될 수 있으므로, 이에 대한 보다 상세한 설명은 생략하기로 한다.In the above, an operation method of an electronic device for selecting a biomarker to be used for predicting cancer prognosis based on patient-specific genetic characteristics according to an embodiment of the present invention has been described with reference to FIG. 4. Here, the operation method of the electronic device for selecting a biomarker to be used for predicting the prognosis of cancer based on the genetic characteristics of each patient according to an embodiment of the present invention is based on the genetic characteristics of each patient described with reference to FIGS. Since it may correspond to the configuration of the operation of the electronic device 110 for selecting a biomarker to be used for predicting the prognosis of cancer, a more detailed description thereof will be omitted.

본 발명의 일실시예에 따른 환자별 유전자 특성에 기초하여 암의 예후 예측에 활용할 바이오 마커를 선정하는 전자 장치의 동작 방법은 컴퓨터와의 결합을 통해 실행시키기 위한 저장매체에 저장된 컴퓨터 프로그램으로 구현될 수 있다.An operation method of an electronic device for selecting a biomarker to be used for predicting cancer prognosis based on patient-specific genetic characteristics according to an embodiment of the present invention may be implemented as a computer program stored in a storage medium for execution through combination with a computer. I can.

또한, 본 발명의 일실시예에 따른 환자별 유전자 특성에 기초하여 암의 예후 예측에 활용할 바이오 마커를 선정하는 전자 장치의 동작 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. In addition, the method of operating an electronic device for selecting a biomarker to be used for predicting the prognosis of cancer based on the genetic characteristics of each patient according to an embodiment of the present invention is implemented in the form of a program command that can be executed through various computer means. It can be recorded on a readable medium. The computer-readable medium may include program instructions, data files, data structures, etc. alone or in combination. The program instructions recorded in the medium may be specially designed and configured for the present invention, or may be known and usable to those skilled in computer software. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic media such as floptical disks. -A hardware device specially configured to store and execute program instructions such as magneto-optical media, and ROM, RAM, flash memory, and the like. Examples of program instructions include not only machine language codes such as those produced by a compiler, but also high-level language codes that can be executed by a computer using an interpreter or the like.

이상과 같이 본 발명에서는 구체적인 구성 요소 등과 같은 특정 사항들과 한정된 실시예 및 도면에 의해 설명되었으나 이는 본 발명의 보다 전반적인 이해를 돕기 위해서 제공된 것일 뿐, 본 발명은 상기의 실시예에 한정되는 것은 아니며, 본 발명이 속하는 분야에서 통상적인 지식을 가진 자라면 이러한 기재로부터 다양한 수정 및 변형이 가능하다.As described above, in the present invention, specific matters such as specific components, etc., and limited embodiments and drawings have been described, but this is provided only to help a more general understanding of the present invention, and the present invention is not limited to the above embodiments. , If a person of ordinary skill in the field to which the present invention belongs, various modifications and variations are possible from these descriptions.

따라서, 본 발명의 사상은 설명된 실시예에 국한되어 정해져서는 아니되며, 후술하는 특허청구범위뿐 아니라 이 특허청구범위와 균등하거나 등가적 변형이 있는 모든 것들은 본 발명 사상의 범주에 속한다고 할 것이다.Therefore, the spirit of the present invention is limited to the described embodiments and should not be defined, and all things that are equivalent or equivalent to the claims as well as the claims to be described later fall within the scope of the spirit of the present invention. .

110: 환자별 유전자 특성에 기초하여 암의 예후 예측에 활용할 바이오 마커를 선정하는 전자 장치
111: 유전자 네트워크 저장부 112: 임베딩 벡터 생성부
113: 클러스터링 결과 생성부 114: 바이오 마커 결정부
115: 경로 정보 생성부 116: 벡터 결정부
117: 예측 모델 생성부 118: 예측부110: Electronic device for selecting biomarkers to be used for predicting cancer prognosis based on patient-specific genetic characteristics
111: gene network storage unit 112: embedding vector generation unit
113: clustering result generation unit 114: biomarker determination unit
115: path information generation unit 116: vector determination unit
117: prediction model generation unit 118: prediction unit

Claims

Genetic networks set in advance for each of a plurality of cancer patients-The gene network is a network in which a link is established between genes that affect each other among a plurality of different types of genes, and for each of the plurality of cancer patients A gene network storage unit in which data on-means a unique gene network for each cancer patient set by pre-measurement of the influence of the cancer expression among the plurality of genes;
An embedding vector generator for generating an embedding vector representing each of the plurality of genes for each of the plurality of cancer patients based on data on the gene network of each of the plurality of cancer patients;
For each of the plurality of genes, a clustering result of generating a clustering result in each of the plurality of genes by performing K-means clustering based on the embedding vector for each gene of each of the plurality of cancer patients Generation unit; And
After measuring the performance of the clustering result generated from each of the plurality of genes, a predetermined number of genes among the plurality of genes in order of high performance for the clustering result are biomarkers for predicting the prognosis of cancer Biomarker determination unit determined by
An electronic device for selecting a biomarker to be used for predicting cancer prognosis based on patient-specific genetic characteristics comprising a.

The method of claim 1,
The embedding vector generator
In order to generate an embedding vector expressing each of the plurality of genes for the first cancer patient, which is one of the plurality of cancer patients, n (n is a natural number of 2 or more) in the gene network of the first cancer patient A path information generator for randomly generating a plurality of path information consisting of n genes connected by a continuous link; And
For each of the plurality of path information, among n genes constituting each of the plurality of path information, a central gene located at the center of the path and n-1 neighboring genes excluding the central gene are selected, and the After designating the central gene as output data and the surrounding genes as input data, by learning a CBOW (Continuous Bag of Words) model based on the central gene and the surrounding genes in each of the plurality of path information, the first Vector determination unit for determining the embedding vector of each of the plurality of genes for a cancer patient
An electronic device for selecting a biomarker to be used for predicting cancer prognosis based on patient-specific genetic characteristics comprising a.

The method of claim 2,
The vector determination unit
A one-hot vector for each of the plurality of genes is generated, and for each of the plurality of path information, a one-hot vector of the central gene is used as output data, and a one-hot vector of the neighboring genes is used as output data. By designating a hot vector as input data, a weight matrix constituting the hidden layer of the CBOW model is trained, and each row vector constituting the learned weight matrix is determined as an embedding vector for each of the plurality of genes. An electronic device that selects biomarkers to be used in predicting cancer prognosis based on patient-specific genetic characteristics.

The method of claim 1,
The biomarker determination unit
Patients who measure the calculated amount of normalized mutual information as a performance for clustering results generated from each of the plurality of genes by calculating normalized mutual information on the clustering result generated from each of the plurality of genes An electronic device that selects a biomarker to be used for predicting cancer prognosis based on individual gene characteristics.

The method of claim 1,
After the predetermined number of genes are determined as the biomarkers, the gene data of each of the biomarkers and the prognostic result data of each of the plurality of cancer patients are collected in advance from each of the plurality of cancer patients by the user. While being input as a training set, when a model generation command for predicting the prognosis of cancer is applied, the genetic data of each biomarker is designated as an input, the prognosis result data of the cancer is designated as an output, and then based on supervised learning. A prediction model generator that generates a cancer prognosis prediction model by performing machine learning of
An electronic device for selecting a biomarker to be used for predicting a cancer prognosis based on gene characteristics for each patient further comprising a.

The method of claim 5,
After the cancer prognosis prediction model is generated, the first genetic data for each of the biomarkers collected from the predicted cancer patient to be predicted by the user is applied as input, and the predicted cancer When the command to predict the prognosis of cancer is applied to the patient, by applying the first genetic data for each of the biomarkers collected from the predicted cancer patient to the cancer prognosis prediction model as input, the predicted cancer patient A prediction unit that calculates the prognosis result data of Korean cancer as output information
An electronic device for selecting a biomarker to be used for predicting a cancer prognosis based on gene characteristics for each patient further comprising a.

Genetic networks set in advance for each of a plurality of cancer patients-The gene network is a network in which a link is established between genes that affect each other among a plurality of different types of genes, and for each of the plurality of cancer patients Maintaining a gene network storage unit in which data on-means a unique gene network for each cancer patient set by measuring the degree of influence of cancer expression among the plurality of genes in advance;
Generating an embedding vector expressing each of the plurality of genes for each of the plurality of cancer patients based on data on the gene network of each of the plurality of cancer patients;
Generating a clustering result for each of the plurality of genes by performing K-means clustering for each of the plurality of genes based on the embedding vector for each gene of each of the plurality of cancer patients; And
After measuring the performance of the clustering result generated from each of the plurality of genes, a predetermined number of genes among the plurality of genes in order of high performance for the clustering result are biomarkers for predicting the prognosis of cancer Steps to determine with
A method of operating an electronic device for selecting a biomarker to be used for predicting a cancer prognosis based on patient-specific genetic characteristics comprising a.

The method of claim 7,
Generating the embedding vector comprises:
In order to generate an embedding vector expressing each of the plurality of genes for the first cancer patient, which is one of the plurality of cancer patients, n (n is a natural number of 2 or more) in the gene network of the first cancer patient Randomly generating a plurality of path information consisting of n genes connected by continuous links; And
For each of the plurality of path information, among n genes constituting each of the plurality of path information, a central gene located at the center of the path and n-1 neighboring genes excluding the central gene are selected, and the After designating the central gene as output data and the surrounding genes as input data, by learning a CBOW (Continuous Bag of Words) model based on the central gene and the surrounding genes in each of the plurality of path information, the first Determining the embedding vector of each of the plurality of genes for the cancer patient
A method of operating an electronic device for selecting a biomarker to be used for predicting a cancer prognosis based on patient-specific genetic characteristics comprising a.

The method of claim 8,
The step of determining the embedding vector is
A one-hot vector for each of the plurality of genes is generated, and for each of the plurality of path information, a one-hot vector of the central gene is used as output data, and a one-hot vector of the neighboring genes is used as output data. By designating a hot vector as input data, a weight matrix constituting the hidden layer of the CBOW model is trained, and each row vector constituting the learned weight matrix is determined as an embedding vector for each of the plurality of genes. An electronic device operating method that selects a biomarker to be used for predicting cancer prognosis based on patient-specific genetic characteristics.

The method of claim 7,
The step of determining with the biomarker
Patients who measure the calculated amount of normalized mutual information as a performance for clustering results generated from each of the plurality of genes by calculating normalized mutual information on the clustering result generated from each of the plurality of genes An electronic device operating method that selects a biomarker to be used for predicting cancer prognosis based on individual gene characteristics.

The method of claim 7,
After the predetermined number of genes are determined as the biomarkers, the gene data of each of the biomarkers and the prognostic result data of each of the plurality of cancer patients are collected in advance from each of the plurality of cancer patients by the user. While being input as a training set, when a model generation command for predicting the prognosis of cancer is applied, the genetic data of each biomarker is designated as an input, the prognosis result data of the cancer is designated as an output, and then based on supervised learning. Generating a cancer prognosis prediction model by performing machine learning of
A method of operating an electronic device for selecting a biomarker to be used for predicting a cancer prognosis based on gene characteristics for each patient further comprising a.

The method of claim 11,
After the cancer prognosis prediction model is generated, the first genetic data for each of the biomarkers collected from the predicted cancer patient to be predicted by the user is applied as input, and the predicted cancer When the command to predict the prognosis of cancer is applied to the patient, by applying the first genetic data for each of the biomarkers collected from the predicted cancer patient to the cancer prognosis prediction model as input, the predicted cancer patient Calculating prognosis result data of Korean cancer as output information
A method of operating an electronic device for selecting a biomarker to be used for predicting a cancer prognosis based on gene characteristics for each patient further comprising a.

A computer-readable recording medium recording a computer program for executing the method of any one of claims 7 to 12 through combination with a computer.

A computer program stored in a storage medium for executing the method of any one of claims 7 to 12 through combination with a computer.