KR20210147651A

KR20210147651A - Method for generating medical data using gan and system thereof

Info

Publication number: KR20210147651A
Application number: KR1020200065203A
Authority: KR
Inventors: 권창혁; 오귀영
Original assignee: 의료법인 이원의료재단
Priority date: 2020-05-29
Filing date: 2020-05-29
Publication date: 2021-12-07

Abstract

One embodiment relates to a medical data method using a GAN and a system thereof. A medical data generating method comprises the following steps of: generating learning data; learning a generator; inputting the learning data to the learned generator to generate result data; and inputting the result data into a discriminant model to generate sample medical data. Therefore, a user can quickly and efficiently generate a large number of sample medical data similar to clinical medical data and with high accuracy.

Description

Method and system for generating medical data using GAN

아래 실시예들은 GAN을 이용한 의료 데이터 생성 방법 및 그 시스템에 관한 것이다.The following embodiments relate to a method and system for generating medical data using GAN.

의료 연구 분야에서 정확한 연구 결과를 도출하기 위해서는 다양한 케이스와 많은 임상 데이터가 요구된다.In order to derive accurate research results in the field of medical research, various cases and a lot of clinical data are required.

그러나 대부분의 의료 연구는 돈, 시간, 인체 유래물의 Institutional Review Board (IRB) 승인의 어려움 등으로 인해, 소량의 임상 데이터로만 연구를 진행하는 것이 현실이다 However, the reality is that most medical research is conducted with only a small amount of clinical data due to money, time, and difficulties in Institutional Review Board (IRB) approval of human derivatives.

특히, 암(Cancer) 연구 분야에서 예후 예측뿐만 아니라, 스테이지(Stage) 예측의 경우에, 1기와 4기의 샘플 부족으로 불균형한 샘플 확보 문제가 발생하고 있다.In particular, in the case of stage prediction as well as prognosis prediction in the field of cancer research, there is a problem of unbalanced sample securing due to the lack of samples in the 1st and 4th stages.

이런 문제를 해결하기 위해서 기존의 SMOTE가 개발되었고, 85개의 업그레이드 알고리즘의 개발로 downsampling, oversampling, imbalanced data의 문제를 해결하기 위한 시도가 계속되고 있다.In order to solve this problem, the existing SMOTE has been developed, and attempts to solve the problems of downsampling, oversampling, and imbalanced data are continuing with the development of 85 upgrade algorithms.

최근에는 딥러닝 기반의 DA(Denoising Autoencoder)를 이용하여, gene expression data를 확장하여 불충분한 임상 데이터의 문제를 해결하려는 시도가 있으며, Expansion-Based Stacked AutoEncoder (SESAE)와 Sample Expansion-Based 1DCNN (SE1DCNN) 방법을 제안되고 있다.Recently, attempts have been made to solve the problem of insufficient clinical data by expanding gene expression data using deep learning-based DA (Denoising Autoencoder). ) method has been proposed.

그러나 이러한 기존 방식들은 적은 수의 암에 대해 암 샘플과 정상 샘플의 구분만 가능할 정도로 정확도가 높지 않은 문제가 있다.However, these existing methods have a problem in that the accuracy is not high enough that only a cancer sample and a normal sample can be distinguished for a small number of cancers.

이 배경기술 부분에 기재된 사항은 발명의 배경에 대한 이해를 증진하기 위하여 작성된 것으로써, 이 기술이 속하는 분야에서 통상의 지식을 가진 자에게 이미 알려진 종래기술이 아닌 사항을 포함할 수 있다.Matters described in this background section are prepared to promote understanding of the background of the invention, and may include matters not already known to those of ordinary skill in the art to which this technology belongs.

아래 실시예들은 전술한 문제점을 해결하기 위하여 안출된 것으로서, 일 실시예는 GAN 모델을 이용한 샘플 의료 데이터 생성 기술을 제공하는 것을 목적으로 한다.The following embodiments have been devised to solve the above-described problems, and an embodiment aims to provide a technology for generating sample medical data using a GAN model.

일 실시예가 해결하고자 하는 과제들은 이상에서 언급된 과제로 제한되지 않으며, 언급되지 않은 또 다른 과제들은 아래의 기재로부터 통상의 기술자에게 명확하게 이해될 수 있을 것이다.Problems to be solved by one embodiment are not limited to the problems mentioned above, and other problems not mentioned will be clearly understood by those skilled in the art from the following description.

일 실시예에 따르면 의료 데이터 생성 장치에서 수행되는 GAN(Generative Adversarial Network)을 이용한 샘플 의료 데이터 생성 방법으로, 임상 의료 데이터의 특징 값의 평균 또는 표준 편차를 이용하여, 학습 데이터를 생성하는 동작; 상기 임상 의료 데이터 및 상기 학습 데이터를 GAN 모델에 입력하여, 생성기를 학습하는 동작; 상기 학습된 생성기에 상기 학습 데이터를 입력하여, 결과 데이터를 생성하는 동작; 및 상기 결과 데이터를 판별 모델에 입력하여, 샘플 의료 데이터를 생성하는 동작을 포함하고, 상기 GAN 모델은 상기 생성기와 판별기로 구성되고, 상기 판별기는 Loss 수렴 모델인 의료 데이터 생성 방법을 제공한다.According to an embodiment, there is provided a method for generating sample medical data using a generative adversarial network (GAN) performed in a medical data generating apparatus, comprising: generating learning data by using an average or standard deviation of feature values of clinical medical data; learning the generator by inputting the clinical medical data and the learning data into a GAN model; inputting the learning data to the learned generator to generate result data; and inputting the result data into a discriminant model to generate sample medical data, wherein the GAN model includes the generator and a discriminator, and the discriminator is a Loss convergence model.

상기 학습 데이터 생성 동작은 상기 임상 의료 데이터의 제 1 대상에서 특징을 선택하는 동작; 상기 임상 의료 데이터의 제 2 대상에서 상기 선택된 특징에 대응되는 특징을 추출하는 동작; 및 상기 추출된 특징 값의 평균 또는 표준 편차를 이용하여, 상기 학습 데이터인 랜덤 수(Latent Space)를 생성하는 동작을 포함한다.The operation of generating the learning data may include: selecting a feature from a first object of the clinical medical data; extracting a feature corresponding to the selected feature from a second object of the clinical medical data; and generating a random number (Latent Space) that is the training data by using the average or standard deviation of the extracted feature values.

상기 제 1 대상은 DNA 유전자이고, 상기 제 2 대상은 RNA 유전자이다.The first object is a DNA gene, and the second object is an RNA gene.

상기 랜덤 수 생성 동작은 상기 추출된 특징 값의 랜덤 샘플링 및 샘플링 간의 평균값 중 적어도 하나를 이용하여, 상기 랜덤수를 생성한다.The random number generating operation generates the random number by using at least one of random sampling of the extracted feature value and an average value between sampling.

상기 GAN 모델은 CNN 기반의 GAN 모델, DNN 기반의 GAN 모델, Deep Convolutional GAN 모델, CycleGAN 모델 또는 StackGAN 모델로 구성된다.The GAN model consists of a CNN-based GAN model, a DNN-based GAN model, a Deep Convolutional GAN model, a CycleGAN model, or a StackGAN model.

상기 판별 모델은 1 차원 컨볼루션 신경망(1-Dimension Convolution Neural Network, 1DCNN), 컨볼루션 신경망(Convolution Neural Network, CNN) 또는 심층 신경망(Deep Neural Network, DNN)로 구성된다.The discriminant model is composed of a 1-Dimension Convolution Neural Network (1DCNN), a Convolution Neural Network (CNN), or a Deep Neural Network (DNN).

일 실시예는 임상 의료 데이터의 특징 값의 평균 또는 표준 편차를 이용하여, 학습 데이터를 생성하도록 구성된 학습 데이터 생성부; 상기 임상 의료 데이터 및 상기 학습 데이터를 GAN 모델에 입력하여, 생성기를 학습하고, 상기 학습된 생성기에 상기 학습 데이터를 입력하여, 결과 데이터를 생성하도록 구성된 GAN 모델부; 및 상기 결과 데이터를 판별 모델에 입력하여, 샘플 의료 데이터를 생성하도록 구성된 결과 판별부를 포함하고, 상기 GAN 모델은 상기 생성기와 판별기로 구성되고, 상기 판별기는 Loss 수렴 모델인 의료 데이터 생성 장치를 제공한다.An embodiment provides a learning data generator configured to generate learning data by using an average or standard deviation of feature values of clinical medical data; a GAN model unit configured to input the clinical medical data and the learning data into a GAN model to learn a generator, and input the learning data to the learned generator to generate result data; and a result discriminator configured to generate sample medical data by inputting the result data into a discriminant model, wherein the GAN model comprises the generator and a discriminator, wherein the discriminator is a Loss convergence model. .

이상에서 설명한 바와 같은 일 실시예들에 따르면, 사용자는 임상 의료 데이터와 유사하고, 정확도가 높은 다수의 샘플 의료 데이터를 빠르고 효율적으로 생성할 수 있다.According to the exemplary embodiments as described above, a user may quickly and efficiently generate a plurality of sample medical data similar to clinical medical data and with high accuracy.

일 실시예의 효과들은 이상에서 언급된 효과로 제한되지 않으며, 언급되지 않은 또 다른 효과들은 아래의 기재로부터 통상의 기술자에게 명확하게 이해될 수 있을 것이다.Effects of one embodiment are not limited to the effects mentioned above, and other effects not mentioned will be clearly understood by those skilled in the art from the following description.

도 1은 일 실시예에 따른 GAN을 이용한 의료 데이터 생성 시스템을 도시한 도면이다.
도 2는 일 실시예에 따른 의료 데이터 생성 장치의 구성을 도시한 도면이다.
도 3은 일 실시예에 따른 GAN 모델부의 구성을 도시한 도면이다.
도 4는 일 실시예에 따른 GAN 모델부의 동작을 도시한 도면이다.
도 5 내지 8은 일 실시예에 따른 의료 데이터 생성 장치의 성능을 도시한 도면이다.
도 9는 일 실시예에 따른 학습 데이터 생성 방법의 흐름도를 도시한 도면이다.
도 10은 일 실시예에 따른 학습 방법의 흐름도를 도시한 도면이다.
도 11은 일 실시예에 따른 결과 데이터 생성 방법의 흐름도를 도시한 도면이다.
도 12는 일 실시예에 따른 샘플 의료 데이터 생성 방법의 흐름도를 도시한 도면이다.1 is a diagram illustrating a system for generating medical data using a GAN according to an embodiment.
2 is a diagram illustrating a configuration of an apparatus for generating medical data according to an exemplary embodiment.
3 is a diagram illustrating a configuration of a GAN model unit according to an embodiment.
4 is a diagram illustrating an operation of a GAN model unit according to an embodiment.
5 to 8 are diagrams illustrating performance of an apparatus for generating medical data according to an exemplary embodiment.
9 is a diagram illustrating a flowchart of a method for generating learning data according to an embodiment.
10 is a diagram illustrating a flowchart of a learning method according to an embodiment.
11 is a diagram illustrating a flowchart of a method for generating result data according to an exemplary embodiment.
12 is a flowchart illustrating a method for generating sample medical data according to an exemplary embodiment.

이하에서, 첨부된 도면을 참조하여 실시예들을 상세하게 설명한다. 각 도면에 제시된 동일한 참조 부호는 동일한 부재를 나타낸다. 아래 설명하는 실시예들에는 다양한 변경이 가해질 수 있다. 아래 설명하는 실시예들은 실시 형태에 대해 한정하려는 것이 아니며, 이들에 대한 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다.Hereinafter, embodiments will be described in detail with reference to the accompanying drawings. Like reference numerals in each figure indicate like elements. Various modifications may be made to the embodiments described below. It should be understood that the embodiments described below are not intended to limit the embodiments, and include all modifications, equivalents, and substitutes thereto.

실시예에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 실시예를 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 명세서에서, "포함하다" 또는 "가지다" 등의 용어는 명세서 상에 기재된 특징, 숫자, 동작, 동작, 구성 요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 동작, 동작, 구성 요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.Terms used in the examples are used only to describe specific examples, and are not intended to limit the examples. The singular expression includes the plural expression unless the context clearly dictates otherwise. In the present specification, terms such as “comprise” or “have” are intended to designate that a feature, number, operation, operation, component, part, or combination thereof described in the specification exists, but one or more other features It should be understood that it does not preclude the possibility of the presence or addition of numbers, movements, operations, components, parts, or combinations thereof.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 실시예가 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥 상 가지는 의미와 일치하는 의미를 가지는 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless otherwise defined, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which the embodiment belongs. Terms such as those defined in commonly used dictionaries should be interpreted as having a meaning consistent with the meaning in the context of the related art, and should not be interpreted in an ideal or excessively formal meaning unless explicitly defined in the present application. does not

또한, 첨부 도면을 참조하여 설명함에 있어, 도면 부호에 관계없이 동일한 구성 요소는 동일한 참조 부호를 부여하고 이에 대한 중복되는 설명은 생략하기로 한다. 실시예를 설명함에 있어서 관련된 공지 기술에 대한 구체적인 설명이 실시예의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다.In addition, in the description with reference to the accompanying drawings, the same components are given the same reference numerals regardless of the reference numerals, and the overlapping description thereof will be omitted. In describing the embodiment, if it is determined that a detailed description of a related known technology may unnecessarily obscure the gist of the embodiment, the detailed description thereof will be omitted.

도 1은 일 실시예에 따른 GAN을 이용한 의료 데이터 생성 시스템(10)을 도시한 도면이다.1 is a diagram illustrating a medical data generation system 10 using a GAN according to an embodiment.

도 1을 참조하면, 일 실시예에 따른 GAN을 이용한 의료 데이터 생성 시스템(10)은 의료 데이터 생성 장치(100)를 포함한다.Referring to FIG. 1 , a medical data generating system 10 using a GAN according to an exemplary embodiment includes a medical data generating apparatus 100 .

의료 데이터 생성 장치(100)는 임상 의료 데이터(20)를 입력받아, GAN을 이용하여 샘플 의료 데이터(50)를 생성한다.The medical data generating apparatus 100 receives the clinical medical data 20 and generates sample medical data 50 by using the GAN.

우선, 일 실시예의 GAN(Generative Adversarial Network, 생성적 대립 신경망)은 생성 모델과 판별 모델이 경쟁하면서 실제와 가까운 수치, 이미지, 동영상, 음성 등의 데이터를 자동으로 만들어 내는 기계 학습(ML: Machine Learning) 모델이다.First, a generative adversarial network (GAN) according to an embodiment is a machine learning (ML: Machine Learning) that automatically generates data such as numerical values, images, videos, and voices that are close to reality while a generative model and a discriminant model compete with each other. ) is the model.

GAN은 확률 분포를 학습하는 생성 모델과 서로 다른 집합을 구분하는 판별 모델로 구성된다. 생성 모델(또는 생성기)은 가짜 예제를 만들어 판별 모델을 최대한 속일 수 있도록 훈련하고, 판별 모델(또는 판별기)은 생성 모델이 제시하는 가짜 예제와 실제 예제를 최대한 정확하게 구분할 수 있도록 훈련한다. 이와 같이 판별 모델을 속일 수 있도록 생성 모델을 훈련하는 방식을 대립적 프로세스라고 한다.GAN consists of a generative model that learns probability distributions and a discriminant model that distinguishes different sets. The generative model (or generator) is trained to deceive the discriminant model as much as possible by creating fake examples, and the discriminant model (or discriminator) is trained to distinguish the fake examples presented by the generative model as accurately as possible from the real examples. This method of training a generative model to deceive the discriminant model is called an adversarial process.

임상 의료 데이터(20)는 사람의 질병 정보를 포함하는 데이터로, 실제 임상에서 획득한 의료 데이터이다.The clinical medical data 20 is data including human disease information, and is medical data obtained in actual clinical practice.

일 실시예의 임상 의료 데이터(20)는 DNA, RNA 데이터를 포함하며, 이외에 사람의 질병과 관련된 다양한 정보를 포함할 수 있다. 일례로, 임상 의료 데이터(20)는 DNA 변이 정보(DNA Mutation), RNA 발현 정보(RNA Expression), 마이크로 RNA (MicroRNA, miRNA), 메틸레이션(Methylation) 유전자 정보 등을 포함할 수 있으며, 병원의 Tab 기반의 데이터도 포함할 수 있다.The clinical medical data 20 according to an embodiment includes DNA and RNA data, and in addition, may include various information related to human diseases. For example, the clinical medical data 20 may include DNA mutation information (DNA Mutation), RNA expression information (RNA Expression), micro RNA (MicroRNA, miRNA), methylation gene information, etc., and Tab-based data can also be included.

일 실시예의 임상 의료 데이터(20)는 TCGA(The Cancer Genome Atlas)의 다양한 암에서 추출한 의료 데이터일 수 있다. 일례로, 임상 의료 데이터(20)는 TCGA에서 배포되는 샘플 데이터 중 8개의 암 관련 데이터일 수 있다.The clinical medical data 20 according to an embodiment may be medical data extracted from various cancers of The Cancer Genome Atlas (TCGA). For example, the clinical medical data 20 may be eight cancer-related data among the sample data distributed by the TCGA.

샘플 의료 데이터(50)는 임상 의료 데이터(20)를 GAN에 입력하여 생성하는 의료 데이터로, 임상 의료 데이터(20)와 동일한 데이터로 구성될 수 있다.The sample medical data 50 is medical data generated by inputting the clinical medical data 20 into the GAN, and may be composed of the same data as the clinical medical data 20 .

의료 데이터 생성 장치(100)는 실제 임상 데이터인 임상 의료 데이터(20)와 유사하고 정확도가 높은 샘플 의료 데이터(50)를 신속하고 효율적으로 생성할 수 있다.The medical data generating apparatus 100 may quickly and efficiently generate the sample medical data 50 that is similar to the clinical medical data 20 that is actual clinical data and has high accuracy.

의료 데이터 생성 장치(100)의 구체적인 구성 및 기능에 대해서는 이하 도 2에서 자세히 설명하도록 한다.A detailed configuration and function of the medical data generating apparatus 100 will be described in detail below with reference to FIG. 2 .

의료 데이터 생성 장치(100)는 예를 들어, 컴퓨터, UMPC(Ultra Mobile PC), 워크스테이션, 넷북(net-book), PDA(Personal Digital Assistants), 포터블(portable) 컴퓨터, 웹 타블렛(web tablet), 무선 전화기(wireless phone), 모바일 폰(mobile phone), 스마트폰(smart phone), PMP(portable multimedia player) 같은 전자 장치 중 하나로서, 의료 데이터 생성 장치(100)와 관련된 어플리케이션의 설치 및 실행이 가능한 모든 전자 장치를 포함할 수 있다. 전자 장치는 어플리케이션의 제어 하에 예를 들어, 서비스 화면의 구성, 데이터 입력, 데이터 송수신, 데이터 저장 등과 같은 서비스 전반의 동작을 수행할 수 있다.Medical data generating device 100 is, for example, a computer, UMPC (Ultra Mobile PC), workstation, net-book (net-book), PDA (Personal Digital Assistants), portable (portable) computer, web tablet (web tablet) , as one of electronic devices such as a wireless phone, a mobile phone, a smart phone, and a portable multimedia player (PMP), the installation and execution of an application related to the medical data generating device 100 is difficult. It can include all possible electronic devices. The electronic device may perform overall service operations such as, for example, configuration of a service screen, data input, data transmission/reception, data storage, etc. under the control of the application.

도 2는 일 실시예에 따른 의료 데이터 생성 장치(100)의 구성을 도시한 도면이다.2 is a diagram illustrating a configuration of an apparatus 100 for generating medical data according to an exemplary embodiment.

도 2를 참조하면, 일 실시예에 따른 의료 데이터 생성 장치(100)는 제어부(110), 학습 데이터 생성부(120), GAN 모델부(130), 결과 데이터 판별부(140), 사용자 인터페이스부(150), 데이터베이스부(160) 및 디스플레이부(170)를 포함한다.Referring to FIG. 2 , the apparatus 100 for generating medical data according to an embodiment includes a control unit 110 , a learning data generation unit 120 , a GAN model unit 130 , a result data determination unit 140 , and a user interface unit. 150 , a database unit 160 and a display unit 170 are included.

의료 데이터 생성 장치(100) 내에 포함된 다양한 개체들(entities) 간의 통신은 유/무선 네트워크(미도시)를 통해 수행될 수 있다. 유/무선 네트워크는 표준 통신 기술 및/또는 프로토콜들이 사용될 수 있다.Communication between various entities included in the medical data generating apparatus 100 may be performed through a wired/wireless network (not shown). A wired/wireless network may use standard communication technologies and/or protocols.

의료 데이터 생성 장치(100)의 하드웨어 구성은 다양하게 구현될 수 있다. 학습 데이터 생성부(120)와 GAN 모델부(130)를 통합하거나, GAN 모델부(130)와 결과 데이터 판별부(140)를 통합하여 하드웨어를 구성할 수 있다. 이와 같이, 의료 데이터 생성 장치(100)의 하드웨어 구성은 본 명세서의 기재에 한정되지 아니하며, 다양한 방법과 조합으로 구현될 수 있다.A hardware configuration of the medical data generating apparatus 100 may be implemented in various ways. Hardware may be configured by integrating the learning data generating unit 120 and the GAN model unit 130 or by integrating the GAN model unit 130 and the result data determining unit 140 . As such, the hardware configuration of the medical data generating apparatus 100 is not limited to the description of the present specification, and may be implemented in various methods and combinations.

제어부(110)는 의료 데이터 생성 장치(100)의 다양한 기능을 수행하도록 학습 데이터 생성부(120), GAN 모델부(130), 결과 데이터 판별부(140), 사용자 인터페이스부(150), 데이터베이스부(160) 및 디스플레이부(170)를 제어한다.The control unit 110 includes a training data generation unit 120 , a GAN model unit 130 , a result data determination unit 140 , a user interface unit 150 , and a database unit to perform various functions of the medical data generating apparatus 100 . 160 and the display unit 170 are controlled.

그리고, 제어부(110)는 프로세서(Processor), 컨트롤러(controller), 마이크로 컨트롤러(microcontroller), 마이크로 프로세서(microprocessor), 마이크로 컴퓨터(microcomputer) 등으로도 호칭될 수 있으며, 제어부는 하드웨어(hardware) 또는 펌웨어(firmware), 소프트웨어 또는 이들의 결합에 의해 구현될 수 있다.In addition, the control unit 110 may also be called a processor, a controller, a microcontroller, a microprocessor, a microcomputer, etc., the control unit is hardware (hardware) or firmware (firmware), software, or a combination thereof.

학습 데이터 생성부(120)는 임상 의료 데이터(20)를 이용하여, GAN 모델의 학습 데이터(30)를 생성한다.The training data generator 120 generates the training data 30 of the GAN model by using the clinical medical data 20 .

학습 데이터 생성부(120)는 임상 의료 데이터(20)의 제 1 대상에서 특징을 선택(feature selection)한다. 일 실시예의 제 1 대상은 DNA 유전자일 수 있으며, 특징은 질병과 관련된 유전자 리스트일 수 있다. 따라서, 학습 데이터 생성부(120)는 임상 의료 데이터(20)의 DNA 유전자에서 질병과 관련된 유전자 리스트를 선택할 수 있다.The learning data generator 120 selects a feature from the first target of the clinical medical data 20 . In one embodiment, the first object may be a DNA gene, and the feature may be a list of genes associated with a disease. Accordingly, the learning data generator 120 may select a disease-related gene list from the DNA genes of the clinical medical data 20 .

다른 실시예로, 제 1 대상은 RNA, 마이크로 RNA(MicroRNA, miRNA), 메틸레이션(Methylation) 유전자의 단독 혹은 조합일 수 있다.In another embodiment, the first target may be RNA, microRNA (microRNA, miRNA), or a combination of methylation genes.

학습 데이터 생성부(120)는 임상 의료 데이터(20)의 제 2 대상에서 특징을 추출(feature extraction)한다. 일 실시예의 제 2 대상은 RNA 유전자일 수 있으며, 특징은 제 1 대상에서 선택된 유전자 리스트에 대응되는 유전자 리스트일 수 있다. 따라서, 학습 데이터 생성부(120)는 DNA 유전자에서 선택된 유전자 리스트에 대응되는 유전자 리스트를 임상 의료 데이터(20)의 RNA 유전자에서 추출할 수 있다.The learning data generator 120 extracts features from the second object of the clinical medical data 20 . The second target of an embodiment may be an RNA gene, and the feature may be a gene list corresponding to the list of genes selected from the first target. Accordingly, the learning data generating unit 120 may extract a gene list corresponding to the selected gene list from the DNA gene from the RNA gene of the clinical medical data 20 .

학습 데이터 생성부(120)는 임상 의료 데이터(20)의 제 2 대상에서 추출된 특징 값의 평균 또는 표준 편차를 이용하여 랜덤 수(Latent Space)를 생성하고, 생성된 랜덤 수를 학습 데이터(30)로 결정한다.The training data generator 120 generates a random number (Latent Space) using the average or standard deviation of the feature values extracted from the second object of the clinical medical data 20 , and uses the generated random number as the training data 30 . ) to be determined.

일 실시예의 랜덤 수(Latent Space)는 임상 의료 데이터(20)의 제 2 대상에서 추출된 특징 값을 대표하는 데이터이다.According to an exemplary embodiment, the random number (Latent Space) is data representing a feature value extracted from the second object of the clinical medical data 20 .

학습 데이터 생성부(120)는 특징 값의 평균 또는 표준 편차를 기준으로, 특징 값을 대표하는 데이터인 랜덤 수를 생성한다. 다른 실시예로, 학습 데이터 생성부(120)는 특징 값의 랜덤 샘플링, 샘플링 간의 평균값 등의 특정 기준으로 하여, 랜덤 수를 생성할 수 있다.The training data generator 120 generates a random number that is data representing a feature value based on the average or standard deviation of the feature values. In another embodiment, the training data generator 120 may generate a random number based on a specific criterion, such as random sampling of feature values and an average value between sampling.

학습 데이터 생성부(120)는 생성된 랜덤 수를 학습 데이터(30)로 결정한다.The training data generator 120 determines the generated random number as the training data 30 .

GAN 모델부(130)는 임상 의료 데이터(20) 및 학습 데이터(30)를 이용하여 GAN 모델을 학습시키고, 학습된 GAN 모델에 학습 데이터(30)를 입력하여 결과 데이터(40)를 생성한다.The GAN model unit 130 trains the GAN model using the clinical medical data 20 and the learning data 30 , and inputs the training data 30 to the learned GAN model to generate the result data 40 .

전술한 바와 같이, 일 실시예의 GAN 모델은 생성기와 판별기가 경쟁하면서 실제와 가까운 수치, 이미지, 동영상, 음성 등의 데이터를 자동으로 만들어 내는 기계 학습 모델이다.As described above, the GAN model of one embodiment is a machine learning model in which a generator and a discriminator compete to automatically generate data such as numerical values, images, videos, and voices that are close to reality.

일 실시예의 GAN 모델은 CNN, DNN 기반의 GAN 모델뿐만 아니라, Deep Convolutional GAN 모델, CycleGAN, StackGAN 모델 등의 다양한 모델로 구성될 수 있다.The GAN model of an embodiment may be composed of various models, such as a deep convolutional GAN model, a CycleGAN, and a StackGAN model, as well as a CNN and DNN-based GAN model.

GAN 모델부(130)의 구체적인 구성 및 기능에 대해서는 이하 도 3 및 도 4에서 자세히 설명하도록 한다.A detailed configuration and function of the GAN model unit 130 will be described in detail with reference to FIGS. 3 and 4 below.

결과 데이터 판별부(140)는 GAN 모델부(130)에서 생성된 결과 데이터(40)를 판별 모델에 입력하여, 샘플 의료 데이터(50)를 생성한다.The result data determination unit 140 inputs the result data 40 generated by the GAN model unit 130 to the determination model to generate sample medical data 50 .

결과 데이터 판별부(140)는 GAN 모델부(130)에서 생성된 결과 데이터(40) 중에서 질병과 관련하여 유의미한 데이터를 판별하여 샘플 의료 데이터(50)로 결정한다.The result data determining unit 140 determines the disease-related meaningful data from among the result data 40 generated by the GAN model unit 130 to be the sample medical data 50 .

일 실시예의 판별 모델은 1 차원 컨볼루션 신경망(1-Dimension Convolution Neural Network, 1DCNN), 컨볼루션 신경망(Convolution Neural Network, CNN) 또는 심층 신경망(Deep Neural Network, DNN)으로 구성될 수 있다.The discriminant model of an embodiment may be configured of a 1-Dimension Convolution Neural Network (1DCNN), a Convolution Neural Network (CNN), or a Deep Neural Network (DNN).

일 실시예의 판별 모델은 CNN과 DNN을 random forest를 이용하여 비교하지만, 일 실시예는 이에 한정되지 아니하고, 다양한 딥러닝 알고리즘으로 확장 가능하다.The discriminant model of an embodiment compares CNN and DNN using a random forest, but one embodiment is not limited thereto, and can be extended to various deep learning algorithms.

사용자 인터페이스부(150)는 사용자에게 데이터를 입력할 수 있는 인터페이스를 제공한다. 사용자는 사용자 인터페이스부(150)를 통해 임상 의료 데이터(20)를 입력할 수 있다.The user interface unit 150 provides an interface through which data can be input to a user. The user may input clinical medical data 20 through the user interface unit 150 .

데이터베이스부(160)는 의료 데이터 생성 장치(100)가 샘플 의료 데이터(50)를 생성하는데 필요한 다양한 데이터를 저장한다. 일례로, 데이터베이스부(160)는 임상 의료 데이터(20), 학습 데이터(30), 결과 데이터(40) 및 샘플 의료 데이터(50) 등을 저장할 수 있다.The database unit 160 stores various data necessary for the medical data generating apparatus 100 to generate the sample medical data 50 . For example, the database unit 160 may store clinical medical data 20 , learning data 30 , result data 40 , and sample medical data 50 .

디스플레이부(170)는 의료 데이터 생성 장치(100)에 저장된 다양한 데이터를 디스플레이 장치(Display Device)를 통해 사용자에게 출력한다. 일례로, 디스플레이부(170)는 임상 의료 데이터(20), 학습 데이터(30), 결과 데이터(40) 및 샘플 의료 데이터(50) 등을 사용자에게 출력할 수 있다.The display unit 170 outputs various data stored in the medical data generating apparatus 100 to the user through a display device. For example, the display unit 170 may output clinical medical data 20 , learning data 30 , result data 40 , and sample medical data 50 to the user.

통신부(미도시)는 외부 장치들과와 데이터 통신한다. 통신부(미도시)는 외부 장치로부터 임상 의료 데이터(20)를 수신할 수 있고, 외부 장치로 샘플 의료 데이터(50)를 전송할 수 있다.The communication unit (not shown) communicates data with external devices. The communication unit (not shown) may receive the clinical medical data 20 from an external device and transmit the sample medical data 50 to the external device.

도 3은 일 실시예에 따른 GAN 모델부(130)의 구성을 도시한 도면이다.3 is a diagram illustrating the configuration of the GAN model unit 130 according to an embodiment.

도 3을 참조하면, 일 실시예에 따른 GAN 모델부(130)는 GAN 학습부(131) 및 GAN 생성부(133)를 포함한다.Referring to FIG. 3 , the GAN model unit 130 according to an embodiment includes a GAN learning unit 131 and a GAN generating unit 133 .

GAN 학습부(131)는 임상 의료 데이터(20) 및 학습 데이터(30)를 이용하여, GAN 모델을 학습시킨다.The GAN learning unit 131 trains the GAN model by using the clinical medical data 20 and the learning data 30 .

GAN 학습부(131)는 확률 분포를 학습하는 생성 모델과 서로 다른 집합을 구분하는 판별 모델로 구성된다. 생성 모델(또는 생성기)은 가짜 예제를 만들어 판별 모델을 최대한 속일 수 있도록 훈련하고, 판별 모델(또는 판별기)은 생성 모델이 제시하는 가짜 예제와 실제 예제를 최대한 정확하게 구분할 수 있도록 훈련한다.The GAN learning unit 131 is composed of a generation model for learning the probability distribution and a discrimination model for discriminating different sets. The generative model (or generator) is trained to deceive the discriminant model as much as possible by creating fake examples, and the discriminant model (or discriminator) is trained to distinguish the fake examples presented by the generative model as accurately as possible from the real examples.

GAN 생성부(133)는 학습된 GAN 모델에 학습 데이터(30)를 입력하여 결과 데이터(40)를 생성한다.The GAN generator 133 generates the result data 40 by inputting the training data 30 into the learned GAN model.

도 4는 일 실시예에 따른 GAN 모델부(130)의 동작을 도시한 도면이다.4 is a diagram illustrating an operation of the GAN model unit 130 according to an embodiment.

도 4를 참조하면, 일 실시예에 따른 GAN 학습부(131)는 생성기(12)와 판별기(11)를 포함하고, GAN 생성부(133)는 학습된 생성기(13)를 포함한다.Referring to FIG. 4 , the GAN learner 131 according to an embodiment includes a generator 12 and a discriminator 11 , and the GAN generator 133 includes a learned generator 13 .

GAN 학습부(131)는 생성기(12)와 판별기(14)의 경쟁에 의해서 Loss가 학습되는 구조를 가진다. 즉, 판별기(14)는 딥러닝의 Loss의 수렴 모델을 이용하여 판별할 수 있다.The GAN learning unit 131 has a structure in which loss is learned by competition between the generator 12 and the discriminator 14 . That is, the discriminator 14 can discriminate using a convergence model of Loss of deep learning.

GAN 학습부(131)는 임상 의료 데이터(20)와 학습 데이터(30)를 입력받아, GAN 모델 학습을 위해 다수의 epoch를 수행할 수 있다. 일례로, GAN 학습부(131)는 GAN 모델 학습을 위해 1000번의 epoch를 수행할 수 있다. 다만, 일 실시예의 epoch 횟수는 이에 한정되지 아니하고, 질병의 종류에 따라 다양하게 설정될 수 있다.The GAN learning unit 131 may receive the clinical medical data 20 and the learning data 30 to perform a plurality of epochs to learn the GAN model. For example, the GAN learning unit 131 may perform 1000 epochs to learn the GAN model. However, the number of epochs according to an embodiment is not limited thereto, and may be set variously according to the type of disease.

GAN 학습부(131)는 전술한 학습 방법을 통해 생성기(12)를 학습하고, 학습된 생성기(13)를 GAN 생성부(133)에 전송한다.The GAN learning unit 131 learns the generator 12 through the above-described learning method, and transmits the learned generator 13 to the GAN generator 133 .

GAN 생성부(133)는 학습된 GAN 생성기(13)에 학습 데이터(30)를 입력하여 결과 데이터(40)를 생성한다.The GAN generator 133 generates the result data 40 by inputting the training data 30 into the learned GAN generator 13 .

GAN 생성부(133)는 다수의 결과 데이터(40)를 생성할 수 있다. 일례로, GAN 생성부(133)는 학습된 GAN 생성기(13)를 1회 적용하여 임상 의료 데이터(20)와 동일한 수의 결과 데이터(40, GAN1)를 생성할 수 있고, 학습된 GAN 생성기(13)를 10회 적용하여 임상 의료 데이터(20)의 10배 수의 결과 데이터(40, GAN10)를 생성할 수 있고, 학습된 GAN(13)를 100회 적용하여 임상 의료 데이터(20)의 100배 수의 결과 데이터(40, GAN100)를 생성할 수 있다.The GAN generator 133 may generate a plurality of result data 40 . As an example, the GAN generator 133 may generate the same number of result data 40 and GAN1 as the clinical medical data 20 by applying the learned GAN generator 13 once, and the learned GAN generator ( 13) can be applied 10 times to generate 10 times the number of result data 40 and GAN10 of the clinical medical data 20, and 100 times of the clinical medical data 20 by applying the learned GAN 13 100 times Multiple result data (40, GAN100) can be generated.

전술한 GAN 모델부(130)의 구성에 따르면, 매우 빠른 시간내에 임상 의료 데이터(20)와 유사한 새로운 결과 데이터(40)를 다수 생성할 수 있다.According to the configuration of the GAN model unit 130 described above, it is possible to generate a large number of new result data 40 similar to the clinical medical data 20 within a very short time.

도 5 내지 8은 일 실시예에 따른 의료 데이터 생성 장치(100)의 성능을 도시한 도면이다.5 to 8 are diagrams illustrating performance of the apparatus 100 for generating medical data according to an exemplary embodiment.

도 5는 일 실시예에 따른 의료 데이터 생성 장치(100)에서 생성된 샘플 의료 데이터(50)의, PCA를 이용한 데이터 분포를 도시한 도면으로, 도 5는 8개 암 종에 대한 PCA를 이용한 임상 의료 데이터(20)의 분포와 GAN1(임상 의료 데이터와 동일한 수를 생성한 결과 데이터(40))의 데이터 분포를 비교한 도면이다.5 is a diagram illustrating a data distribution using PCA of sample medical data 50 generated by the medical data generating apparatus 100 according to an embodiment. FIG. 5 is a clinical diagram using PCA for 8 cancer types. It is a diagram comparing the distribution of the medical data 20 and the data distribution of GAN1 (data 40 as a result of generating the same number as the clinical medical data).

도 (a)에서 보는 바와 같이, 8개 암 종의 임상 의료 데이터(20)의 분포는 BRCA에서 3기 데이터가 약간 구분되고, KIRP의 4기가 한곳에 클러스터링 되어서 약간의 구분력이 있지만, 전체적으로 거의 구분이 힘들다. 그러나, GAN1 데이터를 나타낸 도 (b)의 그래프는 명확한 구분력을 나타낸다. 새롭게 생산한 샘플 의료 데이터(50)는 1, 2, 3, 4기로 명확하게 나눌 수 있고, 임상 의료 데이터(20)와 GAN1을 동시에 나타낸 도 (c)에서 명확히 구분된다.As shown in Fig. (a), the distribution of clinical medical data (20) of 8 cancer types has some distinguishing power as stage 3 data is slightly divided in BRCA, and stage 4 of KIRP is clustered in one place. this is hard However, the graph of FIG. (b) showing the GAN1 data shows a clear discrimination power. The newly produced sample medical data 50 can be clearly divided into phases 1, 2, 3, and 4, and is clearly distinguished from the clinical medical data 20 and GAN1 in FIG.

평균과 표준 편차의 기준이 중요 특징들을 잘 추출하는 것을 도면을 통해 확인할 수 있다. 도 (d)는 유전자 관점에서 임상 의료 데이터(20, 하늘색)와 GAN1 데이터(노란색)를 나타내었을 때, 전체 유전자의 분포가 거의 비슷함을 알 수 있다. 유전자 관점의 도 (d)에서 임상 의료 데이터(20)보다는 범위가 넓어졌고 샘플 의료 데이터(50)의 관점(도 a,b,c)에서는 명확히 구분되고, 8개의 다양한 분포의 암에서 작용하기 때문에 예후 예측, 네트워크 데이터 분석, 오믹스 데이터 분석 등의 다른 타입의 유전자 데이터에서도 변별력이 있음이 명확하다.It can be seen from the drawings that the standard of the mean and standard deviation extracts important features well. Fig. (d) shows that when clinical medical data (20, light blue) and GAN1 data (yellow) are shown from a genetic point of view, the distribution of all genes is almost similar. In FIG. (d) from the genetic point of view, the range is wider than the clinical medical data (20), and from the point of view of the sample medical data (50) ( FIGS. a, b, c), it is clearly differentiated, It is clear that other types of genetic data, such as prognostic prediction, network data analysis, and omics data analysis, also have discriminatory power.

도 6은 임상 의료 데이터(20, Ori), 특징 추출 방법으로 추출한 의료 데이터(FS, Feature Selection), 평균과 표준 편차를 이용하여 생성한 의료 데이터(MS, Mean and Standard deviation)와 일 실시예를 통해 생성된 샘플 의료 데이터(50)를 비교한 도면이다.6 shows clinical medical data (20, Ori), medical data extracted by a feature extraction method (FS, Feature Selection), and medical data (MS, Mean and Standard deviation) generated using the mean and standard deviation and an embodiment. It is a diagram comparing the sample medical data 50 generated through the

임상 의료 데이터(20, Ori)와 특징 추출 데이터(FS)는 THCA에서는 동일하지만 BRCA에서는 9%의 정확도의 향상이 있었다. 그리고, KIRP를 제외한 모든 그래프는 median으로 수렴하면서 일정한 값을 나타냄을 알 수 있다. BRCA는 19738개의 유전자 feature가 359개로 1.8%만 선택을 하였고, LUAD는 19648개의 유전자 feature에서 360개로 1.8%만을 선택했는데도 9%와 7%의 성능 향상이 있다.Clinical medical data (20, Ori) and feature extraction data (FS) were the same in THCA, but there was a 9% improvement in accuracy in BRCA. And, it can be seen that all graphs except for KIRP show a constant value while converging to the median. BRCA selected only 1.8% with 359 gene features from 19738, and LUAD selected only 1.8% with 360 from 19648 gene features, but there is a 9% and 7% performance improvement.

일 실시예는 소수의 유전자 특징을 이용하여도 정확도를 높일 수 있으며, 빠른 학습 속도를 보인다. 선정된 360여개의 유전자에는 각각의 암에 가장 중요한 마커가 포함되어 있을 확률이 높아진다. 나머지 암 종에서는 KIRP, THCA, READ가 가장 많은 4%의 특징을 추출하였고 성능은 3%, 0%, 6%의 향상이 있었고 random forest의 feature importance 방법을 이용하였다.In one embodiment, accuracy can be increased even by using a small number of genetic features, and a fast learning speed is shown. There is a high probability that the selected 360 genes contain the most important markers for each cancer. In the remaining carcinomas, 4% of features were extracted the most with KIRP, THCA, and READ, and the performance was improved by 3%, 0%, and 6%, and the feature importance method of random forest was used.

평균과 표준 편차의 의료 데이터(MS)는 전체 유전자 특징을 다 사용한 임상 의료 데이터(20, Ori)와 비교하면 HNSC에서는 특이하게 12%나 떨어지고, THCA는 1%의 하락이 있었지만 나머지에서는 특징 추출 데이터(FS)의 결과와 거의 비슷하게 성능 향상이 있었다. KIRC, STAD, KIRP에서는 가장 눈에 띄는 6%, 7%, 7%의 향상이 있었고, 특징 추출 데이터(FS) 보다도 좋은 성능향상을 보였다. 하지만, 특정 샘플에서만 향상이 있고 특정 샘플은 하락이 있었기에 다른 데이터나 해석에도 적용하기에는 무리가 있다.Medical data (MS) of mean and standard deviation, compared with clinical medical data (20, Ori) that used all genetic features, in HNSC, dropped by 12%, and in THCA, there was a drop of 1%, but in the rest, feature extraction data There was a performance improvement almost similar to the result of (FS). In KIRC, STAD, and KIRP, there were the most notable 6%, 7%, and 7% improvements, and better performance than the feature extraction data (FS). However, there is an improvement only in a specific sample and a drop in a specific sample, so it is difficult to apply it to other data or interpretations.

GAN1은 학습 데이터(30)의 수와 동일한 1배, GAN20은 학습 데이터(30)의 20배, GAN100은 학습 데이터(30)의 100배의 결과 데이터(40)를 생성하였는데, 모든 결과에서 상당한 기능 향상을 보인다.GAN1 generated the result data 40 times equal to the number of training data 30, GAN20 20 times the training data 30, and GAN100 100 times the training data 30, a significant function in all results. show improvement

GAN1은 HNSC와 STAD에서는 FS나 MS와 비슷한 5%와 7% 였다. 그러나 나머지 5개의 암에서는 최소 16%에서 최대 21%의 매우 높은 정확도 향상을 보인다. 변화의 폭은 넓지만 10번의 테스트에서 최소값도 임상 의료 데이터(20, Ori)의 값보다는 높을 정도로 뚜렸한 성능 향상이 있다.GAN1 was 5% and 7% in HNSC and STAD, similar to FS and MS. However, the remaining five arms show a very high accuracy improvement of at least 16% and up to 21%. Although the range of change is wide, the minimum value in 10 tests is also higher than the value of clinical medical data (20, Ori), so there is a clear performance improvement.

GAN100는, FS나 MS에서 성능 향상이 없는 THCA와 HNSC의 경우에서도, 8%와 9%의 성능향상이 있다. 6개의 암 종에서 최소 15%에서 21%인 성능 향상을 볼 수 있다. 도 5의 PCA 분석에서 보는 바와 같이, 데이터의 명확한 특징을 잘 잡아서 분리가 되기 때문이다. 100배의 데이터를 재생산하였는데도 모든 암에서 성능 향상을 보이고, GAN1 보다도 변화폭이 줄어들기 때문에 에러 데이터의 증폭 보다는 특징을 가장 잘 추출하여 데이터의 증폭이 있었음을 확인할 수 있다.GAN100 has performance improvements of 8% and 9% even in the case of THCA and HNSC, where there is no performance improvement in FS or MS. Performance improvements of at least 15% to 21% are seen in 6 carcinomas. This is because, as shown in the PCA analysis of FIG. 5, it is separated by capturing the clear characteristics of the data. Even though 100 times the data was reproduced, performance improved in all arms, and the range of change was smaller than that of GAN1, so it can be confirmed that there was data amplification by extracting the features best rather than amplifying the error data.

도 7은 일 실시예의 판별 모델로 1DCNN, DNN, random forest(RF)를 사용한 경우의 결과를 도시한 도면이다.7 is a diagram illustrating a result when 1DCNN, DNN, and random forest (RF) are used as a discrimination model according to an embodiment.

도 7에서 보는 바와 같이, 3개의 판별 모델을 비교할 때 GAN1의 경우는 DNN이 5개 암 종, 1DCNN이 2개의 암 종, RF가 2개의 암 종에서 가장 높은 정확도를 보인다. GAN100의 경우는 대부분의 결과에서 높은 정확도의 예측 값을 보이고, 1DCNN은 5개의 암 종, DNN은 4개, RF는 3개에서 가장 높은 정확도를 보인다. 판별 모델로 DNN은 GAN1과 GAN100에서 최고 값을 나타내는 경우도 있었지만, 대부분의 암 종에서 Ori와 FS 결과에서 1DCNN과 RF보다 낮았지만, 1DCNN은 GAN100의 결과가 GAN1의 결과보다 좋은 결과를 보여주었다.As shown in FIG. 7 , when comparing three discriminant models, in the case of GAN1, DNN showed the highest accuracy in 5 carcinomas, 1DCNN in 2 carcinomas, and RF in 2 carcinomas. In the case of GAN100, most of the results show high-accuracy prediction values, 1DCNN shows the highest accuracy in 5 carcinomas, DNN shows 4, and RF shows the highest accuracy. As a discriminant model, DNN showed the highest value in GAN1 and GAN100 in some cases, but in most carcinomas, it was lower than 1DCNN and RF in Ori and FS results, but 1DCNN showed that the results of GAN100 were better than those of GAN1.

일 실시예에 따른 의료 데이터 생성 장치(100)는 샘플의 양을 100배 부풀린 경우에도, 에러의 적재가 아닌, 핵심 유전자 특징을 잘 추출하여 생성된다는 것을 확인할 수 있다. 또한, 예후 예측, 오믹스 데이터를 이용한 유전적인 해석에도 사용될 수 있음을 확인할 수 있다.Even when the amount of the sample is inflated by 100 times, the medical data generating apparatus 100 according to an exemplary embodiment may confirm that it is generated by well extracting key gene features rather than loading errors. In addition, it can be confirmed that it can be used for prognosis prediction and genetic analysis using omics data.

도 8은 SMOTE, DA 등의 기존의 샘플 확장 방법과 일 실시예에 따른 GAN 모델을 이용한 샘플 의료 데이터(50)의 비교 결과를 도시한 도면이다.8 is a diagram illustrating a comparison result of sample medical data 50 using a GAN model according to an exemplary embodiment with an existing sample extension method such as SMOTE or DA.

SMOTE는 샘플들의 불균형 문제, oversampling, downsampling 등의 문제를 해결하기 위해서 85개 이상의 변종 프로그램들이 개발되었고, 104개 이상의 데이터셋을 이용하여 noise removal, dimension reduction, clustering, borderline 문제 해결을 위한 시도가 있다. 도 7에 사용된 결과는 python의 imbalanced-learn module을 이용하여 imblearn.over_sampling 함수를 이용한 테스트 결과이다. BRCA (Breast cancer) 경우에, 942 (158, 548, 218, 18) 샘플에서 학습에 사용된 70%는 657 (110, 383, 152, 12) 개이고 SMOTE를 이용하여 균형잡힌 데이터 셋은 1532 (383, 383, 383, 383)개가 생성된다.SMOTE has developed more than 85 variant programs to solve problems such as sample imbalance, oversampling, and downsampling, and attempts to solve noise removal, dimension reduction, clustering, and borderline problems using more than 104 datasets. . The result used in FIG. 7 is a test result using the imblearn.over_sampling function using the imbalanced-learn module of python. In the case of BRCA (Breast cancer), 70% of the 942 (158, 548, 218, 18) samples used for training were 657 (110, 383, 152, 12) and the balanced dataset using SMOTE was 1532 (383) , 383, 383, 383) are created.

DA(Denoising Autoencoder)는 샘플의 개수를 적절한 계산식에 의해서 부풀리면서 2~5개 사이의 특징 값을 0으로 변경하여 인위적인 잡음을 넣으면서 샘플의 다양성을 확보하는 모델이다.DA (Denoising Autoencoder) is a model that secures sample diversity while adding artificial noise by changing the feature values between 2 and 5 to 0 while inflating the number of samples by an appropriate calculation formula.

SMOTE의 경우는 임상 의료 데이터(20, Ori)보다는 KIRP와 BRCA에서는 13%와 6%로 많은 성능 개선이 있었지만, 나머지의 암 종에 대해서는 겨의 비슷하거나 약간 낮았다. DA의 결과는 READ에서 2~3% 정도의 향상은 있었지만 전반적으로 비슷하거나 낮았다. 이에 반해 GAN 모델을 이용한 샘플 의료 데이터(50)는 기존의 대표적인 2개의 알고리즘 보다는 높은 성능을 보이기 것을 도면에서 확인할 수 있다.In the case of SMOTE, KIRP and BRCA showed a lot of performance improvement, 13% and 6%, rather than clinical medical data (20, Ori), but for the rest of the carcinomas, it was similar or slightly lower than that of bran. The results of DA were about 2-3% improvement in READ, but overall they were similar or lower. On the other hand, it can be seen from the drawings that the sample medical data 50 using the GAN model shows higher performance than the two existing representative algorithms.

도 9는 일 실시예에 따른 학습 데이터 생성 방법의 흐름도를 도시한 도면이다.9 is a diagram illustrating a flowchart of a method for generating learning data according to an embodiment.

도 9를 참조하면, 일 실시예에 따른 학습 데이터 생성 방법은 제 1 대상에서 특징 선택 동작(S100), 제 2 대상에서 특징 추출 동작(S110), 랜덤 수 생성 동작(S120) 및 학습 데이터 결정 동작(S130)을 포함한다.Referring to FIG. 9 , a method for generating training data according to an exemplary embodiment includes an operation of selecting a feature from a first target ( S100 ), an operation of extracting a feature from a second target ( S110 ), an operation of generating a random number ( S120 ), and an operation of determining the learning data (S130).

우선, 제 1 대상에서 특징 선택 동작(S100)으로, 학습 데이터 생성부(120)는 임상 의료 데이터(20)의 제 1 대상에서 특징을 선택(feature selection)한다. 일 실시예의 제 1 대상은 DNA 유전자일 수 있으며, 특징은 질병과 관련된 유전자 리스트일 수 있다. 따라서, 학습 데이터 생성부(120)는 임상 의료 데이터(20)의 DNA 유전자에서 질병과 관련된 유전자 리스트를 선택할 수 있다. 다른 실시예로, 제 1 대상으로 RNA, 마이크로 RNA(MicroRNA, miRNA), 메틸레이션(Methylation) 유전자의 단독 혹은 조합일 수 있다.First, in the feature selection operation S100 from the first object, the learning data generator 120 selects a feature from the first object of the clinical medical data 20 (feature selection). In one embodiment, the first object may be a DNA gene, and the feature may be a list of genes associated with a disease. Accordingly, the learning data generator 120 may select a disease-related gene list from the DNA genes of the clinical medical data 20 . In another embodiment, the first target may be RNA, microRNA (MicroRNA, miRNA), or a single or a combination of methylation genes.

그리고, 제 2 대상에서 특징 추출 동작(S110)으로, 학습 데이터 생성부(120)는 임상 의료 데이터(20)의 제 2 대상에서 특징을 추출(feature extraction)한다. 일 실시예의 제 2 대상은 RNA 유전자일 수 있으며, 특징은 제 1 대상에서 선택된 유전자 리스트에 대응되는 유전자 리스트일 수 있다. 따라서, 학습 데이터 생성부(120)는 DNA 유전자에서 선택된 유전자 리스트에 대응되는 유전자 리스트를 임상 의료 데이터(20)의 RNA 유전자에서 추출할 수 있다.Then, in the feature extraction operation S110 from the second object, the learning data generator 120 extracts features from the second object of the clinical medical data 20 . The second target of an embodiment may be an RNA gene, and the feature may be a gene list corresponding to the list of genes selected from the first target. Accordingly, the learning data generating unit 120 may extract a gene list corresponding to the selected gene list from the DNA gene from the RNA gene of the clinical medical data 20 .

그리고, 랜덤 수 생성 동작(S120)으로, 학습 데이터 생성부(120)는 임상 의료 데이터(20)의 제 2 대상에서 추출된 특징 값의 평균 또는 표준 편차를 이용하여 랜덤 수(Latent Space)를 생성하고, 생성된 랜덤 수를 학습 데이터(30)로 결정한다.And, in the random number generation operation ( S120 ), the learning data generator 120 generates a random number (Latent Space) by using the average or standard deviation of the feature values extracted from the second object of the clinical medical data 20 . and the generated random number is determined as the training data 30 .

그리고, 학습 데이터 결정 동작(S130)으로, 학습 데이터 생성부(120)는 생성된 랜덤 수를 학습 데이터(30)로 결정한다.Then, in the learning data determination operation ( S130 ), the training data generator 120 determines the generated random number as the training data 30 .

도 10은 일 실시예에 따른 학습 방법의 흐름도를 도시한 도면이다.10 is a diagram illustrating a flowchart of a learning method according to an embodiment.

도 10을 참조하면, 일 실시예에 따른 학습 방법은 GAN 모델에 데이터 입력 동작(S200), 생성기 학습 동작(S210) 및 학습된 생성기 전송 동작(S220)을 포함한다.Referring to FIG. 10 , a learning method according to an embodiment includes an operation S200 of data input to a GAN model, a learning operation of a generator ( S210 ), and an operation of transmitting the learned generator ( S220 ).

우선, GAN 모델에 데이터 입력 동작(S200)으로, GAN 학습부(131)는 임상 의료 데이터(20) 및 학습 데이터(30)를 GAN 모델에 입력한다.First, in the data input operation S200 of the GAN model, the GAN learning unit 131 inputs the clinical medical data 20 and the learning data 30 into the GAN model.

그리고, 생성기 학습 동작(S210)으로, GAN 학습부(131)는 임상 의료 데이터(20) 및 학습 데이터(30)를 이용하여, GAN 모델을 학습시킨다.Then, in the generator learning operation ( S210 ), the GAN learning unit 131 uses the clinical medical data 20 and the learning data 30 to learn the GAN model.

전술한 바와 같이, GAN 학습부(131)는 확률 분포를 학습하는 생성기(12)와과 서로 다른 집합을 구분하는 판별기(11)로 구성된다. 생성기(12)는 가짜 예제를 만들어 판별 모델을 최대한 속일 수 있도록 훈련하고, 판별기(11)는 생성기(12)이 제시하는 가짜 예제와 실제 예제를 최대한 정확하게 구분할 수 있도록 훈련한다.As described above, the GAN learning unit 131 includes a generator 12 for learning a probability distribution and a discriminator 11 for discriminating different sets. The generator 12 is trained to make a fake example to deceive the discriminant model as much as possible, and the discriminator 11 is trained to distinguish the fake example presented by the generator 12 from the real example as accurately as possible.

GAN 학습부(131)는 GAN 모델 학습을 위해 다수의 epoch를 수행할 수 있다. 일례로, GAN 학습부(131)는 GAN 모델 학습을 위해 1000번의 epoch를 수행할 수 있다. 다만, 일 실시예의 epoch 횟수는 이에 한정되지 아니하고, 질병의 종류에 따라 다양하게 설정될 수 있다.The GAN learning unit 131 may perform a plurality of epochs to learn the GAN model. For example, the GAN learning unit 131 may perform 1000 epochs to learn the GAN model. However, the number of epochs according to an embodiment is not limited thereto, and may be set variously according to the type of disease.

GAN 학습부(131)는 전술한 학습 방법을 통해 생성기(12)를 학습한다.The GAN learning unit 131 learns the generator 12 through the above-described learning method.

그리고, 학습된 생성기 전송 동작(S220)으로, GAN 학습부(131)는 학습된 생성기(13)를 GAN 생성부(133)에 전송한다.Then, in the learned generator transmission operation ( S220 ), the GAN learner 131 transmits the learned generator 13 to the GAN generator 133 .

도 11은 일 실시예에 따른 결과 데이터 생성 방법의 흐름도를 도시한 도면이다.11 is a diagram illustrating a flowchart of a method for generating result data according to an exemplary embodiment.

도 11을 참조하면, 일 실시예에 따른 결과 데이터 생성 방법은 학습된 생성기 수신 동작(S300), 학습 데이터 입력 동작(S310) 및 결과 데이터 생성 동작(S320)을 포함한다.Referring to FIG. 11 , a method for generating result data according to an embodiment includes a learned generator receiving operation S300 , a learning data input operation S310 , and a result data generating operation S320 .

우선, 학습된 생성기 수신 동작(S300)으로, GAN 생성부(133)는 GAN 학습부(131)로부터 학습된 생성기(13)를 수신한다.First, in the learned generator receiving operation S300 , the GAN generator 133 receives the learned generator 13 from the GAN learner 131 .

그리고, 학습 데이터 입력 동작(S310)으로, GAN 생성부(133)는 학습된 GAN 생성기(13)에 학습 데이터(30)를 입력한다.Then, in the learning data input operation S310 , the GAN generator 133 inputs the learning data 30 to the learned GAN generator 13 .

그리고, 결과 데이터 생성 동작(S320)으로, GAN 생성부(133)는 학습된 GAN 생성기(13)로부터 결과 데이터(40)를 생성한다. GAN 생성부(133)는 다수의 결과 데이터(40)를 생성할 수 있다. 일례로, GAN 생성부(133)는 학습된 GAN 생성기(13)를 1회 적용하여 임상 의료 데이터(20)와 동일한 수의 결과 데이터(40, GAN1)를 생성할 수 있고, 학습된 GAN 생성기(13)를 10회 적용하여 임상 의료 데이터(20)의 10배 수의 결과 데이터(40, GAN10)를 생성할 수 있고, 학습된 GAN(13)를 100회 적용하여 임상 의료 데이터(20)의 100배 수의 결과 데이터(40, GAN100)를 생성할 수 있다.Then, in the result data generation operation S320 , the GAN generator 133 generates the result data 40 from the learned GAN generator 13 . The GAN generator 133 may generate a plurality of result data 40 . As an example, the GAN generator 133 may generate the same number of result data 40 and GAN1 as the clinical medical data 20 by applying the learned GAN generator 13 once, and the learned GAN generator ( 13) can be applied 10 times to generate 10 times the number of result data 40 and GAN10 of the clinical medical data 20, and 100 times of the clinical medical data 20 by applying the learned GAN 13 100 times Multiple result data (40, GAN100) can be generated.

도 12는 일 실시예에 따른 샘플 의료 데이터 생성 방법의 흐름도를 도시한 도면이다.12 is a flowchart illustrating a method for generating sample medical data according to an exemplary embodiment.

도 12를 참조하면, 일 실시예에 따른 샘플 의료 데이터 생성 방법은 결과 데이터 수신 동작(S400), 결과 데이터 입력 동작(S410) 및 샘플 의료 데이터 생성 동작(S420)을 포함한다.Referring to FIG. 12 , a method of generating sample medical data according to an embodiment includes an operation S400 of receiving result data, an operation S410 of inputting result data, and an operation S420 of generating sample medical data.

우선, 결과 데이터 수신 동작(S400)으로, 결과 데이터 판별부(140)는 GAN 모델부(130)로부터 결과 데이터(40)를 수신한다.First, in the result data receiving operation ( S400 ), the result data determining unit 140 receives the result data 40 from the GAN model unit 130 .

그리고, 결과 데이터 입력 동작(S410)으로, 결과 데이터 판별부(140)는 결과 데이터(40)를 판별 모델에 입력한다. 일 실시예의 판별 모델은 1 차원 컨볼루션 신경망(1-Dimension Convolution Neural Network, 1DCNN), 컨볼루션 신경망(Convolution Neural Network, CNN) 또는 심층 신경망(Deep Neural Network, DNN)으로 구성될 수 있다. 그리고, 일 실시예의 판별 모델은 CNN과 DNN을 random forest를 이용하여 비교하지만, 일 실시예는 이에 한정되지 아니하고, 다양한 딥러닝 알고리즘으로 확장 가능하다.Then, in the result data input operation ( S410 ), the result data determination unit 140 inputs the result data 40 into the determination model. The discriminant model of an embodiment may be configured of a 1-Dimension Convolution Neural Network (1DCNN), a Convolution Neural Network (CNN), or a Deep Neural Network (DNN). And, the discrimination model of an embodiment compares CNN and DNN using a random forest, but one embodiment is not limited thereto, and can be extended to various deep learning algorithms.

그리고, 샘플 의료 데이터 생성 동작(S420)으로, 결과 데이터 판별부(140)는 판별 모델로부터 결과 데이터(40) 중에서 질병과 관련하여 유의미한 데이터로 판별된 결과 데이터(40)를 샘플 의료 데이터(50)로 결정한다.Then, in the sample medical data generating operation ( S420 ), the result data determining unit 140 converts the result data 40 determined as meaningful data related to a disease among the result data 40 from the discrimination model to the sample medical data 50 . to be decided by

여기에 설명되는 다양한 실시예는 예를 들어 소프트웨어, 하드웨어 또는 이들의 조합된 것을 이용하여, 컴퓨터로 읽을 수 있는 기록매체 내에서 구현될 수 있다.Various embodiments described herein may be implemented in a computer-readable recording medium using, for example, software, hardware, or a combination thereof.

하드웨어적인 구현에 의하면, 여기에 설명되는 실시예는 ASICs(application specific integrated circuits), DSPs(digital signal processors), DSPDs(digital signal processing devices), PLDs(programmable logic devices), FPGAs(field programmable gate arrays), 프로세서(processors), 제어기(controllers), 마이크로 컨트롤러(micro-controllers), 마이크로 프로세서(microprocessors), 기능 수행을 위한 전기적인 유닛 중 적어도 하나를 이용하여 구현될 수 있다. 일부의 경우에 그러한 실시 예들이 제어부(280)에 의해 구현될 수 있다.According to the hardware implementation, the embodiments described herein include application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field programmable gate arrays (FPGAs). , processors, controllers, micro-controllers, microprocessors, and may be implemented using at least one of an electrical unit for performing a function. In some cases, such embodiments may be implemented by the controller 280 .

소프트웨어적인 구현에 의하면, 절차나 기능과 같은 실시 예들은 적어도 하나의 기능 또는 작동을 수행하게 하는 별개의 소프트웨어 모듈과 함께 구현될 수 있다. 소프트웨어 코드는 적절한 프로그램 언어로 쓰여진 소프트웨어 어플리케이션에 의해 구현될 수 있다. 또한, 소프트웨어 코드는 메모리(260)에 저장되고, 제어부(280)에 의해 실행될 수 있다.According to the software implementation, embodiments such as a procedure or function may be implemented together with a separate software module for performing at least one function or operation. The software code may be implemented by a software application written in a suitable programming language. In addition, the software code may be stored in the memory 260 and executed by the controller 280 .

이상에서 설명된 실시예들은 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치, 방법 및 구성요소는, 예를 들어, 프로세서, 콘트롤러, 중앙 처리 장치(Central Processing Unit; CPU), 그래픽 프로세싱 유닛(Graphics Processing Unit; GPU), ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPGA(field programmable gate array), PLU(programmable logic unit), 마이크로프로세서, 주문형 집적 회로(Application Specific Integrated Circuits; ASICS), 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 하나 이상의 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. The embodiments described above may be implemented by a hardware component, a software component, and/or a combination of the hardware component and the software component. For example, the apparatus, method, and component described in the embodiments may include, for example, a processor, a controller, a central processing unit (CPU), a graphics processing unit (GPU), an ALU ( arithmetic logic unit, digital signal processor, microcomputer, field programmable gate array (FPGA), programmable logic unit (PLU), microprocessor, application specific integrated circuits (ASICS), or instructions ( instructions) may be implemented using one or more general purpose computers or special purpose computers, such as any other device capable of executing and responding to instructions.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기된 하드웨어 장치는 실시예의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded in a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, etc. alone or in combination. The program instructions recorded on the medium may be specially designed and configured for the embodiment, or may be known and available to those skilled in the art of computer software. Examples of the computer-readable recording medium include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic media such as floppy disks. - includes magneto-optical media, and hardware devices specially configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like. Examples of program instructions include not only machine language codes such as those generated by a compiler, but also high-level language codes that can be executed by a computer using an interpreter or the like. The hardware devices described above may be configured to operate as one or more software modules to perform the operations of the embodiments, and vice versa.

이상과 같이 비록 한정된 도면에 의해 실시예들이 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기의 기재로부터 다양한 수정 및 변형이 가능하다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다. 그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다.As described above, although the embodiments have been described with reference to the limited drawings, various modifications and variations are possible from the above description by those of ordinary skill in the art. For example, the described techniques are performed in an order different from the described method, and/or the described components of the system, structure, apparatus, circuit, etc. are combined or combined in a different form than the described method, or other components Or substituted or substituted by equivalents may achieve an appropriate result. Therefore, other implementations, other embodiments, and equivalents to the claims are also within the scope of the following claims.

Claims

A method for generating sample medical data using a Generative Adversarial Network (GAN) performed in a medical data generating device, comprising:
generating learning data by using an average or standard deviation of feature values of clinical medical data;
learning the generator by inputting the clinical medical data and the learning data into a GAN model;
inputting the learning data to the learned generator to generate result data; and
and inputting the result data into a discriminant model to generate sample medical data,
The GAN model is
Consists of the generator and discriminator,
The discriminator is a Loss convergence model
How to generate medical data.

The method of claim 1,
The learning data generation operation is
selecting a feature from a first object of the clinical medical data;
extracting a feature corresponding to the selected feature from a second object of the clinical medical data; and
using the average or standard deviation of the extracted feature values to generate a random number (Latent Space) that is the training data
How to generate medical data.

3. The method of claim 2,
The first subject is a DNA gene,
The second subject is an RNA gene
How to generate medical data.

3. The method of claim 2,
The random number generation operation is
generating the random number by using at least one of random sampling of the extracted feature value and an average value between sampling
How to generate medical data.

The method of claim 1,
The GAN model is
It consists of CNN-based GAN model, DNN-based GAN model, Deep Convolutional GAN model, CycleGAN model, or StackGAN model.
How to generate medical data.

The method of claim 1,
The discriminant model is
It consists of 1-Dimension Convolution Neural Network (1DCNN), Convolution Neural Network (CNN), or Deep Neural Network (DNN).
How to generate medical data.

a learning data generating unit configured to generate learning data by using an average or standard deviation of feature values of clinical medical data;
a GAN model unit configured to input the clinical medical data and the learning data into a GAN model to learn a generator, and input the learning data to the learned generator to generate result data; and
and a result determining unit configured to input the result data into a discriminant model to generate sample medical data,
The GAN model is
Consists of the generator and discriminator,
The discriminator is a Loss convergence model
Medical data generating device.

8. The method of claim 7,
The learning data generation unit
Selecting a feature from a first object of the clinical medical data, extracting a feature corresponding to the selected feature from a second object of the clinical medical data, and using the average or standard deviation of the extracted feature values, the learning data, configured to generate a random number (Latent Space)
Medical data generating device.

9. The method of claim 8,
The first subject is a DNA gene,
The second subject is an RNA gene
Medical data generating device.

9. The method of claim 8,
The learning data generation unit
generating the random number by using at least one of random sampling of the extracted feature value and an average value between sampling
Medical data generating device.

8. The method of claim 7,
The GAN model is
It consists of CNN-based GAN model, DNN-based GAN model, Deep Convolutional GAN model, CycleGAN model, or StackGAN model.
Medical data generating device.

8. The method of claim 7,
The discriminant model is
It consists of 1-Dimension Convolution Neural Network (1DCNN), Convolution Neural Network (CNN), or Deep Neural Network (DNN).
Medical data generating device.

A computer program stored in a medium for executing the method of any one of claims 1 to 6 in combination with hardware.