KR20210050362A

KR20210050362A - Ensemble pruning method, ensemble model generation method for identifying programmable nucleases and apparatus for the same

Info

Publication number: KR20210050362A
Application number: KR1020190134867A
Authority: KR
Inventors: 김지헌; 심재용; 이근호; 유현; 이준현; 이준엽; 하지현
Original assignee: 주식회사 모비스
Priority date: 2019-10-28
Filing date: 2019-10-28
Publication date: 2021-05-07

Abstract

Provided is an ensemble pruning method using a positive determination result which comprises the steps of: allowing a computer device to receive training data; allowing the computer device to generate learning models for constructing an ensemble model using the training data; and allowing the computer device to prune a plurality of learning models among the learning models so that the ensemble model has an area under the ROC curve (AUC) which is greater than or equal to a reference value based on a determination result obtained by inputting data or other training data into the learning models.

Description

Ensemble model pruning method, ensemble model generation method and apparatus for detecting genetic scissors {ENSEMBLE PRUNING METHOD, ENSEMBLE MODEL GENERATION METHOD FOR IDENTIFYING PROGRAMMABLE NUCLEASES AND APPARATUS FOR THE SAME}

이하 설명하는 기술은 앙상블 모델을 생성하는 기법에 관한 것이다. 특히 이하 설명하는 기술은 앙상블 프루닝(pruning) 기법에 관한 것이다.The technique described below relates to a technique for generating an ensemble model. In particular, the technique described below relates to an ensemble pruning technique.

기계학습모델은 다양한 분야에서 활용되고 있다. 분류기(classifier)는 입력데이터에 대한 양성(positive) 또는 음성(negative) 판단을 수행하는 학습모델이다. 분류기는 인공신경망, 앙상블 등 다양한 기법으로 구현될 수 있다.Machine learning models are being used in various fields. A classifier is a learning model that performs positive or negative judgment on input data. The classifier can be implemented with various techniques such as artificial neural networks and ensembles.

앙상블 기법(ensemble)은 기계 학습에서 복수의 학습 알고리즘을 이용하는 기법을 총칭한다. 대표적으로 앙상블 기법은 랜덤 포레스트(random forest)를 포함한 배깅(bagging) 기법이나 부스팅(boosting) 기법 등이 있다.Ensemble is a generic term for a technique that uses a plurality of learning algorithms in machine learning. Representatively, the ensemble technique includes a bagging technique including a random forest or a boosting technique.

한국공개특허 제10-2019-0022431호Korean Patent Publication No. 10-2019-0022431

분류기는 기본적으로 입력데이터를 다양한 유형 중 하나로 분류한다. 기계학습 모델을 이용하는 애플리케이션 유형에 따라, 양성 데이터의 수보다 음성 데이터의 수가 많아, 양성 데이터에 대한 분류 정확도가 낮은 경우가 존재한다. 이러한 문제를 불균형분류 문제라 하는데. 이러한 경우, 음성 데이터에 대한 분류 정확도보다 양성 데이터에 대한 분류 정확도가 더 높은 모델이 필요할 수 있다.Classifiers basically classify input data into one of various types. Depending on the type of application using the machine learning model, there are cases where the number of negative data is larger than the number of positive data, and the accuracy of classification for positive data is low. This problem is called an unbalanced classification problem. In this case, a model with higher classification accuracy for positive data than for negative data may be required.

이하 설명하는 기술은 양성 판단에 효율적인 앙상블 모델을 생성하는 기법을 제공하고자 한다.The technique described below is intended to provide a technique for generating an effective ensemble model for positive judgment.

앙상블 프루닝 방법은 컴퓨터장치가 훈련데이터를 입력받는 단계, 상기 컴퓨터장치가 상기 훈련데이터를 이용하여 앙상블 모델을 구성하기 위한 학습모델들을 생성하는 단계 및 상기 컴퓨터장치가 상기 훈련데이터 또는 다른 훈련데이터를 상기 학습모델들에 입력하여 나타나는 판정 결과를 기준으로 상기 앙상블 모델이 기준값 이상의 AUC(area under the ROC curve)를 갖도록 상기 학습모델 중 복수의 학습모델을 선별(pruning)하는 단계를 포함한다.The ensemble pruning method includes: receiving training data by a computer device, generating training models for constructing an ensemble model by the computer device using the training data, and the computer device receiving the training data or other training data. And pruning a plurality of learning models among the learning models so that the ensemble model has an area under the ROC curve (AUC) greater than or equal to a reference value based on a determination result displayed by inputting the learning models.

앙상블 모델을 생성하는 장치는 훈련데이터를 입력받는 입력장치, 상기 훈련데이터 및 앙상블 모델 생성을 위한 프로그램을 저장하는 저장장치 및 상기 프로그램을 이용하여 상기 훈련데이터로 앙상블 모델을 구성하기 위한 학습모델들을 생성하고, 상기 훈련데이터 또는 다른 훈련데이터를 상기 학습모델들에 입력하여 나타나는 판정 결과를 기준으로 상기 앙상블 모델이 기준값 이상의 AUC(area under the ROC curve)를 갖도록 상기 학습모델 중 복수의 학습모델을 선별(pruning)하는 연산장치를 포함한다.The device for generating the ensemble model generates an input device for receiving training data, a storage device for storing the training data and a program for generating the ensemble model, and a learning model for constructing the ensemble model using the training data using the program. And, based on the determination result displayed by inputting the training data or other training data into the learning models, a plurality of learning models are selected among the learning models so that the ensemble model has an area under the ROC curve (AUC) equal to or greater than a reference value It includes a computing device that performs pruning.

이하 설명하는 기술은 앙상블 프루닝 기법으로 양성 판단에 효과적인 학습모델군을 선별한다. 이하 설명하는 기술은 특정 서열에 유효한 유전자 가위를 선별하는 앙상블 모델을 생성할 수 있다.The technique described below is an ensemble pruning technique to select a group of learning models that are effective in determining positive. The technique described below can generate an ensemble model for selecting effective genetic scissors for a specific sequence.

도 1은 학습모델을 이용하는 서비스 시스템에 대한 예이다.
도 2는 앙상블 모델이 동작하는 과정에 대한 예이다.
도 3은 강화된 앙상블 모델을 생성하는 과정에 대한 예이다.
도 4는 강화된 앙상블 모델 생성 장치에 대한 예이다.
도 5는 강화된 앙상블 모델에 대한 효과를 검증한 예이다.
도 6은 강화된 앙상블 모델에 대한 효과를 검증한 다른 예이다.1 is an example of a service system using a learning model.
2 is an example of a process of operating an ensemble model.
3 is an example of a process of generating an enhanced ensemble model.
4 is an example of an enhanced ensemble model generation device.
5 is an example of verifying the effect on the reinforced ensemble model.
6 is another example of verifying the effect on the reinforced ensemble model.

이하 설명하는 기술은 다양한 변경을 가할 수 있고 여러 가지 실시례를 가질 수 있는 바, 특정 실시례들을 도면에 예시하고 상세하게 설명하고자 한다. 그러나, 이는 이하 설명하는 기술을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 이하 설명하는 기술의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다.The technology to be described below may be modified in various ways and may have various embodiments, and specific embodiments will be illustrated in the drawings and described in detail. However, this is not intended to limit the technology to be described below with respect to a specific embodiment, and it should be understood to include all changes, equivalents, or substitutes included in the spirit and scope of the technology to be described below.

제1, 제2, A, B 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 해당 구성요소들은 상기 용어들에 의해 한정되지는 않으며, 단지 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. 예를 들어, 이하 설명하는 기술의 권리 범위를 벗어나지 않으면서 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소도 제1 구성요소로 명명될 수 있다. 및/또는 이라는 용어는 복수의 관련된 기재된 항목들의 조합 또는 복수의 관련된 기재된 항목들 중의 어느 항목을 포함한다.Terms such as 1st, 2nd, A, B, etc. may be used to describe various components, but the components are not limited by the above terms, and only for the purpose of distinguishing one component from other components. Is only used. For example, a first component may be referred to as a second component, and similarly, a second component may be referred to as a first component without departing from the scope of the rights of the technology described below. The term and/or includes a combination of a plurality of related listed items or any of a plurality of related listed items.

본 명세서에서 사용되는 용어에서 단수의 표현은 문맥상 명백하게 다르게 해석되지 않는 한 복수의 표현을 포함하는 것으로 이해되어야 하고, "포함한다" 등의 용어는 설시된 특징, 개수, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것이 존재함을 의미하는 것이지, 하나 또는 그 이상의 다른 특징들이나 개수, 단계 동작 구성요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 배제하지 않는 것으로 이해되어야 한다.In terms of the terms used in the present specification, expressions in the singular should be understood as including plural expressions unless clearly interpreted differently in context, and terms such as "includes" are specified features, numbers, steps, actions, and components. It is to be understood that the presence or addition of one or more other features or numbers, step-acting components, parts or combinations thereof is not meant to imply the presence of, parts, or combinations thereof.

도면에 대한 상세한 설명을 하기에 앞서, 본 명세서에서의 구성부들에 대한 구분은 각 구성부가 담당하는 주기능 별로 구분한 것에 불과함을 명확히 하고자 한다. 즉, 이하에서 설명할 2개 이상의 구성부가 하나의 구성부로 합쳐지거나 또는 하나의 구성부가 보다 세분화된 기능별로 2개 이상으로 분화되어 구비될 수도 있다. 그리고 이하에서 설명할 구성부 각각은 자신이 담당하는 주기능 이외에도 다른 구성부가 담당하는 기능 중 일부 또는 전부의 기능을 추가적으로 수행할 수도 있으며, 구성부 각각이 담당하는 주기능 중 일부 기능이 다른 구성부에 의해 전담되어 수행될 수도 있음은 물론이다.Prior to the detailed description of the drawings, it is intended to clarify that the division of the constituent parts in the present specification is merely divided by the main function that each constituent part is responsible for. That is, two or more constituent parts to be described below may be combined into one constituent part, or one constituent part may be divided into two or more for each more subdivided function. In addition, each of the constituent units to be described below may additionally perform some or all of the functions of other constituent units in addition to its own main function, and some of the main functions of each constituent unit are different. It goes without saying that it can also be performed exclusively by.

또, 방법 또는 동작 방법을 수행함에 있어서, 상기 방법을 이루는 각 과정들은 문맥상 명백하게 특정 순서를 기재하지 않은 이상 명기된 순서와 다르게 일어날 수 있다. 즉, 각 과정들은 명기된 순서와 동일하게 일어날 수도 있고 실질적으로 동시에 수행될 수도 있으며 반대의 순서대로 수행될 수도 있다.In addition, in performing the method or operation method, each of the processes constituting the method may occur differently from the specified order unless a specific order is clearly stated in the context. That is, each of the processes may occur in the same order as the specified order, may be performed substantially simultaneously, or may be performed in the reverse order.

이하 설명에서 사용되는 용어에 대하여 설명한다.Hereinafter, terms used in the description will be described.

기계 학습(machine learning)은 인공 지능의 한 분야로, 컴퓨터가 학습할 수 있도록 알고리즘을 개발하는 분야를 의미한다. 기계학습모델 또는 학습모델은 컴퓨터가 학습할 수 있도록 개발된 모델을 의미한다. 학습모델은 접근 방법에 따라 인공신경망, 결정 트리 등과 같은 다양한 모델이 있다.Machine learning is a field of artificial intelligence, which refers to the field in which algorithms are developed so that computers can learn. Machine learning model or learning model means a model developed so that a computer can learn. There are various models of learning models, such as artificial neural networks and decision trees, depending on the approach method.

앙상블 기법(Ensemble)은 기계 학습에서 복수의 학습 알고리즘을 이용하는 기법을 총칭한다. 대표적으로 앙상블 기법은 랜덤 포레스트를 포함한 배깅 기법이나 부스팅 기법 등이 있다.Ensemble is a generic term for a technique that uses a plurality of learning algorithms in machine learning. Representatively, the ensemble technique includes a bagging technique including a random forest or a boosting technique.

랜덤 포레스트는 CART의 의사 결정 트리의 조합으로 이루어진 배깅 알고리즘의 일종이다. 랜덤 포레스트는 복수의 의사 결정 트리로 구성된다. 복수의 의사 결정 트리는 각각 훈련데이터와 특징 변수 중 일부를 무작위로 선택하여 사전에 학습된다. 랜덤 포레스트는 각각의 트리는 개별적으로 목표 변수를 결정한 후 모든 트리의 결정을 취합해 최종 결정을 내린다. Random Forest is a kind of bagging algorithm composed of a combination of CART's decision tree. The random forest consists of a plurality of decision trees. Each of the plurality of decision trees is trained in advance by randomly selecting some of the training data and feature variables. In the random forest, each tree individually determines the target variable and then aggregates the decisions of all trees to make a final decision.

이하 앙상블 모델을 중심으로 설명한다. 앙상블 모델을 구성하는 복수의 학습모델은 결정트리일 수 있다. 나아가 앙상블 모델을 구성하는 복수의 학습모델은 인공신경망과 같은 모델일 수도 있다. Hereinafter, it will be described focusing on the ensemble model. The plurality of learning models constituting the ensemble model may be a decision tree. Furthermore, the plurality of learning models constituting the ensemble model may be a model such as an artificial neural network.

분류 모델을 평가하는 지표에 대하여 설명한다. The indicators for evaluating the classification model will be described.

정확도(accuracy)는 전체 데이터 중에서 정확하게 분류된 데이터의 비율이다. 정확도는 분류기가 얼마나 정확하게 데이터를 분류하는지를 나타낸다.Accuracy is the proportion of correctly classified data among the total data. Accuracy refers to how accurately the classifier classifies data.

민감도(sensitivity)는 실제 양성 데이터 중에서 분류기가 양성으로 분류한 비율이다. TPR(true positive rate, 진양성율)도 같은 개념이다. 민감도는 모델이 얼마다 정확하게 양성 데이터를 분류하는지를 나타낸다.Sensitivity is the percentage of actual positive data that the classifier classified as positive. TPR (true positive rate) is the same concept. Sensitivity indicates how accurately the model classifies positive data.

FPR(false positive rate, 위양성율)은 실제 음성 데이터 중에서 분류기가 양성으로 분류한 비율이다. FPR (false positive rate) is the percentage of actual negative data that the classifier classified as positive.

AUC(area under the ROC curve)는 분류기의 양성 데이터에 대한 예측 값이 음성 데이터에 대한 예측 값보다 높을 확률이다. 이는, ROC(receiver operating characteristics, 수신자반응특성) 곡선의 면적으로도 구할 수 있다. ROC 곡선은 분류기의 진양성율과 위양성율을 통해 나타낸다.The area under the ROC curve (AUC) is the probability that the predicted value for positive data of the classifier is higher than the predicted value for negative data. This can also be obtained as the area of the ROC (receiver operating characteristics) curve. The ROC curve is expressed through the true and false positive rates of the classifier.

이하 설명하는 기술은 앙상블 프루닝 기법이다. 이하 설명하는 기술은 정확도를 기준으로 앙상블 모델을 프루닝하지 않고, AUC를 기준으로 앙상블 모델을 프루닝하는 기법이다. 이하 설명하는 앙상블 모델은 진음성(true negative) 판단에 대한 고려를 하지 않고, 양성 판단만을 고려하여 모델을 학습 내지 프루닝한다. 이하 설명에서 앙상블 프루닝을 통해 생성한 새로운 유형의 모델을 강화된 학습모델(앙상블 모델)이라고 명명한다. 기계학습 분야에서의 강화 학습과는 다른 것이다.The technique described below is an ensemble pruning technique. The technique described below is a technique of pruning the ensemble model based on the AUC rather than pruning the ensemble model based on the accuracy. In the ensemble model described below, the model is trained or pruned by considering only the positive judgment without considering the true negative judgment. In the following description, a new type of model generated through ensemble pruning is referred to as a reinforced learning model (ensemble model). It is different from reinforcement learning in the field of machine learning.

이하 설명하는 기술에서 앙상블 모델은 구성하는 개별 멤버(학습모델)의 유형은 다양할 수 있다. 예컨대, 앙상블 모델은 복수의 결정 트리로 구성될 수 있다. 또는 앙상블 모델은 복수의 신경망 모델로 구성될 수도 있다. 나아가, 앙상블 모델은 복수의 트리를 포함하되, 트리의 각 노드가 학습모델일 수도 있다. 이하 설명의 편의를 위하여, 랜덤 포레스트를 중심으로 설명하고자 한다. 다만, 이하 설명하는 앙상블 프루닝 기법이 특정한 유형의 모델에만 적용되는 것은 아니다.In the technique described below, the types of individual members (learning models) constituting the ensemble model may vary. For example, the ensemble model may be composed of a plurality of decision trees. Alternatively, the ensemble model may be composed of a plurality of neural network models. Furthermore, the ensemble model includes a plurality of trees, and each node of the tree may be a learning model. Hereinafter, for convenience of description, a description will be given centering on a random forest. However, the ensemble pruning technique described below is not applied only to a specific type of model.

도 1은 학습모델을 이용하는 서비스 시스템(100)에 대한 예이다. 도 1은 연구자가 설계한 유전자 가위의 효과를 예측하는 서비스 시스템에 대한 예이다. 도 1은 분석장치(130, 140, 150)가 유전자 서열을 분석하여 표적 서열에 효과적인 유전자 가위를 선별하는 예이다.1 is an example of a service system 100 using a learning model. 1 is an example of a service system that predicts the effect of genetic scissors designed by a researcher. 1 is an example in which an analysis device 130, 140, 150 analyzes a gene sequence and selects an effective gene scissors for a target sequence.

도 1에서 분석장치는 서버(130) 및 컴퓨터 단말(140, 150) 형태로 도시하였다. 서버(130)는 네트워크상에서 유전자 서열을 분석하는 서비스를 제공할 수 있다. 컴퓨터 단말(140)은 네트워크에 연결되어 유전자 서열을 수신하고, 설치된 애플리케이션을 이용하여 유전자 서열을 분석한다. 컴퓨터 단말(150)은 유전자 서열이 저장된 매체(예컨대, USB, SD 카드 등)로부터 입력 데이터를 수신하고, 설치된 애플리케이션을 이용하여 유전자 서열을 분석한다. 분석장치(130, 140, 150)는 다양한 형태로 구현될 수 있다. In FIG. 1, the analysis device is shown in the form of a server 130 and computer terminals 140 and 150. The server 130 may provide a service for analyzing gene sequences on a network. The computer terminal 140 is connected to a network to receive a gene sequence, and analyzes the gene sequence using an installed application. The computer terminal 150 receives input data from a medium (eg, USB, SD card, etc.) in which the gene sequence is stored, and analyzes the gene sequence using an installed application. The analysis devices 130, 140, 150 may be implemented in various forms.

분석장치(130, 140, 150)는 유전자 가위 서열을 이용하여 해당 유전자 가위 서열의 효과를 분석한다. 분석장치(130, 140, 150)가 입력받는 서열은 유전자 가위 전체 서열, 가이드 RNA 서열 전체 또는 일부일 수 있다.The analysis devices 130, 140, 150 analyze the effect of the corresponding genetic scissors sequence using the genetic scissors sequence. The sequence to which the analysis devices 130, 140, and 150 are input may be the entire sequence of the genetic scissors and all or part of the guide RNA sequence.

설계자 단말(110)은 특정 표적에 적합할 것으로 예상되는 유전자 가위를 설계한다. 즉, 설계자 단말(110)은 유전자 가위를 구성하는 RNA 서열을 설계한다. 설계자 단말(110)은 생성한 유전체 데이터를 별도의 DB(120)에 저장할 수도 있다.The designer terminal 110 designs a genetic scissors that is expected to be suitable for a specific target. That is, the designer terminal 110 designs the RNA sequence constituting the genetic scissors. The designer terminal 110 may store the generated genome data in a separate DB 120.

사용자(10, 20, 30)는 특정 서열에 대한 유전자 가위 적합도를 확인할 수 있다. 사용자(10)는 사용자 단말(PC, 스마트폰 등)을 통해 서버(130)에 접속하여, 서버(130)가 수행한 분석 결과를 확인할 수 있다. 사용자(20)는 자신이 사용하는 컴퓨터 단말(140)을 통해 유전자 가위 적합도를 확인할 수 있다. 사용자(30)는 자신이 사용하는 컴퓨터 단말(150)을 통해 유전자 가위 적합도를 확인할 수 있다. 사용자(10, 20, 30)는 유전자 가위를 연구하는 연구자일 수 있다. Users (10, 20, 30) can check the suitability of the scissors for a specific sequence. The user 10 may access the server 130 through a user terminal (PC, smartphone, etc.) and check the analysis result performed by the server 130. The user 20 can check the suitability of the genetic scissors through the computer terminal 140 used by the user 20. The user 30 can check the suitability of the genetic scissors through the computer terminal 150 used by the user. Users 10, 20, 30 may be researchers who study genetic scissors.

분석장치(130, 140)는 학습모델을 이용하여 유전자 서열을 분석한다. 분석장치(130, 140)는 사전에 마련된 학습모델에 입력 데이터를 입력하여 유전자 가위 효과를 예측한다. 분석장치(130, 140)는 다양한 학습모델을 이용할 수 있다. 예컨대, 분석장치(130, 140)는 앙상블 기법을 이용하여 유전자 가위 적합도를 분석할 수 있다. The analysis devices 130 and 140 analyze the gene sequence using the learning model. The analysis devices 130 and 140 predict the genetic scissors effect by inputting input data into a learning model prepared in advance. The analysis devices 130 and 140 may use various learning models. For example, the analysis devices 130 and 140 may analyze the suitability of the genetic scissors using an ensemble technique.

앙상블 모델을 구성하는 개별 모델은 유전자 서열 자체를 입력받을 수 있다. 또 개별 모델은 유전자 서열이 구성하는 2차 구조 또는 3차 구조에 대한 정보를 입력받을 수도 있다. 2차 구조 또는 3차 구조는 유전자 서열을 입력받아 구조를 예측하는 다양한 도구(RNAfold Web server, http://rna.tbi.univie.ac.at/cgi-bin/RNAWebSuite/RNAfold.cgi 등)를 이용하여 생성될 수 있다. 또는 개별 모델은 유전자 서열의 2차 구조 또는 3차 구조를 영상 형태로 입력받아 분석할 수 있다. 이 경우 개별 모델은 CNN(convolutional neural network)과 같은 인공신경망일 수 있다.Individual models constituting the ensemble model can receive the gene sequence itself. In addition, individual models may receive information on the secondary or tertiary structure of the gene sequence. For the secondary or tertiary structure, various tools (RNAfold Web server, http://rna.tbi.univie.ac.at/cgi-bin/RNAWebSuite/RNAfold.cgi, etc.) are used to predict the structure by receiving gene sequences. Can be created using Alternatively, individual models can be analyzed by receiving the secondary structure or the tertiary structure of the gene sequence in the form of an image. In this case, the individual model may be an artificial neural network such as a convolutional neural network (CNN).

앙상블은 복수의 학습모델들을 이용하며, 학습모델들의 예측 결과를 조합하여 보다 정확한 예측을 수행하는 모델이다. 앙상블 기법에는 배깅, 부스팅, 랜덤 포레스트, 스태킹 등이 있다. 모델 효과 분석을 위한 기준 내지 도구는 오분류율, AUC, ROC 곡선, 이익도표(lift chart) 등이 있다.An ensemble is a model that uses a plurality of learning models and performs more accurate prediction by combining the prediction results of the learning models. Ensemble techniques include bagging, boosting, random forest, and stacking. Criteria or tools for model effect analysis include misclassification rates, AUC, ROC curves, and lift charts.

도 2는 앙상블 모델이 동작하는 과정(200)에 대한 예이다. 분석 장치는 입력데이터를 사전에 학습된 앙상블 모델에 입력한다(210). 분석 장치는 앙상블 모델을 이용하여 입력데이터를 분석한다. 앙상블 모델(220)은 입력데이터를 개별 모델에 입력하여 개별 모델이 입력데이터를 분석한 결과를 출력한다. 2 is an example of a process 200 of operating an ensemble model. The analysis device inputs the input data into the pre-learned ensemble model (210). The analysis device analyzes the input data using the ensemble model. The ensemble model 220 inputs input data into an individual model and outputs a result of analyzing the input data by the individual model.

도 2는 N개의 결정 트리로 구성된 랜덤 포레스트를 예로 도시한다. 랜덤 포레스트를 구성하는 결정 트리는 각각 입력 데이터를 시작으로 의사 결정을 하면서 최종적인 판단 결과를 출력한다. 도 2를 살펴보면, 의사 결정 트리 A는 분석 대상에 대하여 높음(High)이라는 결과(예컨대, 적합도 높음)를 출력하고, 의사 결정 트리 B는 낮음(Low)이라는 결과를 출력한다. 랜덤 포레스트는 각 의사 결정 트리의 출력 결과를 모두 고려하여 최종적인 판단을 수행한다. 예컨대, 랜덤 포레스트는 다수결 원칙에 따라 최종 결론을 결정할 수 있다. 2 shows a random forest composed of N decision trees as an example. Each decision tree constituting a random forest makes a decision starting with input data and outputs a final decision result. Referring to FIG. 2, decision tree A outputs a result of high (eg, high fitness) for an analysis object, and decision tree B outputs a result of low. The random forest considers all of the output results of each decision tree to make a final decision. For example, a random forest can decide the final conclusion according to the principle of majority vote.

분석장치는 랜덤 포레스트가 출력하는 정보를 기준으로 입력데이터에 대한 분석 결과(분류 결과)를 예측한다(230). 예컨대, 분석장치는 입력되는 유전자 서열이 표적 서열에 대한 유전자 가위 적합도가 높다라는 판단을 할 수 있다.The analysis device predicts the analysis result (classification result) of the input data based on the information output from the random forest (230). For example, the analysis device may determine that the input gene sequence has a high degree of suitability of the gene scissors to the target sequence.

도 3은 강화된 앙상블 모델을 생성하는 과정(400)에 대한 예이다. 컴퓨터장치가 앙상블 모델을 생성한다고 전제한다. 컴퓨터장치는 데이터 처리 및 연산이 가능한 장치를 의미하여, 물리적인 형태는 다양할 수 있다. 개발자는 PC, 서버, 스마트 기기 등과 같은 연산장치를 사용하여 앙상블 모델을 생성할 수 있다. 또는 도 1에서 설명한 분석장치가 사용할 앙상블 모델을 생성할 수도 있다.3 is an example of a process 400 of generating an enhanced ensemble model. It is assumed that the computer device generates an ensemble model. A computer device refers to a device capable of processing and calculating data, and may have various physical forms. Developers can create ensemble models using computing devices such as PCs, servers, and smart devices. Alternatively, an ensemble model to be used by the analysis device described in FIG. 1 may be generated.

훈련데이터는 사전에 마련되어야 한다. 훈련데이터는 복수의 데이터 집합이다. 훈련데이터는 다음과 같은 집합 D일 수 있다. D = {(x_i, y_i)| i = 1....,n}. x는 입력값이고, y는 입력값에 대한 분류값이다. x가 유전자 가위에 대한 후보 서열이라면 x_i∈ {A,C,G,U}²⁰일 수 있다. y_i ∈ {0,1}일 수 있다. 1은 참(true)이고, 0은 거짓(false)를 의미한다.Training data should be prepared in advance. The training data is a plurality of data sets. The training data may be the following set D. D = ((x _i , y _i )| i = 1....,n}. x is the input value, and y is the classification value for the input value. If x is a candidate sequence for the genetic scissors, then x _i ∈ {A,C,G,U} may be ^20. may be y _i ∈ {0,1}. 1 means true and 0 means false.

앙상블 모델을 구성하는 개별 모델이 유전자 가위 효과를 예측하는 모델일 수 있다. 이 경우, 입력값은 유전자 가위 서열 또는 유전자 가위의 구조 정보일 수 있다. 컴퓨터 장치는 유전자 가위 서열을 기준으로 RNA 구조를 예측하는 프로그램을 이용하여 RNA 2차 구조 또는 3차 구조에 대한 정보를 생성할 수 있다. 출력값은 유전자 가위의 효과 점수(effect score)일 수 있다. 효과 점수는 공지된 솔루션을 이용하여 연산할 수 있다. 예컨대, GenomeCRISPR_full05112017 (https://www.dkfz.de/signaling/crispr-downloads/GENOMECRISPR/)과 같은 솔루션을 이용할 수 있다. Individual models constituting the ensemble model may be a model that predicts the effect of shearing genes. In this case, the input value may be the scissors sequence or the structural information of the scissors. The computer device may generate information on the secondary or tertiary structure of the RNA by using a program that predicts the RNA structure based on the scissor sequence. The output value may be an effect score of the genetic scissors. The effect score can be calculated using a known solution. For example, a solution such as GenomeCRISPR_full05112017 (https://www.dkfz.de/signaling/crispr-downloads/GENOMECRISPR/) can be used.

컴퓨터장치는 훈련데이터를 이용하여 초기 앙상블 모델을 생성한다(310). 컴퓨터 장치는 훈련데이터를 이용하여 앙상블 모델을 구성하는 개별 학습모델을 생성한다.The computer device generates an initial ensemble model using the training data (310). The computer device uses the training data to create an individual learning model constituting the ensemble model.

컴퓨터장치는 훈련데이터에 포함된 데이터 집합 중 일부를 임의로 선택하여 각 학습모델을 훈련한다. 랜덤 포레스트를 예로 설명하면, 컴퓨터 장치는 복수의 결정 트리에 대하여 각각 임의로 훈련데이터를 선택하고, 임의로 특징 변수를 선택하여 훈련한다. 도 3에서 컴퓨터장치는 훈련데이터 세트 D에서 D₁,D₂,..., D_m _- ₁,D_m을 선택하여 각각 학습모델 h₁(x),h₂(x),...,h_m _- ₁(x),h_m(x)을 생성하였다. 각 학습모델이 입력 데이터를 분류한 결과는 아래의 표 1과 같을 수 있다. 아래 표 1은 개별 학습모델이 5개인 경우를 가정한 것이다.The computer device trains each learning model by randomly selecting some of the data sets included in the training data. Taking a random forest as an example, the computer device randomly selects training data for a plurality of decision trees, and randomly selects and trains feature variables. In FIG. 3, the computer device selects D ₁ , D ₂ ,..., D _m _- ₁ ,D _m from the training data set D, and each training model h ₁ (x), h ₂ (x),..., h _m _- a _{_{1 (x), h m (}} x) was produced. The result of classifying the input data by each learning model may be as shown in Table 1 below. Table 1 below assumes that there are 5 individual learning models.

ii y_i y _i h₁(x)h ₁ (x) h₂(x)h ₂ (x) h₃(x)h ₃ (x) h₄(x)h ₄ (x) h₅(x)h ₅ (x)

One One One 0 One One 0 One 2 One 0 0 One 0 One 0 3 0 0 0 0 0 One 0 4 0 0 0 One 0 0 0 5 0 One 0 0 0 0 0

는 입력데이터의 실제 분류값이고,

는 학습모델이 예측한 값이다.

는 다수결 원칙에 따라 결정될 수 있다. 즉,

는 아래의 수학식 1과 같이 표현될 수 있다.

Is the actual classification value of the input data,

Is the predicted value by the learning model.

May be determined according to the principle of majority vote. In other words,

Can be expressed as Equation 1 below.

상기 표 1을 살펴보면, 앙상블 모델은 입력데이터 1에 대하여 정확한 분류를 했지만, 입력데이터 2에 대해서는 잘못된 판단(false negative)을 하였다. 애플리케이션이 참인 데이터에 대하여 정확한 판단(true positive)만이 중요하다면, 앙상블을 구성하는 학습모델을 프루닝(pruning)할 필요가 있다. 표 1을 기준으로 설명하면, h₃(x),h₄(x)및 h₅(x)만으로 구성된 앙상블을 구축하면 된다. AUC를 높이기 위한 앙상블 프루닝에 대하여 설명한다. 이와 같은 앙상블 프루닝으로 구축되는 모델을 강화된 앙상블 모델 내재 강화된 앙상블 모델이라고 전술하였다.Referring to Table 1, the ensemble model accurately classified input data 1, but made a false negative for input data 2. If only true positives are important to the data for which the application is true, then it is necessary to prun the learning models that make up the ensemble. Referring to Table 1, _{an ensemble composed of only h 3} (x), h ₄ (x) and h ₅ (x) can be constructed. Ensemble pruning to increase AUC will be described. The model built with such ensemble pruning was described above as a reinforced ensemble model and an inherently reinforced ensemble model.

앙상블 프루닝 기법은 순서 기반 프루닝(ordering-based pruning), 클러스터 기반 프루닝(clustering-based pruning), 최적화 기반 프루닝(optimization-based pruning) 등이 있다. 이하 강화된 앙상블 모델을 생성하기 위한 프루닝 기법에 대하여 설명한다.Ensemble pruning techniques include ordering-based pruning, clustering-based pruning, and optimization-based pruning. Hereinafter, a pruning technique for generating an enhanced ensemble model will be described.

컴퓨터장치는 주어진 목적함수 값을 최소화할 수 있는 개별 모델을 선택한다. 반대로 말하면, 컴퓨터장치는 해당 기준을 만족시키지 못하는 모델을 제거한다. 아래 기법은 최적화 기반 프루닝에 해당한다. The computer unit selects an individual model that can minimize the value of a given objective function. Conversely, the computer device eliminates models that do not meet the criteria. The technique below corresponds to optimization-based pruning.

또한, 컴퓨터장치는 유전 알고리즘(genetic algorithm)을 이용하여 앙상블 모델을 구성하는 분류기 집합을 구성할 수도 있다. 유전 알고리즘에 대하여 간략하게 설명하면, 다음과 같은 과정으로 구성된다. (i) 초기화 단계는 문제 해결을 위한 유전자 집단을 생성하는 과정이다. (ii) 선택 단계는 다음 세대를 위해 문제 해결에 근접한 성공적인 유전자를 선택하는 과정이다. (iii) 유전 연산 단계는 선택 단계에서 선택된 유전자를 재결합하는 과정이다. (iv) 종료 단계는 종료 조건을 만족하는지 검증하는 과정이다. 조건을 만족하지 못한다면 세대 교체를 위한 과정을 반복한다.In addition, the computer device may construct a set of classifiers constituting the ensemble model by using a genetic algorithm. Briefly describing the genetic algorithm, it consists of the following processes. (i) The initialization step is the process of creating a gene group for problem solving. (ii) The selection step is the process of selecting successful genes close to problem solving for the next generation. (iii) The genetic computation step is a process of recombining the genes selected in the selection step. (iv) The termination step is a process of verifying whether the termination condition is satisfied. If the conditions are not satisfied, the process for generation replacement is repeated.

앙상블 모델을 기준으로 설명하면, 초기화 단계는 훈련데이터 세트를 이용하여 초기 앙상블 모델을 생성하는 과정이다. 선택 단계는 프루닝 기법을 이용하여 특정 학습모델만을 선택하는 과정이다. 유전 연산 단계는 프루닝으로 선택된 학습모델만으로 앙상블 모델을 구성하는 과정이라고 할 수 있다. 종료 단계는 앙상블 모델이 일정한 목적함수 값을 보이는지 검증하는 과정이라고 할 수 있다.Explaining based on the ensemble model, the initialization step is a process of generating an initial ensemble model using a training data set. The selection step is a process of selecting only a specific learning model using a pruning technique. The genetic computation step can be said to be a process of constructing an ensemble model with only the learning model selected by pruning. The final step can be said to be a process of verifying whether the ensemble model shows a certain objective function value.

다음 세대 선택을 위한 프루닝에 대하여 설명한다. 컴퓨터장치는 아래의 수학식 2와 같이, 주어진 목적함수 값을 최소화할 수 있는 개별 모델을 선택할 수 있다.Explain about pruning for next generation selection. The computer device may select an individual model capable of minimizing a given objective function value, as shown in Equation 2 below.

z는 앙상블을 구성한 개별 학습모델(분류기)의 선택 여부를 나타낸다. i번째 분류기가 선택되면 z 값이 1이고, 선택되지 못하면 0의 값을 갖는다. g₁은 불균형분류 문제의 성능 지표 중 하나인 ROC(Receiver Operating Characteristics) 곡선의 AUC(Area Under Curve)에 관한 값이다. g₂는 앙상블을 이루는 학습모델 중 최종 선택되는 학습모델의 개수에 관한 값이다. g₃는 최종 선택된 학습모델의 다양성 척도이다. 목적 함수를 구성하는 개별 항목에 대하여 설명한다.z denotes whether to select an individual learning model (classifier) constituting the ensemble. If the i-th classifier is selected, the z value is 1, and if not, it has a value of 0. g ₁ is the value of the AUC (Area Under Curve) of the ROC (Receiver Operating Characteristics) curve, which is one of the performance indicators of the imbalance classification problem. g ₂ is a value about the number of learning models that are finally selected among the learning models that make up the ensemble. g ₃ is a measure of the diversity of the final selected learning model. The individual items constituting the objective function will be described.

g₁은 1 - AUC이다. 목적함수는 g₁ + g₂ + g₃가 최소가 되도록 하는 것이므로, 결국 g₁은 AUC가 최대인 조건에 해당한다. 여기서,

및

는 각각 아래의 수학식 4와 수학식 5로 표현할 수 있다.g ₁ is 1-AUC. The objective function is _{to ensure that g 1} + g ₂ + g ₃ is the minimum, so in the end, g ₁ corresponds to the condition in which the AUC is the maximum. here,

And

Can be expressed by Equation 4 and Equation 5 below, respectively.

는 입력데이터에서 참인 데이터를 의미한다.

는 선택된 학습모델 z이 참인 데이터를 제대로 분류하는 척도(점수)라고 할 수 있다.

Means the data that is true in the input data.

Can be said to be a measure (score) for properly classifying data for which the selected learning model z is true.

는 입력데이터에서 거짓인 데이터를 의미한다.

는 선택된 학습모델 z이 거짓인 데이터를 제대로 분류하는 척도(점수)라고 할 수 있다.

Means data that is false in the input data.

Can be said to be a measure (score) for properly classifying data for which the selected learning model z is false.

g₂는 앙상블을 구성하기 위하여 최종 선택되는 학습모델의 개수에 대한 척도이다.g ₂ is a measure of the number of learning models that are finally selected to form an ensemble.

g₃은 최종 선택된 분류기간의 다양성 척도이다. 다양한 척도는 다양한 값이 사용될 수 있다. 수학식 7은 Q-통계값(Q-statistic)과 관련된 예이다.g ₃ is a measure of the diversity of the last selected classification period. Various values can be used for various scales. Equation 7 is an example related to Q-statistic.

L은 앙상블을 구성하는 개별 학습모델의 전체 개수이다. 즉,

이다.

와 관련된 변수는 각각 아래의 수학식 8 및 수학식 9와 같다.L is the total number of individual learning models that make up the ensemble. In other words,

to be.

Variables related to are as in Equation 8 and Equation 9 below, respectively.

는 학습모델 분류기 u와 v가 모두 정분류한 객체의 수이다.

는 분류기 u와 v가 모두 오분류한 객체의 수이다.

는 분류기 u는 정분류하고, v는 오분류한 객체의 수이다.

는 분류기 u는 오분류하고, v는 정분류한 객체의 수이다.

Is the number of objects correctly classified by both the learning model classifiers u and v.

Is the number of objects misclassified by both the classifiers u and v.

Is the classifier u is the correct classification and v is the number of misclassified objects.

Is the classifier u is misclassified, and v is the number of correctly classified objects.

컴퓨터장치는 전술한 최적화 기법 또는 유전 알고리즘을 사용하여 초기 앙상블 모델을 프루닝할 수 있다(220). 도 3은 푸루닝하여 선택된 학습모델을 도시한다. 최종 앙상블 모델은

,...,

로 구성된다.The computer device may prun the initial ensemble model using the above-described optimization technique or genetic algorithm (220). 3 shows a learning model selected by pruning. The final ensemble model is

,...,

It consists of

컴퓨터장치는 초기에 입력받은 훈련데이터를 이용하여 개별 학습모델을 평가하여 프루닝을 할 수 있다. 나아가, 컴퓨터장치는 초기 훈련데이터가 아닌 별도의 개별 훈련데이터 세트를 이용하여 개별 학습모델을 평가하여 프루닝을 할 수도 있다. The computer device can perform pruning by evaluating an individual learning model using training data that was initially input. Furthermore, the computer device may perform pruning by evaluating an individual learning model using a separate individual training data set, not the initial training data.

컴퓨터장치는 최종 앙상블 모델을 이용하여 입력데이터를 분류할 수 있다(230). 컴퓨터장치는 앙상블 모델을 구성한 선택된 개별 학습모델의 분류 결과를 조합하여 최종 분석 결과를 도출할 수 있다. 아래 수학식 10과 같이 다수결 원칙에 따라 최종 분석 결과를 산출할 수 있다.The computer device may classify the input data using the final ensemble model (230). The computer device can derive the final analysis result by combining the classification results of the selected individual learning models constituting the ensemble model. The final analysis result can be calculated according to the principle of majority vote as shown in Equation 10 below.

x_new는 분석 대상인 입력 데이터를 의미한다.x _new means the input data to be analyzed.

도 4는 강화된 앙상블 모델 생성 장치(400)에 대한 예이다. 모델 생성 장치(400)는 도 3에서 설명한 프루닝 기법으로 강화된 앙상블 모델을 생성한다. 모델 생성 장치(400)는 물리적으로 다양한 형태로 구현될 수 있다. 예컨대, 모델 생성 장치(400)는 PC, 네트워크의 서버, 영상 처리 전용 칩셋, 스마트기기 등의 형태를 가질 수 있다.4 is an example of an enhanced ensemble model generating apparatus 400. The model generation device 400 generates an ensemble model enhanced by the pruning technique described in FIG. 3. The model generating device 400 may be physically implemented in various forms. For example, the model generating device 400 may have a form such as a PC, a server of a network, a chipset dedicated to image processing, and a smart device.

모델 생성 장치(400)는 저장장치(410), 메모리(420), 연산장치(430), 인터페이스 장치(440), 통신장치(450) 및 출력장치(460)를 포함한다.The model generation device 400 includes a storage device 410, a memory 420, an operation device 430, an interface device 440, a communication device 450, and an output device 460.

저장장치(410)는 앙상블 모델 생성을 위한 훈련데이터 세트를 저장할 수 잇다. 저장장치(410)는 앙상블 모델 생성을 위한 프로그램을 저장할 수 있다. 나아가 저장장치(410)는 데이터 처리에 필요한 다른 프로그램 내지 소스 코드 등을 저장할 수 있다. 저장장치(410)는 생성한 앙상블 모델을 저장한다.The storage device 410 may store a training data set for generating an ensemble model. The storage device 410 may store a program for generating an ensemble model. Furthermore, the storage device 410 may store other programs or source codes required for data processing. The storage device 410 stores the generated ensemble model.

메모리(420)는 모델 생성 장치(400)가 수신한 데이터를 분석하는 과정에서 생성되는 데이터 및 정보 등을 저장할 수 있다.The memory 420 may store data and information generated in a process of analyzing the data received by the model generating device 400.

인터페이스 장치(440)는 외부로부터 일정한 명령 및 데이터를 입력받는 장치이다. 인터페이스 장치(440)는 물리적으로 연결된 입력 장치 또는 외부 저장장치로부터 유전체 데이터를 입력받을 수 있다. 인터페이스 장치(440)는 앙상블을 구성하는 개별 학습모델을 생성하기 위한 코드 내지 프로그램을 입력받을 수 있다. 인터페이스 장치(440)는 앙상블 프루닝을 위한 코드 내지 프로그램을 입력받을 수 있다. 인터페이스 장치(440)는 학습모델 훈련을 위한 훈련데이터, 정보 및 파라미터값을 입력받을 수도 있다.The interface device 440 is a device that receives certain commands and data from the outside. The interface device 440 may receive dielectric data from an input device physically connected or an external storage device. The interface device 440 may receive a code or a program for generating an individual learning model constituting an ensemble. The interface device 440 may receive a code or a program for ensemble pruning. The interface device 440 may receive training data, information, and parameter values for training a learning model.

통신장치(450)는 유선 또는 무선 네트워크를 통해 일정한 정보를 수신하고 전송하는 구성을 의미한다. 통신장치(450)는 외부 객체로부터 훈련데이터, 입력 데이터 등을 수신할 수 있다. 통신장치(450)는 앙상블 생성을 위한 프로그램 내지 코드를 수신할 수 있다. 통신장치(450)는 생성한 앙상블 모델을 외부 객체에 송신할 수도 있다.The communication device 450 refers to a configuration that receives and transmits certain information through a wired or wireless network. The communication device 450 may receive training data, input data, and the like from an external object. The communication device 450 may receive a program or code for generating an ensemble. The communication device 450 may transmit the generated ensemble model to an external object.

통신장치(450) 내지 인터페이스 장치(440)는 외부로부터 일정한 데이터 내지 명령을 전달받는 장치이다. 통신장치(450) 내지 인터페이스 장치(440)를 입력장치라고 명명할 수 있다.The communication device 450 to the interface device 440 are devices that receive certain data or commands from the outside. The communication device 450 to the interface device 440 may be referred to as an input device.

출력장치(460)는 일정한 정보를 출력하는 장치이다. 출력장치(460)는 데이터 처리 과정에 필요한 인터페이스, 분석 결과 등을 출력할 수 있다.The output device 460 is a device that outputs certain information. The output device 460 may output an interface required for a data processing process, an analysis result, and the like.

연산 장치(430)는 저장장치(410)에 프로그램을 이용하여 앙상블 모델을 생성할 수 있다. 연산 장치(430)는 저장장치(410)에 프로그램을 이용하여 앙상블 프루닝을 할 수 있다. 연산 장치(430)는 데이터를 처리하고, 일정한 연산을 처리하는 프로세서, AP, 프로그램이 임베디드된 칩과 같은 장치일 수 있다.The computing device 430 may generate an ensemble model using a program in the storage device 410. The computing device 430 may perform ensemble pruning using a program in the storage device 410. The computing device 430 may be a device such as a processor, an AP, or a chip in which a program is embedded that processes data and processes certain operations.

나아가 연산 장치(430)는 생성한 앙상블 모델을 이용하여 입력 데이터를 분석할 수도 있다. 이 경우 데이터 생성 장치(400)가 도 1에서의 분석 장치(130, 140, 150)에 해당한다.Furthermore, the computing device 430 may analyze the input data using the generated ensemble model. In this case, the data generating device 400 corresponds to the analysis devices 130, 140, and 150 in FIG. 1.

이하 강화된 앙상블 모델에 대한 효과를 검증한 결과를 설명한다.Hereinafter, a result of verifying the effect on the reinforced ensemble model will be described.

도 5는 강화된 앙상블 모델에 대한 효과를 검증한 예이다. 도 5는 UCI machine learning repository에서 제공되는 데이터 세트를 이용한 실험 결과이다. 불균형 데이터 세트 115개를 이용한 결과이다. 성능 추정은 훈련 데이터 70%와 다른 테스트 데이터 30%를 이용하여 30회 반복하였다.5 is an example of verifying the effect on the reinforced ensemble model. 5 is an experiment result using a data set provided from the UCI machine learning repository. This is the result of using 115 unbalanced data sets. Performance estimation was repeated 30 times using 70% of training data and 30% of other test data.

비교대상은 AdaBoost+CART 모델이다. AdaBoost에서 부스팅은 500회 반복하였고, CART는 1 ~ 10 깊이를 랜덤하게 선택하여 트리를 구성하였다.The comparison target is the AdaBoost+CART model. In AdaBoost, boosting was repeated 500 times, and CART constructed a tree by randomly selecting 1-10 depths.

강화된 앙상블 모델은 전술한 유전 알고리즘을 이용하여 구성하였다. 초기 생산은 300 염색체를 이용하여 1,000개 생산하였다. 선택은 선형적인 순위(linear rank)로 결정하였다. 크로스오버(crossover)는 확률 0.8의 단일 지점(single point)로 설정하였고, 돌연변이(mutation)은 확률 0.1의 uniform random을 사용하였다.The enhanced ensemble model was constructed using the aforementioned genetic algorithm. The initial production was 1,000 using 300 chromosomes. Selection was determined by linear rank. The crossover was set to a single point with a probability of 0.8, and a uniform random with probability of 0.1 was used for the mutation.

도 5에서 x축은 AdaBoost+CART의 각 데이터에 대한 AUC 값을 나타내며, y축은 강화된 앙상블 모델에 대한 AUC 값이다. AUCP가 강화된 앙상블 모델을 의미한다. 115개 데이터 세트 중 73개가 프루닝 후의 AUC 값의 평균이 높았다. 즉, AdaBoost+CART 기법에 비하여 강화된 앙상블 모델의 성능이 좋다고 분석되었다. Welch's t-test 검정 결과(유의수준 0.01)는 68개의 데이터 세트에 대한 향상이 통계적 유의미한 결과를 가진다고 나타났다.In FIG. 5, the x-axis represents the AUC value for each data of AdaBoost+CART, and the y-axis represents the AUC value for the enhanced ensemble model. It refers to an ensemble model with enhanced AUCP. Of the 115 data sets, 73 had a higher mean of AUC values after pruning. In other words, it was analyzed that the performance of the enhanced ensemble model is better than the AdaBoost+CART technique. The results of Welch's t-test (significance level of 0.01) showed that the improvement for 68 data sets had statistically significant results.

도 6은 강화된 앙상블 모델에 대한 효과를 검증한 다른 예이다. 도 6은 유전자 가위에 대한 데이터를 이용한 예이다. 즉, 특정 유전자 가위 서열이 특정 서열에 효과적인지 여부를 식별하는 모델을 검증한 것이다. 6 is another example of verifying the effect on the reinforced ensemble model. 6 is an example of using data on genetic scissors. In other words, a model that identifies whether a specific genetic scissor sequence is effective for a specific sequence was verified.

데이터 세트는 CRISPR cas9를 이용하였다. 양성 클래스(positive class)는 5,759개 이고, 음성 클래스(negative class)는 1,672 개 였다. 특징(feature)은 20개 캐릭터(characters, 서열)를 사용하였다. 성능 추정은 훈련 데이터 70%와 다른 테스트 데이터 30%를 이용하여 30회 반복하였다.The data set was CRISPR cas9. The positive class was 5,759 and the negative class was 1,672. As for the feature, 20 characters (sequence) were used. Performance estimation was repeated 30 times using 70% of training data and 30% of other test data.

강화된 앙상블 모델은 전술한 유전 알고리즘을 이용하여 구성하였다. 초기 생산은 300 염색체를 이용하여 1,000개 생산하였다. 선택은 선형적인 순위(linear rank)로 결정하였다. 크로스오버(crossover)는 확률 0.8의 단일 지점(single point)로 설정하였고, 돌연변이(mutation)은 확률 0.1의 uniform random을 사용하였다. 선택은 엘리트 보존 방식(elitism)으로 상위 5%를 보존하였다.The enhanced ensemble model was constructed using the aforementioned genetic algorithm. The initial production was 1,000 using 300 chromosomes. Selection was determined by linear rank. The crossover was set to a single point with a probability of 0.8, and a uniform random with probability of 0.1 was used for the mutation. The selection was elitism, which preserved the top 5%.

비교대상은 AdaBoost+CART 모델이다. AdaBoost에서 부스팅은 500회 반복하였고, CART는 1 ~ 10 깊이를 랜덤하게 선택하여 트리를 구성하였다. The comparison target is the AdaBoost+CART model. In AdaBoost, boosting was repeated 500 times, and CART constructed a tree by randomly selecting 1-10 depths.

도 6을 살펴보면, AdaBoost+CART 모델에 비하여 강화된 앙상블 프루닝을 사용한 모델(AUCP)의 AUC의 값이 눈에 띄게 높다는 것을 알 수 있다.Referring to FIG. 6, it can be seen that the AUC value of the model (AUCP) using enhanced ensemble pruning is remarkably higher than that of the AdaBoost+CART model.

또한, 상술한 바와 같은 분류 모델 생성방법, 앙상블 생성 방법 내지 앙상블 프루닝 방법은 컴퓨터에서 실행될 수 있는 실행가능한 알고리즘을 포함하는 프로그램(또는 어플리케이션)으로 구현될 수 있다. 상기 프로그램은 비일시적 판독 가능 매체(non-transitory computer readable medium)에 저장되어 제공될 수 있다.In addition, the classification model generation method, the ensemble generation method, or the ensemble pruning method as described above may be implemented as a program (or application) including an executable algorithm that can be executed on a computer. The program may be provided by being stored in a non-transitory computer readable medium.

비일시적 판독 가능 매체란 레지스터, 캐쉬, 메모리 등과 같이 짧은 순간 동안 데이터를 저장하는 매체가 아니라 반영구적으로 데이터를 저장하며, 기기에 의해 판독(reading)이 가능한 매체를 의미한다. 구체적으로는, 상술한 다양한 어플리케이션 또는 프로그램들은 CD, DVD, 하드 디스크, 블루레이 디스크, USB, 메모리카드, ROM 등과 같은 비일시적 판독 가능 매체에 저장되어 제공될 수 있다.The non-transitory readable medium refers to a medium that stores data semi-permanently and can be read by a device, rather than a medium that stores data for a short moment, such as a register, cache, and memory. Specifically, the above-described various applications or programs may be provided by being stored in a non-transitory readable medium such as a CD, DVD, hard disk, Blu-ray disk, USB, memory card, ROM, or the like.

본 실시례 및 본 명세서에 첨부된 도면은 전술한 기술에 포함되는 기술적 사상의 일부를 명확하게 나타내고 있는 것에 불과하며, 전술한 기술의 명세서 및 도면에 포함된 기술적 사상의 범위 내에서 당업자가 용이하게 유추할 수 있는 변형 예와 구체적인 실시례는 모두 전술한 기술의 권리범위에 포함되는 것이 자명하다고 할 것이다.The present embodiment and the accompanying drawings are merely illustrative of some of the technical ideas included in the above-described technology, and those skilled in the art can easily It will be apparent that all of the modified examples and specific embodiments that can be inferred are included in the scope of the rights of the above technology.

Claims

Receiving, by the computer device, training data;
Generating, by the computer device, learning models for constructing an ensemble model by using the training data; And
A plurality of learning models among the learning models so that the ensemble model has an area under the ROC curve (AUC) greater than or equal to a reference value based on a determination result displayed by the computer device inputting the training data or other training data into the learning models. Comprising the step of pruning,
The learning models are an ensemble model generation method for detecting genetic scissors, which is a model for classifying genetic scissors specific to a specific sequence.

The method of claim 1,
The computer device is an ensemble model generation method for detecting genetic scissors for selecting the plurality of learning models such that the ensemble model has a maximum AUC value according to optimization-based pruning.

The method of claim 1,
The computer device is a method for generating an ensemble model for detecting genetic scissors that selects the plurality of learning models so that an area under curve (AUC) of a receiver operating characteristic (ROC) curve for the true positive rate and the false positive rate is maximized.

The method of claim 1,
The computer device selects the plurality of learning models using a genetic algorithm, and an ensemble for detecting genetic scissors that selects the plurality of learning models such that the _{objective function g 1} + g ₂ + g _{3 of the genetic algorithm is minimized.} How to create a model.
(g ₁ is 1-AUC, g ₂ is the number of selected learning models, g ₃ is the diversity of the plurality of learning models, and AUC is the AUC (Area) of the ROC (Receiver Operating Characteristics) curve for the true positive rate and false positive rate. Under Curve))

The method of claim 1,
The training data is a method for generating an ensemble model for detecting a genetic scissors including information on a secondary structure or a tertiary structure for a gene sequence constituting the genetic scissors acting on the specific sequence.

Receiving, by the computer device, training data;
Generating, by the computer device, learning models for constructing an ensemble model by using the training data; And
A plurality of learning models among the learning models so that the ensemble model has an area under the ROC curve (AUC) greater than or equal to a reference value based on a determination result displayed by the computer device inputting the training data or other training data into the learning models. An ensemble pruning method comprising the step of pruning.

The method of claim 6,
The computer device is an ensemble pruning method for selecting the plurality of learning models such that an area under curve (AUC) of a receiver operating characteristic (ROC) curve for the true positive rate and the false positive rate is maximized.
The value predicted by the learning model, 1 is positive, -1 is negative, θ is the threshold value)

The method of claim 6,
The computer device is an ensemble pruning method for selecting the plurality of learning models such that an area under curve (AUC) of a receiver operating characteristic (ROC) curve for the true positive rate and the false positive rate is maximized.

The method of claim 6,
The computer device selects the plurality of learning models using a genetic algorithm, and the ensemble pruning method selects the plurality of learning models such that the _{objective function g 1} + g ₂ + g _{3 of the genetic algorithm is minimized.}
(g ₁ is 1-AUC, g ₂ is the number of selected learning models, g ₃ is the diversity of the plurality of learning models, and AUC is the AUC (Area) of the ROC (Receiver Operating Characteristics) curve for the true positive rate and false positive rate. Under Curve))

An input device for receiving training data;
A storage device for storing the training data and a program for generating an ensemble model; And
Using the program, learning models for constructing an ensemble model from the training data are generated, and the ensemble model is an AUC ( area under the ROC curve).

The method of claim 10,
The computing device generates an ensemble model for selecting the plurality of learning models such that an area under curve (AUC) of a receiver operating characteristic (ROC) curve for the true positive rate and the false positive rate is maximized.

The method of claim 10,
The computing device selects the plurality of learning models using a genetic algorithm, and generates an ensemble model for selecting the plurality of learning models such that the _{objective function g 1} + g ₂ + g _{3 of the genetic algorithm is minimized.} .
(g ₁ is 1-AUC, g ₂ is the number of selected learning models, g ₃ is the diversity of the plurality of learning models, and AUC is the AUC (Area) of the ROC (Receiver Operating Characteristics) curve for the true positive rate and false positive rate. Under Curve))