KR20230054167A

KR20230054167A - Method for generating machine learning model and device therefor

Info

Publication number: KR20230054167A
Application number: KR1020210137821A
Authority: KR
Inventors: 최유리; 장 피에 로말리자
Original assignee: 주식회사 솔리드웨어
Priority date: 2021-10-15
Filing date: 2021-10-15
Publication date: 2023-04-24
Also published as: WO2023063486A1

Abstract

Disclosed are a generating method of a machine learning model and a device thereof. A model generating device clusters a plurality of data samples into a plurality of clusters, searches for representative data sample of each cluster, and trains a machine learning model with semi-supervised learning using labeled or scored data sample after input-receiving the label or score for the representative data sample. Therefore, the present invention is capable of providing the machine learning model that can perform regression analysis or classification.

Description

Method for generating machine learning model and device therefor}

본 발명의 실시 예는 기계학습모델(machine learning model)을 생성하는 방법 및 그 장치에 관한 것으로, 보다 상세하게는 라벨이 부여되지 않은 데이터를 이용하여 기계학습모델을 생성하는 방법 및 그 장치에 관한 것이다. An embodiment of the present invention relates to a method and apparatus for generating a machine learning model, and more particularly, to a method and apparatus for generating a machine learning model using unlabeled data. will be.

대부분의 기계학습은 지도학습(supervised learning) 영역에서 이루어지고 있다. 지도학습의 훈련을 위해서는 학습데이터에 타겟 변수(즉, 목적 변수)가 필요하다. 그러나 대부분의 기존 데이터에는 타겟 변수가 존재하지 않는다. 즉, 대부분의 데이터는 레이블링(labeling)이 되지 않은 상태이므로 지도학습을 위해서는 레이블링을 위한 복잡합 데이터 처리 과정이 필요하다. Most machine learning is being done in the field of supervised learning. For supervised learning training, a target variable (ie, a target variable) is required in the training data. However, target variables do not exist in most existing data. That is, since most of the data is not labeled, a complex data processing process for labeling is required for supervised learning.

비지도학습(unsupervised learning)은 데이터 구조의 시각화 및 이해를 위해 사용될 수 있다. 비지도학습은 타겟 변수가 필요없는 장점이 있다. 그러나 타겟 변수가 없으므로, 데이터 사이언티스트(data scientist) 또는 도메인 전문가에 따라 동일한 데이터셋에 대해 매우 다른 군집화(clustering) 결과가 얻어질 수 있다. 예를 들어, 군집(cluster)개수, 데이터샘플을 비교하기 위한 거리 척도, 군집화 결과 평가를 위한 기준, 군집화 결과의 해석 등의 주요 파라미터에 사용자의 영향이 미칠 수 있다. Unsupervised learning can be used for visualization and understanding of data structures. Unsupervised learning has the advantage of not requiring a target variable. However, since there is no target variable, very different clustering results may be obtained for the same dataset according to data scientists or domain experts. For example, users can influence key parameters such as the number of clusters, a distance scale for comparing data samples, criteria for evaluating clustering results, and interpretation of clustering results.

비지도학습을 지도학습의 피쳐(feature) 엔지니어링에 사용하는 몇 가지 접근 방식이 존재한다. 이러한 접근 방법은 군집 라벨(cluster label)이 동일한 데이터샘플이 지도학습모델의 목표(타겟) 값과 유사한 관계를 갖는다고 가정하므로, 비지도학습과 마찬가지로 군집 및 기타 모수의 수를 식별하기 어려운 경우가 많다. There are several approaches that use unsupervised learning for feature engineering of supervised learning. This approach assumes that data samples with the same cluster label have a similar relationship with the target value of the supervised learning model, so it is difficult to identify the number of clusters and other parameters as in unsupervised learning. many.

준지도학습(semi-supervised learning)은 부분적으로 라벨이 지정된 샘플이 있는 데이터셋을 대상으로 훈련하여 회귀 분석 또는 분류 모형을 만드는데 활용된다. 준지도학습은 라벨이 지정된 샘플의 수가 지도학습에 비해 훨씬 적다. 따라서 이러한 데이터셋에 전통적인 지도학습 접근방식을 적용하면 입력변수와 목표변수 사이의 신뢰성 있는 관계(회귀 문제)를 발견하거나 군집(또는 클래스) 간의 좋은 의사결정경계(분류 문제)를 찾기가 매우 어렵다.Semi-supervised learning is used to create regression or classification models by training them on datasets with partially labeled samples. Semi-supervised learning has a much smaller number of labeled samples than supervised learning. Therefore, when traditional supervised learning approaches are applied to these datasets, it is very difficult to find reliable relationships between input variables and target variables (regression problems) or good decision boundaries between clusters (or classes) (classification problems).

본 발명의 실시 예가 이루고자 하는 기술적 과제는, 레이블링이 되지 않은 데이터와 사용자로부터 얻은 제한된 정보를 기반으로 사용자가 목표로 하는 회귀분석 또는 분류를 수행할 수 있는 기계학습모델(Machine Learning Model)을 생성하는 방법 및 그 장치를 제공하는 데 있다.A technical problem to be achieved by an embodiment of the present invention is to generate a machine learning model capable of performing regression analysis or classification targeted by the user based on unlabeled data and limited information obtained from the user. It is to provide a method and an apparatus therefor.

상기의 기술적 과제를 달성하기 위한, 본 발명의 실시 예에 따른 기계학습모델 생성 방법의 일 예는, 복수 개의 데이터샘플을 복수의 군집으로 군집화하는 단계; 각 군집의 대표 데이터샘플을 탐색하는 단계; 대표 데이터샘플에 대한 라벨 또는 점수를 입력받는 단계; 및 라벨 또는 점수가 부여된 데이터샘플을 이용한 준지도학습으로 기계학습모델을 훈련시키는 단계;를 포함한다.An example of a method for generating a machine learning model according to an embodiment of the present invention for achieving the above technical problem includes clustering a plurality of data samples into a plurality of clusters; Searching for representative data samples of each cluster; Receiving a label or score for a representative data sample; and training the machine learning model by semi-supervised learning using data samples to which labels or scores have been assigned.

상기의 기술적 과제를 달성하기 위한, 본 발명의 실시 예에 따른 모델생성장치의 일 예는, 복수 개의 데이터샘플을 복수의 군집으로 분류하는 군집화부; 각 군집의 대표 데이터샘플을 탐색하는 샘플탐색부; 대표 데이터샘플에 대한 라벨 또는 점수를 입력받는 레이블링부; 및 라벨 또는 점수가 부여된 데이터샘플을 이용한 준지도학습으로 기계학습모델을 훈련시키는 학습부;를 포함한다.An example of a model generating device according to an embodiment of the present invention for achieving the above technical problem is a clustering unit for classifying a plurality of data samples into a plurality of clusters; a sample search unit that searches representative data samples of each cluster; A labeling unit that receives labels or scores for representative data samples; and a learning unit that trains the machine learning model through semi-supervised learning using data samples to which labels or scores have been assigned.

본 발명의 실시 예에 따르면, 레이블링이 되지 않은 데이터와 사용자로부터 얻은 제한된 정보를 기반으로 사용자가 목표로 하는 회귀분석 또는 분류를 높은 수준의 정확성으로 수행할 수 있는 기계학습모델을 생성할 수 있다. According to an embodiment of the present invention, a machine learning model capable of performing a regression analysis or classification targeted by a user with a high level of accuracy can be generated based on unlabeled data and limited information obtained from the user.

도 1은 본 발명의 실시 예에 따른 모델생성장치의 일 예의 구성을 간략히 도시한 도면,
도 2는 본 발명의 실시 예에 따른 기계학습모델의 생성방법의 일 예를 도시한 흐름도,
도 3은 본 발명의 실시 예에 따른 군집화의 일 예를 도시한 도면,
도 4 및 도 5는 본 발명의 실시 예에 따른 대표 데이터샘플을 탐색하는 방법의 일 예를 도시한 도면,
도 6은 본 발명의 실시 예에 따른 준지도학습 방법의 일 예를 도시한 도면,
도 7은 본 발명의 실시 예에 따른 준지도학습 방법의 다른 일 예를 도시한 도면,
도 8은 본 발명의 실시 예에 따른 준지도학습의 지도학습모델을 평가하는 방법의 일 예를 도시한 도면, 그리고,
도 9는 본 발명의 실시 예에 따른 모델생성장치의 일 예의 구성을 도시한 도면이다.1 is a schematic diagram showing the configuration of an example of a model generating device according to an embodiment of the present invention;
2 is a flowchart showing an example of a method for generating a machine learning model according to an embodiment of the present invention;
3 is a diagram showing an example of clustering according to an embodiment of the present invention;
4 and 5 are diagrams showing an example of a method for searching for a representative data sample according to an embodiment of the present invention;
6 is a diagram showing an example of a semi-supervised learning method according to an embodiment of the present invention;
7 is a diagram showing another example of a semi-supervised learning method according to an embodiment of the present invention;
8 is a diagram showing an example of a method for evaluating a supervised learning model of semi-supervised learning according to an embodiment of the present invention, and
9 is a diagram showing the configuration of an example of a model generating device according to an embodiment of the present invention.

이하에서, 첨부된 도면들을 참조하여 본 발명의 실시 예에 따른 기계학습모델 생성 방법 및 그 장치에 대해 상세히 살펴본다.Hereinafter, a machine learning model generation method and device according to an embodiment of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 실시 예에 따른 모델생성장치의 일 예의 구성을 간략히 도시한 도면이다.1 is a diagram briefly showing the configuration of an example of a model generating device according to an embodiment of the present invention.

도 1을 참조하면, 모델생성장치(100)는 레이블링이 되지 않은 데이터셋(110)을 이용하여 기계학습모델(120)을 훈련시켜 생성한다. 기계학습모델(120)은 사용자가 원하는 회귀분석모델 또는 분류모델일 수 있다. 타겟 변수에 대한 값이 존재하지 않는 데이터셋(110)(즉, 레이블링이 되지 않은 데이터셋)은, 일반적으로 비지도학습을 적용한다. 그러나 비지도학습으로 훈련시킨 모델의 정확성이 떨어지는 문제점이 존재하므로, 본 실시 예는 레이블링이 되지 않은 데이터셋(110)과 사용자로부터 제공받은 최소한의 정보를 기반으로 준지도학습을 통해 기계학습모델(120)을 훈련시켜 생성하는 방법을 제시한다. 이에 대해서는 도 2 이하에서 구체적으로 살펴본다.Referring to FIG. 1 , the model generating device 100 trains and generates a machine learning model 120 using an unlabeled dataset 110 . The machine learning model 120 may be a regression analysis model or a classification model desired by a user. In general, unsupervised learning is applied to the dataset 110 in which there is no value for the target variable (ie, unlabeled dataset). However, since there is a problem of low accuracy of the model trained by unsupervised learning, the present embodiment uses the unlabeled dataset 110 and the machine learning model ( 120) is presented by training. This will be examined in detail below in FIG. 2 .

도 2는 본 발명의 실시 예에 따른 기계학습모델의 생성방법의 일 예를 도시한 흐름도이다.2 is a flowchart illustrating an example of a method for generating a machine learning model according to an embodiment of the present invention.

도 1 및 도 2를 함께 참조하면, 모델생성장치(100)는 데이터셋(110)을 구성하는 복수 개의 데이터샘플을 복수의 군집으로 군집화한다(S200). 여기서 데이터샘플은 복수의 변수에 대한 변수값으로 구성된 데이터일 수 있다. 모델생성장치(100)는 다양한 비지도학습모델(예를 들어, k-means 등)을 군집화를 수행할 수 있으며 이에 대한 예가 도 3에 도시되어 있다. Referring to FIGS. 1 and 2 together, the model generator 100 clusters a plurality of data samples constituting the data set 110 into a plurality of clusters (S200). Here, the data sample may be data composed of variable values for a plurality of variables. The model generator 100 can perform clustering on various unsupervised learning models (eg, k-means, etc.), and an example thereof is shown in FIG. 3 .

모델생성장치(100)는 복수의 군집에 대한 대표 데이터샘플을 탐색한다(S210). 대표 데이터샘플은 복수의 군집의 특성을 가장 잘 반영할 수 있는 데이터샘플을 의미하며, 일 예로, 모델생성장치(100)는 데이터 밀도를 기반으로 각 군집에 대한 대표 데이터샘플을 탐색할 수 있다. 대표 데이터샘플을 탐색하는 구체적인 방법의 실시 예가 도 4 및 도 5에 도시되어 있다.The model generator 100 searches for representative data samples for a plurality of clusters (S210). The representative data sample means a data sample that can best reflect the characteristics of a plurality of clusters. For example, the model generator 100 may search for a representative data sample for each cluster based on data density. Examples of specific methods of searching for representative data samples are shown in FIGS. 4 and 5 .

모델생성장치(100)는 대표 데이터샘플에 대한 라벨 또는 점수를 사용자로부터 입력받는다(S220). 예를 들어, 데이터를 N개의 군집으로 분류하는 기계학습모델(120)을 생성하고자 하는 경우에, 사용자는 N개의 군집 중 대표 데이터샘플이 어느 군집에 해당하는지 그 군집을 식별하는 라벨을 대표 데이터샘플에 부여할 수 있다. 또는 입력값과 출력값의 관계를 예측하는 회귀분석의 기계학습모델을 생성하고자 하는 경우에, 사용자는 대표 데이터샘플이 어떤 출력값에 해당할지 점수(예를 들어, 0~1 사이의 값)를 부여할 수 있다. 이 외에도 종래의 다양할 레이블링 방법이 존재하며 그에 따른 라벨 또는 점수의 부여방법이 본 실시 예에 적용될 수 있다. The model generator 100 receives a label or score for the representative data sample from the user (S220). For example, in the case of generating the machine learning model 120 that classifies data into N clusters, the user selects a representative data sample among the N clusters and assigns a label identifying the cluster to which representative data sample corresponds. can be given to Or, if you want to create a machine learning model for regression analysis that predicts the relationship between input and output values, the user can give a score (for example, a value between 0 and 1) to which output value the representative data sample corresponds to. can In addition to this, there are various conventional labeling methods, and a method of assigning a label or a score according to the labeling method may be applied to this embodiment.

모델생성장치(100)는 사용자가 대표 데이터샘플에 대한 라벨 또는 점수를 부여할 때 도움을 줄 수 있도록 대표 데이터샘플을 테이블이나, 각종 차트 또는 플롯 등을 이용하여 시각화하여 표시할 수 있다. 예를 들어, 모델생성장치(100)는 복수의 군집에 속한 복수의 데이터샘플을 데이터테이블, 평행좌표도 또는 프로젝션 플롯 등을 이용하여 표시할 때 대표 데이터샘플을 구분하여 함께 표시함으로써 사용자가 대표 데이터샘플이 해당 군집에 맞게 분류되었는지 여부를 용이하게 파악하고 라벨이나 점수를 부여할 수 있도록 한다. The model generator 100 may visualize and display the representative data sample using a table, various charts, or plots so that the user can help when assigning a label or score to the representative data sample. For example, when displaying a plurality of data samples belonging to a plurality of clusters using a data table, a parallel coordinate diagram, or a projection plot, the model generator 100 classifies representative data samples and displays them together, so that the user can view the representative data. It makes it easy to determine whether a sample has been classified into a corresponding cluster and assign a label or score.

모델생성장치(100)는 라벨 또는 점수가 부여된 데이터샘플을 이용하여 준지도학습 방법으로 기계학습모델(120)을 훈련시킨다(S230). 즉, 모델생성장치(100)는 대표 데이터샘플에 대해서만 라벨 또는 점수가 부여된 데이터셋을 이용하여 준지도학습을 수행한다. 모델생성장치(100)는 준지도학습의 정확성을 높이기 위하여 멀티뷰 학습(multi-view learning)을 이용한 준지도학습을 수행할 수 있으며 이에 대한 예가 도 6 내지 도 8에 도시되어 있다.The model generating device 100 trains the machine learning model 120 using a semi-supervised learning method using data samples to which labels or scores have been assigned (S230). That is, the model generator 100 performs semi-supervised learning using a dataset to which labels or scores are assigned only to representative data samples. The model generator 100 may perform semi-supervised learning using multi-view learning in order to increase the accuracy of semi-supervised learning, and examples thereof are shown in FIGS. 6 to 8 .

도 3은 본 발명의 실시 예에 따른 군집화의 일 예를 도시한 도면이다.3 is a diagram illustrating an example of clustering according to an embodiment of the present invention.

도 3을 참조하면, 모델생성장치(100)는 복수의 데이터샘플로 구성된 데이터셋(110)을 비지도학습모델(300)을 이용하여 복수의 군집(310)으로 분류할 수 있다. 군집(310)의 개수는 사용자 등에 의해 미리 설정되거나 자동으로 설정될 수 있다. 또 다른 예로, 모델생성장치(100)는 군집화에 사용자의 피드백을 반영하여 사용자가 원하는 복수의 군집을 만들 수 있다. 예를 들어, 특허출원번호 제10-2020-0163344호 "비지도학습에서의 사용자의도 반영 방법 및 그 장치"에 개시된 방법을 이용하여 복수의 군집을 생성할 수 있다.Referring to FIG. 3 , the model generator 100 may classify a dataset 110 composed of a plurality of data samples into a plurality of clusters 310 using an unsupervised learning model 300 . The number of clusters 310 may be preset or automatically set by a user or the like. As another example, the model generating apparatus 100 may create a plurality of clusters desired by the user by reflecting the user's feedback on clustering. For example, a plurality of clusters may be generated using the method disclosed in Patent Application No. 10-2020-0163344 "Method for Reflecting User Intention in Unsupervised Learning and Apparatus Therefor".

도 4 및 도 5는 본 발명의 실시 예에 따른 대표 데이터샘플을 탐색하는 방법의 일 예를 도시한 도면이다.4 and 5 are diagrams illustrating an example of a method of searching for a representative data sample according to an embodiment of the present invention.

도 4 및 도 5를 함께 참조하면, 모델생성장치(100)는 복수 개의 데이터샘플들에 대한 벡터 양자화를 수행하여 N개의 벡터를 탐색할 수 있다(S400). 예를 들어, 데이터샘플이 m개의 변수에 대한 변수값으로 구성된 경우에, 각각의 데이터샘플은 m차원의 벡터(즉, 특징벡터(feacture vector))로 표시될 수 있다. 데이터샘플의 각 변수값의 범위를 정규화하여 벡터로 표시할 수도 있으며, 이 외에도, 데이터샘플을 벡터로 표시하는 종래의 다양한 방법이 본 실시 예에 적용될 수 있다. 벡터 양자화는 K개의 특징벡터를 N(<k)개의 벡터로 사상(mapping)하는 과정이며, N개의 벡터의 수는 사용자에 의해 설정되거나 자동으로 설정될 수 있다. 예를 들어, 데이터셋에 존재하는 K개의 데이터샘플은 k개의 특징벡터로 표시되고, 벡터 양자화를 통해 K개의 특징벡터로부터 N개의 벡터를 탐색할 수 있다. 벡터 양자화 방법 그 자체는 데이터마이닝 분야에서 이미 널리 알려진 방법이므로 이에 대한 추가적인 설명은 생략한다.Referring to FIGS. 4 and 5 together, the model generator 100 may search for N vectors by performing vector quantization on a plurality of data samples (S400). For example, when a data sample is composed of variable values for m variables, each data sample may be represented as an m-dimensional vector (ie, a feature vector). The range of each variable value of the data sample may be normalized and displayed as a vector. In addition, various conventional methods of displaying the data sample as a vector may be applied to this embodiment. Vector quantization is a process of mapping K feature vectors into N (<k) vectors, and the number of N vectors can be set by a user or automatically. For example, K data samples existing in the dataset are represented by k feature vectors, and N vectors can be searched from the K feature vectors through vector quantization. Since the vector quantization method itself is already widely known in the field of data mining, an additional description thereof will be omitted.

모델생성장치(100)는 N개의 벡터가 탐색되면, 그 벡터를 기준으로 대표 데이터샘플을 선택한다(S410). 군집 간 거리가 군집 내 데이터샘플 사이의 거리보다 커야 한다는 가설을 가정하면, 데이터 포인트의 로컬 밀도가 높은 지점을 기준으로 대표 데이터샘플을 찾는 것이 바람직하다. 예를 들어, 모델생성장치(100)는 N개의 벡터와 가장 가까운 거리에 위치한 특징벡터를 파악하고, 그 특징벡터에 해당하는 데이터샘플을 대표 데이터샘플로 선택한다. 즉, N개의 벡터로부터 N개의 대표 데이터샘플이 추출될 수 있다. When N vectors are searched for, the model generator 100 selects a representative data sample based on the vectors (S410). Assuming the hypothesis that the distance between clusters should be greater than the distance between data samples within a cluster, it is desirable to find a representative data sample based on a point with a high local density of data points. For example, the model generating apparatus 100 identifies feature vectors located closest to the N vectors, and selects a data sample corresponding to the feature vector as a representative data sample. That is, N representative data samples may be extracted from N vectors.

모델생성장치(100)는 대표 데이터샘플에 대해 사용자로부터 라벨 또는 점수를 입력받는다(S420). 모델생성장치(100)는 사용자로부터 라벨 또는 점수를 입력받을 수 있는 사용자인터페이스를 제공할 수 있다. 예를 들어, 모델생성장치(100)는 프로젝션 플롯(projection plot) 등을 통해 데이터셋의 각 군집의 데이터샘플을 구분하여 표시하고, 또한 대표 데이터샘플을 구분하여 표시함으로써 사용자가 대표 데이터샘플에 적정 라벨이나 점수를 부여하는데 도움을 줄 수 있다. The model generator 100 receives a label or score from the user for the representative data sample (S420). The model generator 100 may provide a user interface through which a label or score may be input from a user. For example, the model generator 100 classifies and displays data samples of each cluster of the data set through a projection plot, etc., and also classifies and displays representative data samples, so that the user can determine the appropriateness for the representative data samples. It can help assign labels or scores.

다른 실시 예로, 모델생성장치(100)는 사용자로부터 라벨 또는 점수를 입력받을 때 신뢰도를 함께 입력받을 수 있다. 예를 들어, 모델생성장치(100)는 0.1과 1사이의 수치형 변수 형태(0.1: 낮은 신뢰도, 1: 높은 신뢰도), 고유값을 갖는 명목 변수 형태(예를 들어, low, medium, high 등), 퍼지 집합 형태(예를 들어, low, medium, high 등) 등 다양한 형태로 신뢰도를 입력받을 수 있다. As another embodiment, the model generator 100 may also receive a reliability level when receiving a label or a score from a user. For example, the model generating device 100 has a numerical variable form between 0.1 and 1 (0.1: low reliability, 1: high reliability), a nominal variable form having an eigenvalue (eg, low, medium, high, etc.) ) and fuzzy sets (for example, low, medium, high, etc.).

모델생성장치(100)는 기 정의된 정지조건에 해당하면 대표 데이터샘플을 탐색하는 과정을 종료하고, 그렇지 않으면 대표 데이터샘플을 탐색하는 과정을 반복한다(S430). The model generator 100 terminates the process of searching for representative data samples if the predefined stop condition is met, and otherwise repeats the process of searching for representative data samples (S430).

모델생성장치(100)는 대표 데이터샘플의 탐색 과정을 반복하는 경우에 데이터셋을 그대로 이용하는 것이 아니라 데이터셋에서 대표 데이터샘플과 그 주변의 데이터샘플을 제거하여 이용한다(S440). 예를 들어, 모델생성장치(100)는 대표 데이터샘플과 가까운 거리 순으로 적어도 하나 이상의 주변 데이터샘플을 제거할 수 있다. 이때 가까운 거리 여부는 데이터샘플의 특징 벡터 사이의 거리(예를 들어, 유클리드 거리(Euclidean distance))를 이용하여 파악할 수 있다. 대표 데이터샘플 및 그 주변 데이터샘플을 제거함으로써 대표 데이터샘플의 반복 수행시에 동일 대표 데이터샘플이 다시 탐색되는 것을 방지할 수 있다. When the model generator 100 repeats the search process of the representative data sample, the data set is not used as it is, but the representative data sample and the data samples around it are removed from the data set and used (S440). For example, the model generating apparatus 100 may remove at least one or more neighboring data samples in order of proximity to the representative data sample. In this case, whether or not the distance is close can be determined using a distance between feature vectors of data samples (eg, Euclidean distance). By removing the representative data sample and its neighboring data samples, it is possible to prevent the same representative data sample from being searched for again when the representative data sample is repeatedly performed.

모델생성장치(100)는 대표 데이터샘플과 그 주변 데이터샘플을 제거하고 남은 데이터샘플들을 대상으로 다시 N개의 벡터를 탐색하는 과정(S400~S420)을 정지조건을 만족할 때까지 반복한다. The model generator 100 removes the representative data sample and its neighboring data samples, and repeats the process of searching for N vectors again for the remaining data samples (S400 to S420) until the stop condition is satisfied.

대표 데이터샘플의 탐색 과정의 종료 여부를 결정하는 정지조건은 실시 예에 따른 다양하게 설정될 수 있다. 예를 들어, 정지조건은 각 군집에 대하여 적어도 하나 이상의 대표 데이터샘플이 선택된 경우일 수 있다. 도 3과 같이 N개의 군집이 존재하는 경우에, 모델생성장치(100)는 대표 데이터샘플의 첫 번째 탐색 과정을 수행하고 전체 N개의 군집에 대한 대표 데이터샘플이 모두 탐색되었는지 파악한다. 예를 들어, 군집4,6,9에 대표 데이터샘플이 탐색되지 않았다면, 모델생성장치(100)는 대표 데이터샘플의 두 번째 탐색 과정을 수행하고, 그 결과 이전에 탐색되지 않은 군집4,6,9에 대한 대표 데이터샘플이 모두 탐색되었는지 파악한다. 군집4,6,9에 대한 대표 데이터샘플이 모두 탐색되었다면, 모델생성장치(100)는 대표 데이터샘플의 탐색 과정은 종료하며, 그렇지 않으면 대표 데이터샘플의 탐색 과정을 다시 반복 수행한다. A stop condition for determining whether to terminate the process of searching for a representative data sample may be set in various ways according to embodiments. For example, the stopping condition may be a case in which at least one representative data sample is selected for each cluster. When there are N clusters as shown in FIG. 3 , the model generator 100 performs a first search process for representative data samples and determines whether representative data samples for all N clusters are all searched. For example, if representative data samples are not searched for in clusters 4, 6, and 9, the model generator 100 performs a second search process for representative data samples, and as a result, clusters 4, 6, and Determine if all representative data samples for 9 have been searched. If representative data samples for clusters 4, 6, and 9 are all searched, the model generator 100 ends the search process for representative data samples, and otherwise repeats the search process for representative data samples.

정지조건의 또 다른 예로 대표 데이터샘플의 개수가 기 정의된 개수 이상인 경우, 기 정의된 라벨 집합(즉, 사용자가 분류하고자 하는 군집의 각 라벨)의 라벨이 각각 적어도 하나 이상의 데이터샘플에 부여된 경우, 기 정의된 점수(즉, 회귀분석모델에서 예측값의 범위)의 최소값 및 최대값이 각각 적어도 하나 이상의 데이터샘플에 부여된 경우, 또는 사용자로부터 중지 요청을 받은 경우 등이 있다. 이 외에도 실시 예에 따라 다양한 정지조건이 설정될 수 있다.Another example of a stopping condition is when the number of representative data samples is greater than or equal to a pre-defined number, and when at least one label of a pre-defined label set (that is, each label of a cluster that the user wants to classify) is given to at least one data sample. , when the minimum and maximum values of predefined scores (ie, the range of predicted values in the regression analysis model) are respectively assigned to at least one data sample, or when a request to stop is received from the user. In addition to this, various stop conditions may be set according to embodiments.

도 6은 본 발명의 실시 예에 따른 준지도학습 방법의 일 예를 도시한 도면이다.6 is a diagram showing an example of a semi-supervised learning method according to an embodiment of the present invention.

도 6을 참조하면, 모델생성장치(100)는 데이터셋의 일부 데이터샘플에만 라벨 또는 점수가 부여된 상태에서 준지도학습을 이용하여 기계학습모델(120)을 학습시킨다. 준지도학습의 정확성을 높이기 위하여, 본 실시 예는 복수 개의 지도학습모델(600)(즉, 머신러닝 알고리즘)을 이용한다.Referring to FIG. 6 , the model generator 100 trains the machine learning model 120 by using semi-supervised learning in a state where labels or scores are assigned to only some data samples in the dataset. In order to increase the accuracy of semi-supervised learning, this embodiment uses a plurality of supervised learning models 600 (ie, machine learning algorithms).

먼저, 모델생성장치(100)는 라벨 또는 점수가 부여된 데이터샘플을 이용하여 복수의 지도학습모델(600)을 훈련시킨다. 그리고 모델생성장치(100)는 레이블링이 되지 않은 데이터샘플(610)을 훈련된 복수의 지도학습모델(600)에 입력하여 라벨 또는 점수를 예측한다. First, the model generator 100 trains a plurality of supervised learning models 600 using data samples to which labels or scores are assigned. In addition, the model generator 100 predicts labels or scores by inputting the unlabeled data samples 610 into the trained plurality of supervised learning models 600 .

모델생성장치(100)는 복수의 지도학습모델(600)이 예측한 라벨 또는 점수의 합의를 통해 결정한 라벨 또는 점수를 데이터샘플에 부여(620)한다. 예를 들어, 5개의 지도학습모델이 존재한다고 가정한다. 이 경우에 제1 데이터샘플에 대해 제1,2,5 지도학습모델은 라벨A를 출력하고, 제3,4 지도학습모델은 라벨B를 출력하였다면, 모델생성장치는 다수결에 따라 제1 데이터샘플에 대하여 라벨A를 부여한다. 이 외에도, 각 지도학습모델의 예측 신뢰도를 반영하는 등 데이터샘플에 부여할 라벨 또는 점수를 결정하는 다양한 방법이 본 실시 예에 적용될 수 있다. 예를 들어, 제1,2,5 지도학습모델의 예측 신뢰도의 평균과 제3,4 지도학습모델의 예측 신뢰도의 평균을 비교하여 더 높은 쪽의 라벨을 데이터샘플에 부여하는 등 다양한 방법의 적용이 가능하다.The model generator 100 assigns a label or score determined through an agreement between the labels or scores predicted by the plurality of supervised learning models 600 to the data sample (620). For example, suppose there are five supervised learning models. In this case, if the 1st, 2nd, and 5th supervised learning models output label A and the 3rd and 4th supervised learning models output label B for the first data sample, the model generating device outputs the first data sample according to a majority vote. Label A is assigned to In addition to this, various methods of determining a label or score to be assigned to a data sample, such as reflecting the prediction reliability of each supervised learning model, may be applied to this embodiment. For example, applying various methods, such as comparing the average of the prediction reliability of the 1st, 2nd, and 5th supervised learning models with the average of the prediction reliability of the 3rd and 4th supervised learning models, and assigning a higher label to the data sample. this is possible

모델생성장치(100)는 새롭게 라벨이 부여된 데이터샘플을 이용하여 복수의 지도학습모델을 다시 훈련시킨 후 제2 데이터샘플에 대한 복수의 지도학습모델이 예측한 라벨의 합의를 통해 제2 데이터샘플에 라벨을 부여하는 과정을 반복수행한다. 이와 같은 방법으로 모든 데이터샘플에 대하여 라벨 부여가 완료되면 모델생성장치(100)는 레이블링이 완료된 데이터셋을 이용하여 기계학습모델을 지도학습방법으로 훈련시켜 생성할 수 있다.The model generating device 100 retrains a plurality of supervised learning models using the newly labeled data samples, and then the second data samples through an agreement between the labels predicted by the plurality of supervised learning models for the second data samples. Repeat the process of assigning labels to . When labeling is completed for all data samples in this way, the model generating device 100 can train and generate a machine learning model using the supervised learning method using the labeled dataset.

모델생성장치(100)는 복수의 지도학습모델(600)이 보다 정확한 라벨 또는 점수를 예측할 수 있도록 반복 학습과정시마다 각 지도학습모델을 평가하고 평가점수가 낮은 지도학습모델의 하이퍼파라미터를 조정하는 과정을 수행할 수 있다. 이에 대해서는 도 7에서 다시 살펴본다. The process of the model generating device 100 evaluating each supervised learning model at each iterative learning process and adjusting the hyperparameters of the supervised learning model having a low evaluation score so that the plurality of supervised learning models 600 can more accurately predict labels or scores. can be performed. This will be reviewed again in FIG. 7 .

도 7은 본 발명의 실시 예에 따른 준지도학습 방법의 다른 일 예를 도시한 도면이다.7 is a diagram showing another example of a semi-supervised learning method according to an embodiment of the present invention.

도 6 및 7을 함께 참조하면, 모델생성장치(100)는 라벨 또는 점수가 부여된 데이터샘플을 이용하여 복수의 지도학습모델(600)을 훈련시킨다(S700). 모델생성장치(100)는 라벨 또는 점수가 부여되지 않은 데이터샘플(610)에 대한 복수의 지도학습모델의 예측 결과를 기반으로 데이터샘플에 부여할 라벨 또는 점수를 파악한다(S710). Referring to FIGS. 6 and 7 together, the model generator 100 trains a plurality of supervised learning models 600 using data samples to which labels or scores have been assigned (S700). The model generating apparatus 100 determines the label or score to be assigned to the data sample based on the prediction result of the plurality of supervised learning models for the data sample 610 to which no label or score has been assigned (S710).

모델생성장치(100)는 각 지도학습모델(600)이 라벨 또는 점수를 예측할 때의 신뢰도값을 기준으로 지도학습모델을 평가한다(S720). 이를 위해, 지도학습모델은 예측값에 대한 신뢰도를 출력하는 모델일 수 있다. 예를 들어, 지도학습모델은 라벨이나 점수를 예측할 때 해당 라벨이나 점수의 예측확률(즉, 예측 신뢰도)을 함께 출력할 수 있으며, 그 예측확률을 본 실시 예의 신뢰도값으로 사용할 수 있다. The model generator 100 evaluates the supervised learning model based on the reliability value when each supervised learning model 600 predicts a label or a score (S720). To this end, the supervised learning model may be a model that outputs the reliability of the predicted value. For example, when a supervised learning model predicts a label or score, it can also output a prediction probability (ie, prediction reliability) of the corresponding label or score, and the prediction probability can be used as a reliability value in this embodiment.

모델생성장치(100)는 지도학습모델의 평가를 위해 각 지도학습모델에 평가점수를 부여하고 갱신하는 과정을 수행할 수 있다. 예를 들어, 복수의 지도학습모델(600)에 대해 초기 평가점수로 '0'을 부여한다. 초기 평가점수는 실시 예에 따라 다양한 값으로 설정될 수 있다. 모델생성장치(100)는 복수의 지도학습모델(600)의 합의를 통해 데이터샘플(610)에 라벨 또는 점수를 부여하는 경우에 가장 높은 신뢰도값을 가진 지도학습모델의 평가점수를 증가하고 신되도값이 기 정의된 기준을 벗어나는 지도학습모델의 평가점수를 감소할 수 있다. 예를 들어, 데이터샘플(610)에 대해 제1,2,5 지도학습모델은 라벨A를 출력하고, 제3,4 지도학습모델은 라벨B를 출력하여, 데이터샘플(610)에 라벨A가 부여된 경우에, 모델생성장치(100)는 라벨A를 예측한 제1,2,5 지도학습모델 중 신뢰도 값이 가장 높은 지도학습모델의 평가점수를 '1' 증가할 수 있다. 다른 예로, 신뢰도값이 기 정의된 기준을 벗어하는 지도학습모델에는 '-1'(즉, '1' 감소)의 평가점수를 부여할 수 있다. 이와 같은 방법으로, 모델생성장치는 데이터샘플(610)에 라벨 또는 점수를 부여할 때마다 복수의 지도학습모델의 평가점수를 갱신할 수 있다.The model generating device 100 may perform a process of assigning an evaluation score to each supervised learning model and updating the supervised learning model in order to evaluate the supervised learning model. For example, '0' is given as an initial evaluation score for the plurality of supervised learning models 600 . The initial evaluation score may be set to various values according to embodiments. The model generating device 100 increases the evaluation score of the supervised learning model with the highest reliability value when assigning a label or a score to the data sample 610 through the agreement of a plurality of supervised learning models 600 and trustworthy. Evaluation scores of supervised learning models whose values deviate from predefined standards may be reduced. For example, for the data sample 610, the first, second, and fifth supervised learning models output label A, and the third and fourth supervised learning models output label B, so that the data sample 610 has label A. In this case, the model generator 100 may increase the evaluation score of the supervised learning model having the highest reliability value among the first, second, and fifth supervised learning models that predicted the label A by '1'. As another example, an evaluation score of '-1' (ie, a decrease of '1') may be assigned to a supervised learning model whose reliability value deviate from a predefined standard. In this way, the model generator may update evaluation scores of a plurality of supervised learning models whenever a label or score is given to the data sample 610 .

모델생성장치(100)는 각 지도학습모델의 평가점수를 기반으로 지도학습모델의 하이퍼파라미터를 조정한다(S730). 예를 들어, 모델생성장치(100)는 평가점수가 기 정의된 기준값 이하가 되는 지도학습모델의 하이퍼파라미터(hyperparameter)를 조정한다. 조정할 하이퍼파라미터의 종류와 조정값의 범위 등은 미리 설정될 수 있다. 또 다른 예로, 모델생성장치(100)는 종래의 다양한 하이퍼파라미터의 최적화 방법을 적용하여 지도학습모델의 하이퍼파라미터를 최적화하는 과정을 수행하여 하이퍼파라미터를 조정할 수 있다.The model generator 100 adjusts hyperparameters of the supervised learning model based on the evaluation scores of each supervised learning model (S730). For example, the model generator 100 adjusts hyperparameters of supervised learning models whose evaluation scores are equal to or less than a predefined reference value. The type of hyperparameter to be adjusted and the range of the adjustment value can be set in advance. As another example, the model generator 100 may adjust the hyperparameters by performing a process of optimizing the hyperparameters of the supervised learning model by applying various conventional hyperparameter optimization methods.

모델생성장치(100)는 데이터샘플에 대한 라벨 또는 점수의 부여가 완료될 때까지 위 과정(S700~S730)을 반복 수행한다(S740).The model generator 100 repeatedly performs the above processes (S700 to S730) until the labeling or scoring of the data samples is completed (S740).

도 8은 본 발명의 실시 예에 따른 준지도학습의 지도학습모델을 평가하는 방법의 일 예를 도시한 도면이다.8 is a diagram showing an example of a method for evaluating a supervised learning model of semi-supervised learning according to an embodiment of the present invention.

도 8을 참조하면, 준지도학습에 N개의 지도학습모델(800,802,804,806)이 사용된다. 복수의 지도학습모델(800,802,804,806)이 데이터샘플에 대한 라벨 또는 점수의 예측값을 출력할 때, 모델생성장치(100)는 각 지도학습모델의 예측 신뢰도를 이용하여 각 지도학습모델의 평가점수를 갱신한다. Referring to FIG. 8, N supervised learning models (800, 802, 804, 806) are used for semi-supervised learning. When a plurality of supervised learning models (800, 802, 804, 806) output predicted values of labels or scores for data samples, the model generator 100 updates the evaluation score of each supervised learning model using the predicted reliability of each supervised learning model. .

예를 들어, 모델생성장치(100)는 예측 신뢰도가 가장 높은 제2 지도학습모델(802)의 평가점수를 증가하고 가장 낮은 제3 지도학습모델(804)의 평가점수를 감소할 수 있다. 이 외에도 예측 신뢰도를 기반으로 평가점수를 증가 또는 감소하는 다양한 방법이 본 실시 예에 적용될 수 있다. For example, the model generator 100 may increase the evaluation score of the second supervised learning model 802 having the highest prediction reliability and decrease the evaluation score of the third supervised learning model 804 having the lowest prediction reliability. In addition to this, various methods of increasing or decreasing the evaluation score based on prediction reliability may be applied to this embodiment.

다른 실시 예로, 모델생성장치(100)는 특수라벨 또는 특수점수가 부여된 데이터샘플을 이용하여 지도학습모델을 평가할 수 있다. 이를 위해, 모델생성장치(100)는 대표 데이터샘플에 대한 라벨 또는 점수를 사용자로부터 입력받을 때(도 4의 S420 참조), 라벨 또는 점수를 부여할 수 없음을 나타내는 특수라벨 또는 특수점수를 사용할 수 있다. 즉, 사용자는 대표 데이터샘플에 대하여 일반 라벨 또는 일반 점수를 부여하거나, 대표 데이터샘플의 라벨이나 점수를 구분하기 힘든 경우에는 '알 수 없음'을 나타내는 특수라벨이나 특수점수를 대표 데이터샘플에 부여할 수 있다. As another embodiment, the model generating device 100 may evaluate a supervised learning model using a data sample to which a special label or special score is assigned. To this end, when receiving a label or score for a representative data sample from the user (see S420 of FIG. 4 ), the model generator 100 may use a special label or special score indicating that the label or score cannot be assigned. there is. That is, the user can assign a general label or general score to the representative data sample, or assign a special label or special score indicating 'unknown' to the representative data sample if it is difficult to distinguish the label or score of the representative data sample. can

모델생성장치(100)는 일반 라벨/점수와 함께 특수 라벨/점수가 부여된 데이터샘플을 이용하여 복수의 지도학습모델을 학습시킨다. 그 결과 복수의 지도학습모델이 예측한 값은 일반 라벨/점수이거나 특수 라벨/점수일 수 있다.The model generator 100 trains a plurality of supervised learning models using data samples to which special labels/scores are assigned together with general labels/scores. As a result, the values predicted by the plurality of supervised learning models may be general labels/scores or special labels/scores.

모델생성장치(100)는 데이터샘플에 대하여 특수 라벨/점수를 높은 신뢰도로 예측한 지도학습모델의 평가점수를 감소(즉, -1)할 수 있다. The model generator 100 may decrease the evaluation score of the supervised learning model that predicted the special label/score for the data sample with high reliability (ie, -1).

도 9는 본 발명의 실시 예에 따른 모델생성장치의 일 예의 구성을 도시한 도면이다.9 is a diagram showing the configuration of an example of a model generating device according to an embodiment of the present invention.

도 9를 참조하면, 모델생성장치(100)는 군집화부(900), 샘플탐색부(910), 레이블링부(920) 및 학습부9930)를 포함한다. 모델생성장치(100)는 메모리, 프로세서 및 입출력장치를 포함하는 컴퓨팅장치로 구현될 수 있다. 이 경우 각 구성은 소프트웨어로 구현되어 메모리에 탑재되고 프로세서에 의해 구동될 수 있다.Referring to FIG. 9 , the model generator 100 includes a clustering unit 900, a sample search unit 910, a labeling unit 920, and a learning unit 9930. The model generator 100 may be implemented as a computing device including a memory, a processor, and an input/output device. In this case, each configuration may be implemented as software, loaded into a memory, and driven by a processor.

군집화부(900)는 복수 개의 데이터샘플을 복수의 군집으로 분류한다. 군집화부(900)의 군집화의 일 예가 도 3에 도시되어 있다.The clustering unit 900 classifies a plurality of data samples into a plurality of clusters. An example of clustering by the clustering unit 900 is shown in FIG. 3 .

샘플탐색부(910)는 각 군집의 대표 데이터샘플을 탐색한다. 샘플탐색부(910)는 데이터 밀도를 기반으로 복수의 대표 데이터샘플을 탐색할 수 있다. 샘플탐색부(910)의 구체적인 예가 도 4 및 도 5에 도시되어 있다.The sample search unit 910 searches for representative data samples of each cluster. The sample search unit 910 may search for a plurality of representative data samples based on data density. Specific examples of the sample search unit 910 are shown in FIGS. 4 and 5 .

레이블링부(920)는 대표 데이터샘플에 대한 라벨 또는 점수를 입력받는다. 레이블링부(920)의 구체적인 예가 도 4 및 도 5에 도시되어 있다. 다른 실시 예로, 레이블링부(920)는 라벨 또는 점수를 부여할 수 없음을 나타내는 특수라벨 또는 특수점수를 입력받을 수 있다. 특수라벨 또는 특수점수를 기초로 준지도학습의 지도학습모델을 평가하는 방법에 대한 예가 도 8에 도시되어 있다.The labeling unit 920 receives labels or scores for representative data samples. Specific examples of the labeling unit 920 are shown in FIGS. 4 and 5 . As another embodiment, the labeling unit 920 may receive a special label or special score indicating that a label or score cannot be assigned. An example of a method for evaluating a supervised learning model of semi-supervised learning based on special labels or special scores is shown in FIG. 8 .

학습부(930)는 라벨 또는 점수가 부여된 데이터샘플을 이용한 준지도학습으로 기계학습모델을 훈련시킨다. 학습부(930)는 복수의 지도학습모델을 이용한 준지도학습 과정을 수행할 수 있으며 이에 대한 예가 도 6 및 도 7에 도시되어 있다. 다른 실시 예로, 학습부(930)는 준지도학습에 사용되는 복수의 지도학습모델을 평가하여 하이퍼파라미터를 조정하는 과정을 통해 학습의 정확성을 높일 수 있다. 복수의 지도학습모델을 평가하는 방법의 예가 도 8에 도시되어 있다.The learning unit 930 trains the machine learning model through semi-supervised learning using data samples to which labels or scores have been assigned. The learning unit 930 may perform a semi-supervised learning process using a plurality of supervised learning models, examples of which are shown in FIGS. 6 and 7 . As another embodiment, the learning unit 930 may increase the accuracy of learning through a process of adjusting hyperparameters by evaluating a plurality of supervised learning models used in semi-supervised learning. An example of a method for evaluating a plurality of supervised learning models is shown in FIG. 8 .

본 발명의 각 실시 예는 또한 컴퓨터로 읽을 수 있는 기록매체에 컴퓨터가 읽을 수 있는 코드로서 구현하는 것이 가능하다. 컴퓨터가 읽을 수 있는 기록매체는 컴퓨터 시스템에 의하여 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록장치를 포함한다. 컴퓨터가 읽을 수 있는 기록매체의 예로는 ROM, RAM, CD-ROM, SSD, 광데이터 저장장치 등이 있다. 또한 컴퓨터가 읽을 수 있는 기록매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어 분산방식으로 컴퓨터가 읽을 수 있는 코드가 저장되고 실행될 수 있다.Each embodiment of the present invention can also be implemented as computer readable codes on a computer readable recording medium. A computer-readable recording medium includes all types of recording devices in which data that can be read by a computer system is stored. Examples of computer-readable recording media include ROM, RAM, CD-ROM, SSD, and optical data storage devices. In addition, the computer-readable recording medium may be distributed to computer systems connected through a network to store and execute computer-readable codes in a distributed manner.

이제까지 본 발명에 대하여 그 바람직한 실시 예들을 중심으로 살펴보았다. 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자는 본 발명이 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 변형된 형태로 구현될 수 있음을 이해할 수 있을 것이다. 그러므로 개시된 실시 예들은 한정적인 관점이 아니라 설명적인 관점에서 고려되어야 한다. 본 발명의 범위는 전술한 설명이 아니라 특허청구범위에 나타나 있으며, 그와 동등한 범위 내에 있는 모든 차이점은 본 발명에 포함된 것으로 해석되어야 할 것이다.So far, the present invention has been looked at mainly with its preferred embodiments. Those skilled in the art to which the present invention pertains will be able to understand that the present invention can be implemented in a modified form without departing from the essential characteristics of the present invention. Therefore, the disclosed embodiments should be considered from a descriptive point of view rather than a limiting point of view. The scope of the present invention is shown in the claims rather than the foregoing description, and all differences within the equivalent scope will be construed as being included in the present invention.

Claims

clustering a plurality of data samples into a plurality of clusters;
Searching for representative data samples of each cluster;
Receiving a label or score for a representative data sample; and
Training a machine learning model by semi-supervised learning using labeled or scored data samples.

The method of claim 1, wherein the clustering step,
Classifying a plurality of data samples into a plurality of clusters using an unsupervised learning model.

The method of claim 1, wherein the searching step,
Identifying a plurality of representative data samples based on the data density; a model generation method comprising the.

The method of claim 1, wherein the searching step,
Repeating the process of removing the representative data sample and neighboring data samples located within a certain distance from the representative data sample and searching for a representative data sample with the remaining data samples until a predefined stop condition is satisfied; A method for generating a model comprising:

The method of claim 4, wherein the stopping condition is,
When at least one data sample to which the label or score has been assigned exists in each of the plurality of clusters, when the number of representative data samples is greater than or equal to a predefined number, the labels of the predefined label set are at least one data sample, respectively. A method for generating a model, comprising a case in which a sample is assigned, a case in which a minimum value and a maximum value of predefined scores are respectively assigned to at least one data sample, or a case in which a stop request is received from a user.

The method of claim 1, wherein the step of receiving the input,
A method for generating a model comprising the steps of visualizing and displaying the representative data sample to help input a label or score.

According to claim 1,
The method of generating a model, characterized in that the label or score includes a special label or special score indicating that labeling cannot be performed.

The method of claim 1, wherein the step of training the machine learning model,
training a plurality of supervised learning models using labeled or scored data samples;
assigning labels or scores to data samples through an agreement between labels or scores predicted by the plurality of supervised learning models for data samples to which labels or scores are not assigned;
adjusting evaluation scores for the plurality of supervised learning models based on reliability values of predictions of the plurality of supervised learning models;
adjusting hyperparameters of supervised learning models whose evaluation scores are less than a predefined standard; and
and repeating the training step to the adjusting step until labels or scores are assigned to all data samples.

The method of claim 8, wherein the step of adjusting the evaluation score,
When a label or score is assigned to a data sample through the agreement of the plurality of supervised learning models, the evaluation score of the supervised learning model with the highest reliability value is increased and the supervised learning model whose confidence value exceeds the predefined standard Reducing the evaluation score; Model generation method characterized in that it comprises a.

According to claim 8,
The label or score includes a special label or special score indicating that labeling cannot be performed,
The step of adjusting the evaluation score,
and reducing an evaluation score of a supervised learning model that predicts the special label or special score with a reliability value equal to or higher than a predetermined standard.

a clustering unit that classifies a plurality of data samples into a plurality of clusters;
a sample search unit that searches representative data samples of each cluster;
A labeling unit that receives labels or scores for representative data samples; and
A learning unit for training a machine learning model by semi-supervised learning using labeled or scored data samples.

A computer-readable recording medium recording a computer program for performing the method according to any one of claims 1 to 10 by a computer.