KR102189362B1

KR102189362B1 - Method and Device for Machine Learning able to automatically-label

Info

Publication number: KR102189362B1
Application number: KR1020180075333A
Authority: KR
Inventors: 류명훈; 박한
Original assignee: 주식회사 디플리
Priority date: 2018-06-29
Filing date: 2018-06-29
Publication date: 2020-12-11
Also published as: WO2020004867A1; KR20200002149A

Abstract

본 발명의 일 실시예에 따른 레이블이 부여되지 않은 데이터의 자동 레이블링 방법은 레이블이 부여되지 않은 데이터를 수집하는 단계(S31), 상기 레이블이 부여되지 않은 데이터를 클러스터링 기법에 의해 복수의 그룹으로 분류하는 단계(S34), 상기 복수의 그룹 각각에서 일부 데이터를 선택하는 단계(S35), 선택된 일부 데이터의 레이블을 부여하는 단계(S36), 및 상기 일부 데이터에 부여된 레이블을 대응하는 그룹에 속하는 전체 데이터에 부여하는 단계(S37)를 포함하는 것을 특징으로 한다. 본 발명의 일 실시예에 따르면, 일부 트레이닝 데이터에만 레이블을 부여한 후 나머지 모든 트레이닝 데이터에는 레이블을 자동으로 부여하게 되므로, 머신 러닝으로 훈련되는 함수의 학습이 더 빠른 시간 안에 수행될 수 있고, 레이블을 부여하는 노동력을 획기적으로 줄일 수 있다. In the automatic labeling method of unlabeled data according to an embodiment of the present invention, the step of collecting unlabeled data (S31), and classifying the unlabeled data into a plurality of groups by a clustering technique Step (S34), selecting some data from each of the plurality of groups (S35), assigning a label of the selected partial data (S36), and assigning a label assigned to the partial data to all belonging to the corresponding group It characterized in that it comprises a step (S37) to give to the data. According to an embodiment of the present invention, since labels are assigned to only some training data and then labels are automatically assigned to all other training data, learning of a function trained by machine learning can be performed in a faster time, and It is possible to drastically reduce the labor force to be given.

Description

Machine learning method and device capable of automatic labeling {Method and Device for Machine Learning able to automatically-label}

본 발명은 자동 레이블링이 가능한 머신 러닝 방법 및 장치에 관한 것으로, 더 상세하게는 머신 러닝(Machine Learning)을 위한 트레이닝 데이터(Training Data)의 일부에만 레이블을 부여한 후 나머지에는 자동으로 레이블이 부여되도록 하는 방법 및 장치에 관한 것이다. The present invention relates to a machine learning method and apparatus capable of automatic labeling, and more specifically, labeling only a part of training data for machine learning and then automatically labeling the rest. It relates to a method and apparatus.

머신 러닝(Machine Learning)이란, 인공지능의 한 분야로, 컴퓨터가 학습할 수 있도록 하는 알고리즘과 기술을 개발하는 분야를 말한다. 가령, 머신 러닝을 통해서 수신한 이메일이 스팸인지 아닌지를 구분할 수 있도록 훈련할 수 있다. Machine Learning, as a field of artificial intelligence, refers to the field of developing algorithms and technologies that enable computers to learn. For example, it can be trained to identify whether an email received through machine learning is spam or not.

머신 러닝은 크게 지도 학습(Supervised Learning)과 비 지도 학습(Unsupervised Learning)으로 나뉜다. 지도 학습은 트레이닝 데이터를 이용해 하나의 함수를 훈련시키기 위한 머신 러닝 방법 중 하나인데, 트레이닝 데이터는 일반적으로 입력 객체에 대한 속성을 벡터 형태로 포함하고 있으며 각각의 벡터에 대해 원하는 결과가 무엇인지 표시되어 있다. 이렇게 훈련된 함수 중 연속적인 값을 출력하는 것을 회귀분석(Regression)이라 하고 주어진 입력 벡터가 어떤 종류의 값인지 표식하는 것을 분류(Classification)라 한다. 반면 비 지도 학습(Unsupervised Learning)은 지도 학습과는 달리 입력값에 대한 목표치가 주어지지 않는다. Machine learning is largely divided into supervised learning and unsupervised learning. Supervised learning is one of the machine learning methods for training a function using training data. The training data generally contains the properties of the input object in the form of a vector, and the desired result for each vector is displayed. have. Among the trained functions, outputting continuous values is called regression, and marking what kind of values a given input vector is called classification. On the other hand, unsupervised learning, unlike supervised learning, does not give a target value for an input value.

지도 학습은 알고리즘을 통해 정해진 답을 찾는 것을 목적으로 한다. 따라서 지도 학습은 입력값과 목표값이 주어진 트레이닝 데이터로부터 입력값으로부터 출력값을 가장 잘 구해낼 수 있는 함수를 추론해내는 방식이다. Supervised learning aims to find a fixed answer through an algorithm. Therefore, supervised learning is a method of inferring a function that can best obtain an output value from an input value from training data given an input value and a target value.

지도 학습에서는 함수의 훈련을 위해 레이블이 부여된 데이터, 즉 목표값 정보가 포함된 데이터를 사용한다. 지도 학습 방식에서는 레이블 된(Labeled) 데이터를 바탕으로 정해진 학습 알고리즘에 따라 데이터가 입력되었을 때의 실제 출력되는 값과 목표값을 비교하여 오류가 있으면 이 결과를 근거로 함수를 수정하는 작업을 반복한다. In supervised learning, labeled data, that is, data containing target value information is used to train a function. In the supervised learning method, the actual output value when the data is input and the target value are compared according to a predetermined learning algorithm based on the labeled data, and if there is an error, the task of modifying the function based on this result is repeated. .

이러한 지도 학습은 사람이 직접 개별 트레이닝 데이터에 레이블을 부여하는 작업을 진행해야하는데, 시간과 비용이 많이 든다는 문제가 있다. Such supervised learning requires humans to directly label individual training data, which is time consuming and expensive.

한편, 대한민국 특허공개공보 제10-2017-0083419호는 레이블링되지 않은 다수의 학습 데이터를 이용하여 딥 러닝의 모델을 트레이닝하는 방법을 개시하고 있으나, 이는 정확성이 떨어진다는 단점이 있다. Meanwhile, Korean Patent Laid-Open Publication No. 10-2017-0083419 discloses a method of training a deep learning model using a plurality of unlabeled training data, but this has a disadvantage of inferior accuracy.

대한민국 특허공개공보 제10-2017-0083419호 (2017년 7월 18일 등록)Korean Patent Publication No. 10-2017-0083419 (registered on July 18, 2017)

본 발명은 위와 같은 문제점을 해결하기 위하여 제안된 것으로, 레이블이 부여되지 않은 다수의 데이터에 자동으로 레이블을 부여하는 것이 가능한 머신 러닝 방법 및 장치를 제공하는 것을 목적으로 한다. The present invention has been proposed to solve the above problems, and an object of the present invention is to provide a machine learning method and apparatus capable of automatically labeling a plurality of data that are not labeled.

본 발명의 일 실시예에 따른 레이블이 부여되지 않은 데이터의 자동 레이블링 방법은 레이블이 부여되지 않은 데이터를 수집하는 단계(S31), 상기 레이블이 부여되지 않은 데이터를 클러스터링 기법에 의해 복수의 그룹으로 분류하는 단계(S34), 상기 복수의 그룹 각각에서 일부 데이터를 선택하는 단계(S35), 선택된 일부 데이터의 레이블을 부여하는 단계(S36), 및 상기 일부 데이터에 부여된 레이블을 대응하는 그룹에 속하는 전체 데이터에 부여하는 단계(S37)를 포함하는 것을 특징으로 한다. In the automatic labeling method of unlabeled data according to an embodiment of the present invention, the step of collecting unlabeled data (S31), and classifying the unlabeled data into a plurality of groups by a clustering technique Step (S34), selecting some data from each of the plurality of groups (S35), assigning a label of the selected partial data (S36), and assigning a label assigned to the partial data to all belonging to the corresponding group It characterized in that it comprises a step (S37) to give to the data.

예를 들어 상기 선택된 일부 데이터의 레이블은 사전 결정된 알고리즘에 의해 부여될 수 있다. 사용자가 분류하고자 하는 데이터의 출력값은 유형과 개수가 사전에 정해질 수 있다. 따라서 사용자는 알고리즘을 미리 구성하고 상기 선택된 일부 데이터의 레이블을 사전 결정된 알고리즘에 의해 판단하고 부여할 수 있다. 머신 러닝 장치가 상기 선택된 일부 데이터의 레이블을 수신하면, 상기 선택된 일부 데이터를 포함하는 그룹에 속하는 전체 데이터는 동일한 레이블을 갖게 된다. For example, the labels of some of the selected data may be assigned by a predetermined algorithm. The type and number of output values of data that the user wants to classify may be determined in advance. Therefore, the user can pre-configure the algorithm and determine and assign the label of the selected partial data by the predetermined algorithm. When the machine learning device receives the label of the selected partial data, all data belonging to the group including the selected partial data have the same label.

바람직하게는, 상기 선택된 일부 데이터의 레이블은 사용자 입력에 의해 부여될 수 있다. 사용자는 매우 단순한 선택에 의해 데이터의 올바른 출력값을 결정할 수 있고, 이를 머신 러닝 장치에 입력할 수 있다. 머신 러닝 장치가 사용자 입력을 수신하면 상기 선택된 일부 데이터의 레이블은 정확히 정해지고, 상기 선택된 일부 데이터를 포함하는 그룹에 속하는 전체 데이터는 동일한 레이블을 갖게 된다. Preferably, the label of the selected partial data may be given by user input. The user can determine the correct output value of the data by making a very simple selection and input it into the machine learning device. When the machine learning device receives a user input, the label of the selected partial data is accurately determined, and all data belonging to the group including the selected partial data have the same label.

본 발명의 일 실시예에 따른 자동 레이블링 방법은 수집한 데이터를 신호처리 하는 단계(S32) 및 신호처리 된 데이터에 차원 축소를 실행하는 단계(S33)를 더 포함할 수 있고, 상기 단계(S34)에서는 차원 축소되고 레이블이 부여되지 않은 데이터를 클러스터링 기법에 의해 복수의 그룹으로 분류할 수 있다. The automatic labeling method according to an embodiment of the present invention may further include the step of signal processing the collected data (S32) and the step of performing dimensionality reduction on the signal-processed data (S33), and the step (S34) In, the dimensionally reduced and unlabeled data can be classified into a plurality of groups by a clustering technique.

본 발명에 의한 일 실시예에 따른 자동 레이블링 가능한 머신 러닝 방법은 학습에 사용될 하이퍼파라미터(Hyperparameter) 조합을 결정하는 단계(S10), 레이블이 부여된 데이터로 머신 러닝 방식에 의해 파라미터를 포함하는 함수를 학습시키는 제1 학습 단계(S20), 레이블이 부여되지 않은 데이터의 자동 레이블링 단계(S30), 상기 자동 레이블링 단계(S30)에서 자동 레이블링 된 데이터로 머신 러닝 방식에 의해 상기 함수를 학습시키는 제2 학습 단계(S40), 상기 함수의 정확도 지표를 계산하는 단계(S50), 및 상기 하이퍼파라미터 조합을 변경하는 단계(S60)를 포함하고, 상기 S20 내지 S60 단계를 반복하되 가장 높은 정확도 지표가 계산되었을 때 사용된 하이퍼파라미터 조합 및 파라미터를 선택하는 것을 특징으로 한다. In the machine learning method capable of automatic labeling according to an embodiment of the present invention, the step of determining a hyperparameter combination to be used for learning (S10), a function including a parameter by a machine learning method as labeled data. The first learning step of training (S20), the automatic labeling step of unlabeled data (S30), the second learning of learning the function by machine learning with the data automatically labeled in the automatic labeling step (S30) Including step (S40), calculating an accuracy index of the function (S50), and changing the hyperparameter combination (S60), and repeating the steps S20 to S60, but when the highest accuracy index is calculated It is characterized by selecting the used hyperparameter combination and parameters.

예를 들어 상기 자동 레이블링 단계(S30)는 레이블이 부여되지 않은 데이터를 수집하는 단계(S31), 상기 레이블이 부여되지 않은 데이터를 클러스터링 기법에 의해 복수의 그룹으로 분류하는 단계(S34), 상기 복수의 그룹 각각에서 일부 데이터를 선택하는 단계(S35), 선택된 일부 데이터의 레이블을 부여하는 단계(S36); 및 상기 일부 데이터에 부여된 레이블을 대응하는 그룹에 속하는 전체 데이터에 부여하는 단계(S37)를 포함할 수 있다. For example, in the automatic labeling step (S30), the step of collecting unlabeled data (S31), the step of classifying the unlabeled data into a plurality of groups by a clustering technique (S34), the plurality of Selecting some data from each of the groups (S35), giving a label of the selected partial data (S36); And assigning a label assigned to the partial data to all data belonging to a corresponding group (S37).

본 발명의 일 실시예에 따른 자동 레이블링 가능한 머신 러닝 장치는 레이블이 부여되지 않은 데이터를 수집하는 입력부, 상기 레이블이 부여되지 않은 데이터에 자동으로 레이블을 부여하는 레이블링부, 레이블이 부여된 데이터로 머신 러닝 방식에 의해 파라미터를 포함하는 함수를 학습시키는 학습부, 상기 함수의 정확도 지표를 계산하는 계산부, 상기 입력부, 레이블링부, 학습부, 및 계산부를 제어하는 제어부를 포함하고, 상기 레이블링부는 상기 레이블이 부여되지 않은 데이터를 클러스터링 기법에 의해 복수의 그룹으로 분류한 후 상기 복수의 그룹 각각에서 일부 데이터를 선택하고, 선택된 일부 데이터에 레이블을 부여한 후 상기 일부 데이터에 부여된 레이블을 대응하는 그룹에 속하는 전체 데이터에 부여하는 것을 특징으로 한다. The machine learning apparatus capable of automatically labeling according to an embodiment of the present invention includes an input unit for collecting unlabeled data, a labeling unit for automatically labeling the unlabeled data, and a machine with labeled data. A learning unit for learning a function including parameters by a running method, a calculation unit for calculating an accuracy index of the function, the input unit, a labeling unit, a learning unit, and a control unit for controlling the calculation unit, and the labeling unit After classifying the unassigned data into a plurality of groups by a clustering technique, some data is selected from each of the plurality of groups, a label is assigned to the selected partial data, and the label assigned to the partial data belongs to the corresponding group. It is characterized by giving it to all data.

예를 들어 상기 제어부는 학습에 사용되는 하이퍼파라미터 조합을 결정할 수 있고, 상기 하이퍼파라미터 조합을 적용하여 자동 레이블 된 데이터로 머신 러닝 방식에 의해 상기 함수를 학습시킬 때마다 생성되는 정확도 지표가 가장 높을 때의 하이퍼파라미터 조합 및 파라미터를 선택할 수 있다. For example, the control unit may determine a combination of hyperparameters used for learning, and the accuracy index generated each time the function is trained by machine learning with automatically labeled data by applying the hyperparameter combination is the highest. You can select hyperparameter combinations and parameters of.

본 발명의 일 실시예에 따르면, 일부 트레이닝 데이터에만 레이블을 부여한 후 나머지 모든 트레이닝 데이터에는 레이블을 자동으로 부여하게 되므로, 머신 러닝으로 훈련되는 함수의 학습이 더 빠른 시간 안에 수행될 수 있고, 레이블을 부여하는 노동력을 획기적으로 줄일 수 있다. According to an embodiment of the present invention, since labels are assigned to only some training data and then labels are automatically assigned to all other training data, learning of a function trained by machine learning can be performed in a faster time, and It is possible to drastically reduce the labor force to be given.

도 1은 본 발명과 관련된 자동 레이블링이 가능한 머신 러닝 방법 및 장치를 설명하기 위한 개념도이다.
도 2는 본 발명의 일 실시예에 의한 자동 레이블링이 가능한 머신 러닝 방법을 나타내는 도면이다.
도 3은 도 2의 자동 레이블링 단계(S30)를 구체화시킨 도면이다. 1 is a conceptual diagram illustrating a machine learning method and apparatus capable of automatic labeling related to the present invention.
2 is a diagram illustrating a machine learning method capable of automatic labeling according to an embodiment of the present invention.
3 is a diagram illustrating the automatic labeling step S30 of FIG. 2.

이하, 첨부된 도면을 참조하여 본 명세서에 개시된 실시 예를 상세히 설명하되, 도면 부호에 관계없이 동일하거나 유사한 구성요소는 동일한 참조 번호를 부여하고 이에 대한 중복되는 설명은 생략하기로 한다. 이하의 설명에서 사용되는 구성요소에 대한 접미사 "모듈" 및 "부"는 명세서 작성의 용이함만이 고려되어 부여되거나 혼용되는 것으로서, 그 자체로 서로 구별되는 의미 또는 역할을 갖는 것은 아니다. 또한, 본 명세서에 개시된 실시 예를 설명함에 있어서 관련된 공지 기술에 대한 구체적인 설명이 본 명세서에 개시된 실시 예의 요지를 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다. 또한, 첨부된 도면은 본 명세서에 개시된 실시 예를 쉽게 이해할 수 있도록 하기 위한 것일 뿐, 첨부된 도면에 의해 본 명세서에 개시된 기술적 사상이 제한되지 않으며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. Hereinafter, exemplary embodiments disclosed in the present specification will be described in detail with reference to the accompanying drawings, but identical or similar elements are denoted by the same reference numerals regardless of reference numerals, and redundant descriptions thereof will be omitted. The suffixes "module" and "unit" for components used in the following description are given or used interchangeably in consideration of only the ease of preparation of the specification, and do not have meanings or roles that are distinguished from each other by themselves. In addition, in describing the embodiments disclosed in the present specification, when it is determined that a detailed description of related known technologies may obscure the subject matter of the embodiments disclosed in the present specification, the detailed description thereof will be omitted. In addition, the accompanying drawings are for easy understanding of the embodiments disclosed in the present specification, and the technical idea disclosed in the present specification is not limited by the accompanying drawings, and all modifications included in the spirit and scope of the present invention It should be understood to include equivalents or substitutes.

도 1은 본 발명과 관련된 자동 레이블링이 가능한 머신 러닝 방법 및 장치를 설명하기 위한 개념도이다. 1 is a conceptual diagram illustrating a machine learning method and apparatus capable of automatic labeling related to the present invention.

주변 소리가 발생하면 이는 실시간으로 마이크와 같은 입력부(10)를 통해 감지되고 데이터로 저장된다. 주변 소리는 소리가 가의 없는 무음(11)일 수도 있고, 사용자가 관심을 가지지 않는 소리, 즉 잡음(12)일 수도 있으며, 사용자가 분류하거나 분석하고자 하는 관심 소리(13)일 수도 있다. 상기 관심 소리(13)는 경우에 따라 환자의 신음(131)일 수도 있고, 아기 울음 소리(132)일 수도 있고, 성인의 음성(133)일 수도 있다. 그러나 상기 관심 소리(13)는 위 3가지 예에 한정되지 않고 교통사고 충돌 소리, 차량 작동 소리, 동물 소리 등 모든 소리가 될 수 있다. When ambient sound occurs, it is sensed in real time through an input unit 10 such as a microphone and stored as data. The ambient sound may be a silent sound 11 with no meaning, a sound that the user does not care about, that is, a noise 12, or a sound of interest 13 that the user wants to classify or analyze. In some cases, the sound of interest 13 may be a patient's groan 131, a baby's cry 132, or an adult's voice 133. However, the sound of interest 13 is not limited to the above three examples, and may be any sound such as a traffic accident collision sound, vehicle operation sound, and animal sound.

예를 들어 관심 소리(13)가 성인의 음성(133)인 경우, 아기 울음 소리(132)는 잡음(12)으로 분류될 수 있다. 예를 들어 관심 소리(13)가 동물 소리인 경우, 환자의 신음(131), 아기 울음 소리(132), 성인의 음성(133) 및 교통사고 충돌 소리 등은 잡음(12)으로 분류될 수 있다. For example, when the sound of interest 13 is an adult's voice 133, the baby crying sound 132 may be classified as a noise 12. For example, when the sound of interest 13 is an animal sound, a patient's groan 131, a baby's cry 132, an adult's voice 133, and a traffic accident collision sound may be classified as noise 12. .

이와 같은 데이터 유형의 분류는 머신 러닝 장치(1)에서 함수(F)에 의해 수행될 수 있다. 그러나 유형을 구분하기 위한 위와 같은 분류 작업에서, 방대한 양의 데이터를 머신 러닝용 트레이닝 데이터로 활용하기 위해서는 데이터마다 유형이 무엇인지 달아주는 레이블링 작업을 진행해야 하고, 데이터 전체에 대한 레이블링은 사람이 일일이 수행하기에는 많은 시간이 소요된다. Classification of such data types may be performed by the function F in the machine learning apparatus 1. However, in the above classification task to classify types, in order to utilize a vast amount of data as training data for machine learning, labeling that attaches what type of each data is, and labeling the entire data is done by humans. It takes a lot of time to do it.

지도 학습(Supervised Learning)은 트레이닝 데이터를 이용해 하나의 함수를 훈련시키기 위한 머신 러닝 방법 중 하나인데, 트레이닝 데이터는 일반적으로 입력 객체에 대한 속성을 벡터 형태로 포함하고 있으며 각각의 벡터에 대해 원하는 결과가 무엇인지 표시되어 있다. 이렇게 훈련된 함수 중 연속적인 값을 출력하는 것을 회귀분석(Regression)이라 하고 주어진 입력 벡터가 어떤 종류의 값인지 표식하는 것을 분류(Classification)라 한다. 반면 비 지도 학습(Unsupervised Learning)은 지도 학습과는 달리 입력값에 대한 목표치가 주어지지 않는다. Supervised Learning is one of the machine learning methods for training a function using training data. Training data generally contains the properties of the input object in the form of a vector, and the desired result for each vector is It is marked what it is. Among the trained functions, outputting continuous values is called regression, and marking what kind of values a given input vector is called classification. On the other hand, unsupervised learning, unlike supervised learning, does not give a target value for an input value.

바람직하게는, 본 발명의 일 실시예에서, 학습부(30)는 지도 학습과 비 지도 학습의 중간 성격을 갖는 준 지도 학습(Semi-supervised Learning) 방식을 사용할 수 있다. 상기 준 지도 학습은 목표값이 표시된 데이터와 표시되지 않은 데이터를 모두 훈련에 사용하는 것을 말한다. 대개의 경우 이러한 방법에 사용되는 트레이닝 데이터는 목표값이 표시된 데이터가 적고 목표값이 표시되지 않은 데이터를 많이 갖고 있다. 본 발명의 일 실시예에 따르면, 상기 준 지도 학습을 응용하면 레이블링 작업에 드는 시간과 비용을 크게 아낄 수 있다. 예를 들어, 복수의 데이터 중 일부만 레이블링 작업을 수행한 후 함수를 학습시키고, 레이블이 없는 나머지 데이터는 클러스터링 기법에 의해 몇 개의 그룹으로 묶어 그룹 별로 추출한 샘플의 레이블을 결정해주면, 레이블링부(20)가 사전 결정된 조건에 따라 하나의 레이블을 정하여 그룹 내의 데이터는 모두 상기 하나의 레이블을 갖도록 한다. 이러한 데이터를 자동 레이블링 데이터라 한다. 그 후 학습부(30)가 상기 자동 레이블링 데이터로 함수를 지도 학습 방식으로 학습시킬 수 있는 것은 자명하다. Preferably, in an embodiment of the present invention, the learning unit 30 may use a semi-supervised learning method having an intermediate characteristic between supervised learning and unsupervised learning. The quasi-supervised learning refers to using both data with a target value indicated and data not indicated for training. In most cases, the training data used in this method has few data with a target value displayed and a lot of data without a target value. According to an embodiment of the present invention, when the quasi-supervised learning is applied, time and cost for labeling can be greatly saved. For example, if a function is trained after labeling only a part of a plurality of data, and the remaining data without a label is grouped into several groups by a clustering technique and the label of the sample extracted for each group is determined, the labeling unit 20 A label is determined according to a predetermined condition so that all data in the group have the one label. This data is called automatic labeling data. After that, it is obvious that the learning unit 30 can learn the function using the automatic labeling data in a supervised learning method.

상기 목표값을 표시하는 작업이 레이블링(Labeling)이다. 예를 들면, 주변 소리가 발생하고, 그 소리 데이터를 입력이라 하면, 그 소리의 유형이 무음(11)인지, 잡음(12)인지 또는 관심 소리(13)인지에 대하여 표시하는 것이 레이블링 작업이다. 즉, 레이블링은 데이터에 출력의 예를 미리 표시하고 이를 머신 러닝 알고리즘에 의해 함수를 학습시키기 위한 기초 작업이다. The task of displaying the target value is labeling. For example, if ambient sound is generated and the sound data is input, the labeling operation is to indicate whether the type of sound is silent (11), noise (12), or sound of interest (13). In other words, labeling is a basic operation for pre-marking an example of an output in data and learning a function using a machine learning algorithm.

사람이 직접 표시하는 것이 지도 학습, 표시하지 않는 것이 비 지도 학습이고, 일부는 사람이 직접 표시하고, 나머지는 표시하지 않는 것이 준 지도 학습이다. What a person directly displays is supervised learning, and what does not display is unsupervised learning, and some is directly marked by a person, and others are not marked as semi-supervised learning.

본 발명의 일 실시예에서, 머신 러닝 장치(1)는 자동 레이블링(Auto-labeling) 작업을 수행할 수 있다. 레이블(Label)이란, 함수가 출력해야 할 결과값들을 의미한다. 예를 들어, 상기 레이블은 무음, 잡음, 아기 울음 소리, 울음 소리를 제외한 아기 소리 등의 결과값들이다. 상기 자동 레이블링은 아래와 같은 순서로 수행될 수 있다. 상기 자동 레이블링은 예를 들어 학습부(30)가 수행할 수 있다. In an embodiment of the present invention, the machine learning apparatus 1 may perform an auto-labeling operation. Label means the result values that the function should output. For example, the labels are result values of silence, noise, baby crying, and baby sounds excluding crying sounds. The automatic labeling may be performed in the following order. The automatic labeling may be performed by the learning unit 30, for example.

우선, 사람이 개입하여 일정한 개수(예를 들어, 100개)의 데이터에 대해 레이블을 표시한다. 구체적으로, 입력부(10)를 통해 소리 데이터를 수집한다. 레이블링부(20)는 수집된 소리 데이터에 대해서 적절한 신호처리를 진행한 후 차원 축소 작업을 거친다. 그 후 동질성을 지닌 집단을 분류하는 클러스터링 기법을 사용하여 하나의 동질성으로 분류되는 복수의 데이터들을 하나의 데이터 군으로 묶는다. 이 때, 상기 클러스터링 기법은 사전 결정된 하이퍼파라미터(Hyperparameter)를 기준으로 하여 분류를 진행하되, 상기 하이퍼 파라미터는 향후 수행되는 학습 정확도에 따라 변경될 수 있다. First, a person intervenes and displays a label for a certain number of data (eg, 100). Specifically, sound data is collected through the input unit 10. The labeling unit 20 performs appropriate signal processing on the collected sound data and then undergoes a dimension reduction operation. After that, a clustering technique that classifies a group with homogeneity is used to group a plurality of data classified as a homogeneity into a single data group. In this case, the clustering technique is classified based on a predetermined hyperparameter, but the hyperparameter may be changed according to the learning accuracy performed in the future.

다음으로, 복수의 데이터 그룹이 형성되면 상기 레이블링부(20)는 각 데이터 그룹 별로 사전 결정된 개수(예를 들어, 4개의 데이터)만큼만 무작위로 골라 어떤 특징을 가진 요소인지를 판별한다. 예를 들면, 만약, 제1 데이터 그룹에서 고른 4개의 데이터 중 3개 이상이 잡음에 해당하는 것으로 확인되면, 제1 데이터 그룹은 모두 잡음으로 간주하고 제1 데이터 그룹 내의 모든 데이터를 잡음으로 레이블링 한다. 만약, 제2 데이터 그룹에서 고른 4개의 데이터 중 아기 울음 소리가 2개 이하라면 제2 데이터 그룹 내의 모든 데이터를 잡음으로 레이블링 한다. Next, when a plurality of data groups is formed, the labeling unit 20 randomly selects only a predetermined number (eg, 4 pieces of data) for each data group and determines which element has a characteristic. For example, if three or more of the four data selected from the first data group are found to correspond to noise, all the first data groups are regarded as noise, and all data in the first data group are labeled as noise. . If there are two or less baby crying sounds among the four data selected from the second data group, all data in the second data group are labeled as noise.

다음으로, 상기 레이블링부(20)는 이렇게 사전 결정된 알고리즘으로 레이블링을 수행하고, 학습부(30)는 자동 레이블 된 데이터들을 학습 데이터로 활용한다. 이 경우 계산부(40)가 정확도 지표를 계산하고, 제어부(50)는 상기 정확도 지표가 가장 높게 나온 경우의 하이퍼파라미터 조합 및 파라미터를 최종 선택한다. Next, the labeling unit 20 performs labeling with this predetermined algorithm, and the learning unit 30 uses the automatically labeled data as training data. In this case, the calculation unit 40 calculates an accuracy index, and the control unit 50 finally selects a hyperparameter combination and a parameter when the accuracy index is the highest.

도 2는 본 발명의 일 실시예에 의한 자동 레이블링이 가능한 머신 러닝 방법을 나타내는 도면이다. 2 is a diagram illustrating a machine learning method capable of automatic labeling according to an embodiment of the present invention.

머신 러닝에서 파라미터(Parameter)는 학습 모델 내부에서 확인이 가능한 변수로서, 데이터를 통해서 산출이 가능한 값이다. 즉, 파라미터는 머신 러닝에 의해 데이터로부터 학습된다. 인공 신경망에서의 가중치, 서포트 벡터 머신(SVM)에서의 서포트 벡터, 회귀 분석에서의 결정계수가 파라미터의 예이다. 쉽게 말해, 머신 러닝으로 구하고자 하는 함수가 선형 방정식의 형태(y=ax+b)를 가진다면, 기울기(a)와 절편(b)이 파라미터에 해당한다. 이 경우, 머신 러닝으로 최적화가 완료된 함수에 사용된 기울기(a)와 절편(b)을 선택하여 차후에 데이터를 분석하면, 올바른 출력값에 접근할 수 있다. In machine learning, a parameter is a variable that can be checked inside a learning model, and is a value that can be calculated through data. In other words, parameters are learned from data by machine learning. Weights in artificial neural networks, support vectors in a support vector machine (SVM), and coefficients of determination in regression analysis are examples of parameters. In short, if the function to be obtained by machine learning has the form of a linear equation (y=ax+b), the slope (a) and the intercept (b) correspond to the parameters. In this case, by selecting the slope (a) and the intercept (b) used for the optimized function by machine learning and analyzing the data later, the correct output value can be accessed.

머신 러닝에서 하이퍼파라미터(Hyperparameter)는 학습 모델 외적인 요소로서, 데이터를 통해서 산출 가능한 값이 아니다. 즉, 사용자 또는 관리자에 의해 조절되는 요소이다. 신경망 학습에서 학습률(Learning Rate), 서포트 벡터 머신(SVM)에서의 코스트(C), KNN(K-Nearest Neighbor)에서의 K의 개수가 하이퍼파라미터의 예이다. 하이퍼파라미터는 알고리즘 사용자에 의해 결정될 수 있고, 알고리즘의 문제점이 발견되면 변경되거나 조절될 수 있다. In machine learning, hyperparameters are elements outside of the learning model and are not values that can be calculated through data. That is, it is a factor that is controlled by the user or the administrator. In neural network learning, the learning rate, the cost (C) in the support vector machine (SVM), and the number of Ks in the K-Nearest Neighbor (KNN) are examples of hyperparameters. Hyperparameters can be determined by the algorithm user, and can be changed or adjusted if problems with the algorithm are found.

도 2를 참조하면, 자동 레이블링이 가능한 머신 러닝을 수행하기 위해서, 먼저 학습에 사용될 하이퍼파라미터 조합을 결정한다. 상기 하이퍼파라미터 조합은 복수의 하이퍼파라미터일 수 있다. Referring to FIG. 2, in order to perform machine learning capable of automatic labeling, a hyperparameter combination to be used for learning is first determined. The hyperparameter combination may be a plurality of hyperparameters.

다음으로 지도 학습을 수행한다. 즉, 레이블이 부여된 데이터로 머신 러닝 방식에 의해 함수를 학습시킨다. 상기 함수는 파라미터를 포함한다. 파라미터가 최적화되면 함수는 최상의 성능을 낼 수 있다. Next, supervised learning is performed. In other words, the function is trained by machine learning with labeled data. The function includes parameters. When the parameters are optimized, the function can perform at its best.

다음으로 자동 레이블링을 수행한다. 입력부(10)가 레이블이 부여되지 않은 데이터를 수집하고, 레이블링부(20)가 레이블이 부여되지 않은 데이터를 클러스터링 기법에 의해 복수의 그룹으로 분류한다. 상기 레이블링부(20)는 상기 복수의 그룹 각각에서 일부 데이터를 선택하고, 선택된 일부 데이터에 레이블을 부여한다. 여기서 선택된 일부 데이터에 레이블을 부여하는 것은 기계적으로 또는 프로그램에 의해 수행될 수 있으나, 사용자의 입력을 받는 것이 바람직하다. Next, automatic labeling is performed. The input unit 10 collects unlabeled data, and the labeling unit 20 classifies the unlabeled data into a plurality of groups using a clustering technique. The labeling unit 20 selects some data from each of the plurality of groups and assigns a label to the selected data. Labeling some data selected here may be performed mechanically or by a program, but it is preferable to receive a user's input.

사용자로부터 상기 선택된 일부 데이터의 제1 레이블을 입력받으면, 상기 일부 데이터가 속하는 데이터 그룹의 전체 데이터에 제1 레이블을 부여한다. 사용자는 각 데이터 그룹에서 선택된 일부 데이터들의 레이블을 입력할 수 있고, 모든 데이터 그룹 내의 전체 데이터는 레이블을 갖게 된다. When the user inputs the first label of the selected partial data, the first label is assigned to all data of the data group to which the partial data belongs. The user can input the labels of some data selected in each data group, and all data in all data groups have a label.

그러면 다시 학습부(30)가 자동 레이블링 된 데이터로 머신 러닝 방식에 의해 함수를 학습시킬 수 있는데, 이 경우 사람이 일일이 모든 데이터에 레이블을 부여하는 것보다 훨씬 빠른 레이블링 작업이 가능하다. Then, the learning unit 30 can again learn the function by machine learning with the automatically labeled data. In this case, a labeling operation can be performed much faster than a person manually labeling all data.

F1 측정(F1 measure 또는 F1 score)과 같은 정확도 측정법을 사용하여 정확도 지표를 사전 결정할 수 있다. 따라서 제1 하이퍼파라미터 조합을 사용하였을 때 계산된 정확도 지표가 높을수록 함수는 최적화된 것으로 이해할 수 있다. Accuracy metrics can be pre-determined using an accuracy measure such as the F1 measure (F1 measure or F1 score). Therefore, when the first hyperparameter combination is used, the higher the calculated accuracy index, the more optimized the function can be understood.

예를 들어 제1 하이퍼파라미터 조합부터 제n 하이퍼파라미터 조합을 사용한다고 가정하면, n번의 학습 과정 중 정확도 지표가 가장 높았을 때의 하이퍼파라미터 조합을 선택하는 것이 바람직하고, 그 때 산출된 파라미터를 선택하는 것이 바람직하다. For example, assuming that the nth hyperparameter combination is used from the first hyperparameter combination, it is preferable to select the hyperparameter combination when the accuracy index is the highest among n learning processes, and then select the calculated parameter. It is desirable to do.

이와 같이 본 발명의 일 실시예에 따른 자동 레이블링이 가능한 머신 러닝 방법에 의하면, 소량의 데이터에 사람이 레이블을 부여하여 학습의 정확성을 확보하고, 나머지 데이터는 자동으로 레이블을 부여하여 학습의 신속성을 확보할 수 있다. 이는 전체 데이터를 레이블링한 후 학습시키는 지도 학습의 경우보다 최초 학습시 정확성이 다소 떨어질 수 있으나, 기존에 사람이 전체 데이터의 레이블을 부여하던 절차가 생략되어 매우 신속하게 머신 러닝을 진행할 수 있다는 장점을 가진다. 학습 과정 전체의 관점에서, 중간 수준의 정확성과 높은 수준의 신속성을 확보하여 종래보다 효율적인 머신 러닝을 수행할 수 있다. 이에 따라 인건비를 절감할 수 있고, 머신 러닝에 의한 학습 속도가 매우 빨라질 수 있다. As described above, according to the machine learning method capable of automatic labeling according to an embodiment of the present invention, a human label is assigned to a small amount of data to ensure the accuracy of learning, and the remaining data is automatically labeled to speed up learning. Can be secured. This has the advantage that the accuracy at the initial learning may be slightly lower than in the case of supervised learning, in which the entire data is labeled and then trained, but it has the advantage that machine learning can be carried out very quickly because the procedure that previously used to label all data is omitted. Have. From the perspective of the whole learning process, it is possible to perform more efficient machine learning than before by securing a medium level of accuracy and a high level of speed. Accordingly, labor costs can be reduced, and the learning speed by machine learning can be very fast.

도 3은 도 2의 자동 레이블링 단계(S30)를 구체화시킨 도면이다. 3 is a diagram illustrating the automatic labeling step S30 of FIG. 2.

음성 데이터와 같은 데이터를 수집한다. 상기 데이터는 레이블이 부여되지 않은 상태로 수집된다. 수집된 데이터는 머신 러닝을 위한 신호처리 과정을 거칠 수 있다. 상기 신호처리는 전처리 및 특징 벡터 추출을 포함할 수 있다. 이후 신호처리 된 데이터에 차원 축소를 실행할 수 있다. 상기 차원 축소는 예를 들어 주성분 분석(PCA) 방법일 수 있다. 상기 차원 축소에 의해 데이터 양을 감소시킬 수 있다. 이는 데이터의 의미를 제대로 표현할 수 있는 특징을 추출하기 위한 것이다. Collect data such as voice data. The data is collected unlabeled. The collected data may be subjected to signal processing for machine learning. The signal processing may include pre-processing and feature vector extraction. After that, you can perform dimensional reduction on the signal-processed data. The dimension reduction may be, for example, a principal component analysis (PCA) method. The amount of data can be reduced by the dimension reduction. This is to extract features that can properly express the meaning of data.

차원 축소된 데이터는 클러스터링 기법에 의해 복수의 그룹으로 구분될 수 있다. 학습부(30)는 복수의 그룹 각각에서 일부 데이터를 선택하고, 선택된 일부 데이터에 레이블을 부여한다. The dimensionally reduced data can be divided into a plurality of groups by a clustering technique. The learning unit 30 selects some data from each of the plurality of groups and gives a label to the selected data.

바람직하게는, 선택된 일부 데이터에 레이블은 사용자가 입력한다. 사용자가 3~4개에 대한 레이블만 입력해도 데이터 그룹 내의 모든 데이터에 레이블이 부여되도록 하는 것이 바람직하다. Preferably, a label is entered by the user for some selected data. It is desirable that labels are assigned to all data in the data group even if the user enters only 3 or 4 labels.

1: 머신 러닝 장치
10: 입력부
11: 무음
12: 잡음
13: 관심 소리
131: 환자의 신음
132: 아기의 울음
133: 성인의 음성
20: 레이블링부
30: 학습부
40: 계산부
50: 제어부1: machine learning device
10: input
11: silent
12: noise
13: sounds of interest
131: patient moaning
132: baby cry
133: adult voice
20: labeling unit
30: Learning Department
40: calculation unit
50: control unit

Claims

delete

Determining, by the control unit, a hyperparameter combination to be used for learning (S10);
In order to ensure the accuracy of training, a first learning step (S20) of learning a function for classifying sounds, including parameters, by a machine learning method using speech data labeled with a small amount of speech data;
As an automatic labeling step (S30) of voice data that is not labeled,
Collecting, by the input unit, unlabeled voice data (S31);
Classifying, by a labeling unit, the voice data to which the label is not assigned into a plurality of groups by a clustering technique (S34);
Selecting some voice data from each of the plurality of groups by the labeling unit (S35);
Labeling some voice data selected by a predetermined algorithm by the labeling unit (S36); And
An automatic labeling step (S30) including the step (S37) of assigning, by the labeling unit, a label assigned to the partial voice data to all voice data belonging to a corresponding group;
A second learning step (S40) of learning a function for classifying the sound by a machine learning method using speech data automatically labeled in the automatic labeling step (S30) by a learning unit in order to secure the speed of training;
Calculating an accuracy index of the function by a calculation unit (S50); And
Including the step (S60) of changing the hyperparameter combination by the control unit,
Repeating the steps S20 to S60, but selecting the hyperparameter combination and parameters used when the highest accuracy index is calculated by the control unit,
A machine learning method that efficiently trains poorly accurate models by automatically labeling entire unlabeled speech data.

The method of claim 5,
The calculation of the accuracy indicator is determined using the F1 measurement method,
Machine learning method to efficiently train poorly accurate models by automatically labeling entire unlabeled speech data.

delete

An input unit for collecting unlabeled data;
A labeling unit for automatically assigning a label to the unlabeled data;
A learning unit that learns a function for classifying sounds, including parameters by machine learning, with labeled data;
A calculation unit for calculating an accuracy index of the function;
And a control unit for controlling the input unit, a labeling unit, a learning unit, and a calculation unit,
The labeling unit classifies the unlabeled data into a plurality of groups by a clustering technique, selects some data from each of the plurality of groups, assigns a label to the selected partial data, and then assigns a label to the partial data. Assign to all data belonging to the corresponding group,
The control unit may determine a hyperparameter combination used for learning, and the hyperparameter when the accuracy index generated every time the function is trained by machine learning with automatically labeled data by applying the hyperparameter combination is the highest Select combinations and parameters,
The learning unit learns a function including parameters by machine learning with a small amount of labeled speech data in order to ensure the accuracy of learning, and to secure the speed of learning, machine learning with automatically labeled speech data Learning the function by way of,
Machine learning device that efficiently trains poorly accurate models by automatically labeling entire unlabeled speech data.

The method of claim 9,
The calculation of the accuracy indicator is determined using the F1 measurement method,
Machine learning device that efficiently trains poorly accurate models by automatically labeling entire unlabeled speech data.