KR102321007B1

KR102321007B1 - Method and apparatus for determining whether to introduce machine learning models for labeling tasks according characteristics of crowdsourcing based on projects

Info

Publication number: KR102321007B1
Application number: KR1020210094319A
Authority: KR
Inventors: 박민우; 박태상; 김창엽; 오진성; 유상재
Original assignee: 주식회사 크라우드웍스
Priority date: 2021-07-19
Filing date: 2021-07-19
Publication date: 2021-11-03

Abstract

The present invention relates to a method and apparatus for determining whether to introduce machine learning models for labeling tasks according to characteristics of crowdsourcing-based projects, which reduces time loss and cost loss required for a project. According to one embodiment of the present invention, the method comprises the following steps of: calculating a matching score between a project and at least one pre-trained machine learning model when the project for a labeling task is requested; checking whether a pre-trained machine learning model for the labeling task exists among the at least one pre-trained machine learning model on the basis of the calculated matching score; determining whether to perform automatic labeling or manual labeling according to whether the pre-trained machine learning model for the labeling task exists on the basis of a checking result; and performing automatic labeling on the basis of the checked pre-trained machine learning model when the pre-trained machine learning model for the labeling task exists on the basis of the checking result, and comparing a first cost required when manually performing the labeling task with a cost when automatically performing the labeling task by using the pre-trained machine learning model or custom model, which has the calculated matching score within a preset threshold range, and determining whether to perform automatic labeling or manual labeling according to a comparison result when the pre-trained machine learning model for the labeling task does not exist on the basis of the checking result.

Description

Method and device for determining whether to introduce machine learning models for labeling tasks according to the characteristics of crowdsourcing-based projects

본 발명은 크라우드소싱 기반 프로젝트의 특성에 따른 라벨링 작업을 위한 머신러닝 모델 도입 여부 결정 방법 및 장치에 관한 것으로, 보다 상세하게는 라벨링 작업을 인공지능 기반의 머신러닝 모델을 기반으로 자동으로 처리할지 작업자에 의해 수동으로 처리할지 여부를 결정할 수 있도록 하는 크라우드소싱 기반 프로젝트의 특성에 따른 라벨링 작업을 위한 머신러닝 모델 도입 여부 결정 방법 및 장치에 관한 것이다.The present invention relates to a method and apparatus for determining whether to introduce a machine learning model for a labeling task according to the characteristics of a crowdsourcing-based project, and more specifically, to an operator whether or not to automatically process the labeling task based on an artificial intelligence-based machine learning model. It relates to a method and apparatus for deciding whether to introduce a machine learning model for labeling work according to the characteristics of a crowdsourcing-based project, which allows the user to decide whether to process it manually or not.

기업 활동의 일부 과정에 일반 대중을 참여시키는 크라우드소싱 기반으로 많은 양의 데이터를 수집 및 가공하는 기업이 늘고 있다. More and more companies are collecting and processing large amounts of data based on crowdsourcing that engages the general public in some processes of business activities.

크라우드소싱은 대중(crowd)와 아웃소싱(outsourcing)의 합성어로, 기업 활동 일부 과정에 대중을 참여시키는 것을 의미한다.Crowdsourcing is a compound word of "crowd" and "outsourcing".

이러한 크라우드소싱은 최근 디지털 장터에서 거래되는 기간제 근로의 의미를 갖는 긱 경제(Gig economy)에 적합한 구조로서, 기업이 수요자의 요구에 즉각적으로 서비스를 제공하는 온 디멘드(On demand)에 최적화된 형태로 제공될 수 있다. This crowdsourcing is a structure suitable for the gig economy, which has the meaning of fixed-term work traded in the digital marketplace, and is optimized for on demand, in which companies provide services immediately to the needs of consumers. may be provided.

이를 통해 기업의 활동에 대중들을 참여시키면 기업 입장에서는 참신한 아이디어와 실질적인 의견을 들을 수 있고, 대중들은 피드백 참여에 관한 보수를 받을 수 있으며 클라우드 및 AI 기반의 RPA 기술과 인지(Cognitive), 분석(Smart Analysis)을 결합한 디지털 워크 포스를 이용하여 고품질의 프로젝트 결과를 수요자에게 제공할 수도 있다.Through this, if the public is involved in the company's activities, new ideas and practical opinions can be heard from the company's point of view, and the public can be paid for participating in feedback, and RPA technology based on cloud and AI, cognitive and Analysis) combined with Digital Workforce can be used to provide high-quality project results to consumers.

또한, 크라우드소싱을 이용하면 외부 전문업체에 맡겨서 직접 물건을 만들거나 서비스를 하는 것보다 대중들이 직접 참여하여 원하는 결과물을 이끌어내면 개발 비용도 저렴하게 들고, 잠재 고객도 얻을 수 있다는 장점이 있다.In addition, using crowdsourcing has the advantage of lowering development costs and obtaining potential customers if the public participates directly and brings out the desired results rather than entrusting it to an external professional to make or service a product.

최근에는 크라우드소싱이 데이터 라벨링 자동화, 분산처리 설계, 온라인 비대면 관리 등 데이터 라벨링 서비스와 인적자원관리(HR Tech)에 이용되며 더 나아가 자율주행과 영상 학습 등 인공지능 솔루션 고도화를 위해 대량의 정형화된 학습 데이터가 필요한 산업 분야에 적극적으로 활용되고 있다.Recently, crowdsourcing is used for data labeling services such as data labeling automation, distributed processing design, online non-face-to-face management, and human resource management (HR Tech). It is being actively used in industries that require learning data.

구체적으로, 하나의 프로젝트가 오픈되면, 복수의 작업자 각각에게 복수의 작업이 배정된다. 각각의 작업자는 배정받은 복수의 작업을 수행하고, 작업 결과를 제공한다. 이후, 복수의 검수자 각각에게 작업 결과에 대한 복수의 검수 작업이 배정되고, 각각의 검수자는 배정받은 복수의 검수 작업을 수행하게 된다.Specifically, when one project is opened, a plurality of tasks are assigned to each of a plurality of workers. Each worker performs a plurality of assigned tasks and provides task results. Thereafter, a plurality of inspection tasks for the work results are assigned to each of the plurality of inspectors, and each inspector performs the assigned plurality of inspection tasks.

그러나, 다수의 작업자를 통해 작업 데이터에 대한 라벨링을 수동 작업으로 수행하는 경우에는, 다수의 작업자에게 각각 작업을 배정하고, 각 작업자로부터 그 수행 결과물을 수집하게 되는데, 이 경우에는 데이터의 양이 방대할수록 시간적인 효율성은 물론, 비용적인 효율성이 떨어질 수밖에 없다.However, when labeling of work data is manually performed through a plurality of workers, each task is assigned to a plurality of workers, and the performance results are collected from each worker. In this case, the amount of data is huge. The more time-efficient, the less cost-effective.

따라서, 라벨링 작업을 수행해야하는 전체 작업 데이터를 처리하기 위한 사전 훈련된 머신러닝 모델이 존재하는지 여부에 따라 비용을 고려하여 작업자에 의한 수동 라벨링을 수행하거나 파인튜닝 또는 모델 개발을 통해 자동 라벨링을 수행하도록 결정해주는 기술이 개발될 필요가 있다.Therefore, depending on whether there exists a pre-trained machine learning model to process the entire job data that needs to perform the labeling task, manual labeling by the operator taking into account the cost, or automatic labeling through fine-tuning or model development Decision-making technology needs to be developed.

한국공개특허공보 제10-2020-0065191호 (등록일: 2020년 05월 29일)Korean Patent Application Laid-Open No. 10-2020-0065191 (Registration date: May 29, 2020)

본 발명은 상기한 바와 같은 문제점을 해결하기 위하여 제안된 것으로, 라벨링 작업을 수행해야하는 전체 작업 데이터를 처리하기 위한 사전 훈련된 머신러닝 모델이 존재하는지 여부에 따라 비용을 고려하여 작업자에 의한 수동 라벨링을 수행하거나 파인튜닝 또는 모델 개발을 통해 자동 라벨링을 수행하도록 결정해주어, 해당 프로젝트를 위해 소요되는 시간적 손실 및 비용적 손실을 감소시킬 수 있도록 하는 크라우드소싱 기반 프로젝트의 특성에 따른 라벨링 작업을 위한 머신러닝 모델 도입 여부 결정 방법 및 장치를 제공함에 있다.The present invention has been proposed to solve the above problems, and manual labeling by the operator is performed in consideration of the cost depending on whether there is a pre-trained machine learning model for processing the entire job data that needs to perform the labeling operation. A machine learning model for labeling tasks according to the characteristics of crowdsourcing-based projects that can reduce time and cost losses for the project by deciding to perform or perform automatic labeling through fine tuning or model development To provide a method and apparatus for determining whether to introduce.

본 발명이 해결하고자 하는 과제들은 이상에서 언급된 과제로 제한되지 않으며, 언급되지 않은 또 다른 과제들은 아래의 기재로부터 통상의 기술자에게 명확하게 이해될 수 있을 것이다.The problems to be solved by the present invention are not limited to the problems mentioned above, and other problems not mentioned will be clearly understood by those skilled in the art from the following description.

상술한 과제를 해결하기 위한 본 발명의 일 실시예에 따른 장치에 의해 수행되는 크라우드소싱 기반 프로젝트의 특성에 따른 라벨링 작업을 위한 머신러닝 모델 도입 여부 결정 방법은, 라벨링 작업에 대한 프로젝트가 의뢰되면, 상기 프로젝트 및 적어도 하나의 사전 학습된 머신러닝 모델 간의 매칭 스코어(score)를 산출하는 단계; 상기 산출된 매칭 스코어를 기반으로 상기 적어도 하나의 사전 학습된 머신러닝 모델들 중 상기 라벨링 작업을 위한 사전 학습된 머신러닝 모델이 존재하는지 여부를 확인하는 단계; 및 상기 확인 결과, 상기 라벨링 작업을 위한 사전 학습된 머신러닝 모델이 존재하는지 여부에 따라 자동 라벨링 또는 수동 라벨링 여부를 결정하는 단계를 포함하며, 상기 확인 결과, 상기 라벨링 작업을 위한 사전 학습된 머신러닝 모델이 존재하는 경우, 상기 확인된 사전 학습된 머신러닝 모델을 기반으로 자동 라벨링을 수행하고, 상기 라벨링 작업을 위한 사전 학습된 머신러닝 모델이 존재하지 않는 경우, 상기 라벨링 작업을 수동으로 수행할 시 소요되는 제1 비용 및, 상기 산출된 매칭 스코어가 미리 설정된 임계 범위 이내인 사전 학습된 머신러닝 모델 또는 커스텀 모델를 이용하여 상기 라벨링 작업을 자동으로 수행할 시 소요되는 제2 비용을 비교하고, 상기 비교 결과에 따라 자동 라벨링 또는 수동 라벨링 여부를 결정하는 단계를 더 포함할 수 있다.The method for determining whether to introduce a machine learning model for labeling work according to the characteristics of the crowdsourcing-based project performed by the device according to an embodiment of the present invention for solving the above-described problems is, when a project for the labeling task is requested, calculating a matching score between the project and at least one pre-trained machine learning model; checking whether a pre-trained machine learning model for the labeling task exists among the at least one pre-trained machine learning model based on the calculated matching score; and determining whether automatic labeling or manual labeling according to the confirmation result, whether a pre-trained machine learning model for the labeling task exists, and the confirmation result, pre-trained machine learning for the labeling task When a model exists, automatic labeling is performed based on the confirmed pre-trained machine learning model, and when a pre-trained machine learning model for the labeling task does not exist, the labeling operation is performed manually Comparing the first cost required and the second cost required for automatically performing the labeling operation using a pre-trained machine learning model or a custom model in which the calculated matching score is within a preset threshold range, and the comparison It may further include the step of determining whether automatic labeling or manual labeling according to the result.

한편, 본 발명의 일 실시예에 따른 크라우드소싱 기반 프로젝트의 특성에 따른 라벨링 작업을 위한 머신러닝 모델 도입 여부 결정 장치는, 상기 프로젝트를 의뢰하는 의뢰서버 및 상기 프로젝트에 투입된 적어도 하나의 작업자들의 단말과 통신을 수행하는 통신 통신모듈; 상기 라벨링 작업을 위한 머신러닝 모델 도입 여부를 결정하기 위해 필요한 적어도 하나의 프로세스를 저장하는 저장모듈; 및 상기 적어도 하나의 프로세스를 기반으로, 상기 머신러닝 모델 도입 여부를 결정하기 위한 동작을 제어하는 제어모듈을 포함하며, 상기 제어모듈은, 상기 라벨링 작업에 대한 프로젝트가 의뢰되면, 상기 프로젝트 및 적어도 하나의 사전 학습된 머신러닝 모델 간의 매칭 스코어(score)를 산출하고, 상기 산출된 매칭 스코어를 기반으로 상기 적어도 하나의 사전 학습된 머신러닝 모델들 중 상기 라벨링 작업을 위한 사전 학습된 머신러닝 모델이 존재하는지 여부를 확인하고, 상기 확인 결과, 상기 라벨링 작업을 위한 사전 학습된 머신러닝 모델이 존재하는지 여부에 따라 자동 라벨링 또는 수동 라벨링 여부를 결정하며, 상기 확인 결과, 상기 라벨링 작업을 위한 사전 학습된 머신러닝 모델이 존재하는 경우, 상기 확인된 사전 학습된 머신러닝 모델을 기반으로 자동 라벨링을 수행하고, 상기 라벨링 작업을 위한 사전 학습된 머신러닝 모델이 존재하지 않는 경우, 상기 라벨링 작업을 수동으로 수행할 시 소요되는 제1 비용 및, 상기 산출된 매칭 스코어가 미리 설정된 임계 범위 이내인 사전 학습된 머신러닝 모델 또는 커스텀 모델를 이용하여 상기 라벨링 작업을 자동으로 수행할 시 소요되는 제2 비용을 비교하고, 상기 비교 결과에 따라 자동 라벨링 또는 수동 라벨링 여부를 결정하도록 제어할 수 있다.On the other hand, the apparatus for determining whether to introduce a machine learning model for labeling work according to the characteristics of a crowdsourcing-based project according to an embodiment of the present invention includes a request server for requesting the project and a terminal of at least one worker put into the project and a communication communication module for performing communication; a storage module for storing at least one process necessary for determining whether to introduce a machine learning model for the labeling task; and a control module for controlling an operation for determining whether to introduce the machine learning model based on the at least one process, wherein the control module is, when a project for the labeling operation is requested, the project and at least one calculates a matching score between pre-trained machine learning models of Determine whether automatic labeling or manual labeling is determined according to whether a pre-trained machine learning model for the labeling task exists as a result of the confirmation, and as a result of the confirmation, a pre-trained machine for the labeling task If a learning model exists, automatic labeling is performed based on the identified pre-trained machine learning model, and if there is no pre-trained machine learning model for the labeling task, the labeling operation is performed manually. Comparing the first cost required when performing the labeling operation and the second cost of automatically performing the labeling operation using a pre-trained machine learning model or a custom model in which the calculated matching score is within a preset threshold range, It can be controlled to determine whether automatic labeling or manual labeling is performed according to the comparison result.

본 발명의 기타 구체적인 사항들은 상세한 설명 및 도면들에 포함되어 있다.Other specific details of the invention are included in the detailed description and drawings.

본 발명에 의하면, 라벨링 작업을 수행해야하는 전체 작업 데이터를 처리하기 위한 사전 훈련된 머신러닝 모델이 존재하는지 여부에 따라 비용을 고려하여 작업자에 의한 수동 라벨링을 수행하거나 파인튜닝 또는 모델 개발을 통해 자동 라벨링을 수행하도록 결정해주어, 해당 프로젝트를 위해 소요되는 시간적 손실 및 비용적 손실을 감소시킬 수 있도록 한다.According to the present invention, manual labeling is performed by an operator in consideration of the cost depending on whether there is a pre-trained machine learning model for processing the entire job data that needs to perform the labeling operation, or automatic labeling through fine tuning or model development to reduce the time loss and cost loss required for the project.

본 발명의 효과들은 이상에서 언급된 효과로 제한되지 않으며, 언급되지 않은 또 다른 효과들은 아래의 기재로부터 통상의 기술자에게 명확하게 이해될 수 있을 것이다.Effects of the present invention are not limited to the effects mentioned above, and other effects not mentioned will be clearly understood by those skilled in the art from the following description.

도 1은 본 발명의 일 실시예에 따른 크라우드소싱 기반 프로젝트의 특성에 따른 라벨링 작업을 위한 머신러닝 모델 도입 여부 결정 시스템의 네트워크 구조를 나타내는 블록도이다.
도 2는 본 발명의 일 실시예에 따른 크라우드소싱 기반 프로젝트의 특성에 따른 라벨링 작업을 위한 머신러닝 모델 도입 여부 결정 장치의 구성을 나타내는 블록도이다.
도 3은 본 발명의 일 실시예에 따른 크라우드소싱 기반 프로젝트의 특성에 따른 라벨링 작업을 위한 머신러닝 모델 도입 여부 결정 방법을 나타내는 순서도이다.
도 4는 본 발명의 일 실시예에 따른 결정 장치에서 제2 비용을 산출하기 위한 방법을 나타내는 순서도이다.
도 5는 본 발명의 일 실시예에 따른 결정 장치에서 제1 비용 및 제2 비용을 산출하는 일 예를 설명하기 위한 도면이다.1 is a block diagram illustrating a network structure of a system for determining whether to introduce a machine learning model for a labeling operation according to the characteristics of a crowdsourcing-based project according to an embodiment of the present invention.
2 is a block diagram showing the configuration of an apparatus for determining whether to introduce a machine learning model for a labeling operation according to the characteristics of a crowdsourcing-based project according to an embodiment of the present invention.
3 is a flowchart illustrating a method of determining whether to introduce a machine learning model for a labeling operation according to the characteristics of a crowdsourcing-based project according to an embodiment of the present invention.
4 is a flowchart illustrating a method for calculating a second cost in a decision apparatus according to an embodiment of the present invention.
5 is a view for explaining an example of calculating a first cost and a second cost in the determination device according to an embodiment of the present invention.

본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나, 본 발명은 이하에서 개시되는 실시예들에 제한되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 본 실시예들은 본 발명의 개시가 완전하도록 하고, 본 발명이 속하는 기술 분야의 통상의 기술자에게 본 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다. Advantages and features of the present invention and methods of achieving them will become apparent with reference to the embodiments described below in detail in conjunction with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below, but may be implemented in various different forms, and only these embodiments allow the disclosure of the present invention to be complete, and those of ordinary skill in the art to which the present invention pertains. It is provided to fully understand the scope of the present invention to those skilled in the art, and the present invention is only defined by the scope of the claims.

본 명세서에서 사용된 용어는 실시예들을 설명하기 위한 것이며 본 발명을 제한하고자 하는 것은 아니다. 본 명세서에서, 단수형은 문구에서 특별히 언급하지 않는 한 복수형도 포함한다. 명세서에서 사용되는 "포함한다(comprises)" 및/또는 "포함하는(comprising)"은 언급된 구성요소 외에 하나 이상의 다른 구성요소의 존재 또는 추가를 배제하지 않는다. 명세서 전체에 걸쳐 동일한 도면 부호는 동일한 구성 요소를 지칭하며, "및/또는"은 언급된 구성요소들의 각각 및 하나 이상의 모든 조합을 포함한다. 비록 "제1", "제2" 등이 다양한 구성요소들을 서술하기 위해서 사용되나, 이들 구성요소들은 이들 용어에 의해 제한되지 않음은 물론이다. 이들 용어들은 단지 하나의 구성요소를 다른 구성요소와 구별하기 위하여 사용하는 것이다. 따라서, 이하에서 언급되는 제1 구성요소는 본 발명의 기술적 사상 내에서 제2 구성요소일 수도 있음은 물론이다.The terminology used herein is for the purpose of describing the embodiments and is not intended to limit the present invention. As used herein, the singular also includes the plural unless specifically stated otherwise in the phrase. As used herein, “comprises” and/or “comprising” does not exclude the presence or addition of one or more other components in addition to the stated components. Like reference numerals refer to like elements throughout, and "and/or" includes each and every combination of one or more of the recited elements. Although "first", "second", etc. are used to describe various elements, these elements are not limited by these terms, of course. These terms are only used to distinguish one component from another. Accordingly, it goes without saying that the first component mentioned below may be the second component within the spirit of the present invention.

다른 정의가 없다면, 본 명세서에서 사용되는 모든 용어(기술 및 과학적 용어를 포함)는 본 발명이 속하는 기술분야의 통상의 기술자에게 공통적으로 이해될 수 있는 의미로 사용될 수 있을 것이다. 또한, 일반적으로 사용되는 사전에 정의되어 있는 용어들은 명백하게 특별히 정의되어 있지 않는 한 이상적으로 또는 과도하게 해석되지 않는다.Unless otherwise defined, all terms (including technical and scientific terms) used herein will have the meaning commonly understood by those of ordinary skill in the art to which this invention belongs. In addition, terms defined in a commonly used dictionary are not to be interpreted ideally or excessively unless specifically defined explicitly.

공간적으로 상대적인 용어인 "아래(below)", "아래(beneath)", "하부(lower)", "위(above)", "상부(upper)" 등은 도면에 도시되어 있는 바와 같이 하나의 구성요소와 다른 구성요소들과의 상관관계를 용이하게 기술하기 위해 사용될 수 있다. 공간적으로 상대적인 용어는 도면에 도시되어 있는 방향에 더하여 사용시 또는 동작시 구성요소들의 서로 다른 방향을 포함하는 용어로 이해되어야 한다. 예를 들어, 도면에 도시되어 있는 구성요소를 뒤집을 경우, 다른 구성요소의 "아래(below)"또는 "아래(beneath)"로 기술된 구성요소는 다른 구성요소의 "위(above)"에 놓여질 수 있다. 따라서, 예시적인 용어인 "아래"는 아래와 위의 방향을 모두 포함할 수 있다. 구성요소는 다른 방향으로도 배향될 수 있으며, 이에 따라 공간적으로 상대적인 용어들은 배향에 따라 해석될 수 있다.Spatially relative terms "below", "beneath", "lower", "above", "upper", etc. It can be used to easily describe the correlation between a component and other components. A spatially relative term should be understood as a term that includes different directions of components during use or operation in addition to the directions shown in the drawings. For example, when a component shown in the drawing is turned over, a component described as “beneath” or “beneath” of another component may be placed “above” of the other component. can Accordingly, the exemplary term “below” may include both directions below and above. Components may also be oriented in other orientations, and thus spatially relative terms may be interpreted according to orientation.

명세서에서 사용되는 "부" 또는 "모듈"이라는 용어는 소프트웨어, FPGA 또는 ASIC과 같은 하드웨어 구성요소를 의미하며, "부" 또는 "모듈"은 어떤 역할들을 수행한다. 그렇지만 "부" 또는 "모듈"은 소프트웨어 또는 하드웨어에 한정되는 의미는 아니다. "부" 또는 "모듈"은 어드레싱할 수 있는 저장 매체에 있도록 구성될 수도 있고 하나 또는 그 이상의 프로세서들을 재생시키도록 구성될 수도 있다. 따라서, 일 예로서 "부" 또는 "모듈"은 소프트웨어 구성요소들, 객체지향 소프트웨어 구성요소들, 클래스 구성요소들 및 태스크 구성요소들과 같은 구성요소들과, 프로세스들, 함수들, 속성들, 프로시저들, 서브루틴들, 프로그램 코드의 세그먼트들, 드라이버들, 펌웨어, 마이크로 코드, 회로, 데이터, 데이터베이스, 데이터 구조들, 테이블들, 어레이들 및 변수들을 포함한다. 구성요소들과 "부" 또는 "모듈"들 안에서 제공되는 기능은 더 작은 수의 구성요소들 및 "부" 또는 "모듈"들로 결합되거나 추가적인 구성요소들과 "부" 또는 "모듈"들로 더 분리될 수 있다.As used herein, the term “unit” or “module” refers to a hardware component such as software, FPGA, or ASIC, and “unit” or “module” performs certain roles. However, "part" or "module" is not meant to be limited to software or hardware. A “part” or “module” may be configured to reside on an addressable storage medium and may be configured to reproduce one or more processors. Thus, by way of example, “part” or “module” refers to components such as software components, object-oriented software components, class components and task components, processes, functions, properties, procedures, subroutines, segments of program code, drivers, firmware, microcode, circuitry, data, databases, data structures, tables, arrays and variables. Components and functionality provided within “parts” or “modules” may be combined into a smaller number of components and “parts” or “modules” or additional components and “parts” or “modules”. can be further separated.

본 명세서에서 프로젝트는 기업 활동 일부 과정에 대중을 참여시키는 크라우드소싱(Crowd Sourcing)을 기반으로 수행되는 것으로, 하위 동작의 집합으로서 태스크를 더 포함할 수도 있다.In the present specification, a project is performed based on crowd sourcing that engages the public in some process of corporate activity, and may further include a task as a set of sub-actions.

본 명세서에서 평가 함수 및/또는 평가 점수는 프로젝트의 수행 결과(작업 결과, 결과물, 작업물 등)를 평가하는 방법 및 평가를 통하여 도출되는 출력 점수를 의미할 수 있다.In the present specification, the evaluation function and/or evaluation score may refer to a method for evaluating a project performance result (work result, result, work, etc.) and an output score derived through evaluation.

다수의 대중을 작업자로 하여 의뢰자가 의뢰하는 의뢰물을 생성 및 제공하고, 그 의뢰물을 생성하는데 투입된 작업자들에게는 작업에 상응하는 보상을 제공하는 크라우드소싱 서비스 기반의 프로젝트는, 의뢰자가 작업 데이터에 대한 라벨링을 의뢰한 경우, 라벨링 작업을 위해 복수의 참여자를 선정하여 투입시키게 되는데, 그 특성상 작업 데이터의 양이 방대한 경우에는 작업 데이터의 양에 비례하여 작업자에 대한 보상 비용 또한 커질 수밖에 없어 비용 부담이 커진다.A crowdsourcing service-based project that creates and provides a request by a client with a large number of people as workers, and provides a reward corresponding to the work to the workers who were put in to create the request, is a project where the client can In the case of a labeling request, a plurality of participants are selected and put in for the labeling work. Due to the nature of the work data, if the amount of work data is vast, the compensation cost for the worker is inevitably increased in proportion to the amount of work data, which increases the cost burden. get bigger

본 발명은 적어도 하나의 작업자에 의해 작업 데이터에 대한 라벨링을 수동 처리(제1 처리 방식)하는 경우에 소요되는 비용 및 사전 학습된 머신러닝 모델을 기반으로 작업 데이터에 대한 라벨링을 자동 처리(제2 처리 방식)하는 경우에 소요되는 비용을 비교하여, 두 방식 중 비용을 절감할 수 있는 방식을 선택하여 작업 데이터에 대한 라벨링을 수행하도록 결정한다. 이하에서 첨부된 도면을 참조하여 본 발명의 실시예를 상세하게 설명한다. 다만, 설명의 편의를 위하여 작업자로 한정하였으며, 이 작업자는 작업 데이터에 대한 라벨링을 수행하는 워커(worker)는 물론, 인적자원에 해당하는 다른 대상이 될 수도 있다.The present invention automatically processes the labeling of the job data based on the cost and pre-trained machine learning model when manually processing the labeling for the job data by at least one operator (the first processing method) (the second processing method) processing method), compares the cost, and selects a method that can reduce costs among the two methods to determine to perform labeling on the work data. Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings. However, for convenience of explanation, it is limited to a worker, and this worker may be a worker performing labeling on work data, as well as other subjects corresponding to human resources.

도 1은 본 발명의 일 실시예에 따른 크라우드소싱 기반 프로젝트의 특성에 따른 라벨링 작업을 위한 머신러닝 모델 도입 여부 결정 시스템의 네트워크 구조를 나타내는 블록도이다.1 is a block diagram illustrating a network structure of a system for determining whether to introduce a machine learning model for a labeling operation according to the characteristics of a crowdsourcing-based project according to an embodiment of the present invention.

도 1을 참조하면, 라벨링 작업을 위한 머신러닝 모델 도입 여부 결정 시스템은 라벨링 작업을 위한 머신러닝 모델 도입 여부 결정 장치(이하, '결정 장치'라 칭함)(100), 의뢰 서버(200) 및 작업자 단말(300)로 구성되어 수행되며, 이때, 작업자 단말(300)은 적어도 하나일 수 있다.Referring to FIG. 1, the system for determining whether to introduce a machine learning model for a labeling operation is a device for determining whether to introduce a machine learning model for a labeling operation (hereinafter referred to as a 'determining device') 100, a request server 200, and a worker The terminal 300 is configured and performed, and in this case, the operator terminal 300 may be at least one.

결정 장치(100)는 크라우드소싱 서비스를 제공하기 위한 기업(업체)의 장치 또는 단말을 나타내며, 의뢰된(오픈) 프로젝트를 플롯폼 상에 오픈하기에 앞서 적어도 하나의 작업자를 투입하여 라벨링을 수동으로 수행하여 처리할지(수동 라벨링) 사전 훈련된 머신러닝 모델을 기반으로 라벨링을 자동으로 수행하여 처리할지(자동 라벨링) 여부를 결정한다. 이때, 결정 장치(100)는 수동 라벨링 또는 자동 라벨링 여부를 결정하기 위해 두 처리 방식 각각을 통해 라벨링을 수행했을 경우의 비용을 고려하고, 보다 적은 비용이 소요되는 방식으로 라벨링을 처리하도록 결정한다.The determination device 100 represents a device or terminal of a company (company) for providing a crowdsourcing service, and before opening a requested (open) project on a platform, at least one worker is put in to manually label It decides whether to process by performing (manual labeling) or automatically performing labeling based on a pre-trained machine learning model (automatic labeling). At this time, the determination device 100 considers the cost of performing labeling through each of the two processing methods to determine whether manual labeling or automatic labeling, and determines to process the labeling in a way that requires less cost.

즉, 결정 장치(100)는 수동 라벨링으로 작업 데이터를 처리했을 경우에 소요되는 비용(제1 비용) 및 자동 라벨링으로 작업 데이터를 처리했을 경우에 소요되는 비용(제2 비용)을 각각 산출하고, 제1 비용이 제2 비용 보다 큰 경우에는 수동 라벨링을 수행하도록 결정하며, 제1 비용이 제2 비용 보다 작거나 같은 경우에는 자동 라벨링을 수행하도록 결정하는 것이다.That is, the determination device 100 calculates the cost (first cost) required when the job data is processed by manual labeling and the cost (second cost) required when the job data is processed by automatic labeling, respectively, When the first cost is greater than the second cost, it is determined to perform manual labeling, and when the first cost is less than or equal to the second cost, it is determined to perform automatic labeling.

한편, 결정 장치(100)는 의뢰 서버(200)로부터 작업 요청 정보를 수신함으로써 라벨링 작업에 대한 프로젝트를 의뢰받으면, 그 작업 요청 정보와 보유하고 있는 적어도 하나의 사전 학습된 머신러닝 모델을 기반으로, 그 프로젝트의 라벨링 작업에 사용할 수 있는 사전 학습된 머신러닝 모델이 존재하는지 여부를 확인하고, 사전 학습된 머신러닝 모델이 존재하는 경우에는 그 사전 학습된 머신러닝 모델을 이용하여 작업 데이터에 대한 자동 라벨링을 수행하고, 사전 학습된 머신러닝 모델이 존재하지 않는 경우에는 적어도 하나의 작업자를 선정하여 작업 데이터에 대한 수동 라벨링을 수행해줄 것을 요청한다.On the other hand, when the determination device 100 receives a request for a project for labeling work by receiving work request information from the request server 200, based on the work request information and at least one pre-learned machine learning model, Check whether there is a pre-trained machine learning model that can be used for the labeling task of the project, and if there is a pre-trained machine learning model, use the pre-trained machine learning model to automatically label the job data , and if there is no pre-trained machine learning model, at least one worker is selected and requested to perform manual labeling on the job data.

여기서, 결정 장치(100)는 프로젝트의 라벨링 작업에 사용할 수 있는 사전 학습된 머신러닝 모델이 존재하는지 여부를 확인하기 위해, 작업 요청 정보로부터 라벨 클래스를 추출하고, 보유하고 있는 사전 학습된 머신러닝 모델 각각의 스펙 문서로부터 라벨 클래스를 추출한다. 이후, 결정 장치(100)는 두 라벨 클래스 간의 유사도를 기반으로 매칭 스코어를 산출하고, 이 매칭 스코어를 기반으로 사전 학습된 머신러닝 모델을 그대로 이용할지, 파인튜닝하여 이용할지, 커스텀 모델을 학습하여 이용할지 여부를 결정한다. 여기서, 유사도는 각 라벨 클래스를 단어 수준으로 벡터화 하고, 그 획득된 벡터들 간의 코사인 유사도를 산출함으로써 확인할 수 있다.Here, the determination device 100 extracts a label class from the work request information to check whether there is a pre-trained machine learning model that can be used for the labeling task of the project, and the pre-trained machine learning model that holds Extract the label class from each specification document. Thereafter, the determination device 100 calculates a matching score based on the similarity between the two label classes, and whether to use the pre-trained machine learning model as it is, fine-tune it, or learn a custom model based on the matching score. decide whether to use Here, the similarity can be confirmed by vectorizing each label class to the word level and calculating the cosine similarity between the obtained vectors.

예를 들어, 매칭 스코어가 미리 설정된 임계 범위 이상인 사전 학습된 머신러닝 모델이 검출되면, 그 검출된 사전 학습된 머신러닝 모델을 이용하여 해당 라벨링 작업을 자동 수행하도록 한다. 또한, 이 매칭 스코어가 미리 설정된 임계 범위 이내인 사전 학습된 머신러닝 모델이 검출되면, 그 검출된 사전 학습된 머신러닝 모델을 파인튜닝(fine tuning)하여 해당 라벨링 작업을 자동 수행하도록 한다. 또한, 이 매칭 스코어가 미리 설정된 임계 범위 이하인 사전 학습된 머신러닝 모델이 검출되면, 커스텀(custom) 모델을 이용하여 해당 라벨링 작업을 자동 수행하도록 한다. 여기서, 각각의 임계치 및 임계 범위는 사용자 또는 관리자에 의해 설정되거나 변경될 수 있는 것으로, 그 수치를 한정하지 않는다.For example, when a pre-trained machine learning model with a matching score equal to or greater than a preset threshold range is detected, the corresponding labeling task is automatically performed using the detected pre-trained machine learning model. In addition, when a pre-trained machine learning model whose matching score is within a preset threshold range is detected, the detected pre-trained machine learning model is fine-tuned to automatically perform the corresponding labeling task. In addition, when a pre-trained machine learning model whose matching score is less than or equal to a preset threshold range is detected, the corresponding labeling task is automatically performed using a custom model. Here, each threshold value and threshold range may be set or changed by a user or an administrator, and the numerical values are not limited.

의뢰 서버(200)는 프로젝트를 의뢰하는 기업이나 개인, 즉, 의뢰자의 장치를 의미한다.The request server 200 refers to a company or individual requesting a project, that is, a device of a client.

그 의뢰자는 의뢰 서버(200)를 통해 인공지능 학습데이터의 생성을 위한 소스데이터의 수집 또는 데이터 어노테이션 등을 목적으로 프로젝트를 의뢰한다. 프로젝트를 통해서 생성된 데이터는 지도 학습, 비지도 학습, 강화 학습 등의 임의의 기계 학습의 학습데이터로 활용될 수 있다. 소스데이터의 수집은 녹음된 음성 수집, 사진 수집 등 가공되지 않은 데이터를 수집하는 것을 의미한다. 데이터 어노테이션은 텍스트, 사진, 비디오 등의 소스 데이터에 관련 주석데이터를 입력하는 것을 의미한다. 예들 들어, 데이터 어노테이션은 주어진 지문에서 개체를 찾는 것, 유사한 문장을 찾는 것 등이 있을 수 있으나 이에 제한되지 않는다. 한편, 전술한 프로젝트의 종류는 일 실시예에 불과하며, 의뢰자의 설계에 따라 다양한 프로젝트가 본 발명에서 취급될 수 있다.The client requests a project for the purpose of collecting source data or data annotation for the generation of artificial intelligence learning data through the request server 200 . The data generated through the project can be used as learning data for arbitrary machine learning such as supervised learning, unsupervised learning, and reinforcement learning. The collection of source data means the collection of raw data such as recorded voice collection and photo collection. Data annotation refers to inputting relevant annotation data into source data such as text, photos, and videos. For example, data annotation may include, but is not limited to, finding an entity in a given fingerprint, finding a similar sentence, and the like. On the other hand, the type of the above-mentioned project is only one embodiment, and various projects may be handled in the present invention according to the design of the client.

의뢰 서버(200)는 적어도 하나의 프로젝트(프로젝트 A, 프로젝트 B, ?? )를 의뢰하기 위해 각 프로젝트의 작업 데이터와 그 작업 데이터의 종류, 작업대상(객체), 작업대상조건, 작업형태, 작업완료기한, 의뢰자 및 작업자 요건(의뢰조건) 중 적어도 하나를 포함하는 작업 요청 정보를 함께 송신한다. The request server 200 requests at least one project (project A, project B, ?? The work request information including at least one of the completion deadline, requester and worker requirements (request condition) is transmitted together.

이때, 작업 데이터는 사진, 동영상과 같은 이미지나, 문서와 같은 텍스트일 수 있고, 작업 요청 정보는 정형화되어 있는 이미지 문서 또는 전자 문서이거나 정형화되어 있지 않은 텍스트 문서일 수 있다. 여기서, 작업 형태는 어떠한 작업을 수행해야 하는지에 대한 작업 지시 또는 작업 안내일 수 있다.In this case, the job data may be an image such as a photo or a moving picture, or text such as a document, and the job request information may be a standardized image document, an electronic document, or an unstructured text document. Here, the work type may be a work instruction or a work guide for what kind of work should be performed.

또한, 작업 형태는 수집, 바운딩(Bounding), 세그멘테이션(Segmentation), 센티멘트(Sentiment), OCR(Optical Character Recognition), 랜드마크, 라벨링(Labeling), 속성 분류, 태깅(Tagging), 구간 추출, 특성 평가 등일 수 있다. 다만, 본 발명에서는 작업 형태를 라벨링으로 하는 경우에 대해 한정하여 설명한다.In addition, the work types are collection, bounding, segmentation, sentiment, OCR (Optical Character Recognition), landmark, labeling, attribute classification, tagging, section extraction, characteristic evaluation. etc. However, in the present invention, the description will be limited to the case where the operation form is labeled.

작업자 단말(300)은 플랫폼 상에 오픈되는 오픈 프로젝트에 투입하기 위한 작업자로 선정된 일반 대중의 단말을 의미한다.The worker terminal 300 refers to a terminal of the general public selected as a worker to be put into an open project opened on the platform.

그 선정된 작업자는 작업자 단말(300), 즉, 소정의 단말을 이용하여 결정 장치(100)에 의해 작업 요청된 작업을 수행한다. 여기서, 작업자 단말(300)은 복수개로 구성될 수 있으며, 그 개수는 한정되지 않는다. 각각의 작업자 단말(300)은 결정 장치(100)가 제공하는 애플리케이션 또는 웹사이트 등을 통해 오픈 프로젝트에 참여할(투입될) 수 있다.The selected worker performs the job requested by the determination device 100 using the worker terminal 300 , that is, a predetermined terminal. Here, the operator terminal 300 may be configured in plurality, and the number is not limited. Each worker terminal 300 may participate in (inject) an open project through an application or website provided by the determination device 100 .

제1 작업자 단말(301) 및 제2 작업자 단말(302)은 결정 장치(100)로부터 플랫폼 상의 오픈 프로젝트에 대한 작업 요청이 수신되면, 프로젝트에 대한 작업 참여 여부를 결정하여 알릴 수 있다. When a work request for an open project on a platform is received from the determining device 100 , the first worker terminal 301 and the second worker terminal 302 may determine and notify whether to participate in the work for the project.

작업 참여 시, 제1 작업자 단말(301) 및 제2 작업자 단말(302)은 작업 요청에 따라 그 요청된 작업을 수행하고, 그 수행 결과, 즉, 작업 결과를 플랫폼 상에 송신한다.When participating in a job, the first worker terminal 301 and the second worker terminal 302 perform the requested job according to the job request, and transmit the performance result, that is, the job result on the platform.

한편, 작업자 단말(300)은 각각 이동 단말기, 휴대폰, 스마트 폰(smart phone), 노트북 컴퓨터(laptop computer), 데스크탑 컴퓨터(desktop computer), 디지털방송용 단말기, PDA(personal digital assistants), PMP(portable multimedia player), 네비게이션, 슬레이트 PC(slate PC), 태블릿 PC(tablet PC), 울트라 북(ultrabook), 웨어러블 디바이스(wearable device, 예를 들어, 워치형 단말기 (smartwatch), 글래스형 단말기 (smart glass), HMD(head mounted display) 등과 같은 컴퓨터 장치 또는 전기 통신 장치일 수 있으나, 이에 한정되는 것은 아니다.On the other hand, the worker terminal 300 is a mobile terminal, a mobile phone, a smart phone (smart phone), a laptop computer (laptop computer), a desktop computer (desktop computer), a digital broadcasting terminal, PDA (personal digital assistants), PMP (portable multimedia), respectively player), navigation, slate PC, tablet PC, ultrabook, wearable device, for example, watch-type terminal (smartwatch), glass-type terminal (smart glass), It may be a computer device such as a head mounted display (HMD) or a telecommunication device, but is not limited thereto.

도 2는 본 발명의 일 실시예에 따른 크라우드소싱 기반 프로젝트의 특성에 따른 라벨링 작업을 위한 머신러닝 모델 도입 여부 결정 장치의 구성을 나타내는 블록도이다.2 is a block diagram showing the configuration of an apparatus for determining whether to introduce a machine learning model for a labeling operation according to the characteristics of a crowdsourcing-based project according to an embodiment of the present invention.

도 2를 참조하면, 결정 장치(100)는 통신모듈(101), 저장모듈(103) 및 제어모듈(105)를 포함하여 구성될 수 있다.Referring to FIG. 2 , the determining device 100 may include a communication module 101 , a storage module 103 , and a control module 105 .

통신모듈(101)은 의뢰 서버(200)에 의해 의뢰된 적어도 하나의 프로젝트를 수행(완료)하기 위해 의뢰 서버(200) 및 적어도 하나의 작업자 단말(300)과 적어도 하나의 정보 또는 데이터를 송수신한다. 또한, 이 통신모듈(101)은 의뢰 서버(200) 및 적어도 하나의 작업자 단말(300), 그 외 다른 장치들과의 통신을 수행할 수도 있는 것으로, 무선 인터넷 기술들에 따른 통신망에서 무선 신호를 송수신하도록 한다. The communication module 101 transmits and receives at least one piece of information or data with the request server 200 and at least one worker terminal 300 to perform (complete) at least one project requested by the request server 200 . . In addition, the communication module 101 may perform communication with the request server 200 and at least one operator terminal 300, and other devices, and transmits a wireless signal in a communication network according to wireless Internet technologies. to transmit and receive.

무선 인터넷 기술로는, 예를 들어 WLAN(Wireless LAN), Wi-Fi(Wireless-Fidelity), Wi-Fi(Wireless Fidelity) Direct, DLNA(Digital Living Network Alliance), WiBro(Wireless Broadband), WiMAX(World Interoperability for Microwave Access), HSDPA(High Speed Downlink Packet Access), HSUPA(High Speed Uplink Packet Access), LTE(Long Term Evolution), LTE-A(Long Term Evolution-Advanced) 등이 있으며, 조건 설정 장치(100)는 앞에서 나열되지 않은 인터넷 기술까지 포함한 범위에서 적어도 하나의 무선 인터넷 기술에 따라 데이터를 송수신하게 된다.As wireless Internet technologies, for example, WLAN (Wireless LAN), Wi-Fi (Wireless-Fidelity), Wi-Fi (Wireless Fidelity) Direct, DLNA (Digital Living Network Alliance), WiBro (Wireless Broadband), WiMAX (World Interoperability for Microwave Access), High Speed Downlink Packet Access (HSDPA), High Speed Uplink Packet Access (HSUPA), Long Term Evolution (LTE), Long Term Evolution-Advanced (LTE-A), etc., and the condition setting device 100 ) transmits and receives data according to at least one wireless Internet technology to the extent that it includes Internet technologies not listed above.

근거리 통신(Short range communication)을 위한 것으로서, 블루투스(Bluetooth™), RFID(Radio Frequency Identification), 적외선 통신(Infrared Data Association; IrDA), UWB(Ultra Wideband), ZigBee, NFC(Near Field Communication), Wi-Fi(Wireless-Fidelity), Wi-Fi Direct, Wireless USB(Wireless Universal Serial Bus) 기술 중 적어도 하나를 이용하여, 근거리 통신을 지원할 수 있다. 이러한, 근거리 무선 통신망(Wireless Area Networks)을 결정 장치(100) 및 의뢰 서버(200) 간, 결정 장치(100) 및 적어도 하나의 작업자 단말(300) 간 무선 통신을 지원할 수 있다. 이때, 근거리 무선 통신망은 근거리 무선 개인 통신망(Wireless Personal Area Networks)일 수 있다.As for short range communication, Bluetooth™, RFID (Radio Frequency Identification), Infrared Data Association (IrDA), UWB (Ultra Wideband), ZigBee, NFC (Near Field Communication), Wi - At least one of Wireless-Fidelity (Fi), Wi-Fi Direct, and Wireless Universal Serial Bus (USB) technologies may be used to support short-distance communication. It is possible to support wireless communication between the determining device 100 and the requesting server 200, the determining device 100, and the at least one worker terminal 300 for wireless area networks. In this case, the local area network may be a local area network (Wireless Personal Area Networks).

저장모듈(103)은 오픈 프로젝트에 대한 정보 및 기 수행된 적어도 하나의 완료 프로젝트에 대한 정보들을 저장하고, 라벨링 작업을 위한 머신러닝 모델 도입 여부를 결정하기 위해 필요한 적어도 하나의 프로세스를 저장한다.The storage module 103 stores information on an open project and information on at least one previously performed completed project, and stores at least one process necessary for determining whether to introduce a machine learning model for labeling work.

제어모듈(105)은 저장모듈(103)에 저장된 적어도 하나의 프로세스를 기반으로 머신러닝 모델 도입 여부를 결정하기 위한 동작을 제어한다.The control module 105 controls an operation for determining whether to introduce a machine learning model based on at least one process stored in the storage module 103 .

구체적으로, 제어모듈(105)은 의뢰서버(200)로부터 라벨링 작업에 대한 프로젝트가 의뢰되면, 그 프로젝트 및 보유하고 있는 적어도 하나의 사전 학습된 머신러닝 모델 간의 매칭 스코어(score)를 산출하고, 그 산출된 매칭 스코어를 기반으로 적어도 하나의 사전 학습된 머신러닝 모델들 중 의뢰된 라벨링 작업에 이용할 수 있는 사전 학습된 머신러닝 모델이 존재하는지 여부를 확인(검출)하고, 그 확인 결과, 라벨링 작업에 이용할 수 있는 사전 학습된 머신러닝 모델이 존재하는지 여부에 따라 작업 데이터에 대한 자동 라벨링 또는 수동 라벨링 여부를 결정한다.Specifically, when a project for labeling work is requested from the request server 200, the control module 105 calculates a matching score between the project and at least one pre-trained machine learning model possessed, and the Based on the calculated matching score, it is checked (detected) whether there is a pre-trained machine learning model that can be used for the requested labeling task among at least one pre-trained machine learning model, and as a result of the check, the Whether to label job data automatically or manually depends on whether there are pre-trained machine learning models available.

그 확인 결과, 라벨링 작업을 위한 사전 학습된 머신러닝 모델이 존재하는 경우, 제어모듈(105)은 그 확인된 사전 학습된 머신러닝 모델을 기반으로 자동 라벨링을 수행한다. As a result of the check, if there is a pre-trained machine learning model for the labeling task, the control module 105 performs automatic labeling based on the checked pre-trained machine learning model.

한편, 그 확인 결과, 라벨링 작업을 위한 사전 학습된 머신러닝 모델이 존재하지 않는 경우, 제어모듈(105)은 라벨링 작업을 수동으로 수행할 시 소요되는 제1 비용 및 사전 학습된 머신러닝 모델 또는 커스텀 모델을 이용하여 라벨링 작업을 자동으로 수행할 시 소요되는 제2 비용을 비교하고, 그 비교 결과에 따라 자동 라벨링 또는 수동 라벨링 여부를 결정한다. 여기서 제2 비용을 산출하기 위해 고려하는 사전 학습된 머신러닝 모델은 앞서 산출된 매칭 스코어가 미리 설정된 임계 범위 이내인 사전 학습된 머신러닝 모델이다. On the other hand, as a result of the check, if there is no pre-trained machine learning model for the labeling task, the control module 105 controls the first cost required for manually performing the labeling task and the pre-trained machine learning model or custom Comparing the second cost required for automatically performing the labeling operation using the model, it is determined whether automatic labeling or manual labeling is performed according to the comparison result. Here, the pre-trained machine learning model considered to calculate the second cost is a pre-trained machine learning model in which the previously calculated matching score is within a preset threshold range.

이때, 제어모듈(105)은 제1 비용이 제2 비용보다 클 경우, 적어도 하나의 작업자를 선정하고 라벨링 작업을 배정하여 수동 라벨링을 수행하도록 한다. 한편, 제어모듈(105)은 제1 비용이 제2 비용보다 작거나 같은 경우, 사전 학습된 모델 또는 커스텀 모델을 이용하여 자동 라벨링 작업을 수행하도록 하되, 앞서 산출된 매칭 스코어가 미리 설정된 임계 범위 이내인 사전 학습된 모델이 존재하면, 그 사전 학습된 모델을 이용하여 자동 라벨링을 수행하도록 하고, 매칭 스코어가 미리 설정된 임계 범위 이내인 사전 학습된 모델이 존재하지 않고 모두 임계치 이하라면 커스텀 모델을 이용하여 자동 라벨링 작업을 수행하도록 한다.At this time, when the first cost is greater than the second cost, the control module 105 selects at least one operator and assigns a labeling task to perform manual labeling. On the other hand, when the first cost is less than or equal to the second cost, the control module 105 performs an automatic labeling operation using a pre-trained model or a custom model, but the previously calculated matching score is within a preset threshold range If a pre-trained model exists, automatic labeling is performed using the pre-trained model. Allows automatic labeling to be performed.

도 3은 본 발명의 일 실시예에 따른 크라우드소싱 기반 프로젝트의 특성에 따른 라벨링 작업을 위한 머신러닝 모델 도입 여부 결정 방법을 나타내는 순서도이다.3 is a flowchart illustrating a method of determining whether to introduce a machine learning model for a labeling operation according to the characteristics of a crowdsourcing-based project according to an embodiment of the present invention.

결정 장치(100)는 의뢰 서버(200)로부터 작업 요청 정보를 수신됨에 따라 라벨링 작업에 대한 프로젝트를 의뢰받으면(S201), 그 프로젝트를 플랫폼 상에 오픈하기에 앞서 그 라벨링 작업을 처리하기 위해 사용할 수 있는 사전 학습된 머신러닝 모델이 존재하는지 여부를 확인하기 위해 그 프로젝트와 보유하고 있는 적어도 하나의 사전 학습된 모델 간 매칭 스코어를 산출한다(S203).Determination device 100 receives a request for a project for labeling work as it receives work request information from the request server 200 (S201), and can be used to process the labeling task before opening the project on the platform. In order to check whether a pre-trained machine learning model exists, a matching score between the project and at least one pre-trained model is calculated ( S203 ).

그 다음으로, 결정 장치(100)는 그 산출된 매칭 스코어를 기반으로 라벨링 작업을 위한 사전 학습된 머신러닝 모델이 있는지 여부를 확인, 즉 사전 학습된 머신러닝 모델을 검출하고(S205), 그 확인 결과, 사전 학습된 머신러닝 모델이 검출된 경우, 결정 장치(100)는 그 검출된 사전 학습된 머신러닝 모델을 이용하여 자동 라벨링을 수행한다(S207). 이때, 검출된 사전 학습된 머신러닝 모델은 그 라벨링 작업을 위해 그대로 적용할 수 있는 사전 학습된 머신러닝 모델을 나타내는 것으로, 매칭 스코어가 완전히 일치할 수 있다.Next, the determination device 100 checks whether there is a pre-trained machine learning model for the labeling task based on the calculated matching score, that is, detects the pre-trained machine learning model (S205), and the check As a result, when the pre-trained machine learning model is detected, the determination apparatus 100 performs automatic labeling using the detected pre-trained machine learning model ( S207 ). In this case, the detected pre-trained machine learning model represents a pre-trained machine learning model that can be applied as it is for the labeling task, and the matching score may completely match.

또한, 그 확인 결과, 사전 학습된 머신러닝 모델이 검출되지 않은 경우, 결정 장치(100)는 라벨링 작업을 수동으로 처리했을 경우(수동 라벨링)에 소요되는 제1 비용 및 라벨링 작업을 자동으로 처리했을 경우(자동 라벨링)에 소요되는 제2 비용을 산출하고(S209), 그 비교 결과, 제1 비용이 제2 비용보다 크면, 적어도 하나의 작업자를 선정하여 수동으로 라벨링 작업을 수행해줄 것을 요청한다(S211). In addition, as a result of the confirmation, when the pre-trained machine learning model is not detected, the determination device 100 automatically handles the first cost and the labeling operation required when the labeling operation is manually processed (manual labeling) Calculate the second cost required in the case (automatic labeling) (S209), and as a result of the comparison, if the first cost is greater than the second cost, select at least one operator and request to manually perform the labeling operation ( S211).

한편, 그 비교 결과, 제1 비용이 제2 비용보다 작거나 같으면, 결정 장치(100)는 S203 단계에서 산출된 매칭 스코어를 확인한 후(S213), 그 산출된 매칭 스코어가 미리 설정된 임계범위 이내인 사전 학습된 모델이 존재하면 그 사전 학습된 모델을 기반으로 자동 라벨링을 수행하거나(S215), 매칭 스코어가 모두 미리 설정된 임계범위 이하인 사전 학습된 모델만이 존재하면 커스텀 모델을 기반으로 자동 모델링을 수행한다(S217).On the other hand, as a result of the comparison, if the first cost is less than or equal to the second cost, the determining device 100 checks the matching score calculated in step S203 ( S213 ), and the calculated matching score is within a preset threshold range. If there is a pre-trained model, automatic labeling is performed based on the pre-trained model (S215), or if there is only a pre-trained model whose matching scores are all below a preset threshold range, automatic modeling is performed based on the custom model do (S217).

도 4는 본 발명의 일 실시예에 따른 결정 장치에서 제2 비용을 산출하기 위한 방법을 나타내는 순서도이다.4 is a flowchart illustrating a method for calculating a second cost in a decision apparatus according to an embodiment of the present invention.

도 4를 참조하면, 결정 장치(100)는 전체 작업 데이터 중 미리 설정된 수의 학습 데이터만큼씩 이용하여 비용 및 그 정확도(작업 정확도)를 산출하는 산출 동작을 반복하여 수행한다.Referring to FIG. 4 , the determination apparatus 100 repeatedly performs a calculation operation of calculating a cost and its accuracy (job accuracy) by using a preset number of training data from among all job data.

먼저, 결정 장치(100)는 사전 학습된 머신러닝 모델 또는 커스텀 모델을 기반으로 전체 작업 데이터 중 미리 설정된 수만큼의 학습 데이터를 이용하여 학습을 수행하게 되며, 전체 작업 데이터 수 중 미리 설정된 수만큼의 학습 데이터 수에 대한 작업 이용 및 검수 비용과, 데이터 풀(pool)의 작업 데이터 수에 대한 검수 비용을 합산하여 그로 인해 발생하는 비용을 산출한다(S301). 여기서 데이터 풀(pool)의 작업 데이터 수는 전체 작업 데이터 수에서 학습 데이터 수를 차감한 값으로서, 이는 학습 데이터를 제외한 나머지 데이터 수를 의미한다. 즉, 데이터 풀의 작업 데이터 수는 학습되지 않은 데이터 수를 나타내는 것이다. 이 데이터 풀의 작업 데이터를 이용하여 정확도를 산출해야 하기 때문에 검수가 필요하며, 그로 인해 S301 단계에서 데이터 풀의 작업 데이터 수에 대한 검수 비용이 합산되야만 한다.First, the determination device 100 performs learning by using a preset number of training data among all job data based on a pre-trained machine learning model or a custom model, The cost incurred thereby is calculated by adding up the cost of using and inspecting the work for the number of learning data and the cost of inspecting the number of work data in the data pool (S301). Here, the number of work data in the data pool is a value obtained by subtracting the number of training data from the total number of work data, which means the number of remaining data excluding the training data. That is, the number of working data in the data pool represents the number of unlearned data. Since the accuracy must be calculated using the work data of this data pool, inspection is required, and therefore the inspection cost for the number of work data in the data pool must be added up in step S301.

그 다음으로, 데이터 풀의 작업 데이터 검수를 통해 그 정확도를 산출하고(S303), 그 산출된 정확도는 다음 루프에서의 산출 동작을 위해 적용된다.Next, the accuracy is calculated through the work data inspection of the data pool (S303), and the calculated accuracy is applied for the calculation operation in the next loop.

S303 단계 이후, 학습 데이터 수와 데이터 풀의 작업 데이터 수를 비교하고, 그 비교 결과, 학습 데이터 수가 데이터 풀의 작업 데이터 수보다 작으면, S301 단계 내지 S305 단계의 산출 동작을 반복한다. 다만, 최초 수행되는 루프 이후에 수행되는 루프에서는 S301 단계를 수행할 시에 데이터 풀의 작업 데이터 수에 정확도를 반영하여 비용을 산출한다. 즉, 정확하게 작업된 데이터들은 데이터 풀에서 제외한 후, 다음 루프에서 이용하는 것이다. After step S303, the number of training data and the number of working data in the data pool are compared. As a result of the comparison, if the number of training data is smaller than the number of working data in the data pool, the calculation operation of steps S301 to S305 is repeated. However, in the loop performed after the first performed loop, the cost is calculated by reflecting the accuracy in the number of work data in the data pool when step S301 is performed. That is, the correctly processed data is removed from the data pool and then used in the next loop.

따라서, S303 단계에서 산출한 정확도가 높을수록 반복 수행해야 하는 루프는 줄어들게 된다. 또한, 그 정확도는 S301 단계 내지 S305 단계는 반복 학습에 따라 동일하게 유지되거나 향상된다.Therefore, as the accuracy calculated in step S303 is higher, the number of loops to be repeatedly performed is reduced. In addition, the accuracy of steps S301 to S305 is kept the same or improved according to iterative learning.

S301 단계 내지 S303 단계의 산출 동작은 데이터 풀의 작업 데이터 수가 학습 데이터 수보다 작아질 때까지 반복 수행되는데, 데이터 풀의 작업 데이터 수가 학습 데이터 수보다 작아지면, 데이터 풀의 작업 데이터 수에 대한 작업 비용 및 검수 비용을 합산한 후(S307), 더 이상 산출 동작을 반복하지 않고 앞서 n번 반복 수행됨에 따라 산출된 모든 산출값과 함께 합산한다.The calculation operation of steps S301 to S303 is repeatedly performed until the number of work data in the data pool becomes smaller than the number of training data. And after summing the inspection cost (S307), the calculation operation is no longer repeated and is summed together with all the calculated values calculated by repeating the previous n times.

이로써, 사전 학습된 모델 또는 커스텀 모델을 학습 또는 개발하여 자동 라벨링을 수행했을 경우에 발생하는 제2 비용을 산출할 수 있는 것이다.Accordingly, it is possible to calculate the second cost incurred when automatic labeling is performed by learning or developing a pre-trained model or a custom model.

도 5는 본 발명의 일 실시예에 따른 결정 장치에서 제1 비용 및 제2 비용을 산출하는 일 예를 설명하기 위한 도면이다.5 is a view for explaining an example of calculating a first cost and a second cost in the determination device according to an embodiment of the present invention.

도 5를 참조하여 전체 작업 데이터 수가 1000개, 작업 비용이 10원, 검수 비용이 1원이고, 학습 데이터 수를 200개로 설정한 경우에 소요되는 제1 비용 및 제2 비용을 각각 산출하도록 한다.With reference to FIG. 5 , the first cost and the second cost required when the total number of work data is 1000, the work cost is 10 won, the inspection cost is 1 won, and the number of learning data is set to 200 are calculated, respectively.

먼저, 제1 비용은 전체 작업 데이터 수에 대한 작업 비용 및 검수 비용을 합산하여 산출할 수 있으며, 하기 <수학식 1> 와 같이 나타낼 수 있다.First, the first cost can be calculated by adding up the work cost and the inspection cost for the total number of work data, and can be expressed as in Equation 1 below.

이러한, 수학식 1에 따르면, 제1 비용은 11,000원으로 산출된다.According to Equation 1, the first cost is calculated as 11,000 won.

한편, 제2 비용은 학습 데이터 수에 대한 작업 비용 및 검수 비용과, 데이터 풀의 작업 데이터 수에 대한 검수 비용을 합산하 산출동작을 반복 수행하고, 그 반복 수행의 결과로서 산출된 산출값들을 모두 합산하여 획득할 수 있으며, 하기 <수학식 2>와 같이 나타낼 수 있다.On the other hand, the second cost repeats the calculation operation by summing the work cost and inspection cost for the number of learning data and the inspection cost for the number of work data in the data pool, and all of the calculated values as a result of the repeated execution It can be obtained by summing, and can be expressed as in Equation 2 below.

다만, 첫번째 루프 이후에 수행되는 루프들에는 데이터 풀의 작업 데이터 중에서 정확하게 작업된 데이터를 제외하고 이용하게 되어야 하는 바, 데이터 풀의 작업 데이터 수에 정확도를 반영하여 이용한다. 도 5에서는 설명의 편의를 위하여 정확도를 20%로 한정하였으며, 이로써 루프 2에서의 데이터 풀의 작업 데이터 수는 440개, 루프 3에서의 데이터 풀의 작업 데이터 수는 152개가 되며, 데이터 풀의 작업 데이터 수가 학습 데이터 수 보다 작아짐에 따라 마지막 루프 4에서의 데이터 풀의 작업 데이터 수는 152개가 된다.However, since the loops performed after the first loop must be used except for the accurately worked data from the work data of the data pool, the accuracy is reflected in the number of work data in the data pool and used. 5, the accuracy is limited to 20% for convenience of explanation, whereby the number of working data in the data pool in loop 2 is 440, and the number of working data in the data pool in loop 3 is 152, As the number of data becomes smaller than the number of training data, the number of working data in the data pool in the last loop 4 becomes 152.

이러한, 수학식 2에 따르면, 제2 비용은 9,664원으로 산출된다.According to Equation 2, the second cost is calculated as 9,664 won.

그 산출 결과, 제1 비용은 11,000원이고, 제2 비용은 9,664원이므로, 자동 라벨링을 수행했을 때 더 적은 비용이 소요되는 것으로 판단할 수 있다.As a result of the calculation, since the first cost is 11,000 won and the second cost is 9,664 won, it can be determined that less cost is required when automatic labeling is performed.

따라서, 사전 훈련된 머신러닝 모델 또는 커스텀 모델(ML 모델)을 이용하여 라벨링 작업을 진행하도록 결정한다.Therefore, it is decided to proceed with the labeling operation using a pre-trained machine learning model or a custom model (ML model).

본 발명의 실시예와 관련하여 설명된 방법 또는 알고리즘의 단계들은 하드웨어로 직접 구현되거나, 하드웨어에 의해 실행되는 소프트웨어 모듈로 구현되거나, 또는 이들의 결합에 의해 구현될 수 있다. 소프트웨어 모듈은 RAM(Random Access Memory), ROM(Read Only Memory), EPROM(Erasable Programmable ROM), EEPROM(Electrically Erasable Programmable ROM), 플래시 메모리(Flash Memory), 하드 디스크, 착탈형 디스크, CD-ROM, 또는 본 발명이 속하는 기술 분야에서 잘 알려진 임의의 형태의 컴퓨터 판독가능 기록매체에 상주할 수도 있다.The steps of a method or algorithm described in relation to an embodiment of the present invention may be implemented directly in hardware, as a software module executed by hardware, or by a combination thereof. A software module may contain random access memory (RAM), read only memory (ROM), erasable programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), flash memory, hard disk, removable disk, CD-ROM, or It may reside in any type of computer-readable recording medium well known in the art to which the present invention pertains.

이상, 첨부된 도면을 참조로 하여 본 발명의 실시예를 설명하였지만, 본 발명이 속하는 기술분야의 통상의 기술자는 본 발명이 그 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 실시될 수 있다는 것을 이해할 수 있을 것이다. 그러므로, 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며, 제한적이 아닌 것으로 이해해야만 한다. As mentioned above, although embodiments of the present invention have been described with reference to the accompanying drawings, those skilled in the art to which the present invention pertains can realize that the present invention can be embodied in other specific forms without changing its technical spirit or essential features. you will be able to understand Therefore, it should be understood that the embodiments described above are illustrative in all respects and not restrictive.

100: 결정 장치 200: 의뢰 서버
300: 작업자 단말 301: 제1 작업자 단말
302: 제2 작업자 단말 110: 통신모듈
130: 저장모듈 150: 제어모듈100: decision device 200: request server
300: operator terminal 301: first operator terminal
302: second worker terminal 110: communication module
130: storage module 150: control module

Claims

In the method of determining whether to introduce a machine learning model for the labeling operation of a crowdsourcing-based project performed by a device for determining whether to introduce a machine learning model for the labeling operation of the crowdsourcing-based project,
calculating a matching score between the project and at least one pre-trained machine learning model when a project for the labeling task is commissioned;
checking whether a pre-trained machine learning model for the labeling task exists among the at least one pre-trained machine learning model based on the calculated matching score; and
As a result of the confirmation, determining whether automatic labeling or manual labeling is performed according to whether a pre-trained machine learning model for the labeling task exists,
As a result of the above confirmation,
If there is a pre-trained machine learning model for the labeling task, performing automatic labeling based on the confirmed pre-trained machine learning model,
When a pre-trained machine learning model for the labeling task does not exist, the first cost required for manually performing the labeling task and the pre-trained machine learning in which the calculated matching score is within a preset threshold range Comparing the second cost required for automatically performing the labeling operation using a model or a custom model, and further comprising the step of determining whether automatic labeling or manual labeling is performed according to the comparison result,
As a result of the above comparison,
When the first cost is greater than the second cost, manual labeling is performed by assigning the labeling task to at least one operator,
When the first cost is less than or equal to the second cost, a pre-trained machine learning model in which the calculated matching score is within a preset threshold range among the at least one pre-trained machine learning model is used, or the custom It performs automatic labeling work using the model,
The first cost is obtained by adding up the work cost and the inspection cost for the total number of work data to be performed the labeling work,
The second required cost is the sum of the work cost and inspection cost for the number of learning data to be learned among all the work data for which the labeling operation needs to be performed, and the inspection cost for the number of work data in the data pool After repeatedly performing the calculation operation to obtain the data pool until the number of work data in the data pool becomes smaller than the number of learning data, the obtained respective calculated values are summed and obtained,
It is defined as one loop that the calculation operation is performed once,
When performing the iteration, it is characterized in that the ratio of the data accurately worked through inspection among the work data of the data pool is reflected as accuracy in the loops performed after the first loop,
How to decide whether to introduce machine learning models for labeling tasks.

According to claim 1,
The matching score is
It is characterized in that it is calculated based on the similarity between the label class extracted from the work request information for the project and the label class extracted through the specification document of each of the at least one pre-trained machine learning model,
How to decide whether to introduce machine learning models for labeling tasks.

3. The method of claim 2,
The similarity is
Characterized in that each of the label classes is vectorized at the word level, and the cosine similarity between the obtained vectors is calculated,
How to decide whether to introduce machine learning models for labeling tasks.

delete

According to claim 1,
When the number of work data in the data pool is smaller than the number of training data, the calculation operation using only the number of work data in the data pool is finally stopped,
How to decide whether to introduce machine learning models for labeling tasks.

According to claim 1,
The number of learning data is determined as a preset number,
The calculation operation using only the number of work data in the data pool is characterized in that the work cost and the inspection cost for the number of work data in the data pool are summed,
How to decide whether to introduce machine learning models for labeling tasks.

8. The method of claim 7,
The number of work data in the data pool is,
Characterized in that the calculation is performed by reflecting the accuracy according to the operation performed immediately before in the number of remaining data except for the number of learning data from the total number of work data,
How to decide whether to introduce machine learning models for labeling tasks.

In the device for determining whether to introduce a machine learning model for the labeling task of a crowdsourcing-based project,
a communication module for performing communication with a request server for requesting the project and a worker terminal, which is a terminal of at least one worker put into the project;
a storage module for storing information on the project and information on at least one previously performed completed project, and storing at least one process necessary to determine whether to introduce a machine learning model for the labeling operation; and
A control module for controlling an operation for determining whether to introduce the machine learning model based on the at least one process,
The control module is
When a project for the labeling task is requested, a matching score between the project and at least one pre-trained machine learning model is calculated, and the at least one pre-trained machine learning model is calculated based on the calculated matching score. Check whether there is a pre-trained machine learning model for the labeling task among them, and determine whether automatic labeling or manual labeling according to whether there is a pre-trained machine learning model for the labeling task as a result of the confirmation and
As a result of the above confirmation,
If there is a pre-trained machine learning model for the labeling task, performing automatic labeling based on the confirmed pre-trained machine learning model,
When a pre-trained machine learning model for the labeling task does not exist, the first cost required for manually performing the labeling task and the pre-trained machine learning in which the calculated matching score is within a preset threshold range Comparing the second cost required when automatically performing the labeling operation using a model or a custom model, and controlling whether to automatically label or manually label according to the comparison result,
As a result of the above comparison,
When the first cost is greater than the second cost, manual labeling is performed by assigning the labeling task to at least one operator,
When the first cost is less than or equal to the second cost, a pre-trained machine learning model in which the calculated matching score is within a preset threshold range among the at least one pre-trained machine learning model is used, or the custom It performs automatic labeling work using the model,
The first cost is obtained by adding up the work cost and the inspection cost for the total number of work data to be performed the labeling work,
The second required cost is the sum of the work cost and inspection cost for the number of learning data to be learned among all the work data for which the labeling operation needs to be performed, and the inspection cost for the number of work data in the data pool After repeatedly performing the calculation operation to obtain the data pool until the number of work data in the data pool becomes smaller than the number of learning data, the obtained respective calculated values are summed and obtained,
It is defined as one loop that the calculation operation is performed once,
When performing the iteration, it is characterized in that the ratio of the data accurately worked through inspection among the work data of the data pool is reflected as accuracy in the loops performed after the first loop,
A device for deciding whether to introduce machine learning models for labeling tasks.

10. The method of claim 9,
The matching score is
It is characterized in that it is calculated based on the similarity between the label class extracted from the work request information for the project and the label class extracted through the specification document of each of the at least one pre-trained machine learning model,
A device for deciding whether to introduce machine learning models for labeling tasks.

11. The method of claim 10,
The similarity is
Characterized in that each of the label classes is vectorized at the word level, and the cosine similarity between the obtained vectors is calculated,
A device for deciding whether to introduce machine learning models for labeling tasks.

delete

10. The method of claim 9,
When the number of work data in the data pool becomes smaller than the number of learning data, the calculation operation using only the number of work data in the data pool is finally stopped,
How to decide whether to introduce machine learning models for labeling tasks.

10. The method of claim 9,
The number of learning data is determined as a preset number,
The calculation operation using only the number of work data in the data pool is characterized in that the work cost and the inspection cost for the number of work data in the data pool are summed,
A device for deciding whether to introduce machine learning models for labeling tasks.

16. The method of claim 15,
The number of work data in the data pool is,
Characterized in that the calculation is performed by reflecting the accuracy according to the operation performed immediately before in the number of remaining data except for the number of learning data from the total number of work data,
A device for deciding whether to introduce machine learning models for labeling tasks.

A computer-readable medium for executing a method for determining whether to introduce a machine learning model for a labeling operation of a crowdsourcing-based project according to any one of claims 1 to 3, 6 to 8, in combination with a computer stored in a computer program.