KR102408637B1

KR102408637B1 - A recording medium on which a program for providing an artificial intelligence conversation service is recorded

Info

Publication number: KR102408637B1
Application number: KR1020190016146A
Authority: KR
Inventors: 주동원
Original assignee: 주식회사 자이냅스
Priority date: 2019-02-12
Filing date: 2019-02-12
Publication date: 2022-06-15
Also published as: KR20200103171A

Abstract

본 발명의 실시 예에 따른 문서 학습 방법을 컴퓨터에서 실행시키기 위한 문서 학습 프로그램이 기록된 비 휘발성 기록 매체는, 문서 분류를 위한 레이블이 할당된 지도 학습 데이터 세트를 포함하는 학습 문서 데이터를 학습 문서 데이터베이스로부터 입력받는 단계; 레이블이 할당되지 않은 비 지도 학습 데이터 세트를 포함하는 분류 대상 문서 데이터를 입력받는 단계; 상기 학습 문서 데이터에 기초하여, 상기 대상 문서의 분류를 위한 분류기들을 가변적으로 생성하는 단계; 상기 가변적으로 생성된 분류기들을 이용한 비 지도 학습 방식에 따라 상기 분류 대상 문서의 레이블을 예측하는 단계; 상기 예측에 따라 상기 대상 문서에 레이블을 할당하는 단계; 및 상기 레이블 할당된 대상 문서를 상기 학습 문서 데이터에 통합 처리하여, 상기 학습 문서 데이터베이스를 갱신하는 단계;를 포함하는 문서 학습 방법을 컴퓨터에서 실행시키기 위한 문서 학습 프로그램이 기록된 비 휘발성 기록매체이다.A non-volatile recording medium on which a document learning program for executing a document learning method according to an embodiment of the present invention is recorded on a computer is a learning document database that includes learning document data including a supervised learning data set to which a label for document classification is assigned. receiving input from; receiving classification target document data including an unsupervised learning data set to which no label is assigned; variably generating classifiers for classifying the target document based on the learning document data; predicting a label of the classification target document according to an unsupervised learning method using the variably generated classifiers; assigning a label to the target document according to the prediction; and integrating the label-allocated target document into the learning document data, and updating the learning document database.

Description

A recording medium on which a program for providing an artificial intelligence conversation service is recorded {A recording medium on which a program for providing an artificial intelligence conversation service is recorded}

본 발명은 기록매체에 관한 것으로서, 보다 구체적으로, 인공지능 대화 서비스를 제공하기 위하여 가변 분류기를 이용하여 효율적 학습 프로세스를 실행할 수 있도록 하는 컴퓨터에서 실행시키기 위한 문서 학습 프로그램이 기록된 비 휘발성 기록매체에 관한 것이다.The present invention relates to a recording medium, and more specifically, to a non-volatile recording medium in which a document learning program for execution in a computer to be executed in a computer that can execute an efficient learning process using a variable classifier in order to provide an artificial intelligence conversation service is recorded. it's about

인공지능(Artificial Intelligence)은 비즈니스, 조직운영, 생활방식 그리고 커뮤니케이션 방법에 혁신을 일으키고 있다. 매일매일 빠르게 변화하는 현대적 문화의 생활 방식과 다양하게 끊임없이 변화되는 고객의 요구사항에 최적의 서비스를 제공하기 위한 다양한 정보화 프로젝트가 진행되고 있다.Artificial intelligence is revolutionizing business, how organizations operate, how they live, and how they communicate. Various informatization projects are underway to provide optimal services to the rapidly changing lifestyles of modern culture and diverse and constantly changing customer requirements.

그중에서도 최근 빅데이터와 딥러닝 관련 기술이 빠른 속도로 발전하여 특정 분야에서는 실생활에 적용되고 있는 인공지능 기술이 구현되었으며, 특정 데이터에 대한 분석과, 개개인에게 특화된 다양한 분야의 정보를 통합 제공 및 활용하는 지능화된 개인 서비스에도 적용되고 있다.Among them, as technologies related to big data and deep learning have developed at a rapid pace in recent years, artificial intelligence technology that is being applied to real life has been implemented in certain fields. It is also being applied to intelligent personal services.

또한, 최근 트위터, 페이스북, 블로그 등의 소셜 미디어의 사용량이 증가하면서, 빅데이터를 통해 자동으로 상품에 대한 만족도, 영화에 대한 만족도 등 다양한 오피니언 정보에 대한 분석을 통해 감정 정보를 확인하고자 하는 시도가 활발히 이루어지고 있다.In addition, as the usage of social media such as Twitter, Facebook, and blogs has increased recently, an attempt to check emotional information through analysis of various opinion information such as product satisfaction and movie satisfaction automatically through big data is being actively carried out.

특히, 기업은 자사 제품이나 서비스가 소셜 미디어 상에서 어떤 평가를 받고 있는지 파악함으로써 마케팅 전략에 참고 할 수 있고 정책 기관에서는 정책에 대한 여론 분석을 통해 정책 수정 방향 및 홍보 방식 등을 결정할 수 있다. 이러한 필요가 대두되면서 데이터 마이닝을 통한 감정 분석전문으로 하는 브랜드 모니터링 서비스도 활발해지고 있다.In particular, companies can refer to their marketing strategies by understanding how their products or services are being evaluated on social media, and policy agencies can determine policy revision directions and public relations methods through opinion analysis on policies. As such a need arises, brand monitoring services that specialize in emotion analysis through data mining are also active.

이와 같이 인공지능 및 데이터 분석 서비스 등 다양한 사용자 서비스들을 제공하기 위해, 매우 방대한 양의 텍스트 및 기타 언어 기반 정보에 대한 데이터 마이닝이 요구되고 있으며, 이를 위해 주로 기계학습 방식이 이용되고 있다.In order to provide various user services such as artificial intelligence and data analysis services as described above, data mining for a very large amount of text and other language-based information is required, and for this, a machine learning method is mainly used.

기계학습 방식의 경우, 앞서 학습된 데이터를 기반으로, 데이터 마이닝된 학습 대상 문서들에 대한 데이터 예측에 따른 레이블을 자동적으로 할당하고자 하는 것으로, 예측 정확도를 높이기 위해 일반적으로 레이블이 사전 지정된 데이터를 이용하여 학습시키는 지도 학습 방법과, 레이블이 지정되지 않은 데이터들간의 상호 관계를 추론하여 학습시키는 비 지도 학습 방법이 병행되고 있다.In the case of the machine learning method, based on previously learned data, it is intended to automatically assign labels according to data prediction to data-mined learning target documents. In order to increase prediction accuracy, data with pre-labeled data is generally used. The supervised learning method, which learns by doing this, and the unsupervised learning method, which learns by inferring the interrelationship between unlabeled data, are running in parallel.

그러나, 지도 학습 방법의 경우, 레이블이 개별적으로 사전 지정된 학습 데이터를 필요로 하나, 데이터의 양이 급증하고 레이블 종류가 다양화됨에 따라, 이를 확보하기 위환 인적 자원과 인프라 비용이 부족한 문제점이 있다.However, in the case of the supervised learning method, each label requires individually pre-specified learning data, but as the amount of data rapidly increases and the types of labels diversify, there is a problem in that human resources and infrastructure costs are insufficient to secure them.

예를 들어, 레이블이 할당된 대형 데이터 세트 구성을 위하여는 각 데이터에 많은 인적자원을 이용하여 데이터에 레이블을 달아주거나, 특정 도메인에 대한 데이터 수집 인프라를 구축해야 하므로, 많은 재정적 비용이 발생하고, 다양한 제약사항이 발생하게 된다.For example, in order to construct a large data set with labels assigned, it is necessary to label the data using a lot of human resources for each data or to build a data collection infrastructure for a specific domain, which incurs a lot of financial cost, Various restrictions arise.

한편, 비 지도 학습의 경우, 현재의 기술로는 상호 관계 추론에 의해 클러스터링할 분류기의 개수가 미리 사전에 고정되어야 하여, 데이터 처리 범위가 제한적인 문제점이 있으며, 특히 이러한 분류기에서 일부 데이터가 잘못 추론 및 예측되어 분류되는 경우 이로 인한 오류가 지속되어 재현률이 낮아지는 문제를 가지고 있다.On the other hand, in the case of unsupervised learning, with the current technology, the number of classifiers to be clustered must be fixed in advance by correlation inference, so there is a problem in that the range of data processing is limited. And when it is predicted and classified, an error due to this persists and the recall rate is lowered.

그러나, 현재는 위와 같이 지도 학습 또는 비 지도 학습 방식의 기계학습을 기반으로 하는 학습프로세스가 개별적으로 운용되어 각각의 문제점들을 발생시키고 있을 뿐, 하나의 학습 프로세스에 의해 두 학습의 문제점들을 모두 해결하지는 못하고 있는 실정이다.However, at present, the learning process based on supervised learning or unsupervised learning method machine learning as described above is operated individually and causes each problem, but one learning process does not solve both problems of learning. It is not possible.

한국등록특허공보 10-1122844호(2012.02.24.)(발명의 명칭 : 데이터의 의도를 판정하고 의도에 기초하여 데이터에 응답하는 시스템 및 방법)Korea Patent Publication No. 10-1122844 (2012.02.24.) (Title of the invention: system and method for determining intention of data and responding to data based on intention)

본 발명은 상기한 바와 같은 문제점을 해결하고자 안출된 것으로, 레이블이 표시된 지도 학습 데이터 세트와 레이블이 표시되지 않은 비 지도 학습 데이터들을 모두 이용하되, 레이블이 표시된 학습 문서에 따라 비 지도 학습을 위한 분류기를 가변적으로 생성하게 처리함으로써, 과도한 인적 자원과 인프라 비용 부담 없이도 보다 효과적이고 높은 재현률을 가지고 문서를 분류할 수 있는 컴퓨터에서 실행시키기 위한 문서 학습 프로그램이 기록된 비 휘발성 기록매체를 제공하는데 그 목적이 있다.The present invention has been devised to solve the above problems, and uses both a labeled supervised learning data set and unlabeled unsupervised learning data, but a classifier for unsupervised learning according to a labeled learning document. It is an object to provide a non-volatile recording medium on which a document learning program is recorded for executing on a computer that can categorize documents with more effective and high recall without burdening excessive human resources and infrastructure costs by variably generating have.

상기한 바와 같은 과제를 해결하기 위한 본 발명의 실시 예에 따른 컴퓨터에서 실행시키기 위한 문서 학습 프로그램이 기록된 비 휘발성 기록 매체는, 문서 분류를 위한 레이블이 할당된 지도 학습 데이터 세트를 포함하는 학습 문서 데이터를 학습 문서 데이터베이스로부터 입력받는 단계; 레이블이 할당되지 않은 비 지도 학습 데이터 세트를 포함하는 분류 대상 문서 데이터를 입력받는 단계; 상기 학습 문서 데이터에 기초하여, 상기 대상 문서의 분류를 위한 분류기들을 가변적으로 생성하는 단계; 상기 가변적으로 생성된 분류기들을 이용한 비 지도 학습 방식에 따라 상기 분류 대상 문서의 레이블을 예측하는 단계; 상기 예측에 따라 상기 대상 문서에 레이블을 할당하는 단계; 및 상기 레이블 할당된 대상 문서를 상기 학습 문서 데이터에 통합 처리하여, 상기 학습 문서 데이터베이스를 갱신하는 단계;를 포함하는 컴퓨터에서 실행시키기 위한 문서 학습 프로그램이 기록된다.A non-volatile recording medium recording a document learning program for execution in a computer according to an embodiment of the present invention for solving the above problems is a learning document including a supervised learning data set to which a label for document classification is assigned receiving data from a learning document database; receiving classification target document data including an unsupervised learning data set to which no label is assigned; variably generating classifiers for classifying the target document based on the learning document data; predicting a label of the classification target document according to an unsupervised learning method using the variably generated classifiers; assigning a label to the target document according to the prediction; and integrating the label-allocated target document into the learning document data, and updating the learning document database.

또한, 상기한 바와 같은 과제를 해결하기 위한 본 발명의 학습 프로그램을 구현하기 위한 학습 장치는, 문서 학습 장치에 있어서, 문서 분류를 위한 레이블이 할당된 지도 학습 데이터 세트를 포함하는 학습 문서 데이터를 학습 문서 데이터베이스로부터 입력받고, 레이블이 할당되지 않은 비 지도 학습 데이터 세트를 포함하는 분류 대상 문서 데이터를 입력받는 입력부; 상기 학습 문서 데이터에 기초하여, 상기 대상 문서의 분류를 위한 분류기들을 가변적으로 생성하는 분류기 생성부; 상기 가변적으로 생성된 분류기들을 이용한 비 지도 학습 방식에 따라 상기 분류 대상 문서의 레이블을 예측하는 데이터 예측부; 상기 예측에 따라 상기 대상 문서에 레이블을 할당하는 레이블 결정부; 및 상기 레이블 할당된 대상 문서를 상기 학습 문서 데이터에 통합 처리하여, 상기 학습 문서 데이터베이스를 갱신하는 통합 학습 처리부를 포함한다.In addition, the learning apparatus for implementing the learning program of the present invention for solving the above-described problems, in the document learning apparatus, learns learning document data including a supervised learning data set to which a label for document classification is assigned an input unit which receives input from a document database and receives classification target document data including an unsupervised learning data set to which a label is not assigned; a classifier generator for variably generating classifiers for classifying the target document based on the learning document data; a data prediction unit predicting a label of the classification target document according to an unsupervised learning method using the variably generated classifiers; a label determiner for allocating a label to the target document according to the prediction; and an integrated learning processing unit for integrating the label-allocated target document into the learning document data and updating the learning document database.

본 발명의 실시 예에 따르면, 레이블이 표시된 지도 학습 데이터 세트와 레이블이 표시되지 않은 비 지도 학습 데이터들을 모두 이용하되, 레이블이 표시된 학습 문서에 따라 비 지도 학습을 위한 분류기를 가변적으로 생성하게 처리함으로써, 과도한 인적 자원과 인프라 비용 부담 없이도 보다 효과적이고 높은 재현률을 가지고 문서를 분류할 수 있는 문서 학습 방법을 컴퓨터에서 실행시키기 위한 문서 학습 프로그램이 기록된 비 휘발성 기록매체를 제공할 수 있다.According to an embodiment of the present invention, both the labeled supervised learning data set and the unlabeled unsupervised learning data are used, but by processing to variably generate a classifier for unsupervised learning according to the labeled learning document. , it is possible to provide a non-volatile recording medium in which a document learning program is recorded for executing a document learning method capable of classifying documents with more effective and high recall without burdening excessive human resources and infrastructure costs on a computer.

이와 같은 본 발명의 실시 예에 따르면, 종래 기술과 같이 학습을 위한 방대한 양의 데이터에 레이블을 하나하나 표시할 필요 없이 비교적 적은 양의 학습 문서를 이용하더라도, 효과적인 학습 모델 데이터베이스를 구축할 수 있다.According to this embodiment of the present invention, it is possible to construct an effective learning model database even if a relatively small amount of learning documents are used without the need to individually label a huge amount of data for learning as in the prior art.

또한, 본 발명의 실시 예에 따르면, 학습 문서의 수에 비례하여 분류기들을 생성하고 데이터들의 레이블을 예측할 수 있으므로, 레이블이 잘못 예측되는 경우를 사전에 저감시킬 수 있는 효과가 있다.In addition, according to an embodiment of the present invention, since classifiers can be generated in proportion to the number of learning documents and labels of data can be predicted, there is an effect that can reduce a case in which labels are incorrectly predicted in advance.

그리고, 본 발명의 실시 예에 따르면, 하나의 문서 데이터 분류시 레이블을 예측한 모든 분류기들이 하나의 레이블을 예측하는 경우에만 레이블을 할당하게 함으로써, 재현률을 높일 수 있는 효과가 있다.And, according to an embodiment of the present invention, when classifying one document data, all classifiers that predicted a label assign a label only when one label is predicted, thereby increasing the recall.

도 1은 본 발명의 실시 예에 따른 전체 시스템을 개략적으로 도시한 개념도이다.
도 2는 본 발명의 실시 예에 따른 문서 학습 장치를 보다 구체적으로 설명하기 위한 블록도이다.
도 3은 본 발명의 실시 예에 따른 문서 학습 장치의 동작 방법을 보다 구체적으로 설명하기 위한 흐름도이다.
도 4는 본 발명의 실시 예에 따른 학습 문서 및 대상 문서를 설명하기 위한 도면이다.
도 5는 본 발명의 실시 예에 따른 전처리 프로세스를 설명하기 위한 도면이다.
도 6은 본 발명의 실시 예에 따른 가변 분류기 기반 레이블 예측 프로세스를 설명하기 위한 도면이다.
도 7은 본 발명의 실시 예에 따른 문서 학습 장치 동작에 의해 수행되는 학습 프로세스 및 데이터 통합 결과 예시도이다.1 is a conceptual diagram schematically illustrating an entire system according to an embodiment of the present invention.
2 is a block diagram for explaining in more detail an apparatus for learning a document according to an embodiment of the present invention.
3 is a flowchart for explaining in more detail a method of operating a document learning apparatus according to an embodiment of the present invention.
4 is a view for explaining a learning document and a target document according to an embodiment of the present invention.
5 is a view for explaining a pre-processing process according to an embodiment of the present invention.
6 is a diagram for explaining a variable classifier-based label prediction process according to an embodiment of the present invention.
7 is an exemplary diagram of a learning process and data integration result performed by an operation of the document learning apparatus according to an embodiment of the present invention.

이하의 내용은 단지 본 발명의 원리를 예시한다. 그러므로 당업자는 비록 본 명세서에 명확히 설명되거나 도시되지 않았지만 본 발명의 원리를 구현하고 본 발명의 개념과 범위에 포함된 다양한 장치를 발명할 수 있는 것이다. 또한, 본 명세서에 열거된 모든 조건부 용어 및 실시예들은 원칙적으로, 본 발명의 개념이 이해되도록 하기 위한 목적으로만 명백히 의도되고, 이와 같이 특별히 열거된 실시예들 및 상태들에 제한적이지 않는 것으로 이해되어야 한다.The following is merely illustrative of the principles of the invention. Therefore, those skilled in the art will be able to devise various devices that, although not explicitly described or shown herein, embody the principles of the present invention and are included within the spirit and scope of the present invention. Further, it is to be understood that all conditional terms and examples listed herein are, in principle, expressly intended solely for the purpose of enabling the concept of the present invention to be understood, and not limited to the specifically enumerated embodiments and states as such. should be

또한, 본 발명의 원리, 관점 및 실시예들 뿐만 아니라 특정 실시예를 열거하는 모든 상세한 설명은 이러한 사항의 구조적 및 기능적 균등물을 포함하도록 의도되는 것으로 이해되어야 한다. 또한 이러한 균등물들은 현재 공지된 균등물뿐만 아니라 장래에 개발될 균등물 즉 구조와 무관하게 동일한 기능을 수행하도록 발명된 모든 소자를 포함하는 것으로 이해되어야 한다.Moreover, it is to be understood that all detailed description reciting the principles, aspects, and embodiments of the invention, as well as specific embodiments, are intended to cover structural and functional equivalents of such matters. It should also be understood that such equivalents include not only currently known equivalents, but also equivalents developed in the future, i.e., all devices invented to perform the same function, regardless of structure.

따라서, 예를 들어, 본 명세서의 블럭도는 본 발명의 원리를 구체화하는 예시적인 회로의 개념적인 관점을 나타내는 것으로 이해되어야 한다. 이와 유사하게, 모든 흐름도, 상태 변환도, 의사 코드 등은 컴퓨터가 판독 가능한 매체에 실질적으로 나타낼 수 있고 컴퓨터 또는 프로세서가 명백히 도시되었는지 여부를 불문하고 컴퓨터 또는 프로세서에 의해 수행되는 다양한 프로세스를 나타내는 것으로 이해되어야 한다.Thus, for example, the block diagrams herein are to be understood as representing conceptual views of illustrative circuitry embodying the principles of the present invention. Similarly, all flowcharts, state transition diagrams, pseudo code, etc. may be tangibly embodied on computer-readable media and be understood to represent various processes performed by a computer or processor, whether or not a computer or processor is explicitly shown. should be

또한 프로세서, 제어 또는 이와 유사한 개념으로 제시되는 용어의 명확한 사용은 소프트웨어를 실행할 능력을 가진 하드웨어를 배타적으로 인용하여 해석되어서는 아니되고, 제한 없이 디지털 신호 프로세서(DSP) 하드웨어, 소프트웨어를 저장하기 위한 롬(ROM), 램(RAM) 및 비 휘발성 메모리를 암시적으로 포함하는 것으로 이해되어야 한다. 주지관용의 다른 하드웨어도 포함될 수 있다.In addition, the clear use of terms presented as processor, control, or similar concepts should not be construed as exclusively referring to hardware having the ability to execute software, and without limitation, digital signal processor (DSP) hardware, ROM for storing software. It should be understood to implicitly include (ROM), RAM (RAM) and non-volatile memory. Other common hardware may also be included.

상술한 목적, 특징 및 장점은 첨부된 도면과 관련한 다음의 상세한 설명을 통하여 보다 분명해질 것이며, 그에 따라 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 본 발명의 기술적 사상을 용이하게 실시할 수 있을 것이다. 또한, 본 발명을 설명함에 있어서 본 발명과 관련된 공지 기술에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에 그 상세한 설명을 생략하기로 한다. The above objects, features and advantages will become more apparent through the following detailed description in relation to the accompanying drawings, and accordingly, those of ordinary skill in the art to which the present invention pertains can easily implement the technical idea of the present invention. There will be. In addition, in the description of the present invention, if it is determined that a detailed description of a known technology related to the present invention may unnecessarily obscure the gist of the present invention, the detailed description thereof will be omitted.

이하, 첨부된 도면을 참조하여 본 발명에 따른 바람직한 일실시예를 상세히 설명하기로 한다.Hereinafter, a preferred embodiment according to the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 실시 예에 따른 전체 시스템을 개략적으로 도시한 도면이다.1 is a diagram schematically illustrating an entire system according to an embodiment of the present invention.

도 1을 참조하면 본 발명의 일 실시 예에 따른 시스템은, 문서 학습 장치(100), 서비스 제공 장치(200), 사용자 단말(300)을 포함할 수 있으며, 문서 학습 장치(100)는 학습 문서 데이터베이스(180)를 포함하거나 상기 학습 문서 데이터베이스(180)에 연결되어 있을 수 있다.Referring to FIG. 1 , the system according to an embodiment of the present invention may include a document learning apparatus 100 , a service providing apparatus 200 , and a user terminal 300 , and the document learning apparatus 100 is a learning document It may include a database 180 or be connected to the learning document database 180 .

보다 구체적으로, 문서 학습 장치(100)는, 학습 문서 데이터베이스(180)에서 저장된 학습 문서 데이터에 기초하여, 기계학습(machine learning)을 수행하여 문서 학습 모델을 생성할 수 있으며, 생성된 학습 모델을 이용하여, 서비스 제공 장치(200)로부터 입력되는 학습 대상 문서들에 대한 자동적 분류 및 레이블 할당을 처리할 수 있다.More specifically, the document learning apparatus 100 may generate a document learning model by performing machine learning based on the learning document data stored in the learning document database 180, and the generated learning model By using it, automatic classification and label assignment of learning target documents input from the service providing apparatus 200 may be processed.

특히, 본 발명의 실시 예에 따른 문서 학습 장치(100)는, 문서 분류를 위한 레이블이 할당된 지도 학습 데이터 세트를 포함하는 학습 문서 데이터를 학습 문서 데이터베이스로부터 입력받고, 레이블이 할당되지 않은 비 지도 학습 데이터 세트를 포함하는 분류 대상 문서 데이터를 서비스 제공 장치(200) 또는 데이터베이스(180)로부터 입력받으며, 상기 학습 문서 데이터에 기초하여, 상기 대상 문서의 분류를 위한 분류기들을 가변적으로 생성하고, 상기 가변적으로 생성된 분류기들을 이용한 비 지도 학습 방식에 따라 상기 분류 대상 문서의 레이블을 예측하며, 상기 예측에 따라 상기 대상 문서에 레이블을 할당하고, 상기 레이블 할당된 대상 문서를 상기 학습 문서 데이터에 통합 처리하여, 상기 학습 문서 데이터베이스를 갱신함으로써, 학습 문서 데이터 기반의 문서 학습 모델을 기반으로, 대상 문서의 자동적 분류 및 레이블 할당을 처리할 수 있다.In particular, the document learning apparatus 100 according to an embodiment of the present invention receives learning document data including a supervised learning data set to which labels for document classification are assigned, from a learning document database, and unsupervised to which labels are not assigned. Receives classification target document data including a training data set from the service providing device 200 or the database 180, variably generates classifiers for classifying the target document based on the training document data, and the variable Predict the label of the classification target document according to an unsupervised learning method using the classifiers generated as , by updating the learning document database, it is possible to process automatic classification and label assignment of target documents based on a document learning model based on learning document data.

이와 같은 지도(Supervised) 학습 방식과 비 지도(Unsupervised) 학습 방식들의 경우, 기존의 알려진 이진 분류, 다중 분류, 회귀 방식과 같은 지도 학습 방식과, 클러스터링 기반 비 지도 학습 방식 등 각각의 기계 학습 방식들이 이용될 수 있으며, 본 발명의 실시 예에 따른 가변 분류기 기반 학습 방식은 가변 조건에 따라 이들을 적절히 혼용하는 준 지도(semi-supervised) 학습 방식이라고 명명될 수 있다.In the case of such supervised and unsupervised learning methods, each machine learning method, such as the known supervised learning methods such as binary classification, multiple classification, and regression, and clustering-based unsupervised learning methods, is may be used, and the variable classifier-based learning method according to an embodiment of the present invention may be called a semi-supervised learning method that appropriately mixes them according to variable conditions.

그리고, 서비스 제공 장치(200)는 이와 같이 처리된 준 지도 학습 기반 학습 모델을 이용한 데이터 처리기반 사용자 서비스를 사용자 단말(300)로 제공할 수 있다. 사용자 서비스는 본 발명의 실시 예에 따른 문서 학습 장치(100)의 동작에 의해 생성된 학습 문서 데이터베이스(180) 및 이에 따른 문서 학습 모델을 이용하여, 사용자 단말(300)로의 편의성을 제공하거나, 사용자 단말(300)로부터 요구되는 분석 정보를 제공하는 서비스를 포함할 수 있다.In addition, the service providing apparatus 200 may provide a data processing-based user service using the quasi-supervised learning-based learning model processed as described above to the user terminal 300 . The user service provides convenience to the user terminal 300 by using the learning document database 180 generated by the operation of the document learning apparatus 100 according to an embodiment of the present invention and the document learning model according thereto, or It may include a service for providing analysis information required from the terminal 300 .

예를 들어, 서비스 제공 장치(200)는, 상담 데이터를 통한 사용자 만족도 서비스, 빅데이터를 통한 감성 분석 서비스, 하이브리드 챗봇 서비스 등이 예시될 수 있다. For example, the service providing apparatus 200 may include a user satisfaction service through consultation data, a sentiment analysis service through big data, a hybrid chatbot service, and the like.

서비스 제공 장치(200)는 사용자 만족도 서비스 제공을 위해, 빅 데이터로부터의 대상 문서 데이터를 긍정, 부정으로 분류하는 레이블 할당 및 모델 생성을 문서 학습 장치(100)로 요청할 수 있으며, 이에 따라 학습된 학습 문서 데이터는 상담에 대한 사용자의 만족도를 자동적으로 분류 하는데 이용될 수 있다. 예를 들어, 사용자 단말(300)은 트위터와 페이스북과 같은 소셜 네트워크의 빅데이터를 통해 자신의 상품에 대한 사용자의 의견이 긍정적인지 부정적인지를 분석한 분석 정보를 서비스 제공 장치(200)로부터 제공받을 수 있다.The service providing apparatus 200 may request the document learning apparatus 100 for label assignment and model generation for classifying target document data from big data as positive and negative, in order to provide a user satisfaction service, and the learning learned according to this The document data may be used to automatically classify the user's satisfaction with the consultation. For example, the user terminal 300 may receive, from the service providing device 200 , analysis information that analyzes whether the user's opinion on his/her product is positive or negative through big data of social networks such as Twitter and Facebook. can

또한, 서비스 제공 장치(200)는 챗봇 또는 상담 서비스를 사용자 단말(300)로 제공할 수 있다. 예를 들어, 사용자의 표현 방법은 항시 달라지기 때문에 챗봇이 단순한 규칙에 기반한다면 응답에 대한 한계가 존재 할 수 있다. 이를 위해 딥러닝 기술이 챗봇에 활용되고 있으나, 모든 문의에 대한 응답을 처리하기에는 학습 데이터가 현저하게 적은 문제가 있으며, 기존 상담 데이터를 활용하기에는 데이터가 분류 되지 않았으며, 사람이 분류하기에는 오랜 시간이 사용되는 문제점이 있다.Also, the service providing apparatus 200 may provide a chatbot or a consultation service to the user terminal 300 . For example, since the user's expression method is always different, if the chatbot is based on a simple rule, there may be a limit to the response. For this purpose, deep learning technology is being used in chatbots, but there is a problem that there is significantly less learning data to process responses to all inquiries. There is a problem with its use.

그러나, 본 발명의 실시 예에 따른 문서 학습 장치(100)는 레이블이 할당되지 않은 표현 문서들이 입력되더라도, 레이블이 할당된 기존 챗봇 또는 상담 데이터들로부터 사전 분류된 데이터들과 함께 가변 분류기에 의한 효과적인 분류 처리를 수행할 수 있으며, 이를 다시 신규 학습 데이터로서 확보할 수 있게 되므로, 챗봇 또는 상담 서비스를 단순 기존 데이터 기반이 아닌 신규 데이터 및 기존 데이터들이 융합된 하이브리드 서비스로서 제공할 수 있게 된다.However, the document learning apparatus 100 according to an embodiment of the present invention is effective by a variable classifier with data pre-classified from the existing chatbot or consultation data to which labels are assigned, even when expression documents to which labels are not assigned are input. Classification processing can be performed, and since it can be secured as new learning data, it is possible to provide a chatbot or consultation service as a hybrid service in which new data and existing data are fused, rather than simply based on existing data.

한편, 사용자 단말(300) 및 서비스 제공 장치(200)는 네트워크를 통해 유선 또는 무선으로 연결될 수 있으며, 네트워크간 상호간 통신을 위해 각 사용자 단말(100) 및 서비스 제공 장치(300)는 인터넷 네트워크, LAN, WAN, PSTN(Public Switched Telephone Network), PSDN(Public Switched Data Network), 케이블 TV 망, WIFI, 이동 통신망 및 기타 무선 통신망 등을 통하여 데이터를 송수신할 수 있다. 또한, 각 사용자 단말(300) 및 서비스 제공 장치(200)는 각 통신망에 상응하는 프로토콜로 통신하기 위한 각각의 통신 모듈을 포함할 수 있다.Meanwhile, the user terminal 300 and the service providing apparatus 200 may be connected by wire or wirelessly through a network, and each user terminal 100 and the service providing apparatus 300 are connected to an Internet network, a LAN for mutual communication between networks. , WAN, PSTN (Public Switched Telephone Network), PSDN (Public Switched Data Network), cable TV network, WIFI, mobile communication network and other wireless communication networks can transmit and receive data. In addition, each user terminal 300 and the service providing apparatus 200 may include respective communication modules for communicating with a protocol corresponding to each communication network.

그리고, 본 명세서에서 설명되는 사용자 단말(100)에는 휴대폰, 스마트 폰(smart phone), 노트북 컴퓨터(laptop computer), 디지털방송용 단말, PDA(Personal Digital Assistants), PMP(Portable Multimedia Player), 네비게이션 등이 포함될 수 있으나, 본 발명은 이에 한정되지 아니하며 그 이외에 사용자 입력 및 정보 표시 등이 가능한 다양한 장치일 수 있다.In addition, the user terminal 100 described in this specification includes a mobile phone, a smart phone, a laptop computer, a digital broadcasting terminal, Personal Digital Assistants (PDA), Portable Multimedia Player (PMP), navigation, etc. may be included, but the present invention is not limited thereto, and may be various devices capable of user input and information display other than that.

이와 같은 시스템에 있어서, 사용자 단말(300)은 서비스 제공 장치(200)와 연결되어 상술한 데이터 기반 서비스를 제공받을 수 있다.In such a system, the user terminal 300 may be connected to the service providing apparatus 200 to receive the above-described data-based service.

도 2는 본 발명의 실시 예에 따른 문서 학습 장치를 보다 구체적으로 설명하기 위한 블록도이다.2 is a block diagram for explaining in more detail an apparatus for learning a document according to an embodiment of the present invention.

도 2를 참조하면, 본 발명의 실시 예에 따른 문서 학습 장치(100)는, 입력부(110), 문서 분류부(120), 전처리부(140) 및 학습 문서 데이터베이스(180)을 포함할 수 있으며, 문서 분류부(120)는 분류기 생성부(121), 분류기(121a), 데이터 예측부(125), 레이블 결정부(127) 및 통합 학습 처리부(123)를 포함한다.Referring to FIG. 2 , the document learning apparatus 100 according to an embodiment of the present invention may include an input unit 110 , a document classifying unit 120 , a preprocessing unit 140 , and a learning document database 180 . , the document classifying unit 120 includes a classifier generating unit 121 , a classifier 121a , a data predicting unit 125 , a label determining unit 127 , and an integrated learning processing unit 123 .

먼저, 입력부(110)는, 문서 분류를 위한 레이블이 할당된 지도 학습 데이터 세트를 포함하는 학습 문서 데이터를 학습 문서 데이터베이스로부터 입력받고, 레이블이 할당되지 않은 비 지도 학습 데이터 세트를 포함하는 분류 대상 문서 데이터를 입력받는다.First, the input unit 110 receives learning document data including a supervised learning data set to which a label is assigned for document classification from a learning document database, and a classification target document including an unsupervised learning data set to which a label is not assigned. receive data.

입력부(110)는, 서비스 제공 장치(200) 또는 학습 문서 데이터베이스(180)로부터 전술한 학습 문서 데이터 또는 분류 대상 문서 데이터를 입력받기 위한 하나 이상의 입력 인터페이스를 포함할 수 있다. 예를 들어, 상기 대상 문서는 대상 서비스에 따라 결정될 수 있으며, 상담 문서, 챗봇 문서, 영화 평가 댓글 또는 소설 미디어 문서 등이 예시될 수 있다.The input unit 110 may include one or more input interfaces for receiving the above-described learning document data or classification target document data from the service providing device 200 or the learning document database 180 . For example, the target document may be determined according to a target service, and a consultation document, a chatbot document, a movie review comment, or a novel media document may be exemplified.

전처리부(140)는, 상기 대상 문서를 레이블 할당 가능한 하나 이상의 문장 데이터로 변환할 수 있다.The preprocessor 140 may convert the target document into one or more sentence data that can be assigned a label.

이를 위해, 전처리부(140)는, 상기 대상 문서의 레이블 예측 및 분류를 위해 사전 정의된 적어도 하나의 단어 또는 문자를 상기 대상 문서로부터 제거할 수 있다. 예를 들어, 전처리부(140)는 입력 받은 레이블 할당 대상 문서를 보다 정확하게 레이블을 예측하고 분류할 수 있도록 대상 문서를 처리할 수 있는 바, 사전에 정의된 불필요한 문자 및 단어를 대상 문서로부터 제거할 수 있다.To this end, the preprocessor 140 may remove at least one word or character predefined for label prediction and classification of the target document from the target document. For example, the preprocessor 140 may process the target document to more accurately predict and classify the label assignment target document received, so that unnecessary characters and words defined in advance are removed from the target document. can

또한, 전처리부(140)는 상기 대상 문서가 하나 이상의 문단을 포함하는 경우, 문단별 형태소 분석을 통해 하나 이상의 문장들로 재구성할 수 있다. 예를 들어, 대상 문서의 특정 문단이 여러 문장들로 분리 구성될 수 있다.Also, when the target document includes one or more paragraphs, the preprocessor 140 may reconstruct the target document into one or more sentences through morpheme analysis for each paragraph. For example, a specific paragraph of the target document may be divided into several sentences.

한편, 분류기 생성부(121)는, 상기 학습 문서 데이터에 기초하여, 상기 대상 문서의 분류를 위한 분류기(121a)들을 가변적으로 생성하며, 데이터 예측부(125)는 상기 가변적으로 생성된 분류기(121a)들을 이용한 비 지도 학습 방식에 따라 상기 분류 대상 문서의 레이블을 예측한다.Meanwhile, the classifier generator 121 variably generates classifiers 121a for classifying the target document based on the training document data, and the data prediction unit 125 variably generates the classifier 121a. ) predicts the label of the classification target document according to an unsupervised learning method using

보다 구체적으로, 분류기 생성부(121)는, 상기 학습 문서 데이터의 양에 따라, 상기 생성되는 분류기(121a)들의 개수를 가변적으로 결정할 수 있는 바, 상기 데이터 양은 미리 설정된 데이터 크기 구간에 대응할 수 있다.More specifically, the classifier generating unit 121 may variably determine the number of the generated classifiers 121a according to the amount of the training document data, and the data amount may correspond to a preset data size section. .

또한, 분류기 생성부(121)는, 상기 학습 문서 데이터의 레이블 할당된 문서 건수 및 상기 분류기(121a)들의 현재 개수에 기초하여 사전 결정된 로직 프로세스를 수행하여, 상기 생성되는 분류기(121a)들의 개수를 결정할 수 있다. 여기서, 상기 로직 프로세스는 학습 문서 데이터의 대상 서비스에 따라 상이하게 결정될 수 있는 바, 상기 대상 서비스는, 챗봇 서비스, 감정 분석 서비스, 사용자 만족도 서비스, 상담 서비스 중 적어도 하나를 포함할 수 있다.In addition, the classifier generating unit 121 performs a predetermined logic process based on the number of documents assigned to the label of the training document data and the current number of the classifiers 121a to determine the number of the generated classifiers 121a. can decide Here, the logic process may be determined differently according to a target service of the learning document data, and the target service may include at least one of a chatbot service, an emotion analysis service, a user satisfaction service, and a consultation service.

그리고, 분류기 생성부(121)는 생성된 분류기들의 수 만큼 학습 문서를 분할하여 분류기(121a)에서 기계 학습에 따른 개별 학습 모델이 생성되도록 처리할 수 있다.In addition, the classifier generating unit 121 may divide the training document as many as the number of generated classifiers to generate an individual learning model according to machine learning in the classifier 121a.

이에 따라, 데이터 예측부(125)는, 학습된 분류기(121a)들을 통해 대상 문서의 레이블을 예측할 수 있다. 예를 들어,분류기(121a)들이 5개가 생성된 경우 대상 문서의 레이블은 5개의 분류기(121a)에서 예측될 수 있다.Accordingly, the data prediction unit 125 may predict the label of the target document through the learned classifiers 121a. For example, when five classifiers 121a are generated, the label of the target document may be predicted by the five classifiers 121a.

그리고, 레이블 결정부(127)는, 상기 데이터 예측부(125)의 예측에 기초하여, 상기 대상 문서에 대한 레이블 할당을 결정할 수 있다.In addition, the label determiner 127 may determine the label assignment to the target document based on the prediction of the data predictor 125 .

보다 구체적으로, 레이블 결정부(127)는 데이터 예측부(125)의 예측에 대한 정확도를 결정할 수 있으며, 그 예측 정확도에 따라 레이블 할당여부를 결정할 수 있다.More specifically, the label determiner 127 may determine the accuracy of the prediction of the data predictor 125 and determine whether to allocate a label according to the prediction accuracy.

특히, 본 발명의 실시 예에 따르면, 상기 대상 문서의 레이블 할당 대상 데이터에 대응하여, 상기 가변적으로 생성된 분류기(121a)들 모두에서 동일한 레이블이 예측된 경우에만, 상기 대상 데이터에 상기 동일하게 예측된 레이블을 할당할 수 있는 바, 이는 분류 정확도를 크게 향상시킬 수 있게 된다.In particular, according to an embodiment of the present invention, in response to the target data for label assignment of the target document, only when the same label is predicted in all of the variably generated classifiers 121a, the target data is predicted to be the same Labels can be assigned, which can greatly improve classification accuracy.

다시 설명하면, 레이블 결정부(127)는, 대상 문서의 레이블을 예측한 모든 분류기(121a)들이 하나의 레이블을 예측한 경우 정확하게 레이블을 예측한 것으로 간주하여 예측된 레이블로 해당 대상 문서를 분류 결정할 수 있는 것으로, 예를 들어, 분류기(121a)가 5개인 경우 대상 문서의 레이블이 정확하게 예측되었다고 간주 하기 위해서는 5개의 분류기(121a)들이 하나의 레이블을 예측하여야만 한다.In other words, when all the classifiers 121a that predicted the label of the target document predict one label, the label determiner 127 considers that the label is accurately predicted and determines to classify the target document by the predicted label. That is, for example, when there are five classifiers 121a, in order to consider that the label of the target document is accurately predicted, five classifiers 121a must predict one label.

한편, 통합 학습 처리부(123)는, 상기 레이블 할당된 대상 문서를 상기 학습 문서 데이터에 통합 처리하여, 상기 학습 문서 데이터베이스(180)를 갱신할 수 있는 바, 통합 학습 처리부(123)는 상기 분류기(121a)를 통해 새롭게 분류된 데이터 세트들을 기존 학습 문서 데이터에 통합할 수 있으며, 통합된 학습 문서는 학습 문서 데이터베이스(180)에 저장되고, 입력부(110)로 전달되어, 이후 입력되는 레이블이 예측되지 않은 신규 대상 문서를 다시 예측하는데 이용되도록 처리할 수 있다.On the other hand, the integrated learning processing unit 123 may update the learning document database 180 by integrating the label-allocated target document into the learning document data, and the integrated learning processing unit 123 is the classifier ( Through 121a), the newly classified data sets can be integrated into the existing training document data, and the integrated training document is stored in the training document database 180 and transmitted to the input unit 110, so that the label input thereafter is not predicted. It can be processed to be used for re-predicting a new target document that is not yet available.

도 3은 본 발명의 실시 예에 따른 문서 학습 장치의 동작 방법을 보다 구체적으로 설명하기 위한 흐름도이다.3 is a flowchart for explaining in more detail a method of operating a document learning apparatus according to an embodiment of the present invention.

도 3을 참조하면, 본 발명의 실시 예에 따른 문서 학습 장치(100)는, 먼저 입력부(110)를 통해 입력된 대상 문서를 전처리부(140)에서 전처리한다(S101).Referring to FIG. 3 , the document learning apparatus 100 according to an embodiment of the present invention pre-processes a target document input through the input unit 110 in the pre-processing unit 140 ( S101 ).

전처리 과정에 있어서, 전처리부(140)는 만약 대상 문서에서 문단이 식별되는 경우(S103), 식별된 문단들에 대한 형태소 분석을 처리하여 하나 이상의 문장으로 변환 생성한다(S105).In the preprocessing process, if a paragraph is identified in the target document (S103), the preprocessor 140 processes the morphological analysis of the identified paragraphs to convert and generate one or more sentences (S105).

그리고, 문서 학습 장치(100)는, 학습 문서 데이터베이스(180)로부터 레이블이 표시된 학습 문서 데이터를 획득하고, 문서 분류부(120)는 이에 기초하여 분류기(121a)를 가변적으로 생성한다(S107).Then, the document learning apparatus 100 obtains labeled learning document data from the learning document database 180 , and the document classifying unit 120 variably generates the classifier 121a based thereon ( S107 ).

그리고, 문서 학습 장치(100)의 분류기 생성부(121)는, 학습 문서 데이터를 분류기(121a) 수 만큼 분할할 수 있으며, 각 분류기(121a)별 학습 프로세스를 수행시킨다(S109). Then, the classifier generating unit 121 of the document learning apparatus 100 may divide the training document data by the number of classifiers 121a, and performs a learning process for each classifier 121a (S109).

이후, 문서 학습 장치(100)의 데이터 예측부(125)는, 학습 완료된 분류기(121a)들을 이용하여, 대상 문서에 포함된 문장별 레이블을 예측한다(S111).Thereafter, the data prediction unit 125 of the document learning apparatus 100 predicts a label for each sentence included in the target document by using the learned classifiers 121a ( S111 ).

그리고, 문서 학습 장치(100)의 레이블 결정부(127)는 예측된 레이블에 대응하는 정확도를 판단하며(S113), 판단된 레이블에 기초한 대상 문서의 레이블 분류 값을 결정한다(S115).Then, the label determiner 127 of the document learning apparatus 100 determines the accuracy corresponding to the predicted label (S113), and determines a label classification value of the target document based on the determined label (S115).

이에 따라, 문서 학습 장치(100)의 통합 학습 처리부(123)는, 분류 값이 결정된 대상 문서를 기존 학습 문서 데이터에 통합 처리한다(S117).Accordingly, the integrated learning processing unit 123 of the document learning apparatus 100 integrates the target document whose classification value is determined into the existing learning document data (S117).

한편, 문서 학습 장치(100)는, 대상 문서에서 추가 분류할 데이터가 존재하는지 판단하며, 존재하는 경우 다시 S107 단계부터 수행하며, 존재하지 않는 경우에는 학습 프로세스를 종료시킬 수 있다.On the other hand, the document learning apparatus 100 may determine whether there is data to be further classified in the target document, and if there is, it may be performed again from step S107, and if it does not exist, the learning process may be terminated.

도 4는 본 발명의 실시 예에 따른 학습 문서 및 대상 문서를 설명하기 위한 도면이다.4 is a view for explaining a learning document and a target document according to an embodiment of the present invention.

도 4(A)를 참조하면, 학습 문서는 입력 데이터로서, 레이블이 표시된 학습 문서들이 포함될 수 있으며, 각 레이블은 사전 결정된 문서 의미를 나타낼 수 있다. 또한, 도 4(B)를 참조하면, 대상 문서는 레이블이 할당되지 않은 문서로서, 각각의 문장들을 대상 문서로 하여 레이블이 할당될 수 있다.Referring to FIG. 4(A) , the training document may include labeled training documents as input data, and each label may indicate a predetermined meaning of the document. Also, referring to FIG. 4B , a target document is a document to which a label is not assigned, and a label may be assigned to each sentence as a target document.

도 5는 본 발명의 실시 예에 따른 전처리 프로세스를 설명하기 위한 도면이다.5 is a view for explaining a pre-processing process according to an embodiment of the present invention.

도 5(A)는 레이블이 표시되지 않은 초기 대상 문서로서, 전처리부(140)는 불필요한 문자 제거를 통해 도 5(B)와 같은 1차 처리된 대상 문서 데이터를 획득할 수 있다. FIG. 5(A) is an initial target document that is not labeled, and the preprocessor 140 may acquire primary processed target document data as shown in FIG. 5(B) by removing unnecessary characters.

그리고, 전처리부(140)는 1차 처리된 대상 문서 데이터로부터 문장이 아닌 경우 문장으로 분리하는 재구성 처리를 수행할 수 있는 바, 도 5(C)와 같은 최종적 문장들을 대상 문서로서 획득할 수 있다.And, the pre-processing unit 140 may perform a reconstruction process of dividing the primary processed target document data into sentences when it is not a sentence, so that final sentences as shown in FIG. 5(C) can be obtained as a target document. .

도 6은 본 발명의 실시 예에 따른 가변 분류기 기반 레이블 예측 프로세스를 설명하기 위한 도면이다.6 is a diagram for explaining a label prediction process based on a variable classifier according to an embodiment of the present invention.

도 6는 본 발명의 일 실시예에 따른 가변 분류기 기반 준 지도 학습 방법에 대한 알고리즘을 도식화한 것으로서, 입력 데이터인 레이블이 표시된 학습 문서(

)와 레이블이 표시되지 않은 대상 문서(

)가 선택된 상태에서, 전처리부(140)가 대상 문서를 가공한 이후의 프로세스를 나타낸다.6 is a schematic diagram of an algorithm for a variable classifier-based quasi-supervised learning method according to an embodiment of the present invention.

) and unlabeled target documents (

) indicates a process after the preprocessor 140 processes the target document in the selected state.

도 6을 참조하면, 먼저, 레이블이 표시된 학습 문서의 수(데이터량, 데이터 구간 또는 특정 임계값)에 기반하여 분류기를 생성하기 위한 분류기의 수(

)가 결정될 수 있으며, 이러한 분류기 수는 레이블이 표시된 학습 문서의 수에서 생성 될 분류기의 수를 제어하는 상수(

)를 나누는 것으로 결정될 수 있다.6 , first, the number of classifiers for generating classifiers based on the number of labeled training documents (data amount, data interval, or specific threshold)

) can be determined, and the number of these classifiers is a constant controlling the number of classifiers to be generated from the number of labeled training documents.

) can be determined by dividing

그리고, 분류기 생성부(121)는 입력 받은 학습 문서를 분할하여 학습시키기 위한 데이터들의 수(

)를 정의 할 수 있으며, 분류기(

)는 구현하는 환경에 따라 자유롭게 객체를 생성할 수 있다.Then, the classifier generating unit 121 divides the input learning document and divides the number of data for learning (

) can be defined, and the classifier (

) can freely create objects according to the implementation environment.

이후, 분류기 생성부(121)는 학습 문서를 분할하여 학습하고 레이블 예측을 진행 하기 위해 생성될 분류기들의 수에 따라서 도 6 제11 내지 제13행의 프로세스를 반복 할 수 있다.Thereafter, the classifier generator 121 may repeat the process of lines 11 to 13 of FIG. 6 according to the number of classifiers to be generated in order to divide and learn the learning document and perform label prediction.

- 생성된 분류기(

)를 학습하기 위한 데이터(

)들을 분할 시작점(

)에서 분할 종료점(

)까지의 데이터로 분할하고, 학습 데이터를 통해 학습된 분류기를 이용하여, 대상 문서의 예측된 레이블 값(

)를 도출함- the generated classifier (

) for training data (

) to the starting point of the split (

) at the split endpoint (

), and using the classifier learned through the training data, the predicted label value of the target document (

) is derived

이에 따라, 생성 및 학습된 분류기들의 수 만큼 상기 프로세스가 반복될 수 있으며, 분류기들은 대상 문서의 레이블 값을 예측하여 데이터 예측부(125)로 전달하고, 레이블 결정부(127)는 예측된 레이블이 정확한지 판단하기 위한 과정을 진행할 수 있다.Accordingly, the process can be repeated as many as the number of generated and learned classifiers, the classifiers predict the label value of the target document and transmit it to the data prediction unit 125, and the label determiner 127 determines the predicted label value. You can proceed with the process to determine if it is correct.

상기 예측된 레이블을 정확하게 예측되었는지 판단하기 위한 과정은 도 6 15, 16행 및 18행에 개시된 바와 같이 예시될 수있다.A process for determining whether the predicted label is correctly predicted can be exemplified as disclosed in lines 15, 16 and 18 of FIG. 6 .

레이블 결정부(127)는, 각 대상 문서에 레이블을 예측하는 모든 분류기들이 하나의 레이블을 예측한 경우 레이블이 정확하게 예측 된 것으로 판단 할 수 있다.The label determiner 127 may determine that the label is accurately predicted when all the classifiers that predict the label for each target document predict one label.

또한, 예를 들어, 감정 분석을 위한 분류기인 경우, 예측한 모든 레이블들의 합이 분류기의 수와 동일한 경우 긍정(

)으로 분류하고, 0인 경우 부정(

)으로 분류할 수 있다.Also, for example, in the case of a classifier for sentiment analysis, if the sum of all predicted labels is equal to the number of classifiers,

), and if it is 0, it is negative (

) can be classified as

이에 따라, 통합 학습 처리부(123)는 새롭게 분류된 대상 문서를 기존에 입력된 학습 문서에 통합할 수 있으며, 통합 된 학습 문서 데이터는 상기 과정을 반복할 때, 미 분류된 대상 문서를 분류하기 위해 다시 사용될 수 있다.Accordingly, the integrated learning processing unit 123 may integrate the newly classified target document into the previously input training document, and the integrated learning document data is used to classify the unclassified target document when repeating the above process. can be used again.

또한, 새롭게 데이터가 분류 되는 경우 다시 학습 통해 기존에 분류하지 못한 데이터를 분류할 수 있기 때문에 위의 과정을 다시 반복할 수 있다.In addition, when the data is newly classified, the above process can be repeated again because the previously unclassified data can be classified through re-learning.

도 7은 본 발명의 실시 예에 따른 문서 학습 장치 동작에 의해 수행되는 학습 프로세스 및 데이터 통합 테스트 결과 예시도이다.7 is an exemplary diagram of a learning process and data integration test results performed by an operation of a document learning apparatus according to an embodiment of the present invention.

도 7은 영화 댓글 데이터를 기초로 하는 감성 분석 서비스를 위한 문서 학습 장치의 학습 프로세스와, 데이터 통합 결과를 각 프로세스별로 나타낸다.7 shows a learning process of a document learning apparatus for a sentiment analysis service based on movie comment data, and a data integration result for each process.

도 7을 참조하면, 레이블이 표시된 학습 문서에 따른 초기 2개의 분류기(121a)가 분류기 생성부(121)를 통해 생성되며, 생성된 분류기(121a)에 레이블이 표시되니 않은 대상 문서가 입력되면, 분류기(121a)별로 대상 문서의 레이블을 예측하고, 그 결과에 따라 레이블 할당된 문서들이 기존 학습 문서 데이터에 포함되는 것을 확인할 수 있다.Referring to FIG. 7 , the initial two classifiers 121a according to the labeled learning document are generated through the classifier generating unit 121, and when a target document that is not labeled is input to the generated classifier 121a, It can be confirmed that the label of the target document is predicted for each classifier 121a, and the documents assigned with the label are included in the existing training document data according to the result.

본 테스트에서 사용된 전체 데이터의 수는 149,995건, 레이블이 표시된 학습 문서는 300건, 레이블이 표시되지 않은 대상 문서는 149,695건을 사용하였다. 학습 문서에 표시된 레이블은 긍정(1), 부정(0)으로 설정하여 진행하였으며, 가변 분류기 기반 준 지도 학습 장치의 실행 시, 두개의 레이블을 분류 할 수 있도록 최초 생성되는 분류기를 2개를 생성하기 위해 상수를 150으로 설정하여 진행하였다.The total number of data used in this test was 149,995 cases, 300 cases of labeled training documents, and 149,695 cases of unlabeled target documents. The labels displayed in the learning document were set to positive (1) and negative (0). When the variable classifier-based semi-supervised learning device is executed, the first two classifiers are created to classify the two labels. For this, the constant was set to 150 and proceeded.

최초 대상 문서를 분류하기 위해 생성되는 분류기 수는 2개였으며, 분류기 마다 각각 150건의 레이블이 표시된 학습 문서를 학습하게 하였다. 학습되는 학습 문서는 과정이 진행되면서 증가하였으며, 분류기 수 또한 학습 문서의 수 증가에 따른 사전 결정된 로직에 따라 변경되었다.The number of classifiers created to classify the first target document was two, and each classifier trained 150 labeled training documents. The number of learning documents to be learned increased as the process progressed, and the number of classifiers was also changed according to a predetermined logic according to the increase in the number of learning documents.

도 7에 도시된 바와 같이, 가변적으로 생성된 분류기에서 대상 문서의 레이블을 예측한 결과를 토대로 대상 문서의 레이블이 판단되기 위해서는 가변적으로 생성된 모든 분류기에서 하나의 레이블만을 예측한 경우(만장일치) 예측된 레이블이 대상 문서를 분류한 것으로 판단하여, 학습 문서에 통합하여 위의 과정을 반복하게 할 수 있는 바, 모든 대상 문서의 레이블을 예측하고 분류할 때까지 지속적으로 위의 과정을 반복하지만, 더 이상 대상 문서의 레이블을 예측하지 못 하는 경우 반복 과정이 종료 될 수 있다.7 , in order to determine the label of the target document based on the result of predicting the label of the target document in the variably generated classifier, predict when only one label is predicted from all variably generated classifiers (unanimous) It is determined that the label classified the target document, and the above process can be repeated by integrating it into the training document. The above process is continuously repeated until the labels of all target documents are predicted and classified If the label of the abnormal target document cannot be predicted, the iterative process may be terminated.

이와 같은 본 발명의 실시 예에 따라, 종래 기술과 같이 학습을 위한 방대한 양의 데이터에 레이블을 하나하나 표시할 필요 없이 비교적 적은 양의 학습 문서를 이용하더라도, 효과적인 학습 모델 데이터베이스를 구축할 수 있다.According to this embodiment of the present invention, it is possible to construct an effective learning model database even if a relatively small amount of learning documents are used without the need to individually label a huge amount of data for learning as in the prior art.

또한, 본 발명의 실시 예에 따르면, 학습 문서의 수에 비례하여 분류기들을 생성하고 데이터들의 레이블을 예측할 수 있으므로, 레이블이 잘못 예측되는 경우를 사전에 저감시킬 수 있으며, 하나의 문서 데이터 분류시 레이블을 예측한 모든 분류기들이 하나의 레이블을 예측하는 경우에만 레이블을 할당하게 함으로써, 재현률을 높일 수 있게 된다.In addition, according to an embodiment of the present invention, since classifiers can be generated in proportion to the number of learning documents and labels of data can be predicted, the case where labels are incorrectly predicted can be reduced in advance, and when classifying one document data, the label By having all classifiers predicting a label assign a label only when predicting one label, it is possible to increase the recall.

한편, 상술한 본 발명의 다양한 실시 예들에 따른 방법은 프로그램으로 구현되어 다양한 비일시적 판독 가능 매체(non-transitory computer readable medium)에 저장된 상태로 각 서버 또는 기기들에 제공될 수 있다. 이에 따라, 사용자 단말(100)은 서버 또는 기기에 접속하여, 상기 프로그램을 다운로드할 수 있다.Meanwhile, the above-described method according to various embodiments of the present invention may be implemented as a program and provided to each server or device while being stored in various non-transitory computer readable media. Accordingly, the user terminal 100 may access a server or device to download the program.

비일시적 판독 가능 매체란 레지스터, 캐쉬, 메모리 등과 같이 짧은 순간 동안 데이터를 저장하는 매체가 아니라 반영구적으로 데이터를 저장하며, 기기에 의해 판독(reading)이 가능한 매체를 의미한다. 구체적으로는, 상술한 다양한 어플리케이션 또는 프로그램들은 CD, DVD, 하드 디스크, 블루레이 디스크, USB, 메모리카드, ROM 등과 같은 비일시적 판독 가능 매체에 저장되어 제공될 수 있다.The non-transitory readable medium refers to a medium that stores data semi-permanently, rather than a medium that stores data for a short moment, such as a register, cache, memory, and the like, and can be read by a device. Specifically, the various applications or programs described above may be provided by being stored in a non-transitory readable medium such as a CD, DVD, hard disk, Blu-ray disk, USB, memory card, ROM, and the like.

또한, 이상에서는 본 발명의 바람직한 실시예에 대하여 도시하고 설명하였지만, 본 발명은 상술한 특정의 실시예에 한정되지 아니하며, 청구범위에서 청구하는 본 발명의 요지를 벗어남이 없이 당해 발명이 속하는 기술분야에서 통상의 지식을 가진자에 의해 다양한 변형실시가 가능한 것은 물론이고, 이러한 변형실시들은 본 발명의 기술적 사상이나 전망으로부터 개별적으로 이해되어서는 안될 것이다.In addition, although preferred embodiments of the present invention have been illustrated and described above, the present invention is not limited to the specific embodiments described above, and the technical field to which the present invention belongs without departing from the gist of the present invention as claimed in the claims Various modifications may be made by those of ordinary skill in the art, and these modifications should not be individually understood from the technical spirit or perspective of the present invention.

100 : 문서 학습 장치 110 : 입력부
120 : 문서 분류부 121 : 분류기 생성부
121a : 분류기 123 : 통합 학습 처리부
125 : 데이터 예측부 127 : 레이블 결정부
140 : 전처리부 180 : 학습 데이터베이스
200 : 서비스 제공 장치 300 : 사용자 단말100: document learning device 110: input unit
120: document classification unit 121: classifier generating unit
121a: classifier 123: integrated learning processing unit
125: data prediction unit 127: label determination unit
140: preprocessor 180: learning database
200: service providing device 300: user terminal

Claims

In a computer-readable non-volatile recording medium on which a document learning program for execution on a computer is recorded,
receiving learning document data including a supervised learning data set to which labels for document classification are assigned from a learning document database;
receiving classification target document data including unsupervised learning data sets to which labels are not assigned;
variably generating classifiers for classifying the classification target document based on the learning document data;
predicting a label of the classification target document according to an unsupervised learning method using the variably generated classifiers;
allocating the predicted label to the classification target document or re-predicting the label of the classification target document according to labels predicted by each of the variably generated classifiers; and
Including; integrating the label-allocated classification target document into the learning document data, and updating the learning document database;
The step of receiving the classification target document data includes:
A pre-processing step of converting the classification target document into one or more sentence data that can be assigned a label; further comprising,
The allocating or re-predicting step includes:
and assigning the same predicted label to the classification target document data only when the same label is predicted by all of the variably generated classifiers.

delete

According to claim 1,
The pre-processing step is
and removing at least one word or character predefined for label prediction and classification of the classification target document from the classification target document.

According to claim 1,
The pre-processing step is
and, when the classification target document includes one or more paragraphs, reconstructing it into one or more sentences through morphological analysis for each paragraph.