KR20210051523A

KR20210051523A - Dialogue system by automatic domain classfication

Info

Publication number: KR20210051523A
Application number: KR1020190136910A
Authority: KR
Inventors: 윤동민; 양승원
Original assignee: 주식회사 솔트룩스
Priority date: 2019-10-30
Filing date: 2019-10-30
Publication date: 2021-05-10
Also published as: KR102358485B1

Abstract

Provided is a dialog system for automatic domain classification to identify an appropriate domain from a user's speech. According to the present invention, a dialog system for automatically classifying a plurality of domains comprises: a user interface receiving an input sentence from a user through a network and transmitting an output sentence for the input sentence to the user; a cluster generation unit including a first word embedding module performing word embedding to generate entities into a vector and a clustering module clustering the vectorized entities; a dialog corpus cluster storage unit storing a plurality of dialog clusters generated by clustering the plurality of dialog entities forming a dialog corpus by the clustering generation unit with respect to all vectors acquired by word embedding in the first word embedding module; a domain classification learning unit including a domain classification learner training a domain classification estimation model by using each dialog cluster and domain for a plurality of dialogs, which are formed of a dialog cluster corresponding to the vectors in which the entities forming each of the plurality of dialogs forming the dialog corpus, among the plurality of dialog clusters as input and output, respectively; and a domain determination unit determining a domain for the input sentence by using an input sentence cluster including a plurality of cluster keys corresponding to vectors acquired by embedding the entities forming the input sentence in the first word embedding module among the plurality of dialog clusters.

Description

Dialogue system by automatic domain classfication}

본 발명은 대화 시스템에 관한 것으로, 자세하게는 대화 도메인을 자동으로 분류하는 도메인 자동 분류 대화 시스템에 관한 것이다. The present invention relates to a conversation system, and more particularly, to an automatic domain classification conversation system for automatically classifying conversation domains.

본 발명은 산업자원통상부 로봇산업핵심기술개발사업-인공지능융합로봇시스템기술의 일환으로 (주)아이피엘에서 주관하고, (주)솔트룩스에서 수행된 연구로부터 도출된 것이다. [연구기간: 2019.01.01.~2019.12.31., 연구관리 전문기관: 한국산업기술평가관리원, 연구과제명: 가정용 소셜로봇 및 서비스 개발 시스템, 과제 고유번호: 10077633]The present invention is derived from research conducted by IPL Co., Ltd. and conducted by Saltlux Co., Ltd. as part of the Robot Industry Core Technology Development Project-Artificial Intelligence Convergence Robot System Technology of the Ministry of Commerce, Industry and Energy. [Research period: January 1, 2019-December 31, 2019, Research management professional institution: Korea Institute of Industrial Technology Evaluation and Planning, Research project name: Home social robot and service development system, project serial number: 10077633]

대화 시스템은 사용자의 발언(utterance)에 응답하여 시스템의 발언을 생성 및 제공하는 시스템으로, 대화 시스템은 특정 대화 도메인(이하, 도메인이라 함)이나 상황에서 가능한 대화 환경을 제공하기 위한 도메인 대화 모델이 함께 제공된다. 사용자의 발언에 적합한 도메인이 파악되지 않는 경우, 사용자의 의도에 맞는 시스템의 발언의 생성 및 제공에 어려움이 발생할 수 있다. The dialogue system is a system that generates and provides speech of the system in response to the user's utterance, and the dialogue system has a domain dialogue model to provide a dialogue environment possible in a specific dialogue domain (hereinafter referred to as a domain) or situation. Comes with If the domain suitable for the user's speech is not identified, it may be difficult to generate and provide the system's speech that suits the user's intention.

본 발명의 기술적 과제는, 사용자의 발언으로부터 적합한 도메인을 파악할 수 있는, 도메인 자동 분류 대화 시스템을 제공하는 것이다. An object of the present invention is to provide an automatic domain classification dialog system capable of grasping a suitable domain from a user's speech.

상기 기술적 과제를 달성하기 위한 본 발명의 기술적 사상의 일측면에 따른 도메인 자동 분류 대화 시스템은, 네트워크를 통하여 사용자로부터 입력 문장을 수신하고, 상기 입력 문장에 대한 출력 문장을 상기 사용자에게 송신하는 유저 인터페이스; 엔티티들을 벡터로 생성하도록 워드 임베딩을 수행하는 제1 워드 임베딩 모듈, 및 벡터화된 상기 엔티티들을 클러스터링하는 클러스터링 모듈을 포함하는 클러스터 생성부; 대화 말뭉치를 이루는 복수의 대화를 이루는 엔티티들을 상기 제1 워드 임베딩 모듈에서 워드 임베딩한 벡터들 전체를 대상으로 상기 클러스터링 생성부에서 클러스터링을 하여 생성된 복수의 대화 클러스터를 저장하는 대화 말뭉치 클러스터 저장부; 상기 복수의 대화 클러스터 중, 상기 대화 말뭉치를 이루는 복수의 대화 각각을 이루는 엔티티들를 워드 임베딩한 벡터들에 대응하는 대화 클러스터로 이루어지는 상기 복수의 대화에 대한 대화별 클러스터와 도메인을 각각 입력과 출력으로 사용하여 도메인 분류 예측 모델을 학습시키는 도메인 분류 학습기를 포함하는 도메인 분류 학습부; 및 상기 복수의 대화 클러스터 중, 상기 제1 워드 임베딩 모듈에서 상기 입력 문장을 이루는 엔티티들을 워드 임베딩한 벡터들에 대응하는 복수의 클러스터 키로 이루어지는 입력 문장 클러스터을 이용하여 상기 입력 문장에 대한 도메인을 판단하는 도메인 판단부;를 포함한다. The domain automatic classification dialog system according to an aspect of the technical idea of the present invention for achieving the above technical problem is a user interface for receiving an input sentence from a user through a network and transmitting an output sentence for the input sentence to the user. ; A cluster generator including a first word embedding module for performing word embedding to generate entities as vectors, and a clustering module for clustering the vectorized entities; A dialogue corpus cluster storage unit for storing a plurality of dialogue clusters generated by clustering a plurality of dialogue entities forming a dialogue corpus by word-embedding all vectors word-embedded by the first word embedding module; Among the plurality of conversation clusters, each conversation cluster and domain for the plurality of conversations consisting of conversation clusters corresponding to vectors in which the entities constituting each of the plurality of conversations forming the conversation corpus are word-embedded are used as inputs and outputs, respectively. A domain classification learning unit including a domain classification learner that trains a domain classification prediction model by doing so; And a domain for determining a domain for the input sentence using an input sentence cluster consisting of a plurality of cluster keys corresponding to vectors in which the entities constituting the input sentence are word-embedded in the first word embedding module among the plurality of conversation clusters. It includes;

대화 말뭉치를 이루는 복수의 대화를 도메인 별로 분류하여 복수의 도메인 대화 말뭉치를 생성하는 도메인 분류부; 상기 복수의 도메인 대화 말뭉치 각각을 이루는 엔티티들을 워드 임베딩하여 복수의 도메인 워드 임베딩을 생성하는 제2 워드 임베딩 모듈; 상기 복수의 도메인 워드 임베딩 각각에 대하여 복수의 도메인 대화 모델을 생성하는 대화 모델 생성부;를 포함하는 대화 말뭉치 도메인 관리부를 더 포함할 수 있다. A domain classifying unit for classifying a plurality of conversations forming a conversation corpus by domain to generate a plurality of domain conversation corpuses; A second word embedding module for generating a plurality of domain word embeddings by word embedding entities constituting each of the plurality of domain dialogue corpuses; It may further include a dialog corpus domain management unit including; a dialog model generator for generating a plurality of domain dialog models for each of the plurality of domain word embedding.

상기 출력 문장은, 복수의 도메인 대화 모델 중, 상기 도메인 판단부에서 판단된 상기 입력 문장 클러스터에 대한 도메인에 해당하는 도메인 대화 모델에서 생성될 수 있다. The output sentence may be generated from a domain conversation model corresponding to a domain for the input sentence cluster determined by the domain determining unit among a plurality of domain conversation models.

상기 제2 워드 임베딩 모듈은, 예측 기반으로 워드 임베딩을 수행할 수 있다. 상기 제2 워드 임베딩 모듈은, Word2Vec 또는 글로브(Global Vectors for Word Representation, GloVe)를 기반으로 워드 임베딩을 수행할 수 있다. The second word embedding module may perform word embedding based on prediction. The second word embedding module may perform word embedding based on Word2Vec or Global Vectors for Word Representation (GloVe).

상기 도메인 분류 학습부는, 상기 입력 문장에 대한 도메인을 판단 결과에 대한 상기 사용자의 피드백을 저장하는 도메인 분류 로그;를 더 포함하며, 상기 도메인 분류 모델 학습기는, 상기 도메인 분류 로그에 저장된 상기 사용자의 피드백을 사용하여 상기 도메인 분류 예측 모델을 재생성하거나 업데이트할 수 있다. The domain classification learning unit further includes a domain classification log for storing the user's feedback on a result of determining the domain for the input sentence, wherein the domain classification model learner includes the user's feedback stored in the domain classification log. Can be used to regenerate or update the domain classification prediction model.

상기 대화 도메인 말뭉치 분류부는, 상기 도메인 상기 도메인 분류 로그에 저장된 상기 사용자의 피드백이 도메인이 올바르게 분류된 것일 경우, 상기 입력 문장을 상기 복수의 도메인 대화 말뭉치 중 해당하는 도메인에 대한 도메인 대화 말뭉치에 저장할 수 있다. The conversation domain corpus classification unit may store the input sentence in a domain conversation corpus for a corresponding domain among the plurality of domain conversation corpus when the user's feedback stored in the domain classification log is that the domain is correctly classified. have.

상기 제1 워드 임베딩 모듈은, 미리 학습된 언어 모형으로 워드 임베딩을 수행할 수 있다. The first word embedding module may perform word embedding using a pre-learned language model.

상기 제1 워드 임베딩 모듈은, ELMo(Embeddings from Language Model) 기반의 워드 임베딩을 수행할 수 있다. The first word embedding module may perform word embedding based on Embeddings from Language Model (ELMo).

상기 클러스터링 생성부는, 밀도 기반 클러스터링을 할 수 있다. The clustering generator may perform density-based clustering.

상기 클러스터링 생성부는, DBSCAN(Density-based spatial clustering of applications with noise) 방식으로 클러스터링을 할 수 있다. The clustering generator may perform clustering in a Density-based spatial clustering of applications with noise (DBSCAN) method.

본 발명에 따른 도메인 자동 분류 대화 시스템은, 사용자의 입력 문장에 대하여 미리 학습된 언어 모형으로 워드 임베딩을 수행하여, 엔티티를 벡터로 생성하고, 대화 말뭉치 클러스터를 이루는 복수의 대화 클러스터 중 입력 문장을 이루는 단어인 엔티티를 워드 임베딩한 벡터들 각각에 대응하는 대화 클러스터인 복수의 클러스터 키로 이루어지는 입력 문장 클러스터를 생성한 후, 도메인 분류 예측 모델에서 입력 문장 클러스터를 입력으로 사용하여, 입력 문장에 대한 도메인을 분류하여 출력으로 제공한다. The domain automatic classification dialog system according to the present invention performs word embedding with a pre-learned language model for a user's input sentence, generates an entity as a vector, and forms an input sentence among a plurality of dialog clusters forming a dialog corpus cluster. After creating an input sentence cluster consisting of a plurality of cluster keys, which is a dialog cluster corresponding to each of the vectors in which the word entity is word-embedded, the domain for the input sentence is classified using the input sentence cluster as an input in the domain classification prediction model. And provide it as an output.

입력 문장에 대한 워드 임베딩을 수행하는 과정에서, 서로 다른 도메인에서 다른 의미를 갖는 같은 표기의 단어를 하나의 벡터로 만들어버리는 문제를 해결하기 위하여, 미리 학습된 언어 모형, 예를 들면, ELMo 기반의 워드 임베딩을 수행하여, 엔티티를 벡터로 생성하여, 문맥에 따라서 다르게 워드 임베딩을 할 수 있다. 또한, 밀도 기반 클러스터링을 하여 생성한 복수의 대화 클러스터에 대응하는 복수의 클러스터 키로 이루어지는 입력 문장 클러스터를 생성하여 도메인 분류를 수행한다.In the process of performing word embedding for input sentences, in order to solve the problem of making words of the same notation having different meanings in different domains into one vector, a pre-learned language model, for example, ELMo-based By performing word embedding, an entity is generated as a vector, and word embedding can be performed differently depending on the context. In addition, domain classification is performed by generating an input sentence cluster composed of a plurality of cluster keys corresponding to a plurality of conversation clusters generated by performing density-based clustering.

따라서 본 발명에 따른 도메인 자동 분류 대화 시스템은, 오픈 도메인 환경에서 입력 문장이 주어졌을 때도, 입력 문장의 도메인을 판단하여, 알맞은 출력 문장을 생성할 수 있다. Accordingly, the automatic domain classification dialog system according to the present invention can generate an appropriate output sentence by determining the domain of the input sentence even when an input sentence is given in an open domain environment.

도 1은 본 발명의 예시적 실시 예에 따른 도메인 자동 분류 대화 시스템을 나타내는 블록도이다.
도 2는 본 발명의 예시적 실시 예에 따른 도메인 자동 분류 대화 시스템의 클러스터 생성부에서 생성된 대화 말뭉치 클러스터를 설명하기 위한 개념도이다.
도 3은 본 발명의 예시적 실시 예에 따른 도메인 자동 분류 대화 시스템에서 도메인 분류 예측 모델을 생성하는 과정을 설명하기 위한 블록도이다.
도 4는 본 발명의 예시적 실시 예에 따른 복수의 대화 도메인을 가지는 대화 시스템의 클러스터 생성부에서 사용자의 입력 문장에 대한 클러스터를 생성하는 과정을 설명하기 위한 블록도이다.
도 5는 본 발명의 예시적 실시 예에 따른 복수의 대화 도메인을 가지는 대화 시스템의 자연어 이해부의 주요 구성을 나타내는 블록도이다.
도 6은 본 발명의 예시적 실시 예에 따른 복수의 대화 도메인을 가지는 대화 시스템의 대화 말뭉치 도메인 관리부의 주요 구성을 나타내는 블록도이다.
도 7은 본 발명의 예시적 실시 예에 따른 복수의 대화 도메인을 가지는 대화 시스템에서 도메인 자동 분류를 하는 과정을 설명하기 위한 블록도이다. Fig. 1 is a block diagram showing an automatic domain classification dialog system according to an exemplary embodiment of the present invention.
2 is a conceptual diagram illustrating a dialog corpus cluster generated by a cluster generation unit of an automatic domain classification dialog system according to an exemplary embodiment of the present invention.
3 is a block diagram illustrating a process of generating a domain classification prediction model in an automatic domain classification dialog system according to an exemplary embodiment of the present invention.
4 is a block diagram illustrating a process of generating a cluster for a user's input sentence by a cluster generator of a dialog system having a plurality of dialog domains according to an exemplary embodiment of the present invention.
Fig. 5 is a block diagram showing a main configuration of a natural language understanding unit of a dialogue system having a plurality of dialogue domains according to an exemplary embodiment of the present invention.
Fig. 6 is a block diagram showing a main configuration of a dialogue corpus domain management unit of a dialogue system having a plurality of dialogue domains according to an exemplary embodiment of the present invention.
Fig. 7 is a block diagram illustrating a process of automatically classifying domains in a chat system having a plurality of chat domains according to an exemplary embodiment of the present invention.

이하, 첨부한 도면을 참조하여 본 발명의 실시 예에 대해 상세히 설명한다. 본 발명의 실시 예는 당 업계에서 평균적인 지식을 가진 자에게 본 발명을 보다 완전하게 설명하기 위하여 제공되는 것이다. 본 발명은 다양한 변경을 가할 수 있고 여러 가지 형태를 가질 수 있는 바, 특정 실시 예들을 도면에 예시하고 상세하게 설명하고자 한다. 그러나, 이는 본 발명을 특정한 개시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. 각 도면을 설명하면서 유사한 참조부호를 유사한 구성요소에 대해 사용한다. 첨부된 도면에 있어서, 구조물들의 치수는 본 발명의 명확성을 기하기 위하여 실제보다 확대하거나 축소하여 도시한 것이다.Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings. Embodiments of the present invention are provided to more completely describe the present invention to those with average knowledge in the art. In the present invention, various modifications may be made and various forms may be applied, and specific embodiments will be illustrated in the drawings and described in detail. However, this is not intended to limit the present invention to a specific form disclosed, it should be understood to include all changes, equivalents, and substitutes included in the spirit and scope of the present invention. In describing each drawing, similar reference numerals are used for similar elements. In the accompanying drawings, the dimensions of the structures are shown to be enlarged or reduced compared to the actual one for clarity of the present invention.

본 명세서에서 사용한 용어는 단지 특정한 실시 예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수개의 표현을 포함한다. 본 명세서에서, "포함하다" 또는 "가지다" 등의 용어는 명세서 상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성 요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The terms used in the present specification are only used to describe specific embodiments, and are not intended to limit the present invention. Singular expressions include plural expressions unless the context clearly indicates otherwise. In the present specification, terms such as "comprise" or "have" are intended to designate the presence of features, numbers, steps, actions, components, parts, or combinations thereof described in the specification, but one or more other features. It is to be understood that the presence or addition of elements or numbers, steps, actions, components, parts, or combinations thereof, does not preclude in advance the possibility of the presence or addition.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 갖는다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 의미를 가지는 것으로 해석되어야 하며, 본 명세서에서 명백하게 정의하지 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless otherwise defined, all terms used herein including technical or scientific terms have the same meaning as commonly understood by one of ordinary skill in the art to which the present invention belongs. Terms as defined in a commonly used dictionary should be interpreted as having a meaning consistent with the meaning in the context of the related technology, and should not be interpreted as an ideal or excessively formal meaning unless explicitly defined in the present specification. .

이하 도면 및 설명에서, 하나의 블록, 예를 들면, '~부' 또는 '~모듈'로 표시 또는 설명되는 구성요소는 하드웨어 블록 또는 소프트웨어 블록일 수 있다. 예를 들면, 구성요소들 각각은 서로 신호를 주고 받는 독립적인 하드웨어 블록일 수도 있고, 또는 하나의 프로세서에서 실행되는 소프트웨어 블록일 수도 있다.In the following drawings and description, a single block, for example, a component indicated or described as'~ unit' or'~ module' may be a hardware block or a software block. For example, each of the components may be an independent hardware block that exchanges signals with each other, or may be a software block that is executed in one processor.

본 발명의 구성 및 효과를 충분히 이해하기 위하여, 첨부한 도면을 참조하여 본 발명의 바람직한 실시 예들을 설명한다. In order to fully understand the configuration and effects of the present invention, preferred embodiments of the present invention will be described with reference to the accompanying drawings.

도 1은 본 발명의 예시적 실시 예에 따른 도메인 자동 분류 대화 시스템을 나타내는 블록도이다. Fig. 1 is a block diagram showing an automatic domain classification dialog system according to an exemplary embodiment of the present invention.

도 1을 참조하면, 도메인 자동 분류 대화 시스템(1, 이하 대화 시스템)은, 네트워크(20)를 통하여 사용자(10)로부터 발언(utterance)을 수신하고, 사용자에게 대화 시스템(1)의 발언을 송신하는 유저 인터페이스(UI, 100)를 포함한다. 네트워크(20)는 유선 인터넷 서비스, 근거리 통신망(LAN), 광대역 통신망(WAN), 인트라넷, 무선 인터넷 서비스, 이동 컴퓨팅 서비스, 무선 데이터 통신 서비스, 무선 인터넷 접속 서비스, 위성 통신 서비스, 무선 랜, 블루투스 등 유/무선을 통하여 데이터를 주고 받을 수 있는 것을 모두 포함할 수 있다. 네트워크(20)가 스마트폰 또는 태블릿 등과 연결되는 경우, 네트워크(20)는 3G, 4G, 5G 등의 무선 데이터 통신 서비스, 와이파이(Wi-Fi) 등의 무선 랜, 블루투스 등일 수 있다. Referring to FIG. 1, the automatic domain classification dialogue system (1, hereinafter dialogue system) receives utterance from the user 10 through the network 20, and transmits the speech of the dialogue system 1 to the user. It includes a user interface (UI, 100). The network 20 includes a wired Internet service, a local area network (LAN), a broadband communication network (WAN), an intranet, a wireless Internet service, a mobile computing service, a wireless data communication service, a wireless Internet access service, a satellite communication service, a wireless LAN, and Bluetooth. It can include all data that can be sent and received through wired/wireless. When the network 20 is connected to a smartphone or tablet, the network 20 may be a wireless data communication service such as 3G, 4G, and 5G, a wireless LAN such as Wi-Fi, and Bluetooth.

유저 인터페이스(100)는 사용자(10)가 사용하는 단말기 등을 통하여 대화 시스템(1)에 엑세스하기 위한 인터페이스를 제공할 수 있다. 사용자(10)는 유저 인터페이스(100)를 통하여 사용자(10)의 발언을 대화 시스템(1)에 전송할 수 있고, 유저 인터페이스(100)를 통하여 사용자(10)의 발언에 대한 대화 시스템(1)의 발언을 전달받을 수 있다. 발언(utterance)은 대화 과정에서 사용자(10) 또는 대화 시스템(1)이 상대에게 전달하고자 하는 문장이다. 사용자(10)의 발언은 입력 문장(30)이라 호칭할 수 있고, 대화 시스템(1)의 발언은 출력 문장(40)이라 호칭할 수 있다. 사용자(10)의 발언인 입력 문장(30)이 질의인 경우, 대화 시스템(1)의 발언인 출력 문장(40)은 질의에 대한 응답일 수 있다. The user interface 100 may provide an interface for accessing the conversation system 1 through a terminal used by the user 10 or the like. The user 10 can transmit the remarks of the user 10 to the conversation system 1 through the user interface 100, and the conversation system 1 can transmit the remarks of the user 10 through the user interface 100. You can receive your remarks. A utterance is a sentence that the user 10 or the conversation system 1 wants to convey to the other party during the conversation process. The remarks of the user 10 may be referred to as input sentences 30, and the remarks of the conversation system 1 may be referred to as output sentences 40. When the input sentence 30 that is the speaker of the user 10 is a query, the output sentence 40 that is the speaker of the conversation system 1 may be a response to the query.

대화 시스템(1)은 유저 인터페이스(100)와 연결되어, 입력 문장(30)을 수신하는 클러스터 생성부(200)를 포함한다. 클러스터 생성부(200)는 자연어 이해부(800)를 참조하여, 자연어 처리가 수행되어 컴퓨터가 읽을 수 있는 형태로 처리된 입력 문장(30)을 수신할 수 있다. 자연어 처리는 대상 어절을 최소의 의미 단위인 형태소로 분석하는 형태소 분석 과정, 이렇게 분석된 형태소 결과 중 가장 적합한 형태의 품사를 선택하는 과정으로서 문맥 좌우에 위치한 중의성 해소의 힌트가 되는 정보를 이용해서 적합한 분석 결과를 선택하는 품사 부착 과정, 명사구, 동사구, 부사구 등의 덩어리를 분석하는 구 단위 분석 과정, 중문, 복문 등의 문장을 단문 단위로 분해하는 절 단위 분석 과정, 및 문장을 이루고 있는 구성 성분 간 위계 관계를 분석하여 문장의 구조를 결정하는 구문 분석 과정을 포함할 수 있다. 또한, 클러스터 생성부(200)는 대화 말뭉치 저장부(700)로부터 대화 말뭉치 저장부(700)가 가지는 대화들을 수신할 수 있다. The dialogue system 1 includes a cluster generating unit 200 connected to the user interface 100 and receiving an input sentence 30. The cluster generation unit 200 may receive the input sentence 30 processed in a computer-readable form by performing natural language processing with reference to the natural language understanding unit 800. Natural language processing is a morpheme analysis process that analyzes a target word into a morpheme, which is the smallest unit of meaning, and a process of selecting the most suitable form of speech among the analyzed morpheme results. The part-of-speech attachment process that selects an appropriate analysis result, a phrase-by-phrase analysis process that analyzes chunks of noun phrases, verb phrases, adverb phrases, etc. It may include a syntax analysis process of determining the structure of a sentence by analyzing the hierarchical relationship between them. In addition, the cluster generating unit 200 may receive conversations of the conversation corpus storage unit 700 from the conversation corpus storage unit 700.

말뭉치란, 언어 연구를 위해 텍스트들을 컴퓨터가 읽을 수 있는 형태로 모아 놓은 언어 자료를 의미하며, 대화 말뭉치는 복수의 대화를 컴퓨터가 읽을 수 있는 형태로 모아 놓은 언어 자료를 의미한다. The corpus refers to a language material that collects texts in a computer-readable form for language research, and the dialogue corpus refers to a language material that collects a plurality of dialogues in a computer-readable form.

대화 말뭉치 저장부(700)는 복수의 대화로 이루어지는 대화 말뭉치이 저장될 수 있다. 대화 말뭉치 저장부(700)는 대화 말뭉치를 이루는 각각의 대화 및 해당 대화의 도메인을 함께 저장할 수 있다. 즉, 대화 말뭉치 저장부(700)에 저장된 대화 말뭉치를 이루는 각각의 대화는, 해당 대화에 대하여 분류된 도메인이 함께 저장될 수 있다. The dialogue corpus storage unit 700 may store dialogue corpus consisting of a plurality of dialogues. The conversation corpus storage unit 700 may store each conversation constituting the conversation corpus and a domain of the corresponding conversation together. That is, each conversation constituting the conversation corpus stored in the conversation corpus storage unit 700 may store domains classified for the corresponding conversation together.

일부 실시 예에서, 대화 말뭉치 저장부(700)는 대화 시스템(1)의 외부에 존재하는 대화 말뭉치에 대한 정보를 저장할 수 있으며, 대화 말뭉치 저장부(700)는 저장된 외부에 존재하는 대화 말뭉치에 대한 정보를 참조하여, 네트워크(20)를 통하여 외부에 존재하는 대화 말뭉치에 억세스하여, 외부에 존재하는 대화 말뭉치를 이루는 대화들을 대화 시스템(1)에서 이용할 수 있도록 불러들이거나, 대화 말뭉치 저장부(700) 내에 저장해 놓을 수 있다. In some embodiments, the conversation corpus storage unit 700 may store information on a conversation corpus existing outside of the conversation system 1, and the conversation corpus storage unit 700 By referring to the information, by accessing the conversation corpus existing outside through the network 20, the conversations constituting the conversation corpus existing outside are retrieved so that they can be used in the conversation system 1, or the conversation corpus storage unit 700 ) Can be stored in.

클러스터 생성부(200)는 대화 말뭉치 저장부(700)에 저장된 대화 말뭉치를 클러스터링하여 복수의 대화 클러스터(도 2의 CL)로 이루어지는 대화 말뭉치 클러스터를 생성하여 대화 말뭉치 클러스터 저장부(250)에 저장할 수 있다. 클러스터 생성부(200)는 대화 말뭉치 저장부(700)에 저장된 대화 말뭉치를 이루는 복수의 대화 각각을 이루는 단어인 엔티티(entity)들을 벡터로 워드 임베딩한 후, 복수의 대화로 이루어지는 대화 말뭉치에 대한 벡터화된 엔티티들 전체를 대상으로 클러스터링을 하여, 복수의 대화 클러스터(CL)를 생성할 수 있다. The cluster generation unit 200 clusters the dialogue corpus stored in the dialogue corpus storage unit 700 to generate a dialogue corpus cluster consisting of a plurality of dialogue clusters (CL in FIG. 2) and store them in the dialogue corpus cluster storage unit 250. have. The cluster generation unit 200 word embeds entities, which are words constituting each of a plurality of dialogues constituting the dialogue corpus stored in the dialogue corpus storage unit 700, into a vector, and then vectorizes the dialogue corpus consisting of a plurality of dialogues. A plurality of conversation clusters CL may be created by clustering all of the entities.

클러스터 생성부(200)는 입력 문장(30)을 클러스터링하여 입력 문장 클러스터(350)를 생성할 수 있다. 클러스터 생성부(200)는 입력 문장(30)을 이루는 단어인 엔티티(entity)들을 벡터로 워드 임베딩한 후, 입력 문장 클러스터(350)를 생성할 수 있다. 구체적으로, 클러스터 생성부(200)는 대화 말뭉치 클러스터 저장부(250)에 저장된 복수의 대화 클러스터(CL)를 참조하여, 입력 문장(30)에 대한 벡터화된 엔티티들에 대응하는 대화 클러스터(CL)를 선정하여 입력 문장 클러스터(350)를 생성할 수 있다. The cluster generating unit 200 may generate the input sentence cluster 350 by clustering the input sentence 30. The cluster generator 200 may generate the input sentence cluster 350 after word-embedding entities, which are words constituting the input sentence 30, into a vector. Specifically, the cluster generation unit 200 refers to a plurality of dialog clusters CL stored in the dialog corpus cluster storage unit 250, and the dialog cluster CL corresponding to the vectorized entities for the input sentence 30 The input sentence cluster 350 may be generated by selecting.

클러스터 생성부(200)는 미리 학습된 언어 모형으로 워드 임베딩을 수행하여, 엔티티를 벡터로 생성할 수 있다. 예를 들면, 클러스터 생성부(200)는 ELMo(Embeddings from Language Model) 기반의 워드 임베딩을 수행하여, 엔티티를 벡터로 생성할 수 있다. ELMo 기반의 워드 임베딩은 기존의 워드 임베딩, 예를 들면 예측 기반으로 워드 임베딩이 서로 다른 도메인에서 다른 의미를 갖는 같은 표기의 단어를 하나의 벡터로 만들어버리는 문제를 해결하기 위하여, 문맥에 따라서 다르게 워드 임베딩을 할 수 있다. The cluster generator 200 may generate an entity as a vector by performing word embedding with a pre-learned language model. For example, the cluster generator 200 may generate an entity as a vector by performing word embedding based on an ELMo (Embeddings from Language Model). ELMo-based word embedding is to solve the problem of making words of the same notation with different meanings in different domains in different domains by word embedding based on conventional word embedding, for example, word embedding differently depending on the context. Embedding can be done.

클러스터 생성부(200)는 복수의 대화로 이루어지는 대화 말뭉치에 대한 벡터화된 엔티티들 전체를 대상으로 밀도 기반 클러스터링을 하여, 복수의 대화 클러스터(CL)를 생성할 수 있다. 예를 들면, 클러스터 생성부(200)는 DBSCAN(Density-based spatial clustering of applications with noise) 방식으로 복수의 대화로 이루어지는 대화 말뭉치에 대한 벡터화된 엔티티들 전체를 대상으로 클러스터링을 하여, 복수의 대화 클러스터(CL)를 생성할 수 있다. The cluster generator 200 may generate a plurality of dialog clusters CL by performing density-based clustering of all vectorized entities for a dialog corpus consisting of a plurality of dialogs. For example, the cluster generation unit 200 clusters all of the vectorized entities for a dialogue corpus consisting of a plurality of dialogues in a Density-based spatial clustering of applications with noise (DBSCAN) method, and performs a plurality of dialogue clusters. (CL) can be created.

클러스터 생성부(200)에서 대화 말뭉치를 이루는 복수의 대화 각각 또는 입력 문장(30)을 벡터로 워드 임베딩하면, 복수의 대화 각각 또는 입력 문장(30)은, 대화 또는 입력 문장(30)을 이루는 단어인 엔티티를 워드 임베딩한 벡터들로 이루어지는 문장 벡터(sentence vector)로 변환될 수 있다. When each of the plurality of dialogues or input sentences 30 constituting the dialogue corpus is word-embedded as a vector in the cluster generation unit 200, each of the plurality of dialogues or input sentences 30 is a word constituting the dialogue or input sentence 30 It can be converted to a sentence vector (sentence vector) consisting of vectors in which the in-entity is word-embedded.

대화 말뭉치 클러스터를 이루는 복수의 대화 클러스터(CL) 중, 입력 문장(30)을 이루는 단어인 엔티티를 워드 임베딩한 벡터들 각각에 대응하는 대화 클러스터(CL)를 클러스터 키(cluster key)라 호칭할 수 있다. 입력 문장 클러스터(350)는 입력 문장(30)에 대한 문장 벡터를 이루는 벡터들 각각에 대응하는 복수의 클러스터 키(클러스터 1, 클러스터 2, ..., 클러스터 K)로 이루어질 수 있다. Among the plurality of dialogue clusters CL constituting the dialogue corpus cluster, the dialogue cluster CL corresponding to each of the vectors in which the entity, which is a word constituting the input sentence 30, is word-embedded may be referred to as a cluster key. have. The input sentence cluster 350 may include a plurality of cluster keys (cluster 1, cluster 2, ..., cluster K) corresponding to each of the vectors constituting the sentence vector for the input sentence 30.

도메인 판단부(500)는 도메인 분류 학습부(400)가 가지는 도메인 분류 예측 모델(420)을 참조하여, 입력 문장 클러스터(350)의 도메인을 판단할 수 있다. 도메인 도메인 분류 학습부(400)는 대화 말뭉치 클러스터 저장부(250)에 저장된 대화 말뭉치 클러스터 및 각 대화의 도메인을 학습하여 도메인 분류 예측 모델(420)을 생성할 수 있다. 도메인 분류 학습부(400)에서 도메인 분류 예측 모델(420)을 생성하는 방법은 도 3을 통하여 자세히 설명하도록 한다. The domain determination unit 500 may determine the domain of the input sentence cluster 350 by referring to the domain classification prediction model 420 of the domain classification learning unit 400. The domain domain classification learning unit 400 may generate a domain classification prediction model 420 by learning the conversation corpus cluster stored in the conversation corpus cluster storage unit 250 and domains of each conversation. A method of generating the domain classification prediction model 420 in the domain classification learning unit 400 will be described in detail with reference to FIG. 3.

입력 문장 클러스터(350)을 이용하여 도메인 판단부(500)에서 도메인이 판단된 입력 문장(30)의 워드 임베딩된 결과를 대화 말뭉치 도메인 관리부(600)에 대입하여, 입력 문장(30)에 대한 대화 시스템(1)의 발언이 생성될 수 있으며, 대화 시스템(1)의 발언은 자연어 생성부(900)를 참조하여 출력 문장(40)으로 생성되며, 출력 문장(40)을 유저 인터페이스(100)에서 네트워크(20)을 통하여 사용자(10)에게 제공될 수 있다. By using the input sentence cluster 350, the word-embedded result of the input sentence 30 for which the domain is determined by the domain determination unit 500 is substituted into the dialogue corpus domain management unit 600, and a dialogue for the input sentence 30 is performed. The utterance of the system 1 can be generated, and the utterance of the dialogue system 1 is generated as an output sentence 40 with reference to the natural language generator 900, and the output sentence 40 is displayed in the user interface 100. It may be provided to the user 10 through the network 20.

대화 말뭉치 도메인 관리부(600)는 대화 말뭉치 저장부(700)에 저장된 대화 말뭉치가 포함하는 복수의 대화를 도메인 별로 분류하여, 도메인 별로 워드 임베딩하고, 도메인 별로 대화 모델을 생성하여 그 결과를 저장해서 가질 수 있다. 대화 말뭉치 도메인 관리부(600)는 예측 기반으로 워드 임베딩을 수행하여, 대화 말뭉치 저장부(700)에 저장된 대화 말뭉치가 포함하는 복수의 대화를 도메인 별로 워드 임베딩할 수 있다. 예를 들면, 대화 말뭉치 도메인 관리부(600)는 Word2Vec 또는 글로브(Global Vectors for Word Representation, GloVe)를 기반으로 워드 임베딩을 수행하여, 대화 말뭉치 저장부(700)에 저장된 대화 말뭉치가 포함하는 복수의 대화를 도메인 별로 워드 임베딩할 수 있다. 또한, 대화 말뭉치 도메인 관리부(600)는 도메인 별로 워드 임베딩된 대화들을 사용하여, 도메인별 대화모델을 생성할 수 있다.The conversation corpus domain management unit 600 classifies a plurality of conversations included in the conversation corpus stored in the conversation corpus storage unit 700 by domain, embeds words for each domain, and creates a conversation model for each domain, and stores the results. I can. The conversation corpus domain management unit 600 may perform word embedding based on prediction, and word embedding a plurality of conversations included in the conversation corpus stored in the conversation corpus storage unit 700 for each domain. For example, the conversation corpus domain management unit 600 performs word embedding based on Word2Vec or Globe (Global Vectors for Word Representation, GloVe), and a plurality of conversations included in the conversation corpus stored in the conversation corpus storage unit 700 Can be word-embedded for each domain. In addition, the conversation corpus domain manager 600 may generate a conversation model for each domain by using word-embedded conversations for each domain.

도메인 판단부(500)에서 도메인이 판단된 입력 문장(30)의 워드 임베딩된 결과는, 대화 말뭉치 도메인 관리부(600)가 가지는 해당 도메인에 대한 대화모델을 사용하여 입력 문장(30)에 대한 대화 시스템(10)의 발언이 생성될 수 있다. The word-embedded result of the input sentence 30 for which the domain is determined by the domain determination unit 500 is a dialogue system for the input sentence 30 using the dialogue model for the corresponding domain of the dialogue corpus domain management unit 600 (10) utterances can be generated.

도 2는 본 발명의 예시적 실시 예에 따른 도메인 자동 분류 대화 시스템의 클러스터 생성부에서 생성된 대화 말뭉치 클러스터를 설명하기 위한 개념도이다. 2 is a conceptual diagram illustrating a dialog corpus cluster generated by a cluster generation unit of an automatic domain classification dialog system according to an exemplary embodiment of the present invention.

도 2를 참조하면, 대화 말뭉치 클러스터 저장부(250)는 복수의 대화 클러스터(CL)를 저장할 수 있다. 복수의 대화 클러스터(CL)는 복수의 대화로 이루어지는 대화 말뭉치에 대한 벡터화된 엔티티들 전체를 대상으로 밀도 기반 클러스터링을 하여 생성될 수 있다. 예를 들면, 복수의 대화 클러스터(CL)는 DBSCAN 방식으로 복수의 대화로 이루어지는 대화 말뭉치에 대한 벡터화된 엔티티들 전체를 대상으로 클러스터링을 하여 생성될 수 있다. Referring to FIG. 2, the conversation corpus cluster storage unit 250 may store a plurality of conversation clusters CL. The plurality of conversation clusters CL may be generated by performing density-based clustering of all vectorized entities for a conversation corpus consisting of a plurality of conversations. For example, the plurality of conversation clusters CL may be generated by clustering all vectorized entities for a conversation corpus consisting of a plurality of conversations in a DBSCAN method.

DBSCAN 방식으로 복수의 대화 클러스터(CL)로 클러스터링된 벡터화된 엔티티들은 대화 말뭉치에 속한 각각의 어휘들의 의미에 따라서 유사한 의미를 지닌 어휘들끼리 클러스터 갯수를 따로 지정하지 않아도 대화 클러스터(CL)를 형성할 수 있게 된다. 이는 설정된 반경거리 이내의 벡터화된 엔티티들가 아니면, 다른 클러스터를 형성하기 때문에 의미가 전혀 다르지만 설정된 클러스터 개수 때문에 벡터 공간상 거리가 최대한 가까운 임베딩 벡터들을 묶어버려, 각각의 클러스터가 가지는 의미를 해치게 되는 문제점을 보완할 수 있다. Vectorized entities clustered into a plurality of dialog clusters (CL) in the DBSCAN method can form a dialog cluster (CL) without specifying the number of clusters between vocabularies having similar meanings according to the meaning of each vocabulary in the dialog corpus. You will be able to. This has a completely different meaning because they form different clusters unless they are vectorized entities within the set radius distance, but because of the set number of clusters, the embedding vectors with the closest distance in the vector space are grouped together, thereby damaging the meaning of each cluster. Can be supplemented.

도 3은 본 발명의 예시적 실시 예에 따른 도메인 자동 분류 대화 시스템에서 도메인 분류 예측 모델을 생성하는 과정을 설명하기 위한 블록도이다. 3 is a block diagram illustrating a process of generating a domain classification prediction model in an automatic domain classification dialog system according to an exemplary embodiment of the present invention.

도 3을 참조하면, 대화 말뭉치 저장부(700)는 복수의 대화를 이루는 각각의 대화(대화X) 및 해당 대화의 도메인(도메인Y)을 함께 저장할 수 있다. 즉, 대화 말뭉치 저장부(700)는 도메인을 알고 있는 복수의 대화를 저장할 수 있다.Referring to FIG. 3, the conversation corpus storage unit 700 may store each conversation (conversation X) constituting a plurality of conversations and a domain (domain Y) of the corresponding conversation together. That is, the conversation corpus storage unit 700 may store a plurality of conversations with known domains.

클러스터 생성부(200)는 제1 워드 임베딩 모듈(210) 및 클러스터링 모듈(220)을 포함할 수 있다. 여기에서 제1 워드 임베딩 모듈(210)은 대화 말뭉치 도메인 관리부(도 6의 600)가 포함하는 제2 워드 임베딩 모듈(도 6의 630)과의 구분이 필요없는 경우에는 워드 임베딩 모듈이라고도 호칭할 수 있다. The cluster generator 200 may include a first word embedding module 210 and a clustering module 220. Here, the first word embedding module 210 may also be referred to as a word embedding module when it is not necessary to distinguish it from the second word embedding module (630 in FIG. 6) included in the conversation corpus domain management unit (600 in FIG. 6). have.

제1 워드 임베딩 모듈(210)은 미리 학습된 언어 모형으로 워드 임베딩을 수행하여, 엔티티를 벡터로 생성할 수 있다. 예를 들면, 제1 워드 임베딩 모듈(210)은 ELMo(Embeddings from Language Model) 기반의 워드 임베딩을 수행하여, 엔티티를 벡터로 생성할 수 있다. ELMo 기반의 워드 임베딩은 기존의 워드 임베딩, 예를 들면 예측 기반으로 워드 임베딩이 서로 다른 도메인에서 다른 의미를 갖는 같은 표기의 단어를 하나의 벡터로 만들어버리는 문제를 해결하기 위하여, 문맥에 따라서 다르게 워드임 베딩을 할 수 있다. The first word embedding module 210 may generate an entity as a vector by performing word embedding with a pre-learned language model. For example, the first word embedding module 210 may generate an entity as a vector by performing word embedding based on an ELMo (Embeddings from Language Model). ELMo-based word embedding is to solve the problem of making words of the same notation with different meanings in different domains in different domains by word embedding based on conventional word embedding, for example, word embedding differently depending on the context. You can embedding.

제1 워드 임베딩 모듈(210)은 대화 말뭉치 저장부(700)에 저장된 대화 말뭉치를 이루는 복수의 대화 각각을 이루는 단어인 엔티티(entity)들을 벡터로 워드 임베딩할 수 있다. The first word embedding module 210 may word-embed entities, which are words constituting each of a plurality of conversations that make up the conversation corpus stored in the conversation corpus storage unit 700, as a vector.

클러스터링 모듈(220)은, 벡터화된 엔티티들에 대하여 밀도 기반 클러스터링을 할 수 있다. 예를 들면, 클러스터링 모듈(220)는 DBSCAN(Density-based spatial clustering of applications with noise) 방식으로 벡터화된 엔티티들에 대하여 클러스터링을 할 수 있다. The clustering module 220 may perform density-based clustering on vectorized entities. For example, the clustering module 220 may perform clustering on vectorized entities in a Density-based spatial clustering of applications with noise (DBSCAN) method.

클러스터링 모듈(220)은, 제1 워드 임베딩 모듈(210)에서 생성한 대화 말뭉치 저장부(700)에 저장된 복수의 대화로 이루어지는 대화 말뭉치에 대한 벡터화된 엔티티들 전체를 대상으로 밀도 기반 클러스터링을 하여, 복수의 대화 클러스터(도 2의 CL)를 생성하여, 대화 말뭉치 클러스터 저장부(250)에 저장할 수 있다. The clustering module 220 performs density-based clustering on all of the vectorized entities for a dialogue corpus consisting of a plurality of dialogues stored in the dialogue corpus storage unit 700 generated by the first word embedding module 210, A plurality of conversation clusters (CL in FIG. 2) may be generated and stored in the conversation corpus cluster storage unit 250.

대화별 클러스터 조합부(280)는 대화 말뭉치를 이루는 복수의 대화 각각을 이루는 단어인 엔티티를 포함하는 대화 클러스터(CL), 즉 클러스터 키로 대체하여 복수의 클러스터 키로 이루어지는 대화별 클러스터를 생성하여, 복수의 대화 각각에 대한 대화별 클러스터와 도메인의 쌍으로 이루어지는 질의 유형 학습 셋(410)을 생성한다. The dialog-specific cluster combination unit 280 generates a dialog-specific cluster consisting of a plurality of cluster keys by replacing a dialog cluster (CL) including an entity that is a word forming each of a plurality of dialogs forming a dialog corpus, that is, a cluster key. A query type learning set 410 consisting of a pair of conversation-specific clusters and domains for each conversation is created.

도메인 분류 학습부(400)는 질의 유형 학습 셋(410)을 이용하여 도메인 분류 예측 모델(420)을 생성하도록 학습하는 도메인 분류 모델 학습기(450)를 포함한다. 도메인 분류 예측 모델(420)은 관리자로부터 입력되어 기설정된 도메인 분류 동작 규칙(430)을 기반으로, 도메인 분류를 수행할 수 있다. 도메인 분류 로그(490)는 도메인 분류에 대한 사용자(도 1의 10)의 피드백을 저장할 수 있다. The domain classification learning unit 400 includes a domain classification model learner 450 that learns to generate a domain classification prediction model 420 using the query type learning set 410. The domain classification prediction model 420 may perform domain classification based on a preset domain classification operation rule 430 input from an administrator. The domain classification log 490 may store feedback of a user (10 in FIG. 1) regarding domain classification.

즉, 도메인 분류 학습기(450)는 복수의 클러스터 키로 이루어지는 대화별 클러스터, 및 도메인을 각각 입력, 및 출력으로 사용하여, 도메인 분류 예측 모델(420)을 생성할 수 있다. 따라서, 도메인 분류 예측 모델(420)은 도 1에 보인 입력 문장 클러스터(350)를 입력으로 사용하여, 입력 문장(도 1의 30)에 대한 도메인을 분류할수 있다. That is, the domain classification learner 450 may generate the domain classification prediction model 420 by using a conversation-specific cluster composed of a plurality of cluster keys and domains as inputs and outputs, respectively. Accordingly, the domain classification prediction model 420 may classify a domain for an input sentence (30 in FIG. 1) by using the input sentence cluster 350 shown in FIG. 1 as an input.

도 4는 본 발명의 예시적 실시 예에 따른 복수의 대화 도메인을 가지는 대화 시스템의 클러스터 생성부에서 사용자의 입력 문장에 대한 클러스터를 생성하는 과정을 설명하기 위한 블록도이다. 4 is a block diagram illustrating a process of generating a cluster for a user's input sentence by a cluster generator of a dialog system having a plurality of dialog domains according to an exemplary embodiment of the present invention.

도 4를 참조하면, 사용자(10)로부터 네트워크(20)를 통해서 유저 인터페이스(100)가 수신한 입력 문장(30)은 자연어 이해부(800)를 참조하여 자연어 처리가 수행되어, 컴퓨터가 읽을 수 있는 형태로 처리되어 클러스터링 생성부(200)에 에 전달된다. 4, the input sentence 30 received by the user interface 100 from the user 10 through the network 20 is processed in natural language with reference to the natural language understanding unit 800, so that the computer can read it. It is processed in the form that is present and transmitted to the clustering generation unit 200.

클러스터 생성부(200)의 제1 워드 임베딩 모듈(210)은 입력 문장(30)을 이루는 단어인 엔티티(entity)들을 벡터로 워드 임베딩한 후, 클러스터링 모듈(220)에서 대화 말뭉치 클러스터 저장부(250)에 저장된 복수의 대화 클러스터(도 2의 CL)를 참조하여, 입력 문장(30)에 대한 벡터화된 엔티티들에 대응하는 대화 클러스터(CL)를 선정하여 입력 문장 클러스터(350)를 생성할 수 있다. After the first word embedding module 210 of the cluster generating unit 200 word-embeds the entities, which are words constituting the input sentence 30, into a vector, the conversation corpus cluster storage unit 250 in the clustering module 220 ), the input sentence cluster 350 may be generated by selecting a conversation cluster CL corresponding to the vectorized entities for the input sentence 30 with reference to a plurality of conversation clusters (CL in FIG. 2). .

제1 워드 임베딩 모듈(210)은 미리 학습된 언어 모형으로 워드 임베딩을 수행하여, 엔티티를 벡터로 생성할 수 있다. 예를 들면, 제1 워드 임베딩 모듈(210)는 ELMo(Embeddings from Language Model) 기반의 워드 임베딩을 수행하여, 엔티티를 벡터로 생성할 수 있다. ELMo 기반의 워드 임베딩은 기존의 워드 임베딩, 예를 들면 예측 기반으로 워드 임베딩이 서로 다른 도메인에서 다른 의미를 갖는 같은 표기의 단어를 하나의 벡터로 만들어버리는 문제를 해결하기 위하여, 문맥에 따라서 다르게 워드 임베딩을 할 수 있도록 하는 방법이다. The first word embedding module 210 may generate an entity as a vector by performing word embedding with a pre-learned language model. For example, the first word embedding module 210 may generate an entity as a vector by performing word embedding based on an ELMo (Embeddings from Language Model). ELMo-based word embedding is to solve the problem of making words of the same notation with different meanings in different domains in different domains by word embedding based on conventional word embedding, for example, word embedding differently depending on the context. This is a way to enable embedding.

클러스터링 모듈(220)은 입력 문장(30)에 대한 벡터화된 엔티티들에 대응되는 복수의 대화 클러스터(CL)를 선정하여, 입력 문장 클러스터(350)를 생성할 수 있다. 따라서 입력 문장 클러스터(350)를 이루는 복수의 클러스터 키(클러스터 1, 클러스터 2, ..., 클러스터 K)는 로 이루어질 수 있다. 따라서, 입력 문장 클러스터(350)를 이루는 복수의 클러스터 키(클러스터 1, 클러스터 2, ..., 클러스터 K)는 밀도 기반 클러스터링을 한 것과 유사한 클러스터링 결과를 가질 수 있다. The clustering module 220 may generate the input sentence cluster 350 by selecting a plurality of conversation clusters CL corresponding to the vectorized entities for the input sentence 30. Accordingly, a plurality of cluster keys (cluster 1, cluster 2, ..., cluster K) constituting the input sentence cluster 350 may be composed of. Accordingly, a plurality of cluster keys (cluster 1, cluster 2, ..., cluster K) constituting the input sentence cluster 350 may have a clustering result similar to that of density-based clustering.

즉, 입력 문장(30)에 대한 벡터화된 엔티티들만으로는, 밀도 기반 클러스터링을 하기에는 벡터화된 엔티티들의 개수가 적을 수 있다. 그러나, 클러스터링 모듈(220)은 입력 문장(30)에 대한 벡터화된 엔티티들에 대응하는, 밀도 기반 클러스터링을 하여 얻어진 복수의 대화 클러스터(CL)을 선정하므로, 입력 문장 클러스터(350)를 이루는 복수의 클러스터 키(클러스터 1, 클러스터 2, ..., 클러스터 K)는 밀도 기반 클러스터링을 한 것과 유사한 클러스터링 결과를 가질 수 있다. That is, with only the vectorized entities for the input sentence 30, the number of vectorized entities may be small for density-based clustering. However, since the clustering module 220 selects a plurality of conversation clusters CL obtained by performing density-based clustering corresponding to the vectorized entities for the input sentence 30, the plurality of The cluster key (Cluster 1, Cluster 2, ..., Cluster K) can have a clustering result similar to that of density-based clustering.

도 5는 본 발명의 예시적 실시 예에 따른 복수의 대화 도메인을 가지는 대화 시스템의 자연어 이해부의 주요 구성을 나타내는 블록도이다.Fig. 5 is a block diagram showing a main configuration of a natural language understanding unit of a dialogue system having a plurality of dialogue domains according to an exemplary embodiment of the present invention.

도 5를 참조하면, 자연어 이해부(800)는 유저 인터페이스(100)가 수신한 자연어 형태의 입력 문장(30)을 해석하는 자연어 분석부(810)를 포함할 수 있다. 자연어 분석부(810)는 형태소 분석부(811), 구문 분석부(812), 개체명 분석부(813), 필터링 분석부(814), 의도 분류부(815), 도메인 분석부(816) 및 시맨틱 롤 라벨링부(SRL, 817)를 포함할 수 있다. Referring to FIG. 5, the natural language understanding unit 800 may include a natural language analysis unit 810 that interprets an input sentence 30 in a natural language format received by the user interface 100. The natural language analysis unit 810 includes a morpheme analysis unit 811, a syntax analysis unit 812, an entity name analysis unit 813, a filtering analysis unit 814, an intention classification unit 815, a domain analysis unit 816, and It may include a semantic roll labeling unit (SRL, 817).

형태소 분석부(811)는 입력 문장(30)을 최소의 의미 단위인 형태소 단위로 분리할 수 있다. 구문 분석부(812) 및 개체명 분석부(813)는 각각 형태소 단위로 분리된 입력 문장(30)에 대한 구문 분석 및 개체명 분석을 할 수 있다. The morpheme analysis unit 811 may separate the input sentence 30 into a morpheme unit that is a minimum meaning unit. The syntax analysis unit 812 and the entity name analysis unit 813 may perform syntax analysis and entity name analysis on the input sentences 30 separated by morpheme units, respectively.

구문 분석부(812)에서는 형태소 분석 결과 중 가장 적합한 형태의 품사를 선택하는 과정으로서 문맥 좌우에 위치한 중의성 해소의 힌트가 되는 정보를 이용해서 적합한 분석 결과를 선택하는 품사 부착 과정, 명사구, 동사구, 부사구 등의 덩어리를 분석하는 구 단위 분석 과정, 중문, 복문 등의 문장을 단문 단위로 분해하는 절 단위 분석 과정, 및 문장을 이루고 있는 구성 성분 간 위계 관계를 분석하여 문장의 구조를 결정하는 구문 분석 과정이 수행될 수 있다. The syntax analysis unit 812 selects the most suitable form of speech among the results of the morpheme analysis, and uses the information that is a hint for disambiguation located on the left and right of the context to select the appropriate analysis result, a noun phrase, a verb phrase, Phrase-by-phrase analysis process that analyzes chunks of adverbs, etc., section-level analysis process that breaks down sentences such as Chinese and complex sentences into short sentences, and syntax analysis that determines the structure of sentences by analyzing the hierarchical relationship between the constituents of the sentence. The process can be carried out.

필터링 분석부(814)는 입력 문장(30) 중 불필요한 피쳐를 제거하여 간결화된 문장을 생성할 수 있다. The filtering analysis unit 814 may generate a simplified sentence by removing unnecessary features from the input sentence 30.

의도 분류부(815) 및 도메인 분석부(816)는 필터링 분석부(814)에서 생성한 간결화된 문장을 기초로 의미 역할이 부여된 질의의 의도 분류 및 도메인 분석을 할 수 있다. 의도 분류부(815)는, 입력 문장(30)에 대한 자연어 처리 결과로부터 의도를 검출할 수 있다. 의도란 발언의 의도로 사용되며, 하나의 의도에는 다수의 의역된(paraphrasing) 문장들을 설정할 수 있다. 의도 검출(Intent Classification 또는 Intent Detect)이란 특정 문장에 대한 의미를 결정하기 위해 대화 과정에서 설정된 의도 중 유사도가 높은 문장을 찾는 과정을 말한다.The intention classification unit 815 and the domain analysis unit 816 may perform intention classification and domain analysis of a query to which a semantic role is assigned based on the simplified sentence generated by the filtering analysis unit 814. The intention classification unit 815 may detect an intention from a result of natural language processing for the input sentence 30. The intention is used as the intention of a speech, and a number of paraphrasing sentences can be set for one intention. Intent Classification (Intent Detect) refers to the process of finding sentences with high similarity among intentions set in the conversation process to determine the meaning of a specific sentence.

도메인 분석부(816)에서 수행되는 도메인 분석은, 도메인 분류 예측 모델(도 1의 420)에서, 입력 문장(30)에 대한 도메인 분류를 수행하는 과정에서 참조될 수 있다. 시맨틱 롤 라벨링부(SRL, 817)은 입력 문장(30)에 대한 의미 역할을 부여할 수 있다. 시맨틱 롤 라벨링이란, 입력 문장(30)의 형태소들에, E(Entity), P(Property), Q(Quntity), C(Class Information, 클래스 정보) 등과 같은 라벨(label)이 부여하는 것을 의미할 수 있다. The domain analysis performed by the domain analysis unit 816 may be referred to in a domain classification prediction model (420 of FIG. 1) in a process of performing domain classification on the input sentence 30. The semantic roll labeling unit SRL 817 may assign a semantic role to the input sentence 30. Semantic roll labeling means that labels such as E (Entity), P (Property), Q (Quntity), C (Class Information, class information) are assigned to the morphemes of the input sentence 30. I can.

자연어 이해부(800)는 자연어 분석부(810)에서 수행된 자연어 처리 결과를 받아서, 입력 문장(30)에 대한 분류를 수행하는 질의 분류부(820)를 더 포함할 수 있다. 질의 분류부(820)는 입력 문장(30)에 대한 시맨틱 패턴 정보를 생성할 수 있다. 시맨틱 패턴은, 입력 문장(30) 전체에 대한 라벨의 구성, 예를 들면, EP(Entity, Property) 패턴, EPP(Entity, 2 Property) 패턴, EPQ(Entity, Property, Quntity) 패턴 등을 의미할 수 있다. 질의 분류부에서 생성한 시맨틱 패턴 정보는 입력 문장(30)에 대한 출력 문장(40)을 생성하는데에 사용될 수 있다. The natural language understanding unit 800 may further include a query classification unit 820 that receives the natural language processing result performed by the natural language analysis unit 810 and classifies the input sentence 30. The query classifier 820 may generate semantic pattern information for the input sentence 30. The semantic pattern means the configuration of a label for the entire input sentence 30, for example, an EP (Entity, Property) pattern, an EPP (Entity, 2 Property) pattern, an EPQ (Entity, Property, Quntity) pattern, etc. I can. The semantic pattern information generated by the query classifier may be used to generate the output sentence 40 for the input sentence 30.

도 6은 본 발명의 예시적 실시 예에 따른 복수의 대화 도메인을 가지는 대화 시스템의 대화 말뭉치 도메인 관리부의 주요 구성을 나타내는 블록도이다.Fig. 6 is a block diagram showing a main configuration of a dialogue corpus domain management unit of a dialogue system having a plurality of dialogue domains according to an exemplary embodiment of the present invention.

도 6을 참조하면, 대화 말뭉치 도메인 관리부(600)는 대화 말뭉치 도메인 분류부(610), 제2 워드 임베딩 모듈(630), 및 대화 모델 생성부(650)를 포함할 수 있다. 여기에서 제2 워드 임베딩 모듈(630)은 클러스터 생성부(도 3의 200)가 포함하는 제1 워드 임베딩 모듈(210)과의 구분이 필요없는 경우에는 워드 임베딩 모듈이라고도 호칭할 수 있다. Referring to FIG. 6, the conversation corpus domain management unit 600 may include a conversation corpus domain classification unit 610, a second word embedding module 630, and a conversation model generation unit 650. Here, the second word embedding module 630 may also be referred to as a word embedding module when it is not necessary to distinguish it from the first word embedding module 210 included in the cluster generator 200 in FIG. 3.

대화 말뭉치 도메인 분류부(610)는 대화 말뭉치 저장부(700)에 저장된 대화 말뭉치를 이루는 복수의 대화를 도메인 별로 분류하여, 복수의 도메인 대화 말뭉치(620)로 분류한다. 대화 말뭉치 저장부(700)에 저장된 대화 말뭉치를 이루는 복수의 대화가 N개의 도메인을 가지고 있는 경우, 복수의 도메인 대화 말뭉치(620)는 제1 도메인 대화 말뭉치 내지 제N 도메인 대화 말뭉치인 N개의 도메인 대화 말뭉치로 분류될 수 있다. The conversation corpus domain classification unit 610 classifies a plurality of conversations constituting the conversation corpus stored in the conversation corpus storage unit 700 by domain, and classifies them into a plurality of domain conversation corpuses 620. When a plurality of conversations constituting the conversation corpus stored in the conversation corpus storage unit 700 have N domains, the plurality of domain conversation corpus 620 is a first domain conversation corpus or N domain conversations, which are N-th domain conversation corpuses. Can be classified as a corpus.

제2 워드 임베딩 모듈(630)은 복수의 도메인 대화 말뭉치(620) 각각에 포함되는 대화들을 이루는 단어인 엔티티(entity)들을 벡터로 워드 임베딩할 수 있다. 제2 워드 임베딩 모듈(630)은 도메인 별로 분류된 대화들로 이루어지는 도메인 대화 말뭉치별로 워드 임베딩을 하여, 복수의 도메인 워드 임베딩(640)을 생성할 수 있다. 대화 말뭉치 저장부(700)에 저장된 대화 말뭉치를 이루는 복수의 대화가 N개의 도메인을 가지고 있는 경우, 즉 복수의 도메인 대화 말뭉치(620)는 제1 도메인 대화 말뭉치 내지 제N 도메인 대화 말뭉치인 N개의 도메인 대화 말뭉치로 분류되어 있는 경우, 복수의 도메인 워드 임베딩(640)은 N개의 도메인 각각에 대하여 분류된 제1 도메인 워드 임베딩 내지 제N 도메인 워드 임베딩인 N개의 도메인 워드 임베딩을 포함할 수 있다. The second word embedding module 630 may word-embed entities, which are words constituting conversations included in each of the plurality of domain conversation corpuses 620, as vectors. The second word embedding module 630 may generate a plurality of domain word embeddings 640 by performing word embedding for each domain conversation corpus consisting of conversations classified for each domain. When a plurality of conversations constituting the conversation corpus stored in the conversation corpus storage unit 700 have N domains, that is, the plurality of domain conversation corpus 620 is N domains that are the first domain conversation corpus to the N-th domain conversation corpus When classified as a conversation corpus, the plurality of domain word embeddings 640 may include N domain word embeddings, which are classified first domain word embeddings to N-th domain word embeddings, for each of the N domains.

제2 워드 임베딩 모듈(630)은 예측 기반으로 워드 임베딩을 수행하여, 도메인 별로 분류된 대화들로 이루어지는 도메인 대화 말뭉치별로 워드 임베딩을 하여, 복수의 도메인 워드 임베딩(640)을 생성할 수 있다. 예를 들면, 제2 워드 임베딩 모듈(630)은 Word2Vec 또는 글로브(Global Vectors for Word Representation, GloVe)를 기반으로 워드 임베딩을 수행하여, 복수의 도메인 워드 임베딩(640)을 생성할 수 있다. 복수의 도메인 대화 말뭉치(620)는 도메인 별로 분류되어 있어, 서로 다른 도메인에서 다른 의미를 갖는 같은 표기의 단어를 하나의 벡터로 만들어버리는 문제가 발생하지 않으므로, 제2 워드 임베딩 모듈(630)은 예측 기반으로 워드 임베딩을 수행하여, 도메인 별로 분류된 대화들로 이루어지는 도메인 대화 말뭉치별로 워드 임베딩을 하여, 복수의 도메인 워드 임베딩(640)을 생성할 수 있다. The second word embedding module 630 may generate a plurality of domain word embeddings 640 by performing word embedding based on prediction, performing word embedding for each domain conversation corpus consisting of conversations classified for each domain. For example, the second word embedding module 630 may generate a plurality of domain word embeddings 640 by performing word embedding based on Word2Vec or Global Vectors for Word Representation (GloVe). Since the plurality of domain dialogue corpus 620 is classified by domain, there is no problem of creating a single vector of words of the same notation having different meanings in different domains, so the second word embedding module 630 predicts Based on the word embedding, a plurality of domain word embeddings 640 may be generated by performing word embedding for each domain conversation corpus consisting of conversations classified for each domain.

대화 모델 생성부(650)는 도메인 별로 생성된 복수의 도메인 워드 임베딩(640) 각각에 대하여 복수의 도메인 대화 모델(660)을 생성할 수 있다. 대화 말뭉치 저장부(700)에 저장된 대화 말뭉치를 이루는 복수의 대화가 N개의 도메인을 가지고 있는 경우, 즉 복수의 도메인 대화 말뭉치(620)가 제1 도메인 대화 말뭉치 내지 제N 도메인 대화 말뭉치인 N개의 도메인 대화 말뭉치로 분류되어, 복수의 도메인 워드 임베딩(640)이 제1 도메인 워드 임베딩 내지 제N 도메인 워드 임베딩인 N개의 도메인 워드 임베딩으로 이루어지는 경우, 대화 모델 생성부(650)는 N개의 도메인 각각 별인 제1 도메인 대화 모델 내지 제N 도메인 대화 모델로 이루어지는 N개의 도메인 대화 모델을 생성할 수 있다. 복수의 도메인 대화 모델(660)에 포함되는 제1 도메인 대화 모델 내지 제N 도메인 대화 모델로 이루어지는 N개의 도메인 대화 모델 각각은, 입력 문장(도 1의 30)에 대응하는 도메인에 따라 선택되어, 출력 문장(도 1의 40)을 생성할 수 있다. The dialogue model generator 650 may generate a plurality of domain dialogue models 660 for each of the plurality of domain word embeddings 640 generated for each domain. When the plurality of conversations constituting the conversation corpus stored in the conversation corpus storage unit 700 have N domains, that is, the plurality of domain conversation corpus 620 is the first domain conversation corpus to the N-th domain conversation corpus N domains Classified as a conversation corpus, when the plurality of domain word embeddings 640 are composed of N domain word embeddings, which are the first domain word embedding to the Nth domain word embedding, the conversation model generation unit 650 is It is possible to generate N domain conversation models consisting of a one-domain conversation model to an N-th domain conversation model. Each of the N domain conversation models comprising a first domain conversation model to an N-th domain conversation model included in the plurality of domain conversation models 660 is selected according to a domain corresponding to an input sentence (30 in FIG. 1), and output A sentence (40 in FIG. 1) can be generated.

도 7은 본 발명의 예시적 실시 예에 따른 복수의 대화 도메인을 가지는 대화 시스템에서 도메인 자동 분류를 하는 과정을 설명하기 위한 블록도이다. Fig. 7 is a block diagram illustrating a process of automatically classifying domains in a chat system having a plurality of chat domains according to an exemplary embodiment of the present invention.

도 7을 참조하면, 사용자(10)가 네트워크(20)를 통하여 입력한 입력 문장(30)은 도 4에서 설명한 것과 같이 입력 문장 클러스터(350)로 변환될 수 있다. 입력 문장 클러스터(350)는 도메인 분류 예측 모델(420)을 참조하여, 도메인 판단부(500)에서 도메인이 판단될 수 있다. 도메인 판단부(500)는 입력 문장 클러스터(350)가 가지는 복수의 클러스터 키(클러스터 1, 클러스터 2, ..., 클러스터 K)를 도메인 분류 예측 모델(420)을 참조하여, 입력 문장(350)의 도메인을 판단할 수 있다.Referring to FIG. 7, the input sentence 30 input by the user 10 through the network 20 may be converted into an input sentence cluster 350 as described in FIG. 4. The input sentence cluster 350 may determine a domain by the domain determination unit 500 with reference to the domain classification prediction model 420. The domain determination unit 500 refers to the domain classification prediction model 420 for a plurality of cluster keys (cluster 1, cluster 2, ..., cluster K) of the input sentence cluster 350, and the input sentence 350 The domain of can be determined.

도메인 분류 예측 모델(420)은, 대화별 클러스터 조합부(280)는 대화 말뭉치를 이루는 복수의 대화 각각을 이루는 단어인 엔티티를 포함하는 대화 클러스터(CL), 즉 클러스터 키로 대체하여 복수의 클러스터 키로 이루어지는 대화별 클러스터를 생성하여, 복수의 대화 각각에 대한 대화별 클러스터와 도메인의 쌍으로 이루어지는 질의 유형 학습 셋(410)을 생성한다. In the domain classification prediction model 420, the cluster combination unit 280 for each conversation consists of a plurality of cluster keys by replacing the conversation cluster with an entity that is a word constituting each of a plurality of conversations forming a conversation corpus, that is, a cluster key. By creating a conversation-specific cluster, a query type learning set 410 consisting of a pair of conversation-specific clusters and domains for each of a plurality of conversations is generated.

도메인 분류 예측 모델(420)은, 도메인 분류 학습기(도 3의 450)에서 질의 유형 학습 셋(도 3의 410)을 이용하여 생성될 수 있다. 질의 유형 학습 셋(410)은, 대화 말뭉치를 이루는 복수의 대화 각각을 이루는 단어인 엔티티를 포함하는 대화 클러스터(도 2의 CL), 즉 클러스터 키로 대체하여 복수의 클러스터 키로 이루어지는 대화별 클러스터와 도메인의 쌍으로 이루어질 수 있다. The domain classification prediction model 420 may be generated using a query type learning set (410 in FIG. 3) in a domain classification learner (450 in FIG. 3). The query type learning set 410 is a dialog cluster (CL in FIG. 2) including an entity that is a word constituting each of a plurality of dialogs forming a dialog corpus, that is, a cluster for each dialog consisting of a plurality of cluster keys and domains. It can be made in pairs.

도메인 분류 학습기(450)는 복수의 클러스터 키로 이루어지는 대화별 클러스터, 및 도메인을 각각 입력, 및 출력으로 사용하여, 도메인 분류 예측 모델(420)을 생성하므로, 도메인 분류 예측 모델(420)은 입력 문장 클러스터(350)를 입력으로 사용하여, 입력 문장(30)에 대한 도메인을 분류할수 있다. The domain classification learner 450 generates a domain classification prediction model 420 by using a conversation-specific cluster composed of a plurality of cluster keys and domains as inputs and outputs, respectively, so that the domain classification prediction model 420 is an input sentence cluster. By using 350 as an input, the domain for the input sentence 30 can be classified.

도메인이 판단된 입력 문장(30), 정확히는 입력 문장(30)의 자연어 처리 결과는 복수의 도메인 대화 모델(660) 중 판단된 도메인에 대응되는 도메인 대화 모델, 예를 들면, 입력 문장(30)에 대한 도메인이 제2 도메인으로 판단된 경우, 입력 문장(30)은 제2 도메인 대화 모델에 의하여 출력인 대화 시스템(도 1의 1)의 발언이 생성될 수 있으며, 대화 시스템(1)의 발언은 자연어 생성부(900)를 참조하여 출력 문장(40)으로 생성되며, 출력 문장(40)을 유저 인터페이스(100)에서 네트워크(20)을 통하여 사용자(10)에게 제공될 수 있다. The input sentence 30 for which the domain is determined, and more precisely, the natural language processing result of the input sentence 30 is applied to the domain dialogue model corresponding to the determined domain among the plurality of domain dialogue models 660, for example, the input sentence 30. When it is determined that the domain for the second domain is determined as the second domain, the input sentence 30 may generate a utterance of the dialogue system (1 in FIG. 1) that is an output by the second domain dialogue model, and the utterance of the dialogue system 1 is It is generated as an output sentence 40 with reference to the natural language generator 900, and the output sentence 40 may be provided to the user 10 through the network 20 in the user interface 100.

출력 문장(40)을 수신한 사용자(10)의 도메인 분류에 대한 피드백은 도메인 분류 로그(490)에 저장될 수 있다. 도메인 분류 로그(490)에 저장된 사용자(10)의 도메인 분류에 대한 피드백은 도메인 분류 예측 모델(420)을 재생성하거나 업데이트하는데에 사용될 수 있다. Feedback on domain classification of the user 10 who has received the output sentence 40 may be stored in the domain classification log 490. The feedback on the domain classification of the user 10 stored in the domain classification log 490 may be used to regenerate or update the domain classification prediction model 420.

도메인 분류 로그(490)에 저장된 사용자(10)의 도메인 분류에 대한 피드백이, 도메인이 올바르게 분류된 것일 경우, 입력 문장(30)은 대화 말뭉치 도메인 분류부(610)에 의하여 도메인 복수의 도메인 대화 말뭉치(620) 중 해당하는 도메인에 대한 도메인 대화 말뭉치, 예를 들면 제2 도메인 대화 말뭉치에 저장될 수 있다. 이는 사용자(10)의 피드백에 의하여 업데이트된 제2 도메인 대화 말뭉치는, 다시 제2 워드 임베딩 모듈(630)에 의하여 워드 임베딩이 되어, 복수의 도메인 워드 임베딩(640) 중 제2 도메인 워드 임베딩이 재생성 또는 업데이트될 수 있고, 대화 모델 생성부(650)에 의하여 복수의 도메인 대화 모델(660) 중 제2 도메인에 대한 대화 모델인 제2 도메인 대화 모델이 재생성 또는 업데이트될 수 있다. If the feedback on the domain classification of the user 10 stored in the domain classification log 490 is that the domain is correctly classified, the input sentence 30 is a dialogue corpus of domains by the domain classification unit 610. Among 620, it may be stored in a domain dialogue corpus for a corresponding domain, for example, a second domain dialogue corpus. This means that the second domain dialogue corpus updated by the feedback of the user 10 is word-embedded by the second word embedding module 630 again, and the second domain word embedding among the plurality of domain word embeddings 640 is regenerated. Alternatively, it may be updated, and the second domain dialog model, which is a dialog model for the second domain among the plurality of domain dialog models 660, may be regenerated or updated by the dialog model generator 650.

도 1 내지 도 7을 함께 참조하면, 본 발명에 따른 도메인 자동 분류 대화 시스템(1)은 사용자(10)의 입력 문장(30)에 대하여 미리 학습된 언어 모형으로 워드 임베딩을 수행하여, 엔티티를 벡터로 생성하고, 대화 말뭉치 클러스터를 이루는 복수의 대화 클러스터(CL) 중 입력 문장(30)을 이루는 단어인 엔티티를 워드 임베딩한 벡터들 각각에 대응하는 대화 클러스터(CL)인 복수의 클러스터 키(클러스터 1, 클러스터 2, ..., 클러스터 K)로 이루어지는 입력 문장 클러스터(350)를 생성한 후, 도메인 분류 예측 모델(420)에서 입력 문장 클러스터(350)를 입력으로 사용하여, 입력 문장(30)에 대한 도메인을 분류하여 출력으로 제공한다. 1 to 7 together, the domain automatic classification dialog system 1 according to the present invention performs word embedding with a pre-learned language model with respect to the input sentence 30 of the user 10, so that the entity is vectored. And a plurality of cluster keys (Cluster 1) that are dialog clusters (CL) corresponding to each of the vectors in which an entity that is a word constituting the input sentence 30 among a plurality of dialog clusters (CL) forming a dialog corpus cluster is word-embedded. , Cluster 2, ..., cluster K), and then using the input sentence cluster 350 as an input in the domain classification prediction model 420, the input sentence 30 is It classifies the domain for and provides it as an output.

따라서 입력 문장(30)에 대한 워드 임베딩을 수행하는 과정에서, 서로 다른 도메인에서 다른 의미를 갖는 같은 표기의 단어를 하나의 벡터로 만들어버리는 문제를 해결하기 위하여, 미리 학습된 언어 모형, 예를 들면, ELMo 기반의 워드 임베딩을 수행하여, 엔티티를 벡터로 생성하여, 문맥에 따라서 다르게 워드 임베딩을 할 수 있다. Therefore, in the process of performing word embedding for the input sentence 30, in order to solve the problem of making words of the same notation having different meanings in different domains into one vector, a pre-learned language model, for example, , ELMo-based word embedding is performed, an entity is generated as a vector, and word embedding can be performed differently depending on the context.

또한, 밀도 기반 클러스터링을 하여 생성한 복수의 대화 클러스터(CL)에 대응하는 복수의 클러스터 키(클러스터 1, 클러스터 2, ..., 클러스터 K)로 이루어지는 입력 문장 클러스터(350)를 생성하여 도메인 분류를 수행한다.In addition, domain classification by creating an input sentence cluster 350 consisting of a plurality of cluster keys (Cluster 1, Cluster 2, ..., Cluster K) corresponding to a plurality of conversation clusters CL generated through density-based clustering. To do.

따라서 본 발명에 따른 도메인 자동 분류 대화 시스템(1)은, 오픈 도메인 환경에서 입력 문장(30)이 주어졌을 때도, 입력 문장(30)의 도메인을 판단하여, 알맞은 출력 문장(40)을 생성할 수 있다. Therefore, the automatic domain classification dialog system 1 according to the present invention can generate an appropriate output sentence 40 by determining the domain of the input sentence 30 even when the input sentence 30 is given in an open domain environment. have.

이상, 본 발명을 바람직한 실시예를 들어 상세하게 설명하였으나, 본 발명은 상기 실시예에 한정되지 않고, 본 발명의 기술적 사상 및 범위 내에서 당 분야에서 통상의 지식을 가진 자에 의하여 여러가지 변형 및 변경이 가능하다. Above, the present invention has been described in detail with reference to preferred embodiments, but the present invention is not limited to the above embodiments, and various modifications and changes by those of ordinary skill in the art within the spirit and scope of the present invention This is possible.

1 : 도메인 자동 분류 대화 시스템, 10 : 사용자, 20 : 네트워크, 30 : 입력 문장, 40 : 출력 문장, 100 : 유저 인터페이스, 200 : 클러스터 생성부, 210 : 제1 워드 임베딩 모듈, 220 : 클러스터링 모듈, 250 : 대화 말뭉치 클러스터 저장부, CL : 대화 클러스터, 280 : 대화별 클러스터 조합부, 350 : 입력 문장 클러스터, 400 : 도메인 분류 학습부, 410 : 질의 유형 학습 셋, 420 : 도메인 분류 예측 모델, 430 : 도메인 분류 동작 규칙, 450 : 도메인 분류 모델 학습기, 490 : 도메인 분류 로그, 500 : 도메인 판단부, 600 : 대화 말뭉치 도메인 관리부, 610 : 대화 말뭉치 도메인 분류부, 620 : 도메인 대화 말뭉치, 630 : 제2 워드 임베딩 모듈, 640 : 도메인 워드 임베딩, 650 : 대화 모델 생성부, 660 : , 700 : 대화 말뭉치, 800 : 자연어 이해부, 900 : 자연어 생성부1: domain automatic classification dialog system, 10: user, 20: network, 30: input sentence, 40: output sentence, 100: user interface, 200: cluster generation unit, 210: first word embedding module, 220: clustering module, 250: dialogue corpus cluster storage unit, CL: dialogue cluster, 280: dialogue cluster combination unit, 350: input sentence cluster, 400: domain classification learning unit, 410: query type learning set, 420: domain classification prediction model, 430: Domain classification operation rule, 450: domain classification model learner, 490: domain classification log, 500: domain determination unit, 600: dialogue corpus domain management unit, 610: dialogue corpus domain classification unit, 620: domain dialogue corpus, 630: second word Embedding module, 640: domain word embedding, 650: dialogue model generation unit, 660:, 700: dialogue corpus, 800: natural language understanding unit, 900: natural language generation unit

Claims

A user interface for receiving an input sentence from a user through a network and transmitting an output sentence for the input sentence to the user;
A cluster generator including a first word embedding module for performing word embedding to generate entities as vectors, and a clustering module for clustering the vectorized entities;
A dialogue corpus cluster storage unit for storing a plurality of dialogue clusters generated by clustering a plurality of dialogue entities forming a dialogue corpus by word-embedding all vectors word-embedded by the first word embedding module;
Among the plurality of conversation clusters, each conversation cluster and domain for the plurality of conversations consisting of conversation clusters corresponding to vectors in which the entities constituting each of the plurality of conversations forming the conversation corpus are word-embedded are used as inputs and outputs, respectively. A domain classification learning unit including a domain classification learner that trains a domain classification prediction model by doing so; And
Domain determination for determining a domain for the input sentence using an input sentence cluster consisting of a plurality of cluster keys corresponding to vectors in which the entities constituting the input sentence are word-embedded in the first word embedding module among the plurality of conversation clusters Domain automatic classification dialog system including;

The method of claim 1,
A domain classifying unit for classifying a plurality of conversations forming a conversation corpus by domain to generate a plurality of domain conversation corpuses;
A second word embedding module for generating a plurality of domain word embeddings by word embedding entities constituting each of the plurality of domain dialogue corpuses;
And a dialogue corpus domain management unit including a dialogue model generation unit that generates a plurality of domain dialogue models for each of the plurality of domain word embeddings.

The method of claim 2,
The output sentence is generated from a domain conversation model corresponding to a domain for the input sentence cluster determined by the domain determination unit among a plurality of domain conversation models.

The method of claim 2,
And the second word embedding module performs word embedding based on prediction.

The method of claim 4,
The second word embedding module performs word embedding based on Word2Vec or Global Vectors for Word Representation (GloVe).

The method of claim 2,
The domain classification learning unit further includes a domain classification log for storing the user's feedback on a result of determining the domain for the input sentence,
The domain classification model learner regenerates or updates the domain classification prediction model using the user's feedback stored in the domain classification log.

The method of claim 6,
The conversation domain corpus classification unit,
When the user's feedback stored in the domain classification log indicates that the domain is correctly classified, the input sentence is stored in a domain dialogue corpus for a corresponding domain among the plurality of domain dialogue corpuses. Dialogue system.

The method of claim 1,
Wherein the first word embedding module performs word embedding with a pre-learned language model.

The method of claim 8,
The first word embedding module performs word embedding based on an ELMo (Embeddings from Language Model).

The method of claim 1,
The clustering generation unit, the automatic domain classification dialog system, characterized in that the density-based clustering.

The method of claim 10,
The clustering generation unit performs clustering in a Density-based spatial clustering of applications with noise (DBSCAN) method.