KR101565143B1

KR101565143B1 - Feature Weighting Apparatus for User Utterance Information Classification in Dialogue System and Method of the Same

Info

Publication number: KR101565143B1
Application number: KR1020140080978A
Authority: KR
Inventors: 고영중
Original assignee: 동아대학교 산학협력단
Priority date: 2014-06-30
Filing date: 2014-06-30
Publication date: 2015-11-02

Abstract

The present invention provides a feature weight calculation apparatus for classifying user utterance information in a dialogue system and a method thereof wherein the dialogue system effectively utilizes a distribution of each category of features to allow each feature to be classified into a higher category such that information such as a user intent can be classified with a high precision rate from a target having only a very small number of features such as an utterance. The invention comprises: a learning data classification module for building a feature set of learning data by using a probability distribution expressed by each utterance line extracted from a corpus attached with utterance information; a learning module generation module for generating a learning model based on a feature set built from the learning data-based learning module; and an input utterance classification module for allocating an utterance information category having a classification score for each highest-level utterance information category by using a probability distribution wherein a feature of an utterance inputted based on a learning model generated from the learning model generation module is expressed for each utterance line.

Description

TECHNICAL FIELD The present invention relates to a feature weighting apparatus and method for classifying information of a user utterance in an interactive system,

본 발명은 사용자 발화의 정보 분류를 위한 자질 가중치 산출 장치 및 방법에 관한 것으로, 특히 발화처럼 매우 적은 자질만을 포함하고 있는 대상으로부터 사용자 의도(화행 등)를 높은 정확률을 가지고 분류할 수 있는 사용자 발화의 정보 분류를 위한 자질 가중치 산출 장치 및 방법에 관한 것이다.The present invention relates to a feature weight calculation apparatus and method for classifying information of a user utterance, and more particularly, to a feature weight calculation apparatus and method for classifying a user utterance (an utterance, etc.) The present invention relates to a feature weight calculation apparatus and method for information classification.

음성 대화시스템은 음성 대화 처리기술을 이용하여 특정한 목적을 달성하기 위해 이루어지는 대화를 이해하고 적절한 응답을 찾아 제시해 주는 지능형 기술이다. 이러한 음성 대화 처리기술을 구현하기 위해서는 각 발화에 숨겨진 화자의 의도를 찾아내는 의도 분석 기술과 적절한 시스템 응답을 만들어내기 위한 의도 예측 기술이 모두 필요하다.A voice conversation system is an intelligent technology that uses speech speech processing technology to understand the conversation that is made to achieve a specific purpose and to find and present an appropriate response. In order to implement such a speech processing technology, both intention analysis technology for finding the intention of the speaker hidden in each speech and intention prediction technology for generating appropriate system response are all needed.

따라서 음성 대화 처리기술을 이용하는 음성 대화시스템에서 사용자의 의도를 정확히 분석해야만 좋은 성능을 보장받을 수 있어, 사용자의 발화를 보다 정확히 인식하는 것은 매우 중요한 일이다. 하지만, 사용자의 발화의 경우 길이가 매우 짧아서 추출 가능한 자질의 수가 매우 적다는 문제가 있다. Therefore, it is very important to recognize the utterance of the user more accurately because the voice conversation system using the voice conversation processing technology can accurately guarantee the performance by accurately analyzing the intention of the user. However, the user's utterance has a very short length and thus the number of extractable qualities is very small.

이러한 문제점을 해결하기 위해 기존의 화행(speech-act) 등의 사용자 의도 파악을 위해서 분류하는 정보들은 발화가 가질 수 있는 자질의 수가 매우 적기 때문에 정보검색 및 텍스트 마이닝 분야에서 전통적으로 사용되고 있는 TF.IDF 가중치 등을 사용하지 않고 출현여부를 0과 1로만 표현하는 바이너리(binary) 가중치를 사용해 왔다. In order to solve this problem, the information classified in order to grasp the user's intention such as the speech-act has a very small number of qualities that the utterance can possess. Therefore, the TF.IDF We have used binary weights that represent only occurrences of 0 and 1 without using weights.

하지만 바이너리(binary) 가중치를 사용하는 경우에도 많은 사용자 의도 파악 기술들이 발화에서 추출할 수 있는 적은 수의 자질만을 가지고는 높은 성능의 의도파악 기술을 확보하는데 여전히 어려움을 겪고 있다.However, even when using binary weights, many user intention detection techniques still have difficulty in obtaining a high-performance intent detection technique with only a few qualities that can be extracted from speech.

공개특허공보 제2010-0111164호 : 사용자의 발화의도를 파악하는 음성 대화 처리장치 및 음성 대화 처리 방법Open Patent Publication No. 2010-0111164: Speech dialog processing device for grasping the user's utterance intention and voice dialog processing method 공개특허공보 제2008-0109322호 : 사용자의 직관적 의도 파악에 따른 서비스 제공 방법 및 장치Open Patent Publication No. 2008-0109322: Method and apparatus for providing services according to user's intuitive intention

따라서 본 발명은 상기와 같은 문제점을 해결하기 위해 안출한 것으로서, 발화처럼 매우 적은 자질만을 포함하고 있는 대상으로부터 사용자 의도 등의 정보를 높은 정확률을 가지고 분류할 수 있도록 각 자질이 보다 높은 분류 성능을 가지게 하기 위해서 자질의 범주별 분포를 효과적으로 활용하는 대화시스템에서 사용자 발화의 정보 분류를 위한 자질 가중치 산출 장치 및 방법을 제공하는데 그 목적이 있다.SUMMARY OF THE INVENTION Accordingly, the present invention has been made in order to solve the above-mentioned problems, and it is an object of the present invention to provide a method and an apparatus for classifying a user's intention, etc., The present invention has been made in view of the above problems, and it is an object of the present invention to provide a feature weight calculation apparatus and method for classifying information of a user utterance in an interactive system that effectively utilizes the distribution of qualities by category.

본 발명의 다른 목적들은 이상에서 언급한 목적으로 제한되지 않으며, 언급되지 않은 또 다른 목적들은 아래의 기재로부터 당업자에게 명확하게 이해될 수 있을 것이다.Other objects of the present invention are not limited to the above-mentioned objects, and other objects not mentioned can be clearly understood by those skilled in the art from the following description.

상기와 같은 목적을 달성하기 위한 본 발명에 따른 대화시스템에서 사용자 발화의 정보 분류를 위한 자질 가중치 산출 장치의 특징은 발화 정보가 부착된 말뭉치에서 추출된 자질의 화행별로 출현한 확률분포를 이용하여 학습 데이터의 자질 집합을 구축하는 학습 데이터 기반 학습모듈과, 학습 데이터 기반 학습모듈에서 구축된 자질 집합을 기반을 학습모델을 생성하는 학습모델 생성모듈과, 학습모델 생성모듈에서 생성된 학습모델을 기반으로 입력되는 발화의 자질이 화행별로 출현한 확률분포를 이용하여 최상위의 발화 정보 범주별 분류 점수를 가진 발화 정보 범주를 할당하는 입력 발화 분류모듈을 포함하여 구성되는데 있다.According to another aspect of the present invention, there is provided a feature weight calculation apparatus for classifying information of a user utterance in an interactive system, the apparatus comprising: A learning data base learning module for building a qualitative set of data; a learning model generating module for generating a learning model based on a feature set built in the learning data base learning module; And an input speech classification module for assigning a speech information category having a classification score of the highest level of the utterance information category by using a probability distribution in which the quality of the input utterance appears for each utterance.

바람직하게 상기 학습 데이터 기반 학습모듈은 발화 정보가 부착된 말뭉치를 입력받는 학습 말뭉치 입력부와, 학습 말뭉치 입력부에서 입력되는 분류정보가 부착된 말뭉치에서 자질이 화행별로 출현한 확률분포를 이용하여 학습 발화 벡터 표현을 위한 자질 가중치 분포를 산출하는 제 1 발화 자질 가중치 산출부와, 제 1 발화 자질 가중치 산출부에서 화행의 분포를 통해 산출된 자질 가중치 분포를 기반으로 학습 데이터의 자질 집합을 구축하고 카테고리(request, ask, response) 분포를 이용하여 학습 모델을 분류하는 분류 모델 생성부를 포함하여 구성되는 것을 특징으로 한다.Preferably, the learning data-based learning module includes a learning corpus input unit for receiving a corpus to which speech information is attached, and a learning speech corpus input unit for generating a learning speech vector using a probability distribution in which the qualities appear in the speech corpus with classification information input from the learning corpus input unit. A first utterance quality weight calculation unit for calculating a quality weight distribution for expressing the utterance weight distribution, a first utterance quality weight calculation unit for constructing a quality set of the learning data based on the quality weight distribution calculated through the distribution of the utterances in the first utterance quality weight calculation unit, and a classification model generation unit for classifying the learning model using the distribution of the ask, response.

바람직하게 상기 입력 발화 분류모델은 학습모델 생성모듈에서 분류된 학습 데이터를 기반으로 입력되는 발화의 자질이 화행별로 출현한 확률분포를 이용하여 학습 발화 벡터 표현을 위한 자질 가중치 분포를 산출하는 제 2 발화 자질 가중치 산출부와, 제 2 발화 자질 가중치 산출부에서 화행의 분포를 통해 산출된 자질 가중치 분포를 기반으로 발화 정보 범주별 분류 점수를 산출하여 최상위 점수를 가진 발화 정보 범주를 할당하는 발화 분류부를 포함하여 구성되는 것을 특징으로 한다.Preferably, the input speech classification model includes a second utterance classifier for calculating a feature weight distribution for a learner utterance vector expression using a probability distribution in which the utterance qualities input on the basis of the learning data classified by the learning model generation module are for each utterance, And an utterance classifying unit for assigning a utterance information category having a highest score by calculating a classification score for each utterance information category on the basis of the feature weight distribution calculated through distribution of the utterances in the second utterance quality weight calculating unit .

바람직하게 상기 제 1 발화 자질 가중치 산출부 또는 제 2 발화 자질 가중치 산출부는 자질의 해당 범주와 나머지 범주에서 자질이 화행별로 출현한 발생 분포의 차이를 이용하여 자질 가중치 분포를 산출하거나, 자질의 전체 범주에서의 자질이 화행별로 출현한 발생 분포의 엔트로피(entropy)를 계산하여 자질 가중치 분포를 산출하는 것을 특징으로 한다.Preferably, the first utterance quality weight calculating unit or the second utterance quality weight calculating unit may calculate the quality weight distribution using the difference between the occurrence category distribution of the qualities and the occurrence categories of the qualities in the remaining categories, And the entropy of the occurrence distribution in which the qualities of the characters in the sentence appear in each sentence are calculated to calculate the feature weight distribution.

상기와 같은 목적을 달성하기 위한 본 발명에 따른 대화시스템에서 사용자 발화의 정보 분류를 위한 자질 가중치 산출 방법의 특징은 (A) 발화 정보가 부착된 말뭉치가 입력되면, 입력되는 말뭉치에서 자질이 화행별로 출현한 확률분포를 이용하여 학습 발화 벡터 표현을 위한 자질 가중치를 산출하는 단계와, (B) 상기 화행의 분포를 통해 산출된 자질 가중치를 기반으로 학습 데이터의 자질 집합을 구축하고 카테고리에 따른 분포로 모델을 분류하여 학습모델을 생성하는 단계와, (C) 상기 생성된 학습모델을 기반으로 입력되는 발화의 자질이 화행별로 출현한 확률분포를 이용하여 최상위의 발화 정보 범주별 분류 점수를 가진 발화 정보 범주를 할당하여 입력발화 정보를 생성하는 단계를 포함하여 이루어지는 것을 특징으로 한다.According to an aspect of the present invention, there is provided a feature weight calculation method for classifying information of a user utterance in an interactive system, comprising: (A) when a corpus with speech information is input, (B) constructing a feature set of the learning data based on the feature weights calculated through the distribution of the utterances; and (c) And generating a learning model by classifying the models; and (C) generating a learning model by classifying the models by using a probability distribution in which the qualities of the utterances input on the basis of the generated learning models are classified by the utterance, And generating input speech information by assigning a category.

바람직하게 상기 (A) 단계는 자질의 해당 범주와 나머지 범주에서 자질이 화행별로 출현한 발생 분포의 차이를 이용하여 자질 가중치 분포를 산출하거나, 자질의 전체 범주에서의 자질이 화행별로 출현한 발생 분포의 엔트로피(entropy)를 계산하여 자질 가중치 분포를 산출하는 것을 특징으로 한다.Preferably, the step (A) is a step of calculating a quality weight distribution using the category of the qualities and the difference of the occurrence distribution in which the qualities appear in the remaining categories, or the quality of the qualities in the entire category of the qualities And calculating a characteristic weight distribution by calculating an entropy of the characteristic weight distribution.

바람직하게 상기 (A) 단계는 수식

를 이용하여 자질의 해당 범주와 나머지 범주에서 자질이 화행별로 출현한 발생 분포의 차이로 자질 가중치 분포를 산출하며, 이때, 상기

은 자질을,

는 범주를,

는 자질

의 범주

에서의 자질 중요도를, 그리고

는 범주

외의 범주를 나타내는 것을 특징으로 한다.Preferably, the step (A)

, The feature weight distribution is calculated by the difference between the category of the qualities and the occurrences of the occurrences of the qualities in the remaining categories,

The quality,

The category,

Qualities

Category

The importance of quality in

Category

And the like.

바람직하게 상기 (A) 단계는 수식

,

를 이용하여 자질의 전체 범주에서의 자질이 화행별로 출현한 발생 분포의 엔트로피(entropy)를 계산하여 자질 가중치 분포를 산출하며, 이때, 상기 |C|는 전체 범주의 수이고, 상기

는 자질

의 범주

에서의 조건부 확률값을, 그리고 상기

는 각 범주의 획일화 분배(Uniforma distribution) 값을,

는 엔트로피 식을 나타내는 것을 특징으로 한다.Preferably, the step (A)

,

, The feature weight distribution is calculated by calculating the entropy of the occurrence distribution in which the qualities in the entire category of the qualities appear for each utterance, wherein | C | is a total number of categories,

Qualities

Category

The conditional probability value in

The uniforma distribution value of each category,

Is an entropy expression.

이상에서 설명한 바와 같은 본 발명에 따른 대화시스템에서 사용자 발화의 정보 분류를 위한 자질 가중치 산출 장치 및 방법은 다음과 같은 효과가 있다.The feature weight calculation apparatus and method for classifying information of user utterance in the dialog system according to the present invention as described above have the following effects.

첫째, 현재 애플사의 시리(SIRI)와 같이 음성대화 인터페이스가 NUI/NUX 기술로 각광받고 있는 이때 본 발명을 통해 높은 정확도의 의도 분석기가 개발된다면 음성 인터페이스의 활용도가 높아질 것이며, 음성대화시스템의 활용이 급증하는 계기가 될 수 있다. 특히 스마트폰 등의 국산기기의 부가가치가 높아지는 계기를 마련할 수 있다. First, if a high-accuracy intention analyzer is developed through the present invention, the voice interface will be used more and more, and the utilization of the voice dialogue system will be improved. It can be a surge. Especially, the value added of domestic devices such as smart phones can be increased.

둘째, 사용자 의도(화행 등) 등의 범주 정보를 자질 가중치에 활용할 수 있는 새로운 기법을 제시함으로써, 다른 추가 비용을 소비하지 않고도 높은 성능의 사용자 의도 분석의 길을 제시할 수 있다.Second, by introducing a new technique that can utilize category information such as user intention (eg, speech) for qualitative weights, it can provide a way of high-performance user intention analysis without consuming any additional cost.

셋째, 현재 음성대화시스템의 개발이 매우 활발하게 이루어지고 있으며, 화행 등의 발화 의도 분석은 매우 중요한 요소 기술로서 향후 많은 응용 시스템에 적용될 것으로 전망된다.Third, the development of speech communication system is being actively carried out, and analysis of intentions such as speech is very important element technology and it is expected to be applied to many application systems in the future.

도 1 은 본 발명의 실시예에 따른 대화시스템에서 사용자 발화의 정보 분류를 위한 자질 가중치 산출 장치의 구성을 나타낸 블록도
도 2 는 도 1의 발화 자질 가중치 산출부를 통해 표현되는 사용자 발화의 벡터의 실시예를 나타낸 도면
도 3 은 본 발명의 실시예에 따른 대화시스템에서 사용자 발화의 정보 분류를 위한 자질 가중치 산출 방법을 설명하기 위한 흐름도1 is a block diagram showing a configuration of a feature weight calculation apparatus for classifying information of user utterances in an interactive system according to an embodiment of the present invention
2 is a diagram showing an embodiment of a vector of user utterance expressed through the utterance weight calculation unit of Fig. 1
3 is a flowchart for explaining a feature weight calculation method for information classification of user utterances in an interactive system according to an embodiment of the present invention

본 발명의 다른 목적, 특성 및 이점들은 첨부한 도면을 참조한 실시예들의 상세한 설명을 통해 명백해질 것이다.Other objects, features and advantages of the present invention will become apparent from the detailed description of the embodiments with reference to the accompanying drawings.

본 발명에 따른 대화시스템에서 사용자 발화의 정보 분류를 위한 자질 가중치 산출 장치 및 방법의 바람직한 실시예에 대하여 첨부한 도면을 참조하여 설명하면 다음과 같다. 그러나 본 발명은 이하에서 개시되는 실시예에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 본 실시예는 본 발명의 개시가 완전하도록하며 통상의 지식을 가진자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이다. 따라서 본 명세서에 기재된 실시예와 도면에 도시된 구성은 본 발명의 가장 바람직한 일 실시예에 불과할 뿐이고 본 발명의 기술적 사상을 모두 대변하는 것은 아니므로, 본 출원시점에 있어서 이들을 대체할 수 있는 다양한 균등물과 변형예들이 있을 수 있음을 이해하여야 한다.A preferred embodiment of a feature weight calculation apparatus and method for classifying information of user utterance in an interactive system according to the present invention will be described with reference to the accompanying drawings. The present invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art. It is provided to let you know. Therefore, the embodiments described in the present specification and the configurations shown in the drawings are merely the most preferred embodiments of the present invention and are not intended to represent all of the technical ideas of the present invention. Therefore, various equivalents It should be understood that water and variations may be present.

도 1 은 본 발명의 실시예에 따른 대화시스템에서 사용자 발화의 정보 분류를 위한 자질 가중치 산출 장치의 구성을 나타낸 블록도이다.FIG. 1 is a block diagram illustrating a configuration of a feature weight calculation apparatus for classifying information of user utterances in an interactive system according to an embodiment of the present invention.

도 1을 참조하여 설명하면, 본 발명의 자질 가중치 산출 장치는 학습 데이터 기반 학습모듈(100)과, 학습모델 생성모듈(200)과, 입력 발화 분류모듈(300)로 구성된다.Referring to FIG. 1, the feature weight calculation apparatus of the present invention comprises a learning data-based learning module 100, a learning model generation module 200, and an input speech classification module 300.

상기 학습 데이터 기반 학습모듈(100)은 발화 정보가 부착된 말뭉치에서 추출된 자질의 화행별로 출현한 확률분포를 이용하여 학습 데이터의 자질 집합을 구축한다. 이에 따라, 상기 학습 데이터 기반 학습모듈(100)은 발화 정보가 부착된 말뭉치를 입력받는 학습 말뭉치 입력부(110)와, 학습 말뭉치 입력부(110)에서 입력되는 분류정보가 부착된 말뭉치에서 자질이 화행별로 출현한 확률분포를 이용하여 학습 발화 벡터 표현을 위한 자질 가중치 분포를 산출하는 제 1 발화 자질 가중치 산출부(120)와, 제 1 발화 자질 가중치 산출부(120)에서 화행의 분포를 통해 산출된 자질 가중치 분포를 기반으로 학습 데이터의 자질 집합을 구축하고 카테고리(request, ask, response) 분포를 이용하여 학습 모델을 분류하는 분류 모델 생성부(130)로 구성된다.The learning data-based learning module 100 constructs a feature set of learning data by using a probability distribution appearing for each of the qualities of the qualities extracted from the corpus attached with the speech information. Accordingly, the learning data-based learning module 100 includes a learning corpus input unit 110 that receives a corpus with speech information attached thereto, and a corpus-based corpus with classification information input from the learning corpus-based input unit 110, A first utterance weight calculation unit 120 for calculating a utterance weight distribution for a learner utterance vector expression using the emerging probability distribution; a first utterance utterance weight calculation unit 120 for calculating a utterance weight distribution, And a classification model generation unit 130 for building a feature set of the learning data based on the weight distribution and classifying the learning model using a category (request, ask, response) distribution.

상기 학습모델 생성모듈(200)은 학습 데이터 기반 학습모듈(100)에서 분류된 모델을 기반을 학습모델을 생성한다.The learning model generation module 200 generates a learning model based on the classified models in the learning data-based learning module 100.

상기 입력 발화 분류모델(300)은 학습모델 생성모듈(200)에서 생성된 학습모델을 기반으로 입력되는 발화의 자질이 화행별로 출현한 확률분포를 이용하여 최상위의 발화 정보 범주별 분류 점수를 가진 발화 정보 범주를 할당한다. 이에 따라, 상기 입력 발화 분류모델(300)은 학습모델 생성모듈(200)에서 분류된 학습 데이터를 기반으로 입력되는 발화의 자질이 화행별로 출현한 확률분포를 이용하여 학습 발화 벡터 표현을 위한 자질 가중치 분포를 산출하는 제 2 발화 자질 가중치 산출부(310)와, 제 2 발화 자질 가중치 산출부(310)에서 화행의 분포를 통해 산출된 자질 가중치 분포를 기반으로 발화 정보 범주별 분류 점수를 산출하여 최상위 점수를 가진 발화 정보 범주를 할당하는 발화 분류부(320)로 구성된다.The input utterance classification model 300 uses a probability distribution in which the qualities of utterances input on the basis of the learning models generated by the learning model generation module 200 are used to calculate utterances having classification scores for the highest utterance information category Assign an information category. Accordingly, the input utterance classifying model 300 can use a probability distribution in which the utterance qualities inputted on the basis of the learning data classified by the learning model generating module 200 are used for the utterance weighting for the learner utterance vector expression A second utterance quality weight calculation unit 310 for calculating a distribution of utterance information, a second utterance quality weight calculation unit 310 for calculating a classification score for each utterance information category based on the quality weight distribution calculated through the distribution of the utterances, And an utterance classification unit 320 for assigning a category of utterance information having a score.

한편, 상기 제 1, 2 발화 자질 가중치 산출부(120)(310)는 학습/입력 발화 벡터 표현을 위한 발화 자질 가중치를 산출하기 위한 것으로, 발화의 자질이 화행별로 출현한 확률분포를 이용하여 자질 가중치 분포를 산출하며, 도 2에서 도시하고 있는 것과 같이 형태소 분석을 통해 품사가 결정되면 문법적인 표현을 위해 어휘적 자질(Lexical Features :　LF) 및 담화적 자질(Discourse Features : DF)로 사용자 발화의 벡터를 표현한다.The first and second utterance weight calculators 120 and 310 calculate the utterance quality weight for the learning / input utterance vector expression. The utterance utterance weight calculator 120 calculates the utterance quality weight using the probability distributions The weight distribution is calculated. When the part-of-speech is determined through the morphological analysis as shown in FIG. 2, the lexical features (Lexical Features: LF) and the discourse features (DF) Express a vector.

또한, 상기 제 1, 2 발화 자질 가중치 산출부(120)(310)에서 자질 가중치 분포의 산출은 다음 2가지 방법을 통해 산출된다.In addition, the calculation of the characteristic weight distribution in the first and second speaking quality weight calculation units 120 and 310 is performed through the following two methods.

첫 번째는 다음 수학식 1에서 나타내고 있는 것과 같이, 자질의 해당 범주와 나머지 범주에서 자질이 화행별로 출현한 발생 분포의 차이를 이용하여 자질 가중치 분포를 산출한다.First, as shown in the following Equation (1), the feature weight distribution is calculated by using the category of the qualities and the difference of the occurrences of the occurrences of the qualities in the remaining categories.

상기 수학식 1에서

은 자질을,

는 범주를 나타내며,

는 자질

의 범주

에서의 자질 중요도를 나타낸다. 그리고

는 범주

외의 범주를 나타낸다.In Equation (1)

The quality,

Lt; / RTI > represents a category,

Qualities

Category

The importance of quality in And

Category

Indicates a category other than.

즉, 수학식 1에서와 같이 화행에서 추정된 확률을 분자에 위치시키고, 다른 화행의 종류(카테고리)에서 나온 범주를 분모에 위치시켜 해당 화행의 종류에서 나온 범주가 많을수록 자질 중요도가 높게 설정되어 자질 가중치 분포가 높게 산출된다. 또한 반대로 다른 화행의 종류에서 나온 범주가 많을수록 자질 중요도가 낮게 설정되어 자질 가중치 분포가 낮게 산출된다.That is, as shown in Equation (1), the probability estimated from the speech is placed in the numerator, and the category derived from the type of another speech (category) is placed in the denominator. The weight distribution is calculated to be high. On the other hand, the more the categories from other types of actors are, the lower the importance of qualities is set, and the quality weight distribution is calculated to be lower.

두 번째는 다음 수학식 2에서 나타내고 있는 것과 같이, 자질의 전체 범주에서의 자질이 화행별로 출현한 발생 분포의 엔트로피(entropy)를 계산하여 자질 가중치 분포를 산출한다.Second, as shown in the following Equation (2), the entropy of the occurrence distribution in which the qualities in the entire category of the qualities appear for each line is calculated to calculate the quality weight distribution.

이때, 상기 수학식 2에서 상기 |C|는 전체 범주의 수이고, 상기

는 자질

의 범주

에서의 조건부 확률값을 나타낸다. 그리고 상기

는 각 범주의 획일화 분배(Uniforma distribution) 값을 나타낸다.In Equation (2), the | C | is a total number of categories,

Qualities

Category

The conditional probability value in And

Represents the uniforma distribution value of each category.

수학식 2의 분모(

)는 엔트로피 식을 나타냄에 따라, 엔트로피가 낮을수록 자질 가중치 분포는 높게 산출된다. 또한 반대로 엔트로피가 높을수록 자질 가중치 분포가 낮게 산출된다.The denominator of equation (2)

) Represents the entropy equation, and the lower the entropy, the higher the feature weight distribution is calculated. On the contrary, the higher the entropy, the lower the quality weight distribution is calculated.

이처럼, 상기 수학식 2에 따른 두 번째 방법은 각 자질의 모든 범주에서의 분포의 차이 정도가 활용되며, 최대 엔트로피(MaxEntropy)의 계산은 수학식 2에서 나타내고 있는 것과 같이 모든 범주의 발생 분포를 획일화 분배(Uniforma distribution)로 가정하고 계산한다.
In this way, the second method according to Equation (2) utilizes the degree of difference in distribution in each category of each feature, and the calculation of the maximum entropy (MaxEntropy) Uniform distribution is assumed and calculated.

이와 같이 구성된 본 발명에 따른 대화시스템에서 사용자 발화의 정보 분류를 위한 자질 가중치 산출 장치의 동작을 첨부한 도면을 참조하여 상세히 설명하면 다음과 같다. 도 1과 동일한 참조부호는 동일한 기능을 수행하는 동일한 부재를 지칭한다. The operation of the feature weight calculation apparatus for classifying information of user utterance in the dialog system according to the present invention will be described in detail with reference to the accompanying drawings. The same reference numerals as those in Fig. 1 designate the same members performing the same function.

도 3 은 본 발명의 실시예에 따른 대화시스템에서 사용자 발화의 정보 분류를 위한 자질 가중치 산출 방법을 설명하기 위한 흐름도이다.3 is a flowchart illustrating a feature weight calculation method for information classification of user utterances in an interactive system according to an embodiment of the present invention.

도 3을 참조하여 설명하면, 먼저 발화 정보가 부착된 말뭉치가 입력되면(S10), 입력되는 분류정보가 부착된 말뭉치에서 자질이 화행별로 출현한 확률분포를 이용하여 학습 발화 벡터 표현을 위한 자질 가중치를 산출한다(S20).Referring to FIG. 3, when a corpus to which speech information is attached is input (S10), a feature weight for a speech utterance vector expression is calculated using a probability distribution in which the qualities appear for each speech corpus in the corpus (S20).

이때, 자질 가중치 분포의 산출은 다음 2가지 방법을 통해 산출된다. 첫 번째는 위에서 기재하고 있는 수학식 1과 같이, 자질의 해당 범주와 나머지 범주에서 자질이 화행별로 출현한 발생 분포의 차이를 이용하여 자질 가중치 분포를 산출하고, 두 번째는 위에서 기재하고 있는 수학식 2와 같이, 자질의 전체 범주에서의 자질이 화행별로 출현한 발생 분포의 엔트로피(entropy)를 계산하여 자질 가중치 분포를 산출한다.At this time, the calculation of the feature weight distribution is performed by the following two methods. First, as shown in Equation (1) described above, the quality weight distribution is calculated using the category of the qualities and the difference in occurrence distribution in which the qualities appear in the remaining categories, 2, the entropy of the occurrence distribution in which the qualities in the entire category of qualities appear for each transcription is calculated to calculate the qualitative weight distribution.

그리고 화행의 분포를 통해 산출된 자질 가중치 분포를 기반으로 학습 데이터의 자질 집합을 구축하고 카테고리(request, ask, response)에 따른 분포로 모델을 분류하여 학습모델을 생성한다(S30).Then, a feature set of the learning data is constructed based on the feature weight distribution calculated through the distribution of the utterances, and a model is generated by classifying the models according to the categories (request, ask, response) (S30).

이어 생성된 학습모델을 기반으로 입력되는 발화의 자질이 화행별로 출현한 확률분포를 이용하여 최상위의 발화 정보 범주별 분류 점수를 가진 발화 정보 범주를 할당하여 입력발화 정보를 생성한다(S40).
In operation S40, input speech information is generated by assigning a speech information category having classification scores of the highest utterance information category using a probability distribution in which the qualities of utterances input on the basis of the generated learning model appear for each utterance.

상기에서 설명한 본 발명의 기술적 사상은 바람직한 실시예에서 구체적으로 기술되었으나, 상기한 실시예는 그 설명을 위한 것이며 그 제한을 위한 것이 아님을 주의하여야 한다. 또한, 본 발명의 기술적 분야의 통상의 지식을 가진자라면 본 발명의 기술적 사상의 범위 내에서 다양한 실시예가 가능함을 이해할 수 있을 것이다. 따라서 본 발명의 진정한 기술적 보호 범위는 첨부된 특허청구범위의 기술적 사상에 의해 정해져야 할 것이다.
While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it is to be understood that the invention is not limited to the disclosed exemplary embodiments. It will be apparent to those skilled in the art that various modifications may be made without departing from the scope of the present invention. Accordingly, the true scope of the present invention should be determined by the technical idea of the appended claims.

Claims

The feature weight distribution is calculated by using the difference between the categories of features extracted from the corpus with speech information and the occurrences of occurrences of the qualities in the remaining categories or the occurrence of occurrences of the qualities in the entire category of features A learning data based learning module for calculating a characteristic weight distribution by calculating an entropy of a distribution and building a feature set of learning data based on the distribution,
A learning model generation module for generating a learning model based on the feature set built in the learning data based learning module,
And an input speech classification module for assigning a speech information category having a classification score for each of the highest utterance information categories by using a probability distribution in which the utterance qualities inputted on the basis of the speech model are generated based on the learning model generated by the learning model generation module Wherein the feature weight calculation unit is configured to classify information of user utterances in an interactive system.

2. The learning-based learning module of claim 1,
A learning corpus input unit for receiving a corpus to which speech information is attached,
A first utterance weight calculation unit for calculating a utterance weight distribution for a learner utterance vector expression using a probability distribution in which qualities appear for each utterance in corpus with classification information input from the learning corpus input unit;
A classification model generation unit for constructing a feature set of learning data based on the feature weight distribution calculated through the distribution of the speech in the first speech quality weight calculation unit and classifying the learning model using a category (request, ask, response) distribution Wherein the feature weight calculation unit is configured to classify information of user utterances in an interactive system.

The method according to claim 1, wherein the input speech classification model
A second utterance weight calculation unit for calculating a utterance weight distribution for a learner utterance vector expression using a probability distribution in which the utterance qualities input on the basis of the classified learning data are generated for each utterance,
And a second utterance weight calculation unit calculates a classification score according to the utterance information category based on the feature weight distribution calculated through the distribution of the utterances, and assigns the utterance information category having the highest score. A feature weight calculation apparatus for classifying information of a user utterance in an interactive system.

delete

(A) calculating a quality weight for a learner's utterance vector expression using a probability distribution in which the quality of the corpus appears in each speech corpus, when the corpus having the speech information is input,
(B) constructing a feature set of the learning data based on the feature weights calculated through the distribution of the utterances, classifying the models into a distribution according to the category to generate a learning model,
(C) generating input utterance information by assigning a utterance information category having a classification score for the highest utterance information category using a probability distribution in which a quality of utterance inputted based on the generated learning model is different for each utterance , &Lt; / RTI >
At this time, the step (A)
The feature weight distributions are calculated by using the differences of the occurrences of the occurrences of the qualities in the category and the remaining category of the qualities or the entropy of the occurrences of the occurrences of the qualities in the entire category of the qualities are calculated And calculating a feature weight distribution of the user utterance in the dialog system.

delete

6. The method of claim 5, wherein step (A)
Equation

, The feature weight distribution is calculated by the difference of the occurrence distribution of the qualities in the corresponding category and the remaining categories in terms of the utterances,
At this time,

The quality,

The category,

Qualities

Category

The importance of quality in

Category

Wherein the classification unit is configured to classify the user utterance information in the dialogue system.

6. The method of claim 5, wherein step (A)
Equation

,

The entropy of the occurrence distribution in which the qualities in the entire category of the qualities appear for each transcription is calculated to calculate the quality weight distribution,
Here, the | C | is a total number of categories,

Qualities

Category

The conditional probability value in

The uniforma distribution value of each category,

And the entropy expression of the user utterance information.