KR102109866B1

KR102109866B1 - System and Method for Expansion Chatting Corpus Based on Similarity Measure Using Utterance Embedding by CNN

Info

Publication number: KR102109866B1
Application number: KR1020180119113A
Authority: KR
Inventors: 고영중; 안재현
Original assignee: 동아대학교 산학협력단
Priority date: 2018-10-05
Filing date: 2018-10-05
Publication date: 2020-05-12
Also published as: KR20200044178A

Abstract

본 발명은 채팅 시스템(Chatting system)에 관한 것으로, 구체적으로 단어 단위 임베딩 벡터(Word embedding)와 합성곱 신경망(Convolutional Neural Networks)을 이용하여 길이가 짧은 발화에 대해 효과적으로 발화 단위 표상을 생성하고 발화를 표현할 수 있도록 한 합성곱 신경망에 의한 발화 임베딩을 사용한 유사도 측정 기반의 채팅 말뭉치 확장을 위한 장치 및 방법에 관한 것으로, 윈도우 크기를 이용하여 발화 데이터에서 임의의 채팅쌍을 추출하는 채팅쌍 추출부;발화를 기계가 이해할 수 있도록 발화 단위 표상을 생성하는 발화 단위 표상 생성부;기계에서 임의로 만든 채팅 쌍과 미리 구축되어 있는 채팅 말뭉치의 채팅 유사도(Chatting similarity)를 계산하는 채팅 유사도 계산부;채팅 유사도가 임계값(Threshold)보다 높으면 임의의 채팅 쌍은 응답관계가 맞는 채팅 쌍이라고 판단하여 채팅 말뭉치 확장을 하는 채팅 말뭉치 구축부;를 포함하는 것이다.The present invention relates to a chat system, specifically using word embedding and convolutional neural networks to effectively generate utterance unit representations and utterances for short utterances. A device and a method for expanding a chat corpus based on similarity measurement using speech embedding by a convolutional neural network that can be expressed, the chat pair extractor extracting a random chat pair from speech data using a window size; A chat unit representation generation unit that generates a utterance unit representation so that the machine can understand; a chat similarity calculation unit that calculates a chat similarity of a chat pair and a pre-built chat corpus randomly generated by the machine; a chat similarity threshold If it is higher than the value (Threshold), the random chat pair is the matching chat pair. And a chat corpus construction unit that determines that the chat corpus is expanded.

Description

System and Method for Expansion Chatting Corpus Based on Similarity Measure Using Utterance Embedding by CNN}

본 발명은 채팅 시스템(Chatting system)에 관한 것으로, 구체적으로 단어 단위 임베딩 벡터(Word embedding)와 합성곱 신경망(Convolutional Neural Networks)을 이용하여 길이가 짧은 발화에 대해 효과적으로 발화 단위 표상을 생성하고 발화를 표현할 수 있도록 한 합성곱 신경망에 의한 발화 임베딩을 사용한 유사도 측정 기반의 채팅 말뭉치 확장을 위한 장치 및 방법에 관한 것이다.The present invention relates to a chat system, specifically using word embedding and convolutional neural networks to effectively generate utterance unit representations and utterances for short utterances. The present invention relates to an apparatus and method for expanding chat corpus based on similarity measurement using speech embedding by a convolutional neural network.

채팅 시스템(Chatting system)이란 사람과 기계 사이에 의사소통을 하는 시스템을 의미한다. 채팅 시스템에서 사용하는 의사소통의 수단은 사람과 사람사이에서만 사용하였던 자연어(Natural language)를 그대로 사용하는 것이 특징이다.A chat system is a system that communicates between people and machines. The communication method used in the chat system is characterized by using the natural language used only between people.

채팅 시스템은 크게 유사도 기반 채팅 시스템과 생성 기반 채팅 시스템이 있다.The chat system is largely divided into a similarity-based chat system and a generation-based chat system.

먼저, 유사도 기반 채팅 시스템은 대량의 사용자 발화와 시스템 발화 응답 쌍 데이터베이스를 구축하고, 입력으로 사용자 발화가 들어왔을 때 데이터베이스에서 가장 유사한 사용자 발화를 찾고, 시스템 발화를 출력하는 시스템이다.First, the similarity-based chat system is a system that builds a database of a large number of user utterances and a system utterance response pair, finds the most similar user utterances in the database when user utterances are received as input, and outputs system utterances.

의미적으로 가장 정확한 발화를 찾을 경우 강건한 응답 발화를 출력할 수 있는 것이 특징이다.It is characterized by being able to output a robust response speech when it finds the most accurate speech.

종래 기술의 일 예로는 3단계 문장 검색 방법을 이용하여 의미적으로 가장 유사한 발화를 찾는 방법이 있다. 이 방법은 각 단계에 따라 사용하는 형태소와 휴리스틱 기법을 다르게 두어 커버리지를 높였고, 문장 양상, 긍/부정 등 다양한 자질을 추가로 사용하는 것이다.An example of the prior art is a method of finding a semantically similar utterance using a three-step sentence search method. This method uses different morphemes and heuristic techniques for each step to increase coverage, and uses various qualities such as sentence patterns and positive / negative.

다른 방법의 하나는 딥러닝(Deep learning) 모델 중 하나인 LSTM을 이용하여 특별한 자질 선택(Feature selection)의 노력 없이 유사도 기반 채팅 시스템의 성능을 개선하는 방법이 있다.Another method is to improve the performance of the similarity-based chat system without effort of special feature selection using LSTM, which is one of deep learning models.

그리고 생성 기반 채팅 시스템은 뉴럴 기계 번역에서 사용하는 시퀀스 투 시퀀스(Sequence to Sequence) 모델을 그대로 사용하여, 사용자 발화가 입력되었을 때 적절한 응답을 생성하는 시스템이다.In addition, the generation-based chat system is a system that generates an appropriate response when a user's speech is input using the sequence to sequence model used in neural machine translation.

시퀀스 투 시퀀스 모델은 사용자 발화를 요약하는 인코더와 시스템 발화를 생성하는 디코더로 구성되어 있으며, 시스템 발화를 생성하는 단계에서는 문장 구조에 맞게 생성해야 하기 때문에 언어에 대한 지식을 충분히 가지고 있어야 하고, 자연스러운 문장 생성을 위해 선행 연구인 유사도 기반 채팅 시스템보다 더 많은 말뭉치가 요구되는 것이 특징이다.The sequence-to-sequence model consists of an encoder that summarizes the user's utterance and a decoder that generates the system utterance. In the step of generating the system utterance, it must be generated according to the sentence structure. It is characterized by requiring more corpus than similarity-based chat system, which is a previous study, for generation.

종래 기술의 일 예에 따른 방법으로 일반적인 시퀀스 투 시퀀스 모델을 사용하여 일상 대화 및 간단한 질의응답이 가능한 시스템을 제안하고, 또 다른 방법에서는 일반적인 시퀀스 투 시퀀스 모델에서 사용자의 감정을 인식하고, 사용자의 감정에 적절한 응답 발화를 생성하는 방법을 제안하고 있다.As a method according to an example of the prior art, a system capable of daily conversation and simple question-and-answer is proposed using a general sequence-to-sequence model, and another method recognizes a user's emotion in a general sequence-to-sequence model, and the user's emotion A method for generating a response utterance suitable for the proposed method is proposed.

이러한 채팅 시스템을 만들기 위해선 사용자 발화와 시스템 발화가 하나의 쌍으로 묶여 있는 대량의 채팅 말뭉치가 반드시 필요하다.In order to create such a chat system, a large number of chat corpuses in which a user's speech and a system's speech are bundled in a pair are essential.

그러나 채팅 말뭉치는 현재 공개되어 있는 말뭉치가 희소하기 때문에 많은 연구에서는 정제되지 않은 발화의 기록(Log)을 사람이 직접 정제하는 등 많은 노력을 통해 채팅 말뭉치를 구축하여 사용하였다.However, since the chat corpus is currently uncommon, many studies have constructed and used the chat corpus through a lot of efforts, such as manually refining the log of the unrefined utterance.

발화 데이터란 사람이 발화한 모든 데이터를 의미하고, 영화, 극대본과 같이 발화만 존재할 뿐 쌍으로 되어 있지 않은 데이터를 의미한다.The utterance data refers to all data uttered by a person, and refers to data in which only utterances exist, such as movies and manuscripts, but are not paired.

이러한 발화 데이터를 이용하여 채팅 쌍을 생성하기 위해 윈도우 크기(Window size)를 잡아 임의의 채팅 쌍을 구축한다.To create a chat pair using the utterance data, a random chat pair is constructed by grabbing a window size.

그리고 임의의 채팅 쌍과 미리 구축되어 있는 채팅 말뭉치(Golden standard corpus) 간의 유사도를 계산하여 사용자 발화의 응답으로 시스템 발화가 적절한 응답인지 판단한다.In addition, the similarity between a random chat pair and a pre-built chat standard corpus is calculated to determine whether the system speech is an appropriate response in response to a user's speech.

발화 단위 표상은 발화를 기계가 이해할 수 있도록 벡터로 표현해주는 것을 의미한다.Representation of the unit of utterance means to express the utterance as a vector so that the machine can understand it.

발화 단위 표상을 생성하기 위한 종래 기술의 방법으로는 TF(Term Frequency), IDF(Inverted Document Frequency)를 많이 이용하였다.TF (Term Frequency) and IDF (Inverted Document Frequency) are frequently used as methods of the prior art for generating a speech unit representation.

그러나 채팅성 발화는 굉장히 짧은 길이로 구성되어 있기 때문에 일반적으로 문장, 문서를 표현할 때 많이 사용하는 TF(Term Frequency), TF*IDF(Inverted Documents Frequency)를 이용하면 굉장히 희소한(High sparsity) 벡터로 표현되며 의미적인 정보는 포함되지 않는 문제가 있다.However, since chat utterance is composed of very short lengths, TF (Term Frequency) and TF * IDF (Inverted Documents Frequency), which are commonly used to express sentences and documents, are used as very sparsity vectors. There is a problem that expressive and semantic information is not included.

대한민국 등록특허 제10-1814958호Republic of Korea Registered Patent No. 10-1814958 대한민국 등록특허 제10-1741248호Republic of Korea Registered Patent No. 10-1741248

본 발명은 종래 기술의 채팅 시스템(Chatting system)의 문제점을 해결하기 위한 것으로, 단어 단위 임베딩 벡터(Word embedding)와 합성곱 신경망(Convolutional Neural Networks)을 이용하여 길이가 짧은 발화에 대해 효과적으로 발화 단위 표상을 생성하고 발화를 표현할 수 있도록 한 합성곱 신경망에 의한 발화 임베딩을 사용한 유사도 측정 기반의 채팅 말뭉치 확장을 위한 장치 및 방법을 제공하는데 그 목적이 있다.The present invention is to solve the problems of the chat system (Chatting system) of the prior art, by using a word-level embedding vector (Word embedding) and convolutional neural networks (Convolutional Neural Networks) effective speech representation for short speech An object of the present invention is to provide an apparatus and method for expanding a chat corpus based on similarity measurement using speech embedding by a convolutional neural network that enables to generate and express speech.

본 발명은 기계가 이해할 수 있는 벡터로 표현된 발화 쌍을 기계가 올바른 채팅 쌍인지 0과 1로 판단하여 채팅 말뭉치를 확장하는 합성곱 신경망에 의한 발화 임베딩을 사용한 유사도 측정 기반의 채팅 말뭉치 확장을 위한 장치 및 방법을 제공하는데 그 목적이 있다.The present invention is for the expansion of the chat corpus based on similarity measurement using the speech embedding by the convolutional neural network that determines the machine as the correct chat pair and the chat corpus represented by the machine understandable vector as 0 and 1 The purpose is to provide an apparatus and method.

본 발명은 짧은 길이의 채팅성 발화를 효과적으로 표현하기 위해 단어 단위 임베딩 벡터(Word embedding)와 합성곱 신경망(Convolutional Neural Networks) 모델을 이용하여 저차원(Low dimensions), 의미적 정보가 잘 반영된 발화 단위 표상(Utterance Representation)을 생성하고 이를 이용하여 발화 간 유사도를 계산하는 합성곱 신경망에 의한 발화 임베딩을 사용한 유사도 측정 기반의 채팅 말뭉치 확장을 위한 장치 및 방법을 제공하는데 그 목적이 있다.The present invention uses a word-level embedding and a convolutional neural network model to effectively express short-length chatty speech, a speech unit that reflects low dimensions and semantic information well. An object of the present invention is to provide an apparatus and method for expanding chat corpus based on similarity measurement using speech embedding by a convolutional neural network that generates a representation (Utterance Representation) and calculates similarity between speeches using the same.

본 발명은 영화자막, 극대본과 같이 대량의 발화 데이터에서 임의의 쌍을 만들고, 미리 구축된 채팅 말뭉치(Golden standard chatting corpus)와 채팅 유사도를 계산하고, 계산된 채팅 유사도가 실험을 통해 구한 임계값(Threshold)보다 크다면 임의의 쌍은 올바른 채팅 쌍이라고 판단하여 효과적으로 채팅 말뭉치를 확장하는 합성곱 신경망에 의한 발화 임베딩을 사용한 유사도 측정 기반의 채팅 말뭉치 확장을 위한 장치 및 방법을 제공하는데 그 목적이 있다.The present invention creates a random pair from a large amount of speech data such as movie subtitles and manuscripts, calculates the chat similarity with a pre-built chat standard, and calculates the calculated chat similarity through an experiment threshold If it is greater than (Threshold), it is an object of the present invention to provide an apparatus and method for expanding a chat corpus based on similarity measurement using utterance embedding by a convolutional neural network that effectively determines a correct chat pair and expands the chat corpus. .

본 발명은 기계가 구축한 말뭉치(Machine Labeled Chatting corpus)를 사람이 수정할 수 있기 때문에 반자동이며, 기계가 1차적으로 판단하기 때문에 사람이 노력하는 비용이 줄어드는 효과를 갖는 합성곱 신경망에 의한 발화 임베딩을 사용한 유사도 측정 기반의 채팅 말뭉치 확장을 위한 장치 및 방법을 제공하는데 그 목적이 있다.The present invention is semi-automatic because a person can modify a corpus constructed by a machine (Machine Labeled Chatting corpus), and because the machine judges primarily, speech embedding by a convolutional neural network has the effect of reducing the cost of human effort. An object of the present invention is to provide an apparatus and method for expanding a chat corpus based on similarity measurement.

본 발명의 다른 목적들은 이상에서 언급한 목적으로 제한되지 않으며, 언급되지 않은 또 다른 목적들은 아래의 기재로부터 당업자에게 명확하게 이해될 수 있을 것이다.Other objects of the present invention are not limited to those mentioned above, and other objects not mentioned will be clearly understood by those skilled in the art from the following description.

상기와 같은 목적을 달성하기 위한 본 발명에 따른 합성곱 신경망에 의한 발화 임베딩을 사용한 유사도 측정 기반의 채팅 말뭉치 확장을 위한 장치는 윈도우 크기를 이용하여 발화 데이터에서 임의의 채팅쌍을 추출하는 채팅쌍 추출부;발화를 기계가 이해할 수 있도록 발화 단위 표상을 생성하는 발화 단위 표상 생성부;기계에서 임의로 만든 채팅 쌍과 미리 구축되어 있는 채팅 말뭉치의 채팅 유사도(Chatting similarity)를 계산하는 채팅 유사도 계산부;채팅 유사도가 임계값(Threshold)보다 높으면 임의의 채팅 쌍은 응답관계가 맞는 채팅 쌍이라고 판단하여 채팅 말뭉치 확장을 하는 채팅 말뭉치 구축부;를 포함하는 것을 특징으로 한다.The apparatus for expanding the chat corpus based on similarity measurement using utterance embedding by the convolutional neural network according to the present invention for achieving the above object extracts a chat pair extracting a random chat pair from utterance data using a window size Wealth; a utterance unit representation generation unit that generates utterance unit representations so that the machine can understand utterances; a chat similarity calculation unit that calculates chat similarity between randomly created chat pairs and pre-built chat corpus; If the similarity is higher than the threshold (Threshold), the random chat pair is determined to be a chat pair with a correct response relationship, the chat corpus construction unit to expand the chat corpus; characterized in that it comprises a.

여기서, 채팅 유사도 계산부는 i번째 임의의 쌍(pair)이 입력으로 들어 왔을 때 채팅 유사도를,Here, the chat similarity calculator calculates the chat similarity when the i-th random pair comes in as an input,

으로 구하고, i번째 쌍은 길이가 n인 미리 구축된 채팅 말뭉치의 전체 쌍과 각각 유사도를 계산하여, 구해진 유사도 중 가장 큰 값을 i번째 쌍의 채팅 유사도라 하고, 이 채팅 유사도가 미리 정의된 임계값보다 크다면 올바른 쌍이라고 판단하는 것을 특징으로 한다.

The i-th pair is calculated by calculating the similarity with the entire pair of pre-built chat corpus having a length of n, and the largest value among the obtained similarities is called the chat similarity of the i-th pair, and this chat similarity is a predefined threshold If it is greater than the value, it is characterized by determining that it is the correct pair.

그리고 채팅 유사도를 계산하기 위해 코사인 유사도(Cosine similarity)를 이용하고, 임의로 추출된 쌍과 미리 구축된 채팅 말뭉치는 모두 사용자 발화와 시스템 발화의 쌍으로 구성되어 있기 때문에 각각의 유사도를 계산하고, 두 유사도의 반영 비율인 감마(

) 이용으로, 선형 결합(Linear combination)하여 하나의 채팅 유사도로 표현하는 것을 특징으로 한다.And, to calculate chat similarity, cosine similarity is used, and since the randomly extracted pair and the pre-built chat corpus are composed of pairs of user speech and system speech, each similarity is calculated, and the two similarities are similar. Gamma (reflection ratio of

), It is characterized by expressing as a single chat similarity by linear combination.

그리고 발화 단위 표상 생성부는, 저차원(Low dimensions)의 의미 정보가 포함된 벡터로 표현하기 위하여, 형태소의 DF(Document Frequency)를 이용하여 길이가 짧은 발화를 효과적으로 표현할 수 있는 형태소만을 선택하여 평균 임베딩 벡터를 생성하고, 사용한 형태소는 일반명사, 고유명사, 수사, 동사, 형용사, 일반 부사를 선택적으로 포함하는 것을 특징으로 한다.In addition, the utterance unit representation generation unit selects only morphemes that can effectively express short-length utterances by using morphological DF (Document Frequency) in order to express them as vectors containing semantic information of low dimensions, and average embedding. The morpheme used to generate a vector is characterized by selectively including a common noun, a proper noun, a rhetoric, a verb, an adjective, and a general adverb.

그리고 발화 단위 표상 생성부는, 합성곱 신경망(Convolutional Neural Networks) 모델과 단어 단위 임베딩을 이용하여 발화 단위 표상을 생성하는 것을 특징으로 한다.In addition, the speech unit representation generation unit generates a speech unit representation using a convolutional neural network model and word unit embedding.

그리고 발화를 Projection layer를 통해 형태소 단위 임베딩 벡터로 표현하고, Convolution layer와 max polling을 이용하여 심층 자질 표상(Deep feature representation)으로 유도하고, 유도된 심층 자질 표상을 이용하여 최종적인 출력 벡터(Output vector)를 유도하고, 정답 벡터(Answer vector)와 차이를 계산하여 학습하는 것을 특징으로 한다.Then, the utterance is expressed as a morphological unit embedding vector through the projection layer, and is derived as a deep feature representation using a convolution layer and max polling, and a final output vector using the derived deep feature representation. It is characterized by inducing) and learning by calculating the difference with the answer vector.

그리고 학습을 위해 합성곱 신경망 모델의 정답 벡터는 LSA(Latent Semantic Analysis)와 TF*IDF를 이용하여 생성하고, 발화에 대해 TF*IDF를 이용하여 표현하고 차원을 줄이고 잠재적 의미 분석을 수행하는 LSA를 이용하여 매트릭스를 분리, 저차원의 밀집된(Dense) 벡터를 정답 벡터로 사용하고, 합성곱 신경망 모델을 이용하여 출력 벡터(Output vector)를 유도하고, 정답 벡터와 코사인 거리(Cosine distance)가 줄어들도록 학습을 진행하는 것을 특징으로 한다.And for learning, the correct answer vector of the convolutional neural network model is generated using Latent Semantic Analysis (LSA) and TF * IDF, and LSA that expresses using TF * IDF for utterance, reduces dimension, and performs potential semantic analysis. Use to separate the matrix, use a low-dimensional dense vector as the correct answer vector, derive an output vector using the convolutional neural network model, and reduce the correct answer vector and cosine distance. Characterized by the progress of learning.

그리고 출력 벡터는 학습을 위해 사용한 것이고, 실제 발화 단위 표상으로 사용하는 벡터는 학습이 완료된 합성곱 신경망 모델의 심층 자질 표상을 발화 단위 표상으로 사용하는 것을 특징으로 한다.And, the output vector is used for learning, and the vector used as the actual speech unit representation is characterized by using the deep feature representation of the learning-composite neural network model as the speech unit representation.

다른 목적을 달성하기 위한 본 발명에 따른 합성곱 신경망에 의한 발화 임베딩을 사용한 유사도 측정 기반의 채팅 말뭉치 확장을 위한 방법은 윈도우 크기를 이용하여 발화 데이터에서 임의의 채팅쌍을 추출하는 채팅쌍 추출 단계;발화를 기계가 이해할 수 있도록 발화 단위 표상을 생성하는 발화 단위 표상 생성 단계;기계에서 임의로 만든 채팅 쌍과 미리 구축되어 있는 채팅 말뭉치의 채팅 유사도(Chatting similarity)를 계산하는 채팅 유사도 계산 단계;채팅 유사도가 임계값(Threshold)보다 높으면 임의의 채팅 쌍은 응답관계가 맞는 채팅 쌍이라고 판단하여 채팅 말뭉치 확장을 하는 채팅 말뭉치 구축 단계;를 포함하는 것을 특징으로 한다.A method for expanding a chat corpus based on similarity measurement using speech embedding by a convolutional neural network according to the present invention for achieving another object includes a chat pair extraction step of extracting a random chat pair from speech data using a window size; A utterance unit representation generation step of generating a utterance unit representation so that the machine can understand the utterance; a chat similarity calculation step of calculating a chat similarity of a chat pair and a pre-built chat corpus randomly generated by the machine; chat similarity If it is higher than the threshold (Threshold), the random chat pair is determined as a matching chat pair, the chat corpus expansion step to expand the chat corpus; characterized in that it comprises a.

여기서, 채팅 유사도 계산 단계에서, i번째 임의의 쌍(pair)이 입력으로 들어 왔을 때 채팅 유사도를,Here, in the chat similarity calculation step, when the i-th random pair (pair) comes into the input, the chat similarity,

그리고 발화 단위 표상 생성 단계에서, 저차원(Low dimensions)의 의미 정보가 포함된 벡터로 표현하기 위하여, 형태소의 DF(Document Frequency)를 이용하여 길이가 짧은 발화를 효과적으로 표현할 수 있는 형태소만을 선택하여 평균 임베딩 벡터를 생성하고, 사용한 형태소는 일반명사, 고유명사, 수사, 동사, 형용사, 일반 부사를 선택적으로 포함하는 것을 특징으로 한다.Also, in the step of generating the representation of the unit of speech, only a morpheme capable of effectively expressing a short utterance is selected and averaged by using a document frequency (DF) in order to express a vector containing semantic information of low dimensions. The embedding vector is generated and the morpheme used is characterized by selectively including a common noun, a proper noun, a rhetoric, a verb, an adjective, and a general adverb.

그리고 발화 단위 표상 생성 단계에서, 합성곱 신경망(Convolutional Neural Networks) 모델과 단어 단위 임베딩을 이용하여 발화 단위 표상을 생성하는 것을 특징으로 한다.Also, in the step of generating a speech unit representation, a speech unit representation is generated using a convolutional neural network model and word unit embedding.

이상에서 설명한 바와 같은 본 발명에 따른 합성곱 신경망에 의한 발화 임베딩을 사용한 유사도 측정 기반의 채팅 말뭉치 확장을 위한 장치 및 방법은 다음과 같은 효과가 있다.The apparatus and method for expanding the chat corpus based on similarity measurement using speech embedding by the convolutional neural network according to the present invention as described above has the following effects.

첫째, 단어 단위 임베딩 벡터(Word embedding)와 합성곱 신경망(Convolutional Neural Networks)을 이용하여 길이가 짧은 발화에 대해 효과적으로 발화 단위 표상을 생성하고 발화를 표현할 수 있도록 한다.First, using word embedding and convolutional neural networks, it is possible to effectively generate utterance unit representations and express utterances for short utterances.

둘째, 기계가 이해할 수 있는 벡터로 표현된 발화 쌍을 기계가 올바른 채팅 쌍인지 0과 1로 판단하여 채팅 말뭉치를 효과적으로 확장할 수 있다.Second, it is possible to effectively expand the chat corpus by determining whether the machine is the correct chat pair as 0 and 1 as the speech pair represented by the machine understandable vector.

셋째, 단어 단위 임베딩 벡터(Word embedding)와 합성곱 신경망(Convolutional Neural Networks) 모델을 이용하여 저차원(Low dimensions), 의미적 정보가 잘 반영된 발화 단위 표상(Utterance Representation)을 생성하고 이를 이용하여 발화 간 유사도를 계산하는 것에 의해 짧은 길이의 채팅성 발화를 효과적으로 표현할 수 있다.Third, by using word embedding and convolutional neural networks models, a low-dimensional, semantic information well reflected Utterance Representation is generated and spoken using it. By calculating liver similarity, short-length chat utterances can be effectively expressed.

넷째, 영화자막, 극대본과 같이 대량의 발화 데이터에서 임의의 쌍을 만들고, 미리 구축된 채팅 말뭉치(Golden standard chatting corpus)와 채팅 유사도를 계산하고, 계산된 채팅 유사도가 실험을 통해 구한 임계값(Threshold)보다 크다면 임의의 쌍은 올바른 채팅 쌍이라고 판단하여 효과적으로 채팅 말뭉치를 확장할 수 있다.Fourth, a random pair is created from a large amount of speech data, such as movie subtitles and manuscripts, the pre-constructed chat standard (Golden standard chatting corpus) and the chat similarity are calculated, and the calculated chat similarity is the threshold value obtained through the experiment ( If it is greater than Threshold), it is possible to expand the chat corpus effectively by determining that any pair is a correct chat pair.

다섯째, 기계가 구축한 말뭉치(Machine Labeled Chatting corpus)를 사람이 수정할 수 있기 때문에 반자동이며, 기계가 1차적으로 판단하기 때문에 사람이 노력하는 비용이 줄어드는 효과를 갖는다.Fifth, it is semi-automatic because a machine can modify a corpus (Machine Labeled Chatting corpus) built by a machine, and it has an effect of reducing the cost of human effort because the machine judges it primarily.

도 1은 본 발명에 따른 합성곱 신경망에 의한 발화 임베딩을 사용한 유사도 측정 기반의 채팅 말뭉치 확장을 위한 장치의 구성도
도 2는 발화 데이터 중 하나인 영화 자막의 일 예를 나타낸 구성도
도 3은 본 발명에 따른 채팅 말뭉치 반자동 구축 모델 전체 구성도
도 4는 평균 임베딩 벡터의 일 예를 나타낸 구성도
도 5는 합성곱 신경망 모델을 이용한 발화 단위 표상 생성을 나타낸 구성도
도 6은 본 발명에 따른 합성곱 신경망에 의한 발화 임베딩을 사용한 유사도 측정 기반의 채팅 말뭉치 확장을 위한 방법을 나타낸 플로우 차트1 is a block diagram of an apparatus for expanding chat corpus based on similarity measurement using speech embedding by a convolutional neural network according to the present invention;
2 is a block diagram showing an example of a movie subtitle that is one of utterance data
3 is an overall configuration diagram of a chat corpus semi-automatic construction model according to the present invention
4 is a block diagram showing an example of an average embedding vector
5 is a block diagram showing the generation of speech unit representations using a convolutional neural network model.
6 is a flow chart illustrating a method for expanding chat corpus based on similarity measurement using speech embedding by a convolutional neural network according to the present invention.

이하, 본 발명에 따른 합성곱 신경망에 의한 발화 임베딩을 사용한 유사도 측정 기반의 채팅 말뭉치 확장을 위한 장치 및 방법의 바람직한 실시 예에 관하여 상세히 설명하면 다음과 같다.Hereinafter, a preferred embodiment of an apparatus and method for expanding chat corpus based on similarity measurement using speech embedding by a convolutional neural network according to the present invention will be described in detail as follows.

본 발명에 따른 합성곱 신경망에 의한 발화 임베딩을 사용한 유사도 측정 기반의 채팅 말뭉치 확장을 위한 장치 및 방법의 특징 및 이점들은 이하에서의 각 실시 예에 대한 상세한 설명을 통해 명백해질 것이다.The features and advantages of the apparatus and method for expanding the chat corpus based on similarity measurement using speech embedding by the convolutional neural network according to the present invention will become apparent through detailed description of each embodiment below.

도 1은 본 발명에 따른 합성곱 신경망에 의한 발화 임베딩을 사용한 유사도 측정 기반의 채팅 말뭉치 확장을 위한 장치의 구성도이다.1 is a block diagram of a device for expanding chat corpus based on similarity measurement using speech embedding by a convolutional neural network according to the present invention.

본 발명에 따른 합성곱 신경망에 의한 발화 임베딩을 사용한 유사도 측정 기반의 채팅 말뭉치 확장을 위한 장치 및 방법은 채팅 말뭉치 구축의 어려움을 줄이기 위해 대량의 발화 데이터를 이용하여 채팅 쌍을 추출, 채팅 말뭉치를 확장하는 것이다.The apparatus and method for expanding chat corpus based on similarity measurement using speech embedding by the convolutional neural network according to the present invention extracts chat pairs using a large amount of speech data and expands chat corpus to reduce the difficulty of constructing chat corpus Is to do.

이와 같이 대량의 발화 데이터를 이용하여 채팅 말뭉치 구축의 어려움을 줄이기 위하여 본 발명에서는 채팅 말뭉치 확장 시스템을 정의하고, 짧은 길이의 채팅성 발화를 효과적으로 표현하기 위해 형태소 단위 임베딩 벡터와 합성곱 신경망을 이용하여 해당 발화를 잘 표현하는 심층 자질 표상을 생성한다. In order to reduce the difficulty of constructing a chat corpus using such a large amount of speech data, the present invention defines a chat corpus extension system and uses a morpheme unit embedding vector and a convolutional neural network to effectively express short-length chattability speech. Creates a representation of deep qualities that express the utterance well.

이를 이용하여 채팅 말뭉치를 손쉽게 확장하고, 다양한 표현을 가지는 채팅 말뭉치를 구축한다.Using this, chat corpus can be easily expanded, and chat corpus having various expressions can be constructed.

양질, 대량의 채팅 말뭉치 확보가 어려운 이유는 먼저 사용자 발화와 시스템 발화가 쌍으로 이루어져 있어야 하며, 사용자 발화의 응답으로 시스템 발화가 적합하여야 한다는 점이다.The reason why it is difficult to secure a good quality and large amount of chat corpus is that the user talk and the system talk must first be paired, and the system talk must be appropriate in response to the user talk.

이러한 채팅 말뭉치는 공개되어 있는 말뭉치가 희소하기 때문에 일반적으로 실제 사람들 간의 대화 기록(Dialogue log)을 사람이 일일이 판단, 직접 구축하여 사용하였다.Since such a chat corpus is a public corpus, it is generally used by a person who judges and builds a conversation log between real people.

그러나 영화 자막, 극대본과 같은 단순히 시간적 순서로 나열되어 있는 발화 데이터는 많이 존재한다. 본 발명에서는 이러한 대량의 발화 데이터에서 올바른 채팅 쌍을 추출하여 채팅 말뭉치 구축 비용(Cost)을 줄일 수 있도록 한다.However, there are a lot of utterance data that are simply arranged in chronological order, such as movie subtitles and scripts. In the present invention, it is possible to reduce the cost of establishing a chat corpus by extracting the correct chat pair from such a large amount of speech data.

본 발명에 따른 합성곱 신경망에 의한 발화 임베딩을 사용한 유사도 측정 기반의 채팅 말뭉치 확장을 위한 장치는 도 1에서와 같이, 윈도우 크기를 이용하여 발화 데이터에서 임의의 채팅쌍을 추출하는 채팅쌍 추출부(10)와, 발화를 기계가 이해할 수 있도록 발화 단위 표상을 생성하는 발화 단위 표상 생성부(20)와, 기계에서 임의로 만든 채팅 쌍과 미리 구축되어 있는 채팅 말뭉치의 채팅 유사도(Chatting similarity)를 계산하는 채팅 유사도 계산부(30)와, 채팅 유사도가 임계값(Threshold)보다 높으면 임의의 채팅 쌍은 응답관계가 맞는 채팅 쌍이라고 판단하여 채팅 말뭉치 확장을 하는 채팅 말뭉치 구축부(40)를 포함한다.The apparatus for expanding the chat corpus based on similarity measurement using utterance embedding by the convolutional neural network according to the present invention is a chat pair extractor extracting a random chat pair from utterance data using a window size as shown in FIG. 10), a utterance unit representation generation unit 20 for generating utterance unit representations so that the machine can understand utterances, and a chat pairing randomly made in the machine and a chat corpus of a pre-built chat corpus that are calculated The chat similarity calculator 30 includes a chat corpus construction unit 40 that determines if a chat similarity is higher than a threshold and a random chat pair is a matching chat pair and expands the chat corpus.

이와 같은 구성을 갖는 본 발명은 채팅 말뭉치 반자동 확장을 위하여 유사도 기법과 미리 구축된 채팅 말뭉치를 이용하여 대량의 발화 데이터에서 올바른 채팅 쌍을 추출하는 것이다. The present invention having such a configuration is to extract a correct chat pair from a large amount of speech data using a similarity technique and a pre-built chat corpus for semi-automatic expansion of the chat corpus.

발화 데이터란 영화 자막, 극대본과 같이 발화가 단순히 시간적인 순서로 나열되어 있는 데이터를 의미한다. 이러한 발화 데이터는 많은 양이 존재하나, 채팅 말뭉치를 구축하기 위해선 응답관계가 맞는 쌍으로 추출되어야 한다.The utterance data means data in which utterances are simply arranged in chronological order, such as movie subtitles and manuscripts. There is a large amount of such utterance data, but in order to construct a chat corpus, it must be extracted as a pair having a correct response relationship.

도 2는 발화 데이터 중 하나인 영화 자막의 일 예를 나타낸 구성도이다.2 is a block diagram showing an example of a movie subtitle that is one of utterance data.

발화 데이터는 시간적 순서로 구성되어 있기 때문에 본 발명에서는 먼저 임의의 쌍으로 구성한다.Since the utterance data is organized in chronological order, in the present invention, first, it is composed of arbitrary pairs.

쌍으로 구축하기 위해 윈도우 크기를 두고 t번째 발화는 t+1, t+2, ..., t+window size번째 발화와 채팅 쌍이라고 가정한다.To build a pair, we assume that the t-th utterance is a chat pair with the t + 1, t + 2, ..., t + window size utterance with a window size.

표 1은 윈도우 크기를 이용하여 임의로 만든 채팅 쌍 예시이다.Table 1 is an example of a randomly created chat pair using a window size.

번호는 발화 쌍의 인덱스이고, 채팅 유사는 사용자 발화와 시스템 발화의 응답 관계가 맞는지 판별한 것이다.The number is an index of the talk pair, and the chat similarity is to determine whether the response relationship between the user talk and the system talk is correct.

번호 1은 영화 안에서의 올바른 실제 응답 쌍이고, 번호 4는 영화 안에서 '몇 시예요?'의 실제 응답은 아니지만 사람이 판단하기에 자연스럽다고 판단되는 응답 쌍이다.The number 1 is the correct real answer pair in the movie, and the number 4 is the real answer in the movie that is not the real answer of 'What time?'

그리고 번호 2, 3, 5는 올바르지 않는 응답 쌍이다. 이와 같이 단순히 윈도우 크기를 이용하여 쌍을 구축하면, 영화 속의 실제 응답 쌍과 영화 속의 실제 응답은 아니지만 올바르다고 볼 수 있는 쌍과, 올바르지 않은 쌍이 혼재되어 있음을 알 수 있다. And numbers 2, 3, and 5 are incorrect response pairs. By simply constructing the pair using the window size as described above, it can be seen that the pair of the actual response in the movie and the pair that is not the actual response in the movie but which is considered to be correct are mixed.

따라서 본 발명에서는 채팅 말뭉치 반자동 구축을 위한 모델을 다음과 같이 구성한다.Therefore, in the present invention, a model for semi-automatic construction of a chat corpus is constructed as follows.

도 3은 본 발명에 따른 채팅 말뭉치 반자동 구축 모델 전체 구성도이다.3 is an overall configuration diagram of a chat corpus semi-automatic construction model according to the present invention.

Machine이라고 명명되어 있는 기계에서 임의로 만든 채팅 쌍과 미리 구축되어 있는 채팅 말뭉치의 채팅 유사도(Chatting similarity)를 계산하고, 채팅 유사도가 임계값(Threshold)보다 높으면 임의의 채팅 쌍은 응답관계가 맞는 채팅 쌍이라고 판단한다.Calculates the chat similarity between a randomly created chat pair on a machine named Machine and a pre-built chat corpus, and if the chat similarity is higher than the threshold, the random chat pair is a matching chat pair I judge that.

기계가 판단하여 구축된 말뭉치(Machine Labeled Chatting corpus)라 하고, 이 말뭉치를 사람이 수정할 수 있기 때문에 반자동이고, 기계가 1차적으로 판단하기 때문에 사람이 노력하는 비용을 줄일 수 있다.It is called a Machine Labeled Chatting corpus built by a machine, and it is semi-automated because a human can modify it, and because the machine judges it primarily, it can reduce the cost of human effort.

이와 같이 채팅 유사도를 구하기 위해선 먼저 발화를 기계가 이해할 수 있도록 벡터(Vector) 즉 발화 단위 표상을 잘 생성하는 것이 필요하다.In order to obtain the chat similarity, it is necessary to first generate a vector or a unit of speech representation so that the machine can understand the speech.

본 발명의 기본 모델에서는 발화를 벡터로 표현하기 위해 일반적으로 많이 사용하는 TF와 TF*IDF를 이용하여 발화 단위 표상을 생성한다.In the basic model of the present invention, a utterance unit representation is generated using TF and TF * IDF, which are commonly used to express utterances as vectors.

수학식 1은 i번째 임의의 쌍(pair)이 입력으로 들어 왔을 때 채팅 유사도를 구하는 수식이다.Equation 1 is a formula for calculating the chat similarity when the i-th random pair comes in as an input.

i번째 쌍은 길이가 n인 미리 구축된 채팅 말뭉치의 전체 쌍과 각각 유사도를 계산한다.The i-th pair calculates the similarity with the entire pair of pre-built chat corpus of length n.

구해진 유사도 중 가장 큰 값을 i번째 쌍의 채팅 유사도라 하고, 이 채팅 유사도가 미리 정의된 임계값보다 크다면 올바른 쌍이라고 판단한다.The largest value among the obtained similarities is referred to as the i-th pair's chat similarity, and if the chat similarity is greater than a predefined threshold, it is determined that the pair is correct.

본 발명에서는 유사도를 계산하기 위해 코사인 유사도(Cosine similarity)를 이용한다.In the present invention, cosine similarity is used to calculate the similarity.

그리고 임의로 추출된 쌍과 미리 구축된 채팅 말뭉치는 모두 사용자 발화와 시스템 발화의 쌍으로 구성되어 있기 때문에 각각의 유사도를 계산하고, 두 유사도의 반영 비율인 감마(

) 이용으로, 선형 결합(Linear combination)하여 하나의 채팅 유사도로 표현한다.And since both randomly extracted pairs and pre-built chat corpus are composed of pairs of user utterances and system utterances, each similarity is calculated, and gamma (reflection ratio of the two similarities)

), It is expressed as a single chat similarity by linear combination.

본 발명에 따른 기본 모델에서는 발화 단위 표상 생성을 위해 일반적으로 많이 사용하는 TF와 TF*IDF를 사용하였다.In the basic model according to the present invention, TF and TF * IDF, which are commonly used for generating utterance unit representation, were used.

그러나 채팅성 발화는 길이가 굉장히 짧으므로, TF, TF*IDF을 이용하여 발화 단위 표상을 생성하게 되면 굉장히 희소한 벡터로 표현되게 되고, 해당 발화를 잘 표현하지 못하게 된다는 문제점이 발생한다.However, since the chat utterance is very short in length, when a speech unit representation is generated using TF and TF * IDF, it is expressed as a very rare vector, and there is a problem that the utterance cannot be expressed well.

따라서 본 발명에서는 단어 단위 임베딩 벡터를 이용하여 발화 단위 표상을 생성한다.Therefore, in the present invention, a speech unit representation is generated using a word unit embedding vector.

사용하는 단어 단위 임베딩 벡터는 대량의 말뭉치와 word2vec을 이용하여 사전 학습된 형태소 단위 임베딩 벡터를 사용한다.The word unit embedding vector used uses a morpheme unit embedding vector pre-trained using a large amount of corpus and word2vec.

먼저 첫 번째 방법은 발화에서 출현한 형태소들의 평균 임베딩 벡터(Average embedding vector)를 발화 단위 표상으로 사용하는 것이다.First, the first method is to use the average embedding vector of morphemes appearing in the utterance as a representation of the utterance unit.

평균 임베딩 벡터를 만드는 방법은 도 4에서와 같다.The method of making the average embedding vector is as in FIG. 4.

도 4는 평균 임베딩 벡터의 일 예를 나타낸 것으로, '아버지 사랑합니다'라는 발화를 형태소 단위 임베딩 벡터의 평균으로 나타낸 예시이다.FIG. 4 shows an example of an average embedding vector, and is an example in which the utterance 'I love my father' is expressed as an average of morpheme unit embedding vectors.

이와 같은 방법으로 평균 임베딩 벡터를 생성하게 되면 저차원(Low dimensions), 의미 정보가 포함된 벡터로 표현이 가능하다.If the average embedding vector is generated in this way, it can be expressed as a vector containing low dimensions and semantic information.

그러나 이와 같이 전체 형태소를 이용하여 평균 임베딩 벡터를 생성하게 되면 조사와 같이 다른 발화에서도 흔히 출현하는 형태소도 많이 포함하게 된다.However, if an average embedding vector is generated using the entire morpheme as described above, a lot of morphemes commonly found in other utterances, such as irradiation, are included.

특히 채팅성 발화와 같이 굉장히 짧은 발화의 경우는 조사에 대한 값이 많이 반영되어 유사한 발화와 유사하지 않는 발화를 구별하기가 어려워지는 문제가 있다.In particular, in the case of very short utterances such as chat utterances, it is difficult to distinguish similar utterances from similar utterances by reflecting a lot of values for investigation.

본 발명에서는 형태소의 DF(Document Frequency)를 이용하여 길이가 짧은 발화를 효과적으로 표현할 수 있는 형태소만을 선택하여 평균 임베딩 벡터를 생성한다.In the present invention, an average embedding vector is generated by selecting only morphemes capable of effectively expressing short utterances using document morphology (DF).

사용한 형태소는 일반명사, 고유명사, 수사, 동사, 형용사, 일반 부사를 사용한다.The morphemes used are general nouns, proper nouns, rhetoric, verbs, adjectives, and general adverbs.

두 번째 방법은 합성곱 신경망(Convolutional Neural Networks) 모델과 단어 단위 임베딩을 이용하여 발화 단위 표상을 생성하는 방법이다. The second method is to generate a speech unit representation using a convolutional neural network model and word unit embedding.

도 5는 합성곱 신경망 모델을 이용한 발화 단위 표상 생성을 나타낸 구성도이다.5 is a block diagram showing the generation of utterance unit representation using a convolutional neural network model.

도 5는 짧은 발화에 대해 발화 단위 표상을 생성하는 합성곱 신경망 모델의 구조를 나타낸 것으로, 합성곱 신경망 모델의 입력으로는 짧은 길이의 발화를 사용한다.FIG. 5 shows the structure of a convolutional neural network model that generates an utterance unit representation for a short utterance, and short-length utterance is used as an input to the convolutional neural network model.

짧은 길이의 발화를 Projection layer를 통해 형태소 단위 임베딩 벡터로 표현하고, Convolution layer와 max polling을 이용하여 심층 자질 표상(Deep feature representation)으로 유도한다.The short-length utterance is expressed as a morphological unit embedding vector through the projection layer, and is derived as a deep feature representation using the convolution layer and max polling.

유도된 심층 자질 표상을 이용하여 최종적인 출력 벡터(Output vector)를 유도하고, 정답 벡터(Answer vector)와 차이를 계산하여 학습하는 일반적인 합성곱 신경망 모델이다.This is a general convolutional neural network model that derives the final output vector using the derived deep feature representation and learns by calculating the difference with the answer vector.

그러나 도 5는 합성곱 신경망을 이용하여 정답 태그(Label)를 맞추는 것 일반적인 태스크(Task)가 아니라 발화 단위 표상을 만드는 것이 목적이기 때문에 입력에 대한 정답을 잘 만들어야 학습이 원활하게 진행된다.However, in FIG. 5, since the purpose is to create an utterance unit representation rather than a general task of matching a correct answer tag using a convolutional neural network, learning must proceed well by making a correct answer to the input.

본 발명에서는 원활한 학습을 위해 합성곱 신경망 모델의 정답 벡터는 LSA(Latent Semantic Analysis)와 TF*IDF를 이용하여 생성한다.In the present invention, for smooth learning, the correct answer vector of the convolutional neural network model is generated using LSA (Latent Semantic Analysis) and TF * IDF.

우선 짧은 발화에 대해 TF*IDF를 이용하여 표현하고 차원을 줄이는 효과와 잠재적 의미 분석을 수행하는 LSA를 이용하여 매트릭스를 분리, 저차원의 밀집된(Dense) 벡터를 정답 벡터로 사용한다.First, a short utterance is expressed using TF * IDF, the matrix is separated using an LSA that performs a dimension reduction effect and potential semantic analysis, and a low-dimensional dense vector is used as a correct answer vector.

그리고 합성곱 신경망 모델을 이용하여 출력 벡터(Output vector)를 유도하고, 정답 벡터와 코사인 거리(Cosine distance)가 줄어들도록 학습을 진행한다.Then, an output vector is derived using a convolutional neural network model, and learning is performed so that a correct answer vector and a cosine distance are reduced.

출력 벡터는 학습을 위해 사용한 것이고, 실제 발화 단위 표상으로 사용하는 벡터는 학습이 완료된 합성곱 신경망 모델의 심층 자질 표상을 발화 단위 표상으로 사용한다.The output vector is used for learning, and the vector used as the actual speech unit representation uses the deep feature representation of the learning-converged neural network model as the speech unit representation.

이와 같이 학습을 수행하게 되면 형태소 단위 임베딩 벡터에서 중요한 자질을 잘 추출할 수 있고, 평균 임베딩 벡터를 발화 단위 표상으로 사용하는 것보다 효과적으로 짧은 길이의 발화를 표현할 수 있게 된다.When learning is performed in this way, important qualities can be well extracted from the morphological unit embedding vector, and the short-length utterance can be expressed more effectively than using the average embedding vector as the utterance unit representation.

본 발명에 따른 합성곱 신경망에 의한 발화 임베딩을 사용한 유사도 측정 기반의 채팅 말뭉치 확장을 위한 방법을 구체적으로 설명하면 다음과 같다.A method for expanding the chat corpus based on similarity measurement using speech embedding by the convolutional neural network according to the present invention will be described in detail as follows.

도 6은 본 발명에 따른 합성곱 신경망에 의한 발화 임베딩을 사용한 유사도 측정 기반의 채팅 말뭉치 확장을 위한 방법을 나타낸 플로우 차트이다.6 is a flow chart illustrating a method for expanding chat corpus based on similarity measurement using speech embedding by a convolutional neural network according to the present invention.

먼저, 발화 데이터에서 임의의 채팅쌍을 추출하고(S601), 발화를 기계가 이해할 수 있도록 발화 단위 표상을 생성하는 발화 단위 표상 생성 단계를 수행한다.(S602)First, an arbitrary chat pair is extracted from the utterance data (S601), and a utterance unit representation generation step of generating a utterance unit representation so that the machine can understand the utterance is performed (S602).

이어, 기계에서 임의로 만든 채팅 쌍과 미리 구축되어 있는 채팅 말뭉치의 채팅 유사도(Chatting similarity)를 계산한다.(S603)Subsequently, the chat similarity of the chat pair pre-built and the chat pair randomly made by the machine is calculated. (S603)

그리고 계산된 채팅 유사도가 임계값(Threshold)보다 높으면 임의의 채팅 쌍은 응답관계가 맞는 채팅 쌍이라고 판단하여(S604), 채팅 말뭉치 구축을 한다.(S605)Then, if the calculated chat similarity is higher than the threshold, the random chat pair is determined to be a chat pair with a correct response relationship (S604), and a chat corpus is constructed (S605).

이상에서 설명한 본 발명에 따른 합성곱 신경망에 의한 발화 임베딩을 사용한 유사도 측정 기반의 채팅 말뭉치 확장을 위한 장치 및 방법의 채팅 말뭉치 확정 성능을 설명하면 다음과 같다.A description of the chat corpus confirmation performance of the apparatus and method for expanding the chat corpus based on similarity measurement using speech embedding by the convolutional neural network according to the present invention described above is as follows.

대량의 발화 데이터에서 임의로 쌍을 만들고, 사람이 채팅 말뭉치로 이용 가능한지 채팅 유사, 채팅 유사하지 않음을 직접 부착하였다. 그리고 채팅 유사를 예측하는 성능을 높이도록 실험을 구성하였다.Random pairs were made from a large amount of utterance data, and whether a person is available as a chat corpus or a chat similar or a chat not similar was directly attached. And experiments were configured to increase the performance of predicting chat similarity.

또한, 채팅성 발화의 길이가 매우 짧기 때문에 발화를 표현하는 방법에 대해 비교실험을 진행하였다. In addition, since the length of chat utterance is very short, a comparative experiment was conducted on a method of expressing utterance.

먼저 채팅 말뭉치 반자동 구축 실험에서 사용하는 대량의 발화데이터는 Opensubtitle에서 수집한 영화, 외국드라마의 자막을 사용하였다.First, a large amount of utterance data used in the chat corpus semi-automatic construction experiment used subtitles from movies and foreign dramas collected from Opensubtitle.

임계값 및 자질 실험, 평가를 위해 개발 데이터와 평가 데이터는 영화 및 외국 드라마 각각 6편에서 추출하였으며, 3명의 말뭉치 구축 인원이 채팅 유사 유무를 부착하였다.For threshold and quality experiments and evaluation, development data and evaluation data were extracted from 6 films and foreign dramas respectively, and 3 corpus construction personnel attached presence or absence of chat similarity.

표 2는 개발 데이터, 표 3은 평가 데이터의 통계이다.Table 2 shows development data and Table 3 shows statistics of evaluation data.

데이터의 신뢰성을 위해 카파 계수를 측정하였으며, 카파 계수는 0.8114가 측정되었다.The kappa coefficient was measured for data reliability, and the kappa coefficient was 0.8114.

그리고 유사도 계산에서 사용하는 채팅 말뭉치는 약 400,000쌍을 이용하였다. 발화 데이터 및 채팅 말뭉치에 사용하는 형태소 분석기는 [김혜민, 윤정민, 안재현, 배경만, 고영중, "품사 분포와 Bidirectional LSTM-CRFs를 이용한 음절 단위 형태소 분석기", 제28회 한글 및 한국어 정보처리 학술발표 논문집, pp.3-8, 2016.]을 사용하고, 사용한 100차원의 형태소 단위 임베딩 벡터는 대량의 말뭉치와 word2vec을 이용하여 학습하였다.And about 400,000 pairs of chat corpus used in similarity calculation were used. The morpheme analyzer used for speech data and chat corpus [Hyemin Kim, Jungmin Min, Jaehyun Ahn, Baemanman, Koyoungjoong, "Synchronous morpheme analyzer using part-of-speech distribution and bidirectional LSTM-CRFs", 28th Korean and Korean Information Processing Conference , pp.3-8, 2016.], and used 100-dimensional morpheme unit embedding vector was learned using a large amount of corpus and word2vec.

표 4는 일반적인 TF을 사용하여 발화 단위 표상을 생성한 베이스라인 시스템의 성능 비교표이다.Table 4 is a performance comparison table of the baseline system that generated the ignition unit representation using a typical TF.

기본이 되는 베이스라인 시스템에서 사용자 발화와 시스템 발화 각각의 유사도 반영 비율인 감마(

)를 설정하기 위해 진행하였다.Gamma, which is the ratio reflecting the similarity between user speech and system speech in the basic baseline system

) To proceed.

평가 기준은 F1을 이용하였다.F1 was used as the evaluation standard.

표 4의 실험 결과 반영비율인 감마는 0.5일 때 가장 높은 성능을 보여 모든 실험에서 사용하는 감마는 0.5로 설정하여 실험을 진행하였다.As a result of the experiment in Table 4, the gamma, the reflection ratio, showed the highest performance when 0.5, and the gamma used in all experiments was set to 0.5 to conduct the experiment.

그리고 이후 보여주는 모든 실험 성능은 개발 데이터를 통해 결정한 F1이 가장 높을 때의 임계 값을 사용하였다.In addition, the threshold value when the highest F1 determined through the development data was used was used for all experimental performances shown hereinafter.

표 5는 기본 TF, IDF, TF*IDF의 성능 비교표이다. Table 5 is a comparison table of basic TF, IDF, and TF * IDF.

표 5의 결과와 같이 TF, IDF만을 사용했을 때 보다 TF*IDF를 이용하였을 때 성능이 개선됨을 알 수 있다.As shown in Table 5, it can be seen that performance is improved when TF * IDF is used than when only TF and IDF are used.

표 6은 발화 단위 표상을 TF*IDF와 평균 임베딩 벡터로 생성하였을 때의 성능 비교표이다.Table 6 is a performance comparison table when the utterance unit representation is generated with TF * IDF and average embedding vector.

발화 단위 표상을 생성할 때 TF*IDF를 이용하는 것보다 평균 임베딩 벡터를 이용하는 것이 성능 면에서 개선됨을 알 수 있다.It can be seen that the performance of the average embedding vector is improved compared to the use of TF * IDF when generating the utterance unit representation.

표 7은 최종 모델인 합성곱 신경망 모델과 형태소 단위 임베딩 벡터를 이용하여 발화 단위 표상을 생성하였을 때의 성능 비교표이다.Table 7 is a performance comparison table when a speech unit representation is generated using the final model, the convolutional neural network model and the morphological unit embedding vector.

표 7의 결과를 통해 형태소 단위 임베딩 벡터와 합성곱 신경망 모델을 사용함으로서 발화를 더욱 효과적으로 표현할 수 있음을 알 수 있다.Through the results in Table 7, it can be seen that the utterance can be more effectively expressed by using the morphological unit embedding vector and the convolutional neural network model.

이상에서 설명한 본 발명에 따른 합성곱 신경망에 의한 발화 임베딩을 사용한 유사도 측정 기반의 채팅 말뭉치 확장을 위한 장치 및 방법은 채팅 말뭉치 구축의 어려움을 줄이기 위해 대량의 발화 데이터에서 채팅 쌍을 추출하고 올바른 채팅 쌍을 잘 추출하기 위해 발화 단위 표상을 생성하는 것이다.The apparatus and method for expanding chat corpus based on similarity measurement using speech embedding by the convolutional neural network according to the present invention described above extracts chat pairs from a large amount of speech data and reduces correct chat pairs to reduce the difficulty of constructing chat corpus In order to extract well, the ignition unit representation is generated.

그 결과 베이스라인 시스템보다 정확률, 재현율, F1에서 각각 5.16%p, 6.09%p, 5.73%p 증가하여 짧은 길이의 발화를 효과적으로 표현할 수 있는 방법은 형태소 단위 임베딩 벡터와 합성곱 신경망 모델을 이용하는 것이 성능 면에서 개선될 수 있음을 확인할 수 있다.As a result, the method that can effectively express short-length speech by increasing the accuracy rate, reproducibility, and F1 by 5.16% p, 6.09% p, and 5.73% p, respectively, is better than using the morphological unit embedding vector and convolutional neural network model. It can be seen that it can be improved in terms of.

이상에서의 설명에서와 같이 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 변형된 형태로 본 발명이 구현되어 있음을 이해할 수 있을 것이다.It will be understood that the present invention is implemented in a modified form without departing from the essential characteristics of the present invention as described above.

그러므로 명시된 실시 예들은 한정적인 관점이 아니라 설명적인 관점에서 고려되어야 하고, 본 발명의 범위는 전술한 설명이 아니라 특허청구 범위에 나타나 있으며, 그와 동등한 범위 내에 있는 모든 차이점은 본 발명에 포함된 것으로 해석되어야 할 것이다.Therefore, the specified embodiments should be considered in terms of explanation rather than limitation, and the scope of the present invention is shown in the claims rather than the foregoing description, and all differences within the equivalent range are included in the present invention. Should be interpreted.

10. 채팅쌍 추출부
20. 발화 단위 표상 생성부
30. 채팅 유사도 계산부
40. 채팅 말뭉치 구축부10. Chat Pair Extraction Unit
20. Ignition unit representation generation unit
30. Chat similarity calculator
40. Chat Corps Construction Department

Claims

A chat pair extractor that extracts a random chat pair from the utterance data using the window size;
An utterance unit representation generation unit for generating an ignition unit representation so that the machine can understand the ignition;
A chat similarity calculation unit that calculates a chat similarity between a chat pair randomly made in a machine and a pre-built chat corpus;
If the chat similarity is higher than the threshold (Threshold), the random chat pair determines that the response is a matching chat pair, the chat corpus construction unit to expand the chat corpus; includes
The utterance unit representation generation unit selects only morphemes that can effectively express short-length utterances using document morphology DF (Document Frequency) in order to express them as vectors containing semantic information of low dimensions, and averages the embedding vector. The device for generating and using the morpheme based on similarity measurement based on speech similarity using speech embedding by a convolutional neural network, characterized in that it includes a general noun, a proper noun, a rhetoric, a verb, an adjective, and a general adverb.

The method of claim 1, wherein the chat similarity calculator calculates the chat similarity when an i-th random pair (pair) is input.

With
The i-th pair calculates the similarity with each pair of pre-built chat corpus of length n, and the largest value among the obtained similarities is called the chat similarity of the i-th pair, and the chat similarity is greater than a predefined threshold. A device for expanding chat corpus based on similarity measurement using speech embedding by a convolutional neural network, characterized in that it is determined to be the correct pair of faces.

The method of claim 2, wherein a cosine similarity is used to calculate the chat similarity,
Since the randomly extracted pair and the pre-built chat corpus are composed of a pair of user speech and system speech, each similarity is calculated, and gamma (reflection ratio of the two similarities)

), A device for expanding chat corpus based on similarity measurement using speech embedding by a convolutional neural network, characterized in that it is expressed as a single chat similarity by linear combination.

delete

According to claim 1, The utterance unit representation generation unit,
A device for expanding chat corpus based on similarity measurement using speech embedding by a convolutional neural network, characterized in that a speech unit representation is generated using a convolutional neural network model and word unit embedding.

The method of claim 5, wherein the utterance is expressed as a morpheme unit embedding vector through a projection layer,
Derived into a deep feature representation using the convolution layer and max polling, derived final output vector using the derived deep feature representation, and calculated the difference with the answer vector (Answer vector) A device for expanding a chat corpus based on similarity measurement using speech embedding by a convolutional neural network, characterized by learning by using a convolution.

The method of claim 6, for learning, the correct answer vector of the convolutional neural network model is generated using Latent Semantic Analysis (LSA) and TF * IDF.
I use TF * IDF for utterance, separate the matrix using LSA to reduce the dimension and perform potential semantic analysis, and use the low-dimensional dense vector as the correct answer vector,
Based on similarity measurement using speech embedding by a convolutional neural network, characterized in that an output vector is derived using a convolutional neural network model, and learning is performed so that the correct answer vector and the cosine distance are reduced. Device for expanding chat corpus.

The output vector is used for learning, and a vector used as a real speech unit representation is used for a speech representation of a convolutional neural network, characterized by using the deep feature representation of the learning completed convolutional neural network model as the speech unit representation. Device for expanding chat corpus based on similarity measurement using speech embedding.

A chat pair extraction step of extracting a random chat pair from the utterance data by using a window size in the chat pair extraction unit;
An utterance unit representation generation step of generating an utterance unit representation so that the machine can understand the utterance in the utterance unit representation generation unit;
Chat similarity calculation step of calculating the chat similarity (Chatting similarity) of the chat pairs and pre-built chat pairs randomly made by the machine in the chat similarity calculation unit;
If the chat similarity is higher than the threshold (Threshold), the chat corpus construction unit determines that any chat pair is a chat pair having a correct response relationship and constructs a chat corpus to expand the chat corpus;
In order to express as a vector containing semantic information of low dimensions in the utterance unit representation generation step, an average embedding vector is selected by selecting only morphemes that can effectively express short-length utterances using DF (Document Frequency) Method for generating a chat corpus expansion based on similarity measurement using utterance embedding by a convolutional neural network, characterized in that the morpheme used includes a general noun, a proper noun, a rhetoric, a verb, an adjective, and a general adverb.

10. The method of claim 9, In the chat similarity calculation step,
Chat similarity when the i-th random pair comes in as input,

With
The i-th pair calculates the similarity with each pair of pre-built chat corpus of length n, and the largest value among the obtained similarities is called the chat similarity of the i-th pair, and the chat similarity is greater than a predefined threshold. A method for expanding chat corpus based on similarity measurement using speech embedding by a convolutional neural network, characterized by determining that the face is a correct pair.

11. The method of claim 10, Cosine similarity (Cosine similarity) is used to calculate the chat similarity,
Since the randomly extracted pair and the pre-built chat corpus are composed of a pair of user speech and system speech, each similarity is calculated, and gamma (reflection ratio of the two similarities)

), A method for similarity measurement based chat corpus expansion using speech embedding by a convolutional neural network, characterized in that it is expressed as a single chat similarity by linear combination.

delete

10. The method of claim 9, In the ignition unit representation generation step,
A method for expanding chat corpus based on similarity measurement using speech embedding by a convolutional neural network, characterized by generating speech unit representations using a convolutional neural network model and word unit embedding.

The method of claim 13, wherein the utterance is expressed as a morpheme unit embedding vector through a projection layer,
Derived into a deep feature representation using the convolution layer and max polling, derived final output vector using the derived deep feature representation, and calculated the difference with the answer vector (Answer vector) A method for expanding a chat corpus based on similarity measurement using speech embedding by a convolutional neural network, characterized by learning by using the method.

15. The method of claim 14, For learning, the correct answer vector of the convolutional neural network model is generated using Latent Semantic Analysis (LSA) and TF * IDF.
I use TF * IDF for utterance, separate the matrix using LSA to reduce the dimension and perform potential semantic analysis, and use the low-dimensional dense vector as the correct answer vector,
Based on similarity measurement using speech embedding by a convolutional neural network, characterized in that an output vector is derived using a convolutional neural network model, and learning is performed so that the correct answer vector and the cosine distance are reduced. Ways to expand chat corpus.

16. The method according to claim 15, wherein the output vector is used for learning, and the vector used as the actual speech unit representation is used for the speech recognition unit representation using the deep feature representation of the completed learning product neural network model as the speech unit representation. Method for expanding chat corpus based on similarity measurement using speech embedding.