KR20220139626A

KR20220139626A - Apparatus for optimizing open source korean language understanding pipeline based on artificial neural networks

Info

Publication number: KR20220139626A
Application number: KR1020210045806A
Authority: KR
Inventors: 황명하; 신지강; 서호진; 임정선
Original assignee: 한국전력공사
Priority date: 2021-04-08
Filing date: 2021-04-08
Publication date: 2022-10-17

Abstract

The present invention relates to an apparatus for optimizing an open source Korean language understanding pipeline based on artificial neural networks. The apparatus comprises: a pre-processing function execution unit including, for pre-processing of input sentences, a tokenization module for separating the input sentences from a corpus, which is a group of texts, to language elements that cannot be grammatically further divided, a feature extraction module for identifying and extracting which features are useful among token clusters, and learning data for artificial neural network chatbot; and an intent classification and entity extraction unit including an intent classification and entity extraction module for classifying the intents of the sentences queried by a speaker and extracting entities in the sentences after classifying the intents, an entity mapping module for appropriately mapping the extracted entities to slots, and a dialog policy module for setting policies for question and answer dialog scenarios. Therefore, when a commercial chatbot framework is developed, the apparatus can improve an open source chatbot framework for Korean language, not English, and the performance of the pre-processing function execution unit, and the intent classification and entity extraction unit.

Description

Apparatus and method for optimizing an artificial neural network-based open source Korean understanding pipeline

본 발명은 인공신경망 기반 오픈소스 한국어 이해 파이프라인 최적화 장치 및 방법에 관한 것으로, 보다 상세하게는 상용 챗봇 프레임워크 개발 시, 영어가 아닌 한국어를 위한 오픈소스 챗봇 프레임워크 및 전처리 기능 수행부와 의도 분류 및 개체 추출 수행부에 대한 성능을 개선할 수 있도록 하는, 인공신경망 기반 오픈소스 한국어 이해 파이프라인 최적화 장치 및 방법에 관한 것이다.The present invention relates to an apparatus and method for optimizing an artificial neural network-based open source Korean understanding pipeline, and more particularly, when developing a commercial chatbot framework, an open source chatbot framework for non-English Korean and a preprocessing function performing unit and intention classification and to an apparatus and method for optimizing an artificial neural network-based open source Korean understanding pipeline, which can improve the performance of an object extraction execution unit.

일반적으로 챗봇은 인간의 손을 거치지 않고 자동화된 대화를 통해 목적을 이루도록 돕는 기술을 말한다. 상기 챗봇은 개인 비서에게 요청하는 것처럼 식당 예약, 쇼핑, 검색 등 다양한 분야에서 활용할 수 있다. 그러므로 최근 수많은 기업들이 챗봇을 활용하여 경쟁력을 강화하고 있다.In general, chatbots refer to technologies that help achieve goals through automated conversations without human intervention. The chatbot can be used in various fields such as restaurant reservation, shopping, and search, just like asking a personal assistant. Therefore, many companies are using chatbots to strengthen their competitiveness.

이러한 챗봇의 높은 수요로 인해 챗봇 관련 글로벌 시장 규모는 폭발적으로 증가하고 있는 추세이며, 챗봇을 전문적으로 서비스하는 대기업들(예 : 페이스북, 구글, 네이버 등)이 등장하였다. Due to such high demand for chatbots, the size of the global chatbot-related market is increasing explosively, and large companies (eg, Facebook, Google, Naver, etc.) that specialize in chatbot services have appeared.

예컨대 스티브잡스가 앱 스토어를 소개한지 8년 후인 2016년 4월 페이스북의 마크 저커버그는 페이스북 메신저용 플랫폼을 발표했으며, 이는 앱스토어가 시작할 당시 아이폰 사용자보다 많은 10억명을 기록하였다.For example, in April 2016, eight years after Steve Jobs introduced the App Store, Facebook's Mark Zuckerberg announced a platform for Facebook Messenger, which had more than 1 billion iPhone users when the App Store launched.

또한 국내에도 여러 IT기업(예 : 네이버, 카카오 등)에서 상용 챗봇 프레임워크를 개발하였지만, 활용 시 요금 부과로 인한 유지보수비 증가 문제가 발생하게 된다. 예컨대 카카오의 경우, 30,000건 이상 대화 시 건당 30원의 요금을 부과하고 있다. In addition, although commercial chatbot frameworks have been developed by several IT companies in Korea (eg, Naver, Kakao, etc.), there is a problem of increased maintenance costs due to charging fees when using them. For example, Kakao charges 30 won per conversation for more than 30,000 conversations.

이에 따라 개발 및 유지보수비를 절감하기 위해 오픈소스 챗봇 프레임워크를 활용할 수 있지만, 대부분 영어를 활용하기 때문에 한국어를 위한 오픈소스 챗봇 프레임워크 및 성능 개선이 필요한 상황이다.Accordingly, an open source chatbot framework can be used to reduce development and maintenance costs, but since most use English, an open source chatbot framework and performance improvement for Korean are needed.

본 발명의 배경기술은 대한민국 등록특허 10-2217457호(2021.02.15. 등록, 채팅로봇이 탑재되어 고객의 니즈에 따라 의료상담이 가능한 채팅 서비스 제공 시스템)에 개시되어 있다. The background technology of the present invention is disclosed in Republic of Korea Patent No. 10-2217457 (registered on February 15, 2021, a chatting robot equipped with a chatting service providing system capable of providing medical consultation according to customer needs).

본 발명의 일 측면에 따르면, 본 발명은 상기와 같은 문제점을 해결하기 위해 창작된 것으로서, 상용 챗봇 프레임워크 개발 시, 영어가 아닌 한국어를 위한 오픈소스 챗봇 프레임워크 및 전처리 기능 수행부와 의도 분류 및 개체 추출 수행부에 대한 성능을 개선할 수 있도록 하는, 인공신경망 기반 오픈소스 한국어 이해 파이프라인 최적화 장치 및 방법을 제공하는 데 그 목적이 있다. According to one aspect of the present invention, the present invention was created to solve the above problems, and when developing a commercial chatbot framework, an open source chatbot framework for Korean, not English, and a preprocessing function performing unit and intention classification and An object of the present invention is to provide an apparatus and method for optimizing an artificial neural network-based open source Korean understanding pipeline that can improve the performance of the object extraction execution unit.

본 발명의 일 측면에 따른 인공신경망 기반 오픈소스 한국어 이해 파이프라인 최적화 장치는, 입력 문장의 전처리를 위하여, 텍스트 군집인 말뭉치로부터 문법적으로 더 이상 나눌 수 없는 언어요소로 분리하기 위한 토큰화 모듈, 토큰 군집 중 어떤 특징이 유용한지 확인하여 추출하는 특징 추출 모듈, 및 인공신경망 챗봇이 학습하기 위한 학습 데이터를 포함하는 전처리 기능 수행부; 및 화자의 질의 문장에 대한 의도 분류와 의도 분류 후 문장 내 개체를 추출하기 위한 의도 분류 및 개체 추출 모듈, 추출된 개체를 슬롯(Slot)에 알맞게 매핑하기 위한 개체 매핑 모듈, 및 질의응답 대화 시나리오의 정책을 설정하기 위한 대화 정책 모듈을 포함하는 의도 분류 및 개체 추출 수행부;를 포함하는 것을 특징으로 한다.An artificial neural network-based open source Korean understanding pipeline optimization apparatus according to an aspect of the present invention is a tokenization module for separating an input sentence from a corpus, which is a text cluster, into grammatically indivisible language elements for pre-processing. A pre-processing function performing unit including a feature extraction module for extracting by checking which features of the cluster are useful, and learning data for the artificial neural network chatbot to learn; and an intention classification and object extraction module for extracting an object in a sentence after intention classification and intention classification for the speaker's query sentence, an object mapping module for properly mapping the extracted object to a slot, and a question-and-answer dialog scenario. and an intention classification and entity extraction performing unit including a dialogue policy module for setting a policy.

본 발명에 있어서, 상기 토큰화 모듈은, 상기 학습 데이터의 텍스트 군집인 말뭉치로부터 한국어의 어간과 어미 분석, 및 명사 추출에 최적화된 토큰화를 수행하며, 상기 특징 추출 모듈은, 상기 토큰화 모듈을 통해 추출된 토큰 중 유용한 특징으로 여겨지는 토큰을 선정하며, 상기 학습 데이터는, 산업분야에서 활용하는 공통 업무대상 데이터 셋을 포함하는 것을 특징으로 한다.In the present invention, the tokenization module performs tokenization optimized for stem and ending analysis of Korean and noun extraction from the corpus, which is a text cluster of the learning data, and the feature extraction module uses the tokenization module. A token considered to be a useful feature among the extracted tokens is selected, and the learning data is characterized in that it includes a common work target data set used in the industrial field.

본 발명에 있어서, 상기 의도 분류 및 개체 추출 모듈은, 질의 문장의 의도가 지정된 복수의 슬롯 중 어느 슬롯에 해당하는지 분류하고, 상기 개체 매핑 모듈은, 상기 의도 분류 및 개체 추출 모듈을 통해 추출된 각 개체를 지정된 복수의 슬롯 중 해당하는 슬롯에 매핑하며, 상기 대화 정책 모듈은, 지정된 알고리즘에 따라 질의응답 대화 시나리오의 정책을 설정하는 것을 특징으로 한다.In the present invention, the intent classification and entity extraction module classifies which slot the intent of the query sentence corresponds to among a plurality of designated slots, and the entity mapping module includes each An entity is mapped to a corresponding slot among a plurality of designated slots, and the dialog policy module sets a policy of a question-and-answer dialog scenario according to a designated algorithm.

본 발명에 있어서, 상기 의도 분류 및 개체 추출 모듈은, 개체 손실과 의도 손실, 및 마스크 손실 값을 더하면서 손실의 총 합을 계산하며, 상기 총 합을 계산하는 과정을 반복 수행하여, 이 손실 합계를 최소화하기 위해 학습을 수행하는 것을 특징으로 한다.In the present invention, the intention classification and entity extraction module calculates the total sum of the loss while adding the entity loss, the intention loss, and the mask loss value, and repeats the process of calculating the total sum, thereby summing the loss. It is characterized in that learning is performed to minimize .

본 발명에 있어서, 상기 의도 분류 및 개체 추출 모듈은, 토큰화 모듈을 통해 추출된 토큰 집합을 복수의 순방향 계층을 통해 트랜스포머 계층으로 입력하고, 상기 트랜스포머 계층으로부터 출력된 토큰 집합은 개체 집합과 함께 조건부 랜덤 필드 알고리즘에 입력하고, 상기 조건부 랜덤 필드 알고리즘을 통해 미리 지정된 최적값을 찾는 파라미터를 활용하여 개체 손실을 계산하는 것을 특징으로 한다.In the present invention, the intention classification and entity extraction module inputs the token set extracted through the tokenization module to the transformer layer through a plurality of forward layers, and the token set output from the transformer layer is conditionally random together with the entity set It is characterized in that the individual loss is calculated by inputting to a field algorithm and using a parameter to find a pre-specified optimal value through the conditional random field algorithm.

본 발명에 있어서, 상기 의도 분류 및 개체 추출 모듈은, 상기 트랜스포머 계층에서 상기 토큰화 모듈을 통해 추출된 토큰 집합의 문장 인코딩을 지칭하는 클래스 토큰을 출력하여 임베딩 계층으로 입력하고, 그리고 상기 클래스 토큰과 의도 집합으로부터 다른 임베딩 계층을 거쳐 각기 출력된 값을 이용해서 지정된 유사도 계산식을 통해 유사도를 최대화하는 내적 손실 및 유사도를 최소화하는 내적 손실을 산출하고, 상기 산출된 값을 이용하여 의도 손실을 계산하는 것을 특징으로 한다.In the present invention, the intention classification and entity extraction module outputs a class token indicating the sentence encoding of a token set extracted through the tokenization module in the transformer layer and inputs it to the embedding layer, and the class token and Using the values output from the intention set through different embedding layers, calculate the inner product loss that maximizes the similarity and the inner product loss that minimizes the similarity through a specified similarity calculation formula, and calculate the intention loss using the calculated values. characterized.

본 발명에 있어서, 상기 의도 분류 및 개체 추출 모듈은, 상기 토큰 집합에 포함된 마스크(Mask) 토큰을 순방향 계층을 거쳐 트랜스포머 계층을 통해 출력하여 임베딩 계층으로 입력하고, 상기 토큰 집합에서 랜덤으로 선택된 토큰에 대한 트랜스포머 계층의 출력을 지정된 유사도 계산식을 통해 유사도를 최대화하는 내적 손실 및 유사도를 최소화하는 내적 손실을 산출하고, 상기 산출된 값을 이용하여 마스크 손실을 계산하는 것을 특징으로 한다.In the present invention, the intention classification and entity extraction module outputs a mask token included in the token set through a forward layer through a transformer layer, inputs it to an embedding layer, and a token randomly selected from the token set An inner product loss that maximizes the similarity and an inner product loss that minimizes the similarity are calculated from the output of the transformer layer for .

본 발명의 다른 측면에 따른 인공신경망 기반 오픈소스 한국어 이해 파이프라인 최적화 방법은, 인공신경망 기반 오픈소스 한국어 이해 파이프라인 최적화를 위하여, 문장이 입력되면 전처리 기능 수행부가, 입력된 문장의 전처리를 위하여, 입력됨 문장을 토큰화 모듈을 통해 텍스트 군집인 말뭉치로부터 문법적으로 더 이상 나눌 수 없는 언어요소로 분리하고, 특징 추출 모듈을 통해 토큰 군집 중 어떤 특징이 유용한지 확인하여 추출하는 단계; 및 의도 분류 및 개체 추출 수행부가, 의도 분류 및 개체 추출 모듈을 통해 화자의 질의 문장에 대한 의도 분류 및 의도 분류 후 문장 내 개체를 추출을 수행하며, 개체 매핑 모듈을 통해 상기 추출된 개체를 해당 개체에 지정된 슬롯(Slot)에 맞춰 매핑하는 단계;를 포함하는 것을 특징으로 한다.In an artificial neural network-based open source Korean understanding pipeline optimization method according to another aspect of the present invention, when a sentence is input, the pre-processing function performing unit performs pre-processing of the input sentence for optimizing the artificial neural network-based open source Korean understanding pipeline; Separating the input sentence into grammatically indivisible language elements from the corpus, which is a text cluster, through a tokenization module, and extracting by checking which features of the token cluster are useful through a feature extraction module; and the intention classification and object extraction performing unit extracts the object in the sentence after intention classification and intention classification for the speaker's query sentence through the intention classification and object extraction module, and converts the extracted object to the corresponding object through the object mapping module It characterized in that it comprises; mapping according to the slot (Slot) specified in the.

본 발명에 있어서, 상기 화자의 질의 문장에 대한 의도 분류와 의도 분류 후 문장 내 개체를 추출하기 위하여, 상기 의도 분류 및 개체 추출 모듈은, 개체 손실과 의도 손실, 및 마스크 손실 값을 더하면서 손실의 총 합을 계산하며, 상기 총 합을 계산하는 과정을 반복 수행하여, 이 손실 합계를 최소화하기 위해 학습을 수행하는 것을 특징으로 한다.In the present invention, in order to extract an entity within a sentence after intention classification and intention classification for the speaker's query sentence, the intention classification and entity extraction module adds the entity loss, intention loss, and mask loss values to the loss The total sum is calculated, and the process of calculating the total sum is repeatedly performed to perform learning to minimize the total loss.

본 발명의 일 측면에 따르면, 본 발명은 상용 챗봇 프레임워크 개발 시, 영어가 아닌 한국어를 위한 오픈소스 챗봇 프레임워크 및 전처리 기능 수행부와 의도 분류 및 개체 추출 수행부에 대한 성능을 개선할 수 있도록 한다.According to one aspect of the present invention, when developing a commercial chatbot framework, the present invention provides an open source chatbot framework for Korean, not English, and improves the performance of the preprocessing function execution unit and the intention classification and object extraction execution unit. do.

도 1은 본 발명의 일 실시예에 따른 인공신경망 기반 오픈소스 한국어 이해 파이프라인 최적화 장치의 개략적인 구성을 보인 예시도.
도 2와 도 3은 상기 도 1에 있어서, 의도 분류 실험 및 개체 추출 실험을 위한 학습 데이터를 보인 예시도.
도 4는 상기 도 1에 있어서, 의도 분류 및 개체 추출 수행부의 세부 동작 및 기능을 설명하기 위하여 보인 예시도.
도 5는 상기 도 1에 있어서, 의도 분류 및 개체 추출 모듈의 인공신경망 최적화 구조를 보인 예시도.
도 6은 상기 도 1에 있어서, 의도 분류 및 개체 추출 모듈의 인공신경망 최적화 구조를 학습하기 위한 파라미터를 보인 예시도.
도 7은 상기 도 1에 있어서, 의도 분류 및 개체 추출 수행부 내의 의도 분류 및 개체 추출 모듈의 인공신경망 학습 최적 수 결정을 위한 비교 실시 결과를 보인 예시도.
도 8은 상기 도 1에 있어서, 본 실시예에 따른 발명과 종래 기술에 따른 모듈의 성능 비교를 실시한 결과를 보인 예시도.1 is an exemplary diagram showing a schematic configuration of an artificial neural network-based open source Korean understanding pipeline optimization apparatus according to an embodiment of the present invention.
2 and 3 are exemplary views showing learning data for an intention classification experiment and an entity extraction experiment in FIG. 1 .
FIG. 4 is an exemplary diagram illustrating detailed operations and functions of the intention classification and entity extraction performing unit in FIG. 1 .
5 is an exemplary diagram illustrating an artificial neural network optimization structure of an intention classification and entity extraction module in FIG. 1 .
6 is an exemplary diagram showing parameters for learning the artificial neural network optimization structure of the intention classification and entity extraction module in FIG. 1 .
FIG. 7 is an exemplary view showing a comparison execution result for determining the optimal number of artificial neural network learning of the intention classification and object extraction module in the intention classification and object extraction performing unit in FIG. 1 .
8 is an exemplary view showing the results of comparing the performance of the module according to the invention and the prior art according to the present embodiment in FIG. 1 .

이하, 첨부된 도면을 참조하여 본 발명에 따른 인공신경망 기반 오픈소스 한국어 이해 파이프라인 최적화 장치 및 방법의 일 실시예를 설명한다. Hereinafter, an embodiment of an apparatus and method for optimizing an artificial neural network-based open source Korean understanding pipeline according to the present invention will be described with reference to the accompanying drawings.

이 과정에서 도면에 도시된 선들의 두께나 구성요소의 크기 등은 설명의 명료성과 편의상 과장되게 도시되어 있을 수 있다. 또한, 후술되는 용어들은 본 발명에서의 기능을 고려하여 정의된 용어들로서 이는 사용자, 운용자의 의도 또는 관례에 따라 달라질 수 있다. 그러므로 이러한 용어들에 대한 정의는 본 명세서 전반에 걸친 내용을 토대로 내려져야 할 것이다.In this process, the thickness of the lines or the size of the components shown in the drawings may be exaggerated for clarity and convenience of explanation. In addition, the terms to be described later are terms defined in consideration of functions in the present invention, which may vary according to intentions or customs of users and operators. Therefore, definitions of these terms should be made based on the content throughout this specification.

도 1은 본 발명의 일 실시예에 따른 인공신경망 기반 오픈소스 한국어 이해 파이프라인 최적화 장치의 개략적인 구성을 보인 예시도이다.1 is an exemplary diagram showing a schematic configuration of an artificial neural network-based open source Korean understanding pipeline optimization apparatus according to an embodiment of the present invention.

도 1에 도시된 바와 같이, 본 실시예에 따른 인공신경망 기반 오픈소스 한국어 이해 파이프라인 최적화 장치는, 전처리 기능 수행부(100)와 의도 분류 및 개체 추출 수행부(200)를 포함한다.As shown in FIG. 1 , the apparatus for optimizing an artificial neural network-based open source Korean understanding pipeline according to the present embodiment includes a preprocessing function performing unit 100 and an intention classification and object extraction performing unit 200 .

상기 전처리 기능 수행부(100)는 토큰화 모듈(110), 특징 추출 모듈(120), 및 학습 데이터(130)를 포함하고, 상기 의도 분류 및 개체 추출 수행부(200)는 의도 분류 및 개체 추출 모듈(210), 개체 매핑 모듈(220), 및 대화 정책 모듈(230)을 포함한다.The pre-processing function performing unit 100 includes a tokenization module 110 , a feature extraction module 120 , and learning data 130 , and the intention classification and entity extraction performing unit 200 is an intention classification and entity extraction a module 210 , an object mapping module 220 , and a dialog policy module 230 .

상기 토큰화 모듈(110)은 한국어 전용 토큰화기(Korean Tokenizer, Korean version of Mecab)를 활용할 수 있으며, 상기 특징 추출 모듈(120)은 각 토큰의 개수를 대상으로 특징을 추출하는 특징화기(Count Vectors Featurizer)를 적용할 수 있다.The tokenization module 110 may utilize a Korean tokenizer (Korean version of Mecab), and the feature extraction module 120 extracts features from the number of tokens (Count Vectors). featurizer) can be applied.

상기 의도 분류 및 개체 추출 모듈(210)은 듀얼 의도 개체 트랜스포머(Dual Intent Entity Transformer)를 최적화한 구조를 적용할 수 있고. 상기 개체 매핑 모듈(220)은 개체 동의어 매퍼(Entity Synonym Mapper) 알고리즘을 적용할 수 있으며, 상기 대화 정책 모듈(230)은 트랜스포머 임베딩 대화 정책(Transformer Embedding Dialogue Policy)을 적용할 수 있다.The intent classification and entity extraction module 210 may apply a structure optimized for a dual intent entity transformer. The entity mapping module 220 may apply an Entity Synonym Mapper algorithm, and the dialog policy module 230 may apply a Transformer Embedding Dialogue Policy.

좀 더 구체적으로, 상기 전처리 기능 수행부(100)는 텍스트 군집인 말뭉치로부터 문법적으로 더 이상 나눌 수 없는 언어요소로 분리하기 위한 토큰화 모듈(110)과 토큰 군집 중 어떤 특징이 유용한지 확인하여 추출하는 특징 추출 모듈(120), 및 인공신경망 챗봇이 학습하기 위한 학습 데이터(130)를 포함한다.More specifically, the pre-processing function performing unit 100 checks which features of the tokenization module 110 and the token cluster for separating grammatically indivisible language elements from the corpus, which is a text cluster, are useful and extracts them. and a feature extraction module 120 , and learning data 130 for learning the artificial neural network chatbot.

여기서 상기 토큰화 모듈(110)은 상기 학습 데이터(130)의 텍스트 군집인 말뭉치로부터 한국어의 어간과 어미 분석, 명사 추출 등 최적화된 토큰화를 수행한다. 그리고 상기 특징 추출 모듈(120)은 상기 토큰화 모듈(110)을 통해 추출된 토큰 중 유용한 특징으로 여겨지는 토큰을 선정하여 의도 분류 및 개체 추출 수행부(200)로 전달한다. Here, the tokenization module 110 performs optimized tokenization, such as analyzing the stem and ending of Korean, and extracting nouns from the corpus, which is the text cluster of the learning data 130 . In addition, the feature extraction module 120 selects a token considered to be a useful feature among the tokens extracted through the tokenization module 110 and delivers it to the intention classification and entity extraction performing unit 200 .

그리고 상기 학습 데이터(130)는 대부분의 산업분야에서 활용하는 공통 업무대상 데이터 셋으로 구축한다. 다만 이를 한정하고자 하는 것이 아님에 유의한다.And the learning data 130 is constructed as a common work target data set used in most industrial fields. However, it should be noted that this is not intended to be limiting.

또한 상기 의도 분류 및 개체 추출 수행부(200)는, 입력된 문장이 질의 문장인 경우, 화자의 질의 문장에 대한 의도 분류와 의도 분류 후 문장 내 개체를 추출하기 위한 의도 분류 및 개체 추출 모듈(210)과 추출된 개체를 슬롯(Slot)에 알맞게 매핑하기 위한 개체 매핑 모듈(220), 및 질의응답 대화 시나리오의 정책을 설정하기 위한 대화 정책 모듈(Dialogue Policy)(230)을 포함한다.In addition, the intention classification and object extraction performing unit 200 includes an intention classification and object extraction module 210 for extracting an object within a sentence after intention classification and intention classification for the speaker's query sentence when the input sentence is a query sentence. ) and an object mapping module 220 for appropriately mapping the extracted object to a slot, and a dialog policy module 230 for setting a policy of a question-and-answer dialog scenario.

도 2와 도 3은 상기 도 1에 있어서, 의도 분류 실험 및 개체 추출 실험을 위한 학습 데이터를 보인 예시도로서, 의도 분류 실험을 위한 학습 데이터로써 식단, 부서 연락처, 개인 업무분장, 계산기 등 7개 종류의 487건 학습 데이터를 구축하며, 개체 추출 실험을 위한 학습 데이터로써 날짜, 부서, 업무, 이름, 시간 등 7개 종류의 3,354건 학습 데이터를 구축하였다. 다만 이는 예시적인 것이며, 이를 한정하고자 하는 것이 아님에 유의한다.2 and 3 are exemplary views showing the learning data for the intention classification experiment and the entity extraction experiment in FIG. 1, and 7 pieces such as a diet, department contact information, personal task division, calculator, etc. as learning data for intention classification experiment 487 kinds of learning data were constructed, and 3,354 learning data of 7 types including date, department, job, name, and time were constructed as learning data for the object extraction experiment. However, it should be noted that this is merely exemplary and is not intended to limit the present invention.

도 4는 상기 도 1에 있어서, 의도 분류 및 개체 추출 수행부(200)의 세부 동작 및 기능을 설명하기 위하여 보인 예시도이다.FIG. 4 is an exemplary diagram illustrating detailed operations and functions of the intention classification and entity extraction performing unit 200 in FIG. 1 .

상기 의도 분류 및 개체 추출 모듈(210)은 먼저 질의 문장의 의도가 어느 슬롯에 해당하는지 분류한다. The intent classification and entity extraction module 210 first classifies which slot the intent of the query sentence corresponds to.

예컨대 "일요일 대전 날씨는 어때요?"라는 문장이 입력되었다고 가정할 때, 의도 분류를 통해 "날씨 API 슬롯"에 해당하는 것으로 분류하고, 상기 문장에서 개체(또는 토큰)로서 "일요일","대전"등을 추출하고, 상기 추출된 개체(또는 토큰)를 매핑시킨다. 이 때 상기 의도 분류는 인공신경망 알고리즘을 통해 질의 문장이 어떤 의도에 어떤 확률값으로 분류하는지 알 수 있다.For example, assuming that the sentence "How's the weather in Daejeon on Sunday?" is input, it is classified as corresponding to "weather API slot" through intention classification, and "Sunday" and "Daejeon" as entities (or tokens) in the sentence. etc. are extracted, and the extracted entity (or token) is mapped. In this case, the intention classification may be performed through an artificial neural network algorithm to determine which intention the query sentence is classified with which probability value.

상기 개체 매핑 모듈(220)은 상기 의도 분류 및 개체 추출 모듈(210)을 통해 추출된 각 개체를 해당하는 슬롯에 매핑한다. The entity mapping module 220 maps each entity extracted through the intent classification and entity extraction module 210 to a corresponding slot.

상기 대화 정책 모듈(230)은 지정된 알고리즘에 따라 질의응답 대화 시나리오의 정책을 설정한다.The dialogue policy module 230 sets a policy of a question-and-answer dialogue scenario according to a specified algorithm.

도 5는 상기 도 1에 있어서, 의도 분류 및 개체 추출 모듈(210)의 인공신경망 최적화 구조를 보인 예시도이다.5 is an exemplary diagram illustrating an artificial neural network optimization structure of the intention classification and entity extraction module 210 in FIG. 1 .

도 5를 참조하면, 상기 토큰화 모듈(110)을 통해 추출된 토큰 집합(S101)은 복수의 순방향 계층(S102)을 통해 트랜스포머 계층(S103)으로 입력된다. Referring to FIG. 5 , the token set S101 extracted through the tokenization module 110 is input to the transformer layer S103 through a plurality of forward layers S102.

상기 트랜스포머 계층(S103)으로부터 출력된 토큰 집합은 개체 집합(S105)(도 3 참조)과 함께 조건부 랜덤 필드 알고리즘(S104)의 입력값이 되며, 상기 조건부 랜덤 필드 알고리즘은 미리 지정된 최적값을 찾는 파라미터(예 : Negative log-likelihood)를 활용하여 개체 손실(S115)을 계산한다.The token set output from the transformer layer S103 becomes an input value of the conditional random field algorithm S104 together with the entity set S105 (refer to FIG. 3), and the conditional random field algorithm is a parameter that finds a predetermined optimal value. (Ex: Negative log-likelihood) is used to calculate the individual loss (S115).

상기 트랜스포머 계층(S103)은 상기 토큰화 모듈(110)을 통해 추출된 토큰 집합(S101)의 문장 인코딩을 지칭하는 토큰(클래스 토큰)을 출력하여 각각 임베딩 계층(S106)으로 입력된다. The transformer layer S103 outputs a token (class token) indicating the sentence encoding of the token set S101 extracted through the tokenization module 110 and is input to the embedding layer S106, respectively.

그리고 상기 클래스 토큰과 의도 집합(S107)(도 4 참조)으로부터 임베딩 계층(S108)을 거쳐 각기 출력된 값을 이용해서 지정된 유사도 계산식을 통해 유사도(S109)를 최대화하는 내적 손실(Dot-product Loss) 및 유사도를 최소화하는 내적 손실을 산출하고, 상기 산출된 값을 이용하여 의도 손실(Intent Loss)(S110)을 계산한다.And using the values respectively output from the class token and the intent set S107 (see FIG. 4) through the embedding layer S108, the similarity calculation formula is specified to maximize the similarity S109 (Dot-product Loss) and an inner product loss that minimizes the similarity, and calculates an intent loss (S110) using the calculated value.

그리고 상기 토큰 집합(S101)에는 마스크(Mask) 토큰도 포함되는데, 순방향 계층(S102)을 거치고 트랜스포머 계층(S103)을 통해 출력된 값이 임베딩 계층(S111)으로 입력된다. The token set S101 also includes a mask token, and the value output through the forward layer S102 and the transformer layer S103 is input to the embedding layer S111.

이때 상기 임베딩 계층(S111)에는 마스크 토큰을 위한 임베딩 계층과 마스크 되지 않은 실제 토큰을 위한 임베딩 계층으로 구성된다. 실제로는 시퀀스(즉, 토큰의 집합)에서 입력 토큰의 15%가 랜덤으로 선택되며, 상기 랜덤으로 선택된 토큰 중 70%가 특정 마스크 토큰(즉, 마스킹을 수행하는 토큰)으로 대체 입력되며, 10%는 랜덤 토큰으로 대체 입력되고(즉, 본래 토큰이 아닌 임의의 랜덤 토큰으로 대체된 토큰), 나머지 20%는 본래 토큰으로 입력된다. In this case, the embedding layer S111 includes an embedding layer for a mask token and an embedding layer for an unmasked real token. In practice, 15% of the input tokens in a sequence (i.e., a set of tokens) are randomly selected, and 70% of the randomly selected tokens are replaced with a specific mask token (i.e., a token that performs masking), and 10% is replaced with a random token (that is, a token replaced with a random token other than the original token), and the remaining 20% is inputted with the original token.

상기 랜덤으로 선택된 토큰에 대한 트랜스포머 계층의 출력은 지정된 유사도 계산식을 통해 유사도(S112)를 최대화하는 내적 손실(Dot-product Loss) 및 유사도(S112)를 최소화하는 내적 손실을 산출하고, 상기 산출된 값을 이용하여 마스크 손실(S113)을 계산한다. The output of the transformer layer for the randomly selected token calculates a dot-product loss that maximizes the similarity (S112) and a dot-product loss that minimizes the similarity (S112) through a specified similarity calculation formula, and the calculated value is used to calculate the mask loss S113.

그리고 마지막으로 의도 분류 및 개체 추출 모듈(210)은 개체 손실(S115)과 의도 손실(S110), 마스크 손실(S112) 값을 더하면서 손실의 총 합을 계산하며, 상기 총 합을 계산하는 과정을 반복 수행하여, 이 손실 합계를 최소화하기 위해 학습을 수행한다.And finally, the intention classification and object extraction module 210 calculates the total loss while adding the object loss (S115), intention loss (S110), and mask loss (S112) values, and performs the process of calculating the total sum. By iterating, learning is performed to minimize this sum of losses.

여기서 상기 의도 분류 및 개체 추출 모듈(210)의 인공신경망 최적화 구조를 학습하기 위한 파라미터는 도 6에 도시된 바와 같다. Here, parameters for learning the artificial neural network optimization structure of the intention classification and entity extraction module 210 are shown in FIG. 6 .

예컨대 상기 파라미터 종류는 트랜스포머 계층 수, 트랜스포머 모델의 사이즈, 마스크 언어 모델의 사용 여부, 드롭아웃 비율, 희소성 비율, 임베딩 차원, 및 숨은 계층 사이즈 등의 7개로 구분할 수 있으며 각 파라미터의 최적값을 설정한다. 다만 이는 예시적인 것이며, 이를 한정하고자 하는 것이 아님에 유의한다.For example, the parameter type can be divided into 7 categories, such as the number of transformer layers, the size of the transformer model, whether or not to use a mask language model, the dropout ratio, the sparsity ratio, the embedding dimension, and the hidden layer size, and an optimal value of each parameter is set. . However, it should be noted that this is merely exemplary and is not intended to limit the present invention.

도 7은 상기 도 1에 있어서, 의도 분류 및 개체 추출 수행부 내의 의도 분류 및 개체 추출 모듈의 인공신경망 학습 최적 수 결정을 위한 비교 실시 결과를 보인 예시도로서, 상기 학습 데이터(130)를 대상으로 학습 수(Epoch)를 100부터 900까지 학습시켰을 때 의도 분류 및 개체 추출의 정확도(Accuracy)와 F1-score 성능을 비교하였다. 상기 비교 실시 결과, 학습 수를 500으로 설정하여 학습하였을 때 의도 분류의 정확도가 98.2% 및 F1-score가 98.4%로 측정되었으며, 개체 추출의 정확도가 97.4% 및 F1-score가 94.7%로 측정되어 최고 성능을 입증하였다.FIG. 7 is an exemplary view showing the results of comparison for determining the optimal number of artificial neural network learning of the intention classification and object extraction module in the intention classification and object extraction performing unit in FIG. 1 , and the learning data 130 is When the number of learning (Epoch) was learned from 100 to 900, the accuracy (Accuracy) of intention classification and object extraction and F1-score performance were compared. As a result of the comparison, when the number of lessons was set to 500 and learned, the accuracy of intention classification was measured to be 98.2% and F1-score was 98.4%, and the accuracy of object extraction was measured to be 97.4% and F1-score was 94.7%. The best performance was demonstrated.

도 8은 상기 도 1에 있어서, 본 실시예에 따른 발명과 종래 기술에 따른 모듈의 성능 비교를 실시한 결과를 보인 예시도로서, 종래 기술은 키워드(Keyword) 모듈, 폴백(Fallback) 모듈, 조건부 랜덤 필드(Conditional Random Field), 듀얼 의도 개체 트랜스포머(Dual Intent Entity Transformer) 및 조건부 랜덤 필드 융합, 듀얼 의도 개체 트랜스포머와 같이 총 5개의 최신 기존 모듈을 대상으로 진행하였다. 그 결과 본 실시예에 따른 의도 분류 기능의 경우 98.4% 성능 확인을 하여 기존 알고리즘 대비 최대 19.8%의 성능 향상을 보였으며, 개체 추출 기능의 경우 94.7% 성능 확인을 하여 기존 알고리즘 대비 최대 3.9% 성능 향상을 입증하였다. 8 is an exemplary view showing the results of comparing the performance of the module according to the invention and the prior art according to the present embodiment in FIG. 1, and in the prior art, a keyword module, a fallback module, and a conditional random A total of five latest existing modules were conducted: Conditional Random Field, Dual Intent Entity Transformer, Conditional Random Field Fusion, and Dual Intent Entity Transformer. As a result, in the case of the intention classification function according to this embodiment, the performance was confirmed by 98.4%, which showed a performance improvement of up to 19.8% compared to the existing algorithm, and in the case of the object extraction function, the performance was confirmed by 94.7%, and the performance was improved by up to 3.9% compared to the existing algorithm proved.

상기와 같이 본 실시예는 상용 챗봇 프레임워크 개발 시, 영어가 아닌 한국어를 위한 오픈소스 챗봇 프레임워크 및 전처리 기능 수행부와 의도 분류 및 개체 추출 수행부에 대한 성능을 개선할 수 있도록 하는 효과가 있으며, 이에 따라 한국어를 활용하기에 국내 산업 분야에 유연하게 적용할 수 있는 효과가 있으며, 또한 챗봇이 활용되는 로보틱 프로세스 자동화(RPA), 스마트 그리드(Smart Grid), 모바일 서비스 등 다양한 산업 분야에 적용할 수 있고, 개발 및 유지보수비를 절감할 수 있도록 할 뿐만 아니라, 한국어 가능한 챗봇을 통해 고객 만족도가 향상되고 회사 이미지를 향상시키는 효과가 있다.As described above, this embodiment has the effect of improving the performance of the open source chatbot framework and preprocessing function execution unit and intention classification and object extraction execution unit for Korean, not English, when developing a commercial chatbot framework. , it has an effect that it can be flexibly applied to domestic industrial fields because it uses Korean language, and it is also applied to various industrial fields such as Robotic Process Automation (RPA), Smart Grid, and mobile services where chatbots are used. In addition to reducing development and maintenance costs, Korean-speaking chatbots improve customer satisfaction and improve company image.

본 발명은 도면에 도시된 실시예를 참고로 하여 설명되었으나, 이는 예시적인 것에 불과하며, 당해 기술이 속하는 분야에서 통상의 지식을 가진 자라면 이로부터 다양한 변형 및 균등한 타 실시예가 가능하다는 점을 이해할 것이다. 따라서 본 발명의 기술적 보호범위는 아래의 특허청구범위에 의해서 정하여져야 할 것이다. 또한 본 명세서에서 설명된 구현은, 예컨대, 방법 또는 프로세스, 장치, 소프트웨어 프로그램, 데이터 스트림 또는 신호로 구현될 수 있다. 단일 형태의 구현의 맥락에서만 논의(예컨대, 방법으로서만 논의)되었더라도, 논의된 특징의 구현은 또한 다른 형태(예컨대, 장치 또는 프로그램)로도 구현될 수 있다. 장치는 적절한 하드웨어, 소프트웨어 및 펌웨어 등으로 구현될 수 있다. 방법은, 예컨대, 컴퓨터, 마이크로프로세서, 집적 회로 또는 프로그래밍 가능한 로직 디바이스 등을 포함하는 프로세싱 디바이스를 일반적으로 지칭하는 프로세서 등과 같은 장치에서 구현될 수 있다. 프로세서는 또한 최종-사용자 사이에 정보의 통신을 용이하게 하는 컴퓨터, 셀 폰, 휴대용/개인용 정보 단말기(personal digital assistant: "PDA") 및 다른 디바이스 등과 같은 통신 디바이스를 포함한다.Although the present invention has been described with reference to the embodiment shown in the drawings, this is merely exemplary, and those of ordinary skill in the art to which various modifications and equivalent other embodiments are possible. will understand Therefore, the technical protection scope of the present invention should be defined by the following claims. Implementations described herein may also be implemented as, for example, a method or process, an apparatus, a software program, a data stream, or a signal. Although discussed only in the context of a single form of implementation (eg, discussed only as a method), implementations of the discussed features may also be implemented in other forms (eg, as an apparatus or program). The apparatus may be implemented in suitable hardware, software and firmware, and the like. A method may be implemented in an apparatus such as, for example, a processor, which generally refers to a computer, a microprocessor, a processing device, including an integrated circuit or programmable logic device, or the like. Processors also include communication devices such as computers, cell phones, portable/personal digital assistants ("PDA") and other devices that facilitate communication of information between end-users.

100 : 전처리 기능 수행부
110 : 토큰화 모듈
120 : 특징 추출 모듈
130 : 학습 데이터
200 : 의도 분류 및 개체 추출 수행부
210 : 의도 분류 및 개체 추출 모듈
220 : 개체 매핑 모듈
230 : 대화 정책 모듈100: preprocessing function execution unit
110: tokenization module
120: feature extraction module
130: training data
200: Intention classification and entity extraction execution unit
210: Intent classification and entity extraction module
220: object mapping module
230: dialog policy module

Claims

For the preprocessing of input sentences, a tokenization module for separating the text corpus into grammatically indivisible language elements, a feature extraction module for extracting by checking which features of the token cluster are useful, and an artificial neural network chatbot a pre-processing function performing unit including learning data for learning; and
Intent classification and object extraction module for extracting objects in sentences after intention classification and intention classification for the speaker’s query sentence, object mapping module for properly mapping the extracted objects to slots, and policy of question-and-answer dialogue scenario An artificial neural network-based open source Korean understanding pipeline optimization device comprising a; an intention classification and object extraction performing unit including a dialogue policy module for setting

The method of claim 1,
The tokenization module performs tokenization optimized for stem and ending analysis of Korean, and noun extraction from the corpus, which is a text cluster of the learning data,
The feature extraction module selects a token considered to be a useful feature among the tokens extracted through the tokenization module,
The learning data is an artificial neural network-based open source Korean understanding pipeline optimization apparatus, characterized in that it includes a common work target data set used in the industrial field.

The method of claim 1,
The intent classification and entity extraction module classifies which slot the intent of the query sentence corresponds to among a plurality of designated slots,
The entity mapping module maps each entity extracted through the intention classification and entity extraction module to a corresponding slot among a plurality of designated slots,
The dialog policy module is an artificial neural network-based open source Korean understanding pipeline optimization apparatus, characterized in that it sets a policy of a question-and-answer dialog scenario according to a specified algorithm.

The method of claim 1, wherein the intent classification and entity extraction module comprises:
An artificial neural network, characterized in that the sum total of the loss is calculated while adding the values of the entity loss, the intention loss, and the mask loss value, and the process of calculating the total sum is repeated to perform learning to minimize the total loss. Based open source Korean understanding pipeline optimizer.

The method of claim 4, wherein the intent classification and entity extraction module comprises:
Input the token set extracted through the tokenization module to the transformer layer through a plurality of forward layers,
The token set output from the transformer layer is input to the conditional random field algorithm together with the entity set,
An artificial neural network-based open source Korean understanding pipeline optimization apparatus, characterized in that the object loss is calculated by using a parameter that finds a pre-specified optimal value through the conditional random field algorithm.

The method of claim 5, wherein the intent classification and entity extraction module comprises:
In the transformer layer, output a class token indicating the sentence encoding of the token set extracted through the tokenization module and input it to the embedding layer,
Then, using the values output from the class token and the intent set through different embedding layers, an internal loss that maximizes the similarity and an internal loss that minimizes the similarity are calculated through a specified similarity calculation formula, and the intent is used using the calculated values. An artificial neural network-based open source Korean understanding pipeline optimization device characterized by calculating the loss.

The method of claim 5, wherein the intent classification and entity extraction module comprises:
The mask token included in the token set is output through the transformer layer through the forward layer and input to the embedding layer,
Calculating the inner product loss that maximizes the similarity and the inner product loss that minimizes the similarity through the similarity calculation formula specified for the output of the transformer layer for the token randomly selected from the token set, and calculates the mask loss using the calculated value An artificial neural network-based open source Korean understanding pipeline optimization device.

For optimizing the artificial neural network-based open source Korean understanding pipeline, when a sentence is input, the preprocessing function performing unit can further grammatically divide the input sentence from the text cluster, the corpus, through the tokenization module for preprocessing the input sentence. Separating the language elements into non-existent language elements and extracting them by checking which features of the token cluster are useful through a feature extraction module; and
The intention classification and object extraction performing unit extracts the object in the sentence after intention classification and intention classification for the speaker's query sentence through the intention classification and object extraction module, and applies the extracted object to the corresponding object through the object mapping module A method of optimizing an artificial neural network-based open source Korean understanding pipeline, comprising: mapping according to a specified slot.

9. The method of claim 8,
The tokenization module performs tokenization optimized for stem and ending analysis of Korean, and noun extraction from the corpus, which is a text cluster of the learning data,
The feature extraction module is an artificial neural network-based open source Korean understanding pipeline optimization method, characterized in that it selects a token considered to be a useful feature among the tokens extracted through the tokenization module.

9. The method of claim 8,
The intent classification and entity extraction module classifies which slot the intent of the query sentence corresponds to among a plurality of designated slots,
and the entity mapping module maps each entity extracted through the intent classification and entity extraction module to a corresponding slot among a plurality of designated slots.

The method of claim 8, wherein to extract an entity in the sentence after intention classification and intention classification for the speaker's query sentence,
The intention classification and entity extraction module comprises:
An artificial neural network, characterized in that the sum total of the loss is calculated while adding the values of the entity loss, the intention loss, and the mask loss value, and the process of calculating the total sum is repeated to perform learning to minimize the total loss. A method for optimizing a pipeline of open source Korean understanding based on Korean language.

The method of claim 11, wherein the intent classification and entity extraction module comprises:
Input the token set extracted through the tokenization module to the transformer layer through a plurality of forward layers,
The token set output from the transformer layer is input to the conditional random field algorithm together with the entity set,
An artificial neural network-based open source Korean understanding pipeline optimization method, characterized in that the object loss is calculated by using a parameter that finds a pre-specified optimal value through the conditional random field algorithm.

The method of claim 12, wherein the intent classification and entity extraction module comprises:
In the transformer layer, output a class token indicating the sentence encoding of the token set extracted through the tokenization module and input it to the embedding layer,
Then, using the values output from the class token and the intent set through different embedding layers, an internal loss that maximizes the similarity and an internal loss that minimizes the similarity are calculated through a specified similarity calculation formula, and the intent is used using the calculated values. An artificial neural network-based open source Korean understanding pipeline optimization method characterized by calculating the loss.

The method of claim 12, wherein the intent classification and entity extraction module comprises:
The mask token included in the token set is output through the transformer layer through the forward layer and input to the embedding layer,
Calculating the inner product loss that maximizes the similarity and the inner product loss that minimizes the similarity through the similarity calculation formula specified for the output of the transformer layer for the token randomly selected from the token set, and calculates the mask loss using the calculated value An artificial neural network-based open source Korean understanding pipeline optimization method.