KR102606415B1

KR102606415B1 - Apparatus and method for contextual intent recognition using speech recognition based on deep learning

Info

Publication number: KR102606415B1
Application number: KR1020230039972A
Authority: KR
Inventors: 이홍재; 고형석
Original assignee: (주)유알피
Priority date: 2023-03-27
Filing date: 2023-03-27
Publication date: 2023-11-29

Abstract

본 발명은 사용자로부터 발화된 음성 데이터를 딥러닝 기반 음성 인식 모델을 통해 텍스트로 변환하고, 변환된 텍스트에 대해 특정 도메인에 특화되어 사전 학습된 의도 인식 모델을 적용하여, 문맥의 의도를 인식하고, 의도 인식 결과와 정답을 비교하여 오답 데이터를 재학습 데이터로 분류, 저장한 후, 재학습시켜 인식 모델의 정확도를 개선하는 딥러닝 기반 음성인식을 이용한 문맥 의도 인식 장치 및 방법에 관한 것으로, 외부 서버로부터 사용자에 의해 발화된 음성 데이터를 수집하여 학습 데이터로 생성하는 데이터 수집부; 상기 학습 데이터를 딥러닝 기반 음성 인식 모델을 적용하여 음성을 인식하여 텍스트로 변환하는 음성 인식부; 상기 음성 인식부를 통해 인식된 텍스트를 딥러닝 기반 의도 인식 모델을 적용하여 텍스트의 의도 인식 결과를 제공하는 자연어 처리부; 상기 텍스트의 의도 인식 결과와 상기 학습 데이터의 정답 레이블을 비교하여 재학습 데이터를 분류 및 저장하는 의도 인식 결과 처리부; 및 상기 재학습 데이터를 적용하여 자동 반복 학습을 수행하는 재학습부;를 구비한다.The present invention converts voice data uttered by a user into text through a deep learning-based voice recognition model, applies a pre-trained intention recognition model specialized for a specific domain to the converted text, and recognizes the intention of the context. It relates to a context intention recognition device and method using deep learning-based voice recognition that compares the intent recognition result and the correct answer, classifies and stores the incorrect answer data as re-learning data, and then re-trains it to improve the accuracy of the recognition model. External server a data collection unit that collects voice data uttered by the user and generates learning data; A voice recognition unit that recognizes the voice and converts it into text by applying a deep learning-based voice recognition model to the learning data; a natural language processing unit that applies a deep learning-based intent recognition model to the text recognized through the voice recognition unit and provides a text intent recognition result; an intent recognition result processing unit that compares the intent recognition result of the text and the correct answer label of the learning data to classify and store re-learning data; and a re-learning unit that performs automatic iterative learning by applying the re-learning data.

Description

Device and method for context intention recognition using deep learning-based voice recognition {APPARATUS AND METHOD FOR CONTEXTUAL INTENT RECOGNITION USING SPEECH RECOGNITION BASED ON DEEP LEARNING}

본 발명은 사용자로부터 발화된 음성 데이터를 딥러닝 기반 음성 인식 모델을 통해 텍스트로 변환하고, 변환된 텍스트에 대해 특정 도메인에 특화되어 사전 학습된 의도 인식 모델을 적용하여, 문맥의 의도를 인식하고, 의도 인식 결과와 정답을 비교하여 오답 데이터를 재학습 데이터로 분류, 저장한 후, 재학습시켜 인식 모델의 정확도를 개선하는 문맥 의도 인식 장치 및 방법에 관한 것이다. 또한, 본 발명은 음성 데이터에 대한 문맥 의도를 인식하는 프로세스 및 문맥 의도 인식 장치의 개선을 위해 학습 모델의 재학습을 위한 프로세스를 하나의 장치에서 제공하는 것을 특징으로 한다.The present invention converts voice data uttered by a user into text through a deep learning-based voice recognition model, applies a pre-trained intention recognition model specialized for a specific domain to the converted text, and recognizes the intention of the context. It relates to a context intention recognition device and method that improves the accuracy of the recognition model by comparing the intent recognition result and the correct answer, classifying and storing the incorrect answer data as retraining data, and then retraining it. In addition, the present invention is characterized by providing a process for recognizing context intent for voice data and a process for re-learning a learning model to improve the context intent recognition device in one device.

음성 인식 기술은 음성으로부터 언어적 의미 내용을 식별하는 기술로, 일반적으로, 마이크와 같은 소리 입력 장치를 통해 얻은 음향학적 신호를 음소 분석, 단어 인식하여 텍스트 문장으로 변환하는 기술을 의미한다.Speech recognition technology is a technology that identifies linguistic semantic content from speech. In general, it refers to technology that converts acoustic signals obtained through sound input devices such as microphones into text sentences by analyzing phonemes and recognizing words.

음성 인식 기술은 음향 모델(Acoustic Model)과 언어 모델(Language Model)로 구성되며, 음향 모델은 '음소/단어 시퀀스'와 '입력 음성 신호'가 어느 정도 관계를 맺고 있는지 추출하고, 언어 모델은 해당 음소/단어 시퀀스가 얼마나 자연스러운지 확률 값 형태로 나타낸다.Speech recognition technology consists of an acoustic model and a language model. The acoustic model extracts the degree of relationship between the 'phoneme/word sequence' and the 'input speech signal', and the language model extracts the relationship between the 'phoneme/word sequence' and the 'input speech signal'. It indicates how natural a phoneme/word sequence is in the form of a probability value.

음성 인식 기술의 발전에 따라, 음성을 통해 동작을 지시하거나, 음성 기반 대화형 챗봇과 같은 서비스들이 제공되면서, 사용자의 발화를 인식하여 문맥을 인식하고, 사용자의 의도를 정확히 판단하는 것이 중요해지고 있다. With the development of voice recognition technology, services such as voice commands or voice-based interactive chatbots are provided, making it important to recognize the context by recognizing the user's utterance and accurately determine the user's intention. .

이를 위해서는, 음성 인식의 정확도가 매우 중요하며, 특히 도메인에 특정된 단어들이 많은 경우 음성 인식 결과의 오류가 높을 수 있고, 음성 인식 결과에 따라 사용자의 의도를 파악이 상이하게 달라질 수 있다.For this purpose, the accuracy of voice recognition is very important. In particular, if there are many domain-specific words, the error in the voice recognition result may be high, and the user's intent may be identified differently depending on the voice recognition result.

따라서, 음성 인식 오류에 대해 후처리를 진행하여, 중의어, 도메인에 특화된 단어에 대한 의미 분석의 정확도를 높이기 위한 방안이 요구된다.Therefore, a method is required to improve the accuracy of semantic analysis of double words and domain-specific words by post-processing speech recognition errors.

본 발명은 상기 문제점을 해결하기 위해 딥러닝 기반 음성 인식 모델을 사용하고, 음성 인식의 결과로 출력된 텍스트를 자연어 처리를 통해 문맥 및 의도를 인식하여 제공하며, 최종 의도 인식 결과가 정답과 상이한 데이터를 재학습하여 음성 인식 모델 및 의도 인식 모델의 정확도를 높이는 딥러닝 기반 음성인식을 이용한 문맥 의도 인식 장치 및 방법을 제공하는데 그 목적이 있다.The present invention uses a deep learning-based voice recognition model to solve the above problem, recognizes the context and intent of the text output as a result of voice recognition through natural language processing, and provides data where the final intent recognition result is different from the correct answer. The purpose is to provide a context intention recognition device and method using deep learning-based voice recognition to increase the accuracy of the voice recognition model and intent recognition model by relearning.

본 발명의 일 실시예에 따른 딥러닝 기반 음성인식을 이용한 문맥 의도 인식 장치는, 외부 서버로부터 사용자에 의해 발화된 음성 데이터를 수집하여 학습 데이터로 생성하는 데이터 수집부; 상기 학습 데이터를 딥러닝 기반 음성 인식 모델을 적용하여 음성을 인식하여 텍스트로 변환하는 음성 인식부; 상기 음성 인식부를 통해 인식된 텍스트를 딥러닝 기반 의도 인식 모델을 적용하여 텍스트의 의도 인식 결과를 제공하는 자연어 처리부; 상기 텍스트의 의도 인식 결과와 상기 학습 데이터의 정답 레이블을 비교하여 재학습 데이터를 분류 및 저장하는 의도 인식 결과 처리부; 및 상기 재학습 데이터를 적용하여 자동 반복 학습을 수행하는 재학습부;를 구비할 수 있다.A context intention recognition device using deep learning-based voice recognition according to an embodiment of the present invention includes a data collection unit that collects voice data uttered by a user from an external server and generates learning data; A voice recognition unit that recognizes the voice and converts it into text by applying a deep learning-based voice recognition model to the learning data; a natural language processing unit that applies a deep learning-based intent recognition model to the text recognized through the voice recognition unit and provides a text intent recognition result; an intent recognition result processing unit that compares the intent recognition result of the text and the correct answer label of the learning data to classify and store re-learning data; and a re-learning unit that performs automatic iterative learning by applying the re-learning data.

또한, 상기 자연어 처리부는, 입력된 텍스트에 대해 문장 분리, 형태소 분리 및 불용 형태소를 제거하는 전처리부; 형태소 단위로 분리된 단어를 벡터로 변환하는 워드 임베딩부; 및 문장에 포함된 단어를 분석하여 문장의 의도를 인식하는 의도 분석부;를 포함할 수 있다.In addition, the natural language processing unit includes a preprocessing unit that separates sentences, separates morphemes, and removes unused morphemes from the input text; A word embedding unit that converts words separated into morpheme units into vectors; and an intention analysis unit that recognizes the intention of the sentence by analyzing words included in the sentence.

또한, 상기 워드 임베딩부는, 언어 사전을 적용하여 사전 학습된 워드 임베딩을 사용하여 도메인의 특정된 개체명을 학습하는 것을 특징으로 하고, 상기 의도 분석부는, 상기 워드 임베딩부를 통해 출력된 임베딩 벡터를 입력값으로 적용하여 문장의 의도를 분류하는 딥러닝 기반 의도 인식 모델을 사용하는 것을 특징으로 한다.In addition, the word embedding unit is characterized in that it learns the specified entity name of the domain using word embeddings pre-learned by applying a language dictionary, and the intent analysis unit inputs the embedding vector output through the word embedding unit. It is characterized by using a deep learning-based intent recognition model that classifies the intent of a sentence by applying it as a value.

또한, 상기 의도 인식 결과 처리부는, 상기 의도 인식 모델을 통한 의도 인식 결과와 학습 데이터의 정답 레이블을 비교하여 일치하지 않는 경우, 상기 학습 데이터를 재학습 데이터로 분류하여 저장할 수 있다.Additionally, the intent recognition result processing unit may compare the intent recognition result through the intent recognition model with the correct answer label of the learning data, and if they do not match, classify the learning data as re-learning data and store it.

또한, 상기 재학습 데이터가 정해진 개수 이상인 경우, 상기 재학습 데이터를 사용하여 상기 음성 인식 모델 및 상기 의도 인식 모델을 자동으로 재학습하는 것을 특징으로 한다.In addition, when the re-learning data is more than a predetermined number, the speech recognition model and the intent recognition model are automatically re-trained using the re-learning data.

또한, 상기 데이터 수집부는, 하나의 도메인에 특화된 음성 데이터를 수집하는 것을 특징으로 한다.Additionally, the data collection unit is characterized in that it collects voice data specialized for one domain.

한편, 본 발명의 일 실시예에 따른 딥러닝 기반 음성인식을 이용한 문맥 의도 인식 방법은, 음성인식을 이용한 문맥 의도 인식 장치에서, 외부 서버로부터 사용자에 의해 발화된 음성 데이터를 수집하여 학습 데이터로 생성하는 단계; 상기 학습 데이터에 대해 딥러닝 기반 음성 인식 모델을 적용하여 음성 데이터를 텍스트로 변환하는 단계; 상기 음성 인식 모델을 통해 변환된 텍스트를 딥러닝 기반 의도 인식 모델을 적용하여 상기 텍스트의 의도 인식 결과를 제공하는 단계; 상기 의도 인식 결과를 정답 레이블과 비교하여 일치하지 않는 경우, 재학습 데이터로 분류하여 저장하는 단계; 및 상기 재학습 데이터를 적용하여 상기 음성 인식 모델 및 상기 의도 인식 모델을 재학습을 수행하는 단계;를 포함할 수 있다.Meanwhile, the context intention recognition method using deep learning-based voice recognition according to an embodiment of the present invention collects voice data uttered by the user from an external server in a context intention recognition device using voice recognition and generates learning data. steps; converting voice data into text by applying a deep learning-based voice recognition model to the learning data; Applying a deep learning-based intent recognition model to the text converted through the voice recognition model and providing an intent recognition result of the text; Comparing the intention recognition result with the correct answer label and, if it does not match, classifying and storing it as re-learning data; and performing retraining of the voice recognition model and the intent recognition model by applying the retraining data.

또한, 상기 음성 인식 모델을 통해 변환된 텍스트를 딥러닝 기반 의도 인식 모델을 적용하여 상기 텍스트의 의도 인식 결과를 제공하는 단계는, 상기 텍스트를 문장 단위로 분리하고, 분리된 문장을 형태소 단위로 분리하여 형태소 태그를 부착하는 단계; 형태소 태그가 부착된 문장의 형태소 태그를 확인하여 불용 형태소를 제거하는 단계; 형태소 단위로 분리된 단어에 대해 임베딩 벡터를 생성하는 단계; 및 생성된 임베딩 벡터를 딥러닝 기반 의도 인식 모델의 입력값으로 적용하여 문장의 의도를 인식하고, 의도 인식 결과를 도출하는 단계;를 포함할 수 있다.In addition, the step of providing an intent recognition result of the text by applying a deep learning-based intent recognition model to the text converted through the speech recognition model includes separating the text into sentences and separating the separated sentences into morpheme units. Attaching a morpheme tag; Checking the morpheme tag of the sentence to which the morpheme tag is attached and removing unused morphemes; Generating an embedding vector for words separated into morpheme units; and applying the generated embedding vector as an input value to a deep learning-based intent recognition model to recognize the intent of the sentence and derive an intent recognition result.

또한, 상기 형태소 단위로 분리된 단어에 대해 임베딩 벡터를 생성하는 단계는, 언어 사전을 적용하여 사전 학습된 워드 임베딩을 사용하여 도메인의 특정된 개체명을 학습하는 것을 특징으로 한다.In addition, the step of generating an embedding vector for a word separated into morpheme units is characterized by learning a specific entity name of a domain using word embeddings pre-learned by applying a language dictionary.

또한, 상기 외부 서버로부터 사용자에 의해 발화된 음성 데이터를 수집하여 학습 데이터로 생성하는 단계는, 하나의 도메인에 특화된 음성 데이터를 수집하는 것을 특징으로 한다.Additionally, the step of collecting voice data uttered by the user from the external server and generating learning data is characterized by collecting voice data specialized for one domain.

딥러닝 음성 인식 모델을 적용하여, 정확도가 높은 음성 인식을 기대할 수 있고, 딥러닝 기반 문맥의 의도 인식 기술을 통해 음성 인식만으로는 파악하지 못하는 맥락과 의도를 파악할 수 있다.By applying a deep learning voice recognition model, highly accurate voice recognition can be expected, and deep learning-based context intent recognition technology can identify context and intent that cannot be identified through voice recognition alone.

또한, 사용자의 음성으로 사용자가 원하는 의도를 파악하여, 원하는 작업 및 서비스를 제공할 수 있다.Additionally, the user's desired intention can be identified through the user's voice, and desired tasks and services can be provided.

또한, 도메인에 특화된 단어를 학습하여, 문맥의 의도 분석의 정확도를 높일 수 있고, 학습 데이터 중 오답 데이터에 대해 재학습을 수행하여 지속적으로 모델의 정확도를 개선할 수 있다.In addition, by learning domain-specific words, the accuracy of context intention analysis can be improved, and the accuracy of the model can be continuously improved by relearning incorrect answer data among the training data.

도 1은 본 발명의 일 실시예에 따른 딥러닝 기반 음성인식을 이용한 문맥 의도 인식 장치의 전체 관계도이다.
도 2는 본 발명의 일 실시예에 따른 딥러닝 기반 음성인식을 이용한 문맥 의도 인식 장치(100)의 기능에 대한 블록도이다.
도 3은 본 발명의 일 실시예에 따른 딥러닝 기반 음성인식을 이용한 문맥 의도 인식 장치에서 자연어 처리부(130)의 기능을 상세하게 나타낸 블록도이다.
도 4는 본 발명의 일 실시예에 따른 딥러닝 기반 음성인식을 이용한 문맥 의도 인식 장치(100)의 하드웨어 구조를 나타낸 도면이다.
도 5는 본 발명의 일 실시예에 따른 딥러닝 기반 음성인식을 이용한 문맥 의도 인식 방법에 대한 순서도이다.
도 6은 본 발명의 일 실시예에 따른 딥러닝 기반 음성인식을 이용한 문맥 의도 인식 방법에서 의도 인식 결과를 제공하는 단계를 상세하게 나타낸 순서도이다.Figure 1 is an overall relationship diagram of a context intent recognition device using deep learning-based voice recognition according to an embodiment of the present invention.
Figure 2 is a block diagram of the function of the context intent recognition device 100 using deep learning-based voice recognition according to an embodiment of the present invention.
Figure 3 is a block diagram showing in detail the function of the natural language processing unit 130 in the context intent recognition device using deep learning-based voice recognition according to an embodiment of the present invention.
Figure 4 is a diagram showing the hardware structure of the context intention recognition device 100 using deep learning-based voice recognition according to an embodiment of the present invention.
Figure 5 is a flowchart of a method for recognizing context intent using deep learning-based voice recognition according to an embodiment of the present invention.
Figure 6 is a flowchart showing in detail the steps of providing intent recognition results in a context intent recognition method using deep learning-based voice recognition according to an embodiment of the present invention.

이하에서는 도면을 참조하여 본 발명의 구체적인 실시예를 상세하게 설명한다. 다만, 본 발명의 사상은 제시되는 실시예에 제한되지 아니하고, 본 발명의 사상을 이해하는 당업자는 동일한 사상의 범위 내에서 다른 구성요소를 추가, 변경, 삭제 등을 통하여, 퇴보적인 다른 발명이나 본 발명 사상의 범위 내에 포함되는 다른 실시예를 용이하게 제안할 수 있을 것이나, 이 또한 본원 발명 사상 범위 내에 포함된다고 할 것이다.Hereinafter, specific embodiments of the present invention will be described in detail with reference to the drawings. However, the spirit of the present invention is not limited to the presented embodiments, and those skilled in the art who understand the spirit of the present invention may add, change, or delete other components within the scope of the same spirit, or create other degenerative inventions or this invention. Other embodiments that are included within the scope of the invention can be easily proposed, but this will also be said to be included within the scope of the invention of the present application.

그리고, 후술되는 용어들은 본 발명에서의 기능을 고려하여 설정된 용어들로써 이는 발명자의 의도 또는 관례에 따라 달라질 수 있으므로 그 정의는 본 명세서 전반에 걸친 내용을 토대로 내려져야 할 것이고, 본 명세서에서 본 발명에 관련된 공지의 구성 또는 기능에 대한 구체적인 설명이 본 발명의 요지를 흐릴 수 있다고 판단되는 경우에 이에 관한 자세한 설명은 생략하기로 한다.In addition, the terms described below are terms set in consideration of the function in the present invention, and may vary depending on the inventor's intention or custom, so the definition should be made based on the content throughout the specification, and in this specification, the terms related to the present invention In cases where it is determined that detailed descriptions of well-known configurations or functions may obscure the gist of the present invention, detailed descriptions thereof will be omitted.

이하, 도면을 참조로 하여 본 발명에 따른 딥러닝 기반 음성인식을 이용한 문맥 의도 인식 장치(100)를 설명한다.Hereinafter, a context intention recognition device 100 using deep learning-based voice recognition according to the present invention will be described with reference to the drawings.

도 1은 본 발명의 일 실시예에 따른 딥러닝 기반 음성인식을 이용한 문맥 의도 인식 장치의 전체 관계도이다.Figure 1 is an overall relationship diagram of a context intent recognition device using deep learning-based voice recognition according to an embodiment of the present invention.

도 1을 참조하면, 문맥 의도 인식 장치(100)는 적어도 하나 이상의 외부 서버(200) 네트워크로 연결되어 서로 통신할 수 있다.Referring to FIG. 1, the context intent recognition device 100 is connected to a network of at least one external server 200 and can communicate with each other.

본 발명에서 언급하는 네트워크라 함은 유선 공중망, 무선 이동 통신망, 또는 휴대 인터넷 등과 통합된 코어 망일 수도 있고, TCP/IP 프로토콜 및 그 상위 계층에 존재하는 여러 서비스, 즉 HTTP(Hyper Text Transfer Protocol), HTTPS(Hyper Text Transfer Protocol Secure), Telnet, FTP(File Transfer Protocol) 등을 제공하는 전 세계적인 개방형 컴퓨터 네트워크 구조를 의미할 수 있으며, 이러한 예에 한정하지 않고 다양한 형태로 데이터를 송수신할 수 있는 데이터 통신망을 포괄적으로 의미하는 것이다.The network referred to in the present invention may be a core network integrated with a wired public network, wireless mobile communication network, or mobile Internet, etc., and may include the TCP/IP protocol and various services existing in its upper layer, such as HTTP (Hyper Text Transfer Protocol), It can refer to a global open computer network structure that provides HTTPS (Hyper Text Transfer Protocol Secure), Telnet, and FTP (File Transfer Protocol), etc., and is not limited to these examples, but is a data communication network that can transmit and receive data in various forms. It means comprehensively.

본 발명에서 외부 서버(200)는, 문맥 의도 인식 장치(100)가 음성 데이터를 수집하는 서버로, 다양한 도메인에 특화된 음성 데이터를 보유하는 서버들을 의미할 수 있다. In the present invention, the external server 200 is a server from which the context intent recognition device 100 collects voice data, and may refer to servers that hold voice data specialized for various domains.

여기서 도메인이란, 서비스가 수행되는 특정 분야, 주제, 범위를 의미하며, 도메인에 특화된 음성 데이터는, 특정 도메인 서비스에서 많이 사용하는 단어, 고유명사, 문장 형식을 포함한 사용자 발화 음성 데이터를 의미할 수 있다.Here, domain refers to the specific field, subject, and scope in which the service is performed, and domain-specific voice data may refer to user utterance voice data including words, proper nouns, and sentence formats frequently used in specific domain services. .

예를 들어, 외부 서버(200)는 행정 기관의 민원 서비스, 보험 서비스, 금융 서비스 등의 상담 업무에 대해 고객과의 통화 내역을 녹음한 음성 데이터를 저장, 관리, 수집하는 서버일 수 있다.For example, the external server 200 may be a server that stores, manages, and collects voice data recording call details with customers for consultation services such as civil affairs services, insurance services, and financial services of administrative agencies.

또한, 외부 서버(200)는 홈 네트워크, 스마트 팩토리 등의 특정 공간 상에서, 특정 장치에 대해 명령을 지시하는 사용자의 음성 데이터를 저장, 관리, 수집하는 서버일 수 있다.Additionally, the external server 200 may be a server that stores, manages, and collects voice data of a user instructing a specific device in a specific space, such as a home network or smart factory.

또한, 외부 서버(200)는 빅데이터 제공 시스템 일 수 있다.Additionally, the external server 200 may be a big data provision system.

다만, 이에 한정하지 않고, 외부 서버는 다양한 도메인에 특화된 음성 데이터를 저장, 관리, 수집하는 서버로 변형할 수 있을 것이다.However, without being limited to this, the external server may be transformed into a server that stores, manages, and collects voice data specialized for various domains.

문맥 의도 인식 장치(100)는 각 도메인에 특정된 발화 음성 데이터를 수집하여 음성 인식 모델 및 의도 인식 모델을 학습하고, 의도 인식에 실패한 문장에 대해서 재학습을 수행할 수 있다.The context intention recognition device 100 can collect speech data specific to each domain, learn a voice recognition model and an intent recognition model, and perform re-learning for sentences that fail to recognize intent.

상기 과정을 통해 음성 인식 모델 및 의도 인식 모델을 생성/학습한 후, 외부의 어플리케이션 서버로부터 음성 데이터를 전달받아 음성을 인식하여 텍스트로 변환하고, 변환된 텍스트를 자연어 처리를 통해 문장의 의도를 인식하여 해당 결과를 전달하는 서비스를 추가로 수행할 수 있다.After creating/learning a voice recognition model and an intent recognition model through the above process, voice data is received from an external application server, the voice is recognized and converted into text, and the intention of the sentence is recognized through natural language processing of the converted text. Thus, a service that delivers the results can be additionally performed.

도 2는 본 발명의 일 실시예에 따른 딥러닝 기반 음성인식을 이용한 문맥 의도 인식 장치(100)의 기능에 대한 블록도이고, 도 3은 본 발명의 일 실시예에 따른 딥러닝 기반 음성인식을 이용한 문맥 의도 인식 장치에서 자연어 처리부(130)의 기능을 상세하게 나타낸 블록도이다.Figure 2 is a block diagram of the function of the context intention recognition device 100 using deep learning-based voice recognition according to an embodiment of the present invention, and Figure 3 is a block diagram of deep learning-based voice recognition according to an embodiment of the present invention. This is a block diagram showing in detail the function of the natural language processing unit 130 in the context intent recognition device used.

도 2 내지 도 3을 참조하면, 문맥 의도 인식 장치(100)는, 데이터 수집부(110), 음성 인식부(120), 자연어 처리부(130), 의도 인식 결과 처리부(140), 재학습부(150) 및 모델 개선부(160)를 구비할 수 있다.Referring to Figures 2 and 3, the context intention recognition device 100 includes a data collection unit 110, a voice recognition unit 120, a natural language processing unit 130, an intention recognition result processing unit 140, and a re-learning unit ( 150) and a model improvement unit 160 may be provided.

데이터 수집부(110)는 외부 서버로부터 사용자에 의해 발화된 음성 데이터를 수집하여 학습 데이터로 생성할 수 있다.The data collection unit 110 may collect voice data uttered by the user from an external server and generate learning data.

이때, 데이터 수집부(110)는 외부 서버(200)로부터 도메인 특화된 음성 데이터를 수집할 수 있고, 수집된 음성 데이터는 사용자에 의해 발화된 음성 데이터가 바람직할 것이다.At this time, the data collection unit 110 may collect domain-specific voice data from the external server 200, and the collected voice data may preferably be voice data uttered by the user.

또한, 음성 데이터는 학습 데이터로 사용하기 위해, 사용자에 의해 발화된 음성 데이터에 대해 전사한 텍스트와 각 문장에 대한 의도가 포함될 수 있다. 즉, 입력된 음성 데이터에 대한 의도 인식 정답 레이블을 포함할 수 있다.Additionally, in order to be used as learning data, the voice data may include text transcribed from the voice data uttered by the user and the intent for each sentence. In other words, it may include an intention recognition correct answer label for the input voice data.

만약, 음성 데이터에 대한 전사한 텍스트가 없는 경우, 비지도 학습을 통해 음성 인식을 위한 모델 학습을 진행할 수 있다. If there is no transcribed text for voice data, model learning for voice recognition can be performed through unsupervised learning.

다만, 문장의 의도 인식을 위한 의도 인식 모델 경우, 정확도를 높이기 위해 정답 레이블을 포함한 지도 학습을 수행하는 것이 바람직할 것이다.However, in the case of an intent recognition model for recognizing the intent of a sentence, it would be desirable to perform supervised learning including the correct answer label to increase accuracy.

또한, 데이터 수집부(110)는 수집한 음성 데이터를 학습 데이터 셋으로 저장하기 위해 전처리 작업을 수행할 수 있다.Additionally, the data collection unit 110 may perform preprocessing to save the collected voice data as a learning data set.

전처리 작업은 수집한 데이터를 음성 인식 모델의 학습 데이터의 구조에 맞는 데이터 셋 구조로 변환하여 데이터베이스에 저장하는 과정을 포함할 수 있다.The preprocessing task may include converting the collected data into a data set structure that matches the structure of the training data of the speech recognition model and storing it in a database.

한편, 데이터 수집부(110)는 하나의 도메인에 특화된 음성 데이터를 수집할 수 있다. 이는 도메인 별로 특정 단어나 발화 문장 구조 및 의도가 상이할 수 있으므로, 데이터 수집부터 모델 학습까지 도메인 별로 수행되는 것이 바람직하다. Meanwhile, the data collection unit 110 may collect voice data specialized for one domain. Since the structure and intent of specific words or utterance sentences may be different for each domain, it is desirable to carry out everything from data collection to model learning for each domain.

음성 인식부(120)는 데이터 수집부(110)에 의해 수집/생성된 학습 데이터를 딥러닝 기반 음성 인식 모델을 사용하여 음성을 인식하여 텍스트로 변환할 수 있다.The voice recognition unit 120 may recognize the learning data collected/generated by the data collection unit 110 using a deep learning-based voice recognition model and convert it into text.

이때, 음성 인식 모델은 E2E 딥러닝 STT(E2E Deep Learning Speech to Text) 모델을 사용할 수 있다.At this time, the voice recognition model can use the E2E Deep Learning Speech to Text (E2E Deep Learning STT) model.

E2E 딥러닝 STT 모델은 입력된 오디오 신호를 바로 텍스트로 변환하는 모델이다. E2E 딥러닝 STT 모델은 전통적인 음성 인식 모델과 달리 특징 추출기, 음성 모델 및 디코더를 포함한 전통적인 음성 인식 접근 방식과 달리, 입력과 출력 간에 엔드 투 엔드 파이프 라인을 형성하는 하나의 신경망으로 구성될 수 있다.The E2E deep learning STT model is a model that directly converts the input audio signal into text. Unlike traditional speech recognition models, which include a feature extractor, speech model, and decoder, the E2E deep learning STT model can consist of one neural network that forms an end-to-end pipeline between input and output.

음성 인식 모델의 학습 시 도메인에 상관없이 일반적인 음성 인식을 위한 학습 데이터를 사전에 적용하여, 음성 인식 모델의 정확도를 높인 후, 도메인에 특화된 음성 데이터를 학습 데이터로 적용하여 모델을 학습시킬 수 있다.When learning a voice recognition model, training data for general voice recognition can be applied in advance regardless of the domain to increase the accuracy of the voice recognition model, and then domain-specific voice data can be applied as learning data to train the model.

또한, 사전 딥러닝 학습된 음성 인식 모델을 적용하고, 도메인에 특화된 음성 데이터를 학습 데이터를 추가적으로 학습할 수 있다.In addition, it is possible to apply a voice recognition model trained in advance deep learning and additionally learn domain-specific voice data as training data.

자연어 처리부(130)는 음성 인식부(120)를 통해 딥러닝 기반 의도 인식 모델을 사용하여 텍스트의 의도 인식 결과를 제공할 수 있다.The natural language processing unit 130 may provide text intent recognition results using a deep learning-based intent recognition model through the voice recognition unit 120.

보다 구체적으로 자연어 처리부(130)는 텍스트를 입력으로 받아 전처리를 진행한 후 해당 텍스트의 문맥의 의도를 파악하여 최종 의도를 결과로 제공할 수 있다.More specifically, the natural language processing unit 130 may receive text as input, perform pre-processing, determine the intent of the context of the text, and provide the final intent as a result.

자연어 처리부(130)는 전처리부(131), 워드 임베딩부(132), 개체명 인식부(133) 및 의도 분석부(134)를 포함할 수 있다.The natural language processing unit 130 may include a preprocessing unit 131, a word embedding unit 132, an entity name recognition unit 133, and an intent analysis unit 134.

전처리부(131)는 입력된 텍스트에 대해 문장 분리, 형태소 분리 및 불용 형태소를 제거하는 텍스트에 대한 전처리를 수행할 수 있다.The preprocessing unit 131 may perform preprocessing on the input text by separating sentences, separating morphemes, and removing unused morphemes.

입력된 텍스트는 하나 이상의 문장으로 구성되거나, 하나의 문장으로 구성되어 있으나, 접속사로 연결된 문장을 포함할 수 있다.The input text may consist of one or more sentences, or may consist of one sentence but may include sentences connected by conjunctions.

따라서, 입력된 텍스트에 대해 문장으로 분리하고, 입력된 텍스트에 대해 분리된 하나 이상의 문장을 묶어 관리할 수 있다.Therefore, the input text can be separated into sentences, and the input text can be managed by grouping one or more separated sentences.

문장 단위로 분리 한 후, 각각의 문장에 대해 형태소 분석을 수행할 수 있다.After separating into sentence units, morphological analysis can be performed on each sentence.

이때, 형태소 분석을 위해 Mecab, Kkma, Komoran, Okt 등의 형태소 분석기를 사용할 수 있다.At this time, morpheme analyzers such as Mecab, Kkma, Komoran, and Okt can be used for morpheme analysis.

또한, 하나 이상의 형태소를 사용하여 분리하고, 각 형태소의 태그명을 통합하는 과정을 수행할 수 있다.Additionally, the process of separating using one or more morphemes and integrating the tag name of each morpheme can be performed.

형태소 분석 후에는 각 문장에 대한 형태소 태그를 부착할 수 있다. 이는 이후 문장의 노이즈 제거를 위해 불용 형태소를 구별하기 위함이다.After morpheme analysis, morpheme tags can be attached to each sentence. This is to distinguish unused morphemes to remove noise in subsequent sentences.

형태소 분석이 완료된 문장에 대해 노이즈 제거를 위한 불용 형태소를 제거한다.For sentences for which morpheme analysis has been completed, unused morphemes are removed to remove noise.

불용 형태소는 문맥의 의도 분석에 불필요한 띄어쓰기, 문장 부호, 조사, 접속사, 어미 등을 포함할 수 있고, 명사와 형용사 중 불필요한 단어들을 정의하여 제거할 수 있다.Unused morphemes may include spaces, punctuation marks, particles, conjunctions, endings, etc. that are unnecessary in the analysis of the intent of the context, and unnecessary words among nouns and adjectives can be defined and removed.

또한, 표현 방법이 다른 단어들을 통합시켜서 같은 단어로 만들어 주는 정규화를 수행할 수 있다.In addition, normalization can be performed to combine words with different expression methods into the same word.

텍스트의 정규화에는 대, 소문자 통합, 한자를 한글로 변환 등을 포함할 수 있다.Normalization of text may include integrating upper and lower case letters, converting Chinese characters to Hangul, etc.

워드 임베딩부(132)는 데이터 전처리를 통해 토큰화되어 의미있는 단어로 분리된 단어를 밀집 벡터(dense vector)의 형태로 표현할 수 있다.The word embedding unit 132 can express words that are tokenized through data preprocessing and separated into meaningful words in the form of dense vectors.

보다 구체적으로, 워드 임베딩부(132)는 밀집 벡터로 변환을 위해 word2vec, glove, FastText를 사용할 수 있고, 사전 훈련된 언어 모델을 사용하는 ELMo(Embeddings from Language Models) 모델을 사용할 수도 있다. More specifically, the word embedding unit 132 may use word2vec, glove, or FastText for conversion to a dense vector, and may also use an ELMo (Embeddings from Language Models) model that uses a pre-trained language model.

워드 임베딩부(132)는 먼저, 위키피디아, 한국어 사전 등 일반적인 언어 사전을 적용하여 사전 학습된 워드 임베딩을 사용하여 도메인에 특정된 개체명 사전을 학습할 수 있다.The word embedding unit 132 may first learn a domain-specific entity name dictionary using word embeddings pre-learned by applying general language dictionaries such as Wikipedia and Korean dictionaries.

예를 들어, 법원, 행정 기관의 민원 처리를 위한 도메인 분야에 적용하는 경우, 사전에 일반적인 언어 사전으로 학습된 ELMo를 사용하여, 상기 서비스 분야에 특정된 법률 용어, 인적 정보, 조직명, 문서명, 지명 등의 개체명이 포함된 개체명 사전을 학습시켜 문맥을 반영한 단어의 의미를 인식할 수 있도록 학습할 수 있다.For example, when applying to the domain field for handling civil complaints in courts and administrative agencies, ELMo, which has been previously learned as a general language dictionary, is used to obtain legal terms, personal information, organization name, document name, By learning an entity name dictionary containing entity names such as place names, you can learn to recognize the meaning of words that reflect the context.

의도 분석부(133)는 워드 임베딩부(132)를 통해 출력된 임베딩 벡터를 입력값으로 적용하여 문장의 의도를 분류하는 딥러닝 기반 의도 인식 모델을 사용하여 문장의 의도 결과를 출력할 수 있다.The intent analysis unit 133 may output the intent result of the sentence using a deep learning-based intent recognition model that classifies the intent of the sentence by applying the embedding vector output through the word embedding unit 132 as an input value.

보다 구체적으로, 의도 분석부(133)는 워드 임베딩부(132)에서 도메인에 특정된 개체명 사전을 학습하여 문맥을 반영한 단어의 의미를 인식할 수 있는 임베딩 벡터를 입력 받고, 의도를 분류하는 딥러닝 기반 의도 인식 모델을 통해 문장에 대한 의도를 분류할 수 있다.More specifically, the intent analysis unit 133 learns a domain-specific entity name dictionary from the word embedding unit 132, receives an embedding vector that can recognize the meaning of a word reflecting the context, and uses a deep algorithm to classify the intent. The intent of a sentence can be classified through a learning-based intent recognition model.

이때, 딥러닝 기반 의도 인식 모델은 Text CNN(Convolutional Neural Network), Recurrent Neural Network(RNN) 등의 텍스트 분류 모델을 적용할 수 있다.At this time, the deep learning-based intent recognition model can apply text classification models such as Text CNN (Convolutional Neural Network) and Recurrent Neural Network (RNN).

예를 들어, 아래의 예시 문장은 모두 '발급 예상 일정 문의'의도로 판단될 수 있고, 이때, 대상 단어(목적 단어)는 '운전 면허'일 수 있다.For example, the example sentences below can all be judged as intended to 'inquire about the expected issuance schedule', and in this case, the target word may be 'driver's license'.

"제가 신청한 운전면허 발급이 언제쯤 될까요?"“When will the driver’s license I applied for be issued?”

"저번에 운전면허를 신청했는데 발급까지 얼마나 걸리나요?"“I applied for a driver’s license last time. How long does it take to get it issued?”

"신청한 운전면허 발급 상태를 알고 싶습니다."“I would like to know the issuance status of the driver’s license I applied for.”

"운전면허 발급 예상 일정을 알려주세요."“Please tell me the expected schedule for driver’s license issuance.”

"저의 운전면허 발급 상황을 알 수 있을까요?"“Can I know the status of my driver’s license issuance?”

"운전면허 발급일은 어떻게 확인할 수 있나요?"“How can I check the driver’s license issuance date?”

"운전면허 발급에 대한 문의를 드립니다."“I would like to inquire about issuance of a driver’s license.”

의도 분석부(133)는 의도 인식 모델을 통해 상기 문장들을 '발급 예상 일정 문의' 이란 클래스(레이블)로 분류할 수 있다.The intent analysis unit 133 may classify the sentences into a class (label) called ‘expected issuance schedule inquiry’ through the intent recognition model.

의도 인식 결과 처리부(140)는 자연어 처리부(130)를 통해 인식된 텍스트의 의도 인식 결과와 상기 학습 데이터의 정답 레이블을 비교하여 재학습 데이터를 분류 및 저장할 수 있다.The intent recognition result processing unit 140 may classify and store the re-learning data by comparing the intent recognition result of the text recognized through the natural language processing unit 130 with the correct answer label of the learning data.

보다 구체적으로, 의도 분류 결과는 클래스에 해당하는 확률 값으로 출력될 수 있고, 가장 높은 확률 값을 가지는 클래스와 확률 값을 출력될 수 있다.More specifically, the intent classification result may be output as a probability value corresponding to the class, and the class and probability value with the highest probability value may be output.

이때, 의도 인식 결과 처리부(140)는 먼저, 확률 값을 판단하여, 정해진 임계치 보다 낮은 경우 인식 결과를 실패로 처리할 수 있다.At this time, the intent recognition result processing unit 140 may first determine the probability value and, if it is lower than a predetermined threshold, treat the recognition result as a failure.

실패로 판단된 인식 결과에 대한 학습 데이터는 재학습을 위한 데이터로 분리되어 저장할 수 있다.Learning data for recognition results judged to be failures can be separated and stored as data for re-learning.

상기 확률 값이 정해진 임계치 이상인 경우, 분류 결과에 포함된 클래스(레이블)이 학습 데이터의 정답 레이블과 일치하는지를 판단한다.If the probability value is greater than or equal to a predetermined threshold, it is determined whether the class (label) included in the classification result matches the correct label of the learning data.

정답 레이블은 학습 데이터에 대한 정답 레이블로 학습 데이터 생성 시, 별도의 데이터베이스 또는 DB 테이블에 저장되어 관리될 수 있다.The correct answer label is the correct answer label for the learning data and can be stored and managed in a separate database or DB table when generating the learning data.

분류 결과에 포함된 클래스(레이블)이 학습 데이터의 정답 레이블과 일치하지 않은 경우, 의도 인식 결과 처리부(140)는 상기 학습 데이터를 재학습 데이터로 분리하여, 별도 저장할 수 있다.If the class (label) included in the classification result does not match the correct answer label of the learning data, the intention recognition result processing unit 140 may separate the learning data into re-training data and store it separately.

재학습부(150)는 재학습 데이터를 적용하여 자동 반복 학습을 수행할 수 있다. 이때, 재학습은 재학습 데이터가 정해진 개수 이상인 경우에 대해 수행될 수 있다.The re-learning unit 150 may perform automatic iterative learning by applying re-learning data. At this time, re-learning may be performed when the number of re-learning data is more than a certain number.

문맥 의도 인식 장치(100)의 설정 정보로 관리되어, 미리 특정 값으로 설정될 수 있다.It can be managed as setting information of the context intent recognition device 100 and set to a specific value in advance.

재학습 수행 시에는 오버피팅(overfitting) 문제를 피하기 위해 재학습 데이터를 단독으로 재학습 시키지 않고, 새로운 학습 데이터를 생성하여 데이터 증강을 통해 학습 데이터를 보강하여 적용할 수 있다. When performing re-learning, instead of re-learning the re-learning data alone to avoid the overfitting problem, new learning data can be created and applied to reinforce the learning data through data augmentation.

재학습은 음성 인식과 의도 인식을 포함한 전체 인식 과정에 대해 자동으로 재학습을 수행할 수 있고, 상기 재학습 과정은 별도의 분리된 장치에서 수행되는 것이 아니라, 문맥 의도 인식 장치(100) 내에서 음성 인식 및 의도 인식 프로세스와 재학습 과정이 동일 장치 내에서 이루어질 수 있다.Re-learning can automatically perform re-learning for the entire recognition process, including voice recognition and intent recognition, and the re-learning process is not performed in a separate device, but within the context intent recognition device 100. The voice recognition and intent recognition processes and re-learning process can occur within the same device.

이를 통해, 의도 인식 결과에 실패한 학습 데이터를 자동으로 재학습하여, 음성인식 모델 및 의도 인식 모델의 정확도를 지속적으로 높이는 효과를 가질 수 있다. Through this, learning data that fails in the intent recognition result can be automatically relearned, which can have the effect of continuously increasing the accuracy of the voice recognition model and intent recognition model.

또한, 재학습 수행 시, 모델 개선부(160)를 통해 모델 개선을 위한 파라미터 변경을 적용하여 재학습을 수행할 수 있다.Additionally, when performing re-learning, re-learning can be performed by applying parameter changes for model improvement through the model improvement unit 160.

문맥 의도 인식 장치(100)는 장치 내의 모델의 성능을 개선하기 위해 가중치 초기화 및 하이퍼파라미터를 조정하는 모델 개선부(160)를 더 포함할 수 있다.The context intent recognition device 100 may further include a model improvement unit 160 that initializes weights and adjusts hyperparameters to improve the performance of the model within the device.

모델 개선부(160)는 음성 인식 모델 및 의도 인식 모델 중 적어도 하나 이상의 가중치와 하이퍼파라미터를 변경하여, 재학습을 진행할 수 있다.The model improvement unit 160 may change the weights and hyperparameters of at least one of the voice recognition model and the intent recognition model to perform re-learning.

예를 들어, He, 자비어(Xavier)를 사용하여 가중치 초기화를 수행할 수 있다. 또한, 사전 학습된 모델을 사용하는 경우는 모델의 파인 튜닝(fine tuning)을 통해 미세 조정할 수 있다.For example, weight initialization can be performed using He, Xavier. Additionally, when using a pre-trained model, it can be fine-tuned through fine tuning of the model.

또한, 모델이 입력 값을 받아서 예측을 수행한 후, 이 예측 값과 정답 값의 오차를 최소화하도록 가중치 값을 업데이트 할 수 있다. 이때, 업데이트 방식에 따라 확률적 경사 하강법, 모멘텀, Adam 등의 최적화 알고리즘을 사용할 수 있다.Additionally, after the model receives input values and makes predictions, weight values can be updated to minimize the error between the predicted values and the correct answer. At this time, depending on the update method, optimization algorithms such as stochastic gradient descent, momentum, and Adam can be used.

또한, 하이퍼파라미터의 학습률(learning rate)을 조정할 수 있다.Additionally, the learning rate of hyperparameters can be adjusted.

다만, 이에 한정하지 않고 다양한 방법으로 가중치 초기화 및 하이퍼파라미터를 조정하여 적용한 후 성능 평가를 통해 최적화 할 수 있다.However, it is not limited to this and can be optimized through performance evaluation after initializing weights and adjusting hyperparameters using various methods.

도 4는 본 발명의 일 실시예에 따른 문맥 의도 인식 장치(100)의 하드웨어 구조를 나타낸 도면이다.Figure 4 is a diagram showing the hardware structure of the context intent recognition device 100 according to an embodiment of the present invention.

도 4를 참조하면, 문맥 의도 인식 장치(100)의 하드웨어 구조는, 중앙처리장치(1000), 메모리(2000), 사용자 인터페이스(3000), 데이터베이스 인터페이스(4000), 네트워크 인터페이스(5000), 웹서버(6000) 등을 포함하여 구성된다.Referring to FIG. 4, the hardware structure of the context intent recognition device 100 includes a central processing unit 1000, a memory 2000, a user interface 3000, a database interface 4000, a network interface 5000, and a web server. (6000) and others.

상기 사용자 인터페이스(3000)는 그래픽 사용자 인터페이스(GUI, graphical user interface)를 사용함으로써, 사용자에게 입력과 출력 인터페이스를 제공한다.The user interface 3000 provides an input and output interface to the user by using a graphical user interface (GUI).

상기 데이터베이스 인터페이스(4000)는 데이터베이스와 하드웨어 구조 사이의 인터페이스를 제공한다. 상기 네트워크 인터페이스(5000)는 사용자가 보유한 장치 간의 네트워크 연결을 제공한다.The database interface 4000 provides an interface between a database and a hardware structure. The network interface 5000 provides network connections between devices owned by users.

상기 웹 서버(6000)는 사용자가 네트워크를 통해 하드웨어 구조로 액세스하기 위한 수단을 제공한다. 대부분의 사용자들은 원격에서 웹 서버로 접속하여 상기 텍스트 시퀀스 처리장치(100)를 사용할 수 있다.The web server 6000 provides a means for users to access the hardware structure through a network. Most users can use the text sequence processing device 100 by remotely accessing a web server.

상술한 구성 또는 방법의 각 단계는, 컴퓨터 판독 가능한 기록 매체 상의 컴퓨터 판독 가능 코드로 구현되거나 전송 매체를 통해 전송될 수 있다. 컴퓨터 판독 가능한 기록 매체는, 컴퓨터 시스템에 의해 읽혀질 수 있는 데이터를 저장할 수 있는 데이터 저장 디바이스이다.Each step of the above-described configuration or method may be implemented as computer-readable code on a computer-readable recording medium or transmitted through a transmission medium. A computer-readable recording medium is a data storage device that can store data that can be read by a computer system.

컴퓨터 판독 가능한 기록 매체의 예로는 데이터베이스, ROM, RAM, CD-ROM, DVD, 자기 테이프, 플로피 디스크 및 광학 데이터 저장 디바이스가 있으나 이에 한정되는 것은 아니다. 전송 매체는 인터넷 또는 다양한 유형의 통신 채널을 통해 전송되는 반송파를 포함할 수 있다. 또한 컴퓨터 판독 가능한 기록 매체는, 컴퓨터 판독 가능 코드가 분산 방식으로 저장되고, 실행되도록 네트워크 결합 컴퓨터 시스템을 통해 분배될 수 있다.Examples of computer-readable recording media include, but are not limited to, databases, ROM, RAM, CD-ROM, DVD, magnetic tape, floppy disk, and optical data storage devices. Transmission media may include carrier waves transmitted over the Internet or various types of communication channels. The computer-readable recording medium may also be distributed through a network-coupled computer system such that the computer-readable code is stored and executed in a distributed manner.

또한 본 발명에 적용된 적어도 하나 이상의 구성요소는, 각각의 기능을 수행하는 중앙처리장치(CPU), 마이크로프로세서 등과 같은 프로세서를 포함하거나 이에 의해 구현될 수 있으며, 상기 구성요소 중 둘 이상은 하나의 단일 구성요소로 결합되어 결합된 둘 이상의 구성요소에 대한 모든 동작 또는 기능을 수행할 수 있다. 또한 본 발명에 적용된 적어도 하나 이상의 구성요소의 일부는, 이들 구성요소 중 다른 구성요소에 의해 수행될 수 있다. 또한 상기 구성요소들 간의 통신은 버스(미도시)를 통해 수행될 수 있다.In addition, at least one or more components applied to the present invention may include or be implemented by a processor such as a central processing unit (CPU) or microprocessor that performs each function, and two or more of the components may be implemented as a single It can be combined into components and perform all operations or functions of two or more components combined. Additionally, part of at least one or more components applied to the present invention may be performed by other components among these components. Additionally, communication between the components may be performed through a bus (not shown).

도 5는 본 발명의 일 실시예에 따른 딥러닝 기반 음성인식을 이용한 문맥 의도 인식 방법에 대한 순서도이다.Figure 5 is a flowchart of a method for recognizing context intent using deep learning-based voice recognition according to an embodiment of the present invention.

이하, 도 5를 참조하여, 딥러닝 기반으로 음성인식을 이용한 문맥의 의도 인식 방법을 설명한다.Hereinafter, with reference to FIG. 5, a method for recognizing the intent of context using voice recognition based on deep learning will be described.

음성인식을 이용한 문맥의 의도 인식 방법은 문맥 의도 인식 장치(100)에서 수행된다.The context intent recognition method using voice recognition is performed in the context intent recognition device 100.

먼저, 문맥 의도 인식 장치(100)는 문맥 외부 서버로부터 사용자에 의해 발화된 음성 데이터를 수집하여 학습 데이터로 생성하는 단계(S510)를 수행한다.First, the context intention recognition device 100 collects voice data uttered by the user from a server outside the context and generates learning data (S510).

여기서, 외부 서버로부터 수집하는 음성 데이터는 하나의 도메인에 특화된 음성 데이터를 수집할 수 있다.Here, voice data collected from an external server may be voice data specialized for one domain.

수집된 음성 데이터는 전처리 과정을 통해 학습 데이터 셋 구조에 맞게 가공되어 데이터베이스에 저장된다.The collected voice data is processed to fit the learning data set structure through a preprocessing process and stored in the database.

다음으로, 문맥 의도 인식 장치(100)는 저장된 학습 데이터에 대해 딥러닝 기반 음성 인식 모델을 통해 음성 데이터를 텍스트로 변환하는 단계(S520)를 수행한다.Next, the context intention recognition apparatus 100 performs a step (S520) of converting voice data into text using a deep learning-based voice recognition model for the stored training data.

여기서, E2E 딥러닝 STT(E2E Deep Learning Speech to Text) 모델을 사용하여 음성 인식을 처리하고, 인식된 음성을 텍스트로 변환하여 출력할 수 있다.Here, voice recognition can be processed using the E2E Deep Learning STT (E2E Deep Learning Speech to Text) model, and the recognized voice can be converted to text and output.

이후, 문맥 의도 인식 장치(100)는 음성 인식 모델을 통해 변환된 텍스트를 딥러닝 기반 의도 인식 모델을 사용하여 텍스트의 의도 인식 결과를 제공하는 단계(S530)를 수행한다.Thereafter, the context intention recognition apparatus 100 performs a step (S530) of providing an intent recognition result of the text using a deep learning-based intent recognition model for the text converted through the voice recognition model.

S530 단계에서는 문장 분리, 형태소 분석 및 태그 부착, 임베딩 벡터 생성, 딥러닝 기반 텍스트 분류 모델 적용 단계를 수행하여 최종적으로 입력된 텍스트에 대한 의도 인식 결과를 출력할 수 있다.In step S530, the steps of separating sentences, analyzing morphemes and attaching tags, generating embedding vectors, and applying a deep learning-based text classification model can be performed to finally output the intention recognition results for the input text.

S530 단계에 대한 상세 설명은 도 6을 참조하여 이후, 자세히 설명한다.A detailed description of step S530 will be provided later with reference to FIG. 6 .

다음으로, 문맥 의도 인식 장치(100)는 딥러닝 기반 의도 인식 모델을 통해 분류한 의도 인식 결과를 정답 레이블과 비교하여 일치하지 않는 경우, 재학습 데이터로 분류하여 저장하는 단계(S540)를 수행한다.Next, the context intention recognition device 100 compares the intent recognition result classified through a deep learning-based intent recognition model with the correct answer label, and if it does not match, performs a step (S540) of classifying and storing it as re-learning data. .

의도 인식 결과는 가장 높은 확률을 가지는 클래스와 이에 해당하는 확률 값으로 출력될 수 있다. 상기 의도 인식 결과에 대해 확률 값이 정해진 임계치 이하 인 경우는 인식 실패로 판단하여, 재학습을 위한 데이터로 저장한다.The intent recognition result can be output as the class with the highest probability and the corresponding probability value. If the probability value of the intent recognition result is less than a predetermined threshold, it is determined as a recognition failure and stored as data for re-learning.

또한, 임계치 이상인 경우, 학습 데이터의 정답 레이블과 비교하여 일치하지 않는 경우 인식 실패로 판단하여 재학습을 위한 데이터로 저장한다.In addition, if it is above the threshold, it is compared with the correct answer label of the learning data, and if it does not match, it is judged as a recognition failure and saved as data for re-learning.

다음으로, 재학습 데이터를 적용하여 상기 음성 인식 모델 및 상기 의도 인식 모델을 재학습을 수행하는 단계(S550)을 수행할 수 있다.Next, a step (S550) of retraining the voice recognition model and the intent recognition model by applying retraining data may be performed.

S550 단계에서는 S540 단계에서 오답으로 판단하여 재학습 데이터로 분류, 저장된 데이터를 적용하여 자동 반복 학습을 수행할 수 있다. 이때, 재학습은 재학습 데이터가 정해진 개수 이상인 경우에 대해 수행될 수 있다.In step S550, the answer is determined to be incorrect in step S540, classified as re-learning data, and automatic repetitive learning can be performed by applying the saved data. At this time, re-learning may be performed when the number of re-learning data is more than a certain number.

이때, 재학습 수행 시에는 오버피팅(overfitting) 문제를 피하기 위해 재학습 데이터에 새로운 학습 데이터를 더 보강하는 단계를 수행할 수 있다. At this time, when performing re-learning, a step may be performed to further augment the re-learning data with new learning data to avoid overfitting problems.

재학습 과정은 S520 단계 내지 S540 단계를 수행한다.The re-learning process performs steps S520 to S540.

즉, 음성 인식과 의도 인식을 포함한 전체 인식 과정에 대해 자동으로 재학습을 수행하여 모델의 성능을 지속적으로 개선한다.In other words, it continuously improves the model's performance by automatically relearning the entire recognition process, including voice recognition and intent recognition.

또한, S550 단계에서는 모델의 성능 개선을 위해 모델 개선부(160)를 통해 가중치 초기화 및 하이퍼파라미터에 대한 조정을 하는 과정을 포함할 수 있다.Additionally, step S550 may include a process of initializing weights and adjusting hyperparameters through the model improvement unit 160 to improve model performance.

도 6은 본 발명의 일 실시예에 따른 딥러닝 기반 음성인식을 이용한 문맥 의도 인식 방법에서 의도 인식 결과를 제공하는 단계를 상세하게 나타낸 순서도이다.Figure 6 is a flowchart showing in detail the steps of providing intent recognition results in a context intent recognition method using deep learning-based voice recognition according to an embodiment of the present invention.

도 6을 참조하면, 음성 인식 모델을 통해 변환된 텍스트를 문장 단위로 분리하고, 분리된 문장을 형태소 단위로 분리하여 형태소 태그를 부착하는 단계(S531)를 수행한다.Referring to FIG. 6, the text converted through the speech recognition model is separated into sentences, the separated sentences are separated into morphemes, and a morpheme tag is attached (S531).

입력된 텍스트가 하나 이상의 문장으로 구성되 경우 문장으로 분리하고, 분리된 문장은 문장 묶음으로 관리할 수 있다.If the input text consists of one or more sentences, it can be separated into sentences, and the separated sentences can be managed as sentence bundles.

각 분리된 문장은 Mecab, Kkma, Komoran, Okt 등의 형태소 분석기를 사용하여 형태소를 분리하고, 형태소 태그를 부착한다.For each separated sentence, morphemes are separated using morpheme analyzers such as Mecab, Kkma, Komoran, and Okt, and morpheme tags are attached.

다음으로, 형태소 태그가 부착된 문장에 대해 태그를 확인하여 불용 형태소를 제거하는 단계(S532)을 수행한다.Next, a step (S532) is performed to check the tag for a sentence with a morpheme tag attached and remove unused morphemes.

불용 형태소를 제거하는 이유는 문장 분석에 불필요한 노이즈를 제거하기 위함이다. 이때, 불용 형태소는 띄어쓰기, 문장 부호, 조사, 접속사, 어미 등을 포함할 수 있다. The reason for removing unused morphemes is to remove unnecessary noise in sentence analysis. At this time, unused morphemes may include spaces, punctuation marks, particles, conjunctions, endings, etc.

또한, 표현 방법이 다른 단어들을 통합시켜서 같은 단어로 만들어 주는 정규화를 추가로 수행할 수 있다.Additionally, normalization can be performed to combine words with different expression methods to form the same word.

다음으로, 문맥 의도 인식 장치(100)는 형태소 단위로 분리된 단어에 대해 임베딩 벡터를 생성하는 단계(S533)을 수행한다.Next, the context intention recognition apparatus 100 performs a step (S533) of generating an embedding vector for words separated into morpheme units.

임베딩 벡터를 생성할 때, 위키피디아, 한국어 사전 등 일반적인 언어 사전을 적용하여 사전 학습된 워드 임베딩을 사용하여 도메인에 특정된 개체명 사전을 학습할 수 있다.When creating an embedding vector, a domain-specific entity name dictionary can be learned using pre-learned word embeddings by applying general language dictionaries such as Wikipedia and Korean dictionaries.

이후, 문맥 의도 인식 장치(100)는 개체명 사전을 추가로 학습하여 생성된 임베딩 벡터를 딥러닝 기반 의도 인식 모델의 입력값으로 적용하여 문장의 의도를 인식하고, 의도 인식 결과를 도출하는 단계(S534)을 수행한다.Thereafter, the context intention recognition device 100 recognizes the intention of the sentence by applying the embedding vector generated by additionally learning the entity name dictionary as an input value of the deep learning-based intention recognition model, and deriving the intention recognition result ( S534) is performed.

여기서, 상기 임베딩 벡터는 도메인에 특정된 개체명 사전을 학습하여 문맥을 반영한 단어의 의미를 인식할 수 있도록 학습된 임베딩 벡터이다.Here, the embedding vector is an embedding vector learned to recognize the meaning of a word reflecting the context by learning a dictionary of entity names specific to the domain.

문맥 의도 인식 장치(100)는 상기 임베딩 벡터를 입력 값을 받을 수 있도록 구성된 TextCNN, RNN 등의 딥러닝 기반 텍스트 분류 모델을 적용하여 문장에 대한 의도를 분류하여 결과를 출력할 수 있다.The context intent recognition device 100 may classify the intent of a sentence by applying a deep learning-based text classification model such as TextCNN or RNN configured to receive the embedding vector as an input value and output the result.

상기에서는 본 발명에 따른 실시예를 기준으로 본 발명의 구성과 특징을 설명하였으나 본 발명은 이에 한정되지 않으며, 본 발명의 사상과 범위 내에서 다양하게 변경 또는 변형할 수 있음은 본 발명이 속하는 기술분야의 당업자에게 명한 것이며, 따라서 이와 같은 변경 또는 변형은 첨부된 특허청구범위에 속함을 밝혀둔다.In the above, the configuration and features of the present invention have been described based on the embodiments according to the present invention, but the present invention is not limited thereto, and various changes or modifications may be made within the spirit and scope of the present invention. It is instructed to those skilled in the art, and therefore, it is stated that such changes or modifications fall within the scope of the attached patent claims.

100: 문맥 의도 인식 장치
110: 데이터 수집부
120: 음성 인식부
130: 자연어 처리부
131: 전처리부 132: 워드 임베딩부
133: 의도 분석부
140: 의도 인식 결과 처리부
150: 재학습부
160: 모델 개선부
200: 외부 서버100: Contextual intent recognition device
110: Data collection unit
120: Voice recognition unit
130: Natural language processing unit
131: Preprocessing unit 132: Word embedding unit
133: Intent analysis unit
140: Intent recognition result processing unit
150: Re-study Department
160: Model improvement department
200: external server

Claims

a data collection unit that collects voice data uttered by the user from an external server and generates learning data;
A voice recognition unit that recognizes the voice and converts it into text by applying a voice recognition model to the learning data;
a natural language processing unit that applies a deep learning-based intent recognition model to the text recognized through the voice recognition unit and provides a text intent recognition result;
an intent recognition result processing unit that compares the intent recognition result of the text with the correct answer label of the learning data and, if they do not match, classifies it as re-learning data and stores it;
When the number of re-learning data is more than a predetermined number, a re-learning unit that automatically performs re-training of the speech recognition model and the intent recognition model by reinforcing the learning data through data augmentation using the re-learning data; and
A model improvement unit that evaluates model performance while updating weights and hyperparameters for at least one of the speech recognition model and the intent recognition model,
The data collection unit is characterized in that it collects voice data specialized for one domain,
The speech recognition model is characterized by learning by applying domain-specific speech data as learning data based on a pre-trained E2E (End-to-End) deep learning STT (Speech to Text) model.
The natural language processing unit,
a pre-processing unit that performs sentence separation, morpheme separation, unused morpheme removal, and normalization on the text recognized through the speech recognition unit and tokenizes it into meaningful word units;
a word embedding unit that converts tokenized words into vectors through the preprocessor; and
An intent analysis unit that classifies the intent of the sentence by additionally learning an entity name dictionary through the word embedding unit and applying the output embedding vector as an input to the intent recognition model,
The word embedding unit,
Characterized by learning domain-specific entity names using word embeddings pre-trained with a language dictionary,
Contextual intent recognition device using voice recognition.

delete

It is performed on a context intention recognition device using voice recognition,
A learning data generation step in which the data collection unit collects voice data uttered by the user from an external server and generates learning data;
A voice recognition step in which a voice recognition unit converts the voice data into text by applying a voice recognition model to the learning data;
An intent recognition step in which a natural language processing unit applies a deep learning-based intent recognition model to the text converted through the voice recognition model to provide an intent recognition result of the text;
A re-learning data generation step in which the intention recognition result processing unit compares the intention recognition result with the correct answer label and, if it does not match, classifies it as re-learning data and stores it; and
A re-learning step in which a re-learning unit automatically performs re-training of the speech recognition model and the intent recognition model by reinforcing the learning data through data augmentation using the re-learning data when the re-learning data is more than a predetermined number; Including,
The learning data generation step is characterized by collecting voice data specialized for one domain,
The speech recognition model is characterized by learning by applying domain-specific speech data as learning data based on a pre-trained E2E (End-to-End) deep learning STT (Speech to Text) model.
The intention recognition step is,
Separating the text converted through the speech recognition model into sentences, dividing the separated sentences into morphemes, and attaching morpheme tags to them;
Checking the morpheme tag of the sentence to which the morpheme tag is attached and removing unused morphemes;
Tokenizing the sentence from which unused morphemes have been removed into meaningful word units and generating an embedding vector; and
A step of recognizing the intent of the sentence and deriving an intent recognition result by applying the embedding vector generated by additionally learning an entity name dictionary as an input to the intent recognition model,
The re-learning step is,
Characterized by evaluating model performance while updating weights and hyperparameters for at least one model of the speech recognition model and the intent recognition model,
The step of tokenizing the sentence from which the unused morphemes have been removed into meaningful words and generating an embedding vector is,
Characterized by learning domain-specific entity names using word embeddings pre-trained with a language dictionary,
Contextual intent recognition method using voice recognition.

delete