KR20200145324A

KR20200145324A - Deep learing model-based tts service providing method for synthesizing speech having pauses respectively corresponding to pauses included in text data

Info

Publication number: KR20200145324A
Application number: KR1020190074182A
Authority: KR
Inventors: 채경수; 황지욱; 유재성; 장세영
Original assignee: 주식회사 머니브레인
Priority date: 2019-06-21
Filing date: 2019-06-21
Publication date: 2020-12-30

Abstract

According to one feature of the present invention, a suitable pretreatment can be performed before actual learning for training data by a deep learning model for generating a text-to-speech (TTS) model. A pretreatment method may be a method of first analyzing data for deep learning, that is, text information and voice information, and adjusting both text information and voice information to be matched before actual learning. Even though parts, such as commas, periods, and the like, related to spaced reading are included in the text, a mark of spaced reading can be removed from corresponding text data when spaced reading does not exist in each of the parts in actual voice. Even though the parts, such as commas, periods, and the like, related to spaced reading are included in the text, the corresponding audio data can be manipulated to be matched with the text data when spaced reading does not exist in each of the parts in actual voice. Accordingly, through the pretreatment, learning data for each word with actual spacing can be generated and can provide regularity to a deep learning model. The method comprises a data receiving step, a data adjustment step, a learning step, and a TTS model construction step.

Description

DEEP LEARING MODEL-BASED TTS SERVICE PROVIDING METHOD FOR SYNTHESIZING SPEECH HAVING PAUSES RESPECTIVELY CORRESPONDING TO PAUSES INCLUDED IN TEXT DATA }

본 개시는, 텍스트-음성 변환, 즉 TTS에 관한 것이고, 보다 구체적으로는 어절 기반 TTS를 제공하는 방법 등에 관한 것이다.The present disclosure relates to text-to-speech conversion, that is, TTS, and more particularly, to a method of providing a word-based TTS, and the like.

근래, 인공지능 분야, 특히 자연어 이해 분야의 기술 발전에 따라, 전통적인 기계 중심의 명령 입출력 방식에 따른 기계 조작에서 벗어나, 사용자로 하여금, 보다 사람 친화적인 방식, 예컨대 음성 및/또는 텍스트 형태의 자연어를 매개로 한 대화 방식으로 기계를 조작하고 기계로부터 원하는 서비스를 얻을 수 있도록 하는 대화형 AI 에이전트 시스템의 개발 및 활용이 점차 늘어나고 있다. 그에 따라, 온라인 상담 센터나 온라인 쇼핑몰 등을 비롯한 (그러나 이에 한정되지 않은 더 많은) 다양한 분야에서, 사용자는 음성 및/또는 텍스트 형태의 자연어 대화를 통해, 대화형 AI 에이전트 시스템에게 원하는 서비스를 요청하고 그로부터 원하는 결과를 얻을 수 있게 되었다. 특히, 최근에는, 텍스트 입출력을 배제하거나 크게 줄이고, 대부분의 사용자 인터랙션이 음성 입출력에 의하여 (원격으로) 이루어지도록 한 대화형 AI 에이전트 시스템, 예컨대 스마트 스피커 시스템 등이 주목을 받고 있다.In recent years, with the advancement of technology in the field of artificial intelligence, especially in the field of natural language understanding, moving away from machine operation according to the traditional machine-centered command input/output method, users can use a more human-friendly method, such as natural language in the form of voice and/or text. The development and use of interactive AI agent systems that enable machines to be manipulated and to obtain desired services from machines in an intermediate dialogue method are gradually increasing. Accordingly, in various fields including (but not limited to) online counseling centers and online shopping malls, users request a desired service from the interactive AI agent system through natural language conversations in the form of voice and/or text. From that, you can get the desired results. In particular, in recent years, an interactive AI agent system, such as a smart speaker system, in which text input/output is excluded or greatly reduced and most user interactions are performed (remotely) by voice input/output has been attracting attention.

음성에 기초한 대화형 AI 에이전트 시스템의 구축에는 텍스트-음성 변환, 즉 TTS 과정이 중요하다. 통상 상용화되어 있는 TTS 기술은, 학습을 통하여 각 음절별로 녹음된 파일을 마련해두고, 입력되는 텍스트의 각 음절마다에 대해 매칭되는 음성 녹음을 모아서 이어붙여주는 방식을 기본적으로 채택하고 있다. 이러한 TTS 기술은 또한 그 녹음 결과들을 이어붙일 때 좀 더 자연스러운 결과를 생성하도록 다양한 부가적인 처리를 행하기도 한다. 이러한 통상적인 TTS 기술에 의하면, 텍스트 상에 쉼표나 구둣점 등 문장 중간에 띄어 읽어 주어야 할 부분이 있을 경우, 그에 맞추서 각 쉼표나 구둣점 마다 띄어 읽어주는, 잘 통제된 형태의 결과를 산출해낼 수 있다. Text-to-speech conversion, that is, the TTS process, is important for building an interactive AI agent system based on speech. TTS technology, which is commonly commercialized, basically adopts a method of preparing a recorded file for each syllable through learning, and collecting and concatenating voice recordings matching each syllable of the input text. This TTS technology also performs various additional processing to produce more natural results when the recording results are stitched together. According to this common TTS technique, if there is a part to be read in the middle of a sentence such as a comma or a punctuation mark, according to it, it is read with a space for each comma or punctuation mark to produce a well-controlled result. I can.

그런데, 딥러닝 방식에 의한 학습에 기초한 TTS를 구축할 경우, 텍스트에 부합하는 띄어 읽기가 포함되도록 음성을 생성함에 있어서 다소 불안정한 반응을 보이는 경우가 있다. 딥러닝 방식의 학습은, 통상 문장 또는 문단 등 소정 단위의 입력을 받아서, 그 단위 전체로서 학습을 하는데, 실제 그 학습의 재료가 된 음성 자료를 녹음한 화자는, 때때로 그 사람 특유의 습관 등에 의해 텍스트에서 정의된 쉼표나 마침표 등을 정확하게 지키지 않고서 자신만의 스타일로 띄어 읽기를 하는 경우가 있다. 그러므로, 실제 학습 모델에 제공되는 학습 자료에서 텍스트와 음성의 띄어 읽기가 불일치하는 경우가 많고, 이러한 학습 자료들에 의해 학습을 마친 딥러닝 모델은, 추후 텍스트를 입력받아서 음성을 생성할 때 텍스트에 부합하지 않는 부적절한 위치에서 붙여 읽거나 띄어 읽는 경우가 발생하기도 한다.However, in the case of constructing a TTS based on learning by a deep learning method, there is a case that a somewhat unstable reaction is shown in generating speech so that the spaced reading corresponding to the text is included. In the deep learning method of learning, usually by receiving input of a predetermined unit such as a sentence or paragraph, and learning as the whole unit, the speaker who actually recorded the audio material that was the material for the learning sometimes depends on the person's own habits, etc. In some cases, you do not follow the commas or periods defined in the text correctly and read them in your own style. Therefore, in the learning material provided to the actual learning model, there are many cases in which text and voice spacing readings are inconsistent, and the deep learning model that has been trained by these learning materials will be added to the text when generating speech after receiving text. There are also cases of reading pasted or spaced in inappropriate positions that do not match.

따라서, 본 개시는, 딥러닝 방식에 의한 TTS 모델을 구축함에 있어서, 텍스트에 부합하는 방식으로 적절한 위치에서 띄어 읽기가 가능하도록 하는 TTS 모델 구축 방법을 제공하고자 한다.Accordingly, the present disclosure intends to provide a method of constructing a TTS model in which, in constructing a TTS model using a deep learning method, it is possible to read from an appropriate position in a manner consistent with the text.

본 개시의 일 특징에 의하면, TTS 모델 생성을 위한 딥러닝 학습 모델에 의한 학습 자료에 대한 실제 학습이 이루어지기 전에 적절한 전처리가 수행될 수 있다. 본 개시에 따른 전처리 방식은, 딥러닝 학습을 위한 자료, 즉 텍스트 정보와 음성 정보를, 실제 학습 전에 먼저 분석하여, 양자를 일치하도록 먼저 조정하는 방식일 수 있다. 본 개시의 일 실시예에 의하면, 텍스트 상에 쉼표 또는 마침표 등 띄어 읽기에 관한 부분이 존재함에도 실제 그에 관한 음성에서 그러한 각 부분에서 띄어 읽기가 없었다면 해당 텍스트 자료에서도 그러한 띄어 읽기의 표시를 없앨 수 있다. 본 개시의 일 실시예에 의하면, 텍스트 상에 쉼표 또는 마침표 등 띄어 읽기에 관한 부분이 존재함에도 실제 그에 관한 음성에서 그러한 각 부분에서 띄어 읽기가 없었다면 해당 음성 자료를 해당 텍스트 자료에 부합하도록 조작할 수 있다. 즉, 이러한 방식에 의하면, 전처리를 통하여 실질적 띄어 읽기가 이루어진 어절별 학습 데이터를 생성할 수 있고, 딥러닝 학습 모델에 규칙성을 부여할 수 있다.According to one feature of the present disclosure, an appropriate pre-processing may be performed before actual learning is performed on training data by a deep learning learning model for generating a TTS model. The pre-processing method according to the present disclosure may be a method of first analyzing data for deep learning learning, that is, text information and voice information, before actual learning, and adjusting them to match them. According to an embodiment of the present disclosure, even if there is a part related to reading with a space such as a comma or a period in the text, if there is no space reading in each part of the actual voice, the indication of such a reading may be removed from the text material. . According to an embodiment of the present disclosure, even if there is a part related to reading with a space such as a comma or a period in the text, if there is no space reading in each part of the actual voice, the corresponding voice material can be manipulated to match the corresponding text data. have. That is, according to this method, it is possible to generate learning data for each word, which is actually read apart through preprocessing, and gives regularity to the deep learning learning model.

본 개시의 다른 특징에 의하면, TTS 모델을 위한 딥러닝 학습 모델의 학습 자료가 되는 텍스트와 음성 자료를, 텍스트에 포함된 각 쉼표, 구둣점 등 띄어 읽기 단위, 즉 어절 단위로 먼저 분리하여 그 각각의 분리된 텍스트와 음성 자료에 의하여 딥러닝 학습을 수행할 수 있다. 이렇게 학습된 모델에 기초하여 추후 입력 텍스트가 들어오면, 그 텍스트 자료도 각 어절별로 먼저 분리한 후 각 어절별로 합성을 하고, 그 각 합성된 결과를 이어 붙임으로써 최종적인 결과를 산출해낼 수 있다. 이때, 각 단위 어절을 이어 붙이는 것은, 예컨대 쉼표나 구둣점 등이 있다면 그 쉼표나 구둣점를 사이에 두고 소정의 시간을 추가하여 최종 결과를 산출할 수 있다. 본 개시의 일 실시예에 의하면, 예컨대 쉼표 전후에서는 0.2초, 구둣점 전후에서는 0.4초 등 소정의 시간 간격을 둘 수 있고, 그 정해진 각 시간 간격에 소정 범위의 랜덤한 시간 값을 더하거나 빼줌으로써 보다 자연스러운 결과를 산출해낼 수도 있다. 즉, 본 개시의 실시예에 의하면, 딥러닝 모델에 의한 합성 결과 그대로 출력하는 것이 아니라 딥러닝 모델에 의한 합성 결과를 소정의 알고리즘에 따라 이어붙인 최종 결과를 줄 수 있다. According to another feature of the present disclosure, text and voice data, which are training data of a deep learning learning model for a TTS model, are first separated into reading units such as commas and suffixes included in the text, that is, word units. Deep learning learning can be performed by using the separated text and voice data. When an input text comes in later based on the model learned in this way, the text data is also first separated for each word, then synthesized for each word, and the final result can be calculated by concatenating the synthesized results. At this time, joining the words of each unit together, for example, if there is a comma or shoehorn, the final result can be calculated by adding a predetermined time with the comma or shoehorn between them. According to an embodiment of the present disclosure, for example, a predetermined time interval such as 0.2 seconds before and after a comma and 0.4 seconds before and after a shoe point can be set, and by adding or subtracting a predetermined range of random time values to each predetermined time interval. It can also produce natural results. That is, according to the embodiment of the present disclosure, instead of outputting the synthesis result by the deep learning model as it is, it is possible to give a final result by concatenating the synthesis result by the deep learning model according to a predetermined algorithm.

본 개시에 의하면, 딥러닝 방식의 TTS 모델에 있어서도, 좀 더 규칙적이고 통제된 방식의 띄어 읽기, 즉 텍스트 표현에 부합하는 띄어 읽기가 이루어지는 음성을 제공할 수 있게 된다.According to the present disclosure, even in a TTS model of a deep learning method, it is possible to provide a more regular and controlled spacing reading, that is, a voice in which spacing reading corresponding to text expression is performed.

도 1은, 본 개시의 일 실시예에 따라, 대화형 AI 에이전트 시스템이 구현될 수 있는 시스템 환경을 개략적으로 도시한 도면이다.
도 2는, 본 개시의 일 실시예에 따른, 도 1의 사용자 단말(102)의 기능적 구성을 개략적으로 도시한 기능 블록도이다.
도 3은, 본 개시의 일 실시예에 따른, 도 1의 대화형 AI 에이전트 서버(108)의 기능적 구성을 개략적으로 도시한 기능 블록도이다.1 is a diagram schematically illustrating a system environment in which an interactive AI agent system can be implemented according to an embodiment of the present disclosure.
2 is a functional block diagram schematically showing a functional configuration of the user terminal 102 of FIG. 1 according to an embodiment of the present disclosure.
3 is a functional block diagram schematically illustrating a functional configuration of the interactive AI agent server 108 of FIG. 1 according to an embodiment of the present disclosure.

이하, 첨부 도면을 참조하여 본 개시의 실시예에 관하여 상세히 설명한다. 이하에서는, 본 개시의 요지를 불필요하게 흐릴 우려가 있다고 판단되는 경우, 이미 공지된 기능 및 구성에 관한 구체적인 설명을 생략한다. 또한, 이하에서 설명하는 내용은 어디까지나 본 개시의 일 실시예에 관한 것일 뿐 본 개시가 이로써 제한되는 것은 아님을 알아야 한다.Hereinafter, exemplary embodiments of the present disclosure will be described in detail with reference to the accompanying drawings. Hereinafter, when it is determined that there is a possibility that the subject matter of the present disclosure may be unnecessarily obscured, detailed descriptions of functions and configurations already known are omitted. In addition, it should be understood that the contents described below are only related to an embodiment of the present disclosure, and the present disclosure is not limited thereto.

본 개시에서 사용되는 용어는 단지 특정한 실시예를 설명하기 위해 사용되는 것으로 본 개시를 한정하려는 의도에서 사용된 것이 아니다. 예를 들면, 단수로 표현된 구성요소는 문맥상 명백하게 단수만을 의미하지 않는다면 복수의 구성요소를 포함하는 개념으로 이해되어야 한다. 본 개시에서 사용되는 "및/또는"이라는 용어는, 열거되는 항목들 중 하나 이상의 항목에 의한 임의의 가능한 모든 조합들을 포괄하는 것임이 이해되어야 한다. 본 개시에서 사용되는 '포함하다' 또는 '가지다' 등의 용어는 본 개시 상에 기재된 특징, 숫자, 단계, 동작, 구성 요소, 부분품 또는 이들을 조합한 것이 존재함을 지정하려는 것일 뿐이고, 이러한 용어의 사용에 의해 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성 요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 배제하려는 것은 아니다.The terms used in the present disclosure are only used to describe specific embodiments, and are not intended to limit the present disclosure. For example, a component expressed in the singular should be understood as a concept including a plurality of components unless the context clearly means only the singular. It is to be understood that the term "and/or" as used in this disclosure encompasses any and all possible combinations by one or more of the listed items. The terms "comprise" or "have" used in the present disclosure are only intended to designate the existence of features, numbers, steps, actions, components, parts, or a combination thereof described in the present disclosure. It is not intended to exclude the possibility of the presence or addition of one or more other features or numbers, steps, actions, components, parts, or combinations thereof by use.

본 개시의 실시예에 있어서 '모듈' 또는 '부'는 적어도 하나의 기능이나 동작을 수행하는 기능적 부분을 의미하며, 하드웨어 또는 소프트웨어로 구현되거나 하드웨어와 소프트웨어의 결합으로 구현될 수 있다. 또한, 복수의 '모듈' 또는 '부'는, 특정한 하드웨어로 구현될 필요가 있는 '모듈' 또는 '부'를 제외하고는, 적어도 하나의 소프트웨어 모듈로 일체화되어 적어도 하나의 프로세서에 의해 구현될 수 있다.In an embodiment of the present disclosure, a'module' or'unit' means a functional part that performs at least one function or operation, and may be implemented as hardware or software, or a combination of hardware and software. In addition, a plurality of'modules' or'units' may be integrated into at least one software module and implemented by at least one processor, except for'modules' or'units' that need to be implemented with specific hardware. have.

이하, 첨부된 도면을 참조하여, 본 발명의 실시예에 대해 구체적으로 설명하기로 한다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1은, 본 개시의 일 실시예에 따라, 대화형 AI 에이전트 시스템이 구현될 수 있는 시스템 환경(100)을 개략적으로 도시한 도면이다. 도시된 바에 의하면, 시스템 환경(100)은, 복수의 사용자 단말(102), 통신망(104), 대화형 AI 에이전트 서버(106) 및 외부 서비스 서버(108)를 포함한다.1 is a diagram schematically illustrating a system environment 100 in which an interactive AI agent system may be implemented according to an embodiment of the present disclosure. As shown, the system environment 100 includes a plurality of user terminals 102, a communication network 104, an interactive AI agent server 106, and an external service server 108.

본 개시의 일 실시예에 의하면, 복수의 사용자 단말(102) 각각은 유선 또는 무선 통신 기능을 구비한 임의의 사용자 전자 장치일 수 있다. 사용자 단말(102) 각각은, 예컨대 스마트 폰, 태블릿 PC, 뮤직 플레이어, 스마트 스피커, 데스크탑, 랩탑, PDA, 게임 콘솔, 디지털 TV, 셋탑박스 등을 포함한 다양한 유선 또는 무선 통신 단말일 수 있으며, 특정 형태로 제한되지 않음을 알아야 한다.According to an embodiment of the present disclosure, each of the plurality of user terminals 102 may be any user electronic device having a wired or wireless communication function. Each of the user terminals 102 may be various wired or wireless communication terminals including, for example, a smart phone, a tablet PC, a music player, a smart speaker, a desktop, a laptop, a PDA, a game console, a digital TV, a set-top box, etc. It should be noted that it is not limited to.

본 개시의 일 실시예에 의하면, 사용자 단말(102) 각각은, 통신망(104)을 통해서, 대화형 AI 에이전트 서버(106)와 통신, 즉 필요한 정보를 송수신할 수 있다. 본 개시의 일 실시예에 의하면, 사용자 단말(102) 각각은, 통신망(104)을 통해서, 외부 서비스 서버(108)와 통신, 즉 필요한 정보를 송수신할 수 있다. 본 개시의 일 실시예에 의하면, 사용자 단말(102) 각각은, 외부로부터 음성, 텍스트 및/또는 터치 형태의 사용자 입력을 수신할 수 있고, 통신망(104)을 통한 대화형 AI 에이전트 서버(106) 및/또는 외부 서비스 서버(108)와의 통신(및/또는 사용자 단말(102) 내 처리)을 통해 얻어진, 위 사용자 입력에 대응한 동작 결과(예컨대, 특정 대화 응답의 제공 및/또는 특정 태스크의 수행 등)를 사용자에게 제공할 수 있다.According to an embodiment of the present disclosure, each of the user terminals 102 can communicate with the interactive AI agent server 106, that is, transmit and receive necessary information through the communication network 104. According to an embodiment of the present disclosure, each of the user terminals 102 may communicate with the external service server 108, that is, transmit and receive necessary information through the communication network 104. According to an embodiment of the present disclosure, each of the user terminals 102 may receive a user input in the form of voice, text and/or touch from the outside, and the interactive AI agent server 106 through the communication network 104 And/or an operation result corresponding to the above user input obtained through communication with the external service server 108 (and/or processing in the user terminal 102) (for example, providing a specific conversation response and/or performing a specific task) Etc.) to the user.

본 개시의 일 실시예에 의하면, 사용자 단말(102) 각각은, 통신망(104)을 통해서, 대화형 AI 에이전트 서버(108)와 통신, 즉 필요한 정보를 송수신할 수 있다. 본 개시의 일 실시예에 의하면, 사용자 단말(102) 각각은, 통신망(104)을 통해서, 외부 서비스 서버(108)와 통신, 즉 필요한 정보를 송수신할 수 있다. 본 개시의 일 실시예에 의하면, 사용자 단말(102) 각각은, 외부로부터 음성, 텍스트 및/또는 터치 형태의 사용자 입력을 수신할 수 있고, 통신망(104)을 통한 대화형 AI 에이전트 서버(106) 및/또는 외부 서비스 서버(108)와의 통신(및/또는 사용자 단말(102) 내 처리)을 통해 얻어진, 위 사용자 입력에 대응한 동작 결과(예컨대, 특정 대화 응답의 제공 및/또는 특정 태스크의 수행 등)를 사용자에게 제공할 수 있다.According to an embodiment of the present disclosure, each of the user terminals 102 may communicate with the interactive AI agent server 108, that is, transmit and receive necessary information through the communication network 104. According to an embodiment of the present disclosure, each of the user terminals 102 may communicate with the external service server 108, that is, transmit and receive necessary information through the communication network 104. According to an embodiment of the present disclosure, each of the user terminals 102 may receive a user input in the form of voice, text and/or touch from the outside, and the interactive AI agent server 106 through the communication network 104 And/or an operation result corresponding to the above user input obtained through communication with the external service server 108 (and/or processing in the user terminal 102) (for example, providing a specific conversation response and/or performing a specific task) Etc.) to the user.

본 개시의 실시예에 있어서, 사용자 입력에 대응한 동작으로서의 태스크 수행은, 예컨대 정보의 검색, 물품 구매, 메시지 작성, 이메일 작성, 전화 걸기, 음악 재생, 사진 촬영, 사용자 위치 탐색, 지도/내비게이션 서비스 등을 비롯한 각종 다양한 형태의 태스크(그러나 이로써 제한되는 것은 아님) 수행을 포함할 수 있다. 본 개시의 일 실시예에 의하면, 사용자 단말(102) 각각은, 사용자 입력에 대응한 동작 결과로서의 대화 응답을, 음성 형태로 제공할 수 있다. 본 개시의 일 실시예에 의하면, 사용자 단말(102)은 음성 형태의 응답에 더하여, 기타 시각, 청각 및/또는 촉각 형태(예컨대, 음성, 음향, 텍스트, 비디오, 이미지, 기호, 이모티콘, 하이퍼링크, 애니메이션, 각종 노티스, 모션, 햅틱 피드백 등을 포함할 수 있으며, 이로써 제한되는 것은 아님)의 신호를 더 포함하는 다양한 형태로써 사용자에게 제공할 수 있다.In an embodiment of the present disclosure, task execution as an operation corresponding to a user input includes, for example, searching for information, purchasing goods, writing a message, creating an email, making a phone call, playing music, taking a picture, searching for a user location, and a map/navigation service. It may include performing (but not limited to) various types of tasks, including, but not limited to. According to an embodiment of the present disclosure, each of the user terminals 102 may provide a conversation response as a result of an operation corresponding to a user input in a voice form. According to an embodiment of the present disclosure, in addition to the response in the form of a voice, the user terminal 102 includes other visual, auditory and/or tactile forms (eg, voice, sound, text, video, image, symbol, emoticon, hyperlink). , Animation, various notifications, motion, haptic feedback, etc., may be provided to the user in various forms including, but not limited to, signals.

본 개시의 일 실시예에 의하면, 통신망(104)은, 임의의 유선 또는 무선 통신망, 예컨대 TCP/IP 통신망을 포함할 수 있다. 본 개시의 일 실시예에 의하면, 통신망(104)은, 예컨대 Wi-Fi망, LAN망, WAN망, 인터넷망 등을 포함할 수 있으며, 본 개시가 이로써 제한되는 것은 아니다. 본 개시의 일 실시예에 의하면, 통신망(104)은, 예컨대 이더넷, GSM, EDGE(Enhanced Data GSM Environment), CDMA, TDMA, OFDM, 블루투스, VoIP, Wi-MAX, Wibro 기타 임의의 다양한 유선 또는 무선 통신 프로토콜을 이용하여 구현될 수 있다.According to an embodiment of the present disclosure, the communication network 104 may include any wired or wireless communication network, such as a TCP/IP communication network. According to an embodiment of the present disclosure, the communication network 104 may include, for example, a Wi-Fi network, a LAN network, a WAN network, an Internet network, and the like, but the present disclosure is not limited thereto. According to an embodiment of the present disclosure, the communication network 104 is, for example, Ethernet, GSM, EDGE (Enhanced Data GSM Environment), CDMA, TDMA, OFDM, Bluetooth, VoIP, Wi-MAX, Wibro and other various wired or wireless It can be implemented using a communication protocol.

본 개시의 일 실시예에 의하면, 대화형 AI 에이전트 서버(106)는, 통신망(104)을 통해 사용자 단말(102)과 통신할 수 있다. 본 개시의 일 실시예에 의하면, 대화형 AI 에이전트 서버(106)는, 통신망(104)을 통해 사용자 단말(102)과 필요한 정보를 송수신하고, 이를 통해 사용자 단말(102) 상에서 수신된 사용자 입력에 대응한, 즉 사용자 의도에 부합하는 동작 결과가, 사용자에게 제공되도록 동작할 수 있다. According to an embodiment of the present disclosure, the interactive AI agent server 106 may communicate with the user terminal 102 through the communication network 104. According to an embodiment of the present disclosure, the interactive AI agent server 106 transmits and receives necessary information to and from the user terminal 102 through the communication network 104, through which the user input received on the user terminal 102 The operation may be performed so that a corresponding, that is, an operation result conforming to the user's intention is provided to the user.

본 개시의 일 실시예에 의하면, 대화형 AI 에이전트 서버(106)는, 예컨대 통신망(104)을 통해 사용자 단말(102)로부터 음성, 텍스트, 및/또는 터치 형태의 사용자 자연어 입력을 수신하고, 미리 준비된 모델들에 기초해서 그 수신된 자연어 입력을 처리하여 사용자의 의도(intent)를 결정할 수 있다. 본 개시의 일 실시예에 의하면, 대화형 AI 에이전트 서버(106)는, 위 결정된 사용자 의도에 기초하여 대응하는 동작이 수행되도록 할 수 있다. 본 개시의 일 실시예에 의하면, 대화형 AI 에이전트 서버(106)는, 예컨대 사용자 단말(102)이 사용자 의도에 부합하는 특정한 태스크를 수행하도록 특정한 제어 신호를 생성하여 해당 사용자 단말(102)로 전송할 수 있다. 본 개시의 일 실시예에 의하면, 대화형 AI 에이전트 서버(106)는, 예컨대 사용자 단말(102)이 사용자 의도에 부합하는 특정한 태스크를 수행하게 하기 위하여, 통신망(104)을 통해 외부 서비스 서버(108)에 접속할 수 있다.According to an embodiment of the present disclosure, the interactive AI agent server 106 receives a user natural language input in the form of voice, text, and/or touch from the user terminal 102, for example, through the communication network 104, and Based on the prepared models, the received natural language input may be processed to determine the user's intent. According to an embodiment of the present disclosure, the interactive AI agent server 106 may cause a corresponding operation to be performed based on the determined user intention. According to an embodiment of the present disclosure, the interactive AI agent server 106 generates a specific control signal so that the user terminal 102 performs a specific task corresponding to the user's intention and transmits it to the user terminal 102. I can. According to an embodiment of the present disclosure, the interactive AI agent server 106 is an external service server 108 through the communication network 104, for example, in order for the user terminal 102 to perform a specific task corresponding to the user's intention. ) Can be accessed.

본 개시의 일 실시예에 의하면, 대화형 AI 에이전트 서버(106)는, 예컨대 사용자 의도에 부합하는 특정한 대화 응답을 생성하여 사용자 단말(102)로 전송할 수 있다. 본 개시의 일 실시예에 의하면, 대화형 AI 에이전트 서버(106)는, 위 결정된 사용자 의도에 기초하여, 대응하는 대화 응답을 음성 형태로써 생성하고, 생성된 응답을, 통신망(104)을 통해, 사용자 단말(102)로 전달할 수 있다. 본 개시의 일 실시예에 의하면, 대화형 AI 에이전트 서버(106)에 의해 생성되는 대화 응답은, 전술한 음성 형태의 자연어 응답과 함께, 텍스트, 이미지, 비디오, 기호, 이모티콘 등 시각적 요소들이나, 음향 등의 다른 청각적 요소들이나, 기타 다른 촉각적 요소들을 더 포함할 수 있다.According to an embodiment of the present disclosure, the interactive AI agent server 106 may generate, for example, a specific conversation response that meets the user's intention and transmit it to the user terminal 102. According to an embodiment of the present disclosure, the interactive AI agent server 106 generates a corresponding conversation response in a voice form based on the determined user intention, and generates the generated response through the communication network 104, It can be delivered to the user terminal 102. According to an embodiment of the present disclosure, the conversation response generated by the interactive AI agent server 106 includes visual elements such as text, images, videos, symbols, and emoticons, along with the natural language response in the above-described voice form, or sound. Other auditory elements such as, etc., or other tactile elements may be further included.

본 개시의 일 실시예에 의하면, 대화형 AI 에이전트 서버(106)는, 앞서 언급한 바와 같이, 통신망(104)을 통해서 외부 서비스 서버(108)와 통신할 수 있다. 외부 서비스 서버(108)는, 예컨대 의료 서비스 서버, 금융 서비스 서버, 메시징 서비스 서버, 온라인 상담 센터 서버, 온라인 쇼핑몰 서버, 정보 검색 서버, 지도 서비스 서버, 네비게이션 서비스 서버 등일 수 있으며, 본 개시가 이로써 제한되는 것은 아니다. 본 개시의 일 실시예에 의하면, 대화형 AI 에이전트 서버(106)로부터 사용자 단말(102)로 전달되는, 사용자 의도에 기초한 대화 응답은, 예컨대 외부 서비스 서버(108)로부터 검색 및 획득된 데이터 콘텐츠를 포함한 것일 수 있음을 알아야 한다.According to an embodiment of the present disclosure, the interactive AI agent server 106 may communicate with the external service server 108 through the communication network 104, as mentioned above. The external service server 108 may be, for example, a medical service server, a financial service server, a messaging service server, an online consultation center server, an online shopping mall server, an information search server, a map service server, a navigation service server, etc., and the present disclosure is limited thereto. It does not become. According to an embodiment of the present disclosure, the conversation response transmitted from the interactive AI agent server 106 to the user terminal 102, based on the user intent, is, for example, retrieved and acquired data content from the external service server 108. It should be noted that it can be included.

본 도면에서는, 대화형 AI 에이전트 서버(106)가 외부 서비스 서버(108)와 통신망(104)을 통해 통신 가능하게 구성된 별도의 물리 서버인 것으로 도시되어 있으나, 본 개시가 이로써 제한되는 것은 아니다. 본 개시의 다른 실시예에 의하면, 대화형 AI 에이전트 서버(106)는, 예컨대 의료 서비스 서버, 금융 서비스 서버, 온라인 상담 센터 서버 또는 온라인 쇼핑몰 서버 등 각종 서비스 서버의 일부로 포함되어 구성될 수도 있음을 알아야 한다.In this drawing, the interactive AI agent server 106 is shown as a separate physical server configured to communicate with the external service server 108 through the communication network 104, but the present disclosure is not limited thereto. According to another embodiment of the present disclosure, it should be understood that the interactive AI agent server 106 may be included as part of various service servers such as a medical service server, a financial service server, an online counseling center server, or an online shopping mall server. do.

도 2는, 본 개시의 일 실시예에 따른, 도 1에 도시된 사용자 단말(102)의 기능적 구성을 개략적으로 도시한 기능 블록도이다. 도시된 바에 의하면, 사용자 단말(102)은, 사용자 입력 수신 모듈(202), 센서 모듈(204), 프로그램 메모리 모듈(206), 프로세싱 모듈(208), 통신 모듈(210), 및 응답 출력 모듈(212)을 포함한다.FIG. 2 is a functional block diagram schematically illustrating a functional configuration of the user terminal 102 shown in FIG. 1 according to an embodiment of the present disclosure. As illustrated, the user terminal 102 includes a user input receiving module 202, a sensor module 204, a program memory module 206, a processing module 208, a communication module 210, and a response output module ( 212).

본 개시의 일 실시예에 의하면, 사용자 입력 수신 모듈(202)은, 사용자로부터 다양한 형태의 입력, 예컨대 음성 입력 및/또는 텍스트 입력 등의 자연어 입력(및 부가적으로 터치 입력 등의 다른 형태의 입력)을 수신할 수 있다. 본 개시의 일 실시예에 의하면, 사용자 입력 수신 모듈(202)은, 예컨대 마이크로폰 및 오디오 회로를 포함하며, 마이크로폰을 통해 사용자 음성 입력 신호를 획득하고 획득된 신호를 오디오 데이터로 변환할 수 있다. 본 개시의 일 실시예에 의하면, 사용자 입력 수신 모듈(202)은, 예컨대 마우스, 조이스틱, 트랙볼 등의 각종 포인팅 장치, 키보드, 터치패널, 터치스크린, 스타일러스 등 다양한 형태의 입력 장치를 포함할 수 있고, 이들 입력 장치를 통해 사용자로부터 입력된 텍스트 입력 및/또는 터치 입력 신호를 획득할 수 있다. 본 개시의 일 실시예에 의하면, 사용자 입력 수신 모듈(202)에서 수신되는 사용자 입력은, 소정의 태스크 수행, 예컨대 소정의 애플리케이션 실행이나 소정 정보의 검색 등과 연관될 수 있으나, 본 개시가 이로써 제한되는 것은 아니다. 본 개시의 다른 실시예에 의하면, 사용자 입력 수신 모듈(202)에서 수신되는 사용자 입력은, 소정의 애플리케이션 실행이나 정보의 검색 등과는 무관하게 단순한 대화 진행 만을 위한 것일 수도 있다. According to an embodiment of the present disclosure, the user input receiving module 202 includes various types of input from the user, for example, natural language input such as voice input and/or text input (and additionally other types of input such as touch input). ) Can be received. According to an embodiment of the present disclosure, the user input receiving module 202 includes, for example, a microphone and an audio circuit, and may acquire a user voice input signal through the microphone and convert the obtained signal into audio data. According to an embodiment of the present disclosure, the user input receiving module 202 may include various types of input devices such as various pointing devices, such as a mouse, joystick, and trackball, a keyboard, a touch panel, a touch screen, and a stylus. , A text input and/or a touch input signal input from a user may be obtained through these input devices. According to an embodiment of the present disclosure, the user input received from the user input receiving module 202 may be related to performing a predetermined task, such as executing a predetermined application or searching for predetermined information, but the present disclosure is limited thereto. It is not. According to another embodiment of the present disclosure, the user input received by the user input receiving module 202 may be for a simple conversation process regardless of execution of a predetermined application or search for information.

본 개시의 일 실시예에 의하면, 센서 모듈(204)은 하나 이상의 서로 다른 유형의 센서를 포함하고, 이들 센서를 통해 사용자 단말(102)의 상태 정보, 예컨대 해당 사용자 단말(102)의 물리적 상태, 소프트웨어 및/또는 하드웨어 상태, 또는 사용자 단말(102)의 주위 환경 상태에 관한 정보 등을 획득할 수 있다. 본 개시의 일 실시예에 의하면, 센서 모듈(204)은, 예컨대 광 센서를 포함하고, 광 센서를 통해 해당 사용자 단말(102)의 주변 광 상태를 감지할 수 있다. 본 개시의 일 실시예에 의하면, 센서 모듈(204)은, 예컨대 이동 센서를 포함하고, 이동 센서를 통해 해당 사용자 단말(102)의 이동 상태 여부를 감지할 수 있다. 본 개시의 일 실시예에 의하면, 센서 모듈(204)은, 예컨대 속도 센서 및 GPS 센서를 포함하고, 이들 센서를 통해 해당 사용자 단말(102)의 위치 및/또는 배향 상태를 감지할 수 있다. 본 개시의 다른 실시예에 의하면, 센서 모듈(204)은 온도 센서, 이미지 센서, 압력 센서, 접촉 센서 등을 비롯한 다른 다양한 형태의 센서를 포함할 수 있음을 알아야 한다.According to an embodiment of the present disclosure, the sensor module 204 includes one or more different types of sensors, and through these sensors, state information of the user terminal 102, for example, the physical state of the user terminal 102, It is possible to obtain software and/or hardware status, or information about the surrounding environment status of the user terminal 102. According to an embodiment of the present disclosure, the sensor module 204 may include, for example, an optical sensor, and may detect the ambient light state of the user terminal 102 through the optical sensor. According to an embodiment of the present disclosure, the sensor module 204 includes, for example, a movement sensor, and may detect whether the user terminal 102 is in a moving state through the movement sensor. According to an embodiment of the present disclosure, the sensor module 204 includes, for example, a speed sensor and a GPS sensor, and may detect the position and/or orientation state of the user terminal 102 through these sensors. It should be understood that according to another embodiment of the present disclosure, the sensor module 204 may include various types of sensors including temperature sensors, image sensors, pressure sensors, contact sensors, and the like.

본 개시의 일 실시예에 의하면, 프로그램 메모리 모듈(206)은, 사용자 단말(102) 상에서 실행될 수 있는 각종 프로그램, 예컨대 각종 애플리케이션 프로그램 및 관련 데이터 등이 저장된 임의의 저장 매체일 수 있다. 본 개시의 일 실시예에 의하면, 프로그램 메모리 모듈(206)에는, 예컨대 인스턴트 메시징 애플리케이션, 전화 걸기 애플리케이션, 이메일 애플리케이션, 카메라 애플리케이션, 음악 재생 애플리케이션, 비디오 재생 애플리케이션, 이미지 관리 애플리케이션, 지도 애플리케이션, 브라우저 애플리케이션 등을 비롯한 다양한 애플리케이션 프로그램들과 이들 프로그램의 실행과 관련된 데이터들이 저장될 수 있다. 본 개시의 일 실시예에 의하면, 프로그램 메모리 모듈(206)은, DRAM, SRAM, DDR RAM, ROM, 자기 디스크, 광 디스크, 플래시 메모리 등 다양한 형태의 휘발성 또는 비휘발성 메모리를 포함하도록 구성될 수 있다.According to an embodiment of the present disclosure, the program memory module 206 may be any storage medium in which various programs that can be executed on the user terminal 102, such as various application programs and related data, are stored. According to an embodiment of the present disclosure, the program memory module 206 includes, for example, an instant messaging application, a phone dialing application, an email application, a camera application, a music playing application, a video playing application, an image management application, a map application, a browser application, and the like. Including various application programs and data related to the execution of these programs may be stored. According to an embodiment of the present disclosure, the program memory module 206 may be configured to include various types of volatile or nonvolatile memory such as DRAM, SRAM, DDR RAM, ROM, magnetic disk, optical disk, and flash memory. .

본 개시의 일 실시예에 의하면, 프로세싱 모듈(208)은, 사용자 단말(102)의 각 컴포넌트 모듈과 통신하고 사용자 단말(102) 상에서 각종 연산을 수행할 수 있다. 본 개시의 일 실시예에 의하면, 프로세싱 모듈(208)은, 프로그램 메모리 모듈(206) 상의 각종 애플리케이션 프로그램을 구동 및 실행시킬 수 있다. 본 개시의 일 실시예에 의하면, 프로세싱 모듈(208)은, 필요한 경우, 사용자 입력 수신 모듈(202) 및 센서 모듈(204)에서 획득된 신호를 수신하고, 이들 신호에 관한 적절한 처리를 수행할 수 있다. 본 개시의 일 실시예에 의하면, 프로세싱 모듈(208)은, 필요한 경우, 통신 모듈(210)을 통해 외부로부터 수신되는 신호에 대해 적절한 처리를 수행할 수 있다.According to an embodiment of the present disclosure, the processing module 208 may communicate with each component module of the user terminal 102 and perform various operations on the user terminal 102. According to an embodiment of the present disclosure, the processing module 208 may drive and execute various application programs on the program memory module 206. According to an embodiment of the present disclosure, the processing module 208, if necessary, may receive signals obtained from the user input receiving module 202 and the sensor module 204, and perform appropriate processing on these signals. have. According to an embodiment of the present disclosure, the processing module 208 may perform appropriate processing on a signal received from the outside through the communication module 210, if necessary.

본 개시의 일 실시예에 의하면, 통신 모듈(210)은, 사용자 단말(102)이 도 1의 통신망(104)을 통하여, 온라인 대화 서비스 서버(106), 대화형 AI 에이전트 서버(106) 및/또는 외부 서비스 서버(108)와 통신할 수 있게 한다. 본 개시의 일 실시예에 의하면, 통신 모듈(210)은, 예컨대 사용자 입력 수신 모듈(202) 및 센서 모듈(204) 상에서 획득된 신호가 소정의 프로토콜에 따라 통신망(104)을 통하여 대화형 AI 에이전트 서버(106) 및/또는 외부 서비스 서버(108)로 전송되도록 할 수 있다. 본 개시의 일 실시예에 의하면, 통신 모듈(210)은, 예컨대 통신망(104)을 통하여 대화형 AI 에이전트 서버(106) 및/또는 외부 서비스 서버(108)로부터 수신된 각종 신호, 예컨대 음성 및/또는 텍스트 형태의 자연어 응답을 포함한 응답 신호 또는 각종 제어 신호 등을 수신하고, 소정의 프로토콜에 따라 적절한 처리를 수행할 수 있다.According to an embodiment of the present disclosure, the communication module 210 includes an online chat service server 106, an interactive AI agent server 106, and/or the user terminal 102 through the communication network 104 of FIG. 1. Or to communicate with an external service server 108. According to an embodiment of the present disclosure, the communication module 210 is, for example, the user input receiving module 202 and the signal acquired on the sensor module 204 through the communication network 104 according to a predetermined protocol interactive AI agent. It may be transmitted to the server 106 and/or the external service server 108. According to an embodiment of the present disclosure, the communication module 210 includes various signals, such as voice and/or received from the interactive AI agent server 106 and/or the external service server 108, through, for example, the communication network 104. Alternatively, a response signal including a natural language response in a text format or various control signals may be received, and appropriate processing may be performed according to a predetermined protocol.

본 개시의 일 실시예에 의하면, 응답 출력 모듈(212)은, 예컨대 스피커 또는 헤드셋을 포함하고, 사용자 입력에 대응하는 청각적 응답, 예컨대 음성 (및 음향) 응답을 스피커 또는 헤드셋을 통해 사용자에게 제공할 수 있다. 본 개시의 일 실시예에 의하면, 응답 출력 모듈(212)은, 또한, 예컨대 LCD, LED, OLED, QLED 등의 기술에 기초한 터치 스크린 등의 각종 디스플레이 장치를 더 포함하고, 이들 디스플레이 장치를 통해 사용자 입력에 대응하는 시각적 응답, 예컨대 텍스트, 기호, 비디오, 이미지, 하이퍼링크, 애니메이션, 각종 노티스 등을, 전술한 음성 응답과 함께, 사용자에게 제시할 수 있다. 본 개시의 일 실시예에 의하면, 응답 출력 모듈(212)는 모션/햅틱 피드백 생성부를 더 포함하고, 이를 통해 촉각적 응답, 예컨대 모션/햅틱 피드백을 사용자에게 제공할 수 있다. 본 개시의 일 실시예에 의하면, 응답 출력 모듈(212)은, 사용자 입력에 대응하는 음성 응답을 제공하는 동시에, 텍스트 응답, 음향 응답, 모션/햅틱 피드백 중 임의의 것을 추가적으로 더 제공할 수 있음을 알아야 한다.According to an embodiment of the present disclosure, the response output module 212 includes, for example, a speaker or a headset, and provides an auditory response corresponding to a user input, for example, a voice (and sound) response to the user through the speaker or headset. can do. According to an embodiment of the present disclosure, the response output module 212 further includes various display devices such as a touch screen based on technologies such as LCD, LED, OLED, and QLED, and the user through these display devices Visual responses corresponding to inputs, such as text, symbols, videos, images, hyperlinks, animations, various notes, etc., may be presented to the user along with the above-described voice responses. According to an embodiment of the present disclosure, the response output module 212 may further include a motion/haptic feedback generator, through which a tactile response, for example, motion/haptic feedback, may be provided to the user. According to an embodiment of the present disclosure, the response output module 212 may provide a voice response corresponding to a user input, and additionally provide any of a text response, an acoustic response, and motion/haptic feedback. You should know.

도 3은, 본 개시의 일 실시예에 따른, 도 1의 대화형 AI 에이전트 서버(106)의 기능적 구성을 개략적으로 도시한 기능 블록도이다. 도시된 바에 의하면, 대화형 에이전트 서버(106)는, 통신 모듈(302), 음성-텍스트 변환(Speech-To-Text; STT) 모듈(304), 자연어 이해(Natural Language Understanding; NLU) 모듈(306), 대화 이해 지식베이스(308), 사용자 데이터베이스(310), 대화 관리 모듈(312), 대화 생성 모듈(314), 및 음성 합성(Text-To-Speech; TTS) 모듈(316)을 포함한다.3 is a functional block diagram schematically illustrating a functional configuration of the interactive AI agent server 106 of FIG. 1 according to an embodiment of the present disclosure. As illustrated, the interactive agent server 106 includes a communication module 302, a Speech-To-Text (STT) module 304, and a Natural Language Understanding (NLU) module 306 ), a conversation understanding knowledge base 308, a user database 310, a conversation management module 312, a conversation generation module 314, and a text-to-speech (TTS) module 316.

본 개시의 일 실시예에 의하면, 통신 모듈(302)은, 소정의 유선 또는 무선 통신 프로토콜에 따라, 통신망(104)을 통하여, 대화형 AI 에이전트 서버(106)가 사용자 단말(102) 및/또는 외부 서비스 서버(108)와 통신할 수 있게 한다. 본 개시의 일 실시예에 의하면, 통신 모듈(302)은, 통신망(104)을 통해, 사용자 단말(102)로부터 전송되어 온, 사용자 입력(예컨대 터치 입력, 음성 입력 및/또는 텍스트 입력 등을 포함하며, 이로써 제한되지 않음)을 수신할 수 있다. 본 개시의 일 실시예에 의하면, 사용자 입력은 특정한 태스크 실행 또는 대화 응답의 요청 신호일 수 있다. According to an embodiment of the present disclosure, the communication module 302, through the communication network 104, according to a predetermined wired or wireless communication protocol, the interactive AI agent server 106 is the user terminal 102 and / or It allows communication with the external service server 108. According to an embodiment of the present disclosure, the communication module 302 includes a user input (for example, a touch input, a voice input and/or a text input, etc.) transmitted from the user terminal 102 through the communication network 104. And, thereby, not limited to). According to an embodiment of the present disclosure, the user input may be a request signal for executing a specific task or responding to a conversation.

본 개시의 일 실시예에 의하면, 통신 모듈(302)은, 전술한 사용자 입력과 함께 또는 그와 별도로, 통신망(104)을 통해, 사용자 단말(102)로부터 전송되어 온, 사용자 단말(102)의 상태 정보를 수신할 수 있다. 본 개시의 일 실시예에 의하면, 상태 정보는, 예컨대 전술한 사용자 입력 당시의 해당 사용자 단말(102)에 관련된 여러가지 상태 정보(예컨대, 사용자 단말(102)의 물리적 상태, 사용자 단말(102)의 소프트웨어 및/또는 하드웨어 상태, 사용자 단말(102) 주위의 환경 상태 정보 등)일 수 있다. 본 개시의 일 실시예에 의하면, 통신 모듈(302)은, 또한, 위 수신된 사용자 입력에 대응하여 대화형 AI 에이전트 서버(106)에서 생성된 대화 응답(예컨대, 음성 및/또는 텍스트 형태의 자연어 대화 응답 등) 및/또는 제어 신호를, 통신망(104)을 통해, 사용자 단말(102)로 전달하기 위해 필요한 적절한 조치를 수행할 수 있다.According to an embodiment of the present disclosure, the communication module 302 is transmitted from the user terminal 102 through the communication network 104, together with or separately from the user input described above. Status information can be received. According to an embodiment of the present disclosure, the status information is, for example, various status information related to the user terminal 102 at the time of user input (e.g., the physical state of the user terminal 102, the software of the user terminal 102). And/or hardware status, environmental status information around the user terminal 102, etc.). According to an embodiment of the present disclosure, the communication module 302 may further include a conversation response generated by the interactive AI agent server 106 in response to the received user input (e.g., natural language in the form of voice and/or text). Conversation response, etc.) and/or a control signal, via the communication network 104, to the user terminal 102.

본 개시의 일 실시예에 의하면, STT 모듈(304)은, 통신 모듈(302)을 통해 수신된 사용자 입력 중 음성 입력을 수신하고, 수신된 음성 입력을 패턴 매칭 등에 기초하여 텍스트 데이터로 변환할 수 있다. 본 개시의 일 실시예에 의하면, STT 모듈(304)은, 사용자의 음성 입력으로부터 특징을 추출하여 특징 벡터열을 생성할 수 있다. 본 개시의 일 실시예에 의하면, STT 모듈(304)은, DTW(Dynamic Time Warping) 방식이나 HMM 모델(Hidden Markov Model), GMM 모델(Gaussian-Mixture Mode), 딥 신경망 모델, n-gram 모델 등의 다양한 통계적 모델에 기초하여, 텍스트 인식 결과, 예컨대 단어들의 시퀀스를 생성할 수 있다. 본 개시의 일 실시예에 의하면, STT 모듈(304)은, 수신된 음성 입력을 패턴 매칭에 기초하여 텍스트 데이터로 변환할 때, 후술하는 사용자 데이터베이스(310)의 각 사용자 특징적 데이터를 참조할 수 있다.According to an embodiment of the present disclosure, the STT module 304 may receive a voice input among user inputs received through the communication module 302, and convert the received voice input into text data based on pattern matching. have. According to an embodiment of the present disclosure, the STT module 304 may generate a feature vector sequence by extracting features from a user's voice input. According to an embodiment of the present disclosure, the STT module 304 is a DTW (Dynamic Time Warping) method, HMM model (Hidden Markov Model), GMM model (Gaussian-Mixture Mode), deep neural network model, n-gram model, etc. Based on the various statistical models of, it is possible to generate a text recognition result, such as a sequence of words. According to an embodiment of the present disclosure, when converting the received voice input into text data based on pattern matching, the STT module 304 may refer to each user characteristic data of the user database 310 to be described later. .

본 개시의 일 실시예에 의하면, NLU 모듈(306)은, 통신 모듈(302) 또는 STT 모듈(304)로부터 텍스트 입력을 수신할 수 있다. 본 개시의 일 실시예에 의하면, NLU 모듈(306)에서 수신되는 텍스트 입력은, 예컨대 통신 모듈(302)에서 통신망(104)을 통하여 사용자 단말(102)로부터 수신되었던 사용자 텍스트 입력 또는 통신 모듈(302)에서 수신된 사용자 음성 입력으로부터 STT 모듈(304)에서 생성된 텍스트 인식 결과, 예컨대 단어들의 시퀀스일 수 있다. 본 개시의 일 실시예에 의하면, NLU 모듈(306)은, 텍스트 입력을 수신하는 것과 함께 또는 그 이후에, 해당 사용자 입력과 연관된 상태 정보, 예컨대 해당 사용자 입력 당시의 사용자 단말(102)의 상태 정보 등을 수신할 수 있다. 전술한 바와 같이, 상태 정보는, 예컨대 사용자 단말(102)에서 사용자 음성 입력 및/또는 텍스트 입력 당시의 해당 사용자 단말(102)에 관련된 여러가지 상태 정보(예컨대, 사용자 단말(102)의 물리적 상태, 소프트웨어 및/또는 하드웨어 상태, 사용자 단말(102) 주위의 환경 상태 정보 등)일 수 있다.According to an embodiment of the present disclosure, the NLU module 306 may receive a text input from the communication module 302 or the STT module 304. According to an embodiment of the present disclosure, the text input received from the NLU module 306 is, for example, a user text input received from the user terminal 102 through the communication network 104 in the communication module 302 or the communication module 302 The text recognition result generated by the STT module 304 from the user's voice input received at ), for example, may be a sequence of words. According to an embodiment of the present disclosure, the NLU module 306, together with or after receiving the text input, includes status information associated with the user input, for example, status information of the user terminal 102 at the time of the user input. Etc. can be received. As described above, the state information is, for example, various state information related to the user terminal 102 at the time of user voice input and/or text input in the user terminal 102 (e.g., physical state of the user terminal 102, software And/or hardware status, environmental status information around the user terminal 102, etc.).

본 개시의 일 실시예에 의하면, NLU 모듈(306)은, 후술하는 대화 이해 지식베이스(308)에 기초하여, 위 수신된 텍스트 입력을 하나 이상의 사용자 의도(intent)에 대응시킬 수 있다. 여기서 사용자 의도는, 그 사용자 의도에 따라 대화형 AI 에이전트 서버(106)에 의해 이해되고 수행될 수 있는 일련의 동작(들)과 연관될 수 있다. 본 개시의 일 실시예에 의하면, NLU 모듈(306)은, 수신된 텍스트 입력을 하나 이상의 사용자 의도에 대응시킴에 있어서 전술한 상태 정보를 참조할 수 있다. 본 개시의 일 실시예에 의하면, NLU 모듈(306)은, 수신된 텍스트 입력을 하나 이상의 사용자 의도에 대응시킴에 있어서 후술하는 사용자 데이터베이스(310)의 각 사용자 특징적 데이터를 참조할 수 있다.According to an embodiment of the present disclosure, the NLU module 306 may correspond to one or more user intents with the received text input based on the conversation understanding knowledge base 308 to be described later. Here, the user intention may be associated with a series of operation(s) that may be understood and performed by the interactive AI agent server 106 according to the user intention. According to an embodiment of the present disclosure, the NLU module 306 may refer to the above-described state information in matching the received text input to one or more user intentions. According to an embodiment of the present disclosure, the NLU module 306 may refer to each user characteristic data of the user database 310 to be described later in matching the received text input to one or more user intentions.

본 개시의 일 실시예에 의하면, 대화 이해 지식베이스(308)는, 예컨대 미리 정의된 온톨로지 모델을 포함할 수 있다. 본 개시의 일 실시예에 의하면, 온톨로지 모델은, 예컨대 노드들 간의 계층 구조로 표현될 수 있는데, 각 노드는 사용자의 의도에 대응한 "의도" 노드 또는 "의도" 노드에 링크된 하위 "속성" 노드("의도" 노드에 직접 링크되거나 "의도" 노드의 "속성" 노드에 다시 링크된 하위 "속성" 노드) 중 하나일 수 있다. 본 개시의 일 실시예에 의하면, "의도" 노드와 그 "의도" 노드에 직접 또는 간접 링크된 "속성" 노드들은 하나의 도메인을 구성할 수 있고, 온톨로지는 이러한 도메인들의 집합으로 구성될 수 있다. 본 개시의 일 실시예에 의하면, 대화 이해 지식베이스(308)는, 예컨대 대화형 AI 에이전트 시스템이 이해하고 그에 대응한 동작을 수행할 수 있는 모든 의도들에 각각 대응하는 도메인들을 포함하도록 구성될 수 있다. 본 개시의 일 실시예에 의하면, 온톨로지 모델은, 노드의 추가나 삭제, 또는 노드 간의 관계의 수정 등에 의해 동적으로 변경될 수 있음을 알아야 한다.According to an embodiment of the present disclosure, the dialogue understanding knowledge base 308 may include, for example, a predefined ontology model. According to an embodiment of the present disclosure, the ontology model may be expressed as, for example, a hierarchical structure between nodes, each node being a "intent" node corresponding to a user's intention or a sub-"attribute" linked to a "intention" node. It may be one of the nodes (either directly linked to the “intent” node or a subordinate “attribute” node linked back to the “attribute” node of the “intent” node). According to an embodiment of the present disclosure, the "intent" node and the "property" nodes directly or indirectly linked to the "intention" node may constitute one domain, and the ontology may be composed of a set of such domains. . According to an embodiment of the present disclosure, the dialog understanding knowledge base 308 may be configured to include domains respectively corresponding to all intentions that the interactive AI agent system can understand and perform an operation corresponding thereto. have. It should be understood that according to an embodiment of the present disclosure, the ontology model may be dynamically changed by adding or deleting nodes, or modifying a relationship between nodes.

본 개시의 일 실시예에 의하면, 사용자 데이터베이스(310)는, 각 사용자별 특징적 데이터를 저장 및 관리하는 데이터베이스일 수 있다. 본 개시의 일 실시예에 의하면, 사용자 데이터베이스(310)에 포함되는 각 사용자별 특징적 데이터는, 예컨대 각 사용자별로 해당 사용자의 개인 정보, 이전 대화/거동 기록, 사용자의 발음 특징 정보, 사용자 어휘 선호도, 사용자의 소재지, 설정 언어, 연락처/친구 목록, 기타 다양한 사용자 특징적 정보를 포함할 수 있다. According to an embodiment of the present disclosure, the user database 310 may be a database that stores and manages characteristic data for each user. According to an embodiment of the present disclosure, the characteristic data for each user included in the user database 310 includes, for example, personal information of the corresponding user for each user, previous conversation/behavior records, user pronunciation characteristic information, user vocabulary preference, It may include the user's location, setting language, contact/friend list, and various other user characteristic information.

본 개시의 일 실시예에 의하면, 전술한 바와 같이, STT 모듈(304)은, 음성 입력을 텍스트 데이터로 변환할 때 사용자 데이터베이스(310)의 각 사용자 특징적 데이터, 예컨대 각 사용자별 발음 특징을 참조함으로써, 보다 정확한 텍스트 데이터를 얻을 수 있다. 본 개시의 일 실시예에 의하면, NLU 모듈(306)은, 사용자 의도를 결정할 때 사용자 데이터베이스(310)의 각 사용자 특징적 데이터, 예컨대 각 사용자별 특징이나 맥락을 참조함으로써, 보다 정확한 사용자 의도 결정을 할 수 있다. 본 개시의 일 실시예에 의하면, 후술하는 바와 같이, 대화 생성 모듈(314)은, 대화 응답의 생성시, 사용자 데이터베이스(310)의 사용자 특징적 데이터를 참조할 수 있다.According to an embodiment of the present disclosure, as described above, the STT module 304 refers to each user characteristic data of the user database 310, for example, pronunciation characteristics for each user when converting a voice input into text data. , More accurate text data can be obtained. According to an embodiment of the present disclosure, the NLU module 306 refers to each user characteristic data of the user database 310, for example, a characteristic or context for each user, when determining a user intention, thereby making a more accurate user intention determination. I can. According to an embodiment of the present disclosure, as will be described later, the conversation generation module 314 may refer to user characteristic data of the user database 310 when generating a conversation response.

본 도면에서는, 각 사용자별 특징적 데이터를 저장 및 관리하는 사용자 데이터베이스(310)가 대화형 AI 에이전트 서버(106)에 배치되는 것으로 도시되어 있으나, 본 개시가 이로써 제한되는 것은 아니다. 본 개시의 다른 실시예에 의하면, 각 사용자별 특징적 데이터를 저장 및 관리하는 사용자 데이터베이스(310)는, 예컨대 사용자 단말(102)에 존재할 수도 있고, 사용자 단말(102) 및 대화형 AI 에이전트 서버(106)에 분산되어 배치될 수도 있음을 알아야 한다.In this drawing, a user database 310 for storing and managing characteristic data for each user is illustrated as being disposed on the interactive AI agent server 106, but the present disclosure is not limited thereto. According to another embodiment of the present disclosure, the user database 310 for storing and managing characteristic data for each user may exist, for example, in the user terminal 102, and the user terminal 102 and the interactive AI agent server 106 ).

본 개시의 일 실시예에 의하면, 대화 관리 모듈(312)은, NLU 모듈(306)에 의해 결정된 사용자 의도에 따라, 그에 대응하는 일련의 동작 흐름을 생성할 수 있다. 본 개시의 일 실시예에 의하면, 대화 관리 모듈(312)은, 소정의 대화 흐름 관리 모델에 기초하여, 예컨대 NLU 모듈(306)로부터 수신된 사용자 의도에 대응하여 어떠한 동작, 예컨대 어떠한 대화 응답 및/또는 태스크 수행을 행하여야 할지를 결정하고, 그에 따른 세부 동작 흐름을 생성할 수 있다.According to an embodiment of the present disclosure, the conversation management module 312 may generate a series of operation flows corresponding thereto according to the user intention determined by the NLU module 306. According to an embodiment of the present disclosure, the conversation management module 312 may perform any operation, for example, any conversation response and/or in response to a user intention received from the NLU module 306, based on a predetermined conversation flow management model. Alternatively, it is possible to determine whether to perform a task and create a detailed operation flow accordingly.

본 개시의 일 실시예에 의하면, 대화 생성 모듈(314)은, 대화 관리 모듈(312) 에 의하여 생성된 대화 흐름에 기초하여 사용자에게 제공될 대화 응답을 생성할 수 있다. 본 개시의 일 실시예에 의하면, 대화 생성 모듈(314)은, 대화 응답의 생성에 있어서, 예컨대 전술한 사용자 데이터베이스(310)의 사용자 특징적 데이터(예컨대, 사용자의 이전 대화 기록, 사용자의 발음 특징 정보, 사용자 어휘 선호도, 사용자의 소재지, 설정 언어, 연락처/친구 목록, 각 사용자별로 해당 사용자의 이전 대화 기록 등)를 참조할 수 있다. According to an embodiment of the present disclosure, the conversation generation module 314 may generate a conversation response to be provided to a user based on the conversation flow generated by the conversation management module 312. According to an embodiment of the present disclosure, in generating a conversation response, the conversation generating module 314 includes, for example, user characteristic data (e.g., a user's previous conversation record, a user's pronunciation characteristic information) in the above-described user database 310. , User vocabulary preference, user's location, setting language, contact/friend list, each user's previous conversation history, etc.) can be referenced.

본 개시의 일 실시예에 의하면, TTS 모듈(316)은, 대화 생성 모듈(314)에 의해 사용자 단말(102)로 전송되도록 생성된 대화 응답을 수신할 수 있다. TTS 모듈(316)에서 수신되는 대화 응답은 텍스트 형태를 갖는 자연어 단어들의 시퀀스일 수 있다. 본 개시의 일 실시예에 의하면, TTS 모듈(316)은, 다양한 형태의 알고리즘에 따라, 위 수신된 텍스트 형태의 입력에 기초하여 음성 신호를 생성할 수 있다. 본 개시의 일 실시예에 의하면, TTS 모듈(316)은, 각 음성 단위 별 음성 자료를 미리 저장할 수 있다. 본 개시의 일 실시예에 의하면, TTS 모듈(316)은, 수신된 텍스트 입력을 처리하여 각 음성 단위별로 나누고, 아울러 그 텍스트 입력의 문법적 구조를 고려한 운율을 생성하고, 생성된 운율에 따라 위 저장된 음성 자료들을 조합 및 합성하여 합성 음성을 생성할 수 있다. According to an embodiment of the present disclosure, the TTS module 316 may receive a conversation response generated to be transmitted to the user terminal 102 by the conversation generating module 314. The conversation response received by the TTS module 316 may be a sequence of natural language words having a text form. According to an embodiment of the present disclosure, the TTS module 316 may generate a voice signal based on the received text input input according to various types of algorithms. According to an embodiment of the present disclosure, the TTS module 316 may pre-store voice data for each voice unit. According to an embodiment of the present disclosure, the TTS module 316 processes the received text input and divides it for each voice unit, and generates a prosody in consideration of the grammatical structure of the text input, and stores the above according to the generated prosody. Synthetic speech can be created by combining and synthesizing speech data.

본 개시는, 기술적으로 조합이 가능한 한국어 문자소들을, 각 발음가능한 음소로 미리 매칭하여 학습에 활용함으로써, 실제 TTS 과정에서 얼라이언트 매칭의 복잡성을 크게 줄일 수 있다. 본 개시에 의하면, 각각의 문자소를 그보다 훨씬 적은 수의 발음 정보로 전환하기 때문에, 이용되는 정보를 컴팩트하게 만들 수 있고, 훨씬 더 효과적으로 정보의 매핑 결과를 구현할 수 있다. 본 개시에 의하면, 각 문자소의 발음 가능한 음소로의 매핑은, 규칙에 기반하여 이루어질 수도 있고, 머신러닝에 의하여 보다 효과적으로 구현될 수도 있다.According to the present disclosure, the complexity of the alliance matching in the actual TTS process can be greatly reduced by matching technically possible Korean letters to each phoneme and using them for learning. According to the present disclosure, since each letter element is converted into a much smaller number of pronunciation information, the information to be used can be made compact, and a mapping result of information can be implemented much more effectively. According to the present disclosure, mapping of each character letter to a phoneme that can be pronounced may be performed based on a rule or may be more effectively implemented by machine learning.

본 개시에 따른 문자소-음소 변환은, 딥러닝 Seq2Seq with Attention 방식에 따른 것일 수도 있다. 본 개시에 따른 문자소-음소 변환 모듈은, 딥러닝 방식의 TTS와 호환 가능한 것일 수 있다. 본 개시에 따른 문자소-음소 변환 모듈은, 자체적인 데이터와 자체 발음열 정의에 기초하여 사용될 수도 있다.The character phoneme-phoneme conversion according to the present disclosure may be performed according to a deep learning Seq2Seq with Attention method. The character phoneme-phoneme conversion module according to the present disclosure may be compatible with deep learning TTS. The character phoneme-phoneme conversion module according to the present disclosure may be used based on its own data and its own pronunciation sequence definition.

당업자라면 알 수 있듯이, 본 개시가 본 명세서에 기술된 예시에 한정되는 것이 아니라 본 개시의 범주를 벗어나지 않는 범위 내에서 다양하게 변형, 재구성 및 대체될 수 있다. 본 명세서에 기술된 다양한 기술들은 하드웨어 또는 소프트웨어, 또는 하드웨어와 소프트웨어의 조합에 의해 구현될 수 있음을 알아야 한다.As will be appreciated by those skilled in the art, the present disclosure is not limited to the examples described herein, but may be variously modified, reconfigured, and substituted without departing from the scope of the present disclosure. It should be understood that the various techniques described herein may be implemented by hardware or software, or a combination of hardware and software.

본 개시의 일 실시예에 따른 컴퓨터 프로그램은, 컴퓨터 프로세서 등에 의해 판독 가능한 저장 매체, 예컨대 EPROM, EEPROM, 플래시 메모리장치와 같은 비휘발성 메모리, 내장형 하드 디스크와 착탈식 디스크 같은 자기 디스크, 광자기 디스크, 및 CDROM 디스크 등을 포함한 다양한 유형의 저장 매체에 저장된 형태로 구현될 수 있다. 또한, 프로그램 코드(들)는 어셈블리어나 기계어로 구현될 수 있다. 본 개시의 진정한 사상 및 범주에 속하는 모든 변형 및 변경을 이하의 특허청구범위에 의해 모두 포괄하고자 한다.A computer program according to an embodiment of the present disclosure includes a storage medium readable by a computer processor, such as EPROM, EEPROM, a nonvolatile memory such as a flash memory device, a magnetic disk such as an internal hard disk and a removable disk, a magneto-optical disk, and It can be implemented in a form stored in various types of storage media, including a CDROM disk. Further, the program code(s) may be implemented in assembly language or machine language. It is intended to cover all modifications and changes belonging to the true spirit and scope of the present disclosure by the following claims.

Claims

As a method of providing text-to-speech (TTS),
Receiving text data and voice data;
Analyzing the received text data and the voice data, and adjusting a word mismatch between the text data and the voice data to match;
Performing learning by a deep learning learning model using the text data and the voice data in which the word mismatch is adjusted; And
Comprising the step of building a TTS model based on the learning, TTS providing method.

The method of claim 1,
The step of adjusting the word mismatch between the text data and the voice data to match the word mismatch comprises correcting the word notation on the text data based on the voice data.

The step of adjusting the word mismatch between the text data and the speech data to match the word, comprises adjusting the word speech on the speech data based on the text data.

As a method of providing text-to-speech (TTS),
Receiving text data and voice data;
Analyzing the received text data and separating the text data for each word;
Separating the voice data to match each word of the text data based on the separated text data;
Performing learning by a deep learning learning model based on the text data separated for each word and speech data separated to match each word of the text data;
Comprising the step of building a TTS model based on the learning, TTS providing method.

The method of claim 4,
Receiving a text input for TTS;
Analyzing the text input and separating each word;
Generating each voice corresponding to the text data separated for each word based on the constructed TTS model; And
The method of providing a TTS further comprising the step of concatenating each of the generated voices based on a predetermined rule.