KR20220096129A

KR20220096129A - Speech synthesis system automatically adjusting emotional tone

Info

Publication number: KR20220096129A
Application number: KR1020200188312A
Authority: KR
Inventors: 윤종성; 이현석
Original assignee: 미디어젠(주)
Priority date: 2020-12-30
Filing date: 2020-12-30
Publication date: 2022-07-07

Abstract

Disclosed is a speech synthesis method based on an emotion. The speech synthesis method according to one embodiment of the present specification fine-tunes a BERT language model to train an emotion classification model, enables the emotion classification model to apply an input sentence, and performs speech synthesis by extracting a plurality of emotion information having different emotion weight values. Accordingly, the emotion information of the synthesis speech can be adjusted, and can perform speech synthesis that reflects a complex and subtle emotion through a complex emotion weight.

Description

Speech synthesis system that automatically adjusts emotional tones {SPEECH SYNTHESIS SYSTEM AUTOMATICALLY ADJUSTING EMOTIONAL TONE}

본 명세서는 감정톤을 자동조절하는 음성 합성 시스템에 관한 것이다.The present specification relates to a speech synthesis system for automatically adjusting emotional tones.

종래의 텍스트 투 스피치(Text-To-Speech, TTS) 프로세싱은 기 저장된 음성으로 텍스트를 출력한다. TTS 프로세싱의 일차적인 목적은 텍스트가 갖는 의미적 내용(semantic contents)을 전달하는 것이지만, 최근에는 TTS 프로세싱이 텍스트의 의미적 내용뿐 아니라 텍스트가 갖는 인터랙티브(interactive) 의미가 상대방에게 전달될 수 있도록 하여, 실제 텍스트를 전달한 사용자의 의도나 감정이 음성 출력에 반영되어 실제로 텍스트 전달자와 인터랙티브한 대화를 경험하도록 할 필요성이 제기된다.Conventional text-to-speech (TTS) processing outputs text as a pre-stored voice. The primary purpose of TTS processing is to deliver the semantic contents of the text, but recently TTS processing has made it possible to transmit not only the semantic content of the text but also the interactive meaning of the text to the other party. , the need is raised to allow the user's intention or emotion to be reflected in the voice output to actually experience an interactive conversation with the text sender.

감정 기반의 음성 합성은 입력 문장에서 감정 요소를 추출하여 미리 분류된 특정 감정을 결정하고, 결정된 감정 요소를 음성 합성 과정에 반영한다. 이에 따라 음성 합성 과정에 반영되는 감정 요소는 미리 분류된 감정요소가 음성 합성 과정에서 반영된다. 여기서 미리 분류된 감정 슬픔, 화남, 놀람, 기쁨 등 주요 감정 상태를 포함할 수 있다. 그러나, 영화 또는 드라마 등에서 배우의 연기를 통해 드러나는 감정 상태는 상기 미리 분류된 감정 상태만으로 표현하기는 불가능할 수 있다.Emotion-based speech synthesis extracts emotional elements from input sentences to determine pre-classified specific emotions, and reflects the determined emotional elements in the speech synthesis process. Accordingly, as for the emotional elements reflected in the speech synthesis process, the emotional elements classified in advance are reflected in the speech synthesis process. Here, pre-classified emotions may include major emotional states such as sadness, anger, surprise, and joy. However, it may not be possible to express an emotional state revealed through an actor's performance in a movie or a drama using only the emotional state classified in advance.

이에 따라, 본 명세서는 감정 기반의 음성 합성 과정에서 합성 음성의 감정 정보를 조절할 수 있는 음성 합성 방법 및 장치를 제공하는 것을 목적으로 한다.Accordingly, an object of the present specification is to provide a speech synthesis method and apparatus capable of adjusting emotion information of a synthesized speech in an emotion-based speech synthesis process.

또한, 본 명세서는 보다 세분화된 감정 상태를 음성 합성에 반영할 수 있는 음성 합성 방법 및 장치를 제공하는 것을 목적으로 한다.Another object of the present specification is to provide a speech synthesis method and apparatus capable of reflecting a more subdivided emotional state in speech synthesis.

본 명세서가 이루고자 하는 기술적 과제들은 이상에서 언급한 기술적 과제들로 제한되지 않으며, 언급되지 않은 또 다른 기술적 과제들은 이하의 발명의 상세한 설명으로부터 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다.The technical problems to be achieved by the present specification are not limited to the technical problems mentioned above, and other technical problems not mentioned are clear to those of ordinary skill in the art to which the present invention belongs from the detailed description of the invention below. can be understood clearly.

본 명세서의 일 실시예에 따른 음성 합성 장치에서 음성 합성 방법은, 음성 합성에 이용될 학습 데이터를 준비하는 단계; 상기 학습 데이터에 대하여 사전 학습된 언어모델을 미세조정(Fine-Tuning)하여 감정 분류 모델을 학습시키는 단계; 입력문장을 상기 감정분류 모델에 적용하여 상기 입력문장에 대응하는 감정요소를 추출하는 단계; 및 상기 추출된 감정요소는, 서로 다른 감정 가중치값을 갖는 복수의 감정 항목을 포함하고, 상기 입력문장, 임베딩된 화자정보, 상기 추출된 감정요소에 기초하여 음성 합성을 수행하는 단계; 를 포함한다.A speech synthesis method in a speech synthesis apparatus according to an embodiment of the present specification includes: preparing training data to be used for speech synthesis; learning the emotion classification model by fine-tuning the pre-trained language model with respect to the learning data; extracting an emotion element corresponding to the input sentence by applying the input sentence to the emotion classification model; and wherein the extracted emotion factor includes a plurality of emotion items having different emotion weight values, and performing speech synthesis based on the input sentence, embedded speaker information, and the extracted emotion factor; includes

상기 사전에 학습된 언어모델은 BERT 언어모델이며, 상기 미세조정은, 상기 입력 문장에 대응하는 타겟 감정을 태그(tag)로 하여 상기 사전 학습된 언어모델에 대하여 전이학습(Transfer Learning)을 수행하는 것을 의미할 수 있다.The pre-trained language model is a BERT language model, and the fine adjustment is to perform transfer learning on the pre-learned language model by using a target emotion corresponding to the input sentence as a tag. can mean that

상기 감정 분류모델은, 구분되는 복수의 대분류 감정항목을 포함하고, 각각의 대분류 감정항목은 유사한 복수의 하위분류 감정항목을 포함하고, 상기 대분류 감정항목 및 복수의 하위분류 감정항목은 각각 감정 벡터값으로 상기 음성 합성 장치에 임베딩되어 있을 수 있다.The emotion classification model includes a plurality of divided large classification emotion items, each large classification emotion item includes a plurality of similar sub-category emotion items, and the large classification emotion item and the plurality of sub-classification emotion items are each emotion vector value may be embedded in the speech synthesis device.

상기 음성 합성 방법은, 상기 감정 벡터값은 좌표평면 상에 매칭되고, 매칭 결과를 상기 음성 합성장치의 출력부를 통해 출력하는 단계; 상기 좌표 평면 상에서 상기 음성합성 장치에 임베딩되지 않은 감정 벡터값에 대응되는 일 지점이 선택된 경우, 상기 좌표평면 상에 기 매칭된 감정 벡터값을 참조하여 상기 선택된 지점의 감정 벡터값을 결정하는 단계; 및 상기 결정된 감정 벡터값에 기초하여 상기 음성 합성을 수행하는 단계;를 더 포함할 수 있다.The speech synthesis method may include: matching the emotion vector value on a coordinate plane, and outputting a matching result through an output unit of the speech synthesis device; determining an emotion vector value of the selected point by referring to a previously matched emotion vector value on the coordinate plane when a point corresponding to the emotion vector value not embedded in the speech synthesis apparatus is selected on the coordinate plane; and performing the speech synthesis based on the determined emotion vector value.

상기 감정 분류모델은, 입력 문장에 부합하는 대분류 감정 항목 및 상기 대분류 감정항목의 가중치 값을 출력하도록 구성될 수 있다.The emotion classification model may be configured to output a large classification emotion item corresponding to an input sentence and a weight value of the large classification emotion item.

상기 감정 분류모델은, 입력 문장에 부합하는 대분류 감정 항목의 하위 분류 감정항목 및 상기 하위분류 감정항목의 가중치 값을 출력하도록 구성될 수 있다.The emotion classification model may be configured to output a sub-classification emotion item of a large-class emotion item corresponding to an input sentence and a weight value of the sub-classification emotion item.

상기 감정 분류모델은 입력 문장에 부합하는 대분류 감정항목 또는 상기 대분류 감정항목에 포함되는 복수의 하위분류 감정항목 중 적어도 둘 이상의 조합에 의해 상기 입력 문장에 대응되는 감정요소를 추출하되, 상기 둘 이상의 조합된 감정항목은 각각 서로 다른 가중치값을 가질 수 있다.The emotion classification model extracts an emotion element corresponding to the input sentence by a combination of at least two or more of a large category emotion item corresponding to the input sentence or a plurality of sub-category emotion items included in the large category emotion item, wherein a combination of the two or more Each emotion item may have a different weight value.

상기 둘 이상의 조합된 감정항목 각각의 가중치값은 적응적으로 조절 가능하다.The weight value of each of the two or more combined emotion items is adaptively adjustable.

상기 임베딩된 화자정보는, 서로 다른 복수의 화자 정보, 상기 복수의 화자 각각의 보이스톤, 상기 감정분류 모델에서 분류되는 감정에 대응되는 보이스 톤에 대응되는 벡터값이 상기 음성 합성 장치에 임베딩되어 있을 수 있다.In the embedded speaker information, a vector value corresponding to a plurality of different speaker information, a voicestone of each of the plurality of speakers, and a voice tone corresponding to an emotion classified in the emotion classification model may be embedded in the speech synthesis device. can

상기 음성 합성 방법에서, 상기 임베딩된 화자정보는, 상기 벡터값을 통해 좌표평면 상에 매칭되고, 매칭 결과를 상기 음성 합성장치의 디스플레이을 통해 출력하는 단계; 상기 좌표 평면 상에서 상기 음성합성 장치에 임베딩되지 않은 화자정보에 대응되는 일 지점이 선택된 경우, 상기 좌표평면 상에 기 매칭된 벡터값을 참조하여 상기 선택된 지점의 벡터값을 결정하는 단계; 및 상기 결정된 벡터값에 대응되는 보이스톤에 기초하여 상기 음성 합성을 수행하는 단계;를 더 포함할 수 있다.In the speech synthesis method, the embedded speaker information is matched on a coordinate plane through the vector value, and outputting a matching result through a display of the speech synthesis device; determining a vector value of the selected point by referring to a vector value previously matched on the coordinate plane when a point corresponding to speaker information not embedded in the speech synthesis apparatus is selected on the coordinate plane; and performing the speech synthesis based on the voicestone corresponding to the determined vector value.

본 명세서의 다른 실시예에 따른 음성 합성 장치는, 입력부; 음성 합성에 이용될 학습 데이터에 대하여 사전 학습된 언어모델을 저장하는 저장부; 상기 입력부를 통해 입력된 입력 문장에 대하여 음성 합성을 수행하는 음성 합성부; 상기 학습 데이터에 대하여 상기 사전 학섭된 언어 모델을 미세조정(Fine-Tuning)하여 감정 분류모델을 학습시키는 프로세서;를 포함하고, 상기 프로세서는, 상기 입력문장을 상기 미세조정된 감정분류 모델에 적용하여 상기 입력문장에 대응하는 감정요소를 추출하고, 상기 추출된 감정요소는, 서로 다른 감정 가중치값을 갖는 복수의 감정 항목을 포함하고, 상기 입력문장, 음성 합성 장치에 임베딩된 화자정보, 상기 추출된 감정요소에 기초하여 음성 합성을 수행하도록 상기 음성 합성부를 제어할 수 있다.A speech synthesis apparatus according to another embodiment of the present specification includes an input unit; a storage unit for storing a pre-trained language model with respect to learning data to be used for speech synthesis; a speech synthesis unit for performing speech synthesis on the input sentence input through the input unit; A processor for learning an emotion classification model by fine-tuning the pre-trained language model with respect to the learning data; includes, wherein the processor applies the input sentence to the fine-tuned emotion classification model, An emotion element corresponding to the input sentence is extracted, and the extracted emotion element includes a plurality of emotion items having different emotion weight values, the input sentence, speaker information embedded in the speech synthesis device, and the extracted The voice synthesizer may be controlled to perform voice synthesis based on the emotional element.

상기 프로세서는, 상기 입력 문장에 대응하는 타겟 감증을 태그(tag)로 하여 상기 감정 분류모델을 추가학습시킨다.The processor further trains the emotion classification model by using the target decrease corresponding to the input sentence as a tag.

상기 감정 분류모델은, 구분되는 복수의 대분류 감정항목을 포함하고, 각각의 대분류 감정항목은 유사한 복수의 하위분류 감정항목을 포함하고, 상기 대분류 감정항목 및 복수의 하위분류 감정항목은 각각 감정 벡터값으로 상기 음성 합성 장치에 임베딩되어 있다.The emotion classification model includes a plurality of divided large classification emotion items, each large classification emotion item includes a plurality of similar sub-category emotion items, and the large classification emotion item and the plurality of sub-classification emotion items are each emotion vector value is embedded in the speech synthesis device.

상기 감정 벡터값은 좌표평면 상에 매칭되고, 상기 프로세서는, 매칭 결과를 상기 음성 합성장치의 출력부를 통해 출력하고, 상기 좌표 평면 상에서 상기 음성합성 장치에 임베딩되지 않은 감정 벡터값에 대응되는 일 지점이 선택된 경우, 상기 좌표평면 상에 기 매칭된 감정 벡터값을 참조하여 상기 선택된 지점의 감정 벡터값을 결정하고, 상기 결정된 감정 벡터값에 기초하여 상기 음성 합성을 수행하도록 제어할 수 있다.The emotion vector value is matched on a coordinate plane, and the processor outputs a matching result through an output unit of the speech synthesizing device, and a point on the coordinate plane corresponding to the emotion vector value not embedded in the speech synthesizing device. is selected, the emotion vector value of the selected point may be determined by referring to the emotion vector value previously matched on the coordinate plane, and the voice synthesis may be performed based on the determined emotion vector value.

상기 둘 이상의 조합된 감정항목 각각의 가중치값은 적응적으로 조절가능하다.The weight value of each of the two or more combined emotion items is adaptively adjustable.

바람직하게는 상기 임베딩된 화자정보는, 서로 다른 복수의 화자 정보, 상기 복수의 화자 각각의 보이스톤, 상기 감정분류 모델에서 분류되는 감정에 대응되는 보이스 톤에 대응되는 벡터값이 상기 음성 합성 장치에 임베딩되어 있다.Preferably, in the embedded speaker information, a vector value corresponding to a plurality of different speaker information, a voicestone of each of the plurality of speakers, and a voice tone corresponding to an emotion classified in the emotion classification model is provided to the speech synthesis apparatus. is embedded.

상기 프로세서는, 상기 임베딩된 화자정보를 상기 벡터값을 통해 좌표평면 상에 매칭하여 상기 음성 합성 장치의 출력부를 출력하고, 상기 좌표 평면 상에서 상기 음성합성 장치에 임베딩되지 않은 화자정보에 대응되는 일 지점이 선택된 경우, 상기 좌표평면 상에 기 매칭된 벡터값을 참조하여 상기 선택된 지점의 벡터값을 결정하고, 상기 결정된 벡터값에 대응되는 보이스톤에 기초하여 상기 음성 합성을 수행하도록 제어할 수 있다.The processor matches the embedded speaker information on a coordinate plane through the vector value to output an output unit of the speech synthesis apparatus, and a point on the coordinate plane corresponding to speaker information not embedded in the speech synthesis apparatus is selected, a vector value of the selected point may be determined by referring to a vector value matched previously on the coordinate plane, and the voice synthesis may be performed based on a voicestone corresponding to the determined vector value.

본 명세서의 일 실시예에 따른 음성 합성 방법 및 장치는, 감정 기반의 음성 합성 과정에서 합성 음성의 감정 정보를 조절할 수 있는 음성 합성 방법 및 장치를 제공하는 것을 목적으로 한다.An object of the speech synthesis method and apparatus according to an embodiment of the present specification is to provide a speech synthesis method and apparatus capable of adjusting emotion information of a synthesized speech in an emotion-based speech synthesis process.

또한, 본 명세서의 일 실시예에 따르면 보다 세분화된 감정 상태를 음성 합성에 반영할 수 있다.In addition, according to an embodiment of the present specification, a more subdivided emotional state may be reflected in voice synthesis.

본 발명에서 얻을 수 있는 효과는 이상에서 언급한 효과로 제한되지 않으며, 언급하지 않은 또 다른 효과들은 아래의 기재로부터 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다.The effects obtainable in the present invention are not limited to the above-mentioned effects, and other effects not mentioned will be clearly understood by those of ordinary skill in the art to which the present invention belongs from the following description. .

본 명세서에 관한 이해를 돕기 위해 상세한 설명의 일부로 포함되는, 첨부 도면은 본 명세서에 대한 실시예를 제공하고, 상세한 설명과 함께 본 명세서의 기술적 특징을 설명한다.
도 1은 본 명세서의 일 실시예에 따라 감정 정보 기반의 음성 합성 장치의 블록블럭도이다.
도 2는 본 명세서의 일 실시예에 따라 입력 문장에서 감정이 반영된 음성 합성이 수행되는 개략적인 플로우를 설명하기 위한 도면이다.
도 3은 본 명세서의 일 실시예에 따른 음성 합성 방법의 흐름도이다.
도 4a는 본 명세서의 일 실시예에 따라 BERT 언어 모델을 이용하여 감정 분류 모델을 학습시키는 예를 설명하기 위한 도면이다.
도 4b는 본 명세서의 일 실시예에 따라 감정 분류 모델의 출력 결과의 예를 설명하기 위한 도면이다.
도 5는 본 명세서의 일 실시예에 따라 감정 분류 모델을 통해 분류 가능한 감정정보를 분류체계를 설명하기 위한 도면이다.
도 6은 본 명세서의 일 실시예에 따라 입력 문장에 대하여 감정분류 모델을 적용한 결과를 예를 설명하기 위한 도면이다.
도 7은 본 명세서의 일 실시예에 따라 복합 감정 가중치를 적용하는 다른 예를 설명하기 위한 음성 합성 방법의 흐름도이다.
도 8은 임베딩된 감정 벡터값이 좌표평면 공간상에 표시된 예를 설명하기 위한 도면이다.
도 9는 본 명세서의 일 실시예에 따라 감정 정보 기반의 음성 합성을 수행하는 과정을 기능 블록을 통해 보다 구체적으로 설명한다.
도 10은 본 명세서의 일 실시예에 따라 다중 화자 정보를 적용하는 음성 합성 방법의 흐름도이다.
본 발명에 관한 이해를 돕기 위해 상세한 설명의 일부로 포함되는, 첨부 도면은 본 발명에 대한 실시예를 제공하고, 상세한 설명과 함께 본 발명의 기술적 특징을 설명한다.BRIEF DESCRIPTION OF THE DRAWINGS The accompanying drawings, which are included as a part of the detailed description to help the understanding of the present specification, provide embodiments of the present specification, and together with the detailed description, explain the technical features of the present specification.
1 is a block diagram of an apparatus for synthesizing speech based on emotion information according to an embodiment of the present specification.
FIG. 2 is a diagram for explaining a schematic flow of speech synthesis in which emotions are reflected in an input sentence according to an embodiment of the present specification; FIG.
3 is a flowchart of a speech synthesis method according to an embodiment of the present specification.
4A is a diagram for explaining an example of learning an emotion classification model using a BERT language model according to an embodiment of the present specification.
4B is a diagram for explaining an example of an output result of an emotion classification model according to an embodiment of the present specification.
5 is a diagram for explaining a classification system for emotion information that can be classified through an emotion classification model according to an embodiment of the present specification.
6 is a diagram for explaining an example of a result of applying an emotion classification model to an input sentence according to an embodiment of the present specification.
7 is a flowchart of a speech synthesis method for explaining another example of applying a composite emotion weight according to an embodiment of the present specification.
8 is a diagram for explaining an example in which an embedded emotion vector value is displayed on a coordinate plane space.
9 illustrates a process of performing voice synthesis based on emotion information according to an embodiment of the present specification in more detail through functional blocks.
10 is a flowchart of a speech synthesis method for applying multi-speaker information according to an embodiment of the present specification.
BRIEF DESCRIPTION OF THE DRAWINGS The accompanying drawings, which are included as a part of the detailed description to facilitate the understanding of the present invention, provide embodiments of the present invention, and together with the detailed description, explain the technical features of the present invention.

이하, 첨부된 도면을 참조하여 본 명세서에 개시된 실시예를 상세히 설명하되, 도면 부호에 관계없이 동일하거나 유사한 구성요소는 동일한 참조 번호를 부여하고 이에 대한 중복되는 설명은 생략하기로 한다. 이하의 설명에서 사용되는 구성요소에 대한 접미사 "모듈" 및 "부"는 명세서 작성의 용이함만이 고려되어 부여되거나 혼용되는 것으로서, 그 자체로 서로 구별되는 의미 또는 역할을 갖는 것은 아니다. 또한, 본 명세서에 개시된 실시예를 설명함에 있어서 관련된 공지 기술에 대한 구체적인 설명이 본 명세서에 개시된 실시예의 요지를 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다. 또한, 첨부된 도면은 본 명세서에 개시된 실시예를 쉽게 이해할 수 있도록 하기 위한 것일 뿐, 첨부된 도면에 의해 본 명세서에 개시된 기술적 사상이 제한되지 않으며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. Hereinafter, the embodiments disclosed in the present specification will be described in detail with reference to the accompanying drawings, but the same or similar components are assigned the same reference numbers regardless of reference numerals, and redundant description thereof will be omitted. The suffixes "module" and "part" for components used in the following description are given or mixed in consideration of only the ease of writing the specification, and do not have distinct meanings or roles by themselves. In addition, in describing the embodiments disclosed in the present specification, if it is determined that detailed descriptions of related known technologies may obscure the gist of the embodiments disclosed in the present specification, the detailed description thereof will be omitted. In addition, the accompanying drawings are only for easy understanding of the embodiments disclosed in the present specification, and the technical idea disclosed herein is not limited by the accompanying drawings, and all changes included in the spirit and scope of the present invention , should be understood to include equivalents or substitutes.

제1, 제2 등과 같이 서수를 포함하는 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되지는 않는다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다.Terms including an ordinal number such as 1st, 2nd, etc. may be used to describe various elements, but the elements are not limited by the terms. The above terms are used only for the purpose of distinguishing one component from another.

어떤 구성요소가 다른 구성요소에 "연결되어" 있다거나 "접속되어" 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소가 다른 구성요소에 "직접 연결되어" 있다거나 "직접 접속되어" 있다고 언급된 때에는, 중간에 다른 구성요소가 존재하지 않는 것으로 이해되어야 할 것이다.When an element is referred to as being “connected” or “connected” to another element, it is understood that it may be directly connected or connected to the other element, but other elements may exist in between. it should be On the other hand, when it is said that a certain element is "directly connected" or "directly connected" to another element, it should be understood that the other element does not exist in the middle.

단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다.The singular expression includes the plural expression unless the context clearly dictates otherwise.

본 출원에서, "포함한다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.In the present application, terms such as “comprises” or “have” are intended to designate that a feature, number, step, operation, component, part, or combination thereof described in the specification exists, but one or more other features It should be understood that this does not preclude the existence or addition of numbers, steps, operations, components, parts, or combinations thereof.

전술한 본 명세서, 프로그램이 기록된 매체에 컴퓨터가 읽을 수 있는 코드로서 구현하는 것이 가능하다. 컴퓨터가 읽을 수 있는 매체는, 컴퓨터 시스템에 의하여 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록장치를 포함한다. 컴퓨터가 읽을 수 있는 매체의 예로는, HDD(Hard Disk Drive), SSD(Solid State Disk), SDD(Silicon Disk Drive), ROM, RAM, CD-ROM, 자기 테이프, 플로피 디스크, 광 데이터 저장 장치 등이 있으며, 또한 캐리어 웨이브(예를 들어, 인터넷을 통한 전송)의 형태로 구현되는 것도 포함한다. 따라서, 상기의 상세한 설명은 모든 면에서 제한적으로 해석되어서는 아니되고 예시적인 것으로 고려되어야 한다. 본 명세서의 범위는 첨부된 청구항의 합리적 해석에 의해 결정되어야 하고, 본 발명의 등가적 범위 내에서의 모든 변경은 본 명세서의 범위에 포함된다.The above-described specification, it is possible to be implemented as computer-readable code on a medium in which the program is recorded. The computer-readable medium includes all types of recording devices in which data readable by a computer system is stored. Examples of computer-readable media include Hard Disk Drive (HDD), Solid State Disk (SSD), Silicon Disk Drive (SDD), ROM, RAM, CD-ROM, magnetic tape, floppy disk, optical data storage device, etc. There is also a carrier wave (eg, transmission over the Internet) that is implemented in the form of. Accordingly, the above detailed description should not be construed as restrictive in all respects but as exemplary. The scope of the present specification should be determined by a reasonable interpretation of the appended claims, and all modifications within the equivalent scope of the present invention are included in the scope of this specification.

도 1은 본 명세서의 일 실시예에 따라 감정 정보 기반의 음성 합성 장치의 블록블럭도이다.1 is a block diagram of an apparatus for synthesizing speech based on emotion information according to an embodiment of the present specification.

도 1에 도시된 음성 합성 장치(TTS Device, 100)는, TTS 장치(100) 또는 다른 장치에 의해 처리된 음성을 출력하기 위한 오디오 출력 장치(110)를 포함할 수 있다.The voice synthesis device (TTS Device) 100 shown in FIG. 1 may include an audio output device 110 for outputting voice processed by the TTS device 100 or another device.

도 1은 음성 합성을 수행하기 위한 음성 합성 장치(TTS Device, 100)를 개시한다. 본 발명의 일 실시예는 상기 TTS 장치(100)에 포함될 수 있는 컴퓨터 판독 가능한 및 컴퓨터 실행 가능한 명령들을 포함할 수 있다. 도 1은 상기 TTS 장치(100)에 포함된 복수의 구성 요소들을 개시하지만, 상기 개시되지 않은 구성요소들이 상기 TTS 장치(100)에 포함될 수도 있음은 물론이다.1 discloses a speech synthesis device (TTS Device) 100 for performing speech synthesis. An embodiment of the present invention may include computer readable and computer executable instructions that may be included in the TTS device 100 . Although FIG. 1 discloses a plurality of components included in the TTS apparatus 100 , it goes without saying that components that are not disclosed may be included in the TTS apparatus 100 .

한편, 상기 TTS 장치(100)에 개시된 몇몇 구성요소들은 단일 구성요소로서, 하나의 장치에서 여러번 나타날 수 있다. 예를 들어, 상기 TTS 장치(100)는 복수의 입력 장치(120), 출력 장치(130) 또는 복수의 컨트롤러/프로세서(140)를 포함할 수 있다.Meanwhile, some components disclosed in the TTS device 100 may appear multiple times in one device as a single component. For example, the TTS device 100 may include a plurality of input devices 120 , an output device 130 , or a plurality of controllers/processors 140 .

복수의 TTS 장치가 하나의 음성 합성 장치에 적용될 수도 있다. 그러한 다중 장치 시스템에서 상기 TTS 장치는 음성 합성 처리의 다양한 측면들을 수행하기 위한 서로 다른 구성요소들을 포함할 수 있다. 도 1에 도시된 TTS 장치(100)는 예시적인 것이며, 독립된 장치일 수 있으며, 보다 큰 장치 또는 시스템의 일 구성요소로 구현될 수도 있다.A plurality of TTS apparatuses may be applied to one speech synthesis apparatus. In such a multi-device system, the TTS device may include different components for performing various aspects of speech synthesis processing. The TTS device 100 shown in FIG. 1 is exemplary, and may be an independent device, or may be implemented as a larger device or one component of a system.

본 발명의 일 실시예는 복수의 서로 다른 장치 및 컴퓨터 시스템 예를 들어, 범용 컴퓨팅 시스템, 서버-클라이언트 컴퓨팅 시스템, 전화(telephone) 컴퓨팅 시스템, 랩탑 컴퓨터, 휴대용 단말기, PDA, 테블릿 컴퓨터 등에 적용될 수 있다. 상기 TTS 장치(100)는 자동 입출금기(ATMs), 키오스크(kiosks), 글로벌 위치 시스템(GPS), 홈 어플라이언스(예를 들어, 냉장고, 오븐, 세탁기 등), 차량(vehicles), 전자 책 리더(ebook readers) 등의 음성 인식 기능을 제공하는 다른 장치 또는 시스템의 일 구성요소로 적용될 수도 있다.An embodiment of the present invention may be applied to a plurality of different devices and computer systems, for example, a general-purpose computing system, a server-client computing system, a telephone computing system, a laptop computer, a portable terminal, a PDA, a tablet computer, and the like. have. The TTS device 100 includes automatic teller machines (ATMs), kiosks, global positioning systems (GPS), home appliances (eg, refrigerators, ovens, washing machines, etc.), vehicles, and e-book readers. Readers) may be applied as a component of another device or system that provides a voice recognition function.

도 1을 참조하면, 상기 TTS 장치(100)는 상기 TTS 장치(100) 또는 다른 장치에 의해 처리된 음성을 출력하기 위한 음성 출력 장치(110)를 포함할 수 있다. 상기 음성 출력 장치(110)는 스피커(speaker), 헤드폰(headphone) 또는 음성을 전파하는 다른 적절한 구성요소를 포함할 수 있다. 상기 음성 출력 장치(110)는 상기 TTS 장치(100)에 통합되거나, 상기 TTS 장치(100)와 분리되어 구현될 수도 있다.Referring to FIG. 1 , the TTS device 100 may include a voice output device 110 for outputting a voice processed by the TTS device 100 or another device. The audio output device 110 may include a speaker, a headphone, or other suitable component for propagating audio. The audio output device 110 may be integrated into the TTS device 100 or implemented separately from the TTS device 100 .

상기 TTS 장치(100)는 상기 TTS 장치(100)의 구성요소들 사이에 데이터를 전달하기 위한 어드레스/데이터 버스(190)를 포함할 수 있다. 상기 TTS 장치(100) 내의 각 구성요소들은 상기 버스(190)를 통해 다른 구성요소들과 직접적으로 연결될 수 있다. 한편, 상기 TTS 장치(100) 내의 각 구성요소들은 TTS 모듈(170)과 직접적으로 연결될 수도 있다.The TTS device 100 may include an address/data bus 190 for transferring data between components of the TTS device 100 . Each component in the TTS device 100 may be directly connected to other components through the bus 190 . Meanwhile, each component in the TTS device 100 may be directly connected to the TTS module 170 .

TTS 장치(100)는 프로세서(140)를 포함할 수 있다. 상기 프로세서(208)는 데이터를 처리하기 위한 CPU, 데이터를 처리하는 컴퓨터 판독 가능한 명령 및 데이터 및 명령들을 저장하기 위한 메모리에 대응될 수 있다. 상기 메모리(150)는 휘발성 RAM, 비휘발성 ROM 또는 다른 타입의 메모리를 포함할 수 있다.The TTS device 100 may include a processor 140 . The processor 208 may correspond to a CPU for processing data, computer readable instructions for processing data, and a memory for storing data and instructions. The memory 150 may include volatile RAM, non-volatile ROM, or other types of memory.

TTS 장치(100)는 데이터 및 명령을 저장하기 위한 스토리지(160)를 포함할 수 있다. 스토리지(160)는 마그네틱 스토리지, 광학식 스토리지, 고체 상태(solid-state) 스토리지 타입 등을 포함할 수 있다.The TTS device 100 may include a storage 160 for storing data and commands. The storage 160 may include magnetic storage, optical storage, a solid-state storage type, and the like.

TTS 장치(100)는 입력 장치(120) 또는 출력 장치(130)를 통해 착탈식 또는 외장 메모리(예를 들어, 분리형 메모리 카드, 메모리 키 드라이브, 네트워크 스토리지 등)에 접속될 수 있다.The TTS device 100 may be connected to a removable or external memory (eg, a removable memory card, a memory key drive, network storage, etc.) through the input device 120 or the output device 130 .

TTS 장치(100) 및 다양한 구성요소들을 동작시키기 위한 프로세서(140)에서 처리될 컴퓨터 명령(computer instructions)은, 프로세서(140)에 의해 실행될 수 있고, 메모리(150), 스토리지(160), 외부 디바이스 또는 후술할 TTS 모듈(170)에 포함된 메모리나 스토리지에 저장될 수 있다. 대안적으로, 실행 가능한 명령의 전부 또는 일부는 소프트웨어에 추가하여 하드웨어 또는 펌웨어에 내장될 수도 있다. 본 발명의 일 실시예는 예를 들어, 소프트웨어, 펌웨어 및/또는 하드웨어의 다양한 조합으로 구현될 수 있다.Computer instructions to be processed by the processor 140 for operating the TTS apparatus 100 and various components may be executed by the processor 140 , and include the memory 150 , the storage 160 , and an external device. Alternatively, it may be stored in a memory or storage included in the TTS module 170 to be described later. Alternatively, all or a portion of the executable instructions may be embodied in hardware or firmware in addition to software. One embodiment of the present invention may be implemented in various combinations of, for example, software, firmware and/or hardware.

TTS 장치(100)는 입력 장치(120), 출력 장치(130)를 포함한다. 예를 들어, 상기 입력 장치(120)는 마이크로폰, 터치 입력 장치, 키보드, 마우스, 스타일러스 또는 다른 입력 장치와 같은 오디오 출력 장치(110)를 포함할 수 있다. 상기 출력 장치(130)는 디스플레이(visual display or tactile display), 오디오 스피커, 헤드폰, 프린터 또는 기타 출력 장치가 포함될 수 있다. 입력 장치(120) 및/또는 출력 장치(130)는 또한 USB(Universal Serial Bus), FireWire, Thunderbolt 또는 다른 연결 프로토콜과 같은 외부 주변 장치 연결용 인터페이스를 포함할 수 있다. 입력 장치(120) 및/또는 출력 장치(130)는 또한 이더넷 포트, 모뎀 등과 같은 네트워크 연결을 포함할 수 있다. 무선 주파수(RF), 적외선(infrared), 블루투스(Bluetooth), 무선 근거리 통신망(WLAN)(WiFi 등)과 같은 무선 통신 장치 또는 5G 네트워크, LTE(Long Term Evolution) 네트워크, WiMAN 네트워크, 3G 네트워크와 같은 무선 네트워크 무선 장치를 포함할 수 있다. TTS 장치(100)는 입력 장치(120) 및/또는 출력 장치(130)를 통해 인터넷 또는 분산 컴퓨팅 환경(distributed computing environment)을 포함할 수도 있다.The TTS device 100 includes an input device 120 and an output device 130 . For example, the input device 120 may include an audio output device 110 such as a microphone, a touch input device, a keyboard, a mouse, a stylus, or other input device. The output device 130 may include a display (visual display or tactile display), an audio speaker, headphones, a printer, or other output device. Input device 120 and/or output device 130 may also include interfaces for connecting external peripherals, such as Universal Serial Bus (USB), FireWire, Thunderbolt, or other connection protocols. Input device 120 and/or output device 130 may also include a network connection, such as an Ethernet port, modem, or the like. Wireless communication devices such as radio frequency (RF), infrared, Bluetooth, wireless local area network (WLAN) (such as WiFi) or 5G networks, Long Term Evolution (LTE) networks, WiMAN networks, 3G networks, such as It may include a wireless network wireless device. TTS device 100 may include the Internet or distributed computing environment via input device 120 and/or output device 130 .

TTS 장치(100)는 텍스트 데이터를 음성을 포함하는 오디오 파형을 처리하기 위한 TTS 모듈(170)을 포함할 수 있다. The TTS device 100 may include a TTS module 170 for processing text data into an audio waveform including voice.

TTS 모듈(170)은 버스(190), 입력 장치(120), 출력 장치(130), 오디오 출력 장치(110), 프로세서(140) 및/또는 TTS 장치(100)의 다른 구성요소에 접속될 수 있다.TTS module 170 may be connected to bus 190 , input device 120 , output device 130 , audio output device 110 , processor 140 and/or other components of TTS device 100 . have.

텍스트 데이터(textual data)의 출처는 TTS 장치(100)의 내부 구성요소에 의해 생성된 것일 수 있다. 또한, 상기 텍스트 데이터의 출처는 키보드와 같이 입력 장치로부터 수신되거나, 네트워크 연결을 통해 TTS 장치(100)로 전송될 것일 수 있다. 텍스트는 TTS 모듈(170)에 의해 스피치로 변환하기 위한 텍스트, 숫자 및/또는 문장 부호(punctuation)를 포함하는 문장의 형태 일 수있다. 입력 텍스트는 또한 TTS 모듈(170)에 의한 처리를 위하여, 특수 주석(special annotation)을 포함할 수 있으며, 상기 특수 주석을 통해 특정 텍스트가 어떻게 발음되어야 하는지를 지시 할 수 있다. 텍스트 데이터는 실시간으로 처리되거나 나중에 저장 및 처리 될 수 있다.The source of textual data may be generated by an internal component of the TTS device 100 . In addition, the source of the text data may be received from an input device such as a keyboard or transmitted to the TTS device 100 through a network connection. The text may be in the form of a sentence including text, numbers and/or punctuation for conversion into speech by the TTS module 170 . The input text may also include a special annotation for processing by the TTS module 170, and may indicate how the specific text should be pronounced through the special annotation. Text data can be processed in real time or stored and processed later.

TTS 모듈(170)은 전처리부(Front End)(171), 음성 합성 엔진(Speech Synthesis Engine)(172) 및 TTS 저장부(180)를 포함할 수 있다. 전처리부(171)는 입력 테스트 데이터를 음성 합성 엔진(172)에 의한 처리를 위해 기호 언어 표현(symbolic linguistic representation)으로 변환할 수 있다. 음성 합성 엔진(172)은 주석된 음성 단위 모델(annotated phonetic units models)과 TTS 저장부(180)에 저장된 정보를 비교하여 입력 텍스트를 음성으로 변환할 수 있다. 전처리부(171) 및 음성 합성 엔진(172)은 임베디드된 내부 프로세서 또는 메모리를 포함할 수 있거나, TTS 장치(100)에 포함된 프로세서(1400) 및 메모리(150)를 이용할 수 있다. 전처리부(171) 및 음성 합성 엔진(172)을 동작시키기 위한 명령들은 TTS 모듈(170), TTS 장치(100)의 메모리(150) 및 스토리지(160) 또는 외부 장치 내에 포함될 수도 있다.The TTS module 170 may include a front end 171 , a speech synthesis engine 172 , and a TTS storage unit 180 . The preprocessor 171 may convert the input test data into a symbolic linguistic representation for processing by the speech synthesis engine 172 . The speech synthesis engine 172 may convert the input text into speech by comparing the annotated phonetic units models with information stored in the TTS storage unit 180 . The preprocessor 171 and the speech synthesis engine 172 may include an embedded internal processor or memory, or may use the processor 1400 and the memory 150 included in the TTS device 100 . Commands for operating the preprocessor 171 and the speech synthesis engine 172 may be included in the TTS module 170 , the memory 150 and storage 160 of the TTS device 100 , or an external device.

TTS 모듈(170)로의 텍스트 입력은 프로세싱을 위해 전처리부(171)로 전송될 수 있다. 전처리부(1710)는 텍스트 정규화(text normalization), 언어 분석(linguistic analysis), 언어 운율 생성(linguistic prosody generation)을 수행하기 위한 모듈을 포함할 수 있다. The text input to the TTS module 170 may be transmitted to the preprocessor 171 for processing. The preprocessor 1710 may include a module for performing text normalization, linguistic analysis, and linguistic prosody generation.

전처리부(171)는 텍스트 정규화 동작을 수행하는 동안, 텍스트 입력을 처리하고 표준 텍스트(standard text)를 생성하여, 숫자(numbers), 약어(abbreviations), 기호(symbols)를 쓰여진 것과 동일하게 변환한다.The preprocessor 171 processes text input and generates standard text while performing text normalization operation, and converts numbers, abbreviations, and symbols to be the same as written ones. .

전처리부(171)는 언어 분석 동작을 수행하는 동안, 정규화된 텍스트의 언어를 분석하여 입력 텍스트에 대응하는 일련의 음성학적 단위(phonetic units)를 생성할 수 있다. 이와 같은 과정은 발음 표기(phonetic transcription)로 호칭될 수 있다. 음성 단위(phonetic units)는 최종적으로 결합되어 음성(speech)으로서 TTS 장치(100)에 의해 출력되는 사운드 단위(sound units)의 심볼 표현을 포함한다. 다양한 사운드 유닛들이 음성 합성을 위해 텍스트를 분할하는데 사용될 수 있다. TTS 모듈(170)은 음소(phonemes, 개별 음향), 하프-음소(half-phonemes), 다이폰(di-phones, 인접한 음소의 전반과 결합된 하나의 음소의 마지막 절반), 바이폰(bi-phones, 두 개의 연속적인 음속), 음절(syllables), 단어(words), 문구(phrases), 문장(sentences), 또는 기타 단위들에 기초하여 음성을 처리할 수 있다. 각 단어는 하나 이상의 음성 단위(phonetic units)에 매핑될 수 있다. 이와 같은 매핑은 TTS 장치(100)에 저장된 언어 사전(language dictionary)을 이용하여 수행될 수 있다.The preprocessor 171 may analyze the language of the normalized text while performing the language analysis operation to generate a series of phonetic units corresponding to the input text. Such a process may be referred to as phonetic transcription. The phonetic units are finally combined and include a symbolic representation of the sound units output by the TTS device 100 as speech. Various sound units may be used to segment text for speech synthesis. The TTS module 170 includes phonemes (individual sounds), half-phonemes, di-phones (last half of one phoneme combined with the first half of adjacent phonemes), and bi-phones (bi-phones). It can process speech based on phones (two consecutive speeds of sound), syllables, words, phrases, sentences, or other units. Each word may be mapped to one or more phonetic units. Such mapping may be performed using a language dictionary stored in the TTS device 100 .

전처리부(171)에 의해 수행되는 언어 분석은 또한 접두사(prefixes), 접미사(suffixes), 구(phrases), 구두점(punctuation), 구문론 경계(syntactic boundaries)와 같은 서로 다른 문법적 요소들 확인하는 과정을 포함할 수 있다. 이와 같은 문법적 구성요소는 TTS 모듈(1700)에 의해 자연스러운 오디오 파형 출력을 만드는데 사용될 수 있다. 상기 언어 사전은 또한 TTS 모듈(170)에 의해 발생할 수 있는 이전에 확인되지 않은 단어 또는 문자 조합을 발음하는데 사용될 수 있는 문자 대 소리 규칙(letter-to-sound rules) 및 다른 도구들을 포함할 수 있다. 일반적으로 언어 사전에 포함된 정보들이 많을 수록 고 품질의 음성 출력을 보장할 수 있다.The language analysis performed by the preprocessor 171 also includes the process of identifying different grammatical elements such as prefixes, suffixes, phrases, punctuation, and syntactic boundaries. may include This grammatical component may be used by the TTS module 1700 to create a natural audio waveform output. The language dictionary may also include letter-to-sound rules and other tools that may be used to pronounce previously unidentified words or letter combinations that may be generated by the TTS module 170 . . In general, the more information included in the language dictionary, the higher the quality of audio output can be guaranteed.

상기 언어 분석에 기초하여, 전처리부(171)는 음성 단위(phonetic units)에 최종 음향 단위가 최종 출력 음성에서 어떻게 발음되어야 하는지를 나타내는 운율 특성(prosodic characteristics)으로 주석 처리된 언어 운율 생성을 수행할 수 있다. Based on the linguistic analysis, the preprocessor 171 may generate linguistic prosody annotated in phonetic units with prosodic characteristics indicating how the final sound unit should be pronounced in the final output voice. have.

상기 운율 특성은 음향 특징(acoustic features)으로도 호칭될 수 있다. 이 단계의 동작을 수행하는 동안, 전처리부(171)는 텍스트 입력을 수반하는 임의의 운율 주석(prosodic annotations)을 고려하여 TTS 모듈(170)에 통합할 수 있다. 이와 같은 음향 특징(acoustic features)은 피치(pitch), 에너지(energy), 지속 시간(duration) 등을 포함할 수 있다. 음향 특징의 적용은 TTS 모듈(170)이 이용할 수 있는 운율 모델(prosodic models)에 기초할 수 있다. 이러한 운율 모델은 특정 상황에서 음성 단위(phonetic units)가 어떻게 발음되어야 하는지를 나타낸다. 예를 들어, 운율 모델은 음절에서 음소의 위치(a phoneme's position in a syllable), 단어에서 음절의 위치(a syllable's position in a word), 문장 또는 구문에서 단어의 위치(a word's position in a sentence or phrase), 인접한 음운 단위(neighboring phonetic units) 등을 고려할 수 있다. 언어 사전과 마찬가지로, 운율 정보(prosodic model)의 정보가 많을수록 고품질의 음성 출력이 보장될 수 있다.The prosody characteristics may also be referred to as acoustic features. While performing the operation of this step, the preprocessor 171 may consider arbitrary prosodic annotations accompanying text input and integrate them into the TTS module 170 . Such acoustic features may include pitch, energy, duration, and the like. The application of the acoustic characteristics may be based on prosodic models available to the TTS module 170 . This prosody model represents how phonetic units should be pronounced in a particular situation. For example, a prosody model can calculate a phoneme's position in a syllable, a syllable's position in a word, or a word's position in a sentence or phrase. phrase), and neighboring phonetic units. As in the language dictionary, as the amount of information in the prosodic model increases, high-quality speech output can be guaranteed.

전처리부(171)의 출력은, 운율 특성(prosodic characteristics)으로 주석 처리된 일련의 음성 단위를 포함할 수 있다. 상기 전처리부(171)의 출력은 기호식 언어 표현(symbolic linguistic representation)으로 호칭될 수 있다. 상기 심볼릭 언어 표현은 음성 합성 엔진(172)에 전송될 수 있다. 상기 음성 합성 엔진(172)은 오디오 출력 장치(110)를 통해 사용자에게 출력하기 위해 스피치(speech)를 오디오 파형(audio waveform)으로의 변환 과정을 수행한다. 음성 합성 엔진(172)은 입력 텍스트를 효율적인 방식으로 고품질의 자연스러운 음성으로 변환하도록 구성될 수 있다. 이러한 고품질의 스피치는 가능한 한 화자(human speaker)와 유사하게 발음되도록 구성될 수 있다.The output of the preprocessor 171 may include a series of phonetic units annotated with prosodic characteristics. The output of the preprocessor 171 may be referred to as a symbolic linguistic representation. The symbolic language representation may be transmitted to the speech synthesis engine 172 . The speech synthesis engine 172 performs a process of converting speech into an audio waveform to output to the user through the audio output device 110 . The speech synthesis engine 172 may be configured to convert the input text into high quality natural speech in an efficient manner. Such high-quality speech can be configured to be pronounced as similar to a human speaker as possible.

음성 합성 엔진(172)은 적어도 하나 이상의 다른 방법을 이용하여 음성 합성을 수행할 수 있다.The speech synthesis engine 172 may perform speech synthesis using at least one or more other methods.

유닛 선택 엔진(Unit Selection Engine)(173)은 녹음된 스피치 데이터 베이스(recorded speech database)를, 상기 전처리부(171)에 의해 생성된 기호식 언어 표현(symbolic linguistic representation)과 대조한다. 유닛 선택 엔진(173)은 상기 심볼 언어 표현과 스피치 데이터베이스의 음성 오디오 유닛을 매칭한다. 음성 출력(speech output)을 형성하기 위해 매칭 유닛이 선택되고, 선택된 매칭 유닛들이 함께 연결될 수 있다. 각 유닛은 .wav 파일(피치, 에너지 등)과 연관된 다양한 음향 특성들의 설명(description)과 함께, 특정 사운드의 짧은 ,wav 파일과 같은 음성 유닛(phonetic unit)에 대응하는 오디오 파형(audio waveform) 뿐 아니라, 상기 음성 유닛이 단어, 문장 또는 문구, 이웃 음성 유닛에 표시되는 위치와 같은 다른 정보들을 포함할 수 있다.The unit selection engine 173 compares the recorded speech database with the symbolic linguistic representation generated by the preprocessor 171 . A unit selection engine 173 matches the symbol language representation with the spoken audio units in a speech database. Matching units are selected to form a speech output, and the selected matching units can be connected together. Each unit contains only an audio waveform corresponding to a phonetic unit, such as a short ,wav file of a specific sound, with a description of the various acoustic properties associated with the .wav file (pitch, energy, etc.). Rather, it may contain other information such as a word, sentence or phrase, the location where the voice unit is displayed in a neighboring voice unit.

유닛 선택 엔진(173)은 자연스러운 파형을 생성하기 위하여 유닛 데이터 베이스 내의 모든 정보를 이용하여 입력 텍스트를 매칭시킬 수 있다. 유닛 데이터 베이스는 유닛들을 스피치로 연결하기 위해 서로 다른 옵션들을 TTS 장치(100)에 제공하는 다수의 음성 유닛들의 예시를 포함할 수 있다. 유닛 선택의 장점 중 하나는, 데이터 베이스의 크기에 따라 자연스러운 자연스러운 음성 출력이 생성될 수 있다는 것이다. 또한, 유닛 데이터 베이스가 클수록 TTS 장치(100)는 자연스러운 음성을 구성할 수 있게 된다.The unit selection engine 173 may match the input text using all information in the unit database to generate a natural waveform. The unit database may contain examples of multiple voice units that provide different options to the TTS device 100 for linking the units to speech. One of the advantages of unit selection is that a natural natural voice output can be generated according to the size of the database. In addition, as the unit database becomes larger, the TTS device 100 can compose a natural voice.

한편, 음성 합성은 전술한 유닛 선택 합성 외에 파라미터 합성 방법이 존재한다. 파라미터 합성은 인공적인 음성 파형을 생성하기 위해 주파수, 볼륨, 잡음과 같은 합성 파라미터들이 파라미터 합성 엔진(175), 디지털 신호 프로세서, 또는 다른 오디오 생성 장치에 의해 변형될 수 있다.On the other hand, for speech synthesis, there is a parameter synthesis method in addition to the aforementioned unit selection synthesis. In parametric synthesis, synthesis parameters such as frequency, volume, and noise may be transformed by the parametric synthesis engine 175, a digital signal processor, or other audio generating device to generate an artificial speech waveform.

파라미터 합성은, 음향 모델 및 다양한 통계 기법을 사용하여 기호식 언어 표현(symbolic linguistic representation) 원하는 출력 음성 파라미터와 일치시킬 수 있다. 파라미터 합성에는 유닛 선택과 관련된 대용량의 데이터베이스 없이도 음성을 처리할 수 있을 뿐 아니라, 높은 처리 속도로 정확한 처리가 가능하다. 유닛 선택 합성 방법 및 파라미터 합성 방법은 개별적으로 수행되거나 결합되어 수행되어 음성 오디오 출력을 생성할 수 있다. Parametric synthesis can use acoustic models and various statistical techniques to match a symbolic linguistic representation to desired output speech parameters. For parameter synthesis, not only can speech be processed without a large database related to unit selection, but also accurate processing is possible at a high processing speed. The unit selection synthesis method and the parameter synthesis method may be performed individually or may be performed in combination to generate an audio audio output.

파라미터 음성 합성은 다음과 같이 수행될 수 있다. TTS 모듈(170)은 오디오 신호 조작에 기초하여 기호식 언어 표현(symbolic linguistic representation)을 텍스트 입력의 합성 음향 파형(synthetic acoustic waveform)으로 변환이 가능한 음향 모델(acoustic model)을 포함할 수 있다. 상기 음향 모델은, 입력 음성 단위 및/또는 운율 주석(prosodic annotations)에 특정 오디오 파형 파라미터(specific audio waveform parameters)를 할당하기 위해 파라미터 합성 엔진(175)에 의해 사용될 수 있는 규칙(rules)을 포함할 수 있다. 상기 규칙은 특정 오디오 출력 파라미터(주파수, 볼륨 등)가 전처리부(171)로부터의 입력 기호식 언어 표현의 부분에 대응할 가능성을 나타내는 스코어를 계산하는데 이용될 수 있다.Parametric speech synthesis may be performed as follows. The TTS module 170 may include an acoustic model capable of converting a symbolic linguistic representation into a synthetic acoustic waveform of a text input based on audio signal manipulation. The acoustic model may include rules that may be used by the parameter synthesis engine 175 to assign specific audio waveform parameters to input speech units and/or prosodic annotations. can The rule may be used to calculate a score indicating the likelihood that a particular audio output parameter (frequency, volume, etc.) will correspond to a portion of the input symbolic language representation from the preprocessor 171 .

파라미터 합성 엔진(175)은 합성될 음성을 입력 음성 유닛 및/또는 운율 주석과 매칭시키기 위해 복수의 기술들이 적용될 수 있다. 일반적인 기술 중 하나는 HMM(Hidden Markov Model)을 사용한다, HMM은 오디오 출력이 텍스트 입력과 일치해야 하는 확률을 결정하는데 이용될 수 있다. HMM은 원하는 음성을 인공적으로 합성하기 위해, 언어 및 음향 공간의 파라미터들을 보코더(디지털 보이스 인코더)에 의해 사용될 파라미터들로 전환시키는데 이용될 수 있다.The parametric synthesis engine 175 may apply a plurality of techniques to match the speech to be synthesized with the input speech units and/or prosody annotations. One common technique uses the Hidden Markov Model (HMM), which can be used to determine the probability that the audio output should match the text input. The HMM can be used to convert parameters of speech and acoustic space into parameters to be used by a vocoder (digital voice encoder) to artificially synthesize a desired voice.

TTS 장치(100)는 유닛 선택에 사용하기 위한 음성 유닛 데이터베이스를 포함할 수 있다.The TTS device 100 may include a voice unit database for use in unit selection.

상기 음성 유닛 데이터 베이스는 TTS 스토리지(180), 스토리지(160) 또는 다른 스토리지 구성에 저장될 수 있다. 상기 음성 유닛 데이터 베이스는 레코딩된 스피치 발성을 포함할 수 있다. 상기 스피치 발성은 발화 내용에 대응되는 텍스트일 수 있다. 또한, 음성 유닛 데이터 베이스는 TTS 장치(100)에서 상당한 저장 공간을 차지하는 녹음된 음성(오디오 파형, 특징 벡터 또는 다른 포맷의 형태)을 포함할 수 있다. 음성 유닛 데이터베이스의 유닛 샘플들은 음성 단위(음소, 다이폰, 단어 등), 언어적 운율 레이블, 음향 특징 시퀀스, 화자 아이덴티티 등을 포함하는 다양한 방법으로 분류될 수 있다. 샘플 발화(sample utterance)는 특정 음성 유닛에 대한 원하는 오디오 출력에 대응하는 수학적 모델을 생성하는데 사용될 수 있다.The voice unit database may be stored in TTS storage 180 , storage 160 , or other storage configuration. The voice unit database may contain recorded speech utterances. The speech utterance may be text corresponding to the content of the utterance. In addition, the voice unit database may contain recorded voices (in the form of audio waveforms, feature vectors, or other formats) that occupy significant storage space in the TTS device 100 . Unit samples of the speech unit database may be classified in various ways including speech units (phonemes, diphones, words, etc.), linguistic prosody labels, acoustic feature sequences, speaker identities, and the like. Sample utterances can be used to create a mathematical model corresponding to a desired audio output for a particular speech unit.

음성 합성 엔진(172)은 기호화된 언어 표현을 매칭할 때, 입력 텍스트(음성 단위 및 운율 기호 주석 모두를 포함)와 가장 근접하게 일치하는 음성 유닛 데이터베이스 내의 유닛을 선택할 수 있다. 일반적으로 음성 유닛 데이터 베이스가 클 수록 선택 가능한 유닛 샘플 수가 많아서 정확한 스피치 출력이 가능하게 된다.When the speech synthesis engine 172 matches a symbolic linguistic representation, it may select a unit in the speech unit database that most closely matches the input text (including both phonetic units and prosody notes). In general, the larger the voice unit database, the greater the number of selectable unit samples, so that accurate speech output is possible.

TTS 모듈(213)로부터 음성 출력을 포함하는 오디오 파형(audio waveforms)은 사용자에게 출력하기 위해 오디오 출력 장치(110)로 전송될 수 있다. 음성을 포함하는오디오 파형은 일련의 특징 벡터(feature vectors), 비 압축 오디오 데이터(uncompressed audio data) 또는 압축 오디오 데이터와 같은 복수의 상이한 포맷으로 저장될 수 있다. 예를 들어, 음성 출력은 상기 전송 전에 인코더/디코더에 의해 인코딩 및/또는 압축될 수 있다. 인코더/디코더는 디지털화된 오디오 데이터, 특징 벡터 등과 같은 오디오 데이터를 인코딩 및 디코딩할 수 있다. 또한 인코더/디코더의 기능은 별도의 컴포넌트 내에 위치될 수 있거나, 프로세서(140), TTS 모듈(170)에 의해 수행될 수도 있음은 물론이다.Audio waveforms including audio output from the TTS module 213 may be transmitted to the audio output device 110 for output to a user. Audio waveforms, including speech, may be stored in a plurality of different formats, such as a series of feature vectors, uncompressed audio data, or compressed audio data. For example, the audio output may be encoded and/or compressed by an encoder/decoder prior to said transmission. The encoder/decoder may encode and decode audio data such as digitized audio data, feature vectors, and the like. In addition, the function of the encoder/decoder may be located in a separate component, or may be performed by the processor 140 and the TTS module 170, of course.

한편, 상기 TTS 스토리지(180)는 음성 인식(speech recognition)을 위해 다른 정보들을 저장할 수 있다.Meanwhile, the TTS storage 180 may store other information for speech recognition.

TTS 스토리지(180)의 컨텐츠는 일반적인 TTS 사용을 위해 준비될 수도 있고, 특정 애플리케이션에서 사용될 가능성이 있는 소리 및 단어를 포함하도록 맞춤화될 수 있다. 예를 들어, GPS 장치에 의해 TTS 처리를 위해 TTS 스토리지(180)는 위치 및 내비게이션에 특화된 맞춤형 음성을 포함할 수 있다.The contents of the TTS storage 180 may be prepared for general TTS use, and may be customized to include sounds and words likely to be used in a specific application. For example, for TTS processing by a GPS device, TTS storage 180 may contain customized voices specific to location and navigation.

또한 예를 들어, TTS 스토리지(180)는 개인화된 원하는 음성 출력에 기초하여 사용자에게 커스터마이징될 수도 있다. 예를 들어, 사용자는 출력되는 보이스가 특정 성별, 특정 억양, 특정 속도, 특정 감정(예를 들어, 행복한 음성)을 선호할 수 있다. 음성 합성 엔진(172)은 이와 같은 사용자 선호도를 설명하기 위하여 특수 데이터 베이스 또는 모델(specialized database or model)을 포함할 수 있다.Also for example, TTS storage 180 may be customized to a user based on a personalized desired voice output. For example, the user may prefer the output voice to a specific gender, a specific accent, a specific speed, and a specific emotion (eg, a happy voice). The speech synthesis engine 172 may include a special database or model to describe such user preferences.

TTS 장치(100)는 또한 다중 언어로 TTS 처리를 수행하도록 구성될 수 있다. 각 언어에 대해, TTS 모듈(170)은 원하는 언어로 음성을 합성하기 위해 특별히 구성된 데이터, 명령 및/또는 구성 요소를 포함할 수 있다.The TTS device 100 may also be configured to perform TTS processing in multiple languages. For each language, the TTS module 170 may include data, commands and/or components specifically configured to synthesize speech in the desired language.

성능 향상을 위해 TTS 모듈(213)은 TTS 처리 결과에 대한 피드백에 기초하여 TTS 스토리지(180)의 내용을 수정하거나 갱신할 수 있으므로, TTS 모듈(170)이 훈련 코퍼스(training corpus)에서 제공되는 능력 이상으로 음성 인식을 향상시킬 수 있다.In order to improve performance, the TTS module 213 may modify or update the contents of the TTS storage 180 based on the feedback on the TTS processing result, so that the TTS module 170 is capable of being provided by a training corpus. As described above, speech recognition can be improved.

TTS 장치(100)의 처리 능력이 향상됨에 따라, 입력 텍스트가 갖는 감정 속성을 반영하여 음성 출력이 가능하다. 또는 TTS 장치(100)는 상기 입력 텍스트에 감정 속성에 포함되어 있지 않더라도, 입력 텍스트를 작성한 사용자의 의도(감정 정보)를 반영하여 음성 출력이 가능하다.As the processing capability of the TTS device 100 is improved, voice output is possible by reflecting the emotional property of the input text. Alternatively, the TTS device 100 may output the voice by reflecting the intention (emotion information) of the user who wrote the input text even if the input text is not included in the emotion attribute.

실제로 TTS 처리를 수행하는 TTS 모듈에 통합될 모델이 구축될 때 TTS 시스템은, 위에서 언급한 다양한 구성요소와 다른 구성요소를 통합할 수 있다. 일 예로, TTS 장치(100)는 음성에 감정 요소를 삽입할 수 있다.When a model to be integrated into a TTS module that actually performs TTS processing is built, the TTS system may integrate the various components mentioned above and other components. For example, the TTS device 100 may insert an emotional element into the voice.

감정 정보가 부가된 음성을 출력하기 위해 TTS 장치(100)는 감정 삽입 모듈(177)을 포함할 수 있다. 감정 삽입 모듈(177)은 TTS 모듈(170)에 통합되거나, 전처리부(171) 또는 음성 합성 엔진(172)의 일부로서 통합될 수 있다. 상기 감정 삽입 모듈(177)은 감정 속성에 대응하는 메타 데이터를 이용하여 감정 정보 기반의 음성 합성을 구현할 수 있도록 할 수 있다.In order to output the voice to which the emotion information is added, the TTS device 100 may include an emotion insertion module 177 . The emotion insertion module 177 may be integrated into the TTS module 170 , or may be integrated as a part of the preprocessor 171 or the speech synthesis engine 172 . The emotion insertion module 177 may implement speech synthesis based on emotion information by using metadata corresponding to the emotion attribute.

도 2는 본 명세서의 일 실시예에 따라 입력 문장에서 감정이 반영된 음성 합성이 수행되는 개략적인 플로우를 설명하기 위한 도면이다.FIG. 2 is a diagram for explaining a schematic flow of speech synthesis in which emotions are reflected in an input sentence according to an embodiment of the present specification; FIG.

도 2를 참조하면, 본 명세서의 일 실시예에 따른 음성 합성 시스템은 감정인식 시스템과 감정 음성 합성 시스템을 포함할 수 있다. 감정인식 시스템은 입력문장을 입력받고, 입력문장에 대응되는 감정을 추출할 수 있다. 감정인식 시스템은 입력된 문장에서 복수의 감정 요소를 추출할 수 있다. 상기 추출된 복수의 감정 요소는 슬픔, 상처, 불안 3가지의 감정일 수 있으며 3가지 감정은 유사한 감정 상태이지만 각각 61%, 32%, 7%의 비율로 혼합된 감정상태가 추출될 수 있다. 감정 음성합성 시스템은 상기 추출된 감정정보와 입력문장을 함께 입력 데이터로 입력받아서 오디오 파형을 생성하여 출력한다.Referring to FIG. 2 , the speech synthesis system according to an embodiment of the present specification may include an emotion recognition system and an emotion speech synthesis system. The emotion recognition system may receive an input sentence and extract an emotion corresponding to the input sentence. The emotion recognition system may extract a plurality of emotion elements from the input sentence. The plurality of extracted emotional elements may be three emotions of sadness, hurt, and anxiety, and the three emotions are similar emotional states, but mixed emotional states may be extracted at ratios of 61%, 32%, and 7%, respectively. The emotion speech synthesis system receives the extracted emotion information and the input sentence together as input data, and generates and outputs an audio waveform.

도 3은 본 명세서의 일 실시예에 따른 음성 합성 방법의 흐름도이다. 도 3의 음성 합성 방법은 음성 합성 장치에서 구현될 수 있다. 보다 구체적으로 상기 음성 합성 방법은 상기 음성 합성 장치의 프로세서 및/또는 제어부에 의해 구현될 수 있다.3 is a flowchart of a speech synthesis method according to an embodiment of the present specification. The speech synthesis method of FIG. 3 may be implemented in a speech synthesis apparatus. More specifically, the speech synthesis method may be implemented by a processor and/or a controller of the speech synthesis apparatus.

도 3을 참조하면, 프로세서는 학습 데이터를 준비할 수 있다(S300).Referring to FIG. 3 , the processor may prepare training data ( S300 ).

여기서 학습 데이터는 시스템과 사람 간의 대화 내용을 포함할 수 있다. 또는 상기 학습 데이터는 6개의 대분류 감정항목과 대분류 감정항목에 각각 속하는 10개의 소분류 감정항목을 포함하여 총 60개의 감정 세트 데이터가 마련될 수 있다.Here, the learning data may include the contents of a conversation between the system and a person. Alternatively, as the learning data, a total of 60 emotion set data may be provided, including 6 major category emotion items and 10 small category emotion items belonging to each of the major category emotion items.

프로세서는 사전 학습모델에 대하여 미세 조정(fine-tuning)을 통해 감정 분류 모델을 학습할 수 있다(S310).The processor may learn the emotion classification model through fine-tuning with respect to the pre-learning model (S310).

상기 사전 학습된 언어모델은 BERT 언어모델이 적용될 수 있다. BERT 언어모델은 특정 과제를 수행하기 전에 사전 훈련 임베딩을 통해 특정 과제의 성능을 더욱 높일 수 있는 언어모델일 수 있다. BERT 언어모델을 사용하는 경우, 데이터 분류를 위해 LSTM, CNN 등 머신러닝 모델을 사용하여 분류작업을 수행하였지만, BERT 언어모델을 사용하는 경우, 음성합성과 관련된 대량의 코퍼스를 인코더가 임베딩하여 언어모델링을 하고, 이를 전이(transfer)하여 파인튜닝을 한 후 NLP Task를 수행할 수 있다. 대량의 음성합성을 위한 코퍼스로 BERT 언어모델을 적용하고, BERT 언어 모델 출력에 추가적인 모델(RNN, CNN 등의 머신러닝 모델)을 적용하여 원하는 Task (감정분류)를 수행할 수 있다. 이 때 추가적인 모델을 복잡한 CNN, LSTM, Attention을 적용하지 않고 간단한 DNN 모델만을 적용하여도 CNN 등과 같은 복잡한 모델을 이용하였을 때와 비교하여 성능의 차이가 거의 없다.The pre-trained language model may be a BERT language model. The BERT language model may be a language model that can further improve the performance of a specific task through pre-training embedding before performing a specific task. When using the BERT language model, classification was performed using machine learning models such as LSTM and CNN for data classification. After performing fine tuning by transferring it, the NLP task can be performed. A desired task (emotion classification) can be performed by applying the BERT language model as a corpus for mass speech synthesis and applying additional models (machine learning models such as RNN and CNN) to the output of the BERT language model. In this case, even if only a simple DNN model is applied without applying a complicated CNN, LSTM, or Attention to the additional model, there is little difference in performance compared to when a complex model such as CNN is used.

도 4a는 본 명세서의 일 실시예에 따라 BERT 언어 모델을 이용하여 감정 분류 모델을 학습시키는 예를 설명하기 위한 도면이다. 도 4b는 본 명세서의 일 실시예에 따라 감정 분류 모델의 출력 결과의 예를 설명하기 위한 도면이다. 도 5는 본 명세서의 일 실시예에 따라 감정 분류 모델을 통해 분류 가능한 감정정보를 분류체계를 설명하기 위한 도면이다. 도 6은 본 명세서의 일 실시예에 따라 입력 문장에 대하여 감정분류 모델을 적용한 결과를 예를 설명하기 위한 도면이다.4A is a diagram for explaining an example of learning an emotion classification model using a BERT language model according to an embodiment of the present specification. 4B is a diagram for explaining an example of an output result of an emotion classification model according to an embodiment of the present specification. 5 is a diagram for explaining a classification system for emotion information that can be classified through an emotion classification model according to an embodiment of the present specification. 6 is a diagram for explaining an example of a result of applying an emotion classification model to an input sentence according to an embodiment of the present specification.

도 4a를 참조하면, 상기 미세조정은, 상기 입력 문장에 대응하는 타겟 감정을 태그(tag)로 하여 상기 사전 학습된(Pre-training) 언어모델에 대하여 전이학습(Transfer Learning) 시키는 것을 의미할 수 있다. 즉, 사전 학습된 언어모델을 전이학습 시켜 실제 NLP Task를 수행할 수 있다. 본 명세서의 일 실시예는 사전 학습된 언어모델에 입력 문장에 대하여 감정 분류가 가능한 NLP Task를 수행하도록 학습을 시킨다.Referring to FIG. 4A , the fine adjustment may mean performing transfer learning on the pre-training language model by using the target emotion corresponding to the input sentence as a tag. have. In other words, it is possible to perform the actual NLP task by transferring the pre-trained language model. An embodiment of the present specification trains a pre-trained language model to perform an NLP task capable of classifying emotions on an input sentence.

도 4b를 참조하면, 프로세서는 감정분류모델에 입력문장을 적용하고, 감정분류 모델의 출력결과를 음성 합성에 반영한다. 일 실시예에 따라 감정분류 모델은 구분되는 복수의 대분류 감정항목을 포함하고, 각각의 대분류 감정항목은 유사한 복수의 하위분류(소분류) 감정항목을 포함할 수 있다. 상기 감정 분류모델은 입력문장에 부합하는 대분류 감정항목 및 상기 대분류 감정항목의 가중치값을 출력하도록 구성될 수 있다. 또한, 상기 감정 분류모델은 입력 문장에 부합하는 대분류 감정항목의 하위 분류항목 및 상기 하위 분류 감정항목의 가중치 값을 출력하도록 구성될 수 있다. 또한, 상기 감정 분류모델은 입력 문장에 부합하는 대분류 감정항목 또는 상기 대분류 감정항목에 포함되는 복수의 하위분류 감정항목 중 적어도 둘 이상의 조합에 의해 상기 입력 문장에 대응되는 감정요소를 추출하되, 상기 둘 이상의 조합된 감정항목은 각각 서로 다른 가중치 값을 가지도록 구성될 수 있다.Referring to FIG. 4B , the processor applies an input sentence to the emotion classification model, and reflects the output result of the emotion classification model to speech synthesis. According to an embodiment, the emotion classification model may include a plurality of divided large-class emotion items, and each large-class emotion item may include a plurality of similar sub-category (sub-classification) emotion items. The emotion classification model may be configured to output a large classification emotion item corresponding to an input sentence and a weight value of the large classification emotion item. In addition, the emotion classification model may be configured to output a sub-category of the large classification emotion item and a weight value of the sub-classification emotion item corresponding to the input sentence. In addition, the emotion classification model extracts an emotion element corresponding to the input sentence by a combination of at least two or more of a large category emotion item corresponding to the input sentence or a plurality of sub-category emotion items included in the large category emotion item, The above combined emotion items may be configured to have different weight values, respectively.

즉, 본 명세서의 일 실시예에 의하면 입력문장을 감정 분류 모델에 적용한 결과, 하나의 대분류 1(예를 들어, 슬픔)을 출력할 수 있다. That is, according to an embodiment of the present specification, as a result of applying the input sentence to the emotion classification model, one major category 1 (eg, sadness) may be output.

또한, 예를 들어, 대분류 1(예를 들어, 슬픔)에 제1 가중치값을 부여하고, 상기 대분류 1의 하위분류 2(예를 들어, 우울함)에 제2 가중치를 부여할 수 있다. 이 때 상기 제1 가중치값과 제2 가중치 값의 합은 1이 되도록 복수의 감정항목에 대하여 가중치가 부과될 수 있다. 이 경우, 본 명세서의 일 실시예에 의하면 입력 문장은 단순히 슬픈 감정상태 뿐 아니라 우울하면서 슬픈 감정상태가 반영되도록 음성 합성의 감정 가중치가 제어될 수 있다. ?약 상기 대분류 1의 하위분류 2(예를 들어, 슬픔) 대신 하위분류 3(예를 들어, 염세적인)에 제3의 가중치가 부가되어 제1 가중치가 부가된 대분류 1과 함께 감정 분류 결과가 출력된 경우, "슬픔+염세적인"감정상태는 "슬픔+우울함" 감정상태와는 다른 음성 합성 결과가 출력될 것이다. 이 경우, 감정 분류 모델의 출력값이 대분류 1(예를 들어, 슬픔)만이 출력된 경우와 비교할 때 음성 합성의 결과물이 입력 문장에 내재된 감정상태를 보다 정교하게 반영할 수 있게 된다.Also, for example, a first weight value may be assigned to the major category 1 (eg, sadness), and a second weight may be assigned to the sub-class 2 (eg, depressed) of the large category 1 . In this case, a weight may be applied to a plurality of emotion items such that the sum of the first weight value and the second weight value is 1. In this case, according to an embodiment of the present specification, the emotional weight of speech synthesis may be controlled so that the input sentence reflects not only a sad emotional state but also a depressed and sad emotional state. A third weight is added to sub-class 3 (eg, pessimistic) instead of sub-category 2 (eg, sadness) of the large classification 1, so that the emotion classification result is displayed along with the large classification 1 to which the first weight is added. When output, the "sad + pessimistic" emotional state will output a speech synthesis result different from the "sad + depressed" emotional state. In this case, the result of speech synthesis can more accurately reflect the emotional state inherent in the input sentence, compared to the case where only the major category 1 (eg, sadness) is output in the output value of the emotion classification model.

또한, 예를 들어, 입력문장을 감정 분류 모델에 적용한 결과, 브로드한 개념의 대표 감정 상태를 나타내는 대분류가 아닌, 상기 대분류에 포함된 복수의 하위분류(소분류)만이 결과물로 출력될 수도 있다. 예를 들어, 대분류 1(예를 들어, 불안)의 하위분류1(예를 들어, 스트레스 받는)에 제1 가중치가 부가되고, 하위분류 2(예를 들어, 혼란스러운)에 제2 가중치가 부가되고, 하위분류 3(예를 들어, 초조한)에 제3 가중치가 부가되어 최종 결과물이 출력될 수 있다. 이 경우, 프로세서는 입력 문장에 대하여 "스트레스 받는 + 혼란스러운 + 초조한"의 3가지 감정 상태를 각각 제1 가중치, 제2 가중치 및 제3 가중치가 적용됨으로써 최종 음성 합성에 반영되도록 제어할 수 있다. 이 경우 또한, 단순히 '불안' 감정 상태를 음성 합성에 반영하는 결과물과 비교하여 보다 정교하게 감정상태를 음성 합성에 반영할 수 있는 이점에 있다.Also, for example, as a result of applying the input sentence to the emotion classification model, only a plurality of sub-classifications (sub-classifications) included in the large classification, not the large classification indicating the representative emotional state of a broad concept, may be output as a result. For example, a first weight is added to subcategory 1 (eg, stressed) of a major category 1 (eg, anxiety), and a second weight is added to subcategory 2 (eg, confused). and a third weight may be added to subclass 3 (eg, anxious) to output a final result. In this case, the processor may control the three emotional states of “stressed + confused + anxious” to be reflected in the final speech synthesis by applying the first weight, the second weight, and the third weight, respectively, to the input sentence. . In this case, too, there is an advantage in that the emotional state can be reflected in the voice synthesis more precisely compared to the result of simply reflecting the 'anxiety' emotional state in the voice synthesis.

도 5는 본 명세서의 일 실시예에 따라 감정 분류 모델을 통해 분류 가능한 감정정보를 분류체계를 설명하기 위한 도면이다. 5 is a diagram for explaining a classification system for emotion information that can be classified through an emotion classification model according to an embodiment of the present specification.

도 5를 참조하면, 본 명세서의 일 실시예에 따르면, 감정 상태를 6 개의 대분류(분노, 슬픔, 불안, 상처, 당황, 기쁨)로 구분하고, 각각의 대분류에 포함된 모든 감정상태가 각각 10개로 구성되며, 총 60가지의 감정항목으로 구분될 수 있다. 각각의 감정 항목은 코드가 부여되어 테이블 형태로 관리될 수 있다. 도 5에 도시된 테이블 형태의 감정 분류 데이터는 예시적인 것이며, 대분류의 수 및 소분류의 수가 더 늘어날 수 있으며, 상기 테이블 형태의 감정분류 데이터가 업데이트되는 경우, 업데이트된 감정분류 데이터에 기초하여 전술한 사전 학습된 언어모델에 대하여 미세 조정(Fine-tuning) 과정을 다시 수행함으로써, 감정 분류 모델을 업데이트할 수 있다.Referring to FIG. 5 , according to an embodiment of the present specification, emotional states are divided into six major categories (anger, sadness, anxiety, hurt, embarrassment, joy), and all emotional states included in each major category are 10 It is composed of dogs and can be divided into a total of 60 emotion items. Each emotion item can be managed in the form of a table by assigning a code. The emotion classification data in the form of a table shown in FIG. 5 is exemplary, and the number of large classifications and the number of small classifications may be further increased, and when the emotion classification data in the table form is updated, based on the updated emotion classification data, By performing the fine-tuning process again on the pre-trained language model, the emotion classification model can be updated.

도 6은 본 명세서의 일 실시예에 따라 입력 문장에 대하여 감정분류 모델을 적용한 결과를 예를 설명하기 위한 도면이다.6 is a diagram for explaining an example of a result of applying an emotion classification model to an input sentence according to an embodiment of the present specification.

도 6을 참조하면, 본 명세서의 일 실시예에 따라 입력문장이 "행복하다"인 경우, 제6 대분류(기쁨)에 제1 가중치(0.221)가 부가되고, 상기 제6 대분류의 네번째 하위분류(만족스러운)에 제2 가중치(0.716)가 부가되어 결과물이 출력될 수 있다.Referring to FIG. 6 , when the input sentence is “happy” according to an embodiment of the present specification, a first weight (0.221) is added to the sixth major category (joy), and a fourth sub-classification (0.221) of the sixth major category ( A result may be output by adding a second weight (0.716) to ).

프로세서는 상기 제1 가중치와 제2 가중치를 적응적으로 조절할 수 있다. 예를 들어, 도 6에 도시된 예시는 감정 분류 모델이 디폴트로 상기 제1 가중치와 제2 가중치를 적용한 결과이며, 사용자 입력을 통해 상기 제1 가중치와 제2 가중치는 변경될 수 있다. 또한, 예를 들어, 프로세서는 상기 제1 가중치와 제2 가중치가 출력된 상태에서 후행으로 입력되는 문장을 분석함으로써, 자동으로 상기 제1 가중치와 제2 가중치값을 변경할 수도 있다. 상기 제1 가중치와 제2 가중치를 적응적으로 조절하는 기준은 전술한 예들에 한정되지 않고 다양한 기준이 적용될 수 있음은 물론이다. 이에 따라, 본 명세서의 일 실시에에 따르면 복잡 미묘한 인간의 감정 상태를 보다 세밀하게 구분하여 음성 합성에 이용할 수 있으며, 특히 영화 또는 드라마의 대본에 기재된 문장들에 대해 감정 분류 모델에 적용하는 경우, 전술한 바와 같이, 이전 문장, 다음 문장의 감정 상태까지 고려하여 가중치가 자동적으로 제어될 수도 있다.The processor may adaptively adjust the first weight and the second weight. For example, the example shown in FIG. 6 is a result of the emotion classification model applying the first weight and the second weight as a default, and the first weight and the second weight may be changed through a user input. Also, for example, the processor may automatically change the values of the first weights and the second weights by analyzing the sentences input later in the state in which the first weights and the second weights are output. Of course, the criteria for adaptively adjusting the first weight and the second weight are not limited to the above-described examples, and various criteria may be applied. Accordingly, according to one embodiment of the present specification, complex and subtle human emotional states can be more precisely classified and used for speech synthesis, and in particular, when applied to the emotion classification model for sentences described in the script of a movie or drama, As described above, the weight may be automatically controlled in consideration of the emotional state of the previous sentence and the next sentence.

본 명세서의 일 실시예에 따르면 감정정보는 고정된 감정 벡터값의 형태로 음성 합성 장치에 임베딩될 수 있다. 상기 감정정보의 종류, 가지수가 많으면 많을수록 감정 벡터값의 용량이 커지기 때문에 감정정보의 종류는 시스템 리소스 등을 고려하여 한정적으로 임베딩될 수 밖에 없다. 이하, 도 7에서는 상기 시스템 리소스를 고려하여 복합적인 감정 가중치를 통해 다양한 감정이 포함된 음성을 합성하는 다른 예를 설명한다.According to an embodiment of the present specification, emotion information may be embedded in the speech synthesis apparatus in the form of a fixed emotion vector value. As the number of types and branches of the emotion information increases, the capacity of the emotion vector value increases. Therefore, the type of emotion information is limitedly embedded in consideration of system resources and the like. Hereinafter, another example of synthesizing voices including various emotions through complex emotion weights in consideration of the system resources will be described with reference to FIG. 7 .

도 7은 본 명세서의 일 실시예에 따라 복합 감정 가중치를 적용하는 다른 예를 설명하기 위한 음성 합성 방법의 흐름도이며, 도 8은 임베딩된 감정 벡터값이 좌표평면 공간상에 표시된 예를 설명하기 위한 도면이다.7 is a flowchart of a speech synthesis method for explaining another example of applying a complex emotion weight according to an embodiment of the present specification, and FIG. 8 is for explaining an example in which an embedded emotion vector value is displayed on a coordinate plane space. It is a drawing.

도 7을 참조하면, 프로세서는 임베딩된 감정 벡터값을 좌표평면상에 매칭시켜 음성 합성 장치의 출력부를 통해 표시할 수 있다(S700). Referring to FIG. 7 , the processor may match the embedded emotion vector value on the coordinate plane and display it through the output unit of the speech synthesis apparatus ( S700 ).

도 8을 참조하면, 임베딩된 감정벡터값은 감정항목의 종류별로 그룹을 이루어 평면 공간상에 배치될 수 있다. 예를 들어, 감정 분류모델을 통해 분류 가능한 감정이 10개의 대분류로 구분되는 것으로 가정할 때, 각각의 대분류는 유사한 감정항목을 포함하는 복수의 하위분류 감정항목을 포함할 수 있으며 복수의 하위분류 감정항목은 좌표 공간 상에 하나의 군집(801,802,803, ? ,809,810)을 이루면서 배치될 수 있다. 도 8에서 감정벡터 군집이 존재하지 않는 여백영역은 매칭되는 감정정보가 존재하지 않음을 의미할 수 있다. Referring to FIG. 8 , the embedded emotion vector values may be grouped by type of emotion item and disposed on a planar space. For example, if it is assumed that emotions that can be classified through the emotion classification model are divided into 10 major categories, each major category may include a plurality of sub-class emotion items including similar emotion items, and a plurality of sub-class emotions Items may be arranged forming one cluster (801,802,803, ?,809,810) on the coordinate space. In FIG. 8 , the blank area in which the emotion vector cluster does not exist may mean that matching emotion information does not exist.

프로세서는 좌표평면 상의 특정 지점을 선택할 수 있다(S710).The processor may select a specific point on the coordinate plane (S710).

상기 특정 지점이 감정벡터 군집이 존재하는 지점인 경우, 임베딩된 감정 벡터가 존재하므로 프로세서는 대응되는 감정벡터를 음성 합성에 반영할 수 있다(S740). 그러나, 상기 특정 지점이 상기 음성합성 장치에 임베딩되지 않은 감정 벡터값에 대응되는 일 지점인 경우(예를 들어, 도 8에서 830 영역), 프로세서는 상기 좌표평면 상에 기 매칭된 감정 벡터값(801,802,803, ? ,809,810 영역에 대응되는 감정 벡터값들)을 참조하여 상기 선택된 지점의 감정 벡터값을 결정할 수 있다(S730).When the specific point is a point where the emotion vector cluster exists, since there is an embedded emotion vector, the processor may reflect the corresponding emotion vector to speech synthesis (S740). However, when the specific point is a point corresponding to an emotion vector value that is not embedded in the speech synthesis apparatus (eg, area 830 in FIG. 8 ), the processor determines a previously matched emotion vector value on the coordinate plane ( The emotion vector value of the selected point may be determined with reference to (emotion vector values corresponding to regions 801,802,803, ?,809,810) (S730).

프로세서는 결정된 감정 벡터값에 기초하여 합성할 음성 신호를 생성하여 음성 합성을 수행할 수 있다.The processor may perform voice synthesis by generating a voice signal to be synthesized based on the determined emotion vector value.

도 9는 본 명세서의 일 실시예에 따라 감정 정보 기반의 음성 합성을 수행하는 과정을 기능 블록을 통해 보다 구체적으로 설명한다.9 illustrates a process of performing voice synthesis based on emotion information according to an embodiment of the present specification in more detail through functional blocks.

도 9를 참조하면, 본 명세서의 일 실시에에 따른 음성 합성 장치는 복수의 서로 다른 화자 정보와 복수의 서로 다른 감정 정보를 임베딩할 수 있다. 여기서 임베딩된 화자 정보는 서로 다른 복수의 화자, 상기 복수의 화자 각각의 보이스톤, 상기 감정 분류 모델에서 분류되는 감정에 대응되는 보이스톤의 벡터값이 상기 음성 합성 장치에 임베딩될 수 있다.Referring to FIG. 9 , the speech synthesis apparatus according to an embodiment of the present specification may embed a plurality of different speaker information and a plurality of different emotional information. In the embedded speaker information, a vector value of a plurality of different speakers, a voicestone of each of the plurality of speakers, and a voicestone corresponding to an emotion classified in the emotion classification model may be embedded in the speech synthesis apparatus.

전술한 실시예에서 감정 벡터값을 음성 합성에 이용하는 과정에서 상기 감정 벡터값에 대응되는 화자 임베딩 벡터값이 함께 고려될 수 있다. 에를 들어, 프로세서는 감정 분류모델에 따라 결정된 감정 정보를 가장 잘 반영할 수 있는 화자 정보를 선택하고, 선택된 화자의 보이스를 음성 합성에 이용할 수 있다. 또한, 선택된 화자의 보이스 중에서도 분류된 감정을 가장 잘 반영할 수 있는 보이스를 선택하여 음성 합성에 이용할 수 있다. 이를 위해 본 명세서의 일 실시예 따른 음성 합성 장치에 임베딩되는 정보는 감정 벡터외에 화자 정보도 함께 임베딩될 수 있으며, 상기 화자 정보 또한 벡터값 형태로 임베딩될 수 있다. 또한, 상기 임베딩되는 감정 벡터는 감정 종류별로 임베딩될 수 있다.In the process of using the emotion vector value for speech synthesis in the above-described embodiment, a speaker embedding vector value corresponding to the emotion vector value may be considered together. For example, the processor may select speaker information that can best reflect the emotion information determined according to the emotion classification model, and use the selected speaker's voice for speech synthesis. In addition, a voice that can best reflect the classified emotion among the selected speaker's voices may be selected and used for voice synthesis. To this end, in the information embedded in the speech synthesis apparatus according to an embodiment of the present specification, speaker information may be embedded in addition to the emotion vector, and the speaker information may also be embedded in the form of a vector value. In addition, the embedded emotion vector may be embedded for each emotion type.

프로세서는 입력 문장이 입력되면 감정 분류모델을 통해 상기 입력 문장에 대응되는 감정 정보를 결정하고, 결정된 감정 정보에 대응되는 감정 벡터값을 추출할 수 있다. 또한, 프로세서는 추출된 감정 벡터값에 매칭되는 화자를 선택할 수 있다. 프로세서는 임베딩된 복수의 화자 정보 중 상기 추출된 감정 벡터값에 매칭되는 화자 벡터값을 결정할 수 있다.When an input sentence is input, the processor may determine emotion information corresponding to the input sentence through the emotion classification model, and extract an emotion vector value corresponding to the determined emotion information. Also, the processor may select a speaker matching the extracted emotion vector value. The processor may determine a speaker vector value matching the extracted emotion vector value from among the plurality of embedded speaker information.

프로세서는 입력문장의 인코딩 결과와 임베딩된 감정정보, 임베딩된 화자 정보에 기초하여 음성 합성과정을 수행할 수 있다.The processor may perform a speech synthesis process based on an encoding result of the input sentence, embedded emotion information, and embedded speaker information.

한편 도 7 내지 도 8을 통해 감정정보 선택하는 과정을 설명한 방법과 유사하게 본 명세서는 임베딩된 화자 외의 보이스톤에 기초하여 음성 합성을 수행할 수도 있다.On the other hand, similar to the method described in the process of selecting emotion information with reference to FIGS. 7 to 8 , in the present specification, voice synthesis may be performed based on a voicestone other than an embedded speaker.

도 10은 본 명세서의 일 실시예에 따라 다중 화자 정보를 적용하는 음성 합성 방법의 흐름도이다.10 is a flowchart of a speech synthesis method for applying multi-speaker information according to an embodiment of the present specification.

도 10을 참조하면, 프로세서는 음성 합성 장치에 임베딩된 화자 정보(화자 벡터값)을 좌표 평면 상에 매칭하여 음성 합성 장치의 출력부를 통해 출력할 수 있다(S1000).Referring to FIG. 10 , the processor may match the speaker information (speaker vector value) embedded in the speech synthesis apparatus on a coordinate plane and output it through the output unit of the speech synthesis apparatus ( S1000 ).

도 8에서와 유사하게 임베딩된 화자 벡터값은 화자별로 그룹을 이루어 평면 공간상에 배치될 수 있다. 예를 들어, 10명의 화자의 보이스가 음성 합성 장치에 저장된 경우, 프로세서는 상기 10명 각각의 화자 벡터를 평면 공간상에 배치하되, 상기 10명의 목소리는 감정별로 서로 다른 보이스톤으로 구분되어 저장될 수 있다. 예를 들어, 1명의 화자가 6개의 서로 다른 감정에 대응하는 목소리로 화자 보이스를 레코딩한 경우, 한 명의 화자는 6개의 서로 다른 감정 보이스톤이 각각 좌표공간 상에 하나의 군집 영역에 배치될 수 있다.Similar to FIG. 8 , the embedded speaker vector values may be grouped for each speaker and disposed on a planar space. For example, if the voices of 10 speakers are stored in the speech synthesis device, the processor arranges the speaker vectors of each of the 10 speakers in a flat space, and the voices of the 10 speakers are stored separately as different voicestones for each emotion. can For example, if one speaker records a speaker voice with voices corresponding to 6 different emotions, 6 different emotional voicestones for one speaker may be arranged in one cluster area on the coordinate space, respectively. have.

프로세서는 화자 정보가 배칭된 좌표 평면 공간 상에서 특정 지점이 선택되는 경우(S1100), 상기 선택된 지점에 대응되는 화자 벡터값이 음성 합성 장치에 임베딩되어 있는지를 판단할 수 있다(S1200).When a specific point is selected on the coordinate plane space in which the speaker information is batched (S1100), the processor may determine whether a speaker vector value corresponding to the selected point is embedded in the speech synthesis apparatus (S1200).

프로세서는 상기 특정 지점이 화자 벡터가 존재하는 지점인 경우, 임베딩된 화자 벡터가 존재하므로 대응되는 화자의 보이스톤을 음성 합성에 반영할 수 있다(S1400). 그러나, 상기 특정 지점이 상기 음성합성 장치에 임베딩되지 않은 화자 벡터값에 대응되는 지점인 경우, 프로세서는 상기 좌표 평면 상에 기 매칭된 화자 벡터값을 참조하여 상기 선택된 지점의 화자 벡터값을 결정할 수 있다(S1300). 그런 후 프로세서는 결정된 화자 벡터값에 기초하여 합성할 음성 신호를 생성할 수 있다.When the specific point is a point where a speaker vector exists, the processor may reflect the voicestone of the corresponding speaker to the speech synthesis because the embedded speaker vector exists ( S1400 ). However, when the specific point is a point corresponding to a speaker vector value not embedded in the speech synthesis apparatus, the processor may determine the speaker vector value of the selected point by referring to the speaker vector value previously matched on the coordinate plane. There is (S1300). Then, the processor may generate a speech signal to be synthesized based on the determined speaker vector value.

전술한 본 발명은, 프로그램이 기록된 매체에 컴퓨터가 읽을 수 있는 코드로서 구현하는 것이 가능하다. 컴퓨터가 읽을 수 있는 매체는, 컴퓨터 시스템에 의하여 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록장치를 포함한다. 컴퓨터가 읽을 수 있는 매체의 예로는, HDD(Hard Disk Drive), SSD(Solid State Disk), SDD(Silicon Disk Drive), ROM, RAM, CD-ROM, 자기 테이프, 플로피 디스크, 광 데이터 저장 장치 등이 있으며, 또한 캐리어 웨이브(예를 들어, 인터넷을 통한 전송)의 형태로 구현되는 것도 포함한다. 따라서, 상기의 상세한 설명은 모든 면에서 제한적으로 해석되어서는 아니되고 예시적인 것으로 고려되어야 한다. 본 발명의 범위는 첨부된 청구항의 합리적 해석에 의해 결정되어야 하고, 본 발명의 등가적 범위 내에서의 모든 변경은 본 발명의 범위에 포함된다.The present invention described above can be implemented as computer-readable codes on a medium in which a program is recorded. The computer-readable medium includes all types of recording devices in which data readable by a computer system is stored. Examples of computer-readable media include Hard Disk Drive (HDD), Solid State Disk (SSD), Silicon Disk Drive (SDD), ROM, RAM, CD-ROM, magnetic tape, floppy disk, optical data storage device, etc. There is also a carrier wave (eg, transmission over the Internet) that is implemented in the form of. Accordingly, the above detailed description should not be construed as restrictive in all respects but as exemplary. The scope of the present invention should be determined by a reasonable interpretation of the appended claims, and all modifications within the equivalent scope of the present invention are included in the scope of the present invention.

Claims

preparing training data to be used for speech synthesis;
learning the emotion classification model by fine-tuning the pre-trained language model with respect to the learning data;
extracting an emotion element corresponding to the input sentence by applying the input sentence to the emotion classification model; and
The extracted emotion element includes a plurality of emotion items having different emotion weight values,
performing speech synthesis based on the input sentence, embedded speaker information, and the extracted emotional element;
A speech synthesis method comprising a.

The method of claim 1,
The pre-trained language model is a BERT language model,
The fine adjustment is
A speech synthesis method, characterized in that transfer learning is performed on the pre-learned language model by using a target emotion corresponding to the input sentence as a tag.

The method of claim 1,
The emotion classification model is
Including a plurality of divided large-category analysis items, each large-category analysis item includes a plurality of similar sub-category analysis items,
The speech synthesizing method according to claim 1, wherein the major category emotion item and the plurality of sub-category emotion items are embedded in the speech synthesis device as emotion vector values, respectively.

4. The method of claim 3,
matching the emotion vector value on a coordinate plane, and outputting a matching result through a display of the voice synthesizer;
determining an emotion vector value of the selected point by referring to a previously matched Gapjeong vector value on the coordinate plane when a point corresponding to the emotion vector value not embedded in the speech synthesis apparatus is selected on the coordinate plane; and
performing the speech synthesis based on the determined emotion vector value;
Speech synthesis method further comprising a.

4. The method of claim 3,
The emotion classification model is
Speech synthesis method, characterized in that it is configured to output a large classification emotion item corresponding to an input sentence and a weight value of the large classification emotion item.

4. The method of claim 3,
The emotion classification model is
Speech synthesis method, characterized in that it is configured to output a sub-class emotion item of a large-class emotion item that matches an input sentence and a weight value of the sub-class emotion item.

4. The method of claim 3,
The emotion classification model is
Extracting an emotion element corresponding to the input sentence by a combination of at least two or more of a large category emotion item corresponding to the input sentence or a plurality of sub-category emotion items included in the large category emotion item,
The speech synthesis method, characterized in that the two or more combined emotion items each have different weight values.

8. The method of claim 7,
The speech synthesis method, characterized in that the weight value of each of the two or more combined emotion items is adaptively adjustable.

The method of claim 1,
The embedded speaker information is
The speech synthesis method according to claim 1, wherein a vector value of a plurality of different speaker information, a voicestone of each of the plurality of speakers, and a voice tone corresponding to an emotion classified in the emotion classification model are embedded in the speech synthesis apparatus.

10. The method of claim 9,
The embedded speaker information is
matching on a coordinate plane through the vector value, and outputting a matching result through a display of the voice synthesizer;
determining a vector value of the selected point by referring to a vector value previously matched on the coordinate plane when a point corresponding to speaker information not embedded in the speech synthesis apparatus is selected on the coordinate plane; and
performing the speech synthesis based on the voicestone corresponding to the determined vector value;
Speech synthesis method further comprising a.

input unit;
a storage unit for storing a pre-trained language model with respect to learning data to be used for speech synthesis;
a speech synthesis unit for performing speech synthesis on the input sentence input through the input unit;
A processor for learning the emotion classification model by fine-tuning the pre-trained language model with respect to the learning data;
The processor is
applying the input sentence to the fine-tuned emotion classification model to extract an emotion element corresponding to the input sentence, wherein the extracted emotion element includes a plurality of emotion items having different emotion weight values; and controlling the speech synthesis unit to perform speech synthesis based on a sentence, speaker information embedded in the speech synthesis device, and the extracted emotional element.