KR20210032235A

KR20210032235A - Device, method and computer program for synthesizing emotional voice

Info

Publication number: KR20210032235A
Application number: KR1020190113752A
Authority: KR
Inventors: 배문규; 차재욱
Original assignee: 주식회사 케이티
Priority date: 2019-09-16
Filing date: 2019-09-16
Publication date: 2021-03-24

Abstract

A device for synthesizing an emotional voice comprises: a synthesis model generation unit which generates a synthesis model based on original voice data and a sentence corresponding to the original voice data; an emotion model generation unit which generates an emotion model based on emotional voice data by emotion and synthesis voice data generated by the synthesis model; a user synthesis model generation unit which inputs user voice data of a user and a sentence corresponding to the user voice data into the generated synthesis model to generate a user synthesis model of the user; a user synthesis model training unit which uses the generated user synthesis model to generate user synthesis voice data, inputs the generated user synthesis voice data into the generated emotion model to generate user emotion voice data, and inputs the generated user emotion voice data into the user synthesis model to train the user synthesis model; and a synthesis emotional voice data generation unit which inputs a request sentence and emotional information into the trained user synthesis model to generate user synthesis emotional voice data.

Description

Apparatus, method, and computer program for synthesizing emotional voices TECHNICAL FIELD

본 발명은 감정 음성을 합성하는 장치, 방법 및 컴퓨터 프로그램에 관한 것이다. The present invention relates to an apparatus, a method, and a computer program for synthesizing emotional voices.

인공지능을 이용한 키즈 서비스의 수요가 늘어나면서, 많은 부모들이 본인의 목소리로 자녀들에게 동화를 읽어주는 서비스를 필요로 하고 있다. As the demand for kids services using artificial intelligence increases, many parents need a service that reads fairy tales to their children with their own voice.

하지만 부모의 목소리를 이용한 동화 서비스는 부모가 전문 성우가 아니기 때문에 부모의 음성 녹음 가능한 양이 적을 뿐만 아니라 감정 표현이 서툴기 때문에 부모의 녹음 음성으로부터 감정 데이터를 추출하기도 어렵다. However, in a fairy tale service using the voices of parents, since parents are not professional voice actors, the amount of voice recording of the parents is small, and the emotional expression is poor, so it is difficult to extract emotional data from the recorded voices of the parents.

한편, 특정 목소리를 이용하여 감정합성을 따로 수행하기 위해서는 해당 목소리에 대한 대량의 감정 데이터가 필요하다. 또한, 목소리에 감정 필터를 추가하는 방식은 해당 목소리에 대한 감정 데이터가 필요하며 추가적인 작업을 위한 시간이 소모된다는 단점이 있다. Meanwhile, in order to separately perform emotional synthesis using a specific voice, a large amount of emotional data for the corresponding voice is required. In addition, the method of adding an emotion filter to a voice has a disadvantage in that it requires emotion data for the corresponding voice and time for additional work is consumed.

한국공개특허공보 제2008-0060909호 (2008.07.02. 공개)Korean Patent Publication No. 2008-0060909 (published on Feb. 2008)

본 발명은 전술한 종래 기술의 문제점을 해결하기 위한 것으로서, 합성 음성 데이터를 생성하는 합성 모델 및 감정 음성 데이터를 생성하는 감정 모델에 기초하여 학습된 사용자 합성 모델에 사용자의 요청 문장 및 감정 정보를 입력하여 사용자 합성 감정 음성 데이터를 생성하고자 한다. 다만, 본 실시예가 이루고자 하는 기술적 과제는 상기된 바와 같은 기술적 과제들로 한정되지 않으며, 또 다른 기술적 과제들이 존재할 수 있다. The present invention is to solve the above-described problems of the prior art, and input a user's request sentence and emotion information to a user synthesis model learned based on a synthesis model for generating synthesized voice data and an emotion model for generating emotional voice data. Thus, it is intended to generate user synthesized emotion voice data. However, the technical problem to be achieved by the present embodiment is not limited to the technical problems as described above, and other technical problems may exist.

상술한 기술적 과제를 달성하기 위한 기술적 수단으로서, 본 발명의 제 1 측면에 따른 감정 음성을 합성하는 장치는 원본 음성 데이터 및 상기 원본 음성 데이터와 대응되는 문장에 기초하여 합성 모델을 생성하는 합성 모델 생성부; 감정 별 감정 음성 데이터 및 상기 합성 모델에 의해 생성된 합성 음성 데이터에 기초하여 감정 모델을 생성하는 감정 모델 생성부; 상기 생성된 합성 모델에 사용자의 사용자 음성 데이터를 및 상기 사용자 음성 데이터와 대응되는 문장을 입력하여 상기 사용자의 사용자 합성 모델을 생성하는 사용자 합성 모델 생성부; 및 상기 생성된 사용자 합성 모델을 이용하여 사용자 합성 음성 데이터를 생성하고, 상기 생성된 감정 모델에 상기 생성된 사용자 합성 음성 데이터를 입력하여 사용자 감정 음성 데이터를 생성하고, 상기 사용자 합성 모델에 상기 생성된 사용자 감정 음성 데이터를 입력하여 학습시키는 사용자 합성 모델 학습부; 및 상기 학습된 사용자 합성 모델에 요청 문장 및 감정 정보를 입력하여 사용자 합성 감정 음성 데이터를 생성하는 합성 감정 음성 데이터 생성부를 포함할 수 있다. As a technical means for achieving the above-described technical problem, the apparatus for synthesizing emotional speech according to the first aspect of the present invention generates a synthesis model for generating a synthesis model based on original speech data and a sentence corresponding to the original speech data. part; An emotion model generator configured to generate an emotion model based on emotion voice data for each emotion and synthesized voice data generated by the synthesis model; A user synthesis model generating unit for generating a user synthesis model of the user by inputting user voice data of the user and a sentence corresponding to the user voice data to the generated synthesis model; And generating user synthesized voice data using the generated user synthesis model, inputting the generated user synthesized voice data to the generated emotion model to generate user emotion voice data, and generating the generated user synthesized voice data in the user synthesis model. A user synthesis model learning unit that inputs and trains user emotion voice data; And a synthesized emotion voice data generator configured to generate user synthesized emotion voice data by inputting a request sentence and emotion information into the learned user synthesis model.

본 발명의 제 2 측면에 따른 감정 음성을 합성하는 방법은 원본 음성 데이터 및 상기 원본 음성 데이터와 대응되는 문장에 기초하여 합성 모델을 생성하는 단계; 감정 별 감정 음성 데이터 및 상기 합성 모델에 의해 생성된 합성 음성 데이터에 기초하여 감정 모델을 생성하는 단계; 및 상기 합성 모델 및 상기 감정 모델에 기초하여 학습된 사용자 합성 모델에 요청 문장 및 감정 정보를 입력하여 사용자 합성 감정 음성 데이터를 생성하는 단계를 포함할 수 있다. A method for synthesizing an emotional voice according to a second aspect of the present invention includes: generating a synthesis model based on original voice data and a sentence corresponding to the original voice data; Generating an emotion model based on emotion voice data for each emotion and synthesized voice data generated by the synthesis model; And generating user synthesized emotion voice data by inputting a request sentence and emotion information to the synthesis model and the user synthesis model learned based on the emotion model.

본 발명의 제 3 측면에 따른 감정 음성을 합성하는 명령어들의 시퀀스를 포함하는 매체에 저장된 컴퓨터 프로그램은 컴퓨팅 장치에 의해 실행될 경우, 원본 음성 데이터 및 상기 원본 음성 데이터와 대응되는 문장에 기초하여 합성 모델을 생성하고, 감정 별 감정 음성 데이터 및 상기 합성 모델에 의해 생성된 합성 음성 데이터에 기초하여 감정 모델을 생성하고, 상기 생성된 합성 모델에 사용자의 사용자 음성 데이터를 및 상기 사용자 음성 데이터와 대응되는 문장을 입력하여 상기 사용자의 사용자 합성 모델을 생성하고, 상기 생성된 사용자 합성 모델을 이용하여 사용자 합성 음성 데이터를 생성하고, 상기 생성된 감정 모델에 상기 생성된 사용자 합성 음성 데이터를 입력하여 사용자 감정 음성 데이터를 생성하고, 상기 사용자 합성 모델에 상기 생성된 사용자 감정 음성 데이터를 입력하여 학습시키고, 상기 학습된 사용자 합성 모델에 요청 문장 및 감정 정보를 입력하여 사용자 합성 감정 음성 데이터를 생성하도록 하는 명령어들의 시퀀스를 포함할 수 있다. A computer program stored in a medium including a sequence of instructions for synthesizing an emotional voice according to the third aspect of the present invention, when executed by a computing device, generates a synthesis model based on the original voice data and the sentence corresponding to the original voice data. And generate an emotion model based on the emotion voice data for each emotion and the synthesized voice data generated by the synthesis model, and add the user’s user voice data and the sentence corresponding to the user voice data to the generated synthesis model. Input to generate a user synthesis model of the user, generate user synthesis speech data using the generated user synthesis model, and input the generated user synthesis speech data to the generated emotion model to provide user emotion speech data. It includes a sequence of commands for generating, inputting and learning the generated user emotion voice data into the user synthesis model, and inputting a request sentence and emotion information into the learned user synthesis model to generate user synthesized emotion voice data can do.

상술한 과제 해결 수단은 단지 예시적인 것으로서, 본 발명을 제한하려는 의도로 해석되지 않아야 한다. 상술한 예시적인 실시예 외에도, 도면 및 발명의 상세한 설명에 기재된 추가적인 실시예가 존재할 수 있다.The above-described problem solving means are merely exemplary and should not be construed as limiting the present invention. In addition to the above-described exemplary embodiments, there may be additional embodiments described in the drawings and detailed description of the invention.

전술한 본 발명의 과제 해결 수단 중 어느 하나에 의하면, 본 발명은 합성 음성 데이터를 생성하는 합성 모델 및 감정 음성 데이터를 생성하는 감정 모델을 구축하고, 구축된 합성 모델 및 감정 모델에 기초하여 학습된 사용자 합성 모델에 사용자의 요청 문장 및 감정 정보를 입력하여 해당 요청 문장에 적합한 감정이 반영된 사용자 합성 감정 음성 데이터를 생성할 수 있다. According to any one of the above-described problem solving means of the present invention, the present invention constructs a synthetic model for generating synthetic speech data and an emotion model for generating emotional speech data, and is trained based on the constructed synthetic model and emotion model. By inputting the user's request sentence and emotion information into the user synthesis model, the user synthesized emotion voice data reflecting the emotion suitable for the corresponding request sentence may be generated.

감정 표현을 위해 기저 음원을 생성하고, 추가적으로 음원을 입히거나 조정하는 기존의 감정 적용 방식에 비해, 미리 생성된 감정별 감정 모델을 이용하여 사용자 합성 모델을 학습시킬 수 있기 때문에 합성 시간에 영향을 미치지 않고, 감정을 적용할 수 있다. Compared to the existing emotion application method in which a base sound source is created for expressing emotions and additional sound sources are applied or adjusted, the user synthesis model can be trained using the emotion model for each emotion generated in advance, without affecting the synthesis time. , You can apply your emotions.

도 1은 본 발명의 일 실시예에 따른, 감정 음성 합성 장치의 블록도이다.
도 2a 내지 2e는 본 발명의 일 실시예에 따른, 감정 음성 합성 방법을 설명하기 위한 도면이다.
도 3a 내지 3i는 본 발명의 일 실시예에 따른, 감정 모델을 생성하는 방법을 설명하기 위한 도면이다.
도 4는 본 발명의 일 실시예에 따른, 감정 음성 합성 방법을 나타낸 흐름도이다. 1 is a block diagram of an apparatus for synthesizing emotional speech according to an embodiment of the present invention.
2A to 2E are diagrams for explaining a method for synthesizing emotional speech according to an embodiment of the present invention.
3A to 3I are diagrams for explaining a method of generating an emotion model according to an embodiment of the present invention.
4 is a flowchart illustrating a method for synthesizing emotional speech according to an embodiment of the present invention.

아래에서는 첨부한 도면을 참조하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 본 발명의 실시예를 상세히 설명한다. 그러나 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 그리고 도면에서 본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다. Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those of ordinary skill in the art can easily implement the present invention. However, the present invention may be implemented in various different forms and is not limited to the embodiments described herein. In the drawings, parts irrelevant to the description are omitted in order to clearly describe the present invention, and similar reference numerals are attached to similar parts throughout the specification.

명세서 전체에서, 어떤 부분이 다른 부분과 "연결"되어 있다고 할 때, 이는 "직접적으로 연결"되어 있는 경우뿐 아니라, 그 중간에 다른 소자를 사이에 두고 "전기적으로 연결"되어 있는 경우도 포함한다. 또한 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미한다. Throughout the specification, when a part is said to be "connected" with another part, this includes not only "directly connected" but also "electrically connected" with another element interposed therebetween. . In addition, when a part "includes" a certain component, it means that other components may be further included rather than excluding other components unless specifically stated to the contrary.

본 명세서에 있어서 '부(部)'란, 하드웨어에 의해 실현되는 유닛(unit), 소프트웨어에 의해 실현되는 유닛, 양방을 이용하여 실현되는 유닛을 포함한다. 또한, 1 개의 유닛이 2 개 이상의 하드웨어를 이용하여 실현되어도 되고, 2 개 이상의 유닛이 1 개의 하드웨어에 의해 실현되어도 된다. In the present specification, the term "unit" includes a unit realized by hardware, a unit realized by software, and a unit realized using both. Further, one unit may be realized by using two or more hardware, or two or more units may be realized by one piece of hardware.

본 명세서에 있어서 단말 또는 디바이스가 수행하는 것으로 기술된 동작이나 기능 중 일부는 해당 단말 또는 디바이스와 연결된 서버에서 대신 수행될 수도 있다. 이와 마찬가지로, 서버가 수행하는 것으로 기술된 동작이나 기능 중 일부도 해당 서버와 연결된 단말 또는 디바이스에서 수행될 수도 있다. In this specification, some of the operations or functions described as being performed by the terminal or device may be performed instead in a server connected to the terminal or device. Likewise, some of the operations or functions described as being performed by the server may also be performed by a terminal or device connected to the server.

이하, 첨부된 구성도 또는 처리 흐름도를 참고하여, 본 발명의 실시를 위한 구체적인 내용을 설명하도록 한다. Hereinafter, with reference to the accompanying configuration diagram or processing flow chart, it will be described in detail for the implementation of the present invention.

도 1은 본 발명의 일 실시예에 따른, 감정 음성 합성 장치(10)의 블록도이다. 1 is a block diagram of an emotional speech synthesis apparatus 10 according to an embodiment of the present invention.

도 1을 참조하면, 감정 음성 합성 장치(10)는 합성 모델 생성부(100), 감정 모델 생성부(110), 사용자 합성 모델 생성부(120), 사용자 합성 모델 학습부(130) 및 합성 감정 음성 데이터 생성부(140)를 포함할 수 있다. 다만, 도 1에 도시된 감정 음성 합성 장치(10)는 본 발명의 하나의 구현 예에 불과하며, 도 1에 도시된 구성요소들을 기초로 하여 여러 가지 변형이 가능하다. Referring to FIG. 1, the emotion-voice synthesis apparatus 10 includes a synthesis model generation unit 100, an emotion model generation unit 110, a user synthesis model generation unit 120, a user synthesis model learning unit 130, and a synthesis emotion. A voice data generator 140 may be included. However, the emotion-voice synthesis apparatus 10 shown in FIG. 1 is only an example of implementation of the present invention, and various modifications are possible based on the components shown in FIG. 1.

이하에서는 도 2a 내지 3i를 함께 참조하여 도 1을 설명하기로 한다. Hereinafter, FIG. 1 will be described with reference to FIGS. 2A to 3I together.

합성 모델 생성부(100)는 원본 음성 데이터(201) 및 원본 음성 데이터(201)와 대응되는 문장(203)에 기초하여 합성 모델(20)을 생성 및 훈련시킬 수 있다. 여기서, 원본 음성 데이터(201)는 성우의 음성이 녹음된 음성 데이터일 수 있다. 합성 모델의 생성을 위해 입력되는 데이터쌍은 {원본 음성 데이터, 문장}의 쌍으로 이루어질 수 있다. The synthesis model generator 100 may generate and train the synthesis model 20 based on the original speech data 201 and the sentence 203 corresponding to the original speech data 201. Here, the original voice data 201 may be voice data in which voice actors' voices are recorded. The data pair input for the generation of the synthetic model may consist of a pair of {original speech data, sentence}.

합성 모델에 입력되는 문장(203)은 음소 단위로 임베딩된 값일 수 있다. 예를 들어, 'ㄱ: 2, ㄴ:3, ㄷ: 4' 등과 같이 임베딩되어 있으면, '동화'의 음소 단위는 [ㄷ ㅗ ㅐ ㅎ ㅘ]이므로 [5, 29, 13, 20, 30]으로 임베딩될 수 있다. The sentence 203 input to the synthesis model may be a value embedded in phoneme units. For example, if it is embedded like'a: 2, b:3, c: 4', the phoneme unit of'fairy tale' is [c ㅗ ㅐ ㅎ ㅘ], so [5, 29, 13, 20, 30] Can be embedded.

합성 모델에 입력되는 원본 음성 데이터(201)는 멜 스펙트로그램 방식으로 변환된 후, STFT(Short-Time Fourier Transform) 알고리즘을 통해 각 단위별로 주파수가 분리된 값일 수 있다. 이 때, 단위의 크기는 예를 들어, 50ms의 시간 단위별로 주파수가 멜스케일값(mel-scale)으로 80개씩 분리된 크기일 수 있다. The original voice data 201 input to the synthesis model may be a value obtained by separating frequencies for each unit through a short-time fourier transform (STFT) algorithm after being transformed in a mel spectrogram method. In this case, the size of the unit may be, for example, a size in which 80 frequencies are separated by a mel-scale value for each time unit of 50 ms.

예를 들어, 합성 모델 생성부(100)는 문장(203)의 각 음소가 원본 음성 데이터(201)의 어느 시간대에 위치하는지를 학습하고, 원본 음성 데이터(201)에서 각 음소가 위치하는 주파수 위치 및 주파수의 세기를 학습하여 합성 모델(20)을 훈련시킬 수 있다. For example, the synthesis model generation unit 100 learns in which time zone each phoneme of the sentence 203 is located in the original voice data 201, and the frequency position at which each phoneme is located in the original voice data 201 and The synthesis model 20 may be trained by learning the intensity of the frequency.

감정 모델 생성부(110)는 감정 별 감정 음성 데이터(205) 및 합성 모델(20)에 의해 생성된 합성 음성 데이터(207)에 기초하여 감정 별 감정 모델(22)을 생성할 수 있다. 여기서, 감정 음성 데이터(205) 및 합성 음성 데이터(207) 각각은 동일한 문장에 대한 음성 데이터이다. 여기서, 감정 음성 데이터(205)는 성우의 감정이 실린 음성이 녹음된 음성 데이터일 수 있다. 감정 모델(220)의 생성을 위해 입력되는 데이터쌍은 {합성 음성 데이터, 감정 음성 데이터}의 쌍으로 이루어질 수 있다. The emotion model generator 110 may generate the emotion model 22 for each emotion based on the emotion voice data 205 for each emotion and the synthesized speech data 207 generated by the synthesis model 20. Here, each of the emotional voice data 205 and the synthesized voice data 207 is voice data for the same sentence. Here, the emotional voice data 205 may be voice data in which voices containing voice actors' emotions are recorded. The data pair input to generate the emotion model 220 may be a pair of {synthetic voice data and emotion voice data}.

감정 모델 생성부(110)는 동일 문장에 대한 감정 음성 데이터(205) 및 합성 음성 데이터(207)를 스펙트로그램 이미지 형태로 변환하고, 스펙트로그램 이미지 형태로 변환된 감정 음성 데이터 및 합성 음성 데이터를 감정 음성 행렬 및 합성 음성 행렬로 변환할 수 있다. The emotion model generator 110 converts the emotion voice data 205 and the synthesized voice data 207 for the same sentence into a spectrogram image form, and analyzes the emotion voice data and synthesized voice data converted to the spectrogram image form. It can be converted into a speech matrix and a synthesized speech matrix.

도 3a를 참조하면, 감정 모델 생성부(110)는 '하이' 문장(301)에 대한 합성 음성 데이터를 스펙트로그램 이미지(303)로 변환할 수 있다. '하이' 문장(301)에 대한 합성 음성 데이터에 대응하는 스펙트로그램 이미지(303)의 가로축은 시간축(샘플축)에 해당되고, 세로축은 주파수축에 해당된다. 스펙트로그램 이미지(303)에 포함된 픽셀의 값은 소리의 세기(dB)를 나타낸다. Referring to FIG. 3A, the emotion model generator 110 may convert synthesized speech data for a'high' sentence 301 into a spectrogram image 303. The horizontal axis of the spectrogram image 303 corresponding to the synthesized speech data for the'high' sentence 301 corresponds to the time axis (sample axis), and the vertical axis corresponds to the frequency axis. The value of a pixel included in the spectrogram image 303 represents the intensity (dB) of sound.

감정 모델 생성부(110)는 '하이' 문장(301)에 대한 합성 음성 데이터에 대응하는 스펙트로그램 이미지(303)를 합성 음성 행렬(305)로 변환할 수 있다. 이 때, 합성 음성 행렬(305)의 행은 주파수이고, 열은 시간이고, 행렬의 값은 소리의 세기를 의미한다. The emotion model generator 110 may convert the spectrogram image 303 corresponding to the synthesized voice data for the'high' sentence 301 into the synthesized voice matrix 305. In this case, the row of the synthesized speech matrix 305 is frequency, the column is time, and the value of the matrix is the intensity of sound.

감정 모델 생성부(110)는 기설정된 샘플레이트에 기초하여 합성 음성 행렬(305)에 포함된 복수의 샘플에 대하여 시간 단위에 포함되는 샘플의 개수를 산출할 수 있다. 여기서, 기설정된 샘플레이트가

이고, 시간 단위가 1.25ms이면, 한 시간 단위에 포함되는 샘플의 개수(k)는 [수학식 1]을 통해 도출될 수 있다. The emotion model generator 110 may calculate the number of samples included in a time unit for a plurality of samples included in the synthesized speech matrix 305 based on a preset sample rate. Here, the preset sample rate is

And, if the time unit is 1.25 ms, the number of samples (k) included in one time unit may be derived through [Equation 1].

[수학식 1][Equation 1]

k =

예를 들어, [수학식 1]에 기설정된 샘플레이트(

)에 16khz을 대입하면, 1.25ms 시간 단위로 20개의 샘플이 산출될 수 있다. For example, the sample rate preset in [Equation 1] (

If 16khz is substituted for ), 20 samples can be calculated in units of 1.25ms time.

감정 모델 생성부(110)는 기설정된 샘플레이트에 기초하여 산출된 샘플의 개수만큼 기설정된 시간 단위에 샘플이 포함되도록 합성 음성 행렬(305)을 분할할 수 있다. 예를 들어, 기설정된 샘플레이트(

)인 16khz에 따라 1.25ms 시간 단위로 20개의 샘플이 산출되면, 감정 모델 생성부(110)는 합성 음성 행렬(305)의 복수의 열을 20칸(307)씩 분할할 수 있다. 이 때, 합성 음성 행렬(305)이 20칸(307)씩 분할되면,

크기의 합성 음성 행렬(305)이

크기의 합성 음성 행렬(305)로 줄어들게 된다. The emotion model generator 110 may divide the synthesized speech matrix 305 so that samples are included in a preset time unit as many as the number of samples calculated based on the preset sample rate. For example, a preset sample rate (

When 20 samples are calculated in units of 1.25ms time according to 16khz of ), the emotion model generator 110 may divide a plurality of columns of the synthesized speech matrix 305 by 20 spaces 307. At this time, if the synthesized speech matrix 305 is divided by 20 spaces 307,

Synthetic speech matrix of size 305 is

It is reduced to the size of the synthesized speech matrix 305.

감정 모델 생성부(110)는 분할된 합성 음성 행렬(이하, 서브 합성 음성 행렬이라 명명함)을 시간 단위 별로 평균값을 산출할 수 있다. 예를 들어, 감정 모델 생성부(110)는 분할된 복수의 서브 합성 음성 행렬(309, 311) 별로 각 주파수에 대한 시간 단위별 복수의 행렬값을 평균화(313, 315)할 수 있다. The emotion model generator 110 may calculate an average value of the divided synthesized speech matrix (hereinafter referred to as a sub-synthetic speech matrix) for each time unit. For example, the emotion model generation unit 110 may average (313, 315) a plurality of matrix values for each frequency unit for each of the divided plurality of sub-synthetic speech matrices 309 and 311.

이후, 감정 모델 생성부(110)는 각 서브 합성 음성 행렬(309, 311)에 대하여 행렬값의 평균화가 완료되면, 정규화된 합성 음성 행렬(317)을 생성할 수 있다. Thereafter, when the averaging of the matrix values for each of the sub-synthetic speech matrices 309 and 311 is completed, the emotion model generation unit 110 may generate a normalized synthetic speech matrix 317.

도 3b를 참조하면, 정규화된 합성 음성 행렬(317)의 시간 단위 개수를

라고 하면, 합성 음성 행렬(317)은 (8000,

)의 크기를 가지게 된다. 3B, the number of time units of the normalized synthesized speech matrix 317

If so, the synthesized speech matrix 317 is (8000,

).

감정 모델 생성부(110)는 정규화된 합성 음성 행렬(317)을 구성하는 성분을 비교하여 각각의 거리 행렬을 도출할 수 있다. 감정 모델 생성부(110)는 합성 음성 행렬(317)의 N번째 열의 성분(행렬값)과 N+1 번째 열의 성분(행렬값) 간의 차에 기초하여 합성 음성 행렬(317)에 대한 거리 행렬(321)을 도출할 수 있다. 거리 행렬은 [수학식 2]에 기초하여 도출될 수 있다. The emotion model generator 110 may derive each distance matrix by comparing components constituting the normalized synthesized speech matrix 317. The emotion model generator 110 is based on the difference between the component (matrix value) of the Nth column of the synthesized speech matrix 317 and the component (matrix value) of the N+1th column, the distance matrix for the synthetic speech matrix 317 ( 321) can be derived. The distance matrix can be derived based on [Equation 2].

[수학식 2][Equation 2]

여기서

는 (1, 8000) 크기의 행렬이고,

는

크기의 행렬이고,

는 스칼라 값이고,

는

크기의 행렬이다.here

Is a matrix of size (1, 8000),

Is

Is a matrix of size,

Is a scalar value,

Is

It is a matrix of size.

여기서, i번째 시간 단위에서의 주파수 행렬은

가 되고, 모든 주파수 값에 대해 i번째 시간 단위와 i+1번째 시간 단위에서의 세기 차를 구하면, 거리 행렬에 대한 값이 도출될 수 있다. Here, the frequency matrix in the ith unit of time is

Is, and if the intensity difference between the i-th time unit and the i+1-th time unit is calculated for all frequency values, a value for the distance matrix can be derived.

감정 모델 생성부(110)는 합성 음성 행렬(317)의 N번째 열의 성분과 N+1 번째 열의 성분 간의 차의 절대값에 따른 결과값을 포함하는 행렬(319)을 시간 단위별로 합산하여 합성 음성 행렬(317)에 대한 거리 행렬(321)을 도출할 수 있다. 여기서, 합성 음성 행렬(317)에 대한 거리 행렬(321)은 (8000,

-1)의 크기를 갖는다. The emotion model generation unit 110 sums the matrix 319 including the result value according to the absolute value of the difference between the components of the Nth column and the components of the N+1th column of the synthesized speech matrix 317 for each time unit to generate a synthesized speech. The distance matrix 321 for the matrix 317 can be derived. Here, the distance matrix 321 for the synthesized speech matrix 317 is (8000,

It has a size of -1).

도 3b 내지 3c를 함께 참조하면, 감정 모델 생성부(110)는 '하이' 문장(301)에 대한 합성 음성 데이터를 구성하는 음소의 개수에 기초하여 합성 음성 행렬(317)의 거리 행렬(321)로부터 거리값을 추출할 수 있다.3B to 3C, the emotion model generator 110 is a distance matrix 321 of the synthesized voice matrix 317 based on the number of phonemes constituting the synthesized voice data for the'high' sentence 301. The distance value can be extracted from.

감정 모델 생성부(110)는 도출된 합성 음성 행렬(317)의 거리 행렬(321)로부터 합성 음성 데이터를 구성하는 음소의 개수보다 하나 적은 개수만큼의 상위값을 가지는 거리값을 추출할 수 있다. The emotion model generator 110 may extract a distance value having an upper value by one less than the number of phonemes constituting the synthesized voice data from the distance matrix 321 of the derived synthesized voice matrix 317.

감정 모델 생성부(110)는 합성 음성 데이터를 구성하는 음소의 개수가 N개라면, 거리 행렬(321)에 포함된 복수의 거리값 중 N-1개에 해당하는 상위에 속하는 거리값을 추출할 수 있다. If the number of phonemes constituting the synthesized speech data is N, the emotion model generation unit 110 extracts a distance value belonging to an upper level corresponding to N-1 among a plurality of distance values included in the distance matrix 321. I can.

여기서, '하이' 문장(301)은 [ㅎ ㅏ ㅣ]의 음소로 분리될 수 있고, 해당 문장(301)을 구성하는 음소의 개수는 3개가 된다. '하이' 문장(301)에 대한 음소의 개수가 3개이므로 감정 모델 생성부(110)는 2개의 상위 거리값(323, 325)을 거리 행렬(321)로부터 추출할 수 있다. Here, the'high' sentence 301 may be divided into phonemes of [ㅎㅏㅣ], and the number of phonemes constituting the sentence 301 is three. Since the number of phonemes for the'high' sentence 301 is three, the emotion model generator 110 may extract the two upper distance values 323 and 325 from the distance matrix 321.

감정 모델 생성부(110)는 추출된 거리값(323, 325)의 인덱스 값(327)을 확인할 수 있다. The emotion model generator 110 may check the index value 327 of the extracted distance values 323 and 325.

도 3c 내지 3d를 함께 참조하면, 감정 모델 생성부(110)는 확인된 인덱스 값 값(327)에 따라 합성 음성 행렬(317)을 음소의 개수로 분배하여 합성 음성 군집을 도출할 수 있다. 예를 들어, 인덱스 값(327)에 포함된 인덱스 2(순서상으로 세번째 인덱스) 및 인덱스 4(순서상으로 5번째 인덱스)에 기초하여, 감정 모델 생성부(110)는 합성 음성 행렬(317)을 확인된 인덱스 값의 순서상 위치에 따라 분할하여 합성 음성 군집(329, 331, 333)을 도출할 수 있다. 즉, 합성 음성 행렬(317)에서 확인된 인덱스 2와 대응된 1번째 열부터 3번째 열까지를 합성 군집(329)로 분할되고, 합성 음성 행렬(317)에서 확인된 인덱스 4와 대응된 4번째 열부터 5번째 열까지 합성 군집(331)로 분할될 수 있다.3C to 3D, the emotion model generator 110 may derive a synthesized speech cluster by distributing the synthesized speech matrix 317 by the number of phonemes according to the identified index value 327. For example, based on the index 2 (the third index in order) and the index 4 (the fifth index in order) included in the index value 327, the emotion model generation unit 110 includes the synthesized speech matrix 317 Synthesized speech clusters 329, 331, and 333 may be derived by dividing s according to positions in the order of the identified index values. That is, the first to third columns corresponding to the index 2 identified in the synthesized speech matrix 317 are divided into the synthesis cluster 329, and the fourth corresponding to the index 4 identified in the synthesized speech matrix 317 From the column to the 5th column, the composite cluster 331 may be divided.

합성 음성 행렬(317)이 X라고 하면, 합성 음성 행렬(317)을 나누는 기준은

값(인덱스 값(327))으로 수행하며, 나누는 축은 시간 단위로 한다. 즉,

일 때,

,

, ...,

으로 분할될 수 있다. 여기서, Y는 감정 음성 행렬(355)이고,

,

...,

은 합성 음성 군집(319, 321, 323)이다. 예를 들어 X

이고

이라고 하면,

,

이 된다. If the synthesized speech matrix 317 is X, the criterion for dividing the synthesized speech matrix 317 is

It is performed by value (index value (327)), and the division axis is by time unit. In other words,

when,

,

, ...,

Can be divided into Here, Y is the emotion speech matrix 355,

,

...,

Is a

synthesized speech cluster

319, 321, 323. For example X

ego

Speaking of,

,

Becomes.

도 3e를 참조하면, 감정 모델 생성부(110)는 '하이' 문장(301)에 대한 감정 음성 데이터를 스펙트로그램 이미지(341)로 변환할 수 있다. '하이' 문장(301)에 대한 감정 음성 데이터에 대응하는 스펙트로그램 이미지(341)의 가로축은 시간축(샘플축)에 해당되고, 세로축은 주파수축에 해당된다. 스펙트로그램 이미지(341)에 포함된 픽셀의 값은 소리의 세기(dB)를 나타낸다. Referring to FIG. 3E, the emotion model generator 110 may convert emotion voice data for a'high' sentence 301 into a spectrogram image 341. The horizontal axis of the spectrogram image 341 corresponding to the emotion voice data for the'high' sentence 301 corresponds to the time axis (sample axis), and the vertical axis corresponds to the frequency axis. A pixel value included in the spectrogram image 341 represents the intensity (dB) of sound.

감정 모델 생성부(110)는 '하이' 문장(301)에 대한 감정 음성 데이터에 대응하는 스펙트로그램 이미지(341)를 감정 음성 행렬(343)로 변환할 수 있다. 이 때, 감정 음성 행렬(343)의 행은 주파수이고, 열은 시간이고, 행렬의 값은 소리의 세기를 의미한다. The emotion model generator 110 may convert the spectrogram image 341 corresponding to the emotion voice data for the'high' sentence 301 into the emotion voice matrix 343. In this case, the row of the emotion speech matrix 343 is frequency, the column is time, and the value of the matrix indicates the intensity of sound.

감정 모델 생성부(110)는 기설정된 샘플레이트에 기초하여 감정 음성 행렬(343)에 포함된 복수의 샘플에 대하여 시간 단위에 포함되는 샘플의 개수를 산출할 수 있다. 여기서, 샘플의 개수는 앞서 언급된 [수학식 1]을 통해 도출될 수 있다.The emotion model generator 110 may calculate the number of samples included in a time unit for a plurality of samples included in the emotion speech matrix 343 based on a preset sample rate. Here, the number of samples can be derived through [Equation 1] mentioned above.

감정 모델 생성부(110)는 기설정된 샘플레이트에 기초하여 산출된 샘플의 개수만큼 기설정된 시간 단위에 샘플이 포함되도록 감정 음성 행렬(343)을 분할할 수 있다. 예를 들어, [수학식 1]의 기설정된 샘플레이트(

)에 16khz을 대입하면, 1.25ms 시간 단위로 20개의 샘플이 산출되고, 이에 따라, 감정 모델 생성부(110)는 감정 음성 행렬(343)의 복수의 열을 20칸(345)씩 분할할 수 있다.The emotion model generator 110 may divide the emotion speech matrix 343 so that samples are included in a preset time unit as many as the number of samples calculated based on the preset sample rate. For example, the preset sample rate of [Equation 1] (

If 16khz is substituted for ), 20 samples are calculated in units of 1.25ms time, and accordingly, the emotion model generator 110 may divide a plurality of columns of the emotion speech matrix 343 by 20 spaces (345). have.

감정 모델 생성부(110)는 분할된 감정 음성 행렬(이하, 서브 감정 음성 행렬이라 명명함)을 시간 단위 별로 평균값을 산출할 수 있다. 예를 들어, 감정 모델 생성부(110)는 분할된 복수의 서브 감정 음성 행렬(347, 349) 별로 각 주파수에 대한 시간 단위별 복수의 행렬값을 평균화(351, 353)할 수 있다. The emotion model generator 110 may calculate an average value of the divided emotion speech matrix (hereinafter referred to as a sub emotion speech matrix) for each time unit. For example, the emotion model generator 110 may average (351, 353) a plurality of matrix values for each frequency unit for each of the divided plurality of sub-emotional voice matrices 347 and 349.

이후, 감정 모델 생성부(110)는 각 서브 감정 음성 행렬(347, 349)에 대하여 행렬값의 평균화가 완료되면, 정규화된 감정 음성 행렬(355)을 생성할 수 있다. Thereafter, the emotion model generator 110 may generate a normalized emotion speech matrix 355 when the averaging of the matrix values for each of the sub emotion speech matrices 347 and 349 is completed.

도 3f를 참조하면, 정규화된 감정 음성 행렬(355)의 시간 단위 개수를

라고 하면, 감정 음성 행렬(355)은 (8000,

)의 크기를 가지게 된다. 3F, the number of time units of the normalized emotion speech matrix 355

If so, the emotion voice matrix 355 is (8000,

).

감정 모델 생성부(110)는 정규화된 감정 음성 행렬(355)을 구성하는 성분을 비교하여 각각의 거리 행렬을 도출할 수 있다. 감정 모델 생성부(110)는 감정 음성 행렬(355)의 N번째 열의 성분(행렬값)과 N+1 번째 열의 성분(행렬값) 간의 차에 기초하여 감정 음성 행렬(355)에 대한 거리 행렬(359)을 도출할 수 있다. 여기서, 거리 행렬은 앞서 기재한 [수학식 2]에 기초하여 도출될 수 있다. The emotion model generator 110 may derive each distance matrix by comparing components constituting the normalized emotion speech matrix 355. The emotion model generator 110 is a distance matrix for the emotion speech matrix 355 based on a difference between a component (matrix value) of the Nth column of the emotion speech matrix 355 and a component (matrix value) of the N+1th column. 359) can be derived. Here, the distance matrix may be derived based on [Equation 2] described above.

감정 모델 생성부(110)는 감정 음성 행렬(355)의 N번째 열의 성분과 N+1 번째 열의 성분 간의 차의 절대값에 따른 결과값을 포함하는 행렬(357)을 시간 단위별로 합산하여 감정 음성 행렬(355)에 대한 거리 행렬(359)을 도출할 수 있다. 여기서, 감정 음성 행렬(355)에 대한 거리 행렬(359)은 (8000,

-1)의 크기를 갖는다. The emotion model generator 110 adds the matrix 357 including the result value according to the absolute value of the difference between the components in the Nth column and the components in the N+1th column of the emotion speech matrix 355 for each time unit. A distance matrix 359 for matrix 355 can be derived. Here, the distance matrix 359 for the emotion speech matrix 355 is (8000,

It has a size of -1).

도 3f 내지 3g를 함께 참조하면, 감정 모델 생성부(110)는 '하이' 문장(301)에 대한 감정 음성 데이터를 구성하는 음소의 개수에 기초하여 감정 음성 행렬(355)의 거리 행렬(359)로부터 거리값을 추출할 수 있다. 3F to 3G, the emotion model generator 110 is a distance matrix 359 of the emotion speech matrix 355 based on the number of phonemes constituting the emotion speech data for the'high' sentence 301. The distance value can be extracted from.

감정 모델 생성부(110)는 도출된 감정 음성 행렬(355)의 거리 행렬(359)로부터 감정 음성 데이터를 구성하는 음소의 개수보다 하나 적은 개수만큼의 상위값을 가지는 거리값을 추출할 수 있다. The emotion model generator 110 may extract a distance value having an upper value by one less than the number of phonemes constituting the emotion speech data from the distance matrix 359 of the derived emotion speech matrix 355.

감정 모델 생성부(110)는 감정 음성 데이터를 구성하는 음소의 개수가 N개라면, 거리 행렬(359)에 포함된 복수의 거리값 중 N-1개에 해당하는 상위에 속하는 거리값을 추출할 수 있다. If the number of phonemes constituting the emotion voice data is N, the emotion model generation unit 110 extracts a distance value belonging to an upper level corresponding to N-1 among a plurality of distance values included in the distance matrix 359. I can.

여기서, '하이' 문장(301)은 [ㅎ ㅏ ㅣ]의 음소로 분리될 수 있고, 해당 문장(301)을 구성하는 음소의 개수는 3개가 된다. '하이' 문장(301)에 대한 음소의 개수가 3개이므로 감정 모델 생성부(110)는 2개의 상위 거리값(361, 363)을 거리 행렬(359)로부터 추출할 수 있다. Here, the'high' sentence 301 may be divided into phonemes of [ㅎㅏㅣ], and the number of phonemes constituting the sentence 301 is three. Since the number of phonemes for the'high' sentence 301 is three, the emotion model generator 110 may extract the two upper distance values 361 and 363 from the distance matrix 359.

감정 모델 생성부(110)는 추출된 거리값(361, 363)의 인덱스 값(365)을 확인할 수 있다. The emotion model generator 110 may check the index value 365 of the extracted distance values 361 and 363.

도 3g 내지 3h를 함께 참조하면, 감정 모델 생성부(110)는 확인된 인덱스 값(365)에 따라 감정 음성 행렬(355)을 음소의 개수로 분배하여 감정 음성 군집을 도출할 수 있다. 예를 들어, 인덱스 값(365)에 포함된 인덱스 2(순서상으로 세번째 인덱스) 및 인덱스 5(순서상으로 여섯번째 인덱스)에 기초하여, 감정 모델 생성부(110)는 감정 음성 행렬(355)을 확인된 인덱스 값의 순서상 위치에 따라 분할하여 감정 음성 군집(367, 369, 371)을 도출할 수 있다. 3G to 3H, the emotion model generator 110 may derive an emotion speech cluster by distributing the emotion speech matrix 355 by the number of phonemes according to the identified index value 365. For example, based on the index 2 (the third index in order) and the index 5 (the sixth index in order) included in the index value 365, the emotion model generation unit 110 includes the emotion speech matrix 355 Emotional voice clusters 367, 369, and 371 may be derived by dividing s according to the positions in the order of the identified index values.

감정 음성 행렬(355)이 Y라고 하면, 감정 음성 행렬(355)을 나누는 기준은

값(인덱스 값(365))으로 수행하며, 나누는 축은 시간 단위로 한다. 즉,

일 때,

,

, ...,

으로 분할될 수 있다. 여기서, Y는 감정 음성 행렬(355)이고,

,

...,

은 감정 음성 군집(367, 369, 371)이다. 예를 들어

이고

이라고 하면,

,

이 된다. If the emotion speech matrix 355 is Y, the criterion for dividing the emotion speech matrix 355 is

It is performed by a value (index value (365)), and the axis to be divided is by a unit of time. In other words,

when,

,

, ...,

Can be divided into Here, Y is the emotion speech matrix 355,

,

...,

Is an emotional voice cluster (367, 369, 371). E.g

ego

Speaking of,

,

Becomes.

도 3d, 3h 및 3i를 함께 참조하면, 감정 모델 생성부(110)는 도출된 감정 음성 군집(367, 369, 341)에 기초하여 도출된 합성 음성 군집(329, 331, 333)의 길이를 조절할 수 있다. 구체적으로, 감정 음성 군집(367, 369, 371) 각각에 대응하는 행렬

,

에 기초하여 합성 음성 군집(329, 331, 333) 각각에 대응하는 행렬

,

에 대한 군집 길이를 다음과 같이 조정할 수 있다. Referring to FIGS. 3D, 3H and 3I together, the emotion model generation unit 110 adjusts the length of the synthesized

speech clusters

329, 331, and 333 derived based on the derived

emotion speech clusters

367, 369, and 341. I can. Specifically, a matrix corresponding to each of the

emotional voice clusters

367, 369, and 371

,

Matrix corresponding to each of the synthesized speech clusters (329, 331, 333) based on

,

The cluster length for can be adjusted as follows.

임의의

에 대해

,

, 즉

는 k번째 군집의 시간 단위 길이를 나타낼 때,

인 경우(즉, 합성 음성 군집(323)의 길이가 감정 음성 군집(341)의 길이보다 긴 경우),

행렬에서 가장 작은 값을 가진 인덱스(단위 시간 기준)를

개 뽑아낸 후,

행렬)에서 추출된 인덱스에 해당하는 열(40)을 삭제한다. 삭제 시, 한 개의 열 삭제 후 삭제된 열을 기준으로 앞뒤에 있는 시간 단위 열에 대한 거리 계산을 다시 수행하여 열 삭제를 반복한다.random

About

,

, In other words

Is the length of the k-th cluster in time units,

If (i.e., when the length of the synthesized speech cluster 323 is longer than the length of the emotion speech cluster 341),

The index with the smallest value in the matrix (based on unit time)

After pulling out the dog,

The column 40 corresponding to the index extracted from the matrix) is deleted. When deleting, delete one column and repeat the column deletion by re-calculating the distance for the preceding and following time unit columns based on the deleted column.

인 경우(즉, 합성 음성 군집(331)의 길이가 감정 음성 군집(369)의 길이보다 짧은 경우),

행렬에서 가장 작은 값을 가진 인덱스를

개 뽑아낸 후,

행렬에서 추출된 인덱스에 해당하는 열(373)을 복제한다.

If (i.e., when the length of the synthesized speech cluster 331 is shorter than the length of the emotion speech cluster 369),

The index with the smallest value in the matrix

After pulling out the dog,

The column 373 corresponding to the index extracted from the matrix is duplicated.

감정 모델 생성부(110)는 길이가 조절된 합성 음성 군집(375), 음소 정보 벡터, 음소 위치 벡터에 기초하여 합성 감성 음성 시퀀스를 도출할 수 있다. The emotion model generator 110 may derive a synthesized emotional voice sequence based on the synthesized voice cluster 375 whose length is adjusted, a phoneme information vector, and a phoneme position vector.

감정 모델 생성부(110)는 도출된 합성 감정 음성 시퀀스, 감정 음성 데이터에 대한 실제 감정 음성 시퀀스, 음소 정보 벡터 및 음소 위치 벡터에 기초하여 도출된 합성 감정 음성 시퀀스가 감정이 실린 음성인지 여부를 판단하고, 판단 결과를 이용하여 감정 모델을 학습시킬 수 있다. The emotion model generation unit 110 determines whether the derived synthetic emotion speech sequence, the actual emotion speech sequence for the emotion speech data, the synthesized emotion speech sequence derived based on the phoneme information vector and the phoneme position vector is a speech containing emotion. Then, the emotion model may be trained using the determination result.

감정 모델은 변환기 및 판별기로 구성될 수 있다. 여기서, 변환기는 기본 음성을 감정 음성으로 변환하는 기능을 수행할 수 있다. 판별기는 변환된 감정 음성의 감정 여부를 판단하는 기능을 수행할 수 있다. The emotion model can be composed of a transducer and a discriminator. Here, the converter may perform a function of converting a basic voice into an emotional voice. The discriminator may perform a function of determining whether or not the converted emotional voice is emotion.

길이가 조절된 합성 음성 데이터(

)와 음소 정보를 담고 있는 벡터(c), 음소 위치 정보를 담고 있는 벡터(

)는 변환기(F)의 입력값으로 입력될 수 있다. 변환기의 출력값은

으로 나타낼 수 있다. 여기서,

는 합성 음성 데이터가 변환기를 거쳐 감정이 담긴 음성 벡터(행렬)로 변환된 합성 감정 음성 시퀀스일 수 있다. Synthetic speech data with adjusted length (

) And a vector containing phoneme information ( c ), a vector containing phoneme location information (

) May be input as an input value of the converter F. The output of the converter is

It can be represented by here,

May be a synthesized emotional voice sequence in which the synthesized voice data is converted into a voice vector (matrix) containing emotions through a converter.

감정 음성 데이터에 대한 실제 감정 음성(Y), 변환기에 의해 도출된 합성 감정 음성 시퀀스(

), 음소 정보를 담고 있는 벡터(c) 및 음소 위치 정보를 담고 있는 벡터(

)는 판별기(D)의 입력값으로 입력될 수 있다. 판별기(D)의 출력값은

으로 나타낼 수 있다. 여기서, d는 확률 분포이고, d가 0에 가까울 수록 감정이 실리지 않은 음성이고, d가 1에 가까울수록 감정이 실린 음성을 의미할 수 있다. The actual emotional voice (Y) for the emotional voice data, the synthesized emotional voice sequence derived by the converter (

), a vector containing phoneme information ( c ), and a vector containing phoneme location information (

) May be input as an input value of the discriminator D. The output value of the discriminator (D) is

It can be represented by Here, d is a probability distribution, the closer d is to 0, the voice is not loaded with emotion, and the closer d is 1, the voice is loaded with emotion.

감정 모델에 포함된 변환기(F)와 판별기(D) 각각은 훈련을 통해 파라미터 조정이 이루어지며 이를 위한 엔트로피 함수는 아래와 같다. Each of the transducers (F) and discriminators (D) included in the emotion model is subjected to parameter adjustment through training, and the entropy function for this is as follows.

판별기의 엔트로피 함수는

이며, 판별기는 감정 데이터를 정확하게 1로 출력하고 비감정 데이터는 정확하게 0을 출력할 수 있도록 설정될 수 있다. The discriminator's entropy function is

And the discriminator may be set to accurately output the emotion data as 1 and the non-emotional data to accurately output 0.

변환기의 엔트로피 함수는

이며, 변환기는 변환한 감정 음성이 판별기를 통과하는 것을 목적으로 한다. The converter's entropy function is

And the transducer aims for the converted emotional voice to pass through the discriminator.

따라서 최종 엔트로피

이며 결과적으로

가 도출되도록 파라미터를 조정할 수 있다. So the final entropy

And as a result

The parameters can be adjusted so that is derived.

다음은 변환기에서 수행되는 인코딩 방법에 관하여 설명하기로 한다. Next, an encoding method performed by the converter will be described.

변환기의 입력값인 합성 음성 데이터는 스펙트로그램 이미지(2D 이미지)로 변환되어 입력되고, 합성 음성 데이터에 대응하는 스펙트로그램 이미지는 시간 길이가 가변적 특성을 가지고 있어 일정 길이로 잘라줘야 한다. 이 때, 자르는 길이는 임의로 조정할 수 있으나, 길이가 너무 길게 되면, 합성 음성 데이터에 심각한 손실을 야기하기 때문에 12.5ms 시간 단위로 잘라주게 된다. 만일, 시간 단위 시간을 1.25ms로 설정하게 되면, 합성 음성 데이터는 10개 단위로 잘리게 된다. 따라서,

이 된다. 이후,

에 대해 컨벌루션이 수행된다. 여기서, 컨벌루션은 커널(kernel) 사이즈만큼 합성 음성 데이터를 분할한 후, 분할된 합성 음성 데이터 각각과 커널 가중치를 곱해서 최종 값을 출력하는 구조다. 이 때, 커널의 크기는 임의로 가능하나, 주파수의 성분이 많이 압축될 수 있도록 설정될 수 있다.Synthetic speech data, which is an input value of the converter, is converted into a spectrogram image (2D image) and input, and the spectrogram image corresponding to the synthesized speech data has a variable time length and must be cut to a certain length. At this time, the cutting length can be arbitrarily adjusted. However, if the length is too long, since it causes serious loss of the synthesized voice data, it is cut in units of 12.5ms time. If the time unit time is set to 1.25ms, the synthesized speech data is cut into 10 units. therefore,

Becomes. after,

Convolution is performed on. Here, convolution is a structure in which synthesized speech data is divided by a kernel size, and then a final value is output by multiplying each of the divided synthesized speech data by a kernel weight. In this case, the size of the kernel is arbitrarily possible, but it can be set so that a lot of frequency components can be compressed.

이후, 컨벌루션된 결과물에 대해 배치 정규화가 수행될 수 있다. 이를 통해 컨벌루션된 데이터에 대한 학습 불안정화가 줄어들게 된다. 배치 정규화는 [수학식 3]에 기초하여 컨벌루션 결과물의 각 채널에 대해 입력값의 평균과 표준편차를 구해준 후 이를 기준으로 정규화가 수행될 수 있다. Thereafter, batch normalization may be performed on the convolved result. Through this, learning destabilization for convoluted data is reduced. Batch normalization can be performed based on the average and standard deviation of input values for each channel of the convolution result based on [Equation 3].

[수학식 3] [Equation 3]

여기서,

는 컨벌루션 결과물의 평균이고,

는 표준편차이고,

은 화이트 노이즈이고,

와

는 학습을 통해 계산된다. here,

Is the average of the convolutional results,

Is the standard deviation,

Is white noise,

Wow

Is calculated through learning.

이후, 배치 정규화 결과물에 딥-컨벌루셔널 네트워크(deep convolutional network)를 적용하면, [수학식 4]와 같은 히든층(Hidden layer)이 결정된다.Thereafter, when a deep convolutional network is applied to the batch normalization result, a hidden layer such as [Equation 4] is determined.

[수학식 4][Equation 4]

여기서, W 벡터와 b 벡터는 학습용 파라미터가 되며 S(.)는 시그모이드 함수이고,

는 두 행렬에 대한 원소 각각의 곱이 되고, C는 음소 정보 시퀀스이고, L은 음소 위치 시퀀스를 나타낸다. Here, the W vector and the b vector are parameters for learning, and S(.) is a sigmoid function,

Is the product of each element of the two matrices, C is a phoneme information sequence, and L is a phoneme position sequence.

다음은 변환기에서 수행되는 디코딩 방법에 관하여 설명하기로 한다. Next, a decoding method performed in the converter will be described.

디코딩을 위한 전치된(transposed) 컨벌루션을 수행한다. 전치된 컨벌루션 결과물에 딥-컨벌루셔널 네트워크를 적용하면, [수학식 5]와 같은 히든층이 결정되고, 최종 결과물은 [수학식 6]과 같이 표현될 수 있다. Transposed convolution for decoding is performed. When a deep-convolutional network is applied to the transposed convolutional result, a hidden layer such as [Equation 5] is determined, and the final result can be expressed as [Equation 6].

[수학식 5][Equation 5]

[수학식 6][Equation 6]

한편, 판별기는 입력된 데이터에 대해 변환기의 인코딩 과정(즉 컨벌루션, 배치 정규화, 딥-컨벌루셔널 네트워크)를 반복적으로 수행하여 채널을 증가시키면서 다운 샘플링을 수행하고, 소프트맥스(softmax) 함수를 수행하여 확률 값을 계산할 수 있다. On the other hand, the discriminator performs downsampling while increasing the channel by repeatedly performing the encoding process of the converter (i.e., convolution, batch normalization, deep-convolutional network) on the input data, and performs a softmax function. Thus, we can calculate the probability value.

변환기와 판별기 각각에 합성 음성 데이터, 감정 음성 데이터, 음소 정보 시퀀스, 음소 위치 시퀀스를 입력하여 훈련을 진행할 수 있다. Training can be performed by inputting synthesized speech data, emotional speech data, phoneme information sequence, and phoneme position sequence to each of the converter and the discriminator.

도 2c 내지 2e를 참조하면, 사용자 합성 모델 생성부(120)는 생성된 합성 모델(20)에 사용자의 사용자 음성 데이터(209) 및 사용자 음성 데이터(209)와 대응되는 문장(211)을 입력하여 사용자의 사용자 합성 모델(24)을 생성하고, 생성된 사용자 합성 모델(24)을 이용하여 사용자 합성 음성 데이터(213)를 생성할 수 있다. 여기서, 사용자 음성 데이터(209)는 사용자의 음성이 녹음된 음성 데이터일 수 있다.2C to 2E, the user synthesis model generation unit 120 inputs a user's user voice data 209 and a sentence 211 corresponding to the user voice data 209 into the generated synthesis model 20 A user synthesis model 24 of a user may be generated, and user synthesized speech data 213 may be generated using the generated user synthesis model 24. Here, the user voice data 209 may be voice data in which a user's voice is recorded.

사용자 합성 모델 학습부(130)는 생성된 감정 모델(22)에 생성된 사용자 합성 음성 데이터(213)를 입력하여 사용자 감정 음성 데이터(215)를 생성할 수 있다. 여기서, 감정 모델(22)은 변하지 않으며, 앞서 기생성된 감정별 감정모델와 동일하다. The user synthesis model learning unit 130 may generate user emotion speech data 215 by inputting the generated user synthesis speech data 213 into the generated emotion model 22. Here, the emotion model 22 does not change, and is the same as the emotion model for each emotion previously generated.

사용자 합성 모델 학습부(130)는 사용자 합성 모델(24)에 생성된 사용자 감정 음성 데이터(215)를 입력하여 학습시킬 수 있다. The user synthesis model learning unit 130 may input and train the user emotion voice data 215 generated in the user synthesis model 24.

합성 감정 음성 데이터 생성부(140)는 학습된 사용자 합성 모델(24)에 요청 문장(217) 및 감정 정보(219)를 입력하여 사용자 합성 감정 음성 데이터(221)를 생성할 수 있다. 이 때, 감정 정보(219)는 복수의 감정(예컨대, 기쁨, 슬픔, 놀라움, 화남 등)에서 선택된 감정에 대한 태그 정보를 포함할 수 있다. The synthesized emotional voice data generation unit 140 may generate the user synthesized emotional voice data 221 by inputting the request sentence 217 and the emotion information 219 to the learned user synthesis model 24. In this case, the emotion information 219 may include tag information on an emotion selected from a plurality of emotions (eg, joy, sadness, surprise, anger, etc.).

한편, 당업자라면, 합성 모델 생성부(100), 감정 모델 생성부(110), 사용자 합성 모델 생성부(120), 사용자 합성 모델 학습부(130) 및 합성 감정 음성 데이터 생성부(140) 각각이 분리되어 구현되거나, 이 중 하나 이상이 통합되어 구현될 수 있음을 충분히 이해할 것이다. On the other hand, for those skilled in the art, each of the synthetic model generation unit 100, the emotion model generation unit 110, the user synthesis model generation unit 120, the user synthesis model learning unit 130, and the synthetic emotion voice data generation unit 140 It will be fully understood that it may be implemented separately, or one or more of these may be implemented in an integrated manner.

도 4는 본 발명의 일 실시예에 따른, 감정 음성 합성 방법을 나타낸 흐름도이다. 4 is a flowchart illustrating a method for synthesizing emotional speech according to an embodiment of the present invention.

도 4를 참조하면, 단계 S401에서 감정 음성 합성 장치(10)는 원본 음성 데이터 및 원본 음성 데이터와 대응되는 문장에 기초하여 합성 모델을 생성할 수 있다. Referring to FIG. 4, in step S401, the emotional speech synthesis apparatus 10 may generate a synthesis model based on original speech data and a sentence corresponding to the original speech data.

단계 S403에서 감정 음성 합성 장치(10)는 감정 별 감정 음성 데이터 및 합성 모델에 의해 생성된 합성 음성 데이터에 기초하여 감정 모델을 생성할 수 있다. In step S403, the emotion-voice synthesis apparatus 10 may generate an emotion model based on the emotion-voice data for each emotion and the synthesized speech data generated by the synthesis model.

단계 S405에서 감정 음성 합성 장치(10)는 합성 모델 및 감정 모델에 기초하여 학습된 사용자 합성 모델에 요청 문장 및 감정 정보를 입력하여 사용자 합성 감정 음성 데이터를 생성할 수 있다. In step S405, the emotion-voice synthesis apparatus 10 may generate user-synthesized emotion-voice data by inputting a request sentence and emotion information to the user-synthesis model learned based on the synthesis model and the emotion model.

상술한 설명에서, 단계 S401 내지 S405는 본 발명의 구현예에 따라서, 추가적인 단계들로 더 분할되거나, 더 적은 단계들로 조합될 수 있다. 또한, 일부 단계는 필요에 따라 생략될 수도 있고, 단계 간의 순서가 변경될 수도 있다. In the above description, steps S401 to S405 may be further divided into additional steps or may be combined into fewer steps, according to an embodiment of the present invention. In addition, some steps may be omitted as necessary, or the order between steps may be changed.

본 발명의 일 실시예는 컴퓨터에 의해 실행되는 프로그램 모듈과 같은 컴퓨터에 의해 실행 가능한 명령어를 포함하는 기록 매체의 형태로도 구현될 수 있다. 컴퓨터 판독 가능 매체는 컴퓨터에 의해 액세스될 수 있는 임의의 가용 매체일 수 있고, 휘발성 및 비휘발성 매체, 분리형 및 비분리형 매체를 모두 포함한다. 또한, 컴퓨터 판독가능 매체는 컴퓨터 저장 매체를 모두 포함할 수 있다. 컴퓨터 저장 매체는 컴퓨터 판독가능 명령어, 데이터 구조, 프로그램 모듈 또는 기타 데이터와 같은 정보의 저장을 위한 임의의 방법 또는 기술로 구현된 휘발성 및 비휘발성, 분리형 및 비분리형 매체를 모두 포함한다. An embodiment of the present invention may also be implemented in the form of a recording medium including instructions executable by a computer, such as a program module executed by a computer. Computer-readable media can be any available media that can be accessed by a computer, and includes both volatile and nonvolatile media, removable and non-removable media. Further, the computer-readable medium may include all computer storage media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data.

전술한 본 발명의 설명은 예시를 위한 것이며, 본 발명이 속하는 기술분야의 통상의 지식을 가진 자는 본 발명의 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 예를 들어, 단일형으로 설명되어 있는 각 구성 요소는 분산되어 실시될 수도 있으며, 마찬가지로 분산된 것으로 설명되어 있는 구성 요소들도 결합된 형태로 실시될 수 있다. The above description of the present invention is for illustrative purposes only, and those of ordinary skill in the art to which the present invention pertains will be able to understand that other specific forms can be easily modified without changing the technical spirit or essential features of the present invention. will be. Therefore, it should be understood that the embodiments described above are illustrative and non-limiting in all respects. For example, each component described as a single type may be implemented in a distributed manner, and similarly, components described as being distributed may also be implemented in a combined form.

본 발명의 범위는 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본 발명의 범위에 포함되는 것으로 해석되어야 한다. The scope of the present invention is indicated by the claims to be described later rather than the detailed description, and all changes or modified forms derived from the meaning and scope of the claims and their equivalent concepts should be interpreted as being included in the scope of the present invention. .

10: 감정 음성 합성 장치
100: 합성 모델 생성부
110: 감정 모델 생성부
120: 사용자 합성 모델 생성부
130: 사용자 합성 모델 학습부
140: 합성 감정 음성 데이터 생성부10: emotional speech synthesis device
100: synthetic model generation unit
110: emotion model generation unit
120: User synthetic model generation unit
130: User synthetic model learning unit
140: synthesized emotion voice data generation unit

Claims

In the device for synthesizing emotional voices,
A synthesis model generator that generates a synthesis model based on the original speech data and a sentence corresponding to the original speech data;
An emotion model generator configured to generate an emotion model based on emotion voice data for each emotion and synthesized voice data generated by the synthesis model;
A user synthesis model generation unit for generating a user synthesis model of the user by inputting a user's user voice data and a sentence corresponding to the user voice data into the generated synthesis model;
Generate user synthesized voice data using the generated user synthesis model, input the generated user synthesized voice data to the generated emotion model to generate user emotion voice data, and the generated user in the user synthesis model A user synthesis model learning unit that inputs and trains emotion voice data; And
Synthetic emotion voice data generation unit for generating user synthesized emotion voice data by inputting the requested sentence and emotion information into the learned user synthesis model
That includes, emotional speech synthesis device.

The method of claim 1,
The emotion model generator converts the emotion voice data and the synthesized voice data into a spectrogram image form, and converts the emotion voice data and the synthesized voice data converted into the spectrogram image form into an emotion voice matrix and a synthesized voice matrix. That is, the emotional speech synthesis device.

The method of claim 2,
The emotion model generator divides the emotion speech matrix and the synthesized speech matrix so that samples are included in a preset time unit as many as the number of samples calculated based on a preset sample rate, and the divided emotion speech matrix and the divided To calculate the average value of the synthesized speech matrix for each time unit, the emotional speech synthesis apparatus.

The method of claim 3,
The emotion model generation unit derives each distance matrix by comparing components constituting the emotion speech matrix and the synthesized speech matrix,
A distance value is extracted from a distance matrix of the derived emotional speech matrix and a distance matrix of the derived synthetic speech matrix based on the number of phonemes constituting the emotional speech data and the synthesized speech data, and the extracted distance value is To check the index value, emotional speech synthesis device.

The method of claim 4,
The emotion model generation unit extracts a distance value having an upper value by one less than the number of phonemes from the distance matrix of the derived emotion speech matrix and the distance matrix of the derived synthesized speech matrix. Device.

The method of claim 4,
The emotion model generator derives an emotion speech cluster and a synthesized speech cluster by distributing the emotion speech matrix and the synthesized speech matrix by the number of phonemes according to the identified index value, and based on the derived emotion speech cluster, the To adjust the length of the derived synthesized speech cluster, emotional speech synthesis apparatus.

The method of claim 6,
The emotion model generator derives a synthesized emotional voice sequence based on the synthesized voice cluster, a phoneme information vector, and a phoneme position vector whose length is adjusted,
Based on the derived synthetic emotional voice sequence, the actual emotional voice sequence for the emotional voice data, the phoneme information vector, and the phoneme position vector, it is determined whether or not the derived synthetic emotional voice sequence is a voice bearing emotion, and the determination To learn the emotion model by using the result, the emotional speech synthesis device.

In the method of synthesizing the emotional voice,
Generating a synthesis model based on original speech data and sentences corresponding to the original speech data;
Generating an emotion model based on emotion voice data for each emotion and synthesized voice data generated by the synthesis model; And
And generating user synthesized emotional speech data by inputting a request sentence and emotion information to the synthesis model and the user synthesis model learned based on the emotion model.

The method of claim 8,
Generating the emotion model,
Converting the emotional voice data and the synthesized voice data into a spectrogram image form, and converting the emotional voice data and the synthesized voice data converted into the spectrogram image form into an emotional voice matrix and a synthesized voice matrix
That includes, emotional speech synthesis method.

The method of claim 9,
Generating the emotion model,
The emotional voice matrix and the synthesized voice matrix are divided so that samples are included in a preset time unit by the number of samples calculated based on a preset sample rate, and the divided emotional voice matrix and the divided synthesized voice matrix are selected from the Step of calculating the average value for each time unit
It will further include, emotional speech synthesis method.

The method of claim 10,
Generating the emotion model,
Comparing components constituting the emotion speech matrix and the synthesized speech matrix to derive respective distance matrices; And
A distance value is extracted from a distance matrix of the derived emotional speech matrix and a distance matrix of the derived synthetic speech matrix based on the number of phonemes constituting the emotional speech data and the synthesized speech data, and the extracted distance value is Steps to check the index value
It will further include, emotional speech synthesis method.

The method of claim 11,
Generating the emotion model,
Extracting a distance value having an upper value by one less than the number of phonemes from the derived distance matrix of the emotion speech matrix and the derived distance matrix of the synthesized speech matrix
It will further include, emotional speech synthesis method.

The method of claim 11,
Generating the emotion model,
Distributing the emotion speech matrix and the synthesized speech matrix by the number of phonemes according to the identified index value to derive an emotion speech cluster and a synthesized speech cluster; And
Adjusting the length of the derived synthetic speech cluster based on the derived emotional speech cluster
It will further include, emotional speech synthesis method.

The method of claim 13,
Generating the emotion model,
Deriving a synthesized emotional voice sequence based on the length-adjusted synthesized voice cluster, a phoneme information vector, and a phoneme position vector;
Based on the derived synthetic emotional voice sequence, the actual emotional voice sequence for the emotional voice data, the phoneme information vector, and the phoneme position vector, it is determined whether or not the derived synthetic emotional voice sequence is a voice bearing emotion, and the determination Learning the emotion model using the result
It will further include, emotional speech synthesis method.

In a computer program stored in a medium containing a sequence of instructions for synthesizing an emotional voice,
When the computer program is executed by a computing device,
Generate a synthesis model based on the original voice data and the sentence corresponding to the original voice data,
Generating an emotion model based on emotion voice data for each emotion and synthesized voice data generated by the synthesis model,
Generating a user synthesis model of the user by inputting a user's user voice data and a sentence corresponding to the user voice data into the generated synthesis model,
Generating user synthesized speech data using the generated user synthesis model,
Generating user emotion voice data by inputting the generated user synthesized voice data into the generated emotion model,
Inputting and learning the generated user emotion voice data into the user synthesis model,
A computer program stored in a medium comprising a sequence of instructions for generating user synthesized emotion voice data by inputting a request sentence and emotion information to the learned user synthesis model.