KR20210012265A

KR20210012265A - Providing method of voice, learning method for providing voice and apparatus thereof

Info

Publication number: KR20210012265A
Application number: KR1020190089684A
Authority: KR
Inventors: 이종언; 박지웅
Original assignee: 주식회사 엘지유플러스
Priority date: 2019-07-24
Filing date: 2019-07-24
Publication date: 2021-02-03

Abstract

According to one embodiment, a voice providing method and device can be provided, wherein the voice providing method receives a text sequence, extracts a first feature vector from the text sequence, generates a second feature vector by applying an embedding vector previously learned for each condition to the first feature vector, determines context information of the text from the second feature vector, synthesizes a voice corresponding to the text sequence based on the context information, and provides the synthesized voice. Therefore, the voice providing method and device may provide a personalized text-to-speech (TTS) service for which various emotions and tones of a speaker are reflected.

Description

Voice providing method, learning method for providing voice, and devices thereof {PROVIDING METHOD OF VOICE, LEARNING METHOD FOR PROVIDING VOICE AND APPARATUS THEREOF}

실시예들은 음성 제공 방법, 음성 제공을 위한 학습 방법 및 그 장치들에 관한 것이다.The embodiments relate to a method of providing a voice, a method of learning for providing a voice, and devices thereof.

인공 지능 기술이 발달함에 따라 AI(Artificial Intelligence) 비서 또는 AI 스피커 등과 같은 스마트 디바이스를 통한 다양한 음성 서비스(예를 들어, 음원 제공, 음성 인식을 통한 제어, 자율 주행, 증강 현실 등)가 제공되고 있다. 또한, 초 연결 시대인 5G 시대가 도래하면서 다양한 스마트 디바이스에서 인입되는 트래픽이 증대되었으며, 각 디바이스 별 최적의 도메인으로 학습한 개인 별 음성 서비스의 제공을 필요로 한다. 하지만, 현재는 단순히 입력된 텍스트에 대응하는 음성을 합성하여 제공하는 것이 일반적이며, 화자의 다양한 감정들, 음색 및 다양한 언어로의 표현 등을 제공하는 데에는 어려움이 있다.With the development of artificial intelligence technology, various voice services (e.g., sound source provision, control through voice recognition, autonomous driving, augmented reality, etc.) are being provided through smart devices such as AI (Artificial Intelligence) secretaries or AI speakers. . In addition, with the advent of the 5G era, which is the era of hyper-connectivity, incoming traffic from various smart devices has increased, and it is necessary to provide individual voice services that have been learned in the optimal domain for each device. However, at present, it is common to simply synthesize and provide a voice corresponding to an input text, and it is difficult to provide a speaker's various emotions, tones, and expressions in various languages.

일 실시예에 따르면, 화자의 다양한 감정들, 및 음색이 반영된 개인 별 맞춤 음성 합성(Text-To-Speech; TTS) 서비스를 제공할 수 있다. According to an embodiment, a personalized speech synthesis (Text-To-Speech; TTS) service in which various emotions and tones of a speaker are reflected may be provided.

일 실시예에 따르면, 텍스트를 다양한 언어로 변환하여 제공할 수 있다. According to an embodiment, text may be converted into various languages and provided.

일 실시예에 따르면, 개인 별 맞춤 음성 합성(TTS) 서비스를 통해 예를 들어, 성별, 나이, 주변 소음 등과 같은 다양한 변수를 반영한 음성을 생성할 수 있다. According to an embodiment, a voice reflecting various variables such as gender, age, and ambient noise may be generated through a personalized voice synthesis (TTS) service.

일 실시예에 따르면, 새로운 서비스 및 도메인에 맞는 학습 데이터를 바로 생성하고 학습할 수 있어 학습 데이터의 생성 비용을 줄이는 한편, 음성 인식 엔진의 성능을 향상시킬 수 있다.According to an embodiment, training data suitable for a new service and domain can be immediately generated and learned, thereby reducing the cost of generating training data and improving performance of a speech recognition engine.

일 실시예에 따르면, 음성 합성 방법은 텍스트 시퀀스를 수신하는 단계; 상기 텍스트 시퀀스로부터 제1 특징 벡터를 추출하는 단계; 상기 제1 특징 벡터에 조건 별로 미리 학습된 임베딩 벡터(embedding vector)를 적용하여 제2 특징 벡터를 생성하는 단계; 상기 제2 특징 벡터로부터 텍스트의 문맥 정보(context information)를 결정하는 단계; 상기 문맥 정보에 기초하여, 상기 텍스트 시퀀스에 대응하는 음성을 합성하는 단계; 및 상기 합성된 음성을 제공하는 단계를 포함한다.According to an embodiment, a method for synthesizing a speech includes receiving a text sequence; Extracting a first feature vector from the text sequence; Generating a second feature vector by applying a pre-learned embedding vector for each condition to the first feature vector; Determining context information of text from the second feature vector; Synthesizing a speech corresponding to the text sequence based on the context information; And providing the synthesized voice.

상기 조건은 감성, 음색, 언어 변환 중 적어도 하나를 포함할 수 있다. The condition may include at least one of emotion, tone, and language conversion.

상기 제2 특징 벡터를 생성하는 단계는 복수의 감성들 별로 미리 학습된 임베딩 벡터를 적용하여 상기 제2 특징 벡터를 생성하는 단계; 복수의 음색들 별로 미리 학습된 임베딩 벡터를 적용하여 상기 제2 특징 벡터를 생성하는 단계; 및 언어 변환을 위해 복수의 언어 별로 미리 학습된 임베딩 벡터를 적용하여 상기 제2 특징 벡터를 생성하는 단계 중 적어도 하나를 포함할 수 있다. The generating of the second feature vector may include generating the second feature vector by applying a pre-learned embedding vector for each of a plurality of emotions; Generating the second feature vector by applying a pre-learned embedding vector for each of a plurality of tones; And generating the second feature vector by applying pre-learned embedding vectors for each of a plurality of languages for language conversion.

상기 복수의 감성들은 행복, 기쁨, 슬픔, 우울, 화남, 유쾌, 발랄, 경쾌, 짜증 중 적어도 하나를 포함할 수 있다. The plurality of sensibilities may include at least one of happiness, joy, sadness, depression, anger, joy, youthfulness, lightness, and irritation.

상기 복수의 음색들은 음성 톤, 속도 변화 중 적어도 하나를 포함할 수 있다. The plurality of tones may include at least one of a voice tone and a speed change.

상기 음성 제공 방법은 사용자로부터 개인화 조건에 대한 선택을 입력받는 단계를 더 포함하고, 상기 제1 특징 벡터를 추출하는 단계는 상기 개인화 조건에 대응하는 개인화된 TTS(Text to Speech) 엔진을 결정하는 단계; 및 상기 텍스트 시퀀스를 상기 개인화된 TTS 엔진에 인가함으로써 상기 제1 특징 벡터를 출력하는 단계를 포함할 수 있다. The voice providing method further includes receiving a selection for a personalized condition from a user, and extracting the first feature vector comprises determining a personalized TTS (Text to Speech) engine corresponding to the personalized condition. ; And outputting the first feature vector by applying the text sequence to the personalized TTS engine.

상기 개인화 조건은 상기 사용자의 연령, 상기 사용자의 성별, 상기 사용자가 선호하는 음성 주파수 대역 중 적어도 하나를 포함할 수 있다. The personalization condition may include at least one of the user's age, the user's gender, and the user's preferred voice frequency band.

상기 음성을 합성하는 단계는 상기 문맥 정보에 기초하여 상기 텍스트 시퀀스에 대응하는 억양, 세기, 톤 및 주파수의 분포를 예측하는 단계; 및 상기 예측된 억양, 세기, 톤 및 상기 예측된 주파수 분포를 이용하여 상기 음성을 합성하는 단계를 포함할 수 있다. The synthesizing of the speech may include predicting a distribution of intonation, intensity, tone, and frequency corresponding to the text sequence based on the context information; And synthesizing the speech using the predicted intonation, intensity, tone and the predicted frequency distribution.

상기 음성 합성 방법은 상기 합성된 음성에 화자와 관련된 노이즈를 믹싱하는 단계; 및 상기 노이즈가 믹싱된 음성을 출력하는 단계를 더 포함할 수 있다. The speech synthesis method includes mixing the synthesized speech with noise related to a speaker; And outputting the sound mixed with the noise.

일 실시예에 따르면, 학습 방법은 복수의 감성들에 대응하는 기준 음성 신호들(reference audios)을 수신하는 단계; 상기 기준 음성 신호들 각각을 상기 복수의 감성들 별로 분류하는 네트워크를 트레이닝하는 단계; 상기 트레이닝된 네트워크를 이용하여 상기 기준 음성 신호들의 감성들에 대응하는 운율(prosody) 특징들을 수집하는 단계; 및 상기 운율 특징들에 기초하여 상기 복수의 감성들 별로 임베딩 벡터를 결정하는 단계를 포함한다. According to an embodiment, a learning method includes receiving reference audios corresponding to a plurality of emotions; Training a network for classifying each of the reference speech signals according to the plurality of emotions; Collecting prosody features corresponding to sensibilities of the reference speech signals using the trained network; And determining an embedding vector for each of the plurality of emotions based on the prosody features.

상기 운율 특징들을 수집하는 단계는 상기 기준 음성 신호들의 음성 내용과 무관하게 상기 기준 음성 신호들의 특정 감성에 공통된 운율 특징들을 추출하는 단계를 포함할 수 있다. The collecting of the prosody features may include extracting prosody features common to a specific emotion of the reference voice signals irrespective of the voice content of the reference voice signals.

상기 학습 방법은 상기 기준 음성 신호들 각각을 복수의 음색, 및 복수의 언어 별로 분류하는 네트워크를 트레이닝하는 단계를 더 포함할 수 있다. The learning method may further include training a network for classifying each of the reference speech signals into a plurality of tones and a plurality of languages.

상기 학습 방법은 상기 기준 음성 신호들이 수집된 환경에 대응하는 노이즈 신호를 수신하는 단계를 더 포함하고, 상기 복수의 감성들 별로 분류하는 네트워크를 트레이닝하는 단계는 상기 기준 음성 신호들 및 상기 기준 음성 신호들이 수집된 환경에 대응하는 노이즈 신호를 믹싱(mixing)하는 단계; 및 상기 믹싱된 신호를 상기 복수의 감성들 별 및 상기 환경 별로 분류하는 네트워크를 트레이닝하는 단계를 포함할 수 있다. The learning method further includes receiving a noise signal corresponding to an environment in which the reference speech signals are collected, and training a network for classifying each of the plurality of emotions includes the reference speech signals and the reference speech signal Mixing a noise signal corresponding to the environment in which they are collected; And training a network for classifying the mixed signal according to the plurality of emotions and the environment.

일 실시예에 따르면, 음성 합성 장치는 텍스트 시퀀스를 수신하는 통신 인터페이스; 상기 텍스트 시퀀스로부터 제1 특징 벡터를 추출하고, 상기 제1 특징 벡터에 조건 별로 미리 학습된 임베딩 벡터를 적용하여 제2 특징 벡터를 생성하고, 상기 제2 특징 벡터로부터 텍스트의 문맥 정보를 결정하며, 상기 문맥 정보에 기초하여, 상기 텍스트 시퀀스에 대응하는 음성을 합성하는 프로세서; 및 상기 합성된 음성을 출력하는 스피커를 포함한다. According to an embodiment, a speech synthesis apparatus includes: a communication interface for receiving a text sequence; Extracting a first feature vector from the text sequence, applying a pre-learned embedding vector for each condition to the first feature vector to generate a second feature vector, and determining context information of the text from the second feature vector, A processor for synthesizing a speech corresponding to the text sequence based on the context information; And a speaker outputting the synthesized voice.

일 실시예에 따르면, 학습 장치는 복수의 감성들에 대응하는 기준 음성 신호들을 수신하는 통신 인터페이스; 및 상기 기준 음성 신호들 각각을 상기 복수의 감성들 별로 분류하는 네트워크를 트레이닝하고, 상기 트레이닝된 네트워크를 이용하여 상기 기준 음성 신호들의 감성들에 대응하는 운율 특징들을 수집하며, 상기 운율 특징들에 기초하여 상기 복수의 감성들 별로 임베딩 벡터를 결정하는 프로세서를 포함하고, 상기 통신 인터페이스는 상기 임베딩 벡터를 출력한다.According to an embodiment, the learning apparatus includes a communication interface for receiving reference voice signals corresponding to a plurality of emotions; And training a network for classifying each of the reference speech signals according to the plurality of emotions, and collecting prosody features corresponding to the sensations of the reference speech signals using the trained network, based on the prosody characteristics. And a processor for determining an embedding vector for each of the plurality of emotions, and the communication interface outputs the embedding vector.

일 측에 따르면, 화자의 다양한 감정들, 및 음색이 반영된 개인 별 맞춤 음성 합성(Text-To-Speech; TTS) 서비스를 제공할 수 있다.According to one side, it is possible to provide a personalized speech synthesis (Text-To-Speech; TTS) service in which various emotions and tones of the speaker are reflected.

일 측에 따르면, 텍스트를 다양한 언어로 변환하여 제공할 수 있다. According to one side, text may be converted into various languages and provided.

일 측에 따르면, 개인 별 맞춤 음성 합성(TTS) 서비스를 통해 예를 들어, 성별, 나이, 주변 소음 등과 같은 다양한 변수를 반영한 음성을 생성할 수 있다. According to one side, a voice reflecting various variables such as gender, age, and ambient noise may be generated through a personalized voice synthesis (TTS) service.

일 측에 따르면, 새로운 서비스 및 도메인에 맞는 학습 데이터를 바로 생성하고 학습할 수 있어 학습 데이터의 생성 비용을 줄이는 한편, 음성 인식 엔진의 성능을 향상시킬 수 있다.According to one side, training data suitable for a new service and domain can be immediately generated and learned, thereby reducing the cost of generating training data and improving the performance of the speech recognition engine.

도 1은 일 실시예에 따른 음성 제공 방법을 나타낸 흐름도.
도 2는 일 실시예에 따른 합성된 음성을 생성하는 과정을 설명하기 위한 도면.
도 3은 일 실시예에 따른 음성 인식 엔진을 학습하는 과정을 설명하기 위한 도면.
도 4는 일 실시예에 따른 음성 합성(TTS) 매니저의 구성 및 동작을 설명하기 위한 도면.
도 5는 일 실시예에 따른 학습 방법을 나타낸 흐름도
도 6은 일 실시예에 따른 음성 합성(TTS) 모델을 트레이닝하는 과정을 설명하기 위한 도면.
도 7은 일 실시예에 따른 음성 제공 장치의 블록도.
도 8은 일 실시예에 따른 학습 장치의 블록도.1 is a flowchart illustrating a method of providing a voice according to an embodiment.
2 is a diagram illustrating a process of generating a synthesized speech according to an embodiment.
3 is a diagram for explaining a process of learning a speech recognition engine according to an embodiment.
4 is a diagram for explaining the configuration and operation of a speech synthesis (TTS) manager according to an embodiment.
5 is a flow chart showing a learning method according to an embodiment
6 is a diagram for describing a process of training a speech synthesis (TTS) model according to an embodiment.
Fig. 7 is a block diagram of an apparatus for providing a voice according to an embodiment.
Fig. 8 is a block diagram of a learning device according to an embodiment.

이하에서, 첨부된 도면을 참조하여 실시예들을 상세하게 설명한다. 각 도면에 제시된 동일한 참조 부호는 동일한 부재를 나타낸다.Hereinafter, exemplary embodiments will be described in detail with reference to the accompanying drawings. The same reference numerals in each drawing indicate the same members.

아래 설명하는 실시예들에는 다양한 변경이 가해질 수 있다. 아래 설명하는 실시예들은 실시 형태에 대해 한정하려는 것이 아니며, 이들에 대한 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다.Various changes may be made to the embodiments described below. The embodiments described below are not intended to be limited to the embodiments, and should be understood to include all changes, equivalents, and substitutes thereto.

실시예에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 실시예를 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 명세서에서, "포함하다" 또는 "가지다" 등의 용어는 명세서 상에 기재된 특징, 숫자, 단계, 동작, 구성 요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The terms used in the examples are used only to describe specific embodiments, and are not intended to limit the embodiments. Singular expressions include plural expressions unless the context clearly indicates otherwise. In the present specification, terms such as "comprise" or "have" are intended to designate the presence of features, numbers, steps, actions, components, parts, or combinations thereof described in the specification, but one or more other features. It is to be understood that the presence or addition of elements or numbers, steps, actions, components, parts, or combinations thereof, does not preclude in advance.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 실시예가 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥 상 가지는 의미와 일치하는 의미를 가지는 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless otherwise defined, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which the embodiment belongs. Terms as defined in a commonly used dictionary should be interpreted as having a meaning consistent with the meaning in the context of the related technology, and should not be interpreted as an ideal or excessively formal meaning unless explicitly defined in this application. Does not.

또한, 첨부 도면을 참조하여 설명함에 있어, 도면 부호에 관계없이 동일한 구성 요소는 동일한 참조 부호를 부여하고 이에 대한 중복되는 설명은 생략하기로 한다. 실시예를 설명함에 있어서 관련된 공지 기술에 대한 구체적인 설명이 실시예의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다.In addition, in the description with reference to the accompanying drawings, the same reference numerals are assigned to the same components regardless of the reference numerals, and redundant descriptions thereof will be omitted. In describing the embodiments, when it is determined that a detailed description of related known technologies may unnecessarily obscure the subject matter of the embodiments, the detailed description thereof will be omitted.

도 1은 일 실시예에 따른 음성 제공 방법을 나타낸 흐름도이다. 도 1을 참조하면, 일 실시예에 따른 음성 제공 장치는 텍스트 시퀀스(text sequence)를 수신한다(110). 1 is a flow chart showing a method for providing voice according to an embodiment. Referring to FIG. 1, a speech providing apparatus according to an embodiment receives a text sequence (110).

음성 제공 장치는 텍스트 시퀀스로부터 제1 특징 벡터를 추출한다(120). 음성 제공 장치는 단계(110)에서 수신한 텍스트 시퀀스를 개인화된 음성 합성(Text to Speech; TTS) 엔진에 인가함으로써 제1 특징 벡터를 추출할 수 있다. 실시예에 따라서, 음성 제공 장치는 사용자로부터 개인화 조건에 대한 선택을 입력받을 수 있다. 이 경우, 음성 제공 장치는 개인화 조건에 대응하는 개인화된 TTS 엔진을 결정하고, 텍스트 시퀀스를 개인화된 TTS 엔진에 인가함으로써 제1 특징 벡터를 출력할 수 있다. 개인화 조건은 예를 들어, 사용자의 연령, 사용자의 성별, 사용자가 선호하는 음성 주파수 대역 등을 포함할 수 있다. The speech providing apparatus extracts a first feature vector from the text sequence (120). The speech providing apparatus may extract the first feature vector by applying the text sequence received in step 110 to a personalized text to speech (TTS) engine. Depending on the embodiment, the voice providing device may receive a selection of a personalization condition from a user. In this case, the speech providing apparatus may output the first feature vector by determining a personalized TTS engine corresponding to the personalization condition and applying the text sequence to the personalized TTS engine. Personalization conditions may include, for example, a user's age, a user's gender, a user's preferred voice frequency band, and the like.

음성 제공 장치는 제1 특징 벡터에 조건 별로 미리 학습된 임베딩 벡터(embedding vector)를 적용하여 제2 특징 벡터를 생성한다(130). 여기서, 조건은 예를 들어, 감성, 음색, 언어 변환 등을 포함할 수 있다. 음성 제공 장치는 예를 들어, 복수의 감성들 별로 미리 학습된 임베딩 벡터를 적용하여 제2 특징 벡터를 생성할 수 있다. 이때, 복수의 감성들은 예를 들어, 행복, 기쁨, 슬픔, 우울, 화남, 유쾌, 발랄, 경쾌, 짜증 등을 포함할 수 있다. 또한, 복수의 음색들은 예를 들어, 음성 톤, 및 속도 변화 등을 포함할 수 있다. 음성 제공 장치는 복수의 음색들 별로 미리 학습된 임베딩 벡터를 적용하여 제2 특징 벡터를 생성할 수 있다. 또는 음성 제공 장치는 언어 변환을 위해 복수의 언어 별로 미리 학습된 임베딩 벡터를 적용하여 제2 특징 벡터를 생성할 수 있다. The speech providing apparatus generates a second feature vector by applying a pre-learned embedding vector for each condition to the first feature vector (130). Here, the condition may include, for example, emotion, tone, and language conversion. The speech providing apparatus may generate a second feature vector by applying an embedding vector learned in advance for each of a plurality of emotions. In this case, the plurality of sensibilities may include, for example, happiness, joy, sadness, depression, anger, joy, youthfulness, lightness, irritation, and the like. In addition, the plurality of tones may include, for example, a voice tone and a speed change. The speech providing apparatus may generate a second feature vector by applying an embedding vector learned in advance for each of a plurality of tones. Alternatively, the apparatus for providing speech may generate the second feature vector by applying pre-learned embedding vectors for each of a plurality of languages for language conversion.

음성 제공 장치는 제2 특징 벡터로부터 텍스트의 문맥 정보(context information)를 결정한다(140). 음성 제공 장치는 예를 들어, 어텐션 모델을 이용하여 제2 특징 벡터로부터 텍스트의 문맥 정보를 결정할 수 있다. 어텐션 모델은 텍스트의 문맥 정보로부터 다음(next) 스펙트로그램을 예측할 수 있다. 스펙트로그램은 예를 들어, 소리 스펙트로그램으로써 억양, 세기, 톤 및 주파수의 분포를 나타낼 수 있다.The speech providing device determines context information of the text from the second feature vector (140). The speech providing apparatus may determine text context information from the second feature vector using, for example, the attention model. The attention model can predict the next spectrogram from the context information of the text. The spectrogram is, for example, a sound spectrogram and may represent the distribution of intonation, intensity, tone, and frequency.

음성 제공 장치는 문맥 정보에 기초하여, 텍스트 시퀀스에 대응하는 음성을 합성한다(150). 음성 제공 장치는 예를 들어, 문맥 정보에 기초하여 텍스트 시퀀스에 대응하는 억양, 세기, 톤과 주파수의 분포를 예측할 수 있다. 음성 제공 장치는 예측된 억양, 세기, 톤 및 예측된 주파수 분포를 이용하여 음성을 합성할 수 있다. The speech providing apparatus synthesizes speech corresponding to the text sequence based on the context information (150). The speech providing apparatus may predict a distribution of intonation, intensity, tone and frequency corresponding to the text sequence, for example, based on context information. The speech providing apparatus may synthesize speech using the predicted intonation, intensity, tone, and predicted frequency distribution.

음성 제공 장치는 합성된 음성을 출력한다(160). 실시예에 따라서, 음성 제공 장치는 합성된 음성에 화자와 관련된 노이즈를 믹싱(mixing)하고, 노이즈가 믹싱된 음성을 출력할 수도 있다. 여기서, 화자와 관련된 노이즈는 예를 들어, 화자가 학생인 경우, 학교에서 일상적인 생활 중에 발생하는 소음들에 해당하고, 화자가 회사원인 경우, 회사에서 일상적인 생활 중에 발생하는 소음들에 해당할 수 있다. 이와 같이, '화자와 관련된 노이즈'는 화자의 연령대, 주 생활 공간, 주 생활 환경 등과 관련하여 발생할 수 있는 다양한 생활 소음들을 모두 포괄하는 의미로 사용될 수 있다. The voice providing device outputs the synthesized voice (160). Depending on the embodiment, the apparatus for providing a voice may mix noise related to a speaker with the synthesized voice, and may output a voice mixed with the noise. Here, the noise related to the speaker corresponds to noises generated during everyday life at school, for example, when the speaker is a student, and when the speaker is an office worker, it corresponds to noises generated during daily life at the company. I can. As described above,'noise related to the speaker' may be used as a meaning of encompassing all of the various noises that may occur in relation to the speaker's age range, the main living space, and the main living environment.

음성 제공 장치는 예를 들어, 아래의 도 2를 통해 설명하는 음성 합성 장치(200)가 내장된 장치이거나, 또는 별도의 음성 합성 장치를 포함할 수 있다. 음성 제공 장치는 예를 들어, 예를 들어, 스마트 폰, AI 스피커, IoT(Internet of Things) 장치, 및 셋탑 박스 등과 같이 AI 비서 기능이 포함된 장치일 수도 있고, 또는 그 밖의 스마트 가전, 스마트 차량일 수도 있으며, 이와 유사한 기능을 수행하는 장치들일 수 있다. The voice providing apparatus may be, for example, a built-in apparatus with the voice synthesis apparatus 200 described with reference to FIG. 2 below, or may include a separate voice synthesis apparatus. The voice providing device may be, for example, a device including an AI assistant function such as a smart phone, an AI speaker, an Internet of Things (IoT) device, and a set-top box, or other smart home appliances, smart vehicles It may be, and may be devices that perform similar functions.

도 2는 일 실시예에 따른 합성된 음성을 생성하는 과정을 설명하기 위한 도면이다. 도 2를 참조하면, 일 실시예에 따른 음성 합성(TSS) 장치(200)는 엔드-투-엔드(end-to-end) 기반의 심층 신경망(Deep Neural Network; DNN)을 통해 텍스트를 음성으로 합성할 수 있다. 심층 신경망은 예를 들어, 텍스트 인코더(Text Encoder)(210), 스타일 임베딩(Style Embedding) 모듈(220), 컨디셔닝(Conditioning) 모듈(230), 어텐션 모델(Attention model)(240), 디코더(Decoder)(250)를 포함할 수 있다. 2 is a diagram illustrating a process of generating a synthesized speech according to an embodiment. Referring to FIG. 2, a speech synthesis (TSS) apparatus 200 according to an exemplary embodiment transmits text to speech through an end-to-end based deep neural network (DNN). It can be synthesized. The deep neural network includes, for example, a text encoder 210, a style embedding module 220, a conditioning module 230, an attention model 240, and a decoder. ) 250 may be included.

음성 합성(TSS) 장치(200)로 텍스트 시퀀스가 입력되었다고 하자. 이 경우, 텍스트 인코더(210)는 입력된 텍스트 시퀀스를 해당 텍스트 시퀀스의 특징을 나타내는 숫자 또는 벡터로 변환할 수 있다. 텍스트 인코더(210)는 입력된 텍스트 시퀀스로부터 텍스트 시퀀스의 특징을 나타내는 특징 벡터(예를 들어, 제1 특징 벡터)를 결정할 수 있다. 텍스트 인코더(210)는 예를 들어, 1차 컨볼루션 뱅크(convolution bank), 하이웨이(highway) 네트워크, 양방향 GRU(bidirectional gated recurrent unit)을 포함하는 CBHG 네트워크로 구성될 수 있다. 예를 들어, 음성 합성(TSS) 장치(200)를 학습하는 시점에는 텍스트 및 정답 음원이 함께 입력될 수 있다. It is assumed that a text sequence is input to the speech synthesis (TSS) device 200. In this case, the text encoder 210 may convert the input text sequence into a number or vector representing characteristics of the text sequence. The text encoder 210 may determine a feature vector (eg, a first feature vector) representing a feature of the text sequence from the input text sequence. The text encoder 210 may be configured as a CBHG network including, for example, a primary convolution bank, a highway network, and a bidirectional gated recurrent unit (GRU). For example, when the speech synthesis (TSS) apparatus 200 is learned, text and a correct answer sound source may be input together.

스타일 임베딩 모듈(220)은 조건 별로 미리 학습된 임베딩 벡터를 제공할 수 있다. 이때, 조건은 사용자의 감성(예를 들어, 행복, 기쁨, 슬픔, 우울, 화남, 유쾌, 발랄, 경쾌, 짜증 등), 사용자의 음색(예를 들어, 음성 톤, 속도 변화 등), 및 언어 변환(예를 들어, 영어->국어, 중국어->국어, 독일어->영어 등으로의 변환) 등을 포함할 수 있다. The style embedding module 220 may provide pre-learned embedding vectors for each condition. At this time, the condition is the user's emotions (e.g., happiness, joy, sadness, depression, anger, pleasure, cheerfulness, cheerfulness, irritability, etc.), the user's tone (e.g., voice tone, speed change, etc.), and language Conversion (eg, English->Korean, Chinese->Korean, German->English, etc.) may be included.

컨디셔닝 모듈(230)은 텍스트 인코더(210)에서 결정된 제1 특징 벡터에, 조건 별로 미리 학습된 임베딩 벡터를 적용하여 제2 특징 벡터를 생성할 수 있다. The conditioning module 230 may generate a second feature vector by applying a pre-learned embedding vector for each condition to the first feature vector determined by the text encoder 210.

어텐션 모델(240)은 컨디셔닝 모듈(230)을 통해 결정된 제2 특징 벡터로부터 텍스트의 문맥 정보를 결정할 수 있다. 어텐션 모델(240)은 텍스트의 문맥 정보로부터 다음(next) 스펙트로그램을 예측할 수 있다. 스펙트로그램은 예를 들어, 소리 스펙트로그램으로써 억양, 세기, 톤과 주파수의 분포를 나타낼 수 있다. The attention model 240 may determine context information of the text from the second feature vector determined through the conditioning module 230. The attention model 240 may predict a next spectrogram from context information of text. The spectrogram is, for example, a sound spectrogram and may represent intonation, intensity, tone and frequency distribution.

디코더(250)는 어텐션 모델(240)에서 예측한 다음 스펙트로그램을 기반으로 다음 스펙트로그램을 생성할 수 있다. 어텐션 모델(240) 및 디코더(250)는 예를 들어, 어텐션- RNN(Recurrent Neural Network)으로 구성될 수 있다. 디코더(250)는 다음 스펙트로그램을 음원 또는 음성으로 변환함으로써 합성된 음성을 출력할 수 있다. The decoder 250 may generate a next spectrogram based on the spectrogram after prediction by the attention model 240. The attention model 240 and the decoder 250 may be composed of, for example, an attention-recurrent neural network (RNN). The decoder 250 may output the synthesized voice by converting the next spectrogram into a sound source or voice.

디코더(250)는 보코더(vocoder)(미도시)를 포함할 수 있다. 보코더는 각 음소에 대한 파라미터 값들을 음원 또는 음성으로 바꾸어 줄 수 있다. 보코더는 디코더(250)에서 생성된 다음 스펙트로그램을 음원 또는 음성으로 변환함으로써 할 수 있다. 보코더는 예를 들어, 그리핀 림(Griffin- Lim) 보코더 알고리즘 및 뉴럴 보코더(Neural vocoder)를 포함할 수 있다. The decoder 250 may include a vocoder (not shown). The vocoder can convert parameter values for each phoneme into a sound source or voice. The vocoder can be performed by converting the spectrogram generated by the decoder 250 into a sound source or voice. The vocoder may include, for example, a Griffin-Lim vocoder algorithm and a Neural vocoder.

도 3은 일 실시예에 따른 음성 인식 엔진을 학습하는 과정을 설명하기 위한 도면이다. 도 3을 참조하면, 일 실시예에 따른 학습 장치로 음성 인식 엔진에 대한 학습 데이터 생성 요청이 수신된 경우의 동작이 도시된다. 3 is a diagram illustrating a process of learning a speech recognition engine according to an exemplary embodiment. Referring to FIG. 3, an operation when a request for generating training data for a speech recognition engine is received by a learning device according to an embodiment is illustrated.

학습 데이터 생성 요청이 수신되면(310), 학습 장치는 예를 들어, 음성 합성 매니저(TTS manager)를 통해 학습 데이터를 생성할 수 있다(320). 음성 합성 매니저는 예를 들어, 일반적인 음편 접합 방식(Unit Selection System; USS)의 음성 합성 모듈을 사용하여 최적의 학습 데이터를 생성하거나, 자주 사용하는 문장을 녹음한 상용구에 의해 최적의 학습 데이터를 생성하거나, 또는 딥러닝 방식의 개인화 음성 합성 모듈을 통해 다양한 음성 특성 및 노이즈, 주파수 포맷 등을 설정하여 음성 인식에 필요한 최적의 학습 데이터를 생성할 수 있다. 음편 접합 방식의 음성 합성 모듈은 예를 들어, 대량의 음원을 녹음하고, 음소 단위로 데이터베이스(DB)화 하여 요청된 문자를 최적의 음소 단위로 접합하여 음원을 생성할 수 있다. 또한, 딥러닝 방식의 음성 합성 모듈은 예를 들어, 음성의 특징 파라미터를 추출하여 딥러닝 기반으로 모델링하는 방식으로 음향 정보를 예측하여 음원을 생성할 수 있다. When a request to generate training data is received (310), the learning device may generate training data through, for example, a voice synthesis manager (TTS manager) (320). The speech synthesis manager generates optimal learning data using, for example, a speech synthesis module of a general unit selection system (USS), or generates optimal learning data by using a boilerplate recorded frequently used sentences. Alternatively, various speech characteristics, noise, and frequency formats may be set through a deep learning-based personalized speech synthesis module to generate optimal learning data required for speech recognition. The voice synthesis module of the phoneme splicing method may generate a sound source by recording, for example, a large number of sound sources, converting a database (DB) into a phoneme unit, and splicing a requested character into an optimal phoneme unit. In addition, the deep learning speech synthesis module may generate a sound source by predicting sound information by modeling based on deep learning by extracting characteristic parameters of speech, for example.

예를 들어, 학습 장치가 생성하고자 하는 학습 데이터가 키즈용 서비스를 위한 학습 데이터라고 하자. 학습 장치는 예를 들어, 9세 이하의 어린이의 목소리와 학교에서 발생하는 노이즈 등을 믹싱하여 원하는 학습 데이터를 생성할 수 있다. 음성 합성 매니저가 학습 데이터를 생성하는 방법은 아래의 도 4를 참조하여 구체적으로 설명한다. For example, let's say that the learning data that the learning device wants to generate is the learning data for the kids service. The learning device may generate desired learning data by mixing voices of children under 9 years of age and noise generated in school. A method of generating the learning data by the speech synthesis manager will be described in detail with reference to FIG. 4 below.

학습 장치는 단계(320)에서 생성된 학습 데이터를 예를 들어, NAS 스토리지 또는 AWS S3 등과 같은 다양한 저장소에 저장할 수 있다(330). The learning device may store the training data generated in step 320 in various storages such as, for example, NAS storage or AWS S3 (330).

학습 장치는 학습 데이터를 통해 서비스 별 또는 도메인 별로 STT(Speech To Text) 모듈의 최적 데이터를 트레이닝할 수 있다(340). The learning device may train optimal data of a Speech To Text (STT) module for each service or domain through the learning data (340).

예를 들어, 내비게이션 서비스를 제공하고자 하는 경우, 학습 장치는 성별, 나이, 차량 노이즈(예를 들어, 시동의 온/오프 시의 노이즈, 시속 30km/h, 시속 50km/h, 시속 100km/h, 시속 120km/h 주행 시의 노이즈, 창문 온/오프에 따른 노이즈) 등에 대한 다양한 파라미터 또는 조건을 설정하여 최적의 학습 데이터를 생성할 수 있다. 이와 같이 학습 장치는 다양한 서비스 별 최적의 학습 데이터를 통해 STT 모듈을 트레이닝할 수 있다.For example, in the case of providing a navigation service, the learning device may include gender, age, vehicle noise (e.g., noise when starting/off, 30 km/h per hour, 50 km/h per hour, 100 km/h per hour, Optimal learning data can be generated by setting various parameters or conditions for noise when driving 120 km/h per hour and noise due to window on/off). In this way, the learning device can train the STT module through optimal learning data for various services.

도 4는 일 실시예에 따른 음성 합성(TTS) 매니저의 구성 및 동작을 설명하기 위한 도면이다. 도 4를 참조하면, 일 실시예에 따른 음성 합성 매니저(320)는 딥 러닝 방식에 기반하며, 딥러닝 기반 TTS 모듈(410), 음성 특성 모듈(420), 및 믹싱 모듈(430)을 포함할 수 있다. 4 is a diagram illustrating a configuration and operation of a speech synthesis (TTS) manager according to an embodiment. 4, the speech synthesis manager 320 according to an embodiment is based on a deep learning method, and includes a deep learning-based TTS module 410, a speech characteristic module 420, and a mixing module 430. I can.

딥러닝 기반 TTS 모듈(410)은 사용자로부터 예를 들어, 사용자의 연령, 사용자의 성별, 사용자가 선호하는 음성 주파수 대역 등과 같은 개인화 조건이 입력되면, 개인화 조건에 대응하는 개인화된 TTS 엔진을 선택 또는 결정할 수 있다. 이때, 사용자의 연령은 예를 들어, 10대 또는 20대부터 60대 혹은 그 이상의 고령자까지 다양할 수 있다. The deep learning-based TTS module 410 selects a personalized TTS engine corresponding to the personalization condition when a personalization condition such as, for example, the user's age, the user's gender, and the user's preferred voice frequency band, is input from the user. You can decide. At this time, the age of the user may vary from, for example, a teenager or a 20's to an elderly person of 60's or older.

예를 들어, 행복, 기쁨, 슬픔, 우울, 화남, 차분함, 경쾌 등과 같은 복수의 감성들에 대응하는 기준 음성 신호들(reference audios)이 입력되었다고 하자. 음성 특성 모듈(420)은 기준 음성 신호들 각각을 복수의 스타일(또는 감성들) 별로 분류하는 네트워크를 트레이닝할 수 있다. 이때, 음성 특성 모듈(420)은 복수의 감성들을 분류하는 분류기(classifier)를 포함할 수 있다. For example, suppose that reference audios corresponding to a plurality of emotions such as happiness, joy, sadness, depression, anger, calm, and cheerfulness are input. The voice characteristic module 420 may train a network that classifies each of the reference voice signals according to a plurality of styles (or emotions). In this case, the voice characteristic module 420 may include a classifier for classifying a plurality of emotions.

음성 특성 모듈(420)은 트레이닝된 네트워크를 이용하여 기준 음성 신호들의 내용과 무관하게 특정 감성에 공통된 운율 특징들을 추출하여 복수의 감성들 별로 임베딩 벡터를 결정할 수 있다. 이때, 결정된 임베딩 벡터는 음성 합성 시에 개인화된 TTS 엔진의 텍스트 인코더(예를 들어, 도 2의 텍스트 인코더(210) 참조) 단에 조건 적용 시에 함께 반영되어 음성을 합성하는 데에 이용될 수 있다. The speech characteristic module 420 may determine an embedding vector for each of a plurality of emotions by extracting prosody features common to a specific emotion regardless of the content of the reference speech signals using the trained network. At this time, the determined embedding vector is reflected when a condition is applied to the text encoder (for example, see the text encoder 210 in FIG. 2) of the personalized TTS engine during speech synthesis and can be used to synthesize speech. have.

실시예에 따라서, 음성 특성 모듈(420)는 음색이나 언어 변환의 경우, 음색 변환 모듈 또는 언어 변환 모듈을 추가로 이용할 수 있다. 예를 들어, 합성된 음성이 음색 변환 모듈에 인가되거나, 텍스트 시퀀스가 언어 변환 모듈에 인가될 수 있다. According to an embodiment, the voice characteristic module 420 may additionally use a tone conversion module or a language conversion module in the case of tone or language conversion. For example, the synthesized voice may be applied to the tone conversion module or the text sequence may be applied to the language conversion module.

믹싱 모듈(430)은 음성 특성 모듈(420)을 통해 생성된 음성에 주변 환경에 따른 노이즈를 추가(mixing)하거나, 또는 생성된 음성의 주파수를 화자가 희망하는 주파수 대역으로 조절할 수 있다. The mixing module 430 may add noise according to the surrounding environment to the voice generated through the voice characteristic module 420 or adjust the frequency of the generated voice to a frequency band desired by the speaker.

실시예에 따라서, 음성 합성 매니저(320)는 딥러닝 기반 TTS 모듈(410)을 대신하여 USS 방식 또는 일반 코퍼스(Corpus)에 의한 상용구를 통해 음성을 생성하거나, 및/또는 생성된 음성에 주변 환경에 따른 노이즈를 추가할 수도 있다. Depending on the embodiment, the speech synthesis manager 320 generates a speech through a boilerplate using the USS method or a general corpus instead of the deep learning-based TTS module 410, and/or the generated speech Noise can also be added.

도 5는 일 실시예에 따른 학습 방법을 나타낸 흐름도이다. 도 5를 참조하면, 일 실시예에 따른 학습 장치는 복수의 감성들에 대응하는 기준 음성 신호들(reference audios)을 수신한다(510). 이때, 복수의 감성들은 예를 들어, 행복, 기쁨, 슬픔, 우울, 화남, 유쾌, 발랄, 경쾌, 짜증 등을 포함할 수 있다. 기준 음성 신호들은 예를 들어, 기쁜 감성으로 발성하는 "안녕하세요", 슬픈 감성으로 발성하는 "안녕하세요, 화난 감성으로 발성하는 "안녕하세요", 기쁜 감성으로 발성하는 "감사합니다", 우울한 감성으로 발성하는 "감사합니다", 발랄한 감성으로 발성하는 "감사합니다" 등과 같이 다양한 감성들에 대응하는 음성 신호들을 포함할 수 있다. 5 is a flowchart showing a learning method according to an embodiment. Referring to FIG. 5, the learning apparatus according to an embodiment receives reference audios corresponding to a plurality of emotions (510 ). In this case, the plurality of sensibilities may include, for example, happiness, joy, sadness, depression, anger, joy, youthfulness, lightness, irritation, and the like. The standard voice signals are, for example, "Hello" with a joyful sensibility, "Hello" with a sad sensibility, "Hello" with an angry sensibility, "Thank you" with a happy sensibility, and "Hello" with a gloomy sensibility Voice signals corresponding to various sensibilities, such as "thank you" and "thank you," uttered with a cheerful sensibility, may be included.

학습 장치는 기준 음성 신호들 각각을 복수의 감성들 별로 분류하는 네트워크를 트레이닝한다(520). 또는, 학습 장치는 기준 음성 신호들 각각을 복수의 음색, 및 복수의 언어 별로 분류하는 네트워크를 트레이닝할 수 있다. 실시예에 따라서, 학습 장치는 기준 음성 신호들이 수집된 환경에 대응하는 노이즈 신호를 수신할 수 있다. 예를 들어, 기준 음성 신호들이 어린 학생들이 모인 장소(학교 또는 학원 등)에서 수집된 경우, 노이즈 신호는 해당 장소에서 발생하는 다양한 유형의 노이즈들(학교종 소리, 운동장에서 발생하는 소리, 책상 또는 걸상을 끄는 소리, 쉬는 시간에 떠드는 소리 등)을 포함할 수 있다. The learning apparatus trains a network for classifying each of the reference speech signals according to a plurality of emotions (520). Alternatively, the learning apparatus may train a network that classifies each of the reference speech signals into a plurality of tones and a plurality of languages. According to an embodiment, the learning apparatus may receive a noise signal corresponding to an environment in which the reference speech signals are collected. For example, when the reference voice signals are collected at a place where young students gather (such as a school or academy), the noise signal is generated by various types of noises (school bell, playground, desk, etc.) The sound of pulling the stool, the noise of the chatter during breaks, etc.) can be included.

학습 장치는 기준 음성 신호들 및 기준 음성 신호들이 수집된 환경에 대응하는 노이즈 신호를 믹싱(mixing)하고, 믹싱된 신호를 복수의 감성들 별 및 환경 별로 분류하는 네트워크를 트레이닝할 수도 있다. The learning apparatus may train a network that mixes the reference speech signals and a noise signal corresponding to an environment in which the reference speech signals are collected, and classifies the mixed signal according to a plurality of emotions and environments.

학습 장치는 트레이닝된 네트워크를 이용하여 기준 음성 신호들의 감성들에 대응하는 운율(prosody) 특징들을 수집한다(530). 학습 장치는 기준 음성 신호들에 내제된 기쁨, 슬픔, 화남 등과 같은 감성에 해당하는 운율 특징들을 수집할 수 있다. 학습 장치는 기준 음성 신호들의 음성 내용과 무관하게 기준 음성 신호들의 특정 감성에 공통된 운율 특징들을 추출할 수 있다. The learning apparatus collects prosody features corresponding to sensibilities of the reference speech signals using the trained network (530). The learning device may collect prosody features corresponding to emotions such as joy, sadness, and anger inherent in the reference voice signals. The learning apparatus may extract prosody features common to specific sensibilities of the reference speech signals regardless of the speech content of the reference speech signals.

학습 장치는 운율 특징들에 기초하여 복수의 감성들 별로 임베딩 벡터를 결정한다(540).The learning device determines an embedding vector for each of a plurality of emotions based on prosody features (540).

도 6은 일 실시예에 따른 음성 합성(TTS) 모델을 트레이닝하는 과정을 설명하기 위한 도면이다. 도 6을 참조하면, 일 실시예에 따른 학습 장치가 기준 음성 신호들로부터 복수의 감성들 별 임베딩 벡터를 결정하는 과정이 도시된다. 6 is a diagram illustrating a process of training a speech synthesis (TTS) model according to an embodiment. Referring to FIG. 6, a process in which the learning apparatus according to an embodiment determines an embedding vector for each of a plurality of emotions from reference speech signals is illustrated.

학습 장치는 운율 인코더(Prosody Encoder)(610), 운율 임베딩(Prosody Embedding) 모듈(620), 어텐션 모델(630), 및 분류기(640)를 포함할 수 있다. The learning device may include a prosody encoder 610, a prosody embedding module 620, an attention model 630, and a classifier 640.

복수의 감성들을 포함하는 SSML(Speech Synthesis Markup Language) 태그와 기준 음성 신호(들)(605)이 수신되면, 운율 인코더(610)는 기준 음성 신호(605)의 음성 내용과 무관하게 기준 음성 신호(605)에서 특정 감성에 공통된 운율 특징을 추출할 수 있다. When an SSML (Speech Synthesis Markup Language) tag including a plurality of emotions and a reference speech signal(s) 605 are received, the prosody encoder 610 performs a reference speech signal (regardless of the speech content of the reference speech signal 605). In 605), prosody features common to specific emotions can be extracted.

운율 임베딩 모듈(620)은 조건 별로 미리 학습된 임베딩 벡터를 포함할 수 있다. 임베딩 벡터는 예를 들어, 행복, 기쁨, 슬픔, 우울, 화남, 유쾌, 발랄, 경쾌, 짜증 등과 같은 다양한 감성들을 표현할 때에 나타나는 공통된 리듬감, 음 높낮이, 음 간격 등을 표현하는 특징 벡터에 해당할 수 있다. The prosody embedding module 620 may include pre-learned embedding vectors for each condition. The embedding vector may correspond to a feature vector expressing a common sense of rhythm, pitch, pitch, etc., which appears when expressing various emotions such as happiness, joy, sadness, depression, anger, joy, youthfulness, cheerfulness, and irritation, for example. have.

어텐션 모델(630)은 운율 임베딩 모듈(620)에 포함된 임베딩 벡터를 분류기(640)를 통해 기쁜, 슬픈, 화난, 경쾌한, 발랄한 등의 각 감성 별로 분류하여 기준 음성 신호(들)(605)에 해당하는 감성을 표현하는 감성 표현 정도, 억양, 세기, 톤과 주파수 분포를 나타내는 최적화된 스타일 별(감성 별) 임베딩 벡터(650)를 출력할 수 있다. The attention model 630 classifies the embedding vector included in the rhyme embedding module 620 for each sensibility such as happy, sad, angry, cheerful, cheerful, etc. through the classifier 640, and is then added to the reference voice signal(s) 605. The embedding vector 650 for each optimized style (by emotion) indicating the degree of emotional expression, intonation, intensity, tone and frequency distribution that expresses the corresponding emotion may be output.

도 7은 일 실시예에 따른 음성 제공 장치의 블록도이다. 도 7을 참조하면, 일 실시예에 따른 음성 제공 장치(700)는 통신 인터페이스(710), 및 프로세서(730)를 포함한다. 음성 제공 장치(700)는 메모리(750), 및 스피커(770)를 더 포함할 수 있다. 통신 인터페이스(710), 프로세서(730), 메모리(750), 및 스피커(770)는 통신 버스(705)를 통해 서로 통신할 수 있다. 7 is a block diagram of an apparatus for providing a voice according to an embodiment. Referring to FIG. 7, the apparatus 700 for providing a voice according to an embodiment includes a communication interface 710 and a processor 730. The voice providing apparatus 700 may further include a memory 750 and a speaker 770. The communication interface 710, the processor 730, the memory 750, and the speaker 770 may communicate with each other through a communication bus 705.

통신 인터페이스(710)는 텍스트 시퀀스를 수신한다. Communication interface 710 receives a text sequence.

프로세서(730)는 텍스트 시퀀스로부터 제1 특징 벡터를 추출한다. 프로세서(730)는 제1 특징 벡터에 조건 별로 미리 학습된 임베딩 벡터를 적용하여 제2 특징 벡터를 생성한다. 프로세서(730)는 제2 특징 벡터로부터 텍스트의 문맥 정보를 결정한다. 프로세서(730)는 문맥 정보에 기초하여, 텍스트 시퀀스에 대응하는 음성을 합성한다. The processor 730 extracts a first feature vector from the text sequence. The processor 730 generates a second feature vector by applying a pre-learned embedding vector for each condition to the first feature vector. The processor 730 determines context information of the text from the second feature vector. The processor 730 synthesizes a speech corresponding to the text sequence based on the context information.

메모리(750)는 텍스트 시퀀스 및/또는 텍스트 시퀀스에 대응하여 합성된 음성을 저장할 수 있다. The memory 750 may store a text sequence and/or a speech synthesized corresponding to the text sequence.

스피커(770)는 프로세서(730)에서 합성된 음성을 출력한다. The speaker 770 outputs the voice synthesized by the processor 730.

또한, 프로세서(730)는 도 1 내지 도 2를 통해 전술한 적어도 하나의 방법 또는 적어도 하나의 방법에 대응되는 알고리즘을 수행할 수 있다. 프로세서(730)는 목적하는 동작들(desired operations)을 실행시키기 위한 물리적인 구조를 갖는 회로를 가지는 하드웨어로 구현된 데이터 처리 장치일 수 있다. 예를 들어, 목적하는 동작들은 프로그램에 포함된 코드(code) 또는 인스트럭션들(instructions)을 포함할 수 있다. 예를 들어, 하드웨어로 구현된 데이터 처리 장치는 마이크로프로세서(microprocessor), 중앙 처리 장치(central processing unit), 프로세서 코어(processor core), 멀티-코어 프로세서(multi-core processor), 멀티프로세서(multiprocessor), ASIC(Application-Specific Integrated Circuit), FPGA(Field Programmable Gate Array)를 포함할 수 있다.In addition, the processor 730 may perform at least one method or an algorithm corresponding to at least one method described above with reference to FIGS. 1 to 2. The processor 730 may be a data processing device implemented in hardware having a circuit having a physical structure for executing desired operations. For example, desired operations may include code or instructions included in a program. For example, a data processing device implemented in hardware is a microprocessor, a central processing unit, a processor core, a multi-core processor, and a multiprocessor. , Application-Specific Integrated Circuit (ASIC), and Field Programmable Gate Array (FPGA).

프로세서(730)는 프로그램을 실행하고, 음성 제공 장치(700)를 제어할 수 있다. 프로세서(730)에 의하여 실행되는 프로그램 코드는 메모리(750)에 저장될 수 있다.The processor 730 may execute a program and control the audio providing device 700. The program code executed by the processor 730 may be stored in the memory 750.

이 밖에도, 메모리(750)는 상술한 프로세서(730)에서의 처리 과정에서 생성되는 다양한 정보들을 저장할 수 있다. 이 밖에도, 메모리(750)는 각종 데이터와 프로그램 등을 저장할 수 있다. 메모리(750)는 휘발성 메모리 또는 비휘발성 메모리를 포함할 수 있다. 메모리(750)는 하드 디스크 등과 같은 대용량 저장 매체를 구비하여 각종 데이터를 저장할 수 있다. In addition, the memory 750 may store various types of information generated during processing in the processor 730 described above. In addition, the memory 750 may store various types of data and programs. The memory 750 may include a volatile memory or a nonvolatile memory. The memory 750 may include a mass storage medium such as a hard disk to store various types of data.

도 8은 일 실시예에 따른 학습 장치의 블록도이다. 도 8을 참조하면, 일 실시예에 따른 학습 장치(800)는 통신 인터페이스(810), 프로세서(830), 및 메모리(850)를 포함한다. 통신 인터페이스(810), 프로세서(830), 및 메모리(850)는 통신 버스(805)를 통해 서로 통신할 수 있다. 8 is a block diagram of a learning device according to an embodiment. Referring to FIG. 8, the learning apparatus 800 according to an embodiment includes a communication interface 810, a processor 830, and a memory 850. The communication interface 810, the processor 830, and the memory 850 may communicate with each other through a communication bus 805.

통신 인터페이스(810)는 복수의 감성들에 대응하는 기준 음성 신호들을 수신한다. 통신 인터페이스(810)는 프로세서(830)가 복수의 감성들 별로 결정한 임베딩 벡터를 출력할 수 있다. The communication interface 810 receives reference voice signals corresponding to a plurality of emotions. The communication interface 810 may output an embedding vector determined by the processor 830 for each of a plurality of emotions.

프로세서(830)는 기준 음성 신호들 각각을 복수의 감성들 별로 분류하는 네트워크를 트레이닝한다. 프로세서(830)는 트레이닝된 네트워크를 이용하여 기준 음성 신호들의 감성들에 대응하는 운율 특징들을 수집한다. 프로세서(830)는 운율 특징들에 기초하여 복수의 감성들 별로 임베딩 벡터를 결정한다. The processor 830 trains a network that classifies each of the reference speech signals according to a plurality of emotions. The processor 830 collects prosody features corresponding to the sentiments of the reference speech signals using the trained network. The processor 830 determines an embedding vector for each of a plurality of emotions based on prosody features.

또한, 프로세서(830)는 도 3 내지 도 6을 통해 전술한 적어도 하나의 방법 또는 적어도 하나의 방법에 대응되는 알고리즘을 수행할 수 있다. 프로세서(830)는 목적하는 동작들(desired operations)을 실행시키기 위한 물리적인 구조를 갖는 회로를 가지는 하드웨어로 구현된 데이터 처리 장치일 수 있다. 예를 들어, 목적하는 동작들은 프로그램에 포함된 코드(code) 또는 인스트럭션들(instructions)을 포함할 수 있다. 예를 들어, 하드웨어로 구현된 데이터 처리 장치는 마이크로프로세서(microprocessor), 중앙 처리 장치(central processing unit), 프로세서 코어(processor core), 멀티-코어 프로세서(multi-core processor), 멀티프로세서(multiprocessor), ASIC(Application-Specific Integrated Circuit), FPGA(Field Programmable Gate Array)를 포함할 수 있다.In addition, the processor 830 may perform at least one method or an algorithm corresponding to at least one method described above with reference to FIGS. 3 to 6. The processor 830 may be a data processing device implemented in hardware having a circuit having a physical structure for executing desired operations. For example, desired operations may include code or instructions included in a program. For example, a data processing device implemented in hardware is a microprocessor, a central processing unit, a processor core, a multi-core processor, and a multiprocessor. , Application-Specific Integrated Circuit (ASIC), and Field Programmable Gate Array (FPGA).

프로세서(830)는 프로그램을 실행하고, 학습 장치(800)를 제어할 수 있다. 프로세서(830)에 의하여 실행되는 프로그램 코드는 메모리(850)에 저장될 수 있다.The processor 830 may execute a program and control the learning device 800. The program code executed by the processor 830 may be stored in the memory 850.

메모리(850)는 통신 인터페이스(810)를 통해 수신한 다양한 정보를 저장할 수 있다. 이 밖에도, 메모리(850)는 상술한 프로세서(830)에서의 처리 과정에서 생성되는 다양한 정보들을 저장할 수 있다. 이 밖에도, 메모리(850)는 각종 데이터와 프로그램 등을 저장할 수 있다. 메모리(850)는 휘발성 메모리 또는 비휘발성 메모리를 포함할 수 있다. 메모리(850)는 하드 디스크 등과 같은 대용량 저장 매체를 구비하여 각종 데이터를 저장할 수 있다. The memory 850 may store various types of information received through the communication interface 810. In addition, the memory 850 may store various types of information generated during processing in the processor 830 described above. In addition, the memory 850 may store various types of data and programs. The memory 850 may include a volatile memory or a nonvolatile memory. The memory 850 may include a mass storage medium such as a hard disk to store various types of data.

일 실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기된 하드웨어 장치는 본 발명의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The method according to an embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded in a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, etc. alone or in combination. The program instructions recorded in the medium may be specially designed and configured for the present invention, or may be known and usable to those skilled in computer software. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic media such as floptical disks. -A hardware device specially configured to store and execute program instructions such as magneto-optical media, and ROM, RAM, flash memory, and the like. Examples of program instructions include not only machine language codes such as those produced by a compiler but also high-level language codes that can be executed by a computer using an interpreter or the like. The above-described hardware device may be configured to operate as one or more software modules to perform the operation of the present invention, and vice versa.

이상과 같이 본 발명은 비록 한정된 실시예와 도면에 의해 설명되었으나, 본 발명은 상기의 실시예에 한정되는 것은 아니며, 본 발명이 속하는 분야에서 통상의 지식을 가진 자라면 이러한 기재로부터 다양한 수정 및 변형이 가능하다.As described above, although the present invention has been described by the limited embodiments and drawings, the present invention is not limited to the above embodiments, and various modifications and variations from these descriptions are those of ordinary skill in the field to which the present invention belongs. This is possible.

그러므로, 본 발명의 범위는 설명된 실시예에 국한되어 정해져서는 아니되며, 후술하는 특허청구범위뿐만 아니라 이 특허청구범위와 균등한 것들에 의해 정해져야 한다.Therefore, the scope of the present invention is limited to the described embodiments and should not be defined, and should be defined by the claims and equivalents as well as the claims to be described later.

700: 음성 제공 장치
705: 통신 버스
710: 통신 인터페이스
730: 프로세서
750: 메모리
770: 스피커700: voice providing device
705: communication bus
710: communication interface
730: processor
750: memory
770: speaker

Claims

Receiving a text sequence;
Extracting a first feature vector from the text sequence;
Generating a second feature vector by applying a pre-learned embedding vector for each condition to the first feature vector;
Determining context information of text from the second feature vector; And
Synthesizing a speech corresponding to the text sequence based on the context information; And
Providing the synthesized voice
Containing, speech synthesis method.

The method of claim 1,
The above conditions are
A speech synthesis method comprising at least one of emotion, tone, and language conversion.

The method of claim 2,
Generating the second feature vector comprises:
Generating the second feature vector by applying a pre-learned embedding vector for each of a plurality of emotions;
Generating the second feature vector by applying a pre-learned embedding vector for each of a plurality of tones; And
Generating the second feature vector by applying pre-learned embedding vectors for each of a plurality of languages for language conversion
The speech synthesis method comprising at least one of.

The method of claim 3,
The plurality of emotions
A speech synthesis method that includes at least one of happiness, joy, sadness, melancholy, angry, cheerful, cheerful, cheerful, and irritable.

The method of claim 3,
The plurality of tones
Speech synthesis method comprising at least one of a voice tone and a speed change.

The method of claim 2,
Step of receiving a selection of personalization conditions from the user
Including more,
Extracting the first feature vector
Determining a personalized TTS (Text to Speech) engine corresponding to the personalization condition; And
Outputting the first feature vector by applying the text sequence to the personalized TTS engine
Containing, speech synthesis method.

The method of claim 6,
The above personalization conditions are
The voice synthesis method comprising at least one of the user's age, the user's gender, and a voice frequency band preferred by the user.

The method of claim 2,
The step of synthesizing the voice
Predicting a distribution of intonation, intensity, tone, and frequency corresponding to the text sequence based on the context information; And
Synthesizing the speech using the predicted intonation, intensity, tone and the predicted frequency distribution
Containing, speech synthesis method.

The method of claim 2,
Mixing a speaker-related noise with the synthesized speech; And
Outputting a voice mixed with the noise
Further comprising a, speech synthesis method.

Receiving reference audios corresponding to a plurality of emotions;
Training a network for classifying each of the reference speech signals according to the plurality of emotions;
Collecting prosody features corresponding to sensibilities of the reference speech signals using the trained network; And
Determining an embedding vector for each of the plurality of emotions based on the prosody features
Containing, learning method.

The method of claim 10,
The plurality of emotions
A learning method that includes at least one of happiness, joy, sadness, melancholy, angry, cheerful, cheerful, cheerful, and irritable.

The method of claim 10,
Collecting the prosody features
Extracting prosody features common to specific sensibilities of the reference speech signals irrespective of the speech content of the reference speech signals
Containing, learning method.

The method of claim 10,
Training a network for classifying each of the reference speech signals according to a plurality of tones and a plurality of languages
Further comprising, learning method.

The method of claim 10,
Receiving a noise signal corresponding to the environment in which the reference voice signals are collected
Including more,
Training a network for classifying each of the plurality of emotions
Mixing the reference speech signals and a noise signal corresponding to an environment in which the reference speech signals are collected; And
Training a network for classifying the mixed signal according to the plurality of emotions and the environment
Containing, learning method.

A computer program stored in a computer-readable recording medium in combination with hardware to execute the method of any one of claims 1 to 14.

A communication interface for receiving a text sequence;
Extracting a first feature vector from the text sequence, applying a pre-learned embedding vector for each condition to the first feature vector to generate a second feature vector, and determining context information of the text from the second feature vector, A processor for synthesizing a speech corresponding to the text sequence based on the context information; And
A speaker that outputs the synthesized voice
Containing, speech synthesis device.

A communication interface for receiving reference voice signals corresponding to a plurality of emotions; And
Train a network for classifying each of the reference speech signals according to the plurality of emotions, collect prosody characteristics corresponding to the sensations of the reference speech signals using the trained network, and based on the prosody characteristics Processor for determining an embedding vector for each of the plurality of emotions
Including,
The communication interface is
A learning device that outputs the embedding vector.