KR20210131125A

KR20210131125A - Learning device and device for speaking rate controllable text-to-speech

Info

Publication number: KR20210131125A
Application number: KR1020200049525A
Authority: KR
Inventors: 배재성
Original assignee: 주식회사 엔씨소프트
Priority date: 2020-04-23
Filing date: 2020-04-23
Publication date: 2021-11-02

Abstract

In a text-to-speech learning device that can adjust a speech speed in accordance with one embodiment of the present invention, the text-to-speech learning device comprises at least one processor, wherein the at least one processor receives the text and speed information and generates a first output speech data using a text-to-speech (TTS) model, and the text-to-speech (TTS) model is updated using the generated first output speech data and the speech data paired with the text stored in the speech database. Therefore, the present invention is capable of having an effect of learning to adjust the speech speed of a speech included in the output speech data.

Description

A text-to-speech learning device capable of controlling speech speed and a text-to-speech learning device capable of controlling speech speed

아래의 실시예들은 발화 속도 조절이 가능한 텍스트 음성 변환 학습 장치 및 발화 속도 조절이 가능한 텍스트 음성 변환 장치에 관한 것이다.The following embodiments relate to a text-to-speech learning apparatus capable of controlling a speech speed and a text-to-speech apparatus capable of adjusting a speech speed.

머신 러닝(machine learning)은 인공 지능의 한 분야로, 패턴인식과 컴퓨터 학습 이론의 연구로부터 진화한 분야이며, 컴퓨터가 학습할 수 있도록 하는 알고리즘과 기술을 개발하는 분야를 말한다. Machine learning is a field of artificial intelligence that has evolved from the study of pattern recognition and computer learning theory, and refers to the field of developing algorithms and technologies that allow computers to learn.

머신 러닝의 핵심은 표현(representation)과 일반화(generalization)에 있다. 표현이란 데이터의 평가이며, 일반화란 아직 알 수 없는 데이터에 대한 처리이다. 이는 전산 학습 이론 분야이기도 하다.The core of machine learning is representation and generalization. Representation is the evaluation of data, and generalization is the processing of data that is not yet known. This is also the field of computational learning theory.

딥 러닝(deep learning)은 여러 비선형 변환기법의 조합을 통해 높은 수준의 추상화를 시도하는 기계학습(machine learning) 알고리즘의 집합으로 정의되며, 큰 틀에서 사람의 사고방식을 컴퓨터에게 가르치는 기계학습의 한 분야라고 이야기할 수 있다.Deep learning is defined as a set of machine learning algorithms that attempt high-level abstraction through a combination of several nonlinear transformation methods. field can be said.

텍스트 음성 변환(Text To Speech, TTS)이란 컴퓨터의 프로그램을 통해 사람의 목소리를 구현해내는 것으로, 성우 없이도 거의 모든 단어와 문장의 음성을 쉽게 구현할 수 있다.Text-to-speech (TTS) is the realization of human voice through a computer program, and almost all words and sentences can be easily implemented without a voice actor.

본 발명의 실시예에 따르면, 출력 음성 데이터에 포함된 음성의 발화 속도 를 조절할 수 있도록 학습시킬 수 있는 텍스트 음성 변환 학습 장치를 제공할 수 있다.According to an embodiment of the present invention, it is possible to provide a text-to-speech learning apparatus capable of learning to adjust the utterance speed of a voice included in output voice data.

또한, 본 발명의 다른 실시예에 따르면, 출력 음성 데이터에 포함된 음성의 발화 시간 을 조절할 수 있도록 학습시킬 수 있는 텍스트 음성 변환 학습 장치를 제공할 수 있다.In addition, according to another embodiment of the present invention, it is possible to provide a text-to-speech learning apparatus capable of learning to adjust the utterance time of the voice included in the output voice data.

또한, 본 발명의 또 다른 실시예에 따르면, 하나의 속도 정보를 기초로 문장 전체 혹은 일부 단어에 대한 발화 속도를 조절할 수 있도록 학습 시킬 수 있는 텍스트 음성 변환 학습 장치를 제공할 수 있다.In addition, according to another embodiment of the present invention, it is possible to provide a text-to-speech learning apparatus capable of learning to adjust the utterance speed of all or some words in a sentence based on one piece of speed information.

또한, 본 발명의 또 다른 실시예에 따르면, 발화 속도가 조절된 출력 음성 데이터를 생성할 수 있는 텍스트 음성 변환 장치를 제공할 수 있다.In addition, according to another embodiment of the present invention, it is possible to provide a text-to-speech apparatus capable of generating output voice data whose speech speed is adjusted.

또한, 본 발명의 또 다른 실시예에 따르면, 미리 설정된 시간 내에 발화가 마무리되도록 발화 시간이 조절 된 출력 음성 데이터를 생성할 수 있는 텍스트 음성 변환 장치를 제공할 수 있다.In addition, according to another embodiment of the present invention, it is possible to provide a text-to-speech apparatus capable of generating output voice data whose utterance time is adjusted so that the utterance is completed within a preset time.

또한, 본 발명의 또 다른 실시예에 따르면, 하나의 속도 정보를 기초로 문장 전체 혹은 일부 단어에 대한 발화 속도를 조절할 수 있는 텍스트 음성 변환 장치를 제공할 수 있다.In addition, according to another embodiment of the present invention, it is possible to provide a text-to-speech apparatus capable of adjusting the speech speed of all or some words in a sentence based on one piece of speed information.

본 발명의 일실시예에 따르면, 발화 속도 조절이 가능한 텍스트 음성 변환 학습 장치에 있어서, 적어도 하나의 프로세서를 포함하고, 상기 적어도 하나의 프로세서는, 텍스트 및 속도 정보를 입력 받아 텍스트 음성 변환(Text-To-Speech, TTS) 모델을 이용하여 제1 출력 음성 데이터를 생성하고, 상기 생성한 제1 출력 음성 데이터와 음성 데이터 베이스에 저장된 상기 텍스트와 쌍을 이루는 음성 데이터를 이용하여 상기 텍스트 음성 변환(Text-To-Speech, TTS) 모델을 업데이트 한다.According to an embodiment of the present invention, there is provided a text-to-speech learning apparatus capable of controlling a speech speed, comprising at least one processor, wherein the at least one processor receives text and speed information and performs text-to-speech (Text-to-speech) First output voice data is generated using a To-Speech, TTS model, and the text-to-speech conversion (Text) is performed using the generated first output voice data and the voice data paired with the text stored in a voice database. -To-Speech, TTS) update the model.

본 발명의 다른 실시예에 따르면, 발화 속도 조절이 가능한 텍스트 음성 변환 학습 장치에 있어서, 적어도 하나의 프로세서를 포함하고, 상기 적어도 하나의 프로세서는, 텍스트, 속도 정보 및 기준 음성(reference audio) 데이터를 입력 받아 상기 텍스트 음성 변환(Text-To-Speech, TTS) 모델을 이용하여 제2 출력 음성 데이터를 생성하고, 상기 생성한 제2 출력 음성 데이터와 음성 데이터 베이스에 저장된 상기 텍스트와 쌍을 이루는 음성 데이터를 이용하여 상기 텍스트 음성 변환(Text-To-Speech, TTS) 모델을 업데이트 한다.According to another embodiment of the present invention, there is provided a text-to-speech learning apparatus capable of controlling a speech speed, including at least one processor, wherein the at least one processor receives text, speed information, and reference audio data. It receives input and generates second output voice data using the text-to-speech (TTS) model, and voice data that is paired with the generated second output voice data and the text stored in a voice database is used to update the text-to-speech (TTS) model.

본 발명의 또 다른 실시예에 따르면, 발화 속도 조절이 가능한 텍스트 음성 변환 학습 장치에 있어서, 적어도 하나의 프로세서를 포함하고, 상기 적어도 하나의 프로세서는, 텍스트, 속도 정보 및 화자 정보(speaker lookup)를 입력 받아 상기 텍스트 음성 변환(Text-To-Speech, TTS) 모델을 이용하여 제3 출력 음성 데이터를 생성하고, 상기 생성한 제3 출력 음성 데이터와 음성 데이터 베이스에 저장된 상기 텍스트와 쌍을 이루는 음성 데이터를 이용하여 상기 텍스트 음성 변환(Text-To-Speech, TTS) 모델을 업데이트 한다.According to another embodiment of the present invention, there is provided a text-to-speech learning apparatus capable of controlling a speech speed, including at least one processor, wherein the at least one processor receives text, speed information, and speaker lookup. It receives input and generates third output voice data using the text-to-speech (TTS) model, and voice data that is paired with the generated third output voice data and the text stored in a voice database is used to update the text-to-speech (TTS) model.

본 발명의 또 다른 실시예에 따르면, 발화 속도 조절이 가능한 텍스트 음성 변환 학습 장치에 있어서, 적어도 하나의 프로세서를 포함하고, 상기 적어도 하나의 프로세서는, 텍스트, 속도 정보, 기준 음성(reference audio) 데이터 및 화자 정보(speaker lookup)를 입력 받아 텍스트 음성 변환(Text-To-Speech, TTS) 모델을 이용하여 제4 출력 음성 데이터를 생성하고, 상기 생성한 제4 출력 음성 데이터와 음성 데이터 베이스에 저장된 상기 텍스트와 쌍을 이루는 음성 데이터를 이용하여 상기 텍스트 음성 변환(Text-To-Speech, TTS) 모델을 업데이트 한다.According to another embodiment of the present invention, there is provided a text-to-speech learning apparatus capable of controlling a speech speed, comprising at least one processor, wherein the at least one processor includes text, speed information, and reference audio data. and receiving speaker lookup and generating fourth output voice data using a text-to-speech (TTS) model, and storing the generated fourth output voice data and a voice database The text-to-speech (TTS) model is updated using speech data paired with text.

또한, 상기 속도 정보는 조절 가능할 수 있다.Also, the speed information may be adjustable.

또한, 상기 속도 정보는, 상기 텍스트 및 상기 음성 데이터 베이스에 저장된 상기 텍스트와 쌍을 이루는 음성 데이터를 기초로 생성 될 수 있다.In addition, the speed information may be generated based on the text and voice data paired with the text stored in the voice database.

본 발명의 또 다른 실시예에 따르면, 발화 속도 조절이 가능한 텍스트 음성 변환 장치에 있어서, 적어도 하나의 프로세서를 포함하고, 상기 적어도 하나의 프로세서는, 텍스트 및 속도 정보를 입력 받아 학습된 텍스트 음성 변환(Text-To-Speech, TTS) 모델을 이용하여 출력 음성 데이터를 생성한다.According to another embodiment of the present invention, in a text-to-speech conversion apparatus capable of controlling a speech speed, the apparatus includes at least one processor, wherein the at least one processor receives text and speed information and learns text-to-speech conversion ( Text-To-Speech, TTS) model is used to generate output speech data.

또한, 상기 적어도 하나의 프로세서는, 상기 속도 정보를 상기 텍스트의 적어도 일부에만 적용하여 상기 학습된 텍스트 음성 변환(Text To Speech, TTS) 모델을 이용하여 출력 음성 데이터를 생성할 수 있다.Also, the at least one processor may apply the speed information to at least a portion of the text to generate output speech data using the learned text-to-speech (TTS) model.

본 발명의 또 다른 실시예에 따르면, 발화 속도 조절이 가능한 텍스트 음성 변환 장치에 있어서, 적어도 하나의 프로세서를 포함하고, 상기 적어도 하나의 프로세서는, 텍스트를 입력 받고, 상기 텍스트를 기초로 학습된 텍스트 음성 변환(Text-To-Speech, TTS) 모델을 이용하여 생성 가능한 출력 음성 데이터에 대한 시간 범위를 결정하고, 상기 텍스트 및 상기 결정한 시간 범위에 포함된 적어도 하나의 시간 정보를 기초로 상기 학습된 텍스트 음성 변환(Text-To-Speech, TTS) 모델을 이용하여 출력 음성 데이터를 생성한다.According to another embodiment of the present invention, there is provided a text-to-speech conversion apparatus capable of controlling an utterance speed, including at least one processor, wherein the at least one processor receives a text input and a text learned based on the text A time range for output speech data that can be generated is determined using a text-to-speech (TTS) model, and the learned text is based on the text and at least one piece of time information included in the determined time range. Output speech data is generated using a text-to-speech (TTS) model.

또한, 상기 적어도 하나의 프로세서는, 상기 생성한 출력 음성 데이터를 후처리 할 수 있다.Also, the at least one processor may post-process the generated output voice data.

본 발명의 일실시예에 따르면, 출력 음성 데이터에 포함된 음성의 발화 속도를 조절할 수 있도록 학습시킬 수 있는 효과가 있다.According to one embodiment of the present invention, there is an effect that can be learned to adjust the utterance speed of the voice included in the output voice data.

또한, 출력 음성 데이터에 포함된 음성의 발화 시간을 조절할 수 있도록 학습시킬 수 있는 효과가 있다.In addition, there is an effect of learning to adjust the utterance time of the voice included in the output voice data.

또한, 하나의 속도 정보를 기초로 문장 전체 혹은 일부 단어에 대한 발화 속도 조절이 가능하도록 학습시킬 수 있는 효과가 있다.In addition, there is an effect that it is possible to learn to adjust the speech speed of all or some words in a sentence based on one piece of speed information.

또한, 발화 속도가 조절된 출력 음성 데이터를 생성할 수 있는 효과가 있다.In addition, there is an effect of generating output voice data whose speech speed is adjusted.

또한, 미리 설정된 시간 내에 발화가 마무리되도록 발화 시간이 조절된 출력 음성 데이터를 생성할 수 있는 효과가 있다.In addition, there is an effect of generating output voice data whose utterance time is adjusted so that the utterance is finished within a preset time.

또한, 하나의 속도 정보를 기초로 문장 전체 혹은 일부 단어에 대한 발화 속도를 조절할 수 있는 효과가 있다.In addition, there is an effect that the speech speed of the entire sentence or some words can be adjusted based on one piece of speed information.

도 1은 일실시예에 따라 발화 속도 조절이 가능한 텍스트 음성 변환 학습 장치가 음성 데이터 베이스, 텍스트, 속도 정보, 기준 음성 데이터 및 화자 정보를 기초로 스타일 인코더 및 텍스트 음성 변환 모델을 훈련(training)시키는 모습을 나타내는 도면이다.
도 2는 일실시예에 따른 발화 속도 조절이 가능한 텍스트 음성 변환 학습 방법을 나타내는 플로우 차트이다.
도 3은 일실시예에 따라 발화 속도 조절이 가능한 텍스트 음성 변환 장치가 텍스트 및 속도 정보를 기초로 출력 음성 데이터를 생성하는 모습을 나타내는 도면이다.
도 4는 일실시예에 따른 발화 속도 조절이 가능한 텍스트 음성 변환 방법을 나타내는 플로우 차트이다.
도 5는 일실시예에 따라 발화 속도 조절이 가능한 텍스트 음성 변환 장치가 텍스트 및 시간 정보를 기초로 출력 음성 데이터를 생성하는 모습을 나타내는 도면이다.
도 6은 다른 실시예에 따른 발화 속도 조절이 가능한 텍스트 음성 변환 방법을 나타내는 플로우 차트이다.
도 7은 본 발명의 일실시예를 구현하기 위한 예시적인 컴퓨터 시스템의 블록도이다.1 is a text-to-speech learning apparatus capable of controlling speech speed according to an embodiment of the present invention training a style encoder and a text-to-speech model based on a voice database, text, speed information, reference voice data, and speaker information; It is a drawing showing the appearance.
2 is a flowchart illustrating a method for learning text-to-speech in which a speech speed can be adjusted according to an exemplary embodiment.
3 is a diagram illustrating a state in which a text-to-speech conversion apparatus capable of adjusting a speech speed generates output speech data based on text and speed information according to an exemplary embodiment.
4 is a flowchart illustrating a text-to-speech method capable of adjusting a speech speed according to an exemplary embodiment.
5 is a diagram illustrating a state in which a text-to-speech conversion apparatus capable of adjusting a speech speed according to an embodiment generates output voice data based on text and time information.
6 is a flowchart illustrating a text-to-speech method capable of adjusting a speech speed according to another exemplary embodiment.
7 is a block diagram of an exemplary computer system for implementing one embodiment of the present invention.

본 명세서에 개시되어 있는 본 발명의 개념에 따른 실시 예들에 대해서 특정한 구조적 또는 기능적 설명들은 단지 본 발명의 개념에 따른 실시 예들을 설명하기 위한 목적으로 예시된 것으로서, 본 발명의 개념에 따른 실시 예들은 다양한 형태들로 실시될 수 있으며 본 명세서에 설명된 실시 예들에 한정되지 않는다.Specific structural or functional descriptions of the embodiments according to the concept of the present invention disclosed in this specification are only exemplified for the purpose of explaining the embodiments according to the concept of the present invention, and the embodiments according to the concept of the present invention are It may be implemented in various forms and is not limited to the embodiments described herein.

본 발명의 개념에 따른 실시 예들은 다양한 변경들을 가할 수 있고 여러 가지 형태들을 가질 수 있으므로 실시 예들을 도면에 예시하고 본 명세서에 상세하게 설명하고자 한다. 그러나, 이는 본 발명의 개념에 따른 실시 예들을 특정한 개시 형태들에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물, 또는 대체물을 포함한다.Since the embodiments according to the concept of the present invention may have various changes and may have various forms, the embodiments will be illustrated in the drawings and described in detail herein. However, this is not intended to limit the embodiments according to the concept of the present invention to specific disclosed forms, and includes all modifications, equivalents, or substitutes included in the spirit and scope of the present invention.

제1 또는 제2 등의 용어는 다양한 구성 요소들을 설명하는데 사용될 수 있지만, 상기 구성 요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나의 구성 요소를 다른 구성 요소로부터 구별하는 목적으로만, 예컨대 본 발명의 개념에 따른 권리 범위로부터 이탈되지 않은 채, 제1구성요소는 제2구성요소로 명명될 수 있고, 유사하게 제2구성요소는 제1구성요소로도 명명될 수 있다.Terms such as first or second may be used to describe various elements, but the elements should not be limited by the terms. The above terms are used only for the purpose of distinguishing one element from another element, for example, without departing from the scope of the present invention, a first element may be called a second element, and similarly The second component may also be referred to as the first component.

어떤 구성요소가 다른 구성요소에 "연결되어" 있다거나 "접속되어" 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소가 다른 구성요소에 "직접 연결되어" 있다거나 "직접 접속되어" 있다고 언급된 때에는, 중간에 다른 구성요소가 존재하지 않는 것으로 이해되어야 할 것이다. 구성요소들 간의 관계를 설명하는 다른 표현들, 즉 "~사이에"와 "바로 ~사이에" 또는 "~에 이웃하는"과 "~에 직접 이웃하는" 등도 마찬가지로 해석되어야 한다.When a component is referred to as being “connected” or “connected” to another component, it is understood that the other component may be directly connected or connected to the other component, but other components may exist in between. it should be On the other hand, when it is mentioned that a certain element is "directly connected" or "directly connected" to another element, it should be understood that the other element does not exist in the middle. Other expressions describing the relationship between elements, such as "between" and "immediately between" or "neighboring to" and "directly adjacent to", should be interpreted similarly.

본 명세서에서 사용한 용어는 단지 특정한 실시 예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다.The terms used herein are used only to describe specific embodiments, and are not intended to limit the present invention. The singular expression includes the plural expression unless the context clearly dictates otherwise.

본 명세서에서, "포함하다" 또는 "가지다" 등의 용어는 설명된 특징, 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.As used herein, terms such as “comprise” or “have” are intended to designate that the described feature, number, step, operation, component, part, or combination thereof exists, and includes one or more other features or numbers. , it is to be understood that it does not preclude the possibility of the presence or addition of steps, operations, components, parts, or combinations thereof.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미가 있다.Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 의미를 갖는 것으로 해석되어야 하며, 본 명세서에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Terms such as those defined in a commonly used dictionary should be interpreted as having a meaning consistent with the meaning in the context of the related art, and should not be interpreted in an ideal or excessively formal meaning unless explicitly defined in the present specification. does not

이하의 설명에서 동일한 식별 기호는 동일한 구성을 의미하며, 불필요한 중복적인 설명 및 공지 기술에 대한 설명은 생략하기로 한다.In the following description, the same identification symbols mean the same configuration, and unnecessary redundant descriptions and descriptions of well-known technologies will be omitted.

본 발명의 실시 예에서 '통신', '통신망' 및 '네트워크'는 동일한 의미로 사용될 수 있다. 상기 세 용어들은, 파일을 사용자 단말, 다른 사용자들의 단말 및 다운로드 서버 사이에서 송수신할 수 있는 유무선의 근거리 및 광역 데이터 송수신망을 의미한다.In an embodiment of the present invention, 'communication', 'communication network' and 'network' may be used as the same meaning. The above three terms refer to a wired/wireless short-distance and wide-area data transmission/reception network capable of transmitting and receiving files between a user terminal, terminals of other users, and a download server.

이하, 첨부한 도면을 참조하여 본 발명의 바람직한 실시 예를 설명함으로써, 본 발명을 상세히 설명한다.Hereinafter, the present invention will be described in detail by describing preferred embodiments of the present invention with reference to the accompanying drawings.

도 1은 일실시예에 따라 발화 속도 조절이 가능한 텍스트 음성 변환 학습 장치가 음성 데이터 베이스, 텍스트, 속도 정보, 기준 음성 데이터 및 화자 정보를 기초로 스타일 인코더 및 텍스트 음성 변환 모델을 훈련(training)시키는 모습을 나타내는 도면이다.1 is a text-to-speech learning apparatus capable of controlling speech speed according to an embodiment of the present invention training a style encoder and a text-to-speech model based on a voice database, text, speed information, reference voice data, and speaker information; It is a drawing showing the appearance.

도 1을 참조하면, 일실시예에 따른 발화 속도 조절이 가능한 텍스트 음성 변환 학습 장치는 음성 데이터 베이스(100), 스타일 인코더(110), 텍스트 음성 변환 모델(120) 및 손실 계산기(130)를 포함한다.Referring to FIG. 1 , a text-to-speech learning apparatus capable of controlling a speech speed according to an embodiment includes a speech database 100 , a style encoder 110 , a text-to-speech conversion model 120 , and a loss calculator 130 . do.

음성 데이터 베이스(100)에는 텍스트 음성 변환 모델(120)의 훈련(training)을 위한 음성 데이터가 저장되어 있다. 이때, 상기 음성 데이터는 웨이브(wav) 파일로 저장될 수 있으나, 상기 음성 데이터가 저장되는 파일 형태가 이에 한정되는 것은 아니다.Voice data for training of the text-to-speech model 120 is stored in the voice database 100 . In this case, the voice data may be stored as a wave (wav) file, but the file type in which the voice data is stored is not limited thereto.

음성 데이터 베이스(100)에는 상기 음성 데이터를 문자로 표현한 텍스트들이 저장되어 있다. 이때, 상기 텍스트들은 텍스트(txt) 파일 형태일 수 있으나, 상기 텍스트들의 형태가 이에 한정되는 것은 아니다.The voice database 100 stores texts expressing the voice data as text. In this case, the texts may be in the form of a text (txt) file, but the format of the texts is not limited thereto.

음성 데이터 베이스(100)에는 상기 음성 데이터와 상기 음성 데이터를 문자로 표현한 상기 텍스트(101)들이 서로 쌍을 이루어 저장될 수 있다. In the voice database 100 , the voice data and the text 101 expressing the voice data as text may be stored in pairs.

스타일 인코더(110)는 기준 음성(reference audio) 데이터(103)를 인코딩하여 미리 설정된 데이터 형태(예컨대, 벡터 형태)의 데이터를 생성할 수 있다. 이때, 스타일 인코더(110)는 기준 음성 데이터(103)에 포함된 특징(예컨대, 스타일)이 표현되도록 기준 음성 데이터(103)를 인코딩할 수 있다. 또한, 기준 음성 데이터(103)는 웨이브(wav)파일 형태일 수 있으나, 기준 음성 데이터(103)의 형태가 이에 한정되는 것은 아니다.The style encoder 110 may encode the reference audio data 103 to generate data in a preset data format (eg, a vector format). In this case, the style encoder 110 may encode the reference voice data 103 such that a feature (eg, a style) included in the reference voice data 103 is expressed. Also, the reference voice data 103 may be in the form of a wave file, but the form of the reference voice data 103 is not limited thereto.

스타일 인코더(110)는 미리 설정된 형식(예컨대, n x m 행렬(matrix))에 맞도록 기준 음성 데이터(103)를 인코딩할 수 있다.The style encoder 110 may encode the reference voice data 103 to fit a preset format (eg, an n x m matrix).

스타일 인코더(110)는 기준 음성 데이터(103)를 인코딩하여 미리 설정된 형식(예컨대, n x m 행렬(matrix))에 맞도록 배열(embedding)한 데이터를 생성할 수 있다. The style encoder 110 may encode the reference voice data 103 to generate data embedded in a preset format (eg, an n x m matrix).

스타일 인코더(110)는 딥 뉴럴 네트워크(Deep Neural Network)를 포함할 수 있으나, 스타일 인코더(110)의 구성이 이에 한정되는 것은 아니다.The style encoder 110 may include a deep neural network, but the configuration of the style encoder 110 is not limited thereto.

발화 속도 조절이 가능한 텍스트 음성 변환 학습 장치는 음성으로 표현된 텍스트 전체에 대한 하나의 시간 정보를 기초로 음성으로 표현된 전체 텍스트를 컨트롤 하기 위하여 속도 정보(102)를 생성할 수 있다. 이때, 속도 정보(102)는 상기 음성이 얼마만큼의 평균적인 빠르기를 가진 음성인지에 대한 측정 정보일 수 있다.The text-to-speech learning apparatus capable of adjusting the speech speed may generate the speed information 102 in order to control the entire text expressed in voice based on one piece of time information for the entire text expressed in voice. In this case, the speed information 102 may be measurement information about how much average speed the voice has.

발화 속도 조절이 가능한 텍스트 음성 변환 학습 장치는 문장별 길이 정보(스피드)(Sample Average Speed) 계산기(미도시)를 이용하여 속도 정보(102)를 생성할 수 있다. 이때, 상기 문장별 길이 정보(스피드)(Sample Average Speed) 계산기(미도시)는 속도 정보(102) S를 생성하기 위하여 하기 [수학식 1]을 이용할 수 있다.The text-to-speech learning apparatus capable of adjusting the speech speed may generate the speed information 102 using a sample average speed calculator (not shown) for each sentence. In this case, the length information (speed) for each sentence (Sample Average Speed) calculator (not shown) may use the following [Equation 1] to generate the speed information 102 S.

여기서, T는 전체 음성의 길이(시간)이고, N은 텍스트의 자소(grapheme, character) 수 또는 음소 (phoneme) 수이다.Here, T is the length (time) of the total voice, and N is the number of grapheme (character) or phoneme (phoneme) of the text.

발화 속도 조절이 가능한 텍스트 음성 변환 학습 장치는 음성 데이터 베이스(100)에 저장된 음성 데이터에서 전체 음성의 길이(시간)를 획득하고, 상기 음성 데이터를 문자로 표현한 상기 텍스트에서 문자열 또는 발음열의 자소 수 또는 문자열 또는 발음열의 음소 수를 획득하고, 상기 획득한 전체 음성의 길이(시간) 및 상기 획득한 문자열 또는 발음열의 자소 수 또는 문자열 또는 발음열의 음소 수를 기초로 속도 정보(102)를 생성할 수 있다.The text-to-speech learning apparatus capable of adjusting the speech speed obtains the length (time) of the entire voice from the voice data stored in the voice database 100, and the number of graphemes of a character string or a pronunciation sequence in the text expressing the voice data as characters, or Acquire the number of phonemes in a character string or pronunciation sequence, and generate speed information 102 based on the obtained length (time) of the entire voice and the obtained number of phonemes in the character string or pronunciation sequence or the number of phonemes in the character string or pronunciation sequence .

발화 속도 조절이 가능한 텍스트 음성 변환 학습 장치는 생성한 속도 정보(102)를 미리 설정된 형식(예컨대, n x 1 벡터(vector))에 맞추어 복제(tiling) 한 데이터들 생성할 수 있다.The text-to-speech learning apparatus capable of adjusting the speech speed may generate data obtained by tiling the generated speed information 102 according to a preset format (eg, an n×1 vector).

발화 속도 조절이 가능한 텍스트 음성 변환 학습 장치는 화자가 누구인지를 결정하기 위한 화자 정보(104)를 획득할 수 있다. 이때, 상기 화자는 복수일 수 있다. 또한, 화자 정보(104)는 미리 설정된 형식(예컨대, n x m 행렬(matrix))에 맞도록 배열(embedding)한 데이터일 수 있다.The text-to-speech learning apparatus capable of adjusting the speech speed may acquire the speaker information 104 for determining who the speaker is. In this case, the speaker may be plural. Also, the speaker information 104 may be data embedded in a preset format (eg, an n x m matrix).

텍스트 음성 변환 모델(120)은 딥 뉴럴 네트워크(Deep Neural Network)를 포함할 수 있으나, 텍스트 음성 변환 모델(120)의 구성이 이에 한정되는 것은 아니다.The text-to-speech model 120 may include a deep neural network, but the configuration of the text-to-speech model 120 is not limited thereto.

텍스트 음성 변환 모델(120)은 어떤 네트워크 텍스트 음성 변환(Text-To-Speech, TTS) 구조이든 텍스트 음성 변환 모델(120)로 사용 가능하다.The text-to-speech model 120 may use any network text-to-speech (TTS) structure as the text-to-speech model 120 .

일실시예에 따라, 텍스트 음성 변환 모델(120)은 텍스트(101)와 자소 수 또는 음소 수에 맞게 복제한(tiling) 속도 정보(102)를 기초로 출력 음성 데이터(121)를 생성할 수 있다. 이때, 텍스트 음성 변환 모델(120)은 복수 개의 속도 정보(102)를 기초로 텍스트를 발화(변환)하는 음성의 속도가 상이한 복수개의 출력 음성 데이터를 생성할 수 있다.According to an embodiment, the text-to-speech conversion model 120 may generate the output voice data 121 based on the text 101 and the speed information 102 tiling according to the number of phonemes or phonemes. . In this case, the text-to-speech conversion model 120 may generate a plurality of output voice data having different speeds of uttering (converting) text based on the plurality of speed information 102 .

일실시예에 따라, 텍스트 음성 변환 모델(120)은 텍스트(101), 자소 수 또는 음소 수에 맞게 복제한(tiling) 속도 정보(102) 및 스타일 인코더(110)가 인코딩하여 출력한 데이터를 기초로 출력 음성 데이터(121)를 생성할 수 있다. 이때, 텍스트 음성 변환 모델(120)은 복수 개의 속도 정보(102)를 기초로 텍스트를 발화(변환)하는 음성의 속도가 상이한 복수개의 출력 음성 데이터를 생성할 수 있다. 또한, 텍스트 음성 변환 모델(120)은 기준 음성 데이터(103)에 포함된 특징(예컨대, 스타일)이 표현되도록 출력 음성 데이터(121)를 생성할 수 있다.According to an embodiment, the text-to-speech model 120 is based on the data encoded and output by the text 101 , the speed information 102 , and the style encoder 110 tiling according to the number of graphes or phonemes. to generate the output voice data 121 . In this case, the text-to-speech conversion model 120 may generate a plurality of output voice data having different speeds of uttering (converting) text based on the plurality of speed information 102 . Also, the text-to-speech model 120 may generate the output voice data 121 such that a feature (eg, a style) included in the reference voice data 103 is expressed.

일실시예에 따라, 텍스트 음성 변환 모델(120)은 텍스트(101), 자소 수 또는 음소 수에 맞게 복제한(tiling) 속도 정보(102) 및 화자 정보(104)를 기초로 출력 음성 데이터(121)를 생성할 수 있다. 이때, 텍스트 음성 변환 모델(120)은 복수 개의 속도 정보(102)를 기초로 텍스트를 발화(변환)하는 음성의 속도가 상이한 복수개의 출력 음성 데이터를 생성할 수 있다. 또한, 텍스트 음성 변환 모델(120)은 화자 정보(104)에 포함된 화자의 목소리로 출력 음성 데이터(121)를 생성할 수 있다.According to an embodiment, the text-to-speech model 120 outputs speech data 121 based on the text 101 , the speed information 102 , and the speaker information 104 tiling according to the number of phonemes or phonemes. ) can be created. In this case, the text-to-speech conversion model 120 may generate a plurality of output voice data having different speeds of uttering (converting) text based on the plurality of speed information 102 . Also, the text-to-speech model 120 may generate the output voice data 121 using the speaker's voice included in the speaker information 104 .

일실시예에 따라, 텍스트 음성 변환 모델(120)은 텍스트(101), 자소 수 또는 음소 수에 맞게 복제한(tiling) 속도 정보(102), 스타일 인코더(110)가 인코딩하여 출력한 데이터 및 화자 정보(104)를 기초로 출력 음성 데이터(121)를 생성할 수 있다. 이때, 텍스트 음성 변환 모델(120)은 복수 개의 속도 정보(102)를 기초로 텍스트를 발화(변환)하는 음성의 속도가 상이한 복수개의 출력 음성 데이터를 생성할 수 있다. 또한, 텍스트 음성 변환 모델(120)은 기준 음성 데이터(103)에 포함된 특징(예컨대, 스타일)이 표현되도록 화자 정보(104)에 포함된 화자의 목소리로 출력 음성 데이터(121)를 생성할 수 있다. According to an embodiment, the text-to-speech model 120 includes the text 101, the speed information 102 tiling according to the number of phonemes or the number of phonemes, the data encoded by the style encoder 110, and the speaker. Output voice data 121 may be generated based on the information 104 . In this case, the text-to-speech conversion model 120 may generate a plurality of output voice data having different speeds of uttering (converting) text based on the plurality of speed information 102 . In addition, the text-to-speech model 120 may generate the output voice data 121 with the speaker's voice included in the speaker information 104 so that a feature (eg, style) included in the reference voice data 103 is expressed. have.

손실 계산기(130)는 출력 음성 데이터(121)와 음성 데이터 베이스(100)에 저장된 텍스트(101)와 쌍을 이루는 음성 데이터를 이용하여 손실 값을 계산할 수 있다. The loss calculator 130 may calculate a loss value using the voice data paired with the output voice data 121 and the text 101 stored in the voice database 100 .

손실 계산기(130)는 출력 음성 데이터(121)가 음성 신호 형태(예컨대, wav)이면, 출력 음성 데이터(121)의 일부분을 선택하여 푸리에 변환을 적용하거나, 음향 피쳐(feature) 벡터를 추출할 수 있다.If the output speech data 121 is in the form of a speech signal (eg, wav), the loss calculator 130 selects a portion of the output speech data 121 and applies a Fourier transform or extracts a sound feature vector. have.

손실 계산기(130)는 상기 음성 데이터가 음성 신호 형태(예컨대, wav)이면, 상기 음성 데이터의 일부분을 선택하여 푸리에 변환을 적용하거나, 음향 피쳐(feature) 벡터를 추출할 수 있다.When the speech data is in the form of a speech signal (eg, wav), the loss calculator 130 may select a portion of the speech data and apply a Fourier transform or extract an acoustic feature vector.

손실 계산기(130)는 상기 푸리에 변환을 적용한 출력 음성 데이터(121)의 일부분들 또는 출력 음성 데이터(121)의 일부분에서 추출한 음향 피쳐 벡터들을 결합하여 2차원 매트릭스를 생성할 수 있다The loss calculator 130 may generate a two-dimensional matrix by combining portions of the output speech data 121 to which the Fourier transform is applied or acoustic feature vectors extracted from a portion of the output speech data 121 .

손실 계산기(130)는 상기 푸리에 변환을 적용한 상기 음성 데이터의 일부분들 또는 상기 음성 데이터의 일부분에서 추출한 음향 피쳐 벡터들을 결합하여 2차원 매트릭스를 생성할 수 있다The loss calculator 130 may generate a two-dimensional matrix by combining parts of the voice data to which the Fourier transform is applied or acoustic feature vectors extracted from a part of the voice data.

손실 계산기(130)는 상기 푸리에 변환을 적용한 출력 음성 데이터(121)의 일부분들 또는 출력 음성 데이터(121)의 일부분에서 추출한 음향 피쳐 벡터들을 결합하여 생성한 2차원 매트릭스와 상기 푸리에 변환을 적용한 상기 음성 데이터의 일부분들 또는 상기 음성 데이터의 일부분에서 추출한 음향 피쳐 벡터들을 결합하여 생성한 2차원 매트릭스의 차를 계산하여 손실 값을 계산할 수 있다. The loss calculator 130 calculates a two-dimensional matrix generated by combining parts of the output voice data 121 to which the Fourier transform is applied or the acoustic feature vectors extracted from a part of the output voice data 121 and the voice to which the Fourier transform is applied. A loss value may be calculated by calculating a difference between a two-dimensional matrix generated by combining portions of data or acoustic feature vectors extracted from a portion of the voice data.

손실 계산기(130)는 역전파(Back propagation)(131)를 이용하여, 텍스트 음성 변환 모델(120) 또는 스타일 인코더(110)에 포함된 딥 뉴럴 네트워크(Deep Neural Network)를 업데이트 할 수 있다.The loss calculator 130 may update the deep neural network included in the text-to-speech model 120 or the style encoder 110 by using the back propagation 131 .

손실 계산기(130)는 텍스트 음성 변환 모델(120) 또는 스타일 인코더(110)에 포함된 딥 뉴럴 네트워크(Deep Neural Network)를 순차적으로 또는 동시에 업데이트 할 수 있다.The loss calculator 130 may sequentially or simultaneously update a deep neural network included in the text-to-speech model 120 or the style encoder 110 .

손실 계산기(130)는 상기 손실 값이 음성 데이터 베이스(100)를 기초로 미리 설정된 손실 값에 매치될 때까지 텍스트 음성 변환 모델(120) 또는 스타일 인코더(110)에 포함된 딥 뉴럴 네트워크(Deep Neural Network)를 반복적으로 업데이트할 수 있다.The loss calculator 130 is a deep neural network included in the text-to-speech conversion model 120 or the style encoder 110 until the loss value matches a loss value preset based on the speech database 100. Network) can be updated repeatedly.

도 2는 일실시예에 따른 발화 속도 조절이 가능한 텍스트 음성 변환 학습 방법을 나타내는 플로우 차트이다.2 is a flowchart illustrating a text-to-speech learning method capable of adjusting a speech speed according to an exemplary embodiment.

도 2를 참조하면, 발화 속도 조절이 가능한 텍스트 음성 변환 학습 장치가 텍스트 및 속도 정보를 기초로 출력 음성 데이터를 생성한다(200).Referring to FIG. 2 , the text-to-speech learning apparatus capable of adjusting the speech speed generates output speech data based on text and speed information ( 200 ).

이때, 상기 발화 속도 조절이 가능한 텍스트 음성 변환 학습 장치는 기준 음성 데이터 또는 화자 정보를 추가로 입력 받아 텍스트 음성 변환(Text-To-Speech, TTS) 모델을 이용하여 출력 음성 데이터를 생성할 수 있다.In this case, the text-to-speech learning apparatus capable of adjusting the speech speed may additionally receive reference speech data or speaker information and generate output speech data using a text-to-speech (TTS) model.

또한, 상기 텍스트 음성 변환(Text-To-Speech, TTS) 모델은 딥 뉴럴 네트워크(Deep Neural Network)를 포함할 수 있으나, 상기 텍스트 음성 변환(Text-To-Speech, TTS) 모델의 구성이 이에 한정되는 것은 아니다.In addition, the text-to-speech (TTS) model may include a deep neural network, but the configuration of the text-to-speech (TTS) model is limited thereto. it's not going to be

발화 속도 조절이 가능한 텍스트 음성 변환 학습 장치가 상기 출력 음성 데이터와 음성 데이터 베이스에 저장된 상기 텍스트와 쌍을 이루는 음성 데이터를 기초로 손실 값을 계산한다(210).The text-to-speech learning apparatus capable of adjusting the speech speed calculates a loss value based on the output voice data and the voice data paired with the text stored in the voice database ( S210 ).

이때, 상기 발화 속도 조절이 가능한 텍스트 음성 변환 학습 장치는 출력 음성 데이터가 음성 신호 형태(예컨대, wav)이면, 출력 음성 데이터의 일부분을 선택하여 푸리에 변환을 적용하거나, 음향 피쳐(feature) 벡터를 추출할 수 있다.In this case, when the output voice data is in the form of a voice signal (eg, wav), the text-to-speech learning apparatus capable of adjusting the speech speed selects a part of the output voice data and applies a Fourier transform or extracts an acoustic feature vector can do.

또한, 상기 발화 속도 조절이 가능한 텍스트 음성 변환 학습 장치는 상기 음성 데이터가 음성 신호 형태(예컨대, wav)이면, 상기 음성 데이터의 일부분을 선택하여 푸리에 변환을 적용하거나, 음향 피쳐(feature) 벡터를 추출할 수 있다.In addition, when the speech data is in the form of a speech signal (eg, wav), the text-to-speech learning apparatus capable of adjusting the speech speed selects a part of the speech data and applies a Fourier transform or extracts an acoustic feature vector can do.

또한, 상기 발화 속도 조절이 가능한 텍스트 음성 변환 학습 장치는 상기 푸리에 변환을 적용한 출력 음성 데이터의 일부분들 또는 출력 음성 데이터의 일부분에서 추출한 음향 피쳐 벡터들을 결합하여 2차원 매트릭스를 생성할 수 있다In addition, the text-to-speech learning apparatus capable of adjusting the speech speed may generate a two-dimensional matrix by combining parts of the output voice data to which the Fourier transform is applied or acoustic feature vectors extracted from a part of the output voice data.

또한, 상기 발화 속도 조절이 가능한 텍스트 음성 변환 학습 장치는 상기 푸리에 변환을 적용한 상기 음성 데이터의 일부분들 또는 상기 음성 데이터의 일부분에서 추출한 음향 피쳐 벡터들을 결합하여 2차원 매트릭스를 생성할 수 있다In addition, the text-to-speech learning apparatus capable of adjusting the speech speed may generate a two-dimensional matrix by combining parts of the voice data to which the Fourier transform is applied or acoustic feature vectors extracted from a part of the voice data.

또한, 상기 발화 속도 조절이 가능한 텍스트 음성 변환 학습 장치는 상기 푸리에 변환을 적용한 출력 음성 데이터의 일부분들 또는 출력 음성 데이터의 일부분에서 추출한 음향 피쳐 벡터들을 결합하여 생성한 2차원 매트릭스와 상기 푸리에 변환을 적용한 상기 음성 데이터의 일부분들 또는 상기 음성 데이터의 일부분에서 추출한 음향 피쳐 벡터들을 결합하여 생성한 2차원 매트릭스의 차를 계산하여 손실 값을 계산할 수 있다.In addition, the text-to-speech learning apparatus capable of adjusting the speech speed is a two-dimensional matrix generated by combining parts of the output voice data to which the Fourier transform is applied or acoustic feature vectors extracted from a part of the output voice data and the Fourier transform is applied. A loss value may be calculated by calculating a difference between parts of the voice data or a two-dimensional matrix generated by combining acoustic feature vectors extracted from a part of the voice data.

발화 속도 조절이 가능한 텍스트 음성 변환 학습 장치가 음성 데이터 베이스를 기초로 미리 설정된 손실 값에 매치될 때까지 상기 발화 속도 조절이 가능한 텍스트 음성 변환 학습 장치에 포함된 딥 뉴럴 네트워크(Deep Neural Network)를 반복적으로 업데이트 한다(220).The deep neural network included in the text-to-speech learning apparatus capable of adjusting the speech speed is repeatedly repeated until the text-to-speech learning apparatus capable of adjusting the speech speed matches a loss value preset based on the voice database. to update (220).

이때, 상기 발화 속도 조절이 가능한 텍스트 음성 변환 학습 장치는 역전파(Back propagation)를 이용하여 상기 발화 속도 조절이 가능한 텍스트 음성 변환 학습 장치에 포함된 적어도 하나의 딥 뉴럴 네트워크(Deep Neural Network)를 반복적으로 업데이트 할 수 있다.In this case, the text-to-speech learning apparatus capable of adjusting the speech speed repeatedly performs at least one deep neural network included in the text-to-speech learning apparatus capable of adjusting the speech speed by using back propagation. can be updated to

도 3은 일실시예에 따라 발화 속도 조절이 가능한 텍스트 음성 변환 장치가 텍스트 및 속도 정보를 기초로 출력 음성 데이터를 생성하는 모습을 나타내는 도면이다.3 is a diagram illustrating a state in which a text-to-speech conversion apparatus capable of adjusting a speech speed generates output speech data based on text and speed information according to an exemplary embodiment.

도 3을 참조하면, 발화 속도 조절이 가능한 텍스트 음성 변환 장치는 텍스트(301) 및 속도 정보(302)를 입력 받아 학습된 텍스트 음성 변환(Text-To-Speech, TTS) 모델(310)을 이용하여 출력 음성 데이터(303)를 생성할 수 있다. 이때, 속도 정보(302)는 텍스트(301)를 음성으로 변환했을 때, 상기 음성이 얼마만큼의 평균적인 빠르기를 가진 음성인지에 대한 정보일 수 있다. 또한, 텍스트(301)는 문장 또는 단어일 수 있으나, 텍스트(301)가 이에 한정되는 것은 아니다. 또한, 텍스트(301)는 텍스트(txt) 파일 형태일 수 있으나, 텍스트(301)의 형태가 이에 한정되는 것은 아니다.Referring to FIG. 3 , the text-to-speech conversion apparatus capable of adjusting the speech speed uses a text-to-speech (TTS) model 310 learned by receiving text 301 and speed information 302 as input. Output voice data 303 may be generated. In this case, the speed information 302 may be information about an average speed of the voice when the text 301 is converted into a voice. Also, the text 301 may be a sentence or a word, but the text 301 is not limited thereto. Also, the text 301 may be in the form of a text (txt) file, but the form of the text 301 is not limited thereto.

발화 속도 조절이 가능한 텍스트 음성 변환 장치는 텍스트(301) 및 속도 정보(302)를 입력 받기 위한 입력 수단으로 키보드, 마우스, 게임조작 유니트, 아날로그 유니트, 터치 스크린을 적어도 하나 이상 포함할 수 있다.The text-to-speech conversion device capable of adjusting the speech speed may include at least one keyboard, a mouse, a game control unit, an analog unit, and a touch screen as input means for receiving the text 301 and speed information 302 .

발화 속도 조절이 가능한 텍스트 음성 변환 장치는 통신적으로 연결된 외부 장치(예컨대, 서버)에서 텍스트(301)를 획득할 수 있다.The text-to-speech apparatus capable of adjusting the speech speed may acquire the text 301 from an external device (eg, a server) that is communicatively connected.

발화 속도 조절이 가능한 텍스트 음성 변환 장치는 텍스트(301)를 입력 받기 위한 장치(예컨대, USB 포트)를 구비할 수 있다.The text-to-speech conversion device capable of adjusting the speech speed may include a device (eg, a USB port) for receiving the text 301 .

발화 속도 조절이 가능한 텍스트 음성 변환 장치는 입력 받은 속도 정보(302)를 미리 설정된 형식(예컨대, n x 1 벡터(vector))에 맞도록 복제(tiling) 한 데이터들 생성할 수 있다.The text-to-speech conversion apparatus capable of adjusting the speech speed may generate data obtained by tiling the received speed information 302 to fit a preset format (eg, an n×1 vector).

발화 속도 조절이 가능한 텍스트 음성 변환 장치는 하나의 속도 정보(302)를 이용하여 텍스트(301)를 기초로 생성된 출력 음성 데이터(303)에 포함된 음성의 발화 시간(길이) 또는 발화 속도를 제어 할 수 있다.The text-to-speech conversion apparatus capable of adjusting the speech speed controls the speech time (length) or speech speed of speech included in the output speech data 303 generated based on the text 301 by using one piece of speed information 302 . can do.

일실시예에 따른 발화 속도 조절이 가능한 텍스트 음성 변환 장치는 입력 받은 텍스트(301)에 속도 정보(302)를 적용하기 위한 인터페이스를 생성할 수 있다. 이때, 상기 인터페이스에서는 텍스트(301) 전체에 대한 속도 정보(302)를 결정하기 위한 내부 인터페이스(예컨대, 스크롤)가 표시될 수 있다. 이때, 속도 정보(302)는 가변 값일 수 있다.The text-to-speech conversion apparatus capable of adjusting the speech speed according to an embodiment may generate an interface for applying the speed information 302 to the received text 301 . In this case, an internal interface (eg, scroll) for determining the speed information 302 for the entire text 301 may be displayed on the interface. In this case, the speed information 302 may be a variable value.

일실시예에 따른 발화 속도 조절이 가능한 텍스트 음성 변환 장치는 입력 받은 텍스트(301)에 속도 정보(302)를 적용하기 위한 인터페이스를 생성할 수 있다. 이때, 상기 인터페이스에서는 입력 받은 텍스트(301)가 음절 또는 어절 단위로 표시될 수 있고, 표시된 텍스트(301) 전체에 대한 속도 정보(302)를 결정하기 위한 내부 인터페이스(예컨대, 스크롤)가 표시될 수 있다. 이때, 속도 정보(302)는 가변 값일 수 있다.The text-to-speech conversion apparatus capable of adjusting the speech speed according to an embodiment may generate an interface for applying the speed information 302 to the received text 301 . In this case, the received text 301 may be displayed in units of syllables or words in the interface, and an internal interface (eg, scroll) for determining the speed information 302 for the entire displayed text 301 may be displayed. have. In this case, the speed information 302 may be a variable value.

일실시예에 따른 발화 속도 조절이 가능한 텍스트 음성 변환 장치는 입력 받은 텍스트(301)의 적어도 일부에만 속도 정보(302)를 적용하기 위한 인터페이스를 생성할 수 있다. 이때, 상기 인터페이스에서는 텍스트(301) 전체에 대한 속도 정보(302)를 결정하기 위한 제1 내부 인터페이스(예컨대, 스크롤)가 표시될 수 있고, 입력 받은 텍스트(301)가 음절 또는 어절 단위로 표시될 수 있고, 상기 표시된 음절 또는 어절 각각에 대한 속도 정보(302)를 결정하기 위한 제2 내부 인터페이스(예컨대, 스크롤)가 표시될 수 있다. 이때, 속도 정보(302)는 가변 값일 수 있다.The text-to-speech conversion apparatus capable of adjusting the speech speed according to an embodiment may generate an interface for applying the speed information 302 only to at least a part of the received text 301 . At this time, in the interface, a first internal interface (eg, scroll) for determining the speed information 302 for the entire text 301 may be displayed, and the received text 301 may be displayed in units of syllables or words. and a second internal interface (eg, scroll) for determining the speed information 302 for each of the displayed syllables or words may be displayed. In this case, the speed information 302 may be a variable value.

다만, 상기 속도 정보가 상기 발화 속도 조절이 가능한 텍스트 음성 변환에 반영되는 것은 실시예에 불과하며, 상기 실시예에 의해 본 발명이 제한되거나 한정되는 것은 아니다.However, it is only an embodiment that the speed information is reflected in the text-to-speech conversion in which the speech speed can be adjusted, and the present invention is not limited or limited by the embodiment.

일실시예에 따른 발화 속도 조절이 가능한 텍스트 음성 변환 장치는 텍스트(301) 및 내부 인터페이스(예컨대, 스크롤)에서 결정한 속도 정보(302)를 기초로 텍스트 음성 변환 모델(310)을 이용하여 출력 음성 데이터(303)를 생성할 수 있다.The text-to-speech conversion apparatus capable of adjusting the speech speed according to an embodiment uses the text-to-speech model 310 based on the text 301 and the speed information 302 determined from the internal interface (eg, scroll) to output speech data (303) can be created.

일실시예에 따라, 발화 속도 조절이 가능한 텍스트 음성 변환 장치는 텍스트(301) 및 속도 정보(302)를 기초로 텍스트(301) 전체에 동일한 속도 정보(302)가 반영된 출력 음성 데이터(303)를 생성할 수 있다.According to one embodiment, the text-to-speech conversion apparatus capable of adjusting the speech speed generates the output voice data 303 in which the same speed information 302 is reflected in the entire text 301 based on the text 301 and the speed information 302 . can create

일실시예에 따라, 발화 속도 조절이 가능한 텍스트 음성 변환 장치는 텍스트(301) 및 속도 정보(302)를 기초로 텍스트(301)의 적어도 일부에만 다른 속도 정보(302)가 반영된 출력 음성 데이터(303)를 생성할 수 있다.According to an embodiment, the text-to-speech conversion apparatus capable of adjusting the speech speed is output voice data 303 in which other speed information 302 is reflected only in at least a part of the text 301 based on the text 301 and the speed information 302 . ) can be created.

다시 도 3을 참조하면, 일실시예에 따른 발화 속도 조절이 가능한 텍스트 음성 변환 장치는 텍스트 음성 변환 모델(310)을 포함한다.Referring back to FIG. 3 , the text-to-speech conversion apparatus capable of adjusting the speech speed according to an exemplary embodiment includes a text-to-speech conversion model 310 .

텍스트 음성 변환 모델(310)은 학습된 딥 뉴럴 네트워크(Deep Neural Network)를 포함할 수 있으나, 텍스트 음성 변환 모델(310)의 구성이 이에 한정되는 것은 아니다.The text-to-speech model 310 may include a learned deep neural network, but the configuration of the text-to-speech model 310 is not limited thereto.

텍스트 음성 변환 모델(310)은 어떤 네트워크 텍스트 음성 변환(Text-To-Speech, TTS) 구조이든 텍스트 음성 변환 모델(310)로 사용 가능하다.The text-to-speech model 310 may use any network text-to-speech (TTS) structure as the text-to-speech model 310 .

도 4는 일실시예에 따른 발화 속도 조절이 가능한 텍스트 음성 변환 방법을 나타내는 플로우 차트이다.4 is a flowchart illustrating a text-to-speech method capable of adjusting a speech speed according to an exemplary embodiment.

도 4를 참조하면, 발화 속도 조절이 가능한 텍스트 음성 변환 장치가 텍스트를 입력 받는다(400).Referring to FIG. 4 , the text-to-speech apparatus capable of adjusting the speech speed receives text ( 400 ).

이때, 상기 텍스트 음성 변환 장치는 상기 텍스트를 입력 받기 위한 입력 수단으로 키보드, 마우스, 게임조작 유니트, 아날로그 유니트, 터치 스크린을 적어도 하나 이상 포함할 수 있다.In this case, the text-to-speech apparatus may include at least one of a keyboard, a mouse, a game operation unit, an analog unit, and a touch screen as input means for receiving the text.

또한, 상기 발화 속도 조절이 가능한 텍스트 음성 변환 장치는 입력 받은 상기 텍스트를 디스플레이 할 수 있다.In addition, the text-to-speech conversion apparatus capable of adjusting the speech speed may display the received text.

발화 속도 조절이 가능한 텍스트 음성 변환 장치가 속도 정보를 입력 받는다(410).The text-to-speech conversion apparatus capable of adjusting the speech speed receives speed information ( 410 ).

이때, 상기 발화 속도 조절이 가능한 텍스트 음성 변환 장치는 상기 속도 정보를 입력 받기 위한 입력 수단으로 키보드, 마우스, 게임조작 유니트, 아날로그 유니트, 터치 스크린을 적어도 하나 이상 포함할 수 있다.In this case, the text-to-speech conversion apparatus capable of adjusting the speech speed may include at least one of a keyboard, a mouse, a game operation unit, an analog unit, and a touch screen as input means for receiving the speed information.

또한, 상기 발화 속도 조절이 가능한 텍스트 음성 변환 장치는 입력 받은 상기 텍스트에 상기 속도 정보를 적용하기 위한 인터페이스를 생성할 수 있다. 이때, 상기 인터페이스에서는 상기 텍스트 전체에 대한 속도 정보를 결정하기 위한 내부 인터페이스(예컨대, 스크롤)가 표시될 수 있다.Also, the text-to-speech apparatus capable of adjusting the speech speed may generate an interface for applying the speed information to the received text. In this case, an internal interface (eg, scroll) for determining speed information for the entire text may be displayed on the interface.

또한, 상기 발화 속도 조절이 가능한 텍스트 음성 변환 장치는 입력 받은 상기 텍스트에 상기 속도 정보를 적용하기 위한 인터페이스를 생성할 수 있다. 이때, 상기 인터페이스에서는 입력 받은 텍스트가 음절 또는 어절 단위로 표시될 수 있고, 표시된 텍스트 전체에 대한 속도 정보를 결정하기 위한 내부 인터페이스(예컨대, 스크롤)가 표시될 수 있다.Also, the text-to-speech apparatus capable of adjusting the speech speed may generate an interface for applying the speed information to the received text. In this case, the received text may be displayed in units of syllables or words in the interface, and an internal interface (eg, scroll) for determining speed information for the entire displayed text may be displayed.

또한, 상기 발화 속도 조절이 가능한 텍스트 음성 변환 장치는 제1 내부 인터페이스(예컨대, 스크롤)를 통해 텍스트 전체에 대한 상기 속도 정보를 입력 받은 후, 입력 받은 상기 텍스트를 음절 또는 어절 단위로 표시하고, 상기 표시한 음절 또는 어절 각각에 대한 속도 정보를 결정하기 위한 제2 내부 인터페이스(예컨대, 스크롤)를 표시할 수 있다.In addition, the text-to-speech conversion apparatus capable of adjusting the speech speed receives the speed information for the entire text through a first internal interface (eg, scroll) and displays the received text in units of syllables or words, and A second internal interface (eg, scroll) for determining speed information for each of the displayed syllables or words may be displayed.

발화 속도 조절이 가능한 텍스트 음성 변환 장치가 상기 텍스트 및 상기 속도 정보를 기초로 출력 음성 데이터를 생성한다(420).The text-to-speech conversion apparatus capable of adjusting the speech speed generates output speech data based on the text and the speed information ( 420 ).

이때, 상기 발화 속도 조절이 가능한 텍스트 음성 변환 장치는 상기 텍스트 전체에 동일한 속도 정보가 반영된 출력 음성 데이터를 생성할 수 있다.In this case, the text-to-speech conversion apparatus capable of adjusting the speech speed may generate output speech data in which the same speed information is reflected in the entire text.

또한, 상기 발화 속도 조절이 가능한 텍스트 음성 변환 장치는 상기 텍스트의 적어도 일부에만 상기 속도 정보가 반영된 출력 음성 데이터를 생성할 수 있다.Also, the text-to-speech conversion apparatus capable of adjusting the speech speed may generate output speech data in which the speed information is reflected only in at least a part of the text.

다른 실시예에 의하여, 발화 속도 조절이 가능한 텍스트 음성 변환 장치가 제1 내부 인터페이스(예컨대, 스크롤)를 통해 입력될 텍스트 전체에 대한 속도 정보를 입력 받은 후, 텍스트를 입력 받고, 입력 받은 상기 텍스트를 음절 또는 어절 단위로 표시하고, 상기 표시한 음절 또는 어절 각각에 대한 속도 정보를 결정하기 위한 제2 내부 인터페이스(예컨대, 스크롤)를 표시할 수 있다.According to another embodiment, after the text-to-speech conversion apparatus capable of adjusting the speech speed receives speed information for the entire text to be input through the first internal interface (eg, scroll), the text is inputted, and the received text is A syllable or word unit may be displayed, and a second internal interface (eg, scroll) for determining speed information for each of the displayed syllables or words may be displayed.

도 5는 일실시예에 따라 발화 속도 조절이 가능한 텍스트 음성 변환 장치가 텍스트 및 시간 정보를 기초로 출력 음성 데이터를 생성하는 모습을 나타내는 도면이다.5 is a diagram illustrating a state in which a text-to-speech conversion apparatus capable of adjusting a speech speed generates output speech data based on text and time information, according to an exemplary embodiment.

도 5를 참조하면, 발화 속도 조절이 가능한 텍스트 음성 변환 장치는 텍스트(501) 및 시간 정보(503)를 입력 받아 학습된 텍스트 음성 변환(Text-To-Speech, TTS) 모델(500)을 이용하여 출력 음성 데이터(504)를 생성할 수 있다. 이때, 텍스트(501)는 문장 또는 단어일 수 있으나, 상기 텍스트가 이에 한정되는 것은 아니다. 또한, 텍스트(501)는 텍스트(txt) 파일 형태일 수 있으나, 텍스트(501)의 형태가 이에 한정되는 것은 아니다.Referring to FIG. 5 , the text-to-speech conversion apparatus capable of adjusting the speech speed uses a text-to-speech (TTS) model 500 learned by receiving text 501 and time information 503 as input. Output voice data 504 may be generated. In this case, the text 501 may be a sentence or a word, but the text is not limited thereto. Also, the text 501 may be in the form of a text (txt) file, but the form of the text 501 is not limited thereto.

발화 속도 조절이 가능한 텍스트 음성 변환 장치는 텍스트(501) 및 시간 정보(503)를 입력 받기 위한 입력 수단으로 키보드, 마우스, 게임조작 유니트, 아날로그 유니트, 터치 스크린을 적어도 하나 이상 포함할 수 있다.The text-to-speech conversion apparatus capable of adjusting the speech speed may include at least one keyboard, a mouse, a game operation unit, an analog unit, and a touch screen as input means for receiving the text 501 and time information 503 .

발화 속도 조절이 가능한 텍스트 음성 변환 장치는 통신적으로 연결된 외부 장치(예컨대, 서버)에서 텍스트(501)를 획득할 수 있다.The text-to-speech apparatus capable of adjusting the speech speed may obtain the text 501 from an external device (eg, a server) that is communicatively connected.

발화 속도 조절이 가능한 텍스트 음성 변환 장치는 텍스트(351)를 입력 받기 위한 장치(예컨대, USB 포트)를 구비할 수 있다.The text-to-speech conversion apparatus capable of adjusting the speech speed may include a device (eg, a USB port) for receiving text 351 .

다시 도 5를 참조하면, 일실시예에 따른 발화 속도 조절이 가능한 텍스트 음성 변환 장치는 텍스트 음성 변환 모델(500)을 포함한다.Referring back to FIG. 5 , a text-to-speech conversion apparatus capable of adjusting a speech speed according to an embodiment includes a text-to-speech conversion model 500 .

텍스트 음성 변환 모델(500)은 텍스트(501)를 출력 음성 데이터(504)로 변환하였을 때, 출력 음성 데이터(504)에 포함된 음성의 평균 발화 시간(길이)을 결정 할 수 있다.When the text-to-speech conversion model 500 converts the text 501 into the output speech data 504 , the text-to-speech conversion model 500 may determine the average speech time (length) of the speech included in the output speech data 504 .

발화 속도 조절이 가능한 텍스트 음성 변환 장치는 텍스트 음성 변환 모델(510)이 결정한 출력 음성 데이터(504)에 포함된 음성의 평균 발화 시간(길이) 을 기초로 시간 범위(502)를 출력할 수 있다.The text-to-speech conversion apparatus capable of adjusting the speech speed may output the time range 502 based on the average speech time (length) of speech included in the output speech data 504 determined by the text-to-speech conversion model 510 .

발화 속도 조절이 가능한 텍스트 음성 변환 장치는 시간 범위(502)에 포함된 시간 정보(503)를 입력 받을 수 있다. 이때, 시간 정보(503)는 초 단위로 입력 받을 수 있으나, 시간 정보(503)를 입력하는 시간의 단위가 이에 한정되는 것은 아니다.The text-to-speech apparatus capable of adjusting the speech speed may receive time information 503 included in the time range 502 . In this case, the time information 503 may be input in units of seconds, but the unit of time for inputting the time information 503 is not limited thereto.

발화 속도 조절이 가능한 텍스트 음성 변환 장치는 시간 범위(502)를 벗어나는 시간 정보(503)가 입력된 경우, 입력된 시간 정보(503)를 무효화하고 새로운 시간 정보(503)의 입력을 요청할 수 있다.When the time information 503 out of the time range 502 is input, the text-to-speech conversion apparatus capable of adjusting the speech speed may invalidate the input time information 503 and request input of new time information 503 .

발화 속도 조절이 가능한 텍스트 음성 변환 장치는 입력 받은 시간 정보(503)를 미리 설정된 형식(예컨대, n x 1 벡터(vector))에 맞도록 복제(tiling) 한 데이터들 생성할 수 있다.The text-to-speech conversion apparatus capable of adjusting the speech speed may generate data obtained by tiling the received time information 503 to fit a preset format (eg, an n×1 vector).

발화 속도 조절이 가능한 텍스트 음성 변환 장치는 하나의 시간 정보(503)를 이용하여 텍스트(501)를 기초로 생성된 출력 음성 데이터(504)에 포함된 음성의 발화 시간(길이) 또는 발화 속도를 제어할 수 있다.The text-to-speech conversion apparatus capable of adjusting the speech speed controls the speech time (length) or speech speed of speech included in the output speech data 504 generated based on the text 501 by using one piece of time information 503 . can do.

텍스트 음성 변환 모델(500)은 텍스트(501) 및 시간 정보(503)을 기초로 텍스트(501)를 시간 정보(503)가 반영된 출력 음성 데이터(504)로 변환할 수 있다.The text-to-speech model 500 may convert the text 501 into the output voice data 504 in which the time information 503 is reflected based on the text 501 and the time information 503 .

텍스트 음성 변환 모델(500)은 학습된 딥 뉴럴 네트워크(Deep Neural Network)를 포함할 수 있으나, 텍스트 음성 변환 모델(500)의 구성이 이에 한정되는 것은 아니다.The text-to-speech model 500 may include a learned deep neural network, but the configuration of the text-to-speech model 500 is not limited thereto.

텍스트 음성 변환 모델(500)은 어떤 네트워크 텍스트 음성 변환(Text-To-Speech, TTS) 구조이든 텍스트 음성 변환 모델(500)로 사용 가능하다.The text-to-speech model 500 may be used as the text-to-speech model 500 in any network text-to-speech (TTS) structure.

발화 속도 조절이 가능한 텍스트 음성 변환 장치는 생성한 출력 음성 데이터(504)를 후처리 할 수 있다. 이때, 상기 후처리는 신호처리 기반의 후처리 방식을 이용한 후처리일 수 있으나, 상기 후처리 방식이 이에 한정되는 것은 아니다.The text-to-speech conversion apparatus capable of adjusting the speech speed may post-process the generated output speech data 504 . In this case, the post-processing may be a post-processing using a signal processing-based post-processing method, but the post-processing method is not limited thereto.

도 6은 다른 실시예에 따른 발화 속도 조절이 가능한 텍스트 음성 변환 방법을 나타내는 플로우 차트이다.6 is a flowchart illustrating a text-to-speech method capable of adjusting a speech speed according to another exemplary embodiment.

도 6을 참조하면, 발화 속도 조절이 가능한 텍스트 음성 변환 장치가 텍스트를 입력 받는다(600).Referring to FIG. 6 , the text-to-speech apparatus capable of adjusting the speech speed receives text ( 600 ).

이때, 상기 발화 속도 조절이 가능한 텍스트 음성 변환 장치는 상기 텍스트를 입력 받기 위한 입력 수단으로 키보드, 마우스, 게임조작 유니트, 아날로그 유니트, 터치 스크린을 적어도 하나 이상 포함할 수 있다.In this case, the text-to-speech conversion device capable of adjusting the speech speed may include at least one of a keyboard, a mouse, a game operation unit, an analog unit, and a touch screen as input means for receiving the text.

발화 속도 조절이 가능한 텍스트 음성 변환 장치가 시간 범위를 결정한다(610).The text-to-speech apparatus capable of adjusting the speech speed determines a time range ( S610 ).

이때, 상기 발화 속도 조절이 가능한 텍스트 음성 변환 장치는 텍스트 음성 변환 모델을 이용하여 상기 시간 범위를 결정할 수 있다.In this case, the text-to-speech conversion apparatus capable of adjusting the speech speed may determine the time range using a text-to-speech model.

또한, 상기 발화 속도 조절이 가능한 텍스트 음성 변환 장치는 상기 발화 속도 조절이 가능한 텍스트 음성 변환 장치에 포함된 텍스트 음성 변환 모델이 텍스트를 출력 음성 데이터로 변환하였을 때, 상기 출력 음성 데이터에 포함된 음성의 평균 발화 시간(길이)을 기초로 상기 시간 범위를 결정할 수 있다.In addition, when the text-to-speech conversion model included in the text-to-speech conversion device capable of adjusting the speech speed converts text into output speech data, the text-to-speech conversion apparatus capable of adjusting the speech speed The time range may be determined based on the average firing time (length).

또한, 상기 발화 속도 조절이 가능한 텍스트 음성 변환 장치는 결정한 상기 시간 범위를 출력할 수 있다.Also, the text-to-speech apparatus capable of adjusting the speech speed may output the determined time range.

발화 속도 조절이 가능한 텍스트 음성 변환 장치가 시간 정보를 입력 받는다(620).The text-to-speech conversion apparatus capable of adjusting the speech speed receives time information ( 620 ).

이때, 상기 발화 속도 조절이 가능한 텍스트 음성 변환 장치는 시간 범위에 포함된 시간 정보만을 입력 받을 수 있다.In this case, the text-to-speech conversion apparatus capable of adjusting the speech speed may receive only time information included in the time range.

또한, 상기 발화 속도 조절이 가능한 텍스트 음성 변환 장치는 초 단위의 시간 정보를 입력 받을 수 있다.Also, the text-to-speech conversion apparatus capable of adjusting the speech speed may receive time information in units of seconds.

또한, 상기 발화 속도 조절이 가능한 텍스트 음성 변환 장치는 상기 시간 범위를 벗어나는 시간 정보가 입력된 경우, 입력된 시간 정보를 무효화하고 새로운 시간 정보의 입력을 요청할 수 있다.Also, when time information out of the time range is input, the text-to-speech conversion apparatus capable of adjusting the speech speed may invalidate the input time information and request input of new time information.

발화 속도 조절이 가능한 텍스트 음성 변환 장치가 상기 텍스트 및 상기 결정한 시간 범위에 포함된 적어도 하나의 시간 정보를 기초로 출력 음성 데이터를 생성한다(630).The text-to-speech conversion apparatus capable of adjusting the speech speed generates output speech data based on the text and at least one piece of time information included in the determined time range ( 630 ).

이때, 상기 발화 속도 조절이 가능한 텍스트 음성 변환 장치는 상기 텍스트에 상기 시간 정보를 반영하여 출력 음성 데이터를 생성할 수 있다.In this case, the text-to-speech conversion apparatus capable of adjusting the speech speed may generate output voice data by reflecting the time information in the text.

발화 속도 조절이 가능한 텍스트 음성 변환 장치가 상기 생성한 출력 음성 데이터를 후처리 한다(650).The text-to-speech conversion apparatus capable of adjusting the speech speed post-processes the generated output speech data (650).

이때, 상기 후처리는 신호처리 기반의 후처리 방식을 이용한 후처리일 수 있으나, 상기 후처리 방식이 이에 한정되는 것은 아니다.In this case, the post-processing may be a post-processing using a signal processing-based post-processing method, but the post-processing method is not limited thereto.

발화 속도 조절이 가능한 텍스트 음성 변환 장치가 후처리 한 출력 음성 데이터를 출력한다(660).The text-to-speech conversion device capable of adjusting the speech speed outputs post-processed output voice data ( 660 ).

이때, 상기 음성 데이터의 형태가 웨이브(wav)파일 형태일 수 있으나, 상기 음성 데이터의 형태가 이에 한정되는 것은 아니다.In this case, the form of the voice data may be a wave (wav) file, but the form of the voice data is not limited thereto.

도 7은 본 발명의 일실시예를 구현하기 위한 예시적인 컴퓨터 시스템의 블록도이다.7 is a block diagram of an exemplary computer system for implementing one embodiment of the present invention.

도 7을 참조하면, 본 발명의 일실시예를 구현하기 위한 예시적인 컴퓨터 시스템은 정보를 교환하기 위한 버스 또는 다른 커뮤니케이션 채널(701)을 포함하고, 프로세서(702)는 정보를 처리하기 위하여 버스(701)와 연결된다.Referring to FIG. 7 , an exemplary computer system for implementing an embodiment of the present invention includes a bus or other communication channel 701 for exchanging information, and a processor 702 including a bus ( 701) is connected.

컴퓨터 시스템(700)은 정보 및 프로세서(702)에 의해 처리되는 명령들을 저장하기 위하여 버스(701)와 연결된 RAM(Random Access Memory) 또는 다른 동적 저장 장치인 메인 메모리(703)를 포함한다. Computer system 700 includes main memory 703 , which is random access memory (RAM) or other dynamic storage device coupled with bus 701 , for storing information and instructions processed by processor 702 .

또한, 메인 메모리(703)는 프로세서(702)에 의한 명령들의 실행동안 임시변수들 또는 다른 중간 정보를 저장하기 위해 사용될 수 있다.Main memory 703 may also be used to store temporary variables or other intermediate information during execution of instructions by processor 702 .

컴퓨터 시스템(700)은 프로세서(702)에 대한 정적인 정보 또는 명령들을 저장하기 위하여 버스(701)에 결합된 ROM(Read Only Memory) 및 다른 정적 저장장치(704)를 포함할 수 있다.Computer system 700 may include read only memory (ROM) and other static storage 704 coupled to bus 701 for storing static information or instructions for processor 702 .

마그네틱 디스크, 집(zip) 또는 광 디스크 같은 대량 저장장치(705) 및 그것과 대응하는 드라이브 또한 정보 및 명령들을 저장하기 위하여 컴퓨터 시스템(700)에 연결될 수 있다.A mass storage device 705 such as a magnetic disk, zip or optical disk and its corresponding drive may also be coupled to the computer system 700 for storing information and instructions.

컴퓨터 시스템(700)은 엔드 유저(end user)에게 정보를 디스플레이 하기 위하여 버스(701)를 통해 음극선관 또는 엘씨디 같은 디스플레이 장치(710)와 연결될 수 있다.The computer system 700 may be connected to a display device 710 such as a cathode ray tube or an LCD through a bus 701 to display information to an end user.

키보드(720)와 같은 문자 입력 장치는 프로세서(702)에 정보 및 명령을 전달하기 위하여 버스(701)에 연결될 수 있다.A character input device, such as a keyboard 720 , may be coupled to the bus 701 to transmit information and commands to the processor 702 .

다른 유형의 사용자 입력 장치는 방향 정보 및 명령 선택을 프로세서(702)에 전달하고, 디스플레이(710) 상의 커서의 움직임을 제어하기 위한 마우스, 트랙볼 또는 커서 방향 키들과 같은 커서 컨트롤 장치(730)이다.Another type of user input device is a cursor control device 730 , such as a mouse, trackball, or cursor direction keys, for communicating directional information and command selections to the processor 702 , and for controlling movement of the cursor on the display 710 .

통신 장치(740) 역시 버스(701)와 연결된다. The communication device 740 is also connected to the bus 701 .

통신 장치(740)는 지역 네트워크 또는 광역망에 접속되는 것을 서포트 하기 위하여 모뎀, 네트워크 인터페이스 카드, 이더넷, 토큰 링 또는 다른 유형의 물리적 결합물과 연결하기 위해 사용되는 인터페이스 장치를 포함할 수 있다. 이러한 방식으로 컴퓨터 시스템(700)은 인터넷 같은 종래의 네트워크 인프라 스트럭쳐를 통하여 다수의 클라이언트 및 서버와 연결될 수 있다.Communication device 740 may include an interface device used to connect to a modem, network interface card, Ethernet, token ring, or other type of physical combination to support connection to a local area network or wide area network. In this manner, the computer system 700 may be connected to a number of clients and servers via a conventional network infrastructure such as the Internet.

이상에서, 본 발명의 실시예를 구성하는 모든 구성 요소들이 하나로 결합되거나 결합되어 동작하는 것으로 설명되었다고 해서, 본 발명이 반드시 이러한 실시예에 한정되는 것은 아니다. 즉, 본 발명의 목적 범위 안에서라면, 그 모든 구성 요소들이 적어도 하나로 선택적으로 결합하여 동작할 수도 있다. In the above, even though it has been described that all components constituting the embodiment of the present invention are combined or operated in combination, the present invention is not necessarily limited to this embodiment. That is, within the scope of the object of the present invention, all the components may operate by selectively combining at least one.

또한, 그 모든 구성 요소들이 각각 하나의 독립적인 하드웨어로 구현될 수 있지만, 각 구성 요소들의 그 일부 또는 전부가 선택적으로 조합되어 하나 또는 복수 개의 하드웨어에서 조합된 일부 또는 전부의 기능을 수행하는 프로그램 모듈을 갖는 컴퓨터 프로그램으로서 구현될 수도 있다. 그 컴퓨터 프로그램을 구성하는 코드들 및 코드 세그먼트들은 본 발명의 기술 분야의 당업자에 의해 용이하게 추론될 수 있을 것이다. In addition, all of the components may be implemented as one independent hardware, but some or all of the components are selectively combined to perform some or all functions of the combined components in one or a plurality of hardware program modules It may be implemented as a computer program having Codes and code segments constituting the computer program can be easily deduced by those skilled in the art of the present invention.

이러한 컴퓨터 프로그램은 컴퓨터가 읽을 수 있는 저장매체(Computer Readable Media)에 저장되어 컴퓨터에 의하여 읽혀지고 실행됨으로써, 본 발명의 실시예를 구현할 수 있다. 컴퓨터 프로그램의 저장매체로서는 자기 기록매체, 광 기록매체, 등이 포함될 수 있다.Such a computer program is stored in a computer readable storage medium (Computer Readable Media), read and executed by the computer, thereby implementing the embodiment of the present invention. The storage medium of the computer program may include a magnetic recording medium, an optical recording medium, and the like.

또한, 이상에서 기재된 "포함하다", "구성하다" 또는 "가지다" 등의 용어는, 특별히 반대되는 기재가 없는 한, 해당 구성 요소가 내재될 수 있음을 의미하는 것이므로, 다른 구성 요소를 제외하는 것이 아니라 다른 구성 요소를 더 포함할 수 있는 것으로 해석되어야 한다. In addition, terms such as "comprises", "comprises" or "have" described above mean that the corresponding component may be embedded, unless otherwise stated, so that other components are excluded. Rather, it should be construed as being able to further include other components.

기술적이거나 과학적인 용어를 포함한 모든 용어들은, 다르게 정의되지 않는 한, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가진다. 사전에 정의된 용어와 같이 일반적으로 사용되는 용어들은 관련 기술의 문맥 상의 의미와 일치하는 것으로 해석되어야 하며, 본 발명에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.All terms, including technical and scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which the present invention belongs, unless otherwise defined. Terms commonly used, such as those defined in the dictionary, should be interpreted as being consistent with the meaning of the context of the related art, and are not interpreted in an ideal or excessively formal meaning unless explicitly defined in the present invention.

본 발명에서 개시된 방법들은 상술된 방법을 달성하기 위한 하나 이상의 동작들 또는 단계들을 포함한다. 방법 동작들 및/또는 단계들은 청구항들의 범위를 벗어나지 않으면서 서로 상호 교환될 수도 있다. 다시 말해, 동작들 또는 단계들에 대한 특정 순서가 명시되지 않는 한, 특정 동작들 및/또는 단계들의 순서 및/또는 이용은 청구항들의 범위로부터 벗어남이 없이 수정될 수도 있다.The methods disclosed herein include one or more acts or steps for achieving the method described above. Method acts and/or steps may be interchanged with each other without departing from the scope of the claims. In other words, unless a specific order for acts or steps is specified, the order and/or use of specific acts and/or steps may be modified without departing from the scope of the claims.

본 발명에서 이용되는 바와 같이, 아이템들의 리스트 중 "그 중 적어도 하나" 를 지칭하는 구절은 단일 멤버들을 포함하여, 이들 아이템들의 임의의 조합을 지칭한다. 일 예로서, "a, b, 또는 c: 중의 적어도 하나" 는 a, b, c, a-b, a-c, b-c, 및 a-b-c 뿐만 아니라 동일한 엘리먼트의 다수의 것들과의 임의의 조합 (예를 들어, a-a, a-a-a, a-a-b, a-a-c, a-b-b, a-c-c, b-b, b-b-b, b-b-c, c-c, 및 c-c-c 또는 a, b, 및 c 의 다른 임의의 순서 화한 것) 을 포함하도록 의도된다.As used herein, a phrase referring to “at least one of” in a list of items refers to any combination of these items, including single members. As an example, "at least one of a, b, or c:" means a, b, c, ab, ac, bc, and abc, as well as any combination with multiples of the same element (e.g., aa , aaa, aab, aac, abb, acc, bb, bbb, bbc, cc, and ccc or any other ordering of a, b, and c).

본 발명에서 이용되는 바와 같이, 용어 "결정하는"는 매우 다양한 동작들을 망라한다. 예를 들어, "결정하는"는 계산하는, 컴퓨팅, 프로세싱, 도출하는, 조사하는, 룩업하는 (예를 들어, 테이블, 데이터베이스, 또는 다른 데이터 구조에서 룩업하는), 확인하는 등을 포함할 수도 있다. 또한, "결정하는"은 수신하는 (예를 들면, 정보를 수신하는), 액세스하는 (메모리의 데이터에 액세스하는) 등을 포함할 수 있다. 또한, "결정하는"은 해결하는, 선택하는, 고르는, 확립하는 등을 포함할 수 있다.As used herein, the term “determining” encompasses a wide variety of operations. For example, “determining” may include calculating, computing, processing, deriving, investigating, looking up (eg, looking up in a table, database, or other data structure), ascertaining, etc. . Also, “determining” may include receiving (eg, receiving information), accessing (accessing data in a memory), and the like. Also, “determining” may include resolving, choosing, choosing, establishing, and the like.

이상의 설명은 본 발명의 기술 사상을 예시적으로 설명한 것에 불과한 것으로서, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자라면 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 다양한 수정 및 변형이 가능할 것이다. The above description is merely illustrative of the technical spirit of the present invention, and various modifications and variations will be possible without departing from the essential characteristics of the present invention by those skilled in the art to which the present invention pertains.

따라서, 본 발명에 개시된 실시예들은 본 발명의 기술 사상을 한정하기 위한 것이 아니라 설명하기 위한 것이고, 이러한 실시예에 의하여 본 발명의 기술 사상의 범위가 한정되는 것은 아니다. 본 발명의 보호 범위는 아래의 청구범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술 사상은 본 발명의 권리범위에 포함되는 것으로 해석되어야 할 것이다.Therefore, the embodiments disclosed in the present invention are not intended to limit the technical spirit of the present invention, but to explain, and the scope of the technical spirit of the present invention is not limited by these embodiments. The protection scope of the present invention should be construed by the following claims, and all technical ideas within the equivalent range should be construed as being included in the scope of the present invention.

100...음성 데이터 베이스
110...스타일 인코더
120...텍스트 음성 변환 모델
130...손실 계산기
300...텍스트 음성 변환 모델
500...텍스트 음성 변환 모델100...voice database
110...style encoder
120...text-to-speech model
130...Loss Calculator
300...text-to-speech model
500...text-to-speech model

Claims

In the text-to-speech learning apparatus capable of controlling speech speed,
at least one processor;
the at least one processor,
Receive text and speed information and generate first output voice data using a text-to-speech (TTS) model,
Text-to-speech conversion that updates the text-to-speech (TTS) model using the generated first output voice data and the voice data paired with the text stored in the voice database learning device.

In the text-to-speech learning apparatus capable of controlling speech speed,
at least one processor;
the at least one processor,
receiving text, speed information, and reference audio data and generating second output speech data using the text-to-speech (TTS) model;
Text-to-speech conversion for updating the text-to-speech (TTS) model using the generated second output speech data and the speech data paired with the text stored in the speech database learning device.

In the text-to-speech learning apparatus capable of controlling speech speed,
at least one processor;
the at least one processor,
receiving text, speed information, and speaker lookup and generating third output voice data using the text-to-speech (TTS) model;
Text-to-speech conversion that updates the text-to-speech (TTS) model using the generated third output voice data and the voice data paired with the text stored in the voice database learning device.

In the text-to-speech learning apparatus capable of controlling speech speed,
at least one processor;
the at least one processor,
It receives text, speed information, reference audio data, and speaker lookup and generates fourth output voice data using a text-to-speech (TTS) model,
Text-to-speech conversion for updating the text-to-speech (TTS) model using the generated fourth output speech data and the speech data paired with the text stored in the speech database learning device.

5. The method according to any one of claims 1 to 4,
The speed information is a text-to-speech learning apparatus capable of adjusting an adjustable speech speed.

5. The method according to any one of claims 1 to 4,
The speed information is
A text-to-speech learning apparatus capable of controlling a speech speed generated based on the text and the voice data paired with the text stored in the voice database.

In the text-to-speech device capable of adjusting the speech speed,
at least one processor;
the at least one processor,
A text-to-speech conversion device capable of controlling speech speed that receives text and speed information and generates output speech data using a learned text-to-speech (TTS) model.

8. The method of claim 7,
the at least one processor,
A text-to-speech apparatus capable of adjusting the speech speed of generating output speech data using the learned text-to-speech (TTS) model by applying the speed information to at least a part of the text.

In the text-to-speech device capable of adjusting the speech speed,
at least one processor;
the at least one processor,
Receive text, determine a time range for output speech data that can be generated using a text-to-speech (TTS) model learned based on the text,
Text-to-speech capable of controlling a speech speed for generating output speech data using the learned text-to-speech (TTS) model based on the text and at least one piece of time information included in the determined time range conversion device.

10. The method of claim 9,
the at least one processor,
A text-to-speech conversion device capable of adjusting a speech speed for post-processing the generated output speech data.