KR20210027016A

KR20210027016A - Voice synthesis metohd and apparatus thereof

Info

Publication number: KR20210027016A
Application number: KR1020200009391A
Authority: KR
Inventors: 최승도; 민경보; 박상준; 주기현
Original assignee: 삼성전자주식회사
Priority date: 2019-08-30
Filing date: 2020-01-23
Publication date: 2021-03-10

Abstract

Provided are a voice synthesis method and a device thereof. The method for an electronic device to synthesize a voice from a text includes the following operations of: obtaining a text inputted into the electronic device; obtaining a text representation by encoding the text; obtaining an audio representation of a first audio frame set; obtaining an audio representation of a second audio frame set based on the text representation and the audio representation of the first audio frame set; obtaining an audio feature of the second audio frame set through decoding on the audio representation of the second audio frame set; creating feedback information based on the audio feature of the second audio frame set; and synthesizing a voice based on at least one of an audio feature of the first audio frame set and the audio feature of the second audio frame set. Therefore, the present invention is capable of synthesizing a voice corresponding to an inputted text by obtaining a current audio frame based on feedback information.

Description

Voice synthesis method and apparatus {VOICE SYNTHESIS METOHD AND APPARATUS THEREOF}

본 발명은 음성 합성 방법 및 장치에 관한 것이다.The present invention relates to a method and apparatus for synthesizing speech.

최근 다양한 전자 장치들은 TTS(text-to-speech) 기능을 탑재하여 텍스트를 음성으로 합성하여 출력할 수 있다. 전자 장치는 TTS 기능을 제공하기 위해 텍스트의 음소와 음소에 대응되는 음성 데이터를 포함하는 소정의 TTS 모델을 이용할 수 있다. Recently, various electronic devices are equipped with a text-to-speech (TTS) function to synthesize and output text into speech. In order to provide the TTS function, the electronic device may use a predetermined TTS model including a phoneme of text and voice data corresponding to the phoneme.

최근에는 종단 간(end-to-end) 학습 방식을 사용하는 인공 신경망(예를 들어, 딥 뉴럴 네트워크) 기반의 음성 합성 방법이 활발히 연구되고 있으며, 이러한 음성 합성 방법에 따라 합성된 음성은 기존의 방법에 비해 훨씬 자연스러운 음성 특징을 포함하고 있다. 이에 따라, 음성 합성에 필요한 연산량을 감소시키기 위한 새로운 인공 신경망 구조의 설계가 요구되고 있다.Recently, a speech synthesis method based on an artificial neural network (for example, a deep neural network) using an end-to-end learning method has been actively studied. It contains much more natural voice features than the method. Accordingly, it is required to design a new artificial neural network structure to reduce the amount of computation required for speech synthesis.

본 개시는 이전 오디오 프레임의 에너지에 관한 정보를 포함하는 피드백 정보를 사용하여 현재 오디오 프레임을 획득함으로써, 입력된 텍스트에 대응되는 음성을 합성할 수 있는 음성 합성 방법 및 장치를 제공하는 것을 목적으로 한다.An object of the present disclosure is to provide a speech synthesis method and apparatus capable of synthesizing a speech corresponding to an input text by acquiring a current audio frame using feedback information including information on the energy of a previous audio frame. .

본 발명의 목적들은 이상에서 언급한 목적으로 제한되지 않으며, 언급되지 않은 본 발명의 다른 목적 및 장점들은 하기의 설명에 의해서 이해될 수 있고, 본 발명의 실시예에 의해 보다 분명하게 이해될 것이다. 또한, 본 발명의 목적 및 장점들은 특허 청구 범위에 나타낸 수단 및 그 조합에 의해 실현될 수 있음을 쉽게 알 수 있을 것이다.The objects of the present invention are not limited to the above-mentioned objects, and other objects and advantages of the present invention that are not mentioned can be understood by the following description, and will be more clearly understood by examples of the present invention. In addition, it will be easily understood that the objects and advantages of the present invention can be realized by the means shown in the claims and combinations thereof.

상술한 기술적 과제를 달성하기 위한 기술적 수단으로서, 본 개시의 제1 측면은, 전자 장치에 입력되는 텍스트를 획득하는 동작, 전자 장치의 텍스트 인코더를 이용하여 텍스트를 부호화함으로써, 텍스트 표현(text representation)을 획득하는 동작, 전자 장치의 오디오 인코더로부터, 제1 오디오 프레임 셋의 오디오 표현을 획득하는 동작, 텍스트 표현 및 제1 오디오 프레임 셋의 오디오 표현을 기초로 제2 오디오 프레임 셋의 오디오 표현을 획득하는 동작, 제2 오디오 프레임 셋의 오디오 표현에 대한 복호화를 통해 제2 오디오 프레임 셋의 오디오 특징을 획득하는 동작, 제2 오디오 프레임 셋의 오디오 특징에 기초하여 피드백 정보를 생성하는 동작, 및 제1 오디오 프레임 셋의 오디오 특징 또는 제2 오디오 프레임 셋의 오디오 특징 중 적어도 하나에 기초하여 음성을 합성하는 동작을 포함하는, 전자 장치가 텍스트로부터 음성을 합성하는 방법을 제공할 수 있다.As a technical means for achieving the above-described technical problem, a first aspect of the present disclosure provides an operation of obtaining text input to an electronic device, and encoding the text using a text encoder of the electronic device, thereby providing text representation. Acquiring an audio representation of the first audio frame set from an audio encoder of the electronic device, acquiring an audio representation of the second audio frame set based on the text representation and the audio representation of the first audio frame set Operation, an operation of obtaining an audio characteristic of a second audio frame set by decoding an audio representation of the second audio frame set, an operation of generating feedback information based on the audio characteristic of the second audio frame set, and a first audio An electronic device may provide a method for synthesizing speech from text, including an operation of synthesizing speech based on at least one of an audio characteristic of a frame set or an audio characteristic of a second audio frame set.

또한, 상술한 기술적 과제를 달성하기 위한 기술적 수단으로서, 본 개시의 제2 측면은, 텍스트로부터 음성을 합성하는 전자 장치에 있어서, 전자 장치는 적어도 하나의 프로세서를 포함하고, 적어도 하나의 프로세서는 전자 장치에 입력되는 텍스트를 획득하고, 텍스트를 부호화하여 텍스트 표현을 획득하고, 제1 오디오 프레임 셋의 오디오 표현을 획득하고, 텍스트 표현 및 제1 오디오 프레임 셋의 오디오 표현을 기초로 제2 오디오 프레임 셋의 오디오 표현을 획득하고, 제2 오디오 프레임 셋의 오디오 표현에 대한 복호화를 통해 제2 오디오 프레임 셋의 오디오 특징을 획득하고, 제2 오디오 프레임 셋의 오디오 특징에 기초하여 피드백 정보를 생성하고, 제1 오디오 프레임 셋의 오디오 특징 또는 제2 오디오 프레임 셋의 오디오 특징 중 적어도 하나에 기초하여 음성을 합성하는, 텍스트로부터 음성을 합성하는 전자 장치를 제공할 수 있다.In addition, as a technical means for achieving the above-described technical problem, a second aspect of the present disclosure is an electronic device for synthesizing speech from text, wherein the electronic device includes at least one processor, and the at least one processor is an electronic device. Obtaining text input to the device, obtaining a text representation by encoding the text, obtaining an audio representation of the first audio frame set, and obtaining a second audio frame set based on the text representation and the audio representation of the first audio frame set Obtain an audio representation of the second audio frame set, obtain an audio feature of the second audio frame set through decoding of the audio representation of the second audio frame set, generate feedback information based on the audio feature of the second audio frame set, and An electronic device for synthesizing speech from text, which synthesizes speech based on at least one of an audio characteristic of one audio frame set or an audio characteristic of a second audio frame set may be provided.

또한, 본 개시의 제3 측면은, 전자 장치에 입력되는 텍스트를 획득하는 동작, 텍스트를 부호화하여 텍스트 표현(text representation)을 획득하는 동작, 제1 오디오 프레임 셋의 오디오 표현을 획득하는 동작, 텍스트 표현 및 제1 오디오 프레임 셋의 오디오 표현을 기초로 제2 오디오 프레임 셋의 오디오 표현을 획득하는 동작, 제2 오디오 프레임 셋의 오디오 표현에 대한 복호화를 통해 제2 오디오 프레임 셋의 오디오 특징을 획득하는 동작, 제2 오디오 프레임 셋의 오디오 특징에 기초하여 피드백 정보를 생성하는 동작 및 제1 오디오 프레임 셋의 오디오 특징 또는 제2 오디오 프레임 셋의 오디오 특징 중 적어도 하나에 기초하여 음성을 합성하는 동작을 포함하는 텍스트로부터 음성을 합성하는 방법을 컴퓨터에서 실행시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록 매체를 제공할 수 있다.In addition, a third aspect of the present disclosure includes an operation of acquiring text input to an electronic device, an operation of encoding text to obtain a text representation, an operation of acquiring an audio representation of the first audio frame set, and text Acquiring an audio representation of the second audio frame set based on the representation and the audio representation of the first audio frame set, obtaining audio characteristics of the second audio frame set through decoding of the audio representation of the second audio frame set An operation, generating feedback information based on audio characteristics of the second audio frame set, and synthesizing speech based on at least one of the audio characteristics of the first audio frame set or the audio characteristics of the second audio frame set It is possible to provide a computer-readable recording medium in which a program for executing a method of synthesizing speech from a text to be performed on a computer is recorded.

본 개시에 의하면, 이전 오디오 프레임의 에너지에 관한 정보를 포함하는 피드백 정보를 사용하여 현재 오디오 프레임을 획득함으로써, 입력된 텍스트에 대응되는 음성을 합성할 수 있는 음성 합성 방법 및 장치를 제공할 수 있는 효과가 있다.According to the present disclosure, it is possible to provide a speech synthesis method and apparatus capable of synthesizing a speech corresponding to an input text by acquiring a current audio frame using feedback information including information on the energy of a previous audio frame. It works.

도 1a는 일부 실시예에 따른 전자 장치가 텍스트로부터 음성을 합성하는 방법을 개념적으로 나타낸 도면이다.
도 1b는 일부 실시예에 따른 전자 장치가 시간 도메인에서 텍스트로부터 오디오 프레임을 출력하고, 출력된 오디오 프레임으로부터 피드백 정보를 생성하는 방법을 개념적으로 나타낸 도면이다.
도 2는 일부 실시예에 따른 전자 장치가 음성 합성 모델을 사용하여 텍스트로부터 음성을 합성하는 방법을 순서대로 나타낸 흐름도이다.
도 3은 일부 실시예에 따른 전자 장치가 음성 학습 모델을 사용하여 텍스트로부터 음성을 합성하는 과정을 나타낸 도면이다.
도 4는 일부 실시예에 따른 전자 장치가 음성 합성 모델을 학습하는 방법을 나타낸 도면이다.
도 5는 일부 실시예에 따른 전자 장치가 피드백 정보를 생성하는 방법을 나타낸 도면이다.
도 6은 일부 실시예에 따른 전자 장치가 피드백 정보를 생성하는 방법을 나타낸 도면이다.
도 7은 일부 실시예에 따른 전자 장치가 컨벌루션 신경망을 포함하는 음성 합성 모델을 사용하여 음성을 합성하는 방법을 나타낸 도면이다.
도 8은 일부 실시예에 따른 전자 장치가 RNN 신경망을 포함하는 음성 합성 모델을 사용하여 음성을 합성하는 방법을 나타낸 도면이다.
도 9는 일부 실시예에 따른 전자 장치의 구성을 나타낸 블록도이다.
도 10은 일부 실시예에 따른 서버의 구성을 나타낸 블록도이다.1A is a diagram conceptually illustrating a method of synthesizing speech from text by an electronic device according to some embodiments.
1B is a diagram conceptually illustrating a method of outputting an audio frame from a text in a time domain and generating feedback information from the output audio frame, according to an exemplary embodiment.
2 is a flowchart sequentially illustrating a method of synthesizing speech from text by using a speech synthesis model by an electronic device according to some embodiments.
3 is a diagram illustrating a process in which an electronic device synthesizes speech from text using a speech learning model according to some embodiments.
4 is a diagram illustrating a method of learning a speech synthesis model by an electronic device according to some embodiments.
5 is a diagram illustrating a method of generating feedback information by an electronic device according to some embodiments.
6 is a diagram illustrating a method of generating feedback information by an electronic device according to some embodiments.
7 is a diagram illustrating a method of synthesizing speech by an electronic device using a speech synthesis model including a convolutional neural network, according to some embodiments.
8 is a diagram illustrating a method of synthesizing speech by an electronic device using a speech synthesis model including an RNN neural network, according to some embodiments.
9 is a block diagram illustrating a configuration of an electronic device according to some embodiments.
10 is a block diagram showing the configuration of a server according to some embodiments.

아래에서는 첨부한 도면을 참조하여 본 개시가 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 본 개시의 실시예를 상세히 설명한다. 그러나 본 개시는 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 그리고 도면에서 본 개시를 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다. Hereinafter, embodiments of the present disclosure will be described in detail with reference to the accompanying drawings so that those of ordinary skill in the art may easily implement the present disclosure. However, the present disclosure may be implemented in various different forms and is not limited to the embodiments described herein. In the drawings, parts irrelevant to the description are omitted in order to clearly describe the present disclosure, and similar reference numerals are attached to similar parts throughout the specification.

명세서 전체에서, 어떤 부분이 다른 부분과 "연결"되어 있다고 할 때, 이는 "직접적으로 연결"되어 있는 경우뿐 아니라, 그 중간에 다른 소자를 사이에 두고 "전기적으로 연결"되어 있는 경우도 포함한다. 또한, 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미한다.Throughout the specification, when a part is said to be "connected" with another part, this includes not only "directly connected" but also "electrically connected" with another element interposed therebetween. . In addition, when a part "includes" a certain component, it means that other components may be further included rather than excluding other components unless specifically stated to the contrary.

이하 첨부된 도면을 참고하여 본 개시를 상세히 설명하기로 한다.Hereinafter, the present disclosure will be described in detail with reference to the accompanying drawings.

도 1a는 일부 실시예에 따른 전자 장치가 텍스트로부터 음성을 합성하는 방법을 개념적으로 나타낸 도면이다.1A is a diagram conceptually illustrating a method of synthesizing speech from text by an electronic device according to some embodiments.

도 1b는 일부 실시예에 따른 전자 장치가 시간 도메인에서 텍스트로부터 오디오 프레임을 출력하고, 출력된 오디오 프레임으로부터 피드백 정보를 생성하는 방법을 개념적으로 나타낸 도면이다. 1B is a diagram conceptually illustrating a method of outputting an audio frame from a text in a time domain and generating feedback information from the output audio frame, according to an exemplary embodiment.

도 1a를 참조하면, 일부 실시예에 따른 전자 장치는 텍스트 인코더(111), 오디오 인코더(113), 오디오 디코더(115) 및 보코더(117)를 포함하는 음성 합성 모델을 사용하여 텍스트(101)로부터 음성(103)을 합성할 수 있다.Referring to FIG. 1A, an electronic device according to some embodiments uses a speech synthesis model including a text encoder 111, an audio encoder 113, an audio decoder 115, and a vocoder 117 from text 101. Voice 103 can be synthesized.

텍스트 인코더(111)는 입력된 텍스트(101)를 부호화하여 텍스트 표현(text representation)을 획득할 수 있다. The text encoder 111 may obtain a text representation by encoding the input text 101.

텍스트 표현은 입력된 텍스트에 대한 부호화를 통해 획득되는 부호화된 정보로서, 텍스트 내 각 글자(character)에 대응되는 고유의 벡터 열에 관한 정보를 포함할 수 있다.The text representation is encoded information obtained through encoding of the input text, and may include information on a unique vector sequence corresponding to each character in the text.

텍스트 인코더(111)는, 예를 들어, 텍스트(101) 내 각 글자에 대한 임베딩(embedding) 열을 획득하고 획득된 임베딩 열을 부호화함으로써, 텍스트(101) 내 각 글자에 대응되는 고유의 벡터 열에 관한 정보를 포함하는 텍스트 표현을 획득할 수 있다.The text encoder 111 obtains, for example, an embedding column for each character in the text 101 and encodes the obtained embedding column, so that a unique vector column corresponding to each character in the text 101 is It is possible to obtain a textual representation that includes information about it.

텍스트 인코더(111)는, 예를 들어, 컨볼루션 신경망(Convolution Neural Network, CNN), 순환 신경망(Recurrent Neural Network, RNN) 또는 LSTM(Long-Short Term Memory) 중 적어도 하나를 포함하는 모듈일 수 있으나, 이에 제한되지 않는다.The text encoder 111 may be, for example, a module including at least one of a Convolution Neural Network (CNN), a Recurrent Neural Network (RNN), or a Long-Short Term Memory (LSTM). , Is not limited thereto.

오디오 인코더(113)는 제1 오디오 프레임 셋의 오디오 표현을 획득할 수 있다.The audio encoder 113 may obtain an audio representation of the first audio frame set.

오디오 표현은 텍스트를 음성으로 합성하기 위하여 텍스트 표현에 기초하여 획득되는 부호화된 정보이다. 오디오 표현은 오디오 디코더(115)를 이용하여 복호화를 수행함으로써, 오디오 특징(audio feature)으로 변환될 수 있다.The audio representation is encoded information obtained based on the text representation in order to synthesize the text into speech. The audio representation may be converted into an audio feature by performing decoding using the audio decoder 115.

오디오 특징은 주파수 상의 스펙트럼(spectrum) 분포가 서로 상이한 복수의 성분을 포함하는 정보로서, 후술할 보코더(117)의 음성 합성에 직접적으로 사용되는 정보일 수 있다. 오디오 특징은, 예를 들어, spectrum, mel-spectrum, cepstrum, 및 mfccs 중 적어도 하나에 관한 정보를 포함할 수 있으나, 이에 제한되지 않는다.The audio characteristic is information including a plurality of components having different spectrum distributions on a frequency, and may be information directly used for speech synthesis of the vocoder 117 to be described later. The audio feature may include, for example, information on at least one of spectrum, mel-spectrum, cepstrum, and mfccs, but is not limited thereto.

한편, 제1 오디오 프레임 셋(FS_1, 도 1b 참조)은 텍스트 표현으로부터 생성되는 오디오 프레임들, 즉 음성 합성에 사용되는 전체 오디오 프레임들 중에서, 후술할 오디오 디코더(115)를 통해 오디오 특징이 미리 획득된 오디오 프레임들(f_1, f_2, f_3, f_4, 도 1b 참조)을 포함할 수 있다. Meanwhile, in the first audio frame set (FS_1, see FIG. 1B), among audio frames generated from text representation, that is, all audio frames used for speech synthesis, audio characteristics are obtained in advance through an audio decoder 115 to be described later. The resulting audio frames (f_1, f_2, f_3, f_4, see FIG. 1B) may be included.

오디오 인코더(113)는, 예를 들어, 컨볼루션 신경망, 순환 신경망 또는 LSTM 중 적어도 하나를 포함하는 모듈일 수 있으나, 이에 제한되지 않는다.The audio encoder 113 may be, for example, a module including at least one of a convolutional neural network, a recurrent neural network, and an LSTM, but is not limited thereto.

오디오 디코더(115)는 텍스트 표현 및 제1 오디오 프레임 셋(FS_1)의 오디오 표현에 기초하여 제2 오디오 프레임 셋(FS_2)의 오디오 표현을 획득할 수 있다.The audio decoder 115 may obtain the audio representation of the second audio frame set FS_2 based on the text representation and the audio representation of the first audio frame set FS_1.

일부 실시예에 따른 전자 장치는, 음성 합성 모델을 사용하여 텍스트로부터 음성을 합성하기 위하여, 복수의 오디오 프레임(f_1 내지 f_8) 각각에 대한 오디오 특징을 획득할 수 있다. The electronic device according to some embodiments may acquire audio characteristics for each of the plurality of audio frames f_1 to f_8 in order to synthesize speech from text using the speech synthesis model.

도 1b를 함께 참조하면, 전자 장치는 시간 도메인에서 오디오 디코더(115)로부터 출력된 복수의 오디오 프레임(f_1 내지 f_8) 에 대한 오디오 특징을 획득할 수 있다. 일 실시예에서, 전자 장치는 전체 오디오 프레임(f_1 내지 f_8) 각각에 관한 오디오 특징을 획득하지 않고, 전체 프레임(f_1 내지 f_8) 중 기설정된 개수의 오디오 프레임을 포함하는 프레임 셋(FS_1, FS_2)을 형성하고, 프레임 셋(FS_1, FS_2)에 대한 오디오 특징을 획득할 수 있다. 기설정된 개수는 예를 들어, 4개일 수 있다. Referring to FIG. 1B together, the electronic device may acquire audio characteristics of a plurality of audio frames f_1 to f_8 output from the audio decoder 115 in the time domain. In an embodiment, the electronic device does not acquire audio characteristics for each of all audio frames f_1 to f_8, and frame sets FS_1 and FS_2 including a preset number of audio frames among all frames f_1 to f_8 Is formed, and audio characteristics for the frame sets FS_1 and FS_2 may be obtained. The preset number may be, for example, four.

도 1b에 도시된 실시예에서, 제1 오디오 프레임 셋(FS_1)은 제1 오디오 프레임(f_1) 내지 제4 오디오 프레임(f_4)을 포함하고, 제2 오디오 프레임 셋(FS_2)은 제5 오디오 프레임(f_5) 내지 제8 오디오 프레임(f_8)을 포함할 수 있다. 제2 오디오 프레임 셋(FS_2)은 시간 도메인 상에서 제1 오디오 프레임 셋(FS_1)에 대하여 후행(succeeding)하는 프레임들(f_5 내지 f_8)을 포함할 수 있다. In the embodiment shown in FIG. 1B, the first audio frame set FS_1 includes a first audio frame f_1 to a fourth audio frame f_4, and the second audio frame set FS_2 is a fifth audio frame. It may include (f_5) to eighth audio frames (f_8). The second audio frame set FS_2 may include frames f_5 to f_8 succeeding the first audio frame set FS_1 in the time domain.

전자 장치는 제1 오디오 프레임 셋(FS_1)에 포함되는 제1 오디오 프레임(f_1) 내지 제4 오디오 프레임(f_4) 중 어느 하나의 오디오 프레임에 관한 특징 정보를 추출하고, 적어도 하나의 오디오 프레임으로부터 압축 정보를 추출할 수 있다. 도 1b에 도시된 실시예에서, 전자 장치는 제1 오디오 프레임(f_1)의 오디오 특징 정보(F0)를 추출하고, 제1 오디오 프레임(f_1) 내지 제4 오디오 프레임(f_4) 각각으로부터 압축 정보(E0, E1, E2, E3)를 추출할 수 있다. 압축 정보(E0, E1, E2, E3)는 예를 들어, 해당 오디오 프레임에 대응되는 오디오 신호의 진폭 값의 크기, 오디오 신호의 진폭 값에 대한 평균 제곱근(Root Mean Square, RMS)의 크기 또는 오디오 신호의 피크(peak) 값의 크기 중 적어도 하나에 관한 정보를 포함할 수 있다.The electronic device extracts feature information on one of the first to fourth audio frames f_1 to f_4 included in the first audio frame set FS_1, and compresses it from at least one audio frame. Information can be extracted. In the embodiment shown in FIG. 1B, the electronic device extracts the audio characteristic information F0 of the first audio frame f_1, and compresses information from each of the first audio frame f_1 to the fourth audio frame f_4. E0, E1, E2, E3) can be extracted. Compression information (E0, E1, E2, E3) is, for example, the size of the amplitude value of the audio signal corresponding to the audio frame, the size of the root mean square (RMS) of the amplitude value of the audio signal, or audio It may include information on at least one of the magnitudes of the peak values of the signal.

일 실시예에서, 전자 장치는 제1 오디오 프레임 셋(FS_1)에 포함되는 복수의 오디오 프레임(f_1 내지 f_4) 중 제1 오디오 프레임(f_1)의 오디오 특징 정보(F0) 및 제1 오디오 프레임(f_1) 내지 제4 오디오 프레임(f_4) 각각에 대한 압축 정보(E0, E1, E2, E3)를 결합함으로써, 제2 오디오 프레임 셋(FS_2)의 오디오 특징 정보를 획득하기 위한 피드백 정보(Feedback Information)를 생성할 수 있다. 도 1b에 도시된 실시예에서, 전자 장치가 제1 오디오 프레임 셋(FS_1) 중 제1 오디오 프레임(f_1)으로부터 오디오 특징 정보(F0)를 획득하는 것으로 설명되었지만, 이에 한정되는 것은 아니다. 다른 실시예에서, 전자 장치는 제1 오디오 프레임 셋(FS_1) 중 제1 오디오 프레임(f_1)이 아닌, 제2 오디오 프레임(f_2) 내지 제4 오디오 프레임(f_4) 중 어느 하나의 오디오 프레임으로부터 오디오 특징 정보를 획득할 수도 있다.In an embodiment, the electronic device includes audio characteristic information F0 and the first audio frame f_1 of the first audio frame f_1 among a plurality of audio frames f_1 to f_4 included in the first audio frame set FS_1. ) To the fourth audio frame f_4 by combining the compression information (E0, E1, E2, E3) for each of the second audio frame set (FS_2) to obtain feedback information (Feedback Information) for acquiring the audio characteristic information of the second audio frame set (FS_2). Can be generated. In the embodiment illustrated in FIG. 1B, it has been described that the electronic device acquires the audio characteristic information F0 from the first audio frame f_1 of the first audio frame set FS_1, but is not limited thereto. In another embodiment, the electronic device includes audio from any one of the second audio frame f_2 to the fourth audio frame f_4, not the first audio frame f_1 of the first audio frame set FS_1. Feature information can also be obtained.

본 개시의 일 실시예에서 전자 장치의 피드백 정보 생성 주기는 전자 장치가 텍스트로부터 획득한 오디오 프레임의 개수와 대응될 수 있다. 일 실시예에서, 피드백 정보 생성 주기는 기설정된 개수의 오디오 프레임을 통해 출력하는 음성 신호의 길이일 수 있다. 예를 들어, 하나의 오디오 프레임을 통해 10ms 길이의 음성 신호를 출력하는 경우, 4개의 오디오 프레임을 통해 40ms에 해당되는 음성 신호를 출력할 수 있고, 40ms 길이의 출력 음성 신호 당 하나의 피드백 정보가 생성될 수 있다. 즉, 피드백 정보 생성 주기는 4개의 오디오 프레임에 대응하는 출력 음성 신호의 길이일 수 있다. 그러나, 이에 한정되는 것은 아니다. In an embodiment of the present disclosure, the period of generating the feedback information of the electronic device may correspond to the number of audio frames acquired by the electronic device from text. In an embodiment, the feedback information generation period may be a length of a voice signal output through a preset number of audio frames. For example, in the case of outputting a 10 ms long voice signal through one audio frame, a voice signal corresponding to 40 ms may be output through four audio frames, and one feedback information is provided per 40 ms long output voice signal. Can be created. That is, the feedback information generation period may be the length of the output audio signal corresponding to the four audio frames. However, it is not limited thereto.

일 실시예에서, 피드백 정보 생성 주기는 화자 특성에 기초하여 결정될 수 있다. 예를 들어, 평균 발화 속도를 가지는 사용자에 대하여 4개의 오디오 프레임에 대한 오디오 특징을 획득하는 주기를 피드백 정보 생성 주기로 결정하는 경우, 전자 장치는 상대적으로 발화 속도가 느린 사용자에 대하여는 6개의 오디오 프레임에 대한 오디오 특징을 획득하는 주기를 피드백 정보 생성 주기로 결정할 수 있다. 반대의 예에서 전자 장치는, 상대적으로 발화 속도가 빠른 사용자에 대하여 2개의 오디오 프레임에 대한 오디오 특징을 획득하는 주기를 피드백 정보 생성 주기로 결정할 수 있다. 이때 발화의 속도에 관한 판단은, 예를 들어, 측정된 시간 단위 당 음소의 수(phoneme per unit time)에 기초하여 이루어질 수 있다. 사용자 별 발화의 속도는 데이터베이스(Database)에 기 저장되어 있고, 전자 장치는 데이터베이스를 참조하여 발화 속도에 따른 피드백 정보 생성 주기를 결정하고, 결정된 피드백 정보 생성 주기를 이용하여 학습할 수 있다. In an embodiment, the feedback information generation period may be determined based on speaker characteristics. For example, if a cycle for acquiring audio characteristics for four audio frames for a user having an average speech rate is determined as a feedback information generation cycle, the electronic device may generate 6 audio frames for a user with a relatively slow speech speed. The period for acquiring the audio characteristic for may be determined as the period for generating the feedback information. In the opposite example, the electronic device may determine a period for acquiring audio characteristics for two audio frames for a user having a relatively high speech speed as the feedback information generation period. In this case, the determination of the speed of speech may be made based on, for example, the measured number of phonemes per unit time. The rate of speech for each user is previously stored in a database, and the electronic device may refer to the database to determine a feedback information generation period according to the speech rate, and learn by using the determined feedback information generation period.

전자 장치는 텍스트의 종류에 기초하여 피드백 정보 생성 주기를 변경할 수 있다. 일 실시예에서, 전자 장치는, 전처리부(310, 도 3 참조)를 이용하여 텍스트의 종류를 식별(identify)할 수 있다. 전처리부(310)는 예를 들어, G2P(grapheme to phoneme), 형태소 분석기 등을 모듈을 포함할 수 있으며, G2P 모듈 및 형태소 분석기 중 적어도 하나를 이용하여 전처리를 수행함으로써, 음소 열(Phoneme sequence) 또는 grapheme 열 등을 출력할 수 있다. 예를 들어, 텍스트가 "안녕하세요" 인 경우, 전자 장치는 전처리부(310)를 통해 "ㅇㅏㄴㄴㅕㅇㅎㅏㅅㅔㅇㅛ"와 같이 자음과 모음을 분리하고, 자음과 모음의 순서 및 빈도수를 확인할 수 있다. 전자 장치는 텍스트가 모음 또는 묵음일 경우, 피드백 정보 생성 주기를 느리게 변경할 수 있다. 예를 들어, 피드백 정보 생성 주기가 4개의 오디오 프레임을 통해 출력되는 음성 신호의 길이이고, 텍스트가 모음 또는 묵음인 경우, 전자 장치는 피드백 정보 생성 주기를 6개의 오디오 프레임에 해당되는 출력 음성 신호의 길이로 변경할 수 있다. 다른 예로, 텍스트의 종류가 자음 또는 무성음일 경우, 전자 장치는 피드백 생성 주기를 짧게 변경할 수 있다. 예를 들어, 텍스트가 자음 또는 무성음인 경우, 전자 장치는 피드백 정보 생성 주기를 2개의 오디오 프레임에 해당되는 출력 음성 신호의 길이로 변경할 수 있다. The electronic device may change the feedback information generation cycle based on the type of text. In an embodiment, the electronic device may identify the type of text using the preprocessor 310 (refer to FIG. 3 ). The preprocessor 310 may include, for example, a module including a grapheme to phoneme (G2P), a morpheme analyzer, etc., and by performing preprocessing using at least one of a G2P module and a morpheme analyzer, the phoneme sequence Or, you can print grapheme columns. For example, when the text is "Hello", the electronic device can separate consonants and vowels, such as "ㅇㅏㄴㄴㅕㅇㅎㅏㅅㅔㅇㅛ" through the preprocessor 310, and check the order and frequency of consonants and vowels. have. When the text is a vowel or silence, the electronic device may slowly change the cycle of generating the feedback information. For example, if the feedback information generation period is the length of the audio signal output through 4 audio frames, and the text is vowel or silence, the electronic device sets the feedback information generation period to the output audio signal corresponding to the 6 audio frames. Can be changed to length. As another example, when the type of text is a consonant sound or an unvoiced sound, the electronic device may shorten the feedback generation period. For example, when the text is a consonant sound or an unvoiced sound, the electronic device may change the feedback information generation period to the length of the output speech signal corresponding to the two audio frames.

다른 예를 들어, 전자 장치는 전처리부(310)를 통해 텍스트로부터 음소(phoneme)를 출력하고, 기 저장된 발음 사전을 이용하여 텍스트의 각 음소를 발음 기호로 변환하고, 발음 기호에 따라 텍스트의 발음 정보를 추정하고, 추정된 발음 정보에 기초하여 피드백 정보의 생성 주기를 변경할 수 있다. For another example, the electronic device outputs a phoneme from the text through the preprocessor 310, converts each phoneme of the text into a phonetic symbol using a pre-stored phonetic dictionary, and pronounces the text according to the phonetic symbol. The information may be estimated, and a generation period of the feedback information may be changed based on the estimated pronunciation information.

일반적으로, 자음 또는 무성음의 경우, 음성 신호의 길이가 짧아서 음성 신호에 대응되는 오디오 프레임의 수가 적고, 모음 또는 묵음의 경우 음성 신호의 길이가 길어서 음성 신호에 대응되는 오디오 프레임의 수가 많다. 본 개시의 전자 장치는 자음, 모음, 묵음, 무성음 등 텍스트의 종류에 대응하여 피드백 정보 생성 주기를 유동적으로 변경함으로써, 주의 집중 정보 획득의 정확도를 향상시키고, 따라서 음성 합성 성능을 향상시킬 수 있다. 또한, 자음 또는 무성음 등과 같이 음성 신호를 출력하기 위한 오디오 프레임의 수가 상대적으로 적게 필요한 경우, 오디오 프레임으로부터 오디오 특징 정보 획득 및 피드백 정보 획득에 따른 연산량을 절감할 수도 있다.In general, in the case of consonants or unvoiced sounds, the number of audio frames corresponding to the speech signal is small because the length of the speech signal is short, and in the case of vowels or silence, the number of audio frames corresponding to the speech signal is large due to the length of the speech signal. The electronic device of the present disclosure may improve the accuracy of obtaining attention information and thus improve speech synthesis performance by flexibly changing the feedback information generation period in response to the type of text such as consonant, vowel, silent, and unvoiced sound. In addition, when the number of audio frames for outputting a speech signal, such as consonant or unvoiced sound, is required to be relatively small, it is possible to reduce the amount of computation according to acquisition of audio feature information and feedback information from the audio frame.

오디오 디코더(115)는 미리 획득된 제1 오디오 프레임들의 오디오 표현에 기초하여 제2 오디오 프레임들의 오디오 표현을 획득할 수 있다.The audio decoder 115 may obtain the audio representation of the second audio frames based on the audio representation of the first audio frames obtained in advance.

이처럼 일부 실시예에 따른 전자 장치는 음성을 합성하기 위한 오디오 특징을 복수의 오디오 프레임 단위로 획득함으로써, 오디오 특징의 획득에 필요한 연산량을 감소시킬 수 있다.As described above, the electronic device according to some embodiments may reduce the amount of computation required to acquire the audio feature by acquiring the audio feature for synthesizing speech in units of a plurality of audio frames.

오디오 디코더(115)는 제2 오디오 프레임들의 오디오 표현을 복호화하여 제2 오디오 프레임들의 오디오 특징을 획득할 수 있다.The audio decoder 115 may obtain audio characteristics of the second audio frames by decoding the audio representation of the second audio frames.

오디오 디코더(115)는, 예를 들어, 컨볼루션 신경망, 순환 신경망 또는 LSTM 중 적어도 하나를 포함하는 모듈일 수 있으나, 이에 제한되지 않는다.The audio decoder 115 may be, for example, a module including at least one of a convolutional neural network, a recurrent neural network, and an LSTM, but is not limited thereto.

보코더(117)는 오디오 디코더(115)가 획득한 오디오 특징에 기초하여 음성(103)을 합성할 수 있다.The vocoder 117 may synthesize the voice 103 based on the audio characteristics obtained by the audio decoder 115.

보코더(117)는, 예를 들어, 오디오 디코더(115)가 획득한 제1 오디오 프레임들의 오디오 특징 또는 제2 오디오 프레임들의 오디오 특징 중 적어도 하나에 기초하여, 텍스트(101)에 대응되는 음성(103)을 합성할 수 있다.The vocoder 117, for example, based on at least one of the audio characteristics of the first audio frames or the audio characteristics of the second audio frames acquired by the audio decoder 115, the voice 103 corresponding to the text 101 ) Can be synthesized.

보코더(117)는, 예를 들어, WaveNet, Parallel WaveNet, WaveRNN 또는 LPCNet 방식 중 적어도 하나의 방식에 기초하여 오디오 특징으로부터 음성(103)을 합성할 수 있으나. 이에 제한되는 것은 아니다.The vocoder 117 may synthesize the voice 103 from the audio feature based on at least one of, for example, WaveNet, Parallel WaveNet, WaveRNN, or LPCNet. It is not limited thereto.

제2 오디오 프레임 셋(FS_2)의 오디오 특징이 획득되면, 오디오 인코더(113)는, 예를 들어, 오디오 디코더(115)로부터 제2 오디오 프레임 셋(FS_2)의 오디오 특징을 전달받고, 텍스트 인코더(111)로부터 전달받은 텍스트 표현 및 제2 오디오 프레임 셋(FS_2)의 오디오 특징에 기초하여 제2 오디오 프레임 셋(FS_2)에 후행하는 제3 오디오 프레임 셋의 오디오 표현을 획득할 수 있다.When the audio characteristic of the second audio frame set FS_2 is obtained, the audio encoder 113 receives, for example, the audio characteristic of the second audio frame set FS_2 from the audio decoder 115, and the text encoder ( An audio representation of the third audio frame set following the second audio frame set FS_2 may be obtained based on the text representation received from 111) and the audio characteristic of the second audio frame set FS_2.

이와 같은 음성 학습 모델의 피드백 루프(feedback loop) 방식을 통해, 합성될 음성을 구성하는 오디오 프레임 중 첫 오디오 프레임들의 오디오 특징 내지 마지막 오디오 프레임 셋의 오디오 특징이 순차적으로 획득될 수 있다.Through the feedback loop method of the voice learning model, audio features of the first audio frames or audio features of the last audio frame set among the audio frames constituting the voice to be synthesized may be sequentially obtained.

한편, 일부 실시예에 따른 전자 장치는 자연스러운 운율(rhythm)을 가지는 음성을 합성하기 위해, 제2 오디오 프레임 셋(FS_2)의 오디오 표현을 획득하는 과정에서, 미리 획득된 제1 오디오 프레임 셋(FS_1)의 오디오 특징을 그대로 사용하지 않고, 소정의 피드백 정보로 변환하여 사용할 수 있다.Meanwhile, in a process of obtaining an audio representation of the second audio frame set FS_2 in order to synthesize a speech having a natural rhythm, the electronic device according to some embodiments ), instead of using the audio feature as it is, it can be converted into predetermined feedback information and used.

즉, 음성 합성 모델은 텍스트 인코더(111) 및 오디오 디코더(115)를 통해 텍스트 표현을 오디오 표현으로 변환함으로써, 텍스트에 대응되는 음성을 합성하기 위한 오디오 특징을 획득하고, 획득된 오디오 특징을 변환하여 음성을 합성할 수 있다.That is, the speech synthesis model obtains an audio characteristic for synthesizing a speech corresponding to the text by converting a text expression into an audio expression through the text encoder 111 and the audio decoder 115, and converts the obtained audio characteristic. Voice can be synthesized.

이하에서는 전자 장치가 사용하는 음성 합성 모델이 피드백 정보를 획득하고 사용하는 일부 실시예에 기초하여, 본 개시의 음성 합성 방법을 상세히 설명하기로 한다.Hereinafter, a speech synthesis method of the present disclosure will be described in detail based on some embodiments in which a speech synthesis model used by an electronic device acquires and uses feedback information.

도 2는 일부 실시예에 따른 전자 장치가 음성 합성 모델을 사용하여 텍스트로부터 음성을 합성하는 방법을 순서대로 나타낸 흐름도이다.2 is a flowchart sequentially illustrating a method of synthesizing speech from text by using a speech synthesis model by an electronic device according to some embodiments.

전자 장치가 음성 합성 모델을 사용하여 텍스트로 음성을 합성하는 구체적인 방법 및 음성 합성 모델을 학습시키는 구체적인 방법은 도 3 및 도 4의 실시예를 통해 후술한다.A specific method of synthesizing speech by text using the speech synthesis model and a specific method of learning the speech synthesis model by the electronic device will be described later with reference to the embodiments of FIGS. 3 and 4.

도 2를 참조하면, 일부 실시예에 따른 전자 장치는, S201 단계에서, 입력되는 텍스트를 획득할 수 있다.Referring to FIG. 2, the electronic device according to some embodiments may obtain an input text in step S201.

S202 단계에서, 전자 장치는 입력된 텍스트를 부호화하여 텍스트 표현을 획득할 수 있다. 일 실시예에서, 전자 장치는, 텍스트 인코더(111, 도 1a 참조)를 이용하여 입력된 텍스트에 포함된 각 글자에 대한 임베딩 열을 부호화함으로써, 텍스트 내 각 글자에 대응되는 고유의 벡터 열에 관한 정보를 포함하는 텍스트 표현을 획득할 수 있다.In step S202, the electronic device may obtain a text representation by encoding the input text. In one embodiment, the electronic device encodes an embedding column for each character included in the input text using a text encoder 111 (refer to FIG. 1A), thereby providing information on a unique vector column corresponding to each character in the text. It is possible to obtain a text representation including.

S203 단계에서, 전자 장치는 제1 오디오 프레임 셋의 오디오 표현을 획득할 수 있다.In step S203, the electronic device may obtain an audio representation of the first audio frame set.

S204 단계에서, 전자 장치는 텍스트 표현 및 제1 오디오 프레임 셋의 오디오 표현에 기초하여, 제2 오디오 프레임 셋의 오디오 표현을 획득할 수 있다.In step S204, the electronic device may obtain an audio representation of the second audio frame set based on the text representation and the audio representation of the first audio frame set.

S205 단계에서, 전자 장치는 제2 오디오 프레임 셋의 오디오 표현 정보로부터 제2 오디오 프레임 셋의 오디오 특징을 획득한다. 일 실시예에서, 전자 장치는 오디오 디코더(115, 도 1a 참조)를 이용하여 제2 오디오 프레임 셋의 오디오 표현을 복호화하고, 복호화를 통해 제2 오디오 프레임 셋의 오디오 특징을 획득할 수 있다. 오디오 특징은 주파수 상의 스펙트럼(spectrum) 분포가 서로 상이한 복수의 성분을 포함하는 정보이다. 오디오 특징은 예를 들어, spectrum, mel-spectrum, cepstrum, 및 mfccs 중 적어도 하나에 관한 정보를 포함할 수 있으나, 이에 제한되지 않는다.In step S205, the electronic device acquires the audio characteristic of the second audio frame set from the audio representation information of the second audio frame set. In an embodiment, the electronic device decodes the audio representation of the second audio frame set using the audio decoder 115 (refer to FIG. 1A ), and obtains an audio characteristic of the second audio frame set through decoding. The audio characteristic is information including a plurality of components having different spectrum distributions on a frequency. The audio feature may include, for example, information on at least one of spectrum, mel-spectrum, cepstrum, and mfccs, but is not limited thereto.

S206 단계에서, 전자 장치는 제2 오디오 프레임 셋의 오디오 특징에 기초하여 피드백 정보를 생성할 수 있다. 여기서 피드백 정보는 제2 오디오 프레임 셋에 후행하는 제3 오디오 프레임 셋의 오디오 특징을 획득하는데 이용하기 위하여 제2 오디오 프레임 셋의 오디오 특징으로부터 획득되는 정보일 수 있다. In step S206, the electronic device may generate feedback information based on audio characteristics of the second audio frame set. Here, the feedback information may be information obtained from the audio characteristic of the second audio frame set in order to be used to acquire the audio characteristic of the third audio frame set following the second audio frame set.

피드백 정보는, 예를 들어, 제2 오디오 프레임 셋 중 적어도 하나의 오디오 프레임의 오디오 특징에 관한 정보 이외에, 제2 오디오 프레임 셋에 포함되는 적어도 하나의 오디오 프레임에 대한 압축 정보를 포함할 수 있다.The feedback information may include, for example, compression information on at least one audio frame included in the second audio frame set, in addition to information on audio characteristics of at least one audio frame among the second audio frame set.

오디오 프레임에 대한 압축 정보는 해당 오디오 프레임의 에너지에 관한 정보를 포함할 수 있다.The compression information on the audio frame may include information on the energy of the audio frame.

오디오 프레임에 대한 압축 정보는, 예를 들어, 해당 오디오 프레임의 전체 에너지 및 해당 오디오 프레임의 주파수별 에너지에 관한 정보를 포함할 수 있다. 오디오 프레임의 에너지는 해당 오디오 프레임의 오디오 특징에 대응되는 소리의 세기와 관련 있는 값일 수 있다.The compression information for the audio frame may include, for example, information on the total energy of the audio frame and the energy for each frequency of the audio frame. The energy of the audio frame may be a value related to the intensity of sound corresponding to the audio characteristic of the corresponding audio frame.

일 실시예에서, 특정 오디오 프레임의 오디오 특징이 80차 mel-spectrum일 경우, 해당 오디오 프레임 M은 다음과 같은 수식의 형태로 나타낼 수 있다.In an embodiment, when the audio characteristic of a specific audio frame is the 80th mel-spectrum, the corresponding audio frame M may be expressed in the form of the following equation.

이때, 오디오 프레임 M의 에너지는, 예를 들어, 다음과 같은 Mean of mel-spectrum 수식에 기초하여 획득될 수 있다.In this case, the energy of the audio frame M may be obtained, for example, based on the following Mean of mel-spectrum equation.

한편, 오디오 프레임 M의 에너지는, 다른 예로, 다음과 같은 RMS(Root Mean Square) of mel-spectrum 수식에 기초하여 획득될 수 있다.Meanwhile, as another example, the energy of the audio frame M may be obtained based on the following Root Mean Square (RMS) of mel-spectrum equation.

한편, 다른 실시예에서, 특정 오디오 프레임의 오디오 특징이 22차 cepstrum일 경우, 해당 오디오 프레임 C는 다음과 같은 수식의 형태로 나타낼 수 있다.Meanwhile, in another embodiment, when the audio characteristic of a specific audio frame is the 22nd order cepstrum, the corresponding audio frame C may be expressed in the form of the following equation.

이때, cepstrum에서는 첫 번째 원소 및 실제 소리의 세기 사이의 상관 관계가 상대적으로 높으므로, 오디오 프레임 C의 에너지는, 예를 들어, cepstrum의 첫번째 원소인 b₁일 수 있다.In this case, in cepstrum, since the correlation between the first element and the actual sound intensity is relatively high, the energy of the audio frame C may be, for example, b ₁ , which is the first element of cepstrum.

오디오 프레임에 대한 압축 정보는, 예를 들어, 해당 오디오 프레임에 대응되는 오디오 신호의 진폭 값의 크기, 오디오 신호의 진폭 값에 대한 평균 제곱근(Root Mean Square, RMS)의 크기 또는 오디오 신호의 피크(peak) 값의 크기 중 적어도 하나에 관한 정보를 포함할 수 있다.The compression information for the audio frame is, for example, the amplitude of the audio signal corresponding to the audio frame, the root mean square (RMS) of the amplitude of the audio signal, or the peak of the audio signal ( peak) may include information on at least one of the magnitudes of the value.

전자 장치는, 예를 들어, 제2 오디오 프레임 셋에 포함되는 적어도 하나의 오디오 프레임의 오디오 특징에 관한 정보 및 제2 오디오 프레임 셋에 포함되는 적어도 하나의 오디오 프레임 각각에 대한 압축 정보를 결합하여 피드백 정보를 생성할 수 있다.The electronic device provides feedback by combining, for example, information on audio characteristics of at least one audio frame included in the second audio frame set and compression information for each of at least one audio frame included in the second audio frame set. Can generate information.

S203 단계 내지 S206 단계는 n개의 오디오 프레임 셋에 관하여 반복적으로 수행될 수 있다. 일 실시예에서, 전자 장치는 S203 단계에서 k 번째 오디오 프레임 셋의 오디오 표현을 획득하고, S204 단계에서 텍스트 표현 및 k 번째 오디오 프레임 셋의 오디오 표현을 기초로 k+1 번째 오디오 프레임 셋의 오디오 표현을 획득하고, S205 단계에서 k+1 번째 오디오 프레임 셋의 오디오 표현에 대한 복호화를 통해 k+1 번째 오디오 프레임 셋의 오디오 특징 정보를 획득하며, S206 단계에서 k+1 번째 오디오 프레임 셋의 오디오 특징에 기초하여 피드백 정보를 생성할 수 있다. 여기서, k는 연속하는 오디오 프레임 셋에 관한 서수(ordinal number)이고, k값은 1, 2, 3, ... , n일 수 있다. k+1의 값이 전체 오디오 프레임 셋의 개수인 n 보다 작거나, 같은 경우, 전자 장치는 오디오 인코더(314, 도 3 참조)를 이용하여 k+1 번째 오디오 프레임 셋의 피드백 정보를 부호화함으로써, k+1 번째 오디오 프레임 셋에 후행하는 k+2 번째 오디오 프레임 셋의 오디오 표현을 획득할 수 있다. 즉, 전자 장치는 k+1의 값이 전체 오디오 프레임 셋의 개수인 n 보다 작거나 같은 경우, S203 단계 내지 S206 단계를 반복적으로 수행할 수 있다. Steps S203 to S206 may be repeatedly performed for n audio frame sets. In an embodiment, the electronic device obtains the audio representation of the k-th audio frame set in step S203, and the audio representation of the k+1-th audio frame set based on the text representation and the audio representation of the k-th audio frame set in step S204. And, in step S205, audio feature information of the k+1 th audio frame set is obtained through decoding of the audio representation of the k+1 th audio frame set, and in step S206, the audio feature information of the k+1 th audio frame set Feedback information may be generated based on. Here, k is an ordinal number for a set of consecutive audio frames, and k may be 1, 2, 3, ..., n. When the value of k+1 is less than or equal to n, which is the number of all audio frame sets, the electronic device encodes feedback information of the k+1 th audio frame set using an audio encoder 314 (refer to FIG. 3), An audio representation of the k+2 th audio frame set following the k+1 th audio frame set may be obtained. That is, when the value of k+1 is less than or equal to n, which is the number of all audio frame sets, the electronic device may repeatedly perform steps S203 to S206.

전자 장치가 피드백 정보를 생성하는 구체적인 방법은 도 5 및 도 6의 실시예를 통해 후술한다.A detailed method of generating the feedback information by the electronic device will be described later through the embodiments of FIGS. 5 and 6.

S207 단계에서, 전자 장치는 제1 오디오 프레임 셋의 오디오 특징 또는 제2 오디오 프레임 셋의 오디오 특징 중 적어도 하나에 기초하여 음성을 합성할 수 있다. S207 단계는 k 번째 오디오 프레임 셋의 오디오 특징 또는 k+1 번째 오디오 프레임 셋의 오디오 특징이 획득되는 경우 음성을 합성할 수 있지만, 이에 한정되는 것은 아니다. 일 실시예에서, 전자 장치는 n 번째 오디오 프레임 셋의 오디오 특징이 획득될 때까지 S203 단계 내지 S206 단계를 반복 수행한 이후, k+1 번째 오디오 프레임 셋의 오디오 특징 내지 n 번째 오디오 프레임 셋의 오디오 특징 중 적어도 하나에 기초하여 음성을 합성할 수도 있다. In step S207, the electronic device may synthesize speech based on at least one of an audio characteristic of the first audio frame set or an audio characteristic of the second audio frame set. In step S207, when the audio feature of the kth audio frame set or the audio feature of the k+1th audio frame set is acquired, speech may be synthesized, but is not limited thereto. In one embodiment, the electronic device repeats steps S203 to S206 until the audio feature of the n-th audio frame set is acquired, and then, the audio feature of the k+1-th audio frame set to the audio of the n-th audio frame set. Speech may be synthesized based on at least one of the features.

도 2에서 S207 단계는 S206 단계 이후 순차적으로 수행되는 것으로 도시되었지만, 이에 한정되는 것은 아니다.In FIG. 2, step S207 is shown to be sequentially performed after step S206, but is not limited thereto.

도 3은 일부 실시예에 따른 전자 장치가 음성 학습 모델을 사용하여 텍스트로부터 음성을 합성하는 과정을 나타낸 도면이다.3 is a diagram illustrating a process in which an electronic device synthesizes speech from text using a speech learning model according to some embodiments.

도 3을 참조하면, 일부 실시예에 따른 전자 장치는 전처리부(310), 텍스트 인코더(311), 피드백 정보 생성부(312), 주의 집중 모듈(313), 오디오 인코더(314), 오디오 디코더(315) 및 보코더(317)를 포함하는 음성 합성 모델을 사용하여 텍스트(301)로부터 음성(303)을 합성할 수 있다.Referring to FIG. 3, an electronic device according to some embodiments includes a preprocessor 310, a text encoder 311, a feedback information generator 312, an attention module 313, an audio encoder 314, and an audio decoder. Speech 303 can be synthesized from text 301 using a speech synthesis model including 315 and vocoder 317.

전처리부(310)는 후술할 텍스트 인코더(311)가, 입력된 텍스트(301)에 포함된 패턴을 학습하는데 도움을 주는 텍스트의 발성 또는 의미 중 적어도 하나에 관한 정보를 획득하기 위해, 텍스트(301)에 대한 전처리를 수행할 수 있다. The preprocessing unit 310 is configured to obtain information on at least one of the utterances or meanings of the text that helps the text encoder 311 to be described later learn a pattern included in the input text 301, the text 301 ) Can be performed.

자연어 형태의 텍스트는 오탈자, 특수 문자 등 텍스트의 본질적인 의미를 해치는 문자열을 포함할 수 있다. 전처리부(310)는 텍스트(301)로부터 텍스트의 발성 또는 의미 중 적어도 하나에 관한 정보를 획득하고 텍스트에 포함된 패턴을 학습하기 위하여, 텍스트(301)에 대한 전처리를 수행할 수 있다.The text in natural language form may include a character string that harms the essential meaning of the text, such as typos and special characters. The preprocessor 310 may perform preprocessing on the text 301 in order to obtain information on at least one of speech or meaning of the text from the text 301 and learn a pattern included in the text.

전처리부(310)는, 예를 들어, G2P(grapheme to phoneme), 형태소 분석기 등을 모듈을 포함할 수 있으며, 이와 같은 모듈은 기 설정된 규칙 또는 학습된 모델을 기반으로 전처리를 수행할 수 있다. 전처리부(310)의 출력은, 예를 들어, Phoneme 열 또는 grapheme 열 등이 될 수 있으나, 이에 제한되지 않는다.The preprocessor 310 may include, for example, a module including a grapheme to phoneme (G2P), a morpheme analyzer, and the like, and such a module may perform preprocessing based on a preset rule or a learned model. The output of the preprocessor 310 may be, for example, a phoneme column or a grapheme column, but is not limited thereto.

텍스트 인코더(311)는 전처리부(310)로부터 전달받은 전처리가 수행된 텍스트를 부호화하여 텍스트 표현을 획득할 수 있다.The text encoder 311 may obtain a text representation by encoding the preprocessed text received from the preprocessor 310.

오디오 인코더(314)는 피드백 정보 생성부(312)로부터 기 생성된 피드백 정보를 전달받고, 전달받은 피드백 정보를 부호화하여 제1 오디오 프레임 셋의 오디오 표현을 획득할 수 있다.The audio encoder 314 may receive pre-generated feedback information from the feedback information generator 312 and encode the received feedback information to obtain an audio representation of the first audio frame set.

주의 집중 모듈(313)은 텍스트 인코더(311)로부터 전달받은 텍스트 표현의 적어도 일부 및 오디오 인코더(314)로부터 전달받은 제1 오디오 프레임 셋의 오디오 표현에 기초하여, 텍스트 표현 중 집중이 필요한 부분을 식별하기 위한 주의 집중 정보를 획득할 수 있다.The attention module 313 identifies a portion of the text expression that requires attention, based on at least a part of the text expression received from the text encoder 311 and the audio expression of the first audio frame set transmitted from the audio encoder 314. It is possible to obtain attention information for doing.

음성 합성 모델의 입력 시퀀스로서 고정된 크기의 벡터 열에 관한 정보를 포함하는 텍스트 표현이 사용되는 것이 일반적이므로, 음성 합성 모델의 입력 시퀀스 및 출력 시퀀스 사이의 매핑(mapping) 관계를 학습하기 위한 주의 집중 기법(attention mechanism)이 사용될 수 있다.As the input sequence of the speech synthesis model, a textual expression containing information about a vector sequence of a fixed size is generally used, so an attention-focusing technique for learning the mapping relationship between the input sequence and the output sequence of the speech synthesis model (attention mechanism) can be used.

주의 집중 기법을 사용하는 음성 합성 모델은, 음성 합성에 필요한 오디오 특징을 획득하는 매 시점(time-step)마다 텍스트 인코더에 입력된 전체 텍스트, 즉 텍스트 표현을 다시 참고할 수 있다. 이때 음성 합성 모델은 텍스트 표현의 모든 부분을 동일한 비율로 참고하지 않고, 매 시점에 예측해야 할 오디오 특징과 연관이 있는 부분에 집중하여 참고함으로써, 음성 합성의 효율 및 정확도를 높일 수 있다. In the speech synthesis model using the attention-focusing technique, the entire text input to the text encoder, that is, the text representation, may be re-referenced at each time-step at which audio features required for speech synthesis are acquired. In this case, the speech synthesis model does not refer to all parts of the text expression at the same ratio, but focuses on the parts related to the audio characteristics to be predicted at each time point, thereby increasing the efficiency and accuracy of speech synthesis.

주의 집중 모듈(313)은, 예를 들어, 텍스트 인코더(311)로부터 전달받은 텍스트 표현의 적어도 일부 및 오디오 인코더(314)로부터 전달받은 제1 오디오 프레임 셋의 오디오 표현에 기초하여, 텍스트 표현 중 집중이 필요한 부분을 식별할 수 있다. 주의 집중 모듈(113)은 텍스트 표현 중 집중이 필요한 부분에 대한 정보를 포함하는 주의 집중 정보를 생성할 수 있다.Attention module 313, for example, based on at least a part of the text representation received from the text encoder 311 and the audio representation of the first audio frame set received from the audio encoder 314, the concentration of the text representation You can identify the parts you need. The attention module 113 may generate attention-concentration information including information on a portion of text expression that requires attention.

전자 장치가 텍스트 표현 및 주의 집중 정보에 기초하여 제2 오디오 프레임 셋의 오디오 표현을 획득하는 구체적인 방법은 도 7 및 도 8의 실시예를 통해 후술한다.A detailed method of obtaining the audio representation of the second audio frame set by the electronic device based on the text representation and attention information will be described later with reference to the embodiments of FIGS. 7 and 8.

오디오 디코더(315)는 주의 집중 모듈(313)로부터 전달받은 주의 집중 정보 및 오디오 인코더(314)로부터 전달받은 제1 오디오 프레임 셋의 오디오 표현에 기초하여, 제1 오디오 프레임 셋에 후행하는 제2 오디오 프레임 셋의 오디오 표현을 생성할 수 있다.The audio decoder 315 is based on the attention information transmitted from the attention module 313 and the audio representation of the first audio frame set transmitted from the audio encoder 314, and the second audio following the first audio frame set. You can create an audio representation of a frame set.

오디오 디코더(315)는 생성된 제2 오디오 프레임 셋의 오디오 표현을 복호화하여 제2 오디오 프레임 셋의 오디오 특징을 획득할 수 있다.The audio decoder 315 may obtain an audio characteristic of the second audio frame set by decoding an audio representation of the generated second audio frame set.

보코더(317)는 오디오 디코더(315)로부터 전달받은 제1 오디오 프레임 셋의 오디오 특징 또는 제2 오디오 프레임 셋의 오디오 특징 중 적어도 하나를 변환하여 텍스트(301)에 대응되는 음성(303)을 합성할 수 있다.The vocoder 317 converts at least one of the audio characteristics of the first audio frame set or the audio characteristics of the second audio frame set received from the audio decoder 315 to synthesize the speech 303 corresponding to the text 301. I can.

한편, 오디오 디코더(315)가 제2 오디오 프레임 셋의 오디오 특징을 획득하면, 피드백 정보 생성부(312)는 오디오 디코더(315)로부터 제2 오디오 프레임 셋의 오디오 특징을 전달받을 수 있다.Meanwhile, when the audio decoder 315 acquires the audio feature of the second audio frame set, the feedback information generator 312 may receive the audio feature of the second audio frame set from the audio decoder 315.

피드백 정보 생성부(312)는 오디오 디코더(315)로부터 전달받은 제2 오디오 프레임 셋의 오디오 특징에 기초하여, 제2 오디오 프레임 셋에 후행하는 제3 오디오 프레임 셋의 오디오 특징을 획득하는데 이용되는 피드백 정보를 획득할 수 있다.The feedback information generator 312 is used to obtain the audio characteristic of the third audio frame set following the second audio frame set based on the audio characteristic of the second audio frame set received from the audio decoder 315. Information can be obtained.

즉, 피드백 정보 생성부(312)는 오디오 디코더(315)로부터 전달받은 미리 획득된 오디오 프레임 셋의 오디오 특징에 기초하여, 미리 획득된 오디오 프레임 셋에 후행하는 오디오 프레임 셋의 오디오 특징을 획득하기 위한 피드백 정보를 획득할 수 있다.That is, the feedback information generation unit 312 is configured to obtain an audio characteristic of an audio frame set following the previously obtained audio frame set, based on the audio characteristic of the previously acquired audio frame set received from the audio decoder 315. Feedback information can be obtained.

이와 같은 음성 학습 모델의 피드백 루프 방식을 통해, 합성될 음성을 구성하는 오디오 프레임 중 첫 오디오 프레임 셋의 오디오 특징 내지 마지막 오디오 프레임 셋의 오디오 특징이 순차적으로 획득될 수 있다.Through the feedback loop method of the voice learning model, audio features of the first audio frame set to the audio features of the last audio frame set among audio frames constituting the voice to be synthesized may be sequentially obtained.

도 4는 일부 실시예에 따른 전자 장치가 음성 합성 모델을 학습하는 방법을 나타낸 도면이다.4 is a diagram illustrating a method of learning a speech synthesis model by an electronic device according to some embodiments.

일부 실시예에 따른 전자 장치가 사용하는 음성 합성 모델은, 예를 들어, 소정의 텍스트 및 소정의 텍스트에 대응되는 오디오 신호로부터 획득된 오디오 특징을 학습 데이터로서 입력받고, 입력된 텍스트에 대응되는 음성을 합성하는 과정을 통해 학습될 수 있다.In the speech synthesis model used by the electronic device according to some embodiments, for example, a predetermined text and an audio characteristic obtained from an audio signal corresponding to the predetermined text are input as learning data, and a speech corresponding to the input text is received. It can be learned through the process of synthesizing.

도 4를 참조하면, 일부 실시예에 따른 전자 장치가 학습하는 음성 합성 모델은 전처리부(310), 텍스트 인코더(311), 피드백 정보 생성부(312), 주의 집중 모듈(313), 오디오 인코더(314), 오디오 디코더(315) 및 보코더(317) 이외에, 타겟 오디오 신호로부터 오디오 특징을 획득하기 위한 오디오 특징 추출부(411)를 더 포함할 수 있다.Referring to FIG. 4, a speech synthesis model learned by an electronic device according to some embodiments includes a preprocessor 310, a text encoder 311, a feedback information generator 312, an attention module 313, and an audio encoder. In addition to the 314, the audio decoder 315 and the vocoder 317, an audio feature extraction unit 411 for obtaining an audio feature from the target audio signal may be further included.

오디오 특징 추출부(411)는 입력된 오디오 신호(400)를 구성하는 전체 오디오 프레임에 대한 오디오 특징을 추출할 수 있다.The audio feature extraction unit 411 may extract audio features of all audio frames constituting the input audio signal 400.

피드백 정보 생성부(312)는 오디오 특징 추출부(411)로부터 전달받은 오디오 신호(400)의 전체 오디오 프레임에 대한 오디오 특징으로부터, 음성(403)을 구성하는 전체 오디오 프레임에 대한 오디오 특징을 획득하는데 필요한 피드백 정보를 획득할 수 있다.The feedback information generator 312 obtains audio features for all audio frames constituting the voice 403 from audio features for all audio frames of the audio signal 400 received from the audio feature extractor 411. It is possible to obtain necessary feedback information.

오디오 인코더(314)는 피드백 정보 생성부(312)로부터 전달받은 피드백 정보를 부호화하여, 오디오 신호(400)의 전체 오디오 프레임에 대한 오디오 표현을 획득할 수 있다.The audio encoder 314 may obtain an audio representation of all audio frames of the audio signal 400 by encoding feedback information received from the feedback information generator 312.

한편, 전처리부(310)는 입력된 텍스트(401)에 대한 전처리를 수행할 수 있다.Meanwhile, the preprocessor 310 may perform preprocessing on the input text 401.

주의 집중 모듈(313)은 텍스트 인코더(311)로부터 전달받은 텍스트 표현 및 오디오 인코더(314)로부터 전달받은 오디오 신호(400)의 전체 오디오 프레임에 대한 오디오 표현에 기초하여, 텍스트 표현 중 집중이 필요한 부분을 식별하기 위한 주의 집중 정보를 획득할 수 있다.The attention module 313 is based on the text expression received from the text encoder 311 and the audio expression for the entire audio frame of the audio signal 400 received from the audio encoder 314, and the portion of the text expression that needs to be focused It is possible to obtain attention information to identify the.

오디오 디코더(315)는, 주의 집중 모듈(313)로부터 전달받은 주의 집중 정보 및 오디오 인코더(314)로부터 전달받은 오디오 신호(400)의 전체 오디오 프레임에 대한 오디오 표현에 기초하여, 음성(403)을 구성하는 전체 오디오 프레임에 대한 오디오 표현을 획득할 수 있다.The audio decoder 315 generates the speech 403 based on the attention information received from the attention module 313 and the audio representation for all audio frames of the audio signal 400 received from the audio encoder 314. It is possible to obtain an audio representation for all of the constituting audio frames.

오디오 디코더(315)는 음성(403)을 구성하는 전체 오디오 프레임에 대한 오디오 표현을 복호화하여 음성(403)을 구성하는 전체 오디오 프레임에 대한 오디오 특징을 획득할 수 있다.The audio decoder 315 may decode an audio representation of all audio frames constituting the voice 403 to obtain audio characteristics for all audio frames constituting the voice 403.

보코더(317)는 오디오 디코더(315)로부터 전달받은 음성(403)을 구성하는 전체 오디오 프레임에 대한 오디오 특징에 기초하여 텍스트(401)에 대응되는 음성(403)을 합성할 수 있다.The vocoder 317 may synthesize the voice 403 corresponding to the text 401 based on audio characteristics of all audio frames constituting the voice 403 transmitted from the audio decoder 315.

전자 장치는 합성된 음성(403)을 구성하는 오디오 프레임의 오디오 특징과 오디오 신호(400)의 전체 오디오 프레임에 대한 오디오 특징을 서로 비교하여, 양 오디오 특징 사이의 손실(loss)을 최소화하는 가중치 매개변수를 획득하는 방식으로 음성 합성 모델을 학습시킬 수 있다.The electronic device compares the audio characteristics of the audio frame constituting the synthesized speech 403 and the audio characteristics of the entire audio frame of the audio signal 400 with each other, and a weighting parameter that minimizes loss between both audio characteristics. The speech synthesis model can be trained by acquiring variables.

도 5는 일부 실시예에 따른 전자 장치가 피드백 정보를 생성하는 방법을 나타낸 도면이다.5 is a diagram illustrating a method of generating feedback information by an electronic device according to some embodiments.

일부 실시예에 따른 전자 장치가 사용하는 음성 합성 모델은 오디오 특징으로부터 피드백 정보를 획득하기 위한 피드백 정보 생성부를 포함할 수 있다.The speech synthesis model used by the electronic device according to some embodiments may include a feedback information generator for obtaining feedback information from an audio feature.

피드백 정보 생성부는 오디오 디코더로부터 획득한 제1 오디오 프레임 셋의 오디오 특징에 기초하여, 제1 오디오 프레임 셋에 후행하는 제2 오디오 프레임 셋의 오디오 특징을 획득하는데 이용되는 피드백 정보를 생성할 수 있다.The feedback information generator may generate feedback information used to acquire audio characteristics of the second audio frame set following the first audio frame set based on the audio characteristics of the first audio frame set obtained from the audio decoder.

피드백 정보 생성부는, 예를 들어, 제1 오디오 프레임 셋 중 적어도 하나의 오디오 프레임의 오디오 특징에 관한 정보를 획득함과 동시에, 제1 오디오 프레임 셋 중 적어도 하나의 오디오 프레임에 대한 압축 정보를 획득할 수 있다.The feedback information generator may, for example, obtain information on audio characteristics of at least one audio frame among the first audio frame set and at the same time obtain compression information on at least one audio frame among the first audio frame set. I can.

피드백 정보 생성부는, 예를 들어, 획득된 적어도 하나의 오디오 프레임의 오디오 특징에 관한 정보 및 적어도 하나의 오디오 프레임에 대한 압축 정보를 서로 결합하여 피드백 정보를 생성할 수 있다. The feedback information generator may generate feedback information by combining, for example, information on audio characteristics of at least one audio frame obtained and compression information on at least one audio frame.

도 5를 참조하면, 일부 실시예에 따른 음성 합성 모델의 피드백 정보 생성부는 오디오 특징(511)으로부터 피드백 정보의 생성에 필요한 정보를 추출할 수 있다(513).Referring to FIG. 5, the feedback information generator of the speech synthesis model according to some embodiments may extract information necessary for generating feedback information from the audio feature 511 (513 ).

피드백 정보 생성부는, 예를 들어, 제1 오디오 프레임 셋(520)에 포함되는 제1 오디오 프레임 내지 제4 오디오 프레임(521, 522, 523, 524) 각각에 대응되는 오디오 특징에 관한 정보(F₀, F₁, F₂, F₃)로부터 피드백 정보의 생성에 필요한 정보를 추출할 수 있다.The feedback information generation unit, for example, includes information on audio characteristics corresponding to each of the first to fourth audio frames 521, 522, 523, and 524 included in the first audio frame set 520 (F ₀ , F ₁ , F ₂ , F ₃ ), information necessary for generating feedback information can be extracted.

피드백 정보 생성부는, 예를 들어, 제1 오디오 프레임 셋(520)에 포함되는 제1 오디오 프레임 내지 제4 오디오 프레임(521, 522, 523, 524) 각각에 대응되는 오디오 특징에 관한 정보(F₀, F₁, F₂, F₃)로부터 각 오디오 프레임에 대한 압축 정보(E₀, E₁, E₂, E₃)를 획득할 수 있다.For example, the feedback information generator includes information on audio characteristics corresponding to each of the first to fourth audio frames 521, 522, 523, and 524 included in the first audio frame set 520 (F ₀ , F ₁ , F ₂ , F ₃ ) compression information for each audio frame (E ₀ , E ₁ , E ₂ , E ₃ ) can be obtained.

압축 정보는, 예를 들어, 제1 오디오 프레임 내지 제4 오디오 프레임(521, 522, 523, 524) 각각에 대응되는 오디오 신호의 진폭 값의 크기, 오디오 신호의 진폭 값에 대한 평균 제곱근의 크기 또는 오디오 신호의 피크 값의 크기 중 적어도 하나에 관한 정보를 포함할 수 있다.The compression information is, for example, the magnitude of the amplitude value of the audio signal corresponding to each of the first to fourth audio frames 521, 522, 523, and 524, the magnitude of the root mean square of the amplitude value of the audio signal, or It may include information on at least one of the magnitudes of the peak values of the audio signal.

피드백 정보 생성부는 제1 오디오 프레임 내지 제4 오디오 프레임(521, 522, 523, 524) 각각에 대응되는 오디오 특징에 관한 정보(F₀, F₁, F₂, F₃) 중 적어도 하나와 추출(513)된 정보를 결합하여 피드백 정보(517)를 생성할 수 있다(515). The feedback information generation unit extracts at least one of _{information (F 0} , F ₁ , F ₂ , F ₃ ) on audio characteristics corresponding to each of the first to fourth audio frames 521, 522, 523, and 524 ( Feedback information 517 may be generated by combining the information 513).

피드백 정보 생성부는, 예를 들어, 제1 오디오 프레임(521)에 대응되는 오디오 특징에 관한 정보(F₀) 및 제1 오디오 프레임(521), 제2 오디오 프레임(522), 제3 오디오 프레임(523), 및 제4 오디오 프레임(524) 각각에 대한 압축 정보(E₀, E₁, E₂, E₃)를 결합하여 피드백 정보를 생성할 수 있다. 그러나, 이에 한정되는 것은 아니고, 다른 실시예에서 피드백 정보 생성부는 제2 오디오 프레임(522), 제3 오디오 프레임(523), 및 제4 오디오 프레임(524) 중 어느 하나의 오디오 프레임으로부터 오디오 특징 정보(F₀)를 획득할 수도 있다. The feedback information generator includes, for example, information F ₀ on audio characteristics corresponding to the first audio frame 521 and the first audio frame 521, the second audio frame 522, and the third audio frame ( Feedback information may be generated by combining _{523) and compression information E 0} , E ₁ , E ₂ , and E ₃ for each of the fourth audio frame 524. However, the present invention is not limited thereto, and in another embodiment, the feedback information generation unit provides audio characteristic information from any one of the second audio frame 522, the third audio frame 523, and the fourth audio frame 524. You can also get (F _{0 ).}

도 6은 일부 실시예에 따른 전자 장치가 피드백 정보를 생성하는 방법을 나타낸 도면이다.6 is a diagram illustrating a method of generating feedback information by an electronic device according to some embodiments.

도 6을 참조하면, 일부 실시예에 따른 전자 장치가 사용하는 음성 합성 모델의 피드백 정보 생성부는 오디오 특징(611)으로부터 피드백 정보의 생성에 필요한 정보를 추출할 수 있다(613).Referring to FIG. 6, a feedback information generator of a speech synthesis model used by an electronic device according to some embodiments may extract information necessary for generating feedback information from an audio feature 611 (613 ).

피드백 정보 생성부는, 예를 들어, 제1 오디오 프레임 셋(620)에 포함되는 제1 오디오 프레임 내지 제4 오디오 프레임(621, 622, 623, 624) 각각에 대응되는 오디오 특징에 관한 정보(F₀, F₁, F₂, F₃)로부터 피드백 정보의 생성에 필요한 정보를 추출할 수 있다.For example, the feedback information generator includes information on audio characteristics corresponding to each of the first to fourth audio frames 621, 622, 623, and 624 included in the first audio frame set 620 (F ₀ , F ₁ , F ₂ , F ₃ ), information necessary for generating feedback information can be extracted.

피드백 정보 생성부는, 예를 들어, 제2 오디오 프레임(622) 내지 제4 오디오 프레임(624) 각각에 대응되는 오디오 특징에 관한 정보(F₁, F₂, F₃)로부터 각 오디오 프레임에 대한 압축 정보(E₁, E₂, E₃)를 획득할 수 있다.The feedback information generation unit compresses each audio frame from _{information (F 1} , F ₂ , F ₃ ) on audio characteristics corresponding to each of the second audio frame 622 to the fourth audio frame 624. Information (E ₁ , E ₂ , E ₃ ) can be obtained.

압축 정보는, 예를 들어, 제2 오디오 프레임(622) 내지 제4 오디오 프레임(624) 각각에 대응되는 오디오 신호의 진폭 값의 크기, 오디오 신호의 진폭 값에 대한 평균 제곱근의 크기 또는 오디오 신호의 피크 값의 크기 중 적어도 하나에 관한 정보를 포함할 수 있다.The compression information is, for example, the magnitude of the amplitude value of the audio signal corresponding to each of the second audio frame 622 to the fourth audio frame 624, the magnitude of the root mean of the amplitude value of the audio signal, or of the audio signal. It may include information on at least one of the magnitudes of the peak values.

피드백 정보 생성부는 제1 오디오 프레임 셋(620)에 대응되는 오디오 특징에 관한 정보(F₀, F₁, F₂, F₃) 중 적어도 하나와 추출(613)된 정보를 결합하여 피드백 정보(617)를 생성할 수 있다(615). _{The feedback information generator combines at least one of the information (F 0} , F ₁ , F ₂ , F ₃ ) on the audio characteristic corresponding to the first audio frame set 620 and the extracted information 613 to provide feedback information 617. ) Can be created (615).

피드백 정보 생성부는, 예를 들어, 제1 오디오 프레임(621)에 대응되는 오디오 특징에 관한 정보(F₀) 및 제2 오디오 프레임(622) 내지 제4 오디오 프레임(624)에 대한 압축 정보(E₁, E₂, E₃)를 결합하여 피드백 정보를 생성할 수 있다. 그러나, 이에 한정되는 것은 아니고, 다른 실시예에서 피드백 정보 생성부는 제2 오디오 프레임(622), 제3 오디오 프레임(623), 및 제4 오디오 프레임(624) 중 어느 하나의 오디오 프레임으로부터 오디오 특징 정보(F₀)를 획득할 수도 있다. The feedback information generator may include, for example, information F ₀ on audio characteristics corresponding to the first audio frame 621 and compression information E on the second audio frame 622 to the fourth audio frame 624. ₁ , E ₂ , E ₃ ) can be combined to generate feedback information. However, the present invention is not limited thereto, and in another embodiment, the feedback information generation unit provides audio characteristic information from any one of the second audio frame 622, the third audio frame 623, and the fourth audio frame 624. You can also get (F _{0 ).}

도 6 및 도 5의 실시예에서 획득된 피드백 정보를 서로 비교하면, 도 6의 실시예에서 획득된 피드백 정보는 제1 오디오 프레임(521)에 대한 압축 정보(E₀)를 포함하고 있지 않음을 알 수 있다. When the feedback information obtained in the embodiments of FIGS. 6 and 5 are compared with each other, the feedback information obtained in the embodiment of FIG. 6 does not include _{compression information E 0 for the first audio frame 521.} Able to know.

즉, 본 개시의 전자 장치가 사용하는 음성 합성 모델은 제1 오디오 프레임 셋(520, 620)의 오디오 특징에 관한 정보로부터 압축 정보를 자유로운 방식으로 추출하고 결합하여 피드백 정보를 생성할 수 있다.That is, the speech synthesis model used by the electronic device of the present disclosure may generate feedback information by extracting and combining compressed information in a free manner from information on audio characteristics of the first audio frame sets 520 and 620.

도 7은 일부 실시예에 따른 전자 장치가 컨벌루션 신경망을 포함하는 음성 합성 모델을 사용하여 음성을 합성하는 방법을 나타낸 도면이다.7 is a diagram illustrating a method of synthesizing speech by an electronic device using a speech synthesis model including a convolutional neural network, according to some embodiments.

도 7을 참조하면, 일부 실시예에 따른 전자 장치는 텍스트 인코더(711), 피드백 정보 생성부(712), 주의 집중 모듈(713), 오디오 인코더(714), 오디오 디코더(715) 및 보코더(717)를 포함하는 음성 합성 모델을 사용하여 텍스트로부터 음성을 합성할 수 있다.Referring to FIG. 7, an electronic device according to some embodiments includes a text encoder 711, a feedback information generator 712, an attention module 713, an audio encoder 714, an audio decoder 715, and a vocoder 717. A speech synthesis model including) can be used to synthesize speech from text.

텍스트 인코더(711)는 입력된 텍스트 L을 부호화하여 텍스트 표현 K 및 텍스트 표현 V를 획득할 수 있다.The text encoder 711 may obtain a text representation K and a text representation V by encoding the input text L.

텍스트 표현 K는 후술할 오디오 표현 Q가 텍스트 표현의 어느 부분과 관련성이 있는지 결정하는데 사용되는 주의 집중 정보 A의 생성에 사용되는 텍스트 표현일 수 있다.The textual expression K may be a textual expression used for generating attention information A used to determine which part of the textual expression is related to the audio expression Q, which will be described later.

한편, 텍스트 표현 V는 주의 집중 정보 A에 기초하여 텍스트 표현 V의 집중이 필요한 부분을 식별함으로써 오디오 표현 R을 획득하는데 사용되는 텍스트 표현일 수 있다.Meanwhile, the text expression V may be a text expression used to obtain the audio expression R by identifying a portion requiring concentration of the text expression V based on the attention information A.

텍스트 인코더(711)는, 예를 들어, 텍스트 L에 포함된 각 글자에 대한 임베딩 열을 획득하기 위한 임베딩 모듈 및 1차원 비인과적 컨볼루션 레이어(1D non-causal convolution layer)를 포함할 수 있다.The text encoder 711 may include, for example, an embedding module and a 1D non-causal convolution layer for obtaining an embedding column for each character included in the text L.

텍스트 인코더(711)는 텍스트에 포함된 소정의 글자에 대하여 선행하는 글자 및 후행하는 글자 모두의 컨텍스트(context)에 관한 정보를 획득할 수 있으므로, 1차원 비인과적 컨볼루션 레이어를 사용할 수 있다.Since the text encoder 711 can acquire information on the context of both the preceding and following characters for a predetermined character included in the text, a one-dimensional non-causal convolution layer may be used.

텍스트 인코더(711)의 임베딩 모듈을 통해 텍스트 L에 포함된 각 글자에 대한 임베딩 열이 획득되고, 획득된 임베딩 열이 1차원 비인과적 컨볼루션 레이어에 입력되면, 임베딩 열에 대한 동일한 컨볼루션 연산 결과로서 텍스트 표현 K 및 텍스트 표현 V가 출력될 수 있다. When an embedding column for each character included in the text L is obtained through the embedding module of the text encoder 711, and the obtained embedding column is input to the one-dimensional noncausal convolution layer, the same convolution operation result for the embedding column is obtained. Text representation K and text representation V can be output.

피드백 정보 생성부(712)는, 예를 들어, 오디오 디코더(715)를 통해 미리 획득된 4개의 제1 오디오 프레임 셋(720)의 오디오 특징으로부터 제1 오디오 프레임 셋(720)에 포함되는 4개의 오디오 프레임(721, 722, 723, 724)에 후행하는 4개의 오디오 프레임들을 포함하는 제2 오디오 프레임 셋의 오디오 특징을 획득하는데 이용되는 피드백 정보 F1을 생성할 수 있다.The feedback information generation unit 712 includes, for example, from the audio characteristics of the four first audio frame sets 720 obtained in advance through the audio decoder 715. Feedback information F1 used to acquire audio characteristics of the second audio frame set including four audio frames following the audio frames 721, 722, 723, and 724 may be generated.

피드백 정보 생성부(712)는, 예를 들어, 생성할 피드백 정보 F1이 피드백 루프의 시작을 위한 최초의 피드백 정보에 해당할 경우, 0의 값을 가지는 4개의 오디오 프레임(721, 722, 723, 724)의 오디오 특징으로부터 4개의 오디오 프레임(721, 722, 723, 724)에 후행하는 4개의 제2 오디오 프레임 셋의 오디오 특징을 획득하는데 이용되는 피드백 정보 F1을 생성할 수 있다.For example, when the feedback information F1 to be generated corresponds to the initial feedback information for the start of the feedback loop, the feedback information generation unit 712 may include four audio frames 721, 722, 723, having a value of 0. Feedback information F1 used to acquire the audio characteristics of the four second audio frame sets following the four audio frames 721, 722, 723, and 724 may be generated from the audio characteristics of 724).

일부 실시예에 따른 피드백 정보 F1은, 도 5의 실시예를 통해 설명한 것과 같이, 제1 오디오 프레임(721)에 대응되는 오디오 특징에 관한 정보(F₀) 및 제1 오디오 프레임(721) 내지 제4 오디오 프레임(724) 각각에 대한 압축 정보(E₀, E₁, E₂, E₃)를 결합하여 생성될 수 있다. Feedback information F1 according to some embodiments, as described through the embodiment of FIG. 5, information F ₀ on the audio characteristic corresponding to the first audio frame 721 and the first audio frame 721 to the first audio frame 721 to the first audio frame 721 It may be generated by combining the _{compression information (E 0} , E ₁ , E ₂ , E ₃ ) for each of the 4 audio frames 724.

오디오 인코더(714)는 피드백 정보 생성부(712)로부터 전달받은 피드백 정보 F1에 기초하여 4개의 오디오 프레임(721, 722, 723, 724)에 대응되는 오디오 표현 Q1을 획득할 수 있다.The audio encoder 714 may obtain the audio expression Q1 corresponding to the four audio frames 721, 722, 723 and 724 based on the feedback information F1 transmitted from the feedback information generator 712.

오디오 인코더(714)는, 예를 들어, 1차원 인과적 컨볼루션 레이어(1D causal convolution layer)를 포함할 수 있다. 오디오 디코더(715)의 출력은 음성 합성 과정에서 오디오 인코더(714)의 입력으로 피드백 될 수 있으므로, 오디오 디코더(714)는 후행하는 오디오 프레임에 관한 정보, 즉 미래의 정보를 사용하지 않기 위해 1차원 인과적 컨볼루션 레이어를 사용할 수 있다.The audio encoder 714 may include, for example, a 1D causal convolution layer. Since the output of the audio decoder 715 can be fed back to the input of the audio encoder 714 during the speech synthesis process, the audio decoder 714 is A causal convolution layer can be used.

즉, 오디오 인코더(714)는, 예를 들어, 피드백 정보 생성부(712)로부터 전달받은 피드백 정보 F1 및 4개의 오디오 프레임(721, 722, 723, 724)에 시간적으로 선행하는 오디오 프레임 셋에 대하여 생성되었던 피드백 정보(예를 들어, F0)에 기초한 컨볼루션 연산의 결과로서, 4개의 오디오 프레임(721, 722, 723, 724)에 대응되는 오디오 표현 Q1을 획득할 수 있다. That is, the audio encoder 714, for example, with respect to an audio frame set that temporally precedes the feedback information F1 and four audio frames 721, 722, 723, 724 received from the feedback information generator 712. As a result of a convolution operation based on the generated feedback information (eg, F0), an audio expression Q1 corresponding to the four audio frames 721, 722, 723, and 724 may be obtained.

주의 집중 모듈(713)은 텍스트 인코더(711)로부터 전달받은 텍스트 표현 K 및 오디오 인코더(714)로부터 전달받은 제1 오디오 프레임 셋(720)에 대응되는 오디오 표현 Q1에 기초하여, 텍스트 표현 V 중 집중이 필요한 부분을 식별하기 위한 주의 집중 정보 A1을 획득할 수 있다. The attention module 713 focuses on the text expression V based on the text expression K transmitted from the text encoder 711 and the audio expression Q1 corresponding to the first audio frame set 720 transmitted from the audio encoder 714. Attention information A1 for identifying this necessary part can be obtained.

주의 집중 모듈(713)은, 예를 들어, 텍스트 인코더(711)로부터 전달받은 텍스트 표현 K 및 오디오 인코더(714)로부터 전달받은 제1 오디오 프레임 셋(720)에 대응되는 오디오 표현 Q1 사이의 행렬 곱을 계산하여 주의 집중 정보 A1을 획득할 수 있다.Attention module 713, for example, a matrix product between the text representation K received from the text encoder 711 and the audio representation Q1 corresponding to the first audio frame set 720 received from the audio encoder 714. Attention information A1 can be obtained by calculating.

주의 집중 모듈(713)은, 예를 들어, 주의 집중 정보 A1을 획득하는 과정에서, 4개의 오디오 프레임(721, 722, 723, 724)에 시간적으로 선행하는 오디오 프레임 셋에 대하여 생성되었던 주의 집중 정보 A0을 참고할 수 있다.Attention module 713, for example, in the process of obtaining the attention information A1, attention information generated for an audio frame set that temporally precedes the four audio frames 721, 722, 723, 724 You can refer to A0.

주의 집중 모듈(713)은 획득된 주의 집중 정보 A1에 기초하여, 텍스트 표현 V로부터 집중이 필요한 부분을 식별함으로써 오디오 표현 R1을 획득할 수 있다.The attention module 713 may obtain the audio expression R1 by identifying a portion requiring attention from the text expression V based on the obtained attention information A1.

주의 집중 모듈(713)은, 예를 들어, 주의 집중 정보 A1으로부터 가중치를 획득하고, 획득된 가중치에 기초하여 주의 집중 정보 A1 및 텍스트 표현 V 사이의 가중 합(weighted sum)을 계산함으로써 오디오 표현 R1을 획득할 수 있다. Attention module 713, for example, by obtaining a weight from the attention information A1, and calculating a weighted sum (weighted sum) between the attention information A1 and the text expression V based on the obtained weight, the audio expression R1 Can be obtained.

주의 집중 모듈(713)은 오디오 표현 R1 및 제1 오디오 프레임 셋(720)에 대응되는 오디오 표현 Q1을 서로 결합(concatenate)하여 오디오 표현 R1'를 획득할 수 있다. The attention module 713 may obtain the audio expression R1' by concatenating the audio expression R1 and the audio expression Q1 corresponding to the first audio frame set 720 with each other.

오디오 디코더(715)는 주의 집중 모듈(713)로부터 전달받은 오디오 표현 R1'을 복호화하여 제2 오디오 프레임 셋의 오디오 특징을 획득할 수 있다.The audio decoder 715 may decode the audio expression R1' received from the attention module 713 to obtain an audio characteristic of the second audio frame set.

오디오 디코더(715)는, 예를 들어, 1차원 인과적 컨볼루션 레이어를 포함할 수 있다. 오디오 디코더(715)의 출력은 음성 합성 과정에서 오디오 인코더(714)의 입력으로 피드백 될 수 있으므로, 오디오 디코더(715)는 후행하는 오디오 프레임에 관한 정보, 즉 미래의 정보를 사용하지 않기 위한 1차원 인과적 컨볼루션 레이어를 사용할 수 있다.The audio decoder 715 may include, for example, a one-dimensional causal convolution layer. Since the output of the audio decoder 715 can be fed back to the input of the audio encoder 714 during the speech synthesis process, the audio decoder 715 is a one-dimensional information for not using information on a subsequent audio frame, that is, future information. A causal convolution layer can be used.

즉, 오디오 디코더(715)는, 예를 들어, 오디오 표현 R1 및 오디오 표현 Q1과, 4개의 오디오 프레임(721, 722, 723, 724)에 시간적으로 선행하는 오디오 프레임 셋에 대하여 생성되었던 오디오 표현(예를 들어, 오디오 표현 R0 및 오디오 표현 Q0)에 기초한 컨벌루션 연산의 결과로서, 4개의 오디오 프레임(721, 722, 723, 724)에 후행하는 제2 오디오 프레임 셋의 오디오 특징을 획득할 수 있다.That is, the audio decoder 715 is, for example, the audio representation R1 and the audio representation Q1, the audio representation that was generated for the set of audio frames that temporally precede the four audio frames 721, 722, 723, 724 ( For example, as a result of a convolution operation based on the audio expression R0 and the audio expression Q0), the audio characteristics of the second audio frame set following the four audio frames 721, 722, 723, and 724 may be obtained.

보코더(717)는 제1 오디오 프레임 셋(720)의 오디오 특징 또는 제2 오디오 프레임 셋의 오디오 특징 중 적어도 하나에 기초하여 음성을 합성할 수 있다.The vocoder 717 may synthesize speech based on at least one of an audio characteristic of the first audio frame set 720 or an audio characteristic of the second audio frame set.

한편, 제2 오디오 프레임 셋의 오디오 특징이 획득되면, 오디오 디코더(715)는 획득된 제2 오디오 프레임 셋의 오디오 특징을 피드백 정보 생성부(712)에게 전달할 수 있다. Meanwhile, when the audio characteristic of the second audio frame set is obtained, the audio decoder 715 may transmit the obtained audio characteristic of the second audio frame set to the feedback information generator 712.

피드백 정보 생성부(712)는 제2 오디오 프레임 셋의 오디오 특징에 기초하여, 제2 오디오 프레임 셋에 후행하는 제3 오디오 프레임의 오디오 특징을 획득하는데 이용되는 피드백 정보 F2를 생성할 수 있다. 피드백 정보 생성부(712)는, 예를 들어, 상술한 피드백 정보 F1을 생성하는 방식과 동일한 방식에 기초하여 제2 오디오 프레임 셋에 후행하는 제3 오디오 프레임의 오디오 특징을 획득하는데 이용되는 피드백 정보 F2를 생성할 수 있다.The feedback information generator 712 may generate feedback information F2 used to acquire audio characteristics of a third audio frame following the second audio frame set based on the audio characteristics of the second audio frame set. The feedback information generation unit 712 is, for example, feedback information used to obtain an audio characteristic of a third audio frame following the second audio frame set based on the same method as the method of generating the above-described feedback information F1. Can generate F2.

피드백 정보 생성부(712)는 생성된 피드백 정보 F2를 오디오 인코더(714)에게 전달할 수 있다.The feedback information generator 712 may transmit the generated feedback information F2 to the audio encoder 714.

오디오 인코더(714)는 피드백 정보 생성부(712)로부터 전달받은 피드백 정보 F2에 기초하여 4개의 제2 오디오 프레임에 대응되는 오디오 표현 Q2를 획득할 수 있다.The audio encoder 714 may obtain the audio expression Q2 corresponding to the four second audio frames based on the feedback information F2 received from the feedback information generator 712.

오디오 인코더(714)는, 예를 들어, 피드백 정보 생성부(712)로부터 전달받은 피드백 정보 F2 및 4개의 오디오 프레임에 시간적으로 선행하는 오디오 프레임 셋에 대하여 생성되었던 피드백 정보(예를 들어, F0, F1 중 적어도 하나)에 기초한 컨루션 연산의 결과로서, 4개의 제2 오디오 프레임 셋에 대응되는 오디오 표현 Q2를 획득할 수 있다.The audio encoder 714, for example, feedback information F2 received from the feedback information generator 712 and feedback information generated for an audio frame set temporally preceding the four audio frames (eg, F0, As a result of the convolution operation based on at least one of F1), an audio expression Q2 corresponding to the four second audio frame sets may be obtained.

주의 집중 모듈(713)은 텍스트 인코더(711)로부터 전달받은 텍스트 표현 K 및 오디오 인코더(714)로부터 전달받은 제2 오디오 프레임 셋에 대응되는 오디오 표현 Q2에 기초하여, 텍스트 표현 V 중 집중이 필요한 부분을 식별하기 위한 주의 집중 정보 A2을 획득할 수 있다. The attention module 713 is based on the text expression K transmitted from the text encoder 711 and the audio expression Q2 corresponding to the second audio frame set transmitted from the audio encoder 714, the portion of the text expression V that needs attention. Attention information A2 for identifying

주의 집중 모듈(713)은, 예를 들어, 텍스트 인코더(711)로부터 전달받은 텍스트 표현 K 및 오디오 인코더(714)로부터 전달받은 제2 오디오 프레임 셋에 대응되는 오디오 표현 Q2 사이의 행렬 곱을 계산하여 주의 집중 정보 A1을 획득할 수 있다.Attention module 713, for example, calculates a matrix product between the text expression K received from the text encoder 711 and the audio expression Q2 corresponding to the second audio frame set received from the audio encoder 714 to pay attention. Concentration information A1 can be obtained.

주의 집중 모듈(713)은, 예를 들어, 주의 집중 정보 A2를 획득하는 과정에서, 4개의 제2 오디오 프레임 셋에 시간적으로 선행하는 오디오 프레임 셋에 대하여 생성되었던 주의 집중 정보(예를 들어, 주의 집중 정보 A1)을 참고할 수 있다.Attention module 713, for example, in the process of obtaining the attention information A2, attention information generated for the audio frame set that temporally precedes the four second audio frame sets (e.g., attention You can refer to concentrated information A1).

주의 집중 모듈(713)은 획득된 주의 집중 정보 A2에 기초하여, 텍스트 표현 V로부터 집중이 필요한 부분을 식별함으로써 오디오 표현 R2를 획득할 수 있다.The attention module 713 may obtain the audio expression R2 by identifying a portion requiring attention from the text expression V based on the obtained attention information A2.

주의 집중 모듈(713)은, 예를 들어, 주의 집중 정보 A2으로부터 가중치를 획득하고, 획득된 가중치에 기초하여 주의 집중 정보 A2 및 텍스트 표현 V 사이의 가중 합을 계산함으로써 오디오 표현 R2을 획득할 수 있다. Attention module 713, for example, can obtain the audio expression R2 by obtaining a weight from the attention information A2, and calculating a weighted sum between the attention information A2 and the text expression V based on the obtained weight. have.

주의 집중 모듈(713)은 오디오 표현 R2 및 제2 오디오 프레임 셋에 대응되는 오디오 표현 Q2을 서로 결합하여 오디오 표현 R2'를 획득할 수 있다. The attention module 713 may obtain the audio expression R2' by combining the audio expression R2 and the audio expression Q2 corresponding to the second audio frame set with each other.

오디오 디코더(715)는 주의 집중 모듈(713)로부터 전달받은 오디오 표현 R2'을 복호화하여 제2 오디오 프레임 셋의 오디오 특징을 획득할 수 있다.The audio decoder 715 may decode the audio expression R2' received from the attention module 713 to obtain an audio characteristic of the second audio frame set.

오디오 디코더(715)는, 예를 들어, 오디오 표현 R2 및 오디오 표현 Q2와 4개의 오디오 프레임에 시간적으로 선행하는 오디오 프레임 셋에 대하여 생성되었던 오디오 표현(예를 들어, 오디오 표현 R0 및 R1 중 적어도 하나, 오디오 표현 Q0 및 Q1 중 적어도 하나)에 기초한 컨벌루션 연산의 결과로서, 제2 오디오 프레임 셋에 후행하는 제3 오디오 프레임 셋의 오디오 특징을 획득할 수 있다.The audio decoder 715 is, for example, an audio representation (e.g., at least one of audio representations R0 and R1) that was generated for the audio representation R2 and the audio representation Q2 and an audio frame set that temporally precedes the four audio frames. , As a result of a convolution operation based on at least one of audio expressions Q0 and Q1), an audio characteristic of a third audio frame set following the second audio frame set may be obtained.

보코더(717)는 제1 오디오 프레임 셋(720)의 오디오 특징 내지 제3 오디오 프레임 셋의 오디오 특징 중 적어도 하나에 기초하여 음성을 합성할 수 있다.The vocoder 717 may synthesize speech based on at least one of an audio characteristic of the first audio frame set 720 to an audio characteristic of the third audio frame set.

일부 실시예에 따른 전자 장치는, 상술한 제1 오디오 프레임 셋(720)의 오디오 특징 내지 제3 오디오 프레임 셋의 오디오 특징을 획득하는데 사용된 피드백 루프를 텍스트 L에 대응되는 오디오 프레임 셋의 특징을 모두 획득할 때까지 반복하여 수행할 수 있다. The electronic device according to some embodiments may use the feedback loop used to acquire the audio characteristics of the first audio frame set 720 to the audio characteristics of the third audio frame set described above, and the characteristics of the audio frame set corresponding to the text L. You can do it over and over again until all are acquired.

전자 장치는, 예를 들어, 주의 집중 모듈(713)이 생성한 주의 집중 정보 A가 텍스트 L에 포함된 각 글자에 대한 임베딩 열 중 가장 마지막 열에 대응되는 것으로 판단되면, 입력된 텍스트 L에 대응되는 오디오 프레임 셋의 특징이 모두 획득된 것으로 판단하고, 피드백 루프의 반복을 종료할 수 있다.The electronic device, for example, if it is determined that the attention information A generated by the attention module 713 corresponds to the last column among the embedding columns for each character included in the text L, It is determined that all the features of the audio frame set have been acquired, and repetition of the feedback loop may be terminated.

도 8은 일부 실시예에 따른 전자 장치가 RNN 신경망을 포함하는 음성 합성 모델을 사용하여 음성을 합성하는 방법을 나타낸 도면이다.8 is a diagram illustrating a method of synthesizing speech by an electronic device using a speech synthesis model including an RNN neural network, according to some embodiments.

도 8을 참조하면, 일부 실시예에 따른 전자 장치는 텍스트 인코더(811), 주의 집중 모듈(813), 오디오 디코더(815) 및 보코더(817)를 포함하는 음성 합성 모델을 사용하여 텍스트로부터 음성을 합성할 수 있다.Referring to FIG. 8, an electronic device according to some embodiments uses a speech synthesis model including a text encoder 811, an attention module 813, an audio decoder 815, and a vocoder 817. It can be synthesized.

텍스트 인코더(811)는 입력된 텍스트를 부호화하여 텍스트 표현을 획득할 수 있다.The text encoder 811 may obtain a text representation by encoding the input text.

텍스트 인코더(811)는, 예를 들어, 텍스트에 포함된 각 글자에 대한 임베딩 열을 획득하기 위한 임베딩 모듈, 임베딩 열을 텍스트 표현으로 변환하기 위한 pre-net 모듈 및 CBHG 모듈을 포함할 수 있다.The text encoder 811 may include, for example, an embedding module for obtaining an embedding column for each character included in text, a pre-net module for converting the embedding column into a text representation, and a CBHG module.

텍스트 인코더(811)의 임베딩 모듈을 통해 텍스트에 포함된 각 글자에 대한 임베딩 열이 획득되면, 획득된 임베딩 열은 pre-net 모듈 및 CBHG 모듈에서 텍스트 표현으로 변환될 수 있다.When an embedding row for each character included in the text is obtained through the embedding module of the text encoder 811, the obtained embedding row may be converted into a text representation in the pre-net module and the CBHG module.

주의 집중 모듈(813)은 텍스트 인코더(811)로부터 전달받은 텍스트 표현 및 오디오 디코더(815)로부터 전달받은 제1 오디오 프레임 셋에 대응되는 오디오 표현에 기초하여, 텍스트 표현 중 집중이 필요한 부분을 식별하기 위한 주의 집중 정보를 획득할 수 있다. The attention module 813 identifies a portion of the text expression that needs attention based on the text expression received from the text encoder 811 and the audio expression corresponding to the first audio frame set received from the audio decoder 815. Attention information can be obtained for.

피드백 정보 생성부(812)는, 예를 들어, 생성할 피드백 정보가 피드백 루프의 시작을 위한 최초의 피드백 정보에 해당할 경우, 0의 값을 가지는 시작 오디오 프레임(Go frame)을 제1 오디오 프레임으로 하여, 제2 오디오 특징 프레임 셋의 오디오 특징을 획득하는데 사용되는 피드백 정보를 생성할 수 있다.For example, when the feedback information to be generated corresponds to the first feedback information for the start of the feedback loop, the feedback information generation unit 812 generates a start audio frame having a value of 0 as a first audio frame. As a result, feedback information used to acquire the audio feature of the second audio feature frame set may be generated.

피드백 정보 생성부(812)에서 피드백 정보가 생성되면, 오디오 디코더(815)는 Pre-net 모듈 및 Attention RNN 모듈을 이용하여, 제1 오디오 프레임의 오디오 특징을 부호화함으로써 제1 오디오 프레임의 오디오 표현을 획득할 수 있다.When the feedback information is generated by the feedback information generator 812, the audio decoder 815 uses the Pre-net module and the Attention RNN module to encode the audio characteristics of the first audio frame to express the audio representation of the first audio frame. Can be obtained.

제1 오디오 프레임의 오디오 표현이 획득되면, 주의 집중 모듈(813)은 이전 주의 집중 정보가 적용된 텍스트 표현 및 제1 오디오 프레임의 오디오 표현에 기초하여 주의 집중 정보를 생성할 수 있다. 주의 집중 모듈(813)은 텍스트 표현 및 생성된 주의 집중 정보를 사용하여, 제2 오디오 프레임 셋(820)의 오디오 표현을 획득할 수 있다.When the audio expression of the first audio frame is obtained, the attention module 813 may generate attention information based on the text expression to which the previous attention information is applied and the audio expression of the first audio frame. The attention module 813 may acquire the audio expression of the second audio frame set 820 by using the text expression and the generated attention information.

오디오 디코더(815)는 Decoder RNN 모듈을 이용하여, 제1 오디오 프레임의 오디오 표현 및 제2 오디오 프레임 셋(820)의 오디오 표현으로부터 제2 오디오 프레임 셋(820)의 오디오 특징을 획득할 수 있다.보코더(817)는 제1 오디오 프레임 셋의 오디오 특징 또는 제2 오디오 프레임 셋(820)의 오디오 특징 중 적어도 하나에 기초하여 음성을 합성할 수 있다.The audio decoder 815 may acquire audio characteristics of the second audio frame set 820 from the audio representation of the first audio frame and the audio representation of the second audio frame set 820 by using the Decoder RNN module. The vocoder 817 may synthesize speech based on at least one of an audio characteristic of the first audio frame set or an audio characteristic of the second audio frame set 820.

한편, 제2 오디오 프레임 셋(820)의 오디오 특징이 획득되면, 오디오 디코더(815)는 획득된 제2 오디오 프레임 셋(820)의 오디오 특징을 피드백 정보 생성부(812)에게 전달할 수 있다.Meanwhile, when the audio characteristic of the second audio frame set 820 is obtained, the audio decoder 815 may transmit the audio characteristic of the obtained second audio frame set 820 to the feedback information generator 812.

제2 오디오 프레임 셋(820)은 제1 오디오 프레임(821) 내지 제3 오디오 프레임(823)을 포함할 수 있다. 일부 실시예에 따른 피드백 정보는 제1 오디오 프레임(821)에 대응되는 오디오 특징에 관한 정보(F₀)와 제2 오디오 프레임(822) 및 제3 오디오 프레임(823)에 대한 압축 정보(E₁, E₂)를 결합하여 생성될 수 있다. 도 8에서 제2 오디오 프레임 셋(820)은 총 3개의 오디오 프레임(821, 822, 823)을 포함하는 것으로 도시되었지만, 이는 설명의 편의를 위한 예시일 뿐, 오디오 프레임의 개수가 도시된 바와 같이 한정되는 것은 아니다. 예를 들어, 제2 오디오 프레임(820)은 1개, 2개, 또는 4개 이상의 오디오 프레임을 포함할 수도 있다. The second audio frame set 820 may include a first audio frame 821 to a third audio frame 823. _{Feedback information according to some embodiments includes information F 0} on audio characteristics corresponding to the first audio frame 821 _{and compression information E 1 on} the second audio frame 822 and the third audio frame 823. , E ₂ ) can be generated by combining. In FIG. 8, the second audio frame set 820 is shown to include a total of three audio frames 821, 822, and 823, but this is only an example for convenience of description, and the number of audio frames is as shown. It is not limited. For example, the second audio frame 820 may include one, two, or four or more audio frames.

피드백 정보 생성부(812)는 생성된 피드백 정보를 오디오 디코더(815)에게 전달할 수 있다.The feedback information generator 812 may deliver the generated feedback information to the audio decoder 815.

피드백 정보를 전달받은 오디오 디코더(815)는 전달받은 피드백 정보 및 이전 피드백 정보를 기초로 Pre-net 모듈 및 Attention RNN 모듈을 이용하여, 제2 오디오 프레임 셋(820)의 오디오 특징을 부호화함으로써 제2 오디오 프레임 셋(820)의 오디오 표현을 획득할 수 있다.The audio decoder 815 receiving the feedback information encodes the audio characteristic of the second audio frame set 820 using the Pre-net module and the Attention RNN module based on the received feedback information and the previous feedback information. An audio representation of the audio frame set 820 may be obtained.

제2 오디오 프레임 셋(820)의 오디오 표현이 획득되면, 주의 집중 모듈(813)은 이전 주의 집중 정보가 적용된 텍스트 표현 및 제2 오디오 프레임 셋(820)의 오디오 표현에 기초하여 주의 집중 정보를 생성할 수 있다. 주의 집중 모듈(813)은 텍스트 표현 및 생성된 주의 집중 정보를 사용하여, 제3 오디오 프레임 셋(820)의 오디오 표현을 획득할 수 있다.When the audio expression of the second audio frame set 820 is obtained, the attention module 813 generates attention information based on the text expression to which the previous attention information is applied and the audio expression of the second audio frame set 820 can do. The attention module 813 may acquire the audio expression of the third audio frame set 820 by using the text expression and the generated attention information.

오디오 디코더(815)는 Decoder RNN 모듈을 이용하여, 제2 오디오 프레임 셋(820)의 오디오 표현 및 제3 오디오 프레임 셋의 오디오 표현으로부터 제3 오디오 프레임의 오디오 특징을 획득할 수 있다.The audio decoder 815 may acquire audio characteristics of the third audio frame from the audio representation of the second audio frame set 820 and the audio representation of the third audio frame set by using the Decoder RNN module.

보코더(817)는 제1 오디오 프레임 셋의 오디오 특징 내지 제3 오디오 프레임 셋의 오디오 특징 중 적어도 하나에 기초하여 음성을 합성할 수 있다.The vocoder 817 may synthesize speech based on at least one of an audio characteristic of the first audio frame set to an audio characteristic of the third audio frame set.

일부 실시예에 따른 전자 장치는, 상술한 제1 오디오 프레임 셋의 특징 내지 제3 오디오 프레임 셋의 특징을 획득하는데 사용된 피드백 루프를 텍스트에 대응되는 오디오 프레임 셋의 특징을 모두 획득할 때까지 반복하여 수행할 수 있다. The electronic device according to some embodiments repeats the feedback loop used to acquire the features of the first audio frame set to the third audio frame set described above until all the features of the audio frame set corresponding to the text are acquired. You can do it.

전자 장치는, 예를 들어, 주의 집중 모듈(813)이 생성한 주의 집중 정보가 텍스트 에 포함된 각 글자에 대한 임베딩 열 중 가장 마지막 열에 대응되는 것으로 판단되면, 입력된 텍스트에 대응되는 오디오 프레임 셋의 특징이 모두 획득된 것으로 판단하고, 피드백 루프의 반복을 종료할 수 있다. The electronic device, for example, if it is determined that the attention information generated by the attention module 813 corresponds to the last column among the embedding columns for each character included in the text, the audio frame set corresponding to the input text It is determined that all of the features of are acquired, and the iteration of the feedback loop may be terminated.

그러나, 이에 한정되는 것은 아니고, 전자 장치는 피드백 루프의 반복 시점에 관하여 기 학습된 별도의 신경망 모델을 이용하여 피드백 루프의 반복을 종료할 수도 있다. 일 실시예에서, 전자 장치는 stop token prediction을 수행하도록 학습된 별도의 신경망 모델을 이용하여 피드백 루프의 반복을 종료할 수도 있다. However, the present invention is not limited thereto, and the electronic device may terminate the repetition of the feedback loop by using a separate neural network model previously learned about the repetition time of the feedback loop. In an embodiment, the electronic device may terminate the repetition of the feedback loop using a separate neural network model trained to perform stop token prediction.

도 9는 일부 실시예에 따른 전자 장치의 구성을 나타낸 블록도이다.9 is a block diagram illustrating a configuration of an electronic device according to some embodiments.

도 9를 참조하면, 일부 실시예에 따른 전자 장치(1000)는 프로세서(1001), 사용자 입력부(1002), 통신부(1003), 메모리(1004), 마이크(1005), 스피커(1006) 및 디스플레이(1007)를 포함할 수 있다.Referring to FIG. 9, an electronic device 1000 according to some embodiments includes a processor 1001, a user input unit 1002, a communication unit 1003, a memory 1004, a microphone 1005, a speaker 1006, and a display. 1007).

사용자 입력부(1002)는 음성 합성에 사용되는 텍스트를 입력받을 수 있다.The user input unit 1002 may receive text used for speech synthesis.

사용자 입력부(1002)는, 예를 들어, 사용자 입력부(1002)에는 키 패드(key pad), 돔 스위치(dome switch), 터치 패드(접촉식 정전 용량 방식, 압력식 저항막 방식, 적외선 감지 방식, 표면 초음파 전도 방식, 적분식 장력 측정 방식, 피에조 효과 방식 등), 조그 휠, 조그 스위치 등이 있을 수 있으나 이에 한정되는 것은 아니다.The user input unit 1002 includes, for example, a key pad, a dome switch, and a touch pad (contact type capacitive type, pressure type resistive film type, infrared detection method, etc.) in the user input unit 1002. A surface ultrasonic conduction method, an integral tension measurement method, a piezo effect method, etc.), a jog wheel, a jog switch, etc. may be provided, but are not limited thereto.

통신부(1003)는 서버(2000)와의 통신을 위한 하나 이상의 통신 모듈을 포함할 수 있다. 예를 들어, 통신부(1003)는, 근거리 통신부 또는 이동 통신부 중 적어도 하나를 포함할 수 있다. The communication unit 1003 may include one or more communication modules for communication with the server 2000. For example, the communication unit 1003 may include at least one of a short-range communication unit and a mobile communication unit.

근거리 통신부(short-range wireless communication unit)는, 블루투스 통신부, BLE(Bluetooth Low Energy) 통신부, 근거리 무선 통신부(Near Field Communication unit), WLAN(와이파이) 통신부, 지그비(Zigbee) 통신부, 적외선(IrDA, infrared Data Association) 통신부, WFD(Wi-Fi Direct) 통신부, UWB(ultra wideband) 통신부, Ant+ 통신부 등을 포함할 수 있으나, 이에 제한되는 것은 아니다. The short-range wireless communication unit includes a Bluetooth communication unit, a Bluetooth Low Energy (BLE) communication unit, a Near Field Communication unit, a WLAN (Wi-Fi) communication unit, a Zigbee communication unit, and an infrared (IrDA) communication unit. Data Association) communication unit, WFD (Wi-Fi Direct) communication unit, UWB (ultra wideband) communication unit, Ant+ communication unit, etc. may be included, but is not limited thereto.

이동 통신부는, 이동 통신망 상에서 기지국, 외부의 단말, 서버 중 적어도 하나와 무선 신호를 송수신한다. 여기에서, 무선 신호는, 음성 호 신호, 화상 통화 호 신호 또는 문자/멀티미디어 메시지 송수신에 따른 다양한 형태의 데이터를 포함할 수 있다.The mobile communication unit transmits and receives a radio signal with at least one of a base station, an external terminal, and a server on a mobile communication network. Here, the wireless signal may include a voice call signal, a video call signal, or various types of data according to transmission and reception of text/multimedia messages.

메모리(1004)는 텍스트를 음성으로 합성하는데 사용되는 음성 합성 모델을 저장할 수 있다. The memory 1004 may store a speech synthesis model used to synthesize text into speech.

메모리(1004)에 저장된 음성 합성 모델은 기능에 따라 분류된 복수 개의 모듈을 포함할 수 있다. 메모리(1004)에 저장된 음성 합성 모델은, 예를 들어, 전처리부, 텍스트 인코더, 주의 집중 모듈, 오디오 인코더, 오디오 디코더, 피드백 정보 생성부, 오디오 디코더, 보코더 또는 오디오 특징 추출부 중 적어도 하나의 모듈을 포함할 수 있다.The speech synthesis model stored in the memory 1004 may include a plurality of modules classified according to functions. The speech synthesis model stored in the memory 1004 is, for example, at least one module of a preprocessor, a text encoder, an attention module, an audio encoder, an audio decoder, a feedback information generator, an audio decoder, a vocoder, or an audio feature extractor. It may include.

메모리(1004)는, 예를 들어, 전자 장치(1000)의 동작을 제어하기 위한 프로그램을 저장할 수 있다. 메모리(1004)는 전자 장치(1000)의 동작을 제어하기 위한 적어도 하나의 인스트럭션을 포함할 수 있다.The memory 1004 may store, for example, a program for controlling the operation of the electronic device 1000. The memory 1004 may include at least one instruction for controlling the operation of the electronic device 1000.

메모리(1004)는, 예를 들어, 입력된 텍스트 및 합성된 음성에 관한 정보를 저장할 수 있다.The memory 1004 may store, for example, information about an input text and a synthesized voice.

메모리(1004)는, 예를 들어, 플래시 메모리 타입(flash memory type), 하드디스크 타입(hard disk type), 멀티미디어 카드 마이크로 타입(multimedia card micro type), 카드 타입의 메모리(예를 들어 SD 또는 XD 메모리 등), 램(RAM, Random Access Memory) SRAM(Static Random Access Memory), 롬(ROM, Read-Only Memory), EEPROM(Electrically Erasable Programmable Read-Only Memory), PROM(Programmable Read-Only Memory), 자기 메모리, 자기 디스크, 광디스크 중 적어도 하나의 타입의 저장매체를 포함할 수 있으나, 이에 제한되지 않는다.The memory 1004 is, for example, a flash memory type, a hard disk type, a multimedia card micro type, a card type memory (eg, SD or XD Memory, etc.), RAM (RAM, Random Access Memory), SRAM (Static Random Access Memory), ROM (ROM, Read-Only Memory), EEPROM (Electrically Erasable Programmable Read-Only Memory), PROM (Programmable Read-Only Memory), The storage medium may include at least one type of magnetic memory, magnetic disk, and optical disk, but is not limited thereto.

마이크(1005)는 사용자의 음성을 입력받을 수 있다. 사용자가 발화를 수행할 때 마이크(1005)를 통해 입력되는 음성은, 예를 들어, 메모리(1004)에 저장된 음성 합성 모델의 학습에 사용되는 오디오 신호로 변환될 수 있다.The microphone 1005 may receive a user's voice. When the user speaks, the voice input through the microphone 1005 may be converted into, for example, an audio signal used for learning a voice synthesis model stored in the memory 1004.

스피커(1006)는 텍스트로부터 합성된 음성을 소리로 출력할 수 있다. The speaker 1006 may output voice synthesized from text as sound.

스피커(1006)는 전자 장치(1000)에서 수행되는 기능(예를 들어, 호신호 수신음, 메시지 수신음, 알림음)과 관련된 신호를 소리로 출력할 수 있다.The speaker 1006 may output a signal related to a function (eg, a call signal reception sound, a message reception sound, and a notification sound) performed by the electronic device 1000 as sound.

디스플레이(1007)는 전자 장치(1000)에서 처리되는 정보를 표시 출력할 수 있다. The display 1007 may display and output information processed by the electronic device 1000.

디스플레이(1007)는, 예를 들어, 음성 합성에 사용되는 텍스트 및 음성 합성 결과를 표시하기 위한 인터페이스를 디스플레이할 수 있다.The display 1007 may display an interface for displaying text and speech synthesis results used for speech synthesis, for example.

디스플레이(1007)는, 예를 들어, 전자 장치(1000)의 제어를 위한 인터페이스, 전자 장치(1000)의 상태 표시를 위한 인터페이스 등을 디스플레이할 수 있다. The display 1007 may display, for example, an interface for controlling the electronic device 1000, an interface for displaying the status of the electronic device 1000, and the like.

프로세서(1001)는 통상적으로 전자 장치(1000)의 전반적인 동작을 제어할 수 있다. 예를 들어, 프로세서(1001)는, 메모리(1004)에 저장된 프로그램들을 실행함으로써, 사용자 입력부(1002), 통신부(1003), 메모리(1004), 마이크(1005), 스피커(1006) 및 디스플레이(1007)를 전반적으로 제어할 수 있다.The processor 1001 may generally control the overall operation of the electronic device 1000. For example, the processor 1001 executes programs stored in the memory 1004, so that the user input unit 1002, the communication unit 1003, the memory 1004, the microphone 1005, the speaker 1006, and the display 1007 ) Overall control.

프로세서(1001)는 텍스트가 입력됨에 따라, 메모리(1004)에 저장된 음성 합성 모델을 활성화하여 음성 합성 프로세스를 시작할 수 있다.As text is input, the processor 1001 may activate a speech synthesis model stored in the memory 1004 to start a speech synthesis process.

프로세서(1001)는 음성 합성 모델의 텍스트 인코더를 통해, 텍스트를 부호화하여 텍스트 표현을 획득할 수 있다.The processor 1001 may obtain a text representation by encoding text through a text encoder of a speech synthesis model.

프로세서(1001)는 음성 합성 모델의 피드백 정보 생성부를 통해, 텍스트 표현으로부터 생성되는 오디오 프레임들 중에서, 제1 오디오 프레임 셋의 오디오 특징으로부터 제2 오디오 프레임 셋의 오디오 특징을 획득하는데 이용되는 피드백 정보를 생성할 수 있다. 제2 오디오 프레임 셋은, 예를 들어, 제1 오디오 프레임 셋에 후행하는 프레임들을 포함하는 오디오 프레임 셋일 수 있다.The processor 1001 receives feedback information used to obtain an audio characteristic of a second audio frame set from an audio characteristic of the first audio frame set among audio frames generated from a text representation through a feedback information generator of the speech synthesis model. Can be generated. The second audio frame set may be, for example, an audio frame set including frames following the first audio frame set.

피드백 정보는, 예를 들어, 제1 오디오 프레임 셋에 포함되는 적어도 하나의 오디오 프레임의 오디오 특징에 관한 정보 및 제1 오디오 프레임 셋에 포함되는 적어도 하나의 오디오 프레임에 대한 압축 정보를 포함할 수 있다.The feedback information may include, for example, information on audio characteristics of at least one audio frame included in the first audio frame set and compression information on at least one audio frame included in the first audio frame set. .

프로세서(1001)는, 예를 들어, 음성 합성 모델의 피드백 정보 생성부를 통해, 제1 오디오 프레임 셋에 포함되는 적어도 하나의 오디오 프레임의 오디오 특징에 관한 정보 및 제1 오디오 프레임 셋에 포함되는 적어도 하나의 오디오 프레임에 대한 압축 정보를 획득하고, 획득된 적어도 하나의 오디오 프레임의 오디오 특징에 관한 정보 및 적어도 하나의 오디오 프레임에 대한 압축 정보를 결합하여 피드백 정보를 생성할 수 있다.The processor 1001 includes, for example, information on audio characteristics of at least one audio frame included in the first audio frame set and at least one included in the first audio frame set through a feedback information generation unit of the speech synthesis model. Compressed information on the audio frame of is obtained, and feedback information may be generated by combining information on audio characteristics of the obtained at least one audio frame and compression information on the at least one audio frame.

프로세서(1001)는 텍스트 표현 및 피드백 정보에 기초하여, 제2 오디오 프레임 셋의 오디오 표현을 생성할 수 있다.The processor 1001 may generate an audio representation of the second audio frame set based on the text representation and feedback information.

프로세서(1001)는, 예를 들어, 음성 합성 모델의 주의 집중 모듈을 통해, 텍스트 표현 및 제1 오디오 프레임 셋의 오디오 표현에 기초하여, 텍스트 표현 중 집중이 필요한 부분을 식별하기 위한 주의 집중 정보를 획득할 수 있다.The processor 1001, for example, through the attention module of the speech synthesis model, based on the text expression and the audio expression of the first audio frame set, provides attention information for identifying a portion of the text expression that requires attention. Can be obtained.

프로세서(1001)는, 예를 들어, 음성 합성 모델의 주의 집중 모듈을 통해, 주의 집중 정보에 기초하여 상기 텍스트 표현 중 집중이 필요한 부분을 식별하여 추출하고, 추출 결과 및 제1 오디오 프레임 셋의 오디오 표현을 결합하여 제2 오디오 프레임 셋의 오디오 표현을 획득할 수 있다.Processor 1001, for example, through the attention module of the speech synthesis model, based on the attention information to identify and extract a portion of the text expression that needs to be focused, and extract the result and the audio of the first audio frame set By combining the expressions, an audio expression of the second audio frame set may be obtained.

프로세서(1001)는, 예를 들어, 음성 합성 모델의 오디오 디코더를 통해, 제2 오디오 프레임 셋의 오디오 표현을 복호화하여 제2 오디오 프레임 셋의 오디오 특징을 획득할 수 있다.The processor 1001 may obtain audio characteristics of the second audio frame set by decoding the audio representation of the second audio frame set, for example, through an audio decoder of the speech synthesis model.

프로세서(1001)는, 예를 들어, 음성 합성 모델의 보코더를 통해, 제1 오디오 프레임 셋의 오디오 특징 또는 제2 오디오 프레임 셋의 오디오 특징 중 적어도 하나에 기초하여 음성을 합성할 수 있다.The processor 1001 may synthesize speech based on at least one of an audio characteristic of a first audio frame set or an audio characteristic of a second audio frame set, for example, through a vocoder of the speech synthesis model.

일부 실시예에 따른 프로세서(1001)는, 예를 들어, 인공지능 연산을 수행할 수 있다. 프로세서(1001)는, 예를 들어, CPU(Central Processing Unit), GPU(Graphics Processing Unit), NPU(Neural Processing Unit), FPGA(Field Programmable Gate Array), ASIC(application specific integrated circuit) 중 어느 하나일 수 있으나, 이에 제한되지 않는다. The processor 1001 according to some embodiments may, for example, perform an artificial intelligence operation. The processor 1001 is, for example, any one of a central processing unit (CPU), a graphics processing unit (GPU), a neural processing unit (NPU), a field programmable gate array (FPGA), and an application specific integrated circuit (ASIC). However, it is not limited thereto.

도 10은 일부 실시예에 따른 서버의 구성을 나타낸 블록도이다.10 is a block diagram showing the configuration of a server according to some embodiments.

일부 실시예에 따른 음성 합성 방법은, 전자 장치(1000) 및/또는 전자 장치(1000)와 유선 또는 무선 통신을 통해 연결되는 서버(2000)에 의해 수행될 수 있다.The speech synthesis method according to some embodiments may be performed by the electronic device 1000 and/or the server 2000 connected to the electronic device 1000 through wired or wireless communication.

도 10을 참조하면, 일부 실시예에 따른 서버(2000)는, 프로세서(2001), 통신부(2002) 및 메모리(2003)를 포함할 수 있다. Referring to FIG. 10, the server 2000 according to some embodiments may include a processor 2001, a communication unit 2002, and a memory 2003.

통신부(2002)는 전자 장치(1000)와의 통신을 위한 하나 이상의 통신 모듈을 포함할 수 있다. 예를 들어, 통신부(2002)는, 근거리 통신부 또는 이동 통신부 중 적어도 하나를 포함할 수 있다. The communication unit 2002 may include one or more communication modules for communication with the electronic device 1000. For example, the communication unit 2002 may include at least one of a short-range communication unit and a mobile communication unit.

근거리 통신부(short-range wireless communication unit)는, 블루투스 통신부, BLE(Bluetooth Low Energy) 통신부, 근거리 무선 통신부(Near Field Communication unit), WLAN(와이파이) 통신부, 지그비(Zigbee) 통신부, 적외선(IrDA, infrared Data Association) 통신부, WFD(Wi-Fi Direct) 통신부, UWB(ultra wideband) 통신부, Ant+ 통신부 등을 포함할 수 있으나, 이에 제한되지 않는다.The short-range wireless communication unit includes a Bluetooth communication unit, a Bluetooth Low Energy (BLE) communication unit, a Near Field Communication unit, a WLAN (Wi-Fi) communication unit, a Zigbee communication unit, and an infrared (IrDA) communication unit. Data Association) communication unit, WFD (Wi-Fi Direct) communication unit, UWB (ultra wideband) communication unit, Ant+ communication unit, etc. may be included, but is not limited thereto.

메모리(2003)는 텍스트를 음성으로 합성하는데 사용되는 음성 합성 모델을 저장할 수 있다. The memory 2003 may store a speech synthesis model used to synthesize text into speech.

메모리(2003)에 저장된 음성 합성 모델은 기능에 따라 분류된 복수 개의 모듈을 포함할 수 있다. 메모리(2003)에 저장된 음성 합성 모델은, 예를 들어, 전처리부, 텍스트 인코더, 주의 집중 모듈, 오디오 인코더, 오디오 디코더, 피드백 정보 생성부, 오디오 디코더, 보코더 또는 오디오 특징 추출부 중 적어도 하나의 모듈을 포함할 수 있다.The speech synthesis model stored in the memory 2003 may include a plurality of modules classified according to functions. The speech synthesis model stored in the memory 2003 is, for example, at least one of a preprocessor, a text encoder, an attention module, an audio encoder, an audio decoder, a feedback information generation unit, an audio decoder, a vocoder, or an audio feature extraction unit. It may include.

메모리(2003)는 서버(2000)의 동작을 제어하기 위한 프로그램을 저장할 수 있다. 메모리(2003)는 서버(2000)의 동작을 제어하기 위한 적어도 하나의 인스트럭션을 포함할 수 있다. The memory 2003 may store a program for controlling the operation of the server 2000. The memory 2003 may include at least one instruction for controlling the operation of the server 2000.

메모리(2003)는, 예를 들어, 입력된 텍스트 및 합성된 음성에 관한 정보를 저장할 수 있다.The memory 2003 may store, for example, information about an input text and a synthesized voice.

메모리(2003)는, 예를 들어, 플래시 메모리 타입(flash memory type), 하드디스크 타입(hard disk type), 멀티미디어 카드 마이크로 타입(multimedia card micro type), 카드 타입의 메모리(예를 들어 SD 또는 XD 메모리 등), 램(RAM, Random Access Memory) SRAM(Static Random Access Memory), 롬(ROM, Read-Only Memory), EEPROM(Electrically Erasable Programmable Read-Only Memory), PROM(Programmable Read-Only Memory), 자기 메모리, 자기 디스크, 광디스크 중 적어도 하나의 타입의 저장매체를 포함할 수 있으나, 이에 제한되지 않는다.The memory 2003 is, for example, a flash memory type, a hard disk type, a multimedia card micro type, and a card type memory (for example, SD or XD Memory, etc.), RAM (RAM, Random Access Memory), SRAM (Static Random Access Memory), ROM (ROM, Read-Only Memory), EEPROM (Electrically Erasable Programmable Read-Only Memory), PROM (Programmable Read-Only Memory), The storage medium may include at least one type of magnetic memory, magnetic disk, and optical disk, but is not limited thereto.

프로세서(2001)는 통상적으로 서버(2000)의 전반적인 동작을 제어할 수 있다. 예를 들어, 프로세서(2001)는, 메모리(2003)에 저장된 프로그램들을 실행함으로써, 통신부(2002) 및 메모리(2003)를 전반적으로 제어할 수 있다.The processor 2001 may generally control the overall operation of the server 2000. For example, the processor 2001 may overall control the communication unit 2002 and the memory 2003 by executing programs stored in the memory 2003.

프로세서(2001)는 통신부(2002)를 통해, 전자 장치(1000)로부터 음성 합성을 위한 텍스트를 수신할 수 있다.The processor 2001 may receive text for speech synthesis from the electronic device 1000 through the communication unit 2002.

프로세서(2001)는 텍스트가 수신됨에 따라, 메모리(2003)에 저장된 음성 합성 모델을 활성화하여 음성 합성 프로세스를 시작할 수 있다.As the text is received, the processor 2001 may start the speech synthesis process by activating the speech synthesis model stored in the memory 2003.

프로세서(2001)는 음성 합성 모델의 텍스트 인코더를 통해, 텍스트를 부호화하여 텍스트 표현을 획득할 수 있다.The processor 2001 may obtain a text representation by encoding text through a text encoder of a speech synthesis model.

프로세서(2001)는 음성 합성 모델의 피드백 정보 생성부를 통해, 텍스트 표현으로부터 생성되는 오디오 프레임들 중에서, 제1 오디오 프레임 셋의 오디오 특징으로부터 제2 오디오 프레임 셋의 오디오 특징을 획득하는데 이용되는 피드백 정보를 생성할 수 있다. 제2 오디오 프레임 셋은, 예를 들어, 제1 오디오 프레임 셋에 후행하는 프레임들을 포함하는 오디오 프레임 셋일 수 있다. The processor 2001 receives feedback information used to obtain the audio characteristic of the second audio frame set from the audio characteristic of the first audio frame set among the audio frames generated from the text expression through the feedback information generator of the speech synthesis model. Can be generated. The second audio frame set may be, for example, an audio frame set including frames following the first audio frame set.

프로세서(2001)는, 예를 들어, 음성 합성 모델의 피드백 정보 생성부를 통해, 제1 오디오 프레임 셋에 포함되는 적어도 하나의 오디오 프레임의 오디오 특징에 관한 정보 및 제1 오디오 프레임 셋에 포함되는 적어도 하나의 오디오 프레임에 대한 압축 정보를 획득하고, 획득된 적어도 하나의 오디오 프레임의 오디오 특징에 관한 정보 및 적어도 하나의 오디오 프레임에 대한 압축 정보를 결합하여 피드백 정보를 생성할 수 있다.The processor 2001 includes, for example, information on audio characteristics of at least one audio frame included in the first audio frame set and at least one included in the first audio frame set through a feedback information generator of the speech synthesis model. Compressed information on the audio frame of is obtained, and feedback information may be generated by combining information on audio characteristics of the obtained at least one audio frame and compression information on the at least one audio frame.

프로세서(2001)는 텍스트 표현 및 피드백 정보에 기초하여, 제2 오디오 프레임 셋의 오디오 표현을 생성할 수 있다.The processor 2001 may generate an audio representation of the second audio frame set based on the text representation and feedback information.

프로세서(2001)는, 예를 들어, 음성 합성 모델의 주의 집중 모듈을 통해, 텍스트 표현 및 제1 오디오 프레임 셋의 오디오 표현에 기초하여, 텍스트 표현 중 집중이 필요한 부분을 식별하기 위한 주의 집중 정보를 획득할 수 있다.The processor 2001, for example, through the attention module of the speech synthesis model, based on the text expression and the audio expression of the first audio frame set, provides attention information for identifying a portion of the text expression that requires attention. Can be obtained.

프로세서(2001)는, 예를 들어, 음성 합성 모델의 주의 집중 모듈을 통해, 주의 집중 정보에 기초하여 상기 텍스트 표현 중 집중이 필요한 부분을 식별하여 추출하고, 추출 결과 및 제1 오디오 프레임 셋의 오디오 표현을 결합하여 제2 오디오 프레임 셋의 오디오 표현을 획득할 수 있다.Processor 2001, for example, through the attention module of the speech synthesis model, based on the attention information to identify and extract a portion of the text expression that needs to be focused, and extract the result and the audio of the first audio frame set By combining the expressions, an audio expression of the second audio frame set may be obtained.

프로세서(2001)는, 예를 들어, 음성 합성 모델의 오디오 디코더를 통해, 제2 오디오 프레임 셋의 오디오 표현을 복호화하여 제2 오디오 프레임 셋의 오디오 특징을 획득할 수 있다.The processor 2001 may obtain audio characteristics of the second audio frame set by decoding the audio representation of the second audio frame set, for example, through an audio decoder of the speech synthesis model.

프로세서(2001)는, 예를 들어, 음성 합성 모델의 보코더를 통해, 제1 오디오 프레임 셋의 오디오 특징 또는 제2 오디오 프레임 셋의 오디오 특징 중 적어도 하나에 기초하여 음성을 합성할 수 있다.The processor 2001 may synthesize speech based on at least one of an audio characteristic of a first audio frame set or an audio characteristic of a second audio frame set, for example, through a vocoder of the speech synthesis model.

일부 실시예에 따른 프로세서(2001)는, 예를 들어, 인공지능 연산을 수행할 수 있다. 프로세서(2001)는, 예를 들어, CPU(Central Processing Unit), GPU(Graphics Processing Unit), NPU(Neural Processing Unit), FPGA(Field Programmable Gate Array), ASIC(application specific integrated circuit) 중 어느 하나일 수 있으나, 이에 제한되지 않는다.The processor 2001 according to some embodiments may, for example, perform an artificial intelligence operation. The processor 2001 is, for example, any one of a central processing unit (CPU), a graphics processing unit (GPU), a neural processing unit (NPU), a field programmable gate array (FPGA), and an application specific integrated circuit (ASIC). However, it is not limited thereto.

일부 실시예는 컴퓨터에 의해 실행되는 프로그램 모듈과 같은 컴퓨터에 의해 실행 가능한 명령어를 포함하는 기록 매체의 형태로도 구현될 수 있다. 컴퓨터 판독 가능 매체는 컴퓨터에 의해 액세스 될 수 있는 임의의 가용 매체일 수 있고, 휘발성 및 비휘발성 매체, 분리형 및 비분리형 매체를 모두 포함한다. 또한, 컴퓨터 판독 가능 매체는 컴퓨터 저장 매체를 포함할 수 있다. 컴퓨터 저장 매체는 컴퓨터 판독 가능 명령어, 데이터 구조, 프로그램 모듈 또는 기타 데이터와 같은 정보의 저장을 위한 임의의 방법 또는 기술로 구현된 휘발성 및 비휘발성, 분리형 및 비분리형 매체를 모두 포함한다.Some embodiments may also be implemented in the form of a recording medium including instructions executable by a computer, such as a program module executed by a computer. Computer-readable media may be any available media that can be accessed by a computer, and includes both volatile and nonvolatile media, removable and non-removable media. Further, the computer-readable medium may include a computer storage medium. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data.

또한, 본 명세서에서, "부"는 프로세서 또는 회로와 같은 하드웨어 구성(hardware component), 및/또는 프로세서와 같은 하드웨어 구성에 의해 실행되는 소프트웨어 구성(software component)일 수 있다.In addition, in this specification, the "unit" may be a hardware component such as a processor or a circuit, and/or a software component executed by a hardware configuration such as a processor.

전술한 본 개시의 설명은 예시를 위한 것이며, 본 개시가 속하는 기술 분야의 통상의 지식을 가진 자는 본 개시의 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 예를 들어, 단일형으로 설명되어 있는 각 구성 요소는 분산되어 실시될 수도 있으며, 마찬가지로 분산된 것으로 설명되어 있는 구성 요소들도 결합된 형태로 실시될 수 있다.The above description of the present disclosure is for illustrative purposes only, and those of ordinary skill in the technical field to which the present disclosure pertains will be able to understand that it is possible to easily transform it into other specific forms without changing the technical spirit or essential features of the present disclosure. will be. Therefore, it should be understood that the embodiments described above are illustrative and non-limiting in all respects. For example, each component described as a single type may be implemented in a distributed manner, and similarly, components described as being distributed may also be implemented in a combined form.

본 개시의 범위는 상기 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본 개시의 범위에 포함되는 것으로 해석되어야 한다.The scope of the present disclosure is indicated by the claims to be described later rather than the detailed description, and all changes or modified forms derived from the meaning and scope of the claims and their equivalent concepts should be interpreted as being included in the scope of the present disclosure. do.

Claims

In the method for the electronic device to synthesize speech from text,
Acquiring text input to the electronic device;
Encoding the text using a text encoder of the electronic device to obtain a text representation;
Obtaining an audio representation of a first audio frame set from an audio encoder of the electronic device;
Obtaining an audio representation of a second audio frame set based on the text representation and the audio representation of the first audio frame set;
Acquiring an audio characteristic of a second audio frame set by decoding an audio representation of the second audio frame set;
Generating feedback information based on audio characteristics of the second audio frame set; And
Synthesizing speech based on at least one of an audio characteristic of the first audio frame set or an audio characteristic of the second audio frame set;
Including, the method.

The method of claim 1,
The second audio frame set is
And at least one frame succeeding the first set of audio frames.

The method of claim 1,
The operation of generating the feedback information
Obtaining information on audio characteristics of at least one audio frame among the second audio frame set;
Obtaining compression information on at least one audio frame among the second audio frame set; And
Generating the feedback information by combining information on audio characteristics of the at least one audio frame and compression information on the at least one audio frame;
Including, the method.

The method of claim 3,
The above feedback information is
Used to obtain an audio characteristic of a third audio frame set that follows the second audio frame set.

The method of claim 3,
The compression information is
At least one of a magnitude of an amplitude value of an audio signal corresponding to each of the at least one audio frame, a magnitude of a root mean square (RMS) of an amplitude value of an audio signal, or a magnitude of a pitch value of an audio signal Containing information about, how.

The method of claim 1,
The operation of obtaining the audio representation is
Obtaining attention information for identifying a portion of the text expression that requires attention, based on at least a portion of the text expression and the audio expression of the first audio frame set; And
Obtaining an audio representation of the second audio frame set based on at least a portion of the textual representation and the attention information;
Including, the method.

In an electronic device for synthesizing speech from text,
The electronic device includes at least one processor,
The at least one processor
Obtaining text input to the electronic device,
Encoding the text to obtain a text representation,
Obtaining an audio representation of the first audio frame set,
Obtaining an audio representation of a second audio frame set based on the text representation and the audio representation of the first audio frame set,
Acquiring an audio characteristic of a second audio frame set through decoding of an audio representation of the second audio frame set,
Generating feedback information based on audio characteristics of the second audio frame set,
An electronic device for synthesizing speech based on at least one of an audio characteristic of the first audio frame set or an audio characteristic of the second audio frame set.

The method of claim 7,
The second audio frame set is
The electronic device comprising at least one frame following the first audio frame set.

The method of claim 7,
The at least one processor
Obtaining information on audio characteristics of at least one audio frame of the second audio frame set, obtaining compression information of at least one audio frame of the second audio frame set, and obtaining audio of the at least one audio frame The electronic device that generates the feedback information by combining feature information and compression information on the at least one audio frame.

The method of claim 9,
The above feedback information is
The electronic device, which is used to obtain an audio characteristic of a third audio frame set following the second audio frame set.

The method of claim 9,
The compression information is
At least one of a magnitude of an amplitude value of an audio signal corresponding to each of the at least one audio frame, a magnitude of a root mean square (RMS) of an amplitude value of an audio signal, or a magnitude of a pitch value of an audio signal Containing information about the electronic device.

The method of claim 8,
The at least one processor
On the basis of at least a part of the textual expression and the audio expression of the first audio frame set, obtaining attention information for identifying a part of the text expression requiring attention, and in at least a part of the text expression and the attention information Obtaining an audio representation of the second audio frame set based on the electronic device.

Obtaining text input to the electronic device;
Encoding the text to obtain a text representation;
Obtaining an audio representation of the first audio frame set;
Obtaining an audio representation of a second audio frame set based on the text representation and the audio representation of the first audio frame set;
Acquiring an audio characteristic of a second audio frame set by decoding an audio representation of the second audio frame set;
Generating feedback information based on audio characteristics of the second audio frame set; And
Synthesizing speech based on at least one of an audio characteristic of the first audio frame set or an audio characteristic of the second audio frame set;
A computer-readable recording medium storing a program for executing a method for synthesizing speech from text, including a computer.