KR102311239B1

KR102311239B1 - Deep neural network based non-autoregressive speech synthesizer method and system

Info

Publication number: KR102311239B1
Application number: KR1020190005234A
Authority: KR
Inventors: 장준혁; 이준모
Original assignee: 한양대학교 산학협력단
Priority date: 2019-01-15
Filing date: 2019-01-15
Publication date: 2021-10-12
Also published as: KR20200092511A

Abstract

심화 신경망 기반의 비-자동회귀 음성 합성 방법 및 시스템이 제시된다. 일 실시예에 따른 심화 신경망 기반의 비-자동회귀 음성 합성 시스템은, 문장 데이터를 분석하여 정제된 문장 데이터를 출력하는 문장 데이터 분석부; 템플릿(Template) 입력을 생성하고, 생성된 상기 템플릿(Template)에 어텐션 메커니즘(attention mechanism)을 이용하여 상기 정제된 문장 데이터를 추가하여 음성 특징 벡터열을 생성하는 음성 특징 벡터열 합성부; 및 상기 음성 특징 벡터열을 음성 데이터로 변환하는 음성 재구성부를 포함하여 이루어질 수 있다. A deep neural network-based non-autoregressive speech synthesis method and system are presented. According to an embodiment, a non-autoregressive speech synthesis system based on a deep neural network includes: a sentence data analyzer configured to analyze sentence data and output refined sentence data; a speech feature vector sequence synthesizing unit for generating a template input and adding the refined sentence data to the generated template using an attention mechanism to generate a speech feature vector sequence; and a voice reconstruction unit that converts the voice feature vector sequence into voice data.

Description

DEEP NEURAL NETWORK BASED NON-AUTOREGRESSIVE SPEECH SYNTHESIZER METHOD AND SYSTEM

본 발명의 실시예들은 심화 신경망 기반의 비-자동회귀 음성 합성 방법 및 시스템에 관한 것으로, 더욱 상세하게는 음성 특징 벡터의 길이를 추정하고 빈 입력을 생성하여 비재귀적으로 음성 벡터를 생성하는 심화 신경망 기반의 비-자동회귀 음성 합성 방법 및 시스템에 관한 것이다. 이 발명은 2017년도 정부(과학기술정보통신부)의 재원으로 정보통신기술진흥센터의 지원을 받아 수행된 연구임(No.2017-0-00474, AI스피커 음성비서를 위한 지능형 음성신호처리 기술개발).Embodiments of the present invention relate to a method and system for non-autoregressive speech synthesis based on a deep neural network, and more particularly, a deep neural network for generating a speech vector non-recursively by estimating the length of a speech feature vector and generating an empty input. It relates to a non-autoregressive speech synthesis method and system based on the present invention. This invention is a research conducted with the support of the Information and Communication Technology Promotion Center with funding from the government (Ministry of Science and ICT) in 2017 (No.2017-0-00474, Development of Intelligent Voice Signal Processing Technology for AI Speaker Voice Assistant) .

심화 신경망(Deep Neural Network, DNN) 기반의 음성 합성 기술은 심화 신경망을 이용하여 문장 데이터에서 음성 데이터를 생성해내는 기술로, 일반적으로 문장 데이터를 분석하는 문장 데이터 분석부와 음성 특징 벡터를 생성하는 음성 특징 벡터열 합성부가 하나의 네트워크로 구성된다. Deep Neural Network (DNN)-based speech synthesis technology is a technology that generates speech data from sentence data using a deep neural network. The speech feature vector sequence synthesizing unit is configured as one network.

첫 번째 단계인 문장 데이터 분석부는 문장 데이터를 자모로 분리하고 분리된 자모를 신경망 입력으로 유효한 벡터열로 바꾸는 문장 데이터 임베딩(character embedding) 부분, 임베딩된 벡터열에서 콘볼루션(convolution) 신경망과 순환 신경망으로 구성된 네트워크가 음성 특징 벡터 생성에 필요한 정보를 정제하는 부분으로 나뉜다. The first stage, the sentence data analysis unit, separates the sentence data into letters and converts the separated letters into a valid vector sequence as a neural network input. The network consisting of

두 번째 단계인 음성 특징 벡터열 합성부 또한 두 개의 단계로 이루어지며, 첫 번째 단계에서는 어텐션 메커니즘(attention mechanism)으로 정제된 문장 데이터 벡터열에서 음성 데이터에 맞는 정보를 선택적으로 취합한 후, 취합된 정보를 바탕으로 순환 신경망이 멜 필터 뱅크(Mel-Filterbank) 음성 특징 벡터를 생성한다. 이때 순환 신경망의 입력은 이전 단계에서 생성된 출력 멜 필터 뱅크 음성 데이터를 자동회귀(autoregressive) 방식에 따라 입력으로 구성하게 된다. 두 번째 단계에서는 멜 필터 뱅크 음성 데이터를 로그 파워 스펙트로그램(log-power spectrogram)으로 매핑한다.The second step, the speech feature vector sequence synthesizing unit, also consists of two steps. In the first step, information suitable for speech data is selectively collected from the sentence data vector sequence refined by the attention mechanism, and then the Based on the information, a recurrent neural network generates a Mel-Filterbank negative feature vector. At this time, the input of the recurrent neural network consists of the output Mel filter bank voice data generated in the previous step as the input according to the autoregressive method. In the second step, the Mel filter bank voice data is mapped to a log-power spectrogram.

일반적으로 음성 합성의 품질은 주관적 평가 방법인 M.O.S(Mean Opinion Score)를 이용하여 측정한다. 일부 보코더(vocoder) 모델의 경우에는 음성의 왜곡 정도를 측정하지만, 심화 신경망(DNN) 기반의 엔드투엔드(end to end) 모델의 경우에는 M.O.S만을 사용한다. In general, the quality of speech synthesis is measured using M.O.S (Mean Opinion Score), which is a subjective evaluation method. In the case of some vocoder models, the degree of speech distortion is measured, but only M.O.S is used in the case of an end-to-end model based on a deep neural network (DNN).

심화 신경망(DNN) 기반의 엔드투엔드(end to end) 음성 합성 기술은 하나의 심화 신경망(DNN) 모델이 문장 데이터를 분석하여 스펙트로그램 기반의 음성 특징 또는 음성 신호를 생성해내는 기술을 말한다. 문장 데이터와 음성 데이터의 샘플링 비율(sampling rate)이 서로 다르기 때문에 이를 해결하기 위해 seq2seq(sequence-to-sequence) 네트워크와 어텐션 메커니즘(attention mechanism)이 사용된다. seq2seq 네트워크는 인코더(encoder)와 디코더(decoder)로 구성되어 있다. 인코더는 문장 데이터를 정제하는 역할을 하며, 디코더는 인코더에 의해 정제된 정보를 바탕으로 스펙트로그램 기반의 음성 특징을 생성해낸다. 디코더는 생성된 출력물이 다음 시간의 입력이 되는 자동회귀 플로우(autoregressive flow)를 바탕으로 순차적으로 출력을 생성해낸다. 그러나, 자동회귀 플로우(autoregressive flow)는 효율적인 정보의 전달을 가능하게 하지만 시간 순서에 따라 순차적으로 출력을 생성해야 하기 때문에 속도가 느리다는 단점이 있다.Deep neural network (DNN)-based end-to-end speech synthesis technology refers to a technology in which one deep neural network (DNN) model analyzes sentence data to generate spectrogram-based speech features or speech signals. Since the sampling rates of sentence data and speech data are different, a seq2seq (sequence-to-sequence) network and an attention mechanism are used to solve this problem. The seq2seq network consists of an encoder and a decoder. The encoder plays a role in refining the sentence data, and the decoder generates spectrogram-based speech features based on the information refined by the encoder. The decoder sequentially generates outputs based on an autoregressive flow in which the generated output becomes an input of the next time. However, the autoregressive flow enables efficient information transfer, but has a disadvantage in that it is slow because outputs must be sequentially generated according to time order.

이와 같이, 기존의 신경망 기반 음성 합성 기술은 자동회귀(autoregressive) 방식으로 이전 단계에서 생성된 멜 필터 뱅크 음성 특징 벡터를 다시 현재 단계의 입력으로 사용하는 방식으로 음성 특징 벡터를 합성한다. 학습 단계에서는 이미 주어진 타겟 음성 특징 벡터열을 시프팅하여 사용하는 방식으로 입력을 구성하기 때문에 이전 단계 출력을 기다릴 필요가 없지만, 테스트 단계에서는 단계적으로 출력을 생성해야 되기 때문에 생성 속도가 느린 단점이 있다.As such, the existing neural network-based speech synthesis technology synthesizes speech feature vectors by using the Mel filter bank speech feature vector generated in the previous step as an input of the current step again in an autoregressive method. In the learning phase, there is no need to wait for the output of the previous step because the input is configured in a way that shifts and uses the already given target speech feature vector sequence. .

자동회귀(autoregressive) 음성 합성 모델의 경우 출력이 입력으로 다시 들어가기 때문에 순환 신경망 특성상 뒤쪽으로 갈수록 음성의 소리 크기가 작아지는 현상이 빈번하다. 또한 음성 생성시에 음성의 끝을 따로 찾는 구조가 필요하며, 끝부분을 제대로 찾지 못하면 중간에 음성의 생성을 멈춰버리는 현상이 나타난다.In the case of an autoregressive speech synthesis model, since the output goes back into the input, the sound volume of the speech frequently decreases as it goes backward due to the characteristics of the recurrent neural network. In addition, a structure for separately finding the end of a voice is required when generating a voice, and if the end is not found properly, the voice generation stops in the middle.

Wang Y, Skerry-Ryan RJ, Stanton D, Wu Y, Weiss RJ, Jaitly N, Yang Z, Xiao Y, Chen Z, Bengio S, Le Q. Tacotron: A fully end-to-end text-to-speech synthesis model. arXiv preprint arXiv:1703.10135. 2017 Mar 29. Wang Y, Skerry-Ryan RJ, Stanton D, Wu Y, Weiss RJ, Jaitly N, Yang Z, Xiao Y, Chen Z, Bengio S, Le Q. Tacotron: A fully end-to-end text-to-speech synthesis model. arXiv preprint arXiv:1703.10135. 2017 Mar 29.

본 발명의 실시예들은 심화 신경망 기반의 비-자동회귀 음성 합성 방법 및 시스템에 관하여 기술하며, 보다 구체적으로 음성 특징 벡터의 길이를 추정하고 빈 입력을 생성하여 비재귀적으로 음성 벡터를 생성하는 음성 합성 모델 생성 기술을 제공한다. Embodiments of the present invention describe a non-autoregressive speech synthesis method and system based on a deep neural network, and more specifically, speech synthesis that generates a speech vector non-recursively by estimating the length of a speech feature vector and generating an empty input Provides model creation technology.

본 발명의 실시예들은 기존의 자동회귀(autoregressive) 음성 합성 기술이 가지는 문제를 해결하기 위하여, 기존의 자동회귀(autoregressive) 음성 합성 기술의 자동회귀 플로우(autoregressive flow)를 제거하고, 템플릿(Template)로 불리는 새로운 입력을 구성하여 테스트 단계에서도 학습 단계와 같은 방식으로 음성을 생성할 수 있도록 하는 심화 신경망 기반의 비-자동회귀 음성 합성 방법 및 시스템을 제공할 수 있다. Embodiments of the present invention remove the autoregressive flow of the existing autoregressive speech synthesis technology in order to solve the problem of the existing autoregressive speech synthesis technology, and use a template It is possible to provide a deep neural network-based non-autoregressive speech synthesis method and system that constructs a new input called

본 발명은 템플릿(Template)이라는 새로운 입력을 이용하여 음성 합성 모델이 단계적으로 음성 특징 벡터를 생성해야 하는 문제를 해결할 수 있는 방법을 제공하는 것을 그 목적으로 한다. 또한, 기존의 자동회귀(autoregressive) 모델이 긴 음성을 합성할 때 흔히 겪던 문제를 해결하는 방법을 제공하는 것을 그 목적으로 한다.An object of the present invention is to provide a method for solving the problem that a speech synthesis model has to generate speech feature vectors step by step using a new input called a template. In addition, it is an object of the present invention to provide a method for solving a problem commonly encountered when synthesizing a long speech in an existing autoregressive model.

일 실시예에 따른 심화 신경망 기반의 비-자동회귀 음성 합성 시스템은, 문장 데이터를 분석하여 정제된 문장 데이터를 출력하는 문장 데이터 분석부; 템플릿(Template) 입력을 생성하고, 생성된 상기 템플릿(Template)에 어텐션 메커니즘(attention mechanism)을 이용하여 상기 정제된 문장 데이터를 추가하여 음성 특징 벡터열을 생성하는 음성 특징 벡터열 합성부; 및 상기 음성 특징 벡터열을 음성 데이터로 변환하는 음성 재구성부를 포함하여 이루어질 수 있다. According to an embodiment, a non-autoregressive speech synthesis system based on a deep neural network includes: a sentence data analyzer configured to analyze sentence data and output refined sentence data; a speech feature vector sequence synthesizing unit for generating a template input and adding the refined sentence data to the generated template using an attention mechanism to generate a speech feature vector sequence; and a voice reconstruction unit that converts the voice feature vector sequence into voice data.

상기 문장 데이터 분석부는, 상기 문장 데이터를 한글의 자모 단위로 분해하여 자모 단위 입력을 생성한 후, 임베딩하여 문장 특징 벡터열 형태의 임베딩된 문장 데이터를 형성하고, 상기 임베딩된 문장 데이터를 콘볼루션(convolution) 인공 신경망을 이용하여 정제하여 상기 정제된 문장 데이터를 형성할 수 있다. The sentence data analysis unit generates embedded sentence data in the form of a sentence feature vector string by decomposing the sentence data into Jamo units of Hangul to generate a Jamo unit input, embedding the sentence data, and convolution ( convolution) may be refined using an artificial neural network to form the refined sentence data.

상기 문장 데이터 분석부는, 상기 문장 데이터를 한글의 자모 단위로 분해하여 자모 단위 입력을 생성하고, 상기 자모 단위 입력을 색인하여 숫자 데이터로 매핑하여, 상기 숫자 데이터를 원-핫 인코딩(One-hot encoding)하고, 원-핫 인코딩(One-hot encoding)된 벡터열을 문장 임베딩 매트릭스와 곱하여 연속된 특성을 가지는 벡터열로 이루어진 상기 임베딩된 문장 데이터를 생성할 수 있다. The sentence data analysis unit generates a Jamo unit input by decomposing the sentence data into Jamo units of Hangul, indexes the Jamo unit input and maps it to numeric data, and encodes the numeric data into one-hot encoding (One-hot encoding). ), and multiplying the one-hot encoded vector sequence with the sentence embedding matrix to generate the embedded sentence data composed of a vector sequence having continuous characteristics.

상기 음성 특징 벡터열 합성부의 입력인 상기 템플릿(Template)은, 절대적인 위치의 인코딩(absolute positional encoding) 데이터와 상대적인 위치의 인코딩(relative positional encoding) 데이터로 이루어질 수 있다. The template, which is an input of the speech feature vector sequence synthesizing unit, may include absolute positional encoding data and relative positional encoding data.

상기 음성 특징 벡터열 합성부는, 상기 템플릿(Template) 입력을 생성하고, 상기 템플릿(Template) 입력에 어텐션 메커니즘(attention mechanism)을 이용하여 상기 정제된 문장 데이터를 추가하여 인코딩된 템플릿(Template)를 생성한 후, 상기 인코딩된 템플릿(Template)를 디코딩을 통해 멜 필터 뱅크 음성 특징 벡터열을 합성하고, 상기 멜 필터 뱅크 음성 특징 벡터열에서 로그 파워 스펙트럼 음성 특징 벡터열을 합성할 수 있다. The speech feature vector sequence synthesizing unit generates the template input, and adds the refined sentence data to the template input using an attention mechanism to generate an encoded template. After that, a mel filter bank voice feature vector sequence can be synthesized through decoding the encoded template, and a log power spectrum voice feature vector sequence can be synthesized from the mel filter bank voice feature vector sequence.

상기 음성 재구성부는, 그리핀-림 알고리즘(Griffin-lim algorithm)을 이용하여 크기(magnitude) 정보를 갖는 상기 음성 특징 벡터열로부터 위상(phase) 정보를 생성하여 상기 음성 데이터로 변환할 수 있다. The speech reconstruction unit may generate phase information from the speech feature vector sequence having magnitude information using a Griffin-lim algorithm and convert it into the speech data.

다른 실시예에 따른 심화 신경망 기반의 비-자동회귀 음성 합성 방법은, 문장 데이터를 분석하여 정제된 문장 데이터를 출력하는 문장 데이터 분석 단계; 템플릿(Template) 입력을 생성하고, 생성된 상기 템플릿(Template)에 어텐션 메커니즘(attention mechanism)을 이용하여 상기 정제된 문장 데이터를 추가하여 음성 특징 벡터열을 생성하는 음성 특징 벡터열 합성 단계; 및 상기 음성 특징 벡터열을 음성 데이터로 변환하는 음성 재구성 단계를 포함하여 이루어질 수 있다. A non-autoregressive speech synthesis method based on a deep neural network according to another embodiment includes a sentence data analysis step of analyzing sentence data and outputting refined sentence data; a speech feature vector sequence synthesis step of generating a template input and adding the refined sentence data to the generated template using an attention mechanism to generate a speech feature vector sequence; and a speech reconstruction step of converting the speech feature vector sequence into speech data.

상기 문장 데이터 분석 단계는, 상기 문장 데이터를 한글의 자모 단위로 분해하여 자모 단위 입력을 생성한 후, 임베딩하여 문장 특징 벡터열 형태의 임베딩된 문장 데이터를 형성하는 단계; 및 상기 임베딩된 문장 데이터를 콘볼루션(convolution) 인공 신경망을 이용하여 정제하여 상기 정제된 문장 데이터를 형성하는 단계를 포함할 수 있다. The step of analyzing the sentence data may include: decomposing the sentence data into Jamo units of Hangul to generate a Jamo unit input, then embedding the sentence data to form embedded sentence data in the form of a sentence feature vector string; and refining the embedded sentence data using a convolutional artificial neural network to form the refined sentence data.

상기 문장 특징 벡터열 형태의 임베딩된 문장 데이터를 형성하는 단계는, 상기 문장 데이터를 한글의 자모 단위로 분해하여 자모 단위 입력을 생성하는 단계; 상기 자모 단위 입력을 색인하여 숫자 데이터로 매핑하는 단계; 상기 숫자 데이터를 원-핫 인코딩(One-hot encoding)하는 단계; 및 원-핫 인코딩(One-hot encoding)된 벡터열을 문장 임베딩 매트릭스와 곱하여 연속된 특성을 가지는 벡터열로 이루어진 상기 임베딩된 문장 데이터를 생성하는 단계를 포함할 수 있다. The forming of the embedded sentence data in the form of a sentence feature vector string may include: generating a Jamo unit input by decomposing the sentence data into Jamo units of Hangul; indexing the Jamo unit input and mapping it to numeric data; One-hot encoding the numeric data; and multiplying a one-hot encoded vector sequence with a sentence embedding matrix to generate the embedded sentence data including a vector sequence having continuous characteristics.

여기서, 상기 음성 특징 벡터열 합성부의 입력인 상기 템플릿(Template)은 절대적인 위치의 인코딩(absolute positional encoding) 데이터와 상대적인 위치의 인코딩(relative positional encoding) 데이터로 이루어질 수 있다. Here, the template, which is an input of the speech feature vector sequence synthesizing unit, may include absolute positional encoding data and relative positional encoding data.

상기 음성 특징 벡터열 합성 단계는, 상기 템플릿(Template) 입력을 생성하는 단계; 상기 템플릿(Template) 입력에 어텐션 메커니즘(attention mechanism)을 이용하여 상기 정제된 문장 데이터를 추가하여 인코딩된 템플릿(Template)를 생성하는 음성 데이터 인코딩 단계; 상기 인코딩된 템플릿(Template)를 디코딩을 통해 멜 필터 뱅크 음성 특징 벡터열을 합성하는 음성 데이터 디코딩 단계; 및 상기 멜 필터 뱅크 음성 특징 벡터열에서 로그 파워 스펙트럼 음성 특징 벡터열을 합성하는 단계를 포함할 수 있다. The step of synthesizing the speech feature vector sequence may include: generating the template input; a voice data encoding step of generating an encoded template by adding the refined sentence data to the template input using an attention mechanism; a speech data decoding step of synthesizing a Mel filter bank speech feature vector sequence through decoding the encoded template; and synthesizing a log power spectrum speech feature vector sequence from the Mel filter bank speech feature vector sequence.

또한, 상기 템플릿(Template) 입력을 생성하는 단계는, 절대적인 위치의 인코딩 데이터를 생성하는 단계; 상대적인 위치의 인코딩 데이터를 생성하는 단계; 및 생성된 상기 절대적인 위치의 인코딩 데이터와 상기 상대적인 위치의 인코딩 데이터를 병합(concatenate)하여 상기 템플릿(Template)를 생성하는 단계를 포함할 수 있다. In addition, the generating of the template input may include: generating encoded data of an absolute position; generating encoded data of a relative position; and generating the template by concatenating the generated encoded data of the absolute position and the encoded data of the relative position.

상기 음성 데이터 인코딩 단계는, 상기 정제된 문장 데이터와 상기 템플릿(Template)를 입력으로 받아 상기 어텐션 메커니즘이 로그 파워 스펙트럼 합성에 필요한 부분을 선택하여 고정된 길이의 벡터를 형성하는 단계; 상기 콘볼루션(convolution) 인공 신경망과 상기 어텐션 메커니즘을 반복하여 정확한 정보를 담은 템플릿(Template)를 인코딩하는 단계를 포함할 수 있다. The encoding of the speech data may include: receiving the refined sentence data and the template as inputs, selecting a part required for log power spectrum synthesis by the attention mechanism, and forming a vector of a fixed length; It may include encoding a template containing accurate information by repeating the convolutional artificial neural network and the attention mechanism.

상기 음성 데이터 디코딩 단계는, 음성 데이터 디코딩 인공 신경망을 통해 상기 인코딩된 템플릿(Template)로부터 멜 필터 뱅크 음성 특징 벡터열을 합성하는 단계를 포함할 수 있다. The speech data decoding step may include synthesizing a Mel filter bank speech feature vector sequence from the encoded template through a speech data decoding artificial neural network.

상기 음성 재구성 단계는, 그리핀-림 알고리즘(Griffin-lim algorithm)을 이용하여 크기(magnitude) 정보를 갖는 상기 음성 특징 벡터열로부터 위상(phase) 정보를 생성하여 상기 음성 데이터로 변환할 수 있다. In the speech reconstruction step, phase information may be generated from the speech feature vector sequence having magnitude information using a Griffin-lim algorithm and converted into the speech data.

본 발명의 실시예들에 따르면 비-자동회귀(non-autoregressive) 방식으로 음성 특징 벡터열을 한번에 합성하기 때문에 자동회귀(autoregressive) 음성 합성 방식에 비해 빠른 속도로 음성을 합성할 수 있는 심화 신경망 기반의 비-자동회귀 음성 합성 방법 및 시스템을 제공할 수 있다. According to embodiments of the present invention, since speech feature vector sequences are synthesized at once in a non-autoregressive method, a deep neural network-based system capable of synthesizing speech at a faster speed compared to an autoregressive speech synthesis method It is possible to provide a method and system for non-autoregressive speech synthesis of

또한, 본 발명의 실시예들에 따르면 자동회귀(autoregressive) 기반 모델 회귀(regression) 모델에서 출력의 크기가 점점 감소하는 현상이 나타나지 않아 음성의 크기가 문장 전체에서 일정하게 유지되는 이점이 있다. In addition, according to embodiments of the present invention, there is an advantage in that the size of the output does not gradually decrease in the autoregressive-based model regression model, so that the volume of the voice is kept constant throughout the sentence.

도 1은 일 실시예에 따른 비-자동회귀 음성 합성 시스템의 전체적인 구성을 나타내는 도면이다.
도 2 내지 도 4는 일 실시예에 따른 비-자동회귀 음성 합성 방법을 나타내는 흐름도이다.
도 5는 일 실시예에 따른 멜 필터 뱅크 음성 특징 벡터열 생성 과정을 설명하기 위한 도면이다. 1 is a diagram showing the overall configuration of a non-autoregressive speech synthesis system according to an embodiment.
2 to 4 are flowcharts illustrating a non-autoregressive speech synthesis method according to an exemplary embodiment.
5 is a diagram for explaining a process of generating a Mel filter bank speech feature vector sequence according to an embodiment.

이하, 첨부된 도면을 참조하여 실시예들을 설명한다. 그러나, 기술되는 실시예들은 여러 가지 다른 형태로 변형될 수 있으며, 본 발명의 범위가 이하 설명되는 실시예들에 의하여 한정되는 것은 아니다. 또한, 여러 실시예들은 당해 기술분야에서 평균적인 지식을 가진 자에게 본 발명을 더욱 완전하게 설명하기 위해서 제공되는 것이다. 도면에서 요소들의 형상 및 크기 등은 보다 명확한 설명을 위해 과장될 수 있다.Hereinafter, embodiments will be described with reference to the accompanying drawings. However, the described embodiments may be modified in various other forms, and the scope of the present invention is not limited by the embodiments described below. In addition, various embodiments are provided in order to more completely explain the present invention to those of ordinary skill in the art. The shapes and sizes of elements in the drawings may be exaggerated for clearer description.

아래의 본 발명의 실시예들은 음성 특징 벡터의 길이를 추정하고 빈 입력을 생성하여 비재귀적으로 음성 벡터를 생성하는 심화 신경망 기반의 비-자동회귀 음성 합성 방법 및 시스템에 관한 것이다. 실시예들은 기존의 자동회귀(autoregressive) 음성 합성 시스템의 자동회귀 플로우(autoregressive flow)를 제거하고, 템플릿(Template)로 불리는 새로운 입력을 구성하여 테스트 단계에서도 학습 단계와 같은 방식으로 음성을 생성할 수 있도록 한다. The following embodiments of the present invention relate to a deep neural network-based non-autoregressive speech synthesis method and system for non-recursively generating speech vectors by estimating the length of speech feature vectors and generating empty inputs. Embodiments eliminate the autoregressive flow of the existing autoregressive speech synthesis system and configure a new input called a template to generate speech in the test phase in the same way as in the learning phase. let it be

실시예들에 따르면 비-자동회귀(non-autoregressive) 방식으로 음성 특징 벡터열을 한번에 합성함으로써 자동회귀(autoregressive) 음성 합성 방식에 비해 빠른 속도로 음성을 합성할 수 있다. According to embodiments, by synthesizing the speech feature vector sequence in a non-autoregressive method at once, speech may be synthesized at a faster speed than in the autoregressive speech synthesis method.

도 1은 일 실시예에 따른 비-자동회귀 음성 합성 시스템의 전체적인 구성을 나타내는 도면이다. 1 is a diagram showing the overall configuration of a non-autoregressive speech synthesis system according to an embodiment.

도 1을 참조하면, 일 실시예에 따른 심화 신경망 기반의 비-자동회귀(non-autoregressive) 음성 합성 시스템(100)은 문장 데이터(101)를 분석하는 문장 데이터 분석부(110), 음성 특징 벡터열(104)을 합성해내는 음성 특징 벡터열 합성부(120), 그리고 음성 특징 벡터열(104)을 음성으로 변환하는 음성 재구성부(130)를 포함하여 이루어질 수 있다. Referring to FIG. 1 , a non-autoregressive speech synthesis system 100 based on a deep neural network according to an embodiment includes a sentence data analyzer 110 that analyzes sentence data 101 and a voice feature vector. It may include a speech feature vector sequence synthesizing unit 120 for synthesizing the sequence 104, and a speech reconstruction unit 130 for converting the speech feature vector sequence 104 into speech.

문장 데이터 분석부(110)는 문장 데이터(101)를 분석하여 정제된 문장 데이터(102)를 출력할 수 있다. 이러한 문장 데이터 분석부(110)는 한글 자모 단위로 들어오는 문장 입력 데이터를 심화 신경망의 입력으로 변경하는 문장 데이터 임베딩(character embedding) 부분과, 임베딩된 데이터를 정제하는 인공 신경망 부분으로 나뉘어질 수 있다. 보다 구체적으로, 문장 데이터 임베딩 부분은 문장 데이터(101)를 한글의 자모 단위로 분해하여 자모 단위 입력을 생성한 후, 임베딩하여 문장 특징 벡터열 형태의 임베딩된 문장 데이터를 형성하고, 인공 신경망 부분은 임베딩된 문장 데이터를 콘볼루션(convolution) 인공 신경망을 이용하여 정제하여 정제된 문장 데이터(102)를 형성할 수 있다. The sentence data analysis unit 110 may analyze the sentence data 101 and output the refined sentence data 102 . The sentence data analysis unit 110 may be divided into a sentence data embedding part that changes the sentence input data received in units of Hangul characters to an input of the deep neural network, and an artificial neural network part that refines the embedded data. More specifically, the sentence data embedding part decomposes the sentence data 101 into Jamo units of Hangul to generate a Jamo unit input, then embeds it to form embedded sentence data in the form of a sentence feature vector string, and the artificial neural network part is The refined sentence data 102 may be formed by refining the embedded sentence data using a convolution artificial neural network.

특히, 문장 데이터 분석부(110)는 문장 데이터(101)를 한글의 자모 단위로 분해하여 자모 단위 입력을 생성하고, 자모 단위 입력을 색인하여 숫자 데이터로 매핑하여, 숫자 데이터를 원-핫 인코딩(One-hot encoding)하고, 원-핫 인코딩(One-hot encoding)된 벡터열을 문장 임베딩 매트릭스와 곱하여 연속된 특성을 가지는 벡터열로 이루어진 임베딩된 문장 데이터를 생성할 수 있다. In particular, the sentence data analysis unit 110 decomposes the sentence data 101 into the units of the Korean alphabet to generate a unit input of the unit of Jamo, indexes the input in units of the unit of Jamo and maps it to numeric data, and converts the numeric data into one-hot encoding ( One-hot encoding and one-hot encoding are multiplied by a sentence embedding matrix to generate embedded sentence data composed of a vector sequence having continuous characteristics.

음성 특징 벡터열 합성부(120)는 템플릿(Template, 103) 입력을 생성하고, 생성된 템플릿(Template, 103)에 어텐션 메커니즘(attention mechanism)을 이용하여 정제된 문장 데이터(102)를 추가하여 음성 특징 벡터열(104)을 생성할 수 있다. 여기서, 음성 특징 벡터열 합성부(120)의 입력인 템플릿(Template, 103)은 절대적인 위치의 인코딩(absolute positional encoding) 데이터와 상대적인 위치의 인코딩(relative positional encoding) 데이터로 이루어질 수 있다. The speech feature vector sequence synthesizing unit 120 generates a template (Template) 103 input, and adds refined sentence data 102 using an attention mechanism to the generated template (Template 103) to make speech A feature vector sequence 104 may be generated. Here, the template 103, which is an input of the speech feature vector sequence synthesizing unit 120, may include absolute positional encoding data and relative positional encoding data.

또한, 음성 특징 벡터열 합성부(120)는 음성 데이터 인코딩부와 음성 데이터 디코딩부 구성될 수 있다. Also, the speech feature vector sequence synthesizing unit 120 may include a speech data encoding unit and a speech data decoding unit.

음성 특징 벡터열 합성부(120)는 템플릿(Template, 103) 입력을 생성하고, 음성 데이터 인코딩부를 통해 템플릿(Template, 103) 입력에 어텐션 메커니즘(attention mechanism)을 이용하여 정제된 문장 데이터(102)를 추가하여 인코딩된 템플릿(Template)를 생성할 수 있다. 이 후, 음성 데이터 디코딩부는 인코딩된 템플릿(Template)를 디코딩을 통해 멜 필터 뱅크 음성 특징 벡터열을 합성할 수 있다. 그리고, 멜 필터 뱅크 음성 특징 벡터열에서 로그 파워 스펙트럼 음성 특징 벡터열을 합성할 수 있다. The speech feature vector sequence synthesizing unit 120 generates a template (Template, 103) input, and refines the sentence data (102) using an attention mechanism on the template (103) input through the speech data encoding unit. can be added to create an encoded template. Thereafter, the voice data decoding unit may synthesize the Mel filter bank voice feature vector sequence by decoding the encoded template. In addition, a log power spectrum speech feature vector sequence can be synthesized from the Mel filter bank speech feature vector sequence.

음성 재구성부(130)는 음성 특징 벡터열(104)을 음성 데이터(105)로 변환할 수 있다. 보다 구체적으로, 음성 재구성부(130)는 그리핀-림 알고리즘(Griffin-lim algorithm)을 이용하여 크기(magnitude) 정보를 갖는 음성 특징 벡터열(104)로부터 위상(phase) 정보를 생성하여 음성 데이터(105)로 변환할 수 있다. The speech reconstruction unit 130 may convert the speech feature vector sequence 104 into speech data 105 . More specifically, the speech reconstruction unit 130 generates phase information from the speech feature vector sequence 104 having magnitude information using a Griffin-lim algorithm to generate speech data ( 105) can be converted to

도 2 내지 도 4는 일 실시예에 따른 비-자동회귀 음성 합성 방법을 나타내는 흐름도이다. 2 to 4 are flowcharts illustrating a non-autoregressive speech synthesis method according to an exemplary embodiment.

도 2를 참조하면, 일 실시예에 따른 심화 신경망 기반의 비-자동회귀(non-autoregressive) 음성 합성 방법은, 문장 데이터를 분석하여 정제된 문장 데이터를 출력하는 문장 데이터 분석 단계(S110), 템플릿(Template) 입력을 생성하고, 생성된 템플릿(Template)에 어텐션 메커니즘(attention mechanism)을 이용하여 정제된 문장 데이터를 추가하여 음성 특징 벡터열을 생성하는 음성 특징 벡터열 합성 단계(S120), 및 음성 특징 벡터열을 음성 데이터로 변환하는 음성 재구성 단계(S130)를 포함하여 이루어질 수 있다. Referring to FIG. 2 , a non-autoregressive speech synthesis method based on a deep neural network according to an embodiment includes a sentence data analysis step (S110) of analyzing sentence data and outputting refined sentence data, a template A speech feature vector sequence synthesis step (S120) of generating a (Template) input and adding refined sentence data to the generated template using an attention mechanism to generate a speech feature vector sequence (S120), and speech A voice reconstruction step (S130) of converting the feature vector sequence into voice data may be included.

도 3을 참조하면, 문장 데이터 분석 단계(S110)는 문장 데이터를 한글의 자모 단위로 분해하여 자모 단위 입력을 생성한 후, 임베딩하여 문장 특징 벡터열 형태의 임베딩된 문장 데이터를 형성하는 단계(S111)와, 임베딩된 문장 데이터를 콘볼루션(convolution) 인공 신경망을 이용하여 정제하여 정제된 문장 데이터를 형성하는 단계(S112)를 포함할 수 있다.Referring to FIG. 3 , in the sentence data analysis step (S110), the sentence data is decomposed into the units of the Korean alphabet to generate the unit input of the characters, and then embedded to form the embedded sentence data in the form of a sentence feature vector column (S111). ) and refining the embedded sentence data using a convolution artificial neural network to form refined sentence data ( S112 ).

도 4를 참조하면, 음성 특징 벡터열 합성 단계(S120)는, 템플릿(Template) 입력을 생성하는 단계(S121), 템플릿(Template) 입력에 어텐션 메커니즘(attention mechanism)을 이용하여 정제된 문장 데이터를 추가하여 인코딩된 템플릿(Template)를 생성하는 음성 데이터 인코딩 단계(S122), 인코딩된 템플릿(Template)를 디코딩을 통해 멜 필터 뱅크 음성 특징 벡터열을 합성하는 음성 데이터 디코딩 단계(S123), 및 멜 필터 뱅크 음성 특징 벡터열에서 로그 파워 스펙트럼 음성 특징 벡터열을 합성하는 단계(S124)를 포함할 수 있다. Referring to FIG. 4 , in the speech feature vector sequence synthesis step S120, the template input is generated (S121), and the sentence data is refined using an attention mechanism to the template input. A voice data encoding step (S122) of generating an additionally encoded template (Template), a voice data decoding step (S123) of synthesizing a Mel filter bank speech feature vector sequence through decoding the encoded template (S123), and a Mel filter It may include the step of synthesizing the log power spectrum speech feature vector sequence in the bank speech feature vector sequence (S124).

아래에서 일 실시예에 따른 비-자동회귀 음성 합성 방법의 각 단계를 보다 상세히 설명하기로 한다. Hereinafter, each step of the non-autoregressive speech synthesis method according to an embodiment will be described in more detail.

일 실시예에 따른 심화 신경망 기반의 비-자동회귀 음성 합성 방법은 도 1에서 설명한 일 실시예에 따른 심화 신경망 기반의 비-자동회귀 음성 합성 시스템을 예를 들어 보다 상세히 설명할 수 있다. 여기서, 일 실시예에 따른 심화 신경망 기반의 비-자동회귀 음성 합성 시스템은 문장 데이터 분석부, 음성 특징 벡터열 합성부 및 음성 재구성부를 포함하여 이루어질 수 있다.The deep neural network-based non-autoregressive speech synthesis method according to an embodiment may be described in more detail by taking the deep neural network-based non-autoregressive speech synthesis system according to the embodiment described with reference to FIG. 1 as an example. Here, the deep neural network-based non-autoregressive speech synthesis system according to an embodiment may include a sentence data analysis unit, a speech feature vector sequence synthesis unit, and a speech reconstruction unit.

문장 데이터 분석 단계(S110)에서, 문장 데이터 분석부는 문장 데이터를 분석하여 정제된 문장 데이터를 출력할 수 있다. 여기서, 문장 데이터 분석 단계(S110)는 하나의 인공 신경망으로 정제하여 정제된 문장 데이터를 생성할 수 있다. 이때 인공 신경망을 문장 데이터 정제 인공 신경망이라 할 수 있으며, 이를 학습시킬 수 있다. In the sentence data analysis step ( S110 ), the sentence data analyzer may analyze the sentence data and output refined sentence data. Here, the sentence data analysis step ( S110 ) may generate refined sentence data by refining it with one artificial neural network. In this case, the artificial neural network can be called a sentence data purification artificial neural network, and it can be trained.

이러한 문장 데이터 분석 단계(S110)는 문장 데이터를 한글의 자모 단위로 분해하여 자모 단위 입력을 생성한 후, 임베딩하여 문장 특징 벡터열 형태의 임베딩된 문장 데이터를 형성하는 단계(S111)와, 임베딩된 문장 데이터를 콘볼루션(convolution) 인공 신경망을 이용하여 정제하여 정제된 문장 데이터를 형성하는 단계(S112)를 포함할 수 있다. 여기서, 임베딩은 문장 등의 불연속적인 기호 데이터를 연속적이고 다양한 특성을 가지는 특징 벡터로 변환하기 위한 과정이다.This sentence data analysis step (S110) is a step (S111) of forming embedded sentence data in the form of a sentence feature vector string by decomposing the sentence data into Jamo units of Hangul to generate a Jamo unit input, and then embedding the embedded sentence data. The method may include refining the sentence data using a convolutional artificial neural network to form refined sentence data (S112). Here, embedding is a process for converting discontinuous symbol data such as a sentence into a feature vector having continuous and various characteristics.

임베딩된 문장 데이터를 형성하는 단계(S111)는, 문장 데이터를 한글의 자모 단위로 분해하여 자모 단위 입력을 생성하는 단계(문장 분해 단계), 자모 단위 입력을 색인하여 숫자 데이터로 매핑하는 단계(색인 단계), 숫자 데이터를 원-핫 인코딩(One-hot encoding)하는 단계 및 원-핫 인코딩(One-hot encoding)된 벡터열을 문장 임베딩 매트릭스와 곱하여 연속된 특성을 가지는 벡터열로 이루어진 임베딩된 문장 데이터를 생성하는 단계(특징 벡터 변환 단계)를 포함할 수 있다. The step of forming the embedded sentence data (S111) is a step of decomposing the sentence data into the units of the Korean alphabet to generate a unit input (sentence decomposition step), the step of indexing the input of the Jamo unit and mapping the input to numeric data (index) step), one-hot encoding of numeric data, and one-hot encoded vector sequence multiplied by a sentence embedding matrix to form an embedded sentence consisting of a vector sequence having continuous characteristics It may include a step of generating data (a feature vector transformation step).

문장 분해 단계는 한글 문장을 자모로 쪼개는 단계이고, 색인 단계는 쪼개진 자모에 각각 번호를 매기는 단계로 각 자모별 번호가 일대일 대응되는 색인표를 구성하고 색인표에 따라 번호를 매기게 된다. 이러한 두 단계를 거치면 문장 데이터가 숫자 데이터로 변하게 된다. 생성된 문장 특징 벡터열(즉, 임베딩된 문장 데이터)은 다음 식과 같이 나타낼 수 있다.The sentence decomposition step is a step of splitting the Hangul sentences into letters, and the index step is a step of numbering the split letters, respectively. Through these two steps, sentence data is converted into numeric data. The generated sentence feature vector sequence (ie, embedded sentence data) can be expressed as the following equation.

[수학식 1][Equation 1]

특징 벡터 변환 단계에서는 생성된 숫자 데이터를 원-핫 인코딩(One-hot encoding)한 후, 생성된 원-핫(One-hot) 벡터(

)를 수학식 2와 같이 임베딩 매트릭스(

)와 각각 곱하여 특징 벡터로 변환할 수 있다. 이에 따라 벡터열로 이루어진 임베딩된 문장 데이터를 생성할 수 있다. In the feature vector conversion step, after one-hot encoding of the generated numeric data, the generated one-hot vector (

) to the embedding matrix (

) and multiplied by each to convert it into a feature vector. Accordingly, it is possible to generate embedded sentence data composed of a vector sequence.

[수학식 2][Equation 2]

음성 특징 벡터열 합성 단계(S120)에서, 음성 특징 벡터열 합성부는 템플릿(Template) 입력을 생성하고, 생성된 템플릿(Template)에 어텐션 메커니즘(attention mechanism)을 이용하여 정제된 문장 데이터를 추가하여 음성 특징 벡터열을 생성할 수 있다.In the speech feature vector sequence synthesis step ( S120 ), the speech feature vector sequence synthesis unit generates a template input and adds refined sentence data to the generated template using an attention mechanism to make speech You can create a feature vector sequence.

음성 특징 벡터열 합성 단계(S120)는, 템플릿(Template) 입력을 생성하는 단계(S121), 템플릿(Template) 입력에 어텐션 메커니즘(attention mechanism)을 이용하여 정제된 문장 데이터를 추가하여 인코딩된 템플릿(Template)를 생성하는 음성 데이터 인코딩 단계(S122), 인코딩된 템플릿(Template)를 디코딩을 통해 멜 필터 뱅크 음성 특징 벡터열을 합성하는 음성 데이터 디코딩 단계(S123), 및 멜 필터 뱅크 음성 특징 벡터열에서 로그 파워 스펙트럼 음성 특징 벡터열을 합성하는 단계(S124)를 포함할 수 있다. In the speech feature vector sequence synthesis step (S120), the template input is generated (S121), the template input is encoded by adding refined sentence data using an attention mechanism to the template input (S121). In the speech data encoding step (S122) of generating a template), the speech data decoding step (S123) of synthesizing the Mel filter bank speech feature vector sequence through decoding the encoded template (Template), and the Mel filter bank speech feature vector sequence The step of synthesizing a log power spectrum speech feature vector sequence (S124) may be included.

또한, 템플릿(Template) 입력을 생성하는 단계(S121)는 절대적인 위치의 인코딩 데이터를 생성하는 단계, 상대적인 위치의 인코딩 데이터를 생성하는 단계, 및 생성된 절대적인 위치의 인코딩 데이터와 상대적인 위치의 인코딩 데이터를 병합(concatenate)하여 템플릿(Template)를 생성하는 단계를 포함할 수 있다. In addition, the step (S121) of generating a template input includes the steps of generating encoded data of an absolute position, generating encoded data of a relative position, and encoding data of an absolute position and the encoded data of a relative position. It may include the step of creating a template (Template) by concatenating.

그리고, 음성 데이터 인코딩 단계(S122)는 정제된 문장 데이터와 템플릿(Template)를 입력으로 받아 어텐션 메커니즘이 로그 파워 스펙트럼 합성에 필요한 부분을 선택하여 고정된 길이의 벡터를 형성하는 단계와, 콘볼루션(convolution) 인공 신경망과 어텐션 메커니즘을 반복하여 정확한 정보를 담은 템플릿(Template)를 인코딩하는 단계를 포함할 수 있다. 이때 음성 데이터 인코딩 인공 신경망을 학습시킬 수 있으며, 음성 데이터 인코딩 인공 신경망은 콘볼루션(convolution) 인공 신경망이 될 수 있다. Then, the speech data encoding step (S122) includes the steps of receiving refined sentence data and a template as inputs, the attention mechanism selecting a part necessary for log power spectrum synthesis to form a vector of a fixed length, and convolution ( convolution) repeating the artificial neural network and the attention mechanism to encode a template containing accurate information. In this case, a voice data encoding artificial neural network may be trained, and the voice data encoding artificial neural network may be a convolutional artificial neural network.

또한, 음성 데이터 디코딩 단계(S123)는 음성 데이터 디코딩 인공 신경망을 통해 인코딩된 템플릿(Template)로부터 멜 필터 뱅크 음성 특징 벡터열을 합성하는 단계를 포함할 수 있다. 이때 음성 데이터 디코딩 인공 신경망을 학습시킬 수 있다. In addition, the speech data decoding step S123 may include synthesizing a Mel filter bank speech feature vector sequence from a template encoded through a speech data decoding artificial neural network. In this case, the speech data decoding artificial neural network may be trained.

앞에서 언급한 바와 같이, 음성 특징 벡터열 합성부의 입력인 템플릿(Template)(또는 템플릿(Template) 데이터)은 상대적인 위치의 인코딩(relative positional encoding)과 절대적인 위치의 인코딩(absolute positional encoding)으로 구성될 수 있다. 각 위치의 인코딩(positional encoding)은 멜 필터 뱅크 음성 특징 벡터열과 같은 차원으로 구성될 수 있다. 즉, [시간, 주파수 빈]의 크기를 가지는 행렬이다. 절대적인 위치의 인코딩의 구성을 다음 식과 같이 나타낼 수 있다.As mentioned earlier, the template (or template data), which is the input of the speech feature vector sequence synthesizing unit, can be composed of relative positional encoding and absolute positional encoding. have. The positional encoding of each position can be configured in the same dimension as the Mel filter bank speech feature vector sequence. That is, it is a matrix having a size of [time, frequency bin]. The composition of the absolute position encoding can be expressed as the following equation.

[수학식 3][Equation 3]

)

절대적인 위치의 인코딩은

세 개의 파라미터를 가지는 sin, cos 파형으로 구성될 수 있다. 이때, pos는 데이터 내의 시간 순서에 따라 1부터 커지는 자연수이며,

는 데이터 내의 주파수 빈의 순서에 따라 1부터 커지는 자연수이다.

은 멜 필터 뱅크의 주파수 빈 개수와 같은 80이다. Absolute position encoding is

It can be composed of sin and cos waveforms with three parameters. At this time, pos is a natural number that increases from 1 according to the time order in the data,

is a natural number that increases from 1 according to the order of frequency bins in the data.

is 80 equal to the number of frequency bins in the Mel filter bank.

시간에 따라 변하면서도 중복되지 않는 sin 파와 cos 파는 데이터에 시간 정보를 더해주며, 순환 신경망과 달리 순서 정보의 학습이 제한적인 콘볼루션 신경망이 순서, 시간 정보를 결과물에 반영할 수 있게 도와준다.Sin waves and cos waves that do not overlap with time add temporal information to data, and unlike recurrent neural networks, convolutional neural networks, which have limited learning of sequence information, help to reflect sequence and temporal information in the results.

상대적인 위치의 인코딩의 구성은 다음 식과 같이 나타낼 수 있다.The composition of the encoding of the relative position can be expressed as the following equation.

[수학식 4][Equation 4]

)

상대적인 위치의 인코딩은 절대적인 위치의 인코딩과 유사하지만,

파라미터 대신에 전체 음성 데이터의 길이인

파라미터가 포함될 수 있다.

파라미터가 위치의 인코딩의 구성에 포함되기 때문에 절대적인 위치의 인코딩과 다르게 전체 길이에 대한 상대적인 시간 정보를 표현할 수 있게 된다. Relative position encoding is similar to absolute position encoding, but

The length of the entire voice data instead of parameters

Parameters may be included.

Since the parameter is included in the composition of the encoding of the position, it is possible to express relative temporal information for the entire length differently from the encoding of the absolute position.

각 위치의 인코딩은 병합(concatenate)되어 템플릿(Template)을 구성하게 된다. 따라서 전체 템플릿(Template)은 [시간, 주파수 빈*2]의 크기를 가지는 행렬이다.The encoding of each position is concatenated to constitute a Template. Therefore, the entire template is a matrix having a size of [time, frequency bin*2].

도 5는 일 실시예에 따른 멜 필터 뱅크 음성 특징 벡터열 생성 과정을 설명하기 위한 도면이다. 5 is a diagram for explaining a process of generating a Mel filter bank speech feature vector sequence according to an embodiment.

도 5를 참조하면, 심화 신경망 기반의 음성 합성 시스템의 음성 특징 벡터열 합성부의 멜 필터 뱅크 음성 특징 벡터열(204) 생성 과정을 확인할 수 있다.Referring to FIG. 5 , the process of generating the Mel filter bank speech feature vector sequence 204 of the speech feature vector sequence synthesizing unit of the deep neural network-based speech synthesis system can be confirmed.

음성 특징 벡터열 합성부는 음성 데이터 인코딩부(220)와 음성 데이터 디코딩부(230) 구성될 수 있다. 음성 데이터 인코딩부(220)는 다수의 콘볼루션 네트워크(221)와 어텐션 메커니즘(210)으로 구성될 수 있다. 각 어텐션 메커니즘(210)을 거치며 템플릿(Template, 202)은 멜 필터 뱅크 음성 특징 벡터열(204)을 생성하기 위한 문장 정보를 적절하게 담고 있는 새로운 정보로 인코딩될 수 있다.The speech feature vector sequence synthesizing unit may include a speech data encoding unit 220 and a speech data decoding unit 230 . The voice data encoding unit 220 may include a plurality of convolutional networks 221 and an attention mechanism 210 . Through each attention mechanism 210 , the template 202 may be encoded with new information appropriately containing sentence information for generating the Mel filter bank speech feature vector sequence 204 .

일반적으로 음성 특징 벡터열과 문장 데이터 사이에는 샘플링 비율의 차이가 존재한다. 문장 데이터에 비해 음성 특징 벡터열의 샘플링 비율이 일반적으로 훨씬 크다. 템플릿(Template, 202)에 문장 데이터에 대한 정보를 추가해줄 때 이러한 샘플링 비율의 차이로 인해 템플릿(Template, 202)의 각 시간당 어떤 문장 데이터를 추가해줄 것인가에 관한 문제가 생기게 된다. 어텐션 메커니즘(210)은 일반적인 자동회귀(autoregressive) 음성 합성 시스템에서 이러한 문제를 해결하는데 있어 큰 효과를 보여왔다. 본 발명 또한 어텐션 메커니즘(210)을 이용하여 정제된 문장 데이터(201)를 템플릿(Template, 202)에 적절하게 추가해준다. In general, there is a difference in sampling rate between the speech feature vector sequence and the sentence data. Compared to the sentence data, the sampling rate of the speech feature vector sequence is generally much larger. When information on sentence data is added to the template 202, a problem arises as to which sentence data to add for each time of the template 202 due to the difference in the sampling rate. The attention mechanism 210 has shown a great effect in solving this problem in a general autoregressive speech synthesis system. The present invention also appropriately adds the refined sentence data 201 to the template 202 using the attention mechanism 210 .

하지만 자동회귀(autoregressive) 모델의 입력인 멜 필터 뱅크 음성 특징 벡터열(204)에 이전 시간의 문장 정보 및 음성 정보가 담겨있는 반면, 템플릿(Template, 202)에는 오로지 시간 정보만이 담겨있다. 따라서 템플릿(Template, 202) 정보 부족을 해결하기 위해 본 발명은 도 5의 구조와 같이 어텐션 메커니즘(210)이 반복(recursive)하게 정보의 정확도를 높여가는 방법을 제공할 수 있다. 음성 데이터 인코딩부(220)는 총 3번의 어텐션 메커니즘(210)과 콘볼루션 네트워크(221)를 이용해 템플릿(Template, 202)에 정제된 문장 데이터(201) 정보를 적절하게 추가할 수 있다.However, sentence information and speech information of the previous time are contained in the Mel filter bank speech feature vector string 204, which is an input of the autoregressive model, whereas only time information is contained in the template 202. Accordingly, in order to solve the lack of template (Template) 202 information, the present invention can provide a method of increasing the accuracy of information so that the attention mechanism 210 is recursive as shown in the structure of FIG. 5 . The speech data encoding unit 220 may appropriately add the refined sentence data 201 information to the template 202 by using the attention mechanism 210 and the convolutional network 221 a total of three times.

음성 데이터 인코딩부(220)를 거치고 나면 템플릿(Template, 202)은 멜 필터 뱅크 음성 특징 벡터열(204)을 합성하기 위한 문장 정보를 포함하는 인코딩된 템플릿(Template, 203)으로 매핑된다. 각 어텐션 메커니즘(210)의 과정은 다음과 같다. After passing through the speech data encoding unit 220 , the template 202 is mapped to an encoded template 203 including sentence information for synthesizing the Mel filter bank speech feature vector sequence 204 . The process of each attention mechanism 210 is as follows.

먼저, 정제된 문장 데이터(201)를 V 행렬과 K 행렬로 분리한다. 이때 V, K 행렬의 차원은 [문장 시간, 채널]로 동일하다. 콘볼루션 네트워크(221)를 거친 템플릿(Template)을 Q 행렬이라 하고 크기는 [음성 시간, 채널]로 구성된다. 이때 문장 데이터와 음성 데이터의 채널 값은 동일하다.First, the refined sentence data 201 is divided into a V matrix and a K matrix. At this time, the dimensions of the V and K matrices are the same as [sentence time, channel]. A template that has passed through the convolutional network 221 is called a Q matrix, and the size is composed of [voice time, channel]. In this case, the channel values of the sentence data and the voice data are the same.

[수학식 5][Equation 5]

[수학식 5]는 어텐션 메커니즘(210)에서 두 입력의 매칭되는 정도를 구하는 방법에 관한 식이다. 두 행렬의 매칭되는 정도를 나타내는 행렬인

는

와

의 스칼라 곱으로 구해진다. 이때, 행렬

의 크기는 [음성 시간, 문장 시간]이다. 행렬

의 원소를

로 나타낼 수 있다.[Equation 5] is an expression related to a method of obtaining a matching degree between two inputs in the attention mechanism 210 . A matrix indicating the degree of matching between two matrices.

Is

Wow

is obtained as the scalar product of In this case, the matrix

The size of is [voice time, sentence time]. procession

the elements of

can be expressed as

[수학식 6][Equation 6]

[수학식 6]은

행렬의 데이터를 소프트맥스(soft-max) 함수를 이용해 확률의 의미를 가지는

로 변환할 수 있다.

로 이루어진 행렬을 A로 나타내고 얼라인먼트(alignment) 행렬이라 부른다. 어텐션 메커니즘(210)의 최종 결과물인

행렬을 구하는 과정을 다음 식과 같이 나타낼 수 있다.[Equation 6] is

Using the soft-max function for matrix data,

can be converted to

A matrix consisting of is denoted by A and is called an alignment matrix. The final result of the attention mechanism 210 is

The process of finding a matrix can be expressed as the following equation.

[수학식 7][Equation 7]

C 행렬은 문장 데이터에 대한 정보를 담고 있으며 콘텍스트(context) 데이터라 한다. 이

행렬은

행렬과 동일한 차원 [음성 시간, 채널]을 가지기 때문에

와

를 병합(concatenate)하여 네트워크의 다음 입력으로 사용할 수 있다.The C matrix contains information about sentence data and is called context data. this

the matrix is

Because it has the same dimension [voice time, channel] as the matrix

Wow

can be concatenated and used as the next input to the network.

어텐션 메커니즘(210)은 총 세 차례에 걸쳐 적용되며, 순차적으로 더 정확한 얼라인먼트 행렬을 형성할 수 있다. 상기된 내용과 같은 과정으로 생성된 인코딩된 템플릿(Template, 203)은 음성 데이터 디코딩부(230)를 거쳐 멜 필터 뱅크 음성 특징 벡터열(204)로 매핑될 수 있다.The attention mechanism 210 is applied a total of three times, and may sequentially form a more accurate alignment matrix. The encoded template 203 generated through the same process as described above may be mapped to the Mel filter bank voice feature vector sequence 204 through the voice data decoding unit 230 .

위 과정을 거쳐 합성된 멜 필터 뱅크 음성 특징 벡터열(204)은 포스트 프로세싱 인공 신경망을 거쳐 로그 파워 스펙트럼 음성 특징 벡터열로 매핑될 수 있다. 문장 데이터 분석부와 음성 특징 벡터열 합성부에서 사용되는 인공 신경망은 합성된 두 특징 벡터열과 실제 음성 특징 벡터 데이터 사이의 오차를 통해 학습될 수 있다. 즉, 음성 특징 벡터열 합성 인공 신경망을 학습시킬 수 있다. The Mel filter bank speech feature vector sequence 204 synthesized through the above process may be mapped to a log power spectrum speech feature vector sequence through a post-processing artificial neural network. The artificial neural network used in the sentence data analyzer and the speech feature vector sequence synthesizer can be learned through an error between the two synthesized feature vector sequences and the actual speech feature vector data. That is, it is possible to train a speech feature vector sequence synthesis artificial neural network.

음성 재구성 단계(S130)에서, 음성 재구성부는 음성 특징 벡터열을 음성 데이터로 변환할 수 있다.In the speech reconstruction step ( S130 ), the speech reconstruction unit may convert the speech feature vector sequence into speech data.

음성 재구성부에서는 이전 단계에서 최종 합성된 로그 파워 스펙트럼 음성 특징 벡터열을 이용하여 음성 데이터를 복구할 수 있다. 심화 신경망이 합성한 로그 파워 스펙트럼 음성 특징 벡터열은 위상(phase) 정보가 없이 크기(magnitude) 정보만을 가지고 있기 때문에 그리핀-림 알고리즘(Griffin-lim algorithm)을 이용하여 새로운 위상 정보를 생성하여 재구성해야 한다.The voice reconstruction unit may recover voice data using the log power spectrum voice feature vector sequence finally synthesized in the previous step. Since the log power spectrum speech feature vector sequence synthesized by the deep neural network has only magnitude information without phase information, it must be reconstructed by generating new phase information using the Griffin-lim algorithm. do.

실시예들에 따르면 비-자동회귀(non-autoregressive) 방식으로 음성 특징 벡터열을 한번에 합성하기 때문에 자동회귀(autoregressive) 음성 합성 방식에 비해 빠른 속도로 음성을 합성할 수 있다. 또한, 자동회귀(autoregressive) 기반 모델 회귀(regression) 모델에서 출력의 크기가 점점 감소하는 현상이 나타나지 않아 음성의 크기가 문장 전체에서 일정하게 유지될 수 있다. According to embodiments, since the speech feature vector sequence is synthesized at once in a non-autoregressive method, speech can be synthesized at a higher speed than in the autoregressive speech synthesis method. In addition, a phenomenon in which the size of the output gradually decreases does not appear in the autoregressive-based model regression model, so that the size of the speech may be maintained constant throughout the sentence.

이상에서 설명된 장치는 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치 및 구성요소는, 예를 들어, 프로세서, 컨트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPA(field programmable array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 하나 이상의 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 상기 운영 체제 상에서 수행되는 하나 이상의 소프트웨어 애플리케이션을 수행할 수 있다. 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다. 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(processing element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다. 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 컨트롤러를 포함할 수 있다. 또한, 병렬 프로세서(parallel processor)와 같은, 다른 처리 구성(processing configuration)도 가능하다.The device described above may be implemented as a hardware component, a software component, and/or a combination of the hardware component and the software component. For example, devices and components described in the embodiments may include, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable array (FPA), It may be implemented using one or more general purpose or special purpose computers, such as a programmable logic unit (PLU), microprocessor, or any other device capable of executing and responding to instructions. The processing device may execute an operating system (OS) and one or more software applications running on the operating system. The processing device may also access, store, manipulate, process, and generate data in response to execution of the software. For convenience of understanding, although one processing device is sometimes described as being used, one of ordinary skill in the art will recognize that the processing device includes a plurality of processing elements and/or a plurality of types of processing elements. It can be seen that can include For example, the processing device may include a plurality of processors or one processor and one controller. Other processing configurations are also possible, such as parallel processors.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치에 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.The software may comprise a computer program, code, instructions, or a combination of one or more thereof, which configures a processing device to operate as desired or is independently or collectively processed You can command the device. The software and/or data may be any kind of machine, component, physical device, virtual equipment, computer storage medium or device, to be interpreted by or to provide instructions or data to the processing device. may be embodied in The software may be distributed over networked computer systems, and stored or executed in a distributed manner. Software and data may be stored in one or more computer-readable recording media.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded in a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, etc. alone or in combination. The program instructions recorded on the medium may be specially designed and configured for the embodiment, or may be known and available to those skilled in the art of computer software. Examples of the computer-readable recording medium include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic such as floppy disks. - includes magneto-optical media, and hardware devices specially configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like. Examples of program instructions include not only machine language codes such as those generated by a compiler, but also high-level language codes that can be executed by a computer using an interpreter or the like.

이상과 같이 실시예들이 비록 한정된 실시예와 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기의 기재로부터 다양한 수정 및 변형이 가능하다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.As described above, although the embodiments have been described with reference to the limited embodiments and drawings, various modifications and variations are possible for those skilled in the art from the above description. For example, the described techniques are performed in a different order than the described method, and/or the described components of the system, structure, apparatus, circuit, etc. are combined or combined in a different form than the described method, or other components Or substituted or substituted by equivalents may achieve an appropriate result.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다. Therefore, other implementations, other embodiments, and equivalents to the claims are also within the scope of the following claims.

Claims

a sentence data analysis unit that analyzes the sentence data and outputs refined sentence data;
a speech feature vector sequence synthesizing unit for generating a template input and adding the refined sentence data to the generated template using an attention mechanism to generate a speech feature vector sequence; and
A speech reconstruction unit that converts the speech feature vector sequence into speech data
including,
The sentence data analysis unit,
After decomposing the sentence data into Jamo units of Hangul to generate a Jamo unit input, embedding to form embedded sentence data in the form of a sentence feature vector string, and using a convolution artificial neural network for the embedded sentence data to form the refined sentence data by refining it,
The template, which is an input of the speech feature vector sequence synthesizing unit,
After generating absolute positional encoding data including order and time information and relative positional encoding data including relative temporal information for the entire length, the generated absolute positional encoding data By concatenating data and the encoded data of the relative position, the template having time information is generated, and the encoded data of the absolute position is a natural number increasing from 1 according to the temporal order in the data, the frequency bin in the data. It is composed of sin and cos waveforms having three parameters: a natural number that increases from 1 according to the order, and the frequency bin number of the Mel filter bank to help the convolutional artificial neural network reflect the order and time information in the result, and the relative position of the The encoded data expresses relative temporal information about the total length, including parameters of the length of the entire voice data,
The speech feature vector sequence synthesizing unit,
After generating the template input and adding the refined sentence data to the template input using an attention mechanism to generate an encoded template, the encoded template ( Template) to synthesize a Mel filter bank speech feature vector sequence through decoding, and synthesizing a log power spectrum speech feature vector sequence from the Mel filter bank speech feature vector sequence
A non-autoregressive speech synthesis system based on a deep neural network.

delete

According to claim 1,
The sentence data analysis unit,
By decomposing the sentence data into Jamo units of Hangul to generate a Jamo unit input, indexing the Jamo unit input and mapping it to numeric data, one-hot encoding the numeric data, and one-hot Multiplying the encoded (One-hot encoded) vector sequence with the sentence embedding matrix to generate the embedded sentence data consisting of a vector sequence having continuous characteristics
A non-autoregressive speech synthesis system based on a deep neural network.

delete

According to claim 1,
The voice reconstruction unit,
Generating phase information from the speech feature vector sequence having magnitude information using a Griffin-lim algorithm and converting it into the speech data
A non-autoregressive speech synthesis system based on a deep neural network.

A sentence data analysis step of analyzing the sentence data and outputting refined sentence data;
a speech feature vector sequence synthesis step of generating a template input and adding the refined sentence data to the generated template using an attention mechanism to generate a speech feature vector sequence; and
Speech reconstruction step of converting the speech feature vector sequence into speech data
including,
The sentence data analysis step is,
forming embedded sentence data in the form of a sentence feature vector string by decomposing the sentence data into Jamo units of Hangul to generate a Jamo unit input; and
Refining the embedded sentence data using a convolutional artificial neural network to form the refined sentence data
including,
The step of synthesizing the speech feature vector sequence,
generating the template input;
a voice data encoding step of generating an encoded template by adding the refined sentence data to the template input using an attention mechanism;
a speech data decoding step of synthesizing a Mel filter bank speech feature vector sequence through decoding the encoded template; and
Synthesizing a log power spectrum speech feature vector sequence from the Mel filter bank speech feature vector sequence
includes,
The step of generating the template input includes:
generating encoded data of an absolute position including order and time information;
generating encoded data of a relative position including relative time information for an entire length; and
generating the template having time information by concatenating the generated encoded data of the absolute position and the encoded data of the relative position
including,
The encoded data of the absolute position is a sin, cos waveform having three parameters: a natural number increasing from 1 according to the time order in the data, a natural number increasing from 1 according to the order of frequency bins in the data, and the frequency bin number of the Mel filter bank. It is configured to help the convolutional neural network reflect order and time information in the result, and the encoding data of the relative position includes the parameter of the length of the entire voice data to express relative temporal information for the entire length.
Characterized in, a deep neural network-based non-autoregressive speech synthesis method.

delete

8. The method of claim 7,
The step of forming the embedded sentence data in the form of the sentence feature vector column comprises:
generating a Jamo unit input by decomposing the sentence data into Jamo units of Hangul;
indexing the Jamo unit input and mapping it to numeric data;
One-hot encoding the numeric data; and
generating the embedded sentence data consisting of a vector sequence having continuous characteristics by multiplying a one-hot encoded vector sequence with a sentence embedding matrix
Including, a deep neural network-based non-autoregressive speech synthesis method.

delete

8. The method of claim 7,
The voice data encoding step includes:
receiving the refined sentence data and the template as inputs and forming a vector of a fixed length by the attention mechanism selecting a part necessary for log power spectrum synthesis; and
Encoding a template containing accurate information by repeating the convolutional artificial neural network and the attention mechanism
Including, a deep neural network-based non-autoregressive speech synthesis method.

delete

8. The method of claim 7,
The voice reconstruction step is
Generating phase information from the speech feature vector sequence having magnitude information using a Griffin-lim algorithm and converting it into the speech data
Characterized in, a deep neural network-based non-autoregressive speech synthesis method.