CN113488021A

CN113488021A - Method for improving naturalness of speech synthesis

Info

Publication number: CN113488021A
Application number: CN202110906779.3A
Authority: CN
Inventors: 盛乐园
Original assignee: Hangzhou Xiaoying Innovation Technology Co ltd
Current assignee: Hangzhou Xiaoying Innovation Technology Co ltd
Priority date: 2021-08-09
Filing date: 2021-08-09
Publication date: 2021-10-08

Abstract

The invention discloses a method for improving the naturalness of speech synthesis. It comprises the following steps: obtaining phonemes corresponding to the text by a tool from a font to phonemes, forming a phoneme dictionary by all the phonemes, representing the phonemes of the text by using the number of the phoneme dictionaries as the dimension of an embedding layer, and coding the represented characteristics by a CBHG module; taking a text coding result as input, predicting the duration of each phoneme, comparing the prediction result with a real label, and optimizing the time-length model; and decoding the features expanded by the time length model, combining decoded results into a complex feature, and restoring the decoded complex feature into a voice waveform through short-time Fourier inverse transformation in the original audio. The invention has the beneficial effects that: the complexity of the model can be reduced, the calculation amount is reduced, and the calculation and deployment cost is saved; the naturalness of the synthesized voice is improved, and the pronunciation is more like a real person.

Description

Method for improving naturalness of speech synthesis

Technical Field

The invention relates to the technical field of speech synthesis, in particular to a method for improving the naturalness of speech synthesis.

Background

Due to the development of deep learning and application in various fields, the speech synthesis also benefits greatly. Speech synthesis can also be roughly divided into two stages: 1. splicing and parametric methods. The splicing method is to search speech segments in a relatively large corpus and then search corresponding speech segments to combine them according to the characters to be synthesized. Although the synthesized speech is the voice of a real person, the expression of some global features such as the tone of speech, prosody and the like can be limited. Meanwhile, the splicing method also needs a large corpus and has high requirements on a data set. The parametric method is to establish a mapping model between text parameters and acoustic parameters according to a statistical model. The disadvantage is that the synthesized speech has unnatural mechanical feeling and the parameter adjustment is troublesome. 2. Studies based on deep learning. Deep learning based speech synthesis has evolved in the end-to-end direction. The quality of synthesis is better and better, but at present, the number of real end-to-end models is small, and a bridge is basically established between text and voice through a Mel frequency spectrum. This results in a loss of naturalness of the synthesized speech.

In the existing speech synthesis technology, firstly, a text is processed into phonemes as input by a regularization module, then the text or the phonemes are characterized through an embedded layer network, and then the characterized characteristics are encoded through some characteristic extraction networks. The length of the coded features is consistent with the length of the input phoneme, and only the dimension is increased from one dimension to a high dimension. And predicting the pronunciation duration of the text or the phoneme according to the text coding result. Rounding the predicted pronunciation time length, wherein the number of the time lengths is consistent with the length of the phoneme. And then, the coded features are adjusted according to the rounded duration, and finally, a text coding result which is consistent with the length of a Mel frequency spectrum extracted from real voice can be obtained. And (4) decoding the characteristics of the result of the time length model adjustment through a deep learning network, and calculating loss with a Mel frequency spectrum extracted from real voice. Taking as input the mel spectrum extracted from real speech, using neural network models such as: WaveNet, ParallelWaveNet, HifiGan, etc. to predict the true speech waveform. The input at the synthesis stage is the decoded mel spectrum, not the true mel spectrum as input. The prior art circuit predicts the mel spectrum from the text and then predicts the speech waveform from the predicted mel spectrum by the vocoder. And the objective functions calculated by these two processes are also not consistent.

Disclosure of Invention

The invention provides a method for improving the naturalness of speech synthesis, which can reduce the calculation amount and overcome the defects in the prior art.

In order to achieve the purpose, the invention adopts the following technical scheme:

a method for improving the naturalness of speech synthesis specifically comprises the following steps:

(1) text encoding: obtaining phonemes corresponding to the text by a tool from a font to the phonemes, then forming a phoneme dictionary by all the phonemes, wherein the number of the phoneme dictionaries is used as the dimension of an Embedding layer, and representing the phonemes of the text, namely mapping the phonemes to a feature vector through Embedding in deep learning;

(2) the CBHG module encodes the characterized features, the characterized features refer to feature vectors in deep learning, and the encoding refers to mapping the characterized features to another feature vector through the CBHG module;

(3) a duration model: taking a text coding result as input, and predicting the duration of each phoneme through a convolutional neural network with 3 layers and a full-connection layer with 1 layer, wherein the duration refers to one duration of network prediction;

(4) comparing the prediction result with a real label, and optimizing the time-length model; the prediction result is the prediction of the network on the duration, the real label is the real duration of each phoneme, the error is calculated by the network predicted duration and the real duration of the phonemes in the training set, and then the error is continuously reduced, namely the time-length model is optimized;

(5) and (3) voice decoding: decoding the features expanded by the time length model through a 2-layer bidirectional long and short term memory network, combining decoded results into a complex feature, and correspondingly extracting the complex feature from the short-time Fourier transform in the original audio;

(6) and the decoded complex features are subjected to short-time inverse Fourier transform and are restored into voice waveforms.

Because the objective optimization function of the invention is aimed at the synthesized speech waveform and the predicted phoneme pronunciation duration, the speaking characteristics of the speaker can be directly learned from the original audio, and the objective optimization function comprises the following steps: tone, pause, speech style, etc. The synthesized speech is more natural than other speech synthesis systems. The invention avoids the defects of the prior art, directly predicts the waveform by the text, reduces the intermediate process and synthesizes more natural voice. The invention has the advantages that an end-to-end speech synthesis system is provided, compared with other speech synthesis systems: the complexity of the model can be reduced, the calculation amount is reduced, and the calculation and deployment cost is saved; the naturalness of the synthesized voice is improved, and the pronunciation is more like a real person.

Preferably, in step (2), the CBHG module is composed of a one-dimensional convolution filter bank, a highway network, and a recurrent neural network of bidirectional gated cyclic units.

Preferably, in the step (4), specifically: after the pronunciation duration of the phoneme is obtained, the encoded phoneme is expanded according to the numerical value of the duration.

The invention has the beneficial effects that: the complexity of the model can be reduced, the calculation amount is reduced, and the calculation and deployment cost is saved; the naturalness of the synthesized voice is improved, and the pronunciation is more like a real person.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

Detailed Description

The invention is further described with reference to the following figures and detailed description.

In the embodiment shown in fig. 1, a method for improving naturalness of speech synthesis specifically includes the following steps:

(2) the CBHG module encodes the characterized features, the characterized features refer to feature vectors in deep learning, and the encoding refers to mapping the characterized features to another feature vector through the CBHG module; the CBHG module consists of a one-dimensional convolution filter bank, an expressway network and a cyclic neural network of a bidirectional gating cyclic unit.

(4) comparing the prediction result with a real label, and optimizing the time-length model; the prediction result is the prediction of the network on the duration, the real label is the real duration of each phoneme, the error is calculated by the network predicted duration and the real duration of the phonemes in the training set, and then the error is continuously reduced, namely the time-length model is optimized; the method specifically comprises the following steps: after the pronunciation duration of the phoneme is obtained, the encoded phoneme is expanded according to the numerical value of the duration. Looking at the input and output before and after the length adjuster as in fig. 1, in particular if there are three phonemes a, b, c, the predicted durations are 2, 3, 4 respectively, then the augmented version is aabbcccc.

(5) And (3) voice decoding: decoding the features expanded by the time length model through a 2-layer bidirectional long and short term memory network, combining decoded results into a complex feature, and correspondingly extracting the complex feature from the short-time Fourier transform in the original audio; the 2-layer bidirectional long-short term memory network generally refers to bidirectional lstm, complex features are different from general features, generally features in a real number domain, and the complex number domain has one more part than the real number domain, namely the features consist of two parts, namely a real part and an imaginary part; the short-time Fourier transform is a mathematical general operation, stft, and can also be realized by a neural network;

Claims

1. A method for improving the naturalness of speech synthesis is characterized by comprising the following steps:

2. The method of claim 1, wherein in step (2), the CBHG module comprises a one-dimensional convolutional filter bank, a highway network, and a recurrent neural network of bi-directional gated cyclic units.

3. The method according to claim 1, wherein in the step (4), the method specifically comprises: after the pronunciation duration of the phoneme is obtained, the encoded phoneme is expanded according to the numerical value of the duration.