CN116524894A

CN116524894A - A construction method of a vocoder, a speech synthesis method and related devices

Info

Publication number: CN116524894A
Application number: CN202310081092.XA
Authority: CN
Inventors: 艾杨; 凌震华
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2023-01-16
Filing date: 2023-01-16
Publication date: 2023-08-01

Abstract

The embodiment of the present application discloses a construction method of a vocoder, a speech synthesis method and related devices. Firstly, the target acoustic features are obtained, and the target acoustic features are respectively input into the amplitude spectrum prediction model and the phase spectrum prediction model to obtain the first pair The log magnitude spectrum and the first phase spectrum, the first log magnitude spectrum includes the first magnitude spectrum. Then calculate according to the first amplitude spectrum and the first phase spectrum to obtain the first reconstructed short-time spectrum, and preprocess the first reconstructed short-time spectrum to obtain the first reconstructed speech waveform. Calculate amplitude spectrum loss, phase spectrum loss, short-time spectrum loss, waveform loss, and calculate correction parameters based on the above losses. Then the magnitude spectrum prediction model and the phase spectrum prediction model are corrected according to the correction parameters to obtain the magnitude spectrum predictor and the phase spectrum predictor. The magnitude spectrum predictor and phase spectrum predictor of the present application can realize parallel and direct prediction of magnitude spectrum and phase spectrum, improve the efficiency of speech generation, and reduce the complexity of the overall operation.

Description

A method for constructing a vocoder, a method for speech synthesis and related devices

技术领域Technical Field

本申请涉及语音信号处理的技术领域，具体涉及一种声码器的构建方法、语音合成方法及相关装置。The present application relates to the technical field of speech signal processing, and in particular to a method for constructing a vocoder, a speech synthesis method and related devices.

背景技术Background Art

语音合成(speech synthesis)旨在使机器像人类一样流畅自然地说话，它使许多语音交互应用受益。当前，统计参数语音合成(statistical parametric speechsynthesis,SPSS)是其中一种主流的方法。Speech synthesis aims to make machines speak fluently and naturally like humans, which benefits many speech interaction applications. Currently, statistical parametric speech synthesis (SPSS) is one of the mainstream methods.

统计参数语音合成框架由声学模型(acoustic model)和声码器(vocoder)组成。声码器将声学特征转换为最终的语音波形。声码器的性能会显著影响合成语音的质量。随着神经网络的发展，WaveNet和SampleRNN为代表的自回归的神经网络声码器被提出，显著提高了合成语音质量，但受限于自回归的生成模式，其生成效率很低。随后，基于知识蒸馏的神经网络声码器、基于逆自回归流的神经网络声码器和基于神经网络声门模型和线性自回归的神经网络声码器被依次提出，虽然生成效率有所提升，但它们的整体运算复杂度很高。近期，无自回归无流的神经网络声码器正成为主流，它们大多采用神经网络实现从声学特征到语音波形的直接映射，并在预测波形和真实波形之间定义生成对抗损失函数。然而，受限于对波形的直接预测，它们的生成效率也有待提升。The statistical parameter speech synthesis framework consists of an acoustic model and a vocoder. The vocoder converts acoustic features into the final speech waveform. The performance of the vocoder significantly affects the quality of the synthesized speech. With the development of neural networks, autoregressive neural network vocoders represented by WaveNet and SampleRNN have been proposed, which significantly improve the quality of synthesized speech. However, limited by the generation mode of autoregression, their generation efficiency is very low. Subsequently, neural network vocoders based on knowledge distillation, neural network vocoders based on inverse autoregressive flow, and neural network vocoders based on neural network glottal model and linear autoregression have been proposed one after another. Although the generation efficiency has been improved, their overall computational complexity is very high. Recently, neural network vocoders without autoregression and flow are becoming mainstream. Most of them use neural networks to realize the direct mapping from acoustic features to speech waveforms, and define a generative adversarial loss function between the predicted waveform and the real waveform. However, limited by the direct prediction of the waveform, their generation efficiency also needs to be improved.

因此，如何提供一种语音生成效率高且整体运算简单的声码器，是本领域技术人员急需解决的技术问题。Therefore, how to provide a vocoder with high speech generation efficiency and simple overall operation is a technical problem that technical personnel in this field urgently need to solve.

发明内容Summary of the invention

基于上述问题，本申请提供了声码器的构建方法、语音合成方法及相关装置，从而提供一种语音生成效率高且运算简单的声码器。本申请实施例公开了如下技术方案：Based on the above problems, the present application provides a method for constructing a vocoder, a speech synthesis method and related devices, thereby providing a vocoder with high speech generation efficiency and simple operation. The embodiments of the present application disclose the following technical solutions:

一种声码器的构建方法，所述声码器包括：幅度谱预测器和相位谱预测器，所述方法包括：A method for constructing a vocoder, the vocoder comprising: an amplitude spectrum predictor and a phase spectrum predictor, the method comprising:

获取目标声学特征；Acquire target acoustic characteristics;

将所述目标声学特征输入到幅度谱预测模型中，得到所述目标声学特征对应的第一对数幅度谱，所述第一对数幅度谱包括第一幅度谱；Inputting the target acoustic feature into an amplitude spectrum prediction model to obtain a first logarithmic amplitude spectrum corresponding to the target acoustic feature, wherein the first logarithmic amplitude spectrum includes a first amplitude spectrum;

将所述目标声学特征输入到相位谱预测模型中，得到所述目标声学特征对应的第一相位谱；Inputting the target acoustic feature into a phase spectrum prediction model to obtain a first phase spectrum corresponding to the target acoustic feature;

根据所述第一幅度谱和所述第一相位谱进行计算得到第一重构短时谱；Calculate the first reconstructed short-time spectrum according to the first amplitude spectrum and the first phase spectrum;

对所述第一重构短时谱进行预处理得到所述目标声学特征对应的第一重构语音波形；Preprocessing the first reconstructed short-time spectrum to obtain a first reconstructed speech waveform corresponding to the target acoustic feature;

分别计算所述第一对数幅度谱的幅度谱损失、所述第一相位谱的相位谱损失、所述第一重构短时谱的短时谱损失、所述第一重构语音波形的波形损失；Respectively calculating the amplitude spectrum loss of the first logarithmic amplitude spectrum, the phase spectrum loss of the first phase spectrum, the short-time spectrum loss of the first reconstructed short-time spectrum, and the waveform loss of the first reconstructed speech waveform;

根据所述幅度谱损失、所述相位谱损失、所述短时谱损失以及所述波形损失计算得到修正参数；Calculate and obtain correction parameters according to the amplitude spectrum loss, the phase spectrum loss, the short-time spectrum loss and the waveform loss;

根据所述修正参数修正所述幅度谱预测模型从而得到修正幅度谱预测模型作为所述幅度谱预测器；Modifying the amplitude spectrum prediction model according to the correction parameter to obtain a modified amplitude spectrum prediction model as the amplitude spectrum predictor;

根据所述修正参数修正所述相位谱预测模型从而得到修正相位谱预测模型作为所述相位谱预测器。The phase spectrum prediction model is corrected according to the correction parameters to obtain a corrected phase spectrum prediction model as the phase spectrum predictor.

在一种可能的实现方式中，所述方法还包括：In a possible implementation, the method further includes:

将所述修正参数与预设参数进行比较；comparing the modified parameters with preset parameters;

响应于所述修正参数小于或等于所述预设参数，则执行所述根据所述修正参数修正所述幅度谱预测模型从而得到修正幅度谱预测模型作为所述幅度谱预测器、所述根据所述修正参数修正所述相位谱预测模型从而得到修正相位谱预测模型作为所述相位谱预测器；In response to the correction parameter being less than or equal to the preset parameter, the steps of correcting the amplitude spectrum prediction model according to the correction parameter to obtain a corrected amplitude spectrum prediction model as the amplitude spectrum predictor and correcting the phase spectrum prediction model according to the correction parameter to obtain a corrected phase spectrum prediction model as the phase spectrum predictor are performed;

响应于所述修正参数大于所述预设参数，则将所述修正幅度谱预测模型作为所述幅度谱预测模型、将所述修正相位谱预测模型作为所述相位谱预测模型，并执行所述将所述目标声学特征输入到幅度谱预测模型中，得到所述目标声学特征对应的第一对数幅度谱以及后续步骤，直至所述修正参数与所述预设参数相符。In response to the correction parameter being greater than the preset parameter, the corrected amplitude spectrum prediction model is used as the amplitude spectrum prediction model, the corrected phase spectrum prediction model is used as the phase spectrum prediction model, and the target acoustic feature is input into the amplitude spectrum prediction model to obtain the first logarithmic amplitude spectrum corresponding to the target acoustic feature and subsequent steps until the correction parameter matches the preset parameter.

在一种可能的实现方式中，所述幅度谱预测模型，包括：第一输入卷积层、第一残差卷积网络和第一输出卷积层；In a possible implementation, the amplitude spectrum prediction model includes: a first input convolution layer, a first residual convolution network and a first output convolution layer;

所述第一输入卷积层与所述第一残差卷积网络相连；所述第一残差卷积网络分别与所述第一输入卷积层和所述第一输出卷积层依次相连；The first input convolution layer is connected to the first residual convolution network; the first residual convolution network is connected to the first input convolution layer and the first output convolution layer in sequence respectively;

所述第一输入卷积层，用于对所述目标声学特征进行卷积计算；The first input convolution layer is used to perform convolution calculation on the target acoustic feature;

所述第一残差卷积网络，用于对所述第一输入卷积层的计算结果进行深度卷积计算；The first residual convolutional network is used to perform a deep convolution calculation on a calculation result of the first input convolutional layer;

所述第一输出卷积层，用于对所述第一残差卷积网络进行卷积计算，从而得到第二对数幅度谱。The first output convolution layer is used to perform convolution calculation on the first residual convolution network to obtain a second logarithmic amplitude spectrum.

在一种可能的实现方式中，所述相位谱预测模型，包括：第二输入卷积层、第二残差卷积网络、第二输出卷积层、第三输出卷积层和相位计算模块；In a possible implementation, the phase spectrum prediction model includes: a second input convolution layer, a second residual convolution network, a second output convolution layer, a third output convolution layer and a phase calculation module;

所述第二输入卷积层与所述第二残差卷积网络相连；所述第二残差卷积网络分别与所述第二输入卷积层、第二输出卷积层以及第三输出卷积层相连；所述相位计算模块分别与所述第二输出卷积层和第三输出卷积层相连；The second input convolution layer is connected to the second residual convolution network; the second residual convolution network is respectively connected to the second input convolution layer, the second output convolution layer and the third output convolution layer; the phase calculation module is respectively connected to the second output convolution layer and the third output convolution layer;

所述第二输入卷积层，用于对所述目标声学特征进行卷积计算；The second input convolution layer is used to perform convolution calculation on the target acoustic feature;

所述第二残差卷积网络，用于对所述第二输入卷积层的计算结果进行深度卷积计算；The second residual convolutional network is used to perform a deep convolution calculation on a calculation result of the second input convolutional layer;

所述第二输出卷积层，用于对所述第二残差卷积网络的计算结果进行卷积计算；The second output convolution layer is used to perform convolution calculation on the calculation result of the second residual convolution network;

所述第三输出卷积层，用于对所述第二残差卷积网络的计算结果进行卷积计算；The third output convolution layer is used to perform convolution calculation on the calculation result of the second residual convolution network;

所述相位计算模块，用于根据所述第二输出卷积层和所述第三输出卷积层输出的计算结果进行相位计算，从而得到所述第二相位谱。The phase calculation module is used to perform phase calculation according to the calculation results output by the second output convolution layer and the third output convolution layer, so as to obtain the second phase spectrum.

在一种可能的实现方式中，所述第一残差卷积网络以及所述第二残差卷积网络均由N个平行跳跃连接的残差卷积块、第一相加单元、平均单元以及第一LReLu单元依次连接组成，其中，所述残差卷积块由X个残差卷积子块级联组成；N、X均为正整数；In a possible implementation, the first residual convolution network and the second residual convolution network are both composed of N parallel skip-connected residual convolution blocks, a first addition unit, an average unit, and a first LReLu unit connected in sequence, wherein the residual convolution block is composed of X residual convolution sub-blocks cascaded; N and X are both positive integers;

所述残差卷积块，用于对所述第一输入卷积层或所述第二输入卷积层的计算结果进行残差卷积计算；The residual convolution block is used to perform residual convolution calculation on the calculation result of the first input convolution layer or the second input convolution layer;

所述第一相加单元，用于对N个跳跃连接的平行残差卷积块的计算结果进行加和计算；The first adding unit is used to add the calculation results of the N skip-connected parallel residual convolution blocks;

所述平均单元，用于对所述第一相加单元的计算结果进行平均计算；The averaging unit is used to perform average calculation on the calculation result of the first adding unit;

所述第一LReLu单元，用于对所述平均单元的计算结果进行激活得到第一激活矩阵。The first LReLu unit is used to activate the calculation result of the averaging unit to obtain a first activation matrix.

在一种可能的实现方式中，所述残差卷积子块包括：第二LReLu单元、扩张卷积层、第三LReLu单元、第四输出卷积层以及第二相加单元；In a possible implementation, the residual convolution sub-block includes: a second LReLu unit, an expansion convolution layer, a third LReLu unit, a fourth output convolution layer and a second addition unit;

所述第二LReLu单元、所述扩张卷积层、所述第三LReLu单元、所述第四输出卷积层以及所述第二相加单元依次相连；所述第二LReLu单元，用于对输入到所述第二LReLu单元的矩阵进行激活得到第二激活矩阵；The second LReLu unit, the dilated convolution layer, the third LReLu unit, the fourth output convolution layer and the second adding unit are connected in sequence; the second LReLu unit is used to activate the matrix input to the second LReLu unit to obtain a second activation matrix;

所述扩张卷积层，用于对所述第一激活矩阵进行卷积计算；The dilated convolution layer is used to perform convolution calculation on the first activation matrix;

所述第三LReLu单元，用于对所述扩张卷积层的计算结果进行激活得到第三激活矩阵；The third LReLu unit is used to activate the calculation result of the dilated convolutional layer to obtain a third activation matrix;

所述第四输出卷积层，用于对所述第三激活矩阵进行卷积计算；The fourth output convolution layer is used to perform convolution calculation on the third activation matrix;

所述第二相加单元，用于将所述第四输出卷积层的计算结果与输入到所述第二LReLu单元的矩阵进行加和计算。The second adding unit is used to add the calculation result of the fourth output convolutional layer and the matrix input to the second LReLu unit.

在一种可能的实现方式中，所述第一输入卷积层、所述第一输出卷积层、所述第二输出卷积层、所述第三输出卷积层以及所述第四输出卷积层的初始参数均是通过卷积层随机设置得到的。In a possible implementation, initial parameters of the first input convolution layer, the first output convolution layer, the second output convolution layer, the third output convolution layer, and the fourth output convolution layer are all obtained by randomly setting the convolution layer.

一种语音合成方法，所述方法包括：A speech synthesis method, the method comprising:

获取待合成声学特征；Acquiring acoustic features to be synthesized;

将所述待合成声学特征输入到幅度谱预测器中，从而得到所述待合成声学特征对应的第二对数幅度谱，所述第二对数幅度谱包括第二幅度谱；所述幅度谱预测器是根据上述的声码器的构建方法构建得到的；Inputting the acoustic feature to be synthesized into an amplitude spectrum predictor, thereby obtaining a second logarithmic amplitude spectrum corresponding to the acoustic feature to be synthesized, wherein the second logarithmic amplitude spectrum includes a second amplitude spectrum; the amplitude spectrum predictor is constructed according to the above-mentioned method for constructing a vocoder;

将所述待合成声学特征输入到相位谱预测器中，从而得到所述待合成声学特征对应的第二相位谱；所述相位谱预测器是根据上述的声码器的构建方法构建得到的；Inputting the acoustic feature to be synthesized into a phase spectrum predictor to obtain a second phase spectrum corresponding to the acoustic feature to be synthesized; the phase spectrum predictor is constructed according to the above-mentioned method for constructing a vocoder;

根据所述第二幅度谱和所述第二相位谱进行计算得到第二重构短时谱；Calculate the second reconstructed short-time spectrum according to the second amplitude spectrum and the second phase spectrum;

对所述第二重构短时谱进行预处理得到所述待合成声学特征对应的第二重构语音波形；Preprocessing the second reconstructed short-time spectrum to obtain a second reconstructed speech waveform corresponding to the acoustic feature to be synthesized;

将所述第二重构语音波形转化为所述待合成声学特征对应的合成语音。The second reconstructed speech waveform is converted into synthesized speech corresponding to the acoustic feature to be synthesized.

在一种可能的实现方式中，所述对所述第二重构短时谱进行预处理得到所述待合成声学特征对应的第二重构语音波形，包括：In a possible implementation, the preprocessing of the second reconstructed short-time spectrum to obtain a second reconstructed speech waveform corresponding to the acoustic feature to be synthesized includes:

将所述所述第二重构短时谱进行逆短时傅里叶变换，得到待合成声学特征对应的第二重构语音波形。The second reconstructed short-time spectrum is subjected to an inverse short-time Fourier transform to obtain a second reconstructed speech waveform corresponding to the acoustic feature to be synthesized.

一种声码器的构建装置，所述装置包括：A device for constructing a vocoder, the device comprising:

第一获取单元，用于获取目标声学特征；A first acquisition unit, used to acquire target acoustic features;

第一输入单元，用于将所述目标声学特征输入到幅度谱预测模型中，得到所述目标声学特征对应的第一对数幅度谱，所述第一对数幅度谱包括第一幅度谱；A first input unit, used to input the target acoustic feature into an amplitude spectrum prediction model to obtain a first logarithmic amplitude spectrum corresponding to the target acoustic feature, wherein the first logarithmic amplitude spectrum includes a first amplitude spectrum;

第二输入单元，用于将所述目标声学特征输入到相位谱预测模型中，得到所述目标声学特征对应的第一相位谱；A second input unit, used to input the target acoustic feature into a phase spectrum prediction model to obtain a first phase spectrum corresponding to the target acoustic feature;

第一计算单元，用于根据所述第一幅度谱和所述第一相位谱进行计算得到第一重构短时谱；A first calculation unit, configured to calculate a first reconstructed short-time spectrum according to the first amplitude spectrum and the first phase spectrum;

第一预处理单元，用于对所述第一重构短时谱进行预处理得到所述待合成声学特征对应的第一重构语音波形；A first preprocessing unit, configured to preprocess the first reconstructed short-time spectrum to obtain a first reconstructed speech waveform corresponding to the acoustic feature to be synthesized;

第二计算单元，用于计算所述第一对数幅度谱的幅度谱损失、第一相位谱的相位谱损失、所述第一重构短时谱的短时谱损失、第一重构语音波形的波形损失；A second calculation unit is used to calculate the amplitude spectrum loss of the first logarithmic amplitude spectrum, the phase spectrum loss of the first phase spectrum, the short-time spectrum loss of the first reconstructed short-time spectrum, and the waveform loss of the first reconstructed speech waveform;

第三计算单元，用于根据所述幅度谱损失、所述相位谱损失、所述短时谱损失以及所述波形损失计算得到修正参数；A third calculation unit, configured to calculate a correction parameter according to the amplitude spectrum loss, the phase spectrum loss, the short-time spectrum loss and the waveform loss;

第一修正单元，用于根据所述修正参数修正所述幅度谱预测模型从而得到所述幅度谱预测器；A first correction unit, configured to correct the amplitude spectrum prediction model according to the correction parameter to obtain the amplitude spectrum predictor;

第二修正单元，用于根据所述修正参数修正所述相位谱预测模型从而得到所述相位谱预测器。The second correction unit is used to correct the phase spectrum prediction model according to the correction parameter to obtain the phase spectrum predictor.

一种语音合成装置，所述装置包括：A speech synthesis device, comprising:

第二获取单元，用于获取待合成声学特征；A second acquisition unit, used for acquiring acoustic features to be synthesized;

第三输入单元，用于将所述待合成声学特征输入到预先构建的幅度谱预测器中，从而得到所述待合成声学特征对应的第二对数幅度谱，所述第二对数幅度谱包括第二幅度谱；a third input unit, configured to input the acoustic feature to be synthesized into a pre-constructed amplitude spectrum predictor, thereby obtaining a second logarithmic amplitude spectrum corresponding to the acoustic feature to be synthesized, wherein the second logarithmic amplitude spectrum includes a second amplitude spectrum;

第四输入单元，用于将所述待合成声学特征输入到预先构建的相位谱预测器中，从而得到所述待合成声学特征对应的第二相位谱；A fourth input unit, used for inputting the acoustic feature to be synthesized into a pre-built phase spectrum predictor, so as to obtain a second phase spectrum corresponding to the acoustic feature to be synthesized;

第四计算单元，用于根据所述第二幅度谱和所述第二相位谱进行计算得到第二重构短时谱；a fourth calculation unit, configured to calculate according to the second amplitude spectrum and the second phase spectrum to obtain a second reconstructed short-time spectrum;

第二预处理单元，用于对所述第二重构短时谱进行预处理得到所述待合成声学特征对应的第二重构语音波形；A second preprocessing unit, configured to preprocess the second reconstructed short-time spectrum to obtain a second reconstructed speech waveform corresponding to the acoustic feature to be synthesized;

第一转化单元，用于将所述第二重构语音波形转化为所述待合成声学特征对应的合成语音。The first conversion unit is used to convert the second reconstructed speech waveform into a synthesized speech corresponding to the acoustic feature to be synthesized.

相较于现有技术，本申请具有以下有益效果：Compared with the prior art, this application has the following beneficial effects:

本申请提供了一种声码器的构建方法、语音合成方法及相关装置。具体地，在执行本申请实施例提供的声码器的构建方法时，首先可以获取获取目标声学特征。将所述目标声学特征输入到幅度谱预测模型中，得到所述目标声学特征对应的第一对数幅度谱，所述第一对数幅度谱包括第一幅度谱；将所述目标声学特征输入到相位谱预测模型中，得到所述目标声学特征对应的第一相位谱相位谱，并根据所述第一幅度谱和所述第一相位谱进行计算得到第一重构短时谱。然后对所述第一重构短时谱进行预处理得到所述待合成声学特征对应的第一重构语音波形。接着分别计算所述第一对数幅度谱的幅度谱损失、第一相位谱的相位谱损失、所述第一重构短时谱的短时谱损失、第一重构语音波形的波形损失，并根据所述幅度谱损失、所述相位谱损失、所述短时谱损失以及所述波形损失计算得到修正参数。最后根据所述修正参数修正所述幅度谱预测模型从而得到所述幅度谱预测器；根据所述修正参数修正所述相位谱预测模型从而得到所述相位谱预测器。本申请的幅度谱预测器和相位谱预测器的操作都是全帧级的，可以实现平行直接预测语音幅度谱和相位谱，显著提高了语音生成效率，也降低了整体运算的复杂度。同时，本申请通过利用幅度谱损失、相位谱损失、短时谱损失以及波形损失来同时训练幅度谱预测器和相位谱预测器。The present application provides a method for constructing a vocoder, a method for speech synthesis and a related device. Specifically, when executing the method for constructing a vocoder provided in an embodiment of the present application, the target acoustic feature can be first obtained. The target acoustic feature is input into an amplitude spectrum prediction model to obtain a first logarithmic amplitude spectrum corresponding to the target acoustic feature, and the first logarithmic amplitude spectrum includes a first amplitude spectrum; the target acoustic feature is input into a phase spectrum prediction model to obtain a first phase spectrum corresponding to the target acoustic feature, and a first reconstructed short-time spectrum is calculated based on the first amplitude spectrum and the first phase spectrum. Then, the first reconstructed short-time spectrum is preprocessed to obtain a first reconstructed speech waveform corresponding to the acoustic feature to be synthesized. Then, the amplitude spectrum loss of the first logarithmic amplitude spectrum, the phase spectrum loss of the first phase spectrum, the short-time spectrum loss of the first reconstructed short-time spectrum, and the waveform loss of the first reconstructed speech waveform are calculated respectively, and the correction parameters are calculated based on the amplitude spectrum loss, the phase spectrum loss, the short-time spectrum loss and the waveform loss. Finally, the amplitude spectrum prediction model is corrected according to the correction parameters to obtain the amplitude spectrum predictor; the phase spectrum prediction model is corrected according to the correction parameters to obtain the phase spectrum predictor. The operation of the amplitude spectrum predictor and the phase spectrum predictor of the present application are both full-frame level, which can realize parallel direct prediction of the speech amplitude spectrum and phase spectrum, significantly improve the speech generation efficiency, and reduce the complexity of the overall operation. At the same time, the present application simultaneously trains the amplitude spectrum predictor and the phase spectrum predictor by utilizing the amplitude spectrum loss, the phase spectrum loss, the short-time spectrum loss and the waveform loss.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

为更清楚地说明本实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本申请的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in this embodiment or the prior art, the drawings required for use in the embodiments or the description of the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present application. For ordinary technicians in this field, other drawings can be obtained based on these drawings without creative work.

图1为本申请实施例提供的一种声码器的构建方法的方法流程图；FIG1 is a flow chart of a method for constructing a vocoder provided in an embodiment of the present application;

图2为本申请实施例提供的一种声码器的构建装置的结构示意图；FIG2 is a schematic diagram of the structure of a device for constructing a vocoder provided in an embodiment of the present application;

图3为本申请实施例提供的一种语音合成方法的方法流程图；FIG3 is a flow chart of a speech synthesis method provided by an embodiment of the present application;

图4为本申请实施例提供的一种语音合成装置的结构示意图；FIG4 is a schematic diagram of the structure of a speech synthesis device provided in an embodiment of the present application;

图5为本申请实施例提供的一种幅度谱预测模型的结构示意图；FIG5 is a schematic diagram of the structure of an amplitude spectrum prediction model provided in an embodiment of the present application;

图6为本申请实施例提供的一种残差卷积网络的结构示意图；FIG6 is a schematic diagram of the structure of a residual convolutional network provided in an embodiment of the present application;

图7为本申请实施例提供的一种残差卷积子块的结构示意图；FIG7 is a schematic diagram of the structure of a residual convolution sub-block provided in an embodiment of the present application;

图8为本申请实施例提供的一种相位谱预测模型的结构示意图；FIG8 is a schematic diagram of the structure of a phase spectrum prediction model provided in an embodiment of the present application;

图9为本申请实施例提供的又一种残差卷积子块的结构示意图。FIG9 is a schematic diagram of the structure of another residual convolution sub-block provided in an embodiment of the present application.

具体实施方式DETAILED DESCRIPTION

为了使本技术领域的人员更好地理解本申请方案，下面将结合本申请实施例中的附图，对本申请实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅是本申请一部分实施例，而不是全部的实施例。基于本申请中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本申请保护的范围。In order to enable those skilled in the art to better understand the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below in conjunction with the drawings in the embodiments of the present application. Obviously, the described embodiments are only part of the embodiments of the present application, not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by ordinary technicians in this field without creative work are within the scope of protection of this application.

为便于理解本申请实施例提供的技术方案，下面将先对本申请实施例涉及的背景技术进行说明。To facilitate understanding of the technical solutions provided by the embodiments of the present application, the background technology involved in the embodiments of the present application will be described below.

统计参数语音合成框架由声学模型(acoustic model)和声码器(vocoder)组成。声码器将声学特征转换为最终的语音波形。声码器的性能会显著影响合成语音的质量。The statistical parameter speech synthesis framework consists of an acoustic model and a vocoder. The vocoder converts acoustic features into the final speech waveform. The performance of the vocoder significantly affects the quality of the synthesized speech.

传统的声码器如STRAIGHT和WORLD被广泛应用到目前的统计参数语音合成系统中。然而，这些传统的声码器存在一些缺陷，例如谱细节和相位信息的丢失，会导致合成语音听感的下降。Traditional vocoders such as STRAIGHT and WORLD are widely used in current statistical parameter speech synthesis systems. However, these traditional vocoders have some defects, such as the loss of spectral details and phase information, which will lead to a decrease in the listening quality of the synthesized speech.

目前，随着神经网络的发展，WaveNet和SampleRNN为代表的自回归的神经网络声码器被提出，显著提高了合成语音质量，但受限于自回归的生成模式，其生成效率很低。随后，基于知识蒸馏的神经网络声码器、基于逆自回归流的神经网络声码器和基于神经网络声门模型和线性自回归的神经网络声码器被依次提出，虽然生成效率有所提升，但它们的整体运算复杂度很高。近期，无自回归无流的神经网络声码器正成为主流，它们大多采用神经网络实现从声学特征到语音波形的直接映射，并在预测波形和真实波形之间定义生成对抗损失函数。然而，受限于对波形的直接预测，它们的生成效率也有待提升。At present, with the development of neural networks, autoregressive neural network vocoders represented by WaveNet and SampleRNN have been proposed, which significantly improve the quality of synthesized speech. However, limited by the generation mode of autoregression, their generation efficiency is very low. Subsequently, neural network vocoders based on knowledge distillation, neural network vocoders based on inverse autoregressive flow, and neural network vocoders based on neural network glottal model and linear autoregression were proposed one after another. Although the generation efficiency has been improved, their overall computational complexity is very high. Recently, neural network vocoders without autoregression and flow are becoming mainstream. Most of them use neural networks to realize the direct mapping from acoustic features to speech waveforms, and define a generative adversarial loss function between the predicted waveform and the real waveform. However, limited by the direct prediction of the waveform, their generation efficiency also needs to be improved.

为了解决这一问题，在本申请实施例提供了一种声码器的构建方法、语音合成方法及相关装置，首先获取目标声学特征，并将目标声学特征输入到幅度谱预测模型中，得到目标声学特征对应的第一对数幅度谱，第一对数幅度谱包括第一幅度谱；将目标声学特征输入到相位谱预测模型中，得到目标声学特征对应的第一相位谱。然后根据第一幅度谱和第一相位谱进行计算得到第一重构短时谱，并对第一重构短时谱进行预处理得到待合成声学特征对应的第一重构语音波形。接着，分别计算第一对数幅度谱的幅度谱损失、第一相位谱的相位谱损失、第一重构短时谱的短时谱损失、第一重构语音波形的波形损失，并根据幅度谱损失、相位谱损失、短时谱损失以及波形损失计算得到修正参数。最后，根据修正参数修正幅度谱预测模型从而得到幅度谱预测器；根据修正参数修正相位谱预测模型从而得到相位谱预测器。本申请的幅度谱预测器和相位谱预测器的操作都是全帧级的，可以实现平行直接预测语音幅度谱和相位谱，显著提高了语音生成效率，也降低了整体运算的复杂度。同时，本申请通过利用幅度谱损失、相位谱损失、短时谱损失以及波形损失来同时训练幅度谱预测器和相位谱预测器。In order to solve this problem, a method for constructing a vocoder, a speech synthesis method and a related device are provided in an embodiment of the present application. First, the target acoustic feature is obtained, and the target acoustic feature is input into the amplitude spectrum prediction model to obtain the first logarithmic amplitude spectrum corresponding to the target acoustic feature, and the first logarithmic amplitude spectrum includes the first amplitude spectrum; the target acoustic feature is input into the phase spectrum prediction model to obtain the first phase spectrum corresponding to the target acoustic feature. Then, the first reconstructed short-time spectrum is calculated according to the first amplitude spectrum and the first phase spectrum, and the first reconstructed short-time spectrum is preprocessed to obtain the first reconstructed speech waveform corresponding to the acoustic feature to be synthesized. Next, the amplitude spectrum loss of the first logarithmic amplitude spectrum, the phase spectrum loss of the first phase spectrum, the short-time spectrum loss of the first reconstructed short-time spectrum, and the waveform loss of the first reconstructed speech waveform are calculated respectively, and the correction parameters are calculated according to the amplitude spectrum loss, the phase spectrum loss, the short-time spectrum loss and the waveform loss. Finally, the amplitude spectrum prediction model is corrected according to the correction parameters to obtain the amplitude spectrum predictor; the phase spectrum prediction model is corrected according to the correction parameters to obtain the phase spectrum predictor. The operation of the amplitude spectrum predictor and the phase spectrum predictor of the present application is full-frame level, which can realize parallel direct prediction of speech amplitude spectrum and phase spectrum, significantly improve the speech generation efficiency, and reduce the complexity of the overall operation. At the same time, the present application simultaneously trains the amplitude spectrum predictor and the phase spectrum predictor by utilizing amplitude spectrum loss, phase spectrum loss, short-time spectrum loss and waveform loss.

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅是本申请一部分实施例，而不是全部的实施例。基于本申请中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本申请保护的范围。The following will be combined with the drawings in the embodiments of the present invention to clearly and completely describe the technical solutions in the embodiments of the present invention. Obviously, the described embodiments are only part of the embodiments of the present application, not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by ordinary technicians in this field without creative work are within the scope of protection of this application.

参见图1，该图为本申请实施例提供的一种声码器的构建方法的方法流程图，如图1所示，该声码器的构建方法可以包括步骤S101-S109：Referring to FIG. 1 , which is a flow chart of a method for constructing a vocoder provided in an embodiment of the present application, as shown in FIG. 1 , the method for constructing a vocoder may include steps S101-S109:

S101：获取目标声学特征。S101: Acquire target acoustic features.

为了构建声码器，声码器的构建系统首先可以获取目标声学特征。In order to construct a vocoder, a vocoder construction system may first obtain target acoustic features.

其中，目标成声学特征是将待合成文本输入声学模型得到的。目标成声学特征例如是“今天的天气真好”，将其输入声学模型可以转换成对应的目标成声学特征，然后声码器可以基于目标成声学特征进行音频合成，得到干净的合成音频数据。The target acoustic feature is obtained by inputting the text to be synthesized into the acoustic model. For example, the target acoustic feature is "Today's weather is really good", which can be converted into the corresponding target acoustic feature by inputting it into the acoustic model, and then the vocoder can perform audio synthesis based on the target acoustic feature to obtain clean synthesized audio data.

其中，声学特征可以包括但不限于：谱、倒谱等谱参数中的至少一种。此外，还可以包括基频、清音和浊音中的一种或多种。在本实施例中，待合成声学特征以谱为例进行说明，具体可以是梅尔谱(mel spectrogram)。在其他实施例中，待合成声学特征可以是倒谱+基频，还可以结合清音和浊音。可以理解的，在应用时，需要根据训练声码器时使用的声学特征，准备相同类别的声学特征作为输入。例如，训练时使用的声学特征为80维梅尔频谱，则在应用时也采用80维梅尔谱作为输入。Among them, the acoustic features may include but are not limited to: at least one of spectrum parameters such as spectrum and cepstrum. In addition, it may also include one or more of fundamental frequency, unvoiced sound and voiced sound. In the present embodiment, the acoustic feature to be synthesized is illustrated by taking spectrum as an example, and specifically may be a mel spectrogram. In other embodiments, the acoustic feature to be synthesized may be cepstrum + fundamental frequency, and may also be combined with unvoiced sound and voiced sound. It is understandable that when applied, it is necessary to prepare acoustic features of the same category as input according to the acoustic features used when training the vocoder. For example, if the acoustic feature used in training is an 80-dimensional mel spectrum, then an 80-dimensional mel spectrum is also used as input when applied.

S102：将所述目标声学特征输入到幅度谱预测模型中，得到所述目标声学特征对应的第一对数幅度谱，所述第一对数幅度谱包括第一幅度谱。S102: Input the target acoustic feature into an amplitude spectrum prediction model to obtain a first logarithmic amplitude spectrum corresponding to the target acoustic feature, where the first logarithmic amplitude spectrum includes a first amplitude spectrum.

在获取到目标声学特征之后，声码器的构建系统可以将目标声学特征输入到幅度谱预测模型中，从而得到目标声学特征对应的第一对数幅度谱，第一对数幅度谱包括第一幅度谱。After acquiring the target acoustic feature, the vocoder construction system may input the target acoustic feature into an amplitude spectrum prediction model to obtain a first logarithmic amplitude spectrum corresponding to the target acoustic feature, where the first logarithmic amplitude spectrum includes a first amplitude spectrum.

参见图5，该图本申请实施例提供的一种幅度谱预测模型的结构示意图，如图5所示，所述幅度谱预测模型，包括：第一输入卷积层、第一残差卷积网络和第一输出卷积层。Refer to Figure 5, which is a structural diagram of an amplitude spectrum prediction model provided by an embodiment of the present application. As shown in Figure 5, the amplitude spectrum prediction model includes: a first input convolution layer, a first residual convolution network and a first output convolution layer.

所述第一输入卷积层与所述第一残差卷积网络相连；所述第一残差卷积网络分别与所述第一输入卷积层和所述第一输出卷积层依次相连。The first input convolutional layer is connected to the first residual convolutional network; the first residual convolutional network is connected to the first input convolutional layer and the first output convolutional layer in sequence respectively.

所述第一输入卷积层，用于对所述目标声学特征进行卷积计算。The first input convolution layer is used to perform convolution calculation on the target acoustic feature.

所述第一残差卷积网络，用于对所述第一输入卷积层的计算结果进行深度卷积计算。The first residual convolutional network is used to perform deep convolution calculation on the calculation result of the first input convolutional layer.

在一些可能的实现方式中，深度卷积计算是指进行多次卷积计算。In some possible implementations, depthwise convolution calculation refers to performing multiple convolution calculations.

其中，第一输入卷积层和第一输出卷积层的初始参数均是通过卷积层随机设置得到的。Among them, the initial parameters of the first input convolution layer and the first output convolution layer are obtained by randomly setting the convolution layer.

在一些可能的实现方式中，卷积神经网络中每层卷积层(Convolutional layer)由若干卷积单元组成，每个卷积单元的参数都是通过反向传播算法最佳化得到的。卷积运算的目的是提取输入的不同特征，第一层卷积层可能只能提取一些低级的特征如边缘、线条和角等层级，更多层的网路能从低级特征中迭代提取更复杂的特征。In some possible implementations, each convolutional layer in a convolutional neural network consists of several convolutional units, and the parameters of each convolutional unit are optimized through the back-propagation algorithm. The purpose of the convolution operation is to extract different features of the input. The first convolutional layer may only extract some low-level features such as edges, lines, and corners. More layers of the network can iteratively extract more complex features from low-level features.

参见图6，该图本申请实施例提供的一种残差卷积网络的结构示意图，如图6所示，所述第一残差卷积网络由N个平行跳跃连接的残差卷积块、第一相加单元、平均单元以及第一LReLu单元依次连接组成，其中，所述残差卷积块由X个残差卷积子块级联组成；N、X均为正整数。Refer to Figure 6, which is a structural diagram of a residual convolution network provided in an embodiment of the present application. As shown in Figure 6, the first residual convolution network is composed of N parallel jump-connected residual convolution blocks, a first addition unit, an averaging unit, and a first LReLu unit connected in sequence, wherein the residual convolution block is composed of X residual convolution sub-blocks cascaded; N and X are both positive integers.

所述第二LReLu单元、所述扩张卷积层、所述第三LReLu单元以及所述第一相加单元依次相连。The second LReLu unit, the dilated convolutional layer, the third LReLu unit and the first adding unit are connected in sequence.

所述残差卷积块，用于对所述第一输入卷积层或所述第二输入卷积层的计算结果进行残差卷积计算。The residual convolution block is used to perform residual convolution calculation on the calculation result of the first input convolution layer or the second input convolution layer.

所述第一相加单元，用于对N个跳跃连接的平行残差卷积块的计算结果进行加和计算。The first adding unit is used to add the calculation results of the N jump-connected parallel residual convolution blocks.

所述平均单元，用于对所述第一相加单元的计算结果进行平均计算。The averaging unit is used to perform average calculation on the calculation result of the first adding unit.

在一些可能的实现方式中，LReLu单元即为带泄露修正线性单元(Leaky ReLu)函数是经典(以及广泛使用的)的ReLu激活函数的变体，该函数输出对负值输入有很小的坡度。由于导数总是不为零，这能减少静默神经元的出现，允许基于梯度的学习(虽然会很慢)，解决了ReLu函数进入负区间后，导致神经元不学习的问题。LReLu单元进行激活就是对计算结果进行函数运算。In some possible implementations, the LReLu unit is a leaky rectified linear unit (Leaky ReLu) function, which is a variant of the classic (and widely used) ReLu activation function, whose output has a small slope for negative inputs. Since the derivative is always non-zero, this can reduce the appearance of silent neurons, allowing gradient-based learning (although it will be slow), and solving the problem of neurons not learning after the ReLu function enters the negative interval. The activation of the LReLu unit is to perform a function operation on the calculation result.

在一些可能的实现方式中，残差卷积计算包括卷积计算和激活计算。In some possible implementations, the residual convolution calculation includes a convolution calculation and an activation calculation.

参见图7，该图本申请实施例提供的一种残差卷积子块的结构示意图，如图7所示，所述残差卷积子块包括：第二LReLu单元、扩张卷积层、第三LReLu单元、第四输出卷积层以及第二相加单元。Refer to Figure 7, which is a structural diagram of a residual convolution sub-block provided in an embodiment of the present application. As shown in Figure 7, the residual convolution sub-block includes: a second LReLu unit, an expanded convolution layer, a third LReLu unit, a fourth output convolution layer and a second addition unit.

所述第二LReLu单元、所述扩张卷积层、所述第三LReLu单元、所述第四输出卷积层以及所述第二相加单元依次相连。The second LReLu unit, the dilated convolutional layer, the third LReLu unit, the fourth output convolutional layer and the second adding unit are connected in sequence.

所述第二LReLu单元，用于对输入到所述第二LReLu单元的矩阵进行激活得到第二激活矩阵。The second LReLu unit is used to activate the matrix input to the second LReLu unit to obtain a second activation matrix.

在一种可能的实现方式中，第二LReLu单元进行激活就是对输入到第二LReLu单元的矩阵进行函数运算。In a possible implementation manner, activating the second LReLu unit is to perform a function operation on the matrix input to the second LReLu unit.

所述扩张卷积层，用于对所述第一激活矩阵进行卷积计算。The dilated convolution layer is used to perform convolution calculation on the first activation matrix.

所述第三LReLu单元，用于对所述扩张卷积层的计算结果进行激活得到第三激活矩阵。The third LReLu unit is used to activate the calculation result of the expanded convolutional layer to obtain a third activation matrix.

在一些可能的实现方式中，第三LReLu单元进行激活就是对扩张卷积层的计算结果进行函数运算。In some possible implementations, activation of the third LReLu unit is to perform a function operation on the calculation result of the dilated convolutional layer.

所述第四输出卷积层，用于对所述第三激活矩阵进行卷积计算。The fourth output convolution layer is used to perform convolution calculation on the third activation matrix.

在一些可能的实现方式中，第四输出卷积层和扩张卷积层的初始参数均是通过卷积层随机设置得到的。In some possible implementations, initial parameters of the fourth output convolutional layer and the dilated convolutional layer are obtained by randomly setting the convolutional layers.

在一些可能的实现方式中，当所述残差卷积子块为级联的X个残差卷积子块中的第一个残差卷积子块时，所述第二LReLu单元还与所述第一输入卷积层相连；所述第二相加单元还与所述第一输入卷积层相连。In some possible implementations, when the residual convolution sub-block is the first residual convolution sub-block in the cascaded X residual convolution sub-blocks, the second LReLu unit is also connected to the first input convolution layer; and the second addition unit is also connected to the first input convolution layer.

此时，所述第二LReLu单元，用于对所述第一输出卷积层的计算结果进行激活得到第二激活矩阵。At this time, the second LReLu unit is used to activate the calculation result of the first output convolution layer to obtain a second activation matrix.

所述第二相加单元，用于将所述第四输出卷积层的计算结果与所述第一输入卷积层进行加和计算。The second adding unit is used to add the calculation result of the fourth output convolution layer and the first input convolution layer.

S103：将所述目标声学特征输入到相位谱预测模型中，得到所述目标声学特征对应的第一相位谱。S103: Inputting the target acoustic feature into a phase spectrum prediction model to obtain a first phase spectrum corresponding to the target acoustic feature.

在获取到目标声学特征之后，声码器的构建系统可以将目标声学特征输入到相位谱预测模型中，从而得到目标声学特征对应的第一相位谱。After acquiring the target acoustic feature, the vocoder construction system may input the target acoustic feature into a phase spectrum prediction model to obtain a first phase spectrum corresponding to the target acoustic feature.

参见图8，该图本申请实施例提供的一种相位谱预测模型的结构示意图，如图8所示，所述相位谱预测模型，包括：第二输入卷积层、第二残差卷积网络、第二输出卷积层、第三输出卷积层和相位计算模块。Refer to Figure 8, which is a structural diagram of a phase spectrum prediction model provided by an embodiment of the present application. As shown in Figure 8, the phase spectrum prediction model includes: a second input convolution layer, a second residual convolution network, a second output convolution layer, a third output convolution layer and a phase calculation module.

所述第二输入卷积层与所述第二残差卷积网络相连；所述第二残差卷积网络分别与所述第二输入卷积层、第二输出卷积层以及第三输出卷积层相连；所述相位计算模块分别与所述第二输出卷积层和第三输出卷积层相连。The second input convolution layer is connected to the second residual convolution network; the second residual convolution network is respectively connected to the second input convolution layer, the second output convolution layer and the third output convolution layer; the phase calculation module is respectively connected to the second output convolution layer and the third output convolution layer.

所述第二输入卷积层，用于对所述目标声学特征进行卷积计算。The second input convolution layer is used to perform convolution calculation on the target acoustic feature.

所述第二残差卷积网络，用于对所述第二输入卷积层的计算结果进行深度卷积计算。The second residual convolutional network is used to perform deep convolution calculation on the calculation result of the second input convolutional layer.

所述第二输出卷积层，用于对所述第二残差卷积网络的计算结果进行卷积计算。The second output convolution layer is used to perform convolution calculation on the calculation result of the second residual convolution network.

所述第三输出卷积层，用于对所述第二残差卷积网络的计算结果进行卷积计算。The third output convolution layer is used to perform convolution calculation on the calculation result of the second residual convolution network.

其中，第二输出卷积层以及第三输出卷积层的初始参数均是通过卷积层随机设置得到的。因为第二输出卷积层以及第三输出卷积层的初始参数均是通过卷积层随机设置得到的，所以第二输出卷积层以及第三输出卷积层的参数均不同。Among them, the initial parameters of the second output convolution layer and the third output convolution layer are obtained by randomly setting the convolution layer. Because the initial parameters of the second output convolution layer and the third output convolution layer are obtained by randomly setting the convolution layer, the parameters of the second output convolution layer and the third output convolution layer are different.

在一些可能的实现方式中，相位计算模块的公式如下：In some possible implementations, the formula of the phase calculation module is as follows:

其中，R为第二输出卷积层的计算结果，I为第三输出卷积层的计算结果。Φ(0,0)＝0。当R≥0时，Sgn^*(R)＝1；当R<0时，Sgn^*(R)＝-1；当I≥0时，Sgn^*(I)＝1；当I<0时，Sgn^*(I)＝-1。Where R is the calculation result of the second output convolution layer, and I is the calculation result of the third output convolution layer. Φ(0,0)＝0. When R≥0, Sgn ^* (R)＝1; when R<0, Sgn ^* (R)＝-1; when I≥0, Sgn ^* (I)＝1; when I<0, Sgn ^* (I)＝-1.

相位谱预测模型中的第二残差卷积网络也如图6所示，所述第二残差卷积网络由N个跳跃连接的平行残差卷积块、第一相加单元、平均单元以及第一LReLu单元依次连接组成，其中，所述残差卷积块由X个残差卷积子块级联组成；N、X均为正整数。The second residual convolution network in the phase spectrum prediction model is also shown in Figure 6. The second residual convolution network is composed of N parallel residual convolution blocks with jump connections, a first addition unit, an average unit and a first LReLu unit connected in sequence, wherein the residual convolution block is composed of X residual convolution sub-blocks cascaded; N and X are both positive integers.

所述第一LReLu单元，用于所述平均单元的计算结果进行激活得到第一激活矩阵。The first LReLu unit is used to activate the calculation result of the averaging unit to obtain a first activation matrix.

第一残差卷积网络以及第二残差卷积网络虽然都是由N个平行跳跃连接的残差卷积块、第一相加单元、平均单元以及第一LReLu单元依次连接组成，但是第一残差卷积网络中N个残差卷积块与第二残差卷积网络中N个残差卷积块的初始参数都是通过随机设置得到的。因为相位谱预测模型中的第二残差卷积网络与幅度谱模型中的第二残差卷积网络中的个单元和模块的初始参数都是通过通过随机设置得到的，所以相位谱预测模型中的第二残差卷积网络与幅度谱模型中的第二残差卷积网络中各个单元和模块的参数均不同。Although both the first residual convolution network and the second residual convolution network are composed of N parallel skip-connected residual convolution blocks, the first addition unit, the average unit and the first LReLu unit connected in sequence, the initial parameters of the N residual convolution blocks in the first residual convolution network and the N residual convolution blocks in the second residual convolution network are obtained by random settings. Because the initial parameters of the units and modules in the second residual convolution network in the phase spectrum prediction model and the second residual convolution network in the amplitude spectrum model are obtained by random settings, the parameters of the units and modules in the second residual convolution network in the phase spectrum prediction model and the second residual convolution network in the amplitude spectrum model are different.

参见图9，该图本申请实施例提供的又一种残差卷积子块的结构示意图，如图9所示，所述残差卷积子块包括：第二LReLu单元、扩张卷积层、第三LReLu单元、第四输出卷积层以及第二相加单元。Refer to Figure 9, which is a structural diagram of another residual convolution sub-block provided by an embodiment of the present application. As shown in Figure 9, the residual convolution sub-block includes: a second LReLu unit, an expanded convolution layer, a third LReLu unit, a fourth output convolution layer and a second addition unit.

所述第四输出卷积层，用于对所述第三激活矩阵进行卷积计算The fourth output convolution layer is used to perform convolution calculation on the third activation matrix

在一些可能的实现方式中，当所述残差卷积子块为级联的X个残差卷积子块中的第一个残差卷积子块时，所述第二LReLu单元还与所述第二输入卷积层相连；所述第二相加单元还与所述第二输入卷积层相连。In some possible implementations, when the residual convolution sub-block is the first residual convolution sub-block in the cascaded X residual convolution sub-blocks, the second LReLu unit is also connected to the second input convolution layer; and the second addition unit is also connected to the second input convolution layer.

此时，所述第二LReLu单元，用于对所述第二输入卷积层的计算结果进行激活得到第二激活矩阵。At this time, the second LReLu unit is used to activate the calculation result of the second input convolutional layer to obtain a second activation matrix.

所述第二相加单元，用于将所述第四输出卷积层的计算结果与所述第二输入卷积层进行加和计算。The second adding unit is used to add the calculation result of the fourth output convolution layer and the second input convolution layer.

S104：根据所述第一幅度谱和所述第一相位谱进行计算得到第一重构短时谱。S104: Calculate according to the first amplitude spectrum and the first phase spectrum to obtain a first reconstructed short-time spectrum.

在得到第一对数幅度谱的第一幅度谱和第一相位谱之后，声码器的构建系统可以根据第一幅度谱和第一相位谱进行计算得到第一重构短时谱。After obtaining the first amplitude spectrum and the first phase spectrum of the first logarithmic amplitude spectrum, the construction system of the vocoder can calculate the first reconstructed short-time spectrum according to the first amplitude spectrum and the first phase spectrum.

在一种可能的实现方式中，所述根据所述第一幅度谱和所述第一相位谱进行计算得到第一重构短时谱的公式如下： In a possible implementation, the first reconstructed short-time spectrum is obtained by calculating according to the first amplitude spectrum and the first phase spectrum. The formula is as follows:

其中，为对第一数幅度谱的幅度谱部分。为第一相位谱。in, The first magnitude spectrum The amplitude spectrum part. is the first phase spectrum.

S105：对所述第一重构短时谱进行预处理得到所述目标声学特征对应的第一重构语音波形。S105: Preprocessing the first reconstructed short-time spectrum to obtain a first reconstructed speech waveform corresponding to the target acoustic feature.

在根据第一幅度谱和第一相位谱进行计算得到第一重构短时谱之后，声码器的构建系统可以对第一重构短时谱进行逆短时傅里叶变换这样的预处理得到目标声学特征对应的第一重构语音波形。After calculating the first reconstructed short-time spectrum based on the first amplitude spectrum and the first phase spectrum, the vocoder construction system can perform preprocessing such as inverse short-time Fourier transform on the first reconstructed short-time spectrum to obtain a first reconstructed speech waveform corresponding to the target acoustic feature.

在一种可能的实现方式中，对所述第一重构短时谱进行预处理得到所述目标声学特征对应的第一重构语音波形，包括：In a possible implementation manner, preprocessing the first reconstructed short-time spectrum to obtain a first reconstructed speech waveform corresponding to the target acoustic feature includes:

将所述所述第一重构短时谱进行逆短时傅里叶变换，得到待合成声学特征对应的第一重构语音波形。The first reconstructed short-time spectrum is subjected to an inverse short-time Fourier transform to obtain a first reconstructed speech waveform corresponding to the acoustic feature to be synthesized.

在一种可能的实现方式中，在进行逆短时傅里叶变换(Inverse Short TimeFourier Transform，ISTFT)时，先对处理完的每帧频域信号做傅里叶逆变换，而后对逆变换的结果加窗(和分帧时加的窗类型、窗长、overlap保持一致)，最后对加窗后的每帧信号重叠相加再除以每帧窗函数的平方重叠相加的结果，即可重建原始信号。In one possible implementation, when performing an inverse short-time Fourier transform (ISTFT), first perform an inverse Fourier transform on each frame of the processed frequency domain signal, and then add a window to the result of the inverse transform (keeping the window type, window length, and overlap consistent with those added during framing). Finally, each frame of the windowed signal is overlapped and added, and then divided by the square of the overlapped addition of the window function of each frame to reconstruct the original signal.

S106：分别计算所述第一对数幅度谱的幅度谱损失、所述第一相位谱的相位谱损失、所述第一重构短时谱的短时谱损失、所述第一重构语音波形的波形损失。S106: respectively calculating the amplitude spectrum loss of the first logarithmic amplitude spectrum, the phase spectrum loss of the first phase spectrum, the short-time spectrum loss of the first reconstructed short-time spectrum, and the waveform loss of the first reconstructed speech waveform.

在对第一重构短时谱进行预处理得到目标声学特征对应的第一重构语音波形之后，声码器的构建系统可以分别计算第一对数幅度谱的幅度谱损失、第一相位谱的相位谱损失、第一重构短时谱的短时谱损失、第一重构语音波形的波形损失。After preprocessing the first reconstructed short-time spectrum to obtain the first reconstructed speech waveform corresponding to the target acoustic feature, the vocoder construction system can respectively calculate the amplitude spectrum loss of the first logarithmic amplitude spectrum, the phase spectrum loss of the first phase spectrum, the short-time spectrum loss of the first reconstructed short-time spectrum, and the waveform loss of the first reconstructed speech waveform.

在一种可能的实现方式中，第一对数幅度谱的幅度谱损失的计算公式为：In one possible implementation, the magnitude spectrum loss of the first logarithmic magnitude spectrum is The calculation formula is:

其中，为第一对数幅度谱；logA为自然的对数幅度谱；in, is the first logarithmic amplitude spectrum; logA is the natural logarithmic amplitude spectrum;

s为自然波形通过短时傅里叶变换(ShortTime Fourier Transform，STFT)提取自然的短时复数谱，Re和Im分别表示取S的实部和取虚部。 s is a natural waveform, and the natural short-time complex spectrum is extracted by short-time Fourier transform (STFT). Re and Im represent the real part and imaginary part of S, respectively.

在一种可能的实现方式中，第一相位谱的相位谱损失的计算公式为：In a possible implementation, the phase spectrum loss of the first phase spectrum The calculation formula is:

其中，为瞬时相位损失；为群延时损失；为瞬时角频率损失。in, is the instantaneous phase loss; is the group delay loss; is the instantaneous angular frequency loss.

瞬时相位损失定义为第一预测相位谱与自然相位谱P的负余弦损失，即：The instantaneous phase loss is defined as the first predicted phase spectrum The negative cosine loss with the natural phase spectrum P is:

群延时损失定义为预测群延时和自然群延时Δ_DFP的负余弦损失，即：The group delay penalty is defined as the predicted group delay And the negative cosine loss of the natural group delay Δ _DF P, that is:

瞬时角频率损失定义为预测瞬时角频率和自然瞬时角频率Δ_DTP的负余弦损失，即：The instantaneous angular frequency loss is defined as the predicted instantaneous angular frequency And the negative cosine loss of the natural instantaneous angular frequency Δ _DT P, that is:

其中，Δ_DF和Δ_DT分别表示沿频率轴差分和沿时间轴差分。自然相位谱通过下式计算：P＝Φ(Re(S),Im(S))。Wherein, _ΔDF and _ΔDT represent the difference along the frequency axis and the difference along the time axis respectively. The natural phase spectrum is calculated by the following formula: P = Φ(Re(S), Im(S)).

其中，Re(S)为自然的短时复数谱S的实部，Im(S)为自然的短时复数谱S的虚部。Φ(0,0)＝0。当Re(S)≥0时，Sgn^*(Re(S))＝1；当Re(S)<0时，Sgn^*(Re(S))＝-1；当Im(S)≥0时，Sgn^*(Im(S))＝1；当Im(S)<0时，Sgn^*(Im(S))＝-1。Where Re(S) is the real part of the natural short-time complex spectrum S, and Im(S) is the imaginary part of the natural short-time complex spectrum S. Φ(0,0)＝0. When Re(S)≥0, Sgn ^* (Re(S))＝1; when Re(S)<0, Sgn ^* (Re(S))＝-1; when Im(S)≥0, Sgn ^* (Im(S))＝1; when Im(S)<0, Sgn ^* (Im(S))＝-1.

在一种可能的实现方式中，第一重构短时谱的短时谱损失的计算公式为：In a possible implementation, the short-time spectrum loss of the first reconstructed short-time spectrum is The calculation formula is:

短时谱损失用于提高预测的幅度谱和相位谱之间的匹配程度和保证它们重构的短时谱(即)的一致性，包含三个子损失，分别为实部损失虚部损失和短时谱一致性损失实部损失定义为重构实部和自然实部Re(S)之间的绝对误差损失，虚部损失定义为重构虚部和自然虚部Im(S)之间的绝对误差损失，即：Short-term spectrum loss Amplitude spectrum for improved prediction and phase spectrum The matching degree between them and the short-time spectrum that guarantees their reconstruction (Right now ), which contains three sub-losses: real part loss Imaginary loss and short-term spectral consistency loss The real part loss is defined as the reconstructed real part and the absolute error loss between the natural real part Re(S), and the imaginary part loss is defined as the reconstructed imaginary part The absolute error loss between the natural imaginary part Im(S) is:

短时谱一致性损失定义在重构的短时谱与一致的短时谱之间，用于缩小二者的差距。由于幅度谱和相位谱是预测的，并且短时谱域仅仅是复数域的一个子集，因此它们重构成的短时谱不一定是真实存在的短时谱。而对应的真实存在的短时谱是先对做逆短时傅里叶变换得到一个波形再对该波形进行傅里叶变换得到的，即：The short-time spectrum consistency loss is defined on the reconstructed short-time spectrum The short-term spectrum consistent with Since the amplitude spectrum and phase spectrum are predicted, and the short-time spectrum domain is only a subset of the complex domain, the short-time spectrum reconstructed by them is It is not necessarily a real short-term spectrum. The corresponding real short-time spectrum First, Do an inverse short-time Fourier transform to get a waveform Then the waveform is transformed by Fourier, that is:

短时谱一致性损失定义为与之间的二范数，写成实虚部的形式为：The short-term spectral consistency loss is defined as and The two-norm between , written in the form of real and imaginary parts, is:

最终，短时谱损失为实部损失虚部损失和短时谱一致性损失按一定比例的线性组合，即：Finally, the short-term spectrum loss Real part loss Imaginary loss and short-term spectral consistency loss A linear combination in a certain proportion, namely:

其中，λ_RI是短时谱损失超参数，可以人为自行确定和变换。Among them, λ _RI is a short-time spectrum loss hyperparameter, which can be determined and changed manually.

在一种可能的实现方式中，第一重构语音波形的波形损失的计算公式为：In one possible implementation, the waveform loss of the first reconstructed speech waveform The calculation formula is:

波形损失用于缩小重构的波形和自然的波形之间的差距，与HiFi-GAN中使用的相同，包含生成对抗网络的生成器损失生成对抗网络的判决器损失特征匹配损失和梅尔谱损失第一重构语音波形的波形损失是这些损失按一定比例的线性组合。λ_Mel是波形损失超参数，可以人为自行确定和变换。Waveform loss Used to narrow the gap between the reconstructed waveform and the natural waveform, the same as used in HiFi-GAN, including the generator loss of the generative adversarial network Discriminator loss for generative adversarial networks Feature matching loss and Mel-spectral loss The waveform loss of the first reconstructed speech waveform is a linear combination of these losses in a certain proportion. _{λ Mel} is a waveform loss hyperparameter that can be determined and changed manually.

S107：根据所述幅度谱损失、所述相位谱损失、所述短时谱损失以及所述波形损失计算得到修正参数。S107: Calculate and obtain correction parameters according to the amplitude spectrum loss, the phase spectrum loss, the short-time spectrum loss, and the waveform loss.

在计算得到第一对数幅度谱的幅度谱损失、第一相位谱的相位谱损失、第一重构短时谱的短时谱损失以及第一重构语音波形的波形损失之后，声码器的构建系统则可以根据根据幅度谱损失相位谱损失短时谱损失以及波形损失计算得到修正参数。After calculating the amplitude spectrum loss of the first logarithmic amplitude spectrum, the phase spectrum loss of the first phase spectrum, the short-time spectrum loss of the first reconstructed short-time spectrum, and the waveform loss of the first reconstructed speech waveform, the construction system of the vocoder can be based on the amplitude spectrum loss. Phase spectrum loss Short-term spectrum loss and waveform loss The correction parameters are calculated.

在一种可能的实现方式中，修正参数的计算公式为：In one possible implementation, the correction parameter The calculation formula is:

其中，λ_A、λ_P和λ_S均为修正超参数，可以人为自行确定和变换。Among them, λ _A , λ _P and λ _S are all modified hyperparameters and can be determined and changed manually.

S108：根据所述修正参数修正所述幅度谱预测模型从而得到所述幅度谱预测器。S108: Correcting the amplitude spectrum prediction model according to the correction parameters to obtain the amplitude spectrum predictor.

在根据所述幅度谱损失、所述相位谱损失、所述短时谱损失以及所述波形损失计算得到修正参数之后，声码器的构建系统则可以根据修正参数修正幅度谱预测模型中的各参数从而得到幅度谱预测器。After calculating the correction parameters according to the amplitude spectrum loss, the phase spectrum loss, the short-time spectrum loss and the waveform loss, the construction system of the vocoder can correct the parameters in the amplitude spectrum prediction model according to the correction parameters to obtain the amplitude spectrum predictor.

S109：根据所述修正参数修正所述相位谱预测模型从而得到所述相位谱预测器。S109: Correcting the phase spectrum prediction model according to the correction parameters to obtain the phase spectrum predictor.

在根据所述幅度谱损失、所述相位谱损失、所述短时谱损失以及所述波形损失计算得到修正参数之后，声码器的构建系统则可以根据修正参数修正相位谱预测模型中的各参数从而得到相位谱预测器。After calculating the correction parameters according to the amplitude spectrum loss, the phase spectrum loss, the short-time spectrum loss and the waveform loss, the construction system of the vocoder can correct the parameters in the phase spectrum prediction model according to the correction parameters to obtain a phase spectrum predictor.

在一种可能的实现方式中，所述方法还包括A1-A3：In a possible implementation, the method further includes A1-A3:

A1：将所述修正参数与预设参数进行比较。A1: Compare the correction parameter with the preset parameter.

为提高了声码器的语音生成效率，需要对幅度谱预测模型和相位谱预测模型进行迭代训练直至修正参数小于或等于预设参数，那么此时声码器的构建系统首先需要将所述修正参数与预设参数进行比较。In order to improve the speech generation efficiency of the vocoder, it is necessary to iteratively train the amplitude spectrum prediction model and the phase spectrum prediction model until the correction parameter is less than or equal to the preset parameter. At this time, the vocoder construction system first needs to compare the correction parameter with the preset parameter.

在一种可能的实现方式中，预设参数是指幅度谱预测模型和相位谱预测模型经过多次迭代训练后得到的修正参数不再变化的修正参数值，一般设置为0.6525，预设参数可以根据实际情况进行调整，本申请对预设参数的大小不做具体限定。In one possible implementation, the preset parameter refers to the correction parameter value obtained after multiple iterative training of the amplitude spectrum prediction model and the phase spectrum prediction model, in which the correction parameter no longer changes, and is generally set to 0.6525. The preset parameter can be adjusted according to actual conditions. This application does not specifically limit the size of the preset parameter.

A2：响应于所述修正参数小于或等于所述预设参数，则执行所述根据所述修正参数修正所述幅度谱预测模型从而得到修正幅度谱预测模型作为所述幅度谱预测器、所述根据所述修正参数修正所述相位谱预测模型从而得到修正相位谱预测模型作为所述相位谱预测器。A2: In response to the correction parameter being less than or equal to the preset parameter, the steps of correcting the amplitude spectrum prediction model according to the correction parameter to obtain a corrected amplitude spectrum prediction model as the amplitude spectrum predictor and correcting the phase spectrum prediction model according to the correction parameter to obtain a corrected phase spectrum prediction model as the phase spectrum predictor are executed.

A3：响应于所述修正参数大于所述预设参数，则将所述修正幅度谱预测模型作为所述幅度谱预测模型、将所述修正相位谱预测模型作为所述相位谱预测模型，并执行所述将所述目标声学特征输入到幅度谱预测模型中，得到所述目标声学特征对应的第一对数幅度谱以及后续步骤，直至所述修正参数与所述预设参数相符。A3: In response to the correction parameter being greater than the preset parameter, the corrected amplitude spectrum prediction model is used as the amplitude spectrum prediction model, the corrected phase spectrum prediction model is used as the phase spectrum prediction model, and the target acoustic feature is input into the amplitude spectrum prediction model to obtain the first logarithmic amplitude spectrum corresponding to the target acoustic feature and subsequent steps until the correction parameter matches the preset parameter.

基于S101-S109的内容可知，首先获取目标声学特征，并将目标声学特征输入到幅度谱预测模型中，得到目标声学特征对应的第一对数幅度谱，第一对数幅度谱包括第一幅度谱；将目标声学特征输入到相位谱预测模型中，得到目标声学特征对应的第一相位谱。然后根据第一幅度谱和第一相位谱进行计算得到第一重构短时谱，并对第一重构短时谱进行预处理得到待合成声学特征对应的第一重构语音波形。接着，分别计算第一对数幅度谱的幅度谱损失、第一相位谱的相位谱损失、第一重构短时谱的短时谱损失、第一重构语音波形的波形损失，并根据幅度谱损失、相位谱损失、短时谱损失以及波形损失计算得到修正参数。最后，根据修正参数修正幅度谱预测模型从而得到幅度谱预测器；根据修正参数修正相位谱预测模型从而得到相位谱预测器。本申请的幅度谱预测器和相位谱预测器的操作都是全帧级的，可以实现平行直接预测语音幅度谱和相位谱，显著提高了语音生成效率，也降低了整体运算的复杂度。同时，本申请通过利用幅度谱损失、相位谱损失、短时谱损失以及波形损失来同时训练幅度谱预测器和相位谱预测器。Based on the contents of S101-S109, it can be known that the target acoustic feature is first obtained, and the target acoustic feature is input into the amplitude spectrum prediction model to obtain the first logarithmic amplitude spectrum corresponding to the target acoustic feature, and the first logarithmic amplitude spectrum includes the first amplitude spectrum; the target acoustic feature is input into the phase spectrum prediction model to obtain the first phase spectrum corresponding to the target acoustic feature. Then, the first reconstructed short-time spectrum is calculated based on the first amplitude spectrum and the first phase spectrum, and the first reconstructed short-time spectrum is preprocessed to obtain the first reconstructed speech waveform corresponding to the acoustic feature to be synthesized. Next, the amplitude spectrum loss of the first logarithmic amplitude spectrum, the phase spectrum loss of the first phase spectrum, the short-time spectrum loss of the first reconstructed short-time spectrum, and the waveform loss of the first reconstructed speech waveform are calculated respectively, and the correction parameters are calculated based on the amplitude spectrum loss, the phase spectrum loss, the short-time spectrum loss and the waveform loss. Finally, the amplitude spectrum prediction model is corrected according to the correction parameters to obtain the amplitude spectrum predictor; the phase spectrum prediction model is corrected according to the correction parameters to obtain the phase spectrum predictor. The operation of the amplitude spectrum predictor and the phase spectrum predictor of the present application is full-frame level, which can realize parallel direct prediction of speech amplitude spectrum and phase spectrum, significantly improve the speech generation efficiency, and reduce the complexity of the overall operation. At the same time, the present application simultaneously trains the amplitude spectrum predictor and the phase spectrum predictor by utilizing amplitude spectrum loss, phase spectrum loss, short-time spectrum loss and waveform loss.

基于上述提供的声码器的构建方法的实施例，本申请实施例还提供了一种语音合成方法。参见图3，图3为本申请实施例提供的一种语音合成方法的方法流程图。如图3所示，该方法包括S301-S306：Based on the embodiment of the method for constructing a vocoder provided above, the embodiment of the present application further provides a method for speech synthesis. Referring to FIG. 3 , FIG. 3 is a method flow chart of a method for speech synthesis provided in the embodiment of the present application. As shown in FIG. 3 , the method includes S301-S306:

S301：获取待合成声学特征。S301: Acquire acoustic features to be synthesized.

在利用声码器时，首先要获取待合成声学特征。When using a vocoder, the first step is to obtain the acoustic features to be synthesized.

在一种可能的实现方式中，待合成声学特征可以是将带合成文本输入声学模型得到的。待合成文本例如是“今天的天气真好”，将其输入声学模型可以转换成对应的待合成声学特征，然后声码器可以基于待合成声学特征进行音频合成，得到合成音频数据。可选地，声学模型的类型可以根据实际需要进行选取。In a possible implementation, the acoustic features to be synthesized can be obtained by inputting text with synthesis into an acoustic model. For example, the text to be synthesized is "Today's weather is really good", which can be converted into corresponding acoustic features to be synthesized by inputting it into the acoustic model, and then the vocoder can perform audio synthesis based on the acoustic features to be synthesized to obtain synthesized audio data. Optionally, the type of acoustic model can be selected according to actual needs.

S302：将所述待合成声学特征输入到幅度谱预测器中，从而得到所述待合成声学特征对应的第二对数幅度谱，所述第二对数幅度谱包括第二幅度谱；所述幅度谱预测器是根据权利要求1-6任一项所述的声码器的构建方法构建得到的。S302: Input the acoustic feature to be synthesized into an amplitude spectrum predictor to obtain a second logarithmic amplitude spectrum corresponding to the acoustic feature to be synthesized, wherein the second logarithmic amplitude spectrum includes a second amplitude spectrum; the amplitude spectrum predictor is constructed according to the method for constructing a vocoder according to any one of claims 1-6.

S303：将所述待合成声学特征输入到相位谱预测器中，从而得到所述待合成声学特征对应的第二相位谱；所述相位谱预测器是根据权利要求1-6任一项所述的声码器的构建方法构建得到的。S303: Input the acoustic feature to be synthesized into a phase spectrum predictor to obtain a second phase spectrum corresponding to the acoustic feature to be synthesized; the phase spectrum predictor is constructed according to the method for constructing a vocoder according to any one of claims 1-6.

S304：根据所述第二幅度谱和所述第二相位谱进行计算得到第二重构短时谱。S304: Calculate according to the second amplitude spectrum and the second phase spectrum to obtain a second reconstructed short-time spectrum.

在分别将待合成声学特征输入到幅度谱预测器和相位谱预测器中得到第二对数幅度谱和第二相位谱之后，语音合成系统还需要根据第二对数幅度谱的第二幅度谱和第二位相位谱进行计算得到第二重构短时谱。After respectively inputting the acoustic features to be synthesized into the amplitude spectrum predictor and the phase spectrum predictor to obtain the second logarithmic amplitude spectrum and the second phase spectrum, the speech synthesis system also needs to calculate the second reconstructed short-time spectrum based on the second amplitude spectrum and the second phase spectrum of the second logarithmic amplitude spectrum.

其中，幅度谱预测器和相位谱预测器是根据S101-S109所述的声码器的构建方法构建得到的。The amplitude spectrum predictor and the phase spectrum predictor are constructed according to the method for constructing a vocoder described in S101-S109.

在一种可能的实现方式中，所述根据所述第二幅度谱和所述第二相位谱进行计算得到第二重构短时谱的公式如下： In a possible implementation, the second reconstructed short-time spectrum is obtained by calculating according to the second amplitude spectrum and the second phase spectrum. The formula is as follows:

其中，为对第二数幅度谱的幅度谱部分。为第二相位谱。in, The second magnitude spectrum The amplitude spectrum part. is the second phase spectrum.

S305：对所述第二重构短时谱进行预处理得到所述待合成声学特征对应的第二重构语音波形。S305: Preprocess the second reconstructed short-time spectrum to obtain a second reconstructed speech waveform corresponding to the acoustic feature to be synthesized.

为得到合成语音，在根据所述第二幅度谱和所述第二相位谱进行计算得到第二重构短时谱之后，语音合成系统还需要对上述得到的第二重构短时谱进行预处理得到待合成声学特征对应的第二重构语音波形。In order to obtain synthesized speech, after calculating the second reconstructed short-time spectrum based on the second amplitude spectrum and the second phase spectrum, the speech synthesis system also needs to preprocess the second reconstructed short-time spectrum obtained above to obtain a second reconstructed speech waveform corresponding to the acoustic features to be synthesized.

在一些可能的实现方式中，所述对所述第二重构短时谱进行预处理得到所述待合成声学特征对应的第二重构语音波形，包括：In some possible implementations, the preprocessing of the second reconstructed short-time spectrum to obtain a second reconstructed speech waveform corresponding to the acoustic feature to be synthesized includes:

将所述第二重构短时谱进行逆短时傅里叶变换，得到待合成声学特征对应的第二重构语音波形。The second reconstructed short-time spectrum is subjected to an inverse short-time Fourier transform to obtain a second reconstructed speech waveform corresponding to the acoustic feature to be synthesized.

S306：将所述第二重构语音波形转化为所述待合成声学特征对应的合成语音。S306: Convert the second reconstructed speech waveform into synthesized speech corresponding to the acoustic feature to be synthesized.

为得到合成的语音，在得到第二重构语音波形之后，语音合成系统还需要将上述得到的第二重构语音波形转化为待合成声学特征对应的合成语音。In order to obtain synthesized speech, after obtaining the second reconstructed speech waveform, the speech synthesis system also needs to convert the second reconstructed speech waveform obtained above into synthesized speech corresponding to the acoustic features to be synthesized.

在一些可能的实现方式中，可以利用Python等将声音的波形转化为语音的软件或方法将第二重构语音波形转化语音。In some possible implementations, the second reconstructed speech waveform may be converted into speech using software or methods such as Python that convert sound waveforms into speech.

基于上述S301-S306的内容可知，可利用训练完成的幅度谱预测器和相位谱预测器对待合成声学特征进行预测，从而得到待合成声学特征对应的第二对数幅度谱和第二相位谱，第二对数幅度谱包括第二幅度谱。然后，根据第二幅度谱和第二相位谱进行计算得到第二重构短时谱，并对第二重构短时谱进行预处理得到待合成声学特征对应的第二重构语音波形。最后将第二重构语音波形转化为待合成声学特征对应的合成语音。本申请利用幅度谱预测器和相位谱预测器的操作都是全帧级的，可以实现平行直接预测语音幅度谱和相位谱，显著提高了语音生成效率，也降低了整体运算的复杂度。Based on the contents of S301-S306 above, it can be known that the trained amplitude spectrum predictor and phase spectrum predictor can be used to predict the acoustic features to be synthesized, so as to obtain the second logarithmic amplitude spectrum and the second phase spectrum corresponding to the acoustic features to be synthesized, and the second logarithmic amplitude spectrum includes the second amplitude spectrum. Then, the second reconstructed short-time spectrum is calculated according to the second amplitude spectrum and the second phase spectrum, and the second reconstructed short-time spectrum is preprocessed to obtain the second reconstructed speech waveform corresponding to the acoustic features to be synthesized. Finally, the second reconstructed speech waveform is converted into the synthesized speech corresponding to the acoustic features to be synthesized. The operation of the amplitude spectrum predictor and the phase spectrum predictor used in this application are all full-frame level, which can realize parallel direct prediction of the speech amplitude spectrum and phase spectrum, significantly improve the speech generation efficiency, and reduce the complexity of the overall operation.

参见图2，图2为本申请实施例提供的一种声码器的构建装置的结构示意图。如图2所示，该声码器的构建装置包括：Refer to Figure 2, which is a schematic diagram of the structure of a device for constructing a vocoder provided in an embodiment of the present application. As shown in Figure 2, the device for constructing a vocoder includes:

第一获取单元201，用于获取目标声学特征。The first acquisition unit 201 is used to acquire target acoustic features.

其中，目标声学特征是指作要输入到幅度谱预测模型和相位谱预测模型进行训练的声学特征。同时目标成声学特征是将带合成文本输入声学模型得到的。目标成声学特征例如是“今天的天气真好”，将其输入声学模型可以转换成对应的目标成声学特征，然后声码器可以基于目标成声学特征进行音频合成，得到干净的合成音频数据。由于声学模型输出的声学特征通常带噪，若采用带噪的声学特征进行音频合成，则会影响合成音频数据的音质。可选地，声学模型的类型可以根据实际需要进行选取。Among them, the target acoustic feature refers to the acoustic feature to be input into the amplitude spectrum prediction model and the phase spectrum prediction model for training. At the same time, the target acoustic feature is obtained by inputting the synthesized text into the acoustic model. The target acoustic feature is, for example, "Today's weather is really good", which can be converted into the corresponding target acoustic feature by inputting it into the acoustic model, and then the vocoder can perform audio synthesis based on the target acoustic feature to obtain clean synthesized audio data. Since the acoustic features output by the acoustic model are usually noisy, if the noisy acoustic features are used for audio synthesis, it will affect the sound quality of the synthesized audio data. Optionally, the type of acoustic model can be selected according to actual needs.

第一输入单元202，用于将所述目标声学特征输入到幅度谱预测模型中，得到所述目标声学特征对应的第一对数幅度谱，所述第一对数幅度谱包括第一幅度谱。The first input unit 202 is used to input the target acoustic feature into the amplitude spectrum prediction model to obtain a first logarithmic amplitude spectrum corresponding to the target acoustic feature, where the first logarithmic amplitude spectrum includes a first amplitude spectrum.

在一些可能的实现方式中，所述幅度谱预测模型，包括：第一输入卷积层、第一残差卷积网络和第一输出卷积层。In some possible implementations, the amplitude spectrum prediction model includes: a first input convolution layer, a first residual convolution network and a first output convolution layer.

在一些可能的实现方式中，所述第一残差卷积网络由N个平行跳跃连接的残差卷积块、第一相加单元、平均单元以及第一LReLu单元依次连接组成，其中，所述残差卷积块由X个残差卷积子块级联组成；N、X均为正整数。In some possible implementations, the first residual convolution network is composed of N parallel skip-connected residual convolution blocks, a first addition unit, an averaging unit, and a first LReLu unit connected in sequence, wherein the residual convolution block is composed of X residual convolution sub-blocks cascaded; N and X are both positive integers.

在一些可能的实现方式中，第一LReLu单元进行激活就是对平均单元的计算结果进行函数运算。In some possible implementations, activating the first LReLu unit is to perform a function operation on the calculation result of the averaging unit.

在一些可能的实现方式中，LReLu单元即为带泄露修正线性单元(Leaky ReLu)函数是经典(以及广泛使用的)的ReLu激活函数的变体，该函数输出对负值输入有很小的坡度。由于导数总是不为零，这能减少静默神经元的出现，允许基于梯度的学习(虽然会很慢)，解决了ReLu函数进入负区间后，导致神经元不学习的问题。In some possible implementations, the LReLu unit is a leaky rectified linear unit (Leaky ReLu) function, which is a variant of the classic (and widely used) ReLu activation function, whose output has a small slope for negative inputs. Since the derivative is always non-zero, this can reduce the appearance of silent neurons, allowing gradient-based learning (although it will be slow), and solving the problem that neurons do not learn after the ReLu function enters the negative interval.

在一些可能的实现方式中，所述残差卷积子块包括：第二LReLu单元、扩张卷积层、第三LReLu单元、第四输出卷积层以及第二相加单元。In some possible implementations, the residual convolution sub-block includes: a second LReLu unit, an expansion convolution layer, a third LReLu unit, a fourth output convolution layer and a second addition unit.

在一些可能的实现方式中，第二LReLu单元进行激活就是对输入到第二LReLU单元的矩阵进行函数运算。In some possible implementations, activating the second LReLu unit is to perform a function operation on the matrix input to the second LReLU unit.

第二输入单元203，用于将所述目标声学特征输入到相位谱预测模型中，得到所述目标声学特征对应的第一相位谱。The second input unit 203 is used to input the target acoustic feature into the phase spectrum prediction model to obtain a first phase spectrum corresponding to the target acoustic feature.

在一些可能的实现方式中，所述相位谱预测模型，包括：第二输入卷积层、第二残差卷积网络、第二输出卷积层、第三输出卷积层和相位计算模块。In some possible implementations, the phase spectrum prediction model includes: a second input convolution layer, a second residual convolution network, a second output convolution layer, a third output convolution layer and a phase calculation module.

相位谱预测模型中的第二残差卷积网络也由N个跳跃连接的平行残差卷积块、第一相加单元、平均单元以及第一LReLu单元依次连接组成，其中，所述残差卷积块由X个残差卷积子块级联组成；N、X均为正整数。The second residual convolution network in the phase spectrum prediction model is also composed of N parallel residual convolution blocks with jump connections, a first addition unit, an average unit and a first LReLu unit connected in sequence, wherein the residual convolution block is composed of X residual convolution sub-blocks cascaded; N and X are both positive integers.

所述第二LReLu单元、所述扩张卷积层、所述第三LReLu单元以及所述相加单元依次相连。The second LReLu unit, the dilated convolutional layer, the third LReLu unit and the adding unit are connected in sequence.

第一残差卷积网络以及第二残差卷积网络虽然都是由N个平行跳跃连接的残差卷积块、第一相加单元、平均单元以及第一LReLu单元依次连接组成，但是第一残差卷积网络中N个残差卷积块与第二残差卷积网络中N个残差卷积块的初始参数都是通过随机设置的得到的。因为第一残差卷积网络中N个残差卷积块与第二残差卷积网络中N个残差卷积块的初始参数都是通过随机设置的得到的，所以第一残差卷积网络中N个残差卷积块与第二残差卷积网络中N个残差卷积块的参数均不同。Although both the first residual convolution network and the second residual convolution network are composed of N parallel skip-connected residual convolution blocks, the first addition unit, the average unit and the first LReLu unit connected in sequence, the initial parameters of the N residual convolution blocks in the first residual convolution network and the N residual convolution blocks in the second residual convolution network are obtained by random settings. Because the initial parameters of the N residual convolution blocks in the first residual convolution network and the N residual convolution blocks in the second residual convolution network are obtained by random settings, the parameters of the N residual convolution blocks in the first residual convolution network and the N residual convolution blocks in the second residual convolution network are different.

第二残差卷积网络中的残差卷积子块也包括：第二LReLu单元、扩张卷积层、第三LReLu单元、第四输出卷积层以及第二相加单元。The residual convolution sub-block in the second residual convolution network also includes: a second LReLu unit, an expansion convolution layer, a third LReLu unit, a fourth output convolution layer and a second addition unit.

第一计算单元204，用于根据所述第一幅度谱和所述第一相位谱进行计算得到第一重构短时谱。The first calculation unit 204 is configured to calculate a first reconstructed short-time spectrum according to the first amplitude spectrum and the first phase spectrum.

第一预处理单元205，用于对所述第一重构短时谱进行预处理得到所述待合成声学特征对应的第一重构语音波形。The first preprocessing unit 205 is used to preprocess the first reconstructed short-time spectrum to obtain a first reconstructed speech waveform corresponding to the acoustic feature to be synthesized.

第二计算单元206，用于计算所述第一对数幅度谱的幅度谱损失、第一相位谱的相位谱损失、所述第一重构短时谱的短时谱损失、第一重构语音波形的波形损失。The second calculation unit 206 is used to calculate the amplitude spectrum loss of the first logarithmic amplitude spectrum, the phase spectrum loss of the first phase spectrum, the short-time spectrum loss of the first reconstructed short-time spectrum, and the waveform loss of the first reconstructed speech waveform.

瞬时相位损失定义为预测相位谱与自然相位谱P的负余弦损失，即：The instantaneous phase loss is defined as the predicted phase spectrum The negative cosine loss with the natural phase spectrum P is:

第三计算单元207，用于根据所述幅度谱损失、所述相位谱损失、所述短时谱损失以及所述波形损失计算得到修正参数。The third calculation unit 207 is used to calculate the correction parameter according to the amplitude spectrum loss, the phase spectrum loss, the short-time spectrum loss and the waveform loss.

第一修正单元208，用于根据所述修正参数修正所述幅度谱预测模型从而得到所述幅度谱预测器。The first correcting unit 208 is configured to correct the amplitude spectrum prediction model according to the correction parameter to obtain the amplitude spectrum predictor.

第二修正单元209，用于根据所述修正参数修正所述相位谱预测模型从而得到所述相位谱预测器。The second correction unit 209 is used to correct the phase spectrum prediction model according to the correction parameter to obtain the phase spectrum predictor.

参见图4，图4为本申请实施例提供的一种语音合成装置的结构示意图。如图4所示，该语音合成装置装置包括：Referring to FIG. 4 , FIG. 4 is a schematic diagram of the structure of a speech synthesis device provided in an embodiment of the present application. As shown in FIG. 4 , the speech synthesis device comprises:

第二获取单元401，用于获取待合成声学特征。The second acquisition unit 401 is used to acquire the acoustic features to be synthesized.

第三输入单元402，用于将所述待合成声学特征输入到预先构建的幅度谱预测器中，从而得到所述待合成声学特征对应的第二对数幅度谱，所述第二对数幅度谱包括第二幅度谱。The third input unit 402 is used to input the acoustic feature to be synthesized into a pre-constructed amplitude spectrum predictor, so as to obtain a second logarithmic amplitude spectrum corresponding to the acoustic feature to be synthesized, wherein the second logarithmic amplitude spectrum includes a second amplitude spectrum.

第四输入单元403，用于将所述待合成声学特征输入到预先构建的相位谱预测器中，从而得到所述待合成声学特征对应的第二相位谱。The fourth input unit 403 is used to input the acoustic feature to be synthesized into a pre-built phase spectrum predictor, so as to obtain a second phase spectrum corresponding to the acoustic feature to be synthesized.

第四计算单元403，用于根据所述第二幅度谱和所述第二相位谱进行计算得到第二重构短时谱。The fourth calculation unit 403 is configured to calculate according to the second amplitude spectrum and the second phase spectrum to obtain a second reconstructed short-time spectrum.

第二预处理单元404，用于对所述第二重构短时谱进行预处理得到所述待合成声学特征对应的第二重构语音波形。The second preprocessing unit 404 is used to preprocess the second reconstructed short-time spectrum to obtain a second reconstructed speech waveform corresponding to the acoustic feature to be synthesized.

在一些可能的实现方式中，所述第二预处理单元404具体用于：In some possible implementations, the second preprocessing unit 404 is specifically configured to:

第一转化单元405，用于将所述第二重构语音波形转化为所述待合成声学特征对应的合成语音。The first conversion unit 405 is used to convert the second reconstructed speech waveform into a synthesized speech corresponding to the acoustic feature to be synthesized.

在一种可能的实现方式中，所述装置还包括：In a possible implementation manner, the device further includes:

比较单元，用于将所述修正参数与预设参数进行比较。A comparing unit is used to compare the correction parameter with a preset parameter.

第一执行单元，响应于所述修正参数小于或等于所述预设参数，用于执行所述根据所述修正参数修正所述幅度谱预测模型从而得到修正幅度谱预测模型作为所述幅度谱预测器、所述根据所述修正参数修正所述相位谱预测模型从而得到修正相位谱预测模型作为所述相位谱预测器。The first execution unit, in response to the correction parameter being less than or equal to the preset parameter, is used to execute the correction of the amplitude spectrum prediction model according to the correction parameter to obtain the corrected amplitude spectrum prediction model as the amplitude spectrum predictor, and the correction of the phase spectrum prediction model according to the correction parameter to obtain the corrected phase spectrum prediction model as the phase spectrum predictor.

第二执行单元，响应于所述修正参数大于所述预设参数，用于将所述修正幅度谱预测模型作为所述幅度谱预测模型、将所述修正相位谱预测模型作为所述相位谱预测模型，并执行所述将所述目标声学特征输入到幅度谱预测模型中，得到所述目标声学特征对应的第一对数幅度谱以及后续步骤，直至所述修正参数与所述预设参数相符。The second execution unit, in response to the correction parameter being greater than the preset parameter, is used to use the corrected amplitude spectrum prediction model as the amplitude spectrum prediction model, use the corrected phase spectrum prediction model as the phase spectrum prediction model, and execute the step of inputting the target acoustic feature into the amplitude spectrum prediction model to obtain the first logarithmic amplitude spectrum corresponding to the target acoustic feature and subsequent steps until the correction parameter matches the preset parameter.

本申请实施例提供了一种声码器的构建方法、语音合成方法及相关装置，声码器的构建包括：获取目标声学特征，并将目标声学特征输入到幅度谱预测模型中，得到目标声学特征对应的第一对数幅度谱，第一对数幅度谱包括第一幅度谱；将目标声学特征输入到相位谱预测模型中，得到目标声学特征对应的第一相位谱。接着，根据第一幅度谱和第一相位谱进行计算得到第一重构短时谱，并对第一重构短时谱进行预处理得到目标声学特征对应的第一重构语音波形。分别计算第一对数幅度谱的幅度谱损失、第一相位谱的相位谱损失、第一重构短时谱的短时谱损失、第一重构语音波形的波形损失，然后根据幅度谱损失、相位谱损失、短时谱损失以及波形损失计算得到修正参数。根据修正参数修正幅度谱预测模型从而得到幅度谱预测器；根据修正参数修正相位谱预测模型从而得到相位谱预测器。本申请的幅度谱预测器和相位谱预测器的操作都是全帧级的，可以实现平行直接预测语音幅度谱和相位谱，显著提高了语音生成效率，也降低了整体运算的复杂度。同时，本申请通过利用幅度谱损失、相位谱损失、短时谱损失以及波形损失来同时训练幅度谱预测器和相位谱预测器。The embodiment of the present application provides a method for constructing a vocoder, a speech synthesis method and a related device. The construction of the vocoder includes: obtaining a target acoustic feature, and inputting the target acoustic feature into an amplitude spectrum prediction model to obtain a first logarithmic amplitude spectrum corresponding to the target acoustic feature, the first logarithmic amplitude spectrum including a first amplitude spectrum; inputting the target acoustic feature into a phase spectrum prediction model to obtain a first phase spectrum corresponding to the target acoustic feature. Then, a first reconstructed short-time spectrum is calculated based on the first amplitude spectrum and the first phase spectrum, and the first reconstructed short-time spectrum is preprocessed to obtain a first reconstructed speech waveform corresponding to the target acoustic feature. The amplitude spectrum loss of the first logarithmic amplitude spectrum, the phase spectrum loss of the first phase spectrum, the short-time spectrum loss of the first reconstructed short-time spectrum, and the waveform loss of the first reconstructed speech waveform are calculated respectively, and then correction parameters are calculated based on the amplitude spectrum loss, the phase spectrum loss, the short-time spectrum loss and the waveform loss. The amplitude spectrum prediction model is corrected according to the correction parameters to obtain an amplitude spectrum predictor; the phase spectrum prediction model is corrected according to the correction parameters to obtain a phase spectrum predictor. The operation of the amplitude spectrum predictor and the phase spectrum predictor of the present application is full-frame level, which can realize parallel direct prediction of speech amplitude spectrum and phase spectrum, significantly improve the speech generation efficiency, and reduce the complexity of the overall operation. At the same time, the present application simultaneously trains the amplitude spectrum predictor and the phase spectrum predictor by utilizing amplitude spectrum loss, phase spectrum loss, short-time spectrum loss and waveform loss.

以上对本申请所提供的一种声码器的构建方法、语音合成方法及相关装置进行了详细介绍。说明书中各个实施例采用递进的方式描述，每个实施例重点说明的都是与其他实施例的不同之处，各个实施例之间相同相似部分互相参见即可。对于实施例公开的装置而言，由于其与实施例公开的方法相对应，所以描述的比较简单，相关之处参见方法部分说明即可。应当指出，对于本技术领域的普通技术人员来说，在不脱离本申请原理的前提下，还可以对本申请进行若干改进和修饰，这些改进和修饰也落入本申请权利要求的保护范围内。The above is a detailed introduction to a method for constructing a vocoder, a speech synthesis method and related devices provided by the present application. The various embodiments in the specification are described in a progressive manner, and each embodiment focuses on the differences from other embodiments. The same and similar parts between the embodiments can be referenced to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant parts can be referred to the method part description. It should be pointed out that for ordinary technicians in this technical field, without departing from the principles of the present application, several improvements and modifications can be made to the present application, and these improvements and modifications also fall within the scope of protection of the claims of the present application.

应当理解，在本申请中，“至少一个(项)”是指一个或者多个，“多个”是指两个或两个以上。“和/或”，用于描述关联对象的关联关系，表示可以存在三种关系，例如，“A和/或B”可以表示：只存在A，只存在B以及同时存在A和B三种情况，其中A，B可以是单数或者复数。字符“/”一般表示前后关联对象是一种“或”的关系。“以下至少一项(个)”或其类似表达，是指这些项中的任意组合，包括单项(个)或复数项(个)的任意组合。例如，a，b或c中的至少一项(个)，可以表示：a，b，c，“a和b”，“a和c”，“b和c”，或“a和b和c”，其中a，b，c可以是单个，也可以是多个。It should be understood that in the present application, "at least one (item)" means one or more, and "plurality" means two or more. "And/or" is used to describe the association relationship of associated objects, indicating that three relationships may exist. For example, "A and/or B" can mean: only A exists, only B exists, and A and B exist at the same time, where A and B can be singular or plural. The character "/" generally indicates that the objects associated before and after are in an "or" relationship. "At least one of the following" or similar expressions refers to any combination of these items, including any combination of single or plural items. For example, at least one of a, b or c can mean: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", where a, b, c can be single or multiple.

还需要说明的是，在本文中，诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来，而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。It should also be noted that, in this article, relational terms such as first and second, etc. are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Moreover, the terms "include", "comprise" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, article or device including a series of elements includes not only those elements, but also other elements not explicitly listed, or also includes elements inherent to such process, method, article or device. In the absence of further restrictions, the elements defined by the sentence "comprise a ..." do not exclude the presence of other identical elements in the process, method, article or device including the elements.

结合本文中所公开的实施例描述的方法或算法的步骤可以直接用硬件、处理器执行的软件模块，或者二者的结合来实施。软件模块可以置于随机存储器(RAM)、内存、只读存储器(ROM)、电可编程ROM、电可擦除可编程ROM、寄存器、硬盘、可移动磁盘、CD-ROM、或技术领域内所公知的任意其它形式的存储介质中。The steps of the method or algorithm described in conjunction with the embodiments disclosed herein may be implemented directly using hardware, a software module executed by a processor, or a combination of the two. The software module may be placed in a random access memory (RAM), a memory, a read-only memory (ROM), an electrically programmable ROM, an electrically erasable programmable ROM, a register, a hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

对所公开的实施例的上述说明，使本领域专业技术人员能够实现或使用本申请。对这些实施例的多种修改对本领域的专业技术人员来说将是显而易见的，本文中所定义的一般原理可以在不脱离本申请的精神或范围的情况下，在其它实施例中实现。因此，本申请将不会被限制于本文所示的这些实施例，而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。The above description of the disclosed embodiments enables those skilled in the art to implement or use the present application. Various modifications to these embodiments will be apparent to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the present application. Therefore, the present application will not be limited to the embodiments shown herein, but will conform to the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A construction method of a vocoder, the vocoder comprising: an amplitude spectrum predictor and a phase spectrum predictor, characterized in that the method comprises:

Acquire target acoustic features;

The target acoustic feature is input into the magnitude spectrum prediction model to obtain a first logarithmic magnitude spectrum corresponding to the target acoustic feature, and the first logarithmic magnitude spectrum includes a first magnitude spectrum;

Inputting the target acoustic feature into a phase spectrum prediction model to obtain a first phase spectrum corresponding to the target acoustic feature;

performing calculations according to the first amplitude spectrum and the first phase spectrum to obtain a first reconstructed short-time spectrum;

Preprocessing the first reconstructed short-term spectrum to obtain a first reconstructed speech waveform corresponding to the target acoustic feature;

respectively calculating the magnitude spectrum loss of the first logarithmic magnitude spectrum, the phase spectrum loss of the first phase spectrum, the short-time spectrum loss of the first reconstructed short-time spectrum, and the first reconstructed speech waveform waveform loss;

calculating correction parameters according to the amplitude spectrum loss, the phase spectrum loss, the short-time spectrum loss and the waveform loss;

correcting the amplitude spectrum prediction model according to the correction parameters to obtain the corrected amplitude spectrum prediction model as the amplitude spectrum predictor;

The phase spectrum prediction model is corrected according to the correction parameters to obtain a corrected phase spectrum prediction model as the phase spectrum predictor.

2. The method according to claim 1, characterized in that the method further comprises:

comparing the modified parameters with preset parameters;

In response to the correction parameter being less than or equal to the preset parameter, performing the correction of the amplitude spectrum prediction model according to the correction parameter to obtain the corrected amplitude spectrum prediction model as the amplitude spectrum predictor, the correcting the phase spectrum prediction model with the correction parameters to obtain the corrected phase spectrum prediction model as the phase spectrum predictor;

In response to the modified parameter being greater than the preset parameter, using the modified magnitude spectrum prediction model as the magnitude spectrum prediction model, using the modified phase spectrum prediction model as the phase spectrum prediction model, and executing the Inputting the target acoustic feature into the magnitude spectrum prediction model to obtain the first logarithmic magnitude spectrum corresponding to the target acoustic feature and subsequent steps until the correction parameter matches the preset parameter.

3. The method according to claim 1, wherein the amplitude spectrum prediction model comprises: a first input convolutional layer, a first residual convolutional network and a first output convolutional layer;

The first input convolutional layer is connected to the first residual convolutional network; the first residual convolutional network is respectively connected to the first input convolutional layer and the first output convolutional layer in sequence ;

The first input convolution layer is used to perform convolution calculation on the target acoustic feature;

The first residual convolution network is used to perform depth convolution calculations on the calculation results of the first input convolution layer;

The first output convolutional layer is configured to perform convolution calculation on the first residual convolutional network, so as to obtain a second log magnitude spectrum.

4. The method according to claim 1, wherein the phase spectrum prediction model comprises: a second input convolutional layer, a second residual convolutional network, a second output convolutional layer, a third output convolutional layer Stack and phase calculation modules;

The second input convolutional layer is connected to the second residual convolutional network; the second residual convolutional network is respectively connected to the second input convolutional layer, the second output convolutional layer and the third output The convolutional layer is connected; the phase calculation module is respectively connected to the second output convolutional layer and the third output convolutional layer;

The second input convolution layer is used to perform convolution calculation on the target acoustic feature;

The second residual convolution network is used to perform depth convolution calculations on the calculation results of the second input convolution layer;

The second output convolution layer is used to perform convolution calculation on the calculation result of the second residual convolution network;

The third output convolution layer is used to perform convolution calculation on the calculation result of the second residual convolution network;

The phase calculation module is configured to perform phase calculation according to the calculation results output by the second output convolution layer and the third output convolution layer, so as to obtain the second phase spectrum.

5. The method according to claim 3 or 4, wherein both the first residual convolutional network and the second residual convolutional network are composed of N residual convolutional blocks connected by parallel jumps, The first addition unit, the average unit and the first LReLu unit are connected in sequence, wherein the residual convolution block is composed of X residual convolution sub-blocks concatenated; N and X are both positive integers;

The residual convolution block is used to perform residual convolution calculation on the calculation result of the first input convolution layer or the second input convolution layer;

The first adding unit is configured to sum calculation results of N skip-connected parallel residual convolution blocks;

The averaging unit is configured to average the calculation results of the first adding unit;

The first LReLu unit is configured to activate the calculation result of the averaging unit to obtain a first activation matrix.

6. The method according to claim 5, wherein the residual convolution sub-block comprises: a second LReLu unit, a dilated convolutional layer, a third LReLu unit, a fourth output convolutional layer, and a second phase add unit;

The second LReLu unit, the expanded convolutional layer, the third LReLu unit, the fourth output convolutional layer, and the second addition unit are connected in sequence; the second LReLu unit is used for The matrix input to the second LReLu unit is activated to obtain a second activation matrix;

The dilated convolution layer is used to perform convolution calculation on the first activation matrix;

The third LReLu unit is used to activate the calculation result of the expanded convolution layer to obtain a third activation matrix;

The fourth output convolution layer is used to perform convolution calculation on the third activation matrix;

The second addition unit is configured to perform sum calculation on the calculation result of the fourth output convolution layer and the matrix input to the second LReLu unit.

7. The method according to claim 3, 4 or 6, wherein the first input convolutional layer, the first output convolutional layer, the second output convolutional layer, the third The initial parameters of the output convolution layer and the fourth output convolution layer are obtained by randomly setting the convolution layer.

8. A speech synthesis method, characterized in that the method comprises:

Obtain the acoustic features to be synthesized;

The acoustic feature to be synthesized is input into the magnitude spectrum predictor, thereby obtaining the second logarithmic magnitude spectrum corresponding to the acoustic feature to be synthesized, the second logarithmic magnitude spectrum comprising a second magnitude spectrum; the magnitude spectrum The predictor is constructed according to the construction method of the vocoder described in any one of claims 1-7;

The acoustic features to be synthesized are input into a phase spectrum predictor to obtain a second phase spectrum corresponding to the acoustic features to be synthesized; the phase spectrum predictor is the acoustic spectrum according to any one of claims 1-7. obtained by the construction method of the coder;

calculating according to the second amplitude spectrum and the second phase spectrum to obtain a second reconstructed short-time spectrum;

Preprocessing the second reconstructed short-term spectrum to obtain a second reconstructed speech waveform corresponding to the acoustic feature to be synthesized;

converting the second reconstructed speech waveform into a synthesized speech corresponding to the acoustic feature to be synthesized.

9. The method according to claim 8, wherein the preprocessing the second reconstructed short-time spectrum to obtain the second reconstructed speech waveform corresponding to the acoustic feature to be synthesized comprises:

Performing an inverse short-time Fourier transform on the second reconstructed short-time spectrum to obtain a second reconstructed speech waveform corresponding to the acoustic feature to be synthesized.

10. A construction device of a vocoder, characterized in that the device comprises:

a first acquisition unit, configured to acquire target acoustic features;

A first input unit, configured to input the target acoustic feature into an amplitude spectrum prediction model to obtain a first logarithmic amplitude spectrum corresponding to the target acoustic feature, and the first logarithmic amplitude spectrum includes a first amplitude spectrum;

a second input unit, configured to input the target acoustic feature into a phase spectrum prediction model to obtain a first phase spectrum corresponding to the target acoustic feature;

A first calculation unit, configured to calculate according to the first amplitude spectrum and the first phase spectrum to obtain a first reconstructed short-time spectrum;

A first preprocessing unit, configured to preprocess the first reconstructed short-term spectrum to obtain a first reconstructed speech waveform corresponding to the acoustic feature to be synthesized;

The second calculation unit is used to calculate the magnitude spectrum loss of the first logarithmic magnitude spectrum, the phase spectrum loss of the first phase spectrum, the short-time spectrum loss of the first reconstructed short-time spectrum, the first reconstructed speech waveform loss of the waveform;

A third calculation unit, configured to calculate correction parameters according to the amplitude spectrum loss, the phase spectrum loss, the short-time spectrum loss, and the waveform loss;

a first correction unit, configured to correct the magnitude spectrum prediction model according to the correction parameters to obtain the magnitude spectrum predictor;

A second correction unit is configured to correct the phase spectrum prediction model according to the correction parameters to obtain the phase spectrum predictor.

11. A speech synthesis device, characterized in that the device comprises:

The second acquisition unit is used to acquire the acoustic features to be synthesized;

The third input unit is configured to input the acoustic features to be synthesized into a pre-built amplitude spectrum predictor, thereby obtaining a second logarithmic amplitude spectrum corresponding to the acoustic features to be synthesized, the second logarithmic amplitude spectrum including a second magnitude spectrum;

A fourth input unit, configured to input the acoustic features to be synthesized into a pre-built phase spectrum predictor, so as to obtain a second phase spectrum corresponding to the acoustic features to be synthesized;

A fourth calculation unit, configured to calculate according to the second amplitude spectrum and the second phase spectrum to obtain a second reconstructed short-time spectrum;

A second preprocessing unit, configured to preprocess the second reconstructed short-term spectrum to obtain a second reconstructed speech waveform corresponding to the acoustic feature to be synthesized;

A first converting unit, configured to convert the second reconstructed speech waveform into synthesized speech corresponding to the acoustic feature to be synthesized.