CN116524894A - Vocoder construction method, voice synthesis method and related devices - Google Patents
Vocoder construction method, voice synthesis method and related devices Download PDFInfo
- Publication number
- CN116524894A CN116524894A CN202310081092.XA CN202310081092A CN116524894A CN 116524894 A CN116524894 A CN 116524894A CN 202310081092 A CN202310081092 A CN 202310081092A CN 116524894 A CN116524894 A CN 116524894A
- Authority
- CN
- China
- Prior art keywords
- spectrum
- unit
- convolution
- convolution layer
- phase
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000010276 construction Methods 0.000 title abstract description 8
- 238000001308 synthesis method Methods 0.000 title abstract description 6
- 238000001228 spectrum Methods 0.000 claims abstract description 621
- 238000012937 correction Methods 0.000 claims abstract description 87
- 238000007781 pre-processing Methods 0.000 claims abstract description 28
- 238000004364 calculation method Methods 0.000 claims description 183
- 238000000034 method Methods 0.000 claims description 59
- 239000011159 matrix material Substances 0.000 claims description 48
- 230000004913 activation Effects 0.000 claims description 46
- 230000015572 biosynthetic process Effects 0.000 claims description 26
- 238000003786 synthesis reaction Methods 0.000 claims description 26
- 230000003213 activating effect Effects 0.000 claims description 14
- 230000009191 jumping Effects 0.000 claims description 11
- 238000012935 Averaging Methods 0.000 claims description 9
- 230000004044 response Effects 0.000 claims description 6
- 230000006870 function Effects 0.000 description 23
- 238000013528 artificial neural network Methods 0.000 description 16
- 238000010586 diagram Methods 0.000 description 14
- 230000003595 spectral effect Effects 0.000 description 14
- 238000012549 training Methods 0.000 description 10
- 239000000284 extract Substances 0.000 description 9
- 210000002569 neuron Anatomy 0.000 description 6
- 230000002194 synthesizing effect Effects 0.000 description 6
- 238000013527 convolutional neural network Methods 0.000 description 4
- 238000009432 framing Methods 0.000 description 4
- 230000009471 action Effects 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 241000282412 Homo Species 0.000 description 2
- 238000013459 approach Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000002452 interceptive effect Effects 0.000 description 2
- 238000013140 knowledge distillation Methods 0.000 description 2
- 238000013507 mapping Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/06—Elementary speech units used in speech synthesisers; Concatenation rules
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Signal Processing (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
Abstract
The embodiment of the application discloses a vocoder construction method, a voice synthesis method and a related device, wherein target acoustic features are firstly obtained and are respectively input into an amplitude spectrum prediction model and a phase spectrum prediction model to obtain a first logarithmic amplitude spectrum and a first phase spectrum, and the first logarithmic amplitude spectrum comprises the first amplitude spectrum. And then calculating according to the first amplitude spectrum and the first phase spectrum to obtain a first reconstructed short-time spectrum, and preprocessing the first reconstructed short-time spectrum to obtain a first reconstructed voice waveform. And calculating amplitude spectrum loss, phase spectrum loss, short-time spectrum loss and waveform loss, and calculating correction parameters according to the losses. And correcting the amplitude spectrum prediction model and the phase spectrum prediction model according to the correction parameters to obtain an amplitude spectrum predictor and a phase spectrum predictor. The amplitude spectrum predictor and the phase spectrum predictor can realize parallel direct prediction of the amplitude spectrum and the phase spectrum, improve the efficiency of voice generation and reduce the complexity of integral operation.
Description
Technical Field
The present disclosure relates to the field of speech signal processing, and in particular, to a method for constructing a vocoder, a method for synthesizing speech, and a related apparatus.
Background
Speech synthesis (speech synthesis) aims at making machines speak smoothly and naturally like humans, which benefits many speech interactive applications. Currently, statistical parametric speech synthesis (statistical parametric speech synthesis, SPSS) is one of the dominant approaches.
Statistical parameters the speech synthesis framework consists of an acoustic model (acoustic model) and a vocoder (vocoder). The vocoder converts the acoustic features into a final speech waveform. The performance of the vocoder can significantly impact the quality of the synthesized speech. With the development of neural networks, autoregressive neural network vocoders represented by WaveNet and SampleRNN have been proposed to significantly improve the quality of synthesized speech, but are limited to autoregressive generation modes, which have low generation efficiency. Subsequently, a knowledge distillation-based neural network vocoder, an inverse autoregressive flow-based neural network vocoder, and a neural network glottal model and a linear autoregressive neural network vocoder are sequentially proposed, and although the generation efficiency is improved, their overall operation complexity is high. Recently, non-autoregressive and non-streaming neural network vocoders are becoming mainstream, which mostly employ neural networks to achieve direct mapping from acoustic features to speech waveforms and to define a generation of a contrast loss function between predicted waveforms and real waveforms. However, limited by the direct prediction of waveforms, their generation efficiency is also to be improved.
Therefore, how to provide a vocoder with high speech generation efficiency and simple overall operation is a technical problem that needs to be solved by those skilled in the art.
Disclosure of Invention
Based on the above problems, the present application provides a method for constructing a vocoder, a method for synthesizing voice, and a related device, thereby providing a vocoder with high voice generation efficiency and simple operation. The embodiment of the application discloses the following technical scheme:
a method of constructing a vocoder, the vocoder comprising: an amplitude spectrum predictor and a phase spectrum predictor, the method comprising:
acquiring a target acoustic feature;
inputting the target acoustic features into an amplitude spectrum prediction model to obtain a first logarithm amplitude spectrum corresponding to the target acoustic features, wherein the first logarithm amplitude spectrum comprises a first amplitude spectrum;
inputting the target acoustic features into a phase spectrum prediction model to obtain a first phase spectrum corresponding to the target acoustic features;
calculating according to the first amplitude spectrum and the first phase spectrum to obtain a first reconstructed short-time spectrum;
preprocessing the first reconstructed short-time spectrum to obtain a first reconstructed voice waveform corresponding to the target acoustic feature;
Respectively calculating the amplitude spectrum loss of the first logarithmic amplitude spectrum, the phase spectrum loss of the first phase spectrum, the short-time spectrum loss of the first reconstruction short-time spectrum and the waveform loss of the first reconstruction voice waveform;
calculating to obtain correction parameters according to the amplitude spectrum loss, the phase spectrum loss, the short-time spectrum loss and the waveform loss;
correcting the amplitude spectrum prediction model according to the correction parameters so as to obtain a corrected amplitude spectrum prediction model serving as the amplitude spectrum predictor;
and correcting the phase spectrum prediction model according to the correction parameters so as to obtain a corrected phase spectrum prediction model serving as the phase spectrum predictor.
In one possible implementation, the method further includes:
comparing the correction parameter with a preset parameter;
executing the correction of the amplitude spectrum prediction model according to the correction parameter to obtain a corrected amplitude spectrum prediction model as the amplitude spectrum predictor, and the correction of the phase spectrum prediction model according to the correction parameter to obtain a corrected phase spectrum prediction model as the phase spectrum predictor in response to the correction parameter being less than or equal to the preset parameter;
And in response to the correction parameter being greater than the preset parameter, taking the correction amplitude spectrum prediction model as the amplitude spectrum prediction model, taking the correction phase spectrum prediction model as the phase spectrum prediction model, and executing the step of inputting the target acoustic feature into the amplitude spectrum prediction model to obtain a first pair of logarithmic amplitude spectrums corresponding to the target acoustic feature and the subsequent steps until the correction parameter accords with the preset parameter.
In one possible implementation, the amplitude spectrum prediction model includes: a first input convolution layer, a first residual convolution network, and a first output convolution layer;
the first input convolution layer is connected with the first residual convolution network; the first residual convolution network is respectively connected with the first input convolution layer and the first output convolution layer in sequence;
the first input convolution layer is used for carrying out convolution calculation on the target acoustic characteristics;
the first residual convolution network is used for performing depth convolution calculation on the calculation result of the first input convolution layer;
the first output convolution layer is used for carrying out convolution calculation on the first residual convolution network so as to obtain a second logarithmic magnitude spectrum.
In one possible implementation, the phase spectrum prediction model includes: the second input convolution layer, the second residual convolution network, the second output convolution layer, the third output convolution layer and the phase calculation module;
the second input convolution layer is connected with the second residual convolution network; the second residual convolution network is respectively connected with the second input convolution layer, the second output convolution layer and the third output convolution layer; the phase calculation module is respectively connected with the second output convolution layer and the third output convolution layer;
the second input convolution layer is used for carrying out convolution calculation on the target acoustic features;
the second residual convolution network is used for performing depth convolution calculation on the calculation result of the second input convolution layer;
the second output convolution layer is used for carrying out convolution calculation on the calculation result of the second residual convolution network;
the third output convolution layer is used for carrying out convolution calculation on the calculation result of the second residual convolution network;
and the phase calculation module is used for carrying out phase calculation according to the calculation results output by the second output convolution layer and the third output convolution layer so as to obtain the second phase spectrum.
In one possible implementation manner, the first residual convolution network and the second residual convolution network are formed by sequentially connecting N residual convolution blocks, a first adding unit, an averaging unit and a first lrerlu unit, which are connected in parallel and jumped, wherein the residual convolution blocks are formed by cascading X residual convolution sub-blocks; n, X are positive integers;
the residual convolution block is used for carrying out residual convolution calculation on the calculation result of the first input convolution layer or the second input convolution layer;
the first adding unit is used for adding and calculating the calculation results of the N parallel residual convolution blocks which are connected in a jumping manner;
the average unit is used for carrying out average calculation on the calculation result of the first adding unit;
and the first LReLu unit is used for activating the calculation result of the average unit to obtain a first activation matrix.
In one possible implementation, the residual convolution sub-block includes: a second lrehu unit, an expanded convolution layer, a third lrehu unit, a fourth output convolution layer, and a second addition unit;
the second LReLu unit, the expansion convolution layer, the third LReLu unit, the fourth output convolution layer and the second addition unit are connected in sequence; the second lrehu unit is configured to activate a matrix input to the second lrehu unit to obtain a second activation matrix;
The expansion convolution layer is used for carrying out convolution calculation on the first activation matrix;
the third lrerlu unit is configured to activate the calculation result of the extended convolutional layer to obtain a third activation matrix;
the fourth output convolution layer is used for carrying out convolution calculation on the third activation matrix;
and the second adding unit is used for adding and calculating the calculation result of the fourth output convolution layer and the matrix input to the second LReLu unit.
In one possible implementation, the initial parameters of the first input convolution layer, the first output convolution layer, the second output convolution layer, the third output convolution layer, and the fourth output convolution layer are all randomly set by the convolution layers.
A method of speech synthesis, the method comprising:
acquiring acoustic features to be synthesized;
inputting the acoustic features to be synthesized into an amplitude spectrum predictor so as to obtain a second pair of logarithmic amplitude spectrums corresponding to the acoustic features to be synthesized, wherein the second pair of logarithmic amplitude spectrums comprise a second amplitude spectrum; the amplitude spectrum predictor is constructed according to the construction method of the vocoder;
inputting the acoustic features to be synthesized into a phase spectrum predictor so as to obtain a second phase spectrum corresponding to the acoustic features to be synthesized; the phase spectrum predictor is constructed according to the construction method of the vocoder;
Calculating according to the second amplitude spectrum and the second phase spectrum to obtain a second reconstructed short-time spectrum;
preprocessing the second reconstructed short-time spectrum to obtain a second reconstructed voice waveform corresponding to the acoustic feature to be synthesized;
and converting the second reconstructed voice waveform into synthesized voice corresponding to the acoustic feature to be synthesized.
In a possible implementation manner, the preprocessing the second reconstructed short-time spectrum to obtain a second reconstructed voice waveform corresponding to the acoustic feature to be synthesized includes:
and performing inverse short-time Fourier transform on the second reconstructed short-time spectrum to obtain a second reconstructed voice waveform corresponding to the acoustic feature to be synthesized.
A device for constructing a vocoder, the device comprising:
a first acquisition unit configured to acquire a target acoustic feature;
the first input unit is used for inputting the target acoustic features into an amplitude spectrum prediction model to obtain a first logarithm amplitude spectrum corresponding to the target acoustic features, wherein the first logarithm amplitude spectrum comprises a first amplitude spectrum;
the second input unit is used for inputting the target acoustic features into a phase spectrum prediction model to obtain a first phase spectrum corresponding to the target acoustic features;
The first calculation unit is used for calculating according to the first amplitude spectrum and the first phase spectrum to obtain a first reconstructed short-time spectrum;
the first preprocessing unit is used for preprocessing the first reconstructed short-time spectrum to obtain a first reconstructed voice waveform corresponding to the acoustic feature to be synthesized;
a second calculation unit, configured to calculate an amplitude spectrum loss of the first pair of magnitude spectrums, a phase spectrum loss of the first phase spectrum, a short-time spectrum loss of the first reconstructed short-time spectrum, and a waveform loss of the first reconstructed voice waveform;
a third calculation unit, configured to calculate a correction parameter according to the amplitude spectrum loss, the phase spectrum loss, the short-time spectrum loss, and the waveform loss;
the first correction unit is used for correcting the amplitude spectrum prediction model according to the correction parameters so as to obtain the amplitude spectrum predictor;
and the second correction unit is used for correcting the phase spectrum prediction model according to the correction parameters so as to obtain the phase spectrum predictor.
A speech synthesis apparatus, the apparatus comprising:
the second acquisition unit is used for acquiring acoustic features to be synthesized;
the third input unit is used for inputting the acoustic features to be synthesized into a pre-constructed amplitude spectrum predictor so as to obtain a second pair of magnitude spectrums corresponding to the acoustic features to be synthesized, wherein the second pair of magnitude spectrums comprise a second magnitude spectrum;
The fourth input unit is used for inputting the acoustic features to be synthesized into a pre-constructed phase spectrum predictor so as to obtain a second phase spectrum corresponding to the acoustic features to be synthesized;
a fourth calculation unit, configured to calculate a second reconstructed short-time spectrum according to the second amplitude spectrum and the second phase spectrum;
the second preprocessing unit is used for preprocessing the second reconstructed short-time spectrum to obtain a second reconstructed voice waveform corresponding to the acoustic feature to be synthesized;
the first converting unit is used for converting the second reconstructed voice waveform into the synthesized voice corresponding to the acoustic feature to be synthesized.
Compared with the prior art, the application has the following beneficial effects:
the application provides a vocoder construction method, a voice synthesis method and a related device. Specifically, when the method for constructing a vocoder provided in the embodiments of the present application is executed, the acquisition target acoustic feature may be acquired first. Inputting the target acoustic features into an amplitude spectrum prediction model to obtain a first logarithm amplitude spectrum corresponding to the target acoustic features, wherein the first logarithm amplitude spectrum comprises a first amplitude spectrum; and inputting the target acoustic features into a phase spectrum prediction model to obtain a first phase spectrum corresponding to the target acoustic features, and calculating according to the first amplitude spectrum and the first phase spectrum to obtain a first reconstructed short-time spectrum. And then preprocessing the first reconstructed short-time spectrum to obtain a first reconstructed voice waveform corresponding to the acoustic feature to be synthesized. And then, respectively calculating the amplitude spectrum loss of the first pair of amplitude spectrums, the phase spectrum loss of the first phase spectrum, the short-time spectrum loss of the first reconstruction short-time spectrum and the waveform loss of the first reconstruction voice waveform, and calculating to obtain correction parameters according to the amplitude spectrum loss, the phase spectrum loss, the short-time spectrum loss and the waveform loss. Finally, correcting the amplitude spectrum prediction model according to the correction parameters so as to obtain the amplitude spectrum predictor; and correcting the phase spectrum prediction model according to the correction parameters so as to obtain the phase spectrum predictor. The amplitude spectrum predictor and the phase spectrum predictor of the method are all of full-frame level, and can be used for directly predicting the voice amplitude spectrum and the phase spectrum in parallel, so that the voice generation efficiency is remarkably improved, and the complexity of overall operation is reduced. At the same time, the present application trains both the amplitude spectrum predictor and the phase spectrum predictor by utilizing amplitude spectrum loss, phase spectrum loss, short-time spectrum loss, and waveform loss.
Drawings
In order to more clearly illustrate the present embodiments or the technical solutions in the prior art, the drawings that are required for the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a method flowchart of a method for constructing a vocoder according to an embodiment of the present application;
fig. 2 is a schematic structural diagram of a device for constructing a vocoder according to an embodiment of the present application;
fig. 3 is a method flowchart of a speech synthesis method according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of a speech synthesis apparatus according to an embodiment of the present application;
FIG. 5 is a schematic structural diagram of an amplitude spectrum prediction model according to an embodiment of the present application;
fig. 6 is a schematic structural diagram of a residual convolution network according to an embodiment of the present application;
fig. 7 is a schematic structural diagram of a residual convolution sub-block according to an embodiment of the present application;
fig. 8 is a schematic structural diagram of a phase spectrum prediction model according to an embodiment of the present application;
Fig. 9 is a schematic structural diagram of still another residual convolution sub-block according to an embodiment of the present application.
Detailed Description
In order to make the present application solution better understood by those skilled in the art, the following description will clearly and completely describe the technical solution in the embodiments of the present application with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.
In order to facilitate understanding of the technical solutions provided in the embodiments of the present application, the following description will first explain the background technology related to the embodiments of the present application.
Speech synthesis (speech synthesis) aims at making machines speak smoothly and naturally like humans, which benefits many speech interactive applications. Currently, statistical parametric speech synthesis (statistical parametric speech synthesis, SPSS) is one of the dominant approaches.
Statistical parameters the speech synthesis framework consists of an acoustic model (acoustic model) and a vocoder (vocoder). The vocoder converts the acoustic features into a final speech waveform. The performance of the vocoder can significantly impact the quality of the synthesized speech.
Conventional vocoders such as STRAIGHT and WORLD are widely used in current statistical parameter speech synthesis systems. However, these conventional vocoders suffer from drawbacks such as loss of spectral details and phase information, which can lead to a reduced hearing of the synthesized speech.
Currently, with the development of neural networks, autoregressive neural network vocoders represented by WaveNet and SampleRNN are proposed, which significantly improve the quality of synthesized speech, but are limited to autoregressive generation modes, and have low generation efficiency. Subsequently, a knowledge distillation-based neural network vocoder, an inverse autoregressive flow-based neural network vocoder, and a neural network glottal model and a linear autoregressive neural network vocoder are sequentially proposed, and although the generation efficiency is improved, their overall operation complexity is high. Recently, non-autoregressive and non-streaming neural network vocoders are becoming mainstream, which mostly employ neural networks to achieve direct mapping from acoustic features to speech waveforms and to define a generation of a contrast loss function between predicted waveforms and real waveforms. However, limited by the direct prediction of waveforms, their generation efficiency is also to be improved.
In order to solve the problem, in the embodiment of the present application, a method for constructing a vocoder, a method for synthesizing speech, and a related device are provided, first, a target acoustic feature is acquired, and the target acoustic feature is input into an amplitude spectrum prediction model, so as to obtain a first pair of logarithm amplitude spectrums corresponding to the target acoustic feature, where the first pair of logarithm amplitude spectrums include the first amplitude spectrums; and inputting the target acoustic features into a phase spectrum prediction model to obtain a first phase spectrum corresponding to the target acoustic features. And then calculating according to the first amplitude spectrum and the first phase spectrum to obtain a first reconstructed short-time spectrum, and preprocessing the first reconstructed short-time spectrum to obtain a first reconstructed voice waveform corresponding to the acoustic feature to be synthesized. Then, the amplitude spectrum loss of the first pair of amplitude spectrums, the phase spectrum loss of the first phase spectrum, the short-time spectrum loss of the first reconstruction short-time spectrum and the waveform loss of the first reconstruction voice waveform are respectively calculated, and correction parameters are obtained according to the amplitude spectrum loss, the phase spectrum loss, the short-time spectrum loss and the waveform loss. Finally, correcting the amplitude spectrum prediction model according to the correction parameters so as to obtain an amplitude spectrum predictor; and correcting the phase spectrum prediction model according to the correction parameters so as to obtain the phase spectrum predictor. The amplitude spectrum predictor and the phase spectrum predictor of the method are all of full-frame level, and can be used for directly predicting the voice amplitude spectrum and the phase spectrum in parallel, so that the voice generation efficiency is remarkably improved, and the complexity of overall operation is reduced. At the same time, the present application trains both the amplitude spectrum predictor and the phase spectrum predictor by utilizing amplitude spectrum loss, phase spectrum loss, short-time spectrum loss, and waveform loss.
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.
Referring to fig. 1, which is a method flowchart of a method for constructing a vocoder according to an embodiment of the present application, as shown in fig. 1, the method for constructing a vocoder may include steps S101 to S109:
s101: target acoustic features are acquired.
To construct a vocoder, the vocoder's construction system may first obtain the target acoustic features.
The target acoustic features are obtained by inputting the text to be synthesized into an acoustic model. The target acoustic feature is, for example, "today's weather is good", the input acoustic model can be converted into the corresponding target acoustic feature, and then the vocoder can perform audio synthesis based on the target acoustic feature, so as to obtain clean synthesized audio data.
The acoustic features may include, but are not limited to: at least one of spectrum parameters such as spectrum and cepstrum. In addition, one or more of fundamental frequency, unvoiced, and voiced may be included. In the present embodiment, the acoustic feature to be synthesized is described by taking a spectrum as an example, and specifically, mel-spectrum (mel-spectrum) may be used. In other embodiments, the acoustic feature to be synthesized may be cepstrum + fundamental frequency, and unvoiced and voiced may also be combined. It will be appreciated that in use, it is necessary to prepare as input the same class of acoustic features based on the acoustic features used in training the vocoder. For example, the acoustic feature used in training is an 80-dimensional mel spectrum, and then the 80-dimensional mel spectrum is also taken as input in application.
S102: and inputting the target acoustic features into an amplitude spectrum prediction model to obtain a first pair of logarithm amplitude spectrums corresponding to the target acoustic features, wherein the first pair of logarithm amplitude spectrums comprise the first amplitude spectrums.
After the target acoustic feature is obtained, the vocoder building system may input the target acoustic feature into the amplitude spectrum prediction model, so as to obtain a first pair of logarithm amplitude spectrums corresponding to the target acoustic feature, where the first pair of logarithm amplitude spectrums includes the first amplitude spectrum.
Referring to fig. 5, a schematic structural diagram of an amplitude spectrum prediction model provided in an embodiment of the present application, as shown in fig. 5, the amplitude spectrum prediction model includes: a first input convolution layer, a first residual convolution network, and a first output convolution layer.
The first input convolution layer is connected with the first residual convolution network; the first residual convolution network is respectively connected with the first input convolution layer and the first output convolution layer in sequence.
The first input convolution layer is used for carrying out convolution calculation on the target acoustic characteristics.
The first residual convolution network is used for performing depth convolution calculation on the calculation result of the first input convolution layer.
In some possible implementations, the deep convolution calculation refers to performing a plurality of convolution calculations.
The first output convolution layer is used for carrying out convolution calculation on the first residual convolution network so as to obtain a second logarithmic magnitude spectrum.
The initial parameters of the first input convolution layer and the first output convolution layer are obtained through random setting of the convolution layers.
In some possible implementations, each convolutional layer (Convolutional layer) in the convolutional neural network is composed of a number of convolutional units, and parameters of each convolutional unit are optimized through a back-propagation algorithm. The purpose of convolution operations is to extract different features of the input, and the first layer of convolution may only extract some low-level features such as edges, lines, and corners, and more layers of the network may iteratively extract more complex features from the low-level features.
Referring to fig. 6, a schematic structural diagram of a residual convolution network provided in this embodiment of the present application is shown in fig. 6, where the first residual convolution network is formed by sequentially connecting N residual convolution blocks connected in parallel in a jumping manner, a first adding unit, an averaging unit, and a first lrerlu unit, where the residual convolution blocks are formed by cascading X residual convolution sub-blocks; n, X are all positive integers.
The second lrehu unit, the extended convolution layer, the third lrehu unit, and the first adding unit are sequentially connected.
The residual convolution block is used for carrying out residual convolution calculation on the calculation result of the first input convolution layer or the second input convolution layer.
The first adding unit is used for adding and calculating the calculation results of the N parallel residual convolution blocks which are connected in a jumping mode.
And the average unit is used for carrying out average calculation on the calculation result of the first adding unit.
And the first LReLu unit is used for activating the calculation result of the average unit to obtain a first activation matrix.
In some possible implementations, the lrerlu unit, i.e., the leakage corrected linear unit (leak ReLu) function, is a variant of the classical (and widely used) ReLu activation function, the function output having a small slope for negative inputs. Since the derivative is always non-zero, this reduces the occurrence of silent neurons, allowing gradient-based learning (although it may be slow), solving the problem of neurons not learning after the ReLu function enters the negative interval. The lrerlu unit is activated to perform a function operation on the calculation result.
In some possible implementations, the residual convolution calculations include convolution calculations and activation calculations.
Referring to fig. 7, a schematic structural diagram of a residual convolution sub-block provided in an embodiment of the present application, as shown in fig. 7, the residual convolution sub-block includes: a second lrehu unit, an expanded convolution layer, a third lrehu unit, a fourth output convolution layer, and a second addition unit.
The second lrehu unit, the expanded convolution layer, the third lrehu unit, the fourth output convolution layer, and the second addition unit are sequentially connected.
And the second LReLu unit is used for activating the matrix input into the second LReLu unit to obtain a second activation matrix.
In one possible implementation, the activation of the second lrerlu unit is a functional operation on a matrix input to the second lrerlu unit.
The expansion convolution layer is used for carrying out convolution calculation on the first activation matrix.
And the third LReLu unit is used for activating the calculation result of the expansion convolution layer to obtain a third activation matrix.
In some possible implementations, the activation of the third lrerlu unit is a function operation of the calculation result of the expanded convolution layer.
And the fourth output convolution layer is used for carrying out convolution calculation on the third activation matrix.
In some possible implementations, the initial parameters of the fourth output convolution layer and the expanded convolution layer are each obtained by random settings of the convolution layers.
And the second adding unit is used for adding and calculating the calculation result of the fourth output convolution layer and the matrix input to the second LReLu unit.
In some possible implementations, when the residual convolution sub-block is a first residual convolution sub-block of the cascaded X residual convolution sub-blocks, the second lrerlu unit is further connected to the first input convolution layer; the second adder is also coupled to the first input convolutional layer.
At this time, the second lrehu unit is configured to activate the calculation result of the first output convolution layer to obtain a second activation matrix.
And the second adding unit is used for adding and calculating the calculation result of the fourth output convolution layer and the first input convolution layer.
S103: and inputting the target acoustic features into a phase spectrum prediction model to obtain a first phase spectrum corresponding to the target acoustic features.
After the target acoustic feature is obtained, the vocoder building system may input the target acoustic feature into the phase spectrum prediction model, so as to obtain a first phase spectrum corresponding to the target acoustic feature.
Referring to fig. 8, a schematic structural diagram of a phase spectrum prediction model provided in an embodiment of the present application, as shown in fig. 8, the phase spectrum prediction model includes: the system comprises a second input convolution layer, a second residual convolution network, a second output convolution layer, a third output convolution layer and a phase calculation module.
The second input convolution layer is connected with the second residual convolution network; the second residual convolution network is respectively connected with the second input convolution layer, the second output convolution layer and the third output convolution layer; the phase calculation module is respectively connected with the second output convolution layer and the third output convolution layer.
The second input convolution layer is used for carrying out convolution calculation on the target acoustic features.
And the second residual convolution network is used for performing depth convolution calculation on the calculation result of the second input convolution layer.
In some possible implementations, the deep convolution calculation refers to performing a plurality of convolution calculations.
And the second output convolution layer is used for carrying out convolution calculation on the calculation result of the second residual convolution network.
And the third output convolution layer is used for carrying out convolution calculation on the calculation result of the second residual convolution network.
And the phase calculation module is used for carrying out phase calculation according to the calculation results output by the second output convolution layer and the third output convolution layer so as to obtain the second phase spectrum.
The initial parameters of the second output convolution layer and the third output convolution layer are obtained through random setting of the convolution layers. Because the initial parameters of the second output convolution layer and the third output convolution layer are obtained by random setting of the convolution layers, the parameters of the second output convolution layer and the third output convolution layer are different.
In some possible implementations, the phase calculation module is formulated as follows:
wherein R is the calculation result of the second output convolution layer, and I is the calculation result of the third output convolution layer. Φ (0, 0) =0. When R is greater than or equal to 0, sgn * (R) =1; when R is<At 0, Sgn * (R) = -1; when I is greater than or equal to 0, sgn * (I) =1; when I<0, sgn * (I)=-1。
In some possible implementations, each convolutional layer (Convolutional layer) in the convolutional neural network is composed of a number of convolutional units, and parameters of each convolutional unit are optimized through a back-propagation algorithm. The purpose of convolution operations is to extract different features of the input, and the first layer of convolution may only extract some low-level features such as edges, lines, and corners, and more layers of the network may iteratively extract more complex features from the low-level features.
As shown in fig. 6, a second residual convolution network in the phase spectrum prediction model is also formed by sequentially connecting N parallel residual convolution blocks, a first adding unit, an averaging unit and a first lrerlu unit, wherein the residual convolution blocks are formed by cascading X residual convolution sub-blocks; n, X are all positive integers.
The second lrehu unit, the extended convolution layer, the third lrehu unit, and the first adding unit are sequentially connected.
The residual convolution block is used for carrying out residual convolution calculation on the calculation result of the first input convolution layer or the second input convolution layer.
The first adding unit is used for adding and calculating the calculation results of the N parallel residual convolution blocks which are connected in a jumping mode.
And the average unit is used for carrying out average calculation on the calculation result of the first adding unit.
And the first LReLu unit is used for activating the calculation result of the average unit to obtain a first activation matrix.
The first residual convolution network and the second residual convolution network are formed by sequentially connecting N residual convolution blocks which are connected in parallel in a jumping manner, a first adding unit, an averaging unit and a first LReLu unit, but initial parameters of the N residual convolution blocks in the first residual convolution network and the N residual convolution blocks in the second residual convolution network are obtained through random setting. Because the initial parameters of the units and modules in the second residual convolution network in the phase spectrum prediction model and the second residual convolution network in the amplitude spectrum model are obtained through random setting, the parameters of the units and modules in the second residual convolution network in the phase spectrum prediction model and the second residual convolution network in the amplitude spectrum model are different.
In some possible implementations, the residual convolution calculations include convolution calculations and activation calculations.
Referring to fig. 9, a schematic structural diagram of another residual convolution sub-block provided in an embodiment of the present application, as shown in fig. 9, the residual convolution sub-block includes: a second lrehu unit, an expanded convolution layer, a third lrehu unit, a fourth output convolution layer, and a second addition unit.
The second lrehu unit, the expanded convolution layer, the third lrehu unit, the fourth output convolution layer, and the second addition unit are sequentially connected.
And the second LReLu unit is used for activating the matrix input into the second LReLu unit to obtain a second activation matrix.
The expansion convolution layer is used for carrying out convolution calculation on the first activation matrix.
And the third LReLu unit is used for activating the calculation result of the expansion convolution layer to obtain a third activation matrix.
The fourth output convolution layer is configured to perform convolution calculation on the third activation matrix
And the second adding unit is used for adding and calculating the calculation result of the fourth output convolution layer and the matrix input to the second LReLu unit.
In some possible implementations, when the residual convolution sub-block is a first residual convolution sub-block of the cascaded X residual convolution sub-blocks, the second lrerlu unit is further connected to the second input convolution layer; the second adder is further coupled to the second input convolutional layer.
At this time, the second lrehu unit is configured to activate the calculation result of the second input convolution layer to obtain a second activation matrix.
And the second adding unit is used for adding and calculating the calculation result of the fourth output convolution layer and the second input convolution layer.
S104: and calculating according to the first amplitude spectrum and the first phase spectrum to obtain a first reconstructed short-time spectrum.
After obtaining the first magnitude spectrum and the first phase spectrum of the first pair of magnitude spectrums, the vocoder building system may calculate a first reconstructed short-time spectrum according to the first magnitude spectrum and the first phase spectrum.
In one possible implementation manner, the calculating according to the first amplitude spectrum and the first phase spectrum obtains a first reconstructed short-time spectrumThe formula of (2) is as follows:
Wherein,,for the first number amplitude spectrum +.>Is used for the amplitude spectrum part of the (c).Is the first phase spectrum.
S105: and preprocessing the first reconstructed short-time spectrum to obtain a first reconstructed voice waveform corresponding to the target acoustic feature.
After the first reconstructed short-time spectrum is obtained by calculation according to the first amplitude spectrum and the first phase spectrum, the vocoder building system can perform preprocessing such as inverse short-time fourier transform on the first reconstructed short-time spectrum to obtain a first reconstructed voice waveform corresponding to the target acoustic feature.
In a possible implementation manner, preprocessing the first reconstructed short-time spectrum to obtain a first reconstructed voice waveform corresponding to the target acoustic feature includes:
and performing inverse short-time Fourier transform on the first reconstructed short-time spectrum to obtain a first reconstructed voice waveform corresponding to the acoustic feature to be synthesized.
In one possible implementation, when performing the inverse short-time fourier transform (Inverse Short Time Fourier Transform, ISTFT), the inverse fourier transform is performed on each frame of the processed domain signal, then the result of the inverse transform is windowed (consistent with the window type, window length, overlap added during framing), and finally the windowed signal of each frame is overlap-added and divided by the result of the square overlap-add of the window function of each frame, so that the original signal can be reconstructed.
S106: and respectively calculating the amplitude spectrum loss of the first logarithmic amplitude spectrum, the phase spectrum loss of the first phase spectrum, the short-time spectrum loss of the first reconstruction short-time spectrum and the waveform loss of the first reconstruction voice waveform.
After the first reconstructed short-time spectrum is preprocessed to obtain a first reconstructed voice waveform corresponding to the target acoustic feature, the vocoder building system can calculate the amplitude spectrum loss of the first logarithmic amplitude spectrum, the phase spectrum loss of the first phase spectrum, the short-time spectrum loss of the first reconstructed short-time spectrum and the waveform loss of the first reconstructed voice waveform respectively.
In one possible implementation, the first pair of magnitude spectra have magnitude spectrum lossesThe calculation formula of (2) is as follows: />
Wherein,,a first log magnitude spectrum; log a is a natural log magnitude spectrum;
s is the natural waveform passingThe short-time fourier transform (Short Time Fourier Transform, STFT) extracts natural short-time complex spectra, re and Im representing the real and imaginary parts of S, respectively.
In one possible implementation, the phase spectrum of the first phase spectrum is lostThe calculation formula of (2) is as follows:
wherein,,is the instantaneous phase loss;Is a group delay loss;Is the instantaneous angular frequency loss.
The instantaneous phase loss is defined as a first predicted phase spectrumA negative cosine loss from the natural phase spectrum P, namely:
group delay loss is defined as predicted group delayAnd natural group delay delta DF Negative cosine loss of P, namely:
the loss of instantaneous angular frequency is defined as the predicted instantaneous angular frequencyAnd natural instantaneous angular frequency delta DT Negative cosine loss of P, namely:
wherein delta is DF And delta DT Representing the difference along the frequency axis and the difference along the time axis, respectively. The natural phase spectrum is calculated by: p=Φ (Re (S), im (S)).
Where Re (S) is the real part of the natural short-time complex spectrum S and Im (S) is the imaginary part of the natural short-time complex spectrum S. Φ (0, 0) =0. When Re (S) is not less than 0, sgn * (Re (S))=1; when Re (S)<0, sgn * (Re (S)) = -1; when Im (S) is not less than 0, sgn * (Im (S))=1; when Im (S)<0, sgn * (Im(S))=-1。
In one possible implementation, the short-term spectrum loss of the first reconstructed short-term spectrumThe calculation formula of (2) is as follows:
short-term spectral lossAmplitude spectrum for improving prediction +.>And phase spectrum->Degree of matching between them and short-term spectrum guaranteeing their reconstruction +.>(i.e.)>) Comprises three sub-losses, respectively real losses +.>Imaginary loss->And loss of short-term spectral coherence->The real part loss is defined as reconstructed real part +.>Absolute error loss between the natural real part Re (S), the imaginary part loss being defined as reconstructed imaginary part +.>And the natural imaginary part Im (S), namely: />
Short-term spectrum consistency loss is defined in the reconstructed short-term spectrumConsistent short-term spectrum->For reducing the gap between the two. Since the magnitude spectrum and the phase spectrum are predictive and the short-term spectrum domain is only a subset of the complex domain, they reconstruct the short-term spectrum +.>Not necessarily a truly existing short-term spectrum. But->Corresponding real existing short-term spectrum +.>Is to->Performing inverse short-time Fourier transform to obtain a waveform +.>And then carrying out Fourier transform on the waveform to obtain the waveform, namely:
The short-term spectral consistency loss is defined asAnd->The form of the real and imaginary parts written as the two norms between:
finally, short-term spectral lossLoss of real part->Imaginary loss->And loss of short-term spectral coherence->Linear combination according to a certain proportion, namely:
wherein lambda is RI Is a short-time spectrum loss super-parameter, and can be manually and automatically determined and transformed.
In one possible implementation, the waveform loss of the first reconstructed speech waveformThe calculation formula of (2) is as follows:
waveform lossFor narrowing the gap between the reconstructed waveform and the natural waveform, as used in HiFi-GAN, including generating a generator penalty against the network +.>Decision maker penalty for generating an countermeasure network>Feature matching loss->And Meier spectral loss->The waveform loss of the first reconstructed speech waveform is a proportional linear combination of these losses. Lambda (lambda) Mel Is a waveform loss super-parameter, and can be determined and transformed by human body.
S107: and calculating to obtain correction parameters according to the amplitude spectrum loss, the phase spectrum loss, the short-time spectrum loss and the waveform loss.
After the amplitude spectrum loss of the first logarithmic amplitude spectrum, the phase spectrum loss of the first phase spectrum, the short-time spectrum loss of the first reconstructed short-time spectrum and the waveform loss of the first reconstructed speech waveform are calculated, the vocoder is constructed according to the loss of the amplitude spectrum Loss of phase spectrum>Short-term spectral loss->Waveform loss->And calculating to obtain the correction parameters.
In one possible implementation, the parameters are modifiedThe calculation formula of (2) is as follows:
wherein lambda is A 、λ P And lambda (lambda) S All are correction super parameters, and can be determined and changed by people.
S108: and correcting the amplitude spectrum prediction model according to the correction parameters so as to obtain the amplitude spectrum predictor.
After the correction parameters are calculated according to the amplitude spectrum loss, the phase spectrum loss, the short-time spectrum loss and the waveform loss, the vocoder building system can correct each parameter in the amplitude spectrum prediction model according to the correction parameters so as to obtain the amplitude spectrum predictor.
S109: and correcting the phase spectrum prediction model according to the correction parameters so as to obtain the phase spectrum predictor.
After the correction parameters are calculated according to the amplitude spectrum loss, the phase spectrum loss, the short-time spectrum loss and the waveform loss, the vocoder building system can correct each parameter in the phase spectrum prediction model according to the correction parameters so as to obtain the phase spectrum predictor.
In one possible implementation, the method further comprises A1-A3:
A1: and comparing the correction parameter with a preset parameter.
In order to improve the voice generation efficiency of the vocoder, the amplitude spectrum prediction model and the phase spectrum prediction model need to be trained iteratively until the correction parameters are smaller than or equal to the preset parameters, and then the vocoder building system needs to compare the correction parameters with the preset parameters.
In a possible implementation manner, the preset parameter is a correction parameter value that is obtained by performing multiple iterative training on the amplitude spectrum prediction model and the phase spectrum prediction model and is not changed any more, and is generally set to 0.6525, and the preset parameter can be adjusted according to actual conditions.
A2: and in response to the correction parameter being smaller than or equal to the preset parameter, executing the correction of the amplitude spectrum prediction model according to the correction parameter so as to obtain a corrected amplitude spectrum prediction model as the amplitude spectrum predictor, and correcting the phase spectrum prediction model according to the correction parameter so as to obtain a corrected phase spectrum prediction model as the phase spectrum predictor.
A3: and in response to the correction parameter being greater than the preset parameter, taking the correction amplitude spectrum prediction model as the amplitude spectrum prediction model, taking the correction phase spectrum prediction model as the phase spectrum prediction model, and executing the step of inputting the target acoustic feature into the amplitude spectrum prediction model to obtain a first pair of logarithmic amplitude spectrums corresponding to the target acoustic feature and the subsequent steps until the correction parameter accords with the preset parameter.
Based on the content of S101-S109, firstly, acquiring target acoustic features, and inputting the target acoustic features into an amplitude spectrum prediction model to obtain a first pair of logarithm amplitude spectrums corresponding to the target acoustic features, wherein the first pair of logarithm amplitude spectrums comprise a first amplitude spectrum; and inputting the target acoustic features into a phase spectrum prediction model to obtain a first phase spectrum corresponding to the target acoustic features. And then calculating according to the first amplitude spectrum and the first phase spectrum to obtain a first reconstructed short-time spectrum, and preprocessing the first reconstructed short-time spectrum to obtain a first reconstructed voice waveform corresponding to the acoustic feature to be synthesized. Then, the amplitude spectrum loss of the first pair of amplitude spectrums, the phase spectrum loss of the first phase spectrum, the short-time spectrum loss of the first reconstruction short-time spectrum and the waveform loss of the first reconstruction voice waveform are respectively calculated, and correction parameters are obtained according to the amplitude spectrum loss, the phase spectrum loss, the short-time spectrum loss and the waveform loss. Finally, correcting the amplitude spectrum prediction model according to the correction parameters so as to obtain an amplitude spectrum predictor; and correcting the phase spectrum prediction model according to the correction parameters so as to obtain the phase spectrum predictor. The amplitude spectrum predictor and the phase spectrum predictor of the method are all of full-frame level, and can be used for directly predicting the voice amplitude spectrum and the phase spectrum in parallel, so that the voice generation efficiency is remarkably improved, and the complexity of overall operation is reduced. At the same time, the present application trains both the amplitude spectrum predictor and the phase spectrum predictor by utilizing amplitude spectrum loss, phase spectrum loss, short-time spectrum loss, and waveform loss.
Based on the embodiment of the method for constructing the vocoder, the embodiment of the application also provides a voice synthesis method. Referring to fig. 3, fig. 3 is a flowchart of a method for synthesizing speech according to an embodiment of the present application. As shown in fig. 3, the method includes S301-S306:
s301: and acquiring the acoustic characteristics to be synthesized.
In using a vocoder, the acoustic features to be synthesized are first acquired.
In one possible implementation, the acoustic features to be synthesized may be derived by inputting the synthesized text into an acoustic model. The text to be synthesized is, for example, "weather today is good", the input acoustic model of the text to be synthesized is converted into corresponding acoustic features to be synthesized, and then the vocoder can perform audio synthesis based on the acoustic features to be synthesized to obtain synthesized audio data. Alternatively, the type of the acoustic model may be selected according to actual needs.
The acoustic features may include, but are not limited to: at least one of spectrum parameters such as spectrum and cepstrum. In addition, one or more of fundamental frequency, unvoiced, and voiced may be included. In the present embodiment, the acoustic feature to be synthesized is described by taking a spectrum as an example, and specifically, mel-spectrum (mel-spectrum) may be used. In other embodiments, the acoustic feature to be synthesized may be cepstrum + fundamental frequency, and unvoiced and voiced may also be combined. It will be appreciated that in use, it is necessary to prepare as input the same class of acoustic features based on the acoustic features used in training the vocoder. For example, the acoustic feature used in training is an 80-dimensional mel spectrum, and then the 80-dimensional mel spectrum is also taken as input in application.
S302: inputting the acoustic features to be synthesized into an amplitude spectrum predictor so as to obtain a second pair of logarithmic amplitude spectrums corresponding to the acoustic features to be synthesized, wherein the second pair of logarithmic amplitude spectrums comprise a second amplitude spectrum; the amplitude spectrum predictor is constructed according to the method for constructing a vocoder according to any one of claims 1 to 6.
S303: inputting the acoustic features to be synthesized into a phase spectrum predictor so as to obtain a second phase spectrum corresponding to the acoustic features to be synthesized; the phase spectrum predictor is constructed according to the method for constructing a vocoder according to any one of claims 1 to 6.
S304: and calculating according to the second amplitude spectrum and the second phase spectrum to obtain a second reconstructed short-time spectrum.
After the acoustic features to be synthesized are respectively input into the amplitude spectrum predictor and the phase spectrum predictor to obtain a second pair of magnitude spectrums and a second phase spectrum, the voice synthesis system also needs to calculate according to the second magnitude spectrums and the second phase spectrums of the second pair of magnitude spectrums to obtain a second reconstructed short-time spectrum.
The amplitude spectrum predictor and the phase spectrum predictor are constructed according to the construction method of the vocoder described in S101-S109.
In one possible implementation manner, the calculating according to the second amplitude spectrum and the second phase spectrum obtains a second reconstructed short-time spectrumThe formula of (2) is as follows:
Wherein,,for the second magnitude spectrum +.>Is used for the amplitude spectrum part of the (c).Is a second phase spectrum.
S305: and preprocessing the second reconstructed short-time spectrum to obtain a second reconstructed voice waveform corresponding to the acoustic feature to be synthesized.
In order to obtain the synthesized voice, after the second reconstructed short-time spectrum is obtained by calculating according to the second amplitude spectrum and the second phase spectrum, the voice synthesis system also needs to preprocess the obtained second reconstructed short-time spectrum to obtain a second reconstructed voice waveform corresponding to the acoustic feature to be synthesized.
In some possible implementations, the preprocessing the second reconstructed short-time spectrum to obtain a second reconstructed voice waveform corresponding to the acoustic feature to be synthesized includes:
and performing inverse short-time Fourier transform on the second reconstructed short-time spectrum to obtain a second reconstructed voice waveform corresponding to the acoustic feature to be synthesized.
In one possible implementation, when performing the inverse short-time fourier transform (Inverse Short Time Fourier Transform, ISTFT), the inverse fourier transform is performed on each frame of the processed domain signal, then the result of the inverse transform is windowed (consistent with the window type, window length, overlap added during framing), and finally the windowed signal of each frame is overlap-added and divided by the result of the square overlap-add of the window function of each frame, so that the original signal can be reconstructed.
S306: and converting the second reconstructed voice waveform into synthesized voice corresponding to the acoustic feature to be synthesized.
In order to obtain the synthesized speech, after obtaining the second reconstructed speech waveform, the speech synthesis system further needs to convert the obtained second reconstructed speech waveform into the synthesized speech corresponding to the acoustic feature to be synthesized.
In some possible implementations, the second reconstructed speech waveform may be converted to speech using software or methods such as Python that convert the waveform of sound to speech.
Based on the above-mentioned content of S301-S306, the to-be-synthesized acoustic feature may be predicted by using the trained amplitude spectrum predictor and phase spectrum predictor, so as to obtain a second pair of magnitude spectrums and a second phase spectrum corresponding to the to-be-synthesized acoustic feature, where the second pair of magnitude spectrums includes the second magnitude spectrum. And then, calculating according to the second amplitude spectrum and the second phase spectrum to obtain a second reconstructed short-time spectrum, and preprocessing the second reconstructed short-time spectrum to obtain a second reconstructed voice waveform corresponding to the acoustic feature to be synthesized. And finally, converting the second reconstructed voice waveform into synthesized voice corresponding to the acoustic features to be synthesized. The method and the device utilize the fact that the operation of the amplitude spectrum predictor and the phase spectrum predictor is of full frame level, can realize parallel direct prediction of the voice amplitude spectrum and the phase spectrum, remarkably improve the voice generation efficiency and reduce the complexity of integral operation.
Referring to fig. 2, fig. 2 is a schematic structural diagram of a vocoder building device according to an embodiment of the present application. As shown in fig. 2, the vocoder constructing apparatus includes:
a first acquisition unit 201 for acquiring a target acoustic feature.
The target acoustic feature is an acoustic feature to be input into the amplitude spectrum prediction model and the phase spectrum prediction model for training. While the target acoustic features are derived by inputting the synthesized text into an acoustic model. The target acoustic feature is, for example, "today's weather is good", the input acoustic model can be converted into the corresponding target acoustic feature, and then the vocoder can perform audio synthesis based on the target acoustic feature, so as to obtain clean synthesized audio data. Since the acoustic features output by the acoustic model are usually noisy, if the noisy acoustic features are used for audio synthesis, the sound quality of the synthesized audio data will be affected. Alternatively, the type of the acoustic model may be selected according to actual needs.
The acoustic features may include, but are not limited to: at least one of spectrum parameters such as spectrum and cepstrum. In addition, one or more of fundamental frequency, unvoiced, and voiced may be included. In the present embodiment, the acoustic feature to be synthesized is described by taking a spectrum as an example, and specifically, mel-spectrum (mel-spectrum) may be used. In other embodiments, the acoustic feature to be synthesized may be cepstrum + fundamental frequency, and unvoiced and voiced may also be combined. It will be appreciated that in use, it is necessary to prepare as input the same class of acoustic features based on the acoustic features used in training the vocoder. For example, the acoustic feature used in training is an 80-dimensional mel spectrum, and then the 80-dimensional mel spectrum is also taken as input in application.
The first input unit 202 is configured to input the target acoustic feature into an amplitude spectrum prediction model, and obtain a first pair of log-amplitude spectrums corresponding to the target acoustic feature, where the first pair of log-amplitude spectrums includes the first amplitude spectrum.
In some possible implementations, the amplitude spectrum prediction model includes: a first input convolution layer, a first residual convolution network, and a first output convolution layer.
The first input convolution layer is connected with the first residual convolution network; the first residual convolution network is respectively connected with the first input convolution layer and the first output convolution layer in sequence.
The first input convolution layer is used for carrying out convolution calculation on the target acoustic characteristics.
The first residual convolution network is used for performing depth convolution calculation on the calculation result of the first input convolution layer.
In some possible implementations, the deep convolution calculation refers to performing a plurality of convolution calculations.
The first output convolution layer is used for carrying out convolution calculation on the first residual convolution network so as to obtain a second logarithmic magnitude spectrum.
The initial parameters of the first input convolution layer and the first output convolution layer are obtained through random setting of the convolution layers.
In some possible implementations, each convolutional layer (Convolutional layer) in the convolutional neural network is composed of a number of convolutional units, and parameters of each convolutional unit are optimized through a back-propagation algorithm. The purpose of convolution operations is to extract different features of the input, and the first layer of convolution may only extract some low-level features such as edges, lines, and corners, and more layers of the network may iteratively extract more complex features from the low-level features.
In some possible implementations, the first residual convolution network is formed by sequentially connecting N residual convolution blocks connected in parallel in a jumping manner, a first adding unit, an averaging unit and a first lrerlu unit, wherein the residual convolution blocks are formed by cascading X residual convolution sub-blocks; n, X are all positive integers.
The second lrehu unit, the extended convolution layer, the third lrehu unit, and the first adding unit are sequentially connected.
The residual convolution block is used for carrying out residual convolution calculation on the calculation result of the first input convolution layer or the second input convolution layer.
The first adding unit is used for adding and calculating the calculation results of the N parallel residual convolution blocks which are connected in a jumping mode.
And the average unit is used for carrying out average calculation on the calculation result of the first adding unit.
And the first LReLu unit is used for activating the calculation result of the average unit to obtain a first activation matrix.
In some possible implementations, the lrerlu unit, i.e., the leakage corrected linear unit (leak ReLu) function, is a variant of the classical (and widely used) ReLu activation function, the function output having a small slope for negative inputs. Since the derivative is always non-zero, this reduces the occurrence of silent neurons, allowing gradient-based learning (although it may be slow), solving the problem of neurons not learning after the ReLu function enters the negative interval. The lrerlu unit is activated to perform a function operation on the calculation result.
In some possible implementations, the activation of the first lrerlu unit is a function operation of the calculation result of the averaging unit.
In some possible implementations, the lrerlu unit, i.e., the leakage corrected linear unit (leak ReLu) function, is a variant of the classical (and widely used) ReLu activation function, the function output having a small slope for negative inputs. Since the derivative is always non-zero, this reduces the occurrence of silent neurons, allowing gradient-based learning (although it may be slow), solving the problem of neurons not learning after the ReLu function enters the negative interval.
In some possible implementations, the residual convolution calculations include convolution calculations and activation calculations.
In some possible implementations, the residual convolution sub-block includes: a second lrehu unit, an expanded convolution layer, a third lrehu unit, a fourth output convolution layer, and a second addition unit.
The second lrehu unit, the expanded convolution layer, the third lrehu unit, the fourth output convolution layer, and the second addition unit are sequentially connected.
And the second LReLu unit is used for activating the matrix input into the second LReLu unit to obtain a second activation matrix.
In some possible implementations, the activation of the second lrerlu unit is a functional operation on a matrix input to the second lrerlu unit.
The expansion convolution layer is used for carrying out convolution calculation on the first activation matrix.
And the third LReLu unit is used for activating the calculation result of the expansion convolution layer to obtain a third activation matrix.
In some possible implementations, the activation of the third lrerlu unit is a function operation of the calculation result of the expanded convolution layer.
The fourth output convolution layer is configured to perform convolution calculation on the third activation matrix
And the second adding unit is used for adding and calculating the calculation result of the fourth output convolution layer and the matrix input to the second LReLu unit.
In some possible implementations, when the residual convolution sub-block is a first residual convolution sub-block of the cascaded X residual convolution sub-blocks, the second lrerlu unit is further connected to the first input convolution layer; the second adder is also coupled to the first input convolutional layer.
At this time, the second lrehu unit is configured to activate the calculation result of the first output convolution layer to obtain a second activation matrix.
And the second adding unit is used for adding and calculating the calculation result of the fourth output convolution layer and the first input convolution layer.
The second input unit 203 is configured to input the target acoustic feature into a phase spectrum prediction model, so as to obtain a first phase spectrum corresponding to the target acoustic feature.
In some possible implementations, the phase spectrum prediction model includes: the system comprises a second input convolution layer, a second residual convolution network, a second output convolution layer, a third output convolution layer and a phase calculation module.
The second input convolution layer is connected with the second residual convolution network; the second residual convolution network is respectively connected with the second input convolution layer, the second output convolution layer and the third output convolution layer; the phase calculation module is respectively connected with the second output convolution layer and the third output convolution layer.
The second input convolution layer is used for carrying out convolution calculation on the target acoustic features.
And the second residual convolution network is used for performing depth convolution calculation on the calculation result of the second input convolution layer.
In some possible implementations, the deep convolution calculation refers to performing a plurality of convolution calculations.
And the second output convolution layer is used for carrying out convolution calculation on the calculation result of the second residual convolution network.
And the third output convolution layer is used for carrying out convolution calculation on the calculation result of the second residual convolution network.
And the phase calculation module is used for carrying out phase calculation according to the calculation results output by the second output convolution layer and the third output convolution layer so as to obtain the second phase spectrum.
The initial parameters of the second output convolution layer and the third output convolution layer are obtained through random setting of the convolution layers. Because the initial parameters of the second output convolution layer and the third output convolution layer are obtained by random setting of the convolution layers, the parameters of the second output convolution layer and the third output convolution layer are different.
In some possible implementations, the phase calculation module is formulated as follows:
Wherein R is the calculation result of the second output convolution layer, and I is the calculation result of the third output convolution layer. Φ (0, 0) =0. When R is greater than or equal to 0, sgn * (R) =1; when R is<0, sgn * (R) = -1; when I is greater than or equal to 0, sgn * (I) =1; when I<0, sgn * (I)=-1。
In some possible implementations, each convolutional layer (Convolutional layer) in the convolutional neural network is composed of a number of convolutional units, and parameters of each convolutional unit are optimized through a back-propagation algorithm. The purpose of convolution operations is to extract different features of the input, and the first layer of convolution may only extract some low-level features such as edges, lines, and corners, and more layers of the network may iteratively extract more complex features from the low-level features.
The second residual convolution network in the phase spectrum prediction model is formed by sequentially connecting N parallel residual convolution blocks which are connected in a jumping manner, a first adding unit, an averaging unit and a first LReLu unit, wherein the residual convolution blocks are formed by cascading X residual convolution sub-blocks; n, X are all positive integers.
The second lreuu unit, the expanded convolution layer, the third lreuu unit, and the adding unit are sequentially connected.
The residual convolution block is used for carrying out residual convolution calculation on the calculation result of the first input convolution layer or the second input convolution layer.
The first adding unit is used for adding and calculating the calculation results of the N parallel residual convolution blocks which are connected in a jumping mode.
And the average unit is used for carrying out average calculation on the calculation result of the first adding unit.
And the first LReLu unit is used for activating the calculation result of the average unit to obtain a first activation matrix.
The first residual convolution network and the second residual convolution network are formed by sequentially connecting N residual convolution blocks which are connected in parallel in a jumping manner, a first adding unit, an averaging unit and a first LReLu unit, but initial parameters of the N residual convolution blocks in the first residual convolution network and initial parameters of the N residual convolution blocks in the second residual convolution network are obtained through random setting. Because the initial parameters of the N residual convolution blocks in the first residual convolution network and the N residual convolution blocks in the second residual convolution network are obtained through random setting, the parameters of the N residual convolution blocks in the first residual convolution network and the N residual convolution blocks in the second residual convolution network are different.
In some possible implementations, the residual convolution calculations include convolution calculations and activation calculations.
The residual convolution sub-block in the second residual convolution network also includes: a second lrehu unit, an expanded convolution layer, a third lrehu unit, a fourth output convolution layer, and a second addition unit.
The second lrehu unit, the expanded convolution layer, the third lrehu unit, the fourth output convolution layer, and the second addition unit are sequentially connected.
And the second LReLu unit is used for activating the matrix input into the second LReLu unit to obtain a second activation matrix.
The expansion convolution layer is used for carrying out convolution calculation on the first activation matrix.
And the third LReLu unit is used for activating the calculation result of the expansion convolution layer to obtain a third activation matrix.
The fourth output convolution layer is configured to perform convolution calculation on the third activation matrix
And the second adding unit is used for adding and calculating the calculation result of the fourth output convolution layer and the matrix input to the second LReLu unit.
In some possible implementations, when the residual convolution sub-block is a first residual convolution sub-block of the cascaded X residual convolution sub-blocks, the second lrerlu unit is further connected to the second input convolution layer; the second adder is further coupled to the second input convolutional layer.
At this time, the second lrehu unit is configured to activate the calculation result of the second input convolution layer to obtain a second activation matrix.
And the second adding unit is used for adding and calculating the calculation result of the fourth output convolution layer and the second input convolution layer.
The first calculating unit 204 is configured to calculate a first reconstructed short-time spectrum according to the first amplitude spectrum and the first phase spectrum.
In one possible implementation manner, the calculating according to the first amplitude spectrum and the first phase spectrum obtains a first reconstructed short-time spectrumThe formula of (2) is as follows:
Wherein,,for the first number amplitude spectrum +.>Is used for the amplitude spectrum part of the (c).Is the first phase spectrum.
The first preprocessing unit 205 is configured to preprocess the first reconstructed short-time spectrum to obtain a first reconstructed speech waveform corresponding to the acoustic feature to be synthesized.
In one possible implementation, when performing the inverse short-time fourier transform (Inverse Short Time Fourier Transform, ISTFT), the inverse fourier transform is performed on each frame of the processed domain signal, then the result of the inverse transform is windowed (consistent with the window type, window length, overlap added during framing), and finally the windowed signal of each frame is overlap-added and divided by the result of the square overlap-add of the window function of each frame, so that the original signal can be reconstructed.
A second calculating unit 206, configured to calculate an amplitude spectrum loss of the first logarithmic amplitude spectrum, a phase spectrum loss of the first phase spectrum, a short-time spectrum loss of the first reconstructed short-time spectrum, and a waveform loss of the first reconstructed speech waveform.
In one possible implementation, the first pair of magnitude spectra have magnitude spectrum lossesThe calculation formula of (2) is as follows:
wherein,,a first log magnitude spectrum; log a is a natural log magnitude spectrum;
s is a natural waveform, natural short-time complex spectra are extracted by short-time fourier transform (Short Time Fourier Transform, STFT), and Re and Im represent a real part and an imaginary part of S, respectively.
In one possible implementation, the phase spectrum of the first phase spectrum is lostThe calculation formula of (2) is as follows:
wherein,,is the instantaneous phase loss;Is a group delay loss;Is the instantaneous angular frequency loss.
Instantaneous phase loss is defined as the predicted phase spectrumA negative cosine loss from the natural phase spectrum P, namely:
group delay loss is defined as predicted group delayAnd natural group delay delta DF Negative cosine loss of P, namely:
the loss of instantaneous angular frequency is defined as the predicted instantaneous angular frequencyAnd natural instantaneous angular frequency delta DT Negative cosine loss of P, namely:
wherein delta is DF And delta DT Representing the difference along the frequency axis and the difference along the time axis, respectively. The natural phase spectrum is calculated by: p=Φ # -Re(S),Im(S))。
Where Re (S) is the real part of the natural short-time complex spectrum S and Im (S) is the imaginary part of the natural short-time complex spectrum S. Φ (0, 0) =0. When Re (S) is not less than 0, sgn * (Re (S))=1; when Re (S)<0, sgn * (Re (S)) = -1; when Im (S) is not less than 0, sgn * (Im (S))=1; when Im (S)<0, sgn * (Im(S))=-1。
In one possible implementation, the short-term spectrum loss of the first reconstructed short-term spectrumThe calculation formula of (2) is as follows:
short-term spectral lossAmplitude spectrum for improving prediction +.>And phase spectrum->Degree of matching between them and short-term spectrum guaranteeing their reconstruction +.>(i.e.)>) Comprises three sub-losses, respectively real losses +.>Imaginary loss->And loss of short-term spectral coherence->The real part loss is defined as reconstructed real part +.>Absolute error loss between the natural real part Re (S), the imaginary part loss being defined as reconstructed imaginary part +.>And the natural imaginary part Im (S), namely:
short-term spectrum consistency loss is defined in the reconstructed short-term spectrumConsistent short-term spectrum->For reducing the gap between the two. Since the magnitude spectrum and the phase spectrum are predictive and the short-term spectrum domain is only a subset of the complex domain, they reconstruct the short-term spectrum +. >Not necessarily a truly existing short-term spectrum. But->Corresponding real existing short-term spectrum +.>Is to->Performing inverse short-time Fourier transform to obtain a waveform +.>And then carrying out Fourier transform on the waveform to obtain the waveform, namely:
the short-term spectral consistency loss is defined asAnd->The form of the real and imaginary parts written as the two norms between:
finally, short-term spectral lossLoss of real part->Imaginary loss->And loss of short-term spectral coherence->Linear combination according to a certain proportion, namely:
wherein lambda is RI Is a short-time spectrum loss super-parameter, and can be manually and automatically determined and transformed.
In one possible implementation, the waveform loss of the first reconstructed speech waveformThe calculation formula of (2) is as follows:
waveform lossFor narrowing the gap between the reconstructed waveform and the natural waveform, as used in HiFi-GAN, including generating a generator penalty against the network +.>Decision maker penalty for generating an countermeasure network>Feature matching loss->And Meier spectral loss->The waveform loss of the first reconstructed speech waveform is a proportional linear combination of these losses. Lambda (lambda) Mel Is a waveform loss super-parameter, and can be determined and transformed by human body.
A third calculation unit 207 is configured to calculate a correction parameter according to the amplitude spectrum loss, the phase spectrum loss, the short-time spectrum loss, and the waveform loss.
In one possible implementation, the parameters are modifiedThe calculation formula of (2) is as follows: />
Wherein lambda is A 、λ P And lambda (lambda) S All are correction super parameters, and can be determined and changed by people.
A first correction unit 208, configured to correct the magnitude spectrum prediction model according to the correction parameter, so as to obtain the magnitude spectrum predictor.
A second correction unit 209 is configured to correct the phase spectrum prediction model according to the correction parameter so as to obtain the phase spectrum predictor.
Referring to fig. 4, fig. 4 is a schematic structural diagram of a speech synthesis apparatus according to an embodiment of the present application. As shown in fig. 4, the speech synthesis apparatus includes:
a second acquisition unit 401 is configured to acquire an acoustic feature to be synthesized.
In one possible implementation, the acoustic features to be synthesized may be derived by inputting the synthesized text into an acoustic model. The text to be synthesized is, for example, "weather today is good", the input acoustic model of the text to be synthesized is converted into corresponding acoustic features to be synthesized, and then the vocoder can perform audio synthesis based on the acoustic features to be synthesized to obtain synthesized audio data. Alternatively, the type of the acoustic model may be selected according to actual needs.
The acoustic features may include, but are not limited to: at least one of spectrum parameters such as spectrum and cepstrum. In addition, one or more of fundamental frequency, unvoiced, and voiced may be included. In the present embodiment, the acoustic feature to be synthesized is described by taking a spectrum as an example, and specifically, mel-spectrum (mel-spectrum) may be used. In other embodiments, the acoustic feature to be synthesized may be cepstrum + fundamental frequency, and unvoiced and voiced may also be combined. It will be appreciated that in use, it is necessary to prepare as input the same class of acoustic features based on the acoustic features used in training the vocoder. For example, the acoustic feature used in training is an 80-dimensional mel spectrum, and then the 80-dimensional mel spectrum is also taken as input in application.
A third input unit 402, configured to input the acoustic feature to be synthesized into a pre-constructed amplitude spectrum predictor, so as to obtain a second pair of magnitude spectrums corresponding to the acoustic feature to be synthesized, where the second pair of magnitude spectrums includes the second magnitude spectrum.
A fourth input unit 403, configured to input the acoustic feature to be synthesized into a pre-constructed phase spectrum predictor, so as to obtain a second phase spectrum corresponding to the acoustic feature to be synthesized.
A fourth calculating unit 403, configured to calculate a second reconstructed short-time spectrum according to the second amplitude spectrum and the second phase spectrum.
In one possible implementation manner, the calculating according to the second amplitude spectrum and the second phase spectrum obtains a second reconstructed short-time spectrumThe formula of (2) is as follows:
Wherein,,for the second magnitude spectrum +.>Is used for the amplitude spectrum part of the (c).Is a second phase spectrum.
And a second preprocessing unit 404, configured to preprocess the second reconstructed short-time spectrum to obtain a second reconstructed voice waveform corresponding to the acoustic feature to be synthesized.
In some possible implementations, the second preprocessing unit 404 is specifically configured to:
and performing inverse short-time Fourier transform on the second reconstructed short-time spectrum to obtain a second reconstructed voice waveform corresponding to the acoustic feature to be synthesized.
In one possible implementation, when performing the inverse short-time fourier transform (Inverse Short Time Fourier Transform, ISTFT), the inverse fourier transform is performed on each frame of the processed domain signal, then the result of the inverse transform is windowed (consistent with the window type, window length, overlap added during framing), and finally the windowed signal of each frame is overlap-added and divided by the result of the square overlap-add of the window function of each frame, so that the original signal can be reconstructed.
A first converting unit 405, configured to convert the second reconstructed speech waveform into a synthesized speech corresponding to the acoustic feature to be synthesized.
In some possible implementations, the second reconstructed speech waveform may be converted to speech using software or methods such as Python that convert the waveform of sound to speech.
In one possible implementation, the apparatus further includes:
and the comparison unit is used for comparing the correction parameter with a preset parameter.
The first execution unit is used for responding to the correction parameter being smaller than or equal to the preset parameter, and is used for executing the correction of the amplitude spectrum prediction model according to the correction parameter so as to obtain a corrected amplitude spectrum prediction model as the amplitude spectrum predictor, and the correction of the phase spectrum prediction model according to the correction parameter so as to obtain a corrected phase spectrum prediction model as the phase spectrum predictor.
And the second execution unit is used for responding to the correction parameter being larger than the preset parameter, taking the correction amplitude spectrum prediction model as the amplitude spectrum prediction model, taking the correction phase spectrum prediction model as the phase spectrum prediction model, and executing the step of inputting the target acoustic characteristic into the amplitude spectrum prediction model to obtain a first pair of amplitude spectrums corresponding to the target acoustic characteristic and the subsequent steps until the correction parameter accords with the preset parameter.
The embodiment of the application provides a method for constructing a vocoder, a method for synthesizing voice and a related device, wherein the method for constructing the vocoder comprises the following steps: acquiring target acoustic features, inputting the target acoustic features into an amplitude spectrum prediction model, and obtaining a first pair of logarithm amplitude spectrums corresponding to the target acoustic features, wherein the first pair of logarithm amplitude spectrums comprise first amplitude spectrums; and inputting the target acoustic features into a phase spectrum prediction model to obtain a first phase spectrum corresponding to the target acoustic features. And then, calculating according to the first amplitude spectrum and the first phase spectrum to obtain a first reconstructed short-time spectrum, and preprocessing the first reconstructed short-time spectrum to obtain a first reconstructed voice waveform corresponding to the target acoustic characteristic. And respectively calculating the amplitude spectrum loss of the first logarithmic amplitude spectrum, the phase spectrum loss of the first phase spectrum, the short-time spectrum loss of the first reconstruction short-time spectrum and the waveform loss of the first reconstruction voice waveform, and then calculating according to the amplitude spectrum loss, the phase spectrum loss, the short-time spectrum loss and the waveform loss to obtain the correction parameters. Correcting the amplitude spectrum prediction model according to the correction parameters so as to obtain an amplitude spectrum predictor; and correcting the phase spectrum prediction model according to the correction parameters so as to obtain the phase spectrum predictor. The amplitude spectrum predictor and the phase spectrum predictor of the method are all of full-frame level, and can be used for directly predicting the voice amplitude spectrum and the phase spectrum in parallel, so that the voice generation efficiency is remarkably improved, and the complexity of overall operation is reduced. At the same time, the present application trains both the amplitude spectrum predictor and the phase spectrum predictor by utilizing amplitude spectrum loss, phase spectrum loss, short-time spectrum loss, and waveform loss.
The above describes in detail a method for constructing a vocoder, a method for synthesizing speech, and related devices. In the description, each embodiment is described in a progressive manner, and each embodiment is mainly described by the differences from other embodiments, so that the same similar parts among the embodiments are mutually referred. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section. It should be noted that it would be obvious to those skilled in the art that various improvements and modifications can be made to the present application without departing from the principles of the present application, and such improvements and modifications fall within the scope of the claims of the present application.
It should be understood that in this application, "at least one" means one or more, and "a plurality" means two or more. "and/or" for describing the association relationship of the association object, the representation may have three relationships, for example, "a and/or B" may represent: only a, only B and both a and B are present, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b or c may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.
It is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Claims (11)
1. A method of constructing a vocoder, the vocoder comprising: an amplitude spectrum predictor and a phase spectrum predictor, the method comprising:
acquiring a target acoustic feature;
inputting the target acoustic features into an amplitude spectrum prediction model to obtain a first logarithm amplitude spectrum corresponding to the target acoustic features, wherein the first logarithm amplitude spectrum comprises a first amplitude spectrum;
inputting the target acoustic features into a phase spectrum prediction model to obtain a first phase spectrum corresponding to the target acoustic features;
calculating according to the first amplitude spectrum and the first phase spectrum to obtain a first reconstructed short-time spectrum;
Preprocessing the first reconstructed short-time spectrum to obtain a first reconstructed voice waveform corresponding to the target acoustic feature;
respectively calculating the amplitude spectrum loss of the first logarithmic amplitude spectrum, the phase spectrum loss of the first phase spectrum, the short-time spectrum loss of the first reconstruction short-time spectrum and the waveform loss of the first reconstruction voice waveform;
calculating to obtain correction parameters according to the amplitude spectrum loss, the phase spectrum loss, the short-time spectrum loss and the waveform loss;
correcting the amplitude spectrum prediction model according to the correction parameters so as to obtain a corrected amplitude spectrum prediction model serving as the amplitude spectrum predictor;
and correcting the phase spectrum prediction model according to the correction parameters so as to obtain a corrected phase spectrum prediction model serving as the phase spectrum predictor.
2. The method according to claim 1, wherein the method further comprises:
comparing the correction parameter with a preset parameter;
executing the correction of the amplitude spectrum prediction model according to the correction parameter to obtain a corrected amplitude spectrum prediction model as the amplitude spectrum predictor, and the correction of the phase spectrum prediction model according to the correction parameter to obtain a corrected phase spectrum prediction model as the phase spectrum predictor in response to the correction parameter being less than or equal to the preset parameter;
And in response to the correction parameter being greater than the preset parameter, taking the correction amplitude spectrum prediction model as the amplitude spectrum prediction model, taking the correction phase spectrum prediction model as the phase spectrum prediction model, and executing the step of inputting the target acoustic feature into the amplitude spectrum prediction model to obtain a first pair of logarithmic amplitude spectrums corresponding to the target acoustic feature and the subsequent steps until the correction parameter accords with the preset parameter.
3. The method of claim 1, wherein the magnitude spectrum prediction model comprises: a first input convolution layer, a first residual convolution network, and a first output convolution layer;
the first input convolution layer is connected with the first residual convolution network; the first residual convolution network is respectively connected with the first input convolution layer and the first output convolution layer in sequence;
the first input convolution layer is used for carrying out convolution calculation on the target acoustic characteristics;
the first residual convolution network is used for performing depth convolution calculation on the calculation result of the first input convolution layer;
the first output convolution layer is used for carrying out convolution calculation on the first residual convolution network so as to obtain a second logarithmic magnitude spectrum.
4. The method of claim 1, wherein the phase spectrum prediction model comprises: the second input convolution layer, the second residual convolution network, the second output convolution layer, the third output convolution layer and the phase calculation module;
the second input convolution layer is connected with the second residual convolution network; the second residual convolution network is respectively connected with the second input convolution layer, the second output convolution layer and the third output convolution layer; the phase calculation module is respectively connected with the second output convolution layer and the third output convolution layer;
the second input convolution layer is used for carrying out convolution calculation on the target acoustic features;
the second residual convolution network is used for performing depth convolution calculation on the calculation result of the second input convolution layer;
the second output convolution layer is used for carrying out convolution calculation on the calculation result of the second residual convolution network;
the third output convolution layer is used for carrying out convolution calculation on the calculation result of the second residual convolution network;
and the phase calculation module is used for carrying out phase calculation according to the calculation results output by the second output convolution layer and the third output convolution layer so as to obtain the second phase spectrum.
5. The method according to claim 3 or 4, wherein the first residual convolution network and the second residual convolution network are each composed of N parallel jump connected residual convolution blocks, a first adding unit, an averaging unit and a first lrerlu unit connected in sequence, wherein the residual convolution blocks are composed of X residual convolution sub-block cascades; n, X are positive integers;
the residual convolution block is used for carrying out residual convolution calculation on the calculation result of the first input convolution layer or the second input convolution layer;
the first adding unit is used for adding and calculating the calculation results of the N parallel residual convolution blocks which are connected in a jumping manner;
the average unit is used for carrying out average calculation on the calculation result of the first adding unit;
and the first LReLu unit is used for activating the calculation result of the average unit to obtain a first activation matrix.
6. The method of claim 5, wherein the residual convolution sub-block comprises: a second lrehu unit, an expanded convolution layer, a third lrehu unit, a fourth output convolution layer, and a second addition unit;
the second LReLu unit, the expansion convolution layer, the third LReLu unit, the fourth output convolution layer and the second addition unit are connected in sequence; the second lrehu unit is configured to activate a matrix input to the second lrehu unit to obtain a second activation matrix;
The expansion convolution layer is used for carrying out convolution calculation on the first activation matrix;
the third lrerlu unit is configured to activate the calculation result of the extended convolutional layer to obtain a third activation matrix;
the fourth output convolution layer is used for carrying out convolution calculation on the third activation matrix;
and the second adding unit is used for adding and calculating the calculation result of the fourth output convolution layer and the matrix input to the second LReLu unit.
7. The method of claim 3, 4 or 6, wherein the initial parameters of the first input convolution layer, the first output convolution layer, the second output convolution layer, the third output convolution layer and the fourth output convolution layer are each randomly set by a convolution layer.
8. A method of speech synthesis, the method comprising:
acquiring acoustic features to be synthesized;
inputting the acoustic features to be synthesized into an amplitude spectrum predictor so as to obtain a second pair of logarithmic amplitude spectrums corresponding to the acoustic features to be synthesized, wherein the second pair of logarithmic amplitude spectrums comprise a second amplitude spectrum; the amplitude spectrum predictor is constructed according to the method for constructing a vocoder according to any one of claims 1 to 7;
Inputting the acoustic features to be synthesized into a phase spectrum predictor so as to obtain a second phase spectrum corresponding to the acoustic features to be synthesized; the phase spectrum predictor is constructed according to the method for constructing a vocoder according to any one of claims 1 to 7;
calculating according to the second amplitude spectrum and the second phase spectrum to obtain a second reconstructed short-time spectrum;
preprocessing the second reconstructed short-time spectrum to obtain a second reconstructed voice waveform corresponding to the acoustic feature to be synthesized;
and converting the second reconstructed voice waveform into synthesized voice corresponding to the acoustic feature to be synthesized.
9. The method of claim 8, wherein the preprocessing the second reconstructed short-time spectrum to obtain a second reconstructed speech waveform corresponding to the acoustic feature to be synthesized comprises:
and performing inverse short-time Fourier transform on the second reconstructed short-time spectrum to obtain a second reconstructed voice waveform corresponding to the acoustic feature to be synthesized.
10. A vocoder building apparatus, the apparatus comprising:
a first acquisition unit configured to acquire a target acoustic feature;
the first input unit is used for inputting the target acoustic features into an amplitude spectrum prediction model to obtain a first logarithm amplitude spectrum corresponding to the target acoustic features, wherein the first logarithm amplitude spectrum comprises a first amplitude spectrum;
The second input unit is used for inputting the target acoustic features into a phase spectrum prediction model to obtain a first phase spectrum corresponding to the target acoustic features;
the first calculation unit is used for calculating according to the first amplitude spectrum and the first phase spectrum to obtain a first reconstructed short-time spectrum;
the first preprocessing unit is used for preprocessing the first reconstructed short-time spectrum to obtain a first reconstructed voice waveform corresponding to the acoustic feature to be synthesized;
a second calculation unit, configured to calculate an amplitude spectrum loss of the first pair of magnitude spectrums, a phase spectrum loss of the first phase spectrum, a short-time spectrum loss of the first reconstructed short-time spectrum, and a waveform loss of the first reconstructed voice waveform;
a third calculation unit, configured to calculate a correction parameter according to the amplitude spectrum loss, the phase spectrum loss, the short-time spectrum loss, and the waveform loss;
the first correction unit is used for correcting the amplitude spectrum prediction model according to the correction parameters so as to obtain the amplitude spectrum predictor;
and the second correction unit is used for correcting the phase spectrum prediction model according to the correction parameters so as to obtain the phase spectrum predictor.
11. A speech synthesis apparatus, the apparatus comprising:
The second acquisition unit is used for acquiring acoustic features to be synthesized;
the third input unit is used for inputting the acoustic features to be synthesized into a pre-constructed amplitude spectrum predictor so as to obtain a second pair of magnitude spectrums corresponding to the acoustic features to be synthesized, wherein the second pair of magnitude spectrums comprise a second magnitude spectrum;
the fourth input unit is used for inputting the acoustic features to be synthesized into a pre-constructed phase spectrum predictor so as to obtain a second phase spectrum corresponding to the acoustic features to be synthesized;
a fourth calculation unit, configured to calculate a second reconstructed short-time spectrum according to the second amplitude spectrum and the second phase spectrum;
the second preprocessing unit is used for preprocessing the second reconstructed short-time spectrum to obtain a second reconstructed voice waveform corresponding to the acoustic feature to be synthesized;
the first converting unit is used for converting the second reconstructed voice waveform into the synthesized voice corresponding to the acoustic feature to be synthesized.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310081092.XA CN116524894A (en) | 2023-01-16 | 2023-01-16 | Vocoder construction method, voice synthesis method and related devices |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310081092.XA CN116524894A (en) | 2023-01-16 | 2023-01-16 | Vocoder construction method, voice synthesis method and related devices |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116524894A true CN116524894A (en) | 2023-08-01 |
Family
ID=87403545
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310081092.XA Pending CN116524894A (en) | 2023-01-16 | 2023-01-16 | Vocoder construction method, voice synthesis method and related devices |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116524894A (en) |
-
2023
- 2023-01-16 CN CN202310081092.XA patent/CN116524894A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
AU656787B2 (en) | Auditory model for parametrization of speech | |
CN111081268A (en) | Phase-correlated shared deep convolutional neural network speech enhancement method | |
Koizumi et al. | SpecGrad: Diffusion probabilistic model based neural vocoder with adaptive noise spectral shaping | |
JP5634959B2 (en) | Noise / dereverberation apparatus, method and program thereof | |
EP3716270B1 (en) | Speech processing system and method therefor | |
Hwang et al. | LP-WaveNet: Linear prediction-based WaveNet speech synthesis | |
US20230282202A1 (en) | Audio generator and methods for generating an audio signal and training an audio generator | |
Saito et al. | Text-to-speech synthesis using STFT spectra based on low-/multi-resolution generative adversarial networks | |
Yang et al. | Improving generative adversarial networks for speech enhancement through regularization of latent representations | |
CN112289343B (en) | Audio repair method and device, electronic equipment and computer readable storage medium | |
Marafioti et al. | Audio inpainting of music by means of neural networks | |
EP4447040A1 (en) | Speech synthesis model training method, speech synthesis method, and related apparatuses | |
CN108198566A (en) | Information processing method and device, electronic device and storage medium | |
Yoneyama et al. | Unified source-filter GAN: Unified source-filter network based on factorization of quasi-periodic parallel WaveGAN | |
Girirajan et al. | Real-Time Speech Enhancement Based on Convolutional Recurrent Neural Network. | |
Song et al. | AdaVITS: Tiny VITS for low computing resource speaker adaptation | |
Yang et al. | RS-CAE-based AR-Wiener filtering and harmonic recovery for speech enhancement | |
Elshamy et al. | DNN-based cepstral excitation manipulation for speech enhancement | |
Wu et al. | Denoising Recurrent Neural Network for Deep Bidirectional LSTM Based Voice Conversion. | |
CN116524894A (en) | Vocoder construction method, voice synthesis method and related devices | |
CN115881112A (en) | Speech recognition data enhancement method based on feature replacement and masking of spectrogram | |
Huang et al. | Generalization of spectrum differential based direct waveform modification for voice conversion | |
Liu et al. | A novel unified framework for speech enhancement and bandwidth extension based on jointly trained neural networks | |
CN104485099A (en) | Method for improving naturalness of synthetic speech | |
Guimarães et al. | Optimizing time domain fully convolutional networks for 3D speech enhancement in a reverberant environment using perceptual losses |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |