CN110503128B

CN110503128B - Spectrogram for waveform synthesis using convolution-generated countermeasure network

Info

Publication number: CN110503128B
Application number: CN201910419461.5A
Authority: CN
Inventors: 塞尔坎·安瑞克; 俊熙雄; 埃里克·昂德桑德; 格雷戈里·迪莫斯
Original assignee: Baidu USA LLC
Current assignee: Baidu USA LLC
Priority date: 2018-05-18
Filing date: 2019-05-20
Publication date: 2023-01-13
Anticipated expiration: 2039-05-20
Also published as: US20190355347A1; CN110503128A; US11462209B2

Abstract

For the problem of synthesizing waveforms from a spectrogram, embodiments of an efficient neural network architecture based on transposed convolution to achieve high computational intensity and fast derivation are presented herein. In one or more embodiments, to train convolutional vocoder architectures, an evaluator that identifies impractical waveforms is utilized and guided using the loss associated with perceptual audio quality and GAN framework. Embodiments of the model can achieve more than 500 times the speed of real-time audio synthesis while producing high quality audio. Also disclosed are multi-headed convolutional neural network (MCNN) embodiments for waveform synthesis from a spectrogram. Compared to commonly used iterative algorithms such as Griffin-Lim, the MCNN implementation can make significantly better use of modern multi-core processors, and the MCNN implementation can produce very fast (over 300 times faster than real-time) waveform synthesis. Embodiments herein produce high quality speech synthesis without any iterative algorithms or autoregressive in the computation.

Description

Spectrogram for waveform synthesis using convolution-generated countermeasure network

Technical Field

The present disclosure relates generally to systems and methods for computer learning that can provide improved computer performance, features and use. More particularly, the present disclosure relates to embodiments of an efficient neural network architecture that enables high computational intensity and fast derivation.

Background

Deep neural networks have shown good performance on challenging research benchmarks, while driving the frontiers of numerous influential applications such as language translation, speech recognition, and speech synthesis.

Spectrograms (spectrograms) are a commonly used element. The spectrogram typically contains intensity information for the time-varying spectrum of the waveform. The waveform to spectrogram conversion is fundamentally lossy due to the magnitude squared short-time fourier transform (STFT). Extensive studies have been carried out in the literature on spectral inversion (Spectrogram inversion). However, no known algorithm can ensure a globally optimal solution with low computational complexity. The fundamental challenge is that the intensity constraint is non-convex with unknown phase.

One of the most popular techniques for spectral inversion is the Griffin-Lim (GL) algorithm. GL is based on substituting the amplitude of each frequency component at each step into the predicted amplitude by repeatedly transforming between the frequency domain and the time domain using STFT and its inversion, thereby iteratively estimating the unknown phase. Although GL is attractive in simplicity, the sequential nature of the operations can make it slow. Fast variants are investigated by modifying their updated steps with terms that depend on the magnitude of the previously updated steps. Single-pass-spectrum inversion (SPSI) algorithms have been explored which are said to be able to synthesize waveforms in a single, fully deterministic process, and can be further improved by additional GL iterations. The SPSI estimates the instantaneous frequency of each frame by peak-picking and quadratic interpolation. Another non-iterative spectral inversion technique has been proposed that is based on a direct relationship between the partial derivatives of the phase and amplitude of the STFT in relation to an assumption that allows for the analytically derived gaussian window. Convex relaxation (convex relaxation) is applied to the spectral inversion to represent it as a semi-definite program (semefinite program), which ensures convergence at the cost of increased dimensionality. In general, one common drawback of these general-purpose spectrogram inversion techniques is that they have a fixed objective function, rendering them inflexible to adapt to a particular field, such as human speech, to improve the perceptual quality of the output.

One common example of use of spectrograms is in the audio field. Autoregressive modeling of waveforms is a common approach, especially for audio. The most advanced results in generating speech modeling use neural networks that perform autoregressive at the sampling rate. However, these models present challenges for deployment due to the derivation that needs to be performed approximately 16k to 24k times per second. One approach is to approximate autoregressive with an efficient derivation model that can be trained by learning an inverse-autoregressive flow using distillation (distillation). Recently, autoregressive neural networks have also been applied to spectral inversion. But autoregressive at the sampling rate may result in slow synthesis. The fundamental question is whether explicit autoregressive modeling is necessary for high quality synthesis. Some generative models synthesize audio by applying autoregressive at the rate of a spectrogram time frame (e.g., 100s samples) without causing a significant degradation in audio quality.

Accordingly, there is a need for improved systems and methods for waveform synthesis from a spectrogram.

Disclosure of Invention

One aspect of the present application provides a computer-implemented method of training a neural network model for spectrogram inversion, comprising: inputting an input spectrogram comprising a plurality of frequency channels into a convolutional neural network comprising at least one header, wherein a header comprises a set of transposed convolutional layers, wherein each transposed convolutional layer is separated by a non-linear operation in the set of transposed convolutional layers, and wherein the set of transposed convolutional layers reduces the number of frequency channels of the input spectrogram to one channel after a last transposed convolutional layer in the set of transposed convolutional layers; outputting, from the convolutional neural network, a composite waveform for the input spectrogram having a corresponding true value waveform; obtaining a loss of the convolutional neural network using the respective true-value waveforms, the composite waveform, and a loss function, wherein the loss function includes at least one selected from a spectral convergence loss and a log-scaled short-time Fourier transform (STFT) amplitude loss; and updating the convolutional neural network using the loss.

According to an embodiment of the present application, the convolutional neural network comprises: a plurality of headers, wherein each header receives the input spectrogram and comprises a set of transposed convolutional layers, in which each transposed convolutional layer is separated by a non-linear operation, and which reduces the number of frequency channels of the input spectrogram to one channel after a last transposed convolutional layer in the set of transposed convolutional layers.

According to an embodiment of the present application, a head of the plurality of heads is initialized with at least some different parameters to allow the convolutional neural network to focus on different portions of a waveform associated with the input spectrogram during training.

According to an embodiment of the application, each head of the plurality of heads generates a head output waveform from the input spectrogram, the method further comprising: obtaining a combination of the head output waveforms, wherein the head output waveforms are combined in a weighted combination using trainable weight values for each head output waveform.

According to an embodiment of the application, the method further comprises: a scaled softsign function is applied to the weighted combination to obtain a final output waveform.

According to an embodiment of the application, the convolutional neural network further comprises a generation of a countering network, and the loss function further comprises a generation of a countering network loss component.

According to an embodiment of the application, the loss function further comprises one or more additional loss terms selected from the group consisting of instantaneous frequency loss, weighted phase loss, and waveform envelope loss.

According to an embodiment of the present application, the convolutional neural network is trained with a massive multi-talker data set to generate a trained convolutional neural network for synthesized waveforms for speakers not included in the massive multi-talker data set, or with a single-talker data set to generate a trained convolutional neural network for synthesized waveforms for individual speakers.

Another aspect of the application provides a computer-implemented method of generating a waveform from a spectrogram using a trained convolutional neural network, the computer-implemented method comprising: inputting an input spectrogram comprising a plurality of frequency channels into a trained convolutional neural network comprising at least one header, wherein a header comprises a set of transposed convolutional layers, wherein each transposed convolutional layer is separated by a non-linear operation in the set of transposed convolutional layers, and wherein the set of transposed convolutional layers reduces the number of frequency channels of the input spectrogram to one channel after a last transposed convolutional layer in the set of transposed convolutional layers, and wherein the header outputs an output waveform; applying a scaling function to the output waveform to obtain a final composite waveform; and outputting the final synthesized waveform corresponding to the input spectrogram.

According to an embodiment of the application, the trained convolutional neural network comprises: a plurality of headers, wherein each header receives the input spectrogram, an output header output waveform, and comprises a set of transposed convolutional layers, wherein each transposed convolutional layer is separated by a non-linear operation, and wherein the set of transposed convolutional layers reduces the number of frequency channels of the input spectrogram to one channel after a last transposed convolutional layer in the set of transposed convolutional layers.

According to an embodiment of the present application, the output waveform is obtained by performing the steps of: combining head output waveforms of the plurality of heads into the output waveform using a weighted combination of the head output waveforms, wherein the output waveforms of the heads are weighted using trained weights for the heads.

According to an embodiment of the application, the scaling function is a softsign function of the scale.

According to an embodiment of the application, the trained convolutional neural network is trained using a loss function comprising at least one loss component selected from the group consisting of a spectral convergence loss, a log-scale short-time fourier transform (STFT) amplitude loss, an instantaneous frequency loss, a weighted phase loss, and a waveform envelope loss.

According to an embodiment of the application, the trained convolutional neural network is trained using a generative countering network, and the loss function further includes generating a countering network loss component.

According to an embodiment of the application, the method further comprises transforming the input spectrogram based on a mel-frequency spectrogram.

Yet another aspect of the application provides a non-transitory computer-readable medium comprising one or more sequences of instructions which, when executed by at least one processor, cause the following steps to be performed: inputting an input spectrogram comprising a plurality of frequency channels into a trained convolutional neural network comprising at least one header, wherein a header comprises a set of transposed convolutional layers, wherein each transposed convolutional layer is separated by a non-linear operation in the set of transposed convolutional layers, and wherein the set of transposed convolutional layers reduces the number of frequency channels of the input spectrogram to one channel after a last transposed convolutional layer in the set of transposed convolutional layers, and wherein the header outputs an output waveform; applying a scaling function to the output waveform to obtain a final composite waveform; and outputting the final synthesized waveform corresponding to the input spectrogram.

According to an embodiment of the application, the trained convolutional neural network comprises: a plurality of headers, wherein each header receives the input spectrogram, outputs a header output waveform, and comprises a set of transposed convolutional layers, in which each transposed convolutional layer is separated by a non-linear operation, and which reduces the number of frequency channels of the input spectrogram to one channel after a last transposed convolutional layer in the set of transposed convolutional layers.

According to an embodiment of the application, the trained convolutional neural network is trained using a loss function, wherein the loss function includes at least one loss component selected from the group consisting of a spectral convergence loss, a log-scaled Short Time Fourier Transform (STFT) amplitude loss, an instantaneous frequency loss, a weighted phase loss, a waveform envelope loss, and a generation of a countering network loss component.

According to an embodiment of the present application, the computer program product further comprises one or more sequences of instructions which, when executed by at least one processor, cause the following steps to be performed: converting the input spectrogram based on a Mel spectrogram.

Drawings

Reference will now be made to embodiments of the present disclosure, examples of which may be illustrated in the accompanying drawings. The drawings are intended to be illustrative, not restrictive. While the present disclosure is generally described in the context of these embodiments, it should be understood that the scope of the present disclosure is not intended to be limited to these particular embodiments. The items in the drawings may not be to scale.

FIG. 1 shows a graphical depiction of an architecture of a generative model according to an embodiment of the present disclosure.

Figure 2 graphically depicts a multi-headed convolutional neural network (MCNN) architecture for spectrogram inversion, in accordance with embodiments of the present disclosure.

Figure 3 depicts a general method for training a Convolutional Neural Network (CNN) that may be used to generate a synthetic waveform from an input spectrogram, in accordance with embodiments of the present disclosure.

Figure 4 depicts a general method of generating a synthetic waveform from an input spectrogram using a trained Convolutional Neural Network (CNN), in accordance with embodiments of the present disclosure.

Fig. 5 depicts an exemplary comparison of waveforms (whole speech and two scaled portions) and spectrograms of true (group pitch) samples with waveforms (whole speech and two scaled portions) and spectrograms generated using a convolutional GAN vocoder, according to an embodiment of the present disclosure.

Figure 6 depicts a composite waveform for spectrogram inputs having constant frequencies of 4000Hz, 2000Hz, and 1000Hz, and a composite waveform for spectrogram inputs having superimposed sine waves of 1000Hz and 2000Hz, according to an embodiment of the present disclosure.

Fig. 7 shows a visualization of guidance from an evaluator (critic) according to an embodiment of the present disclosure.

Fig. 8 shows a spectrogram of a generated sample without GAN (800) and with GAN training (805) according to an embodiment of the present disclosure.

Fig. 9 depicts a comparison of waveforms (entire utterance and scaled portion) and their spectrogram for truth (left) versus waveforms (entire utterance and scaled portion) and their spectrogram for MCNN generation (right) according to an embodiment of the present disclosure.

Fig. 10 depicts Log-STFT of synthetic samples of an MCNN embodiment trained with SC loss (top) only and an MCNN embodiment trained with all losses (bottom), according to an embodiment of the present disclosure.

Fig. 11 depicts waveforms synthesized by an MCNN embodiment (trained on libristech) for spectrogram inputs corresponding to 500Hz, 1000Hz, and 2000Hz sinusoids and spectrogram inputs for sinusoids having the 1000Hz and 2000Hz sinusoids superimposed, in accordance with embodiments of the present disclosure.

Fig. 12 shows the output of each head and the overall waveform according to an embodiment of the present disclosure. The top row shows an example of the synthesized waveform and its log-STFT, while the bottom 8 rows show the output of the waveform for each of the included heads. For better visualization, the waveform is normalized in each head, and the small amplitude components in the STFT are discarded after the threshold is applied.

Fig. 13 depicts an exemplary comparison of true samples in terms of waveform (showing the entire utterance and two different scaled portions) and spectrogram compared to their version generated using a convolutional CAN vocoder, according to an embodiment of the present disclosure.

Fig. 14 depicts an exemplary comparison of true samples in terms of waveform (showing the entire utterance and two different scaled portions) and spectrogram compared to their version generated using a convolutional CAN vocoder, according to an embodiment of the present disclosure.

Fig. 15 depicts an exemplary comparison of true samples in terms of waveform (showing the entire utterance and two different scaled portions) and spectrogram compared to their versions generated using a convolutional CAN vocoder, according to an embodiment of the present disclosure.

Fig. 16 depicts an exemplary comparison of true samples in terms of waveform (showing the entire utterance and two different scaled portions) and spectrogram compared to their versions generated using a convolutional CAN vocoder, according to an embodiment of the present disclosure.

Fig. 17 depicts an exemplary comparison of true samples in terms of waveform (showing the entire utterance and two different scaled portions) and spectrogram compared to their versions generated using a convolutional CAN vocoder, according to an embodiment of the present disclosure.

Fig. 18 depicts an exemplary comparison of true samples in terms of waveform (showing the entire utterance and two different scaled portions) and spectrogram compared to their versions generated using a convolutional CAN vocoder, according to an embodiment of the present disclosure.

FIG. 19 depicts a simplified block diagram of a computing device/information handling system according to an embodiment of the present disclosure.

Detailed Description

In the following description, for purposes of explanation, specific details are set forth in order to provide an understanding of the present disclosure. It will be apparent, however, to one skilled in the art that the present disclosure may be practiced without these specific details. Furthermore, those skilled in the art will recognize that the embodiments of the present disclosure described below can be implemented in various ways (e.g., processes, apparatuses, systems, devices, or methods) on a tangible computer-readable medium.

The components or modules illustrated in the drawings are illustrative of exemplary embodiments of the disclosure and are intended to avoid obscuring the disclosure. It should also be understood that throughout this discussion, components may be described as separate functional units (which may include sub-units), but those skilled in the art will recognize that various components or portions thereof may be divided into separate components or may be integrated together (including being integrated within a single system or component). It should be noted that the functions or operations discussed herein may be implemented as components. The components may be implemented in software, hardware, or a combination thereof.

Furthermore, the connections between components or systems within the figures are not intended to be limited to direct connections. Rather, data between these components may be modified, reformatted, or otherwise changed by intermediate components. In addition, additional connections or fewer connections may be used. It should also be noted that the terms "coupled," "connected," or "communicatively coupled" should be understood to include direct connections, indirect connections through one or more intermediate devices, and wireless connections.

Reference in the specification to "one embodiment," "a preferred embodiment," "an embodiment," or "embodiments" means that a particular feature, structure, characteristic, or function described in connection with the embodiment is included in at least one embodiment of the disclosure, and may be included in more than one embodiment. Moreover, the appearances of the above-described phrases in various places in the specification are not necessarily all referring to the same embodiment or a plurality of the same embodiments.

Certain terminology is used in various places throughout this specification for the purpose of description and should not be construed as limiting. A service, function, or resource is not limited to a single service, single function, or single resource; the use of these terms may refer to a distributable or aggregatable grouping of related services, functions, or resources.

The terms "comprising," "including," "containing," and "containing" are to be construed as open-ended terms, and any listing thereafter is an example and not intended to be limiting on the listed items. Any headings used herein are for organizational purposes only and are not meant to be used to limit the scope of the description or the claims. Each reference mentioned in this patent document is incorporated herein by reference in its entirety.

Furthermore, one skilled in the art will recognize that: (1) certain steps may optionally be performed; (2) The steps may not be limited to the specific order set forth herein; (3) certain steps may be performed in a different order; and (4) certain steps may be performed simultaneously.

It should be noted that any experiments and results provided herein are provided by way of illustration and are performed under specific conditions using one or more specific embodiments; accordingly, neither these experiments nor their results should be used to limit the scope of the disclosure of this patent document.

A. Introduction to the design reside in

Deep neural networks have recently gained significant effort in generative modeling in a number of applications ranging from synthesis of high-resolution photo-level images to highly natural speech samples. Three long-term goals continue to be achieved to achieve their broad applicability: (i) Matching the generated distribution of samples with a distribution of true samples; (ii) High level control is achieved by adjusting the generation on certain inputs; and (iii) increase the derivation speed of generative models for hardware deployment. In this patent document, the focus of these objectives is on the application of synthesizing waveforms from a spectrogram. This problem is also referred to in the signal processing literature as spectral inversion.

An audio signal source (such as a human voice production system) may generate an audio signal in an autoregressive form. This also contributes to generating autoregressive in audio modeling. In fact, the most advanced results in speech synthesis employ sample rate autoregressive generation. However, these models present deployment challenges due to the need to derive 16,000 or 24,000 times per second. One approach to addressing this challenge is to approximate the autoregressive model using an efficient derivative model. This method was performed by learning an inverse autoregressive flow model using distillation. The fundamental question is whether autoregression is necessary at all. For example, many generative models synthesize audio by applying autoregressive at the rate of a spectrogram time frame (which can be a 100s sample), and can still obtain very high quality audio (when combined with traditional spectrogram inversion techniques, as outlined in section b.1).

Presented herein are embodiments of a deep neural network architecture that is capable of efficiently synthesizing waveforms from a spectrogram without performing any autoregressive calculations. The goal is not necessarily to claim the most advanced neural speech quality or the most advanced spectral inversion performance in metrics such as spectral convergence, but rather to justify the premise of a deep neural network for this basic signal processing problem. Since the implementation of the architecture is trainable, the implementation of the architecture may be integrated with any generative modeling application such as text-to-speech (text-to-speech), audio style conversion, or speech enhanced output spectrogram.

The proposed convolutional waveform synthesis network embodiments may be trained using a combination of audio reconstruction losses, and in embodiments, may be trained using a Generative Adaptive Network (GAN) framework. From the viewpoint of GAN literature, its application in generating high quality audio samples has been demonstrated. Previous work on GAN has demonstrated good results in high quality image synthesis. For audio synthesis, these proofs have so far been limited to either unconditional speech synthesis or low quality speech synthesis from a smaller level (such as digital or speech enhancement) where the input already contains rich information. One of the purposes of some embodiments herein is also to drive the development of the most advanced techniques in conditional audio synthesis using GAN.

B. Background of the invention

1. Spectrogram inversion

The spectrogram contains intensity information of the time-varying spectrum of the waveform given by its squared magnitude of the short-time fourier transform (STFT). Spectral conversion is typically applied after windowing the waveform, typically with a Hanning (Hanning) window, a Hamming (Hamming) window, and a Gaussian (Gaussian) window. The conversion of the waveform into a spectrogram is fundamentally lossy due to the squaring operation. The spectrum inversion has been extensively studied in the signal processing literature. However, no known algorithm ensures that a globally optimal solution is achieved with low (i.e., non-deterministic polynomial time-difficult) computational complexity. The fundamental challenge is that the intensity constraint is non-convex in the case where the phase in the frequency domain is unknown.

The most popular technique for spectrogram inversion is the Griffin-Lim (GL) algorithm. GL is based on iteratively estimating the unknown phase by substituting the amplitude of each frequency component into the predicted amplitude at each step by repeatedly converting between the frequency domain and the time domain using STFT and its inversion. Although GL is interesting in its simplicity, well-known drawbacks are that GL is slow (usually requiring multiple successive iterations), GL is inflexible (the objective function is not directly related to the objective such as perceptual audio quality), and GL is fixed (there are no trainable parameters to improve certain areas such as human speech). To solve these problems, variants of GL have been studied. A fast GL is proposed which modifies its updated steps by using terms that depend on the magnitude of the previously updated steps. A single-pass spectrogram inversion (SPSI) algorithm is also proposed, which is capable of synthesizing waveforms in a single, fully deterministic process, and can be further enhanced with additional GL iterations. The SPSI estimates the instantaneous frequency of each frame by peak picking and quadratic interpolation. Another non-iterative spectral inversion technique is proposed based on a direct relationship between the partial derivatives of phase and amplitude of the STFT in relation to a gaussian window. This technique assumes a gaussian window for analysis of the derivatives, but can be approximated to other windows (but will be penalized). In another approach, convex relaxation is applied to the spectral inversion problem to represent it as a semi-definite plan. Convex relaxation has been shown to ensure convergence at the expense of increasing the problem dimensionality.

Recently, deep neural networks have been applied to spectral inversion. In one or more embodiments, a variant of the WaveNet architecture is used for spectral inversion. Embodiments of the architecture perform autoregressive at the sampling rate and may be implemented using a stacked void convolution layer (stacked convolution layer). In one or more embodiments, a spectrogram frame is presented as an external modifier at each sampling.

Some applications use spectrograms with non-linear frequency scales, such as mel-frequency spectrograms (mel spectrograms). In one or more embodiments herein, the emphasis is on performing spectral inversion with a linear frequency scale to solve the more conventional problem. The neural network may convert the mel-spectrum to a linear spectrum based on a simple architecture such as a fully connected layer.

2. Deep neural network for generating audio applications

Recently, deep neural networks have shown excellent results in generating audio applications. The primary one generating audio is text-to-speech. For example, a WaveNet architecture has been proposed that synthesizes speech by autoregressive means conditioned on linguistic features. WaveNet has been approximated by a parallelizable architecture that learns by distillation. On the other hand, many successful text-to-speech methods are combined with spectrogram inversion techniques and are based on separating the problem into converting text to spectrogram (or text to mel-spectrum). A mel-spectrum autoregressive at the spectrum frame rate is generated using an implementation of a sequence-to-sequence (sequence-to-sequence) model of the attention mechanism. The spectral inversion can then be applied using the GL algorithm or WaveNet. Applications also exist that directly convert one or more input audio samples. One application is speech enhancement to improve the quality of distorted or noisy speech. A common approach is to process raw waveforms or spectra based on GAN. Another application is the conversion of speech patterns or speakers. In some cases, this transformation is applied after initializing the input with content audio by optimizing the pattern loss defined between waveforms or spectrograms of the input or pattern audio. Speaker identity modification is performed using generative model tuning or directly using neural network speaker encoding. In general, many deep neural network models used to generate audio modeling output spectrograms, and these models synthesize waveforms using a spectrogram inversion technique such as the GL algorithm. Predicting a smooth and structured representation is an effective method for neural network training, since a rapidly fluctuating phase (whether in the time or frequency domain) makes neural network training more difficult and produces worse results.

C. Convolution waveform synthesis implementation

Assuming input spectrum | STFT(s) & gt of waveform s ² Having a T _spec ×F _spec And the corresponding waveform has a T _wave Of (c) is measured. Ratio T _wave /T _spec May be determined by spectrogram parameters, hop length, and window length. These parameters are assumed to be known a priori.

From the spectrogram, a waveform is generated, and the neural network should perform nonlinear up-sampling (upsampling) in the time domain while using the spectral information. Typically, the window length is much longer than the hop length and may be important for processing additional information in adjacent time frames. To obtain a fast derivation, it is necessary to obtain a neural network architecture that can achieve high computational intensity. The computational intensity may be defined as the average amount of operations per data access. Modern multi-core processors support models with high computational intensity, which can be achieved by repeatedly applying computations with the same core.

1. Single-headed convolutional neural network implementation

For this purpose, the emphasis is on a vocoder (vocoder) architecture implementation that includes L transposed convolutional layers, as shown in fig. 1. FIG. 1 shows a graphical depiction of an architecture of a generative model according to an embodiment of the present disclosure.

In one or more embodiments, each transposed convolutional layer 110-x comprises a one-dimensional (1-D) time convolution, and is followed by a softsign non-linearity (not shown). Softsign non-linearity has lower implementation complexity and more gradient flow than other saturation non-linearities. For the l-th layer, w _l Is the filter width, s _l Is stride and c _l Is the number of output filters (channels). The step size in the convolution determines the amount of time upsampling and may be selected to satisfy the step size

The filter width controls the amount of local neighborhood information used in upsampling. The number of filters determines the number of frequency channels in the processed representation, and the number of filters may be gradually reduced to 1 to produce the time domain waveform 120. The scale of the trainable scalar may be used at the last level to match the scale of the inverse STFT operation.

2. Multi-headed convolutional neural network (MCNN) implementation

It should be noted that two or more heads may be combined into a multi-head convolutional neural network (MCNN) architecture. In one or more embodiments, the MCNN includes multiple heads using the same type of layer but with different weights and initializations, and the multiple heads learn cooperatively as a form of ensemble learning. By using multiple heads, each model is allowed to assign different upsampling kernels to different components of the waveform, which is further analyzed in appendix B.

Figure 2 graphically depicts a multi-headed convolutional neural network (MCNN) architecture 200 for spectrogram inversion, in accordance with embodiments of the present disclosure. In one or more embodiments, each Head may be configured as an exploded view of an example Head i 202-i, as described above or as depicted in FIG. 2. As shown in FIG. 2, each head, like the head depicted in FIG. 1, includes L transposed convolutional layers 210-x. Each transposed convolution layer may include a one-dimensional time convolution operation followed by an Exponential Linear Unit (ELU) 212-x. It should be noted that other non-linearities such as ReLU and softsign may be used, but empirically, the ELU produces superior audio quality. Similar to the single-headed embodiment, for the l-th layer, w _l Is the filter width, s _l Is a stride, and c _l Is the number of output filters (channels); the step size in the convolution determines the amount of time upsampling and may be selected to satisfy the step size

The filter width controls the amount of local neighborhood information used in upsampling. The number of filters determines the number of frequency channels in the processed representation, and the number of filters may be gradually reduced to 1 to produce a time domain waveform.

In one or more embodiments, the MCNN embodiments may receive spectrogram input 205 having an arbitrary duration, as a result of the convolution filter being shared among the channel dimensions at different time steps.

In one or more embodiments, the trainable scalar 204-x is multiplied by the output of each head 202-x to match the overall scale of the inverse STFT operation and to determine the relative weights of the different heads.

Finally, in one or more embodiments, the outputs of all the headers are summed 208 and passed through a Scaled (Scaled) softsign non-linearity 209 (e.g.,

) Where a and b are trainable scalars) limit the output waveform 220.

D. Loss of audio reconstruction

The fundamental challenge of generative modeling is the selection of a loss function that is highly correlated with the perceived quality of the output. In one or more embodiments, the estimated signal may be used

And a linear combination of one or more of the following loss terms between the true signal s:

(i) Spectrum Convergence (SC):

in one or more embodiments, the spectral convergence loss may be formulated as:

wherein | · | charging _F Is the Frobenius norm (Frobenius norm) over time and frequency. The loss of spectral convergence emphasizes the large spectral components that may be particularly important in the early stages of training.

(ii) Log scale (Log-scale) STFT amplitude loss:

in one or more embodiments, the log-scaled STFT amplitude loss may be formulated as:

wherein | · | charging ₁ Is L ¹ Norm, and ∈ is a decimal value. M (f) is a band pass filter to focus on controlling frequencies important to human hearing (which can simply be assumed to have [300,3500 ] a]An ideal band pass filter for the passband in Hz and with the amplitude selected empirically). The goal of the log-scaled STFT amplitude loss is to accurately fit the details given by the small amplitudes (as opposed to the spectral convergence), which tends to beMore important in the later stages of training.

In one or more embodiments, the log-scaled STFT amplitude loss may also be formulated as:

(iii) Instantaneous frequency loss:

in one or more embodiments, the instantaneous frequency loss can also be formulated as:

where φ (·) is a phase argument function. Using finite difference

To estimate the time derivative

Spectral phases are highly unstructured in both the time and frequency domains, so fitting the raw phase values is very challenging and does not improve training. Conversely, the instantaneous frequency is a smoother phase correlation metric that can be more accurately fitted at training.

(iv) Weighted phase loss:

in one or more embodiments, the weighted phase penalty may be formulated as:

wherein,

is a real part, and

is an imaginary partAnd (4) dividing.

Alternatively, in one or more embodiments, the weighted phase loss may be formulated as:

wherein,

is a real part of the digital video signal,

is an imaginary part and &isthe element-by-element product.

Log-likelihood loss and when assuming a circular normal distribution for phase

And (4) in proportion. Accordingly, the loss can be defined as

Which is at

Time minimization of

In order to pay more attention to high amplitude components and for better numerical stability, the method can be used

Scaling to further modify

It is at L ¹ The norm is followed by equation 4b. In one or more embodiments, this penalty is minimized when the phase of the STFT matches the true and generated waveforms.

(v) Loss of waveform envelope:

wherein, is the convolution operator, and g _n Is the impulse response of the low pass filter. Consider N gaussian filters (the cut-off frequency is chosen to average 10s to 100s samples). The envelope of the waveform determines the specific individual quality of the sound and how to understand the sound. In equation 5, one of the simplest waveform envelope extraction methods is used to map an envelope between a true value waveform and a generated waveform.

As described above, in one or more embodiments, an estimated signal may be used

And one or more of the loss terms between the true signal s. In one or more embodiments, the combination may be a linear combination that may include a weighted linear combination.

E. Generating a countermeasure network (GAN) framework implementation

Distance-based metrics learn conditional expectations or other central properties, but marginalization on potential phase structures may produce impractical samples. In one or more implementations, a GAN framework may be used to ensure that the distribution of the generated audio matches the true distribution. Many early variants of GAN are challenging to train and sensitive to optimization and architecture selection. In one or more embodiments herein, cramer GAN is used for its conventional robustness. Furthermore, cramer GAN is not affected by bias gradients, and training converges fairly well even with only one evaluator update per generation step (unlike other similar populations such as Wasserstein GAN).

In one or more embodiments, the original Cramer GAN algorithm is used as is. The fundamental difference is the type of input and the corresponding evaluator architecture (critic architecture) required. For time processing, the evaluator slides over the input waveform and issues a converted signatureAnd (5) carrying out characterization. The final scalar score is calculated by taking the time average of the features and measuring the energy distance between the pooled vectors from the truth and predicted waveform. Embodiments do not use previous vectors as these previous vectors were found to be useless in improving the perceptual quality of the generated audio. For the evaluator architecture, one or more embodiments use a stack of one-dimensional convolutions, layer normalization, and leakage-corrected linear units (leakage relus). The one-dimensional convolution in layer l has w as the filter width _l S as stride _l And c as the number of output filters (channels) _l . The evaluator framework can be considered as a mirror of the upsampling operation in the waveform synthesis. Different steps and filter sizes than those of the vocoder cause the perceptual domain to not be aligned accurately. The gradient penalty and generation of multiple samples in Cramer GAN add considerable computational and memory overhead in training, but this does not affect the derivation efficiency.

It should be noted that although Cramer GAN is used for one or more embodiments, other relative GANs may be employed.

F. General method embodiments

Figure 3 depicts a general method for training a Convolutional Neural Network (CNN) that may be used to generate a composite waveform from an input spectrogram, according to an embodiment of the present disclosure. In one or more embodiments, a method for spectrogram inversion includes inputting (305) an input spectrogram including a plurality of frequency channels into a Convolutional Neural Network (CNN) including at least one head. It should be noted that the input spectrogram may initially be a mel-frequency spectrogram which is converted to a spectrogram. As described above, in one or more embodiments, the header includes a set of transposed convolutional layers in which each transposed convolutional layer is separated by a non-linear operation, and the set of transposed convolutional layers reduces the number of frequency channels of the spectrogram to one channel after the last transposed convolutional layer.

If the CNN system includes only a single header, its output is used. If the CNN system has more than one header, the outputs from each header are combined (310). In one or more embodiments, a weighted combination of the outputs of each head may be obtained by weighting each head output by a value, which may also be trainable.

Whether a single output or a weighted combined output, a scaling function (e.g., scaled softsign) may be applied 315 to the output to obtain a final output waveform. The final output waveform and its corresponding true value waveform are used to calculate the loss. In one or more embodiments, a combined loss function may be utilized (320) that includes at least one or more loss components selected from the group consisting of spectral convergence loss, log-scaled STFT amplitude loss, instantaneous frequency loss, weighted phase loss, and waveform envelope loss. In one or more embodiments, losses from GAN can also be included. Finally, the CNN is updated (325) by back propagation using the loss.

For the MCNN implementation, training may include initializing multiple heads with at least some different parameters to allow the CNN system to focus on different aspects of the spectrogram.

It should be noted that the training data set may be a large-scale multi-speaker dataset (multi-speaker dataset), but the trained CNN may be applied to non-preset speakers to generate waveforms. Alternatively, the CNN may be trained using a data set of a single speaker.

Figure 4 depicts a general method of generating a synthetic waveform from an input spectrogram using a trained Convolutional Neural Network (CNN) according to an embodiment of the present disclosure. In one or more embodiments, an input spectrogram comprising a plurality of frequency channels is input (405) into a Convolutional Neural Network (CNN) comprising at least one head. It should be noted that the input spectrogram may initially be a mel-frequency spectrogram which is converted into a spectrogram.

If the CNN system includes only a single head, its output is obtained. If the CNN system has more than one header, the outputs from each header are combined (410). In one or more embodiments, a weighted combination of the outputs of each head may be obtained by weighting the outputs of each head by a certain value.

In one or more embodiments, a scaling function (e.g., scaled softsign) may be applied 415 to the output to obtain a final output waveform, whether a single output or a weighted combined output.

G. Results of the experiment

It is noted that these experiments and results are provided by way of illustration and are performed under particular conditions using one or more embodiments; therefore, neither these experiments nor their results should be used to limit the scope of the disclosure of this patent document.

1. Single headed CNN implementation

a) Experimental setup

Training was performed using the LibriSpeech dataset (V.Panayotov, G.Chen, D.Povey and S.Khudanpur, "Librispeech: an ASR Speech based on public domain audio books" in Acoustics, speech and Signal Processing (ICASSP), IEEE International conference, pages 5206-5210.IEEE 2015). The librispech dataset is an automatic speech recognition dataset that includes 960 hours of public domain audio books from 2484 speakers (sampled at 16 KHz) and has lower audio quality than the speech synthesis dataset. In an embodiment, a preprocessing pipeline (preprocessing pipeline) is used that combines W.Ping, K.Peng, A.Gibiansky, etc. with segmentation and noise reduction,

The pre-processing pipeline in "Deep Voice 3-speaker neural text-to-speech" available from arXiv:1710.07654,2017 of a. Kannan, s.narang, j.raiman and j.miller is the same or similar and is also disclosed in the following patent documents: commonly assigned U.S. patent application Ser. No. 16/058,265 (case No. 28888-2175) entitled "SYSTEM AND METHOD FOR NEURAL TEXT-TO-SPECH CONVOLUONAL SEQUENCE LEARNING", filed 8/8.2018; AND submitted in 2017 on 19/10/19 under the heading "SYSTEM AND METHODS FOR NEURAL TEXT-TO-SPECCH USE ING CONVOLUTIONAL SEQUENCE LEARNING (system and method for neural text to speech using convolutional SEQUENCE LEARNING)' and will

U.S. provisional patent application No. 62/574,382 (case No. 28888-2175P) by Wei Ping, kainan Peng, sharan Narang, ajay Kannan, andrew Gibiansky, jonathan Raiman, and John Miller listed as inventor (for convenience, its disclosure may be generally referred to as "Deep Voice 3" or "DV 3"). Each of the above documents is incorporated herein by reference in its entirety.

The following parameters were assumed as spectrogram parameters: a hop length of 400, a window length of 1600, and a Fast Fourier Transform (FFT) size of 4096. These spectrogram parameters are typical values for audio applications. The vocoder structure has 6 transposed convolution layers, where(s) ₁ ,w ₁ ,c ₁ )＝(5,9,1024)，(s ₂ ,w ₂ ,c ₂ )＝(2,3,1024)，(s ₃ ,w ₃ ,c ₃ )＝(2,3,256)，(s ₄ ,w ₄ ,c ₄ )＝(5,9,128)，(s ₅ ,w ₅ ,c ₅ ) = (2,3,32) and(s) ₆ ,w ₆ ,c ₆ ) = (2,3,1024). The evaluator framework has 11 convolutional layers, where(s) ₁ ,w ₁ ,c ₁ )＝(5,2,32)，(s ₂ ,w ₂ ,c ₂ )＝(5,2,64)，(s ₃ ,w ₃ ,c ₃ )＝(5,2,64)，(s ₄ ,w ₄ ,c ₄ )＝(5,2,128)，(s ₅ ,w ₅ ,c ₅ )＝(5,2,128)，(s ₆ ,w ₆ ,c ₆ )＝(3,2,256)，(s ₇ ,w ₇ ,c ₇ )＝(3,2,256)，(s ₈ ,w ₈ ,c ₈ )＝(3,2,512)，(s ₉ ,w ₉ ,c ₉ )＝(2,2,512)，(s ₁₀ ,w ₁₀ ,c ₁₀ ) = (2,2,1024) and(s) ₁₁ ,w ₁₁ ,c ₁₁ ) = (1,2,1024). The coefficients of the loss function proposed in subsection D are chosen to be 1, 5, 10, 1 and 500 respectively (optimization of audio quality by employing a random lattice search-gradients of different lossesContributions are of similar order) and the factor for Cramer GAN loss is chosen to be 100. The tested implementation model was trained by applying successive update steps to the vocoder and evaluator using an Adam optimizer. For the vocoder, the learning rate is 0.0006 and annealing is applied every 5000 iterations at a rate of 0.95. For the evaluators, the learning rate was 0.0003. Both the vocoder and the evaluator use a batch size (batch size) of 16. The model was trained over approximately 800k iterations.

These results were compared to the following spectral inversion method:

(i) GL: standard implementations of "Signal estimation from modified short-time Fourier transform" proposed in IEEE Transactions on optics, speech, and Signal Processing,32 (2): 236-243 by D.Griffin and J.Lim, 4.1984 were used in 3 iterations and 50 iterations.

(ii) SPSI: an embodiment of "Single pass spectrum inversion" from g.t. beauregard, m.harish and l.wyse was used, presented in 2015IEEE International Conference on Digital Signal Processing (DSP), pages 427-431, at 7 months 2015. In addition to the general single-pass spectrogram inversion (SPSI), further improvement with 3 and 50 additional GL iterations was considered.

(iii) WaveNet: the WaveNet architecture from WaveNet was used on the librispech dataset without speaker embedding

"Deep Voice 2: 15/974,397, entitled "SYSTEM AND METHODS FOR MULTI-SPEAKER NEURAL TEXT-TO-SPEECH (SYSTEM AND METHOD FOR MULTI-SPEAKER NEURtext TO SPEECH)" filed 8/2018(docket number: 28888-2144); AND U.S. provisional patent application No. 62/508,579 (docket No. 28888-2144P), entitled "SYSTEMS AND METHODS FOR MULTI-talker NEURAL TEXT-TO-SPEECH," filed on 19/5/2017, each of which is incorporated herein by reference in its entirety (FOR convenience, its disclosure may be generally referred TO as "Deep Voice 2" or "DV 2"). The WaveNet architecture conditions the spectrogram to predict the waveform. The same hyperparameter was retained except that the number of layers was increased to 40 and trained to converge (approximately 1.3M iterations).

b) Synthesized audio waveform quality

The resulting audio waveform is illustrated in fig. 5. Fig. 5 depicts an exemplary comparison of a waveform (entire utterance and two scaled portions) and spectrogram for a true value sample 505 with a synthesized version 510 of the waveform and spectrogram generated using a convolutional GAN vocoder. It has been observed that complex waveform patterns can be fitted and that there is a small phase error between the high amplitude spectral components of the correlation (low offset between peaks). The related appendix C shows further examples.

To measure speaker distinctiveness after synthesis, mean Opinion Score (MOS), spectral convergence, and classification accuracy were used to evaluate the synthesis quality on proposed librispech samples (see table 1) and non-pre-set speakers (see table 2).

For MOS, human ratings are collected independently for each assessment by Amazon Mechanical turn (Amazon crowdsourcing task platform) framework and multiple votes on the same sample are aggregated by majority voting rules.

Implementation of speaker classifier models for classification accuracy comes from

Chen, K.Peng, W.Ping and Y.Zhou, "Neural Voice Cloning with a Few Samples" available at arXiv:1802.06006,2018 and also in the following specialtiesThe patent refers to the field of 'pictorial communication,': U.S. patent application No. 16/143,330 (docket No. 28888-2201), entitled "SYSTEMS AND METHODS FOR NEURAL speech CLONING WITH a field sample", filed on 26.9.2018; and submitted on 9.2.2018 entitled "NEURAL Voice CLONING WITH A FEW SAMPLES (Using a FEW SAMPLES)" and will

Jitong Chen, kainan Peng and Wei Ping are U.S. provisional patent application No. 62/628,736 (case No.: 28888-2201P) by the inventor. Each of the above documents is incorporated by reference herein in its entirety.

For non-pre-set librispech samples, the convolution vocoder implementation is on par with a large number of GL iterations in terms of naturalness and spectral convergence. WaveNet performs poorly and generalizes poorly for non-pre-set speakers when trained with LibriSpeech. It was verified that the WaveNet implementation could indeed generate very high quality audio for datasets with higher quality audio and fewer numbers of speakers. The low score of WaveNet may be related to the hyper-parameters not being re-optimized for the librispech dataset, or this may be due to its sensitivity to low audio quality of training samples. GAN loss improves audio quality by minimizing background artifacts (artifacts), resulting in a slight improvement in naturalness. For non-pre-set speakers, the results are indeed very similar, albeit with a small penalty (which may be due to a slight overfitting to the training distribution with the evaluator), but still demonstrate the generalization capability of the model implementation.

Table 1: mean Opinion Score (MOS) with 95% confidence interval, spectral convergence, and speaker classification accuracy for libris spech test samples. WaveNet does not show spectral convergence since the synthesized waveform does not match the initial silence filling.

Table 2: mean Opinion Score (MOS) with 95% confidence interval and spectral convergence for non-pre-set speakers (from internal data set). WaveNet does not show spectral convergence since the synthesized waveform does not match the initial silence filling.

c) Frequency-based representation learning

Convolutional vocoder model implementations are trained using only human speech, which includes multi-frequency time-varying signals. However, as shown in FIG. 6, it is interesting for the model implementation to learn the Fourier basis representation over the spectral range of human speech. Fig. 6 depicts a synthesized waveform for spectrogram inputs having constant frequencies of 4000Hz, 2000Hz, and 1000Hz, and a synthesized waveform for spectrogram inputs of sine waves having the sine waves of 1000Hz and 2000Hz superimposed, according to an embodiment of the present disclosure. When a spectrogram having constant frequencies is input, sinusoidal waveforms of these frequencies are synthesized. Furthermore, when the input spectrogram includes a small number of frequency bands, the synthesized waveform is also a superposition of pure sine waves of the constituent frequencies.

d) Evaluator guided assay

To understand what the evaluator has learned and how the evaluator affects the vocoder, the Cramer distance gradient related to the generated audio of the training model may be visible. This gradient is overlaid on the resulting spectrum and the residual spectrum in fig. 7 to show the audio regions to which the evaluator is most sensitive. Fig. 7 illustrates a visualization of guidance from an evaluator in accordance with an embodiment of the present disclosure. Image 700 is the gradient shown over the generated audio, and image 705 is the gradient shown over the log spectral difference. Evaluator gradients are calculated for the generated audio and are shown in a logarithmic scale in terms of color. The luminance shows the log-scale frequency content of the signal with the gradient superimposed on it. This gradient matches the residual in the human voicing range, serving in part as a distance metric. The low frequency gradient is several orders of magnitude smaller than the higher frequency content, but it is assigned to correct for lower frequency distortion. The high frequency gradient does not have a high level structure that helps reduce errors, but its periodic peaks align with the overall step over time. In fact, without the GAN target, as shown in fig. 8, the spectral graph of the generated audio is more affected by the checkerboard artifact (checkerbard artifact) generally generated by the transposed convolution and the constant frequency noise in high frequencies. Fig. 8 shows a spectrogram of a generated sample without GAN (800) and with GAN training (805) according to an embodiment of the present disclosure.

e) Deployment considerations

The derivation complexity and computational intensity (based on the assumptions presented in appendix a) and the baseline delay were benchmark tested on an Nvidia Tesla P100 GPU. Consider a Tensorflow operation implementation without specific kernel optimization. It should be noted that further improvements can be obtained by further optimization. For a fair comparison, consider an implementation of GL using a tensflow FFT operation/inverse FFT operation on the GPU.

The computational complexity of a convolutional GAN vocoder implementation is about 4 GFLOP/sec. It achieves about 8.2 mls of samples/sec, which is 513 times faster than real-time waveform synthesis. The delay of GL was about 2 times slower in the case of 3 iterations and about 32 times slower in the case of 50 iterations compared to the tested model embodiment (see table 3). Although the tested embodiment requires larger computations (FLOP/s) than GL, it requires less GPU DRAM bandwidth (byte/s). In fact, the computational intensity of the model implementation, 65FLOP/byte, is much higher than the computational intensity of the GL iteration, 1.9FLOP/byte. High computational intensity is important for better utilization of computational resources such as GPUs. Furthermore, the model implementation has shorter critical paths for correlation operations in its computation graph than GL (18 ops compared to 27ops for 3 GL iterations and 450ops for 50 GL iterations). Thus, embodiments herein are more suitable for GPU parallelization.

Table 3: mean Opinion Score (MOS) and derived metrics. Samples/sec were inferred from the Tensorflow implementation with no specific kernel optimization and reference testing on NVIDIA Tesla P100.

2. Multi-headed CNN implementation

a) Experimental setup

Similar to the single-headed CNN test embodiment, a libris tech dataset similar to that in Deep Voice 3 was used after the pre-processing pipeline including segmentation and noise reduction.

Assume a hop length of 256 (duration of 16 ms), a Hanning window length of 1024 (duration of 64 ms) and an FFT size of 2048 as spectrogram parameters. The MCNN implementation tested had 8 transposed convolution layers, where(s) _i ,w _i ,c _i )＝(2,13,2 ^8–i ) (where i is 1. Ltoreq. I.ltoreq.8), i.e., halving the number of channels and up-sampling in time achieves a 2-fold balance. The coefficients of the loss function in section D above are chosen to be 1, 6, 10 and 1, respectively (audio quality is optimized by using a random grid search). The model was trained using Adam optimizer. The initial learning rate was 0.0005 and annealed at a rate of 0.94 at each 5000 iterations. The model was trained for approximately 600k iterations, with a batch size of 16 and distributed over 4 GPUs with simultaneous updates. These results are compared to the GL conventional implementation and the SPSI conventional implementation with and without additional GL iterations.

b) Synthesized audio waveform quality

The synthesized audio waveform is illustrated in fig. 9. Fig. 9 depicts a comparison of the waveform (entire utterance and scaled portion) of a true value (left/905) and its spectrogram to the waveform (entire utterance and scaled portion) of an MCNN generated (right/910) and its spectrogram according to an embodiment of the present disclosure. It is observed that complex patterns can be fitted and that there is a small phase error between the high amplitude spectral components of the correlation (low offset between peaks). To measure the distinguishability of 2484 speakers, the quality of synthesis on the proposed librispech sample (see table 4 below) was evaluated using Mean Opinion Score (MOS), SC and classification accuracy.

For MOS, human ratings are collected independently for each assessment by Amazon Mechanical turn framework and multiple votes on the same sample are aggregated by majority voting rules.

Implementation of speaker classifier model for classification accuracy comes from

"Neural Voice Cloning with a Few Samples" available at arXiv:1802.06006,2018, by J.Chen, K.Peng, W.Ping and Y.Zhou, and which is also disclosed in the following patent documents: U.S. patent application Ser. No. 16/143,330 (docket No. 28888-2201), filed on 26.9.2018, entitled "SYSTEM AND METHODS FOR NEURAL VOICE CLONING WITH A FEW SAMPLES (SYSTEM AND METHOD FOR NEURAL VOICE CLONING USING A SMALL SAMPLE)"; and submitted on 9.2.2018 entitled "NEURAL Voice CLONING WITH A FEW SAMPLES (Using a FEW SAMPLES)" and will

Jidong Chen, kainan Peng and Wei Ping are U.S. provisional patent applications, ser. No. 62/628,736 (case No.: 28888-2201P) by the inventor. Each of the above documents is incorporated by reference herein in its entirety.

According to subjective human rating (MOS), MCNN outperforms GL despite a large number of iterations and SPSI initializations. The MCNN implementation is on par with GL with very high iterations when training is only on Spectral Convergence (SC). In fact, only SC loss was targeted for training, yielding even slightly better SC for the test samples.

However, as shown in fig. 10, with SC loss alone, a lower audio quality of some samples due to generated background noise and relatively unclear high frequency harmonics is observed. Fig. 10 depicts a Log-STFT of synthesized speech for an MCNN trained with SC-only loss (top) and an MCNN trained with all losses (bottom), according to an embodiment of the present disclosure. To further improve audio quality, the flexibility of the MCNN implementation is beneficial for other loss integration, as shown in table 4.

Table 4: MOS with 95% confidence interval, mean spectral convergence, and speaker classification accuracy for libris test samples.

Model (model)	MOS (full score is 5)	Spectrum convergence (dB)	Accuracy of Classification (%)
				MCNN (Filter width 13, 8 heads, all losses)	3.50±0.18	–12.9	76.8
MCNN (Filter width 9)	3.26±0.18	–11.9	73.2
				MCNN (2 pieces)	2.78±0.17	–10.7	71.4
MCNN (loss: equation (1))	3.32±0.16	–13.3	69.6
				MCNN (loss: equation (1)&Equation (2 b)	3.35±0.18	–12.6	73.2
GL (3 iterations)	2.55±0.26	–5.9	76.8
				GL (50 iterations)	3.28±0.24	–10.1	78.6
GL (150 iterations)	3.41±0.21	–13.6	82.1
				SPSI	2.52±0.28	–4.9	75.0
SPSI + GL (3 iterations)	3.18±0.23	–8.7	78.6
				SPSI + GL (50 iterations)	3.41±0.19	–11.8	78.6
Truth value	4.20±0.16	–∞	85.7

Ablation (Ablation) studies also show that a sufficiently large filter width and a sufficiently large head are important. Transposed convolution tends to produce a checkerboard-like pattern, and a single head may not be able to generate all frequencies efficiently. However, in the whole, the different headers cooperate to eliminate artifacts and cover different frequency bands, which is described in further detail in appendix B. Finally, high speaker classification accuracy indicates that MCNN implementations can efficiently maintain speaker characteristics (e.g., pitch, accent, etc.) without any conditions, which suggests the potential for direct integration into training for applications such as speech cloning.

c) Generalization and optimization to specific speakers

As shown in Table 5, audio quality is maintained (from high quality text-to-speech data sets, such as those from LibriSpeech), even when the MCNN implementation trained on LibriSpeech is used for non-pre-set speakers

Chrzanowski, a.coates, g.diamos, a.gibianansky, y.kang, x.li, j.miller, a.ng, j.raiman, s.sengutta and m.shoeybi as described in "Deep voice: real-time neural text-to-speech" (also available at arxiv.org/pdf/1702.07825. Pdf) in vol.70, pp.195-204,06-11Aug 2017, and also in the following patent documentsThe following steps: commonly assigned U.S. patent application Ser. No. 15/882,926 (case No. 28888-2105), entitled "SYSTEM AND METHODS FOR REAL-TIME NEURAL TEXT-TO-SPEECH", filed 1/29/2018; AND U.S. provisional patent application No. 62/463,482 (docket No. 28888-2105P), entitled "SYSTEMS AND METHODS FOR REAL-TIME NEURAL TEXT-TO-SPEECH" filed 24/2/2017, the disclosure of each of which is incorporated herein by reference in its entirety (FOR convenience, the disclosure of which may be referred TO as "Deep Voice 1"). To assess the extent to which quality can be improved, individual MCNN model implementations are trained using only speaker-specific audio data (which has re-optimized hyperparameters). The filter width is increased to 19 to improve the resolution of the modeling of the sharper high frequency components. Since the data set has a smaller size (which amounts to about 20 hours), a lower learning rate and more aggressive annealing are applied. The loss factor of equation 2b is increased because the quality of the data set is higher and a lower SC is produced. The quality generated by the single speaker MCNN model implementation is very different from the truth value.

Table 5: MOS with 95% confidence interval for single speaker sample (internal data set from Deep Voice 1).

d) Frequency-based representation learning

The MCNN embodiment is trained using only human speech, which includes multi-frequency time-varying signals. Interestingly, as shown in fig. 11, similar to the single-headed CNN, the MCNN implementation learns the fourier basis representation over the spectral range of human speech (the mismatch of training and testing increases, resulting in a worse representation for higher frequencies beyond human speech). Fig. 11 depicts waveforms synthesized by an MCNN embodiment (trained on libri spech) for spectrogram inputs corresponding to sine waves of 500Hz, 1000Hz, and 2000Hz and spectrogram inputs for sine waves resulting from superimposing sine waves of 1000Hz and 2000Hz, according to an embodiment of the present disclosure. When the input spectrum corresponds to constant frequencies, the sinusoidal waveforms of these frequencies are synthesized. When the input spectrogram corresponds to a small number of frequency bands, the synthesized waveform is a superposition of pure sinusoids that make up the frequencies. For all cases, phase consistency was observed over a long time window.

e) Deployment considerations

The derivation complexity and computational intensity (based on the assumptions presented in appendix a) and runtime were benchmark tested on the Nvidia Tesla P100 GPU. Consider a Tensorflow operation implementation without specific kernel optimization, which may yield further improvements specific to hardware. For fair comparison, consider a GPU implementation of GL using a tensflow FFT operation/inverse FFT operation. The baseline MCNN model embodiment from table 4 (the one in bold font) may generate about 5.2M samples/second, which is about 330 times faster than real-time waveform synthesis. Compared to the MCNN implementation, the runtime of GL is about 20 times slower with 50 iterations and about 60 times slower with 150 iterations.

The computational complexity of the MCNN implementation is about 2.2 GFLOP/sec, and in practice it is slightly higher than the complexity of 150 GL iterations. However, the nature of the neural network architecture makes it well suited to modern multi-core processors such as GPUs or TPUs, making the run time much shorter. First, the MCNN implementation requires much less DRAM bandwidth (in bytes/s) -the computational strength 61 flo/byte of the MCNN implementation is an order of magnitude greater than the 1.9FLOPs/byte of GL. Furthermore, compared to GL, the MCNN implementation has shorter critical paths of related operations in its computation graph, for parallelization and use. Efficient derivation can be achieved with such highly specialized models by learning large-scale training data, which is not possible for signal processing algorithms such as GL.

H. Some conclusions

Embodiments are presented herein that demonstrate the potential of convolutional neural network architectures for long-term spectral inversion problems in achieving very low latency without significantly sacrificing perceptual quality. Such an architecture would benefit even more from future hardware in ways that traditional iterative signal processing techniques such as GL and autoregressive models such as WaveNet cannot utilize.

It has been demonstrated that embodiments of the convolutional GAN model can be trained using large-scale and low-quality data sets to generalize to non-pre-set speakers and learn a pure frequency basis. In one or more implementations, the evaluator's guidance in the GAN framework helps to suppress the checkerboard artifacts for upsampling and suppress constant frequency noise, resulting in slightly higher audio quality to the human rater. Improving the GAN algorithm to identify attributes related to perceptual quality would narrow the gap between models such as WaveNet and appropriate inductive bias.

One of the limitations of the output audio quality is that the used speech recognition data set has a low quality. It is desirable to use higher quality data sets to improve the quality of the synthesis. Quality improvement can be achieved by using a lower quality but larger data set. It should be noted that these embodiments may be integrated into other end-to-end training that generates audio models, such as text-to-speech applications or audio style conversion applications.

Embodiments of the MCNN architecture for the spectrogram inversion problem are also presented herein. The MCNN embodiments achieve very fast waveform synthesis without significantly sacrificing perceptual quality. The MCNN implementation can train on large-scale speech data sets and generalize well to non-pre-set speech or speakers. The MCNN implementation may even benefit more from future hardware in ways that autoregressive neural network models and conventional iterative signal processing techniques such as GL cannot utilize. Furthermore, the MCNN implementation would benefit from a larger scale audio data set that is expected to narrow the quality gap from truth. The MCNN implementation may be integrated into other end-to-end training that generates audio models, such as a text-to-speech system or an audio style conversion system.

I. Appendix

1. Appendix A-complexity modeling

The computational complexity of the operation is represented by the total number of algorithms FLOP, without considering the hardware-specific logic level implementation. This complexity metric also has limitations on some of the major sources of indicator power, such as loading data and storing data. To be able to implement most mathematical operations as a single instruction, all point-wise operations (including non-linearities) are counted as 1FLOP. The complexity of register memory move operations is ignored. Assume that 2mnp FLOPs are needed for matrix-to-matrix multiplication between an m n matrix W and an n p matrix X. Similar expressions are common to the multidimensional tensors used in convolutional layers. For a real Fast Fourier Transform (FFT), assume that a length N vector has 2.5Nlog ₂ (N) complexity of FLOP. For most operations used in this patent document, the Tensorflow analysis tool includes the FLOP count which is used directly.

2. Appendix B-analysis of contributions from multiple heads

Fig. 12 shows the output of each head and the overall waveform according to an embodiment of the present disclosure. The top row shows an example of the synthesized waveform and its log-STFT, while the bottom 8 rows show the output of the waveform for each of the included heads. For better visualization, the waveform is normalized in each head, and the small amplitude components in the STFT are discarded after the threshold is applied. It is observed that the multiple heads focus on different parts of the waveform in time, and also on different frequency bands. For example, the head 2 focuses mainly on low-frequency components. During training, the individual heads are not constrained by such a row. In fact, different heads share the same architecture, but the initial random weight of a head determines which part of the waveform the head will focus on in the later stages of training. The architecture of the network facilitates cooperation with end-to-end objectives. Thus, initializing with the same weight will negate the benefits of a multi-headed architecture. While the interpretability of the individual waveform outputs is very low (it should also be noted that non-linear combinations of these waveforms may also generate new frequencies that are not present in these individual outputs), the combination thereof may produce highly natural sounding waveforms.

3. Appendix C-more Waveform Synthesis examples

Examples of randomly sampled waveform synthesis of the proposed libri spech samples using a single-headed CNN are shown in fig. 13 to 18.

Fig. 13-18 each depict an exemplary comparison of true samples in terms of waveform (showing the entire utterance and two different scaled portions) and spectrogram compared to their versions generated using a convolutional CAN vocoder, according to embodiments of the present disclosure.

J. Computing system implementation

In an embodiment, aspects of this patent document may relate to or may include, or may be implemented on, one or more information handling systems/information computing systems. A computing system may include any instrumentality or combination of instrumentalities operable to compute, determine, classify, process, transmit, receive, retrieve, originate, route, switch, store, display, communicate, manifest, detect, record, reproduce, handle, or utilize any form of information, intelligence, or data. For example, a computing system may be or include a personal computer (e.g., a laptop), a tablet, a Personal Digital Assistant (PDA), a smartphone, a smartwatch, a smart package, a server (e.g., a blade server or a rack server), a network storage device, a camera, or any other suitable device and may vary in size, shape, performance, functionality, and price. The computing system may include Random Access Memory (RAM), one or more processing resources (e.g., a Central Processing Unit (CPU) or hardware or software control logic), ROM, and/or other types of memory. Additional components of the computing system may include one or more disk drives, one or more network ports for communicating with external devices as well as various input and output (I/O) devices, such as a keyboard, a mouse, a touch screen, and/or a video display. The computing system may also include one or more buses operable to transmit communications between the various hardware components.

FIG. 19 depicts a simplified block diagram of a computing device/information handling system (or computing system) according to an embodiment of the present disclosure. It should be understood that while the computing system may be configured differently and include different components (including fewer or more components as shown in fig. 19), it should be understood that the functionality illustrated for system 1900 may be operable to support various embodiments of the computing system.

As shown in FIG. 19, computing system 1900 includes one or more Central Processing Units (CPUs) 1901, CPU 1901 providing computing resources and controlling the computer. The CPU 1901 may be implemented by a microprocessor or the like, and may also include one or more Graphics Processing Units (GPUs) 1919 and/or floating point coprocessors for mathematical computations. The system 1900 may also include a system memory 1902, which system memory 1902 may be in the form of Random Access Memory (RAM), read Only Memory (ROM), or both.

As shown in fig. 19, a plurality of controllers and peripheral devices may also be provided. The input controller 1903 represents an interface to various input devices 1904, such as a keyboard, mouse, touch screen, and/or stylus. The computing system 1900 may also include a storage controller 1907, the storage controller 1907 for interfacing with one or more storage devices 1908, each of which includes storage media (such as tape or disk) or optical media, which may be used to record programs of instructions for operating systems, utilities and applications, which may include embodiments of programs that implement aspects of the present disclosure. Storage 1908 may also be used to store processed data or data to be processed in accordance with the present disclosure. The system 1900 may further include a display controller 1909 to provide an interface for a display device 1911, where the display device 1911 may be a Cathode Ray Tube (CRT), thin Film Transistor (TFT) display, organic light emitting diode, electroluminescent panel, plasma panel, or other type of display. Computing system 1900 may also include one or more peripheral controllers or interfaces 1905 for one or more peripheral devices 1906. Examples of peripheral devices may include one or more printers, scanners, input devices, output devices, sensors, and so forth. The communication controller 1914 may interface with one or more communication devices 1915 that enable the system 1900 to connect to remote devices over any of a variety of networks, including the internet, cloud resources (e.g., ethernet cloud, fibre channel over ethernet (FCoE)/Data Center Bridge (DCB) cloud, etc.), local Area Networks (LANs), wide Area Networks (WANs), storage Area Networks (SANs), or by any suitable electromagnetic carrier signals, including infrared signals.

In the system shown, all major system components may be connected to bus 1916, and bus 1916 may represent more than one physical bus. However, the various system components may or may not be physically proximate to each other. For example, input data and/or output data may be remotely transmitted from one physical location to another. In addition, programs implementing aspects of the present disclosure may be accessed from a remote location (e.g., a server) via a network. Such data and/or programs may be conveyed by any of a variety of machine-readable media, including but not limited to: magnetic media such as hard disks, floppy disks, and magnetic tape; optical media such as CD-ROMs and holographic devices; a magneto-optical medium; and hardware devices that are specially configured to store or store and execute program code, such as Application Specific Integrated Circuits (ASICs), programmable Logic Devices (PLDs), flash memory devices, and ROM and RAM devices.

Aspects of the disclosure may be encoded on one or more non-transitory computer-readable media with instructions for causing one or more processors or processing units to perform steps. It should be noted that the one or more non-transitory computer-readable media should include both volatile and non-volatile memory. It should be noted that alternative implementations are possible, including hardware implementations or software/hardware implementations. The hardware-implemented functions may be implemented using ASICs, programmable arrays, digital signal processing circuits, and the like. Thus, the term "device" in any claim is intended to encompass both software implementations and hardware implementations. Similarly, the term "computer-readable medium or media" as used herein includes software and/or hardware or a combination thereof having a program of instructions embodied thereon. It should be understood that with such alternative implementations as contemplated, the figures and accompanying description provide those skilled in the art with the functional information required to write program code (i.e., software) and/or fabricate circuits (i.e., hardware) to perform the required processing.

It should be noted that embodiments of the present disclosure may also relate to computer products having non-transitory tangible computer-readable media thereon with computer code for performing various computer-implemented operations. The media and computer code may be those specially designed and constructed for the purposes of the present disclosure, or they may be of the kind known or available to those having skill in the relevant art. Examples of tangible computer-readable media include, but are not limited to: magnetic media such as hard disks, floppy disks, and magnetic tape; optical media such as CD-ROMs and holographic devices; a magneto-optical medium; and hardware devices that are specially configured to store or store and execute program code, such as Application Specific Integrated Circuits (ASICs), programmable Logic Devices (PLDs), flash memory devices, and ROM and RAM devices. Examples of computer code include machine code, such as code produced by a compiler, and files containing higher level code that are executed by a computer using an interpreter. Embodiments of the disclosure may be implemented, in whole or in part, as machine-executable instructions in program modules that are executed by a processing device. Examples of program modules include libraries, programs, routines, objects, components, and data structures. In a distributed computing environment, program modules may be physically located in local, remote, or both arrangements.

One skilled in the art will recognize that neither the computing system nor the programming language is important to the practice of the present disclosure. Those skilled in the art will also recognize that a number of the above elements may be physically and/or functionally divided into sub-modules or combined together.

Those skilled in the art will appreciate that the foregoing examples and embodiments are illustrative and do not limit the scope of the disclosure. It is to be understood that all permutations, enhancements, equivalents, combinations, and improvements of the present disclosure that would be apparent to one skilled in the art upon reading the specification and studying the drawings are included within the true spirit and scope of the present disclosure. It should also be noted that elements of any claim may be arranged differently to include multiple dependencies, configurations and combinations.

Claims

1. A computer-implemented method of training a neural network model for spectrogram inversion, comprising:

inputting an input spectrogram comprising a plurality of frequency channels into a convolutional neural network comprising at least one header, wherein a header comprises a set of transposed convolutional layers, wherein each transposed convolutional layer is separated by a non-linear operation in the set of transposed convolutional layers, and wherein the set of transposed convolutional layers reduces the number of frequency channels of the input spectrogram to one channel after a last transposed convolutional layer in the set of transposed convolutional layers;

outputting, from the convolutional neural network, a composite waveform for the input spectrogram having a corresponding true value waveform;

obtaining a loss of the convolutional neural network using the respective true-value waveforms, the composite waveform, and a loss function, wherein the loss function includes at least one selected from a spectral convergence loss and a log-scale short-time fourier transform amplitude loss; and

updating the convolutional neural network using the loss.

2. The computer-implemented method of claim 1, wherein the convolutional neural network comprises:

a plurality of headers, wherein each header receives the input spectrogram and comprises a set of transposed convolutional layers, wherein each transposed convolutional layer is separated by a non-linear operation, and wherein the set of transposed convolutional layers reduces the number of frequency channels of the input spectrogram to one channel after a last transposed convolutional layer in the set of transposed convolutional layers.

3. The computer-implemented method of claim 2, wherein a head of the plurality of heads is initialized with at least some different parameters to allow the convolutional neural network to focus on different portions of a waveform associated with the input spectrogram during training.

4. The computer-implemented method of claim 2, wherein each head of the plurality of heads generates a head output waveform from the input spectrogram, the method further comprising:

obtaining a combination of the head output waveforms, wherein the head output waveforms are combined in a weighted combination using trainable weight values for each head output waveform.

5. The computer-implemented method of claim 4, further comprising:

a scaled softsign function is applied to the weighted combination to obtain a final output waveform.

6. The computer-implemented method of claim 1, wherein the convolutional neural network further comprises generating a countering network and the loss function further comprises generating a countering network loss component.

7. The computer-implemented method of claim 1, wherein the loss function further comprises one or more additive loss terms selected from instantaneous frequency loss, weighted phase loss, and waveform envelope loss.

8. The computer-implemented method of claim 1, wherein the convolutional neural network is trained with a massive multi-speaker data set to produce a trained convolutional neural network for synthesizing waveforms for speakers not included in the massive multi-speaker data set or is trained with a single speaker data set to produce a trained convolutional neural network for synthesizing waveforms for individual speakers.

9. A computer-implemented method of generating a waveform from a spectrogram using a trained convolutional neural network, the computer-implemented method comprising:

inputting an input spectrogram comprising a plurality of frequency channels into a trained convolutional neural network comprising at least one header, wherein a header comprises a set of transposed convolutional layers, wherein each transposed convolutional layer is separated by a non-linear operation in the set of transposed convolutional layers, and wherein the set of transposed convolutional layers reduces the number of frequency channels of the input spectrogram to one channel after a last transposed convolutional layer in the set of transposed convolutional layers, and wherein the header outputs an output waveform;

applying a scaling function to the output waveform to obtain a final composite waveform; and

and outputting the final synthesized waveform corresponding to the input spectrogram.

10. The computer-implemented method of claim 9, wherein the trained convolutional neural network comprises:

a plurality of headers, wherein each header receives the input spectrogram, outputs a header output waveform, and comprises a set of transposed convolutional layers, in which each transposed convolutional layer is separated by a non-linear operation, and which reduces the number of frequency channels of the input spectrogram to one channel after a last transposed convolutional layer in the set of transposed convolutional layers.

11. The computer-implemented method of claim 10, wherein the output waveform is obtained by performing the steps of:

combining head output waveforms of the plurality of heads into the output waveform using a weighted combination of the head output waveforms, wherein the output waveforms of the heads are weighted using trained weights for the heads.

12. The computer-implemented method of claim 9, further comprising: wherein the scaling function is a softsign function of the scale.

13. The computer-implemented method of claim 9, wherein the trained convolutional neural network is trained using a loss function comprising at least one loss component selected from spectral convergence loss, log-scale short-time fourier transform amplitude loss, instantaneous frequency loss, weighted phase loss, and waveform envelope loss.

14. The computer-implemented method of claim 13, wherein the trained convolutional neural network is trained using a generative countering network, and the loss function further comprises generating a countering network loss component.

15. The computer-implemented method of claim 9, further comprising converting the input spectrogram based on a mel-spectrum.

16. A non-transitory computer-readable medium comprising one or more sequences of instructions which, when executed by at least one processor, cause the following steps to be performed:

inputting an input spectrogram comprising a plurality of frequency channels into a trained convolutional neural network comprising at least one header, wherein a header comprises a set of transposed convolutional layers, in which each transposed convolutional layer is separated by a non-linear operation, and which reduces the number of frequency channels of the input spectrogram to one channel after a last transposed convolutional layer in the set of transposed convolutional layers, and the header outputs an output waveform;

17. The non-transitory computer-readable medium of claim 16, wherein the trained convolutional neural network comprises:

18. The non-transitory computer-readable medium of claim 17, wherein the output waveform is obtained by performing the steps of:

19. The non-transitory computer-readable medium of claim 16, wherein the trained convolutional neural network is trained using a loss function, wherein the loss function comprises at least one loss component selected from the group consisting of a spectral convergence loss, a log-scaled short-time fourier transform amplitude loss, an instantaneous frequency loss, a weighted phase loss, a waveform envelope loss, and a generation countering network loss component.

20. The non-transitory computer-readable medium of claim 16, further comprising one or more sequences of instructions which, when executed by at least one processor, cause the following steps to be performed:

converting the input spectrogram based on a Mel spectrogram.