CN117616498A

CN117616498A - Compression of audio waveforms using neural networks and vector quantizers

Info

Publication number: CN117616498A
Application number: CN202280046175.9A
Authority: CN
Inventors: 尼尔·泽格多尔; 马尔科·塔利亚萨基; 多米尼克·罗博列克
Original assignee: Google LLC
Current assignee: Google LLC
Priority date: 2021-07-02
Filing date: 2022-07-05
Publication date: 2024-02-27

Abstract

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium. Wherein one of the methods comprises: receiving an audio waveform comprising a respective audio sample for each of a plurality of time steps; processing the audio waveform using the encoder neural network to generate a plurality of feature vectors representing the audio waveform; generating a respective compiled representation of each feature vector of the plurality of feature vectors using a plurality of vector quantizers, each associated with a respective codebook of code vectors, wherein the respective compiled representation of each feature vector identifies a plurality of code vectors including the respective code vector from the codebook of each vector quantizer, the plurality of code vectors defining quantized representations of the feature vectors; and generating a compressed representation of the audio waveform by compressing a respective compiled representation of each of a plurality of feature vectors.

Description

Compression of audio waveforms using neural networks and vector quantizers

Cross reference to related applications

The present application claims priority from U.S. patent application Ser. No.17/856,856 filed on day 1 at 7 at 2022, which claims priority from U.S. provisional application Ser. No.63/218,139 filed on day 2 at 7 at 2021. The disclosures of these prior applications are considered to be part of the disclosure of the present application and are incorporated by reference into the disclosure of the present application.

Technical Field

The present description relates to processing data using a machine learning model.

Background

The machine learning model receives input and generates output, such as predicted output, based on the received input. Some machine learning models are parametric models and generate an output based on the received input and a parametric value for the model.

Some machine learning models are depth models that employ multiple layers of models to generate an output for a received input. For example, a deep neural network is a deep machine learning model that includes one output layer and one or more hidden layers, each of which applies a nonlinear transformation to a received input to generate an output.

Disclosure of Invention

The present specification generally describes a compression system implemented as a computer program on one or more computers in one or more locations that can compress audio waveforms. The present specification further describes a decompression system implemented as a computer program on one or more computers in one or more locations that can decompress audio waveforms.

In general, the compression system and decompression system may be located in any suitable location. In particular, the compression system may optionally be located at a location remote from the compression system. For example, the compression system may be implemented by one or more first computers at a first location, while the decompression system may be implemented by one or more second (different) computers at a second (different) location.

In some implementations, the compression system may generate a compressed representation of the input audio waveform and store the compressed representation in a data store, such as a logical data store or a physical data storage device. The decompression system may later access the compressed representation from the data store and process the compressed representation to generate a corresponding output audio waveform. The output audio waveform may be, for example, a reconstruction of the input audio waveform or an enhanced (e.g., denoised) version of the input audio waveform.

In some implementations, the compression system may generate a compressed representation of the input audio waveform and transmit the compressed representation to the destination over a data communication network, such as a local area network, a wide area network, or the internet. The decompression system may access the compressed representation at the destination and process the compressed representation to generate a corresponding output waveform.

According to a first aspect, there is provided a method performed by one or more computers, the method comprising: receiving an audio waveform comprising a respective audio sample for each of a plurality of time steps; processing the audio waveform using the encoder neural network to generate a plurality of feature vectors representing the audio waveform; generating, using a plurality of vector quantizers, a respective compiled representation of each of a plurality of feature vectors, the vector quantizers each being associated with a respective codebook of code vectors, wherein the respective compiled representation of each feature vector identifies a plurality of code vectors including the respective code vector from the codebook of each vector quantizer; a plurality of code vectors defining a quantized representation of the feature vector; and generating a compressed representation of the audio waveform by compressing a respective compiled representation of each of the plurality of feature vectors.

In some implementations, the plurality of vector quantizers are ordered into a sequence, and wherein, for each feature vector of the plurality of feature vectors, generating the compiled representation of the feature vector comprises: for a first vector quantizer in the sequence of vector quantizers: receiving a feature vector; identifying, based on the feature vectors, corresponding code vectors from a codebook of the vector quantizer to represent the feature vectors; and determining a current residual vector based on an error between (i) the feature vector and (ii) a code vector representing the feature vector; wherein the compiled representation of the feature vector identifies a code vector representing the feature vector.

In some implementations, for each feature vector of the plurality of feature vectors, generating the compiled representation of the feature vector further includes: for each vector quantizer following the first vector quantizer in the sequence of vector quantizers: receiving a current residual vector generated by a previous vector quantizer in the sequence of vector quantizers; identifying, from a codebook of the vector quantizer, a corresponding code vector to represent the current residual vector based on the current residual vector; and if the vector quantizer is not the last vector quantizer in the sequence of vector quantizers: updating the current residual vector based on an error between (i) the current residual vector and (ii) a code vector representing the current residual vector; wherein the compiled representation of the feature vector identifies a code vector representing the current residual vector.

In some implementations, generating the compressed representation of the audio waveform includes: the respective compiled representation of each of the plurality of feature vectors is entropy encoded.

In some implementations, a respective quantized representation of each feature vector is defined by a sum of a plurality of code vectors identified by a compiled representation of the feature vector.

In some implementations, the codebooks of the plurality of vector quantizers all include an equal number of code vectors.

In some implementations, the encoder neural network and the codebooks of the plurality of vector quantizers are co-trained with a decoder neural network, wherein the decoder neural network is configured to: receiving a respective quantized representation of each of a plurality of feature vectors representing an input audio waveform generated using an encoder neural network and a plurality of vector quantizers; and processing the quantized representation of the feature vector representing the input audio waveform to generate an output audio waveform.

In some embodiments, training comprises: obtaining a plurality of training examples, each training example comprising: (i) A corresponding input audio waveform and (ii) a corresponding target audio waveform; processing a respective input audio waveform from each training example using an encoder neural network, a plurality of vector quantizers from a sequence of vector quantizers, and a decoder neural network to generate an output audio waveform that is an estimate of the corresponding target audio waveform; determining a gradient of an objective function that depends on the respective output of each training example and the target waveform; and updating one or more of the following using the gradient of the objective function: a set of encoder neural network parameters, a set of decoder neural network parameters, or a codebook of multiple vector quantizers.

In some implementations, for one or more of the training examples, the target audio waveform is an enhanced version of the input audio waveform.

In some implementations, for one or more of the training examples, the target audio waveform is a denoised version of the input audio waveform.

In some implementations, the target audio waveform is the same as the input audio waveform for one or more of the training examples.

In some implementations, processing each input audio waveform to generate a corresponding output audio waveform includes: the encoder neural network, the decoder neural network, or both are adjusted based on data defining whether the corresponding target audio waveform is (i) an input audio waveform or (ii) an enhanced version of the input audio waveform.

In some implementations, the method further includes, for each training example: selecting a corresponding number of vector quantizers to be used in quantizing feature vectors representing the input audio waveform; only a selected number of vector quantizers from the sequence of vector quantizers are used to generate corresponding output audio waveforms.

In some implementations, the selected number of vector quantizers to be used in quantizing feature vectors representing the input audio waveform varies between training examples.

In some implementations, for each training example, selecting a respective number of vector quantizers to be used in quantizing feature vectors representing the input audio waveform includes: random sampling the number of vector quantizers to be used in quantizing feature vectors representing the input audio waveform.

In some implementations, the objective function includes a reconstruction loss that measures, for each training example, an error between (i) the output audio waveform and (ii) the corresponding target audio waveform.

In some implementations, for each training example, a reconstruction loss measures a multi-scale spectral error between (i) the output audio waveform and (ii) the corresponding target audio waveform.

In some implementations, for each training example: processing data derived from the output audio waveform using a discriminator neural network to generate a set of one or more discriminator scores, wherein each discriminator score characterizes an estimated likelihood that the output audio waveform is generated using an encoder neural network, a plurality of vector quantizers, and a decoder neural network; wherein the objective function includes an antagonism loss that depends on a discriminator score generated by the discriminator neural network.

In some implementations, the data derived from the output audio waveform includes the output audio waveform, a downsampled version of the output audio waveform, or a fourier transformed version of the output audio waveform.

In some implementations, for each training example, the reconstruction loss measures the error between: (i) One or more intermediate outputs generated by the discriminator neural network by processing the output audio waveform, and (ii) one or more intermediate outputs generated by the discriminator neural network by processing the corresponding target audio waveform.

In some implementations, during training, the codebooks of the plurality of vector quantizers are repeatedly updated using an exponential moving average of the feature vectors generated by the encoder neural network.

In some implementations, the encoder neural network includes a sequence of encoder blocks, each encoder block configured to process a respective set of input feature vectors according to a set of encoder block parameters to generate a set of output feature vectors having a lower temporal resolution than the set of input feature vectors.

In some implementations, the decoder neural network includes a sequence of decoder blocks, each decoder block configured to process a respective set of input feature vectors according to a set of decoder block parameters to generate a set of output feature vectors having a higher temporal resolution than the set of input feature vectors.

In some implementations, the audio waveform is a speech waveform or a music waveform.

In some embodiments, the method further comprises transmitting the compressed representation of the audio waveform over a network.

According to another aspect, there is provided a method performed by one or more computers, the method comprising: receiving a compressed representation of an input audio waveform; decompressing the compressed representation of the input audio waveform to obtain a respective compiled representation of each of a plurality of feature vectors representing the input audio waveform, wherein the compiled representation of each feature vector identifies a plurality of code vectors including respective code vectors from a respective codebook of each of a plurality of vector quantizers, the plurality of code vectors defining quantized representations of the feature vectors; generating a respective quantized representation of each feature vector from the compiled representation of the feature vector; and processing the quantized representation of the feature vector using a decoder neural network to generate an output audio waveform.

According to another aspect, there is provided a system comprising: one or more computers; and one or more storage devices communicatively coupled to the one or more computers, wherein the one or more storage devices store instructions that, when executed by the one or more computers, cause the one or more computers to perform the operations of the respective methods herein.

According to another aspect, one or more non-transitory computer storage media are provided storing instructions that, when executed by one or more computers, cause the one or more computers to perform the operations of the respective methods herein.

The subject matter described in this specification can be implemented in specific embodiments to realize one or more of the following advantages.

The compression/decompression system described in this specification may enable more efficient compression of audio data than some conventional systems. By enabling more efficient audio data compression, the system allows for more efficient audio data transmission (e.g., by reducing the communication network bandwidth required to transmit audio data) and more efficient audio data storage (e.g., by reducing the amount of memory required to store video data).

The compression/decompression system includes an encoder neural network, a set of vector quantizers, and a decoder neural network, which are jointly trained (i.e., from "end to end"). The co-training of the corresponding neural network parameters of the encoder and decoder neural networks along with the codebook of the vector quantizer enables the parameters of the compression/decompression system to be adjusted consistently to achieve more efficient audio compression than would otherwise be possible. For example, as the neural network parameters of the encoder neural network are iteratively adjusted, the codebook of vector quantizers is simultaneously optimized to enable more accurate quantization of the feature vectors generated by the encoder neural network. The neural network parameters of the decoder neural network are also optimized simultaneously to enable more accurate reconstruction of the audio waveform from quantized feature vectors generated using the updated codebook of the vector quantizer.

Performing vector quantization on feature vectors representing audio waveforms using a single vector quantizer in which each feature vector is represented using r bits may require a codebook of size 2. That is, the size of the codebook of the vector quantizer may exponentially increase with the number of bits allocated to represent each feature vector. As the number of bits allocated to represent each feature vector increases, learning and storing the codebook becomes computationally infeasible. To solve this problem, the compression/decompression system performs vector quantization using a sequence of multiple vector quantizers, each maintaining a corresponding codebook. The first vector quantizer may directly quantize the feature vector generated by the encoder neural network, while each subsequent vector quantizer may quantize the residual vector defining the quantization error generated by the previous vector quantizer.

The vector quantizer sequence may iteratively refine the quantization of feature vectors while each maintaining a significantly smaller codebook required by a single vector quantizer. For example, each vector quantizer may remain sizedWherein r is the number of bits allocated to represent each feature vector, and N _q Is the number of vector quantizers. Thus, performing vector quantization using a sequence of multiple vector quantizers enables a compression/decompression system to reduce the memory required to store the quantizer codebook and allows vector quantization to be performed if it is otherwise computationally infeasible.

Performing vector quantization using a set of multiple vector quantizers (i.e., rather than a single vector quantizer) also enables the compression/decompression system to control the compression bit rate, e.g., the number of bits used to represent audio data per second. To reduce the bit rate, the compression/decompression system may use fewer vector quantizers to perform vector quantization. Conversely, to increase the bit rate, the compression/decompression system may use more vector quantizers to perform vector quantization. During training, the number of vector quantizers used for compression/decompression of each audio waveform may vary (e.g., randomly) between training examples, such that the compression/decompression system learns a single set of parameter values that enable efficient compression/decompression over a range of possible bit rates. Thus, the compression/decompression system enables a reduction in the consumption of computational resources by eliminating any requirement to train and maintain a plurality of respective encoders, decoders and vector quantizers, each optimized for a respective bit rate.

The compression/decompression system may be trained to jointly perform audio data compression and audio data enhancement, such as both denoising. That is, the compression and decompression system may be trained to simultaneously enhance (e.g., denoise) the audio waveform as part of the compression and decompression waveforms without increasing the overall delay. In contrast, some conventional systems apply separate audio enhancement algorithms to the audio waveform on the transmitter side (i.e., before compression) or on the receiver side (i.e., after decompression), which may result in increased delay.

The details of one or more embodiments of the subject matter are set forth in the accompanying drawings and the description below. Other features, aspects, and advantages of the subject matter will become apparent from the description, the drawings, and the claims.

Drawings

Fig. 1 depicts an example audio compression system that may compress audio waveforms using an encoder neural network and a residual vector quantizer.

Fig. 2 depicts an example audio decompression system that may use a decoder neural network and a residual vector quantizer to decompress a compressed audio waveform.

FIG. 3 is a schematic diagram of an example training system that may jointly train an encoder neural network, a decoder neural network, and a residual vector quantizer.

Fig. 4 is a flow chart of an example process for compressing an audio waveform.

Fig. 5 is a flow chart of an example process for decompressing a compressed audio waveform.

Fig. 6 is a flow chart of an example process for generating a quantized representation of a feature vector using a residual vector quantizer.

FIG. 7 is a flowchart of an example process for jointly training an encoder neural network, a decoder neural network, and a residual vector quantizer.

Fig. 8A and 8B illustrate examples of a full convolutional neural network architecture.

Like reference numbers and designations in the various drawings indicate like elements.

Detailed Description

Fig. 1 depicts an example audio compression system 100 that may compress audio waveforms using an encoder neural network 102 and a residual vector quantizer 106. Similarly, fig. 2 depicts an example audio decompression system 200 that may use the decoder neural network 104 and the residual vector quantizer 106 to decompress compressed audio waveforms. For clarity, reference will be made to fig. 1 and 2 in describing the various components involved in compression and decompression. The audio compression/decompression system 100/200 is an example of a system implemented as a computer program on one or more computers in one or more locations in which the systems, components, and techniques described below are implemented.

The compression/decompression system 100/200 utilizes a neural network architecture (neural codec) that is superior to conventional codecs, such as waveform and parameter codecs, in terms of operating bit rate and generic audio compression. For comparison, waveform codecs typically use time/frequency domain transforms for compression/decompression with little or no assumption of source audio. Thus, they produce high quality audio at medium to high bit rates, but often introduce coding artifacts at low bit rates. Parametric codecs can overcome this problem by making specific assumptions about the source audio (e.g. speech), but are ineffective for generic audio compression. Instead, the audio compression and decompression system 100/200 may compress and decompress speech, music, and general audio at a bit rate that is typically targeted by a speech customization codec (e.g., a parametric codec). Thus, the audio compression/decompression system 100/200 may operate in a manner not achievable by conventional codecs. In some implementations, the compression and decompression system 100/200 may be configured for a particular type of audio content (e.g., speech) due to the flexibility allowed by the neural network architecture.

Referring to fig. 1, the compression system 100 receives an audio waveform 112 to be compressed. Waveform 112 may include audio samples at each time step, where the time step generally corresponds to a particular sampling rate. The higher sampling rate captures higher frequency components of the audio waveform 112. For example, a standard audio sampling rate of 48kHz is used by professional digital devices because it can reconstruct sound at frequencies up to 24kHz (e.g., the upper limit of human hearing). While such a sampling rate is ideal for comfortable listening, the compression system 100 may generally be configured to process the waveform 112 at any sampling rate, even with non-uniformly sampled waveforms.

The audio waveform 112 may originate from any suitable audio source. For example, waveform 112 may be a recording from an external audio device (e.g., speech from a microphone), a pure digital product (e.g., electronic music), or general-purpose audio such as sound effects and background noise (e.g., white noise, room tones). In some implementations, the audio compression system 100 can perform audio enhancement simultaneously as the waveform 112 is compressed, e.g., suppressing unwanted background noise.

In conventional audio processing pipelines, compression and enhancement are typically performed by separate modules. For example, the audio enhancement algorithm may be applied at the input of the encoder 102 before the waveform is compressed or at the output of the decoder 104 after the waveform is decompressed. In this arrangement, each processing step contributes to the end-to-end delay, for example, by buffering the waveform to the expected frame length of the algorithm. In contrast, with judicious training of the various neural network components (see fig. 3), the audio compression system 100 may enable joint compression and enhancement without relying on separate modules and without creating additional delays. In some implementations, the audio decompression system 200 implements joint decompression and enhancement. In general, compression system 100, decompression system 200, or both, may be designed to perform audio enhancement by fully training neural network components.

The encoder 102 processes (e.g., encodes) the audio waveform 112 to generate a sequence of feature vectors 208 representing the waveform 112. Feature vector 208 (e.g., embedded, potential representation) is a compressed representation of the waveform that extracts the most relevant information about its audio content. The encoder 102 may downsample the input waveform 202 to generate the compressed feature vector 208 such that the feature vector 208 has a lower sampling rate than the original audio waveform 112. For example, the encoder neural network 102 may use multiple convolutional layers with increasing steps to generate the feature vector 208 at a lower sampling rate (e.g., lower temporal resolution).

The residual (e.g., multi-level) vector quantizer RVQ 106 then processes the feature vectors 208 to generate a compiled representation of the feature vectors (CFV) 210 and a corresponding quantized representation of the feature vectors (QFV) 212.RVQ 106 may generate QFV 212 at a particular bit rate by utilizing one or more vector quantizers 108. RVQ 108 achieves (lossy) compression by mapping the high-dimensional space of feature vectors 208 to the discrete subspaces of the code vectors. As will be described in detail below, CFV 210 specifies codewords (e.g., indexes) from a respective codebook 110 of each vector quantizer 108, wherein each codeword identifies a code vector stored in an associated codebook 110. Thus, QFV 212 is an approximation of the feature vector 208 defined by the combination of code vectors specified by the corresponding CFV 212. Typically, QFV is the sum (e.g., linear combination) of the code vectors specified by CFV 212.

In some cases, RVQ 106 uses a single vector quantizer 108 with a single codebook 110. The quantizer 108 may compress the feature vector 208 into QFV 212 by selecting the code vectors in its codebook 110 to represent the feature vector 208. Quantizer 106 may select a code vector based on any suitable distance metric (e.g., error) between the two vectors, such as L ⁿ Norms, cosine distances, etc. For example, RVQ 106 may select the vector having the smallest Euclidean norm (e.g., L ² Norms). Quantizer 106 may then store the corresponding codeword in CFV 210. Since codewords typically require fewer bits than code vectors,the CFV 210 consumes less space in memory and can achieve greater compression than QFV 212 without additional loss.

However, since the size of the codebook 110 increases exponentially with the increase in bit rate, the approach of a single vector quantizer 108 may become too expensive. To overcome this problem, RVQ 106 may utilize a sequence of vector quantizers 108. In this case, each vector quantizer 108 in the sequence contains a respective codebook 110 of code vectors. RVQ 106 may then use an iterative method to generate CFV 210 and corresponding QFV 212 such that each vector quantizer 108 in the sequence further refines the quantization.

For example, at the first vector quantizer 108, the quantizer 106 may receive the feature vector 208 and select a code vector from its codebook 110 to represent the feature vector 208 based on a minimum distance metric. The residual vector may be calculated as the difference between the feature vector 208 and the code vector representing the feature vector 208. The residual vector may be received by the next quantizer 108 in the sequence to select a code vector from its codebook 110 to represent the residual vector based on the minimum distance metric. The difference between these two vectors can be used as the residual vector for the next iteration. The iterative method may continue for each vector quantizer 108 in the sequence. Each code vector identified in the method may be summed into QFV 212 and the codeword for each code vector may be stored in a corresponding CFV 210.

In general, RVQ 106 may utilize any suitable number of vector quantizers 108. Number of quantizers N _q And the size N of each codebook _i A trade-off between computational complexity and compilation efficiency is controlled. Thus, the sequence of quantizers 108 provides a flexible means of balancing these two opposing factors. In some cases, the size N of each codebook _i =n is the same so that the total bit budget is evenly allocated on each vector quantizer 108. Uniform allocation provides practical modularity to RVQ 106 because each codebook 110 consumes the same space in memory.

Furthermore, for a fixed size codebook N _i Vector quantization in sequenceNumber of devices N _q The resulting bit rate of QFV 212 is determined, where a higher bit rate corresponds to a greater number of quantizers 108. Thus, RVQ 106 provides a convenient framework for variable (e.g., scalable) bit rates by employing structured dropping of quantizer 108. That is, the audio compression and decompression system 100/200 may vary the number of quantizers 108 in the sequence to facilitate adjustable performance and reduce overall memory usage compared to multiple fixed bit rate codecs for any desired bit rate. Because of these capabilities, the compression/decompression system 100/200 may be particularly suited for low latency implementations, for devices with limited computing resources (e.g., smartphones, tablets, watches, etc.).

The CFV 218 may then be further compressed into a compressed representation of the audio waveform 114, for example, using the entropy codec 302. The entropy codec 302 may implement any suitable lossless entropy coding, such as arithmetic coding, huffman coding, and the like.

Referring to fig. 2, a decompression system 200 receives a compressed representation of an audio waveform 114. In general, the compressed audio waveform 114 may represent any type of audio content, such as speech, music, general audio, and the like. Indeed, as previously described, the audio compression and decompression system 100/200 may be implemented for specific tasks (e.g., voice customized compression/decompression) to optimize around specific types of audio content.

The compressed audio waveform 114 may be decompressed to CFV 210, for example, using entropy codec 302. CFV 210 is then processed by RVQ 106 to QFV 212. As described above, each CFV 210 includes a codeword (e.g., an index) that identifies the code vector in the corresponding codebook 110 of each vector quantizer 108. The combination of code vectors specified by each CFV 210 identifies the corresponding QFV 212. Typically, the code vectors identified by each CFV 210 are summed into a corresponding QFV 212.

The decoder 104 may then process (e.g., decode) the QFV 212 to generate the audio waveform 112. The decoder 104 mirrors the process of the encoder 102, typically by outputting waveforms starting from the (quantized) feature vector. The decoder 104 may upsample QFV 212 at a higher sampling rate than the input QFV 212 to generate the output waveform 206. For example, decoder 102 may use multiple convolutional layers with reduced step sizes to generate output waveform 206 at a higher sampling rate (e.g., higher temporal resolution).

Note that the compression/decompression system 100/200 may be implemented in a variety of different embodiments, such as integrated as a single system or separate systems. Furthermore, the components of each of the compression/decompression systems 100/200 need not be constrained to a single client device. For example, in some embodiments, the compression system 100 stores the compressed audio waveform 114 in local memory, which is then retrieved from local memory by the decompression system 200. In other implementations, the compression system 100 on the transmitter client transmits the compressed audio waveform 114 across a network (e.g., the internet, 5G cellular network, bluetooth, wi-Fi, etc.), which may be received by the decompression system 200 on the receiver client.

As will be described in more detail below, the neural network architecture may be trained using the training system 300. Training system 300 may enable efficient generic compression or custom compression (e.g., speech customization) by utilizing a set of suitable training examples 116 and various training processes. Specifically, training system 300 may jointly train encoder neural network 102 and decoder neural network 104 to efficiently encode and decode feature vectors 208 of the various waveforms contained in training examples 116. In addition, training system 300 may train RVQ 106 to efficiently quantize feature vector 208. In particular, each codebook 110 of each cascaded vector quantizer 108 may be trained to minimize quantization error. To facilitate the trainable codebook 110, each vector quantizer 108 may be implemented, for example, as a vector quantized variable automatic encoder (VQ-VAE).

When implementing such a data-driven training solution, the audio compression/decompression system 100/200 may be a complete "end-to-end" machine learning approach. In an end-to-end implementation, the compression/decompression system 100/200 utilizes a neural network for all tasks involved in training and post-training inferences. The external system does not perform processing such as feature extraction. In general, training system 300 may utilize an unsupervised learning algorithm, a semi-supervised learning algorithm, a supervised learning algorithm, or a finer combination of these algorithms. For example, training system 300 may balance the loss of reconstruction with the loss of resistance to achieve audio compression that is both faithful and perceptually similar to the original audio when played back.

In general, the neural networks included in the audio compression/decompression systems 100/200 may have any suitable neural network architecture that enables them to perform their described functions. In particular, the neural networks may each include any suitable number (e.g., 5, 10, or 100 layers) and any suitable neural network layers (e.g., fully connected layers, convolutional layers, attention layers, etc.) arranged in any suitable configuration (e.g., as a linear sequence of layers).

In some embodiments, the compression/decompression system 100/200 utilizes a full convolutional neural network architecture. Fig. 8A and 8B illustrate an example implementation of such an architecture for the encoder 102 and decoder 104 neural networks. The full convolution architecture may be particularly advantageous for low latency compression because it has lower scale connectivity than a fully connected network (e.g., a multi-layer perceptron) and has filters (e.g., kernels) that can be optimized to limit coding artifacts. Furthermore, convolutional neural networks provide an effective means of resampling waveform 112, i.e., by using different steps for different convolutional layers to change the time resolution of waveform 112.

In a further embodiment, the compression/decompression system 100/200 uses a strict causal convolution (causal convolution) in implementing a full convolution architecture, such that in both training and offline inference, padding is applied only in the past and not in the future. No padding is required for streaming inference. In this case, the total latency of the compression and decompression system 100/200 is entirely determined by the temporal resampling rate between the waveforms and their corresponding feature vectors.

Fig. 3 illustrates operations performed by the example training system 300 to jointly train the encoder neural network 102, the decoder neural network 104, and the residual vector quantizer 106. The neural network performs end-to-end training on an objective function 214, which objective function 214 may include a plurality of reconstruction losses. In some embodiments, the discriminator neural network 216 is also trained to facilitate the resistance loss 218, and in some cases, the additional reconstruction loss.

Training system 300 receives a set of training examples 116. Each training example 116 includes a respective input audio waveform 202 and a corresponding target audio waveform 204, and the neural network is trained to reconstruct the respective input audio waveform 202 and corresponding target audio waveform 204. That is, using the objective function 214, the objective waveform 204 may be compared to the resulting output audio waveform 206 to evaluate the performance of the neural network. Specifically, the objective function 214 may include a reconstruction loss that measures an error between the target waveform 204 and the output waveform 206. In some cases, the point-by-point reconstruction loss on the original waveform is achieved, for example, using the mean square error between the waveforms.

However, this type of reconstruction loss may have limitations in some cases, for example, because two different waveforms may sound perceptually the same, while point-by-point similar waveforms may sound very different. To alleviate this problem, the objective function 214 may utilize a multi-scale spectral reconstruction loss that measures the error between the mel-frequency spectra of the target waveform 204 and the output waveform 206. The spectrogram characterizes the frequency spectrum of the audio waveform over time, for example using a Short Time Fourier Transform (STFT). A mel-frequency spectrum is a spectrum converted to mel scale. Since humans typically do not perceive sound frequencies on a linear scale, the mel scale may weight the frequency components appropriately to improve fidelity. For example, in a target waveform And output waveform +.>Reconstruction loss between->May include terms that measure absolute and logarithmic errors of the mel-frequency spectrum,

wherein … is | _n Represents L ⁿ Norms. This form of reconstruction is possible, although other reconstruction losses are also possibleMeeting strictly appropriate scoring rules may be desirable for training purposes. Here, a->A t-th frame (e.g., a time slice) of the 64-bin mel-spectrum plot calculated with a window length equal to s and a hop length equal to s/4 is represented. Alpha _s The coefficients can be set to

As previously described, the set of training examples 116 may be selected to implement various modes of the compression/decompression system 100/200, such as generic audio compression, voice customization compression, and the like. For example, to train generic audio compression, training examples 116 may include speech, music, and generic audio waveforms. In other implementations, the training examples 116 may include only music waveforms to facilitate optimal music compression and playback.

In some cases, the target waveform 204 is identical to the input waveform 202, which can train the neural network toward faithful and perceptually similar reconstruction. However, the target waveform 204 may also be modified relative to the input waveform 202 to encourage more complex functions, such as joint compression and enhancement. Enhanced properties may be determined by designing training examples 116 with specific qualities. For example, the target waveform 204 may be a speech enhanced version of the input waveform 202 such that the neural network improves the audio dialog as the waveform is reconstructed. Alternatively or additionally, the target waveform 204 may be a de-noised version of the input waveform 202 that trains the network to suppress background noise. In general, this technique may be used to enable any desired audio enhancement.

In further implementations, the encoder 102 and/or decoder 104 may be conditioned on data typically included in the training examples 116 that defines whether the target waveform 204 is the same as the input waveform 202 or an enhanced version of the waveform 202. For example, the training examples 116 may include an adjustment signal representing two modes (enhancement enabled or disabled) such that the neural network is trained to enable enhancement only when the signal is present. To achieve this, the encoder 102 and/or decoder 104 may have a dedicated layer, such as a feature-by-feature linear modulation (FilM) layer, to process the adjustment signal. After training, the techniques may allow the audio compression/decompression system 100/200 to flexibly control enhancements in real time by feeding the conditioning signals via a network. Accordingly, the compression system 100 may implement such controllable enhancements to allow compression of acoustic scenes and natural sounds that would otherwise be removed by the enhancement (e.g., denoising).

Returning now to the encoder neural network 102. The input waveform 202 is processed by the encoder 102 and encoded as a sequence of feature vectors 208. This process, which may involve multiple encoder network layers, may be defined by the encoder function epsilon _θ Together, the encoder function maps the input waveform x to a feature vector y such that y (x) =ε _θ (x) A. The invention relates to a method for producing a fibre-reinforced plastic composite Encoder function ε _θ Parameterized by the encoder network parameters θ, the encoder network parameters may be updated using the objective function 214 to minimize losses during encoding.

Feature vector 208 is then compressed by RVQ 106 to generate compiled CFV 210 and corresponding QFV 212. Note that the quantization process of RVQ 106, which may involve multiple vector quantizers 108, may be defined by an RVQ functionTogether, the RVQ function +.>Mapping the feature vector y to QFV +.>Make->RVQ function->Parameterized by codebook parameters ψ, which can be updated using an objective function 214 to minimize losses during quantization.

Training system 300 may minimize quantization loss associated with RVQ 106 by properly aligning the code vector with the vector space of feature vectors 208. That is, training system 300 may update codebook parameters ψ by back propagating the gradient of objective function 214. For example, the updating of codebook 110 may be repeated during training using an exponential moving average of feature vectors 208. Training system 300 may also improve the use of codebook 110 by running a k-means algorithm on the first set of training examples 116 and using the learned centroid as an initialization for subsequent training examples 116. Alternatively or additionally, if a code vector has not been assigned to a feature vector 208 for multiple training examples 116, the training system 300 may replace it with a random feature vector 208 sampled during the current training example 116. For example, training system 300 may track an exponential moving average (with a decay factor of 0.99) of the assignment to each code vector and replace the code vector with a statistic below 2.

To adequately train the neural network for variable (e.g., scalable) bit rates, training system 300 may select a particular number n _q For each training example 116 such that the number of quantizers 108 differs between training examples 116. For example, for each training example 116, training system 300 may be at [1; n (N) _q ]Uniformly randomly sampling n _q And only the first i= … n in the sequence is used _q And a quantizer 108. Thus, the network is trained to pair with range n _q ＝1…N _q All the corresponding target bit rate audio waveforms are encoded and decoded and no architectural changes are required for the encoder 102 or decoder 104. After training, the audio compression and decompression system 100/200 may select a particular number n during compression and decompression that adapts to the desired bit rate _q Is provided).

Reference is now made to the decoder neural network 104.QFV 212 are processed by the decoder 104 and decoded into the output audio waveform 206. Similar to encoder 102, decoding may involve multiple decoder network layers, which may be defined by decoder functionsAnd collectively represent. Decoder function->Input QFV->Mapping to output waveform +.>So that In some embodiments, the input waveform x and the output waveform +. >With the same sampling rate, but this is not necessarily the case. Decoder function->Parameterized by decoder network parameters phi that can be updated using the objective function 214 to minimize losses during decoding.

Due to the output waveformGenerally depending on the encoder network parameters θ, codebook parameters ψ, and decoder network parameters φ, these network parameters may be updated using an objective function 214 that includes a reconstruction penalty between the output waveform 206 and the target waveform 204 for each training example 116. Specifically, the gradient of the objective function 214 may be calculated to iteratively update the network parameters by back propagation, for example, using a gradient descent method. In general, network parameters are updated with the goal of optimizing objective function 214.

In some implementations, the training system 300 utilizes the discriminator neural network 216 to incorporate the resistance loss 218 and potentially additional reconstruction losses into the objective function 214. The contrast loss 218 may improve the perceived quality of the waveform reconstructed by the neural network. In this case, the discriminator 216 is jointly trained by the training system 300 and competes with the encoder 102, decoder 104, and RVQ 106. That is, the discriminator 216 is trained to distinguish the target waveform 204 from the output waveform 206, while the encoder 102, decoder 104, and RVQ 106 are trained to spoof the discriminator 216.

The discriminator 216 may divide it by using a discriminator score of k= {1,2, …, K }Is implemented as a resistance loss 218 such that each score characterizes an estimated likelihood that the output waveform 206 is not generated as output from the decoder 104. For example, discriminator 216 may receive the output waveform from decoder 104>And processing the waveform using one or more neural network layers to generate +.>In this case, the +.>Is a discriminator function that maps an input waveform to an output log. k indexes a particular discriminator output, and t indexes a particular log of the discriminator output. In some implementations, the discriminator 216 utilizes a full convolutional neural network such that the number of pairs of scores is proportional to the length of the input waveform.

The discriminator 216 may use the logarithm of the division to determine a corresponding discriminator score for each discriminator output kFor example, every fraction +.>Can be determined as

Here, T _k Is the sum of the logarithms of the outputs k, and E _x Is the expected value on x. In some embodiments, resistance is lostIs discriminator score->On the average of the values of the average,

loss of resistanceMay be included in the objective function 214 to improve the perceived quality of the reconstructed waveform. Furthermore, discriminator 216 may be trained by training system 300 to minimize the discriminator loss function +. >To distinguish the target waveform +.>And output waveform->In some embodiments, the->Has the following form

Note that the target waveformGenerally depends on the input waveform x, as they may be identical to the input waveformOr an enhanced version. By training the discriminator 216 to be relative +.>Effectively classifying the target waveform 204 from the output waveform 206, the encoder 102, decoder 104 and RVQ 106 learn to minimize +.>To spoof the authenticator.

In some implementations, the discriminator 216 utilizes different versions of the waveform to determine the discriminator scoreFor example, in addition to the original waveform, discriminator 216 may use a downsampled version of the waveform (e.g., 2 downsampled, 4 downsampled, etc.) or a fourier transformed version of the waveform (e.g., STFT, hartley transform, etc.), which increasesThe diversity of resistance loss 218 is added. As a specific implementation using four discriminator scores,/->The score may correspond to the STFT waveform, while the discriminator score +.>May correspond to the original waveform, the 2 downsampled waveform, and the 4 downsampled waveform.

In a further embodiment, the discriminator 216 introduces a reconstruction penalty in the form of a "feature penalty". Specifically, feature lossAn error between the inner layer output of the discriminator of the target audio waveform 204 and the inner layer output of the discriminator of the output audio waveform 206 is measured. For example, feature loss- >Can be expressed as a target waveform +.>Identifier output +.>And the output waveform +.L of each layer of L ε {1,2, …, L }>Absolute difference between>

Feature loss may be a useful tool to facilitate increased fidelity between the output waveform 206 and the target waveform 204. Taking all of the above loss terms into account, the objective function L may control the trade-off between reconstruction loss, contrast loss and feature loss,

by using a weighting factor lambda _rec 、λ _adv And lambda (lambda) _feat Weighting the appropriate penalty term, the objective function 214 may emphasize certain characteristics such as faithful reconstruction, fidelity, perceived quality, and so forth. In some embodiments, the weight factor is set to λ _fec ＝λ _adv =1 and λ _feat ＝100。

Fig. 4 is a flow chart of an example process 400 for compressing an audio waveform. For convenience, process 400 will be described as being performed by a system of one or more computers located at one or more locations. For example, an audio compression system, such as audio compression device 100 of fig. 1, suitably programmed in accordance with the present description, may perform process 400.

The system receives an audio waveform (402). The audio waveform includes a respective audio sample at each of a plurality of time steps. In some cases, the time step may correspond to a particular sampling rate.

The system processes the audio waveform using an encoder neural network to generate feature vectors representing the audio waveform (404).

The system processes each feature vector using a plurality of vector quantizers to generate a respective compiled representation of the feature vector (406), wherein each vector quantizer is associated with a respective codebook of code vectors. Each compiled representation of a feature vector identifies a plurality of code vectors including a code vector from a codebook of each vector quantizer, the plurality of code vectors defining a respective quantized representation of the feature vector. In some implementations, the respective quantized representations of the feature vectors are defined by a sum of a plurality of code vectors.

The system compresses the compiled representation of the feature vector to generate a compressed representation of the audio waveform (408). In some implementations, the system uses entropy encoding to compress the compiled representation of the feature vector.

Fig. 5 is a flow chart of an example process 500 for decompressing a compressed audio waveform. For convenience, process 500 will be described as being performed by a system of one or more computers located at one or more locations. For example, an audio decompression system, such as audio decompression system 200 of fig. 2, suitably programmed according to the present description, may perform process 500.

The system receives a compressed representation of an input audio waveform (502).

The system decompresses the compressed representation of the audio waveform to obtain a compiled representation of feature vectors representing the input audio waveform (504). In some implementations, the system uses entropy decoding to decompress a compressed representation of the input audio waveform.

For each compiled representation of the feature vector, the system identifies a plurality of code vectors including the code vector from the codebook of each vector quantizer, the plurality of code vectors defining a respective quantized representation of the feature vector (506). In some implementations, the respective quantized representations of the feature vectors are defined by a sum of a plurality of code vectors.

The system processes the quantized representation of the feature vector using a decoder neural network to generate an output audio waveform (510). The output audio waveform may include a respective audio sample at each of a plurality of time steps. In some cases, the time step may correspond to a particular sampling rate.

Fig. 6 is a flow diagram of an example process 600 for generating a quantized representation of a feature vector using a residual vector quantizer. For convenience, process 600 will be described as being performed by a system of one or more computers located at one or more locations.

The system receives a feature vector at a first vector quantizer in a sequence of vector quantizers (602).

The system identifies a code vector from a codebook of a first vector quantizer in the sequence based on the feature vector to represent the feature vector (604). For example, a distance metric (e.g., error) between the feature vector and each code vector in the codebook may be calculated. The code vector with the smallest distance metric may be selected to represent the feature vector.

The system determines a current residual vector based on an error between the feature vector and a code vector representing the feature vector (606). For example, the residual vector may be the difference between the feature vector and the code vector representing the feature vector. Codewords corresponding to code vectors representing feature vectors may be stored in a compiled representation of the feature vectors.

The system receives at a next vector quantizer in the sequence a current residual vector generated by a previous vector quantizer in the sequence (608).

The system identifies a code vector from a codebook of a next vector quantizer in the sequence based on the current residual vector to represent the current residual vector (610). For example, a distance metric (e.g., error) between the current residual vector and each code vector in the codebook may be calculated. The code vector with the smallest distance metric may be selected to represent the current residual vector. Codewords corresponding to the code vector representing the current residual vector may be stored in a compiled representation of the feature vector.

The system updates the current residual vector based on an error between the current residual vector and a code vector representing the current residual vector (612). For example, the current residual vector may be updated by subtracting a code vector representing the current residual vector from the current residual vector.

Steps 606-612 may be repeated for each residual vector quantizer in the sequence. The final compiled representation of the feature vector contains codewords for each code vector selected from its corresponding codebook during process 600. The quantized representation of the feature vector corresponds to the sum of all code vectors specified by the codeword of the compiled representation of the feature vector. In some implementations, the codebooks of the vector quantizers in the sequence include an equal number of code vectors, such that each codebook is allocated the same space in memory.

Fig. 7 is a flow chart of an example process 700 for jointly training an encoder neural network, a decoder neural network, and a residual vector quantizer. For convenience, process 700 will be described as being performed by a system of one or more computers located at one or more locations. For example, a training system, such as training system 300 of FIG. 3, suitably programmed in accordance with the present description, may perform process 700.

The system obtains training examples including respective input audio waveforms and corresponding target audio waveforms (702). In some implementations, the target audio waveform of the one or more training examples may be an enhanced version of the input audio waveform, such as a denoised version of the input audio waveform. The target audio waveform of one or more training examples may also be the same as the input audio waveform. Alternatively or additionally, the input audio waveform may be a speech or music waveform.

The system processes an input audio waveform for each training example using an encoder neural network, a plurality of vector quantizers, and a decoder neural network to generate a corresponding output audio waveform (704), wherein each vector quantizer is associated with a corresponding codebook. In some implementations, the encoder and/or decoder neural network is conditioned on data defining where the corresponding target audio waveform is identical to the input audio waveform or an enhanced version of the input audio waveform.

The system determines a gradient of an objective function that depends on the respective output of each training example and the target audio waveform, for example, using back propagation (706).

The system uses the gradient of the objective function to update one or more of the following: a set of encoder network parameters, a set of decoder network parameters, or a codebook of multiple vector quantizers (708). For example, the parameters may be updated using update rules of any suitable gradient descent optimization technique (e.g., RMSprop, adam, etc.).

Fig. 8A and 8B illustrate examples of a fully convolutional neural network architecture of the encoder 102 and decoder 104 neural networks. C represents the number of channels and D is the dimension of feature vector 208. The architecture in FIGS. 8A and 8B is based on the SoundStream model developed by N.Zeghidour, A.Luebs, A.Omran, J.Skoglund and M.Tagliasacchi, "SoundStream: an End-to-End Neural Audio Codec (SoundStream: end-to-End neuroaudio codec)", "in IEEE/ACM Transactions on Audio, spech, and Language Processing, vol.30, pp.495-507,2022. The model is an adaptation of the sea encoder-decoder network designed by the following, but without skipping the connection: Y.Li, M.Tagliasacchi, O.Rybakov, V.Ungureanu and D. Roblek, "Real-Time Speech Frequency Bandwidth Extension (Real-time speech frequency bandwidth extension)," ICASSP 2021-2021IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP), 2021, pp.691-695.

The encoder 102 includes a Conv1D layer 802 followed by four encoderblocks 804. Each block includes three residual units 812, the residual units 812 containing an dilation convolution of dilation ratios 1, 3 and 9, respectively, followed by a downsampling layer in the form of a step convolution. The inner convolutional layers of the EncoderBlock 804 and the residualUnit 812 are shown in FIG. 8B. The number of channels is doubled each time a downsampling is performed. A final Conv1D layer 802 with a kernel of length 3 and a step size of 1 is used to set the dimension of feature vector 208 to D. A palm adjust layer 806 may also be implemented to process the adjust signals for joint compression and enhancement. The palm layer 806 performs a feature-by-feature affine transformation on the feature vectors 208 of the neural network on the condition of the tuning signal.

In this case, the decoder 104 effectively mirrors the encoder 102. The DecoderBlock 810 includes a transposed Conv1D layer 814 for upsampling followed by three residual units 812. The inner convolutional layers of the DecoderBlock 810 and the residualUnit 812 are shown in FIG. 8B. The decoder 104 uses the same steps as the encoder 102, but in the reverse order, to reconstruct a waveform having the same resolution as the input waveform. The number of channels is halved each time an upsampling is performed. A final Conv1D layer 802 with one filter, a kernel of size 7, and a step size of 1 projects feature vector 208 back into waveform 112. A palm adjust layer 806 may also be implemented to process the adjust signal for joint decompression and enhancement. In some embodiments, both the encoder 102 and the decoder 104 perform audio enhancement, while in other embodiments, only one of the encoder 102 or the decoder 104 is responsible.

The term "configured" is used in this specification in connection with systems and computer program components. A system for one or more computers configured to perform a particular operation or action means that the system has installed thereon software, firmware, hardware, or a combination thereof that, in operation, causes the system to perform the operation or action. For one or more computer programs configured to perform particular operations or actions, it is meant that the one or more programs include instructions that, when executed by a data processing apparatus, cause the apparatus to perform the operations or actions.

Embodiments of the subject matter and the functional operations described in this specification can be implemented in digital electronic circuitry, in tangibly embodied computer software or firmware, in computer hardware (including the structures disclosed in this specification and their structural equivalents), or in combinations of one or more of them. Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible, non-transitory storage medium for execution by, or to control the operation of, data processing apparatus. The computer storage medium may be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them. Alternatively or additionally, the program instructions may be encoded on a manually-generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode information for transmission to suitable receiver apparatus for execution by data processing apparatus.

The term "data processing apparatus" refers to data processing hardware and encompasses all kinds of apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers. The apparatus may also be or further comprise a dedicated logic circuit, for example an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit). In addition to hardware, the apparatus may optionally include code that creates an execution environment for the computer program, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them.

A computer program (also known as a program, software application, app, module, software module, script, or code) can be written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages, and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A program may, but need not, correspond to a file in a file system. A program can be stored in a portion of a file that holds other programs or data (e.g., one or more scripts stored in a markup language document), in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code). A computer program can be deployed to be executed on one computer or on multiple computers at one site or distributed across multiple sites and interconnected by a communication network.

In this specification, the term "engine" is used broadly to refer to a software-based system, subsystem, or process that is programmed to perform one or more particular functions. Typically, the engine will be implemented as one or more software modules or components installed on one or more computers in one or more locations. In some cases, one or more computers will be dedicated to a particular engine; in other cases, multiple engines may be installed and run on the same computer or multiple computers.

The processes and logic flows described in this specification can be performed by one or more programmable processors executing one or more computer programs to perform functions by operating on input data and generating output. The processes and logic flows can also be performed by, or combinations of, special purpose logic circuitry, e.g., an FPGA or ASIC, and one or more programmed computers.

A computer suitable for executing a computer program may be based on a general-purpose or special-purpose microprocessor or both, or any other kind of central processing unit. Typically, a central processing unit will receive instructions and data from a read only memory or a random access memory or both. The essential elements of a computer are a central processing unit for executing instructions and one or more memory devices for storing instructions and data. The central processing unit and the memory may be supplemented by, or incorporated in, special purpose logic circuitry. Typically, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks. However, such devices are not required for a computer. Furthermore, the computer may be embedded in another device, such as a mobile phone, a Personal Digital Assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device (e.g., a Universal Serial Bus (USB) flash drive), to name a few.

Computer readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices including, for example: semiconductor memory devices such as EPROM, EEPROM, and flash memory devices; magnetic disks, such as internal hard disks or removable disks; magneto-optical disk; and CD ROM and DVD-ROM discs.

To provide for interaction with a user, embodiments of the subject matter described in this specification can be implemented on a computer having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) and a keyboard and a pointing device (e.g., a mouse or a trackball) for displaying information to the user and the user can provide input to the computer via the display device and the keyboard and the pointing device. Other types of devices may also be used to provide interaction with a user; for example, feedback provided to the user may be any form of sensory feedback, such as visual feedback, auditory feedback, or tactile feedback; and input from the user may be received in any form, including acoustic, speech, or tactile input. In addition, the computer may interact with the user by: transmitting and receiving documents to and from devices used by the user; for example, a web page is sent to a web browser on a user device in response to a request received from the web browser. Further, the computer may interact with the user by sending text messages or other forms of messages to a personal device (e.g., a smart phone running a messaging application) and receiving response messages as a return from the user.

The data processing means for implementing the machine learning model may also comprise, for example, dedicated hardware accelerator units for handling public and computationally intensive parts of machine learning training or production, i.e. inference, workload.

The machine learning model may be implemented and deployed using a machine learning framework (e.g., a TensorFlow framework).

Embodiments of the subject matter described in this specification can be implemented in a computing system that includes a back-end component, e.g., as a data server, or that includes a middleware component, e.g., an application server, or that includes a front-end component, e.g., a client computer having a graphical user interface, a Web browser, or an app through which a user can interact with an implementation of the subject matter described in this specification, or any combination of one or more such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN") and a wide area network ("WAN"), such as the Internet.

The computing system may include clients and servers. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. In some embodiments, the server transmits data (e.g., HTML pages) to the user device, e.g., for purposes of displaying data to and receiving user input from a user interacting with the device as a client. Data generated at the user device, e.g., results of a user interaction, may be received at the server from the device.

While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any invention or of what may be claimed, but rather as descriptions of features specific to particular embodiments of particular inventions. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Furthermore, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.

Similarly, although operations are depicted in the drawings and described in the claims in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated in a single software product or packaged into multiple software products.

Specific embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. For example, the actions recited in the claims can be performed in a different order and still achieve desirable results. As one example, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous.

The scope of protection claimed is as follows.

Claims

1. A method performed by one or more computers, the method comprising:

receiving an audio waveform comprising a respective audio sample for each of a plurality of time steps;

processing the audio waveform using an encoder neural network to generate a plurality of feature vectors representing the audio waveform;

generating a respective compiled representation of each feature vector of the plurality of feature vectors using a plurality of vector quantizers, each associated with a respective codebook of code vectors,

wherein the respective compiled representation of each feature vector identifies a plurality of code vectors including respective code vectors from the codebook of each vector quantizer, the plurality of code vectors defining quantized representations of the feature vectors; and

A compressed representation of the audio waveform is generated by compressing the respective compiled representation of each feature vector of the plurality of feature vectors.

2. The method of claim 1, wherein the plurality of vector quantizers are ordered into a sequence, and wherein for each feature vector of the plurality of feature vectors, generating the compiled representation of the feature vector comprises:

for a first vector quantizer in the sequence of vector quantizers:

receiving the feature vector;

identifying, based on the feature vectors, respective code vectors from the codebook of the vector quantizer to represent the feature vectors; and

determining a current residual vector based on an error between (i) the feature vector and (ii) the code vector representing the feature vector;

wherein the compiled representation of the feature vector identifies the code vector representing the feature vector.

3. The method of claim 2, wherein generating the compiled representation of the feature vector for each feature vector of the plurality of feature vectors further comprises:

for each vector quantizer following the first vector quantizer in the sequence of vector quantizers:

Receiving a current residual vector generated by a previous vector quantizer in the sequence of vector quantizers;

identifying, based on the current residual vector, a respective code vector from the codebook of the vector quantizer to represent the current residual vector; and

if the vector quantizer is not the last vector quantizer in the sequence of vector quantizers:

updating the current residual vector based on an error between (i) the current residual vector and (ii) the code vector representing the current residual vector;

wherein the compiled representation of the feature vector identifies the code vector representing the current residual vector.

4. The method of any preceding claim, wherein generating the compressed representation of the audio waveform comprises:

entropy encoding the respective compiled representation of each feature vector of the plurality of feature vectors.

5. The method of any preceding claim, wherein the respective quantized representation of each feature vector is defined by a sum of the plurality of code vectors identified by the compiled representation of the feature vector.

6. The method of any preceding claim, wherein the codebooks of the plurality of vector quantizers all comprise an equal number of code vectors.

7. The method of any preceding claim, wherein the encoder neural network and the codebooks of the plurality of vector quantizers are co-trained with a decoder neural network, wherein the decoder neural network is configured to:

receiving a respective quantized representation of each of a plurality of feature vectors representing an input audio waveform generated using the encoder neural network and the plurality of vector quantizers; and

the quantized representation of the feature vector representing the input audio waveform is processed to generate an output audio waveform.

8. The method of claim 7, wherein the training comprises:

obtaining a plurality of training examples, each training example comprising: (i) A corresponding input audio waveform and (ii) a corresponding target audio waveform;

processing the respective input audio waveform from each training example using the encoder neural network, a plurality of vector quantizers from a sequence of vector quantizers, and the decoder neural network to generate an output audio waveform that is an estimate of the corresponding target audio waveform;

determining a gradient of an objective function that depends on the respective output of each training example and the target waveform; and

Updating one or more of the following using the gradient of the objective function: a set of encoder neural network parameters, a set of decoder neural network parameters, or the codebook of the plurality of vector quantizers.

9. The method of claim 8, wherein, for one or more of the training examples, the target audio waveform is an enhanced version of the input audio waveform.

10. The method of claim 9, wherein, for one or more of the training examples, the target audio waveform is a denoised version of the input audio waveform.

11. The method of any of claims 9-10, wherein the target audio waveform is the same as the input audio waveform for one or more of the training examples.

12. The method of claim 11, wherein processing each input audio waveform to generate a corresponding output audio waveform comprises:

the encoder neural network, the decoder neural network, or both are adjusted according to data defining whether the corresponding target audio waveform is (i) the input audio waveform or (ii) an enhanced version of the input audio waveform.

13. The method of any of claims 8-12, further comprising, for each training example:

selecting a respective number of vector quantizers to be used in quantizing feature vectors representing the input audio waveform;

the corresponding output audio waveform is generated using only a selected number of vector quantizers from the sequence of vector quantizers.

14. The method of claim 13, wherein the selected number of vector quantizers to be used in quantizing feature vectors representing input audio waveforms varies between training examples.

15. The method of any of claims 13-14, wherein, for each training example, selecting the respective number of vector quantizers to be used in quantizing feature vectors representing the input audio waveform comprises:

random sampling the number of vector quantizers to be used in quantizing feature vectors representing the input audio waveform.

16. The method of any of claims 8-15, wherein the objective function comprises a reconstruction loss that measures, for each training example, an error between (i) the output audio waveform and (ii) the corresponding target audio waveform.

17. The method of claim 16, wherein for each training example, the reconstruction loss measures a multi-scale spectral error between (i) the output audio waveform and (ii) the corresponding target audio waveform.

18. The method of any of claims 8-17, wherein the training further comprises, for each training example:

processing data derived from the output audio waveform using a discriminator neural network to generate a set of one or more discriminator scores, wherein each discriminator score characterizes an estimated likelihood that the output audio waveform is an audio waveform generated using the encoder neural network, the plurality of vector quantizers, and the decoder neural network;

wherein the objective function comprises an antagonistic loss that depends on the discriminator score generated by the discriminator neural network.

19. The method of claim 18, wherein the data derived from the output audio waveform comprises the output audio waveform, a downsampled version of the output audio waveform, or a fourier transformed version of the output audio waveform.

20. The method of any of claims 18-19, wherein for each training example, the reconstruction loss measures an error between: (i) One or more intermediate outputs generated by the discriminator neural network by processing the output audio waveform, and (ii) one or more intermediate outputs generated by the discriminator neural network by processing the corresponding target audio waveform.

21. The method of any of claims 8-20, wherein during the training, updating the codebooks of the plurality of vector quantizers is repeated using an exponential moving average of feature vectors generated by the encoder neural network.

22. The method of any preceding claim, wherein the encoder neural network comprises a sequence of encoder blocks, each encoder block being configured to process a respective set of input feature vectors according to a set of encoder block parameters to generate a set of output feature vectors having a lower temporal resolution than the set of input feature vectors.

23. The method of any of claims 7-22, wherein the decoder neural network comprises a sequence of decoder blocks, each decoder block configured to process a respective set of input feature vectors according to a set of decoder block parameters to generate a set of output feature vectors having a higher temporal resolution than the set of input feature vectors.

24. The method of any preceding claim, wherein the audio waveform is a speech waveform or a music waveform.

25. The method of any preceding claim, further comprising transmitting the compressed representation of the audio waveform over a network.

26. A method performed by one or more computers, the method comprising:

receiving a compressed representation of an input audio waveform;

decompressing the compressed representation of the input audio waveform to obtain a respective compiled representation of each of a plurality of feature vectors representing the input audio waveform,

wherein the compiled representation of each feature vector identifies a plurality of code vectors including respective code vectors from respective codebooks of each of a plurality of vector quantizers, the plurality of code vectors defining quantized representations of the feature vectors;

generating a respective quantized representation of each feature vector from the compiled representation of the feature vector; and

the quantized representation of the feature vector is processed using a decoder neural network to generate an output audio waveform.

27. A system, comprising:

one or more computers; and

one or more storage devices communicatively coupled to the one or more computers, wherein the one or more storage devices store instructions that, when executed by the one or more computers, cause the one or more computers to perform the operations of the respective method of any one of claims 1-26.

28. One or more non-transitory computer storage media storing instructions which, when executed by one or more computers, cause the one or more computers to perform the operations of the respective method of any one of claims 1-26.