CN116457797A - Method and apparatus for processing audio using neural network - Google Patents

Method and apparatus for processing audio using neural network Download PDF

Info

Publication number
CN116457797A
CN116457797A CN202180076578.3A CN202180076578A CN116457797A CN 116457797 A CN116457797 A CN 116457797A CN 202180076578 A CN202180076578 A CN 202180076578A CN 116457797 A CN116457797 A CN 116457797A
Authority
CN
China
Prior art keywords
audio signal
perceptual
neural network
domain audio
domain
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202180076578.3A
Other languages
Chinese (zh)
Inventor
M·S·文顿
周聪
R·M·菲金
G·A·戴维森
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dolby Laboratories Licensing Corp
Original Assignee
Dolby Laboratories Licensing Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dolby Laboratories Licensing Corp filed Critical Dolby Laboratories Licensing Corp
Priority claimed from PCT/US2021/055090 external-priority patent/WO2022081915A1/en
Publication of CN116457797A publication Critical patent/CN116457797A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/098Distributed learning, e.g. federated learning
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Abstract

A method of processing an audio signal using a neural network or using a first neural network and a second neural network is described herein. A method of training the neural network or co-training a set of the first neural network and the second neural network is further described. Furthermore, a method of obtaining and transmitting a potential feature space representation of a perceptual domain audio signal using a neural network, and a method of obtaining an audio signal from a potential feature space representation of a perceptual domain audio signal using a neural network are described. Corresponding apparatus and computer program products are also described.

Description

Method and apparatus for processing audio using neural network
Cross Reference to Related Applications
The present application claims priority to the following priority applications: U.S. provisional application 63/092,118 filed on 10/15 2020 and european application number 20210968.2 filed on 12/1 2020, which are incorporated herein by reference.
Technical Field
The present disclosure relates generally to a method of processing audio signals using a neural network or using a first neural network and a second neural network, and in particular to a method of processing audio signals in the perceptual domain using a neural network or using a first neural network and a second neural network. The present disclosure further relates to a method of training the neural network or co-training a set of the first neural network and the second neural network. The present disclosure also relates to a method of obtaining and transmitting a potential feature space representation of a perceptual domain audio signal using a neural network, and to a method of obtaining an audio signal from a potential feature space representation of a perceptual domain audio signal using a neural network. The present disclosure also relates to a corresponding apparatus and computer program product.
Although some embodiments will be described herein with particular reference to this disclosure, it should be understood that the disclosure is not limited to this field of use and is applicable to a broader context.
Background
Any discussion of the background art throughout the disclosure should in no way be considered as an admission that such art is widely known or forms part of the common general knowledge in the field.
High performance audio encoders and decoders exploit the limitations of the human auditory system to remove extraneous information that is not audible to humans. Typically, the encoding system uses psychoacoustic or perceptual models to calculate the corresponding masking thresholds. The masking threshold is then used to control the encoding process so that the impact of the introduced noise on hearing is minimized.
Neural networks have shown promise in many applications to date, including encoding and/or decoding of images, video, and even speech. However, there is still a need for neural network applications in general audio coding and/or decoding applications using typical training techniques, in particular in coding and/or decoding applications involving perceptual-domain audio signals.
Disclosure of Invention
According to a first aspect of the present disclosure, a method of processing an audio signal using a neural network is provided. The method may comprise the step (a): a perceptual domain audio signal is obtained. The method may further comprise step (b): the perceptual-domain audio signal is input into the neural network to process the perceptual-domain audio signal. The method may further comprise step (c): a processed perceptual domain audio signal is obtained as an output of the neural network. And the method may comprise step (d): the processed perceptual domain audio signal is converted to an original signal domain based on a mask indicative of a masking threshold derived from a psychoacoustic model.
In some embodiments, processing the perceptual-domain audio signal through the neural network may be performed in the time domain.
In some embodiments, the method may further comprise: before step (d), converting the audio signal to the frequency domain.
In some embodiments, the neural network may be conditioned on information indicative of the mask.
In some embodiments, the neural network may condition the perceptual domain audio signal.
In some embodiments, processing the perceptual-domain audio signal through the neural network may include predicting the processed perceptual-domain audio signal across time.
In some embodiments, processing the perceptual-domain audio signal through the neural network may include predicting the processed perceptual-domain audio signal across frequencies.
In some embodiments, processing the perceptual-domain audio signal through the neural network may include predicting the processed perceptual-domain audio signal across time and frequency.
In some embodiments, the perceptual-domain audio signal may be obtained from: (a) Converting an audio signal from the original signal domain to the perceptual domain by applying the mask; (b) encoding the perceptual domain audio signal; and (c) decoding the perceptual domain audio signal.
In some embodiments, quantization may be applied to the perceptual domain audio signal prior to encoding and inverse quantization may be applied to the perceptual domain audio signal after decoding.
According to a second aspect of the present disclosure, a method of processing an audio signal using a first neural network and a second neural network is provided. The method may comprise the step (a): a perceptual domain audio signal is obtained by a first device by applying a mask indicative of a masking threshold derived from a psycho-acoustic model to the audio signal in the original signal domain. The method may further comprise step (b): the perceptual-domain audio signals are input into the first neural network to map the perceptual-domain audio signals to a potential feature space representation. The method may further comprise step (c): the potential feature space representation is obtained as an output of the first neural network. The method may further comprise step (d): the potential feature space representation of the perceptual domain audio signal and the mask are transmitted to a second device. The method may further comprise step (e): a potential feature space representation of the perceptual domain audio signal and the mask are received by the second device. The method may further comprise step (f): the potential feature space representation is input into the second neural network to generate an approximate perceptual domain audio signal. The method may further comprise step (g): the approximated perceptual domain audio signal is obtained as an output of the second neural network. And the method may comprise step (h): the approximate perceptual domain audio signal is converted to the original signal domain based on the mask.
In some embodiments, the method may further comprise: encoding the mask and the latent feature space representation of the perceptual domain audio signal into a bitstream, and transmitting the bitstream to the second device, wherein the method may further comprise: the method further includes receiving, by the second device, the bitstream and decoding the bitstream to obtain the potential feature space representation of the perceptual domain audio signal and the mask.
In some embodiments, the potential feature space representation of the perceptual domain audio signal and the mask may be quantized before being encoded into the bitstream and dequantized before being processed by the second neural network.
In some embodiments, the second neural network may condition a potential feature space representation of the perceptual domain audio signal and/or the mask.
In some embodiments, mapping the perceptual-domain audio signal to the potential feature space representation by the first neural network and generating the approximated perceptual-domain audio signal by the second neural network may be performed in the time domain.
In some embodiments, obtaining the perceptual domain signal in step (a) and converting the approximate perceptual domain signal in step (h) may be performed in a frequency domain.
According to a third aspect of the present disclosure, a method of jointly training a set of first and second neural networks is provided. The method may comprise the step (a): a perceptual domain audio training signal is input into the first neural network to map the perceptual domain audio training signal to a potential feature space representation. The method may further comprise step (b): a potential feature space representation of the perceptual domain audio training signal is obtained as an output of the first neural network. The method may further comprise step (c): a potential feature space representation of the perceptual domain audio training signal is input into the second neural network to generate an approximate perceptual domain audio training signal. The method may further comprise step (d): the approximate perceptual domain audio training signal is obtained as an output of the second neural network. And the method may comprise step (e): iteratively adjusting parameters of the first neural network and the second neural network based on a difference between the approximated perceptual domain audio training signal and an original perceptual domain audio signal.
In some embodiments, the first neural network and the second neural network may be trained in the perceptual domain based on one or more loss functions.
In some embodiments, the first neural network and the second neural network may be trained in the perceptual domain based on negative log-likelihood conditions.
According to a fourth aspect of the present disclosure, a method of training a neural network is provided. The method may comprise the step (a): a perceptual-domain audio training signal is input into the neural network to process the perceptual-domain audio training signal. The method may further comprise step (b): a processed perceptual domain audio training signal is obtained as an output of the neural network. And the method may comprise step (c): parameters of the neural network are iteratively adjusted based on differences between the processed perceptual domain audio training signal and an original perceptual domain audio signal.
In some embodiments, the neural network may be trained in the perceptual domain based on one or more loss functions.
In some embodiments, the neural network may be trained in the perceptual domain based on negative log-likelihood conditions.
According to a fifth aspect of the present disclosure, a method of obtaining and transmitting a potential feature space representation of a perceptual domain audio signal using a neural network is provided. The method may comprise the step (a): a perceptual domain audio signal is obtained by applying a mask indicative of a masking threshold derived from a psycho-acoustic model to the audio signal in the original signal domain. The method may further comprise step (b): the perceptual-domain audio signals are input into a neural network to map the perceptual-domain audio signals to a potential feature space representation. The method may further comprise step (c): a potential feature space representation of the perceptual domain audio signal is obtained as an output of the neural network. And the method may comprise step (d): the potential feature space representation of the perceptual domain audio signal is output as a bitstream.
In some embodiments, in step (d), further information indicative of the mask may be output as the bitstream.
In some embodiments, the potential feature space representation of the perceptual domain audio signal and/or the information indicative of the mask may be quantized before being output as a bitstream.
In some embodiments, mapping the perceptual-domain audio signal to the potential feature space representation by the neural network may be performed in the time domain.
In some embodiments, obtaining the perceptual-domain audio signal may be performed in the frequency domain.
According to a sixth aspect of the present disclosure, there is provided a method of obtaining an audio signal from a potential feature space representation of a perceptual domain audio signal using a neural network. The method may comprise the step (a): a potential feature space representation of a perceptual domain audio signal is received as a bitstream. The method may further comprise step (b): the potential feature space representation is input into a neural network to generate the perceptual domain audio signal. The method may further comprise step (c): and obtaining the perception domain audio signal as an output of the neural network. And the method may comprise step (d): the perceptual-domain audio signal is converted to an original signal domain based on a mask indicative of a masking threshold derived from a psychoacoustic model.
In some embodiments, the neural network may condition a potential feature space representation of the perceptual domain audio signal.
In some embodiments, in step (a), further information indicative of the mask may be received as the bitstream, and the neural network may condition the information.
In some embodiments, the potential feature space representation of the perceptual domain audio signal and/or the information indicative of the mask may be quantized upon receipt and inverse quantization may be performed prior to step (b).
In some embodiments, generating the perceptual-domain audio signal by the neural network may be performed in the time domain.
In some embodiments, converting the perceptual-domain audio signal to the original signal domain may be performed in the frequency domain.
According to a seventh aspect of the present disclosure, there is provided an apparatus for processing an audio signal using a neural network. The apparatus may include a neural network and one or more processors configured to perform a method comprising: (a) obtaining a perceptual domain audio signal; (b) Inputting the perceptual-domain audio signal into the neural network to process the perceptual-domain audio signal; (c) Obtaining a processed perceptual domain audio signal as an output of the neural network; and (d) converting the processed perceptual domain audio signal to an original signal domain based on a mask indicative of a masking threshold derived from the psychoacoustic model.
According to an eighth aspect of the present disclosure, there is provided an apparatus for obtaining and transmitting a potential feature space representation of a perceptual domain audio signal using a neural network. The apparatus may include a neural network and one or more processors configured to perform a method comprising: (a) Obtaining a perceptual domain audio signal by applying a mask indicative of a masking threshold derived from a psychoacoustic model to the audio signal in the original signal domain; (b) Inputting the perceptual domain audio signal into a neural network to map the perceptual domain audio signal to a potential feature space representation; (c) Obtaining a potential feature space representation of the perceptual domain audio signal as an output of the neural network; and (d) outputting the potential feature space representation of the perceptual domain audio signal as a bitstream.
According to a ninth aspect of the present disclosure, there is provided an apparatus for obtaining an audio signal from a potential feature space representation of a perceptual domain audio signal using a neural network. The apparatus may include a neural network and one or more processors configured to perform a method comprising: (a) Receiving a potential feature space representation of a perceptual domain audio signal as a bitstream; (b) Inputting the potential feature space representation into a neural network to generate the perceptual domain audio signal; (c) Obtaining the perceptual domain audio signal as an output of the second neural network; and (d) converting the perceptual domain audio signal to an original signal domain based on a mask indicative of a masking threshold derived from the psychoacoustic model.
According to a tenth to fifteenth aspect of the present disclosure, there is provided a computer program product comprising a computer readable storage medium having instructions adapted to, when executed by a device having processing capabilities, cause the device to perform the above-described method.
Drawings
Example embodiments of the present disclosure will now be described, by way of example only, with reference to the accompanying drawings, in which:
fig. 1 illustrates an example of a method of processing an audio signal using a neural network.
Fig. 2 illustrates a further example of a method of processing an audio signal using a neural network.
Fig. 3 illustrates an example of a system including an apparatus for processing an audio signal using a neural network.
Fig. 4a and 4b illustrate examples of a method of processing an audio signal using a first neural network and a second neural network.
Fig. 5 illustrates an example of a system having means for obtaining and transmitting a potential feature space representation of a perceptual domain audio signal using a neural network, and means for obtaining an audio signal from the potential feature space representation of the perceptual domain audio signal using the neural network.
Fig. 6 illustrates an example of a method of training a neural network.
FIG. 7 illustrates an example of a method of jointly training a set of first and second neural networks.
Fig. 8 illustrates an example of an original audio signal and mask as a function of level and frequency.
Fig. 9 illustrates an example of a perceptual domain audio signal as a function of level and frequency obtained by applying a mask to an original audio signal.
Fig. 10 illustrates an example of converting an audio signal to a perception domain and processing the audio signal using a neural network.
Fig. 11 illustrates an example where the audio encoder and decoder operate in the perceptual domain and there is a neural network in both the encoder and decoder. The figure also illustrates an example of training the neural network using a simple loss function when the network is operating in the perceptual domain.
Fig. 12 illustrates an example where an audio encoder and decoder operate in the perceptual domain and there is a neural network in the decoder. The figure also illustrates an example of training the neural network using a simple loss function when the network is operating in the perceptual domain.
Detailed Description
SUMMARY
While neural networks have shown promise for encoding and/or decoding images, video, and even speech, it is challenging to encode and/or decode general audio using neural networks. There are two factors that complicate the compression of general audio with neural networks: first, audio encoders and decoders need to take advantage of the limitations of the human auditory system to achieve high performance. To take advantage of the perceptual constraints of the human auditory system, neural networks cannot be trained directly with non-perceptual loss functions such as L1 or L2:
Wherein x is n Is a target value (true value), andis a predicted value (output value of the network).
Second, typical audio signals have very high dynamic range and are very diverse in nature, which complicates neural network training.
The present disclosure describes methods and apparatus for transforming an audio signal to a perceptual domain prior to application of a neural network in a corresponding audio encoder and/or decoder. The perceptual-domain conversion of audio signals not only significantly reduces the dynamic range, but also allows training the network using non-perceptual loss functions like L1 and L2.
Method for processing audio signal using neural network
Referring to the example of fig. 1, a method of processing an audio signal using a neural network is illustrated. In step S101, a perceptual domain audio signal is obtained. The term "perceptual domain" as used herein refers to signals in which the relative level differences between frequency components are (approximately) proportional to their relative subjective importance. In general, an audio signal that is converted to the perceptual domain minimizes the auditory impact of adding white noise (spectrally flat noise) to the perceptual domain signal, because when the signal is converted back to the original signal domain, the noise will be shaped to minimize audibility.
Referring to the example of fig. 2, a perceptual domain audio signal may be obtained from steps S101a, S101b and S101c, wherein in step S101a the audio signal may be converted from an original signal domain to a perceptual domain by applying a mask.
One way to convert the audio signal to the perceptual domain may be to estimate a mask or masking curve, for example using a psycho-acoustic model. The masking curve generally defines the level of Just Noticeable Distortion (JND) that can be detected by the human auditory system for a given stimulus signal. Once the masking curve has been derived from the psycho-acoustic model, the spectrum of the audio signal may be divided by the masking curve to produce a perceptual domain audio signal. By multiplying the mask after the neural network has been encoded and/or decoded, the perceptual domain audio signal resulting from the multiplication with the inverse mask estimate may be converted back to the original signal. Multiplying the mask after decoding will ensure that the errors introduced by the encoding and decoding processes follow the masking curve. While this is one way of converting the original audio signal to the perceptual domain, it should be noted that many other ways are also conceivable, for example filtering in the time domain by means of a properly designed time-varying filter. Referring to the examples of fig. 8 and 9, the conversion of the spectrum of the original audio signal into the perceptual domain is illustrated. The graph of fig. 8 illustrates the spectrum of the original audio signal (solid line) and an estimated mask or masking curve calculated with a psycho-acoustic model (dash-dot line). The perceptual domain signal obtained by multiplying the inverse mask estimate is illustrated in the graph of fig. 9. The perceptual domain signal not only allows the use of simple loss terms during training of the neural network, but also, as shown in fig. 8, it exhibits a much smaller dynamic range than the original audio signal spectrum.
Referring again to the example of fig. 2, then, in step S101b, the perceptual-domain audio signal may be encoded and then decoded in step S101c to obtain a perceptual-domain audio signal. In some embodiments, quantization may be applied to the perceptual domain audio signal prior to encoding and inverse quantization may be applied to the perceptual domain audio signal after decoding.
Referring again to the example of fig. 1, in step S102, a perceptual-domain audio signal is input into a neural network to process the perceptual-domain audio signal. The neural network used is not limited and may be selected according to processing requirements. While the neural network may be operated in the frequency domain and in the time domain, in some embodiments, processing the perceptual-domain audio signal through the neural network may be performed in the time domain. Further, in some embodiments, the neural network may be conditioned on information indicative of the mask. Alternatively or additionally, in some embodiments, the neural network may condition the perceptual domain audio signal.
In some embodiments, processing the perceptual-domain audio signal through the neural network may include predicting the processed perceptual-domain audio signal across time. Alternatively, in some embodiments, processing the perceptual-domain audio signal through the neural network may include predicting the processed perceptual-domain audio signal across frequencies. Further alternatively, in some embodiments, processing the perceptual-domain audio signal through the neural network may include predicting the processed perceptual-domain audio signal across time and frequency.
In step S103, a processed perceptual domain audio signal is then obtained as an output of the neural network. In some embodiments, the processed perceptual-domain audio signal may be converted to the frequency domain prior to the next step S104.
In step S104, the processed perceptual-domain audio signal is converted to an original signal domain based on a mask indicative of a masking threshold derived from the psychoacoustic model. For example, to calculate the mask, the psycho-acoustic model may utilize frequency coefficients from a time-frequency transform that is adapted to convert the processed perceptual domain audio signal to the frequency domain. Alternatively or additionally, the mask used in step S104 may be based on a mask that has been used to convert the original audio signal to the perceptual domain. In this case, a mask may be obtained as the auxiliary information; the mask may optionally be quantized.
Thus, the term "original audio signal" as used herein refers to the corresponding signal domain of the audio signal prior to converting the audio signal to the perceptual domain.
The method described above may be implemented in various ways. For example, the method may be implemented by an apparatus for processing an audio signal using a neural network, wherein the apparatus comprises the neural network and one or more processors configured to perform the method.
Referring to the example of fig. 3, a system including an apparatus for processing an audio signal using a neural network is illustrated. The apparatus may be a decoder. In this case, the neural network is used only in the decoder.
As shown in the example of fig. 3, the perceptual-domain audio signal may be quantized in a quantizer 101 and (entropy) encoded by, for example, a corresponding conventional encoder 102. The quantized coded perceptual audio signal may then be transmitted, e.g. as a bitstream, to a decoder 103 to obtain a quantized perceptual domain audio signal, e.g. by (entropy) decoding the received bitstream. The quantized perceptual domain audio signal may then be inverse quantized in a corresponding inverse quantizer 104. The obtained perceptual-domain audio signal may then be input to a neural network (decoder neural network) 105 to obtain a processed perceptual-domain audio signal as an output of the neural network 105.
Alternatively or additionally, the above-described methods may be implemented by a computer program product comprising a computer-readable storage medium having instructions adapted to, when executed by a device having processing capabilities, cause the device to perform the methods.
Method for processing audio signal using first and second neural networks
Referring to the examples of fig. 4a and 4b, a method of processing an audio signal using a first neural network and a second neural network is illustrated. For example, the first neural network may be implemented at an encoder, while the second neural network may be implemented at a decoder.
As shown in the example of fig. 4a, in step S201, a perceptual domain audio signal is obtained by a first device by applying a mask indicative of a masking threshold derived from a psycho-acoustic model to an audio signal in an original signal domain. For example, the first device may be an encoder. In some embodiments, obtaining the perceptual-domain audio signal may be performed in the frequency domain.
In step S202, the obtained perceptual-domain audio signal is then input into a first neural network to map the perceptual-domain audio signal to a potential feature space representation.
In some embodiments, mapping the perceptual-domain audio signal to the potential feature space representation by the first neural network may be performed in the time domain.
In step S203, a potential feature space representation is obtained as an output of the first neural network.
The potential feature space representation and mask of the perceptual domain audio signal is then transmitted to a second device in step S204. In some embodiments, the above method may further comprise: encoding the potential feature space representation and the mask of the perceptual domain audio signal into a bitstream, and transmitting the bitstream to a second device. In some embodiments, the potential feature space representation and mask of the perceptual domain audio signal may be additionally quantized prior to encoding into a bitstream.
Referring now to the example of fig. 4b, in step S205, a potential feature space representation and mask of a perceptual domain audio signal is received by a second device. The second device may be a decoder. In some embodiments, the method may further comprise: the method further includes receiving, by the second device, the potential feature space representation and the mask of the perceptual domain audio signal as a bitstream, and decoding the bitstream to obtain the potential feature space representation and the mask of the perceptual domain audio signal. In some embodiments, where the potential feature space representation and mask of the perceptual domain audio signal is quantized, the potential feature space representation and mask of the perceptual domain audio signal may be dequantized prior to processing by the second neural network.
In step S206, a potential feature space representation is input into the second neural network to generate an approximate perceptual domain audio signal. In some embodiments, the second neural network may condition a potential feature space representation and/or mask of the perceptual domain audio signal. In some embodiments, generating the approximate perceptual domain audio signal by the second neural network may be performed in the time domain.
In step S207, an approximate perceptual domain audio signal is obtained as an output of the second neural network.
In step S208, the approximated perceptual domain audio signal is converted to an original signal domain based on the mask. In some embodiments, converting the approximate perceptual domain signal may be performed in the frequency domain.
The above method may be implemented by a system having respective first and second devices. Alternatively or additionally, the above-described methods may be implemented by a corresponding computer program product comprising a computer-readable storage medium having instructions adapted to, when executed by a device having processing capabilities, cause the device to perform the method.
Alternatively, the above-described method may be implemented in part by means for obtaining and transmitting a potential feature space representation of a perceptual domain audio signal using a neural network, and in part by means for obtaining an audio signal from the potential feature space representation of the perceptual domain audio signal using the neural network. The apparatus may then be implemented as a standalone apparatus or system.
The method of obtaining and transmitting a potential feature space representation of a perceptual domain audio signal using a neural network then comprises the following steps. In step (a), a perceptual domain audio signal is obtained by applying a mask indicative of a masking threshold derived from a psycho-acoustic model to the audio signal in the original signal domain. In some embodiments, obtaining the perceptual-domain audio signal may be performed in the frequency domain.
In step (b), the perceptual-domain audio signal is input into a neural network to map the perceptual-domain audio signal to a potential feature space representation. In some embodiments, mapping the perceptual-domain audio signal to the potential feature space representation by the neural network may be performed in the time domain.
In step (c), a potential feature space representation of the perceptual domain audio signal is obtained as an output of the neural network. And in step (d) the potential feature space representation of the perceptual domain audio signal is then output as a bitstream.
In some embodiments, in step (d), further information indicative of the mask may be output as a bitstream. In some embodiments, the potential feature space representation of the perceptual domain audio signal and/or the information indicative of the mask may be quantized before being output as a bitstream.
A method of obtaining an audio signal from a potential feature space representation of a perceptual domain audio signal using a neural network comprises the following steps. In step (a), a potential feature space representation of a perceptual domain audio signal is received as a bitstream. In step (b), the potential feature space representation is input into a neural network to generate a perceptual domain audio signal. In step (c), a perceptual domain audio signal is obtained as an output of the neural network. And in step (d), converting the perceptual domain audio signal to an original signal domain based on a mask indicative of a masking threshold derived from the psychoacoustic model.
In some embodiments, the neural network may be conditioned on a potential feature space representation of the perceptual domain audio signal. In some embodiments, further in step (a), information may be received as an indication mask of the bitstream, and the neural network may condition the information. In some embodiments, the potential feature space representation of the perceptual domain audio signal and/or the information indicative of the mask may be quantized upon receipt, and the inverse quantization may be performed prior to step (b). In some embodiments, generating the perceptual-domain audio signal by the neural network may be performed in the time domain. In some embodiments, converting the perceptual-domain audio signal to the original signal domain may be performed in the frequency domain.
Referring to the example of fig. 5, a system is illustrated having means (also a first means) for obtaining and transmitting a potential feature space representation of a perceptual domain audio signal using a neural network, and means (also a second means) for obtaining an audio signal from the potential feature space representation of the perceptual domain audio signal using the neural network.
In the example of fig. 5, in the (first) device 201, the perceptual-domain audio signal may be input into the (first) neural network 202 for the above-described processing. The first neural network 202 may be an encoder neural network. The potential feature space representation output from the (first) neural network may be quantized in a quantizer 203 and transmitted to a (second) device 204. The quantized latent feature space representation may be encoded and transmitted as a bit stream to the (second) device 204. In the (second) means 204, the received potential feature space representation may first be inverse quantized in an inverse quantizer 205 and optionally decoded before being input to the (second) neural network 206 to generate an approximate perceptual domain audio signal based on the potential feature space representation. Then, an approximate perceptual domain audio signal may be obtained as output of the (second) neural network 206.
Method for training neural network
Referring to the example of fig. 6, a method of training a neural network is illustrated. In step S301, a perceptual-domain audio training signal is input into a neural network to process the perceptual-domain audio training signal. The perceptual domain audio training signal is processed by a neural network and, in step S302, the processed perceptual domain audio training signal is then obtained as an output of the neural network. Based on the difference between the processed perceptual-domain audio training signal and the original perceptual-domain audio signal from which the perceptual-domain audio training signal may have been obtained, the parameters of the neural network are then iteratively adjusted in step S303. The neural network is trained based on this iterative adjustment to generate increasingly better processed perceptual domain audio training signals. The goal of this iterative adjustment is to have the neural network generate a processed perceptual domain audio training signal that is indistinguishable from the corresponding original perceptual domain audio signal.
In some embodiments, the neural network may be trained in the perceptual domain based on one or more loss functions. Neural networks designed to encode audio signals in the perceptual domain can be trained with simple loss functions such as L1 and L2, as these functions can introduce white errors across the frequency spectrum. In the case of L1 and L2, the neural network may predict the mean of the processed perceptual domain audio training signal.
Alternatively, in some embodiments, the neural network may be trained in the perceptual domain based on Negative Log Likelihood (NLL) conditions. In the case of NLL, the neural network can predict from a pre-selected distribution as a parameterized mean and scale. Logarithmic operation of scale parameters can be used to avoid numerical instability in general. The preselected distribution may be a laplace distribution. Alternatively, the preselected profile may be a logistic profile or a gaussian profile. In the case of gaussian distributions, the scale parameters may be replaced by variance parameters. For the NLL case, a sampling operation may be used to convert from the distribution parameters to the processed perceptual domain audio training signal. The sampling operation may be written as:
wherein, the liquid crystal display device comprises a liquid crystal display device,is a predicted processed perceptual domain audio training signal, mean and scale are prediction parameters from a neural network, F () is a sampling function determined by a preselected distribution, and u is derived from uniformly distributed sampling.
For example, in the case of a Laplace distribution,
F=-scale*sign(u)*log(1-2*|u|),u~(-0.5,0.5)
the weighting function resulting from the quantized mask may be applied to the scale parameters in the sampling function F (). Further, in the case of sampling from a mixture of each output coefficient (e.g., gaussian mixture), there may be a parameter vector.
Method for jointly training a set of first neural network and second neural network
Referring to the example of fig. 7, a method of jointly training a set of first and second neural networks is illustrated.
In step S401, a perceptual domain audio training signal is input into the first neural network to map the perceptual domain audio training signal to a potential feature space representation. In step S402, a potential feature space representation of a perceptual domain audio training signal is obtained as an output of a first neural network. In step S403, the potential feature space representation of the perceptual domain audio training signal is then input into a second neural network to generate an approximate perceptual domain audio training signal. In step S404, an approximate perceptual domain audio training signal may then be obtained as an output of the second neural network. And in step S405, iteratively adjusting parameters of the first neural network and the second neural network based on a difference between the approximated perceptual domain audio training signal and an original perceptual domain audio signal based on which the perceptual domain audio training signal was derived.
In some embodiments, the first neural network and the second neural network may be trained in the perceptual domain based on one or more loss functions. In some embodiments, the first and second neural networks may be trained in the perceptual domain based on Negative Log Likelihood (NLL) conditions. The goal of the iterative adjustment is to have the first and second neural networks generate an approximate perceptual domain audio training signal that is indistinguishable from the corresponding original perceptual domain audio signal.
Further exemplary embodiments
With reference to the examples of fig. 10-12, further exemplary embodiments of the methods and apparatus described herein are illustrated. In the example of fig. 10, a schematic diagram showing the conversion of an audio signal to the perceptual domain for data reduction using a neural network is illustrated. In the example of fig. 10, PCM audio data is used as input.
In the example of fig. 11, a schematic diagram of an audio encoder and decoder operating in the perceptual domain and with a neural network in both the encoder and decoder is illustrated. Fig. 11 also shows training the neural network using a simple loss function when operating the network in the perceptual domain. In the example of fig. 11, the truth signal refers to an original perceptual domain audio signal based on which a corresponding perceptual domain audio training signal may be derived and the original perceptual domain audio signal may be compared to an approximate perceptual domain audio signal to iteratively adjust the neural network.
In the example of fig. 12, a schematic diagram of an audio encoder and decoder operating in the perceptual domain and having a neural network in the decoder is illustrated. Fig. 12 also shows training the neural network using a simple loss function when operating the network in the perceptual domain. Further, in this case, the true value signal refers to an original perceptual domain audio signal based on which a corresponding perceptual domain audio training signal is derived, and the original perceptual domain audio signal may be compared to the processed perceptual domain audio signal to iteratively adjust the neural network.
Interpretation of the drawings
Unless specifically stated otherwise, as apparent from the following discussions, it is appreciated that throughout the disclosed discussions utilizing terms such as "processing," "computing," "determining," "analyzing," or the like, refer to the action and/or processes of a computer or computing system, or similar electronic computing device, that manipulate and/or transform data represented as physical (e.g., electronic) quantities into other data similarly represented as physical quantities.
In a similar manner, the term "processor" may refer to any device or portion of a device that processes electronic data, e.g., from registers and/or memory, to transform the electronic data into other electronic data, e.g., that may be stored in registers and/or memory. A "computer" or "computing machine" or "computing platform" may include one or more processors.
In one example embodiment, the methods described herein may be performed by one or more processors that accept computer readable (also referred to as machine readable) code containing a set of instructions that, when executed by one or more processors, perform at least one of the methods described herein. Including any processor capable of executing a set of instructions (sequential or otherwise) that specify actions to be taken. Thus, one example is a typical processing system that includes one or more processors. Each processor may include one or more of a CPU, a graphics processing unit, and a programmable DSP unit. The processing system may further comprise a memory subsystem comprising main RAM and/or static RAM and/or ROM. A bus subsystem may be included for communication between the components. The processing system may further be a distributed processing system in which the processors are coupled together by a network. Such a display may be included, for example, a Liquid Crystal Display (LCD) or a Cathode Ray Tube (CRT) display, if the processing system requires such a display. If manual data entry is desired, the processing system further includes an input device, such as one or more of an alphanumeric input unit (e.g., keyboard), a pointing control device (e.g., mouse), etc. The processing system may also encompass a storage system, such as a disk drive unit. The processing system in some configurations may include a sound output device and a network interface device. The memory subsystem thus includes a computer-readable carrier medium carrying computer-readable code (e.g., software) comprising a set of instructions that, when executed by one or more processors, cause performance of one or more of the methods described herein. It should be noted that when the method comprises several elements (e.g., several steps), no order of the elements is implied unless specifically stated. The software may reside on the hard disk, or it may be completely or at least partially resident in the RAM and/or processor during execution thereof by the computer system. Thus, the memory and processor also constitute a computer-readable carrier medium carrying computer-readable code. Furthermore, the computer readable carrier medium may be formed or included in a computer program product.
In alternative example embodiments, one or more processors may operate as standalone devices, or may be connected (e.g., networked) to other processors in a networked deployment, the one or more processors may operate in the capacity of a server or user machine in a server-user network environment, or as peer machines in a peer-to-peer or distributed network environment. The one or more processors may form a Personal Computer (PC), tablet PC, personal Digital Assistant (PDA), cellular telephone, web appliance, network router, switch or bridge, or any machine capable of executing a set of instructions (sequential or otherwise) that specify actions to be taken by that machine.
It should be noted that the term "machine" shall also be taken to include any collection of machines that individually or jointly execute a set (or multiple sets) of instructions to perform any one or more of the methodologies discussed herein.
Thus, one example embodiment of each method described herein is in the form of a computer-readable carrier medium carrying a set of instructions, for example, a computer program for execution on one or more processors (e.g., one or more processors that are part of a web server arrangement). Accordingly, as will be appreciated by one skilled in the art, example embodiments of the present disclosure may be embodied as a method, an apparatus such as a special purpose apparatus, an apparatus such as a data processing system, or a computer readable carrier medium (e.g., a computer program product). The computer-readable carrier medium carries computer-readable code comprising a set of instructions that, when executed on one or more processors, cause the one or more processors to implement a method. Accordingly, aspects of the present disclosure may take the form of an entirely hardware example embodiment, an entirely software example embodiment or an example embodiment combining software and hardware aspects. Furthermore, the present disclosure may take the form of a carrier medium (e.g., a computer program product on a computer-readable storage medium) carrying computer-readable program code embodied in the medium.
The software may further be transmitted or received over a network via a network interface device. While the carrier medium is a single medium in the example embodiments, the term "carrier medium" should be taken to include a single medium or multiple media (e.g., a centralized or distributed database, and/or associated caches and servers) that store the one or more sets of instructions. The term "carrier medium" shall also be taken to include any medium that is capable of storing, encoding or carrying a set of instructions for execution by one or more of the processors and that cause the one or more processors to perform any one or more of the methodologies of the present disclosure. Carrier media can take many forms, including but not limited to, non-volatile media, and transmission media. Non-volatile media includes, for example, optical, magnetic disks, and magneto-optical disks. Volatile media includes dynamic memory, such as main memory. Transmission media includes coaxial cables, copper wire and fiber optics, including the wires that comprise a bus subsystem. Transmission media can also take the form of acoustic or light waves, such as those generated during radio-wave and infra-red data communications. For example, the term "carrier medium" shall accordingly be taken to include, but not be limited to, solid-state memories, computer products embodied in optical and magnetic media; a medium that carries a propagated signal that can be detected by at least one processor or one or more processors and that represents a set of instructions that when executed implement a method; and a transmission medium in the network, the transmission medium carrying a propagated signal detectable by at least one of the one or more processors and representing the set of instructions.
It will be appreciated that in one example embodiment, the steps of the methods discussed are performed by a suitable processor (or processors) in a processing (e.g., computer) system executing instructions (computer readable code) stored in a storage device. It will also be appreciated that the present disclosure is not limited to any particular implementation or programming technique, and that the present disclosure may be implemented using any suitable technique for implementing the functions described herein. The present disclosure is not limited to any particular programming language or operating system.
Reference throughout this disclosure to "one embodiment," "some embodiments," or "an example embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present disclosure. Thus, the appearances of the phrases "in one embodiment," "in some embodiments," or "in example embodiments" in various places throughout this disclosure are not necessarily all referring to the same example embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner, as will be apparent to one of ordinary skill in the art in light of this disclosure, in one or more example embodiments.
As used herein, unless otherwise specified the use of the ordinal adjectives "first", "second", "third", etc., to describe a common object, merely indicate that different instances of like objects are being referred to, and are not intended to imply that the objects so described must be in a given sequence, either temporally, spatially, in ranking, or in any other manner.
In the claims below and in the description herein, any one of the terms "comprising," "including," or "comprising" is an open term that means including at least the elements/features that follow, but not excluding other elements/features. Thus, when the term "comprising" is used in the claims, the term should not be interpreted as being limited to the means or elements or steps listed thereafter. For example, the scope of expression of a device including a and B should not be limited to a device composed of only elements a and B. As used herein, the term comprising or any of its inclusion or inclusion is also an open term that also means including at least the elements/features following the term, but not excluding other elements/features. Thus, inclusion is synonymous with and means including.
It should be appreciated that in the foregoing description of example embodiments of the disclosure, various features of the disclosure are sometimes grouped together in a single example embodiment/figure or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. This method of disclosure, however, is not to be interpreted as reflecting an intention that the claims require more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed example embodiment. Thus, the claims following the description are hereby expressly incorporated into this description, with each claim standing on its own as a separate example embodiment of this disclosure.
Moreover, while some example embodiments described herein include some features included in other example embodiments and not others included in other example embodiments, combinations of features of different example embodiments are intended to be within the scope of the present disclosure and form different example embodiments, as will be appreciated by those of skill in the art. For example, in the appended claims, any of the example embodiments claimed may be used in any combination.
In the description provided herein, numerous specific details are set forth. However, it is understood that example embodiments of the disclosure may be practiced without these specific details. In other instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
Therefore, while there has been described what are believed to be the best modes of the present disclosure, those skilled in the art will recognize that other and further modifications may be made thereto without departing from the spirit of the disclosure, and it is intended to claim all such changes and modifications as fall within the scope of the present disclosure. For example, any formulas given above represent only processes that may be used. Functions may be added or deleted from the block diagrams and operations may be interchanged among the functional blocks. Steps may be added or deleted to the methods described within the scope of the present disclosure.
Aspects of the invention may be understood from the example embodiments (EEEs) enumerated below:
EEE 1. A computer-implemented method of encoding an audio signal using a neural network, the method comprising the steps of:
(a) Obtaining a perceptual domain audio signal by applying a mask indicative of a masking threshold derived from a psychoacoustic model to the audio signal in an original signal domain;
(b) Inputting the perceptual domain audio signal into a neural network to map the perceptual domain audio signal to a potential feature space representation;
(c) Obtaining the potential feature space representation of the perceptual domain audio signal as an output of the neural network; and
(d) The potential feature space representation of the perceptual domain audio signal is output in a bitstream.
EEE 2. The method according to EEE 1 wherein further information indicative of the mask is output in the bitstream in step (d).
EEE 3. The method according to EEE 1 or 2, wherein said potential feature space representation of said perceptual domain audio signal and/or said information indicative of said mask is quantized before being output in said bitstream.
The method of any of EEEs 1 through 3, wherein mapping the perceived-domain audio signal to the potential feature space representation through the neural network is performed in the time domain; and/or
Wherein obtaining the perceptual domain audio signal is performed in the frequency domain.
EEE 5. A computer-implemented method of decoding an audio signal using a neural network, wherein the method comprises the steps of:
(a) Obtaining a representation of a perceptual domain audio signal;
(b) Inputting the representation of the perceptual domain audio signal into the neural network to process the representation of the perceptual domain audio signal;
(c) Obtaining a processed perceptual domain audio signal as an output of the neural network; and
(d) The processed perceptual domain audio signal is converted to an original signal domain based on a mask indicative of a masking threshold derived from a psychoacoustic model.
EEE 6. The method of EEE 5, wherein processing the perceived-domain audio signal through the neural network is performed in the time domain; and/or
Wherein the method further comprises: before step (d), converting the audio signal to the frequency domain.
EEE 7. The method of EEE 5 or 6, wherein the neural network is conditioned on information indicative of the mask; and/or
Wherein the neural network is conditioned on the perceptual domain audio signal.
EEE 8. The method of EEE 7, wherein processing the perceived-domain audio signal through the neural network comprises at least one of:
predicting the processed perceptual domain audio signal across time;
Predicting the processed perceptual domain audio signal across frequencies; and
the processed perceptual domain audio signal is predicted across time and frequency.
The method of any of EEEs 5-8, wherein the representation of the perceptual-domain audio signal comprises the perceptual-domain audio signal.
EEE 10. According to the method of any one of EEE 5 to 9,
wherein the representation of the perceptual-domain audio signal is obtained from:
converting an audio signal from the original signal domain to the perceptual domain by applying the mask;
encoding the perceptual domain audio signal; and
decoding the perceptual domain audio signal; and optionally
Wherein quantization is applied to the perceptual domain audio signal prior to encoding and inverse quantization is applied to the perceptual domain audio signal after decoding.
EEE 11. According to the method described in EEE 5,
wherein step (a) involves receiving a potential feature space representation of the perceptual domain audio signal in a bitstream; and is also provided with
Wherein step (b) involves inputting the potential feature space representation into the neural network to generate the processed perceptual domain audio signal.
EEE 12. The method of EEE 11, wherein the neural network is conditioned on the potential feature space representation of the perceptual domain audio signal.
EEE 13. The method of EEE 11 or 12, further comprising receiving additional information indicative of the mask as the bitstream,
wherein the neural network is conditioned on the additional information.
The method of any of EEEs 11-13, wherein the potential feature space representation of the perceptual domain audio signal and/or the information indicative of the mask is received in quantized form; and is also provided with
Wherein the method further comprises inverse quantizing the potential feature space representation prior to inputting into the neural network.
The method of any of EEEs 11-14, wherein generating the perceived-domain audio signal through the neural network is performed in the time domain; and/or
Wherein converting the perceptual-domain audio signal into the original signal domain is performed in the frequency domain.
EEE 16. A method (e.g., a computer-implemented method) of processing an audio signal using a neural network, wherein the method comprises the steps of:
(a) Obtaining a perceptual domain audio signal;
(b) Inputting the perceptual-domain audio signal into the neural network to process the perceptual-domain audio signal;
(c) Obtaining a processed perceptual domain audio signal as an output of the neural network; and
(d) The processed perceptual domain audio signal is converted to an original signal domain based on a mask indicative of a masking threshold derived from a psychoacoustic model.
EEE 17. The method of EEE 16, wherein processing the perceived-domain audio signal through the neural network is performed in the time domain.
EEE 18. The method of EEE 16 or 17, wherein the method further comprises: before step (d), converting the audio signal to the frequency domain.
The method of any of EEEs 16-18, wherein the neural network is conditioned on information indicative of the mask.
The method of any of EEEs 16-19, wherein the neural network is conditioned on the perceptual-domain audio signal.
EEE 21. The method of EEE 19 or 20, wherein processing the perceptual-domain audio signal through the neural network comprises predicting the processed perceptual-domain audio signal across time.
EEE 22. The method of EEE 19 or 20, wherein processing the perceptual-domain audio signal through the neural network comprises predicting the processed perceptual-domain audio signal across frequencies.
EEE 23. The method of EEE 19 or 20, wherein processing the perceived domain audio signal through the neural network includes predicting the processed perceived domain audio signal across time and frequency.
The method of any of EEEs 16-23, wherein the representation of the perceptual-domain audio signal is obtained from:
(a) Converting an audio signal from the original signal domain to the perceptual domain by applying the mask;
(b) Encoding the perceptual domain audio signal; and
(c) Decoding the perceptual domain audio signal.
EEE 25. The method of EEE 24 wherein quantization is applied to the perceptual domain audio signal prior to encoding and inverse quantization is applied to the perceptual domain audio signal after decoding.
EEE 26. A method (e.g., a computer-implemented method) of processing an audio signal using a first neural network and a second neural network, wherein the method comprises the steps of:
(a) Obtaining, by the first device, a perceptual domain audio signal by applying a mask indicative of a masking threshold derived from the psychoacoustic model to the audio signal in the original signal domain;
(b) Inputting the perceptual domain audio signal into the first neural network to map the perceptual domain audio signal to a potential feature space representation;
(c) Obtaining the potential feature space representation as an output of the first neural network;
(d) Transmitting the potential feature space representation of the perceptual domain audio signal and the mask to a second device;
(e) Receiving, by the second device, the potential feature space representation of the perceptual domain audio signal and the mask;
(f) Inputting the potential feature space representation into the second neural network to generate an approximate perceptual domain audio signal;
(g) Obtaining the approximated perceptual domain audio signal as an output of the second neural network; and
(h) The approximate perceptual domain audio signal is converted to the original signal domain based on the mask.
The method of EEE 26, wherein the method further comprises: encoding the potential feature spatial representation of the perceptual domain audio signal and the mask into a bitstream and transmitting the bitstream to the second device, and wherein the method further comprises: the method further includes receiving, by the second device, the bitstream and decoding the bitstream to obtain the latent feature space representation and the mask of the perceptual domain audio signal.
EEE 28. The method of EEE 27 wherein said potential feature space representation of said perceptual domain audio signal and said mask are quantized prior to being encoded into said bitstream and dequantized prior to being processed by said second neural network.
The method of any of EEEs 26-28, wherein the second neural network is conditioned on the potential feature space representation and/or the mask of the perceptual domain audio signal.
The method of any of EEEs 26-29, wherein mapping the perceived domain audio signal to the latent feature space representation by the first neural network and generating the approximate perceived domain audio signal by the second neural network are performed in the time domain.
The method of any of EEEs 26-30, wherein obtaining the perceived domain signal in step (a) and converting the approximate perceived domain signal in step (h) are performed in the frequency domain.
EEE 32. A method (e.g., a computer-implemented method) of jointly training a set of first and second neural networks, wherein the method comprises the steps of:
(a) Inputting a perceptual domain audio training signal into the first neural network to map the perceptual domain audio training signal to a potential feature space representation;
(b) Obtaining the potential feature space representation of the perceptual domain audio training signal as an output of the first neural network;
(c) Inputting the potential feature space representation of the perceptual domain audio training signal into the second neural network to generate an approximate perceptual domain audio training signal;
(d) Obtaining the approximate perceptual domain audio training signal as an output of the second neural network; and
(e) Iteratively adjusting parameters of the first neural network and the second neural network based on a difference between the approximated perceptual domain audio training signal and an original perceptual domain audio signal.
EEE 33. The method of EEE 32, wherein the first neural network and the second neural network are trained in the perceptual domain based on one or more loss functions.
EEE 34. The method of EEE 32 wherein the first neural network and the second neural network are trained in the perceptual domain based on negative log-likelihood conditions.
EEE 35. A method (e.g., a computer-implemented method) of training a neural network, wherein the method comprises the steps of:
(a) Inputting a perception domain audio training signal into the neural network to process the perception domain audio training signal;
(b) Obtaining a processed perception domain audio training signal as an output of the neural network; and
(c) Parameters of the neural network are iteratively adjusted based on differences between the processed perceptual domain audio training signal and an original perceptual domain audio signal.
EEE 36. The method of EEE 35, wherein the neural network is trained in the perceptual domain based on one or more loss functions.
EEE 37. The method of EEE 35 wherein the neural network is trained in the perceptual domain based on negative log likelihood conditions.
EEE 38. A method (e.g., a computer-implemented method) of obtaining and transmitting a potential feature space representation of a perceptual domain audio signal using a neural network, the method comprising the steps of:
(a) Obtaining a perceptual domain audio signal by applying a mask indicative of a masking threshold derived from a psychoacoustic model to the audio signal in the original signal domain;
(b) Inputting the perceptual domain audio signal into a neural network to map the perceptual domain audio signal to a potential feature space representation;
(c) Obtaining the potential feature space representation of the perceptual domain audio signal as an output of the neural network; and
(d) The potential feature space representation of the perceptual domain audio signal is output as a bitstream.
EEE 39. The method of EEE 38, wherein in step (d) further information indicative of said mask is output as said bit stream.
EEE 40. The method according to EEE 38 or 39, wherein said potential feature space representation of said perceptual domain audio signal and/or said information indicative of said mask is quantized before being output in said bitstream.
EEE 41A method according to any one of EEEs 38-40, wherein mapping the perceptual-domain audio signal to the latent feature space representation by the neural network is performed in the time domain.
EEE 42. The method according to any one of EEEs 38 to 41, wherein obtaining the perceptual-domain audio signal is performed in the frequency domain.
EEE 43. A method (e.g., a computer-implemented method) of obtaining an audio signal from a potential feature space representation of a perceptual domain audio signal using a neural network, the method comprising the steps of:
(a) Receiving a potential feature space representation of a perceptual domain audio signal as a bitstream;
(b) Inputting the potential feature space representation into a neural network to generate the perceptual domain audio signal;
(c) Obtaining the perception domain audio signal as an output of the neural network; and
(d) The perceptual-domain audio signal is converted to an original signal domain based on a mask indicative of a masking threshold derived from a psychoacoustic model.
EEE 44. The method of EEE 43 wherein said neural network is conditioned on said potential feature space representation of said perceptual domain audio signal.
EEE 45. The method according to EEE 43 or 44, wherein in step (a) further information indicative of said mask is received as said bit stream, and said neural network is conditioned on said information.
The method of any of EEEs 43-45, wherein the latent feature space representation of the perceptual domain audio signal and/or the information indicative of the mask is quantized upon receipt and inverse quantization is performed prior to step (b).
EEE 47. The method of any one of EEE 43 to 46, wherein generating the perceived-domain audio signal by the neural network is performed in the time domain.
EEE 48. The method of any of EEEs 43-47, wherein converting the perceptual-domain audio signal to the original signal domain is performed in the frequency domain.
EEE 49. An apparatus for processing an audio signal using a neural network, wherein the apparatus comprises the neural network and one or more processors configured to perform a method comprising:
(a) Obtaining a perceptual domain audio signal;
(b) Inputting the perceptual-domain audio signal into the neural network to process the perceptual-domain audio signal;
(c) Obtaining a processed perceptual domain audio signal as an output of the neural network; and
(d) The processed perceptual domain audio signal is converted to an original signal domain based on a mask indicative of a masking threshold derived from a psychoacoustic model.
An apparatus for obtaining and transmitting a latent feature space representation of a perceptual domain audio signal using a neural network, wherein the apparatus comprises the neural network and one or more processors configured to perform a method comprising:
(a) Obtaining a perceptual domain audio signal by applying a mask indicative of a masking threshold derived from a psychoacoustic model to the audio signal in the original signal domain;
(b) Inputting the perceptual domain audio signal into a neural network to map the perceptual domain audio signal to a potential feature space representation;
(c) Obtaining the potential feature space representation of the perceptual domain audio signal as an output of the neural network; and
(d) The potential feature space representation of the perceptual domain audio signal is output as a bitstream.
An EEE 51. An apparatus for obtaining an audio signal from a potential feature space representation of a perceptual domain audio signal using a neural network, wherein the apparatus comprises the neural network and one or more processors configured to perform a method comprising:
(a) Receiving a potential feature space representation of a perceptual domain audio signal as a bitstream;
(b) Inputting the potential feature space representation into a neural network to generate the perceptual domain audio signal;
(c) Obtaining the perceptual domain audio signal as an output of the second neural network; and
(d) The perceptual-domain audio signal is converted to an original signal domain based on a mask indicative of a masking threshold derived from a psychoacoustic model.
EEE 52. A computer program product comprising a computer readable storage medium having instructions adapted to, when executed by a device having processing capabilities, cause the device to perform a method according to any one of EEEs 1 to 10.
EEE 53. A computer program product comprising a computer readable storage medium having instructions adapted to, when executed by a device having processing capabilities, cause the device to perform a method according to any one of EEEs 11 to 16.
EEE 54. A computer program product comprising a computer readable storage medium having instructions adapted to, when executed by a device having processing capabilities, cause the device to perform a method according to any one of EEEs 17 to 19.
EEE 55. A computer program product comprising a computer readable storage medium having instructions adapted to, when executed by a device having processing capabilities, cause the device to perform a method according to any one of EEEs 20 to 22.
EEE 56. A computer program product comprising a computer readable storage medium having instructions adapted to, when executed by a device having processing capabilities, cause the device to perform a method according to any one of EEEs 23 to 27.
EEE 57. A computer program product comprising a computer readable storage medium having instructions adapted to, when executed by a device having processing capabilities, cause the device to perform a method according to any one of EEEs 28 to 33.

Claims (54)

1. A computer-implemented method of encoding an audio signal using a neural network, the method comprising the steps of:
(a) Obtaining a perceptual domain audio signal by applying a mask indicative of a masking threshold derived from a psychoacoustic model to the audio signal in an original signal domain;
(b) Inputting the perceptual domain audio signal into a neural network to map the perceptual domain audio signal to a potential feature space representation;
(c) Obtaining the potential feature space representation of the perceptual domain audio signal as an output of the neural network; and
(d) The potential feature space representation of the perceptual domain audio signal is output in a bitstream.
2. The method of claim 1, wherein further information indicative of the mask is output in the bitstream in step (d).
3. The method according to claim 1 or 2, wherein the potential feature space representation of the perceptual domain audio signal and/or the information indicative of the mask is quantized before being output in the bitstream.
4. A method according to any of claims 1 to 3, wherein mapping the perceptual-domain audio signal to the potential feature space representation by the neural network is performed in the time domain; and/or
Wherein obtaining the perceptual domain audio signal is performed in the frequency domain.
5. A computer-implemented method of decoding an audio signal using a neural network, wherein the method comprises the steps of:
(a) Obtaining a representation of a perceptual domain audio signal by decoding the received bitstream;
(b) Inputting the representation of the perceptual domain audio signal into the neural network to process the representation of the perceptual domain audio signal;
(c) Obtaining a processed perceptual domain audio signal as an output of the neural network; and
(d) The processed perceptual domain audio signal is converted to an original signal domain based on a mask indicative of a masking threshold derived from a psychoacoustic model.
6. The method of claim 5, wherein processing the perceptual-domain audio signal through the neural network is performed in the time domain; and/or
Wherein the method further comprises: before step (d), converting the audio signal to the frequency domain.
7. The method of claim 5 or 6, wherein the neural network is conditioned on information indicative of the mask; and/or
Wherein the neural network is conditioned on the perceptual domain audio signal.
8. The method of claim 7, wherein processing the perception domain audio signal through the neural network comprises at least one of:
predicting the processed perceptual domain audio signal across time;
predicting the processed perceptual domain audio signal across frequencies; and
the processed perceptual domain audio signal is predicted across time and frequency.
9. The method of any of claims 5 to 8, wherein the representation of the perceptual-domain audio signal comprises the perceptual-domain audio signal.
10. The method according to any one of claim 5 to 9,
wherein the representation of the perceptual-domain audio signal is obtained from:
converting an audio signal from the original signal domain to the perceptual domain by applying the mask;
encoding the perceptual domain audio signal; and
decoding the perceptual domain audio signal; and optionally
Wherein quantization is applied to the perceptual domain audio signal prior to encoding and inverse quantization is applied to the perceptual domain audio signal after decoding.
11. The method according to claim 5,
wherein step (a) involves receiving a potential feature space representation of the perceptual domain audio signal in a bitstream; and is also provided with
Wherein step (b) involves inputting the potential feature space representation into the neural network to generate the processed perceptual domain audio signal.
12. The method of claim 11, wherein the neural network is conditioned on the potential feature space representation of the perceptual domain audio signal.
13. The method of claim 11 or 12, further comprising receiving additional information indicative of the mask as the bitstream,
wherein the neural network is conditioned on the additional information.
14. The method of any of claims 11 to 13, wherein the potential feature space representation of the perceptual domain audio signal and/or the information indicative of the mask is received in quantized form; and is also provided with
Wherein the method further comprises inverse quantizing the potential feature space representation prior to inputting into the neural network.
15. The method of any of claims 11 to 14, wherein generating the perceptual-domain audio signal by the neural network is performed in the time domain; and/or
Wherein converting the perceptual-domain audio signal into the original signal domain is performed in the frequency domain.
16. A method (e.g., a computer-implemented method) of processing an audio signal using a neural network, wherein the method comprises the steps of:
(a) Obtaining a perceptual domain audio signal;
(b) Inputting the perceptual-domain audio signal into the neural network to process the perceptual-domain audio signal;
(c) Obtaining a processed perceptual domain audio signal as an output of the neural network; and
(d) The processed perceptual domain audio signal is converted to an original signal domain based on a mask indicative of a masking threshold derived from a psychoacoustic model.
17. The method of claim 16, wherein processing the perceptual-domain audio signal through the neural network is performed in the time domain.
18. The method according to claim 16 or 17, wherein the method further comprises: before step (d), converting the audio signal to the frequency domain.
19. The method of any of claims 16 to 18, wherein the neural network is conditioned on information indicative of the mask.
20. The method of any of claims 16 to 19, wherein the neural network is conditioned on the perceptual-domain audio signal.
21. The method of claim 19 or 20, wherein processing the perceptual-domain audio signal through the neural network comprises predicting the processed perceptual-domain audio signal across time.
22. The method of claim 19 or 20, wherein processing the perceptual-domain audio signal through the neural network comprises predicting the processed perceptual-domain audio signal across frequencies.
23. The method of claim 19 or 20, wherein processing the perceptual-domain audio signal through the neural network comprises predicting the processed perceptual-domain audio signal across time and frequency.
24. The method of any of claims 16 to 23, wherein the perceptual-domain audio signal is obtained from:
(a) Converting an audio signal from the original signal domain to the perceptual domain by applying the mask;
(b) Encoding the perceptual domain audio signal; and
(c) Decoding the perceptual domain audio signal.
25. The method of claim 24, wherein quantization is applied to the perceptual domain audio signal prior to encoding and inverse quantization is applied to the perceptual domain audio signal after decoding.
26. A method (e.g., a computer-implemented method) of processing an audio signal using a first neural network and a second neural network, wherein the method comprises the steps of:
(a) Obtaining, by the first device, a perceptual domain audio signal by applying a mask indicative of a masking threshold derived from the psychoacoustic model to the audio signal in the original signal domain;
(b) Inputting the perceptual domain audio signal into the first neural network to map the perceptual domain audio signal to a potential feature space representation;
(c) Obtaining the potential feature space representation as an output of the first neural network;
(d) Transmitting the potential feature space representation of the perceptual domain audio signal and the mask to a second device;
(e) Receiving, by the second device, the potential feature space representation of the perceptual domain audio signal and the mask;
(f) Inputting the potential feature space representation into the second neural network to generate an approximate perceptual domain audio signal;
(g) Obtaining the approximated perceptual domain audio signal as an output of the second neural network; and
(h) The approximate perceptual domain audio signal is converted to the original signal domain based on the mask.
27. The method of claim 26, wherein the method further comprises: encoding the potential feature spatial representation of the perceptual domain audio signal and the mask into a bitstream and transmitting the bitstream to the second device, and wherein the method further comprises: the method further includes receiving, by the second device, the bitstream and decoding the bitstream to obtain the latent feature space representation and the mask of the perceptual domain audio signal.
28. The method of claim 27, wherein the potential feature space representation of the perceptual domain audio signal and the mask are quantized before being encoded into the bitstream and dequantized before being processed by the second neural network.
29. The method of any of claims 26 to 28, wherein the second neural network is conditioned on the latent feature space representation of the perceptual domain audio signal and/or the mask.
30. The method of any of claims 26 to 29, wherein mapping the perceptual-domain audio signal to the potential feature space representation by the first neural network and generating the approximate perceptual-domain audio signal by the second neural network are performed in the time domain.
31. The method according to any one of claims 26 to 30, wherein obtaining the perceptual-domain signal in step (a) and converting the approximated perceptual-domain signal in step (h) are performed in the frequency domain.
32. A method (e.g., a computer-implemented method) of jointly training a set of first and second neural networks, wherein the method comprises the steps of:
(a) Inputting a perceptual domain audio training signal into the first neural network to map the perceptual domain audio training signal to a potential feature space representation;
(b) Obtaining the potential feature space representation of the perceptual domain audio training signal as an output of the first neural network;
(c) Inputting the potential feature space representation of the perceptual domain audio training signal into the second neural network to generate an approximate perceptual domain audio training signal;
(d) Obtaining the approximate perceptual domain audio training signal as an output of the second neural network; and
(e) Iteratively adjusting parameters of the first neural network and the second neural network based on a difference between the approximated perceptual domain audio training signal and an original perceptual domain audio signal.
33. The method of claim 32, wherein the first neural network and the second neural network are trained in the perceptual domain based on one or more loss functions.
34. The method of claim 32, wherein the first neural network and the second neural network are trained in the perceptual domain based on negative log-likelihood conditions.
35. A method (e.g., a computer-implemented method) of training a neural network, wherein the method comprises the steps of:
(a) Inputting a perception domain audio training signal into the neural network to process the perception domain audio training signal;
(b) Obtaining a processed perception domain audio training signal as an output of the neural network; and
(c) Parameters of the neural network are iteratively adjusted based on differences between the processed perceptual domain audio training signal and an original perceptual domain audio signal.
36. The method of claim 35, wherein the neural network is trained in the perceptual domain based on one or more loss functions.
37. The method of claim 35, wherein the neural network is trained in the perceptual domain based on negative log-likelihood conditions.
38. A method (e.g., a computer-implemented method) of obtaining and transmitting a potential feature space representation of a perceptual domain audio signal using a neural network, the method comprising the steps of:
(a) Obtaining a perceptual domain audio signal by applying a mask indicative of a masking threshold derived from a psychoacoustic model to the audio signal in the original signal domain;
(b) Inputting the perceptual domain audio signal into a neural network to map the perceptual domain audio signal to a potential feature space representation;
(c) Obtaining the potential feature space representation of the perceptual domain audio signal as an output of the neural network; and
(d) The potential feature space representation of the perceptual domain audio signal is output as a bitstream.
39. The method of claim 38, wherein in step (d), further information indicating the mask is output as the bitstream.
40. The method of claim 38 or 39, wherein the potential feature space representation of the perceptual domain audio signal and/or the information indicative of the mask is quantized before being output as the bitstream.
41. The method of any one of claims 38 to 40, wherein mapping the perceptual-domain audio signal to the latent feature space representation by the neural network is performed in the time domain.
42. The method of any one of claims 38 to 41, wherein obtaining the perceptual-domain audio signal is performed in the frequency domain.
43. A method (e.g., a computer-implemented method) of obtaining an audio signal from a potential feature space representation of a perceptual audio signal using a neural network, the method comprising the steps of:
(a) Receiving a potential feature space representation of a perceptual domain audio signal as a bitstream;
(b) Inputting the potential feature space representation into a neural network to generate the perceptual domain audio signal;
(c) Obtaining the perception domain audio signal as an output of the neural network; and
(d) The perceptual-domain audio signal is converted to an original signal domain based on a mask indicative of a masking threshold derived from a psychoacoustic model.
44. A method according to claim 43 wherein the neural network is conditioned on the potential feature space representation of the perceptual domain audio signal.
45. The method of claim 43 or 44, wherein in step (a) further information indicative of the mask is received as the bitstream, and the neural network is conditioned on the information.
46. The method of any one of claims 43 to 45, wherein the potential feature space representation of the perceptual domain audio signal and/or the information indicative of the mask is quantized upon receipt and inverse quantization is performed prior to step (b).
47. The method of any one of claims 43 to 46, wherein generating the perceptual-domain audio signal by the neural network is performed in the time domain.
48. The method of any of claims 43 to 47, wherein converting the perceptual-domain audio signal to the original signal domain is performed in the frequency domain.
49. An apparatus for processing an audio signal using a neural network, wherein the apparatus comprises the neural network and one or more processors configured to perform a method comprising:
(a) Obtaining a perceptual domain audio signal;
(b) Inputting the perceptual-domain audio signal into the neural network to process the perceptual-domain audio signal;
(c) Obtaining a processed perceptual domain audio signal as an output of the neural network; and
(d) The processed perceptual domain audio signal is converted to an original signal domain based on a mask indicative of a masking threshold derived from a psychoacoustic model.
50. An apparatus for obtaining and transmitting a potential feature space representation of a perceptual domain audio signal using a neural network, wherein the apparatus comprises the neural network and one or more processors configured to perform a method comprising:
(a) Obtaining a perceptual domain audio signal by applying a mask indicative of a masking threshold derived from a psychoacoustic model to the audio signal in the original signal domain;
(b) Inputting the perceptual domain audio signal into a neural network to map the perceptual domain audio signal to a potential feature space representation;
(c) Obtaining the potential feature space representation of the perceptual domain audio signal as an output of the neural network; and
(d) The potential feature space representation of the perceptual domain audio signal is output as a bitstream.
51. An apparatus for obtaining an audio signal from a potential feature space representation of a perceptual domain audio signal using a neural network, wherein the apparatus comprises the neural network and one or more processors configured to perform a method comprising:
(a) Receiving a potential feature space representation of a perceptual domain audio signal as a bitstream;
(b) Inputting the potential feature space representation into a neural network to generate the perceptual domain audio signal;
(c) Obtaining the perceptual domain audio signal as an output of the second neural network; and
(d) The perceptual-domain audio signal is converted to an original signal domain based on a mask indicative of a masking threshold derived from a psychoacoustic model.
52. An apparatus configured to perform the method of any one of claims 1 to 48.
53. A computer program comprising instructions adapted to cause a device having processing capabilities to perform a method according to any one of claims 1 to 48 when executed by the device.
54. A computer readable storage medium having instructions adapted to cause a device having processing capabilities to perform a method according to any one of claims 1 to 48 when executed by the device.
CN202180076578.3A 2020-10-15 2021-10-14 Method and apparatus for processing audio using neural network Pending CN116457797A (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US202063092118P 2020-10-15 2020-10-15
US63/092,118 2020-10-15
EP20210968.2 2020-12-01
PCT/US2021/055090 WO2022081915A1 (en) 2020-10-15 2021-10-14 Method and apparatus for processing of audio using a neural network

Publications (1)

Publication Number Publication Date
CN116457797A true CN116457797A (en) 2023-07-18

Family

ID=73654712

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202180076578.3A Pending CN116457797A (en) 2020-10-15 2021-10-14 Method and apparatus for processing audio using neural network

Country Status (1)

Country Link
CN (1) CN116457797A (en)

Similar Documents

Publication Publication Date Title
JP6792011B2 (en) A method and apparatus for generating a mixed spatial / coefficient domain representation of this HOA signal from the coefficient domain representation of the HOA signal.
US7325023B2 (en) Method of making a window type decision based on MDCT data in audio encoding
RU2640722C2 (en) Improved quantizer
JP7345650B2 (en) Alternative end-to-end video coding
US9646615B2 (en) Audio signal encoding employing interchannel and temporal redundancy reduction
US8311843B2 (en) Frequency band scale factor determination in audio encoding based upon frequency band signal energy
TW202338789A (en) Audio decoding device, audio encoding method
JP4750707B2 (en) Short window grouping method in audio coding
US20230299788A1 (en) Systems and Methods for Improved Machine-Learned Compression
JP2007507750A (en) Rate-distortion control method in audio coding
KR20130109793A (en) Audio encoding method and apparatus for noise reduction
CN116457797A (en) Method and apparatus for processing audio using neural network
US20240013797A1 (en) Signal coding using a generative model and latent domain quantization
US20230395086A1 (en) Method and apparatus for processing of audio using a neural network
JP6261381B2 (en) Signal processing apparatus, signal processing method, and program
JP6712643B2 (en) Sample sequence transformation device, signal coding device, signal decoding device, sample sequence transformation method, signal coding method, signal decoding method, and program
CN116368497A (en) Adaptive block switching using deep neural networks
JP6629256B2 (en) Encoding device, method and program
JP2023546082A (en) Neural network predictors and generative models containing such predictors for general media
KR20230069167A (en) Trained generative model voice coding
CN114556470A (en) Method and system for waveform coding of audio signals using generative models

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination