WO2020025140A1

WO2020025140A1 - Sound processing apparatus and method for sound enhancement

Info

Publication number: WO2020025140A1
Application number: PCT/EP2018/071070
Authority: WO
Inventors: Peter GROSCHE; Gil Keren; Jing HAN; Bjoern Schuller; Wenyu Jin; Panji Setiawan
Original assignee: Huawei Technologies Co., Ltd.
Priority date: 2018-08-02
Filing date: 2018-08-02
Publication date: 2020-02-06
Also published as: EP3797415B1; EP3797415A1

Abstract

The invention relates to a sound processing apparatus (100) configured to process a current noisy sound signal comprising a target signal and a current noise signal into an enhanced sound signal. The apparatus (100) comprises processing circuitry configured to provide an adjustable neural network (103, 107), wherein the adjustable neural network (103, 107) is configured: in a training phase, to be trained using as a first input a training noise signal, as a second input a training sound signal comprising a training target signal and the training noise signal and as a third input the training target signal; and, in an application phase, to adjust on the basis of the current noise signal and to generate an estimated noise signal on the basis of the sound signal comprising the target signal and the current noise signal. In the application phase, the processing circuitry is further configured to process the sound signal into the enhanced sound signal on the basis of the estimated noise signal. Moreover, the invention relates to a corresponding sound processing method.

Description

DESCRIPTION

SOUND PROCESSING APPARATUS AND METHOD FOR SOUND ENHANCEMENT

TECHNICAL FIELD

The invention relates to the field of sound processing. More specifically, the invention relates to a sound processing apparatus and method for sound, in particular speech enhancement.

BACKGROUND

Sound or audio enhancement conventionally uses only a recording of the speech and environment, i.e. noise for producing the enhanced speech audio. Often audio enhancement procedures make use of neural network, such as the speech enhancement procedure described in the article "A Fully Convolutional Neural Network For Speech Enhancement", Se Rim Park and Jinwon Lee, in Proc. Interspeech 2017, August 20-24, 2017, pages 1993-1997, Stockholm, Sweden.

However, given only one recording that contains both the speech and the noise created by the environment, it can be difficult, in particular for a neural network to ascertain which components of an audio signal originate from the environment, which components are the clean speech or sound, i.e. the target signal and which components are just reverberation effects of both the speech and the environment. Additionally, in multichannel settings, audio localization can be performed, but sound enhancement may have difficulties predicting whether a given sound source is to be attributed to the speech or the environment.

Thus, there is still a need for an improved sound processing apparatus and method allowing for an improved enhancement of a noisy sound signal.

SUMMARY

It is an object of the invention to provide an improved sound processing apparatus and method allowing for an improved enhancement of a noisy sound signal. The foregoing and other objects are achieved by the subject matter of the independent claims. Further implementation forms are apparent from the dependent claims, the description and the figures.

Generally, embodiments of the invention are based on the idea to use for a plurality of training sound signals, including a training target signal and a training noise signal, the training noise signal as an additional input for training the neural network of a sound processing apparatus for improving the sound enhancement process. In an embodiment, the environment recording, i.e. the training noise signal can be fed into a dedicated portion of the neural network that outputs an audio environment representation defined, for instance, by a parameter set. The environment representation, in turn, can be fed as an additional input to another portion of the neural network that produces the enhanced sound. By explicitly learning to represent audio environments, and use these

representations for enhancement, embodiments of the invention allow to perform efficient speech enhancement in unpredictable audio environments.

More specifically, according to a first aspect the invention relates to a sound, in particular speech processing apparatus configured to process a current noisy sound signal comprising a target signal and a current noise signal into an enhanced, i.e. de-noised sound signal. The apparatus, which could be implemented, for instance, as a

loudspeaker, a mobile phone and the like, comprises processing circuitry, in particular one or more processors, configured to provide an adjustable neural network. In a training phase, the adjustable neural network is configured to be trained, i.e. conditioned using as a first input a training noise signal, as a second input a noisy training sound signal comprising a training target signal and the training noise signal and as a third input the training target signal, preferably using a set of training sound signals comprising a plurality of training target signals and a plurality of training noise signals. In an application phase, the adjustable neural network is configured to adjust itself on the basis of the current noise signal and to generate an estimated noise signal on the basis of the sound signal comprising the target signal and the current noise signal. The processing circuitry is further configured to process the sound signal into the enhanced sound signal on the basis of the estimated noise signal.

Thus, an improved improved sound processing apparatus and method allowing for an improved enhancement of the current noisy sound signal. By additionally conditioning the neural network on the basis of a separate recording of the environment, i.e. the training noise signal, the neural network can better separate the target signal from the sounds originating in the environment and reverberations of both the target sound and the environment sounds.

In a further possible implementation form of the first aspect, the processing circuitry is configured to transform the training noise signal, the noisy training sound signal and the training target signal from the time domain into the frequency domain, wherein the adjustable neural network is configured, in the training phase, to be trained using the training noise signal, the noisy training sound signal and the training target signal in the frequency domain. In an embodiment, the processing circuitry is configured to process the respective log spectra of the training noise signal, the noisy training sound signal and the training target signal.

In a further possible implementation form of the first aspect, the processing circuitry is configured to transform the current noise signal and the sound signal from the time domain into the frequency domain, wherein, in the training phase, the adjustable neural network is configured to adjust itself on the basis of the current noise signal in the frequency domain and to generate the estimated noise signal on the basis of the sound signal comprising the target signal and the current noise signal in the frequency domain and wherein the processing circuitry is configured to process the sound signal into the enhanced sound signal in the frequency domain on the basis of the estimated noise signal in the frequency domain.

In a further possible implementation form of the first aspect, the processing circuitry is further configured to transform the enhanced sound signal from the frequency domain into the time domain.

In a further possible implementation form of the first aspect, in the application phase, the processing circuitry is further configured to extract phase information from the sound signal comprising the target signal and the current noise signal and to process the sound signal into the enhanced sound signal on the basis of the estimated noise signal and the extracted phase information.

In a further possible implementation form of the first aspect, the sound signal is a multi channel sound signal, wherein, in the application phase, the processing circuitry is configured to select a channel of the multi-channel sound signal and to extract the phase information from the selected channel of the multi-channel sound signal. In multi-channel embodiments of the sound processing apparatus, the neural network may be trained to localize sound sources that belong to the environment, and remove sounds originating from these locations from the current noisy sound signal.

In a further possible implementation form of the first aspect, in the training phase, the neural network is further configured to generate an estimated training noise signal on the basis of the training sound signal comprising the training target signal and the training noise signal, to process the training sound signal into an enhanced training sound signal on the basis of the estimated training noise signal and to be trained by minimizing a difference measure between the training target signal and the enhanced training sound signal. In an implementation form, a gradient-based optimization algorithm can be used for training the neural network.

In a further possible implementation form of the first aspect, in the application phase, the processing circuitry is configured to process the sound signal into the enhanced sound signal on the basis of the estimated noise signal by subtracting the estimated noise signal from the sound signal.

In a further possible implementation form of the first aspect, the sound signal is a multi channel sound signal, wherein, in the application phase, the processing circuitry is configured to select a channel of the multi-channel sound signal and to process the multi channel sound signal into the enhanced sound signal on the basis of the estimated noise signal by subtracting the estimated noise signal from the selected channel of the multi channel sound signal.

In a further possible implementation form of the first aspect, the neural network comprises a first neural sub-network and a second neural sub-network, wherein, in the training phase, the first neural sub-network is configured to generate on the basis of the training noise signal a parameter set describing the training noise signal and to provide the parameter set to the second neural sub-network, wherein the second neural sub-network is configured to adjust on the basis of the parameter set provided by the first neural sub network.

In a further possible implementation form of the first aspect, in the application phase, the first neural sub-network is configured to generate on the basis of the current noise signal a parameter set describing the current noise signal, i.e. the current environment and to provide the parameter set to the second neural sub-network, wherein the second neural sub-network is configured to adjust on the basis of the parameter set provided by the first neural sub-network.

By explicitly producing an environment representation using, for instance, a parameter set or vector, as implemented in embodiments of the invention, the neural network can learn how to represent a sound environment not encountered yet, and later use this

representation for an improved sound enhancement.

In a further possible implementation form of the first aspect, the first neural sub-network and/or the second neural sub-network comprises one or more convolutional layers.

According to a second aspect the invention relates to a corresponding sound processing method for processing a current noisy sound signal comprising a target signal and a current noise signal into an enhanced, i.e. de-noised sound signal. The method comprises the steps of providing an adjustable neural network; in a training phase, training, i.e. conditioning the adjustable neural network using as a first input a training noise signal, as a second input a noisy training sound signal comprising a training target signal and the training noise signal and as a third input the training target signal; and, in an application phase, adjusting the neural network on the basis of the current noise signal, generating an estimated noise signal on the basis of the sound signal comprising the target signal and the current noise signal, and processing the sound signal into the enhanced sound signal on the basis of the estimated noise signal.

The sound processing method according to the second aspect of the invention can be performed by the sound processing apparatus according to the first aspect of the invention. Further features of the sound processing method according to the second aspect of the invention result directly from the functionality of the sound processing apparatus according to the first aspect of the invention and its different implementation forms described above and below.

According to a third aspect the invention relates to a computer program comprising program code for performing the image processing method according to the second aspect, when executed on a processor or a computer. The invention can be implemented in hardware and/or software.

BRIEF DESCRIPTION OF THE DRAWINGS

Further embodiments of the invention will be described with respect to the following figures, wherein:

Fig. 1 a shows a schematic diagram illustrating an example of processing blocks implemented in a single channel sound processing apparatus according to an

embodiment in a training phase;

Fig. 1 b shows a schematic diagram illustrating an example of processing blocks implemented in a single channel sound processing apparatus according to an

embodiment in an application phase;

Fig. 2a shows a schematic diagram illustrating an example of processing blocks implemented in a multi-channel sound processing apparatus according to an embodiment in a training phase;

Fig. 2b shows a schematic diagram illustrating an example of processing blocks implemented in a multi-channel sound processing apparatus according to an embodiment in an application phase; and

Fig. 3 shows a flow diagram illustrating an example of a sound processing method according to an embodiment.

In the various figures, identical reference signs will be used for identical or at least functionally equivalent features.

DETAILED DESCRIPTION OF EMBODIMENTS

In the following description, reference is made to the accompanying drawings, which form part of the disclosure, and in which are shown, by way of illustration, specific aspects in which the invention may be placed. It is understood that other aspects may be utilized and structural or logical changes may be made without departing from the scope of the invention. The following detailed description, therefore, is not to be taken in a limiting sense, as the scope of the invention is defined by the appended claims.

For instance, it is understood that a disclosure in connection with a described method may also hold true for a corresponding device or system configured to perform the method and vice versa. For example, if a specific method step is described, a corresponding device may include a unit to perform the described method step, even if such unit is not explicitly described or illustrated in the figures. Further, it is understood that the features of the various exemplary aspects described herein may be combined with each other, unless specifically noted otherwise.

Figure 1 a shows a schematic diagram illustrating an example of processing blocks implemented in a single channel sound processing apparatus 100 according to an embodiment in a training phase, while figure 1 b shows a schematic diagram illustrating an example of processing blocks implemented in the single channel sound processing apparatus 100 in an application phase.

As will be described in more detail further below, the sound processing apparatus 100 is configured to process a current noisy sound, in particular speech signal comprising a target signal and a current noise signal into an enhanced, i.e. de-noised sound, in particular speech signal.

The apparatus 100, which could be implemented, for instance, as a loudspeaker, a mobile phone and the like, comprises processing circuitry, in particular one or more processors, configured to provide, i.e. implement an adjustable neural network. In the embodiment shown in figures 1 a and 1 b, the adjustable neural network comprises a first neural sub network 103 and a second neural sub-network 107. In an embodiment, the first neural sub-network 103 and/or the second neural sub-network 107 (referred to as "Environment Residual Blocks" 103, 107 in the figures) can comprise one or more residual blocks. In further embodiments, the first neural sub-network 103 and the second neural sub-network 107 can constitute independent, i.e. separate neural networks. In an embodiment, the neural network, the first neural sub-network 103 and/or the second neural sub-network 107 can comprise one or more convolutional layers. More details about possible implementations of the neural network, the first neural sub-network 103 and/or the second neural sub-network 107 can be found, for instance, in the article "A Fully Convolutional Neural Network For Speech Enhancement", Se Rim Park and Jinwon Lee, in Proc. Interspeech 2017, August 20-24, 2017, pages 1993-1997, Stockholm, Sweden, which is fully incorporated by reference herein.

In a training phase, the adjustable neural network 103, 107 of the sound processing apparatus 100 is configured to be trained, i.e. conditioned using as a first input a training noise signal (referred to in figure 1 a as "Environment Waveform"), as a second input a noisy training sound signal (referred to in figure 1 a as "Environment + speech Waveform") comprising a training target signal and the training noise signal and as a third input the training target signal (referred to in figure 1 a as "clean Waveform"). Usually, the training phase involves processing a set of training sound signals comprising a plurality of known training target signals and a plurality of known training noise signals.

In an application phase, the adjustable neural network 103, 107 of the sound processing apparatus 100 is configured to adjust itself on the basis of the current noise signal and to generate an estimated noise signal on the basis of the sound signal comprising the target signal and the current noise signal. The processing circuitry of the sound processing apparatus is further configured to process the sound signal into the enhanced sound signal on the basis of the estimated noise signal.

As illustrated by blocks 101 , 105 and 1 13 in figures 1 a and 1 b, in an embodiment, the processing unit of the sound processing apparatus 100 is configured to transform the training noise signal, the noisy training sound signal, the training target signal, the current noise signal and the current sound signal from the time domain into the frequency domain by generating a respective log spectrum thereof. To this end, the blocks 101 , 105 and 1 13 can be configured to perform a short time Fourier transform (STFT) using, for instance, 25 ms frames shifted by 10 ms to extract the spectrum of each signal.

The spectrum of the training noise signal (which is provided by block 101 of figure 1 ) is then processed by the first neural sub-network 103. In an embodiment, the first neural sub-network 103 comprises a sequence of residual blocks. In an embodiment, a respective residual block comprises two parallel paths. The first path can contain two convolutional layers applied one after another, where batch normalization and a rectified- linear non-linearity are applied in between the layers. The second path can contain the identity function. The respective outputs of the two paths can be summed, and a rectified- linear non-linearity can be applied. The output provided by the first neural sub-network 103 is a representation of the environment associated with a respective training noise signal (referred to as

"Environment Embedding" in the figures). Thus, in an embodiment, in the training phase (illustrated in figure 1 a), the first neural sub-network 103 is configured to generate on the basis of the training noise signal provided by block 101 a parameter set, i.e. an environment embedding vector describing the training noise signal and to provide the parameter set to the second neural sub-network 107, wherein the second neural sub network 107 is configured to adjust itself on the basis of the parameter set provided by the first neural sub-network 103. Likewise, in the application phase (illustrated in figure 1 b), the first neural sub-network 103 is configured to generate on the basis of the current noise signal the environment embedding vector describing the current noise signal and to provide the environment embedding vector to the second neural sub-network 107, wherein the second neural sub-network 107 is configured to adjust itself on the basis of the parameter set provided by the first neural sub-network 103.

The output of the first neural sub-network 103, i.e. the environment embedding vector describing in the training phase the training noise signal or in the application phase the current noise signal, is used by the second neural sub-network 107 to adjust itself. In other words, the parameter set defined by the environment embedding vector is used as an additional input by the second neural sub-network 107 such that the output of the second neural sub-network 107 depends on the environment embedding vector, and is “adjusting” to the noise in that sense. There can be multiple ways for the second neural sub-network 107 to use this additional input, which also depend on the inner structure of the second neural sub-network 107. In one embodiment, the second neural sub-network 107 comprises a set of residual blocks, each comprised of two convolutional layers. For each convolutional layer, the environment embedding vector is projected (a linear transformation) to a vector with a dimension equal to the number of feature maps in the convolutional layer. Then, the output of this projection is added to every spatial location in the output map of the convolutional layer.

In the training phase, the adjusted second neural sub-network 107 is configured to generate an estimated training noise signal (referred to as "Enhancement Mask" in figure 1 a) on the basis of the training sound signal provided by block 105. Likewise, in the application phase, the adjusted second neural sub-network 107 is configured to generate an estimated noise signal (referred to as "Enhancement Mask" in figure 1 a) on the basis of the sound signal provided by block 105. In the training phase, in block 109 of the sound processing apparatus 100 shown in figure 1 a an enhanced training sound signal (referred to as "Enhanced Speech Spectrum" in figure 1 a) is generated on the basis of the estimated training noise signal provided by the second neural sub-network 107 and the training sound signal provided by block 105. In an embodiment, this can be done by subtracting the estimated training noise signal from the training sound signal or, alternatively, by adding the negative of the estimated training noise signal to the training sound signal.

Likewise, in the application phase, in block 109 of the sound processing apparatus 100 shown in figure 1 b an enhanced sound signal (referred to as "Enhanced Speech

Spectrum" in figure 1 b) is generated on the basis of the estimated noise signal provided by the second neural sub-network 107 and the sound signal provided by block 105. In an embodiment, this can be done by subtracting the estimated noise signal from the sound signal or, alternatively, by adding the negative of the estimated noise signal to the sound signal.

In the training phase shown in figure 1 a, the output of block 109, i.e. the enhanced training sound signal, is used for training the second neural sub-network 107 by minimizing a difference measure, such as the absolute difference(s), the squared difference(s) and the like, between the training target signal provided by block 1 13 and the enhanced training sound signal provided by block 109. In an embodiment, a gradient-based optimization algorithm can be used for training, i.e. optimizing the model parameters of the second neural sub-network 107.

In the application phase shown in figure 1 b, in block 1 12 (referred to as "Waveform Reconstruction" in figure 1 b) the spectrum of the enhanced sound signal provided by block 109 is transformed back into the time domain. To this end, as illustrated by block 1 14 of figure 1 b, the processing circuitry of the sound processing apparatus 100 can be further configured to extract phase information from the sound signal comprising the target signal and the current noise signal and to transform the spectrum of the enhanced sound signal back into the time domain on the basis of the extracted phase information. In the application phase, the final output of the sound processing apparatus 100 is the enhanced, i.e. de-noised sound signal in the time domain (referred to as "Enhanced Waveform"). Figures 2a and 2b show a further embodiment of the sound processing apparatus 100 shown in figures 1 a and 1 b. In the embodiment shown in figures 2a and 2b the sound processing apparatus 100 is configured to process multi-channel sound signals. In the following only the main differences between the embodiment of the sound processing apparatus 100 shown in figures 2a and 2b and the embodiment of the sound processing apparatus 100 shown in figures 1 a and 1 b will be described.

As can be taken from figure 2b illustrating the application phase, the processing circuitry of the sound processing apparatus 100 can be configured to select a channel of the multi channel sound signal and to process the multi-channel sound signal into the enhanced sound signal on the basis of the estimated noise signal by subtracting (or adding) the estimated noise signal from the selected channel of the multi-channel sound signal. The selected channel could be, for instance, the channel closest to the speaker. The enhanced spectrum is considered the output of the beamforming procedure in the multichannel setting.

Moreover, the processing circuitry in block 1 14 of figure 2b can be configured to select a channel of the multi-channel sound signal and to extract the phase information from the selected channel of the multi-channel sound signal.

The multiple channels of the noise signal can be used for localizing the sound sources by processing these channels with a time-frequency transformation that is more localized in time, such as a STFT over frames of 10 ms, shifted by 5 ms, or a wavelet transform.

Figure 3 shows a flow diagram illustrating an example of a corresponding sound processing method 300 according to an embodiment. The method 300 comprises the steps of: providing 301 the adjustable neural network 103, 107; in a training phase 303, training, i.e. conditioning the adjustable neural network 103, 107 using as a first input a training noise signal, as a second input a noisy training sound signal comprising a training target signal and the training noise signal and as a third input the training target signal; and, in an application phase 305, adjusting the neural network 107 on the basis of the current noise signal, generating an estimated noise signal on the basis of the sound signal comprising the target signal and the current noise signal, and processing the sound signal into the enhanced sound signal on the basis of the estimated noise signal. While a particular feature or aspect of the disclosure may have been disclosed with respect to only one of several implementations or embodiments, such feature or aspect may be combined with one or more other features or aspects of the other implementations or embodiments as may be desired and advantageous for any given or particular application. Furthermore, to the extent that the terms "include", "have", "with", or other variants thereof are used in either the detailed description or the claims, such terms are intended to be inclusive in a manner similar to the term "comprise". Also, the terms "exemplary", "for example" and "e.g." are merely meant as an example, rather than the best or optimal. The terms "coupled" and "connected", along with derivatives may have been used. It should be understood that these terms may have been used to indicate that two elements cooperate or interact with each other regardless whether they are in direct physical or electrical contact, or they are not in direct contact with each other.

Although specific aspects have been illustrated and described herein, it will be

appreciated by those of ordinary skill in the art that a variety of alternate and/or equivalent implementations may be substituted for the specific aspects shown and described without departing from the scope of the present disclosure. This application is intended to cover any adaptations or variations of the specific aspects discussed herein.

Although the elements in the following claims are recited in a particular sequence with corresponding labeling, unless the claim recitations otherwise imply a particular sequence for implementing some or all of those elements, those elements are not necessarily intended to be limited to being implemented in that particular sequence.

Many alternatives, modifications, and variations will be apparent to those skilled in the art in light of the above teachings. Of course, those skilled in the art readily recognize that there are numerous applications of the invention beyond those described herein. While the invention has been described with reference to one or more particular embodiments, those skilled in the art recognize that many changes may be made thereto without departing from the scope of the invention. It is therefore to be understood that within the scope of the appended claims and their equivalents, the invention may be practiced otherwise than as specifically described herein.

Claims

1. A sound processing apparatus (100) configured to process a sound signal comprising a target signal and a current noise signal into an enhanced sound signal, wherein the apparatus (100) comprises: processing circuitry configured to provide an adjustable neural network (103, 107), wherein the adjustable neural network (103, 107) is configured: in a training phase, to be trained using as a first input a training noise signal, as a second input a training sound signal comprising a training target signal and the training noise signal and as a third input the training target signal, and in an application phase, to adjust itself on the basis of the current noise signal and to generate an estimated noise signal on the basis of the sound signal; wherein, in the application phase, the processing circuitry is further configured to process the sound signal into the enhanced sound signal on the basis of the estimated noise signal.

2. The apparatus (100) of claim 1 , wherein the processing circuitry is configured to transform the training noise signal, the training sound signal and the training target signal from a time domain into a frequency domain and wherein the adjustable neural network is configured, in the training phase, to be trained using the training noise signal, the training sound signal and the training target signal in the frequency domain.

3. The apparatus (100) of claim 1 or 2, wherein the processing circuitry is configured to transform the current noise signal and the sound signal from the time domain into the frequency domain, wherein, in the training phase, the adjustable neural network is configured to adjust itself on the basis of the current noise signal in the frequency domain and to generate the estimated noise signal on the basis of the sound signal comprising the target signal and the current noise signal in the frequency domain and wherein the processing circuitry is configured to process the sound signal into the enhanced sound signal in the frequency domain on the basis of the estimated noise signal in the frequency domain.

4. The apparatus (100) of claim 3, wherein the processing circuitry is further configured to transform the enhanced sound signal from the frequency domain into the time domain.

5. The apparatus (100) of any one of the preceding claims, wherein, in the application phase, the processing circuitry is further configured to extract phase information from the sound signal comprising the target signal and the current noise signal and to process the sound signal into the enhanced sound signal on the basis of the estimated noise signal and the extracted phase information.

6. The apparatus (100) of claim 5, wherein the sound signal is a multi-channel sound signal and wherein, in the application phase, the processing circuitry is configured to select a channel of the multi-channel sound signal and to extract the phase information from the selected channel of the multi-channel sound signal.

7. The apparatus (100) of any one of the preceding claims, wherein, in the training phase, the neural network is further configured to generate an estimated training noise signal on the basis of the training sound signal comprising the training target signal and the training noise signal, to process the training sound signal into an enhanced training sound signal on the basis of the estimated training noise signal and to be trained by minimizing a difference measure between the training target signal and the enhanced training sound signal.

8. The apparatus (100) of any one of the preceding claims, wherein, in the application phase, the processing circuitry is configured to process the sound signal into the enhanced sound signal on the basis of the estimated noise signal by subtracting the estimated noise signal from the sound signal.

9. The apparatus (100) of any one of the preceding claims, wherein the sound signal is a multi-channel sound signal and wherein, in the application phase, the processing circuitry is configured to select a channel of the multi-channel sound signal and to process the multi-channel sound signal into the enhanced sound signal on the basis of the estimated noise signal by subtracting the estimated noise signal from the selected channel of the multi-channel sound signal.

10. The apparatus (100) of any one of the preceding claims, wherein the neural network (103, 107) comprises a first neural sub-network (103) and a second neural sub network (107), wherein, in the training phase, the first neural sub-network (103) is configured to generate on the basis of the training noise signal a parameter set describing the training noise signal and to provide the parameter set to the second neural sub network (107), wherein the second neural sub-network (107) is configured to adjust on the basis of the parameter set provided by the first neural sub-network (103).

1 1. The apparatus (100) of any one of the preceding claims, wherein the neural network (103, 107) comprises a first neural sub-network (103) and a second neural sub network (107), wherein, in the application phase, the first neural sub-network (103) is configured to generate on the basis of the current noise signal a parameter set describing the current noise signal and to provide the parameter set to the second neural sub network (107), wherein the second neural sub-network (107) is configured to adjust on the basis of the parameter set provided by the first neural sub-network (103).

12. The apparatus (100) of claim 10 or 1 1 , wherein the first neural sub-network (103) and/or the second neural sub-network (107) comprises one or more convolutional layers.

13. A sound processing method (300) for processing a sound signal comprising a target signal and a current noise signal into an enhanced sound signal, wherein the method (300) comprises: providing (301 ) an adjustable neural network (103, 107); in a training phase (303), training the adjustable neural network (103, 107) using as a first input a training noise signal, as a second input a training sound signal comprising a training target signal and the training noise signal and as a third input the training target signal; and in an application phase (305), adjusting the neural network (103, 107) on the basis of the current noise signal, generating an estimated noise signal on the basis of the sound signal comprising the target signal and the current noise signal, and processing the sound signal into the enhanced sound signal on the basis of the estimated noise signal.

14. A computer program comprising program code for performing the method (300) of claim 13, when executed on a computer or a processor.