CN116348884A

CN116348884A - Method and apparatus for audio processing using convolutional neural network architecture

Info

Publication number: CN116348884A
Application number: CN202180071332.7A
Authority: CN
Inventors: 孙俊岱; 芦烈; 双志伟
Original assignee: Dolby Laboratories Licensing Corp
Current assignee: Dolby Laboratories Licensing Corp
Priority date: 2020-10-19
Filing date: 2021-10-19
Publication date: 2023-06-27

Abstract

Systems, methods, and computer program products for audio processing based on Convolutional Neural Networks (CNNs) are described. The first CNN architecture may include a contracted path of U-net, a multi-scale CNN, and an expanded path of U-net. The shrink path may include a first encoding layer and may be configured to generate an output representation of the shrink path. The multi-scale CNN may be configured to generate an intermediate representation based on the output representation of the contracted path. The multi-scale CNN may include at least two parallel convolution paths. The dilation path may include a first decoding layer and may be configured to generate a final representation based on an intermediate representation generated by the multi-scale CNN. In the second CNN architecture, the first encoding layer may include a first multi-scale CNN having at least two parallel convolution paths, and the first decoding layer may include a second multi-scale CNN having at least two parallel convolution paths.

Description

Method and apparatus for audio processing using convolutional neural network architecture

Cross Reference to Related Applications

The present application claims priority to the following priority applications: PCT International application PCT/CN2020/121829 filed on day 19 of 10 in 2020, U.S. provisional application 63/112,220 filed on day 11 in 2020, and EP application 20211501.0 filed on day 3 of 12 in 2020.

Technical Field

The present disclosure relates generally to methods and apparatus for audio processing using Convolutional Neural Networks (CNNs). More specifically, the present disclosure relates to extracting speech from an original noisy speech signal using a U-net based CNN architecture.

Although some embodiments will be described herein with particular reference to this disclosure, it should be understood that the disclosure is not limited to this field of use and is applicable to a broader context.

Background

Any discussion of the background art throughout the disclosure should in no way be considered as an admission that such art is widely known or forms part of the common general knowledge in the field.

Deep Neural Networks (DNNs) have become a viable option to address various audio processing issues. Types of DNNs include feed forward multilayer perceptron (MLP), convolutional Neural Network (CNN), recurrent Neural Network (RNN), and generation of countermeasure network (GAN). Among these, CNN is a type of feed forward network.

U-Net architecture [ O.Ronneberger, P.Fischer and T.Brox, "U-Net: convolutional networks for biomedical image segmentation", medical image computation and computer aided intervention International conference, springer [ Schpraringer ],2015, pages 234-241 ] was introduced into biomedical imaging to improve the accuracy and localization of microscopic images of neuronal structures. The architecture builds on a stack of convolutional layers as shown in fig. 1. Each

downsampling layer

11, 12, 13 halves the size of the image and doubles the number of channels. Thus, the image is encoded into a small (less dimensional) and deep representation. The encoded potential features are then decoded into the original image size by the stack of

upsampling layers

14, 15, 16.

In recent years, U-Net architecture has been used in the field of audio processing by treating the audio spectrum as an image. Thus, the U-net architecture may be applied to a variety of audio processing issues, including sound separation, speech enhancement, and speech source separation. Speech source separation aims at recovering target speech from background interference and has many applications in the field of speech and/or audio technology. In this context, the separation of speech sources is also commonly referred to as the "cocktail party problem". In such a scenario, extracting conversations from professional content (such as movies and TV) presents challenges due to the complex context.

It is an object of this document to provide a novel U-net based CNN architecture that can be applied in various fields of audio processing, including sound separation, speech enhancement and speech source separation.

Disclosure of Invention

According to a first aspect of the present disclosure, a Convolutional Neural Network (CNN) architecture is provided. For example, the CNN architecture may be implemented by a computing system. The CNN architecture may include a contracted path of U-net, a multi-scale CNN, and an expanded path of U-net. The shrink path may include a first encoding layer and may be configured to generate an output representation of the shrink path. The multi-scale CNN may be configured to generate an intermediate representation based on the output representation of the contracted path. The multi-scale CNN may include at least two parallel convolution paths. The dilation path may include a first decoding layer and may be configured to generate a final representation based on an intermediate representation generated by the multi-scale CNN.

The proposed CNN architecture may be applicable or used for audio processing. In this way it can receive a first audio signal (first audio sample) as input to the contracted path and output a second audio signal (second audio sample) from the expanded path.

The first encoding layer may be configured to perform convolution and downsampling operations. The first encoding layer may be configured to forward the results of the convolution operation and the downsampling operation as an intermediate representation to the multi-scale CNN. In case the shrink path does not comprise any other coding layer, the first audio signal may be applied directly to the first coding layer.

The first decoding layer may be configured to generate an output by: the method includes receiving an intermediate representation generated by a multi-scale CNN, receiving an output of a first encoding layer, concatenating the intermediate representation with the output of the first encoding layer, performing a convolution operation, and performing an upsampling operation. The expansion path may be configured to generate a final representation based on the output of the first decoding layer. In particular, if the expansion path comprises only one (i.e. the first) decoding layer, the expansion path may be configured to directly use this output of the first decoding layer as a final representation.

The CNN architecture may further include a second encoding layer. The second encoding layer may be configured to perform convolution, perform a downsampling operation, and forward the result to the first encoding layer. Furthermore, the CNN architecture may further include a second decoding layer. The second decoding layer may be configured to: receiving an output of the first decoding layer, receiving an output of the second encoding layer, concatenating the output of the first decoding layer with the output of the second encoding layer, performing a convolution operation, and performing an upsampling operation.

In general, the contracted path may include additional encoded layers, while the expanded path may include additional corresponding decoded layers having the same size. In other words, the encoding layer and decoding layer may be added in pairs. For example, an additional encoding layer may be added before the second encoding layer for preprocessing the input of the second encoding layer, and an additional decoding layer may be added after the second decoding layer for post-processing the output of the second decoding layer. Alternatively, additional layers may be added between the first encoding layer and the second encoding layer and between the first decoding layer and the second decoding layer, respectively.

The multi-scale CNN may be configured to generate an aggregate output based on the outputs of the at least two parallel convolution paths. The multi-scale CNN may be configured to generate an aggregate output by concatenating or adding the outputs of the at least two parallel convolution paths. The multi-scale CNN may be configured to weight the outputs of at least two parallel convolution paths using different weights. In particular, the multi-scale CNN may be configured to weight the outputs of at least two parallel convolution paths before cascading or adding the outputs. The weights may be based on trainable parameters learned from a training process.

Each parallel convolution path of the multi-scale CNN may include L convolution layers, where L is a natural number greater than 1, and where a first layer of the L layers has Nl filters, where L = 1 … L.

For each parallel convolution path, the number of filters Nl in the first layer may increase with the number of layers l. For example, for each parallel convolution path, the number Nl of filters in the first layer may be given by nl=l×n0, where N0 is a predetermined constant greater than 1. In one aspect, the filter size of the filter may be the same in each parallel convolution path. On the other hand, the filter size of the filter may be different between different parallel convolution paths.

For a given parallel convolution path, the filters of at least one layer of the parallel convolution path may be expanded 2D convolution filters. The expansion operation of the filters of at least one layer of the parallel convolution paths may be performed only on the frequency axis.

For a given parallel convolution path, the filters of two or more layers of the parallel convolution path may be dilation 2D convolution filters, and the dilation factor of the dilation 2D convolution filters may increase exponentially with increasing number of layers i. For example, for a given parallel convolution path, the expansion may be (1, 1) in a first layer of L convolution layers, the expansion may be (1, 2) in a second layer of L convolution layers, the expansion may be (1, 2 (L-1)) in a first layer of L convolution layers, and the expansion may be (1, 2 (L-1)) in a last layer of L convolution layers, where (c, d) indicates the expansion factor c along the time axis and the expansion factor d along the frequency axis.

Furthermore, the multi-scale CNN may include a complex convolution layer having a first CNN, a second CNN, an addition unit, and a subtraction unit. The first CNN may be configured to generate a first intermediate representation and a second intermediate representation based on real and imaginary parts of the input signal. The second CNN may be configured to generate a third intermediate representation and a fourth intermediate representation based on the real and imaginary parts of the input signal. The addition unit may be configured to generate the real output representation based on the first intermediate representation and the third intermediate representation. The subtracting unit may be configured to generate the imaginary output representation based on the second intermediate representation and the fourth intermediate representation.

According to a second aspect of the present disclosure, another CNN architecture is provided. For example, such CNN architecture may also be implemented by a computing system. The CNN architecture may include a contracted path of U-net and an expanded path of U-net. The shrink path may include a first encoding layer and may be configured to generate an output representation of the shrink path. The first encoding layer may include a first multi-scale CNN having at least two parallel convolution paths. The dilation path may include a first decoding layer and may be configured to generate a final representation based on the output representation of the dilation path. The first decoding layer may include a second multi-scale CNN having at least two parallel convolution paths. Also, such CNN architecture may be suitable for or used in audio processing. In this way it can receive a first audio signal (first audio sample) as input to the contracted path and output a second audio signal (second audio sample) from the expanded path.

Both the first multi-scale CNN and the second multi-scale CNN may be implemented using the multi-scale CNN described previously. In particular, the first multi-scale CNN and the second multi-scale CNN may be based on the same network structure.

The first encoding layer may be configured to perform a downsampling (or pooling) operation on the output of the first multi-scale CNN. The first decoding layer may be configured to: receiving an output representation of the puncture path, receiving an output of the first encoding layer, performing concatenation based on the output of the first encoding layer and the output representation of the puncture path, feeding the result to the second multi-scale CNN, and performing an upsampling operation. Thus, the first decoding layer may be configured to determine the final representation.

The contracted path may include a second encoding layer and the expanded path may include a corresponding second decoding layer. The second encoding layer may include a third multi-scale CNN having at least two parallel convolution paths, and the second decoding layer may include a fourth multi-scale CNN having at least two parallel convolution paths. The third and fourth multi-scale CNNs may be based on similar or identical network structures as the first and second multi-scale CNNs.

In an aspect, the second encoding layer may be configured to perform a convolution operation using the third multi-scale CNN, perform a downsampling operation, and forward the result to the first encoding layer. In another aspect, the second decoding layer may be configured to: receiving an output of the first decoding layer, an output of the second encoding layer, concatenating the output of the first decoding layer with the output of the second encoding layer, performing a convolution operation using a fourth multi-scale CNN, and finally performing an upsampling operation to obtain a final representation of the expanded path.

The CNN architecture may further include another multi-scale CNN coupled between the contracted path and the expanded path, wherein the other multi-scale CNN includes at least two parallel convolution paths and is configured to receive and process an output representation of the contracted path. Further, another multi-scale CNN may be configured to forward its output to the dilation path.

The first multi-scale CNN may be configured to: an aggregate output is generated based on the outputs of the at least two parallel convolution paths, a 2D convolution is performed on the aggregate output, and a downsampling or pooling operation is performed based on the results of the 2D convolution.

The second multiscale CNN may be configured to: an aggregate output is generated based on the outputs of the at least two parallel convolution paths, a 2D convolution is performed on the aggregate output, and an upsampling operation is performed based on the result of the 2D convolution.

Likewise, the first and/or second multi-scale CNNs may be configured to generate an aggregate output by concatenating or adding the outputs of the respective at least two parallel convolution paths. The first and/or second multi-scale CNN may be configured to weight the outputs of the at least two parallel convolution paths using different weights prior to cascading or adding the outputs. The weights may be based on trainable parameters learned from a training process.

Each parallel convolution path of the first and/or second multi-scale CNN may comprise L convolution layers, wherein L is a natural number greater than 1, and wherein a first layer of the L layers has Nl filters, wherein L = 1 … L. For each parallel convolution path, the number of filters Nl in the first layer may increase with the number of layers l. For example, for each parallel convolution path, the number Nl of filters in the first layer may be given by nl=l×n0, where N0 is a predetermined constant greater than 1. The filter size of the filter may be the same in each parallel convolution path. Alternatively, the filter size of the filter may be different between different parallel convolution paths. For a given parallel convolution path, the filters of at least one layer of the parallel convolution path may be expanded 2D convolution filters. The expansion operation of the filters of at least one layer of each parallel convolution path may be performed only on the frequency axis. Specifically, for a given parallel convolution path, the filters of two or more layers of the parallel convolution path may be dilation 2D convolution filters, and the dilation factor of the dilation 2D convolution filters may increase exponentially with increasing number of layers i.

The first multi-scale CNN or the second multi-scale CNN may include complex convolution layers. The complex convolution layer may include a first CNN, a second CNN, an addition unit, and a subtraction unit. The first CNN may be configured to generate a first intermediate representation and a second intermediate representation based on real and imaginary parts of the input signal. The second CNN may be configured to generate a third intermediate representation and a fourth intermediate representation based on the real and imaginary parts of the input signal. The addition unit may be configured to generate the real output representation based on the first intermediate representation and the third intermediate representation. The subtracting unit may be configured to generate the imaginary output representation based on the second intermediate representation and the fourth intermediate representation.

The complex target range of the complex convolution layer may be limited by ignoring complex target values whose absolute values are greater than a predetermined threshold. Alternatively, the complex target range of the complex convolution layer may be limited by mapping the complex target value to a mapped complex target value having an absolute value less than or equal to a predetermined threshold using a transform function.

According to a third aspect of the present disclosure, an apparatus for audio processing is provided. The apparatus may be configured to receive an input of an input audio signal and output an output audio signal. The apparatus may comprise any of the CNN architectures described above. The input of the contracted path may be based on the input audio signal and the output audio signal may be based on the output of the expanded path.

According to a fourth aspect of the present disclosure, a method of audio processing using a Convolutional Neural Network (CNN) is provided. The method may include providing a shrink path of the U-net having the first encoding layer. The method may include generating an output representation of the contracted path from the contracted path. The method may include providing a multi-scale CNN including at least two parallel convolution paths. The method may include generating, by the multi-scale CNN, an intermediate representation based on the output representation of the contracted path. The method may include providing an expansion path of the U-net having the first decoding layer. The method may include generating, by the dilation path, a final representation based on the intermediate representation generated by the multi-scale CNN.

According to a fifth aspect of the present disclosure, there is provided another method of audio processing using a Convolutional Neural Network (CNN). The method may include providing a shrink path of a U-net having a first encoding layer, wherein the first encoding layer includes a first multi-scale CNN having at least two parallel convolution paths. The method may include generating an output representation of the contracted path from the contracted path. The method may include providing an expanded path of the U-net with a first decoding layer, wherein the first decoding layer includes a second multi-scale CNN having at least two parallel convolution paths. The method may include generating, by the expanded path, a final representation based on the output representation of the contracted path.

According to a sixth aspect of the present disclosure, there is provided computer program products, each comprising a computer readable storage medium having instructions adapted to cause a respective device to perform some or all of the steps of the above-described method when executed by a device having processing capabilities.

According to a seventh aspect of the present disclosure, there is provided a computing system implementing the aforementioned CNN architecture(s).

In accordance with another aspect of the present disclosure, a system for audio processing is presented. The system may include one or more processors and a non-transitory computer-readable medium storing instructions that, when executed by the one or more processors, cause the one or more processors to: the method comprises receiving an input audio signal and processing the input audio signal using a CNN architecture according to any of the CNN architectures described above. The processing may include: the input is provided to a contracted path of the CNN architecture based on the input audio signal, and the output audio signal is generated based on an output of an expanded path of the CNN architecture. Further, the system may be configured to provide the audio signal to a downstream device.

Drawings

Example embodiments of the present disclosure will now be described, by way of example only, with reference to the accompanying drawings, in which:

fig. 1 illustrates a conventional U-net architecture.

Fig. 2 illustrates a first embodiment of the proposed CNN architecture.

Fig. 3 illustrates an example of aggregating multi-scale CNNs.

Fig. 4 illustrates a more detailed view of the aggregate multiscale CNN of fig. 3.

Fig. 5 illustrates a second embodiment of the proposed CNN architecture.

Fig. 6 illustrates an exemplary multi-scale coding layer.

Fig. 7 illustrates an exemplary multi-scale decoding layer.

Fig. 8 illustrates another exemplary multi-scale coding layer.

Fig. 9 illustrates another exemplary multi-scale decoding layer.

Fig. 10 illustrates an exemplary complex convolution layer.

Detailed Description

Fig. 1 illustrates a conventional U-net architecture. The U-net architecture is built on a stack of convolutional layers organized into a contracted path comprising the encoding layers 11, 12, 13 and an expanded path comprising the decoding layers 14, 15, 16. The shrink path follows the typical architecture of a convolutional network. In each

coding layer

11, 12, 13, the puncture path may consist of repeated applications of the convolution, a modified linear unit (ReLU) after each application, and a max-pooling operation with appropriate steps for downsampling. The number of characteristic channels may be doubled at each downsampling. Each

decoding layer

14, 15, 16 in the dilation path may consist of upsampling of the feature map followed by a convolution that halves the number of feature channels, concatenation of the feature map with corresponding clipping from the dilation path, and further convolution of the respective follow ReLU.

Fig. 2 illustrates a first embodiment of the proposed CNN architecture. In this architecture, the multiscale CNN block 27 is embedded in the bottleneck layer of the U-NET. The output representations of the encoding layers 21, 22, 23 will be fed into the multi-scale CNN 27, compared to the original U-NET. The structure of a potential multi-scale CNN 27 is shown in fig. 3 and 4. The multi-scale CNN block 27 may not change the feature dimensions, but rather use several convolution paths to fully analyze the potential representations of the different scales and then ultimately aggregate them. The output of the multi-scale CNN block 27 will be fed to the decoding layers 24, 25, 26. In the first embodiment shown, the concatenation between the encoding layer and the decoding layer may be the same as in U-net. As will be appreciated by those skilled in the art, the proposed CNN architecture may be implemented by a suitable computing system.

In an exemplary embodiment, the architecture shown in fig. 2 has been applied to speech enhancement applications for 48kHz audio signals. The data has been transformed to the T-F domain using a 4096 Short Time Fourier Transform (STFT) with 50% overlap. As a result, an amplitude of 2049 points was obtained. Subsequently, 8 frames of data are fed to the model by ignoring Direct Current (DC) bins (i.e., the input dimension is 8 x 2048), and the target is an amplitude ratio mask of 8 frames (i.e., 8 x 2048). In this example embodiment, each coding layer consists of a 2D convolution with a stride of 1, a kernel size of 3×3, and a pooling layer with a size of 1×2. In the coding path, the feature size is reduced by half on the frequency axis, while the number of filters is doubled each time. In the decoding path, the feature size doubles on the frequency axis, while the number of filters is reduced by half at a time.

Fig. 3 and 4 illustrate examples of an aggregated multi-scale CNN that may be used directly as multi-scale CNN block 27 in fig. 2. The aggregate multi-scale CNN 3 in fig. 3 comprises a plurality of

parallel convolution paths

31, 32, 33. Although the number of parallel convolution paths is not limited, the aggregate multi-scale CNN may include three parallel convolution paths. By means of these parallel convolution paths, local and general feature information of the time-frequency transformation of a plurality of frames of the audio signal can be extracted on different scales. The outputs of the

parallel convolution paths

31, 32, 33 are aggregated and subjected to a further 2D convolution 34.

By means of the multi-scale CNN block 27 of the U-net bottleneck layer, different filter sizes can be used in combination with different strides or expansions to capture features of different scales. Based on multi-scale CNN, the network is able to generate scale-dependent features, which is very important and cost-effective for practical applications. In fig. 2, each parallel convolution path may use the same filter size. For example, in fig. 2, the three parallel paths may have different kernel sizes. In this way, the model learns the same scale of features in each path, which greatly accelerates the convergence speed of the model. In each path, an exponentially increasing expansion factor can be applied on the frequency axis. This results in an increase in receptive field (receptive field) and works like a comb filter and potential harmonic structures/correlations can be found. At the same time, the number of channels/filters along the convolutional layer is also increased. Using parallel paths with different scales, local and (relative) global information can be captured and features characterizing various speech harmonic shapes can be extracted. The output from each path is aggregated for further processing. We can concatenate them together or calculate a weighted average. Based on preliminary experiments we found that features extracted by filters of different sizes have different properties. The use of larger convolution filter sizes tends to preserve more speech harmonics but also more noise, while the use of smaller filter sizes preserves only key components of speech and removes noise more aggressively. Thus, if a larger weight is chosen for a path with a larger filter size, the model will be relatively conservative and have better speech retention (at the cost of more residual noise). On the other hand, if a larger weight is chosen for a path with a smaller filter size, the model will remove noise more aggressively and some speech components may also be lost. Thus, the weights can be varied to control the aggressiveness of the model. The optimal weights may also be designed/learned based on a preferred tradeoff in a particular application.

Fig. 4 illustrates a more detailed view of the aggregate multiscale CNN 4. The time-frequency transforms of the multiple frames may be input (in parallel) into multiple parallel convolution paths. Each of the plurality of parallel convolution paths of the CNN may include N convolution layers, where N is a natural number greater than 1, and each convolution layer may include a number of filters.

The filter size of the filter may be the same (i.e., uniform) in each parallel convolution path. For example, a filter of size (k 1, k 1) (i.e., k1 x k 1) may be used in each layer within the top parallel convolution path. By using filters of the same size in each parallel convolution path, a mix of different scale features can be avoided. In this way, the CNN learns the feature extraction of the same scale in each path, which greatly increases the convergence speed of the CNN. The filter size of the filter may be different between different parallel convolution paths. For example, without intended limitation, if the aggregate multi-scale CNN includes three parallel convolution paths, the filter size may be (k 1, k 1) in the first (top) parallel convolution path, (k 2, k 2) in the second (middle) parallel convolution path, and (k 3, k 3) in the third (bottom) parallel convolution path. For example, the filter size may depend on the harmonic length for feature extraction.

The filters of at least one layer of the parallel convolution paths may be dilation 2D convolution filters. The use of an expansion filter enables the correlation of harmonic features in different receptive fields to be extracted. Expansion enables far receptive fields to be reached by skipping (i.e., skipping) a series of time-frequency (TF) bins. The expansion operation of the filters of at least one layer of the parallel convolution paths may be performed only on the frequency axis. For example, in the context of the present disclosure, an expansion of (1, 2) may indicate that there is no expansion along the time axis (expansion factor 1), while every other bin in the frequency axis is skipped (expansion factor 2). In general, (1, d) expansion may indicate that (d-1) bins are skipped along the frequency axis between the bins for feature extraction by the corresponding filters.

As shown in fig. 4, for a given convolution path, the filters of two or more layers of the parallel convolution path may be expanded 2D convolution filters, wherein the expansion factor of the expanded 2D convolution filter increases exponentially with increasing number of layers i. In this way, a receptive field that grows exponentially with depth can be achieved. As shown in the example of fig. 4, for a given convolution path, the dilation in a first layer of N convolution layers may be (1, 1), the dilation in a second layer of N convolution layers may be (1, 2), and the dilation in the last layer of N convolution layers may be (1, 2 (N-1)), where (c, d) indicates a dilation factor c along the time axis and a dilation factor d along the frequency axis.

Fig. 5 illustrates a second embodiment of the proposed CNN architecture. As will be appreciated by those skilled in the art, the proposed CNN architecture may be implemented by a suitable computing system. In this second embodiment, the multi-scale CNN is embedded in the encoding layers 51, 52, 53 of the contracted path of the U-net and the decoding layers 54, 55, 56 of the expanded path of the U-net. Fig. 6 illustrates an exemplary multi-scale coding layer 6 that may be embedded in one or more coding layers 51, 52, 53 of the second embodiment in fig. 5. It comprises three

parallel convolution paths

61, 62, 63 and one downsampling layer 64. Fig. 7 illustrates an exemplary multi-scale decoding layer that may be embedded in one or more decoding layers 54, 55, 56 of the second embodiment in fig. 5. It comprises three

parallel convolution paths

71, 72, 73 and one upsampling layer 74. Also, fig. 8 and 9 illustrate more detailed views of a multi-scale CNN that may be used for speech enhancement. More specifically, fig. 8 illustrates another exemplary multi-scale encoding layer 8, and fig. 9 illustrates another exemplary multi-scale decoding layer 9.

Finally, fig. 10 illustrates an exemplary complex convolution layer. Both the first embodiment in fig. 2 and the second embodiment in fig. 5 can be extended to complex domain processing. In this case the input will be a complex valued vector or matrix, such as the complex spectrum of the input signal, and the output will also be a complex valued vector or matrix, such as complex soft mask estimation in the case of speech enhancement applications.

One option to achieve this goal is to package the real and imaginary parts of the input matrix into two input channels and apply the real-valued convolution operation with a shared real-valued convolution filter. However, this approach may not meet complex multiplication rules, so the network may learn the real and imaginary parts independently. To solve this problem, a complex convolution layer shown in fig. 10 is used. In particular, the complex convolution layer models the correlation between amplitude and phase by simulating complex multiplication.

In fig. 10, an exemplary complex convolution layer 1000 includes a first CNN 103, a second CNN 105, an addition unit 105, and a subtraction unit 106. The first CNN 103 may be configured to generate a first intermediate representation and a second intermediate representation based on the real part 101 and the imaginary part 102 of the input signal. The second CNN 104 may be configured to generate a third intermediate representation and a fourth intermediate representation based on the real and imaginary parts of the input signal. The addition unit may be configured to generate the real output representation based on the first intermediate representation and the third intermediate representation. The subtracting unit may be configured to generate the imaginary output representation based on the second intermediate representation and the fourth intermediate representation.

The object mask of the complex model is also complex valued and its real and imaginary parts have a considerable range of values. Nonlinear transformation/compression is typically required to transform the original value range to a fixed certain range, e.g., [0,1] range. This will make learning/convergence in model training easier. In this disclosure, we propose two solutions:

(1) The complex target values are limited to a unit circle (or other circle with a fixed radius) while keeping the phase the same. In other words, if the absolute value of the complex target is greater than 1, it may be limited to 1. In early experiments, complex targets with absolute values greater than 1 occupied approximately 5-10% of all data points. Limiting them to 1 may have only a small impact on the final result.

(2) A specially designed decay function is used to narrow the target, such as using an S-shaped function. An inverse function may be applied after the network output to transform the estimate back into the original range of values.

The penalty function may also include a number of terms, namely the estimated soft mask and the real and imaginary penalty of the spectrum. The loss function may also include an amplitude loss or a wave domain loss obtained by transforming complex values into real values through Inverse Fast Fourier Transform (IFFT). All items may be weighted based on the particular application.

Interpretation of the drawings

Unless specifically stated otherwise as apparent from the following discussions, it is appreciated that throughout the disclosed discussions utilizing terms such as "processing," "computing," "calculating," "determining," "analyzing," or the like, refer to the action and/or processes of a computer or computing system, or similar electronic computing device, that manipulates and/or transforms data represented as physical (e.g., electronic) quantities into other data similarly represented as physical quantities.

In a similar manner, the term "processor" may refer to any device or portion of a device that processes electronic data, e.g., from registers and/or memory, to transform the electronic data into other electronic data, e.g., that may be stored in registers and/or memory. A "computer" or "computing machine" or "computing platform" may include one or more processors.

In one example embodiment, the methods described herein may be performed by one or more processors that accept computer readable (also referred to as machine readable) code containing a set of instructions that, when executed by the one or more processors, perform at least one of the methods described herein. Including any processor capable of executing a set of instructions (sequential or otherwise) that specify actions to be taken. Thus, one example is a typical processing system that includes one or more processors. Each processor may include one or more of a CPU, a graphics processing unit, and a programmable DSP unit. The processing system may further comprise a memory subsystem comprising main RAM and/or static RAM and/or ROM. A bus subsystem may be included for communication between the components. The processing system may further be a distributed processing system in which the processors are coupled together by a network. Such a display may be included, for example, a Liquid Crystal Display (LCD) or a Cathode Ray Tube (CRT) display, if the processing system requires such a display. If manual data entry is desired, the processing system further includes an input device, such as one or more of an alphanumeric input unit (e.g., keyboard), a pointing control device (e.g., mouse), etc. The processing system may also encompass a storage system, such as a disk drive unit. The processing system in some configurations may include a sound output device and a network interface device. The memory subsystem thus includes a computer-readable carrier medium carrying computer-readable code (e.g., software) comprising a set of instructions that, when executed by one or more processors, cause performance of one or more of the methods described herein. It should be noted that when the method comprises several elements (e.g. several steps), no order of the elements is implied unless specifically stated. The software may reside on the hard disk, or it may be completely or at least partially resident in the RAM and/or processor during execution thereof by the computer system. Thus, the memory and processor also constitute a computer-readable carrier medium carrying computer-readable code. Furthermore, the computer readable carrier medium may be formed or included in a computer program product.

In alternative example embodiments, one or more processors may operate as standalone devices, or may be connected (e.g., networked) to other processors in a networked deployment, one or more processors may operate in the capacity of a server or user machine in a server-user network environment, or as peer machines in a peer-to-peer or distributed network environment. The one or more processors may form a Personal Computer (PC), tablet PC, personal Digital Assistant (PDA), cellular telephone, web appliance, network router, switch or bridge, or any machine capable of executing a set of instructions (sequential or otherwise) that specify actions to be taken by that machine.

It should be noted that the term "machine" shall also be taken to include any collection of machines that individually or jointly execute a set (or multiple sets) of instructions to perform any one or more of the methodologies discussed herein.

Thus, one example embodiment of each method described herein is in the form of a computer-readable carrier medium carrying a set of instructions, such as a computer program for execution on one or more processors (e.g., one or more processors that are part of a web server arrangement). Accordingly, as will be appreciated by one skilled in the art, example embodiments of the present disclosure may be embodied as a method, an apparatus such as a special purpose apparatus, an apparatus such as a data processing system, or a computer readable carrier medium (e.g., a computer program product). The computer-readable carrier medium carries computer-readable code comprising a set of instructions that, when executed on one or more processors, cause the one or more processors to implement a method. Accordingly, aspects of the present disclosure may take the form of an entirely hardware example embodiment, an entirely software example embodiment or an example embodiment combining software and hardware aspects. Furthermore, the present disclosure may take the form of a carrier medium (e.g., a computer program product on a computer-readable storage medium) carrying computer-readable program code embodied in the medium.

The software may further be transmitted or received over a network via a network interface device. While the carrier medium is a single medium in the example embodiments, the term "carrier medium" should be taken to include a single medium or multiple media (e.g., a centralized or distributed database, and/or associated caches and servers) that store the one or more sets of instructions. The term "carrier medium" shall also be taken to include any medium that is capable of storing, encoding or carrying a set of instructions for execution by and that cause one or more processors to perform any one or more of the methodologies of the present disclosure. Carrier media can take many forms, including but not limited to, non-volatile media, and transmission media. Non-volatile media includes, for example, optical, magnetic disks, and magneto-optical disks. Volatile media includes dynamic memory, such as main memory. Transmission media includes coaxial cables, copper wire and fiber optics, including the wires that comprise a bus subsystem. Transmission media can also take the form of acoustic or light waves, such as those generated during radio-wave and infra-red data communications. For example, the term "carrier medium" shall accordingly be taken to include, but not be limited to, solid-state memories, computer products embodied in optical and magnetic media; a medium carrying a propagated signal that is detectable by at least one processor or one or more processors and that represents a set of instructions that when executed implement a method; and a transmission medium in the network, the transmission medium carrying a propagated signal which is detectable by at least one of the one or more processors and which represents a set of instructions.

It will be appreciated that in one example embodiment, the steps of the methods discussed are performed by a suitable processor (or processors) in a processing (e.g., computer) system executing instructions (computer readable code) stored in a storage device. It will also be appreciated that the present disclosure is not limited to any particular implementation or programming technique, and that the present disclosure may be implemented using any suitable technique for implementing the functions described herein. The present disclosure is not limited to any particular programming language or operating system.

Reference throughout this disclosure to "one example embodiment," "some example embodiments," or "example embodiments" means that a particular feature, structure, or characteristic described in connection with the example embodiments is included in at least one example embodiment of the present disclosure. Thus, the appearances of the phrases "in one example embodiment," "in some example embodiments," or "in example embodiments" in various places throughout this disclosure are not necessarily all referring to the same example embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner, as will be apparent to one of ordinary skill in the art in light of this disclosure, in one or more example embodiments.

As used herein, unless otherwise specified the use of the ordinal adjectives "first", "second", "third", etc., to describe a common object, merely indicate that different instances of like objects are being referred to, and are not intended to imply that the objects so described must be in a given sequence, either temporally, spatially, in ranking, or in any other manner.

In the claims below and in the description herein, any one of the terms "comprising," "including," or "comprising" is an open term that means including at least the following elements/features, but not excluding other elements/features. Thus, when the term "comprising" is used in the claims, the term should not be interpreted as being limited to the means or elements or steps listed thereafter. For example, the scope of expression of a device including a and B should not be limited to a device composed of only elements a and B. As used herein, the term comprising or any of its inclusion or inclusion is also an open term that also means that at least the element/feature following the term is included, but not excluding other elements/features. Thus, inclusion is synonymous with and means including.

It should be appreciated that in the foregoing description of example embodiments of the disclosure, various features of the disclosure are sometimes grouped together in a single example embodiment/figure or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. This method of disclosure, however, is not to be interpreted as reflecting an intention that the claims require more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed example embodiment. Thus, the claims following the description are hereby expressly incorporated into this description, with each claim standing on its own as a separate example embodiment of this disclosure.

Moreover, while some example embodiments described herein include some features included in other example embodiments and not others included in other example embodiments, combinations of features of different example embodiments are intended to be within the scope of the present disclosure and form different example embodiments, as will be appreciated by those of skill in the art. For example, in the appended claims, any of the example embodiments claimed may be used in any combination.

In the description provided herein, numerous specific details are set forth. However, it is understood that example embodiments of the disclosure may be practiced without these specific details. In other instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Therefore, while there has been described what are believed to be the best modes of the present disclosure, those skilled in the art will recognize that other and further modifications may be made thereto without departing from the spirit of the disclosure, and it is intended to claim all such changes and modifications as fall within the scope of the present disclosure. For example, any formulas given above represent only processes that may be used. Functions may be added or deleted from the block diagrams and operations may be interchanged among the functional blocks. Steps may be added or deleted to the methods described within the scope of the present disclosure.

Various aspects and implementations of the present disclosure may also become apparent from the example embodiments (EEEs) enumerated below, which are not the claims.

EEE 1. A Convolutional Neural Network (CNN) architecture comprising:

a shrink path of U-net with a first encoding layer, wherein the shrink path is configured to generate an output representation of the shrink path,

A multi-scale CNN configured to generate an intermediate representation based on an output representation of the contracted path, wherein the multi-scale CNN comprises at least two parallel convolution paths,

an expansion path of the U-net with the first decoding layer, wherein the expansion path is configured to generate a final representation based on the intermediate representation generated by the multi-scale CNN.

EEE 2 the CNN architecture according to EEE 1, wherein the first encoding layer is configured to perform convolution and downsampling operations.

EEE 3 the CNN architecture according to EEE 1 or 2, wherein the first decoding layer is configured to generate an output by:

receiving the intermediate representation generated by the multi-scale CNN,

receiving the output of said first coding layer,

concatenating the intermediate representation with the output of the first encoding layer,

perform convolution operations

Perform the upsampling operation.

The CNN architecture according to any of the preceding EEEs, further comprising a second encoding layer, wherein the second encoding layer is configured to:

the convolution is performed and the convolution is performed,

perform downsampling operations

Forwarding the result to the first encoding layer.

The CNN architecture of EEE 4, further comprising a second decoding layer, wherein the second decoding layer is configured to:

Receiving an output of the first decoding layer,

receiving the output of said second coding layer,

concatenating the output of the first decoding layer with the output of the second encoding layer,

perform convolution operations

Perform the upsampling operation.

The CNN architecture according to any of the preceding EEEs, wherein the multi-scale CNN is configured to generate an aggregate output based on the outputs of the at least two parallel convolution paths.

EEE 7 the CNN architecture according to EEE 6, wherein the multi-scale CNN is configured to generate the aggregate output by concatenating or adding the outputs of the at least two parallel convolution paths.

EEE 8. CNN architecture according to

EEE

6 or 7, wherein the multi-scale CNN is configured to weight the outputs of the at least two parallel convolution paths using different weights.

The CNN architecture according to any of the preceding EEEs, wherein each parallel convolution path of the multi-scale CNN comprises L convolution layers, wherein L is a natural number greater than or equal to 1, and wherein a first layer of the L layers has N _l And a filter, where l= … L.

EEE 10 the CNN architecture according to EEE 9, wherein for each parallel convolution path the number N of filters in the first layer _l As the number of layers i increases.

EEE 11. CNN architecture according to EEE 9, wherein the filter size of the filter is the same in each parallel convolution path.

EEE 12. CNN architecture according to EEE 9, wherein the filter size of the filter is different between different parallel convolution paths.

EEE 13. CNN architecture according to EEE 9, wherein for a given parallel convolution path, the filters of at least one layer of the parallel convolution path are dilation 2D convolution filters.

EEE 14. The CNN architecture according to EEE 13, wherein the expansion operation of the filters of at least one layer of the parallel convolution paths is performed only on the frequency axis.

EEE 15. The CNN architecture of EEE 13 wherein, for a given parallel convolution path, the filters of two or more layers of the parallel convolution path are inflated 2D convolution filters, and wherein the inflation factor of the inflated 2D convolution filter increases exponentially with an increase in the number of layers/.

EEE 16. A Convolutional Neural Network (CNN) architecture comprising:

a shrink path of U-net with a first coding layer, wherein the shrink path is configured to generate an output representation of the shrink path, wherein the first coding layer comprises a first multi-scale CNN with at least two parallel convolution paths, and

An expanded path of U-net with a first decoding layer, wherein the expanded path is configured to generate a final representation based on an output representation of the contracted path, wherein the first decoding layer comprises a second multi-scale CNN with at least two parallel convolution paths.

EEE 17 the CNN architecture of EEE 16 further comprising another multi-scale CNN coupled between the contracted path and the expanded path, wherein the other multi-scale CNN

Comprises at least two parallel convolution paths, and

an output representation configured to receive and process the contracted path.

EEE 18 the CNN architecture according to EEE 16 or 17, wherein the first multi-scale CNN is configured to:

generating an aggregate output based on the outputs of the at least two parallel convolution paths,

performing a 2D convolution on the aggregated output, and

performing a downsampling or pooling operation based on the result of the 2D convolution.

The CNN architecture according to any one of EEEs 16 to 18, wherein the second multi-scale CNN is configured to:

performing a 2D convolution on the aggregated output, and

Performing an upsampling operation based on the result of the 2D convolution.

The CNN architecture according to any one of EEEs 16 to 19, wherein the first multi-scale CNN or the second multi-scale CNN comprises a complex convolution layer having:

a first CNN configured to generate a first intermediate representation and a second intermediate representation based on real and imaginary parts of an input signal,

a second CNN configured to generate a third intermediate representation and a fourth intermediate representation based on real and imaginary parts of the input signal,

an addition unit configured to generate a real output representation based on the first intermediate representation and the third intermediate representation, and

a subtracting unit configured to generate an imaginary output representation based on the second intermediate representation and the fourth intermediate representation.

EEE 21 the CNN architecture according to EEE 20, wherein the complex target range of the complex convolutional layer is limited by ignoring complex target values whose absolute values are greater than a predetermined threshold.

EEE 22. The CNN architecture according to EEE 20, wherein the complex target range of the complex convolutional layer is limited by mapping complex target values to mapped complex target values having absolute values less than or equal to a predetermined threshold using a transform function.

EEE 23. An apparatus for audio processing, wherein,

the device is configured to receive an input of an input audio signal and to output an output audio signal,

the device comprises a CNN architecture according to any of the preceding EEEs, and

the input of the contracted path is based on the input audio signal and the output audio signal is based on the output of the expanded path.

EEE 24. A method (e.g., a computer-implemented method) of audio processing using a Convolutional Neural Network (CNN), the method comprising:

providing a shrink path of the U-net with the first coding layer,

generating an output representation of the contracted path from the contracted path,

providing a multi-scale CNN comprising at least two parallel convolution paths,

generating an intermediate representation by said multi-scale CNN based on the output representation of said contracted path,

providing an expanded path of the U-net with the first decoding layer, and

generating a final representation by the dilation path based on the intermediate representation generated by the multi-scale CNN.

EEE 25. A computer program product comprising a computer readable storage medium having instructions adapted to, when executed by a device having processing capabilities, cause the device to perform a method according to EEE 24.

EEE 26. A method (e.g., a computer-implemented method) of audio processing using a Convolutional Neural Network (CNN), the method comprising:

providing a shrink path of U-net with a first coding layer, wherein the first coding layer comprises a first multi-scale CNN with at least two parallel convolution paths,

providing an expanded path of the U-net with a first decoding layer, wherein the first decoding layer comprises a second multi-scale CNN with at least two parallel convolution paths, and

generating a final representation by the expanded path based on the output representation of the contracted path.

EEE 27. A computer program product comprising a computer readable storage medium having instructions adapted to cause a device, when executed by a device having processing capabilities, to perform a method according to EEE 26.

EEE 28. A system for audio processing comprising:

one or more processors; and

a non-transitory computer-readable medium storing instructions that, when executed by the one or more processors, cause the one or more processors to perform operations comprising:

Receiving an input audio signal;

processing the input audio signal using a CNN architecture according to any of EEEs 1 to 22, the processing comprising:

providing an input to a contracted path of the CNN architecture based on the input audio signal; and

an output audio signal is generated based on an output of the dilated path of the CNN architecture.

EEE 29. A computing system implementing the CNN architecture according to any one of EEEs 1 to 22.

Claims

1. A Convolutional Neural Network (CNN) architecture for audio processing, the CNN architecture comprising:

a shrink path of U-net having a first encoding layer, wherein the shrink path is configured to generate an output representation of the shrink path based on a first audio signal provided as an input to the shrink path,

an expanded path of U-net with a first decoding layer, wherein the expanded path is configured to generate a final representation based on the intermediate representation generated by the multi-scale CNN, and to output a second audio signal.

2. The CNN architecture of claim 1, wherein the first encoding layer is configured to perform convolution and downsampling operations.

3. The CNN architecture of claim 1 or 2, wherein the first decoding layer is configured to generate an output by:

receiving the intermediate representation generated by the multi-scale CNN,

receiving the output of said first coding layer,

perform convolution operations

Perform the upsampling operation.

4. The CNN architecture of any preceding claim, further comprising a second encoding layer, wherein the second encoding layer is configured to:

the convolution is performed and the convolution is performed,

perform downsampling operations

Forwarding the result to the first encoding layer.

5. The CNN architecture of claim 4, further comprising a second decoding layer, wherein the second decoding layer is configured to:

receiving an output of the first decoding layer,

receiving the output of said second coding layer,

perform convolution operations

Perform the upsampling operation.

6. The CNN architecture according to any preceding claim, wherein the multi-scale CNN is configured to generate an aggregate output based on the outputs of the at least two parallel convolution paths.

7. The CNN architecture of claim 6, wherein the multi-scale CNN is configured to generate the aggregate output by concatenating or adding outputs of the at least two parallel convolution paths.

8. The CNN architecture of claim 6 or 7, wherein the multi-scale CNN is configured to weight the outputs of the at least two parallel convolution paths using different weights.

9. The CNN architecture according to any preceding claim, wherein each parallel convolution path of the multi-scale CNN comprises L convolution layers, wherein L is a natural number greater than or equal to 1, and wherein a first layer of the L layers has N _l And a filter, where l= … L.

10. The CNN architecture of claim 9, wherein for each parallel convolution path, the number N of filters in the first layer _l As the number of layers i increases.

11. The CNN architecture of claim 9, wherein a filter size of the filter is the same in each parallel convolution path.

12. The CNN architecture of claim 9, wherein a filter size of the filter is different between different parallel convolution paths.

13. The CNN architecture of claim 9, wherein for a given parallel convolution path, the filters of at least one layer of the parallel convolution path are dilation 2D convolution filters.

14. The CNN architecture of claim 13, wherein the expansion operation of the filters of at least one layer of the parallel convolution paths is performed only on the frequency axis.

15. The CNN architecture of claim 13, wherein for a given parallel convolution path, the filters of two or more layers of the parallel convolution path are dilation 2D convolution filters, and wherein a dilation factor of the dilation 2D convolution filters increases exponentially with an increase in the number of layers/.

16. A Convolutional Neural Network (CNN) architecture for audio processing, the CNN architecture comprising:

a shrink path of U-net having a first coding layer, wherein the shrink path is configured to generate an output representation of the shrink path based on a first audio signal provided as an input to the shrink path, wherein the first coding layer comprises a first multi-scale CNN having at least two parallel convolution paths, and

an expanded path of U-net having a first decoding layer, wherein the expanded path is configured to generate a final representation based on an output representation of the contracted path, and to output a second audio signal, wherein the first decoding layer comprises a second multi-scale CNN having at least two parallel convolution paths.

17. The CNN architecture of claim 16, further comprising another multi-scale CNN coupled between the contracted path and the expanded path, and wherein the other multi-scale CNN

Comprises at least two parallel convolution paths, and

an output representation configured to receive and process the contracted path.

18. The CNN architecture according to claim 16 or 17, wherein the first multi-scale CNN is configured to:

performing a 2D convolution on the aggregated output, and

19. The CNN architecture according to any one of claims 16 to 18, wherein the second multi-scale CNN is configured to:

performing a 2D convolution on the aggregated output, and

performing an upsampling operation based on the result of the 2D convolution.

20. The CNN architecture according to any one of claims 16 to 19, wherein the first or second multi-scale CNN comprises a complex convolution layer having:

21. The CNN architecture of claim 20, wherein the complex target range of the complex convolutional layer is limited by ignoring complex target values having absolute values greater than a predetermined threshold.

22. The CNN architecture of claim 20, wherein the complex target range of the complex convolutional layer is limited by mapping complex target values to mapped complex target values having absolute values less than or equal to a predetermined threshold using a transform function.

23. An apparatus for audio processing, wherein,

The apparatus comprising a CNN architecture according to any of the preceding claims, and

24. A method of audio processing using a Convolutional Neural Network (CNN), the method comprising:

providing a shrink path of the U-net with the first coding layer,

providing a multi-scale CNN comprising at least two parallel convolution paths,

providing an expanded path of the U-net with the first decoding layer, and

25. A computer program product comprising a computer readable storage medium having instructions adapted to cause a device having processing capabilities to perform the method of claim 24 when executed by the device.

26. A method of audio processing using a Convolutional Neural Network (CNN), the method comprising:

27. A computer program product comprising a computer readable storage medium having instructions adapted to cause a device having processing capabilities to perform the method of claim 26 when executed by the device.

28. A system for audio processing, comprising:

one or more processors; and

receiving an input audio signal;

processing the input audio signal using the CNN architecture according to any one of claims 1 to 22, the processing comprising:

29. A computing system implementing the CNN architecture according to any one of claims 1 to 22.