CN116368495A

CN116368495A - Method and apparatus for audio processing using nested convolutional neural network architecture

Info

Publication number: CN116368495A
Application number: CN202180071571.2A
Authority: CN
Inventors: 孙俊岱; 芦烈; 双志伟
Original assignee: Dolby Laboratories Licensing Corp
Current assignee: Dolby Laboratories Licensing Corp
Priority date: 2020-10-19
Filing date: 2021-10-19
Publication date: 2023-06-30

Abstract

Systems, methods, and computer program products for audio processing based on Convolutional Neural Networks (CNNs) are described. The CNN architecture may include a multi-scale input block and a multi-scale nest block. The multi-scale input block may be configured to receive input data and generate a first downsampled input data set by downsampling the input data. The multi-scale nested block may include a first encoding layer configured to generate a first encoded data set by performing convolution based on input data. The multi-scale nested block may include a second encoding layer configured to generate a second encoded data set by performing a convolution based on the first downsampled input data set. Further, the multi-scale nested block may include a first convolution layer configured to generate a first output data set by upsampling a second encoded data set, concatenating the first encoded data set and the upsampled second encoded data set, and performing a convolution. The first convolutional layer may be nested between the encoding and decoding layers, thereby increasing the number of communication channels with the CNN and simplifying the underlying optimization problem.

Description

Method and apparatus for audio processing using nested convolutional neural network architecture

Cross Reference to Related Applications

The present application claims priority to the following priority applications: PCT International application PCT/CN2020/121829 filed on month 10 and 19 in 2020, U.S. provisional application 63/112,220 filed on

month

11 and 11 in 2020, european application 20211501.0 filed on month 12 and 3 in 2020, PCT International application PCT/CN2021/078705 filed on month 3 and 2 in 2021, and U.S. provisional application 63/164,028 filed on month 3 and 22 in 2021.

Technical Field

The present disclosure relates generally to methods and apparatus for audio processing using Convolutional Neural Networks (CNNs). More particularly, the present disclosure relates to extracting speech from an original noisy speech signal using an aggregated multi-scale nested CNN architecture.

Although some embodiments will be described herein with particular reference to this disclosure, it should be understood that the disclosure is not limited to this field of use and is applicable to a broader context.

Background

Any discussion of the background art throughout the disclosure should in no way be considered as an admission that such art is widely known or forms part of the common general knowledge in the field.

Deep Neural Networks (DNNs) have become a viable option to address various audio processing issues. Types of DNNs include feed forward multilayer perceptron (MLP), convolutional Neural Network (CNN), recurrent Neural Network (RNN), and generation of countermeasure network (GAN). Among these, CNN is a type of feed forward network.

In recent years, CNN architecture has been used in the field of audio processing. In particular, CNN architecture has been successfully applied to various audio processing problems including sound separation, speech enhancement, and speech source separation. Speech source separation aims at recovering target speech from background interference and has many applications in the field of speech and/or audio technology. In this context, the separation of speech sources is also commonly referred to as the "cocktail party problem". In such a scenario, extracting conversations from professional content (such as movies and TV) presents challenges due to the complex context.

It is an object of this document to provide a novel CNN architecture that can be applied in various fields of audio processing, including sound separation, speech enhancement and speech source separation.

Disclosure of Invention

According to a first aspect of the present disclosure, a computing system implementing a Convolutional Neural Network (CNN) architecture is described. The CNN architecture may include a multi-scale input block and a multi-scale nest block. The multi-scale input block may be configured to receive input data and generate a first downsampled input data set by downsampling the input data. The multi-scale nested block may include a first encoding layer configured to generate a first encoded data set by performing convolution based on input data. The multi-scale nested block may include a second encoding layer configured to generate a second encoded data set by performing a convolution based on the first downsampled input data set. Further, the multi-scale nested block may include a first convolution layer configured to generate a first output data set by performing convolution based on the first encoded data set and an upsampled second encoded data set, wherein the upsampled second encoded data set is obtained by upsampling the second encoded data set. For example, the first convolution layer may be configured to generate the first output data set by upsampling the second encoded data set, concatenating the first encoded data set and the upsampled second encoded data set, and performing a convolution based on a concatenation result of the first encoded data set and the upsampled second encoded data set. Alternatively, the upsampling and/or cascading may be performed by some other layer or unit, for example by the first encoding layer or the second encoding layer. The first convolutional layer may be nested between the encoding and decoding layers, thereby increasing the number of communication channels within the CNN and simplifying the underlying optimization problem.

The input data may represent an audio signal. For example, the input data may include an audio signal spectrum extending along a time dimension and a frequency dimension. The multi-scale input block may then be configured to perform a downsampling operation by downsampling the spectrum in the time dimension or by downsampling the spectrum in the frequency dimension. Alternatively, the multi-scale input block may be configured to perform the downsampling operation by downsampling the spectrum in both the time and frequency dimensions. As will be described in the following description, the multi-scale input block may be configured to generate further downsampled versions of the input data, and thus generate multiple-scale raw input data, which is forwarded to the multi-scale nested block for further processing.

In a multi-scale nested block, the coding layer and the convolution layer may be the same or different. They may for example comprise a single convolution layer or multiple convolution layers, the outputs of which are aggregated or added in any way. Each convolution operation may be, for example, a 2D convolution, and may be followed by a suitable activation function. The convolution layer may have a plurality of filters. The filter sizes of the coding layer and the convolutional layer may be different. The filters may be initialized with random weights and the weights may be trained during a training process. The training process may include both a forward propagation process and a backward propagation process. The data sets generated by the coding layer and the convolution layer may also be denoted as feature maps in this document.

The multi-scale nested block may further include a second convolution layer configured to generate a second output data set by performing a convolution based on the second encoded data set. The multi-scale nested block may further comprise a third convolution layer configured to generate a third output data set by performing a convolution based on the first output data set and an upsampled second output data set, wherein the upsampled second output data set is obtained by upsampling the second output data set. For example, the third convolution layer may be configured to generate a third output data set by upsampling the second output data set, concatenating the first output data set and the upsampled second output data set, and performing a convolution based on a concatenation result of the first output data set and the upsampled second output data set.

The third convolutional layer may also be denoted/treated as a first decoding layer and the third output data set may be denoted as a first decoding data set. In other words, the first decoded data set may represent a decoded data set having the same scale as the input data. Similarly, the second convolutional layer may also be denoted as a second decoded layer, and the second output data set may be denoted as a second decoded data set. That is, the second decoded data set may represent a decoded data set at a lower scale than the input data, or more precisely: a decoded data set at the scale of the first downsampled input data set. Thus, according to the above explanation of the described CNN architecture, the first convolutional layer is coupled between two coding layers and two decoding layers, and thus may also be denoted as a nested (or intermediate) convolutional layer. Thus, the presence of such nested convolution layers increases the communication within the proposed CNN architecture. In particular, the introduction of nested convolution layers brings the semantic level of an encoded data set (e.g., an encoder feature map) closer to the semantic level of a decoded data set (e.g., a decoder feature map). A technical advantage is that the optimizer may face easier optimization problems when the received encoded data set and the corresponding decoded data set are semantically more similar.

From a network perspective, the first encoding layer, the first convolution layer, and the third convolution layer (i.e., the first decoding layer) may be configured to process and output a data set at the same scale as the input data. The three layers may form a first hierarchy of multi-scale nested blocks. Similarly, the second encoding layer and the second convolutional layer (i.e., the second decoding layer) may be configured to process and output data sets at the same scale as the first downsampled input data set. The two layers may form a second level of multi-scale nested blocks. Thus, the proposed CNN architecture can also be denoted as "nested" because (a) the first convolution layer is located between the different layers of the first hierarchy. In conventional CNN architectures, such middle layers are typically not provided, and the output of the encoder is forwarded directly to the corresponding decoder on the same level. Furthermore, the proposed CNN architecture can be denoted as "nested" because (b) the first convolution layer establishes a connection (with additional convolution processing) between the different levels. For example, the first convolution layer may perform some additional convolution processing between the second coding layer (at the second level) and the third convolution layer (at the first level), where such additional convolution processing is not typically provided in prior art architectures. In other words, a multi-scale nest block may include multiple levels, each level being associated with a respective resolution of its input data, wherein the number of (serial) layers decreases by one from one level to the next.

The multi-scale input block may be further configured to generate a second downsampled input data set by downsampling the first downsampled input data set. The multi-scale nested block may further include a third encoding layer configured to generate a third encoded data set by performing a convolution based on the second downsampled input data set. Further, the second convolution layer may be configured to generate a second output data set by performing a convolution based on the second encoded data set and an upsampled third encoded data set, wherein the upsampled third encoded data set is obtained by upsampling the third encoded data set. For example, the second convolution layer may be configured to generate the second output data set by upsampling the third encoded data set, concatenating the second encoded data set and the upsampled third encoded data set, and performing a convolution based on the concatenation of the second encoded data set and the upsampled third encoded data set.

The second encoding layer may be configured to generate a second encoded data set by performing convolution based on the first downsampled input data set and the downsampled first encoded data set, wherein the downsampled first encoded data set is obtained by downsampling the first encoded data set. For example, the second encoding layer may be further configured to downsample the first encoded data set, concatenate the first downsampled input data set and the downsampled first encoded data set, and generate the second encoded data set by performing a convolution based on the concatenation.

Alternatively or additionally, the third encoding layer may be configured to downsample the second encoded data set, concatenate the second downsampled input data set and the downsampled second encoded data set, and generate the third encoded data set by performing a convolution based on the concatenation (i.e., the concatenation of the second downsampled input data set and the downsampled second encoded data set).

In contrast to the above-described CNN architecture in which the second encoding layer is not configured to receive and downsample the output of the first encoding layer and in which the third encoding layer is not configured to receive and downsample the output of the second encoding layer, the CNN architecture with corresponding receiving and downsampling functions is less aggressive when attempting to solve the underlying optimization problem.

The second convolution layer may be configured to generate a second output data set by performing convolution based on the second encoded data set, a downsampled first output data set obtained by downsampling the first output data set, and an upsampled third encoded data set obtained by upsampling the third encoded data set. For example, the second convolution layer may be configured to generate a second output data set by: downsampling the first output data set, upsampling the third encoded data set, concatenating the downsampled first output data set, the upsampled third encoded data set and the second encoded data set, and performing a convolution based on the concatenation.

The third convolution layer may be configured to generate a third output data set by performing a convolution based on the first output data set, an up-sampled second output data set obtained by up-sampling the second output data set, and the first encoded data set. For example, the third convolution layer may be configured to generate a third output data set by: upsampling the second output data set, concatenating the first output data set, the upsampled second output data set and the first encoded data set, and performing a convolution based on the concatenation.

The third encoding layer may be configured to generate a third encoded data set by performing convolution based on the second downsampled input data set, the downsampled first encoded data set obtained by downsampling the first encoded data set, and the downsampled second encoded data set obtained by downsampling the second encoded data set. For example, the third encoding layer may be configured to generate a third encoded data set by: downsampling the first encoded data set, downsampling the second encoded data set, concatenating the downsampled first encoded data set, the downsampled second encoded data set, and the second downsampled input data set, and performing a convolution based on the concatenating.

The computing system may further include a weighted addition block configured to apply the first weight to the third output data set. The weighted addition block may be configured to apply a second weight to the second output data set. The weighted addition block may be configured to generate an output of the multi-scale nested block by adding the weighted third output data set and the weighted second output data set. The first weight and/or the second weight may be a learnable parameter or may be set based on knowledge of the signal processing domain.

The first encoding layer may be configured to generate a first encoded data set by performing convolution based on the input data and an upsampled first downsampled input data set, wherein the upsampled first downsampled input data set is obtained by upsampling the first downsampled input data set. For example, the first encoding layer may be configured to generate the first encoded data set by upsampling the first downsampled input data set, concatenating the upsampled first downsampled input data set and the input data, and performing a convolution based on the concatenation.

Alternatively or additionally, the second encoding layer may be configured to generate the second encoded data set by performing a convolution based on the first downsampled input data set and the upsampled second downsampled input data set, wherein the upsampled second downsampled input data set is obtained by upsampling the second downsampled input data set. For example, the second encoding layer is configured to generate a second encoded data set by upsampling a second downsampled input data set, concatenating the upsampled second downsampled input data set and the first downsampled input data set, and performing a convolution based on the concatenation.

The multi-scale input block may include a convolutional layer or dense layer configured to generate a first downsampled input data set based on the input data. The parameters of the convolutional layer or dense layer may be trainable during the training process. The multi-scale input block may be configured to generate the first downsampled input data set using a maximum pooling process, an average pooling process, or a mixture of the maximum pooling process and the average pooling process.

The first encoding layer or the second encoding layer may comprise a multi-scale convolution block configured to generate an output by concatenating or adding the outputs of the at least two parallel convolution paths. The multi-scale convolution block may be configured to weight the outputs of at least two parallel convolution paths using different weights. Likewise, the weights may be based on trainable parameters learned from the training process.

Each parallel convolution path of the multi-scale convolution block may comprise L convolution layers, wherein L is a natural number greater than 1, and wherein a first layer of the L layers has Nl filters, wherein L = 1 … L. For each parallel convolution path, the number of filters Nl in the first layer may increase with the number of layers l. For example, for each parallel convolution path, the number Nl of filters in the first layer may be given by nl=l×n0, where N0 is a predetermined constant greater than 1. In one aspect, the filter size of the filter may be the same in each parallel convolution path. On the other hand, the filter size of the filter may be different between different parallel convolution paths. For a given parallel convolution path, the filters of at least one layer of the parallel convolution path may be expanded 2D convolution filters. The expansion operation of the filters of at least one layer of the parallel convolution paths may be performed only on the frequency axis.

For a given parallel convolution path, the filters of two or more layers of the parallel convolution path may be dilation 2D convolution filters, and the dilation factor of the dilation 2D convolution filters may increase exponentially with increasing number of layers i. For example, for a given parallel convolution path, the expansion may be (1, 1) in a first layer of L convolution layers, the expansion may be (1, 2) in a second layer of L convolution layers, the expansion may be (1, 2 (L-1)) in a first layer of L convolution layers, and the expansion may be (1, 2 (L-1)) in a last layer of L convolution layers, where (c, d) indicates the expansion factor c along the time axis and the expansion factor d along the frequency axis.

As already indicated in the foregoing description, the input data may comprise an audio signal. The CNN architecture may further include an aggregation block configured to receive an output of the multi-scale nested block. The aggregate block may include at least one of: a convolution layer configured to reduce the number of channels associated with the input data, a pooling layer configured to reduce the dimensions associated with the input data, and a loop layer configured to order the output of the multi-scale nested block.

The CNN architecture can be further extended according to the principles described above. For example, the multi-scale input block may be further configured to generate a third downsampled input data set by downsampling the second downsampled input data set. The multi-scale nested block may further include a fourth encoding layer configured to generate a fourth encoded data set by performing convolution based on the third downsampled input data set. The multi-scale nested block may further include a fourth convolution layer configured to generate a fourth output data set by performing convolution based on the third encoded data set and the up-sampled fourth encoded data set, wherein the up-sampled fourth encoded data set is obtained by up-sampling the fourth encoded data set. For example, the fourth convolution layer may be configured to generate a fourth output data set by upsampling a fourth coded data set, concatenating the third coded data set and the upsampled fourth coded data set, and performing convolution based on a result of the concatenating. The multi-scale nested block may further include a fifth convolution layer configured to generate a fifth output data set by performing convolution based on the second output data set and an upsampled fourth output data set, wherein the upsampled fourth output data set is obtained by upsampling the fourth output data set. For example, the fifth convolution layer may be configured to generate a fifth output data set by upsampling the fourth output data set, concatenating the second output data set and the upsampled fourth output data set, and performing convolution based on a result of the concatenating. The multi-scale nested block may further include a sixth convolution layer configured to generate a sixth output data set by performing convolution based on the third output data set and an upsampled fifth output data set, wherein the upsampled fifth output data set is obtained by upsampling the fifth output data set.

In the described CNN architecture, the following three convolutional layers can be considered nested layers: a first convolution layer, a second convolution layer, and a third convolution layer. The layers are coupled between three encoding layers and three decoding layers. Further, the first coding layer, the first convolution layer, the third convolution layer, and the sixth convolution layer are at a first scale, i.e., at the scale of the input data. Similarly, the second coding layer, the second convolution layer, and the fifth convolution layer are at a second scale, i.e., at the scale of the first downsampled input data. Finally, only two layers (i.e., the third coding layer and the fourth convolutional layer) are at the third scale. At the fourth scale, the fourth coding layer constitutes the only layer. In general, the CNN architecture can be further extended according to the pyramidal structure described above. Here, the number of layers per scale depends on the number of downsampled input data sets provided by the multi-scale input block.

According to a second aspect of the present disclosure, an apparatus for audio processing is provided. The apparatus may be configured to receive an input of an input audio signal and output an output audio signal. The apparatus may include a CNN architecture described in this document. The input data received by the multi-scale input block may be based on the input audio signal and the output audio signal may be based on a third output data set generated by a third convolution layer of the multi-scale nested block.

According to a third aspect of the present disclosure, a method of audio processing using a convolutional neural network CNN is provided. The method may include receiving input data. The method may include generating a first downsampled input data set by downsampling input data. The method may include generating a first encoded data set by performing a convolution based on input data. The method may include generating a second encoded data set by performing a convolution based on the first downsampled input data set. The method may include generating a third output data set by performing a convolution based on the first output data set and an up-sampled second output data set, wherein the up-sampled second output data set is obtained by up-sampling the second output data set.

The method may further include generating a second output data set by performing a convolution based on the second encoded data set. The method may further include generating a first output data set by performing a convolution based on the first encoded data set and an upsampled second encoded data set, wherein the upsampled second encoded data set is obtained by upsampling the second encoded data set.

According to a fourth aspect of the present disclosure, there is provided a computer program product comprising a computer readable storage medium having instructions adapted to cause a device having processing capabilities to perform the above-described method when executed by the device.

Drawings

Example embodiments of the present disclosure will now be described, by way of example only, with reference to the accompanying drawings, in which:

FIG. 1 illustrates an exemplary architecture of an AMS-Nest.

Fig. 2 illustrates an exemplary multi-scale convolution block M.

FIG. 3 illustrates an exemplary horizontally dense multi-scale nest block.

Fig. 4 illustrates an exemplary vertically cascaded multi-scale nest block.

Fig. 5 illustrates an exemplary vertically dense multi-scale nest block.

FIG. 6 illustrates an exemplary architecture of an AMS-Nest that removes connections between M blocks.

FIG. 7 illustrates an exemplary architecture with weighted AMS-Nest.

FIG. 8 illustrates an exemplary architecture of an AMS-Nest with a cascade added between M blocks at different levels of multi-scale input blocks and multi-scale nested blocks.

Detailed Description

The present document discloses an aggregated multi-scale nested neural network architecture called AMS-Nest. The proposed architecture can be regarded as a depth-supervised encoder-decoder network, where the encoder and decoder subnetworks are connected by a series of nested paths. Here, the encoder may include a layer that encodes the original input to a particular space or dimension to obtain the encoding characteristics. The decoder may include a layer that decodes the encoded features into the original space or dimension.

FIG. 1 illustrates an exemplary architecture 1 of AMS-Nest. Architecture 1 basically includes three main blocks, which are referred to as a multi-scale input block 11, a multi-scale nest block 12, and an aggregate block (aggregation block) 13. FIG. 1 shows a model structure of an AMS-Nest with a depth of five. In fig. 1 and all the figures that follow, the horizontal arrows indicate forwarding of the feature map from the output of one layer to the input of another layer. The down-pointing arrow (particularly at a down-90 degree angle in the multi-scale input block 11 and at a down-45 degree angle in the multi-scale nest block 12) indicates that the feature map is downsampled. The feature map may be downsampled by a layer outputting the feature map, a layer receiving the feature map, or a third layer or entity. The upward pointing arrow (particularly at an upward 45 degree angle in the multi-scale nest block 12) indicates that the feature map is up-sampled. Likewise, the feature map may be upsampled by a layer outputting the feature map, a layer receiving the feature map, or a third layer or entity.

The multi-scale input block 11 downsamples the original input to several scales, the multi-scale nest block 12 will capture features using different filter sizes based on the multi-scale input, and each scale of the input will correspond to a separate horizontal path. Furthermore, these paths may be tightly connected. The aggregation block 13 will narrow the number and dimensions of channels to match the target shape. By removing the bottom block (In ⁴ 、M ⁴ 、C ³⁰ 、C ²¹ 、C ¹² 、C ⁰³ ) The depth of AMS-Nest can be reduced to four. After the additional layers are similarly removed, the depth of the AMS-Nest may be reduced to three or even two. In this disclosure, the number of horizontal paths is defined as the hierarchy of AMS-Nest. Thus, as shown in FIG. 1The structure may represent an AMS-Nest based hierarchy 5.

The multi-scale input block 11 generates a first downsampled input data set In ¹ And a second downsampled input data set In ² . The multi-scale nest block 12 includes, for example, a first coding layer M ⁰ The first coding layer is formed by a method based on input data In ⁰ A convolution is performed to generate a first encoded data set. The multi-scale nest block 12 further includes, for example, a second coding layer M ¹ Third coding layer M ² First convolution layer C ⁰⁰ Second convolution layer C ¹⁰ And a third convolution layer C ⁰¹ 。

Multi-scale input block

The multi-scale input block 11 may contain multiple scales of the original input. In as shown In FIG. 1 ¹ To In ⁴ Is the original input In ⁰ And the arrows indicate the downsampling process. The downsampling process may be implemented in several different ways.

Option 1: using nerve layers

The convolutional layer may be used as a downsampling layer by using different strides, and the dense layer may also be applied by using a smaller number of nodes than the original input.

For a multi-scale input block based on a convolutional layer, the i-th input can be expressed as:

In ⁱ =2d convolution (channel, kernel, stride, expansion) (In ^i-1 )，i≥1

Where 2D convolution (channel, kernel, stride, expansion) (·) represents a 2D convolution layer. The channel, kernel, stride, and expansion are the four main parameters in the convolutional layer, and stride > 1 can be used to reduce the dimension of the previous input. Other parameters may be determined based on different use cases. For downsampling, a single convolution layer or multiple convolution layers may be used.

For dense layer based multi-scale input blocks, the i-th input can be expressed as:

In ⁱ =dense (node ⁱ )(In ^i-1 )，i≥1

Wherein the node ⁱ Is the node number of the dense layer, and node i < node ^i-1 . For downsampling, a single dense layer or multiple dense layers may be used.

Option 2: using pooling

By using different step sizes, the original input can be downsampled to different ratios using a pooling process, and several pooling options are presented. For example, LP pooling may be used:

wherein the method comprises the steps of

Is the (f+1) _th Input of hierarchy, i.e. output of pooling operator at position (i, j), and

is f (f) _th Hierarchical pooling region R _i，j Characteristic values at positions (m, n) within. In particular, when the parameter p=1, L _p Corresponds to average pooling, and when the parameter p= infinity, L _p Corresponding to maximum pooling.

Alternatively, hybrid pooling may be used:

where λ is a random value (which may be arbitrarily chosen) of value 0 or 1, indicating whether the choice is to use average pooling or maximum pooling. During the forward propagation process λ is recorded and can be used for the backward propagation operation.

Multi-scale nested block

In the proposed nested CNN architecture, the encoder (i.e., encoding layer M ⁰ 、M ¹ …) is subjected to one or more convolution layers, wherein the number of convolution layers depends on the level of the pyramid (see fig. 1). For exampleNode M ⁰ And C ⁰³ The path between them is formed by a layer (C ⁰⁰ ，C ⁰¹ ，C ⁰² ) Each convolution layer may be preceded by a concatenation layer that combines the output from the previous convolution layer of the same block with the corresponding upsampled output of the lower level block. Essentially, nested (or intermediate) convolution blocks bring the semantic level of the encoder feature map closer to the semantic level of the feature map waiting in the decoder. In fig. 1, the decoder can be said to include a layer (C ³⁰ ，C ²¹ ，C ¹² ，...C ⁰³ ). A technical advantage is that the optimizer faces easier optimization problems when the received encoder feature maps and the corresponding decoder feature maps are semantically more similar.

In general, the nodes in the multi-scale nest block 12 include two types: type M and type C. Type M may, for example, include multi-scale convolution block 2 shown in fig. 2. The multi-scale convolution block 2 generates an output by concatenating or adding the outputs of at least two

parallel convolution paths

21, 22, 23. Although the number of parallel convolution paths is not limited, the aggregate multi-scale CNN may include three

parallel convolution paths

21, 22, 23. By means of these parallel convolution paths, local and general feature information of the time-frequency transformation of a plurality of frames of the audio signal can be extracted on different scales. The outputs of the parallel convolution paths are aggregated and subjected to a further 2D convolution 24.

It should be mentioned that the word "multiscale" in the term "multiscale convolution block 2" is different from the meaning in the terms "multiscale input block 11" and "multiscale nest block 12". In one aspect, in the term "multi-scale convolution block 2", the word "multi-scale" indicates that different filter sizes may be used in each parallel convolution path. On the other hand, in the terms "multi-scale input block 11" and "multi-scale nest block 12", the word "multi-scale" indicates that input data having different sizes/resolutions are processed on each level. In particular, a lower level compared to the last level may process input data downsampled by a factor of 2 compared to the last level, for example.

Type C represents, for example, a common convolution layer in which different kernels and expansion rates may be set. It should be noted that type C and type M may be the same or different. The layer of type C may include any convolution block. In the simplest case, a layer of type C may include only a single convolution operation.

Fig. 3 illustrates an exemplary horizontally dense multi-scale nest block 32 that may replace multi-scale nest block 12 in fig. 1. To make the nested blocks more powerful, the concept of dense convolution layers can be applied in this example. As shown in fig. 3, several skip connections may be added in the nest block 12. Dashed arrows indicate the corresponding cascading process. For example, in fig. 3, the third convolution layer C01 generates a third output data set by performing a convolution based on the first output data set generated by layer C00, the up-sampled second output data set originating from layer C10, and the first encoded data set originating from layer M0. In other words, in a horizontally dense architecture, the first encoded data set is forwarded directly to layer C01, bypassing layer C00. Additionally, the first encoded data set may also be forwarded to layers C00, C02 and C03.

Formally, the skipped path is written as the following formula: let c ^i，j Representing node C ^i，j Output of m ⁱ Represents M ⁱ Where the i index is along the downsampled layer (i.e., level number) of the encoder and the j index is along the convolved layer of the dense block of skipped paths. From c ⁱ ^，j The stack of the represented feature map may be calculated as:

where the function H (·) is a convolution operation followed by an activation function, U (·) represents the upsampling layer and [ · ] represents the concatenation layer. Basically, the C node at level j=0 may receive only two inputs, both from the encoder subnetwork, but on two consecutive levels. The node at level j >0 receives j+2 inputs, where j inputs are the outputs of the previous j nodes in the same skipped path, and the last input is the upsampled output from the lower skipped path. Because dense convolution blocks can be utilized along each skipped path, all previous feature maps can accumulate and reach the current node.

Fig. 4 illustrates an exemplary vertically cascaded multi-scale nest block 42 that may replace the multi-scale nest block 12 of fig. 1. In this example, the second convolution layer C ¹⁰ By being based on the source layer M ¹ Is derived from layer C ⁰⁰ Is derived from layer M ² Performs convolution on the up-sampled second encoded data set of (a) to generate a second output data set. I.e. layer C ⁰⁰ The output of (2) is explicitly downsampled and forwarded to layer C ¹⁰ . As can be seen from fig. 4, an up-sampled version of the second output data set may be sent to node C ¹⁰ While a downsampled version of the second output data may be sent to node C ²⁰ . Fig. 5 illustrates an exemplary vertically dense multi-scale nest block 52 that may replace multi-scale nest block 12 in fig. 1. In fig. 5, an additional downsampling process is illustrated using a downward curved arrow. Here, the third coding layer M ² By deriving layer M based on a second downsampled input dataset ¹ Is derived from layer M ⁰ Performs convolution on the downsampled second encoded data set to generate a third encoded data set.

FIG. 6 illustrates an exemplary architecture 62 of AMS-Nest that removes connections between M blocks. Likewise, the blocks shown in FIG. 6 may replace the multi-scale nested blocks 12 of FIG. 1. First coding layer M ⁰ Generating a first encoded data set by performing convolution based on input data, a second encoding layer M ¹ Generating a second encoded data set by performing convolution based on the first downsampled input data set, and the first convolution layer C ⁰⁰ The first output data set is generated by performing a convolution based on the first encoded data set and the up-sampled second encoded data set. Second convolution layer C ¹⁰ A second output data set is generated by performing a convolution based on the second encoded data set. Third convolution layer C ⁰¹ Generating a third output data set by performing a convolution based on the first output data set and an up-sampled second output data set, wherein the up-sampled second output data setThe output data set is obtained by upsampling the second output data set.

Fig. 7 illustrates another exemplary architecture 72 of the multi-scale nest block 12 of fig. 1 with weighting. Different weights may be set at each level in the multi-scale nest block. The weights may be set based on knowledge of the signal processing domain or a parameter that may be learned. The depicted CNN architecture includes a weighted addition block configured to add a first weight W ⁰ Applying the second weight W to the third output data set ¹ And generating an output of the multi-scale nested block based on the weighted third output dataset and the weighted second output dataset.

Finally, FIG. 8 illustrates an exemplary architecture 8 that adds a cascade of AMS-Nest between M blocks at different levels of the multi-scale input block 81 and the multi-scale Nest block 82. Fig. 8 also shows a corresponding aggregation block 83. The corresponding up-sampling and cascading process is indicated in fig. 8 by the corresponding up-arrow. As shown in the depicted example, the input of the (f+1) th level may be fed into the M blocks of the f-th level.

In fig. 8, a first coding layer M ⁰ Is configured to be by based on input data In ⁰ And convolving the upsampled first downsampled input data set to generate a first encoded data set, wherein the upsampled first downsampled input data set is generated by convolving the first downsampled input data set In ¹ Up-sampling. Second coding layer M ¹ Is configured to input the data set In by being based on the first downsampling ¹ And convolving the upsampled second downsampled input data set to generate a second encoded data set, wherein the upsampled second downsampled input data set is generated by convolving the second downsampled input data set In ² Up-sampling.

Aggregation block

The

aggregation block

13 or 83 may reduce the number of convolution channels and input dimensions of the output of the

nesting block

12 or 82 to match the target shape. The aggregate block may include one or more convolutional layers, pooled layers, or even looped layers. The convolutional layer (which may also be trainable) may be intended to be gradualReducing the number of channels, the pooling layer may aim to reduce the dimensionality, and the loop layer may help order the outputs. The number of convolutions depends on the channel difference between the input and the output of the

aggregation block

13 or 83. Let the channel of the input of the aggregate block be N _i The channel of the output of the aggregate block is N _o (for mono signal it should be 1, and for stereo signal it should be 2), the minimum number of convolutions layers N _c Can be calculated as:

where the step size represents the reduction index.

The number of pooling layers depends on the difference in the number of frames in the input and output of the aggregate block. Assume that the number of frames of the input of the aggregate block is F _i The number of frames of the output of the aggregate block is F _o The number of pooling layers N _p Can be calculated by the following formula:

where the step size represents the pooling size.

Interpretation of the drawings

Unless specifically stated otherwise as apparent from the following discussions, it is appreciated that throughout the disclosed discussions utilizing terms such as "processing," "computing," "calculating," "determining," "analyzing," or the like, refer to the action and/or processes of a computer or computing system, or similar electronic computing device, that manipulates and/or transforms data represented as physical (e.g., electronic) quantities into other data similarly represented as physical quantities.

In a similar manner, the term "processor" may refer to any device or portion of a device that processes electronic data, e.g., from registers and/or memory, to transform the electronic data into other electronic data, e.g., that may be stored in registers and/or memory. A "computer" or "computing machine" or "computing platform" may include one or more processors.

In one example embodiment, the methods described herein may be performed by one or more processors that accept computer readable (also referred to as machine readable) code containing a set of instructions that, when executed by the one or more processors, perform at least one of the methods described herein. Including any processor capable of executing a set of instructions (sequential or otherwise) that specify actions to be taken. Thus, one example is a typical processing system that includes one or more processors. Each processor may include one or more of a CPU, a graphics processing unit, and a programmable DSP unit. The processing system may further comprise a memory subsystem comprising main RAM and/or static RAM and/or ROM. A bus subsystem may be included for communication between the components. The processing system may further be a distributed processing system in which the processors are coupled together by a network. Such a display may be included, for example, a Liquid Crystal Display (LCD) or a Cathode Ray Tube (CRT) display, if the processing system requires such a display. If manual data entry is desired, the processing system further includes an input device, such as one or more of an alphanumeric input unit (e.g., keyboard), a pointing control device (e.g., mouse), etc. The processing system may also encompass a storage system, such as a disk drive unit. The processing system in some configurations may include a sound output device and a network interface device. The memory subsystem thus includes a computer-readable carrier medium carrying computer-readable code (e.g., software) that includes a set of instructions that, when executed by one or more processors, cause performance of one or more of the methods herein. It should be noted that when the method comprises several elements (e.g. several steps), no order of the elements is implied unless specifically stated. The software may reside on the hard disk, or it may be completely or at least partially resident in the RAM and/or processor during execution thereof by the computer system. Thus, the memory and processor also constitute a computer-readable carrier medium carrying computer-readable code. Furthermore, the computer readable carrier medium may be formed or included in a computer program product.

In alternative example embodiments, one or more processors may operate as standalone devices, or may be connected (e.g., networked) to other processors in a networked deployment, one or more processors may operate in the capacity of a server or user machine in a server-user network environment, or as peer machines in a peer-to-peer or distributed network environment. The one or more processors may form a Personal Computer (PC), tablet PC, personal Digital Assistant (PDA), cellular telephone, web appliance, network router, switch or bridge, or any machine capable of executing a set of instructions (sequential or otherwise) that specify actions to be taken by that machine.

It should be noted that the term "machine" shall also be taken to include any collection of machines that individually or jointly execute a set (or multiple sets) of instructions to perform any one or more of the methodologies discussed herein.

Thus, one example embodiment of each method described herein is in the form of a computer-readable carrier medium carrying a set of instructions, such as a computer program for execution on one or more processors (e.g., one or more processors that are part of a web server arrangement). Accordingly, as will be appreciated by one skilled in the art, example embodiments of the present disclosure may be embodied as a method, an apparatus such as a special purpose apparatus, an apparatus such as a data processing system, or a computer readable carrier medium (e.g., a computer program product). The computer-readable carrier medium carries computer-readable code comprising a set of instructions that, when executed on one or more processors, cause the one or more processors to implement a method. Accordingly, aspects of the present disclosure may take the form of an entirely hardware example embodiment, an entirely software example embodiment or an example embodiment combining software and hardware aspects. Furthermore, the present disclosure may take the form of a carrier medium (e.g., a computer program product on a computer-readable storage medium) carrying computer-readable program code embodied in the medium.

The software may further be transmitted or received over a network via a network interface device. While the carrier medium is a single medium in the example embodiments, the term "carrier medium" should be taken to include a single medium or multiple media (e.g., a centralized or distributed database, and/or associated caches and servers) that store the one or more sets of instructions. The term "carrier medium" shall also be taken to include any medium that is capable of storing, encoding or carrying a set of instructions for execution by and that cause one or more processors to perform any one or more of the methodologies of the present disclosure. Carrier media can take many forms, including but not limited to, non-volatile media, and transmission media. Non-volatile media includes, for example, optical, magnetic disks, and magneto-optical disks. Volatile media includes dynamic memory, such as main memory. Transmission media includes coaxial cables, copper wire and fiber optics, including the wires that comprise a bus subsystem. Transmission media can also take the form of acoustic or light waves, such as those generated during radio-wave and infra-red data communications. For example, the term "carrier medium" shall accordingly be taken to include, but not be limited to, solid-state memories, computer products embodied in optical and magnetic media; a medium carrying a propagated signal that is detectable by at least one processor or one or more processors and that represents a set of instructions that when executed implement a method; and a transmission medium in the network, the transmission medium carrying a propagated signal which is detectable by at least one of the one or more processors and which represents a set of instructions.

It will be appreciated that in one example embodiment, the steps of the methods discussed are performed by a suitable processor (or processors) in a processing (e.g., computer) system executing instructions (computer readable code) stored in a storage device. It will also be appreciated that the present disclosure is not limited to any particular implementation or programming technique, and that the present disclosure may be implemented using any suitable technique for implementing the functions described herein. The present disclosure is not limited to any particular programming language or operating system.

Reference throughout this disclosure to "one example embodiment," "some example embodiments," or "example embodiments" means that a particular feature, structure, or characteristic described in connection with the example embodiments is included in at least one example embodiment of the present disclosure. Thus, the appearances of the phrases "in one example embodiment," "in some example embodiments," or "in example embodiments" in various places throughout this disclosure are not necessarily all referring to the same example embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner, as will be apparent to one of ordinary skill in the art in light of this disclosure, in one or more example embodiments.

As used herein, unless otherwise specified the use of the ordinal adjectives "first", "second", "third", etc., to describe a common object, merely indicate that different instances of like objects are being referred to, and are not intended to imply that the objects so described must be in a given sequence, either temporally, spatially, in ranking, or in any other manner.

In the claims below and in the description herein, any one of the terms "comprising," "including," or "comprising" is an open term that means including at least the following elements/features, but not excluding other elements/features. Thus, when the term "comprising" is used in the claims, the term should not be interpreted as being limited to the means or elements or steps listed thereafter. For example, the scope of expression of a device including a and B should not be limited to a device composed of only elements a and B. As used herein, the term comprising or any of its inclusion or inclusion is also an open term that also means that at least the element/feature following the term is included, but not excluding other elements/features. Thus, inclusion is synonymous with and means including.

It should be appreciated that in the foregoing description of example embodiments of the disclosure, various features of the disclosure are sometimes grouped together in a single example embodiment/figure or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. This method of disclosure, however, is not to be interpreted as reflecting an intention that the claims require more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed example embodiment. Thus, the claims following the description are hereby expressly incorporated into this description, with each claim standing on its own as a separate example embodiment of this disclosure.

Moreover, while some example embodiments described herein include some features included in other example embodiments and not others included in other example embodiments, combinations of features of different example embodiments are intended to be within the scope of the present disclosure and form different example embodiments, as will be appreciated by those of skill in the art. For example, in the appended claims, any of the example embodiments claimed may be used in any combination.

In the description provided herein, numerous specific details are set forth. However, it is understood that example embodiments of the disclosure may be practiced without these specific details. In other instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Therefore, while there has been described what are believed to be the best modes of the present disclosure, those skilled in the art will recognize that other and further modifications may be made thereto without departing from the spirit of the disclosure, and it is intended to claim all such changes and modifications as fall within the scope of the present disclosure. For example, any formulas given above represent only processes that may be used. Functions may be added or deleted from the block diagrams and operations may be interchanged among the functional blocks. Steps may be added or deleted to the methods described within the scope of the present disclosure.

Claims

1. A computing system implementing a Convolutional Neural Network (CNN) architecture, the CNN architecture comprising a multi-scale input block and a multi-scale nest block, wherein the multi-scale input block is configured to receive input data,

generating a first downsampled input data set by downsampling said input data, and

And wherein the multi-scale nest block comprises:

a first encoding layer configured to generate a first encoded data set by performing a convolution based on the input data,

a second encoding layer configured to generate a second encoded data set by performing convolution based on the first downsampled input data set, and

a first convolution layer configured to generate a first output data set by performing a convolution based on the first encoded data set and an up-sampled second encoded data set, wherein the up-sampled second encoded data set is obtained by up-sampling the second encoded data set.

2. The computing system of claim 1, wherein the multi-scale nest block further comprises:

a second convolution layer configured to generate a second output data set by performing a convolution based on the second encoded data set; and

a third convolution layer configured to generate a third output data set by performing a convolution based on the first output data set and an up-sampled second output data set, wherein the up-sampled second output data set is obtained by up-sampling the second output data set.

3. The computing system of claim 2 wherein,

the multi-scale input block is further configured to generate a second downsampled input data set by downsampling the first downsampled input data set,

the multi-scale nested block further comprises a third encoding layer configured to generate a third encoded data set by performing a convolution based on the second downsampled input data set, and

the second convolution layer is configured to generate the second output data set by performing a convolution based on the second encoded data set and an up-sampled third encoded data set, wherein the up-sampled third encoded data set is obtained by up-sampling the third encoded data set.

4. The computing system of any of the preceding claims, wherein,

the second encoding layer is configured to generate the second encoded data set by performing a convolution based on the first downsampled input data set and a downsampled first encoded data set, wherein the downsampled first encoded data set is obtained by downsampling the first encoded data set.

5. The computing system of any one of claims 2 to 4, wherein,

the second convolution layer is configured to generate the second output data set by performing a convolution based on the second encoded data set, a downsampled first output data set obtained by downsampling the first output data set, and an upsampled third encoded data set obtained by upsampling the third encoded data set.

6. The computing system of any one of claim 2 to 5, wherein,

the third convolution layer is configured to generate the third output data set by performing a convolution based on the first output data set, an up-sampled second output data set, obtained by up-sampling the second output data set, and the first encoded data set.

7. A computing system according to claim 3 or any claim dependent on claim 3, wherein the third encoding layer is configured to generate the third encoded data set by performing a convolution based on the second downsampled input data set, a downsampled first encoded data set obtained by downsampling the first encoded data set, and a downsampled second encoded data set obtained by downsampling the second encoded data set.

8. The computing system of any one of claims 2 to 7, wherein the CNN architecture includes a weighted addition block configured to apply a first weight to the third output data set,

applying a second weight to the second output data set, an

-generating an output of the multi-scale nested block based on the weighted third output dataset and the weighted second output dataset.

9. The computing system of any of the preceding claims, wherein,

the first encoding layer is configured to generate the first encoded data set by performing a convolution based on the input data and an upsampled first downsampled input data set, wherein the upsampled first downsampled input data set is obtained by upsampling the first downsampled input data set, or

The second encoding layer is configured to generate the second encoded data set by performing a convolution based on the first downsampled input data set and an upsampled second downsampled input data set, wherein the upsampled second downsampled input data set is obtained by upsampling the second downsampled input data set.

10. The computing system of any of the preceding claims, wherein the multi-scale input block comprises a convolutional layer or a dense layer configured to generate the first downsampled input data set based on the input data.

11. The computing system of any of claims 1 to 9, wherein the multi-scale input block is configured to generate the first downsampled input data set using a max-pooling process, an average pooling process, or a mixture of max-pooling and average pooling processes.

12. The computing system of any of the preceding claims, wherein the first encoding layer or the second encoding layer comprises a multi-scale convolution block configured to generate an output by concatenating or adding outputs of at least two parallel convolution paths.

13. The computing system of claim 12, wherein the multi-scale convolution block is configured to weight the outputs of the at least two parallel convolution paths using different weights.

14. The computing system of any of the preceding claims, wherein the input data comprises an audio signal, wherein the CCN architecture further comprises an aggregate block configured to receive an output of the multi-scale nested block, and wherein the aggregate block comprises at least one of:

A convolution layer configured to reduce a number of channels associated with the input data,

a pooling layer configured to reduce a dimension associated with the input data, and

a loop layer configured to order the outputs of the multi-scale nested blocks.

15. The computing system of claim 3 or any claim dependent on claim 3, wherein the multi-scale input block is further configured to generate a third downsampled input data set by downsampling the second downsampled input data set, and wherein the multi-scale nested block further comprises:

a fourth encoding layer configured to generate a fourth encoded data set by performing a convolution based on the third downsampled input data set,

a fourth convolution layer configured to generate a fourth output data set by performing a convolution based on the third encoded data set and an upsampled fourth encoded data set, wherein the upsampled fourth encoded data set is obtained by upsampling the fourth encoded data set,

A fifth convolution layer configured to generate a fifth output data set by performing convolution based on the second output data set and an up-sampled fourth output data set, wherein the up-sampled fourth output data set is obtained by up-sampling the fourth output data set, and

a sixth convolution layer configured to generate a sixth output data set by performing convolution based on the third output data set and an up-sampled fifth output data set, wherein the up-sampled fifth output data set is obtained by up-sampling the fifth output data set.

16. An apparatus for audio processing, wherein,

the device is configured to receive an input of an input audio signal and to output an output audio signal,

the apparatus comprising a computing system implementing a CNN architecture according to any of the preceding claims, and

the input data received by the multi-scale input block is based on the input audio signal and the output audio signal is based on the third output data set generated by a third convolution layer of the multi-scale nested block.

17. A method of audio processing using a Convolutional Neural Network (CNN), the method comprising:

the reception of the input data is performed,

generating a first downsampled input data set by downsampling said input data,

generating a first encoded data set by performing a convolution based on the input data,

generating a second encoded data set by performing a convolution based on the first downsampled input data set, an

Generating a first output data set by performing a convolution based on the first encoded data set and an up-sampled second encoded data set, wherein the up-sampled second encoded data set is obtained by up-sampling the second encoded data set.

18. The method of claim 17, further comprising:

generating a second output data set by performing a convolution based on the second encoded data set; and

generating a third output data set by performing a convolution based on the first output data set and an up-sampled second output data set, wherein the up-sampled second output data set is obtained by up-sampling the second output data set.

19. A computer program product comprising a computer readable storage medium having instructions adapted to cause a device, when executed by a device having processing capabilities, to perform the method of claim 17 or 18.