WO2022204729A1

WO2022204729A1 - Broadcasted residual learning

Info

Publication number: WO2022204729A1
Application number: PCT/US2022/071364
Authority: WO
Inventors: Byeonggeun KIM; Simyung CHANG; Jinkyu Lee; Dooyong Sung
Original assignee: Qualcomm Incorporated
Priority date: 2021-03-25
Filing date: 2022-03-25
Publication date: 2022-09-29
Also published as: JP2024511033A; BR112023018634A2; KR20230159418A

Abstract

Certain aspects of the present disclosure provide techniques for efficient broadcasted residual machine learning. An input tensor comprising a frequency dimension and a temporal dimension is received, and the input tensor is processed with a first convolution operation to generate a multidimensional intermediate feature map comprising the frequency dimension and the temporal dimension. The multidimensional intermediate feature map is converted to a one-dimensional intermediate feature map in the temporal dimension using a frequency dimension reduction operation, and the one-dimensional intermediate feature map is processed using a second convolution operation to generate a temporal feature map. The temporal feature map is expanded to the frequency dimension using a broadcasting operation to generate a multidimensional output feature map, and the multidimensional output feature map is augmented with the multidimensional intermediate feature map via a first residual connection.

Description

BROADCASTED RESIDUAL LEARNING

CROSS-REFERENCE TO RELATED APPLICATIONS

[0001] This application claims priority to U.S. Patent Application No. 17/656,621, filed March 25, 2022, which claims the benefit of and priority to U.S. Provisional Patent Application No. 63/166,161, filed on March 25, 2021, the entire contents of each of which are incorporated herein by reference in their entirety.

INTRODUCTION

[0002] Aspects of the present disclosure relate to machine learning, and more specifically, to efficient data processing.

[0003] Designing efficient machine learning architectures is an important topic in neural speech processing. In particular, keyword spotting (KWS), which aims to detect a predefined keyword, has become increasingly important. KWS plays a key role in device wake-up and user interaction on smart devices. However, it is challenging to provide models that minimize errors while also operating efficiently. Model efficiency is particularly important in KWS, as the process is typically performed in edge devices (e.g., in devices with limited resources such as mobile phones, smart speakers, and Internet of Things (IoT) devices) while simultaneously requiring low latency.

[0004] Accordingly, systems and methods are needed for providing high accuracy classifications with efficient model designs.

BRIEF SUMMARY

[0005] Certain aspects provide a method, comprising: receiving an input tensor comprising a frequency dimension and a temporal dimension; processing the input tensor with a first convolution operation to generate a multidimensional intermediate feature map comprising the frequency dimension and the temporal dimension; converting the multidimensional intermediate feature map to a one-dimensional intermediate feature map in the temporal dimension using a frequency dimension reduction operation; processing the one-dimensional intermediate feature map using a second convolution operation to generate a temporal feature map; expanding the temporal feature map to the frequency dimension using a broadcasting operation to generate a multidimensional output feature map; and augmenting the multidimensional output feature map with the multidimensional intermediate feature map via a first residual connection. [0006] Other aspects provide processing systems configured to perform the aforementioned methods as well as those described herein; non-transitory, computer- readable media comprising instructions that, when executed by one or more processors of a processing system, cause the processing system to perform the aforementioned methods as well as those described herein; a computer program product embodied on a computer readable storage medium comprising code for performing the aforementioned methods as well as those further described herein; and a processing system comprising means for performing the aforementioned methods as well as those further described herein.

[0007] The following description and the related drawings set forth in detail certain illustrative features of one or more aspects.

BRIEF DESCRIPTION OF THE DRAWINGS

[0008] The appended figures depict certain aspects of the one or more aspects and are therefore not to be considered limiting of the scope of this disclosure.

[0009] FIG. 1 depicts an example workflow for broadcasted residual learning.

[0010] FIG. 2 depicts example block diagrams for residual learning techniques.

[0011] FIG. 3 is an example broadcasted residual learning block for use in efficient processing of input data.

[0012] FIG. 4 is an example broadcasted residual learning block for use in efficient processing of input data in a transitional layer.

[0013] FIG. 5 is an example flow diagram illustrating a method for processing data using broadcasted residual learning.

[0014] FIG. 6 depicts an example processing system configured to perform various aspects of the present disclosure.

[0015] To facilitate understanding, identical reference numerals have been used, where possible, to designate identical elements that are common to the drawings. It is contemplated that elements and features of one aspect may be beneficially incorporated in other aspects without further recitation. DETAILED DESCRIPTION

[0016] Aspects of the present disclosure provide techniques for broadcasted residual learning. The techniques described herein provide high model accuracy and significantly improved computational efficiency (e.g., a small model size and light computational load), as compared to existing approaches.

[0017] A wide variety of efficient convolutional neural networks (CNNs) have been developed recently. Generally, the CNNs are made up of repeated blocks of the same structure and are often based on residual learning and depthwise separable convolutions. This has resulted in a number of CNN-based KWS approaches. Existing approaches either use one-dimensional temporal convolutions or two-dimensional (e.g., frequency and temporal) convolutions. Each approach has respective benefits and drawbacks.

[0018] For example, for models using one-dimensional temporal convolution, less computing resources are typically needed, as compared to models relying on two- dimensional approaches. However, with one-dimensional convolution, the internal biases of the convolution (such as translation equivariance) cannot be obtained for the frequency dimension.

[0019] On the other hand, approaches based on two-dimensional convolution require significantly more computational resources than one-dimensional methods, even when using efficient designs and architectures such as depthwise separable convolution. This may prevent such two-dimensional approaches from being useful for a wide variety of devices and implementations.

[0020] The broadcasted residual learning techniques described herein can be used to efficiently process data, both during training (while training data is passed through the model) and during runtime (when new data is passed through to generate inferences).

[0021] In some aspects, broadcasted residual learning is used to process and classify audio data and features (e.g., to perform KWS). Generally, the audio data and features can be represented using two-dimensional tensors (e.g., with a frequency dimension and a temporal dimension). Although audio is used in examples herein, aspects of the present disclosure can be readily applied to a wide variety of data.

[0022] In some aspects. The broadcasted residual learning generally involves performing convolution on input tensors to extract two-dimensional features, reducing the dimensionality of the two-dimensional features to allow for efficient convolutions on the features (e.g., requiring reduced computations, processing steps, and energy), expanding the resulting tensors to the original dimensionality of the two-dimensional features, and augmenting the expanded tensors with the original two-dimensional features. In some aspects, the expanded tensors are further augmented with the original input tensor.

[0023] In some aspects, the broadcasted residual learning described herein can be performed in a neural network architecture to perform a variety of tasks, such as classifying input audio. For example, the techniques described herein can be implemented as broadcasted residual learning blocks, and a number of these blocks can be used in sequence within a neural network architecture.

[0024] Advantageously, the broadcasted residual learning retains many residual functions of one-dimensional temporal convolution, while still allowing two-dimensional convolution to be used together via a broadcasted-residual connection that expands temporal output to the frequency dimension. This residual mapping enables the network to effectively represent useful audio features with far less computation than conventional convolutional neural networks, which reduces computational complexity, latency, compute requirements, memory requirements, and the like. In aspects, the broadcasted residual learning techniques described herein can achieve state-of-the-art accuracy on speech command datasets using fewer computations and parameters, as compared to conventional systems.

Example Workflow for Broadcasted Residual Learning

[0025] FIG. 1 depicts an example workflow 100 for broadcasted residual learning. The workflow 100 begins with an input tensor 105. In some examples, the tensor 105 may be audio data (e.g., represented by a log Mel spectrogram indicating a spectrum of frequencies over time), or audio features (e.g., features generated by processing audio data). In some aspects, the input tensor 105 is a two-dimensional tensor with a frequency dimension and a temporal dimension. The temporal dimension may be delineated into time intervals or steps, while the frequency dimension is delineated based on frequency values or bands. The frequencies present at each interval (e.g., the magnitude of sound at each frequency) can be reflected via the values in the tensor.

[0026] The input tensor 105 is processed using a first convolution operation 110, resulting in a set of two-dimensional features maps 115. As illustrated, the feature maps 115 have dimensionality H X W X c, where H and W are spatial dimensions (e.g., a temporal dimension and a frequency dimension, respectively) and c is the number of channels.

[0027] In one aspect, the convolution operation 110 is a depthwise convolution performed using one or more kernels configured to extract features of the frequency dimension. For example, the convolution operation 110 may use n X 1 kernels, where n corresponds to the frequency dimension. That is, the depthwise kernels for the convolution operation 110 may have a length greater than one in the frequency dimension, with a length of one in the temporal dimension. This allows the convolution operation 110 to serve as a frequency depthwise convolution that extracts frequency features (e.g., feature maps 115) for the tensor 105.

[0028] As illustrated, these feature maps 115 are two-dimensional (with a length greater than one in both the frequency dimension and the temporal dimension). In the illustrated workflow 100, a dimension reduction operation 120 is performed to reduce the dimensionality of the feature maps 115. Specifically, the dimension reduction operation 120 may reduce the feature maps 115 to eliminate the frequency dimension and preserve the temporal dimension. This results in one-dimensional feature maps 125. The feature maps 125 may have the same temporal dimensionality and number of channels as the feature maps 115, but with a length of one in the frequency dimension.

[0029] The dimension reduction operation 120 is generally performed on a per frequency (or a per frequency band) basis, and can include a variety of techniques, including maximum pooling (such that the maximum value, or the feature with the most activated presence, is retained), average pooling (such that the average value is retained), minimum pooling (such that the minimum value is retained), and the like. In some aspects, the dimension reduction operation 120 can also be performed by convolving the feature maps 115 using an H X 1 kernel without padding in order to reduce the dimension, where H corresponds to the size of the frequency dimension.

[0030] Advantageously, the one-dimensional feature maps 125 (which correspond to the temporal dimension) can be convolved with significantly fewer computational resources, as compared to traditional two-dimensional convolution. This significantly improves the efficiency of the broadcasted residual learning. [0031] As illustrated, the feature maps 125 are processed using a second convolution operation 130. In some aspects, the convolution operation 130 is a depthwise-separable convolution (e.g., a depthwise convolution followed by a pointwise convolution). In contrast to the convolution operation 110 (which corresponds to the frequency dimension), the convolution operation 130 may be performed using one or more kernels configured to extract features for the temporal dimension. For example, the convolution operation 130 may use 1 X m kernels, where m corresponds to the temporal dimension.

[0032] That is, the depthwise kernels for the convolution operation 130 may have a length greater than one in the temporal dimension, with a length of one in the frequency dimension. This allows the convolution operation 130 to serve as a temporal depthwise convolution that extracts temporal features for the feature maps 125. In some aspects, the convolution operation 130 may be a depthwise separable convolution. In such an aspect, following the temporal depthwise convolution, the convolution operation 130 can apply one or more pointwise kernels. This results in feature maps 135.

[0033] In the workflow 100, the feature maps 135 are then broadcasted to the frequency dimension, as indicated by the arrows 137. This broadcasting operation (also referred to as an expanding operation) generally converts the one-dimensional feature maps 135 to multi-dimensional feature maps 140 with the same dimensionality as the feature maps 115. In some aspects, the broadcasting involves copying and stacking the feature maps 135 until they reach a height of H (in this example).

[0034] The residual connection 150 reflects the residual nature of broadcasted residual learning. In the workflow 100, the input tensor 105 is augmented with the feature maps 140 using operation 145 to generate the output 155. In some aspects, the feature maps 140 may also or alternatively be augmented with the feature maps 115. This operation 145 may generally include any number of combination techniques, including element-wise summation, averaging, multiplication, and the like. Advantageously, the residual connection 150 allows the system to retain two-dimensional features of the input, despite the dimension reduction operation 120.

Example Residual Learning Techniques

[0035] FIG. 2 depicts example block diagrams 200A and 200B for residual learning techniques. [0036] Block 200A reflects a conventional residual block used in some residual models. This block 200A may be expressed as y = x + /(x), where x and y are input and output features, respectively, and function /( ) computes the convolution output. The identity shortcut of x and the result of / (x) are of the same dimensionality and can be summed by simple element-wise addition.

[0037] Specifically, as illustrated by the residual block 200A, the input 205 is processed using some convolution operation 210. The resulting tensor can then be summed with the original input 205 (via the identity shortcut 215), as indicated by operation 220. This yields the output 225 of the ordinary residual block 200A.

[0038] In aspects of the present disclosure, in order to utilize both one-dimensional and two-dimensional features together, the function /(x) (reflected by convolution operation 210) may be decomposed into

and /₂, which correspond to the temporal and two-dimensional operations, respectively. This is reflected in the broadcasted residual block 200B.

[0039] The broadcasted residual block 200B may be expressed as: y = x + BC(/₁(rediiction(/₂(x)))),

[0040] where x and y are input and output features, respectively,

and f₂ are convolution operations, BC(-) is a broadcasting or expansion operation, and reduction ( ) is a dimension reduction operation (e.g., average pooling by frequency dimension). In this equation, batch and channel dimensions are ignored for conceptual clarity, and the input feature x is in

where H and W are the frequency and time steps, respectively.

[0041] As illustrated by the residual block 200B, input 250 is processed using a convolution operation 255 to extract two-dimensional features. The resulting tensor can then be reduced using dimension reduction 260, and the reduced tensor(s) are processed using the convolution operation 265 to extract temporal features. These features are then expanded to the frequency dimension and augmented with the original input 250 via the identity shortcut 270, resulting in output 280.

Example Broadcasted Residual Learning Block

[0042] FIG. 3 is an example broadcasted residual learning block 300 for use in efficient processing of input data, such as audio input data. [0043] As illustrated, an input tensor 305 is received and processed using a first operation 310 (labeled f₂ in FIG. 3). The operation 310 corresponds to the two- dimensional feature extraction discussed above (e.g., convolution operation 110), and yields two-dimensional feature maps in R^HxW (e.g., feature maps 115 in FIG. 1). As illustrated, the convolution operation 310 is performed using a frequency depthwise convolution 320 that comprises one or more n X 1 frequency-depthwise convolution kernels.

[0044] As illustrated, the operation 310 also includes a SubSpectral Normalization (SSN) operation 325. The SSN operation 325 generally operates by splitting the input features (generated by the frequency depthwise convolution 320) into sub-bands in the frequency dimension, and separately normalizing each sub-band (e.g., with batch normalization). This allows the system to achieve frequency-aware temporal features, as compared to ordinary batch normalization on the entire feature set.

[0045] The system can then perform dimension reduction using operation 330. In the illustrated example, broadcasted residual learning block 300 uses frequency average pooling to average the input features by frequency, resulting in features in M^1xM/ as discussed above (e.g., feature maps 125 in FIG. 1).

[0046] These features are then processed using a second operation 340 (labeled in FIG. 3). The operation 320 may correspond to the temporal convolution operation discussed above (e.g., convolution operation 130). In one aspect, the operation 340 is a depthwise separable convolution (e.g., a composite of a temporal depthwise convolution 345 and a pointwise convolution 355).

[0047] The temporal depthwise convolution 345 may comprise one or more 1 X m temporal-depthwise convolution kernels to generate temporal features (e.g., feature maps

135 in FIG. 1).

[0048] As illustrated, the operation 340 then includes a batch normalization operation 350 followed by swish activation (also indicated by 350). Although swish activation is depicted in FIG. 3, in aspects, any suitable activation function can be used.

[0049] Following a pointwise convolution 355, the operation 340 can also include channel-wise dropout (indicated by 360) at a dropout rate p. This dropout can be used as regularization for the model in order to prevent overfitting and improve generalization. A broadcasting operation (which may correspond to the broadcasting operation 137 of FIG. 1), represented by operation 365 (which also includes the tensor augmentation discussed above with reference to operation 145 of FIG. 1) can then be used to expand the features from the operation 340 (in E^{lx r}) to

[0050] In some aspects, to be frequency-convolution-aware over sequential blocks (e.g., sequential applications of the broadcasted residual learning block 300), the system uses not only the residual connection 315 (sometimes referred to as the “identity shortcut”) to augment the features with the original input 305 (at operation 365), but also uses an auxiliary residual connection 335 from the two-dimensional features output by the frequency depthwise convolution 320 (at operation 365). This auxiliary residual connection 335 enables the system to retain frequency-aware features of the input, despite the dimension reduction operation. The output of this broadcasting and augmentation operation 365 (also referred to as a broadcast sum operation in some aspects) can then be processed using one or more activation functions (e.g., ReLU function 370), and then provided as output 375 from the residual learning block 300.

[0051] In this way, the broadcasted residual learning block 300 can be expressed as are input and output

features, respectively, and f₂ are convolution operations, BC(-) is a broadcasting or expansion operation, and reduction ( ) is a dimension reduction operation (e.g., average pooling by frequency dimension).

[0052] Using the broadcasted residual learning block 300, machine learning models can provide, for example, more efficient KWS as compared to conventional techniques while retaining two-dimensional features. By performing the temporal depthwise and the pointwise convolutions on one-dimensional temporal features, the computational load is reduced by a factor of the frequency steps H (often forty or more), as compared to traditional two-dimensional depthwise separable convolutions.

Example Transitional Broadcasted Residual Learning Block

[0053] FIG. 4 is an example transition broadcasted residual learning block 400 for use in efficient processing of input data, such as audio input data.

[0054] The transition broadcasted residual learning block 400 is similar to the normal broadcasted residual learning block 300, with two differences that enable the transition broadcasted residual learning block 400 to be used in transitional layers where the number of channels in the input 305 differ from the number of channels in the output 475.

[0055] Specifically, the operation 410 replaces the operation 310 in FIG. 3. The operation 410 includes an additional pointwise convolution 412, which is used to change the number of channels in the input 405 to the desired number of channels for the output 475. As illustrated, this pointwise convolution 412 may be followed with batch normalization and an activation function (such as ReLU), indicated by 413.

[0056] The second difference between the transition broadcasted residual learning block 400 and the normal broadcasted residual learning block 300 is that the transition broadcasted residual learning block 400 does not include the identity shortcut (residual connection 315 in FIG. 3). That is, the transition broadcasted residual learning block 400 does not augment the output using the input 405 (because the dimensionality differs).

[0057] In other respects, the transition broadcasted residual learning block 400 largely mirrors the normal broadcasted residual learning block 300 described above with reference to FIG. 3.

Example Method for Broadcasted Residual Learning

[0058] FIG. 5 is an example flow diagram illustrating a method 500 for processing data using broadcasted residual learning.

[0059] The method 500 begins at block 505, where a processing system receives an input tensor comprising a frequency dimension and a temporal dimension.

[0060] At block 510, the processing system processes the input tensor with a first convolution operation to generate a multidimensional intermediate feature map comprising the frequency dimension and the temporal dimension. In some cases, the multidimensional intermediate feature map is a two-dimensional intermediate feature map.

[0061] In some aspects, the first convolution operation uses one or more depthwise convolution kernels with a size greater than one in the frequency dimension and equal to one in the temporal dimension.

[0062] In some aspects, the input tensor is output from a pointwise convolution operation configured to change a number of channels in the input tensor. [0063] At block 515, the processing system converts the multidimensional intermediate feature map to a one-dimensional intermediate feature map in the temporal dimension using a frequency dimension reduction operation.

[0064] In some aspects, the frequency dimension reduction operation comprises at least one of a maximum pooling operation, an average pooling operation, or a convolution operation.

[0065] In some aspects, the method 500 further comprises performing a subspectral normalization (SSN) operation on the multidimensional intermediate feature map prior to converting the multidimensional intermediate feature map to a one-dimensional intermediate feature map.

[0066] In some aspects, wherein the SSN operation comprises: dividing the multidimensional intermediate feature map into a plurality of sub-bands in the frequency dimension; and performing batch normalization on each sub band of the plurality of sub bands.

[0067] At block 520, the processing system processes the one-dimensional intermediate feature map using a second convolution operation to generate a temporal feature map.

[0068] In some aspects, the second convolution operation comprises a depthwise separable convolution operation, wherein a depthwise convolution of the depthwise separable convolution operation is configured to use one or more depthwise convolution kernels with a size equal to one in the frequency dimension and greater than one in the temporal dimension.

[0069] In some aspects, a pointwise convolution of the depthwise separable convolution operation is configured to use one or more pointwise convolution kernels subsequent to the depthwise convolution.

[0070] At block 525, the processing system expands the temporal feature map to the frequency dimension using a broadcasting operation to generate a multidimensional output feature map.

[0071] At block 530, the processing system augments the multidimensional output feature map with the multidimensional intermediate feature map via a first residual connection. [0072] In some aspects, the method 500 further includes outputting the augmented multidimensional output (e.g., as output from a residual block to another residual block or other block or layer of a model, as output from the mode, and the like).

[0073] In some aspects, the method 500 further comprises augmenting the multidimensional output feature map with the input tensor via a second residual connection.

[0074] In some aspects, the input tensor comprises input audio features; and the first and second convolution operations are part of a broadcast residual neural network configured to classify the input audio features.

Example Processing System for Broadcasted Residual Learning

[0075] In some aspects, the techniques, methods, and workflows described with respect to FIGS. 1-5 may be performed on one or more devices.

[0076] FIG. 6 depicts an example processing system 600 which may be configured to perform aspects of the various methods described herein, including, for example, the methods described with respect to FIGS. 1-5.

[0077] Processing system 600 includes a central processing unit (CPU) 602, which in some examples may be a multi-core CPU. Instructions executed at the CPU 602 may be loaded, for example, from a program memory associated with the CPU 602 or may be loaded from a memory partition 624.

[0078] Processing system 600 also includes additional processing components tailored to specific functions, such as a graphics processing unit (GPU) 604, a digital signal processor (DSP) 606, a neural processing unit (NPU) 608, a multimedia processing unit 610, and a wireless connectivity component 612.

[0079] An NPU, such as 608, is generally a specialized circuit configured for implementing all the necessary control and arithmetic logic for executing machine learning algorithms, such as algorithms for processing artificial neural networks (ANNs), deep neural networks (DNNs), random forests (RFs), and the like. An NPU may sometimes alternatively be referred to as a neural signal processor (NSP), tensor processing units (TPU), neural network processor (NNP), intelligence processing unit (IPU), vision processing unit (VPU), or graph processing unit. [0080] NPUs, such as 608, are configured to accelerate the performance of common machine learning tasks, such as image classification, machine translation, object detection, and various other predictive models. In some examples, a plurality of NPUs may be instantiated on a single chip, such as a system on a chip (SoC), while in other examples they may be part of a dedicated neural-network accelerator.

[0081] NPUs may be optimized for training or inference, or in some cases configured to balance performance between both. For NPUs that are capable of performing both training and inference, the two tasks may still generally be performed independently.

[0082] NPUs designed to accelerate training are generally configured to accelerate the optimization of new models, which is a highly compute-intensive operation that involves inputting an existing dataset (often labeled or tagged), iterating over the dataset, and then adjusting model parameters, such as weights and biases, in order to improve model performance. Generally, optimizing based on a wrong prediction involves propagating back through the layers of the model and determining gradients to reduce the prediction error.

[0083] NPUs designed to accelerate inference are generally configured to operate on complete models. Such NPUs may thus be configured to input a new piece of data and rapidly process it through an already trained model to generate a model output (e.g., an inference).

[0084] In one implementation, NPU 608 is a part of one or more of CPU 602, GPU 604, and/or DSP 606.

[0085] In some examples, wireless connectivity component 612 may include subcomponents, for example, for third generation (3G) connectivity, fourth generation (4G) connectivity (e.g., 4G LTE), fifth generation connectivity (e.g., 5G or NR), Wi-Fi connectivity, Bluetooth connectivity, and other wireless data transmission standards. Wireless connectivity processing component 612 is further connected to one or more antennas 614.

[0086] Processing system 600 may also include one or more sensor processing units 616 associated with any manner of sensor, one or more image signal processors (ISPs) 618 associated with any manner of image sensor, and/or a navigation processor 620, which may include satellite-based positioning system components (e.g., GPS or GLONASS) as well as inertial positioning system components. [0087] Processing system 600 may also include one or more input and/or output devices 622, such as screens, touch-sensitive surfaces (including touch-sensitive displays), physical buttons, speakers, microphones, and the like.

[0088] In some examples, one or more of the processors of processing system 600 may be based on an ARM or RISC-V instruction set.

[0089] Processing system 600 also includes memory 624, which is representative of one or more static and/or dynamic memories, such as a dynamic random access memory, a flash-based static memory, and the like. In this example, memory 624 includes computer-executable components, which may be executed by one or more of the aforementioned processors of processing system 600.

[0090] In particular, in this example, memory 624 includes machine learning component 624A, which may be configured according to one or more aspects described herein. For example, the machine learning component 624A may provide data or audio analysis using one or more machine learning models (e.g., neural networks) configured with one or more broadcasted residual learning blocks.

[0091] The memory 624 further includes a set of frequency depthwise kernel(s) 624B and a set of temporal depthwise kemel(s) 624C. As discussed above, the frequency depthwise kernels 624B generally include one-dimensional kernels with a length greater than one in the frequency dimension, while temporal depthwise kernels 624C include one-dimensional kernels with a length greater than one in the temporal dimension.

[0092] The frequency depthwise kernels 624B can generally be used to perform frequency depthwise convolution (e.g., convolution operation 110 in FIG. 1), while the temporal depthwise kernels 624C are generally used to perform temporal depthwise convolution (e.g., convolution operation 130 in FIG. 1).

[0093] Processing system 600 further comprises machine learning circuit 626, such as described above, for example, with respect to FIGS. 1-5.

[0094] Though depicted as a separate circuit for clarity in FIG. 6, the machine learning circuit 626 may be implemented in other processing devices of processing system 600, such as within CPU 602, GPU 604, DSP 606, NPU 608, and the like.

[0095] Generally, processing system 600 and/or components thereof may be configured to perform the methods described herein. [0096] Notably, in other aspects, aspects of processing system 600 may be omitted, such as where processing system 600 is a server computer or the like. For example, multimedia component 610, wireless connectivity 612, sensors 616, ISPs 618, and/or navigation component 620 may be omitted in other aspects. Further, aspects of processing system 600 maybe distributed between multiple devices.

[0097] The depicted components, and others not depicted, may be configured to perform various aspects of the methods described herein.

Example Clauses

[0098] Clause 1: A method, comprising: receiving an input tensor comprising a frequency dimension and a temporal dimension; processing the input tensor with a first convolution operation to generate a multidimensional intermediate feature map comprising the frequency dimension and the temporal dimension; converting the multidimensional intermediate feature map to a one-dimensional intermediate feature map in the temporal dimension using a frequency dimension reduction operation; processing the one-dimensional intermediate feature map using a second convolution operation to generate a temporal feature map; expanding the temporal feature map to the frequency dimension using a broadcasting operation to generate a multidimensional output feature map; and augmenting the multidimensional output feature map with the multidimensional intermediate feature map via a first residual connection.

[0099] Clause 2: The method of Clause 1, wherein the multidimensional intermediate feature map is a two-dimensional intermediate feature map.

[0100] Clause 3: The method of any of Clauses 1-2, further comprising augmenting the multidimensional output feature map with the input tensor via a second residual connection.

[0101] Clause 4: The method of any one of Clauses 1-3, wherein the first convolution operation uses one or more depthwise convolution kernels with a size greater than one in the frequency dimension and equal to one in the temporal dimension.

[0102] Clause 5: The method of any one of Clauses 1-4, wherein the input tensor is output from a pointwise convolution operation configured to change a number of channels in the input tensor. [0103] Clause 6: The method of any one of Clauses 1-5, further comprising performing a subspectral normalization (SSN) operation on the multidimensional intermediate feature map prior to converting the multidimensional intermediate feature map to a one-dimensional intermediate feature map.

[0104] Clause 7: The method of any one of Clauses 1-6, wherein the SSN operation comprises: dividing the multidimensional intermediate feature map into a plurality of sub bands in the frequency dimension; and performing batch normalization on each sub band of the plurality of sub-bands.

[0105] Clause 8: The method of any one of Clauses 1-7, wherein the frequency dimension reduction operation comprises at least one of a maximum pooling operation, an average pooling operation, or a convolution operation.

[0106] Clause 9: The method of any one of Clauses 1-8, wherein the second convolution operation comprises a depthwise separable convolution operation, wherein a depthwise convolution of the depthwise separable convolution operation is configured to use one or more depthwise convolution kernels with a size equal to one in the frequency dimension and greater than one in the temporal dimension.

[0107] Clause 10: The method of any one of Clauses 1-9, wherein a pointwise convolution of the depthwise separable convolution operation is configured to use one or more pointwise convolution kernels subsequent to the depthwise convolution.

[0108] Clause 11: The method of any one of Clauses 1-10, wherein: the input tensor comprises input audio features; and the first and second convolution operations are part of a broadcast residual neural network configured to classify the input audio features.

[0109] Clause 12: A system, comprising means for performing a method in accordance with any one of Clauses 1-11.

[0110] Clause 13: A system, comprising: a memory comprising computer-executable instructions; and one or more processors configured to execute the computer-executable instructions and cause the processing system to perform a method in accordance with any one of Clauses 1-11.

[0111] Clause 14: A non-transitory computer-readable medium comprising computer-executable instructions that, when executed by one or more processors of a processing system, cause the processing system to perform a method in accordance with any one of Clauses 1-11.

[0112] Clause 15: A computer program product embodied on a computer-readable storage medium comprising code for performing a method in accordance with any one of Clauses 1-11.

Additional Considerations

[0113] The preceding description is provided to enable any person skilled in the art to practice the various aspects described herein. The examples discussed herein are not limiting of the scope, applicability, or aspects set forth in the claims. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects. For example, changes may be made in the function and arrangement of elements discussed without departing from the scope of the disclosure. Various examples may omit, substitute, or add various procedures or components as appropriate. For instance, the methods described may be performed in an order different from that described, and various steps may be added, omitted, or combined. Also, features described with respect to some examples may be combined in some other examples. For example, an apparatus may be implemented or a method may be practiced using any number of the aspects set forth herein. In addition, the scope of the disclosure is intended to cover such an apparatus or method that is practiced using other structure, functionality, or structure and functionality in addition to, or other than, the various aspects of the disclosure set forth herein. It should be understood that any aspect of the disclosure disclosed herein may be embodied by one or more elements of a claim.

[0114] As used herein, the word “exemplary” means “serving as an example, instance, or illustration.” Any aspect described herein as “exemplary” is not necessarily to be construed as preferred or advantageous over other aspects.

[0115] As used herein, a phrase referring to “at least one of’ a list of items refers to any combination of those items, including single members. As an example, “at least one of: a, b, or c” is intended to cover a, b, c, a-b, a-c, b-c, and a-b-c, as well as any combination with multiples of the same element (e.g., a-a, a-a-a, a-a-b, a-a-c, a-b-b, a-c-c, b-b, b-b-b, b-b-c, c-c, and c-c-c or any other ordering of a, b, and c). [0116] As used herein, the term “determining” encompasses a wide variety of actions. For example, “determining” may include calculating, computing, processing, deriving, investigating, looking up (e.g., looking up in a table, a database or another data structure), ascertaining and the like. Also, “determining” may include receiving (e.g., receiving information), accessing (e.g., accessing data in a memory) and the like. Also, “determining” may include resolving, selecting, choosing, establishing and the like.

[0117] As used herein, the term “connected to”, in the context of sharing electronic signals and data between the elements described herein, may generally mean in data communication between the respective elements that are connected to each other. In some cases, elements may be directly connected to each other, such as via one or more conductive traces, lines, or other conductive carriers capable of carrying signals and/or data between the respective elements that are directly connected to each other. In other cases, elements may be indirectly connected to each other, such as via one or more data busses or similar shared circuitry and/or integrated circuit elements for communicating signals and data between the respective elements that are indirectly connected to each other.

[0118] The methods disclosed herein comprise one or more steps or actions for achieving the methods. The method steps and/or actions may be interchanged with one another without departing from the scope of the claims. In other words, unless a specific order of steps or actions is specified, the order and/or use of specific steps and/or actions may be modified without departing from the scope of the claims. Further, the various operations of methods described above may be performed by any suitable means capable of performing the corresponding functions. The means may include various hardware and/or software component(s) and/or module(s), including, but not limited to a circuit, an application specific integrated circuit (ASIC), or processor. Generally, where there are operations illustrated in figures, those operations may have corresponding counterpart means-plus-function components with similar numbering.

[0119] The following claims are not intended to be limited to the aspects shown herein, but are to be accorded the full scope consistent with the language of the claims. Within a claim, reference to an element in the singular is not intended to mean “one and only one” unless specifically so stated, but rather “one or more.” Unless specifically stated otherwise, the term “some” refers to one or more. No claim element is to be construed under the provisions of 35 U.S.C. § 112(f) unless the element is expressly recited using the phrase “means for” or, in the case of a method claim, the element is recited using the phrase “step for.” All structural and functional equivalents to the elements of the various aspects described throughout this disclosure that are known or later come to be known to those of ordinary skill in the art are expressly incorporated herein by reference and are intended to be encompassed by the claims. Moreover, nothing disclosed herein is intended to be dedicated to the public regardless of whether such disclosure is explicitly recited in the claims.

Claims

WHAT IS CLAIMED IS:

1. A computer-implemented method, comprising: receiving an input tensor comprising a frequency dimension and a temporal dimension; processing the input tensor with a first convolution operation to generate a multidimensional intermediate feature map comprising the frequency dimension and the temporal dimension; converting the multidimensional intermediate feature map to a one dimensional intermediate feature map in the temporal dimension using a frequency dimension reduction operation; processing the one-dimensional intermediate feature map using a second convolution operation to generate a temporal feature map; expanding the temporal feature map to the frequency dimension using a broadcasting operation to generate a multidimensional output feature map; augmenting the multidimensional output feature map with the multidimensional intermediate feature map via a first residual connection; and outputting the augmented multidimensional output feature map.

2. The computer- implemented method of Claim 1, wherein the multidimensional intermediate feature map is a two-dimensional intermediate feature map, and wherein converting the multidimensional intermediate feature map to the one dimensional intermediate feature map reduces a number of computations performed by a processor when generating the temporal feature map.

3. The computer- implemented method of Claim 1, further comprising augmenting the multidimensional output feature map with the input tensor via a second residual connection.

4. The computer- implemented method of Claim 1, wherein the first convolution operation uses one or more depthwise convolution kernels with a size greater than one in the frequency dimension and equal to one in the temporal dimension.

5. The computer- implemented method of Claim 4, wherein the input tensor is output from a pointwise convolution operation configured to change a number of channels in the input tensor.

6. The computer- implemented method of Claim 1, further comprising performing a subspectral normalization (SSN) operation on the multidimensional intermediate feature map prior to converting the multidimensional intermediate feature map to a one-dimensional intermediate feature map.

7. The computer-implemented method of Claim 6, wherein the SSN operation comprises: dividing the multidimensional intermediate feature map into a plurality of sub-bands in the frequency dimension; and performing batch normalization on each sub band of the plurality of sub bands.

8. The computer- implemented method of Claim 1, wherein the frequency dimension reduction operation comprises at least one of a maximum pooling operation, an average pooling operation, or a convolution operation.

9. The computer- implemented method of Claim 1, wherein the second convolution operation comprises a depthwise separable convolution operation, wherein a depthwise convolution of the depthwise separable convolution operation is configured to use one or more depthwise convolution kernels with a size equal to one in the frequency dimension and greater than one in the temporal dimension.

10. The computer-implemented method of Claim 9, wherein a pointwise convolution of the depthwise separable convolution operation is configured to use one or more pointwise convolution kernels subsequent to the depthwise convolution.

11. The computer- implemented method of Claim 1, wherein: the input tensor comprises input audio features; and the first and second convolution operations are part of a broadcast residual neural network configured to classify the input audio features.

12. A non- transitory computer-readable medium comprising computer- executable instructions that, when executed by one or more processors of a processing system, cause the processing system to perform an operation, comprising: receiving an input tensor comprising a frequency dimension and a temporal dimension; processing the input tensor with a first convolution operation to generate a multidimensional intermediate feature map comprising the frequency dimension and the temporal dimension; converting the multidimensional intermediate feature map to a one dimensional intermediate feature map in the temporal dimension using a frequency dimension reduction operation; processing the one-dimensional intermediate feature map using a second convolution operation to generate a temporal feature map; expanding the temporal feature map to the frequency dimension using a broadcasting operation to generate a multidimensional output feature map; augmenting the multidimensional output feature map with the multidimensional intermediate feature map via a first residual connection; and outputting the augmented multidimensional output feature map.

13. The non-transitory computer-readable medium of Claim 12, the operation further comprising augmenting the multidimensional output feature map with the input tensor via a second residual connection.

14. The non-transitory computer-readable medium of Claim 12, wherein the first convolution operation uses one or more depthwise convolution kernels with a size greater than one in the frequency dimension and equal to one in the temporal dimension.

15. The non-transitory computer-readable medium of Claim 14, wherein the input tensor is output from a pointwise convolution operation configured to change a number of channels in the input tensor.

16. The non-transitory computer-readable medium of Claim 12, further comprising performing a subspectral normalization (SSN) operation on the multidimensional intermediate feature map prior to converting the multidimensional intermediate feature map to a one-dimensional intermediate feature map.

17. The non-transitory computer-readable medium of Claim 16, wherein the SSN operation comprises: dividing the multidimensional intermediate feature map into a plurality of sub-bands in the frequency dimension; and performing batch normalization on each sub band of the plurality of sub bands.

18. The non-transitory computer-readable medium of Claim 12, wherein the frequency dimension reduction operation comprises at least one of (i) a maximum pooling operation, (ii) an average pooling operation, or (iii) a convolution operation.

19. The non-transitory computer-readable medium of Claim 12, wherein the second convolution operation comprises a depthwise separable convolution operation, wherein a depthwise convolution of the depthwise separable convolution operation is configured to use one or more depthwise convolution kernels with a size equal to one in the frequency dimension and greater than one in the temporal dimension.

20. The non-transitory computer-readable medium of Claim 19, wherein a pointwise convolution of the depthwise separable convolution operation is configured to use one or more pointwise convolution kernels subsequent to the depthwise convolution.

21. The non-transitory computer-readable medium of Claim 12, wherein: the input tensor comprises input audio features; and the first and second convolution operations are part of a broadcast residual neural network configured to classify the input audio features.

22. A processing system, comprising: a memory comprising computer-executable instructions; one or more processors configured to execute the computer-executable instructions and cause the processing system to perform an operation comprising: receiving an input tensor comprising a frequency dimension and a temporal dimension; processing the input tensor with a first convolution operation to generate a multidimensional intermediate feature map comprising the frequency dimension and the temporal dimension; converting the multidimensional intermediate feature map to a one-dimensional intermediate feature map in the temporal dimension using a frequency dimension reduction operation; processing the one-dimensional intermediate feature map using a second convolution operation to generate a temporal feature map; expanding the temporal feature map to the frequency dimension using a broadcasting operation to generate a multidimensional output feature map; augmenting the multidimensional output feature map with the multidimensional intermediate feature map via a first residual connection; and outputting the augmented multidimensional output feature map.

23. The processing system of Claim 22, the operation further comprising augmenting the multidimensional output feature map with the input tensor via a second residual connection.

24. The processing system of Claim 22, wherein the first convolution operation uses one or more depthwise convolution kernels with a size greater than one in the frequency dimension and equal to one in the temporal dimension.

25. The processing system of Claim 24, wherein the input tensor is output from a pointwise convolution operation configured to change a number of channels in the input tensor.

26. The processing system of Claim 22, further comprising performing a subspectral normalization (SSN) operation on the multidimensional intermediate feature map prior to converting the multidimensional intermediate feature map to a one dimensional intermediate feature map.

27. The processing system of Claim 26, wherein the SSN operation comprises: dividing the multidimensional intermediate feature map into a plurality of sub-bands in the frequency dimension; and performing batch normalization on each sub band of the plurality of sub bands.

28. The processing system of Claim 22, wherein the frequency dimension reduction operation comprises at least one of (i) a maximum pooling operation, (ii) an average pooling operation, or (iii) a convolution operation.

29. The processing system of Claim 22, wherein the second convolution operation comprises a depthwise separable convolution operation, wherein a depthwise convolution of the depthwise separable convolution operation is configured to use one or more depthwise convolution kernels with a size equal to one in the frequency dimension and greater than one in the temporal dimension.

30. A processing system, comprising: means for receiving an input tensor comprising a frequency dimension and a temporal dimension; means for processing the input tensor with a first convolution operation to generate a multidimensional intermediate feature map comprising the frequency dimension and the temporal dimension; means for converting the multidimensional intermediate feature map to a one-dimensional intermediate feature map in the temporal dimension using a frequency dimension reduction operation; means for processing the one-dimensional intermediate feature map using a second convolution operation to generate a temporal feature map; means for expanding the temporal feature map to the frequency dimension using a broadcasting operation to generate a multidimensional output feature map; and means for augmenting the multidimensional output feature map with the multidimensional intermediate feature map via a first residual connection.