WO2022204729A1 - Broadcasted residual learning - Google Patents
Broadcasted residual learning Download PDFInfo
- Publication number
- WO2022204729A1 WO2022204729A1 PCT/US2022/071364 US2022071364W WO2022204729A1 WO 2022204729 A1 WO2022204729 A1 WO 2022204729A1 US 2022071364 W US2022071364 W US 2022071364W WO 2022204729 A1 WO2022204729 A1 WO 2022204729A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- feature map
- multidimensional
- convolution
- temporal
- dimension
- Prior art date
Links
- 230000002123 temporal effect Effects 0.000 claims abstract description 85
- 238000000034 method Methods 0.000 claims abstract description 70
- 230000009467 reduction Effects 0.000 claims abstract description 25
- 230000003190 augmentative effect Effects 0.000 claims abstract description 21
- 238000012545 processing Methods 0.000 claims description 77
- 238000011176 pooling Methods 0.000 claims description 16
- 238000010606 normalization Methods 0.000 claims description 15
- 238000013528 artificial neural network Methods 0.000 claims description 12
- 230000015654 memory Effects 0.000 claims description 12
- 230000008859 change Effects 0.000 claims description 6
- 238000010801 machine learning Methods 0.000 abstract description 11
- 230000006870 function Effects 0.000 description 10
- 208000037170 Delayed Emergence from Anesthesia Diseases 0.000 description 8
- 238000013459 approach Methods 0.000 description 7
- 230000007704 transition Effects 0.000 description 7
- 230000009471 action Effects 0.000 description 5
- 230000004913 activation Effects 0.000 description 5
- 238000013527 convolutional neural network Methods 0.000 description 5
- 238000012549 training Methods 0.000 description 5
- 238000010586 diagram Methods 0.000 description 4
- 230000008569 process Effects 0.000 description 4
- 230000001537 neural effect Effects 0.000 description 3
- 230000000717 retained effect Effects 0.000 description 3
- 230000003416 augmentation Effects 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 238000004422 calculation algorithm Methods 0.000 description 2
- 238000004590 computer program Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 238000007637 random forest analysis Methods 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- 230000026676 system process Effects 0.000 description 2
- 238000013519 translation Methods 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 239000000969 carrier Substances 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 239000002131 composite material Substances 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 230000001902 propagating effect Effects 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/15—Correlation function computation including computation of convolution operations
- G06F17/153—Multidimensional correlation or convolution
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Definitions
- aspects of the present disclosure relate to machine learning, and more specifically, to efficient data processing.
- KWS keyword spotting
- edge devices e.g., in devices with limited resources such as mobile phones, smart speakers, and Internet of Things (IoT) devices
- IoT Internet of Things
- Certain aspects provide a method, comprising: receiving an input tensor comprising a frequency dimension and a temporal dimension; processing the input tensor with a first convolution operation to generate a multidimensional intermediate feature map comprising the frequency dimension and the temporal dimension; converting the multidimensional intermediate feature map to a one-dimensional intermediate feature map in the temporal dimension using a frequency dimension reduction operation; processing the one-dimensional intermediate feature map using a second convolution operation to generate a temporal feature map; expanding the temporal feature map to the frequency dimension using a broadcasting operation to generate a multidimensional output feature map; and augmenting the multidimensional output feature map with the multidimensional intermediate feature map via a first residual connection.
- processing systems configured to perform the aforementioned methods as well as those described herein; non-transitory, computer- readable media comprising instructions that, when executed by one or more processors of a processing system, cause the processing system to perform the aforementioned methods as well as those described herein; a computer program product embodied on a computer readable storage medium comprising code for performing the aforementioned methods as well as those further described herein; and a processing system comprising means for performing the aforementioned methods as well as those further described herein.
- FIG. 1 depicts an example workflow for broadcasted residual learning.
- FIG. 2 depicts example block diagrams for residual learning techniques.
- FIG. 3 is an example broadcasted residual learning block for use in efficient processing of input data.
- FIG. 4 is an example broadcasted residual learning block for use in efficient processing of input data in a transitional layer.
- FIG. 5 is an example flow diagram illustrating a method for processing data using broadcasted residual learning.
- FIG. 6 depicts an example processing system configured to perform various aspects of the present disclosure.
- aspects of the present disclosure provide techniques for broadcasted residual learning.
- the techniques described herein provide high model accuracy and significantly improved computational efficiency (e.g., a small model size and light computational load), as compared to existing approaches.
- CNNs convolutional neural networks
- the CNNs are made up of repeated blocks of the same structure and are often based on residual learning and depthwise separable convolutions. This has resulted in a number of CNN-based KWS approaches.
- Existing approaches either use one-dimensional temporal convolutions or two-dimensional (e.g., frequency and temporal) convolutions. Each approach has respective benefits and drawbacks.
- the broadcasted residual learning techniques described herein can be used to efficiently process data, both during training (while training data is passed through the model) and during runtime (when new data is passed through to generate inferences).
- broadcasted residual learning is used to process and classify audio data and features (e.g., to perform KWS).
- the audio data and features can be represented using two-dimensional tensors (e.g., with a frequency dimension and a temporal dimension).
- two-dimensional tensors e.g., with a frequency dimension and a temporal dimension.
- the broadcasted residual learning generally involves performing convolution on input tensors to extract two-dimensional features, reducing the dimensionality of the two-dimensional features to allow for efficient convolutions on the features (e.g., requiring reduced computations, processing steps, and energy), expanding the resulting tensors to the original dimensionality of the two-dimensional features, and augmenting the expanded tensors with the original two-dimensional features.
- the expanded tensors are further augmented with the original input tensor.
- the broadcasted residual learning described herein can be performed in a neural network architecture to perform a variety of tasks, such as classifying input audio.
- the techniques described herein can be implemented as broadcasted residual learning blocks, and a number of these blocks can be used in sequence within a neural network architecture.
- the broadcasted residual learning retains many residual functions of one-dimensional temporal convolution, while still allowing two-dimensional convolution to be used together via a broadcasted-residual connection that expands temporal output to the frequency dimension.
- This residual mapping enables the network to effectively represent useful audio features with far less computation than conventional convolutional neural networks, which reduces computational complexity, latency, compute requirements, memory requirements, and the like.
- the broadcasted residual learning techniques described herein can achieve state-of-the-art accuracy on speech command datasets using fewer computations and parameters, as compared to conventional systems.
- FIG. 1 depicts an example workflow 100 for broadcasted residual learning.
- the workflow 100 begins with an input tensor 105.
- the tensor 105 may be audio data (e.g., represented by a log Mel spectrogram indicating a spectrum of frequencies over time), or audio features (e.g., features generated by processing audio data).
- the input tensor 105 is a two-dimensional tensor with a frequency dimension and a temporal dimension.
- the temporal dimension may be delineated into time intervals or steps, while the frequency dimension is delineated based on frequency values or bands.
- the frequencies present at each interval e.g., the magnitude of sound at each frequency
- the input tensor 105 is processed using a first convolution operation 110, resulting in a set of two-dimensional features maps 115.
- the feature maps 115 have dimensionality H X W X c, where H and W are spatial dimensions (e.g., a temporal dimension and a frequency dimension, respectively) and c is the number of channels.
- the convolution operation 110 is a depthwise convolution performed using one or more kernels configured to extract features of the frequency dimension.
- the convolution operation 110 may use n X 1 kernels, where n corresponds to the frequency dimension. That is, the depthwise kernels for the convolution operation 110 may have a length greater than one in the frequency dimension, with a length of one in the temporal dimension. This allows the convolution operation 110 to serve as a frequency depthwise convolution that extracts frequency features (e.g., feature maps 115) for the tensor 105.
- these feature maps 115 are two-dimensional (with a length greater than one in both the frequency dimension and the temporal dimension).
- a dimension reduction operation 120 is performed to reduce the dimensionality of the feature maps 115.
- the dimension reduction operation 120 may reduce the feature maps 115 to eliminate the frequency dimension and preserve the temporal dimension. This results in one-dimensional feature maps 125.
- the feature maps 125 may have the same temporal dimensionality and number of channels as the feature maps 115, but with a length of one in the frequency dimension.
- the dimension reduction operation 120 is generally performed on a per frequency (or a per frequency band) basis, and can include a variety of techniques, including maximum pooling (such that the maximum value, or the feature with the most activated presence, is retained), average pooling (such that the average value is retained), minimum pooling (such that the minimum value is retained), and the like.
- the dimension reduction operation 120 can also be performed by convolving the feature maps 115 using an H X 1 kernel without padding in order to reduce the dimension, where H corresponds to the size of the frequency dimension.
- the one-dimensional feature maps 125 (which correspond to the temporal dimension) can be convolved with significantly fewer computational resources, as compared to traditional two-dimensional convolution. This significantly improves the efficiency of the broadcasted residual learning.
- the feature maps 125 are processed using a second convolution operation 130.
- the convolution operation 130 is a depthwise-separable convolution (e.g., a depthwise convolution followed by a pointwise convolution).
- the convolution operation 130 may be performed using one or more kernels configured to extract features for the temporal dimension.
- the convolution operation 130 may use 1 X m kernels, where m corresponds to the temporal dimension.
- the depthwise kernels for the convolution operation 130 may have a length greater than one in the temporal dimension, with a length of one in the frequency dimension. This allows the convolution operation 130 to serve as a temporal depthwise convolution that extracts temporal features for the feature maps 125.
- the convolution operation 130 may be a depthwise separable convolution. In such an aspect, following the temporal depthwise convolution, the convolution operation 130 can apply one or more pointwise kernels. This results in feature maps 135.
- the feature maps 135 are then broadcasted to the frequency dimension, as indicated by the arrows 137.
- This broadcasting operation (also referred to as an expanding operation) generally converts the one-dimensional feature maps 135 to multi-dimensional feature maps 140 with the same dimensionality as the feature maps 115.
- the broadcasting involves copying and stacking the feature maps 135 until they reach a height of H (in this example).
- the residual connection 150 reflects the residual nature of broadcasted residual learning.
- the input tensor 105 is augmented with the feature maps 140 using operation 145 to generate the output 155.
- the feature maps 140 may also or alternatively be augmented with the feature maps 115.
- This operation 145 may generally include any number of combination techniques, including element-wise summation, averaging, multiplication, and the like.
- the residual connection 150 allows the system to retain two-dimensional features of the input, despite the dimension reduction operation 120.
- FIG. 2 depicts example block diagrams 200A and 200B for residual learning techniques.
- the input 205 is processed using some convolution operation 210.
- the resulting tensor can then be summed with the original input 205 (via the identity shortcut 215), as indicated by operation 220. This yields the output 225 of the ordinary residual block 200A.
- the function /(x) (reflected by convolution operation 210) may be decomposed into and / 2 , which correspond to the temporal and two-dimensional operations, respectively. This is reflected in the broadcasted residual block 200B.
- x and y are input and output features, respectively, and f 2 are convolution operations
- BC(-) is a broadcasting or expansion operation
- reduction ( ) is a dimension reduction operation (e.g., average pooling by frequency dimension).
- H and W are the frequency and time steps, respectively.
- input 250 is processed using a convolution operation 255 to extract two-dimensional features.
- the resulting tensor can then be reduced using dimension reduction 260, and the reduced tensor(s) are processed using the convolution operation 265 to extract temporal features. These features are then expanded to the frequency dimension and augmented with the original input 250 via the identity shortcut 270, resulting in output 280.
- FIG. 3 is an example broadcasted residual learning block 300 for use in efficient processing of input data, such as audio input data.
- an input tensor 305 is received and processed using a first operation 310 (labeled f 2 in FIG. 3).
- the operation 310 corresponds to the two- dimensional feature extraction discussed above (e.g., convolution operation 110), and yields two-dimensional feature maps in R HxW (e.g., feature maps 115 in FIG. 1).
- the convolution operation 310 is performed using a frequency depthwise convolution 320 that comprises one or more n X 1 frequency-depthwise convolution kernels.
- the operation 310 also includes a SubSpectral Normalization (SSN) operation 325.
- SSN SubSpectral Normalization
- the SSN operation 325 generally operates by splitting the input features (generated by the frequency depthwise convolution 320) into sub-bands in the frequency dimension, and separately normalizing each sub-band (e.g., with batch normalization). This allows the system to achieve frequency-aware temporal features, as compared to ordinary batch normalization on the entire feature set.
- broadcasted residual learning block 300 uses frequency average pooling to average the input features by frequency, resulting in features in M 1xM/ as discussed above (e.g., feature maps 125 in FIG. 1).
- the operation 320 may correspond to the temporal convolution operation discussed above (e.g., convolution operation 130).
- the operation 340 is a depthwise separable convolution (e.g., a composite of a temporal depthwise convolution 345 and a pointwise convolution 355).
- the temporal depthwise convolution 345 may comprise one or more 1 X m temporal-depthwise convolution kernels to generate temporal features (e.g., feature maps
- the operation 340 then includes a batch normalization operation 350 followed by swish activation (also indicated by 350). Although swish activation is depicted in FIG. 3, in aspects, any suitable activation function can be used.
- the operation 340 can also include channel-wise dropout (indicated by 360) at a dropout rate p. This dropout can be used as regularization for the model in order to prevent overfitting and improve generalization.
- a broadcasting operation (which may correspond to the broadcasting operation 137 of FIG. 1), represented by operation 365 (which also includes the tensor augmentation discussed above with reference to operation 145 of FIG. 1) can then be used to expand the features from the operation 340 (in E lx r ) to
- the system uses not only the residual connection 315 (sometimes referred to as the “identity shortcut”) to augment the features with the original input 305 (at operation 365), but also uses an auxiliary residual connection 335 from the two-dimensional features output by the frequency depthwise convolution 320 (at operation 365).
- This auxiliary residual connection 335 enables the system to retain frequency-aware features of the input, despite the dimension reduction operation.
- the output of this broadcasting and augmentation operation 365 can then be processed using one or more activation functions (e.g., ReLU function 370), and then provided as output 375 from the residual learning block 300.
- activation functions e.g., ReLU function 370
- the broadcasted residual learning block 300 can be expressed as are input and output features, respectively, and f 2 are convolution operations, BC(-) is a broadcasting or expansion operation, and reduction ( ) is a dimension reduction operation (e.g., average pooling by frequency dimension).
- machine learning models can provide, for example, more efficient KWS as compared to conventional techniques while retaining two-dimensional features.
- the computational load is reduced by a factor of the frequency steps H (often forty or more), as compared to traditional two-dimensional depthwise separable convolutions.
- FIG. 4 is an example transition broadcasted residual learning block 400 for use in efficient processing of input data, such as audio input data.
- the transition broadcasted residual learning block 400 is similar to the normal broadcasted residual learning block 300, with two differences that enable the transition broadcasted residual learning block 400 to be used in transitional layers where the number of channels in the input 305 differ from the number of channels in the output 475.
- the operation 410 replaces the operation 310 in FIG. 3.
- the operation 410 includes an additional pointwise convolution 412, which is used to change the number of channels in the input 405 to the desired number of channels for the output 475.
- this pointwise convolution 412 may be followed with batch normalization and an activation function (such as ReLU), indicated by 413.
- transition broadcasted residual learning block 400 does not include the identity shortcut (residual connection 315 in FIG. 3). That is, the transition broadcasted residual learning block 400 does not augment the output using the input 405 (because the dimensionality differs).
- transition broadcasted residual learning block 400 largely mirrors the normal broadcasted residual learning block 300 described above with reference to FIG. 3.
- FIG. 5 is an example flow diagram illustrating a method 500 for processing data using broadcasted residual learning.
- the method 500 begins at block 505, where a processing system receives an input tensor comprising a frequency dimension and a temporal dimension.
- the processing system processes the input tensor with a first convolution operation to generate a multidimensional intermediate feature map comprising the frequency dimension and the temporal dimension.
- the multidimensional intermediate feature map is a two-dimensional intermediate feature map.
- the first convolution operation uses one or more depthwise convolution kernels with a size greater than one in the frequency dimension and equal to one in the temporal dimension.
- the input tensor is output from a pointwise convolution operation configured to change a number of channels in the input tensor.
- the processing system converts the multidimensional intermediate feature map to a one-dimensional intermediate feature map in the temporal dimension using a frequency dimension reduction operation.
- the frequency dimension reduction operation comprises at least one of a maximum pooling operation, an average pooling operation, or a convolution operation.
- the method 500 further comprises performing a subspectral normalization (SSN) operation on the multidimensional intermediate feature map prior to converting the multidimensional intermediate feature map to a one-dimensional intermediate feature map.
- SSN subspectral normalization
- the SSN operation comprises: dividing the multidimensional intermediate feature map into a plurality of sub-bands in the frequency dimension; and performing batch normalization on each sub band of the plurality of sub bands.
- the processing system processes the one-dimensional intermediate feature map using a second convolution operation to generate a temporal feature map.
- the second convolution operation comprises a depthwise separable convolution operation, wherein a depthwise convolution of the depthwise separable convolution operation is configured to use one or more depthwise convolution kernels with a size equal to one in the frequency dimension and greater than one in the temporal dimension.
- a pointwise convolution of the depthwise separable convolution operation is configured to use one or more pointwise convolution kernels subsequent to the depthwise convolution.
- the processing system expands the temporal feature map to the frequency dimension using a broadcasting operation to generate a multidimensional output feature map.
- the processing system augments the multidimensional output feature map with the multidimensional intermediate feature map via a first residual connection.
- the method 500 further includes outputting the augmented multidimensional output (e.g., as output from a residual block to another residual block or other block or layer of a model, as output from the mode, and the like).
- the method 500 further comprises augmenting the multidimensional output feature map with the input tensor via a second residual connection.
- the input tensor comprises input audio features; and the first and second convolution operations are part of a broadcast residual neural network configured to classify the input audio features.
- the techniques, methods, and workflows described with respect to FIGS. 1-5 may be performed on one or more devices.
- FIG. 6 depicts an example processing system 600 which may be configured to perform aspects of the various methods described herein, including, for example, the methods described with respect to FIGS. 1-5.
- Processing system 600 includes a central processing unit (CPU) 602, which in some examples may be a multi-core CPU. Instructions executed at the CPU 602 may be loaded, for example, from a program memory associated with the CPU 602 or may be loaded from a memory partition 624.
- CPU central processing unit
- Instructions executed at the CPU 602 may be loaded, for example, from a program memory associated with the CPU 602 or may be loaded from a memory partition 624.
- Processing system 600 also includes additional processing components tailored to specific functions, such as a graphics processing unit (GPU) 604, a digital signal processor (DSP) 606, a neural processing unit (NPU) 608, a multimedia processing unit 610, and a wireless connectivity component 612.
- GPU graphics processing unit
- DSP digital signal processor
- NPU neural processing unit
- 610 multimedia processing unit
- An NPU such as 608, is generally a specialized circuit configured for implementing all the necessary control and arithmetic logic for executing machine learning algorithms, such as algorithms for processing artificial neural networks (ANNs), deep neural networks (DNNs), random forests (RFs), and the like.
- An NPU may sometimes alternatively be referred to as a neural signal processor (NSP), tensor processing units (TPU), neural network processor (NNP), intelligence processing unit (IPU), vision processing unit (VPU), or graph processing unit.
- NPUs, such as 608, are configured to accelerate the performance of common machine learning tasks, such as image classification, machine translation, object detection, and various other predictive models.
- a plurality of NPUs may be instantiated on a single chip, such as a system on a chip (SoC), while in other examples they may be part of a dedicated neural-network accelerator.
- SoC system on a chip
- NPUs may be optimized for training or inference, or in some cases configured to balance performance between both.
- the two tasks may still generally be performed independently.
- NPUs designed to accelerate training are generally configured to accelerate the optimization of new models, which is a highly compute-intensive operation that involves inputting an existing dataset (often labeled or tagged), iterating over the dataset, and then adjusting model parameters, such as weights and biases, in order to improve model performance.
- model parameters such as weights and biases
- NPUs designed to accelerate inference are generally configured to operate on complete models. Such NPUs may thus be configured to input a new piece of data and rapidly process it through an already trained model to generate a model output (e.g., an inference).
- a model output e.g., an inference
- NPU 608 is a part of one or more of CPU 602, GPU 604, and/or DSP 606.
- wireless connectivity component 612 may include subcomponents, for example, for third generation (3G) connectivity, fourth generation (4G) connectivity (e.g., 4G LTE), fifth generation connectivity (e.g., 5G or NR), Wi-Fi connectivity, Bluetooth connectivity, and other wireless data transmission standards.
- Wireless connectivity processing component 612 is further connected to one or more antennas 614.
- Processing system 600 may also include one or more sensor processing units 616 associated with any manner of sensor, one or more image signal processors (ISPs) 618 associated with any manner of image sensor, and/or a navigation processor 620, which may include satellite-based positioning system components (e.g., GPS or GLONASS) as well as inertial positioning system components.
- ISPs image signal processors
- Navigation processor 620 which may include satellite-based positioning system components (e.g., GPS or GLONASS) as well as inertial positioning system components.
- Processing system 600 may also include one or more input and/or output devices 622, such as screens, touch-sensitive surfaces (including touch-sensitive displays), physical buttons, speakers, microphones, and the like.
- one or more of the processors of processing system 600 may be based on an ARM or RISC-V instruction set.
- Processing system 600 also includes memory 624, which is representative of one or more static and/or dynamic memories, such as a dynamic random access memory, a flash-based static memory, and the like.
- memory 624 includes computer-executable components, which may be executed by one or more of the aforementioned processors of processing system 600.
- memory 624 includes machine learning component 624A, which may be configured according to one or more aspects described herein.
- the machine learning component 624A may provide data or audio analysis using one or more machine learning models (e.g., neural networks) configured with one or more broadcasted residual learning blocks.
- machine learning models e.g., neural networks
- the memory 624 further includes a set of frequency depthwise kernel(s) 624B and a set of temporal depthwise kemel(s) 624C.
- the frequency depthwise kernels 624B generally include one-dimensional kernels with a length greater than one in the frequency dimension
- temporal depthwise kernels 624C include one-dimensional kernels with a length greater than one in the temporal dimension.
- the frequency depthwise kernels 624B can generally be used to perform frequency depthwise convolution (e.g., convolution operation 110 in FIG. 1), while the temporal depthwise kernels 624C are generally used to perform temporal depthwise convolution (e.g., convolution operation 130 in FIG. 1).
- Processing system 600 further comprises machine learning circuit 626, such as described above, for example, with respect to FIGS. 1-5.
- machine learning circuit 626 may be implemented in other processing devices of processing system 600, such as within CPU 602, GPU 604, DSP 606, NPU 608, and the like.
- processing system 600 and/or components thereof may be configured to perform the methods described herein.
- aspects of processing system 600 may be omitted, such as where processing system 600 is a server computer or the like.
- multimedia component 610, wireless connectivity 612, sensors 616, ISPs 618, and/or navigation component 620 may be omitted in other aspects.
- aspects of processing system 600 maybe distributed between multiple devices.
- a method comprising: receiving an input tensor comprising a frequency dimension and a temporal dimension; processing the input tensor with a first convolution operation to generate a multidimensional intermediate feature map comprising the frequency dimension and the temporal dimension; converting the multidimensional intermediate feature map to a one-dimensional intermediate feature map in the temporal dimension using a frequency dimension reduction operation; processing the one-dimensional intermediate feature map using a second convolution operation to generate a temporal feature map; expanding the temporal feature map to the frequency dimension using a broadcasting operation to generate a multidimensional output feature map; and augmenting the multidimensional output feature map with the multidimensional intermediate feature map via a first residual connection.
- Clause 2 The method of Clause 1, wherein the multidimensional intermediate feature map is a two-dimensional intermediate feature map.
- Clause 3 The method of any of Clauses 1-2, further comprising augmenting the multidimensional output feature map with the input tensor via a second residual connection.
- Clause 4 The method of any one of Clauses 1-3, wherein the first convolution operation uses one or more depthwise convolution kernels with a size greater than one in the frequency dimension and equal to one in the temporal dimension.
- Clause 5 The method of any one of Clauses 1-4, wherein the input tensor is output from a pointwise convolution operation configured to change a number of channels in the input tensor.
- Clause 6 The method of any one of Clauses 1-5, further comprising performing a subspectral normalization (SSN) operation on the multidimensional intermediate feature map prior to converting the multidimensional intermediate feature map to a one-dimensional intermediate feature map.
- SSN subspectral normalization
- Clause 7 The method of any one of Clauses 1-6, wherein the SSN operation comprises: dividing the multidimensional intermediate feature map into a plurality of sub bands in the frequency dimension; and performing batch normalization on each sub band of the plurality of sub-bands.
- Clause 8 The method of any one of Clauses 1-7, wherein the frequency dimension reduction operation comprises at least one of a maximum pooling operation, an average pooling operation, or a convolution operation.
- Clause 9 The method of any one of Clauses 1-8, wherein the second convolution operation comprises a depthwise separable convolution operation, wherein a depthwise convolution of the depthwise separable convolution operation is configured to use one or more depthwise convolution kernels with a size equal to one in the frequency dimension and greater than one in the temporal dimension.
- Clause 10 The method of any one of Clauses 1-9, wherein a pointwise convolution of the depthwise separable convolution operation is configured to use one or more pointwise convolution kernels subsequent to the depthwise convolution.
- Clause 11 The method of any one of Clauses 1-10, wherein: the input tensor comprises input audio features; and the first and second convolution operations are part of a broadcast residual neural network configured to classify the input audio features.
- Clause 12 A system, comprising means for performing a method in accordance with any one of Clauses 1-11.
- Clause 13 A system, comprising: a memory comprising computer-executable instructions; and one or more processors configured to execute the computer-executable instructions and cause the processing system to perform a method in accordance with any one of Clauses 1-11.
- Clause 14 A non-transitory computer-readable medium comprising computer-executable instructions that, when executed by one or more processors of a processing system, cause the processing system to perform a method in accordance with any one of Clauses 1-11.
- Clause 15 A computer program product embodied on a computer-readable storage medium comprising code for performing a method in accordance with any one of Clauses 1-11.
- an apparatus may be implemented or a method may be practiced using any number of the aspects set forth herein.
- the scope of the disclosure is intended to cover such an apparatus or method that is practiced using other structure, functionality, or structure and functionality in addition to, or other than, the various aspects of the disclosure set forth herein. It should be understood that any aspect of the disclosure disclosed herein may be embodied by one or more elements of a claim.
- exemplary means “serving as an example, instance, or illustration.” Any aspect described herein as “exemplary” is not necessarily to be construed as preferred or advantageous over other aspects.
- a phrase referring to “at least one of’ a list of items refers to any combination of those items, including single members.
- “at least one of: a, b, or c” is intended to cover a, b, c, a-b, a-c, b-c, and a-b-c, as well as any combination with multiples of the same element (e.g., a-a, a-a-a, a-a-b, a-a-c, a-b-b, a-c-c, b-b, b-b-b, b-b-c, c-c, and c-c-c or any other ordering of a, b, and c).
- determining encompasses a wide variety of actions. For example, “determining” may include calculating, computing, processing, deriving, investigating, looking up (e.g., looking up in a table, a database or another data structure), ascertaining and the like. Also, “determining” may include receiving (e.g., receiving information), accessing (e.g., accessing data in a memory) and the like. Also, “determining” may include resolving, selecting, choosing, establishing and the like.
- the term “connected to”, in the context of sharing electronic signals and data between the elements described herein, may generally mean in data communication between the respective elements that are connected to each other.
- elements may be directly connected to each other, such as via one or more conductive traces, lines, or other conductive carriers capable of carrying signals and/or data between the respective elements that are directly connected to each other.
- elements may be indirectly connected to each other, such as via one or more data busses or similar shared circuitry and/or integrated circuit elements for communicating signals and data between the respective elements that are indirectly connected to each other.
- the methods disclosed herein comprise one or more steps or actions for achieving the methods.
- the method steps and/or actions may be interchanged with one another without departing from the scope of the claims.
- the order and/or use of specific steps and/or actions may be modified without departing from the scope of the claims.
- the various operations of methods described above may be performed by any suitable means capable of performing the corresponding functions.
- the means may include various hardware and/or software component(s) and/or module(s), including, but not limited to a circuit, an application specific integrated circuit (ASIC), or processor.
- ASIC application specific integrated circuit
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Data Mining & Analysis (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- Biophysics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Evolutionary Computation (AREA)
- Computational Linguistics (AREA)
- Biomedical Technology (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Computational Mathematics (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Pure & Applied Mathematics (AREA)
- Algebra (AREA)
- Databases & Information Systems (AREA)
- Complex Calculations (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
Description
Claims
Priority Applications (5)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
BR112023018634A BR112023018634A2 (en) | 2021-03-25 | 2022-03-25 | SPREAD RESIDUAL LEARNING |
EP22716833.3A EP4315167A1 (en) | 2021-03-25 | 2022-03-25 | Broadcasted residual learning |
CN202280022308.9A CN117015784A (en) | 2021-03-25 | 2022-03-25 | Broadcast Residual Learning |
KR1020237031629A KR20230159418A (en) | 2021-03-25 | 2022-03-25 | Broadcast residual learning |
JP2023557146A JP2024511033A (en) | 2021-03-25 | 2022-03-25 | Broadcast residual learning |
Applications Claiming Priority (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US202163166161P | 2021-03-25 | 2021-03-25 | |
US63/166,161 | 2021-03-25 | ||
US17/656,621 US20220309344A1 (en) | 2021-03-25 | 2022-03-25 | Broadcasted residual learning |
US17/656,621 | 2022-03-25 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022204729A1 true WO2022204729A1 (en) | 2022-09-29 |
Family
ID=81308368
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2022/071364 WO2022204729A1 (en) | 2021-03-25 | 2022-03-25 | Broadcasted residual learning |
Country Status (4)
Country | Link |
---|---|
JP (1) | JP2024511033A (en) |
KR (1) | KR20230159418A (en) |
BR (1) | BR112023018634A2 (en) |
WO (1) | WO2022204729A1 (en) |
-
2022
- 2022-03-25 KR KR1020237031629A patent/KR20230159418A/en unknown
- 2022-03-25 BR BR112023018634A patent/BR112023018634A2/en unknown
- 2022-03-25 WO PCT/US2022/071364 patent/WO2022204729A1/en active Application Filing
- 2022-03-25 JP JP2023557146A patent/JP2024511033A/en active Pending
Non-Patent Citations (3)
Title |
---|
DING RUNWEI ET AL: "Audio-Visual Keyword Spotting Based on Multidimensional Convolutional Neural Network", 2018 25TH IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), IEEE, 7 October 2018 (2018-10-07), pages 4138 - 4142, XP033454671, DOI: 10.1109/ICIP.2018.8451096 * |
RAPHAEL TANG ET AL: "Deep Residual Learning for Small-Footprint Keyword Spotting", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 28 October 2017 (2017-10-28), XP081509130 * |
SEUNGWOO CHOI ET AL: "Temporal Convolution for Real-time Keyword Spotting on Mobile Devices", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 8 April 2019 (2019-04-08), XP081166237 * |
Also Published As
Publication number | Publication date |
---|---|
BR112023018634A2 (en) | 2023-10-10 |
KR20230159418A (en) | 2023-11-21 |
JP2024511033A (en) | 2024-03-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20230036702A1 (en) | Federated mixture models | |
US20210150306A1 (en) | Phase selective convolution with dynamic weight selection | |
US20210150347A1 (en) | Guided training of machine learning models with convolution layer feature data fusion | |
US20220058450A1 (en) | Tabular convolution and acceleration | |
US20210397963A1 (en) | Method and apparatus for neural network model compression with micro-structured weight pruning and weight unification | |
US20230119791A1 (en) | Relaxed instance frequency normalization for neural-network-based audio processing | |
US20220309344A1 (en) | Broadcasted residual learning | |
WO2022204729A1 (en) | Broadcasted residual learning | |
US20210374537A1 (en) | Structured convolutions and associated acceleration | |
US20230065725A1 (en) | Parallel depth-wise processing architectures for neural networks | |
US20220405547A1 (en) | Residual normalization for improved neural network classifications | |
US20230259773A1 (en) | Dimensionality transformation for efficient bottleneck processing | |
US20240160896A1 (en) | Propagating attention information in efficient machine learning models | |
WO2022266671A1 (en) | Residual normalization for improved neural network classifications | |
US20240046078A1 (en) | Desparsified convolution for sparse activations | |
US20220284290A1 (en) | Data-driven weight initialization for machine learning models | |
US20240144017A1 (en) | Quantization range estimation for quantized training | |
US20230281510A1 (en) | Machine learning model architecture combining mixture of experts and model ensembling | |
WO2023091925A1 (en) | Panoptic segmentation with panoptic, instance, and semantic relations | |
US20240020517A1 (en) | Real-time inference of temporal down-sampling convolutional networks | |
US20230139347A1 (en) | Per-embedding-group activation quantization | |
WO2024059365A1 (en) | Constrained masking for sparsification in machine learning | |
US20240104356A1 (en) | Quantized neural network architecture | |
US20240095493A1 (en) | Desparsified convolution for sparse tensors | |
WO2023225585A1 (en) | Efficient transformer with serial composition of multi-scale multi-range attentions |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22716833 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2023557146 Country of ref document: JP |
|
WWE | Wipo information: entry into national phase |
Ref document number: 202280022308.9 Country of ref document: CN Ref document number: 2301005853 Country of ref document: TH |
|
REG | Reference to national code |
Ref country code: BR Ref legal event code: B01A Ref document number: 112023018634 Country of ref document: BR |
|
ENP | Entry into the national phase |
Ref document number: 112023018634 Country of ref document: BR Kind code of ref document: A2 Effective date: 20230914 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2022716833 Country of ref document: EP |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
ENP | Entry into the national phase |
Ref document number: 2022716833 Country of ref document: EP Effective date: 20231025 |