WO2024118071A1

WO2024118071A1 - Reduced power consumption in machine learning models

Info

Publication number: WO2024118071A1
Application number: PCT/US2022/051409
Authority: WO
Inventors: Hsilin Huang; Haoyu REN; Chi Fung WONG; Tsz Lok Poon
Original assignee: Zeku, Inc.
Priority date: 2022-11-30
Filing date: 2022-11-30
Publication date: 2024-06-06

Abstract

A system includes on-chip memory, one or more processing logic blocks of a machine learning model, the one or more processing logic blocks comprising one or more hidden layers of the machine learning model, a processor, and a non-transitory computer readable medium in communication with the processor. The non-transitory computer readable medium includes a set of instructions encoded thereon, and executable by the processor to obtain input data of a current frame for processing the machine learning model, and obtain, via the on-chip memory, at least one first output of at least one first hidden layer of the one or more hidden layers of a respective at least one previous frame, and process, via the machine learning model, the current frame using the at least one first output.

Description

REDUCED POWER CONSUMPTION IN MACHINE LEARNING

MODELS

FIELD

[0001] The present disclosure relates, in general, to methods, systems, and apparatuses for reducing power consumption in machine learning (ML) models.

BACKGROUND

[0002] Machine learning is an area of rapid development and adoption in various technological fields. ML models typically demand a huge amount of computing resources, data bandwidth and storage. In an internet of things (loT) device and handheld systems, device resources are limited for computing, bandwidth, and storage. Moreover, ML models typically consume more power with the increased need for computing and memory access.

[0003] Chip-to-Chip data transfer consumes more power than internal data transferring, for example, between a processor system on a chip (SoC) and a dynamic random access memory (DRAM) chip. Conventional approaches to saving additional power utilize a bigger SRAM to hold more data inside chip and avoid accessing data from DRAM. However, bigger static random access memory (SRAM) and die size manufacturing costs.

[0004] Thus, systems, methods, and apparatuses for reducing power consumption in machine learning models are provided.

SUMMARY

[0005] Tools and techniques for reduced power consumption in machine learning models are provided.

[0006] A method includes obtaining input data of a current frame for processing by a machine learning model, the machine learning model including two or more hidden layers, and obtaining, via on-chip memory, at least one first output of at least one first hidden layer of the two or more hidden layers of a respective at least one previous frame. The method continues by processing, via the machine learning model, the current frame utilizing the at least one first output as input to at least one second hidden layer of the two or more hidden layers for the current frame.

[0007] An apparatus includes a non-transitory computer readable medium in communication with the processor, the non-transitory computer readable medium having encoded thereon a set of instructions executable by the processor to perform various functions. The set of instructions may be executable by the processor to obtain input data of a current frame for processing by a machine learning model, the machine learning model including two or more hidden layers, and obtain, via on-chip memory, a first output of a first hidden layer of the two or more hidden layers of a respective previous frame. The set of instructions is further executable by the processor to process, via the machine learning model, the current frame using the first output as input to a second hidden layer of the two or more hidden layers for the current frame. [0008] A system includes on-chip memory, one or more processing logic blocks of a machine learning model, the one or more processing logic blocks comprising one or more hidden layers of the machine learning model, a processor, and a non-transitory computer readable medium in communication with the processor, the non-transitory computer readable medium having encoded thereon a set of instructions executable by the processor to perform various processes. The set of instructions is executable by the processor to obtain input data of a current frame for processing by the one or more processing logic blocks of the machine learning model, and obtain, via the on-chip memory, at least one first output of at least one first hidden layer of the one or more hidden layers of a respective at least one previous frame. The set of instructions is further executable by the processor to process, via the machine learning model, the current frame using the at least one first output as input to at least one second hidden layer of the one or more hidden layers for the current frame.

[0009] These illustrative embodiments are mentioned not to limit or define the disclosure, but to provide examples to aid understanding thereof. Additional embodiments are discussed in the Detailed Description, and further description is provided therein. BRIEF DESCRIPTION OF THE DRAWINGS

[0010] A further understanding of the nature and advantages of particular embodiments may be realized by reference to the remaining portions of the specification and the drawings, in which like reference numerals are used to refer to similar components. In some instances, a sub-label is associated with a reference numeral to denote one of multiple similar components. When reference is made to a reference numeral without specification to an existing sub-label, it is intended to refer to all such multiple similar components.

[0011] Fig. 1 is a schematic block diagram of a system for reduced power consumption in ML models, in accordance with various embodiments;

[0012] Fig. 2 is a schematic block diagram of an ML model architecture for reduced power consumption, in accordance with various embodiments;

[0013] Fig. 3A is a schematic block diagram of an alternative ML model architecture utilizing chunks, in accordance with various embodiments;

[0014] Fig. 3B is a schematic block diagram of an alternative ML model architecture cut into frames, in accordance with various embodiments;

[0015] Fig. 3C is a schematic block diagram of an alternative ML model architecture asymmetrically cut into frames, in accordance with various embodiments;

[0016] Fig. 4 is a flow diagram of a method for reducing power consumption in ML models, in accordance with various embodiments; and

[0017] Fig. 5 is a schematic diagram of a computer system for reducing power consumption in ML models, in accordance with various embodiments.

DETAILED DESCRIPTION OF EMBODIMENTS

[0018] Various embodiments provide tools and techniques for reducing power consumption in ML models.

[0019] In some embodiments, a method for reducing power consumption in ML models is provided. The method includes obtaining input data of a current frame for processing by a machine learning model, the machine learning model including two or more hidden layers, and obtaining, via on-chip memory, at least one first output of at least one first hidden layer of the two or more hidden layers of a respective at least one previous frame. The method continues by processing, via the machine learning model, the current frame utilizing the at least one first output as input to at least one second hidden layer of the two or more hidden layers for the current frame.

[0020] In some examples, the method may further include storing, via the on- chip memory, at least one second output of at least one third hidden layer of the two or more hidden layers of the current frame. In some examples, the at least one previous frame includes a first previous frame and a second previous frame.

[0021] In some examples, utilizing the at least one first output as input to the at least one second hidden layer for the current frame further includes utilizing a first previous output associated with the first previous frame, the at least one first output including the first previous output, as input to a first current hidden layer of the at least one second hidden layer for the current frame, and utilizing a second previous output associated with the second previous frame, the at least one first output including the second previous output, as input to a second current hidden layer of the at least one second hidden layer for the current frame.

[0022] In some examples, the on-chip memory includes static random access memory. In some examples, the machine learning model is a convolutional neural network. In some examples, the at least one first hidden layer includes at least one of a pooling layer or convolutional layer. In some further examples, the at least one second hidden layer includes at least one of a pooling layer or convolutional layer.

[0023] In some embodiments, an apparatus for reducing power consumption in ML models is provided. The apparatus includes a non-transitory computer readable medium in communication with the processor, the non-transitory computer readable medium having encoded thereon a set of instructions executable by the processor to perform various functions. The set of instructions may be executable by the processor to obtain input data of a current frame for processing by a machine learning model, the machine learning model including two or more hidden layers, and obtain, via on- chip memory, a first output of a first hidden layer of the two or more hidden layers of a respective previous frame. The set of instructions is further executable by the processor to process, via the machine learning model, the current frame using the first output as input to a second hidden layer of the two or more hidden layers for the current frame.

[0024] In some examples, the set of instructions may further be executable by the processor to store, via the on-chip memory, a second output of a third hidden layer of the two or more hidden layers of the current frame. In some examples, the set of instructions may further be executable by the processor to obtain, via on-chip memory, a second output of a second hidden layer of the two or more hidden layers of a second previous frame, and process, via the machine learning model, the current frame using the second output as input to a third hidden layer of the two or more hidden layers for the current frame. In some examples, on-chip memory includes static random access memory. In some examples, the machine learning model is a convolutional neural network. In some examples, the at least one first hidden layer includes at least one of a pooling layer or convolutional layer. In further examples, the at least one second hidden layer includes at least one of a pooling layer or convolutional layer.

[0025] In in further embodiments, a system for reducing power consumption in ML models is provided. A system includes on-chip memory, one or more processing logic blocks of a machine learning model, the one or more processing logic blocks comprising one or more hidden layers of the machine learning model, a processor, and a non-transitory computer readable medium in communication with the processor, the non-transitory computer readable medium having encoded thereon a set of instructions executable by the processor to perform various processes. The set of instructions is executable by the processor to obtain input data of a current frame for processing by the one or more processing logic blocks of the machine learning model, and obtain, via the on-chip memory, at least one first output of at least one first hidden layer of the one or more hidden layers of a respective at least one previous frame. The set of instructions is further executable by the processor to process, via the machine learning model, the current frame using the at least one first output as input to at least one second hidden layer of the one or more hidden layers for the current frame.

[0026] In some examples, the set of instructions may further be executable by the processor to store, via the on-chip memory, at least one second output of at least one third hidden layer of the one or more hidden layers of the current frame.

[0027] In some examples, the at least one previous frame includes a first previous frame and a second previous frame, wherein utilizing the at least one first output as input to the at least one second hidden layer for the current frame further includes utilizing a first previous output associated with the first previous frame, the at least one first output including the first previous output, as input to a first current hidden layer of the at least one second hidden layer for the current frame, and utilizing a second previous output associated with the second previous frame, the at least one first output including the second previous output, as input to a second current hidden layer of the at least one second hidden layer for the current frame.

[0028] In some examples, the on-chip memory includes static random access memory. In further examples, the at least one first hidden layer includes at least one of a pooling layer or convolutional layer, and wherein the at least one second hidden layer includes at least one of a pooling layer or convolutional layer.

[0029] In the following description, for the purposes of explanation, numerous details are set forth to provide a thorough understanding of the described embodiments. It will be apparent to one skilled in the art, however, that other embodiments may be practiced without some of these details. Several embodiments are described herein, and while various features are ascribed to different embodiments, it should be appreciated that the features described with respect to one embodiment may be incorporated with other embodiments as well. By the same token, however, no single feature or features of any described embodiment should be considered essential to every embodiment of the invention, as other embodiments of the invention may omit such features.

[0030] When an element is referred to herein as being "connected" or "coupled" to another element, it is to be understood that the elements can be directly connected to the other element, or have intervening elements present between the elements. In contrast, when an element is referred to as being "directly connected" or "directly coupled" to another element, it should be understood that no intervening elements are present in the "direct" connection between the elements. However, the existence of a direct connection does not exclude other connections, in which intervening elements may be present.

[0031] Moreover, the terms left, right, front, back, top, bottom, forward, reverse, clockwise and counterclockwise are used for purposes of explanation only and are not limited to any fixed direction or orientation. Rather, they are used merely to indicate relative locations and/or directions between various parts of an object and/or components. [0032] Furthermore, the methods and processes described herein may be described in a particular order for ease of description. However, it should be understood that, unless the context dictates otherwise, intervening processes may take place before and/or after any portion of the described process, and further various procedures may be reordered, added, and/or omitted in accordance with various embodiments.

[0033] Unless otherwise indicated, all numbers used herein to express quantities, dimensions, and so forth should be understood as being modified in all instances by the term "about." In this application, the use of the singular includes the plural unless specifically stated otherwise, and use of the terms "and" and "or" means "and/or” unless otherwise indicated. Moreover, the use of the term "including," as well as other forms, such as "includes” and "included," should be considered nonexclusive. Also, terms such as "element" or "component" encompass both elements and components comprising one unit and elements and components that comprise more than one unit, unless specifically stated otherwise.

[0034] As used herein, the phrase "at least one of' preceding a series of items, with the term "and” or "or" to separate any of the items, modifies the list as a whole, rather than each member of the list (i.e., each item). The phrase "at least one of' does not require selection of at least one of each item listed; rather, the phrase allows a meaning that includes at least one of any one of the items, and/or at least one of any combination of the items. By way of example, the phrases "at least one of A, B, and C" or "at least one of A, B, or C" each refer to only A, only B, or only C; and/or any combination of A, B, and C. In instances where it is intended that a selection be of "at least one of each of A, B, and C," or alternatively, "at least one of A, at least one of B, and at least one of C," it is expressly described as such.

[0035] Conventional ML model architectures typically have at least an input layer and an output layer, and one or more hidden layers, which is further stored in memory of a computing system. Typically, power is consumed during memory access when storing and fetching the internal layers of the ML model through an off-chip memory, such as DRAM.

[0036] Accordingly, to save power consumption, hidden layers may be stored in cache or on-chip memory, such as SRAM, magnetoresistive random access memory (MRAM), resistive random access memory (RRAM), phase change memory (PCM), and ferroelectric random access memory (FeRAM). Furthermore, all the feature maps and weights may be stored in on-chip memory. However, a challenge to storing the feature maps and weights is the size of feature maps for images with large dimensions, such as 4K (e.g., 2160x3840) resolution images. Thus, a framework for ML model processing is set forth in which a partial output of a layer is used as the input to the next layer, reducing size and storage in the on-chip memory.

[0037] Fig. 1 is a schematic block diagram of a system for reduced power consumption in ML models, in accordance with various embodiments. The system 100 includes one or more processing modules 105a-105f (collectively, "processing modules 105," also interchangeably referred to as "processing logic blocks”), processor 110, cache memory 115, and DRAM 120. It should be noted that the various components of the system 100 are schematically illustrated in Fig. 1, and that modifications to the various components and other arrangements of system 100 may be possible and in accordance with the various embodiments.

[0038] In various embodiments, the system 100 is configured to execute an ML model. In various examples, the ML model may be implemented in hardware, software, or both hardware and software. The ML model may include a neural network, such as a convolutional neural network (CNN), including a residual neural network (ResNet), recurrent neural network (RNN), U-Net, a feed forward network (FFN), including a multilayer perceptron (MLP), a transformer, or other suitable neural network that includes one or more layers.

[0039] In various examples, the ML model may be stored, at least in part, in the DRAM 120 of the system 100, and similarly fetched from the DRAM 120. The processor 110 includes any computer processor configured to run and/or perform tire processing of the ML model. In various examples, the processor 110 may include a neural processing unit (NPU). The processor 110 (e.g., NPU) may include one or more microprocessors, such as a central processing unit (CPU), graphics processing unit (GPU).

[0040] The processor 110 (e.g., the NPU) may, in various embodiments, be a chip (e.g., an IC package) including on-chip memory, such as cache memory' 115. The NPU 110 may further be coupled to off-chip memory, such as DRAM 120. In various embodiments, cache memory 115 may be configured to store at least part of the ML model. For example, in some embodiments, the cache memory 115 may be configured to store the feature maps, filters, and weights of a pooling layer, which in this example is a "pooling 1/8" layer. The pooling 1/8 layer is a pooling layer in which the size of a feature map is reduced by a factor of 8 via a pooling function. A pooling function may include, for example, mean pooling, max pooling, min pooling, or among other pooling functions. In various embodiments, each processing module 105a-105f may include a respective layer of the ML model, feature maps, weights, filters, and/or other piece of executable code of the ML model. The data (e.g., layer of the ML model, feature map, weights, filters, etc.) of the respective processing modules 105a-105f may be stored and/or fetched from DRAM 120. In some examples, the processing module 105c associated with pooling 1/8 layer may be stored and/or fetched from the cache memory 115.

[0041] In some examples, the input data to the system 100 may be an image or other data that may be represented in matrix-form. Each instance of input data in a stream of input data may be referred to as a frame. The frame, as used herein, may refer to the input data itself (e.g., an image of a series of images) or to data generated based on an instance of the input data (e.g., a feature map of a given image). In some embodiments, the system 100 may include feature extraction network, for example implemented in one or more of the processing modules 105a-105f, configured to extract a feature map from the input data. In some embodiments, one or more respective feature maps extracted from one or more previous frames may be used and stored in cache memory 115, and fetched from cache memory 115 to be used by the pooling 1/8 layer of tire respective module (e.g., processing module 105c). For example, for the one or more previous frames, the pooling 1/8 layer may be fetched from DRAM 120 (not shown). Once processed by the pooling layer, the feature maps generated by the pooling function (e.g., a feature map on which a pooling function has been applied) may be saved to the cache memory 115 to be used for the current and/or future frames. In this way, one or more pooling layers, in this case the pooling 1/8 layer, are processed ahead of time and only stored to cache memory 115.

[0042] Effectively, a neural network layer is removed from processing by the processor 110 for a current frame by processing the data for a pooling layer a few pipelines ahead of the processor 110 in the pooling 1/8 processing module 105c. Specifically, in some embodiments, processing module may include piece of executable code and/or logic (e.g., processing logic block) of at least part of the ML model (such as data used by the ML model, one or more layers of the ML model, etc.). In some examples, the processing module 105a-105e may include data processed by tire ML model for a previous frame and executed by the processor 110 (e.g., an NPU). In other embodiments, the data processing module 105a-105f may include dedicated hardware, software, or hardware and software, such as dedicated hardware logic (e.g., programmable logic such as a field-programmable gate array (FPGA), and/or as a dedicated custom integrated circuit (e.g., ASIC) configured to perform at least part of the processing of the ML model.

[0043] Continuing with the example above, the reduction in the number of dependencies is described in greater detail below with respect to Fig. 2.

[0044] Fig. 2 is a schematic block diagram of an architecture 200 of an ML model for reduced power consumption, in accordance with various embodiments. The architecture 200 includes input frames 205, a first concatenation layer 210, pooling 1/8 layer 215, pooling 1/4 layer 220, pooling 1/2 layer 225, respective tile concatenation layers 230a-230c, including a first tile concatenation layer 230a, second tile concatenation layer 230b, and third tile concatenation layer 230c. The architecture 200 further includes a plurality of tiles, such as pooling 1/8 tiles 235a, pooling l/4tiles 235b, and pooling 1/2 tiles 235c. The architecture 200 further includes output frames 235d.

[0045] The architecture 200, in various embodiments, depicts a dependencybased CNN architecture similar to a pyramid network. As shown in Fig. 2, the first concatenation layer 210 has dependencies from the input frames 205, which includes frame t-1 205a (also referred to as "previous frame 205a") and frame t 205b (also referred to as "current frame 205b"). In some examples, for purposes of explanation, the input layer data (e.g., a frame 205a, 205b of the frames 205) may be 4K x 4K (Bayer format RGGB 4 channels), which at 12 bits / pixel is equal to a size of 24MB. In some examples, the input frames 205 themselves are respectively feature maps.

[0046] Each feature map may further be divided into tiles, to which a pooling algorithm may be applied to obtain the feature map at each pooling layer. The concatenation layer 210 feeds feature maps to different pooling ratios, such as pooling 1/4 layer 220, pooling 1/2 layer 225. For purposes of explanation, using a tile-by-tile processing operation for each frame, from left to right until the end of a given line of a feature map of a respective frame 205a-205b.

[0047] In a typical architecture where the pooling 1/8 layer 215 is further dependent on the concatenation layer 210, the total number of buffer lines needed is given by the following formula: 2 * (2 * (2 * (8 + 2) + 8) + 8) = 128 lines. With architecture 200, pooling 1/8 layer 215 receives its feature maps that are prepared through a previous pipeline. Thus, in this example, by "removing" a layer, the dependency lines can be calculated as: 2 * (2 * (8 + 2) + 8) = 56 lines. The number of dependency lines may be calculated based on the number of convolutions (including deconvolutions) for a given layer, size of a convolutional fdter, pooling factor (e.g., ratio - in this example, 1/8), and total number of layers.

[0048] In various embodiments, the reduction in dependency lines translates to reductions in storage size. Continuing with the example above, for purposes of explanation, taking each input as 12 bits, with each line having 2000 pixels, 4 channels per pixel gives a total size of dependency lines: 12 / 8 * 2000 * 4 * 56 = 672 KB. In a traditional arrangement, dependency lines for the concatenation layer would have been 12 / 8 * 2000 * 4 * 128 = 1.5 MB. Moreover, a feature map of the pool 1/8 layer 215 has a size of 8 bits * 250 * 188 * 8 = 376KB. Thus, the feature map has a small size that can be processed ahead of time and stored in cache memory or on-chip memory.

[0049] Similar reductions in storage size requirements can be realized in other architectures for ML models, as will be described below with respect to Figs. 3 A-3C.

[0050] Fig. 3A is a schematic block diagram of an alternative ML model architecture 300A utilizing chunks, in accordance with various embodiments. The architecture 300A includes an input layer 305, output layer 310, skip layers 315 including first skip layer (skipl) 315a, second skip layer (skip2) 315b, and third skip layer (skip3) 315c. It should be noted that the various elements of architecture 300 are schematically illustrated in Fig. 3, and that modifications to the various elements and arrangements of architecture 300 may be possible and in accordance with the various embodiments. [0051] In various embodiments, the architecture 300A may be a U-Net based ML model. In the example shown, the input layer size is 2000 x 1504 x 4 x 12 bits / pixel, and the output layer size is also 2000 x 1504 x 4 x 12 bits / pixel. The architecture further includes three skip layers 315, skipl 315a, skip2 315b, and skip3 315c. Skip 1 315a, skip2 315b, and skip3 may refer to the outputs of the respective skip layers, such as a tile and/or frame generated based on the input data from the input layer 305. In some examples, the outputs of the respective skip layers 315 may be generated by one or more convolutional layers

[0052] In a conventional arrangement, using layer-by-layer processing, the output of convolutional layer Convl and output of first skip layer 315a, Skipl convl, is held in storage, for example, in DRAM. Furthermore, the output of first skip layer 315a, Skipl convl, is held until the end of the process for a given layer. Moreover, the output of second skip layer 315b, Skip2_convl, and output of third skip layer 315c, Skip3_convl, must also be held in storage. Given the above, a minimum buffer size may be given as:

[0053] Size of buffer = (output of Skipl convl + Skip2_convl + Skip3_convl) + maxfinput i data + output i data)

[0054] Where input i data is the size of the input layer 305, and output i data is the size of the output layer 310.

[0055] Accordingly, in various embodiments, to reduce the buffer size, a chunk-by-chunk process is provided. By processing chunks, the size of the buffer becomes:

[0056] Size of buffer = (output of Skipl convl + Skip2_convl) + output of Conv4 + (line buffer size of Tile layer + line buffer size of Concatl + line buffer size of Skip3_convl + line buffer size of Conv4 + line buffer size of upConvl + line buffer size of upConv2 + line buffer size of Concat2 + line buffer size of Conv5 + line buffer size of upConv3 + line buffer size of upConv4 + line buffer size of Concat3 + line buffer size of Conv6).

[0057] This results in a reduced buffer size. Due to global pooling, however, the full size of the previous layer is utilized. To address this, further embodiments, set forth cutting of global pooling into two frames, as described below with respect to Fig. 3B. [0058] Fig. 3B, like Fig. 3A, depicts a similar U-net based architecture 300B. In contrast with Fig. 3 A, however, Fig. 3B depicts global pooling layer 320 that has been cut into two frames - a current frame (frame t) and a previous frame (frame t-1). Using a cut arrangement, only certain line buffers need to be stored rather than the outputs for the full layer. Specifically, the output of global pooling layer 320 from a previous frame (frame t-1) may be utilized to reduce buffer size, leveraging the outputs of different layers from previous frames in off-chip memory and/or on-chip memory.

[0059] Further reductions in buffer size may be achieved using asymmetric cuts, as set forth with respect to Fig. 3C.

[0060] Like Fig. 3B, Fig. 3C depicts an architecture in which data from two previous frames are leveraged to reduce buffer size and memory requirements.

Specifically, data from images and/or video between sequential frames are often similar. For example, three sequential frames from a video often contain similar image information (e.g., of the same scene, object, etc.). Thus, in various embodiments, architecture 300C fetches the output of convolutional layer Conv4 from a first previous frame (frame t-1) immediately preceding the current frame (frame t) (e.g., output of previous frame 325), and the output from global pooling from a first previous frame (frame t-2), which precedes frame t by two frames, (e.g., output of second previous frame 320). These outputs may be stored in off-chip memory and/or on-chip memory prior to the processing pipeline for the current frame.

[0061] Using a cut arrangement, only certain line buffers need to be stored rather than the outputs for the full layer. Specifically, the output of global pooling form previous frame (frame t-1) may be utilized to reduce buffer size by storing within cache memory. Thus, further reduction in buffer size resulting in lower storage requirements, and in turn reduction in power consumption may be realized.

[0062] Fig. 4 is a flow diagram of a method 500 for reducing power consumption in ML models, in accordance with various embodiments. The method 400 includes, at block 405, obtaining input data for a current frame to be processed by an ML model. As previously described, the ML model may include, without limitation, a neural network, such as, without limitation, a CNN, ResNet, RNN, U-Net, FFN, MLP, transformer, or other suitable neural network. In various embodiments. processing by the ML model may include encoding and/or decoding of the input data (or a feature map of the input data) of the current frame. As used in this context, input data may include, without limitation, raw data (e.g., unprocessed image data, or other unprocessed source data), a feature map, or other processed data.

[0063] At block 410, the method 400 continues by generating a feature map of the frame. As previously described, a feature extraction network may be configured to generate the feature map. The feature extraction network itself may include various types of CNNs, such as a ResNet, RNN, R-CNN, etc., configured to extract and/or identify features in the current frame, and further to generate respective feature maps from other respective frames (e.g., previous frames and/or future frames).

[0064] The method 400 continues, at block 415, by obtaining an output of at least one layer of a neural network (e.g., ML model) of at least one previous frame from on-chip memory. As previously described, in some embodiments, the one or more outputs of a layer from a neural network (e.g., ML model) for one or more previous frames may be stored in on-chip memory, such as SRAM, of an SoC, IC, or other computing device. In various embodiments, the ML model may utilize an architecture that includes an input layer, an output layer, and one or more hidden layers. As previously described, in some examples, the ML model may be a pyramid architecture that includes one or more pooling layers as part of the one or more hidden layers. Thus, in some examples, an output of a pooling 1/8 layer from a previous frame (e.g., frame t-1) may be obtained (e.g., fetched) from SRAM. As previously described, storage of these outputs yields a saving in storage requirements that exceeds the size of storing the outputs themselves.

[0065] In yet further examples, the ML model may employ a U-Net based architecture. Accordingly, in some embodiments, the one or more hidden layers may include a global pooling layer, one or more convolution layers, one or more skip layers, one or more concatenation layers, convolution layers, etc. In some examples, an output of a global pooling layer from a first previous frame immediately preceding frame (e.g., the immediate previous frame in a sequence of frames), frame t-1, may be obtained from SRAM and/or DRAM storage.

[0066] In further examples, as previously described, outputs from layers of multiple previous frames may be utilized to "cut" a given network into three frames. In other words, the layers of a neural network may be "cut” into three different groups of layers. The groups of layers may, in some examples, be asymmetric. An input to a group of layers may, in some examples, be obtained from an output of a layer (e.g., a preceding layer) from a preceding frame.

[0067] For example, an output of a convolutional layer from a first previous frame (frame t-1) may be fetched and utilized as the input for a global pooling layer of a current frame (frame t). The output of a global pooling layer from a second previous frame (frame t-2), which precedes frame t by two frames, may be taken as input to a chunking and/or tiling layer, which may further be utilized as an input to a concatenation layer for tire current frame. These outputs may be stored and fetched from SRAM prior to the processing pipeline for the current frame. The method 400 continues, at block 420, by processing the current frame using the output of the at least one at least one layer of the at least one previous frame as an input to at least one layer of the current frame. In various embodiments, processing of a current frame by an ML model may include encoding and/or decoding of the current frame (e.g., a feature map or input data) via an ML model. Thus, the outputs of a hidden layer of a previous frame may be taken as an input to a hidden layer of an encoder and/or decoder of an encoder-decoder network. In some examples, the output of a first hidden layer of a previous frame may be taken as input to a second hidden layer for a current frame, where the second hidden layer immediately follows the first hidden layer of a network / ML model.

[0068] At block 425, the method 400 further includes storing the output of at least one layer of a current frame in on-chip memory. As described above, an output of a layer for a current frame may be stored in on-chip memory for use in the processing of a future frame. In some examples, as previously described, the output of a layer for a current frame may include the output of layer in which an input is obtained from a previous frame. For example, the output of a first layer of a first previous frame (frame t-2) may be used as input to a second layer succeeding the first layer. The output of the second layer may then be stored in on-chip memory to be used later as input to a third layer succeeding the second layer.

[0069] The techniques and processes described above with respect to various embodiments may be performed by one or more computer systems. Fig. 5 is a schematic block diagram of a computer system 500 for reducing power consumption in ML models. Fig. 5 provides a schematic illustration of one embodiment of a computer system 500, such as the system 100, and ML model architectures, such as architectures 200 and 3OOA-3OOC, or subsystems thereof, which may perform the methods provided by various other embodiments, as described herein. It should be noted that Fig. 5 only provides a generalized illustration of various components, of which one or more of each may be utilized as appropriate. Fig. 5, therefore, broadly illustrates how individual system elements may be implemented in a separated or integrated manner.

[0070] The computer system 500 includes multiple hardware elements that may be electrically coupled via a bus 505 (or may otherwise be in communication, as appropriate). The hardware elements may include one or more processors 510, including, without limitation, one or more general-purpose processors and/or one or more special-purpose processors (such as microprocessors, digital signal processing chips, graphics acceleration processors, and microcontrollers); one or more input devices 515, which include, without limitation, a mouse, a keyboard, one or more sensors, and/or the like; and one or more output devices 520, which can include, without limitation, a display device, and/or the like.

[0071] The computer system 500 may further include (and/or be in communication with) one or more storage devices 525, which can comprise, without limitation, local and/or network accessible storage, and/or can include, without limitation, a disk drive, a drive array, an optical storage device, solid-state storage device such as a random-access memory ("RAM") and/or a read-only memory ("ROM”), which can be programmable, flash-updateable, and/or the like. Such storage devices may be configured to implement any appropriate data stores, including, without limitation, various file systems, database structures, and/or the like.

[0072] The computer system 500 might also include a communications subsystem 530, which may include, without limitation, a modem, a network card (wireless or wired), an IR communication device, a wireless communication device and/or chipset (such as a Bluetooth™ device, an 802.11 device, a WiFi device, a WiMax device, a WWAN device, a Z-Wave device, a ZigBee device, cellular communication facilities, etc.), and/or a low-power wireless device. The communications subsystem 530 may permit data to be exchanged with a network (such as the network described below, to name one example), with other computer or hardware systems, between data centers or different cloud platforms, and/or with any other devices described herein. In many embodiments, the computer system 500 further comprises a working memory 535, which can include a RAM or ROM device, such as DRAM, SRAM, etc., as described above.

[0073] The computer system 500 also may comprise software elements, shown as being currently located within the working memory 535, including an operating system 540, device drivers, executable libraries, and/or other code, such as one or more application programs 545, which may comprise computer programs provided by various embodiments, and/or may be designed to implement methods, and/or configure systems, provided by other embodiments, as described herein. Merely by way of example, one or more procedures described with respect to the method(s) discussed above might be implemented as code and/or instructions executable by a computer (and/or a processor within a computer); in an aspect, then, such code and/or instructions can be used to configure and/or adapt a general purpose computer (or other device) to perform one or more operations in accordance with the described methods.

[0074] A set of these instructions and/or code might be encoded and/or stored on a non-transitory computer readable storage medium, such as the storage device(s) 525 described above. In some cases, the storage medium might be incorporated within a computer system, such as the system 500. In other embodiments, the storage medium might be separate from a computer system (i.e., a removable medium, such as a compact disc, etc.), and/or provided in an installation package, such that the storage medium can be used to program, configure, and/or adapt a general purpose computer with the instructions/code stored thereon. These instructions might take the form of executable code, which is executable by the computer system 500 and/or might take the form of source and/or installable code, which, upon compilation and/or installation on the computer system 500 (e.g., using any of a variety of generally available compilers, installation programs, compression/decompression utilities, etc.) then takes the form of executable code.

[0075] It will be apparent to those skilled in the art that substantial variations may be made in accordance with specific requirements. For example, customized hardware (such as programmable logic controllers, single board computers, FPGAs, ASICs, and SoCs) might also be used, and/or particular elements might be implemented in hardware, software (including portable software, such as applets, etc.), or both. Further, connection to other computing devices such as network input/output devices may be employed.

[0076] As mentioned above, in one aspect, some embodiments may employ a computer or hardware system (such as the computer system 500) to perform methods in accordance with various embodiments of the invention. According to a set of embodiments, some or all of the procedures of such methods are performed by the computer system 500 in response to processor 510 executing one or more sequences of one or more instructions (which may be incorporated into the operating system 540 and/or other code, such as an application program 545) contained in the working memory 535. Such instructions may be read into the working memory 535 from another computer readable medium, such as one or more of the storage device(s) 525. Merely by way of example, execution of the sequences of instructions contained in the working memory 535 might cause the processor(s) 510 to perform one or more procedures of the methods described herein.

[0077] The terms "machine readable medium" and "computer readable medium," as used herein, refer to any medium that participates in providing data that causes a machine to operate in a specific fashion. In an embodiment implemented using the computer system 500, various computer readable media might be involved in providing instructions/code to processor(s) 510 for execution and/or might be used to store and/or carry such instructions/code (e.g., as signals). In many implementations, a computer readable medium is a non-transitory, physical, and/or tangible storage medium. In some embodiments, a computer readable medium may take many forms, including, but not limited to, non-volatile media, volatile media, or the like. Non-volatile media includes, for example, optical and/or magnetic disks, such as the storage device(s) 525. Volatile media includes, without limitation, dynamic memory, such as the working memory 535. In some alternative embodiments, a computer readable medium may take the form of transmission media, which includes, without limitation, coaxial cables, copper wire and fiber optics, including the wires that comprise the bus 505, as well as the various components of the communication subsystem 530 (and/or the media by which the communications subsystem 530 provides communication with other devices). In an alternative set of embodiments, transmission media can also take the form of waves (including, without limitation, radio, acoustic, and/or light waves, such as those generated during radiowave and infra-red data communications).

[0078] Common forms of physical and/or tangible computer readable media include, for example, a floppy disk, a flexible disk, a hard disk, magnetic tape, or any other magnetic medium, a CD-ROM, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, a RAM, a PROM, and EPROM, a FLASH-EPROM, any other memory chip or cartridge, a carrier wave as described hereinafter, or any other medium from which a computer can read instructions and/or code.

[0079] Various forms of computer readable media may be involved in carrying one or more sequences of one or more instructions to the processor(s) 510 for execution. Merely by way of example, the instructions may initially be carried on a magnetic disk and/or optical disc of a remote computer. A remote computer might load the instructions into its dynamic memory and send the instructions as signals over a transmission medium to be received and/or executed by the computer system 500. These signals, which might be in the form of electromagnetic signals, acoustic signals, optical signals, and/or the like, are all examples of carrier waves on which instructions can be encoded, in accordance with various embodiments of the invention.

[0080] The communications subsystem 530 (and/or components thereof) generally receives the signals, and the bus 505 then might carry the signals (and/or the data, instructions, etc. carried by the signals) to the working memory 535, from which the processor(s) 510 retrieves and executes the instructions. The instructions received by the working memory 535 may optionally be stored on a storage device 525 either before or after execution by the processor(s) 510.

[0081] While some features and aspects have been described with respect to the embodiments, one skilled in the art will recognize that numerous modifications are possible. For example, the methods and processes described herein may be implemented using hardware components, software components, and/or any combination thereof. Further, while various methods and processes described herein may be described with respect to particular structural and''or functional components for ease of description, methods provided by various embodiments are not limited to any particular structural and/or functional architecture but instead can be implemented on any suitable hardware, firmware and/or software configuration. Similarly, while some functionality is ascribed to one or more system components, unless the context dictates otherwise, this functionality can be distributed among various other system components in accordance with the several embodiments.

[0082] Moreover, while the procedures of the methods and processes described herein are described in a particular order for ease of description, unless the context dictates otherwise, various procedures may be reordered, added, and/or omitted in accordance with various embodiments. Moreover, the procedures described with respect to one method or process may be incorporated within other described methods or processes; likewise, system components described according to a particular structural architecture and/or with respect to one system may be organized in alternative structural architectures and/or incorporated within other described systems. Hence, while various embodiments are described with or without some features for ease of description and to illustrate aspects of those embodiments, the various components and/or features described herein with respect to a particular embodiment can be substituted, added and/or subtracted from among other described embodiments, unless the context dictates otherwise. Consequently, although several embodiments are described above, it will be appreciated that the invention is intended to cover all modifications and equivalents within the scope of the following claims.

Claims

WHAT IS CLAIMED IS:

1. A method comprising: obtaining input data of a current frame for processing by a machine learning model, the machine learning model including two or more hidden layers; obtaining, via on-chip memory, at least one first output of at least one first hidden layer of the two or more hidden layers of a respective at least one previous frame; and processing, via the machine learning model, the current frame utilizing the at least one first output as input to at least one second hidden layer of the two or more hidden layers for the current frame.

2. The method of claim 1, further comprising: storing, via the on-chip memory, at least one second output of at least one third hidden layer of the two or more hidden layers of the current frame.

3. The method of claim 1, wherein the at least one previous frame includes a first previous frame and a second previous frame.

4. The method of claim 1, wherein utilizing the at least one first output as input to the at least one second hidden layer for the current frame further comprises: utilizing a first previous output associated with the first previous frame, the at least one first output including the first previous output, as input to a first current hidden layer of the at least one second hidden layer for the current frame; and utilizing a second previous output associated with the second previous frame, the at least one first output including the second previous output, as input to a second current hidden layer of the at least one second hidden layer for the current frame.

5. The method of claim 1, wherein the on-chip memory includes static random access memory.

6. The method of claim 1, wherein the machine learning model is a convolutional neural network.

7. The method of claim 1, wherein the at least one first hidden layer includes at least one of a pooling layer or convolutional layer.

8. The method of claim 1, wherein the at least one second hidden layer includes at least one of a pooling layer or convolutional layer.

9. A non-transitory computer readable medium in communication with a processor, the non-transitory computer readable medium having encoded thereon a set of instructions executable by the processor to: obtain input data of a current frame for processing by a machine learning model, the machine learning model including two or more hidden layers; obtain, via on-chip memory, a first output of a first hidden layer of the two or more hidden layers of a respective previous frame; and process, via the machine learning model, the current frame using the first output as input to a second hidden layer of the two or more hidden layers for the current frame.

10. The non-transitory computer readable medium of claim 9, wherein the set of instructions is further executable by the processor to: store, via the on-chip memory, a second output of a third hidden layer of the two or more hidden layers of the current frame.

11. The non-transitory computer readable medium of claim 9, wherein the set of instructions is further executable by the processor to: obtain, via on-chip memory, a second output of a second hidden layer of the two or more hidden layers of a second previous frame; and process, via the machine learning model, the current frame using the second output as input to a third hidden layer of the two or more hidden layers for the current frame.

12. The non-transitory computer readable medium of claim 9, wherein on- chip memory includes static random access memory.

13. The non-transitory computer readable medium of claim 9, wherein the machine learning model is a convolutional neural network.

14. The non-transitory computer readable medium of claim 9, wherein the at least one first hidden layer includes at least one of a pooling layer or convolutional layer.

15. The non-transitory computer readable medium of claim 9, wherein the at least one second hidden layer includes at least one of a pooling layer or convolutional layer.

16. A system comprising: on-chip memory; one or more processing logic blocks of a machine learning model, the one or more processing logic blocks comprising one or more hidden layers of the machine learning model; a processor; a non-transitory computer readable medium in communication with the processor, the non-transitory computer readable medium having encoded thereon a set of instructions executable by the processor to: obtain input data of a current frame for processing by the one or more processing logic blocks of the machine learning model; obtain, via the on-chip memory, at least one first output of at least one first hidden layer of the one or more hidden layers of a respective at least one previous frame; and process, via the machine learning model, the current frame using the at least one first output as input to at least one second hidden layer of the one or more hidden layers for the current frame.

17. The system of claim 16, wherein the set of instructions is further executable by the processor to: store, via the on-chip memory, at least one second output of at least one third hidden layer of the one or more hidden layers of the current frame.

18. The system of claim 16, wherein the at least one previous frame includes a first previous frame and a second previous frame, wherein utilizing the at least one first output as input to the at least one second hidden layer for the current frame further comprises: utilizing a first previous output associated with the first previous frame, the at least one first output including the first previous output, as input to a first current hidden layer of the at least one second hidden layer for the current frame; and utilizing a second previous output associated with the second previous frame, the at least one first output including the second previous output, as input to a second current hidden layer of the at least one second hidden layer for the current frame.

19. The system of claim 16, wherein the on-chip memory includes static random access memory.

20. The system of claim 16, wherein the at least one first hidden layer includes at least one of a pooling layer or convolutional layer, and wherein the at least one second hidden layer includes at least one of a pooling layer or convolutional layer.