CN112016681B

CN112016681B - Decomposition of machine learning operations

Info

Publication number: CN112016681B
Application number: CN202010374389.1A
Authority: CN
Inventors: C·凯特萨克里斯
Original assignee: Apple Inc
Current assignee: Apple Inc
Priority date: 2019-05-31
Filing date: 2020-05-06
Publication date: 2024-04-30
Anticipated expiration: 2040-05-06
Also published as: CN112016681A

Abstract

The present disclosure relates to decomposition of machine learning operations. Disclosed herein is a subject technology that receives a representation of a Neural Network (NN) model to be executed on an electronic device, the representation of the NN model including nodes corresponding to middle layers of the NN model. The subject technology determines, for the respective operations corresponding to each node in each respective middle tier of the NN model, a respective set of operations that are mathematically equivalent to the respective operations such that an aggregation of outputs of the respective set of operations is equivalent to the outputs of the respective operations. The subject technology generates a graph based on each respective set of operations, wherein the graph includes a set of branches, each branch including a plurality of operations. The subject technology determines a respective order for executing each branch of the graph.

Description

Decomposition of machine learning operations

Cross Reference to Related Applications

The present utility model claims the benefit of U.S. provisional patent application serial No. 62/855,850, entitled "DECOMPOSITION OF MACHINE LEARNING OPERATIONS," filed 5/31/2019, which is incorporated herein by reference in its entirety and forms part of the U.S. patent application for all purposes.

Technical Field

The present description relates generally to machine learning operations, including decomposing machine learning operations to perform more efficiently on a target platform.

Background

Software engineers and scientists have been using computer hardware for machine learning to improve across different industry applications, including image classification, video analysis, speech recognition, natural language processing, and the like. Notably, neural networks are more frequently utilized to create systems that can perform different computing tasks based on training of large amounts of data.

Drawings

Some features of the subject technology are shown in the appended claims. However, for purposes of explanation, several embodiments of the subject technology are set forth in the following figures.

FIG. 1 illustrates an example network environment in accordance with one or more implementations.

FIG. 2 illustrates an example software architecture for performing the decomposition process for operation of the neural network in accordance with one or more implementations.

FIG. 3 illustrates example data flows from various nodes in a portion of a neural network in accordance with one or more implementations.

FIG. 4 illustrates an example data flow after a portion of the neural network depicted in FIG. 3 has undergone a decomposition process, in accordance with one or more implementations.

FIG. 5 illustrates an example of a first neural network and a second neural network that have undergone a decomposition process in accordance with one or more implementations.

FIG. 6 illustrates a flow diagram of an example process for performing a decomposition process for a neural network in accordance with one or more implementations.

FIG. 7 illustrates an electronic system with which one or more implementations of the subject technology may be implemented.

Detailed Description

The detailed description set forth below is intended as a description of various configurations of the subject technology and is not intended to represent the only configurations in which the subject technology may be practiced. The accompanying drawings are incorporated in and constitute a part of this specification. The specific embodiments include specific details for the purpose of providing a thorough understanding of the subject technology. However, the subject technology is not limited to the specific details set forth herein, but may be practiced with one or more other implementations. In one or more implementations, structures and components are shown in block diagram form in order to avoid obscuring the concepts of the subject technology.

In recent years, the popularity of machine learning has increased significantly due to the availability of large amounts of training data and the advancement of more powerful and efficient computing hardware. One popular machine learning technique is to utilize deep neural networks to perform a set of machine learning tasks. To train the deep neural network, a general method is to use a Graphics Processing Unit (GPU), and the general method is also used to perform the deep neural network on new input data post-training. However, in some instances, when a given deep neural network is executed, different operations of the deep neural network may require memory access to (e.g., writing to and/or reading from) slower memory (e.g., off-chip memory) because the output of the operation is too large to be stored in faster memory, such as an on-chip cache (e.g., L1, L2) on the target device executing the network. For example, the output of a node in the deep neural network may provide data that is too large to be stored in an on-chip cache of a target device executing the deep neural network, but instead such data is stored in a slower memory such as DRAM. Thus, deep neural networks may perform slower when data is read and written to DRAM.

Implementations of the subject technology described herein reduce memory traffic for each operation of a neural network by performing a decomposition process that splits a given operation of the neural network into various operations having outputs that can fit within a cache of a target device executing the network. Thus, the performance of the neural network may be improved by avoiding access to slower memory (e.g., DRAM), which may be necessary when the decomposition process is not performed. Advantageously, the accuracy of the network is not affected by the decomposition process described herein. Accordingly, these benefits are understood to improve the computing functionality of a given electronic device, such as an end-user device, which may generally have fewer available computing resources than, for example, one or more cloud-based servers.

FIG. 1 illustrates an example network environment 100 in accordance with one or more implementations. However, not all of the depicted components may be used in all implementations, and one or more implementations may include additional or different components than those shown in the figures. Variations in the arrangement and type of these components may be made without departing from the spirit or scope of the claims set forth herein. Additional components, different components, or fewer components may be provided.

The network environment 100 includes an electronic device 110, an electronic device 115, and a server 120. Network 106 may communicatively couple (directly or indirectly) electronic device 110 and/or server 120, electronic device 115 and/or server 120, and/or electronic device 110 and/or electronic device 115. In one or more implementations, the network 106 may be an interconnection network that may include the internet or devices communicatively coupled to the internet. For purposes of explanation, network environment 100 is shown in FIG. 1 as including electronic device 110, electronic device 115, and server 120; however, network environment 100 may include any number of electronic devices and any number of servers.

The electronic device 110 may be, for example, a desktop computer, a portable computing device such as a laptop computer, a smart phone, a peripheral device (e.g., digital camera, headset), a tablet device, a wearable device such as a watch, a band, etc. In fig. 1, by way of example, electronic device 110 is depicted as a desktop computer. The electronic device 110 may be and/or may include all or part of an electronic system discussed below with respect to fig. 7.

In one or more implementations, the electronic device 110 can provide a system for splitting operations from the neural network model into code (e.g., C code, c++ code, shift code) of a particular programming language. In particular, the subject system can include a neural network compiler for compiling code. In an example, using compiled code, the subject system can create an executable software package for deployment on a target platform (such as electronic device 115) with the assistance of server 120. When executing the compiled code, the target platform may perform one or more given operations of the neural network model.

The electronic device 115 may be, for example, a portable computing device such as a laptop computer, a smart phone, a peripheral device (e.g., a digital camera, an earphone), a tablet device, a wearable device such as a watch, a band, etc., or any electronic device. The electronic device may also include processors with different computing capabilities, including, for example, a CPU, GPU, and/or a neural processor. In fig. 1, by way of example, the electronic device 115 is depicted as a smart phone device. In one or more implementations, the electronic device 115 may be and/or may include all or a portion of the electronic devices discussed below with respect to the electronic system discussed below with respect to fig. 7.

In one or more implementations, the server 120 deploys compiled code included in the executable software package to the target device for execution. In an example, the electronic device 115 may be a target device for receiving a software package with compiled neural network code and for executing the compiled code in a runtime environment of the electronic device 115. The electronic device 115 (or any electronic device that is a target device) includes a framework such that the framework is capable of performing operations in compiled code of the neural network. A framework may refer to a software environment that provides specific functionality as part of a larger software platform to facilitate development of software applications.

FIG. 2 illustrates an example software architecture for performing the decomposition process for operation of the neural network in accordance with one or more implementations. For purposes of illustration, the software architecture is described as being provided by the electronic device 110 of fig. 1, such as by a processor and/or memory of the electronic device 110; however, the software architecture may be implemented by any other electronic device. However, not all of the depicted components may be used in all implementations, and one or more implementations may include additional or different components than those shown in the figures. Variations in the arrangement and type of these components may be made without departing from the spirit or scope of the claims set forth herein. Additional components, different components, or fewer components may be provided.

As shown, the computing architecture includes a neural network compiler 215. The memory 240 includes Neural Network (NN) model source code 244 that, after compilation by the neural network compiler 215, generates a Neural Network (NN) binary executable 242 that may be deployed to different target platforms for execution. In an example, the NN model source code 244 may include code for various algorithms that may be utilized, alone or in combination, to implement particular functions for execution on a given target device. As described above, the target device may include various hardware sensors and different processors (e.g., as provided by the electronic device 115) that may be utilized when running the NN binary executable 242 on the target device. In examples, the specific functions may include image processing or computer vision related functions, speech recognition, natural language processing, and the like.

Although in the example of fig. 2, the neural network compiler 215 is provided on the electronic device 110, in some implementations, such a compiler may be provided on a particular electronic device (e.g., the electronic device 115) that compiles source code locally and executes the compiled code on the same device. In implementations, the NN model source code 244 may be compiled for a particular target platform and then deployed to a different device (such as the electronic device 115) for execution. In an example, the NN model source code 244 may include at least code corresponding to a set of operations to be performed by corresponding nodes from each layer of a given NN model. In an example, code of an operation in a layer of the NN is a respective function call for performing the operation and/or a set of parameters for the function call. Additionally, code corresponding to one or more input and output features, data structures, and feature types may be included in the NN model source code 244.

As further shown, the neural network compiler 215 includes an operation decomposition engine 230 that performs a decomposition process on the NN model source code 244 to split corresponding operations of nodes of the NN model (e.g., operations that produce output with reduced data size to fit a cache) into various decomposition operations. The operation decomposition engine 230 also serves as a scheduling component to determine the order in which the decomposition operations are performed. Such decomposition process as described herein refers to: for a given operation at a node, operations (e.g., decomposition operations) are generated that produce an output having a reduced size, each operation providing a particular output of a particular size such that the size of the output enables the output to be stored within a cache (such as an on-chip cache) of a target device (e.g., electronic device 115). In an example, the size of the cache (such as the L2 cache 252 on the electronic device 115) is determined based on the underlying hardware architecture of the given electronic device. Thus, the respective size of each decomposition operation is constrained by the size of the cache of the electronic device (e.g., the size of the L2 cache 252 on the electronic device 115). In addition, the operation decomposition engine 230 performs a decomposition process to ensure that: for a given node, the aggregate of the outputs of the split operations is equal to the output of the operations of that node prior to the split. In particular implementations, the output of the foregoing operations may be in the form of a data structure, such as a container (e.g., tensor) that may store the data in N dimensions (e.g., matrix, vector, array of arrays, etc.).

The operation decomposition engine 230 may obtain source code from the NN model source code 244 and perform decomposition of operations corresponding to nodes of the NN model represented in the NN model source code 244. In an example, code corresponding to the break up operation may be included in the source code. The neural network compiler 215 retrieves source code from the operational decomposition engine 230 and compiles the code into an NN binary executable for the target device, which can be stored in the neural network binary executable 242 and then deployed to the target device (e.g., the electronic device 115) for execution.

Although in the example of fig. 2, the neural network compiler 215 is provided on the electronic device 110, in some implementations, such a compiler may be provided on a particular electronic device that compiles code for the neural network model and executes the compiled neural network model on the same device.

As described above, the neural network model may be compiled from NN model source code 244 for a particular target platform and then deployed to a different device (such as electronic device 115) for execution. As further shown, in implementations, the electronic device 115 includes a system on a chip (SOC) 250.SoC250 includes L2 cache 252, CPU 254, GPU 255, and neural processor 256. The electronic device 115 also includes a DRAM 258 that is slower access memory than the L2 cache 252. Accessing the DRAM 258 may consume computing resources of the electronic device 115 because this requires a significant amount of power and may affect the performance of the NN model by slowing down the memory constraint layer (e.g., pooling layer, element-by-element layer, etc.) of the NN. In contrast, in implementations, the L2 cache 252 is very fast, but its size is significantly smaller than the DRAM 258. Thus, typically many outputs of the operations of the NN model will not fit into the L2 cache 252. For purposes of illustration, the on-chip cache in FIG. 2 is depicted as L2 cache 252; however, the on-chip cache may be any level of cache, such as L1, L2, L3, L4, and the like. As shown, in implementations, the L2 cache 252 is included as part of the neural processor 256, and thus, other processors on the SoC250 cannot access the L2 cache 252.

Recently, specialized (e.g., specialized) hardware has been developed that is optimized for performing specific operations from a given NN. A given electronic device may include a neural processor 256, which may be implemented as circuitry that performs various machine learning operations based on computations, including multiplications, additions, and summations. Such calculations may be arranged to perform, for example, convolution of the input data. In an example, the neural processor is specifically configured to execute a machine learning algorithm, typically by operating on a predictive model (such as NN). In one or more implementations, the electronic device may include a neural processor 256 in addition to the CPU 254 and/or the GPU 255.

As discussed herein, CPU 254 may refer to a main processor in a given electronic device that performs the basic arithmetic, logic, control, and input/output operations specified by the instructions of a computer program or application, including some operations for a neural network model. As discussed herein, GPU255 may refer to special-purpose electronic circuitry designed to perform operations for rendering graphics, which in many instances is also utilized to process computational workloads for machine learning operations (e.g., as specified by instructions of a computer program or application). CPU 254, GPU255, and neural processor 256 may each have different computational specifications and capabilities depending on their respective implementations, where each of the foregoing components may provide a different degree of performance for certain operations than other components.

As discussed herein, convolutional neural networks refer to a particular type of neural network, but use different types of layers consisting of nodes that exist in three dimensions, where the dimensions may vary between layers. In convolutional neural networks, nodes in a layer may only be connected to a subset of nodes in a previous layer. The final output layer may be fully connected and may be sized according to the number of classifiers. And in some examples, the convolutional neural network model may include various combinations of the following types of layers, multiples of each of them, and their orders: input layer, convolution layer, pooling layer, linear rectifying unit layer (ReLU) and fully connected layer. Part of the operations performed by the convolutional neural network include obtaining a set of filters (or kernels) that iterate over the input data based on one or more parameters. In an example, the depth of the convolution layer may be equal to the number of filters used. It should be appreciated that given the hyper-parameters of the convolutional neural network, the size of the different volumes at each layer can be mathematically determined.

Convolutional neural networks typically run on cloud-based computing platforms due to the volume of data processed. In such instances, memory management is often a post-hoc concern because cloud-based systems do not have practical memory issues (e.g., more computing power/memory is available at random). In contrast, it may not be possible or practical to store all weights and resulting node values of a convolutional neural network in memory on a memory-constrained device (e.g., a mobile electronic device such as a smart phone).

FIG. 3 illustrates example data flows from various nodes in a portion of a neural network 300 in accordance with one or more implementations. However, not all of the depicted components may be used in all implementations, and one or more implementations may include additional or different components than those shown in the figures. Variations in the arrangement and type of these components may be made without departing from the spirit or scope of the claims set forth herein. Additional components, different components, or fewer components may be provided.

As shown, the portion of the neural network 300 includes data 302, data 304, and data 306. This portion of the neural network 300 includes a pooling layer 303 and a convolution layer 305. In the example of fig. 3, the portion of the neural network 300 has not undergone a decomposition process by the operational decomposition engine 230. At time t0, data 302 corresponds to an output of size 4 Megabytes (MB). At time t1, data 304 corresponds to an output of size 4 Megabytes (MB). At time t2, data 306 corresponds to an output of size 2.4 Megabytes (MB). In fig. 3, 8MB of data is transferred between the pooling layer 303 and the convolution layer 305 without utilizing a cache. In an example, the pooling layer 303 performs a pooling operation on the data 302, which is received by the pooling layer 303 as input data. The pooling operation may include a maximum pooling operation (e.g., reporting maximum output within a rectangular neighborhood), an average of a rectangular neighborhood, a euclidean norm of a rectangular neighborhood, or a weighted average based on distance from a center pixel. The convolution layer 305 performs a convolution operation on the data 304 and provides data 306 as an output. In an example, the convolution operation may include performing affine transformation and/or filtering.

The following discussion describes an example of a decomposition process or how such mechanisms may be reconfigured.

The following steps are provided for a simple convolution with optional padding. In this example, generalization may account for other parameters (stride):

1) The "logical area" is defined as the coordinate range plus the fill of all directions (top, right, bottom, left), if any.

2) A function is constructed that identifies the logical area of the input that is required to produce the logical area of the output.

3) The display of all the involved regions is applied recursively in a bottom-up fashion (the output of the last operator towards the input of the first operator).

For the example of fig. 3, the following procedure may be performed:

Pick-up splitting factor, height=50, width=200

This defines two areas: (0, 0), (63,49,199) and (0,50,0), (63,99,199)

Here the area is given in 0-based (channel, height, width);

for the last operator (convolution 3×3)

-306 (T2) (0, 0.) (63,49,199) requires input (0, 0.) (95,51,201) from 304 (T1)

-306 (T2) (0,50,0) the (63,99,199) requires input (0,50,0) from 304 (T1) (95,101,201)

For the first operator (pooling 3×3)

-304 (T1) (0, 0.) (95,51,201) requires input (0, 0.) from 203 (T0.) (95,53,203)

-304 (T1) (0,50,0), (95,101,201) requires input (0,50,0) from 203 (T0), (95,103,203)

The intermediate/temporary tensors (i.e. T1 (a) and T1 (b)) will have the dimensions of their corresponding regions:

-logical 304 (T1) of (0, 0), (95,51,201) of (T1 (a)) 96×52×202

-Logical 304 (T1) (0,50,0) 405 (T1 (a)) of (95,101,201) 96×52×202

In the above example, the presence of an input stride expands the area of the desired input, while the presence of padding means that some areas may have a presentation pad, while other areas do not.

FIG. 4 illustrates an example data flow 400 after a portion of the neural network 300 depicted in FIG. 3 has undergone a decomposition process, in accordance with one or more implementations. However, not all of the depicted components may be used in all implementations, and one or more implementations may include additional or different components than those shown in the figures. Variations in the arrangement and type of these components may be made without departing from the spirit or scope of the claims set forth herein. Additional components, different components, or fewer components may be provided. Fig. 4 will be discussed with reference to the components of fig. 3.

In the example of fig. 4, the operational decomposition engine 230 has performed a decomposition process on the pooling layer 303 and the convolution layer 305 of the portion of the neural network 300 in fig. 3. For example, pooling layer 303 has been split into pooling layer 410 and pooling layer 420. Thus, pooled layer 410 and pooled layer 420 have been generated as a result of the decomposition process performed by operational decomposition engine 230. In addition, convolutional layer 305 has been split into convolutional layer 412 and convolutional layer 422, which are the result of the decomposition process performed by operational decomposition engine 230.

As further shown, the data 402 is provided as input to a pooling layer 410 that performs a pooling operation. The pooling layer 410 provides as output data 404 that is received as input data to the convolution layer 412. The convolution layer 412 performs convolution operations on the data 404 and provides the data 406 as an output.

As further shown, data 403 is provided as input to a pooling layer 420 that performs pooling operations. The pooling layer 420 provides data 405 as an output that is received as input data to the convolution layer 422. The convolution layer 422 performs a convolution operation on the data 405 and provides the data 407 as an output.

As shown in fig. 4, the hatched areas corresponding to the data 406 and the data 407 are each (64×50×100). In this example, data 406 and data 407 correspond to data 306 (T2) in fig. 3. The shadow region is projected for data 306T2 because the result is stored into the original T2 corresponding to data 306. Thus, in this example, T2 is not decomposed, but only the computation.

Further, in this example, T0 (e.g., corresponding to data 302) is not decomposed into smaller objects, and T0 is read directly from the region required for computation. As shown, the shaded areas in data 402 and data 403 are each (96×54×104), and data 402 and data 403 collectively include data 302 (T0). In this example, the two regions overlap each other.

In the example of fig. 4, each object is produced with 2 elements less in height and width than its input. The type of layer, kernel dimensions, stride, and fill determine how many inputs will be needed per 1 x1 output. Although in this example the input amounts of both the convolutional layer and the pooling layer are the same, this is shown as such for simplicity only.

In the example of fig. 4, data 404 and data 405 have a common element (two middle rows). This is in effect a redundant calculation and redundant read for T0. In this example, convolution layer 412 and convolution layer 422 use the same coefficients for the convolution, which are also read from the DRAM.

Fig. 5 illustrates an example of a neural network 500 and a neural network 550 that have undergone a decomposition process in accordance with one or more implementations.

The Neural Network (NN) may be represented in a directed graph in a single path (e.g., a producer/consumer chain) including 1) respective nodes representing operation of layers of the NN, and 2) other nodes representing respective outputs of each of the operational nodes. In such a graph, a node representing the output of one node (corresponding to an operation) will be consumed by a subsequent node corresponding to a different operation in the graph. The following examples illustrate this concept:

[ node 1: operation 1] - - - - - [ node 2: output of operation 1] - - - - - [ node 3: operation 2] - - - - - [ node 4: output of operation 2], and the like.

In the above example, memory (e.g., cache or DRAM) is required to store the outputs corresponding to node 2 and node 4. The operation decomposition engine 230 of the NN compiler 215 determines how to split the operations in each layer of the NN 500 into multiple branches, wherein each branch includes multiple operations, and wherein objects generated by the operations in each branch will fit into a cache (e.g., the L2 cache 252) on the target device. In an example, the aggregation of the outputs of these branches is mathematically equivalent to the corresponding operations in the layers of the NN (e.g., the original NN before being split into branches). Furthermore, each branch is fully executed before switching to execute another branch.

In implementations, the operation decomposition engine 230 determines 1) the size of objects corresponding to input or output data and 2) the order of the objects to enable such objects to be stored in a cache that is directly accessible from the SoC 250. Since caches (e.g., L2 cache 252) are used to store such objects where possible, memory traffic to DRAMs (e.g., DRAM 258) is reduced.

The operational decomposition engine 230 determines a tradeoff between: 1) Generating more branches, although each introduces additional computational overhead for redundant operations; or 2) perform the operation without splitting into multiple operations and resulting in slower memory accesses to data that is not fit into the cache in the DRAM. Temporary objects are determined by the neural network compiler 215 to ensure that such objects are short-term (e.g., generated and consumed in a short amount of time, and then disappear to free up cache memory).

In an example, the operation decomposition engine 230 determines a first set of operations for the respective operation and determines a second set of operations for the respective operation. The operation decomposition engine 230 selects one of the first set of operations or the second set of operations based at least in part on an analysis of the first set of operations and the second set of operations. In an example, the analysis may indicate which of the first set of operations and the second set of operations utilize less resources (e.g., memory) to facilitate selection by the operations decomposition engine 230.

In an example, the CNN operator reduces, maintains the same, or even expands the dimensions (all three dimensions) of its output. This can have an adverse effect on the split output and the number of operators involved in the "chain". The more disparate the input/output dimensions, the less efficient the L2 cache space utilization, and the greater the number of "branches". Branch counting affects the overall mechanism because each branch results in additional coefficient rereading and additional computation. This is one reason that smaller "chains" should also be checked.

In an example, the operation decomposition engine 230 first finds the operation chain that appears in the producer/consumer mode shown in FIG. 3: t0- > T1- > … T_n

This chain may include N > =2 operators, n+1 tensors, and N-2 intermediate tensors.

Then, the operation decomposition engine 230 checks whether any intermediate tensors t_i (0 < i < n) assigned L2 are not guaranteed. Given the intermediate tensor T_i, if the combined size of (T_ (i-1) and T_ (i) or (T_ i and T_ (i+1)) exceeds L2, then T_ i (along T_ (i-1) and T_ (i+1)) is not guaranteed to be in L2.

As previously described, there may be redundancy for the memory access for T0 shown in the previous figures. When T0 can end up in DRAM, this modifies the savings in intermediate products of the chain obtained by decomposition in L2. Thus, the original chain may be trimmed to operate on its segments, i.e., with T0- >. T2 and T5- > T8 replacing T0- >. T8, where determined.

As shown in the example of fig. 5, the neural network 500 includes intermediate layers 505, 515, and 525. The middle layers 505, 515, and 525 may be different types of layers, such as convolutional layers, reLU layers, pooled layers, fully connected layers, etc., that perform the corresponding operations corresponding to the types of layers. Thus, the aforementioned intermediate layers may have different dimensions. As further shown, input data 501 and output data 560 are stored in DRAM. Prior to the decomposition process, the neural network 500 stores the data 510, 520, and 530 in DRAM because the respective sizes of the aforementioned data are not fit within the cache memory.

The neural network 500 also shows the dependency between the different intermediate layers. Thus, middle layer 515 uses the output of middle layer 505 (e.g., data 510), and middle layer 525 uses the output of middle layer 515.

The operation decomposition engine 230 performs a decomposition process on the intermediate layer 505, and splits the operation O ₁ into three operations O ₂、O₃ and O ₄ corresponding to the intermediate layer 506, the intermediate layer 507, and the intermediate layer 508, respectively. In an example, operation O ₁ corresponds to a pooling layer operation, and operation O ₂, operation O ₃, and operation O ₄ are each respective pooling layer operations with various hyper-parameters (e.g., spatial intent and/or stride) that can affect the magnitude of the output of the corresponding pooling layer operation.

Given a chain, the operation decomposition engine 230 determines the DRAM traffic involved. In implementations, the DRAM traffic involved is due to: 1) Tensors in L2 are not guaranteed; and 2) certain operations with kernel coefficients (mostly convolutions or convolutions layers).

In an example, when the output (t_n) is split into multiple parts (initially 2 parts, then 3, 4 parts, etc.), then all intermediate tensor sizes and all regions involved are calculated until a "split factor" is determined that ensures that DRAM traffic will drop.

In particular, for convolutional or convolutional layers, the kernel coefficients may have to be reread from the DRAM and this has to be taken into account in the implementation.

Next, the operation decomposition engine 230 performs a decomposition process on the intermediate layer 515, and splits the operation O ₅ into three operations O ₆、O₇ and O ₈ corresponding to the intermediate layer 516, the intermediate layer 517, and the intermediate layer 518, respectively.

In addition, the operation decomposition engine 230 performs a decomposition process on the intermediate layer 525, and splits the operation O ₉ into three operations O ₁₀、O₁₁ and O ₁₂ corresponding to the intermediate layer 526, the intermediate layer 527, and the intermediate layer 528, respectively.

In this example, the operation decomposition engine 230 may group decomposition operations into different execution branches for the network. For example, branch 570 includes middle layer 506, middle layer 516, and middle layer 526. In addition, branch 572 includes middle layer 507, middle layer 517, and middle layer 527. In addition, branch 574 includes middle layer 508, middle layer 518, and middle layer 528.

To provide input data to the middle tier of the initial group, the operational decomposition engine 230 performs a decomposition process on the data 501. As shown, when a network is executing on a target device, data 501 is split into data 502, data 503, and data 504 that are provided as input data to middle tier 506, middle tier 507, and middle tier 508, respectively.

The following discussion describes data flows throughout the network. Each of middle layer 506, middle layer 507, and middle layer 508 performs a respective operation and generates data 511, data 512, and data 513, respectively, as outputs. As shown, data 511, data 512, and data 513 are provided to middle layer 516, middle layer 517, and middle layer 518, respectively. Each of middle layer 516, middle layer 517, and middle layer 518 performs a respective operation and generates data 521, data 522, and data 523 as outputs, respectively. Further, as shown, data 521, data 522, and data 523 are provided to intermediate layer 526, intermediate layer 527, and intermediate layer 528, respectively. Each of intermediate layers 526, 527, and 528 performs a corresponding operation and generates data 531, 532, and 533, respectively, as output.

For each branch, the decomposition process performed by operation decomposition engine 230 has split the original middle tier into multiple operations, with the corresponding middle tier from each branch providing output data that can fit into the cache as shown in FIG. 5, thereby minimizing utilization of memory bandwidth on the target device (e.g., by relinquishing memory accesses to slower DRAMs).

In this example, the operation decomposition engine 230 determines an order in which each of the operations corresponding to the middle tier of the aforementioned branches is performed. For example, operation decomposition engine 230 may determine the order in which branches 570, 572, and 574 are executed. In a specific implementation, each branch is fully executed before another branch is selected for execution.

As further shown, output layer 540 receives data 531, data 532, and data 533. The aggregation of data 531, data 532, and data 533 is equivalent to data 530 in neural network 500, which ensures that the accuracy of neural network 550 is not affected by the decomposition process. In this example, the output layer 540 in the neural network 500 and the neural network 550 perform the same operations to provide output data 560 (which is equivalent data in both networks in fig. 5). In an example, data 531, data 532, and data 533 may be aggregated by summing each of the foregoing data together or using deep-link techniques to provide output data 560.

In implementations, there is no "aggregation" itself (e.g., no additional movement/duplication). For example, when splitting the output, the computation is split into multiple regions (e.g., the shaded region of T2 in fig. 4); the result is written directly into the final buffer (T2).

In a specific implementation, these logical areas have coordinates. For example, data 407 corresponds to a T2 region starting from (50, 0) to including (99,199) using a 0-based index and a (height, width) orientation. In addition, when the convolution layer 422 produces results, these results are pointed into T2.

In particular implementations, the decomposition operations of the neural network 550 (e.g., additional branches of the layer corresponding to the operations performed for each branch, including the order in which each of these branches is performed) may be included with the code of the network for compiling into a binary executable.

FIG. 6 illustrates a flow diagram of an example process 600 for performing a decomposition process for a neural network in accordance with one or more implementations. For purposes of illustration, the process 600 is described herein primarily with reference to components of the software architecture of fig. 2, which may be executed by one or more processors of the electronic device 110 of fig. 1. However, process 600 is not limited to electronic device 110, and one or more blocks (or operations) of process 600 may be performed by one or more other components of other suitable devices, such as by electronic device 115. For further illustration purposes, blocks of process 600 are described herein as occurring sequentially or linearly. However, multiple blocks of process 600 may occur in parallel. Furthermore, the blocks of process 600 need not be performed in the order shown, and/or one or more blocks of process 600 need not be performed and/or may be replaced by other operations.

The operational decomposition engine 230 receives a representation of a Neural Network (NN) model to be executed on an electronic device (610). In an example, the representation of the NN model includes nodes corresponding to intermediate layers of the NN model, where at least some of the nodes each correspond to a respective operation of a respective intermediate layer of the NN model to be performed by the electronic device.

The operation decomposition engine 230 determines, for each respective operation corresponding to each node in each respective middle tier of the NN model, a respective set of operations that are mathematically equivalent to the respective operation such that the aggregation of the outputs of the respective set of operations is equivalent to the outputs of the respective operation (612).

The operation decomposition engine 230 generates a graph based on each respective set of operations, wherein the graph includes a set of branches, each branch including a plurality of operations including a particular operation from each respective set of operations (614).

The operation decomposition engine 230 determines a respective order for executing each branch of the graph (616).

FIG. 7 illustrates an electronic system 700 with which one or more implementations of the subject technology may be implemented. Electronic system 700 may be and/or may be part of electronic device 110, electronic device 115, and/or server 120 shown in fig. 1. Electronic system 700 may include various types of computer-readable media and interfaces for various other types of computer-readable media. Electronic system 700 includes bus 708, one or more processing units 712, system memory 704 (and/or a buffer), ROM 710, persistent storage 702, input device interface 714, output device interface 706, and one or more network interfaces 716, or a subset and variation thereof.

Bus 708 generally represents all of the system buses, peripheral buses, and chipset buses that communicatively connect the many internal devices of electronic system 700. In one or more implementations, the bus 708 communicatively connects the one or more processing units 712 with the ROM 710, the system memory 704, and the persistent storage 702. One or more processing units 712 retrieve instructions to be executed and data to be processed from the various memory units in order to perform the processes of the subject disclosure. In different implementations, one or more of the processing units 712 may be a single processor or a multi-core processor.

ROM 710 stores static data and instructions required by one or more processing units 712, as well as other modules of electronic system 700. On the other hand, persistent storage 702 may be a read-write memory device. Persistent storage 702 may be a non-volatile memory unit that stores instructions and data even when electronic system 700 is turned off. In one or more implementations, a mass storage device (such as a magnetic or optical disk and its corresponding disk drive) may be used as the persistent storage device 702.

In one or more implementations, removable storage devices (such as floppy disks, flash memory drives, and their corresponding disk drives) may be used as the persistent storage device 702. As with persistent storage 702, system memory 704 may be a read-write memory device. However, unlike persistent storage 702, system memory 704 may be a volatile read-write memory, such as random access memory. The system memory 704 may store any of instructions and data that may be required by the one or more processing units 712 at runtime. In one or more implementations, the processes of the subject disclosure are stored in system memory 704, persistent storage 702, and/or ROM 710. One or more processing units 712 retrieve instructions to be executed and data to be processed from the various memory units in order to perform one or more embodied processes.

The bus 708 is also connected to an input device interface 714 and an output device interface 706. The input device interface 714 enables a user to communicate information and select commands to the electronic system 700. Input devices that may be used with input device interface 714 may include, for example, an alphanumeric keyboard and a pointing device (also referred to as a "cursor control device"). The output device interface 706 may, for example, enable display of images generated by the electronic system 700. Output devices that may be used with output device interface 706 may include, for example, printers and display devices, such as Liquid Crystal Displays (LCDs), light Emitting Diode (LED) displays, organic Light Emitting Diode (OLED) displays, flexible displays, flat panel displays, solid state displays, projectors, or any other device for outputting information. One or more implementations may include a device that serves as both an input device and an output device, such as a touch screen. In these implementations, the feedback provided to the user may be any form of sensory feedback, such as visual feedback, auditory feedback, or tactile feedback; and input from the user may be received in any form, including acoustic, speech, or tactile input.

Finally, as shown in fig. 7, the bus 708 also couples the electronic system 700 to one or more networks and/or to one or more network nodes, such as the electronic device 115 shown in fig. 1, through one or more network interfaces 716. In this manner, electronic system 700 may be part of a computer network, such as a LAN, a wide area network ("WAN") or an intranet, or may be part of a network of networks, such as the Internet. Any or all of the components of electronic system 700 may be used with the subject disclosure.

One aspect of the disclosed technology may include applying machine learning and/or compiler techniques that may perform operations on user data. The present disclosure contemplates that in some instances, the user data may include personal information data that uniquely identifies or may be used to identify a particular person. Such personal information data may include demographic data, location-based data, online identifiers, telephone numbers, email addresses, home addresses, data or records related to the user's health or fitness level (e.g., vital sign measurements, medication information, exercise information), date of birth, or any other personal information.

The present disclosure recognizes that the use of such personal information data in the present technology may be used to benefit users. For example, personal information data may be used to perform machine learning tasks that provide results (e.g., predictions) of interest to a user. Thus, the use of such personal information data enables the user to have greater control over the results delivered. In addition, the present disclosure contemplates other uses for personal information data that are beneficial to the user. For example, health and fitness data may be used according to user preferences to provide insight into their overall health condition, or may be used as positive feedback to individuals who use technology to pursue health goals.

The present disclosure contemplates that entities responsible for collecting, analyzing, disclosing, transmitting, storing, or otherwise using such personal information data will adhere to established privacy policies and/or privacy practices. In particular, it would be desirable for such entity implementations and consistent applications to generally be recognized as meeting or exceeding privacy practices required by industries or governments maintaining user privacy. Such information about the use of personal data should be prominent and easily accessible to the user and should be updated as the collection and/or use of the data changes. The user's personal information should be accessed for legal use only. In addition, such access should only occur after receiving user consent or other legal basis as specified in the applicable law. Moreover, such entities should consider taking any necessary steps to defend and secure access to such personal information data and to ensure that others having access to the personal information data adhere to their privacy policies and procedures. In addition, such entities may subject themselves to third party evaluations to prove compliance with widely accepted privacy policies and practices. In addition, policies and practices should be tailored to the particular type of personal information data that is to be collected and/or accessed and adapted to the applicable laws and standards, including jurisdictional-specific considerations that may be employed to impose higher standards. For example, in the united states, the collection or acquisition of certain health data may be governed by federal and/or state law, such as the health insurance flow and liability act (HIPAA); while health data in other countries may be subject to other regulations and policies and should be processed accordingly.

In spite of the foregoing, the present disclosure also contemplates embodiments in which a user selectively prevents access to or access to personal information data that components of the systems described herein may attempt to access. That is, the present disclosure contemplates that hardware elements and/or software elements may be provided to prevent or block access to such personal information data. For example, with respect to advertisement delivery services, the present technology may be configured to allow a user to choose to "opt-in" or "opt-out" to participate in the collection of personal information data during or at any time after registration with the service. In another example, the user may choose not to provide mood-related data for the targeted content delivery service. As another example, the user may choose to limit the length of time that the mood-related data is maintained, or to prevent development of the underlying emotional condition altogether. In addition to providing the "opt-in" and "opt-out" options, the present disclosure contemplates providing notifications related to accessing or using personal information. For example, the user may be notified that his personal information data will be accessed when the application is downloaded, and then be reminded again just before the personal information data is accessed by the application.

Further, it is an object of the present disclosure that personal information data should be managed and processed to minimize the risk of inadvertent or unauthorized access or use. Once the data is no longer needed, risk can be minimized by restricting data access and deleting data. In addition, and when applicable, included in certain health-related applications, the data de-identification may be used to protect the privacy of the user. De-identification may be facilitated by removing identifiers, controlling the amount or specificity of stored data (e.g., collecting location data at a city level instead of at an address level), controlling how data is stored (e.g., aggregating data among users), and/or other methods such as differentiated privacy, as appropriate.

Thus, while the present disclosure broadly covers the use of personal information data to implement one or more of the various disclosed embodiments, the present disclosure also contemplates that the various embodiments may be implemented without accessing such personal information data. That is, various embodiments of the present technology do not fail to function properly due to the lack of all or a portion of such personal information data. For example, content may be selected and delivered to a user based on aggregated non-personal information data or absolute minimum amount of personal information, such as content processed only on user devices or other non-personal information available to a content delivery service.

Implementations within the scope of the present disclosure may be partially or fully implemented using a tangible computer-readable storage medium (or multiple tangible computer-readable storage media of one or more types) having one or more instructions written thereon. The tangible computer readable storage medium may also be non-transitory in nature.

A computer readable storage medium may be any storage medium that can be read, written, or otherwise accessed by a general purpose or special purpose computing device including any processing electronics and/or processing circuitry capable of executing the instructions. By way of example, and not limitation, computer readable media can comprise any volatile semiconductor memory such as RAM, DRAM, SRAM, T-RAM, Z-RAM, and TTRAM. The computer readable medium may also include any non-volatile semiconductor memory, such as ROM, PROM, EPROM, EEPROM, NVRAM, flash, nvSRAM, feRAM, feTRAM, MRAM, PRAM, CBRAM, SONOS, RRAM, NRAM, racetrack, FJG, and Millipede memories.

Furthermore, the computer-readable storage medium may include any non-semiconductor memory, such as optical disk storage, magnetic tape, other magnetic storage devices, or any other medium capable of storing one or more instructions. In one or more implementations, the tangible computer-readable storage medium may be directly coupled to the computing device, while in other implementations, the tangible computer-readable storage medium may be indirectly coupled to the computing device, for example, via one or more wired connections, one or more wireless connections, or any combination thereof.

The instructions may be directly executable or may be used to develop executable instructions. For example, the instructions may be implemented as executable or non-executable machine code, or may be implemented as high-level language instructions that may be compiled to produce executable or non-executable machine code. Further, the instructions may also be implemented as data, or may include data. Computer-executable instructions may also be organized in any format, including routines, subroutines, programs, data structures, objects, modules, applications, applets, functions, and the like. As will be appreciated by one of skill in the art, details including, but not limited to, the number, structure, sequence, and organization of instructions may vary significantly without altering the underlying logic, functionality, processing, and output.

While the above discussion primarily refers to a microprocessor or multi-core processor executing software, one or more implementations are performed by one or more integrated circuits, such as an ASIC or FPGA. In one or more implementations, such integrated circuits execute instructions stored on the circuits themselves.

Those of skill in the art will appreciate that the various illustrative blocks, modules, elements, components, methods, and algorithms described herein may be implemented as electronic hardware, computer software, or combinations of both. To illustrate this interchangeability of hardware and software, various illustrative blocks, modules, elements, components, methods, and algorithms have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application. The various components and blocks may be arranged differently (e.g., arranged in a different order, or divided in a different manner) without departing from the scope of the subject technology.

It should be understood that the specific order or hierarchy of blocks in the processes disclosed herein is an illustration of exemplary approaches. Based on design preference requirements, it should be understood that the particular order or hierarchy of blocks in the process may be rearranged or all illustrated blocks may be performed. Any of these blocks may be performed simultaneously. In one or more implementations, multitasking and parallel processing may be advantageous. Moreover, the partitioning of various system components in the implementations described above should not be understood as requiring such partitioning in all implementations, and it should be understood that program components and systems may generally be integrated together in a single software product or packaged into multiple software products.

As used in this specification and any claims of this patent application, the terms "base station," "receiver," "computer," "server," "processor," and "memory" refer to an electronic or other technical device. These terms exclude a person or group of people. For purposes of this specification, the term "display" or "displaying" means displaying on an electronic device.

As used herein, the phrase "at least one of" after separating a series of items of any of the items with the term "and" or "is a modification of the list as a whole, rather than modifying each member (i.e., each item) in the list. The phrase "at least one of" does not require the selection of at least one of each item listed; rather, the phrase allows for the inclusion of at least one of any one item and/or the meaning of at least one of any combination of items and/or at least one of each item. For example, the phrase "at least one of A, B and C" or "at least one of A, B or C" each refer to a alone, B alone, or C alone; A. any combination of B and C; and/or A, B and C.

The predicates "configured to", "operable to", and "programmed to" do not mean any particular tangible or intangible modification to a subject but are intended to be used interchangeably. In one or more implementations, a processor configured to monitor and control operations or components may also mean that the processor is programmed to monitor and control operations or that the processor is operable to monitor and control operations. Likewise, a processor configured to execute code may be interpreted as a processor programmed to execute code or operable to execute code.

Phrases such as an aspect, this aspect, another aspect, some aspects, one or more aspects, an implementation, the implementation, another implementation, some implementations, one or more implementations, an embodiment, the embodiment, another embodiment, some embodiments, one or more embodiments, a configuration, the configuration, other configurations, some configurations, one or more configurations, subject technology, disclosure, the present disclosure, other variations thereof, and the like are all for convenience and do not imply that disclosure involving such one or more phrases is essential to the subject technology nor that such disclosure applies to all configurations of the subject technology. The disclosure relating to such one or more phrases may apply to all configurations or one or more configurations. The disclosure relating to such one or more phrases may provide one or more examples. A phrase such as an aspect or some aspects may refer to one or more aspects and vice versa, and this applies similarly to other previously described phrases.

The word "exemplary" is used herein to mean "serving as an example, instance, or illustration. Any embodiment described herein as "exemplary" or as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments. Furthermore, to the extent that the terms "includes," "has," and the like are used in either the description or the claims, such terms are intended to be inclusive in a manner similar to the term "comprising" as "comprising" is interpreted when employed as a transitional word in a claim.

All structural and functional equivalents to the elements of the various aspects described throughout this disclosure that are known or later come to be known to those of ordinary skill in the art are expressly incorporated herein by reference and are intended to be encompassed by the claims. Furthermore, nothing disclosed herein is intended to be dedicated to the public regardless of whether such disclosure is explicitly recited in the claims. No claim element should be construed in accordance with the specification of 35u.s.c. ≡112 (f) unless the element is explicitly stated using the phrase "means for … …" or, in the case of a method claim, the element is stated using the phrase "step for … …".

The previous description is provided to enable any person skilled in the art to practice the various aspects described herein. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects. Thus, the claims are not intended to be limited to the aspects shown herein, but is to be accorded the full scope consistent with the language claims, wherein reference to an element in a singular value is not intended to mean "one only" but rather "one or more" unless specifically so stated. The term "some" means one or more unless specifically stated otherwise. The terminology of male (e.g., his) includes female and neutral (e.g., her and its), and vice versa. Headings and sub-headings (if any) are used for convenience only and do not limit the subject disclosure.

Claims

1. A method for reducing memory traffic, comprising:

Receiving a representation of a neural network NN model to be executed on an electronic device, the representation of the NN model including nodes corresponding to intermediate layers of the NN model, wherein at least some of the nodes each correspond to a respective operation of a respective intermediate layer of the NN model to be executed by the electronic device;

Determining a respective set of operations mathematically equivalent to the respective operation for the respective operation corresponding to at least one node in at least one respective middle tier of the NN model such that an aggregation of outputs of the respective set of operations is equivalent to an output of the respective operation, wherein the determining for the respective operation corresponding to the at least one node in the at least one respective middle tier of the NN model comprises:

Determining a first plurality of operations for the respective operation;

determining a second plurality of operations for the respective operation; and

Selecting one of the first plurality of operations or the second plurality of operations based at least in part on an analysis of the first plurality of operations and the second plurality of operations, wherein the analysis is based at least in part on statistics indicative of computational overhead and memory access, and the analysis indicates which of the first plurality of operations or the second plurality of operations utilizes less memory resources;

Generating a graph based on each respective set of operations, wherein the graph includes a set of branches, at least one branch including a plurality of operations including a particular operation from each respective set of operations;

determining a respective order for executing each branch of the graph; and

The map and the corresponding order are stored.

2. The method of claim 1, further comprising:

a binary package for the electronic device is compiled based at least in part on the graph and the respective order for executing each branch of the graph, wherein the electronic device executes each respective set of operations based on the respective order.

3. The method of claim 1, wherein one of the first plurality of operations or the second plurality of operations is selected based on a set of heuristics that utilize the statistics.

4. The method of claim 1, wherein the output from each operation of the respective set of operations is constrained based at least in part on an amount of available memory in a cache of the electronic device.

5. The method of claim 4, wherein the aggregation of the outputs of the respective set of operations is stored in a memory of the electronic device, the memory being slower than the cache of the electronic device.

6. The method of claim 1, wherein the plurality of operations of each branch begin after an input node of the NN model and end before an output node of an output layer of the NN model.

7. The method of claim 1, wherein the plurality of operations of each branch provide a portion of an output of the NN model from an output layer.

8. The method of claim 7, wherein an aggregate of each output of each branch is equal to the output of the NN model from the output layer.

9. The method of claim 8, wherein the output of the NN model from the output layer is stored in dynamic random access memory, DRAM.

10. The method of claim 1, wherein the electronic device comprises a cache memory and a dynamic random access memory, DRAM.

11. A system for reducing memory traffic, comprising:

A processor;

a memory device including instructions that, when executed by the processor, cause the processor to:

Receiving a representation of a neural network NN model to be executed on an electronic device, the representation of the NN model comprising nodes corresponding to layers of the NN model, wherein at least one of the nodes corresponds to an operation of a corresponding layer of the NN model to be executed by the electronic device;

determining a set of operations mathematically equivalent to the operation such that an aggregation of outputs of the set of operations is equivalent to the outputs of the operation, wherein determining the set of operations comprises:

Determining a first plurality of operations for the operation;

Determining a second plurality of operations for the operation; and

Generating a graph based on the set of operations, wherein the graph includes a set of branches, each branch including a plurality of operations including at least one operation from the set of operations;

determining a respective order for executing each branch of the graph; and

The graph and the respective order for executing each branch of the graph are stored for compilation of the NN model.

12. The system of claim 11, wherein the memory device further includes instructions that when executed by the processor further cause the processor to:

A binary package for the electronic device is compiled based at least in part on the graph and the respective order for executing each branch of the graph, wherein the electronic device executes each operation of the respective plurality of operations based on the respective order.

13. The system of claim 11, wherein one of the first plurality of operations or the second plurality of operations is selected based on a set of heuristics that utilize the statistics, wherein the analysis indicates which of the first plurality of operations and the second plurality of operations utilize less memory resources, wherein selecting one of the first plurality of operations or the second plurality of operations is further based on statistics that indicate computational overhead and memory access and a set of heuristics that utilize the statistics.

14. The system of claim 11, wherein the output from each operation of the set of operations is constrained based at least in part on an amount of available memory in a cache of the electronic device.

15. The system of claim 14, wherein the aggregation of the output of the set of operations is stored in a memory of the electronic device, the memory being slower than the cache of the electronic device.

16. The system of claim 11, wherein the set of operations starts after an input node of the NN model and ends before an output node of an output layer of the NN model.

17. The system of claim 11, wherein the set of operations for each branch provides a portion of an output of the NN model from an output layer.

18. The system of claim 17, wherein an aggregate of each output of each branch is equal to the output of the NN model from the output layer.

19. The system of claim 18, wherein the output of the NN model from the output layer is stored in DRAM.

20. A non-transitory computer-readable medium comprising instructions that, when executed by a computing device, cause the computing device to perform operations comprising:

Receiving a representation of a neural network, NN, model to be performed on an electronic device, the representation of the NN model comprising nodes corresponding to layers of the NN model, wherein at least some of the nodes each correspond to a respective operation of a respective layer of the NN model to be performed by the electronic device;

Determining a respective set of operations mathematically equivalent to the respective operation for the respective operation corresponding to at least one node in at least one layer of the NN model such that an aggregation of outputs of the respective set of operations is equivalent to an output of the respective operation, wherein determining the respective set of operations for the respective operation corresponding to the at least one node in the at least one respective intermediate layer of the NN model comprises:

Determining a first plurality of operations for the respective operation;

determining a second plurality of operations for the respective operation; and

Generating a graph based on each respective set of operations, wherein the graph includes a set of branches, each branch including a plurality of operations including a particular operation from each respective set of operations; and

A respective order for executing each branch of the graph is determined.