CN112015675A

CN112015675A - Allocation of machine learning tasks into shared cache

Info

Publication number: CN112015675A
Application number: CN202010322486.6A
Authority: CN
Inventors: F·P·万纳; C·M·福雷特; 姚笑终; S·哈雷哈拉苏巴曼尼安
Original assignee: Apple Inc
Current assignee: Apple Inc
Priority date: 2019-05-31
Filing date: 2020-04-22
Publication date: 2020-12-01
Anticipated expiration: 2040-04-22
Also published as: CN117632785A; CN112015675B

Abstract

The present disclosure relates to the allocation of machine learning tasks into a shared cache. The subject technology receives code corresponding to a Neural Network (NN) model, the code including specific operations performed by the NN model. In the particular operation, the subject technology determines a set of operations to be assigned to a cache of an electronic device that is to execute the NN model. The subject technology generates a set of cache indicators corresponding to the determined set of operations. The subject technology compiles the code and the generated set of cache indicators to provide a compiled binary file for the NN model to execute on a target device.

Description

Allocation of machine learning tasks into shared cache

Cross Reference to Related Applications

This application claims the benefit OF U.S. provisional patent application serial No. 62/855,900 entitled "alloclean OF MACHINE LEARNING TASKS INTO A SHARED CACHE," filed on 31/5/2019, which is hereby incorporated by reference in its entirety and forms part OF the present U.S. utility patent application for all purposes.

Technical Field

The present specification relates generally to compiling a neural network model for execution on a target platform.

Background

Software engineers and scientists have been using computer hardware for machine learning to improve in different industry applications including image classification, video analysis, speech recognition, and natural language processing. Notably, neural networks are more frequently used to create systems that can perform different computational tasks based on training from a large amount of data.

Drawings

Some of the features of the subject technology are set forth in the appended claims. However, for purposes of explanation, several embodiments of the subject technology are set forth in the following figures.

FIG. 1 illustrates an exemplary network environment in accordance with one or more implementations.

FIG. 2 illustrates an exemplary computing architecture for compiling a neural network with cache indicators according to one or more implementations.

Fig. 3 illustrates an example of processing a machine learning operation with respect to an on-chip memory, such as a cache, and/or an off-chip memory, such as a DRAM, based on a cache indicator provided in the operation.

FIG. 4 illustrates a flow diagram of an exemplary process for compiling a neural network using cache indicators, according to one or more implementations.

FIG. 5 illustrates a flow diagram of an example process for allocating memory for a neural network based on cache indicators in memory transactions, according to one or more implementations.

FIG. 6 illustrates an electronic system with which one or more implementations of the subject technology may be implemented.

Detailed Description

The detailed description set forth below is intended as a description of various configurations of the subject technology and is not intended to represent the only configurations in which the subject technology may be practiced. The accompanying drawings are incorporated herein and constitute a part of the detailed description. The detailed description includes specific details for the purpose of providing a thorough understanding of the subject technology. The subject technology is not limited to the specific details set forth herein, however, and may be practiced with one or more other implementations. In one or more implementations, structures and components are shown in block diagram form in order to avoid obscuring the concepts of the subject technology.

The popularity of machine learning has risen dramatically in recent years due to the availability of large amounts of training data and the advancement of more powerful and efficient computing hardware. One popular machine learning technique is to utilize deep neural networks to perform a set of machine learning tasks. For training deep neural networks, a common approach is to utilize a Graphics Processing Unit (GPU), and also to perform deep neural networks on new input data post-training.

On a given platform for executing one or more neural networks, the platform may provide a limited amount of memory. For example, modern computing devices typically include various types of memory, including faster cache memory (e.g., on-chip memory) and slower main memory (e.g., off-chip memory), such as dynamic random access memory or DRAM. Executing such a neural network on a faster cache memory may improve the performance of the neural network because the performance penalty of accessing slower DRAM is avoided. Additionally, on some computing platforms (such as mobile devices), accessing DRAM also results in greater power consumption when compared to accessing faster cache memory.

Implementations of the subject technology described herein improve the computing functionality of electronic devices by: by including at least a cache indicator during the compilation process of a neural network, where possible, faster cache memory is enabled to be utilized by a given neural network while being executed by an electronic device. For example, the cache indicator may indicate that a faster cache memory (e.g., on-chip memory) is preferred for a given task or operation of the neural network, e.g., to account for the relative performance penalty that would result from using slower off-chip memory (e.g., DRAM).

Such cache indicators enable other hardware components (e.g., a cache engine or controller) to perform cache memory allocation during runtime, where the cache memory allocation may be prioritized for a task or operation that prefers cache memory. Advantageously, the neural network may preferentially access faster cache memory and thus perform faster-completing machine learning tasks. Thus, these benefits are understood to improve the computing functionality of a given electronic device, such as an end-user device, which may typically have fewer available computing resources than, for example, one or more cloud-based servers.

FIG. 1 illustrates an exemplary network environment 100 in accordance with one or more implementations. However, not all of the depicted components may be used in all implementations, and one or more implementations may include additional or different components than those shown in the figures. Variations in the arrangement and type of these components may be made without departing from the spirit or scope of the claims set forth herein. Additional components, different components, or fewer components may be provided.

Network environment 100 includes electronic device 110, electronic device 115, and server 120. Network 106 may communicatively (directly or indirectly) couple electronic device 110 and/or server 120, electronic device 115 and/or server 120, and/or electronic device 110 and/or electronic device 115. In one or more implementations, the network 106 may be an interconnected network that may include the internet or a device communicatively coupled to the internet. For purposes of explanation, network environment 100 is shown in FIG. 1 as including electronic device 110, electronic device 115, and server 120; however, network environment 100 may include any number of electronic devices and any number of servers.

The electronic device 110 may be, for example, a desktop computer, a portable computing device such as a laptop computer, a smartphone, a peripheral device (e.g., digital camera, headset), a tablet device, a wearable device such as a watch, a band, and so forth. In FIG. 1, by way of example, the electronic device 110 is depicted as a desktop computer. The electronic device 110 may be and/or may include all or part of an electronic system discussed below with respect to fig. 6.

In one or more implementations, the electronic device 110 and/or the server 120 can provide a system for compiling a given neural network model. In an example, using compiled code, the subject system can create an executable software package to deploy on a target platform, such as electronic device 115, under the direction of server 120. When the compiled code is executed, the target platform may perform a given operation of the neural network model.

The electronic device 115 may be, for example, a portable computing device such as a laptop computer, a smartphone, a peripheral device (e.g., digital camera, headset), a tablet device, a wearable device such as a watch, a band, etc., or any electronic device. The electronic device may also include processors with different computing capabilities, including, for example, a CPU, a GPU, and/or a neural processor. In fig. 1, by way of example, the electronic device 115 is depicted as a smartphone device. In one or more implementations, the electronic device 115 may be and/or may include all or part of an electronic system discussed below with respect to fig. 6.

In one or more implementations, the server 120 deploys compiled code included in an executable software package to a target device for execution. In an example, the electronic device 115 may be a target device for receiving a software package with compiled neural network code and executing the compiled code in a runtime environment of the electronic device 115. The electronic device 115 (or any electronic device that is a target device) may include a framework enabled to perform operations in compiled code of a neural network. A framework may refer to a software environment that provides specific functionality as part of a larger software platform to facilitate software application development.

FIG. 2 illustrates an exemplary computing architecture 200 for compiling a neural network with cache indicators according to one or more implementations. For purposes of illustration, the computing architecture is described as being provided by the electronic device 110 of fig. 1, such as by a processor and/or memory of the electronic device 110; however, the computing architecture may be implemented by any other electronic device. However, not all of the depicted components may be used in all implementations, and one or more implementations may include additional or different components than those shown in the figures. Variations in the arrangement and type of these components may be made without departing from the spirit or scope of the claims set forth herein. Additional components, different components, or fewer components may be provided.

As shown, computing architecture 200 includes electronic device 110 and electronic device 115. The electronic device 110 includes a compiler 215 and a memory 240. Memory 240 includes Neural Network (NN) model source code 244 that, after being compiled by compiler 215, generates a Neural Network (NN) binary executable 242 that may be deployed to different target platforms for execution. In an example, the NN model source code 244 may include code for various algorithms, which may be used alone or in combination to implement particular functions for execution on a given target device. As described above, the target device may include various hardware sensors and different processors (e.g., as provided by the electronic device 115) that may be utilized when running the NN binary executable 242 on the target device. In an example, the particular functionality may include image processing or computer vision related functionality, speech recognition, natural language processing, and so forth.

Although compiler 215 is provided on electronic device 110 in the example of fig. 2, in some implementations, a compiler may be provided on a particular electronic device (e.g., electronic device 115) that locally compiles source code and executes the compiled code on the same device. In particular implementations, the NN model source code 244 may be compiled for a particular target platform and then deployed to a different device, such as the electronic device 115, for execution. In an example, the NN model source code 244 may include at least code corresponding to a set of operations (e.g., machine learning tasks) to be performed by corresponding nodes from each layer of a given NN model. As referred to herein, a machine learning task corresponds to at least one operation performed by a given node in a particular layer of a given NN model. It should be appreciated that in a particular implementation, a machine learning task may refer to various operations performed by multiple nodes in a network (e.g., in the same layer or layers). In an example, the code of an operation in a layer of the NN is a respective function call for performing the operation and/or a set of parameters for the function call. Additionally, code corresponding to input and output features, data structures, and feature types may be included in the NN model source code 244.

As discussed further below, the target device (e.g., electronic device 115) may include multiple processors (e.g., CPUs, GPUs, Neural Processors (NPs)) for performing operations of a given NN model, where each processor has access to memory, such as a cache or slower Dynamic Random Access Memory (DRAM) provided by the target device, that is shared among the processors of the target device. Given the memory constraints of the target device, the various operations of the NN model performed by the aforementioned processor may not always fit within the cache to provide better performance, but rather are stored in slower DRAM memory to accomplish such operations.

In particular implementations, compiler 215 analyzes NN model source code 244 and determines which data of a given Neural Network (NN) model will benefit from being placed in faster storage (e.g., memory cache 257) rather than slower storage (e.g., DRAM 258). Such data may include, for example, data corresponding to the input and output characteristics described above, and/or data structures of the NN model. By way of example, the respective outputs of the operations by the NN model may be in the form of data structures, such as containers (e.g., tensors) that may store data in N dimensions (e.g., matrices, vectors, arrays of arrays, etc.).

In particular implementations, compiler 215 performs the following operations: 1) determine the machine learning tasks performed by the NN model based on the code, 2) determine which machine learning tasks should be allocated in the faster memory cache 257 to improve performance, and 3) generate cache indicators to associate with the respective machine learning tasks to enable the compiled NN model to allocate the memory cache 257 (e.g., where possible) or not allocate the memory cache 257 (but rather to be placed in slower DRAM) during runtime.

As mentioned herein, the cache indicator may include information indicating whether to request allocation of memory in the shared cache, or to perform another operation, such as evicting or invalidating data already stored in the shared cache. In an example, such information may be included in an instruction (e.g., as part of a memory transaction) sent to a processor (e.g., CPU, GPU, NP) that is then processed by the processor to determine whether to request allocation of memory within a cache or slower memory or to evict a portion of memory. To allocate memory cache 257, compiler 215 may use knowledge of the size of memory cache 257 available on the target device to determine whether allocation of memory cache 257 is feasible.

For a given caching indicator, the compiler 215 may include information corresponding to particular operations of the nodes of the NN network, or associate the caching indicator with a set of operations from a single node or performed across different nodes and/or layers of the NN network. In the foregoing memory transactions, a cache indicator may be associated with each of the instructions, which may include a set of instructions that are ultimately issued to the processor (or processors, depending on the operation to be performed). In another example, not every instruction in a memory transaction includes a cache indicator, depending on the instruction. In one or more implementations, the cache indicator may be included in the following operations: wherein the preferred memory is changed, for example, from on-chip memory to off-chip memory, or vice versa, such as when the preferred memory remains static for a plurality of consecutive operations.

In particular implementations, compiler 215 generates the cache indicator when compiling code for the NN model using the following policies/criteria. Data that is utilized only once is not preferred/preferred for placement in the cache and may be placed in slower DRAM if desired. In addition, data that is utilized more than once is preferred/prioritized for placement in the cache. For data that is utilized more than once but is the last operation to no longer use the data, compiler 215 may also determine whether to request eviction of the data from the cache (e.g., a cache delete operation that invalidates a portion of the cache with the data). Additionally, compiler 215 may assign a first priority value to a first set of data such that the data is given a higher priority than another data (e.g., it has been assigned a lower priority value) to place in the cache.

Such priority may be based on performance requirements (e.g., cost): the speed at which data needs to be read, for example to meet the requirements of the machine learning task being performed by the NN network and/or whether the computational requirements of the task are greater than the memory requirements, in which case placing the data in slower memory does not significantly impact performance. Additionally, the compiler 215 may consider energy requirements, e.g., whether tasks should be placed in the cache to meet energy and/or temperature requirements of the target devices executing the NN network.

Compiler 215 further utilizes the generated cache indicators to process the source code and compile the code into an NN binary executable for the target device, which may be stored in neural network executable 242 and then deployed to the target device for execution (e.g., electronic device 115). Although compiler 215 is provided on electronic device 110 in the example of fig. 2, in some implementations, a compiler may be provided on a particular electronic device that compiles code for a neural network model and executes the compiled neural network model on the same device. As described above, the neural network model may be compiled from the NN model source code 244 for a particular target platform and then deployed to a different device, such as the electronic device 115, for execution.

As further shown, in one or more implementations, the electronic device 115 includes a system on a chip (SOC) 250. SOC 250 may include L2 cache 252 (e.g., on-chip memory), CPU 254, GPU 255, and neural processor 256. The electronic device 115 also includes memory cache 257 and DRAM 258 (e.g., off-chip memory).

In one or more implementations, memory cache 257 may be on-chip (e.g., part of SOC 250, as shown in the example of fig. 2) or off-chip (not shown). Further, with respect to power, performance, and/or accessibility, memory cache 257 may fall between L2 cache 252 and DRAM 258. For example, memory cache 257 may be more general purpose than L2 cache 252, but less general purpose than DRAM 258.

DRAM 258 may be a memory that has slower access speeds than memory cache 257 and/or L2 cache 252. In one or more implementations, DRAM 258 may be shared across multiple (e.g., all) tasks and processing units with respect to electronic device 115. Accessing the DRAM 258 may consume computing resources of the electronic device 115 because it may utilize a relatively significant amount of power and may affect the performance of the NN model by slowing down memory-bound layers (e.g., pooling layers, element-level layers, etc.) of the NN. In contrast, in an implementation, memory cache 257 is faster than DRAM 258, but is smaller in size than DRAM 258. Therefore, data (e.g., input data at the time of processing, output data, intermediate data, etc.) corresponding to the operation of the NN model often does not fit in the memory cache 257 but is stored in the DRAM 258.

Usage of the memory cache 257 may be managed (e.g., based on cache indicators) in the following manner, such as by a quota system or general access permissions: access to the memory cache 257 is provided for some tasks or engines (e.g., rather than for others). In one or more implementations, memory cache 257 may be checked before DRAM 258 for data requests. For example, a cache indicator as described herein may be generated (e.g., by compiler 215) to allocate data to memory cache 257. However, the data may or may not still be available on the memory cache 257. A request may be made to the driver (e.g., by the corresponding engine) to cause the memory cache 257 to collect the data. If the data is still available in the memory cache 257, the data may be obtained from the memory cache 257 and sent to the corresponding engine. If the data is no longer available in the memory cache 257, the request for the data may be forwarded to the DRAM 258 and obtained from the DRAM 258. It is also possible that only some of the data in memory cache 257 is still available, which may result in an available portion of the data being obtained from memory cache 257 and the remaining portion of the data being obtained from DRAM 258.

Thus, in one or more implementations, compiler 215 may choose to place data for subsequent access in one or more of: an L2 cache 252 (e.g., corresponding to the fastest relative access), a DRAM 258 with a cache indicator that enables use of the memory cache 257 (e.g., corresponding to the second fastest relative access), and/or a DRAM 258 without a cache indicator for the memory cache 257 (e.g., corresponding to the third fastest relative access).

As also shown, driver 260 is provided by an Operating System (OS) running on electronic device 115. In an example, the driver 260 allows other software (e.g., one or more applications 270) to communicate with firmware that enables such software to control (e.g., by executing commands) one or more components of hardware, such as the neural processor 256, the CPU 254, the GPU 255, the memory cache 257, and/or the DRAM 258 included in the electronic device 115. As discussed further herein, the driver 260 may request various operations involving the memory cache 257 and/or the DRAM 258 based at least in part on cache indicators included in one or more memory transactions as part of executing a given NN model. Additionally, while one driver is shown in the example of FIG. 2 for simplicity, it should be understood that in a particular implementation, various drivers for hardware components are provided. For example, in addition to the drivers for memory cache 257 and/or DRAM 258, a respective driver may be provided for each of the processors described above.

In a particular implementation, during runtime of the NN model, a client application from the application 210 executing the binary files of the NN model may send operations (e.g., a request including a set of instructions and/or a cache director) to the driver 260 to facilitate processing by the neural processor 256, the CPU 254, the GPU 255, the memory cache 257, and/or the DRAM 258. In particular implementations, driver 260 may receive such operations from a client application and forward the operations (e.g., as it relates to a memory transaction) to a cache engine (not shown) provided by memory cache 257 for processing. Based on the cache indicator, the cache engine may determine whether to allocate memory in memory cache 257, evict a portion of the data in memory cache 257, or allocate memory in DRAM 258. Exemplary interactions between the drive 260 and the memory cache 257 are further discussed below in fig. 3.

Recently, specialized (e.g., dedicated) hardware has been developed that is optimized for performing specific operations from a given NN. A given electronic device may include a neural processor 256, which may be implemented as circuitry that performs various machine learning operations based on computations including multiplication, addition, and accumulation. Such calculations may be arranged to perform, for example, a convolution of the input data. In an example, the neural processor 256 is specifically configured to execute machine learning algorithms, typically by operating on a predictive model such as NN. In one or more implementations, the electronic device may include a neural processor 256 in addition to the CPU 254 and/or GPU 255.

As discussed herein, a CPU may refer to a main processor in a given electronic device that performs the operations of basic arithmetic, logic, control, and input/output operations specified by instructions of a computer program or application, including some operations of a neural network model. As discussed herein, a GPU may refer to a specialized electronic circuit designed to perform operations for rendering graphics, which may also be used in many cases to process computational workloads (e.g., as specified by instructions of a computer program or application) for machine learning operations. The CPU, GPU, and neural processor may each have different computational specifications and capabilities depending on their respective implementations, where each of the above components may provide different degrees of performance for certain operations than others.

Fig. 3 illustrates an example of processing machine learning operations with respect to an on-chip cache (e.g., memory cache 257) and/or an off-chip cache (e.g., DRAM 258) based on a cache indicator provided in the operations. FIG. 3 will be discussed with reference to the components of the computing architecture 200 depicted in FIG. 2.

As shown in fig. 3, the driver 260 may receive a Machine Learning (ML) operation 310 (e.g., from a client application executing the NN model), which is part of the memory transactions of the neural network model. The driver 260 may analyze the cache indicators 312 provided with the ML operation 310 to determine whether to request allocation of memory in the memory cache 257. The driver 260 may utilize knowledge of the respective quota allocated to the memory cache 257 of each processor on the target device, such as the electronic device 115, to determine whether allocation is feasible based on the amount of available memory. As shown, the driver 260 may allocate a quota 350 for the neural processor 256, a quota 355 for the GPU 255, and a quota 360 for the CPU 254 for the memory cache 257. For example, if the size of memory cache 257 is 16 megabytes (16MB), then quota 350 may be 4MB in size, quota 355 may be 8MB in size, and quota 360 may be 4MB in size. The driver 260 may also share information about quotas to the cache engine 320, which processes requests from the driver 260, as discussed further below.

However, it should be understood that when memory on the electronic device 115 is shared between other applications and/or other NN models that are also executing concurrently with the NN model, the driver 260 may dynamically adjust the respective size of each quota during runtime of the NN model. In an example, the driver 260 may receive multiple ML operations involving different memory transactions from two or more respective applications that each execute a respective NN model. In a particular implementation, driver 260 may determine respective sizes of memory allocations for the ML operation and sum the respective sizes to determine a combined memory allocation size. The driver 260 may then adjust the respective size of the quota based on the combined memory allocation size, and may also notify the cache engine 320 of the adjusted quota. Further, when other applications and/or NN models cease executing, the driver 260 may adjust the respective sizes of the quotas in response to the unused memory of the applications and/or NN models no longer executing.

In an example, driver 260 may forward the request to cache engine 320 for use in allocating memory in memory cache 257. In a particular implementation, the cache engine 320 may be a hardware cache controller provided by a target device, such as the electronic device 115, which may be included as part of the SOC 250. In another implementation, the cache engine 320 may be a software component (e.g., a security daemon application) or implemented in firmware of the electronic device 115.

Upon receiving the request from driver 260, cache engine 320 may perform the allocation of memory in memory cache 257 corresponding to CPU 254, GPU 255, or neural processor 256, as requested. In examples where the cache engine 320 is unable to allocate the requested memory, the driver 260 may receive an indication from the cache engine 320 that the request has failed. In response, driver 260 may instead request an allocation of memory from DRAM 258.

FIG. 4 illustrates a flow diagram of an example process 400 for compiling a neural network using cache indicators in accordance with one or more implementations. For purposes of explanation, the process 400 is described herein primarily with reference to components of the computing architecture 200 of fig. 2, which may be executed by one or more processors of the electronic device 110 of fig. 1. However, process 400 is not limited to electronic device 110, and one or more blocks (or operations) of process 400 may be performed by one or more other components of other suitable devices, such as by electronic device 115. For further explanation purposes, the blocks of process 400 are described herein as occurring sequentially or linearly. However, multiple blocks of process 400 may occur in parallel. Further, the blocks of process 400 need not be performed in the order shown, and/or one or more blocks of process 400 need not be performed and/or may be replaced by other operations.

Compiler 215 receives code corresponding to a Neural Network (NN) model (410). In an example, the code includes specific operations performed by the NN model. At least some of the specific operations include respective data to be stored in a memory of the electronic device during execution of the NN model.

In certain operations, the compiler 215 determines a set of operations to be preferably assigned to a shared cache of the electronic device that is to execute the NN model (412). In particular implementations, compiler 215 determines a set of operations based at least in part on whether a particular operation uses data accessed more than once during execution of the particular operation or data accessed using two respective operations executed by the NN model.

Additionally, compiler 215 generates a set of cache indicators corresponding to the determined set of operations (414). In particular implementations, the set of cache indicators includes information indicating whether memory allocation in the shared cache is requested. Further, compiler 215 compiles the code and the generated set of cache indicators to provide a compiled binary for the NN model to execute on the target device (416). For example, this may correspond to generating binary code with the generated set of cache indicators to provide a compiled binary file for the NN model.

FIG. 5 illustrates a flow diagram of an example process for allocating memory for a neural network based on cache indicators in memory transactions, according to one or more implementations. For purposes of explanation, the process 500 is described herein primarily with reference to components of the computing architecture 200 of fig. 2, which may be executed by one or more processors of the electronic device 110 of fig. 1. However, process 500 is not limited to electronic device 110, and one or more blocks (or operations) of process 500 may be performed by one or more other components of other suitable devices, such as by electronic device 115. Further for purposes of explanation, the blocks of process 500 are described herein as occurring sequentially or linearly. However, multiple blocks of process 500 may occur in parallel. Further, the blocks of process 500 need not be performed in the order shown, and/or one or more blocks of process 500 need not be performed and/or may be replaced by other operations.

The driver 260 receives a request to perform an operation by the neural network model (510). In an example, the request includes a cache indicator having information indicating whether the operation includes an allocation of memory in a cache provided by the computing device.

Driver 260 determines a request to make an allocation of memory in the cache based at least in part on the cache indicator and the operation (512). The driver 260 sends a request for allocation of memory to the cache engine to complete the allocation of memory in the cache (514).

FIG. 6 illustrates an electronic system 600 that may be utilized to implement one or more implementations of the subject technology. Electronic system 600 may be and/or may be part of electronic device 110, electronic device 115, and/or server 120 shown in fig. 1. Electronic system 600 may include various types of computer-readable media and interfaces for various other types of computer-readable media. Electronic system 600 includes bus 608, one or more processing units 612, system memory 604 (and/or cache), ROM 610, persistent storage 602, input device interface 614, output device interface 606, and one or more network interfaces 616, or subsets and variations thereof.

Bus 608 generally represents all of the system bus, peripheral buses, and chipset buses that communicatively connect the many internal devices of electronic system 600. In one or more implementations, the bus 608 communicatively connects one or more processing units 612 with the ROM 610, the system memory 604, and the permanent storage device 602. From these various memory units, one or more processing units 612 retrieve instructions to be executed and data to be processed in order to perform the processes of the subject disclosure. In different implementations, the one or more processing units 612 may be a single processor or a multi-core processor.

ROM 610 stores static data and instructions for one or more processing units 612 as well as other modules of electronic system 600. On the other hand, persistent storage device 602 may be a read-write memory device. Persistent storage 602 may be a non-volatile memory unit that stores instructions and data even when electronic system 600 is turned off. In one or more implementations, a mass storage device (such as a magnetic disk or optical disc and its corresponding magnetic disk drive) may be used as the persistent storage device 602.

In one or more implementations, a removable storage device (such as a floppy disk, a flash drive, and their corresponding disk drives) may be used as the persistent storage device 602. Like the persistent storage device 602, the system memory 604 may be a read-write memory device. However, unlike the persistent storage device 602, the system memory 604 may be a volatile read-and-write memory, such as a random access memory. System memory 604 may store any of the instructions and data that may be needed by one or more processing units 612 during runtime. In one or more implementations, the processes of the subject disclosure are stored in the system memory 604, the persistent storage device 602, and/or the ROM 610. From these various memory units, one or more processing units 612 retrieve instructions to be executed and data to be processed in order to perform one or more embodied processes.

The bus 608 is also connected to an input device interface 614 and an output device interface 606. Input device interface 614 enables a user to communicate information and select commands to electronic system 600. Input devices that may be used with input device interface 614 may include, for example, an alphanumeric keyboard and a pointing device (also referred to as a "cursor control device"). The output device interface 606 may, for example, enable display of images generated by the electronic system 600. Output devices that may be used with output device interface 606 may include, for example, printers and display devices, such as Liquid Crystal Displays (LCDs), Light Emitting Diode (LED) displays, Organic Light Emitting Diode (OLED) displays, flexible displays, flat panel displays, solid state displays, projectors, or any other device for outputting information. One or more implementations may include a device that acts as both an input device and an output device, such as a touch screen. In these implementations, the feedback provided to the user can be any form of sensory feedback, such as visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, or tactile input.

Finally, as shown in FIG. 6, bus 608 also couples electronic system 600 to one or more networks and/or to one or more network nodes, such as electronic device 115 shown in FIG. 1, through one or more network interfaces 616. In this manner, electronic system 600 may be part of a computer network, such as a LAN, wide area network ("WAN"), or intranet, or may be part of a network of networks, such as the internet. Any or all of the components of electronic system 600 may be used with the subject disclosure.

One aspect of the present technology may include accessing data. The present disclosure contemplates that, in some cases, the data may include personal information data that uniquely identifies or may be used to identify a particular person. Such personal information data may include demographic data, location-based data, online identifiers, phone numbers, email addresses, home addresses, data or records related to the user's health or fitness level (e.g., vital signs measurements, medication information, exercise information), date of birth, or any other personal information.

The present disclosure recognizes that the use of such personal information data in the present technology may be useful to benefit the user. For example, personal information data may be used in various machine learning applications. Thus, using such personal information data may confer the benefit of such machine learning applications to users.

The present disclosure contemplates that entities responsible for collecting, analyzing, disclosing, transmitting, storing, or otherwise using such personal information data will comply with established privacy policies and/or privacy practices. In particular, it would be desirable for such entities to implement and consistently apply privacy practices generally recognized as meeting or exceeding industry or government requirements to maintain user privacy. Such information regarding usage of personal data should be prominently and conveniently accessible to the user and should be updated as the data is collected and/or used. The user's personal information should be collected for legitimate use only. In addition, such collection/sharing should only occur after receiving user consent or other legal grounds as set forth in applicable law. Furthermore, such entities should consider taking any necessary steps to defend and secure access to such personal information data, and to ensure that others who have access to the personal information data comply with their privacy policies and procedures. In addition, such entities may subject themselves to third party evaluations to prove compliance with widely accepted privacy policies and practices. In addition, policies and practices should be tailored to the particular type of personal information data that is conveniently collected and/or accessed, and made applicable to applicable laws and standards, including jurisdiction-specific considerations that may be used to impose higher standards. For example, in the united states, the collection or acquisition of certain health data may be governed by federal and/or state laws, such as the health insurance association and accountability act (HIPAA); while other countries may have health data subject to other regulations and policies and should be treated accordingly.

Regardless of the foregoing, the present disclosure also contemplates embodiments in which a user selectively prevents use or access to personal information data. That is, the present disclosure contemplates that hardware elements and/or software elements may be provided to prevent or block access to such personal information data. For example, in the case of a machine learning application, the present technology may be configured to allow a user to opt-in or opt-out of participating in the collection of personal information data at any time during or after registration service. In addition to providing "opt-in" and "opt-out" options, the present disclosure contemplates providing notifications related to accessing or using personal information. For example, the user may be notified that their personal information data is to be accessed when the application is downloaded, and then be reminded again just before the personal information data is accessed by the application.

Further, it is an object of the present disclosure that personal information data should be managed and processed to minimize the risk of inadvertent or unauthorized access or use. Once the data is no longer needed, the risk can be minimized by limiting data collection and deleting data. In addition, and when applicable, including in certain health-related applications, data de-identification may be used to protect the privacy of the user. De-identification may be facilitated by removing identifiers, controlling the amount or specificity of stored data (e.g., collecting location data at a city level rather than at an address level), controlling how data is stored (e.g., aggregating data among users), and/or other methods such as differential privacy, as appropriate.

Thus, while the present disclosure broadly covers the use of personal information data to implement one or more of the various disclosed embodiments, the present disclosure also contemplates that various embodiments may be implemented without the need to access such personal information data. That is, various embodiments of the present technology do not fail to function properly due to the lack of all or a portion of such personal information data. For example, content may be selected and delivered to the user based on aggregated non-personal information data or an absolute minimum amount of personal information, such as content that is processed only on the user device or other non-personal information that may be available to the content delivery service.

Implementations within the scope of the present disclosure may be realized, in part or in whole, by a tangible computer-readable storage medium (or multiple tangible computer-readable storage media of one or more types) having one or more instructions written thereon. The tangible computer readable storage medium may also be non-transitory in nature.

A computer-readable storage medium may be any storage medium that can be read, written, or otherwise accessed by a general purpose or special purpose computing device and that includes any processing electronics and/or processing circuitry capable of executing instructions. For example, without limitation, the computer-readable medium may include any volatile semiconductor memory, such as RAM, DRAM, SRAM, T-RAM, Z-RAM, and TTRAM. The computer readable medium may also include any non-volatile semiconductor memory, such as ROM, PROM, EPROM, EEPROM, NVRAM, flash memory, nvSRAM, FeRAM, FeTRAM, MRAM, PRAM, CBRAM, SONOS, RRAM, NRAM, racetrack memory, FJG, and Millipede memory.

Further, the computer-readable storage medium may include any non-semiconductor memory, such as optical disk storage, magnetic tape, other magnetic storage devices, or any other medium capable of storing one or more instructions. In one or more implementations, the tangible computer-readable storage medium may be directly coupled to the computing device, while in other implementations, the tangible computer-readable storage medium may be indirectly coupled to the computing device, e.g., via one or more wired connections, one or more wireless connections, or any combination thereof.

The instructions may be directly executable or may be used to develop executable instructions. For example, the instructions may be implemented as executable or non-executable machine code, or may be implemented as high-level language instructions that may be compiled to produce executable or non-executable machine code. Further, instructions may also be implemented as, or may include, data. Computer-executable instructions may also be organized in any format, including routines, subroutines, programs, data structures, objects, modules, applications, applets, functions, and the like. As those skilled in the art will recognize, details including, but not limited to, number, structure, sequence, and organization of instructions may vary significantly without changing the underlying logic, function, processing, and output.

Although the above discussion has primarily referred to microprocessor or multi-core processors executing software, one or more implementations are performed by one or more integrated circuits, such as ASICs or FPGAs. In one or more implementations, such integrated circuits execute instructions stored on the circuit itself.

Those skilled in the art will appreciate that the various illustrative blocks, modules, elements, components, methods, and algorithms described herein may be implemented as electronic hardware, computer software, or combinations of both. To illustrate this interchangeability of hardware and software, various illustrative blocks, modules, elements, components, methods, and algorithms have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application. The various components and blocks may be arranged differently (e.g., arranged in a different order, or divided in a different manner) without departing from the scope of the subject technology.

It is to be understood that the specific order or hierarchy of blocks in the processes disclosed is an illustration of exemplary approaches. Based upon design preferences, it is understood that the specific order or hierarchy of blocks in the processes may be rearranged or that all illustrated blocks may be performed. Any of these blocks may be performed simultaneously. In one or more implementations, multitasking and parallel processing may be advantageous. Moreover, the division of various system components in the implementations described above should not be understood as requiring such division in all implementations, and it should be understood that program components and systems can generally be integrated together in a single software product or packaged into multiple software products.

As used in this specification and any claims of this patent application, the terms "base station," "receiver," "computer," "server," "processor," and "memory" all refer to electronic or other technical devices. These terms exclude a person or group of persons. For the purposes of this specification, the term "display" or "displaying" means displaying on an electronic device.

As used herein, the phrase "at least one of," following the use of the term "and" or "to separate a series of items from any one of the items, modifies the list as a whole and not every member of the list (i.e., every item). The phrase "at least one of" does not require the selection of at least one of each of the items listed; rather, the phrase allows the meaning of at least one of any one item and/or at least one of any combination of items and/or at least one of each item to be included. For example, the phrases "at least one of A, B and C" or "at least one of A, B or C" each refer to a only, B only, or C only; A. any combination of B and C; and/or A, B and C.

The predicate words "configured to", "operable to", and "programmed to" do not imply any particular tangible or intangible modification to a certain subject but are intended to be used interchangeably. In one or more implementations, a processor configured to monitor and control operations or components may also mean that the processor is programmed to monitor and control operations or that the processor is operable to monitor and control operations. Also, a processor configured to execute code may be interpreted as a processor that is programmed to execute code or that is operable to execute code.

Phrases such as an aspect, the aspect, another aspect, some aspects, one or more aspects, a specific implementation, the specific implementation, another specific implementation, some specific implementation, one or more specific implementations, embodiments, the embodiment, another embodiment, some embodiments, one or more embodiments, configurations, the configuration, other configurations, some configurations, one or more configurations, the subject technology, the disclosure, the present disclosure, other variations thereof, and the like are for convenience and do not imply that a disclosure relating to such phrase or phrases is essential to the subject technology, nor that such disclosure applies to all configurations of the subject technology. Disclosure relating to such one or more phrases may apply to all configurations or one or more configurations. Disclosure relating to such one or more phrases may provide one or more examples. Phrases such as an aspect or some aspects may refer to one or more aspects and vice versa and this applies similarly to the other preceding phrases.

The word "exemplary" is used herein to mean "serving as an example, instance, or illustration. Any embodiment described herein as "exemplary" or as "exemplary" is not necessarily to be construed as preferred or advantageous over other implementations. Furthermore, to the extent that the terms "includes," has, "" having, "" has, "" with, "" has, "" having, "" contains, "" containing, "" contain.

All structural and functional equivalents to the elements of the various aspects described throughout this disclosure that are known or later come to be known to those of ordinary skill in the art are expressly incorporated herein by reference and are intended to be encompassed by the claims. Moreover, nothing disclosed herein is intended to be dedicated to the public regardless of whether such disclosure is explicitly recited in the claims. No claim element should be construed in accordance with the provisions of 35u.s.c. § 112(f), unless the element is explicitly recited using the phrase "means for … …", or for method claims, the element is recited using the phrase "step for … …".

The previous description is provided to enable any person skilled in the art to practice the various aspects described herein. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects. Thus, the claims are not intended to be limited to the aspects shown herein, but are to be accorded the full scope consistent with the language claims, wherein reference to an element in a singular value is not intended to mean "one only" and means "one or more" unless specifically so stated. The term "some" means one or more unless specifically stated otherwise. Pronouns for men (e.g., his) include women and neutrals (e.g., her and its), and vice versa. Headings and sub-headings (if any) are used for convenience only and do not limit the subject disclosure.

Claims

1. A method, comprising:

receiving code corresponding to a Neural Network (NN) model, the code comprising specific operations performed by the NN model, wherein at least some of the specific operations comprise respective data to be stored in a memory of an electronic device during execution of the NN model;

in the particular operation, determining a set of operations to be assigned to a cache of the electronic device that is to execute the NN model;

generating a set of cache indicators corresponding to the determined set of operations, wherein the set of cache indicators includes information indicating whether memory allocation in the cache is requested; and

compiling the code and the generated set of cache indicators to provide a compiled binary file for the NN model to execute on a target device.

2. The method of claim 1, wherein the particular operations are performed by at least one of a neural processor, a GPU, or a CPU, and each of the particular operations corresponds to at least a machine learning operation performed by the NN model, and the cache is shared among the neural processor, the GPU, and the CPU.

3. The method of claim 2, wherein at least one of the neural processor, the GPU, or the CPU is assigned a respective quota of memory based at least in part on a predetermined amount of memory used by the particular operation when the NN model is executed by the target device.

4. The method of claim 3, wherein the respective quota of memory is constrained based at least in part on a size of cache memory provided by the target device, and

the respective quota of memory is dynamic such that, during execution of the NN model by the target device, a particular processor of the target device is enabled to request allocation of memory based at least in part on the respective quota of memory.

5. The method of claim 1, wherein the set of operations includes only one operation.

6. The method of claim 1, wherein generating the set of cache indicators corresponding to the determined set of operations further comprises generating additional information indicating that the particular operation uses data only once and that the data is to be stored in a second memory slower than the cache.

7. The method of claim 1, wherein generating the set of cache indicators corresponding to the determined set of operations further comprises generating additional information indicating that the particular operation uses data multiple times and that the data is to be stored in the cache.

8. The method of claim 1, wherein generating the set of cache indicators corresponding to the determined set of operations comprises generating additional information indicating a cache delete operation to invalidate a portion of the cache corresponding to data that is no longer utilized by the determined set of operations.

9. The method of claim 1, wherein determining the set of operations is based at least in part on whether a particular operation uses data that is accessed more than once during execution of the particular operation.

10. The method of claim 1, wherein the set of operations to be allocated to the cache is based at least in part on a set of priorities indicating that particular data is given a higher priority than other data to place in the cache based on performance requirements or energy requirements.

11. A system, comprising:

a processor;

a memory device including instructions that, when executed by the processor, cause the processor to:

12. The system of claim 11, wherein the particular operations are performed by at least one of a neural processor, a GPU, or a CPU, and each of the particular operations corresponds to at least a machine learning operation performed by the NN model, and the cache is shared among the neural processor, the GPU, and the CPU.

13. The system of claim 12, wherein at least one of the neural processor, the GPU, or the CPU is assigned a respective quota of memory based at least in part on a predetermined amount of memory used by the particular operation when the NN model is executed by the target device.

14. The system of claim 13, wherein respective quotas for the memory are constrained based at least in part on a size of cache memory provided by the target device, and

15. The system of claim 14, wherein the set of operations includes only one operation.

16. The system of claim 11, wherein generating the set of cache indicators corresponding to the determined set of operations further causes the processor to generate additional information indicating that the particular operation uses data only once and that the data is to be stored in a second memory slower than the cache.

17. The system of claim 11, wherein generating the set of cache indicators corresponding to the determined set of operations further causes the processor to generate additional information indicating that the particular operation uses data multiple times and that the data is to be stored in the cache.

18. The system of claim 11, wherein generating the set of cache indicators corresponding to the determined set of operations further causes the processor to generate additional information indicating a cache delete operation to invalidate a portion of the cache corresponding to data no longer utilized by the determined set of operations.

19. The system of claim 11, wherein determining the set of operations is based at least in part on whether a particular operation uses data that is accessed more than once during execution of the particular operation.

20. A non-transitory computer-readable medium comprising instructions that, when executed by a computing device, cause the computing device to perform operations comprising:

receiving a request to perform an operation by a neural network model, the request including a cache indicator having information indicating whether the operation includes an allocation of memory in a cache provided by the computing device;

determining, based at least in part on the cache indicator and the operation, a request to make an allocation of the memory in the cache; and

sending the request for the allocation of the memory to a cache engine to complete the allocation of the memory in the cache.