CN112015675B

CN112015675B - Allocation of machine learning tasks into shared caches

Info

Publication number: CN112015675B
Application number: CN202010322486.6A
Authority: CN
Inventors: F·P·万纳; C·M·福雷特; 姚笑终; S·哈雷哈拉苏巴曼尼安
Original assignee: Apple Inc
Current assignee: Apple Inc
Priority date: 2019-05-31
Filing date: 2020-04-22
Publication date: 2023-12-01
Anticipated expiration: 2040-04-22
Also published as: CN112015675A; CN117632785A

Abstract

The present disclosure relates to allocation of machine learning tasks into shared caches. The subject technology receives code corresponding to a Neural Network (NN) model that includes particular operations performed by the NN model. In the particular operation, the subject technology determines a set of operations to be allocated to a cache of an electronic device to execute the NN model. The subject technology generates a set of cache indicators corresponding to the determined set of operations. The subject technology compiles the code and the generated set of cache indicators to provide a compiled binary file for the NN model for execution on a target device.

Description

Allocation of machine learning tasks into shared caches

Cross Reference to Related Applications

The present utility model claims the benefit OF U.S. provisional patent application serial No. 62/855,900, entitled "location OF MACHINE LEARNING TASKS INTO A SHARED CACHE," filed on 5 months 31 OF 2019, which is hereby incorporated by reference in its entirety and forms part OF the U.S. patent application for all purposes.

Technical Field

The present description relates generally to compiling a neural network model for execution on a target platform.

Background

Software engineers and scientists have been using computer hardware for machine learning to improve in different business applications including image classification, video analysis, speech recognition, natural language processing, and the like. Notably, neural networks are used more frequently to create systems that can perform different computing tasks based on training from a large amount of data.

Drawings

Some features of the subject technology are shown in the appended claims. However, for purposes of explanation, several embodiments of the subject technology are set forth in the following figures.

FIG. 1 illustrates an exemplary network environment in accordance with one or more implementations.

FIG. 2 illustrates an exemplary computing architecture for compiling a neural network with cache indicators in accordance with one or more implementations.

FIG. 3 illustrates an example of processing machine learning operations with respect to on-chip memory, such as cache, and/or off-chip memory, such as DRAM, based on cache indicators provided in the operations.

FIG. 4 illustrates a flow diagram of an exemplary process for compiling a neural network using a cache indicator, in accordance with one or more implementations.

FIG. 5 illustrates a flow diagram of an exemplary process for allocating memory for a neural network based on a cache indicator in a memory transaction, in accordance with one or more implementations.

FIG. 6 illustrates an electronic system with which one or more implementations of the subject technology may be implemented.

Detailed Description

The detailed description set forth below is intended as a description of various configurations of the subject technology and is not intended to represent the only configurations in which the subject technology may be practiced. The accompanying drawings are incorporated in and constitute a part of this specification. The specific embodiments include specific details for the purpose of providing a thorough understanding of the subject technology. However, the subject technology is not limited to the specific details set forth herein, but may be practiced with one or more other implementations. In one or more implementations, structures and components are shown in block diagram form in order to avoid obscuring the concepts of the subject technology.

In recent years, the popularity of machine learning has increased substantially due to the availability of large amounts of training data and the advancement of more powerful and efficient computing hardware. One popular machine learning technique is to utilize deep neural networks to perform a set of machine learning tasks. For training deep neural networks, a common approach is to utilize a Graphics Processing Unit (GPU) and also to perform deep neural networks on new input data post-training.

On a given platform for executing one or more neural networks, the platform may provide a limited amount of memory. For example, modern computing devices typically include various types of memory, including faster cache memory (e.g., on-chip memory) and slower main memory (e.g., off-chip memory), such as dynamic random access memory or DRAM. Executing such a neural network on a faster cache memory may improve the performance of the neural network, as performance penalty of accessing slower DRAMs is avoided. In addition, on some computing platforms (such as mobile devices), accessing DRAM also results in greater power consumption when compared to accessing faster cache memory.

Implementations of the subject technology described herein improve the computing functionality of an electronic device by: by including at least a cache indicator during the compilation process of the neural network, faster cache memory is enabled to be utilized by a given neural network while being executed by the electronic device, where possible. For example, the cache indicator may indicate that a faster cache memory (e.g., on-chip memory) is preferred for a given task or operation of the neural network, e.g., in view of the relative performance penalty that would result from using a slower off-chip memory (e.g., DRAM).

Such a cache indicator enables other hardware components (e.g., a cache engine or controller) to perform allocation of cache memory during runtime, where allocation of cache memory may be prioritized for tasks or operations that prefer cache memory. Advantageously, the neural network may preferentially access faster cache memory and thus perform faster completed machine learning tasks. Thus, these benefits are understood to improve the computing functionality of a given electronic device, such as an end-user device, which may generally have fewer available computing resources than, for example, one or more cloud-based servers.

FIG. 1 illustrates an exemplary network environment 100 in accordance with one or more implementations. However, not all of the depicted components may be used in all implementations, and one or more implementations may include additional or different components than those shown in the figures. Variations in the arrangement and type of these components may be made without departing from the spirit or scope of the claims set forth herein. Additional components, different components, or fewer components may be provided.

The network environment 100 includes an electronic device 110, an electronic device 115, and a server 120. Network 106 may communicatively couple (directly or indirectly) electronic device 110 and/or server 120, electronic device 115 and/or server 120, and/or electronic device 110 and/or electronic device 115. In one or more implementations, the network 106 may be an interconnection network that may include the internet or devices communicatively coupled to the internet. For purposes of explanation, network environment 100 is shown in FIG. 1 as including electronic device 110, electronic device 115, and server 120; however, network environment 100 may include any number of electronic devices and any number of servers.

The electronic device 110 may be, for example, a desktop computer, a portable computing device such as a laptop computer, a smart phone, a peripheral device (e.g., digital camera, headset), a tablet device, a wearable device such as a watch, a band, etc. In fig. 1, by way of example, electronic device 110 is depicted as a desktop computer. The electronic device 110 may be and/or may include all or part of an electronic system discussed below with respect to fig. 6.

In one or more implementations, the electronic device 110 and/or the server 120 can provide a system for compiling a given neural network model. In an example, using compiled code, the subject system may create an executable software package to be deployed on a target platform, such as electronic device 115, under the direction of server 120. In executing the compiled code, the target platform may perform a given operation of the neural network model.

The electronic device 115 may be, for example, a portable computing device such as a laptop computer, a smart phone, a peripheral device (e.g., a digital camera, an earphone), a tablet device, a wearable device such as a watch, a band, etc., or any electronic device. The electronic device may also include processors with different computing capabilities, including, for example, a CPU, GPU, and/or a neural processor. In fig. 1, by way of example, the electronic device 115 is depicted as a smart phone device. In one or more implementations, the electronic device 115 may be and/or may include all or part of an electronic system discussed below with respect to fig. 6.

In one or more implementations, the server 120 deploys compiled code included in the executable software package to the target device for execution. In an example, the electronic device 115 may be a target device for receiving a software package having compiled neural network code and executing the compiled code in a runtime environment of the electronic device 115. The electronic device 115 (or any electronic device that is a target device) may include a framework that is enabled to perform operations in compiled code of the neural network. A framework may refer to a software environment that provides specific functionality as part of a larger software platform to facilitate development of software applications.

FIG. 2 illustrates an exemplary computing architecture 200 for compiling a neural network with cache pointers in accordance with one or more implementations. For purposes of illustration, the computing architecture is described as being provided by the electronic device 110 of fig. 1, such as by a processor and/or memory of the electronic device 110; however, the computing architecture may be implemented by any other electronic device. However, not all of the depicted components may be used in all implementations, and one or more implementations may include additional or different components than those shown in the figures. Variations in the arrangement and type of these components may be made without departing from the spirit or scope of the claims set forth herein. Additional components, different components, or fewer components may be provided.

As shown, the computing architecture 200 includes the electronic device 110 and the electronic device 115. The electronic device 110 includes a compiler 215 and a memory 240. Memory 240 includes Neural Network (NN) model source code 244 that, after compilation by compiler 215, generates a Neural Network (NN) binary executable 242 that may be deployed to different target platforms for execution. In an example, the NN model source code 244 may include code for various algorithms, either alone or in combination, for implementing specific functions for execution on a given target device. As described above, the target device may include various hardware sensors and different processors (e.g., as provided by the electronic device 115) that may be utilized when running the NN binary executable 242 on the target device. In examples, the specific functions may include image processing or computer vision related functions, speech recognition, natural language processing, and the like.

Although compiler 215 is provided on electronic device 110 in the example of fig. 2, in some implementations, a compiler may be provided on a particular electronic device (e.g., electronic device 115) that compiles source code locally and executes the compiled code on the same device. In implementations, the NN model source code 244 may be compiled for a particular target platform and then deployed to a different device, such as the electronic device 115, for execution. In an example, the NN model source code 244 may include at least code corresponding to a set of operations (e.g., machine learning tasks) to be performed by corresponding nodes from each layer of a given NN model. As mentioned herein, a machine learning task corresponds to at least one operation performed by a given node in a particular layer of a given NN model. It should be appreciated that in implementations, machine learning tasks may refer to various operations performed by multiple nodes in a network (e.g., in the same layer or layers). In an example, code of an operation in a layer of the NN is a respective function call for performing the operation and/or a set of parameters for the function call. Additionally, code corresponding to input and output features, data structures, and feature types may be included in the NN model source code 244.

As discussed further below, the target device (e.g., electronic device 115) may include multiple processors (e.g., CPUs, GPUs, neural Processors (NPs)) for performing operations of a given NN model, where each processor has access to memory, such as cache or slower Dynamic Random Access Memory (DRAM) provided by the target device, that is shared among the processors of the target device. The various operations of the NN model performed by the foregoing processor may not always fit within the cache to provide better performance, taking into account memory constraints of the target device, but rather are stored in slower DRAM memory to accomplish such operations.

In implementations, the compiler 215 analyzes the NN model source code 244 and determines which data of a given Neural Network (NN) model will benefit from being placed in faster memory (e.g., memory cache 257) rather than slower memory (e.g., DRAM 258). Such data may include, for example, data corresponding to the input and output features described above, and/or data structures of the NN model. By way of example, the respective outputs of the operations performed by the NN model may be in the form of a data structure, such as a container (e.g., tensor) that may store data in N dimensions (e.g., matrix, vector, array of arrays, etc.).

In particular implementations, compiler 215 performs the following operations: 1) determine machine learning tasks performed by the NN model based on code, 2) determine which machine learning tasks should be allocated in the faster memory caches 257 to improve performance, and 3) generate cache indicators to associate with the respective machine learning tasks to enable the compiled NN model during runtime to allocate memory caches 257 (e.g., where possible) or not allocate memory caches 257 (but rather placed in slower DRAMs).

As referred to herein, a cache indicator may include information indicating whether to request allocation of memory in a shared cache or to perform another operation, such as evicting or invalidating data already stored in the shared cache. In an example, such information may be included in an instruction sent to a processor (e.g., CPU, GPU, NP) (e.g., as part of a memory transaction) that is then processed by the processor to determine whether to request allocation of memory within a cache or slower memory or evict a portion of memory. To allocate memory buffers 257, compiler 215 may use knowledge of the size of memory buffers 257 available on the target device to determine whether allocation of memory buffers 257 is feasible.

For a given cache indicator, compiler 215 may include information corresponding to a particular operation of a node of the NN network, or associate the cache indicator with a set of operations performed from a single node or across different nodes and/or layers of the NN network. In the foregoing memory transaction, a cache indicator may be associated with each of the instructions, which may include a set of instructions that are ultimately sent to a processor (or processors, depending on the operation to be performed). In another example, not every instruction in a memory transaction includes a cache indicator, depending on the instruction. In one or more implementations, the cache indicator may be included in the following operations: where the preferred memory changes from on-chip memory to off-chip memory, for example, or vice versa, such as when the preferred memory remains static for multiple consecutive operations.

In implementations, the compiler 215 generates the cache indicator when compiling code for the NN model using the following policies/criteria. Data that is only utilized once is not preferred/prioritized for placement in the cache and may be placed in slower DRAMs if desired. In addition, data that is utilized more than once is preferred/prioritized for placement in the cache. For data that is utilized more than once, but in the last operation that is no longer using the data, compiler 215 may also determine if it is requested to evict the data from the cache (e.g., a cache delete operation that utilized the data to invalidate a portion of the cache). Further, compiler 215 may assign a first priority value to a first set of data such that the data is given a higher priority than another data (e.g., it has been assigned a lower priority value) for placement in the cache.

Such priorities may be based on performance requirements (e.g., cost): the speed at which the data needs to be read, e.g. to meet the requirements of the machine learning task being performed by the NN network and/or whether the computational requirements of the task are greater than the memory requirements, in which case placing the data in a slower memory does not significantly affect performance. Additionally, the compiler 215 may consider energy requirements, for example, whether tasks should be placed in the cache to meet the energy and/or temperature requirements of the target device executing the NN network.

Compiler 215 further processes the source code using the generated cache indicator and compiles the code into an NN binary executable for the target device, which may be stored in neural network executable 242 and then deployed to the target device for execution (e.g., electronic device 115). Although compiler 215 is provided on electronic device 110 in the example of fig. 2, in some implementations, a compiler may be provided on a particular electronic device that compiles code for a neural network model and executes the compiled neural network model on the same device. As described above, the neural network model may be compiled from NN model source code 244 for a particular target platform and then deployed to a different device, such as electronic device 115, for execution.

As further shown, in one or more implementations, the electronic device 115 includes a system on a chip (SOC) 250.SOC 250 may include an L2 cache 252 (e.g., on-chip memory), a CPU 254, a GPU 255, and a neural processor 256. Electronic device 115 also includes memory cache 257 and DRAM 258 (e.g., off-chip memory).

In one or more implementations, memory cache 257 may be on-chip (e.g., a portion of SOC 250, as shown in the example of fig. 2) or off-chip (not shown). Further, with respect to power, performance, and/or accessibility, memory cache 257 may fall between L2 cache 252 and DRAM 258. For example, memory cache 257 may be more general-purpose than L2 cache 252, but less general-purpose than DRAM 258.

DRAM 258 may be a memory having a slower access speed than memory cache 257 and/or L2 cache 252. In one or more implementations, the DRAM 258 may be shared across multiple (e.g., all) tasks and processing units with respect to the electronic device 115. Accessing the DRAM 258 may consume computing resources of the electronic device 115 because it may utilize a relatively significant amount of power and may affect performance of the NN model by slowing down the memory constraint layer (e.g., pooling layer, element level layer, etc.) of the NN. In contrast, in implementations, memory buffer 257 is faster than DRAM 258, but is smaller in size than DRAM 258. Thus, data corresponding to the operation of the NN model (e.g., input data, output data, intermediate data, etc. at the time of processing) is often not fitted in the memory buffer 257 but stored in the DRAM 258.

The use of memory cache 257 (e.g., based on a cache indicator) may be managed, for example, by a quota system or general access permissions, in the following manner: access to memory cache 257 is provided for some tasks or engines (e.g., but not for other tasks or engines). In one or more implementations, with respect to data requests, memory cache 257 may be checked before DRAM 258. For example, a cache indicator as described herein may be generated (e.g., by compiler 215) to allocate data to memory cache 257. However, the data may or may not be available on the memory cache 257. A request may be made (e.g., by a corresponding engine) to a drive to cause the memory cache 257 to collect data. If data is still available in memory cache 257, the data may be obtained from memory cache 257 and sent to the corresponding engine. If the data is no longer available in memory cache 257, the request for data may be forwarded to DRAM 258 and obtained from DRAM 258. Only some data may still be available in memory cache 257, which may result in the availability of portions of data from memory cache 257 and the remainder of data from DRAM 258.

Thus, in one or more implementations, compiler 215 may choose to place data for subsequent access in one or more of: the L2 cache 252 (e.g., corresponding to the fastest relative access), the DRAM 258 with a cache indicator that enables use of the memory cache 257 (e.g., corresponding to the second fastest relative access), and/or the DRAM 258 without a cache indicator for the memory cache 257 (e.g., corresponding to the third fastest relative access).

Also shown, driver 260 is provided by an Operating System (OS) running on electronic device 115. In an example, the driver 260 allows other software (e.g., one or more applications 270) to communicate with firmware that enables such software to control (e.g., by executing commands) one or more components of hardware, such as the neural processor 256, the CPU 254, the GPU 255, the memory cache 257, and/or the DRAM 258 included in the electronic device 115. As discussed further herein, the driver 260 may request various operations involving the memory cache 257 and/or the DRAM 258 based at least in part on cache indicators included in one or more memory transactions as part of executing a given NN model. In addition, while one driver is shown in the example of FIG. 2 for simplicity, it should be understood that in implementations, various drivers for hardware components are provided. For example, in addition to the drivers for memory cache 257 and/or DRAM 258, a respective driver may be provided for each of the processors described above.

In implementations, during runtime of the NN model, a client application from the application 210 executing the binary file of the NN model may send an operation (e.g., a request including a set of instructions and/or a cache indicator) to the driver 260 to facilitate processing by the neural processor 256, the CPU 254, the GPU 255, the memory cache 257, and/or the DRAM 258. In implementations, the driver 260 may receive such operations from the client application and forward the operations (e.g., when involving memory transactions) to a cache engine (not shown) provided by the memory cache 257 for processing. Based on the cache indicator, the cache engine may determine whether to allocate memory in memory cache 257, evict a portion of the data in memory cache 257, or allocate memory in DRAM 258. Exemplary interactions between the drive 260 and the memory cache 257 are further discussed in FIG. 3 below.

Recently, specialized (e.g., dedicated) hardware has been developed that is optimized for performing specific operations from a given NN. A given electronic device may include a neural processor 256, which may be implemented as circuitry that performs various machine learning operations based on computations, including multiplications, additions, and summations. Such calculations may be arranged to perform, for example, convolution of the input data. In an example, the neural processor 256 is specifically configured to execute a machine learning algorithm, typically by operating on a predictive model such as NN. In one or more implementations, the electronic device may include a neural processor 256 in addition to the CPU 254 and/or the GPU 255.

As discussed herein, a CPU may refer to a host processor in a given electronic device that performs the operations of basic arithmetic, logic, control, and input/output operations specified by instructions of a computer program or application, including some operations of a neural network model. As discussed herein, a GPU may refer to specialized electronic circuitry designed to perform operations for rendering graphics, which may also be used in many cases to handle computing workloads for machine learning operations (e.g., as specified by instructions of a computer program or application). The CPU, GPU, and neural processor may each have different computational specifications and capabilities, depending on their respective implementations, wherein each of the above components may provide a different degree of performance for certain operations than others.

Fig. 3 illustrates an example of machine learning operations with respect to an on-chip cache (e.g., memory cache 257) and/or an off-chip cache (e.g., DRAM 258) based on cache indicators provided in the operations. Fig. 3 will be discussed with reference to the components of the computing architecture 200 described in fig. 2.

As shown in fig. 3, the driver 260 may receive a Machine Learning (ML) operation 310 (e.g., from a client application executing the NN model) that is part of a memory transaction of the neural network model. The driver 260 may analyze the cache indicator 312 provided with the ML operation 310 to determine whether to request an allocation of memory in the memory cache 257. The driver 260 may utilize knowledge of the respective quota allocated to the memory cache 257 of each processor on the target device, such as the electronic device 115, to determine whether allocation is feasible based on the amount of available memory. As shown, driver 260 may allocate memory cache 257 a quota 350 for neural processor 256, a quota 355 for GPU 255, and a quota 360 for CPU 254. For example, if memory cache 257 is 16 megabytes (16 MB) in size, quota 350 may be 4MB in size, quota 355 may be 8MB in size, and quota 360 may be 4MB in size. The driver 260 may also share information about the quota to a cache engine 320 that processes requests from the driver 260, as discussed further below.

However, it should be appreciated that when memory on the electronic device 115 is shared between other applications and/or other NN models that are also executing concurrently with the NN model, the driver 260 may dynamically adjust the respective size of each quota during the runtime of the NN model. In an example, the driver 260 may receive multiple ML operations involving different memory transactions from two or more respective applications each executing a respective NN model. In implementations, the driver 260 may determine the respective sizes of memory allocations for ML operations and sum the respective sizes to determine a combined memory allocation size. The driver 260 may then adjust the respective sizes of the quotas based on the combined memory allocation sizes, and may also inform the cache engine 320 of the adjusted quotas. Further, when other applications and/or NN models cease executing, the driver 260 may adjust the respective sizes of the quotas in response to memory not being utilized by the applications and/or NN models that are no longer executing.

In an example, the driver 260 may forward the request to the cache engine 320 for allocation of memory in the memory cache 257. In implementations, the cache engine 320 may be a hardware cache controller provided by a target device, such as the electronic device 115, which may be included as part of the SOC 250. In another implementation, the cache engine 320 may be a software component (e.g., a security daemon application) or implemented in firmware of the electronic device 115.

After receiving the request from driver 260, cache engine 320 may perform an allocation of memory in memory cache 257 corresponding to CPU 254, GPU 255, or neural processor 256, according to the request. In examples where the cache engine 320 cannot allocate the requested memory, the driver 260 may receive an indication from the cache engine 320 that the request has failed. In response, driver 260 may instead request an allocation of memory from DRAM 258.

FIG. 4 illustrates a flow diagram of an exemplary process 400 for compiling a neural network using a cache indicator, in accordance with one or more implementations. For purposes of explanation, the process 400 is described herein primarily with reference to components of the computing architecture 200 of fig. 2, which may be executed by one or more processors of the electronic device 110 of fig. 1. However, process 400 is not limited to electronic device 110, and one or more blocks (or operations) of process 400 may be performed by one or more other components of other suitable devices, such as by electronic device 115. For further explanation purposes, the blocks of process 400 are described herein as occurring sequentially or linearly. However, multiple blocks of process 400 may occur in parallel. Furthermore, the blocks of process 400 need not be performed in the order shown, and/or one or more blocks of process 400 need not be performed and/or may be replaced by other operations.

Compiler 215 receives code corresponding to a Neural Network (NN) model (410). In an example, the code includes specific operations performed by the NN model. At least some of the specific operations include corresponding data to be stored in a memory of the electronic device during execution of the NN model.

In certain operations, the compiler 215 determines a set of operations that will be preferentially allocated to a shared cache of electronic devices that are to execute the NN model (412). In particular implementations, compiler 215 determines a set of operations based at least in part on whether a particular operation uses data that is accessed more than once during execution of the particular operation or data that is accessed by two corresponding operations performed by the NN model.

In addition, compiler 215 generates a set of cache indicators corresponding to the determined set of operations (414). In implementations, the set of cache indicators includes information indicating whether allocation of memory in the shared cache is requested. In addition, compiler 215 compiles the code and the generated set of cache indicators to provide a compiled binary file for the NN model for execution on the target device (416). For example, this may correspond to generating binary code using the generated set of cache indicators to provide a compiled binary file for the NN model.

FIG. 5 illustrates a flow diagram of an exemplary process for allocating memory for a neural network based on a cache indicator in a memory transaction, in accordance with one or more implementations. For purposes of explanation, the process 500 is described herein primarily with reference to components of the computing architecture 200 of fig. 2, which may be executed by one or more processors of the electronic device 110 of fig. 1. However, process 500 is not limited to electronic device 110, and one or more blocks (or operations) of process 500 may be performed by one or more other components of other suitable devices, such as by electronic device 115. For further explanation purposes, the blocks of process 500 are described herein as occurring sequentially or linearly. However, multiple blocks of process 500 may occur in parallel. Furthermore, the blocks of process 500 need not be performed in the order shown, and/or one or more blocks of process 500 need not be performed and/or may be replaced by other operations.

The driver 260 receives a request to perform an operation by the neural network model (510). In an example, the request includes a cache indicator having information indicating whether the operation includes an allocation of memory in a cache provided by the computing device.

The driver 260 determines a request to allocate memory in the cache based at least in part on the cache indicator and the operation (512). The driver 260 sends a request for allocation of memory to the cache engine to complete the allocation of memory in the cache (514).

FIG. 6 illustrates an electronic system 600 that may be utilized to implement one or more implementations of the subject technology. Electronic system 600 may be and/or may be part of electronic device 110, electronic device 115, and/or server 120 shown in fig. 1. Electronic system 600 may include various types of computer-readable media and interfaces for various other types of computer-readable media. Electronic system 600 includes bus 608, one or more processing units 612, system memory 604 (and/or cache), ROM 610, persistent storage 602, input device interface 614, output device interface 606, and one or more network interfaces 616, or subsets and variations thereof.

Bus 608 generally represents the entire system bus, peripheral bus, and chipset bus that communicatively connects many internal devices of electronic system 600. In one or more implementations, a bus 608 communicatively connects one or more processing units 612 with the ROM 610, the system memory 604, and the persistent storage 602. One or more processing units 612 retrieve the instructions to be executed and the data to be processed from these various memory units in order to perform the processes of the subject disclosure. In different implementations, one or more of the processing units 612 may be a single processor or a multi-core processor.

ROM 610 stores static data and instructions required by one or more processing units 612 and other modules of electronic system 600. On the other hand, persistent storage 602 may be a read-write memory device. Persistent storage 602 may be a non-volatile memory unit that stores instructions and data even when electronic system 600 is turned off. In one or more implementations, a mass storage device (such as a magnetic or optical disk and its corresponding disk drive) may be used as persistent storage device 602.

In one or more implementations, removable storage devices (such as floppy disks, flash memory drives, and their corresponding disk drives) may be used as persistent storage 602. As with persistent storage 602, system memory 604 may be a read-write memory device. However, unlike persistent storage 602, system memory 604 may be a volatile read-write memory, such as random access memory. The system memory 604 may store any of the instructions and data that may be needed by the one or more processing units 612 at runtime. In one or more implementations, the processes of the subject disclosure are stored in system memory 604, persistent storage 602, and/or ROM 610. The one or more processing units 612 retrieve the instructions to be executed and the data to be processed from these various memory units in order to perform one or more embodied processes.

Bus 608 is also connected to input device interface 614 and output device interface 606. The input device interface 614 enables a user to communicate information to the electronic system 600 as well as select commands. Input devices that may be used with input device interface 614 may include, for example, an alphanumeric keyboard and a pointing device (also referred to as a "cursor control device"). The output device interface 606 may, for example, enable display of images generated by the electronic system 600. Output devices that may be used with output device interface 606 may include, for example, printers and display devices, such as Liquid Crystal Displays (LCDs), light Emitting Diode (LED) displays, organic Light Emitting Diode (OLED) displays, flexible displays, flat panel displays, solid state displays, projectors, or any other device for outputting information. One or more implementations may include a device that serves as both an input device and an output device, such as a touch screen. In these implementations, the feedback provided to the user may be any form of sensory feedback, such as visual feedback, auditory feedback, or tactile feedback; and input from the user may be received in any form, including acoustic, speech, or tactile input.

Finally, as shown in fig. 6, bus 608 also couples electronic system 600 to one or more networks and/or to one or more network nodes, such as electronic device 115 shown in fig. 1, through one or more network interfaces 616. In this manner, electronic system 600 may be part of a computer network, such as a LAN, a wide area network ("WAN") or an intranet, or may be part of a network of networks, such as the Internet. Any or all of the components of electronic system 600 may be used with the subject disclosure.

One aspect of the present technology may include accessing data. This disclosure contemplates that in some cases, the data may include personal information data that uniquely identifies or may be used to identify a particular person. Such personal information data may include demographic data, location-based data, online identifiers, telephone numbers, email addresses, home addresses, data or records related to the user's health or fitness level (e.g., vital sign measurements, medication information, exercise information), date of birth, or any other personal information.

The present disclosure recognizes that the use of such personal information data in the present technology may be used to benefit users. For example, personal information data may be used in various machine learning applications. Thus, the use of such personal information data may grant the user benefits of such machine learning applications.

The present disclosure contemplates that entities responsible for collecting, analyzing, disclosing, transmitting, storing, or otherwise using such personal information data will adhere to established privacy policies and/or privacy practices. In particular, it would be desirable for such entity implementations and consistent applications to generally be recognized as meeting or exceeding privacy practices required by industries or governments maintaining user privacy. Such information about the use of personal data should be prominent and easily accessible to the user and should be updated as the collection and/or use of the data changes. The user's personal information should be collected only for legitimate use. In addition, such collection/sharing should only occur after receiving user consent or other legal basis specified in the applicable law. Moreover, such entities should consider taking any necessary steps to defend and secure access to such personal information data and to ensure that others having access to the personal information data adhere to their privacy policies and procedures. In addition, such entities may subject themselves to third party evaluations to prove compliance with widely accepted privacy policies and practices. In addition, policies and practices should be tailored to the particular type of personal information data that is to be collected and/or accessed and adapted to the applicable laws and standards, including jurisdictional-specific considerations that may be employed to impose higher standards. For example, in the united states, the collection or acquisition of certain health data may be governed by federal and/or state law, such as the health insurance flow and liability act (HIPAA); while health data in other countries may be subject to other regulations and policies and should be processed accordingly.

In spite of the foregoing, the present disclosure also contemplates embodiments in which a user selectively prevents use or access to personal information data. That is, the present disclosure contemplates that hardware elements and/or software elements may be provided to prevent or block access to such personal information data. For example, with respect to machine learning applications, the present technology may be configured to allow a user to choose to "opt-in" or "opt-out" to participate in the collection of personal information data during or at any time after registration with a service. In addition to providing the "opt-in" and "opt-out" options, the present disclosure contemplates providing notifications related to accessing or using personal information. For example, the user may be notified that his personal information data will be accessed when the application is downloaded, and then be reminded again just before the personal information data is accessed by the application.

Further, it is an object of the present disclosure that personal information data should be managed and processed to minimize the risk of inadvertent or unauthorized access or use. Once the data is no longer needed, risk can be minimized by limiting the data collection and deleting the data. In addition, and when applicable, included in certain health-related applications, the data de-identification may be used to protect the privacy of the user. De-identification may be facilitated by removing identifiers, controlling the amount or specificity of stored data (e.g., collecting location data at a city level instead of at an address level), controlling how data is stored (e.g., aggregating data among users), and/or other methods such as differentiated privacy, as appropriate.

Thus, while the present disclosure broadly covers the use of personal information data to implement one or more of the various disclosed embodiments, the present disclosure also contemplates that the various embodiments may be implemented without accessing such personal information data. That is, various embodiments of the present technology do not fail to function properly due to the lack of all or a portion of such personal information data. For example, content may be selected and delivered to a user based on aggregated non-personal information data or absolute minimum amount of personal information, such as content processed only on user devices or other non-personal information available to a content delivery service.

Implementations within the scope of the present disclosure may be partially or fully implemented using a tangible computer-readable storage medium (or multiple tangible computer-readable storage media of one or more types) having one or more instructions written thereon. The tangible computer readable storage medium may also be non-transitory in nature.

A computer readable storage medium may be any storage medium that can be read, written, or otherwise accessed by a general purpose or special purpose computing device including any processing electronics and/or processing circuitry capable of executing the instructions. By way of example, and not limitation, computer readable media can comprise any volatile semiconductor memory such as RAM, DRAM, SRAM, T-RAM, Z-RAM and TTRAM. The computer readable medium may also include any non-volatile semiconductor memory, such as ROM, PROM, EPROM, EEPROM, NVRAM, flash, nvSRAM, feRAM, feTRAM, MRAM, PRAM, CBRAM, SONOS, RRAM, NRAM, racetrack, FJG, and Millipede memories.

Furthermore, the computer-readable storage medium may include any non-semiconductor memory, such as optical disk storage, magnetic tape, other magnetic storage devices, or any other medium capable of storing one or more instructions. In one or more implementations, the tangible computer-readable storage medium may be directly coupled to the computing device, while in other implementations, the tangible computer-readable storage medium may be indirectly coupled to the computing device, for example, via one or more wired connections, one or more wireless connections, or any combination thereof.

The instructions may be directly executable or may be used to develop executable instructions. For example, the instructions may be implemented as executable or non-executable machine code, or may be implemented as high-level language instructions that may be compiled to produce executable or non-executable machine code. Further, the instructions may also be implemented as data, or may include data. Computer-executable instructions may also be organized in any format, including routines, subroutines, programs, data structures, objects, modules, applications, applets, functions, and the like. As will be appreciated by one of skill in the art, details including, but not limited to, the number, structure, sequence, and organization of instructions may vary significantly without altering the underlying logic, functionality, processing, and output.

While the above discussion primarily refers to a microprocessor or multi-core processor executing software, one or more implementations are performed by one or more integrated circuits, such as an ASIC or FPGA. In one or more implementations, such integrated circuits execute instructions stored on the circuits themselves.

Those of skill in the art will appreciate that the various illustrative blocks, modules, elements, components, methods, and algorithms described herein may be implemented as electronic hardware, computer software, or combinations of both. To illustrate this interchangeability of hardware and software, various illustrative blocks, modules, elements, components, methods, and algorithms have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application. The various components and blocks may be arranged differently (e.g., arranged in a different order, or divided in a different manner) without departing from the scope of the subject technology.

It should be understood that the specific order or hierarchy of blocks in the processes disclosed herein is an illustration of exemplary approaches. Based on design preference requirements, it should be understood that the particular order or hierarchy of blocks in the process may be rearranged or all illustrated blocks may be performed. Any of these blocks may be performed simultaneously. In one or more implementations, multitasking and parallel processing may be advantageous. Moreover, the partitioning of various system components in the implementations described above should not be understood as requiring such partitioning in all implementations, and it should be understood that program components and systems may generally be integrated together in a single software product or packaged into multiple software products.

As used in this specification and any claims of this patent application, the terms "base station," "receiver," "computer," "server," "processor," and "memory" refer to an electronic or other technical device. These terms exclude a person or group of people. For purposes of this specification, the term "display" or "displaying" means displaying on an electronic device.

As used herein, the phrase "at least one of" after separating a series of items of any of the items with the term "and" or "is a modification of the list as a whole, rather than modifying each member (i.e., each item) in the list. The phrase "at least one of" does not require the selection of at least one of each item listed; rather, the phrase allows for the inclusion of at least one of any one item and/or the meaning of at least one of any combination of items and/or at least one of each item. For example, the phrase "at least one of A, B and C" or "at least one of A, B or C" each refer to a only, B only, or C only; A. any combination of B and C; and/or at least one of each of A, B and C.

The predicates "configured to", "operable to", and "programmed to" do not mean any particular tangible or intangible modification to a subject but are intended to be used interchangeably. In one or more implementations, a processor configured to monitor and control operations or components may also mean that the processor is programmed to monitor and control operations or that the processor is operable to monitor and control operations. Likewise, a processor configured to execute code may be interpreted as a processor programmed to execute code or operable to execute code.

Phrases such as an aspect, this aspect, another aspect, some aspects, one or more aspects, an implementation, the implementation, another implementation, some implementations, one or more implementations, an embodiment, the embodiment, another embodiment, some embodiments, one or more embodiments, a configuration, the configuration, other configurations, some configurations, one or more configurations, subject technology, disclosure, the present disclosure, other variations thereof, and the like are all for convenience and do not imply that disclosure involving such one or more phrases is essential to the subject technology nor that such disclosure applies to all configurations of the subject technology. The disclosure relating to such one or more phrases may apply to all configurations or one or more configurations. The disclosure relating to such one or more phrases may provide one or more examples. A phrase such as an aspect or some aspects may refer to one or more aspects and vice versa, and this applies similarly to other previously described phrases.

The word "exemplary" is used herein to mean "serving as an example, instance, or illustration. Any embodiment described herein as "exemplary" or as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments. Furthermore, to the extent that the terms "includes," "has," and the like are used in either the description or the claims, such terms are intended to be inclusive in a manner similar to the term "comprising" as "comprising" is interpreted when employed as a transitional word in a claim.

All structural and functional equivalents to the elements of the various aspects described throughout this disclosure that are known or later come to be known to those of ordinary skill in the art are expressly incorporated herein by reference and are intended to be encompassed by the claims. Furthermore, nothing disclosed herein is intended to be dedicated to the public regardless of whether such disclosure is explicitly recited in the claims. No claim element should be construed in accordance with the specification of 35u.s.c. ≡112 (f) unless the element is explicitly stated using the phrase "means for … …" or, in the case of method claims, the element is stated using the phrase "step for … …".

The previous description is provided to enable any person skilled in the art to practice the various aspects described herein. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects. Thus, the claims are not intended to be limited to the aspects shown herein, but is to be accorded the full scope consistent with the language claims, wherein reference to an element in a singular value is not intended to mean "one only" but rather "one or more" unless specifically so stated. The term "some" means one or more unless specifically stated otherwise. The terminology of male (e.g., his) includes female and neutral (e.g., her and its), and vice versa. Headings and sub-headings (if any) are used for convenience only and do not limit the subject disclosure.

Claims

1. A method, comprising:

receiving code corresponding to a Neural Network (NN) model, the code comprising specific operations performed by the NN model, wherein at least some of the specific operations include respective data to be stored in a memory of an electronic device during execution of the NN model;

In the particular operation, determining a set of operations to be allocated to a cache of the electronic device to execute the NN model;

generating a set of cache indicators corresponding to the determined set of operations, wherein the set of cache indicators includes information indicating whether allocation of memory in the cache is requested; and

the code and the generated set of cache indicators are compiled to provide a compiled binary file for the NN model for execution on a target device.

2. The method of claim 1, wherein the particular operations are performed by at least one of a neural processor, a GPU, or a CPU, and each of the particular operations corresponds to at least a machine learning operation performed by the NN model, and the cache is shared among the neural processor, the GPU, and the CPU.

3. The method of claim 2, wherein respective quotas of memory are allocated to at least one of the neural processor, the GPU, or the CPU based at least in part on a predetermined amount of memory used by the particular operation when the NN model is executed by the target device.

4. The method of claim 3, wherein respective quotas of cache memory provided by the target device are constrained based at least in part on the size of the memory, and

the respective quotas of the memories are dynamic such that during execution of the NN model by the target device, a particular processor of the target device is enabled to request allocation of memory based at least in part on the respective quotas of the memories.

5. The method of claim 1, wherein the set of operations comprises only one operation.

6. The method of claim 1, wherein generating the set of cache indicators corresponding to the determined set of operations further comprises generating additional information indicating that the particular operation uses data only once and that the data is to be stored in a second memory slower than the cache.

7. The method of claim 1, wherein generating the set of cache indicators corresponding to the determined set of operations further comprises generating additional information indicating that the particular operation uses data multiple times and that the data is to be stored in the cache.

8. The method of claim 1, wherein generating the set of cache indicators corresponding to the determined set of operations comprises generating additional information indicative of a cache delete operation to invalidate a portion of the cache of data corresponding to the determined set of operations no longer utilized.

9. The method of claim 1, wherein determining the set of operations is based at least in part on whether a particular operation uses data that is accessed more than once during execution of the particular operation.

10. The method of claim 1, wherein the set of operations to be allocated to the cache is based at least in part on a set of priorities indicating that particular data is given a higher priority than other data for placement in the cache based on performance requirements or energy requirements.

11. A system, comprising:

a processor;

a memory device including instructions that, when executed by the processor, cause the processor to:

12. The system of claim 11, wherein the particular operations are performed by at least one of a neural processor, a GPU, or a CPU, and each of the particular operations corresponds to at least a machine learning operation performed by the NN model, and the cache is shared among the neural processor, the GPU, and the CPU.

13. The system of claim 12, wherein respective quotas of memory are allocated to at least one of the neural processor, the GPU, or the CPU based at least in part on a predetermined amount of memory used by the particular operation when the NN model is performed by the target device.

14. The system of claim 13, wherein respective quotas of cache memory provided by the target device are constrained based at least in part on the size of the memory, and

15. The system of claim 14, wherein the set of operations includes only one operation.

16. The system of claim 11, wherein generating the set of cache indicators corresponding to the determined set of operations further causes the processor to generate additional information indicating that the particular operation uses data only once and that the data is to be stored in a second memory slower than the cache.

17. The system of claim 11, wherein generating the set of cache indicators corresponding to the determined set of operations further causes the processor to generate additional information indicating that the particular operation uses data multiple times and that the data is to be stored in the cache.

18. The system of claim 11, wherein generating the set of cache indicators corresponding to the determined set of operations further causes the processor to generate additional information indicating a cache delete operation to invalidate a portion of the cache of data corresponding to the determined set of operations no longer utilized.

19. The system of claim 11, wherein determining the set of operations is based at least in part on whether a particular operation uses data that is accessed more than once during execution of the particular operation.

20. A non-transitory computer-readable medium comprising instructions that, when executed by a computing device, cause the computing device to perform operations comprising:

receiving a request to perform an operation by a neural network model, the request including a cache indicator having information indicating whether the operation includes an allocation of memory in a cache provided by the computing device;

determining a request to make an allocation of the memory in the cache based at least in part on the cache indicator and the operation; and

The request for the allocation of the memory is sent to a cache engine to complete the allocation of the memory in the cache.