WO2018103472A1

WO2018103472A1 - Method and device for buffer optimization in deep learning network

Info

Publication number: WO2018103472A1
Application number: PCT/CN2017/108030
Authority: WO
Inventors: 郑星; 王鹏; 叶挺群; 彭剑峰; 周智强
Original assignee: 杭州海康威视数字技术股份有限公司
Priority date: 2016-12-09
Filing date: 2017-10-27
Publication date: 2018-06-14
Also published as: CN108615077B; CN108615077A

Abstract

A method and device for buffer optimization in a deep learning network. The buffer optimization method comprises: simulating an Nth layer in the deep learning network; and determining, after the simulating step, whether a first preset buffer block is occupied, and if so, allocating, to output data in the Nth layer, a second preset buffer block. The method and device are utilized to resolve an issue of complicated buffer optimization during training of a deep learning network, providing simpler and more efficient buffer optimization and allocation, and being adaptive in various networks.

Description

Cache optimization method and device applied to deep learning network

The present application claims priority to Chinese Patent Application No. 201611132262.9, entitled "A Cache Optimization Method and Apparatus for Deep Learning Networks", filed on December 9, 2016, the entire contents of which are hereby incorporated by reference. Combined in this application.

Technical field

The present application relates to the field of computer technologies, and in particular, to a cache optimization method applied to a deep learning network and a device corresponding to the method.

Background technique

Deep learning is a new field in machine learning research and a more effective artificial intelligence method. It learns the relevant knowledge from the data by simulating the learning behavior of the human brain for subsequent prediction. Deep learning is done using a so-called "network" in which the network consists of multiple "layers" (for example, convolutions), with the output of the previous layer (or the first few convolutions) of each layer being used as input for training, and then Use the training results as input to the next level. The training process is the process of orderly calculation of each layer.

In the deep learning training, a large amount of intermediate data will be generated. For the training needs, it is usually necessary to cache the data, which occupies a large amount of cache. Therefore, it is especially important to explore good cache optimization methods. If the cache optimization of deep learning is performed manually based on empirical knowledge, it requires the optimizer to be very clear about the structure of the deep learning network model, and to know the timing of each cache of the cached intermediate data, knowing when each cache will be Use, when will not be used, so that will not be shared by the commonly used cache, to save the cache size.

However, the different network models of deep learning have different structures, and the use of caches is also different. The requirements for the optimizers are high. In the actual implementation process, it is difficult to implement, and the cache optimization efficiency is reduced.

Summary of the invention

The purpose of the present application is to solve the cache optimization method and device applied to the deep learning network, so as to solve the problem that the cache optimization efficiency is low in each layer training of the deep learning network. problem.

According to an aspect of the present disclosure, a cache optimization method is provided for a deep learning network, where the deep learning network includes an N layer, and N is greater than or equal to 2. The method includes: performing a simulation operation on the Nth layer of the deep learning network; After performing an analog operation on the Nth layer of the deep learning network, detecting whether the first predetermined cache block is occupied, the first predetermined cache block is used to buffer the input data of the Nth layer simulation operation or before the Nth layer The input/output data of the layer simulation operation; if occupied, the second predetermined cache block is allocated for the output data of the Nth layer simulation operation, and when the preset condition is satisfied, the occupied first predetermined cache block is released.

Optionally, the cache block has a status tag corresponding to the occupied or unoccupied state, and is stored in the cache block status table, where the detecting whether the first predetermined cache block is occupied, specifically: according to the identifier of the first predetermined cache block Querying, in the cache status table, a status flag corresponding to the first predetermined cache block, and determining, according to the status flag, whether the first predetermined cache block is occupied.

Optionally, the allocating the second predetermined cache block to the output data of the Nth layer analog operation comprises: allocating, for the output data of the Nth layer analog operation, a second predetermined cache block immediately adjacent to the first predetermined cache block. .

Optionally, in the method, when the preset condition is met, releasing the occupied first predetermined cache block, specifically, when the simulation operation is performed on the (N+1)th layer of the deep learning network, the release is occupied. The first predetermined cache block; or after the simulation operation on the Nth layer, releasing the occupied first predetermined cache block before performing an analog operation on the N+1th layer of the deep learning network; or When the second predetermined cache block is assignable, the occupied first predetermined cache block is released.

Optionally, when it is detected that the first predetermined cache block is not occupied, the first predetermined cache block is allocated for the output data of the Nth layer simulation operation.

Optionally, the first predetermined cache block and the second predetermined cache block are marked by different colors. According to another aspect of the present application, there is also provided a cache optimization apparatus applied to a deep learning network, the deep learning network comprising an N layer, N being greater than or equal to 2, comprising:

An analog operation unit for performing an analog operation on the Nth layer of the deep learning network;

a state detecting unit, configured to detect whether the first predetermined cache block is occupied after performing an analog operation on the Nth layer of the deep learning network, where the first predetermined cache block is used to buffer the input data of the Nth layer simulation operation Or the input/output data of the layer simulation operation before the Nth layer;

a buffer allocation unit, configured to allocate a second predetermined cache block for the output data of the Nth layer simulation operation when the first predetermined cache block is occupied;

And a cache release unit, configured to release the occupied first predetermined cache block when a preset condition is met.

Optionally, the cache block has a status tag corresponding to the occupied or unoccupied state, and the state detecting unit includes a query subunit, where the query subunit is configured to query a status tag corresponding to the first predetermined cache block, Determining whether the first predetermined cache block is occupied according to the status flag.

Optionally, the corresponding relationship between the cache block and the status tag is stored in the cache block status table, where the query sub-unit is configured to query the cached status table for the first reservation according to the identifier of the first predetermined cache block. Cache the status flag corresponding to the block to determine whether the first predetermined cache block is occupied according to the query result.

Optionally, the cache allocation unit is specifically configured to allocate, for the output data of the Nth layer simulation operation, a second predetermined cache block immediately adjacent to the first predetermined cache block.

Optionally, the cache release unit is configured to: release the occupied first predetermined cache block when performing an analog operation on the (N+1)th layer of the deep learning network; or simulate the Nth layer After the operation, releasing the occupied first predetermined cache block before performing the simulation operation on the (N+1)th layer of the deep learning network; or releasing the occupied portion when no second predetermined cache block is assignable A predetermined cache block.

Optionally, the state detecting unit is further configured to allocate, according to the output data of the Nth layer simulation operation, the first predetermined cache block when detecting that the first predetermined cache block is not occupied.

Optionally, the first predetermined cache block and the second predetermined cache block are performed by different colors mark.

To achieve the above objective, an embodiment of the present application further discloses a storage medium for storing executable program code for being executed to perform the above-described cache optimization applied to a deep learning network. method.

After performing the simulation operation on the Nth layer of the deep learning network, the present application detects whether the first predetermined cache block is occupied. In the case of being occupied, the cache block is re-allocated, and when the preset condition is met, the occupied place is released. The first predetermined cache block is described. When the application needs to allocate a buffer to the result data of the simulation operation, the application can automatically identify the usage state of the predetermined cache block, instead of relying on the optimization personnel to grasp the deep learning network model structure and the cache usage state, thereby improving the cache optimization efficiency. .

DRAWINGS

In order to explain the technical solutions of the present application more clearly, the embodiments of the present application and the description thereof are used to explain the present application, and do not constitute an undue limitation to the present application. The following drawings will be briefly described in the description of the embodiments. . In the drawing:

Figure 1a is a schematic diagram of a conventional cache usage method, the horizontal axis of which is the time axis;

Figure 1b is a schematic diagram of a shared cache, the horizontal axis of which is the time axis;

2 is a flowchart of a cache optimization method according to an embodiment of the present application;

FIG. 3 is a flowchart of a cache optimization method according to an embodiment of the present application; FIG.

4 is a flowchart of a cache optimization method according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a memory state at the beginning of the optimization of the memory in the embodiment of the present application; FIG.

FIG. 5b is a schematic diagram of a memory optimization process for layer a in the embodiment of the present application; FIG.

FIG. 5c is a schematic diagram of a process of optimizing memory for layer b in the embodiment of the present application;

FIG. 5 is a schematic diagram of a process of optimizing memory for layer c in the embodiment of the present application; FIG.

FIG. 5 e is a schematic diagram of the use of the memory after the final optimization of the memory is performed in the embodiment of the present application;

6 is a structural block diagram of a cache optimization apparatus in an embodiment of the present application;

FIG. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

detailed description

The technical solutions of the present application will be clearly and completely described in the following with reference to the specific embodiments of the present application and the accompanying drawings. It is apparent that the described embodiments are only a part of the embodiments of the present application, and not all of them. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present application without departing from the inventive scope are the scope of the present application.

The technical solutions provided by the embodiments of the present application are described in detail below with reference to the accompanying drawings.

Figure 1a shows the schematic diagram of the traditional cache usage method, in which the horizontal axis is the time axis; from Figure 1a, the applicant finds that the memory 1 and the memory 2 are used simultaneously for a period of time, therefore, the memory 1 and the memory 2 cannot Sharing, but the usage time of video memory 1 and video memory 3 is relatively independent, so it can be shared. As shown in Figure 1b is a schematic diagram of a shared cache, where the horizontal axis is the time axis. The shared memory 1 indicates that the memory 1 and the memory 3 end up using the same block address space; the shared memory 2 identifies the memory 2 and the memory 4 eventually uses the same block address space.

This embodiment provides a cache optimization method applied to a deep learning network. The deep learning network in this embodiment includes a plurality of layers, generally speaking, the number of layers is greater than or equal to 2, and the number of layers is represented by N. The flowchart of this embodiment is shown in FIG. 2 and includes the following steps:

S101: performing simulation operations on the Nth layer of the deep learning network; in the deep learning network, including an input layer, an output layer, and one or more hidden layers, wherein the multiple layers of hidden layers form a deep learning network middle layer. There are also many kinds of analog operations in deep learning networks. For example, the value of the output layer can be obtained by some combination of the value of the input layer and the weight matrix of the intermediate layer. The process of simulation operations can be seen as the process of calculating these intermediate layer weight matrix.

S102: After performing an analog operation on the Nth layer of the deep learning network, detecting a usage state of the first predetermined cache block; wherein the first predetermined cache block is used to buffer input data or the Nth of the Nth layer simulation operation The input/output data of the layer simulation operation before the layer.

S103: The first predetermined cache block is occupied, and then the second predetermined cache block is allocated for the output data of the Nth layer simulation operation, and when the preset condition is met, the occupied first predetermined cache block is released.

The first predetermined cache block and the second predetermined cache block are reserved cache blocks that are allocated before the simulation operation is performed, and the number of cache blocks is not limited, and may be two or more. The first and second of the first predetermined cache block and the second predetermined cache block are used to distinguish the cache block.

It should be noted that the execution bodies of the steps of the method provided in this embodiment may all be the same device, or the method may also be performed by different devices. For example, the execution subject of step S101 may be device 1, the execution bodies of steps S102 and S103 may be device 2, and the like. The device may be a computing device capable of implementing cache optimization, such as a personal computer, laptop, tablet, or the like. In order to avoid repetition, the execution subject in all of the embodiments described below may be the same as or similar to the execution subject in the above embodiment.

The cache block may be, but not limited to, a memory having a storage function such as a memory of a CPU (Central Processing Unit) or a memory of a GPU (Graphics Processing Unit).

Through the embodiment, a cache optimization method based on a deep learning network is provided, which achieves the purpose of optimizing the cache by performing collision detection and release buffering in the simulation operation, that is, in each layer training of the deep learning network, in order to optimize In the actual training process, the buffer is occupied first, and the simulation operation is first used as a reference for buffer allocation. When the buffer allocation is performed after the simulation operation, the current use state of the cache block to be allocated or pre-allocated is checked, and then the idle cache is selected for allocation. Through the above process, the usage state of the cache block can be automatically detected, and the optimization personnel can no longer rely on the allocation of the cache according to the network structure, so that the cache optimization and allocation efficiency of the deep learning network can be improved, and can adapt to more different networks.

In addition, for the cache already occupied by the deep learning network, when the preset condition is met, the previous output data is released in time so that the entire cache allocation can be cycled.

The process of a specific implementation of the embodiment of the present application may further include performing an analog operation on each layer of the deep learning network, and detecting the input data of the simulation operation of the layer or the analog input/output data of the previous layer. Whether the first predetermined cache block is occupied for detection; cache allocation, that is, finding that the first predetermined cache block is occupied, allocating a second predetermined cache block for its data, and so on. A schematic flowchart of a cache optimization method based on a conflict detection mode in this embodiment is shown in FIG. 3, and the steps thereof include:

S201: Assume that all the predetermined cache blocks are not used, and the cache blocks that need to cache the data all point to the same predetermined first cache address.

S202: Perform collision detection on a cache block that needs to be used on each layer. When a conflict is detected, Step S202a is performed, and if no conflict is detected, step S202b is performed.

S202a: Allocating a predetermined second cache address to the conflicting cache block.

S202b: Continue to detect the next block of cache.

S203: After the preset condition is met, that is, after the layer simulation operation ends, the occupied first predetermined cache block is released.

The first and second of the first predetermined cache block and the second predetermined cache block are in an order in which they are used. And, the second predetermined cache block is in close proximity to the first predetermined cache block. The purpose of this design is to reduce unnecessary cache waste caused by storage fragmentation. Moreover, the cache optimization mode in the embodiment can perform the “collision detection” on the cache state after each simulation operation, and timely discover the cache in the idle state, thereby improving the use efficiency of the cache.

In a specific implementation manner of the embodiment of the present application, as shown in FIG. 4 is a flowchart of a cache optimization method in this embodiment. In this embodiment, the cache block is a storage unit, and can be embodied as a storage device having a storage function, such as a disk storage, a memory, and a memory. Therefore, the cache optimization method in this embodiment can be applied to disk storage, memory, video memory, and the like. optimization. In the following, the memory is used as an example for description. Therefore, in this embodiment, the cache block status table is a memory block status table. For the sake of simplicity, in this embodiment, the deep learning network is used. The number of layers N is 3, which are three layers a, b, and c respectively. The predetermined distribution mode of the memory is: memory 1 and memory 2 are the input and output of layer a, and memory 3 and memory 4 are the input and output of layer b. The memory 2 and the memory 4 are inputs of the layer c, and the memory 5 is the output of the layer c. When each layer of the simulation operation is completed, the memory blocks occupied by this layer input are released. The method includes (steps S301-S304):

Step S301: Point the memory blocks that need to display data to the same predetermined memory address, and mark that the predetermined memory block is occupied by the data in the video memory 1. As shown in FIG. 5a is a schematic diagram of the memory state at the beginning of the memory optimization optimization. As shown in FIG. 5a, the memory 1-5 is marked as red or displayed in red, that is, all the memory blocks point to the same memory address and marked with red. The memory block is occupied by the data in the memory 1, and is recorded in the memory block status table.

The memory block status table may be stored in any of the video memory 1-5, or may be stored in a memory unit other than the memory 1-5, which is reasonable.

Step S302: performing a memory optimization process on layer a, that is, rushing the input and output memory of layer a Burst detection and memory allocation, and record the distribution results. As shown in FIG. 5b, a schematic diagram of the memory optimization process for layer a is shown in FIG. 5b. In the simulation operation, the memory 1 of the layer a and the memory 2 are detected for conflict detection, and it is found that the red memory block that the memory 2 wants to use has been Memory 1 is occupied, that is, a conflict occurs. At this time, a new memory address is reassigned for the memory 2, that is, the marked yellow memory block is occupied by the data in the memory 2. After the simulation operation of the layer a is completed, the red memory block occupied by the memory 1 is released, and in the memory block status table. Record the status of the memory block usage at this time.

Step S303: Performing a memory optimization process on the layer b, that is, performing collision detection and memory allocation on the input and output memory of the b layer, and recording the allocation result. As shown in FIG. 5c, a schematic diagram of the process of optimizing the memory for layer b is shown in FIG. 5c. In the simulation operation, the memory 3 of layer b wants to cache the data into a predetermined red memory block, and red is found through the memory block status table. The memory block is not used, so the red memory block is marked for use by video memory 3. Next, the c-layer memory 3 and the memory 4 are detected for conflict, and the red memory block that the memory 4 wants to use has been occupied by the memory 3 through the memory block status table, and the yellow memory block has been occupied by the memory 2, that is, a conflict occurs. At this time, a new memory address is newly allocated for the memory 4, that is, the marked blue memory block is occupied by the data in the memory 4, and after the simulation operation of the b layer is completed, the red memory block occupied by the memory 3 is released, and the state of the memory block is released. The status of the memory block usage at this time is recorded in the table.

Step S304: Perform a memory optimization process on the layer c, that is, perform conflict detection and memory allocation on the input and output memory of the c layer, and record the allocation result. As shown in FIG. 5d, a schematic diagram of the memory optimization process for layer c is shown in FIG. 5d. In the simulation operation, the memory 5 of the layer c wants to store the data in the red memory block, and performs collision detection through the memory block status table. The layer c red memory block is found to be unused, so the red memory block is marked as being occupied by the video memory 5. At this time, the conflict state of all the memory blocks has been detected. After the simulation operation of layer c is completed, the blue memory block and the yellow memory block occupied by the memory memory 2 and the memory memory 4 are released and recorded in the memory block state table.

Figure 5e shows a schematic diagram of the memory usage after the final completion of the memory optimization. As shown in FIG. 5e, finally, after the memory optimization is completed, the data in the memory 5 is stored in the red memory block, and the memory 1-4 has all been released.

The memory block status table described in this embodiment is a record of all the memory occupancy states. By using the memory block status table, each video memory occupation information is made clearer. In this embodiment, it is assumed that the size of the memory 1-5 is 200M, and the memory is not used for the memory optimization method described in the embodiment. If the allocation is optimized, the memory 1-5 needs to be occupied, that is, the memory usage is 1000M. However, after the memory allocation optimization is performed by the method described in this embodiment, the actual memory blocks used are only red, blue, and yellow. The actual memory usage is only 600M. Through the cache optimization method of the embodiment, the purpose of saving the cache can be achieved, and the usage rate of the cache is further improved.

The example in this embodiment is only a small deep learning network, including only three layers. For more complex deep learning networks, such as the N-layer (N greater than 3) deep learning network, using this method, the output of the N-1 layer as the input of the N layer can still achieve optimization and allocation of its cache.

Based on the same application concept, an embodiment of the present application also provides a cache optimization apparatus.

6 is a structural block diagram of a cache optimization apparatus according to an embodiment of the present application. As shown in FIG. 6, the apparatus includes: an analog operation unit 610, a state detection unit 620, a cache allocation unit 630, and a cache release unit 640, which are described in detail below. The structure and connection relationship of each module.

The simulation operation unit 610 is configured to perform an analog operation on the Nth layer of the deep learning network;

The state detecting unit 620 is coupled to the analog operation unit 610, and configured to detect whether the first predetermined cache block is occupied after performing an analog operation on the Nth layer of the deep learning network;

The cache allocating unit 630 is coupled to the state detecting unit 620, and configured to allocate a second predetermined cache block for the output data of the Nth layer analog operation when the first predetermined cache block is occupied;

The cache release unit 640 is coupled to the cache allocation unit 630 for releasing the occupied first predetermined cache block when a preset condition is satisfied.

Optionally, the correspondence between the cache block and its status tag is stored in the cache block status table, and the status detecting unit 620 further includes: a query subunit (not shown), specifically configured according to the first predetermined cache block. And identifying, in the cache status table, a status flag corresponding to the first predetermined cache block, so as to determine whether the first predetermined cache block is occupied according to the query result.

Optionally, the cache allocating unit 630 is specifically configured to allocate, for the output data of the Nth layer analog operation, a second predetermined cache block immediately adjacent to the first predetermined cache block.

Optionally, the cache release unit 640 is configured to: release the occupied first predetermined cache block when performing an analog operation on the (N+1)th layer of the deep learning network; or, at the Nth layer After the simulation operation, releasing the occupied first predetermined cache block before performing the simulation operation on the (N+1)th layer of the deep learning network; or releasing the occupied portion when no second predetermined cache block is available for allocation The first predetermined cache block used.

Optionally, the state detecting unit 620 is further configured to allocate the first predetermined cache block for the output data of the Nth layer analog operation when detecting that the first predetermined cache block is not occupied. The first predetermined cache block and the second predetermined cache block may be marked by different colors.

According to the technical solution of the present application, based on the deep learning method, in the traditional cache optimization method, the optimizer must be very clear about the deep learning network, and know the usage timing of each cache, and know when each cache will be used. The complex method of when not to be used makes the cache optimization method in the deep learning domain more general and simple. Through the "analog operation, conflict detection, release cache" method, the cache in the idle state is found in time, thereby improving the use efficiency of the cache.

The embodiment of the present application further provides an electronic device, as shown in FIG. 7, comprising: a housing 701, a processor 702, a memory 703, a circuit board 704, and a power circuit 705, wherein the circuit board 704 is disposed in the housing 701. The processor 702 and the memory 703 are disposed on the circuit board 704; the power circuit 705 is used to supply power to the various circuits or devices of the electronic device; the memory 703 is used to store executable program code; and the processor 702 is read by The executable program code stored in the memory 703 is configured to execute a program corresponding to the executable program code for executing the cache optimization method applied to the deep learning network, the method comprising:

Performing an analog operation on the Nth layer of the deep learning network;

After performing an analog operation on the Nth layer of the deep learning network, detecting whether the first predetermined cache block is occupied, the first predetermined cache block is used to buffer the input data of the Nth layer simulation operation or before the Nth layer Layer input/output data for analog operations;

If occupied, the second predetermined cache block is allocated for the output data of the Nth layer analog operation, and when the preset condition is satisfied, the occupied first predetermined cache block is released.

The embodiment of the present application further provides an executable program code, where the executable program code is used The method is executed to perform the cache optimization method applied to the deep learning network, and the method includes:

Performing an analog operation on the Nth layer of the deep learning network;

The embodiment of the present application further provides a storage medium for storing executable program code, where the executable program code is used to execute the cache optimization method applied to a deep learning network, and the method includes :

Performing an analog operation on the Nth layer of the deep learning network;

The various embodiments in this specification are described in a related manner, and the embodiments are the same. Similar parts can be referred to each other, and each embodiment focuses on differences from other embodiments.

In particular, for the device embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and the relevant parts can be referred to the description of the method embodiment.

The logic and/or steps represented in the flowchart or otherwise described herein, for example, may be considered as an ordered list of executable instructions for implementing logical functions, and may be embodied in any computer readable medium, Used in conjunction with, or in conjunction with, an instruction execution system, apparatus, or device (eg, a computer-based system, a system including a processor, or other system that can fetch instructions and execute instructions from an instruction execution system, apparatus, or device) Or use with equipment. For the purposes of this specification, a "computer-readable medium" can be any apparatus that can contain, store, communicate, propagate, or transport a program for use in an instruction execution system, apparatus, or device, or in conjunction with the instruction execution system, apparatus, or device. More specific examples (non-exhaustive list) of computer readable media include the following: electrical connections (electronic devices) having one or more wires, portable computer disk cartridges (magnetic devices), random access memory (RAM), Read only memory (ROM), erasable editable read only memory (EPROM or flash memory), fiber optic devices, and portable compact disk read only memory (CDROM). In addition, the computer readable medium may even be a paper or other suitable medium on which the program can be printed, as it may be optically scanned, for example by paper or other medium, followed by editing, interpretation or, if appropriate, other suitable The method is processed to obtain the program electronically and then stored in computer memory.

It should be understood that portions of the application can be implemented in hardware, software, firmware, or a combination thereof.

In the above-described embodiments, multiple steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, it can be implemented by any one or combination of the following techniques well known in the art: having logic gates for implementing logic functions on data signals. Discrete logic circuits, application specific integrated circuits with suitable combinational logic gates, programmable gate arrays (PGAs), field programmable gate arrays (FPGAs), etc.

While the preferred embodiment of the present application has been described, those skilled in the art can make further changes and modifications to these embodiments once they are aware of the basic inventive concept. So, the institute The appended claims are intended to be interpreted as including all the modifications and modifications

It will be apparent to those skilled in the art that various modifications and changes can be made in the present application without departing from the spirit and scope of the application. Thus, it is intended that the present invention cover the modifications and variations of the present invention.

Claims

A cache optimization method applied to a deep learning network, where the deep learning network includes an N layer, N is greater than or equal to 2, and the method includes:

Performing an analog operation on the Nth layer of the deep learning network;

After performing an analog operation on the Nth layer of the deep learning network, detecting whether the first predetermined cache block is occupied, the first predetermined cache block is used to buffer the input data of the Nth layer simulation operation or before the Nth layer Layer input/output data for analog operations;

If occupied, the second predetermined cache block is allocated for the output data of the Nth layer analog operation, and when the preset condition is satisfied, the occupied first predetermined cache block is released.
The method according to claim 1, wherein the cache block has a corresponding occupied or unoccupied status flag stored in the cache block status table, and the detecting whether the first predetermined cache block is occupied includes:

Querying, according to the identifier of the first predetermined cache block, a status flag corresponding to the first predetermined cache block in the cache status table, and determining, according to the status flag, whether the first predetermined cache block is occupied.
The method according to claim 1, wherein the allocating the second predetermined cache block to the output data of the Nth layer simulation operation comprises:

The output data of the Nth layer analog operation is allocated a second predetermined cache block immediately adjacent to the first predetermined cache block.
The method according to claim 1, wherein the releasing the occupied first predetermined cache block when the preset condition is met comprises:

Release the occupied first predetermined cache block when performing an analog operation on the (N+1)th layer of the deep learning network; or

After performing the simulation operation on the Nth layer, releasing the occupied first predetermined cache block before performing the simulation operation on the N+1th layer of the deep learning network; or

The first predetermined cache block that is occupied is released when no second predetermined cache block is assignable.
The method of claim 1, wherein the first predetermined cache block is allocated for output data of the Nth layer analog operation upon detecting that the first predetermined cache block is unoccupied.
A method according to any one of claims 1 to 5, wherein said first pre- The fixed cache block and the second predetermined cache block are marked by different colors.
A cache optimization apparatus for a deep learning network, the deep learning network includes an N layer, N is greater than or equal to 2, and the device includes: an analog operation unit, a state detection unit, a cache allocation unit, and a cache release unit. among them:

The simulation operation unit is configured to perform an analog operation on the Nth layer of the deep learning network;

The state detecting unit is configured to detect whether the first predetermined cache block is occupied after performing an analog operation on the Nth layer of the deep learning network, where the first predetermined cache block is used to buffer the Nth layer simulation operation Input data or input/output data of layer simulation operations before the Nth layer;

The buffer allocation unit is configured to allocate a second predetermined cache block for the output data of the Nth layer simulation operation when the first predetermined cache block is occupied;

The cache release unit is configured to release the occupied first predetermined cache block when a preset condition is met.
The apparatus according to claim 7, wherein the cache block has an occupied or unoccupied status flag corresponding thereto, and the state detecting unit includes a query subunit, and the query subunit is configured to query the a status flag corresponding to a predetermined cache block, and determining, according to the status flag, whether the first predetermined cache block is occupied.
The device according to claim 8, wherein the correspondence between the cache block and its status flag is stored in a cache block status table, and the query sub-unit is specifically configured to use the identifier of the first predetermined cache block. The status flag corresponding to the first predetermined cache block is queried in the cache status table, so as to determine whether the first predetermined cache block is occupied according to the query result.
The apparatus according to claim 7, wherein the buffer allocation unit is specifically configured to allocate, for output data of the Nth layer analog operation, a second predetermined cache block immediately adjacent to the first predetermined cache block.
The device according to claim 7, wherein the cache release unit is configured to: release the occupied first predetermined cache block when performing an analog operation on the (N+1)th layer of the deep learning network; or,

After performing the simulation operation on the Nth layer, releasing the occupied first predetermined cache block before performing the simulation operation on the N+1th layer of the deep learning network; or

The first predetermined cache block that is occupied is released when no second predetermined cache block is assignable.
The apparatus according to claim 7, wherein the state detecting unit is further configured to allocate the first reservation for output data of the Nth layer simulation operation when detecting that the first predetermined cache block is not occupied. Cache block.
Apparatus according to any one of claims 7-12, wherein said first predetermined cache block and said second predetermined cache block are marked by different colors.
A storage medium, characterized in that the storage medium is for storing executable program code for being executed to perform the application to the deep learning network according to any one of claims 1-6 Cache optimization method.