CN108615077B - Cache optimization method and device applied to deep learning network - Google Patents

Cache optimization method and device applied to deep learning network Download PDF

Info

Publication number
CN108615077B
CN108615077B CN201611132262.9A CN201611132262A CN108615077B CN 108615077 B CN108615077 B CN 108615077B CN 201611132262 A CN201611132262 A CN 201611132262A CN 108615077 B CN108615077 B CN 108615077B
Authority
CN
China
Prior art keywords
cache block
occupied
block
cache
predetermined
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201611132262.9A
Other languages
Chinese (zh)
Other versions
CN108615077A (en
Inventor
郑星
王鹏
叶挺群
彭剑峰
周智强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Hikvision Digital Technology Co Ltd
Original Assignee
Hangzhou Hikvision Digital Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Hikvision Digital Technology Co Ltd filed Critical Hangzhou Hikvision Digital Technology Co Ltd
Priority to CN201611132262.9A priority Critical patent/CN108615077B/en
Priority to PCT/CN2017/108030 priority patent/WO2018103472A1/en
Publication of CN108615077A publication Critical patent/CN108615077A/en
Application granted granted Critical
Publication of CN108615077B publication Critical patent/CN108615077B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N99/00Subject matter not provided for in other groups of this subclass

Abstract

The invention discloses a cache optimization method and a cache optimization device applied to a deep learning network, wherein the cache optimization method comprises the following steps: performing simulation operation on the Nth layer of the deep learning network; detecting whether a first preset cache block is occupied or not after analog operation, if so, allocating a second preset cache block for output data of an Nth layer, and releasing the occupied first preset cache block when preset conditions are met. The invention can solve the problem of complicated cache optimization in each layer of training of the deep learning network, and enables the optimization and allocation of the cache to be simpler and more efficient, and particularly can adapt to different networks.

Description

Cache optimization method and device applied to deep learning network
Technical Field
The invention relates to the technical field of computers, in particular to a cache optimization method applied to a deep learning network and a device corresponding to the method.
Background
Deep learning is a new field in machine learning research and is an effective artificial intelligence method. It learns the relevant knowledge from the data for subsequent prediction by simulating the learning behavior of the human brain. Deep learning employs so-called "networks" for learning, wherein a network is composed of a plurality of "layers" (e.g., convolutions), each of which is trained with the output of a previous layer (or of previous convolutions) as input, and then with the training results as input to the next layer. The training process is a process of performing ordered calculations for each layer.
In deep learning training, a large amount of intermediate data is generated, and the data is generally required to be buffered for training, so that a large amount of buffer is occupied. Therefore, it becomes important to find an excellent cache optimization method. If the cache optimization is performed on the deep learning manually according to experience knowledge, optimization personnel is required to be very clear about the structure of a network model of the deep learning, and know the use time of each cache block for caching intermediate data, know when each cache block is used and not used, and share the cache which is not used together, so that the purpose of saving the size of the cache is achieved.
However, different network models for deep learning have different structures, and the cache using conditions are different, so that the requirements on optimization personnel are high, the implementation is difficult in the real implementation process, and the cache optimization efficiency is reduced.
Disclosure of Invention
The invention aims to solve the problem of low cache optimization efficiency in each layer of training of a deep learning network by providing a cache optimization method and a cache optimization device applied to the deep learning network.
According to an aspect of the present invention, there is provided a cache optimization method applied to a deep learning network, where the deep learning network includes N layers, N is greater than or equal to 2, the method includes: performing simulation operation on the Nth layer of the deep learning network; after simulation operation is carried out on the Nth layer of the deep learning network, whether a first preset cache block is occupied or not is detected, wherein the first preset cache block is used for caching input data of the Nth layer of simulation operation or input/output data of layer simulation operation before the Nth layer; and if the first preset cache block is occupied, distributing a second preset cache block for output data of the Nth layer of simulation operation, and releasing the occupied first preset cache block when preset conditions are met.
Optionally, the cache block has an occupied or unoccupied status flag corresponding thereto, and is stored in a cache block status table, and the detecting whether the first predetermined cache block is occupied specifically includes: and inquiring a state mark corresponding to the first preset cache block in the cache state table according to the identifier of the first preset cache block, and determining whether the first preset cache block is occupied according to the state mark.
Optionally, the allocating a second predetermined cache block to output data of the nth layer of analog operation specifically includes: and distributing a second preset cache block next to the first preset cache block for the output data of the Nth layer of simulation operation.
Optionally, in the method, when a preset condition is met, releasing the first predetermined cache block that is occupied specifically includes: releasing the first preset cache block which is occupied when simulation operation is carried out on the (N + 1) th layer of the deep learning network; or after the simulation operation is carried out on the Nth layer and before the simulation operation is carried out on the (N + 1) th layer of the deep learning network, releasing the occupied first preset cache block; or releasing said first predetermined cache block that is occupied when no second predetermined cache block is allocable.
Optionally, when it is detected that the first predetermined cache block is not occupied, the first predetermined cache block is allocated to output data of the nth layer simulation operation.
Optionally, the first predetermined cache block and the second predetermined cache block are marked by different colors. According to another aspect of the present invention, there is also provided a cache optimization apparatus applied to a deep learning network, where the deep learning network includes N layers, N is greater than or equal to 2, and the apparatus includes:
the simulation operation unit is used for performing simulation operation on the Nth layer of the deep learning network;
the state detection unit is used for detecting whether a first preset cache block is occupied or not after the simulation operation is carried out on the Nth layer of the deep learning network, wherein the first preset cache block is used for caching input data of the Nth layer of simulation operation or input/output data of layer simulation operation before the Nth layer;
the buffer allocation unit is used for allocating a second preset buffer block to output data of the Nth-layer simulation operation when the first preset buffer block is occupied;
and the cache release unit is used for releasing the occupied first preset cache block when a preset condition is met.
Optionally, the cache block has an occupied or unoccupied status flag corresponding to the cache block, and the status detection unit includes an inquiry subunit, where the inquiry subunit is configured to inquire the status flag corresponding to the first predetermined cache block, and determine whether the first predetermined cache block is occupied according to the status flag.
Optionally, the cache block and the status flag thereof are stored in a cache block status table, and the querying subunit is specifically configured to query the status flag corresponding to the first predetermined cache block in the cache status table according to the identifier of the first predetermined cache block, so as to determine whether the first predetermined cache block is occupied according to a query result.
Optionally, the buffer allocation unit is specifically configured to allocate a second predetermined buffer block, which is immediately adjacent to the first predetermined buffer block, to output data of the nth layer of analog operation.
Optionally, the cache release unit is specifically configured to: releasing the first preset cache block which is occupied when simulation operation is carried out on the (N + 1) th layer of the deep learning network; or after the simulation operation is carried out on the Nth layer and before the simulation operation is carried out on the (N + 1) th layer of the deep learning network, releasing the occupied first preset cache block; or, when no second predetermined cache block is allocable, releasing the first predetermined cache block which is occupied.
Optionally, the state detection unit is further configured to allocate the first predetermined cache block to output data of the nth layer of analog operation when it is detected that the first predetermined cache block is not occupied.
Optionally, the first predetermined cache block and the second predetermined cache block are marked by different colors.
The method comprises the steps of detecting whether a first preset cache block is occupied or not after simulation operation is carried out on the Nth layer of the deep learning network, reallocating the cache block under the condition that the first preset cache block is occupied, and releasing the occupied first preset cache block when preset conditions are met. When the cache is required to be allocated to the result data of the analog operation, the method can automatically identify the use state of the preset cache block without depending on the mastering of an optimizer on the deep learning network model structure and the cache use state, thereby improving the cache optimization efficiency.
Drawings
In order to more clearly illustrate the technical solution of the present invention, the embodiments of the present invention and the description thereof are used for explaining the present invention and do not constitute a limitation of the present invention, and the drawings used in the description of the embodiments will be briefly described below. In the drawings:
FIG. 1a is a schematic diagram of a conventional cache usage method, with the horizontal axis representing the time axis;
FIG. 1b is a schematic diagram of a shared cache with time axis on the horizontal axis;
FIG. 2 is a flow chart of a cache optimization method according to an embodiment of the present invention;
FIG. 3 is a flow chart of a cache optimization method according to an embodiment of the present invention;
FIG. 4 is a flow chart of a cache optimization method according to an embodiment of the present invention;
FIG. 5a is a diagram illustrating a video memory status at a video memory optimization start stage according to an embodiment of the present invention;
FIG. 5b is a schematic diagram of a process for optimizing video memory of layer a according to an embodiment of the present disclosure;
FIG. 5c is a schematic diagram of a process for optimizing video memory of layer b according to an embodiment of the present invention;
FIG. 5d is a schematic diagram illustrating a process of performing video memory optimization on layer c according to an embodiment of the present invention;
fig. 5e is a schematic diagram illustrating the use of the video memory after the video memory optimization is finally completed in the embodiment of the present invention;
fig. 6 is a block diagram of a cache optimization apparatus according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more clear, the technical solutions of the present application will be clearly and completely described below with reference to the detailed description of the present application and the accompanying drawings. It should be apparent that the described embodiments are only some embodiments of the present application, and not all embodiments. All other embodiments obtained by a person of ordinary skill in the art without any inventive work based on the embodiments in the present application are within the scope of protection of the present application.
Technical solutions provided by embodiments of the present application are described in detail below with reference to the accompanying drawings.
FIG. 1a is a schematic diagram of a conventional cache utilization method, in which the horizontal axis is the time axis; from fig. 1a, the inventor finds that video memory 1 and video memory 2 are used simultaneously for a while, and therefore video memory 1 and video memory 2 cannot be shared, but the use time of video memory 1 and video memory 3 is relatively independent and therefore can be shared. Shown in fig. 1b is a schematic diagram of a shared cache, wherein the horizontal axis is the time axis. Wherein, the shared video memory 1 indicates that the video memory 1 and the video memory 3 finally use the same block address space; the shared video memory 2 identifies that the video memory 2 and the video memory 4 finally use the same block address space.
The embodiment provides a cache optimization method applied to a deep learning network. The deep learning network in this embodiment includes a plurality of layers, generally, the number of layers is 2 or more, and N is used to represent the number of layers. The flowchart of this embodiment is shown in fig. 2, and includes the following steps:
s101: performing simulation operation on the Nth layer of the deep learning network; in the deep learning network, the input layer, the output layer and one or more hidden layers are included, wherein the hidden layers are combined to form an intermediate layer in the deep learning network. There are many analog operation modes in the deep learning network. For example, the values of the output layer may be obtained by some combination of the values of the input layer and the weight matrix of the intermediate layer. And the process of simulating operation can be regarded as the process of calculating the weight matrix of the middle layer.
S102: after simulation operation is carried out on the Nth layer of the deep learning network, the using state of a first preset cache block is detected; wherein the first predetermined cache block is used for caching input data of the Nth layer simulation operation or input/output data of a layer simulation operation before the Nth layer;
s103: and if the cache block is occupied, distributing a second preset cache block for output data of the Nth layer of simulation operation, and releasing the occupied first preset cache block when a preset condition is met.
The first predetermined cache block and the second predetermined cache block are standby cache blocks allocated before the analog operation is performed, and the number of the cache blocks is not limited, and may be two or more. The first and second of the first predetermined cache block and the second predetermined cache block are for distinguishing cache blocks.
It should be noted that the execution subjects of the steps of the method provided by this embodiment may be the same device, or different devices may also be used as the execution subjects of the method. For example, the execution subject of step S101 may be device 1, and the execution subjects of steps S102 and S103 may be device 2; and so on. The device may be a computing device with the ability to implement cache optimization, e.g., a personal computer, a laptop, a tablet, etc. To avoid repetition, the execution main bodies in all the embodiments described below may be the same as or similar to the execution main bodies in the embodiments described above.
The cache block may be, but not limited to, a carrier having a storage function, such as a memory of a CPU (Central Processing Unit), a video memory of a GPU (Graphics Processing Unit), or the like.
According to the embodiment, a cache optimization method based on a deep learning network is provided, and the method achieves the purpose of optimizing the cache by performing conflict detection and releasing the cache during simulation operation, namely in each layer of training of the deep learning network, in order to optimize the occupation of the cache in the real training process, simulation operation is performed first to serve as a reference of cache allocation, and when cache allocation is performed after the simulation operation, the current use state of a cache block to be allocated or pre-allocated is checked, and then an idle cache is selected for allocation. Through the process, the using state of the cache block can be automatically detected, and the optimization personnel is not relied on to manually distribute the cache according to the network structure, so that the cache optimization and distribution efficiency of the deep learning network can be improved, and the deep learning network can adapt to more different networks.
In addition, for the cache occupied by the deep learning network, when the preset condition is met, the previous output data is released in time, so that the whole cache allocation can be circulated.
The process of a specific implementation manner of the embodiment of the present invention may further include performing simulation operation and conflict detection on each layer of the deep learning network, that is, detecting whether a first predetermined cache block that caches input data of the simulation operation on this layer or input/output data of the simulation operation on a previous layer is occupied; and (4) cache allocation, namely, the cache block is found to be occupied, a second preset cache block is allocated to the data of the cache block, and the like. A schematic flow chart of the cache optimization method based on a conflict detection method in this embodiment is shown in fig. 3, and the steps include:
s201: assuming that all predetermined cache blocks are unused, and that the cache blocks in which data is to be cached are all pointing to the same predetermined first cache address,
s202: and performing conflict detection on the cache blocks required to be used by each layer. When the conflict is detected, a preset second cache address is allocated to the cache block with the conflict, if the conflict is not detected, the next cache block is continuously detected, and the like.
S203: and releasing the first preset cache block which is occupied after a preset condition is met, namely the simulation operation of the layer is finished.
Wherein, the first one of the first predetermined cache block and the second predetermined cache block is the order of their usage. And the second predetermined cache block is immediately adjacent to the first predetermined cache block. The purpose of this design is to reduce unnecessary cache waste caused by fragmentation. In addition, in the cache optimization mode in the embodiment, after each simulation operation, the 'conflict detection' is performed on the cache state, and the cache in the idle state is found in time, so that the use efficiency of the cache is improved.
In a specific implementation manner of the embodiment of the present invention, as shown in fig. 4, a flowchart of a cache optimization method according to the embodiment is shown. In this embodiment, the cache block is used as a storage unit, and may be embodied as a carrier with a storage function, such as a disk storage, an internal memory, a video memory, and the like. For simplicity, in this embodiment, the number N of layers of the deep learning network is 3, which are three layers a, b, and c, respectively, where the predetermined allocation manner of the video memory is: the video memory 1 and the video memory 2 are input and output of the layer a, the video memory 3 and the video memory 4 are input and output of the layer b, the video memory 2 and the video memory 4 are input of the layer c, and the video memory 5 is output of the layer c. After the simulation operation of each layer is completed, the video memory block occupied by the input of the layer is released. The method includes (steps S301-S304):
step S301: and directing the video memory blocks needing to display data to the same preset video memory address, and marking that the preset video memory blocks are occupied by the data in the video memory 1. Fig. 5a shows a schematic diagram of a video memory state at the beginning of video memory optimization, and as shown in fig. 5a, all video memories 1-5 are marked red or displayed in red, that is, all video memory blocks point to the same video memory address, and the red video memory block is marked to be occupied by data in the video memory 1 and recorded in a video memory block state table.
Step S302: and (3) performing a video memory optimization process on the layer a, namely performing conflict detection and video memory allocation on the input and output video memories of the layer a, and recording an allocation result. Fig. 5b is a schematic diagram of a process for optimizing the video memory of the layer a, and as shown in fig. 5b, in the simulation operation, collision detection is performed on the video memory 1 and the video memory 2 of the layer a, and it is found that a red video memory block which the video memory 2 wants to use is already occupied by the video memory 1, that is, a collision occurs. At this time, a new video memory address is reallocated for the video memory 2, that is, the yellow display memory block is marked to be occupied by the data in the video memory 2, after the analog operation of the layer a is completed, the red display memory block occupied by the video memory 1 is released, and the use state of the video memory block at this time is recorded in the video memory block state table.
Step S303: and (4) performing a video memory optimization process on the layer b, namely performing conflict detection and video memory allocation on the input and output video memories of the layer b, and recording an allocation result. As shown in fig. 5c, a schematic diagram of the process of performing video memory optimization on the layer b is shown, and as shown in fig. 5c, in the simulation operation, the video memory 3 of the layer b wants to cache data in a predetermined red video memory block, and the video memory block state table finds that the red video memory block is not used, so the red video memory block is marked as being used by the video memory 3. Then, conflict detection is performed on the video memory 3 and the video memory 4 of the layer b, and it is found through the video memory block state table that the red video memory block which is used by the video memory 4 is occupied by the video memory 3 and the yellow video memory block is occupied by the video memory 2, that is, a conflict is generated. At this time, a new video memory address is reallocated for the video memory 4, that is, the blue video memory block is marked to be occupied by data in the video memory 4, after the analog operation of the layer b is completed, the red video memory block occupied by the video memory 3 is released, and the use state of the video memory block at this time is recorded in the video memory block state table.
Step S304: and (4) performing a video memory optimization process on the layer c, namely performing conflict detection and video memory allocation on the input and output video memories of the layer c, and recording an allocation result. Fig. 5d is a schematic diagram illustrating a process of performing video memory optimization on the layer c, and as shown in fig. 5d, in the simulation operation, the video memory 5 of the layer c wants to store data in the red video memory block, perform collision detection, and find that the red video memory block of the layer c is not used through the video memory block state table, so that the red video memory block is marked as being occupied by the video memory 5. At this time, after the detection of the collision state of all the display blocks is completed and the analog operation of the layer c is completed, the blue display block and the yellow display block occupied by the display 2 and the display 4 are released and recorded in the display block state table.
Fig. 5e is a schematic diagram showing the use of the video memory after the video memory optimization is finally completed. As shown in fig. 5e, finally, after the video memory optimization is completed, the data in the video memory 5 is stored in the red video memory block, and the video memories 1-4 have been all released.
The video memory block state table described in this embodiment is a record of all video memory occupation states, and makes each piece of video memory occupation information clearer by using the video memory block state table. In this embodiment, it is assumed that the sizes of the video memories 1 to 5 are all 200M, and if the video memory optimization allocation is not performed in the video memory optimization manner described in the embodiment, all the video memories 1 to 5 need to be occupied, that is, the video memory occupation amount is 1000M; however, after the optimized allocation of the video memory is performed by using the method in this embodiment, the actually used video memory blocks only have three red, blue and yellow video memories, and the actual video memory occupation amount is only 600M. By the cache optimization method, the purpose of saving the cache can be achieved, and the utilization rate of the cache is further improved.
The example in this embodiment is only a small deep learning network, which includes only three layers. For more complex deep learning networks, such as deep learning networks with N layers (N is greater than 3), the optimization and allocation of the cache can still be realized by using the method and taking the output of the N-1 layer as the input of the N layers.
Based on the same inventive concept, the embodiment of the invention also provides a cache optimization device.
Fig. 6 is a block diagram of a cache optimization apparatus according to an embodiment of the present invention, and as shown in fig. 6, the apparatus includes: the analog operation unit 610, the state detection unit 620, the buffer allocation unit 630, and the buffer release unit 640 describe the structure and connection relationship of each module in detail.
The simulation operation unit 610 is configured to perform simulation operation on the nth layer of the deep learning network;
a state detection unit 620 coupled to the simulation operation unit 610, for detecting whether the first predetermined cache block is occupied after performing simulation operation on the nth layer of the deep learning network;
a buffer allocation unit 630, coupled to the state detection unit 620, for allocating a second predetermined buffer block to output data of the nth layer of analog operation when the first predetermined buffer block is occupied;
the buffer release unit 640 is coupled to the buffer allocation unit 630, and configured to release the first predetermined buffer block that is occupied when a preset condition is met.
Optionally, the correspondence between the cache block and the status flag thereof is stored in a cache block status table, and the status detecting unit 620 further includes: the querying subunit (not shown) is specifically configured to query, according to the identifier of the first predetermined cache block, a status flag corresponding to the first predetermined cache block in the cache status table, so as to determine whether the first predetermined cache block is occupied according to a query result.
Optionally, the buffer allocation unit 630 is specifically configured to allocate a second predetermined buffer block, which is immediately adjacent to the first predetermined buffer block, to output data of the nth layer of simulation operation.
Optionally, the cache releasing unit 640 is specifically configured to: releasing the first preset cache block which is occupied when simulation operation is carried out on the (N + 1) th layer of the deep learning network; or after the simulation operation is carried out on the Nth layer and before the simulation operation is carried out on the (N + 1) th layer of the deep learning network, releasing the occupied first preset cache block; or, when no second predetermined cache block is allocable, releasing the first predetermined cache block which is occupied.
Optionally, the state detecting unit 620 is further configured to allocate the first predetermined cache block to output data of the nth layer of analog operation when it is detected that the first predetermined cache block is not occupied. The first predetermined cache block and the second predetermined cache block may be marked by different colors.
According to the technical scheme of the invention, based on a deep learning mode, a complex method that in a traditional cache optimization method, an optimizer must be very clear of a deep learning network and knows when each cache block is used and when each cache block is not used is improved, so that the cache optimization method in the field of deep learning is more universal and simpler. And the buffer in an idle state is found in time by a method of simulating operation, conflict detection and buffer release, so that the use efficiency of the buffer is improved.
All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments.
In particular, as for the apparatus embodiment, since it is substantially similar to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.
It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof.
In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims (13)

1. A cache optimization method applied to a deep learning network, wherein the deep learning network comprises N layers, and N is greater than or equal to 2, and the method comprises the following steps:
performing simulation operation on the Nth layer of the deep learning network;
after simulation operation is carried out on the Nth layer of the deep learning network, whether a first preset cache block is occupied or not is detected, wherein the first preset cache block is used for caching input data of the Nth layer of simulation operation or input/output data of layer simulation operation before the Nth layer;
and if the first preset cache block is occupied, distributing a second preset cache block for output data of the Nth layer of simulation operation, and releasing the occupied first preset cache block when preset conditions are met.
2. The method according to claim 1, wherein the cache block has an occupied or unoccupied status flag corresponding thereto, and the detecting whether the first predetermined cache block is occupied includes:
and inquiring a state mark corresponding to the first preset cache block in the cache block state table according to the identifier of the first preset cache block, and determining whether the first preset cache block is occupied according to the state mark.
3. The method of claim 1, wherein the allocating a second predetermined buffer block to output data of the nth layer of analog operation comprises:
and distributing a second preset cache block next to the first preset cache block for the output data of the Nth layer of simulation operation.
4. The method according to claim 1, wherein releasing the first predetermined cache block that is occupied when a predetermined condition is met comprises:
releasing the first preset cache block which is occupied when simulation operation is carried out on the (N + 1) th layer of the deep learning network; or the like, or, alternatively,
after the simulation operation is carried out on the Nth layer and before the simulation operation is carried out on the (N + 1) th layer of the deep learning network, releasing the occupied first preset cache block; or the like, or, alternatively,
releasing the first predetermined cache block that is occupied when no second predetermined cache block is allocable.
5. The method of claim 1, wherein the first predetermined cache block is allocated for output data of the nth layer of simulation operations upon detection of the first predetermined cache block being unoccupied.
6. Method according to any of claims 1-5, wherein said first predefined cache block and said second predefined cache block are marked by different colors.
7. A cache optimization device applied to a deep learning network, wherein the deep learning network comprises N layers, and N is greater than or equal to 2, the device comprising: analog operation unit, state detection unit, buffer allocation unit and buffer release unit, wherein:
the simulation operation unit is used for performing simulation operation on the Nth layer of the deep learning network;
the state detection unit is used for detecting whether a first preset cache block is occupied or not after the simulation operation is carried out on the Nth layer of the deep learning network, wherein the first preset cache block is used for caching input data of the Nth layer of simulation operation or input/output data of layer simulation operation before the Nth layer;
the buffer allocation unit is used for allocating a second preset buffer block to output data of the Nth layer of simulation operation when the first preset buffer block is occupied;
the cache releasing unit is used for releasing the occupied first preset cache block when a preset condition is met.
8. The apparatus of claim 7, wherein a cache block has an occupied or unoccupied status flag corresponding thereto, and the status detection unit comprises an inquiry subunit, which is configured to inquire the status flag corresponding to the first predetermined cache block and determine whether the first predetermined cache block is occupied according to the status flag.
9. The apparatus according to claim 8, wherein the mapping relationship between the cache blocks and the status flags thereof is stored in a cache block status table, and the querying subunit is specifically configured to query the cache block status table for the status flag corresponding to the first predetermined cache block according to the identifier of the first predetermined cache block, so as to determine whether the first predetermined cache block is occupied according to the query result.
10. The apparatus of claim 7, wherein the buffer allocation unit is specifically configured to allocate a second predetermined buffer block next to the first predetermined buffer block for output data of the nth layer of simulation operation.
11. The apparatus of claim 7, wherein the cache release unit is specifically configured to: releasing the first preset cache block which is occupied when simulation operation is carried out on the (N + 1) th layer of the deep learning network; or the like, or, alternatively,
after the simulation operation is carried out on the Nth layer and before the simulation operation is carried out on the (N + 1) th layer of the deep learning network, releasing the occupied first preset cache block; or the like, or, alternatively,
releasing the first predetermined cache block that is occupied when no second predetermined cache block is allocable.
12. The apparatus of claim 7, further configured to, in the state detection unit, allocate a first predetermined buffer block for output data of the nth layer simulation operation when the first predetermined buffer block is detected to be unoccupied.
13. The apparatus according to any of claims 7-12, wherein said first predetermined cache block and said second predetermined cache block are marked by different colors.
CN201611132262.9A 2016-12-09 2016-12-09 Cache optimization method and device applied to deep learning network Active CN108615077B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201611132262.9A CN108615077B (en) 2016-12-09 2016-12-09 Cache optimization method and device applied to deep learning network
PCT/CN2017/108030 WO2018103472A1 (en) 2016-12-09 2017-10-27 Method and device for buffer optimization in deep learning network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611132262.9A CN108615077B (en) 2016-12-09 2016-12-09 Cache optimization method and device applied to deep learning network

Publications (2)

Publication Number Publication Date
CN108615077A CN108615077A (en) 2018-10-02
CN108615077B true CN108615077B (en) 2021-08-24

Family

ID=62490680

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611132262.9A Active CN108615077B (en) 2016-12-09 2016-12-09 Cache optimization method and device applied to deep learning network

Country Status (2)

Country Link
CN (1) CN108615077B (en)
WO (1) WO2018103472A1 (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109447253B (en) * 2018-10-26 2021-04-27 杭州比智科技有限公司 Video memory allocation method and device, computing equipment and computer storage medium
CN110851187B (en) * 2019-11-19 2023-06-02 北京百度网讯科技有限公司 Video memory processing method, device, equipment and medium
CN112862085B (en) * 2019-11-27 2023-08-22 杭州海康威视数字技术股份有限公司 Storage space optimization method and device
CN111666150B (en) * 2020-05-09 2022-01-11 深圳云天励飞技术股份有限公司 Storage space allocation method and device, terminal and computer readable storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101809597A (en) * 2007-09-26 2010-08-18 佳能株式会社 Calculation processing apparatus and method
CN103455443A (en) * 2013-09-04 2013-12-18 华为技术有限公司 Buffer management method and device
CN104133784A (en) * 2014-07-24 2014-11-05 大唐移动通信设备有限公司 Message buffer management method and message buffer management device
CN104636285A (en) * 2015-02-03 2015-05-20 北京麓柏科技有限公司 Flash memory storage system and reading, writing and deleting method thereof
CN104915322A (en) * 2015-06-09 2015-09-16 中国人民解放军国防科学技术大学 Method for accelerating convolution neutral network hardware and AXI bus IP core thereof
CN105677583A (en) * 2015-12-31 2016-06-15 华为技术有限公司 Cache management method and device
CN106022468A (en) * 2016-05-17 2016-10-12 成都启英泰伦科技有限公司 Artificial neural network processor integrated circuit and design method therefor

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101957800A (en) * 2010-06-12 2011-01-26 福建星网锐捷网络有限公司 Multichannel cache distribution method and device
US8965819B2 (en) * 2010-08-16 2015-02-24 Oracle International Corporation System and method for effective caching using neural networks

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101809597A (en) * 2007-09-26 2010-08-18 佳能株式会社 Calculation processing apparatus and method
CN103455443A (en) * 2013-09-04 2013-12-18 华为技术有限公司 Buffer management method and device
CN104133784A (en) * 2014-07-24 2014-11-05 大唐移动通信设备有限公司 Message buffer management method and message buffer management device
CN104636285A (en) * 2015-02-03 2015-05-20 北京麓柏科技有限公司 Flash memory storage system and reading, writing and deleting method thereof
CN104915322A (en) * 2015-06-09 2015-09-16 中国人民解放军国防科学技术大学 Method for accelerating convolution neutral network hardware and AXI bus IP core thereof
CN105677583A (en) * 2015-12-31 2016-06-15 华为技术有限公司 Cache management method and device
CN106022468A (en) * 2016-05-17 2016-10-12 成都启英泰伦科技有限公司 Artificial neural network processor integrated circuit and design method therefor

Also Published As

Publication number Publication date
WO2018103472A1 (en) 2018-06-14
CN108615077A (en) 2018-10-02

Similar Documents

Publication Publication Date Title
CN108615077B (en) Cache optimization method and device applied to deep learning network
KR102161448B1 (en) System comprising multi channel memory and operating method for the same
JP6542909B2 (en) File operation method and apparatus
CN105283855B (en) A kind of addressing method and device
JP2005235019A5 (en)
CN110223216B (en) Data processing method and device based on parallel PLB and computer storage medium
US20140181404A1 (en) Information coherency maintenance systems and methods
US10481817B2 (en) Methods and apparatus to optimize dynamic memory assignments in multi-tiered memory systems
CN109447253B (en) Video memory allocation method and device, computing equipment and computer storage medium
CN104951239B (en) Cache driver, host bus adaptor and its method used
EP3958122A1 (en) Memory management method, apparatus, and system
CN109284180A (en) A kind of method for scheduling task, device, electronic equipment and storage medium
CN111984400A (en) Memory allocation method and device of neural network
US10204060B2 (en) Determining memory access categories to use to assign tasks to processor cores to execute
CN107862031A (en) Method for processing business, device and server
CN107015923A (en) Uniformity for managing snoop operations is interconnected and data processing equipment including it
US6343309B1 (en) Method and apparatus for parallelizing a graphics pipeline
CN105210038A (en) Core affinity bitmask translation
US20150052327A1 (en) Dynamic memory relocation
CN107451271A (en) A kind of Hash table processing method, device, equipment and storage medium
CN105874431A (en) Computing system with reduced data exchange overhead and related data exchange method thereof
CN106933512A (en) The method and its equipment of a kind of reading and writing data
CN112181637B (en) Memory resource allocation method and device
CN103874988A (en) Programmably partitioning caches
CN105279007A (en) Multi-core processor simulation method and apparatus

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant