WO2018103472A1 - Method and device for buffer optimization in deep learning network - Google Patents
Method and device for buffer optimization in deep learning network Download PDFInfo
- Publication number
- WO2018103472A1 WO2018103472A1 PCT/CN2017/108030 CN2017108030W WO2018103472A1 WO 2018103472 A1 WO2018103472 A1 WO 2018103472A1 CN 2017108030 W CN2017108030 W CN 2017108030W WO 2018103472 A1 WO2018103472 A1 WO 2018103472A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- cache block
- occupied
- predetermined cache
- predetermined
- deep learning
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N99/00—Subject matter not provided for in other groups of this subclass
Definitions
- the present application relates to the field of computer technologies, and in particular, to a cache optimization method applied to a deep learning network and a device corresponding to the method.
- Deep learning is a new field in machine learning research and a more effective artificial intelligence method. It learns the relevant knowledge from the data by simulating the learning behavior of the human brain for subsequent prediction. Deep learning is done using a so-called "network” in which the network consists of multiple "layers” (for example, convolutions), with the output of the previous layer (or the first few convolutions) of each layer being used as input for training, and then Use the training results as input to the next level.
- the training process is the process of orderly calculation of each layer.
- the cache optimization of deep learning is performed manually based on empirical knowledge, it requires the optimizer to be very clear about the structure of the deep learning network model, and to know the timing of each cache of the cached intermediate data, knowing when each cache will be Use, when will not be used, so that will not be shared by the commonly used cache, to save the cache size.
- the purpose of the present application is to solve the cache optimization method and device applied to the deep learning network, so as to solve the problem that the cache optimization efficiency is low in each layer training of the deep learning network. problem.
- a cache optimization method for a deep learning network, where the deep learning network includes an N layer, and N is greater than or equal to 2.
- the method includes: performing a simulation operation on the Nth layer of the deep learning network; After performing an analog operation on the Nth layer of the deep learning network, detecting whether the first predetermined cache block is occupied, the first predetermined cache block is used to buffer the input data of the Nth layer simulation operation or before the Nth layer The input/output data of the layer simulation operation; if occupied, the second predetermined cache block is allocated for the output data of the Nth layer simulation operation, and when the preset condition is satisfied, the occupied first predetermined cache block is released.
- the cache block has a status tag corresponding to the occupied or unoccupied state, and is stored in the cache block status table, where the detecting whether the first predetermined cache block is occupied, specifically: according to the identifier of the first predetermined cache block Querying, in the cache status table, a status flag corresponding to the first predetermined cache block, and determining, according to the status flag, whether the first predetermined cache block is occupied.
- the allocating the second predetermined cache block to the output data of the Nth layer analog operation comprises: allocating, for the output data of the Nth layer analog operation, a second predetermined cache block immediately adjacent to the first predetermined cache block. .
- the release when the preset condition is met, releasing the occupied first predetermined cache block, specifically, when the simulation operation is performed on the (N+1)th layer of the deep learning network, the release is occupied.
- the second predetermined cache block is assignable, the occupied first predetermined cache block is released.
- the first predetermined cache block is allocated for the output data of the Nth layer simulation operation.
- the first predetermined cache block and the second predetermined cache block are marked by different colors.
- a cache optimization apparatus applied to a deep learning network comprising an N layer, N being greater than or equal to 2, comprising:
- An analog operation unit for performing an analog operation on the Nth layer of the deep learning network for performing an analog operation on the Nth layer of the deep learning network
- a state detecting unit configured to detect whether the first predetermined cache block is occupied after performing an analog operation on the Nth layer of the deep learning network, where the first predetermined cache block is used to buffer the input data of the Nth layer simulation operation Or the input/output data of the layer simulation operation before the Nth layer;
- a buffer allocation unit configured to allocate a second predetermined cache block for the output data of the Nth layer simulation operation when the first predetermined cache block is occupied;
- a cache release unit configured to release the occupied first predetermined cache block when a preset condition is met.
- the cache block has a status tag corresponding to the occupied or unoccupied state
- the state detecting unit includes a query subunit, where the query subunit is configured to query a status tag corresponding to the first predetermined cache block, Determining whether the first predetermined cache block is occupied according to the status flag.
- the corresponding relationship between the cache block and the status tag is stored in the cache block status table, where the query sub-unit is configured to query the cached status table for the first reservation according to the identifier of the first predetermined cache block. Cache the status flag corresponding to the block to determine whether the first predetermined cache block is occupied according to the query result.
- the cache allocation unit is specifically configured to allocate, for the output data of the Nth layer simulation operation, a second predetermined cache block immediately adjacent to the first predetermined cache block.
- the cache release unit is configured to: release the occupied first predetermined cache block when performing an analog operation on the (N+1)th layer of the deep learning network; or simulate the Nth layer After the operation, releasing the occupied first predetermined cache block before performing the simulation operation on the (N+1)th layer of the deep learning network; or releasing the occupied portion when no second predetermined cache block is assignable A predetermined cache block.
- the state detecting unit is further configured to allocate, according to the output data of the Nth layer simulation operation, the first predetermined cache block when detecting that the first predetermined cache block is not occupied.
- the first predetermined cache block and the second predetermined cache block are performed by different colors mark.
- an embodiment of the present application further discloses a storage medium for storing executable program code for being executed to perform the above-described cache optimization applied to a deep learning network. method.
- the present application After performing the simulation operation on the Nth layer of the deep learning network, the present application detects whether the first predetermined cache block is occupied. In the case of being occupied, the cache block is re-allocated, and when the preset condition is met, the occupied place is released. The first predetermined cache block is described. When the application needs to allocate a buffer to the result data of the simulation operation, the application can automatically identify the usage state of the predetermined cache block, instead of relying on the optimization personnel to grasp the deep learning network model structure and the cache usage state, thereby improving the cache optimization efficiency. .
- Figure 1a is a schematic diagram of a conventional cache usage method, the horizontal axis of which is the time axis;
- Figure 1b is a schematic diagram of a shared cache, the horizontal axis of which is the time axis;
- FIG. 2 is a flowchart of a cache optimization method according to an embodiment of the present application.
- FIG. 3 is a flowchart of a cache optimization method according to an embodiment of the present application.
- FIG. 4 is a flowchart of a cache optimization method according to an embodiment of the present application.
- FIG. 5 is a schematic diagram of a memory state at the beginning of the optimization of the memory in the embodiment of the present application.
- FIG. 5b is a schematic diagram of a memory optimization process for layer a in the embodiment of the present application.
- FIG. 5c is a schematic diagram of a process of optimizing memory for layer b in the embodiment of the present application.
- FIG. 5 is a schematic diagram of a process of optimizing memory for layer c in the embodiment of the present application.
- FIG. 5 e is a schematic diagram of the use of the memory after the final optimization of the memory is performed in the embodiment of the present application;
- FIG. 6 is a structural block diagram of a cache optimization apparatus in an embodiment of the present application.
- FIG. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
- Figure 1a shows the schematic diagram of the traditional cache usage method, in which the horizontal axis is the time axis; from Figure 1a, the applicant finds that the memory 1 and the memory 2 are used simultaneously for a period of time, therefore, the memory 1 and the memory 2 cannot Sharing, but the usage time of video memory 1 and video memory 3 is relatively independent, so it can be shared.
- Figure 1b is a schematic diagram of a shared cache, where the horizontal axis is the time axis.
- the shared memory 1 indicates that the memory 1 and the memory 3 end up using the same block address space; the shared memory 2 identifies the memory 2 and the memory 4 eventually uses the same block address space.
- This embodiment provides a cache optimization method applied to a deep learning network.
- the deep learning network in this embodiment includes a plurality of layers, generally speaking, the number of layers is greater than or equal to 2, and the number of layers is represented by N.
- the flowchart of this embodiment is shown in FIG. 2 and includes the following steps:
- S101 performing simulation operations on the Nth layer of the deep learning network; in the deep learning network, including an input layer, an output layer, and one or more hidden layers, wherein the multiple layers of hidden layers form a deep learning network middle layer.
- the value of the output layer can be obtained by some combination of the value of the input layer and the weight matrix of the intermediate layer.
- the process of simulation operations can be seen as the process of calculating these intermediate layer weight matrix.
- S102 After performing an analog operation on the Nth layer of the deep learning network, detecting a usage state of the first predetermined cache block; wherein the first predetermined cache block is used to buffer input data or the Nth of the Nth layer simulation operation The input/output data of the layer simulation operation before the layer.
- the first predetermined cache block and the second predetermined cache block are reserved cache blocks that are allocated before the simulation operation is performed, and the number of cache blocks is not limited, and may be two or more.
- the first and second of the first predetermined cache block and the second predetermined cache block are used to distinguish the cache block.
- the execution bodies of the steps of the method provided in this embodiment may all be the same device, or the method may also be performed by different devices.
- the execution subject of step S101 may be device 1
- the execution bodies of steps S102 and S103 may be device 2, and the like.
- the device may be a computing device capable of implementing cache optimization, such as a personal computer, laptop, tablet, or the like.
- the execution subject in all of the embodiments described below may be the same as or similar to the execution subject in the above embodiment.
- the cache block may be, but not limited to, a memory having a storage function such as a memory of a CPU (Central Processing Unit) or a memory of a GPU (Graphics Processing Unit).
- a memory having a storage function such as a memory of a CPU (Central Processing Unit) or a memory of a GPU (Graphics Processing Unit).
- a cache optimization method based on a deep learning network which achieves the purpose of optimizing the cache by performing collision detection and release buffering in the simulation operation, that is, in each layer training of the deep learning network, in order to optimize
- the buffer is occupied first, and the simulation operation is first used as a reference for buffer allocation.
- the current use state of the cache block to be allocated or pre-allocated is checked, and then the idle cache is selected for allocation.
- the previous output data is released in time so that the entire cache allocation can be cycled.
- the process of a specific implementation of the embodiment of the present application may further include performing an analog operation on each layer of the deep learning network, and detecting the input data of the simulation operation of the layer or the analog input/output data of the previous layer. Whether the first predetermined cache block is occupied for detection; cache allocation, that is, finding that the first predetermined cache block is occupied, allocating a second predetermined cache block for its data, and so on.
- FIG. 3 A schematic flowchart of a cache optimization method based on a conflict detection mode in this embodiment is shown in FIG. 3, and the steps thereof include:
- Step S202 Perform collision detection on a cache block that needs to be used on each layer. When a conflict is detected, Step S202a is performed, and if no conflict is detected, step S202b is performed.
- S202a Allocating a predetermined second cache address to the conflicting cache block.
- the first and second of the first predetermined cache block and the second predetermined cache block are in an order in which they are used. And, the second predetermined cache block is in close proximity to the first predetermined cache block.
- the purpose of this design is to reduce unnecessary cache waste caused by storage fragmentation.
- the cache optimization mode in the embodiment can perform the “collision detection” on the cache state after each simulation operation, and timely discover the cache in the idle state, thereby improving the use efficiency of the cache.
- the cache block is a storage unit, and can be embodied as a storage device having a storage function, such as a disk storage, a memory, and a memory. Therefore, the cache optimization method in this embodiment can be applied to disk storage, memory, video memory, and the like. optimization.
- the memory is used as an example for description. Therefore, in this embodiment, the cache block status table is a memory block status table.
- the deep learning network is used. The number of layers N is 3, which are three layers a, b, and c respectively.
- the predetermined distribution mode of the memory is: memory 1 and memory 2 are the input and output of layer a, and memory 3 and memory 4 are the input and output of layer b.
- the memory 2 and the memory 4 are inputs of the layer c, and the memory 5 is the output of the layer c.
- the method includes (steps S301-S304):
- Step S301 Point the memory blocks that need to display data to the same predetermined memory address, and mark that the predetermined memory block is occupied by the data in the video memory 1.
- FIG. 5a is a schematic diagram of the memory state at the beginning of the memory optimization optimization.
- the memory 1-5 is marked as red or displayed in red, that is, all the memory blocks point to the same memory address and marked with red.
- the memory block is occupied by the data in the memory 1, and is recorded in the memory block status table.
- the memory block status table may be stored in any of the video memory 1-5, or may be stored in a memory unit other than the memory 1-5, which is reasonable.
- Step S302 performing a memory optimization process on layer a, that is, rushing the input and output memory of layer a Burst detection and memory allocation, and record the distribution results.
- FIG. 5b a schematic diagram of the memory optimization process for layer a is shown in FIG. 5b.
- the memory 1 of the layer a and the memory 2 are detected for conflict detection, and it is found that the red memory block that the memory 2 wants to use has been Memory 1 is occupied, that is, a conflict occurs.
- a new memory address is reassigned for the memory 2, that is, the marked yellow memory block is occupied by the data in the memory 2.
- the red memory block occupied by the memory 1 is released, and in the memory block status table. Record the status of the memory block usage at this time.
- Step S303 Performing a memory optimization process on the layer b, that is, performing collision detection and memory allocation on the input and output memory of the b layer, and recording the allocation result.
- FIG. 5c a schematic diagram of the process of optimizing the memory for layer b is shown in FIG. 5c.
- the memory 3 of layer b wants to cache the data into a predetermined red memory block, and red is found through the memory block status table. The memory block is not used, so the red memory block is marked for use by video memory 3.
- the c-layer memory 3 and the memory 4 are detected for conflict, and the red memory block that the memory 4 wants to use has been occupied by the memory 3 through the memory block status table, and the yellow memory block has been occupied by the memory 2, that is, a conflict occurs.
- a new memory address is newly allocated for the memory 4, that is, the marked blue memory block is occupied by the data in the memory 4, and after the simulation operation of the b layer is completed, the red memory block occupied by the memory 3 is released, and the state of the memory block is released.
- the status of the memory block usage at this time is recorded in the table.
- Step S304 Perform a memory optimization process on the layer c, that is, perform conflict detection and memory allocation on the input and output memory of the c layer, and record the allocation result.
- FIG. 5d a schematic diagram of the memory optimization process for layer c is shown in FIG. 5d.
- the memory 5 of the layer c wants to store the data in the red memory block, and performs collision detection through the memory block status table.
- the layer c red memory block is found to be unused, so the red memory block is marked as being occupied by the video memory 5. At this time, the conflict state of all the memory blocks has been detected.
- the blue memory block and the yellow memory block occupied by the memory memory 2 and the memory memory 4 are released and recorded in the memory block state table.
- Figure 5e shows a schematic diagram of the memory usage after the final completion of the memory optimization. As shown in FIG. 5e, finally, after the memory optimization is completed, the data in the memory 5 is stored in the red memory block, and the memory 1-4 has all been released.
- the memory block status table described in this embodiment is a record of all the memory occupancy states. By using the memory block status table, each video memory occupation information is made clearer.
- the size of the memory 1-5 is 200M, and the memory is not used for the memory optimization method described in the embodiment. If the allocation is optimized, the memory 1-5 needs to be occupied, that is, the memory usage is 1000M. However, after the memory allocation optimization is performed by the method described in this embodiment, the actual memory blocks used are only red, blue, and yellow. The actual memory usage is only 600M. Through the cache optimization method of the embodiment, the purpose of saving the cache can be achieved, and the usage rate of the cache is further improved.
- the example in this embodiment is only a small deep learning network, including only three layers.
- the output of the N-1 layer as the input of the N layer can still achieve optimization and allocation of its cache.
- an embodiment of the present application also provides a cache optimization apparatus.
- FIG. 6 is a structural block diagram of a cache optimization apparatus according to an embodiment of the present application. As shown in FIG. 6, the apparatus includes: an analog operation unit 610, a state detection unit 620, a cache allocation unit 630, and a cache release unit 640, which are described in detail below. The structure and connection relationship of each module.
- the simulation operation unit 610 is configured to perform an analog operation on the Nth layer of the deep learning network
- the state detecting unit 620 is coupled to the analog operation unit 610, and configured to detect whether the first predetermined cache block is occupied after performing an analog operation on the Nth layer of the deep learning network;
- the cache allocating unit 630 is coupled to the state detecting unit 620, and configured to allocate a second predetermined cache block for the output data of the Nth layer analog operation when the first predetermined cache block is occupied;
- the cache release unit 640 is coupled to the cache allocation unit 630 for releasing the occupied first predetermined cache block when a preset condition is satisfied.
- the correspondence between the cache block and its status tag is stored in the cache block status table
- the status detecting unit 620 further includes: a query subunit (not shown), specifically configured according to the first predetermined cache block. And identifying, in the cache status table, a status flag corresponding to the first predetermined cache block, so as to determine whether the first predetermined cache block is occupied according to the query result.
- the cache allocating unit 630 is specifically configured to allocate, for the output data of the Nth layer analog operation, a second predetermined cache block immediately adjacent to the first predetermined cache block.
- the cache release unit 640 is configured to: release the occupied first predetermined cache block when performing an analog operation on the (N+1)th layer of the deep learning network; or, at the Nth layer After the simulation operation, releasing the occupied first predetermined cache block before performing the simulation operation on the (N+1)th layer of the deep learning network; or releasing the occupied portion when no second predetermined cache block is available for allocation The first predetermined cache block used.
- the state detecting unit 620 is further configured to allocate the first predetermined cache block for the output data of the Nth layer analog operation when detecting that the first predetermined cache block is not occupied.
- the first predetermined cache block and the second predetermined cache block may be marked by different colors.
- the optimizer in the traditional cache optimization method, the optimizer must be very clear about the deep learning network, and know the usage timing of each cache, and know when each cache will be used.
- the complex method of when not to be used makes the cache optimization method in the deep learning domain more general and simple. Through the "analog operation, conflict detection, release cache” method, the cache in the idle state is found in time, thereby improving the use efficiency of the cache.
- the embodiment of the present application further provides an electronic device, as shown in FIG. 7, comprising: a housing 701, a processor 702, a memory 703, a circuit board 704, and a power circuit 705, wherein the circuit board 704 is disposed in the housing 701.
- the processor 702 and the memory 703 are disposed on the circuit board 704;
- the power circuit 705 is used to supply power to the various circuits or devices of the electronic device;
- the memory 703 is used to store executable program code;
- the processor 702 is read by
- the executable program code stored in the memory 703 is configured to execute a program corresponding to the executable program code for executing the cache optimization method applied to the deep learning network, the method comprising:
- the first predetermined cache block After performing an analog operation on the Nth layer of the deep learning network, detecting whether the first predetermined cache block is occupied, the first predetermined cache block is used to buffer the input data of the Nth layer simulation operation or before the Nth layer Layer input/output data for analog operations;
- the second predetermined cache block is allocated for the output data of the Nth layer analog operation, and when the preset condition is satisfied, the occupied first predetermined cache block is released.
- the present application After performing the simulation operation on the Nth layer of the deep learning network, the present application detects whether the first predetermined cache block is occupied. In the case of being occupied, the cache block is re-allocated, and when the preset condition is met, the occupied place is released. The first predetermined cache block is described. When the application needs to allocate a buffer to the result data of the simulation operation, the application can automatically identify the usage state of the predetermined cache block, instead of relying on the optimization personnel to grasp the deep learning network model structure and the cache usage state, thereby improving the cache optimization efficiency. .
- the embodiment of the present application further provides an executable program code, where the executable program code is used
- the method is executed to perform the cache optimization method applied to the deep learning network, and the method includes:
- the first predetermined cache block After performing an analog operation on the Nth layer of the deep learning network, detecting whether the first predetermined cache block is occupied, the first predetermined cache block is used to buffer the input data of the Nth layer simulation operation or before the Nth layer Layer input/output data for analog operations;
- the second predetermined cache block is allocated for the output data of the Nth layer analog operation, and when the preset condition is satisfied, the occupied first predetermined cache block is released.
- the present application After performing the simulation operation on the Nth layer of the deep learning network, the present application detects whether the first predetermined cache block is occupied. In the case of being occupied, the cache block is re-allocated, and when the preset condition is met, the occupied place is released. The first predetermined cache block is described. When the application needs to allocate a buffer to the result data of the simulation operation, the application can automatically identify the usage state of the predetermined cache block, instead of relying on the optimization personnel to grasp the deep learning network model structure and the cache usage state, thereby improving the cache optimization efficiency. .
- the embodiment of the present application further provides a storage medium for storing executable program code, where the executable program code is used to execute the cache optimization method applied to a deep learning network, and the method includes :
- the first predetermined cache block After performing an analog operation on the Nth layer of the deep learning network, detecting whether the first predetermined cache block is occupied, the first predetermined cache block is used to buffer the input data of the Nth layer simulation operation or before the Nth layer Layer input/output data for analog operations;
- the second predetermined cache block is allocated for the output data of the Nth layer analog operation, and when the preset condition is satisfied, the occupied first predetermined cache block is released.
- the present application After performing the simulation operation on the Nth layer of the deep learning network, the present application detects whether the first predetermined cache block is occupied. In the case of being occupied, the cache block is re-allocated, and when the preset condition is met, the occupied place is released. The first predetermined cache block is described. When the application needs to allocate a buffer to the result data of the simulation operation, the application can automatically identify the usage state of the predetermined cache block, instead of relying on the optimization personnel to grasp the deep learning network model structure and the cache usage state, thereby improving the cache optimization efficiency. .
- the description is relatively simple, and the relevant parts can be referred to the description of the method embodiment.
- a "computer-readable medium” can be any apparatus that can contain, store, communicate, propagate, or transport a program for use in an instruction execution system, apparatus, or device, or in conjunction with the instruction execution system, apparatus, or device.
- computer readable media include the following: electrical connections (electronic devices) having one or more wires, portable computer disk cartridges (magnetic devices), random access memory (RAM), Read only memory (ROM), erasable editable read only memory (EPROM or flash memory), fiber optic devices, and portable compact disk read only memory (CDROM).
- the computer readable medium may even be a paper or other suitable medium on which the program can be printed, as it may be optically scanned, for example by paper or other medium, followed by editing, interpretation or, if appropriate, other suitable The method is processed to obtain the program electronically and then stored in computer memory.
- multiple steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system.
- a suitable instruction execution system For example, if implemented in hardware, as in another embodiment, it can be implemented by any one or combination of the following techniques well known in the art: having logic gates for implementing logic functions on data signals. Discrete logic circuits, application specific integrated circuits with suitable combinational logic gates, programmable gate arrays (PGAs), field programmable gate arrays (FPGAs), etc.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Biomedical Technology (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Memory System Of A Hierarchy Structure (AREA)
- Image Generation (AREA)
Abstract
A method and device for buffer optimization in a deep learning network. The buffer optimization method comprises: simulating an Nth layer in the deep learning network; and determining, after the simulating step, whether a first preset buffer block is occupied, and if so, allocating, to output data in the Nth layer, a second preset buffer block. The method and device are utilized to resolve an issue of complicated buffer optimization during training of a deep learning network, providing simpler and more efficient buffer optimization and allocation, and being adaptive in various networks.
Description
本申请要求于2016年12月9日提交中国专利局、申请号为201611132262.9发明名称为“一种应用于深度学习网络的缓存优化方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。The present application claims priority to Chinese Patent Application No. 201611132262.9, entitled "A Cache Optimization Method and Apparatus for Deep Learning Networks", filed on December 9, 2016, the entire contents of which are hereby incorporated by reference. Combined in this application.
本申请涉及计算机技术领域,具体涉及一种应用于深度学习网络的缓存优化方法及与该方法对应的装置。The present application relates to the field of computer technologies, and in particular, to a cache optimization method applied to a deep learning network and a device corresponding to the method.
深度学习是机器学习研究中的新领域,是一种比较有效的人工智能方法。它通过模拟人脑的学习行为,从数据中学习到相关知识以便用于后续的预测。深度学习采用所谓的“网络”进行学习,其中网络由多个“层”(例如,卷积)组成,每一层以前一层(或前几个卷积)的输出作为输入进行训练,然后再将训练结果作为下一层的输入。训练过程便是对每一层进行有序计算的过程。Deep learning is a new field in machine learning research and a more effective artificial intelligence method. It learns the relevant knowledge from the data by simulating the learning behavior of the human brain for subsequent prediction. Deep learning is done using a so-called "network" in which the network consists of multiple "layers" (for example, convolutions), with the output of the previous layer (or the first few convolutions) of each layer being used as input for training, and then Use the training results as input to the next level. The training process is the process of orderly calculation of each layer.
在深度学习的训练中,将产生大量的中间数据,为训练需要,通常需要对这些数据进行缓存,从而占用大量的缓存。因此,探寻优秀的缓存优化方法变得尤为重要。如果由人工根据经验知识进行对深度学习的缓存优化,其要求优化人员必须对深度学习的网络模型结构非常清楚,并且了解缓存中间数据的每一块缓存的使用时机,知道每块缓存何时会被使用、何时不会被使用,从而将不会被共同使用的缓存进行共享,达到节约缓存大小的目的。In the deep learning training, a large amount of intermediate data will be generated. For the training needs, it is usually necessary to cache the data, which occupies a large amount of cache. Therefore, it is especially important to explore good cache optimization methods. If the cache optimization of deep learning is performed manually based on empirical knowledge, it requires the optimizer to be very clear about the structure of the deep learning network model, and to know the timing of each cache of the cached intermediate data, knowing when each cache will be Use, when will not be used, so that will not be shared by the commonly used cache, to save the cache size.
然而,深度学习的不同网络模型,结构不同,其使用缓存的情况也各不相同,对优化人员的要求较高,在真正实施过程中,难以实现,降低了缓存优化效率。However, the different network models of deep learning have different structures, and the use of caches is also different. The requirements for the optimizers are high. In the actual implementation process, it is difficult to implement, and the cache optimization efficiency is reduced.
发明内容Summary of the invention
本申请所要实现的目的在于,通过提出的一种应用于深度学习网络的缓存优化方法及装置,以解决深度学习网络每层训练中,缓存优化效率较低的
问题。The purpose of the present application is to solve the cache optimization method and device applied to the deep learning network, so as to solve the problem that the cache optimization efficiency is low in each layer training of the deep learning network.
problem.
根据本申请的一方面,提供了一种应用于深度学习网络的缓存优化方法,该深度学习网络包括N层,N大于等于2,该方法包括:对深度学习网络的第N层进行模拟运算;在对深度学习网络的第N层进行模拟运算后,检测第一预定缓存块是否被占用,所述第一预定缓存块用于缓存所述第N层模拟运算的输入数据或第N层之前的层模拟运算的输入/输出数据;如果被占用,则为第N层模拟运算的输出数据分配第二预定缓存块,并在满足预设条件时,释放被占用的所述第一预定缓存块。According to an aspect of the present disclosure, a cache optimization method is provided for a deep learning network, where the deep learning network includes an N layer, and N is greater than or equal to 2. The method includes: performing a simulation operation on the Nth layer of the deep learning network; After performing an analog operation on the Nth layer of the deep learning network, detecting whether the first predetermined cache block is occupied, the first predetermined cache block is used to buffer the input data of the Nth layer simulation operation or before the Nth layer The input/output data of the layer simulation operation; if occupied, the second predetermined cache block is allocated for the output data of the Nth layer simulation operation, and when the preset condition is satisfied, the occupied first predetermined cache block is released.
可选地,所述缓存块具有与其对应的占用或未占用的状态标记存储在缓存块状态表中,所述检测第一预定缓存块是否被占用,具体包括:根据第一预定缓存块的标识在所述缓存状态表中查询第一预定缓存块对应的状态标记,根据所述状态标记确定所述第一预定缓存块是否被占用。Optionally, the cache block has a status tag corresponding to the occupied or unoccupied state, and is stored in the cache block status table, where the detecting whether the first predetermined cache block is occupied, specifically: according to the identifier of the first predetermined cache block Querying, in the cache status table, a status flag corresponding to the first predetermined cache block, and determining, according to the status flag, whether the first predetermined cache block is occupied.
可选地,所述为第N层模拟运算的输出数据分配第二预定缓存块,具体包括:为第N层模拟运算的输出数据分配与所述第一预定缓存块紧邻的第二预定缓存块。Optionally, the allocating the second predetermined cache block to the output data of the Nth layer analog operation comprises: allocating, for the output data of the Nth layer analog operation, a second predetermined cache block immediately adjacent to the first predetermined cache block. .
可选地,所述方法中,在满足预设条件时,释放被占用的所述第一预定缓存块,具体包括:在对深度学习网络的第N+1层进行模拟运算时,释放被占用的所述第一预定缓存块;或在对第N层进行模拟运算之后,对深度学习网络的第N+1层进行模拟运算之前,释放被占用的所述第一预定缓存块;或在没有第二预定缓存块可分配时,释放被占用的所述第一预定缓存块。Optionally, in the method, when the preset condition is met, releasing the occupied first predetermined cache block, specifically, when the simulation operation is performed on the (N+1)th layer of the deep learning network, the release is occupied. The first predetermined cache block; or after the simulation operation on the Nth layer, releasing the occupied first predetermined cache block before performing an analog operation on the N+1th layer of the deep learning network; or When the second predetermined cache block is assignable, the occupied first predetermined cache block is released.
可选地,在检测到第一预定缓存块未被占用时,为第N层模拟运算的输出数据分配所述第一预定缓存块。Optionally, when it is detected that the first predetermined cache block is not occupied, the first predetermined cache block is allocated for the output data of the Nth layer simulation operation.
可选地,所述第一预定缓存块、所述第二预定缓存块通过不同颜色进行标记。根据本申请的另一方面,还提供了一种应用于深度学习网络的缓存优化装置,该深度学习网络包括N层,N大于等于2,包括:
Optionally, the first predetermined cache block and the second predetermined cache block are marked by different colors. According to another aspect of the present application, there is also provided a cache optimization apparatus applied to a deep learning network, the deep learning network comprising an N layer, N being greater than or equal to 2, comprising:
模拟运算单元,用于对深度学习网络的第N层进行模拟运算;An analog operation unit for performing an analog operation on the Nth layer of the deep learning network;
状态检测单元,用于在对深度学习网络的第N层进行模拟运算后,检测第一预定缓存块是否被占用,所述第一预定缓存块用于缓存所述第N层模拟运算的输入数据或第N层之前的层模拟运算的输入/输出数据;a state detecting unit, configured to detect whether the first predetermined cache block is occupied after performing an analog operation on the Nth layer of the deep learning network, where the first predetermined cache block is used to buffer the input data of the Nth layer simulation operation Or the input/output data of the layer simulation operation before the Nth layer;
缓存分配单元,用于在第一预定缓存块被占用时,为第N层模拟运算的输出数据分配第二预定缓存块;a buffer allocation unit, configured to allocate a second predetermined cache block for the output data of the Nth layer simulation operation when the first predetermined cache block is occupied;
缓存释放单元,用于在满足预设条件时,释放被占用的所述第一预定缓存块。And a cache release unit, configured to release the occupied first predetermined cache block when a preset condition is met.
可选地,缓存块具有与其对应的占用或未占用的状态标记,则所述状态检测单元包括查询子单元,所述查询子单元,用于查询所述第一预定缓存块对应的状态标记,根据所述状态标记确定所述第一预定缓存块是否被占用。Optionally, the cache block has a status tag corresponding to the occupied or unoccupied state, and the state detecting unit includes a query subunit, where the query subunit is configured to query a status tag corresponding to the first predetermined cache block, Determining whether the first predetermined cache block is occupied according to the status flag.
可选地,缓存块与其状态标记之间的对应关系存储在缓存块状态表中,所述查询子单元,具体用于根据第一预定缓存块的标识在所述缓存状态表中查询第一预定缓存块对应的状态标记,以便根据查询结果确定所述第一预定缓存块是否被占用。Optionally, the corresponding relationship between the cache block and the status tag is stored in the cache block status table, where the query sub-unit is configured to query the cached status table for the first reservation according to the identifier of the first predetermined cache block. Cache the status flag corresponding to the block to determine whether the first predetermined cache block is occupied according to the query result.
可选地,所述缓存分配单元具体用于为第N层模拟运算的输出数据分配与所述第一预定缓存块紧邻的第二预定缓存块。Optionally, the cache allocation unit is specifically configured to allocate, for the output data of the Nth layer simulation operation, a second predetermined cache block immediately adjacent to the first predetermined cache block.
可选地,所述缓存释放单元具体用于:在对深度学习网络的第N+1层进行模拟运算时,释放被占用的所述第一预定缓存块;或,在对第N层进行模拟运算之后,对深度学习网络的第N+1层进行模拟运算之前,释放被占用的所述第一预定缓存块;或,在没有第二预定缓存块可分配时,释放被占用的所述第一预定缓存块。Optionally, the cache release unit is configured to: release the occupied first predetermined cache block when performing an analog operation on the (N+1)th layer of the deep learning network; or simulate the Nth layer After the operation, releasing the occupied first predetermined cache block before performing the simulation operation on the (N+1)th layer of the deep learning network; or releasing the occupied portion when no second predetermined cache block is assignable A predetermined cache block.
可选地,所述状态检测单元,还用于在检测到第一预定缓存块未被占用时,为第N层模拟运算的输出数据分配所述第一预定缓存块。Optionally, the state detecting unit is further configured to allocate, according to the output data of the Nth layer simulation operation, the first predetermined cache block when detecting that the first predetermined cache block is not occupied.
可选地,所述第一预定缓存块、所述第二预定缓存块通过不同颜色进行
标记。Optionally, the first predetermined cache block and the second predetermined cache block are performed by different colors
mark.
为达到上述目的,本申请实施例还公开了一种存储介质,所述存储介质用于存储可执行程序代码,所述可执行程序代码用于被运行以执行上述应用于深度学习网络的缓存优化方法。To achieve the above objective, an embodiment of the present application further discloses a storage medium for storing executable program code for being executed to perform the above-described cache optimization applied to a deep learning network. method.
本申请在对深度学习网络第N层进行模拟运算后,检测第一预定缓存块是否被占用,在被占用的情况下,便重新分配缓存块,在满足预设条件时,释放被占用的所述第一预定缓存块。本申请在需要对模拟运算的结果数据分配缓存时,可以自动识别预定缓存块的使用状态,而不再依赖于优化人员对深度学习网络模型结构、缓存使用状态的掌握,从而可以提高缓存优化效率。After performing the simulation operation on the Nth layer of the deep learning network, the present application detects whether the first predetermined cache block is occupied. In the case of being occupied, the cache block is re-allocated, and when the preset condition is met, the occupied place is released. The first predetermined cache block is described. When the application needs to allocate a buffer to the result data of the simulation operation, the application can automatically identify the usage state of the predetermined cache block, instead of relying on the optimization personnel to grasp the deep learning network model structure and the cache usage state, thereby improving the cache optimization efficiency. .
为了更清楚地说明本申请的技术方案,本申请的实施例及其说明用于解释本申请,并不构成对本申请的不当限定,下面将对实施方式描述中所需要使用的附图进行简要说明。在附图中:In order to explain the technical solutions of the present application more clearly, the embodiments of the present application and the description thereof are used to explain the present application, and do not constitute an undue limitation to the present application. The following drawings will be briefly described in the description of the embodiments. . In the drawing:
图1a为传统的缓存使用方法原理图,其横轴为时间轴;Figure 1a is a schematic diagram of a conventional cache usage method, the horizontal axis of which is the time axis;
图1b为共享缓存的原理图,其横轴为时间轴;Figure 1b is a schematic diagram of a shared cache, the horizontal axis of which is the time axis;
图2为根据本申请实施例的一种缓存优化方法的流程图;2 is a flowchart of a cache optimization method according to an embodiment of the present application;
图3为根据本申请实施例的一种缓存优化方法的流程图;FIG. 3 is a flowchart of a cache optimization method according to an embodiment of the present application; FIG.
图4为根据本申请实施例的一种缓存优化方法的流程图;4 is a flowchart of a cache optimization method according to an embodiment of the present application;
图5a为本申请实施例中显存优化开始阶段的显存状态示意图;FIG. 5 is a schematic diagram of a memory state at the beginning of the optimization of the memory in the embodiment of the present application; FIG.
图5b为本申请实施例中对层a进行显存优化过程示意图;FIG. 5b is a schematic diagram of a memory optimization process for layer a in the embodiment of the present application; FIG.
图5c为本申请实施例中对层b进行显存优化过程示意图;FIG. 5c is a schematic diagram of a process of optimizing memory for layer b in the embodiment of the present application;
图5d为本申请实施例中对层c进行显存优化过程示意图;FIG. 5 is a schematic diagram of a process of optimizing memory for layer c in the embodiment of the present application; FIG.
图5e为本申请实施例中最终完成显存优化之后的显存使用示意图;FIG. 5 e is a schematic diagram of the use of the memory after the final optimization of the memory is performed in the embodiment of the present application;
图6为本申请实施例中的缓存优化装置的结构框图;6 is a structural block diagram of a cache optimization apparatus in an embodiment of the present application;
图7为本申请实施例提供的一种电子设备的结构示意图。
FIG. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
为使本申请的目的、技术方案和优点更加清楚,下面将结合本申请具体实施方式及相应的附图对本申请技术方案进行清楚、完整地描述。显然,所描述的实施方式仅是本申请一部分实施方式,而不是全部的实施方式。基于本申请中的实施方式,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施方式,都属于本申请保护的范围。The technical solutions of the present application will be clearly and completely described in the following with reference to the specific embodiments of the present application and the accompanying drawings. It is apparent that the described embodiments are only a part of the embodiments of the present application, and not all of them. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present application without departing from the inventive scope are the scope of the present application.
以下结合附图,详细说明本申请各实施方式提供的技术方案。The technical solutions provided by the embodiments of the present application are described in detail below with reference to the accompanying drawings.
如图1a所示的是传统的缓存使用方法原理图,其中横轴为时间轴;从图1a中,申请人发现显存1和显存2有一段时间被同时使用,因此,显存1和显存2不能共享,但是显存1和显存3的使用时间是相对独立的,因此可以共享。如图1b所示的是共享缓存的原理图,其中横轴为时间轴。其中共享显存1表示显存1和显存3最终使用同一块地址空间;共享显存2标识显存2和显存4最终使用同一块地址空间。Figure 1a shows the schematic diagram of the traditional cache usage method, in which the horizontal axis is the time axis; from Figure 1a, the applicant finds that the memory 1 and the memory 2 are used simultaneously for a period of time, therefore, the memory 1 and the memory 2 cannot Sharing, but the usage time of video memory 1 and video memory 3 is relatively independent, so it can be shared. As shown in Figure 1b is a schematic diagram of a shared cache, where the horizontal axis is the time axis. The shared memory 1 indicates that the memory 1 and the memory 3 end up using the same block address space; the shared memory 2 identifies the memory 2 and the memory 4 eventually uses the same block address space.
本实施例提供了一种应用于深度学习网络的缓存优化方法。本实施例中的深度学习网络包括许多层,一般来说层数大于等于2,并用N来表示层数。本实施例的流程图如图2所示,包括如下步骤:This embodiment provides a cache optimization method applied to a deep learning network. The deep learning network in this embodiment includes a plurality of layers, generally speaking, the number of layers is greater than or equal to 2, and the number of layers is represented by N. The flowchart of this embodiment is shown in FIG. 2 and includes the following steps:
S101:对深度学习网络的第N层进行模拟运算;在深度学习网络中,包括输入层、输出层以及一层或多层的隐藏层,其中,多层隐藏层合起来组成深度学习网络中的中间层。而深度学习网络中的模拟运算方式也有许多种。例如,输出层的值可以通过输入层的值与中间层的权值矩阵进行某种结合所得到。而模拟运算的过程即可以被看作是计算这些中间层权值矩阵的过程。S101: performing simulation operations on the Nth layer of the deep learning network; in the deep learning network, including an input layer, an output layer, and one or more hidden layers, wherein the multiple layers of hidden layers form a deep learning network middle layer. There are also many kinds of analog operations in deep learning networks. For example, the value of the output layer can be obtained by some combination of the value of the input layer and the weight matrix of the intermediate layer. The process of simulation operations can be seen as the process of calculating these intermediate layer weight matrix.
S102:在对深度学习网络的第N层进行模拟运算后,检测第一预定缓存块的使用状态;其中所述第一预定缓存块用于缓存所述第N层模拟运算的输入数据或第N层之前的层模拟运算的输入/输出数据。S102: After performing an analog operation on the Nth layer of the deep learning network, detecting a usage state of the first predetermined cache block; wherein the first predetermined cache block is used to buffer input data or the Nth of the Nth layer simulation operation The input/output data of the layer simulation operation before the layer.
S103:第一预定缓存块被占用,则为第N层模拟运算的输出数据分配第二预定缓存块,并在满足预设条件时,释放被占用的所述第一预定缓存块。S103: The first predetermined cache block is occupied, and then the second predetermined cache block is allocated for the output data of the Nth layer simulation operation, and when the preset condition is met, the occupied first predetermined cache block is released.
其中,所述第一预定缓存块及第二预定缓存块是在模拟运算进行之前分配好的待用缓存块,其缓存块的数量不做限定,可以是两块及两块以上。第一预定缓存块及第二预定缓存块中的第一和第二用于区分缓存块。
The first predetermined cache block and the second predetermined cache block are reserved cache blocks that are allocated before the simulation operation is performed, and the number of cache blocks is not limited, and may be two or more. The first and second of the first predetermined cache block and the second predetermined cache block are used to distinguish the cache block.
需要说明的是,本实施例所提供的方法的各步骤的执行主体均可以是同一设备,或者,该方法也可以由不同设备作为执行主体。例如,步骤S101的执行主体可以是设备1,步骤S102和S103的执行主体可以为设备2;等等。所述设备可以是具有能够实施缓存优化的计算设备,例如,个人电脑、笔记本电脑、平板电脑等。为了避免重复,下述所有实施例中的执行主体都可以与上述实施例中的执行主体相同或类似。It should be noted that the execution bodies of the steps of the method provided in this embodiment may all be the same device, or the method may also be performed by different devices. For example, the execution subject of step S101 may be device 1, the execution bodies of steps S102 and S103 may be device 2, and the like. The device may be a computing device capable of implementing cache optimization, such as a personal computer, laptop, tablet, or the like. In order to avoid repetition, the execution subject in all of the embodiments described below may be the same as or similar to the execution subject in the above embodiment.
上述缓存块可以但不限于是CPU(中央处理器,Central Processing Unit)的内存、GPU(图形处理器,Graphics Processing Unit)的显存等具有存储功能的载体。The cache block may be, but not limited to, a memory having a storage function such as a memory of a CPU (Central Processing Unit) or a memory of a GPU (Graphics Processing Unit).
通过本实施例,提供一种基于深度学习网络的缓存优化方法,该方法通过在模拟运算时进行冲突检测和释放缓存以达到优化缓存的目的,即在深度学习网络的每层训练中,为了优化真实训练过程中对缓存的占用,先进行模拟运算来作为缓存分配的参考,在模拟运算后进行缓存分配时,检查待分配或预先分配的缓存块当前使用状态,进而选择空闲的缓存进行分配。通过上述过程,能够自动检测缓存块的使用状态,而不再依赖优化人员人为去根据网络结构分配缓存,使得深度学习网络的缓存优化和分配效率得以提升,并能够去适应更多不同的网络。Through the embodiment, a cache optimization method based on a deep learning network is provided, which achieves the purpose of optimizing the cache by performing collision detection and release buffering in the simulation operation, that is, in each layer training of the deep learning network, in order to optimize In the actual training process, the buffer is occupied first, and the simulation operation is first used as a reference for buffer allocation. When the buffer allocation is performed after the simulation operation, the current use state of the cache block to be allocated or pre-allocated is checked, and then the idle cache is selected for allocation. Through the above process, the usage state of the cache block can be automatically detected, and the optimization personnel can no longer rely on the allocation of the cache according to the network structure, so that the cache optimization and allocation efficiency of the deep learning network can be improved, and can adapt to more different networks.
此外,对于深度学习网络已经占用的缓存,在满足预设条件时,及时释放掉先前的输出数据,以使得整个缓存分配能够循环起来。In addition, for the cache already occupied by the deep learning network, when the preset condition is met, the previous output data is released in time so that the entire cache allocation can be cycled.
本申请实施例的一个具体实施方式的过程还可以包括对深度学习网络的每一层进行模拟运算,冲突检测,即对缓存此层模拟运算的输入数据或前一层的模拟运算输入/输出数据的第一预定缓存块是否被占用进行检测;缓存分配,即发现第一预定缓存块被占用,为其数据分配第二预定缓存块,以此类推。本实施例中基于一种冲突检测方式的缓存优化方法的流程示意图如图3所示,其步骤包括:The process of a specific implementation of the embodiment of the present application may further include performing an analog operation on each layer of the deep learning network, and detecting the input data of the simulation operation of the layer or the analog input/output data of the previous layer. Whether the first predetermined cache block is occupied for detection; cache allocation, that is, finding that the first predetermined cache block is occupied, allocating a second predetermined cache block for its data, and so on. A schematic flowchart of a cache optimization method based on a conflict detection mode in this embodiment is shown in FIG. 3, and the steps thereof include:
S201:假设所有预定缓存块都没被使用,并将需要缓存数据的缓存块都指向同一个预定的第一缓存地址。S201: Assume that all the predetermined cache blocks are not used, and the cache blocks that need to cache the data all point to the same predetermined first cache address.
S202:对每层需要使用的缓存块进行冲突检测。当检测到冲突出现时,
执行步骤S202a,若未检测到冲突出现,执行步骤S202b。S202: Perform collision detection on a cache block that needs to be used on each layer. When a conflict is detected,
Step S202a is performed, and if no conflict is detected, step S202b is performed.
S202a:为冲突的缓存块分配一个预定的第二缓存地址。S202a: Allocating a predetermined second cache address to the conflicting cache block.
S202b:继续检测下一块缓存。S202b: Continue to detect the next block of cache.
S203:在满足预设条件,即此层模拟运算结束后,释放被占用的所述第一预定缓存块。S203: After the preset condition is met, that is, after the layer simulation operation ends, the occupied first predetermined cache block is released.
其中,所述第一预定缓存块及第二预定缓存块中第一第二是其使用的顺序。并且,第二预定缓存块与第一预定缓存块是紧邻的。这样设计的目的是为了减少存储碎片所带来的不必要的缓存浪费。并且,实施例中的缓存优化方式,可以通过每次模拟运算后,对缓存状态进行“冲突检测”,及时发现处于闲置状态的缓存,从而提高缓存的使用效率。The first and second of the first predetermined cache block and the second predetermined cache block are in an order in which they are used. And, the second predetermined cache block is in close proximity to the first predetermined cache block. The purpose of this design is to reduce unnecessary cache waste caused by storage fragmentation. Moreover, the cache optimization mode in the embodiment can perform the “collision detection” on the cache state after each simulation operation, and timely discover the cache in the idle state, thereby improving the use efficiency of the cache.
本申请实施例的一个具体实施方式中,如图4所示的是本实施例的一种缓存优化方法的流程图。在本实施例中,缓存块作为一个存储单元,可以具体体现为磁盘存储器、内存、显存等具有存储功能的载体,因此本实施例的缓存优化方法可以适用于对磁盘存储器、内存、显存等的优化。下面以显存为例来进行说明,由于本实施例以显存为例来进行说明,所以在本实施例中缓存块状态表为显存块状态表,为简化起见,在本实施例中,深度学习网络的层数N为3,分别是a、b、c三层,其中显存的预定分配方式是:显存1和显存2为层a的输入和输出,显存3和显存4为层b的输入和输出,显存2和显存4为层c的输入,显存5为层c的输出。当每一层模拟运算完成后,此层输入所占用的显存块即被释放。该方法包括(步骤S301-S304):In a specific implementation manner of the embodiment of the present application, as shown in FIG. 4 is a flowchart of a cache optimization method in this embodiment. In this embodiment, the cache block is a storage unit, and can be embodied as a storage device having a storage function, such as a disk storage, a memory, and a memory. Therefore, the cache optimization method in this embodiment can be applied to disk storage, memory, video memory, and the like. optimization. In the following, the memory is used as an example for description. Therefore, in this embodiment, the cache block status table is a memory block status table. For the sake of simplicity, in this embodiment, the deep learning network is used. The number of layers N is 3, which are three layers a, b, and c respectively. The predetermined distribution mode of the memory is: memory 1 and memory 2 are the input and output of layer a, and memory 3 and memory 4 are the input and output of layer b. The memory 2 and the memory 4 are inputs of the layer c, and the memory 5 is the output of the layer c. When each layer of the simulation operation is completed, the memory blocks occupied by this layer input are released. The method includes (steps S301-S304):
步骤S301:将需要显示数据的显存块都指向同一个预定的显存地址,并标记预定的显存块被显存1中的数据占用。如图5a所示的是显存优化开始阶段的显存状态示意图,如图5a所示,将显存1-5都标记为红色或显示成红色,即所有显存块都指向同一个显存地址,并标记红色显存块被显存1中数据占用,并在显存块状态表中记录。Step S301: Point the memory blocks that need to display data to the same predetermined memory address, and mark that the predetermined memory block is occupied by the data in the video memory 1. As shown in FIG. 5a is a schematic diagram of the memory state at the beginning of the memory optimization optimization. As shown in FIG. 5a, the memory 1-5 is marked as red or displayed in red, that is, all the memory blocks point to the same memory address and marked with red. The memory block is occupied by the data in the memory 1, and is recorded in the memory block status table.
其中,该显存块状态表可以存储在显存1-5中任意一个显存中,也可以存储在除显存1-5以外的存储单元中,这都是合理的。The memory block status table may be stored in any of the video memory 1-5, or may be stored in a memory unit other than the memory 1-5, which is reasonable.
步骤S302:对层a进行显存优化过程,即对a层的输入输出显存进行冲
突检测以及显存分配,并将分配结果记录。如图5b所示的是对层a进行显存优化过程示意图,如图5b所示,模拟运算中,对层a的显存1和显存2进行冲突检测,发现显存2想使用的红色显存块已经被显存1占用,即产生冲突。此时,为显存2重新分配一个新的显存地址,即标记黄色显存块被显存2中数据占用,完成层a的模拟运算之后,释放显存1占用的红色显存块,并在显存块状态表中记录此时的显存块使用状态。Step S302: performing a memory optimization process on layer a, that is, rushing the input and output memory of layer a
Burst detection and memory allocation, and record the distribution results. As shown in FIG. 5b, a schematic diagram of the memory optimization process for layer a is shown in FIG. 5b. In the simulation operation, the memory 1 of the layer a and the memory 2 are detected for conflict detection, and it is found that the red memory block that the memory 2 wants to use has been Memory 1 is occupied, that is, a conflict occurs. At this time, a new memory address is reassigned for the memory 2, that is, the marked yellow memory block is occupied by the data in the memory 2. After the simulation operation of the layer a is completed, the red memory block occupied by the memory 1 is released, and in the memory block status table. Record the status of the memory block usage at this time.
步骤S303:对层b进显存优化过程,即对b层的输入输出显存进行冲突检测以及显存分配,并将分配结果记录。如图5c所示的是对层b进行显存优化过程示意图,如图5c所示,模拟运算中,层b的显存3想将数据缓存到预定的红色显存块中,通过显存块状态表发现红色显存块没有被使用,因此将红色显存块标记为被显存3使用。接着,对b层的显存3和显存4进行冲突检测,通过显存块状态表发现显存4想使用的红色显存块已被显存3占用,黄色显存块已被显存2占用,即产生冲突。此时,为显存4重新分配一个新的显存地址,即标记蓝色显存块被显存4中数据占用,完成b层的模拟运算之后,释放显存3占用的红显存存块,并在显存块状态表中记录此时的显存块使用状态。Step S303: Performing a memory optimization process on the layer b, that is, performing collision detection and memory allocation on the input and output memory of the b layer, and recording the allocation result. As shown in FIG. 5c, a schematic diagram of the process of optimizing the memory for layer b is shown in FIG. 5c. In the simulation operation, the memory 3 of layer b wants to cache the data into a predetermined red memory block, and red is found through the memory block status table. The memory block is not used, so the red memory block is marked for use by video memory 3. Next, the c-layer memory 3 and the memory 4 are detected for conflict, and the red memory block that the memory 4 wants to use has been occupied by the memory 3 through the memory block status table, and the yellow memory block has been occupied by the memory 2, that is, a conflict occurs. At this time, a new memory address is newly allocated for the memory 4, that is, the marked blue memory block is occupied by the data in the memory 4, and after the simulation operation of the b layer is completed, the red memory block occupied by the memory 3 is released, and the state of the memory block is released. The status of the memory block usage at this time is recorded in the table.
步骤S304:对层c进行显存优化过程,即对c层的输入输出显存进行冲突检测以及显存分配,并将分配结果记录。如图5d所示的是对层c进行显存优化过程示意图,如图5d所示,模拟运算中,层c的显存5想将数据存储到红色显存块中,进行冲突检测,通过显存块状态表发现层c红色显存块未被使用,因此将红色显存块标记为被显存5占用。此时,已对所有显存块的冲突状态检测完毕,完成层c的模拟运算之后,释放显存2和显存4占用的蓝色显存块和黄色显存块,并在显存块状态表中记录。Step S304: Perform a memory optimization process on the layer c, that is, perform conflict detection and memory allocation on the input and output memory of the c layer, and record the allocation result. As shown in FIG. 5d, a schematic diagram of the memory optimization process for layer c is shown in FIG. 5d. In the simulation operation, the memory 5 of the layer c wants to store the data in the red memory block, and performs collision detection through the memory block status table. The layer c red memory block is found to be unused, so the red memory block is marked as being occupied by the video memory 5. At this time, the conflict state of all the memory blocks has been detected. After the simulation operation of layer c is completed, the blue memory block and the yellow memory block occupied by the memory memory 2 and the memory memory 4 are released and recorded in the memory block state table.
图5e所示的是最终完成显存优化之后的显存使用示意图。如图5e所示,最终,完成显存优化之后,显存5中的数据存储在红色显存块中,而显存1-4已经全部被释放。Figure 5e shows a schematic diagram of the memory usage after the final completion of the memory optimization. As shown in FIG. 5e, finally, after the memory optimization is completed, the data in the memory 5 is stored in the red memory block, and the memory 1-4 has all been released.
本实施例中所述的显存块状态表,是对所有显存占用状态的记录,通过使用显存块状态表,使得每块显存占用信息更加清晰。在本实施例中,假设显存1-5的大小均为200M,若不利用实施例中所述的显存优化方式进行显存
优化分配,则显存1-5都需要被占用,即显存占用量为1000M;但利用本实施例中所述方法进行显存优化分配之后,实际使用的显存块只有红色、蓝色、黄色三块显存,其实际显存占用量仅为600M。通过本实施例的缓存优化方法,能达到节约缓存的目的,进一步提高缓存的使用率。The memory block status table described in this embodiment is a record of all the memory occupancy states. By using the memory block status table, each video memory occupation information is made clearer. In this embodiment, it is assumed that the size of the memory 1-5 is 200M, and the memory is not used for the memory optimization method described in the embodiment.
If the allocation is optimized, the memory 1-5 needs to be occupied, that is, the memory usage is 1000M. However, after the memory allocation optimization is performed by the method described in this embodiment, the actual memory blocks used are only red, blue, and yellow. The actual memory usage is only 600M. Through the cache optimization method of the embodiment, the purpose of saving the cache can be achieved, and the usage rate of the cache is further improved.
本实施例中所举例子只是一个较小的深度学习网络,只包括三层。对于更加复杂的深度学习网络,例如N层(N大于3)的深度学习网络,利用本方法,将N-1层的输出作为N层的输入,也依然能够实现其缓存的优化及分配。The example in this embodiment is only a small deep learning network, including only three layers. For more complex deep learning networks, such as the N-layer (N greater than 3) deep learning network, using this method, the output of the N-1 layer as the input of the N layer can still achieve optimization and allocation of its cache.
基于相同的申请构思,本申请的实施例还提供了一种缓存优化装置。Based on the same application concept, an embodiment of the present application also provides a cache optimization apparatus.
图6是根据本申请实施方式的缓存优化装置的结构框图,如图6所示,所述装置包括:模拟运算单元610、状态检测单元620、缓存分配单元630和缓存释放单元640,下面详细描述各模块的结构和连接关系。6 is a structural block diagram of a cache optimization apparatus according to an embodiment of the present application. As shown in FIG. 6, the apparatus includes: an analog operation unit 610, a state detection unit 620, a cache allocation unit 630, and a cache release unit 640, which are described in detail below. The structure and connection relationship of each module.
模拟运算单元610用于对深度学习网络的第N层进行模拟运算;The simulation operation unit 610 is configured to perform an analog operation on the Nth layer of the deep learning network;
状态检测单元620其与模拟运算单元610相耦合,其用于在对深度学习网络的第N层进行模拟运算后,检测第一预定缓存块是否被占用;The state detecting unit 620 is coupled to the analog operation unit 610, and configured to detect whether the first predetermined cache block is occupied after performing an analog operation on the Nth layer of the deep learning network;
缓存分配单元630其与状态检测单元620相耦合,其用于在第一预定缓存块被占用时,为第N层模拟运算的输出数据分配第二预定缓存块;The cache allocating unit 630 is coupled to the state detecting unit 620, and configured to allocate a second predetermined cache block for the output data of the Nth layer analog operation when the first predetermined cache block is occupied;
缓存释放单元640其与缓存分配单元630相耦合,其用于在满足预设条件时,释放被占用的所述第一预定缓存块。The cache release unit 640 is coupled to the cache allocation unit 630 for releasing the occupied first predetermined cache block when a preset condition is satisfied.
可选地,缓存块与其状态标记之间的对应关系存储在缓存块状态表中,所述状态检测单元620还包括:查询子单元(未示出),具体用于根据第一预定缓存块的标识在所述缓存状态表中查询第一预定缓存块对应的状态标记,以便根据查询结果确定所述第一预定缓存块是否被占用。Optionally, the correspondence between the cache block and its status tag is stored in the cache block status table, and the status detecting unit 620 further includes: a query subunit (not shown), specifically configured according to the first predetermined cache block. And identifying, in the cache status table, a status flag corresponding to the first predetermined cache block, so as to determine whether the first predetermined cache block is occupied according to the query result.
可选地,所述缓存分配单元630具体用于为第N层模拟运算的输出数据分配与所述第一预定缓存块紧邻的第二预定缓存块。Optionally, the cache allocating unit 630 is specifically configured to allocate, for the output data of the Nth layer analog operation, a second predetermined cache block immediately adjacent to the first predetermined cache block.
可选地,所述缓存释放单元640具体用于:在对深度学习网络的第N+1层进行模拟运算时,释放被占用的所述第一预定缓存块;或,在对第N层进行模拟运算之后,对深度学习网络的第N+1层进行模拟运算之前,释放被占用的所述第一预定缓存块;或,在没有第二预定缓存块可分配时,释放被占
用的所述第一预定缓存块。Optionally, the cache release unit 640 is configured to: release the occupied first predetermined cache block when performing an analog operation on the (N+1)th layer of the deep learning network; or, at the Nth layer After the simulation operation, releasing the occupied first predetermined cache block before performing the simulation operation on the (N+1)th layer of the deep learning network; or releasing the occupied portion when no second predetermined cache block is available for allocation
The first predetermined cache block used.
可选地,所述状态检测单元620,还用于在检测到第一预定缓存块未被占用时,为第N层模拟运算的输出数据分配所述第一预定缓存块。所述第一预定缓存块、所述第二预定缓存块可以通过不同颜色进行标记。Optionally, the state detecting unit 620 is further configured to allocate the first predetermined cache block for the output data of the Nth layer analog operation when detecting that the first predetermined cache block is not occupied. The first predetermined cache block and the second predetermined cache block may be marked by different colors.
根据本申请的技术方案,基于深度学习方式,改进了传统缓存优化方法中,优化人员必须对深度学习网络非常清楚,并且了解的每一块缓存的使用时机,知道每块缓存何时会被使用,何时不会被使用的复杂方法,使得深度学习领域的缓存优化方法更加通用和简单。并通过“模拟运算、冲突检测、释放缓存”的方法,及时发现处于闲置状态的缓存,从而提高了缓存的使用效率。According to the technical solution of the present application, based on the deep learning method, in the traditional cache optimization method, the optimizer must be very clear about the deep learning network, and know the usage timing of each cache, and know when each cache will be used. The complex method of when not to be used makes the cache optimization method in the deep learning domain more general and simple. Through the "analog operation, conflict detection, release cache" method, the cache in the idle state is found in time, thereby improving the use efficiency of the cache.
本申请实施例还提供了一种电子设备,如图7所示,包括:壳体701、处理器702、存储器703、电路板704和电源电路705,其中,电路板704安置在壳体701围成的空间内部,处理器702和存储器703设置在电路板704上;电源电路705,用于为电子设备的各个电路或器件供电;存储器703用于存储可执行程序代码;处理器702通过读取存储器703中存储的可执行程序代码来运行与可执行程序代码对应的程序,以用于执行所述应用于深度学习网络的缓存优化方法,方法包括:The embodiment of the present application further provides an electronic device, as shown in FIG. 7, comprising: a housing 701, a processor 702, a memory 703, a circuit board 704, and a power circuit 705, wherein the circuit board 704 is disposed in the housing 701. The processor 702 and the memory 703 are disposed on the circuit board 704; the power circuit 705 is used to supply power to the various circuits or devices of the electronic device; the memory 703 is used to store executable program code; and the processor 702 is read by The executable program code stored in the memory 703 is configured to execute a program corresponding to the executable program code for executing the cache optimization method applied to the deep learning network, the method comprising:
对深度学习网络的第N层进行模拟运算;Performing an analog operation on the Nth layer of the deep learning network;
在对深度学习网络的第N层进行模拟运算后,检测第一预定缓存块是否被占用,所述第一预定缓存块用于缓存所述第N层模拟运算的输入数据或第N层之前的层模拟运算的输入/输出数据;After performing an analog operation on the Nth layer of the deep learning network, detecting whether the first predetermined cache block is occupied, the first predetermined cache block is used to buffer the input data of the Nth layer simulation operation or before the Nth layer Layer input/output data for analog operations;
如果被占用,则为第N层模拟运算的输出数据分配第二预定缓存块,并在满足预设条件时,释放被占用的所述第一预定缓存块。If occupied, the second predetermined cache block is allocated for the output data of the Nth layer analog operation, and when the preset condition is satisfied, the occupied first predetermined cache block is released.
本申请在对深度学习网络第N层进行模拟运算后,检测第一预定缓存块是否被占用,在被占用的情况下,便重新分配缓存块,在满足预设条件时,释放被占用的所述第一预定缓存块。本申请在需要对模拟运算的结果数据分配缓存时,可以自动识别预定缓存块的使用状态,而不再依赖于优化人员对深度学习网络模型结构、缓存使用状态的掌握,从而可以提高缓存优化效率。After performing the simulation operation on the Nth layer of the deep learning network, the present application detects whether the first predetermined cache block is occupied. In the case of being occupied, the cache block is re-allocated, and when the preset condition is met, the occupied place is released. The first predetermined cache block is described. When the application needs to allocate a buffer to the result data of the simulation operation, the application can automatically identify the usage state of the predetermined cache block, instead of relying on the optimization personnel to grasp the deep learning network model structure and the cache usage state, thereby improving the cache optimization efficiency. .
本申请实施例还提供了一种可执行程序代码,所述可执行程序代码用于
被运行以执行所述应用于深度学习网络的缓存优化方法,方法包括:The embodiment of the present application further provides an executable program code, where the executable program code is used
The method is executed to perform the cache optimization method applied to the deep learning network, and the method includes:
对深度学习网络的第N层进行模拟运算;Performing an analog operation on the Nth layer of the deep learning network;
在对深度学习网络的第N层进行模拟运算后,检测第一预定缓存块是否被占用,所述第一预定缓存块用于缓存所述第N层模拟运算的输入数据或第N层之前的层模拟运算的输入/输出数据;After performing an analog operation on the Nth layer of the deep learning network, detecting whether the first predetermined cache block is occupied, the first predetermined cache block is used to buffer the input data of the Nth layer simulation operation or before the Nth layer Layer input/output data for analog operations;
如果被占用,则为第N层模拟运算的输出数据分配第二预定缓存块,并在满足预设条件时,释放被占用的所述第一预定缓存块。If occupied, the second predetermined cache block is allocated for the output data of the Nth layer analog operation, and when the preset condition is satisfied, the occupied first predetermined cache block is released.
本申请在对深度学习网络第N层进行模拟运算后,检测第一预定缓存块是否被占用,在被占用的情况下,便重新分配缓存块,在满足预设条件时,释放被占用的所述第一预定缓存块。本申请在需要对模拟运算的结果数据分配缓存时,可以自动识别预定缓存块的使用状态,而不再依赖于优化人员对深度学习网络模型结构、缓存使用状态的掌握,从而可以提高缓存优化效率。After performing the simulation operation on the Nth layer of the deep learning network, the present application detects whether the first predetermined cache block is occupied. In the case of being occupied, the cache block is re-allocated, and when the preset condition is met, the occupied place is released. The first predetermined cache block is described. When the application needs to allocate a buffer to the result data of the simulation operation, the application can automatically identify the usage state of the predetermined cache block, instead of relying on the optimization personnel to grasp the deep learning network model structure and the cache usage state, thereby improving the cache optimization efficiency. .
本申请实施例还提供了一种存储介质,所述存储介质用于存储可执行程序代码,所述可执行程序代码用于被运行以执行所述应用于深度学习网络的缓存优化方法,方法包括:The embodiment of the present application further provides a storage medium for storing executable program code, where the executable program code is used to execute the cache optimization method applied to a deep learning network, and the method includes :
对深度学习网络的第N层进行模拟运算;Performing an analog operation on the Nth layer of the deep learning network;
在对深度学习网络的第N层进行模拟运算后,检测第一预定缓存块是否被占用,所述第一预定缓存块用于缓存所述第N层模拟运算的输入数据或第N层之前的层模拟运算的输入/输出数据;After performing an analog operation on the Nth layer of the deep learning network, detecting whether the first predetermined cache block is occupied, the first predetermined cache block is used to buffer the input data of the Nth layer simulation operation or before the Nth layer Layer input/output data for analog operations;
如果被占用,则为第N层模拟运算的输出数据分配第二预定缓存块,并在满足预设条件时,释放被占用的所述第一预定缓存块。If occupied, the second predetermined cache block is allocated for the output data of the Nth layer analog operation, and when the preset condition is satisfied, the occupied first predetermined cache block is released.
本申请在对深度学习网络第N层进行模拟运算后,检测第一预定缓存块是否被占用,在被占用的情况下,便重新分配缓存块,在满足预设条件时,释放被占用的所述第一预定缓存块。本申请在需要对模拟运算的结果数据分配缓存时,可以自动识别预定缓存块的使用状态,而不再依赖于优化人员对深度学习网络模型结构、缓存使用状态的掌握,从而可以提高缓存优化效率。After performing the simulation operation on the Nth layer of the deep learning network, the present application detects whether the first predetermined cache block is occupied. In the case of being occupied, the cache block is re-allocated, and when the preset condition is met, the occupied place is released. The first predetermined cache block is described. When the application needs to allocate a buffer to the result data of the simulation operation, the application can automatically identify the usage state of the predetermined cache block, instead of relying on the optimization personnel to grasp the deep learning network model structure and the cache usage state, thereby improving the cache optimization efficiency. .
本说明书中的各个实施例均采用相关的方式描述,各个实施例之间相同
相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。The various embodiments in this specification are described in a related manner, and the embodiments are the same.
Similar parts can be referred to each other, and each embodiment focuses on differences from other embodiments.
尤其,对于装置实施例而言,由于其基本相似于方法实施例,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。In particular, for the device embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and the relevant parts can be referred to the description of the method embodiment.
在流程图中表示或在此以其他方式描述的逻辑和/或步骤,例如,可以被认为是用于实现逻辑功能的可执行指令的定序列表,可以具体实现在任何计算机可读介质中,以供指令执行系统、装置或设备(如基于计算机的系统、包括处理器的系统或其他可以从指令执行系统、装置或设备取指令并执行指令的系统)使用,或结合这些指令执行系统、装置或设备而使用。就本说明书而言,"计算机可读介质"可以是任何可以包含、存储、通信、传播或传输程序以供指令执行系统、装置或设备或结合这些指令执行系统、装置或设备而使用的装置。计算机可读介质的更具体的示例(非穷尽性列表)包括以下:具有一个或多个布线的电连接部(电子装置),便携式计算机盘盒(磁装置),随机存取存储器(RAM),只读存储器(ROM),可擦除可编辑只读存储器(EPROM或闪速存储器),光纤装置,以及便携式光盘只读存储器(CDROM)。另外,计算机可读介质甚至可以是可在其上打印所述程序的纸或其他合适的介质,因为可以例如通过对纸或其他介质进行光学扫描,接着进行编辑、解译或必要时以其他合适方式进行处理来以电子方式获得所述程序,然后将其存储在计算机存储器中。The logic and/or steps represented in the flowchart or otherwise described herein, for example, may be considered as an ordered list of executable instructions for implementing logical functions, and may be embodied in any computer readable medium, Used in conjunction with, or in conjunction with, an instruction execution system, apparatus, or device (eg, a computer-based system, a system including a processor, or other system that can fetch instructions and execute instructions from an instruction execution system, apparatus, or device) Or use with equipment. For the purposes of this specification, a "computer-readable medium" can be any apparatus that can contain, store, communicate, propagate, or transport a program for use in an instruction execution system, apparatus, or device, or in conjunction with the instruction execution system, apparatus, or device. More specific examples (non-exhaustive list) of computer readable media include the following: electrical connections (electronic devices) having one or more wires, portable computer disk cartridges (magnetic devices), random access memory (RAM), Read only memory (ROM), erasable editable read only memory (EPROM or flash memory), fiber optic devices, and portable compact disk read only memory (CDROM). In addition, the computer readable medium may even be a paper or other suitable medium on which the program can be printed, as it may be optically scanned, for example by paper or other medium, followed by editing, interpretation or, if appropriate, other suitable The method is processed to obtain the program electronically and then stored in computer memory.
应当理解,本申请的各部分可以用硬件、软件、固件或它们的组合来实现。It should be understood that portions of the application can be implemented in hardware, software, firmware, or a combination thereof.
在上述实施方式中,多个步骤或方法可以用存储在存储器中且由合适的指令执行系统执行的软件或固件来实现。例如,如果用硬件来实现,和在另一实施方式中一样,可用本领域公知的下列技术中的任一项或他们的组合来实现:具有用于对数据信号实现逻辑功能的逻辑门电路的离散逻辑电路,具有合适的组合逻辑门电路的专用集成电路,可编程门阵列(PGA),现场可编程门阵列(FPGA)等。In the above-described embodiments, multiple steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, it can be implemented by any one or combination of the following techniques well known in the art: having logic gates for implementing logic functions on data signals. Discrete logic circuits, application specific integrated circuits with suitable combinational logic gates, programmable gate arrays (PGAs), field programmable gate arrays (FPGAs), etc.
尽管已描述了本申请的优选实施方式,但本领域内的技术人员一旦得知了基本创造性概念,则可对这些实施方式做出另外的变更和修改。所以,所
附权利要求意欲解释为包括优选实施方式以及落入本申请范围的所有变更和修改。While the preferred embodiment of the present application has been described, those skilled in the art can make further changes and modifications to these embodiments once they are aware of the basic inventive concept. So, the institute
The appended claims are intended to be interpreted as including all the modifications and modifications
显然,本领域的技术人员可以对本申请进行各种改动和变型而不脱离本申请的精神和范围。这样,倘若本申请的这些修改和变型属于本申请权利要求及其等同技术的范围之内,则本申请也意图包含这些改动和变型在内。
It will be apparent to those skilled in the art that various modifications and changes can be made in the present application without departing from the spirit and scope of the application. Thus, it is intended that the present invention cover the modifications and variations of the present invention.
Claims (14)
- 一种应用于深度学习网络的缓存优化方法,所述深度学习网络包括N层,N大于等于2,其特征在于,该方法包括:A cache optimization method applied to a deep learning network, where the deep learning network includes an N layer, N is greater than or equal to 2, and the method includes:对深度学习网络的第N层进行模拟运算;Performing an analog operation on the Nth layer of the deep learning network;在对深度学习网络的第N层进行模拟运算后,检测第一预定缓存块是否被占用,所述第一预定缓存块用于缓存所述第N层模拟运算的输入数据或第N层之前的层模拟运算的输入/输出数据;After performing an analog operation on the Nth layer of the deep learning network, detecting whether the first predetermined cache block is occupied, the first predetermined cache block is used to buffer the input data of the Nth layer simulation operation or before the Nth layer Layer input/output data for analog operations;如果被占用,则为第N层模拟运算的输出数据分配第二预定缓存块,并在满足预设条件时,释放被占用的所述第一预定缓存块。If occupied, the second predetermined cache block is allocated for the output data of the Nth layer analog operation, and when the preset condition is satisfied, the occupied first predetermined cache block is released.
- 根据权利要求1所述的方法,其特征在于,缓存块具有与其对应的占用或未占用的状态标记存储在缓存块状态表中,所述检测第一预定缓存块是否被占用,具体包括:The method according to claim 1, wherein the cache block has a corresponding occupied or unoccupied status flag stored in the cache block status table, and the detecting whether the first predetermined cache block is occupied includes:根据第一预定缓存块的标识在所述缓存状态表中查询所述第一预定缓存块对应的状态标记,根据所述状态标记确定所述第一预定缓存块是否被占用。Querying, according to the identifier of the first predetermined cache block, a status flag corresponding to the first predetermined cache block in the cache status table, and determining, according to the status flag, whether the first predetermined cache block is occupied.
- 根据权利要求1所述的方法,其特征在于,所述为第N层模拟运算的输出数据分配第二预定缓存块,具体包括:The method according to claim 1, wherein the allocating the second predetermined cache block to the output data of the Nth layer simulation operation comprises:为第N层模拟运算的输出数据分配与所述第一预定缓存块紧邻的第二预定缓存块。The output data of the Nth layer analog operation is allocated a second predetermined cache block immediately adjacent to the first predetermined cache block.
- 根据权利要求1所述的方法,其特征在于,所述在满足预设条件时,释放被占用的所述第一预定缓存块,具体包括:The method according to claim 1, wherein the releasing the occupied first predetermined cache block when the preset condition is met comprises:在对深度学习网络的第N+1层进行模拟运算时,释放被占用的所述第一预定缓存块;或,Release the occupied first predetermined cache block when performing an analog operation on the (N+1)th layer of the deep learning network; or在对第N层进行模拟运算之后,对深度学习网络的第N+1层进行模拟运算之前,释放被占用的所述第一预定缓存块;或,After performing the simulation operation on the Nth layer, releasing the occupied first predetermined cache block before performing the simulation operation on the N+1th layer of the deep learning network; or在没有第二预定缓存块可分配时,释放被占用的所述第一预定缓存块。The first predetermined cache block that is occupied is released when no second predetermined cache block is assignable.
- 根据权利要求1所述的方法,其特征在于,在检测到第一预定缓存块未被占用时,为第N层模拟运算的输出数据分配所述第一预定缓存块。The method of claim 1, wherein the first predetermined cache block is allocated for output data of the Nth layer analog operation upon detecting that the first predetermined cache block is unoccupied.
- 根据权利要求1-5中任何一项所述的方法,其特征在于,所述第一预 定缓存块、所述第二预定缓存块通过不同颜色进行标记。A method according to any one of claims 1 to 5, wherein said first pre- The fixed cache block and the second predetermined cache block are marked by different colors.
- 一种应用于深度学习网络的缓存优化装置,所述深度学习网络包括N层,N大于等于2,其特征在于,该装置包括:模拟运算单元、状态检测单元、缓存分配单元和缓存释放单元,其中:A cache optimization apparatus for a deep learning network, the deep learning network includes an N layer, N is greater than or equal to 2, and the device includes: an analog operation unit, a state detection unit, a cache allocation unit, and a cache release unit. among them:所述模拟运算单元,用于对深度学习网络的第N层进行模拟运算;The simulation operation unit is configured to perform an analog operation on the Nth layer of the deep learning network;所述状态检测单元,用于在对深度学习网络的第N层进行模拟运算后,检测第一预定缓存块是否被占用,所述第一预定缓存块用于缓存所述第N层模拟运算的输入数据或第N层之前的层模拟运算的输入/输出数据;The state detecting unit is configured to detect whether the first predetermined cache block is occupied after performing an analog operation on the Nth layer of the deep learning network, where the first predetermined cache block is used to buffer the Nth layer simulation operation Input data or input/output data of layer simulation operations before the Nth layer;所述缓存分配单元,用于在第一预定缓存块被占用时,为第N层模拟运算的输出数据分配第二预定缓存块;The buffer allocation unit is configured to allocate a second predetermined cache block for the output data of the Nth layer simulation operation when the first predetermined cache block is occupied;所述缓存释放单元,用于在满足预设条件时,释放被占用的所述第一预定缓存块。The cache release unit is configured to release the occupied first predetermined cache block when a preset condition is met.
- 根据权利要求7所述的装置,其特征在于,缓存块具有与其对应的占用或未占用的状态标记,则所述状态检测单元包括查询子单元,所述查询子单元,用于查询所述第一预定缓存块对应的状态标记,根据所述状态标记确定所述第一预定缓存块是否被占用。The apparatus according to claim 7, wherein the cache block has an occupied or unoccupied status flag corresponding thereto, and the state detecting unit includes a query subunit, and the query subunit is configured to query the a status flag corresponding to a predetermined cache block, and determining, according to the status flag, whether the first predetermined cache block is occupied.
- 根据权利要求8所述的装置,其特征在于,缓存块与其状态标记之间的对应关系存储在缓存块状态表中,所述查询子单元,具体用于根据第一预定缓存块的标识在所述缓存状态表中查询第一预定缓存块对应的状态标记,以便根据查询结果确定所述第一预定缓存块是否被占用。The device according to claim 8, wherein the correspondence between the cache block and its status flag is stored in a cache block status table, and the query sub-unit is specifically configured to use the identifier of the first predetermined cache block. The status flag corresponding to the first predetermined cache block is queried in the cache status table, so as to determine whether the first predetermined cache block is occupied according to the query result.
- 根据权利要求7所述的装置,其特征在于,所述缓存分配单元具体用于为第N层模拟运算的输出数据分配与所述第一预定缓存块紧邻的第二预定缓存块。The apparatus according to claim 7, wherein the buffer allocation unit is specifically configured to allocate, for output data of the Nth layer analog operation, a second predetermined cache block immediately adjacent to the first predetermined cache block.
- 根据权利要求7所述的装置,其特征在于,所述缓存释放单元具体用于:在对深度学习网络的第N+1层进行模拟运算时,释放被占用的所述第一预定缓存块;或,The device according to claim 7, wherein the cache release unit is configured to: release the occupied first predetermined cache block when performing an analog operation on the (N+1)th layer of the deep learning network; or,在对第N层进行模拟运算之后,对深度学习网络的第N+1层进行模拟运算之前,释放被占用的所述第一预定缓存块;或,After performing the simulation operation on the Nth layer, releasing the occupied first predetermined cache block before performing the simulation operation on the N+1th layer of the deep learning network; or在没有第二预定缓存块可分配时,释放被占用的所述第一预定缓存块。 The first predetermined cache block that is occupied is released when no second predetermined cache block is assignable.
- 根据权利要求7所述的装置,其特征在于,所述状态检测单元,还用于在检测到第一预定缓存块未被占用时,为第N层模拟运算的输出数据分配所述第一预定缓存块。The apparatus according to claim 7, wherein the state detecting unit is further configured to allocate the first reservation for output data of the Nth layer simulation operation when detecting that the first predetermined cache block is not occupied. Cache block.
- 根据权利要求7-12中任何一项所述的装置,其特征在于,所述第一预定缓存块、所述第二预定缓存块通过不同颜色进行标记。Apparatus according to any one of claims 7-12, wherein said first predetermined cache block and said second predetermined cache block are marked by different colors.
- 一种存储介质,其特征在于,所述存储介质用于存储可执行程序代码,所述可执行程序代码用于被运行以执行权利要求1-6任一项所述的应用于深度学习网络的缓存优化方法。 A storage medium, characterized in that the storage medium is for storing executable program code for being executed to perform the application to the deep learning network according to any one of claims 1-6 Cache optimization method.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611132262.9A CN108615077B (en) | 2016-12-09 | 2016-12-09 | Cache optimization method and device applied to deep learning network |
CN201611132262.9 | 2016-12-09 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2018103472A1 true WO2018103472A1 (en) | 2018-06-14 |
Family
ID=62490680
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2017/108030 WO2018103472A1 (en) | 2016-12-09 | 2017-10-27 | Method and device for buffer optimization in deep learning network |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN108615077B (en) |
WO (1) | WO2018103472A1 (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110851187A (en) * | 2019-11-19 | 2020-02-28 | 北京百度网讯科技有限公司 | Video memory processing method, device, equipment and medium |
WO2021227789A1 (en) * | 2020-05-09 | 2021-11-18 | 深圳云天励飞技术股份有限公司 | Storage space allocation method and device, terminal, and computer readable storage medium |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109447253B (en) * | 2018-10-26 | 2021-04-27 | 杭州比智科技有限公司 | Video memory allocation method and device, computing equipment and computer storage medium |
CN112862085B (en) * | 2019-11-27 | 2023-08-22 | 杭州海康威视数字技术股份有限公司 | Storage space optimization method and device |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101809597A (en) * | 2007-09-26 | 2010-08-18 | 佳能株式会社 | Calculation processing apparatus and method |
CN104915322A (en) * | 2015-06-09 | 2015-09-16 | 中国人民解放军国防科学技术大学 | Method for accelerating convolution neutral network hardware and AXI bus IP core thereof |
CN106022468A (en) * | 2016-05-17 | 2016-10-12 | 成都启英泰伦科技有限公司 | Artificial neural network processor integrated circuit and design method therefor |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101957800A (en) * | 2010-06-12 | 2011-01-26 | 福建星网锐捷网络有限公司 | Multichannel cache distribution method and device |
US8965819B2 (en) * | 2010-08-16 | 2015-02-24 | Oracle International Corporation | System and method for effective caching using neural networks |
CN103455443B (en) * | 2013-09-04 | 2017-01-18 | 华为技术有限公司 | Buffer management method and device |
CN104133784B (en) * | 2014-07-24 | 2017-08-29 | 大唐移动通信设备有限公司 | A kind of packet buffer management method and device |
CN104636285B (en) * | 2015-02-03 | 2016-03-23 | 北京麓柏科技有限公司 | A kind of flash-memory storage system and read-write thereof, delet method |
CN105677583B (en) * | 2015-12-31 | 2019-01-08 | 华为技术有限公司 | A kind of buffer memory management method and device |
-
2016
- 2016-12-09 CN CN201611132262.9A patent/CN108615077B/en active Active
-
2017
- 2017-10-27 WO PCT/CN2017/108030 patent/WO2018103472A1/en active Application Filing
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101809597A (en) * | 2007-09-26 | 2010-08-18 | 佳能株式会社 | Calculation processing apparatus and method |
CN104915322A (en) * | 2015-06-09 | 2015-09-16 | 中国人民解放军国防科学技术大学 | Method for accelerating convolution neutral network hardware and AXI bus IP core thereof |
CN106022468A (en) * | 2016-05-17 | 2016-10-12 | 成都启英泰伦科技有限公司 | Artificial neural network processor integrated circuit and design method therefor |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110851187A (en) * | 2019-11-19 | 2020-02-28 | 北京百度网讯科技有限公司 | Video memory processing method, device, equipment and medium |
WO2021227789A1 (en) * | 2020-05-09 | 2021-11-18 | 深圳云天励飞技术股份有限公司 | Storage space allocation method and device, terminal, and computer readable storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN108615077B (en) | 2021-08-24 |
CN108615077A (en) | 2018-10-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2018103472A1 (en) | Method and device for buffer optimization in deep learning network | |
US9785568B2 (en) | Cache lookup bypass in multi-level cache systems | |
US9342135B2 (en) | Accelerated thermal mitigation for multi-core processors | |
EP3111333B1 (en) | Thread and data assignment in multi-core processors | |
US20110055495A1 (en) | Memory Controller Page Management Devices, Systems, and Methods | |
JP6622715B2 (en) | Dynamic load balancing of hardware threads in a cluster processor core using shared hardware resources and associated circuits, methods, and computer readable media | |
TWI537962B (en) | Memory controlled data movement and timing | |
US9483318B2 (en) | Distributed procedure execution in multi-core processors | |
US20160055086A1 (en) | Dynamic cache partitioning apparatus and method | |
JP2015519654A5 (en) | ||
US20150301865A1 (en) | Hardware resource allocation for applications | |
EP3662376B1 (en) | Reconfigurable cache architecture and methods for cache coherency | |
KR102205899B1 (en) | Method and apparatus for avoiding bank conflict in memory | |
US20140379995A1 (en) | Semiconductor device for controlling prefetch operation | |
CN110223216A (en) | A kind of data processing method based on parallel PLB, device and computer storage medium | |
US20150052327A1 (en) | Dynamic memory relocation | |
US10482027B2 (en) | Cache management method and apparatus | |
CN116501249A (en) | Method for reducing repeated data read-write of GPU memory and related equipment | |
US20230205700A1 (en) | Selective speculative prefetch requests for a last-level cache | |
CN105874431A (en) | Computing system with reduced data exchange overhead and related data exchange method thereof | |
JP2014056425A (en) | Data management device, data management system, process allocation method, and process allocation program | |
CN110235113B (en) | Memory controller and system and method for data processing | |
KR101876574B1 (en) | Data i/o controller and system having the same | |
CN108021563A (en) | The detection method and device that a kind of inter-instruction data relies on | |
US20150324278A1 (en) | Data sort using memory-intensive exosort |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 17878203 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 17878203 Country of ref document: EP Kind code of ref document: A1 |