WO2021253875A1 - 内存管理方法和相关产品 - Google Patents

内存管理方法和相关产品 Download PDF

Info

Publication number
WO2021253875A1
WO2021253875A1 PCT/CN2021/079390 CN2021079390W WO2021253875A1 WO 2021253875 A1 WO2021253875 A1 WO 2021253875A1 CN 2021079390 W CN2021079390 W CN 2021079390W WO 2021253875 A1 WO2021253875 A1 WO 2021253875A1
Authority
WO
WIPO (PCT)
Prior art keywords
task
processing device
cache block
block
cache
Prior art date
Application number
PCT/CN2021/079390
Other languages
English (en)
French (fr)
Inventor
李周洋
Original Assignee
上海商汤智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海商汤智能科技有限公司 filed Critical 上海商汤智能科技有限公司
Priority to KR1020217042198A priority Critical patent/KR20220010036A/ko
Priority to JP2021570921A priority patent/JP2022539956A/ja
Publication of WO2021253875A1 publication Critical patent/WO2021253875A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • This application relates to the computer field, in particular to a memory management method and related products.
  • Acceleration equipment refers to equipment used for accelerating computing other than CPU, such as graphics processing unit (GPU), network processing unit (NPU), field programmable gate array (Field-Programmable Gate Array, FPGA) etc. Due to the low memory utilization of the currently adopted memory management strategies, it is necessary to study memory management strategies with higher memory utilization.
  • GPU graphics processing unit
  • NPU network processing unit
  • FPGA field programmable gate array
  • the embodiment of the application discloses a memory management method and related products.
  • an embodiment of the present application provides a memory management method.
  • the method includes: a first processing device allocates a first buffer block of a buffer pool for a first task; the first processing device determines that the second processing device needs In the case that the second task and the first task are executed in order, a second cache block of the cache pool is allocated to the second task, wherein at least a part of the second cache block is included in the first One cache block.
  • the fact that the first processing device determines that the second processing device needs to execute the second task and the first task in order means that the first processing device determines that the second processing device will not execute the first task and the second task in parallel. In other words, the second processing device will not execute the first task and the second task at the same time. It should be understood that when the second processing device executes the second task and the first task in sequence, it is impossible for the second processing device to execute the first task and the second task at the same time occupying the same cache block. Therefore, the first processing device is allocating the first cache block of the cache pool to the first task, and can allocate the second cache block to the second task, that is, the first task and the second task can reuse a part of the cache block.
  • the first processing device determines that the second processing device needs to execute the second task and the first task in order
  • the first task and the second task can reuse a part of the cache block; the memory can be improved Utilization rate.
  • the method further includes: the first processing device responds to sending the first task to The second processing device puts the first buffer block into the buffer pool again.
  • the sending of the first task to the second processing device by the first processing device may refer to that the first processing device invokes the second processing device to execute the first task, or it may refer to sending the first task to the second processing device. Submit to a certain task queue processed by the second processing device. After the first processing device sends the first task to the second processing device, the first buffer block may be immediately put into the buffer pool again.
  • the first processing device may re-put the first cache block into the cache pool in time, so as to reuse the first cache pool.
  • the method further includes: the first processing device records all the data corresponding to the first buffer block.
  • the first operation queue where the first task is located; the determining that the second processing device needs to execute the second task and the first task in order includes: the first processing device is based on the recorded location of the first task According to the first operation queue, it is determined that the operation queues in which the first task and the second task are located are the same.
  • the first processing device After the first processing device responds to sending the first task to the second processing device, it records the first operation queue where the first task corresponding to the first cache block is located.
  • the operation of the first processing device to put the first cache block back into the cache pool and the operation of recording the first operation queue where the first task corresponding to the first cache block is located may be Treated as simultaneous execution. That is, the operation of the first processing device to put the first cache block back into the cache pool and the operation of recording the first operation queue where the first task corresponding to the first cache block is located may be bound together.
  • the first processing device Before allocating a buffer block for the second task, the first processing device knows the operation queue in which the second task is located. Therefore, the first processing device can determine whether the first task and the second task are located in the same operation queue based on the recorded first operation queue where the first task is located.
  • the first processing device can accurately and quickly determine that the first task and the second task are located in the same operation queue.
  • the allocating the second buffer block of the buffer pool for the second task includes: the first processing device is in the process of the second processing device executing the first task , Allocating the second buffer block of the buffer pool for the second task.
  • the first processing device and the second processing device work in parallel, and the work efficiency is high.
  • the method before the allocating the second buffer block of the buffer pool for the second task, the method further includes: the first processing device searches the buffer pool for the current allocated At least one candidate cache block of the task; the first processing device allocates the second task of the buffer pool to the second task when it is determined that the second processing device needs to execute the second task and the first task in order
  • the second cache block includes: the first processing device allocates the second task from the at least one candidate to the second task based on the execution sequence relationship between the task currently allocated by the at least one candidate cache block and the second task The second cache block determined in the cache block.
  • the first processing device allocates a second cache determined from the at least one candidate cache block to the second task based on the execution sequence relationship between the task currently allocated by the at least one candidate cache block and the second task. Block; so that the second task reuses the allocated cache block, which can improve memory utilization.
  • the first processing device searching for at least one candidate cache block currently assigned a task from the buffer pool includes: the first processing device searches the buffer pool for satisfying the At least one candidate cache block of the cache size required by the second task; the first processing device searches the at least one candidate cache block for at least one candidate cache block to which the task is currently allocated.
  • the at least one candidate cache block may be a cache block to which a task is currently allocated; it may also include both a cache block to which a task is currently allocated and a cache block to which no task is currently allocated.
  • the at least one candidate cache block currently allocated with the task is preferentially searched from the at least one candidate cache block, and at least one candidate cache block currently allocated with the task and meeting the cache size required by the second task can be quickly found.
  • the first processing device searching for at least one candidate cache block currently allocated with a task from the buffer pool includes: the first processing device currently allocates from the buffer pool At least one candidate cache block that satisfies the required cache size of the second task is searched for in the cache block of the task.
  • the first processing device directly searches for a cache block that meets the cache size required by the second task from the cache block currently allocated with the task in the cache pool, and then allocates the cache block currently allocated with the task to the cache block.
  • the second task memory utilization can be improved.
  • the first processing device allocates the second task to the second task based on the execution order relationship between the task currently allocated by the at least one candidate cache block and the second task.
  • the second cache block determined in one candidate cache block includes: the execution sequence relationship between the task currently allocated by the first processing device based on the at least one candidate cache block and the second task, and the at least one candidate The size of the cache block allocates the second cache block determined from the at least one candidate cache block to the second task.
  • the first processing device allocates the second task from the at least The second cache block determined in one candidate cache block may refer to the case where the first processing device determines that the task currently allocated by the at least one candidate cache block and the second task are executed in order, based on the The size of the at least one candidate cache block is that the second task allocates the second cache block determined from the at least one candidate cache block.
  • the second cache block determined from at least one candidate cache block currently assigned with the task is allocated to the second task, the cache block currently assigned with the task can be reused, and the memory reuse rate can be improved.
  • the method further includes: when the first processing device determines that the at least one candidate cache block does not include a cache block that meets the requirements of the second task, from the cache pool
  • the target cache block allocated to the second task is determined among at least one cache block currently not allocated with a task included in.
  • the target cache block allocated to the second task is determined from at least one cache block included in the cache pool to which the task is not currently allocated, so that the second task can be successfully executed.
  • the method further includes: expanding the buffer pool when the first processing device does not find a buffer block that meets the requirements of the second task in the buffer pool; The first processing device searches the expanded buffer pool for the target buffer block allocated to the second task.
  • a cache block that meets the requirements of the second task is searched from the expanded cache pool, which can quickly satisfy the allocation of cache blocks that meet the requirements of the second task.
  • an embodiment of the present application provides a data processing device.
  • the data processing device includes: a memory allocation unit, configured to allocate a first buffer block of a buffer pool for a first task; and a processing unit, configured to determine a second processing The device needs to execute the second task and the first task in order; the memory allocation unit is also used for determining in the processing unit that the second processing device needs to execute the second task and the first task in order In the case of the first task, a second cache block of the cache pool is allocated to the second task, wherein at least a part of the second cache block is included in the first cache block.
  • the processing unit and the memory allocation unit may be the same unit or two independent units.
  • the processing unit is a processor, such as a CPU, and the memory allocation unit is a piece of hardware.
  • the processing unit is a processor, such as a CPU, and the functions of the memory allocation unit are implemented by software or programs run by the processor. In other words, the function of the processing unit and the function of the memory allocation unit are both implemented by the processor.
  • the processing unit is further configured to send the first task to the second processing device;
  • the memory allocation unit is further configured to send the first task to the The second processing device puts the first buffer block into the buffer pool again.
  • the processing unit is further configured to record the first operation queue where the first task corresponding to the first cache block is located; the processing unit is configured to record the first operation queue based on the recorded first The first operation queue where a task is located is determined to be the same operation queue where the first task and the second task are located.
  • the memory allocation unit is further configured to allocate the second task of the buffer pool to the second task during the execution of the first task by the second processing device. Cache block.
  • the memory allocation unit is further configured to search for at least one candidate cache block currently allocated with a task from the buffer pool; the memory allocation unit is configured to perform the processing unit based on the at least When the execution order relationship between the task currently allocated by a candidate cache block and the second task determines that the second processing device needs to execute the second task and the first task in order, it is the second The task is assigned to the second cache block determined from the at least one candidate cache block.
  • the memory allocation unit is configured to search for the at least one candidate cache block that satisfies the cache size required by the second task from the cache blocks currently allocated to the task in the cache pool .
  • the memory allocation unit is configured to search for at least one candidate cache block that satisfies the cache size required by the second task from the buffer pool; and search for the at least one candidate cache block At least one candidate cache block to which the task is currently allocated.
  • the memory allocation unit is configured to determine the second task by the processing unit based on the execution order relationship between the task currently allocated by the at least one candidate cache block and the second task In the case that the processing device needs to execute the second task and the first task in order, based on the size of the at least one candidate cache block, the second task is allocated to the second task determined from the at least one candidate cache block.
  • the second cache block is configured to determine the second task by the processing unit based on the execution order relationship between the task currently allocated by the at least one candidate cache block and the second task.
  • the memory processing unit is further configured to, when the processing unit determines that the at least one candidate cache block does not include a cache block that meets the requirements of the second task, from the cache pool
  • the target cache block allocated to the second task is determined among at least one cache block currently not allocated with a task included in.
  • the memory processing unit is further configured to expand the buffer pool when no buffer block meeting the requirements of the second task is found in the buffer pool; Find the target cache block allocated to the second task in the cache pool.
  • an embodiment of the present application provides an electronic device.
  • the electronic device includes a memory and a first processor, where the memory is used to store instructions, and the first processor is used to execute instructions stored in the memory.
  • the instruction causes the first processor to execute the method as in the first aspect and any possible implementation manner.
  • the electronic device further includes a second processor, and the second processor is configured to perform a task sent by the first processor by using a cache block allocated by the first processor.
  • the first processor is a CPU
  • the second processor is a GPU.
  • an embodiment of the present application provides an electronic device that includes: a first processing device, a memory, and a second processing device, wherein the memory is used for storing instructions and data, and the first processor is used for To execute the instructions stored in the memory to cause the first processor to execute the method as in the first aspect and any possible implementation manner, the second processing device is configured to use the cache allocated by the first processing device The block executes the task sent by the first processor.
  • the first processing device is a CPU
  • the second processing device is a GPU.
  • an embodiment of the present application provides a chip that includes a data interface and the first processing device described in the first aspect, wherein the first processing device is configured to execute the first aspect or the first aspect Any possible implementation method.
  • an embodiment of the present application provides a computer-readable storage medium that stores a computer program.
  • the computer program includes program instructions that, when executed by a processor, cause the processor to execute the first Aspects and any optional implementation methods.
  • the embodiments of the present application provide a computer program product, the computer program product includes program instructions that when executed by a processor cause the processor to execute the first aspect and any optional Way of realization.
  • FIG. 1 is a schematic structural diagram of a data processing device provided by an embodiment of this application.
  • FIG. 2 is a flowchart of a memory management method provided by an embodiment of the application
  • FIG. 3 is a flowchart of another memory management method provided by an embodiment of the application.
  • FIG. 4 is a flowchart of another memory management method provided by an embodiment of the application.
  • FIG. 5 is a flowchart of another memory management method provided by an embodiment of the application.
  • FIG. 6 is a sequence diagram of a memory management method provided by an embodiment of the application.
  • FIG. 7 is a flowchart of another memory management method provided by an embodiment of the application.
  • FIG. 8 is a sequence diagram of another memory management method provided by an embodiment of the application.
  • FIG. 9 is a schematic structural diagram of a data processing device provided by an embodiment of this application.
  • FIG. 10 is a schematic structural diagram of another data processing device provided by an embodiment of the application.
  • the embodiment of the application provides a memory management method with high memory utilization, which is suitable for a data processing device (corresponding to heterogeneous acceleration) having a first processing device (such as a CPU) and a second processing device (corresponding to an acceleration device) system).
  • a data processing device corresponding to heterogeneous acceleration
  • a first processing device such as a CPU
  • a second processing device corresponding to an acceleration device
  • FIG. 1 is a schematic structural diagram of a data processing device provided by an embodiment of the application.
  • the data processing apparatus includes: a first processing device 101, a memory allocator 102, a second processing device 103, and a device memory 104, where the first processing device 101 and the memory allocator 102 are independently arranged or integrated
  • the second processing device 103 and the first processing device 101 are different types of processing devices
  • the device memory 104 can be a part of the second processing device 103 or set independently of the second processing device 103. This is not limited.
  • the first processing device 101 corresponds to a processing unit
  • the memory allocator 102 corresponds to a memory allocation unit.
  • the first processing device 101 may be a CPU or other types of processors.
  • the first processing device 101 may be a main processing device, such as a CPU;
  • the second processing device 103 is an acceleration device, such as a GPU.
  • the second processing device 103 can be a GPU, NPU, FPGA, digital signal processor (digital signal processor, DSP), application specific integrated circuit (ASIC), etc., a processor or processing device that is different from the first processing device 101 .
  • the first processing device 101 is a CPU, and the memory allocation unit is hardware independent of the CPU.
  • the first processing device 101 is a CPU, and the functions of the memory allocator 102 are implemented by software or programs run by the first processing device 101.
  • the physical hardware corresponding to the memory allocator 102 is the first processing device 101.
  • the device memory 104 may be a memory that can be used by the second processing device 103.
  • the second processing device 103 is a GPU, and the device memory 104 is the video memory of the second processing device 103.
  • the device memory 104 is part of the second processing device 103.
  • the following respectively introduces the functions of the first processing device 101, the memory allocator 102, the second processing device 103, and the device memory 104 when the data processing apparatus implements the memory management method provided by the embodiment of the present application.
  • the first processing device (such as a CPU) 101 is used to submit tasks to the second processing device 103, and to control the memory allocator 102 to allocate and/or release the cache in the buffer pool, that is, to manage the second processing device 103 through the memory allocator 102
  • the device memory 104 may mean that the first processing device 101 adds the task to an operation queue that needs to be processed by the second processing device 103; it may also mean that the task is sent to the second processing device 103.
  • sending an instruction to instruct the second processing device 103 to perform a certain task it can also refer to adding the task to the operation queue that the first processing device 101 needs to call the interface of the second processing device 103; it can also refer to other ways
  • the second processing device 103 is notified to perform the task.
  • the first processing device 101 can also be used to call the interface of the second processing device 103 to perform tasks.
  • the operation queue is a cudaStream in a unified computing device architecture (compute unified device architecture, CUDA). Tasks on the same cudaStream are executed in the order of submission. Tasks on different cudaStreams are in no order and can be executed concurrently.
  • the operation queue is cl_command_queue in Open Computing Language (OpenCL).
  • OpenCL is a framework for writing programs for heterogeneous platforms, which can include CPUs, GPUs or other types of processors.
  • the operation queue is accelerator_view in C++AMP.
  • C++AMP is a new extension of Microsoft Visual Studio and C++ programming language to help developers fully adapt to the current and future highly parallel and heterogeneous computing environments.
  • the memory allocator 102 is responsible for the management of the device memory 104.
  • the memory allocator 102 may be a piece of physical hardware. Alternatively, the function of the memory allocator 102 may be implemented by software or a program run by the first processing device 101.
  • the second processing device 103 is used for reading and writing data through the device memory 104, and executing tasks submitted by the first processing device 101 or calling at least one interface by the first processing device 101 to complete tasks.
  • the second processing device 103 may be understood as an acceleration device other than the CPU, such as GPU, NPU, ASIC, FPGA, and so on.
  • the device memory 104 refers to a storage device (corresponding to the memory space) on the second processing device 103, and is used by the second processing device 103, such as the video memory of a GPU (corresponding to the second processing device 103).
  • a buffer pool is set in the memory allocator 102 to pre-allocate and cache device memory.
  • This buffer pool may be a whole piece of device memory, or it may be composed of multiple pieces of device memory of any size.
  • the memory allocator 102 can directly allocate the cache from the buffer pool to the second processing device 103. After the second processing device 103 does not need to use the cache allocated to it, the memory is allocated The processor 102 puts the cache back into the cache pool, that is, returns the cache to the cache pool.
  • the first processing device 101 is a CPU
  • the second processing device 103 is a GPU
  • the device memory 104 is a video memory on the second processing device 103;
  • the CPU is responsible for tasks (For example, computing tasks or image processing tasks, etc.) preparation and initiation
  • the GPU is responsible for the actual execution of the task.
  • the allocation and release of the video memory by the CPU is actually completed by the memory allocator 102.
  • the memory allocator 102 is actually responsible for the management of the video memory on the GPU.
  • the GPU directly uses part of the video memory allocated by the CPU through the memory allocator 102 when performing tasks.
  • FIG. 2 is a flowchart of a memory management method provided by an embodiment of the application. As shown in Figure 2, the memory management method includes:
  • the first processing device allocates a first buffer block of a buffer pool for a first task.
  • the first processing device may be a CPU or other types of processors.
  • the first processing device that allocates the first buffer block of the buffer pool to the first task may be that the first processing device allocates the first buffer block of the buffer pool to the first task through the memory allocator; wherein, the memory allocator
  • the function of is realized by the software or program running by the first processing device.
  • the first task may be an image processing task, a computing task, etc., a task that needs to be executed by a second processing device (for example, a GPU) or a task that needs to be invoked to execute the second processing device.
  • the first cache block may be any cache block in the cache pool.
  • the cache pool can be understood as a pool of cache device memory (such as video memory) provided inside the memory allocator.
  • the video memory managed by the memory allocator is cached in a pool (ie, buffer pool).
  • the memory allocator allocates it from the pool, and does not wait after the video memory allocated to the second processing device is used up.
  • the video memory is released.
  • Device synchronization can be understood as the first processing device (for example, CPU) stopping to execute the program and waiting for the second processing device (corresponding to the acceleration device) to complete the task.
  • allocating the first cache block of the cache pool by the first processing device to the first task may refer to: preferentially allocating cache blocks that meet the requirements of the first task from the cache pool; if there is no cache block in the cache pool that meets the requirements of the first task; For the cache block required by the first task, cudaMalloc is called to allocate more memory from the device memory to expand the cache pool.
  • cudaMalloc is an interface for allocating video memory in CUDA.
  • the first task may be a certain computing task or image processing task or other types of deep learning tasks.
  • the first processing device before performing step 201, may perform the following operations: the first processing device may split a larger task to obtain the first task.
  • the data processing device in the embodiment of the present application may be a heterogeneous computing system. Heterogeneous computing systems are based on disjunctive computing task parallelism types, dividing code segments of the same type into the same subtask, and then assigning each subtask to the computing resource most suitable for executing it according to different types of parallelism (For example, the second processing device) to be executed to minimize the total execution time of the computing task.
  • the first task can be understood as a subtask that is obtained by the first processing device by splitting a larger task and needs to be executed by the second processing device or is called a subtask to be executed by the second processing device.
  • the first task is the task itself that the first processing device determines to be executed, which is not limited in the embodiment of the present disclosure.
  • the first processing device may perform the following operations: in response to sending the first task to the second processing device, the first processing device re-inserts the first buffer block (also referred to as Return) the buffer pool.
  • the first processing device sending the first task to the second processing device can mean that the first processing device adds the task to the operation queue that needs to be processed by the second processing device; it can also mean sending an instruction to the second processing device to perform a certain task. It can also refer to adding a task to the operation queue that the first processing device needs to call the interface of the second processing device to complete; it can also refer to informing the second processing device to perform the task in other ways, which is not the case in the embodiments of the present disclosure Make a limit.
  • the first processing device can immediately put the first buffer block into the buffer pool without waiting for the second processing device to finish using the first buffer block.
  • the first task can be immediately put the first buffer block into the buffer pool in time to facilitate reuse of the first cache block.
  • the first processing device or the memory allocator may set an identifier for each cache block in the cache pool to indicate the current state of the cache block.
  • that the first processing device allocates the first buffer block of the buffer pool to the first task may mean that the first processing device allocates the first buffer block of the buffer pool to the first task, and allocates the first buffer block to the first task.
  • the state of the block is set to unavailable, indicating that the first cache block cannot be allocated to other tasks; putting the first cache block back into the cache pool (also called returning) can mean setting the state of the first cache block to available , Indicating that the first cache block can be allocated to other tasks.
  • the embodiments of the present disclosure may also indicate whether each cache block is currently available in other ways.
  • the first processing device when the first processing device allocates the first cache block to the first task, it may refer to removing the first cache block from the cache resources contained in the cache pool. Removing and putting the first cache block back into the cache pool may refer to adding the first cache block back to the cache resources included in the cache pool.
  • the first processing device allocates a second buffer block of the buffer pool to the second task when it is determined that the second processing device needs to execute the second task and the first task in order.
  • the second task may be an image processing task, a calculation task, or other tasks that need to be executed by a second processing device (for example, a GPU) or a task that needs to be invoked to execute the second processing device.
  • a second processing device for example, a GPU
  • the determination by the first processing device that the second processing device needs to execute the second task and the first task in order means that the first processing device determines that the second processing device will not execute the first task and the second task in parallel. In other words, the second processing device will not execute the first task and the second task at the same time.
  • the first task and the second task may be the same or different types of tasks, for example, they may be different subtasks of the same task, or subtasks of different tasks, etc., which are not limited in the embodiment of the present disclosure. It should be understood that when the second processing device executes the second task and the first task in sequence, it is impossible for the second processing device to execute the first task and the second task at the same time using the cache block.
  • the first processing device determines that the second processing device needs to execute the second task and the first task in order, it can allocate a second buffer block to the second task, that is, the first task and the second task can be replicated. Use part of the cache block.
  • the memory allocation of the first processing device and the task processing of the second processing device can be executed in parallel, thereby improving processing efficiency.
  • the first processing device allocates the second buffer block of the buffer pool to the second task when the second processing device executes the first task.
  • the first processing device determines that the second processing device needs to execute the second task and the first task in order
  • the first task and the second task can reuse a part of the cache block; the memory can be improved Utilization rate.
  • the following describes an optional example of determining that the second processing device needs to perform the second task and the first task in order.
  • the first processing device records the first operation queue where the first task corresponding to the first cache block is located after putting the first cache block back into the cache pool or in the process of putting the first cache block back into the cache pool; In the process of allocating the buffer block to the second task, based on the recorded first operation queue where the first task is located, it is determined that the first task and the second task are located in the same operation queue.
  • the first operation queue may be an operation queue that needs to be executed by the second processing device or an operation queue where the second task is located. For example, the first processing device submits tasks to the first operation queue of the second processing device, and the second processing device sequentially executes the tasks in the first operation queue in the order in which each task is submitted to the first operation queue. Task.
  • the first operation queue may be an operation queue that the first processing device needs to call an interface of the second processing device to complete. For example, the first processing device adds tasks to the operation queue that it needs to call the interface of the second processing device to complete, and calls the interface of the second processing device to execute each task in the order in which the tasks are added to the operation queue. .
  • the operation of the first processing device to put the first cache block back into the cache pool and the operation of recording the first operation queue where the first task corresponding to the first cache block is located may be performed at the same time or in any order. For example, after the first cache block is put into the cache pool again, the current task allocation information of the first cache block is recorded, and the task allocation information includes the information of the operation queue where the first task is located. For another example, the first processing device records the first operation queue where the first task corresponding to the first cache block is located, and then puts the first cache block into the cache pool again.
  • the first processing device can determine the first task based on the first operation queue where the recorded first task is located, for example, by querying the task allocation information of the first cache block Whether it is in the same operation queue as the second task, that is, the second task to be allocated and the first task allocated by the first buffer block belong to the same operation queue. In this way, different tasks in the same operation queue are executed sequentially in a specific order, and the first task and the second task are located in the same operation queue, indicating that the first task and the second task will not be executed at the same time.
  • FIG. 3 is a flowchart of another memory management method provided by an embodiment of the application. Among them, the description of the same points as in the embodiment shown in FIG. 2 will be briefly mentioned.
  • the first processing device allocates a first buffer block of a buffer pool for a first task.
  • the first processing device puts the first cache block into the cache pool again, and records the first operation queue where the first task corresponding to the first cache block is located.
  • the first processing device may record the operation queue in which the task corresponding to each cache block that is put back into the cache pool is located. That is, the first processing device may record the operation queue corresponding to each buffer block to which the task is currently allocated in the buffer pool.
  • the operation queue corresponding to a cache block is the operation queue where the tasks allocated by the cache block are located.
  • the first processing device may release the first cache block before putting the first cache block into the cache pool again.
  • the cache block in the cache pool is the video memory of the second processing device, and the first processing device releases the first cache block by calling the interface of the memory allocator before putting the first cache block into the cache pool again.
  • the cudaFree interface is an interface for releasing video memory in CUDA.
  • releasing the first cache block may refer to putting the first cache block into the cache pool, but the first cache block cannot be allocated, for example, setting the status of the first cache block to unavailable;
  • the re-insertion of the first cache block into the cache pool may mean that the first cache block can be allocated, for example, the status of the first cache block is set to be available.
  • the first processing device allocates the second buffer block of the buffer pool to the second task in a case where it is determined based on the recorded first operation queue where the first task is located, that the first task and the second task are located in the same operation queue.
  • Step 303 is a possible implementation of step 202. At least a part of the second cache block is included in the first cache block.
  • the first processing device may search for at least one candidate cache block currently assigned to the task from the cache pool; one implementation of step 303 is as follows : Based on the execution order relationship between the task currently allocated by the at least one candidate cache block and the second task, allocate the second cache block determined from the at least one candidate cache block to the second task.
  • the search by the first processing device from the buffer pool for at least one candidate cache block currently allocated with the task may refer to: the first processing device searches the buffer pool for at least one candidate cache block that satisfies the cache size required by the second task; first processing The device searches for at least one candidate cache block to which the task is currently allocated from the at least one candidate cache block.
  • the first processing device allocates the second cache block determined from the at least one candidate cache block to the second task based on the execution order relationship between the task currently allocated by the at least one candidate cache block and the second task may refer to : Select from at least one candidate cache block the currently assigned task and one or more target cache blocks executed in sequence by the second task; assign a second cache block determined from the one or more target cache blocks to the second task.
  • the first processing device finds from the buffer pool 10 candidate cache blocks that meet the required cache size of the second task; from the 10 candidate cache blocks, select the currently assigned task and the second task to be executed in order.
  • the candidate cache block obtains the target cache block; the second cache block determined from the target cache block is allocated to the second task.
  • the candidate cache block meets the cache block of the cache size required by the second task, and the candidate cache block not only meets the cache size required by the second task, but also has a task currently allocated.
  • the present application based on the recorded first operation queue where the first task is located, it can be accurately and quickly determined that the first task and the second task are located in the same operation queue, and then the second buffer block is allocated to the second task. ; Can improve memory utilization.
  • FIG. 4 is a flowchart of another memory management method provided by an embodiment of the application. Among them, the description of the same points as in the embodiment shown in FIG. 2 will be briefly mentioned.
  • the first processing device allocates a first buffer block of a buffer pool for a first task.
  • the first processing device puts the first cache block into the cache pool again, and records the first operation queue where the first task corresponding to the first cache block is located.
  • the first processing device searches for at least one candidate cache block that meets the cache size required by the second task from the cache pool.
  • step 404 is executed; if at least one candidate cache block that satisfies the cache size required by the second task is not found, step 408 is executed.
  • the first processing device searches for at least one candidate cache block to which the task is currently allocated from the at least one candidate cache block.
  • step 405 is executed; if no candidate cache block assigned a task is found, step 406 is executed.
  • the first processing device determines the second cache block allocated for the second task from the at least one candidate cache block based on the execution order relationship between the task currently allocated by the at least one candidate cache block and the second task.
  • step 405 is as follows: from at least one candidate cache block, the currently assigned task and the candidate cache block executed by the second task in order are selected to obtain one or more target cache blocks (as described above). The first cache block); the second cache block allocated for the second task is determined from one or more target cache blocks. Step 405 corresponds to step 202 in FIG. 2.
  • step 405 is as follows: the first processing device is based on the execution order relationship between the tasks currently allocated by at least one candidate cache block and the second task and the size of the at least one candidate cache block, from at least A second cache block allocated for the second task is determined in a candidate cache block. For example, from at least one candidate cache block, the currently assigned task and the candidate cache block executed by the second task in sequence are selected to obtain one or more target cache blocks (such as the above-mentioned first cache block); from one or more Select a second cache block that meets the cache size required by the second task from the target cache block, and allocate the second cache block to the second task.
  • target cache blocks such as the above-mentioned first cache block
  • the target cache block that meets the cache size required by the second task has If there are multiple, the smallest target cache block can be selected as the second cache block from the target cache blocks that meet the requirements of the second task, but the embodiment of the present disclosure does not limit this.
  • the first processing device determines a third cache block allocated to the second task from at least one candidate cache block included in the cache pool that is not currently allocated with a task.
  • the first processing device allocates a third buffer block for the second task.
  • the first processing device expands the buffer pool, and searches the expanded buffer pool for the fourth buffer block allocated to the second task.
  • the fourth cache block may be a cache block that satisfies the cache size required by the second task. For example, if there is no cache block that meets the requirements of the second task in the cache pool, the cudaMalloc interface is called to allocate more video memory from the device memory to expand the cache pool.
  • the cudaMalloc interface is an interface for allocating video memory in CUDA. Satisfying the requirement of the second task refers to meeting the cache size required for the second task.
  • the first processing device allocates a fourth buffer block for the second task.
  • the method may further include 410.
  • the first processing device empties the buffer pool.
  • the buffer in the buffer pool can be returned to the device memory of the second processing device.
  • the first processing device calls the cudaFree interface to return the video memory in the buffer pool to the GPU (that is, the second processing device), that is, clears the buffer pool.
  • the first processing device first determines at least one candidate cache block that meets the required cache size of the second task, and then preferentially selects from the at least one candidate cache block the task currently assigned and the assigned task and the second task At least one candidate cache block executed in order; the second task can be quickly allocated a cache block that meets its needs, and the memory utilization rate can be improved.
  • FIG. 5 is a flowchart of another memory management method provided by an embodiment of the application.
  • the first processing device allocates a first buffer block of a buffer pool for a first task.
  • the first processing device puts the first cache block into the cache pool again, and records the first operation queue where the first task corresponding to the first cache block is located.
  • the first processing device finds, from the buffer pool, a candidate cache block that is currently assigned with a task and meets the requirements of the second task.
  • step 504 is executed; if no candidate cache block currently assigned with the task and meeting the requirements of the second task is not found, step 505 is executed.
  • the first processing device allocates a second cache block determined from the at least one candidate cache block to the second task based on the execution order relationship between the task currently allocated by the at least one candidate cache block and the second task.
  • step 504 may be the same as the implementation of step 405.
  • the first processing device finds, from the buffer pool, a buffer block that is not currently allocated with a task and meets the requirements of the second task.
  • step 506 is executed;
  • Step 507 is executed for the cache block required by the task.
  • the first processing device determines a third cache block allocated to the second task from the found cache blocks that are not currently allocated with tasks and meet the requirements of the second task.
  • the first processing device expands the buffer pool, and searches the expanded buffer pool for the fourth buffer block allocated to the second task.
  • the priority search allocates a cache block for the second task from the cache block currently allocated with the task and meets the requirements of the second task, which can improve the search speed and memory reuse rate.
  • FIG. 6 is a sequence diagram of a memory management method provided by an embodiment of the application, which corresponds to the memory management method in FIGS. 2 to 5.
  • the first processing device sequentially performs the following operations: Assign 1, submit task 1 (corresponding to the first task), release 1, assign 2, submit task 2 (corresponding to the second task), and release 2;
  • the second processing device sequentially performs the following operations: perform task 1 and perform task 2.
  • allocation 1 means that the first processing device allocates the first buffer block for task 1
  • allocation 2 means that the first processing device allocates the second buffer block for task 2
  • submitting task 1 means that the first processing device submits the task 1 to the second The operation queue of the processing device.
  • Submit task 2 means that the first processing device submits task 2 to the operation queue of the second processing device;
  • release 1 means that the first processing device controls the memory allocator to release the first buffer block and the first buffer block Put it back into the buffer pool, release 2 means that the first processing device controls the memory allocator to release the second buffer block and put the second buffer block back into the buffer pool;
  • execute task 1 means that the second processing device executes the task 1 and executes the task 2 means that the second processing device performs task 2.
  • the first cache block allocated by the first processing device performing allocation 1 and the second cache block allocated by performing allocation 2 are the same or overlap.
  • the second processing device can reuse the cache block to execute tasks in the same operation queue.
  • the first cache block used by the second processing device to execute the first task is the same as the second cache block used to execute the second task.
  • the second processing device is a GPU, and the GPU can multiplex the same piece of video memory to perform calculation tasks in the same operation queue. As shown in Figure 6, when the first processing device executes the operations of submit task 1, release 1, assign 2, and submit task 2, the second processing device executes task 1 at the same time; when the second processing device executes task 2, the first processing device Perform release 2 operation.
  • the first processing device does not need to wait for the second processing device to complete task 1 before performing the operations of releasing 1, assigning 2, and submitting task 2. That is to say, the first processing device and the second processing device do not need to be synchronized, and an asynchronous calculation mode can be implemented to improve calculation performance.
  • the second processing device can multiplex the cache according to the order in which the tasks are executed, and the first processing device and the second processing device can implement the asynchronous calculation mode; it can improve the utilization rate of the memory and the calculation efficiency.
  • FIG. 7 is a flowchart of another memory management method provided by an embodiment of the application. As shown in Figure 7, the method includes:
  • the first processing device allocates the fifth buffer block in the buffer pool for the third task.
  • step 701 may be similar to the implementation of step 301.
  • the first processing device submits the third task to the operation queue of the second processing device.
  • the first processing device immediately releases the fifth buffer block after submitting the third task to the operation queue of the second processing device.
  • the first processing device checks whether the third task is completed.
  • step 704 may periodically (for example, every 5ms, 10ms, etc.) check whether the third task is completed; if it is checked that the third task is completed, the memory allocator is called to relocate the fifth cache block. Enter the buffer pool; if not, continue to check regularly.
  • step 703 may be replaced by: checking whether the third task is completed each time before releasing the cache block (for example, the fifth cache block).
  • step 703 may be replaced by: checking whether the third task is completed each time before applying for a cache block (for example, the fifth cache block). It should be understood that the first processing device may also use other methods to check whether the third task is completed, which is not limited in the embodiment of the present application.
  • the first processing device puts the fifth cache block into the cache pool again.
  • the first processing device allocates a sixth buffer block in the buffer pool for the fourth task.
  • the first processing device submits the fourth task to the operation queue of the second processing device.
  • the second processing device occupies the fifth cache block to perform the third task, and occupies the sixth cache block to perform the fourth task.
  • the first processing device periodically checks whether the task on the second processing device is completed, and releases the cache corresponding to any calculation task when any calculation task is completed, without waiting for all the calculation tasks on the second processing device Complete; can improve processing efficiency.
  • FIG. 8 is a sequence diagram of a memory management method provided by an embodiment of the application, which corresponds to the memory management method in FIG. 7.
  • the first processing device sequentially performs the following operations: Assign 1, submit task 1 (corresponding to the third task), release 1, assign 2, submit task 2 (corresponding to the fourth task), and release 2;
  • the second processing device sequentially performs the following operations: perform task 1 and perform task 2.
  • allocation 1 means that the first processing device allocates the fifth buffer block in the buffer pool to task 1
  • allocation 2 means that the first processing device allocates the sixth buffer block in the buffer pool to task 2
  • submitting task 1 means the first The processing device submits the task 1 to the operation queue of the second processing device, and submitting task 2 indicates that the first processing device submits the task 2 to the operation queue of the second processing device
  • release 1 indicates that the first processing device releases the fifth buffer block
  • Release 2 means that the first processing device releases the sixth buffer block
  • execute task 1 means that the second processing device executes the task 1
  • execute task 2 means that the second processing device executes the task 2.
  • the fifth cache block allocated by the first processing device performing allocation 1 and the sixth cache block allocated by performing allocation 2 do not have any identical caches.
  • the second processing device executes the calculation tasks in the same operation queue and cannot reuse the cache block.
  • the operation of releasing 1 in the timing diagram of FIG. 8 is for the first processing device to release the fifth cache block
  • the operation of releasing 1 in the timing diagram of FIG. 6 is for the first processing device to release the first cache. Block and put the first cache block back into the cache pool
  • the operation of releasing 2 in the timing diagram of FIG. 8 is the first processing device releasing the sixth cache block
  • the operation of releasing 2 in the timing diagram of FIG. 6 is the first processing device Release the second cache block and put the second cache block into the cache pool again.
  • the first processing device after the first processing device submits any computing task to the operation queue of the second processing device, it periodically checks whether any computing task is completed by the second processing device until the any computing task is completed.
  • the memory space occupied by any computing task will be put back into the buffer pool when it is completed.
  • the first processing device allocates a cache block to the second processing device to perform a certain computing task, and the cache block will not be put back into the cache pool until the computing task is completed.
  • the first processing device executes the operations of submit task 1, release 1, assign 2, and task 2
  • the second processing device executes task 1 at the same time; when the second processing device executes task 2, the first processing device executes Release 2 operation.
  • the first processing device does not need to wait for the second processing device to complete task 1 before performing the operations of releasing 1, assigning 2, and submitting task 2.
  • the first processing device and the second processing device do not need to be synchronized, and an asynchronous calculation mode can be implemented, thereby improving calculation performance.
  • the first processing device and the second processing device can implement an asynchronous calculation mode, which can improve the utilization rate of the memory.
  • the following describes the structure of a data processing device that can implement the memory management method provided in the foregoing embodiment.
  • FIG. 9 is a schematic structural diagram of a data processing device provided by an embodiment of the application. As shown in FIG. 9, the data processing device includes:
  • the memory allocation unit 901 is configured to allocate the first buffer block of the buffer pool for the first task
  • the processing unit 902 is configured to determine a situation in which the second processing device needs to execute the second task and the first task in order;
  • the memory allocation unit 901 is further configured to allocate a second buffer block of the buffer pool for the second task when the processing unit determines that the second processing device needs to execute the second task and the first task in order, where the second buffer block At least a part of is included in the first cache block.
  • the processing unit and the memory allocation unit can be the same unit or two independent units.
  • the processing unit is a processor, such as a CPU, and the memory allocation unit is a piece of hardware.
  • the processing unit is a processor, such as a CPU, and the functions of the memory allocation unit are implemented by software or programs run by the processor. In other words, the function of the processing unit and the function of the memory allocation unit are both implemented by the processor.
  • the cache block is put back into the cache pool.
  • the processing unit 902 is further configured to record the first operation queue where the first task corresponding to the first cache block is located; the processing unit 902 is configured to record the first operation queue where the first task is located based on the record , It is determined that the operation queue of the first task and the second task are the same.
  • the memory allocation unit 901 is further configured to allocate the second buffer block of the buffer pool to the second task when the second processing device executes the first task.
  • the memory allocation unit 901 is further configured to search for at least one candidate cache block currently allocated with a task from the buffer pool; the memory allocation unit 901 is configured to allocate the at least one candidate cache block currently based on the processing unit When the execution sequence relationship between the task and the second task determines that the second processing device needs to execute the second task and the first task in order, the second task is allocated a second cache block determined from at least one candidate cache block.
  • the memory allocating unit 901 is configured to find at least one candidate cache block that meets the required cache size of the second task from the cache pool; find at least one candidate cache block currently assigned to the task from the at least one candidate cache block Candidate cache block.
  • the memory allocation unit 901 is configured to determine that the second processing device needs to execute the second task in order based on the execution order relationship between the task currently allocated by the at least one candidate cache block and the second task in the processing unit In the case of the first task, a second cache block determined from the at least one candidate cache block is allocated to the second task based on the size of the at least one candidate cache block.
  • the memory processing unit 901 is further configured to, when the processing unit determines that the at least one candidate cache block does not include a cache block that meets the requirements of the second task, from the currently unallocated cache block included in the cache pool
  • the target cache block allocated to the second task is determined from at least one cache block of the task.
  • the memory processing unit 901 is also used to expand the buffer pool when the buffer block that meets the requirements of the second task is not found in the buffer pool; to search the expanded buffer pool to be allocated to the first The target cache block of the second task.
  • Figure 10 is a schematic structural diagram of another data processing device provided by an embodiment of the application.
  • the data processing device includes: a first processor 1001, a second processor 1002, and a memory 1003;
  • the first processor is used to execute instructions stored in the memory, so that the first processor executes the memory management method described in any of the above embodiments, and the second processing device is used to use the cache block allocated by the first processing device Perform the task sent by the first processor.
  • the memory 1003 may include device memory used by the second processor 1002 and memory of the first processor 1001.
  • the first processing device is a CPU
  • the second processing device is a GPU
  • the memory 1003 includes video memory of the GPU.
  • a computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the memory management method described in any of the foregoing embodiments is implemented.
  • the computer-readable storage medium includes a non-transitory computer-readable storage medium.
  • the embodiments of the present application also provide a computer program product containing instructions, which when run on a computer, cause the computer to execute the memory management method provided in the foregoing embodiments.
  • An embodiment of the present application also provides an electronic device that includes a memory and a first processor, where the memory is used to store instructions, and the first processor is used to execute instructions stored in the memory, so that The first processor executes the memory management method described in any of the foregoing embodiments.
  • the electronic device may further include a second processor configured to use the cache block allocated by the first processor to execute the task sent by the first processor.
  • An embodiment of the present application provides a chip that includes a data interface and the first processing device described in the first aspect, wherein the first processing device is configured to execute the memory management method described in any of the foregoing embodiments.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Memory System Of A Hierarchy Structure (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请实施例公开了一种内存管理方法和相关产品,该方法包括:第一处理设备为第一任务分配缓存池的第一缓存块;所述第一处理设备在确定第二处理设备需要按次序执行第二任务和所述第一任务的情况下,为所述第二任务分配所述缓存池的第二缓存块,其中,所述第二缓存块中的至少一部分包含于所述第一缓存块。本申请实施例中,第一处理设备在确定第二处理设备需要按次序执行第二任务和第一任务的情况下,该第一任务和该第二任务可以复用一部分缓存块;能够提高内存利用率。

Description

内存管理方法和相关产品 技术领域
本申请涉及计算机领域,尤其涉及一种内存管理方法和相关产品。
背景技术
在由中央处理器(Central Processing Unit,CPU)和加速设备构成的异构加速系统中,加速设备上的内存管理策略会极大影响整个系统的性能和效率。加速设备是指CPU以外用于加速计算的设备,例如图形处理器(Graphics Processing Unit,GPU)、网络处理器(Neural-network Processing Unit,NPU)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)等。由于目前采用的内存管理策略内存利用率较低,因此需要研究内存利用率更高的内存管理策略。
发明内容
本申请实施例公开了一种内存管理方法和相关产品。
第一方面,本申请实施例提供了一种内存管理方法,该方法包括:第一处理设备为第一任务分配缓存池的第一缓存块;所述第一处理设备在确定第二处理设备需要按次序执行第二任务和所述第一任务的情况下,为所述第二任务分配所述缓存池的第二缓存块,其中,所述第二缓存块中的至少一部分包含于所述第一缓存块。
第一处理设备确定第二处理设备需要按次序执行第二任务和第一任务的情况是指该第一处理设备确定该第二处理设备不会并行执行该第一任务和该第二任务。也就是说,第二处理设备不会同时执行该第一任务和该第二任务。应理解,第二处理设备按次序执行第二任务和第一任务的情况下,该第二处理设备执行该第一任务和该第二任务不可能同时占用同一缓存块。因此,第一处理设备在为第一任务分配缓存池的第一缓存块,可以为第二任务分配第二缓存块,即该第一任务和该第二任务可以复用一部分缓存块。
本申请实施例中,第一处理设备在确定第二处理设备需要按次序执行第二任务和第一任务的情况下,该第一任务和该第二任务可以复用一部分缓存块;能够提高内存利用率。
在一个可能的实现方式中,所述第一处理设备为第一任务分配缓存池的第一缓存块之后,所述方法还包括:所述第一处理设备响应于将所述第一任务发送至所述第二处理设备,将所述第一缓存块重新放入所述缓存池。
所述第一处理设备将所述第一任务发送至所述第二处理设备可以是指所述第一处理设调用第二处理设备执行所述第一任务,也可以指将所述第一任务提交至第二处理设备处理的某个任务队列。所述第一处理设备在将所述第一任务发送至所述第二处理设备之后,可立即将所述第一缓存块重新放入所述缓存池。
在该实现方式中,第一处理设备可及时将第一缓存块重新放入缓存池,以便于复用该第一缓存池。
在一个可能的实现方式中,所述第一处理设备为第一任务分配缓存池的第一缓存块之后,所述方法还包括:所述第一处理设备记录所述第一缓存块对应的所述第一任务位于的第一操作队列;所述确定第二处理设备需要按次序执行第二任务和所述第一任务包括:所述第一处理设备基于记录的所述第一任务位于的所述第一操作队列,确定所述第一任务和所述第二任务位于的操作队列相同。
可选的,所述第一处理设备响应于将所述第一任务发送至所述第二处理设备之后,记录所述第一缓存块对应的所述第一任务位于的第一操作队列。可选的,所述第一处理设备将所述第一缓存块重新放入所述缓存池的操作和记录所述第一缓存块对应的所述第一任务位于的第一操作队列的操作可视为同时执行。也就是说,第一处理设备将第一缓存块重新放入缓存池的操作和记录该第一缓存块对应的第一任务位于的第一操作队列的操作可以是绑定到一起的。所述第一处理设备在为所述第二任务分配缓存块之前,已知所述第二任务位于的操作队列。因此,第一处理设备基于记录的第一任务位于的第一操作队列,可以确定该第一任务和第二任务位于的操作队列是否相同。
在该实现方式中,第一处理设备基于记录的第一任务位于的第一操作队列,可准确、快速地确定该第一任务和第二任务位于的操作队列相同。
在一个可能的实现方式中,所述为所述第二任务分配所述缓存池的第二缓存块包括:所述第一处理设备在所述第二处理设备执行所述第一任务的过程中,为所述第二任务分配所述缓存池的所述第二缓存块。
在该实现方式中,第一处理设备和第二处理设备并行工作,工作效率高。
在一个可能的实现方式中,所述为所述第二任务分配所述缓存池的第二缓存块之前,所述方法还包括:所述第一处理设备从所述缓存池中查找当前分配有任务的至少一个候选缓存块;所述第一处理设备在确定第二处理设备需要按次序执行第二任务和所述第一任务的情况下,为所述第二任务分配所述缓存池的第二缓存块,包括:所述第一处理设备基于所述至少一个候选缓存块当前分配的任务与所述第二任务之间的执行顺序关系,为所述第二任务分配从所述至少一个候选缓存块中确定的所述第二缓存块。
在该实现方式中,第一处理设备基于至少一个候选缓存块当前分配的任务与第二任务之间的执行顺序关系,为该第二任务分配从该至少一个候选缓存块中确定的第二缓存块;以便于该第二任务复用已分配的缓存块,能够提高内存利用率。
在一个可能的实现方式中,所述第一处理设备从所述缓存池中查找当前分配有任务的至少一个候选缓存块,包括:所述第一处理设备从所述缓存池中查找满足所述第二任务所需缓存大小的至少一个候选缓存块;所述第一处理设备从所述至少一个候选缓存块中查找当前分配有任务的至少一个候选缓存块。所述至少一个候选缓存块可以均为当前分配有任务的缓存块;也可以既包括当前分配有任务的缓存块还包括当前未分配有任务 的缓存块。
在该实现方式中,优先从至少一个候选缓存块中查找当前分配有任务的至少一个候选缓存块,可快速查找到当前分配有任务且满足第二任务所需缓存大小的至少一个候选缓存块。
在一个可能的实现方式中,所述第一处理设备从所述缓存池中查找当前分配有任务的至少一个候选缓存块,包括:所述第一处理设备从所述缓存池中的当前分配有任务的缓存块中查找满足所述第二任务所需缓存大小的至少一个候选缓存块。
在该实现方式中,第一处理设备直接从缓存池中的当前分配有任务的缓存块中查找满足第二任务所需缓存大小的缓存块,进而将当前分配有任务的缓存块再次分配给该第二任务;可以提高内存利用率。
在一个可能的实现方式中,所述第一处理设备基于所述至少一个候选缓存块当前分配的任务与所述第二任务之间的执行顺序关系,为所述第二任务分配从所述至少一个候选缓存块中确定的所述第二缓存块,包括:第一处理设备基于所述至少一个候选缓存块当前分配的任务与所述第二任务之间的执行顺序关系以及所述至少一个候选缓存块的大小,为所述第二任务分配从所述至少一个候选缓存块中确定的所述第二缓存块。
第一处理设备基于所述至少一个候选缓存块当前分配的任务与所述第二任务之间的执行顺序关系以及所述至少一个候选缓存块的大小,为所述第二任务分配从所述至少一个候选缓存块中确定的所述第二缓存块可以是指:所述第一处理设备确定所述至少一个候选缓存块当前分配的任务与所述第二任务按次序执行的情况下,基于所述至少一个候选缓存块的大小为所述第二任务分配从所述至少一个候选缓存块中确定的所述第二缓存块。
在该实现方式中,为第二任务分配从当前分配有任务的至少一个候选缓存块中确定的第二缓存块,可以复用当前分配有任务的缓存块,提高内存复用率。
在一个可能的实现方式中,所述方法还包括:所述第一处理设备确定所述至少一个候选缓存块中不包括满足所述第二任务需求的缓存块的情况下,从所述缓存池中包括的当前未分配有任务的至少一个缓存块中确定分配给所述第二任务的目标缓存块。
在该实现方式中,通过从缓存池中包括的当前未分配有任务的至少一个缓存块中确定分配给第二任务的目标缓存块,以便于该第二任务能够成功被执行。
在一个可能的实现方式中,所述方法还包括:所述第一处理设备在所述缓存池中未查找到满足所述第二任务需求的缓存块的情况下,扩充所述缓存池;所述第一处理设备从扩充后的所述缓存池中查找分配给所述第二任务的目标缓存块。
在该实现方式中,从扩充后的缓存池中查找满足第二任务需求的缓存块,能够快速地满足为第二任务分配满足其需求的缓存块。
第二方面,本申请实施例提供了一种数据处理装置,该数据处理装置包括:内存分 配单元,用于为第一任务分配缓存池的第一缓存块;处理单元,用于确定第二处理设备需要按次序执行第二任务和所述第一任务的情况;所述内存分配单元,还用于在所述处理单元确定所述第二处理设备需要按次序执行所述第二任务和所述第一任务的情况下,为所述第二任务分配所述缓存池的第二缓存块,其中,所述第二缓存块中的至少一部分包含于所述第一缓存块。
所述处理单元和所述内存分配单元可以是同一个单元,也可以是两个独立的单元。在一些可能的实施方式中,处理单元为处理器,例如CPU,内存分配单元为一个硬件。在一些可能的实施方式中,处理单元为处理器,例如CPU,内存分配单元的功能由处理器运行的软件或程序实现。也就是说,处理单元的功能和内存分配单元的功能均由处理器实现。
在一个可能的实现方式中,所述处理单元,还用于将所述第一任务发送至所述第二处理设备;所述内存分配单元,还用于响应于将所述第一任务发送至所述第二处理设备,将所述第一缓存块重新放入所述缓存池。
在一个可能的实现方式中,所述处理单元,还用于记录所述第一缓存块对应的所述第一任务位于的第一操作队列;所述处理单元,用于基于记录的所述第一任务位于的所述第一操作队列,确定所述第一任务和所述第二任务位于的操作队列相同。
在一个可能的实现方式中,所述内存分配单元,还用于在所述第二处理设备执行所述第一任务的过程中,为所述第二任务分配所述缓存池的所述第二缓存块。
在一个可能的实现方式中,所述内存分配单元,还用于从所述缓存池中查找当前分配有任务的至少一个候选缓存块;所述内存分配单元,用于在处理单元基于所述至少一个候选缓存块当前分配的任务与所述第二任务之间的执行顺序关系确定所述第二处理设备需要按次序执行所述第二任务和所述第一任务情况下,为所述第二任务分配从所述至少一个候选缓存块中确定的所述第二缓存块。
在一个可能的实现方式中,所述内存分配单元,用于从所述缓存池中的当前分配有任务的缓存块中查找满足所述第二任务所需缓存大小的所述至少一个候选缓存块。
在一个可能的实现方式中,所述内存分配单元,用于从所述缓存池中查找满足所述第二任务所需缓存大小的至少一个候选缓存块;从所述至少一个候选缓存块中查找当前分配有任务的至少一个候选缓存块。
在一个可能的实现方式中,所述内存分配单元,用于在所述处理单元基于所述至少一个候选缓存块当前分配的任务与所述第二任务之间的执行顺序关系确定所述第二处理设备需要按次序执行所述第二任务和所述第一任务情况下,基于所述至少一个候选缓存块的大小为所述第二任务分配从所述至少一个候选缓存块中确定的所述第二缓存块。
在一个可能的实现方式中,所述内存处理单元,还用于在处理单元确定所述至少一个候选缓存块中不包括满足所述第二任务需求的缓存块的情况下,从所述缓存池中包括的当前未分配有任务的至少一个缓存块中确定分配给所述第二任务的目标缓存块。
在一个可能的实现方式中,所述内存处理单元,还用于在所述缓存池中未查找到满足所述第二任务需求的缓存块的情况下,扩充所述缓存池;从扩充后的所述缓存池中查找分配给所述第二任务的目标缓存块。
关于第二方面或各种可能的实现方式所带来的技术效果,可参考对于第一方面或相应的实现方式的技术效果的介绍。
第三方面,本申请实施例提供了一种电子设备,该电子设备包括:存储器和第一处理器,其中,所述存储器用于存储指令,所述第一处理器用于执行所述存储器存储的指令,使得所述第一处理器执行如第一方面以及任一种可能的实现方式的方法。
在一个可能的实现方式中,所述电子设备还包括第二处理器,所述第二处理器用于利用所述第一处理器分配的缓存块执行所述第一处理器发送的任务。示例性的,第一处理器为CPU,第二处理器为GPU。
第四方面,本申请实施例提供了一种电子设备,该电子设备包括:第一处理设备、存储器以及第二处理设备,其中,所述存储器用于存储指令和数据,所述第一处理器用于执行所述存储器存储的指令,使得所述第一处理器执行如第一方面以及任一种可能的实现方式的方法,所述第二处理设备用于利用所述第一处理设备分配的缓存块执行所述第一处理器发送的任务。示例性的,第一处理设备为CPU,第二处理设备为GPU。
第五方面,本申请实施例提供了一种芯片,该芯片包括数据接口和第一方面所述的第一处理设备,其中,所述第一处理设备用于执行第一方面或第一方面的任意可能实现方式中的方法。
第六方面,本申请实施例提供了一种计算机可读存储介质,该计算机存储介质存储有计算机程序,该计算机程序包括程序指令,该程序指令当被处理器执行时使该处理器执行第一方面以及任一种可选的实现方式的方法。
第七方面,本申请实施例提供了一种计算机程序产品,该计算机程序产品包括程序指令,所述程序指令当被处理器执行时使所述处理器执行第一方面以及任一种可选的实现方式的方法。
附图说明
为了更清楚地说明本申请实施例或背景技术中的技术方案,下面将对本申请实施例或背景技术中所需要使用的附图进行说明。
图1为本申请实施例提供的一种数据处理装置的结构示意图;
图2为本申请实施例提供的一种内存管理方法流程图;
图3为本申请实施例提供的另一种内存管理方法流程图;
图4为本申请实施例提供的另一种内存管理方法流程图;
图5为本申请实施例提供的另一种内存管理方法流程图;
图6为本申请实施例提供的一种内存管理方法的时序图;
图7为本申请实施例提供的另一种内存管理方法流程图;
图8为本申请实施例提供的另一种内存管理方法的时序图;
图9为本申请实施例提供的一种数据处理装置的结构示意图;
图10为本申请实施例提供的另一种数据处理装置的结构示意图。
具体实施方式
本申请的说明书实施例和权利要求书及附图中的术语“第一”、“第二”、和“第三”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元。方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。
如背景技术所述,在异构加速系统中当前采用的内存管理策略的内存利用率通常较低,因此需要研究内存利用率更高的内存管理策略。本申请实施例提供了一种内存利用率较高的内存管理方法,适用于具有第一处理设备(例如CPU)和第二处理设备(对应于加速设备)的数据处理装置(对应于异构加速系统)。下面先介绍本申请实施例提供的内存管理方法所适用的数据处理装置的结构,以便更方便的描述本申请实施例提供的内存管理方法。
图1为本申请实施例提供的一种数据处理装置的结构示意图。如图1所示,该数据处理装置包括:第一处理设备101、内存分配器102、第二处理设备103以及设备内存104,其中,该第一处理设备101和内存分配器102独立设置或者集成在同一设备,第二处理设备103与第一处理设备101为不同类型的处理设备,设备内存104可以为第二处理设备103的一部分,或者独立于第二处理设备103设置,本公开实施例对此不做限定。在一些例子中,第一处理设备101对应于处理单元,内存分配器102对应于内存分配单元。第一处理设备101可以是CPU或者其他类型的处理器。在一些实施例中,第一处理设备101可以为主处理设备,例如CPU;第二处理设备103为加速设备,例如GPU。第二处理设备103可以是GPU、NPU、FPGA、数字信号处理器(digital signal processor,DSP)、专用集成电路(application specific integrated circuit,ASIC)等不同于第一处理设备101的处理器或者处理装置。在一些可能的实施方式中,第一处理设备101为CPU,内存分配单元为与CPU相互独立的硬件。在一些可能的实施方式中,第一处理设备101为CPU,内存分配器102的功能由第一处理设备101运行的软件或程序实现。也就是说,在一些可能的实施方式中,内存分配器102对应的实体硬件为第一处理设备101。设备内存104可以为第二处理设备103可使用的内存。示例性的,第二处理设备103为GPU,设备内存104为第二处理设备103的显存。在一些实施例中,设备内存104为第二处理设备103的一部分。
下面分别介绍数据处理装置在实施本申请实施例提供的内存管理方法时,第一处理设备101、内存分配器102、第二处理设备103以及设备内存104的功能。
第一处理设备(例如CPU)101,用于向第二处理设备103提交任务,以及控制内存分配器102分配和/或释放缓存池中的缓存,即通过内存分配器102管理第二处理设备103的设备内存104。第一处理设备101向第二处理设备103提交任务可以是指第一处理设备101将任务添加至需要第二处理设备103处理的操作队列;也可以是指将任务发送给第二处理设备103,例如发送一个指示第二处理设备103执行某个任务的指令;还可以是指将任务添加至第一处理设备101需要调用第二处理设备103的接口完成的操作队列;还可以是指以其他方式通知第二处理设备103执行任务。第一处理设备101,还可用于调用第二处理设备103的接口执行任务。示例性的,操作队列为统一计算设备架构(compute unified device architecture,CUDA)中的cudaStream。同一个cudaStream上的任务按提交的顺序执行,不同cudaStream上的任务没有先后顺序,可以并发执行。示例性的,操作队列为开放运算语言(Open Computing Language,OpenCL)中的cl_command_queue。OpenCL是一个为异构平台编写程序的框架,此异构平台可包括CPU,GPU或其他类型的处理器。示例性的,操作队列为C++AMP中的accelerator_view。C++AMP是微软Visual Studio和C++编程语言的新扩展,用于帮助开发人员充分适应现在和未来的高度并行和异构计算环境。
内存分配器102负责对设备内存104的管理。内存分配器102可以是一个实体硬件。或者,内存分配器102的功能可以由第一处理设备101运行的软件或者程序实现。
第二处理设备103用于通过设备内存104读写数据,以及执行第一处理设备101提交的任务或者被第一处理设备101调用至少一个接口来完成任务。第二处理设备103可以理解为除CPU之外的加速设备,例如GPU、NPU、ASIC、FPGA等。
设备内存104是指第二处理设备103上的存储设备(对应于内存空间),被第二处理设备103所使用,例如GPU(对应于第二处理设备103)的显存。
在一些实施例中,内存分配器102中设置一个缓存池,用于预先分配和缓存设备内存。这个缓存池可能是一整块的设备内存,也可能由多块大小任意的设备内存组合而成。第二处理设备103在执行任务需要使用设备内存时,内存分配器102可直接从缓存池中分配缓存给第二处理设备103,第二处理设备103不需要使用分配给它的缓存后,内存分配器102将缓存重新放入缓存池中,即将缓存归还缓存池。在一些实施例中,第一处理设备101为CPU,第二处理设备103为GPU,设备内存104为第二处理设备103上的显存;在包括CPU和GPU的异构计算系统中,CPU负责任务(例如计算任务或者图像处理任务等)的准备和发起,GPU负责任务的实际执行。CPU发起任务前需要指定GPU执行任务所使用的显存。CPU对显存的分配和释放由内存分配器102实际完成。内存分配器102实际负责了对GPU上显存的管理。GPU执行任务时直接使用CPU通过内存分配器102分配的部分显存。
下面结合图2来描述数据处理装置执行本申请实施例提供的内存管理方法时,各部 件执行的操作。图2为本申请实施例提供的一种内存管理方法流程图。如图2所示,该内存管理方法包括:
201、第一处理设备为第一任务分配缓存池的第一缓存块。
第一处理设备可以是CPU或者其他类型的处理器。在一些实施例中,第一处理设备为第一任务分配缓存池的第一缓存块可以是第一处理设备通过内存分配器为第一任务分配缓存池的第一缓存块;其中,内存分配器的功能由第一处理设备运行的软件或者程序实现。第一任务可以是图像处理任务、计算任务等需要第二处理设备(例如GPU)执行的任务或者需要调用第二处理设备执行的任务。第一缓存块可以为缓存池中任一个缓存块。缓存池可以理解为内存分配器内部设有的缓存设备内存(例如显存)的池子。举例来说,内存分配器管理的显存缓存在一个池子(即缓存池)里面,第二处理设备需要显存时内存分配器从池子中分配,分配给该第二处理设备的显存用完后不等待该第二处理设备完成任务,就释放该显存。这样,利用缓存池,可以消除设备内存分配和释放过程中的设备同步。设备同步可以理解为第一处理设备(例如CPU)停下执行程序,等待第二处理设备(对应于加速设备)完成任务。在一些实施例中,第一处理设备为第一任务分配缓存池的第一缓存块可以是指:优先从缓存池中分配满足第一任务需求的缓存块;若该缓存池中不存在满足该第一任务需求的缓存块,则调用cudaMalloc从设备内存中分配更多的内存,扩充缓存池。cudaMalloc是CUDA中分配显存的接口。
在一些实施例中,第一任务可以是某个计算任务或图像处理任务或其他类型的深度学习任务。在一些实施例中,第一处理设备在执行步骤201之前,可执行如下操作:第一处理设备可以将一个较大的任务进行拆分,得到第一任务。本申请实施例中的数据处理装置可以是一个异构计算系统。异构计算系统是在析取计算任务并行性类型的基础上,将具有相同类型的代码段划分到同一子任务中,然后根据不同并行性类型将各子任务分配到最适合执行它的计算资源(例如第二处理设备)上加以执行,达到使计算任务总的执行时间为最小。这样,第一任务可以理解为第一处理设备拆分一个较大的任务得到的需要第二处理设备执行的子任务或者调用第二处理设备执行的子任务。或者,第一任务为第一处理设备确定需要执行的任务本身,本公开实施例对此不做限定。
在一些实施例中,第一处理设备在执行步骤201之后,可执行如下操作:第一处理设备响应于将第一任务发送至第二处理设备,将第一缓存块重新放入(也称为归还)缓存池。第一处理设备将第一任务发送至第二处理设备可以是指第一处理设备将任务添加至需要第二处理设备处理的操作队列;也可以是指发送一个指示第二处理设备执行某个任务的指令;还可以是指将任务添加至第一处理设备需要调用第二处理设备的接口完成的操作队列;还可以是指以其他方式通知第二处理设备执行任务,本公开实施例对此不做限定。
在一些实施例中,第一处理设备在将第一任务发送至第二处理设备之后,可立即将第一缓存块重新放入缓存池,而无需等待第二处理设备使用第一缓存块执行完毕第一任务。在本公开实施例中,第一处理设备可及时将第一缓存块重新放入缓存池,以便于复 用该第一缓存块。
在一些实施例中,第一处理设备或内存分配器可以为缓存池中的各个缓存块设置标识,以表明缓存块的当前状态。作为一个例子,第一处理设备为第一任务分配缓存池的第一缓存块可以是指该第一处理设备将该缓存池的该第一缓存块分配给第一任务,并将该第一缓存块的状态设置为不可用,表明该第一缓存块不能分配给其他任务;将第一缓存块重新放入(也称为归还)缓存池可以是指将该第一缓存块的状态设置为可用,表明该第一缓存块可以分配给其他任务。本公开实施例还可以通过其他方式指示各个缓存块当前是否可用,例如,第一处理设备将第一缓存块分配给第一任务,可以指将该第一缓存块从缓存池包含的缓存资源中移除,并且将第一缓存块重新放入缓存池,可以指将第一缓存块重新添加到缓存池包含的缓存资源中。
202、第一处理设备在确定第二处理设备需要按次序执行第二任务和第一任务的情况下,为第二任务分配缓存池的第二缓存块。
第二缓存块中的至少一部分包含于第一缓存块。第二任务可以是图像处理任务、计算任务等需要第二处理设备(例如GPU)执行的任务或者需要调用第二处理设备执行的任务。
第一处理设备确定第二处理设备需要按次序执行第二任务和第一任务是指该第一处理设备确定该第二处理设备不会并行执行该第一任务和该第二任务。也就是说,第二处理设备不会同时执行该第一任务和该第二任务。该第一任务和第二任务可以是相同或不同类型的任务,例如可以是同一个任务的不同子任务,或者是不同任务的子任务,等等,本公开实施例对此不做限定。应理解,第二处理设备按次序执行第二任务和第一任务的情况下,该第二处理设备执行该第一任务和该第二任务不可能同时使用缓存块。因此,第一处理设备在确定第二处理设备需要按次序执行第二任务和第一任务的情况下,可以为第二任务分配第二缓存块,即该第一任务和该第二任务可以复用一部分缓存块。后续再详述第一处理设备确定第二处理设备需要按次序执行第二任务和第一任务的实现方式。
在一些实施例中,第一处理设备的内存分配和第二处理设备的任务处理可以并行执行,从而提高处理效率。举例来说,第一处理设备在第二处理设备执行第一任务的过程中,为第二任务分配缓存池的第二缓存块。
本申请实施例中,第一处理设备在确定第二处理设备需要按次序执行第二任务和第一任务的情况下,该第一任务和该第二任务可以复用一部分缓存块;能够提高内存利用率。
下面介绍确定第二处理设备需要按次序执行第二任务和第一任务的可选示例。
第一处理设备在将第一缓存块重新放入缓存池之后或者在将第一缓存块重新放入缓存池的过程中,记录第一缓存块对应的第一任务所在的第一操作队列;在为第二任务分配缓存块的过程中,基于记录的第一任务所在的第一操作队列,确定第一任务和第二任 务位于的操作队列相同。第一操作队列可以是需要第二处理设备执行的操作队列或者是第二任务所在的操作队列。举例来说,第一处理设备将任务提交至第二处理设备的第一操作队列,该第二处理设备按照各任务被提交至该第一操作队列的先后顺序依次执行该第一操作队列中的任务。第一操作队列可以是第一处理设备需要调用第二处理设备的接口完成的操作队列。举例来说,第一处理设备将任务添加至其需要调用第二处理设备的接口完成的操作队列,并按照各任务被添加至该操作队列的先后顺序依次调用第二处理设备的接口执行各任务。
在一些实施例中,第一处理设备将第一缓存块重新放入缓存池的操作和记录第一缓存块对应的第一任务所在的第一操作队列的操作可以是同时或以任意先后顺序执行的,例如,在将第一缓存块重新放入缓存池后,记录第一缓存块的本次任务分配信息,该任务分配信息包含第一任务所在的操作队列的信息。再例如,第一处理设备记录第一缓存块对应的第一任务所在的第一操作队列,并紧接着将第一缓存块重新放入缓存池。
然后,在为第二任务分配缓存块的过程中,第一处理设备基于记录的第一任务所在的第一操作队列,例如,通过查询第一缓存块的任务分配信息,可以确定该第一任务和第二任务位于的操作队列是否相同,即待分配的第二任务与第一缓存块已分配的第一任务属于相同的操作队列。这样,同一操作队列中的不同任务是按照特定顺序依次执行的,第一任务和第二任务位于同一操作队列,表明第一任务和第二任务不会同时执行。
图3为本申请实施例提供的另一种内存管理方法流程图。其中,与图2所示实施例中的相同之处的描述将简略带过。
301、第一处理设备为第一任务分配缓存池的第一缓存块。
302、第一处理设备将第一缓存块重新放入缓存池,并记录第一缓存块对应的第一任务位于的第一操作队列。
在一些实施例中,第一处理设备可记录重新放入缓存池的每个缓存块对应的任务位于的操作队列。也就是说,第一处理设备可记录有缓存池中当前分配有任务的各缓存块对应的操作队列。一个缓存块对应的操作队列为该缓存块分配的任务位于的操作队列。
在一些实施例中,第一处理设备在将第一缓存块重新放入缓存池之前,可释放该第一缓存块。举例来说,缓存池中的缓存块为第二处理设备的显存,第一处理设备在将第一缓存块重新放入缓存池之前,通过调用内存分配器的接口释放该第一缓存块。cudaFree接口为CUDA中释放显存的接口。在一些实施例中,释放第一缓存块可以是指将该第一缓存块放入缓存池,但是该第一缓存块不能被分配,例如将该第一缓存块的状态设置为不可用;将第一缓存块重新放入缓存池可以是指该第一缓存块能够被分配,例如将该第一缓存块的状态设置为可用。
303、第一处理设备在基于记录的第一任务位于的第一操作队列确定第一任务和第二任务位于的操作队列相同的情况下,为第二任务分配缓存池的第二缓存块。
步骤303为步骤202的一种可能的实现方式。第二缓存块中的至少一部分包含于第 一缓存块。
在一些实施例中,第一处理设备在为第二任务分配缓存池的第二缓存块之前,可从缓存池中查找当前分配有任务的至少一个候选缓存块;步骤303的一种实现方式如下:基于至少一个候选缓存块当前分配的任务与第二任务之间的执行顺序关系,为第二任务分配从至少一个候选缓存块中确定的第二缓存块。
第一处理设备从缓存池中查找当前分配有任务的至少一个候选缓存块可以是指:第一处理设备从缓存池中查找满足第二任务所需缓存大小的至少一个候选缓存块;第一处理设备从至少一个候选缓存块中查找当前分配有任务的至少一个候选缓存块。示例性的,第一处理设备基于至少一个候选缓存块当前分配的任务与第二任务之间的执行顺序关系,为第二任务分配从至少一个候选缓存块中确定的第二缓存块可以是指:从至少一个候选缓存块中选择当前分配的任务与第二任务按次序执行的一个或多个目标缓存块;为第二任务分配从一个或多个目标缓存块中确定的第二缓存块。举例来说,第一处理设备从缓存池中查找到满足第二任务所需缓存大小的10个候选缓存块;从该10个候选缓存块中选择当前分配的任务与第二任务按次序执行的候选缓存块,得到目标缓存块;为该第二任务分配从该目标缓存块中确定的第二缓存块。在本申请实施例中,候选缓存块满足第二任务所需缓存大小的缓存块,候选缓存块不仅满足第二任务所需缓存大小,并且当前分配有任务。
本申请实施例中,基于记录的第一任务位于的第一操作队列,可准确、快速地确定该第一任务和第二任务位于的操作队列相同,进而为该第二任务分配第二缓存块;能够提高内存利用率。
图4为本申请实施例提供的另一种内存管理方法流程图。其中,与图2所示实施例中的相同之处的描述将简略带过。
401、第一处理设备为第一任务分配缓存池的第一缓存块。
402、第一处理设备将第一缓存块重新放入缓存池,并记录第一缓存块对应的第一任务位于的第一操作队列。
403、第一处理设备从缓存池中查找满足第二任务所需缓存大小的至少一个候选缓存块。
若查找到满足第二任务所需缓存大小的至少一个候选缓存块,执行步骤404;若未查找到满足第二任务所需缓存大小的至少一个候选缓存块,执行步骤408。
404、第一处理设备从至少一个候选缓存块中查找当前分配有任务的至少一个候选缓存块。
若查找到分配有任务的至少一个候选缓存块,执行步骤405;若未查找到分配有任务的候选缓存块,执行步骤406。
405、第一处理设备基于至少一个候选缓存块当前分配的任务与第二任务之间的执行 顺序关系,从至少一个候选缓存块中确定为第二任务分配的第二缓存块。
示例性的,步骤405的一种可能的实现方式如下:从至少一个候选缓存块中选择当前分配的任务与第二任务按次序执行的候选缓存块,得到一个或多个目标缓存块(如上述第一缓存块);从一个或多个目标缓存块中确定为第二任务分配的第二缓存块。步骤405对应于图2中的步骤202。
示例性的,步骤405的一种可能的实现方式如下:第一处理设备基于至少一个候选缓存块当前分配的任务与第二任务之间的执行顺序关系以及至少一个候选缓存块的大小,从至少一个候选缓存块中确定为第二任务分配的第二缓存块。举例来说,从至少一个候选缓存块中选择当前分配的任务与第二任务按次序执行的候选缓存块,得到一个或多个目标缓存块(如上述第一缓存块);从一个或多个目标缓存块中选取满足第二任务所需缓存大小的第二缓存块,并为第二任务分配第二缓存块,其中,作为一个例子,如果满足第二任务所需缓存大小的目标缓存块有多个,则可以从满足第二任务的需求的目标缓存块中选取最小的目标缓存块作为第二缓存块,但本公开实施例对此不做限定。
406、第一处理设备从缓存池中包括的当前未分配有任务的至少一个候选缓存块中确定分配给第二任务的第三缓存块。
407、第一处理设备为第二任务分配第三缓存块。
408、第一处理设备扩充缓存池,并从扩充后的缓存池中查找分配给第二任务的第四缓存块。
第四缓存块可以为满足第二任务所需缓存大小的缓存块。举例来说,如果缓存池中不存在满足第二任务需求的缓存块,则调用cudaMalloc接口从设备内存分配更多的显存,扩充缓存池。cudaMalloc接口是CUDA中分配显存的接口。满足第二任务需求是指满足第二任务所需缓存大小。
409、第一处理设备为第二任务分配第四缓存块。
这样,就完成了对第二任务的缓存分配。在一些实施例中,该方法还可以进一步包括410。
410、第一处理设备清空缓存池。
在一些实施例中,第一处理设备不需要调用第二处理设备执行任务时,可将缓存池中的缓存归还第二处理设备的设备内存。举例来说,第一处理设备调用cudaFree接口将缓存池中的显存归还给GPU(即第二处理设备),即清空缓存池。
本申请实施例中,第一处理设备先确定满足第二任务所需缓存大小的至少一个候选缓存块,再从该至少一个候选缓存块中优先选择当前分配有任务且分配的任务与第二任务按次序执行的至少一个候选缓存块;能够快速地为第二任务分配满足其需求的缓存块,并能够提高内存利用率。
图5为本申请实施例提供的另一种内存管理方法流程图。
501、第一处理设备为第一任务分配缓存池的第一缓存块。
502、第一处理设备将第一缓存块重新放入缓存池,并记录第一缓存块对应的第一任务位于的第一操作队列。
503、第一处理设备从缓存池中是否查找到当前分配有任务且满足第二任务需求的候选缓存块。
若查找到至少一个当前分配有任务且满足第二任务需求的候选缓存块,执行步骤504;若未查找到当前分配有任务且满足第二任务需求的候选缓存块,执行步骤505。
504、第一处理设备基于至少一个候选缓存块当前分配的任务与第二任务之间的执行顺序关系,为第二任务分配从至少一个候选缓存块中确定的第二缓存块。
步骤504的实现方式可与步骤405的实现方式相同。
505、第一处理设备从缓存池中是否查找到当前未分配有任务且满足第二任务需求的缓存块。
若第一处理设备从缓存池中查找到当前未分配有任务且满足第二任务需求的缓存块,执行步骤506;若第一处理设备从缓存池中未查找到当前未分配有任务且满足第二任务需求的缓存块,执行步骤507。
506、第一处理设备从查找到的当前未分配有任务且满足第二任务需求的缓存块中确定分配给第二任务的第三缓存块。
507、第一处理设备扩充缓存池,并从扩充后的缓存池中查找分配给第二任务的第四缓存块。
本申请实施例中,优先查找从当前分配有任务且满足第二任务需求的缓存块中为第二任务分配缓存块,既能提高查找速度,又能提高内存复用率。
下面结合图6中的内存管理方法的时序图来进一步描述图2至图5中的内存管理方法。图6为本申请实施例提供的一种内存管理方法的时序图,对应于图2至图5中的内存管理方法。如图6所示,第一处理设备依次执行如下操作:分配1、提交任务1(对应于第一任务)、释放1、分配2、提交任务2(对应于第二任务)以及释放2;第二处理设备依次执行如下操作:执行任务1和执行任务2。其中,分配1表示第一处理设备为任务1分配第一缓存块,分配2表示第一处理设备为任务2分配第二缓存块;提交任务1表示第一处理设备将该任务1提交至第二处理设备的操作队列,提交任务2表示第一处理设备将任务2提交至第二处理设备的操作队列;释放1表示第一处理设备控制内存分配器释放第一缓存块以及将该第一缓存块重新放入缓存池,释放2表示第一处理设备控制内存分配器释放第二缓存块以及将该第二缓存块重新放入缓存池;执行任务1表示第二处理设备执行该任务1,执行任务2表示第二处理设备执行任务2。图6中,第一处理设备执行分配1所分配的第一缓存块和执行分配2所分配的第二缓存块相同或者存在重叠。也就是说,第二处理设备执行相同操作队列中的任务可复用缓存块。应理 解,在一些实施例中,第二处理设备执行第一任务使用的第一缓存块和执行第二任务使用的第二缓存块相同。举例来说,第二处理设备为GPU,GPU执行同一操作队列的计算任务可复用同一块显存。如图6所示,第一处理设备执行提交任务1、释放1、分配2以及提交任务2的操作时,第二处理设备同时执行任务1;第二处理设备执行任务2时,第一处理设备执行释放2的操作。可见,第一处理设备不需要等待第二处理设备完成任务1,就执行释放1、分配2以及提交任务2的操作。也就是说,第一处理设备和第二处理设备不需要进行同步,能够实现异步计算模式,提高计算性能。
本申请实施例中,第二处理设备可根据任务被执行的先后顺序复用缓存,第一处理设备和第二处理设备能够实现异步计算模式;既能提高内存的利用率,又能提高计算效率。
图7为本申请实施例提供的另一种内存管理方法流程图。如图7所示,该方法包括:
701、第一处理设备为第三任务分配缓存池中的第五缓存块。
步骤701的实现方式可以与步骤301的实现方式类似。
702、第一处理设备将第三任务提交至第二处理设备的操作队列。
在一些实施例中,第一处理设备将第三任务提交至第二处理设备的操作队列之后,立即释放第五缓冲块。
703、第一处理设备检查第三任务是否被完成。
若是,执行步骤704;若否,再次执行步骤703。在一些实施例中,第一处理设备可定期(例如每隔5ms、10ms等)检查第三任务是否被完成;若检查到第三任务被完成,则调用内存分配器将第五缓存块重新放入缓存池;若否,继续定期检查。在一些实施例中,步骤703可替换为:在每次释放缓存块(例如第五缓存块)之前,检查第三任务是否被完成。在一些实施例中,步骤703可替换为:在每次申请缓存块(例如第五缓存块)之前,检查第三任务是否被完成。应理解,第一处理设备还可以采用其他方式检查第三任务是否被完成,本申请实施例不作限定。
704、第一处理设备将第五缓存块重新放入缓存池。
705、第一处理设备为第四任务分配缓存池中的第六缓存块。
第五缓存块和第六缓存块不存在重叠。
706、第一处理设备将第四任务提交至第二处理设备的操作队列。
应理解,第二处理设备占用第五缓存块执行第三任务,以及占用第六缓存块执行第四任务。
本申请实施例中,第一处理设备定期检查第二处理设备上的任务是否完成,在任一计算任务完成时释放该任一计算任务对应的缓存,不需要等待第二处理设备上所有 的计算任务完成;能够提高处理效率。
下面结合图8中的内存管理方法的时序图来进一步描述图7的内存管理方法。图8为本申请实施例提供的一种内存管理方法的时序图,对应于图7中的内存管理方法。如图8所示,第一处理设备依次执行如下操作:分配1、提交任务1(对应于第三任务)、释放1、分配2、提交任务2(对应于第四任务)以及释放2;第二处理设备依次执行如下操作:执行任务1和执行任务2。图8中,分配1表示第一处理设备为任务1分配缓存池中的第五缓存块,分配2表示第一处理设备为任务2分配缓存池中的第六缓存块;提交任务1表示第一处理设备将该任务1提交至第二处理设备的操作队列,提交任务2表示第一处理设备将该任务2提交至第二处理设备的操作队列;释放1表示第一处理设备释放第五缓存块,释放2表示第一处理设备释放第六缓存块;执行任务1表示第二处理设备执行该任务1,执行任务2表示第二处理设备执行任务2。图8中,第一处理设备执行分配1所分配的第五缓存块和执行分配2所分配的第六缓存块不存在任何相同的缓存。也就是说,第二处理设备执行相同操作队列中的计算任务不能复用缓存块。对比针对图6的时序图的描述,图8的时序图中释放1的操作为第一处理设备释放第五缓存块,图6的时序图中释放1的操作为第一处理设备释放第一缓存块以及将该第一缓存块重新放入缓存池;图8的时序图中释放2的操作为第一处理设备释放第六缓存块,图6的时序图中释放2的操作为第一处理设备释放第二缓存块以及将该第二缓存块重新放入缓存池。图7的内存管理方法中,第一处理设备在将任一计算任务提交至第二处理设备的操作队列之后,定期检查该任一计算任务是否被第二处理设备完成,直到该任一计算任务被完成才会将该任一计算任务占用的内存空间重新放入缓存池。也就是说,第一处理设备分配给第二处理设备执行某个计算任务的缓存块,直到该计算任务被完成,该缓存块才会被重新放入缓存池。如图8所示,第一处理设备执行提交任务1、释放1、分配2以及任务2的操作时,第二处理设备同时执行任务1;第二处理设备执行任务2时,第一处理设备执行释放2的操作。可见,第一处理设备不需要等待第二处理设备完成任务1,就执行释放1、分配2以及提交任务2的操作。也就是说,第一处理设备和第二处理设备不需要进行同步,能够实现异步计算模式,提高计算性能。
本申请实施例中,第一处理设备和第二处理设备能够实现异步计算模式,能够提高内存的利用率。
下面介绍可实现前述实施例提供的内存管理方法的数据处理装置的结构。
图9为本申请实施例提供的一种数据处理装置的结构示意图,如图9所示,该数据处理装置包括:
内存分配单元901,用于为第一任务分配缓存池的第一缓存块;
处理单元902,用于确定第二处理设备需要按次序执行第二任务和第一任务的情况;
内存分配单元901,还用于在处理单元确定第二处理设备需要按次序执行第二任务和第一任务的情况下,为第二任务分配缓存池的第二缓存块,其中,第二缓存块中的 至少一部分包含于第一缓存块。
处理单元和内存分配单元可以是同一个单元,也可以是两个独立的单元。在一些可能的实施方式中,处理单元为处理器,例如CPU,内存分配单元为一个硬件。在一些可能的实施方式中,处理单元为处理器,例如CPU,内存分配单元的功能由处理器运行的软件或程序实现。也就是说,处理单元的功能和内存分配单元的功能均由处理器实现。
在一个可能的实现方式中,处理单元902,还用于将第一任务发送至第二处理设备;内存分配单元901,还用于响应于将第一任务发送至第二处理设备,将第一缓存块重新放入缓存池。
在一个可能的实现方式中,处理单元902,还用于记录第一缓存块对应的第一任务位于的第一操作队列;处理单元902,用于基于记录的第一任务位于的第一操作队列,确定第一任务和第二任务位于的操作队列相同。
在一个可能的实现方式中,内存分配单元901,还用于在第二处理设备执行第一任务的过程中,为第二任务分配缓存池的第二缓存块。
在一个可能的实现方式中,内存分配单元901,还用于从缓存池中查找当前分配有任务的至少一个候选缓存块;内存分配单元901,用于在处理单元基于至少一个候选缓存块当前分配的任务与第二任务之间的执行顺序关系确定第二处理设备需要按次序执行第二任务和第一任务情况下,为第二任务分配从至少一个候选缓存块中确定的第二缓存块。
在一个可能的实现方式中,内存分配单元901,用于从缓存池中查找满足第二任务所需缓存大小的至少一个候选缓存块;从至少一个候选缓存块中查找当前分配有任务的至少一个候选缓存块。
在一个可能的实现方式中,内存分配单元901,用于在处理单元基于至少一个候选缓存块当前分配的任务与第二任务之间的执行顺序关系确定第二处理设备需要按次序执行第二任务和第一任务情况下,基于至少一个候选缓存块的大小为第二任务分配从至少一个候选缓存块中确定的第二缓存块。
在一个可能的实现方式中,内存处理单元901,还用于在处理单元确定至少一个候选缓存块中不包括满足第二任务需求的缓存块的情况下,从缓存池中包括的当前未分配有任务的至少一个缓存块中确定分配给第二任务的目标缓存块。
在一个可能的实现方式中,内存处理单元901,还用于在缓存池中未查找到满足第二任务需求的缓存块的情况下,扩充缓存池;从扩充后的缓存池中查找分配给第二任务的目标缓存块。
图10为本申请实施例提供的另一种数据处理装置的结构示意图,如图10所示,该数据处理装置包括:第一处理器1001、第二处理器1002、存储器1003;其中,存储器用于存储指令和数据,第一处理器用于执行存储器存储的指令,使得第一处理器执行 上述任一实施例所述的内存管理方法,第二处理设备用于利用第一处理设备分配的缓存块执行第一处理器发送的任务。存储器1003可包括第二处理器1002使用的设备内存以及第一处理器1001的内存。示例性的,第一处理设备为CPU,第二处理设备为GPU,存储器1003包括GPU的显存。
在本申请的实施例中还提供一种计算机可读存储介质,计算机可读存储介质存储有计算机程序,计算机程序被处理器执行时实现上述任一实施例所述的内存管理方法。该计算机可读存储介质包括非暂态的计算机可读存储介质。
本申请实施例还提供了一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行前述实施例所提供的内存管理方法。
本申请实施例还提供了一种电子设备,该电子设备包括:存储器和第一处理器,其中,所述存储器用于存储指令,所述第一处理器用于执行所述存储器存储的指令,使得所述第一处理器执行上述任一实施例所述的内存管理方法。所述电子设备还可包括第二处理器,所述第二处理器用于利用所述第一处理器分配的缓存块执行所述第一处理器发送的任务。
本申请实施例提供了一种芯片,该芯片包括数据接口和第一方面所述的第一处理设备,其中,所述第一处理设备用于执行上述任一实施例所述的内存管理方法。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。

Claims (21)

  1. 一种内存管理方法,其特征在于,所述方法包括:
    第一处理设备为第一任务分配缓存池的第一缓存块;
    所述第一处理设备在确定第二处理设备需要按次序执行第二任务和所述第一任务的情况下,为所述第二任务分配所述缓存池的第二缓存块,其中,所述第二缓存块中的至少一部分包含于所述第一缓存块。
  2. 根据权利要求1所述的方法,其特征在于,所述第一处理设备为第一任务分配缓存池的第一缓存块之后,所述方法还包括:
    所述第一处理设备响应于将所述第一任务发送至所述第二处理设备,将所述第一缓存块重新放入所述缓存池。
  3. 根据权利要求1或2所述的方法,其特征在于,所述第一处理设备为第一任务分配缓存池的第一缓存块之后,所述方法还包括:
    所述第一处理设备记录所述第一缓存块对应的所述第一任务位于的第一操作队列;
    所述确定第二处理设备需要按次序执行第二任务和所述第一任务包括:
    所述第一处理设备基于记录的所述第一任务位于的所述第一操作队列,确定所述第一任务和所述第二任务位于的操作队列相同。
  4. 根据权利要求1至3任一项所述的方法,其特征在于,所述为所述第二任务分配所述缓存池的第二缓存块包括:
    所述第一处理设备在所述第二处理设备执行所述第一任务的过程中,为所述第二任务分配所述缓存池的所述第二缓存块。
  5. 根据权利要求1至4任一项所述的方法,其特征在于,所述为所述第二任务分配所述缓存池的第二缓存块之前,所述方法还包括:
    所述第一处理设备从所述缓存池中查找当前分配有任务的至少一个候选缓存块;
    所述第一处理设备在确定第二处理设备需要按次序执行第二任务和所述第一任务的情况下,为所述第二任务分配所述缓存池的第二缓存块,包括:
    所述第一处理设备基于所述至少一个候选缓存块当前分配的任务与所述第二任务之间的执行顺序关系,为所述第二任务分配从所述至少一个候选缓存块中确定的所述第二缓存块。
  6. 根据权利要求5所述的方法,其特征在于,所述第一处理设备从所述缓存池中查找当前分配有任务的至少一个候选缓存块,包括:
    所述第一处理设备从所述缓存池中查找满足所述第二任务所需缓存大小的至少一个候选缓存块;
    所述第一处理设备从所述至少一个候选缓存块中查找当前分配有任务的至少一个候选缓存块。
  7. 根据权利要求5或6所述的方法,其特征在于,所述第一处理设备基于所述至少一个候选缓存块当前分配的任务与所述第二任务之间的执行顺序关系,为所述第二任务分配从所述至少一个候选缓存块中确定的所述第二缓存块,包括:
    第一处理设备基于所述至少一个候选缓存块当前分配的任务与所述第二任务之间 的执行顺序关系以及所述至少一个候选缓存块的大小,为所述第二任务分配从所述至少一个候选缓存块中确定的所述第二缓存块。
  8. 根据权利要求5至7中任一项所述的方法,其特征在于,所述方法还包括:
    所述第一处理设备确定所述至少一个候选缓存块中不包括满足所述第二任务需求的缓存块的情况下,从所述缓存池中包括的当前未分配有任务的至少一个缓存块中确定分配给所述第二任务的目标缓存块。
  9. 根据权利要求1至8中任一项所述的方法,其特征在于,所述方法还包括:
    所述第一处理设备在所述缓存池中未查找到满足所述第二任务需求的缓存块的情况下,扩充所述缓存池;
    所述第一处理设备从扩充后的所述缓存池中查找分配给所述第二任务的目标缓存块。
  10. 一种数据处理装置,其特征在于,包括:
    内存分配单元,用于为第一任务分配缓存池的第一缓存块;
    处理单元,用于确定第二处理设备需要按次序执行第二任务和所述第一任务的情况;
    所述内存分配单元,还用于在所述处理单元确定所述第二处理设备需要按次序执行所述第二任务和所述第一任务的情况下,为所述第二任务分配所述缓存池的第二缓存块,其中,所述第二缓存块中的至少一部分包含于所述第一缓存块。
  11. 根据权利要求10所述的数据处理装置,其特征在于,
    所述处理单元,还用于将所述第一任务发送至所述第二处理设备;
    所述内存分配单元,还用于响应于将所述第一任务发送至所述第二处理设备,将所述第一缓存块重新放入所述缓存池。
  12. 根据权利要求10或11所述的数据处理装置,其特征在于,
    所述处理单元,还用于记录所述第一缓存块对应的所述第一任务位于的第一操作队列;
    所述处理单元,用于基于记录的所述第一任务位于的所述第一操作队列,确定所述第一任务和所述第二任务位于的操作队列相同。
  13. 根据权利要求10至12任一项所述的数据处理装置,其特征在于,
    所述内存分配单元,还用于在所述第二处理设备执行所述第一任务的过程中,为所述第二任务分配所述缓存池的所述第二缓存块。
  14. 根据权利要求10至13任一项所述的数据处理装置,其特征在于,
    所述内存分配单元,还用于从所述缓存池中查找当前分配有任务的至少一个候选缓存块;
    所述内存分配单元,用于在处理单元基于所述至少一个候选缓存块当前分配的任务与所述第二任务之间的执行顺序关系确定所述第二处理设备需要按次序执行所述第二任务和所述第一任务情况下,为所述第二任务分配从所述至少一个候选缓存块中确定的所述第二缓存块。
  15. 根据权利要求14所述的数据处理装置,其特征在于,
    所述内存分配单元,用于从所述缓存池中查找满足所述第二任务所需缓存大小的至 少一个候选缓存块;从所述至少一个候选缓存块中查找当前分配有任务的至少一个候选缓存块。
  16. 根据权利要求14或15所述的数据处理装置,其特征在于,
    所述内存分配单元,用于在所述处理单元基于所述至少一个候选缓存块当前分配的任务与所述第二任务之间的执行顺序关系确定所述第二处理设备需要按次序执行所述第二任务和所述第一任务情况下,基于所述至少一个候选缓存块的大小为所述第二任务分配从所述至少一个候选缓存块中确定的所述第二缓存块。
  17. 根据权利要求14至16任一项所述的数据处理装置,其特征在于,
    所述内存处理单元,还用于在处理单元确定所述至少一个候选缓存块中不包括满足所述第二任务需求的缓存块的情况下,从所述缓存池中包括的当前未分配有任务的至少一个缓存块中确定分配给所述第二任务的目标缓存块。
  18. 根据权利要求10至17任一项所述的数据处理装置,其特征在于,
    所述内存处理单元,还用于在所述缓存池中未查找到满足所述第二任务需求的缓存块的情况下,扩充所述缓存池;从扩充后的所述缓存池中查找分配给所述第二任务的目标缓存块。
  19. 一种电子设备,其特征在于,包括存储器和第一处理器,其中,所述存储器用于存储指令,所述第一处理器用于执行所述存储器存储的指令,使得所述第一处理器执行如权利要求1至9任一项所述的方法。
  20. 根据权利要求19所述的电子设备,其特征在于,所述电子设备还包括第二处理器,所述第二处理器用于利用所述第一处理器分配的缓存块执行所述第一处理器发送的任务。
  21. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中存储有计算机程序,所述计算机程序包括程序指令,所述程序指令当被处理器执行时,使所述处理器执行权利要求1至9任意一项所述的方法。
PCT/CN2021/079390 2020-06-18 2021-03-05 内存管理方法和相关产品 WO2021253875A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
KR1020217042198A KR20220010036A (ko) 2020-06-18 2021-03-05 메모리 관리 방법 및 관련 제품
JP2021570921A JP2022539956A (ja) 2020-06-18 2021-03-05 メモリ管理方法及び関連製品

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010561183.XA CN111736998A (zh) 2020-06-18 2020-06-18 内存管理方法和相关产品
CN202010561183.X 2020-06-18

Publications (1)

Publication Number Publication Date
WO2021253875A1 true WO2021253875A1 (zh) 2021-12-23

Family

ID=72649904

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/079390 WO2021253875A1 (zh) 2020-06-18 2021-03-05 内存管理方法和相关产品

Country Status (5)

Country Link
JP (1) JP2022539956A (zh)
KR (1) KR20220010036A (zh)
CN (1) CN111736998A (zh)
TW (1) TWI783401B (zh)
WO (1) WO2021253875A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111736998A (zh) * 2020-06-18 2020-10-02 上海商汤智能科技有限公司 内存管理方法和相关产品

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130047162A1 (en) * 2011-08-19 2013-02-21 Canon Kabushiki Kaisha Efficient cache reuse through application determined scheduling
CN108829610A (zh) * 2018-04-02 2018-11-16 浙江大华技术股份有限公司 一种神经网络前向计算过程中的内存管理方法及设备
CN109271327A (zh) * 2017-07-18 2019-01-25 杭州海康威视数字技术股份有限公司 内存管理方法及装置
CN110308982A (zh) * 2018-03-20 2019-10-08 华为技术有限公司 一种共享内存复用方法及装置
CN111736998A (zh) * 2020-06-18 2020-10-02 上海商汤智能科技有限公司 内存管理方法和相关产品

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7353339B2 (en) * 2003-12-24 2008-04-01 Intel Corporation Adaptive caching
US10509727B1 (en) * 2018-09-10 2019-12-17 Mediatek Inc. Method and apparatus for performing task-level cache management in electronic device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130047162A1 (en) * 2011-08-19 2013-02-21 Canon Kabushiki Kaisha Efficient cache reuse through application determined scheduling
CN109271327A (zh) * 2017-07-18 2019-01-25 杭州海康威视数字技术股份有限公司 内存管理方法及装置
CN110308982A (zh) * 2018-03-20 2019-10-08 华为技术有限公司 一种共享内存复用方法及装置
CN108829610A (zh) * 2018-04-02 2018-11-16 浙江大华技术股份有限公司 一种神经网络前向计算过程中的内存管理方法及设备
CN111736998A (zh) * 2020-06-18 2020-10-02 上海商汤智能科技有限公司 内存管理方法和相关产品

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
LI, WEI: "Research and Application of Multimedia Stream Processing Framework for Deep Learning", CHINESE MASTER'S THESES FULL-TEXT DATABASE, SOCIAL SCIENCES II, no. 9, 15 September 2019 (2019-09-15), pages 1 - 77, XP055882229, ISSN: 1674-0246 *

Also Published As

Publication number Publication date
KR20220010036A (ko) 2022-01-25
TWI783401B (zh) 2022-11-11
CN111736998A (zh) 2020-10-02
JP2022539956A (ja) 2022-09-14
TW202201231A (zh) 2022-01-01

Similar Documents

Publication Publication Date Title
WO2017166777A1 (zh) 一种任务调度方法及装置
CN106371894B (zh) 一种配置方法、装置和数据处理服务器
US8893148B2 (en) Performing setup operations for receiving different amounts of data while processors are performing message passing interface tasks
US8312464B2 (en) Hardware based dynamic load balancing of message passing interface tasks by modifying tasks
US8108876B2 (en) Modifying an operation of one or more processors executing message passing interface tasks
US7650601B2 (en) Operating system kernel-assisted, self-balanced, access-protected library framework in a run-to-completion multi-processor environment
US9311157B2 (en) Method and apparatus for dynamic resource allocation of processing units on a resource allocation plane having a time axis and a processing unit axis
US8127300B2 (en) Hardware based dynamic load balancing of message passing interface tasks
US20090019450A1 (en) Apparatus, method, and computer program product for task management
US20090064166A1 (en) System and Method for Hardware Based Dynamic Load Balancing of Message Passing Interface Tasks
KR20110075297A (ko) 병렬도를 고려한 병렬 처리 장치 및 방법
CN110990154B (zh) 一种大数据应用优化方法、装置及存储介质
US20130097382A1 (en) Multi-core processor system, computer product, and control method
WO2021253875A1 (zh) 内存管理方法和相关产品
US10241822B2 (en) Information processing apparatus for moving virtual machine and method of moving virtual machine
JP6007516B2 (ja) リソース配分システム、リソース配分方法、及びリソース配分プログラム
CN109766168B (zh) 任务调度方法和装置、存储介质以及计算设备
JP7122299B2 (ja) 処理タスクを実行するための方法、装置、デバイス、および記憶媒体
US11392388B2 (en) System and method for dynamic determination of a number of parallel threads for a request
WO2013178244A1 (en) A graphics processing unit controller, host system, and methods
JP4734348B2 (ja) 共有メモリ型マルチプロセッサにおける非同期遠隔手続き呼び出し方法、非同期遠隔手続き呼び出しプログラムおよび記録媒体
CN115509704A (zh) 一种任务调度方法、装置、设备及存储介质
JP4211645B2 (ja) 専用プロセッサの備わった計算機システム
JP2005327007A (ja) 組込みコンピュータ制御プログラム、そのプログラムを記録した記録媒体、及び組込みシステム
CN112685158B (zh) 一种任务调度方法、装置、电子设备及存储介质

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2021570921

Country of ref document: JP

Kind code of ref document: A

ENP Entry into the national phase

Ref document number: 20217042198

Country of ref document: KR

Kind code of ref document: A

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21826582

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21826582

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205 DATED 17/05/2023)

122 Ep: pct application non-entry in european phase

Ref document number: 21826582

Country of ref document: EP

Kind code of ref document: A1