WO2020093227A1 - 一种异构计算系统及内存管理方法 - Google Patents

一种异构计算系统及内存管理方法 Download PDF

Info

Publication number
WO2020093227A1
WO2020093227A1 PCT/CN2018/114102 CN2018114102W WO2020093227A1 WO 2020093227 A1 WO2020093227 A1 WO 2020093227A1 CN 2018114102 W CN2018114102 W CN 2018114102W WO 2020093227 A1 WO2020093227 A1 WO 2020093227A1
Authority
WO
WIPO (PCT)
Prior art keywords
task
physical address
calculation
input data
data
Prior art date
Application number
PCT/CN2018/114102
Other languages
English (en)
French (fr)
Inventor
张龙
郑明�
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to CN201880095316.XA priority Critical patent/CN112368686A/zh
Priority to PCT/CN2018/114102 priority patent/WO2020093227A1/zh
Publication of WO2020093227A1 publication Critical patent/WO2020093227A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation

Definitions

  • This application relates to the field of computer technology, in particular to a heterogeneous computing system and memory management method.
  • a master device distributes M computing tasks in a network task to multiple slave devices, and the slave devices complete the calculation of the M computing tasks.
  • the master device and the slave device share a shared memory (shared memory).
  • the main device applies for memory space from the shared memory to store task data generated during the execution of the M computing tasks, for example, input data, calculation parameters, and output data.
  • the address allocation of the memory space is completed independently for each computing task. For example, assume that the memory space requested by the master device is 51 logical pages (assuming logical pages 0-50). The input data of task 1 needs 4 logical pages, the calculation parameters need 18 logical pages, and the output data needs 7 logical pages. The input data of task 2 is the output data of task 1, which needs 7 logical pages, the calculation parameters need 36 logical pages, and the output data needs 7 logical pages.
  • the slave device first applies for a logical address from the shared memory, and the shared memory completes address allocation according to the size of the task data of task 1.
  • the input data of task 1 occupy logical pages 0-3, the calculation parameters occupy logical pages 4-21, and the output data occupy logical pages 22-28.
  • the input data and calculation parameters of task 1 are released, and the output data of task 1 is used as the input data of task 2, still occupying logical pages 22-28.
  • logical pages 0-21 and 29-50 in the memory space are free.
  • the present application provides a heterogeneous computing system and a memory management method, which can reduce the occurrence of logical address fragmentation in the memory and improve the utilization rate of the memory.
  • the present application provides a heterogeneous computing system, including a master device and N slave devices, where N is an integer greater than or equal to 1; the master device is used to generate a remapping table and send the remapping table to N At least one of the slave devices, the remapping table includes multiple logical addresses of the computing task to be processed by the at least one slave device, the multiple logical addresses are continuous and uninterrupted corresponding to the memory space of the first memory Multiple logical addresses, including the logical address of the input data of the calculation task, the logical address of the calculation parameters, and the logical address of the output data.
  • the calculation task is a subtask in the neural network or artificial intelligence, and according to the instructions in the remapping table
  • the multiple logical addresses adjust the multiple initial logical addresses corresponding to the physical address of the input data in the memory space, the physical address of the calculation parameter and the physical address of the output data, the physical address of the input data and the calculation parameter
  • At least one of the physical addresses is determined by the previous calculation task; the at least one slave device is used to perform the calculation task ,
  • the multiple logical addresses indicated by the remapping table read input data and calculation parameters in the memory space, and write output data.
  • the master device calculates the logical address of the input data, output data, and calculation parameters of the calculation task in advance by calculating the remapping table, so that at least one slave device performs the calculation task , There are continuous logical addresses in the memory space to read input data and calculation parameters, and write output data, so as to avoid logical address fragmentation and improve the utilization rate of the first memory.
  • the heterogeneous computing system further includes a system memory management unit SMMU, a master device, and is further configured to adjust the physical addresses of the input data and the physical parameters of the calculation parameters obtained after adjusting the multiple initial logical addresses.
  • the correspondence between the address and the physical address of the output data and multiple logical addresses is sent to the SMMU; the SMMU is used to receive the correspondence between the master device and the slave device or the at least one slave device that has received the memory space.
  • the logical address to be operated is converted into a physical address according to the corresponding relationship, so that the master device or at least one slave device sending the instruction accesses the physical address, and the instruction is a read instruction or a write instruction .
  • the master device sends the correspondence between the physical address of the input data, the physical address of the calculation parameters, and the physical address of the output data to the multiple logical addresses obtained by adjusting multiple initial logical addresses to the SMMU, It is convenient for the SMMU to convert the logical address to the physical address according to the corresponding relationship, so as to ensure that the slave device can successfully access the physical address of the input data, the physical address of the calculation parameter, and the output data according to the multiple logical addresses indicated in the remapping table Physical address.
  • the master device is also used to apply for memory space from the first memory and determine task information before generating the remapping table, and the task information includes the data size of the task data of the calculation task and the A data dependency relationship between the calculation task and the previous calculation task; and generating the remapping table according to the task information, the task data including the input data, the calculation parameter, and the output data.
  • the master device includes a central processing unit
  • the slave device includes at least one of the following: a graphics processor, a neural processor, or a field programmable gate array.
  • the present application provides a memory management method applied to a heterogeneous computing system as described in the first aspect, the method includes: a master device generates a remapping table, and sends the remapping table to the N slaves At least one slave device in the device, the remapping table includes multiple logical addresses of the computing task to be processed by the at least one slave device, the multiple logical addresses are consecutive uninterrupted multiples corresponding to the memory space of the first memory Logical address, including the logical address of the input data of the calculation task, the logical address of the calculation parameter and the logical address of the output data, the calculation task is a sub-task in the neural network or artificial intelligence, and according to the instructions in the remapping table Multiple logical addresses to adjust multiple initial logical addresses corresponding to the physical address of the input data, the physical address of the calculation parameter, and the physical address of the output data in the memory space, the physical address of the input data and the physical At least one of the addresses is determined by the previous calculation task;
  • the method further includes: the master device adjusts the physical address of the input data, the physical address of the calculation parameters, the physical address of the output data and the multiple logical addresses obtained after adjusting the multiple initial logical addresses
  • the corresponding relationship is sent to the system memory management unit SMMU; the SMMU receives the corresponding relationship sent by the master device, and according to the corresponding relationship when receiving the instruction carrying the logical address of the memory space from the master device or at least one slave device ,
  • the logical address to be operated is converted into a physical address, so that the master device or at least one slave device that sends the instruction accesses the physical address, and the instruction is a read instruction or a write instruction.
  • the method further includes: the master device applies for memory space from the first memory and determines task information, and the task information includes the data size of the task data of the calculation task And the data dependency relationship between the calculation task and the previous calculation task, the task data includes input data, calculation parameters and output data; the master device generates a remapping table, including: the master device generates the remapping table according to the task information .
  • the present application provides a memory management device, including a first module and N second modules, where N is an integer greater than or equal to 1.
  • the first module may be driver software of a main device in a heterogeneous computing system.
  • the master device runs the first module, the master device can implement the corresponding steps in the memory management method described in the second aspect.
  • N second modules correspond one-to-one to N slave devices in a heterogeneous computing system.
  • the second module may be driver software of the slave device.
  • the slave device runs the second module, the slave device may implement the second aspect described above The corresponding steps in the memory management method.
  • the first module and the N second modules may also be implemented by hardware or a combination of software and hardware.
  • the first module is used to generate a remapping table and send the remapping table to at least one of the N second modules
  • the remapping table includes the at least one second Multiple logical addresses of the calculation task processed by the module, the multiple logical addresses being consecutive uninterrupted multiple logical addresses corresponding to the memory space of the first memory, including the logical addresses of the input data of the calculation task and the logical addresses of the calculation parameters And the logical address of the output data
  • the calculation task is a subtask of the neural network or artificial intelligence, and the physical address and calculation parameters of the input data in the memory space according to the multiple logical addresses indicated in the remapping table Multiple physical logical addresses corresponding to the physical address of the input data and the physical address of the output data are adjusted.
  • At least one of the physical address of the input data and the physical address of the calculation parameter is determined by the previous calculation task; the at least one second module uses When performing the calculation task, according to the multiple logical addresses indicated by the remapping table, read in the memory space Input data and calculation parameters, and write output data.
  • the first module is further used to adjust the physical address of the input data, the physical address of the calculation parameters, and the physical address of the output data obtained by adjusting the multiple initial logical addresses and the multiple logical
  • the corresponding relationship of the address is sent to the SMMU; the SMMU is used to receive the corresponding relationship sent by the first module, and when receiving the instruction carrying the logical address of the memory space from the first module or at least one second module, according to The corresponding relationship converts the logical address to be operated into a physical address, so that the first module or at least one second module that sends the instruction accesses the physical address, and the instruction is a read instruction or a write instruction.
  • the first module is also used to apply for memory space from the first memory and determine task information before generating the remapping table, and the task information includes the data size and the task data of the calculation task A data dependency relationship between the calculation task and the previous calculation task; and generating the remapping table according to the task information, the task data including the input data, the calculation parameter, and the output data.
  • the remapping table includes multiple logical addresses starting from the starting logical address of the memory space.
  • the data dependency relationship includes: the calculation task shares input data and calculation parameters with the previous calculation task of the calculation task; or, the output data of the previous calculation task is the input data of the calculation task.
  • At least one of the physical address of the input data of the calculation task and the physical address of the calculation parameter is determined in at least one of the following ways: the physical address of the input data is the physical address of the input data of the previous calculation task ; The physical address of the input data is the physical address of the output data of the previous calculation task; or the physical address of the calculation parameter is the physical address of the calculation parameter of the previous calculation task.
  • the present application provides a computer storage medium that stores, for example, the computer instructions in the first module in the third aspect and each of the N second modules in the second module Computer instructions.
  • the computer program product stored in the computer storage medium is used to execute the method shown in any one of the foregoing second aspect and each optional manner of the second aspect.
  • the present application provides a computer program product including, for example, the software package of the first module and the software package of each of the N second modules in the third aspect.
  • the computer program product is used to execute the method shown in any one of the foregoing second aspect and each optional manner of the second aspect.
  • FIG. 1 is a schematic diagram of usage of a logical address of a memory space in the prior art
  • FIG. 2 is a schematic structural diagram of a heterogeneous computing system provided by this application.
  • FIG. 3 is a flowchart of an embodiment of a memory management method provided by this application.
  • FIG. 4 is a schematic diagram of usage of a logical address of a memory space provided by this application.
  • FIG. 5 is a schematic structural diagram of a memory management device provided by this application.
  • the memory management method provided in this application is applicable to the heterogeneous computing system shown in FIG. 2 and includes a master device, N (N is an integer greater than or equal to 1) slave devices (assumed as slave device 1, slave device 2, ..., the slave device N), the first memory, and the system memory management unit (SMMU).
  • the main device may be a central processing unit (CPU).
  • the main device is the main control end of a heterogeneous computing system, and is used to allocate memory for each computing task in the first memory after obtaining the computing task Space, and distribute computing tasks to multiple slave devices.
  • the slave device can be graphics processor (graphics processing unit, GPU), neural processor (Neural processing unit, NPU), field programmable gate array (FPGA) and other components, and the slave device is a heterogeneous computing system.
  • the main calculation component is used to calculate the calculation tasks issued by the main device.
  • At least one of the master device and the slave device can execute software to realize calculation or processing functions.
  • Related software includes but is not limited to at least one of driver software, platform software, operating system software, or application software.
  • the first memory may be any kind of memory that stores task data of computing tasks in a heterogeneous computing system, such as global memory (global memory), shared memory (shared memory), local memory (local memory), and register memory (Register memory) )Wait.
  • global memory global memory
  • shared memory shared memory
  • local memory local memory
  • register memory Register memory
  • SMMU is a memory management unit that can provide virtual address management or mapping for the master and slave devices, so that the slave device does not need to pass through the master device when reading or writing data.
  • the method may include:
  • the master device In step 301, the master device generates a remap table (remap).
  • the remapping table includes continuous logical addresses of task data assigned to each of the M computing tasks, that is, the logical address of the task data will not be broken.
  • the M computing tasks are M sub-tasks in a neural network or artificial intelligence. M is an integer greater than or equal to 1.
  • each of the multiple layers in the neural network is a single computing layer or fusion layer (fusion layer) as a computing task.
  • fusion kernel in a unified computing device architecture (Compute Unified Device Architecture, CUDA)
  • each kernel is a computing task.
  • the main device performs the M computing tasks through software simulation, and determines task information such as each task data of each computing task, the data size of each task data, and the data dependency relationship between the M computing tasks.
  • the task data of a calculation task may include input data, calculation parameters and output data.
  • a calculation task can use multiple input data and multiple calculation parameters, and after completing calculations for the multiple input data and multiple calculation parameters, multiple output data are obtained.
  • the data dependency between the M computing tasks including but not limited to: multiple computing tasks in the M computing tasks share input data and computing parameters; or, there are two computing tasks in the M computing tasks, these two
  • the output data of the calculation task executed first in the calculation task is the input data of the calculation task executed later.
  • the master device and the multiple slave devices executing the M computing tasks share the first memory (for example, shared memory in a heterogeneous computing system, or global memory, etc.), when the task data of the M computing tasks needs to be stored in the first memory At this time, the master device can apply to the first memory to obtain the memory space and determine the memory information of the memory space.
  • the memory information of the memory space may include a logical address and a physical address, etc., for reading and writing task data of the M computing tasks from the device in the first memory.
  • the master device can calculate the remapping table according to the task information and the memory information of the memory space, and the remapping table can include a task ID (task ID), an input ID (input ID), a weight ID (weight ID), and an output Identification (out ID), memory address (memory address), etc.
  • the task identifier is used to identify each computing task.
  • Input identification is used to identify input data.
  • the weight indicator is used to identify the calculation parameters, that is, the weight parameters used for artificial intelligence or neural network calculation.
  • Output identification is used to identify output data. Since each calculation task may have multiple input data, calculation parameters, and output data, the logical address of each input data, calculation parameter, and output data of the calculation task can be marked by the input identification, weight identification, and output identification .
  • the memory address is a logical address (logical address) on the first memory of each input data, each output data, and each calculation parameter, which is used for the slave device to read and write the corresponding task data when performing the calculation task.
  • the logical address is the instruction from any device that needs to operate the first memory, such as the address in the read instruction or write instruction that the master or slave device reads or writes to the first memory.
  • the logical address in the instruction will be mapped to a physical address according to the correspondence between the physical address (physical dddress) of the memory space and the logical address (such as a mapping table) recorded in the page table cached in the SMMU, so that the The read instruction or the write instruction can access the corresponding physical address in the first memory.
  • the physical address is an address where data is actually stored in the first memory.
  • physical addresses are real addresses in the first memory and can actually store data.
  • the logical address is to facilitate management of the address by any device or related software, and it belongs to a virtual address. When accessing the first memory, the logical address of any device or related software operation needs to be converted into a physical address by the SMMU according to the corresponding relationship in order to access the real first memory.
  • the remapping table can also indicate the data size (data size), which is used to record the data size of each input data, each output data, and each calculation parameter.
  • data size data size
  • the logical address of the memory space requested by the master device for the M computing tasks includes logical pages 0-50, a total of 51 logical pages of memory space, the M computing tasks are executed in sequence, and the previous computing task ’s
  • the output data is the input data of the latter calculation task.
  • the remapping table generated by the master device according to the task information of the M computing tasks and the 51 logical pages of the first memory is shown in Table 1. Among them, the logical page is a unit of logical address.
  • the logical addresses in Table 1 are all represented by the number of logical pages. Subsequent implementations, for example, have no special description, and are also introduced using logical pages as logical address units.
  • the master device reallocates the logical address of the memory space to ensure that each task data has a continuous logical address. Instead of assigning based on the relationship between tasks. For example, in Table 1, for task 1, the master device sequentially allocates logical pages of memory space to each task data of task 1 starting from logical page 0. For task 2, the master device also allocates logical pages of the memory space to each task data of task 2 in order starting from logical page 0. Instead of assigning logical pages 19-25 to the input data of task 2 based on the data dependency between task 1 and task 2, the calculation parameters of task 2 can be used without consecutive logical addresses.
  • the master device calculates the logical address of each task data of each computing task in advance by calculating the remapping table, so that each computing task has a continuous logical address to read and write during the execution process Task data to avoid address fragmentation and ensure that M computing tasks can be completed successfully.
  • the master device sends a remapping table to N slave devices.
  • the N slave devices are slave devices that execute the M computing tasks, and each slave device of the N slave devices executes at least one of the M computing tasks.
  • the master device can determine N (N ⁇ M, and an integer greater than or equal to 1) based on the data dependency relationship between the M computing tasks, the data size of the task data, and the load status of the computing resources of each slave device.
  • the master device may allocate the M computing tasks to the M slave devices, which are executed in parallel by the M slave devices.
  • the master device may allocate the M calculation tasks to the same slave device, and the slave The device executes serially.
  • the master device may carry the remapping table in the scheduling information to the N slave devices, or may send the remapping table to the N slave devices through separate signaling. device.
  • Step 303 Before each computing task is executed, the master device adjusts the logical address corresponding to the physical address of the applied memory space according to the logical address of the computing task indicated in the remapping table, and updates the memory space The corresponding relationship between the physical address and the logical address. That is, the logical address corresponding to the physical address is changed according to the remapping table in step 301.
  • the slave device reads or writes task data in the first memory according to the logical address indicated in the remapping table when performing the calculation task.
  • the updated correspondence is sent to the SMMU, so that when the SMMU receives the read or write instruction sent by the master or slave , The logical address to the physical address can be mapped according to the updated correspondence.
  • the slave device can send a read instruction or a write instruction to the SMMU.
  • the SMMU converts the logical address in the instruction into a physical address according to the correspondence relationship sent by the master device, so as to further realize access to the corresponding physical address, so that the slave device can The physical address reads or writes the task data.
  • the master device can divide the logical page and the physical page of the memory space carried in the memory information used in generating the remapping table in step 301 The corresponding relationship (as shown in Table 2) is sent to SMMU as the updated corresponding relationship.
  • the write instruction includes three input data of the task 1 (input data with input identifiers 101, 102, and 103, hereinafter referred to as input data 101, input Data 102 and input data 103) and three calculation parameters (the calculation parameters with weights 111, 112, and 113, hereinafter referred to as calculation parameter 111, calculation parameter 112, calculation parameter 113), and the input data indicated in the remapping table 101.
  • Input data 102, input data 103, calculation parameters 111, calculation parameters 112, and calculation parameters 113 are logical addresses, that is, logical pages 0-18.
  • the SMMU maps the logical address mapping table in the write instruction to a physical address according to the corresponding relationship shown in Table 2 recorded in the cached page table, that is, physical pages 0-18.
  • the first memory stores the input data 101 to the physical page 0, the input data 102 to the physical page 1, and the input data 103 to the physical page 2-3
  • the calculation parameter 111 is stored in the physical page 4-6
  • the calculation parameter 112 is stored in the physical page 7-11
  • the calculation parameter 113 is stored in the physical page 12-18.
  • the slave device 1 When performing task 1, the slave device 1 sends a read command from the device 1 to the SMMU.
  • the read command includes the input data 101, input data 102, input data 103, calculation parameters 111, calculation parameters 112, and calculations indicated in the remapping table.
  • the logical address of parameter 113 namely logical pages 0-18.
  • the SMMU maps the logical address in the read instruction to a physical address according to the corresponding relationship shown in Table 2 recorded in the page table, that is, physical pages 0-18.
  • the read instruction is sent to the first memory to access the first memory physical page 0-18.
  • the first memory stores data stored in physical pages 0-18 including input data 101, input data 102, input data 103, calculation parameters 111, calculation parameters 112, and calculation parameters 113.
  • the slave device 1 can read one input data or calculation parameter through one read instruction, and can also read all input data or calculation parameters of task 1 through one read instruction.
  • Slave 1 performs calculations based on input data 101, input data 102, input data 103, calculation parameters 111, calculation parameters 112, and calculation parameters 113, and obtains the three output data of task 1 (output data with the output identifiers 121, 122, and 123) , Hereinafter referred to as output data 121, output data 122, output data 123).
  • the slave device 1 sends a write instruction to the SMMU.
  • the write instruction includes output data 121, output data 122, output data 123, and the logical addresses of the output data 121, output data 122, and output data 123 indicated in the remapping table, that is, logic Pages 19-25.
  • the SMMU maps the logical address in the write instruction to a physical address according to the corresponding relationship shown in Table 2 recorded in the page table, that is, physical pages 19-25. Then the write instruction is sent to the first memory, and the first memory writes the output data 121, the output data 122, and the output data 123 included in the write instruction to the physical pages 19-25 according to the write instruction.
  • the logical address of input data 201 of task 2 is logical page 19-20
  • the physical address is physical page 19-20
  • the logical address of input data 202 is logical page 21-23
  • the physical address is physical page 21-23
  • the logical address of the input data 203 is the logical page 24-25
  • the physical address is the physical page 24-25.
  • the master device Before the slave device 1 performs task 2, the master device needs to adjust the logical address corresponding to the physical address of the memory space once according to the logical address of the task data of the task 2 indicated in Table 1, and update the physical address of the memory space and Correspondence of logical addresses. That is, the one-to-one correspondence between the logical address 0-50 and the physical address 0-50 is re-established to ensure that the logical address of the input data of task 2 starts from the logical page as indicated in Table 1, and the slave device 1 can Page 0-6, accurately access the physical address where the input data of task 2 is stored, that is, physical page 19-25. At the same time, it is ensured that in the updated correspondence, the physical pages corresponding to logical pages 7-50 are in a writable state.
  • the master device adjusts the logical address corresponding to the physical address of the memory space once according to the logical address of the task data of task 2 indicated in Table 1, and the updated correspondence can be obtained as shown in Table 3 below. Show:
  • step 303 the master device adjusts the logical address corresponding to the physical address so that the logical address of the input data of task 2 still starts from 0, then the original table 2 needs to be adjusted to table 3. That is, after task 1 is executed, the logical address corresponding to physical page 19-25 that originally stored the input data of task 2 is logical page 19-25, but according to the remapping table in step 301, the physical address of the memory space corresponds to During the adjustment of the logical address, the logical address corresponding to the physical page 19-25 is adjusted to the logical page 0-6.
  • step 304 when the slave device 1 executes task 2, it can access the physical pages 19-25 according to the logical pages 0-6 through the SMMU to read the input data of the task 2. That is, after task 1 ends and before task 2 is executed, in step 303, the master device adjusts the logical address corresponding to the physical address of the memory space according to the remapping table in step 301, that is, the original table 2 is updated to Table 3, and provide Table 3 to SMMU.
  • slave device 1 uses the remapping table to execute task 2, starting from logical page 0, and the corresponding actual physical page 19 is the output data of the previous task 1, that is, the input data of this task 2, For this slave device 1, it can continue to use the output data of task 1 as input data, but its logical address will not start from logical page 19, but from logical page 0, to ensure that the logical addresses of subsequent series of processing are continuous Yes, to prevent logical address fragmentation, which is convenient for the master device or slave device 1 to manage and maintain logical addresses.
  • the master device may send Table 3 to the SMMU, so that the SMMU updates the correspondence between the logical pages and the physical pages of the memory space recorded in its cached page table according to Table 3, and adopts the update afterwards. Mapping relationship between logical addresses and physical addresses to ensure the correctness of addresses. Based on Table 3, the master device adjusts the logical address corresponding to the physical address of the memory space once according to the logical address of the task data of task 2 indicated in Table 1, and logical pages 0-31 are sequentially followed by physical pages 19-50 One-to-one correspondence, logical pages 32-50 correspond to physical pages 0-18 in sequence.
  • the logical addresses of the three input data of task 2 are mapped to the starting position of the memory space, that is, the input data 201 of task 2
  • the logical address is logical page 0-1
  • the physical address is physical page 19-20
  • the logical address of input data 202 is logical page 2-4
  • the physical address is physical page 21-23
  • the logical address of input data 203 is logical page 5 -6
  • the physical address is the physical page 24-25.
  • the slave device 1 can read the input data 201 stored in the physical pages 19-20 through the SMMU according to the logical page 0-1, and read the physical pages according to the logical page 2-4 through the SMMU
  • the input data 202 stored in 21-23 reads the input data 203 stored in the physical pages 24-25 through the SMMU according to the logical page 5-6.
  • the master device Since the master device adjusts the logical address corresponding to the physical address of the memory space once according to the logical address of the task data of the task 2 indicated in Table 1, the memory space except for the 7 logical pages allocated to the input data of the task 2 In addition to (logical pages 0-6), there are 44 consecutive logical pages (logical pages 7-50) that can be assigned to the calculation parameters and output data of task 2.
  • the three calculation parameters of task 2 (calculation parameter 211, calculation parameter 212, and calculation parameter 213) need to occupy 12, 14, 10, 36 logical pages respectively. Therefore, the master device can successfully write the calculation parameters of task 2 to the first memory through the SMMU according to the logical address indicated in the remapping table.
  • the master device sends a write instruction to the SMMU.
  • the write instruction includes the calculation parameter 211, the calculation parameter 212, and the calculation parameter 213, and the logical address indicating the calculation parameter 211, the calculation parameter 212, and the calculation parameter 213 in the remapping table: logic Pages 14-49.
  • the SMMU maps the logical pages 14-49 in the write instruction to physical pages 33-50 and 0-17 according to the correspondence shown in Table 3 recorded in the page table.
  • the write instruction is sent to the first memory.
  • the first memory stores the calculation parameters 211 included in the write instruction to physical pages 33-44, and the calculation parameters 212 to physical pages 45-50 and 0- In 7, the calculation parameter 213 is stored in the physical pages 8-17.
  • the slave device 1 When performing the task 2, the slave device 1 sends a read instruction to the SMMU in step 304 as described above.
  • the read instruction includes the input data 201, input data 202, input data 203, and calculation parameters 211 indicated in the remapping table. , The logical address of the calculation parameter 212 and the calculation parameter 213.
  • the SMMU maps the logical address in the read instruction to a physical address according to the correspondence shown in Table 3 recorded in the cached page table. Then, the read instruction is sent to the first memory, so that the read instruction can successfully access the corresponding physical address.
  • the first memory stores the input data 201 stored in the physical pages 19-20, the input data 202 stored in the physical pages 21-23, the input data 203 stored in the physical pages 24-25, and the physical pages 33-44 according to the read instruction.
  • the stored calculation parameters 211, the calculation parameters 212 stored in the physical pages 45-50 and 0-7, and the calculation parameters 213 stored in the physical pages 8-17 are output to the slave device 1.
  • the slave device 1 performs calculation according to the three input data and three calculation parameters of task 2, and obtains three output data (output data 221, output data 222, and output data 223) of task 2.
  • the slave device 1 sends a write instruction to the SMMU.
  • the write instruction includes output data 221, output data 222, output data 223, and the logical addresses of the output data 221, output data 222, and output data 223 indicated in the remapping table.
  • SMMU maps the logical address in the write instruction to a physical address according to the correspondence shown in Table 3 recorded in the page table.
  • the write instruction to the first memory, according to the write instruction, the first memory stores the output data 221 to the physical pages 26-27, the output data 222 to the physical pages 28-30, and the output data 223 Go to physical pages 31-32.
  • task 1 and task 2 for the logical address of the memory space requested by the master device (logical pages 0-50) can be seen in Figure 4.
  • the input data of task 1 occupies 4 logical pages ( Logic pages 0-3), the calculation parameters occupy 15 logic pages (logic pages 4-18), and the output data occupy 7 logic pages (logic pages 19-20).
  • the input data of task 2 occupies 7 logical pages (logical pages 0-6), the calculation parameters occupy 36 logical pages (logical pages 14-49), and the output data occupies 7 Logical page (logical page 7-13).
  • FIG. 1 When task 1 is executed, the input data of task 1 occupies 4 logical pages ( Logic pages 0-3), the calculation parameters occupy 15 logic pages (logic pages 4-18), and the output data occupy 7 logic pages (logic pages 19-20).
  • the input data of task 2 occupies 7 logical pages (logical pages 0-6), the calculation parameters occupy 36 logical pages (logical pages 14-49), and the output data occupies 7 Logical page (logical page 7-13).
  • mapping table remaps the logical address and physical address of the memory space to avoid fragmentation of the logical address of the memory space, thereby ensuring that each task data has continuous logical addresses that can be used. Ensure the successful execution of computing tasks.
  • the master device can also A mirror address (mirror address) on the second memory is allocated to each task data of each computing task in advance, and the mirror address of each task data of each computing task on the second memory is carried in the remapping table , Sent to each slave device for use when the memory space of the first memory is occupied, the master device and the slave device can read and write task data on the second memory through the mirror address of the task data on the second memory, Ensure the successful completion of computing tasks.
  • the second memory may be global memory.
  • the global memory is usually the memory with the largest capacity and the smallest bandwidth.
  • the task data of the computing task is usually stored in the global memory. If according to the conventional mechanism, multiple slave devices need high Frequently read or write task data in the global memory, which causes bandwidth pressure on the global memory, and for the slave device, because the bandwidth in the global is small, it is costly to read or write task data in the global memory More time, reducing the calculation efficiency.
  • the shared memory has a small capacity, the data path usually has a large bandwidth. Therefore, if the network task can complete the calculation on the shared memory, the time spent by the slave device on reading and writing task data can be greatly reduced, and the global Bandwidth pressure.
  • the master device needs to transfer part of the task data (for example, calculation parameters or part of the input data) required by the computing task from the global memory to the shared memory for the slave device to read, There is no need to read and write task data on the global memory from the device, thereby improving the calculation efficiency of network tasks.
  • the memory management method provided by this application when the memory management method provided by this application is used to manage shared memory (that is, the first memory is shared memory), it can effectively avoid the fragmentation of shared logical addresses and improve the utilization rate of shared memory. So as to ensure that network tasks can complete calculations on shared memory and reduce the bandwidth pressure of global memory.
  • Table 4 shows the calculation cycle, read bandwidth (bandwidth occupied when reading data on global memory during network task execution) and write bandwidth (on global memory during network task execution) when performing network tasks using conventional mechanisms and remapping mechanisms Data comparison of bandwidth occupied when writing data).
  • the calculation cycle, read bandwidth, and write bandwidth in Table 4 are all data collected when the fusion task in the scene recognition network (scene classify network) is performed using the conventional mechanism and the remapping mechanism.
  • the conventional mechanism performs network tasks on the global memory for the master device and the slave device, and the allocation of logical addresses on the global memory is completed independently for each computing task in the network task without re-mapping the logical addresses.
  • the remapping mechanism is based on the memory management method provided by this application, and adjusts the logical address corresponding to the physical address of the memory space requested on the shared memory, so that the slave device and the master device can smoothly execute on the shared memory (that is, the first memory) Network tasks. From the data recorded in Table 4, it can be seen that, compared with the conventional mechanism, the use of a remapping mechanism can greatly reduce the calculation cycle of network tasks and improve the calculation efficiency of network tasks. And because the memory management method provided by the present application is adopted, the shared memory is efficiently used to ensure that the slave device does not need to perform network tasks on reading and writing task data on the global memory, thereby reducing the bandwidth pressure of the global memory.
  • the first module may be driver software of a host device in a heterogeneous computing system.
  • the host device may implement the process performed by the host device in the foregoing embodiment.
  • the N second modules correspond one-to-one to the N slave devices in the heterogeneous computing system.
  • the second module may be the driver software of the slave device.
  • the slave device may implement the slave The process performed by the device.
  • the first module is used to generate a remapping table and send the remapping table to at least one of the N second modules.
  • the remapping table includes the at least one second Multiple logical addresses of the calculation task processed by the module, the multiple logical addresses being consecutive uninterrupted multiple logical addresses corresponding to the memory space of the first memory, including the logical addresses of the input data of the calculation task and the logical addresses of the calculation parameters And the logical address of the output data, the calculation task is a subtask of the neural network or artificial intelligence, and the physical address and calculation parameters of the input data in the memory space according to the multiple logical addresses indicated in the remapping table The multiple physical logical addresses corresponding to the physical address of the input data and the physical address of the output data are adjusted.
  • At least one of the physical address of the input data and the physical address of the calculation parameter is determined by the previous calculation task; the at least one second module uses When performing the computing task, according to the multiple logical addresses indicated by the remapping table, read in the memory space Take input data and calculation parameters, and write output data.
  • the first module is also used to send the correspondence between the physical address of the input data, the physical address of the calculation parameters, and the physical address of the output data obtained after adjusting the multiple initial logical addresses and the multiple logical addresses To SMMU; SMMU, used to receive the corresponding relationship sent by the first module, and when receiving the instruction carrying the logical address of the memory space to be operated from the first module or at least one second module, according to the corresponding relationship, the The logical address to be operated is converted into a physical address, so that the first module or at least one second module that sends the instruction accesses the physical address, and the instruction is a read instruction or a write instruction.
  • the first module is also used to apply for memory space from the first memory and determine task information before generating the remapping table, and the task information includes the data size and the task data of the calculation task A data dependency relationship between the calculation task and the previous calculation task; and generating the remapping table according to the task information, the task data including the input data, the calculation parameter, and the output data.
  • the multiple logical addresses start from the starting logical address of the memory space.
  • the data dependency relationship includes: the calculation task shares input data and calculation parameters with the previous calculation task of the calculation task; or, the output data of the previous calculation task is the input data of the calculation task.
  • At least one of the physical address of the input data of the calculation task and the physical address of the calculation parameter is determined in at least one of the following ways: the physical address of the input data is the physical address of the input data of the previous calculation task ; The physical address of the input data is the physical address of the output data of the previous calculation task; or the physical address of the calculation parameter is the physical address of the calculation parameter of the previous calculation task.
  • a computer storage medium stores the computer instructions in the first module and the computer instructions in each of the N second modules.
  • the computer storage medium stores a computer program product for performing the method described above.
  • the computer storage medium may be read-only memory (read-only memory, ROM) or other types of static storage devices that can store static information and instructions, random access memory (random access memory, RAM), or can store information and
  • the other types of dynamic storage devices for instructions may also be electrically erasable programmable read-only memory (Electrically, programmable-only memory, EEPROM).
  • the memory may also be a compact disc-read-only memory (CD-ROM) or other optical disc storage, optical disc storage (including compact discs, laser discs, optical discs, digital versatile discs, Blu-ray discs, etc.) , Disk storage media or other magnetic storage devices, or any other media that can be used to carry or store the desired program code in the form of instructions or data structures and can be accessed by a computer, but is not limited thereto.
  • CD-ROM compact disc-read-only memory
  • optical disc storage including compact discs, laser discs, optical discs, digital versatile discs, Blu-ray discs, etc.
  • Disk storage media or other magnetic storage devices or any other media that can be used to carry or store the desired program code in the form of instructions or data structures and can be accessed by a computer, but is not limited thereto.
  • the memory management method provided by the present application can be implemented in the form of computer program software, hardware, or a combination of hardware and computer software as described above. Whether a function is executed by hardware or computer software driven hardware depends on the specific application and design constraints of the technical solution. Professional technicians can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Memory System Of A Hierarchy Structure (AREA)
  • Memory System (AREA)

Abstract

一种异构计算系统及内存管理方法,涉及计算机技术领域,能够减少内存出现逻辑地址碎片。该异构计算系统中,主设备用于生成重映射表,并将重映射表发送给至少一个从设备,以及按照重映射表中指示的即将被至少一个从设备处理的计算任务的连续不间断的多个逻辑地址,对与第一内存的内存空间中输入数据的物理地址、计算参数的物理地址和输出数据的物理地址所对应的多个初始逻辑地址进行调整,该多个逻辑地址包括该计算任务的输入数据的逻辑地址、计算参数的逻辑地址和输出数据的逻辑地址;该至少一个从设备用于在执行该计算任务时,根据重映射表指示的多个逻辑地址,在该内存空间中读取输入数据和计算参数、并写入输出数据。

Description

一种异构计算系统及内存管理方法 技术领域
本申请涉及计算机技术领域,尤其涉及一种异构计算系统及内存管理方法。
背景技术
在异构计算系统中,由一个主设备对网络任务中的M个计算任务分发给多个从设备,由从设备完成该M个计算任务的计算。通常,主设备和从设备共享一个共享内存(shared memory)。主设备在分发计算任务之前,向共享内存申请内存空间,用于存放该M个计算任务在执行过程中产生的任务数据,例如,输入数据、计算参数以及输出数据。
目前,内存空间的地址分配是针对每个计算任务独立完成。例如,假设主设备申请的内存空间为51个逻辑页(假设为逻辑页0-50)。任务1的输入数据需占用4个逻辑页,计算参数需占用18个逻辑页,输出数据需占用7个逻辑页。任务2的输入数据为任务1的输出数据,需占用7个逻辑页,计算参数需占用36个逻辑页,输出数据需占用7个逻辑页。在任务1的执行过程中,从设备先向共享内存申请逻辑地址,共享内存根据任务1的任务数据的大小完成地址分配。如图1所示,任务1的输入数据占用逻辑页0-3,计算参数占用逻辑页4-21,输出数据占用逻辑页22-28。任务1完成之后,任务1的输入数据和计算参数被释放,任务1的输出数据作为任务2的输入数据,仍然占用逻辑页22-28。此时,内存空间中逻辑页0-21,29-50空闲。在任务2的执行过程中,从设备向共享内存申请逻辑地址时,由于任务2的计算参数需占用36个逻辑页,而内存空间中不存在连续的36个逻辑页,仅存在两段逻辑地址碎片(逻辑页0-21和29-50),可能导致从设备或从设备的相关软件难以处理碎片化的逻辑地址,容易导致任务2失败。因此,如何减少内存出现逻辑地址碎片就成为一个问题。
发明内容
本申请提供一种异构计算系统及内存管理方法,能够减少内存出现逻辑地址碎片,提高内存的利用率。
第一方面,本申请提供一种异构计算系统,包括主设备和N个从设备,N为大于等于1的整数;主设备,用于生成重映射表,并将该重映射表发送给N个从设备中的至少一个从设备,该重映射表包括即将被该至少一个从设备处理的计算任务的多个逻辑地址,该多个逻辑地址为对应第一内存的内存空间的连续不间断的多个逻辑地址,包括该计算任务的输入数据的逻辑地址、计算参数的逻辑地址和输出数据的逻辑地址,该计算任务为神经网络或者人工智能中的子任务,以及按照该重映射表中指示的该多个逻辑地址,对与该内存空间中输入数据的物理地址、计算参数的物理地址和输出数据的物理地址所对应的多个初始逻辑地址进行调整,该输入数据的物理地址和计算参数的物理地址中至少一个由前一计算任务确定;该至少一个从设备,用于在执行该计算任务时,根据该重映射表指示的该多个逻辑地址,在该内存空间中读取输入数据和计算参数、并写入输出数据。
在本申请提供的异构计算系统中,主设备通过计算重映射表,提前规划好计算任务的 输入数据、输出数据以及计算参数的逻辑地址,以使得至少一个从设备在执行计算任务的过程中,在内存空间中具有连续的逻辑地址来读取输入数据和计算参数,以及写入输出数据,从而避免出现逻辑地址碎片,提高第一内存的利用率。
在一种可选的方式中,异构计算系统还包括系统内存管理单元SMMU,主设备,还用于将对该多个初始逻辑地址进行调整后得到的输入数据的物理地址、计算参数的物理地址和输出数据的物理地址与多个逻辑地址的对应关系发送给SMMU;SMMU,用于接收主设备发送的对应关系,并在从主设备或至少一个从设备接收到携带了该内存空间的待操作逻辑地址的指令时,按照该对应关系,将该待操作逻辑地址转换为物理地址,以使得发送该指令的该主设备或至少一个从设备访问该物理地址,该指令为读指令或者写指令。
基于该可选的方式,主设备通过将多个初始逻辑地址进行调整后得到的输入数据的物理地址、计算参数的物理地址和输出数据的物理地址与多个逻辑地址的对应关系发送给SMMU,便于SMMU按照该对应关系进行逻辑地址到物理地址的转换,从而保证从设备能够根据重映射表中指示的多个逻辑地址成功访问内存空间中输入数据的物理地址、计算参数的物理地址和输出数据的物理地址。
在一种可选的方式中,主设备,还用于在生成重映射表之前,向第一内存申请内存空间,并确定任务信息,该任务信息包括该计算任务的任务数据的数据大小和该计算任务与该前一计算任务之间的数据依赖关系;以及根据该任务信息生成该重映射表,该任务数据包括该输入数据、该计算参数和该输出数据。
在一种可选的方式中,主设备包括中央处理单元,从设备包括如下至少一个:图形处理器、神经处理器或现场可编程门阵列。
第二方面,本申请提供一种内存管理方法,应用于如第一方面所述的异构计算系统,该方法包括:主设备生成重映射表,并将该重映射表发送给该N个从设备中的至少一个从设备,该重映射表包括即将被该至少一个从设备处理的计算任务的多个逻辑地址,该多个逻辑地址为对应第一内存的内存空间的连续不间断的多个逻辑地址,包括该计算任务的输入数据的逻辑地址、计算参数的逻辑地址和输出数据的逻辑地址,该计算任务为神经网络或者人工智能中的子任务,以及按照该重映射表中指示的该多个逻辑地址,对与该内存空间中输入数据的物理地址、计算参数的物理地址和输出数据的物理地址所对应的多个初始逻辑地址进行调整,该输入数据的物理地址和计算参数的物理地址中至少一个由前一计算任务确定;该至少一个从设备在执行该计算任务时,根据该重映射表指示的该多个逻辑地址,在该内存空间中读取该输入数据和该计算参数、并写入该输出数据。
在一种可选的方式中,该方法还包括:主设备将对多个初始逻辑地址进行调整后得到的输入数据的物理地址、计算参数的物理地址和输出数据的物理地址与多个逻辑地址的对应关系发送给系统内存管理单元SMMU;SMMU接收主设备发送的对应关系,并在从主设备或至少一个从设备接收到携带了该内存空间的待操作逻辑地址的指令时,按照该对应关系,将该待操作逻辑地址转换为物理地址,以使得发送该指令的该主设备或至少一个从设备访问该物理地址,该指令为读指令或者写指令。
在一种可选的方式中,主设备生成重映射表之前,该方法还包括:主设备向第一内存申请内存空间,并确定任务信息,该任务信息包括该计算任务的任务数据的数据大小和该计算任务与该前一计算任务之间的数据依赖关系,该任务数据包括输入数据、计算参数和 输出数据;主设备生成重映射表,包括:主设备根据该任务信息生成该重映射表。
本申请提供的内存管理装置的技术效果可以参见上述第一方面和第一方面的各个可选方式的技术效果,此处不再赘述。
第三方面,本申请提供一种内存管理装置,包括第一模块和N个第二模块,N为大于等于1的整数。例如,第一模块可以为异构计算系统中主设备的驱动软件。当主设备运行第一模块时,主设备可以实现上述第二方面所述的内存管理方法中的对应步骤。N个第二模块与异构计算系统中的N个从设备一一对应,第二模块可以为从设备的驱动软件,当从设备运行第二模块时,从设备可以实现上述第二方面所述的内存管理方法中的对应步骤。可替换地,第一模块和N个第二模块也可以是硬件或软件与硬件结合实现。
在第三方面中,第一模块,用于生成重映射表,并将该重映射表发送给N个第二模块中的至少一个第二模块,该重映射表包括即将被该至少一个第二模块处理的计算任务的多个逻辑地址,该多个逻辑地址为对应第一内存的内存空间的连续不间断的多个逻辑地址,包括该计算任务的输入数据的逻辑地址、计算参数的逻辑地址和输出数据的逻辑地址,该计算任务为神经网络或者人工智能中的子任务,以及按照该重映射表中指示的该多个逻辑地址,对与该内存空间中输入数据的物理地址、计算参数的物理地址和输出数据的物理地址所对应的多个初始逻辑地址进行调整,该输入数据的物理地址和计算参数的物理地址中至少一个由前一计算任务确定;该至少一个第二模块,用于在执行该计算任务时,根据该重映射表指示的该多个逻辑地址,在该内存空间中读取输入数据和计算参数、并写入输出数据。
在一种可选的方式中,第一模块,还用于将对该多个初始逻辑地址进行调整后得到的输入数据的物理地址、计算参数的物理地址和输出数据的物理地址与多个逻辑地址的对应关系发送给SMMU;SMMU,用于接收第一模块发送的对应关系,并在从第一模块或至少一个第二模块接收到携带了该内存空间的待操作逻辑地址的指令时,按照该对应关系,将该待操作逻辑地址转换为物理地址,以使得发送该指令的该第一模块或至少一个第二模块访问该物理地址,该指令为读指令或者写指令。
在一种可选的方式中,第一模块,还用于在生成重映射表之前,向第一内存申请内存空间,并确定任务信息,该任务信息包括该计算任务的任务数据的数据大小和该计算任务与该前一计算任务之间的数据依赖关系;以及根据该任务信息生成该重映射表,该任务数据包括该输入数据、该计算参数和该输出数据。
基于上述第一方面至第三方面,在一种可选的方式中,重映射表中包括多个逻辑地址从该内存空间的起始逻辑地址开始。
在一种可选的方式中,数据依赖关系包括:计算任务与该计算任务的前一计算任务共享输入数据和计算参数;或者,该前一计算任务的输出数据为该计算任务的输入数据。
在一种可选的方式中,计算任务的输入数据的物理地址和计算参数的物理地址中至少一个通过如下至少一个方式确定:该输入数据的物理地址是前一计算任务的输入数据的物理地址;输入数据的物理地址是前一计算任务的输出数据的物理地址;或计算参数的物理地址是前一计算任务的计算参数的物理地址。
第四方面,提本申请提供计算机存储介质,所述计算机存储介质中存储有例如第三方面中所述第一模块中的计算机指令和所述N个第二模块中的每个第二模块的计算机指令。 可选地,所述计算机存储介质中存储有的计算机程序产品,用于执行之前第二方面和第二方面的各个可选方式中任一的所示的方法。
第五方面,提本申请提供计算机程序产品,所述计算机程序产品包括例如第三方面中所述第一模块的软件包和所述N个第二模块中的每个第二模块的软件包。可选地,所述计算机程序产品,用于执行之前第二方面和第二方面的各个可选方式中任一的所示的方法。
附图说明
图1为现有技术中的一种内存空间的逻辑地址的使用情况示意图;
图2为本申请提供的一种异构计算系统的结构示意图;
图3为本申请提供的一种内存管理方法的一个实施例的流程图;
图4为本申请提供的一种内存空间的逻辑地址的使用情况示意图;
图5为本申请提供的一种内存管理装置的结构示意图。
具体实施方式
首先,当本申请提及“第一”、“第二”或者“第三”等序数词时,除非根据上下文其确实表达顺序之意,否则应当理解为仅仅是起区分之用。
其次,本申请提供的内存管理方法适用于如图2所示的异构计算系统,包括主设备、N(N为大于等于1的整数)个从设备(假设为从设备1、从设备2、……、从设备N)、第一内存,以及系统内存管理单元(system memory management unit,SMMU)。其中,主设备可以是中央处理单元(central processing unit,CPU),主设备为异构计算系统的主控端,用于在获取到计算任务后,在第一内存中为每个计算任务分配内存空间,并将计算任务分发给多个从设备。从设备可以是图形处理器(graphics processing unit,GPU)、神经处理器(Neural processing unit,NPU)、现场可编程门阵列(field programmable gate array,FPGA)等部件,从设备为异构计算系统的主要计算部件,用于对主设备下发的计算任务进行计算。主设备和从设备中的至少一个可以执行软件来实现计算或处理功能。相关软件包括但不限于驱动软件、平台软件、操作系统软件或应用软件中的至少一个。
第一内存可以是异构计算系统中的任何一种存储计算任务的任务数据的内存,例如全局内存(global memory)、共享内存(shared memory)、本地内存(local memory)、寄存器内存(Register memory)等。
SMMU为一种内存管理单元,能够为主设备和从设备提供虚拟地址管理、或映射等的工作,使得从设备在读、或写数据时不需要通过主设备的处理。
基于如图2所示的异构计算系统,如图3所示为本申请提供的一种内存管理方法的一个实施例,该方法可以包括:
步骤301,主设备生成重映射表(remap table)。其中,重映射表中包括分配给M个计算任务中的每个计算任务的任务数据的连续的逻辑地址,即该任务数据的逻辑地址不会断裂。该M个计算任务为神经网络或者人工智能中的M个子任务。M为大于等于1的整数。例如,在神经网络的融合计算任务中,神经网络中的多个层中每一个单层(layer)或者融合层(fusion layers)为一个计算任务。或者,在统一计算设备架构(Compute Unified Device Architecture,CUDA)中的融合核(fusion kernel)任务中,每一个核(kernel)为一个计算任务。
主设备通过软件模拟执行该M个计算任务,确定每个计算任务的每个任务数据、每个任务数据的数据大小和该M个计算任务之间的数据依赖关系等任务信息。其中,一个计算任务的任务数据可以包括输入数据、计算参数和输出数据。一个计算任务可以采用多个输入数据和多个计算参数,并在对该多个输入数据和多个计算参数完成计算后,得到多个输出数据。M个计算任务之间的数据依赖关系,包括但不限于:该M个计算任务中的多个计算任务共享输入数据和计算参数;或者,M个计算任务中存在两个计算任务,这两个计算任务中先执行的计算任务的输出数据为后执行的计算任务的输入数据。
主设备和执行该M个计算任务的多个从设备共享第一内存(例如,异构计算系统中的共享内存、或全局内存等),当M个计算任务的任务数据需要存储在第一内存时,主设备可以向第一内存申请获取内存空间,并确定该内存空间的内存信息。该内存空间的内存信息可以包括逻辑地址和物理地址等,用以从设备在第一内存上读写该M个计算任务的任务数据。
示例性的,主设备可以根据任务信息和内存空间的内存信息计算重映射表,该重映射表中可以包括任务标识(task ID)、输入标识(input ID)、权重标识(weight ID)、输出标识(out ID)、和内存地址(memory address)等。其中,任务标识用于标识每一个计算任务。输入标识用于标识输入数据。权重标识用于标识计算参数,即标识用于人工智能或神经网络计算的权重参数。输出标识用于标识输出数据。由于每一个计算任务可能具有多个输入数据、计算参数、和输出数据,因此可以通过输入标识、权重标识、输出标识对计算任务的每一个输入数据、计算参数、和输出数据的逻辑地址进行标记。
内存地址为每一个输入数据、每一个输出数据以及每一个计算参数在第一内存上的逻辑地址(logical address),用于从设备在执行计算任务时对相应的任务数据进行读写。逻辑地址为来自任一需要操作第一内存的设备的指令,如是主设备或从设备对第一内存做读或写的读指令或写指令中的地址,SMMU在接收到读指令或写指令后,会根据SMMU中缓存的页表中记录的该内存空间的物理地址(physical dddress)与逻辑地址之间的对应关系(如映射表),将指令中的逻辑地址映射为物理地址,以使得该读指令或写指令能够访问到第一内存中对应的物理地址。其中,物理地址为第一内存中实际存储数据的地址。与逻辑地址不同,物理地址是第一内存中真实的地址,可以实际存储数据。而逻辑地址是为了便于任一设备或相关软件对该地址进行管理,其属于虚拟的地址。在进行第一内存的访问时,所述任一设备或相关软件操作的逻辑地址需要被SMMU根据所述对应关系转换为物理地址以便访问真实的第一内存。
重映射表中还可以指示数据大小(data size),用于记录每一个输入数据、每一个输出数据以及每一个计算参数的数据大小。示例性的,假设主设备为该M个计算任务申请的内存空间的逻辑地址包括逻辑页0-50,共51个逻辑页的内存空间,该M个计算任务依次执行,且前一个计算任务的输出数据为后一个计算任务的输入数据。假设,主设备根据该M个计算任务的任务信息和第一内存的51个逻辑页生成的重映射表如表1所示。其中,逻辑页是一个逻辑地址的单位。
表1
Figure PCTCN2018114102-appb-000001
在本实施例中,为了便于描述,表1中的逻辑地址均通过逻辑页的编号来表示。后续实施例如无特别说明,也以逻辑页为逻辑地址单位进行介绍。
在本申请中,主设备生成重映射表的过程中,针对每一个计算任务的任务数据,主设备都将内存空间的逻辑地址进行重新分配,保证每个任务数据都具有连续的逻辑地址。而不是基于任务之间的关系进行分配。例如,表1中,针对任务1,主设备从逻辑页0开始依次将内存空间的逻辑页分配给任务1的每个任务数据。而针对任务2,主设备也是从逻辑页0开始依次将内存空间的逻辑页分配给任务2的每个任务数据。而不是基于任务1与任务2之间的数据依赖关系,直接将逻辑页19-25分配给任务2的输入数据,从而导致任务2的计算参数没有连续的逻辑地址也可以使用。
在本实施例中,主设备通过计算重映射表,提前规划好每个计算任务的每个任务数据的逻辑地址,以使得每个计算任务在执行的过程中具有连续的逻辑地址来读和写任务数据,从而避免出现地址碎片,保证M个计算任务能够顺利完成。
步骤302,主设备向N个从设备发送重映射表。其中,该N个从设备为执行该M个计算任务的从设备,该N个从设备中的每个从设备执行该M个计算任务中的至少一个计算任务。主设备可以根据该M个计算任务之间的数据依赖关系、任务数据的数据大小、以及各个从设备的计算资源的负载状况等信息,确定N(N≤M,且为大于等于1的整数)个执行该M个计算任务的从设备,以及确定每个从设备所执行的计算任务。
例如,当该M个计算任务共享相同的输入数据以及计算参数时,主设备可以将该M个计算任务分配给M个从设备,由M个从设备并行执行。或者,当该M个计算任务中,前一个执行的计算任务的输出数据为后一个执行的计算任务的输入数据时,主设备可以将该M个计算任务分配给同一个从设备,由该从设备串行执行。例如,主设备可以在调度该M个计算任务的过程中,将重映射表携带在调度信息中发送给该N个从设备,也可以通过单独的信令发送该重映射表给该N个从设备。
步骤303,在每个计算任务被执行之前,主设备按照重映射表中指示的该计算任务的逻辑地址,对申请到的内存空间的物理地址所对应的逻辑地址进行一次调整,更新该内存空间的物理地址与逻辑地址的对应关系。即对应于物理地址的逻辑地址根据步骤301中的重映射表而改变。步骤304,从设备在执行计算任务时,根据该重映射 表中指示的逻辑地址,在第一内存中读或写任务数据。具体的,主设备在每一次更新物理地址与逻辑地址之间的对应关系后,将更新后的对应关系发送给SMMU,以使得SMMU在接收到主设备或者从设备发送的读指令或者写指令时,可以根据该更新后的对应关系进行逻辑地址到物理地址的映射。相应的,从设备可以通过发送读指令或写指令至SMMU,SMMU按照主设备发送的对应关系将指令内的逻辑地址转换为物理地址,以便进一步实现访问对应的物理地址,从而使得从设备能够针对所述物理地址读或写所述任务数据。
示例性的,以表1中的任务1和任务2为例,假设从设备1执行任务1和任务2,任务2的输入数据为任务1的输出数据。为了便于描述,以逻辑地址中的逻辑页与物理地址中的物理页为例进行描述。假设,M个计算任务在开始执行之前,SMMU中缓存的页表上记录的该内存空间的逻辑页与物理页之间的对应关系如下表2所示。
表2
逻辑页 0 1 …… 49 50
物理页 0 1 …… 49 50
需要说明的是,该M个计算任务在执行之前,由于该内存空间还未被使用,即内存空间内的每个物理地址均为可写入状态。那么,主设备在该M个计算任务中的第一个计算任务被执行之前,可以将在步骤301中生成重映射表时,使用的内存信息中携带的内存空间的逻辑页与物理页之间的对应关系(如表2所示),作为更新后的对应关系,发送送给SMMU。
假设,主设备发送给SMMU的更新后的对应关系仍如表2所示。那么在从设备1执行任务1之前,主设备向SMMU发送写指令,该写指令包括任务1的3个输入数据(输入标识为101、102、103的输入数据,以下称为输入数据101、输入数据102以及输入数据103)和3个计算参数(权重标识为111、112、113的计算参数,以下称为计算参数111、计算参数112、计算参数113),以及重映射表中指示的输入数据101、输入数据102、输入数据103、计算参数111、计算参数112、计算参数113的逻辑地址,即逻辑页0-18。SMMU根据缓存的页表中记录的如表2所示的对应关系,将写指令中的逻辑地址映射表为物理地址,即物理页0-18。然后将该写指令发送给第一内存,第一内存根据该写指令,将输入数据101存储到物理页0中,输入数据102存储到物理页1中,输入数据103存储到物理页2-3中,计算参数111存储到物理页4-6,计算参数112存储到物理页7-11,计算参数113存储到物理页12-18中。
从设备1在执行任务1时,从设备1向SMMU发送读指令,该读指令中包括重映射表中指示的输入数据101、输入数据102、输入数据103、计算参数111、计算参数112和计算参数113的逻辑地址,即逻辑页0-18。SMMU根据页表中记录的如表2所示的对应关系,将读指令中的逻辑地址映射为物理地址,即物理页0-18。然后将该读指令发送至第一内存,以访问第一内存物理页0-18。第一内存将物理页0-18中存储的数据包括输入数据101、输入数据102、输入数据103、计算参数111、计算参数112和计算参数113。其中,从设备1可以通过一个读指令读取一个输入数据或计算参数,也可以通过一个读指令读取任务1的所有输入数据或计算参数。
从设备1根据输入数据101、输入数据102、输入数据103、计算参数111、计算 参数112和计算参数113执行计算,得到任务1的3个输出数据(输出标识为121、122、123的输出数据,以下称为输出数据121、输出数据122、输出数据123)。从设备1向SMMU发送写指令,该写指令中包括输出数据121、输出数据122、输出数据123,以及重映射表中指示的输出数据121、输出数据122、输出数据123的逻辑地址,即逻辑页19-25。SMMU根据页表中记录的如表2所示的对应关系,将写指令中的逻辑地址映射为物理地址,即物理页19-25。然后将该写指令发送至第一内存,第一内存根据写指令,将写指令中包括的输出数据121、输出数据122、输出数据123写入物理页19-25中。
在任务1被执行的过程中,任务1的输入数据和计算参数被读出后,就会被第一内存删除。因此,当任务1结束后,内存空间物理页0-18被释放。此时,内存空间的物理页19-25中存储了任务1的3个输出数据。而下一步执行的任务2的输入数据201、输入数据202以及输入数据203分别为任务1的输出数据121、输出数据122、输出数据123。也就是说,此时,任务2的输入数据201的逻辑地址为逻辑页19-20,物理地址为物理页19-20,输入数据202的逻辑地址为逻辑页21-23,物理地址为物理页21-23,输入数据203的逻辑地址为逻辑页24-25,物理地址为物理页24-25。
在从设备1执行任务2之前,主设备需要根据表1中指示的任务2的任务数据的逻辑地址,对内存空间的物理地址所对应的逻辑地址进行一次调整,更新该内存空间的物理地址与逻辑地址的对应关系。即重新建立逻辑地址0-50与物理地址0-50之间的一一对应关系,以保证任务2的输入数据的逻辑地址按照表1所指示的从逻辑页开始,且从设备1能够根据逻辑页0-6,准确访问存储了任务2的输入数据的物理地址,即物理页19-25。同时,保证在更新后的对应关系中,逻辑页7-50所对应的物理页为可写入状态。示例性的,主设备根据表1中指示的任务2的任务数据的逻辑地址,对内存空间的物理地址所对应的逻辑地址进行一次调整后,得到的更新后的对应关系,可以如下表3所示:
表3
逻辑页 0 1 …… 31 32 33 …… 50
物理页 19 20   50 0 1 …… 18
由于在步骤303中,主设备通过调整物理地址所对应的逻辑地址,使得任务2的输入数据的逻辑地址仍然为从0开始,那么原来的表2需要被调整为表3。即任务1执行后本来存储了任务2的输入数据的物理页19-25所对应的逻辑地址为逻辑页19-25,但在根据步骤301的重映射表,对内存空间的物理地址所对应的逻辑地址进行调整的过程中,物理页19-25所对应的逻辑地址被调整为逻辑页0-6。因此,在后续步骤304中,当从设备1执行任务2时,能够通过SMMU,根据逻辑页0-6去访问物理页19-25,以读取任务2的输入数据。即在任务1结束后,任务2被执行之前,主设备在步骤303中,按照步骤301中的重映射表,将内存空间的物理地址所对应逻辑地址进行一次调整,即将原本的表2更新为表3,并将表3提供给SMMU。从设备1在后续步骤304的操作中,使用重映射表执行任务2,从逻辑页0开始执行,对应的实际物理页19是前一次任务1的输出数据,即本次任务2的输入数据,对于该从设备1而言,能够继续使用任务1的输出数据作为输入数据,但其逻辑地址不会从逻辑页19开 始,而是从逻辑页0开始,保证后续一系列处理的逻辑地址是连续的,防止出现逻辑地址碎片,便于主设备或从设备1管理和维护逻辑地址。
在步骤303中,主设备可以将表3发送至SMMU,以使得SMMU根据表3更新其缓存的页表中记录的内存空间的逻辑页与物理页之间的对应关系,并在后续采用更新后的对应关系,进行逻辑地址到物理地址的映射,以保证地址正确性。基于表3可知,主设备根据表1中指示的任务2的任务数据的逻辑地址,对内存空间的物理地址所对应的逻辑地址进行一次调整后,逻辑页0-31依次与物理页19-50一一对应,逻辑页32-50与依次与物理页0-18一一对应。无论任务2所需要占用的逻辑地址多大,只要需占用逻辑地址小于罗辑页的总数(例如,第一内存空间总量),其逻辑地址都从0开始,避免出现逻辑地址碎片化。也就是说,在步骤303中主设备对物理地址所对应的逻辑地址进行调整后,任务2的3个输入数据的逻辑地址被映射到内存空间的起始位置,即任务2的输入数据201的逻辑地址为逻辑页0-1,物理地址为物理页19-20,输入数据202的逻辑地址为逻辑页2-4,物理地址为物理页21-23,输入数据203的逻辑地址为逻辑页5-6,物理地址为物理页24-25。从而,从设备1在后续步骤304中执行任务2时,能够根据逻辑页0-1通过SMMU读取物理页19-20中存储的输入数据201,根据通过SMMU逻辑页2-4读取物理页21-23中存储的输入数据202,根据逻辑页5-6通过SMMU读取物理页24-25中存储的输入数据203。
由于主设备在根据表1中指示的任务2的任务数据的逻辑地址,对内存空间的物理地址所对应的逻辑地址进行一次调整后,内存空间除了分配给任务2的输入数据的7个逻辑页(逻辑页0-6)外,还剩余44个连续的逻辑页(逻辑页7-50)可以被分配给任务2的计算参数和输出数据。而任务2的3个计算参数(计算参数211、计算参数212、计算参数213)分别需要占用12、14、10共36个逻辑页。因此主设备可以根据重映射表中指示的逻辑地址,通过SMMU将任务2的3计算参数成功写入第一内存。
具体的,主设备向SMMU发送写指令,该写指令中包括计算参数211、计算参数212、计算参数213,以及重映射表中指示计算参数211、计算参数212、计算参数213的逻辑地址:逻辑页14-49。SMMU根据页表中记录的如表3所示的对应关系,将写指令中的逻辑页14-49映射为物理页33-50和0-17。然后将该写指令发送至第一内存,第一内存根据写指令,将写指令中包括的计算参数211存储到物理页33-44中,将计算参数212存储到物理页45-50和0-7中,将计算参数213存储到物理页8-17中。
从设备1在执行任务2时,在如前所述的步骤304中向SMMU发送读指令,该读指令中包括重映射表中指示的输入数据201、输入数据202、输入数据203、计算参数211、计算参数212和计算参数213的逻辑地址。SMMU根据缓存的页表中记录的如表3所示的对应关系,将读指令中的逻辑地址映射为物理地址。然后将该读指令发送至第一内存,以使得该读指令能够成功访问到对应的物理地址。第一内存根据读指令,将物理页19-20中存储的输入数据201、物理页21-23中存储的输入数据202、物理页24-25中存储的输入数据203、物理页33-44中存储的计算参数211、物理页45-50和0-7中存储的计算参数212、以及物理页8-17中存储的计算参数213输出至从设备1。
从设备1根据任务2的3个输入数据和3个计算参数执行计算,得到任务2的3个输出数据(输出数据221、输出数据222、输出数据223)。从设备1向SMMU发送 写指令,写指令中包括输出数据221、输出数据222、输出数据223,以及重映射表中指示的输出数据221、输出数据222、输出数据223的逻辑地址。SMMU根据页表中记录的如表3所示的对应关系,将写指令中的逻辑地址映射为物理地址。然后将该写指令发送至第一内存,第一内存根据该写指令,将输出数据221存储到物理页26-27中、将输出数据222存储到物理页28-30中、将输出数据223存储到物理页31-32中。
任务1和任务2对主设备申请的内存空间的逻辑地址(逻辑页0-50)的使用情况可以参见图4所示,在任务1被执行时,任务1的输入数据占用4个逻辑页(逻辑页0-3),计算参数占用15个逻辑页(逻辑页4-18),输出数据占7个逻辑页(逻辑页19-20)。经过重映射之后,在任务2被执行时,任务2的输入数据占用7个逻辑页(逻辑页0-6),计算参数占用36个逻辑页(逻辑页14-49),输出数据占7个逻辑页(逻辑页7-13)。相比于图1中所示的由于任务2的输入数据占用逻辑页19-20,导致计算参数没有连续的36个逻辑地址可以使用,从而导致任务2失败的情况,本申请中主设备根据重映射表在每个计算任务被执行之前,进行对内存空间的逻辑地址和物理地址进行重映射,能够避免内存空间的逻辑地址碎片化,进而保证每个任务数据都有连续的逻辑地址可以使用,保证计算任务成功执行。
可选的,在M个计算任务被执行的过程中,可能会出现其他数据需要临时或者持续占用分配给该M个计算任务的内存空间(第一内存上)的情况。当其他数据的优先级高于该M个计算任务的任务数据的优先级时,为了避免即将被执行的计算任务的任务数据被高优先级的其他数据覆盖,导致计算任务失败,主设备还可以预先为每个计算任务的每个任务数据分配在第二内存上的镜像地址(mirror address),并将每个计算任务的每个任务数据在第二内存上的镜像地址携带在重映射表中,发送给每个从设备,以用于在第一内存的内存空间被占用时,主设备和从设备可以通过任务数据在第二内存上的镜像地址在第二内存上读写任务数据,以保证计算任务顺利完成。其中,当第一内存为共享内存时,第二内存可为全局内存。
在异构计算系统中,全局内存通常是容量最大、带宽最小的内存,计算任务的任务数据通常存储在全局内存中,若按照常规机制,多个从设备在执行计算任务的过程中,需要高频次的在全局内存中读或写任务数据,从而造成全局内存的带宽压力,且对于从设备来说,由于全局内的带宽较小,因此在全局内存上读或写任务数据时,花费的时间较多,降低了计算效率。而共享内存虽然容量较小,但数据通路通常具有较大的带宽,因此若网络任务能够在共享内存上完成计算,则可以大大降低从设备在读写任务数据上花费的时间,并减轻全局内的带宽压力。即在计算任务的执行过程中,只需要由主设备将计算任务所需部分任务数据(例如,计算参数,或者部分输入数据)从全局内存,转移到共享内存上,以供从设备读取,无需从设备在全局内存上读写任务数据,从而提高网络任务的计算效率。由于共享内存的容量较小,因此当采用本申请提供的内存管理方法管理共享内存(即第一内存为共享内存)时,能够有效的避免共享的逻辑地址碎片化,提高共享内存的利用率,从而保证网络任务能够在共享内存上完成计算,减轻全局内存的带宽压力。
表4为采用常规机制和重映射机制执行网络任务时,计算周期、读带宽(网络任务执行过程中在全局内存上读数据时占用的带宽)、写带宽(网络任务执行过程中在 全局内存上写数据时占用的带宽)的数据对比。表4中的计算周期、读带宽、写带宽均为采用常规机制和重映射机制执行场景识别网络(scene classify net)中的融合任务时采集到的数据。
表4
Figure PCTCN2018114102-appb-000002
其中,常规机制为主设备和从设备在全局内存上执行网络任务,且全局内存上逻辑地址的分配是针对网络任务中的每个计算任务独立完成,无需重映射逻辑地址。重映射机制为基于本申请提供的内存管理方法,调整共享内存上申请到的内存空间的物理地址对应的逻辑地址,以使得从设备和主设备能够顺利在共享内存(即第一内存)上执行网络任务。从表4中记录的数据可以看出,相比于常规机制,采用重映射机制,能够使得网络任务的计算周期大幅减小,提高网络任务的计算效率。且由于采用本申请提供的内存管理方法,使得共享内存被高效使用,保证从设备无需在全局内存上读写任务数据上执行网络任务,从而减小了全局内存的带宽压力。
下面介绍本申请实施例提供的一种内存管理装置,如图11所示,包括第一模块和N个第二模块(假设为第二模块1、第二模块2、……、第二模块N),N为大于等于1的整数。例如,第一模块可以为异构计算系统中主设备的驱动软件,当主设备运行第一模块时,主设备可以实现上述实施例中的主设备执行的过程。N个第二模块与异构计算系统中的N个从设备一一对应,第二模块可以为从设备的驱动软件,当从设备运行第二模块时,从设备可以实现上述实施例中的从设备执行的过程。
在一个实施例中,第一模块,用于生成重映射表,并将该重映射表发送给N个第二模块中的至少一个第二模块,该重映射表包括即将被该至少一个第二模块处理的计算任务的多个逻辑地址,该多个逻辑地址为对应第一内存的内存空间的连续不间断的多个逻辑地址,包括该计算任务的输入数据的逻辑地址、计算参数的逻辑地址和输出数据的逻辑地址,该计算任务为神经网络或者人工智能中的子任务,以及按照该重映射表中指示的该多个逻辑地址,对与该内存空间中输入数据的物理地址、计算参数的物理地址和输出数据的物理地址所对应的多个初始逻辑地址进行调整,该输入数据的物理地址和计算参数的物理地址中至少一个由前一计算任务确定;该至少一个第二模块,用于在执行该计算任务时,根据该重映射表指示的该多个逻辑地址,在该内存空间中读取输入数据和计算参数、并写入输出数据。
可选的,第一模块,还用于将对该多个初始逻辑地址进行调整后得到的输入数据的物理地址、计算参数的物理地址和输出数据的物理地址与多个逻辑地址的对应关系 发送给SMMU;SMMU,用于接收第一模块发送的对应关系,并在从第一模块或至少一个第二模块接收到携带了该内存空间的待操作逻辑地址的指令时,按照该对应关系,将该待操作逻辑地址转换为物理地址,以使得发送该指令的该第一模块或至少一个第二模块访问该物理地址,该指令为读指令或者写指令。
在一种可选的方式中,第一模块,还用于在生成重映射表之前,向第一内存申请内存空间,并确定任务信息,该任务信息包括该计算任务的任务数据的数据大小和该计算任务与该前一计算任务之间的数据依赖关系;以及根据该任务信息生成该重映射表,该任务数据包括该输入数据、该计算参数和该输出数据。
在一种可选的方式中,该多个逻辑地址从该内存空间的起始逻辑地址开始。
在一种可选的方式中,数据依赖关系包括:计算任务与该计算任务的前一计算任务共享输入数据和计算参数;或者,该前一计算任务的输出数据为该计算任务的输入数据。
在一种可选的方式中,计算任务的输入数据的物理地址和计算参数的物理地址中至少一个通过如下至少一个方式确定:该输入数据的物理地址是前一计算任务的输入数据的物理地址;输入数据的物理地址是前一计算任务的输出数据的物理地址;或计算参数的物理地址是前一计算任务的计算参数的物理地址。
在本申请另一种实施例中,提供了一种计算机存储介质,计算机存储介质中存储有如上第一模块中的计算机指令和N个第二模块中的每个第二模块的计算机指令。例如,该计算机存储介质存储有计算机程序产品,用于执行如上所述的方法。所述的计算机存储介质可为只读存储器(read-only memory,ROM)或可存储静态信息和指令的其他类型的静态存储设备,随机存取存储器(random access memory,RAM)或者可存储信息和指令的其他类型的动态存储设备,也可以是电可擦可编程只读存储器(Electrically erasable programmabler-only memory,EEPROM)。在某些场景下,存储器还可以是只读光盘(compact disc read-only memory,CD-ROM)或其他光盘存储、光碟存储(包括压缩光碟、激光碟、光碟、数字通用光碟、蓝光光碟等)、磁盘存储介质或者其他磁存储设备、或者能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其他介质,但不限于此。
因此,结合本文中所公开的实施例描述的各示例的算法步骤,本申请提供的内存管理方法能够以如前所述的计算机程序软件、硬件或硬件和计算机软件的结合形式来实现。某个功能究竟以硬件还是计算机软件驱动硬件的方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
本说明书中各个实施例之间相同相似的部分互相参见即可。尤其,对于内存管理装置的实施例而言,由于其基本相似于方法实施例,所以描述的比较简单,相关之处参见方法实施例中的说明即可。
以上所述的具体实施方式,对本申请的目的、技术方案和有益效果进行了进一步详细说明,所应理解的是,以上所述仅为本申请的具体实施方式而已,并不用于限定本申请的保护范围,凡在本申请的技术方案的基础之上,所做的任何修改、等同替换、改进等,均应包括在本申请的保护范围之内。

Claims (13)

  1. 一种异构计算系统,其特征在于,包括主设备和N个从设备,N为大于等于1的整数;
    所述主设备,用于生成重映射表,并将所述重映射表发送给所述N个从设备中的至少一个从设备,所述重映射表包括即将被所述至少一个从设备处理的计算任务的多个逻辑地址,所述多个逻辑地址为对应第一内存的内存空间的连续不间断的多个逻辑地址,包括所述计算任务的输入数据的逻辑地址、计算参数的逻辑地址和输出数据的逻辑地址,所述计算任务为神经网络或者人工智能中的子任务,以及
    按照所述重映射表中指示的所述多个逻辑地址,对与所述内存空间中输入数据的物理地址、计算参数的物理地址和输出数据的物理地址所对应的多个初始逻辑地址进行调整,所述输入数据的物理地址和计算参数的物理地址中至少一个由前一计算任务确定;
    所述至少一个从设备,用于在执行所述计算任务时,根据所述重映射表指示的所述多个逻辑地址,在所述内存空间中读取所述输入数据和所述计算参数、并写入所述输出数据。
  2. 根据权利要求1所述的异构计算系统,其特征在于,所述多个逻辑地址从所述内存空间的起始逻辑地址开始。
  3. 根据权利要求1或2所述的异构计算系统,其特征在于,还包括系统内存管理单元SMMU,
    所述主设备,还用于将对所述多个初始逻辑地址进行调整后得到的输入数据的物理地址、计算参数的物理地址和输出数据的物理地址与多个逻辑地址的对应关系发送给所述SMMU;
    所述SMMU,用于接收所述主设备发送的所述对应关系,并在从所述主设备或至少一个从设备接收到携带了所述内存空间的待操作逻辑地址的指令时,按照所述对应关系,将所述待操作逻辑地址转换为物理地址,以使得发送所述指令的所述主设备或至少一个从设备访问所述物理地址,所述指令为读指令或者写指令。
  4. 根据权利要求1-3所述的异构计算系统,其特征在于,
    所述主设备,还用于在生成所述重映射表之前,向所述第一内存申请所述内存空间,并确定任务信息,所述任务信息包括所述计算任务的任务数据的数据大小和所述计算任务与所述前一计算任务之间的数据依赖关系;以及
    根据所述任务信息生成所述重映射表,所述任务数据包括所述输入数据、所述计算参数和所述输出数据。
  5. 根据权利要求4所述的异构计算系统,其特征在于,所述数据依赖关系包括:
    所述计算任务与所述前一计算任务共享输入数据和计算参数;或者,
    所述前一计算任务的输出数据为所述计算任务的输入数据。
  6. 根据权利要求1-5所述的异构计算系统,其特征在于,所述输入数据的物理地址和所述计算参数的物理地址中至少一个通过如下至少一个方式确定:
    所述输入数据的物理地址是所述前一计算任务的输入数据的物理地址;
    所述输入数据的物理地址是所述前一计算任务的输出数据的物理地址;或
    所述计算参数的物理地址是所述前一计算任务的计算参数的物理地址。
  7. 根据权利要求1至6中任一项所述的异构计算系统,其特征在于,所述主设备包括中央处理单元,所述从设备包括如下至少一个:图形处理器、神经处理器或现场可编程门阵列。
  8. 一种内存管理方法,其特征在于,应用于如权利要求1所述的异构计算系统,所述方法包括:
    主设备生成重映射表,并将所述重映射表发送给所述N个从设备中的至少一个从设备,所述重映射表包括即将被所述至少一个从设备处理的计算任务的多个逻辑地址,所述多个逻辑地址为对应第一内存的内存空间的连续不间断的多个逻辑地址,包括所述计算任务的输入数据的逻辑地址、计算参数的逻辑地址和输出数据的逻辑地址,所述计算任务为神经网络或者人工智能中的子任务,以及
    按照所述重映射表中指示的所述多个逻辑地址,对与所述内存空间中输入数据的物理地址、计算参数的物理地址和输出数据的物理地址所对应的多个初始逻辑地址进行调整,所述输入数据的物理地址和计算参数的物理地址中至少一个由前一计算任务确定;
    所述至少一个从设备在执行所述计算任务时,根据所述重映射表指示的所述多个逻辑地址,在所述内存空间中读取所述输入数据和所述计算参数、并写入所述输出数据。
  9. 根据权利要求8所述的方法,其特征在于,所述多个逻辑地址从所述内存空间的起始逻辑地址开始。
  10. 根据权利要求8或9所述的方法,其特征在于,所述方法还包括:
    所述主设备将对所述多个初始逻辑地址进行调整后得到的输入数据的物理地址、计算参数的物理地址和输出数据的物理地址与多个逻辑地址的对应关系发送给系统内存管理单元SMMU;
    所述SMMU接收所述主设备发送的所述对应关系,并在从所述主设备或至少一个从设备接收到携带了所述内存空间的待操作逻辑地址的指令时,按照所述对应关系,将所述待操作逻辑地址转换为物理地址,以使得发送所述指令的所述主设备或至少一个从设备访问所述物理地址,所述指令为读指令或者写指令。
  11. 根据权利要求8-10所述的方法,其特征在于,所述主设备生成重映射表之前,所述方法还包括:
    所述主设备向所述第一内存申请所述内存空间,并确定任务信息,所述任务信息包括所述计算任务的任务数据的数据大小和所述计算任务与所述前一计算任务之间的数据依赖关系,所述任务数据包括输入数据、计算参数和输出数据;
    所述主设备生成重映射表,包括:
    所述主设备根据所述任务信息生成所述重映射表。
  12. 根据权利要求11所述的方法,其特征在于,所述数据依赖关系包括:
    所述计算任务与所述前一计算任务共享输入数据和计算参数;或者,
    所述前一计算任务的输出数据为所述计算任务的输入数据。
  13. 根据权利要求8-12所述的方法,其特征在于,所述输入数据的物理地址和计 算参数的物理地址中至少一个通过如下至少一个方式确定:
    所述输入数据的物理地址是所述前一计算任务的输入数据的物理地址;
    所述输入数据的物理地址是所述前一计算任务的输出数据的物理地址;或
    所述计算参数的物理地址是所述前一计算任务的计算参数的物理地址。
PCT/CN2018/114102 2018-11-06 2018-11-06 一种异构计算系统及内存管理方法 WO2020093227A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201880095316.XA CN112368686A (zh) 2018-11-06 2018-11-06 一种异构计算系统及内存管理方法
PCT/CN2018/114102 WO2020093227A1 (zh) 2018-11-06 2018-11-06 一种异构计算系统及内存管理方法

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2018/114102 WO2020093227A1 (zh) 2018-11-06 2018-11-06 一种异构计算系统及内存管理方法

Publications (1)

Publication Number Publication Date
WO2020093227A1 true WO2020093227A1 (zh) 2020-05-14

Family

ID=70610751

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/114102 WO2020093227A1 (zh) 2018-11-06 2018-11-06 一种异构计算系统及内存管理方法

Country Status (2)

Country Link
CN (1) CN112368686A (zh)
WO (1) WO2020093227A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113435153B (zh) * 2021-06-04 2022-07-22 上海天数智芯半导体有限公司 一种gpu缓存子系统互联的数字电路设计方法
CN114492775A (zh) * 2022-01-13 2022-05-13 哲库科技(上海)有限公司 一种数据处理方法、装置、神经网络加速器及存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5561785A (en) * 1992-10-29 1996-10-01 International Business Machines Corporation System for allocating and returning storage and collecting garbage using subpool of available blocks
CN103077120A (zh) * 2012-12-31 2013-05-01 东软集团股份有限公司 程序共享内存的地址转换方法和装置
CN103514098A (zh) * 2012-06-29 2014-01-15 伊姆西公司 用于回收存储空间的方法和系统
CN103970680A (zh) * 2014-04-28 2014-08-06 上海华为技术有限公司 内存管理方法、装置及嵌入式系统

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5561785A (en) * 1992-10-29 1996-10-01 International Business Machines Corporation System for allocating and returning storage and collecting garbage using subpool of available blocks
CN103514098A (zh) * 2012-06-29 2014-01-15 伊姆西公司 用于回收存储空间的方法和系统
CN103077120A (zh) * 2012-12-31 2013-05-01 东软集团股份有限公司 程序共享内存的地址转换方法和装置
CN103970680A (zh) * 2014-04-28 2014-08-06 上海华为技术有限公司 内存管理方法、装置及嵌入式系统

Also Published As

Publication number Publication date
CN112368686A (zh) 2021-02-12

Similar Documents

Publication Publication Date Title
US20230315342A1 (en) Memory system and control method
US10817412B2 (en) Methods for migrating information stored in memory using an intermediate depth map
US20190121553A1 (en) Multiprocessor system with independent direct access to bulk solid state memory resources
TWI624791B (zh) 用於在多緒處理單元中改善性能之技術
US9110809B2 (en) Reducing memory traffic in DRAM ECC mode
US10877757B2 (en) Binding constants at runtime for improved resource utilization
US11263149B2 (en) Cache management of logical-physical translation metadata
US11042486B2 (en) Virtual memory management
KR20170008153A (ko) 비휘발성 장치에서 데이터 속성 기반 데이터 배치를 활용하기 위해 컴퓨터를 구동하는 경험적 인터페이스
TWI489392B (zh) 多個應用程式分享的圖形處理單元
JP2019191909A (ja) メモリシステムおよび制御方法
TW201342240A (zh) 解決執行緒發散的方法和系統
US9798543B2 (en) Fast mapping table register file allocation algorithm for SIMT processors
US20200042216A1 (en) Storage-based graph for enabling computation graph optimization
TWI502489B (zh) 叢集多階暫存檔的暫存器分配
CN109710175A (zh) 用于数据存储管理的设备和方法
US20220004488A1 (en) Software drive dynamic memory allocation and address mapping for disaggregated memory pool
WO2020093227A1 (zh) 一种异构计算系统及内存管理方法
US20170255565A1 (en) Method and apparatus for providing a contiguously addressable memory region by remapping an address space
US10515014B1 (en) Non-uniform memory access (NUMA) mechanism for accessing memory with cache coherence
US20190266110A1 (en) Scalable, parameterizable, and script-generatable buffer manager architecture
US20220276984A1 (en) Techniques for configuring parallel processors for different application domains
US20150097847A1 (en) Managing memory regions to support sparse mappings
TW201432573A (zh) 工作佇列型圖形處理單元工作創建
TW201351276A (zh) 計算工作的排程和執行

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18939181

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18939181

Country of ref document: EP

Kind code of ref document: A1