WO2023142091A1 - 计算任务调度装置、计算装置、计算任务调度方法和计算方法 - Google Patents

计算任务调度装置、计算装置、计算任务调度方法和计算方法 Download PDF

Info

Publication number
WO2023142091A1
WO2023142091A1 PCT/CN2022/075123 CN2022075123W WO2023142091A1 WO 2023142091 A1 WO2023142091 A1 WO 2023142091A1 CN 2022075123 W CN2022075123 W CN 2022075123W WO 2023142091 A1 WO2023142091 A1 WO 2023142091A1
Authority
WO
WIPO (PCT)
Prior art keywords
computing
task
memory
unit
load information
Prior art date
Application number
PCT/CN2022/075123
Other languages
English (en)
French (fr)
Inventor
张龙
郑明�
何世明
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to CN202280004975.4A priority Critical patent/CN116897581A/zh
Priority to EP22922920.8A priority patent/EP4426037A1/en
Priority to PCT/CN2022/075123 priority patent/WO2023142091A1/zh
Publication of WO2023142091A1 publication Critical patent/WO2023142091A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/14Handling requests for interconnection or transfer
    • G06F13/20Handling requests for interconnection or transfer for access to input/output bus
    • G06F13/28Handling requests for interconnection or transfer for access to input/output bus using burst mode transfer, e.g. direct memory access DMA, cycle steal
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/505Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the load

Definitions

  • the present application relates to the field of computer technology, and in particular to a computing task scheduling device, a computing device, a computing task scheduling method, and a computing method.
  • the Von Neumann or Harvard architecture is a separate architecture for computing and storage.
  • the data required for calculation needs to be loaded from the external memory to the computing core memory, that is, the cache. After the calculation is completed, it needs to be returned from the core memory to the external memory, which increases the power consumption of data transmission during the calculation process.
  • computing fusion technology In order to reduce the power consumption of data transmission, computing fusion technology, near-memory computing or in-memory computing technology can be used.
  • Computing fusion technology performs multi-step calculations through fusion to reduce the interaction with external memory.
  • computing fusion requires a cache of a certain size in the computing core and requires fine software segmentation management, which is complex to implement.
  • Near-memory computing completes calculations near the memory, and in-memory computing is directly calculated inside the memory, thereby reducing the power consumption of data transmission.
  • near-memory computing or in-memory computing generally needs to add corresponding computing instructions, and for different hardware platforms , the adaptation scheme is not uniform, so the complexity is high.
  • the present application provides a storage computing device and a storage computing method, in order to reduce the overhead of data transmission and the complexity of calculation.
  • a computing task scheduling device including: a task scheduler, configured to determine a first computing task for a first computing unit, generate load information of the first computing task, and the load information is used to define The first computing task; a processor configured to receive the load information from the task scheduler, store the load information to a first address in a memory to assign the first computing task to the second A computing unit, the first address is a reserved address of the first computing unit, wherein the processor is coupled to at least one of the memory and the first computing unit through a bus, and the first The computing unit is tightly coupled with the memory, the tight coupling does not need to go through any bus, and the first computing unit can access the memory at a speed higher than that of the bus.
  • the task scheduler can determine the computing task for the first computing unit, so that the software module does not need to perceive the computing capability of the first computing unit, thereby reducing the complexity of the software.
  • the processor accesses the memory to The load information of the computing task is stored in a fixed address in the memory to assign the computing task to the first computing unit, and the first computing unit is tightly coupled with the memory, so that the task calculation can be completed quickly without the need for a processor and the first computing unit
  • the task scheduling information is transmitted between them through a specific bus or interface, which reduces the power consumption and delay of data transmission.
  • the processor is specifically configured to: store the load information in the first address by using a direct memory access DMA controller.
  • the processor achieves the above-mentioned beneficial effects by multiplexing the existing DMA technology, and through the DMA, the first address in the load information storage is allocated to the first calculation unit, so that the first calculation task can be allocated to the first calculation unit Save system overhead and improve computing efficiency.
  • the task scheduler is a dedicated task scheduler other than system software or application software.
  • the task scheduler may be a hardware task scheduler dedicated to task scheduling.
  • the task scheduler is further configured to: receive a computing task sequence from the system software or the application software, and in the computing task sequence, The first calculation unit determines the first calculation task.
  • computing task sequence may include one or more computing tasks, and the first computing task may be one computing task or multiple computing tasks, which is not limited in this embodiment of the present application.
  • the task scheduler is further configured to: determine a second computing task for the second computing unit in the computing task sequence; assign the second computing task to Scheduling to a second computing unit; wherein, the second computing unit includes at least one of the processor, an image processing unit, an artificial intelligence AI processing unit, a digital signal processor or a dedicated logic circuit, and the second computing unit It is coupled with the memory through a bus.
  • the task scheduler when the task scheduler determines in the calculation task sequence that the second calculation task is not suitable for the first calculation unit to complete the calculation, it can schedule the second calculation task to the second calculation unit, and the second calculation unit It is coupled with the memory through the bus.
  • the second computing task may be other tasks that are not suitable for near-memory computing, in-memory computing, or integration of memory and computing.
  • the task scheduler is specifically configured to: determine the first computing task for the first computing unit in the computing task sequence according to a computing list , wherein the computing list includes computing task types supported by the first computing unit.
  • calculation list can be replaced by a linked list or the like.
  • this list of calculations can be updated. For example, when the calculation type supported by the first calculation unit changes, the changed calculation type may be added to the calculation list, so as to update the calculation list. Alternatively, when the task scheduler and processor are applied to other computing units, the computing types supported by the computing unit can be added to the computing list. Thereby, the compatibility of the system can be improved.
  • the load information includes at least one of the following information: a data address; a data dimension; or a control command word.
  • load information may also include other information used for computing tasks.
  • the tight coupling includes near-storage computing coupling, in-storage computing coupling, or integrated storage-computing coupling.
  • a computing device including: a memory; a first computing unit, configured to obtain load information from a first address in the memory, and complete a first computing task according to the load information, wherein the The load information is used to define the first computing task, and the first address is a reserved address of the first computing unit; wherein, the first computing unit is tightly coupled with the memory, and the tight coupling does not require Through any bus, and the first computing unit can access the memory at a higher speed than bus access; at least one of the memory and the first computing unit is coupled to the processor through the bus.
  • the first calculation unit acquires load information from the memory, and the first calculation unit is tightly coupled with the memory, so that the system overhead required for calculation can be reduced and the calculation efficiency can be improved.
  • the load information includes at least one of the following information: a data address; a data dimension; or a control command word.
  • the tight coupling includes near-memory computing coupling, in-memory computing coupling, or integrated storage-computing coupling.
  • the memory is specifically configured to: write the load information at the first address under an operation of a direct memory access DMA controller.
  • the memory can write the load information in the first address through DMA. This saves bus overhead.
  • a computing task scheduling method including: a task scheduler determines a first computing task for a first computing unit, generates load information of the first computing task, and the load information is used to define the first computing tasks; the processor receives the load information from the task scheduler, stores the load information in a first address in the memory to assign the first computing task to the first computing unit, and the second An address is a reserved address of the first computing unit, wherein the processor is coupled to at least one of the memory and the first computing unit through a bus, and the first computing unit is closely connected to the memory coupling, the tight coupling does not need to go through any bus, and the first computing unit can access the memory at a speed higher than bus access.
  • the storing the load information to a first address in the memory so as to assign the first computing task to the first computing unit includes: The load information is stored to the first address by a direct memory access DMA controller to assign the first computing task to the first computing unit.
  • the task scheduler is a dedicated task scheduler other than system software or application software.
  • the task scheduler determining the first computing task for the first computing unit includes: the task scheduler receiving the information from the system software or the application software A computing task sequence, in which the first computing task is determined for the first computing unit.
  • the method further includes: determining a second computing task for the second computing unit in the computing task sequence; scheduling the second computing task to the first Two computing units; wherein, the second computing unit includes at least one of the processor, an image processing unit, an artificial intelligence AI processing unit, a digital signal processor or a dedicated logic circuit, and the second computing unit and the The memories are coupled via a bus.
  • the determining the first calculation task for the first calculation unit in the calculation task sequence includes: according to the calculation list, in the calculation The first computing task is determined for the first computing unit in the task sequence, wherein the computing list includes computing task types supported by the first computing unit.
  • the load information includes at least one of the following information: a data address; a data dimension; or a control command word.
  • the tight coupling includes near-memory computing coupling, in-memory computing coupling, or integrated storage-computing coupling.
  • a calculation method including: a first calculation unit obtains load information from a first address in a memory, and completes a first calculation task according to the load information, wherein the load information is used to define the The first computing task, the first address is a reserved address of the first computing unit; wherein, the first computing unit is tightly coupled with the memory, the tight coupling does not need to go through any bus, and the first computing unit A computing unit can access the memory at a higher speed than bus access; at least one of the memory and the first computing unit is coupled to the processor through the bus.
  • the load information includes at least one of the following information: a data address; a data dimension; or a control command word.
  • the tight coupling includes near-memory computing coupling, in-memory computing coupling, or integrated storage-computing coupling.
  • the method further includes: the memory writes the load information at the first address under the operation of a direct memory access DMA controller.
  • a computer-readable storage medium including: computer programs or instructions are stored in the storage medium, and when the computer programs or instructions are executed by a communication device, the third aspect and any one thereof The computing task scheduling method described in a possible implementation manner is executed, or the computing method described in the fourth aspect and any possible implementation manner thereof is executed.
  • a computer program product is provided.
  • the computing task scheduling method as described in the third aspect and any possible implementation thereof is executed, or, The calculation method described in the fourth aspect and any possible implementation manner thereof is executed.
  • a seventh aspect provides a computing system, including the task scheduling device described in the first aspect and any possible implementation thereof and the computing device described in the second aspect and any possible implementation thereof .
  • Fig. 1 is a schematic block diagram of a computing device provided by an embodiment of the present application.
  • Fig. 2 is a schematic block diagram of a near-storage computing device provided by an embodiment of the present application.
  • Fig. 3 is a schematic block diagram of an apparatus for scheduling computing tasks provided by an embodiment of the present application.
  • Fig. 4 is a schematic diagram of determining a target computing task according to a computing list provided by an embodiment of the present application.
  • Fig. 5 is a schematic block diagram of another computing task scheduling device provided by an embodiment of the present application.
  • Fig. 6 is a schematic flowchart of a computing task scheduling method provided by an embodiment of the present application.
  • Fig. 7 is a schematic flow chart of another calculation task scheduling method provided by an embodiment of the present application.
  • von Neumann or Harvard architectures are architectures that separate computing and storage.
  • the data required for calculation needs to be loaded from the external storage to the computing core, and the calculation needs to be returned from the core memory to the external storage.
  • most acceleration hardware adopts von Neumann architecture.
  • the computing characteristics of the neural network are both computing-intensive and data-intensive.
  • the computing core has highly data-parallel computing resources and has a very large demand for bandwidth. Therefore, in the power consumption decomposition of the overall calculation, the power consumption overhead of data transmission is often higher than that of calculation power consumption.
  • calculation fusion In order to reduce the power consumption overhead of data transmission, the technical solution of calculation fusion can be adopted, that is, the multi-step calculation is calculated through fusion to reduce the interaction with external memory.
  • Computing fusion can effectively relieve bandwidth pressure and reduce transmission power consumption.
  • computing fusion requires a cache of a certain size (such as static random-access memory, SRAM) in the computing core.
  • SRAM static random-access memory
  • computing fusion requires fine software segmentation management, which is complex to implement.
  • near-memory computing NMC
  • in-memory computing IMC
  • NMC near-memory computing
  • IMC in-memory computing
  • Both near-memory computing and in-memory computing are new architectural technology directions that focus on memory. By computing near the memory or directly inside the memory, it breaks through the limitations of the von Neumann architecture and solves the power consumption of data transmission. overhead. Near-memory computing tightly couples memory and computing processors, reduces data transmission delay and power consumption through shorter wires, and improves system energy efficiency. With the development of manufacturing technology and packaging technology, computing logic and storage stacking are used to construct hybrid computing storage. In-memory computing is done directly in the memory array, which reduces the data transmission between the computing processor and the memory. However, near storage or in-memory computing technology is limited by computing characteristics and storage computing hardware design complexity.
  • an embodiment of the present application provides a storage computing device and a storage computing method.
  • the technical solution can further reduce the complexity of implementation while ensuring low power consumption for data transmission.
  • Fig. 1 is a schematic block diagram of a common computing device provided by an embodiment of the present application.
  • the memory 110 writes the data to be calculated into the cache (buffer) 120 through the bus
  • the common calculation unit 130 reads the buffer and reads the data from the cache 120
  • the ordinary calculation unit 130 completes the calculation operation, writes the calculation result into the cache 120, and writes the data from the cache 120 to the memory 110 through the bus.
  • the cache 120 needs to perform multiple reads and writes
  • the memory 110 needs to perform multiple interactions with the cache 120 through the bus, which makes the system bus overhead larger.
  • Fig. 2 is a schematic block diagram of a near-storage computing device provided by an embodiment of the present application.
  • the near-storage computing unit 150 may be located outside the memory 140 and tightly coupled with the memory 140, so that the near-storage computing unit 150 may not interact with the memory 140 through the bus when performing calculations. Instead, data interaction is performed through physical wires or circuit connections. Since the near-storage computing unit 150 and the memory 140 are tightly coupled together, the distance between the two is relatively close, and the physical wires or circuit connections for transmitting data are relatively short, thereby reducing data in-place. The latency and power consumption of near-memory computing units and memory transfers also reduces bus overhead.
  • the near-storage computing unit 150 can be replaced by an in-memory computing unit, and the in-memory computing unit can be located inside the memory 140, for example, the in-memory computing unit can be embedded in the memory 140, as a part of the memory, also That is to say, the memory has computing power.
  • the in-memory computing unit can interact with the memory through physical wires or circuit connections.
  • the in-memory computing unit can also directly read the data inside the memory to complete the calculation without using the read-write protocol. bus, which can save bus overhead.
  • the near-storage computing unit 150 and memory 140 can also be replaced by a storage-computing integrated unit.
  • the storage-computing integrated unit can not only store data, but also complete calculations, thereby saving the bus overhead between calculation and storage, and also The delay and power consumption of data transmission can be reduced.
  • Fig. 3 is a schematic block diagram of an apparatus for scheduling computing tasks provided by an embodiment of the present application.
  • the apparatus 200 may include a computing service unit 210 , a task scheduler 220 , a processor 230 , a memory 240 and a near storage computing unit 250 .
  • the task scheduler 220 and the processor 230 in the apparatus 200 may be located in one chip, such as a system on chip (SoC).
  • SoC system on chip
  • the memory 240 and the near memory computing unit 250 may be located in another chip.
  • the computing business unit 210 is located at the business scheduling layer and belongs to a software module.
  • the computing business unit 210 can be system software or application software; the task scheduler 220 belongs to a hardware scheduler, and the processor 230, memory 240 and near
  • the storage computing unit 250 is a hardware device.
  • the processor 230 can run the system software or application software to perform calculation or processing tasks, and the processor can also interact with other hardware devices, such as sending/receiving data or instructions.
  • the memory 240 can be used to store data and can be accessed by other hardware devices, such as the processor 230 .
  • the near memory calculation unit 250 may include a calculation circuit for performing calculation tasks, which may be different from the calculation tasks performed by the processor 230 .
  • the computing business unit 210 sends the compiled computing task sequence to the task scheduler 220; the task scheduler 220 parses the computing task sequence and determines whether the computing task can perform near-storage computing.
  • the task scheduler 220 calls the near memory calculation load generation function to generate the load information of the first calculation task, and schedules the load information of the first calculation task to the processor 230; the processor 230 (for example, The CPU) stores the load information of the first computing task to a first address in the memory.
  • the first address is a reserved address for the load information interaction between the processor 230 and the near memory computing unit 250, and the near memory computing unit 250 can access the first address to obtain the target computing The load information of the task, and then the near storage calculation unit 250 completes the calculation according to the load information, and stores the calculation result in the memory 240 .
  • the task scheduler 220 may determine a first computing task for the near-storage computing unit, and generate load information of the first computing task, where the load information is used to define the first computing task.
  • the task scheduler is a dedicated task scheduler other than system software or application software. That is, the task scheduler is a hardware task scheduler specially used for scheduling computing tasks in the apparatus 200 .
  • the processor 230 may receive load information from the task scheduler 220, and store the load information into a first address in the memory 240 to distribute the first computing task to the near memory computing unit 250, the first address being the preset address of the near memory computing unit 250. address, wherein the processor 230 is coupled to at least one of the memory 240 and the near-storage computing unit 250 through a bus, and the near-storage computing unit 250 is tightly coupled to the memory 240, and the tight coupling does not need to go through any bus, and the near-storage computing unit 250 can access memory 240 at a higher speed than bus access.
  • the tight coupling is near-memory computing coupling.
  • the load information is used to define the first computing task, and it can be understood that the content in the load information is content required for computing the first computing task, and can be used by a near-storage computing unit to complete the first computing task.
  • the first address in the memory 240 is the reserved address of the near memory computing unit 250, that is, an area is reserved for the near memory computing unit 250 in the memory 240, and the memory calculated by the near memory computing unit 250 can be stored in this area. required load information.
  • the near memory calculation unit 250 can access the first address to obtain the load information, so as to complete the first calculation task according to the load information.
  • the processor 230 is coupled to the memory 240 through a bus, and the memory 240 is tightly coupled to the near-storage computing unit 250, that is, the near-storage computing unit 250 does not need to pass any bus to exchange data with the memory, and the near-storage computing
  • the speed at which the unit accesses the memory 240 is higher than the speed at which the memory 240 is accessed through the bus.
  • the two can interact through physical wires or circuit connections, so that there is no need to pass through the bus, which can save bus overhead, thereby reducing the delay and power consumption of data transmission.
  • the processor 230 can transfer the load to The information is stored in the first address, and the near storage calculation unit 250 obtains the load information from the first address.
  • the processor 230 may also schedule the load information to the near storage computing unit 250 by configuring registers, etc., for example, the processor 230 writes the load information into the register.
  • the near storage calculation unit 250 reads the register, obtains the load information from the register and stores it in the first address, so as to complete the calculation.
  • This register may be located on the same chip as processor 230, such as an SoC.
  • the task scheduler may determine a first computing task for the near-storage computing unit in the computing task sequence.
  • the task scheduler may determine the first computing task according to the type of the first computing task.
  • the task scheduler may pre-store calculation types, and the calculation types may be one or more preset items.
  • the calculation types may include matrix calculations, loop calculations, etc., and the calculation types may be in a Calculated in list or linked list.
  • the task scheduler determines the first computing task for the near storage computing unit 250 in the computing task sequence according to the computing list. Specifically, when the calculation type of a calculation task is included in the calculation type in the calculation list, it may be determined that the calculation task is the first calculation task.
  • the first computing task may be one computing task, or may be multiple computing tasks, which is not limited in this embodiment of the present application.
  • FIG. 4 is a schematic diagram of determining a first calculation task according to a calculation list provided by an embodiment of the present application.
  • the calculation list may include calculation type A, calculation type B, calculation type C, calculation type D, etc. of calculation tasks
  • the calculation task sequence may include calculation task one (calculation type is A), calculation Task 2 (calculation type is C), calculation task 3 (calculation type is E), calculation task 4 (calculation type is F), and so on.
  • the task scheduler may pre-store the calculation list, and after receiving the calculation task sequence sent by the calculation business unit, the task scheduler may determine the target calculation task according to whether the type of the calculation task in the calculation task sequence is included in the calculation list. Continuing to refer to FIG. 4 , if the calculation types of the calculation task 1 and the calculation task 2 in the calculation task sequence are included in the calculation list, then the calculation task 1 and the calculation task 2 can be determined as the first calculation task.
  • the computing type may be related to the near-storage computing unit, for example, the computing type may be a computing type supported by the near-storage computing unit.
  • the type of near-storage calculation supported by the near-storage calculation unit is matrix-type calculation, then the calculation type may include the matrix-type calculation, or, when the calculation type does not include the matrix-type calculation type, it may be The calculation type of the matrix class is added to the calculation type to complete the update of the calculation type.
  • the calculation type may also be sent by the calculation task unit to the task scheduler.
  • the load information of the first computing task may include but not limited to: data address; data dimension; control command word and so on.
  • the data address may be an address used to indicate that the data is stored in the memory;
  • the data dimension is used to indicate the dimension information of the data, for example, the number of rows, the number of columns, storage in row priority, storage in column priority, etc.
  • the data dimension may also include a data type, which may be a floating point type, an integer type, etc.;
  • the control command word may be a calculation type used to control the first calculation task, for example, multiplication, addition, multiply-add, etc.
  • the processor may include but not limited to: a central processing unit (central processing unit, CPU), a graphics processing unit (graphics processing unit, GPU), a neural network processor (neural-network processing unit, NPU), and the like.
  • the processor may dispatch the load information of the first computing task to the near-memory computing unit in the following ways:
  • the processor stores the load information to a first address through a direct memory access (DMA) controller, and the near-storage computing unit accesses the first address, thereby acquiring the load information.
  • DMA direct memory access
  • the processor sends an instruction to the DMA, which may include the source address and destination address of the load information, and the DMA controller transfers the load information from the source address to the destination address through the DMA write signal according to the instruction of the processor, that is, Load information is transferred to memory.
  • the processor can send an instruction to the DMA controller to transfer the load information to the first address of the memory by using the existing DMA mechanism, so that no redesign is required and the design complexity is reduced.
  • the DMA controller transmits the load information to the fixed address in the memory through the DMA write signal.
  • the fixed address can be the reserved address of the near storage computing unit.
  • the unit can obtain the load information from the fixed address and analyze it, and complete the calculation according to the load information, and write the calculation result into the memory, and then the memory returns a DMA response signal to the DMA controller, and the DMA controller transfers the DMA Completed messages or instructions are transferred to the processor.
  • the memory writes the load information at the first address under the operation of the DMA controller.
  • the processor dispatches the load information to the near-storage computing unit through the existing DMA mechanism, so that the processor does not need to send the load information to the near-storage computing unit separately through the near-storage computing special instruction, thus realizing the processor and the near-storage computing unit.
  • the decoupling of computing units saves bus overhead.
  • the complexity of the design can be reduced due to the reuse of the existing DMA mechanism.
  • the processor can schedule load information to the near-memory computing unit by configuring registers and other methods. For example, the load information is dispatched to the first address in the near storage computing unit.
  • the processor writes the load information into an on-chip register
  • the near-storage calculation unit reads the register, and obtains the load information from the register to complete the calculation.
  • the on-chip registers may be located on the SoC.
  • the near-storage computing unit 250 can complete calculations near the memory 240, the near-storage computing unit 250 can complete the interaction with the memory 240 through physical wires or circuit connections without going through the system bus, thereby reducing the time for data transmission. Delay and power consumption, saving system bus overhead, thereby improving system computing efficiency.
  • the task scheduler can analyze the computing task to determine whether the computing task supports near-storage computing, so that the computing business unit does not need to perceive the computing capability of the computing unit.
  • the load information of the task is stored in the first address in the memory to allocate the computing task to the near-storage computing unit, and the near-storage computing unit is tightly coupled with the memory, so that the task calculation can be completed quickly without the need for a processor and the near-storage computing unit
  • the task scheduling information is transmitted through a specific bus or interface, thereby reducing bus overhead.
  • the computing type supported by the replaced near-storage computing unit may change.
  • the newly added computing type can be added to the above computing list to Complete the update of the calculation list.
  • the preset computing types in the task scheduler can be updated for different near-storage computing units and memories, which makes the task scheduler more adaptable and further improves compatibility.
  • the task scheduler can also be used to determine the second computing task in the computing task sequence.
  • the computing The task is dispatched to another second computing unit, and the second computing unit computes the second computing task, wherein the second computing unit may be coupled to the memory through a bus, and the second computing unit may include the processor, At least one of an image processing unit, an artificial intelligence (AI) processing unit, a digital signal processor, or a dedicated logic circuit.
  • AI artificial intelligence
  • Fig. 4 is a schematic block diagram of another computing task scheduling device provided by an embodiment of the present application.
  • the apparatus 300 may include a computing service unit 210 , a task scheduler 220 , a processor 230 , a memory 240 and an in-memory computing unit 260 .
  • the in-memory computing unit 260 may be located in the memory 240, for example, the in-memory computing unit 260 may be embedded in the memory 240, and as a part of the memory, the in-memory computing unit 260 may pass a shorter physical wire in the memory 240 Or circuit connection and memory 240 for data interaction, or, the in-memory computing unit 260 can directly read the data in the memory 240 without going through the bus, thereby saving bus overhead and quickly completing data calculation and data transmission.
  • the computing business unit 210 sends the compiled computing task sequence to the task scheduler 220; the task scheduler 220 analyzes the computing task sequence and determines whether the target computing task can perform in-memory computing.
  • the task scheduler 220 calls the in-memory calculation load generation function to generate the load information of the target calculation task, and schedules the load information to the processor 230; the processor 230 (for example, CPU) the target
  • the load information of the computing task is dispatched to the first address in the memory, and the in-memory computing unit 260 accesses the first address to obtain the load information, and the in-memory computing unit 260 calculates the target computing task according to the load information, and the calculated The results are stored in memory 240 .
  • the tight coupling may be in-memory computing coupling.
  • the in-memory calculation unit 260 and the memory 240 may be replaced by an integrated storage and calculation unit, and in this case, the tight coupling is an integrated storage and calculation coupling.
  • the task scheduler can analyze the computing task to determine whether the computing task supports in-memory computing, so that the computing business unit does not need to perceive the computing capability of the computing unit, thereby reducing the complexity of the software.
  • the processor stores the load information of the computing task in a fixed address in the memory by accessing the memory, so as to allocate the computing task to the in-memory computing unit, and the in-memory computing unit is tightly coupled with the memory, so that the task calculation can be completed quickly , there is no need to transmit task scheduling information between the processor and the in-memory computing unit through a specific bus or interface, thereby reducing system bus overhead.
  • Fig. 5 is a schematic flowchart of a computing task scheduling method provided by an embodiment of the present application. As shown in FIG. 5 , the method 400 may include steps 301 to 308 .
  • the calculation service unit sends the calculation task sequence to the task scheduler TS.
  • the task scheduler receives the computing task sequence.
  • the computing task sequence may be one computing task or multiple computing task sequences, and the embodiment of the present application does not limit the number of computing tasks included in the computing task sequence.
  • the computing business unit may be system software or application software.
  • the computing business unit may compile the computing task sequence and send it to the task scheduler.
  • the task scheduler determines a first computing task for a near-storage computing unit, and generates first load information of the first computing task.
  • the task scheduler may analyze the compiled computing task sequence, and after determining the first computing task, call a load generation function to generate near-storage computing load information corresponding to the first computing task.
  • the first computing task is a computing task determined by the task scheduler to perform near-storage computing.
  • the manner in which the task scheduler determines the first computing task may be determined according to the computing type of the first computing task. For example, the task scheduler may determine whether the calculation type of the first calculation task belongs to a preset calculation type.
  • the calculation type can be pre-stored in the task scheduler.
  • the calculation type can be matrix calculation, such as matrix-matrix multiplication, matrix-vector multiplication; the calculation type can also be loop calculation, vector convolution operation, etc. .
  • the calculation type may also be sent by the calculation service unit to the task scheduler.
  • the preset calculation type can be in a list, or in a linked list, etc.
  • the calculation type is in a list.
  • the list may be a list of calculation types.
  • the list of calculation types may include calculation type A, calculation type B, and calculation type C.
  • the task scheduler can update the preset computing type, for example, for different memory and near memory computing units, the type of near memory computing it supports may be different, when the memory and near memory When the storage computing unit is replaced, if the target type of the near-storage computing it supports is not included in the preset computing type, the target type can be added to the preset computing type, such as adding the target type to the computing List, to complete the update of this calculation type, so that the adaptability of the task scheduler is better, so that the compatibility can be improved.
  • the task scheduler may further determine the first computing task according to the data dimension of the first computing task, so as to determine the first computing task 1. Whether the computing task is suitable for near memory computing. For example, the amount of data can be determined according to the data dimension (for example, the number of rows multiplied by the number of columns). When the amount of data is greater than a preset value, it can be determined that the first computing task is suitable for near-memory computing; otherwise, the first computing task is not Suitable for near storage computing.
  • the first computing task when it is determined according to the data dimension that the data type (for example, floating-point type) of the first computing task is consistent with the data type (for example, floating-point type) supported by the near-storage computing unit, it may be determined that the first computing task is suitable for Near-memory computing, otherwise, the first computing task is not suitable for near-memory computing.
  • the data type for example, floating-point type
  • the task scheduler can The first computing task is dispatched to other computing cores for normal computing.
  • the task scheduler can call the load generation function to generate load information, and The load information is dispatched to the processors.
  • the task scheduler can schedule the computing tasks to the second computing unit for normal computing, so that the computing business unit does not need to perceive whether the computing unit supports near-memory computing, thereby reducing the complexity of software implementation.
  • the second computing unit may be at least one of the above-mentioned processor, image processing unit, AI processing unit, digital signal processor, or dedicated logic circuit, where the second computing unit may be coupled to the memory through a bus.
  • the first load information can be used to define the first computing task, so that the near storage computing unit can calculate the first computing task according to the first load information.
  • the load information may include a data address, a data dimension, a control command word, etc.
  • load information may also include other information required for data calculation.
  • the task scheduler schedules the first load information to the processor.
  • the processor receives the first load information.
  • the processor may be a CPU, GPU, NPU, etc., and the processor may also be one or more computing cores or computing units in the CPU, GPU, or NPU, which is not limited in this embodiment of the present application.
  • the processor schedules the first load information to a memory.
  • the processor schedules the first load information to the memory through a DMA controller. Specifically, the processor sends an instruction to the DMA controller, and the DMA controller transmits the load information to a fixed address in the memory through a DMA write signal according to the instruction. For example, the DMA controller uses the bus to transfer the load information from the source address to Move to the destination address.
  • a reserved address dedicated to storing information related to near-storage computing is reserved in the memory, that is, the first address, and the reserved address can be used to store the first load information, so that the near-storage computing unit can access The reserved address is used to obtain the first load information.
  • the processor implements the delivery of load information through the existing DMA mechanism, without adding special instructions for near-memory calculations, so that the processor does not need to issue scheduling information through the bus alone, thereby saving system bus overhead and improving computing efficiency.
  • the complexity of the design can be reduced due to the reuse of the existing DMA mechanism.
  • the processor may also schedule the load information to the near-storage computing unit in other ways, for example, schedule the load information to the near-storage computing unit by configuring a register or the like. For example, the processor writes load information into a register, and the near-storage calculation unit reads the register, and obtains the load information from the register to complete the calculation.
  • the near-storage computing unit acquires the first load information from the memory.
  • the near-storage computing unit can obtain the first load information from the reserved address in the memory. Since the near-storage computing unit is located near the memory, accessing the memory does not need to go through the bus, so the delay and power consumption of data transmission can be improved. .
  • the near-storage computing unit may also acquire the load information through a near-storage computing instruction.
  • the near-storage computing unit may also acquire the load information in other ways.
  • the near storage computing unit completes the first computing task according to the first load information.
  • the first load information defines the matrix multiplication operation between the matrix A1 in address A and the matrix B1 in address B, and the near-storage computing unit can instruct the memory to complete the corresponding calculation according to the first load information.
  • the near-storage computing unit sends the information that the computation is completed to the memory.
  • the near-storage calculation unit writes the calculation result into the memory after completing the calculation, that is, sends the information of the calculation completion to the memory.
  • the memory sends a response signal to the processor.
  • the memory may send a DMA response signal to the DMA controller after receiving the information or instruction that the near storage calculation is completed, and the DMA controller may send a signal or instruction that the DMA transfer is completed to the processor after receiving the DMA response signal.
  • the near-storage computing unit can complete computation near the memory, thereby reducing data transmission delay and power consumption, thereby improving system computing efficiency.
  • the task scheduler can analyze the computing task to determine whether the computing task supports near-storage computing, so that the computing business unit does not need to perceive the computing capability of the near-storage computing unit, thereby reducing software complexity.
  • the preset storage and calculation types in the task scheduler can be updated for different hardware platforms, further improving compatibility.
  • the processor can schedule the load information required for near-memory computing to the near-memory computing unit through the existing DMA mechanism, so that there is no need to add new near-memory computing instructions for scheduling computing tasks. Therefore, it can save bus overhead and improve computing efficiency.
  • Fig. 6 is a schematic flowchart of another calculation task scheduling method provided by an embodiment of the present application. As shown in FIG. 6 , the method 500 may include step 401 to step 407 .
  • the calculation service unit sends the calculation task sequence to the task scheduler TS.
  • the task scheduler receives the computing task sequence.
  • the task scheduler determines a third computing task for the in-memory computing unit, and generates second load information of the third computing task.
  • the task scheduler schedules the second load information to the processor.
  • the processor receives the second load information.
  • the processor schedules the second load information to the memory.
  • steps 401 to 404 reference may be made to related descriptions of steps 301 to 304, and details are not repeated for brevity.
  • the in-memory computing unit acquires the second load information from the memory.
  • a reserved address dedicated to storing information related to in-memory computing may be reserved in the memory for the in-memory computing unit, and the reserved address may be used to store the second load information, so that in-memory computing The unit can access the reserved address to obtain the second load information.
  • the in-memory computing unit completes a third computing task according to the second load information.
  • the memory sends a response signal to the processor.
  • steps 406 to 407 reference may be made to related descriptions of steps 306 to 307, and details are not repeated for brevity.
  • the in-memory computing unit and the memory can also be replaced by an integrated storage and computing unit.
  • the in-memory calculation unit can complete the calculation inside the memory, so that no new in-memory calculation instructions are needed, the bus overhead is saved, and the delay and power consumption of data transmission can be reduced, thereby improving the system calculation efficiency.
  • the task scheduler can analyze the computing task to determine whether the computing task supports in-memory computing, so that the computing business unit does not need to perceive the computing type of the computing task supported by the in-memory computing unit, reducing software complexity.
  • the preset computing types in the task scheduler can be updated for different in-memory computing units and memories, further improving compatibility.
  • the processor can dispatch the load information required for in-memory computing to the in-memory computing unit through the existing DMA mechanism, without separately transmitting in-memory computing instructions through the bus, thereby saving bus overhead and improving computing efficiency.
  • the embodiment of the present application also provides a computer-readable storage medium, where computer instructions are stored in the computer-readable storage medium, and when the computer instructions are run on a computer, the calculation task scheduling method as described in any one of the foregoing be executed.
  • the embodiment of the present application also provides a computer-readable storage medium, the computer-readable storage medium stores computer instructions, and when the computer instructions are run on the computer, the calculation method described in any one of the foregoing is executed .
  • the embodiment of the present application also provides a computer program product, which, when running on a computer, causes the computer to execute the above related steps, so as to implement the calculation task scheduling method in the above embodiment.
  • An embodiment of the present application also provides a computer program product, which, when running on a computer, causes the computer to execute the above-mentioned related steps, so as to implement the computing method in the above-mentioned embodiment.
  • An embodiment of the present application also provides a computing system, including the computing task scheduling device and the computing device described in any one of the foregoing.
  • an embodiment of the present application also provides a device, which may specifically be a chip, a component or a module, and the device may include a connected processor and a memory; wherein the memory is used to store computer-executable instructions, and when the device is running, The processor can execute the computer-executable instructions stored in the memory, so that the chip executes the calculation task scheduling method or the calculation method in the above method embodiments.
  • the computing task scheduling device, computing device, computer-readable storage medium, computer program product or chip provided in this embodiment are all used to execute the corresponding method provided above, therefore, the beneficial effects it can achieve can refer to The beneficial effects of the corresponding method provided above will not be repeated here.
  • the disclosed systems, devices and methods may be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components can be combined or May be integrated into another system, or some features may be ignored, or not implemented.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
  • the functions described above are realized in the form of software function units and sold or used as independent products, they can be stored in a computer-readable storage medium.
  • the technical solution of the present application is essentially or the part that contributes to the prior art or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (read-only memory, ROM), random access memory (random access memory, RAM), magnetic disk or optical disc and other media that can store program codes. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Advance Control (AREA)
  • Bus Control (AREA)

Abstract

本申请提供了一种计算任务调度装置、计算装置、计算任务调度方法和计算方法,该计算任务调度装置,包括:任务调度器,用于为第一计算单元确定第一计算任务,生成第一计算任务的负载信息,负载信息用于定义第一计算任务;处理器,用于从任务调度器接收负载信息,将负载信息存储至存储器中的第一地址以将第一计算任务分配给第一计算单元,第一地址为第一计算单元的预留地址,其中,处理器与存储器和第一计算单元中的至少一个通过总线耦合,第一计算单元与存储器紧密耦合,紧密耦合无需经过任何总线,且第一计算单元能够以高于总线接入的速度接入存储器。该技术方案能够降低数据传输的开销,降低计算的复杂度。

Description

计算任务调度装置、计算装置、计算任务调度方法和计算方法 技术领域
本申请涉及计算机技术领域,尤其涉及一种计算任务调度装置、计算装置、计算任务调度方法和计算方法。
背景技术
在通用计算机系统中,冯诺依曼或者哈佛架构为计算与存储分离架构。计算需要的数据需从外存加载到计算核内存,即缓存内,计算完成需要从核内存回到外存,使得计算过程中的数据传输功耗增加。
为了降低数据传输的功耗,可以采用计算融合技术、近存计算或存内计算技术。计算融合技术将多步计算通过融合的方式进行计算从而可以减少与外存的交互,但计算融合需要计算核内有一定大小的高速缓存且需要精细的软件切分管理,实现复杂度较高。近存计算在存储器附近完成计算,存内计算在存储器内部直接计算,从而可以降低数据传输的功耗,但近存计算或存内计算一般需要新增相应的计算指令,且针对不同的硬件平台,适配方案不统一,从而复杂度较高。
因此,如何在保证降低数据传输的功耗的同时,降低计算的复杂度,成为需要解决的技术问题。
发明内容
本申请提供一种存储计算装置和存储计算方法,以期降低数据传输的开销,降低计算的复杂度。
第一方面,提供了一种计算任务调度装置,包括:任务调度器,用于为第一计算单元确定第一计算任务,生成所述第一计算任务的负载信息,所述负载信息用于定义所述第一计算任务;处理器,用于从所述任务调度器接收所述负载信息,将所述负载信息存储至存储器中的第一地址以将所述第一计算任务分配给所述第一计算单元,所述第一地址为所述第一计算单元的预留地址,其中,所述处理器与所述存储器和所述第一计算单元中的至少一个通过总线耦合,所述第一计算单元与所述存储器紧密耦合,所述紧密耦合无需经过任何总线,且所述第一计算单元能够以高于总线接入的速度接入所述存储器。
该技术方案中,任务调度器可以为第一计算单元确定计算任务,从而软件模块可以无需感知该第一计算单元的计算能力,从而可以降低软件复杂度,此外,处理器采用访问存储器的方式将计算任务的负载信息存储至存储器中的固定地址中,以将计算任务分配给第一计算单元,第一计算单元与存储器紧密耦合,从而可以快速完成任务计算,无需在处理器与第一计算单元之间通过特定总线或接口传输任务调度信息,降低了数据传输的功耗与时延。
结合第一方面,在第一方面的一种实现方式中,所述处理器具体用于:通过直接存储访问DMA控制器将所述负载信息存储至所述第一地址。
该技术方案中,处理器通过复用现有的DMA技术实现上述有益效果,通过该DMA将负载信息存储器中的第一地址中,以将该第一计算任务分配给第一计算单元,从而可以节省系统开销,提升计算效率。
结合第一方面,在第一方面的一种实现方式中,所述任务调度器为系统软件或应用软件之外的专用任务调度器。
应理解,该任务调度器可以为专用于任务调度的硬件任务调度器。
结合第一方面,在第一方面的一种实现方式中,所述任务调度器还用于:接收来自所述系统软件或所述应用软件的计算任务序列,在所述计算任务序列中为所述第一计算单元确定所述第一计算任务。
应理解,该计算任务序列中可以包括一个或多个计算任务,该第一计算任务可以是一个计算任务,也可以是多个计算任务,本申请实施例对此不予限定。
结合第一方面,在第一方面的一种实现方式中,所述任务调度器还用于:在所述计算任务序列中为第二计算单元确定第二计算任务;将所述第二计算任务调度至第二计算单元;其中,所述第二计算单元包括所述处理器、图像处理单元、人工智能AI处理单元、数字信号处理器或专用逻辑电路中的至少一个,所述第二计算单元和所述存储器通过总线耦合。
该技术方案中,当任务调度器在计算任务序列中确定该第二计算任务不适合第一计算单元完成计算时,可以将该第二计算任务调度至第二计算单元中,该第二计算单元与存储器通过总线耦合。例如,该第二计算任务可以是不适合进行近存计算、存内计算或存算一体的其他任务。
结合第一方面,在第一方面的一种实现方式中,所述任务调度器具体用于:根据计算列表,在所述计算任务序列中为所述第一计算单元确定所述第一计算任务,其中,所述计算列表包括所述第一计算单元支持的计算任务类型。
应理解,该计算列表可以用链表等进行代替。
在一些实施例中,该计算列表可以进行更新。例如,当第一计算单元支持的计算类型发生改变时,可以将改变后的计算类型加入计算列表中,以完成对计算列表的更新。或者,当该任务调度器和处理器应用到其他的计算单元时,可以将该计算单元支持的计算类型加入该计算列表中。从而可以提升系统的兼容性。
结合第一方面,在第一方面的一种实现方式中,所述负载信息包括如下信息中的至少一种:数据地址;数据维度;或控制命令字。
应理解,该负载信息还可以包括用于计算任务的其他信息等。
结合第一方面,在第一方面的一种实现方式中,所述紧密耦合包括近存计算耦合、存内计算耦合或存算一体耦合。
第二方面,提供了一种计算装置,包括:存储器;第一计算单元,用于从所述存储器中的第一地址获取负载信息,并根据所述负载信息完成第一计算任务,其中,所述负载信息用于定义所述第一计算任务,所述第一地址为所述第一计算单元的预留地址;其中,所述第一计算单元与所述存储器紧密耦合,所述紧密耦合无需经过任何总线,且所述第一计算单元能够以高于总线接入的速度接入所述存储器;所述存储器和所述第一计算单元中的至少一个通过总线耦合至处理器。
该技术方案中,第一计算单元从存储器中获取负载信息,且第一计算单元与存储器紧密耦合,从而可以降低计算所需的系统开销,提升计算效率。
结合第二方面,在第二方面的一种实现方式中,所述负载信息包括如下信息中的至少一种:数据地址;数据维度;或控制命令字。
结合第二方面,在第二方面的一种实现方式中,所述紧密耦合包括近存计算耦合、存内计算耦合或存算一体耦合。
结合第二方面,在第二方面的一种实现方式中,所述存储器具体用于:在直接存储访问DMA控制器的操作下在所述第一地址写入所述负载信息。
该技术方案中,存储器可以通过DMA,在第一地址中写入负载信息。从而可以节省总线开销。
第三方面,提供一种计算任务调度方法,包括:任务调度器为第一计算单元确定第一计算任务,生成所述第一计算任务的负载信息,所述负载信息用于定义所述第一计算任务;处理器从所述任务调度器接收所述负载信息,将所述负载信息存储至存储器中的第一地址以将所述第一计算任务分配给所述第一计算单元,所述第一地址为所述第一计算单元的预留地址,其中,所述处理器与所述存储器和所述第一计算单元中的至少一个通过总线耦合,所述第一计算单元与所述存储器紧密耦合,所述紧密耦合无需经过任何总线,且所述第一计算单元能够以高于总线接入的速度接入所述存储器。
结合第三方面,在第三方面的一种实现方式中,所述将所述负载信息存储至存储器中的第一地址以将所述第一计算任务分配给所述第一计算单元,包括:通过直接存储访问DMA控制器将所述负载信息存储至所述第一地址以将所述第一计算任务分配给所述第一计算单元。
结合第三方面,在第三方面的一种实现方式中,所述任务调度器为系统软件或应用软件之外的专用任务调度器。
结合第三方面,在第三方面的一种实现方式中,所述任务调度器为第一计算单元确定第一计算任务,包括:所述任务调度器接收来自所述系统软件或所述应用软件的计算任务序列,在所述计算任务序列中为所述第一计算单元确定所述第一计算任务。
结合第三方面,在第三方面的一种实现方式中,所述方法还包括:在所述计算任务序列中为第二计算单元确定第二计算任务;将所述第二计算任务调度至第二计算单元;其中,所述第二计算单元包括所述处理器、图像处理单元、人工智能AI处理单元、数字信号处理器或专用逻辑电路中的至少一个,所述第二计算单元和所述存储器通过总线耦合。
结合第三方面,在第三方面的一种实现方式中,所述在所述计算任务序列中为所述第一计算单元确定所述第一计算任务,包括:根据计算列表,在所述计算任务序列中为所述第一计算单元确定所述第一计算任务,其中,所述计算列表包括所述第一计算单元支持的计算任务类型。
结合第三方面,在第三方面的一种实现方式中,所述负载信息包括如下信息中的至少一种:数据地址;数据维度;或控制命令字。
结合第三方面,在第三方面的一种实现方式中,所述紧密耦合包括近存计算耦合、存内计算耦合或存算一体耦合。
第四方面,提供一种计算方法,包括:第一计算单元从存储器中的第一地址获取负载信息,并根据所述负载信息完成第一计算任务,其中,所述负载信息用于定义所述第一计算任务,所述第一地址为所述第一计算单元的预留地址;其中,所述第一计算单元与所述存储器紧密耦合,所述紧密耦合无需经过任何总线,且所述第一计算单元能够以高于总线 接入的速度接入所述存储器;所述存储器和所述第一计算单元中的至少一个通过总线耦合至处理器。
结合第四方面,在第四方面的一种实现方式中,所述负载信息包括如下信息中的至少一种:数据地址;数据维度;或控制命令字。
结合第四方面,在第四方面的一种实现方式中,所述紧密耦合包括近存计算耦合、存内计算耦合或存算一体耦合。
结合第四方面,在第四方面的一种实现方式中,所述方法还包括:所述存储器在直接存储访问DMA控制器的操作下在所述第一地址写入所述负载信息。
第五方面,提供一种计算机可读存储介质,包括:所述存储介质中存储有计算机程序或指令,当所述计算机程序或指令被通信装置执行时,使得如第三方面及其任一种可能的实现方式中所述的计算任务调度方法被执行,或者,使得如第四方面及其任一种可能的实现方式中所述的计算方法被执行。
第六方面,提供一种计算机程序产品,当所述计算机程序产品在计算机上运行时,使得如第三方面及其任一种可能的实现方式中所述的计算任务调度方法被执行,或者,使得如第四方面及其任一种可能的实现方式中所述的计算方法被执行。
第七方面,提供一种计算系统,包括如第一方面及其任一种可能的实现方式中所述的任务调度装置和第二方面及其任一种可能的实现方式中所述的计算装置。
附图说明
图1是本申请实施例提供的一种计算装置的示意性框图。
图2是本申请实施例提供的一种近存计算装置的示意性框图。
图3是本申请实施例提供的一种计算任务调度装置的示意性框图。
图4是本申请实施例提供的一种根据计算列表确定目标计算任务的示意图。
图5是本申请实施例提供的另一种计算任务调度装置的示意性框图。
图6是本申请实施例提供的一种计算任务调度方法的示意性流程图。
图7是本申请实施例提供的另一种计算任务调度方法的示意性流程图。
具体实施方式
下面将结合附图,对本申请中的技术方案进行描述。
在通用计算机系统中,冯诺依曼或者哈佛架构均为计算与存储分离架构。计算需要的数据需从外存加载到计算核内,计算完成需要从核内存回到外存。在当前神经网络快速发展的时代,大多加速硬件采用冯诺依曼架构。而神经网络的计算特点为计算密集型同时也是数据密集型,计算核内具有高数据并行的计算资源,对带宽需求非常大。所以在整体计算的功耗分解中往往数据传输的功耗开销相比计算功耗开销还要高。
为了降低数据传输的功耗开销,可以采用计算融合的技术方案,即:将多步计算通过融合的方式进行计算从而可以减少与外存的交互。计算融合能够有效缓解带宽压力,起到降低传输功耗开销的目的。但计算融合需要计算核内有一定大小的高速缓存(如static random-access memory,SRAM),另外计算融合需要精细的软件切分管理,实现复杂度较高。
除了计算融合之外,还可以采用近存计算(near-memory computing,NMC)或存内计 算(in-memory computing,IMC)技术来降低数据传输的功耗开销。近存计算和存内计算都是一种聚焦在存储器的新体系结构技术方向,通过在存储器附近计算或者存储器内部直接计算,从而突破冯诺依曼架构的限制,从而解决了数据传输的功耗开销。近存计算把存储器和计算处理器紧密耦合一起,通过较短的导线降低数据传输的延迟和功耗,从而提高系统能效。随着制造工艺和封装技术发展,将计算逻辑与存储堆叠构建混合计算存储。而存内计算是计算直接在内存阵列中完成,减少了计算处理器与内存的数据传输。但近存或存内计算技术受限于计算特点与存储计算硬件设计复杂度。
一般情况下,近存计算或存内计算需要新增相应的计算指令,且针对不同的硬件平台,适配方案不统一,从而集成复杂度较高。有鉴于此,本申请实施例提供一种存储计算装置和存储计算方法,该技术方案能够在保证数据传输的功耗较低的情况下,进一步降低实现的复杂度。
在介绍本申请的技术方案之前,首先结合图1-2介绍一下普通计算与近存计算或存内计算的区别。
图1是本申请实施例提供的一种普通计算装置的示意性框图。如图1所示,该装置100a中,存储器110将需要计算的数据通过总线写入缓存(buffer)120中,普通计算单元130在进行计算时,该普通计算单元130进行读缓存,从缓存120中获取需要计算的数据,然后该普通计算单元130完成计算操作,并将计算结果写入缓存120中,通过总线将数据从缓存120中写入存储器110。在普通计算单元130完成计算的过程中,缓存120需要进行多次读写,且存储器110需要通过总线与缓存120进行多次交互,使得系统总线开销较大。
图2是本申请实施例提供的一种近存计算装置的示意性框图。如图2所示,该装置100b中,近存计算单元150可以位于存储器140外部,与存储器140紧密耦合,从而该近存计算单元150在进行计算时,可以不通过总线与存储器140进行交互,而是通过物理导线或电路连线进行数据交互,由于近存计算单元150与存储器140紧密耦合在一起,二者距离较近,传输数据的物理导线或电路连线较短,从而可以降低数据在近存计算单元和存储器传输的时延与功耗,也降低了总线开销。
在一些实施例中,该近存计算单元150可以用存内计算单元代替,该存内计算单元可以位于存储器140内部,例如,该存内计算单元可以嵌入存储器140内部,作为存储器的一部分,也就是说存储器具备计算能力,该存内计算单元可以通过物理导线或电路连线与存储器交互,该存内计算单元还可以不通过读写协议直接读取存储器内部的数据以完成计算,而无需通过总线,从而可以节省总线开销。
该近存计算单元150和存储器140,还可以用存算一体单元代替,此时,该存算一体单元既可以存储数据,又可以完成计算,从而可以节省计算与存储之间的总线开销,也可以降低数据传输的时延与功耗。
下文将结合图3至图7详细介绍本申请实施例中的技术方案。
图3是本申请实施例提供的一种计算任务调度装置的示意性框图。如图3所示,该装置200可以包括计算业务单元210、任务调度器220、处理器230、存储器240和近存计算单元250。可选地,该装置200中的任务调度器220、处理器230可以位于一个芯片,如片上系统(SoC)内。存储器240和近存计算单元250可以位于另一个芯片内。
其中,计算业务单元210位于业务调度层,属于软件模块,例如,该计算业务单元 210可以是系统软件,也可以是应用软件;任务调度器220属于硬件调度器,处理器230、存储器240和近存计算单元250均属于硬件器件。处理器230可以运行所述系统软件或应用软件以执行计算或处理任务,处理器还可以与其他硬件设备进行交互,如发送/接收数据或指令等。存储器240可以用于存储数据并能够被其他硬件设备,如处理器230访问。近存计算单元250可以包括计算电路,用于执行计算任务,该计算任务可以不同于处理器230执行的计算任务。
示例性地,计算业务单元210将编译好的计算任务序列发送至任务调度器220;任务调度器220对该计算任务序列进行解析,并确定计算任务是否可以进行近存计算,当确定目标计算任务可以进行近存计算时,任务调度器220调用近存计算负载生成函数生成第一计算任务的负载信息,并将该第一计算任务的负载信息调度至处理器230;该处理器230(例如,CPU)将该第一计算任务的负载信息存储至存储器中的第一地址。可选地,该第一地址是预留地址,用于该处理器230与近存计算单元250之间的所述负载信息交互,近存计算单元250可以访问该第一地址,以获取目标计算任务的负载信息,之后近存计算单元250根据该负载信息完成计算,并将计算的结果存储至存储器240中。
具体地,任务调度器220可以为近存计算单元确定第一计算任务,生成第一计算任务的负载信息,该负载信息用于定义第一计算任务。
示例性地,该任务调度器为系统软件或应用软件之外的专用任务调度器。即该任务调度器为装置200中专门用于调度计算任务的硬件任务调度器。
处理器230可以从任务调度器220接收负载信息,将负载信息存储至存储器240中的第一地址以将第一计算任务分配给近存计算单元250,第一地址为近存计算单元250的预留地址,其中,处理器230与存储器240和近存计算单元250中的至少一个通过总线耦合,近存计算单元250与存储器240紧密耦合,所述紧密耦合无需经过任何总线,且近存计算单元250能够以高于总线接入的速度接入存储器240。
该实施例中,紧密耦合为近存计算耦合。
应理解,该负载信息用于定义第一计算任务,可以理解为,该负载信息中的内容是计算第一计算任务需要的内容,可以用于近存计算单元完成该第一计算任务。
该存储器240中的第一地址为近存计算单元250的预留地址,即该存储器240中为该近存计算单元250预留了一块区域,该区域中可以存储有近存计算单元250计算所需的负载信息。该近存计算单元250可以访问该第一地址,以获取该负载信息,从而根据该负载信息完成第一计算任务。
在一种可能的实现方式中,处理器230与存储器240通过总线耦合,存储器240与近存计算单元250紧密耦合,即近存计算单元250与存储器交互数据无需通过任何总线,且该近存计算单元接入存储器240的速度高于通过总线接入存储器240的速度。例如,二者可以通过物理导线或电路连线交互,从而无需通过总线,可以节省总线开销,进而降低数据传输的时延和功耗。在另一种可能的实现方式中,当处理器230与存储器240通过总线耦合,且处理器230与近存计算单元250不通过总线耦合时,处理器230可以通过直接存储访问DMA控制器将负载信息存储至第一地址,近存计算单元250从该第一地址中获取该负载信息。处理器230还可以通过配置寄存器等方式将负载信息调度至近存计算单元250,例如,处理器230将负载信息写入寄存器中。近存计算单元250读取该寄存器,从寄存器中获取该负载信息并存入第一地址,以完成计算。该寄存器可以与处理器230位于 同一个芯片内,如SoC内。
该任务调度器可以在计算任务序列中为近存计算单元确定第一计算任务。
具体地,该任务调度器确定第一计算任务可以是根据该第一计算任务的类型进行确定的。例如,该任务调度器中可以预先存储有计算类型,该计算类型可以是预先设置的一项或多项,例如,该计算类型可以包括矩阵类计算、循环计算等等,该计算类型可以处于一个计算列表中或链表中。
示例性地,任务调度器根据计算列表,在计算任务序列中为近存计算单元250确定第一计算任务。具体地,当一个计算任务的计算类型处于计算列表中包括的计算类型时,可以确定该计算任务为第一计算任务。该第一计算任务可以是是一个计算任务,也可以是多个计算任务,本申请实施例对此不予限定。
参见图4,图4是本申请实施例提供的一种根据计算列表确定第一计算任务的示意图。如图4所示,该计算列表中可以包括计算任务的计算类型A、计算类型B、计算类型C、计算类型D等等,该计算任务序列可以包括计算任务一(计算类型为A)、计算任务二(计算类型为C)、计算任务三(计算类型为E)、计算任务四(计算类型为F)等等。
任务调度器可以预先存储有该计算列表,当任务调度器接收到计算业务单元发送的计算任务序列后,可以根据计算任务序列中的计算任务的类型是否包括在计算列表中来确定目标计算任务。继续参见图4,计算任务序列中的计算任务一和计算任务二的计算类型包括在计算列表中,则可以确定计算任务一和计算任务二为第一计算任务。
应理解,该计算类型可以是与近存计算单元相关的,例如,该计算类型可以是该近存计算单元支持的计算类型。示例性地,该近存计算单元支持的近存计算的类型为矩阵类计算,则该计算类型可以包括该矩阵类计算,或者,当该计算类型不包括该矩阵类计算的类型时,可以将该矩阵类计算的类型添加至计算类型中,以完成对该计算类型的更新。
在一些实施例中,该计算类型还可以由计算任务单元发送至任务调度器的。该第一计算任务的负载信息可以包括但不限于:数据地址;数据维度;控制命令字等。其中,该数据地址可以是用于指示该数据在存储器中存放的地址;该数据维度用于指示该数据的维度信息,例如,行数、列数,按照行优先存储、按照列优先存储等,该数据维度还可以包括数据类型,该数据类型可以是浮点型、整型等;该控制命令字可以是用于控制该第一计算任务的计算类型,例如,乘法、加法、乘加等。
该处理器可以包括但不限于:中央处理器(central processing unit,CPU)、图形处理器(graphics processing unit,GPU)、神经网络处理器(neural-network processing unit,NPU)等。
该处理器将第一计算任务的负载信息调度至近存计算单元可以通过以下几种方式:
方式一:
处理器通过直接存储访问(direct memory access,DMA)控制器将该负载信息存储至第一地址,近存计算单元访问该第一地址,从而可以获取该负载信息。
例如,处理器向DMA发送指令,该指令中可以包括负载信息的源地址和目的地址,DMA控制器根据处理器的指令,通过DMA写信号将负载信息从源地址搬运至目的地址中,也即将负载信息传输至存储器中。该技术方案中,处理器可以向DMA控制器发送指令,将该负载信息利用已有的DMA机制传输至存储器的第一地址中,从而可以无需重新设计,降低了设计复杂度。
在这种情况下DMA控制器在处理器的控制下,通过DMA写信号将该负载信息传输至存储器中的固定地址中,该固定地址可以是近存计算单元的预留地址,该近存计算单元可以从该固定地址中获取该负载信息并进行解析,并根据该负载信息完成计算,并将计算的结果写入存储器中,然后存储器返回DMA响应信号至DMA控制器,DMA控制器将DMA传输完成的消息或指令传输至处理器。
相应的,存储器在DMA控制器的操作下在第一地址写入负载信息。
该技术方案中,处理器通过已有的DMA机制将负载信息调度至近存计算单元,从而处理器无需通过近存计算专用指令将负载信息单独发送至近存计算单元,从而实现了处理器与近存计算单元的解耦,节省了总线开销。此外,由于复用现有的DMA机制,可以降低设计的复杂度。
方式二:
处理器可以通过配置寄存器等方式将负载信息调度至近存计算单元。例如,负载信息调度至近存计算单元中第一地址。
示例性地,处理器将负载信息写入片上寄存器中,近存计算单元读取该寄存器,从寄存器中获取该负载信息,以完成计算。该片上寄存器可以位于SoC上。
应理解,由于该近存计算单元250可以在存储器240附近完成计算,该近存计算单元250可以通过物理导线或电路连线与存储器240完成交互,无需通过系统总线,从而可以降低数据传输的时延与功耗,节省系统总线开销,从而提高系统计算效率。
进一步地,该技术方案中,任务调度器可以对计算任务进行解析以确定计算任务是否支持近存计算,从而计算业务单元无需感知计算单元的计算能力,此外,处理器采用访问存储器的方式将计算任务的负载信息存储至存储器中的第一地址中,以将计算任务分配给近存计算单元,近存计算单元与存储器紧密耦合,从而可以快速完成任务计算,无需在处理器与近存计算单元之间通过特定总线或接口传输任务调度信息,从而降低了总线开销。
在一些实施例中,当存储器和近存计算单元更换时,更换后的近存计算单元支持的计算类型可能会发生改变,此时,可以将新增加的计算类型添加至上述计算列表中,以完成计算列表的更新。
该技术方案中,任务调度器中的预设计算类型可以针对不同的近存计算单元和存储器进行更新,使得任务调度器的适配性更好,进一步提高了兼容性。
在另一些实施例中,任务调度器还可以用于在计算任务序列中确定第二计算任务,当任务调度器确定计算任务序列的第二计算任务不适合进行近存计算时,可以将该计算任务调度至其他的第二计算单元中,该第二计算单元计算该第二计算任务,其中,该第二计算单元可以与存储器通过总线耦合,该第二计算单元可以是包括所述处理器、图像处理单元、人工智能(artificial intelligence,AI)处理单元、数字信号处理器或专用逻辑电路中的至少一个。
图4是本申请实施例提供的另一种计算任务调度装置的示意性框图。如图4所示,该装置300可以包括计算业务单元210、任务调度器220、处理器230、存储器240和存内计算单元260。
其中,该存内计算单元260可以处于存储器240中,例如,该存内计算单元260可以嵌入存储器240内部,作为存储器的一部分,该存内计算单元260在存储器240内部可以通过更短的物理导线或电路连线与存储器240进行数据交互,或者,该存内计算单元260 可以直接读取存储器240中的数据,而无需通过总线,从而可以节省总线开销,可以快速完成数据计算和数据传输。
示例性地,计算业务单元210将编译好的计算任务序列发送至任务调度器220;任务调度器220对该计算任务序列进行解析,并确定目标计算任务是否可以进行存内计算,当确定目标计算任务可以进行存内计算时,任务调度器220调用存内计算负载生成函数生成目标计算任务的负载信息,并将该负载信息调度至处理器230;该处理器230(例如,CPU)将该目标计算任务的负载信息调度至存储器中的第一地址,存内计算单元260访问该第一地址以获取负载信息,存内计算单元260根据该负载信息对该目标计算任务进行计算,并将计算的结果存储至存储器240中。
应理解,对于计算业务单元210、任务调度器220、处理器230的相关描述可以参见前文,为了简洁,不再赘述。
该实施例中,紧密耦合可以为存内计算耦合。
应理解,该存内计算单元260和存储器240可以用存算一体单元代替,此时,紧密耦合为存算一体耦合。
该技术方案中,任务调度器可以对计算任务进行解析以确定计算任务是否支持存内计算,从而计算业务单元无需感知计算单元的的计算能力,从而降低了软件的复杂度。此外,处理器采用访问存储器的方式将计算任务的负载信息存储至存储器中的固定地址中,以将计算任务分配给存内计算单元,存内计算单元与存储器紧密耦合,从而可以快速完成任务计算,无需在处理器与存内计算单元之间通过特定总线或接口传输任务调度信息,从而可以降低系统总线开销。
图5是本申请实施例提供的一种计算任务调度方法的示意性流程图。如图5所示,该方法400可以包括步骤301至步骤308。
301,计算业务单元将计算任务序列发送至任务调度器TS。相应的,任务调度器接收该计算任务序列。
其中,该计算任务序列可以一个计算任务,也可以是多个计算任务序列,本申请实施例对该计算任务序列中包括的计算任务的数量不做限定。
该计算业务单元可以是系统软件,也可以是应用软件。
在一些实施例中,该计算业务单元可以将该计算任务序列编译后,发送至任务调度器。
302,任务调度器为近存计算单元确定第一计算任务,生成第一计算任务的第一负载信息。
应理解,该任务调度器可以对编译后的计算任务序列进行解析,在确定第一计算任务后,调用负载生成函数生成该第一计算任务对应的近存计算负载信息。
应理解,该第一计算任务是任务调度器确定可以进行近存计算的计算任务。
该任务调度器确定第一计算任务的方式可以是根据该第一计算任务的计算类型确定的。例如,任务调度器可以确定该第一计算任务的计算类型是否属于预设的计算类型。
其中,任务调度器中可以预先保存有计算类型,例如,该计算类型可以是矩阵类计算,如矩阵与矩阵乘、矩阵与向量乘;该计算类型还可以是循环计算、向量卷积运算等等。
在一些实施例中,该计算类型还可以是计算业务单元发送至任务调度器的。
该预设的计算类型可以处于一个列表中、或链表中等。
例如,该计算类型处于列表中,该列表可以是计算类型列表,该计算类型列表中可以 包括计算类型A、计算类型B、计算类型C,当任务调度器解析该计算任务序列,当计算任务序列中包括计算类型A、或计算类型B或计算类型C的第一计算任务时,均可确定该第一计算任务适合进行近存计算,从而该任务调度器可以调用负载生成函数生成负载信息。
在另一些实施例中,任务调度器可以对该预设的计算类型进行更新,例如,针对不同的存储器和近存计算单元,其支持的近存计算的类型可能是不同的,当存储器和近存计算单元更换时,在其支持的近存计算的目标类型不包括在预设的计算类型的情况下,可以将该目标类型添加至预设的计算类型中,如将该目标类型添加至计算列表中,以完成该计算类型的更新,使得任务调度器的适配性更好,从而可以提升兼容性。
在另一些实施例中,任务调度器除了根据第一计算任务的计算类型确定该第一计算任务之外,还可以进一步根据第一计算任务的数据维度确定该第一计算任务,以确定该第一计算任务是否适合近存计算。例如,可以根据数据维度确定数据量大小(例如,行数乘以列数),当数据量大于预设值时,可以确定该第一计算任务适合近存计算,否则,该第一计算任务不适合近存计算。又如,当根据数据维度确定该第一计算任务的数据类型(例如,浮点型)与近存计算单元支持的数据类型(例如,浮点型)一致时,可以确定该第一计算任务适合近存计算,否则,该第一计算任务不适合近存计算。
在一些实施例中,任务调度器在确定了第一计算任务的计算类型属于预设的计算类型后,进一步确定该第一计算任务的数据维度不适合进行近存计算,则该任务调度器可以将该第一计算任务调度至其他计算核进行正常计算。或者,在确定了第一计算任务的计算类型属于预设的计算类型后,进一步确定该目标计算任务的数据维度适合进行近存计算,则该任务调度器可以调用负载生成函数生成负载信息,并将该负载信息调度至处理器。
在另一些实施例中,当计算任务序列中的第二计算任务不属于预设的计算类型时,说明该第二计算任务不适合近存计算,例如,该第二计算任务为控制流、激活函数等计算任务,则该任务调度器可以将该计算任务调度至第二计算单元进行正常计算,从而计算业务单元无需感知计算单元是否支持近存计算,进而降低了软件实现的复杂度。例如,该第二计算单元可以是上述处理器、图像处理单元、AI处理单元、数字信号处理器或专用逻辑电路中的至少一个,其中,该第二计算单元可以和存储器通过总线耦合。
该第一负载信息可以用于定义该第一计算任务,从而近存计算单元可以根据该第一负载信息计算该第一计算任务。
该负载信息可以包括数据地址、数据维度、控制命令字等,对于该负载信息的具体描述可以参见前文中的相关描述,此处不再详述。
应理解,该负载信息还可以包括其他进行数据计算所需要的信息。
303,任务调度器将第一负载信息调度至处理器。相应的,处理器接收该第一负载信息。
其中,该处理器可以是CPU、GPU、NPU等等,该处理器还可以是CPU、GPU、NPU中的一个或多个计算核或计算单元,本申请实施例对此不予限定。
304,处理器将该第一负载信息调度至存储器。
在一种可能的实现方式中,处理器通过DMA控制器将第一负载信息调度至存储器。具体地,处理器向DMA控制器发送指令,DMA控制器根据该指令通过DMA写信号将负载信息传输至存储器中的固定地址中,例如,DMA控制器通过该指令利用总线将负载 信息从源地址搬运至目的地址。
其中,该存储器中预留了专门用于存储近存计算相关的信息的预留地址,即第一地址,该预留地址中可以用于存储该第一负载信息,从而近存计算单元可以访问该预留地址,以获取该第一负载信息。
这样,处理器通过已有的DMA机制实现负载信息的下发,可以无需增加近存计算专用指令,从而处理器无需单独通过总线下发调度信息,从而可以节省系统总线开销,提升计算效率。此外,由于复用现有的DMA机制,可以降低设计的复杂度。
在其他的实施例中,处理器还可以通过其他方式将该负载信息调度至近存计算单元中,例如,通过配置寄存器等方式该负载信息调度至近存计算单元中。例如,处理器将负载信息写入寄存器中,近存计算单元读取该寄存器,从寄存器中获取该负载信息,以完成计算。
305,近存计算单元从存储器中获取该第一负载信息。
示例性地,近存计算单元可以从存储器中的预留地址中获取该第一负载信息,由于近存计算单元位于存储器附近,访问存储器无需经过总线,因而可以提升数据传输的时延和功耗。
在其他的实施例中,该近存计算单元还可以通过近存计算指令获取该负载信息。
应理解,该近存计算单元还可以通过其他方式获取该负载信息。
306,近存计算单元根据该第一负载信息完成第一计算任务。
示例性地,该第一负载信息定义了地址A中的矩阵A1与地址B中的矩阵B1进行矩阵乘运算,则该近存计算单元可以根据该第一负载信息指示存储器完成相应的计算。
307,近存计算单元将计算完成的信息发送至存储器。
其中,近存计算单元在完成计算后,将计算结果写入存储器中,也即将计算完成的信息发送至存储器。
308,存储器向处理器发送响应信号。
示例性地,存储器接收到近存计算完成的信息或指令后,可以向DMA控制器发送DMA响应信号,该DMA控制器接收到DMA响应信号后,向处理器发送DMA传输完成的信号或指令。
基于本申请实施例,近存计算单元可以在存储器附近完成计算,从而可以降低数据传输的时延与功耗,从而提高系统计算效率。
该技术方案中,任务调度器可以对计算任务进行解析以确定计算任务是否支持近存计算,从而计算业务单元无需感知近存计算单元的计算能力,从而降低了软件复杂度。此外,任务调度器中的预设存储计算类型可以针对不同的硬件平台进行更新,进一步提高了兼容性。进一步地,处理器可以通过已有的DMA机制将近存计算需要的负载信息调度至近存计算单元,从而无需新增加近存计算指令用于调度计算任务,因此,可以节省总线开销,提升计算效率。
图6是本申请实施例提供的另一种计算任务调度方法的示意性流程图。如图6所示,该方法500可以包括步骤401至步骤407。
401,计算业务单元将计算任务序列发送至任务调度器TS。相应的,任务调度器接收该计算任务序列。
402,任务调度器为存内计算单元确定第三计算任务,生成第三计算任务的第二负载信息。
403,任务调度器将第二负载信息调度至处理器。相应的,处理器接收该第二负载信息。
404,处理器将第二负载信息调度至存储器。
应理解,步骤401至步骤404可以参见步骤301至步骤304的相关描述,为了简洁,不再赘述。
405,存内计算单元从存储器中获取该第二负载信息。
示例性地,该存储器中可以为存内计算单元预留了专门用于存储存内计算相关的信息的预留地址,该预留地址中可以用于存储该第二载信息,从而存内计算单元可以访问该预留地址,以获取该第二负载信息。
406,存内计算单元根据该第二负载信息完成第三计算任务。
407,存储器向处理器发送响应信号。
应理解,步骤406至步骤407可以参见步骤306至步骤307的相关描述,为了简洁,不再赘述。
在另一些实施例中,该存内计算单元和存储器还可以用存算一体单元进行代替。
基于本申请实施例,存内计算单元可以在存储器内部完成计算,从而无需新增存内计算指令,节省了总线开销,可以降低数据传输的时延与功耗,从而提高系统计算效率。
该技术方案中,任务调度器可以对计算任务进行解析以确定计算任务是否支持存内计算,从而计算业务单元无需感知存内计算单元支持的计算任务的计算类型,降低了软件复杂度。此外,任务调度器中的预设计算类型可以针对不同的存内计算单元和存储器进行更新,进一步提高了兼容性。进一步地,处理器可以通过已有的DMA机制将存内计算需要的负载信息调度至存内计算单元,无需通过总线单独传输存内计算指令,从而可以节省总线开销,提升计算效率。
本申请实施例还提供一种计算机可读存储介质,该计算机可读存储介质中存储有计算机指令,当该计算机指令在计算机上运行时,使得如前文中任一项所述的计算任务调度方法被执行。
本申请实施例还提供一种计算机可读存储介质,该计算机可读存储介质中存储有计算机指令,当该计算机指令在计算机上运行时,使得如前文中任一项所述的计算方法被执行。
本申请实施例还提供了一种计算机程序产品,当该计算机程序产品在计算机上运行时,使得计算机执行上述相关步骤,以实现上述实施例中的计算任务调度方法。
本申请实施例还提供了一种计算机程序产品,当该计算机程序产品在计算机上运行时,使得计算机执行上述相关步骤,以实现上述实施例中的计算方法。
本申请实施例还提供了一种计算系统,包括如前文中任一项所述的计算任务调度装置和计算装置。
另外,本申请的实施例还提供一种装置,这个装置具体可以是芯片,组件或模块,该装置可包括相连的处理器和存储器;其中,存储器用于存储计算机执行指令,当装置运行时,处理器可执行存储器存储的计算机执行指令,以使芯片执行上述各方法实施例中的计算任务调度方法或计算方法。
其中,本实施例提供的计算任务调度装置、计算装置、计算机可读存储介质、计算机程序产品或芯片均用于执行上文所提供的对应的方法,因此,其所能达到的有益效果可参考上文所提供的对应的方法中的有益效果,此处不再赘述。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。

Claims (25)

  1. 一种计算任务调度装置,其特征在于,包括:
    任务调度器,用于为第一计算单元确定第一计算任务,生成所述第一计算任务的负载信息,所述负载信息用于定义所述第一计算任务;
    处理器,用于从所述任务调度器接收所述负载信息,将所述负载信息存储至存储器中的第一地址以将所述第一计算任务分配给所述第一计算单元,所述第一地址为所述第一计算单元的预留地址,其中,所述处理器与所述存储器和所述第一计算单元中的至少一个通过总线耦合,所述第一计算单元与所述存储器紧密耦合,所述紧密耦合无需经过任何总线,且所述第一计算单元能够以高于总线接入的速度接入所述存储器。
  2. 根据权利要求1所述的装置,其特征在于,所述处理器具体用于:
    通过直接存储访问DMA控制器将所述负载信息存储至所述第一地址。
  3. 根据权利要求1或2所述的装置,其特征在于,所述任务调度器为系统软件或应用软件之外的专用任务调度器。
  4. 根据权利要求3所述的装置,其特征在于,所述任务调度器还用于:
    接收来自所述系统软件或所述应用软件的计算任务序列,在所述计算任务序列中为所述第一计算单元确定所述第一计算任务。
  5. 根据权利要求4所述的装置,其特征在于,所述任务调度器还用于:
    在所述计算任务序列中为第二计算单元确定第二计算任务;
    将所述第二计算任务调度至第二计算单元;
    其中,所述第二计算单元包括所述处理器、图像处理单元、人工智能AI处理单元、数字信号处理器或专用逻辑电路中的至少一个,所述第二计算单元和所述存储器通过总线耦合。
  6. 根据权利要求4所述的装置,其特征在于,所述任务调度器具体用于:
    根据计算列表,在所述计算任务序列中为所述第一计算单元确定所述第一计算任务,其中,所述计算列表包括所述第一计算单元支持的计算任务类型。
  7. 根据权利要求1-6中任一项所述的装置,其特征在于,所述负载信息包括如下信息中的至少一种:
    数据地址;数据维度;或控制命令字。
  8. 根据权利要求1-7中任一项所述的装置,其特征在于,所述紧密耦合包括近存计算耦合、存内计算耦合或存算一体耦合。
  9. 一种计算装置,其特征在于,包括:
    存储器;
    第一计算单元,用于从所述存储器中的第一地址获取负载信息,并根据所述负载信息完成第一计算任务,其中,所述负载信息用于定义所述第一计算任务,所述第一地址为所述第一计算单元的预留地址;
    其中,所述第一计算单元与所述存储器紧密耦合,所述紧密耦合无需经过任何总线,且所述第一计算单元能够以高于总线接入的速度接入所述存储器;
    所述存储器和所述第一计算单元中的至少一个通过总线耦合至处理器。
  10. 根据权利要求9所述的装置,其特征在于,所述负载信息包括如下信息中的至少一种:
    数据地址;数据维度;或控制命令字。
  11. 根据权利要求9或10所述的装置,其特征在于,所述紧密耦合包括近存计算耦合、存内计算耦合或存算一体耦合。
  12. 根据权利要求9-11中任一项所述的装置,其特征在于,所述存储器具体用于:
    在直接存储访问DMA控制器的操作下在所述第一地址写入所述负载信息。
  13. 一种计算任务调度方法,其特征在于,包括:
    任务调度器为第一计算单元确定第一计算任务,生成所述第一计算任务的负载信息,所述负载信息用于定义所述第一计算任务;
    处理器从所述任务调度器接收所述负载信息,将所述负载信息存储至存储器中的第一地址以将所述第一计算任务分配给所述第一计算单元,所述第一地址为所述第一计算单元的预留地址,其中,所述处理器与所述存储器和所述第一计算单元中的至少一个通过总线耦合,所述第一计算单元与所述存储器紧密耦合,所述紧密耦合无需经过任何总线,且所述第一计算单元能够以高于总线接入的速度接入所述存储器。
  14. 根据权利要求13所述的方法,其特征在于,所述将所述负载信息存储至存储器中的第一地址以将所述第一计算任务分配给所述第一计算单元,包括:
    通过直接存储访问DMA控制器将所述负载信息存储至所述第一地址以将所述第一计算任务分配给所述第一计算单元。
  15. 根据权利要求13或14所述的方法,其特征在于,所述任务调度器为系统软件或应用软件之外的专用任务调度器。
  16. 根据权利要求15所述的方法,其特征在于,所述任务调度器为第一计算单元确定第一计算任务,包括:
    所述任务调度器接收来自所述系统软件或所述应用软件的计算任务序列,在所述计算任务序列中为所述第一计算单元确定所述第一计算任务。
  17. 根据权利要求16所述的方法,其特征在于,所述方法还包括:
    在所述计算任务序列中为第二计算单元确定第二计算任务;
    将所述第二计算任务调度至第二计算单元;
    其中,所述第二计算单元包括所述处理器、图像处理单元、人工智能AI处理单元、数字信号处理器或专用逻辑电路中的至少一个,所述第二计算单元和所述存储器通过总线耦合。
  18. 根据权利要求16所述的方法,其特征在于,所述在所述计算任务序列中为所述第一计算单元确定所述第一计算任务,包括:
    根据计算列表,在所述计算任务序列中为所述第一计算单元确定所述第一计算任务,其中,所述计算列表包括所述第一计算单元支持的计算任务类型。
  19. 根据权利要求13-18中任一项所述的方法,其特征在于,所述负载信息包括如下信息中的至少一种:
    数据地址;数据维度;或控制命令字。
  20. 根据权利要求13-19中任一项所述的方法,其特征在于,所述紧密耦合包括近存计算耦合、存内计算耦合或存算一体耦合。
  21. 一种计算方法,其特征在于,包括:
    第一计算单元从存储器中的第一地址获取负载信息,并根据所述负载信息完成第一计算任务,其中,所述负载信息用于定义所述第一计算任务,所述第一地址为所述第一计算单元的预留地址;
    其中,所述第一计算单元与所述存储器紧密耦合,所述紧密耦合无需经过任何总线,且所述第一计算单元能够以高于总线接入的速度接入所述存储器;
    所述存储器和所述第一计算单元中的至少一个通过总线耦合至处理器。
  22. 根据权利要求21所述的方法,其特征在于,所述负载信息包括如下信息中的至少一种:
    数据地址;数据维度;或控制命令字。
  23. 根据权利要求21或22所述的方法,其特征在于,所述紧密耦合包括近存计算耦合、存内计算耦合或存算一体耦合。
  24. 根据权利要求21-23中任一项所述的方法,其特征在于,所述方法还包括:
    所述存储器在直接存储访问DMA控制器的操作下在所述第一地址写入所述负载信息。
  25. 一种计算机可读存储介质,其特征在于,包括:所述存储介质中存储有计算机程序或指令,当所述计算机程序或指令被通信装置执行时,使得如权利要求13-20中任一项所述的计算任务调度方法被执行,或者,使得如权利要求21-24中任一项所述的计算方法被执行。
PCT/CN2022/075123 2022-01-29 2022-01-29 计算任务调度装置、计算装置、计算任务调度方法和计算方法 WO2023142091A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
CN202280004975.4A CN116897581A (zh) 2022-01-29 2022-01-29 计算任务调度装置、计算装置、计算任务调度方法和计算方法
EP22922920.8A EP4426037A1 (en) 2022-01-29 2022-01-29 Computing task scheduling apparatus, computing apparatus, computing task scheduling method and computing method
PCT/CN2022/075123 WO2023142091A1 (zh) 2022-01-29 2022-01-29 计算任务调度装置、计算装置、计算任务调度方法和计算方法

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2022/075123 WO2023142091A1 (zh) 2022-01-29 2022-01-29 计算任务调度装置、计算装置、计算任务调度方法和计算方法

Publications (1)

Publication Number Publication Date
WO2023142091A1 true WO2023142091A1 (zh) 2023-08-03

Family

ID=87470250

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/075123 WO2023142091A1 (zh) 2022-01-29 2022-01-29 计算任务调度装置、计算装置、计算任务调度方法和计算方法

Country Status (3)

Country Link
EP (1) EP4426037A1 (zh)
CN (1) CN116897581A (zh)
WO (1) WO2023142091A1 (zh)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6738881B1 (en) * 1999-06-09 2004-05-18 Texas Instruments Incorporated Multi-channel DMA with scheduled ports
CN101630053A (zh) * 2008-07-15 2010-01-20 鸿富锦精密工业(深圳)有限公司 微组合镜片装置及其制造方法
CN110049130A (zh) * 2019-04-22 2019-07-23 北京邮电大学 一种基于边缘计算的服务部署和任务调度方法及装置
CN110678847A (zh) * 2017-05-30 2020-01-10 超威半导体公司 用于gpu任务调度的连续分析任务
CN111651253A (zh) * 2020-05-28 2020-09-11 中国联合网络通信集团有限公司 算力资源的调度方法及装置
CN111656335A (zh) * 2018-01-29 2020-09-11 美光科技公司 存储器控制器
CA3083316A1 (en) * 2019-06-11 2020-12-11 Comcast Cable Communications, Llc Wireless communications and control information transmission/reception

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6738881B1 (en) * 1999-06-09 2004-05-18 Texas Instruments Incorporated Multi-channel DMA with scheduled ports
CN101630053A (zh) * 2008-07-15 2010-01-20 鸿富锦精密工业(深圳)有限公司 微组合镜片装置及其制造方法
CN110678847A (zh) * 2017-05-30 2020-01-10 超威半导体公司 用于gpu任务调度的连续分析任务
CN111656335A (zh) * 2018-01-29 2020-09-11 美光科技公司 存储器控制器
CN110049130A (zh) * 2019-04-22 2019-07-23 北京邮电大学 一种基于边缘计算的服务部署和任务调度方法及装置
CA3083316A1 (en) * 2019-06-11 2020-12-11 Comcast Cable Communications, Llc Wireless communications and control information transmission/reception
CN111651253A (zh) * 2020-05-28 2020-09-11 中国联合网络通信集团有限公司 算力资源的调度方法及装置

Also Published As

Publication number Publication date
EP4426037A1 (en) 2024-09-04
CN116897581A (zh) 2023-10-17

Similar Documents

Publication Publication Date Title
CN111630505B (zh) 深度学习加速器系统及其方法
CN113918101B (zh) 一种写数据高速缓存的方法、系统、设备和存储介质
CN103119912A (zh) 多处理器计算平台中的处理器间通信技术
WO2021115208A1 (zh) 神经网络处理器、芯片和电子设备
US11442866B2 (en) Computer memory module processing device with cache storage
CN115033188B (zh) 一种基于zns固态硬盘的存储硬件加速模块系统
CN106250348A (zh) 一种基于gpu访存特性的异构多核架构缓存管理方法
CN114900699A (zh) 视频编解码卡虚拟化方法、装置、存储介质及终端
CN118035618B (zh) 数据处理器、数据处理方法、电子设备、存储介质
CN115686836A (zh) 一种安装有加速器的卸载卡
CN111459668A (zh) 用于服务器的轻量级资源虚拟化方法及轻量级资源虚拟化装置
CN108829530B (zh) 一种图像处理方法及装置
WO2021115149A1 (zh) 神经网络处理器、芯片和电子设备
US12073261B2 (en) Synchronization method and apparatus
CN116483536B (zh) 数据调度方法、计算芯片及电子设备
WO2023142091A1 (zh) 计算任务调度装置、计算装置、计算任务调度方法和计算方法
EP4432210A1 (en) Data processing method and apparatus, electronic device, and computer-readable storage medium
EP2689325A1 (en) Processor system with predicate register, computer system, method for managing predicates and computer program product
US20240220315A1 (en) Dynamic control of work scheduling
WO2023134588A1 (zh) 计算系统、方法、装置及加速设备
CN114860461B (zh) Gpu设备间高效内存置换的方法、系统、设备及存储介质
US20240143498A1 (en) Methods, devices, and systems for allocating memory space
KR102260820B1 (ko) 대칭적 인터페이스 기반 인터럽트 신호 처리 장치 및 방법
WO2023045478A1 (zh) 图任务调度方法、执行端设备、存储介质及程序产品
CN118170558A (zh) 函数调用的核间通信封装方法、装置以及计算机设备

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 202280004975.4

Country of ref document: CN

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22922920

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2022922920

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2022922920

Country of ref document: EP

Effective date: 20240529

NENP Non-entry into the national phase

Ref country code: DE