WO2023123395A1 - 一种计算任务处理装置、方法及电子设备 - Google Patents

一种计算任务处理装置、方法及电子设备 Download PDF

Info

Publication number
WO2023123395A1
WO2023123395A1 PCT/CN2021/143792 CN2021143792W WO2023123395A1 WO 2023123395 A1 WO2023123395 A1 WO 2023123395A1 CN 2021143792 W CN2021143792 W CN 2021143792W WO 2023123395 A1 WO2023123395 A1 WO 2023123395A1
Authority
WO
WIPO (PCT)
Prior art keywords
purpose processor
computing
instruction
task
special
Prior art date
Application number
PCT/CN2021/143792
Other languages
English (en)
French (fr)
Inventor
徐涛
石洁珂
王晓禹
郑明�
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to CN202180035983.0A priority Critical patent/CN116848509A/zh
Priority to PCT/CN2021/143792 priority patent/WO2023123395A1/zh
Publication of WO2023123395A1 publication Critical patent/WO2023123395A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]

Definitions

  • the present application relates to the technical field of data processing, and in particular to a computing task processing device, method and electronic equipment.
  • a central processing unit (CPU) and a neural network processor (neural-network processing unit, NPU) are usually integrated in the SoC, and the CPU multiplexes the computing resources of the NPU to perform AI business-related computing tasks.
  • the prior art provides an architecture in which the CPU multiplexes computing resources of the NPU based on software.
  • this CPU and this NPU are coupled with this system bus, have software stack running on this CPU, this software stack comprises the NPU driver (kernel) that is positioned at kernel (kernel) driver), and the NPU runtime (runtime) and application (application, APP) in the user space (user space).
  • this software stack comprises the NPU driver (kernel) that is positioned at kernel (kernel) driver), and the NPU runtime (runtime) and application (application, APP) in the user space (user space).
  • the APP sends the computing task to the NPU through the NPU runtime and the NPU driver through the CPU and the system bus.
  • the NPU receives the The compute task is processed when the compute task is computed.
  • the CPU needs to switch between the user mode and the kernel mode, and there are many layers in the software stack, resulting in a large overhead, which is not suitable for the interaction between the CPU and the NPU. in frequent scenes.
  • Embodiments of the present application provide a computing task processing device, method, and electronic equipment, which are used to reduce the overhead of CPU multiplexing NPU computing resources and improve the interaction efficiency between CPU and NPU.
  • a computing task processing device which includes: a general-purpose processor and a special-purpose processor, the general-purpose processor and the special-purpose processor are coupled through a physical interface, for example, the general-purpose processor is a CPU, the The special-purpose processor is an NPU; the general-purpose processor is used to send a first instruction to the special-purpose processor through the physical interface, the first instruction is an instruction for the special-purpose processor in the instruction set of the general-purpose processor, and the first instruction It is used to instruct the special-purpose processor to process the first calculation task; the special-purpose processor is used to receive and execute the first instruction through the physical interface (that is, the first instruction is directly received by the special-purpose processor through the physical interface) instructions, which are not instructions obtained from the memory in a manner similar to software stack scheduling), and process the first computing task according to the first instruction.
  • the general-purpose processor is a CPU
  • the special-purpose processor is an NPU
  • the general-purpose processor is used to send a first instruction to the special-purpose processor through the physical interface, the
  • the general-purpose processor and the special-purpose processor are coupled through a physical interface at the physical layer, so that the general-purpose processor can directly send the first instruction to the special-purpose processor through the physical interface, so as to schedule the special-purpose processor to process the first calculation task, that is,
  • the general-purpose processor can directly multiplex the special-purpose processor through the physical interface, and the multiplexing process does not need to be realized by software, so the overhead is small, and the interaction efficiency between the general-purpose processor and the special-purpose processor is improved.
  • the device further includes: a buffer coupled to the special-purpose processor; the buffer is used to store task data of the first computing task; the special-purpose processor uses for reading the task data from the buffer, and/or buffering the task data in the buffer.
  • the efficiency of reading or storing task data by the dedicated processor can be improved, thereby improving the processing efficiency of computing tasks.
  • the cache is a cache of the general-purpose processor; or, the general-purpose processor is coupled to the cache.
  • the efficiency of the general-purpose processor in reading data from the buffer or storing data in the buffer can be improved, and at the same time, the design flexibility of the buffer can be improved.
  • the general-purpose processor and the special-purpose processor share the same page table, and the page table is used to indicate the relationship between the logical address and the physical address of the task data in the buffer. Mapping relations.
  • the general-purpose processor and the special-purpose processor read data in the cache or store data in the cache, no additional address conversion is required, thereby reducing certain power consumption and improving Data read and write efficiency.
  • a software stack of the special-purpose processor runs on the general-purpose processor; the general-purpose processor is further configured to send an indication message to the special-purpose processor through the software stack, the The instruction message is used to instruct the special-purpose processor to obtain the second instruction.
  • the application program calls the software stack so that the software stack generates the instruction message, and then the general-purpose processing
  • the processor sends the instruction message to the special-purpose processor through the system bus, and the instruction message can be an interrupt signal, and the general-purpose processor does not perceive the second instruction; the special-purpose processor is also used to pass the instruction message after receiving the instruction message
  • the software stack of the special-purpose processor obtains the second instruction after parsing, and processes the second computing task according to the second instruction; wherein, the second instruction is an instruction set of the special-purpose processor.
  • the general-purpose processor can also multiplex computing resources of the special-purpose processor based on the software stack, and can process other tasks while the special-purpose processor is processing the second computing task, thereby improving resource utilization.
  • the calculation amount of the first calculation task is smaller than the calculation amount of the second calculation task.
  • the general-purpose processor can multiplex special-purpose processors through physical interfaces to handle small calculation tasks. Multiplex dedicated processors to handle computationally intensive computing tasks. This is because the method of multiplexing dedicated processors through the software stack is suitable for computing tasks with a large amount of calculation, which generally requires a long calculation time and is not sensitive to scheduling delays, while the method of multiplexing dedicated processors through physical interfaces is suitable for computing Small computing tasks require short computing time and are sensitive to scheduling delays.
  • the first computing task and the second computing task are two concurrent computing tasks.
  • the processing efficiency and resource utilization rate of computing tasks can be improved.
  • the dedicated processor includes: a control unit and at least one computing unit; the control unit is configured to, when receiving multiple computing tasks, according to at least one of the multiple computing tasks A preset parameter allocates the at least one computing unit to the multiple computing tasks, and the multiple computing tasks may only include multiple computing tasks indicated by one of the first instruction or the second instruction, or may include simultaneously A plurality of computing tasks indicated by the first instruction and the second instruction; wherein the at least one preset parameter includes at least one of the following: priority and task type.
  • the at least one preset parameter includes the task type
  • the at least one calculation unit includes: a vector operation unit, configured to process the task type in the multiple calculation tasks as vector operation computing tasks; a matrix computing unit, configured to process computing tasks whose task type is matrix computing among the plurality of computing tasks.
  • the general-purpose processor includes a central processing unit CPU, or an image processing unit GPU with a scheduling function (also called a macro GPU, such as a GPU with a CPU integrated inside) or a GPU with a scheduling function.
  • the dedicated processor includes at least one of the following: a neural network processor (NPU), a digital signal processor (DSP), and an image processing unit (GPU).
  • NPU neural network processor
  • DSP digital signal processor
  • GPU image processing unit
  • a computing task processing method is provided, which is applied to a device including a general-purpose processor and a special-purpose processor, the general-purpose processor and the special-purpose processor are coupled through a physical interface, and the method includes: the general-purpose processor Sending a first instruction to the special-purpose processor through the physical interface, the first instruction is an instruction for the special-purpose processor in the instruction set of the general-purpose processor, and the first instruction is used to instruct the special-purpose processor to process a first calculation task
  • the dedicated processor receives and executes the first instruction through the physical interface, and processes the first computing task according to the first instruction.
  • the device further includes a buffer coupled to the dedicated processor, and the method further includes: the dedicated processor reads the task data of the first computing task from the buffer ; or, the dedicated processor caches the task data of the first computing task in the buffer.
  • the buffer is a buffer of the general-purpose processor; or, the general-purpose processor is coupled to the buffer; wherein, the general-purpose processor and the special-purpose processor share the same
  • the page table is used to indicate the mapping relationship between the logical address and the physical address of the task data in the buffer.
  • the general-purpose processor runs a software stack of the special-purpose processor
  • the method includes: the general-purpose processor sends an indication message to the special-purpose processor through the software stack, the The instruction message is used to instruct the special-purpose processor to obtain the second instruction.
  • the application runs on the general-purpose processor generates a computing task
  • the application calls the software stack so that the software stack generates an instruction corresponding to the computing task.
  • the general-purpose processor sends the indication message to the special-purpose processor through the system bus, the indication message can be an interrupt signal, and the general-purpose processor does not perceive the second instruction; when the special-purpose processor receives the indication message Afterwards, the second instruction is obtained after parsing through the software stack of the special processor, and the second computing task is processed according to the second instruction; wherein, the second instruction is an instruction of the special processor.
  • the calculation amount of the first calculation task is smaller than the calculation amount of the second calculation task.
  • the first computing task and the second computing task are two concurrent computing tasks.
  • the dedicated processor includes a control unit and at least one computing unit, and the method further includes: when receiving multiple computing tasks, the control unit At least one preset parameter allocates the at least one computing unit to the multiple computing tasks, and the multiple computing tasks may only include multiple computing tasks indicated by one of the first instruction or the second instruction, or may include both A plurality of computing tasks indicated by the first instruction and the second instruction; wherein the at least one preset parameter includes at least one of the following: priority and task type.
  • the at least one preset parameter includes the task type
  • the at least one computing unit includes a vector computing unit and a matrix computing unit
  • the method further includes: processing the multiple Among the computing tasks, the task type is a computing task of vector computing; the matrix computing unit processes the computing tasks of the multiple computing tasks, and the task type is matrix computing.
  • the general-purpose processor includes a central processing unit CPU, an image processing unit GPU with a scheduling function (for example, the GPU internally integrates a CPU) or a digital signal processor DSP with a scheduling function ;
  • the dedicated processor includes at least one of the following: a neural network processor NPU, a digital signal processor DSP, and an image processing unit GPU.
  • a system-on-chip SoC is provided, and the SoC is integrated with the computing task processing device provided by the first aspect or any possible implementation manner of the first aspect.
  • an electronic device in another aspect of the present application, includes the computing task processing apparatus provided in the first aspect or any possible implementation manner of the first aspect.
  • a computer-readable storage medium is provided. Instructions are stored in the computer-readable storage medium. When the instructions are run on a device, the device is made to perform the second aspect or any of the second aspects.
  • a calculation task processing method provided by a possible implementation manner.
  • a computer program product includes: a computer program (also referred to as code, or an instruction), which, when the computer program is executed, causes the computer to perform the above-mentioned second aspect. Or the calculation task processing method provided by any possible implementation manner of the second aspect.
  • any of the computing task processing methods, electronic devices, computer-readable storage media, and computer program products provided above can achieve beneficial effects that can be obtained by referring to the beneficial effects of the computing task processing apparatus provided above. effects, which will not be repeated here.
  • FIG. 1 is a schematic diagram of the architecture of the first processor
  • FIG. 2 is a schematic diagram of the architecture of the second processor
  • FIG. 3 is a schematic diagram of the structure of the third processor
  • FIG. 4 is a schematic structural diagram of a computing task processing device provided in an embodiment of the present application.
  • FIG. 5 is a schematic diagram of a CPU multiplexing NPU provided in an embodiment of the present application.
  • FIG. 6 is a schematic diagram of another CPU multiplexing NPU provided by the embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of an NPU provided in an embodiment of the present application.
  • FIG. 8 is a schematic structural diagram of another computing task processing device provided by an embodiment of the present application.
  • At least one means one or more, and “multiple” means two or more.
  • “And/or” describes the association relationship of associated objects, indicating that there may be three types of relationships, for example, A and/or B, which can mean: A exists alone, A and B exist at the same time, and B exists alone, where A, B can be singular or plural.
  • the character “/” generally indicates that the contextual objects are an “or” relationship.
  • “At least one of the following" or similar expressions refer to any combination of these items, including any combination of single or plural items.
  • At least one item (piece) of a, b, or c can represent: a, b, c, a-b, a-c, b-c, or a-b-c, where a, b, c can be single or multiple .
  • the embodiments of the present application use words such as "first” and "second” to distinguish the same or similar items with basically the same function and effect.
  • the first threshold and the second threshold are only used to distinguish different thresholds, and their sequence is not limited. Those skilled in the art can understand that words such as “first” and “second” do not limit the quantity and execution order.
  • SoCs are usually integrated with special-purpose processors such as NPU or digital signal processor (DSP) suitable for AI operations, and the special-purpose processors usually include matrix operation units and vector operation units.
  • special-purpose processors such as CPU and graphics processing unit (GPU) can also handle AI operations, but compared with special-purpose processors, there are certain limitations in terms of energy efficiency, area, and flexibility. gap.
  • FIG. 1 is a schematic diagram of the architecture of the first type of processor, which includes two different processors coupled in a loosely coupled manner.
  • the architecture includes a system bus, a CPU and an NPU coupled to the system bus.
  • a software stack runs on the CPU, and the software stack includes an NPU driver located in the kernel, and an NPU runtime and an application program APP located in the user space.
  • the NPU includes a matrix operation unit and a vector operation unit.
  • the CPU and NPU can be executed asynchronously, that is, they can be used to execute different services at the same time.
  • the APP generates computing tasks
  • the CPU can multiplex the computing resources of the NPU based on the NPU driver in the software stack and the NPU runtime.
  • the specific process can include: the CPU sends an interrupt signal to the NPU, and when the NPU receives the interrupt signal, the NPU obtains the corresponding instruction from the memory after parsing through the NPU software stack, and processes the computing task according to the instruction.
  • the CPU needs to switch between the user state and the kernel state, and there are many layers in the software stack, resulting in high overhead, which is not suitable for scenarios where the CPU and NPU interact frequently.
  • FIG. 2 is a schematic diagram of the architecture of the second type of processor, which includes a processor with a matrix operation unit inside.
  • the architecture includes a CPU, the CPU includes a CPU core (core), and a matrix operation unit coupled to the CPU core and a cache (cache), and the matrix operation unit and the cache are also coupled to each other.
  • the CPU core can drive the matrix operation unit to run through a customized instruction, so as to process the AI operation through the matrix operation unit.
  • the advantage of this architecture is that the scheduling overhead is small, and it is suitable for scenarios where CPU cores and matrix operation units interact frequently.
  • the matrix operation unit can only be used to process AI operations with a small amount of calculation and some common rectangular operations, and cannot be applied to AI operations with a large amount of calculation.
  • FIG. 3 is a schematic diagram of the architecture of the third processor, which is a combination of the above two architectures.
  • the architecture includes a system bus, a CPU and an NPU coupled to the system bus.
  • the CPU is provided with a matrix operation unit, and a software stack runs on the CPU, and the software stack includes an NPU driver located in the kernel, and an NPU runtime and an application program APP located in the user space.
  • the NPU includes a matrix operation unit and a vector operation unit.
  • the CPU drives the internal matrix operation unit to run through custom instructions; when there is an AI operation with a large amount of calculation that needs to be processed, the CPU multiplexes it through the software stack
  • the matrix operation unit in the NPU is used for processing.
  • This architecture can be used to process AI operations with a small amount of calculation, and can also be used to process AI operations with a large amount of calculation.
  • the matrix operation units included in the CPU and the NPU are physically independent of each other, thus occupying a large area, and the two matrix operation units cannot be used to process the same calculation task at the same time, thereby reducing the Calculate resource utilization.
  • an embodiment of the present application provides a computing task processing device, which can be used to process computing tasks with different computing amounts, and has a small overhead when processing computing tasks with a small amount of computing.
  • the device Compared with the above-mentioned structure of the third type of processor, it can also increase and reduce the occupied area, and improve the utilization rate of computing resources.
  • the computing task processing device can be applied to electronic devices, which include but not limited to: mobile phones, tablet computers, notebook computers, palmtop computers, mobile internet devices (mobile internet device, MID), wearable devices (such as smart watches and Smart bracelets, etc.), on-board equipment (such as cars, bicycles, electric vehicles, airplanes, ships, trains, high-speed rail, etc.), virtual reality (virtual reality, VR) equipment, augmented reality (augmented reality, AR) equipment, industrial control Wireless terminals in (industrial control), smart home devices (such as refrigerators, TVs, air conditioners, electricity meters, etc.), intelligent robots, workshop equipment, wireless terminals in self-driving (self-driving), remote medical surgery (remote medical surgery) ), wireless terminals in smart grid, wireless terminals in transportation safety, wireless terminals in smart city, or wireless terminals in smart home , flight equipment (for example, intelligent robots, hot air balloons, drones, airplanes), etc.
  • mobile internet devices mobile internet device, MID
  • wearable devices such as smart watches and Smart
  • FIG. 4 is a schematic structural diagram of a computing task processing device provided by an embodiment of the present application.
  • the computing task processing apparatus includes: a general purpose processor 201 and a special purpose processor 202 , and the general purpose processor 201 and the special purpose processor 202 are coupled through a physical interface 203 .
  • the general-purpose processor 201 is used to send a first instruction to the special-purpose processor 202 through the physical interface 203.
  • the first instruction is an instruction for the special-purpose processor 202 in the instruction set of the general-purpose processor 201.
  • the first instruction It is used to instruct the dedicated processor 202 to process the first computing task; the dedicated processor 202 is configured to receive and execute the first instruction through the physical interface 203, and process the first computing task according to the first instruction. That is, the general-purpose processor 201 may multiplex the dedicated processor 202 through the physical interface 203 to process the first calculation task.
  • the general-purpose processor 201 may include one or more of a central processing unit CPU or other processors with a scheduling function, such as an image processor GPU with a scheduling function (also called a macro GPU, such as an internally integrated CPU GPU), or digital signal processor DSP with scheduling function, etc.
  • the dedicated processor 202 may include one or more of a neural network processor NPU, a digital signal processor DSP, etc.
  • the neural network processor NPU may also be called an artificial intelligence AI processor.
  • the calculation task processing apparatus includes one or more general-purpose processors 201 and one or more special-purpose processors 202 , and each processor may include one or more processing cores.
  • the general processor 201 includes a CPU and the special processor 202 includes an NPU as an example for illustration.
  • the dedicated processor 202 can be used to process operations on data of multiple different dimensions.
  • the data of multiple different dimensions can include one-dimensional data (for example, vector), two-dimensional (for example, matrix), and more than two-dimensional data. data (for example, three-dimensional data), etc.
  • the first instruction may be an extended instruction (also called a customized instruction) of the general-purpose processor 201, and the extended instruction can be used to instruct (or be called a driver)
  • the special-purpose processor 202 handles computing tasks, and the first instruction may be generated by the general-purpose processor 201 .
  • the first computing task may be a computing task generated by an application program (also referred to as a service) running on the general-purpose processor 201.
  • An application program may generate one or more computing tasks, and each computing task may correspond to a thread.
  • the first calculation task may be a calculation task corresponding to an AI operation, and the AI operation may be an operation on two-dimensional data, or an operation on more than two-dimensional data.
  • the general-purpose processor 201 can obtain the first instruction during execution, and the general-purpose processor 201 can send the instruction to the special-purpose processor 202 through the physical interface 203 Sending the first instruction, when the special-purpose processor 202 receives and executes the first instruction, the first computing task can be processed (that is, the first instruction is an instruction received directly by the special-purpose processor through a physical interface, not through a method similar to Instructions fetched from memory in a software-scheduled manner).
  • the general purpose processor 201 While the dedicated processor 202 is processing the first calculation task, the general purpose processor 201 may be in a waiting state, and after the special purpose processor 202 finishes processing the first calculation task, the general purpose processor 201 may continue to perform subsequent operations.
  • the manner in which the general processor 201 multiplexes the dedicated processor 202 through the physical interface 203 to process the first computing task may be referred to as synchronously multiplexing the computing resources of the dedicated processor 202 .
  • the CPU multiplexing the NPU through the physical interface 203 may include: CPU loads binary code (binary code) to the instruction cache (instruction cache), and perform instruction fetch and decode operations on the binary code.
  • the decoding operation can recognize the extended instruction, and the extended instruction passes the instruction
  • the CPU will send the extended instruction to the instruction buffer (instruction buffer) in the NPU through the physical interface 203, and the NPU will complete the decoding, dispatch and instruction execution
  • the (execution) process is to process the computing task corresponding to the extended instruction.
  • the general-purpose processor 201 and the special-purpose processor 202 are coupled at the physical layer through the physical interface 203, so that the general-purpose processor 201 can directly send the first instruction through the physical interface 203 to schedule special-purpose processing
  • the processor 202 handles the first calculation task, that is, the general-purpose processor 201 can multiplex the special-purpose processor 202 through the physical interface 203, and the multiplexing process does not need to be implemented by software, so that the overhead is small, and the relationship between the general-purpose processor 201 and the special-purpose processor 202 is improved. Interactive efficiency.
  • the computing task processing apparatus further includes: a cache (cache) 204 .
  • the buffer 204 is coupled to the special purpose processor 202 , so that the special purpose processor 202 can be used to store data in the buffer 204 and read data in the buffer 204 .
  • the cache 204 is a cache integrated in the general-purpose processor 201.
  • the general-purpose processor 201 is a CPU including a three-level cache (ie, L1-L3 cache), and the cache 204 may be an L3 cache.
  • the buffer 204 is a buffer integrated outside the general processor 201, and the general processor 201 is coupled to the buffer 204, that is, the general processor 201 can be used to store data in the buffer 204 and read data in the buffer 204. data.
  • the buffer 204 is integrated outside the general processor 201 as an example for illustration.
  • first task data of the first computing task is stored in the buffer 204, and the first task data may be input data required for processing the first computing task (that is, the data is 202 to execute the first instruction), the dedicated processor 202 is further configured to: read the first task data from the buffer 204 during the process of processing the first computing task.
  • the first task data may be data stored in the buffer 204 by the general processor 201 .
  • the dedicated processor 202 generates the second task data during processing the first computing task, and the second task data may be the output data when processing the first computing task (that is, the data is After the dedicated processor 202 executes the first instruction and corresponds to output data), the dedicated processor 202 is further configured to: store the second task data in the buffer 204 .
  • the general processor 201 may read the second task data from the buffer 204 .
  • the general-purpose processor 201 and the special-purpose processor 202 may share the same page table, and the page table may be used to indicate between the logical address and the physical address of the first task data and/or the second task data in the buffer 204 mapping relationship.
  • the general-purpose processor 201 and the special-purpose processor 202 read the data in the buffer 204 or store data in the buffer 204, no additional address conversion is required, thereby reducing certain power consumption and improving data processing efficiency. read and write efficiency.
  • the computing task processing device may further include a system bus 205 , and both the cache memory 204 and the dedicated processor 202 are coupled to the system bus 205 .
  • the computing task processing device may further include a memory, for example, the memory may be a double data rate synchronous dynamic random access memory (DDR SRAM), referred to as DDR.
  • DDR SRAM double data rate synchronous dynamic random access memory
  • the dedicated processor 202 can also access the memory through the buffer 204 , or not through the buffer 204 , that is, the dedicated processor 202 has an independent channel for accessing the memory without passing through the buffer 204 .
  • the software stack of the special-purpose processor 202 runs on the general-purpose processor 201 , and the software stack may include the runtime and driver of the special-purpose processor 202 .
  • the software stack includes the NPU runtime and the NPU driver as an example for illustration.
  • the general-purpose processor 201 is also configured to send an indication message to the special-purpose processor 202 through the software stack, where the indication message is used to instruct the special-purpose processor to obtain the second instruction, and the second instruction is an instruction set of the special-purpose processor 202.
  • the general-purpose processor does not perceive the second instruction, for example, when an application program running on the general-purpose processor 201 generates a computing task, the application program calls the software stack, so that the software stack generates the corresponding calculation task indication message, then the indication message is sent to the special purpose processor 202 by the general processor 201 through the system bus, and the indication message can be an interrupt signal;
  • the software stack obtains the second instruction (for example, obtains the second instruction from the memory), and processes the second computing task according to the second instruction. That is, the general-purpose processor 201 can also multiplex the computing resources of the special-purpose processor 202 based on the software stack. When the special processor 202 is processing the second calculation task, the general purpose processor 201 can continue to execute other tasks.
  • the special purpose processor 202 After the special purpose processor 202 finishes processing the second calculation task, the special purpose processor 202 sends an interrupt to the general purpose processor 201, In this way, the general processor 201 may continue to execute subsequent operations corresponding to the second computing task when receiving the interrupt.
  • the manner in which the general-purpose processor 201 multiplexes the special-purpose processor 202 to process the second computing task based on the software stack may be referred to as asynchronously multiplexing the computing resources of the special-purpose processor 202 .
  • the second computing task may be a computing task generated by an application program (also called a service) running on the general purpose processor 201 .
  • the first computing task and the second computing task may be two computing tasks generated by the same application program, or may be two computing tasks generated by different application programs.
  • the second calculation task may be a calculation task corresponding to the AI operation.
  • the APP when running an APP on the CPU to generate a second computing task, and the CPU needs to multiplex the computing resources of the NPU to process the second computing task, the APP can pass the NPU runtime and the NPU The driver sends the second computing task to the NPU, so that the NPU processes the second computing task.
  • the calculation amount of the first calculation task is smaller than the calculation amount of the second calculation task. That is, when the general-purpose processor 201 needs to multiplex the special-purpose processor 202 to process calculation tasks with different calculation amounts, the general-purpose processor 201 can multiplex the special-purpose processor 202 through the physical interface 203 to process the calculation tasks with a small calculation amount.
  • the software stack multiplexes the dedicated processor 202 to handle computationally intensive computational tasks.
  • the general-purpose processor 201 can simultaneously multiplex the special-purpose processor 202 to process calculation tasks of different calculation volumes in the above two ways, or multiplex the special-purpose processor 202 to process calculation tasks of different calculation volumes in time division.
  • the general-purpose processor 201 processes the first computing task in the manner of multiplexing the dedicated processor 202 through the physical interface 203, and at the same time The second computing task is processed by multiplexing the dedicated processor 202 through the software stack.
  • the dedicated processor 202 may include: a control unit and a calculation unit, the number of the calculation unit may be one or more, and the control unit may be used to manage the calculation unit.
  • control unit may include a resource management unit and an instruction execution unit
  • the resource management unit may be used to manage and allocate computing units
  • the instruction execution unit may be used to be responsible for functions such as instruction cache, instruction fetching and decoding.
  • the calculation unit may include one or more calculation units of different dimensions, for example, the calculation unit may include a vector operation unit and a matrix operation unit. The following mainly takes the control unit as the main body, and introduces and explains the functions of the control unit in resource management.
  • control unit is configured to: when receiving multiple computing tasks, assign the at least one computing unit to the multiple computing tasks according to at least one preset parameter of the multiple computing tasks.
  • the plurality of computing tasks includes a first computing task.
  • the at least one preset parameter includes at least one of the following: priority and task type.
  • the control unit when the at least one preset parameter includes a priority, and the control unit receives a plurality of computing tasks, it allocates the computing tasks to the multiple computing tasks according to the priority order of the multiple computing tasks from high to low. At least one computing unit. If the calculation amount that the at least one calculation unit can handle is less than the calculation amount of the multiple calculation tasks, the control unit can give priority to assigning the at least one calculation unit to high priority calculation tasks, and the at least one calculation unit completes the high priority After computing tasks of low priority, the at least one computing unit is allocated to low-priority computing tasks.
  • each computing unit in the at least one computing unit may include multiple computing blocks, and each computing block in the multiple computing blocks may have the same or different computing capabilities, and one or more computing blocks may be used for Process a computational task.
  • the control unit allocates the calculation blocks in the calculation unit according to the priority or calculation amount of the calculation tasks, it can allocate one or more calculation blocks with matching calculation capabilities to each calculation task according to the calculation amount of each calculation task.
  • each computing unit includes multiple computing blocks
  • at least one computing block in the multiple computing blocks may be statically configured to process computing tasks in a manner of multiplexing the dedicated processor 202 through the physical interface 203 .
  • the at least one computing block can also be used to process computing tasks in the manner of multiplexing the dedicated processor 202 through software stacks.
  • the priority and calculation amount of the calculation task may be determined by the calculation task, for example, the priority and calculation amount of the calculation task are correspondingly determined when the calculation task is generated.
  • the at least one preset parameter includes at least two parameters
  • the at least two parameters can be integrated to determine the order in which the control unit allocates computing units to each computing task.
  • the general-purpose processor 201 may include a priority control unit, which may be used to support setting and querying of the priority of computing tasks, and the control unit in the special-purpose processor 202 may support Allocate computing units according to priority.
  • the priority control unit can provide an interface such as a register to allow business software to configure the priority, and can also maintain a priority queue that is allowed to be queried.
  • the control unit in the dedicated processor 202 when the high priority service needs to use the computing unit, the control unit can control the running low priority service and switch to prioritize the scheduling of the high priority service.
  • the business in the general-purpose processor 201 is prioritized at the granularity of threads, so that the control unit in the special-purpose processor 202 can ensure that high-priority threads are preferentially allocated to computing units, and if low-priority threads cannot be allocated to computing units It can be dormant (sleep). After the computing unit is released, the low-priority thread is awakened and assigned to the computing unit.
  • the general processor 201 is taken as a CPU and includes multiple processing cores as an example for illustration.
  • the control unit is further configured to: among the plurality of calculation tasks, the task type is vector operation calculation The task is allocated to the vector operation unit, and the calculation task whose task type is matrix operation among the plurality of calculation tasks is allocated to the matrix operation unit.
  • the vector operation unit is used to process the calculation task whose task type is vector operation among the multiple calculation tasks
  • the matrix operation unit is used to process the calculation task whose task type is matrix operation among the multiple calculation tasks.
  • FIG. 7 it is illustrated by taking the vector operation unit including m calculation blocks and the matrix operation unit including n calculation blocks as an example, where m and n are positive integers.
  • control unit can first determine the type of computing unit assigned to the computing task according to the task type, and then assign the corresponding type according to the order determined by the priority. computing unit.
  • control unit can assign computing units to multiple computing tasks according to at least one preset parameter, so as to ensure high-priority or small-intensity computing when multiple computing tasks are concurrent and computing resources are limited. Tasks are processed first, and low-priority or computationally intensive computing tasks are processed later, so that high-priority or small computationally intensive computing tasks have smaller processing delays and higher processing efficiency.
  • the embodiment of the present application also provides a calculation task processing method, which can be applied to the calculation task processing device provided above, the device includes a general-purpose processor and a special-purpose processor, and the general-purpose processor and the special-purpose processing Devices are coupled through a physical interface.
  • the device includes a general-purpose processor and a special-purpose processor, and the general-purpose processor and the special-purpose processing Devices are coupled through a physical interface.
  • the method includes: the general-purpose processor sends a first instruction to the special-purpose processor through the physical interface, the first instruction is used to instruct the special-purpose processor to process a first computing task, and the first instruction may be the general-purpose
  • the instruction set of the processor is directed to the instructions of the special-purpose processor; the special-purpose processor receives and executes the first instruction through the physical interface (that is, the first instruction is an instruction directly received by the special-purpose processor through the physical interface, and is not an instruction fetched from the memory in a manner similar to software stack scheduling), and process the first computing task according to the first instruction.
  • the method may further include: the dedicated processor reading task data of the first computing task from the buffer; or, the dedicated processing The processor caches the task data of the first computing task in the buffer.
  • the register is a register of the general purpose processor.
  • the general purpose processor is coupled to the cache.
  • the general purpose processor and the special purpose processor share the same page table, and the page table is used to indicate the mapping relationship between the logical address and the physical address of the task data in the buffer.
  • the general-purpose processor runs a software stack of the special-purpose processor
  • the method may further include: the general-purpose processor sends an indication message to the special-purpose processor through the software stack, and the indication message is used to indicate that the special-purpose processor Obtaining the second instruction, for example, when an application program running on a general-purpose processor generates a computing task, the application program calls the software stack so that the software stack generates an instruction message, and then the general-purpose processor sends the instruction message through the system bus Sent to a special-purpose processor, the instruction message may be an interrupt signal, and the general-purpose processor does not perceive the second instruction; the special-purpose processor is also used to obtain the instruction after receiving the instruction message through software analysis of the special-purpose processor The second instruction (for example, fetching the second instruction from memory), and processing the second computing task according to the second instruction; wherein, the second instruction is an instruction in the instruction set of the special-purpose processor.
  • the calculation amount of the first calculation task is smaller than the calculation amount of the second calculation task.
  • the first computing task and the second computing task are two concurrent computing tasks.
  • the dedicated processor includes a control unit and at least one calculation unit, and the method may further include: when receiving multiple calculation tasks, the control unit calculates the multiple calculation tasks according to at least one preset parameter of the multiple calculation tasks.
  • the calculation tasks are allocated to the at least one calculation unit, and the multiple calculation tasks may only include multiple calculation tasks indicated by one of the first instruction or the second instruction, or may include both the first instruction and the second instruction.
  • the at least one preset parameter includes the task type
  • the at least one calculation unit includes a vector operation unit and a matrix operation unit
  • the method further includes: the vector operation unit processes the task in the plurality of calculation tasks A calculation task whose type is vector operation; the matrix operation unit processes the calculation task whose type is matrix operation among the plurality of calculation tasks.
  • the general-purpose processor and the special-purpose processor are coupled through a physical interface at the physical layer, so that the general-purpose processor can directly send the first instruction through the physical interface to schedule the special-purpose processor to process the first computing task, that is, the general-purpose processing
  • the processor can multiplex special-purpose processors through physical interfaces, and the multiplexing process does not need to be implemented through software, thereby reducing overhead and improving the interaction efficiency between general-purpose processors and special-purpose processors.
  • a system-on-chip SoC is also provided, in which any computing task processing device provided above is integrated.
  • an electronic device is also provided, and the electronic device includes any computing task processing apparatus provided above.
  • a computer-readable storage medium is provided. Instructions are stored in the computer-readable storage medium. When the instructions are run on a device, the device is made to perform the computing task processing provided by the above-mentioned method embodiments. method.
  • a computer program product is provided.
  • the device is made to execute any computing task processing method provided by the above method embodiments.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Advance Control (AREA)
  • Multi Processors (AREA)

Abstract

一种计算任务处理装置、方法及电子设备,涉及数据处理技术领域。该装置包括:通用处理器(201)和专用处理器(202),该通用处理器(201)和该专用处理器(202)之间通过物理接口(203)耦合;该通用处理器(201),用于通过该物理接口(203)向该专用处理器(202)发送第一指令,该第一指令为该通用处理器(201)的指令集中的指令且用于指示该专用处理器(202)处理第一计算任务;该专用处理器(202),用于通过该物理接口(203)接收并执行该第一指令,并根据该第一指令处理该第一计算任务。这样在该通用处理器(201)为CPU、该专用处理器(202)为NPU时,该CPU能够通过该CPU与该NPU之间耦合的物理接口(203)调度该NPU处理计算任务,而无需通过软件栈和系统总线调度,从而能够降低CPU复用NPU的计算资源的开销,提高CPU与NPU的交互效率。

Description

一种计算任务处理装置、方法及电子设备 技术领域
本申请涉及数据处理技术领域,尤其涉及一种计算任务处理装置、方法及电子设备。
背景技术
随着人工智能(artificial intelligence,AI)技术的快速发展,越来越多的AI业务在终端设备上部署,从而对终端设备中片上系统(system of chip,SoC)的计算能力的要求越来越高。目前,SoC中通常集成有中央处理器(central processing unit,CPU)和神经网络处理器(neural-network processing unit,NPU),由CPU复用NPU的计算资源,来执行AI业务相关的计算任务。
现有技术中提供了一种CPU基于软件复用NPU的计算资源的架构。如图1所示,该架构中包括系统总线、CPU和NPU,该CPU和该NPU与该系统总线耦合,该CPU上运行有软件栈,该软件栈包括位于内核中(kernel)的NPU驱动(driver)、以及位于用户空间(user space)的NPU运行时(runtime)和应用程序(application,APP)。具体的,当APP产生计算任务,需要CPU复用NPU的计算资源处理该计算任务时,该APP通过NPU运行时和NPU驱动将该计算任务通过CPU和系统总线发送给NPU,当NPU接收到该计算任务时处理该计算任务。
但是,在上述CPU复用NPU的计算资源的过程中,CPU需要在用户态与内核态之间进行切换,且软件栈中的层次较多,从而导致开销较大,不适用于CPU与NPU交互频繁的场景中。
发明内容
本申请的实施例提供一种计算任务处理装置、方法及电子设备,用于降低CPU复用NPU的计算资源的开销,提高CPU与NPU的交互效率。
为达到上述目的,本申请的实施例采用如下技术方案:
第一方面,提供一种计算任务处理装置,该装置包括:通用处理器和专用处理器,该通用处理器和该专用处理器之间通过物理接口耦合,比如,该通用处理器为CPU、该专用处理器为NPU;该通用处理器,用于通过该物理接口向该专用处理器发送第一指令,第一指令为该通用处理器的指令集中针对该专用处理器的指令,该第一指令用于指示该专用处理器处理第一计算任务;该专用处理器,用于通过该物理接口接收并执行该第一指令(也即是,第一指令是该专用处理器直接通过物理接口接收到的指令,并不是通过类似于软件栈调度的方式从内存中获取的指令),并根据该第一指令处理该第一计算任务。
上述技术方案中,通用处理器与专用处理器在物理层通过物理接口耦合,从而通用处理器可以直接通过物理接口向专用处理器发送第一指令,以调度专用处理器处理第一计算任务,即通用处理器能够通过物理接口直接复用专用处理器,该复用过程无需通过软件实现,从而开销小,提高了通用处理器与专用处理器的交互效率。
在第一方面的一种可能的实现方式中,该装置还包括:与该专用处理器耦合的缓存器;该缓存器,用于存储该第一计算任务的任务数据;该专用处理器,用于从该缓存器读取该 任务数据,和/或将该任务数据缓存在该缓存器中。上述可能的实现方式中,能够提高该专用处理器读取或存储任务数据的效率,进而提高计算任务的处理效率。
在第一方面的一种可能的实现方式中,该缓存器为该通用处理器的缓存器;或者,该通用处理器与该缓存器耦合。上述可能的实现方式中,能够提高该通用处理器从缓存器读取数据或者向缓存器存储数据的效率,同时还能够提高缓存器的设计灵活性。
在第一方面的一种可能的实现方式中,该通用处理器与该专用处理器共用同一页表,该页表用于指示该任务数据在该缓存器中的逻辑地址与物理地址之间的映射关系。上述可能的实现方式中,通用处理器和专用处理器读取缓存器中的数据、或者向缓存器中存储数据时,无需进行额外的地址转换,从而能够降低的一定的功耗,并提高了数据的读写效率。
在第一方面的一种可能的实现方式中,该通用处理器上运行有该专用处理器的软件栈;该通用处理器,还用于通过该软件栈向该专用处理器发送指示消息,该指示消息用于指示专用处理器获取第二指令,比如,运行在通用处理器上的应用程序产生计算任务时,该应用程序调用该软件栈,以使该软件栈生成指示消息,然后由通用处理器通过系统总线将该指示消息发送给专用处理器,该指示消息可以是中断信号,该通用处理器并不感知该第二指令;该专用处理器,还用于在接收到该指示消息后通过专用处理器的软件栈解析后获取第二指令,并根据该第二指令处理该第二计算任务;其中,该第二指令为该专用处理器的指令集的指令。上述可能的实现方式中,通用处理器还可以基于软件栈复用专用处理器的计算资源,并且能够在专用处理器处理第二计算任务的过程中处理其他任务,从而提高资源利用率。
在第一方面的一种可能的实现方式中,该第一计算任务的计算量小于该第二计算任务的计算量。上述可能的实现方式中,当通用处理器需要复用专用处理器处理不同计算量的计算任务时,通用处理器可以通过物理接口复用专用处理器来处理计算量小的计算任务,通过软件栈复用专用处理器来处理计算量大的计算任务。这是因为通过软件栈复用专用处理器的方式适用于计算量大的计算任务,一般需要的计算时间较长、对调度延迟不敏感,而通过物理接口复用专用处理器的方式适用于计算量小的计算任务,需要的计算时间较短,对调度延迟比较敏感。
在第一方面的一种可能的实现方式中,该第一计算任务和该第二计算任务为并发的两个计算任务。上述可能的实现方式中,能够提高计算任务的处理效率和资源利用率。
在第一方面的一种可能的实现方式中,该专用处理器包括:控制单元和至少一个计算单元;该控制单元,用于在接收到多个计算任务时,根据该多个计算任务的至少一个预设参数为该多个计算任务分配该至少一个计算单元,该多个计算任务可以仅包括通过第一指令或第二指令中的一种指令指示的多个计算任务,也可以同时包括通过第一指令和第二指令指示的多个计算任务;其中,该至少一个预设参数包括以下至少一项:优先级、任务类型。上述可能的实现方式中,能够在多个计算任务并发且计算资源有限的情况下,保证高优先级或者计算量小的计算任务优先被处理,低优先级或者计算量大的计算任务后被处理,从而使得高优先级或者计算量小的计算任务具有较小的处理时延和较高的处理效率。
在第一方面的一种可能的实现方式中,该至少一个预设参数包括该任务类型,该至少一个计算单元包括:矢量运算单元,用于处理该多个计算任务中该任务类型为矢量运算的计算任务;矩阵运算单元,用于处理该多个计算任务中该任务类型为矩阵运算的计算任务。 上述可能的实现方式,能够提高计算任务的处理效率。
在第一方面的一种可能的实现方式中,该通用处理器包括中央处理器CPU、或者具有调度功能的图像处理单元GPU(也可以称为宏观GPU,比如内部集成有CPU的GPU)或者具有调度功能的数字信号处理器DSP等;该专用处理器包括以下至少一个:神经网络处理器NPU、数字信号处理器DSP、图像处理单元GPU。上述可能的实现方式,能够提高专用处理器的设计灵活性和多样性。
第二方面,提供一种计算任务处理方法,应用于包括通用处理器和专用处理器的装置中,该通用处理器和该专用处理器之间通过物理接口耦合,该方法包括:该通用处理器通过该物理接口向该专用处理器发送第一指令,该第一指令是该通用处理器的指令集中针对该专用处理器的指令,该第一指令用于指示该专用处理器处理第一计算任务;该专用处理器通过该物理接口接收并执行该第一指令,并根据该第一指令处理该第一计算任务。
在第二方面的一种可能的实现方式中,该装置还包括与该专用处理器耦合的缓存器,该方法还包括:该专用处理器从该缓存器读取该第一计算任务的任务数据;或者,该专用处理器将该第一计算任务的任务数据缓存在该缓存器中。
在第二方面的一种可能的实现方式中,该缓存器为该通用处理器的缓存器;或者,该通用处理器与该缓存器耦合;其中,该通用处理器与该专用处理器共用同一页表,该页表用于指示该任务数据在该缓存器中的逻辑地址与物理地址之间的映射关系。
在第二方面的一种可能的实现方式中,该通用处理器上运行有该专用处理器的软件栈,该方法包括:该通用处理器通过该软件栈向该专用处理器发送指示消息,该指示消息用于指示该专用处理器获取第二指令,比如,运行在通用处理器上的应用程序产生计算任务时,该应用程序调用该软件栈,以使该软件栈生成该计算任务对应的指示消息,然后由通用处理器通过系统总线将该指示消息发送给专用处理器,该指示消息可以是中断信号,该通用处理器并不感知该第二指令;当该专用处理器接收到该指示消息后通过专用处理器的软件栈解析后获取该第二指令,并根据该第二指令处理该第二计算任务;其中,该第二指令为该专用处理器的指令。
在第二方面的一种可能的实现方式中,该第一计算任务的计算量小于该第二计算任务的计算量。
在第二方面的一种可能的实现方式中,该第一计算任务和该第二计算任务为并发的两个计算任务。
在第二方面的一种可能的实现方式中,该专用处理器包括控制单元和至少一个计算单元,该方法还包括:在接收到多个计算任务时,该控制单元根据该多个计算任务的至少一个预设参数为该多个计算任务分配该至少一个计算单元,该多个计算任务可以仅包括通过第一指令或第二指令中的一种指令指示的多个计算任务,也可以同时包括通过第一指令和第二指令指示的多个计算任务;其中,该至少一个预设参数包括以下至少一项:优先级、任务类型。
在第二方面的一种可能的实现方式中,该至少一个预设参数包括该任务类型,该至少一个计算单元包括矢量运算单元和矩阵运算单元,该方法还包括:该矢量运算单元处理该多个计算任务中该任务类型为矢量运算的计算任务;该矩阵运算单元处理该多个计算任务中该任务类型为矩阵运算的计算任务。
在第二方面的一种可能的实现方式中,该通用处理器包括中央处理器CPU、具有调度功能的图像处理单元GPU(比如,该GPU内部集成CPU)或者具有调度功能的数字信号处理器DSP;该专用处理器包括以下至少一个:神经网络处理器NPU、数字信号处理器DSP、图像处理单元GPU。
在本申请的另一方面,提供一种片上系统SoC,该SoC中集成有第一方面或第一方面的任一种可能的实现方式所提供的计算任务处理装置。
在本申请的另一方面,提供一种电子设备,所述电子设备包括第一方面或第一方面的任一种可能的实现方式所提供的计算任务处理装置。
在本申请的另一方面,提供一种计算机可读存储介质,该计算机可读存储介质中存储有指令,当该指令在设备上运行时,使得该设备执行第二方面或第二方面的任一种可能的实现方式所提供的计算任务处理方法。
在本申请的另一方面,提供了一种计算机程序产品,该计算机程序产品包括:计算机程序(也可以称为代码,或指令),当该计算机程序被运行时,使得计算机执行上述第二方面或者第二方面的任一种可能的实现方式所提供的计算任务处理方法。
可以理解地,上述提供的任一种计算任务处理方法、电子设备、计算机可读存储介质和计算机程序产品,其所能达到的有益效果可对应参考上文所提供的计算任务处理装置中的有益效果,此处不再赘述。
附图说明
图1为第一种处理器的架构示意图;
图2为第二种处理器的架构示意图;
图3为第三种处理器的架构示意图;
图4为本申请实施例提供的一种计算任务处理装置的结构示意图;
图5为本申请实施例提供的一种CPU复用NPU的示意图;
图6为本申请实施例提供的另一种CPU复用NPU的示意图;
图7为本申请实施例提供的一种NPU的结构示意图;
图8为本申请实施例提供的另一种计算任务处理装置的结构示意图。
具体实施方式
本申请中,“至少一个”是指一个或者多个,“多个”是指两个或两个以上。“和/或”,描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B的情况,其中A,B可以是单数或者复数。字符“/”一般表示前后关联对象是一种“或”的关系。“以下至少一项(个)”或其类似表达,是指的这些项中的任意组合,包括单项(个)或复数项(个)的任意组合。例如,a,b,或c中的至少一项(个),可以表示:a,b,c,a-b,a-c,b-c,或a-b-c,其中a,b,c可以是单个,也可以是多个。另外,本申请实施例采用了“第一”、“第二”等字样对功能和作用基本相同的相同项或相似项进行区分。例如,第一阈值和第二阈值仅仅是为了区分不同的阈值,并不对其先后顺序进行限定。本领域技术人员可以理解“第一”、“第二”等字样并不对数量和执行次序进行限定。
需要说明的是,本申请中,“示例性的”或者“例如”等词用于表示作例子、例证或说明。本申请中被描述为“示例性的”或者“例如”的任何实施例或设计方案不应被解释 为比其他实施例或设计方案更优选或更具优势。确切而言,使用“示例性的”或者“例如”等词旨在以具体方式呈现相关概念。
下面在介绍本申请实施例之前,首先对本申请所涉及的相关背景技术进行介绍说明。
随着AI技术的快速发展,越来越多的AI业务在终端设备上部署,从而对终端设备中片上系统SoC的计算能力的要求越来越高。目前,SoC中通常集成有适合AI运算的NPU或数字信号处理器(digital signal processor,DSP)等专用处理器,该专用处理器中通常包括矩阵运算单元和矢量运算单元。当然,除了专用处理器外,CPU和图像处理器(graphics processing unit,GPU)等通用处理器也能够处理AI运算,但是与专用处理器相比,在能效、面积和灵活性等方面存在一定的差距。
此外,AI运算的应用场景和AI算法的数量都很多,且不同的应用场景和不同的AI算法对计算能力的需求的差异较大。例如,图像类的AI算法的计算量大(即对计算能力的需求较高),而部分语音类的AI算法的计算量小(即对计算能力的需求相对较小),但是实时性要求高。二者的区别在于:计算量大的AI算法的运算时长较长,对调度时延相对不敏感,一般调度时延为500微秒(us)至几毫秒(ms);计算量小的AI算法的运算时长较小,计算时长仅为几百us或者1ms内。针对上述问题,图1-图3提供了几种处理器的架构图,下面分别对这几种处理器的架构进行介绍说明。
图1为第一种处理器的架构示意图,该架构包括两个不同的处理器且采用松耦合方式耦合。该架构包括系统总线、CPU和NPU,该CPU和该NPU与该系统总线耦合。该CPU上运行有软件栈,该软件栈包括位于内核中的NPU驱动、以及位于用户空间的NPU运行时和应用程序APP。该NPU包括矩阵运算单元和矢量运算单元。在该架构中,CPU和NPU可以异步执行,即同时用于执行不同的业务。此外,当APP产生计算任务,该CPU可以基于软件栈中的NPU驱动和NPU运行时复用NPU的计算资源,具体过程可以包括:该CPU向该NPU发送中断信号,当该NPU接收到该中断信号时,该NPU通过NPU的软件栈解析后从内存中获取对应的指令,并根据该指令处理该计算任务。但是,该方式中CPU需要在用户态与内核态之间进行切换,且软件栈中的层次较多,从而导致开销较大,不适用于CPU与NPU交互频繁的场景中。
图2为第二种处理器的架构示意图,该架构包括一个处理器且内部设有矩阵运算单元。该架构包括CPU,该CPU包括CPU核(core)、以及与CPU核相互耦合的矩阵运算单元和缓存器(cache),该矩阵运算单元与该缓存器也相互耦合。当有AI运算需要被处理时,该CPU核可以通过定制指令驱动矩阵运算单元运行,以通过该矩阵运算单元处理AI运算。这种架构的优势是调度开销小,适合CPU核与矩阵运算单元交互比较频繁的场景中。但是,该矩阵运算单元仅能用于处理计算量较小的AI运算和一些普通的矩形运算,无法适用于处理计算量较大的AI运算。
图3为第三种处理器的架构示意图,该架构是上述两种架构的结合。该架构包括系统总线、CPU和NPU,该CPU和该NPU与该系统总线耦合。该CPU内设置有矩阵运算单元,且该CPU上运行有软件栈,该软件栈包括位于内核中的NPU驱动、以及位于用户空间的NPU运行时和应用程序APP。该NPU包括矩阵运算单元和矢量运算单元。当有计算量较小的AI运算需要被处理时,该CPU通过定制指令驱动内部的矩阵运算单元运行来处理;当有计算量较大的AI运算需要被处理时,该CPU通过软件栈复用NPU中的矩阵运 算单元来处理。这种架构可以用于处理计算量较小的AI运算,也可以用于处理计算量较大的AI运算。但是,该架构中该CPU和该NPU所包含的矩阵运算单元在物理上是相互独立的,从而需要占用的面积较大,且两个矩阵运算单元无法同时用于处理同一计算任务,从而降低了计算资源的利用率。
基于此,本申请实施例提供一种计算任务处理装置,该计算任务处理装置能够用于处理不同计算量的计算任务,且在处理计算量小的计算任务时具有较小的开销,同时该装置与上述第三种处理器的结构相比,还可以提高减小占用面积,提高计算资源的利用率。该计算任务处理装置可以应用于电子设备中,该电子设备包括但不限于:手机、平板电脑、笔记本电脑、掌上电脑、移动互联网设备(mobile internet device,MID)、可穿戴设备(例如智能手表和智能手环等)、车载设备(例如,汽车、自行车、电动车、飞机、船舶、火车、高铁等)、虚拟现实(virtual reality,VR)设备、增强现实(augmented reality,AR)设备、工业控制(industrial control)中的无线终端、智能家居设备(例如,冰箱、电视、空调、电表等)、智能机器人、车间设备、无人驾驶(self-driving)中的无线终端、远程手术(remote medical surgery)中的无线终端、智能电网(smart grid)中的无线终端、运输安全(transportation safety)中的无线终端、智慧城市(smart city)中的无线终端,或智慧家庭(smart home)中的无线终端、飞行设备(例如,智能机器人、热气球、无人机、飞机)等。
下面对该计算任务处理装置的具体结构进行介绍说明。
图4为本申请实施例提供的一种计算任务处理装置的结构示意图。该计算任务处理装置包括:通用处理器201和专用处理器202,通用处理器201和专用处理器202之间通过物理接口203耦合。
在该计算任务处理装置中,通用处理器201用于通过物理接口203向专用处理器202发送第一指令,第一指令为通用处理器201的指令集中针对专用处理器202的指令,第一指令用于指示专用处理器202处理第一计算任务;专用处理器202用于通过物理接口203接收并执行第一指令,并根据第一指令处理第一计算任务。也即是,通用处理器201可以通过物理接口203复用专用处理器202来处理第一计算任务。
其中,通用处理器201可以包括中央处理器CPU或者具有调度功能的其他处理器中的一个或者多个,比如,具有调度功能的图像处理器GPU(也可以称为宏观GPU,比如内部集成有CPU的GPU)、或者具有调度功能的数字信号处理器DSP等。专用处理器202可以包括神经网络处理器NPU、数字信号处理器DSP等中的一个或者多个,神经网络处理器NPU也可以称为人工智能AI处理器。该计算任务处理装置包括的通用处理器201的数量和专用处理器202的数量均可以为一个或者多个,每个处理器中可以包括一个或者多个处理核。图4中以通用处理器201包括CPU,专用处理器202包括NPU为例进行说明。
另外,专用处理器202可用于处理多个不同维度的数据的运算,比如,该多个不同维度的数据可以包括一维数据(比如,矢量)、二维(比如,矩阵)、以及二维以上的数据(比如,三维数据)等。
再者,相对于通用处理器201的当前指令集而言,第一指令可以为通用处理器201的扩展指令(也可以称为定制指令),该扩展指令能够用于指示(或称为驱动)专用处理器202处理计算任务,第一指令可以由通用处理器201生成。第一计算任务可以是运行在通 用处理器201上的某一应用程序(也可以称为业务)产生的一个计算任务,一个应用程序可以产生一个或者多个计算任务,每个计算任务可以对应一个线程。可选的,第一计算任务可以为AI运算对应的计算任务,该AI运算可以是二维数据的运算、或者二维以上数据的运算等。
具体的,当通用处理器201上的某一应用程序产生第一计算任务时,通用处理器201在执行过程中可以获取到第一指令,通用处理器201可以通过物理接口203向专用处理器202发送第一指令,当专用处理器202接收并执行第一指令时可以处理第一计算任务(也即是,第一指令是该专用处理器直接通过物理接口接收到的指令,并不是通过类似于软件调度的方式从内存中获取的指令)。在专用处理器202处理第一计算任务的过程中,通用处理器201可以处于等待状态,在专用处理器202完成第一计算任务的处理后,通用处理器201可以继续执行后续操作。上述通用处理器201通过物理接口203复用专用处理器202来处理第一计算任务的方式可以称为同步复用专用处理器202的计算资源。
示例性的,如图5所示,假设通用处理器201为CPU、专用处理器202为NPU,则CPU通过物理接口203复用NPU的具体可以包括:CPU加载二进制码(binary code)到指令缓存(instruction cache)中,并对二进制码进行取指(instruction fetch)和译码(decode)操作,当该二进制码中含有扩展指令时,译码操作可以识别到该扩展指令,该扩展指令通过指令队列(issue queue)和指令存放队列(store queue)后,CPU将通过物理接口203该扩展指令发送到NPU中的指令缓存(instruction buffer)中,由NPU完成译码、派发(dispatch)和指令执行(execution)过程,即处理该扩展指令对应的计算任务。
在本申请实施例提供的计算任务处理装置中,通用处理器201与专用处理器202在物理层通过物理接口203耦合,从而通用处理器201可以直接通过物理接口203发送第一指令以调度专用处理器202处理第一计算任务,即通用处理器201能够通过物理接口203复用专用处理器202,该复用过程无需通过软件实现,从而开销小,提高了通用处理器201与专用处理器202的交互效率。
进一步的,如图4所示,该计算任务处理装置还包括:缓存器(cache)204。缓存器204与专用处理器202耦合,从而专用处理器202可用于向缓存器204中存储数据、以及读取缓存器204中的数据。
可选的,缓存器204为集成在通用处理器201内部的缓存器,比如,通用处理器201为包括三级缓存(即L1-L3缓存)的CPU,缓存器204可以为L3缓存。或者,缓存器204为集成在通用处理器201外部的缓存器,通用处理器201与缓存器204耦合,即通用处理器201可用于向缓存器204中存储数据、以及读取缓存器204中的数据。图4中以缓存器204集成在通用处理器201的外部为例进行说明。
在一种可能的实施例中,缓存器204中存储有第一计算任务的第一任务数据,第一任务数据可以是处理第一计算任务时所需的输入数据(即该数据是专用处理器202执行第一指令过程中需要使用的数据),则专用处理器202还用于:在处理第一计算任务的过程中,从缓存器204中读取第一任务数据。可选的,第一任务数据可以是通用处理器201存储在缓存器204中的数据。
在另一种可能的实施例中,专用处理器202在处理第一计算任务的过程中产生了第二任务数据,第二任务数据可以是处理第一计算任务时的输出数据(即该数据是专用处理器 202执行第一指令后对应输出的数据),则专用处理器202还用于:将第二任务数据存储在缓存器204中。可选的,通用处理器201可以从缓存器204中读取第二任务数据。
可选的,通用处理器201与专用处理器202可以共用同一页表,该页表可用于指示上述第一任务数据和/或第二任务数据在缓存器204中的逻辑地址与物理地址之间的映射关系。这样,通用处理器201和专用处理器202读取缓存器204中的数据、或者向缓存器204中存储数据时,无需进行额外的地址转换,从而能够降低的一定的功耗,并提高了数据的读写效率。
可选的,该计算任务处理装置还可以包括系统总线205,缓存器204和专用处理器202均与系统总线205耦合。此外,该计算任务处理装置还可以包括内存,比如该内存可以为双倍速率同步动态随机存储器(double data rate synchronous dynamic random access memory,DDR SRAM),简称DDR。专用处理器202还可以通过缓存器204访问内存,也可以不通过缓存器204访问内存,即专用处理器202有独立的不经过缓存器204访问内存的通道。
进一步的,如图4所示,通用处理器201上运行有专用处理器202的软件栈,该软件栈可以包括专用处理器202的运行时和驱动。图4中该软件栈包括NPU运行时和NPU驱动为例进行说明。
具体的,通用处理器201还用于通过该软件栈向专用处理器202发送指示消息,该指示消息用于指示专用处理器获取第二指令,该第二指令为专用处理器202的指令集中的指令,该通用处理器并不感知该第二指令,比如,运行在通用处理器201上的应用程序产生计算任务时,该应用程序调用该软件栈,以使该软件栈生成该计算任务对应的指示消息,然后由通用处理器201通过系统总线将该指示消息发送给专用处理器202,该指示消息可以是中断信号;专用处理器202还用于在接收到该指示消息后通过专用处理器202的软件栈获取第二指令(比如从内存中获取第二指令),并根据第二指令处理第二计算任务。也即是,通用处理器201还可以基于软件栈复用专用处理器202的计算资源。在专用处理器202处理第二计算任务的过程中,通用处理器201可以继续执行其他任务,当专用处理器202完成第二计算任务的处理后,专用处理器202向通用处理器201发送中断,这样通用处理器201可以在接收到该中断时继续执行第二计算任务对应的后续操作。上述通用处理器201基于软件栈复用专用处理器202来处理第二计算任务的方式可以称为异步复用专用处理器202的计算资源。
其中,第二计算任务可以是运行在通用处理器201上的某一应用程序(也可以称为业务)产生的一个计算任务。第一计算任务和第二计算任务可以是同一应用程序产生的两个计算任务,也可以是不同应用程序产生的两个计算任务。第二计算任务可以为AI运算对应的计算任务。
示例性的,结合图4,如图6所示,当运行CPU上的APP产生第二计算任务,需要CPU复用NPU的计算资源处理第二计算任务时,该APP可以通过NPU运行时和NPU驱动将第二计算任务发送给NPU,以使NPU处理第二计算任务。
可选的,第一计算任务的计算量小于第二计算任务的计算量。也即是,当通用处理器201需要复用专用处理器202处理不同计算量的计算任务时,通用处理器201可以通过物理接口203复用专用处理器202来处理计算量小的计算任务,通过软件栈复用专用处理器 202来处理计算量大的计算任务。这是因为通过软件栈复用专用处理器202的方式适用于计算量大的计算任务,一般需要的计算时间较长、对调度延迟不敏感,而通过物理接口203复用专用处理器202的方式适用于计算量小的计算任务,需要的计算时间较短,对调度延迟比较敏感。
在实际应用中,通用处理器201可以按照上述两种方式同时复用专用处理器202处理不同计算量的计算任务,也可以分时复用专用处理器202处理不同计算量的计算任务。在一种实施例中,当第一计算任务和第二计算任务为并发的两个计算任务时,通用处理器201按照通过物理接口203复用专用处理器202的方式处理第一计算任务,同时通过软件栈复用专用处理器202的方式处理第二计算任务。
进一步的,当存在并发的多个计算任务时,此时需要有计算资源的管理机制来确保该多个计算任务的正常运行,该多个计算任务可以仅包括上述两种方式中的任意一种方式下专用处理器202接收到的计算任务,也可以同时包括上述两种方式下专用处理器202接收到的计算任务。具体的,如图7所示,专用处理器202可以包括:控制单元和计算单元,该计算单元的数量可以为一个或者多个,控制单元可用于负责管理计算单元。
可选的,控制单元可以包括资源管理单元和指令执行单元,资源管理单元可用于管理和分配计算单元,指令执行单元可用于负责指令缓存、取指和译码等功能。计算单元可以包括不同维度的一个或者多个计算单元,比如,计算单元可以包括矢量运算单元和矩阵运算单元。下面主要以控制单元为主体,对控制单元在资源管理方面的功能进行介绍说明。
在一种可能的实施例中,控制单元用于:在接收到多个计算任务时,根据该多个计算任务的至少一个预设参数为该多个计算任务分配该至少一个计算单元。该多个计算任务包括第一计算任务。该至少一个预设参数包括以下至少一项:优先级、任务类型。下面以几种示例为例进行说明。
在一种示例中,当该至少一个预设参数包括优先级,控制单元接收到多个计算任务时,按照该多个计算任务的优先级从高到低的顺序为该多个计算任务分配该至少一个计算单元。若该至少一个计算单元能够处理的计算量小于该多个计算任务的计算量,控制单元可以优先为高优先级的计算任务分配该至少一个计算单元,在该至少一个计算单元完成该高优先级的计算任务后,再为低优先级的计算任务分配该至少一个计算单元。
可选的,该至少一个计算单元中的每个计算单元可以包括多个计算块,该多个计算块中的每个计算块可以具有相同或者不同的计算能力,一个或者多个计算块可用于处理一个计算任务。控制单元在按照计算任务的优先级或者计算量分配计算单元中的计算块时,可以根据每个计算任务的计算量为该计算任务分配计算能力相匹配的一个或者多个计算块。
此外,当每个计算单元包括多个计算块时,该多个计算块中的至少一个计算块可被静态配置为用于处理通过物理接口203复用专用处理器202的方式下的计算任务。当然,该至少一个计算块也可以用于处理通过软件栈复用专用处理器202的方式下的计算任务。
需要说明的是,计算任务的优先级和计算量可以是由该计算任务决定的,比如,在该计算任务产生时就对应决定了该计算任务的优先级和计算量。当至少一个预设参数包括至少两个参数时,该至少两个参数可以综合用于确定控制单元为每个计算任务分配计算单元的顺序。
可选的,如图8所示,通用处理器201中可以包括优先级控制单元,该优先级控制单 元可用于支持计算任务的优先级的设置和查询,专用处理器202中的控制单元可以支持按照优先级分配计算单元。其中,该优先级控制单元可以提供寄存器等接口允许业务软件配置优先级,同时还可以维护允许被查询的优先级队列。对于专用处理器202中的控制单元,当高优先级业务需要使用计算单元时,该控制单元可以控制住正在运行的低优先级业务并进行切换,以优先调度高优先级业务。比如,通用处理器201中的业务以线程为粒度设置优先级,这样专用处理器202中的控制单元可以保证高优先级的线程优先分配到计算单元,低优先级的线程若分配不到计算单元可以休眠(sleep),在计算单元释放后,低优先级的线程重新被唤醒并分配到计算单元。图8中以通用处理器201为CPU且包括多个处理核为例进行说明。
在又一种示例中,当该至少一个预设参数包括任务类型,计算单元包括矢量运算单元和矩阵运算单元时,控制单元还用于:将该多个计算任务中任务类型为矢量运算的计算任务分配至矢量运算单元,将该多个计算任务中任务类型为矩阵运算的计算任务分配至矩阵运算单元。相应的,矢量运算单元用于处理该多个计算任务中任务类型为矢量运算的计算任务,矩阵运算单元用于处理该多个计算任务中任务类型为矩阵运算的计算任务。图7中以矢量运算单元包括m个计算块、矩阵运算单元包括n个计算块为例进行说明,m和n为正整数。
需要说明的是,当该至少一个预设参数包括任务类型,还包括优先级时,控制单元可以先按照任务类型确定为计算任务分配的计算单元的类型,再按照优先级确定的顺序分配对应类型的计算单元。
在本申请实施例中,控制单元可以按照至少一个预设参数为多个计算任务分配计算单元,从而在多个计算任务并发且计算资源有限的情况下,保证高优先级或者计算量小的计算任务优先被处理,低优先级或者计算量大的计算任务后被处理,从而使得高优先级或者计算量小的计算任务具有较小的处理时延和较高的处理效率。
本申请实施例还提供一种计算任务处理方法,该方法可以应用于上文所提供的计算任务处理装置中,该装置包括通用处理器和专用处理器,所述通用处理器和所述专用处理器之间通过物理接口耦合,关于该装置的具体描述可以参见上文中的相关描述。
具体的,该方法包括:该通用处理器通过该物理接口向该专用处理器发送第一指令,该第一指令用于指示该专用处理器处理第一计算任务,该第一指令可以为该通用处理器的指令集中针对专用处理器的指令;该专用处理器通过该物理接口接收并执行该第一指令(也即是,第一指令是该专用处理器直接通过物理接口接收到的指令,并不是通过类似于软件栈调度的方式从内存中获取的指令),并根据该第一指令处理该第一计算任务。
可选的,当该装置还包括与该专用处理器耦合的缓存器时,该方法还可以包括:该专用处理器从该缓存器读取该第一计算任务的任务数据;或者,该专用处理器将该第一计算任务的任务数据缓存在该缓存器中。
在一种实施例中,该缓存器为该通用处理器的缓存器。在另一种实施例中,该通用处理器与该缓存器耦合。
在实际应用中,该通用处理器与该专用处理器共用同一页表,该页表用于指示该任务数据在该缓存器中的逻辑地址与物理地址之间的映射关系。
进一步的,该通用处理器上运行有该专用处理器的软件栈,该方法还可以包括:该通 用处理器通过该软件栈向该专用处理器发送指示消息,该指示消息用于指示专用处理器获取第二指令,比如,运行在通用处理器上的应用程序产生计算任务时,该应用程序调用该软件栈,以使该软件栈生成指示消息,然后由通用处理器通过系统总线将该指示消息发送给专用处理器,该指示消息可以是中断信号,该通用处理器并不感知该第二指令;该专用处理器,还用于在接收到该指示消息后通过专用处理器的软件解析后获取第二指令(比如从内存中获取第二指令),并根据该第二指令处理该第二计算任务;其中,该第二指令为该专用处理器的指令集中的指令。
可选的,该第一计算任务的计算量小于该第二计算任务的计算量。在一种实施例中,该第一计算任务和该第二计算任务为并发的两个计算任务。
进一步的,该专用处理器包括控制单元和至少一个计算单元,该方法还可以包括:在接收到多个计算任务时,该控制单元根据该多个计算任务的至少一个预设参数为该多个计算任务分配该至少一个计算单元,该多个计算任务可以仅包括通过第一指令或者第二指令中的一种指令指示的多个计算任务,也可以同时包括通过第一指令和第二指令指示的多个计算任务;其中,该至少一个预设参数包括以下至少一项:优先级、任务类型。
在一种实施例中,该至少一个预设参数包括该任务类型,该至少一个计算单元包括矢量运算单元和矩阵运算单元,该方法还包括:该矢量运算单元处理该多个计算任务中该任务类型为矢量运算的计算任务;该矩阵运算单元处理该多个计算任务中该任务类型为矩阵运算的计算任务。
需要说明的是,关于上述步骤的详细描述具体可以参见上文所提供的计算任务处理装置中的描述,本申请实施例在此不再赘述。
在本申请实施例中,通用处理器与专用处理器在物理层通过物理接口耦合,从而通用处理器可以直接通过物理接口发送第一指令,以调度专用处理器处理第一计算任务,即通用处理器能够通过物理接口复用专用处理器,该复用过程无需通过软件实现,从而开销小,提高了通用处理器与专用处理器的交互效率。
在本申请的另一方面,还提供一种片上系统SoC,该SoC中集成有上文所提供的任一种计算任务处理装置。
在本申请的另一方面,还提供一种电子设备,该电子设备包括上文所提供的任一种计算任务处理装置。
在本申请的又一方面,提供一种计算机可读存储介质,该计算机可读存储介质中存储有指令,当该指令在设备上运行时,使得该设备执行上述方法实施例提供的计算任务处理方法。
在本申请的又一方面,提供一种计算机程序产品,当该计算机程序产品在设备上运行时,使得该设备执行上述方法实施例提供的任一种计算任务处理方法。
最后应说明的是:以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何在本申请揭露的技术范围内的变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。

Claims (22)

  1. 一种计算任务处理装置,其特征在于,所述装置包括:通用处理器和专用处理器,所述通用处理器和所述专用处理器之间通过物理接口耦合;
    所述通用处理器,用于通过所述物理接口向所述专用处理器发送第一指令,所述第一指令为所述通用处理器的指令集中针对所述专用处理器的指令,所述第一指令用于指示所述专用处理器处理第一计算任务;
    所述专用处理器,用于通过所述物理接口接收并执行所述第一指令,并根据所述第一指令处理所述第一计算任务。
  2. 根据权利要求1所述的装置,其特征在于,所述装置还包括:与所述专用处理器耦合的缓存器;
    所述缓存器,用于存储所述第一计算任务的任务数据;
    所述专用处理器,用于从所述缓存器读取所述任务数据,和/或将所述任务数据缓存在所述缓存器中。
  3. 根据权利要求2所述的装置,其特征在于,所述缓存器为所述通用处理器的缓存器;或者,
    所述通用处理器与所述缓存器耦合。
  4. 根据权利要求3所述的装置,其特征在于,所述通用处理器与所述专用处理器共用同一页表,所述页表用于指示所述任务数据在所述缓存器中的逻辑地址与物理地址之间的映射关系。
  5. 根据权利要求1-4任一项所述的装置,其特征在于,所述通用处理器上运行有所述专用处理器的软件栈;
    所述通用处理器,还用于通过所述软件栈向所述专用处理器发送指示消息,所述指示消息用于指示所述专用处理器获取第二指令;
    所述专用处理器,还用于在接收所述指示消息后通过所述专用处理器的软件栈解析后获取所述第二指令,并根据所述第二指令处理所述第二计算任务;
    其中,所述第二指令是所述专用处理器的指令集中的指令。
  6. 根据权利要求5所述的装置,其特征在于,所述第一计算任务的计算量小于所述第二计算任务的计算量。
  7. 根据权利要求5或6所述的装置,其特征在于,所述第一计算任务和所述第二计算任务为并发的两个计算任务。
  8. 根据权利要求1-7任一项所述的装置,其特征在于,所述专用处理器包括:控制单元和至少一个计算单元;
    所述控制单元,用于在接收到多个计算任务时,根据所述多个计算任务的至少一个预设参数为所述多个计算任务分配所述至少一个计算单元;
    其中,所述至少一个预设参数包括以下至少一项:优先级、任务类型。
  9. 根据权利要求8所述的装置,其特征在于,所述至少一个预设参数包括所述任务类型,所述至少一个计算单元包括:
    矢量运算单元,用于处理所述多个计算任务中所述任务类型为矢量运算的计算任务;
    矩阵运算单元,用于处理所述多个计算任务中所述任务类型为矩阵运算的计算任务。
  10. 根据权利要求1-9任一项所述的装置,其特征在于,所述通用处理器包括中央处理器CPU;所述专用处理器包括以下至少一个:神经网络处理器NPU、数字信号处理器DSP。
  11. 根据权利要求1-10任一项所述的装置,其特征在于,所述装置集成在片上系统SoC中。
  12. 一种计算任务处理方法,其特征在于,应用于包括通用处理器和专用处理器的装置中,所述通用处理器和所述专用处理器之间通过物理接口耦合,所述方法包括:
    所述通用处理器通过所述物理接口向所述专用处理器发送第一指令,所述第一指令为所述通用处理器的指令集中针对所述专用处理器的指令,所述第一指令用于指示所述专用处理器处理第一计算任务;
    所述专用处理器通过所述物理接口接收并执行所述第一指令,并根据所述第一指令处理所述第一计算任务。
  13. 根据权利要求12所述的方法,其特征在于,所述装置还包括与所述专用处理器耦合的缓存器,所述方法还包括:
    所述专用处理器从所述缓存器读取所述第一计算任务的任务数据;或者,
    所述专用处理器将所述第一计算任务的任务数据缓存在所述缓存器中。
  14. 根据权利要求13所述的方法,其特征在于,所述缓存器为所述通用处理器的缓存器;或者,
    所述通用处理器与所述缓存器耦合。
  15. 根据权利要求14所述的方法,其特征在于,所述通用处理器与所述专用处理器共用同一页表,所述页表用于指示所述任务数据在所述缓存器中的逻辑地址与物理地址之间的映射关系。
  16. 根据权利要求12-14任一项所述的方法,其特征在于,所述通用处理器上运行有所述专用处理器的软件栈,所述方法包括:
    所述通用处理器通过所述软件栈向所述专用处理器发送指示消息,所述指示消息用于指示所述专用处理器获取第二指令;
    在所述专用处理器接收到所述指示消息后通过所述专用处理器的软件栈解析后获取所述第二指令,并根据所述第二指令处理所述第二计算任务;
    其中,所述第二指令是所述专用处理器的指令集中的指令。
  17. 根据权利要求16所述的方法,其特征在于,所述第一计算任务的计算量小于所述第二计算任务的计算量。
  18. 根据权利要求16或17所述的方法,其特征在于,所述第一计算任务和所述第二计算任务为并发的两个计算任务。
  19. 根据权利要求12-18任一项所述的方法,其特征在于,所述专用处理器包括控制单元和至少一个计算单元,所述方法还包括:
    在接收到多个计算任务时,所述控制单元根据所述多个计算任务的至少一个预设参数为所述多个计算任务分配所述至少一个计算单元;
    其中,所述至少一个预设参数包括以下至少一项:优先级、任务类型。
  20. 根据权利要求19所述的方法,其特征在于,所述至少一个预设参数包括所述任务类型,所述至少一个计算单元包括矢量运算单元和矩阵运算单元,所述方法还包括:
    所述矢量运算单元处理所述多个计算任务中所述任务类型为矢量运算的计算任务;
    所述矩阵运算单元处理所述多个计算任务中所述任务类型为矩阵运算的计算任务。
  21. 根据权利要求12-20任一项所述的方法,其特征在于,所述通用处理器包括中央处理器CPU;所述专用处理器包括以下至少一个:神经网络处理器NPU、数字信号处理器DSP。
  22. 一种电子设备,其特征在于,所述电子设备包括如权利要求1-11任一项所述的计算任务处理装置。
PCT/CN2021/143792 2021-12-31 2021-12-31 一种计算任务处理装置、方法及电子设备 WO2023123395A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202180035983.0A CN116848509A (zh) 2021-12-31 2021-12-31 一种计算任务处理装置、方法及电子设备
PCT/CN2021/143792 WO2023123395A1 (zh) 2021-12-31 2021-12-31 一种计算任务处理装置、方法及电子设备

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/143792 WO2023123395A1 (zh) 2021-12-31 2021-12-31 一种计算任务处理装置、方法及电子设备

Publications (1)

Publication Number Publication Date
WO2023123395A1 true WO2023123395A1 (zh) 2023-07-06

Family

ID=86997235

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/143792 WO2023123395A1 (zh) 2021-12-31 2021-12-31 一种计算任务处理装置、方法及电子设备

Country Status (2)

Country Link
CN (1) CN116848509A (zh)
WO (1) WO2023123395A1 (zh)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110120063A (zh) * 2019-04-23 2019-08-13 深圳市道通智能航空技术有限公司 一种基于多处理器的目标跟踪处理方法
US20190286972A1 (en) * 2018-03-14 2019-09-19 Microsoft Technology Licensing, Llc Hardware accelerated neural network subgraphs
CN112513817A (zh) * 2020-08-14 2021-03-16 华为技术有限公司 一种主cpu与npu的数据交互方法及计算设备
CN113554149A (zh) * 2021-06-18 2021-10-26 北京百度网讯科技有限公司 神经网络处理单元npu、神经网络的处理方法及其装置
CN113574656A (zh) * 2020-02-28 2021-10-29 华为技术有限公司 一种数据处理装置及方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190286972A1 (en) * 2018-03-14 2019-09-19 Microsoft Technology Licensing, Llc Hardware accelerated neural network subgraphs
CN110120063A (zh) * 2019-04-23 2019-08-13 深圳市道通智能航空技术有限公司 一种基于多处理器的目标跟踪处理方法
CN113574656A (zh) * 2020-02-28 2021-10-29 华为技术有限公司 一种数据处理装置及方法
CN112513817A (zh) * 2020-08-14 2021-03-16 华为技术有限公司 一种主cpu与npu的数据交互方法及计算设备
CN113554149A (zh) * 2021-06-18 2021-10-26 北京百度网讯科技有限公司 神经网络处理单元npu、神经网络的处理方法及其装置

Also Published As

Publication number Publication date
CN116848509A (zh) 2023-10-03

Similar Documents

Publication Publication Date Title
US11550627B2 (en) Hardware accelerated dynamic work creation on a graphics processing unit
EP2126690B1 (en) On-demand multi-thread multimedia processor
US9176794B2 (en) Graphics compute process scheduling
US9176795B2 (en) Graphics processing dispatch from user mode
US20120229481A1 (en) Accessibility of graphics processing compute resources
US8743131B2 (en) Course grain command buffer
CN103218329A (zh) 数字信号处理数据传输
JP2013546105A (ja) システムコール要求の通信の最適化
US11403104B2 (en) Neural network processor, chip and electronic device
CN111209244B (zh) 数据处理装置及相关产品
CN112925616A (zh) 任务分配方法、装置、存储介质及电子设备
JP5805783B2 (ja) コンピュータシステムインタラプト処理
CN114816777A (zh) 命令处理装置、方法、电子设备以及计算机可读存储介质
WO2023123395A1 (zh) 一种计算任务处理装置、方法及电子设备
CN117112165A (zh) 虚拟现实应用任务的处理方法、装置、虚拟现实系统
CN113556242B (zh) 一种基于多处理节点来进行节点间通信的方法和设备
US20180011804A1 (en) Inter-Process Signaling Mechanism
CN116724294A (zh) 一种任务分配方法及装置
US10261817B2 (en) System on a chip and method for a controller supported virtual machine monitor
WO2023230909A1 (zh) 调度方法及相关装置
KR102536943B1 (ko) 데이터 절감 장치, 데이터 절감 방법 및 데이터 절감 장치를 포함하는 시스템
US11915041B1 (en) Method and system for sequencing artificial intelligence (AI) jobs for execution at AI accelerators
WO2024087513A1 (zh) 应用场景的数据处理方法、系统、电子设备及存储介质
CN114399034B (zh) 用于直接存储器访问装置的数据搬运方法
WO2023173276A1 (en) Universal core to accelerator communication architecture

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 202180035983.0

Country of ref document: CN

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21969742

Country of ref document: EP

Kind code of ref document: A1