WO2024045580A1 - Method for scheduling tasks, and related product thereof - Google Patents

Method for scheduling tasks, and related product thereof Download PDF

Info

Publication number
WO2024045580A1
WO2024045580A1 PCT/CN2023/083494 CN2023083494W WO2024045580A1 WO 2024045580 A1 WO2024045580 A1 WO 2024045580A1 CN 2023083494 W CN2023083494 W CN 2023083494W WO 2024045580 A1 WO2024045580 A1 WO 2024045580A1
Authority
WO
WIPO (PCT)
Prior art keywords
chip
task
lookup table
tasks
circuit
Prior art date
Application number
PCT/CN2023/083494
Other languages
French (fr)
Chinese (zh)
Inventor
高健
刘少礼
郝勇峥
韩栋
Original Assignee
寒武纪(西安)集成电路有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 寒武纪(西安)集成电路有限公司 filed Critical 寒武纪(西安)集成电路有限公司
Publication of WO2024045580A1 publication Critical patent/WO2024045580A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • G06F15/163Interprocessor communication
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]

Definitions

  • This application relates generally to the computer field. More specifically, the present application relates to a scheduler, an artificial intelligence processor chip, a board, a method and a computer-readable storage medium for scheduling tasks.
  • CPU central processing unit
  • cache cache
  • data that may be reused is cached in the cache, thereby shortening the time it takes to access the data next time, such as when executing a task.
  • buffers buffers
  • queues queues
  • this application provides a task caching and wake-up solution based on a lookup table. Based on the solution of this application, the delay of large-scale task scheduling can be greatly reduced, while simplifying hardware design and reducing on-chip storage overhead. To this end, this application provides solutions in multiple aspects as follows.
  • the present disclosure provides a scheduler for task scheduling, which is arranged on an artificial intelligence processor chip and connects an off-chip storage device and an on-chip task execution unit.
  • the scheduler includes: a scheduling circuit , which is configured to read tasks from the off-chip storage device to on-chip, so as to schedule the task to be executed by the on-chip task execution unit, wherein the task is recorded on the off-chip storage device in a valid state;
  • a lookup table circuit configured to: in response to the task being read from the off-chip storage device to the chip, update the task from the valid state to the invalid state and record it in the first lookup table; and in response to recording the invalid status in the first lookup table, triggering the scheduling circuit to read the next task from the off-chip storage device to the chip.
  • the present disclosure provides an artificial intelligence processor chip, including: a scheduler according to the first aspect; and an on-chip task execution unit configured to interact with the scheduler to execute the Tasks issued by the scheduler.
  • the present disclosure provides a board card including the artificial intelligence processor chip according to the second aspect.
  • the present disclosure provides a method for scheduling a task using the scheduler according to the first aspect, the method comprising: using the scheduling circuit to perform reading from the off-chip storage device a task onto the chip so as to schedule the task to be executed by the on-chip task execution unit, wherein the task is recorded in a valid state on the off-chip storage device; using the first lookup table circuit to execute: in response to the The task is read from the off-chip storage device to the chip, the task is updated from the valid state to the invalid state and recorded in the first lookup table; and in response to recording the invalid state in the In the first lookup table, the scheduling circuit is triggered to read the next task from the off-chip storage device to the chip.
  • the present disclosure provides a computer-readable storage medium having computer program instructions for scheduling tasks stored thereon.
  • the computer program instructions are executed by a processor, the computer program instructions according to the fourth aspect are implemented. method described.
  • the processing speed of task scheduling can be accelerated, thereby significantly reducing the delay of large-scale task scheduling.
  • the complexity of hardware design is also simplified and the cost of on-chip storage is reduced.
  • the present disclosure uses a lookup table dedicated to inter-chip tasks, burst transmission of multiple task allocations is avoided. Bus congestion and backpressure during message processing, thereby enabling efficient inter-chip task scheduling.
  • FIG. 1 is a simplified block diagram schematically illustrating a scheduler according to an embodiment of the present disclosure
  • FIG. 2 is a detailed structural block diagram schematically showing a scheduler according to an embodiment of the present disclosure
  • Figure 3 is a simplified flowchart schematically illustrating a method of scheduling tasks using a scheduler according to an embodiment of the present disclosure
  • Figure 4 is a schematic structural diagram schematically showing a board card according to an embodiment of the present disclosure
  • Figure 5 is a schematic structural diagram schematically showing a combined processing device in a chip according to an embodiment of the present disclosure
  • FIG. 6 is a schematic diagram schematically showing the internal structure of a computing device according to an embodiment of the present disclosure
  • FIG. 7 is a schematic diagram schematically showing the internal structure of a processor core according to an embodiment of the present disclosure.
  • FIG. 8 is a schematic diagram schematically illustrating data writing operations between computing clusters (or “computing clusters”) according to an embodiment of the present disclosure.
  • the term “if” may be interpreted as “when” or “once” or “in response to determining” or “in response to detecting” depending on the context.
  • the phrase “if determined” or “if [the described condition or event] is detected” may be interpreted, depending on the context, to mean “once determined” or “in response to a determination” or “once the [described condition or event] is detected ]” or “in response to detection of [the described condition or event]”.
  • the solution of this disclosure proposes a lookup table-based Task caching and wake-up mechanism.
  • the solution of the present disclosure forms a lookup table (hereinafter referred to as the first lookup table) by storing a flag bit that indicates whether the task is valid or not on the chip, and determines whether the next corresponding task is to be changed from the flag bit according to whether the flag bit is valid.
  • the lookup table itself has no concept of order and can be considered an out-of-order structure, it can reflect the validity of the current task in a timely manner, thereby enabling faster scheduling of the corresponding task.
  • the solution of the present disclosure introduces more lookup tables to implement task scheduling in different scenarios (such as communication task scenarios between artificial intelligence processor chips), thereby further improving the efficiency of scheduling and It reduces on-chip storage overhead and simplifies hardware design, thereby also reducing task scheduling delays on a large scale.
  • FIG. 1 is a simplified block diagram schematically illustrating a task scheduler ("scheduler" for short) 100 according to an embodiment of the present disclosure.
  • the task scheduler 100 of the present disclosure can be arranged on the artificial intelligence processor chip and connected between the off-chip storage device and the on-chip task execution unit to schedule tasks located on the external storage device to on-chip and sent to the task execution unit for execution by the task execution unit.
  • the task scheduler of the present disclosure may include a scheduling circuit 100 and a first lookup table circuit 104 .
  • the scheduling circuit 100 may be configured to read a task from an off-chip storage device to an on-chip device, so as to schedule the task to be executed by the on-chip task execution unit, wherein the task is recorded in the off-chip storage in a valid state. on the device.
  • the off-chip storage device here may include off-chip dynamic random access memory (DDR) or cache memory (such as L3 cache memory).
  • the task execution unit here may be multiple intelligent processing units, or simplified versions thereof. Depending on the application, intelligent processing units can perform conventional calculations and/or classical algorithms for distributed cluster communication.
  • a simplified version of the aforementioned intelligent processing unit may be called a microprocessing core, and each microprocessing core may have multiple (eg, 8) task scheduling queues. Furthermore, the slot id of each task is unique and can be expressed as follows:
  • the combination of the queue identifier of each queue and the task identifier of each task is used to indicate the address of the task in a lookup table, such as in the first lookup table, and the second lookup table discussed below. address in the table, the third lookup table and/or the fourth lookup table.
  • multiple tasks to be read by the scheduling circuit on the chip are stored on an off-chip storage device and at least the valid values of the multiple tasks are recorded. Second lookup table for status.
  • the first lookup table circuit 104 cooperating with the above-mentioned scheduling circuit may be configured to, in response to the task being read from the off-chip storage device to the chip, update the task from the valid state to the invalid state and record in the first lookup table, and in response to recording the invalid state in the first lookup table, triggering the scheduling circuit to read the next task from the off-chip storage device onto the chip. It can be seen that by utilizing the status change of the task in the first lookup table (ie, from valid to invalid), the scheduling circuit can efficiently read the next task from the off-chip storage device to the chip.
  • the scheduling circuit may be configured to respond to the first The triggering of a lookup table circuit reads one of the multiple tasks recorded in the second lookup table from the off-chip storage device to the chip, and triggers the first lookup table circuit to update the effective status of the task read on the chip to The invalid status is recorded in the first lookup table.
  • the scheduler 100 of the present disclosure has been described above in conjunction with FIG. 1 . It can be understood that by using the first lookup table to record the status changes of on-chip tasks, and triggering subsequent tasks to be sent from the off-chip storage device to the chip based on the status changes, the solution of the present disclosure advantageously ensures timely task scheduling. sex and effectiveness.
  • the first lookup table has an out-of-order structural attribute and better reflects whether the current task is valid or not, the scheduler is also able to perform task scheduling with higher efficiency.
  • the on-chip task execution unit can execute issued tasks more efficiently to complete, for example, inter-chip communication tasks.
  • FIG. 2 is a detailed structural block diagram schematically showing the task scheduler 100 according to an embodiment of the present disclosure.
  • the figure also shows a system-on-chip 200 including the task scheduler (which is simply referred to as "on-chip” in the context of this disclosure), and on-chip tasks disposed within the system-on-chip 200 Execution unit 204.
  • various types of tasks to be issued can be stored on the off-chip storage device 202, and these tasks can be applied for and created by users (such as programmers) through software instructions.
  • multiple tasks created by software applications can be stored in the form of a queue, and the software can issue tasks to the queue, where each queue can include tasks of the same type, for example, executed by a separate on-chip task.
  • Tasks executed by a unit tasks executed by multiple on-chip task execution units.
  • each on-chip execution unit includes 8 microprocessing cores
  • 8 queues can be set for each microprocessing core, so that 64 task queues (such as the 0th task queue) can be arranged on the off-chip storage device 202 -The 7th queue corresponds to the 0th microprocessor core, the 8th-15th queue corresponds to the 2nd microprocessor core, and so on, until the 56th-63rd queue corresponds to the 7th microprocessor), and for each microprocessor
  • the core sets up a second lookup table for storing and maintaining its eight queues.
  • the aforementioned eight lookup tables can be maintained and managed through the second lookup table circuit 206 to implement tasks in the lookup tables.
  • its status in the second lookup table may initially be set to valid by software, for example, its valid status bit is "1" to indicate that the task is off-chip.
  • each entry in the first lookup table maintained and managed by the first lookup table circuit 104 may have a status entry corresponding to a task in the second lookup table.
  • a valid identifier of one bit such as "1”
  • the first lookup table circuit can modify the bits of the valid signal to an invalid flag (such as "0"). Then, based on the transition of the task status from valid to invalid, the first lookup table circuit can trigger the scheduling circuit to read a new task from the second lookup table circuit onto the chip.
  • the present disclosure in order to realize the execution of inter-chip tasks between the artificial intelligence processor chip and another artificial intelligence processor chip, the present disclosure also proposes to set up a third lookup table circuit 108 in the scheduler, which can be configured to utilize
  • the third lookup table is used to record and manage inter-chip tasks stored on the chip.
  • the number of entries in the third lookup table may be the same as the number of slot ids mentioned above.
  • the third lookup table can be implemented as 64 (address) ⁇ 5120 (data), where 64 can correspond to the 64 queues mentioned above, and each queue has 5120 bits.
  • each task in the above third lookup table occupies 5 bits
  • 1024 tasks can be recorded in each queue
  • each task ie, table entry record
  • each task can include the following semantics ⁇ valid, need initial, wakeup, have data,have space ⁇ , where "valid” represents the flag bit indicating whether the task is valid, "need initial” represents the task initialization flag bit, “wakeup” represents the wake-up flag bit indicating whether the task is awakened, and "have data” represents the inter-chip buffer
  • the data identification bit indicates whether there is data in the area
  • “have space” represents whether there is a space identification bit of storage space in the inter-chip buffer.
  • the entry content and semantics in the third lookup table here are only exemplary and not restrictive, and those skilled in the art can understand the multiple lookup tables of the present disclosure based on the teachings of the present disclosure (for example, the first and second lookup tables, as well as the fourth lookup table to be mentioned below) may have the same or similar entry content (that is, the "descriptor" of the task) and semantics as the third lookup table.
  • the flag bit can be set to "0", thereby indicating that the task has been scheduled once.
  • slot id[15:0] can be used as the offset address of the descriptor in the L3 cache or DDR, and is also used to query the lookup table (for example, the three lookup tables).
  • the "slot id" can be used to address the third lookup table And the corresponding identification bit is set, thereby completing the update of the third lookup table.
  • the task wake-up message here may be, for example, a wake-up message sent by the first artificial intelligence processor chip to the second artificial intelligence processor chip, so as to instruct the second artificial intelligence processor chip to execute the information stored on its chip based on the task wake-up message.
  • the task may be a task that is repeatedly scheduled for execution in the reordering buffer ("Reorder of Buffer", referred to as "ROB") circuit 110 stored in the scheduler of the second artificial intelligence processor chip.
  • Reorder of Buffer referred to as "ROB”
  • the solution of the present disclosure proposes to provide the above-mentioned reordering buffer circuit 110 in the scheduler 100, which can be configured to record tasks repeatedly executed by the on-chip task execution unit.
  • the scheduler can sequentially send each task queue to the microprocessor core for execution in the order recorded in the first lookup table, and at the same time register it in the storage space in the reordering buffer circuit.
  • each The "slot id" of each task will also be registered in the reordering buffer circuit.
  • tasks newly acquired by the scheduler from the off-chip storage device can still be scheduled by the scheduler to the idle microprocessor core for execution.
  • the present disclosure proposes to provide a polling circuit 112 in the scheduler 100, which can receive the task wake-up message used for scheduling inter-chip tasks as described above. Then, the polling circuit may poll the inter-chip tasks recorded in the third lookup table circuit according to the task wake-up message, so as to poll the specific task associated with the task wake-up message. In response to being polled for the specific task, the scheduling circuit may be configured to schedule the polled specific task for execution by the on-chip task execution unit 204 .
  • the scheduler can directly wake up the task and send it to the on-chip task execution unit for execution.
  • the scheduler can go to the off-chip storage device (such as L3 buffer or DDR) to save the task.
  • the task is retrieved (that is, scheduled to the chip) and sent to the on-chip task execution unit, such as a microprocessor core, for execution.
  • the on-chip task execution unit such as a microprocessor core
  • the present disclosure also proposes to provide a fourth lookup table circuit 114 in the scheduler, and it is configured to use the fourth lookup table to record the tasks scheduled to the on-chip task execution unit.
  • the scheduling circuit may be configured to interact with the fourth lookup table circuit to query and determine whether the task recorded in the fourth lookup table is consistent with the current task before scheduling the task to be executed to the on-chip task execution unit.
  • the tasks to be scheduled to the on-chip task execution unit are different.
  • the scheduling circuit may be further configured to trigger the fourth lookup table circuit to remove the completion or suspension of execution from the fourth lookup table in response to receiving the completion or suspension of execution of the task from the on-chip task execution unit.
  • Task With the help of querying the fourth lookup table, it can be guaranteed that the task to be sent to the on-chip task execution unit is different from the multiple tasks being executed in parallel (for example, the "slot id" of the task is different).
  • the scheduling scheme of the present disclosure is further elaborated above with reference to FIG. 2 . Based on the above description, those skilled in the art can understand that the scheduling scheme of the present disclosure can significantly reduce the delay of task scheduling and simplify the hardware design by means of the arrangement of one or more lookup tables, especially the arrangement of on-chip lookup tables. Furthermore, by using lookup tables, on-chip resources can be effectively utilized for task scheduling, avoiding repetitive and redundant scheduling of tasks, thereby improving scheduling efficiency. Through the solution of the present disclosure, the smooth and efficient execution of inter-chip tasks is also promoted, thereby saving task execution overhead.
  • FIG. 3 is a simplified flowchart schematically illustrating a method 300 for scheduling tasks using a scheduler according to an embodiment of the present disclosure. It can be understood that the method 300 may be executed by the scheduler detailed above in conjunction with FIGS. 1 and 2 .
  • a scheduling circuit is used to execute a task of reading from the off-chip storage device onto the chip, so that the task is scheduled to be executed by the on-chip task execution unit, wherein the task is The valid status is recorded on the off-chip storage device.
  • the first lookup table circuit is used to update the task from the valid state to the invalid state and record it in the first lookup table.
  • the first lookup table circuit may read the next task from the off-chip storage device onto the chip.
  • the above-mentioned off-chip storage device stores multiple tasks on the chip to be read by the scheduling circuit and at least a second lookup table that records the valid status of the multiple tasks (for example, with the help of the diagram shown in Figure 2 implemented by the second lookup table circuit 206).
  • method 300 may further include in response to triggering of the first lookup table circuit, Use a scheduling circuit to read one of the tasks recorded in the second lookup table from the off-chip storage device to the chip, and use the scheduling circuit to trigger the first lookup table circuit to read from the off-chip storage device The valid status of the task read onto the chip is updated to the invalid status and recorded in the first lookup table.
  • the method further includes using a third lookup table circuit (such as the third lookup table circuit 108 shown in FIG. 3) to perform a communication between the artificial intelligence processor chip and another artificial intelligence processor chip.
  • the inter-chip tasks use the third lookup table to record and manage the inter-chip tasks stored on the chip.
  • the method further includes utilizing the polling circuit 112 to receive a task wake-up message for scheduling inter-chip tasks, and polling the inter-chip tasks recorded in the third lookup table circuit according to the task wake-up message, to poll for a specific task associated with the task wakeup message. Based on this, the method also uses a scheduling circuit to schedule the polled specific tasks.
  • the method further includes using a scheduler to read the specific task associated with the task wake-up message from the off-chip storage device onto the chip. Further, in order to save and maintain tasks repeatedly executed by the on-chip task execution unit, the method further includes using a reordering buffer circuit (reordering buffer circuit 110 as shown in FIG. 2) to record the repeatedly executed tasks.
  • a reordering buffer circuit reordering buffer circuit 110 as shown in FIG. 2
  • the method further includes using a fourth lookup table circuit (the fourth lookup table circuit 114 shown in FIG. 2 ) to perform using the fourth lookup table to record tasks scheduled to the on-chip task execution unit. . Further, the method further includes using the scheduling circuit to perform: before scheduling the task to be executed to the on-chip task execution unit, interact with the fourth lookup table circuit to query and determine the records in the fourth lookup table. The task is different from the task currently to be scheduled to the on-chip task execution unit. Additionally, the method further includes using the scheduling circuit to perform: in response to receiving execution of a completed or suspended task from the on-chip task execution unit, triggering the fourth lookup table circuit to move from the fourth lookup table. In addition to completing or suspending the execution of tasks.
  • a fourth lookup table circuit the fourth lookup table circuit 114 shown in FIG. 2
  • FIG 4 shows a schematic structural diagram of a board card 400 according to an embodiment of the present disclosure.
  • the board 400 includes a chip 401, which is a system on chip (SoC), or system on a chip, integrated with one or more combination processing devices.
  • the combination processing device is an artificial Intelligent computing units are used to support various deep learning and machine learning algorithms to meet the intelligent processing needs in complex scenarios in computer vision, speech, natural language processing, data mining and other fields.
  • deep learning technology is widely used in the field of cloud intelligence.
  • a significant feature of cloud intelligence applications is the large amount of input data, which has high requirements on the storage and computing capabilities of the platform.
  • the board 400 of this embodiment is suitable for use in cloud intelligence applications.
  • inter-chip communication or "inter-chip communication" in the context of this disclosure.
  • the chip 401 is connected to an external device 403 through an external interface device 402 .
  • the external device 403 is, for example, a server, computer, camera, monitor, mouse, keyboard, network card or wifi interface.
  • the data to be processed can be transferred to the chip 401 from the external device 403 through the external interface device 402 .
  • the calculation results of the chip 401 can be transmitted back to the external device 403 via the external interface device 402 .
  • the external interface device 402 may have different interface forms, such as PCIe interface.
  • Board 400 also includes a storage device 4404 for storing data, which includes one or more storage units 4405.
  • the storage device 404 connects and transmits data with the control device 406 and the chip 401 through the bus.
  • the control device 406 in the board card 400 is configured to control the status of the chip 401 .
  • the control device 406 may include a microcontroller unit (Micro Controller Unit, MCU).
  • FIG. 5 is a structural diagram showing the combined processing device in the chip 401 of this embodiment.
  • the combined processing device 500 includes a computing device 501, an interface device 502, a processing device 503 and a DRAM 504.
  • the computing device 501 is configured to perform user-specified operations, and is mainly implemented as a single-core intelligent processor or a multi-core intelligent processor to perform calculations of deep learning or machine learning, which can be performed through the interface device 502 and the processing device 503 Interact to jointly complete user-specified actions.
  • the interface device 502 is used to transmit data and control instructions between the computing device 501 and the processing device 503 .
  • the computing device 5501 can obtain input data from the processing device 503 via the interface device 502 and write it into an on-chip storage device of the computing device 501 .
  • the computing device 501 can obtain the control instructions from the processing device 503 via the interface device 502 and write them into the control cache on-chip of the computing device 501 .
  • the interface device 502 may also read the data in the storage device of the computing device 501 and transmit it to the processing device 503 .
  • the processing device 503 performs basic control including but not limited to data transfer, starting and/or stopping the computing device 501, and the like.
  • the processing device 503 may be one or more types of a central processing unit (CPU), a graphics processing unit (GPU), or other general and/or special purpose processors.
  • processors including but not limited to digital signal processor (DSP), application specific integrated circuit (ASIC), field-programmable gate array (FPGA) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., and their number can be determined according to actual needs.
  • DSP digital signal processor
  • ASIC application specific integrated circuit
  • FPGA field-programmable gate array
  • Programmable logic devices discrete gate or transistor logic devices, discrete hardware components, etc., and their number can be determined according to actual needs.
  • the computing device 501 of the present disclosure can be regarded as having a single-core structure or a homogeneous multi-core structure.
  • the computing device 501 and the processing device 503 are considered together, they are considered to form a heterogeneous multi-core structure.
  • the storage device 504 is used to store data to be processed, which can be DRAM, which is a DDR memory.
  • the size is usually 16G or larger, and is used to save data of the computing device 201 and/or the processing device 203 .
  • the storage device here can be regarded as an off-chip storage device of the aforementioned scheduling scheme.
  • FIG. 6 shows a schematic diagram of the internal structure of the computing device 501.
  • the computing device 501 is used to process input data such as computer vision, speech, natural language, and data mining.
  • the computing device 501 in the figure adopts a multi-core hierarchical structure design.
  • the computing device 501 serves as a system on a chip and includes multiple computing clusters. , each computing cluster includes multiple processor cores.
  • the computing device 501 is composed of a system-on-chip-computing cluster-processor core hierarchy.
  • the computing device 501 includes an external storage controller 601 , a peripheral communication module 602 , an on-chip interconnection module 603 , a synchronization module 604 and multiple computing clusters 605 .
  • the scheduling circuit in the context of the present disclosure may also be included in the computing device 501 to schedule tasks of the external storage device onto the chip for execution by the computing cluster 605 .
  • the peripheral communication module 602 is used to receive control signals from the processing device 503 through the interface device 502 and start the computing device 501 to perform tasks.
  • the on-chip interconnection module 603 connects the external storage controller 601, the peripheral communication module 602 and multiple computing clusters 605 to transmit data and control signals between various modules.
  • the synchronization module 604 is a global synchronization barrier controller (GBC), used to coordinate the work progress of each computing cluster and ensure information synchronization.
  • GBC global synchronization barrier controller
  • Multiple computing clusters 605 are the computing cores of the computing device 501. Four are shown as an example in the figure. With the development of hardware, the computing device 501 of the present disclosure may also include 8, 16, 64, or even more. Computing cluster 605. Computing cluster 605 is used to efficiently execute deep learning algorithms.
  • each computing cluster 605 includes multiple processor cores (IPU core) 606 and a storage core (MEM core) 607.
  • IPU core processor cores
  • MEM core storage core
  • processor cores 606 are exemplarily shown in the figure, and the present disclosure does not limit the number of processor cores 606 . Its internal architecture is shown in Figure 4. Each processor core 606 includes three major modules: a control module 71 , an operation module 72 and a storage module 73 .
  • the control module 71 is used to coordinate and control the work of the computing module 72 and the storage module 73 to complete the task of deep learning, and includes an instruction fetch unit (IFU) 711 and an instruction decode unit (IDU). 712.
  • the instruction fetching unit 711 is used to obtain instructions from the processing device 503, and the instruction decoding unit 712 decodes the obtained instructions and sends the decoding results to the computing module 72 and the storage module 73 as control information.
  • the operation module 72 includes a vector operation unit 721 and a matrix operation unit 422.
  • the vector operation unit 721 is used to execute Vector operations can support complex operations such as vector multiplication, addition, and nonlinear transformation;
  • the matrix operation unit 722 is responsible for the core calculations of the deep learning algorithm, namely matrix multiplication and convolution.
  • the storage module 73 is used to store or transport related data, including a neuron storage unit (neuron RAM, NRAM) 731, a weight storage unit (weight RAM, WRAM) 732, and an input/output direct memory access module (input/output direct memory access). , IODMA) 733.
  • Move direct memory access module (move direct memory access, MVDMA) 734.
  • NRAM 731 is used to store input, output data and intermediate results calculated by the processor core 606; WRAM 732 is used to store the weights of the deep learning network; IODMA 733 controls the access of NRAM 731/WRAM 732 and DRAM 504 through the broadcast bus 609 memory; MVDMA 734 is used to control the memory access of NRAM 731/WRAM 732 and SRAM 708.
  • NRAM and WRAM may be two storage areas formed by dividing the same memory in the logical storage space, or they may be two independent memories, which are not specifically limited here.
  • the storage core 307 is mainly used for storage and communication, that is, storage of shared data or intermediate results between the processor cores 606, and communication between the execution computing cluster 605 and the DRAM 504, and communication between the computing clusters 605. Communication between processor cores 606, etc.
  • the storage core 607 has scalar operation capabilities to perform scalar operations.
  • the storage core 607 includes a shared memory unit (SRAM) 608, a broadcast bus 609, a computing cluster direct memory access module (cluster direct memory access, CDMA) 610, and a global direct memory access module (global direct memory access, GDMA) 611.
  • SRAM 608 plays the role of a high-performance data transfer station.
  • the data multiplexed between different processor cores 606 in the same computing cluster 605 does not need to be obtained from the DRAM 504 through the processor cores 606, but is processed by the SRAM 608.
  • the storage core 607 only needs to quickly distribute the multiplexed data from the SRAM 608 to multiple processor cores 6606 to improve inter-core communication efficiency and greatly reduce on-chip and off-chip input/output access.
  • the broadcast bus 609, CDMA 610 and GDMA 611 are respectively used to perform communication between processor cores 606, communication between computing clusters 605 and data transmission between the computing cluster 605 and the DRAM 504. They will be explained below.
  • the broadcast bus 609 is used to complete high-speed communication between the processor cores 606 in the computing cluster 605.
  • the broadcast bus 609 in this embodiment supports inter-core communication methods including unicast, multicast and broadcast.
  • Unicast refers to point-to-point (i.e., single processor core to single processor core) data transmission
  • multicast is a communication method that transmits a piece of data from SRAM 608 to specific several processor cores 606
  • broadcast is a communication method that transmits a piece of data from SRAM 608 to specific processor cores 606.
  • the communication method in which copies of data are transmitted from SRAM 608 to all processor cores 606 is a special case of multicast.
  • CDMA 610 is used to control memory access of SRAM 608 between different computing clusters 605 within the same computing device 501.
  • Figure 8 shows a schematic diagram when one processor core wants to write data to the processor core of another computing cluster to illustrate the working principle of CDMA 610.
  • the same computing device includes multiple computing clusters.
  • computing cluster 0 and computing cluster 1 are shown in the figure.
  • Computing cluster 0 and computing cluster 1 respectively include multiple processor cores.
  • computing cluster 0 in the figure only displays processor core 0, and computing cluster 1 only displays processor core 1.
  • Processor core 0 wants to write data to processor core 1.
  • processor core 0 sends a unicast write request to write data to the local SRAM 0.
  • CDMA 0 serves as the master (master) end
  • CDMA 1 serves as the slave (slave) end.
  • the master end pushes the write request to the slave end, that is, the master end
  • the end sends the write address AW and the write data W to transfer the data to the SRAM 1 of the computing cluster 1.
  • the slave end sends a write response B in response.
  • the processor core 1 of the computing cluster 1 sends a unicast read request to transfer the data from the SRAM. Read out in 1.
  • the GDMA 611 cooperates with the external memory controller 601 to control memory access from the SRAM 608 of the computing cluster 605 to the DRAM 504, or to read data from the DRAM 504 to the SRAM 608.
  • the communication between the DRAM 504 and the NRAM 731 or the WRAM 732 can be realized through two channels.
  • the first channel is to directly contact DRAM 504 and NRAM 731 or WRAM 732 through IODAM 733;
  • the second channel is to first transmit data between DRAM 504 and SRAM 6608 through GDMA 611, and then through MVDMA 734 to transmit data between SRAM 608 and NRAM. 731 or WRAM 732.
  • the bandwidth of the second channel is much greater than that of the first channel. channel, so communication between DRAM 504 and NRAM 731 or WRAM 732 may be more efficient through the second channel.
  • the embodiments of the present disclosure can select a data transmission channel according to the own hardware conditions.
  • the functionality of GDMA 611 and the functionality of IODMA 733 may be integrated in the same component.
  • this disclosure treats GDMA 611 and IODMA 733 as different components.
  • the functions of GDMA 611, IODMA 733, CDMA 610, and MVDMA 734 can also be implemented by the same component.
  • the functions implemented and the technical effects achieved are similar to those of this disclosure, they all belong to SCOPE OF THIS DISCLOSURE.
  • this application actually also discloses a device, which includes a processor and a memory.
  • the memory may store program instructions for scheduling tasks.
  • the scheduling operation steps described in this application in conjunction with FIGS. 1-3 are implemented.
  • the present application also discloses a computer-readable storage medium or computer program product, on which computer programs/instructions for task scheduling are stored, thereby realizing the combination of the figures. 1-The scheduling operation steps described in Figure 3.
  • the equipment or devices of the present disclosure may include servers, cloud servers, server clusters, data processing devices, robots, computers, printers, scanners, tablets, smart terminals, PC equipment, Internet of Things terminals, and mobile terminals , mobile phones, driving recorders, navigators, sensors, cameras, cameras, video cameras, projectors, watches, headphones, mobile storage, wearable devices, visual terminals, autonomous driving terminals, vehicles, household appliances, and/or medical equipment .
  • the means of transportation include airplanes, ships and/or vehicles; the household appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lights, gas stoves, and range hoods; the medical equipment includes nuclear magnetic resonance machines, B-ultrasound and/or electrocardiograph.
  • the equipment or device of the present disclosure can also be applied to the Internet, Internet of Things, data centers, energy, transportation, public administration, manufacturing, education, power grid, telecommunications, finance, retail, construction sites, medical and other fields.
  • the equipment or device of the present disclosure can also be used in cloud, edge, terminal and other application scenarios related to artificial intelligence, big data and/or cloud computing.
  • devices or devices with high power consumption according to the solution of the present disclosure can be applied to cloud devices (such as cloud servers), while devices or devices with low power consumption can be applied to terminal devices and/or edge terminals.
  • device e.g. smartphone or camera.
  • the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that the hardware resources of the cloud device can be obtained based on the hardware information of the terminal device and/or the edge device. Match appropriate hardware resources to simulate the hardware resources of terminal devices and/or edge devices to complete unified management, scheduling and collaborative work of end-cloud integration or cloud-edge-end integration.
  • this disclosure expresses some methods and their embodiments as a series of actions and their combinations, but those skilled in the art can understand that the solutions of this disclosure are not limited by the order of the described actions. . Therefore, based on the disclosure or teachings of this disclosure, those skilled in the art will understand that certain steps may be performed in other orders or simultaneously. Furthermore, those skilled in the art can understand that the embodiments described in the present disclosure can be regarded as optional embodiments, that is, the actions or modules involved are not necessarily necessary for the implementation of one or some solutions of the present disclosure. In addition, depending on the solution, the description of some embodiments in this disclosure also has different emphasis. In view of this, those skilled in the art can understand the parts that are not described in detail in a certain embodiment of the present disclosure, and can also refer to the relevant descriptions of other embodiments.
  • units illustrated as separate components may or may not be physically separate, and components illustrated as units may or may not be physical units.
  • the aforementioned components or units may be co-located or distributed over multiple network units.
  • some or all of the units may be selected to achieve the purpose of the solutions described in the embodiments of the present disclosure.
  • multiple units in the embodiments of the present disclosure may be integrated into one unit or each unit may exist physically separately.
  • the above integrated units can be implemented in the form of software program modules. If implemented in the form of software program modules and sold or used as a stand-alone product, the integrated unit may be stored in a computer-readable memory. Based on this, when the solution of the present disclosure is embodied in the form of a software product (such as a computer-readable storage medium), the software product can be stored in a memory, which can include a number of instructions to cause a computer device (such as a personal computer, server or network equipment, etc.) to perform some or all steps of the method described in the embodiments of the present disclosure.
  • a computer device such as a personal computer, server or network equipment, etc.
  • the aforementioned memory may include but is not limited to U disk, flash disk, read-only memory ("Read Only Memory”, abbreviated as ROM), random access memory (“Random Access Memory”, abbreviated as RAM), mobile hard disk, magnetic disk Or various media such as CDs that can store program code.
  • ROM read-only memory
  • RAM random access memory
  • CDs compact discs
  • the above-mentioned integrated unit can also be implemented in the form of hardware, that is, a specific hardware circuit, which can include digital circuits and/or analog circuits, etc.
  • the physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, and the physical devices may include, but is not limited to, devices such as transistors or memristors.
  • various devices such as computing devices or other processing devices described herein can be implemented by appropriate hardware processors, such as CPUs, GPUs, FPGAs, DSPs, and ASICs.
  • the aforementioned storage unit or storage device can be any appropriate storage medium (including magnetic storage media or magneto-optical storage media, etc.), which can be, for example, a variable resistive memory ("Resistive Random Access Memory”, abbreviated as RRAM), dynamic random access memory (“Dynamic Random Access Memory”, abbreviated as DRAM), static random access memory (“Static Random Access Memory”, abbreviated as SRAM), enhanced dynamic random access memory (“Enhanced Dynamic Random Access Memory”, abbreviated as "EDRAM”), high bandwidth memory (“High Bandwidth Memory”, abbreviated as "HBM”), hybrid memory cube ("Hybrid Memory Cube”, abbreviated as "HMC”), ROM and RAM, etc.
  • RRAM variable resistive memory
  • DRAM dynamic random access memory
  • SRAM static random access memory
  • EDRAM enhanced dynamic random access memory
  • HBM High Bandwidth Memory
  • HMC Hybrid Memory Cube
  • ROM and RAM etc.
  • a scheduler for scheduling tasks which is arranged on an artificial intelligence processor chip and connects an off-chip storage device and an on-chip task execution unit.
  • the scheduler includes:
  • a scheduling circuit configured to read a task from the off-chip storage device onto the chip so as to schedule the task to be executed by the on-chip task execution unit, wherein the task is recorded in a valid state on the off-chip storage device ;
  • the first lookup table circuit is configured as:
  • the scheduling circuit In response to recording the invalid state in the first lookup table, the scheduling circuit is triggered to read the next task from the off-chip storage device to the chip.
  • Clause A2 The scheduler according to Clause A1, wherein the off-chip storage device stores multiple tasks on the chip to be read by the scheduling circuit and a second lookup table that records at least the valid status of the multiple tasks, and
  • the scheduling circuit is configured as:
  • a plurality of arbitrary records recorded in the second lookup table are read from the off-chip storage device. one of the tasks onto the chip;
  • the first lookup table circuit is triggered to update the valid state of the task read from the off-chip storage device to the on-chip task to the invalid state and record it in the first lookup table.
  • a third lookup table is used to record and manage the inter-chip tasks stored on the chip.
  • the scheduling circuit is configured to schedule the polled specific task.
  • Clause A6 The scheduler of Clause A4, wherein the task wake-up message comes from the other artificial intelligence processor chip, and the scheduling circuit is configured to schedule the specific task for processing by the artificial intelligence
  • the on-chip task execution unit of the processor chip executes.
  • Clause A8 The scheduler according to Clause A1, further comprising a fourth lookup table circuit configured to utilize the fourth lookup table to record tasks scheduled to the on-chip task execution unit.
  • triggering the fourth lookup table circuit In response to receiving completion or suspension of execution of a task from the on-chip task execution unit, triggering the fourth lookup table circuit to remove the completion or suspension of execution of the task from the fourth lookup table.
  • Clause A11 The scheduler according to any one of Clause A8 to Clause A10, wherein the plurality of tasks are stored in the first lookup table, the second lookup table, the third lookup table in the form of one or more queues. table and/or the fourth lookup table, and the queue identifier of each queue and the task identifier of each task are combined to indicate that the task is in the first lookup table, the second lookup table, the third lookup table and/or Or the address in the fourth lookup table.
  • the wake-up flag bit indicates whether the task is awakened
  • Clause A13 The scheduler according to Clause A12, wherein the task is an inter-chip communication task for communication between the artificial intelligence processor chip and another artificial intelligence processor chip.
  • An artificial intelligence processor chip including:
  • An on-chip task execution unit is configured to interact with the scheduler in order to execute tasks issued by the scheduler.
  • Clause A16 A method of scheduling tasks using a scheduler according to any of Clauses A1-A13, The methods include:
  • the scheduling circuit In response to recording the invalid state in the first lookup table, the scheduling circuit is triggered to read the next task from the off-chip storage device to the chip.
  • Clause A17 The method according to Clause A16, wherein the off-chip storage device stores multiple tasks on the chip to be read by the scheduling circuit and a second lookup table that records at least the valid status of the multiple tasks, and the The method: in response to the triggering of the first lookup table circuit, reading one of the plurality of tasks recorded in the second lookup table from the off-chip storage device to the chip; and
  • the first lookup table circuit is triggered to update the valid state of the task read from the off-chip storage device to the on-chip task to the invalid state and record it in the first lookup table.
  • a third lookup table is used to record and manage the inter-chip tasks stored on the chip.
  • Clause A19 The method of Clause A18, further comprising using the polling circuit to perform:
  • the scheduling circuit is configured to schedule the polled specific task.
  • Clause A22 The method of Clause A16, further comprising using the reordering buffer circuit to record tasks that are repeatedly executed by the on-chip task execution unit.
  • Clause A23 The method of Clause A16, further comprising using the fourth lookup table circuit to perform using the fourth lookup table to record tasks scheduled to an on-chip task execution unit.
  • Clause A24 The method of Clause A23, wherein the method further comprises using the scheduling circuit to perform:
  • Clause A25 The method of Clause A23, wherein the method further comprises using the scheduling circuit to perform:
  • triggering the fourth lookup table circuit In response to receiving completion or suspension of execution of a task from the on-chip task execution unit, triggering the fourth lookup table circuit to remove the completion or suspension of execution of the task from the fourth lookup table.
  • Clause A26 The method according to any one of clauses A23-A25, wherein the plurality of tasks are stored in the first lookup table, the second lookup table, the third lookup table and the form of one or more queues. /or in the fourth lookup table, and the combination of the queue identifier of each queue and the task identifier of each task is used to indicate that the task is in the first lookup table, the second lookup table, the third lookup table and/or the third lookup table. Four lookup addresses in the table.
  • Clause A27 The method of Clause A26, wherein the first lookup table, the second lookup table, the third lookup table
  • the entry records in the lookup table and/or the fourth lookup table include one or more of the following:
  • the wake-up flag bit indicates whether the task is awakened
  • Clause A29 A computer-readable storage medium having computer program instructions for scheduling tasks stored thereon, which when executed by a processor cause the method according to any of clauses A16-A28 to be implemented .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Computer Hardware Design (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Multi Processors (AREA)
  • Stored Programmes (AREA)

Abstract

The present disclosure relates to a method for scheduling tasks, and a related product thereof. The related product comprises a device and a computer-readable storage medium. The device may be comprised in a computing processing apparatus of a combined processing apparatus, wherein the computing processing apparatus may comprise one or more data processing apparatuses. The combined processing apparatus may further comprise an interface apparatus and other processing apparatuses. The computing processing apparatus interacts with the other processing apparatuses to jointly complete a computing operation specified by a user. The combined processing apparatus may further comprise a storage apparatus, wherein the storage apparatus is respectively connected to the device and the other processing apparatuses, and is used for storing data of the device and the other processing apparatuses. By means of the solution of the present disclosure, the scheduling efficiency can be improved, and the on-chip storage overheads can be reduced. FIG. 5

Description

用于调度任务的方法及其相关产品Methods for scheduling tasks and related products
相关申请的交叉引用Cross-references to related applications
本申请要求于2022年08月30日申请的,申请号为202211044067.6,名称为“用于调度任务的方法及其相关产品”的中国专利申请的优先权。This application claims priority to the Chinese patent application filed on August 30, 2022, with the application number 202211044067.6 and titled "Method for Scheduling Tasks and Related Products".
技术领域Technical field
本申请一般地涉及计算机领域。更具体地,本申请涉及用于调度任务的调度器、人工智能处理器芯片、板卡、方法和计算机可读存储介质。This application relates generally to the computer field. More specifically, the present application relates to a scheduler, an artificial intelligence processor chip, a board, a method and a computer-readable storage medium for scheduling tasks.
背景技术Background technique
为了解决访问片外存储器(例如动态随机存储器DRAM)开销过大的问题,传统的中央处理器单元(“CPU”)一般会使用高速缓存(“cache”)对数据的时间局部性和空间局部性进行利用,将有可能会重复使用的数据缓存在cache中,从而缩短下次例如执行任务来访问该数据所消耗的时间。然而,其他的一些专用系统通常会使用缓冲区(“buffer”)或者队列(“queue”)的方式对数据进行缓存。In order to solve the problem of excessive overhead of accessing off-chip memory (such as dynamic random access memory DRAM), traditional central processing unit ("CPU") generally uses a cache ("cache") to achieve temporal and spatial locality of data. By using this method, data that may be reused is cached in the cache, thereby shortening the time it takes to access the data next time, such as when executing a task. However, other specialized systems often use buffers ("buffers") or queues ("queues") to cache data.
就任务调度而言,一种比较通用的方式是采用queue的结构对任务进行排列调度,该结构的优点是任务的顺序会通过queue的结构天然保证,对软件编程接口十分友好,方便在软件层面进行编程。然而,通过队列进行调度的缺点是调度灵活性差,无法实现后续的任务“超车”前序任务的执行。另外,片上缓存资源是有限的,如何利用有限的资源实现低延迟高吞吐的调度成为亟需解决的技术问题。As far as task scheduling is concerned, a more common way is to use the queue structure to arrange and schedule tasks. The advantage of this structure is that the order of tasks is naturally guaranteed by the queue structure. It is very friendly to the software programming interface and is convenient at the software level. to program. However, the disadvantage of scheduling through queues is that scheduling flexibility is poor, and subsequent tasks cannot "overtake" the execution of previous tasks. In addition, on-chip cache resources are limited, and how to utilize limited resources to achieve low-latency and high-throughput scheduling has become an urgent technical issue that needs to be solved.
发明内容Contents of the invention
鉴于上文背景技术中所提及的技术问题,本申请提供了一种基于查找表的任务缓存与唤醒方案。基于本申请的方案,,可以大幅降低大规模任务调度的延迟,同时简化硬件设计并且降低片上存储开销。为此,本申请在如下的多个方面提供方案。In view of the technical problems mentioned in the background art above, this application provides a task caching and wake-up solution based on a lookup table. Based on the solution of this application, the delay of large-scale task scheduling can be greatly reduced, while simplifying hardware design and reducing on-chip storage overhead. To this end, this application provides solutions in multiple aspects as follows.
在第一方面中,本公开提供了一种用于任务调度的调度器,其布置于人工智能处理器芯片上,并且连接片外存储装置和片上任务执行单元,所述调度器包括:调度电路,其配置成从所述片外存储装置读取任务到片上,以便调度所述任务来由所述片上任务执行单元执行,其中所述任务以有效状态记录于所述片外存储装置上;第一查找表电路,其配置成:响应于所述任务从所述片外存储装置读取到所述片上,将所述任务从所述有效状态更新至无效状态并记录于第一查找表中;以及响应于将所述无效状态记录于所述第一查找表中,触发所述调度电路从所述片外存储装置读取下一任务到片上。In a first aspect, the present disclosure provides a scheduler for task scheduling, which is arranged on an artificial intelligence processor chip and connects an off-chip storage device and an on-chip task execution unit. The scheduler includes: a scheduling circuit , which is configured to read tasks from the off-chip storage device to on-chip, so as to schedule the task to be executed by the on-chip task execution unit, wherein the task is recorded on the off-chip storage device in a valid state; A lookup table circuit configured to: in response to the task being read from the off-chip storage device to the chip, update the task from the valid state to the invalid state and record it in the first lookup table; and in response to recording the invalid status in the first lookup table, triggering the scheduling circuit to read the next task from the off-chip storage device to the chip.
在第二方面中,本公开提供了一种人工智能处理器芯片,包括:根据第一方面中所述的调度器;以及片上任务执行单元,其配置成与所述调度器交互,以便执行由所述调度器下发的任务。In a second aspect, the present disclosure provides an artificial intelligence processor chip, including: a scheduler according to the first aspect; and an on-chip task execution unit configured to interact with the scheduler to execute the Tasks issued by the scheduler.
在第三方面中,本公开提供了一种板卡,包括根据第二方面中所述的人工智能处理器芯片。In a third aspect, the present disclosure provides a board card including the artificial intelligence processor chip according to the second aspect.
在第四方面中,本公开提供了一种使用根据第一方面中所述的调度器来调度任务的方法,所述方法包括:使用所述调度电路来执行从所述片外存储装置读取任务到片上,以便调度所述任务来由所述片上任务执行单元执行,其中所述任务以有效状态记录于所述片外存储装置上;使用所述第一查找表电路来执行:响应于所述任务从所述片外存储装置读取到所述片上,将所述任务从所述有效状态更新至无效状态并记录于第一查找表中;以及响应于将所述无效状态记录于所述第一查找表中,触发所述调度电路从所述片外存储装置读取下一任务到片上。 In a fourth aspect, the present disclosure provides a method for scheduling a task using the scheduler according to the first aspect, the method comprising: using the scheduling circuit to perform reading from the off-chip storage device a task onto the chip so as to schedule the task to be executed by the on-chip task execution unit, wherein the task is recorded in a valid state on the off-chip storage device; using the first lookup table circuit to execute: in response to the The task is read from the off-chip storage device to the chip, the task is updated from the valid state to the invalid state and recorded in the first lookup table; and in response to recording the invalid state in the In the first lookup table, the scheduling circuit is triggered to read the next task from the off-chip storage device to the chip.
在第五方面中,本公开提供了一种计算机可读存储介质,其上存储有用于调度任务的计算机程序指令,当所述计算机程序指令由处理器执行时,使得实现根据第四方面中所述的方法。In a fifth aspect, the present disclosure provides a computer-readable storage medium having computer program instructions for scheduling tasks stored thereon. When the computer program instructions are executed by a processor, the computer program instructions according to the fourth aspect are implemented. method described.
利用本公开上述基于查找表的方案,特别是基于存储于片上的第一查找表的使用,可以加速任务调度的处理速度,从而大幅度降低大规模任务调度的延迟。另外,通过利用第一查找表,也简化了硬件设计的复杂性并且降低片上存储的开销。在一些实施例中,当用于片间任务(例如人工智能处理器芯片间的通信任务)时,由于本公开使用专用于片间任务的查找表,从而避免了突发传输多笔任务唤配消息时的总线拥塞和反压,由此实现有效的片间任务调度。Utilizing the above-mentioned lookup table-based solution of the present disclosure, especially based on the use of the first lookup table stored on the chip, the processing speed of task scheduling can be accelerated, thereby significantly reducing the delay of large-scale task scheduling. In addition, by utilizing the first lookup table, the complexity of hardware design is also simplified and the cost of on-chip storage is reduced. In some embodiments, when used for inter-chip tasks (such as communication tasks between artificial intelligence processor chips), since the present disclosure uses a lookup table dedicated to inter-chip tasks, burst transmission of multiple task allocations is avoided. Bus congestion and backpressure during message processing, thereby enabling efficient inter-chip task scheduling.
附图说明Description of drawings
通过参考附图阅读下文的详细描述,本公开示例性实施方式的上述以及其他目的、特征和优点将变得易于理解。在附图中,以示例性而非限制性的方式示出了本公开的若干实施方式,并且相同或对应的标号表示相同或对应的部分,其中:The above and other objects, features and advantages of exemplary embodiments of the present disclosure will become readily understood by reading the following detailed description with reference to the accompanying drawings. In the drawings, several embodiments of the present disclosure are shown by way of illustration and not limitation, and like or corresponding reference numerals designate like or corresponding parts, wherein:
图1是示意性示出根据本公开实施例的调度器的简化框图;1 is a simplified block diagram schematically illustrating a scheduler according to an embodiment of the present disclosure;
图2是示意性示出根据本公开实施例的调度器的详细结构框图;Figure 2 is a detailed structural block diagram schematically showing a scheduler according to an embodiment of the present disclosure;
图3是示意性示出使用根据本公开实施例的调度器来调度任务的方法的简化流程图;Figure 3 is a simplified flowchart schematically illustrating a method of scheduling tasks using a scheduler according to an embodiment of the present disclosure;
图4是示意性示出根据本公开实施例的板卡的结构示意图;Figure 4 is a schematic structural diagram schematically showing a board card according to an embodiment of the present disclosure;
图5是示意性示出根据本公开实施例的芯片中的组合处理装置的结构示意图;Figure 5 is a schematic structural diagram schematically showing a combined processing device in a chip according to an embodiment of the present disclosure;
图6是示意性示出根据本公开实施例的计算装置的内部结构示意图;FIG. 6 is a schematic diagram schematically showing the internal structure of a computing device according to an embodiment of the present disclosure;
图7是示意性示出根据本公开实施例的处理器核的内部结构示意图;以及FIG. 7 is a schematic diagram schematically showing the internal structure of a processor core according to an embodiment of the present disclosure; and
图8是示意性示出根据本公开实施例的计算簇(或称“计算集群”)之间数据写入操作的示意图。FIG. 8 is a schematic diagram schematically illustrating data writing operations between computing clusters (or "computing clusters") according to an embodiment of the present disclosure.
具体实施方式Detailed ways
下面将结合本公开实施方式中的附图,对本公开实施方式中的技术方案进行清楚、完整地描述,显然,所描述的实施方式是本公开一部分实施方式,而不是全部的实施方式。基于本公开中的实施方式,本领域技术人员在没有做出创造性劳动前提下所获得的所有其他实施方式,都属于本公开保护的范围。The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present disclosure. Obviously, the described embodiments are part of the embodiments of the present disclosure, rather than all of them. Based on the embodiments in this disclosure, all other embodiments obtained by those skilled in the art without creative efforts fall within the scope of protection of this disclosure.
应当理解,本公开的权利要求、说明书及附图中的术语“第一”、“第二”、“第三”和“第四”等是用于区别不同对象,而不是用于描述特定顺序。本公开的说明书和权利要求书中使用的术语“包括”和“包含”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。It should be understood that the terms “first”, “second”, “third” and “fourth” in the claims, description and drawings of the present disclosure are used to distinguish different objects, rather than to describe a specific sequence. . The terms "comprising" and "including" used in the description and claims of this disclosure indicate the presence of described features, integers, steps, operations, elements and/or components but do not exclude one or more other features, integers , the presence or addition of steps, operations, elements, components and/or collections thereof.
还应当理解,在此本公开说明书中所使用的术语仅仅是出于描述特定实施方式的目的,而并不意在限定本公开。如在本公开说明书和权利要求书中所使用的那样,除非上下文清楚地指明其它情况,否则单数形式的“一”、“一个”及“该”意在包括复数形式。还应当进一步理解,在本公开说明书和权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。It should also be understood that the terminology used in the description of the disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used in this disclosure and the claims, the singular forms "a," "an," and "the" are intended to include the plural forms unless the context clearly dictates otherwise. It will be further understood that the term "and/or" as used in the specification and claims of this disclosure refers to and includes any and all possible combinations of one or more of the associated listed items.
如在本说明书和权利要求书中所使用的那样,术语“如果”可以依据上下文被解释为“当...时”或“一旦”或“响应于确定”或“响应于检测到”。类似地,短语“如果确定”或“如果检测到[所描述条件或事件]”可以依据上下文被解释为意指“一旦确定”或“响应于确定”或“一旦检测到[所描述条件或事件]”或“响应于检测到[所描述条件或事件]”。As used in this specification and claims, the term "if" may be interpreted as "when" or "once" or "in response to determining" or "in response to detecting" depending on the context. Similarly, the phrase "if determined" or "if [the described condition or event] is detected" may be interpreted, depending on the context, to mean "once determined" or "in response to a determination" or "once the [described condition or event] is detected ]" or "in response to detection of [the described condition or event]".
如前文提到的,为了实现高效的任务调度和执行,本公开的方案提出一种基于查找表 的任务缓存与唤醒机制。特别地,本公开的方案通过将表征任务有效与否的标志位存储于片上以形成查找表(即下文所称的第一查找表),并且根据标识位是否有效来确定将下一对应任务从片外存储装置读取到片上。由于查找表本身并没有顺序的概念并且可以认为其是一个乱序的结构,因此能够及时的反映当前任务的有效与否,从而可以实现对相应任务的更快调度。在一些实施例中,本公开的方案引入更多的查找表,以实现在不同场景下(例如人工智能处理器芯片之间的通信任务场景下)的任务调度,由此进一步提升调度的效率并且降低了片上的存储开销并简化硬件设计,从而也大规模降低任务调度的延迟。As mentioned above, in order to achieve efficient task scheduling and execution, the solution of this disclosure proposes a lookup table-based Task caching and wake-up mechanism. In particular, the solution of the present disclosure forms a lookup table (hereinafter referred to as the first lookup table) by storing a flag bit that indicates whether the task is valid or not on the chip, and determines whether the next corresponding task is to be changed from the flag bit according to whether the flag bit is valid. Off-chip storage is read into on-chip. Since the lookup table itself has no concept of order and can be considered an out-of-order structure, it can reflect the validity of the current task in a timely manner, thereby enabling faster scheduling of the corresponding task. In some embodiments, the solution of the present disclosure introduces more lookup tables to implement task scheduling in different scenarios (such as communication task scenarios between artificial intelligence processor chips), thereby further improving the efficiency of scheduling and It reduces on-chip storage overhead and simplifies hardware design, thereby also reducing task scheduling delays on a large scale.
下面结合附图来详细描述本公开的具体实施方式。Specific implementations of the present disclosure will be described in detail below with reference to the accompanying drawings.
图1是示意性示出根据本公开实施例的任务调度器(简称为“调度器”)100的简化框图。如前所述,本公开的任务调度器100可以布置于人工智能处理器芯片上,并且连接在片外存储装置和片上任务执行单元之间,以用于将位于外部存储装置上的任务调度至片上并且下发至任务执行单元,以便由任务执行单元来执行。FIG. 1 is a simplified block diagram schematically illustrating a task scheduler ("scheduler" for short) 100 according to an embodiment of the present disclosure. As mentioned above, the task scheduler 100 of the present disclosure can be arranged on the artificial intelligence processor chip and connected between the off-chip storage device and the on-chip task execution unit to schedule tasks located on the external storage device to on-chip and sent to the task execution unit for execution by the task execution unit.
如图1所示,本公开的任务调度器可以包括调度电路100和第一查找表电路104。如上所述,调度电路100可以配置成从片外存储装置读取任务到片上,以便调度所述任务来由所述片上任务执行单元执行,其中所述任务以有效状态记录于所述片外存储装置上。在一个实施场景中,此处的片外存储装置可以包括片外的动态随机存储器(DDR)或高速缓冲存储器(如L3高速缓冲存储器)。在另一个实施场景中,此处的任务执行单元可以是多个智能处理单元,或者其简化版本。根据应用的不同,智能处理单元可以执行常规计算和/或分布式集群通信的经典算法。As shown in FIG. 1 , the task scheduler of the present disclosure may include a scheduling circuit 100 and a first lookup table circuit 104 . As described above, the scheduling circuit 100 may be configured to read a task from an off-chip storage device to an on-chip device, so as to schedule the task to be executed by the on-chip task execution unit, wherein the task is recorded in the off-chip storage in a valid state. on the device. In one implementation scenario, the off-chip storage device here may include off-chip dynamic random access memory (DDR) or cache memory (such as L3 cache memory). In another implementation scenario, the task execution unit here may be multiple intelligent processing units, or simplified versions thereof. Depending on the application, intelligent processing units can perform conventional calculations and/or classical algorithms for distributed cluster communication.
在本公开的上下文中,前述智能处理单元的简化版本可以称之为微处理核,并且每个微处理核可以具有多个(例如8个)任务调度队列。进一步,每个任务的slot id是唯一的,并且可以表述如下:In the context of this disclosure, a simplified version of the aforementioned intelligent processing unit may be called a microprocessing core, and each microprocessing core may have multiple (eg, 8) task scheduling queues. Furthermore, the slot id of each task is unique and can be expressed as follows:
slot id[15:0]={js_que_id[5:0],real_slot_id[9:0]},其中js_que_id[5:0]表示队列的标识(此例中共有26=64个队列),而real_slot_id[9:0]表示任务在该队列中的标识(简称“任务标识符”),即用16位二进制数来表示任务的完整标识。在一些应用场景中,每个队列的队列标识符和每个任务的任务标识符组合用于指示所述任务在查找表中的地址,例如在第一查找表、以及下文将讨论的第二查找表、第三查找表和/或第四查找表中的地址。slot id[15:0]={js_que_id[5:0],real_slot_id[9:0]}, where js_que_id[5:0] represents the identifier of the queue (a total of 2 6 =64 queues in this example), and real_slot_id [9:0] represents the identification of the task in the queue (referred to as "task identifier"), that is, a 16-bit binary number is used to represent the complete identification of the task. In some application scenarios, the combination of the queue identifier of each queue and the task identifier of each task is used to indicate the address of the task in a lookup table, such as in the first lookup table, and the second lookup table discussed below. address in the table, the third lookup table and/or the fourth lookup table.
在一个实施场景中,与上述每个微处理核中的任务调度队列相对应地,在片外存储装置上存储有待所述调度电路读取到片上的多个任务和至少记录多个任务的有效状态的第二查找表。In one implementation scenario, corresponding to the task scheduling queue in each microprocessing core, multiple tasks to be read by the scheduling circuit on the chip are stored on an off-chip storage device and at least the valid values of the multiple tasks are recorded. Second lookup table for status.
与上述调度电路相配合的第一查找表电路104可以配置成响应于所述任务从所述片外存储装置读取到所述片上,将所述任务从所述有效状态更新至无效状态并记录于第一查找表中,以及响应于将无效状态记录于所述第一查找表中,触发所述调度电路从所述片外存储装置读取下一任务到片上。可以看出,通过利用第一查找表中任务的状态变化(即从有效到无效),可以使得调度电路高效地从片外存储装置读取下一任务到片上。特别地,对于在片外存储装置上存储有待所述调度电路读取到片上的多个任务和至少记录多个任务的有效状态的第二查找表的场景中,调度电路可以配置成响应于第一查找表电路的触发,从片外存储装置读取第二查找表中记录的多个任务之一到片上,以及触发所述第一查找表电路将读取到片上的任务的有效状态更新至无效状态并记录于所述第一查找表中。The first lookup table circuit 104 cooperating with the above-mentioned scheduling circuit may be configured to, in response to the task being read from the off-chip storage device to the chip, update the task from the valid state to the invalid state and record in the first lookup table, and in response to recording the invalid state in the first lookup table, triggering the scheduling circuit to read the next task from the off-chip storage device onto the chip. It can be seen that by utilizing the status change of the task in the first lookup table (ie, from valid to invalid), the scheduling circuit can efficiently read the next task from the off-chip storage device to the chip. In particular, for a scenario where multiple tasks on-chip to be read by the scheduling circuit and a second lookup table that records at least the valid status of the multiple tasks are stored on the off-chip storage device, the scheduling circuit may be configured to respond to the first The triggering of a lookup table circuit reads one of the multiple tasks recorded in the second lookup table from the off-chip storage device to the chip, and triggers the first lookup table circuit to update the effective status of the task read on the chip to The invalid status is recorded in the first lookup table.
以上结合图1对本公开的调度器100进行了描述。可以理解的是,通过利用第一查找表来记录片上任务的状态变化,并且基于该状态变化来触发从片外存储装置发送后续的任务到片上,本公开的方案有利地保证了任务调度的及时性和有效性。另外,由于第一查找表具有乱序的结构属性并且更好的反映当前任务的有效与否,因此也使得调度器能够以更高的效率来执行任务的调度。由此,片上的任务执行单元可以更为高效地执行下发的任务,以完成例如片间的通信任务。 The scheduler 100 of the present disclosure has been described above in conjunction with FIG. 1 . It can be understood that by using the first lookup table to record the status changes of on-chip tasks, and triggering subsequent tasks to be sent from the off-chip storage device to the chip based on the status changes, the solution of the present disclosure advantageously ensures timely task scheduling. sex and effectiveness. In addition, since the first lookup table has an out-of-order structural attribute and better reflects whether the current task is valid or not, the scheduler is also able to perform task scheduling with higher efficiency. As a result, the on-chip task execution unit can execute issued tasks more efficiently to complete, for example, inter-chip communication tasks.
图2是示意性示出根据本公开实施例的任务调度器100的详细结构框图。为了便于进一步阐述任务调度器100的操作原理,图中还示出包括任务调度器的片上系统200(其在本公开的上下文中简称为“片上”),以及设置于片上系统200内的片上任务执行单元204。FIG. 2 is a detailed structural block diagram schematically showing the task scheduler 100 according to an embodiment of the present disclosure. In order to further explain the operating principle of the task scheduler 100, the figure also shows a system-on-chip 200 including the task scheduler (which is simply referred to as "on-chip" in the context of this disclosure), and on-chip tasks disposed within the system-on-chip 200 Execution unit 204.
如图2中所示,片外存储装置202上可以存储有待下发的各类任务,并且这些任务可以由用户(例如程序员)通过软件指令来申请和创建。作为一种实施方式,可以以队列的形式来存储软件申请创建的多个任务,并且由软件来向队列下发任务,其中每个队列中可以包括同一类型的任务,例如由单独的片上任务执行单元执行的任务、由多个片上任务执行单元执行的任务。例如,当每个片上执行单元包括8个微处理核时,则可以针对于每个微处理核来设置8个队列,由此片外存储装置202上可以布置有64个任务队列(例如第0-7队列对应于第0微处理核,第8-15队列对应于第2微处理核,以此类推,直至第56-63队列对应于第7微处理器),并且针对于每个微处理核设置一个用于存储和维护其8个队列的第二查找表。作为示例,前述的8个查找表可以通过第二查找表电路206来实现对查找表中任务的维护和管理。作为队列中的任务,其在第二查找表中的状态初始可以由软件设置为有效,例如其有效状态位为“1”,以表示该任务在片外。As shown in Figure 2, various types of tasks to be issued can be stored on the off-chip storage device 202, and these tasks can be applied for and created by users (such as programmers) through software instructions. As an implementation manner, multiple tasks created by software applications can be stored in the form of a queue, and the software can issue tasks to the queue, where each queue can include tasks of the same type, for example, executed by a separate on-chip task. Tasks executed by a unit, tasks executed by multiple on-chip task execution units. For example, when each on-chip execution unit includes 8 microprocessing cores, 8 queues can be set for each microprocessing core, so that 64 task queues (such as the 0th task queue) can be arranged on the off-chip storage device 202 -The 7th queue corresponds to the 0th microprocessor core, the 8th-15th queue corresponds to the 2nd microprocessor core, and so on, until the 56th-63rd queue corresponds to the 7th microprocessor), and for each microprocessor The core sets up a second lookup table for storing and maintaining its eight queues. As an example, the aforementioned eight lookup tables can be maintained and managed through the second lookup table circuit 206 to implement tasks in the lookup tables. As a task in the queue, its status in the second lookup table may initially be set to valid by software, for example, its valid status bit is "1" to indicate that the task is off-chip.
与第二查找表电路206的功能相对应,由第一查找表电路104来维护和管理的第一查找表中的每一项可以具有对应于第二查找表中任务的状态项。例如,可以用一个比特位的有效标识(如“1”)来代表对应的任务已经由软件指令配置并等待下发。当该对应的任务由调度电路成功调度到片上时,则第一查找表电路可以将有效信号的比特位修改为无效标识(如“0”)。接着,基于该任务状态从有效到无效的转变,第一查找表电路可以触发调度电路从第二查找表电路中读取新的任务到片上。Corresponding to the function of the second lookup table circuit 206, each entry in the first lookup table maintained and managed by the first lookup table circuit 104 may have a status entry corresponding to a task in the second lookup table. For example, a valid identifier of one bit (such as "1") can be used to represent that the corresponding task has been configured by software instructions and is waiting to be issued. When the corresponding task is successfully scheduled on the chip by the scheduling circuit, the first lookup table circuit can modify the bits of the valid signal to an invalid flag (such as "0"). Then, based on the transition of the task status from valid to invalid, the first lookup table circuit can trigger the scheduling circuit to read a new task from the second lookup table circuit onto the chip.
在一个实施场景中,为了实现人工智能处理器芯片与另一人工智能处理器芯片的片间任务的执行,本公开还提出在调度器中设置第三查找表电路108,其可以配置用于利用第三查找表来记录和管理存储于片上的片间任务。作为示例,该第三查找表的项数可以与前文所述的slot id数目相同。例如,该第三查找表可以实现为64(地址)×5120(数据),这里的64可以对应于前文所述的64个队列,并且每个队列具有5120个比特。In one implementation scenario, in order to realize the execution of inter-chip tasks between the artificial intelligence processor chip and another artificial intelligence processor chip, the present disclosure also proposes to set up a third lookup table circuit 108 in the scheduler, which can be configured to utilize The third lookup table is used to record and manage inter-chip tasks stored on the chip. As an example, the number of entries in the third lookup table may be the same as the number of slot ids mentioned above. For example, the third lookup table can be implemented as 64 (address) × 5120 (data), where 64 can correspond to the 64 queues mentioned above, and each queue has 5120 bits.
当上述第三查找表中的每项任务占用5比特时,则每个队列中可以记录1024个任务,并且每项任务(即表项记录)可以包括如下语义{valid,need initial,wakeup,have data,have space},其中“valid”代表指示任务是否有效的标识位、“need initial”代表任务初始化标识位、“wakeup”代表任务是否被唤醒的唤醒标识位、“have data”代表片间缓冲区是否存在数据的数据标识位,并且“have space”代表所述片间缓冲区是否存在存储空间的空间标识位。可以理解的是,这里的第三查找表中的表项内容以及语义仅仅是示例性的而非限制性的,并且本领域技术人员根据本公开的教导,可以理解本公开的多个查找表(例如第一和第二查找表,以及下文将要提及的第四查找表)都可以具有与第三查找表相同或类似的表项内容(也即任务的“描述符”)和语义。作为应用的示例,在软件将下发的任务的对应“need initial”标志位置为1时,则表示该任务有效。此后,当本公开的调度器第一次将该任务发送至微处理核上之后,可以将该标志位置为“0”,从而表示该任务已经被调度过一次。When each task in the above third lookup table occupies 5 bits, 1024 tasks can be recorded in each queue, and each task (ie, table entry record) can include the following semantics {valid, need initial, wakeup, have data,have space}, where "valid" represents the flag bit indicating whether the task is valid, "need initial" represents the task initialization flag bit, "wakeup" represents the wake-up flag bit indicating whether the task is awakened, and "have data" represents the inter-chip buffer The data identification bit indicates whether there is data in the area, and "have space" represents whether there is a space identification bit of storage space in the inter-chip buffer. It can be understood that the entry content and semantics in the third lookup table here are only exemplary and not restrictive, and those skilled in the art can understand the multiple lookup tables of the present disclosure based on the teachings of the present disclosure ( For example, the first and second lookup tables, as well as the fourth lookup table to be mentioned below) may have the same or similar entry content (that is, the "descriptor" of the task) and semantics as the third lookup table. As an application example, when the corresponding "need initial" flag of a task issued by the software is 1, it means that the task is valid. Thereafter, when the scheduler of the present disclosure sends the task to the microprocessing core for the first time, the flag bit can be set to "0", thereby indicating that the task has been scheduled once.
在一个实施场景中,为了减少查询任务地址带来的延迟开销,可以将slot id[15:0]作为描述符在L3高速缓冲存储器或者DDR中的偏移地址,同时也是查询查找表(例如第三查找表)的地址。当接收到片间(“chip to chip”,简称为“c2c”)传输过来的任务唤醒消息时(包含要唤醒的“slot id”),可以利用“slot id”对第三查找表进行寻址并对相应的标识位进行置位,从而完成第三查找表的更新。作为示例,这里的任务唤醒消息例如可以是第一人工智能处理器芯片向第二人工智能处理器芯片发送的唤醒消息,以便指示第二人工智能处理器芯片基于任务唤醒消息来执行存储于其片上的任务。在该场景中,此处的任 务例如可以是存储于第二人工智能处理器芯片的调度器中的重排序缓冲(“Reorder of Buffer”,简称为“ROB”)电路110中被反复调度执行的任务。In an implementation scenario, in order to reduce the delay overhead caused by querying the task address, slot id[15:0] can be used as the offset address of the descriptor in the L3 cache or DDR, and is also used to query the lookup table (for example, the three lookup tables). When receiving a task wake-up message (including the "slot id" to be woken up) transmitted from chip to chip ("chip to chip", referred to as "c2c"), the "slot id" can be used to address the third lookup table And the corresponding identification bit is set, thereby completing the update of the third lookup table. As an example, the task wake-up message here may be, for example, a wake-up message sent by the first artificial intelligence processor chip to the second artificial intelligence processor chip, so as to instruct the second artificial intelligence processor chip to execute the information stored on its chip based on the task wake-up message. task. In this scenario, any For example, the task may be a task that is repeatedly scheduled for execution in the reordering buffer ("Reorder of Buffer", referred to as "ROB") circuit 110 stored in the scheduler of the second artificial intelligence processor chip.
为了实现任务的有效调度,本公开的方案提出在调度器100中设置如上面所提到的重排序缓冲电路110,其可以配置成记录被片上任务执行单元反复执行的任务。在一个示例性应用场景中,调度器可以将每个任务队列按第一查找表所记录的顺序顺次发送到微处理核来执行,同时注册到重排序缓冲电路中的存储空间中,期间每个任务的“slot id”也会注册到重排序缓冲电路中。当重排序缓冲电路内的存储资源全部被占用时,则调度器从片外存储装置新获取到的任务仍可以由调度器调度到空闲的微处理核上来执行。In order to achieve effective scheduling of tasks, the solution of the present disclosure proposes to provide the above-mentioned reordering buffer circuit 110 in the scheduler 100, which can be configured to record tasks repeatedly executed by the on-chip task execution unit. In an exemplary application scenario, the scheduler can sequentially send each task queue to the microprocessor core for execution in the order recorded in the first lookup table, and at the same time register it in the storage space in the reordering buffer circuit. During each The "slot id" of each task will also be registered in the reordering buffer circuit. When all storage resources in the reordering buffer circuit are occupied, tasks newly acquired by the scheduler from the off-chip storage device can still be scheduled by the scheduler to the idle microprocessor core for execution.
作为片间任务调度的另一个实施场景,本公开提出在调度器100中设置轮询电路112,其可以接收如上所述用于调度片间任务的任务唤醒消息。接着,轮询电路可以根据任务唤醒消息来轮询第三查找表电路中记录的片间任务,以便轮询到与所述任务唤醒消息关联的特定任务。响应于轮询到该特定任务,此时调度电路可以配置成对轮询到的所述特定任务进行调度,以便由片上任务执行单元204来执行。As another implementation scenario of inter-chip task scheduling, the present disclosure proposes to provide a polling circuit 112 in the scheduler 100, which can receive the task wake-up message used for scheduling inter-chip tasks as described above. Then, the polling circuit may poll the inter-chip tasks recorded in the third lookup table circuit according to the task wake-up message, so as to poll the specific task associated with the task wake-up message. In response to being polled for the specific task, the scheduling circuit may be configured to schedule the polled specific task for execution by the on-chip task execution unit 204 .
举例而言,在轮询过程中,轮询电路可以同时分队列对第三查找表进行轮询,每个队列例如可以是3×1024(项数)=3072比特。基于此,轮询电路每次可以轮询32项,32个调度器的工作时钟周期(32纳秒)就可以完成一个队列共1024项的轮询。在片间任务调度时,如果需要唤醒的任务已经存储于在片上,则调度器可以直接唤醒该任务并发送给片上任务执行单元进行执行。相反,如果轮询电路没有轮询到需唤醒的任务,换句话说,该需要唤醒的任务并不在片上,则此时调度器可以去片外存储装置(如L3缓冲器或DDR)上将该任务取回(即调度到片上)并发送给片上任务执行单元,例如微处理核来执行。作为优选方式,本公开假定片间的任务调度优先于片内的任务调度。For example, during the polling process, the polling circuit may simultaneously poll the third lookup table into queues, and each queue may be, for example, 3×1024 (number of entries)=3072 bits. Based on this, the polling circuit can poll 32 items at a time, and 32 scheduler working clock cycles (32 nanoseconds) can complete polling of a total of 1024 items in a queue. During inter-chip task scheduling, if the task that needs to be woken up is already stored on the chip, the scheduler can directly wake up the task and send it to the on-chip task execution unit for execution. On the contrary, if the polling circuit does not poll the task that needs to be woken up, in other words, the task that needs to be woken up is not on the chip, then the scheduler can go to the off-chip storage device (such as L3 buffer or DDR) to save the task. The task is retrieved (that is, scheduled to the chip) and sent to the on-chip task execution unit, such as a microprocessor core, for execution. As a preferred approach, this disclosure assumes that inter-slice task scheduling takes precedence over intra-slice task scheduling.
为了有效记录发送到片上任务执行单元的任务,本公开还提出在调度器内设置第四查找表电路114,并且其配置成利用第四查找表来记录调度至片上任务执行单元的任务。在一个实施场景中,调度电路可以配置成在向所述片上任务执行单元调度待执行的任务前,与所述第四查找表电路交互,以查询并确定第四查找表中记录的任务与当前待调度至所述片上任务执行单元的任务不同。进一步,调度电路还可以配置成响应于从所述片上任务执行单元接收到完成或暂停任务的执行,触发所述第四查找表电路从所述第四查找表中移除完成执行或暂停执行的任务。借助于对第四查找表的查询,可以保证待发送到片上任务执行单元的任务与正在并行执行的多个任务不同(例如任务的“slot id”不同)。In order to effectively record the tasks sent to the on-chip task execution unit, the present disclosure also proposes to provide a fourth lookup table circuit 114 in the scheduler, and it is configured to use the fourth lookup table to record the tasks scheduled to the on-chip task execution unit. In one implementation scenario, the scheduling circuit may be configured to interact with the fourth lookup table circuit to query and determine whether the task recorded in the fourth lookup table is consistent with the current task before scheduling the task to be executed to the on-chip task execution unit. The tasks to be scheduled to the on-chip task execution unit are different. Further, the scheduling circuit may be further configured to trigger the fourth lookup table circuit to remove the completion or suspension of execution from the fourth lookup table in response to receiving the completion or suspension of execution of the task from the on-chip task execution unit. Task. With the help of querying the fourth lookup table, it can be guaranteed that the task to be sent to the on-chip task execution unit is different from the multiple tasks being executed in parallel (for example, the "slot id" of the task is different).
以上结合图2对本公开的调度方案做了进一步的阐述。基于上述描述,本领域技术人员可以理解本公开的调度方案借助于一个或多个查找表的设置,特别是片上查找表的布置,可以显著降低任务调度的延迟,并同时简化硬件设计。进一步,通过利用查找表,可以有效利用片上资源来进行任务调度,避免任务的重复冗余调度,从而提高调度的效率。通过本公开的方案,也促进了片间任务的顺利和高效执行,从而节省任务执行的开销。The scheduling scheme of the present disclosure is further elaborated above with reference to FIG. 2 . Based on the above description, those skilled in the art can understand that the scheduling scheme of the present disclosure can significantly reduce the delay of task scheduling and simplify the hardware design by means of the arrangement of one or more lookup tables, especially the arrangement of on-chip lookup tables. Furthermore, by using lookup tables, on-chip resources can be effectively utilized for task scheduling, avoiding repetitive and redundant scheduling of tasks, thereby improving scheduling efficiency. Through the solution of the present disclosure, the smooth and efficient execution of inter-chip tasks is also promoted, thereby saving task execution overhead.
图3是示意性示出使用根据本公开实施例的调度器来调度任务的方法300的简化流程图。可以理解的是,方法300可以由上述结合图1和图2所详细的调度器来执行。FIG. 3 is a simplified flowchart schematically illustrating a method 300 for scheduling tasks using a scheduler according to an embodiment of the present disclosure. It can be understood that the method 300 may be executed by the scheduler detailed above in conjunction with FIGS. 1 and 2 .
如图3中所示,在步骤S302处,使用调度电路来执行从所述片外存储装置读取任务到片上,以便调度所述任务来由所述片上任务执行单元执行,其中所述任务以有效状态记录于所述片外存储装置上。接着,在步骤S304处,响应于任务从片外存储装置读取到片上,使用第一查找表电路将任务从有效状态更新至无效状态并记录于第一查找表中。此后,在步骤S306处,响应于将无效状态记录于第一查找表中,第一查找表电路可以从片外存储装置读取下一任务到片上。As shown in Figure 3, at step S302, a scheduling circuit is used to execute a task of reading from the off-chip storage device onto the chip, so that the task is scheduled to be executed by the on-chip task execution unit, wherein the task is The valid status is recorded on the off-chip storage device. Next, at step S304, in response to the task being read from the off-chip storage device to the chip, the first lookup table circuit is used to update the task from the valid state to the invalid state and record it in the first lookup table. Thereafter, at step S306, in response to recording the invalid status in the first lookup table, the first lookup table circuit may read the next task from the off-chip storage device onto the chip.
在一个实施例中,上述片外存储装置上存储有待所述调度电路读取到片上的多个任务和至少记录多个任务的有效状态的第二查找表(例如借助于图2中所示出的第二查找表电路206来实现)。在该实施例中,方法300还可以包括响应于所述第一查找表电路的触发, 使用调度电路从所述片外存储装置读取所述第二查找表中记录的多个任务之一到片上,以及使用调度电路来触发所述第一查找表电路将从所述片外存储装置读取到片上的任务的有效状态更新至无效状态并记录于所述第一查找表中。In one embodiment, the above-mentioned off-chip storage device stores multiple tasks on the chip to be read by the scheduling circuit and at least a second lookup table that records the valid status of the multiple tasks (for example, with the help of the diagram shown in Figure 2 implemented by the second lookup table circuit 206). In this embodiment, method 300 may further include in response to triggering of the first lookup table circuit, Use a scheduling circuit to read one of the tasks recorded in the second lookup table from the off-chip storage device to the chip, and use the scheduling circuit to trigger the first lookup table circuit to read from the off-chip storage device The valid status of the task read onto the chip is updated to the invalid status and recorded in the first lookup table.
在一个实施例中,所述方法还包括使用第三查找表电路(例如图3中所示出的第三查找表电路108)来执行人工智能处理器芯片与另一人工智能处理器芯片之间的片间任务,即利用第三查找表来记录和管理存储于片上的片间任务。在另一个实施例中,所述方法还包括利用轮询电路112来接收用于调度片间任务的任务唤醒消息,根据该任务唤醒消息来轮询第三查找表电路中记录的片间任务,以便轮询到与所述任务唤醒消息关联的特定任务。基于此,所述方法还使用调度电路对轮询到的特定任务进行调度。In one embodiment, the method further includes using a third lookup table circuit (such as the third lookup table circuit 108 shown in FIG. 3) to perform a communication between the artificial intelligence processor chip and another artificial intelligence processor chip. The inter-chip tasks use the third lookup table to record and manage the inter-chip tasks stored on the chip. In another embodiment, the method further includes utilizing the polling circuit 112 to receive a task wake-up message for scheduling inter-chip tasks, and polling the inter-chip tasks recorded in the third lookup table circuit according to the task wake-up message, to poll for a specific task associated with the task wakeup message. Based on this, the method also uses a scheduling circuit to schedule the polled specific tasks.
在一些场景中,响应于上述轮询电路未成功轮询到所述特定任务,所述方法还包括使用调度器从所述片外存储装置读取与任务唤醒消息关联的特定任务到片上。进一步,为了保存和维护由片上任务执行单元重复执行的任务,所述方法还包括使用重排序缓冲电路(如图2中所示出的重排序缓冲电路110)来记录重复执行的任务。In some scenarios, in response to the above polling circuit not successfully polling the specific task, the method further includes using a scheduler to read the specific task associated with the task wake-up message from the off-chip storage device onto the chip. Further, in order to save and maintain tasks repeatedly executed by the on-chip task execution unit, the method further includes using a reordering buffer circuit (reordering buffer circuit 110 as shown in FIG. 2) to record the repeatedly executed tasks.
在另一些场景中,所述方法还包括使用第四查找表电路(如图2中所示出的第四查找表电路114)来执行利用第四查找表来记录调度至片上任务执行单元的任务。进一步,所述方法还包括使用所述调度电路来执行:在向所述片上任务执行单元调度待执行的任务前,与所述第四查找表电路交互,以查询并确定第四查找表中记录的任务与当前待调度至所述片上任务执行单元的任务不同。另外,所述方法还包括使用所述调度电路来执行:响应于从所述片上任务执行单元接收到完成或暂停任务的执行,触发所述第四查找表电路从所述第四查找表中移除完成执行或暂停执行的任务。In other scenarios, the method further includes using a fourth lookup table circuit (the fourth lookup table circuit 114 shown in FIG. 2 ) to perform using the fourth lookup table to record tasks scheduled to the on-chip task execution unit. . Further, the method further includes using the scheduling circuit to perform: before scheduling the task to be executed to the on-chip task execution unit, interact with the fourth lookup table circuit to query and determine the records in the fourth lookup table. The task is different from the task currently to be scheduled to the on-chip task execution unit. Additionally, the method further includes using the scheduling circuit to perform: in response to receiving execution of a completed or suspended task from the on-chip task execution unit, triggering the fourth lookup table circuit to move from the fourth lookup table. In addition to completing or suspending the execution of tasks.
以上结合图3对本公开的使用调度器执行调度任务的方法进行了描述。可以理解的是,上述的描述仅仅是示例性的而非限制性的。本领域技术人员根据本公开的披露,也可以想到将其中的步骤进行结合或替换,以便实现对任务的有效调度并且节省调度资源。The method of using a scheduler to perform scheduled tasks according to the present disclosure has been described above with reference to FIG. 3 . It is understood that the above description is only illustrative and not restrictive. Based on the disclosure of this disclosure, those skilled in the art can also think of combining or replacing the steps in order to achieve effective scheduling of tasks and save scheduling resources.
图4示出本披露实施例的一种板卡400的结构示意图。如图4所示,板卡400包括芯片401,其是一种系统级芯片(System on Chip,SoC),或称片上系统,集成有一个或多个组合处理装置,组合处理装置是一种人工智能运算单元,用以支持各类深度学习和机器学习算法,满足计算机视觉、语音、自然语言处理、数据挖掘等领域复杂场景下的智能处理需求。特别是深度学习技术大量应用在云端智能领域,云端智能应用的一个显著特点是输入数据量大,对平台的存储能力和计算能力有很高的要求,此实施例的板卡400适用在云端智能应用,具有庞大的片外存储、片上存储和大量的计算能力。在一些场景中,当板卡上仅布置一个芯片401时,板卡与板卡之间的任务调度也即是本公开上下文中的片间通信(或称“片间通讯”)。Figure 4 shows a schematic structural diagram of a board card 400 according to an embodiment of the present disclosure. As shown in Figure 4, the board 400 includes a chip 401, which is a system on chip (SoC), or system on a chip, integrated with one or more combination processing devices. The combination processing device is an artificial Intelligent computing units are used to support various deep learning and machine learning algorithms to meet the intelligent processing needs in complex scenarios in computer vision, speech, natural language processing, data mining and other fields. In particular, deep learning technology is widely used in the field of cloud intelligence. A significant feature of cloud intelligence applications is the large amount of input data, which has high requirements on the storage and computing capabilities of the platform. The board 400 of this embodiment is suitable for use in cloud intelligence applications. application, with huge off-chip storage, on-chip storage and a large amount of computing power. In some scenarios, when only one chip 401 is arranged on the board, the task scheduling between the boards is also the inter-chip communication (or "inter-chip communication") in the context of this disclosure.
芯片401通过对外接口装置402与外部设备403相连接。外部设备403例如是服务器、计算机、摄像头、显示器、鼠标、键盘、网卡或wifi接口等。待处理的数据可以由外部设备403通过对外接口装置402传递至芯片401。芯片401的计算结果可以经由对外接口装置402传送回外部设备403。根据不同的应用场景,对外接口装置402可以具有不同的接口形式,例如PCIe接口等。The chip 401 is connected to an external device 403 through an external interface device 402 . The external device 403 is, for example, a server, computer, camera, monitor, mouse, keyboard, network card or wifi interface. The data to be processed can be transferred to the chip 401 from the external device 403 through the external interface device 402 . The calculation results of the chip 401 can be transmitted back to the external device 403 via the external interface device 402 . According to different application scenarios, the external interface device 402 may have different interface forms, such as PCIe interface.
板卡400还包括用于存储数据的存储器件4404,其包括一个或多个存储单元4405。存储器件404通过总线与控制器件406和芯片401进行连接和数据传输。板卡400中的控制器件406配置用于对芯片401的状态进行调控。为此,在一个应用场景中,控制器件406可以包括单片机(Micro Controller Unit,MCU)。Board 400 also includes a storage device 4404 for storing data, which includes one or more storage units 4405. The storage device 404 connects and transmits data with the control device 406 and the chip 401 through the bus. The control device 406 in the board card 400 is configured to control the status of the chip 401 . To this end, in an application scenario, the control device 406 may include a microcontroller unit (Micro Controller Unit, MCU).
图5是示出此实施例的芯片401中的组合处理装置的结构图。如图5中所示,组合处理装置500包括计算装置501、接口装置502、处理装置503和DRAM 504。FIG. 5 is a structural diagram showing the combined processing device in the chip 401 of this embodiment. As shown in Figure 5, the combined processing device 500 includes a computing device 501, an interface device 502, a processing device 503 and a DRAM 504.
计算装置501配置成执行用户指定的操作,主要实现为单核智能处理器或者多核智能处理器,用以执行深度学习或机器学习的计算,其可以通过接口装置502与处理装置503 进行交互,以共同完成用户指定的操作。The computing device 501 is configured to perform user-specified operations, and is mainly implemented as a single-core intelligent processor or a multi-core intelligent processor to perform calculations of deep learning or machine learning, which can be performed through the interface device 502 and the processing device 503 Interact to jointly complete user-specified actions.
接口装置502用于在计算装置501与处理装置503间传输数据和控制指令。例如,计算装置5501可以经由接口装置502从处理装置503中获取输入数据,写入计算装置501片上的存储装置。进一步,计算装置501可以经由接口装置502从处理装置503中获取控制指令,写入计算装置501片上的控制缓存中。替代地或可选地,接口装置502也可以读取计算装置501的存储装置中的数据并传输给处理装置503。The interface device 502 is used to transmit data and control instructions between the computing device 501 and the processing device 503 . For example, the computing device 5501 can obtain input data from the processing device 503 via the interface device 502 and write it into an on-chip storage device of the computing device 501 . Further, the computing device 501 can obtain the control instructions from the processing device 503 via the interface device 502 and write them into the control cache on-chip of the computing device 501 . Alternatively or optionally, the interface device 502 may also read the data in the storage device of the computing device 501 and transmit it to the processing device 503 .
处理装置503作为通用的处理装置,执行包括但不限于数据搬运、对计算装置501的开启和/或停止等基本控制。根据实现方式的不同,处理装置503可以是中央处理器(central processing unit,CPU)、图形处理器(graphics processing unit,GPU)或其他通用和/或专用处理器中的一种或多种类型的处理器,这些处理器包括但不限于数字信号处理器(digital signal processor,DSP)、专用集成电路(application specific integrated circuit,ASIC)、现场可编程门阵列(field-programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等,并且其数目可以根据实际需要来确定。如前所述,仅就本披露的计算装置501而言,其可以视为具有单核结构或者同构多核结构。然而,当将计算装置501和处理装置503整合共同考虑时,二者视为形成异构多核结构。As a general processing device, the processing device 503 performs basic control including but not limited to data transfer, starting and/or stopping the computing device 501, and the like. Depending on the implementation, the processing device 503 may be one or more types of a central processing unit (CPU), a graphics processing unit (GPU), or other general and/or special purpose processors. Processors, including but not limited to digital signal processor (DSP), application specific integrated circuit (ASIC), field-programmable gate array (FPGA) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., and their number can be determined according to actual needs. As mentioned above, only as far as the computing device 501 of the present disclosure is concerned, it can be regarded as having a single-core structure or a homogeneous multi-core structure. However, when the computing device 501 and the processing device 503 are considered together, they are considered to form a heterogeneous multi-core structure.
存储装置504用以存储待处理的数据,其可以是DRAM,为DDR内存,大小通常为16G或更大,用于保存计算装置201和/或处理装置203的数据。在本公开的上下文,这里的存储装置可以视为前述调度方案的片外存储装置。The storage device 504 is used to store data to be processed, which can be DRAM, which is a DDR memory. The size is usually 16G or larger, and is used to save data of the computing device 201 and/or the processing device 203 . In the context of this disclosure, the storage device here can be regarded as an off-chip storage device of the aforementioned scheduling scheme.
图6示出了计算装置501的内部结构示意图。计算装置501用以处理计算机视觉、语音、自然语言、数据挖掘等输入数据,图中的计算装置501采用多核分层结构设计,计算装置501作为一个片上系统,其包括多个计算簇(cluster),每个计算簇又包括多个处理器核,换言之,计算装置501是以片上系统-计算簇-处理器核的层次所构成的。Figure 6 shows a schematic diagram of the internal structure of the computing device 501. The computing device 501 is used to process input data such as computer vision, speech, natural language, and data mining. The computing device 501 in the figure adopts a multi-core hierarchical structure design. The computing device 501 serves as a system on a chip and includes multiple computing clusters. , each computing cluster includes multiple processor cores. In other words, the computing device 501 is composed of a system-on-chip-computing cluster-processor core hierarchy.
以片上系统的层级来看,如图6所示,计算装置501包括外部存储控制器601、外设通信模块602、片上互联模块603、同步模块604以及多个计算簇605。尽管未示出,计算装置501中也可以包括本公开上下文的调度电路,以实现将外部存储装置的任务调度到片上,以便由计算簇605来执行。From a system-on-chip level, as shown in FIG. 6 , the computing device 501 includes an external storage controller 601 , a peripheral communication module 602 , an on-chip interconnection module 603 , a synchronization module 604 and multiple computing clusters 605 . Although not shown, the scheduling circuit in the context of the present disclosure may also be included in the computing device 501 to schedule tasks of the external storage device onto the chip for execution by the computing cluster 605 .
外部存储控制器601可以有多个,在图中示例性地展示2个,其用以响应处理器核发出的访问请求,访问外部存储设备,例如图5中的DRAM 504,从而自片外读取数据或是将数据写入。外设通信模块602用以通过接口装置502接收来自处理装置503的控制信号,启动计算装置501执行任务。片上互联模块603将外部存储控制器601、外设通信模块602及多个计算簇605连接起来,用以在各个模块间传输数据和控制信号。同步模块604是一种全局同步屏障控制器(global barrier controller,GBC),用以协调各计算簇的工作进度,确保信息的同步。多个计算簇605是计算装置501的计算核心,在图中示例性地展示4个,随着硬件的发展,本披露的计算装置501还可以包括8个、16个、64个、甚至更多的计算簇605。计算簇605用以高效地执行深度学习算法。There may be multiple external memory controllers 601, two of which are exemplarily shown in the figure. They are used to respond to access requests issued by the processor core and access external memory devices, such as the DRAM 504 in Figure 5, to read from off-chip. Get data or write data. The peripheral communication module 602 is used to receive control signals from the processing device 503 through the interface device 502 and start the computing device 501 to perform tasks. The on-chip interconnection module 603 connects the external storage controller 601, the peripheral communication module 602 and multiple computing clusters 605 to transmit data and control signals between various modules. The synchronization module 604 is a global synchronization barrier controller (GBC), used to coordinate the work progress of each computing cluster and ensure information synchronization. Multiple computing clusters 605 are the computing cores of the computing device 501. Four are shown as an example in the figure. With the development of hardware, the computing device 501 of the present disclosure may also include 8, 16, 64, or even more. Computing cluster 605. Computing cluster 605 is used to efficiently execute deep learning algorithms.
以计算簇的层级来看,如图6所示,每个计算簇605包括多个处理器核(IPU core)606及一个存储核(MEM core)607。Looking at the computing cluster level, as shown in Figure 6, each computing cluster 605 includes multiple processor cores (IPU core) 606 and a storage core (MEM core) 607.
处理器核606在图中示例性地展示4个,本披露不限制处理器核606的数量。其内部架构如图4所示。每个处理器核606包括三大模块:控制模块71、运算模块72及存储模块73。Four processor cores 606 are exemplarily shown in the figure, and the present disclosure does not limit the number of processor cores 606 . Its internal architecture is shown in Figure 4. Each processor core 606 includes three major modules: a control module 71 , an operation module 72 and a storage module 73 .
控制模块71用以协调并控制运算模块72和存储模块73的工作,以完成深度学习的任务,其包括取指单元(instruction fetch unit,IFU)711及指令译码单元(instruction decode unit,IDU)712。取指单元711用以获取来自处理装置503的指令,指令译码单元712则将获取的指令进行译码,并将译码结果作为控制信息发送给运算模块72和存储模块73。The control module 71 is used to coordinate and control the work of the computing module 72 and the storage module 73 to complete the task of deep learning, and includes an instruction fetch unit (IFU) 711 and an instruction decode unit (IDU). 712. The instruction fetching unit 711 is used to obtain instructions from the processing device 503, and the instruction decoding unit 712 decodes the obtained instructions and sends the decoding results to the computing module 72 and the storage module 73 as control information.
运算模块72包括向量运算单元721及矩阵运算单元422。向量运算单元721用以执行 向量运算,可支持向量乘、加、非线性变换等复杂运算;矩阵运算单元722负责深度学习算法的核心计算,即矩阵乘及卷积。The operation module 72 includes a vector operation unit 721 and a matrix operation unit 422. The vector operation unit 721 is used to execute Vector operations can support complex operations such as vector multiplication, addition, and nonlinear transformation; the matrix operation unit 722 is responsible for the core calculations of the deep learning algorithm, namely matrix multiplication and convolution.
存储模块73用来存储或搬运相关数据,包括神经元存储单元(neuron RAM,NRAM)731、权值存储单元(weight RAM,WRAM)732、输入/输出直接内存访问模块(input/output direct memory access,IODMA)733、搬运直接内存访问模块(move direct memory access,MVDMA)734。NRAM 731用以存储供处理器核606计算的输入、输出数据及中间结果;WRAM 732则用以存储深度学习网络的权值;IODMA 733通过广播总线609控制NRAM 731/WRAM 732与DRAM 504的访存;MVDMA 734则用以控制NRAM 731/WRAM 732与SRAM 708的访存。应当注意,此处的NRAM和WRAM可以是同一存储器在逻辑存储空间上划分形成的两个存储区域,也可以是两个独立的存储器,此处不做具体限定。The storage module 73 is used to store or transport related data, including a neuron storage unit (neuron RAM, NRAM) 731, a weight storage unit (weight RAM, WRAM) 732, and an input/output direct memory access module (input/output direct memory access). , IODMA) 733. Move direct memory access module (move direct memory access, MVDMA) 734. NRAM 731 is used to store input, output data and intermediate results calculated by the processor core 606; WRAM 732 is used to store the weights of the deep learning network; IODMA 733 controls the access of NRAM 731/WRAM 732 and DRAM 504 through the broadcast bus 609 memory; MVDMA 734 is used to control the memory access of NRAM 731/WRAM 732 and SRAM 708. It should be noted that the NRAM and WRAM here may be two storage areas formed by dividing the same memory in the logical storage space, or they may be two independent memories, which are not specifically limited here.
回到图6,存储核307主要用以存储和通信,即存储处理器核606间的共享数据或中间结果、以及执行计算簇605与DRAM 504之间的通信、计算簇605间彼此的通信、处理器核606间彼此的通信等。在其他实施例中,存储核607具有标量运算的能力,用以执行标量运算。Returning to Figure 6, the storage core 307 is mainly used for storage and communication, that is, storage of shared data or intermediate results between the processor cores 606, and communication between the execution computing cluster 605 and the DRAM 504, and communication between the computing clusters 605. Communication between processor cores 606, etc. In other embodiments, the storage core 607 has scalar operation capabilities to perform scalar operations.
存储核607包括共享存储单元(SRAM)608、广播总线609、计算簇直接内存访问模块(cluster direct memory access,CDMA)610及全局直接内存访问模块(global direct memory access,GDMA)611。SRAM 608承担高性能数据中转站的角色,在同一个计算簇605内不同处理器核606之间所复用的数据不需要通过处理器核606各自向DRAM 504获得,而是经SRAM 608在处理器核606间中转,存储核607只需要将复用的数据从SRAM 608迅速分发给多个处理器核6606即可,以提高核间通讯效率,亦大大减少片上片外的输入/输出访问。The storage core 607 includes a shared memory unit (SRAM) 608, a broadcast bus 609, a computing cluster direct memory access module (cluster direct memory access, CDMA) 610, and a global direct memory access module (global direct memory access, GDMA) 611. SRAM 608 plays the role of a high-performance data transfer station. The data multiplexed between different processor cores 606 in the same computing cluster 605 does not need to be obtained from the DRAM 504 through the processor cores 606, but is processed by the SRAM 608. For transfer between processor cores 606, the storage core 607 only needs to quickly distribute the multiplexed data from the SRAM 608 to multiple processor cores 6606 to improve inter-core communication efficiency and greatly reduce on-chip and off-chip input/output access.
广播总线609、CDMA 610及GDMA 611则分别用来执行处理器核606间的通信、计算簇605间的通信和计算簇605与DRAM 504的数据传输。以下将分别说明。The broadcast bus 609, CDMA 610 and GDMA 611 are respectively used to perform communication between processor cores 606, communication between computing clusters 605 and data transmission between the computing cluster 605 and the DRAM 504. They will be explained below.
广播总线609用以完成计算簇605内各处理器核606间的高速通信,此实施例的广播总线609支持核间通信方式包括单播、多播与广播。单播是指点对点(即单一处理器核至单一处理器核)的数据传输,多播是将一份数据从SRAM 608传输到特定几个处理器核606的通信方式,而广播则是将一份数据从SRAM 608传输到所有处理器核606的通信方式,属于多播的一种特例。The broadcast bus 609 is used to complete high-speed communication between the processor cores 606 in the computing cluster 605. The broadcast bus 609 in this embodiment supports inter-core communication methods including unicast, multicast and broadcast. Unicast refers to point-to-point (i.e., single processor core to single processor core) data transmission, multicast is a communication method that transmits a piece of data from SRAM 608 to specific several processor cores 606, and broadcast is a communication method that transmits a piece of data from SRAM 608 to specific processor cores 606. The communication method in which copies of data are transmitted from SRAM 608 to all processor cores 606 is a special case of multicast.
CDMA 610用以控制在同一个计算装置501内不同计算簇605间的SRAM 608的访存。图8示出当一个处理器核欲将数据写入至另一个计算簇的处理器核时的示意图,以说明CDMA 610的工作原理。在此应用场景中,同一个计算装置包括多个计算簇,为方便说明,图中仅展示计算簇0与计算簇1,计算簇0与计算簇1分别包括多个处理器核,同样为了说明方便,图中的计算簇0仅展示处理器核0,计算簇1仅展示处理器核1。处理器核0欲将数据写入至处理器核1。CDMA 610 is used to control memory access of SRAM 608 between different computing clusters 605 within the same computing device 501. Figure 8 shows a schematic diagram when one processor core wants to write data to the processor core of another computing cluster to illustrate the working principle of CDMA 610. In this application scenario, the same computing device includes multiple computing clusters. For the convenience of explanation, only computing cluster 0 and computing cluster 1 are shown in the figure. Computing cluster 0 and computing cluster 1 respectively include multiple processor cores. Also for illustration, For convenience, computing cluster 0 in the figure only displays processor core 0, and computing cluster 1 only displays processor core 1. Processor core 0 wants to write data to processor core 1.
首先,处理器核0发送单播写请求将数据写入本地的SRAM 0中,CDMA 0作为主(master)端,CDMA 1作为从(slave)端,主端向从端推送写请求,即主端发送写地址AW和写数据W,将数据传送到计算簇1的SRAM 1中,接着从端发送写响应B作为回应,最后计算簇1的处理器核1发送单播读请求将数据从SRAM 1中读取出来。First, processor core 0 sends a unicast write request to write data to the local SRAM 0. CDMA 0 serves as the master (master) end, and CDMA 1 serves as the slave (slave) end. The master end pushes the write request to the slave end, that is, the master end The end sends the write address AW and the write data W to transfer the data to the SRAM 1 of the computing cluster 1. Then the slave end sends a write response B in response. Finally, the processor core 1 of the computing cluster 1 sends a unicast read request to transfer the data from the SRAM. Read out in 1.
回到图6,GDMA 611与外部存储控制器601协同,用以控制计算簇605的SRAM 608到DRAM 504的访存,或是将数据自DRAM 504读取至SRAM 608中。从前述可知,DRAM 504与NRAM 731或WRAM 732间的通信可以经由2个渠道来实现。第一个渠道是通过IODAM 733直接联系DRAM 504与NRAM 731或WRAM 732;第二个渠道是先经由GDMA 611使得数据在DRAM 504与SRAM 6608间传输,再经过MVDMA 734使得数据在SRAM 608与NRAM 731或WRAM 732间传输。虽然表面上看来第二个渠道需要更多的元件参与,数据流较长,但实际上在部分实施例中,第二个渠道的带宽远大于第一个渠 道,因此DRAM 504与NRAM 731或WRAM 732间的通信通过第二个渠道可能更有效率。本披露的实施例可根据本身硬件条件选择数据传输渠道。Returning to Figure 6, the GDMA 611 cooperates with the external memory controller 601 to control memory access from the SRAM 608 of the computing cluster 605 to the DRAM 504, or to read data from the DRAM 504 to the SRAM 608. From the foregoing, it can be known that the communication between the DRAM 504 and the NRAM 731 or the WRAM 732 can be realized through two channels. The first channel is to directly contact DRAM 504 and NRAM 731 or WRAM 732 through IODAM 733; the second channel is to first transmit data between DRAM 504 and SRAM 6608 through GDMA 611, and then through MVDMA 734 to transmit data between SRAM 608 and NRAM. 731 or WRAM 732. Although on the surface it seems that the second channel requires more components to participate and the data flow is longer, in fact in some embodiments, the bandwidth of the second channel is much greater than that of the first channel. channel, so communication between DRAM 504 and NRAM 731 or WRAM 732 may be more efficient through the second channel. The embodiments of the present disclosure can select a data transmission channel according to the own hardware conditions.
在其他实施例中,GDMA 611的功能和IODMA 733的功能可以整合在同一部件中。本披露为了方便描述,将GDMA 611和IODMA 733视为不同部件,对于本领域技术人员来说,只要其实现的功能以及达到的技术效果与本披露类似,即属于本披露的保护范围。进一步地,GDMA 611的功能、IODMA 733的功能、CDMA 610的功能、MVDMA 734的功能亦可以由同一部件来实现,同样地,只要其实现的功能以及达到的技术效果与本披露类似,均属于本披露的保护范围。In other embodiments, the functionality of GDMA 611 and the functionality of IODMA 733 may be integrated in the same component. For the convenience of description, this disclosure treats GDMA 611 and IODMA 733 as different components. For those skilled in the art, as long as the functions they implement and the technical effects they achieve are similar to those of this disclosure, they fall within the scope of protection of this disclosure. Furthermore, the functions of GDMA 611, IODMA 733, CDMA 610, and MVDMA 734 can also be implemented by the same component. Similarly, as long as the functions implemented and the technical effects achieved are similar to those of this disclosure, they all belong to SCOPE OF THIS DISCLOSURE.
以上结合图4-图8对本公开的软硬件架构及其内部结构进行了详细的描述。可以理解的是上述描述仅仅是示例性的而非限制性的。根据不同的应用场景和硬件规格,本领域技术人员也可以对本公开的板卡(或者说人工智能设备)及其内部结构进行改变,而这些改变依然落入本公开的保护范围内。The software and hardware architecture and its internal structure of the present disclosure are described in detail above with reference to Figures 4-8. It is understood that the above description is only illustrative and not restrictive. According to different application scenarios and hardware specifications, those skilled in the art can also make changes to the board card (or artificial intelligence device) and its internal structure of the present disclosure, and these changes still fall within the protection scope of the present disclosure.
基于上文的描述,本领域技术人员可以理解本申请实际上也公开了一种设备,其包括处理器和存储器。具体地,存储器可以存储用于对任务进行调度的程序指令,当所述程序指令由处理器执行时,实现本申请结合图1-图3所描述的调度操作步骤。另外,由于本申请的方案可以通过计算程序指令来实现,因此本申请也公开了一种计算机可读存储介质或计算机程序产品,其上存储有用于任务调度的计算机程序/指令,从而实现结合图1-图3所描述的调度操作步骤。Based on the above description, those skilled in the art can understand that this application actually also discloses a device, which includes a processor and a memory. Specifically, the memory may store program instructions for scheduling tasks. When the program instructions are executed by the processor, the scheduling operation steps described in this application in conjunction with FIGS. 1-3 are implemented. In addition, since the solution of the present application can be implemented by computing program instructions, the present application also discloses a computer-readable storage medium or computer program product, on which computer programs/instructions for task scheduling are stored, thereby realizing the combination of the figures. 1-The scheduling operation steps described in Figure 3.
以上结合附图对本公开的方案进行了详细的描述。根据不同的应用场景,本披露的设备或装置可以包括服务器、云端服务器、服务器集群、数据处理装置、机器人、电脑、打印机、扫描仪、平板电脑、智能终端、PC设备、物联网终端、移动终端、手机、行车记录仪、导航仪、传感器、摄像头、相机、摄像机、投影仪、手表、耳机、移动存储、可穿戴设备、视觉终端、自动驾驶终端、交通工具、家用电器、和/或医疗设备。所述交通工具包括飞机、轮船和/或车辆;所述家用电器包括电视、空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、油烟机;所述医疗设备包括核磁共振仪、B超仪和/或心电图仪。本披露的设备或装置还可以被应用于互联网、物联网、数据中心、能源、交通、公共管理、制造、教育、电网、电信、金融、零售、工地、医疗等领域。The solution of the present disclosure has been described in detail above with reference to the accompanying drawings. According to different application scenarios, the equipment or devices of the present disclosure may include servers, cloud servers, server clusters, data processing devices, robots, computers, printers, scanners, tablets, smart terminals, PC equipment, Internet of Things terminals, and mobile terminals , mobile phones, driving recorders, navigators, sensors, cameras, cameras, video cameras, projectors, watches, headphones, mobile storage, wearable devices, visual terminals, autonomous driving terminals, vehicles, household appliances, and/or medical equipment . The means of transportation include airplanes, ships and/or vehicles; the household appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lights, gas stoves, and range hoods; the medical equipment includes nuclear magnetic resonance machines, B-ultrasound and/or electrocardiograph. The equipment or device of the present disclosure can also be applied to the Internet, Internet of Things, data centers, energy, transportation, public administration, manufacturing, education, power grid, telecommunications, finance, retail, construction sites, medical and other fields.
进一步,本披露的设备或装置还可以用于云端、边缘端、终端等与人工智能、大数据和/或云计算相关的应用场景中。在一个或多个实施例中,根据本披露方案的功耗高的设备或装置可以应用于云端设备(例如云端服务器),而功耗小的设备或装置可以应用于终端设备和/或边缘端设备(例如智能手机或摄像头)。在一个或多个实施例中,云端设备的硬件信息和终端设备和/或边缘端设备的硬件信息相互兼容,从而可以根据终端设备和/或边缘端设备的硬件信息,从云端设备的硬件资源中匹配出合适的硬件资源来模拟终端设备和/或边缘端设备的硬件资源,以便完成端云一体或云边端一体的统一管理、调度和协同工作。Furthermore, the equipment or device of the present disclosure can also be used in cloud, edge, terminal and other application scenarios related to artificial intelligence, big data and/or cloud computing. In one or more embodiments, devices or devices with high power consumption according to the solution of the present disclosure can be applied to cloud devices (such as cloud servers), while devices or devices with low power consumption can be applied to terminal devices and/or edge terminals. device (e.g. smartphone or camera). In one or more embodiments, the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that the hardware resources of the cloud device can be obtained based on the hardware information of the terminal device and/or the edge device. Match appropriate hardware resources to simulate the hardware resources of terminal devices and/or edge devices to complete unified management, scheduling and collaborative work of end-cloud integration or cloud-edge-end integration.
需要说明的是,为了简明的目的,本披露将一些方法及其实施例表述为一系列的动作及其组合,但是本领域技术人员可以理解本披露的方案并不受所描述的动作的顺序限制。因此,依据本披露的公开或教导,本领域技术人员可以理解其中的某些步骤可以采用其他顺序来执行或者同时执行。进一步,本领域技术人员可以理解本披露所描述的实施例可以视为可选实施例,即其中所涉及的动作或模块对于本披露某个或某些方案的实现并不一定是必需的。另外,根据方案的不同,本披露对一些实施例的描述也各有侧重。鉴于此,本领域技术人员可以理解本披露某个实施例中没有详述的部分,也可以参见其他实施例的相关描述。It should be noted that, for the purpose of simplicity, this disclosure expresses some methods and their embodiments as a series of actions and their combinations, but those skilled in the art can understand that the solutions of this disclosure are not limited by the order of the described actions. . Therefore, based on the disclosure or teachings of this disclosure, those skilled in the art will understand that certain steps may be performed in other orders or simultaneously. Furthermore, those skilled in the art can understand that the embodiments described in the present disclosure can be regarded as optional embodiments, that is, the actions or modules involved are not necessarily necessary for the implementation of one or some solutions of the present disclosure. In addition, depending on the solution, the description of some embodiments in this disclosure also has different emphasis. In view of this, those skilled in the art can understand the parts that are not described in detail in a certain embodiment of the present disclosure, and can also refer to the relevant descriptions of other embodiments.
在具体实现方面,基于本披露的公开和教导,本领域技术人员可以理解本披露所公开的若干实施例也可以通过本文未公开的其他方式来实现。例如,就前文所述的设备或装置 实施例中的各个单元来说,本文在考虑了逻辑功能的基础上对其进行划分,而实际实现时也可以有另外的划分方式。又例如,可以将多个单元或组件结合或者集成到另一个系统,或者对单元或组件中的一些特征或功能进行选择性地禁用。就不同单元或组件之间的连接关系而言,前文结合附图所讨论的连接可以是单元或组件之间的直接或间接耦合。在一些场景中,前述的直接或间接耦合涉及利用接口的通信连接,其中通信接口可以支持电性、光学、声学、磁性或其它形式的信号传输。In terms of specific implementation, based on the disclosure and teachings of this disclosure, those skilled in the art can understand that several embodiments disclosed in this disclosure can also be implemented in other ways not disclosed herein. For example, with respect to the equipment or devices mentioned above For each unit in the embodiment, this article divides them based on the logical function, but there may be other division methods in actual implementation. As another example, multiple units or components may be combined or integrated into another system, or some features or functions in units or components may be selectively disabled. In terms of connection relationships between different units or components, the connections discussed above in connection with the drawings may be direct or indirect couplings between the units or components. In some scenarios, the aforementioned direct or indirect coupling involves a communication connection using an interface, where the communication interface can support electrical, optical, acoustic, magnetic or other forms of signal transmission.
在本披露中,作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元示出的部件可以是或者也可以不是物理单元。前述部件或单元可以位于同一位置或者分布到多个网络单元上。另外,根据实际的需要,可以选择其中的部分或者全部单元来实现本披露实施例所述方案的目的。另外,在一些场景中,本披露实施例中的多个单元可以集成于一个单元中或者各个单元物理上单独存在。In this disclosure, units illustrated as separate components may or may not be physically separate, and components illustrated as units may or may not be physical units. The aforementioned components or units may be co-located or distributed over multiple network units. In addition, according to actual needs, some or all of the units may be selected to achieve the purpose of the solutions described in the embodiments of the present disclosure. In addition, in some scenarios, multiple units in the embodiments of the present disclosure may be integrated into one unit or each unit may exist physically separately.
在一些实现场景中,上述集成的单元可以采用软件程序模块的形式来实现。如果以软件程序模块的形式实现并作为独立的产品销售或使用时,所述集成的单元可以存储在计算机可读取存储器中。基于此,当本披露的方案以软件产品(例如计算机可读存储介质)的形式体现时,该软件产品可以存储在存储器中,其可以包括若干指令用以使得计算机设备(例如个人计算机、服务器或者网络设备等)执行本披露实施例所述方法的部分或全部步骤。前述的存储器可以包括但不限于U盘、闪存盘、只读存储器(“Read Only Memory”,简写为ROM)、随机存取存储器(“Random Access Memory”,简写为RAM)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。In some implementation scenarios, the above integrated units can be implemented in the form of software program modules. If implemented in the form of software program modules and sold or used as a stand-alone product, the integrated unit may be stored in a computer-readable memory. Based on this, when the solution of the present disclosure is embodied in the form of a software product (such as a computer-readable storage medium), the software product can be stored in a memory, which can include a number of instructions to cause a computer device (such as a personal computer, server or network equipment, etc.) to perform some or all steps of the method described in the embodiments of the present disclosure. The aforementioned memory may include but is not limited to U disk, flash disk, read-only memory ("Read Only Memory", abbreviated as ROM), random access memory ("Random Access Memory", abbreviated as RAM), mobile hard disk, magnetic disk Or various media such as CDs that can store program code.
在另外一些实现场景中,上述集成的单元也可以采用硬件的形式实现,即为具体的硬件电路,其可以包括数字电路和/或模拟电路等。电路的硬件结构的物理实现可以包括但不限于物理器件,而物理器件可以包括但不限于晶体管或忆阻器等器件。鉴于此,本文所述的各类装置(例如计算装置或其他处理装置)可以通过适当的硬件处理器来实现,例如CPU、GPU、FPGA、DSP和ASIC等。进一步,前述的所述存储单元或存储装置可以是任意适当的存储介质(包括磁存储介质或磁光存储介质等),其例如可以是可变电阻式存储器(“Resistive Random Access Memory”,简写为RRAM)、动态随机存取存储器(“Dynamic Random Access Memory”,简写为DRAM)、静态随机存取存储器(“Static Random Access Memory”,简写为SRAM)、增强动态随机存取存储器(“Enhanced Dynamic Random Access Memory”,简写为“EDRAM”)、高带宽存储器(“High Bandwidth Memory”,简写为“HBM”)、混合存储器立方体(“Hybrid Memory Cube”,简写为“HMC”)、ROM和RAM等。In other implementation scenarios, the above-mentioned integrated unit can also be implemented in the form of hardware, that is, a specific hardware circuit, which can include digital circuits and/or analog circuits, etc. The physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, and the physical devices may include, but is not limited to, devices such as transistors or memristors. In view of this, various devices (such as computing devices or other processing devices) described herein can be implemented by appropriate hardware processors, such as CPUs, GPUs, FPGAs, DSPs, and ASICs. Furthermore, the aforementioned storage unit or storage device can be any appropriate storage medium (including magnetic storage media or magneto-optical storage media, etc.), which can be, for example, a variable resistive memory ("Resistive Random Access Memory", abbreviated as RRAM), dynamic random access memory ("Dynamic Random Access Memory", abbreviated as DRAM), static random access memory ("Static Random Access Memory", abbreviated as SRAM), enhanced dynamic random access memory ("Enhanced Dynamic Random Access Memory", abbreviated as "EDRAM"), high bandwidth memory ("High Bandwidth Memory", abbreviated as "HBM"), hybrid memory cube ("Hybrid Memory Cube", abbreviated as "HMC"), ROM and RAM, etc.
依据以下条款可更好地理解前述内容:The foregoing can be better understood in accordance with the following terms:
条款A1.一种用于调度任务的调度器,其布置于人工智能处理器芯片上,并且连接片外存储装置和片上任务执行单元,所述调度器包括:Clause A1. A scheduler for scheduling tasks, which is arranged on an artificial intelligence processor chip and connects an off-chip storage device and an on-chip task execution unit. The scheduler includes:
调度电路,其配置成从所述片外存储装置读取任务到片上,以便调度所述任务来由所述片上任务执行单元执行,其中所述任务以有效状态记录于所述片外存储装置上;a scheduling circuit configured to read a task from the off-chip storage device onto the chip so as to schedule the task to be executed by the on-chip task execution unit, wherein the task is recorded in a valid state on the off-chip storage device ;
第一查找表电路,其配置成:The first lookup table circuit is configured as:
响应于所述任务从所述片外存储装置读取到所述片上,将所述任务从所述有效状态更新至无效状态并记录于第一查找表中;以及In response to the task being read from the off-chip storage device to the chip, updating the task from the valid state to the invalid state and recording it in the first lookup table; and
响应于将所述无效状态记录于所述第一查找表中,触发所述调度电路从所述片外存储装置读取下一任务到片上。In response to recording the invalid state in the first lookup table, the scheduling circuit is triggered to read the next task from the off-chip storage device to the chip.
条款A2.根据条款A1所述的调度器,其中所述片外存储装置上存储有待所述调度电路读取到片上的多个任务和至少记录多个任务的有效状态的第二查找表,并且所述调度电路配置成:Clause A2. The scheduler according to Clause A1, wherein the off-chip storage device stores multiple tasks on the chip to be read by the scheduling circuit and a second lookup table that records at least the valid status of the multiple tasks, and The scheduling circuit is configured as:
响应于所述第一查找表电路的触发,从所述片外存储装置读取第二查找表中记录的多个任 务之一到片上;以及In response to the triggering of the first lookup table circuit, a plurality of arbitrary records recorded in the second lookup table are read from the off-chip storage device. one of the tasks onto the chip; and
触发所述第一查找表电路将从片外存储装置读取到片上的任务的有效状态更新至无效状态并记录于所述第一查找表中。The first lookup table circuit is triggered to update the valid state of the task read from the off-chip storage device to the on-chip task to the invalid state and record it in the first lookup table.
条款A3.根据条款A1所述的调度器,还包括第三查找表电路,其配置成:Clause A3. The scheduler according to Clause A1, further comprising a third lookup table circuit configured to:
在执行所述人工智能处理器芯片与另一人工智能处理器芯片的片间任务时,利用第三查找表来记录和管理存储于片上的所述片间任务。When executing inter-chip tasks between the artificial intelligence processor chip and another artificial intelligence processor chip, a third lookup table is used to record and manage the inter-chip tasks stored on the chip.
条款A4.根据条款A3所述的调度器,还包括轮询电路,其配置成:Clause A4. The scheduler according to Clause A3, further comprising a polling circuit configured to:
接收用于调度片间任务的任务唤醒消息;Receive task wake-up messages used to schedule inter-chip tasks;
根据所述任务唤醒消息来轮询所述第三查找表电路中记录的片间任务,以便轮询到与所述任务唤醒消息关联的特定任务,Polling the inter-chip tasks recorded in the third lookup table circuit according to the task wake-up message, so as to poll a specific task associated with the task wake-up message,
其中所述调度电路配置成对轮询到的所述特定任务进行调度。The scheduling circuit is configured to schedule the polled specific task.
条款A5.根据条款A4所述的调度器,其中所述调度电路还配置成:Clause A5. The scheduler of Clause A4, wherein the scheduling circuit is further configured to:
响应于所述轮询电路未成功轮询到所述特定任务,从所述片外存储装置读取与任务唤醒消息关联的特定任务到片上。In response to the polling circuit not successfully polling the specific task, reading the specific task associated with the task wake-up message from the off-chip storage device onto the chip.
条款A6.根据条款A4所述的调度器,其中所述任务唤醒消息来自于所述另一人工智能处理器芯片,并且所述调度电路配置成调度所述特定任务,以便由所述人工智能处理器芯片的所述片上任务执行单元执行。Clause A6. The scheduler of Clause A4, wherein the task wake-up message comes from the other artificial intelligence processor chip, and the scheduling circuit is configured to schedule the specific task for processing by the artificial intelligence The on-chip task execution unit of the processor chip executes.
条款A7、根据条款A1所述的调度器,还包括重排序缓冲电路,其配置成记录被所述片上任务执行单元重复执行的任务。Clause A7. The scheduler according to Clause A1, further comprising a reordering buffer circuit configured to record tasks repeatedly executed by the on-chip task execution unit.
条款A8、根据条款A1所述的调度器,还包括第四查找表电路,其配置成利用第四查找表来记录调度至片上任务执行单元的任务。Clause A8. The scheduler according to Clause A1, further comprising a fourth lookup table circuit configured to utilize the fourth lookup table to record tasks scheduled to the on-chip task execution unit.
条款A9、根据条款A8所述的调度器,其中所述调度电路还配置成:Clause A9. The scheduler according to Clause A8, wherein the scheduling circuit is further configured to:
在向所述片上任务执行单元调度待执行的任务前,与所述第四查找表电路交互,以查询并确定第四查找表中记录的任务与当前待调度至所述片上任务执行单元的任务不同。Before scheduling the task to be executed to the on-chip task execution unit, interact with the fourth lookup table circuit to query and determine the tasks recorded in the fourth lookup table and the tasks currently to be scheduled to the on-chip task execution unit. different.
条款A10、根据条款A8所述的调度器,其中所述调度电路还配置成:Clause A10. The scheduler according to Clause A8, wherein the scheduling circuit is further configured to:
响应于从所述片上任务执行单元接收到完成或暂停任务的执行,触发所述第四查找表电路从所述第四查找表中移除完成执行或暂停执行的任务。In response to receiving completion or suspension of execution of a task from the on-chip task execution unit, triggering the fourth lookup table circuit to remove the completion or suspension of execution of the task from the fourth lookup table.
条款A11、根据条款A8-条款A10的任意一项所述的调度器,其中所述多个任务以一个或多个队列的形式存储于所述第一查找表、第二查找表、第三查找表和/或第四查找表中,并且每个队列的队列标识符和每个任务的任务标识符组合用于指示所述任务在第一查找表、第二查找表、第三查找表和/或第四查找表中的地址。Clause A11. The scheduler according to any one of Clause A8 to Clause A10, wherein the plurality of tasks are stored in the first lookup table, the second lookup table, the third lookup table in the form of one or more queues. table and/or the fourth lookup table, and the queue identifier of each queue and the task identifier of each task are combined to indicate that the task is in the first lookup table, the second lookup table, the third lookup table and/or Or the address in the fourth lookup table.
条款A12、根据条款A11所述的调度器,其中所述第一查找表、第二查找表、第三查找表和/或第四查找表中的表项记录包括以下中的一个或多项:Clause A12. The scheduler according to clause A11, wherein the entry records in the first lookup table, the second lookup table, the third lookup table and/or the fourth lookup table include one or more of the following:
指示任务是否有效的标识位;A flag indicating whether the task is valid;
任务初始化标识位;Task initialization flag bit;
任务是否被唤醒的唤醒标识位;The wake-up flag bit indicates whether the task is awakened;
片间缓冲区是否存在数据的数据标识位;以及Whether the data identification bit of the data exists in the inter-chip buffer; and
所述片间缓冲区是否存在存储空间的空间标识位。Whether there is a space identification bit of storage space in the inter-chip buffer.
条款A13、根据条款A12所述的调度器,其中所述任务是所述人工智能处理器芯片与另一人工智能处理器芯片之间通信的片间通信任务。Clause A13. The scheduler according to Clause A12, wherein the task is an inter-chip communication task for communication between the artificial intelligence processor chip and another artificial intelligence processor chip.
条款A14、一种人工智能处理器芯片,包括:Clause A14. An artificial intelligence processor chip, including:
根据条款A1-A13的任意一项所述的调度器;以及a scheduler according to any of clauses A1-A13; and
片上任务执行单元,其配置成与所述调度器交互,以便执行由所述调度器下发的任务。An on-chip task execution unit is configured to interact with the scheduler in order to execute tasks issued by the scheduler.
条款A15、一种板卡,包括根据条款A14所述的人工智能处理器芯片。Clause A15. A board including an artificial intelligence processor chip according to Clause A14.
条款A16.一种使用根据条款A1-A13的任意一项所述的调度器来调度任务的方法, 所述方法包括:Clause A16. A method of scheduling tasks using a scheduler according to any of Clauses A1-A13, The methods include:
使用所述调度电路来执行从所述片外存储装置读取任务到片上,以便调度所述任务来由所述片上任务执行单元执行,其中所述任务以有效状态记录于所述片外存储装置上;Use the scheduling circuit to execute a task reading from the off-chip storage device onto the chip so as to schedule the task to be executed by the on-chip task execution unit, wherein the task is recorded in the off-chip storage device in a valid state superior;
使用所述第一查找表电路来执行:Use the first lookup table circuit to perform:
响应于所述任务从所述片外存储装置读取到所述片上,将所述任务从所述有效状态更新至无效状态并记录于第一查找表中;以及In response to the task being read from the off-chip storage device to the chip, updating the task from the valid state to the invalid state and recording it in the first lookup table; and
响应于将所述无效状态记录于所述第一查找表中,触发所述调度电路从所述片外存储装置读取下一任务到片上。In response to recording the invalid state in the first lookup table, the scheduling circuit is triggered to read the next task from the off-chip storage device to the chip.
条款A17.根据条款A16所述的方法,其中所述片外存储装置上存储有待所述调度电路读取到片上的多个任务和至少记录多个任务的有效状态的第二查找表,并且所述方法:响应于所述第一查找表电路的触发,从所述片外存储装置读取所述第二查找表中记录的多个任务之一到片上;以及Clause A17. The method according to Clause A16, wherein the off-chip storage device stores multiple tasks on the chip to be read by the scheduling circuit and a second lookup table that records at least the valid status of the multiple tasks, and the The method: in response to the triggering of the first lookup table circuit, reading one of the plurality of tasks recorded in the second lookup table from the off-chip storage device to the chip; and
触发所述第一查找表电路将从所述片外存储装置读取到片上的任务的有效状态更新至无效状态并记录于所述第一查找表中。The first lookup table circuit is triggered to update the valid state of the task read from the off-chip storage device to the on-chip task to the invalid state and record it in the first lookup table.
条款A18.根据条款A16所述的方法,还包括使用所述第三查找表电路来执行:Clause A18. The method of clause A16, further comprising using the third lookup table circuit to perform:
在执行所述人工智能处理器芯片与另一人工智能处理器芯片的片间任务时,利用第三查找表来记录和管理存储于片上的所述片间任务。When executing inter-chip tasks between the artificial intelligence processor chip and another artificial intelligence processor chip, a third lookup table is used to record and manage the inter-chip tasks stored on the chip.
条款A19.根据条款A18所述的方法,还包括使用所述轮询电路来执行:Clause A19. The method of Clause A18, further comprising using the polling circuit to perform:
接收用于调度片间任务的任务唤醒消息;Receive task wake-up messages used to schedule inter-chip tasks;
根据所述任务唤醒消息来轮询所述第三查找表电路中记录的片间任务,以便轮询到与所述任务唤醒消息关联的特定任务,polling inter-chip tasks recorded in the third lookup table circuit according to the task wake-up message, so as to poll a specific task associated with the task wake-up message,
其中所述调度电路配置成对轮询到的所述特定任务进行调度。The scheduling circuit is configured to schedule the polled specific task.
条款A20.根据条款A19所述的方法,其中使用所述调度电路来执行以下步骤:Clause A20. The method of clause A19, wherein the scheduling circuit is used to perform the following steps:
响应于所述轮询电路未成功轮询到所述特定任务,从所述片外存储装置读取与任务唤醒消息关联的特定任务到片上。In response to the polling circuit not successfully polling the specific task, reading the specific task associated with the task wake-up message from the off-chip storage device onto the chip.
条款A21.根据条款A19所述的方法,其中所述任务唤醒消息来自于所述另一人工智能处理器芯片,并且所述方法还包括使用所述调度电路来执行调度所述特定任务,以便由所述人工智能处理器芯片的所述片上任务执行单元执行。Clause A21. The method of clause A19, wherein the task wake-up message comes from the other artificial intelligence processor chip, and the method further includes using the scheduling circuit to perform scheduling the specific task to be performed by The on-chip task execution unit of the artificial intelligence processor chip executes.
条款A22.根据条款A16所述的方法,还包括使用所述重排序缓冲电路来记录被所述片上任务执行单元重复执行的任务。Clause A22. The method of Clause A16, further comprising using the reordering buffer circuit to record tasks that are repeatedly executed by the on-chip task execution unit.
条款A23.根据条款A16所述的方法,还包括使用所述第四查找表电路来执行利用第四查找表来记录调度至片上任务执行单元的任务。Clause A23. The method of Clause A16, further comprising using the fourth lookup table circuit to perform using the fourth lookup table to record tasks scheduled to an on-chip task execution unit.
条款A24.根据条款A23所述的方法,其中所述方法还包括使用所述调度电路来执行:Clause A24. The method of Clause A23, wherein the method further comprises using the scheduling circuit to perform:
在向所述片上任务执行单元调度待执行的任务前,与所述第四查找表电路交互,以查询并确定第四查找表中记录的任务与当前待调度至所述片上任务执行单元的任务不同。Before scheduling the task to be executed to the on-chip task execution unit, interact with the fourth lookup table circuit to query and determine the tasks recorded in the fourth lookup table and the tasks currently to be scheduled to the on-chip task execution unit. different.
条款A25.根据条款A23所述的方法,其中所述方法还包括使用所述调度电路来执行:Clause A25. The method of Clause A23, wherein the method further comprises using the scheduling circuit to perform:
响应于从所述片上任务执行单元接收到完成或暂停任务的执行,触发所述第四查找表电路从所述第四查找表中移除完成执行或暂停执行的任务。In response to receiving completion or suspension of execution of a task from the on-chip task execution unit, triggering the fourth lookup table circuit to remove the completion or suspension of execution of the task from the fourth lookup table.
条款A26.根据条款A23-A25的任意一项所述的方法,其中所述多个任务以一个或多个队列的形式存储于所述第一查找表、第二查找表、第三查找表和/或第四查找表中,并且每个队列的队列标识符和每个任务的任务标识符组合用于指示所述任务在第一查找表、第二查找表、第三查找表和/或第四查找表中的地址。Clause A26. The method according to any one of clauses A23-A25, wherein the plurality of tasks are stored in the first lookup table, the second lookup table, the third lookup table and the form of one or more queues. /or in the fourth lookup table, and the combination of the queue identifier of each queue and the task identifier of each task is used to indicate that the task is in the first lookup table, the second lookup table, the third lookup table and/or the third lookup table. Four lookup addresses in the table.
条款A27.根据条款A26所述的方法,其中所述第一查找表、第二查找表、第三查 找表和/或第四查找表中的表项记录包括以下中的一个或多项:Clause A27. The method of Clause A26, wherein the first lookup table, the second lookup table, the third lookup table The entry records in the lookup table and/or the fourth lookup table include one or more of the following:
指示任务是否有效的标识位;A flag indicating whether the task is valid;
任务初始化标识位;Task initialization flag bit;
任务是否被唤醒的唤醒标识位;The wake-up flag bit indicates whether the task is awakened;
片间缓冲区是否存在数据的数据标识位;以及Whether the data identification bit of the data exists in the inter-chip buffer; and
所述片间缓冲区是否存在存储空间的空间标识位。Whether there is a space identification bit of storage space in the inter-chip buffer.
条款A28.根据条款A27所述的方法,其中所述任务是所述人工智能处理器芯片与另一人工智能处理器芯片之间通信的片间通信任务。Clause A28. The method of clause A27, wherein the task is an inter-chip communication task for communication between the artificial intelligence processor chip and another artificial intelligence processor chip.
条款A29.一种计算机可读存储介质,其上存储有用于调度任务的计算机程序指令,当所述计算机程序指令由处理器执行时,使得实现根据条款A16-A28的任意一项所述的方法。Clause A29. A computer-readable storage medium having computer program instructions for scheduling tasks stored thereon, which when executed by a processor cause the method according to any of clauses A16-A28 to be implemented .
虽然本公开的实施方式如上,但所述内容只是为便于理解本公开而采用的实施例,并非用以限定本公开的范围和应用场景。任何本公开所述技术领域内的技术人员,在不脱离本公开所揭露的精神和范围的前提下,可以在实施的形式上及细节上作任何的修改与变化,但本公开的专利保护范围,仍须以所附的权利要求书所界定的范围为准。 Although the embodiments of the present disclosure are as above, the described content is only an example adopted to facilitate understanding of the present disclosure, and is not intended to limit the scope and application scenarios of the present disclosure. Any person skilled in the technical field described in this disclosure may make any modifications and changes in the form and details of the implementation without departing from the spirit and scope disclosed in this disclosure. However, the scope of patent protection of this disclosure is , the scope defined by the appended claims shall prevail.

Claims (29)

  1. 一种用于调度任务的调度器,其布置于人工智能处理器芯片上,并且连接片外存储装置和片上任务执行单元,所述调度器包括:A scheduler for scheduling tasks, which is arranged on an artificial intelligence processor chip and connects an off-chip storage device and an on-chip task execution unit. The scheduler includes:
    调度电路,其配置成从所述片外存储装置读取任务到片上,以便调度所述任务来由所述片上任务执行单元执行,其中所述任务以有效状态记录于所述片外存储装置上;a scheduling circuit configured to read a task from the off-chip storage device onto the chip so as to schedule the task to be executed by the on-chip task execution unit, wherein the task is recorded in a valid state on the off-chip storage device ;
    第一查找表电路,其配置成:The first lookup table circuit is configured as:
    响应于所述任务从所述片外存储装置读取到所述片上,将所述任务从所述有效状态更新至无效状态并记录于第一查找表中;以及In response to the task being read from the off-chip storage device to the chip, updating the task from the valid state to the invalid state and recording it in the first lookup table; and
    响应于将所述无效状态记录于所述第一查找表中,触发所述调度电路从所述片外存储装置读取下一任务到片上。In response to recording the invalid state in the first lookup table, the scheduling circuit is triggered to read the next task from the off-chip storage device to the chip.
  2. 根据权利要求1所述的调度器,其中所述片外存储装置上存储有待所述调度电路读取到片上的多个任务和至少记录多个任务的有效状态的第二查找表,并且所述调度电路配置成:The scheduler according to claim 1, wherein the off-chip storage device stores multiple tasks on the chip to be read by the scheduling circuit and a second lookup table that records at least the valid status of the multiple tasks, and the The scheduling circuit is configured as:
    响应于所述第一查找表电路的触发,从所述片外存储装置读取第二查找表中记录的多个任务之一到片上;以及In response to triggering of the first lookup table circuit, reading one of the plurality of tasks recorded in the second lookup table from the off-chip storage device to the chip; and
    触发所述第一查找表电路将从片外存储装置读取到片上的任务的有效状态更新至无效状态并记录于所述第一查找表中。The first lookup table circuit is triggered to update the valid state of the task read from the off-chip storage device to the on-chip task to the invalid state and record it in the first lookup table.
  3. 根据权利要求1所述的调度器,还包括第三查找表电路,其配置成:The scheduler of claim 1, further comprising a third lookup table circuit configured to:
    在执行所述人工智能处理器芯片与另一人工智能处理器芯片的片间任务时,利用第三查找表来记录和管理存储于片上的所述片间任务。When executing inter-chip tasks between the artificial intelligence processor chip and another artificial intelligence processor chip, a third lookup table is used to record and manage the inter-chip tasks stored on the chip.
  4. 根据权利要求3所述的调度器,还包括轮询电路,其配置成:The scheduler of claim 3, further comprising a polling circuit configured to:
    接收用于调度片间任务的任务唤醒消息;Receive task wake-up messages used to schedule inter-chip tasks;
    根据所述任务唤醒消息来轮询所述第三查找表电路中记录的片间任务,以便轮询到与所述任务唤醒消息关联的特定任务,Polling the inter-chip tasks recorded in the third lookup table circuit according to the task wake-up message, so as to poll a specific task associated with the task wake-up message,
    其中所述调度电路配置成对轮询到的所述特定任务进行调度。The scheduling circuit is configured to schedule the polled specific task.
  5. 根据权利要求4所述的调度器,其中所述调度电路还配置成:The scheduler of claim 4, wherein the scheduling circuit is further configured to:
    响应于所述轮询电路未成功轮询到所述特定任务,从所述片外存储装置读取与任务唤醒消息关联的特定任务到片上。In response to the polling circuit failing to successfully poll the specific task, the specific task associated with the task wake-up message is read from the off-chip storage device onto the chip.
  6. 根据权利要求4所述的调度器,其中所述任务唤醒消息来自于所述另一人工智能处理器芯片,并且所述调度电路配置成调度所述特定任务,以便由所述人工智能处理器芯片的所述片上任务执行单元执行。The scheduler of claim 4, wherein the task wake-up message comes from the other artificial intelligence processor chip, and the scheduling circuit is configured to schedule the specific task to be executed by the artificial intelligence processor chip. The on-chip task execution unit executes.
  7. 根据权利要求1所述的调度器,还包括重排序缓冲电路,其配置成记录被所述片上任务执行单元重复执行的任务。The scheduler of claim 1, further comprising a reordering buffer circuit configured to record tasks repeatedly executed by the on-chip task execution unit.
  8. 根据权利要求1所述的调度器,还包括第四查找表电路,其配置成利用第四查找表来记录调度至片上任务执行单元的任务。The scheduler of claim 1, further comprising a fourth lookup table circuit configured to utilize the fourth lookup table to record tasks scheduled to the on-chip task execution unit.
  9. 根据权利要求8所述的调度器,其中所述调度电路还配置成:The scheduler of claim 8, wherein the scheduling circuit is further configured to:
    在向所述片上任务执行单元调度待执行的任务前,与所述第四查找表电路交互,以查询并确定第四查找表中记录的任务与当前待调度至所述片上任务执行单元的任务不同。Before scheduling the task to be executed to the on-chip task execution unit, interact with the fourth lookup table circuit to query and determine the tasks recorded in the fourth lookup table and the tasks currently to be scheduled to the on-chip task execution unit. different.
  10. 根据权利要求8所述的调度器,其中所述调度电路还配置成:The scheduler of claim 8, wherein the scheduling circuit is further configured to:
    响应于从所述片上任务执行单元接收到完成或暂停任务的执行,触发所述第四查找表电 路从所述第四查找表中移除完成执行或暂停执行的任务。In response to receiving execution of a completed or suspended task from the on-chip task execution unit, triggering the fourth lookup table circuit The method removes tasks that have completed execution or suspended execution from the fourth lookup table.
  11. 根据权利要求8-10的任意一项所述的调度器,其中所述多个任务以一个或多个队列的形式存储于所述第一查找表、第二查找表、第三查找表和/或第四查找表中,并且每个队列的队列标识符和每个任务的任务标识符组合用于指示所述任务在第一查找表、第二查找表、第三查找表和/或第四查找表中的地址。The scheduler according to any one of claims 8-10, wherein the plurality of tasks are stored in the first lookup table, the second lookup table, the third lookup table and/or in the form of one or more queues. or the fourth lookup table, and the combination of the queue identifier of each queue and the task identifier of each task is used to indicate that the task is in the first lookup table, the second lookup table, the third lookup table and/or the fourth lookup table. Find the address in the table.
  12. 根据权利要求11所述的调度器,其中所述第一查找表、第二查找表、第三查找表和/或第四查找表中的表项记录包括以下中的一个或多项:The scheduler according to claim 11, wherein the entry records in the first lookup table, the second lookup table, the third lookup table and/or the fourth lookup table include one or more of the following:
    指示任务是否有效的标识位;A flag indicating whether the task is valid;
    任务初始化标识位;Task initialization flag bit;
    任务是否被唤醒的唤醒标识位;The wake-up flag bit indicates whether the task is awakened;
    片间缓冲区是否存在数据的数据标识位;以及Whether the data identification bit of the data exists in the inter-chip buffer; and
    所述片间缓冲区是否存在存储空间的空间标识位。Whether there is a space identification bit of storage space in the inter-chip buffer.
  13. 根据权利要求12所述的调度器,其中所述任务是所述人工智能处理器芯片与另一人工智能处理器芯片之间通信的片间通信任务。The scheduler of claim 12, wherein the task is an inter-chip communication task for communication between the artificial intelligence processor chip and another artificial intelligence processor chip.
  14. 一种人工智能处理器芯片,包括:An artificial intelligence processor chip, including:
    根据权利要求1-13的任意一项所述的调度器;以及The scheduler according to any one of claims 1-13; and
    片上任务执行单元,其配置成与所述调度器交互,以便执行由所述调度器下发的任务。An on-chip task execution unit is configured to interact with the scheduler in order to execute tasks issued by the scheduler.
  15. 一种板卡,包括根据权利要求14所述的人工智能处理器芯片。A board card including the artificial intelligence processor chip according to claim 14.
  16. 一种使用根据权利要求1-13的任意一项所述的调度器来调度任务的方法,所述方法包括:A method for scheduling tasks using the scheduler according to any one of claims 1-13, the method comprising:
    使用所述调度电路来执行从所述片外存储装置读取任务到片上,以便调度所述任务来由所述片上任务执行单元执行,其中所述任务以有效状态记录于所述片外存储装置上;Use the scheduling circuit to execute a task reading from the off-chip storage device onto the chip so as to schedule the task to be executed by the on-chip task execution unit, wherein the task is recorded in the off-chip storage device in a valid state superior;
    使用所述第一查找表电路来执行:Use the first lookup table circuit to perform:
    响应于所述任务从所述片外存储装置读取到所述片上,将所述任务从所述有效状态更新至无效状态并记录于第一查找表中;以及In response to the task being read from the off-chip storage device to the chip, updating the task from the valid state to the invalid state and recording it in the first lookup table; and
    响应于将所述无效状态记录于所述第一查找表中,触发所述调度电路从所述片外存储装置读取下一任务到片上。In response to recording the invalid state in the first lookup table, the scheduling circuit is triggered to read the next task from the off-chip storage device to the chip.
  17. 根据权利要求16所述的方法,其中所述片外存储装置上存储有待所述调度电路读取到片上的多个任务和至少记录多个任务的有效状态的第二查找表,并且所述方法包括使用调度电路执行以下操作:The method according to claim 16, wherein the off-chip storage device stores multiple tasks on the chip to be read by the scheduling circuit and a second lookup table that records at least the valid status of the multiple tasks, and the method Including using dispatch circuits to do the following:
    响应于所述第一查找表电路的触发,从所述片外存储装置读取所述第二查找表中记录的多个任务之一到片上;以及In response to triggering of the first lookup table circuit, reading one of the plurality of tasks recorded in the second lookup table from the off-chip storage device to the chip; and
    触发所述第一查找表电路将从所述片外存储装置读取到片上的任务的有效状态更新至无效状态并记录于所述第一查找表中。The first lookup table circuit is triggered to update the valid state of the task read from the off-chip storage device to the on-chip task to the invalid state and record it in the first lookup table.
  18. 根据权利要求16所述的方法,还包括使用所述第三查找表电路来执行:The method of claim 16, further comprising using the third lookup table circuit to perform:
    在执行所述人工智能处理器芯片与另一人工智能处理器芯片的片间任务时,利用第三查找表来记录和管理存储于片上的所述片间任务。When executing inter-chip tasks between the artificial intelligence processor chip and another artificial intelligence processor chip, a third lookup table is used to record and manage the inter-chip tasks stored on the chip.
  19. 根据权利要求18所述的方法,还包括使用所述轮询电路来执行:The method of claim 18, further comprising using the polling circuit to perform:
    接收用于调度片间任务的任务唤醒消息; Receive task wake-up messages used to schedule inter-chip tasks;
    根据所述任务唤醒消息来轮询所述第三查找表电路中记录的片间任务,以便轮询到与所述任务唤醒消息关联的特定任务,Polling the inter-chip tasks recorded in the third lookup table circuit according to the task wake-up message, so as to poll a specific task associated with the task wake-up message,
    其中所述方法还使用所述调度电路对轮询到的所述特定任务进行调度。The method further uses the scheduling circuit to schedule the polled specific tasks.
  20. 根据权利要求19所述的方法,其中使用所述调度电路来执行:The method of claim 19, wherein the scheduling circuit is used to perform:
    响应于所述轮询电路未成功轮询到所述特定任务,从所述片外存储装置读取与任务唤醒消息关联的特定任务到片上。In response to the polling circuit not successfully polling the specific task, reading the specific task associated with the task wake-up message from the off-chip storage device onto the chip.
  21. 根据权利要求19所述的方法,其中所述任务唤醒消息来自于所述另一人工智能处理器芯片,并且所述方法还包括使用所述调度电路来执行调度所述特定任务,以便由所述人工智能处理器芯片的所述片上任务执行单元执行。The method of claim 19, wherein the task wake-up message comes from the other artificial intelligence processor chip, and the method further includes using the scheduling circuit to perform scheduling of the specific task so that the specific task is scheduled by the The on-chip task execution unit of the artificial intelligence processor chip executes.
  22. 根据权利要求16所述的方法,还包括使用所述重排序缓冲电路来记录被所述片上任务执行单元重复执行的任务。The method of claim 16, further comprising using the reorder buffer circuit to record tasks repeatedly executed by the on-chip task execution unit.
  23. 根据权利要求16所述的方法,还包括使用所述第四查找表电路来执行利用第四查找表来记录调度至片上任务执行单元的任务。The method of claim 16 , further comprising using the fourth lookup table circuit to perform using the fourth lookup table to record tasks scheduled to an on-chip task execution unit.
  24. 根据权利要求23所述的方法,其中所述方法还包括使用所述调度电路来执行:The method of claim 23, wherein the method further includes using the scheduling circuit to perform:
    在向所述片上任务执行单元调度待执行的任务前,与所述第四查找表电路交互,以查询并确定第四查找表中记录的任务与当前待调度至所述片上任务执行单元的任务不同。Before scheduling the task to be executed to the on-chip task execution unit, interact with the fourth lookup table circuit to query and determine the tasks recorded in the fourth lookup table and the tasks currently to be scheduled to the on-chip task execution unit. different.
  25. 根据权利要求23所述的方法,其中所述方法还包括使用所述调度电路来执行:The method of claim 23, wherein the method further includes using the scheduling circuit to perform:
    响应于从所述片上任务执行单元接收到完成或暂停任务的执行,触发所述第四查找表电路从所述第四查找表中移除完成执行或暂停执行的任务。In response to receiving completion or suspension of execution of a task from the on-chip task execution unit, triggering the fourth lookup table circuit to remove the completion or suspension of execution of the task from the fourth lookup table.
  26. 根据权利要求23-25的任意一项所述的方法,其中所述多个任务以一个或多个队列的形式存储于所述第一查找表、第二查找表、第三查找表和/或第四查找表中,并且每个队列的队列标识符和每个任务的任务标识符组合用于指示所述任务在第一查找表、第二查找表、第三查找表和/或第四查找表中的地址。The method according to any one of claims 23-25, wherein the plurality of tasks are stored in the first lookup table, the second lookup table, the third lookup table and/or in the form of one or more queues. in the fourth lookup table, and the queue identifier of each queue and the task identifier of each task are combined to indicate that the task is in the first lookup table, the second lookup table, the third lookup table and/or the fourth lookup table. address in the table.
  27. 根据权利要求26所述的方法,其中所述第一查找表、第二查找表、第三查找表和/或第四查找表中的表项记录包括以下中的一个或多项:The method according to claim 26, wherein the entry records in the first lookup table, the second lookup table, the third lookup table and/or the fourth lookup table include one or more of the following:
    指示任务是否有效的标识位;A flag indicating whether the task is valid;
    任务初始化标识位;Task initialization flag bit;
    任务是否被唤醒的唤醒标识位;The wake-up flag bit indicates whether the task is awakened;
    片间缓冲区是否存在数据的数据标识位;以及Whether the data identification bit of the data exists in the inter-chip buffer; and
    所述片间缓冲区是否存在存储空间的空间标识位。Whether there is a space identification bit of storage space in the inter-chip buffer.
  28. 根据权利要求27所述的方法,其中所述任务是所述人工智能处理器芯片与另一人工智能处理器芯片之间通信的片间通信任务。The method of claim 27, wherein the task is an inter-chip communication task for communication between the artificial intelligence processor chip and another artificial intelligence processor chip.
  29. 一种计算机可读存储介质,其上存储有用于调度任务的计算机程序指令,当所述计算机程序指令由处理器执行时,使得实现根据权利要求16-28的任意一项所述的方法。 A computer-readable storage medium having computer program instructions for scheduling tasks stored thereon. When the computer program instructions are executed by a processor, the method according to any one of claims 16-28 is implemented.
PCT/CN2023/083494 2022-08-30 2023-03-23 Method for scheduling tasks, and related product thereof WO2024045580A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211044067.6A CN117667328A (en) 2022-08-30 2022-08-30 Method for scheduling tasks and related products
CN202211044067.6 2022-08-30

Publications (1)

Publication Number Publication Date
WO2024045580A1 true WO2024045580A1 (en) 2024-03-07

Family

ID=90084987

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/083494 WO2024045580A1 (en) 2022-08-30 2023-03-23 Method for scheduling tasks, and related product thereof

Country Status (2)

Country Link
CN (1) CN117667328A (en)
WO (1) WO2024045580A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7028299B1 (en) * 2000-06-30 2006-04-11 Intel Corporation Task-based multiprocessing system
US20190258511A1 (en) * 2016-09-20 2019-08-22 Ramon Chips Ltd. Scheduling of tasks in a multiprocessor device
US20210073169A1 (en) * 2019-09-09 2021-03-11 Shanghai Denglin Technologies Co., Ltd. On-chip heterogeneous ai processor
CN114237717A (en) * 2021-12-31 2022-03-25 合肥工业大学 Multi-core heterogeneous processor on-chip temporary storage dynamic scheduling manager

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7028299B1 (en) * 2000-06-30 2006-04-11 Intel Corporation Task-based multiprocessing system
US20190258511A1 (en) * 2016-09-20 2019-08-22 Ramon Chips Ltd. Scheduling of tasks in a multiprocessor device
US20210073169A1 (en) * 2019-09-09 2021-03-11 Shanghai Denglin Technologies Co., Ltd. On-chip heterogeneous ai processor
CN114237717A (en) * 2021-12-31 2022-03-25 合肥工业大学 Multi-core heterogeneous processor on-chip temporary storage dynamic scheduling manager

Also Published As

Publication number Publication date
CN117667328A (en) 2024-03-08

Similar Documents

Publication Publication Date Title
US10120728B2 (en) Graphical processing unit (GPU) implementing a plurality of virtual GPUs
CN110347635B (en) Heterogeneous multi-core microprocessor based on multilayer bus
US20190324792A1 (en) Task processor
US11880330B2 (en) Network-on-chip data processing method and device
US20230367722A1 (en) Data processing device and method, and related products
US7386642B2 (en) IO direct memory access system and method
CN114827048B (en) Dynamic configurable high-performance queue scheduling method, system, processor and protocol
WO2023076591A1 (en) Hardware management of direct memory access commands
CN109062857A (en) A kind of new type of messages controller and its communication means that can be communicated between realization of High Speed multiprocessor
WO2024045580A1 (en) Method for scheduling tasks, and related product thereof
US10884477B2 (en) Coordinating accesses of shared resources by clients in a computing device
EP4142217A1 (en) Inter-node communication method and device based on multiple processing nodes
WO2023016382A1 (en) Method for system on chip, and related product thereof
US6708259B1 (en) Programmable wake up of memory transfer controllers in a memory transfer engine
CN112948001A (en) Method for setting tensor hardware configuration, readable storage medium and device
WO2024046018A1 (en) Instruction control method, data caching method, and related products
WO2023236479A1 (en) Method for executing task scheduling and related products thereof
TWI823655B (en) Task processing system and task processing method applicable to intelligent processing unit
WO2024012280A1 (en) Method and device for task scheduling, board, and computer-readable storage medium
WO2023241478A1 (en) Artificial intelligence accelerator pipeline performance analysis method and apparatus
WO2023016383A1 (en) Method for cache memory and related products
CN111210011B (en) Data processing device and related product
WO2023231768A1 (en) Multi-core processor and related inter-core communication method
CN118210598A (en) Method for executing task and related product thereof
CN117908959A (en) Method for performing atomic operations and related products

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23858622

Country of ref document: EP

Kind code of ref document: A1