WO2024045580A1

WO2024045580A1 - Method for scheduling tasks, and related product thereof

Info

Publication number: WO2024045580A1
Application number: PCT/CN2023/083494
Authority: WO
Inventors: 高健; 刘少礼; 郝勇峥; 韩栋
Original assignee: 寒武纪(西安)集成电路有限公司
Priority date: 2022-08-30
Filing date: 2023-03-23
Publication date: 2024-03-07
Also published as: CN117667328A

Abstract

The present disclosure relates to a method for scheduling tasks, and a related product thereof. The related product comprises a device and a computer-readable storage medium. The device may be comprised in a computing processing apparatus of a combined processing apparatus, wherein the computing processing apparatus may comprise one or more data processing apparatuses. The combined processing apparatus may further comprise an interface apparatus and other processing apparatuses. The computing processing apparatus interacts with the other processing apparatuses to jointly complete a computing operation specified by a user. The combined processing apparatus may further comprise a storage apparatus, wherein the storage apparatus is respectively connected to the device and the other processing apparatuses, and is used for storing data of the device and the other processing apparatuses. By means of the solution of the present disclosure, the scheduling efficiency can be improved, and the on-chip storage overheads can be reduced. FIG. 5

Description

Methods for scheduling tasks and related products

Cross-references to related applications

This application claims priority to the Chinese patent application filed on August 30, 2022, with the application number 202211044067.6 and titled "Method for Scheduling Tasks and Related Products".

Technical field

This application relates generally to the computer field. More specifically, the present application relates to a scheduler, an artificial intelligence processor chip, a board, a method and a computer-readable storage medium for scheduling tasks.

Background technique

In order to solve the problem of excessive overhead of accessing off-chip memory (such as dynamic random access memory DRAM), traditional central processing unit ("CPU") generally uses a cache ("cache") to achieve temporal and spatial locality of data. By using this method, data that may be reused is cached in the cache, thereby shortening the time it takes to access the data next time, such as when executing a task. However, other specialized systems often use buffers ("buffers") or queues ("queues") to cache data.

As far as task scheduling is concerned, a more common way is to use the queue structure to arrange and schedule tasks. The advantage of this structure is that the order of tasks is naturally guaranteed by the queue structure. It is very friendly to the software programming interface and is convenient at the software level. to program. However, the disadvantage of scheduling through queues is that scheduling flexibility is poor, and subsequent tasks cannot "overtake" the execution of previous tasks. In addition, on-chip cache resources are limited, and how to utilize limited resources to achieve low-latency and high-throughput scheduling has become an urgent technical issue that needs to be solved.

Contents of the invention

In view of the technical problems mentioned in the background art above, this application provides a task caching and wake-up solution based on a lookup table. Based on the solution of this application, the delay of large-scale task scheduling can be greatly reduced, while simplifying hardware design and reducing on-chip storage overhead. To this end, this application provides solutions in multiple aspects as follows.

In a first aspect, the present disclosure provides a scheduler for task scheduling, which is arranged on an artificial intelligence processor chip and connects an off-chip storage device and an on-chip task execution unit. The scheduler includes: a scheduling circuit , which is configured to read tasks from the off-chip storage device to on-chip, so as to schedule the task to be executed by the on-chip task execution unit, wherein the task is recorded on the off-chip storage device in a valid state; A lookup table circuit configured to: in response to the task being read from the off-chip storage device to the chip, update the task from the valid state to the invalid state and record it in the first lookup table; and in response to recording the invalid status in the first lookup table, triggering the scheduling circuit to read the next task from the off-chip storage device to the chip.

In a second aspect, the present disclosure provides an artificial intelligence processor chip, including: a scheduler according to the first aspect; and an on-chip task execution unit configured to interact with the scheduler to execute the Tasks issued by the scheduler.

In a third aspect, the present disclosure provides a board card including the artificial intelligence processor chip according to the second aspect.

In a fourth aspect, the present disclosure provides a method for scheduling a task using the scheduler according to the first aspect, the method comprising: using the scheduling circuit to perform reading from the off-chip storage device a task onto the chip so as to schedule the task to be executed by the on-chip task execution unit, wherein the task is recorded in a valid state on the off-chip storage device; using the first lookup table circuit to execute: in response to the The task is read from the off-chip storage device to the chip, the task is updated from the valid state to the invalid state and recorded in the first lookup table; and in response to recording the invalid state in the In the first lookup table, the scheduling circuit is triggered to read the next task from the off-chip storage device to the chip.

In a fifth aspect, the present disclosure provides a computer-readable storage medium having computer program instructions for scheduling tasks stored thereon. When the computer program instructions are executed by a processor, the computer program instructions according to the fourth aspect are implemented. method described.

Utilizing the above-mentioned lookup table-based solution of the present disclosure, especially based on the use of the first lookup table stored on the chip, the processing speed of task scheduling can be accelerated, thereby significantly reducing the delay of large-scale task scheduling. In addition, by utilizing the first lookup table, the complexity of hardware design is also simplified and the cost of on-chip storage is reduced. In some embodiments, when used for inter-chip tasks (such as communication tasks between artificial intelligence processor chips), since the present disclosure uses a lookup table dedicated to inter-chip tasks, burst transmission of multiple task allocations is avoided. Bus congestion and backpressure during message processing, thereby enabling efficient inter-chip task scheduling.

Description of drawings

The above and other objects, features and advantages of exemplary embodiments of the present disclosure will become readily understood by reading the following detailed description with reference to the accompanying drawings. In the drawings, several embodiments of the present disclosure are shown by way of illustration and not limitation, and like or corresponding reference numerals designate like or corresponding parts, wherein:

1 is a simplified block diagram schematically illustrating a scheduler according to an embodiment of the present disclosure;

Figure 2 is a detailed structural block diagram schematically showing a scheduler according to an embodiment of the present disclosure;

Figure 3 is a simplified flowchart schematically illustrating a method of scheduling tasks using a scheduler according to an embodiment of the present disclosure;

Figure 4 is a schematic structural diagram schematically showing a board card according to an embodiment of the present disclosure;

Figure 5 is a schematic structural diagram schematically showing a combined processing device in a chip according to an embodiment of the present disclosure;

FIG. 6 is a schematic diagram schematically showing the internal structure of a computing device according to an embodiment of the present disclosure;

FIG. 7 is a schematic diagram schematically showing the internal structure of a processor core according to an embodiment of the present disclosure; and

FIG. 8 is a schematic diagram schematically illustrating data writing operations between computing clusters (or "computing clusters") according to an embodiment of the present disclosure.

Detailed ways

The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present disclosure. Obviously, the described embodiments are part of the embodiments of the present disclosure, rather than all of them. Based on the embodiments in this disclosure, all other embodiments obtained by those skilled in the art without creative efforts fall within the scope of protection of this disclosure.

It should be understood that the terms “first”, “second”, “third” and “fourth” in the claims, description and drawings of the present disclosure are used to distinguish different objects, rather than to describe a specific sequence. . The terms "comprising" and "including" used in the description and claims of this disclosure indicate the presence of described features, integers, steps, operations, elements and/or components but do not exclude one or more other features, integers , the presence or addition of steps, operations, elements, components and/or collections thereof.

It should also be understood that the terminology used in the description of the disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used in this disclosure and the claims, the singular forms "a," "an," and "the" are intended to include the plural forms unless the context clearly dictates otherwise. It will be further understood that the term "and/or" as used in the specification and claims of this disclosure refers to and includes any and all possible combinations of one or more of the associated listed items.

As used in this specification and claims, the term "if" may be interpreted as "when" or "once" or "in response to determining" or "in response to detecting" depending on the context. Similarly, the phrase "if determined" or "if [the described condition or event] is detected" may be interpreted, depending on the context, to mean "once determined" or "in response to a determination" or "once the [described condition or event] is detected ]" or "in response to detection of [the described condition or event]".

As mentioned above, in order to achieve efficient task scheduling and execution, the solution of this disclosure proposes a lookup table-based Task caching and wake-up mechanism. In particular, the solution of the present disclosure forms a lookup table (hereinafter referred to as the first lookup table) by storing a flag bit that indicates whether the task is valid or not on the chip, and determines whether the next corresponding task is to be changed from the flag bit according to whether the flag bit is valid. Off-chip storage is read into on-chip. Since the lookup table itself has no concept of order and can be considered an out-of-order structure, it can reflect the validity of the current task in a timely manner, thereby enabling faster scheduling of the corresponding task. In some embodiments, the solution of the present disclosure introduces more lookup tables to implement task scheduling in different scenarios (such as communication task scenarios between artificial intelligence processor chips), thereby further improving the efficiency of scheduling and It reduces on-chip storage overhead and simplifies hardware design, thereby also reducing task scheduling delays on a large scale.

Specific implementations of the present disclosure will be described in detail below with reference to the accompanying drawings.

FIG. 1 is a simplified block diagram schematically illustrating a task scheduler ("scheduler" for short) 100 according to an embodiment of the present disclosure. As mentioned above, the task scheduler 100 of the present disclosure can be arranged on the artificial intelligence processor chip and connected between the off-chip storage device and the on-chip task execution unit to schedule tasks located on the external storage device to on-chip and sent to the task execution unit for execution by the task execution unit.

As shown in FIG. 1 , the task scheduler of the present disclosure may include a scheduling circuit 100 and a first lookup table circuit 104 . As described above, the scheduling circuit 100 may be configured to read a task from an off-chip storage device to an on-chip device, so as to schedule the task to be executed by the on-chip task execution unit, wherein the task is recorded in the off-chip storage in a valid state. on the device. In one implementation scenario, the off-chip storage device here may include off-chip dynamic random access memory (DDR) or cache memory (such as L3 cache memory). In another implementation scenario, the task execution unit here may be multiple intelligent processing units, or simplified versions thereof. Depending on the application, intelligent processing units can perform conventional calculations and/or classical algorithms for distributed cluster communication.

In the context of this disclosure, a simplified version of the aforementioned intelligent processing unit may be called a microprocessing core, and each microprocessing core may have multiple (eg, 8) task scheduling queues. Furthermore, the slot id of each task is unique and can be expressed as follows:

slot id[15:0]={js_que_id[5:0],real_slot_id[9:0]}, where js_que_id[5:0] represents the identifier of the queue (a total of 2 ⁶ =64 queues in this example), and real_slot_id [9:0] represents the identification of the task in the queue (referred to as "task identifier"), that is, a 16-bit binary number is used to represent the complete identification of the task. In some application scenarios, the combination of the queue identifier of each queue and the task identifier of each task is used to indicate the address of the task in a lookup table, such as in the first lookup table, and the second lookup table discussed below. address in the table, the third lookup table and/or the fourth lookup table.

In one implementation scenario, corresponding to the task scheduling queue in each microprocessing core, multiple tasks to be read by the scheduling circuit on the chip are stored on an off-chip storage device and at least the valid values of the multiple tasks are recorded. Second lookup table for status.

The first lookup table circuit 104 cooperating with the above-mentioned scheduling circuit may be configured to, in response to the task being read from the off-chip storage device to the chip, update the task from the valid state to the invalid state and record in the first lookup table, and in response to recording the invalid state in the first lookup table, triggering the scheduling circuit to read the next task from the off-chip storage device onto the chip. It can be seen that by utilizing the status change of the task in the first lookup table (ie, from valid to invalid), the scheduling circuit can efficiently read the next task from the off-chip storage device to the chip. In particular, for a scenario where multiple tasks on-chip to be read by the scheduling circuit and a second lookup table that records at least the valid status of the multiple tasks are stored on the off-chip storage device, the scheduling circuit may be configured to respond to the first The triggering of a lookup table circuit reads one of the multiple tasks recorded in the second lookup table from the off-chip storage device to the chip, and triggers the first lookup table circuit to update the effective status of the task read on the chip to The invalid status is recorded in the first lookup table.

The scheduler 100 of the present disclosure has been described above in conjunction with FIG. 1 . It can be understood that by using the first lookup table to record the status changes of on-chip tasks, and triggering subsequent tasks to be sent from the off-chip storage device to the chip based on the status changes, the solution of the present disclosure advantageously ensures timely task scheduling. sex and effectiveness. In addition, since the first lookup table has an out-of-order structural attribute and better reflects whether the current task is valid or not, the scheduler is also able to perform task scheduling with higher efficiency. As a result, the on-chip task execution unit can execute issued tasks more efficiently to complete, for example, inter-chip communication tasks.

FIG. 2 is a detailed structural block diagram schematically showing the task scheduler 100 according to an embodiment of the present disclosure. In order to further explain the operating principle of the task scheduler 100, the figure also shows a system-on-chip 200 including the task scheduler (which is simply referred to as "on-chip" in the context of this disclosure), and on-chip tasks disposed within the system-on-chip 200 Execution unit 204.

As shown in Figure 2, various types of tasks to be issued can be stored on the off-chip storage device 202, and these tasks can be applied for and created by users (such as programmers) through software instructions. As an implementation manner, multiple tasks created by software applications can be stored in the form of a queue, and the software can issue tasks to the queue, where each queue can include tasks of the same type, for example, executed by a separate on-chip task. Tasks executed by a unit, tasks executed by multiple on-chip task execution units. For example, when each on-chip execution unit includes 8 microprocessing cores, 8 queues can be set for each microprocessing core, so that 64 task queues (such as the 0th task queue) can be arranged on the off-chip storage device 202 -The 7th queue corresponds to the 0th microprocessor core, the 8th-15th queue corresponds to the 2nd microprocessor core, and so on, until the 56th-63rd queue corresponds to the 7th microprocessor), and for each microprocessor The core sets up a second lookup table for storing and maintaining its eight queues. As an example, the aforementioned eight lookup tables can be maintained and managed through the second lookup table circuit 206 to implement tasks in the lookup tables. As a task in the queue, its status in the second lookup table may initially be set to valid by software, for example, its valid status bit is "1" to indicate that the task is off-chip.

Corresponding to the function of the second lookup table circuit 206, each entry in the first lookup table maintained and managed by the first lookup table circuit 104 may have a status entry corresponding to a task in the second lookup table. For example, a valid identifier of one bit (such as "1") can be used to represent that the corresponding task has been configured by software instructions and is waiting to be issued. When the corresponding task is successfully scheduled on the chip by the scheduling circuit, the first lookup table circuit can modify the bits of the valid signal to an invalid flag (such as "0"). Then, based on the transition of the task status from valid to invalid, the first lookup table circuit can trigger the scheduling circuit to read a new task from the second lookup table circuit onto the chip.

In one implementation scenario, in order to realize the execution of inter-chip tasks between the artificial intelligence processor chip and another artificial intelligence processor chip, the present disclosure also proposes to set up a third lookup table circuit 108 in the scheduler, which can be configured to utilize The third lookup table is used to record and manage inter-chip tasks stored on the chip. As an example, the number of entries in the third lookup table may be the same as the number of slot ids mentioned above. For example, the third lookup table can be implemented as 64 (address) × 5120 (data), where 64 can correspond to the 64 queues mentioned above, and each queue has 5120 bits.

When each task in the above third lookup table occupies 5 bits, 1024 tasks can be recorded in each queue, and each task (ie, table entry record) can include the following semantics {valid, need initial, wakeup, have data,have space}, where "valid" represents the flag bit indicating whether the task is valid, "need initial" represents the task initialization flag bit, "wakeup" represents the wake-up flag bit indicating whether the task is awakened, and "have data" represents the inter-chip buffer The data identification bit indicates whether there is data in the area, and "have space" represents whether there is a space identification bit of storage space in the inter-chip buffer. It can be understood that the entry content and semantics in the third lookup table here are only exemplary and not restrictive, and those skilled in the art can understand the multiple lookup tables of the present disclosure based on the teachings of the present disclosure ( For example, the first and second lookup tables, as well as the fourth lookup table to be mentioned below) may have the same or similar entry content (that is, the "descriptor" of the task) and semantics as the third lookup table. As an application example, when the corresponding "need initial" flag of a task issued by the software is 1, it means that the task is valid. Thereafter, when the scheduler of the present disclosure sends the task to the microprocessing core for the first time, the flag bit can be set to "0", thereby indicating that the task has been scheduled once.

In an implementation scenario, in order to reduce the delay overhead caused by querying the task address, slot id[15:0] can be used as the offset address of the descriptor in the L3 cache or DDR, and is also used to query the lookup table (for example, the three lookup tables). When receiving a task wake-up message (including the "slot id" to be woken up) transmitted from chip to chip ("chip to chip", referred to as "c2c"), the "slot id" can be used to address the third lookup table And the corresponding identification bit is set, thereby completing the update of the third lookup table. As an example, the task wake-up message here may be, for example, a wake-up message sent by the first artificial intelligence processor chip to the second artificial intelligence processor chip, so as to instruct the second artificial intelligence processor chip to execute the information stored on its chip based on the task wake-up message. task. In this scenario, any For example, the task may be a task that is repeatedly scheduled for execution in the reordering buffer ("Reorder of Buffer", referred to as "ROB") circuit 110 stored in the scheduler of the second artificial intelligence processor chip.

In order to achieve effective scheduling of tasks, the solution of the present disclosure proposes to provide the above-mentioned reordering buffer circuit 110 in the scheduler 100, which can be configured to record tasks repeatedly executed by the on-chip task execution unit. In an exemplary application scenario, the scheduler can sequentially send each task queue to the microprocessor core for execution in the order recorded in the first lookup table, and at the same time register it in the storage space in the reordering buffer circuit. During each The "slot id" of each task will also be registered in the reordering buffer circuit. When all storage resources in the reordering buffer circuit are occupied, tasks newly acquired by the scheduler from the off-chip storage device can still be scheduled by the scheduler to the idle microprocessor core for execution.

As another implementation scenario of inter-chip task scheduling, the present disclosure proposes to provide a polling circuit 112 in the scheduler 100, which can receive the task wake-up message used for scheduling inter-chip tasks as described above. Then, the polling circuit may poll the inter-chip tasks recorded in the third lookup table circuit according to the task wake-up message, so as to poll the specific task associated with the task wake-up message. In response to being polled for the specific task, the scheduling circuit may be configured to schedule the polled specific task for execution by the on-chip task execution unit 204 .

For example, during the polling process, the polling circuit may simultaneously poll the third lookup table into queues, and each queue may be, for example, 3×1024 (number of entries)=3072 bits. Based on this, the polling circuit can poll 32 items at a time, and 32 scheduler working clock cycles (32 nanoseconds) can complete polling of a total of 1024 items in a queue. During inter-chip task scheduling, if the task that needs to be woken up is already stored on the chip, the scheduler can directly wake up the task and send it to the on-chip task execution unit for execution. On the contrary, if the polling circuit does not poll the task that needs to be woken up, in other words, the task that needs to be woken up is not on the chip, then the scheduler can go to the off-chip storage device (such as L3 buffer or DDR) to save the task. The task is retrieved (that is, scheduled to the chip) and sent to the on-chip task execution unit, such as a microprocessor core, for execution. As a preferred approach, this disclosure assumes that inter-slice task scheduling takes precedence over intra-slice task scheduling.

In order to effectively record the tasks sent to the on-chip task execution unit, the present disclosure also proposes to provide a fourth lookup table circuit 114 in the scheduler, and it is configured to use the fourth lookup table to record the tasks scheduled to the on-chip task execution unit. In one implementation scenario, the scheduling circuit may be configured to interact with the fourth lookup table circuit to query and determine whether the task recorded in the fourth lookup table is consistent with the current task before scheduling the task to be executed to the on-chip task execution unit. The tasks to be scheduled to the on-chip task execution unit are different. Further, the scheduling circuit may be further configured to trigger the fourth lookup table circuit to remove the completion or suspension of execution from the fourth lookup table in response to receiving the completion or suspension of execution of the task from the on-chip task execution unit. Task. With the help of querying the fourth lookup table, it can be guaranteed that the task to be sent to the on-chip task execution unit is different from the multiple tasks being executed in parallel (for example, the "slot id" of the task is different).

The scheduling scheme of the present disclosure is further elaborated above with reference to FIG. 2 . Based on the above description, those skilled in the art can understand that the scheduling scheme of the present disclosure can significantly reduce the delay of task scheduling and simplify the hardware design by means of the arrangement of one or more lookup tables, especially the arrangement of on-chip lookup tables. Furthermore, by using lookup tables, on-chip resources can be effectively utilized for task scheduling, avoiding repetitive and redundant scheduling of tasks, thereby improving scheduling efficiency. Through the solution of the present disclosure, the smooth and efficient execution of inter-chip tasks is also promoted, thereby saving task execution overhead.

FIG. 3 is a simplified flowchart schematically illustrating a method 300 for scheduling tasks using a scheduler according to an embodiment of the present disclosure. It can be understood that the method 300 may be executed by the scheduler detailed above in conjunction with FIGS. 1 and 2 .

As shown in Figure 3, at step S302, a scheduling circuit is used to execute a task of reading from the off-chip storage device onto the chip, so that the task is scheduled to be executed by the on-chip task execution unit, wherein the task is The valid status is recorded on the off-chip storage device. Next, at step S304, in response to the task being read from the off-chip storage device to the chip, the first lookup table circuit is used to update the task from the valid state to the invalid state and record it in the first lookup table. Thereafter, at step S306, in response to recording the invalid status in the first lookup table, the first lookup table circuit may read the next task from the off-chip storage device onto the chip.

In one embodiment, the above-mentioned off-chip storage device stores multiple tasks on the chip to be read by the scheduling circuit and at least a second lookup table that records the valid status of the multiple tasks (for example, with the help of the diagram shown in Figure 2 implemented by the second lookup table circuit 206). In this embodiment, method 300 may further include in response to triggering of the first lookup table circuit, Use a scheduling circuit to read one of the tasks recorded in the second lookup table from the off-chip storage device to the chip, and use the scheduling circuit to trigger the first lookup table circuit to read from the off-chip storage device The valid status of the task read onto the chip is updated to the invalid status and recorded in the first lookup table.

In one embodiment, the method further includes using a third lookup table circuit (such as the third lookup table circuit 108 shown in FIG. 3) to perform a communication between the artificial intelligence processor chip and another artificial intelligence processor chip. The inter-chip tasks use the third lookup table to record and manage the inter-chip tasks stored on the chip. In another embodiment, the method further includes utilizing the polling circuit 112 to receive a task wake-up message for scheduling inter-chip tasks, and polling the inter-chip tasks recorded in the third lookup table circuit according to the task wake-up message, to poll for a specific task associated with the task wakeup message. Based on this, the method also uses a scheduling circuit to schedule the polled specific tasks.

In some scenarios, in response to the above polling circuit not successfully polling the specific task, the method further includes using a scheduler to read the specific task associated with the task wake-up message from the off-chip storage device onto the chip. Further, in order to save and maintain tasks repeatedly executed by the on-chip task execution unit, the method further includes using a reordering buffer circuit (reordering buffer circuit 110 as shown in FIG. 2) to record the repeatedly executed tasks.

In other scenarios, the method further includes using a fourth lookup table circuit (the fourth lookup table circuit 114 shown in FIG. 2 ) to perform using the fourth lookup table to record tasks scheduled to the on-chip task execution unit. . Further, the method further includes using the scheduling circuit to perform: before scheduling the task to be executed to the on-chip task execution unit, interact with the fourth lookup table circuit to query and determine the records in the fourth lookup table. The task is different from the task currently to be scheduled to the on-chip task execution unit. Additionally, the method further includes using the scheduling circuit to perform: in response to receiving execution of a completed or suspended task from the on-chip task execution unit, triggering the fourth lookup table circuit to move from the fourth lookup table. In addition to completing or suspending the execution of tasks.

The method of using a scheduler to perform scheduled tasks according to the present disclosure has been described above with reference to FIG. 3 . It is understood that the above description is only illustrative and not restrictive. Based on the disclosure of this disclosure, those skilled in the art can also think of combining or replacing the steps in order to achieve effective scheduling of tasks and save scheduling resources.

Figure 4 shows a schematic structural diagram of a board card 400 according to an embodiment of the present disclosure. As shown in Figure 4, the board 400 includes a chip 401, which is a system on chip (SoC), or system on a chip, integrated with one or more combination processing devices. The combination processing device is an artificial Intelligent computing units are used to support various deep learning and machine learning algorithms to meet the intelligent processing needs in complex scenarios in computer vision, speech, natural language processing, data mining and other fields. In particular, deep learning technology is widely used in the field of cloud intelligence. A significant feature of cloud intelligence applications is the large amount of input data, which has high requirements on the storage and computing capabilities of the platform. The board 400 of this embodiment is suitable for use in cloud intelligence applications. application, with huge off-chip storage, on-chip storage and a large amount of computing power. In some scenarios, when only one chip 401 is arranged on the board, the task scheduling between the boards is also the inter-chip communication (or "inter-chip communication") in the context of this disclosure.

The chip 401 is connected to an external device 403 through an external interface device 402 . The external device 403 is, for example, a server, computer, camera, monitor, mouse, keyboard, network card or wifi interface. The data to be processed can be transferred to the chip 401 from the external device 403 through the external interface device 402 . The calculation results of the chip 401 can be transmitted back to the external device 403 via the external interface device 402 . According to different application scenarios, the external interface device 402 may have different interface forms, such as PCIe interface.

Board 400 also includes a storage device 4404 for storing data, which includes one or more storage units 4405. The storage device 404 connects and transmits data with the control device 406 and the chip 401 through the bus. The control device 406 in the board card 400 is configured to control the status of the chip 401 . To this end, in an application scenario, the control device 406 may include a microcontroller unit (Micro Controller Unit, MCU).

FIG. 5 is a structural diagram showing the combined processing device in the chip 401 of this embodiment. As shown in Figure 5, the combined processing device 500 includes a computing device 501, an interface device 502, a processing device 503 and a DRAM 504.

The computing device 501 is configured to perform user-specified operations, and is mainly implemented as a single-core intelligent processor or a multi-core intelligent processor to perform calculations of deep learning or machine learning, which can be performed through the interface device 502 and the processing device 503 Interact to jointly complete user-specified actions.

The interface device 502 is used to transmit data and control instructions between the computing device 501 and the processing device 503 . For example, the computing device 5501 can obtain input data from the processing device 503 via the interface device 502 and write it into an on-chip storage device of the computing device 501 . Further, the computing device 501 can obtain the control instructions from the processing device 503 via the interface device 502 and write them into the control cache on-chip of the computing device 501 . Alternatively or optionally, the interface device 502 may also read the data in the storage device of the computing device 501 and transmit it to the processing device 503 .

As a general processing device, the processing device 503 performs basic control including but not limited to data transfer, starting and/or stopping the computing device 501, and the like. Depending on the implementation, the processing device 503 may be one or more types of a central processing unit (CPU), a graphics processing unit (GPU), or other general and/or special purpose processors. Processors, including but not limited to digital signal processor (DSP), application specific integrated circuit (ASIC), field-programmable gate array (FPGA) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., and their number can be determined according to actual needs. As mentioned above, only as far as the computing device 501 of the present disclosure is concerned, it can be regarded as having a single-core structure or a homogeneous multi-core structure. However, when the computing device 501 and the processing device 503 are considered together, they are considered to form a heterogeneous multi-core structure.

The storage device 504 is used to store data to be processed, which can be DRAM, which is a DDR memory. The size is usually 16G or larger, and is used to save data of the computing device 201 and/or the processing device 203 . In the context of this disclosure, the storage device here can be regarded as an off-chip storage device of the aforementioned scheduling scheme.

Figure 6 shows a schematic diagram of the internal structure of the computing device 501. The computing device 501 is used to process input data such as computer vision, speech, natural language, and data mining. The computing device 501 in the figure adopts a multi-core hierarchical structure design. The computing device 501 serves as a system on a chip and includes multiple computing clusters. , each computing cluster includes multiple processor cores. In other words, the computing device 501 is composed of a system-on-chip-computing cluster-processor core hierarchy.

From a system-on-chip level, as shown in FIG. 6 , the computing device 501 includes an external storage controller 601 , a peripheral communication module 602 , an on-chip interconnection module 603 , a synchronization module 604 and multiple computing clusters 605 . Although not shown, the scheduling circuit in the context of the present disclosure may also be included in the computing device 501 to schedule tasks of the external storage device onto the chip for execution by the computing cluster 605 .

There may be multiple external memory controllers 601, two of which are exemplarily shown in the figure. They are used to respond to access requests issued by the processor core and access external memory devices, such as the DRAM 504 in Figure 5, to read from off-chip. Get data or write data. The peripheral communication module 602 is used to receive control signals from the processing device 503 through the interface device 502 and start the computing device 501 to perform tasks. The on-chip interconnection module 603 connects the external storage controller 601, the peripheral communication module 602 and multiple computing clusters 605 to transmit data and control signals between various modules. The synchronization module 604 is a global synchronization barrier controller (GBC), used to coordinate the work progress of each computing cluster and ensure information synchronization. Multiple computing clusters 605 are the computing cores of the computing device 501. Four are shown as an example in the figure. With the development of hardware, the computing device 501 of the present disclosure may also include 8, 16, 64, or even more. Computing cluster 605. Computing cluster 605 is used to efficiently execute deep learning algorithms.

Looking at the computing cluster level, as shown in Figure 6, each computing cluster 605 includes multiple processor cores (IPU core) 606 and a storage core (MEM core) 607.

Four processor cores 606 are exemplarily shown in the figure, and the present disclosure does not limit the number of processor cores 606 . Its internal architecture is shown in Figure 4. Each processor core 606 includes three major modules: a control module 71 , an operation module 72 and a storage module 73 .

The control module 71 is used to coordinate and control the work of the computing module 72 and the storage module 73 to complete the task of deep learning, and includes an instruction fetch unit (IFU) 711 and an instruction decode unit (IDU). 712. The instruction fetching unit 711 is used to obtain instructions from the processing device 503, and the instruction decoding unit 712 decodes the obtained instructions and sends the decoding results to the computing module 72 and the storage module 73 as control information.

The operation module 72 includes a vector operation unit 721 and a matrix operation unit 422. The vector operation unit 721 is used to execute Vector operations can support complex operations such as vector multiplication, addition, and nonlinear transformation; the matrix operation unit 722 is responsible for the core calculations of the deep learning algorithm, namely matrix multiplication and convolution.

The storage module 73 is used to store or transport related data, including a neuron storage unit (neuron RAM, NRAM) 731, a weight storage unit (weight RAM, WRAM) 732, and an input/output direct memory access module (input/output direct memory access). , IODMA) 733. Move direct memory access module (move direct memory access, MVDMA) 734. NRAM 731 is used to store input, output data and intermediate results calculated by the processor core 606; WRAM 732 is used to store the weights of the deep learning network; IODMA 733 controls the access of NRAM 731/WRAM 732 and DRAM 504 through the broadcast bus 609 memory; MVDMA 734 is used to control the memory access of NRAM 731/WRAM 732 and SRAM 708. It should be noted that the NRAM and WRAM here may be two storage areas formed by dividing the same memory in the logical storage space, or they may be two independent memories, which are not specifically limited here.

Returning to Figure 6, the storage core 307 is mainly used for storage and communication, that is, storage of shared data or intermediate results between the processor cores 606, and communication between the execution computing cluster 605 and the DRAM 504, and communication between the computing clusters 605. Communication between processor cores 606, etc. In other embodiments, the storage core 607 has scalar operation capabilities to perform scalar operations.

The storage core 607 includes a shared memory unit (SRAM) 608, a broadcast bus 609, a computing cluster direct memory access module (cluster direct memory access, CDMA) 610, and a global direct memory access module (global direct memory access, GDMA) 611. SRAM 608 plays the role of a high-performance data transfer station. The data multiplexed between different processor cores 606 in the same computing cluster 605 does not need to be obtained from the DRAM 504 through the processor cores 606, but is processed by the SRAM 608. For transfer between processor cores 606, the storage core 607 only needs to quickly distribute the multiplexed data from the SRAM 608 to multiple processor cores 6606 to improve inter-core communication efficiency and greatly reduce on-chip and off-chip input/output access.

The broadcast bus 609, CDMA 610 and GDMA 611 are respectively used to perform communication between processor cores 606, communication between computing clusters 605 and data transmission between the computing cluster 605 and the DRAM 504. They will be explained below.

The broadcast bus 609 is used to complete high-speed communication between the processor cores 606 in the computing cluster 605. The broadcast bus 609 in this embodiment supports inter-core communication methods including unicast, multicast and broadcast. Unicast refers to point-to-point (i.e., single processor core to single processor core) data transmission, multicast is a communication method that transmits a piece of data from SRAM 608 to specific several processor cores 606, and broadcast is a communication method that transmits a piece of data from SRAM 608 to specific processor cores 606. The communication method in which copies of data are transmitted from SRAM 608 to all processor cores 606 is a special case of multicast.

CDMA 610 is used to control memory access of SRAM 608 between different computing clusters 605 within the same computing device 501. Figure 8 shows a schematic diagram when one processor core wants to write data to the processor core of another computing cluster to illustrate the working principle of CDMA 610. In this application scenario, the same computing device includes multiple computing clusters. For the convenience of explanation, only computing cluster 0 and computing cluster 1 are shown in the figure. Computing cluster 0 and computing cluster 1 respectively include multiple processor cores. Also for illustration, For convenience, computing cluster 0 in the figure only displays processor core 0, and computing cluster 1 only displays processor core 1. Processor core 0 wants to write data to processor core 1.

First, processor core 0 sends a unicast write request to write data to the local SRAM 0. CDMA 0 serves as the master (master) end, and CDMA 1 serves as the slave (slave) end. The master end pushes the write request to the slave end, that is, the master end The end sends the write address AW and the write data W to transfer the data to the SRAM 1 of the computing cluster 1. Then the slave end sends a write response B in response. Finally, the processor core 1 of the computing cluster 1 sends a unicast read request to transfer the data from the SRAM. Read out in 1.

Returning to Figure 6, the GDMA 611 cooperates with the external memory controller 601 to control memory access from the SRAM 608 of the computing cluster 605 to the DRAM 504, or to read data from the DRAM 504 to the SRAM 608. From the foregoing, it can be known that the communication between the DRAM 504 and the NRAM 731 or the WRAM 732 can be realized through two channels. The first channel is to directly contact DRAM 504 and NRAM 731 or WRAM 732 through IODAM 733; the second channel is to first transmit data between DRAM 504 and SRAM 6608 through GDMA 611, and then through MVDMA 734 to transmit data between SRAM 608 and NRAM. 731 or WRAM 732. Although on the surface it seems that the second channel requires more components to participate and the data flow is longer, in fact in some embodiments, the bandwidth of the second channel is much greater than that of the first channel. channel, so communication between DRAM 504 and NRAM 731 or WRAM 732 may be more efficient through the second channel. The embodiments of the present disclosure can select a data transmission channel according to the own hardware conditions.

In other embodiments, the functionality of GDMA 611 and the functionality of IODMA 733 may be integrated in the same component. For the convenience of description, this disclosure treats GDMA 611 and IODMA 733 as different components. For those skilled in the art, as long as the functions they implement and the technical effects they achieve are similar to those of this disclosure, they fall within the scope of protection of this disclosure. Furthermore, the functions of GDMA 611, IODMA 733, CDMA 610, and MVDMA 734 can also be implemented by the same component. Similarly, as long as the functions implemented and the technical effects achieved are similar to those of this disclosure, they all belong to SCOPE OF THIS DISCLOSURE.

The software and hardware architecture and its internal structure of the present disclosure are described in detail above with reference to Figures 4-8. It is understood that the above description is only illustrative and not restrictive. According to different application scenarios and hardware specifications, those skilled in the art can also make changes to the board card (or artificial intelligence device) and its internal structure of the present disclosure, and these changes still fall within the protection scope of the present disclosure.

Based on the above description, those skilled in the art can understand that this application actually also discloses a device, which includes a processor and a memory. Specifically, the memory may store program instructions for scheduling tasks. When the program instructions are executed by the processor, the scheduling operation steps described in this application in conjunction with FIGS. 1-3 are implemented. In addition, since the solution of the present application can be implemented by computing program instructions, the present application also discloses a computer-readable storage medium or computer program product, on which computer programs/instructions for task scheduling are stored, thereby realizing the combination of the figures. 1-The scheduling operation steps described in Figure 3.

The solution of the present disclosure has been described in detail above with reference to the accompanying drawings. According to different application scenarios, the equipment or devices of the present disclosure may include servers, cloud servers, server clusters, data processing devices, robots, computers, printers, scanners, tablets, smart terminals, PC equipment, Internet of Things terminals, and mobile terminals , mobile phones, driving recorders, navigators, sensors, cameras, cameras, video cameras, projectors, watches, headphones, mobile storage, wearable devices, visual terminals, autonomous driving terminals, vehicles, household appliances, and/or medical equipment . The means of transportation include airplanes, ships and/or vehicles; the household appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lights, gas stoves, and range hoods; the medical equipment includes nuclear magnetic resonance machines, B-ultrasound and/or electrocardiograph. The equipment or device of the present disclosure can also be applied to the Internet, Internet of Things, data centers, energy, transportation, public administration, manufacturing, education, power grid, telecommunications, finance, retail, construction sites, medical and other fields.

Furthermore, the equipment or device of the present disclosure can also be used in cloud, edge, terminal and other application scenarios related to artificial intelligence, big data and/or cloud computing. In one or more embodiments, devices or devices with high power consumption according to the solution of the present disclosure can be applied to cloud devices (such as cloud servers), while devices or devices with low power consumption can be applied to terminal devices and/or edge terminals. device (e.g. smartphone or camera). In one or more embodiments, the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that the hardware resources of the cloud device can be obtained based on the hardware information of the terminal device and/or the edge device. Match appropriate hardware resources to simulate the hardware resources of terminal devices and/or edge devices to complete unified management, scheduling and collaborative work of end-cloud integration or cloud-edge-end integration.

It should be noted that, for the purpose of simplicity, this disclosure expresses some methods and their embodiments as a series of actions and their combinations, but those skilled in the art can understand that the solutions of this disclosure are not limited by the order of the described actions. . Therefore, based on the disclosure or teachings of this disclosure, those skilled in the art will understand that certain steps may be performed in other orders or simultaneously. Furthermore, those skilled in the art can understand that the embodiments described in the present disclosure can be regarded as optional embodiments, that is, the actions or modules involved are not necessarily necessary for the implementation of one or some solutions of the present disclosure. In addition, depending on the solution, the description of some embodiments in this disclosure also has different emphasis. In view of this, those skilled in the art can understand the parts that are not described in detail in a certain embodiment of the present disclosure, and can also refer to the relevant descriptions of other embodiments.

In terms of specific implementation, based on the disclosure and teachings of this disclosure, those skilled in the art can understand that several embodiments disclosed in this disclosure can also be implemented in other ways not disclosed herein. For example, with respect to the equipment or devices mentioned above For each unit in the embodiment, this article divides them based on the logical function, but there may be other division methods in actual implementation. As another example, multiple units or components may be combined or integrated into another system, or some features or functions in units or components may be selectively disabled. In terms of connection relationships between different units or components, the connections discussed above in connection with the drawings may be direct or indirect couplings between the units or components. In some scenarios, the aforementioned direct or indirect coupling involves a communication connection using an interface, where the communication interface can support electrical, optical, acoustic, magnetic or other forms of signal transmission.

In this disclosure, units illustrated as separate components may or may not be physically separate, and components illustrated as units may or may not be physical units. The aforementioned components or units may be co-located or distributed over multiple network units. In addition, according to actual needs, some or all of the units may be selected to achieve the purpose of the solutions described in the embodiments of the present disclosure. In addition, in some scenarios, multiple units in the embodiments of the present disclosure may be integrated into one unit or each unit may exist physically separately.

In some implementation scenarios, the above integrated units can be implemented in the form of software program modules. If implemented in the form of software program modules and sold or used as a stand-alone product, the integrated unit may be stored in a computer-readable memory. Based on this, when the solution of the present disclosure is embodied in the form of a software product (such as a computer-readable storage medium), the software product can be stored in a memory, which can include a number of instructions to cause a computer device (such as a personal computer, server or network equipment, etc.) to perform some or all steps of the method described in the embodiments of the present disclosure. The aforementioned memory may include but is not limited to U disk, flash disk, read-only memory ("Read Only Memory", abbreviated as ROM), random access memory ("Random Access Memory", abbreviated as RAM), mobile hard disk, magnetic disk Or various media such as CDs that can store program code.

In other implementation scenarios, the above-mentioned integrated unit can also be implemented in the form of hardware, that is, a specific hardware circuit, which can include digital circuits and/or analog circuits, etc. The physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, and the physical devices may include, but is not limited to, devices such as transistors or memristors. In view of this, various devices (such as computing devices or other processing devices) described herein can be implemented by appropriate hardware processors, such as CPUs, GPUs, FPGAs, DSPs, and ASICs. Furthermore, the aforementioned storage unit or storage device can be any appropriate storage medium (including magnetic storage media or magneto-optical storage media, etc.), which can be, for example, a variable resistive memory ("Resistive Random Access Memory", abbreviated as RRAM), dynamic random access memory ("Dynamic Random Access Memory", abbreviated as DRAM), static random access memory ("Static Random Access Memory", abbreviated as SRAM), enhanced dynamic random access memory ("Enhanced Dynamic Random Access Memory", abbreviated as "EDRAM"), high bandwidth memory ("High Bandwidth Memory", abbreviated as "HBM"), hybrid memory cube ("Hybrid Memory Cube", abbreviated as "HMC"), ROM and RAM, etc.

The foregoing can be better understood in accordance with the following terms:

Clause A1. A scheduler for scheduling tasks, which is arranged on an artificial intelligence processor chip and connects an off-chip storage device and an on-chip task execution unit. The scheduler includes:

a scheduling circuit configured to read a task from the off-chip storage device onto the chip so as to schedule the task to be executed by the on-chip task execution unit, wherein the task is recorded in a valid state on the off-chip storage device ;

The first lookup table circuit is configured as:

In response to the task being read from the off-chip storage device to the chip, updating the task from the valid state to the invalid state and recording it in the first lookup table; and

In response to recording the invalid state in the first lookup table, the scheduling circuit is triggered to read the next task from the off-chip storage device to the chip.

Clause A2. The scheduler according to Clause A1, wherein the off-chip storage device stores multiple tasks on the chip to be read by the scheduling circuit and a second lookup table that records at least the valid status of the multiple tasks, and The scheduling circuit is configured as:

In response to the triggering of the first lookup table circuit, a plurality of arbitrary records recorded in the second lookup table are read from the off-chip storage device. one of the tasks onto the chip; and

The first lookup table circuit is triggered to update the valid state of the task read from the off-chip storage device to the on-chip task to the invalid state and record it in the first lookup table.

Clause A3. The scheduler according to Clause A1, further comprising a third lookup table circuit configured to:

When executing inter-chip tasks between the artificial intelligence processor chip and another artificial intelligence processor chip, a third lookup table is used to record and manage the inter-chip tasks stored on the chip.

Clause A4. The scheduler according to Clause A3, further comprising a polling circuit configured to:

Receive task wake-up messages used to schedule inter-chip tasks;

Polling the inter-chip tasks recorded in the third lookup table circuit according to the task wake-up message, so as to poll a specific task associated with the task wake-up message,

The scheduling circuit is configured to schedule the polled specific task.

Clause A5. The scheduler of Clause A4, wherein the scheduling circuit is further configured to:

In response to the polling circuit not successfully polling the specific task, reading the specific task associated with the task wake-up message from the off-chip storage device onto the chip.

Clause A6. The scheduler of Clause A4, wherein the task wake-up message comes from the other artificial intelligence processor chip, and the scheduling circuit is configured to schedule the specific task for processing by the artificial intelligence The on-chip task execution unit of the processor chip executes.

Clause A7. The scheduler according to Clause A1, further comprising a reordering buffer circuit configured to record tasks repeatedly executed by the on-chip task execution unit.

Clause A8. The scheduler according to Clause A1, further comprising a fourth lookup table circuit configured to utilize the fourth lookup table to record tasks scheduled to the on-chip task execution unit.

Clause A9. The scheduler according to Clause A8, wherein the scheduling circuit is further configured to:

Before scheduling the task to be executed to the on-chip task execution unit, interact with the fourth lookup table circuit to query and determine the tasks recorded in the fourth lookup table and the tasks currently to be scheduled to the on-chip task execution unit. different.

Clause A10. The scheduler according to Clause A8, wherein the scheduling circuit is further configured to:

In response to receiving completion or suspension of execution of a task from the on-chip task execution unit, triggering the fourth lookup table circuit to remove the completion or suspension of execution of the task from the fourth lookup table.

Clause A11. The scheduler according to any one of Clause A8 to Clause A10, wherein the plurality of tasks are stored in the first lookup table, the second lookup table, the third lookup table in the form of one or more queues. table and/or the fourth lookup table, and the queue identifier of each queue and the task identifier of each task are combined to indicate that the task is in the first lookup table, the second lookup table, the third lookup table and/or Or the address in the fourth lookup table.

Clause A12. The scheduler according to clause A11, wherein the entry records in the first lookup table, the second lookup table, the third lookup table and/or the fourth lookup table include one or more of the following:

A flag indicating whether the task is valid;

Task initialization flag bit;

The wake-up flag bit indicates whether the task is awakened;

Whether the data identification bit of the data exists in the inter-chip buffer; and

Whether there is a space identification bit of storage space in the inter-chip buffer.

Clause A13. The scheduler according to Clause A12, wherein the task is an inter-chip communication task for communication between the artificial intelligence processor chip and another artificial intelligence processor chip.

Clause A14. An artificial intelligence processor chip, including:

a scheduler according to any of clauses A1-A13; and

An on-chip task execution unit is configured to interact with the scheduler in order to execute tasks issued by the scheduler.

Clause A15. A board including an artificial intelligence processor chip according to Clause A14.

Clause A16. A method of scheduling tasks using a scheduler according to any of Clauses A1-A13, The methods include:

Use the scheduling circuit to execute a task reading from the off-chip storage device onto the chip so as to schedule the task to be executed by the on-chip task execution unit, wherein the task is recorded in the off-chip storage device in a valid state superior;

Use the first lookup table circuit to perform:

Clause A17. The method according to Clause A16, wherein the off-chip storage device stores multiple tasks on the chip to be read by the scheduling circuit and a second lookup table that records at least the valid status of the multiple tasks, and the The method: in response to the triggering of the first lookup table circuit, reading one of the plurality of tasks recorded in the second lookup table from the off-chip storage device to the chip; and

Clause A18. The method of clause A16, further comprising using the third lookup table circuit to perform:

Clause A19. The method of Clause A18, further comprising using the polling circuit to perform:

Receive task wake-up messages used to schedule inter-chip tasks;

polling inter-chip tasks recorded in the third lookup table circuit according to the task wake-up message, so as to poll a specific task associated with the task wake-up message,

The scheduling circuit is configured to schedule the polled specific task.

Clause A20. The method of clause A19, wherein the scheduling circuit is used to perform the following steps:

Clause A21. The method of clause A19, wherein the task wake-up message comes from the other artificial intelligence processor chip, and the method further includes using the scheduling circuit to perform scheduling the specific task to be performed by The on-chip task execution unit of the artificial intelligence processor chip executes.

Clause A22. The method of Clause A16, further comprising using the reordering buffer circuit to record tasks that are repeatedly executed by the on-chip task execution unit.

Clause A23. The method of Clause A16, further comprising using the fourth lookup table circuit to perform using the fourth lookup table to record tasks scheduled to an on-chip task execution unit.

Clause A24. The method of Clause A23, wherein the method further comprises using the scheduling circuit to perform:

Clause A25. The method of Clause A23, wherein the method further comprises using the scheduling circuit to perform:

Clause A26. The method according to any one of clauses A23-A25, wherein the plurality of tasks are stored in the first lookup table, the second lookup table, the third lookup table and the form of one or more queues. /or in the fourth lookup table, and the combination of the queue identifier of each queue and the task identifier of each task is used to indicate that the task is in the first lookup table, the second lookup table, the third lookup table and/or the third lookup table. Four lookup addresses in the table.

Clause A27. The method of Clause A26, wherein the first lookup table, the second lookup table, the third lookup table The entry records in the lookup table and/or the fourth lookup table include one or more of the following:

A flag indicating whether the task is valid;

Task initialization flag bit;

The wake-up flag bit indicates whether the task is awakened;

Clause A28. The method of clause A27, wherein the task is an inter-chip communication task for communication between the artificial intelligence processor chip and another artificial intelligence processor chip.

Clause A29. A computer-readable storage medium having computer program instructions for scheduling tasks stored thereon, which when executed by a processor cause the method according to any of clauses A16-A28 to be implemented .

Although the embodiments of the present disclosure are as above, the described content is only an example adopted to facilitate understanding of the present disclosure, and is not intended to limit the scope and application scenarios of the present disclosure. Any person skilled in the technical field described in this disclosure may make any modifications and changes in the form and details of the implementation without departing from the spirit and scope disclosed in this disclosure. However, the scope of patent protection of this disclosure is , the scope defined by the appended claims shall prevail.

Claims

A scheduler for scheduling tasks, which is arranged on an artificial intelligence processor chip and connects an off-chip storage device and an on-chip task execution unit. The scheduler includes:

a scheduling circuit configured to read a task from the off-chip storage device onto the chip so as to schedule the task to be executed by the on-chip task execution unit, wherein the task is recorded in a valid state on the off-chip storage device ;

The first lookup table circuit is configured as:

In response to the task being read from the off-chip storage device to the chip, updating the task from the valid state to the invalid state and recording it in the first lookup table; and

In response to recording the invalid state in the first lookup table, the scheduling circuit is triggered to read the next task from the off-chip storage device to the chip.
The scheduler according to claim 1, wherein the off-chip storage device stores multiple tasks on the chip to be read by the scheduling circuit and a second lookup table that records at least the valid status of the multiple tasks, and the The scheduling circuit is configured as:

In response to triggering of the first lookup table circuit, reading one of the plurality of tasks recorded in the second lookup table from the off-chip storage device to the chip; and

The first lookup table circuit is triggered to update the valid state of the task read from the off-chip storage device to the on-chip task to the invalid state and record it in the first lookup table.
The scheduler of claim 1, further comprising a third lookup table circuit configured to:

When executing inter-chip tasks between the artificial intelligence processor chip and another artificial intelligence processor chip, a third lookup table is used to record and manage the inter-chip tasks stored on the chip.
The scheduler of claim 3, further comprising a polling circuit configured to:

Receive task wake-up messages used to schedule inter-chip tasks;

Polling the inter-chip tasks recorded in the third lookup table circuit according to the task wake-up message, so as to poll a specific task associated with the task wake-up message,

The scheduling circuit is configured to schedule the polled specific task.
The scheduler of claim 4, wherein the scheduling circuit is further configured to:

In response to the polling circuit failing to successfully poll the specific task, the specific task associated with the task wake-up message is read from the off-chip storage device onto the chip.
The scheduler of claim 4, wherein the task wake-up message comes from the other artificial intelligence processor chip, and the scheduling circuit is configured to schedule the specific task to be executed by the artificial intelligence processor chip. The on-chip task execution unit executes.
The scheduler of claim 1, further comprising a reordering buffer circuit configured to record tasks repeatedly executed by the on-chip task execution unit.
The scheduler of claim 1, further comprising a fourth lookup table circuit configured to utilize the fourth lookup table to record tasks scheduled to the on-chip task execution unit.
The scheduler of claim 8, wherein the scheduling circuit is further configured to:

Before scheduling the task to be executed to the on-chip task execution unit, interact with the fourth lookup table circuit to query and determine the tasks recorded in the fourth lookup table and the tasks currently to be scheduled to the on-chip task execution unit. different.
The scheduler of claim 8, wherein the scheduling circuit is further configured to:

In response to receiving execution of a completed or suspended task from the on-chip task execution unit, triggering the fourth lookup table circuit The method removes tasks that have completed execution or suspended execution from the fourth lookup table.
The scheduler according to any one of claims 8-10, wherein the plurality of tasks are stored in the first lookup table, the second lookup table, the third lookup table and/or in the form of one or more queues. or the fourth lookup table, and the combination of the queue identifier of each queue and the task identifier of each task is used to indicate that the task is in the first lookup table, the second lookup table, the third lookup table and/or the fourth lookup table. Find the address in the table.
The scheduler according to claim 11, wherein the entry records in the first lookup table, the second lookup table, the third lookup table and/or the fourth lookup table include one or more of the following:

A flag indicating whether the task is valid;

Task initialization flag bit;

The wake-up flag bit indicates whether the task is awakened;

Whether the data identification bit of the data exists in the inter-chip buffer; and

Whether there is a space identification bit of storage space in the inter-chip buffer.
The scheduler of claim 12, wherein the task is an inter-chip communication task for communication between the artificial intelligence processor chip and another artificial intelligence processor chip.
An artificial intelligence processor chip, including:

The scheduler according to any one of claims 1-13; and

An on-chip task execution unit is configured to interact with the scheduler in order to execute tasks issued by the scheduler.
A board card including the artificial intelligence processor chip according to claim 14.
A method for scheduling tasks using the scheduler according to any one of claims 1-13, the method comprising:

Use the scheduling circuit to execute a task reading from the off-chip storage device onto the chip so as to schedule the task to be executed by the on-chip task execution unit, wherein the task is recorded in the off-chip storage device in a valid state superior;

Use the first lookup table circuit to perform:

In response to the task being read from the off-chip storage device to the chip, updating the task from the valid state to the invalid state and recording it in the first lookup table; and

In response to recording the invalid state in the first lookup table, the scheduling circuit is triggered to read the next task from the off-chip storage device to the chip.
The method according to claim 16, wherein the off-chip storage device stores multiple tasks on the chip to be read by the scheduling circuit and a second lookup table that records at least the valid status of the multiple tasks, and the method Including using dispatch circuits to do the following:

In response to triggering of the first lookup table circuit, reading one of the plurality of tasks recorded in the second lookup table from the off-chip storage device to the chip; and

The first lookup table circuit is triggered to update the valid state of the task read from the off-chip storage device to the on-chip task to the invalid state and record it in the first lookup table.
The method of claim 16, further comprising using the third lookup table circuit to perform:

When executing inter-chip tasks between the artificial intelligence processor chip and another artificial intelligence processor chip, a third lookup table is used to record and manage the inter-chip tasks stored on the chip.
The method of claim 18, further comprising using the polling circuit to perform:

Receive task wake-up messages used to schedule inter-chip tasks;

Polling the inter-chip tasks recorded in the third lookup table circuit according to the task wake-up message, so as to poll a specific task associated with the task wake-up message,

The method further uses the scheduling circuit to schedule the polled specific tasks.
The method of claim 19, wherein the scheduling circuit is used to perform:

In response to the polling circuit not successfully polling the specific task, reading the specific task associated with the task wake-up message from the off-chip storage device onto the chip.
The method of claim 19, wherein the task wake-up message comes from the other artificial intelligence processor chip, and the method further includes using the scheduling circuit to perform scheduling of the specific task so that the specific task is scheduled by the The on-chip task execution unit of the artificial intelligence processor chip executes.
The method of claim 16, further comprising using the reorder buffer circuit to record tasks repeatedly executed by the on-chip task execution unit.
The method of claim 16 , further comprising using the fourth lookup table circuit to perform using the fourth lookup table to record tasks scheduled to an on-chip task execution unit.
The method of claim 23, wherein the method further includes using the scheduling circuit to perform:

Before scheduling the task to be executed to the on-chip task execution unit, interact with the fourth lookup table circuit to query and determine the tasks recorded in the fourth lookup table and the tasks currently to be scheduled to the on-chip task execution unit. different.
The method of claim 23, wherein the method further includes using the scheduling circuit to perform:

In response to receiving completion or suspension of execution of a task from the on-chip task execution unit, triggering the fourth lookup table circuit to remove the completion or suspension of execution of the task from the fourth lookup table.
The method according to any one of claims 23-25, wherein the plurality of tasks are stored in the first lookup table, the second lookup table, the third lookup table and/or in the form of one or more queues. in the fourth lookup table, and the queue identifier of each queue and the task identifier of each task are combined to indicate that the task is in the first lookup table, the second lookup table, the third lookup table and/or the fourth lookup table. address in the table.
The method according to claim 26, wherein the entry records in the first lookup table, the second lookup table, the third lookup table and/or the fourth lookup table include one or more of the following:

A flag indicating whether the task is valid;

Task initialization flag bit;

The wake-up flag bit indicates whether the task is awakened;

Whether the data identification bit of the data exists in the inter-chip buffer; and

Whether there is a space identification bit of storage space in the inter-chip buffer.
The method of claim 27, wherein the task is an inter-chip communication task for communication between the artificial intelligence processor chip and another artificial intelligence processor chip.
A computer-readable storage medium having computer program instructions for scheduling tasks stored thereon. When the computer program instructions are executed by a processor, the method according to any one of claims 16-28 is implemented.