CN115964155B

CN115964155B - On-chip data processing hardware, on-chip data processing method and AI platform

Info

Publication number: CN115964155B
Application number: CN202310251374.XA
Authority: CN
Inventors: 乔文; 孙力军; 张亚林
Original assignee: Suiyuan Intelligent Technology Chengdu Co ltd
Current assignee: Suiyuan Intelligent Technology Chengdu Co ltd
Priority date: 2023-03-16
Filing date: 2023-03-16
Publication date: 2023-05-30
Anticipated expiration: 2043-03-16
Also published as: CN115964155A

Abstract

The invention discloses an on-chip data processing hardware, an on-chip data processing method and an AI platform. The on-chip data processing hardware comprises a plurality of virtual channel modules, an arbitration module and a plurality of thread units; the arbitration module is respectively connected with each virtual channel module and each thread unit; the virtual channel module is used for splitting the target data processing task into a plurality of target subtasks and sending the target subtasks to the arbitration module according to the source address and/or the target address and the splitting times in the target data processing task when receiving the splittable target data processing task issued by the software; the arbitration module is used for determining a target thread unit when receiving a target subtask and sending the target subtask to the target thread unit; and the thread unit is used for executing the received target subtasks. The technical scheme of the embodiment of the invention can improve the utilization rate of the line units and the bandwidths in the data processing module under a specific scene and can accelerate the completion of a single data processing task.

Description

On-chip data processing hardware, on-chip data processing method and AI platform

Technical Field

The embodiment of the invention relates to a computer hardware technology, in particular to a chip technology, and especially relates to an on-chip data processing hardware, an on-chip data processing method and an AI platform.

Background

Within the AI (Artificial Intelligence ) platform, common data processing tasks mainly include linear operation tasks (e.g., data handling or data population), multi-dimensional matrix data operation tasks (e.g., matrix population or matrix segmentation), and the like. The data processing tasks are mainly executed by a data processing module in the form of hardware in the AI platform.

The data processing module is typically a multi-tasking, multi-threaded mode of operation, with the software issuing a plurality of data processing tasks to the data processing module. Different tasks are issued to different thread units in the data processing module for execution, each thread unit independently uses a set of read-write ports, the read-write ports used by the different thread units are mutually independent and do not interfere with each other, and each thread unit can serially execute the issued data processing tasks.

In the process of realizing the invention, the inventor finds that the prior art has the following technical defects: when the number of data processing tasks is small, idle thread units are not fully utilized, so that not only is the waste of resources caused, but also the loss of data bandwidth is caused; meanwhile, when a data processing task needs to be executed quickly, the existing data processing module only issues the data processing task to a thread unit for execution, and then data transmission is completed through a group of read-write ports, so that the aim of executing the data processing task quickly cannot be really achieved.

Disclosure of Invention

The embodiment of the invention provides an on-chip data processing hardware, an on-chip data processing method and an AI platform, so as to improve the utilization rate of a linear unit and a bandwidth in a data processing module under a specific scene and accelerate the completion of a single data processing task.

In a first aspect, an embodiment of the present invention provides on-chip data processing hardware, including: a plurality of virtual channel modules, an arbitration module, and a plurality of thread units; the arbitration module is respectively connected with the virtual channel modules and the thread units;

the virtual channel module is used for splitting the target data processing task into a plurality of target subtasks and sending the target subtasks to the arbitration module according to at least one of a source address and a target address in the target data processing task and splitting times when receiving the split target data processing task issued by the software;

the arbitration module is used for determining a target thread unit according to the working state of each thread unit when receiving the target subtask and sending the target subtask to the target thread unit;

and the thread unit is used for executing the received target subtasks.

In a second aspect, an embodiment of the present invention further provides an on-chip data processing method, which is performed by a virtual channel module in on-chip data processing hardware according to any one of the embodiments of the present invention, where the method includes:

Splitting the target data processing task into a plurality of target subtasks according to at least one of a source address and a target address in the target data processing task and splitting times when receiving the splittable target data processing task issued by the software;

the plurality of target subtasks are sent to an arbitration module in the on-chip data processing hardware.

In a third aspect, an embodiment of the present invention further provides an AI platform, including on-chip data processing hardware according to any one of the embodiments of the present invention.

In the on-chip data processing hardware, when each virtual channel module receives a splittable target data processing task issued by software, splitting the target data processing task into a plurality of target subtasks according to at least one of a source address and a target address in the target data processing task and splitting times, and sending the target subtasks to an arbitration module, wherein the arbitration module can send each target subtask to one or a plurality of thread units for execution by taking the target subtask as a unit, and the utilization rate of the thread units and bandwidth can be fully improved in a scene of insufficient data processing tasks by splitting the plurality of subtasks and distributing the subtasks to the plurality of thread units.

Drawings

FIG. 1a is a block diagram illustrating an implementation of an on-chip data processing module according to the prior art;

FIG. 1b is a block diagram of an implementation of on-chip data processing hardware in accordance with a first embodiment of the present invention;

FIG. 2 is a block diagram of another implementation of on-chip data processing hardware in accordance with a second embodiment of the present invention;

FIG. 3 is a flowchart of an implementation of an on-chip data processing method in accordance with a third embodiment of the present invention;

FIG. 4 is a flowchart of an implementation of an on-chip data processing method according to a fourth embodiment of the present invention;

FIG. 5 is a flowchart of an on-chip data processing method according to a fifth embodiment of the present invention;

fig. 6 is a schematic structural diagram of an AI chip in a sixth embodiment of the invention.

Detailed Description

The invention is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting thereof. It should be further noted that, for convenience of description, only some, but not all of the structures related to the present invention are shown in the drawings.

In order to facilitate understanding of the implementation schemes of the embodiments of the present invention, first, the implementation principle of the on-chip data processing module in the prior art will be briefly described. In which an implementation block diagram of an on-chip data processing module provided by the prior art is shown in fig. 1 a. As shown in fig. 1a, the on-chip data processing module mainly includes three components in hardware form: the system comprises a virtual channel module, an arbitration module and a thread unit.

When the software issues a plurality of data processing tasks to different virtual channel modules, the arbitration module issues different data processing tasks to different thread units according to a certain rule.

As shown in fig. 1a, each virtual channel module, arbitration module, and each thread unit perform task distribution and processing in units of data processing tasks. At this time, when the number of data processing tasks issued by the software is smaller than the number of thread units, the idle thread units may be unloaded, and the idle thread units may not be fully utilized. Meanwhile, even if a certain data processing task (for example, a diagonal rectangular task) issued by the software is identified as a task requiring acceleration of execution, the data processing task can only be allocated to the thread unit 1, and the thread unit 1 uses its exclusive data read/write port to independently execute the data processing task. The data processing task cannot be really executed quickly.

Based on the above, the embodiments of the present invention creatively propose a split implementation scheme of data processing tasks that are implemented entirely by hardware. According to the implementation scheme, for the scene of insufficient number of data processing tasks, idle thread units can be utilized, and the utilization rate of the thread units and the data bandwidth is improved; for the scenario that the data processing task needs to be executed quickly, the data processing task that needs to be executed quickly can be distributed to a plurality of thread units for execution, and data is transmitted through the data read-write ports of each thread unit, so that the purpose of executing the data processing task quickly is achieved.

Example 1

FIG. 1b is a block diagram illustrating an implementation of on-chip data processing hardware according to an embodiment of the present invention. As shown in fig. 1b, the on-chip data processing hardware may specifically include: a plurality of virtual channel modules 110, an arbitration module 120, and a plurality of thread units 130.

Wherein the arbitration module 120 is respectively connected to the plurality of virtual channel modules 110 and the plurality of thread units 130.

In this embodiment, the virtual channel modules 110, the arbitration module 120, and the thread units 130 are all pure hardware components, for example, each of the hardware components may be a hardware circuit formed by overlapping various logic gate arrays.

The virtual channel module 110 is configured to split the target data processing task into a plurality of target subtasks according to at least one of a source address and a target address in the target data processing task and the splitting times when receiving the splittable target data processing task issued by the software, and send the split target data processing task to the arbitration module 120.

Wherein each virtual channel module 110 (virtual channel module 1, virtual channel module 2, …, virtual channel module n shown in fig. 1 b) receives a data processing task issued by the software.

The software is understood as a software system installed on a chip (also understood as an AI platform) configured by the on-chip data processing hardware, and is mainly used for scheduling execution of various data processing tasks. In this embodiment, the software may divide the overall data processing task into a common data processing task that is detachable, and a target data processing task that is not detachable,

The task types of the data processing task may include: linear manipulation tasks and multidimensional matrix data manipulation tasks. Specifically, the linear operation task may include a data handling task or a data filling task, and the multidimensional matrix data operation task may include a matrix filling task, a matrix splitting task, a matrix de-splitting task, a matrix reshaping task, and the like. Each data processing task may include both a source address and a destination address, or may include only a destination address.

The data processing task including the source address and the target address can be described as one or more data processing results obtained by processing after one or more source data are acquired from the source address and task processing is performed according to the task type, and the one or more data processing results are stored to the target address; for a data processing task only including a target address, after acquiring preset or fixed source data, performing task processing according to a task type, and storing one or more data processing results obtained by processing to the target address.

In this embodiment, the common data processing task that is not detachable may be processed according to the scheme of the prior art, but the target data processing task is required to be detached into a plurality of target subtasks according to at least one of a source address and a target address in the target data processing task and the number of detachment times.

Whether the data processing tasks can be split or not can be dynamically set by the software according to the actual application scene, for example, if the software detects that the number of the data processing tasks generated in real time is small at a certain moment, full utilization of the thread units cannot be realized, and then each data processing task generated in real time at present can be set as a target data processing task which can be split. Or when the software detects that the priority of a certain currently generated data processing task is higher and the priority needs to be executed quickly, the data processing task can be set as a detachable target data processing task.

Meanwhile, the splitting times can be dynamically set by software according to actual application scenes, and the splitting times of different target data processing tasks can be the same or different. After the software sets the splitting times for the target data processing task, the splitting times may be added to the target data processing task, so that the virtual channel module 110 splits the target data processing task into the splitting times of target subtasks according to the splitting times.

According to the task types of the target data processing tasks, the virtual channel module 110 may split the target data processing tasks by only splitting the source address, only splitting the target address, or splitting the source address and the target address at the same time, which is not limited in this embodiment. The virtual channel module 110 performs splitting operation on the received target data processing task, where one target data processing task may be split into a plurality of target subtasks with splitting times, and at the same time, the virtual channel module 110 sends the plurality of split target subtasks to the arbitration module 120 respectively.

Specifically, as shown in fig. 1b, after a diagonal arrow task is split by the virtual channel module 1, three diagonal arrow subtasks are generated; after splitting a grid rectangular task through the virtual channel module 2, three grid rectangular subtasks are generated.

The arbitration module 120 is configured to determine a target thread unit according to the working state of each thread unit 130 when the target subtask is received, and send the target subtask to the target thread unit.

In this embodiment, the arbitration module 120 can distribute each target subtask to different thread units 130 in units of target subtasks, so that multiple thread units 130 can execute multiple target subtasks included in the same target data processing task in parallel.

The arbitration module 120 may determine which target subtask is specifically allocated to which specific thread unit (target thread unit) to execute according to the number of tasks currently allocated to each thread unit 130 and information such as the current calculation performance parameter.

In the previous example, taking the total number m=3 of thread units 120 as an example, the arbitration module 120 sends three diagonal arrow subtasks sent by the virtual channel module 1 to thread unit 1, thread unit 2 and thread unit 3, respectively, while the arbitration module 120 sends three grid rectangular subtasks sent by the virtual channel module 2 to thread unit 1, thread unit 2 and thread unit 3, respectively. As is apparent from the implementation configuration diagram in fig. 1b, even if the number of data processing tasks is less than the number of thread units, each thread unit 130 can be fully occupied on each time slice, so as to fully utilize the thread units 130.

And a thread unit 130 for executing the received target subtasks.

In this embodiment, each thread unit 130 (thread unit 1, thread unit 2, …, thread unit m shown in FIG. 1 b) executes the corresponding target subtask upon receiving the target subtask sent by the arbitration module 120.

In a specific example, the thread unit 1 obtains corresponding source data from the data reading port 1 for a source address matched with the oblique arrow subtask, processes the source data according to a processing mode matched with the oblique arrow subtask, and writes processing result data into a target address matched with the oblique arrow subtask through the data writing port 1 so as to realize task execution taking the target subtask as a unit.

Example two

Fig. 2 is a diagram illustrating an implementation structure of on-chip data processing hardware according to a second embodiment of the present invention. The present embodiment is optimized based on the above embodiments. In this embodiment, it is contemplated that after a data processing module performs a data processing task, the software needs to be notified that the data processing task is complete. After the virtual channel module completes the splitting of one target data processing task, the target data processing task is split into a plurality of target subtasks for execution. Therefore, a split task counting module is required to be set, and the completion condition of all split target subtasks is counted. After all the target subtasks are executed, the execution of the target data processing task is completed, and the software can be informed that the target data processing task is completed.

Accordingly, in this embodiment, the on-chip data processing hardware further includes, in addition to the virtual channel module, the arbitration module, and the thread unit, a split task count module 210.

The virtual channel module is further configured to send the task identifier and the splitting number of times of the target data processing task to the splitting task counting module 210.

In this embodiment, each time the virtual channel module splits a target data processing task that is detachable into a plurality of target subtasks that match the splitting times, the task identifier and the splitting times of the target data processing task may be sent to the splitting task count module 210 as splitting task count trigger information. The thread unit is further configured to generate subtask completion information that matches the target subtask that completes execution, and send the subtask completion information to the split task count module 210.

The thread unit may generate a subtask completion message according to the task identifier of the target data processing task to which the target subtask belongs and send the subtask completion message to the split task counting module 210 after completing execution of the target subtask.

The subtask completion information of the target data processing task A indicates that the split target subtask in the target data processing task A is completely executed.

The splitting task counting module 210 is configured to update a completed count value of the target data processing task according to the subtask completion information, and report task completion information of the target data processing task to the software when the completed count value reaches the splitting times.

In this embodiment, it is assumed that the number of splitting times of the target data processing task whose task identifier is AAA is 5, that is, the target data processing task AAA is split to obtain 5 target subtasks. At this time, when the split task counting module 210 receives 5 subtask completion information for the target data processing task AAA in total, it may be determined that the target data processing task AAA has been executed to complete, and further, the task completion information of the target data processing task AAA may be reported to the software.

Accordingly, in this embodiment, the splitting task counting module 210 may set a counter for each target data processing task sent by each virtual channel module separately, count +1 of the counter of the target data processing task to which the subtask completion information belongs after receiving the subtask completion information sent by each thread unit, and determine that the target data processing task is completed when the counter counts the splitting times of the target data processing task to which the subtask completion information belongs, so that task completion information may be reported to software for the target data processing task.

Based on the above embodiments, the splitting task counting module 210 may be further configured to:

and when the completed count value reaches the splitting times, sending task completion information of the target data processing task to a virtual channel module for splitting the target data processing task.

In this alternative embodiment, after all the target splitting tasks in a certain target data processing task are executed, the splitting task counting module 210 may notify the software of task completion information of the target data processing task, and may further notify the virtual channel module splitting the target data processing task of task completion information of the target data processing task. At this time, the virtual channel module may release all caches occupied by the target data processing task, and the released storage space may receive a new data processing task issued by the software.

According to the technical scheme, the new splitting task counting module is added in the data processing hardware in the chip, so that the effective monitoring of the complete execution condition of the target data processing task can be realized on the basis of splitting and executing the target data processing task, the function of the data processing hardware in the chip is further perfected, and the actual application requirement of the data processing hardware in the chip is met.

Example III

Fig. 3 shows a flowchart of an implementation of an on-chip data processing method in a third embodiment of the present invention, where the present embodiment is applicable to a case where splitting a data processing task is implemented in a hardware form, and the method may be performed by a virtual channel module in on-chip data processing hardware, and specifically includes the following steps:

s310, splitting the target data processing task into a plurality of target subtasks according to at least one of a source address and a target address in the target data processing task and splitting times when the splittable target data processing task issued by the software is received.

The software is understood as a software system installed on a chip (also understood as an AI platform) configured by the on-chip data processing hardware, and is mainly used for scheduling execution of various data processing tasks.

In this embodiment, if the virtual channel module receives a target data processing task that is detachable, the target data processing task may be detached into a plurality of target subtasks; if the virtual channel module receives the common data processing task which is not detachable, the common data processing task can be directly sent to the arbitration module, so that the arbitration module and the thread unit can distribute and execute the data processing task by taking the data processing task as a unit.

Whether the data processing tasks can be split or not can be dynamically set by the software according to the actual application scene, for example, if the software detects that the number of the data processing tasks generated in real time is small at a certain moment, full utilization of the thread units cannot be realized, and then each data processing task generated in real time at present can be set as a target data processing task which can be split. Or when the software detects that the priority of a certain currently generated data processing task is higher and the priority needs to be executed quickly, the data processing task can be set as a detachable target data processing task. Furthermore, after the software sets one data processing task as a detachable target data processing task, the splitting identification and splitting times can be added into the target data processing task so as to effectively identify and distinguish the virtual channel module.

Meanwhile, the splitting times can be dynamically set by software according to actual application scenes, and the splitting times of different target data processing tasks can be the same or different.

In an alternative implementation of this embodiment, if the software detects that the number of data processing tasks currently generated in real time is small at a certain moment, the number of thread units currently configured may be determined as a split number for each target data processing task. To ensure full utilization of all thread units for each target data processing task. By way of example and not limitation, if 8 thread units are currently configured, the split number of each target data processing task may be set to 8.

In another optional implementation manner of this embodiment, after the software sets a data processing task with a higher priority and that needs to be executed quickly as a target data processing task, information such as a calculated amount or a required longest calculation time of the target data processing task may be obtained, and the number of splitting times for the target data processing task is dynamically determined according to the information. For example, when the calculation amount of the target data processing task is larger and the required longest calculation time is longer, the number of splitting times of the target data processing task is larger.

Correspondingly, when receiving a detachable target data processing task issued by software, splitting the target data processing task into a plurality of target subtasks according to at least one of a source address and a target address in the target data processing task and splitting times, may include: matrix filling tasks (padding), matrix slicing tasks (slice), matrix de-slicing tasks (de-slice), matrix reshaping tasks (reshape), and the like.

When a data processing task issued by software is received, detecting whether the data processing task contains a splitting identification or not;

if yes, determining the data processing task as a detachable target data processing task, and extracting the splitting times from the target data processing task; extracting at least one of a source address and a target address in the target data processing task according to the task type of the target data processing task; and splitting the target data processing task into a plurality of target subtasks according to at least one of a source address and a target address in the target data processing task and the splitting times.

According to the different task types of the target data processing task, the virtual channel module can split the target data processing task by only splitting the source address, only splitting the target address, or splitting the source address and the target address at the same time, which is not limited in this embodiment.

S320, sending the target subtasks to an arbitration module in the on-chip data processing hardware.

According to the technical scheme, the target data processing task is split into a plurality of target subtasks according to at least one of a source address and a target address in the target data processing task and splitting times by receiving the splittable target data processing task issued by software; the technical means of sending the plurality of target subtasks to the arbitration module in the data processing hardware in the chip, and splitting the plurality of subtasks and distributing the subtasks on the plurality of thread units for each data processing task can fully improve the utilization rate of the thread units and the bandwidth in the scene of insufficient data processing tasks, and can realize the acceleration completion of the data processing tasks in a mode of jointly executing the plurality of thread units when the acceleration processing of a certain data processing task is required.

Example IV

Fig. 4 is a flowchart of an implementation of an on-chip data processing method in accordance with a fourth embodiment of the present invention. This embodiment will be based on the above embodiments: the operation of splitting the target data processing task into a plurality of target subtasks according to at least one of a source address and a target address in the target data processing task and the splitting times is embodied as follows: determining the source data volume and/or the target data volume corresponding to a single target subtask according to the target data processing task and the splitting times; determining a source address offset of each target subtask for the source address according to the source data amount, and/or determining a target address offset of each target subtask for the target address according to the target data amount; splitting the target data processing task into a plurality of target subtasks according to the source address offset of each target subtask for the source address and/or the target address offset of each target subtask for the target address.

Accordingly, as illustrated in fig. 4, the method may specifically include:

s410, when a data processing task issued by software is received, detecting whether the data processing task contains a split identification or not: if yes, executing S420; otherwise, S480 is performed.

In this embodiment, if the data processing task includes a split identifier, the description software determines that the data processing task needs to be split and executed, and then the virtual channel module may determine the data processing task as a target data processing task that is detachable. If the data processing task does not contain the split identification, the data processing task can be directly executed by the thread unit, and the data processing task can be directly sent to an arbitration module in the on-chip data processing hardware.

S420, determining the data processing task as a detachable target data processing task, and extracting the splitting times from the target data processing task.

The splitting number can be understood as how many target subtasks the target data processing task needs to be split.

S430, extracting at least one of a source address and a target address in the target data processing task according to the task type of the target data processing task.

It will be appreciated that different types of target data processing tasks may require different data processing operations. At this time, when splitting the data processing task, the types of information to be extracted from the target data processing task are also different.

In an optional implementation manner of this embodiment, extracting at least one of the source address and the destination address in the target data processing task according to the task type of the target data processing task may include:

if the target data processing task is determined to be the data handling task, extracting a source address and a target address in the target data processing task;

if the target data processing task is determined to be a data filling task, extracting a target address in the target data processing task;

and if the target data processing task is determined to be a multidimensional matrix processing task, extracting a source address and a target address in the target data processing task.

If it is determined that the target data processing task is a data handling task, the operation to be executed by the data handling task specifically uses the source address as a data acquisition starting point, acquires the source data of the set data amount, uses the target address as a data writing starting point, and writes the source data of the set data amount. The software configures the source address, the destination address, and the amount of data to be handled when configuring the data handling task.

If it is determined that the target data processing task is a data filling task, a specific operation to be performed by the data filling task is to fill a data value of a specified data amount with the target address as a data writing start point. The software configures the destination address, the fill data value, and the fill data amount when configuring the data handling task.

The multidimensional matrix processing task is a processing task for a multidimensional matrix, and can comprise: matrix filling tasks (padding), matrix slicing tasks (slice), matrix de-slicing tasks (de-slice), matrix reshaping tasks (reshape), and the like. The specific operation to be executed by each multi-dimensional matrix processing task is to use a source address as a data acquisition starting point, acquire a first multi-dimensional matrix of a first shape, process the first multi-dimensional matrix of the first shape to obtain a second multi-dimensional matrix of a second shape, and then write the data into the second multi-dimensional matrix of the second shape by taking a target address as a data writing starting point. The software configures the source address, the first shape of the first multi-dimensional matrix, the target address, and the second shape of the second multi-dimensional matrix when configuring the multi-dimensional matrix processing task.

In this embodiment, since the above-described target processing tasks of the task types each involve a read operation of data in a segment of the address space starting from the source address, or a data write operation is performed in a segment of the address space starting from the target address. Thus, the read or write operation described above may be divided into a plurality of sub-tasks, which in turn may be performed by different thread units using separate read and write data ports.

S440, determining the source data volume and/or the target data volume corresponding to the single target subtask according to the target data processing task and the splitting times.

As described above, the software writes the total amount of data matching the source address, the total amount of data matching the target address, or the description information associated with the total amount of data of the source address or the total amount of data of the target address in each target data processing task. For example, the amount of transport data in the data transport task, the amount of fill data in the data fill task, the first shape of the first multi-dimensional matrix in the multi-dimensional matrix processing task, the second shape of the second multi-dimensional matrix in the multi-dimensional matrix processing task, and so forth.

The total data amount corresponding to the first multidimensional matrix is: a1×a2×a3×a4×a5.

By acquiring the total amount of data matched with the source address or the total amount of data matched with the target address and combining the splitting times, the source data amount and/or the target data amount corresponding to the single target subtasks can be determined.

In a specific example, if the amount of transport data in the data transport task is M and the number of splits is N, the amount of source data of a single target subtask corresponding to the data transport task is M/N and the amount of target data is also M/N. If the filling data amount in the data filling task is X and the splitting times are N, the target data amount of a single target subtask corresponding to the data filling task is X/N.

Similarly, if the total amount of data of the first multidimensional matrix is determined to be P1 by the first shape of the first multidimensional matrix in the multidimensional matrix processing task, the total amount of data of the second multidimensional matrix is determined to be P2 by the second shape of the second multidimensional matrix in the multidimensional matrix processing task, and the splitting frequency is N, the source data amount of a single target subtask corresponding to the multidimensional matrix processing task is P1/N, and the target data amount of a single target subtask corresponding to the multidimensional matrix processing task is P2/N.

S450, determining the source address offset of each target subtask aiming at the source address according to the source data quantity, and/or determining the target address offset of each target subtask aiming at the target address according to the target data quantity.

In this embodiment, when dividing the target data processing task into multiple target subtasks, the source address or the target address of each target subtask needs to be specified, and then, according to the source address or the target address of the target data processing task, the source address offset of each target subtask for the source address or the target address offset of each target subtask for the target address may be determined in combination with the source data amount and the target data amount of each target subtask.

In an optional implementation manner of this embodiment, determining the source address offset of each target subtask for the source address according to the source data amount, and/or determining the target address offset of each target subtask for the target address according to the target data amount may include:

generating a task number of each target subtask according to the splitting times and a natural counting sequence;

determining the product of the task number of each target subtask and the source data volume as the source address offset of each target subtask for the source address; and/or

And determining the product of the task number of each target subtask and the target data volume as a target address offset of each target subtask for the target address.

In the previous example, assuming that the number of splitting times is 3 for one data transfer task, the target subtask 0, the target subtask 1, and the target subtask 2 may be generated in the natural count order. Further, the product of the task number of each target subtask and the source data amount M/N of the data movement task may be determined as a source address offset for the source address and a target address offset for the target address for each target subtask. That is, the source address offset of the target subtask 1 is 0, and the target address offset is 0; the source address offset of the target subtask 2 is M/N, and the target address offset is M/N; the source address offset of the target subtask 3 is 2*M/N and the target address offset is 2*M/N.

S460, splitting the target data processing task into a plurality of target subtasks according to the source address offset of each target subtask for the source address and/or the target address offset of each target subtask for the target address.

In this embodiment, splitting the target data processing task into a plurality of target subtasks according to the source address offset of each target subtask for the source address and/or the target address offset of each target subtask for the target address may include:

determining the source address of each target subtask according to the source address offset of each target subtask for the source address; and/or determining the target address of each target subtask according to the target address offset of each target subtask for the target address;

and splitting the target data processing task into a plurality of target subtasks according to the source address and/or the target address of each target subtask.

S470, sending the plurality of target subtasks to an arbitration module in the on-chip data processing hardware.

And S480, the data processing task is directly sent to an arbitration module in the on-chip data processing hardware.

Example five

Fig. 5 is a flowchart of an implementation of an on-chip data processing method in a fifth embodiment of the present invention. Based on the above embodiments, the embodiments of the present invention further refine the operation after the virtual channel module performs the operation of sending the plurality of target subtasks to the arbitration module in the on-chip data processing hardware.

Accordingly, as shown in fig. 5, the method specifically may include:

s510, splitting the target data processing task into a plurality of target subtasks according to at least one of a source address and a target address in the target data processing task and splitting times when the splittable target data processing task issued by the software is received.

S520, the plurality of target subtasks are sent to an arbitration module in the on-chip data processing hardware.

S530, the task identification and the splitting times of the target data processing task are sent to a splitting task counting module in the data processing hardware in the chip.

In this embodiment, it is contemplated that after a data processing module performs a data processing task, the software needs to be notified that the data processing task is complete. After the virtual channel module completes the splitting of one target data processing task, the target data processing task is split into a plurality of target subtasks for execution. Therefore, a split task counting module is required to be set, and the completion condition of all split target subtasks is counted. After all the target subtasks are executed, the execution of the target data processing task is completed, and the software can be informed that the target data processing task is completed.

In this embodiment, each time the virtual channel module splits a target data processing task that is detachable into a plurality of target subtasks that match the splitting times, the task identifier and the splitting times of the target data processing task may be sent to the splitting task count module as splitting task count trigger information.

And the thread unit is also used for generating subtask completion information matched with the target subtask of which the execution is completed and sending the subtask completion information to the splitting task counting module.

The thread unit can generate a subtask completion message according to the task identifier of the target data processing task to which the target subtask belongs and send the subtask completion message to the splitting task counting module after completing the execution of the target subtask. And the subtask completion information aiming at the target data processing task A indicates that the execution of one target subtask split in the target data processing task A is completed.

Specifically, the splitting task counting module is used for updating the completed count value of the target data processing task according to the subtask completion information, and reporting the task completion information of the target data processing task to the software when the completed count value reaches the splitting times.

S540, when task completion information of the target data processing task sent by the splitting task counting module is received, the target data processing task stored locally is cleared.

Specifically, when all the target splitting tasks in a certain target data processing task are executed, the splitting task counting module can notify the software of task completion information of the target data processing task, and can further notify the virtual channel module for splitting the target data processing task of task completion information of the target data processing task. At this time, the virtual channel module may release all caches occupied by the target data processing task, and the released storage space may receive a new data processing task issued by the software.

Example six

Fig. 6 is a schematic structural diagram of an AI chip in a sixth embodiment of the invention. As shown in fig. 6, the AI platform is provided with an AI chip, and the AI chip includes the on-chip data processing hardware 610 according to any of the embodiments of the present invention, and the on-chip data processing hardware 610 is executed in cooperation with the software 620 installed on the AI chip, so as to perform the on-chip data processing method according to any of the embodiments of the present invention.

Specifically, the on-chip data processing hardware 610 may include: a plurality of virtual channel modules, an arbitration module, and a plurality of thread units; the arbitration module is respectively connected with the virtual channel modules and the thread units;

the virtual channel module is used for splitting the target data processing task into a plurality of target subtasks and sending the target subtasks to the arbitration module according to at least one of a source address and a target address in the target data processing task and splitting times when receiving the splittable target data processing task issued by the software 620;

And the thread unit is used for executing the received target subtasks.

And an on-chip data processing method performed by a virtual channel module in on-chip data processing hardware, comprising:

Note that the above is only a preferred embodiment of the present invention and the technical principle applied. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, while the invention has been described in connection with the above embodiments, the invention is not limited to the embodiments, but may be embodied in many other equivalent forms without departing from the spirit or scope of the invention, which is set forth in the following claims.

Claims

1. An on-chip data processing system, comprising: a plurality of virtual channel modules, an arbitration module, and a plurality of thread units; the arbitration module is respectively connected with the virtual channel modules and the thread units;

Wherein each data processing task contains both an active address and a destination address or only a destination address; aiming at a data processing task which simultaneously comprises a source address and a target address, describing that after one or more items of source data are acquired from the source address, performing task processing according to a task type, and storing one or more items of data processing results obtained by processing to the target address; for a data processing task only comprising a target address, describing to acquire preset or fixed source data, performing task processing according to a task type, and storing one or more data processing results obtained by processing to the target address;

and the thread unit is used for executing the received target subtasks.

2. The on-chip data processing system of claim 1, further comprising a split task counting module;

the virtual channel module is also used for sending the task identification and the splitting times of the target data processing task to the splitting task counting module;

The thread unit is also used for generating subtask completion information matched with the target subtask of which execution is completed and sending the subtask completion information to the splitting task counting module;

and the splitting task counting module is used for updating the completed count value of the target data processing task according to the subtask completion information, and reporting the task completion information of the target data processing task to the software when the completed count value reaches the splitting times.

3. The on-chip data processing system of claim 2, wherein the split task counting module is further configured to:

4. An on-chip data processing method performed by a virtual channel module in an on-chip data processing system according to any of claims 1-3, the method comprising:

5. The method of claim 4, wherein upon receiving the software-issued detachable target data processing task, splitting the target data processing task into a plurality of target sub-tasks based on the number of splits and at least one of a source address and a target address in the target data processing task, comprising:

if yes, determining the data processing task as a detachable target data processing task, and extracting the splitting times from the target data processing task;

extracting at least one of a source address and a target address in the target data processing task according to the task type of the target data processing task;

and splitting the target data processing task into a plurality of target subtasks according to at least one of a source address and a target address in the target data processing task and the splitting times.

6. The method of claim 5, wherein extracting at least one of a source address and a destination address in the target data processing task based on a task type of the target data processing task, comprises:

7. The method of claim 5, wherein splitting the target data processing task into a plurality of target sub-tasks based on the number of splits and at least one of a source address and a target address in the target data processing task, comprising:

determining the source data volume and/or the target data volume corresponding to a single target subtask according to the target data processing task and the splitting times;

determining a source address offset of each target subtask for the source address according to the source data amount, and/or determining a target address offset of each target subtask for the target address according to the target data amount;

splitting the target data processing task into a plurality of target subtasks according to the source address offset of each target subtask for the source address and/or the target address offset of each target subtask for the target address.

8. The method of claim 7, wherein determining a source address offset for each target subtask for the source address based on a source data amount and/or determining a target address offset for each target subtask for the target address based on a target data amount comprises:

9. The method of claim 7, wherein splitting the target data processing task into a plurality of target subtasks according to a source address offset for the source address for each target subtask and/or a target address offset for the target address for each target subtask, comprises:

10. The method of claim 4, further comprising, after sending the plurality of target subtasks to an arbitration module in the on-chip data processing hardware:

and sending the task identification and the splitting times of the target data processing task to a splitting task counting module in the data processing hardware in the chip.

11. The method of claim 10, further comprising, after sending the task identification and the split times of the target data processing task to a split task count module in the on-chip data processing hardware:

and when task completion information of the target data processing task sent by the splitting task counting module is received, the target data processing task stored locally is cleared.

12. An AI platform comprising an on-chip data processing system according to any of claims 1-3.