CN115964155B - On-chip data processing hardware, on-chip data processing method and AI platform - Google Patents

On-chip data processing hardware, on-chip data processing method and AI platform Download PDF

Info

Publication number
CN115964155B
CN115964155B CN202310251374.XA CN202310251374A CN115964155B CN 115964155 B CN115964155 B CN 115964155B CN 202310251374 A CN202310251374 A CN 202310251374A CN 115964155 B CN115964155 B CN 115964155B
Authority
CN
China
Prior art keywords
target
data processing
task
processing task
address
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310251374.XA
Other languages
Chinese (zh)
Other versions
CN115964155A (en
Inventor
乔文
孙力军
张亚林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suiyuan Intelligent Technology Chengdu Co ltd
Original Assignee
Suiyuan Intelligent Technology Chengdu Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suiyuan Intelligent Technology Chengdu Co ltd filed Critical Suiyuan Intelligent Technology Chengdu Co ltd
Priority to CN202310251374.XA priority Critical patent/CN115964155B/en
Publication of CN115964155A publication Critical patent/CN115964155A/en
Application granted granted Critical
Publication of CN115964155B publication Critical patent/CN115964155B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
  • Stored Programmes (AREA)

Abstract

The invention discloses an on-chip data processing hardware, an on-chip data processing method and an AI platform. The on-chip data processing hardware comprises a plurality of virtual channel modules, an arbitration module and a plurality of thread units; the arbitration module is respectively connected with each virtual channel module and each thread unit; the virtual channel module is used for splitting the target data processing task into a plurality of target subtasks and sending the target subtasks to the arbitration module according to the source address and/or the target address and the splitting times in the target data processing task when receiving the splittable target data processing task issued by the software; the arbitration module is used for determining a target thread unit when receiving a target subtask and sending the target subtask to the target thread unit; and the thread unit is used for executing the received target subtasks. The technical scheme of the embodiment of the invention can improve the utilization rate of the line units and the bandwidths in the data processing module under a specific scene and can accelerate the completion of a single data processing task.

Description

On-chip data processing hardware, on-chip data processing method and AI platform
Technical Field
The embodiment of the invention relates to a computer hardware technology, in particular to a chip technology, and especially relates to an on-chip data processing hardware, an on-chip data processing method and an AI platform.
Background
Within the AI (Artificial Intelligence ) platform, common data processing tasks mainly include linear operation tasks (e.g., data handling or data population), multi-dimensional matrix data operation tasks (e.g., matrix population or matrix segmentation), and the like. The data processing tasks are mainly executed by a data processing module in the form of hardware in the AI platform.
The data processing module is typically a multi-tasking, multi-threaded mode of operation, with the software issuing a plurality of data processing tasks to the data processing module. Different tasks are issued to different thread units in the data processing module for execution, each thread unit independently uses a set of read-write ports, the read-write ports used by the different thread units are mutually independent and do not interfere with each other, and each thread unit can serially execute the issued data processing tasks.
In the process of realizing the invention, the inventor finds that the prior art has the following technical defects: when the number of data processing tasks is small, idle thread units are not fully utilized, so that not only is the waste of resources caused, but also the loss of data bandwidth is caused; meanwhile, when a data processing task needs to be executed quickly, the existing data processing module only issues the data processing task to a thread unit for execution, and then data transmission is completed through a group of read-write ports, so that the aim of executing the data processing task quickly cannot be really achieved.
Disclosure of Invention
The embodiment of the invention provides an on-chip data processing hardware, an on-chip data processing method and an AI platform, so as to improve the utilization rate of a linear unit and a bandwidth in a data processing module under a specific scene and accelerate the completion of a single data processing task.
In a first aspect, an embodiment of the present invention provides on-chip data processing hardware, including: a plurality of virtual channel modules, an arbitration module, and a plurality of thread units; the arbitration module is respectively connected with the virtual channel modules and the thread units;
the virtual channel module is used for splitting the target data processing task into a plurality of target subtasks and sending the target subtasks to the arbitration module according to at least one of a source address and a target address in the target data processing task and splitting times when receiving the split target data processing task issued by the software;
the arbitration module is used for determining a target thread unit according to the working state of each thread unit when receiving the target subtask and sending the target subtask to the target thread unit;
and the thread unit is used for executing the received target subtasks.
In a second aspect, an embodiment of the present invention further provides an on-chip data processing method, which is performed by a virtual channel module in on-chip data processing hardware according to any one of the embodiments of the present invention, where the method includes:
Splitting the target data processing task into a plurality of target subtasks according to at least one of a source address and a target address in the target data processing task and splitting times when receiving the splittable target data processing task issued by the software;
the plurality of target subtasks are sent to an arbitration module in the on-chip data processing hardware.
In a third aspect, an embodiment of the present invention further provides an AI platform, including on-chip data processing hardware according to any one of the embodiments of the present invention.
In the on-chip data processing hardware, when each virtual channel module receives a splittable target data processing task issued by software, splitting the target data processing task into a plurality of target subtasks according to at least one of a source address and a target address in the target data processing task and splitting times, and sending the target subtasks to an arbitration module, wherein the arbitration module can send each target subtask to one or a plurality of thread units for execution by taking the target subtask as a unit, and the utilization rate of the thread units and bandwidth can be fully improved in a scene of insufficient data processing tasks by splitting the plurality of subtasks and distributing the subtasks to the plurality of thread units.
Drawings
FIG. 1a is a block diagram illustrating an implementation of an on-chip data processing module according to the prior art;
FIG. 1b is a block diagram of an implementation of on-chip data processing hardware in accordance with a first embodiment of the present invention;
FIG. 2 is a block diagram of another implementation of on-chip data processing hardware in accordance with a second embodiment of the present invention;
FIG. 3 is a flowchart of an implementation of an on-chip data processing method in accordance with a third embodiment of the present invention;
FIG. 4 is a flowchart of an implementation of an on-chip data processing method according to a fourth embodiment of the present invention;
FIG. 5 is a flowchart of an on-chip data processing method according to a fifth embodiment of the present invention;
fig. 6 is a schematic structural diagram of an AI chip in a sixth embodiment of the invention.
Detailed Description
The invention is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting thereof. It should be further noted that, for convenience of description, only some, but not all of the structures related to the present invention are shown in the drawings.
In order to facilitate understanding of the implementation schemes of the embodiments of the present invention, first, the implementation principle of the on-chip data processing module in the prior art will be briefly described. In which an implementation block diagram of an on-chip data processing module provided by the prior art is shown in fig. 1 a. As shown in fig. 1a, the on-chip data processing module mainly includes three components in hardware form: the system comprises a virtual channel module, an arbitration module and a thread unit.
When the software issues a plurality of data processing tasks to different virtual channel modules, the arbitration module issues different data processing tasks to different thread units according to a certain rule.
As shown in fig. 1a, each virtual channel module, arbitration module, and each thread unit perform task distribution and processing in units of data processing tasks. At this time, when the number of data processing tasks issued by the software is smaller than the number of thread units, the idle thread units may be unloaded, and the idle thread units may not be fully utilized. Meanwhile, even if a certain data processing task (for example, a diagonal rectangular task) issued by the software is identified as a task requiring acceleration of execution, the data processing task can only be allocated to the thread unit 1, and the thread unit 1 uses its exclusive data read/write port to independently execute the data processing task. The data processing task cannot be really executed quickly.
Based on the above, the embodiments of the present invention creatively propose a split implementation scheme of data processing tasks that are implemented entirely by hardware. According to the implementation scheme, for the scene of insufficient number of data processing tasks, idle thread units can be utilized, and the utilization rate of the thread units and the data bandwidth is improved; for the scenario that the data processing task needs to be executed quickly, the data processing task that needs to be executed quickly can be distributed to a plurality of thread units for execution, and data is transmitted through the data read-write ports of each thread unit, so that the purpose of executing the data processing task quickly is achieved.
Example 1
FIG. 1b is a block diagram illustrating an implementation of on-chip data processing hardware according to an embodiment of the present invention. As shown in fig. 1b, the on-chip data processing hardware may specifically include: a plurality of virtual channel modules 110, an arbitration module 120, and a plurality of thread units 130.
Wherein the arbitration module 120 is respectively connected to the plurality of virtual channel modules 110 and the plurality of thread units 130.
In this embodiment, the virtual channel modules 110, the arbitration module 120, and the thread units 130 are all pure hardware components, for example, each of the hardware components may be a hardware circuit formed by overlapping various logic gate arrays.
The virtual channel module 110 is configured to split the target data processing task into a plurality of target subtasks according to at least one of a source address and a target address in the target data processing task and the splitting times when receiving the splittable target data processing task issued by the software, and send the split target data processing task to the arbitration module 120.
Wherein each virtual channel module 110 (virtual channel module 1, virtual channel module 2, …, virtual channel module n shown in fig. 1 b) receives a data processing task issued by the software.
The software is understood as a software system installed on a chip (also understood as an AI platform) configured by the on-chip data processing hardware, and is mainly used for scheduling execution of various data processing tasks. In this embodiment, the software may divide the overall data processing task into a common data processing task that is detachable, and a target data processing task that is not detachable,
The task types of the data processing task may include: linear manipulation tasks and multidimensional matrix data manipulation tasks. Specifically, the linear operation task may include a data handling task or a data filling task, and the multidimensional matrix data operation task may include a matrix filling task, a matrix splitting task, a matrix de-splitting task, a matrix reshaping task, and the like. Each data processing task may include both a source address and a destination address, or may include only a destination address.
The data processing task including the source address and the target address can be described as one or more data processing results obtained by processing after one or more source data are acquired from the source address and task processing is performed according to the task type, and the one or more data processing results are stored to the target address; for a data processing task only including a target address, after acquiring preset or fixed source data, performing task processing according to a task type, and storing one or more data processing results obtained by processing to the target address.
In this embodiment, the common data processing task that is not detachable may be processed according to the scheme of the prior art, but the target data processing task is required to be detached into a plurality of target subtasks according to at least one of a source address and a target address in the target data processing task and the number of detachment times.
Whether the data processing tasks can be split or not can be dynamically set by the software according to the actual application scene, for example, if the software detects that the number of the data processing tasks generated in real time is small at a certain moment, full utilization of the thread units cannot be realized, and then each data processing task generated in real time at present can be set as a target data processing task which can be split. Or when the software detects that the priority of a certain currently generated data processing task is higher and the priority needs to be executed quickly, the data processing task can be set as a detachable target data processing task.
Meanwhile, the splitting times can be dynamically set by software according to actual application scenes, and the splitting times of different target data processing tasks can be the same or different. After the software sets the splitting times for the target data processing task, the splitting times may be added to the target data processing task, so that the virtual channel module 110 splits the target data processing task into the splitting times of target subtasks according to the splitting times.
According to the task types of the target data processing tasks, the virtual channel module 110 may split the target data processing tasks by only splitting the source address, only splitting the target address, or splitting the source address and the target address at the same time, which is not limited in this embodiment. The virtual channel module 110 performs splitting operation on the received target data processing task, where one target data processing task may be split into a plurality of target subtasks with splitting times, and at the same time, the virtual channel module 110 sends the plurality of split target subtasks to the arbitration module 120 respectively.
Specifically, as shown in fig. 1b, after a diagonal arrow task is split by the virtual channel module 1, three diagonal arrow subtasks are generated; after splitting a grid rectangular task through the virtual channel module 2, three grid rectangular subtasks are generated.
The arbitration module 120 is configured to determine a target thread unit according to the working state of each thread unit 130 when the target subtask is received, and send the target subtask to the target thread unit.
In this embodiment, the arbitration module 120 can distribute each target subtask to different thread units 130 in units of target subtasks, so that multiple thread units 130 can execute multiple target subtasks included in the same target data processing task in parallel.
The arbitration module 120 may determine which target subtask is specifically allocated to which specific thread unit (target thread unit) to execute according to the number of tasks currently allocated to each thread unit 130 and information such as the current calculation performance parameter.
In the previous example, taking the total number m=3 of thread units 120 as an example, the arbitration module 120 sends three diagonal arrow subtasks sent by the virtual channel module 1 to thread unit 1, thread unit 2 and thread unit 3, respectively, while the arbitration module 120 sends three grid rectangular subtasks sent by the virtual channel module 2 to thread unit 1, thread unit 2 and thread unit 3, respectively. As is apparent from the implementation configuration diagram in fig. 1b, even if the number of data processing tasks is less than the number of thread units, each thread unit 130 can be fully occupied on each time slice, so as to fully utilize the thread units 130.
And a thread unit 130 for executing the received target subtasks.
In this embodiment, each thread unit 130 (thread unit 1, thread unit 2, …, thread unit m shown in FIG. 1 b) executes the corresponding target subtask upon receiving the target subtask sent by the arbitration module 120.
In a specific example, the thread unit 1 obtains corresponding source data from the data reading port 1 for a source address matched with the oblique arrow subtask, processes the source data according to a processing mode matched with the oblique arrow subtask, and writes processing result data into a target address matched with the oblique arrow subtask through the data writing port 1 so as to realize task execution taking the target subtask as a unit.
In the on-chip data processing hardware, when each virtual channel module receives a splittable target data processing task issued by software, splitting the target data processing task into a plurality of target subtasks according to at least one of a source address and a target address in the target data processing task and splitting times, and sending the target subtasks to an arbitration module, wherein the arbitration module can send each target subtask to one or a plurality of thread units for execution by taking the target subtask as a unit, and the utilization rate of the thread units and bandwidth can be fully improved in a scene of insufficient data processing tasks by splitting the plurality of subtasks and distributing the subtasks to the plurality of thread units.
Example two
Fig. 2 is a diagram illustrating an implementation structure of on-chip data processing hardware according to a second embodiment of the present invention. The present embodiment is optimized based on the above embodiments. In this embodiment, it is contemplated that after a data processing module performs a data processing task, the software needs to be notified that the data processing task is complete. After the virtual channel module completes the splitting of one target data processing task, the target data processing task is split into a plurality of target subtasks for execution. Therefore, a split task counting module is required to be set, and the completion condition of all split target subtasks is counted. After all the target subtasks are executed, the execution of the target data processing task is completed, and the software can be informed that the target data processing task is completed.
Accordingly, in this embodiment, the on-chip data processing hardware further includes, in addition to the virtual channel module, the arbitration module, and the thread unit, a split task count module 210.
The virtual channel module is further configured to send the task identifier and the splitting number of times of the target data processing task to the splitting task counting module 210.
In this embodiment, each time the virtual channel module splits a target data processing task that is detachable into a plurality of target subtasks that match the splitting times, the task identifier and the splitting times of the target data processing task may be sent to the splitting task count module 210 as splitting task count trigger information. The thread unit is further configured to generate subtask completion information that matches the target subtask that completes execution, and send the subtask completion information to the split task count module 210.
The thread unit may generate a subtask completion message according to the task identifier of the target data processing task to which the target subtask belongs and send the subtask completion message to the split task counting module 210 after completing execution of the target subtask.
The subtask completion information of the target data processing task A indicates that the split target subtask in the target data processing task A is completely executed.
The splitting task counting module 210 is configured to update a completed count value of the target data processing task according to the subtask completion information, and report task completion information of the target data processing task to the software when the completed count value reaches the splitting times.
In this embodiment, it is assumed that the number of splitting times of the target data processing task whose task identifier is AAA is 5, that is, the target data processing task AAA is split to obtain 5 target subtasks. At this time, when the split task counting module 210 receives 5 subtask completion information for the target data processing task AAA in total, it may be determined that the target data processing task AAA has been executed to complete, and further, the task completion information of the target data processing task AAA may be reported to the software.
Accordingly, in this embodiment, the splitting task counting module 210 may set a counter for each target data processing task sent by each virtual channel module separately, count +1 of the counter of the target data processing task to which the subtask completion information belongs after receiving the subtask completion information sent by each thread unit, and determine that the target data processing task is completed when the counter counts the splitting times of the target data processing task to which the subtask completion information belongs, so that task completion information may be reported to software for the target data processing task.
Based on the above embodiments, the splitting task counting module 210 may be further configured to:
and when the completed count value reaches the splitting times, sending task completion information of the target data processing task to a virtual channel module for splitting the target data processing task.
In this alternative embodiment, after all the target splitting tasks in a certain target data processing task are executed, the splitting task counting module 210 may notify the software of task completion information of the target data processing task, and may further notify the virtual channel module splitting the target data processing task of task completion information of the target data processing task. At this time, the virtual channel module may release all caches occupied by the target data processing task, and the released storage space may receive a new data processing task issued by the software.
According to the technical scheme, the new splitting task counting module is added in the data processing hardware in the chip, so that the effective monitoring of the complete execution condition of the target data processing task can be realized on the basis of splitting and executing the target data processing task, the function of the data processing hardware in the chip is further perfected, and the actual application requirement of the data processing hardware in the chip is met.
Example III
Fig. 3 shows a flowchart of an implementation of an on-chip data processing method in a third embodiment of the present invention, where the present embodiment is applicable to a case where splitting a data processing task is implemented in a hardware form, and the method may be performed by a virtual channel module in on-chip data processing hardware, and specifically includes the following steps:
s310, splitting the target data processing task into a plurality of target subtasks according to at least one of a source address and a target address in the target data processing task and splitting times when the splittable target data processing task issued by the software is received.
The software is understood as a software system installed on a chip (also understood as an AI platform) configured by the on-chip data processing hardware, and is mainly used for scheduling execution of various data processing tasks.
In this embodiment, if the virtual channel module receives a target data processing task that is detachable, the target data processing task may be detached into a plurality of target subtasks; if the virtual channel module receives the common data processing task which is not detachable, the common data processing task can be directly sent to the arbitration module, so that the arbitration module and the thread unit can distribute and execute the data processing task by taking the data processing task as a unit.
Whether the data processing tasks can be split or not can be dynamically set by the software according to the actual application scene, for example, if the software detects that the number of the data processing tasks generated in real time is small at a certain moment, full utilization of the thread units cannot be realized, and then each data processing task generated in real time at present can be set as a target data processing task which can be split. Or when the software detects that the priority of a certain currently generated data processing task is higher and the priority needs to be executed quickly, the data processing task can be set as a detachable target data processing task. Furthermore, after the software sets one data processing task as a detachable target data processing task, the splitting identification and splitting times can be added into the target data processing task so as to effectively identify and distinguish the virtual channel module.
Meanwhile, the splitting times can be dynamically set by software according to actual application scenes, and the splitting times of different target data processing tasks can be the same or different.
In an alternative implementation of this embodiment, if the software detects that the number of data processing tasks currently generated in real time is small at a certain moment, the number of thread units currently configured may be determined as a split number for each target data processing task. To ensure full utilization of all thread units for each target data processing task. By way of example and not limitation, if 8 thread units are currently configured, the split number of each target data processing task may be set to 8.
In another optional implementation manner of this embodiment, after the software sets a data processing task with a higher priority and that needs to be executed quickly as a target data processing task, information such as a calculated amount or a required longest calculation time of the target data processing task may be obtained, and the number of splitting times for the target data processing task is dynamically determined according to the information. For example, when the calculation amount of the target data processing task is larger and the required longest calculation time is longer, the number of splitting times of the target data processing task is larger.
Correspondingly, when receiving a detachable target data processing task issued by software, splitting the target data processing task into a plurality of target subtasks according to at least one of a source address and a target address in the target data processing task and splitting times, may include: matrix filling tasks (padding), matrix slicing tasks (slice), matrix de-slicing tasks (de-slice), matrix reshaping tasks (reshape), and the like.
When a data processing task issued by software is received, detecting whether the data processing task contains a splitting identification or not;
if yes, determining the data processing task as a detachable target data processing task, and extracting the splitting times from the target data processing task; extracting at least one of a source address and a target address in the target data processing task according to the task type of the target data processing task; and splitting the target data processing task into a plurality of target subtasks according to at least one of a source address and a target address in the target data processing task and the splitting times.
According to the different task types of the target data processing task, the virtual channel module can split the target data processing task by only splitting the source address, only splitting the target address, or splitting the source address and the target address at the same time, which is not limited in this embodiment.
S320, sending the target subtasks to an arbitration module in the on-chip data processing hardware.
According to the technical scheme, the target data processing task is split into a plurality of target subtasks according to at least one of a source address and a target address in the target data processing task and splitting times by receiving the splittable target data processing task issued by software; the technical means of sending the plurality of target subtasks to the arbitration module in the data processing hardware in the chip, and splitting the plurality of subtasks and distributing the subtasks on the plurality of thread units for each data processing task can fully improve the utilization rate of the thread units and the bandwidth in the scene of insufficient data processing tasks, and can realize the acceleration completion of the data processing tasks in a mode of jointly executing the plurality of thread units when the acceleration processing of a certain data processing task is required.
Example IV
Fig. 4 is a flowchart of an implementation of an on-chip data processing method in accordance with a fourth embodiment of the present invention. This embodiment will be based on the above embodiments: the operation of splitting the target data processing task into a plurality of target subtasks according to at least one of a source address and a target address in the target data processing task and the splitting times is embodied as follows: determining the source data volume and/or the target data volume corresponding to a single target subtask according to the target data processing task and the splitting times; determining a source address offset of each target subtask for the source address according to the source data amount, and/or determining a target address offset of each target subtask for the target address according to the target data amount; splitting the target data processing task into a plurality of target subtasks according to the source address offset of each target subtask for the source address and/or the target address offset of each target subtask for the target address.
Accordingly, as illustrated in fig. 4, the method may specifically include:
s410, when a data processing task issued by software is received, detecting whether the data processing task contains a split identification or not: if yes, executing S420; otherwise, S480 is performed.
In this embodiment, if the data processing task includes a split identifier, the description software determines that the data processing task needs to be split and executed, and then the virtual channel module may determine the data processing task as a target data processing task that is detachable. If the data processing task does not contain the split identification, the data processing task can be directly executed by the thread unit, and the data processing task can be directly sent to an arbitration module in the on-chip data processing hardware.
S420, determining the data processing task as a detachable target data processing task, and extracting the splitting times from the target data processing task.
The splitting number can be understood as how many target subtasks the target data processing task needs to be split.
S430, extracting at least one of a source address and a target address in the target data processing task according to the task type of the target data processing task.
It will be appreciated that different types of target data processing tasks may require different data processing operations. At this time, when splitting the data processing task, the types of information to be extracted from the target data processing task are also different.
In an optional implementation manner of this embodiment, extracting at least one of the source address and the destination address in the target data processing task according to the task type of the target data processing task may include:
if the target data processing task is determined to be the data handling task, extracting a source address and a target address in the target data processing task;
if the target data processing task is determined to be a data filling task, extracting a target address in the target data processing task;
and if the target data processing task is determined to be a multidimensional matrix processing task, extracting a source address and a target address in the target data processing task.
If it is determined that the target data processing task is a data handling task, the operation to be executed by the data handling task specifically uses the source address as a data acquisition starting point, acquires the source data of the set data amount, uses the target address as a data writing starting point, and writes the source data of the set data amount. The software configures the source address, the destination address, and the amount of data to be handled when configuring the data handling task.
If it is determined that the target data processing task is a data filling task, a specific operation to be performed by the data filling task is to fill a data value of a specified data amount with the target address as a data writing start point. The software configures the destination address, the fill data value, and the fill data amount when configuring the data handling task.
The multidimensional matrix processing task is a processing task for a multidimensional matrix, and can comprise: matrix filling tasks (padding), matrix slicing tasks (slice), matrix de-slicing tasks (de-slice), matrix reshaping tasks (reshape), and the like. The specific operation to be executed by each multi-dimensional matrix processing task is to use a source address as a data acquisition starting point, acquire a first multi-dimensional matrix of a first shape, process the first multi-dimensional matrix of the first shape to obtain a second multi-dimensional matrix of a second shape, and then write the data into the second multi-dimensional matrix of the second shape by taking a target address as a data writing starting point. The software configures the source address, the first shape of the first multi-dimensional matrix, the target address, and the second shape of the second multi-dimensional matrix when configuring the multi-dimensional matrix processing task.
In this embodiment, since the above-described target processing tasks of the task types each involve a read operation of data in a segment of the address space starting from the source address, or a data write operation is performed in a segment of the address space starting from the target address. Thus, the read or write operation described above may be divided into a plurality of sub-tasks, which in turn may be performed by different thread units using separate read and write data ports.
S440, determining the source data volume and/or the target data volume corresponding to the single target subtask according to the target data processing task and the splitting times.
As described above, the software writes the total amount of data matching the source address, the total amount of data matching the target address, or the description information associated with the total amount of data of the source address or the total amount of data of the target address in each target data processing task. For example, the amount of transport data in the data transport task, the amount of fill data in the data fill task, the first shape of the first multi-dimensional matrix in the multi-dimensional matrix processing task, the second shape of the second multi-dimensional matrix in the multi-dimensional matrix processing task, and so forth.
The total data amount corresponding to the first multidimensional matrix is: a1×a2×a3×a4×a5.
By acquiring the total amount of data matched with the source address or the total amount of data matched with the target address and combining the splitting times, the source data amount and/or the target data amount corresponding to the single target subtasks can be determined.
In a specific example, if the amount of transport data in the data transport task is M and the number of splits is N, the amount of source data of a single target subtask corresponding to the data transport task is M/N and the amount of target data is also M/N. If the filling data amount in the data filling task is X and the splitting times are N, the target data amount of a single target subtask corresponding to the data filling task is X/N.
Similarly, if the total amount of data of the first multidimensional matrix is determined to be P1 by the first shape of the first multidimensional matrix in the multidimensional matrix processing task, the total amount of data of the second multidimensional matrix is determined to be P2 by the second shape of the second multidimensional matrix in the multidimensional matrix processing task, and the splitting frequency is N, the source data amount of a single target subtask corresponding to the multidimensional matrix processing task is P1/N, and the target data amount of a single target subtask corresponding to the multidimensional matrix processing task is P2/N.
S450, determining the source address offset of each target subtask aiming at the source address according to the source data quantity, and/or determining the target address offset of each target subtask aiming at the target address according to the target data quantity.
In this embodiment, when dividing the target data processing task into multiple target subtasks, the source address or the target address of each target subtask needs to be specified, and then, according to the source address or the target address of the target data processing task, the source address offset of each target subtask for the source address or the target address offset of each target subtask for the target address may be determined in combination with the source data amount and the target data amount of each target subtask.
In an optional implementation manner of this embodiment, determining the source address offset of each target subtask for the source address according to the source data amount, and/or determining the target address offset of each target subtask for the target address according to the target data amount may include:
generating a task number of each target subtask according to the splitting times and a natural counting sequence;
determining the product of the task number of each target subtask and the source data volume as the source address offset of each target subtask for the source address; and/or
And determining the product of the task number of each target subtask and the target data volume as a target address offset of each target subtask for the target address.
In the previous example, assuming that the number of splitting times is 3 for one data transfer task, the target subtask 0, the target subtask 1, and the target subtask 2 may be generated in the natural count order. Further, the product of the task number of each target subtask and the source data amount M/N of the data movement task may be determined as a source address offset for the source address and a target address offset for the target address for each target subtask. That is, the source address offset of the target subtask 1 is 0, and the target address offset is 0; the source address offset of the target subtask 2 is M/N, and the target address offset is M/N; the source address offset of the target subtask 3 is 2*M/N and the target address offset is 2*M/N.
S460, splitting the target data processing task into a plurality of target subtasks according to the source address offset of each target subtask for the source address and/or the target address offset of each target subtask for the target address.
In this embodiment, splitting the target data processing task into a plurality of target subtasks according to the source address offset of each target subtask for the source address and/or the target address offset of each target subtask for the target address may include:
determining the source address of each target subtask according to the source address offset of each target subtask for the source address; and/or determining the target address of each target subtask according to the target address offset of each target subtask for the target address;
and splitting the target data processing task into a plurality of target subtasks according to the source address and/or the target address of each target subtask.
S470, sending the plurality of target subtasks to an arbitration module in the on-chip data processing hardware.
And S480, the data processing task is directly sent to an arbitration module in the on-chip data processing hardware.
According to the technical scheme, the target data processing task is split into a plurality of target subtasks according to at least one of a source address and a target address in the target data processing task and splitting times by receiving the splittable target data processing task issued by software; the technical means of sending the plurality of target subtasks to the arbitration module in the data processing hardware in the chip, and splitting the plurality of subtasks and distributing the subtasks on the plurality of thread units for each data processing task can fully improve the utilization rate of the thread units and the bandwidth in the scene of insufficient data processing tasks, and can realize the acceleration completion of the data processing tasks in a mode of jointly executing the plurality of thread units when the acceleration processing of a certain data processing task is required.
Example five
Fig. 5 is a flowchart of an implementation of an on-chip data processing method in a fifth embodiment of the present invention. Based on the above embodiments, the embodiments of the present invention further refine the operation after the virtual channel module performs the operation of sending the plurality of target subtasks to the arbitration module in the on-chip data processing hardware.
Accordingly, as shown in fig. 5, the method specifically may include:
s510, splitting the target data processing task into a plurality of target subtasks according to at least one of a source address and a target address in the target data processing task and splitting times when the splittable target data processing task issued by the software is received.
S520, the plurality of target subtasks are sent to an arbitration module in the on-chip data processing hardware.
S530, the task identification and the splitting times of the target data processing task are sent to a splitting task counting module in the data processing hardware in the chip.
In this embodiment, it is contemplated that after a data processing module performs a data processing task, the software needs to be notified that the data processing task is complete. After the virtual channel module completes the splitting of one target data processing task, the target data processing task is split into a plurality of target subtasks for execution. Therefore, a split task counting module is required to be set, and the completion condition of all split target subtasks is counted. After all the target subtasks are executed, the execution of the target data processing task is completed, and the software can be informed that the target data processing task is completed.
In this embodiment, each time the virtual channel module splits a target data processing task that is detachable into a plurality of target subtasks that match the splitting times, the task identifier and the splitting times of the target data processing task may be sent to the splitting task count module as splitting task count trigger information.
And the thread unit is also used for generating subtask completion information matched with the target subtask of which the execution is completed and sending the subtask completion information to the splitting task counting module.
The thread unit can generate a subtask completion message according to the task identifier of the target data processing task to which the target subtask belongs and send the subtask completion message to the splitting task counting module after completing the execution of the target subtask. And the subtask completion information aiming at the target data processing task A indicates that the execution of one target subtask split in the target data processing task A is completed.
Specifically, the splitting task counting module is used for updating the completed count value of the target data processing task according to the subtask completion information, and reporting the task completion information of the target data processing task to the software when the completed count value reaches the splitting times.
S540, when task completion information of the target data processing task sent by the splitting task counting module is received, the target data processing task stored locally is cleared.
Specifically, when all the target splitting tasks in a certain target data processing task are executed, the splitting task counting module can notify the software of task completion information of the target data processing task, and can further notify the virtual channel module for splitting the target data processing task of task completion information of the target data processing task. At this time, the virtual channel module may release all caches occupied by the target data processing task, and the released storage space may receive a new data processing task issued by the software.
According to the technical scheme, the new splitting task counting module is added in the data processing hardware in the chip, so that the effective monitoring of the complete execution condition of the target data processing task can be realized on the basis of splitting and executing the target data processing task, the function of the data processing hardware in the chip is further perfected, and the actual application requirement of the data processing hardware in the chip is met.
Example six
Fig. 6 is a schematic structural diagram of an AI chip in a sixth embodiment of the invention. As shown in fig. 6, the AI platform is provided with an AI chip, and the AI chip includes the on-chip data processing hardware 610 according to any of the embodiments of the present invention, and the on-chip data processing hardware 610 is executed in cooperation with the software 620 installed on the AI chip, so as to perform the on-chip data processing method according to any of the embodiments of the present invention.
Specifically, the on-chip data processing hardware 610 may include: a plurality of virtual channel modules, an arbitration module, and a plurality of thread units; the arbitration module is respectively connected with the virtual channel modules and the thread units;
the virtual channel module is used for splitting the target data processing task into a plurality of target subtasks and sending the target subtasks to the arbitration module according to at least one of a source address and a target address in the target data processing task and splitting times when receiving the splittable target data processing task issued by the software 620;
the arbitration module is used for determining a target thread unit according to the working state of each thread unit when receiving the target subtask and sending the target subtask to the target thread unit;
And the thread unit is used for executing the received target subtasks.
And an on-chip data processing method performed by a virtual channel module in on-chip data processing hardware, comprising:
splitting the target data processing task into a plurality of target subtasks according to at least one of a source address and a target address in the target data processing task and splitting times when receiving the splittable target data processing task issued by the software;
the plurality of target subtasks are sent to an arbitration module in the on-chip data processing hardware.
In the on-chip data processing hardware, when each virtual channel module receives a splittable target data processing task issued by software, splitting the target data processing task into a plurality of target subtasks according to at least one of a source address and a target address in the target data processing task and splitting times, and sending the target subtasks to an arbitration module, wherein the arbitration module can send each target subtask to one or a plurality of thread units for execution by taking the target subtask as a unit, and the utilization rate of the thread units and bandwidth can be fully improved in a scene of insufficient data processing tasks by splitting the plurality of subtasks and distributing the subtasks to the plurality of thread units.
Note that the above is only a preferred embodiment of the present invention and the technical principle applied. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, while the invention has been described in connection with the above embodiments, the invention is not limited to the embodiments, but may be embodied in many other equivalent forms without departing from the spirit or scope of the invention, which is set forth in the following claims.

Claims (12)

1. An on-chip data processing system, comprising: a plurality of virtual channel modules, an arbitration module, and a plurality of thread units; the arbitration module is respectively connected with the virtual channel modules and the thread units;
the virtual channel module is used for splitting the target data processing task into a plurality of target subtasks and sending the target subtasks to the arbitration module according to at least one of a source address and a target address in the target data processing task and splitting times when receiving the split target data processing task issued by the software;
Wherein each data processing task contains both an active address and a destination address or only a destination address; aiming at a data processing task which simultaneously comprises a source address and a target address, describing that after one or more items of source data are acquired from the source address, performing task processing according to a task type, and storing one or more items of data processing results obtained by processing to the target address; for a data processing task only comprising a target address, describing to acquire preset or fixed source data, performing task processing according to a task type, and storing one or more data processing results obtained by processing to the target address;
the arbitration module is used for determining a target thread unit according to the working state of each thread unit when receiving the target subtask and sending the target subtask to the target thread unit;
and the thread unit is used for executing the received target subtasks.
2. The on-chip data processing system of claim 1, further comprising a split task counting module;
the virtual channel module is also used for sending the task identification and the splitting times of the target data processing task to the splitting task counting module;
The thread unit is also used for generating subtask completion information matched with the target subtask of which execution is completed and sending the subtask completion information to the splitting task counting module;
and the splitting task counting module is used for updating the completed count value of the target data processing task according to the subtask completion information, and reporting the task completion information of the target data processing task to the software when the completed count value reaches the splitting times.
3. The on-chip data processing system of claim 2, wherein the split task counting module is further configured to:
and when the completed count value reaches the splitting times, sending task completion information of the target data processing task to a virtual channel module for splitting the target data processing task.
4. An on-chip data processing method performed by a virtual channel module in an on-chip data processing system according to any of claims 1-3, the method comprising:
splitting the target data processing task into a plurality of target subtasks according to at least one of a source address and a target address in the target data processing task and splitting times when receiving the splittable target data processing task issued by the software;
The plurality of target subtasks are sent to an arbitration module in the on-chip data processing hardware.
5. The method of claim 4, wherein upon receiving the software-issued detachable target data processing task, splitting the target data processing task into a plurality of target sub-tasks based on the number of splits and at least one of a source address and a target address in the target data processing task, comprising:
when a data processing task issued by software is received, detecting whether the data processing task contains a splitting identification or not;
if yes, determining the data processing task as a detachable target data processing task, and extracting the splitting times from the target data processing task;
extracting at least one of a source address and a target address in the target data processing task according to the task type of the target data processing task;
and splitting the target data processing task into a plurality of target subtasks according to at least one of a source address and a target address in the target data processing task and the splitting times.
6. The method of claim 5, wherein extracting at least one of a source address and a destination address in the target data processing task based on a task type of the target data processing task, comprises:
If the target data processing task is determined to be the data handling task, extracting a source address and a target address in the target data processing task;
if the target data processing task is determined to be a data filling task, extracting a target address in the target data processing task;
and if the target data processing task is determined to be a multidimensional matrix processing task, extracting a source address and a target address in the target data processing task.
7. The method of claim 5, wherein splitting the target data processing task into a plurality of target sub-tasks based on the number of splits and at least one of a source address and a target address in the target data processing task, comprising:
determining the source data volume and/or the target data volume corresponding to a single target subtask according to the target data processing task and the splitting times;
determining a source address offset of each target subtask for the source address according to the source data amount, and/or determining a target address offset of each target subtask for the target address according to the target data amount;
splitting the target data processing task into a plurality of target subtasks according to the source address offset of each target subtask for the source address and/or the target address offset of each target subtask for the target address.
8. The method of claim 7, wherein determining a source address offset for each target subtask for the source address based on a source data amount and/or determining a target address offset for each target subtask for the target address based on a target data amount comprises:
generating a task number of each target subtask according to the splitting times and a natural counting sequence;
determining the product of the task number of each target subtask and the source data volume as the source address offset of each target subtask for the source address; and/or
And determining the product of the task number of each target subtask and the target data volume as a target address offset of each target subtask for the target address.
9. The method of claim 7, wherein splitting the target data processing task into a plurality of target subtasks according to a source address offset for the source address for each target subtask and/or a target address offset for the target address for each target subtask, comprises:
determining the source address of each target subtask according to the source address offset of each target subtask for the source address; and/or determining the target address of each target subtask according to the target address offset of each target subtask for the target address;
And splitting the target data processing task into a plurality of target subtasks according to the source address and/or the target address of each target subtask.
10. The method of claim 4, further comprising, after sending the plurality of target subtasks to an arbitration module in the on-chip data processing hardware:
and sending the task identification and the splitting times of the target data processing task to a splitting task counting module in the data processing hardware in the chip.
11. The method of claim 10, further comprising, after sending the task identification and the split times of the target data processing task to a split task count module in the on-chip data processing hardware:
and when task completion information of the target data processing task sent by the splitting task counting module is received, the target data processing task stored locally is cleared.
12. An AI platform comprising an on-chip data processing system according to any of claims 1-3.
CN202310251374.XA 2023-03-16 2023-03-16 On-chip data processing hardware, on-chip data processing method and AI platform Active CN115964155B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310251374.XA CN115964155B (en) 2023-03-16 2023-03-16 On-chip data processing hardware, on-chip data processing method and AI platform

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310251374.XA CN115964155B (en) 2023-03-16 2023-03-16 On-chip data processing hardware, on-chip data processing method and AI platform

Publications (2)

Publication Number Publication Date
CN115964155A CN115964155A (en) 2023-04-14
CN115964155B true CN115964155B (en) 2023-05-30

Family

ID=85894779

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310251374.XA Active CN115964155B (en) 2023-03-16 2023-03-16 On-chip data processing hardware, on-chip data processing method and AI platform

Country Status (1)

Country Link
CN (1) CN115964155B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104967619A (en) * 2015-06-17 2015-10-07 深圳市腾讯计算机系统有限公司 File pushing method, device and system
CN114780215A (en) * 2022-04-13 2022-07-22 Oppo广东移动通信有限公司 Task scheduling method, device, equipment and storage medium
CN115454343A (en) * 2022-09-16 2022-12-09 山东云海国创云计算装备产业创新中心有限公司 Data processing method, device and medium based on RAID chip

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
FR3045894A1 (en) * 2015-12-18 2017-06-23 Yelloan SYSTEM FOR MANAGING THE EXECUTION OF A TASK BY AN OPERATOR
CN106325989A (en) * 2016-08-17 2017-01-11 东软集团股份有限公司 Task execution method and device
CN112925772A (en) * 2019-12-06 2021-06-08 北京沃东天骏信息技术有限公司 Data dynamic splitting method and device
CN111522796B (en) * 2020-04-24 2022-10-28 苏州达家迎信息技术有限公司 Data migration method and device between systems, computer equipment and medium
CN112269752B (en) * 2020-10-10 2023-07-14 山东云海国创云计算装备产业创新中心有限公司 Data processing method and related device of PCIe virtual channel
CN114580606A (en) * 2020-12-02 2022-06-03 中科寒武纪科技股份有限公司 Data processing method, data processing device, computer equipment and storage medium
CN112765059A (en) * 2021-01-20 2021-05-07 苏州浪潮智能科技有限公司 DMA (direct memory access) equipment based on FPGA (field programmable Gate array) and DMA data transfer method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104967619A (en) * 2015-06-17 2015-10-07 深圳市腾讯计算机系统有限公司 File pushing method, device and system
CN114780215A (en) * 2022-04-13 2022-07-22 Oppo广东移动通信有限公司 Task scheduling method, device, equipment and storage medium
CN115454343A (en) * 2022-09-16 2022-12-09 山东云海国创云计算装备产业创新中心有限公司 Data processing method, device and medium based on RAID chip

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
MPSoC片上互连网络缓冲管理与高速互连技术研究;尹亚明;《中国博士学位论文全文数据库信息科技辑》;I135-59 *
What Is Semantic Communication?A View on Conveying Meaning in the Eraof Machine Intelligence;Qiao Lan等;《Journal of Communications and Information Networks》;第6卷(第4期);第336-370页 *

Also Published As

Publication number Publication date
CN115964155A (en) 2023-04-14

Similar Documents

Publication Publication Date Title
US8250164B2 (en) Query performance data on parallel computer system having compute nodes
CN108564470B (en) Transaction distribution method for parallel building blocks in block chain
US8112559B2 (en) Increasing available FIFO space to prevent messaging queue deadlocks in a DMA environment
CN108647104B (en) Request processing method, server and computer readable storage medium
US8982884B2 (en) Serial replication of multicast packets
CN101923490B (en) Job scheduling apparatus and job scheduling method
US8463928B2 (en) Efficient multiple filter packet statistics generation
WO2012037760A1 (en) Method, server and system for improving alarm processing efficiency
US20110173287A1 (en) Preventing messaging queue deadlocks in a dma environment
CN103986585A (en) Message preprocessing method and device
CN115964155B (en) On-chip data processing hardware, on-chip data processing method and AI platform
CN110764705B (en) Data reading and writing method, device, equipment and storage medium
CN103299298A (en) Service processing method and system
CN110502337B (en) Optimization system for shuffling stage in Hadoop MapReduce
CN111831408A (en) Asynchronous task processing method and device, electronic equipment and medium
CN111225063A (en) Data exchange system and method for static distributed computing architecture
CN112035460A (en) Identification distribution method, device, equipment and storage medium
Birke et al. Meeting latency target in transient burst: A case on spark streaming
CN111538604B (en) Distributed task processing system
US20040064580A1 (en) Thread efficiency for a multi-threaded network processor
CN103593606A (en) Contextual information managing method and system
CN112417015A (en) Data distribution method and device, storage medium and electronic device
CN112541038A (en) Time series data management method, system, computing device and storage medium
CN114637594A (en) Multi-core processing device, task allocation method, device and storage medium
US8428075B1 (en) System and method for efficient shared buffer management

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant