CN117971496A - Operator task execution method, artificial intelligent chip and electronic equipment - Google Patents
Operator task execution method, artificial intelligent chip and electronic equipment Download PDFInfo
- Publication number
- CN117971496A CN117971496A CN202410295023.3A CN202410295023A CN117971496A CN 117971496 A CN117971496 A CN 117971496A CN 202410295023 A CN202410295023 A CN 202410295023A CN 117971496 A CN117971496 A CN 117971496A
- Authority
- CN
- China
- Prior art keywords
- data
- task
- memory
- operator
- computing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 41
- 238000003860 storage Methods 0.000 claims description 105
- 238000004364 calculation method Methods 0.000 claims description 77
- 238000013473 artificial intelligence Methods 0.000 claims description 52
- 238000012546 transfer Methods 0.000 claims description 39
- 238000012545 processing Methods 0.000 claims description 36
- 238000010586 diagram Methods 0.000 description 18
- 230000006870 function Effects 0.000 description 7
- 238000004590 computer program Methods 0.000 description 5
- 238000012986 modification Methods 0.000 description 4
- 230000004048 modification Effects 0.000 description 4
- 230000008569 process Effects 0.000 description 3
- 230000000694 effects Effects 0.000 description 2
- 238000012795 verification Methods 0.000 description 2
- 230000004075 alteration Effects 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000005055 memory storage Effects 0.000 description 1
- 238000000638 solvent extraction Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5027—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
- G06F9/505—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the load
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5027—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
- G06F9/5038—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the execution order of a plurality of tasks, e.g. taking priority or time dependency constraints into consideration
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T1/00—General purpose image data processing
- G06T1/20—Processor architectures; Processor configuration, e.g. pipelining
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Bus Control (AREA)
Abstract
The embodiment of the invention provides an operator task execution method, an artificial intelligent chip and electronic equipment, which are used for reducing the load of a computing core in the artificial intelligent chip and improving the operator execution efficiency. The method comprises the following steps: the computing core performs a first task in the operator, the first task comprising at least one of: a computing task and a first carrying data task; the DMA controller performs a second data handling task in the operator. The DMA controller in the artificial intelligent chip shares part or all of the data carrying tasks in the operator, so that the load of the computing core in the artificial intelligent chip can be reduced, and the execution efficiency of the operator can be improved.
Description
Technical Field
The embodiment of the invention relates to the technical field of artificial intelligence, in particular to an operator task execution method, an artificial intelligence chip and electronic equipment.
Background
The operations of the artificial intelligence model may be implemented by operators in a computational graph (computation graph). The computational graph is a multi-graph structure used for representing the computational task and the data flow process of the artificial intelligent model, and the operator refers to various operations on tensors of all layers in the artificial intelligent model, for example, convolution operations on input data of the artificial intelligent model by a convolution layer of the artificial intelligent model are convolution operators.
In practical application, an artificial intelligent chip can be adopted to realize the operation of an artificial intelligent model, all tasks of each operator are executed by a computing core in the artificial intelligent chip at present, the load of the computing core is higher, and the execution efficiency of the operators is low.
Disclosure of Invention
The embodiment of the invention provides an operator task execution method, an artificial intelligent chip and electronic equipment, which are used for reducing the load of a computing core in the artificial intelligent chip and improving the operator execution efficiency.
In a first aspect, the present application provides an operator task execution method applied to an artificial intelligence chip, the artificial intelligence chip including a computing core and a direct memory access DMA controller, the method comprising:
The computing core performs a first task in the operator, the first task comprising at least one of: a computing task and a first carrying data task;
the DMA controller performs a second data handling task in the operator.
In the above technical solution, the operator includes a first task and a second transport data task, the computing core in the artificial intelligent chip executes the first task in the operator, and the DMA controller in the artificial intelligent chip executes the second transport data task, that is, the DMA controller in the artificial intelligent chip shares part or all of the transport data tasks in the operator, thereby reducing the load of the computing core in the artificial intelligent chip and improving the execution efficiency of the operator.
In one possible implementation, the first task is a computing task, the input data of the operator includes first data, the first data is associated with the computing task, and the artificial intelligence chip further includes a memory;
a computing core performs a first task in an operator comprising:
The computing core reads first data from a first storage location of the memory;
calculating the first data by a calculation check to obtain a first calculation result;
the computing core writes the first computation result to a second storage location of the memory.
In the above technical solution, when the operator includes the calculation task and the second transfer data task, all the transfer data tasks are executed by the DMA controller, so that the load of the calculation core can be reduced to the greatest extent.
In one possible implementation, the first task is a computing task and a first handling data task, the input data of the operator includes first data and second data, the first data is associated with the computing task, the second data is associated with the first handling data task, and the artificial intelligence chip further includes a memory;
a computing core performs a first task in an operator comprising:
the computing core reads first data from a first storage position of the memory, calculates the first data to obtain a first calculation result, and writes the first calculation result into a second storage position of the memory;
the computing core reads the second data from the third storage location of the memory and writes the second data to the fourth storage location of the memory.
In the above technical solution, when the operator includes a calculation task, a first transport data task and a second transport data task, the calculation core executes all the calculation tasks and a part of the transport data tasks, and the other part of the transport data tasks are executed by the DMA controller, so that the load of the calculation core can be reduced, and the operator execution efficiency can be improved.
In one possible implementation manner, the data amount of the second data is determined according to the first data amount, the first duration occupied by the computing core to execute the computing task, the data carrying speed of the computing core, and the data carrying speed of the DMA controller, where the first data amount is a data amount corresponding to data except the first data in the input data.
In the above technical solution, when determining the data amount of the second data to be carried in the case that the operator includes the calculation task, the first carrying data task and the second carrying data task, factors such as the first data amount corresponding to all the data to be carried in the input data of the operator, the first duration occupied by the calculation core in executing the calculation task, the data carrying speed of the calculation core, and the data carrying speed of the DMA controller are considered, which are helpful for further realizing load balancing of the calculation core and the DMA controller in the artificial intelligent chip.
In one possible implementation, the first task is a first handling data task, the input data of the operator includes second data, the second data is associated with the first handling data task, and the artificial intelligence chip further includes a memory;
a computing core performs a first task in an operator comprising:
the computing core reads the second data from the third storage location of the memory and writes the second data to the fourth storage location of the memory.
According to the technical scheme, under the condition that the operator comprises the first carrying data task and the second carrying data task, the computing core executes one part of carrying data tasks, the DMA controller executes the other part of carrying data tasks, the load of the computing core can be reduced, and the operator execution efficiency is improved.
In one possible implementation, the data amount of the second data is determined according to the data amount of the input data of the operator, the data transfer speed of the computation core, and the data transfer speed of the DMA controller.
According to the technical scheme, under the condition that the operator comprises the first carrying data task and the second carrying data task, when the data volume of the second data to be carried is determined, factors such as the data volume of the input data of the operator, the data carrying speed of the computing core, the data carrying speed of the DMA controller and the like are considered, and the load balancing of the computing core and the DMA controller in the artificial intelligent chip is further facilitated.
In one possible implementation, the input data of the operator includes third data, the third data being associated with a second handling data task; the artificial intelligence chip also comprises a memory;
the DMA controller performs a second data handling task in the operator comprising:
The DMA controller reads the third data from the fifth memory location of the memory and writes the third data to the sixth memory location of the memory.
In one possible implementation, the difference between the second time period occupied by the computing core executing the first task and the third time period occupied by the DMA controller executing the second transfer data task is within a preset range.
In the above technical solution, the operator includes a first task and a second transport data task, and the difference between the second time length occupied by the computing core executing the first task and the third time length occupied by the DMA controller executing the second transport data task is within a preset range, so that load balancing between the computing core and the DMA controller in the artificial intelligent chip can be realized.
Further, the second time length occupied by the computing core executing the first task is equal to the third time length occupied by the DMA controller executing the second data carrying task, so that load balancing between the computing core and the DMA controller in the artificial intelligent chip can be realized to the greatest extent.
In one possible implementation, the artificial intelligence chip further includes a control unit that calculates a first task in the core execution operator and before the DMA controller executes a second data handling task in the operator, further includes:
The control unit divides the operator into a first task and a second data carrying task;
the control unit distributes the first task to the computing core processing and distributes the second handling data task to the DMA controller processing.
According to the technical scheme, the artificial intelligent chip can divide the task of the operator and distribute the task in the artificial intelligent chip, so that hardware resources for executing the task of the operator can be automatically distributed.
In one possible implementation, before the computing core performs the first task in the operator and the DMA controller performs the second data handling task in the operator, the method further includes:
The computing core receives a task allocation result sent by the central processing unit, wherein the task allocation result comprises a first task and a second carrying data task which are obtained after operators are divided, and processing resources allocated to the first task are the computing core and processing resources allocated to the second carrying data task are the DMA controller;
And the DMA controller receives the task allocation result sent by the central processing unit.
According to the technical scheme, the central processing unit divides the operator tasks and distributes hardware resources for each divided task, and the artificial intelligent chip executes each task of the operator according to the task distribution result of the central processing unit, so that the performance cost of the artificial intelligent chip can be reduced, and the performance of the operator is improved.
In a second aspect, the present application provides an artificial intelligence chip comprising a compute core and a DMA controller;
A computing core for: executing a first task in the operator, the first task comprising at least one of: a computing task and a first carrying data task;
A DMA controller for: and executing a second data carrying task in the operator.
In a third aspect, the present application provides an electronic device comprising a processor and a memory, the memory having program instructions stored thereon; the processor executing program instructions in the memory implements the method of the first aspect or any one of the possible implementations of the first aspect.
The technical effects achieved by any one of the second aspect to the third aspect may refer to the description of the advantages of the first aspect or any one of the possible implementation manners of the first aspect, and the description is not repeated here.
Drawings
FIG. 1 is a schematic diagram of a system architecture to which embodiments of the present application are applicable;
FIG. 2 is a schematic diagram of a computing core executing an image superposition operator in the related art;
FIG. 3 is a schematic diagram of an artificial intelligence chip according to an embodiment of the present application;
FIG. 4 is a schematic flow chart of an operator task execution method according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of an artificial intelligence chip executing an image superposition operator according to an embodiment of the present application;
FIG. 6 is a schematic diagram of an artificial intelligence chip executing an image overlay operator according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantageous effects of the present invention more apparent, the present invention will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
Fig. 1 is a system architecture diagram applicable to an embodiment of the present application, where the system architecture includes a central processing unit (Central Processing Unit, abbreviated as CPU) and artificial intelligent chips, and the number of the artificial intelligent chips may be one or more, and the number of the artificial intelligent chips is not specifically limited in the present application. It should be understood that the system architecture may also include other modules or components, and the application is not particularly limited in this regard.
The central processor is used to process non-computing tasks in artificial intelligence applications. The central processor may include multiple processing cores that may perform multiple tasks simultaneously, which may greatly improve the computing power and performance of the computing device.
Artificial intelligence chips, also known as AI accelerators or computing cards, are used to handle a large number of computing tasks in artificial intelligence applications. The artificial intelligence chip may be: graphics processor (graphics processing unit, GPU) or General-purpose graphics processor (General-purpose computing on graphics processing units, GPGPU).
The artificial intelligent chip comprises a computing core and a memory, the artificial intelligent chip is adopted to realize the operation of the artificial intelligent model in practical application, the operation of the artificial intelligent model can be realized by operators in a computing graph, each operator comprises various operator tasks, for example, some operators only comprise various computing tasks, some operators only comprise carrying data tasks, and some operators comprise both computing tasks and carrying data tasks.
In the related art, for each operator, all tasks of the operator, such as all calculation tasks in the operator and all handling data tasks, are performed by the calculation core.
Taking an operator as an image superposition operator as an example, fig. 2 shows a schematic diagram of a computing core executing the image superposition operator in the related art.
As shown in fig. 2, the image overlaying operator is used to overlay the image 1 with the image 2 to obtain the image 3. The input data of the image superposition operator comprises an image 1 and an image 2, the input data is stored in a storage position 1 in a memory in the artificial intelligent chip, and the execution result of the image superposition operator is stored in a storage position 2 in the memory in the artificial intelligent chip.
The superposition of image 1 and image 2 includes two types of operations, namely: performing superposition calculation on the region 1 of the image 1 and the image 3, and storing a superposition result in a storage position 2; and, carrying the region 2 of the image 1 from the storage location 1 to the storage location 2. Both types of operations are performed by the computing core in the related art. Thus, the load of the computing core is high, and the execution efficiency of the operator is low.
In view of this, the present application provides an operator task execution method, which is used to reduce the load of the computing core and improve the operator execution efficiency. The operator execution method provided by the application can be realized by an operator task execution device. The operator task execution device can be an artificial intelligent chip, and the following method embodiment is introduced by taking the artificial intelligent chip as an execution main body.
A schematic structural diagram of an artificial intelligence chip suitable for use in the embodiment of the present application is provided below with reference to fig. 3.
As shown in FIG. 3, the artificial intelligence chip includes a compute core, a direct memory access (Direct Memory Access, DMA) controller, and a memory.
The computation cores are used to perform computation tasks in the operator, and may be, for example, streaming processor clusters (Stream Processor Cluster, abbreviated SPC).
The DMA controller is configured to perform some or all of the data handling tasks in the operator.
The memory may be used to store input data for the operators, results of execution of the respective tasks of the operators.
It should be understood that the artificial intelligence chip may also include other modules or components, for example, the artificial intelligence chip may also include a control unit, as the application is not limited in this regard. The present application is also not limited to the number of compute cores and DMA controllers included in the artificial intelligence chip.
The application provides an operator task execution method with reference to the attached drawings.
Fig. 4 is a schematic flow chart of an operator task execution method according to the present application. The operator task execution method can be applied to an artificial intelligent chip, wherein the artificial intelligent chip comprises a computing core, a DMA controller and a memory, and as shown in FIG. 4, the operator task execution method comprises the following steps:
step 401, a computing core performs a first task in an operator, the first task including at least one of: a computing task and a first data carrying task.
In the embodiment of the present application, the operator includes a first task and a second transport data task, where the first task may have various implementation manners, for example, the first task is a calculation task and a first transport data task, and for example, the first task is a first transport data task.
In step 402, the dma controller performs a second data handling task in the operator.
By the method, the computing core in the artificial intelligent chip executes the first task in the operator, and the DMA controller in the artificial intelligent chip executes the second data carrying task, that is, the DMA controller in the artificial intelligent chip shares part or all of the data carrying task in the operator, so that the load of the computing core in the artificial intelligent chip can be reduced, and the operator execution efficiency can be improved.
Further, the steps 401 and 402 may be executed in parallel, so that the execution speed of the whole operator may be increased.
Based on the first, different tasks, there are many possible implementations of the artificial intelligence chip execution operator method.
In a first embodiment, the first task is a calculation task, the operator includes a calculation task and a second transport data task, the input data of the operator includes first data and third data, the first data is associated with the calculation task, and the third data is associated with the second transport data task. The above step 401 can be understood as: the computing core performs a computing task on the first data. The above step 402 can be understood as: the DMA controller performs a second transport data task on the third data.
In one possible manner, the above step 401 may include the following procedure:
s11, the computing core reads first data from a first storage position of the memory.
The first storage location is a set of storage addresses occupied by the first data in the memory, and the set of storage addresses occupied by the first data may be continuous storage addresses or discontinuous storage addresses, which is not limited in the present application. Taking the example that the first data occupies a plurality of consecutive memory addresses in the memory, the first memory location may be represented by the following example 1 or example 2:
example 1, a starting address and an address length occupied by the first data in the memory, wherein the address length is related to a data size of the first data.
Example 2, the first data occupies a start address and a stop address in memory.
S12, calculating the first data by the calculation check to obtain a first calculation result.
S13, the computing core writes the first computing result into a second storage position of the memory.
The second storage location is a set of storage addresses occupied by the first calculation result in the memory, and the set of storage addresses occupied by the first calculation result may be continuous storage addresses or discontinuous storage addresses. Taking the example that the first calculation result occupies a plurality of consecutive memory addresses in the memory, the second memory location may be represented by the following example 3 or example 4:
example 3, the first calculation result occupies a starting address and an address length in memory, wherein the address length is related to a data size of the first calculation result.
Example 4, the first calculation results are a start address and a stop address occupied in memory.
The step 402 may include the following:
s21, the DMA controller reads the third data from a fifth storage position of the memory;
The fifth storage location is a set of storage addresses occupied in the memory before the third data handling, and the set of storage addresses occupied in the memory before the third data handling may be continuous storage addresses or discontinuous storage addresses, which is not limited in this aspect of the present application. Taking the example of occupying a consecutive plurality of memory addresses in the memory before the third data handling, the fifth memory location may be represented by the following example 5 or example 6:
example 5, the starting address and address length occupied in the memory before the third data handling, wherein the address length is related to a data size of the third data.
Example 6, the start address and the end address occupied in memory prior to the third data handling.
S22, the DMA controller writes third data into a sixth storage position of the memory.
The sixth storage location is a set of storage addresses occupied in the memory after the third data transfer. The set of memory addresses occupied in the memory after the third data handling may be consecutive memory addresses or discontinuous memory addresses, which is not limited in this aspect of the application.
Taking the example that the third data is carried and then occupies a plurality of consecutive memory addresses in the memory, the sixth memory location may be represented by the following example 7 or example 8:
Example 7, the starting address and address length occupied in the memory after the third data handling, wherein the address length is related to a data size of the third data.
Example 8, the start address and the end address occupied in memory after the third data handling.
Taking an operator as an image superposition operator, the image superposition operator is used for superposing two images to obtain a superposed image, and fig. 5 shows a schematic diagram of the artificial intelligent chip executing the image superposition operator according to the embodiment of the application.
As shown in fig. 5, the input data of the image superimposition operator includes an image 1 and an image 2, the image 1 being stored in a storage location 1 in the HBM within the artificial intelligence chip, the image 2 being stored in a storage location 2 in the HBM, the image superimposition operator including a calculation task and a transfer data task, the calculation task being performed by the calculation in the artificial intelligence chip to check the area 1 of the image 1 and the image 2, respectively, and the transfer data task being performed by the DMA controller to the area 2 of the image 1.
Computing region 1 of verification image 1 performs a computing task with image 2, which may include: the calculation core reads the area 1 of the image 1 from the storage position 1 in the HBM, reads the image 2 from the storage position 2, and then performs superposition calculation on the read area 1 of the image 1 and the image 2 by the calculation core to obtain a superposition result, and the calculation core writes the superposition result into the storage position 3 in the HBM.
The DMA controller performs a data transfer task on region 2 of image 1, which may include: the DMA controller reads region 2 of image 1 from memory location 1 in the HBM and then writes region 2 of image 1 to memory location 3 in the HBM.
The result of the artificial intelligence chip performing the image overlay operator is image 3, stored in storage location 3 in the HBM.
In the first embodiment, in the case that the operator includes a calculation task and a transfer data task, the calculation check in the artificial intelligence chip performs the calculation task on all the data to be calculated, and the DMA controller performs the transfer data task on all the data to be transferred, so that the load of the calculation core can be reduced to the maximum extent.
In a second embodiment, the first task is a calculation task and a first transport data task, the operator includes a calculation task, a first transport data task and a second transport data task, the input data of the operator includes first data, second data and third data, the first data is associated with the calculation task, the second data is associated with the first transport data task, and the third data is associated with the second transport data task. The above step 401 can be understood as: the computing core performs a computing task on the first data and performs a first handling data task on the second data. The above step 402 can be understood as: the DMA controller performs a second transport data task on the third data.
In one possible manner, the above step 401 may include the following procedure:
s31, the computing core reads first data from a first storage position of the memory;
here, the specific implementation of the first storage location may be referred to the description related to S11, which is not repeated herein.
S32, calculating the first data by the calculation check to obtain a first calculation result,
S33, the computing core writes a first computing result into a second storage position of the memory;
Here, the specific implementation of the second storage location may be referred to the description related to S13, which is not repeated herein.
S34, the computing core reads the second data from the third storage position of the memory and writes the second data to the fourth storage position of the memory.
The step 402 may include the following:
s21, the DMA controller reads the third data from a fifth storage position of the memory;
Here, the specific implementation of the fifth storage location may be referred to the related description in S21, which is not repeated herein.
S22, the DMA controller writes third data into a sixth storage position of the memory.
Here, the specific implementation of the sixth storage location may be referred to the description related to S22, which is not repeated here.
In one possible implementation manner, the data amount of the second data is determined according to the first data amount, the first duration occupied by the computing core in executing the computing task, the data carrying speed of the computing core, and the data carrying speed of the DMA controller, where the first data amount is a data amount corresponding to data except the first data in the input data, that is, the first data amount is a data amount corresponding to all data to be carried in the first data.
Taking the total data quantity of the input data of the operator as S0+S, wherein the sum of the data quantity of the second data and the third data is S, namely the first data quantity is S, the first time length occupied by the computing core to execute the computing task is denoted as t0, the data carrying speed of the computing core is denoted as v1, the data carrying speed of the DMA controller is denoted as v2, the second data is denoted as S1, and the data quantity of the third data is denoted as S2. The second time period occupied by the computing core executing the first task includes a sum of a first time period t0 occupied by the computing core executing the computing task and a time period t1 occupied by the computing core executing the first transfer data task, and a third time period t2 occupied by the DMA controller executing the second transfer data task, wherein t1=s1/v 1, t2=s2/v 2.
In one possible implementation, the difference between the second time period occupied by the computing core executing the first task and the third time period occupied by the DMA controller executing the second transfer data task is within a preset range. Thus, S1 and S2 conform to the following formula (1):
s2/v 2-S1/v 1-t0 +.ltoreq.x formula (1)
In the formula (1), x is a constant, and can be set as a default value, and the specific value of x can be set according to actual needs.
In one possible implementation, the second time period occupied by the computing core executing the first task is greater than or equal to the third time period occupied by the DMA controller executing the second data handling task. Wherein the second time period is the sum of t0 and t1, and the third time period is t2, wherein t1=s1/v 1, t2=s2/v 2.
Therefore, S1 and S2 conform to the following formula (2)
T0+S1/v1.gtoreq.S2/v2 formula (2)
T0, v1 and v2 in the above formula (1) and formula (2) can be calculated.
For example, the first time period t0 occupied by the computing core to execute the computing task includes the sum of the following three time periods: <1> the time taken to read the first data from the memory, e.g., by dividing the amount of data of the first data read by the bandwidth of the data read from the memory by the compute core; <2> time consuming performing superimposition calculation on the first data; <3> the time taken to write the first calculation result obtained by the superposition calculation to the memory is obtained, for example, by dividing the data amount of the first calculation result to be stored by the bandwidth of the memory storage data by the calculation core.
For example, the data transfer speed v1 of the computing core may be obtained by dividing the data amount of the second data read from the memory by the computing core by the bandwidth of the data read from the memory by the computing core, and dividing the data amount of the second data by the bandwidth of the data stored by the computing core to the memory.
The data transfer speed v2 of a DMA controller, for example, may be obtained by dividing the amount of data of the second data read from the memory by the computing core, by the bandwidth of the data read from the memory by the computing core, and dividing the amount of data of the second data by the bandwidth of the data stored by the computing core to the memory.
After the values of t0, v1 and v2 are calculated, the relation between S1 and S2 can be obtained by the formula (1) or the formula (2), so that the minimum value of S1 and the maximum value of S2 can be calculated, the data amount of the second data is the minimum value of S1, and the data amount of the third data is the maximum value of S2. In this way, the data with the minimum value of the data amount S1 among all the data to be transferred can be transferred to the computing core, and the data with the maximum value of the data amount S2 among all the data to be transferred can be transferred to the DMA controller.
Taking an operator as an image superposition operator, the image superposition operator is used for superposing two images to obtain a superposed image, and fig. 6 shows a schematic diagram of an artificial intelligent chip executing the image superposition operator according to the embodiment of the application.
As shown in fig. 6, the input data of the image superimposition operator includes an image 1 and an image 2, the image 1 being stored in a storage location 1 in the HBM within the artificial intelligence chip, the image 2 being stored in a storage location 2 in the HBM, the image superimposition operator including a calculation task, a transfer data task 1, and a transfer data task 2, the calculation task being performed by the area 1 and the image 2 of the calculation collation image 1 in the artificial intelligence chip, the transfer data task 1 being performed by the area 21 of the calculation collation image 1, and the transfer data task 2 being performed by the DMA controller on the area 22 of the image 1, respectively.
Computing region 1 of verification image 1 performs a computing task with image 2, which may include: the calculation core reads the area 1 of the image 1 from the storage position 1 in the HBM, reads the image 2 from the storage position 2 in the HBM, and then performs superposition calculation on the read area 1 of the image 1 and the image 2 to obtain a superposition result, and the calculation core writes the superposition result into the storage position 3 in the HBM.
Calculating the area 21 of the collation image 1 to perform the conveyance data task 1 may include: the computing core reads region 21 of image 1 from storage location 1 in the HBM and then writes region 21 of image 1 to storage location 3 in the HBM.
The DMA controller performs a data handling task on the region 22 of the image 1, which may include: the DMA controller reads region 22 of image 1 from storage location 1 in the HBM and then writes region 22 of image 1 to storage location 3 in the HBM.
The result of the artificial intelligence chip performing the image overlay operator is image 3, stored in storage location 3 in the HBM.
With the second embodiment, in the case where the operator includes a calculation task and a transfer data task, the calculation check in the artificial intelligence chip performs the calculation task on all the data to be calculated, and performs the transfer task on one portion of the data to be transferred, and the DMA controller performs the transfer data task on the other portion of the data to be transferred, so that the load of the calculation core can be reduced, and the operator execution efficiency can be improved.
In a third embodiment, the first task is a first transport data task, the input data of the operator includes second data and third data, the second data is associated with the first transport data task, and the third data is associated with the second transport data task. The above step 401 can be understood as: the computing core performs the first handling data task on the second data. The above step 402 can be understood as: the DMA controller performs a second transport data task on the third data.
In one possible manner, the above step 401 may include the following procedure:
S41, the computing core reads second data from a third storage position of the memory;
The third storage location is a set of storage addresses occupied in the memory before the second data is carried, and the set of storage addresses occupied in the memory before the second data is carried may be continuous storage addresses or discontinuous storage addresses. Taking the example of occupying a consecutive plurality of memory addresses in the memory prior to the second data handling, the third memory location may be represented by either example 9 or example 10 as follows:
example 9, the starting address and address length occupied in the memory before the second data is handled, wherein the address length is related to a data size of the second data.
Example 10, the second data is carried with a start address and a stop address occupied in memory.
S42, the computing core writes second data into a fourth storage position of the memory.
The fourth storage location is a set of storage addresses occupied in the memory after the second data is carried, and the set of storage addresses occupied in the memory after the second data is carried may be continuous storage addresses or discontinuous storage addresses. Taking the example of occupying a consecutive plurality of memory addresses in the memory after the second data is handled, the fourth memory location may be represented by either example 11 or example 12 as follows:
Example 11, the starting address and address length occupied in the memory after the second data is handled, wherein the address length is related to a data size of the second data.
Example 12, the start address and the end address occupied in memory after the second data is carried.
The step 402 may include the following:
s21, the DMA controller reads the third data from a fifth storage position of the memory;
Here, the specific implementation of the fifth storage location may be referred to the related description in S21, which is not repeated herein.
S22, the DMA controller writes third data into a sixth storage position of the memory.
Here, the specific implementation of the sixth storage location may be referred to the description related to S22, which is not repeated here.
In one possible implementation, the data amount of the second data is determined according to the data amount of the input data of the operator, the data transfer speed of the computation core, and the data transfer speed of the DMA controller. In one manner, a difference between a second duration taken by the computing core to perform the first transfer data task and a third duration taken by the DMA controller to perform the second transfer data task is less than a certain threshold. For example, when the second time period occupied by the computing core executing the first transfer data task is equal to the third time period occupied by the DMA controller executing the second transfer data task, the performance of the artificial intelligent chip is optimal.
With the third embodiment, in the case that the operator includes only the transfer data task, the calculation check in the artificial intelligence chip performs the transfer task on a part of the data to be transferred, and the DMA controller performs the transfer data task on another part of the data to be transferred, so that the load of the calculation core can be reduced, and the operator execution efficiency can be improved.
In the embodiment of the application, the task division of the operator and the allocation of the hardware resources for the task of the operator can be realized in various ways, and in one possible implementation way, the task division of the operator and the allocation of the hardware resources for the task of the operator are realized by an artificial intelligent chip. For example, the artificial intelligence chip further comprises a control unit.
Before the computing core executes a first task in the operator and the DMA controller executes a second transport data task in the operator, the control unit divides the operator into the first task and the second transport data task, distributes the first task to the computing core for processing, and distributes the second transport data task to the DMA controller for processing. Thus, the self-allocation of hardware resources for executing operator tasks is realized.
In one possible implementation, the task partitioning of the operators is implemented by a central processor and hardware resources are allocated for the tasks of the operators.
Before a computing core executes a first task in an operator, the computing core receives a task allocation result sent by a central processing unit, wherein the task allocation result comprises a first task and a second carrying data task which are obtained after the operator is divided, processing resources allocated to the first task are the computing core, and processing resources allocated to the second carrying data task are a DMA controller;
Before the DMA controller executes the second data carrying task in the operator, the DMA controller receives a task allocation result sent by the central processing unit. Thus, the artificial intelligent chip executes each task of the operator according to the task allocation result of the central processing unit, so that the performance cost of the artificial intelligent chip can be reduced, and the performance of the operator can be improved.
Based on the same technical conception, the embodiment of the application provides an artificial intelligent chip which comprises a computing core and a DMA controller; wherein:
A computing core for: executing a first task in the operator, the first task comprising at least one of: a computing task and a first data carrying task.
A DMA controller for: and executing a second data carrying task in the operator.
In one possible implementation, the first task is a computing task, the input data of the operator includes first data, the first data is associated with the computing task, and the artificial intelligence chip further includes a memory; a computing core, in particular for: reading first data from a first storage location of a memory; calculating the first data to obtain a first calculation result; the first calculation result is written to a second storage location of the memory.
In one possible implementation, the first task is a computing task and a first handling data task, the input data of the operator includes first data and second data, the first data is associated with the computing task, the second data is associated with the first handling data task, and the artificial intelligence chip further includes a memory; a computing core, in particular for: reading first data from a first storage position of a memory, calculating the first data to obtain a first calculation result, and writing the first calculation result into a second storage position of the memory; the second data is read from the third storage location of the memory and written to the fourth storage location of the memory.
In one possible implementation manner, the data amount of the second data is determined according to the first data amount, the first duration occupied by the computing core to execute the computing task, the data carrying speed of the computing core, and the data carrying speed of the DMA controller, where the first data amount is a data amount corresponding to data except the first data in the input data.
In one possible implementation, the first task is a first handling data task, the input data of the operator includes second data, the second data is associated with the first handling data task, and the artificial intelligence chip further includes a memory; a computing core, in particular for: the second data is read from a third storage location of the memory and written to a fourth storage location of the memory.
In one possible implementation, the data amount of the second data is determined according to the data amount of the input data of the operator, the data transfer speed of the computation core, and the data transfer speed of the DMA controller.
In one possible implementation, the input data of the operator includes third data, the third data being associated with a second handling data task; the DMA controller is specifically configured to: the DMA controller reads the third data from the fifth memory location of the memory and writes the third data to the sixth memory location of the memory.
In one possible implementation, the difference between the second time period occupied by the computing core executing the first task and the third time period occupied by the DMA controller executing the second transfer data task is within a preset range.
In one possible implementation, the artificial intelligence chip further includes a control unit for: dividing an operator into a first task and a second data carrying task; the first task is allocated to the computing core processing, and the second transport data task is allocated to the DMA controller processing.
In one possible implementation, the computing core is further configured to: receiving a task allocation result sent by a central processing unit, wherein the task allocation result comprises a first task and a second carrying data task which are obtained after operators are divided, processing resources allocated to the first task are calculation cores, and processing resources allocated to the second carrying data task are DMA controllers; DMA controller, further for: and receiving a task allocation result sent by the central processing unit.
Based on the same inventive concept, the application provides an electronic device comprising a processor and a memory, wherein the memory stores program instructions; the processor executes the program instructions in the memory to implement the steps of the operator execution method described above.
It will be apparent to those skilled in the art that embodiments of the present application may be provided as a method, apparatus (device), system, chip, computer readable storage medium, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects all generally referred to herein as a "module" or "system.
The present application is described with reference to at least one of the following figures of the method, apparatus (device) or system of the present application: flow chart, block diagram. It should be understood that at least one of the following may be implemented by computer program instructions: each flow in the flow diagrams, each block in the block diagrams, a combination of blocks in the flow diagrams and blocks in the block diagrams.
These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, produce a signal for implementing at least one of the following: means for performing the function specified in the flowchart flow or flows, means for performing the function specified in the block diagram block or blocks. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement at least one of the following: functions specified in the flowchart flow or flows, and functions specified in the block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing at least one of the following: a step of a function specified in one or more of the flowcharts, a step of a function specified in one or more of the blocks in the block diagrams.
Although the invention has been described in connection with specific features and embodiments thereof, it will be apparent that various modifications and combinations can be made without departing from the spirit and scope of the invention. Accordingly, the specification and drawings are merely exemplary illustrations of the present invention as defined in the appended claims and are considered to cover any and all modifications, variations, combinations, or equivalents that fall within the scope of the invention. It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.
Claims (16)
1. An operator task execution method, applied to an artificial intelligence chip, the artificial intelligence chip comprising a computing core and a direct memory access DMA controller, the method comprising:
The computing core performs a first task in an operator, the first task comprising at least one of: a computing task and a first carrying data task;
The DMA controller performs a second data handling task in the operator.
2. The method of claim 1, wherein the first task is a computing task, the input data of the operator comprises first data, the first data is associated with the computing task, and the artificial intelligence chip further comprises a memory;
The computing core performs a first task in an operator, comprising:
the computing core reading the first data from a first storage location of the memory;
The calculation check is used for calculating the first data to obtain a first calculation result;
the computing core writes the first computation result to a second storage location of the memory.
3. The method of claim 1, wherein the first task is a computing task and a first handling data task, the operator input data comprises first data and second data, the first data is associated with the computing task, the second data is associated with the first handling data task, the artificial intelligence chip further comprises a memory;
The computing core performs a first task in an operator, comprising:
The computing core reads the first data from a first storage position of the memory, calculates the first data to obtain a first calculation result, and writes the first calculation result into a second storage position of the memory;
The computing core reads the second data from a third storage location of the memory and writes the second data to a fourth storage location of the memory.
4. The method of claim 3, wherein the second data amount is determined according to a first data amount, which is a data amount corresponding to data other than the first data in the input data, a first time period occupied by the computing core to perform the computing task, a data transfer speed of the computing core, and a data transfer speed of the DMA controller.
5. The method of claim 1, wherein the first task is a first transport data task, the operator input data comprises second data, the second data is associated with the first transport data task, and the artificial intelligence chip further comprises a memory;
The computing core performs a first task in an operator, comprising:
The computing core reads the second data from a third storage location of the memory and writes the second data to a fourth storage location of the memory.
6. The method of claim 5, wherein the amount of data of the second data is determined based on the amount of data of the operator's input data, the data transfer speed of the computing core, and the data transfer speed of the DMA controller.
7. The method of claim 1, wherein the operator input data includes third data, the third data being associated with the second handling data task; the artificial intelligence chip further comprises a memory;
The DMA controller performs a second data handling task in the operator comprising:
the DMA controller reads the third data from a fifth storage location of the memory and writes the third data to a sixth storage location of the memory.
8. The method of any of claims 1-7, wherein a difference between a second time period taken by the computing core to perform the first task and a third time period taken by the DMA controller to perform the second data-handling task is within a preset range.
9. The method of any of claims 1-7, wherein the artificial intelligence chip further comprises a control unit, the computing core performing a first task in an operator and the DMA controller performing a second data handling task in the operator, further comprising:
the control unit divides the operator into the first task and the second data carrying task;
The control unit distributes the first task to the computing core for processing, and distributes the second carrying data task to the DMA controller for processing.
10. The method of any of claims 1-7, wherein the computing core performs a first task in an operator and the DMA controller performs a second data handling task in the operator prior to further comprising:
The computing core receives a task allocation result sent by a central processing unit, wherein the task allocation result comprises a first task and a second carrying data task which are obtained after the operator is divided, the processing resources allocated to the first task are the computing core, and the processing resources allocated to the second carrying data task are the DMA controller;
And the DMA controller receives the task allocation result sent by the central processing unit.
11. An artificial intelligent chip is characterized by comprising a computing core and a direct memory access DMA controller;
the computing core is configured to perform a first task in an operator, where the first task includes at least one of: a computing task and a first carrying data task;
the DMA controller is used for executing a second data carrying task in the operator.
12. The artificial intelligence chip of claim 11, wherein the first task is a computing task, the input data for the operator comprises first data, the first data associated with the computing task, the artificial intelligence chip further comprising a memory;
The computing core is specifically configured to:
Reading the first data from a first storage location of the memory;
Calculating the first data to obtain a first calculation result;
writing the first calculation result to a second storage location of the memory.
13. The artificial intelligence chip of claim 11, wherein the first task is a computing task and a first handling data task, the operator input data includes first data and second data, the first data is associated with the computing task, the second data is associated with the first handling data task, the artificial intelligence chip further comprising a memory;
The computing core is specifically configured to:
Reading the first data from a first storage position of the memory, calculating the first data to obtain a first calculation result, and writing the first calculation result into a second storage position of the memory;
The second data is read from a third storage location of the memory and written to a fourth storage location of the memory.
14. The artificial intelligence chip of claim 11, wherein the first task is a first transport data task, the operator input data comprises second data, the second data associated with the first transport data task, the artificial intelligence chip further comprising a memory;
The computing core is specifically configured to:
The second data is read from a third storage location of the memory and written to a fourth storage location of the memory.
15. The artificial intelligence chip of any one of claims 12-14, wherein the operator input data includes third data, the third data being associated with the second handling data task;
The DMA controller is specifically configured to: the third data is read from a fifth storage location of the memory and written to a sixth storage location of the memory.
16. An electronic device comprising a processor and a memory, the memory having program instructions stored thereon; execution of program instructions in the memory by the processor implements the steps of the method according to any one of claims 1 to 10.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410295023.3A CN117971496A (en) | 2024-03-14 | 2024-03-14 | Operator task execution method, artificial intelligent chip and electronic equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410295023.3A CN117971496A (en) | 2024-03-14 | 2024-03-14 | Operator task execution method, artificial intelligent chip and electronic equipment |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117971496A true CN117971496A (en) | 2024-05-03 |
Family
ID=90849816
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410295023.3A Pending CN117971496A (en) | 2024-03-14 | 2024-03-14 | Operator task execution method, artificial intelligent chip and electronic equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117971496A (en) |
-
2024
- 2024-03-14 CN CN202410295023.3A patent/CN117971496A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
TWI748151B (en) | Accelerator for neural network computing and execution method thereof | |
TWI811291B (en) | Deep learning accelerator and method for accelerating deep learning operations | |
CN111831254B (en) | Image processing acceleration method, image processing model storage method and corresponding devices | |
CN111242277B (en) | Convolutional neural network accelerator supporting sparse pruning based on FPGA design | |
CN112840356A (en) | Operation accelerator, processing method and related equipment | |
CN110738308A (en) | neural network accelerators | |
CN110333827B (en) | Data loading device and data loading method | |
KR20140122835A (en) | Apparatus and method for process parallel execution | |
CN111931918A (en) | Neural network accelerator | |
US20230169318A1 (en) | Method and apparatus to efficiently process and execute artificial intelligence operations | |
CN117785480A (en) | Processor, reduction calculation method and electronic equipment | |
CN115860080A (en) | Computing core, accelerator, computing method, device, equipment, medium and system | |
CN113485750B (en) | Data processing method and data processing device | |
CN117971496A (en) | Operator task execution method, artificial intelligent chip and electronic equipment | |
CN118193410A (en) | Execution method, equipment and storage medium of memory handling operator | |
CN116069480B (en) | Processor and computing device | |
CN116483550A (en) | Computing resource allocation method and device for tensor computing graph and readable storage medium | |
CN112418417B (en) | Convolutional neural network acceleration device and method based on SIMD technology | |
TWI798591B (en) | Convolutional neural network operation method and device | |
CN112433847B (en) | OpenCL kernel submitting method and device | |
CN115328440A (en) | General sparse matrix multiplication implementation method and device based on 2D systolic array | |
CN116648694A (en) | Method for processing data in chip and chip | |
CN117291240B (en) | Convolutional neural network accelerator and electronic device | |
CN118093452B (en) | Memory architecture mapping method, device, storage medium and program product | |
CN112948758B (en) | Data processing method, device and chip |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |