CN110109861A

CN110109861A - A kind of task executing method and device

Info

Publication number: CN110109861A
Application number: CN201910327737.7A
Authority: CN
Inventors: 不公告发明人
Original assignee: Beijing Zhongke Cambrian Technology Co Ltd
Current assignee: Cambricon Technologies Corp Ltd; Beijing Zhongke Cambrian Technology Co Ltd
Priority date: 2019-04-22
Filing date: 2019-04-22
Publication date: 2019-08-09

Abstract

The invention discloses a kind of task executing method and relevant apparatus.Parallel argument table is obtained by computer equipment, realizes the function of software scheduler, so that the software program run on the Intelligent hardware with different hardware structure can be compatible with.

Description

A kind of task executing method and device

Technical field

The present invention relates to computer field more particularly to a kind of task executing methods and device.

Background technique

With the development of technology, Intelligent hardware brings great convenience to people's production and life.But with skill The continuous renewal evolution of art, the hardware configuration of Intelligent hardware are also constantly changing, and the Intelligent hardware of such as a new generation may be At least one functional module is increased on the basis of the Intelligent hardware of an old generation newly.Due to the hardware configuration of the Intelligent hardware of a new generation It is different from the hardware configuration of the Intelligent hardware of an old generation, therefore will lead to the software program run on new and old two generations Intelligent hardware not It can be compatible.Therefore, how to realize that new and old two generations Intelligent hardware can run identical software program as research hotspot.

Summary of the invention

The embodiment of the present invention provides a kind of task executing method and device, so that the Intelligent hardware with different hardware structure The software program of upper operation can be compatible with.

The function that hardware scheduler is realized by software mode, by dividing for each kernel in artificial intelligence process equipment With one piece of privately owned memory space, for storing the parallel argument table of each kernel, to realize that multiple kernels can be performed simultaneously together One instruction realizes the task processing of SIMT, the i.e. programming of SIMT and compiling.

In a first aspect, the embodiment of the present invention provides a kind of task executing method, this method is applied to task execution system, should Task execution system includes general purpose computing device and artificial intelligence process equipment, this method comprises:

Task type, task scale and the process instruction of determining goal task are requested according to the processing of user；Processing request Including pending data；

The parallel variable information of goal task, the parallel variable letter of goal task are determined according to task scale and task type It ceases the kernel being used to indicate in M arithmetic core cluster and executes process instruction；

It, will be parallel if there are the kernels in M arithmetic core cluster to be in idle state in artificial intelligence process equipment Variable information, process instruction and pending data are transmitted in the memory of artificial intelligence process equipment.

In a feasible embodiment, if there are the kernel in M arithmetic core cluster is equal in artificial intelligence process equipment It is in idle condition, then parallel variable information is transmitted to the private of each kernel in M arithmetic core cluster being in idle condition Before having in memory space, this method further include:

Idle state is in the presence or absence of the kernel in M arithmetic core cluster in inquiry artificial intelligence process equipment.

In a feasible embodiment, task scale described in task scale includes the rule of the subtask of at least one dimension Mould, the parallel variable information include k parallel argument tables, and the k is the rule according to the subtask of at least one shown dimension What mould determined.

In a feasible embodiment, parallel variable information is transmitted to M arithmetic core cluster being in idle condition In each kernel privately owned memory space in, comprising:

Parallel variable information is transmitted to each kernel in M arithmetic core cluster being in idle condition by PCIE bus Privately owned memory space in；

Process instruction and pending data are transmitted in the memory of artificial intelligence process equipment, comprising:

Process instruction and pending data are transmitted in the memory of artificial intelligence process equipment by PCIE bus.

In a feasible embodiment, the operation domain of process instruction includes data writing address, this method further include:

The processing result of artificial intelligence process equipment is received, and the corresponding storage of data writing address is written into processing result In space.

In a feasible embodiment, parallel variable information includes parallel argument table, which includes task The quantity and fortune of kernel in mark, task scale, the mark of arithmetic core cluster, the number of arithmetic core cluster, arithmetic core cluster Calculate at least one of the mark of kernel in core cluster.

Second aspect, the embodiment of the present invention provide a kind of task executing method, and this method is applied to task execution system, should Task execution system includes general purpose computing device and artificial intelligence process equipment, and artificial intelligence process equipment includes at least one Arithmetic core cluster, each arithmetic core cluster include at least one kernel, this method comprises:

The mission bit stream that general purpose computing device is sent is received, be engaged in information includes parallel variable information, pending data And process instruction；Parallel variable information is used to indicate the kernel in M arithmetic core cluster and executes process instruction；M arithmetic core Cluster is the arithmetic core cluster for needing to occupy when executing process instruction to pending data；

By the parallel argument table of the pending data and preservation kernel into M arithmetic core cluster is corresponding privately owned deposits It stores up in space；

Parallel variable information in the corresponding privately owned memory space of each kernel in M arithmetic core cluster is stored to kernel In corresponding on piece storage, and the data determined based on parallel variable information and data writing address are carried out according to process instruction Processing, to obtain processing result.

In a feasible embodiment, mission bit stream further includes data length, is become according to process instruction to based on parallel The data that amount information and data writing address determine are handled, to obtain processing result, comprising:

According to process instruction to based on parallel variable information, data writing address and data length determine data at Reason, to obtain processing result.

In a feasible embodiment, the parallel variable information in the corresponding privately owned memory space of each kernel is read Into the on piece storage in artificial intelligence process equipment, comprising:

The parallel argument table in the corresponding privately owned memory space of each kernel is read into people by executing preload instruction In on piece storage in work Intelligent treatment equipment.

In a feasible embodiment, this method further include:

In artificial intelligence process device power or reseting procedure, from the retaining space of the memory of artificial intelligence process equipment In, it is that the kernel in M arithmetic core cluster distributes privately owned memory space, wherein the kernel and the privately owned memory space pair It should be arranged.

In a feasible embodiment, mission bit stream further includes address data output, this method further include:

After having executed process instruction to each kernel in M arithmetic core cluster, obtained processing result is transmitted to number According in the corresponding memory space of output address.

The third aspect, the embodiment of the present invention provide a kind of general purpose computing device, and general purpose computing device is applied to task Execution system, the task execution system further include artificial intelligence process equipment, which includes:

Determination unit requests the task type, task scale and the processing that determine goal task for the processing according to user Instruction；Processing request includes pending data；And determine that the parallel variable of goal task is believed according to task scale and task type Breath, the parallel variable information of goal task are used to indicate the kernel in M arithmetic core cluster and execute process instruction；

Transmission unit, if for there are the kernels in M arithmetic core cluster to be in the free time in artificial intelligence process equipment Parallel variable information, process instruction and pending data are then transmitted in the memory of artificial intelligence process equipment by state.

In a feasible embodiment, general purpose computing device further include:

Query unit, if for there are the kernels in M arithmetic core cluster to be in sky in artificial intelligence process equipment The parallel variable information, the pending data and the process instruction are then transmitted at the artificial intelligence by not busy state Before managing in the memory of equipment, inquires in artificial intelligence process equipment and be in the presence or absence of the kernel in M arithmetic core cluster Idle state.

In a feasible embodiment, task scale includes the scale of the subtask of at least one dimension, parallel variable Information includes k parallel argument tables, and k is determined according to the scale of the subtask of at least one shown dimension.

In a feasible embodiment, parallel variable information is being transmitted to M arithmetic core being in idle condition Aspect in cluster in the privately owned memory space of each kernel, determination unit are specifically used for:

In a feasible embodiment, the operation domain of process instruction includes data writing address, general purpose computing device Further include:

Receiving unit is written for receiving the processing result of artificial intelligence process equipment, and by processing result write-in data In the corresponding memory space in address.

Fourth aspect, the embodiment of the present invention provide a kind of artificial intelligence process equipment, the artificial intelligence process equipment application In task execution system, which further includes general purpose computing device, and artificial intelligence process equipment includes at least one A arithmetic core cluster, each arithmetic core cluster include at least one kernel, and artificial intelligence process equipment includes:

Receiving unit, for receiving the mission bit stream of general purpose computing device transmission, which includes parallel variable Table, pending data, data input address and process instruction；Parallel argument table is used to indicate the kernel in M arithmetic core cluster Execute process instruction；M arithmetic core cluster is the arithmetic core cluster for needing to occupy when executing process instruction to pending data；

Storage unit, it is corresponding for parallel argument table and pending data to be saved into M arithmetic core cluster kernel In privately owned memory space；

Processing unit, for by the parallel change in the corresponding privately owned memory space of each kernel in M arithmetic core cluster Scale is read in the corresponding on piece storage of kernel, and according to process instruction to true based on parallel argument table and data writing address Fixed data are handled, to obtain processing result.

In a feasible embodiment, mission bit stream further includes data length, parallel to being based on according to process instruction The data that argument table and data writing address determine are handled, and to obtain the aspect of processing result, processing unit is specifically used for:

According to process instruction to based on parallel argument table, data writing address and data length determine data at Reason, to obtain processing result.

In a feasible embodiment, read by the parallel argument table in the corresponding privately owned memory space of each kernel To the aspect in the on piece storage in artificial intelligence process equipment, processing unit is specifically used for:

In a feasible embodiment, artificial intelligence process equipment further include:

Allocation unit is used in artificial intelligence process device power or reseting procedure, from artificial intelligence process equipment Be the privately owned memory space of kernel distribution in M arithmetic core cluster in the retaining space of memory, wherein the kernel with it is described Privately owned memory space is correspondingly arranged.

In a feasible embodiment, mission bit stream further includes address data output, and storage unit is also used to:

After having executed process instruction to each kernel in M arithmetic core cluster, obtained processing result is saved to number According in the corresponding memory space of output address.

5th aspect, provides a kind of computer-readable medium, which is used for the journey that equipment executes Sequence code, the program code include for executing the method in first aspect or second aspect.

As can be seen that in the scheme of the embodiment of the present invention, after general purpose computing device receives processing request, according to place Reason requests to determine task type, task scale and the process instruction of goal task, and is determined according to task scale and task type The parallel variable information (i.e. parallel argument table) of goal task is appointed so that artificial intelligence process equipment is able to carry out target Business, so that parallel variable information performance objective task can be based in artificial intelligence process equipment by realizing, wherein parallel variable letter Breath can be applied to have in the artificial intelligence process equipment of different hardware structure, and then realizes old Intelligent hardware and can hold Go/operate in execution/operation task/program on new Intelligent hardware.

Further, it by distributing one piece of privately owned memory space for each kernel in artificial intelligence process equipment, is used for Memory parallel argument table allows the kernel in arithmetic core cluster correctly to read the kernel when executing same instruction at the same time Data to be treated realize the task processing of SIMT.It, will be in its privately owned memory space before core instructions processing order Parallel argument table read on piece NRAM, improve subsequent processing efficiency.It is sent by receiving general purpose computing device Parallel argument table, being equivalent to artificial intelligence process equipment realizes the function of hardware scheduler by general purpose computing device, real Parallel argument table performance objective task can be based on by having showed in the case where there is no hardware scheduler in artificial intelligence process equipment, into And the functional module on the Intelligent hardware of a new generation is realized by software programming, so that the software program of new and old two generations Intelligent hardware It can be compatible with.

The aspects of the invention or other aspects can more straightforwards in the following description.

Detailed description of the invention

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with It obtains other drawings based on these drawings.

Fig. 1 a is a kind of structural schematic diagram of task execution system provided in an embodiment of the present invention；

Fig. 1 b is a kind of structural schematic diagram of task execution system provided in an embodiment of the present invention；

Fig. 1 c is a kind of method flow schematic diagram of task execution provided in an embodiment of the present invention；

Fig. 2 is a kind of method flow schematic diagram of task execution provided in an embodiment of the present invention；

Fig. 3 is the method flow schematic diagram of another task execution provided in an embodiment of the present invention；

Fig. 4 is a kind of structural schematic diagram of general purpose computing device provided in an embodiment of the present invention；

Fig. 5 is a kind of structural schematic diagram of artificial intelligence process equipment provided in an embodiment of the present invention.

Specific embodiment

Embodiments herein is described in detail respectively below in conjunction with attached drawing.

A referring to Figure 1, Fig. 1 a are a kind of structural schematic diagram of task execution system provided in an embodiment of the present invention.Such as figure Shown in 1a, which includes general purpose computing device and artificial intelligence process equipment, wherein general purpose computing device 100 include general processor 101, memory 102 and PCIE interface 103.General processor 101 connects with memory 102 and PCIE Mouth is separately connected, and memory 102 is connect with PCIE interface.Optionally, which can be server or local Computer equipment, the general processor 101 can be CPU etc..

Artificial intelligence process equipment 200 include S arithmetic core cluster 201, network-on-chip (Network on chip) 202, PCIE interface 203 and memory 204, wherein S is the integer greater than 0.Each arithmetic core cluster packet in S arithmetic core cluster N number of kernel is included, N is the integer greater than 0.Optionally, each kernel can be an artificial intelligent processor.The embodiment of the present application In, more than one kernel can form an arithmetic core cluster (Cluster).For example, each arithmetic core cluster (Cluster) It may include 4 kernels, alternatively, each arithmetic core cluster may include 8 kernels.

S arithmetic core cluster stores 205 with PCIE interface 203, memory 204 and on piece respectively by network-on-chip 202 It is connected, data exchange between S arithmetic core cluster and S arithmetic core cluster and PCIE may be implemented in network-on-chip 202 Data exchange between interface 203, memory 204 and on piece storage 205.PCIE interface 203 is connected with memory 204, PCIE interface 203 can be realized the data communication between artificial intelligence process equipment and general purpose computing device 100.

Optionally, the memory 102 of general purpose computing device and the memory 204 of artificial intelligence process equipment can be double Times rate (Double Data Rate, DDR) memory.Optionally, on piece storage 205 can cache on piece.Further, On piece caching is neuron random access memory (neural random access memory, NRAM).

B referring to Figure 1, Fig. 1 b are a kind of structural schematic diagram of task execution system provided in an embodiment of the present invention.Such as figure Shown in 1b, which includes general purpose computing device and artificial intelligence process equipment, wherein general purpose computing device 100 include general processor 101, memory 102 and PCIE interface 103.General processor 101 connects with memory 102 and PCIE Mouth is separately connected, and memory 102 is connect with PCIE interface.Optionally, which can be server or local Computer equipment, the general processor 101 can be CPU etc..

Artificial intelligence process equipment 300 include Q arithmetic core cluster 301, network-on-chip (Network on chip) 302, PCIE interface 303, memory 304, task dispatcher 305, wherein Q is the integer greater than 0.Each of Q arithmetic core cluster Arithmetic core cluster includes P kernel, and P is the integer greater than 0.Optionally, each kernel can be an artificial intelligent processor. In the embodiment of the present application, more than one kernel can form an arithmetic core cluster (Cluster).For example, each operation core Heart cluster (Cluster) may include 4 kernels, alternatively, each arithmetic core cluster may include 8 kernels.

Q arithmetic core cluster by network-on-chip 302 respectively with task dispatcher 305, PCIE interface 303 and memory 304 are connected, network-on-chip 302 may be implemented data exchange between Q arithmetic core cluster and Q arithmetic core cluster with Data exchange between PCIE interface 303, memory 304 and task dispatcher 305.PCIE interface 303 is connected with memory 304 It connects, PCIE interface 303 can be realized the data communication between artificial intelligence process equipment and general purpose computing device 100.Task The goal task of the available general purpose computing device of scheduler 305 transmission and task processing request, and handled and asked according to task It asks and the goal task is split, obtain the subtask that can be run at least one kernel.Specifically, task dispatcher 305 can split goal task at least one dimension, obtain according to the task type and task scale that user inputs Obtain the subtask that can be run at least one kernel.Still optionally further, task dispatcher 305 can also appoint each height The operations such as the transmitting and execution of business are scheduled and manage.

Optionally, the memory 102 of general purpose computing device and the memory 304 of artificial intelligence process equipment can be double Times rate (Double Data Rate, DDR) memory.Optionally, each kernel may include on piece storage and special deposit Device, wherein on piece storage can cache on piece.Further, on piece caching is neuron random access memory (neural Random access memory, NRAM).

It should be pointed out that task dispatcher 305 is an independent hardware in artificial intelligence process equipment, therefore at this In invention, task dispatcher 305 is properly termed as hardware scheduler again.

Based on above-mentioned task execution system, as illustrated in figure 1 c, this application provides a kind of task executing method, the tasks Execution method can be suitable for the task execution system of any of the above-described embodiment.Specifically, the above method includes:

S101, general purpose computing device handled according to task request to determine the task type of goal task, task scale and Process instruction, the task processing request includes pending data.

Specifically, general purpose computing device can handle the task type for requesting to determine goal task, task according to task The mission bit streams such as scale and process instruction.Task processing request can be user's input.Optionally, which asks It may include the mission bit streams such as task type and the task scale of goal task, the tasks such as the task type and task scale in asking Information can be also possible to user's input.

S102, general purpose computing device determine the parallel change of goal task according to the task scale and the task type Information is measured, the parallel variable information can be suitable for the artificial intelligence process equipment of different editions.

Wherein, artificial intelligence process equipment may include the artificial intelligence process equipment of different editions, the people of each version The hardware configuration of work Intelligent treatment equipment can be different.

Optionally, the parallel argument table includes task identification, task scale, the mark of arithmetic core cluster, arithmetic core In the number of cluster, arithmetic core cluster in the quantity of kernel and arithmetic core cluster kernel at least one of mark.Specifically It is discussed below to determine that the mode of parallel variable information can be found in.

Still optionally further, task scale includes the scale of the subtask of at least one dimension, the parallel variable information Including k parallel argument tables, the k is determined according to the scale of the subtask of at least one shown dimension.For example, described Business scale be [taskDimX, taskDimY, taskDimZ], the parallel variable information include taskDimX × taskDimY × TaskDimZ parallel argument tables, the taskDimX, taskDimY and taskDimZ are the integer greater than 0.

The parallel variable information is stored privately owned memory space corresponding to the kernel by S103, task execution system In.

Optionally, the task dispatcher of task execution system can store parallel variable information corresponding privately owned to kernel In memory space.Optionally, which can be realized using hardware configuration, alternatively, the task dispatcher can also be adopted Use software realization.Optionally, which can be hardware configuration realization, and the task dispatcher can integrate in artificial In Intelligent treatment equipment, as shown in Figure 1 b.Optionally, which can also be software realization, as shown in Figure 1a.

Optionally, the corresponding privately owned memory space of the kernel can be the memory 204 in artificial intelligence process equipment Retaining space, the memory 204 also become the memory in artificial intelligence process equipment.General purpose computing device can be in the storage Determine that a parallel variable storage section, the parallel variable storage section may include that each kernel is corresponding in retaining space in device Privately owned memory space.Optionally, the corresponding memory space of the kernel can also be each in artificial intelligence process equipment 200 The storage section for the specified register being arranged in kernel, as shown in Figure 1 b.

In the embodiment of the present application, task execution system can handle the task class that request determines goal task to according to task The mission bit streams such as type and task scale, and according to the mission bit streams such as the task type and task scale determine the goal task and Row variable information, the parallel variable information can be suitable for the artificial intelligence process equipment of different hardware structure, the parallel change Amount information can be stored in artificial intelligence process equipment, so that artificial intelligence process equipment can be according to the parallel change Measure information processing goal task.In the embodiment of the present application, by the way that the parallel variable information of different editions can be suitable for, so that not Artificial intelligence process equipment with version can be compatible with identical software programming mode, so as to improve the general of software program Property, and then improve the efficiency of the task execution system.

Referring to fig. 2, Fig. 2 is a kind of flow diagram of task executing method provided in an embodiment of the present invention.This method is answered For task execution system shown in Fig. 1 a or 1b, which includes general purpose computing device and artificial intelligence process Equipment.As shown in Fig. 2, this method comprises:

S201, general purpose computing device handled according to task request to determine the task type of goal task, task scale and Process instruction.

Wherein, task processing request includes that pending data will after general purpose computing device is connected to task processing request Pending data is saved into the memory of general purpose computing device.Optionally, task processing request can be user's input 's.Further alternative, task processing request can also include the mission bit streams such as task type and task scale.

The number M1 for the kernel for needing to occupy when task type is used to indicate artificial intelligence process equipment performance objective task, M1 is the integer greater than 0, and M1 is less than or equal to S*N.Task scale is used to characterize the operand of goal task.The application is implemented In example, task type may include multicore task type and monokaryon task type, and optionally, multicore task type can use UNION indicates that monokaryon task type can be indicated using BLOCK.In multicore task, task type is used to indicate artificial intelligence The number M for the arithmetic core cluster for needing to occupy when processing equipment performance objective task, artificial intelligence process equipment executes mesh at this time The number M1=M*N for the kernel for needing to occupy when mark task.

For example, illustrating that the task type is multicore task when task type is UNION.If the task class of goal task Type is UNION1, then the number for needing to occupy arithmetic core cluster when artificial intelligence process equipment performance objective task is 1；If target The task type of task is UNION2, then needs to occupy of arithmetic core cluster when artificial intelligence process equipment performance objective task Number is 2；If the task type of goal task is UNION4, need when artificial intelligence process equipment performance objective task to occupy fortune The number for calculating core cluster is 4；If the task type of goal task is UNION8, artificial intelligence process equipment performance objective task When need to occupy arithmetic core cluster number be 8.

For another example, when task type is BLOCK, then illustrate that the task type is monokaryon task, that is, execute the goal task Only need a kernel.

S202, general purpose computing device determine the parallel variable information of goal task according to task scale and task type.

Wherein, parallel variable information may include at least one parallel argument table, and each parallel argument table includes task mark Know taskid, task scale taskDim, the number for identifying clusterid, arithmetic core cluster of arithmetic core cluster In clusterDim, arithmetic core cluster in the quantity coreDim and arithmetic core cluster of kernel in the mark coreid of kernel It is at least one.

Optionally, general purpose computing device can be handled according to task and be requested, and obtain goal task at least one dimension On subtask scale.Specifically, user, which can specify, splits goal task on X, Y, tri- dimensions of Z, respectively Obtain the scale taskDimX of goal task subtask on X-dimension, the scale of subtask of the goal task in Y dimension The scale taskDimZ of the subtask of taskDimY and the goal task on Z-dimension.Optionally, when task type is When BLOCK, then it is 1 that taskDimX, taskDimY and taskDimZ, which default value,.Optionally, general purpose computing device is also Parallel total scale can be determined according to the scale of subtask at least one dimension.Specifically, which can be taskDimX×taskDimY×taskDimZ。

Still optionally further, general purpose computing device can determine minimum simultaneously according to the scale of the subtask on certain dimension Row granularity.For example, general purpose computing device can be that the transmitting of unit degree of parallelism is executed according to taskDimX, i.e., minimum parallel granularity For taskDimX=clusterDim × coreDim.Further, after user specifies Union type, if specified TaskDimX is not the positive integer times of clusterDim × coreDim, then can detect and report an error when running.Certainly, in other implementations In example, general purpose computing device can also determine minimum parallel granularity according to taskDimY or taskDimZ, specific to determine Method is consistent with the method for determination of above-mentioned taskDimX.

Still optionally further, general purpose computing device can also determine the task identification taskidX of each subtask, TaskidY, taskidZ, taskidX, taskidY and taskidZ indicate the task of current inner execution in X, tri- directions Y, Z On ID.Specifically, general purpose computing device can according to scale taskDimX, taskDimY of subtask in each dimension and TaskDimZ determines task identification taskidX in respective dimensions, taskidY, taskidZ.For example, the task of goal task is advised When mould is [4,1,1], then the value that the value range of taskidX can be 0~3, taskidY can be taking for 0, taskidZ Value can be 0.For another example, when the task scale of goal task is [4,2,2], then the value range of taskidX can be 0~3, The value of taskidY can be 0 or 1, and the value of taskidZ can be 0 or 1.Further optionally, general purpose computing device Number of repetition can be further determined that in the case where keeping unit degree of parallelism constant, so that it is determined that going out at least one parallel change Scale.For example, general purpose computing device can keep the value of taskDimX constant, determined by taskDimY and taskDimZ Number of repetition (number of repetition is equal to taskDimY × taskDimZ), can convert taskDimY and/or taskDimZ every time Value.

Parallel argument table illustrated below:

For example, it is assumed that it is UNION1 that processor, which needs to handle task type, and task scale is [4,1,1], parallel variable Table such as the following table 1.

Variable name	Kernel 0	Kernel 1	Kernel 2	Kernel 3
					taskidX	0	1	2	3
taskidY	0	0	0	0
					TaskidZ	0	0	0	0
taskid	0	1	2	3
					taskDimX	4	4	4	4
taskDimY	1	1	1	1
					taskDimZ	1	1	1	1
taskDim	4	4	4	4
					coreid	0	1	2	3
coreDim	4	4	4	4
					clusterid	0	0	0	0
clusterDim	1	1	1	1

Table 1

For another example, it is assumed that it is UNION1 that processor, which needs to handle task type, and task scale is [4,2,2], alignment processing Be designated as under general assignment [, y=0, z=0] when, parallel argument table such as following table 2.1.

Variable name	Kernel 0	Kernel 1	Kernel 2	Kernel 3
					taskidX	0	1	2	3
taskidY	0	0	0	0
					TaskidZ	0	0	0	0
taskid	0	1	2	3
					taskDimX	4	4	4	4
taskDimY	2	2	2	2
					taskDimZ	2	2	2	2
taskDim	16	16	16	16
					coreid	0	1	2	3
coreDim	4	4	4	4
					clusterid	0	0	0	0
clusterDim	1	1	1	1

Table 2.1

Be designated as under alignment processing general assignment [, y=1, z=0] when, parallel argument table such as following table 2.2.

Variable name	Kernel 0	Kernel 1	Kernel 2	Kernel 3
					taskidX	0	1	2	3
taskidY	1	1	1	1
					TaskidZ	0	0	0	0
taskid	4	5	6	7
					taskDimX	4	4	4	4
taskDimY	2	2	2	2
					taskDimZ	2	2	2	2
taskDim	16	16	16	16
					coreid	0	1	2	3
coreDim	4	4	4	4
					clusterid	0	0	0	0
clusterDim	1	1	1	1

Table 2.2

Be designated as under alignment processing general assignment [, y=0, z=1] when, parallel argument table such as following table 2.3.

Variable name	Kernel 0	Kernel 1	Kernel 2	Kernel 3
					taskidX	0	1	2	3
taskidY	0	0	0	0
					TaskidZ	1	1	1	1
taskid	8	9	10	11
					taskDimX	4	4	4	4
taskDimY	2	2	2	2
					taskDimZ	2	2	2	2
taskDim	16	16	16	16
					coreid	0	1	2	3
coreDim	4	4	4	4
					clusterid	0	0	0	0
clusterDim	1	1	1	1

Table 2.3

Be designated as under alignment processing general assignment [, y=1, z=1] when, parallel argument table such as following table 2.4.

Table 2.4

Parallel variable information, process instruction and pending data are transmitted to artificial intelligence by S203, general purpose computing device In the memory 204 of processing equipment.

Optionally, artificial intelligence process equipment power on or reseting procedure in, can be by driver in artificial intelligence It is each kernel distribution one in artificial intelligence process equipment in the reserved area memory address space of the memory 204 of processing equipment Block memory headroom, the memory headroom are known as the privately owned memory space of kernel.

Wherein, process instruction and instruction to be processed are transmitted at artificial intelligence by general purpose computing device by PCIE bus In the memory for managing equipment, pending data is handled so that the kernel in M arithmetic core cluster is based on process instruction.

General purpose computing device receives the processing result of artificial intelligence process equipment by PCIE bus, and the processing is tied Fruit is written in the corresponding memory space of data writing address.Wherein, which can be logical With part memory space in memory in computer equipment.

As can be seen that in the scheme of the embodiment of the present invention, after general purpose computing device receives task processing request, root Task type, task scale and the process instruction for requesting to determine goal task are handled according to task, and according to task scale and task Type determines the parallel variable information (i.e. parallel argument table) of goal task, so that artificial intelligence process equipment is able to carry out Goal task, realizing can be executed in the case where not having hardware scheduler in artificial intelligence process equipment based on parallel argument table Goal task, so realize old Intelligent hardware be able to carry out/operate in execution/operation task on new Intelligent hardware/ Program.

Still optionally further, there are in M arithmetic core cluster in artificial intelligence process equipment for general purpose computing device Kernel is in idle state, then executes described be transmitted to parallel argument table, process instruction and pending data and be in the free time In M arithmetic core cluster of state in the privately owned memory space of each kernel.

Specifically, general purpose computing device determines the operation that needs occupy when artificial intelligence process equipment performance objective task After the number M of core cluster, inquiry artificial intelligence process equipment is in idle shape with the presence or absence of the kernel in M arithmetic core cluster State.If there are the kernels in M arithmetic core cluster to be in idle condition for artificial intelligence process equipment, general purpose computing device is logical PCIE bus is crossed to save parallel argument table into M arithmetic core cluster in the privately owned memory space of each kernel.General-purpose computations It is to guarantee to be somebody's turn to do that parallel argument table is saved the purpose into M arithmetic core cluster in the privately owned memory space of each kernel by machine equipment The kernel of M arithmetic core cluster is not occupied by other tasks.

Wherein, task scale is [taskDimX, taskDimY, taskDimZ], and parallel variable information includes taskDimX × taskDimY × taskDimZ parallel argument tables, general purpose computing device by PCIE bus by taskDimX × TaskDimY × taskDimZ parallel argument tables save the privately owned memory space of each kernel into M arithmetic core cluster respectively In.

In this respect it is to be noted that kernel, which is in idle condition, refers to that kernel is not carried out instruction.

Referring to Fig. 3, Fig. 3 is the flow diagram of another task processing method provided in an embodiment of the present invention.This method Applied to task execution system, which includes general purpose computing device and artificial intelligence process equipment, this is artificial Intelligent treatment equipment includes at least one arithmetic core cluster, and each arithmetic core cluster includes at least one kernel.As shown in figure 3, This method comprises:

S301, artificial intelligence process equipment receive the mission bit stream that general purpose computing device is sent, which includes Parallel argument table, pending data, data input address and process instruction；The parallel argument table is used to indicate M arithmetic core Kernel in cluster executes process instruction；It needs to occupy when the M arithmetic core cluster executes process instruction to pending data Arithmetic core cluster.

It should be noted that parallel argument table can be found in the associated description of step S202, no longer describe herein.

S302, artificial intelligence process equipment save parallel argument table and pending data into M arithmetic core cluster interior In the corresponding privately owned memory space of core.

It should be noted that artificial intelligence process equipment power on or reseting procedure in, set in artificial intelligence process In standby reserved area memory headroom, it is driven to each kernel in artificial intelligence process equipment and distributes one piece of memory headroom, the memory Space is known as the privately owned memory space of kernel.The reserved area memory headroom of artificial intelligence process equipment is artificial intelligence process equipment Memory 204 in a part.

S303, artificial intelligence process equipment will be in the corresponding privately owned memory spaces of kernel each in M arithmetic core cluster Parallel argument table is read in the storage of the on piece in artificial intelligence process equipment, according to process instruction to based on parallel argument table and The data that data writing address determines are handled, to obtain processing result.

Specifically, each kernel in M arithmetic core cluster is by the parallel argument table in its corresponding privately owned memory space After reading in the on piece storage of artificial intelligence process equipment, which is determined based on the parallel variable in parallel argument table The data of processing.The operation domain of process instruction includes data input address, i.e., pending data is in artificial intelligence process equipment The first address of storage, parallel variable can regard offset address as, and each kernel is determined based on data input address and offset address Kernel data to be treated are then based on processing order and handle the data, to obtain processing result.

It should be noted that the kernel in M arithmetic core cluster is to execute processing order to pending data simultaneously.

In an example, mission bit stream further includes data length, and each kernel is based on data input address, offset address (i.e. parallel variable) and data length determine kernel data to be treated.

In an example, the parallel argument table in its corresponding privately owned memory space is read artificial intelligence by each kernel It can be specifically each kernel by executing preload instruction for its corresponding privately owned memory space in the on piece storage of processing equipment In parallel argument table read artificial intelligence process equipment on piece storage in.

Aforesaid operations S303 can specifically include: privately owned be deposited by executing preload instruction by each kernel is corresponding Parallel argument table in storage space is read in the storage of the on piece in the artificial intelligence process equipment.

Wherein, preload instruction be load.nram.dram address1, [%SP], 64；Wherein, [address1] is The first address of on piece storage, agreement [address1-address2] are the reserved area memory headroom of artificial intelligence process equipment Address field, %SP are the title of a specified register in artificial intelligence process equipment, are used to indicate herein to be directed toward in each The first address of the privately owned memory space of core.The parallel argument table of the privately owned memory space of kernel is read by preload instruction In on piece storage, when kernel executes process instruction, if desired the variable in parallel argument table, then can directly be stored from piece Middle reading, and then improve the execution efficiency of process instruction.

Optionally, on piece storage can cache on piece.Further, on piece caching is on piece NRAM.

Further, processing result is transmitted to general purpose computing device by PCIE bus by artificial intelligence process equipment.

In one example, artificial intelligence process equipment is artificial intelligence process equipment shown in Fig. 1 a, the artificial intelligence Processing equipment may include 8 clusters, and each cluster may include 4 kernels or 8 kernels, and each kernel can be an artificial intelligence Energy processor core, i.e. arithmetic core cluster in artificial intelligence process equipment can regard artificial intelligence process equipment shown in Fig. 1 a as In cluster, the kernel in artificial intelligence process equipment can regard the kernel in artificial intelligence process equipment shown in Fig. 1 a as, scheme Artificial intelligence process equipment shown in 1a executes the corresponding content of step S301-S303.

As can be seen that in the scheme of the embodiment of the present invention, by dividing for each kernel in artificial intelligence process equipment With one piece of privately owned memory space, it is used for memory parallel argument table, so that the kernel in arithmetic core cluster executes same finger at the same time Kernel data to be treated can be correctly read when enabling.It is before kernel executes processing order, its privately owned storage is empty Between in parallel argument table read on piece NRAM, when kernel executes processing order, if desired parallel variable, then directly from Parallel variable is read on piece storage, and then improves subsequent processing efficiency.By receive general purpose computing device send and Row argument table, being equivalent to artificial intelligence process equipment realizes the function of hardware scheduler by general purpose computing device, realizes It can be based on parallel argument table performance objective task, in turn in the case where there is no hardware scheduler in artificial intelligence process equipment It realizes old Intelligent hardware and is able to carry out/operates in execution/operation task/program on new Intelligent hardware.

Referring to artificial intelligence process equipment shown in Fig. 1 a and Fig. 1 b, Fig. 1 b relative to artificial intelligence process shown in Fig. 1 a Equipment has increased task dispatcher and specified register newly, which provides a privately owned memory space for kernel.When When running task executing method in task execution system shown in Fig. 1 b, artificial intelligence process equipment shown in Fig. 1 b is being received After processing task, task dispatcher can determine parallel argument table according to task type and task scale, and by parallel argument table It stores in specified register, then kernel may continue to execute process instruction.

Referring to fig. 4, Fig. 4 provides a kind of structural schematic diagram of general purpose computing device for the embodiment of the present invention.The general meter It calculates machine equipment and is applied to task execution system, which further includes artificial intelligence process equipment, the general purpose computer Equipment 400 includes:

Determination unit 401, for handling task type, task scale and the processing of requesting to determine goal task according to task Instruction；Processing request includes pending data；And determine that the parallel variable of goal task is believed according to task scale and task type Breath, the parallel variable information of goal task are used to indicate the kernel in M arithmetic core cluster and execute process instruction；

Transmission unit 402, if for there are the kernels in M arithmetic core cluster to be in sky in artificial intelligence process equipment Parallel variable information, process instruction and pending data are then transmitted in the memory of artificial intelligence process equipment by not busy state.

In a feasible embodiment, the general purpose computing device 400 further include:

Query unit 404, if for there are the kernels in M arithmetic core cluster to be in artificial intelligence process equipment Parallel variable information, process instruction and pending data are then transmitted in the memory of artificial intelligence process equipment by idle state It inquires in artificial intelligence process equipment before and is in idle state with the presence or absence of the kernel in M arithmetic core cluster.

In a feasible embodiment, task scale includes the scale of the subtask of at least one dimension, described parallel Variable information includes k parallel argument tables, and the k is determined according to the scale of the subtask of at least one shown dimension.

In a feasible embodiment, parallel variable information is being transmitted to M arithmetic core being in idle condition Aspect in cluster in the privately owned memory space of each kernel, transmission unit 402 are specifically used for:

In terms of process instruction and pending data are transmitted in the memory of artificial intelligence process equipment, transmission unit 402 are specifically used for:

In a feasible embodiment, the operation domain of process instruction includes data writing address, general purpose computing device 400 further include:

Receiving unit 403 is write for receiving the processing result of artificial intelligence process equipment, and by processing result write-in data Enter in the corresponding memory space in address.

It should be noted that above-mentioned each unit (determination unit 401, transmission unit 402, receiving unit 403 and query unit 404) for executing the correlation step of the above method.Wherein it is determined that unit 401 is specifically used for executing the phase of step S201 and S202 Hold inside the Pass, transmission unit 402, receiving unit 403 and query unit 404 are specifically used for executing the related content of step S203.

In the present embodiment, general purpose computing device 400 is presented in the form of unit.Here " unit " can refer to Application-specific integrated circuit (application-specific integrated circuit, ASIC), executes one or more The processor and memory of software or firmware program, integrated logic circuit and/or other device of above-mentioned function can be provided.

Referring to Fig. 5, Fig. 5 provides a kind of structural schematic diagram of artificial intelligence process equipment for the embodiment of the present invention.This is artificial For Intelligent treatment equipment application in task execution system, which further includes general purpose computing device, at artificial intelligence Managing equipment includes at least one arithmetic core cluster, and each arithmetic core cluster includes at least one kernel, which sets Standby 500 include:

Receiving unit 501, for receiving the mission bit stream of general purpose computing device transmission, which includes parallel become Scale, pending data, data input address and process instruction；Parallel argument table is used to indicate interior in M arithmetic core cluster Core executes process instruction；M arithmetic core cluster is the arithmetic core cluster for needing to occupy when executing process instruction to pending data；

Storage unit 502, it is corresponding for parallel argument table and pending data to be saved into M arithmetic core cluster kernel Privately owned memory space in；

Processing unit 503, for will be parallel in the corresponding privately owned memory space of each kernel in M arithmetic core cluster Argument table is read in the storage of the on piece in artificial intelligence process equipment；And according to process instruction to based on parallel argument table sum number It is handled according to the data that writing address determines, to obtain processing result.

In a feasible embodiment, mission bit stream further includes data length, parallel to being based on according to process instruction The data that argument table and data writing address determine are handled, and to obtain the aspect of processing result, processing unit 503 is specifically used In:

In a feasible embodiment, read by the parallel argument table in the corresponding privately owned memory space of each kernel To the aspect in the on piece storage in artificial intelligence process equipment, processing unit 503 is specifically used for:

In a feasible embodiment, artificial intelligence process equipment 500 further include:

Allocation unit 504 is used in artificial intelligence process device power or reseting procedure, from artificial intelligence process equipment Reserved area memory headroom in, be multiple arithmetic core clusters in each kernel distribute a privately owned memory space.

In a feasible embodiment, mission bit stream further includes address data output, and storage unit 502 is also used to:

It should be noted that (receiving unit 501, storage unit 502, processing unit 503 and distribution are single for above-mentioned each unit Member is 504) for executing the correlation step of the above method.Wherein, receiving unit 501 is specifically used for the phase of execution step S301 inside the Pass Hold, storage unit 502 and allocation unit 504 are specifically used for executing the related content of step S302, and processing unit 503 is for executing The related content of step S303.

In the present embodiment, artificial intelligence process equipment 500 is presented in the form of unit.Here " unit " can be with Refer to application-specific integrated circuit (application-specific integrated circuit, ASIC), executes one or more The processor and memory of a software or firmware program, integrated logic circuit and/or other device of above-mentioned function can be provided Part.

It should be noted that for the various method embodiments described above, for simple description, therefore, it is stated as a series of Combination of actions, but those skilled in the art should understand that, the present invention is not limited by the sequence of acts described because According to the present invention, some steps may be performed in other sequences or simultaneously.Secondly, those skilled in the art should also know It knows, the embodiments described in the specification are all preferred embodiments, and related actions and modules is not necessarily of the invention It is necessary.

In the above-described embodiments, it all emphasizes particularly on different fields to the description of each embodiment, there is no the portion being described in detail in some embodiment Point, reference can be made to the related descriptions of other embodiments.

In several embodiments provided herein, it should be understood that disclosed device, it can be by another way It realizes.For example, the apparatus embodiments described above are merely exemplary, such as the division of the unit, it is only a kind of Logical function partition, there may be another division manner in actual implementation, such as multiple units or components can combine or can To be integrated into another system, or some features can be ignored or not executed.Another point, shown or discussed is mutual Coupling, direct-coupling or communication connection can be through some interfaces, the indirect coupling or communication connection of device or unit, It can be electrical or other forms.

The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple In network unit.It can select some or all of unit therein according to the actual needs to realize the mesh of this embodiment scheme 's.

It, can also be in addition, the functional units in various embodiments of the present invention may be integrated into one processing unit It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list Member can take the form of hardware realization.

The embodiment of the present invention has been described in detail above, specific case used herein to the principle of the present invention and Embodiment is expounded, and the above description of the embodiment is only used to help understand the method for the present invention and its core ideas； At the same time, for those skilled in the art can in specific embodiments and applications according to the thought of the present invention There is change place, to sum up above-mentioned, the contents of this specification are not to be construed as limiting the invention.

Claims

1. a kind of task executing method, which is characterized in that the method is applied to task execution system, the task execution system packet Include general purpose computing device and artificial intelligence process equipment, which comprises

Task type, task scale and the process instruction for requesting to determine goal task are handled according to task；The task processing is asked It asks including pending data；

Determine the parallel variable information of goal task according to the task scale and the task type, the goal task and Row variable information is used to indicate the kernel in M arithmetic core cluster and executes the process instruction；

It, will be described if there are the kernels in M arithmetic core cluster to be in idle state in the artificial intelligence process equipment Parallel variable information, the pending data and the process instruction are transmitted in the memory of the artificial intelligence process equipment.

2. if the method according to claim 1, wherein there are M in the artificial intelligence process equipment Kernel in arithmetic core cluster is in idle state, then by the parallel variable information, the pending data and described Before process instruction is transmitted in the memory of the artificial intelligence process equipment, the method also includes:

It inquires in the artificial intelligence process equipment and is in idle shape with the presence or absence of the kernel in the M arithmetic core cluster State.

3. method according to claim 1 or 2, which is characterized in that the parallel variable information includes parallel argument table, institute Stating parallel argument table includes task identification, task scale, the mark of arithmetic core cluster, the number of arithmetic core cluster, arithmetic core In cluster in the quantity of kernel and arithmetic core cluster kernel at least one of mark.

4. method according to claim 1 or 2, which is characterized in that the operation domain of the process instruction includes data write-in Address, the method also includes:

The processing result of the artificial intelligence process equipment is received, and the data writing address pair is written into the processing result In the memory space answered.

5. a kind of task executing method, which is characterized in that the method is applied to task execution system, the task execution system packet General purpose computing device and artificial intelligence process equipment are included, the artificial intelligence process equipment includes at least one arithmetic core Cluster, each arithmetic core cluster includes at least one kernel, which comprises

Receive general purpose computing device send mission bit stream, the mission bit stream include parallel variable information, pending data, Data input address and process instruction；The parallel variable information is used to indicate described in the execution of the kernel in M arithmetic core cluster Process instruction；

It is corresponding privately owned that the parallel variable information and the pending data are saved into the M arithmetic core cluster kernel In memory space；

By in the M arithmetic core cluster, parallel variable information storage in the corresponding privately owned memory space of each kernel to institute It states in the corresponding on piece storage of kernel, and is written according to the process instruction to based on the parallel variable information and the data The data that address determines are handled, to obtain processing result.

6. according to the method described in claim 5, it is characterized in that, described will be in the corresponding privately owned memory space of each kernel Parallel variable information is read in the storage of the on piece in artificial intelligence process equipment, comprising:

The parallel argument table in the corresponding privately owned memory space of each kernel is read into artificial intelligence by executing preload instruction In on piece storage in energy processing equipment.

7. method according to claim 5 or 6, which is characterized in that the method also includes:

In the artificial intelligence process device power or reseting procedure, from the reservation of the memory of the artificial intelligence process equipment It is that the kernel in the M arithmetic core cluster distributes the privately owned memory space, wherein the kernel and the private in space There is memory space to be correspondingly arranged.

8. method according to claim 5 or 6, which is characterized in that the mission bit stream further includes address data output, institute State method further include:

After having executed the process instruction to the kernel in M arithmetic core cluster, obtained processing result is transmitted to the number According in the corresponding memory space of output address.

9. a kind of general purpose computing device, which is characterized in that the general purpose computing device includes memory and general processor, It is stored with computer program in the memory, when the general processor executes the computer program, realizes as right is wanted The step of seeking 1-4 described in any item methods.

10. a kind of artificial intelligence process equipment, which is characterized in that the artificial intelligence process equipment include at least one kernel and Memory is stored with computer program in the memory, when the kernel executes the computer program, realizes that such as right is wanted The step of seeking 5-8 described in any item methods.