CN114138440A

CN114138440A - Operator execution device, operator scheduling device, method and chip

Info

Publication number: CN114138440A
Application number: CN202111450101.5A
Authority: CN
Inventors: 冷祥纶; 刘文龙; 刘才齐; 李林鹏
Original assignee: Shanghai Power Tensors Intelligent Technology Co Ltd
Current assignee: Shanghai Power Tensors Intelligent Technology Co Ltd
Priority date: 2021-11-30
Filing date: 2021-11-30
Publication date: 2022-03-04

Abstract

The present disclosure provides an operator executing device, an operator scheduling device, a method, a chip, a computer device and a storage medium, wherein the operator executing device includes: an operator scheduler and an execution unit; the operator scheduler is used for responding to an operator starting instruction of an operator to be started, and issuing the operator starting instruction to an execution unit corresponding to the operator type information based on the operator type information carried in the operator starting instruction; the operator starting instruction comprises scheduling information of an operation unit in the execution unit; and the execution unit is used for responding to the received operator starting instruction issued by the operator scheduler and executing the operator to be started based on the scheduling information carried in the operator starting instruction.

Description

Operator execution device, operator scheduling device, method and chip

Technical Field

The present disclosure relates to the field of computer application technologies, and in particular, to an operator executing device, an operator scheduling method, a chip, a computer device, and a storage medium.

Background

In the field of deep learning, when a computer device runs a deep learning model, a deep learning frame deployed on the computer device analyzes the deep learning model to obtain an operator in the deep learning model, and sends the operator to an Artificial Intelligence (AI) accelerator card, a Graphics Processing Unit (GPU) and other execution devices, wherein the execution devices are responsible for computing power scheduling after receiving the operator sent by the deep learning frame and execute a data Processing task corresponding to the operator based on a scheduling result. The data processing mode has the problem of low utilization rate of computing resources.

Disclosure of Invention

The embodiment of the disclosure at least provides operator execution equipment, operator scheduling equipment, an operator scheduling method, a chip, computer equipment and a storage medium.

In a first aspect, an embodiment of the present disclosure provides an operator executing apparatus, including: an operator scheduler and an execution unit;

the operator scheduler is used for responding to an operator starting instruction of an operator to be started, and issuing the operator starting instruction to an execution unit corresponding to the operator type information based on the operator type information carried in the operator starting instruction; the operator starting instruction comprises scheduling information of an operation unit in the execution unit;

and the execution unit is used for responding to the received operator starting instruction issued by the operator scheduler and executing the operator to be started based on the scheduling information carried in the operator starting instruction.

In one possible embodiment, the scheduling information includes at least one of:

the identification of the operator to be started, the number of task blocks to be executed in the operator to be started, the task block identification of an initial task block in the task block to be executed, the number of operation units for executing the task block to be executed, and the operation unit identification for executing the task block to be executed;

each task block comprises a plurality of subtasks in the operator to be started.

In one possible embodiment, the execution unit includes: a task block scheduler and a plurality of arithmetic units;

the task block scheduler is used for responding to the received operator starting instruction issued by the operator scheduler, determining a target operation unit for executing the operator to be started and determining a task block to be executed from a plurality of operation units based on scheduling information carried in the operator starting instruction; issuing the task block to be executed to the target operation unit; each task block to be executed comprises a plurality of subtasks in the operator to be started;

and the arithmetic unit is used for responding to the received task block to be executed sent by the task block scheduler and executing the data processing task corresponding to the task block to be executed.

In one possible embodiment, the scheduling information includes: the method comprises the following steps that the number of task blocks to be executed in an operator to be started and the task block identification of an initial task block in the task blocks to be executed are determined;

the task block scheduler, when determining a plurality of task blocks to be executed based on scheduling information carried in the operator start instruction, is configured to: and determining a plurality of task blocks corresponding to the number of the task blocks to be executed as the task blocks to be executed based on the task table identification of the initial task block and taking the initial task block as the start.

In a possible implementation manner, when the task block scheduler issues the task blocks to be executed to the target arithmetic units, the task block scheduler is configured to determine, from the task blocks to be executed, task blocks to be executed that are issued to each of the target arithmetic units based on the number of the task blocks to be executed and the number of the target arithmetic units;

and issuing the task blocks to be executed determined for each target operation unit to each target operation unit.

In one possible embodiment, the scheduling information includes: the number of the arithmetic units executing the task block to be executed or the arithmetic unit identification executing the task block to be executed;

the task block scheduler, when determining a target operation unit for executing the operator to be started from a plurality of operation units based on scheduling information carried in the operator start instruction, is configured to:

and determining a target operation unit for executing the operator to be started from a plurality of operation units based on the number of operation units for executing the task block to be executed or the identification of the operation units for executing the task block to be executed.

In a possible implementation, the operator initiation instruction includes: a first field for carrying the scheduling information, and at least one of the following fields:

a second field used for carrying size information of the data to be processed corresponding to the operator to be started, a third field used for carrying size information corresponding to a task block, a fourth field used for carrying the size of the required memory space, and a fifth field used for carrying the code address corresponding to the operator to be started.

In a second aspect, an embodiment of the present disclosure provides an operator executing method, including:

the operator scheduler responds to an operator starting instruction of an operator to be started, and issues the operator starting instruction to an execution unit corresponding to the operator type information based on the operator type information carried in the operator starting instruction; the operator starting instruction comprises scheduling information of an operation unit in the execution unit;

and the execution unit responds to the received operator starting instruction issued by the operator scheduler and executes the operator to be started based on scheduling information carried in the operator starting instruction.

In a third aspect, an embodiment of the present disclosure provides an operator scheduling apparatus, including: a scheduling policy generator, and an instruction generator;

the scheduling strategy generator is used for generating a scheduling strategy for scheduling the operation unit in the operator execution equipment when the operator execution equipment is used for executing the multiple operators of the deep learning model; and transmitting the scheduling policy to the instruction generator;

and the instruction generator is used for generating an operator starting instruction based on the scheduling strategy and sending the operator starting instruction to the operator executing equipment.

In one possible embodiment, the scheduling policy includes: operator starting time and scheduling information corresponding to each operator starting time;

the scheduling information includes at least one of: the identification of the operator to be started, the number of task blocks to be executed in the operator to be started, the task block identification of an initial task block in the task block to be executed, the number of operation units for executing the task block to be executed, and the operation unit identification for executing the task block to be executed;

In one possible implementation, the scheduling policy generator, when generating a scheduling policy for an arithmetic unit in the operator execution device when executing a multi-operator of a deep learning model by using the operator execution device, is configured to:

analyzing the deep learning model to obtain a plurality of operators in the deep learning model;

determining a scheduling strategy for an operation unit in operator execution equipment when a plurality of operators are executed based on operation information respectively corresponding to the operators and calculation force information of the operator execution equipment;

each task block comprises a plurality of subtasks in a corresponding operator; and scheduling an operator execution device to execute a plurality of operators based on the scheduling strategy.

In a possible implementation, the operation information includes: the operation information corresponding to the operators respectively comprises the execution duration of the task blocks corresponding to the operators respectively and the parallelism among the operators.

In a possible implementation manner, the scheduling policy generator is further configured to obtain execution durations of task blocks corresponding to the multiple operators respectively by using the following manners:

and estimating the execution time length required for executing each task block of the operator based on the memory occupied by the operator during execution and the computing power information of the operator execution equipment.

In one possible implementation manner, the scheduling policy generator, when estimating the execution time required for executing each task block of the operator based on the memory occupied by the operator during execution and the computing power information of the operator executing device, is configured to:

determining the memory access duration based on the memory occupied by the operator during execution and the memory access bandwidth of the operation unit to the memory; and

determining the calculation time length of the task block based on the calculation force information required by each calculation step in each subtask in the task block and the calculation force information of each operation unit;

and determining the execution time length required for executing the task block based on the memory access time length and the calculation time length.

In a possible implementation manner, the scheduling policy generator is configured to obtain the execution time lengths of the task blocks corresponding to the multiple operators respectively by using the following method:

determining a simulation model corresponding to the operator based on the operator;

and operating the simulation model, and determining the execution time required by each task block of the operator based on the operation time of the simulation model.

In one possible embodiment, the scheduling policy generator, when determining the execution time required for each task block of the operator based on the execution time of the simulation model, is configured to:

determining the number of task blocks obtained by dividing the operator in the simulation process according to the size of an operation unit in operator execution equipment for operating the simulation model and the data volume to be processed by the operator in the execution process;

determining the batch to be processed according to the number of task blocks obtained by dividing the operator in the simulation process and the number of operation units in operator execution equipment for operating the simulation model;

and determining the execution time length of the task block corresponding to each operator based on the batch and the operation time length of the simulation model.

In one possible implementation, the scheduling policy generator, when determining a scheduling policy for an arithmetic unit in an operator execution device when executing a plurality of operators based on operation information corresponding to the operators and computation information of the operator execution device, is configured to:

constructing an execution time length of a task block corresponding to each of the operators, a number of the task blocks corresponding to each of the operators, parallelism among the operators, computing power information of operator execution equipment, a strategy scheduling parameter and an incidence relation between the total execution time length of the operators;

and adjusting the strategy scheduling parameters based on the incidence relation by taking the reduction of the total execution time as a target to obtain a target scheduling strategy.

In a possible implementation manner, the scheduling policy generator, when constructing an association relationship between an execution time length of a task block corresponding to each of the plurality of operators, a number of task blocks corresponding to each of the plurality of operators, a parallelism between the plurality of operators, computation force information of the operator execution device, a policy scheduling parameter, and a total execution time length of the plurality of operators, is configured to:

constructing a relational equation by taking the execution time length of a task block corresponding to each of the operators, the number of the task blocks corresponding to each of the operators and the computational power information of the operator execution equipment as parameters, taking the strategy scheduling parameter as an independent variable, taking the total execution time length of the operators as a dependent variable and taking the parallelism among the operators as a constraint condition; and taking the relation equation as the incidence relation.

In one possible implementation, the instruction generator, when generating an operator starting instruction based on the scheduling policy and sending the operator starting instruction to the operator executing device, is configured to:

generating the operator starting instruction based on the scheduling information;

responding to the arrival of an instruction sending moment corresponding to any operator starting instruction, and sending the any operator starting instruction to the operator executing equipment; and the instruction sending time is determined based on the instruction execution time corresponding to any operator starting instruction.

In a fourth aspect, an embodiment of the present disclosure further provides an operator scheduling method, including:

the scheduling strategy generator generates a scheduling strategy for scheduling the operation unit in the operator execution equipment when the operator execution equipment is used for executing the multiple operators of the deep learning model; and transmitting the scheduling policy to an instruction generator;

the instruction generator generates an operator starting instruction based on the scheduling strategy and sends the operator starting instruction to the operator executing equipment.

In a fifth aspect, an embodiment of the present disclosure further provides a chip, including: the operator execution device of any of the first aspects, and/or the operator scheduling device of any of the third aspects.

In a sixth aspect, alternative implementations of the present disclosure also provide computer apparatus, comprising: the chip of the fifth aspect.

In a seventh aspect, this disclosure also provides a computer-readable storage medium, on which a computer program is stored, the computer program being executed to perform the operator execution method according to the second aspect or any one of the second aspects; or performing the operator scheduling method of the fourth aspect or any one of the fourth aspects.

The operator execution device provided by the embodiment of the disclosure comprises an operator scheduler and an execution unit; the operator scheduler can respond to the received operator starting instruction of the operator to be started and sends the operator starting instruction to the corresponding execution unit; after receiving an operator starting instruction issued by an operator scheduler, an execution unit executes an operator to be started based on scheduling information carried in the operator starting instruction, wherein in the process, the scheduling information is not generated by operator execution equipment but is executed by equipment initiating the operator starting instruction, so that a host machine issuing operator start can plan a specific process of executing the operator by the execution equipment in advance, issue of the operator is controlled, and utilization rate of computing resources in the execution equipment is improved.

The operator scheduling device provided by the embodiment of the disclosure comprises a scheduling strategy generator and an instruction generator; the scheduling strategy generator can generate a scheduling strategy for scheduling an operation unit in the operator execution equipment when the operator execution equipment executes the operator of the deep learning model, and transmits the scheduling strategy to the instruction generator; the instruction generator can generate an operator starting instruction based on the scheduling strategy and send the operator starting instruction to the operator executing device, so that the operator executing device can execute the corresponding operator based on the operator starting instruction sent by the operator scheduling device, the specific process of executing the operator by the executing device can be planned in advance, the issuing of the operator is controlled, and the utilization rate of computing resources in the executing device is improved.

In order to make the aforementioned objects, features and advantages of the present disclosure more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings required for use in the embodiments will be briefly described below, and the drawings herein incorporated in and forming a part of the specification illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the technical solutions of the present disclosure. It is appreciated that the following drawings depict only certain embodiments of the disclosure and are therefore not to be considered limiting of its scope, for those skilled in the art will be able to derive additional related drawings therefrom without the benefit of the inventive faculty.

Fig. 1 is a schematic diagram illustrating an operator executing apparatus provided by an embodiment of the present disclosure;

fig. 2 is a schematic diagram illustrating a specific example of performing a deep learning task through cooperative work of an operator scheduling device and an operator executing device according to an embodiment of the present disclosure;

fig. 3 is a schematic diagram illustrating an operator scheduling apparatus provided in an embodiment of the present disclosure;

FIG. 4 shows an example format of an operator startup instruction provided by an embodiment of the present disclosure;

FIG. 5 is a flow chart illustrating a method for executing an operator provided by an embodiment of the present disclosure;

fig. 6 shows a flowchart of an operator scheduling method provided by an embodiment of the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present disclosure more clear, the technical solutions of the embodiments of the present disclosure will be described clearly and completely with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are only a part of the embodiments of the present disclosure, not all of the embodiments. The components of embodiments of the present disclosure, as generally described and illustrated herein, may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present disclosure is not intended to limit the scope of the disclosure, as claimed, but is merely representative of selected embodiments of the disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the disclosure without making creative efforts, shall fall within the protection scope of the disclosure.

Research shows that with the wide use of artificial intelligence, the models and data volume of deep learning are continuously increased, which naturally leads to the great increase of operators or operations to be executed by a computer, and the proportion of scheduling overhead on computing resources in the computer is larger and larger. On one hand, a Host (Host) running a deep learning framework such as TensorFlow, Pythrch, Paddlepadddle and the like obtains operators in the deep learning model by analyzing various deep learning models, and simply dispatches various operators to operator execution equipment; and the operator execution equipment is responsible for scheduling the calculation power and executing the operators. The host cannot know the state of the computing resource in the execution equipment, so that the host cannot control the issuing of the operator according to the state of the computing resource in the execution equipment, and the utilization rate of the computing resource of the operator execution equipment is reduced.

On the other hand, in the neural network training process, operators required to be executed by a training system of the deep learning model comprise simple calculation operators and communication operators, and the simple calculation operators comprise convolution operators, full-link operators and the like; the communication class operator includes, for example, an allreduce operator, and when the execution device executes the two operators, the computing resource is preempted by the computation and communication. This not only causes the computation to fluctuate or jitter enormously, but also significantly reduces the efficiency of this training system.

Based on the research, the present disclosure provides an operator executing device, which includes an operator scheduler and an executing unit; the operator scheduler can respond to the received operator starting instruction of the operator to be started and sends the operator starting instruction to the corresponding execution unit; after receiving an operator starting instruction issued by an operator scheduler, an execution unit executes an operator to be started based on scheduling information carried in the operator starting instruction, wherein in the process, the scheduling information is not generated by operator execution equipment but is executed by equipment initiating the operator starting instruction, so that a host machine issuing operator start can plan a specific process of executing the operator by the execution equipment in advance, issue of the operator is controlled, and utilization efficiency of computing resources is improved.

In addition, the embodiment of the disclosure also provides operator scheduling equipment, which comprises a scheduling policy generator and an instruction generator; the scheduling strategy generator can generate a scheduling strategy for scheduling an operation unit in the operator execution equipment when the operator execution equipment executes the operator of the deep learning model, and transmits the scheduling strategy to the instruction generator; the instruction generator can generate an operator starting instruction based on the scheduling strategy and send the operator starting instruction to the operator executing device, so that the operator executing device can execute the corresponding operator based on the operator starting instruction sent by the operator scheduling device, the specific process of executing the operator by the executing device can be planned in advance, the issuing of the operator is controlled, and the utilization rate of computing resources in the executing device is improved.

The above-mentioned drawbacks are the results of the inventor after practical and careful study, and therefore, the discovery process of the above-mentioned problems and the solutions proposed by the present disclosure to the above-mentioned problems should be the contribution of the inventor in the process of the present disclosure.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

To facilitate understanding of the present embodiment, a detailed description is first given of an operator execution device disclosed in the embodiments of the present disclosure, and the operator execution device provided in the embodiments of the present disclosure is generally deployed in a computer device. The computer device includes, for example: a terminal device, which may be a User Equipment (UE), a mobile device, a User terminal, a cellular phone, a cordless phone, a Personal Digital Assistant (PDA), a handheld device, a computing device, a vehicle mounted device, a wearable device, or a server or other processing device.

Referring to fig. 1, a schematic structural diagram of an operator executing apparatus provided in the embodiment of the present disclosure is shown; the operator execution device comprises: operator scheduler 10, and execution unit 20;

the operator scheduler 10 is configured to respond to an operator starting instruction of an operator to be started, and issue the operator starting instruction to an execution unit corresponding to the operator type information based on the operator type information carried in the operator starting instruction; the operator starting instruction comprises scheduling information of an operation unit in the execution unit;

the execution unit 20 is configured to, in response to receiving the operator starting instruction issued by the operator scheduler, execute the operator to be started based on scheduling information carried in the operator starting instruction.

In a specific implementation, the operator executing device includes, for example: an Artificial Intelligence (AI) chip, a Graphics Processing Unit (GPU), and the like. The execution unit 40 in the operator execution device includes, for example: an AI chip, or a data processing core in a GPU.

The execution unit 20 includes, for example: a task block scheduler 201 and a plurality of arithmetic units 202;

the task block scheduler 201 is configured to, in response to receiving the operator starting instruction issued by the operator scheduler 10, determine, based on scheduling information carried in the operator starting instruction, a target operation unit for executing the operator to be started and a task block to be executed from the plurality of operation units 202; issuing the task block to be executed to the target operation unit; each task block to be executed comprises a plurality of subtasks in the operator to be started;

the operation unit 202 is configured to execute a data processing task corresponding to the task block to be executed in response to receiving the task block to be executed issued by the task block scheduler 201.

The operator scheduler 10 receives a start instruction, for example, generated by an operator scheduling device. The specific manner of generating the operator starting instruction by the operator scheduling device can be shown in the following embodiment corresponding to fig. 3, and is not described herein again.

As shown in fig. 2, the embodiment of the present disclosure provides a specific example of executing a deep learning task by cooperative work of an operator scheduling device and an operator executing device.

The operator scheduling device is used for analyzing the deep learning model to obtain an operator, generating a scheduling strategy for scheduling the operation unit, generating an operator starting instruction according to the scheduling strategy, and sending the operator starting instruction to the flow queue. The operator scheduling device may be, for example, a central processor in a computer device.

The flow queue is positioned between the operator scheduling device and the operator executing device and used for storing the operator starting instruction which is not read by the operator executing device.

The operator execution device comprises an operator scheduler (kernel scheduler) and an execution unit; the execution unit comprises a task Block scheduler (Block scheduler) and a plurality of arithmetic units;

the operator scheduler acquires an operator starting instruction from the flow queue (SQ), and distributes the operator starting instruction to different execution units based on operator types carried in the operator starting instruction, wherein the execution units are various; the execution units with different types can execute different operators; the operator types include, for example: the system comprises a calculation operator for executing a calculation task, an operator for executing a data copy task and a data synchronization task. Accordingly, the types of algorithms that can be executed by different types of execution units vary due to differences in structure. And the execution unit receives the operator starting instruction and executes the corresponding operator.

The execution unit for executing a calculation task includes a task Block scheduler (Block scheduler) and a plurality of arithmetic units. The task block scheduler is used for receiving an operator starting instruction dispatched by the operator scheduler and decomposing the operator starting instruction into a plurality of task blocks (blocks); each block comprises a plurality of subtasks. The task block scheduler can obtain whether the operation unit is in a working state or not, and after the block is obtained, the block is dispatched to the corresponding operation unit according to scheduling information of a scheduling strategy carried in the operator starting instruction. And the computing unit executes the computing task corresponding to the block after receiving the block.

After the computing unit finishes executing the computing task corresponding to the block, the computing unit may send a level signal for indicating that the computing unit is currently in an idle state to the task block scheduler, and the task block scheduler obtains the working state of the computing unit according to the level signal.

Each arithmetic unit is generally composed of a two-dimensional Processing Engine (PE) array and a register array (local register file), and in each PE, a computing element such as a multiplier-adder is included for performing a specific computing task for each block. Each arithmetic unit can simultaneously perform synchronous processing on a plurality of data, for example, a PE array in a certain arithmetic unit includes S PEs, and the S PEs can synchronously perform synchronous processing on the S data; that is, the arithmetic unit can perform synchronization processing on S data at most. The task for processing each data is called a subtask, and S subtasks completed synchronously by S PEs in an arithmetic unit constitute a task block.

The number of task blocks corresponding to the operator is related to the configuration of the operation unit; assuming that the number of the data to be processed corresponding to the operator includes H, and the number of the PEs in the PE array in each operation unit is S, the number of the task blocks corresponding to the operator is: and (4) rounding up the H/S to obtain an integer. For example, the data to be processed is image data, and the size of the image data is w × H × c, where w represents the image width, H represents the image height, and c represents the number of channels of the image, the number H of the corresponding data to be processed satisfies: h ═ w × H × c.

The scheduling information carried in the operator start instruction includes, for example, at least one of:

the identification of the operator to be started, the number of task blocks to be executed in the operator to be started, the task block identification of a starting task block in the task block to be executed, the number of operation units for executing the task block to be executed, and the operation unit identification for executing the task block to be executed.

The identifier of the operator to be started can be represented by the code address of the operator to be started. And the execution code of the operator can be obtained through the code address, and the data processing task corresponding to the operator is realized through the execution code of the operator.

The operator to be started refers to the operator which needs to be started at the moment of starting the operator.

The number of the task blocks to be executed in the operator to be started refers to the number of the task blocks which need to be executed and correspond to the starting time of the operator.

The initial task block identifier of the task block to be executed refers to an identifier of a first task block in a plurality of task blocks which need to be executed and correspond to the operator starting time.

The number of the arithmetic units for executing the task block to be executed refers to the number of the arithmetic units to be allocated to the operator to be started when the arithmetic units are used for executing the task block to be executed in the operator to be started.

The operation unit identifier for executing the task block to be executed refers to a specific identifier of an operation unit to be allocated to an operator to be started when the operation unit is used for executing the task block to be executed in the operator to be started.

Illustratively, if the scheduling information includes: the method comprises the following steps that when the number of task blocks to be executed in an operator to be started and the task block identification of an initial task block in the task blocks to be executed are determined, a task block scheduler is used for: and determining a plurality of task blocks corresponding to the number of the task blocks to be executed as the task blocks to be executed based on the task table identification of the initial task block and taking the initial task block as the start.

After determining the task blocks to be executed, the task block scheduler issues the task blocks to be executed to the arithmetic unit; and the arithmetic unit which receives the task block executes the received task block.

In addition, when the task block scheduler issues the task blocks to be executed to the arithmetic units, the task block scheduler is configured to determine the task blocks to be executed issued to each target arithmetic unit from the task blocks to be executed based on the number of the task blocks to be executed and the number of the target arithmetic units; and issuing the task blocks to be executed determined for each target operation unit to each target operation unit.

The task block scheduler may determine the target operation unit based on the number of operation units executing the task block to be executed or the identification of the operation unit executing the task block to be executed, and may determine the target operation unit according to the operating state of each operation unit when the scheduling information does not include the number of operation units executing the task block to be executed and the identification of the operation unit executing the task block to be executed.

In the operator executing device provided by the embodiment of the present disclosure, when the operator executing device executes an operator, the scheduling information is not generated by the operator executing device, but is executed by a device that initiates an operator starting instruction, so that a host that issues operator starting can plan a specific process of executing the operator by the executing device in advance, issue of the operator is controlled, and utilization efficiency of computing resources is improved.

Referring to fig. 3, a schematic structural diagram of an operator scheduling device is provided for the embodiment of the present disclosure, where the operator scheduling device includes: a scheduling policy generator 30, and an instruction generator 40;

the scheduling policy generator 30 is configured to generate a scheduling policy for scheduling an arithmetic unit in the operator execution device when the operator execution device executes an operator of the deep learning model; and transmitting the scheduling policy to the instruction generator 40;

the instruction generator 40 is configured to generate an operator starting instruction based on the scheduling policy, and send the operator starting instruction to the operator executing device.

In a specific implementation, the operator scheduling device, for example, may include: a central processor in a computer device; the operator scheduling device can be used for running a deep learning framework or a deep model training system; when the operator scheduling equipment runs the deep learning framework, executing a deep learning model; or when the operator scheduling equipment runs the deep model training system, training the deep learning model to be trained.

Illustratively, when the operator scheduling device executes the deep learning model or trains the deep learning model, for example, the deep learning model may be analyzed to obtain a plurality of operators included in the deep learning model; the operators include, for example, a computation class operator and a communication class operator, wherein the computation class operator includes, for example: convolution operator, full join operator, activation operator, etc.; the communication class operators include, for example: global reduction operator allreduce, global collection operator allgather, syncBN, etc.

When the scheduling policy generator 30 in the operator scheduling apparatus generates the scheduling policy, for example, it may:

The scheduling policy includes, for example: operator starting time and scheduling information corresponding to each operator starting time;

the scheduling information includes at least one of: the identification of the operator to be started, the number of task blocks to be executed in the operator to be started, the task block identification of a starting task block in the task block to be executed, the number of operation units for executing the task block to be executed, and the operation unit identification for executing the task block to be executed.

The operation information includes: the operation information corresponding to the operators respectively comprises the execution duration of the task blocks corresponding to the operators respectively and the parallelism among the operators.

The execution time of the task block is, for example, the time required by the arithmetic unit to execute a plurality of subtasks corresponding to one task block; in this case, a task block generally includes multiple sub-tasks that can be executed by multiple PEs in the PE array in parallel, and therefore, the execution time of the task block corresponding to the task block is the same as or approximately equal to the execution time of one sub-task. This subtask is, for example, a processing task for any data in the image data.

In another embodiment of the present disclosure, the scheduling policy generator, before determining the scheduling policy, is further configured to: and acquiring the execution time length of the task block corresponding to the operators respectively. The execution duration of the task block may be determined by, but not limited to, at least one of the following (1) or (2):

(1) and estimating the execution time required by executing the task block based on the internal memory required to be occupied by the task block during execution and the computing power information of the computing unit.

In a specific implementation, the computation information of the computing unit includes, for example, the number of times per second (ops) of operations performed by the computing unit; the calculation force information of each calculation unit can be determined, for example, in the following manner: determining the number of PEs included in an arithmetic unit, wherein one PE can execute n times of arithmetic every second and represents the calculation force information of one PE; determining the calculation power information of the calculation unit, namely the total OPS number of the calculation unit according to the number of PEs included in the calculation unit and the calculation power information of each PE, wherein the calculation power information of the calculation unit satisfies the following conditions: r P n. Where R represents the number of PEs in an arithmetic unit and P represents the number of arithmetic units included in an execution unit.

When the execution time length required for executing the task block is estimated based on the memory required to be occupied by the task block during execution and the computing power information of the computing unit, for example, the memory access time length can be determined based on the memory required to be occupied by the operator during execution and the memory access bandwidth of the computing unit to the memory; determining the calculation time length of the task block based on the calculation force information required by each calculation step in each subtask in the task block and the calculation force information of each operation unit;

The memory required to be accessed by the task block during execution can represent the data volume to be processed by the task block during execution; since the type of operator is known, and the computation steps of each subtask can be determined when the operator is executed, it is possible to determine the ops required by PE when executing a computation operation, for example, PE is a multiplier-adder which can be used to execute the multiplication-addition operation, and each multiplication-addition operation requires 2 ops; the calculation steps of each subtask are known, that is, the required calculation operation is known, so that the ops required to be consumed by each calculation step can be determined according to the ops required by the corresponding operation of each calculation step; then determining the calculation time length of one subtask according to the specific calculation steps of the subtask; a task block includes a plurality of subtasks that are executed in parallel by a plurality of PEs, and thus, the computation time of each subtask, i.e., the computation time of the task block.

And determining the memory access duration of the task block according to the data volume and the memory access bandwidth of the memory access task corresponding to the task block, which need to access the memory.

And determining the execution time length required for executing one task block based on the calculation time length and the memory access time length.

Illustratively, the operator is, for example, a matrix multiplication operator, and the matrix multiplication task in each task block satisfies: CMN ═ A_MK*B_KNWhere M and K represent the size of operand A, respectively; k and N represent the size of operand B, respectively; each data needs to occupy b bits, and the data size (memory access data size) of the memory to be accessed by the operator during execution is (M × K + K × N + M × N) × b/8 bytes. Wherein, M × K represents the data volume of the operand A to be read from the memory; k × N represents the data amount of operand B to be read from the memory; m × N represents the amount of data to be stored in the memory. The number of multiply-add devices (i.e. PEs) in each arithmetic unit of the execution unit for executing the operators is M x K x N.

The computation time required for executing one task block of the operator is as follows: n M K N ÷ total OPS number of arithmetic units; the memory access time required by the arithmetic unit in executing one task block of the operator is as follows: the data amount of the memory access is divided by the bandwidth of the memory access.

The execution time length of the first task block corresponding to the operator comprises: and calculating the sum of the time length and the memory access time length, and further obtaining the execution time length of one task block in the matrix multiplier as follows: n M K N/total OPS number + access data volume of the arithmetic unit divided by access bandwidth.

(2) Determining a simulation model corresponding to the operator based on the operator; and operating the simulation model, and determining the execution time of each task block of the operator based on the operation time of the simulation model.

In a specific implementation, the simulation model corresponding to the operator determined based on the operator only includes the operator, and does not include other operators. And the parameters of the operator in the simulation model are consistent with the parameters of the operator in the deep learning model. And then operating the simulation model to obtain the operation duration of the simulation model.

Since the size W × H and the number S of the operation units in the operator execution device running the simulation model are known, the number U/(W × H) of the task blocks obtained by dividing the operator in the simulation process can be determined according to the size W × H of the operation units in the operator execution device running the simulation model and the data amount U to be processed by the operator, and then, since the plurality of operation units can correspondingly process the plurality of task blocks synchronously, the number U/(W × H)/S to be processed can be determined according to the number of the task blocks and the number S of the operation units. Then, according to the running time T of the simulation model and the batch needing to be processed, obtaining the execution time of each task block: T/(W H)/S)

Here, when the simulation model is run, the configuration of the execution device used may be the same as or different from the configuration of the operator execution device when the deep learning model is run.

And after the scheduling policy generator analyzes the deep learning model to obtain a plurality of operators included in the deep learning model, determining the execution duration of the task block in each operator according to the mode (1) or (2), determining the scheduling policy of an operation unit in the operator execution equipment according to the operation information respectively corresponding to the operators and the calculation force information of the operator execution equipment, and transmitting the scheduling policy to the instruction generator.

The operation information corresponding to each operator includes, for example: the execution duration of the task block corresponding to each of the operators and the parallelism among the operators are determined; wherein each task block comprises a plurality of subtasks in the corresponding operator.

The embodiment of the present disclosure further provides a specific method for determining a scheduling policy for the arithmetic unit when the scheduling policy generator 30 executes a plurality of operators, including:

and constructing an association relation between the execution time length of a task block corresponding to each of the operators, the number of the task blocks corresponding to each of the operators, the parallelism among the operators, the computing power information of the operator execution equipment, the scheduling strategy parameter and the total execution time length of the operators. And adjusting the scheduling strategy parameters based on the incidence relation by taking the reduction of the total execution time as a target to obtain a target scheduling strategy.

The scheduling strategy comprises the following steps:

operator starting time and scheduling information corresponding to each operator starting time;

the scheduling policy parameters may be used to characterize the scheduling policy, such as when the scheduling policy parameters include: and when at least one execution time corresponding to the operators respectively, the task block identifier of the initial task block corresponding to each execution time in the at least one execution time and the number of the required operation units arrive corresponding to any one operator execution time, the host sends an operator starting instruction carrying two kinds of information of the task block identifier of the corresponding initial task block and the number of the required operation units to the operator execution equipment. And after the operator execution equipment receives an operator starting instruction, scheduling the operation units according to the number of the operation units required, and controlling the scheduled operation units to execute the task blocks according to the task block identification.

The association relationship may be embodied in the form of an equation, for example, that is, the execution duration of the task block corresponding to each of the plurality of operators, the number of the task blocks corresponding to each of the plurality of operators, and the computation information of the operator execution device are used as parameters, the scheduling policy parameter is used as an argument, the total execution duration of the plurality of operators is used as a dependent variable, the parallelism between the operators is used as a constraint condition, a relationship equation is constructed, the total execution duration is used as an optimization target, the independent variable in the relationship equation is continuously adjusted, the total execution duration is optimized, and finally, when the total execution duration is minimized (for example, in a multiple optimization iteration cycle, it is not reduced), the finally obtained scheduling policy is used as a target scheduling policy.

In the process of continuously adjusting the independent variable, for example, the following multiple iteration cycles may be performed:

and in the 0 th iteration cycle, determining a scheduling strategy parameter, and determining the total execution time of the 0 th iteration cycle according to the scheduling strategy parameter and the relation equation. And taking the scheduling strategy parameters as the reference scheduling strategy parameters of the 1 st iteration period, taking the total execution time length of the 0 th iteration period as the reference total execution time length of the 1 st iteration period, and entering the 1 st iteration period.

In the 1 st iteration period, adjusting the reference scheduling strategy parameters of the 1 st iteration period; determining a new total execution time length based on the adjusted scheduling strategy parameters and the relation equation; comparing the new total execution time length with the reference total execution time length; if the new total execution duration is smaller than the reference total execution duration, taking the adjusted scheduling policy parameter determined in the 1 st iteration period as the reference scheduling policy parameter of the 2 nd iteration period, taking the new total execution duration as the reference total execution duration of the 2 nd iteration period, and entering the 2 nd iteration period; and if the new total execution time length is greater than or equal to the reference total execution time length, taking the reference scheduling strategy parameter of the 1 st iteration period as the reference scheduling strategy parameter of the 2 nd iteration period, taking the reference total execution time length of the 1 st iteration period as the reference total execution time length of the 2 nd iteration period, and entering the 2 nd iteration period.

……

In the ith iteration period, adjusting the reference scheduling strategy parameters of the ith iteration period; determining a new total execution time length based on the adjusted scheduling strategy parameters and the relation equation; comparing the new total execution time length with the reference total execution time length; if the new total execution time length is smaller than the reference total execution time length, taking the adjusted scheduling strategy parameter determined in the ith iteration period as the reference scheduling strategy parameter of the (i +1) th iteration period, taking the new total execution time length as the reference total execution time length of the (i +1) th iteration period, and entering the (i +1) th iteration period; and if the new total execution time length is greater than or equal to the reference total execution time length, taking the reference scheduling strategy parameter of the ith iteration period as the reference scheduling strategy parameter of the (i +1) th iteration period, taking the reference total execution time length of the ith iteration period as the reference total execution time length of the (i +1) th iteration period, and entering the (i +1) th iteration period.

And (4) iteratively executing the process until the total reference execution time length is not changed any more in n successive iteration cycles.

In another embodiment, the execution strategy of the operator may be determined according to a deep learning model and an operator executing device before executing a certain deep learning model, or may be a scheduling strategy which is determined in advance for a certain deep learning model and corresponds to each operator executing device when the deep learning model is executed by multiple operator executing devices; when the deep learning model is executed, the scheduling strategy corresponding to the used operator execution device can be called directly through the provided interface.

It should be noted that the present disclosure does not limit the equation form corresponding to the association relationship, and a user using the scheme provided by the present disclosure may determine the corresponding equation form according to the needs of the user and the hardware environment (such as the number of arithmetic units, computational power, tasks to be executed, and the like) to which the scheme provided by the present disclosure is applied, or determine the corresponding equation form by combining a certain algorithm, and adjust the equation by using the iteration method, so as to obtain the corresponding scheduling policy parameter under the condition that the total execution duration is minimum, i.e., obtain the scheduling policy.

Illustratively, after the scheduling policy generator analyzes the deep learning model, the obtained operators include: A. b, C, D are provided. A. B, C, D include: a and B need to be executed sequentially, and C and D need to be executed sequentially; AB and CD have no dependency relationship, namely AC, AD, BC and BD can be executed in parallel; operator B has higher execution priority than operator C and operator D.

According to the above (1) or (2), the number of task blocks corresponding to A, B, C and D, respectively, and the execution time length that each task block needs to be run are determined as shown in the following table 1:

TABLE 1

Operator	thread	cycle
			A	68	20
B	8	80
			C	60	20
D	48	20

Wherein thread represents a task block, cycle represents the execution duration of the task block, and the unit is: and (5) processing period.

In the operator execution device, the number of the operation units is 16; each arithmetic unit is capable of synchronously processing a plurality of subtasks in one task block.

The theoretical time consumption for executing each operator in series by using the operator executing device is as shown in table 2 below:

TABLE 2

Operator	thread	cycle	Serial scheduling	Theoretical time consumption
					A	68	20	100	1360
B	8	80	80	640
					C	60	20	80	1200
D	48	20	60	960
					total			320	260

Serial scheduling, which represents the number of processing cycles required to execute any operator.

Taking operator a as an example, which includes 68 task blocks, since the operator execution device includes 16 operation units, operator a needs to be executed by the following calculation batch: 68/16 ═ 4.25; because a plurality of operators are executed in series, when the operator a is executed, all the operation units are dispatched no matter whether all the operation units are actually used or not in the processing process of any batch, and therefore, 5 batches are needed to realize the processing of the operators. Each batch occupies 20 processing cycles, and the processing cycle required by 5 batches is 20 × 5 — 100, that is, the task corresponding to the processing operator a requires 100 processing cycles.

The theory is time-consuming: representing the theoretical number of processing cycles required to perform operator a. Taking operator a as an example, it includes 68 task blocks, and the execution duration of each task block is 20 processing cycles, so the theoretical processing cycle number is: 68 x 20. 1360 processing cycles.

It can be seen that: the total execution duration of the four operators is: 320 processing cycles.

Theoretically, the total execution time length of the four operators is sum (thread) 16, 260.

If the prior art is adopted and the operation unit is scheduled by the GPU, the specific scheduling process is as shown in table 3 below:

TABLE 3

Wherein, the time represents: the ith processing cycle. For the four operators A-D above:

in the 0 th processing cycle, allocating 16 operation units to the operator A equally, and executing 80 processing cycles;

in the 80 th processing cycle, 4 arithmetic units are allocated to an operator A, and 20 processing cycles are executed, so that the operator A is executed; and 12 arithmetic units are allocated to the operator C, and 20 processing cycles are executed.

In the 100 th processing cycle, because the execution priority of the operator B is higher than that of the operator C, 8 operation units are allocated to the operator B, 80 processing cycles are executed first, and the operator B is executed completely; allocating the idle 8 operation units to an operator C, and executing 80 processing cycles;

in the 180 th processing cycle, 16 arithmetic units are equally distributed to the operator C, and 20 processing cycles are executed.

In the 200 th processing cycle, 4 arithmetic units are allocated to the operator C, and the operator C is executed after 20 processing cycles.

In the 220 th processing cycle, 16 arithmetic units are allocated to the operator D, and the operator D is executed after 60 processing cycles are executed.

By the 280 th processing cycle, the operators A-D are all executed.

If the scheduling method provided by the embodiment of the present disclosure is adopted, the specific scheduling process is as shown in table 4 below:

TABLE 4

time	A	B	C	D
					0	16x20
20	8x120		8x120
					140	4x20	12x20
160		8x80		8x80
					240			16x20
260	done

Wherein, for the four operators A-D:

in the 0 th processing cycle, 16 operation units are equally allocated to the operator a, and 20 processing cycles are executed.

In the 20 th processing cycle, 8 arithmetic units are assigned to the operator a and 120 processing cycles are performed, and 8 arithmetic units are assigned to the operator C and 120 processing cycles are performed.

In the 140 th processing cycle, 4 arithmetic units are allocated to an operator A, and 20 processing cycles are executed, so that the operator A is executed; and allocating 12 operation units to the operator C, and executing 20 processing cycles, wherein the operator C is executed.

In the 160 th processing cycle, 8 operation units are distributed to an operator B, 80 processing cycles are executed, and the operator B is executed; and 8 arithmetic units are allocated to the operator D, and 80 processing cycles are executed.

In the 240 th processing cycle, 16 arithmetic units are allocated to the operator D, and the operator D is executed after 20 processing cycles are executed.

In the 260 th processing cycle, the operators A-D are all executed.

Therefore, 280 processing cycles are needed by using the GPU for scheduling, and only 260 processing cycles are needed by using the scheduling method provided by the embodiment of the disclosure for scheduling, so that the scheduling method provided by the embodiment of the disclosure can improve the execution efficiency of operators, can more fully utilize computing power, and improves the utilization rate of the computing power.

Instruction generator 40 may, for example, generate an operator initiation instruction in any of, but not limited to, the following a or B, and send the operator initiation instruction to an operator execution device:

a: responding to the arrival of each instruction starting time, writing scheduling information corresponding to each instruction starting time into a preset bit of an operator starting instruction, and generating the operator starting instruction corresponding to each operator starting time; and sending an operator starting instruction corresponding to each operator starting moment to the operator executing equipment.

In this example, the operator starting instruction is generated after the operator starting time arrives, and is sent to the operator executing device.

B: aiming at each operator starting time, writing scheduling information corresponding to each operator starting time into a preset bit of an operator starting instruction, and generating the operator starting instruction corresponding to each operator starting time; and responding to the arrival of each operator starting time, and sending an operator starting instruction corresponding to each operator starting time to the operator executing equipment.

In this example, after obtaining the scheduling policy, an operator start instruction may be generated based on the scheduling policy; and after any operator starting time arrives, sending the corresponding operator starting instruction to the operator executing equipment.

When the scheduling information corresponding to each instruction starting time is written into a preset bit of an operator starting instruction to generate the operator starting instruction corresponding to each operator starting time, for example, the following manner may be adopted:

and aiming at each operator to be started corresponding to each instruction starting moment, writing the scheduling information corresponding to the operator to be started into a preset bit of the operator starting instruction corresponding to the operator to be started, and generating the operator starting instruction corresponding to the operator to be started.

For example: in the example corresponding to table 4 above, the operator activation times t0 to t4 are:

t 0: the 0 th processing cycle;

t 1: the 20 th processing cycle;

t 2: the 140 th processing cycle;

t 3: the 160 th processing cycle;

t 4: the 240 th processing cycle.

Generating an operator starting instruction L0-A aiming at the operator starting time t 0;

aiming at the operator starting time t1, generating an operator starting instruction L1-A aiming at an operator A and generating an operator starting instruction L1-C aiming at an operator C;

aiming at the operator starting time t2, generating an operator starting instruction L2-A aiming at an operator A and generating an operator starting instruction L2-C aiming at an operator C;

generating an operator starting instruction L3-B aiming at an operator B and an operator starting instruction L3-D aiming at an operator D aiming at the operator starting time t 3;

and generating an operator starting instruction L4-D aiming at the operator D at the operator starting time t 4.

For the instruction sending time corresponding to any operator starting instruction, for example, after the instruction starting time corresponding to the operator starting instruction is determined, the instruction starting time is reduced by at least one processing cycle number to obtain the instruction sending time. For example, in the example shown in table 4, at the 0 th processing cycle, 16 starting units need to be allocated to operator a in average, and the instruction transmission timing may be determined as the-1 st processing cycle. Here, the ith processing cycle is for four operators a to D.

In the 20 th processing cycle, 8 arithmetic units need to be allocated to the operator a, and 8 arithmetic units need to be allocated to the operator C, so that the instruction transmission time can be determined as the 19 th processing cycle.

In another possible implementation, the operator starting time may also be used as the operator sending time, that is, when the operator starting time arrives, the scheduling information is written into a preset bit of the operator starting instruction, so as to generate the operator starting instruction.

And after the instruction generator generates an operator starting instruction, the operator starting instruction is sent to the operator executing equipment.

Operator initiation instructions include, for example: a first field for carrying the scheduling information, and at least one of the following fields:

Referring to fig. 5, a structural example of an operator start instruction is provided; wherein the operator starting instruction comprises: a command Header and a command body; wherein, at the Header, include: four fields, in turn for storing the following data: a command type (cmd _ type), a command subtype (cmd _ sub _ type), a command Body length (Body length), and a Cyclic Redundancy Check (CRC).

In the command body, a plurality of fields, such as the above-described first field, and at least one of the second to fifth fields, are included.

The second field is used for storing the three-dimensional size GridDim _ X, GridDim _ Y and GridDim _ Z of the data to be processed corresponding to the operator; the third field is used for storing the three-dimensional size corresponding to one task block corresponding to the operator: the device comprises a Block dim _ X, BlockDim _ Y, BlockDim _ Z, a reserved field RZ, a fourth field for storing a needed memory Share _ memory _ size, and a fifth field for storing code addresses Kernel _ Addr _ lo, Kernel _ Addr _ hi and the like corresponding to operators. The code address of the operator can be used for acquiring an execution code of the operator, and the execution code of the operator is used for the arithmetic device to execute a data processing task corresponding to the operator.

The first field may include a plurality of subfields for storing the following information, respectively:

subfield a 1: an instruction Mode in which:

mode is 0, which means that no arithmetic unit is designated;

and the Mode is 1 and indicates the number of the specified operation units, and the operator execution device allocates the operation units with corresponding number to the operator starting instruction according to the state of each operation unit.

Mode is 2, and indicates that a specific arithmetic unit is specified.

Mode is 3, indicating a reservation.

Subfield a 2: the required arithmetic unit PE _ VALID, takes 32 bits, of which,

when Mode is 1, the number of operation units is expressed; for example, 8 PEs are needed, and PE _ valid ═ 8;

when Mode is 2, each bit is used to indicate the corresponding arithmetic unit. For example, if 0-7 arithmetic units are required, PE _ valid is 0x00 FF; and if No. 8-15 arithmetic units are needed, PE _ valid is 0xFF 00.

Subfield a 3: the start task Block identifies Block _ start _ id;

subfield a 4: the number of task blocks _ num;

wherein: block _ start _ id, and Block _ num denote: each time the operator is started, the number of task blocks (Block _ num) is executed from which task Block (Block _ start _ id) is executed.

If block _ num is 0, all task blocks in the execution completion operator are represented.

Taking the example in table 4 above as an example:

operator start time t 0: in the 0 th processing cycle, 16 operation units are uniformly distributed to an operator a, and 20 processing cycles are executed, the execution time of one task block of the operator a is 20 processing cycles, so that it is determined that the number of the task blocks to be executed in the operator a to be started is 16, and the starting task block identifier is: a-1.

Then, in the L0-a in the operator starting instruction corresponding to the operator starting time t0, the values of the four subfields a1 to a4 are respectively:

a1：Mode＝1；

a2：PE_valid＝16；

a 3: block _ start _ id is a-1; wherein A-1 represents the 1 st task block in operator A;

a4：block_num＝16。

operator start time t 1: in the 20 th processing period, 8 operation units are allocated to the operator a, 8 operation units are allocated to the operator C, 120 processing periods are all executed, the execution time required for executing the task block of one operator a is 20 processing periods, the number of the task blocks to be executed in the operator a can be determined to be 48, and since the operator starting time t0 indicates that 16 task blocks in the operator a are executed, the starting task block of the task block to be executed in the operator a is identified as a-17; if the execution time length required for executing the task block of one operator C is 20 processing cycles, it can be determined that the number of the task blocks to be executed in the operator C is 48, and if the initial task block identifier is C-1, then operator start instructions corresponding to the operator a and the operator C, respectively, can be generated.

In an operator starting instruction L1-A corresponding to the operator A at the operator starting time t1, the values of the four subfields a1 to a4 are respectively as follows:

a1：Mode＝1；

a2：PE_valid＝8；

a 3: block _ start _ id is a-17; wherein A-17 represents the 17 th task block in operator A;

a4：block_num＝48。

in an operator starting instruction L1-C corresponding to the operator C at the operator starting time t1, the values of the four subfields a1 to a4 are respectively as follows:

a1：Mode＝1；

a2：PE_valid＝8；

a 3: block _ start _ id ═ C-1; wherein C-1 represents the 1 st task block in operator C;

a4：block_num＝48。

operator start time t 2: the 140 th processing cycle; to allocate 4 arithmetic units to operator a and 12 arithmetic units to operator C, and each executing 20 processing cycles; the execution time length required for executing one task block of the operator A is 20 processing cycles, so that the number of the task blocks to be executed in the operator A can be determined to be 4, and since the operator indicates to execute 48 task blocks in the operator A at the starting time t1, the actual task block of the task block to be executed in the operator A is identified as A-65; the execution time length required for executing one task block of the operator C is 20 processing cycles, and in addition, at the operator starting time t2, 48 task blocks in the operator C have been instructed to be executed, so that the starting task block corresponding to the operator C at the operator starting time t3 is identified as C-49, and the number of task blocks to be executed is 12. Operator starting instructions corresponding to the operator A and the operator C respectively can be generated.

The values of four subfields, namely a 1-a 4 in the operator starting instruction L2-A corresponding to the operator B at the operator starting time t2 are respectively as follows:

a1：Mode＝1；

a2：PE_valid＝4；

a 3: block _ start _ id is a-65; wherein A-65 represents the 65 th task block in operator A;

a4：block_num＝4。

in the operator starting instruction L2-C corresponding to the operator C at the operator starting time t2, the values of the four subfields a1 to a4 are respectively:

a1：Mode＝1；

a2：PE_valid＝12；

a 3: block _ start _ id ═ C-49; wherein C-49 represents the 49 th task block in operator C;

a4：block_num＝12。

operator start time t 3: the 160 th processing cycle; to allocate 8 arithmetic units to operator B and 8 arithmetic units to operator D, and to each execute 80 processing cycles; the execution time required for executing one task block of the operators B and D is 20 processing cycles, the initial task block of the task block to be executed is marked as B-1, and the operator B only comprises 8 task blocks, so that the number of the task blocks to be executed is the total number of the task blocks; the number of the task blocks to be executed in the operator D is 32, and the actual task block identifier of the task block to be executed is D-1. Operator startup instructions corresponding to operator B and operator D respectively can be generated,

in an operator starting instruction L3-B corresponding to the operator B at the operator starting time t3, the values of the four subfields a1 to a4 are respectively as follows:

a1：Mode＝1；

a2：PE_valid＝8；

a3：Block_start_id＝B-1；

a4：block_num＝0。

in an operator starting instruction L3-D corresponding to the operator D at the operator starting time t3, the values of the four subfields a1 to a4 are respectively as follows:

a1：Mode＝1；

a2：PE_valid＝8；

a3：Block_start_id＝D-1；

a4：block_num＝32。

operator start time t 4: the 240 th processing cycle; to allocate 16 arithmetic units to operator D, 20 processing cycles are performed; the execution time required for executing one task block of the operator D is 20 processing cycles, the initial task block identifier of the task block to be executed is D-33, and the number of the task blocks to be executed is 16; an operator startup instruction corresponding to operator D may be generated,

in an operator starting instruction L4-D corresponding to the operator D at the operator starting time t4, the values of the four subfields a1 to a4 are respectively as follows:

a1：Mode＝1；

a2：PE_valid＝16；

a3：Block_start_id＝D33-1；

a4：block_num＝16。

through the above process, the host computer 10 generates an operator start instruction corresponding to each operator start time, and sends the operator start instruction corresponding to the operator start time to the operator execution device 20 after any operator start time arrives.

Therefore, the fields corresponding to the scheduling information are added in the operator starting instruction, and the client realizes flexible scheduling of the operator; for example, for the operator a, the operator a may be processed by three times of scheduling, and a specific scheduling process is as follows in table 5:

TABLE 5

	block	Number of PE	Mode	PE_valid	block_start_id	block_num
							1	16	16	1	16	0	16
2	48	8	1	8	16	48
							3	4	4	1	4	64	4

In addition, when the deep learning model is trained by using the deep learning framework, the deep learning model has not only forward and backward propagation calculation tasks but also communication tasks, and the situation that calculation and communication mutually occupy calculation resources exists in the prior art. For example, the operator A is a calculation task and occupies No. 0-7 operation units; operator B is a communication task, and occupies No. 8-15 operation units, and the scheduling process can be as shown in the following table 6:

TABLE 6

block

Number of PE

Mode

PE_valid

block_start_id

block_num

A

16

0～7

2

0x00FF

0

16 or 0

B

48

8～15

2

0xFF00

0

48 or 0

It can be seen from the above examples that the operator scheduling device provided by the embodiment of the present disclosure performs scheduling, so that the execution time of an operator can be effectively reduced, and the utilization rate of computing power in the operator execution device is improved.

After the operator scheduling device sends the operator starting instruction to the operator executing device, an operator scheduler in the operator executing device sends the operator starting instruction to an executing unit capable of executing the operator starting instruction. After receiving an operator starting instruction issued by the operator scheduler, a task block scheduler in the execution unit decomposes the operator starting instruction into a plurality of task blocks to be executed (including at least part of the plurality of task blocks corresponding to the operator) according to the number of the task blocks to be executed carried in the operator starting instruction and the task block identifier of an initial task block in the task blocks to be executed, and issues the task blocks to be executed to the operation unit. And after receiving the task block to be executed, the arithmetic unit executes the received task block.

For example, taking the structural example of the operator initiation instruction shown in fig. 5 as an example, the field a 1: mode is 0, which means that no arithmetic unit is designated; the start task Block identification Block _ start _ id written in sub-field a3 is "13" (in operator start instructions, in binary representation, here in decimal representation for ease of writing); if the number of task blocks _ num written in sub-field a4 is 10, the operator start instruction is decomposed into 10 task blocks, where the identifications of the 10 task blocks are: "13" to "22".

It will be understood by those skilled in the art that in the method of the present invention, the order of writing the steps does not imply a strict order of execution and any limitations on the implementation, and the specific order of execution of the steps should be determined by their function and possible inherent logic.

Based on the same inventive concept, the embodiment of the present disclosure further provides an operator executing method corresponding to the operator executing device; because the principle of solving the problem by the device in the embodiment of the present disclosure is similar to that of the operator execution method in the embodiment of the present disclosure, the implementation of the operator execution method may refer to the implementation of the method, and repeated details are not described again.

Referring to fig. 5, a flowchart of an operator execution method provided in an embodiment of the present disclosure includes:

s501: the operator scheduler responds to an operator starting instruction of an operator to be started, and issues the operator starting instruction to an execution unit corresponding to the operator type information based on the operator type information carried in the operator starting instruction; the operator starting instruction comprises scheduling information of an operation unit in the execution unit;

s502: and the execution unit responds to the received operator starting instruction issued by the operator scheduler and executes the operator to be started based on scheduling information carried in the operator starting instruction.

In one possible embodiment, the execution unit includes: a task block scheduler and a plurality of arithmetic units; the executing unit, in response to receiving the operator starting instruction issued by the operator scheduler, executes the operator to be started based on scheduling information carried in the operator starting instruction, including:

the task block scheduler responds to the received operator starting instruction issued by the operator scheduler, and determines a target operation unit for executing the operator to be started and a task block to be executed from a plurality of operation units based on scheduling information carried in the operator starting instruction; issuing the task block to be executed to the target operation unit; each task block to be executed comprises a plurality of subtasks in the operator to be started;

and the arithmetic unit responds to the received task block to be executed issued by the task block scheduler and executes the data processing task corresponding to the task block to be executed.

the task block scheduler determines a plurality of task blocks to be executed based on scheduling information carried in the operator starting instruction, and the task block scheduler comprises: and determining a plurality of task blocks corresponding to the number of the task blocks to be executed as the task blocks to be executed based on the task table identification of the initial task block and taking the initial task block as the start.

In one possible implementation, the issuing, by the task block scheduler, the task block to be executed to the target arithmetic unit includes: determining task blocks to be executed issued to each target arithmetic unit from the task blocks to be executed based on the number of the task blocks to be executed and the number of the target arithmetic units;

the task block scheduler determines a target operation unit for executing the operator to be started from a plurality of operation units based on scheduling information carried in the operator starting instruction, and the task block scheduler comprises:

Based on the same invention concept, the embodiment of the disclosure also provides an operator scheduling method corresponding to the operator scheduling device; because the principle of solving the problem of the device in the embodiment of the present disclosure is similar to that of the operator scheduling method in the embodiment of the present disclosure, the implementation of the operator scheduling method may refer to the implementation of the method, and repeated details are not described again.

Referring to fig. 6, a flowchart of an operator scheduling method provided in the embodiment of the present disclosure includes:

s601: the scheduling strategy generator generates a scheduling strategy for scheduling the operation unit in the operator execution equipment when the operator execution equipment is used for executing the multiple operators of the deep learning model; and transmitting the scheduling policy to an instruction generator;

s602: the instruction generator generates an operator starting instruction based on the scheduling strategy and sends the operator starting instruction to the operator executing equipment.

In one possible implementation, when the scheduling policy generator generates a multi-operator for executing a deep learning model by using the operator executing device, the scheduling policy generator performs scheduling on an arithmetic unit in the operator executing device, and includes:

In a possible embodiment, the method further comprises: the scheduling strategy generator acquires the execution time lengths of the task blocks corresponding to the operators respectively in the following modes:

In a possible implementation manner, the scheduling policy generator estimates, based on the memory occupied by the operator during execution and the computation information of the operator execution device, an execution time length required for executing each task block of the operator, including:

In one possible implementation, the determining, by the scheduling policy generator, an execution time required for each task block of the operator based on the running time of the simulation model includes:

In one possible implementation, the scheduling policy generator determines a scheduling policy for an arithmetic unit in an operator execution device when a plurality of operators are executed, based on operation information corresponding to the operators and computation information of the operator execution device, and the scheduling policy generator includes:

In a possible implementation manner, the method for constructing an association relationship between the execution time length of a task block corresponding to each of a plurality of operators, the number of task blocks corresponding to each of the plurality of operators, the parallelism between the plurality of operators, the computational power information of the operator execution device, a policy scheduling parameter, and the total execution time length of the plurality of operators includes:

In one possible implementation, the instruction generator generates an operator starting instruction based on the scheduling policy, and sends the operator starting instruction to the operator executing device, including:

The embodiment of the present disclosure further provides a chip, including: the operator execution device according to any one of the embodiments of the present disclosure, and/or the operator scheduling device according to any one of the embodiments of the present disclosure.

The embodiment of the present disclosure further provides a computer device, which is a schematic structural diagram of a computer device provided by the embodiment of the present disclosure, and the schematic structural diagram includes: such as the chips described in the embodiments of the present disclosure.

The embodiments of the present disclosure also provide a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program performs the steps of the scheduling method in the foregoing method embodiments. The storage medium may be a volatile or non-volatile computer-readable storage medium.

The embodiments of the present disclosure also provide a computer program product, where the computer program product carries a program code, and instructions included in the program code may be used to execute the steps of the scheduling method in the foregoing method embodiments, which may be referred to specifically for the foregoing method embodiments, and are not described herein again.

The computer program product may be implemented by hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied in a computer storage medium, and in another alternative embodiment, the computer program product is embodied in a Software product, such as a Software Development Kit (SDK), or the like.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the system and the apparatus described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again. In the several embodiments provided in the present disclosure, it should be understood that the disclosed system, apparatus, and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present disclosure may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present disclosure. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

Finally, it should be noted that: the above-mentioned embodiments are merely specific embodiments of the present disclosure, which are used for illustrating the technical solutions of the present disclosure and not for limiting the same, and the scope of the present disclosure is not limited thereto, and although the present disclosure is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive of the technical solutions described in the foregoing embodiments or equivalent technical features thereof within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present disclosure, and should be construed as being included therein. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.

Claims

1. An operator execution device, comprising: an operator scheduler and an execution unit;

2. The operator execution device according to claim 1, wherein the scheduling information comprises at least one of:

3. The operator execution device according to claim 1 or 2, wherein the execution unit comprises: a task block scheduler and a plurality of arithmetic units;

4. The operator execution device according to claim 3, wherein the scheduling information comprises: the method comprises the following steps that the number of task blocks to be executed in an operator to be started and the task block identification of an initial task block in the task blocks to be executed are determined;

5. The operator performing apparatus according to claim 4, wherein the task block scheduler, when issuing the task blocks to be performed to the target arithmetic units, is configured to determine, from the task blocks to be performed, the task blocks to be performed issued to the respective target arithmetic units based on the number of the task blocks to be performed and the number of the target arithmetic units;

6. The operator execution device according to claim 4 or 5, wherein the scheduling information comprises: the number of the arithmetic units executing the task block to be executed or the arithmetic unit identification executing the task block to be executed;

7. The operator execution device according to any of claims 1-6, wherein the operator initiation instruction comprises: a first field for carrying the scheduling information, and at least one of the following fields:

8. An operator scheduling apparatus, comprising: a scheduling policy generator, and an instruction generator;

the scheduling strategy generator is used for generating a scheduling strategy for scheduling the operation unit in the operator execution equipment when the operator execution equipment is used for executing the operator of the deep learning model; and transmitting the scheduling policy to the instruction generator;

9. The operator scheduling device of claim 8, wherein the scheduling policy comprises: operator starting time and scheduling information corresponding to each operator starting time;

10. The operator scheduling device according to claim 8 or 9, wherein the scheduling policy generator, when generating the scheduling policy for the arithmetic unit in the operator execution device when executing the multi-operator of the deep learning model by using the operator execution device, is configured to:

11. The operator scheduling device according to claim 10, wherein the operation information comprises: the operation information corresponding to the operators respectively comprises the execution duration of the task blocks corresponding to the operators respectively and the parallelism among the operators.

12. The operator scheduling device according to claim 10 or 11, wherein the scheduling policy generator is further configured to obtain the execution time lengths of the task blocks corresponding to the plurality of operators respectively by:

13. The operator scheduling device of claim 12, wherein the scheduling policy generator, when estimating the execution time required for executing each task block of the operator based on the memory occupied by the operator during execution and the computing power information of the operator executing device, is configured to:

14. The operator scheduling device according to claim 10 or 11, wherein the scheduling policy generator is configured to obtain the execution time lengths of the task blocks corresponding to the plurality of operators respectively by using the following method:

15. The operator scheduling apparatus according to claim 14, wherein the scheduling policy generator, when determining the execution time required for each task block of the operator based on the execution time of the simulation model, is configured to:

16. The operator scheduling device according to any one of claims 10 to 15, wherein the scheduling policy generator, when determining the scheduling policy for the operation unit in the operator executing device when executing the plurality of operators based on the operation information corresponding to the plurality of operators and the computation power information of the operator executing device, is configured to:

17. The operator scheduling device of claim 16, wherein the scheduling policy generator, when constructing an association relationship between an execution duration of a task block corresponding to each of the plurality of operators, a number of task blocks corresponding to each of the plurality of operators, a parallelism between the plurality of operators, computation power information of the operator executing device, a policy scheduling parameter, and a total execution duration of the plurality of operators, is configured to:

18. The operator scheduling device according to any of the claims 8-17, wherein the instruction generator, when generating an operator start instruction based on the scheduling policy and sending the operator start instruction to the operator executing device, is configured to:

19. The operator scheduling device of claim 18, wherein the operator initiation instruction comprises: a first field for carrying the scheduling information, and at least one of the following fields:

20. A chip, comprising: operator execution apparatus according to any of claims 1-7, and/or operator scheduling apparatus according to any of claims 8-19.

21. An operator execution method, comprising:

22. An operator scheduling method, comprising:

23. A computer device, comprising: the chip of claim 20.

24. A computer-readable storage medium, having stored thereon a computer program, which, when being executed by a computer device, performs the steps of the operator performing method as claimed in claim 21, or the steps of the operator scheduling method as claimed in claim 22.