CN114911610A

CN114911610A - Task compiling method and device and compiler

Info

Publication number: CN114911610A
Application number: CN202210369707.4A
Authority: CN
Inventors: 马海涛
Original assignee: Spreadtrum Communications Tianjin Co Ltd
Current assignee: Spreadtrum Communications Tianjin Co Ltd
Priority date: 2022-04-08
Filing date: 2022-04-08
Publication date: 2022-08-16

Abstract

The embodiment of the invention provides a task compiling method, a task compiling device and a compiler. The method comprises the following steps: the method comprises the steps that a compiler receives at least one compiling task input by a user, the compiler judges whether the compiling task comprises a plurality of branch tasks, if the compiling task comprises the plurality of branch tasks, acquired hardware resources of an NPU (network processor unit) of the embedded neural network processor are divided according to the plurality of branch tasks to generate a hardware distribution result, and a first compiling instruction is generated according to the hardware distribution result, compiling parameters input by the user and core parameters required by core calculation of each acquired NPU; the compiler sends the first compiling instruction to the scheduler so that the scheduler can schedule the hardware resources according to the first compiling instruction, and therefore reasonable distribution of the hardware resources of the NPU is achieved through the compiler, and the operation speed of the NPU is improved.

Description

Task compiling method and device and compiler

[ technical field ] A method for producing a semiconductor device

The embodiment of the invention relates to the technical field of Artificial Intelligence (AI), in particular to a task compiling method, a device and a compiler.

[ background ] A method for producing a semiconductor device

With the development of technology, the AI industry has developed rapidly, and the processing demand for images or videos is increasing. An embedded neural Network Processor (NPU) is a processor specially used for accelerating neural Network inference, and the NPU mainly adopts an Application Specific Integrated Circuit (ASIC) technology to make up for the deficiency of a Central Processing Unit (CPU) and an image Processing Unit (GPU) in the design of a neural Network computing architecture by a hardware neural Network simulation mode, thereby greatly improving the operating speed of an AI chip.

In the related art, in order to further increase the operation speed of the AI chip, the NPU is inevitably designed to be developed toward a multi-core architecture. At present, the development of multi-core NPUs does not form a complete system, a mature and targeted compiler analysis front-end model is lacked, and the rear-end hardware configuration cannot be reasonably distributed according to tasks, so that the operation speed is reduced.

[ summary of the invention ]

In view of this, embodiments of the present invention provide a task compiling method and apparatus, and a compiler, so as to implement reasonable allocation of hardware resources of an NPU through the compiler, so as to improve an operation speed of the NPU.

In a first aspect, an embodiment of the present invention provides a task compiling method, where the method includes:

receiving at least one compilation task input by a user;

judging whether the compiling task comprises a plurality of branch tasks;

if the compiling task comprises a plurality of branch tasks, dividing the acquired hardware resources of the NPU according to the plurality of branch tasks to generate a hardware distribution result;

generating a first compiling instruction according to the hardware distribution result, the compiling parameters input by the user and the acquired core parameters required by each NPU core calculation;

and sending the first compiling instruction to a scheduler so that the scheduler schedules the hardware resource according to the first compiling instruction.

Optionally, the dividing the acquired hardware resources of the NPU according to the multiple branch tasks to generate a hardware allocation result includes:

clustering the plurality of branch tasks to generate a plurality of task category data;

and dividing the hardware resources according to the task category data to generate a hardware distribution result.

Optionally, the method further comprises:

if the compiling task does not comprise the branch task, generating a second compiling instruction according to the compiling parameter and the core parameter;

and sending the second compiling instruction to a scheduler so that the scheduler schedules the hardware resource according to the second compiling instruction.

Optionally, the number of the compiling tasks is multiple, and before the determining whether the compiling task includes multiple branch tasks, the method further includes:

and if the number of the compiling tasks is judged to be less than or equal to the number of the NPU clusters, distributing each compiling task to the corresponding NPU cluster, and executing the step of judging whether the compiling tasks comprise a plurality of branch tasks in parallel.

Optionally, the number of the compiling tasks is multiple, and the method further includes:

if the number of the compiling tasks is judged to be larger than the number of the NPU clusters, calculating the calculation demand of each compiling task;

selecting a specific number of compiling tasks from a plurality of compiling tasks input by a user, wherein the calculation demand of the specific number of compiling tasks is less than that of other compiling tasks in the plurality of compiling tasks input by the user, and the specific number is equal to the number of the NPU clusters;

and distributing each compiling task in a specific number of compiling tasks to a corresponding NPU cluster, and executing and judging whether the compiling tasks comprise a plurality of branch tasks in parallel.

In a second aspect, an embodiment of the present invention provides a task compiling device, where the device includes:

the receiving module is used for receiving at least one compiling task input by a user;

the acquisition module is used for acquiring the hardware resources of the NPU;

the judging module is used for judging whether the compiling task comprises a plurality of branch tasks;

the generation module is used for dividing the hardware resources according to the plurality of branch tasks and generating a hardware distribution result if the judgment module judges that the compiling task comprises the plurality of branch tasks; generating a first compiling instruction according to the hardware distribution result, the compiling parameter and the core parameter;

and the sending module is used for sending the first compiling instruction to a scheduler so that the scheduler can schedule the hardware resource according to the first compiling instruction.

Optionally, the generating module includes:

the first generation submodule is used for clustering a plurality of branch tasks to generate a plurality of task category data;

and the second generation submodule is used for dividing the hardware resources according to the plurality of task category data to generate a hardware distribution result.

Optionally, the method further comprises:

the generation module is further used for generating a second compiling instruction according to the compiling parameter and the core parameter if the judgment module judges that the compiling task does not include the branch task;

the sending module is further configured to send the second compiling instruction to a scheduler, so that the scheduler schedules the hardware resource according to the second compiling instruction.

In a third aspect, an embodiment of the present invention provides a computer-readable storage medium, where the computer-readable storage medium includes a stored program, and when the program runs, a device in which the computer-readable storage medium is located is controlled to execute a task compiling method in the first aspect or any possible implementation manner of the first aspect.

In a fourth aspect, an embodiment of the present invention provides a compiler, including: one or more processors; a memory; and one or more computer programs, wherein the one or more computer programs are stored in the memory, the one or more computer programs comprising instructions which, when executed by the apparatus, cause the apparatus to perform the method of task compilation in the first aspect or any possible implementation of the first aspect.

In the technical scheme provided by the embodiment of the invention, a compiler receives at least one compiling task input by a user, the compiler judges whether the compiling task comprises a plurality of branch tasks, if the compiling task comprises the plurality of branch tasks, the acquired hardware resources of the NPU of the embedded neural network processor are divided according to the plurality of branch tasks to generate a hardware distribution result, and a first compiling instruction is generated according to the hardware distribution result, the compiling parameter input by the user and the acquired core parameter required by the core calculation of each NPU; the compiler sends the first compiling instruction to the scheduler so that the scheduler schedules the hardware resources according to the first compiling instruction, and therefore the NPU hardware resources are reasonably distributed through the compiler, and the operation speed of the NPU is improved.

[ description of the drawings ]

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a flowchart of a task compiling method according to an embodiment of the present invention;

FIG. 2 is a flowchart of another task compiling method according to an embodiment of the present invention;

FIG. 3 is a flowchart of another task compiling method according to an embodiment of the present invention;

FIG. 4 is a schematic structural diagram of a task compiling device according to an embodiment of the present invention;

fig. 5 is a schematic diagram of a compiler according to an embodiment of the present invention.

[ detailed description ] embodiments

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be understood that the described embodiments are only some embodiments of the invention, and not all embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.

The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the examples of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be understood that the term "and/or" as used herein is merely one type of associative relationship that describes an associated object, meaning that three types of relationships may exist, e.g., A and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.

The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination" or "in response to a detection", depending on the context. Similarly, the phrases "if determined" or "if detected (a stated condition or event)" may be interpreted as "when determined" or "in response to a determination" or "when detected (a stated condition or event)" or "in response to a detection (a stated condition or event)", depending on the context.

Fig. 1 is a flowchart of a task core compiling method according to an embodiment of the present invention, and as shown in fig. 1, the method includes:

step 101, a compiler receives at least one compilation task input by a user.

The at least one compilation task may include one compilation task or a plurality of compilation tasks. The compiler receives a compilation task or a plurality of compilation tasks input by a user through the compiler front-end.

Step 102, the compiler acquires the hardware resources of the NPU.

As an alternative, when the compiling task is a compiling task, the NPU includes a plurality of NPU cores, and the hardware resource includes the identifiers of the plurality of NPU cores. For example, the identification of the plurality of NPU cores includes core 1 through core 8.

As another alternative, when the compilation task is a plurality of compilation tasks, the NPU includes a plurality of NPU clusters, and each NPU cluster includes a plurality of NPU cores, then the hardware resource includes an identification of the plurality of NPU clusters and an identification of the plurality of NPU cores of each NPU cluster. For example, the identification of the plurality of NPU clusters includes cluster 1 to cluster 3, the identification of the plurality of NPU cores of cluster 1 includes cores 1 to 4, the identification of the plurality of NPU cores of cluster 2 includes cores 5 to 8, and the identification of the plurality of NPU cores of cluster 3 includes cores 9 to 16.

It should be noted that: for example, step 101 may also be performed after step 102.

103, judging whether the compiling task comprises a plurality of branch tasks or not by the compiler, and if so, executing a step 104; if not, go to step 108.

The compilation task includes a plurality of tasks, each task being a branch task of the compilation task. If the compiler judges that the compilation task comprises a plurality of branch tasks, which indicates that the plurality of branch tasks can be executed in parallel, and at this time, hardware resources need to be divided, then step 104 is executed; if the compiler determines that the compilation task does not include a branch task, indicating that the hardware resources do not need to be partitioned, then step 108 is performed.

As an alternative, when the compiling task is a compiling task, the compiler determines whether the compiling task includes multiple branch tasks, and if the compiler determines that the compiling task includes multiple branch tasks, which indicates that the multiple branch tasks can be executed in parallel, at this time, the identifiers of multiple NPU cores need to be divided, step 104 is executed; if the compiler determines that a compilation task does not include a branch task, indicating that the identifiers of the NPU cores do not need to be partitioned, then step 108 is performed.

As another alternative, when the compiling task is multiple compiling tasks, the compiler allocates the multiple compiling tasks to corresponding NPU clusters, at this time, the compiler determines whether the compiling task in a single NPU cluster includes multiple branch tasks, if the compiler determines that the compiling task in the single NPU cluster includes multiple branch tasks, it indicates that the multiple branch tasks can be executed in parallel, at this time, the identifiers of multiple NPU cores of the single NPU cluster need to be divided, then step 104 is executed; if the compiler determines that the compilation task in the single NPU cluster does not include a branch task, which indicates that the identifiers of the NPU cores of the single NPU cluster do not need to be divided, step 108 is executed.

And step 104, dividing the hardware resources according to the plurality of branch tasks by the compiler to generate a hardware distribution result.

As an alternative, when the number of the compiling tasks is one, the compiler divides the identifiers of the plurality of NPU cores according to the plurality of branch tasks, and generates a hardware allocation result. For example, the identifiers of the NPU cores include core 1 to core 8, the plurality of branch tasks include branch task 1, branch task 2, and branch task 3, the compiler divides the cores 1 to 8 according to the branch task 1, the branch task 2, and the branch task 3, and generates a hardware allocation result, where the hardware allocation result includes allocating an NPU core corresponding to the core 1 and an NPU core corresponding to the core 2 to the branch task 1, allocating an NPU core corresponding to the core 3 and an NPU core corresponding to the core 4 to the branch task 2, and allocating an NPU core corresponding to the core 5 to an NPU core corresponding to the core 8 to the branch task 3. The compiler enables the partitioning of hardware resources by partitioning the identity of multiple NPU cores.

As another alternative, when the number of the compiling tasks is multiple, the compiler divides the identifiers of the multiple NPU clusters according to the multiple compiling tasks, allocates the multiple compiling tasks to the corresponding NPU clusters, and divides the identifiers of the multiple NPU cores of a single NPU cluster according to the multiple branch tasks of each compiling task, so as to generate a hardware allocation result. For example, the plurality of compilation tasks include compilation task 1 and compilation task 2, compilation task 1 includes branch task 1 and branch task 2, compilation task 2 includes branch task 3 to branch task 6, the identification of the plurality of NPU clusters includes cluster 1 to cluster 3, the identification of the plurality of NPU cores of cluster 1 includes cores 1 to 4, the identification of the plurality of NPU cores of cluster 2 includes cores 5 to 8, and the identification of the plurality of NPU cores of cluster 3 includes cores 9 to 16. The method comprises the following steps that a compiler divides a cluster 1 to a cluster 3 according to a compiling task 1 and a compiling task 2, the compiler distributes the compiling task 1 to the cluster 1 and the compiling task 2 to the cluster 3, the compiler divides cores 1 to 4 of the cluster 1 according to a branch task 1 and the branch task 2 of the compiling task 1 to generate a hardware distribution result, and the hardware distribution result comprises that an NPU core corresponding to the core 1 of the cluster 1 and an NPU core corresponding to the core 2 are distributed to the branch task 1 and the NPU core corresponding to the core 3 of the cluster 1 and the NPU core corresponding to the core 4 are distributed to the branch task 2; the compiler divides the cores 9 to 16 of the cluster 3 according to the branch tasks 3 to 6 of the compilation task 2 to generate a hardware allocation result, wherein the hardware allocation result includes allocating an NPU core corresponding to the core 9 of the cluster 3 and an NPU core corresponding to the core 10 to the branch task 3, allocating an NPU core corresponding to the core 11 of the cluster 3 and an NPU core corresponding to the core 12 to the branch task 4, allocating an NPU core corresponding to the core 13 of the cluster 3 and an NPU core corresponding to the core 14 to the branch task 5, and allocating an NPU core corresponding to the core 15 of the cluster 3 and an NPU core corresponding to the core 16 to the branch task 6. The compiler realizes the division of hardware resources by the division of the identification of the plurality of NPU clusters and the identification of the plurality of NPU cores.

And 105, the compiler generates a first compiling instruction according to the hardware distribution result, the compiling parameter input by the user and the core parameter required by each acquired NPU core calculation.

The user-entered compilation parameters include, but are not limited to, at least one of weight compression parameters, precision parameters, scheduling parameters. The core parameters required for each NPU core computation include, but are not limited to, at least one of minimum memory size, fill size, convolution kernel parameters, convolution step size. The compiler generates a first compiling instruction according to at least one of a hardware distribution result, a weight compression parameter, a precision parameter, a scheduling parameter, a minimum memory capacity, a filling size, a convolution kernel parameter and a convolution step length.

Step 106, the compiler sends the first compiling instruction to the scheduler.

And step 107, the scheduler schedules the hardware resources according to the first compiling instruction, and the process is ended.

The scheduler performs tensor segmentation on the feature graph and the weight of the compiling task according to at least one of a hardware distribution result, a weight compression parameter, a precision parameter, a scheduling parameter, a minimum memory capacity, a filling size, a convolution kernel parameter and a convolution step length so as to achieve the purpose of reducing the memory bandwidth; and the scheduler schedules the hardware resources according to the divided compiling tasks. For example, the scheduler tensorially divides the feature map and the weights of the compiling task according to a hardware distribution result, a weight compression parameter, a minimum memory capacity, a filling size and a convolution step size, schedules the NPU cores according to the divided compiling task, and compiles the compiling task by using a Multiply and Accumulate (MAC) operation for each NPU core, wherein the overall computation power of the NPUs is related to the number of the NPU cores, the frequency of the NPU cores and the operation of the NPU cores in a single clock period.

And step 108, the compiler generates a second compiling instruction according to the compiling parameter and the core parameter.

And the compiler generates a second compiling instruction according to at least one of the weight compression parameter, the precision parameter, the scheduling parameter, the minimum memory capacity, the filling size, the convolution kernel parameter and the convolution step length.

Step 109, the compiler sends the second compiling instruction to the scheduler.

And step 110, the scheduler schedules the hardware resources according to the second compiling instruction, and the process is ended.

The compiler performs tensor segmentation on the feature graph and the weight of the compiling task according to at least one of a weight compression parameter, a precision parameter, a scheduling parameter, a minimum memory capacity, a filling size, a convolution kernel parameter and a convolution step length; and the scheduler schedules the hardware resources according to the divided compiling tasks. For example, the scheduler tensors the feature map and the weights of the compilation task according to the weight compression parameters, the minimum memory capacity, the filling size and the convolution step size, and the scheduler schedules the NPU core according to the partitioned compilation task.

In the embodiment of the present invention, the compiler may adopt a Tensor Virtual Machine (TVM) based on a respective object code generation module (Bring young town Codegen, BYOC) mechanism of a vendor. The TVM serves as an open deep learning compiler and provides a universal compiling mode for various AI processors. The TVM supports a deep learning front-end framework of mainstream including tensrflow, MXNet, PyTorch, or Keras, while the TVM can be deployed to a wide range of hardware back-ends including CPU, GPU, NPU, and other various special-purpose accelerators, etc. The TVM compiler has better versatility but poorer specificity. The hardware configuration of the processor cannot be obtained by the compiling mode of the TVM, the TVM selects the optimal configuration of model compiling through theoretical algorithms such as machine learning and the like, and automatic adjustment and scheduling of the hardware configuration are realized, so that the actual application effect of the TVM is not ideal, and the utilization rate and the calculation efficiency of hardware have a great space for improvement. In order to make up for the deficiency of the TVM, a BYOC mechanism may be introduced, which allows a hardware manufacturer to add a target code generator to the TVM, where the TVM added to the target code generator is a TVM based on the BYOC mechanism.

In the technical scheme of the task core compiling method provided by the embodiment of the invention, a compiler receives at least one compiling task input by a user, the compiler judges whether the compiling task comprises a plurality of branch tasks, if the compiling task comprises the plurality of branch tasks, the acquired hardware resources of the NPU (network processor unit) are divided according to the plurality of branch tasks to generate a hardware distribution result, and a first compiling instruction is generated according to the hardware distribution result, the compiling parameter input by the user and the acquired core parameter required by the calculation of each NPU core; the compiler sends the first compiling instruction to the scheduler so that the scheduler schedules the hardware resources according to the first compiling instruction, and therefore the NPU hardware resources are reasonably distributed through the compiler, and the operation speed of the NPU is improved.

Fig. 2 is a flowchart of another task core compiling method according to an embodiment of the present invention, and as shown in fig. 2, the method includes:

step 201, a compiler receives a compiling task input by a user.

The compilation tasks include one compilation task, i.e., the number of compilation tasks is 1. The compiler receives a compilation task input by a user through the front end of the compiler.

In step 202, the compiler acquires the hardware resources of the NPU.

When the compiling task is a compiling task, the NPU includes a plurality of NPU cores, and the hardware resource includes the identifiers of the plurality of NPU cores. For example, the identification of the plurality of NPU cores includes core 1 through core 8.

Step 203, the compiler judges whether the compiling task includes a plurality of branch tasks, if yes, step 204 is executed; if not, go to step 211.

If the compiler determines that the compilation task includes multiple branch tasks, which indicates that the multiple branch tasks can be executed in parallel, and at this time, the hardware resource needs to be divided, then step 204 is executed; if the compiler determines that the compilation task does not include a branch task, indicating that the hardware resources do not need to be divided, step 211 is executed. For example, when the compiling task is a compiling task, the compiler determines whether the compiling task includes multiple branch tasks, and if the compiler determines that the compiling task includes multiple branch tasks, which indicates that the multiple branch tasks can be executed in parallel, at this time, the identifiers of multiple NPU cores need to be divided, step 204 is executed; if the compiler determines that a compilation task does not include a branch task, which indicates that the identifiers of the NPU cores do not need to be divided, and the compilation of the compilation task is completed using the concurrent maximum computation power of all the cores, step 211 is executed.

And 204, clustering the plurality of branch tasks by the compiler to generate a plurality of task category data.

The task category data includes operator cluster data. And the compiler takes the first operator of each branch task as an initial clustering center, performs iterative clustering on other operators according to a K-means algorithm until the sum of distances from all operators to the class center is minimum, and performs task division according to the class of the operator clustering to generate a plurality of task class data. For example, when a plurality of branch tasks of one compiling task include branch task 1, branch task 3 and branch task 3, clustering branch task 1, branch task 2 and branch task 3 to generate a plurality of task category data, wherein the plurality of task category data include operator cluster data 1, operator cluster data 2 and operator cluster data 3.

Step 205, the compiler divides the hardware resources according to the task category data to generate a hardware allocation result.

The hardware resource comprises an identification of a plurality of NPU cores, for example, the identification of the plurality of NPU cores comprises cores 1 to 8, the compilation task comprises a branch task 1, a branch task 2 and a branch task 3, and the plurality of task category data comprises operator cluster data 1, operator cluster data 2 and operator cluster data 3. The compiler divides the cores 1 to 8 according to the operator clustering data 1, the operator clustering data 2 and the operator clustering data 3 to generate a hardware distribution result, wherein the hardware distribution result comprises that the NPU core corresponding to the core 1 and the NPU core corresponding to the core 2 are distributed to the branch task 1, the NPU core corresponding to the core 3 and the NPU core corresponding to the core 4 are distributed to the branch task 2, and the NPU core corresponding to the core 5 to the NPU core corresponding to the core 8 are distributed to the branch task 3. The compiler enables the partitioning of hardware resources by partitioning the identity of multiple NPU cores.

And step 206, the compiler generates a first compiling instruction according to the hardware distribution result, the compiling parameters input by the user and the acquired core parameters required by each NPU core calculation.

The user-entered compilation parameters include, but are not limited to, at least one of weight compression parameters, precision parameters, scheduling parameters. The core parameters required for each NPU core computation include, but are not limited to, at least one of minimum memory size, fill size, convolution kernel parameters, convolution step size. The compiler performs binary conversion on at least one of a hardware distribution result, a weight compression parameter, a precision parameter, a scheduling parameter, a minimum memory capacity, a filling size, a convolution kernel parameter and a convolution step length to generate a first compiling instruction.

Step 207, the compiler sends the first compiling instruction to the scheduler.

And step 208, the scheduler analyzes the first compiling instruction to obtain a configuration parameter.

The configuration parameters comprise at least one of hardware distribution results, weight compression parameters, precision parameters, scheduling parameters, minimum memory capacity, filling size, convolution kernel parameters and convolution step length. The scheduler performs binary anti-sequence analysis on the first compiling instruction to obtain at least one of a hardware distribution result, a weight compression parameter, a precision parameter, a scheduling parameter, a minimum memory capacity, a filling size, a convolution kernel parameter and a convolution step length.

Step 209, the scheduler determines whether the NPU includes an idle core, if so, step 210 is executed; if not, go to step 209.

If the scheduler determines that the NPU includes an idle core, indicating that the NPU idle core can be scheduled to complete the compilation of the compilation task, then step 210 is performed; if the scheduler determines that the NPU does not include an idle core, which indicates that the NPU core may not be scheduled and the compiling task cannot be compiled, the scheduler continues to wait for the idle core of the NPU, and then step 209 is executed.

Step 210, the scheduler schedules the hardware resource according to the configuration parameters, and the process is ended.

The scheduler performs tensor segmentation on the feature graph and the weight of the compiling task according to at least one of a hardware distribution result, a weight compression parameter, a precision parameter, a scheduling parameter, a minimum memory capacity, a filling size, a convolution kernel parameter and a convolution step length; and the scheduler schedules the hardware resources according to the divided compiling tasks. For example, the scheduler performs tensor division on the feature map and the weights of the compilation task according to a hardware allocation result, a weight compression parameter, a minimum memory capacity, a filling size and a convolution step size, and schedules the NPU core corresponding to the core 1 to the NPU core corresponding to the core 8 according to the divided compilation task.

And step 211, the compiler generates a second compiling instruction according to the compiling parameter and the core parameter.

The compiler performs binary conversion on at least one of the weight compression parameter, the precision parameter, the scheduling parameter, the minimum memory capacity, the filling size, the convolution kernel parameter and the convolution step length to generate a second compiling instruction.

In step 212, the compiler sends the second compiling instruction to the scheduler.

In step 213, the scheduler parses the second compiling instruction to obtain the configuration parameters.

The configuration parameters comprise at least one of weight compression parameters, precision parameters, scheduling parameters, minimum memory capacity, filling size, convolution kernel parameters and convolution step length. And the scheduler performs binary anti-sequence analysis on the second compiling instruction to obtain at least one of a weight compression parameter, a precision parameter, a scheduling parameter, a minimum memory capacity, a filling size, a convolution kernel parameter and a convolution step length.

Step 214, the scheduler determines whether the NPU includes an idle core, if yes, step 215 is executed; if not, go to step 214.

If the scheduler determines that the NPU includes an idle core, indicating that the NPU idle core can be scheduled to complete the compiling of the task, then step 215 is performed; if the scheduler determines that the NPU does not include an idle core, indicating that the NPU core cannot be scheduled, the task cannot be compiled, and continues to wait for the idle core of the NPU, step 214 is executed.

Step 215, the scheduler schedules the hardware resource according to the configuration parameters, and the process is ended.

The scheduler performs tensor segmentation on the feature graph and the weight of the compiling task according to at least one of a weight compression parameter, a precision parameter, a scheduling parameter, a minimum memory capacity, a filling size, a convolution kernel parameter and a convolution step length; and the scheduler schedules the hardware resources according to the divided compiling tasks. For example, the scheduler performs tensor division on the feature map and the weights of the compilation task according to the weight compression parameter, the minimum memory capacity, the filling size and the convolution step size, and the scheduler schedules the NPU core corresponding to the core 1 to the NPU core corresponding to the core 8 according to the divided compilation task.

In another technical scheme of the task core compiling method provided by the embodiment of the invention, a compiler receives a compiling task input by a user and acquires hardware resources of an NPU; if the compiler judges that the compiling task comprises a plurality of branch tasks, clustering the branch tasks to generate a plurality of task category data, and dividing hardware resources according to the task category data to generate a hardware distribution result; the compiler generates a first compiling instruction according to a hardware distribution result, compiling parameters input by a user and acquired core parameters required by each NPU core calculation, and sends the first compiling instruction to the scheduler; the scheduler analyzes the first compiling instruction to obtain a configuration parameter, and if the scheduler judges that the NPU comprises an idle core, the scheduler schedules the hardware resource according to the configuration parameter, so that the compiling task is adaptively divided through the compiler, the hardware resource of the NPU is reasonably distributed, the utilization rate of the hardware resource of the NPU is improved, the operation speed of the NPU is improved, and the performance of the neural network model for executing the task on the NPU is improved.

Fig. 3 is a flowchart of another task core compiling method according to an embodiment of the present invention, and as shown in fig. 3, the method includes:

in step 301, a compiler receives a plurality of compilation tasks input by a user.

The compilation task includes a plurality of compilation tasks, i.e., the number of compilation tasks is greater than or equal to 2. The compiler receives a plurality of compilation tasks input by a user through the front end of the compiler.

Step 302, the compiler acquires the hardware resources of the NPU.

When the compilation task is a plurality of compilation tasks, the NPU includes a plurality of NPU clusters, and each NPU cluster includes a plurality of NPU cores, then the hardware resource includes an identification of the plurality of NPU clusters and an identification of the plurality of NPU cores of each NPU cluster. For example, the identification of the plurality of NPU clusters includes cluster 1 to cluster 3, the identification of the plurality of NPU cores of cluster 1 includes cores 1 to 4, the identification of the plurality of NPU cores of cluster 2 includes cores 5 to 8, and the identification of the plurality of NPU cores of cluster 3 includes cores 9 to 16.

Step 303, the compiler judges whether the number of the compiling tasks is less than or equal to the number of the NPU clusters, if so, step 304 is executed; if not, go to step 318.

If the compiler judges that the number of the compiling tasks is less than or equal to the number of the NPU clusters, which indicates that each compiling task can be executed in parallel in the NPU clusters, and at this time, the identifiers of the NPU clusters need to be allocated, then step 304 is executed; if the compiler determines that the number of the compiling tasks is greater than the number of the NPU clusters, indicating that the NPU cluster identifier needs to be allocated according to the calculation demand of each compiling task, step 318 is executed.

And step 304, the compiler allocates each compiling task to the corresponding NPU cluster.

The compiler allocates each compilation task to a corresponding NPU cluster according to the number of NPU cores in a single NPU cluster.

In the embodiment of the invention, the NPU cores in a single NPU cluster can commonly access the shared inter-core memory in the single NPU cluster.

Step 305, the compiler judges whether the compiling task includes a plurality of branch tasks, if yes, step 306 is executed; if not, go to step 313.

If the compiler judges that the compilation task comprises a plurality of branch tasks, which indicates that the plurality of branch tasks can be executed in parallel, and at this time, hardware resources need to be divided, then step 306 is executed; if the compiler determines that the compilation task does not include a branch task, indicating that the hardware resources do not need to be partitioned, step 313 is executed. For example, when the compiling task is a plurality of compiling tasks, the compiler allocates the plurality of compiling tasks to the corresponding NPU clusters, at this time, the compiler determines whether the compiling task in the single NPU cluster includes a plurality of branch tasks, if the compiler determines that the compiling task in the single NPU cluster includes a plurality of branch tasks, which indicates that the plurality of branch tasks can be executed in parallel, at this time, the identifiers of the plurality of NPU cores of the single NPU cluster need to be divided, step 306 is executed; if the compiler determines that the compiling task in the single NPU cluster does not include the branch task, which indicates that the identifiers of the plurality of NPU cores of the single NPU cluster do not need to be divided, and the compiling of the task is completed using the maximum computation power of all cores in the single NPU cluster concurrently, step 313 is executed.

In the embodiment of the invention, the compiler executes and judges whether the compiling task in each NPU cluster comprises a plurality of branch tasks in parallel.

And step 306, clustering the plurality of branch tasks by the compiler to generate a plurality of task category data.

The compiler takes the first operator of each branch task in a single NPU cluster as an initial clustering center, performs iterative clustering on other operators of each branch task in the single NPU cluster according to a K-means algorithm until the sum of distances from all operators to the class center is minimum, and performs task division according to the class of the operator clustering to generate task class data. For example, when the plurality of branch tasks of one compiling task in a single NPU cluster include branch task 1 and branch task 2, clustering branch task 1 and branch task 2 generates a plurality of task category data, wherein the plurality of task category data includes operator cluster data 1 and operator cluster data 2.

And 307, dividing the hardware resources according to the task category data by the compiler to generate a hardware distribution result.

The hardware resource includes an identification of a plurality of NPU clusters and an identification of a plurality of NPU cores of a single NPU cluster, for example, the plurality of compilation tasks include compilation task 1 and compilation task 2, compilation task 1 includes branch task 1 and branch task 2, compilation task 2 includes branch task 3 to branch task 6, the plurality of task category data includes operator cluster data 1 to operator cluster data 6, the identification of the plurality of NPU clusters includes cluster 1 to cluster 3, the identification of the plurality of NPU cores of cluster 1 includes core 1 to core 4, the identification of the plurality of NPU cores of cluster 2 includes core 5 to core 8, and the identification of the plurality of NPU cores of cluster 3 includes core 9 to core 16. The method comprises the following steps that a compiler divides a cluster 1 to a cluster 3 according to a compiling task 1 and a compiling task 2, the compiler distributes the compiling task 1 to the cluster 1 and the compiling task 2 to the cluster 3, the compiler divides cores 1 to 4 of the cluster 1 according to operator clustering data 1 and operator clustering data 2 of the compiling task 1 to generate a hardware distribution result, and the hardware distribution result comprises the steps that an NPU core corresponding to the core 1 of the cluster 1 and an NPU core corresponding to the core 2 are distributed to a branch task 1 and the NPU cores corresponding to the core 3 and the core 4 of the cluster 1 and the NPU cores corresponding to the core 4 are distributed to the branch task 2; the compiler divides the cores 9 to 16 of the cluster 3 according to the operator clustering data 3 to the operator clustering data 6 of the compiling task 2 to generate a hardware allocation result, wherein the hardware allocation result comprises allocating the NPU core corresponding to the core 9 of the cluster 3 and the NPU core corresponding to the core 10 to the branch task 3, allocating the NPU core corresponding to the core 11 of the cluster 3 and the NPU core corresponding to the core 12 to the branch task 4, allocating the NPU core corresponding to the core 13 of the cluster 3 and the NPU core corresponding to the core 14 to the branch task 5, and allocating the NPU core corresponding to the core 15 of the cluster 3 and the NPU core corresponding to the core 16 to the branch task 6. The compiler realizes the division of hardware resources by the division of the identification of the plurality of NPU clusters and the identification of the plurality of NPU cores.

And 308, the compiler generates a first compiling instruction according to the hardware distribution result, the compiling parameters input by the user and the acquired core parameters required by each NPU core calculation.

The user-entered compilation parameters include, but are not limited to, at least one of weight compression parameters, precision parameters, scheduling parameters. The core parameters required for each NPU core calculation include, but are not limited to, at least one of minimum memory capacity, fill size, convolution kernel parameters, convolution step size. The compiler performs binary conversion on at least one of a hardware distribution result, a weight compression parameter, a precision parameter, a scheduling parameter, a minimum memory capacity, a filling size, a convolution kernel parameter and a convolution step length to generate a first compiling instruction.

Step 309, the compiler sends the first compiling instruction to the scheduler.

In step 310, the scheduler parses the first compiling instruction to obtain the configuration parameters.

Step 311, the scheduler determines whether the NPU includes an idle core, if yes, step 312 is executed; if not, go to step 311.

If the scheduler determines that the NPU includes an idle core, indicating that the NPU idle core can be scheduled to complete the compilation of the compilation task, then step 312 is performed; if the scheduler determines that the NPU does not include an idle core, which indicates that the NPU core cannot be scheduled and the compiling task cannot be compiled, the scheduler continues to wait for the idle core of the NPU, and then step 311 is executed.

And step 312, the scheduler schedules the hardware resources according to the configuration parameters, and the process is ended.

The scheduler performs tensor segmentation on the feature graph and the weight of the compiling task according to at least one of a hardware distribution result, a weight compression parameter, a precision parameter, a scheduling parameter, a minimum memory capacity, a filling size, a convolution kernel parameter and a convolution step length; and the scheduler schedules the hardware resources according to the divided compiling tasks. For example, the scheduler performs tensor division on the feature map and the weights of the compiling task 1 according to a hardware allocation result, a weight compression parameter, a minimum memory capacity, a filling size and a convolution step size, and the scheduler schedules the NPU cores corresponding to the core 1 to the NPU cores corresponding to the core 4 of the cluster 1 according to the divided compiling task 1.

And 313, generating a second compiling instruction by the compiler according to the compiling parameter and the core parameter.

In step 314, the compiler sends the second compiling instruction to the scheduler.

And 315, the scheduler analyzes the second compiling instruction to obtain a configuration parameter.

Step 316, the scheduler determines whether the NPU includes an idle core, if so, step 317 is executed; if not, go to step 316.

If the scheduler determines that the NPU includes an idle core, indicating that the NPU idle core can be scheduled to complete the compilation of the compilation task, then step 317 is executed; if the scheduler determines that the NPU does not include an idle core, indicating that the NPU core cannot be scheduled and the compilation task cannot be compiled, and continues to wait for the idle core of the NPU, step 316 is executed.

And step 317, the scheduler schedules the hardware resources according to the configuration parameters, and the process is ended.

The scheduler performs tensor segmentation on the feature graph and the weight of the compiling task according to at least one of a weight compression parameter, a precision parameter, a scheduling parameter, a minimum memory capacity, a filling size, a convolution kernel parameter and a convolution step length; and the scheduler schedules the hardware resources according to the divided compiling tasks. For example, the scheduler tensor-divides the eigen map and the weights of the compiling task 1 according to the weight compression parameters, the minimum memory capacity, the filling size and the convolution step size, and the scheduler schedules the NPU cores corresponding to the core 1 to the NPU cores corresponding to the core 4 of the cluster 1 according to the divided compiling task 1.

Step 318, the compiler calculates the computation demand of each compilation task.

The calculation demand includes time. The compiler calculates the time of each compilation task according to the parameter amount of each compilation task.

Step 319, the compiler selects a specific number of compilation tasks from the plurality of compilation tasks input by the user, the calculation demand of the specific number of compilation tasks is less than the calculation demand of other compilation tasks in the plurality of compilation tasks input by the user, and the specific number is equal to the number of NPU clusters.

The compiler selects a specific number of compilation tasks according to the size of the calculation demand of each compilation task, and preferentially distributes the compilation tasks with small calculation demand to the corresponding NPU clusters through a Shortest Job First (SJF) algorithm.

In the embodiment of the invention, other compiling tasks with large computing demand wait to be distributed to the corresponding NPU clusters.

In step 320, the compiler allocates each of the specific number of compilation tasks to the corresponding NPU cluster, and performs step 305 in parallel.

The compiler allocates a certain number of compilation tasks to the corresponding NPU clusters according to the number of NPU cores in the NPU clusters, and the compiler judges whether the compilation tasks in a single NPU cluster include a plurality of branch tasks.

In another technical scheme of the task core compiling method provided by the embodiment of the invention, a compiler receives a plurality of compiling tasks input by a user, acquires hardware resources of an NPU, and if the compiler judges that the number of the compiling tasks is less than or equal to the number of the NPU clusters, allocates each compiling task to the corresponding NPU cluster; if the compiler judges that the compiling task comprises a plurality of branch tasks, clustering the plurality of branch tasks by the compiler to generate a plurality of task category data, and dividing hardware resources according to the plurality of task category data to generate a hardware distribution result corresponding to each branch task; the compiler generates a first compiling instruction according to a hardware distribution result, compiling parameters input by a user and acquired core parameters required by each NPU core calculation, and sends the first compiling instruction to the scheduler; the scheduler analyzes the first compiling instruction to obtain a configuration parameter; if the scheduler judges that the NPU comprises the idle core, the scheduler schedules the hardware resources according to the configuration parameters, so that the compiling task is adaptively divided through the compiler, the hardware resources of the NPU are reasonably distributed, the utilization rate of the hardware resources of the NPU is improved, the operation speed of the NPU is improved, and the performance of the neural network model for executing the task on the NPU is improved.

Fig. 4 is a schematic structural diagram of a task compiling device according to an embodiment of the present invention, and as shown in fig. 4, the task compiling device includes: the device comprises a receiving module 11, an obtaining module 12, a judging module 13, a generating module 14 and a sending module 15.

The receiving module 11 is connected with the obtaining module 12, the obtaining module 12 is connected with the judging module 13, the judging module 13 is connected with the generating module 14, and the generating module 14 is connected with the sending module 15.

The receiving module 11 is configured to receive at least one compiling task input by a user; the acquiring module 12 is configured to acquire hardware resources of the NPU; the judging module 13 is configured to judge whether the compiling task includes a plurality of branch tasks; the generating module 14 is configured to, if the determining module 13 determines that the compiling task includes a plurality of branch tasks, divide the hardware resources according to the plurality of branch tasks, and generate a hardware allocation result; generating a first compiling instruction according to the hardware distribution result, the compiling parameter and the core parameter; the sending module 15 is configured to send the first compiling instruction to the scheduler, so that the scheduler schedules the hardware resource according to the first compiling instruction.

In the embodiment of the present invention, the generating module 14 includes: a first generation submodule 141 and a second generation submodule 142. The first generation submodule 141 and the second generation submodule 142 are connected.

The first generation submodule 141 is configured to cluster the multiple branch tasks and generate multiple task category data; the second generation submodule 142 is configured to divide hardware resources according to the multiple task category data, and generate a hardware allocation result.

In this embodiment of the present invention, the generating module 14 is further configured to generate a second compiling instruction according to the compiling parameter and the core parameter if the judging module 13 judges that the compiling task does not include the branch task; the sending module 15 is further configured to send the second compiling instruction to the scheduler, so that the scheduler schedules the hardware resource according to the second compiling instruction.

In this embodiment of the present invention, the number of the compiling tasks is plural, and the apparatus further includes: a dispensing module 16. The distribution module 16 is connected with the judgment module 13.

The judging module 13 is further configured to judge whether the number of compiling tasks is less than or equal to the number of NPU clusters; the allocation module 16 is configured to, if the determination module 13 determines that the number of the compiling tasks is less than or equal to the number of the NPU clusters, allocate each compiling task to a corresponding NPU cluster, and trigger the determination module 13 to perform a step of determining whether the compiling task includes multiple branch tasks in parallel.

In this embodiment of the present invention, the number of the compiling tasks is plural, and the apparatus further includes: a calculation module 17 and a selection module 18. The calculation module 17 is connected with the judgment module 13 and the selection module 18.

The calculating module 17 is configured to calculate a calculation demand of each compiling task if the judging module 13 judges that the number of the compiling tasks is greater than the number of the NPU clusters; the selecting module 18 is configured to select a specific number of compilation tasks from the multiple compilation tasks input by the user, where the computation demand of the specific number of compilation tasks is less than the computation demand of other compilation tasks in the multiple compilation tasks input by the user, and the specific number is equal to the number of NPU clusters; the allocating module 16 is further configured to allocate each compiling task of the specific number of compiling tasks to a corresponding NPU cluster, and trigger the determining module 13 to perform the step of determining whether the compiling task includes multiple branch tasks in parallel.

In the technical scheme of the task compiling device provided by the embodiment of the invention, a compiler receives at least one compiling task input by a user, the compiler judges whether the compiling task comprises a plurality of branch tasks, if the compiling task comprises the plurality of branch tasks, the acquired hardware resources of the NPU of the embedded neural network processor are divided according to the plurality of branch tasks to generate a hardware distribution result, and a first compiling instruction is generated according to the hardware distribution result, the compiling parameters input by the user and the acquired core parameters required by the core calculation of each NPU; the compiler sends the first compiling instruction to the scheduler so that the scheduler can schedule the hardware resources according to the first compiling instruction, and therefore reasonable distribution of the hardware resources of the NPU is achieved through the compiler, and the operation speed of the NPU is improved.

Fig. 5 is a schematic diagram of a compiler according to an embodiment of the present invention, and as shown in fig. 5, the compiler 21 includes: the processor 211, the memory 212, and the computer program 213 stored in the memory 212 and capable of running on the processor 211, wherein the computer program 213 implements the task compiling method in the embodiment when executed by the processor 211, and therefore, for avoiding repetition, details are not repeated herein.

Compiler 21 includes, but is not limited to, processor 211, memory 212. Those skilled in the art will appreciate that fig. 5 is only an example of the compiler 21, and does not constitute a limitation of the compiler 21, and may include more or less components than those shown, or combine some components, or different components, for example, the compiler may further include input and output devices, network access devices, buses, etc.

The Processor 211 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The storage 212 may be an internal storage unit of the compiler 21, such as a hard disk or a memory of the compiler 21. The memory 212 may also be an external storage device of the compiler 21, such as a plug-in hard disk provided on the compiler 21, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like. Further, the memory 212 may also include both an internal storage unit of the compiler 21 and an external storage device. The memory 212 is used to store computer programs and other programs and data required by the network device. The memory 212 may also be used to temporarily store data that has been output or is to be output.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A method for task compilation, the method comprising:

receiving at least one compilation task input by a user;

judging whether the compiling task comprises a plurality of branch tasks;

2. The method of claim 1, wherein the dividing the acquired hardware resources of the NPU according to the plurality of branch tasks to generate the hardware allocation result comprises:

3. The method of claim 1, further comprising:

4. The method of claim 1, wherein the number of the compiling tasks is multiple, and before the determining whether the compiling task includes multiple branch tasks, the method further comprises:

5. The method of claim 1, wherein the compilation task is plural in number, the method further comprising:

selecting a specific number of compiling tasks from a plurality of compiling tasks input by a user, wherein the calculation demand of the specific number of compiling tasks is smaller than that of other compiling tasks in the plurality of compiling tasks input by the user, and the specific number is equal to the number of the NPU clusters;

6. A task compiling device, characterized in that the device comprises:

the acquisition module is used for acquiring the hardware resources of the NPU;

7. The apparatus of claim 6, wherein the generating module comprises:

8. The apparatus of claim 6, further comprising:

the generation module is further used for generating a second compiling instruction according to the compiling parameter and the core parameter if the judging module judges that the compiling task does not include the branch task;

9. A computer-readable storage medium, comprising a stored program, wherein the program, when executed, controls an apparatus in which the computer-readable storage medium is located to perform the method of any of claims 1-5.

10. A compiler, comprising: one or more processors; a memory; and one or more computer programs, wherein the one or more computer programs are stored in the memory, the one or more computer programs comprising instructions which, when executed by the apparatus, cause the apparatus to perform the method of any of claims 1 to 5.