CN112711478A

CN112711478A - Task processing method, device, server and storage medium based on neural network

Info

Publication number: CN112711478A
Application number: CN201911016715.5A
Authority: CN
Inventors: 刘文峰
Original assignee: Gree Electric Appliances Inc of Zhuhai; Zhuhai Zero Boundary Integrated Circuit Co Ltd
Current assignee: Gree Electric Appliances Inc of Zhuhai; Zhuhai Zero Boundary Integrated Circuit Co Ltd
Priority date: 2019-10-24
Filing date: 2019-10-24
Publication date: 2021-04-27

Abstract

The application relates to the technical field of neural networks, in particular to a task processing method, a task processing device, a task processing server and a task processing storage medium based on a neural network, which are used for solving the technical problem that the execution efficiency of data processing tasks of the neural network is low in the prior art. The method comprises the following steps: dividing the operation task to be processed into a plurality of first-class subtasks, and further determining the operation task to be processed as an executable subtask; determining the dependency value of each executable subtask in turn according to the obtained dependency relationship among the executable subtasks; adding the executable subtasks with the dependency values as preset values into an activation queue; the plurality of executable sub-tasks in the activation queue are executed in parallel based on the plurality of cores. Therefore, the execution efficiency of related tasks of the neural network target layer data processing is improved based on the strategy of dividing firstly and then executing in parallel.

Description

Task processing method, device, server and storage medium based on neural network

Technical Field

The present application relates to the field of neural network technologies, and in particular, to a task processing method, device, server, and storage medium based on a neural network.

Background

Artificial neural networks are one of the main branches of intelligent control technology, and have applications and research in a number of fields, including: pattern recognition, signal processing, knowledge engineering, expert systems, optimal combination, robotic control, and the like. With the continuous development of the artificial neural network theory itself, the related theory and the related technology, the application of the neural network is more deeply determined.

The concept of deep learning is derived from the research of artificial neural networks, and a multi-layer perceptron with a plurality of hidden layers is a deep learning structure, and the deep learning forms more abstract high-layer representation attribute categories or features by combining low-layer features so as to find distributed feature representations of data.

With the rapid development of deep learning technology in recent years, various large artificial neural networks are introduced, and quite high requirements are put on the computing power, flexibility and computing efficiency of a processor.

However, in the prior art, due to the limitations of the operation speed of the processor, the buffer space and the problem of the memory wall, it is increasingly difficult for the processor to meet the operation requirements of the large neural network. For example, Convolutional Neural Networks (CNNs) are a class of feed-forward Neural Networks that include convolution calculations and have a deep structure, are one of the typical algorithms for deep learning, and can perform a shift-invariant classification of input information according to their hierarchical structure, and are also referred to as shift-invariant artificial Neural Networks. When a convolutional neural Network is used for image Processing, if the size of an image pixel is large, the convolution kernel of the corresponding convolutional operation is also large, the convolutional operation of the large convolution kernel needs to occupy a large memory or cache space, and when the cache space of a processor is not enough to load related operation data, especially the cache of a simplified Network Processor (NPU) computing Unit is generally not too large, sometimes all data of a certain intermediate layer of the convolutional operation or other neural networks cannot be loaded, so that the Processing efficiency is reduced.

In view of the above, there is a need to redesign a process to overcome the above-mentioned drawbacks.

Disclosure of Invention

The embodiment of the application provides a task processing method, a task processing device, a server and a storage medium based on a neural network, and aims to solve the technical problem that in the prior art, the efficiency of executing a data processing task of the neural network is low.

The embodiment of the application provides the following specific technical scheme:

in the embodiment of the application, the to-be-processed operation task in a designated target layer in the neural network is divided into a plurality of first-class sub-tasks based on a multi-core processor architecture, the executable sub-tasks are determined based on the first-class sub-tasks, the dependency values of the executable sub-tasks are determined according to the dependency relationship among the executable sub-tasks, when the dependency values of the executable sub-tasks meet preset values, the executable sub-tasks are added into an activation queue, and the executable sub-tasks in the activation queue are executed in parallel based on a plurality of cores of the multi-core processor architecture. Therefore, through division, the to-be-processed operation task of a designated target layer in the neural network can be divided into a plurality of first-class subtasks with smaller granularity, the operation resource required to be occupied when each first-class subtask is executed is smaller than the operation resource required to be occupied when the operation task is executed before division, the operation is carried out without waiting for the operation resource with large enough, the possibility that the current task is quickly executed by the processor core is improved, and the requirement on the performance of the processor is reduced; and based on a plurality of kernels under the processor architecture, a plurality of executable subtasks are executed almost simultaneously in a parallel execution mode, and the processing efficiency is also improved.

Drawings

FIG. 1 is a schematic diagram of a heterogeneous multi-core processor architecture in an embodiment of the present application;

FIG. 2 is a schematic flow chart diagram illustrating an embodiment of a neural network based task processing method according to an embodiment of the present application;

FIG. 3 is a schematic view of a local area of a topology map in an embodiment of the present application;

FIG. 4 is a diagram illustrating an active queue operating mechanism according to an embodiment of the present application;

FIG. 5 is a diagram illustrating partitioning of a convolutional layer into a plurality of executable sub-tasks according to an embodiment of the present application;

FIG. 6 is a schematic diagram of operational parameters of executable sub-tasks in an embodiment of the present application;

FIG. 7 is a schematic structural diagram of a neural network-based task processing device according to an embodiment of the present application;

fig. 8 is a schematic diagram of a server structure in an embodiment of the present application.

Detailed Description

In order to solve the technical problem that the efficiency of executing the neural network operation task is low in the prior art, in the embodiment of the application, the operation task in the neural network middle layer is divided into a plurality of first-class subtasks, a plurality of executable subtasks are determined based on the first-class subtasks, the dependency value of each executable subtask is determined according to the dependency relationship among the subtasks, the executable subtasks with the current dependency value being a preset value are added into an activation queue, the executable subtasks in the activation queue are called at one time in a parallel execution mode, and the called subtasks are executed almost simultaneously in time sequence.

Preferred embodiments of the present application will be described in further detail below with reference to the accompanying drawings:

the task processing method based on the neural network, as an implementable mode, can be implemented based on a multi-core processor architecture, and can be implemented based on a single-core processor when the processing performance of the multi-core processor can be simulated based on the single-core processor through technical improvement. The multi-core processor architecture is preferably a heterogeneous multi-core processor, and can also be a homogeneous multi-core processor, that is, the processing method provided by the embodiment of the application is mainly suitable for the heterogeneous multi-core processor, but also suitable for the homogeneous multi-core processor.

The heterogeneous multi-core processor architecture comprises more than two processors, for example, a Central Processing Unit (CPU) and an NPU, wherein the CPU is provided with a plurality of CPU cores, and the NPU is provided with a plurality of NPU cores; or two processors including a CPU and a Graphics Processing Unit (GPU), and so on.

For example, referring to fig. 1, a preferable multi-core processor architecture may include three processors, namely, a CPU, an NPU, and a GPU, where the CPU has three cores, namely, a CPU1, a CPU2, and a CPU 3; the NPU is provided with three inner cores, namely NPU1, NPU2 and NPU 3; the GPU is provided with three kernels, GPU1, GPU2, and GPU 3.

The task processing method based on the neural network provided by the embodiment of the application can be applied to data processing under multiple application scenes in multiple fields, for example, in the field of image recognition, when a convolutional neural network is adopted for text recognition or face recognition, the operation task processing method provided by the embodiment of the application can be adopted, after image data is input, a task for performing convolutional operation on the image data can be divided into multiple first-class subtasks with smaller granularity, then corresponding multiple executable subtasks are further determined based on the multiple first-class subtasks, finally, a dependency topological graph is constructed according to the data dependency relationship among the executable subtasks, the dependency value of each executable subtask is updated according to the current execution condition, when the dependency value reaches a preset value, the corresponding executable subtask is added into an activation queue, and the multiple executable subtasks in the activation queue can be scheduled and executed in parallel, the processing efficiency of executing corresponding data operation tasks based on the convolutional neural network is improved.

Referring to fig. 2, in the embodiment of the present application, a detailed flow of the task processing method based on the neural network is as follows:

s201: dividing the operation task to be processed in a designated target layer in the neural network into a plurality of first-class subtasks.

In this embodiment, as an implementable manner, before S201, a processor and at least one core corresponding to the processor, which are available for executing a task, are determined from the multi-core processor architecture.

In S201, the target layer specified in the neural network may be any one or a combination of multiple layers of an input layer, an output layer, and an intermediate layer of the neural network, that is, the method for processing the operation task provided in the embodiment of the present application may be applicable to each layer of the neural network. For the convolutional neural network, the target layer mainly includes intermediate layers such as convolutional layers, pooling layers, and full-link layers.

Specifically, for the division of the operation task to be processed, different division modes should be adopted for different neural networks.

And if the neural network is a convolutional neural network, dividing the operation tasks to be processed in the target layer into a plurality of corresponding first-class subtasks based on the number of channels or the number of corresponding rows and columns of the target layer. The channels are input channels or output channels of the target layer and other layers for data interaction, and the rows are rows and columns of the two-dimensional array or the three-dimensional array of each channel of the target layer.

The middle layer of the convolutional neural network is a three-dimensional array, and each dimension corresponds to a row, a column and a channel respectively, so that the channel division mode and the row and column division mode are feasible for the operation task division of the convolutional neural network.

For other non-convolutional Neural networks, such as a Recurrent Neural Network (RNN), the to-be-processed operation task in the target layer needs to be divided into a plurality of corresponding first-type subtasks based on the number of channels corresponding to the target layer.

For example, when the number of input channels is 4, the to-be-processed operation task of the target layer is divided into 4 first-class subtasks.

The divided first-class subtasks are not completely consistent in size, for example, a three-dimensional array with 15 columns is divided into 4 first subtasks, then 3 of the first-class subtasks are all 4 columns, and the remaining other first-class subtasks are only 3 columns.

For the convolutional neural network, generally, only the convolution operation of the target layer is divided to wait for processing the operation task, and the convolution kernel is not segmented. In some special cases, the convolution kernel may be divided, for example, the number of channels is 9 according to the channel division mode, and then, for the convolution kernel of 7x7 size, the convolution kernel of 3x3 size may be divided into 9.

In the embodiment of the application, after the first-class subtasks are divided, the operation parameters of each first-class subtask are generated. The operation parameters of a first type of subtask at least comprise any one or any combination of the following parameters: an input tensor, an output tensor, weight parameters, convolution parameters, and additional parameters.

Wherein, the weight parameter comprises any one or two combinations of weight address and weight length; the convolution parameter comprises any one or any combination of parameters related to convolution operation, such as convolution type, convolution kernel size, convolution step length and the like; the additional parameters include any one or a combination of normalization parameters, activation parameters.

S202: a plurality of executable subtasks is determined, the plurality of executable subtasks including at least one subtask of a first type.

Optionally, the determination manner of the executable subtasks may include, but is not limited to, the following:

the first method is as follows:

after the operation task to be processed is divided into a plurality of first-class subtasks, the first-class subtasks are compiled, and the compiled first-class subtasks are determined to be a plurality of corresponding executable subtasks.

The compiling of the first type subtasks includes compiling the plurality of first type subtasks into an operation instruction which can be executed by at least one processor in the multi-core processor architecture.

The purpose of compilation is to convert the first type of subtasks into a form of instructions that can be executed by at least one processor.

Specifically, based on the operation types supported by different processors, at least the following compiling process should be performed:

obtaining weight parameters corresponding to the plurality of first-class subtasks respectively, and executing the following operations aiming at the plurality of first-class subtasks respectively:

if the weight parameter corresponding to a first type of subtask is fixed-point integer, compiling the first type of subtask into an operation instruction of fixed-point integer; and if the weight parameter corresponding to one first-class subtask is a floating point number, compiling the first-class subtask into an operation instruction in a floating point format.

For example, if the weight parameter of a first type subtask is int8 fixed point integer, the first type subtask is compiled into an int8 type task that can run in the NPU; if the weight parameter of the first type of subtask is a floating point number, the first type of subtask is compiled into a floating point type task which can be operated in a GPU or a CPU.

The second method comprises the following steps:

after the operation task to be processed is divided into a plurality of first-class subtasks, compiling work of the first-class subtasks is not carried out, and the plurality of first-class subtasks which are not compiled are directly determined to be a plurality of corresponding executable subtasks.

The third method comprises the following steps:

because the neural network computing generally has the problem of a storage wall, in the embodiment of the application, in order to prevent the import and export of data from becoming a performance bottleneck, the operations of importing and exporting the data of the neural network target layer in a cache and a memory, and the storage operations from image import to the cache from the memory are divided as storage tasks to be processed. Therefore, different from the first and second modes, in the third mode, the partition object includes not only the to-be-processed operation task but also the to-be-processed storage task, and the first type of subtask obtained by partitioning the to-be-processed operation task and the second type of subtask obtained by partitioning the to-be-processed storage task are determined as executable subtasks together.

Specifically, in a storage area of a multi-core processor architecture, dividing a to-be-processed storage task for importing and/or exporting data of a target layer into a plurality of second-class sub-tasks, compiling the plurality of second-class sub-tasks, and merging the compiled plurality of second-class sub-tasks and the plurality of first-class sub-tasks to obtain a plurality of corresponding executable sub-tasks.

In other words, in the third embodiment, the second type of subtasks are compiled after being divided. And compiling the second type of subtasks into data preloading instructions and/or data pre-saving instructions executable by at least one processor in the multi-core processor architecture.

For example, many NPU cores support data pre-load and data pre-save instructions in addition to convolution operation instructions. If the multi-core processor architecture comprises the NPU, the NPU is determined as a processor for executing the storage task to be processed, and during compiling, the second type of subtasks are compiled into data pre-loading and data pre-storing instructions supported by the NPU kernel, so that the storage task to be processed and the operation task to be processed, which are imported and exported by data, can be executed in parallel.

The data preloading instruction needs to provide storage parameters such as a data input address, an output address and a copy length, and the storage parameters are compiled together in the corresponding second type subtasks, and the NPU kernel is manipulated to complete data loading and saving according to the storage parameters during running.

The method is as follows:

in a fourth mode, the same as the third mode, the storage task to be processed for importing and/or exporting the data of the target layer is divided into a plurality of second-class subtasks, and the plurality of second-class subtasks and the plurality of first-class subtasks are merged and determined as executable subtasks together; in the fourth embodiment, after the second-type subtasks are divided, the second-type subtasks are determined as executable subtasks together with the first-type subtasks without compiling the second-type subtasks.

The method divides a to-be-processed operation task into a first type of subtask or divides a to-be-processed storage task into a plurality of second type of subtasks, namely, divides a large-granularity to-be-processed task into subtasks with smaller granularity, reduces the requirements on operation resources (including memory, cache space and the like) of a processor core during each task processing, enables the core, which is originally insufficient for executing the large-granularity to-be-processed task, of the operation resources to be fully utilized, improves the utilization rate of the core, and further improves the processing efficiency.

S203: and determining the dependency value of each executable subtask according to the dependency relationship among the executable subtasks.

Specifically, in the embodiment of the present application, the dependency relationship between multiple executable sub-tasks, that is, the data input and output relationship between different executable sub-tasks, for example, if the input of the executable sub-task a is the output of the executable sub-task B, there is a dependency relationship between a and B, and the dependency relationship is that a depends on B; if the output of executable subtask C is the input of executable subtask D, there is a dependency relationship between C and D, and the dependency relationship is that D depends on C.

Wherein the dependency value characterizes the number of other executable sub-tasks on which the corresponding executable sub-task depends, indicating that the dependency value has an explicit quantitative relationship with the number of other executable sub-tasks on which it depends, not necessarily being equal.

Specifically, regarding the determination of the dependency value, if the outputs of the m executable sub-tasks are all used as the inputs of the executable sub-task a, the dependency value of the executable sub-task a is m × α + n, where α is a step value, and n is a preset value. When n is set to 0 and the step value is 1, then the dependency value of executable subtask A is m. For example, if n is set to 0 and the step value is 1, then if there are 5 outputs of executable sub-tasks as inputs to executable sub-task a, then the dependency value of executable sub-task a is 5, and if n is set to 2 and the step value is 1, then the dependency value of executable sub-task is 7.

Specifically, this step can be implemented in, but not limited to, the following two ways:

mode 1, the topology graph mode is relied upon.

Generating a corresponding one of the nodes based on one of the executable subtasks; and respectively defining each other node having a data transmission relation with the one node as an upstream node or a downstream node of the one node, and generating a corresponding dependency topology graph. And marking the number of the upstream nodes depended on by the one node as the dependency value of the one node.

Any other node may be an upstream node or a downstream node of a node corresponding to the executable sub-task, and specifically needs to be determined according to a data transmission direction (also referred to as a dependency relationship), for example, a node corresponding to the executable sub-task a in the topology diagram is a, a node corresponding to the executable sub-task B in the topology diagram is B, an input of the executable sub-task a is an output of the executable sub-task B, and then the data transmission direction is from the node B to the node a, and the node a is a downstream node of the node B.

Regarding the dependency value, specifically, the dependency relationship between each executable subtask is recorded, the input number of one executable subtask (i.e., the number of dependent executable subtasks) is counted, and the input number is used as the dependency value of the corresponding node of the executable subtask in the topology map.

And in the dependency topology graph, for a node corresponding to an executable subtask, each time the executable subtask corresponding to any node in the upstream nodes on which the node depends is determined to be executed, subtracting a step value from the dependency value of the node.

For example, referring to fig. 3, according to the structure of the neural network and the operational relationship between the executable subtasks, it can be determined that the outputs of 4 executable subtasks A, E, F, G are all inputs of the executable subtask B, and then the nodes a, e, f, and g corresponding to the executable subtask A, E, F, G in the topology map are all upstream nodes of the node B, and the dependency value of the node B is labeled as 4.

In the operation process, when the executable subtask corresponding to any node in the upstream nodes is determined to be executed completely and when one upstream node is executed completely, the dependent value of the node b subtracts a step value. The step value is a constant and the value span is (0, 1), preferably, the step value is set to 1, that is, every time an upstream node is executed, the dependency value of the node is reduced by 1.

In the dependency topology graph, the executable subtasks divided by the first layer of the neural network are independent of other executable subtasks, and the dependency value is n, and preferably, when n is set to 0, the initial dependency value of the first layer node is 0.

Mode 2, a dependent vector mode is constructed.

In the method, a corresponding dependency vector is generated for each executable subtask according to the dependency relationship among the executable subtasks, elements in the dependency vector are other executable subtasks on which the executable subtask depends in sequence, and the number of the elements in the dependency vector is the dependency value of the executable subtask. For example, the outputs of the 4 executable subtasks A, E, F, G are all inputs of the executable subtask B, so the dependency vector of the executable subtask B is [ a, E, F, G ], the dependency value of the executable subtask B is 4, and the operation level of the executable subtask is 4 or other values having a fixed correspondence to the dependency value.

When the to-be-processed operation task is executed, if the executable subtask A is executed completely, the corresponding element A is deleted from the dependency vectors corresponding to the rest unexecuted executable subtasks, the dependency vectors of the unexecuted executable subtasks B are updated to [ E, F, G ], the dependency values of the B are correspondingly updated, and the operation level of the B is upgraded to 3.

S204: adding the executable subtasks with the dependency values as preset values into an activation queue; executing, by the multiple cores, the multiple executable sub-tasks in the activation queue in parallel.

In S204, the plurality of cores are currently available cores under the multi-core processor architecture, and may include only a plurality of cores of one processor, or may include a plurality of cores of different processors. For example, the plurality of cores may be four cores of a CPU; or two cores of a CPU and two cores of a GPU; or include one core of the CPU, one core of the GPU, and two cores of the NPU, which may be specifically determined according to the actual situation of the multi-core processor architecture, and the embodiments of the present application are not necessarily listed.

Specifically, in step S204, the multiple executable sub-tasks in the activation queue are executed in parallel, including selecting multiple executable sub-tasks corresponding to the number of currently available cores from the activation queue, and executing in parallel, for example, if it is determined that 5 cores are currently available, selecting 5 executable sub-tasks from the activation queue.

And the executed executable subtasks should be popped out of the activation queue in time.

The dependency value of each executable subtask is not a fixed value, but is updated gradually as the dependent executable subtask is executed, and the updating mode can be real-time updating or periodic updating.

When the dependency value of an executable subtask is updated to n, it indicates that all the dependent executable subtasks of the executable subtask have been executed, and the executable subtask can be executed, and at this time, the executable subtask is activated and added into the activation queue.

For example, one possible implementation is that when the dependency value of an executable sub-task becomes 0, the executable sub-task is activated and added to the activation queue, and then among the plurality of executable sub-tasks that are activated, a plurality of executable sub-tasks are selected to be executed in parallel. Another possible implementation manner is that the preset value n is set to be 1, when the dependency value is 1, the corresponding executable subtask is activated and added to the activation queue, and among the activated executable subtasks, a plurality of executable subtasks are selected to be executed in parallel.

Specifically, in the first possible implementation manner, referring to fig. 4, each executable sub-task that is not executed is traversed, an executable sub-task whose dependency value is 0 is obtained, and the executable sub-task is pushed into the activation queue. Based on a plurality of cores of at least one processor in a multi-core processor architecture, for example, in the architecture shown in fig. 1, there are 9 cores in total, wherein 6 cores are determined to be available, 6 executable sub-tasks are selected from an activation queue, and the 6 cores are called to execute the 6 executable sub-tasks in parallel. When the operation of the 6 executable subtasks is completed, the 6 executable subtasks are popped out of the activation queue, and the dependency value of the downstream executable subtasks dependent on the 6 executable subtasks is reduced by 1. That is, as an implementable manner, after the executable tasks of each batch are executed, the corresponding dependency values are correspondingly updated.

At this time, the information of the currently remaining computing resources in the multi-core processor architecture should be acquired, and when parallel scheduling is executed, the computing resources occupied for executing one executable sub-task should not be larger than the currently remaining computing resources of at least one core, so as to ensure that each executable sub-task to be executed can be executed by at least one core. That is, according to the type of the executable subtask and the currently available computing resource, the executable subtask is transferred to the corresponding NPU kernel, GPU kernel, and CPU kernel to complete the computation, and a plurality of activated executable subtasks can run in different kernels at the same time.

And the rest of the operation resources in the multi-core processor architecture comprise the rest of the operation resources corresponding to a plurality of cores corresponding to each type of processor. The computing resources are various resources necessary for executing the executable subtasks, including but not limited to any one or any combination of the following parameters: the operation speed of each core of the processor, and the available space of the memory, the available space of the cache and the like corresponding to each core.

If a certain executable sub-task is executed in a certain processor core, a condition of being too slow or blocked occurs, even no executable sub-task exists in an activation queue, and the waste of computing resources is caused, other cores can schedule the same executable sub-task to run, and inform the processor core that the running task of the executable sub-task is cancelled.

Corresponding to the second and fourth ways in step 202, when step 202 does not compile the first or second type of subtasks, the activated executable subtasks should also be compiled before the executable subtasks in the activation queue are executed in parallel.

Specifically, compiling is performed according to the hardware structure of the multi-core processor architecture and the condition of the residual operation resources.

For example, when the applicable multi-core processor architecture is the architecture shown in fig. 1, the multi-core processor architecture is compiled into a floating-point type supported by the CPU and the GPU or compiled into a fixed-point integer type supported by the NPU according to the weight parameter of each executable sub-task, and the multi-core processor architecture is also compiled into an executable sub-task of the fixed-point integer type and can be operated in the CPU and the GPU. Executable subtasks scheduled to be executed by the NPU also need to be compiled into a form of instructions executable by the NPU. The second type of subtasks determined to be executable subtasks are preferentially selected to be executed by the NPU, so that the second type of subtasks in the executable subtasks are compiled into data preloading instructions and data pre-saving instructions executable by the NPU.

When the applicable multi-core processor architecture is a homogeneous multi-core processor, that is, the CPU has a plurality of cores, no compilation is required.

In S204, executing, by the multiple cores, the multiple executable sub-tasks in the activation queue in parallel may include: determining a plurality of available cores under the current multi-core processor architecture, sequentially allocating executable sub-tasks in an activation queue to the plurality of available cores, wherein one core corresponds to one executable sub-task, the plurality of cores run in parallel, the allocated executable sub-tasks with the number consistent with that of the plurality of cores are executed in parallel, for example, if 4 available cores exist, 4 executable sub-tasks are executed in parallel, after the execution is finished, sequentially popping the executed sub-tasks out of the activation queue, and continuously allocating subsequent executable sub-tasks to the currently available cores for execution.

In the embodiment of the present application, parallel execution may not be understood as complete synchronization in time sequence, and may be partial overlapping in execution time.

A complete embodiment of the neural network-based task processing method provided by the embodiment of the present application is listed below.

In the complete embodiment, the heterogeneous multi-core processor architecture shown in fig. 1 is taken as a hardware basis, and a convolutional neural network is taken as an example of a neural network to be processed for explanation.

Firstly, determining that the heterogeneous multi-core processor architecture comprises a CPU processor provided with three CPU cores, a GPU processor provided with three GPU cores, and an NPU processor provided with three NPU cores.

In this embodiment, the middle layer of the convolutional neural network is taken as the designated target layer. Referring to fig. 5, taking the convolutional layer in the intermediate layer as an example, the convolution operation task of each convolutional layer is correspondingly divided into a plurality of first-class subtasks according to the number of input channels of the convolutional neural network. The convolution operation task corresponding to the convolution layer 1 is divided into 3 first-

type subtasks

11, 12 and 13, the convolution operation task corresponding to the convolution layer 2 is divided into 3 first-

type subtasks

21, 22 and 23, and the convolution operation task corresponding to the convolution layer 3 is divided into 3 first-

type subtasks

31, 32 and 33.

Referring to fig. 6, after the first-class subtasks are divided, the operation parameters of each first-class subtask are correspondingly generated. In this embodiment, the operation parameters of one of the first-class subtasks include an input tensor, an output tensor, weight parameters, convolution parameters, and additional parameters. The dependency values and the operator activation states shown in fig. 6 are correspondingly added after the dependency topology is constructed in the subsequent step.

Based on the heterogeneous multi-core processor architecture, the operation of importing and exporting the data of the convolution layer in the cache and the memory and the operation of importing the image from the memory to the cache are used as storage tasks to be processed and divided into a plurality of second-class subtasks.

And determining the plurality of first-class subtasks and the plurality of second-class subtasks as executable subtasks together. The first type of subtasks 11-13, 21-23 and 31-33 are each determined to be executable subtasks.

And then, constructing a dependency topology graph according to the dependency relationship among the executable subtasks. In this embodiment, a preset value n is set to 0 and a step value α is set to 1. Correspondingly, referring to fig. 5, the outputs of the first-

class subtasks

11 and 12 are inputs of the first-class subtask 21, so that the dependency value of the first-class subtask 21 is 2, and none of the first-class subtasks 11 to 13 depends on other subtasks, and therefore the dependency value is set to 0, that is, the dependency value of the first-layer node in the topology map is 0.

And determining the dependency value of the node corresponding to one executable subtask as the number of the depended upstream nodes according to the dependency topology graph. For example, after the first-class subtask 21 is determined to be an executable subtask, if there are 2 upstream nodes on which it depends, its dependency value is 2.

And when the dependency value is 0, activating the corresponding executable subtask and adding the executable subtask into the activation queue.

For example, if there is no pending storage task temporarily, that is, when the current executable sub-task does not include the second type of sub-task, the dependency value of the 3 executable sub-tasks 11-13 corresponding to the convolutional layer 1 is 0, and the dependent sub-tasks are activated first and added to the activation queue, where there are 3 executable sub-tasks in the activation queue.

If the 3 executable subtasks are all floating point type, then according to the remaining operation resources of 3 kernels of the CPU or the GPU, parallel scheduling is carried out, so that one executable subtask is executed by one CPU kernel or one GPU kernel, and 3 kernels execute the 3 executable subtasks in parallel, namely 3 executable subtasks are executed simultaneously. And after the 3 executable subtasks are executed, popping the 3 executable subtasks out of the activation queue. And because the

executable subtasks

11 and 12 are executed simultaneously, the dependency values of the

executable subtasks

21 and 22 corresponding to the convolutional layer 2 are reduced by 2 at the same time, the dependency values are changed into 0, the dependency values are added into an activation queue after activation, and the steps are repeatedly executed until all the convolution operation tasks of the convolutional neural network are executed.

Based on the same inventive concept, referring to fig. 7, an embodiment of the present application further provides a task processing device based on a neural network, which is configured based on a multi-core processor architecture, and includes:

the dividing unit 701 is configured to divide an operation task to be processed in a target layer specified in a neural network into a plurality of first-class subtasks;

a first determining unit 702, configured to determine a plurality of executable sub-tasks, where the plurality of executable sub-tasks includes at least one first type sub-task;

the second determining unit 703 is configured to determine a dependency value of each executable sub-task according to the obtained dependency relationship among the plurality of executable sub-tasks; wherein the dependency value characterizes a number of other executable sub-tasks on which the respective executable sub-task depends;

an execution unit 704, configured to add the executable sub-task whose dependent value is a preset value to an activation queue; executing, by the multiple cores, the multiple executable sub-tasks in the activation queue in parallel.

When the to-be-processed operation task in the target layer specified in the neural network is divided into a plurality of first-class subtasks, the dividing unit 701 is specifically configured to: if the neural network is a convolutional neural network, dividing the to-be-processed operation task in the target layer into a plurality of corresponding first-class subtasks based on the number of channels or the number of corresponding rows and columns corresponding to the target layer, wherein the channels are input channels or output channels for data interaction between the target layer and other layers, and the rows and columns are two-dimensional arrays or three-dimensional arrays of each channel of the target layer;

or, when the to-be-processed operation task in the target layer specified in the neural network is divided into a plurality of first-class subtasks, the dividing unit 701 is specifically configured to: and if the neural network is a non-convolution neural network, dividing the to-be-processed operation task in the target layer into a plurality of corresponding first-class subtasks based on the number of channels corresponding to the target layer, wherein the channels are channels for data interaction between the target layer and other layers.

After dividing the to-be-processed operation task in the target layer specified in the neural network into a plurality of first-class subtasks and before determining a plurality of executable subtasks, the dividing unit 701 is further configured to: generating the operation parameters of each first-class subtask, wherein the operation parameters of one first-class subtask at least comprise any one or any combination of the following parameters: an input tensor, an output tensor, weight parameters, convolution parameters, and additional parameters.

When determining a plurality of executable sub-tasks based on the plurality of first-class sub-tasks, the first determining unit 702 is specifically configured to:

compiling the plurality of first-class subtasks, and determining the compiled plurality of first-class subtasks as a corresponding plurality of executable subtasks;

alternatively, the first and second electrodes may be,

dividing a storage task to be processed for importing and/or exporting the data of the target layer into a plurality of second-class subtasks in a storage area of the multi-core processor architecture; compiling the plurality of second-class subtasks, and merging the compiled plurality of second-class subtasks and the plurality of first-class subtasks to obtain a plurality of corresponding executable subtasks.

When compiling the first type of subtask, the first determining unit 702 is specifically configured to:

compiling the plurality of first-class subtasks into operation instructions executable by at least one processor in the multi-core processor architecture.

When the plurality of first-class subtasks are compiled, the first determining unit is specifically configured to: acquiring weight parameters corresponding to the plurality of first-class subtasks; performing the following operations respectively for the plurality of first-class subtasks:

When compiling the second type subtasks, the first determining unit 702 is specifically configured to: and compiling the plurality of second-class sub-tasks into data pre-loading instructions and/or data pre-saving instructions executable by at least one processor in the multi-core processor architecture respectively.

When determining the operation levels of the multiple executable sub-tasks in sequence according to the obtained dependency relationships among the multiple executable sub-tasks, the second determining unit 703 is specifically configured to: the following operations are performed for each executable sub-task separately:

generating a corresponding one of the nodes based on one of the executable subtasks;

respectively defining each other node having a data transmission relation with the node as an upstream node or a downstream node of the node, and generating a corresponding dependency topological graph, wherein the topological graph characterizes the dependency relation among the nodes; and marking the number of the upstream nodes depended on by the one node as the dependency value of the one node.

After the number of upstream nodes on which the node depends is marked as the dependency value of the node, and before the executable subtask whose dependency value is a preset value is added to the activation queue, the execution unit 704 is specifically configured to: respectively aiming at each executable subtask, the following operations are executed:

in the dependency topology graph, for a node corresponding to an executable subtask, each time it is determined that the executable subtask corresponding to any node in the upstream node on which the node depends is executed, subtracting a step value (for example, 1) from the dependency value of the node; and adding the executable subtasks with the dependency value of a preset value (for example 0) into an activation queue.

Based on the same inventive concept, referring to fig. 8, an embodiment of the present application provides a server, where the server at least includes: a memory 801 and a processor 802 that, among other things,

a memory 801 for storing executable instructions;

a processor 802 for reading and executing the executable instructions stored in the memory to implement any of the methods involved in the above embodiments.

Based on the same inventive concept, the present application provides a storage medium, wherein when instructions in the storage medium are executed by a processor, the storage medium enables any one of the methods related to the embodiments to be executed.

In summary, in the embodiment of the present application, a task to be processed at a target layer of a neural network is divided into a plurality of first-class subtasks, so that the refinement of the operation granularity is achieved, the requirement on the performance of a processor core is reduced, and the divided fine-grained first-class subtasks are easier to be executed by the processor core; and based on a plurality of kernels of the multi-core processor, an efficient scheduling mechanism is provided by setting an activation queue of the executable subtasks, so that the executable subtasks are scheduled in order, the executable subtasks can be executed in different kernels respectively, and the task processing efficiency is improved.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While the preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all alterations and modifications as fall within the scope of the application.

It will be apparent to those skilled in the art that various changes and modifications may be made in the embodiments of the present application without departing from the spirit and scope of the embodiments of the present application. Thus, if such modifications and variations of the embodiments of the present application fall within the scope of the claims of the present application and their equivalents, the present application is also intended to encompass such modifications and variations.

Claims

1. A task processing method based on a neural network is characterized by comprising the following steps:

dividing an operation task to be processed in a designated target layer in a neural network into a plurality of first-class subtasks;

determining a plurality of executable subtasks, the plurality of executable subtasks including at least one subtask of a first type;

determining a dependency value of each executable subtask according to the dependency relationship among the executable subtasks; wherein the dependency value characterizes a number of other executable sub-tasks on which the respective executable sub-task depends;

adding the executable subtasks with the dependency values as preset values into an activation queue; executing, by the multiple cores, the multiple executable sub-tasks in the activation queue in parallel.

2. The method according to claim 1, wherein dividing the to-be-processed operation task in the target layer specified in the neural network into a plurality of first-class subtasks specifically includes:

if the neural network is a convolutional neural network, dividing the to-be-processed operation task in the target layer into a plurality of corresponding first-class subtasks based on the number of channels or the number of corresponding rows and columns corresponding to the target layer, wherein the channels are input channels or output channels for data interaction between the target layer and other layers, and the rows and columns are two-dimensional arrays or three-dimensional arrays of each channel of the target layer;

alternatively, the first and second electrodes may be,

and if the neural network is a non-convolution neural network, dividing the to-be-processed operation task in the target layer into a plurality of corresponding first-class subtasks based on the number of channels corresponding to the target layer, wherein the channels are channels for data interaction between the target layer and other layers.

3. The method of claim 1, wherein after dividing the computational task to be processed within the target layer specified in the neural network into a plurality of subtasks of the first type, and before determining the plurality of executable subtasks, further comprising:

generating the operation parameters of each first-class subtask, wherein the operation parameters of one first-class subtask at least comprise any one or any combination of the following parameters: an input tensor, an output tensor, weight parameters, convolution parameters, and additional parameters.

4. The method of claim 1, 2 or 3, wherein determining a plurality of executable subtasks specifically comprises:

alternatively, the first and second electrodes may be,

dividing a storage task to be processed for importing and/or exporting the data of the target layer into a plurality of second-class subtasks in a storage area of the multi-core processor architecture;

compiling the plurality of second-class subtasks, and merging the compiled plurality of second-class subtasks and the plurality of first-class subtasks to obtain a plurality of corresponding executable subtasks.

5. The method of claim 4, wherein compiling the first type of subtask specifically includes:

6. The method of claim 4, wherein compiling the plurality of first-class subtasks specifically includes:

acquiring weight parameters corresponding to the plurality of first-class subtasks;

performing the following operations respectively for the plurality of first-class subtasks:

if the weight parameter corresponding to a first type of subtask is fixed-point integer, compiling the first type of subtask into an operation instruction of fixed-point integer;

and if the weight parameter corresponding to one first-class subtask is a floating point number, compiling the first-class subtask into an operation instruction in a floating point format.

7. The method of claim 4, wherein compiling the second type of subtasks specifically includes:

and compiling the plurality of second-class sub-tasks into data pre-loading instructions and/or data pre-saving instructions executable by at least one processor in the multi-core processor architecture respectively.

8. The method according to claim 1, 2 or 3, wherein determining the dependency value of each executable sub-task in turn according to the obtained dependency relationship between the plurality of executable sub-tasks comprises:

the following operations are performed for each executable sub-task separately:

respectively defining each other node having a data transmission relation with the node as an upstream node or a downstream node of the node, and generating a corresponding dependency topological graph;

and marking the number of the upstream nodes depended on by the one node as the dependency value of the one node.

9. The method of claim 8, wherein marking the number of upstream nodes on which the one node depends as the dependency value of the one node, and adding an executable sub-task having the dependency value of a predetermined value to the active queue, further comprises:

respectively aiming at each executable subtask, the following operations are executed:

in the dependency topology graph, for a node corresponding to an executable subtask, subtracting a step value from a dependency value of the node every time the executable subtask corresponding to any node in an upstream node on which the node depends is determined to be executed;

adding the executable subtasks with the dependency values as preset values into an activation queue, specifically comprising:

and adding the executable subtasks with the dependency values as preset values into an activation queue.

10. A task processing device based on a neural network, configured based on a multi-core processor architecture, comprising:

the dividing unit is used for dividing the operation tasks to be processed in a designated target layer in the neural network into a plurality of first-class subtasks;

a first determining unit, configured to determine a plurality of executable subtasks, where the plurality of executable subtasks include at least one first type subtask;

the second determining unit is used for determining the dependency value of each executable subtask according to the dependency relationship among the executable subtasks; wherein the dependency value characterizes a number of other executable sub-tasks on which the respective executable sub-task depends;

the execution unit is used for adding the executable subtasks with the dependency values as preset values into an activation queue; executing, by the multiple cores, the multiple executable sub-tasks in the activation queue in parallel.

11. The apparatus according to claim 10, wherein when the to-be-processed operation task in the target layer specified in the neural network is divided into a plurality of first-class subtasks, the dividing unit is specifically configured to:

alternatively, the first and second electrodes may be,

and if the neural network is a non-convolution neural network, dividing the to-be-processed operation task in the target layer into a plurality of corresponding first-class subtasks based on the number of channels corresponding to the target layer, wherein the channels are input channels or output channels of the target layer and other layers of interactive data.

12. The apparatus of claim 10, wherein after dividing the computational task to be processed within the target layer specified in the neural network into the plurality of subtasks of the first type, and before determining the plurality of executable subtasks, the dividing unit is further configured to:

13. The apparatus according to claim 10, 11 or 12, wherein, when determining a plurality of executable sub-tasks, the first determining unit is specifically configured to:

alternatively, the first and second electrodes may be,

14. The apparatus according to claim 13, wherein, when compiling the first type of subtask, the first determining unit is specifically configured to:

15. The apparatus according to claim 13, wherein when compiling the plurality of first-class subtasks, the first determining unit is specifically configured to:

16. The apparatus according to claim 13, wherein when compiling the second type of subtask, the first determining unit is specifically configured to:

17. The apparatus according to claim 10, 11 or 12, wherein when determining the dependency value of each executable sub-task in turn according to the dependency relationship among the plurality of executable sub-tasks, the second determining unit is specifically configured to:

the following operations are performed for each executable sub-task separately:

18. The apparatus of claim 17, wherein after marking the number of upstream nodes on which the one node depends as the dependency value of the one node, the execution unit, before adding the executable sub-task having the dependency value of a preset value to the activation queue, is further configured to:

19. A server, comprising: a memory and a processor; wherein the content of the first and second substances,

a memory for storing executable instructions;

a processor for reading and executing executable instructions stored in the memory to implement the method of any one of claims 1-9.

20. A storage medium, characterized in that instructions in the storage medium, when executed by a processor, enable execution of the method according to any one of claims 1-9.