CN111563584B

CN111563584B - Splitting method of neural network model and related product

Info

Publication number: CN111563584B
Application number: CN201910114927.0A
Authority: CN
Inventors: 不公告发明人
Original assignee: Shanghai Cambricon Information Technology Co Ltd
Current assignee: Shanghai Cambricon Information Technology Co Ltd
Priority date: 2019-02-14
Filing date: 2019-02-14
Publication date: 2022-12-09
Anticipated expiration: 2039-02-14
Also published as: CN111563584A

Abstract

The scheme splits an operator into a plurality of sub-operators with smaller scale, so that a calculation library under a single-core architecture can be directly called, and additional workload of re-implementation is avoided.

Description

Splitting method of neural network model and related product

Technical Field

The embodiment of the disclosure relates to a method for splitting a neural network model and a related product.

Background

In recent years, deep learning accelerators have been proposed and are expanding from single core to multi-core as general purpose processors. The expanded multi-core structure can support a data parallel mode in a training stage to improve data throughput and accelerate training speed. However, in the inference phase, there is a higher requirement for end-to-end delay than for the throughput deep neural network, which often determines the availability of the accelerator in a certain scenario. The traditional data parallel scheme cannot meet the requirements on small data and low delay of an accelerator in an inference scene.

Disclosure of Invention

In order to solve the technical problem, the present disclosure provides a method for splitting a neural network model and a related product.

In order to achieve the above object, the present disclosure provides a neural network model splitting method, wherein the method includes:

determining a split state set of tensor data associated with an operator of a target layer in the neural network model according to the operator of the target layer; wherein the target layer is at least one layer in the neural network model;

traversing the split state sets according to a directed acyclic graph of the neural network model, and determining state paths between adjacent split state sets and weights of the state paths; wherein the state path represents a splitting mode of the operator; each state in the split state set represents a sub-tensor data set, and the union result of all sub-tensor data of the state is the tensor data;

determining a target splitting path of the target layer according to the weight of the state path;

and splitting an operator of a target layer of the neural network model by using the target splitting path.

Preferably, the step of determining the target split path of the target layer comprises:

traversing all the split state sets of the target layer, traversing each state of the current split state set, and obtaining all state paths pointing to the current state and split paths from the initial state of the state paths to the initial state of the input tensor data of the target layer;

determining a split path of an initial state of the input tensor data from the current state to the target layer according to the weight of the state path and the weight of the split path; the weight of the split path is determined according to the weights of all state paths corresponding to the split path;

and after traversing all the splitting state sets of the target layer, obtaining a target splitting path between the splitting state set of the input tensor data of the target layer and the splitting state set of the output tensor data of the target layer.

traversing all the split state sets of the target layer, traversing each state of the current split state set, and obtaining all state paths taking the current state as a starting point and split paths from the end state of the state paths to the end state of the output tensor data of the target layer;

determining a split path of the termination state of the output tensor data of the current state to the target layer according to the weight of the state path and the weight of the split path; the weight of the split path is determined according to the weights of all state paths corresponding to the split path;

Preferably, the number of sub-operators obtained after the operator of the target layer of the neural network model is split is an integer power of 2.

Preferably, the state in the split state set of input tensor data of the operator of the target layer of the neural network model is determined according to the computation logic of the operator and the state in the split state set of the corresponding output tensor data.

Preferably, the state in the split state set of output tensor data of the operator of the target layer of the neural network model is determined according to the computation logic of the operator and the state in the split state set of the corresponding input tensor data.

Preferably, the method further comprises the following steps:

in a forward traversal phase, when the output tensor data of the operator is used as input tensor data by at least two operators, or the operator has at least two output tensor data, one splitting state is reserved in a splitting state set of the output tensor data of the operator, and the splitting state is determined by the same state path of the operator.

Preferably, the method further comprises the following steps:

in a reverse traversal phase, when the operator has at least two input tensor data, one split state is retained in a split state set of the input tensor data of the operator, and the split state is determined via the same state path of the operator.

Preferably, the weight of the state path is determined according to the type and the scale of an operator and hardware parameters of the multi-core processor.

In order to achieve the above object, the present disclosure provides a neural network model splitting device, including:

the split state set module is used for determining a split state set of tensor data associated with an operator of a target layer in the neural network model according to the operator of the target layer; wherein the target layer is at least one layer in the neural network model;

the state path module is used for traversing the split state sets according to the directed acyclic graph of the neural network model, and determining state paths between adjacent split state sets and weights of the state paths; wherein the state path represents a splitting mode of the operator; each state in the split state set represents a sub-tensor data set, and the union result of all sub-tensor data of the state is the tensor data;

the target splitting path module is used for determining a target splitting path of the target layer according to the weight of the state path;

and the splitting module is used for splitting an operator of a target layer of the neural network model by using the target splitting path.

Preferably, the target split path module comprises:

the first traversal unit is used for traversing all the split state sets of the target layer, traversing each state of the current split state set, and obtaining all state paths pointing to the current state and split paths from the initial state of the state paths to the initial state of the input tensor data of the target layer;

a first split path determining unit, configured to determine a split path of an initial state of the input tensor data from the current state to the target layer according to the weight of the state path and the weight of the split path; the weight of the split path is determined according to the weights of all state paths corresponding to the split path;

the first selected target split path unit is configured to obtain a target split path between the split state set of the input tensor data of the target layer and the split state set of the output tensor data of the target layer after traversing all the split state sets of the target layer.

Preferably, the target split path module comprises:

a second traversal unit, configured to traverse all the split state sets of the target layer, and traverse each state for a current split state set to obtain all state paths using the current state as a starting point and all split paths from an end state of the state path to an end state of the output tensor data of the target layer;

a second split path determining unit, configured to determine, according to the weight of the state path and the weight of the split path, a split path from the current state to a termination state of the output tensor data of the target layer; the weight of the split path is determined according to the weights of all state paths corresponding to the split path;

and the second selected target splitting path unit is used for obtaining a target splitting path between the splitting state set of the input tensor data of the target layer and the splitting state set of the output tensor data of the target layer after traversing all the splitting state sets of the target layer.

Preferably, the method further comprises the following steps:

a first splitting state set optimization module, configured to, in a forward traversal phase, when output tensor data of an operator is used as input tensor data by at least two operators, or the operator has at least two output tensor data, leave a splitting state in a splitting state set of the output tensor data of the operator, and the splitting state is determined via a same state path of the operator.

Preferably, the method further comprises the following steps:

and in a reverse traversal phase, when the operator has at least two input tensor data, a splitting state is reserved in the splitting state set of the input tensor data of the operator, and the splitting state is determined by the same state path of the operator.

In order to achieve the above object, the present disclosure provides a neural network model splitting hardware device, which includes a memory and a processor, where the memory stores a computer program that is executable on the processor, and the processor implements the steps of the method when executing the computer program.

To achieve the above object, the present disclosure provides a computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the steps of the method described above.

The technical scheme disclosed by the invention can realize the extension of the deep learning accelerator from a single-core to a multi-core structure with less expenditure, and can provide an efficient splitting scheme aiming at the characteristics of a given network and a bottom layer accelerator, and the scheme can effectively reduce the end-to-end time delay of various networks on the multi-core accelerator.

Drawings

To more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings of the embodiments will be briefly introduced below, and it is apparent that the drawings in the following description relate only to some embodiments of the present disclosure and are not limiting to the present disclosure.

FIG. 1 is a schematic diagram of a shared memory multi-core architecture;

fig. 2 is a flowchart of a neural network model splitting method according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of a serial neural network model split;

fig. 4 is a second flowchart of a neural network model splitting method according to an embodiment of the present disclosure;

FIG. 5 is a schematic diagram of the splitting of the glue operator introduced between the operator and the input tensor data;

FIG. 6 is a schematic illustration of compensation;

fig. 7 is a third flowchart of a neural network model splitting method according to an embodiment of the present disclosure;

FIG. 8 is a schematic view of pyramid splitting;

fig. 9 is a fourth flowchart of a neural network model splitting method provided in the embodiment of the present disclosure;

fig. 10 is a schematic diagram of a neural network model splitting hardware device according to an embodiment of the present disclosure.

Detailed Description

The technical solutions in the embodiments of the present disclosure will be described more fully hereinafter with reference to the non-limiting exemplary embodiments shown in the accompanying drawings and detailed in the following description, taken in conjunction with the accompanying drawings, which illustrate, more fully, the exemplary embodiments of the present disclosure and their various features and advantageous details. It should be noted that the features illustrated in the drawings are not necessarily drawn to scale. The present disclosure omits descriptions of known materials, components, and process techniques so as not to obscure the example embodiments of the present disclosure. The examples given are intended merely to facilitate an understanding of implementations of example embodiments of the disclosure and to further enable those of skill in the art to practice example embodiments. Accordingly, these examples should not be construed as limiting the scope of the embodiments of the disclosure.

Unless otherwise specifically defined, technical or scientific terms used herein shall have the ordinary meaning as understood by those of ordinary skill in the art to which this disclosure belongs. The use of "first," "second," and similar terms in this disclosure is not intended to indicate any order, quantity, or importance, but rather is used to distinguish one element from another. Further, in the various embodiments of the present disclosure, the same or similar reference numerals denote the same or similar components.

The following describes in detail specific embodiments of a neural network model splitting method and related products provided by the embodiments of the present disclosure with reference to the accompanying drawings.

In recent years, deep learning accelerators have become a rapidly growing field, thanks to the great success of deep learning itself in many fields. These emerging accelerators tend to have a greater advantage in performance power consumption than GPUs. As with the development of general-purpose processors, the deep learning accelerator can be extended from a single core to a multi-core architecture, and the extension is very suitable for a data parallel training mode in deep learning. Data parallelism refers to speeding up training by dividing a trained data set into several parts, using multiple processing cores to process a portion of the sub-data sets separately. By adopting the mode in the multi-core structure, each core processes different data sets in the training data in parallel, thereby improving the throughput of the whole system and accelerating the training speed. Therefore, the multi-core accelerator architecture can conveniently extend the computing throughput of the whole system in the training phase on the premise of keeping the good performance-power consumption ratio of each core.

For a chip with a multi-core processor structure, as shown in fig. 1, the shared memory multi-core structure is a classic multi-core structure. This structure is very suitable for a neural network training method of data parallel. Each core can be used as a processor in data parallel, different data are read respectively, and then forward and reverse calculation of the neural network model is completed in parallel. Each core can still maintain good performance power consumption ratio under the previous single-core architecture in the computing stage, and meanwhile the throughput of the whole system can also be increased along with the expansion of the number of the cores. The problem with data parallelism is that its scalability depends on the size of the data batch being processed. While this is not usually a problem during the training phase, it is difficult to guarantee this premise for the inference phase. In general, in a neural network model for real-time service field (including video monitoring, automatic driving, etc.), processed data is usually serially input in a streaming manner, so that the data processed each time is very small in size and even is often a single picture. In this case, data parallelism cannot provide any parallelism, and all work tasks may be concentrated on a single core, which makes the computational resources brought by multiple cores not translate into the speed of processing tasks.

After the training of the neural network model is completed by using the data set on line, the model is deployed on a cloud server to process data sent from the outside, and the application scene is changed from off-line training to on-line reasoning. In the online reasoning phase, a very important index is the time delay, i.e. the time from the server receiving the data to be processed to returning the processed result, and further, the time for processing the data using the neural network model. The low time delay ensures that the cloud server can respond to the data sent by the client in the shortest time, and directly determines whether the scheme is available in more sensitive scenes. Therefore, the requirements of the accelerator in the online reasoning phase are changed from processing large-batch data and high throughput to processing small-batch data and low time delay.

In this case, the traditional data parallel or model parallel is difficult to effectively reduce the time delay of the inference task. For data parallelism, a large batch of data is a precondition, which is contradictory with the characteristic of online reasoning of a small batch of data. For model parallelism, the method is usually adopted to solve the problem that a large-scale neural network model exceeds the memory limit of a single device, and operators are distributed to different cores, so that the time delay of the network cannot be reduced. In order to actually reduce the time delay of the inference task on the multi-core accelerator, a method must be found, which can reasonably distribute the inference computation task on small-batch data and even single data to each core of the multi-core architecture, and ensure that as many cores as possible participate in computation at each moment, so as to fully utilize the resources of the multi-core architecture. The method can ensure that a plurality of cores participate in the calculation at every moment even when a reasoning task of a single picture is processed, thereby achieving the purpose of reducing time delay by utilizing multi-core resources.

However, there are many problems to be solved for multi-core accelerators. Firstly, a deep learning accelerator adapts to the data parallel characteristics of a deep learning algorithm by customizing the hardware design of the accelerator, so that the calculation throughput is improved, the accelerator usually needs enough data scale to achieve higher calculation efficiency, and further splitting in an operator can reduce the calculation scale of each core. When the split reaches a certain granularity, the loss of computational efficiency on each core will exceed the revenue brought by the increased parallelism of the split. Therefore, it is necessary to provide a sufficient degree of parallelism while ensuring sufficient computational efficiency between split parallelism and computational efficiency.

On the other hand, the neural network model can be regarded as a complex computational graph composed of hundreds or even thousands of common-valued operators. The different algorithmic logics in the operators of different kinds are different, which results in different methods for splitting the operators. The splitting of each operator not only balances the self computing efficiency and the parallelism, but also considers the collocation with the previous operator and the next operator, even influences the overall situation. The rapid development of deep learning brings more and more large-scale complex networks, and it is unrealistic to find a good parallel method by a manual mode, so an automatic method is needed to ensure that a better splitting parallel strategy can be provided for different networks.

In addition, portability to the underlying accelerator is also a consideration. For accelerators that do not have good enough programmability, the extension from single core to multi-core and the effort to modify the software stack resulting from implementing split parallelism inside the operator is very large. The traditional realization of data parallel and model parallel still completes the calculation task of an operator based on a processing core, so that a lot of extra work is not brought, the cross-core parallel of a single operator needs to modify the realization of the operator, and the difficulty degree of the modification depends on the programmability of an accelerator and the complexity degree of the original operator realization logic. How to reduce the extra overhead in the process of realizing low-delay reasoning on a multi-core architecture and relieve the dependence of the workload on the programmability of the accelerator in the realization process is also a problem to be considered, so that the method can have certain universality for different multi-core accelerators in the future.

Based on the analysis description, a set of end-to-end splitting scheme is automatically given for a large-scale neural network model, and one operator is split into a plurality of sub-operators with smaller scales, so that a computing library under a single-core architecture can be directly called, and additional workload for realizing again is avoided. Such as: one activation operator can obtain a plurality of smaller activation operators after being split, which means that each subtask can be completed only by calling the original single-core activation function on a plurality of cores without modifying or realizing the activation function of a multi-core version again. In this process, not only the calculation efficiency and the parallelism of each operator after splitting need to be considered, but also the cooperation of the context operators on the splitting needs to be considered. The final aim is to obtain a parallel splitting scheme which can effectively reduce the end-to-end reasoning time delay of the whole neural network model.

Taking the automatic driving application as an example, the vehicle needs to analyze and process external information such as images, videos, voices and the like transmitted by the vehicle-mounted sensor during the automatic driving process. To ensure safety, the vehicle must take the processed results in a minimum amount of time to make a decision. The vehicle adopting the chip with the multi-core processor structure can distribute the calculation load of the neural network model for processing small-batch external information to a plurality of processor cores in a balanced manner by using the scheme, complete the processing of the information within the specified response time and return the processing result to assist the vehicle to automatically run. The technical scheme can realize the extension of the deep learning accelerator from a single-core structure to a multi-core structure with less expenditure, and can effectively reduce the end-to-end time delay of various networks on the multi-core accelerator.

In the application scenario, the chip with the multi-core processor structure is arranged on a vehicle. In practice, multicore processor chip can set up on the high in the clouds server, and the vehicle can take place to the high in the clouds server through networks such as 3G 4G, WIFI with external information such as image, video, pronunciation that vehicle-mounted sensor transmitted. The cloud server uses the scheme to distribute the calculation load of the neural network model for processing small batches of external information to a plurality of processing cores in a balanced manner. And in the response time specified by the running of the vehicle, the cloud server feeds back the processing result to the vehicle through networks such as 3G/4G, WIFI and the like. In practice, the scale of the external information collected by the vehicle-mounted sensor is different. Before application, according to external information of different scales, the vehicle-mounted processor determines a corresponding operator splitting path by using the scheme. And storing operator splitting schemes corresponding to external information of different scales in corresponding areas, calling corresponding operator splitting paths to split operators in the neural network model after the multi-core processor structural chip acquires the external information, and distributing the calculation load of the external information to the plurality of processor cores in a balanced manner.

Typically, the upper framework requires a computational library to be invoked to obtain an instruction implementation of each operator in the neural network model on the processor. In particular, the framework informs the computational base of the type, parameters, of each operator, and the computational base returns the machine instructions required for each operator to execute on the processor. The framework loads data and the machine instruction to the processor through a driver, starts the processor and completes the calculation of the operator.

If the computing platform of the operator is changed from a single-core accelerator to a multi-core accelerator with similar or even identical core structures, the computing library needs to be redesigned accordingly, so that the computing library can generate machine instructions running on multiple cores. In particular, since multiple cores need to read different portions of the same input tensor data, and also need to write their respective outputs back to different portions of the same output tensor data, the computation library needs to modify all of the computation instructions for each operator with respect to the read and store portions.

The neural network splitting method provided by the embodiment of the disclosure can avoid modifying the computational library of the single-core processor as much as possible, and can simultaneously realize the parallel execution of the neural network model on the multi-core processor. Specifically, the upper-layer framework divides an operator in the neural network model into a plurality of sub-operators capable of being executed in parallel, for each sub-operator, the framework calls a computing library to generate a machine instruction of the sub-operator executed on a single core, and the machine instruction of the sub-operator is loaded to different cores, so that parallel computing of the operator on the multi-core processor is realized. In particular, because the framework uses a single-core processor computational library to generate computational instructions for sub-operators, the input and output tensor data for the operator in the neural network model is split into corresponding sub-tensor data as the operator is split into sub-operators.

Based on the above description, as shown in fig. 2, a flowchart of a neural network model splitting method is provided for the embodiment of the present disclosure.

The method comprises the following steps:

step 201): and determining a splitting state set of tensor data associated with an operator of a target layer in the neural network model according to the operator of the target layer.

In this embodiment, the neural network model can be generally regarded as a directed acyclic graph formed by operators and multidimensional tensor data, the operators and the tensor data are connected with each other through directed edges, and the direction of the directed edges indicates that the data is input or output of the operators. The operator is represented using op and tensor data. Meanwhile, in order to unify the expressions of the splitting modes of different operators, the framework selects the splitting mode of tensor data associated with the operators to explain the splitting modes of the different operators. Assuming that all tensor data in the network are 4-dimensional, for input data or output data of a final full-connection layer and a normalization index regression layer of the image classification network, the actual dimension is less than 4, and the actual dimension is still expressed as a 4-dimensional tensor. The 4 dimensions are denoted by the symbols N, C, H, W, respectively. Where N denotes the size of the batch, C denotes the number of feature images, H denotes the height of the feature images, and W denotes the width of the feature images. This assumption is merely for convenience of explanation, and for the framework itself, can support the processing of neural network models containing tensor data of arbitrary number of dimensions. Nevertheless, 4 dimensions are sufficient for a significant part of the neural network structure.

When the technical scheme is used for splitting the operators in the neural network model, the types of the operators are different, the computational logics supported by the operators are different, and different splitting strategies are provided. In order to uniformly express splitting strategies of different operators, the splitting state of the input tensor data and the output tensor data of the operators is adopted to express the splitting of the calculation logic of the operators.

For the technical scheme, all operators in the whole neural network model can be split, and part of operators in the neural network model can also be split. In addition, the network structure and algorithm emerging in the deep learning field at present gradually obscure the physical meanings of all data dimensions and the boundaries among the data dimensions, and the technical scheme can be applied to operator splitting under more dimensions in an expanded mode.

Any split of tensor data is called a state s of the tensor data, and after the tensor data is split, a sub-tensor data set is obtained. The state s is characterized by a corresponding set of sub-tensor data. All possible splits s ₀ ,s ₁ ,s ₂ \8230; } constitutes the set of split states S of the tensor data, which is in general a very large state space, which means that the space of possible splitting ways of the operators represented by the split states of the tensor data is also very large.

The state set of tensor data can be pruned with some reasonable assumptions. First, the time delay of the multi-core accelerator for completing the computation of an operator depends on the time of the core which takes the longest time to execute the subtasks, and the cores in the multi-core architecture are mutually equivalent in hardware structure, so the time consumption of each core depends on the load of the task allocated to the core. A reasonable assumption is therefore that it should be ensured that the size of the split sub-operators is approximately balanced, for which reason those split states that are disproportionately split can be omitted from the state set S of the tensor data. In addition, the number of cores in a multi-core architecture is usually an integer power of 2, such as 1,2,4,8, 16, etc., and a task with parallelism not being an integer power of 2 will often cause "fragmentation" in the scheduling of cores, so the number of sub-operators after splitting should be guaranteed to be an integer power of 2. With these two assumptions, the search space of the operator splitting strategy can be greatly reduced.

It should be noted that not selecting the splitting state of any tensor data associated with an operator can represent an effective splitting mode of the operator. The split dimension of the tensor data should be supported by an operator, e.g., input data to a normalized exponential regression operator (Softmax) should not have a split in the dimension to be normalized. Furthermore, the splitting of the input tensor and the output tensor of the operator should satisfy the calculation logic of the operator, for example, the start and end points of each sub-block of the H/W dimension splitting of the output data of the convolution operator should be exactly calculated by the convolution kernel and the displacement step of the convolution operator for the sub-block of the H/W dimension splitting of the corresponding input data; splitting of input data of the convolution operator in the C dimension is completely consistent with splitting of the weight data in the C dimension, and splitting of the output data in the C dimension is completely consistent with splitting of the weight data in the N dimension. In the framework, the output states are used to push the input states of the operators backwards according to the specific logic of each operator, or the input states are used to push the output states of the operators forwards according to the specific logic of each operator. This ensures that the state of the associated data can always represent an efficient way of splitting the operator.

Step 202): traversing the split state sets according to the directed acyclic graph of the neural network model, and determining state paths between adjacent split state sets and weights of the state paths.

As shown in fig. 3, the splitting scheme P of the entire neural network model can be seen as a jump from one splitting state in the set of splitting states of the input tensor data of each operator to one splitting state in the output tensor. The split state of the output tensor of the former operator is the split state of the input tensor of the latter operator. Each possible jump through an operator corresponds to an efficient splitting pattern on that operator. Thus, the state path represents the split mode of the operator.

In the technical scheme, tensor data are decomposed according to a decomposition mode to obtain a sub-tensor set, the sub-tensor set corresponds to one splitting state, multiple splitting states can be obtained through different decomposition modes, and the splitting states obtained through all the decomposition modes form a splitting state set. Therefore, each split state corresponds to a sub-tensor set, and the sub-tensor set comprises all elements in tensor data. In addition, in one sub-tensor set, the elements of each sub-tensor may or may not overlap.

As described above, the state path represents a splitting manner of an operator, and the computation logic of the operator is split by the splitting manner corresponding to the state path to obtain a corresponding sub-operator set. The state of the input tensor data is connected with the state of the corresponding output tensor data through a state path, and a sub tensor data set which represents a splitting state of the input tensor data is processed by sub operators in the sub operator set to obtain a sub tensor data set which corresponds to the splitting state of the output tensor data.

In the technical scheme, the weight of the state path is characterized by the time for the operators to execute on the multi-core accelerator in parallel in a certain splitting state, and the time for the multi-core accelerator to finish the calculation of one operator depends on the time of the core which takes the longest time for executing the subtasks. The estimation is performed using parameters when calculating the weights of the state paths:

1) Computing load c of n split sub-operators ₁ ,c ₂ ,…,c _n . Wherein, c _i Calculating according to the type and scale of the ith sub-operator after splitting;

2) Memory access data volume d of n sub-operators ₁ ,d ₂ ,…,d _n . Wherein, d _i Calculating according to the type and scale of the ith sub-operator after splitting;

3) The computational throughput rate a of each accelerator core. α is determined by the performance parameters of the accelerator itself;

4) The memory access bandwidth beta of each core. Generally speaking, multiple cores share a limited memory access bandwidth, so β = B/n. Where B is the total bandwidth of the multi-core accelerator.

The calculation formula of the weight of the state path is as follows:

t＝max _i＝1,...,n (max(c _i /α，d _i /. Beta.)) formula (1)

The maximum value taking operation at the inner side is realized on the basis of an operator, a calculation part and a memory access part can be mutually hidden, namely, the calculation and the memory access can be executed concurrently as much as possible. For some accelerators, when the size of the sub-operator is too small, the computational throughput of each core is reduced, and further correction of α can be made to make the estimate more accurate. The outer max operation is the time for the multi-core accelerator to complete the computation of one operator, which depends on the time of the core that takes the longest time to execute the subtasks.

It should be noted that the above-mentioned manner of obtaining the weight of the state path is only a partial example, and is not an exhaustive list, and those skilled in the art may generate other modifications or transformations on the technical solution of the present application in the case of understanding the spirit of the technical solution of the present application, such as: the weight that weighs the state path may be not only the time it takes to perform a subtask, but also the throughput of performing a subtask. Or the weight of the state path can be determined by actually measuring the time for executing all the subtasks in the operator splitting mode corresponding to the state path on the multi-core processor. But should be within the scope of the present application as long as the achieved functions and technical effects are similar to the present application.

Step 203): and determining a target splitting path of the target layer according to the weight of the state path.

In step 203, the split path of the target layer is determined in two ways by using the weight of the state path. The first way is to determine the split path for forward traversal, which includes the steps of:

determining a split path of an initial state of the input tensor data of the current state to the target layer according to the state path and the split path;

determining a split path from the current state to the initial state of the input tensor data of the target layer according to the weight of the state path and the weight of the split path; the weight of the split path is determined according to the weights of all state paths corresponding to the split path;

The second way is to determine the split path by reverse traversal, which includes the steps of:

traversing all the splitting state sets of the target layer, traversing each state of the current splitting state set, and obtaining all state paths taking the current state as a starting point and splitting paths from the ending state of the state paths to the ending state of the output tensor data of the target layer;

In the following, how to obtain a target split path between the split state set of the input tensor data of the target layer and the split state set of the output tensor data of the target layer after traversing all the split state sets of the target layer is described in detail by way of example.

Shown in FIG. 3The neural network model of (2) is serial, and the input tensor data and the output tensor data of the whole neural network model are in an undisassembled state. Splitting the operators of the whole neural network model of FIG. 3, and describing a serial neural network model containing n operators into an operator sequence (op) ₁ ,op ₂ ,...,op _n ) Assuming that each operator has only one input and one output, and the input of the previous operator is the output of the next operator, all tensor data form a set (tensor) including the input tensor data and the output tensor data of the whole neural network model and the intermediate result tensors between all operators ₀ ,tensor ₁ ,...,tensor _n )，op _i Is the input of tenor _i-1 The output is tensor _i . For each data tensor _i Having a state set S corresponding thereto ⁱ The goal of the search strategy is to find a mapping relationship tensor between the tensor itself and a state of its set of states _i →s ⁱ The splitting mode of all operators can be determined by determining a specific splitting state for each tensor data in the neural network model, so that a mapping relation of all tensor data in the neural network model to the splitting state is called as a splitting scheme P of the network model. In the calculation phase, the ith operator op _i Calculating output tensor data in a splitting state r by using input data in the splitting state s, wherein a specific parallel calculation mode is determined by the states of the input tensor data and the output tensor data, and the calculation time of the operator is recorded as t _s→r If the value depends on the corresponding splitting mode and the hardware characteristics of the bottom accelerator, the calculation formula of the time delay T of the whole network is:

also corresponding to this is the time t used for parallel execution of the operator on the multi-core accelerator using the split mode _i Thus, t _i Can be viewed as being operated upon by operatorsThe states of the input tensor data point to weights of directed edges of the states of the output tensor data. Meanwhile, as the input tensor data and the output tensor data of the whole neural network model, only one non-split state which keeps the whole data block continuous and complete is arranged in the split state space corresponding to the input tensor data and the output tensor data, so that the split scheme P of the neural network model starts from the complete input data to the complete output data, and an external user always sees a complete input and output. At this time, a good splitting scheme P is searched for a given neural network model, that is, a shortest path from the non-splitting state of the input tensor data to the non-splitting state of the output tensor data is found, and the path is passed through by selecting a state in the effective state space of each intermediate result tensor. Equations 3 and 4 give an abstract formula representation.

It is also noted that in fig. 3, there is a case where one split state of the input tensor data points to a plurality of split states of the output tensor data, which is what further causes the huge split space of the neural network model.

In the technical scheme, the non-splitting state of the input tensor data of the whole neural network model is set as an initial state s _root In the initial stage, the non-splitting state of the input tensor data of the neural network model is an initial state s _root The weight of the corresponding split path is 0, and the weights of the corresponding split paths of all the states of all the remaining tensor data are ∞. Any state s of any tensor data in the neural network model has a corresponding equation s _root Split path weight from start to s is/ _s . Accessing each split state set from head to back, and in each split state set, sequentially accessing each split state setEach state s therein is traversed. For each state s, there is a directed edge e pointing to several split states in the next set of split states ₁ ,…,e _ks . Taking splitting state v in the latter splitting state set as an example, the weight t between state s and state v is obtained using equation (1) _sv And updating the s corresponding to the state v in the next splitting state set pointed by the state path by using the formula (5) _root Weight/of split path starting to state v _v 。

l _v ＝min(l _v ,l _s +t _sv ) (5)

After the access of all split state sets is completed according to the forward traversal of the topological relation of the neural network model, the non-split state s of the input tensor data of the whole neural network model is obtained _root Non-torn state s of output tensor data to neural network model _end The target split path of (2).

The above description is one by non-split state s _root To the not-split state s _end And a path passing through one state in each split state set is the split path of the neural network model. And selecting the splitting path with the minimum weight from the splitting paths of the neural network model as a target splitting path of the neural network model.

It should be noted that the neural network model shown in fig. 3 is a serial neural network model, and for convenience of description of the present technical solution, the split state sets corresponding to the input tensor data and the output tensor data of the neural network model are both in the non-split state. Splitting state set of output tensor data in neural network model is not non-splitting state s _end And when the set is formed by a plurality of splitting states, selecting the minimum value from the weight of the splitting path of each splitting state in the splitting state set of the output tensor data of the neural network model as a target splitting path from the splitting state set of the input tensor data of the whole neural network model to the splitting state set of the output tensor data of the neural network model.

Note that the entire scheme may also be converted to search by the non-split state s _end To non-split state s _root The two are equivalent. Similarly, the split state set of the input tensor data in the neural network model is not the non-split state s _end And when the set is formed by a plurality of splitting states, selecting the minimum value from the weight of the splitting path of each splitting state in the splitting state set of the input tensor data of the neural network model as a target splitting path from the splitting state set of the input tensor data of the whole neural network model to the splitting state set of the output tensor data of the neural network model.

The neural network model shown in fig. 3 is a serial neural network model, and the state in the split state set of the input tensor data and the output tensor data of the model is the non-split case. In practical application, for the technical solutions shown in fig. 2, fig. 4, fig. 7, and fig. 9, the problem of consistency of different branch splitting modes in the multi-branch neural network model needs to be solved. Operators located at branch intersections have more than 1 input tensor data, such as the para-addition operator (Add), the para-multiplication operator (Mult), the concatenation operator (Concat). For an operator A with 2 inputs, after the operator is accessed, namely the split state set of the input tensor data is enumerated according to the split state set of the input tensor data, two input tensor data tensor _left ，tensor _right Respectively have corresponding splitting state sets S _left And S _right . Respectively along the tensor _left ，tensor _right The first two paths continue to traverse forward, in one case, the two branches will extend directly until the end of the traversal, representing more than one input data for the whole network, which is not usually common in inference tasks, and in another case, the two branches join together at some operator. In either case, when determining the splitting scheme P, the two input tensor data tensor at operator A _left ，tensor _right In this way, split states that do not match each other may be selected. Specifically, assuming that operator A is a binary para-addition operator, the backtracking process is in the tensor _left The split state set of (2) is selected to be a state with a split only in the C dimension, but in the tensor _right In the split state ofThe selected state in the state set may be a state with split only in the H dimension, and the split modes of the addition operators represented by the two split states are inconsistent, so that the whole splitting scheme P is invalid. To solve this problem, tentor is guaranteed before traversing operator A ends _left ，tensor _right The corresponding split state sets only contain one split state, which ensures the certainty of the state selected in the two state sets in the backtracking process. Thus, in a forward traversal phase, when the output tensor data of the operator is used by at least two operators as input tensor data, or the operator has at least two output tensor data, one splitting state remains in a set of splitting states of the output tensor data of the operator, and the splitting state is determined via the same state path of the operator. In a reverse traversal phase, when the operator has at least two input tensor data, one split state is retained in a split state set of the input tensor data of the operator, and the split state is determined via the same state path of the operator. Therefore, before the traversal of the branch operator is finished, the state with the minimum corresponding accumulative weight is selected from the split state sets of the input data and is reserved, and other split states in the split state sets are removed.

It should be noted that the above manner of acquiring the target splitting path is similar to the viterbi algorithm, and this time is only an exemplary case, but not an exhaustive case, and those skilled in the art may generate other modifications or transformations based on the technical solution of the present application, such as: the weight of each split path between the split state set of the input tensor data of the neural network model to the split state set of the output tensor data of the neural network model is determined by the sum of the weights of the corresponding state paths. And setting a threshold value according to experience, wherein the weight of the splitting path is smaller than the set threshold value, and the splitting path can be used as a target splitting path to split the neural network model. But should be within the scope of the present application as long as the achieved functions and technical effects are similar to the present application.

Step 204): and splitting an operator of a target layer of the neural network model by using the target splitting path.

From the above description, it can be known that the hardware resources of the chip of the multi-core processor structure are fully utilized by splitting the computation logic of the operators in the neural network into smaller subtasks which are distributed to a plurality of cores to be executed in parallel.

For the technical solution shown in fig. 2, under the most ideal condition, it is desirable that the split sub-operators can write their output tensor data to a corresponding position in a storage space for storing complete output tensor data, so that a complete and continuous data block is always obtained after all the sub-operators of an operator are executed. But this is not easily done for part of the accelerator. Firstly, the storage positions of the output tensor data of the split operator in the whole output may be discontinuous, so that the codes of the output part of the operator need to be rewritten, and the output result can be written back to the corresponding discrete position of one sub tensor in the storage. Meanwhile, the accelerator usually further adjusts the sequence of data in storage to improve the memory access efficiency during calculation, which makes the work of modifying the output logic of the operator more difficult and tedious. Meanwhile, if the computational logic or the splitting logic of the subsequent operator does not need to ensure the storage continuity of the input data in a certain dimension, the data which is output by the previous layer and is in a discrete storage state in the dimension can be directly used for the calculation of the next layer without ensuring the continuity of the output data.

Therefore, the task of adjusting the tensor data splitting form is stripped from the calculation task of the operator by the framework and abstracted into a new operator called glue operator, the stripping avoids the modification of the output logic of each operator, and the transportability of the framework to different accelerators on the bottom layer is enhanced. The glue operator is used for adjusting the sub data blocks of the tensor split according to one mode into sub data blocks formed according to another splitting mode. As shown in table 1, the splitting patterns allowed by the operators of different kinds are different in terms of their input tensor data and output tensor data. When the splitting mode of the output tensor data of the previous operator is not allowed by the next operator, the splitting mode of the tensor data needs to be adjusted by using the glue operator, so that the previous operator and the next operator are bonded. In addition, even if the splitting mode output by the upper layer is supported by the lower layer, the splitting of tensor data can be adjusted to a form which is more beneficial to the calculation with the lower layer through a glue operator.

TABLE 1

Based on the above description, on the basis of fig. 2, the present embodiment also proposes another neural network model splitting method. As shown in fig. 4. On the basis of fig. 2, further comprising:

step 201'): inserting a glue operator between the operator of the target layer and the associated splitting state set, and adjusting the state of the operator in the splitting state set of tensor data; the glue operator is used for adjusting the state of the tensor data obtained according to a splitting mode into the state obtained according to any splitting mode.

In this step, the behavior mode of adjusting the splitting state of the tensor data is expressed by a glue operator, the calculation scale of each layer of the neural network model continuously changes along with the extension of the network, and the splitting mode of the operator needs to be adjusted correspondingly along with the change of the splitting trend of the neural network model, that is, the state of the intermediate result is adjusted. As shown in fig. 5, a glue operator is added between Op _2 and its input Tensor1 in fig. 3, so that any split state of Tensor data can be converted into another split state. For the glue operator, the input tensor data and the output tensor data have the same shape and the same state space, and any split state of the input tensor data has directed edges pointing to all split states of the output tensor data, so that a fully-connected mesh structure is formed between the split state set of the input tensor data and the split state set of the output tensor data. This makes it possible to convert the splitting state of any input tensor data into another splitting state before the operator Op _2, introducing the possibility of adjusting the splitting state of its input tensor data before the calculation of each operator starts, i.e. adjusting the splitting pattern of the operator itself, into the search space of the splitting solution P.

It should be noted that fig. 5 shows that a glue operator is inserted between the operator and the corresponding input tensor data, or a glue operator may be inserted between the operator and the corresponding output tensor data, or even between the operator and the corresponding input tensor data or output tensor data, which is only an exemplary case, not an exhaustive list.

Glue operators are inserted between operators of a target layer of the neural network model and the associated split state sets, although the split mode of the operators is adjusted correspondingly, the adjustment brings extra expense, and how to properly insert the glue operators in the whole neural network model improves the performance of the neural network model. In order to solve the problem, a glue operator is inserted between an operator of a target layer of the neural network model and the associated split state set, and a directed acyclic graph of the neural network model containing the glue operator is obtained; determining state paths between adjacent split state sets and weights of the state paths according to the split state sets corresponding to all tensor data of the directed acyclic graph traversing the target layer; determining a splitting path of a target layer of the neural network model including the glue operator according to the weight of the state path; and selecting each inserted glue operator by utilizing a splitting path of a target layer of the neural network model including the glue operator, deleting the glue operators which do not need to be inserted, and reserving the glue operators which need to be inserted.

For the glue operator, splitting-splicing, splicing-splitting are adopted in the glue operatorIn one implementation mode of the four modes of splicing and splitting, sub-data blocks adjacent in any dimension can be divided into a new data block in the splicing stage, and any sub-data block can be divided into two smaller sub-data blocks in the splitting stage. Either split can be converted to another split form by such a two-stage process. To illustrate this, it is not assumed that the data itself is one-dimensional, and the split form before adjustment is represented by { (0 ₁ ),(p ₁ ,p ₂ ),…,(p _n-1 End), each segment represents a sub-segment after one-dimensional data splitting, and the adjusted splitting form is (0, q) ₁ ),(q ₁ ,q ₂ ),…,(q _m-1 End), if some adjacent two segments (p) before adjustment _i-1 ,p _i ),(p _i ,p _i+1 ) Is a certain section (q) after adjustment _j ,q _j+1 ) I.e. p _i-1 ＝q _j ，p _i+1 ＝q _j+1 When adjusting this part, only the part (p) is needed to be spliced _i-1 ,p _i ),(p _i ,p _i+1 ) Spliced together, skipping the split phase. And in the other case, if a certain sub-section before adjustment is a set of several sub-sections after adjustment, skipping the splicing stage and executing corresponding splitting in the splitting stage. In the worst case, all data can be combined into a complete one-dimensional data in the splicing stage, and corresponding splitting is performed in the splicing stage.

Taking splitting-splicing or splicing-splitting as an example for a glue operator, assuming that the total size of tensor data to be adjusted is M, and the two stages cannot be skipped, and each stage needs to be spliced or split according to 4 dimensions. For the convenience of transplantation, splicing and splitting are usually realized by using a splicing operator (Concat) and a splitting operator (Slice) in a neural network algorithm, and since the two operators can only process one dimension at a time, the whole glue brings about 8M storage read-write overhead under the worst condition. Therefore, an optimal balance point must be found between the adjustment of the splitting state and the introduced extra overhead, and the splitting mode of the operator is adjusted at a position which is more reasonable according to the rule of the network structure under the condition that the glue operator is introduced as little as possible.

In further detail, glue operators and ordinary neural network operators perform the same processing, and each glue operator has corresponding time t for adjusting the tensor data splitting state, and the time is used as the weight of a corresponding state path. Obtaining the non-split state s of the input tensor data of the target layer of the neural network model including the glue operator still using equation (5) _root Non-torn state s of output tensor data to neural network model _end The target split path of (1). When selecting the glue operators, in the splitting path, checking a splitting state corresponding to the input Tensor data and a splitting state corresponding to the output Tensor data of each glue operator, if the two splitting states are the same, that is, the splitting state status _1 in the splitting state set Tensor _1 in fig. 5 is connected with the splitting state status _1 in the splitting state set Tensor _1' through the state path, and the two splitting states are the same. The split path P illustrating the target layer of the neural network model does not need to adjust the split state of the input tensor data of the operator Op _2, and this is a result of overall performance coordination considerations based on the forward and backward operators. The glue operator inserted between operator Op _2 and the corresponding input tensor data is removed from the network. Otherwise, the inserted glue operator needs to be retained.

It should be noted that the glue operator is implemented by using an operator originally existing in the neural network model. The splicing stage corresponds to a Concat operator in the neural network model, the splitting stage corresponds to a Slice operator in the neural network model, and any accelerator which supports the Concat operator and the Slice operator can quickly realize a glue operator. In addition, in this embodiment, the above manner of acquiring the target splitting path is similar to the viterbi algorithm, and this time is only an exemplary case, but not an exhaustive case, and those skilled in the art may generate other modifications or transformations on the basis of the technical solution of the present application, such as: the weight of each split path between the split state set of the input tensor data of the neural network model to the split state set of the output tensor data of the neural network model is determined by the sum of the weights of the corresponding state paths. And setting a threshold value according to experience, wherein the weight of the splitting path is smaller than the set threshold value, and then the splitting path can be used as a target splitting path to split the neural network model. But should be within the scope of the present application as long as the achieved functions and technical effects are similar to the present application.

It should be emphasized that the contents of the technical solution regarding operator splitting shown in fig. 2 are applicable to the technical solution shown in fig. 4, and are not described herein again.

For neural network models, convolution operators are relatively special operators, and in some cases additional auxiliary operators are needed to complete the splitting task. When dividing the calculation according to the H/W dimension of the input tensor data, if the size of the convolution kernel window exceeds the step length of each movement of the convolution kernel window, namely, the kernel > stride, the condition that the frame moves to the boundary and exceeds the boundary of the sub-tensor data occurs in the calculation of the divided convolution operator, and the missing part of data is positioned on the adjacent sub-tensor data. In order to handle the condition that the input tensor data among the subtasks are overlapped with each other and ensure the portability, the behavior of accessing the boundary data of the adjacent sub tensor data is also stripped to form a new auxiliary operator which is called a compensation operator.

As shown in fig. 6, the compensation operator is used to obtain the target data from the adjacent sub-tensor data except for one sub-tensor data, and combine the target data and the sub-tensor data together to form a larger data block, where the moving range of the calculation stage window does not exceed the boundary of the compensated data block. Besides convolution operators, the pooling operator and the Local Response Normalization operator (LRN) which is not very common at present also have the problem that the split subtask depends on data on adjacent data blocks, the pooling operator is similar to the convolution operator, mainly caused by the fact that a pooling window is larger than the moving step length of the window, while the Local Response Normalization operators are different in calculation logic, namely, in order to calculate a result of outputting a point of tensor data on a C dimension, a point corresponding to the tensor data on the C dimension and values of k/2 points adjacent to the left and right need to be input. Thus, if the computation of the local response normalization operator is split into multiple LRN operators by the C-dimension, each new operator also requires the element data in the neighboring sub-tensor data to compute the values located on the C-dimension boundaries.

Operators can be categorized into three classes according to the range of data in a dimension of the input tensor data that is required for the operator to calculate one data in the dimension of the output tensor data. One class of point-to-point operators, i.e., computing a data point of the output tensor data requires only inputting the value of the corresponding data point on the tensor, and includes the activation operators (Relu, pRelu), the batch normalization operator (BatchNorm), and the base operators (Add, sub, mult, div) for adding, subtracting, multiplying and dividing the bit. The operators can carry out task splitting in any dimension, and the obtained sub-operators only need the corresponding sub-tensor data as input in the calculation stage. Another class of operators are operators of the fully dependent type, i.e. calculating a data point of the output tensor data requires only all data in the input tensor data in that dimension, e.g. a convolution operator and a fully connected operator require all points in the input tensor data C dimension in calculating a point in the output tensor data C dimension, although the splitting of the convolution operator in the input tensor data C dimension can be achieved by adding partial sums afterwards, when the calculation logic of the operator in that dimension is more complicated, e.g. the normalized exponential regression operator (Softmax), equation (6) gives the calculation formula in its normalized dimension.

Where I is the vector of the input tensor data in the normalized dimension and O is the vector of the output tensor data in the normalized dimension. Unlike partial sum accumulation of convolution, the computation logic is complex and it is difficult to achieve splitting. From this perspective, the compensation operator is actually used to handle the third case between the point-to-point type operator and the fully dependent type operator, i.e. computing a point on the output tensor data requires the input tensor data to be in the region near the corresponding position. The corresponding position attachment area is determined according to the compensation parameter. In this case, the operators are still computationally logically separable, although they may rely on data other than the sub-tensor data, and the use of the compensation operator may uniformly solve this problem.

Based on this, as shown in fig. 7, a third flowchart of a neural network model splitting method provided in the embodiment of the present disclosure is shown.

On the basis of fig. 2, further comprising:

step 201 "): inserting a compensation operator between the operator of the target layer and the associated splitting state set, and adjusting the state of the operator in the splitting state set of the input tensor data; wherein the compensation operator is configured to obtain target data from neighboring sub-tensor data of any one of the sub-tensor data of the state, and to combine the target data with the sub-tensor data.

In the technical scheme, in order to solve the problem that a window exceeds the boundary of input sub-tensor data under the condition that a task is split along an H/W dimension due to the fact that the window of a convolution operator and a pooling operator is smaller than a displacement step length, a frame introduces a compensation operator to complement elements of adjacent sub-tensor data around each sub-tensor data for a sub-tensor data set before calculation is started. The method avoids modifying the calculation logic of the split convolution operator or the pooling operator, so that the dependence on the adjacent sub tensor data is invisible to the convolution operator and the pooling operator, the system is favorable for quick realization, and the consistency on accelerators with different structures is facilitated. However, the compensation operator itself may bring extra overhead, and assuming that the size of the original data block is M, the compensation operator introduces 2M of access overhead without considering the overlapping portion between the compensated sub-tensor data. The convolution operator and the pooling operator are main operators for forming the neural network, particularly the image classification neural network, and in order to reduce the overhead caused by the compensation action, the inserted compensation operators are combined by adopting a pyramid structure. As shown in FIG. 8, the neural network model is formed by two convolution operatorsIn the formed serial sequence, two convolution operators are subjected to task splitting according to H/W dimensionality and are split into 4 smaller convolution operators respectively, and the N dimensionality and the C dimensionality of data are omitted in the figure. Suppose the convolution kernel sizes of the two convolution operators are k ₁ ，k ₂ For the sake of simplicity of calculation, the displacement step is 1. Under normal circumstances, the convolution operator Conv1 compensates the periphery of the sub-tensor data for a data width k before calculating ₁ This ensures that the convolution kernel does not exceed the boundaries of the input sub-tensor data when the split convolution task is computed. However, here, the convolution operator Conv1 compensates the periphery of the sub-tensor data for the data width k before the calculation ₁ /2+k ₂ /2, which makes the sub-Tensor data of its output data Tensor1 have k between each other ₂ The widths overlap, so the convolution operator Conv2 does not need to compensate its input sub-tensor data before the calculation starts.

By the method, a plurality of compensation operators used in the serial operator sequence can be combined into one top end, and although the memory access overhead of the first compensation is increased, the memory access overhead of the compensation operators after the model is split can be effectively reduced under the condition that the compensation width is far smaller than the size of the sub data block. On the other hand, this method causes an iterative calculation, and the results of the overlapping portions of the sub Tensor data of the output Tensor data Tensor1 of the convolution operator Conv1 in fig. 8 are calculated by a plurality of convolution operators after the division. In addition, for the convolution network with a small size of the input characteristic image, since the condition that the compensation width is far smaller than the size of the sub-tensor data is no longer satisfied, the change condition of the total access and storage overhead before and after combining a plurality of compensation operators needs to be evaluated more carefully. To solve this problem, the merging of the compensation operators is also added to the search space of the splitting scheme. The whole traversal process is changed from forward traversal to backward traversal, the two inputs are communicated in principle, but the latter is more suitable for introducing a search strategy after the combination of the compensation operator. Setting the non-splitting state of the output tensor data of the whole neural network model as an end point state s _end Any state s of any tensor data in the neural network model has a corresponding components start to s _end The weight of the split path of is l _s . Before traversal starts, s _end The corresponding weight is 0, and the corresponding weights of all states of all the remaining tensor data are ∞. And traversing each operator reversely according to the topological relation of the neural network model, and enumerating the possible split states of the input tensor data by the split state of the output tensor data, wherein the input split states which are overlapped and do not need compensation are enumerated in the same way except for the split states in which the sub tensor data are not overlapped under the normal condition and the compensation process needs to be introduced. The weights of the latter corresponding state paths are computationally taking into account the time added by the redundant computed portion. Still obtaining the non-splitting state s of the input tensor data of the target layer of the neural network model including the compensation operator using the formula (5) _root Non-torn state s of output tensor data to neural network model _end The target split path of (1).

For the compensation operator, a plurality of compensation operators inserted in the neural network model are combined by adopting a pyramid structure, so that one combined compensation operator can be obtained, and a plurality of combined compensation operators can also be obtained. In this case, the number of compensation operators after the merging is smaller than the number of compensation operators before the merging.

It should be noted that the above manner of obtaining the target splitting path is similar to the viterbi algorithm, and this time is only an exemplary case, not an exhaustive case, and those skilled in the art may generate other modifications or transformations on the basis of the technical solution of the present application under the condition of understanding the spirit of the technical solution of the present application, for example: the weight of each split path between the split state set of the input tensor data of the neural network model to the split state set of the output tensor data of the neural network model is determined by the sum of the weights of the corresponding state paths. And setting a threshold value according to experience, wherein the weight of the splitting path is smaller than the set threshold value, and then the splitting path can be used as a target splitting path to split the neural network model. But should be within the scope of the present application as long as the achieved functions and technical effects are similar to the present application.

It should be emphasized that the contents of the technical solution regarding operator splitting shown in fig. 2 are applicable to the technical solution shown in fig. 7, and are not described herein again.

Fig. 9 is a fourth flowchart of a neural network model splitting method according to an embodiment of the present disclosure. Introducing the glue operator and the compensation operator into an operator splitting scheme, wherein the splitting method comprises the following steps:

step a): determining a split state set of tensor data associated with an operator of a target layer in the neural network model according to the operator of the target layer; wherein the target layer is at least one layer of the neural network model;

step b): inserting a glue operator between the operator of the target layer and the associated splitting state set, and adjusting the state of the operator in the splitting state set of tensor data; wherein the glue operator is to adjust a state in the set of split states of the tensor data to any one split state of the tensor data;

step c): inserting a compensation operator between the operator of the target layer and the associated split state set, and adjusting the state in the split state set of the input tensor data of the operator; the compensation operator is used for acquiring target data from adjacent sub tensor data of any sub tensor data of the state, and merging the target data with the sub tensor data;

step d): traversing the split state sets according to a directed acyclic graph of the neural network model, and determining state paths between adjacent split state sets and weights of the state paths; wherein the state path represents a splitting mode of the operator; each state in the split state set represents a sub-tensor data set, and the union result of all sub-tensor data of the state is the tensor data;

step e): determining a target splitting path of the target layer according to the weight of the state path;

step f): and splitting an operator of a target layer of the neural network model by using the target splitting path.

A glue operator is inserted between each operator of the neural network model and its input tensor data, and a glue operator is inserted between the output tensor data of the neural network model and the operator that generated the output tensor data. For each tensor data tensor in the neural network model _i Initialize its state set S _i Storing the state itself in the state set and the output state s of the last output data of the network starting from this split state of the data _root The pair of values is represented by (s, t). State set S corresponding to output tensor data of whole neural network model _root With an unresolved state of the data and a corresponding minimum time(s) _root 0), all remaining sets are empty. For a given neural network model, a topological order lambda is given to all operators in the neural network model according to the dependency relationship among the operators, and the topological order needs to satisfy the following conditions: for any operator a, all a-dependent operators must follow a in topological order, while all a-dependent operators must precede a in topological order.

The set of split states for each operator of the neural network model is traversed backwards, taking into account the insertion of the compensation operator. In the reverse traversal stage, the operators in the neural network model are traversed one by one according to the sequence of the reverse lambda, and the operator A with m input n output has input tensor data u ₁ ,…,u _m Outputs tensor data v ₁ ,…,v _n The content of the technical solution regarding operator splitting in the neural network model shown in fig. 2 is applicable to the technical solution shown in fig. 9, the content of the technical solution regarding glue operator in the neural network model inserting glue operator shown in fig. 4 is applicable to the technical solution shown in fig. 9, and the content of the technical solution regarding compensation operator in the neural network model inserting compensation operator shown in fig. 7 is applicable to the technical solution shown in fig. 9, which is not described herein again. The temporal complexity of the backward traversal is O (NM) ² ) And N is the number of operators in the neural network model, and M represents the number of states in the maximum splitting state set in the splitting state sets of all tensor data.

It should be emphasized that the contents of the operator splitting technical solution shown in fig. 2 are applicable to the technical solution shown in fig. 9, the contents of the glue operator in the operator splitting technical solution based on the glue operator shown in fig. 4 are applicable to the technical solution shown in fig. 9, and the contents of the compensation operator in the operator splitting technical solution based on the compensation operator shown in fig. 7 are applicable to the technical solution shown in fig. 9, which are not described again here.

The technical solutions shown in fig. 2,4, 7 and 9 fully utilize the hardware resources of the multi-core system by splitting each operator of the target layer in the neural network model into smaller subtasks and distributing the subtasks to a plurality of cores to execute in parallel. In the technical solutions shown in fig. 4, 7, and 9, by introducing a glue operator or a compensation operator, it is ensured that the calculation graph of the neural network model after being split can still be realized by using an operator kernel function on a single core, and the software stack work of a large number of operators, which is required to be modified or realized again by the bottom layer accelerator in the transplanting process of the framework, is avoided, so that the framework is more friendly to accelerators without good programmability. The framework can automatically generate a set of efficient splitting schemes for a given neural network and the multi-core accelerator, in the scheme generation process, the splitting modes of operators can be reasonably adjusted according to the types and scales of the operators and the calculation throughput rate and the memory access bandwidth of bottom-layer hardware, the calculation efficiency of a hardware core and the splitting degree of the operators are well balanced, the mutual matching of context operators on splitting is also considered, and the splitting selection of a plurality of operators is planned in a lump.

The present disclosure provides a neural network model splitting apparatus, which includes:

Preferably, the target split path module comprises:

the first traversal unit is used for traversing all the splitting state sets of the target layer, traversing each state of the current splitting state set, and obtaining all state paths pointing to the current state and splitting paths from the initial state of the state paths to the initial state of the input tensor data of the target layer;

a first split path determining unit, configured to determine a split path of an initial state of input tensor data from the current state to the target layer according to a weight of the state path and a weight of the split path; the weight of the split path is determined according to the weights of all state paths corresponding to the split path;

the first selected target splitting path unit is used for obtaining a target splitting path between the splitting state set of the input tensor data of the target layer and the splitting state set of the output tensor data of the target layer after traversing all the splitting state sets of the target layer.

Preferably, the target split path module comprises:

Preferably, the method further comprises the following steps:

Fig. 10 is a schematic diagram of a neural network model splitting hardware device according to an embodiment of the present application. Comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the computer program to realize the neural network model splitting method.

In the neural network model splitting hardware device provided in the embodiments of the present specification, specific functions implemented by the memory and the processor of the neural network model splitting hardware device may be explained in comparison with the foregoing embodiments in the present specification, and may achieve the technical effects of the foregoing embodiments, and therefore, details are not repeated here.

In this embodiment, the memory may include a physical device for storing information, and typically, the information is digitized and then stored in a medium using an electrical, magnetic, or optical method. The memory according to this embodiment may further include: devices that store information using electrical energy, such as RAM, ROM, etc.; devices that store information using magnetic energy, such as hard disks, floppy disks, tapes, core memories, bubble memories, usb disks; devices for storing information optically, such as CDs or DVDs. Of course, there are other ways of memory, such as quantum memory, graphene memory, and so forth.

In this embodiment, the processor may be implemented in any suitable manner. For example, the processor may take the form of, for example, a microprocessor or processor and a computer-readable medium that stores computer-readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, an Application Specific Integrated Circuit (ASIC), a programmable logic controller, an embedded microcontroller, and so forth.

In this embodiment, an embodiment of the present application further provides a readable storage medium, on which a computer program is stored, where the computer program is executed to implement the neural network model splitting method described above.

Therefore, the technical scheme of the deep learning accelerator can realize the expansion of the deep learning accelerator from a single-core structure to a multi-core structure with less expenditure, and can provide an efficient splitting scheme aiming at the given network and the characteristics of the bottom accelerator. Experimental results show that the scheme can effectively reduce the end-to-end time delay of various networks on the multi-core accelerator.

Those skilled in the art will also appreciate that, in addition to implementing clients and servers as pure computer readable program code, the same functionality may be implemented entirely by logically programming method steps such as logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Such a client and server can therefore be considered as a hardware component, and the means included therein for carrying out the various functions can also be considered as structures within the hardware component. Or even means for performing the functions may be conceived to be both a software module implementing the method and a structure within a hardware component.

From the above description of the embodiments, it is clear to those skilled in the art that the present application can be implemented by software plus a necessary general hardware platform. Based on such understanding, the technical solutions of the present application may be essentially or partially implemented in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the embodiments or some parts of the embodiments of the present application.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments can be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, both for the client and the server embodiments, reference may be made to the introduction of embodiments of the method described above.

The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

Although the present application has been described in terms of embodiments, those of ordinary skill in the art will recognize that there are numerous variations and permutations of the present application without departing from the spirit of the application, and it is intended that the appended claims encompass such variations and permutations without departing from the spirit of the application.

Claims

1. A method for data splitting for neural network model computation implemented with a multi-core processor, the method comprising:

determining a split state set of tensor data associated with an operator of a target layer in the neural network model according to the operator of the target layer; wherein the target layer is at least one layer of the neural network model;

traversing the split state sets according to the directed acyclic graph of the neural network model, and determining state paths between adjacent split state sets and weights of the state paths; wherein the state path represents a split of tensor data of the operator; each state in the split state set represents a sub-tensor data set, and the union result of all sub-tensor data of the state is the tensor data;

determining a target split path of tensor data of the target layer according to the weight of the state path; wherein the weight of the state path is the time taken to perform a subtask; or the throughput of performing the subtasks; or the time for executing all the subtasks under the operator splitting mode corresponding to the state path on the multi-core processor is actually measured to determine the time;

and splitting tensor data of an operator of a target layer of the neural network model by using the target splitting path so as to distribute the tensor data to each core of the multi-core processor for operation.

2. The method of claim 1, wherein determining the target split path for the tensor data of the target layer comprises:

traversing all split state sets of tensor data of the target layer, traversing each state of the current split state set, and obtaining all state paths pointing to the current state and split paths from the initial state of the state paths to the initial state of the input tensor data of the target layer;

and after traversing all the splitting state sets of the tensor data of the target layer, obtaining a target splitting path between the splitting state set of the input tensor data of the target layer and the splitting state set of the output tensor data of the target layer.

3. The method of claim 1, wherein determining the target split path for the tensor data of the target layer comprises:

traversing all split state sets of tensor data of the target layer, traversing each state of the current split state set, and obtaining all state paths taking the current state as a starting point and all split paths from an end state of the state paths to an end state of the output tensor data of the target layer;

determining a split path from the current state to a termination state of the output tensor data of the target layer according to the weight of the state path and the weight of the split path; the weight of the split path is determined according to the weights of all state paths corresponding to the split path;

4. The method of claim 1, wherein the number of sub-operators obtained after operator splitting of the target layer of the neural network model is an integer power of 2.

5. The method of claim 1, wherein the state in the set of split states of input tensor data for an operator of a target layer of the neural network model is determined from computational logic of the operator and the state in the set of split states of corresponding output tensor data.

6. The method of claim 1, wherein the state in the set of split states of output tensor data for an operator of a target layer of the neural network model is determined from computational logic of the operator and the state in the set of split states of corresponding input tensor data.

7. The method of claim 1, further comprising:

8. The method of claim 1, further comprising:

9. The method of claim 1, wherein the weight of the state path is determined according to the type and scale of an operator, and multicore processor hardware parameters.

10. A data splitting device for neural network model calculation implemented by a multi-core processor is characterized by comprising the following components:

the split state set module is used for determining a split state set of tensor data associated with an operator of a target layer in the neural network model according to the operator of the target layer; wherein the target layer is at least one layer of the neural network model;

the state path module is used for traversing the split state sets according to the directed acyclic graph of the neural network model, and determining state paths between adjacent split state sets and weights of the state paths; wherein the state path represents a tensor data splitting manner of the operator; each state in the split state set represents a sub-tensor data set, and the union result of all sub-tensor data of the state is the tensor data;

the target splitting path module is used for determining a target splitting path of tensor data of the target layer according to the weight of the state path; wherein the weight of the state path is the time taken to execute a subtask; or the throughput of performing the subtasks; or the time for executing all the subtasks under the operator splitting mode corresponding to the state path on the multi-core processor is actually measured to determine the time;

and the splitting module is used for splitting tensor data of an operator of a target layer of the neural network model by using the target splitting path so as to distribute the tensor data to each core of the multi-core processor for operation.

11. The apparatus of claim 10, wherein the target split path module comprises:

the first traversal unit is used for traversing all split state sets of tensor data of the target layer, traversing each state of the current split state set, and acquiring all state paths pointing to the current state and split paths from the initial state of the state paths to the initial state of the input tensor data of the target layer;

the first selected target splitting path unit is used for obtaining a target splitting path between the splitting state set of the input tensor data of the target layer and the splitting state set of the output tensor data of the target layer after traversing all the splitting state sets of the tensor data of the target layer.

12. The apparatus of claim 10, wherein the target split path module comprises:

a second traversal unit, configured to traverse all split state sets of the tensor data of the target layer, traverse each state for a current split state set, and obtain all state paths using the current state as a starting point and all split paths from an end state of the state path to an end state of the output tensor data of the target layer;

and the second selected target splitting path unit is used for obtaining a target splitting path between the splitting state set of the input tensor data of the target layer and the splitting state set of the output tensor data of the target layer after traversing all the splitting state sets of the tensor data of the target layer.

13. The apparatus of claim 10, further comprising:

14. The apparatus of claim 10, further comprising:

15. A neural network model splitting hardware device comprising a memory and a processor, the memory having stored thereon a computer program operable on the processor, wherein the processor, when executing the computer program, implements the steps of the method of any one of claims 1 to 9.

16. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 9.