CN111563587A

CN111563587A - Splitting method of neural network model and related product

Info

Publication number: CN111563587A
Application number: CN201910115162.2A
Authority: CN
Inventors: 不公告发明人
Original assignee: Shanghai Cambricon Information Technology Co Ltd
Current assignee: Shanghai Cambricon Information Technology Co Ltd
Priority date: 2019-02-14
Filing date: 2019-02-14
Publication date: 2020-08-21
Anticipated expiration: 2039-02-14
Also published as: CN111563587B

Abstract

The scheme splits an operator into a plurality of sub-operators with smaller scale, so that a calculation library under a single-core architecture can be directly called, and additional workload of re-implementation is avoided.

Description

Splitting method of neural network model and related product

Technical Field

The embodiment of the disclosure relates to a method for splitting a neural network model and a related product.

Background

In recent years, deep learning accelerators have been proposed and are expanding from single core to multi-core as with general purpose processors. The expanded multi-core structure can support a data parallel mode in a training stage to improve data throughput and accelerate training speed. However, in the inference phase, there is a higher requirement for end-to-end delay than for the throughput deep neural network, which often determines the availability of the accelerator in a certain scenario. The traditional data parallel scheme cannot meet the requirements on small data and low delay of an accelerator in an inference scene.

Disclosure of Invention

In order to solve the above technical problem, the present disclosure provides a method for splitting a neural network model and a related product.

In order to achieve the above object, the present disclosure provides a neural network model splitting method, including:

determining a split state set of tensor data associated with an operator of a target layer in the neural network model according to the operator of the target layer; wherein the target layer is at least one layer of the neural network model;

inserting a glue operator between the operator of the target layer and the associated splitting state set, and adjusting the state in the splitting state set of tensor data of the operator; the glue operator is used for adjusting the state of the tensor data obtained according to a splitting mode into a state obtained according to any splitting mode;

inserting a compensation operator between the operator of the target layer and the associated split state set, and adjusting the state in the split state set of the input tensor data of the operator; the compensation operator is used for acquiring target data from adjacent sub tensor data of any sub tensor data of the state, and merging the target data with the sub tensor data;

traversing the split state sets according to a directed acyclic graph of the neural network model, and determining state paths between adjacent split state sets and weights of the state paths; wherein the state path represents a splitting mode of the operator; each state in the split state set represents a sub-tensor data set, and the union result of all sub-tensor data of the state is the tensor data;

determining a target splitting path of the target layer according to the weight of the state path;

and splitting an operator of a target layer of the neural network model by using the target splitting path.

Preferably, the step of inserting a glue operator between the operator of the target layer and the associated set of split states comprises:

inserting a glue operator between the operator of the target layer and the associated split state set to obtain a directed acyclic graph of the neural network model including the glue operator;

determining state paths between adjacent split state sets and weights of the state paths according to the split state sets corresponding to all tensor data of the directed acyclic graph traversing the target layer;

determining a splitting path of a target layer of the neural network model including the glue operator according to the weight of the state path;

and selecting each inserted glue operator by using the splitting path of the target layer of the neural network model including the glue operator, deleting the glue operators which do not need to be inserted, and reserving the glue operators which need to be inserted.

Preferably, the glue operator is configured to splice states in the split state set of the input tensor data of the glue operator.

Preferably, the glue operator is configured to split a state in the split state set of the input tensor data of the glue operator.

Preferably, the glue operator is configured to splice states in the split state set of the input tensor data of the glue operator, and split the state in the split state set after the splicing processing.

Preferably, the glue operator is configured to split a state in a split state set of input tensor data of the glue operator, and then splice states in the split state set after the split processing.

Preferably, the step of inserting a compensation operator between the operator of the target layer and the associated set of split states comprises:

inserting a compensation operator between a particular type of operator in the target layer and the associated set of split states of the input tensor data; wherein the specific type of operator is characterized by: the elements of the input tensor data corresponding to the elements of the output tensor data used to compute the class of operators are also used to compute the neighbouring elements of the output tensor data.

Preferably, the operators of the specific type are convolution operators, pooling operators and local response normalization operators.

Preferably, the step of inserting a compensation operator between the operator of the target layer and the associated set of split states further comprises:

a pyramid-form structure is employed to merge the plurality of compensation operators in the target layer.

Preferably, the step of determining the splitting path of the target layer comprises:

traversing all the split state sets of the target layer, traversing each state of the current split state set, and obtaining all state paths taking the current state as a starting point and split paths from the end state of the state paths to the end state of the output tensor data of the target layer;

determining a split path of the termination state of the output tensor data of the current state to the target layer according to the weight of the state path and the weight of the split path; the weight of the split path is determined according to the weights of all state paths corresponding to the split path;

and after traversing all the splitting state sets of the target layer, obtaining a target splitting path between the splitting state set of the input tensor data of the target layer and the splitting state set of the output tensor data of the target layer.

Preferably, the number of sub-operators obtained after the operator of the target layer of the neural network model is split is an integer power of 2.

Preferably, the state in the split state set of the input tensor data of the operator of the target layer of the neural network model is determined according to the computation logic of the operator and the state in the split state set of the corresponding output tensor data.

Preferably, the state in the split state set of output tensor data of the operator of the target layer of the neural network model is determined according to the computation logic of the operator and the state in the split state set of the corresponding input tensor data.

Preferably, the method further comprises the following steps:

in a forward traversal phase, when the output tensor data of the operator is used as input tensor data by at least two operators, or the operator has at least two output tensor data, one splitting state is reserved in a splitting state set of the output tensor data of the operator, and the splitting state is determined by the same state path of the operator.

Preferably, the method further comprises the following steps:

in a reverse traversal phase, when the operator has at least two input tensor data, one split state is retained in a split state set of the input tensor data of the operator, and the split state is determined via the same state path of the operator.

Preferably, the weight of the state path is determined according to the type and the scale of an operator and hardware parameters of the multi-core processor.

In order to achieve the above object, the present disclosure provides a neural network model splitting device, including:

the split state set module is used for determining a split state set of tensor data associated with an operator of a target layer in the neural network model according to the operator of the target layer; wherein the target layer is at least one layer of the neural network model;

the glue operator inserting module is used for inserting a glue operator between the operator of the target layer and the associated splitting state set and adjusting the state in the splitting state set of tensor data of the operator; the glue operator is used for adjusting the state of the tensor data obtained according to a splitting mode into a state obtained according to any splitting mode;

the compensation operator inserting module is used for inserting a compensation operator between the operator of the target layer and the associated splitting state set and adjusting the state of the splitting state set of the input tensor data of the operator; the compensation operator is used for acquiring target data from adjacent sub tensor data of any sub tensor data of the state, and merging the target data with the sub tensor data;

the state path module is used for traversing the split state sets according to the directed acyclic graph of the neural network model, and determining state paths between adjacent split state sets and weights of the state paths; wherein the state path represents a splitting mode of the operator; each state in the split state set represents a sub-tensor data set, and the union result of all sub-tensor data of the state is the tensor data;

the target splitting path module is used for determining a target splitting path of the target layer according to the weight of the state path;

and the splitting module is used for splitting an operator of a target layer of the neural network model by using the target splitting path.

Preferably, the glue insertion operator module comprises:

a first insertion unit, configured to insert a glue operator between the operator of the target layer and the associated split state set, and obtain a directed acyclic graph of the neural network model including the glue operator;

the state path unit is used for determining state paths between adjacent split state sets and weights of the state paths according to the split state sets corresponding to all tensor data of the directed acyclic graph traversing the target layer;

the first target splitting path determining unit is used for determining a target splitting path of a target layer of the neural network model including the glue operator according to the weight of the state path;

and the selecting unit is used for selecting each inserted glue operator by using a target splitting path of a target layer of the neural network model including the glue operator, deleting the glue operators which do not need to be inserted, and reserving the glue operators which need to be inserted.

Preferably, the glue operator inserted by the glue operator inserting module is used for splicing states in the split state set of the input tensor data of the glue operator.

Preferably, the glue operator inserted by the glue operator inserting module is used for splitting the state in the split state set of the input tensor data of the glue operator.

Preferably, the glue operator inserted by the glue operator inserting module is used for splicing the states in the split state set of the input tensor data of the glue operator, and then splitting the states in the split state set after splicing.

Preferably, the glue operator inserted by the glue operator inserting module is configured to split a state in a split state set of input tensor data of the glue operator, and then splice states in the split state set after the splitting processing.

Preferably, the insertion compensation operator module comprises:

a second insertion unit for inserting a compensation operator between a particular type operator in the target layer and the associated set of split states of the input tensor data; wherein the specific type of operator is characterized by: the elements of the input tensor data corresponding to the elements of the output tensor data used to compute the class of operators are also used to compute the neighbouring elements of the output tensor data.

Preferably, the specific type of operator to which the compensation operator inserted by the second insertion unit is applicable is a convolution operator, a pooling operator, or a local response normalization operator.

Preferably, the insertion compensation operator module further comprises:

a merging unit, configured to merge the plurality of compensation operators in the target layer by using a pyramid structure.

Preferably, the target split path determining module includes:

a traversal unit, configured to traverse all the split state sets of the target layer, traverse each state for a current split state set, and obtain all state paths using the current state as a starting point and all split paths from an end state of the state path to an end state of the output tensor data of the target layer;

a split path determining unit, configured to determine, according to the weight of the state path and the weight of the split path, a split path from the current state to a termination state of the output tensor data of the target layer; the weight of the split path is determined according to the weights of all state paths corresponding to the split path;

and a second target splitting path determining unit, configured to obtain a target splitting path between the splitting state set of the input tensor data of the target layer and the splitting state set of the output tensor data of the target layer after traversing all the splitting state sets of the target layer.

Preferably, the method further comprises the following steps:

a first splitting state set optimization module, configured to, in a forward traversal phase, when output tensor data of an operator is used as input tensor data by at least two operators, or the operator has at least two output tensor data, leave a splitting state in a splitting state set of the output tensor data of the operator, and the splitting state is determined via a same state path of the operator.

Preferably, the method further comprises the following steps:

and in a reverse traversal phase, when the operator has at least two input tensor data, a splitting state is reserved in the splitting state set of the input tensor data of the operator, and the splitting state is determined by the same state path of the operator.

In order to achieve the above object, the present disclosure provides a neural network model splitting hardware device, which includes a memory and a processor, where the memory stores a computer program that is executable on the processor, and the processor implements the steps of the method when executing the computer program.

To achieve the above object, the present disclosure provides a computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the steps of the method described above.

The technical scheme disclosed by the invention can realize the extension of the deep learning accelerator from a single-core to a multi-core structure with less expenditure, and can provide an efficient splitting scheme aiming at the characteristics of a given network and a bottom layer accelerator, and the scheme can effectively reduce the end-to-end time delay of various networks on the multi-core accelerator.

Drawings

To more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings of the embodiments will be briefly introduced below, and it is apparent that the drawings in the following description relate only to some embodiments of the present disclosure and are not limiting to the present disclosure.

FIG. 1 is a schematic diagram of a shared memory multi-core architecture;

fig. 2 is a flowchart of a neural network model splitting method according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of a serial neural network model split;

fig. 4 is a second flowchart of a neural network model splitting method according to an embodiment of the present disclosure;

FIG. 5 is a schematic diagram of the splitting of the glue operator introduced between the operator and the input tensor data;

FIG. 6 is a schematic illustration of compensation;

fig. 7 is a third flowchart of a neural network model splitting method according to the embodiment of the present disclosure;

FIG. 8 is a schematic view of pyramid splitting;

fig. 9 is a fourth flowchart of a neural network model splitting method provided in the embodiment of the present disclosure;

fig. 10 is a schematic diagram of a neural network model splitting hardware device according to an embodiment of the present disclosure.

Detailed Description

The technical solutions in the embodiments of the present disclosure will be described more fully hereinafter with reference to the non-limiting exemplary embodiments shown in the accompanying drawings and detailed in the following description, taken in conjunction with the accompanying drawings, which illustrate, more fully, the exemplary embodiments of the present disclosure and their various features and advantageous details. It should be noted that the features illustrated in the drawings are not necessarily drawn to scale. The present disclosure omits descriptions of well-known materials, components, and process techniques so as not to obscure the example embodiments of the present disclosure. The examples given are intended merely to facilitate an understanding of ways in which the example embodiments of the disclosure may be practiced and to further enable those of skill in the art to practice the example embodiments. Thus, these examples should not be construed as limiting the scope of the embodiments of the disclosure.

Unless otherwise specifically defined, technical or scientific terms used herein shall have the ordinary meaning as understood by those of ordinary skill in the art to which this disclosure belongs. The use of "first," "second," and similar terms in this disclosure is not intended to indicate any order, quantity, or importance, but rather is used to distinguish one element from another. Further, in the various embodiments of the present disclosure, the same or similar reference numerals denote the same or similar components.

The following describes in detail specific embodiments of a neural network model splitting method and related products provided by the embodiments of the present disclosure with reference to the accompanying drawings.

In recent years, deep learning accelerators have become a rapidly growing field, thanks to the great success of deep learning itself in many fields. These emerging accelerators tend to have a greater advantage in performance power consumption than GPUs. As with the development of general-purpose processors, the deep learning accelerator can be extended from a single core to a multi-core architecture, and the extension is very suitable for a training mode of data parallelism in deep learning. Data parallelism refers to speeding up training by dividing a training data set into several parts, using multiple processing cores to process a portion of the sub-data set separately. By adopting the mode in the multi-core structure, each core processes different data sets in the training data in parallel, thereby improving the throughput of the whole system and accelerating the training speed. Therefore, the multi-core accelerator architecture can conveniently extend the computing throughput of the whole system in the training phase on the premise of keeping the good performance-power consumption ratio of each core.

For a chip with a multi-core processor structure, as shown in fig. 1, the shared memory multi-core structure is a classic multi-core structure. This structure is very suitable for a neural network training method of data parallel. Each core can be used as a processor in data parallel, different data are read respectively, and then forward and reverse calculation of the neural network model is completed in parallel. Each core can still maintain good performance power consumption ratio under the previous single-core architecture in the computing stage, and meanwhile the throughput of the whole system can also be increased along with the expansion of the number of the cores. The problem with data parallelism is that its scalability depends on the size of the data batch being processed. While this is not usually a problem during the training phase, it is difficult to guarantee this premise for the inference phase. In general, in a neural network model for real-time service field (including video monitoring, automatic driving, etc.), processed data is usually serially input in a streaming manner, so that the data processed each time is very small in size and even is often a single picture. In this case, data parallelism cannot provide any parallelism, and all work tasks may be concentrated on a single core, which makes the computational resources brought by multiple cores not translate into the speed of processing tasks.

After the training of the neural network model is completed by using the data set on line, the model is deployed on a cloud server to process data sent from the outside, and the application scene is changed from off-line training to on-line reasoning. In the online reasoning phase, a very important index is the time delay, i.e. the time from the server receiving the data to be processed to returning the processed result, and further, the time for processing the data using the neural network model. The low time delay ensures that the cloud server can respond to the data sent by the client in the shortest time, and directly determines whether the scheme is available in more sensitive scenes. Therefore, the requirements of the accelerator in the online reasoning phase are changed from processing large-batch data and high throughput to processing small-batch data and low time delay.

In this case, the traditional data parallel or model parallel is difficult to effectively reduce the time delay of the inference task. For data parallelism, a large batch of data is a prerequisite, which is contradictory with the characteristic of reasoning small batch of data online. For model parallelism, it is usually a method adopted to solve the problem that a large-scale neural network model exceeds the memory limit of a single device, and the allocation of operators to different cores cannot reduce the network delay. In order to actually reduce the time delay of the inference task on the multi-core accelerator, a method must be found, which can reasonably distribute the inference computation task on small-batch data and even single data to each core of the multi-core architecture, and ensure that as many cores as possible participate in computation at each moment, so as to fully utilize the resources of the multi-core architecture. The method is characterized in that the calculation task of each operator in the neural network is divided into a plurality of cores for calculation, and the method can ensure that a plurality of cores participate in the calculation at each moment even when the inference task of a single picture is processed, so that the purpose of reducing time delay by utilizing multi-core resources is achieved.

However, there are many problems to be solved for multi-core accelerators. Firstly, the deep learning accelerator adapts the data parallel characteristics of the deep learning algorithm by customizing the hardware design of the accelerator, so that the computing throughput is improved, the accelerator usually needs enough data scale to achieve higher computing efficiency, and further splitting in an operator can reduce the computing scale on each core. When the split reaches a certain granularity, the loss of computational efficiency on each core will exceed the gain brought by the increased parallelism of the split. Therefore, it is necessary to provide a sufficient degree of parallelism while ensuring sufficient computational efficiency between split parallelism and computational efficiency.

On the other hand, the neural network model can be regarded as a complex computational graph composed of hundreds or even thousands of common-valued operators. The different algorithmic logics in the operators of different kinds are different, which results in different methods for splitting the operators. The splitting of each operator not only balances the self computing efficiency and the parallelism, but also considers the collocation with the previous operator and the subsequent operator, even influences the overall situation. The rapid development of deep learning brings more and more large-scale complex networks, and it is unrealistic to find a good parallel method by a manual mode, so an automatic method is needed to ensure that a better splitting parallel strategy can be provided for different networks.

In addition, portability to the underlying accelerator is also a consideration. For accelerators that do not have good enough programmability, the extension from single core to multi-core and the effort to modify the software stack resulting from implementing split parallelism inside the operator is very large. The traditional realization of data parallel and model parallel still completes the calculation task of an operator based on a processing core, so that a lot of extra work is not brought, the cross-core parallel of a single operator needs to modify the realization of the operator, and the difficulty degree of the modification depends on the programmability of an accelerator and the complexity degree of the original operator realization logic. How to reduce the extra overhead in the low-delay reasoning process on the multi-core architecture and relieve the dependence of the workload on the programmability of the accelerator in the realizing process, so that the method has certain universality for different multi-core accelerators in the future, and the problem to be considered is also solved.

Based on the analysis description, a set of end-to-end splitting scheme is automatically given for a large-scale neural network model, and one operator is split into a plurality of sub-operators with smaller scales, so that a computing library under a single-core architecture can be directly called, and additional workload for realizing again is avoided. Such as: after an activation operator is split, a plurality of smaller activation operators can be obtained, which means that each subtask is completed only by calling the original single-core activation function on a plurality of cores without modifying or realizing a multi-core version of the activation function again. In this process, not only the calculation efficiency and the parallelism of each operator after splitting need to be considered, but also the coordination of the context operators on the splitting needs to be considered. The final aim is to obtain a split parallel scheme which can effectively reduce the end-to-end reasoning time delay of the whole neural network model.

Taking the automatic driving application as an example, the vehicle needs to analyze and process external information such as images, videos, voices and the like transmitted by the vehicle-mounted sensor during the automatic driving process. To ensure safety, the vehicle must obtain the processed results in a minimum amount of time to make a decision. The vehicle adopting the chip with the multi-core processor structure can distribute the calculation load of the neural network model for processing small-batch external information to a plurality of processor cores in a balanced manner by using the scheme, complete the processing of the information within the specified response time and return the processing result to assist the vehicle to automatically run. The technical scheme can realize the extension of the deep learning accelerator from a single-core structure to a multi-core structure with less expenditure, and can effectively reduce the end-to-end time delay of various networks on the multi-core accelerator.

In the application scenario, the chip with the multi-core processor structure is arranged on a vehicle. In practice, multicore processor chip can set up on the cloud end server, and the vehicle can take place to the cloud end server through networks such as 3G/4G, WIFI with external information such as image, video, pronunciation that on-vehicle sensor transmitted. The cloud server distributes the calculation load of the neural network model for processing the small-batch external information to the plurality of processing cores in a balanced manner by using the scheme. And within the response time specified by the vehicle running, the cloud server feeds back the processing result to the vehicle through a 3G/4G, WIFI network and the like. In practice, the scale of the external information collected by the vehicle-mounted sensor is different. Before application, according to external information of different scales, the vehicle-mounted processor determines a corresponding operator splitting path by using the scheme. And storing operator splitting schemes corresponding to external information of different scales in corresponding areas, calling corresponding operator splitting paths to split operators in the neural network model after the multi-core processor structural chip acquires the external information, and distributing the calculation load of the external information to the plurality of processor cores in a balanced manner.

In general, the upper framework requires a computational library to be invoked to obtain an instruction implementation on the processor for each operator in the neural network model. Specifically, the framework informs the computational library of the type and parameters of each operator, and the computational library returns the machine instructions required for each operator to execute on the processor. The framework loads data and the machine instruction to the processor through a driver, starts the processor and completes the calculation of the operator.

If the computing platform of the operator is changed from a single-core accelerator to a multi-core accelerator with similar or even identical core structures, the computing library needs to be redesigned correspondingly, so that the computing library can generate machine instructions running on a plurality of cores. In particular, since multiple cores need to read different portions of the same input tensor data, and also need to write their respective outputs back to different portions of the same output tensor data, the computation library needs to modify all of the computation instructions for each operator with respect to the read and store portions.

The neural network splitting method provided by the embodiment of the disclosure can avoid modifying the computational library of the single-core processor as much as possible, and can simultaneously realize the parallel execution of the neural network model on the multi-core processor. Specifically, the upper-layer framework divides an operator in the neural network model into a plurality of sub-operators capable of being executed in parallel, for each sub-operator, the framework calls a computing library to generate a machine instruction of the sub-operator executed on a single core, and the machine instruction of the sub-operator is loaded to different cores, so that parallel computing of the operator on the multi-core processor is realized. In particular, because the framework uses a single-core processor computational library to generate computational instructions for sub-operators, the input and output tensor data for the operators in the neural network model are likewise split into corresponding sub-tensor data as the operators are split into sub-operators.

Based on the above description, as shown in fig. 2, a flowchart of a neural network model splitting method is provided for the embodiment of the present disclosure.

The method comprises the following steps:

step 201): and determining a splitting state set of tensor data associated with an operator of a target layer in the neural network model according to the operator of the target layer.

In this embodiment, the neural network model can be generally regarded as a directed acyclic graph formed by operators and multidimensional tensor data, the operators and the tensor data are connected with each other through directed edges, and the direction of the directed edges indicates that the data is input or output of the operators. The operator is represented using op and tensor data. Meanwhile, in order to unify the expressions of the splitting modes of different operators, the framework uniformly selects the splitting mode using tensor data associated with the operators to explain the splitting modes of the different operators. Assuming that all tensor data in the network are 4-dimensional, for input data or output data of a final full-connection layer and a normalization index regression layer of the image classification network, the actual dimension is less than 4, and the actual dimension is still expressed as a 4-dimensional tensor. The 4 dimensions are denoted by the symbols N, C, H, W, respectively. Where N denotes the size of the batch, C denotes the number of feature images, H denotes the height of the feature images, and W denotes the width of the feature images. This assumption is merely for convenience of explanation, and for the framework itself, can support the processing of neural network models containing tensor data of arbitrary number of dimensions. Nevertheless, 4 dimensions are sufficient for a significant part of the neural network structure.

When the technical scheme is used for splitting the operators in the neural network model, the types of the operators are different, the computational logics supported by the operators are different, and different splitting strategies are provided. In order to uniformly express the splitting strategies of different operators, the splitting state of the input tensor data and the output tensor data of the operators is adopted to express the splitting of the calculation logic of the operators.

For the technical scheme, all operators in the whole neural network model can be split, and partial operators in the neural network model can also be split. In addition, the network structure and algorithm emerging in the deep learning field at present gradually obscure the physical meanings of all data dimensions and the boundaries among the data dimensions, and the technical scheme can be applied to operator splitting under more dimensions in an expanded mode.

Any split of tensor data is called a state s of the tensor data, and after the tensor data is split, a sub-tensor data set is obtained. The state s is characterized by a corresponding set of sub-tensor data. All possible splits s₀,s₁,s₂…, which generally is a very large state space, this means that the space of possible splitting ways of the operator represented by the split state of the tensor data is also very large.

The state set of tensor data can be pruned with some reasonable assumptions. Firstly, the time delay of the multi-core accelerator for completing the computation of an operator depends on the time of the core which takes the longest time to execute the subtasks, and the cores in the multi-core architecture are mutually equivalent in hardware structure, so that the time consumption of each core depends on the load of the task allocated to the core. A reasonable assumption is therefore that it should be ensured that the size of the split sub-operators is approximately balanced, for which reason those split states that are disproportionately split can be omitted from the state set S of the tensor data. In addition, the number of cores in a multi-core architecture is usually an integer power of 2, such as 1, 2, 4, 8, 16, etc., and a task with a parallelism degree not being an integer power of 2 will often cause "fragmentation" on the scheduling of the cores, so the number of sub-operators after splitting should be guaranteed to be an integer power of 2. With these two assumptions, the search space of the operator splitting strategy can be greatly reduced.

It should be noted that the splitting state of any tensor data associated with an operator is not selected to represent an effective splitting mode of the operator. The split dimension of the tensor data should be supported by an operator, e.g., input data to a normalized exponential regression operator (Softmax) should not have a split in the dimension to be normalized. In addition, the splitting of the input tensor and the output tensor of the operator should satisfy the calculation logic of the operator, for example, the start and end points of each sub-block of which the output data of the convolution operator is split in the H/W dimension should be actually calculated by the sub-block of which the corresponding input data is split in the H/W dimension according to the convolution kernel and the displacement step of the convolution operator; splitting of input data of the convolution operator in the C dimension is completely consistent with splitting of the weight data in the C dimension, and splitting of the output data in the C dimension is completely consistent with splitting of the weight data in the N dimension. In the framework, the output states are used to push the input states of the operators backwards according to the specific logic of each operator, or the input states are used to push the output states of the operators forwards according to the specific logic of each operator. This ensures that the state of the associated data can always represent an efficient way of splitting the operator.

Step 202): traversing the split state sets according to the directed acyclic graph of the neural network model, and determining state paths between adjacent split state sets and weights of the state paths.

As shown in fig. 3, the splitting scheme P of the entire neural network model can be seen as a jump from one splitting state in the set of splitting states of the input tensor data of each operator to one splitting state in the output tensor. The split state of the output tensor of the previous operator is the split state of the input tensor of the next operator. Each possible jump through an operator corresponds to an efficient splitting pattern on that operator. Thus, the state path represents the way the operator splits.

In the technical scheme, tensor data are decomposed according to a decomposition mode to obtain a sub-tensor set, the sub-tensor set corresponds to one splitting state, multiple splitting states can be obtained through different decomposition modes, and the splitting states obtained through all the decomposition modes form a splitting state set. Therefore, each splitting state corresponds to a sub-tensor set, and the sub-tensor set comprises all elements in the tensor data. In addition, in one sub tensor set, the elements of each sub tensor may or may not overlap.

As described above, the state path represents a splitting manner of an operator, and the computation logic of the operator is split by the splitting manner corresponding to the state path to obtain a corresponding sub-operator set. The state of the input tensor data is connected with the state of the corresponding output tensor data through a state path, and a sub tensor data set which represents a splitting state of the input tensor data is processed by sub operators in the sub operator set to obtain a sub tensor data set which corresponds to the splitting state of the output tensor data.

In the technical scheme, the weight of the state path is characterized by the time for the operators to execute on the multi-core accelerator in parallel in a certain splitting state, and the time for the multi-core accelerator to finish the calculation of one operator depends on the time of the core which takes the longest time for executing the subtasks. The estimation is performed using parameters when calculating the weights of the state paths:

1) computing load c of n split sub-operators₁,c₂,…,c_n. Wherein, c_iCalculating according to the type and scale of the ith sub-operator after splitting;

2) memory access data volume d of n sub-operators₁,d₂,…,d_n. Wherein d is_iCalculating according to the type and scale of the ith sub-operator after splitting;

3) the computational throughput rate a of each accelerator core. α is determined by the performance parameters of the accelerator itself;

4) the memory access bandwidth beta of each core. Generally, a plurality of cores share a limited memory access bandwidth, so β ═ B/n. Where B is the total bandwidth of the multi-core accelerator.

The calculation formula of the weight of the state path is as follows:

t＝max_i＝1,...,n(max(c_i/α，d_i/β)) formula (1)

The maximum value taking operation at the inner side is realized on the basis of an operator, a calculation part and a memory access part can be mutually hidden, namely, the calculation and the memory access can be executed concurrently as much as possible. For some accelerators, when the size of the sub-operator is too small, the computational throughput of each core is reduced, and further correction of α can be made to make the estimate more accurate. The outer max operation is the time for the multi-core accelerator to complete the computation of one operator, which depends on the time of the core that takes the longest time to execute the subtasks.

It should be noted that the above-mentioned manner of obtaining the weight of the state path is only an exemplary case, and is not an exhaustive case, and those skilled in the art may generate other modifications or changes on the basis of the technical solution of the present application, such as: the weight to measure the state path may be not only the time it takes to execute a sub-task, but also the throughput of executing a sub-task. Or the weight of the state path can be determined by actually measuring the time for executing all the subtasks in the operator splitting mode corresponding to the state path on the multi-core processor. But should be within the scope of the present application as long as the achieved functions and technical effects are similar to the present application.

Step 203): and determining a target splitting path of the target layer according to the weight of the state path.

In step 203, the splitting path of the target layer is determined in two ways by using the weight of the state path. The first way is to determine the split path for forward traversal, which includes the steps of:

traversing all the split state sets of the target layer, traversing each state of the current split state set, and obtaining all state paths pointing to the current state and split paths from the initial state of the state paths to the initial state of the input tensor data of the target layer;

determining a split path of an initial state of the input tensor data of the current state to the target layer according to the state path and the split path;

determining a split path of an initial state of the input tensor data from the current state to the target layer according to the weight of the state path and the weight of the split path; the weight of the split path is determined according to the weights of all state paths corresponding to the split path;

The second way is to determine the split path by reverse traversal, which includes the steps of:

In the following, how to obtain a target split path between the split state set of the input tensor data of the target layer and the split state set of the output tensor data of the target layer after traversing all the split state sets of the target layer is described in detail by way of example.

The neural network model shown in fig. 3 is serial, and both the input tensor data and the output tensor data of the entire neural network model are in an undisrupted state. Will be provided withFIG. 3 is a diagram illustrating a serial neural network model comprising n operators as an operator sequence (op)₁,op₂,...,op_n) Assuming that each operator has only one input and one output, and the input of the previous operator is the output of the next operator, all tensor data form a set (tensor) including the input tensor data and the output tensor data of the whole neural network model and the intermediate result tensors between all operators₀,tensor₁,...,tensor_n)，op_iIs the input of tenor_i-1The output is tenor_i. For each data tensor_iHaving a state set S corresponding theretoⁱThe goal of the search strategy is to find a mapping relation tensor between the tensor itself and a state of its set of states_i→sⁱThe splitting mode of all operators can be determined by determining a specific splitting state for each tensor data in the neural network model, so that a mapping relation of all tensor data in the neural network model to the splitting state is called as a splitting scheme P of the network model. In the calculation phase, the ith operator op_iCalculating output tensor data in a splitting state r by using input data in the splitting state s, wherein a specific parallel calculation mode is determined by the states of the input tensor data and the output tensor data, and the calculation time of the operator is recorded as t_s→rIf the value depends on the corresponding splitting mode and the hardware characteristics of the bottom accelerator, the calculation formula of the time delay T of the whole network is:

also corresponding to this is the time t used for parallel execution of the operator on the multi-core accelerator using the split mode_iThus, t_iCan be seen as the weight of a directed edge directed by the state of the input tensor data of the operator to the state of the output tensor data. Simultaneously, input tensor data and output tensor data as a whole neural network modelThe quantity data and the split state space corresponding to the quantity data only have one state which is not split and keeps the whole data block continuous and complete, so that the split scheme P of the neural network model starts from complete input data and finishes to complete output data, and an external user always sees a complete input and output. At this time, a good splitting scheme P is searched for a given neural network model, that is, a shortest path from the non-splitting state of the input tensor data to the non-splitting state of the output tensor data is found, and the path is passed through by selecting a state in the effective state space of each intermediate result tensor. Equations 3 and 4 give an abstract formula representation.

It is also noted that in fig. 3, there is a case where one split state of the input tensor data points to a plurality of split states of the output tensor data, which is what further causes the huge split space of the neural network model.

In the technical scheme, the non-splitting state of the input tensor data of the whole neural network model is set as an initial state s_rootIn the initial stage, the non-splitting state of the input tensor data of the neural network model is an initial state s_rootThe weight of the corresponding split path is 0, and the weights of the corresponding split paths of all the states of all the remaining tensor data are ∞. Any state s of any tensor data in the neural network model has a corresponding equation s_rootSplit path weight starting to s is l_s. And accessing each split state set from head to back, and traversing each state s in each split state set in turn. For each state s, there is a directed edge e pointing to several split states in the next set of split states₁,…,e_ks. The split state v in the next split state set isFor example, the weight t between the state s and the state v is obtained using equation (1)_svAnd updating the s corresponding to the state v in the next splitting state set pointed by the state path by using the formula (5)_rootWeight/of split path starting to state v_v。

l_v＝min(l_v,l_s+t_sv) (5)

After the access of all split state sets is completed according to the forward traversal of the topological relation of the neural network model, the non-split state s of the input tensor data of the whole neural network model is obtained_rootNon-torn state s of output tensor data to neural network model_endThe target split path of (1).

The above description is a one-by-no-split state s_rootTo the not-split state s_endAnd a path passing through one state in each split state set is the split path of the neural network model. And selecting the splitting path with the smallest weight from the splitting paths of the neural network model as a target splitting path of the neural network model.

It should be noted that the neural network model shown in fig. 3 is a serial neural network model, and for convenience of description of the present embodiment, the split state sets corresponding to the input tensor data and the output tensor data of the neural network model are both in the non-split state. Splitting state set of output tensor data in neural network model is not non-splitting state s_endAnd when the set is formed by a plurality of splitting states, selecting the minimum value from the weight of the splitting path of each splitting state in the splitting state set of the output tensor data of the neural network model as a target splitting path from the splitting state set of the input tensor data of the whole neural network model to the splitting state set of the output tensor data of the neural network model.

Note that the entire scheme may also be converted to search by the non-split state s_endTo the not-split state s_rootThe two are equivalent. Similarly, the split state set of the input tensor data in the neural network model is not the non-split state s_endBut rather a set of multiple split states,and selecting the minimum value from the weights of the splitting paths of each splitting state in the splitting state set of the input tensor data of the neural network model as a target splitting path between the splitting state set of the input tensor data of the whole neural network model and the splitting state set of the output tensor data of the neural network model.

The neural network model shown in fig. 3 is a serial neural network model, and the state in the split state set of the input tensor data and the output tensor data of the model is the non-split case. In practical application, for the technical solutions shown in fig. 2, fig. 4, fig. 7, and fig. 9, the problem of consistency of different branch splitting modes in the multi-branch neural network model needs to be solved. The operators at the intersection of the branches have more than 1 input tensor data, such as the alignment addition operator (Add), the alignment multiplication operator (Mult), the concatenation operator (Concat). For an operator A with 2 inputs, after the operator is accessed, namely the split state set of the input tensor data is enumerated according to the split state set of the input tensor data, two input tensor data tensor_left，tensor_rightRespectively have corresponding split state sets S_leftAnd S_right. Respectively along the tensor_left，tensor_rightThe first two paths continue to traverse forward, in one case, the two branches will extend directly until the end of the traversal, representing that the whole network has more than one input data, which is not common in reasoning tasks, and in the other case, the two branches join together at an operator. In either case, when determining the splitting scheme P, the two input tensor data tensor at operator A_left，tensor_rightIn this way, split states that do not match each other may be selected. Specifically, assuming operator A is a binary para-position addition operator, the backtracking process is in tenor_leftThe split state set of (2) is selected to be a state with a split only in the C dimension, but in the tensor_rightThe selected split state set may be a state with split only in the H dimension, and the split modes of the addition operators represented by the two split states are inconsistent, so that the whole splitting scheme P is invalid. To understandTo solve this problem, tenor is guaranteed before traversing operator A ends_left，tensor_rightThe corresponding split state sets only contain one split state, which ensures the certainty of the state selected in the two state sets in the backtracking process. Thus, in the forward traversal phase, when the output tensor data of the operator is treated by at least two operators as input tensor data, or the operator has at least two output tensor data, one splitting state is retained in the set of splitting states of the output tensor data of the operator, and the splitting state is determined via the same state path of the operator. In a reverse traversal phase, when the operator has at least two input tensor data, one split state is retained in a split state set of the input tensor data of the operator, and the split state is determined via the same state path of the operator. Therefore, before the traversal of the branch operator is finished, the state with the minimum corresponding accumulative weight is selected from the split state sets of the input data and is reserved, and other split states in the split state sets are removed.

It should be noted that the above manner of acquiring the target splitting path is similar to the viterbi algorithm, and this time is only an exemplary case, but not an exhaustive case, and those skilled in the art may generate other modifications or transformations based on the technical solution of the present application, such as: the weight of each split path between the split state set of the input tensor data of the neural network model to the split state set of the output tensor data of the neural network model is determined by the sum of the weights of the corresponding state paths. And setting a threshold value according to experience, wherein the weight of the splitting path is smaller than the set threshold value, and then the splitting path can be used as a target splitting path to split the neural network model. But should be within the scope of the present application as long as the achieved functions and technical effects are similar to the present application.

Step 204): and splitting an operator of a target layer of the neural network model by using the target splitting path.

From the above description, it can be known that the hardware resources of the chip of the multi-core processor structure are fully utilized by splitting the computation logic of the operators in the neural network into smaller subtasks which are distributed to a plurality of cores to be executed in parallel.

For the technical solution shown in fig. 2, under the most ideal condition, it is desirable that the split sub-operators can write their output tensor data to a corresponding position in a storage space for storing complete output tensor data, so that a complete and continuous data block is always obtained after all the sub-operators of an operator are executed. But this is not easily done for part of the accelerator. Firstly, the storage positions of the output tensor data of the split operator in the whole output may be discontinuous, so that the codes of the output part of the operator need to be rewritten, and the output result can be written back to the corresponding discrete position of one sub tensor in the storage. Meanwhile, the accelerator usually further adjusts the sequence of data in storage to improve the memory access efficiency during calculation, which makes the work of modifying the output logic of the operator more difficult and tedious. Meanwhile, if the computational logic or the splitting logic of the subsequent operator does not need to ensure the storage continuity of the input data in a certain dimension, the data which is output by the previous layer and is in a discrete storage state in the dimension can be directly used for the calculation of the next layer without ensuring the continuity of the output data.

Therefore, the task of adjusting the tensor data splitting form is stripped from the calculation task of the operator by the framework and abstracted into a new operator called glue operator, the stripping avoids the modification of the output logic of each operator, and the transportability of the framework to different accelerators on the bottom layer is enhanced. The glue operator is used to adjust a sub-data block whose tensor is split in one way into sub-data blocks formed in another splitting way. As shown in table 1, the splitting patterns allowed by the operators of different kinds are different in terms of their input tensor data and output tensor data. When the splitting mode of the output tensor data of the previous operator is not allowed by the next operator, the splitting mode of the tensor data needs to be adjusted by using the glue operator, so that the previous operator and the next operator are bonded. In addition, even if the splitting mode output by the upper layer is supported by the lower layer, the splitting of tensor data can be adjusted to a form which is more beneficial to the calculation with the lower layer through a glue operator.

TABLE 1

Based on the above description, on the basis of fig. 2, the present embodiment also proposes another neural network model splitting method. As shown in fig. 4. On the basis of fig. 2, further comprising:

step 201'): inserting a glue operator between the operator of the target layer and the associated splitting state set, and adjusting the state in the splitting state set of tensor data of the operator; the glue operator is used for adjusting the state of the tensor data obtained according to a splitting mode into the state obtained according to any splitting mode.

In this step, the behavior mode of adjusting the splitting state of the tensor data is expressed by a glue operator, the calculation scale of each layer of the neural network model continuously changes along with the extension of the network, and the splitting mode of the operator needs to be adjusted correspondingly along with the change of the splitting trend of the neural network model, that is, the state of the intermediate result is adjusted. As shown in fig. 5, a glue operator is added between Op _2 and its input sensor 1 in fig. 3 to convert any one split state of Tensor data to another. For the glue operator, the input tensor data and the output tensor data have the same shape and the same state space, and any split state of the input tensor data has directed edges pointing to all split states of the output tensor data, so that a fully-connected mesh structure is formed between the split state set of the input tensor data and the split state set of the output tensor data. This makes it possible to convert the splitting state of any input tensor data into another splitting state before the operator Op _2, and introduces the possibility of adjusting the splitting state of the input tensor data before the calculation of each operator begins, that is, adjusting the splitting mode of the operator itself, into the search space of the splitting scheme P.

It should be noted that fig. 5 shows that a glue operator is inserted between the operator and the corresponding input tensor data, or a glue operator may be inserted between the operator and the corresponding output tensor data, or even between the operator and the corresponding input tensor data or output tensor data, which is only an exemplary case, not an exhaustive list.

Glue operators are inserted between operators of a target layer of the neural network model and the associated split state sets, although the split mode of the operators is adjusted correspondingly, the adjustment brings extra expense, and how to properly insert the glue operators in the whole neural network model improves the performance of the neural network model. In order to solve the problem, a glue operator is inserted between an operator of a target layer of the neural network model and the associated split state set, and a directed acyclic graph of the neural network model containing the glue operator is obtained; determining state paths between adjacent split state sets and weights of the state paths according to the split state sets corresponding to all tensor data of the directed acyclic graph traversing the target layer; determining a splitting path of a target layer of the neural network model including the glue operator according to the weight of the state path; and selecting each inserted glue operator by using the splitting path of the target layer of the neural network model including the glue operator, deleting the glue operators which do not need to be inserted, and reserving the glue operators which need to be inserted.

For the glue operator, one of four ways of splitting-splicing, splicing-splitting, splicing and splitting is adopted in the glue operator, the sub-data blocks adjacent to each other in any dimension can be divided into a new data block in the splicing stage, and any sub-data block can be split into two smaller sub-data blocks in the splitting stageThe sub data block of (2). Any one split can be converted to another split form by such a two-stage process. To illustrate this, it is not assumed that the data itself is one-dimensional, and the split form before adjustment is represented as { (0, p)₁),(p₁,p₂),…,(p_n-1End), each section represents a sub-section after one-dimensional data splitting, and the adjusted splitting form is { (0, q) }₁),(q₁,q₂),…,(q_m-1End), if some adjacent two segments (p) before adjustment_i-1,p_i),(p_i,p_i+1) Is a certain section (q) after adjustment_j,q_j+1) I.e. p_i-1＝q_j，p_i+1＝q_j+1When adjusting this part, only the (p) needs to be adjusted in the splicing stage_i-1,p_i),(p_i,p_i+1) Spliced together, skipping the split phase. And in the other case, if a certain sub-segment before adjustment is a set of a plurality of sub-segments after adjustment, skipping the splicing stage and executing corresponding splitting in the splitting stage. In the worst case, all data can be combined into a complete one-dimensional data in the splicing stage, and corresponding splitting is performed in the splicing stage.

Taking splitting-splicing or splicing-splitting as an example for a glue operator, assuming that the total size of tensor data to be adjusted is M, and the two stages cannot be skipped, and each stage needs to be spliced or split according to 4 dimensions. For the convenience of transplantation, splicing and splitting are usually realized by using a splicing operator (Concat) and a splitting operator (Slice) in a neural network algorithm, and since the two operators can only process one dimension at a time, the whole glue brings about 8M storage read-write overhead under the worst condition. Therefore, an optimal balance point must be found between the adjustment of the splitting state and the introduced extra overhead, and the splitting mode of the operator is adjusted at a position which is more reasonable according to the rule of the network structure under the condition that the glue operator is introduced as little as possible.

More specifically, glue operators and ordinary neural network operators perform the same treatment, and each glue operator has corresponding time t for adjusting the split state of tensor data, and the adjustment is carried out by the glue operatorsThe time serves as the weight of the corresponding state path. Obtaining the non-split state s of the input tensor data of the target layer of the neural network model including the glue operator still using equation (5)_rootNon-torn state s of output tensor data to neural network model_endThe target split path of (1). When selecting the glue operators, in the splitting path, checking a splitting state corresponding to the input Tensor data and a splitting state corresponding to the output Tensor data of each glue operator, if the two splitting states are the same, that is, the splitting state status _1 in the splitting state set Tensor _1 in fig. 5 is connected with the splitting state status _1 in the splitting state set Tensor _ 1' through the state path, and the two splitting states are the same. The split path P illustrating the target layer of the neural network model does not need to adjust the split state of the input tensor data of the operator Op _2, and this is a result of overall consideration based on the front and back operators and overall performance. The glue operator inserted between operator Op _2 and the corresponding input tensor data is removed from the network. Otherwise, the inserted glue operator needs to be retained.

It should be noted that the glue operator is implemented by using an operator originally existing in the neural network model. The splicing stage corresponds to a Concat operator in the neural network model, the splitting stage corresponds to a Slice operator in the neural network model, and any accelerator which supports the Concat operator and the Slice operator can quickly realize a glue operator. In addition, in this embodiment, the above manner of acquiring the target splitting path is similar to the viterbi algorithm, and this time is only an exemplary case, but not an exhaustive case, and those skilled in the art may generate other modifications or transformations on the basis of the technical solution of the present application, such as: the weight of each split path between the split state set of the input tensor data of the neural network model to the split state set of the output tensor data of the neural network model is determined by the sum of the weights of the corresponding state paths. And setting a threshold value according to experience, wherein the weight of the splitting path is smaller than the set threshold value, and then the splitting path can be used as a target splitting path to split the neural network model. But should be within the scope of the present application as long as the achieved functions and technical effects are similar to the present application.

It should be emphasized that the contents of the technical solution regarding operator splitting shown in fig. 2 are applicable to the technical solution shown in fig. 4, and are not described herein again.

For neural network models, convolution operators are relatively special operators, and in some cases additional auxiliary operators are needed to complete the splitting task. When dividing the calculation according to the H/W dimension of the input tensor data, if the size of the convolution kernel window exceeds the step length of each movement of the convolution kernel window, namely, the kernel > stride, the condition that the frame moves to the boundary and exceeds the boundary of the sub-tensor data occurs in the calculation of the divided convolution operator, and the missing part of data is positioned on the adjacent sub-tensor data. In order to handle the condition that the input tensor data among the subtasks are overlapped with each other and ensure the portability, the behavior of accessing the boundary data of the adjacent sub tensor data is also stripped to form a new auxiliary operator which is called a compensation operator.

As shown in fig. 6, the compensation operator is used to obtain the target data from the adjacent sub-tensor data except one sub-tensor data, and combine the target data and the sub-tensor data together to form a larger data block, where the moving range of the calculation stage window does not exceed the boundary of the compensated data block. Besides convolution operators, the pooling operator and the Local Response Normalization operator (LRN) which is not very common at present also have the problem that the split subtask depends on data on adjacent data blocks, the pooling operator is similar to the convolution operator, mainly caused by the fact that a pooling window is larger than the moving step length of the window, while the Local Response Normalization operators are different in calculation logic, namely, in order to calculate a result of outputting a point of tensor data on a C dimension, a point corresponding to the tensor data on the C dimension and values of k/2 points adjacent to the left and right need to be input. Thus, if the computation of the local response normalization operator is split into multiple LRN operators by the C-dimension, each new operator also requires the element data in the adjacent sub-tensor data to compute the values at the C-dimension boundaries.

Operators can be classified into three categories according to the range of data in a certain dimension of input tensor data required by the operators to calculate one data in the dimension of output tensor data. One class of point-to-point operators, i.e., computing a data point of the output tensor data requires only inputting the value of the corresponding data point on the tensor, and includes the activation operators (Relu, pRelu), the batch normalization operator (BatchNorm), and the base operators (Add, Sub, Mult, Div) for adding, subtracting, multiplying and dividing the bit. The operators can carry out task splitting in any dimension, and the obtained sub-operators only need the corresponding sub-tensor data as input in the calculation stage. Another class of operators are operators of the fully dependent type, i.e. calculating a data point of the output tensor data requires only all data in the input tensor data in that dimension, e.g. a convolution operator and a fully connected operator require all points in the input tensor data C dimension in calculating a point in the output tensor data C dimension, although the splitting of the convolution operator in the input tensor data C dimension can be achieved by adding partial sums afterwards, when the calculation logic of the operator in that dimension is more complicated, e.g. the normalized exponential regression operator (Softmax), equation (6) gives the calculation formula in its normalized dimension.

Where I is the vector of the input tensor data in the normalized dimension and O is the vector of the output tensor data in the normalized dimension. Unlike partial sum accumulation of convolution, the computation logic is complex and it is difficult to achieve splitting. From this point of view, the compensation operator is actually used to handle the third case between the point-to-point type operator and the fully dependent type operator, i.e. calculating a point on the output tensor data requires the input tensor data to be in the region near the corresponding position. The corresponding position attachment area is determined according to the compensation parameter. In this case, the operators are still computationally logically separable, although they may rely on data other than the sub-tensor data, and the use of the compensation operator may uniformly solve this problem.

Based on this, as shown in fig. 7, a third flowchart of a neural network model splitting method provided by the embodiment of the present disclosure is shown. On the basis of fig. 2, further comprising:

step 201 "): inserting a compensation operator between the operator of the target layer and the associated split state set, and adjusting the state in the split state set of the input tensor data of the operator; the compensation operator is used for acquiring target data from adjacent sub tensor data of any sub tensor data of the state and merging the target data with the sub tensor data.

In the technical scheme, in order to solve the problem that a window exceeds the boundary of input sub-tensor data under the condition that a task is split along an H/W dimension due to the fact that the window of a convolution operator and a pooling operator is smaller than a displacement step length, a frame introduces a compensation operator to complement elements of adjacent sub-tensor data around each sub-tensor data for a sub-tensor data set before calculation is started. The method avoids modifying the calculation logic of the split convolution operator or the pooling operator, so that the dependence on the adjacent sub tensor data is invisible to the convolution operator and the pooling operator, the system is favorable for quick realization, and the consistency on accelerators with different structures is facilitated. However, the compensation operator itself may bring extra overhead, and assuming that the size of the original data block is M, the compensation operator introduces 2M of access overhead without considering the overlapping portion between the compensated sub-tensor data. The convolution operator and the pooling operator are main operators for forming the neural network, particularly the image classification neural network, and in order to reduce the overhead caused by the compensation action, the inserted compensation operators are combined by adopting a pyramid structure. As shown in fig. 8, the neural network model is a serial sequence formed by two convolution operators, both of which perform task splitting according to H/W dimension, and are split into 4 smaller convolution operators, and the N dimension and the C dimension of data are omitted in the figure. Suppose the convolution kernel sizes of the two convolution operators are k₁，k₂For the sake of simplifying the calculation, the displacement step is 1. Is justOften, the convolution operator Conv1 compensates the periphery of the sub-tensor data for a data width k before the calculation₁And/2, ensuring that the convolution kernel does not exceed the boundary of the input sub-tensor data during calculation of the split convolution task. However, here, the convolution operator Conv1 compensates the periphery of the sub-tensor data for the data width k before the calculation₁/2+k ₂2, which makes the sub-Tensor data of its output data Tensor1 have k between each other₂The widths overlap, so the convolution operator Conv2 does not need to data compensate its input sub-tensor data before the computation starts.

By the method, a plurality of compensation operators used in the serial operator sequence can be combined into one top end, and although the memory access overhead of the first compensation is increased, the memory access overhead of the compensation operators after the model is split can be effectively reduced under the condition that the compensation width is far smaller than the size of the sub data block. On the other hand, this method causes repeated calculation, and the results of the overlapping portions of the sub Tensor data of the output Tensor data Tensor1 of the convolution operator Conv1 in fig. 8 are calculated in a plurality of convolution operators after the division. In addition, for the convolution network with a small size of the input characteristic image, because the condition that the compensation width is far smaller than the size of the sub tensor data is no longer satisfied, the change condition of the total access and storage overhead before and after the combination of a plurality of compensation operators needs to be evaluated more carefully. To solve this problem, the merging of the compensation operators is also added to the search space of the splitting scheme. The whole traversal process is changed from forward traversal to reverse traversal, the two inputs are communicated in principle, but the search strategy after the combination of the compensation operator is more suitable for the latter. Setting the non-splitting state of the output tensor data of the whole neural network model as an end point state s_endAny state s of any tensor data in the neural network model has a corresponding relationship from s to s_endThe weight of the split path of is l_s. Before traversal starts, s_endThe corresponding weight is 0, and the corresponding weights of all states of all the remaining tensor data are ∞. Reversely traversing each operator according to the topological relation of the neural network model, and outputting the tensor data in a splitting stateIn the process of taking the splitting state of the possible input tensor data, the splitting state of the sub tensor data which are not overlapped with each other under the normal condition and need to be introduced with the compensation process is enumerated, and the input splitting state which is overlapped with each other and does not need to be compensated is also enumerated. The weights of the latter corresponding state paths are computationally taking into account the time added by the redundant computed portion. Obtaining the non-split state s of the input tensor data of the target layer of the neural network model including the compensation operator still using equation (5)_rootNon-torn state s of output tensor data to neural network model_endThe target split path of (1).

For the compensation operator, a plurality of compensation operators inserted in the neural network model are combined by adopting a pyramid structure, so that one combined compensation operator can be obtained, and a plurality of combined compensation operators can also be obtained. In this case, the number of compensation operators after the merging is smaller than the number of compensation operators before the merging.

It should be emphasized that the contents of the technical solution regarding operator splitting shown in fig. 2 are applicable to the technical solution shown in fig. 7, and are not described herein again.

Fig. 9 is a fourth flowchart of a neural network model splitting method according to an embodiment of the present disclosure. Introducing the glue operator and the compensation operator into an operator splitting scheme, wherein the splitting method comprises the following steps:

step a): determining a split state set of tensor data associated with an operator of a target layer in the neural network model according to the operator of the target layer; wherein the target layer is at least one layer of the neural network model;

step b): inserting a glue operator between the operator of the target layer and the associated splitting state set, and adjusting the state in the splitting state set of tensor data of the operator; wherein the glue operator is to adjust a state in the set of split states of the tensor data to any one split state of the tensor data;

step c): inserting a compensation operator between the operator of the target layer and the associated split state set, and adjusting the state in the split state set of the input tensor data of the operator; the compensation operator is used for acquiring target data from adjacent sub tensor data of any sub tensor data of the state, and merging the target data with the sub tensor data;

step d): traversing the split state sets according to a directed acyclic graph of the neural network model, and determining state paths between adjacent split state sets and weights of the state paths; wherein the state path represents a splitting mode of the operator; each state in the split state set represents a sub-tensor data set, and the union result of all sub-tensor data of the state is the tensor data;

step e): determining a target splitting path of the target layer according to the weight of the state path;

step f): and splitting an operator of a target layer of the neural network model by using the target splitting path.

Glue operators are inserted between each operator of the neural network model and its input tensor data, and glue operators are inserted between the output tensor data of the neural network model and the operator that generated the output tensor data. For each tensor data tensor in the neural network model_iInitialize its state set S_iStoring the state itself in the state set and the output state s of the last output data of the network starting from this split state of the data_rootThe pair of values is represented by (s, t). State set S corresponding to output tensor data of whole neural network model_rootWith an unresolved state of the data and a corresponding minimum time(s)_root0), all remaining sets are empty. For a given neural network model, a topological order lambda is given to all operators in the neural network model according to the dependency relationship among the operators, and the topological order needs to satisfy the following conditions: for any operator a, all a-dependent operators must follow a in topological order, while all a-dependent operators must precede a in topological order.

The set of split states for each operator of the neural network model is traversed backwards, taking into account the insertion of the compensation operator. In the reverse traversal stage, the operators in the neural network model are traversed one by one according to the sequence of the reverse lambda, and the operator A with m input n output has input tensor data u₁,…,u_mOutputs tensor data v₁,…,v_nThe content of the technical solution regarding operator splitting in the neural network model shown in fig. 2 is applicable to the technical solution shown in fig. 9, the content of the technical solution regarding glue operator in the neural network model inserting glue operator shown in fig. 4 is applicable to the technical solution shown in fig. 9, and the content of the technical solution regarding compensation operator in the neural network model inserting compensation operator shown in fig. 7 is applicable to the technical solution shown in fig. 9, which is not described herein again. The temporal complexity of the backward traversal is O (NM)²) And N is the number of operators in the neural network model, and M represents the number of states in the maximum splitting state set in the splitting state sets of all tensor data.

It should be emphasized that the content of the operator splitting technical solution shown in fig. 2 is applicable to the technical solution shown in fig. 9, the content of the glue operator in the operator splitting technical solution based on the glue operator shown in fig. 4 is applicable to the technical solution shown in fig. 9, and the content of the compensation operator in the operator splitting technical solution based on the compensation operator shown in fig. 7 is applicable to the technical solution shown in fig. 9, which is not repeated herein.

The technical solutions shown in fig. 2, 4, 7 and 9 fully utilize the hardware resources of the multi-core system by splitting each operator of the target layer in the neural network model into smaller subtasks and distributing the subtasks to a plurality of cores to execute in parallel. In the technical solutions shown in fig. 4, 7, and 9, by introducing a glue operator or a compensation operator, it is ensured that the calculation graph of the neural network model after being split can still be realized by using an operator kernel function on a single core, and the software stack work of a large number of operators, which is required to be modified or realized again by the bottom layer accelerator in the transplanting process of the framework, is avoided, so that the framework is more friendly to accelerators without good programmability. The framework can automatically generate a set of efficient splitting schemes for a given neural network and the multi-core accelerator, in the scheme generation process, the splitting modes of operators can be reasonably adjusted according to the types and scales of the operators and the calculation throughput rate and the memory access bandwidth of bottom-layer hardware, the calculation efficiency of a hardware core and the splitting degree of the operators are well balanced, the mutual matching of context operators on splitting is also considered, and the splitting selection of a plurality of operators is planned in a lump.

The present disclosure provides a neural network model splitting device, which includes:

Preferably, the glue insertion operator module comprises:

Preferably, the insertion compensation operator module comprises:

Preferably, the insertion compensation operator module further comprises:

Preferably, the target split path determining module includes:

Preferably, the method further comprises the following steps:

Fig. 10 is a schematic diagram of a neural network model splitting hardware device according to an embodiment of the present application. Comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor implements the neural network model splitting method when executing the computer program.

In the neural network model splitting hardware device provided in the embodiments of the present specification, specific functions implemented by the memory and the processor of the neural network model splitting hardware device may be explained in comparison with the foregoing embodiments in the present specification, and may achieve the technical effects of the foregoing embodiments, and therefore, details are not repeated here.

In this embodiment, the memory may include a physical device for storing information, and typically, the information is digitized and then stored in a medium using an electrical, magnetic, or optical method. The memory according to this embodiment may further include: devices that store information using electrical energy, such as RAM, ROM, etc.; devices that store information using magnetic energy, such as hard disks, floppy disks, tapes, core memories, bubble memories, usb disks; devices for storing information optically, such as CDs or DVDs. Of course, there are other ways of memory, such as quantum memory, graphene memory, and so forth.

In this embodiment, the processor may be implemented in any suitable manner. For example, the processor may take the form of, for example, a microprocessor or processor and a computer-readable medium that stores computer-readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, an Application Specific Integrated Circuit (ASIC), a programmable logic controller, an embedded microcontroller, and so forth.

In this embodiment, an embodiment of the present application further provides a readable storage medium, on which a computer program is stored, where the computer program is executed to implement the neural network model splitting method described above.

As can be seen from the above, the technical solution of the present disclosure can implement the extension of the deep learning accelerator from a single-core to a multi-core structure with less overhead, and can provide an efficient splitting scheme for given network and bottom accelerator characteristics. Experimental results show that the scheme can effectively reduce the end-to-end time delay of various networks on the multi-core accelerator.

Those skilled in the art will also appreciate that, in addition to implementing clients and servers as pure computer readable program code, the same functionality can be implemented by logically programming method steps such that the clients and servers implement logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Such clients and servers may thus be considered a hardware component, and the means included therein for performing the various functions may also be considered as structures within the hardware component. Or even means for performing the functions may be regarded as being both a software module for performing the method and a structure within a hardware component.

From the above description of the embodiments, it is clear to those skilled in the art that the present application can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present application may be essentially or partially implemented in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the embodiments or some parts of the embodiments of the present application.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments can be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, both for the embodiments of the client and the server, reference may be made to the introduction of embodiments of the method described above.

The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

Although the present application has been described in terms of embodiments, those of ordinary skill in the art will recognize that there are numerous variations and permutations of the present application without departing from the spirit of the application, and it is intended that the appended claims encompass such variations and permutations without departing from the spirit of the application.

Claims

1. A neural network model splitting method is characterized by comprising the following steps:

2. The method of claim 1, wherein the step of inserting a glue operator between the operator of the target layer and the associated set of split states comprises:

3. The method of claim 1, wherein the glue operator is to stitch states in a split state set of input tensor data of the glue operator.

4. The method of claim 1, wherein the glue operator is to split a state in a set of split states of input tensor data of the glue operator.

5. The method of claim 1, wherein the glue operator is configured to stitch states in the split state set of input tensor data of the glue operator, and then split the states in the split state set after stitching.

6. The method of claim 1, wherein the glue operator is configured to split states in a split state set of input tensor data of the glue operator, and then stitch the split states in the split state set after the splitting.

7. The method of claim 1, wherein the step of inserting a compensation operator between the operator of the target layer and the associated set of split states comprises:

8. The method of claim 7, wherein the operators of a particular type are convolution operators, pooling operators, and local response normalization operators.

9. The method of claim 7, wherein the step of inserting a compensation operator between the operator of the target layer and the associated set of split states further comprises:

10. The method of claim 1, wherein the step of determining a split path for the target layer comprises:

11. The method of claim 1, wherein the number of sub-operators obtained after operator splitting of the target layer of the neural network model is an integer power of 2.

12. The method of claim 1, wherein the state in the set of split states of input tensor data for an operator of a target layer of the neural network model is determined from computational logic of the operator and the state in the set of split states of corresponding output tensor data.

13. The method of claim 1, wherein the state in the set of split states of output tensor data for an operator of a target layer of the neural network model is determined from computational logic of the operator and the state in the set of split states of corresponding input tensor data.

14. The method of claim 1, further comprising:

15. The method of claim 1, further comprising:

16. The method of claim 1, wherein the weight of the state path is determined according to the type and scale of an operator, and multicore processor hardware parameters.

17. A neural network model splitting device, comprising:

18. The apparatus of claim 17, wherein said insert glue operator module comprises:

19. The apparatus of claim 17, wherein the glue operator inserted by the glue operator module is to stitch states in the split state set of input tensor data of the glue operator.

20. The apparatus of claim 17, wherein the glue operator inserted by the glue operator module is to split a state in a set of split states of input tensor data of a glue operator.

21. The apparatus of claim 17, wherein the glue operator inserted by the glue operator inserting module is configured to splice states in the split state set of input tensor data of the glue operator, and then split the states in the split state set after the splicing process.

22. The apparatus of claim 17, wherein the glue operator inserted by the glue operator inserting module is configured to split states in the split state set of the input tensor data of the glue operator and then stitch the split states in the split state set after the splitting process.

23. The apparatus of claim 17, wherein the insertion compensation operator module comprises:

24. The apparatus of claim 23, wherein the specific type of operator to which the compensation operator inserted by the second insertion unit applies is a convolution operator, a pooling operator, a local response normalization operator.

25. The apparatus of claim 23, wherein the insertion compensation operator module further comprises:

26. The apparatus of claim 17, wherein the target split path determination module comprises:

27. The apparatus of claim 17, further comprising:

28. The apparatus of claim 17, further comprising:

29. A computer device comprising a memory and a processor, the memory having stored thereon a computer program operable on the processor, wherein the processor, when executing the computer program, implements the steps of the method of any of claims 1-16.

30. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 16.