CN115668222A

CN115668222A - Data processing method and device of neural network

Info

Publication number: CN115668222A
Application number: CN202180037755.7A
Authority: CN
Inventors: 袁宏辉; 高山青; 高立稳; 熊乐进
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2020-05-30
Filing date: 2021-01-26
Publication date: 2023-01-31
Also published as: WO2021244045A1; WO2021243489A1

Abstract

The application discloses a data processing method and device of a neural network, and relates to the field of artificial intelligence. The method includes dynamically segmenting input data according to a data volume of the input data, a first characteristic of an internal memory within a chip operating the neural network, and a second characteristic of a plurality of layers in the neural network, setting different batch sizes for the layers in the neural network. Because the reasonable batch size is set for each layer in the neural network, the internal memory can be fully utilized to store interlayer data of the neural network in the neural network reasoning process, thereby improving the utilization rate of the internal memory and ensuring the calculation efficiency of hardware for operating the neural network.

Description

Data processing method and device of neural network

Technical Field

The present application relates to the field of Artificial Intelligence (AI), and in particular, to a data processing method and apparatus for a neural network.

Background

As the computing power of processors in computer systems continues to increase, the performance of processors continues to increase. In order to solve the problem of 'memory wall' caused by the fact that the bandwidth of an external memory is limited and the processing speed of a processor cannot be adapted, a computer system is provided with a multi-level cache structure with higher bandwidth and small capacity.

In the neural network reasoning process, after each layer in the neural network finishes processing input data, the next layer is entered. If the amount of input data is large, the size (size) of inter-layer data of multiple layers of the neural network may be too large, so that the cache cannot store the inter-layer data and store the inter-layer data in the external memory. The computational efficiency of the processor is reduced because the cache is not efficiently utilized.

In order to solve the above problem, the conventional technology groups input data according to the buffer requirement of the inter-layer data of each layer to obtain a plurality of groups of batches (batch) with the same batch size (batch size), and the batch size is limited by the batch size of the layer with the largest buffer requirement. The neural network has processed one batch before processing the next batch. By reducing the data processed by each layer in the neural network, the interlayer data is reduced, and the interlayer data is kept in a cache as much as possible.

Because different layers in the neural network operate on data processing differently, the size of data between layers is different. For example, if a layer in the neural network performs an enlargement operation on a picture, the generated inter-layer data is large. For another example, if a layer in the neural network performs a reduction operation on the picture, the generated inter-layer data is smaller. For a layer outputting smaller interlayer data, the smaller the batch size is, the smaller the interlayer data is, and the more the residual capacity of the cache is; for a layer outputting larger inter-layer data, the larger the batch size, the larger the inter-layer data, the less the remaining capacity of the cache, resulting in the cache possibly being unable to store the inter-layer data. In a word, in the process of processing input data according to the batch size of the neural network determined by the conventional technology, the utilization rate of the cache is still low, and the calculation efficiency of hardware for operating the neural network is influenced. In addition, if the number of divided groups is large, the overhead of processing each batch by each layer in the neural network is increased, and the computational efficiency of hardware for operating the neural network is reduced. Therefore, how to improve the utilization rate of the cache and ensure the computational efficiency of the hardware running the neural network is an urgent problem to be solved.

Disclosure of Invention

The application provides a data processing method and device of a neural network, which can improve the utilization rate of a cache and ensure the calculation efficiency of hardware for operating the neural network. In order to achieve the purpose, the following technical scheme is adopted in the application.

In a first aspect, the present application provides a data processing method for a neural network, the method including: the processor groups the input data using a data amount of the input data, a first characteristic of an internal memory within a chip operating the neural network, and a second characteristic of a plurality of layers in the neural network, determines a batch size of each layer in the neural network, and makes batch sizes of at least two layers among the batch sizes of the plurality of layers different. For example, the batch size for each layer in a neural network is different. As another example, a neural network includes layers of the same batch size and layers of different batch sizes. Wherein the first characteristic includes at least one of a distribution characteristic of the internal memory within the chip and a capacity of the internal memory. The second feature includes a connection relationship between the plurality of layers and a calculation-related parameter of each of the plurality of layers. The batch corresponding to the batch size is one picture, a plurality of pictures or partial images in one picture.

It should be understood that by internal memory is meant memory within the chip running the neural network. For example, the memory within the chip running the neural network is a cache. By external memory is meant off-chip memory that runs the neural network. The internal memory may also be referred to as on-chip memory. The external memory may also be referred to as off-chip memory.

According to the data processing method of the neural network, the input data are segmented by comprehensively referring to the data size of the input data, the first characteristic and the second characteristic, and different batch sizes are set for layers in the neural network. Therefore, by setting a reasonable batch size for each layer in the neural network, the internal memory is fully utilized to store interlayer data of the neural network in the neural network reasoning process, and the interaction between the chip for operating the neural network and the external memory is reduced, so that the utilization rate of the internal memory is improved, and the calculation efficiency of the chip for operating the neural network is ensured.

Specifically, determining the batch size of the plurality of layers according to the data volume, the first characteristic, and the second characteristic includes: and determining the batch size of a plurality of layers, N sub-images, M images and the storage position of data among the layers according to the data amount, the first characteristic and the second characteristic, wherein N is an integer greater than or equal to 2, M is an integer greater than or equal to 1, and N is greater than or equal to M.

Wherein the storage location of the inter-layer data includes at least one of an internal memory or an external memory. In one possible implementation, the inter-layer data of the plurality of layers included in the subgraph is stored in the internal memory. In another possible implementation, inter-layer data between subgraphs is stored in the internal memory. In another possible implementation, inter-layer data is stored in the external memory.

The subgraph contains one or more layers of the same batch size. Optionally, the number of layers included in different subgraphs may be the same or different. Alternatively described, a subgraph can also be referred to as a group of groups of the first type.

The graph includes one or more subgraphs. The number of subgraphs contained in different graphs can be the same or different. Alternatively depicted, the diagram may also be referred to as a second type layer group.

In some possible designs, the processor may determine a batch size, N subgraphs, M graphs, and storage locations of inter-layer data for a plurality of layers as a function of the amount of data, the first feature, and the second feature using an iterative algorithm. Understandably, the processor obtains the batch size, the N subgraphs, the M graphs and the storage positions of the data among the layers according to the data amount, the first characteristic and the second characteristic without one-time calculation, and selects the batch size, the N subgraphs, the M graphs and the storage positions of the data among the layers from a plurality of experimental results through a plurality of iterative experiments by adopting an iterative algorithm so as to ensure the utilization rate of an internal memory and the calculation efficiency of a chip for operating a neural network.

Wherein the optimization algorithm may be a dynamic programming algorithm, a greedy algorithm, or a genetic algorithm.

The basic idea of dynamic programming algorithm (dynamic programming algorithm) is to decompose the problem to be solved into a plurality of sub-problems, solve the sub-problems first, and then obtain the solution of the original problem from the solution of the sub-problems.

Greedy algorithm (greedy algorithm) can also be called greedy algorithm, the basic idea is to start from a certain initial solution of the problem step by step, and according to a certain optimization measure, each step is to ensure that a local optimal solution can be obtained. Only one data is considered in each step, and the selection of the data should meet the condition of local optimization. If the next data and partial optimal solution are no longer feasible solutions to join, the data is not added to the partial solution until all the data is enumerated, or no more additions are made to stop the algorithm

Genetic algorithms (genetic algorithm) are designed by using the evolution law of the biological world for reference and are used for simulating natural evolution to search for an optimal solution.

It should be noted that, in the process of processing the input data of the neural network according to the partitioning result of the layers of the neural network, the scheduling order of the layers in the graph is determined according to the scheduling order of each sub-graph included in the graph and the scheduling order of the layers in the sub-graphs. The scheduling order of the layers in the subgraph is the same as the scheduling order of the layers in the neural network. For example, batches corresponding to the batch size of the layers contained in a subgraph are processed in the order of the layers contained in the subgraph. The scheduling sequence of each subgraph contained in the graph is determined according to the batch size and the scheduling sequence of the first layer and the last layer in the subgraph. And carrying out aggregation processing or scattering processing on the interlayer data of the subgraph contained in the graph.

In a second aspect, an embodiment of the present application further provides a data processing apparatus of a neural network, and for beneficial effects, reference may be made to the description of the first aspect, which is not described herein again. The data processing apparatus of the neural network has the function of implementing the processor behavior in the method example of the first aspect described above. The functions can be realized by hardware, and the functions can also be realized by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the functions described above. In one possible design, the data processing apparatus of the neural network includes: an acquisition unit and a processing unit. An acquisition unit for acquiring a data amount of input data of the neural network, a first characteristic of an internal memory within a chip that operates the neural network, and a second characteristic of a plurality of layers in the neural network. And the processing unit is used for determining the batch size of each layer in the plurality of layers according to the data volume, the first characteristic and the second characteristic, and the batch sizes of at least two layers in the plurality of layers are different. The modules may perform corresponding functions in the method example of the first aspect, for specific reference, detailed description of the method example is given, and details are not repeated here.

In a third aspect, a data processing apparatus of a neural network is provided, and the data processing apparatus of the neural network may be a processor. Such as a Graphics Processing Unit (GPU), a Neural Network Processor (NPU), an Advanced reduced instruction set processor (ARM), and the like, and optionally, the data Processing apparatus of the Neural network further includes a memory. Wherein the memory is used for storing a computer program or instructions, and the processor is coupled to the memory, and when the processor executes the computer program or instructions, the data processing device of the neural network is caused to execute the method executed by the processor in the above method embodiment.

In a fourth aspect, there is provided a computer program product comprising: computer program code which, when run, causes the method performed by the processor in the first aspect described above to be performed.

In a fifth aspect, the present application provides a chip system, which includes a processor, and is configured to implement the functions of the processor in the method of the first aspect. In one possible design, the system-on-chip further includes a memory to store at least one of program instructions or data. The chip system may be formed by a chip, or may include a chip and other discrete devices.

In a sixth aspect, the present application provides a computer-readable storage medium storing a computer program that, when executed, implements the method of the first aspect described above, the method being performed by a processor.

In the present application, the names of the processors and the data processing means of the neural network do not constitute a limitation on the devices themselves, which may appear under other names in a practical implementation. Provided that the function of each device is similar to that of the present application, and that the devices are within the scope of the claims of the present application and their equivalents.

Drawings

FIG. 1 is a schematic diagram of a neural network provided in an embodiment of the present application;

fig. 2 is a schematic structural diagram of a neural network system according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of a neural network chip according to an embodiment of the present application;

FIG. 4 is a schematic structural diagram of a processing device according to an embodiment of the present application;

FIG. 5 is a schematic diagram illustrating a layer structure in a neural network according to an embodiment of the present disclosure;

FIG. 6 is a schematic diagram of the overlay problem provided by an embodiment of the present application;

fig. 7 is a schematic diagram of a sub-diagram provided in an embodiment of the present application;

FIG. 8 is a schematic illustration of a diagram provided in accordance with an embodiment of the present application;

fig. 9 is a schematic diagram of performing aggregation processing on inter-layer data between subgraphs according to an embodiment of the present application;

fig. 10 is a schematic diagram illustrating a process of scattering inter-layer data between subgraphs according to an embodiment of the present application;

FIG. 11 is a schematic illustration of a process of a graph provided in accordance with an embodiment of the present application;

fig. 12 is a flowchart of a data processing method of a neural network according to an embodiment of the present application;

FIG. 13 is a diagram illustrating a process for processing data by a neural network according to an embodiment of the present application;

FIG. 14 is a diagram illustrating a process of processing data by a neural network according to an embodiment of the present application;

FIG. 15 is a diagram illustrating a process of processing data by a neural network according to an embodiment of the present application;

fig. 16 is a schematic structural diagram of a data processing apparatus of a neural network according to an embodiment of the present application;

fig. 17 is a schematic structural diagram of a data processing apparatus of a neural network according to an embodiment of the present application;

fig. 18 is a schematic diagram of sub-graph partitioning on the resnet50 network according to an embodiment of the present application.

Detailed Description

The terms "first," "second," and "third," etc. in the description and claims of this application and the above-described drawings are used for distinguishing between different objects and not for limiting a particular order. In the embodiments of the present application, the words "exemplary" or "such as" are used herein to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary" or "e.g.," is not necessarily to be construed as preferred or advantageous over other embodiments or designs. Rather, use of the word "exemplary" or "such as" is intended to present concepts related in a concrete fashion.

Neural Networks (NN) may also be referred to as Artificial Neural Networks (ANN) or neural-like networks. In the field of machine learning and cognitive science, neural networks are mathematical or computational models that mimic the structure and function of biological neural networks (the central nervous system of animals, particularly the brain) and are used to estimate or approximate functions. The neural network may include a Convolutional Neural Network (CNN), a Deep Neural Network (DNN), a multilayer perceptron (MLP), and a Recurrent Neural Network (RNN).

(1) Neural network

The neural network may be composed of neural units, which may be referred to as x _s And an arithmetic unit having intercept 1 as input. The output of the arithmetic unit satisfies the following formula (1).

Wherein s =1, 2, \8230, n is natural number greater than 1, and W is _s Is x _s B is the bias of the neural unit. f is an activation function (activation functions) of the neural unit for introducing a nonlinear characteristic into the neural network to convert an input signal in the neural unit into an output signal. The output signal of the activation function may be used as an input for the next layer, and the activation function may be a sigmoid function. A neural network is a network formed by connecting together a plurality of the above-mentioned single neural units, i.e. the output of one neural unit may be the input of another neural unit. The input of each neural unit may be connected to a local receptive field of a previous layer to extract features of the local receptive field, which may be a region composed of several neural units.

Fig. 1 is a schematic diagram of a neural network according to an embodiment of the present application. The neural network 100 has N processing layers, where N is greater than or equal to 3 and is a natural number. The first layer of the neural network is an input layer 110, which is responsible for receiving input signals, and the last layer of the neural network is an output layer 130, which outputs the processing results of the neural network. The other layers except the first and last layers are intermediate layers 140, and these intermediate layers 140 collectively constitute the hidden layers 120, and each intermediate layer 140 in the hidden layers 120 can receive either an input signal or an output signal. The hidden layer 120 is responsible for processing the input signal. Each layer represents a logic level of signal processing, and through multiple layers, data signals may be processed through multiple levels of logic.

The input signal to the neural network may be various forms of signals such as a video signal, a voice signal, a text signal, an image signal, a temperature signal, and the like in some possible embodiments. In this embodiment, the processed image signal may be various sensor signals such as a landscape signal captured by a camera (image sensor), an image signal showing a community environment captured by a monitoring device, and a face signal of a human face acquired by an access control system. The input signals of the neural network also include various other computer-processable engineering signals, which are not listed here. If the neural network is used for deep learning of the image signal, the image quality can be improved.

(2) Deep neural network

A deep neural network, also called a multi-layer neural network, may be understood as a neural network with multiple hidden layers. The deep neural network is divided according to the positions of different layers, and the neural networks in the deep neural network can be divided into three types: an input layer, a hidden layer and an output layer. Generally, the first layer is an input layer, the last layer is an output layer, and the middle layers are hidden layers. The layers are all connected, namely any neuron at the ith layer is connected with any neuron at the (i + 1) th layer.

Although deep neural networks seem complex, they are not really complex in terms of the work of each layer, which is simply a linear relational expression as follows: y = a (Wx + b), wherein x isThe input vector, y the output vector, b the offset vector, W the weight matrix (also called coefficients), and α () the activation function. Each layer simply performs such a simple operation on the input vector x to obtain the output vector y. Due to the fact that the number of layers of the deep neural network is large, the number of coefficients W and the number of offset vectors b are large. The definition of these parameters in a deep neural network is as follows: taking coefficient W as an example: suppose that in a three-layer deep neural network, the linear coefficients from the 4 th neuron of the second layer to the 2 nd neuron of the third layer are defined as

The superscript 3 represents the number of layers in which the coefficient W is located, and the subscripts correspond to the third-layer index 2 that is output and the second-layer index 4 that is input.

In summary, the coefficients from the kth neuron at layer L-1 to the jth neuron at layer L are defined as

Note that the input layer is without the W parameter. In deep neural networks, more hidden layers make the network more able to depict complex situations in the real world. Theoretically, the more parameters the higher the model complexity, the larger the "capacity", which means that it can accomplish more complex learning tasks. The final goal of the process of training the deep neural network, i.e., learning the weight matrix, is to obtain the weight matrix (formed by a number of layers of vectors W) of all layers of the deep neural network that has been trained.

(3) Convolutional neural network

The convolutional neural network is a deep neural network with a convolutional structure. The convolutional neural network comprises a feature extractor consisting of convolutional layers and sub-sampling layers, which can be regarded as a filter. The convolutional layer is a neuron layer for performing convolution processing on an input signal in a convolutional neural network. In convolutional layers of convolutional neural networks, one neuron may be connected to only a portion of the neighbor neurons. In a convolutional layer, there are usually several characteristic planes, and each characteristic plane may be composed of several neural units arranged in a rectangular shape. The neural units of the same feature plane share weights, where the shared weights are convolution kernels. Sharing weights may be understood as the way image information is extracted is location independent. The convolution kernel can be initialized in the form of a matrix of random size, and can be learned to obtain reasonable weights in the training process of the convolutional neural network. In addition, sharing weights brings the direct benefit of reducing connections between layers of the convolutional neural network, while reducing the risk of overfitting.

Fig. 2 is a schematic structural diagram of a neural network system according to an embodiment of the present application. The neural network system 200 includes a host 210 and a neural network circuit 220. The neural network circuit 220 is connected to the host 210 through a host interface. The host interface may include a standard host interface as well as a network interface (network interface). For example, the host interface may include a peripheral component interconnect express (PCIe) interface. As shown in fig. 2, the neural network circuit 220 may be connected to the host 210 through a PCIe bus 230. Thus, data may be input into the neural network circuit 220 through the PCIe bus 230 and the completed data may be received by the neural network circuit 220 through the PCIe bus 230 for processing. Also, the host 210 may monitor the operating state of the neural network circuit 220 through the host interface.

The host 210 includes a processor (processor) 211 and a memory 212. It should be noted that, besides the devices shown in fig. 2, the host 210 may further include other devices such as a communication interface and a magnetic disk as an external memory, which is not limited herein. The host 210 may be considered an integrated circuit or a stand-alone device.

The processor 211 is a control unit (control unit) and an arithmetic core of the host 210. Multiple processor cores (cores) may be included in processor 211. The processor 211 may be an ultra-large scale integrated circuit. An operating system and other software programs are installed in the processor 211 so that the processor 211 can access the memory 212, cache, disks, and peripheral devices (e.g., the neural network circuit in fig. 2). It is understood that in the embodiment of the present application, the processor core in the processor 211 may be a Central Processing Unit (CPU) or other specific integrated circuit (ASIC).

Memory 212 is the main memory of host 210. The memory 212 is coupled to the processor 211 via a Double Data Rate (DDR) bus. The memory 212 is typically used to store various operating software for the operating system, input and output data, and information exchanged with external storage. In order to increase the access speed of the processor 211, the memory 212 needs to have an advantage of high access speed. In a conventional computer system architecture, a Dynamic Random Access Memory (DRAM) is usually used as the memory 212. The processor 211 can access the memory 212 at a high speed through a memory controller (not shown in fig. 2) to perform a read operation and a write operation on any one of the memory cells in the memory 212.

The neural network circuit 220 may be a chip that operates a neural network. The neural network circuit 220 is a chip array composed of a plurality of neural network chips (chips). For example, as shown in fig. 2, the neural network circuit 220 includes a plurality of neural network chips (chips) 221 that perform data processing and a plurality of routers 222. For convenience of description, the embodiment of the present application simply refers to the neural network chip 221 as the chip 221. The plurality of chips 221 are connected to each other through a router 222. For example, one chip 221 may be connected to one or more routers 222. The plurality of routers 222 may comprise one or more network topologies. The chips 221 may communicate data therebetween via the various network topologies. The neural network circuit 220 may also include memory 223, input ports 224, and output ports 225, among other devices. The memory is used to store data, computer programs and instructions.

Fig. 3 is a schematic structural diagram of a neural network chip according to an embodiment of the present application. Chip 221 includes a plurality of routers 310 therein, and each router 310 may be connected to a tile (tile) 320. In practice, one router 310 may also be connected to multiple tiles 320. As shown in fig. 3, each tile 320 may include an input output interface (TxRx) 321, a switch 322, a plurality of Processing Elements (PEs) 323, and a memory 324. Input-output interface 321 is used to receive data input to tile 320 from router 310 or output the results of the computation of tile 320. Put another way, input-output interface 321 is used to enable data transfer between tile 320 and router 310. The switching device 322 connects the input/output interface 321 and the plurality of processing devices 323. The switching device 322 is used to implement data transmission between the input/output interface 321 and the plurality of processing devices 323. The memory 324 is used to store data, computer programs, and instructions. Each tile 320 may also include a controller 325, controller 325 being used to control input output interface 321 and the plurality of processing devices 323 so that the system operates properly. Each processing device 323 may include one or more compute engines (computing engines) 326. One or more compute engines 326 are used to implement neural network computations on data input into compute engines 326. For example, the data input to tile 320 may be subjected to a multiply-add operation with a convolution kernel preset in tile 320. The results of the computations by compute engine 326 may be sent to other tiles 320 through switch 322 and input-output interface 321. In practice, one compute engine 326 may include modules that implement convolution, pooling (posing), or other neural network operations. Here, the specific circuit or function of the calculation engine 326 is not limited. For simplicity of description, in the embodiment of the present application, the calculation engine is simply referred to as an engine (engine).

Fig. 4 is a schematic structural diagram of a processing device according to an embodiment of the present disclosure. The processing device 323 may also include a controller 327 and a bus 328. The controller 327 is used to receive data and schedule one or more engines 326 within the processing device 323 to process the data so that the system operates properly. The plurality of engines 326 transmit data via a bus 328. The engine 326 is coupled to one or more exclusive memories 3210. Optionally, multiple engines 326 may also share one or more memories 329.

Herein, the memory within the neural network circuit 220 may be a cache memory, i.e., a cache. For example, memory 223, memory 324, memory 329, and memory 3210 may all be cache memories.

Herein, the cache Memory in the neural network circuit 220 is composed of a Static Random Access Memory (SRAM), and has a relatively small capacity but a much higher speed than the main Memory, which is close to the speed of the CPU. The cache may be a level L1 cache, a level L2 cache, or a level L3 cache. For example, the memory 3210 is an L1 level cache. The memory 329 is a level L2 cache or a level L3 cache. Memory 223 is a level L2 cache or a level L3 cache. Memory 324 is a level L2 cache or a level L3 cache.

As can be seen from the above description of the neural network, the neural network circuit 220 provided in the embodiment of the present application includes a plurality of neural network chips 221, each of the neural network chips 221 includes a plurality of tiles 320, each of the tiles 320 includes a plurality of processing devices 323, and each of the processing devices 323 includes a plurality of engines 326. As can be seen, the neural network system provided in the embodiments of the present application may include multiple levels of computing nodes, for example, may include four levels of computing nodes: the first level computing node is chip 221, the second level computing node is tile 320 within chip 221, the third level computing node is processing device 323 within tile 320, and the fourth level computing node is engine 326 within processing device 323.

The neural network system provided by the embodiment of the application can be applied to a mobile terminal, a monitoring terminal or a server and the like so as to realize related neural network operation.

Those skilled in the art will appreciate that a neural network comprises a plurality of neural network layers. In the embodiment of the present application, the neural network layer is a logical layer concept, and one neural network layer means that a neural network operation is to be performed.

The neural network may include n neural network layers (also referred to as n-layer neural network), where n is an integer greater than or equal to 2. The first neural network layer and the second neural network layer may be two of the n layers that are operationally dependent. In the embodiment of the present application, two neural network layers having a dependency relationship means that input data of one neural network layer includes output data of the other neural network layer. Two neural network layers having a dependency relationship may also be referred to as being adjacent layers. Alternatively, the input to each neural network layer may come from more than one neural network layer, possibly the first m neural network layers; also, the output of each neural network layer may be output to more than the next neural network layer, possibly to the next m neural network layers.

Fig. 5 illustrates portions of neural network layers in a neural network, which may include convolutional layers, pooling layers, and the like. The neural network 500 may include a first layer 502, a second layer 504, a third layer 506, a fourth layer 508, a fifth layer 510 through an nth layer 512. Wherein the first layer 502 may perform a convolution operation, the second layer 504 may perform a pooling operation on the output data of the first layer 502, the third layer 506 may perform a convolution operation on the output data of the second layer 504, the fourth layer 508 may perform a convolution operation on the output result of the third layer 506, the fifth layer 510 may perform a summing operation on the output data of the second layer 504 and the output data of the fourth layer 508, and so on. It is understood that fig. 5 is only a simple example and illustration of the neural network layers in the neural network, and does not limit the specific operation of each layer of the neural network, for example, the fourth layer 508 may be a pooling operation, and the fifth layer 510 may be other neural network operations such as a convolution operation or a pooling operation.

The output data of the first layer 502 is the input data of the second layer 504, and thus the first layer 502 and the second layer 504 have a dependency relationship. The output data of the second layer 504 is the input data of the third layer 506, and the second layer 504 and the third layer 506 have a dependency relationship. The output data of the third layer 506 is the input data of the fourth layer 508, and the third layer 506 and the fourth layer 508 have a dependency relationship. The input data of the fifth layer 510 includes the output data of the second layer 504 and the output data of the fourth layer 508, and therefore, the second layer 504 and the fifth layer 510 also have a dependency relationship, and the fourth layer 508 and the fifth layer 510 also have a dependency relationship.

Each layer of computation in the neural network is implemented by a compute node. In practical applications, the amount of computation required for different application scenarios is different. Therefore, the computational nodes in the neural network system can be divided in a chip, a tile, a processing device or an engine as a granularity according to the actual application, so that the computational nodes in different sets are used for processing the operations of different neural network layers. In this manner, the compute node referred to in embodiments of the present application may be chip 221, tile 320, processing device 323, or engine 326.

In the neural network reasoning process, after the ith layer of the neural network is calculated, the calculation result (interlayer data) of the ith layer is temporarily stored in a preset cache, and when the (i + 1) th layer of calculation is executed, the calculation node loads the calculation result of the ith layer and the weight of the (i + 1) th layer from the preset cache again for calculation. Wherein, the ith layer is any layer in the neural network. For example, as shown in fig. 5, after the second layer 504 in the neural network is calculated, the output data (interlayer data) of the second layer 504 is temporarily stored in the preset memory 329, and when the calculation of the fifth layer 510 is performed, the calculation node loads the calculation result of the second layer 504 and the weight of the fifth layer 510 from the preset memory 329 again for calculation.

The preset cache is different according to different computing nodes. For example, if the compute node is the engine 326, the predetermined cache may be the memory 329 or the memory 3210. As another example, if the compute node is a processing device 323, the default cache may be memory 324. As another example, if the compute node is tile 320, the predetermined cache may be memory within tile 320. For another example, if the computing node is chip 221, the predetermined cache may be memory 223.

It should be understood that memory external to the neural network circuit 220 is referred to as external memory. The external storage is, for example, the memory 212 shown in fig. 2. The memory within the neural network circuit 220 is referred to as internal memory. The internal memory is, for example, memory 223 shown in fig. 2. As another example, the internal memory is memory 324 shown in FIG. 3. As another example, the internal memories are the memory 329 and the memory 3210 shown in fig. 4. By external memory is meant off-chip memory that runs the neural network. The external storage may be, for example, a magnetic disk or the memory 212 shown in fig. 2.

In order to facilitate understanding of technical solutions provided in the embodiments of the present application, some terms in the embodiments of the present application are first explained.

1) Batch size (batch size)

The data size that each layer in the neural network can process is the batch size corresponding to the layer, limited by the capacity of the internal memory. The batch corresponding to the batch size may be one picture, a plurality of pictures, or a partial image in one picture. For example, the capacity of the internal memory is 100, and if the size of the buffer request generated when layer 1 (layer 1, L1) processes 1 picture is 60, the maximum number of layer 1 pictures is processed by scheduling layer 1, and the batch size corresponding to layer 1 is 1 picture. If the size of the data cache requirement generated by processing 1 picture by the layer 2 is 30, scheduling the layer 2 to process 3 pictures at most each time, and the batch size corresponding to the layer 2 is 3 pictures. The batch size affects not only the use of the internal memory of the chip running the neural network, but also the optimization degree and processing speed of the neural network.

2) Overlap problem

In some scenarios where the neural network processes pictures, limited by the capacity of the internal memory, it may be necessary to split the entire picture data into two or more pieces of data as a batch of input data, where each piece of data may be referred to as non-entire picture data. The convolutional layer may process the input data of the non-integer map using a padding algorithm. That is, the input data is artificially increased in size by means of a padding algorithm to offset the effect of size shrinkage in the calculation before the calculation by the convolution kernel. The padding algorithm may be, for example, zero padding, repeated boundary value padding, or other methods. That is, if the input data is non-entire graph data, the input data needs to be processed by using a filling algorithm; if the input data is full graph data, the input data does not need to be processed by a filling algorithm.

Taking the filling algorithm as an example, if the convolutional layer adopts the filling algorithm, when interpreting the convolutional layer, it is necessary to fill the input data first and then flatten the convolutional layer. If the step length (stride) of the convolution kernel movement is smaller than the side length (generally square) of the convolution kernel, the overlapping (overlap) of the convolution kernel and the original input matrix action range on the region can occur, and the overlapping phenomenon can not occur when the step length (stride) of the convolution kernel movement is consistent with the side length of the convolution kernel. If the input data size is (w × w), the data size after the padding is (w + k-s) ((w + k-s)). Where k represents the side length of the convolution kernel, s represents the step size of the convolution kernel movement, and the padding data is (k-s).

For example, referring to fig. 6, assuming that a layer (layer) in a certain neural network includes layer0, layer 1, layer 2 and layer 3, the size of the convolution kernel is 3 × 3, the step size of the convolution kernel movement is 1, the step size of the convolution kernel movement is smaller than the side length of the convolution kernel, and in the process of processing input data by using a filling algorithm, an overlap problem exists. For example, the size of the whole picture is 56 × 56, and the number of lines of the whole picture is divided into 4 parts for processing. If layer0, layer 1, and layer 2 are scheduled as a layer group, it needs to be ensured that layer 2 outputs 14 rows of data, that is, the size of output data of the layer group is 14 × 56, and layer 3 can process 1/4 row of pictures. The input data of layer 2 needs to be filled with 2 rows of data, i.e. input data size 16 x 56. Accordingly, layer 1 corresponds to an input data size of 18 × 56, and layer0 corresponds to an input data size of 20 × 56. That is, in the process of splitting the whole picture, in order to ensure the size of the output data, the buffer requirement of the layer in the layer group is increased. Further, the larger the number of layers in a layer group, the larger the amount of data that needs to be filled in the preceding layer, and if the internal memory capacity is small, the size of the layer group is limited.

3) Subgraph (subgraph)

It has been introduced above that the neural network contains a plurality of layers, which can be described as comprising a plurality of layers arranged in a directed graph and each layer can have a respective set of parameters. The subgraph is obtained by dividing the layers contained in the neural network according to the batch size of each layer. The subgraph contains one or more layers of the same batch size. Subgraphs may also be described as super layers (super layers) or groups of layers, etc., representing one or more successive layers in a neural network.

In some examples, the neural network is scheduled to process the input data in units of subgraphs, where the layers in the subgraph are scheduled in the same order as the layers in the neural network. The batches corresponding to the batch sizes of the layers included in the subgraph are processed in the order of the layers included in the subgraph. Inter-layer data of a plurality of layers included in the sub-graph is stored in the internal memory. Inter-layer data between subgraphs is stored in an internal memory.

For example, as shown in fig. 7, a schematic diagram of a sub-graph provided in an embodiment of the present application is shown. The subgraph includes layer0 and layer 1. The batch size for layer0 and the batch size for layer 1 are both 1. Hereinafter, a batch for which each batch size is 1 may be one picture, a plurality of pictures, or a partial image in one picture. Layer0 is processed one batch at a time. Layer 1 processes one batch at a time.

Assume that layer0 and layer 1 in the subgraph process batch A0 and batch A1. Batch A0 and batch A1 may be batches in input data to be processed by the neural network. Alternatively, the lot A0 and the lot A1 may be inter-layer data that has undergone layer processing in the neural network. The lot size for both lot A0 and lot A1 was 1. The execution order of processing batches within a subgraph is shown by the bold arrows in the figure. For ease of understanding, layer0 and layer 1 are separately indicated for processing batch A0 and batch A1, respectively.

Wherein, layer0 processes batch A0 first to obtain interlayer data B0, and layer 1 processes interlayer data B0 to obtain interlayer data C0. Then, layer0 processes batch A1 to obtain interlayer data B1, and layer 1 processes interlayer data B1 to obtain interlayer data C1. The inter-layer data C0 and the inter-layer data C1 may be stored in an internal memory.

4) Picture (graph)

The graph includes one or more subgraphs. Where a graph may also be described as a super-level layer or set of layers, meaning one or a succession of layers in a neural network is included.

In some embodiments, each subgraph in the graph contains layers that can handle the same batch size. As an example, as shown in (a) in fig. 8, it is assumed that the graph includes sub fig. 1 and sub fig. 2. Wherein, subgraph 1 includes layer0 and layer 1, and the batch size of layer0 is the same as that of layer 1. Sub-graph 2 includes layer 2 and layer 3, with the batch size for layer 2 being the same as the batch size for layer 3. The batch size of layer0 and the batch size of layer 1 are both one batch. The batch size for layer 2 and the batch size for layer 3 are both one batch. In summary, all sub-graphs included in a graph contain layers of the same batch size.

In other embodiments, the graph includes all subgraphs for which at least two subgraphs contain layers of different batch sizes. As shown in (b) in fig. 8, it is assumed that the graph includes sub fig. 1, sub fig. 2, and sub fig. 3. Wherein, subgraph 1 includes layer0 and layer 1, and the batch size of layer0 is the same as that of layer 1. Sub-graph 2 includes layer 2 and layer 3, with the batch size for layer 2 being the same as the batch size for layer 3. Sub-graph 3 includes layer 4 and layer5, with the batch size for layer 4 being the same as the batch size for layer 5. The batch size of layer0 and the batch size of layer 1 are both one batch. The batch size for layer 2 and the batch size for layer 3 are both one batch. The batch size for layer 4 and the batch size for layer5 are both two batches. In summary, the graph includes layers included in sub-graph 3 having a different batch size than the layers included in sub-graph 1. The graph includes sub graph 3 containing layers having a different batch size than the layers contained in sub graph 2.

In some examples, the neural network is scheduled to process the input data in units of a graph, where a scheduling order of layers in the graph is the same as a scheduling order of layers in the neural network. The scheduling sequence of each subgraph contained in the graph is determined according to the batch size and the scheduling sequence of the first layer and the last layer in the subgraph. When multiple layers of different batch sizes in the neural network are scheduled as a graph, a part of data is reserved in the cache space of the internal memory, so that additional internal memory cache requirements are generated. Inter-layer data between the figures is stored in an external memory. The scheduling process for the layers in the figure is illustrated by the following aggregation (gather) and scatter (scatter) problems.

5) Aggregation (gather) problem

In one possible implementation, the inter-layer data between the subgraphs included in the graph is aggregated. For example, as shown in fig. 9, a schematic diagram of performing aggregation processing on inter-layer data of a subgraph provided in an embodiment of the present application is shown. Suppose the graph includes sub-graph 0 and sub-graph 1. Sub-graph 0 includes layer0 and layer 1. The batch size for layer0 and the batch size for layer 1 are both 1. Layer0 was processed one batch at a time. Layer 1 processes one batch at a time. Sub-diagram 1 includes layer 2 and layer 3. The batch size for layer 2 and the batch size for layer 3 are both 2. Hereinafter, a batch for which each batch size is 2 may be two pictures, a plurality of pictures, or partial images in one picture. Layer 2 was processed two batches at a time. Layer 3 processed two batches at a time. Assume that the graph processes batch A0 and batch A1. Batch A0 and batch A1 may be batches in input data to be processed by the neural network. Batch A0 and batch A1 may be inter-layer data that has undergone layer processing in a neural network. The lot size of both lot A0 and lot A1 was 1. Since layer0 and layer 1 included in subgraph 0 each process one batch at a time, layer 2 and layer 3 included in subgraph 1 each process two batches at a time. After the sub-graph 0 finishes processing the batch A0 and the batch A1 respectively, the sub-graph 1 processes the inter-layer data of the batch A0 and the inter-layer data of the batch A1 output by the sub-graph 0. The order of execution of the processing batches within the diagram is indicated by the bold arrows. For ease of understanding, layer0 and layer 1 are shown separately for processing batch A0 and batch A1, respectively.

For subgraph 0, layer0 processes batch A0 first to obtain interlayer data B0, and layer 1 processes interlayer data B0 to obtain interlayer data C0. Then, layer0 processes batch A1 to obtain interlayer data B1, and layer 1 processes interlayer data B1 to obtain interlayer data C1. The inter-layer data C0 and the inter-layer data C1 may be stored in an internal memory.

For sub fig. 1, layer 2 may obtain inter-layer data C0 and inter-layer data C1 from the internal memory, and at this time, the inter-layer data C0 and the inter-layer data C1 may be combined into inter-layer data (C0, C1). Layer 2 processes (C0, C1) resulting in interlayer data (D0, D1), and layer 3 processes interlayer data (D0, D1) resulting in interlayer data (E0, E1). The inter-layer data (E0, E1) may be stored in an internal memory.

6) Scatter problem

In another possible implementation manner, the inter-layer data between the subgraphs included in the graph is subjected to scatter processing. For example, as shown in fig. 10, a schematic diagram for performing a scatter process on inter-layer data of a subgraph provided in an embodiment of the present application is shown. Suppose the graph includes sub-graph 1 and sub-graph 2. Sub-diagram 1 includes layer 2 and layer 3. The batch size for layer 2 and the batch size for layer 3 are both two batches. Layer 2 was processed two batches at a time. Layer 3 processed two batches at a time. Sub-diagram 2 includes layer 4 and layer 5. The batch size of layer 4 and the batch size of layer5 are both one batch. Layer 4 processes one batch at a time. Layer5 processes one batch at a time. Since layer 2 and layer 3 included in sub-graph 1 each process two batches, layer 4 and layer5 included in sub-graph 2 each process one batch. After sub-graph 1 finishes processing the inter-layer data (C0, C1), sub-graph 2 may process the inter-layer data E0 and the inter-layer data E1 output by sub-graph 1. The order of execution of the processing batches within the diagram is shown by the bold arrows. For ease of understanding, layer 4 and layer5 are separately denoted to handle interlayer data E0 and interlayer data E1, respectively.

For sub-graph 1, layer 2 may retrieve the inter-layer data (C0, C1) of batch A0 and batch A1 from the internal memory. Layer 2 processes (C0, C1) resulting in interlayer data (D0, D1), and layer 3 processes interlayer data (D0, D1) resulting in interlayer data (E0, E1). At this time, the inter-layer data (E0, E1) may be stored in the internal memory.

For sub-diagram 2, the layer 4 first obtains the inter-layer data (E0, E1) from the internal memory, and divides the inter-layer data (E0, E1) into inter-layer data E0 and inter-layer data E1. The layer 4 processes the interlayer data E0 in the interlayer data (E0, E1) first to obtain interlayer data F0, and the layer5 processes the interlayer data F0 to obtain interlayer data G0. Then, the layer 4 processes the interlayer data E1 in the interlayer data (E0, E1) to obtain interlayer data F1, and the layer5 processes the interlayer data F1 to obtain interlayer data G1. The inter-layer data G0 and the inter-layer data G1 may be stored in an internal memory.

In another possible implementation, the multiple graphs schedule processing in order of layers of the neural network. Note that the data processed in the subsequent graph is the data output in the previous graph. The layer of the neural network is divided into a plurality of graphs, and batches are processed according to the sequence of the graphs, so that the utilization rate of an internal memory is improved, and the processing performance of the whole neural network is improved.

For example, as shown in fig. 11, a schematic diagram of processing of a graph provided in an embodiment of the present application is shown. Where the abscissa represents the layer of the neural network and the ordinate represents the batch. The neural network is assumed to comprise 12 layers. The batch sizes for layer0, layer 1, layer 4, layer5, layer 10 and layer 11 are each one batch, i.e., layer0, layer 1, layer 4, layer5, layer 10 and layer 11 process one batch at a time. The batch sizes for layer 2, layer 3, layer 6 and layer 7 are each two batches, i.e., layer 2, layer 3, layer 6 and layer 7 process two batches at a time. The batch sizes for layer 8 and layer 9 are each four batches, i.e. layer 8 and layer 9 process four batches at a time. The neural network includes 12 layers divided into two graphs, fig. 0 and fig. 1. Fig. 0 includes layers 0 through 5. Fig. 1 includes layers 6 through 11. The numbers in the boxes indicate the order of execution of the batches. Performed in small to large numbers. After one graph 0 is processed, the graph 1 is processed again.

With respect to fig. 0, layer0 processes batch 0, layer 1 processes inter-layer data of batch 0 output by layer0 to obtain inter-layer data of batch 0 output by layer 1, and the inter-layer data of batch 0 is stored in the internal memory. And processing the batch 1 by the layer0, processing the interlayer data of the batch 1 output by the layer0 by the layer 1 to obtain the interlayer data of the batch 1 output by the layer 1, and storing the interlayer data of the batch 1 in an internal memory. And extracting the batch 0 interlayer data and the batch 1 interlayer data from the internal memory, processing the batch 0 interlayer data and the batch 1 interlayer data by the layer 2, and processing the batch 0 interlayer data and the batch 1 interlayer data output by the layer 2 by the layer 3 to obtain the batch 0 interlayer data and the batch 1 interlayer data output by the layer 3. Layer 4 processes the batch 0 interlayer data output by layer 3, and layer5 processes the batch 0 interlayer data output by layer 4; layer 4 processes the batch 1 inter-layer data output by layer 3 and layer5 processes the batch 1 inter-layer data output by layer 4.

Similarly, layer0 to layer5 processed batch 2 and batch 3 in the order of batch 0 and batch 1. The batch processed in FIG. 1 is the data output in FIG. 0. With respect to fig. 1, layer 6 processes the batch 0 inter-layer data and batch 1 inter-layer data output by layer5, and layer 7 processes the batch 0 inter-layer data and batch 1 inter-layer data output by layer 6, and stores the batch 0 inter-layer data and batch 1 inter-layer data output by layer 7 in the internal memory. The layer 6 processes the batch 2 and batch 3 layer data output by the layer5, and the layer 7 processes the batch 2 and batch 3 layer data output by the layer 6, and stores the batch 2 and batch 3 layer data output by the layer 7 in the internal memory. The batch 0 interlayer data, the batch 1 interlayer data, the batch 2 interlayer data and the batch 3 interlayer data output by the layer 7 are taken out from the internal memory, the layer 8 processes the batch 0 interlayer data, the batch 1 interlayer data, the batch 2 interlayer data and the batch 3 interlayer data output by the layer 7, the layer 9 processes the batch 0 interlayer data, the batch 1 interlayer data, the batch 2 interlayer data and the batch 3 interlayer data output by the layer 8, and the batch 0 interlayer data, the batch 1 interlayer data, the batch 2 interlayer data and the batch 3 interlayer data output by the layer 9 are stored in the internal memory. Layer 10 processes the batch 0 inter-layer data output by layer 9. Layer 11 processes the batch 0 inter-layer data output by layer 10. Layer 10 processes the batch 1 inter-layer data output by layer 9. Layer 11 processes the batch 1 inter-layer data output by layer 10. Layer 10 processes the inter-layer data for batch 2 output by layer 9. Layer 11 processes the batch 2 inter-layer data output by layer 10. Layer 10 processes the inter-layer data for batch 3 output by layer 9. Layer 11 processes the inter-layer data for batch 3 output by layer 10.

Next, a data processing method of the neural network will be described in detail with reference to fig. 12. Here, a data processing method in which the processor 211 executes a neural network will be described as an example. The internal memory includes memory 223, memory 324, memory 329, and memory 3210. The external memory is memory 212. And finishing the operation of the neural network by the computing node according to the determined batch size. Compute node includes chip 221, tile 320, processing device 323, or engine 326. As shown in fig. 12, the data processing method of the neural network includes S1201 and S1202.

S1201, the processor 211 acquires a data amount of the input data, a first characteristic of an internal memory within a chip that operates the neural network, and a second characteristic of a plurality of layers in the neural network. The input data is data received by an input layer of the neural network. For example, the input data is data within a data set. Taking image processing as an example, the input data is 32 pictures in a data set.

The first feature includes at least one of a distribution feature of the internal memory within the chip and a capacity of the internal memory. It can be understood that the distribution characteristics of the internal memories in the chip include the number of memories in the chip for operating the neural network and the connection relationship between the memories and the computing nodes, the memory capacity and the number in the chip are large, the memory resources allocated to the neural network computation are changed, and therefore, the neural network configuration needs to be dynamically optimized according to the number of memories and the connection relationship between the memories and the computing nodes. For example, the neural network circuit 220 includes the number of memories 223, 324, 329, and 3210, as well as the connection relationship between the memories 223 and the chip 221, the connection relationship between the memories 324 and the processing device 323, the connection relationship between the memories 329 and the engine 326, and the connection relationship between the memories 3210 and the engine 326.

The capacity of the internal memory includes the capacity of all memories in the chip for running the neural network, the capacity and the number of memories in the chip are large, the memory resources are not used for the neural network calculation each time, the memory resources allocated to the neural network calculation are changed, and therefore the neural network configuration needs to be dynamically optimized according to the capacity. For example, the neural network circuit 220 includes a capacity of the memory 223, a capacity of the memory 324, a capacity of the memory 329, and a capacity of the memory 3210. It is understood that the capacity of the internal memory may refer to the available capacity of the internal memory.

The second feature includes a connection relationship between the plurality of layers and a calculation-related parameter for each of the plurality of layers. The computational resources within the chip may vary and are not used for neural network computations each time, so the connection relationships between the layers and the computation-related parameters of each of the layers may also vary as needed, and the neural network configuration needs to be dynamically optimized according to the variation. It is understood that the connection relationship between the plurality of layers includes a connection relationship of each layer with at least one of the other layers in the neural network. The connection relationship of the layers in the neural network is different according to the different functions performed by the neural network, and the connection relationship of the layers in the neural network is not limited in the present application. The parameters related to calculation of each layer include the dimension of input data and the dimension of output data, an offset parameter, a convolution kernel, a quantization parameter, a normalization parameter, or the like.

The first feature and the second feature may be stored in a memory 212 within the host 210. The processor 211 may retrieve the characteristics of the internal memory and the characteristics of the layers in the neural network from the memory 212 in the host 210.

S1202, the processor 211 determines, according to the data amount, the first characteristic, and the second characteristic, storage locations of batch sizes of the plurality of layers, N subgraphs, M graphs, and inter-layer data, where batch sizes of at least two layers in the batch sizes of the plurality of layers are different.

Specifically, the processor 211 may determine the batch size of the multiple layers, the N sub-graphs, the M graphs, and the storage location of the inter-layer data according to the data amount, the first characteristic, and the second characteristic by using an iterative algorithm. Wherein the optimization algorithm may be a dynamic programming algorithm, a greedy algorithm, or a genetic algorithm. Understandably, the processor obtains the batch size, the N subgraphs, the M graphs and the storage positions of the data among the layers according to the data amount, the first characteristic and the second characteristic without one-time calculation, and selects the batch size, the N subgraphs, the M graphs and the storage positions of the data among the layers from a plurality of experimental results through a plurality of iterative experiments by adopting an iterative algorithm so as to ensure the utilization rate of an internal memory and the calculation efficiency of a chip for operating a neural network. Wherein N is an integer greater than or equal to 2, M is an integer greater than or equal to 1, and N is greater than or equal to M. For example, N =2,m =1, represents that the layer of the neural network is divided into 2 sub-graphs, and the 2 sub-graphs are divided into one graph. As another example, N =2,m =2, representing a layer division of the neural network into 2 sub-graphs, and 2 sub-graphs into 2 graphs. For another example, N =3, m =2, represents that the layer of the neural network is divided into 3 sub-graphs, and 3 sub-graphs are divided into 2 graphs.

For example, the processor 211 first determines the batch size of each layer in the neural network based on the capacity of the internal memory, and then merges the layers with the same batch size into a subgraph. And fusing the multiple subgraphs into a graph based on the caching requirement of the subgraphs and the capacity of the internal memory, wherein the obtained graph comprises the subgraphs with different batch sizes. That is to say, when the neural network is scheduled in a unit of graph subsequently, the input data is processed in different batches, so that the cache requirement of each graph does not exceed the capacity of the internal memory, the utilization rate of the on-chip memory can be improved, and the running performance of hardware is improved.

Understandably, the connection of the layers in the N subgraphs is a complete neural network. Each of the N subgraphs contains one or more layers of the same batch size. One or more of the same batch size layers are successive layers in the neural network. The number of layers included in different subgraphs may be the same or different.

The sub-graphs in the M graphs are connected to form a complete neural network. Each of the M graphs includes one or more subgraphs. The number of sub-graphs included in different graphs may be the same or different.

In the process of scheduling the neural network to process data, corresponding neural network operation overhead is generated. Such as computation time overhead, data handling time overhead, etc. The method can be used for measuring the performance of the neural network by presetting the operation overhead index of the neural network. If the operation overhead of the neural network is low, the performance of the neural network is excellent. As shown in fig. 13, an exemplary process for processing data by layers in a neural network includes a data move-in process (i.e., a process of reading input data), a calculation process, and a data move-out process (i.e., a process of storing output data). The neural network processes a batch of data, and needs to move part of the data in first, namely a data moving-in process, and overhead generated in the process is head overhead. Then, the data move-in process, the calculation process and the data move-out process are parallel. And finally, the neural network carries out the data of the last calculated data out and stores the data in a storage space, and the overhead generated in the process is tail overhead.

In the embodiment of the present application, a layer processes data in units of a batch size. In the process of processing a batch of input data at a certain layer, the computation time = the computation amount of the layer/the computation power of a chip on which the neural network is mounted, the data transfer time = (input data amount + output data amount)/(internal memory bandwidth or off-chip memory bandwidth), and the total time overhead = the head overhead + max (computation time, data transfer time) + the tail overhead. It can be seen that if the batch size is too small, the time corresponding to the head overhead and the tail overhead may be greater than or equal to the computation time, resulting in a low operation efficiency of the neural network. The time cost of a certain layer in the neural network can be obtained according to the storage position of at least one of the input data or the output data of the current layer and the computing power of a chip carrying the neural network. The storage position of the data comprises an internal memory and an external memory.

The interlayer data are allowed to be stored in the external memory, and the overall planning external memory and the internal memory jointly store the interlayer data, so that the storage space occupied by the internal memory is reduced. In addition, the interlayer data is allowed to be stored in the external memory, and larger batch sizes can be set for the layers in the neural network, so that the head overhead of processing each batch by the layers in the neural network is reduced, and the computing efficiency of the processor is improved.

In the process of processing the input data of the neural network according to the dividing result of the layers of the neural network, the scheduling sequence of the layers in the graph is determined according to the scheduling sequence of each subgraph contained in the graph and the scheduling sequence of the layers in the subgraph. For example, the scheduling order of the layers in the subgraph is the same as the scheduling order of the layers in the neural network. And processing batches corresponding to the batch sizes of the layers contained in the subgraph according to the sequence of the layers contained in the subgraph. The scheduling sequence of each subgraph contained in the graph is determined according to the batch size and the scheduling sequence of the first layer and the last layer in the subgraph. The inter-layer data of the sub-graphs included in the graph is subjected to aggregation processing or scattering processing. For the explanation of the figures and subgraphs, reference is made to the statements above.

Illustratively, as shown in FIG. 13, assume that the neural network comprises 6 layers, with the hierarchical order being layer0-layer5 (layer 0-layer5, L1-L5). The batch sizes for L0, L1, L4, and L5 are 1, and the batch sizes for L2 and L3 are 2. And (4) composing layers with the same batch size into a subgraph, namely composing subgraph 0 by L0 and L1. L2 and L3 constitute subfigure 1. L4 and L5 constitute subfigure 2. And (4) composing subgraphs into graphs, namely subgraph 0, subgraph 1 and subgraph 2. And the corresponding batch sizes of L0 and L1 are 1, sub-graph 0 can process input data with the data size of 1 each time, namely batch 0 and batch 1 are processed separately. After batch 0 is input to L0, the output data of L1 is C0 through the processing of L0 and L1. The batch size corresponding to L2 is 2, and at this time, C0 corresponds to only batch 0, and does not satisfy the processing requirement of L2, and C0 needs to be temporarily stored in the internal memory. Batch 1 is processed by being input to L0, and the output data of L1 is C1 after being processed by L0 and L1. At this time, L1 outputs two batches of data to satisfy the processing requirements of L2. The internal memory contains two groups of data of C0 and C1, and after C0 and C1 are aggregated, L2 can call the aggregated C0 and C1 for processing. Therefore, if sub-graph 0 and sub-graph 1 are divided into one graph, in the process of scheduling L0 and L1 to process batch 1, C0 occupies the cache space of the internal memory, and the data size corresponding to C0 is the extra cache requirement of the internal memory of L0 and L1. In this process, the cache demand of the input data corresponding to L0 is the data amount corresponding to (C0 + A1), and the cache demand of the output data is the data amount corresponding to (C0 + B1); the cache demand of the input data corresponding to L1 is a data amount corresponding to (C0 + B1), and the cache demand of the output data is a data amount corresponding to (C0 + C1).

If subgraph 1 and subgraph 2 are divided into one graph for scheduling, the scatter problem occurs. As shown in fig. 13, the L3 input data is D0 corresponding to batch 0 and D1 corresponding to batch 1, and the output data is E0 corresponding to batch 0 and E1 corresponding to batch 1. L4 corresponds to a batch size of 1, then E0 and E1 cannot be processed simultaneously. At this time, L4 processes E0 first and temporarily stores E1 in the internal memory. Then, in the process of scheduling L4 and L5 to process the data corresponding to batch 0, E1 occupies the buffer space of the internal memory, and the amount of data corresponding to E1 is the extra internal memory buffer requirement of L4 and L5. In this process, the cache demand of the input data corresponding to L4 is the data amount corresponding to (E1 + E0), and the cache demand of the output data is the data amount corresponding to (E1 + F0); the buffer demand of the input data corresponding to L5 is the data amount corresponding to (E1 + F0), and the buffer demand of the output data is the data amount corresponding to (E1 + G0).

It should be noted that, because the inter-layer data of multiple layers included in the subgraph and the inter-subgraph data are stored in the internal memory, and occupy the storage space of the internal memory, the batch size of the multiple layers and the storage location of the inter-layer data are also affected by dividing the subgraph and the graph.

For example, as shown in fig. 9, when calculating the batch A1, the layer0 and the layer 1 have the interlayer data C0 stored in the cache, which may reduce the available caches of the layer0 and the layer 1, and may affect the splitting of the input data.

For another example, as shown in fig. 10, when the layer 4 and the layer5 process the interlayer data E0, the interlayer data E1 is stored in the cache, and occupies space in the cache, so that the available caches of the layer 4 and the layer5 become small, and the slicing of the input data is affected.

Therefore, in the process of dividing layers of different batch sizes, the extra internal memory cache requirement caused by the aggregation or scattering problem needs to be considered, and whether the cache requirement of the divided subgraph exceeds the capacity of the internal memory or not needs to be determined.

According to the data processing method of the neural network, the input data are segmented by comprehensively referring to the data size of the input data, the first characteristic and the second characteristic, and different batch sizes are set for layers in the neural network. Therefore, by setting a reasonable batch size for each layer in the neural network, the internal memory is fully utilized to store interlayer data of the neural network in the neural network reasoning process, the interaction between the chip for operating the neural network and the external memory is reduced, the utilization rate of the internal memory is improved, and the calculation efficiency of the chip for operating the neural network is ensured.

In one possible implementation, S1201 and S1202 may be performed offline by another computer to generate the slicing strategy and schedule the execution order of the layers of the neural network. And configuring the execution sequence of the layers of the segmentation strategy and the scheduling neural network to a controller in the neural network system, and controlling the execution sequence of the layers of the segmentation strategy and the scheduling neural network by the controller in the neural network system.

In another possible implementation, S1201 and S1202 may be executed by a controller in the neural network system to generate a slicing policy and an execution order of layers of the scheduling neural network, and the controller manages the layers of the scheduling neural network and slices the plurality of batches in a unified manner.

The following describes a neural network scheduling method provided in the embodiments of the present application with reference to specific examples.

Example one, the input data is whole graph data.

As shown in fig. 14, based on the internal memory capacity, it is determined that the batch size corresponding to L0 and L1 is 1 picture, the batch size corresponding to L2, L3, and L4 is 2 pictures, and the batch size corresponding to L5 and L6 is 4 pictures, from the viewpoint of the overall performance of the neural network. By adopting the method provided by the embodiment of the application, L0 and L1 are divided into subgraph 0, L2-L4 are divided into subgraph 1, and L5 and L6 are divided into subgraph 2. Aiming at the aggregation problem, based on the capacity of the internal memory, in view of the overall performance of the neural network, 3 sub-graphs are divided into a graph, namely L0-L6 are divided into graphs, and the cache requirement of the graph is smaller than or equal to the capacity of the internal memory. The graph comprises layers with different batch sizes, and the utilization rate of an internal memory can be improved and the operation performance of a chip for operating the neural network is improved in the process of processing input data by a subgraph in the scheduling neural network.

As shown in fig. 14, assuming that the data set includes 8 pictures, L0 is the first layer of the graph, and the batch size is 1 picture, the data set is divided into 8 batches of input data (batch 0 to batch 7 shown in fig. 14), each batch of input data is full graph data corresponding to 1 picture, and L0 is input in batches. As shown in FIG. 14, in processing the input data of the current data set, sub-graph 0 is scheduled 2 times, corresponding to sub-graph 1 being scheduled 1 time, i.e., the scheduling order is L0 → L1 → L0 → L1 → L2 → L3 → L4; sub-graph 1 is scheduled 2 times, corresponding to sub-graph 2 being scheduled 1 time, i.e., the scheduling order is L2 → L3 → L4 → L5 → L6. Processing the input data of the current data set requires scheduling 8

sub-graphs

0,4

sub-graphs

1,2 sub-graphs 2.

Example two, the input data is non-whole graph data.

As shown in fig. 15, based on the internal memory capacity, the batch size corresponding to L0 and L1 is determined to be 1/4 picture, and the batch size corresponding to L2, L3, and L4 is determined to be 1/2 picture, in view of the overall performance of the neural network. By adopting the method provided by the embodiment of the application, the L0 and the L1 are divided into subgraph 0, and the L2-L4 sequence is divided into subgraph 1. For the overlapping problem, as shown in fig. 15, the input data is non-whole graph data, and the input data needs to be processed by a filling algorithm, where the filling data is a shaded portion. Based on the internal memory capacity, from the overall performance of the neural network, the 2 sub-graphs are divided into a graph, i.e., L0-L4 are divided into graphs, and the cache requirement of the graph is less than or equal to the capacity of the internal memory. The graph comprises layers with different batch sizes, and the utilization rate of an internal memory can be improved and the operation performance of a chip for operating the neural network is improved in the process of processing input data by a subgraph in the scheduling neural network.

As shown in fig. 15, it is assumed that the data set includes 2 pictures, L0 is the first layer of the graph, and the batch size is 1/4 picture, so the data set is divided into 8 batches of input data (batch 0 to batch 7 shown in fig. 15), each batch of input data is non-whole image data corresponding to 1/4 picture, and L0 is input in batch. As shown in FIG. 15, sub-graph 0 is scheduled 2 times in the process of processing the input data of the current data set, corresponding to sub-graph 1 being scheduled 1 time, i.e., the scheduling order is L0 → L1 → L0 → L1 → L2 → L3 → L4. Processing the input data of the current data set requires scheduling 8

sub-graphs

0 and 4 sub-graphs 1.

It is to be understood that, in order to implement the functions of the above embodiments, the neural network system includes at least one of a hardware structure or a software module corresponding to each function. Those of skill in the art will readily appreciate that the various illustrative elements and method steps described in connection with the embodiments disclosed herein may be implemented as hardware or combinations of hardware and computer software. Whether a function is performed as hardware or computer software driven hardware depends on the particular application scenario and design constraints imposed on the solution.

Fig. 16 and 17 are schematic structural diagrams of a data processing apparatus of a possible neural network provided in an embodiment of the present application. The data processing device of the neural network can be used to implement the functions of the processor 211 in the above method embodiments, and therefore, the beneficial effects of the above method embodiments can also be achieved.

In the embodiment of the present application, the data processing device of the neural network in fig. 16 may be the processor 211 shown in fig. 2 or a device formed by running software thereon. As shown in fig. 16, the data processing apparatus 1600 of the neural network includes an obtaining unit 1610 and a processing unit 1620. The data processing device 1600 of the neural network is used to implement the functions of the processor 211 in the method embodiment shown in fig. 12 described above. When the data processing device 1600 of the neural network is used to implement the functions of the processor 211 in the method embodiment shown in fig. 12: the obtaining unit 1610 is configured to execute S1201; the processing unit 1620 is configured to execute S1202. More detailed descriptions about the obtaining unit 1610 and the processing unit 1620 can be directly obtained by referring to the related descriptions in the embodiment of the method shown in fig. 12, which is not repeated herein.

The data processing apparatus of the neural network may also be a module (e.g., a chip) of other devices connected to the neural network system 200. As shown in fig. 17, the data processing apparatus 1700 of the neural network includes a processor 1710 and an interface circuit 1720. The processor 1710 and the interface circuit 1720 are coupled to one another. It will be appreciated that the interface circuit 1720 may be a transceiver or an input-output interface. Optionally, the data processing apparatus 1700 of the neural network may further include a memory 1730 for storing instructions executed by the processor 1710 or storing input data required by the processor 1710 to execute the instructions or storing data generated by the processor 1710 after executing the instructions. For example, the data processing apparatus of the neural network may include the host 210 shown in fig. 2, the processor 1710 may include the processor 211, and the storage 1730 is the memory 212. While the above scheme is used to configure the batch size for the neural network chip so that the neural network can work efficiently, the processing and related algorithm operations of the batch size, graph and sub-graph in the embodiment are performed by the processor 211, in fact, the processing method may be performed by other types of processors or devices, for example, other controllers or processors located inside the neural network chip execute the related scheme to complete the configuration of the neural network. In one possibility, the neural network chip may include one or more types of processors inside, the processor may run a configuration scheme of the related neural network so as to obtain a suitable algorithm such as a batch size and a graph, a subgraph partition, and the like, and after parameters of the neural network are configured, the processor may run a neural network consistency neural network calculation, thereby implementing self-configuration, which is not limited in this embodiment.

When the data processing apparatus 1700 of the neural network is used to implement the method shown in fig. 12, the processor 1710 is configured to perform the functions of the processing unit 1620, and the interface circuit 1720 is configured to perform the functions of the obtaining unit 1610.

It is understood that the Processor in the embodiments of the present Application may optionally include a Central Processing Unit (CPU), and may also be other general purpose processors, digital Signal Processors (DSPs), application Specific Integrated Circuits (ASICs), field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, transistor logic devices, hardware components, or any combination thereof. The general purpose processor may be a microprocessor, but may be any conventional processor.

In the neural network training or reasoning process, the batch size of the neural network is determined according to the traditional technology, so that data of some layers are too large to be cached, and data of some layers are too small, so that the data processing amount of a computing unit is reduced, and the performance is reduced. In addition, if the size of the cache is only adapted, so that more divided groups are caused, the head overhead of processing each batch by each layer in the neural network is increased, the computing efficiency of hardware for operating the neural network is reduced, some layers select to place a plurality of input/output data selection parts into an external cache, the data processing amount of a computing unit is increased, and the performance is better. In a training network, some operators can influence the precision (such as BN operators) after segmentation, but if the operators are not segmented, the data cannot be put in a cache to reduce the performance, at the moment, the balance between the precision and the performance needs to be made, the precision is ensured to be small, and the performance is improved as much as possible.

The embodiment of the present application further provides a data processing method of a neural network, where the data processing method includes:

s1, a processor acquires the data volume of input data, a first characteristic of an internal memory in a chip for operating a neural network, a second characteristic of a plurality of layers in the neural network and an accuracy threshold.

Wherein the accuracy threshold is configurable by a user, the accuracy threshold indicating a minimum task processing accuracy of the neural network that is acceptable to the user.

And S2, the processor determines the batch size of each layer in the plurality of layers according to the data volume, the first characteristic, the second characteristic and the precision threshold value, so that the task processing precision of the neural network is not smaller than the precision threshold value.

Specifically, the processor may employ an optimization algorithm, determine a batch size of each of the plurality of layers according to the data amount, the first feature, the second feature, and the precision threshold, form a plurality of subgraphs based on the batch size of each layer, and fuse the plurality of subgraphs into one graph, that is, a computation graph, where the neural network processes tasks in the form of the computation graph, so that on one hand, output data of the layers in the neural network may be of a proper size, and may not be stored in a cache because the output data is too large, thereby affecting performance, and on the other hand, task processing precision of the neural network may be realized to be not less than the precision threshold, that is, certain precision may be maintained while performance is ensured.

For example, the optimization algorithm is a greedy algorithm, the data amount of the input data can be set to a plurality of different values, the batch size of each layer in the plurality of layers can be determined based on each value, a plurality of subgraphs are formed based on the batch size of each layer, and the subgraphs are fused into a computation graph, that is, when the data amount of the input data is set to different values, each value can correspond to one computation graph, the neural network can process the same task in the form of different computation graphs to respectively achieve different performances and accuracies, and then an appropriate value of the data amount of the input data, that is, the batch size (batch size), is selected, so as to ensure that the batch size can achieve high performance when the neural network processes the task, and the task processing accuracy is not less than the accuracy threshold.

1) The implementation is further described below by taking the determination of the batch size (batch size) of the BN layer as an example.

In the process of training a neural network, a Batch Normalization (BN) layer needs to wait for complete input data to be acquired before computation, where the size of the complete input data may be called mini-batch size, which will result in that the complete input data must be stored in a cache before processing, and due to the limited size of the storage space of the cache, this will result in that part of the data needs to be moved to other storage spaces, such as off-chip storage space, resulting in data handling overhead, and the like, thereby affecting performance.

Based on the foregoing embodiment, the present application proposes an implementation of determining the size of the batch size of the BN layer, that is, re-determining the size of the batch size of the BN layer, so as to ensure that the size of the input data waiting for the BN layer is not greater than the storage space of the cache, and the task processing precision of the neural network including the BN layer reaches a certain standard, that is, not less than the required precision threshold.

2) The above implementation is further described below by taking the example of the resnet50 segmentation.

Taking the rescet 50 network as an example, according to the above method for determining the batch size of the layer, the rescet 50 network is partitioned to include 4 subgraphs, as shown in fig. 18, each subgraph is composed of a plurality of convolution (conv) layers and a BN layer, and each convolution layer is followed by a BN layer, the batch size of the layers included in each subgraph is the same, specifically, the batch size =32 of the layer included in subgraph 1, the batch size =64 of the layer included in subgraph 2, the batch size =128 of the layer included in subgraph 3, the batch size =256 of the layer included in subgraph 4, and further, iteration =8 indicates that the layer included in subgraph 1 performs 8 cycles of input data with the batch size of 32 to complete processing of input data with the batch size of 256 (obtained by 32 × 8), by analogy, iteration =4 indicates that the layer included in sub-diagram 2 completes processing of input data having a size of 256 (obtained from 64 × 4) after executing an input data loop having a batch size of 64 times, iteration =2 indicates that the layer included in sub-diagram 3 completes processing of input data having a size of 256 (obtained from 128 × 2) after executing the layer included in sub-diagram 2 times of an input data loop having a batch size of 128, and Iteration =1 indicates that the layer included in sub-diagram 4 completes processing of input data having a size of 256 (obtained from 256 × 1) after executing the layer included in sub-diagram 4 1 time of input data having a batch size of 256. The resnet50 network can improve the running performance of the network as much as possible under the condition of ensuring the precision according to the middle segmentation mode.

Correspondingly, in the data processing apparatus 1600 corresponding to fig. 16, the obtaining unit 1610 is further configured to obtain an accuracy threshold, and accordingly, the processing unit 1620 is configured to determine the batch size of each of the plurality of layers according to the data amount, the first feature, the second feature and the accuracy threshold, so that the task processing accuracy of the neural network is not less than the accuracy threshold.

The method steps in the embodiments of the present application may be implemented by hardware, or may be implemented by software instructions executed by a processor. The software instructions may be comprised of corresponding software modules that may be stored in Random Access Memory (RAM), flash Memory, read-Only Memory (ROM), programmable ROM (PROM), erasable PROM (EPROM), electrically EPROM (EEPROM), registers, a hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. Of course, the storage medium may also be integral to the processor. The processor and the storage medium may reside in an ASIC. In addition, the ASIC may reside in a network device or a terminal device. Of course, the processor and the storage medium may reside as discrete components in a network device or a terminal device.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer programs or instructions. When the computer program or instructions are loaded and executed on a computer, the processes or functions described in the embodiments of the present application are performed in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, a network appliance, a user device, or other programmable apparatus. The computer program or instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer program or instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wire or wirelessly. The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that integrates one or more available media. The usable medium may be a magnetic medium, such as a floppy disk, a hard disk, a magnetic tape; or optical media such as Digital Video Disks (DVDs); it may also be a semiconductor medium, such as a Solid State Drive (SSD).

In the embodiments of the present application, unless otherwise specified or conflicting with respect to logic, terms or descriptions in different embodiments have consistency and may be mutually cited, and technical features in different embodiments may be combined to form a new embodiment according to their inherent logic relationship.

In the present application, "at least one" means one or more, "a plurality" means two or more. It is to be understood that the various numerical references referred to in the embodiments of the present application are merely for convenience of description and distinction and are not intended to limit the scope of the embodiments of the present application. The sequence numbers of the above-mentioned processes do not mean the execution sequence, and the execution sequence of the processes should be determined by their functions and inherent logic.

Claims

A data processing method of a neural network,

acquiring a data volume of input data of a neural network, a first characteristic of an internal memory within a chip operating the neural network, and a second characteristic of a plurality of layers in the neural network;

determining a batch size of each of the plurality of layers as a function of the amount of data, the first characteristic, and the second characteristic, the batch sizes of at least two of the plurality of layers being different.
The method of claim 1, wherein the first characteristic includes at least one of a distribution characteristic of the internal memory within the chip and a capacity of the internal memory, and the second characteristic includes a connection relationship between the plurality of layers and a parameter related to calculation of each of the plurality of layers.
The method of claim 1 or 2, wherein determining the batch size of the plurality of layers as a function of the amount of data, the first characteristic, and the second characteristic comprises:

determining the batch size, N sub-graphs, M graphs and the storage position of data among the layers according to the data volume, the first characteristic and the second characteristic, wherein N is an integer greater than or equal to 2, M is an integer greater than or equal to 1, and N is greater than or equal to M;

wherein the storage location of the inter-layer data comprises at least one of the internal memory or an external memory, the external memory being an off-chip memory to run the neural network, the subgraph comprising one or more layers of the same batch size, the graph comprising one or more subgraphs.
The method of claim 3, wherein inter-layer data for a plurality of layers included in the subgraph is stored in the internal memory.
The method of claim 3 or 4, wherein inter-layer data between the subgraphs is stored in the internal memory.
The method according to any one of claims 3-5, wherein inter-graph inter-layer data is stored in the external memory.
The data processing method of any one of claims 1-6, further comprising:

acquiring a precision threshold;

the determining a batch size for each of the plurality of layers as a function of the amount of data, the first characteristic, and the second characteristic includes:

determining a batch size for each of the plurality of layers as a function of the amount of data, the first feature, the second feature, and the accuracy threshold such that a task processing accuracy of the neural network is not less than the accuracy threshold.
The method according to any one of claims 1 to 7, wherein the batch corresponding to the batch size is one picture, a plurality of pictures or partial images in one picture.
A data processing apparatus of a neural network,

an acquisition unit configured to acquire a data amount of input data of a neural network, a first feature of an internal memory within a chip that operates the neural network, and a second feature of a plurality of layers in the neural network;

a processing unit, configured to determine a batch size of each of the plurality of layers according to the data volume, the first characteristic, and the second characteristic, where at least two of the plurality of layers have different batch sizes.
The apparatus of claim 9, wherein the first characteristic includes at least one of a distribution characteristic of the internal memory within the chip and a capacity of the internal memory, and wherein the second characteristic includes a connection relationship between the plurality of layers and a parameter related to calculation of each of the plurality of layers.
The apparatus according to claim 9 or 10, wherein the processing unit is specifically configured to:

determining the batch size of the layers, N subgraphs, M graphs and the storage position of the data among the layers according to the data volume, the first characteristic and the second characteristic, wherein N is an integer greater than or equal to 2, M is an integer greater than or equal to 1, and N is greater than or equal to M;

wherein the storage location of the inter-layer data comprises at least one of the internal memory or the external memory, the external memory is an off-chip memory for operating the neural network, the subgraph comprises one or more layers of the same batch size, and the graph comprises one or more subgraphs.
The apparatus of claim 11, wherein inter-layer data for a plurality of layers included in the subgraph is stored in the internal memory.
The apparatus of claim 11 or 12, wherein inter-layer data between the subgraphs is stored in the internal memory.
The apparatus of any of claims 11-13, wherein inter-graph inter-layer data is stored in the external memory.
The apparatus according to any one of claims 9-14, wherein said obtaining unit is further configured to obtain a precision threshold;

the processing unit is specifically configured to: determining a batch size for each of the plurality of layers as a function of the amount of data, the first feature, the second feature, and the accuracy threshold such that a task processing accuracy of the neural network is not less than the accuracy threshold.
The apparatus according to any one of claims 9-15, wherein the batch size corresponds to a batch that is a picture, a plurality of pictures, or a portion of a picture.
A data processing apparatus of a neural network, comprising: at least one processor and a memory, wherein the memory is for storing a computer program such that the computer program, when executed by the at least one processor, implements the data processing method of the neural network of any one of claims 1-8.
A computer-readable storage medium, characterized in that a computer program or instructions are stored in the storage medium, which, when executed by a data processing apparatus of a neural network, implements the data processing method of the neural network according to any one of claims 1 to 8.