WO2021243489A1

WO2021243489A1 - Data processing method and apparatus for neural network

Info

Publication number: WO2021243489A1
Application number: PCT/CN2020/093624
Authority: WO
Inventors: 袁宏辉; 高山青; 高立稳; 熊乐进
Original assignee: 华为技术有限公司
Priority date: 2020-05-30
Filing date: 2020-05-30
Publication date: 2021-12-09
Also published as: CN115668222A; WO2021244045A1

Abstract

Disclosed are a data processing method and apparatus for a neural network, which method and apparatus relate to the field of artificial intelligence. The method comprises: according to the data amount of input data, a first feature of an internal memory in a chip that runs a neural network, and a second feature of multiple layers in the neural network, dynamically segmenting the input data, and configuring different batch sizes for the layers in the neural network. By means of configuring a rational batch size for each layer in a neural network, during a neural network inference procedure, an internal memory can be fully utilized to store inter-layer data of the neural network, thereby improving the utilization rate of the internal memory, and ensuring the computational efficiency of hardware that runs the neural network.

Description

A neural network data processing method and device

Technical field

This application relates to the field of artificial intelligence (AI), and in particular to a neural network data processing method and device.

Background technique

As the computing power of the processor in the computer system continues to increase, the performance of the processor continues to improve. In order to solve the problem of the "memory wall" caused by the limitation of the bandwidth of the external memory and the inability to adapt to the processing speed of the processor, the computer system is equipped with a multi-level cache structure with higher bandwidth and smaller capacity.

In the neural network reasoning process, after each layer of the neural network has processed the input data, it enters the next layer. If the amount of input data is large, the size of the inter-layer data of multiple layers of the neural network may also be too large, causing the cache to be unable to store the inter-layer data, and the inter-layer data is stored in an external memory. Because the cache cannot be used effectively, the computational efficiency of the processor is reduced.

In order to solve the above problems, the traditional technology groups the input data according to the inter-layer data caching requirements of each layer to obtain multiple sets of batches of the same batch size. The batch size is limited by the largest cache demand. The batch size of the layer. The neural network processes one set of batches before processing the next set of batches. By reducing the data processed by each layer in the neural network, the inter-layer data is reduced, and the inter-layer data is stored in the cache as much as possible.

Since different layers in the neural network have different operations on data processing, the size of the data between layers is different. For example, if a layer in a neural network enlarges a picture, the generated inter-layer data is relatively large. For another example, if a layer in the neural network performs a shrinking operation on a picture, the generated inter-layer data is smaller. For layers that output smaller inter-layer data, the smaller the batch size, the smaller the inter-layer data, and the more remaining capacity of the cache; for the layers that output larger inter-layer data, the larger the batch size. The larger the data, the smaller the remaining capacity of the cache, and the cache may not be able to store inter-layer data. In short, in the process of processing input data according to the batch size of the neural network determined by the traditional technology, the utilization rate of the cache is still low, which affects the computational efficiency of the hardware running the neural network. In addition, if there are more groups, it will increase the head overhead of processing each batch in each layer of the neural network, and on the contrary reduce the computational efficiency of the hardware running the neural network. Therefore, how to improve the utilization rate of the cache and ensure the computational efficiency of the hardware running the neural network is an urgent problem to be solved.

Summary of the invention

The present application provides a neural network data processing method and device, which can improve the utilization rate of the cache and ensure the computational efficiency of the hardware running the neural network. In order to achieve the above-mentioned purpose, this application adopts the following technical solutions.

In the first aspect, this application provides a neural network data processing method. The method includes: the processor uses the amount of input data, the first feature of the internal memory in the chip running the neural network, and the multiple layers of the neural network. The second feature groups the input data, determines the batch size of each layer in the neural network, and makes the batch sizes of at least two layers different among the batch sizes of multiple layers. For example, the batch size of each layer in a neural network is different. For another example, a neural network includes layers of the same batch size and layers of different batch sizes. Wherein, the first feature includes at least one of the distribution feature of the internal memory in the chip and the capacity of the internal memory. The second feature includes the connection relationship between the plurality of layers and the calculation-related parameters of each of the plurality of layers. The batch corresponding to the batch size is one picture, multiple pictures, or part of the image in one picture.

It should be understood that the so-called internal memory refers to the memory in the chip running the neural network. For example, the memory on the chip that runs the neural network is a cache. The so-called external memory refers to the memory outside the chip that runs the neural network. Internal memory can also be called on-chip memory. External memory can also be called off-chip memory.

The neural network data processing method provided by the embodiments of the present application comprehensively refers to the data amount of the input data, the first feature and the second feature to segment the input data, and sets different batch sizes for the layers in the neural network. Therefore, by setting a reasonable batch size for each layer in the neural network, the internal memory is fully utilized to store the inter-layer data of the neural network during the neural network inference process, which reduces the interaction between the chip running the neural network and the external memory, thereby Improve the utilization of internal memory and ensure the computational efficiency of the chip running the neural network.

Specifically, determining the batch size of the multiple layers according to the amount of data, the first characteristic, and the second characteristic includes: determining the amount of batches according to the amount of data, the first characteristic, and the second characteristic. The batch size of layers, N sub-pictures, M pictures and storage locations of inter-layer data, N is an integer greater than or equal to 2, M is an integer greater than or equal to 1, and N≥M.

Wherein, the storage location of the inter-layer data includes at least one of an internal memory or an external memory. In a possible implementation manner, the inter-layer data of multiple layers included in the sub-picture is stored in the internal memory. In another possible implementation manner, the inter-layer data between sub-pictures is stored in the internal memory. In another possible implementation manner, the inter-layer data between the pictures is stored in the external memory.

The subgraph contains one or more layers of the same batch size. Optionally, the number of layers included in different sub-pictures may be the same or different. Alternatively described, the sub-picture may also be referred to as the first-type layer group.

The graph includes one or more subgraphs. The number of subgraphs contained in different graphs can be the same or different. Alternatively described, the graph may also be referred to as the second type of layer group.

In some possible designs, the processor may use an iterative algorithm to determine the batch size of multiple layers, N sub-pictures, M pictures, and inter-layer data based on the amount of data, the first feature, and the second feature. Storage location. It is understandable that the processor does not perform a calculation based on the amount of data, the first feature, and the second feature at one time, and obtains the batch size of multiple layers, N sub-pictures, M pictures, and storage locations of inter-layer data. Instead, it uses an iterative algorithm to go through multiple iterative experiments. From multiple experimental results, select the batch size of multiple layers, N sub-graphs, M graphs, and storage locations of inter-layer data to ensure the utilization of internal memory and operation The computational efficiency of the neural network chip.

Among them, the optimization algorithm can be a dynamic programming algorithm, a greedy algorithm or a genetic algorithm.

The basic idea of dynamic programming algorithm (dynamic programming algorithm) is also to decompose the problem to be solved into several sub-problems, first solve the sub-problems, and then obtain the solution of the original problem from the solutions of these sub-problems.

Greedy algorithm (greedy algorithm) can also be called greedy algorithm. The basic idea is to proceed step by step from a certain initial solution of the problem. According to a certain optimization measure, each step must ensure that a local optimal solution can be obtained. Only one data is considered in each step, and his selection should meet the conditions of local optimization. If the next data and the partial optimal solution are no longer a feasible solution, the data is not added to the partial solution until all the data is enumerated, or no more data can be added to stop the algorithm

Genetic algorithm (genetic algorithm) is a type of algorithm designed based on the evolutionary laws of the biological world, and is used to simulate natural evolution to search for the optimal solution.

It should be noted that in the process of processing the input data of the neural network according to the division result of the layers of the neural network, the scheduling order of the layers in the figure is based on the scheduling order of the subgraphs contained in the figure, and the The scheduling sequence of the layers in the subgraph is determined. The scheduling order of the layers in the subgraph is the same as the scheduling order of the layers in the neural network. For example, batches corresponding to the batch size of the layers included in the sub-picture are processed in the order of the layers included in the sub-picture. The scheduling order of each subgraph included in the figure is determined according to the batch size and the scheduling order of the first and last layers in the subgraph. The inter-layer data of the sub-graphs contained in the graph are aggregated or scattered.

In the second aspect, the embodiment of the present application also provides a neural network data processing device, and the beneficial effects can be referred to the description of the first aspect and will not be repeated here. The data processing device of the neural network has the function of realizing the behavior of the processor in the method example of the first aspect described above. The functions can be realized by hardware, or by hardware executing corresponding software. The hardware or software includes one or more modules corresponding to the above-mentioned functions. In a possible design, the data processing device of the neural network includes: an acquisition unit and a processing unit. The acquiring unit is used to acquire the data amount of the input data of the neural network, the first feature of the internal memory in the chip running the neural network, and the second feature of multiple layers in the neural network. The processing unit is configured to determine the batch size of each layer in the multiple layers according to the data amount, the first characteristic, and the second characteristic, and the batch sizes of at least two layers in the multiple layers are different. These modules can perform the corresponding functions in the method example of the first aspect. For details, please refer to the detailed description in the method example, which will not be repeated here.

In a third aspect, a neural network data processing device is provided, and the neural network data processing device may be a processor. For example, graphics processor (Graphics Processing Unit, GPU), neural network processor (Neural-network Processing Unit, NPU), advanced reduced instruction set processor (Advanced RISC Machines, ARM), etc., optionally, the neural network The data processing device also includes a memory. Wherein, the memory is used to store computer programs or instructions, and the processor is coupled with the memory. When the processor executes the computer programs or instructions, the data processing device of the neural network is caused to execute the method executed by the processor in the above method embodiments. .

In a fourth aspect, a computer program product is provided. The computer program product includes: computer program code, which when the computer program code runs, causes the method executed by the processor in the first aspect to be executed.

In a fifth aspect, the present application provides a chip system, the chip system includes a processor, and is configured to implement the function of the processor in the method of the first aspect. In a possible design, the chip system further includes a memory for storing at least one of program instructions or data. The chip system can be composed of chips, and can also include chips and other discrete devices.

In a sixth aspect, the present application provides a computer-readable storage medium that stores a computer program, and when the computer program is executed, the method executed by the processor in the first aspect described above is implemented.

In this application, the names of the processor and the data processing device of the neural network do not constitute a limitation on the device itself. In actual implementation, these devices may appear under other names. As long as the function of each device is similar to that of this application, it falls within the scope of the claims of this application and its equivalent technologies.

Description of the drawings

FIG. 1 is a schematic diagram of the principle of a neural network provided by an embodiment of this application;

FIG. 2 is a schematic structural diagram of a neural network system provided by an embodiment of this application;

FIG. 3 is a schematic diagram of the structure of a neural network chip provided by an embodiment of the application;

4 is a schematic structural diagram of a processing device provided by an embodiment of this application;

FIG. 5 is a schematic diagram of the structure of layers in a neural network provided by an embodiment of this application;

FIG. 6 is a schematic diagram of the overlap problem provided by an embodiment of the application;

FIG. 7 is a schematic diagram of a sub-picture provided by an embodiment of this application;

FIG. 8 is a schematic diagram of a diagram provided by an embodiment of the application;

FIG. 9 is a schematic diagram of aggregation processing of inter-layer data between sub-pictures according to an embodiment of the application; FIG.

FIG. 10 is a schematic diagram of dispersing processing of inter-layer data between sub-pictures provided by an embodiment of the application; FIG.

FIG. 11 is a schematic diagram of the processing of a graph provided by an embodiment of this application;

FIG. 12 is a flowchart of a neural network data processing method provided by an embodiment of the application;

FIG. 13 is a schematic diagram of a process of neural network processing data provided by an embodiment of this application;

FIG. 14 is a schematic diagram of a process of neural network processing data provided by an embodiment of this application;

FIG. 15 is a schematic diagram of a process of neural network processing data provided by an embodiment of this application;

16 is a schematic structural diagram of a neural network data processing device provided by an embodiment of the application;

FIG. 17 is a schematic structural diagram of a neural network data processing device provided by an embodiment of the application.

detailed description

The terms "first", "second", and "third" in the specification and claims of this application and the above-mentioned drawings are used to distinguish different objects, rather than to limit a specific order. In the embodiments of the present application, words such as "exemplary" or "for example" are used as examples, illustrations, or illustrations. Any embodiment or design solution described as "exemplary" or "for example" in the embodiments of the present application should not be construed as being more preferable or advantageous than other embodiments or design solutions. To be precise, words such as "exemplary" or "for example" are used to present related concepts in a specific manner.

Neural network (NN) may also be called artificial neural network (ANN) or similar neural network. In the field of machine learning and cognitive science, a neural network is a mathematical model or calculation model that imitates the structure and function of a biological neural network (an animal's central nervous system, especially the brain), and is used to estimate or approximate functions. Neural networks can include convolutional neural network (convolutional neural network, CNN), deep neural network (deep neural network, DNN), multilayer perceptron (multilayer perceptron, MLP) and recurrent neural network (recurrent neural network, RNN), etc. The internet.

(1) Neural network

A neural network can be composed of neural units, which can refer to _{an arithmetic unit that takes x s} and intercept 1 as inputs. The output of this arithmetic unit satisfies the following formula (1).

Among them, s=1, 2,...n, n is a natural number greater than 1, W _s is the weight of x _s , and b is the bias of the neural unit. f is the activation function of the neural unit, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into an output signal. The output signal of the activation function can be used as the input of the next layer, and the activation function can be a sigmoid function. A neural network is a network formed by connecting multiple above-mentioned single neural units together, that is, the output of one neural unit can be the input of another neural unit. The input of each neural unit can be connected with the local receptive field of the previous layer to extract the characteristics of the local receptive field. The local receptive field can be a region composed of several neural units.

As shown in FIG. 1, it is a schematic diagram of the principle of a neural network provided by an embodiment of this application. The neural network 100 has N processing layers, N≧3 and N takes a natural number. The first layer of the neural network is the input layer 110, which is responsible for receiving input signals, and the last layer of the neural network is the output layer 130, which outputs the processing results of the neural network. The other layers excluding the first and last layers are intermediate layers 140. These intermediate layers 140 collectively form the hidden layer 120. Each intermediate layer 140 in the hidden layer 120 can receive input signals and output signals. The hidden layer 120 is responsible for the processing of the input signal. Each layer represents a logic level of signal processing. Through multiple layers, data signals can be processed by multiple levels of logic.

In some feasible embodiments, the input signal of the neural network may be a signal in various forms such as a video signal, a voice signal, a text signal, an image signal, and a temperature signal. In this embodiment, the processed image signal may be various sensor signals such as a landscape signal taken by a camera (image sensor), an image signal of a community environment captured by a display monitoring device, and a facial signal of a human face obtained by an access control system. The input signal of the neural network also includes various other engineering signals that can be processed by computers, which will not be listed here. If the neural network is used for deep learning of the image signal, the image quality can be improved.

(2) Deep neural network

Deep neural network is also called multi-layer neural network, which can be understood as a neural network with multiple hidden layers. The deep neural network is divided according to the position of different layers. The neural network inside the deep neural network can be divided into three categories: input layer, hidden layer and output layer. Generally speaking, the first layer is the input layer, the last layer is the output layer, and the number of layers in the middle are all hidden layers. The layers are fully connected, that is, any neuron in the i-th layer is connected to any neuron in the i+1-th layer.

Although the deep neural network looks complicated, it is not complicated in terms of the work of each layer. Simply put, it is the following linear relationship expression: y=α(Wx+b), where x is the input vector, y is the output vector, b is the offset vector, W is the weight matrix (also called coefficient), and α() is the activation function. Each layer is just to get the output vector y after such a simple operation on the input vector x. Due to the large number of layers of the deep neural network, the number of coefficients W and offset vectors b is also relatively large. The definition of these parameters in a deep neural network is as follows: Take the coefficient W as an example: suppose that in a three-layer deep neural network, the fourth neuron of the second layer to the second neuron of the third layer The linear coefficient is defined as

Among them, the superscript 3 represents the number of layers where the coefficient W is located, and the subscript corresponds to the output third-level index 2 and the input second-level index 4.

In summary, the coefficients from the kth neuron in the L-1th layer to the jth neuron in the Lth layer are defined as

It should be noted that there is no W parameter in the input layer. In deep neural networks, more hidden layers make the network more capable of portraying complex situations in the real world. In theory, a model with more parameters is more complex and has a greater "capacity", which means that it can complete more complex learning tasks. Training a deep neural network is also the process of learning a weight matrix, and its ultimate goal is to obtain the weight matrix of all layers of the trained deep neural network (a weight matrix formed by many layers of vectors W).

(3) Convolutional neural network

Convolutional neural network is a deep neural network with convolutional structure. The convolutional neural network contains a feature extractor composed of a convolutional layer and a sub-sampling layer. The feature extractor can be regarded as a filter. The convolutional layer refers to the neuron layer that performs convolution processing on the input signal in the convolutional neural network. In the convolutional layer of a convolutional neural network, a neuron can be connected to only part of the neighboring neurons. A convolutional layer usually contains several feature planes, and each feature plane can be composed of some rectangularly arranged neural units. Neural units in the same feature plane share weights, and the shared weights here are the convolution kernels. Sharing weight can be understood as the way of extracting image information has nothing to do with location. The convolution kernel can be initialized in the form of a matrix of random size. During the training process of the convolutional neural network, the convolution kernel can obtain reasonable weights through learning. In addition, the direct benefit of sharing weights is to reduce the connections between the layers of the convolutional neural network, and at the same time reduce the risk of overfitting.

As shown in FIG. 2, it is a schematic structural diagram of a neural network system provided by an embodiment of this application. The neural network system 200 includes a host 210 and a neural network circuit 220. The neural network circuit 220 is connected to the host 210 through a host interface. The host interface may include a standard host interface and a network interface (network interface). For example, the host interface may include a peripheral component interconnect express (PCIe) interface. As shown in FIG. 2, the neural network circuit 220 may be connected to the host 210 through the PCIe bus 230. Therefore, data can be input to the neural network circuit 220 via the PCIe bus 230, and data processed by the neural network circuit 220 can be received via the PCIe bus 230. In addition, the host 210 can also monitor the working status of the neural network circuit 220 through the host interface.

The host 210 includes a processor 211 and a memory 212. It should be noted that, in addition to the devices shown in FIG. 2, the host 210 may also include other devices such as a communication interface and a magnetic disk as an external memory, which is not limited here. The host 210 can be considered as an integrated circuit or an independent device.

The processor 211 is the computing core and control unit of the host 210. The processor 211 may include multiple processor cores (cores). The processor 211 may be a very large-scale integrated circuit. An operating system and other software programs are installed in the processor 211, so that the processor 211 can implement access to the memory 212, cache, disk, and peripheral devices (such as the neural network circuit in FIG. 2). It can be understood that, in the embodiment of the present application, the processor core in the processor 211 may be a central processing unit (CPU), or may also be other application specific integrated circuits (ASICs).

The memory 212 is the main memory of the host 210. The memory 212 is connected to the processor 211 via a double data rate (DDR) bus. The memory 212 is generally used to store various running software in the operating system, input and output data, and information exchanged with an external memory. In order to increase the access speed of the processor 211, the memory 212 needs to have the advantage of fast access speed. In a traditional computer system architecture, a dynamic random access memory (DRAM) is usually used as the memory 212. The processor 211 can access the memory 212 at a high speed through a memory controller (not shown in FIG. 2), and perform a read operation and a write operation on any storage unit in the memory 212.

The neural network circuit 220 may be a chip that runs a neural network. The neural network circuit 220 is a chip array composed of a plurality of neural network chips. For example, as shown in FIG. 2, the neural network circuit 220 includes a plurality of neural network chips 221 for data processing and a plurality of routers 222. For the convenience of description, the neural network chip 221 is referred to as the chip 221 for short in the embodiment of the present application. The multiple chips 221 are connected to each other through a router 222. For example, one chip 221 may be connected to one or more routers 222. Multiple routers 222 can form one or more network topologies. The chips 221 can transmit data through the multiple network topologies described above. The neural network circuit 220 may also include other devices such as a memory 223, an input port 224, and an output port 225. The memory is used to store data, computer programs and instructions.

FIG. 3 is a schematic structural diagram of a neural network chip provided by an embodiment of the application. The chip 221 includes a plurality of routers 310, and each router 310 can be connected to a tile 320. In practical applications, one router 310 can also connect multiple tiles 320. As shown in FIG. 3, each tile 320 may include an input/output interface (TxRx) 321, a switching device 322, multiple processing elements (PE) 323, and a memory 324. The input/output interface 321 is used to receive data input from the router 310 to the tile 320, or to output the calculation result of the tile 320. To put it another way, the input/output interface 321 is used to implement data transmission between the tile 320 and the router 310. The switching device 322 connects the input/output interface 321 and a plurality of processing devices 323. The switching device 322 is used to implement data transmission between the input/output interface 321 and the multiple processing devices 323. The memory 324 is used to store data, computer programs and instructions. Each tile 320 may also include a controller 325, which is used to control the input/output interface 321 and multiple processing devices 323 to make the system work normally. Each processing device 323 may include one or more computing engines 326. One or more calculation engines 326 are used to implement neural network calculations on the data input to the calculation engine 326. For example, the data input to the tile 320 and the preset convolution kernel in the tile 320 may be multiplied and added. The calculation result of the calculation engine 326 can be sent to other tiles 320 through the switching device 322 and the input/output interface 321. In practical applications, a calculation engine 326 may include modules that implement convolution, pooling, or other neural network operations. Here, the specific circuit or function of the calculation engine 326 is not limited. For simplicity of description, in the embodiments of the present application, the calculation engine is referred to as engine for short.

As shown in FIG. 4, it is a schematic structural diagram of a processing device provided by an embodiment of this application. The processing device 323 may also include a controller 327 and a bus 328. The controller 327 is used for receiving data, and scheduling one or more engines 326 in the processing device 323 to process the data, so that the system works normally. The multiple engines 326 perform data transmission through the bus 328. The engine 326 is connected to one or more exclusive memories 3210. Optionally, multiple engines 326 may also share one or more memories 329.

In this document, the memory in the neural network circuit 220 may be a cache memory, that is, a cache. For example, the memory 223, the memory 324, the memory 329, and the memory 3210 may all be cache memories.

In this article, the cache memory in the neural network circuit 220 is composed of a static random access memory (SRAM), which has a relatively small capacity but a speed much higher than that of the main memory, which is close to the speed of the CPU. The cache memory may be an L1 level cache memory, an L2 level cache memory, or an L3 level cache memory. For example, the memory 3210 is an L1 level cache memory. The memory 329 is an L2 level cache memory or an L3 level cache memory. The memory 223 is an L2 level cache memory or an L3 level cache memory. The memory 324 is an L2 level cache memory or an L3 level cache memory.

According to the above introduction to neural networks, it can be seen that the neural network circuit 220 provided by the embodiment of the present application includes a plurality of neural network chips 221, each neural network chip 221 includes a plurality of tiles 320, and each tile 320 includes a plurality of Processing devices 323, each processing device 323 includes a plurality of engines 326. It can be seen that the neural network system provided by the embodiments of the present application may include multi-level computing nodes, for example, may include four-level computing nodes: the first-level computing node is the chip 221, and the second-level computing node is the tile in the chip 221 320, the third-level computing node is the processing device 323 in the tile 320, and the fourth-level computing node is the engine 326 in the processing device 323.

The neural network system provided by the embodiments of the present application can be applied to a mobile terminal, a monitoring terminal, or a server, etc., to implement related neural network operations.

Those skilled in the art may know that the neural network includes multiple neural network layers. In the embodiments of the present application, the neural network layer is a logical layer concept, and a neural network layer refers to a neural network operation to be performed once.

The neural network may include n neural network layers (also called n-layer neural network), where n is an integer greater than or equal to 2. The first neural network layer and the second neural network layer may be two of the n layers that have a dependency relationship in operation. In the embodiment of the present application, two neural network layers with a dependency relationship means that the input data of one neural network layer includes the output data of the other neural network layer. Two neural network layers with dependencies can also be referred to as adjacent layers. Optionally, the input of each neural network layer may come from more than one neural network layer, and may come from the first m neural network layers; similarly, the output of each neural network layer may not only be output to the next neural network layer, it is possible Output to the last m neural network layers.

Figure 5 shows part of the neural network layers in the neural network. The neural network layers may include convolutional layers, pooling layers, and so on. The neural network 500 may include a first layer 502, a second layer 504, a third layer 506, a fourth layer 508, and a fifth layer 510 to an nth layer 512. Among them, the first layer 502 can perform a convolution operation, the second layer 504 can perform a pooling operation on the output data of the first layer 502, and the third layer 506 can perform a convolution operation on the output data of the second layer 504. The fourth layer 508 can perform a convolution operation on the output result of the third layer 506, and the fifth layer 510 can perform a summation operation on the output data of the second layer 504 and the output data of the fourth layer 508, and so on. It is understandable that Figure 5 is only a simple example and description of the neural network layers in the neural network, and does not limit the specific operations of each layer of the neural network. For example, the fourth layer 508 can also be a pooling operation. The five-layer 510 may also perform other neural network operations such as convolution operation or pooling operation.

The output data of the first layer 502 is the input data of the second layer 504. Therefore, the first layer 502 and the second layer 504 have a dependency relationship. The output data of the second layer 504 is the input data of the third layer 506, and the second layer 504 and the third layer 506 have a dependency relationship. The output data of the third layer 506 is the input data of the fourth layer 508, and the third layer 506 and the fourth layer 508 have a dependency relationship. The input data of the fifth layer 510 includes the output data of the second layer 504 and the output data of the fourth layer 508. Therefore, the second layer 504 and the fifth layer 510 also have a dependency relationship, and the fourth layer 508 and the fifth layer 510 also Have dependencies.

Each layer of calculation in the neural network is realized by a computing node. In actual applications, different application scenarios require different amounts of calculation. Therefore, the computing nodes in the neural network system can be divided with the granularity of chips, tiles, processing devices, or engines according to actual application conditions, so that computing nodes in different sets are used to process operations of different neural network layers. According to this manner, the computing node referred to in the embodiment of the present application may be a chip 221, a tile 320, a processing device 323, or an engine 326.

In the neural network reasoning process, when the calculation of the i-th layer in the neural network is completed, the calculation results of the i-th layer (inter-layer data) are temporarily stored in the preset cache, and when the calculation of the i+1 layer is performed , The computing node reloads the calculation result of the i-th layer and the weight of the i+1th layer from the preset cache for calculation. Among them, the i-th layer is any layer in the neural network. For example, as shown in FIG. 5, after the calculation of the second layer 504 in the neural network is completed, the output data (interlayer data) of the second layer 504 is temporarily stored in the preset memory 329, and the fifth layer 510 is executed. During calculation, the computing node reloads the calculation result of the second layer 504 and the weight of the fifth layer 510 from the preset memory 329 for calculation.

Depending on the computing node, the preset cache is also different. For example, if the computing node is the engine 326, the preset cache may be the memory 329 or the memory 3210. For another example, if the computing node is the processing device 323, the preset cache may be the memory 324. For another example, if the computing node is a tile 320, the preset cache may be a memory in the tile 320. For another example, if the computing node is the chip 221, the preset cache may be the memory 223.

It should be understood that the memory outside the neural network circuit 220 is called an external memory. For example, the external memory is the memory 212 shown in FIG. 2. The memory in the neural network circuit 220 is called an internal memory. For example, the internal memory is the memory 223 shown in FIG. 2. For another example, the internal memory is the memory 324 shown in FIG. 3. For another example, the internal memories are the memory 329 and the memory 3210 shown in FIG. 4. The so-called external memory refers to the memory outside the chip that runs the neural network. For example, the external memory may be a magnetic disk or the memory 212 shown in FIG. 2.

In order to facilitate the understanding of the technical solutions provided by the embodiments of the present application, firstly, some terms in the embodiments of the present application are explained.

1) batch size

Limited by the capacity of the internal memory, the amount of data that can be processed by each layer in the neural network is the batch size corresponding to that layer. The batch corresponding to the batch size can be one picture, multiple pictures, or part of images in one picture. For example, if the capacity of the internal memory is 100, if layer 1 (layer 1, L1) processes 1 image with a cache requirement of 60, then each scheduling layer 1 can process at most 1 image, and the batch size corresponding to layer 1 is 1. Pictures. If the data cache requirement size for processing 1 image in layer 2 is 30, then each scheduling layer 2 can process 3 images at most, and the batch size corresponding to layer 2 is 3 images. The batch size not only affects the usage of the internal memory of the chip running the neural network, but also affects the optimization degree and processing speed of the neural network.

2) Overlap problem

In some scenes of neural network processing pictures, limited by the capacity of internal memory, it may be necessary to divide the entire picture data into two or more pieces of data as a batch of input data, where each piece of data can be called Partial image data. The convolutional layer can use a filling algorithm to process the input data of a non-integral image. That is, before the calculation by the convolution kernel, the size of the input data is artificially increased by means of the filling algorithm to offset the influence caused by the size shrinkage in the calculation. The filling algorithm can be, for example, zero filling, repeated boundary value filling, or other methods. That is to say, if the input data is non-integral image data, it is necessary to process the input data with a filling algorithm; if the input data is an entire image data, it is not necessary to use the filling algorithm to process the input data.

Taking the filling algorithm as an example, if the convolutional layer adopts the filling algorithm, when interpreting the convolutional layer, the input data needs to be filled first, and then flattened. If the stride of the convolution kernel is smaller than the side length of the convolution kernel (usually a square row), there will be an overlap between the convolution kernel and the original input matrix. The convolution kernel When the stride of the movement is the same as the side length of the convolution kernel, there will be no overlap. Among them, if the input data size is (w*w), the filled data size is (w+k-s)*(w+k-s). Among them, k represents the side length of the convolution kernel, s represents the step length of the convolution kernel movement, and the filling data is (k-s).

Exemplarily, referring to Figure 6, suppose that the layers in a neural network include layer 0, layer 1, layer 2, and layer 3. The size of the convolution kernel is all 3*3, and the step size of the convolution kernel movement It is 1, the step length of the convolution kernel is smaller than the side length of the convolution kernel, and there is an overlap problem in the process of using the filling algorithm to process the input data. For example, the size of the entire picture is 56*56, and the number of rows of the entire picture is divided into 4 parts for processing. If layer 0, layer 1, and layer 2 are scheduled as a layer group, it is necessary to ensure that layer 2 outputs 14 rows of data, that is, the output data size of the layer group is 14*56, and that layer 3 can process 1/4 rows of pictures. Then the input data of layer 2 needs to be filled with 2 rows of data, that is, the input data size is 16*56. Correspondingly, the input data size corresponding to layer 1 is 18*56, and the input data size corresponding to layer 0 is 20*56. That is to say, in the process of segmenting the entire picture, in order to ensure the output data size, the cache demand of the layers in the layer group will increase. And, the more the number of layers in the layer group, the larger the amount of data that the previous layer needs to fill. If the internal memory capacity is small, the size of the layer group will be limited.

3) Subgraph

It has been described above that a neural network includes multiple layers, which can be described as a neural network including multiple layers arranged in a directed graph, and each layer can have a corresponding set of parameters. The subgraph is obtained by dividing the layers included in the neural network according to the batch size of each layer. The subgraph contains one or more layers of the same batch size. The subgraph can also be described as a super layer or a layer group, etc., which means that it contains one layer or continuous multiple layers in the neural network.

In some examples, the neural network is scheduled to process the input data with the sub-graph as a unit, and the scheduling order of the layers in the sub-graph is the same as the scheduling order of the layers in the neural network. The batches corresponding to the batch size of the layers contained in the subgraph are processed in the order of the layers contained in the subgraph. The inter-layer data of multiple layers included in the sub-picture is stored in the internal memory. The inter-layer data between the sub-pictures is stored in the internal memory.

For example, as shown in FIG. 7, it is a schematic diagram of a sub-picture provided in an embodiment of this application. The sub-picture includes layer 0 and layer 1. The batch size of layer 0 and the batch size of layer 1 are both 1. In the following, a batch corresponding to a batch size of 1 can be one picture, multiple pictures, or part of images in one picture. Level 0 processes one batch at a time. Tier 1 processes one batch at a time.

Suppose that layer 0 and layer 1 in the subgraph process batch A0 and batch A1. Batch A0 and batch A1 may be batches in the input data to be processed by the neural network. Alternatively, batch A0 and batch A1 may be inter-layer data that has been processed by layers in the neural network. The batch size of batch A0 and batch A1 are both 1. The execution sequence of the processing batches in the subgraph is shown by the bold arrows in the figure. For ease of understanding, the layer 0 and layer 1 processing batch A0 and batch A1 are separately shown.

Among them, layer 0 processes batch A0 first to obtain inter-layer data B0, and layer 1 processes inter-layer data B0 to obtain inter-layer data C0. Then, layer 0 processes batch A1 to obtain inter-layer data B1, and layer 1 processes inter-layer data B1 to obtain inter-layer data C1. The inter-layer data C0 and the inter-layer data C1 can be stored in the internal memory.

4) Graph

The graph includes one or more subgraphs. Among them, the graph can also be described as a super layer or a layer group, which means a layer or a continuous multi-layer in the neural network.

In some embodiments, each subgraph in the graph contains layers that can handle the same batch size. For example, as shown in (a) of FIG. 8, the hypothetical graph includes sub-graph 1 and sub-graph 2. Among them, subgraph 1 includes layer 0 and layer 1, and the batch size of layer 0 is the same as the batch size of layer 1. Subgraph 2 includes layer 2 and layer 3. The batch size of layer 2 is the same as the batch size of layer 3. The batch size of layer 0 and the batch size of layer 1 are both one batch. The batch size of layer 2 and the batch size of layer 3 are both one batch. In summary, all sub-graphs included in the graph contain layers of the same batch size.

In other embodiments, at least two sub-graphs in all sub-graphs included in the graph include layers of different batch sizes. As shown in (b) of FIG. 8, the hypothetical picture includes sub-picture 1, sub-picture 2 and sub-picture 3. Among them, subgraph 1 includes layer 0 and layer 1, and the batch size of layer 0 is the same as the batch size of layer 1. Subgraph 2 includes layer 2 and layer 3. The batch size of layer 2 is the same as the batch size of layer 3. Sub-figure 3 includes layer 4 and layer 5, and the batch size of layer 4 is the same as the batch size of layer 5. The batch size of layer 0 and the batch size of layer 1 are both one batch. The batch size of layer 2 and the batch size of layer 3 are both one batch. The batch size of layer 4 and the batch size of layer 5 are both two batches. In summary, the batch size of the layer included in the sub-picture 3 included in the graph is different from the batch size of the layer included in the sub-picture 1. The batch size of the layer included in the sub-picture 3 included in the graph is different from the batch size of the layer included in the sub-picture 2.

In some examples, the neural network is scheduled to process input data in units of graphs, and the scheduling order of the layers in the graph is the same as the scheduling order of the layers in the neural network. The scheduling order of each subgraph included in the figure is determined according to the batch size and the scheduling order of the first and last layers in the subgraph. When multiple layers of different batch sizes in the neural network are scheduled as a graph, a part of the data is retained in the cache space of the internal memory, thereby generating additional internal memory cache requirements. The inter-layer data between the pictures is stored in the external memory. The scheduling process of the layers in the figure is described as follows to gather and scatter problems.

5) Gather problem

In a possible implementation manner, the inter-layer data between the sub-graphs contained in the graph is aggregated. As an example, as shown in FIG. 9, it is a schematic diagram of aggregation processing of inter-layer data of a subgraph provided in an embodiment of this application. Assume that the graph includes subgraph 0 and subgraph 1. Subgraph 0 includes layer 0 and layer 1. The batch size of layer 0 and the batch size of layer 1 are both 1. Level 0 processes one batch at a time. Tier 1 processes one batch at a time. Sub-figure 1 includes layer 2 and layer 3. The batch size of layer 2 and the batch size of layer 3 are both 2. In the following, a batch corresponding to a batch size of 2 can be two pictures, multiple pictures, or partial images in one picture. Layer 2 processes two batches at a time. Layer 3 processes two batches at a time. Suppose the graph processes batch A0 and batch A1. Batch A0 and batch A1 may be batches in the input data to be processed by the neural network. Batch A0 and batch A1 may be inter-layer data that has been processed by layers in the neural network. The batch size of batch A0 and batch A1 are both 1. Since layer 0 and layer 1 included in sub-picture 0 process one batch each time, layer 2 and layer 3 included in sub-picture 1 process two batches each time. After subgraph 0 has processed batch A0 and batch A1 respectively, subgraph 1 can then process the inter-layer data of batch A0 and the inter-layer data of batch A1 output by subgraph 0. The execution sequence of the processing batches in the figure is shown by the bold arrows in the figure. For ease of understanding, the layer 0 and layer 1 processing batch A0 and batch A1 are separately shown.

Among them, for sub-picture 0, layer 0 first processes batch A0 to obtain inter-layer data B0, and layer 1 processes inter-layer data B0 to obtain inter-layer data C0. Then, layer 0 processes batch A1 to obtain inter-layer data B1, and layer 1 processes inter-layer data B1 to obtain inter-layer data C1. The inter-layer data C0 and the inter-layer data C1 can be stored in the internal memory.

For sub-figure 1, layer 2 can obtain inter-layer data C0 and inter-layer data C1 from the internal memory. At this time, inter-layer data C0 and inter-layer data C1 can be combined into inter-layer data (C0, C1). Layer 2 processes (C0, C1) to obtain inter-layer data (D0, D1), and layer 3 processes inter-layer data (D0, D1) to obtain inter-layer data (E0, E1). The inter-layer data (E0, E1) can be stored in the internal memory.

6) Scatter problem

In another possible implementation manner, the inter-layer data between the sub-graphs contained in the graph is scattered. As an example, as shown in FIG. 10, it is a schematic diagram of spreading the inter-layer data of the subgraph provided in an embodiment of this application. Assume that the graph includes sub graph 1 and sub graph 2. Sub-figure 1 includes layer 2 and layer 3. The batch size of layer 2 and the batch size of layer 3 are both two batches. Layer 2 processes two batches at a time. Layer 3 processes two batches at a time. Sub-figure 2 includes layer 4 and layer 5. The batch size of layer 4 and the batch size of layer 5 are both one batch. Layer 4 processes one batch at a time. Layer 5 processes one batch at a time. Since layer 2 and layer 3 included in sub-figure 1 process two batches each time, layer 4 and layer 5 included in sub-figure 2 process one batch each time. You can let the sub-picture 1 process the inter-layer data (C0, C1) first, and then let the sub-picture 2 process the inter-layer data E0 and the inter-layer data E1 output by the sub-picture 1. The execution sequence of the processing batches in the figure is shown by the bold arrows in the figure. To facilitate understanding, the processing of the inter-layer data E0 and the inter-layer data E1 in layer 4 and layer 5 are separately represented.

Among them, for sub-figure 1, layer 2 can obtain the inter-layer data (C0, C1) of batch A0 and batch A1 from the internal memory. Layer 2 processes (C0, C1) to obtain inter-layer data (D0, D1), and layer 3 processes inter-layer data (D0, D1) to obtain inter-layer data (E0, E1). At this time, the inter-layer data (E0, E1) can be stored in the internal memory.

For sub-figure 2, layer 4 first obtains the inter-layer data (E0, E1) from the internal memory, and divides the inter-layer data (E0, E1) into the inter-layer data E0 and the inter-layer data E1. Layer 4 first processes the inter-layer data E0 in the inter-layer data (E0, E1) to obtain the inter-layer data F0, and layer 5 processes the inter-layer data F0 to obtain the inter-layer data G0. Then, the layer 4 processes the inter-layer data E1 in the inter-layer data (E0, E1) to obtain the inter-layer data F1, and the layer 5 processes the inter-layer data F1 to obtain the inter-layer data G1. The inter-layer data G0 and the inter-layer data G1 can be stored in the internal memory.

In another possible implementation, multiple graphs are scheduled for processing in the order of the layers of the neural network. What needs to be clarified is that the data processed by the latter graph is the data output by the previous graph. Divide the layers of the neural network into multiple graphs, and process batches according to the order of the graphs, which improves the utilization of internal memory and the processing performance of the entire neural network.

Illustratively, as shown in FIG. 11, it is a schematic diagram of the processing of a graph provided by an embodiment of this application. Among them, the abscissa represents the layer of the neural network, and the ordinate represents the batch. Assume that the neural network includes 12 layers. The batch sizes of layer 0, layer 1, layer 4, layer 5, layer 10, and layer 11 are all one batch, that is, layer 0, layer 1, layer 4, layer 5, layer 10, and layer 11 are processed one batch at a time. The batch sizes of layer 2, layer 3, layer 6 and layer 7 are all two batches, that is, layer 2, layer 3, layer 6 and layer 7 process two batches at a time. The batch sizes of layer 8 and layer 9 are both four batches, that is, layer 8 and layer 9 process four batches each time. The 12 layers included in the neural network are divided into two graphs, namely graph 0 and graph 1. Figure 0 includes layer 0 to layer 5. FIG. 1 includes layers 6 to 11. The numbers in the boxes indicate the order of batch execution. Follow the numbers from small to large. After processing a picture 0, process picture 1 again.

For Figure 0, layer 0 processes batch 0, and layer 1 processes the inter-layer data of batch 0 output by layer 0, obtains the inter-layer data of batch 0 output by layer 1, and stores the inter-layer data of batch 0 in the internal memory. Layer 0 processes batch 1, and layer 1 processes the inter-layer data of batch 1 output by layer 0, obtains the inter-layer data of batch 1 output from layer 1, and stores the inter-layer data of batch 1 in the internal memory. Take out the inter-layer data of batch 0 and the inter-layer data of batch 1 from the internal memory, layer 2 processes the inter-layer data of batch 0 and the inter-layer data of batch 1, and layer 3 processes the inter-layer data of batch 0 output by layer 2 and The inter-layer data of batch 1 obtains the inter-layer data of batch 0 and the inter-layer data of batch 1 output by layer 3. Layer 4 processes the inter-layer data of batch 0 output by layer 3, and layer 5 processes the inter-layer data of batch 0 output by layer 4; layer 4 processes the inter-layer data of batch 1 output from layer 3, and layer 5 processes the batch output from layer 4 1 inter-layer data.

In the same way, layer 0 to layer 5 process batch 2 and batch 3 in the order of processing batch 0 and batch 1. The batch processed in Figure 1 is the data output in Figure 0. For Figure 1, layer 6 processes the inter-layer data of batch 0 and batch 1 output from layer 5, and layer 7 processes the inter-layer data of batch 0 and batch 1 output of layer 6, and outputs layer 7 The inter-layer data of batch 0 and the inter-layer data of batch 1 are stored in the internal memory. Layer 6 processes the batch 2 inter-layer data and batch 3 inter-layer data output by layer 5, and layer 7 processes the batch 2 inter-layer data and batch 3 inter-layer data output by layer 6 and transfers the batch 2 output from layer 7 The inter-layer data and batch 3 inter-layer data are stored in the internal memory. Retrieve the inter-layer data of batch 0, batch 1 inter-layer data, batch 2 inter-layer data, and batch 3 inter-layer data output by layer 7 from the internal memory. Layer 8 processes the inter-layer data of batch 0 output by layer 7 , The inter-layer data of batch 1, the inter-layer data of batch 2 and the inter-layer data of batch 3, layer 9 processes the inter-layer data of batch 0 output by layer 8, the inter-layer data of batch 1, and the inter-layer data of batch 2 and The inter-layer data of batch 3, the inter-layer data of batch 0, the inter-layer data of batch 1, the inter-layer data of batch 2, and the inter-layer data of batch 3 outputted by layer 9 are stored in the internal memory. Layer 10 processes the batch 0 inter-layer data output by layer 9. Layer 11 processes the inter-layer data of batch 0 output by layer 10. Layer 10 processes the inter-layer data of batch 1 output by layer 9. Layer 11 processes the inter-layer data of batch 1 output by layer 10. Layer 10 processes the batch 2 inter-layer data output by layer 9. Layer 11 processes the inter-layer data of batch 2 output by layer 10. Layer 10 processes the batch 3 inter-layer data output by layer 9. Layer 11 processes the inter-layer data of batch 3 output by layer 10.

Next, the data processing method of the neural network will be described in detail in conjunction with Figure 12. Here, a data processing method in which the processor 211 executes a neural network is taken as an example for description. Among them, the internal memory includes a memory 223, a memory 324, a memory 329, and a memory 3210. The external memory is the memory 212. The calculation node completes the calculation of the neural network according to the determined batch size. The computing node includes a chip 221, a tile 320, a processing device 323, or an engine 326. As shown in Figure 12, the neural network data processing method includes S1201 and S1202.

S1201, the processor 211 obtains the data amount of the input data, the first feature of the internal memory in the chip running the neural network, and the second feature of the multiple layers in the neural network. The input data is the data received by the input layer of the neural network. For example, the input data is data in a data set. Take image processing as an example. For example, the input data is 32 pictures in the data set.

The first feature includes at least one of the distribution feature of the internal memory in the chip and the capacity of the internal memory. It is understandable that the distribution characteristics of internal memory in the chip include the number of memories in the chip running the neural network and the connection relationship between the memory and the computing node. The memory capacity and number in the chip are large, not every time these storage resources Both are used for neural network calculations, and the storage resources allocated to the neural network calculations vary. Therefore, it is necessary to dynamically optimize the neural network configuration according to the number of memories and the connection relationship between the memories and the computing nodes, that is, the distribution characteristics. For example, the neural network circuit 220 includes the number of memories 223, the number of memories 324, the number of memories 329, and the number of memories 3210, as well as the connection relationship between the memory 223 and the chip 221, and the connection between the memory 324 and the processing device 323 The relationship, the connection relationship between the storage 329 and the engine 326, and the connection relationship between the storage 3210 and the engine 326.

The capacity of internal memory includes the capacity of all memories in the chip running the neural network. The memory capacity and number in the chip are large. Not every time these storage resources are used for neural network calculations, they are allocated to the storage for neural network calculations. Resources are variable, so the neural network configuration needs to be dynamically optimized according to the capacity. For example, the neural network circuit 220 includes the capacity of the memory 223, the capacity of the memory 324, the capacity of the memory 329, and the capacity of the memory 3210. It is understandable that the capacity of the internal memory may refer to the available capacity of the internal memory.

The second feature includes the connection relationship between the multiple layers and the calculation-related parameters of each of the multiple layers. The computing resources in the chip will change, not every time these computing resources are used for neural network calculations, so the connection relationship between multiple layers and the calculation-related parameters of each layer in multiple layers will also be based on As requirements change, the neural network configuration needs to be dynamically optimized according to the changes. It is understandable that the connection relationship between multiple layers includes the connection relationship between each layer in the neural network and at least one layer in other layers. According to the different functions performed by the neural network, the connection relationship of the layers in the neural network is also different, and this application does not limit the connection relationship of the layers in the neural network. The calculation-related parameters of each layer include the dimensionality of the input data and the dimensionality of the output data, offset parameters, convolution kernels, quantization parameters, or normalization parameters.

The first feature and the second feature may be stored in the memory 212 in the host 210. The processor 211 may obtain the characteristics of the internal memory and the characteristics of multiple layers in the neural network from the memory 212 in the host 210.

S1202. The processor 211 determines the batch size of the multiple layers, N sub-pictures, M pictures, and storage locations of inter-layer data according to the data volume, the first characteristic and the second characteristic. The batch size of at least two layers in the batch size are different.

Specifically, the processor 211 may use an iterative algorithm to determine the batch size of multiple layers, N sub-pictures, M pictures, and storage locations of inter-layer data according to the amount of data, the first feature, and the second feature. . Among them, the optimization algorithm can be a dynamic programming algorithm, a greedy algorithm or a genetic algorithm. It is understandable that the processor does not perform a calculation based on the amount of data, the first feature, and the second feature at one time, and obtains the batch size of multiple layers, N sub-pictures, M pictures, and storage locations of inter-layer data. Instead, it uses an iterative algorithm to go through multiple iterative experiments. From multiple experimental results, select the batch size of multiple layers, N sub-graphs, M graphs, and storage locations of inter-layer data to ensure the utilization of internal memory and operation The computational efficiency of the neural network chip. Wherein, N is an integer greater than or equal to 2, M is an integer greater than or equal to 1, and N≥M. For example, N=2, M=1, it means that the layer of the neural network is divided into 2 subgraphs, and the 2 subgraphs are divided into one graph. For another example, N=2, M=2, it means that the layer of the neural network is divided into 2 subgraphs, and the 2 subgraphs are divided into 2 graphs. For another example, N=3, M=2, which means that the layer of the neural network is divided into 3 subgraphs, and the 3 subgraphs are divided into 2 graphs.

For example, the processor 211 first determines the batch size of each layer in the neural network based on the capacity of the internal memory, and then merges layers with the same batch size into sub-graphs. Based on the cache requirements of the sub-graphs and the capacity of the internal memory, multiple sub-graphs are merged into a graph, and the resulting graph contains sub-graphs of different batch sizes. That is to say, when the neural network is scheduled in the unit of graphs, the input data is processed in different batch sizes, so the cache requirement of each graph will not exceed the capacity of the internal memory, but also can improve the utilization of the on-chip memory and improve the hardware. Operational performance.

Understandably, the layers in the N subgraphs are connected to form a complete neural network. Each of the N subgraphs contains one or more layers of the same batch size. One or more layers of the same batch size are consecutive layers in the neural network. The number of layers included in different sub-pictures may be the same or different.

The sub-graphs in the M graphs are connected to form a complete neural network. Each of the M graphs includes one or more subgraphs. The number of sub-pictures included in different pictures may be the same or different.

In the process of scheduling neural network for data processing, corresponding neural network computing overhead will be generated. Such as calculation time cost, data transfer time cost, etc. The computational cost index of the neural network can be preset to measure the performance of the neural network. If the computational cost of the neural network is low, the performance of the neural network is better. As shown in Figure 13, an exemplary process of processing data in a neural network is given, including the process of data import (that is, the process of reading input data), the calculation process, and the process of data export (that is, the process of storing output data). ). Among them, the neural network processes a batch of data, and needs to first move part of the data in, that is, the data move in process, and the overhead in this process is the head overhead. After that, the data import process, the calculation process, and the data export process are parallel. Finally, the neural network executes the data removal process of the last calculated data and stores it in the storage space. The overhead generated by this process is the tail overhead.

In this embodiment of the application, the layer processes data in units of batch size. In the process of processing a batch of input data for a certain layer, calculation time = calculation amount of this layer / computing power of the chip equipped with the neural network, data transfer time = (input data volume + output data volume) / (internal memory bandwidth or chip External storage bandwidth), total time overhead = head overhead + max (calculation time, data transfer time) + tail overhead. It can be seen that if the batch size is too small, the time corresponding to the head overhead and the tail overhead may be greater than or equal to the calculation time, resulting in lower neural network operation efficiency. The time overhead of a certain layer in the neural network can be obtained according to the storage location of at least one of the input data or output data of the current layer and the computing power of the chip equipped with the neural network. Among them, the storage location of data includes internal memory and external memory.

Since the inter-layer data is allowed to be stored in the external memory, the external memory and the internal memory are jointly planned to store the inter-layer data, which reduces the storage space of the internal memory. In addition, since the inter-layer data can be stored in the external memory, a larger batch size can be set for the layers in the neural network, thereby reducing the head overhead of processing each batch of the layers in the neural network, and improving the computational efficiency of the processor .

In the process of processing the input data of the neural network according to the division result of the layers of the neural network, the scheduling order of the layers in the figure is according to the scheduling order of each subgraph contained in the figure, and the order of the subgraphs in the subgraph The scheduling sequence of the layers is determined. For example, the scheduling order of the layers in the subgraph is the same as the scheduling order of the layers in the neural network. The batches corresponding to the batch size of the layers included in the sub-picture are processed in the order of the layers included in the sub-picture. The scheduling order of each subgraph included in the figure is determined according to the batch size and the scheduling order of the first and last layers in the subgraph. The inter-layer data of the sub-graphs contained in the graph are aggregated or scattered. For the explanation of sub-pictures and graphs, please refer to the above description.

Exemplarily, as shown in FIG. 13, it is assumed that the neural network includes 6 layers, and the hierarchy sequence is layer 0-layer 5 (layer0-layer5, L1-L5). The batch size corresponding to L0, L1, L4, and L5 is 1, and the batch size corresponding to L2 and L3 is 2. The layers with the same batch size form a subgraph, that is, L0 and L1 form subgraph 0. L2 and L3 form subfigure 1. L4 and L5 make up subfigure 2. The subgraphs are composed of graphs, that is, subgraph 0, subgraph 1, and subgraph 2 are composed of graphs. The batch size corresponding to L0 and L1 is 1, so subgraph 0 can process input data with a data size of 1 each time, that is, batch 0 and batch 1 are processed separately. After batch 0 is input to L0, after L0 and L1 are processed, the output data of L1 is C0. The batch size corresponding to L2 is 2. At this time, C0 only corresponds to batch 0, which does not meet the processing requirements of L2, and C0 needs to be temporarily stored in the internal memory. Batch 1 is input to L0 for processing, after L0 and L1 are processed, the output data of L1 is C1. At this time, L1 outputs two batches of data to meet the processing requirements of L2. The internal memory contains two sets of C0 and C1 data. After C0 and C1 are aggregated, L2 can call the aggregated C0 and C1 for processing. Therefore, if sub-graph 0 and sub-graph 1 are divided into one graph, in the process of scheduling L0 and L1 to process batch 1, C0 occupies the cache space of the internal memory, and the amount of data corresponding to C0 is L0 and L1. Additional internal memory cache requirements . In this process, the cache requirement of input data corresponding to L0 is the amount of data corresponding to (C0+A1), the cache requirement of output data is the amount of data corresponding to (C0+B1); the cache requirement of input data corresponding to L1 is (C0 +B1) corresponds to the amount of data, and the buffer requirement for output data is the amount of data corresponding to (C0+C1).

If sub-graph 1 and sub-graph 2 are divided into one graph for scheduling, a scatter problem will occur. As shown in Figure 13, the input data of L3 is D0 corresponding to batch 0 and D1 corresponding to batch 1, and the output data is E0 corresponding to batch 0 and E1 corresponding to batch 1. The batch size corresponding to L4 is 1, so E0 and E1 cannot be processed at the same time. At this time, L4 processes E0 first, and temporarily stores E1 in internal memory. Then, in the process of scheduling L4 and L5 to process the data corresponding to batch 0, E1 occupies the cache space of the internal memory, and the amount of data corresponding to E1 is the extra internal memory cache requirement of L4 and L5. In this process, the cache requirement of input data corresponding to L4 is the amount of data corresponding to (E1+E0), the cache requirement of output data is the amount of data corresponding to (E1+F0); the cache requirement of input data corresponding to L5 is (E1 +F0) corresponds to the amount of data, and the buffer requirement for output data is the amount of data corresponding to (E1+G0).

It should be noted that because the inter-layer data of multiple layers contained in the sub-picture and the inter-layer data between the sub-pictures are stored in the internal memory, which occupies the storage space of the internal memory, the batch size of the multiple layers and the storage of the inter-layer data The location is also affected by the division of subgraphs and graphs.

For example, as shown in Fig. 9, since the inter-layer data C0 is stored in the cache when calculating batch A1 for layer 0 and layer 1, the available cache for layer 0 and layer 1 will become smaller, which will affect the segmentation of input data.

As another example, as shown in Figure 10, when the inter-layer data E0 is processed in

layers

4 and 5, the inter-layer data E1 is stored in the cache, which occupies the space of the cache, resulting in smaller available caches in the

layers

4 and 5 , Affecting the segmentation of input data.

Therefore, in the process of dividing layers of different batch sizes, it is necessary to consider the additional internal memory cache requirements caused by the aggregation or spreading problem, and determine whether the cache requirements of the divided sub-graphs exceed the capacity of the internal memory.

In a possible implementation manner, other computers can execute S1201 and S1202 offline to generate the segmentation strategy and schedule the execution order of the layers of the neural network. The segmentation strategy and the execution order of the layers of the scheduling neural network are configured to the controller in the neural network system, and the controller in the neural network system controls the execution order of the segmentation strategy and the layers of the scheduling neural network.

In another possible implementation, the controller in the neural network system can execute S1201 and S1202 to generate the segmentation strategy and schedule the execution sequence of the neural network layers, and the controller uniformly manages the various layers of the scheduling neural network and the number of segmentation. Batches.

The neural network scheduling method provided in the embodiment of the present application will be described below with reference to specific examples.

Example 1: The input data is the whole image data.

As shown in Figure 14, based on the internal memory capacity and considering the overall performance of the neural network, it is determined that the batch size corresponding to L0 and L1 is 1 picture, and the batch size corresponding to L2, L3 and L4 is 2 pictures, and L5 and L6 correspond to The batch size is 4 pictures. Using the method provided in the embodiment of the present application, L0 and L1 are divided into sub-graph 0, L2-L4 is divided into sub-graph 1, and L5 and L6 are divided into sub-graph 2. Aiming at the aggregation problem, based on the internal memory capacity and considering the overall performance of the neural network, the 3 subgraphs are divided into one graph, that is, L0-L6 are divided into graphs. The cache requirement of the graph is less than or equal to the capacity of the internal memory. The figure contains layers with different batch sizes. In the process of scheduling the subgraphs in the neural network to process the input data, it can improve the utilization of internal memory and improve the operating performance of the chip running the neural network.

As shown in Figure 14, assuming that the data set contains 8 pictures, L0 is the first layer of the picture, and the batch size is 1 picture, so the data set is divided into 8 batches of input data (batch 0-batch 7 shown in Fig. 14 ), each batch of input data is the whole image data corresponding to 1 picture, and input L0 in batches. As shown in Figure 14, in the process of processing the input data of the current data set, subgraph 0 is scheduled twice, corresponding to subgraph 1 is scheduled once, that is, the scheduling sequence is L0→L1→L0→L1→L2→L3→L4; Scheduling subgraph 1 twice corresponds to subgraph 2 scheduling once, that is, the scheduling sequence is L2→L3→L4→L2→L3→L4→L5→L6. To process the input data of the current data set, it is necessary to schedule 8 times of

subgraph

0, 4 times of

subgraph

1, and 2 times of subgraph 2.

Example 2: The input data is non-integral image data.

As shown in Figure 15, based on the internal memory capacity and considering the overall performance of the neural network, it is determined that the batch size corresponding to L0 and L1 is 1/4 pictures, and the batch size corresponding to L2, L3, and L4 is 1/2 pictures. Using the method provided in the embodiment of the present application, L0 and L1 are divided into subgraph 0, and the L2-L4 sequence is divided into subgraph 1. In view of the overlap problem, as shown in Figure 15, the input data is non-integral image data, and the input data needs to be processed with a filling algorithm, and the filling data is the shaded part. Based on the internal memory capacity and considering the overall performance of the neural network, the two subgraphs are divided into one graph, that is, L0-L4 are divided into graphs. The cache requirement of the graph is less than or equal to the capacity of the internal memory. The figure contains layers with different batch sizes. In the process of scheduling the subgraphs in the neural network to process the input data, it can improve the utilization of internal memory and improve the operating performance of the chip running the neural network.

As shown in Figure 15, assuming that the data set contains 2 pictures, L0 is the first layer of the picture, and the batch size is 1/4 pictures, so the data set is divided into 8 batches of input data (batch 0- as shown in Figure 15). Batch 7), each batch of input data is the non-integrated image data corresponding to 1/4 pictures, and input L0 in batches. As shown in Figure 15, in the process of processing the input data of the current dataset, subgraph 0 is scheduled twice, corresponding to subgraph 1 is scheduled once, that is, the scheduling sequence is L0→L1→L0→L1→L2→L3→L4. Processing the input data of the current data set needs to schedule 8 times of

subgraph

0 and 4 times of subgraph 1.

It can be understood that, in order to implement the functions in the foregoing embodiments, the neural network system includes at least one of a hardware structure or a software module corresponding to each function. Those skilled in the art should easily realize that, in combination with the units and method steps of the examples described in the embodiments disclosed in this application, this application can be implemented in the form of hardware or a combination of hardware and computer software. Whether a certain function is executed by hardware or computer software-driven hardware depends on the specific application scenarios and design constraints of the technical solution.

16 and FIG. 17 are schematic diagrams of the structure of a possible neural network data processing device provided by an embodiment of the application. These neural network data processing devices can be used to implement the functions of the processor 211 in the foregoing method embodiment, and therefore can also achieve the beneficial effects of the foregoing method embodiment.

In the embodiment of the present application, the data processing device of the neural network in FIG. 16 may be the processor 211 shown in FIG. 2 or a device formed by running software on it. As shown in FIG. 16, the data processing device 1600 of the neural network includes an acquisition unit 1610 and a processing unit 1620. The neural network data processing device 1600 is used to implement the function of the processor 211 in the method embodiment shown in FIG. 12 above. When the neural network data processing device 1600 is used to implement the function of the processor 211 in the method embodiment shown in FIG. 12: the acquiring unit 1610 is used to perform S1201; the processing unit 1620 is used to perform S1202. More detailed descriptions of the above-mentioned acquisition unit 1610 and processing unit 1620 can be obtained directly by referring to the relevant description in the method embodiment shown in FIG. 12, and will not be repeated here.

The data processing device of the neural network may also be a module (such as a chip) of other equipment connected to the neural network system 200. As shown in FIG. 17, the data processing device 1700 of the neural network includes a processor 1710 and an interface circuit 1720. The processor 1710 and the interface circuit 1720 are coupled to each other. It can be understood that the interface circuit 1720 may be a transceiver or an input/output interface. Optionally, the neural network data processing device 1700 may further include a memory 1730 for storing instructions executed by the processor 1710 or storing input data required by the processor 1710 to run the instructions or storing data generated after the processor 1710 runs the instructions. For example, the data processing device of the neural network may include the host 210 shown in FIG. 2, the processor 1710 may include the processor 211, and the memory 1730 is the memory 212. The above scheme is used to configure the batch size for the neural network chip so that the neural network can work efficiently. In the described embodiment, the batch size, the processing of graphs and subgraphs, and the operation of related algorithms are all executed by the processor 211. In fact, the processing method can be It is executed by other types of processors or devices, for example, other controllers or processors located inside the neural network chip execute related solutions to complete the configuration of the neural network. In one possibility, one or more types of processors can be included in the neural network chip, and the processor can run related neural network configuration schemes in order to obtain suitable batch sizes and algorithms such as graph and subgraph division. After configuring the parameters of the neural network, the processor can run neural network consistency neural network calculations, thereby realizing self-configuration, which is not limited in this embodiment.

When the neural network data processing device 1700 is used to implement the method shown in FIG. 12, the processor 1710 is used to perform the functions of the above-mentioned processing unit 1620, and the interface circuit 1720 is used to perform the functions of the above-mentioned obtaining unit 1610.

It is understandable that the processor in the embodiment of the present application may optionally include a central processing unit (Central Processing Unit, CPU), or may also be other general-purpose processors, digital signal processors (Digital Signal Processors, DSPs), and special purpose processors. Integrated circuits (Application Specific Integrated Circuit, ASIC), Field Programmable Gate Array (Field Programmable Gate Array, FPGA) or other programmable logic devices, transistor logic devices, hardware components, or any combination thereof. The general-purpose processor may be a microprocessor or any conventional processor.

The method steps in the embodiments of the present application can be implemented by hardware, and can also be implemented by a processor executing software instructions. Software instructions can be composed of corresponding software modules, which can be stored in Random Access Memory (RAM), flash memory, read-only memory (Read-Only Memory, ROM), and programmable read-only memory (Programmable ROM) , PROM), Erasable Programmable Read-Only Memory (Erasable PROM, EPROM), Electrically Erasable Programmable Read-Only Memory (Electrically EPROM, EEPROM), register, hard disk, mobile hard disk, CD-ROM or well-known in the art Any other form of storage medium. An exemplary storage medium is coupled to the processor, so that the processor can read information from the storage medium and can write information to the storage medium. Of course, the storage medium may also be an integral part of the processor. The processor and the storage medium may be located in the ASIC. In addition, the ASIC can be located in a network device or a terminal device. Of course, the processor and the storage medium may also exist as discrete components in the network device or the terminal device.

In the foregoing embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented by software, it can be implemented in the form of a computer program product in whole or in part. The computer program product includes one or more computer programs or instructions. When the computer program or instruction is loaded and executed on the computer, the process or function described in the embodiment of the present application is executed in whole or in part. The computer may be a general-purpose computer, a special-purpose computer, a computer network, network equipment, user equipment, or other programmable devices. The computer program or instruction may be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium. For example, the computer program or instruction may be downloaded from a website, computer, The server or data center transmits to another website site, computer, server or data center through wired or wireless means. The computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server or a data center that integrates one or more available media. The usable medium may be a magnetic medium, such as a floppy disk, a hard disk, and a magnetic tape; it may also be an optical medium, such as a digital video disc (digital video disc, DVD); and it may also be a semiconductor medium, such as a solid state drive (solid state drive). , SSD).

In the various embodiments of this application, if there are no special instructions and logical conflicts, the terms or descriptions between different embodiments are consistent and can be mutually cited. The technical features in different embodiments are based on their inherent logical relationships. It can be combined to form a new embodiment.

In this application, "at least one" refers to one or more, and "multiple" refers to two or more. It is understandable that the various numerical numbers involved in the embodiments of the present application are only for easy distinction for description, and are not used to limit the scope of the embodiments of the present application. The size of the sequence number of the above processes does not mean the order of execution, and the execution order of each process should be determined by its function and internal logic.

Claims

A neural network data processing method, which is characterized in that:

Acquiring the amount of input data of the neural network, the first feature of the internal memory in the chip running the neural network, and the second feature of multiple layers in the neural network;

The batch size of each layer in the plurality of layers is determined according to the amount of data, the first characteristic, and the second characteristic, and the batch sizes of at least two layers in the plurality of layers are different.
The method according to claim 1, wherein the first feature includes at least one of a distribution feature of the internal memory in the chip and a capacity of the internal memory, and the second feature includes the The connection relationship between the multiple layers and the calculation-related parameters of each of the multiple layers.
The method according to claim 1 or 2, wherein determining the batch size of the multiple layers according to the amount of data, the first characteristic, and the second characteristic comprises:

Determine the batch size of the multiple layers, N sub-pictures, M pictures, and storage locations of inter-layer data according to the amount of data, the first feature, and the second feature, where N is an integer greater than or equal to 2 , M is an integer greater than or equal to 1, and N≥M;

Wherein, the storage location of the inter-layer data includes at least one of the internal memory or the external memory, the external memory is an off-chip memory running the neural network, and the subgraph includes one or more batches of the same size The graph includes one or more subgraphs.
The method according to claim 3, wherein the inter-layer data of the multiple layers included in the sub-picture is stored in the internal memory.
The method according to claim 3 or 4, wherein the inter-layer data between the sub-pictures is stored in the internal memory.
The method according to any one of claims 3-5, wherein the inter-layer data between the pictures is stored in the external memory.
The method according to any one of claims 1 to 6, wherein the batch corresponding to the batch size is one picture, multiple pictures, or partial images in one picture.
A neural network data processing device, which is characterized in that:

An acquiring unit, configured to acquire the amount of input data of the neural network, the first feature of the internal memory in the chip running the neural network, and the second feature of multiple layers in the neural network;

The processing unit is configured to determine the batch size of each of the multiple layers according to the amount of data, the first feature, and the second feature, and the batch sizes of at least two of the multiple layers are different .
The device according to claim 8, wherein the first feature includes at least one of a distribution feature of the internal memory in the chip and a capacity of the internal memory, and the second feature includes the The connection relationship between the multiple layers and the calculation-related parameters of each of the multiple layers.
The device according to claim 8 or 9, wherein the processing unit is specifically configured to:

Determine the batch size of the multiple layers, N sub-pictures, M pictures, and storage locations of inter-layer data according to the amount of data, the first feature, and the second feature, where N is an integer greater than or equal to 2 , M is an integer greater than or equal to 1, and N≥M;

Wherein, the storage location of the inter-layer data includes at least one of the internal memory or the external memory, the external memory is an off-chip memory running the neural network, and the subgraph includes one or more batches of the same size The graph includes one or more subgraphs.
The device according to claim 10, wherein the inter-layer data of the multiple layers included in the sub-picture is stored in the internal memory.
The device according to claim 10 or 11, wherein the inter-layer data between the sub-pictures is stored in the internal memory.
The device according to any one of claims 10-12, wherein the inter-layer data between the pictures is stored in the external memory.
The device according to any one of claims 8-13, wherein the batch corresponding to the batch size is one picture, multiple pictures, or partial images in one picture.
A neural network data processing device, characterized by comprising: at least one processor and a memory, wherein the memory is used to store a computer program, so that when the computer program is executed by the at least one processor, the The neural network data processing method described in any one of 1-7 is required.
A computer-readable storage medium, characterized in that a computer program or instruction is stored in the storage medium, and when the computer program or instruction is executed by a data processing device of a neural network, it can implement any of claims 1-7. One of the neural network data processing methods.