WO2021243489A1 - Data processing method and apparatus for neural network - Google Patents

Data processing method and apparatus for neural network Download PDF

Info

Publication number
WO2021243489A1
WO2021243489A1 PCT/CN2020/093624 CN2020093624W WO2021243489A1 WO 2021243489 A1 WO2021243489 A1 WO 2021243489A1 CN 2020093624 W CN2020093624 W CN 2020093624W WO 2021243489 A1 WO2021243489 A1 WO 2021243489A1
Authority
WO
WIPO (PCT)
Prior art keywords
layer
neural network
data
batch
inter
Prior art date
Application number
PCT/CN2020/093624
Other languages
French (fr)
Chinese (zh)
Inventor
袁宏辉
高山青
高立稳
熊乐进
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to PCT/CN2020/093624 priority Critical patent/WO2021243489A1/en
Priority to PCT/CN2021/073691 priority patent/WO2021244045A1/en
Priority to CN202180037755.7A priority patent/CN115668222A/en
Publication of WO2021243489A1 publication Critical patent/WO2021243489A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons

Definitions

  • This application relates to the field of artificial intelligence (AI), and in particular to a neural network data processing method and device.
  • AI artificial intelligence
  • the performance of the processor continues to improve.
  • the computer system is equipped with a multi-level cache structure with higher bandwidth and smaller capacity.
  • each layer of the neural network After each layer of the neural network has processed the input data, it enters the next layer. If the amount of input data is large, the size of the inter-layer data of multiple layers of the neural network may also be too large, causing the cache to be unable to store the inter-layer data, and the inter-layer data is stored in an external memory. Because the cache cannot be used effectively, the computational efficiency of the processor is reduced.
  • the traditional technology groups the input data according to the inter-layer data caching requirements of each layer to obtain multiple sets of batches of the same batch size.
  • the batch size is limited by the largest cache demand.
  • the neural network processes one set of batches before processing the next set of batches. By reducing the data processed by each layer in the neural network, the inter-layer data is reduced, and the inter-layer data is stored in the cache as much as possible.
  • the size of the data between layers is different. For example, if a layer in a neural network enlarges a picture, the generated inter-layer data is relatively large. For another example, if a layer in the neural network performs a shrinking operation on a picture, the generated inter-layer data is smaller. For layers that output smaller inter-layer data, the smaller the batch size, the smaller the inter-layer data, and the more remaining capacity of the cache; for the layers that output larger inter-layer data, the larger the batch size. The larger the data, the smaller the remaining capacity of the cache, and the cache may not be able to store inter-layer data.
  • the utilization rate of the cache is still low, which affects the computational efficiency of the hardware running the neural network.
  • it will increase the head overhead of processing each batch in each layer of the neural network, and on the contrary reduce the computational efficiency of the hardware running the neural network. Therefore, how to improve the utilization rate of the cache and ensure the computational efficiency of the hardware running the neural network is an urgent problem to be solved.
  • the present application provides a neural network data processing method and device, which can improve the utilization rate of the cache and ensure the computational efficiency of the hardware running the neural network.
  • this application adopts the following technical solutions.
  • this application provides a neural network data processing method.
  • the method includes: the processor uses the amount of input data, the first feature of the internal memory in the chip running the neural network, and the multiple layers of the neural network.
  • the second feature groups the input data, determines the batch size of each layer in the neural network, and makes the batch sizes of at least two layers different among the batch sizes of multiple layers.
  • the batch size of each layer in a neural network is different.
  • a neural network includes layers of the same batch size and layers of different batch sizes.
  • the first feature includes at least one of the distribution feature of the internal memory in the chip and the capacity of the internal memory.
  • the second feature includes the connection relationship between the plurality of layers and the calculation-related parameters of each of the plurality of layers.
  • the batch corresponding to the batch size is one picture, multiple pictures, or part of the image in one picture.
  • the so-called internal memory refers to the memory in the chip running the neural network.
  • the memory on the chip that runs the neural network is a cache.
  • the so-called external memory refers to the memory outside the chip that runs the neural network.
  • Internal memory can also be called on-chip memory.
  • External memory can also be called off-chip memory.
  • the neural network data processing method comprehensively refers to the data amount of the input data, the first feature and the second feature to segment the input data, and sets different batch sizes for the layers in the neural network. Therefore, by setting a reasonable batch size for each layer in the neural network, the internal memory is fully utilized to store the inter-layer data of the neural network during the neural network inference process, which reduces the interaction between the chip running the neural network and the external memory, thereby Improve the utilization of internal memory and ensure the computational efficiency of the chip running the neural network.
  • determining the batch size of the multiple layers according to the amount of data, the first characteristic, and the second characteristic includes: determining the amount of batches according to the amount of data, the first characteristic, and the second characteristic.
  • the batch size of layers, N sub-pictures, M pictures and storage locations of inter-layer data, N is an integer greater than or equal to 2
  • M is an integer greater than or equal to 1
  • N ⁇ M is an integer greater than or equal to 2
  • the storage location of the inter-layer data includes at least one of an internal memory or an external memory.
  • the inter-layer data of multiple layers included in the sub-picture is stored in the internal memory.
  • the inter-layer data between sub-pictures is stored in the internal memory.
  • the inter-layer data between the pictures is stored in the external memory.
  • the subgraph contains one or more layers of the same batch size.
  • the number of layers included in different sub-pictures may be the same or different.
  • the sub-picture may also be referred to as the first-type layer group.
  • the graph includes one or more subgraphs.
  • the number of subgraphs contained in different graphs can be the same or different.
  • the graph may also be referred to as the second type of layer group.
  • the processor may use an iterative algorithm to determine the batch size of multiple layers, N sub-pictures, M pictures, and inter-layer data based on the amount of data, the first feature, and the second feature. Storage location. It is understandable that the processor does not perform a calculation based on the amount of data, the first feature, and the second feature at one time, and obtains the batch size of multiple layers, N sub-pictures, M pictures, and storage locations of inter-layer data. Instead, it uses an iterative algorithm to go through multiple iterative experiments. From multiple experimental results, select the batch size of multiple layers, N sub-graphs, M graphs, and storage locations of inter-layer data to ensure the utilization of internal memory and operation The computational efficiency of the neural network chip.
  • the optimization algorithm can be a dynamic programming algorithm, a greedy algorithm or a genetic algorithm.
  • dynamic programming algorithm dynamic programming algorithm
  • the basic idea of dynamic programming algorithm is also to decompose the problem to be solved into several sub-problems, first solve the sub-problems, and then obtain the solution of the original problem from the solutions of these sub-problems.
  • Greedy algorithm can also be called greedy algorithm.
  • the basic idea is to proceed step by step from a certain initial solution of the problem. According to a certain optimization measure, each step must ensure that a local optimal solution can be obtained. Only one data is considered in each step, and his selection should meet the conditions of local optimization. If the next data and the partial optimal solution are no longer a feasible solution, the data is not added to the partial solution until all the data is enumerated, or no more data can be added to stop the algorithm
  • Genetic algorithm is a type of algorithm designed based on the evolutionary laws of the biological world, and is used to simulate natural evolution to search for the optimal solution.
  • the scheduling order of the layers in the figure is based on the scheduling order of the subgraphs contained in the figure, and the The scheduling sequence of the layers in the subgraph is determined.
  • the scheduling order of the layers in the subgraph is the same as the scheduling order of the layers in the neural network. For example, batches corresponding to the batch size of the layers included in the sub-picture are processed in the order of the layers included in the sub-picture.
  • the scheduling order of each subgraph included in the figure is determined according to the batch size and the scheduling order of the first and last layers in the subgraph.
  • the inter-layer data of the sub-graphs contained in the graph are aggregated or scattered.
  • the embodiment of the present application also provides a neural network data processing device, and the beneficial effects can be referred to the description of the first aspect and will not be repeated here.
  • the data processing device of the neural network has the function of realizing the behavior of the processor in the method example of the first aspect described above.
  • the functions can be realized by hardware, or by hardware executing corresponding software.
  • the hardware or software includes one or more modules corresponding to the above-mentioned functions.
  • the data processing device of the neural network includes: an acquisition unit and a processing unit.
  • the acquiring unit is used to acquire the data amount of the input data of the neural network, the first feature of the internal memory in the chip running the neural network, and the second feature of multiple layers in the neural network.
  • the processing unit is configured to determine the batch size of each layer in the multiple layers according to the data amount, the first characteristic, and the second characteristic, and the batch sizes of at least two layers in the multiple layers are different.
  • a neural network data processing device may be a processor.
  • graphics processor Graphics Processing Unit, GPU
  • neural network processor Neural-network Processing Unit, NPU
  • advanced reduced instruction set processor Advanced RISC Machines, ARM
  • the neural network also includes a memory.
  • the memory is used to store computer programs or instructions
  • the processor is coupled with the memory.
  • a computer program product includes: computer program code, which when the computer program code runs, causes the method executed by the processor in the first aspect to be executed.
  • the present application provides a chip system, the chip system includes a processor, and is configured to implement the function of the processor in the method of the first aspect.
  • the chip system further includes a memory for storing at least one of program instructions or data.
  • the chip system can be composed of chips, and can also include chips and other discrete devices.
  • the present application provides a computer-readable storage medium that stores a computer program, and when the computer program is executed, the method executed by the processor in the first aspect described above is implemented.
  • the names of the processor and the data processing device of the neural network do not constitute a limitation on the device itself. In actual implementation, these devices may appear under other names. As long as the function of each device is similar to that of this application, it falls within the scope of the claims of this application and its equivalent technologies.
  • FIG. 1 is a schematic diagram of the principle of a neural network provided by an embodiment of this application.
  • FIG. 2 is a schematic structural diagram of a neural network system provided by an embodiment of this application.
  • FIG. 3 is a schematic diagram of the structure of a neural network chip provided by an embodiment of the application.
  • FIG. 4 is a schematic structural diagram of a processing device provided by an embodiment of this application.
  • FIG. 5 is a schematic diagram of the structure of layers in a neural network provided by an embodiment of this application.
  • FIG. 6 is a schematic diagram of the overlap problem provided by an embodiment of the application.
  • FIG. 7 is a schematic diagram of a sub-picture provided by an embodiment of this application.
  • FIG. 8 is a schematic diagram of a diagram provided by an embodiment of the application.
  • FIG. 9 is a schematic diagram of aggregation processing of inter-layer data between sub-pictures according to an embodiment of the application.
  • FIG. 10 is a schematic diagram of dispersing processing of inter-layer data between sub-pictures provided by an embodiment of the application.
  • FIG. 11 is a schematic diagram of the processing of a graph provided by an embodiment of this application.
  • FIG. 12 is a flowchart of a neural network data processing method provided by an embodiment of the application.
  • FIG. 13 is a schematic diagram of a process of neural network processing data provided by an embodiment of this application.
  • FIG. 14 is a schematic diagram of a process of neural network processing data provided by an embodiment of this application.
  • FIG. 15 is a schematic diagram of a process of neural network processing data provided by an embodiment of this application.
  • 16 is a schematic structural diagram of a neural network data processing device provided by an embodiment of the application.
  • FIG. 17 is a schematic structural diagram of a neural network data processing device provided by an embodiment of the application.
  • Neural network may also be called artificial neural network (ANN) or similar neural network.
  • ANN artificial neural network
  • a neural network is a mathematical model or calculation model that imitates the structure and function of a biological neural network (an animal's central nervous system, especially the brain), and is used to estimate or approximate functions.
  • Neural networks can include convolutional neural network (convolutional neural network, CNN), deep neural network (deep neural network, DNN), multilayer perceptron (multilayer perceptron, MLP) and recurrent neural network (recurrent neural network, RNN), etc.
  • CNN convolutional neural network
  • DNN deep neural network
  • MLP multilayer perceptron
  • RNN recurrent neural network
  • a neural network can be composed of neural units, which can refer to an arithmetic unit that takes x s and intercept 1 as inputs. The output of this arithmetic unit satisfies the following formula (1).
  • s 1, 2,...n, n is a natural number greater than 1
  • W s is the weight of x s
  • b is the bias of the neural unit.
  • f is the activation function of the neural unit, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into an output signal.
  • the output signal of the activation function can be used as the input of the next layer, and the activation function can be a sigmoid function.
  • a neural network is a network formed by connecting multiple above-mentioned single neural units together, that is, the output of one neural unit can be the input of another neural unit.
  • the input of each neural unit can be connected with the local receptive field of the previous layer to extract the characteristics of the local receptive field.
  • the local receptive field can be a region composed of several neural units.
  • the neural network 100 has N processing layers, N ⁇ 3 and N takes a natural number.
  • the first layer of the neural network is the input layer 110, which is responsible for receiving input signals
  • the last layer of the neural network is the output layer 130, which outputs the processing results of the neural network.
  • the other layers excluding the first and last layers are intermediate layers 140. These intermediate layers 140 collectively form the hidden layer 120.
  • Each intermediate layer 140 in the hidden layer 120 can receive input signals and output signals.
  • the hidden layer 120 is responsible for the processing of the input signal.
  • Each layer represents a logic level of signal processing. Through multiple layers, data signals can be processed by multiple levels of logic.
  • the input signal of the neural network may be a signal in various forms such as a video signal, a voice signal, a text signal, an image signal, and a temperature signal.
  • the processed image signal may be various sensor signals such as a landscape signal taken by a camera (image sensor), an image signal of a community environment captured by a display monitoring device, and a facial signal of a human face obtained by an access control system.
  • the input signal of the neural network also includes various other engineering signals that can be processed by computers, which will not be listed here. If the neural network is used for deep learning of the image signal, the image quality can be improved.
  • Deep neural network is also called multi-layer neural network, which can be understood as a neural network with multiple hidden layers.
  • the deep neural network is divided according to the position of different layers.
  • the neural network inside the deep neural network can be divided into three categories: input layer, hidden layer and output layer. Generally speaking, the first layer is the input layer, the last layer is the output layer, and the number of layers in the middle are all hidden layers.
  • the layers are fully connected, that is, any neuron in the i-th layer is connected to any neuron in the i+1-th layer.
  • the coefficients from the kth neuron in the L-1th layer to the jth neuron in the Lth layer are defined as
  • Convolutional neural network is a deep neural network with convolutional structure.
  • the convolutional neural network contains a feature extractor composed of a convolutional layer and a sub-sampling layer.
  • the feature extractor can be regarded as a filter.
  • the convolutional layer refers to the neuron layer that performs convolution processing on the input signal in the convolutional neural network.
  • a neuron can be connected to only part of the neighboring neurons.
  • a convolutional layer usually contains several feature planes, and each feature plane can be composed of some rectangularly arranged neural units. Neural units in the same feature plane share weights, and the shared weights here are the convolution kernels.
  • Sharing weight can be understood as the way of extracting image information has nothing to do with location.
  • the convolution kernel can be initialized in the form of a matrix of random size. During the training process of the convolutional neural network, the convolution kernel can obtain reasonable weights through learning. In addition, the direct benefit of sharing weights is to reduce the connections between the layers of the convolutional neural network, and at the same time reduce the risk of overfitting.
  • the neural network system 200 includes a host 210 and a neural network circuit 220.
  • the neural network circuit 220 is connected to the host 210 through a host interface.
  • the host interface may include a standard host interface and a network interface (network interface).
  • the host interface may include a peripheral component interconnect express (PCIe) interface.
  • PCIe peripheral component interconnect express
  • the neural network circuit 220 may be connected to the host 210 through the PCIe bus 230. Therefore, data can be input to the neural network circuit 220 via the PCIe bus 230, and data processed by the neural network circuit 220 can be received via the PCIe bus 230.
  • the host 210 can also monitor the working status of the neural network circuit 220 through the host interface.
  • the host 210 includes a processor 211 and a memory 212. It should be noted that, in addition to the devices shown in FIG. 2, the host 210 may also include other devices such as a communication interface and a magnetic disk as an external memory, which is not limited here. The host 210 can be considered as an integrated circuit or an independent device.
  • the processor 211 is the computing core and control unit of the host 210.
  • the processor 211 may include multiple processor cores (cores).
  • the processor 211 may be a very large-scale integrated circuit.
  • An operating system and other software programs are installed in the processor 211, so that the processor 211 can implement access to the memory 212, cache, disk, and peripheral devices (such as the neural network circuit in FIG. 2).
  • the processor core in the processor 211 may be a central processing unit (CPU), or may also be other application specific integrated circuits (ASICs).
  • the memory 212 is the main memory of the host 210.
  • the memory 212 is connected to the processor 211 via a double data rate (DDR) bus.
  • the memory 212 is generally used to store various running software in the operating system, input and output data, and information exchanged with an external memory. In order to increase the access speed of the processor 211, the memory 212 needs to have the advantage of fast access speed. In a traditional computer system architecture, a dynamic random access memory (DRAM) is usually used as the memory 212.
  • DRAM dynamic random access memory
  • the processor 211 can access the memory 212 at a high speed through a memory controller (not shown in FIG. 2), and perform a read operation and a write operation on any storage unit in the memory 212.
  • the neural network circuit 220 may be a chip that runs a neural network.
  • the neural network circuit 220 is a chip array composed of a plurality of neural network chips.
  • the neural network circuit 220 includes a plurality of neural network chips 221 for data processing and a plurality of routers 222.
  • the neural network chip 221 is referred to as the chip 221 for short in the embodiment of the present application.
  • the multiple chips 221 are connected to each other through a router 222.
  • one chip 221 may be connected to one or more routers 222.
  • Multiple routers 222 can form one or more network topologies.
  • the chips 221 can transmit data through the multiple network topologies described above.
  • the neural network circuit 220 may also include other devices such as a memory 223, an input port 224, and an output port 225.
  • the memory is used to store data, computer programs and instructions.
  • FIG. 3 is a schematic structural diagram of a neural network chip provided by an embodiment of the application.
  • the chip 221 includes a plurality of routers 310, and each router 310 can be connected to a tile 320. In practical applications, one router 310 can also connect multiple tiles 320.
  • each tile 320 may include an input/output interface (TxRx) 321, a switching device 322, multiple processing elements (PE) 323, and a memory 324.
  • the input/output interface 321 is used to receive data input from the router 310 to the tile 320, or to output the calculation result of the tile 320. To put it another way, the input/output interface 321 is used to implement data transmission between the tile 320 and the router 310.
  • the switching device 322 connects the input/output interface 321 and a plurality of processing devices 323.
  • the switching device 322 is used to implement data transmission between the input/output interface 321 and the multiple processing devices 323.
  • the memory 324 is used to store data, computer programs and instructions.
  • Each tile 320 may also include a controller 325, which is used to control the input/output interface 321 and multiple processing devices 323 to make the system work normally.
  • Each processing device 323 may include one or more computing engines 326.
  • One or more calculation engines 326 are used to implement neural network calculations on the data input to the calculation engine 326. For example, the data input to the tile 320 and the preset convolution kernel in the tile 320 may be multiplied and added.
  • the calculation result of the calculation engine 326 can be sent to other tiles 320 through the switching device 322 and the input/output interface 321.
  • a calculation engine 326 may include modules that implement convolution, pooling, or other neural network operations.
  • the specific circuit or function of the calculation engine 326 is not limited.
  • the calculation engine is referred to as engine for short.
  • FIG. 4 it is a schematic structural diagram of a processing device provided by an embodiment of this application.
  • the processing device 323 may also include a controller 327 and a bus 328.
  • the controller 327 is used for receiving data, and scheduling one or more engines 326 in the processing device 323 to process the data, so that the system works normally.
  • the multiple engines 326 perform data transmission through the bus 328.
  • the engine 326 is connected to one or more exclusive memories 3210.
  • multiple engines 326 may also share one or more memories 329.
  • the memory in the neural network circuit 220 may be a cache memory, that is, a cache.
  • the memory 223, the memory 324, the memory 329, and the memory 3210 may all be cache memories.
  • the cache memory in the neural network circuit 220 is composed of a static random access memory (SRAM), which has a relatively small capacity but a speed much higher than that of the main memory, which is close to the speed of the CPU.
  • the cache memory may be an L1 level cache memory, an L2 level cache memory, or an L3 level cache memory.
  • the memory 3210 is an L1 level cache memory.
  • the memory 329 is an L2 level cache memory or an L3 level cache memory.
  • the memory 223 is an L2 level cache memory or an L3 level cache memory.
  • the memory 324 is an L2 level cache memory or an L3 level cache memory.
  • the neural network circuit 220 provided by the embodiment of the present application includes a plurality of neural network chips 221, each neural network chip 221 includes a plurality of tiles 320, and each tile 320 includes a plurality of Processing devices 323, each processing device 323 includes a plurality of engines 326.
  • the neural network system may include multi-level computing nodes, for example, may include four-level computing nodes: the first-level computing node is the chip 221, and the second-level computing node is the tile in the chip 221 320, the third-level computing node is the processing device 323 in the tile 320, and the fourth-level computing node is the engine 326 in the processing device 323.
  • the neural network system provided by the embodiments of the present application can be applied to a mobile terminal, a monitoring terminal, or a server, etc., to implement related neural network operations.
  • the neural network includes multiple neural network layers.
  • the neural network layer is a logical layer concept, and a neural network layer refers to a neural network operation to be performed once.
  • the neural network may include n neural network layers (also called n-layer neural network), where n is an integer greater than or equal to 2.
  • the first neural network layer and the second neural network layer may be two of the n layers that have a dependency relationship in operation.
  • two neural network layers with a dependency relationship means that the input data of one neural network layer includes the output data of the other neural network layer.
  • Two neural network layers with dependencies can also be referred to as adjacent layers.
  • the input of each neural network layer may come from more than one neural network layer, and may come from the first m neural network layers; similarly, the output of each neural network layer may not only be output to the next neural network layer, it is possible Output to the last m neural network layers.
  • FIG. 5 shows part of the neural network layers in the neural network.
  • the neural network layers may include convolutional layers, pooling layers, and so on.
  • the neural network 500 may include a first layer 502, a second layer 504, a third layer 506, a fourth layer 508, and a fifth layer 510 to an nth layer 512.
  • the first layer 502 can perform a convolution operation
  • the second layer 504 can perform a pooling operation on the output data of the first layer 502
  • the third layer 506 can perform a convolution operation on the output data of the second layer 504.
  • the fourth layer 508 can perform a convolution operation on the output result of the third layer 506, and the fifth layer 510 can perform a summation operation on the output data of the second layer 504 and the output data of the fourth layer 508, and so on.
  • Figure 5 is only a simple example and description of the neural network layers in the neural network, and does not limit the specific operations of each layer of the neural network.
  • the fourth layer 508 can also be a pooling operation.
  • the five-layer 510 may also perform other neural network operations such as convolution operation or pooling operation.
  • the output data of the first layer 502 is the input data of the second layer 504. Therefore, the first layer 502 and the second layer 504 have a dependency relationship.
  • the output data of the second layer 504 is the input data of the third layer 506, and the second layer 504 and the third layer 506 have a dependency relationship.
  • the output data of the third layer 506 is the input data of the fourth layer 508, and the third layer 506 and the fourth layer 508 have a dependency relationship.
  • the input data of the fifth layer 510 includes the output data of the second layer 504 and the output data of the fourth layer 508. Therefore, the second layer 504 and the fifth layer 510 also have a dependency relationship, and the fourth layer 508 and the fifth layer 510 also Have dependencies.
  • Each layer of calculation in the neural network is realized by a computing node.
  • the computing nodes in the neural network system can be divided with the granularity of chips, tiles, processing devices, or engines according to actual application conditions, so that computing nodes in different sets are used to process operations of different neural network layers.
  • the computing node referred to in the embodiment of the present application may be a chip 221, a tile 320, a processing device 323, or an engine 326.
  • the computing node reloads the calculation result of the i-th layer and the weight of the i+1th layer from the preset cache for calculation.
  • the i-th layer is any layer in the neural network.
  • the output data (interlayer data) of the second layer 504 is temporarily stored in the preset memory 329, and the fifth layer 510 is executed.
  • the computing node reloads the calculation result of the second layer 504 and the weight of the fifth layer 510 from the preset memory 329 for calculation.
  • the preset cache is also different.
  • the preset cache may be the memory 329 or the memory 3210.
  • the preset cache may be the memory 324.
  • the preset cache may be a memory in the tile 320.
  • the preset cache may be the memory 223.
  • the memory outside the neural network circuit 220 is called an external memory.
  • the external memory is the memory 212 shown in FIG. 2.
  • the memory in the neural network circuit 220 is called an internal memory.
  • the internal memory is the memory 223 shown in FIG. 2.
  • the internal memory is the memory 324 shown in FIG. 3.
  • the internal memories are the memory 329 and the memory 3210 shown in FIG. 4.
  • the so-called external memory refers to the memory outside the chip that runs the neural network.
  • the external memory may be a magnetic disk or the memory 212 shown in FIG. 2.
  • the amount of data that can be processed by each layer in the neural network is the batch size corresponding to that layer.
  • the batch corresponding to the batch size can be one picture, multiple pictures, or part of images in one picture. For example, if the capacity of the internal memory is 100, if layer 1 (layer 1, L1) processes 1 image with a cache requirement of 60, then each scheduling layer 1 can process at most 1 image, and the batch size corresponding to layer 1 is 1. Pictures. If the data cache requirement size for processing 1 image in layer 2 is 30, then each scheduling layer 2 can process 3 images at most, and the batch size corresponding to layer 2 is 3 images.
  • the batch size not only affects the usage of the internal memory of the chip running the neural network, but also affects the optimization degree and processing speed of the neural network.
  • the convolutional layer can use a filling algorithm to process the input data of a non-integral image. That is, before the calculation by the convolution kernel, the size of the input data is artificially increased by means of the filling algorithm to offset the influence caused by the size shrinkage in the calculation.
  • the filling algorithm can be, for example, zero filling, repeated boundary value filling, or other methods. That is to say, if the input data is non-integral image data, it is necessary to process the input data with a filling algorithm; if the input data is an entire image data, it is not necessary to use the filling algorithm to process the input data.
  • the input data needs to be filled first, and then flattened.
  • the stride of the convolution kernel is smaller than the side length of the convolution kernel (usually a square row)
  • the convolution kernel When the stride of the movement is the same as the side length of the convolution kernel, there will be no overlap.
  • the input data size is (w*w)
  • the filled data size is (w+k-s)*(w+k-s).
  • k represents the side length of the convolution kernel
  • s represents the step length of the convolution kernel movement
  • the filling data is (k-s).
  • the layers in a neural network include layer 0, layer 1, layer 2, and layer 3.
  • the size of the convolution kernel is all 3*3, and the step size of the convolution kernel movement It is 1, the step length of the convolution kernel is smaller than the side length of the convolution kernel, and there is an overlap problem in the process of using the filling algorithm to process the input data.
  • the size of the entire picture is 56*56, and the number of rows of the entire picture is divided into 4 parts for processing. If layer 0, layer 1, and layer 2 are scheduled as a layer group, it is necessary to ensure that layer 2 outputs 14 rows of data, that is, the output data size of the layer group is 14*56, and that layer 3 can process 1/4 rows of pictures.
  • the input data of layer 2 needs to be filled with 2 rows of data, that is, the input data size is 16*56.
  • the input data size corresponding to layer 1 is 18*56
  • the input data size corresponding to layer 0 is 20*56. That is to say, in the process of segmenting the entire picture, in order to ensure the output data size, the cache demand of the layers in the layer group will increase. And, the more the number of layers in the layer group, the larger the amount of data that the previous layer needs to fill. If the internal memory capacity is small, the size of the layer group will be limited.
  • a neural network includes multiple layers, which can be described as a neural network including multiple layers arranged in a directed graph, and each layer can have a corresponding set of parameters.
  • the subgraph is obtained by dividing the layers included in the neural network according to the batch size of each layer.
  • the subgraph contains one or more layers of the same batch size.
  • the subgraph can also be described as a super layer or a layer group, etc., which means that it contains one layer or continuous multiple layers in the neural network.
  • the neural network is scheduled to process the input data with the sub-graph as a unit, and the scheduling order of the layers in the sub-graph is the same as the scheduling order of the layers in the neural network.
  • the batches corresponding to the batch size of the layers contained in the subgraph are processed in the order of the layers contained in the subgraph.
  • the inter-layer data of multiple layers included in the sub-picture is stored in the internal memory.
  • the inter-layer data between the sub-pictures is stored in the internal memory.
  • FIG. 7 it is a schematic diagram of a sub-picture provided in an embodiment of this application.
  • the sub-picture includes layer 0 and layer 1.
  • the batch size of layer 0 and the batch size of layer 1 are both 1.
  • a batch corresponding to a batch size of 1 can be one picture, multiple pictures, or part of images in one picture.
  • Level 0 processes one batch at a time.
  • Tier 1 processes one batch at a time.
  • layer 0 and layer 1 in the subgraph process batch A0 and batch A1.
  • Batch A0 and batch A1 may be batches in the input data to be processed by the neural network.
  • batch A0 and batch A1 may be inter-layer data that has been processed by layers in the neural network.
  • the batch size of batch A0 and batch A1 are both 1.
  • the execution sequence of the processing batches in the subgraph is shown by the bold arrows in the figure. For ease of understanding, the layer 0 and layer 1 processing batch A0 and batch A1 are separately shown.
  • layer 0 processes batch A0 first to obtain inter-layer data B0, and layer 1 processes inter-layer data B0 to obtain inter-layer data C0. Then, layer 0 processes batch A1 to obtain inter-layer data B1, and layer 1 processes inter-layer data B1 to obtain inter-layer data C1.
  • the inter-layer data C0 and the inter-layer data C1 can be stored in the internal memory.
  • the graph includes one or more subgraphs. Among them, the graph can also be described as a super layer or a layer group, which means a layer or a continuous multi-layer in the neural network.
  • each subgraph in the graph contains layers that can handle the same batch size.
  • the hypothetical graph includes sub-graph 1 and sub-graph 2.
  • subgraph 1 includes layer 0 and layer 1
  • the batch size of layer 0 is the same as the batch size of layer 1.
  • Subgraph 2 includes layer 2 and layer 3.
  • the batch size of layer 2 is the same as the batch size of layer 3.
  • the batch size of layer 0 and the batch size of layer 1 are both one batch.
  • the batch size of layer 2 and the batch size of layer 3 are both one batch.
  • all sub-graphs included in the graph contain layers of the same batch size.
  • At least two sub-graphs in all sub-graphs included in the graph include layers of different batch sizes.
  • the hypothetical picture includes sub-picture 1, sub-picture 2 and sub-picture 3.
  • subgraph 1 includes layer 0 and layer 1
  • the batch size of layer 0 is the same as the batch size of layer 1.
  • Subgraph 2 includes layer 2 and layer 3.
  • the batch size of layer 2 is the same as the batch size of layer 3.
  • Sub-figure 3 includes layer 4 and layer 5, and the batch size of layer 4 is the same as the batch size of layer 5.
  • the batch size of layer 0 and the batch size of layer 1 are both one batch.
  • the batch size of layer 2 and the batch size of layer 3 are both one batch.
  • the batch size of layer 4 and the batch size of layer 5 are both two batches.
  • the batch size of the layer included in the sub-picture 3 included in the graph is different from the batch size of the layer included in the sub-picture 1.
  • the batch size of the layer included in the sub-picture 3 included in the graph is different from the batch size of the layer included in the sub-picture 2.
  • the neural network is scheduled to process input data in units of graphs, and the scheduling order of the layers in the graph is the same as the scheduling order of the layers in the neural network.
  • the scheduling order of each subgraph included in the figure is determined according to the batch size and the scheduling order of the first and last layers in the subgraph.
  • a part of the data is retained in the cache space of the internal memory, thereby generating additional internal memory cache requirements.
  • the inter-layer data between the pictures is stored in the external memory.
  • the scheduling process of the layers in the figure is described as follows to gather and scatter problems.
  • the inter-layer data between the sub-graphs contained in the graph is aggregated.
  • FIG. 9 it is a schematic diagram of aggregation processing of inter-layer data of a subgraph provided in an embodiment of this application.
  • the graph includes subgraph 0 and subgraph 1.
  • Subgraph 0 includes layer 0 and layer 1.
  • the batch size of layer 0 and the batch size of layer 1 are both 1.
  • Level 0 processes one batch at a time.
  • Tier 1 processes one batch at a time.
  • Sub-figure 1 includes layer 2 and layer 3.
  • the batch size of layer 2 and the batch size of layer 3 are both 2.
  • a batch corresponding to a batch size of 2 can be two pictures, multiple pictures, or partial images in one picture.
  • Layer 2 processes two batches at a time.
  • Layer 3 processes two batches at a time.
  • the graph processes batch A0 and batch A1.
  • Batch A0 and batch A1 may be batches in the input data to be processed by the neural network.
  • Batch A0 and batch A1 may be inter-layer data that has been processed by layers in the neural network.
  • the batch size of batch A0 and batch A1 are both 1. Since layer 0 and layer 1 included in sub-picture 0 process one batch each time, layer 2 and layer 3 included in sub-picture 1 process two batches each time. After subgraph 0 has processed batch A0 and batch A1 respectively, subgraph 1 can then process the inter-layer data of batch A0 and the inter-layer data of batch A1 output by subgraph 0.
  • the execution sequence of the processing batches in the figure is shown by the bold arrows in the figure. For ease of understanding, the layer 0 and layer 1 processing batch A0 and batch A1 are separately shown.
  • layer 0 first processes batch A0 to obtain inter-layer data B0, and layer 1 processes inter-layer data B0 to obtain inter-layer data C0. Then, layer 0 processes batch A1 to obtain inter-layer data B1, and layer 1 processes inter-layer data B1 to obtain inter-layer data C1.
  • the inter-layer data C0 and the inter-layer data C1 can be stored in the internal memory.
  • layer 2 can obtain inter-layer data C0 and inter-layer data C1 from the internal memory.
  • inter-layer data C0 and inter-layer data C1 can be combined into inter-layer data (C0, C1).
  • Layer 2 processes (C0, C1) to obtain inter-layer data (D0, D1)
  • layer 3 processes inter-layer data (D0, D1) to obtain inter-layer data (E0, E1).
  • the inter-layer data (E0, E1) can be stored in the internal memory.
  • the inter-layer data between the sub-graphs contained in the graph is scattered.
  • FIG. 10 it is a schematic diagram of spreading the inter-layer data of the subgraph provided in an embodiment of this application.
  • the graph includes sub graph 1 and sub graph 2.
  • Sub-figure 1 includes layer 2 and layer 3.
  • the batch size of layer 2 and the batch size of layer 3 are both two batches.
  • Layer 2 processes two batches at a time.
  • Layer 3 processes two batches at a time.
  • Sub-figure 2 includes layer 4 and layer 5.
  • the batch size of layer 4 and the batch size of layer 5 are both one batch.
  • Layer 4 processes one batch at a time.
  • Layer 5 processes one batch at a time.
  • layer 2 and layer 3 included in sub-figure 1 process two batches each time
  • layer 4 and layer 5 included in sub-figure 2 process one batch each time.
  • the execution sequence of the processing batches in the figure is shown by the bold arrows in the figure. To facilitate understanding, the processing of the inter-layer data E0 and the inter-layer data E1 in layer 4 and layer 5 are separately represented.
  • layer 2 can obtain the inter-layer data (C0, C1) of batch A0 and batch A1 from the internal memory.
  • Layer 2 processes (C0, C1) to obtain inter-layer data (D0, D1)
  • layer 3 processes inter-layer data (D0, D1) to obtain inter-layer data (E0, E1).
  • the inter-layer data (E0, E1) can be stored in the internal memory.
  • layer 4 first obtains the inter-layer data (E0, E1) from the internal memory, and divides the inter-layer data (E0, E1) into the inter-layer data E0 and the inter-layer data E1.
  • Layer 4 first processes the inter-layer data E0 in the inter-layer data (E0, E1) to obtain the inter-layer data F0, and layer 5 processes the inter-layer data F0 to obtain the inter-layer data G0.
  • the layer 4 processes the inter-layer data E1 in the inter-layer data (E0, E1) to obtain the inter-layer data F1
  • the layer 5 processes the inter-layer data F1 to obtain the inter-layer data G1.
  • the inter-layer data G0 and the inter-layer data G1 can be stored in the internal memory.
  • multiple graphs are scheduled for processing in the order of the layers of the neural network. What needs to be clarified is that the data processed by the latter graph is the data output by the previous graph. Divide the layers of the neural network into multiple graphs, and process batches according to the order of the graphs, which improves the utilization of internal memory and the processing performance of the entire neural network.
  • FIG. 11 it is a schematic diagram of the processing of a graph provided by an embodiment of this application.
  • the abscissa represents the layer of the neural network
  • the ordinate represents the batch.
  • the neural network includes 12 layers.
  • the batch sizes of layer 0, layer 1, layer 4, layer 5, layer 10, and layer 11 are all one batch, that is, layer 0, layer 1, layer 4, layer 5, layer 10, and layer 11 are processed one batch at a time.
  • the batch sizes of layer 2, layer 3, layer 6 and layer 7 are all two batches, that is, layer 2, layer 3, layer 6 and layer 7 process two batches at a time.
  • the batch sizes of layer 8 and layer 9 are both four batches, that is, layer 8 and layer 9 process four batches each time.
  • FIG. 1 includes layers 6 to 11.
  • the numbers in the boxes indicate the order of batch execution. follow the numbers from small to large. After processing a picture 0, process picture 1 again.
  • layer 0 processes batch 0, and layer 1 processes the inter-layer data of batch 0 output by layer 0, obtains the inter-layer data of batch 0 output by layer 1, and stores the inter-layer data of batch 0 in the internal memory.
  • layer 0 processes batch 1
  • layer 1 processes the inter-layer data of batch 1 output by layer 0, obtains the inter-layer data of batch 1 output from layer 1, and stores the inter-layer data of batch 1 in the internal memory.
  • layer 2 processes the inter-layer data of batch 0 and the inter-layer data of batch 1 from the internal memory
  • layer 3 processes the inter-layer data of batch 0 output by layer 2 and
  • the inter-layer data of batch 1 obtains the inter-layer data of batch 0 and the inter-layer data of batch 1 output by layer 3.
  • Layer 4 processes the inter-layer data of batch 0 output by layer 3, and layer 5 processes the inter-layer data of batch 0 output by layer 4;
  • layer 4 processes the inter-layer data of batch 1 output from layer 3, and layer 5 processes the batch output from layer 4 1 inter-layer data.
  • layer 0 to layer 5 process batch 2 and batch 3 in the order of processing batch 0 and batch 1.
  • the batch processed in Figure 1 is the data output in Figure 0.
  • layer 6 processes the inter-layer data of batch 0 and batch 1 output from layer 5
  • layer 7 processes the inter-layer data of batch 0 and batch 1 output of layer 6, and outputs layer 7
  • the inter-layer data of batch 0 and the inter-layer data of batch 1 are stored in the internal memory.
  • Layer 6 processes the batch 2 inter-layer data and batch 3 inter-layer data output by layer 5
  • layer 7 processes the batch 2 inter-layer data and batch 3 inter-layer data output by layer 6 and transfers the batch 2 output from layer 7
  • the inter-layer data and batch 3 inter-layer data are stored in the internal memory.
  • Layer 8 processes the inter-layer data of batch 0 output by layer 7 , The inter-layer data of batch 1, the inter-layer data of batch 2 and the inter-layer data of batch 3, layer 9 processes the inter-layer data of batch 0 output by layer 8, the inter-layer data of batch 1, and the inter-layer data of batch 2 and The inter-layer data of batch 3, the inter-layer data of batch 0, the inter-layer data of batch 1, the inter-layer data of batch 2, and the inter-layer data of batch 3 outputted by layer 9 are stored in the internal memory.
  • Layer 10 processes the batch 0 inter-layer data output by layer 9.
  • Layer 11 processes the inter-layer data of batch 0 output by layer 10.
  • Layer 10 processes the inter-layer data of batch 1 output by layer 9.
  • Layer 11 processes the inter-layer data of batch 1 output by layer 10.
  • Layer 10 processes the batch 2 inter-layer data output by layer 9.
  • Layer 11 processes the inter-layer data of batch 2 output by layer 10.
  • Layer 10 processes the batch 3 inter-layer data output by layer 9.
  • Layer 11 processes the inter-layer data of batch 3 output by layer 10.
  • the internal memory includes a memory 223, a memory 324, a memory 329, and a memory 3210.
  • the external memory is the memory 212.
  • the calculation node completes the calculation of the neural network according to the determined batch size.
  • the computing node includes a chip 221, a tile 320, a processing device 323, or an engine 326.
  • the neural network data processing method includes S1201 and S1202.
  • the processor 211 obtains the data amount of the input data, the first feature of the internal memory in the chip running the neural network, and the second feature of the multiple layers in the neural network.
  • the input data is the data received by the input layer of the neural network.
  • the input data is data in a data set. Take image processing as an example.
  • the input data is 32 pictures in the data set.
  • the first feature includes at least one of the distribution feature of the internal memory in the chip and the capacity of the internal memory.
  • the distribution characteristics of internal memory in the chip include the number of memories in the chip running the neural network and the connection relationship between the memory and the computing node.
  • the memory capacity and number in the chip are large, not every time these storage resources Both are used for neural network calculations, and the storage resources allocated to the neural network calculations vary. Therefore, it is necessary to dynamically optimize the neural network configuration according to the number of memories and the connection relationship between the memories and the computing nodes, that is, the distribution characteristics.
  • the neural network circuit 220 includes the number of memories 223, the number of memories 324, the number of memories 329, and the number of memories 3210, as well as the connection relationship between the memory 223 and the chip 221, and the connection between the memory 324 and the processing device 323 The relationship, the connection relationship between the storage 329 and the engine 326, and the connection relationship between the storage 3210 and the engine 326.
  • the capacity of internal memory includes the capacity of all memories in the chip running the neural network.
  • the memory capacity and number in the chip are large. Not every time these storage resources are used for neural network calculations, they are allocated to the storage for neural network calculations. Resources are variable, so the neural network configuration needs to be dynamically optimized according to the capacity.
  • the neural network circuit 220 includes the capacity of the memory 223, the capacity of the memory 324, the capacity of the memory 329, and the capacity of the memory 3210. It is understandable that the capacity of the internal memory may refer to the available capacity of the internal memory.
  • the second feature includes the connection relationship between the multiple layers and the calculation-related parameters of each of the multiple layers.
  • the computing resources in the chip will change, not every time these computing resources are used for neural network calculations, so the connection relationship between multiple layers and the calculation-related parameters of each layer in multiple layers will also be based on As requirements change, the neural network configuration needs to be dynamically optimized according to the changes.
  • the connection relationship between multiple layers includes the connection relationship between each layer in the neural network and at least one layer in other layers. According to the different functions performed by the neural network, the connection relationship of the layers in the neural network is also different, and this application does not limit the connection relationship of the layers in the neural network.
  • the calculation-related parameters of each layer include the dimensionality of the input data and the dimensionality of the output data, offset parameters, convolution kernels, quantization parameters, or normalization parameters.
  • the first feature and the second feature may be stored in the memory 212 in the host 210.
  • the processor 211 may obtain the characteristics of the internal memory and the characteristics of multiple layers in the neural network from the memory 212 in the host 210.
  • the processor 211 determines the batch size of the multiple layers, N sub-pictures, M pictures, and storage locations of inter-layer data according to the data volume, the first characteristic and the second characteristic.
  • the batch size of at least two layers in the batch size are different.
  • the processor 211 may use an iterative algorithm to determine the batch size of multiple layers, N sub-pictures, M pictures, and storage locations of inter-layer data according to the amount of data, the first feature, and the second feature.
  • the optimization algorithm can be a dynamic programming algorithm, a greedy algorithm or a genetic algorithm. It is understandable that the processor does not perform a calculation based on the amount of data, the first feature, and the second feature at one time, and obtains the batch size of multiple layers, N sub-pictures, M pictures, and storage locations of inter-layer data. Instead, it uses an iterative algorithm to go through multiple iterative experiments.
  • N is an integer greater than or equal to 2
  • M is an integer greater than or equal to 1
  • N ⁇ M it means that the layer of the neural network is divided into 2 subgraphs, and the 2 subgraphs are divided into one graph.
  • the processor 211 first determines the batch size of each layer in the neural network based on the capacity of the internal memory, and then merges layers with the same batch size into sub-graphs. Based on the cache requirements of the sub-graphs and the capacity of the internal memory, multiple sub-graphs are merged into a graph, and the resulting graph contains sub-graphs of different batch sizes. That is to say, when the neural network is scheduled in the unit of graphs, the input data is processed in different batch sizes, so the cache requirement of each graph will not exceed the capacity of the internal memory, but also can improve the utilization of the on-chip memory and improve the hardware. Operational performance.
  • the layers in the N subgraphs are connected to form a complete neural network.
  • Each of the N subgraphs contains one or more layers of the same batch size.
  • One or more layers of the same batch size are consecutive layers in the neural network.
  • the number of layers included in different sub-pictures may be the same or different.
  • the sub-graphs in the M graphs are connected to form a complete neural network.
  • Each of the M graphs includes one or more subgraphs.
  • the number of sub-pictures included in different pictures may be the same or different.
  • an exemplary process of processing data in a neural network is given, including the process of data import (that is, the process of reading input data), the calculation process, and the process of data export (that is, the process of storing output data). ).
  • the neural network processes a batch of data, and needs to first move part of the data in, that is, the data move in process, and the overhead in this process is the head overhead.
  • the data import process, the calculation process, and the data export process are parallel.
  • the neural network executes the data removal process of the last calculated data and stores it in the storage space.
  • the overhead generated by this process is the tail overhead.
  • the layer processes data in units of batch size.
  • calculation time calculation amount of this layer / computing power of the chip equipped with the neural network
  • data transfer time (input data volume + output data volume) / (internal memory bandwidth or chip External storage bandwidth)
  • total time overhead head overhead + max (calculation time, data transfer time) + tail overhead.
  • the time overhead of a certain layer in the neural network can be obtained according to the storage location of at least one of the input data or output data of the current layer and the computing power of the chip equipped with the neural network.
  • the storage location of data includes internal memory and external memory.
  • the external memory and the internal memory are jointly planned to store the inter-layer data, which reduces the storage space of the internal memory.
  • the inter-layer data can be stored in the external memory, a larger batch size can be set for the layers in the neural network, thereby reducing the head overhead of processing each batch of the layers in the neural network, and improving the computational efficiency of the processor .
  • the scheduling order of the layers in the figure is according to the scheduling order of each subgraph contained in the figure, and the order of the subgraphs in the subgraph
  • the scheduling sequence of the layers is determined.
  • the scheduling order of the layers in the subgraph is the same as the scheduling order of the layers in the neural network.
  • the batches corresponding to the batch size of the layers included in the sub-picture are processed in the order of the layers included in the sub-picture.
  • the scheduling order of each subgraph included in the figure is determined according to the batch size and the scheduling order of the first and last layers in the subgraph.
  • the inter-layer data of the sub-graphs contained in the graph are aggregated or scattered. For the explanation of sub-pictures and graphs, please refer to the above description.
  • the neural network includes 6 layers, and the hierarchy sequence is layer 0-layer 5 (layer0-layer5, L1-L5).
  • the batch size corresponding to L0, L1, L4, and L5 is 1, and the batch size corresponding to L2 and L3 is 2.
  • the layers with the same batch size form a subgraph, that is, L0 and L1 form subgraph 0.
  • L2 and L3 form subfigure 1.
  • L4 and L5 make up subfigure 2.
  • the subgraphs are composed of graphs, that is, subgraph 0, subgraph 1, and subgraph 2 are composed of graphs.
  • the batch size corresponding to L0 and L1 is 1, so subgraph 0 can process input data with a data size of 1 each time, that is, batch 0 and batch 1 are processed separately.
  • the output data of L1 is C0.
  • the batch size corresponding to L2 is 2.
  • C0 only corresponds to batch 0, which does not meet the processing requirements of L2, and C0 needs to be temporarily stored in the internal memory.
  • Batch 1 is input to L0 for processing, after L0 and L1 are processed, the output data of L1 is C1.
  • L1 outputs two batches of data to meet the processing requirements of L2.
  • the internal memory contains two sets of C0 and C1 data.
  • L2 can call the aggregated C0 and C1 for processing. Therefore, if sub-graph 0 and sub-graph 1 are divided into one graph, in the process of scheduling L0 and L1 to process batch 1, C0 occupies the cache space of the internal memory, and the amount of data corresponding to C0 is L0 and L1. Additional internal memory cache requirements .
  • the cache requirement of input data corresponding to L0 is the amount of data corresponding to (C0+A1)
  • the cache requirement of output data is the amount of data corresponding to (C0+B1)
  • the cache requirement of input data corresponding to L1 is (C0 +B1) corresponds to the amount of data
  • the buffer requirement for output data is the amount of data corresponding to (C0+C1).
  • the cache requirement of input data corresponding to L4 is the amount of data corresponding to (E1+E0)
  • the cache requirement of output data is the amount of data corresponding to (E1+F0)
  • the cache requirement of input data corresponding to L5 is (E1 +F0) corresponds to the amount of data
  • the buffer requirement for output data is the amount of data corresponding to (E1+G0).
  • the inter-layer data of multiple layers contained in the sub-picture and the inter-layer data between the sub-pictures are stored in the internal memory, which occupies the storage space of the internal memory, the batch size of the multiple layers and the storage of the inter-layer data The location is also affected by the division of subgraphs and graphs.
  • the inter-layer data E0 when the inter-layer data E0 is processed in layers 4 and 5, the inter-layer data E1 is stored in the cache, which occupies the space of the cache, resulting in smaller available caches in the layers 4 and 5 , Affecting the segmentation of input data.
  • the neural network data processing method comprehensively refers to the data amount of the input data, the first feature and the second feature to segment the input data, and sets different batch sizes for the layers in the neural network. Therefore, by setting a reasonable batch size for each layer in the neural network, the internal memory is fully utilized to store the inter-layer data of the neural network during the neural network inference process, which reduces the interaction between the chip running the neural network and the external memory, thereby Improve the utilization of internal memory and ensure the computational efficiency of the chip running the neural network.
  • other computers can execute S1201 and S1202 offline to generate the segmentation strategy and schedule the execution order of the layers of the neural network.
  • the segmentation strategy and the execution order of the layers of the scheduling neural network are configured to the controller in the neural network system, and the controller in the neural network system controls the execution order of the segmentation strategy and the layers of the scheduling neural network.
  • the controller in the neural network system can execute S1201 and S1202 to generate the segmentation strategy and schedule the execution sequence of the neural network layers, and the controller uniformly manages the various layers of the scheduling neural network and the number of segmentation. Batches.
  • Example 1 The input data is the whole image data.
  • the batch size corresponding to L0 and L1 is 1 picture
  • the batch size corresponding to L2, L3 and L4 is 2 pictures
  • L5 and L6 correspond to The batch size is 4 pictures.
  • L0 and L1 are divided into sub-graph 1
  • L2-L4 is divided into sub-graph 1
  • L5 and L6 are divided into sub-graph 2.
  • the 3 subgraphs are divided into one graph, that is, L0-L6 are divided into graphs.
  • the cache requirement of the graph is less than or equal to the capacity of the internal memory.
  • the figure contains layers with different batch sizes. In the process of scheduling the subgraphs in the neural network to process the input data, it can improve the utilization of internal memory and improve the operating performance of the chip running the neural network.
  • the data set contains 8 pictures
  • L0 is the first layer of the picture
  • the batch size is 1 picture
  • the data set is divided into 8 batches of input data (batch 0-batch 7 shown in Fig. 14 )
  • each batch of input data is the whole image data corresponding to 1 picture, and input L0 in batches.
  • subgraph 0 is scheduled twice, corresponding to subgraph 1 is scheduled once, that is, the scheduling sequence is L0 ⁇ L1 ⁇ L0 ⁇ L1 ⁇ L2 ⁇ L3 ⁇ L4; Scheduling subgraph 1 twice corresponds to subgraph 2 scheduling once, that is, the scheduling sequence is L2 ⁇ L3 ⁇ L4 ⁇ L2 ⁇ L3 ⁇ L4 ⁇ L5 ⁇ L6.
  • the scheduling sequence is L2 ⁇ L3 ⁇ L4 ⁇ L2 ⁇ L3 ⁇ L4 ⁇ L5 ⁇ L6.
  • Example 2 The input data is non-integral image data.
  • the batch size corresponding to L0 and L1 is 1/4 pictures
  • the batch size corresponding to L2, L3, and L4 is 1/2 pictures.
  • L0 and L1 are divided into subgraph 0, and the L2-L4 sequence is divided into subgraph 1.
  • the input data is non-integral image data, and the input data needs to be processed with a filling algorithm, and the filling data is the shaded part.
  • the two subgraphs are divided into one graph, that is, L0-L4 are divided into graphs.
  • the cache requirement of the graph is less than or equal to the capacity of the internal memory.
  • the figure contains layers with different batch sizes. In the process of scheduling the subgraphs in the neural network to process the input data, it can improve the utilization of internal memory and improve the operating performance of the chip running the neural network.
  • each batch of input data is the non-integrated image data corresponding to 1/4 pictures, and input L0 in batches.
  • subgraph 0 is scheduled twice, corresponding to subgraph 1 is scheduled once, that is, the scheduling sequence is L0 ⁇ L1 ⁇ L0 ⁇ L1 ⁇ L2 ⁇ L3 ⁇ L4. Processing the input data of the current data set needs to schedule 8 times of subgraph 0 and 4 times of subgraph 1.
  • the neural network system includes at least one of a hardware structure or a software module corresponding to each function.
  • FIG. 16 and FIG. 17 are schematic diagrams of the structure of a possible neural network data processing device provided by an embodiment of the application. These neural network data processing devices can be used to implement the functions of the processor 211 in the foregoing method embodiment, and therefore can also achieve the beneficial effects of the foregoing method embodiment.
  • the data processing device of the neural network in FIG. 16 may be the processor 211 shown in FIG. 2 or a device formed by running software on it.
  • the data processing device 1600 of the neural network includes an acquisition unit 1610 and a processing unit 1620.
  • the neural network data processing device 1600 is used to implement the function of the processor 211 in the method embodiment shown in FIG. 12 above.
  • the acquiring unit 1610 is used to perform S1201; the processing unit 1620 is used to perform S1202. More detailed descriptions of the above-mentioned acquisition unit 1610 and processing unit 1620 can be obtained directly by referring to the relevant description in the method embodiment shown in FIG. 12, and will not be repeated here.
  • the data processing device of the neural network may also be a module (such as a chip) of other equipment connected to the neural network system 200.
  • the data processing device 1700 of the neural network includes a processor 1710 and an interface circuit 1720.
  • the processor 1710 and the interface circuit 1720 are coupled to each other.
  • the interface circuit 1720 may be a transceiver or an input/output interface.
  • the neural network data processing device 1700 may further include a memory 1730 for storing instructions executed by the processor 1710 or storing input data required by the processor 1710 to run the instructions or storing data generated after the processor 1710 runs the instructions.
  • the data processing device of the neural network may include the host 210 shown in FIG.
  • the processor 1710 may include the processor 211, and the memory 1730 is the memory 212.
  • the above scheme is used to configure the batch size for the neural network chip so that the neural network can work efficiently.
  • the batch size, the processing of graphs and subgraphs, and the operation of related algorithms are all executed by the processor 211.
  • the processing method can be It is executed by other types of processors or devices, for example, other controllers or processors located inside the neural network chip execute related solutions to complete the configuration of the neural network.
  • one or more types of processors can be included in the neural network chip, and the processor can run related neural network configuration schemes in order to obtain suitable batch sizes and algorithms such as graph and subgraph division. After configuring the parameters of the neural network, the processor can run neural network consistency neural network calculations, thereby realizing self-configuration, which is not limited in this embodiment.
  • the processor 1710 is used to perform the functions of the above-mentioned processing unit 1620, and the interface circuit 1720 is used to perform the functions of the above-mentioned obtaining unit 1610.
  • the processor in the embodiment of the present application may optionally include a central processing unit (Central Processing Unit, CPU), or may also be other general-purpose processors, digital signal processors (Digital Signal Processors, DSPs), and special purpose processors.
  • Integrated circuits Application Specific Integrated Circuit, ASIC
  • Field Programmable Gate Array Field Programmable Gate Array, FPGA
  • the general-purpose processor may be a microprocessor or any conventional processor.
  • the method steps in the embodiments of the present application can be implemented by hardware, and can also be implemented by a processor executing software instructions.
  • Software instructions can be composed of corresponding software modules, which can be stored in Random Access Memory (RAM), flash memory, read-only memory (Read-Only Memory, ROM), and programmable read-only memory (Programmable ROM) , PROM), Erasable Programmable Read-Only Memory (Erasable PROM, EPROM), Electrically Erasable Programmable Read-Only Memory (Electrically EPROM, EEPROM), register, hard disk, mobile hard disk, CD-ROM or well-known in the art Any other form of storage medium.
  • RAM Random Access Memory
  • ROM read-only memory
  • PROM programmable read-only memory
  • PROM Erasable Programmable Read-Only Memory
  • EPROM Erasable Programmable Read-Only Memory
  • Electrically Erasable Programmable Read-Only Memory Electrically Erasable Programmable Read-Only Memory
  • register hard disk, mobile
  • An exemplary storage medium is coupled to the processor, so that the processor can read information from the storage medium and can write information to the storage medium.
  • the storage medium may also be an integral part of the processor.
  • the processor and the storage medium may be located in the ASIC.
  • the ASIC can be located in a network device or a terminal device.
  • the processor and the storage medium may also exist as discrete components in the network device or the terminal device.
  • the computer program product includes one or more computer programs or instructions.
  • the computer may be a general-purpose computer, a special-purpose computer, a computer network, network equipment, user equipment, or other programmable devices.
  • the computer program or instruction may be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium.
  • the computer program or instruction may be downloaded from a website, computer, The server or data center transmits to another website site, computer, server or data center through wired or wireless means.
  • the computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server or a data center that integrates one or more available media.
  • the usable medium may be a magnetic medium, such as a floppy disk, a hard disk, and a magnetic tape; it may also be an optical medium, such as a digital video disc (digital video disc, DVD); and it may also be a semiconductor medium, such as a solid state drive (solid state drive). , SSD).

Abstract

Disclosed are a data processing method and apparatus for a neural network, which method and apparatus relate to the field of artificial intelligence. The method comprises: according to the data amount of input data, a first feature of an internal memory in a chip that runs a neural network, and a second feature of multiple layers in the neural network, dynamically segmenting the input data, and configuring different batch sizes for the layers in the neural network. By means of configuring a rational batch size for each layer in a neural network, during a neural network inference procedure, an internal memory can be fully utilized to store inter-layer data of the neural network, thereby improving the utilization rate of the internal memory, and ensuring the computational efficiency of hardware that runs the neural network.

Description

一种神经网络的数据处理方法及装置A neural network data processing method and device 技术领域Technical field
本申请涉及人工智能(artificial intelligence,AI)领域,尤其涉及一种神经网络的数据处理方法及装置。This application relates to the field of artificial intelligence (AI), and in particular to a neural network data processing method and device.
背景技术Background technique
随着计算机系统中处理器的计算能力的不断提高,处理器的性能持续提升。为了解决由于外部存储器的带宽限制,无法适应处理器的处理速度而产生的“内存墙”的问题,计算机系统中配置有带宽更高、容量小的多级高速缓存结构。As the computing power of the processor in the computer system continues to increase, the performance of the processor continues to improve. In order to solve the problem of the "memory wall" caused by the limitation of the bandwidth of the external memory and the inability to adapt to the processing speed of the processor, the computer system is equipped with a multi-level cache structure with higher bandwidth and smaller capacity.
在神经网络推理过程中,神经网络中每层处理完输入数据后,进入下一层。若输入数据的数据量较大,神经网络的多个层的层间数据的大小(size)也可能过大,导致高速缓存无法存储层间数据,将层间数据存到外部存储器。由于不能有效利用高速缓存,因此降低了处理器的计算效率。In the neural network reasoning process, after each layer of the neural network has processed the input data, it enters the next layer. If the amount of input data is large, the size of the inter-layer data of multiple layers of the neural network may also be too large, causing the cache to be unable to store the inter-layer data, and the inter-layer data is stored in an external memory. Because the cache cannot be used effectively, the computational efficiency of the processor is reduced.
为了解决上述问题,传统技术针对每个层的层间数据的缓存需求将输入数据进行分组,得到多组相同批大小(batch size)的批(batch),该批大小受限于缓存需求最大的层的批大小。神经网络处理完一组批再处理下一组批。通过减少神经网络中每层处理的数据,来减少层间数据,使层间数据尽可能保存在高速缓存中。In order to solve the above problems, the traditional technology groups the input data according to the inter-layer data caching requirements of each layer to obtain multiple sets of batches of the same batch size. The batch size is limited by the largest cache demand. The batch size of the layer. The neural network processes one set of batches before processing the next set of batches. By reducing the data processed by each layer in the neural network, the inter-layer data is reduced, and the inter-layer data is stored in the cache as much as possible.
由于神经网络中不同的层对数据处理的操作不同,层间数据的大小不同。例如,若神经网络中的层对图片进行放大操作,产生的层间数据较大。又如,若神经网络中的层对图片进行缩小操作,产生的层间数据较小。对于输出较小的层间数据的层而言,批大小越小层间数据越小,高速缓存的剩余容量较多;对于输出较大的层间数据的层而言,批大小越大层间数据越大,高速缓存的剩余容量较少,导致高速缓存可能无法存储层间数据。总之,依据传统技术确定的神经网络的批大小处理输入数据的过程中,依然导致高速缓存的利用率较低,影响运行神经网络的硬件的计算效率。另外,如果划分的分组较多,将增加神经网络中每层处理每组批的头部开销,反而降低了运行神经网络的硬件的计算效率。因此,如何提高高速缓存的利用率,以及确保运行神经网络的硬件的计算效率是一个亟待解决的问题。Since different layers in the neural network have different operations on data processing, the size of the data between layers is different. For example, if a layer in a neural network enlarges a picture, the generated inter-layer data is relatively large. For another example, if a layer in the neural network performs a shrinking operation on a picture, the generated inter-layer data is smaller. For layers that output smaller inter-layer data, the smaller the batch size, the smaller the inter-layer data, and the more remaining capacity of the cache; for the layers that output larger inter-layer data, the larger the batch size. The larger the data, the smaller the remaining capacity of the cache, and the cache may not be able to store inter-layer data. In short, in the process of processing input data according to the batch size of the neural network determined by the traditional technology, the utilization rate of the cache is still low, which affects the computational efficiency of the hardware running the neural network. In addition, if there are more groups, it will increase the head overhead of processing each batch in each layer of the neural network, and on the contrary reduce the computational efficiency of the hardware running the neural network. Therefore, how to improve the utilization rate of the cache and ensure the computational efficiency of the hardware running the neural network is an urgent problem to be solved.
发明内容Summary of the invention
本申请提供了一种神经网络的数据处理方法及装置,能够提高了高速缓存的利用率,以及确保了运行神经网络的硬件的计算效率。为达到上述目的,本申请采用如下技术方案。The present application provides a neural network data processing method and device, which can improve the utilization rate of the cache and ensure the computational efficiency of the hardware running the neural network. In order to achieve the above-mentioned purpose, this application adopts the following technical solutions.
第一方面,本申请提供了一种神经网络的数据处理方法,方法包括:处理器利用输入数据的数据量、运行神经网络的芯片内的内部存储器的第一特征和神经网络中多个层的第二特征对输入数据进行分组,确定神经网络中每个层的批大小,使多个层的批大小中至少两个层的批大小不同。例如,神经网络中每个层的批大小不同。又如,神经网络包括相同批大小的层和不同批大小的层。其中,第一特征包括内部存储器在 所述芯片内的分布特征和内部存储器的容量中至少一个。第二特征包括所述多个层之间的连接关系和所述多个层中每个层的与计算相关的参数。批大小对应的批为一个图片、多个图片或者一个图片中的部分图像。In the first aspect, this application provides a neural network data processing method. The method includes: the processor uses the amount of input data, the first feature of the internal memory in the chip running the neural network, and the multiple layers of the neural network. The second feature groups the input data, determines the batch size of each layer in the neural network, and makes the batch sizes of at least two layers different among the batch sizes of multiple layers. For example, the batch size of each layer in a neural network is different. For another example, a neural network includes layers of the same batch size and layers of different batch sizes. Wherein, the first feature includes at least one of the distribution feature of the internal memory in the chip and the capacity of the internal memory. The second feature includes the connection relationship between the plurality of layers and the calculation-related parameters of each of the plurality of layers. The batch corresponding to the batch size is one picture, multiple pictures, or part of the image in one picture.
应理解,所谓内部存储器是指运行神经网络的芯片内的存储器。例如,运行神经网络的芯片内的存储器是高速缓存。所谓外部存储器是指运行神经网络的芯片外的存储器。内部存储器也可以称为片上存储器。外部存储器也可以称为片外存储器。It should be understood that the so-called internal memory refers to the memory in the chip running the neural network. For example, the memory on the chip that runs the neural network is a cache. The so-called external memory refers to the memory outside the chip that runs the neural network. Internal memory can also be called on-chip memory. External memory can also be called off-chip memory.
本申请实施例提供的神经网络的数据处理方法,综合参考输入数据的数据量、第一特征和第二特征切分输入数据,为神经网络中的层设置不同的批大小。因此,通过为神经网络中的每一层设置合理的批大小,在神经网络推理过程中,充分利用内部存储器存储神经网络的层间数据,减少了运行神经网络的芯片与外部存储器的交互,从而提高了内部存储器的利用率,以及确保运行神经网络的芯片的计算效率。The neural network data processing method provided by the embodiments of the present application comprehensively refers to the data amount of the input data, the first feature and the second feature to segment the input data, and sets different batch sizes for the layers in the neural network. Therefore, by setting a reasonable batch size for each layer in the neural network, the internal memory is fully utilized to store the inter-layer data of the neural network during the neural network inference process, which reduces the interaction between the chip running the neural network and the external memory, thereby Improve the utilization of internal memory and ensure the computational efficiency of the chip running the neural network.
具体的,依据所述数据量、所述第一特征和所述第二特征确定所述多个层的批大小包括:依据所述数据量、所述第一特征和所述第二特征确定多个层的批大小、N个子图、M个图和层间数据的存储位置,N为大于或等于2的整数,M为大于或等于1的整数,N≥M。Specifically, determining the batch size of the multiple layers according to the amount of data, the first characteristic, and the second characteristic includes: determining the amount of batches according to the amount of data, the first characteristic, and the second characteristic. The batch size of layers, N sub-pictures, M pictures and storage locations of inter-layer data, N is an integer greater than or equal to 2, M is an integer greater than or equal to 1, and N≥M.
其中,层间数据的存储位置包括内部存储器或外部存储器中至少一个。在一种可能的实现方式中,子图包含的多个层的层间数据存储在所述内部存储器。在另一种可能的实现方式中,子图间的层间数据存储在所述内部存储器。在另一种可能的实现方式中,图间的层间数据存储在所述外部存储器。Wherein, the storage location of the inter-layer data includes at least one of an internal memory or an external memory. In a possible implementation manner, the inter-layer data of multiple layers included in the sub-picture is stored in the internal memory. In another possible implementation manner, the inter-layer data between sub-pictures is stored in the internal memory. In another possible implementation manner, the inter-layer data between the pictures is stored in the external memory.
子图包含一个或多个相同批大小的层。可选的,不同的子图包含的层的个数可以相同,也可以不同。可替换描述的,子图也可以称为第一类层组。The subgraph contains one or more layers of the same batch size. Optionally, the number of layers included in different sub-pictures may be the same or different. Alternatively described, the sub-picture may also be referred to as the first-type layer group.
图包括一个或多个子图。不同的图包含的子图的个数可以相同,也可以不同。可替换描述的,图也可以称为第二类层组。The graph includes one or more subgraphs. The number of subgraphs contained in different graphs can be the same or different. Alternatively described, the graph may also be referred to as the second type of layer group.
在一些可能的设计中,处理器可以采用迭代算法,依据所述数据量、所述第一特征和所述第二特征确定多个层的批大小、N个子图、M个图和层间数据的存储位置。可理解的,处理器根据所述数据量、所述第一特征和所述第二特征并非一次计算,得到多个层的批大小、N个子图、M个图和层间数据的存储位置,而是采用迭代算法经过多次迭代实验,从多个实验结果中选择多个层的批大小、N个子图、M个图和层间数据的存储位置,来确保内部存储器的利用率,以及运行神经网络的芯片的计算效率。In some possible designs, the processor may use an iterative algorithm to determine the batch size of multiple layers, N sub-pictures, M pictures, and inter-layer data based on the amount of data, the first feature, and the second feature. Storage location. It is understandable that the processor does not perform a calculation based on the amount of data, the first feature, and the second feature at one time, and obtains the batch size of multiple layers, N sub-pictures, M pictures, and storage locations of inter-layer data. Instead, it uses an iterative algorithm to go through multiple iterative experiments. From multiple experimental results, select the batch size of multiple layers, N sub-graphs, M graphs, and storage locations of inter-layer data to ensure the utilization of internal memory and operation The computational efficiency of the neural network chip.
其中,优化算法可以是动态规划算法、贪婪算法或遗传算法。Among them, the optimization algorithm can be a dynamic programming algorithm, a greedy algorithm or a genetic algorithm.
动态规划算法(dynamic programming algorithm)的基本思想也是将待求解问题分解成若干个子问题,先求解子问题,然后从这些子问题的解得到原问题的解。The basic idea of dynamic programming algorithm (dynamic programming algorithm) is also to decompose the problem to be solved into several sub-problems, first solve the sub-problems, and then obtain the solution of the original problem from the solutions of these sub-problems.
贪婪算法(greedy algorithm)也可以称为贪心算法,其基本思路是从问题的某一个初始解出发一步一步地进行,根据某个优化测度,每一步都要确保能获得局部最优解。每一步只考虑一个数据,他的选取应该满足局部优化的条件。若下一个数据和部分最优解连在一起不再是可行解时,就不把该数据添加到部分解中,直到把所有数据枚举完,或者不能再添加使算法停止Greedy algorithm (greedy algorithm) can also be called greedy algorithm. The basic idea is to proceed step by step from a certain initial solution of the problem. According to a certain optimization measure, each step must ensure that a local optimal solution can be obtained. Only one data is considered in each step, and his selection should meet the conditions of local optimization. If the next data and the partial optimal solution are no longer a feasible solution, the data is not added to the partial solution until all the data is enumerated, or no more data can be added to stop the algorithm
遗传算法(genetic algorithm)是一类借鉴生物界的进化规律设计的算法,用于模拟自然进化搜索最优解。Genetic algorithm (genetic algorithm) is a type of algorithm designed based on the evolutionary laws of the biological world, and is used to simulate natural evolution to search for the optimal solution.
需要说明的是,在根据上述神经网络的层的划分结果处理所述神经网络的输入数据过程中,图中的层的调度顺序为根据所述图中包含的各个子图的调度顺序,以及所述子图中的层的调度顺序确定。所述子图中的层的调度顺序与神经网络中的层的调度顺序相同。例如,子图包含的层的批大小对应的批按照所述子图包含的层的顺序处理。图中包含的各个子图的调度顺序为根据批大小以及子图中的首层和末层的调度顺序确定。图包含的子图的层间数据进行聚集处理或散开处理。It should be noted that in the process of processing the input data of the neural network according to the division result of the layers of the neural network, the scheduling order of the layers in the figure is based on the scheduling order of the subgraphs contained in the figure, and the The scheduling sequence of the layers in the subgraph is determined. The scheduling order of the layers in the subgraph is the same as the scheduling order of the layers in the neural network. For example, batches corresponding to the batch size of the layers included in the sub-picture are processed in the order of the layers included in the sub-picture. The scheduling order of each subgraph included in the figure is determined according to the batch size and the scheduling order of the first and last layers in the subgraph. The inter-layer data of the sub-graphs contained in the graph are aggregated or scattered.
第二方面,本申请实施例还提供了一种神经网络的数据处理装置,有益效果可以参见第一方面的描述此处不再赘述。所述神经网络的数据处理装置具有实现上述第一方面的方法实例中处理器行为的功能。所述功能可以通过硬件实现,也可以通过硬件执行相应的软件实现。所述硬件或软件包括一个或多个与上述功能相对应的模块。在一个可能的设计中,所述神经网络的数据处理装置包括:获取单元和处理单元。获取单元,用于获取神经网络的输入数据的数据量、运行神经网络的芯片内的内部存储器的第一特征和神经网络中多个层的第二特征。处理单元,用于依据所述数据量、第一特征和第二特征确定所述多个层中每个层的批大小,多个层中至少两个层的批大小不同。这些模块可以执行上述第一方面方法示例中的相应功能,具体参见方法示例中的详细描述,此处不做赘述。In the second aspect, the embodiment of the present application also provides a neural network data processing device, and the beneficial effects can be referred to the description of the first aspect and will not be repeated here. The data processing device of the neural network has the function of realizing the behavior of the processor in the method example of the first aspect described above. The functions can be realized by hardware, or by hardware executing corresponding software. The hardware or software includes one or more modules corresponding to the above-mentioned functions. In a possible design, the data processing device of the neural network includes: an acquisition unit and a processing unit. The acquiring unit is used to acquire the data amount of the input data of the neural network, the first feature of the internal memory in the chip running the neural network, and the second feature of multiple layers in the neural network. The processing unit is configured to determine the batch size of each layer in the multiple layers according to the data amount, the first characteristic, and the second characteristic, and the batch sizes of at least two layers in the multiple layers are different. These modules can perform the corresponding functions in the method example of the first aspect. For details, please refer to the detailed description in the method example, which will not be repeated here.
第三方面,提供了一种神经网络的数据处理装置,该神经网络的数据处理装置可以为处理器。例如,图形处理器(Graphics Processing Unit,GPU)、神经网络处理器(Neural-network Processing Unit,NPU)、高级精简指令集处理器(Advanced RISC Machines,ARM)等,可选的,该神经网络的数据处理装置还包括存储器。其中,该存储器用于存储计算机程序或指令,处理器与存储器耦合,当处理器执行所述计算机程序或指令时,使神经网络的数据处理装置执行上述方法实施例中由处理器所执行的方法。In a third aspect, a neural network data processing device is provided, and the neural network data processing device may be a processor. For example, graphics processor (Graphics Processing Unit, GPU), neural network processor (Neural-network Processing Unit, NPU), advanced reduced instruction set processor (Advanced RISC Machines, ARM), etc., optionally, the neural network The data processing device also includes a memory. Wherein, the memory is used to store computer programs or instructions, and the processor is coupled with the memory. When the processor executes the computer programs or instructions, the data processing device of the neural network is caused to execute the method executed by the processor in the above method embodiments. .
第四方面,提供了一种计算机程序产品,所述计算机程序产品包括:计算机程序代码,当所述计算机程序代码运行时,使得上述第一方面中由处理器执行的方法被执行。In a fourth aspect, a computer program product is provided. The computer program product includes: computer program code, which when the computer program code runs, causes the method executed by the processor in the first aspect to be executed.
第五方面,本申请提供了一种芯片系统,该芯片系统包括处理器,用于实现上述第一方面的方法中处理器的功能。在一种可能的设计中,所述芯片系统还包括存储器,用于保存程序指令或数据中至少一个。该芯片系统,可以由芯片构成,也可以包括芯片和其他分立器件。In a fifth aspect, the present application provides a chip system, the chip system includes a processor, and is configured to implement the function of the processor in the method of the first aspect. In a possible design, the chip system further includes a memory for storing at least one of program instructions or data. The chip system can be composed of chips, and can also include chips and other discrete devices.
第六方面,本申请提供了一种计算机可读存储介质,该计算机可读存储介质存储有计算机程序,当该计算机程序被运行时,实现上述第一方面中由处理器执行的方法。In a sixth aspect, the present application provides a computer-readable storage medium that stores a computer program, and when the computer program is executed, the method executed by the processor in the first aspect described above is implemented.
本申请中,处理器和神经网络的数据处理装置的名字对设备本身不构成限定,在实际实现中,这些设备可以以其他名称出现。只要各个设备的功能和本申请类似,属于本申请权利要求及其等同技术的范围之内。In this application, the names of the processor and the data processing device of the neural network do not constitute a limitation on the device itself. In actual implementation, these devices may appear under other names. As long as the function of each device is similar to that of this application, it falls within the scope of the claims of this application and its equivalent technologies.
附图说明Description of the drawings
图1为本申请一实施例提供的神经网络的原理示意图;FIG. 1 is a schematic diagram of the principle of a neural network provided by an embodiment of this application;
图2为本申请一实施例提供的神经网络系统的结构示意图;FIG. 2 is a schematic structural diagram of a neural network system provided by an embodiment of this application;
图3为本申请一实施例提供的神经网络芯片的结构示意图;FIG. 3 is a schematic diagram of the structure of a neural network chip provided by an embodiment of the application;
图4为本申请一实施例提供的处理器件的结构示意图;4 is a schematic structural diagram of a processing device provided by an embodiment of this application;
图5为本申请一实施例提供的神经网络中的层的结构示意图;FIG. 5 is a schematic diagram of the structure of layers in a neural network provided by an embodiment of this application;
图6为本申请一实施例提供的重叠问题的示意图;FIG. 6 is a schematic diagram of the overlap problem provided by an embodiment of the application;
图7为本申请一实施例提供的子图的示意图;FIG. 7 is a schematic diagram of a sub-picture provided by an embodiment of this application;
图8为本申请一实施例提供的图的示意图;FIG. 8 is a schematic diagram of a diagram provided by an embodiment of the application;
图9为本申请一实施例提供的子图间的层间数据进行聚集处理的示意图;FIG. 9 is a schematic diagram of aggregation processing of inter-layer data between sub-pictures according to an embodiment of the application; FIG.
图10为本申请一实施例提供的子图间的层间数据进行散开处理的示意图;FIG. 10 is a schematic diagram of dispersing processing of inter-layer data between sub-pictures provided by an embodiment of the application; FIG.
图11为本申请一实施例提供的图的处理的示意图;FIG. 11 is a schematic diagram of the processing of a graph provided by an embodiment of this application;
图12为本申请一实施例提供的神经网络的数据处理方法流程图;FIG. 12 is a flowchart of a neural network data processing method provided by an embodiment of the application;
图13为本申请一实施例提供的神经网络处理数据的过程示意图;FIG. 13 is a schematic diagram of a process of neural network processing data provided by an embodiment of this application;
图14为本申请一实施例提供的神经网络处理数据的过程示意图;FIG. 14 is a schematic diagram of a process of neural network processing data provided by an embodiment of this application;
图15为本申请一实施例提供的神经网络处理数据的过程示意图;FIG. 15 is a schematic diagram of a process of neural network processing data provided by an embodiment of this application;
图16为本申请一实施例提供的神经网络的数据处理装置结构示意图;16 is a schematic structural diagram of a neural network data processing device provided by an embodiment of the application;
图17为本申请一实施例提供的神经网络的数据处理装置结构示意图。FIG. 17 is a schematic structural diagram of a neural network data processing device provided by an embodiment of the application.
具体实施方式detailed description
本申请说明书和权利要求书及上述附图中的术语“第一”、“第二”和“第三”等是用于区别不同对象,而不是用于限定特定顺序。在本申请实施例中,“示例性的”或者“例如”等词用于表示作例子、例证或说明。本申请实施例中被描述为“示例性的”或者“例如”的任何实施例或设计方案不应被解释为比其它实施例或设计方案更优选或更具优势。确切而言,使用“示例性的”或者“例如”等词旨在以具体方式呈现相关概念。The terms "first", "second", and "third" in the specification and claims of this application and the above-mentioned drawings are used to distinguish different objects, rather than to limit a specific order. In the embodiments of the present application, words such as "exemplary" or "for example" are used as examples, illustrations, or illustrations. Any embodiment or design solution described as "exemplary" or "for example" in the embodiments of the present application should not be construed as being more preferable or advantageous than other embodiments or design solutions. To be precise, words such as "exemplary" or "for example" are used to present related concepts in a specific manner.
神经网络(neural network,NN)也可以称为人工神经网络(artificial neural network,ANN)或类神经网络。在机器学习和认知科学领域,神经网络是一种模仿生物神经网络(动物的中枢神经系统,特别是大脑)的结构和功能的数学模型或计算模型,用于对函数进行估计或近似。神经网络可以包括卷积神经网络(convolutional neural network,CNN)、深度神经网络(deep neural network,DNN)、多层感知器(multilayer perceptron,MLP)和循环神经网络(recurrent neural network,RNN)等神经网络。Neural network (NN) may also be called artificial neural network (ANN) or similar neural network. In the field of machine learning and cognitive science, a neural network is a mathematical model or calculation model that imitates the structure and function of a biological neural network (an animal's central nervous system, especially the brain), and is used to estimate or approximate functions. Neural networks can include convolutional neural network (convolutional neural network, CNN), deep neural network (deep neural network, DNN), multilayer perceptron (multilayer perceptron, MLP) and recurrent neural network (recurrent neural network, RNN), etc. The internet.
(1)神经网络(1) Neural network
神经网络可以是由神经单元组成的,神经单元可以是指以x s和截距1为输入的运算单元。该运算单元的输出满足如下公式(1)。 A neural network can be composed of neural units, which can refer to an arithmetic unit that takes x s and intercept 1 as inputs. The output of this arithmetic unit satisfies the following formula (1).
Figure PCTCN2020093624-appb-000001
Figure PCTCN2020093624-appb-000001
其中,s=1、2、……n,n为大于1的自然数,W s为x s的权重,b为神经单元的偏置。f为神经单元的激活函数(activation functions),用于将非线性特性引入神经网络中,来将神经单元中的输入信号转换为输出信号。该激活函数的输出信号可以作为下一层的输入,激活函数可以是sigmoid函数。神经网络是将多个上述单一的神经单元联结在一起形成的网络,即一个神经单元的输出可以是另一个神经单元的输入。每个神经单元的输入可以与前一层的局部接受域相连,来提取局部接受域的特征,局部接受域可以是由若干个神经单元组成的区域。 Among them, s=1, 2,...n, n is a natural number greater than 1, W s is the weight of x s , and b is the bias of the neural unit. f is the activation function of the neural unit, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into an output signal. The output signal of the activation function can be used as the input of the next layer, and the activation function can be a sigmoid function. A neural network is a network formed by connecting multiple above-mentioned single neural units together, that is, the output of one neural unit can be the input of another neural unit. The input of each neural unit can be connected with the local receptive field of the previous layer to extract the characteristics of the local receptive field. The local receptive field can be a region composed of several neural units.
如图1所示,为本申请一实施例提供的神经网络的原理示意图。该神经网络100具有N个处理层,N≥3且N取自然数。该神经网络的第一层为输入层110,负责接收输入信号,该神经网络的最后一层为输出层130,输出神经网络的处理结果。除去第一层和最后一层的其他层为中间层140,这些中间层140共同组成隐藏层120,隐藏层120中的每一层中间层140既可以接收输入信号,也可以输出信号。隐藏层120负责输入信号的处理过程。每一层代表了信号处理的一个逻辑级别,通过多个层,数据信号可经过多级逻辑的处理。As shown in FIG. 1, it is a schematic diagram of the principle of a neural network provided by an embodiment of this application. The neural network 100 has N processing layers, N≧3 and N takes a natural number. The first layer of the neural network is the input layer 110, which is responsible for receiving input signals, and the last layer of the neural network is the output layer 130, which outputs the processing results of the neural network. The other layers excluding the first and last layers are intermediate layers 140. These intermediate layers 140 collectively form the hidden layer 120. Each intermediate layer 140 in the hidden layer 120 can receive input signals and output signals. The hidden layer 120 is responsible for the processing of the input signal. Each layer represents a logic level of signal processing. Through multiple layers, data signals can be processed by multiple levels of logic.
在一些可行的实施例中该神经网络的输入信号可以是视频信号、语音信号、文本信号、图像信号、温度信号等各种形式的信号。在本实施例中,被处理的图像信号可以是相机(图像传感器)拍摄的风景信号、显监控设备捕捉的社区环境的图像信号以及门禁系统获取的人脸的面部信号等各类传感器信号。该神经网络的输入信号还包括其他各种计算机可处理的工程信号,在此不再一一列举。若利用神经网络对图像信号进行深度学习,可提高图像质量。In some feasible embodiments, the input signal of the neural network may be a signal in various forms such as a video signal, a voice signal, a text signal, an image signal, and a temperature signal. In this embodiment, the processed image signal may be various sensor signals such as a landscape signal taken by a camera (image sensor), an image signal of a community environment captured by a display monitoring device, and a facial signal of a human face obtained by an access control system. The input signal of the neural network also includes various other engineering signals that can be processed by computers, which will not be listed here. If the neural network is used for deep learning of the image signal, the image quality can be improved.
(2)深度神经网络(2) Deep neural network
深度神经网络也称多层神经网络,可以理解为具有多层隐藏层的神经网络。按照不同层的位置对深度神经网络进行划分,深度神经网络内部的神经网络可以分为三类:输入层,隐藏层和输出层。一般来说第一层是输入层,最后一层是输出层,中间的层数都是隐藏层。层与层之间是全连接的,也就是说,第i层的任意一个神经元与第i+1层的任意一个神经元相连。Deep neural network is also called multi-layer neural network, which can be understood as a neural network with multiple hidden layers. The deep neural network is divided according to the position of different layers. The neural network inside the deep neural network can be divided into three categories: input layer, hidden layer and output layer. Generally speaking, the first layer is the input layer, the last layer is the output layer, and the number of layers in the middle are all hidden layers. The layers are fully connected, that is, any neuron in the i-th layer is connected to any neuron in the i+1-th layer.
虽然深度神经网络看起来很复杂,但是就每一层的工作来说,其实并不复杂,简单来说就是如下线性关系表达式:y=α(Wx+b),其中,x是输入向量,y是输出向量,b是偏移向量,W是权重矩阵(也称系数),α()是激活函数。每一层仅仅是对输入向量x经过如此简单的操作得到输出向量y。由于深度神经网络的层数多,系数W和偏移向量b的数量也比较多。这些参数在深度神经网络中的定义如下所述:以系数W为例:假设在一个三层的深度神经网络中,第二层的第4个神经元到第三层的第2个神经元的线性系数定义为
Figure PCTCN2020093624-appb-000002
其中,上标3代表系数W所在的层数,而下标对应的是输出的第三层索引2和输入的第二层索引4。
Although the deep neural network looks complicated, it is not complicated in terms of the work of each layer. Simply put, it is the following linear relationship expression: y=α(Wx+b), where x is the input vector, y is the output vector, b is the offset vector, W is the weight matrix (also called coefficient), and α() is the activation function. Each layer is just to get the output vector y after such a simple operation on the input vector x. Due to the large number of layers of the deep neural network, the number of coefficients W and offset vectors b is also relatively large. The definition of these parameters in a deep neural network is as follows: Take the coefficient W as an example: suppose that in a three-layer deep neural network, the fourth neuron of the second layer to the second neuron of the third layer The linear coefficient is defined as
Figure PCTCN2020093624-appb-000002
Among them, the superscript 3 represents the number of layers where the coefficient W is located, and the subscript corresponds to the output third-level index 2 and the input second-level index 4.
综上,第L-1层的第k个神经元到第L层的第j个神经元的系数定义为
Figure PCTCN2020093624-appb-000003
In summary, the coefficients from the kth neuron in the L-1th layer to the jth neuron in the Lth layer are defined as
Figure PCTCN2020093624-appb-000003
需要注意的是,输入层是没有W参数的。在深度神经网络中,更多的隐藏层让网络更能够刻画现实世界中的复杂情形。理论上而言,参数越多的模型复杂度越高,“容量”也就越大,也就意味着它能完成更复杂的学习任务。训练深度神经网络的也就是学习权重矩阵的过程,其最终目的是得到训练好的深度神经网络的所有层的权重矩阵(由很多层的向量W形成的权重矩阵)。It should be noted that there is no W parameter in the input layer. In deep neural networks, more hidden layers make the network more capable of portraying complex situations in the real world. In theory, a model with more parameters is more complex and has a greater "capacity", which means that it can complete more complex learning tasks. Training a deep neural network is also the process of learning a weight matrix, and its ultimate goal is to obtain the weight matrix of all layers of the trained deep neural network (a weight matrix formed by many layers of vectors W).
(3)卷积神经网络(3) Convolutional neural network
卷积神经网络是一种带有卷积结构的深度神经网络。卷积神经网络包含了一个由卷积层和子采样层构成的特征抽取器,该特征抽取器可以看作是滤波器。卷积层是指卷积神经网络中对输入信号进行卷积处理的神经元层。在卷积神经网络的卷积层中,一个神经元可以只与部分邻层神经元连接。一个卷积层中,通常包含若干个特征平面,每个特征平面可以由一些矩形排列的神经单元组成。同一特征平面的神经单元共享权重,这里共享的权重就是卷积核。共享权重可以理解为提取图像信息的方式与位置无 关。卷积核可以以随机大小的矩阵的形式初始化,在卷积神经网络的训练过程中卷积核可以通过学习得到合理的权重。另外,共享权重带来的直接好处是减少卷积神经网络各层之间的连接,同时又降低了过拟合的风险。Convolutional neural network is a deep neural network with convolutional structure. The convolutional neural network contains a feature extractor composed of a convolutional layer and a sub-sampling layer. The feature extractor can be regarded as a filter. The convolutional layer refers to the neuron layer that performs convolution processing on the input signal in the convolutional neural network. In the convolutional layer of a convolutional neural network, a neuron can be connected to only part of the neighboring neurons. A convolutional layer usually contains several feature planes, and each feature plane can be composed of some rectangularly arranged neural units. Neural units in the same feature plane share weights, and the shared weights here are the convolution kernels. Sharing weight can be understood as the way of extracting image information has nothing to do with location. The convolution kernel can be initialized in the form of a matrix of random size. During the training process of the convolutional neural network, the convolution kernel can obtain reasonable weights through learning. In addition, the direct benefit of sharing weights is to reduce the connections between the layers of the convolutional neural network, and at the same time reduce the risk of overfitting.
如图2所示,为本申请一实施例提供的神经网络系统的结构示意图。神经网络系统200包括主机210以及神经网络电路220。神经网络电路220通过主机接口与主机210连接。主机接口可以包括标准的主机接口以及网络接口(network interface)。例如,主机接口可以包括快捷外设互联标准(peripheral component interconnect express,PCIe)接口。如图2所示,神经网络电路220可以通过PCIe总线230与主机210连接。因此,数据可以通过PCIe总线230输入至神经网络电路220中,并通过PCIe总线230接收神经网络电路220处理完成后的数据。并且,主机210也可以通过主机接口监测神经网络电路220的工作状态。As shown in FIG. 2, it is a schematic structural diagram of a neural network system provided by an embodiment of this application. The neural network system 200 includes a host 210 and a neural network circuit 220. The neural network circuit 220 is connected to the host 210 through a host interface. The host interface may include a standard host interface and a network interface (network interface). For example, the host interface may include a peripheral component interconnect express (PCIe) interface. As shown in FIG. 2, the neural network circuit 220 may be connected to the host 210 through the PCIe bus 230. Therefore, data can be input to the neural network circuit 220 via the PCIe bus 230, and data processed by the neural network circuit 220 can be received via the PCIe bus 230. In addition, the host 210 can also monitor the working status of the neural network circuit 220 through the host interface.
主机210包括处理器(processor)211以及内存212。需要说明的是,除了图2所示的器件外,主机210还可以包括通信接口以及作为外部存储器的磁盘等其他器件,在此不做限制。主机210可以认为是一个集成电路也可以是一个独立的设备。The host 210 includes a processor 211 and a memory 212. It should be noted that, in addition to the devices shown in FIG. 2, the host 210 may also include other devices such as a communication interface and a magnetic disk as an external memory, which is not limited here. The host 210 can be considered as an integrated circuit or an independent device.
处理器211是主机210的运算核心和控制核心(control unit)。处理器211中可以包括多个处理器核(core)。处理器211可以是一块超大规模的集成电路。在处理器211中安装有操作系统和其他软件程序,从而处理器211能够实现对内存212、缓存、磁盘及外设设备(如图2中的神经网络电路)的访问。可以理解的是,在本申请实施例中,处理器211中的处理器核可以是中央处理器(central processing unit,CPU),还可以是其他特定集成电路(application specific integrated circuit,ASIC)。The processor 211 is the computing core and control unit of the host 210. The processor 211 may include multiple processor cores (cores). The processor 211 may be a very large-scale integrated circuit. An operating system and other software programs are installed in the processor 211, so that the processor 211 can implement access to the memory 212, cache, disk, and peripheral devices (such as the neural network circuit in FIG. 2). It can be understood that, in the embodiment of the present application, the processor core in the processor 211 may be a central processing unit (CPU), or may also be other application specific integrated circuits (ASICs).
内存212是主机210的主存。内存212通过双倍速率(double data rate,DDR)总线和处理器211相连。内存212通常用来存放操作系统中各种正在运行的软件、输入和输出数据以及与外部存储器交换的信息等。为了提高处理器211的访问速度,内存212需要具备访问速度快的优点。在传统的计算机系统架构中,通常采用动态随机存取存储器(dynamic random access memory,DRAM)作为内存212。处理器211能够通过内存控制器(图2中未示出)高速访问内存212,对内存212中的任意一个存储单元进行读操作和写操作。The memory 212 is the main memory of the host 210. The memory 212 is connected to the processor 211 via a double data rate (DDR) bus. The memory 212 is generally used to store various running software in the operating system, input and output data, and information exchanged with an external memory. In order to increase the access speed of the processor 211, the memory 212 needs to have the advantage of fast access speed. In a traditional computer system architecture, a dynamic random access memory (DRAM) is usually used as the memory 212. The processor 211 can access the memory 212 at a high speed through a memory controller (not shown in FIG. 2), and perform a read operation and a write operation on any storage unit in the memory 212.
神经网络电路220可以是一种运行神经网络的芯片。神经网络电路220是由多个神经网络芯片(chip)组成的芯片阵列。例如,如图2所示,神经网络电路220包括多个进行数据处理的神经网络芯片(chip)221和多个路由器222。为了描述方便,本申请实施例将神经网络芯片221简称为芯片221。所述多个芯片221通过路由器222相互连接。例如,一个芯片221可以与一个或多个路由器222连接。多个路由器222可以组成一种或多种网络拓扑。芯片221之间可以通过所述多种网络拓扑进行数据传输。神经网络电路220还可以包括存储器223、输入端口224和输出端口225等其他器件。存储器用于存储数据、计算机程序和指令。The neural network circuit 220 may be a chip that runs a neural network. The neural network circuit 220 is a chip array composed of a plurality of neural network chips. For example, as shown in FIG. 2, the neural network circuit 220 includes a plurality of neural network chips 221 for data processing and a plurality of routers 222. For the convenience of description, the neural network chip 221 is referred to as the chip 221 for short in the embodiment of the present application. The multiple chips 221 are connected to each other through a router 222. For example, one chip 221 may be connected to one or more routers 222. Multiple routers 222 can form one or more network topologies. The chips 221 can transmit data through the multiple network topologies described above. The neural network circuit 220 may also include other devices such as a memory 223, an input port 224, and an output port 225. The memory is used to store data, computer programs and instructions.
图3为本申请一实施例提供的神经网络芯片的结构示意图。芯片221中包括多个路由器310,每个路由器310可以连接一个瓦片(tile)320。实际应用中,一个路由器310还可以连接多个瓦片320。如图3所示,每个瓦片320可以包括输入输出接口(TxRx)321、交换装置322、多个处理器件(processing element,PE)323和存储器324。输 入输出接口321用于接收从路由器310输入到瓦片320的数据,或者输出瓦片320的计算结果。换一种表达方式,输入输出接口321用于实现瓦片320和路由器310之间的数据传输。交换装置322连接输入输出接口321和多个处理器件323。交换装置322用于实现输入输出接口321和多个处理器件323之间的数据传输。存储器324用于存储数据、计算机程序和指令。每个瓦片320还可以包括控制器325,控制器325用于控制输入输出接口321和多个处理器件323,使系统正常工作。每个处理器件323可以包括一个或多个计算引擎(computing engine)326。一个或多个计算引擎326用于实现对输入到计算引擎326中的数据进行神经网络计算。例如,可以对输入到瓦片320的数据与瓦片320中预设的卷积核进行乘加运算。计算引擎326的计算结果可以通过交换装置322和输入输出接口321发送给其他瓦片320。实际应用中,一个计算引擎326可以包括实现卷积、池化(pooling)或其他神经网络操作的模块。在此,不对计算引擎326的具体电路或功能进行限定。为了描述简便,在本申请实施例中,将计算引擎简称为引擎(engine)。FIG. 3 is a schematic structural diagram of a neural network chip provided by an embodiment of the application. The chip 221 includes a plurality of routers 310, and each router 310 can be connected to a tile 320. In practical applications, one router 310 can also connect multiple tiles 320. As shown in FIG. 3, each tile 320 may include an input/output interface (TxRx) 321, a switching device 322, multiple processing elements (PE) 323, and a memory 324. The input/output interface 321 is used to receive data input from the router 310 to the tile 320, or to output the calculation result of the tile 320. To put it another way, the input/output interface 321 is used to implement data transmission between the tile 320 and the router 310. The switching device 322 connects the input/output interface 321 and a plurality of processing devices 323. The switching device 322 is used to implement data transmission between the input/output interface 321 and the multiple processing devices 323. The memory 324 is used to store data, computer programs and instructions. Each tile 320 may also include a controller 325, which is used to control the input/output interface 321 and multiple processing devices 323 to make the system work normally. Each processing device 323 may include one or more computing engines 326. One or more calculation engines 326 are used to implement neural network calculations on the data input to the calculation engine 326. For example, the data input to the tile 320 and the preset convolution kernel in the tile 320 may be multiplied and added. The calculation result of the calculation engine 326 can be sent to other tiles 320 through the switching device 322 and the input/output interface 321. In practical applications, a calculation engine 326 may include modules that implement convolution, pooling, or other neural network operations. Here, the specific circuit or function of the calculation engine 326 is not limited. For simplicity of description, in the embodiments of the present application, the calculation engine is referred to as engine for short.
如图4所示,为本申请一实施例提供的处理器件的结构示意图。处理器件323还可以包括控制器327和总线328。控制器327用于接收数据,并调度处理器件323内的一个或多个引擎326处理数据,使系统正常工作。多个引擎326通过总线328进行数据传输。引擎326连接一个或多个独占的存储器3210。可选的,多个引擎326还可以共享一个或多个存储器329。As shown in FIG. 4, it is a schematic structural diagram of a processing device provided by an embodiment of this application. The processing device 323 may also include a controller 327 and a bus 328. The controller 327 is used for receiving data, and scheduling one or more engines 326 in the processing device 323 to process the data, so that the system works normally. The multiple engines 326 perform data transmission through the bus 328. The engine 326 is connected to one or more exclusive memories 3210. Optionally, multiple engines 326 may also share one or more memories 329.
在本文中,神经网络电路220内的存储器可以是高速缓冲存储器,即高速缓存。例如,存储器223、存储器324、存储器329和存储器3210都可以是高速缓冲存储器。In this document, the memory in the neural network circuit 220 may be a cache memory, that is, a cache. For example, the memory 223, the memory 324, the memory 329, and the memory 3210 may all be cache memories.
在本文中,神经网络电路220内的高速缓冲存储器由静态随机存取存储器(Static Random Access Memory,SRAM)组成,容量比较小,但速度比主存高得多,接近于CPU的速度。高速缓冲存储器可以是L1级高速缓冲存储器、L2级高速缓冲存储器或L3级高速缓冲存储器。例如,存储器3210是L1级高速缓冲存储器。存储器329是L2级高速缓冲存储器或者L3级高速缓冲存储器。存储器223是L2级高速缓冲存储器或者L3级高速缓冲存储器。存储器324是L2级高速缓冲存储器或者L3级高速缓冲存储器。In this article, the cache memory in the neural network circuit 220 is composed of a static random access memory (SRAM), which has a relatively small capacity but a speed much higher than that of the main memory, which is close to the speed of the CPU. The cache memory may be an L1 level cache memory, an L2 level cache memory, or an L3 level cache memory. For example, the memory 3210 is an L1 level cache memory. The memory 329 is an L2 level cache memory or an L3 level cache memory. The memory 223 is an L2 level cache memory or an L3 level cache memory. The memory 324 is an L2 level cache memory or an L3 level cache memory.
依据上述对神经网络的介绍可以看出,本申请实施例提供的神经网络电路220包括多个神经网络芯片221,每个神经网络芯片221包括多个瓦片320,每个瓦片320包括多个处理器件323,每个处理器件323包括多个引擎326。由此可见,本申请实施例提供的神经网络系统可以包括多级计算节点,例如,可以包括四级计算节点:第一级计算节点为芯片221,第二级计算节点为芯片221内的瓦片320,第三级计算节点为瓦片320内的处理器件323,第四级计算节点为处理器件323内的引擎326。According to the above introduction to neural networks, it can be seen that the neural network circuit 220 provided by the embodiment of the present application includes a plurality of neural network chips 221, each neural network chip 221 includes a plurality of tiles 320, and each tile 320 includes a plurality of Processing devices 323, each processing device 323 includes a plurality of engines 326. It can be seen that the neural network system provided by the embodiments of the present application may include multi-level computing nodes, for example, may include four-level computing nodes: the first-level computing node is the chip 221, and the second-level computing node is the tile in the chip 221 320, the third-level computing node is the processing device 323 in the tile 320, and the fourth-level computing node is the engine 326 in the processing device 323.
本申请实施例提供的神经网络系统可以应用于移动终端、监控终端或者服务器等,以实现相关的神经网络运算。The neural network system provided by the embodiments of the present application can be applied to a mobile terminal, a monitoring terminal, or a server, etc., to implement related neural network operations.
本领域技术人员可以知道,神经网络包括多个神经网络层。在本申请实施例中,神经网络层为逻辑的层概念,一个神经网络层是指要执行一次神经网络操作。Those skilled in the art may know that the neural network includes multiple neural network layers. In the embodiments of the present application, the neural network layer is a logical layer concept, and a neural network layer refers to a neural network operation to be performed once.
神经网络中可以包括n个神经网络层(又可以被称为n层神经网络),其中,n为大于或等于2的整数。第一神经网络层和第二神经网络层可以是n层中在操作上有 依赖关系的两层。在本申请实施例中,具有依赖关系的两个神经网络层是指一个神经网络层的输入数据包括另一神经网络层的输出数据。具有依赖关系的两个神经网络层也可以被称为是相邻层。可选的,每个神经网络层的输入可能不止来自一个神经网络层,有可能来自前m个神经网络层;同样,每个神经网络层的输出可能不止输出到下一个神经网络层,有可能输出到后m个神经网络层。The neural network may include n neural network layers (also called n-layer neural network), where n is an integer greater than or equal to 2. The first neural network layer and the second neural network layer may be two of the n layers that have a dependency relationship in operation. In the embodiment of the present application, two neural network layers with a dependency relationship means that the input data of one neural network layer includes the output data of the other neural network layer. Two neural network layers with dependencies can also be referred to as adjacent layers. Optionally, the input of each neural network layer may come from more than one neural network layer, and may come from the first m neural network layers; similarly, the output of each neural network layer may not only be output to the next neural network layer, it is possible Output to the last m neural network layers.
图5示出了神经网络中的部分神经网络层,神经网络层可以包括卷积层、池化层等。神经网络500可以包括第一层502、第二层504、第三层506、第四层508、第五层510至第n层512。其中,第一层502可以执行卷积操作,第二层504可以是对第一层502的输出数据执行池化操作,第三层506可以是对第二层504的输出数据执行卷积操作,第四层508可以对第三层506的输出结果执行卷积操作,第五层510可以对第二层504的输出数据以及第四层508的输出数据执行求和操作等等。可以理解的是,图5只是对神经网络中的神经网络层的一个简单示例和说明,并不对每一层神经网络的具体操作进行限制,例如,第四层508也可以是池化运算,第五层510也可以是做卷积操作或池化操作等其他的神经网络操作。Figure 5 shows part of the neural network layers in the neural network. The neural network layers may include convolutional layers, pooling layers, and so on. The neural network 500 may include a first layer 502, a second layer 504, a third layer 506, a fourth layer 508, and a fifth layer 510 to an nth layer 512. Among them, the first layer 502 can perform a convolution operation, the second layer 504 can perform a pooling operation on the output data of the first layer 502, and the third layer 506 can perform a convolution operation on the output data of the second layer 504. The fourth layer 508 can perform a convolution operation on the output result of the third layer 506, and the fifth layer 510 can perform a summation operation on the output data of the second layer 504 and the output data of the fourth layer 508, and so on. It is understandable that Figure 5 is only a simple example and description of the neural network layers in the neural network, and does not limit the specific operations of each layer of the neural network. For example, the fourth layer 508 can also be a pooling operation. The five-layer 510 may also perform other neural network operations such as convolution operation or pooling operation.
第一层502的输出数据为第二层504的输入数据,因此,第一层502和第二层504有依赖关系。第二层504的输出数据为第三层506的输入数据,第二层504和第三层506有依赖关系。第三层506的输出数据为第四层508的输入数据,第三层506和第四层508有依赖关系。第五层510的输入数据包括第二层504的输出数据和第四层508的输出数据,因此,第二层504和第五层510也具有依赖关系,第四层508和第五层510也具有依赖关系。The output data of the first layer 502 is the input data of the second layer 504. Therefore, the first layer 502 and the second layer 504 have a dependency relationship. The output data of the second layer 504 is the input data of the third layer 506, and the second layer 504 and the third layer 506 have a dependency relationship. The output data of the third layer 506 is the input data of the fourth layer 508, and the third layer 506 and the fourth layer 508 have a dependency relationship. The input data of the fifth layer 510 includes the output data of the second layer 504 and the output data of the fourth layer 508. Therefore, the second layer 504 and the fifth layer 510 also have a dependency relationship, and the fourth layer 508 and the fifth layer 510 also Have dependencies.
神经网络中的每一层计算均是由计算节点来实现。实际应用中,不同应用场景所需的计算量不同。因此,可以根据实际的应用情况,以芯片、瓦片、处理器件或引擎为粒度对神经网络系统中的计算节点进行划分,使得不同的集合中的计算节点用于处理不同神经网络层的操作。根据这种方式,本申请实施例所指的计算节点可以是芯片221、瓦片320、处理器件323或引擎326。Each layer of calculation in the neural network is realized by a computing node. In actual applications, different application scenarios require different amounts of calculation. Therefore, the computing nodes in the neural network system can be divided with the granularity of chips, tiles, processing devices, or engines according to actual application conditions, so that computing nodes in different sets are used to process operations of different neural network layers. According to this manner, the computing node referred to in the embodiment of the present application may be a chip 221, a tile 320, a processing device 323, or an engine 326.
在神经网络推理过程中,当神经网络中的第i层计算完成后,会将第i层的计算结果(层间数据)暂存在预设的缓存中,在执行第i+1层的计算时,计算节点重新从预设的缓存中加载第i层的计算结果和第i+1层的权重进行计算。其中,第i层为神经网络中的任意一层。例如,如图5所示,当神经网络中的第二层504计算完成后,将第二层504的输出数据(层间数据)暂存在预设的存储器329中,在执行第五层510的计算时,计算节点重新从预设的存储器329中加载第二层504的计算结果和第五层510的权重进行计算。In the neural network reasoning process, when the calculation of the i-th layer in the neural network is completed, the calculation results of the i-th layer (inter-layer data) are temporarily stored in the preset cache, and when the calculation of the i+1 layer is performed , The computing node reloads the calculation result of the i-th layer and the weight of the i+1th layer from the preset cache for calculation. Among them, the i-th layer is any layer in the neural network. For example, as shown in FIG. 5, after the calculation of the second layer 504 in the neural network is completed, the output data (interlayer data) of the second layer 504 is temporarily stored in the preset memory 329, and the fifth layer 510 is executed. During calculation, the computing node reloads the calculation result of the second layer 504 and the weight of the fifth layer 510 from the preset memory 329 for calculation.
根据计算节点不同,预设的缓存也不同。例如,若计算节点是引擎326,预设的缓存可以是存储器329或存储器3210。又如,若计算节点是处理器件323,预设的缓存可以是存储器324。又如,若计算节点是瓦片320,预设的缓存可以是瓦片320内的存储器。又如,若计算节点是芯片221,预设的缓存可以是存储器223。Depending on the computing node, the preset cache is also different. For example, if the computing node is the engine 326, the preset cache may be the memory 329 or the memory 3210. For another example, if the computing node is the processing device 323, the preset cache may be the memory 324. For another example, if the computing node is a tile 320, the preset cache may be a memory in the tile 320. For another example, if the computing node is the chip 221, the preset cache may be the memory 223.
应理解,神经网络电路220外的存储器称为外部存储器。例如,外部存储器是图2中所示的内存212。神经网络电路220内的存储器称为内部存储器。例如,内部存储器是图2中所示存储器223。又如,内部存储器是图3中所示存储器324。又如,内部 存储器是图4中所示存储器329和存储器3210。所谓外部存储器是指运行神经网络的芯片外的存储器。例如,外部存储器可以是磁盘或者图2中所示的内存212。It should be understood that the memory outside the neural network circuit 220 is called an external memory. For example, the external memory is the memory 212 shown in FIG. 2. The memory in the neural network circuit 220 is called an internal memory. For example, the internal memory is the memory 223 shown in FIG. 2. For another example, the internal memory is the memory 324 shown in FIG. 3. For another example, the internal memories are the memory 329 and the memory 3210 shown in FIG. 4. The so-called external memory refers to the memory outside the chip that runs the neural network. For example, the external memory may be a magnetic disk or the memory 212 shown in FIG. 2.
为了便于理解本申请实施例提供的技术方案,首先对本申请实施例中的部分用语进行解释说明。In order to facilitate the understanding of the technical solutions provided by the embodiments of the present application, firstly, some terms in the embodiments of the present application are explained.
1)批大小(batch size)1) batch size
受限于内部存储器的容量,神经网络中每一层可处理的数据量即为该层对应的批大小。批大小对应的批可以是一个图片、多个图片、或者一个图片中的部分图像。比如,内部存储器的容量为100,若层1(layer 1,L1)处理1张图片产生的缓存需求大小为60,则每次调度层1最多处理1张图片,层1对应的批大小为1张图片。若层2处理1张图片产生的数据缓存需求大小为30,则每次调度层2最多处理3张图片,层2对应的批大小为3张图片。批大小不仅影响运行神经网络的芯片的内部存储器的使用情况,还影响到神经网络的优化程度和处理速度。Limited by the capacity of the internal memory, the amount of data that can be processed by each layer in the neural network is the batch size corresponding to that layer. The batch corresponding to the batch size can be one picture, multiple pictures, or part of images in one picture. For example, if the capacity of the internal memory is 100, if layer 1 (layer 1, L1) processes 1 image with a cache requirement of 60, then each scheduling layer 1 can process at most 1 image, and the batch size corresponding to layer 1 is 1. Pictures. If the data cache requirement size for processing 1 image in layer 2 is 30, then each scheduling layer 2 can process 3 images at most, and the batch size corresponding to layer 2 is 3 images. The batch size not only affects the usage of the internal memory of the chip running the neural network, but also affects the optimization degree and processing speed of the neural network.
2)重叠(overlap)问题2) Overlap problem
在神经网络处理图片的一些场景中,受限于内部存储器的容量,可能需要将整张图片数据切分两份或两份以上的数据作为一批的输入数据,其中每一份数据可称为非整图的数据。卷积层可以采用填充算法处理非整图的输入数据。即在通过卷积核计算之前,通过填充算法的方式人为的增大输入数据的大小以抵消计算中因尺寸收缩带来的影响。填充算法例如可以是按零填充、重复边界值填充或其他方法。也就是说,若输入数据为非整图数据,则需要利用填充算法处理输入数据;若输入数据为整图数据,则不需要利用填充算法处理输入数据。In some scenes of neural network processing pictures, limited by the capacity of internal memory, it may be necessary to divide the entire picture data into two or more pieces of data as a batch of input data, where each piece of data can be called Partial image data. The convolutional layer can use a filling algorithm to process the input data of a non-integral image. That is, before the calculation by the convolution kernel, the size of the input data is artificially increased by means of the filling algorithm to offset the influence caused by the size shrinkage in the calculation. The filling algorithm can be, for example, zero filling, repeated boundary value filling, or other methods. That is to say, if the input data is non-integral image data, it is necessary to process the input data with a filling algorithm; if the input data is an entire image data, it is not necessary to use the filling algorithm to process the input data.
以填充算法为例,如果卷积层采用了填充算法,则在对卷积层进行解释时,需要先对输入数据进行填充,再展平。若卷积核移动的步长(stride)小于卷积核的边长(一般为正方行)时,便会出现卷积核与原始输入矩阵作用范围在区域上的重叠(overlap),卷积核移动的步长(stride)与卷积核的边长相一致时,不会出现重叠现象。其中,若输入数据大小为(w*w),则填充后的数据大小为(w+k-s)*(w+k-s)。其中,k表示卷积核的边长,s表示卷积核移动的步长,填充数据为(k-s)。Taking the filling algorithm as an example, if the convolutional layer adopts the filling algorithm, when interpreting the convolutional layer, the input data needs to be filled first, and then flattened. If the stride of the convolution kernel is smaller than the side length of the convolution kernel (usually a square row), there will be an overlap between the convolution kernel and the original input matrix. The convolution kernel When the stride of the movement is the same as the side length of the convolution kernel, there will be no overlap. Among them, if the input data size is (w*w), the filled data size is (w+k-s)*(w+k-s). Among them, k represents the side length of the convolution kernel, s represents the step length of the convolution kernel movement, and the filling data is (k-s).
示例性的,参见图6所示,假设某神经网络中的层(layer)包括层0,层1,层2和层3,卷积核大小均为3*3,卷积核移动的步长为1,卷积核移动的步长小于卷积核的边长,采用填充算法处理输入数据的过程中,存在overlap问题。比如,整张图片大小为56*56,将该整张图片的行数切分为4份进行处理。若将层0、层1和层2作为一个层组进行调度,需要保证层2输出14行数据,即层组的输出数据大小为14*56,保证层3可以处理1/4行的图片。则层2的输入数据需要填充2行数据,即输入数据大小为16*56。相应的,层1对应的输入数据大小为18*56,层0对应的输入数据大小为20*56。也就是说,在对整张图片切分处理的过程中,为保证输出数据大小,会导致层组中的层的缓存需求的增加。并且,层组中层的数量越多,前层需要填充的数据量越大,若内部存储器容量较小,则限制层组大小。Exemplarily, referring to Figure 6, suppose that the layers in a neural network include layer 0, layer 1, layer 2, and layer 3. The size of the convolution kernel is all 3*3, and the step size of the convolution kernel movement It is 1, the step length of the convolution kernel is smaller than the side length of the convolution kernel, and there is an overlap problem in the process of using the filling algorithm to process the input data. For example, the size of the entire picture is 56*56, and the number of rows of the entire picture is divided into 4 parts for processing. If layer 0, layer 1, and layer 2 are scheduled as a layer group, it is necessary to ensure that layer 2 outputs 14 rows of data, that is, the output data size of the layer group is 14*56, and that layer 3 can process 1/4 rows of pictures. Then the input data of layer 2 needs to be filled with 2 rows of data, that is, the input data size is 16*56. Correspondingly, the input data size corresponding to layer 1 is 18*56, and the input data size corresponding to layer 0 is 20*56. That is to say, in the process of segmenting the entire picture, in order to ensure the output data size, the cache demand of the layers in the layer group will increase. And, the more the number of layers in the layer group, the larger the amount of data that the previous layer needs to fill. If the internal memory capacity is small, the size of the layer group will be limited.
3)子图(subgraph)3) Subgraph
上文已经介绍神经网络包含多层,可以描述为神经网络包括以有向图布置的多个层并且每个层可以具有相应的参数集合。子图是依据每个层的批大小划分神经网络包 含的层得到的。子图包含一个或多个相同批大小的层。子图也可以描述为超级层(super layer)或层组等,表示包含神经网络中的一层或连续的多层。It has been described above that a neural network includes multiple layers, which can be described as a neural network including multiple layers arranged in a directed graph, and each layer can have a corresponding set of parameters. The subgraph is obtained by dividing the layers included in the neural network according to the batch size of each layer. The subgraph contains one or more layers of the same batch size. The subgraph can also be described as a super layer or a layer group, etc., which means that it contains one layer or continuous multiple layers in the neural network.
在一些示例中,以子图为单位,调度神经网络处理输入数据,子图中的层的调度顺序与神经网络中的层的调度顺序相同。子图包含的层的批大小对应的批按照子图包含的层的顺序处理。子图包含的多个层的层间数据存储在内部存储器。子图间的层间数据存储在内部存储器。In some examples, the neural network is scheduled to process the input data with the sub-graph as a unit, and the scheduling order of the layers in the sub-graph is the same as the scheduling order of the layers in the neural network. The batches corresponding to the batch size of the layers contained in the subgraph are processed in the order of the layers contained in the subgraph. The inter-layer data of multiple layers included in the sub-picture is stored in the internal memory. The inter-layer data between the sub-pictures is stored in the internal memory.
示例的,如图7所示,为本申请一实施例提供的子图的示意图。子图包括层0和层1。层0的批大小和层1的批大小均为1。在下文中,批大小均为1对应的批可以为一个图片、多个图片或者一个图片中的部分图像。层0每次处理一个批。层1每次处理一个批。For example, as shown in FIG. 7, it is a schematic diagram of a sub-picture provided in an embodiment of this application. The sub-picture includes layer 0 and layer 1. The batch size of layer 0 and the batch size of layer 1 are both 1. In the following, a batch corresponding to a batch size of 1 can be one picture, multiple pictures, or part of images in one picture. Level 0 processes one batch at a time. Tier 1 processes one batch at a time.
假设子图中的层0和层1处理批A0和批A1。批A0和批A1可以是待神经网络处理的输入数据中的批。或者,批A0和批A1可以是已经经过神经网络中的层处理的层间数据。批A0和批A1的批尺寸均为1。子图内处理批的执行顺序如图中加粗箭头所示。为便于理解,将层0和层1分别处理批A0和批A1分开表示。Suppose that layer 0 and layer 1 in the subgraph process batch A0 and batch A1. Batch A0 and batch A1 may be batches in the input data to be processed by the neural network. Alternatively, batch A0 and batch A1 may be inter-layer data that has been processed by layers in the neural network. The batch size of batch A0 and batch A1 are both 1. The execution sequence of the processing batches in the subgraph is shown by the bold arrows in the figure. For ease of understanding, the layer 0 and layer 1 processing batch A0 and batch A1 are separately shown.
其中,层0先处理批A0,得到层间数据B0,层1处理层间数据B0,得到层间数据C0。然后,层0处理批A1,得到层间数据B1,层1处理层间数据B1,得到层间数据C1。层间数据C0和层间数据C1可以存储在内部存储器。Among them, layer 0 processes batch A0 first to obtain inter-layer data B0, and layer 1 processes inter-layer data B0 to obtain inter-layer data C0. Then, layer 0 processes batch A1 to obtain inter-layer data B1, and layer 1 processes inter-layer data B1 to obtain inter-layer data C1. The inter-layer data C0 and the inter-layer data C1 can be stored in the internal memory.
4)图(graph)4) Graph
图包括一个或多个子图。其中,图也可以描述为超级层或层组,表示包含神经网络中的一层或连续的多层。The graph includes one or more subgraphs. Among them, the graph can also be described as a super layer or a layer group, which means a layer or a continuous multi-layer in the neural network.
在一些实施例中,图中每个子图包含可以处理相同的批大小的层。示例的,如图8中的(a)所示,假设图包括子图1和子图2。其中,子图1包括层0和层1,层0的批大小与层1的批大小相同。子图2包括层2和层3,层2的批大小与层3的批大小相同。层0的批大小和层1的批大小均为一个批。层2的批大小和层3的批大小均为一个批。综上可知,图包括的所有子图包含相同的批大小的层。In some embodiments, each subgraph in the graph contains layers that can handle the same batch size. For example, as shown in (a) of FIG. 8, the hypothetical graph includes sub-graph 1 and sub-graph 2. Among them, subgraph 1 includes layer 0 and layer 1, and the batch size of layer 0 is the same as the batch size of layer 1. Subgraph 2 includes layer 2 and layer 3. The batch size of layer 2 is the same as the batch size of layer 3. The batch size of layer 0 and the batch size of layer 1 are both one batch. The batch size of layer 2 and the batch size of layer 3 are both one batch. In summary, all sub-graphs included in the graph contain layers of the same batch size.
在另一些实施例中,图包括的所有子图中至少两个子图包含不相同的批大小的层。如图8中的(b)所示,假设图包括子图1、子图2和子图3。其中,子图1包括层0和层1,层0的批大小与层1的批大小相同。子图2包括层2和层3,层2的批大小与层3的批大小相同。子图3包括层4和层5,层4的批大小与层5的批大小相同。层0的批大小和层1的批大小均为一个批。层2的批大小和层3的批大小均为一个批。层4的批大小和层5的批大小均为两个批。综上可知,图包括的子图3包含的层的批大小与子图1包含的层的批大小不同。图包括的子图3包含的层的批大小与子图2包含的层的批大小不同。In other embodiments, at least two sub-graphs in all sub-graphs included in the graph include layers of different batch sizes. As shown in (b) of FIG. 8, the hypothetical picture includes sub-picture 1, sub-picture 2 and sub-picture 3. Among them, subgraph 1 includes layer 0 and layer 1, and the batch size of layer 0 is the same as the batch size of layer 1. Subgraph 2 includes layer 2 and layer 3. The batch size of layer 2 is the same as the batch size of layer 3. Sub-figure 3 includes layer 4 and layer 5, and the batch size of layer 4 is the same as the batch size of layer 5. The batch size of layer 0 and the batch size of layer 1 are both one batch. The batch size of layer 2 and the batch size of layer 3 are both one batch. The batch size of layer 4 and the batch size of layer 5 are both two batches. In summary, the batch size of the layer included in the sub-picture 3 included in the graph is different from the batch size of the layer included in the sub-picture 1. The batch size of the layer included in the sub-picture 3 included in the graph is different from the batch size of the layer included in the sub-picture 2.
在一些示例中,以图为单位,调度神经网络处理输入数据,图中的层的调度顺序与神经网络中的层的调度顺序相同。图中包含的各个子图的调度顺序为根据批大小以及子图中的首层和末层的调度顺序确定。将神经网络中不同批大小的多个层作为一个图进行调度时,会有一部分数据保留在内部存储器的缓存空间中,从而产生额外的内部存储器缓存需求。图间的层间数据存储在外部存储器。图中的层的调度过程如下聚集(gather)和散开(scatter)问题的阐述。In some examples, the neural network is scheduled to process input data in units of graphs, and the scheduling order of the layers in the graph is the same as the scheduling order of the layers in the neural network. The scheduling order of each subgraph included in the figure is determined according to the batch size and the scheduling order of the first and last layers in the subgraph. When multiple layers of different batch sizes in the neural network are scheduled as a graph, a part of the data is retained in the cache space of the internal memory, thereby generating additional internal memory cache requirements. The inter-layer data between the pictures is stored in the external memory. The scheduling process of the layers in the figure is described as follows to gather and scatter problems.
5)聚集(gather)问题5) Gather problem
在一种可能的实现方式中,图包含的子图间的层间数据进行聚集处理。示例的,如图9所示,为本申请一实施例提供的子图的层间数据进行聚集处理的示意图。假设图包括子图0和子图1。子图0包括层0和层1。层0的批大小和层1的批大小均为1。层0每次处理一个批。层1每次处理一个批。子图1包括层2和层3。层2的批大小和层3的批大小均为2。在下文中,批大小均为2对应的批可以为两个图片、多个图片或者一个图片中的部分图像。层2每次处理两个批。层3每次处理两个批。假设图处理批A0和批A1。批A0和批A1可以是待神经网络处理的输入数据中的批。批A0和批A1可以是已经经过神经网络中的层处理的层间数据。批A0和批A1的批尺寸均为1。由于子图0包括的层0和层1每次均处理一个批,子图1包括的层2和层3每次均处理两个批。可以让子图0分别处理完批A0和批A1后,再让子图1处理子图0输出的批A0的层间数据和批A1的层间数据。图内处理批的执行顺序如图中加粗箭头所示。为便于理解,将层0和层1分别处理批A0和批A1分开表示。In a possible implementation manner, the inter-layer data between the sub-graphs contained in the graph is aggregated. As an example, as shown in FIG. 9, it is a schematic diagram of aggregation processing of inter-layer data of a subgraph provided in an embodiment of this application. Assume that the graph includes subgraph 0 and subgraph 1. Subgraph 0 includes layer 0 and layer 1. The batch size of layer 0 and the batch size of layer 1 are both 1. Level 0 processes one batch at a time. Tier 1 processes one batch at a time. Sub-figure 1 includes layer 2 and layer 3. The batch size of layer 2 and the batch size of layer 3 are both 2. In the following, a batch corresponding to a batch size of 2 can be two pictures, multiple pictures, or partial images in one picture. Layer 2 processes two batches at a time. Layer 3 processes two batches at a time. Suppose the graph processes batch A0 and batch A1. Batch A0 and batch A1 may be batches in the input data to be processed by the neural network. Batch A0 and batch A1 may be inter-layer data that has been processed by layers in the neural network. The batch size of batch A0 and batch A1 are both 1. Since layer 0 and layer 1 included in sub-picture 0 process one batch each time, layer 2 and layer 3 included in sub-picture 1 process two batches each time. After subgraph 0 has processed batch A0 and batch A1 respectively, subgraph 1 can then process the inter-layer data of batch A0 and the inter-layer data of batch A1 output by subgraph 0. The execution sequence of the processing batches in the figure is shown by the bold arrows in the figure. For ease of understanding, the layer 0 and layer 1 processing batch A0 and batch A1 are separately shown.
其中,对于子图0而言,层0先处理批A0,得到层间数据B0,层1处理层间数据B0,得到层间数据C0。然后,层0处理批A1,得到层间数据B1,层1处理层间数据B1,得到层间数据C1。层间数据C0和层间数据C1可以存储在内部存储器。Among them, for sub-picture 0, layer 0 first processes batch A0 to obtain inter-layer data B0, and layer 1 processes inter-layer data B0 to obtain inter-layer data C0. Then, layer 0 processes batch A1 to obtain inter-layer data B1, and layer 1 processes inter-layer data B1 to obtain inter-layer data C1. The inter-layer data C0 and the inter-layer data C1 can be stored in the internal memory.
对于子图1而言,层2可以从内部存储器中获取层间数据C0和层间数据C1,此时,可以将层间数据C0和层间数据C1组合成层间数据(C0,C1)。层2处理(C0,C1),得到层间数据(D0,D1),层3处理层间数据(D0,D1),得到层间数据(E0,E1)。层间数据(E0,E1)可以存储在内部存储器。For sub-figure 1, layer 2 can obtain inter-layer data C0 and inter-layer data C1 from the internal memory. At this time, inter-layer data C0 and inter-layer data C1 can be combined into inter-layer data (C0, C1). Layer 2 processes (C0, C1) to obtain inter-layer data (D0, D1), and layer 3 processes inter-layer data (D0, D1) to obtain inter-layer data (E0, E1). The inter-layer data (E0, E1) can be stored in the internal memory.
6)散开(scatter)问题6) Scatter problem
在另一种可能的实现方式中,图包含的子图间的层间数据进行散开处理。示例的,如图10所示,为本申请一实施例提供的子图的层间数据进行散开处理的示意图。假设图包括子图1和子图2。子图1包括层2和层3。层2的批大小和层3的批大小均为两个批。层2每次处理两个批。层3每次处理两个批。子图2包括层4和层5。层4的批大小和层5的批大小均为一个批。层4每次处理一个批。层5每次处理一个批。由于子图1包括的层2和层3每次均处理两个批,子图2包括的层4和层5每次均处理一个批。可以让子图1先处理完层间数据(C0,C1)后,再让子图2处理子图1输出的层间数据E0和层间数据E1。图内处理批的执行顺序如图中加粗箭头所示。为便于理解,将层4和层5分别处理层间数据E0和层间数据E1分开表示。In another possible implementation manner, the inter-layer data between the sub-graphs contained in the graph is scattered. As an example, as shown in FIG. 10, it is a schematic diagram of spreading the inter-layer data of the subgraph provided in an embodiment of this application. Assume that the graph includes sub graph 1 and sub graph 2. Sub-figure 1 includes layer 2 and layer 3. The batch size of layer 2 and the batch size of layer 3 are both two batches. Layer 2 processes two batches at a time. Layer 3 processes two batches at a time. Sub-figure 2 includes layer 4 and layer 5. The batch size of layer 4 and the batch size of layer 5 are both one batch. Layer 4 processes one batch at a time. Layer 5 processes one batch at a time. Since layer 2 and layer 3 included in sub-figure 1 process two batches each time, layer 4 and layer 5 included in sub-figure 2 process one batch each time. You can let the sub-picture 1 process the inter-layer data (C0, C1) first, and then let the sub-picture 2 process the inter-layer data E0 and the inter-layer data E1 output by the sub-picture 1. The execution sequence of the processing batches in the figure is shown by the bold arrows in the figure. To facilitate understanding, the processing of the inter-layer data E0 and the inter-layer data E1 in layer 4 and layer 5 are separately represented.
其中,对于子图1而言,层2可以从内部存储器中获取批A0和批A1的层间数据(C0,C1)。层2处理(C0,C1),得到层间数据(D0,D1),层3处理层间数据(D0,D1),得到层间数据(E0,E1)。此时,层间数据(E0,E1)可以存储在内部存储器。Among them, for sub-figure 1, layer 2 can obtain the inter-layer data (C0, C1) of batch A0 and batch A1 from the internal memory. Layer 2 processes (C0, C1) to obtain inter-layer data (D0, D1), and layer 3 processes inter-layer data (D0, D1) to obtain inter-layer data (E0, E1). At this time, the inter-layer data (E0, E1) can be stored in the internal memory.
对于子图2而言,层4先从内部存储器中获取层间数据(E0,E1),将层间数据(E0,E1)划分为层间数据E0和层间数据E1。层4先处理层间数据(E0,E1)中的层间数据E0,得到层间数据F0,层5处理层间数据F0,得到层间数据G0。然后,层4处理层间数据(E0,E1)中的层间数据E1,得到层间数据F1,层5处理层间数据F1,得到层间数据G1。层间数据G0和层间数据G1可以存储在内部存储器。For sub-figure 2, layer 4 first obtains the inter-layer data (E0, E1) from the internal memory, and divides the inter-layer data (E0, E1) into the inter-layer data E0 and the inter-layer data E1. Layer 4 first processes the inter-layer data E0 in the inter-layer data (E0, E1) to obtain the inter-layer data F0, and layer 5 processes the inter-layer data F0 to obtain the inter-layer data G0. Then, the layer 4 processes the inter-layer data E1 in the inter-layer data (E0, E1) to obtain the inter-layer data F1, and the layer 5 processes the inter-layer data F1 to obtain the inter-layer data G1. The inter-layer data G0 and the inter-layer data G1 can be stored in the internal memory.
在另一种可能的实现方式中,多个图按照神经网络的层的顺序调度处理。需要说 明的是,后一个图处理的数据为前一个图输出的数据。将神经网络的层划分为多个图,按照图的顺序分别处理批,提高了内部存储器的利用率,整个神经网络的处理性能。In another possible implementation, multiple graphs are scheduled for processing in the order of the layers of the neural network. What needs to be clarified is that the data processed by the latter graph is the data output by the previous graph. Divide the layers of the neural network into multiple graphs, and process batches according to the order of the graphs, which improves the utilization of internal memory and the processing performance of the entire neural network.
示例的,如图11所示,为本申请一实施例提供的图的处理的示意图。其中,横坐标表示神经网络的层,纵坐标表示批。假设神经网络包括12层。层0、层1、层4、层5、层10和层11的批大小均为一个批,即层0、层1、层4、层5、层10和层11每次处理一个批。层2、层3、层6和层7的批大小均为两个批,即层2、层3、层6和层7每次处理两个批。层8和层9的批大小均为四个批,即层8和层9每次处理四个批。神经网络包括的12层分为两个图,即图0和图1。图0包括层0至层5。图1包括层6至层11。方框中的数字表示批的执行顺序。按照从小到大的数字执行。处理完一个图0,再处理图1。Illustratively, as shown in FIG. 11, it is a schematic diagram of the processing of a graph provided by an embodiment of this application. Among them, the abscissa represents the layer of the neural network, and the ordinate represents the batch. Assume that the neural network includes 12 layers. The batch sizes of layer 0, layer 1, layer 4, layer 5, layer 10, and layer 11 are all one batch, that is, layer 0, layer 1, layer 4, layer 5, layer 10, and layer 11 are processed one batch at a time. The batch sizes of layer 2, layer 3, layer 6 and layer 7 are all two batches, that is, layer 2, layer 3, layer 6 and layer 7 process two batches at a time. The batch sizes of layer 8 and layer 9 are both four batches, that is, layer 8 and layer 9 process four batches each time. The 12 layers included in the neural network are divided into two graphs, namely graph 0 and graph 1. Figure 0 includes layer 0 to layer 5. FIG. 1 includes layers 6 to 11. The numbers in the boxes indicate the order of batch execution. Follow the numbers from small to large. After processing a picture 0, process picture 1 again.
对于图0,层0处理批0,层1处理层0输出的批0的层间数据,得到层1输出的批0的层间数据,将批0的层间数据存储到内部存储器。层0处理批1,层1处理层0输出的批1的层间数据,得到层1输出的批1的层间数据,将批1的层间数据存储到内部存储器。从内部存储器中取出批0的层间数据和批1的层间数据,层2处理批0的层间数据和批1的层间数据,层3处理层2输出的批0的层间数据和批1的层间数据,得到层3输出的批0的层间数据和批1的层间数据。层4处理层3输出的批0的层间数据,层5处理层4输出的批0的层间数据;层4处理层3输出的批1的层间数据,层5处理层4输出的批1的层间数据。For Figure 0, layer 0 processes batch 0, and layer 1 processes the inter-layer data of batch 0 output by layer 0, obtains the inter-layer data of batch 0 output by layer 1, and stores the inter-layer data of batch 0 in the internal memory. Layer 0 processes batch 1, and layer 1 processes the inter-layer data of batch 1 output by layer 0, obtains the inter-layer data of batch 1 output from layer 1, and stores the inter-layer data of batch 1 in the internal memory. Take out the inter-layer data of batch 0 and the inter-layer data of batch 1 from the internal memory, layer 2 processes the inter-layer data of batch 0 and the inter-layer data of batch 1, and layer 3 processes the inter-layer data of batch 0 output by layer 2 and The inter-layer data of batch 1 obtains the inter-layer data of batch 0 and the inter-layer data of batch 1 output by layer 3. Layer 4 processes the inter-layer data of batch 0 output by layer 3, and layer 5 processes the inter-layer data of batch 0 output by layer 4; layer 4 processes the inter-layer data of batch 1 output from layer 3, and layer 5 processes the batch output from layer 4 1 inter-layer data.
同理,层0至层5按照处理批0和批1的顺序,处理批2和批3。图1处理的批为图0输出的数据。对于图1,层6处理层5输出的批0的层间数据和批1的层间数据,层7处理层6输出的批0的层间数据和批1的层间数据,将层7输出的批0的层间数据和批1的层间数据存储到内部存储器。层6处理层5输出的批2的层间数据和批3的层间数据,层7处理层6输出的批2的层间数据和批3的层间数据,将层7输出的批2的层间数据和批3的层间数据存储到内部存储器。从内部存储器中取出层7输出的批0的层间数据、批1的层间数据、批2的层间数据和批3的层间数据,层8处理层7输出的批0的层间数据、批1的层间数据、批2的层间数据和批3的层间数据,层9处理层8输出的批0的层间数据、批1的层间数据、批2的层间数据和批3的层间数据,将层9输出的批0的层间数据、批1的层间数据、批2的层间数据和批3的层间数据存储到内部存储器。层10处理层9输出的批0的层间数据。层11处理层10输出的批0的层间数据。层10处理层9输出的批1的层间数据。层11处理层10输出的批1的层间数据。层10处理层9输出的批2的层间数据。层11处理层10输出的批2的层间数据。层10处理层9输出的批3的层间数据。层11处理层10输出的批3的层间数据。In the same way, layer 0 to layer 5 process batch 2 and batch 3 in the order of processing batch 0 and batch 1. The batch processed in Figure 1 is the data output in Figure 0. For Figure 1, layer 6 processes the inter-layer data of batch 0 and batch 1 output from layer 5, and layer 7 processes the inter-layer data of batch 0 and batch 1 output of layer 6, and outputs layer 7 The inter-layer data of batch 0 and the inter-layer data of batch 1 are stored in the internal memory. Layer 6 processes the batch 2 inter-layer data and batch 3 inter-layer data output by layer 5, and layer 7 processes the batch 2 inter-layer data and batch 3 inter-layer data output by layer 6 and transfers the batch 2 output from layer 7 The inter-layer data and batch 3 inter-layer data are stored in the internal memory. Retrieve the inter-layer data of batch 0, batch 1 inter-layer data, batch 2 inter-layer data, and batch 3 inter-layer data output by layer 7 from the internal memory. Layer 8 processes the inter-layer data of batch 0 output by layer 7 , The inter-layer data of batch 1, the inter-layer data of batch 2 and the inter-layer data of batch 3, layer 9 processes the inter-layer data of batch 0 output by layer 8, the inter-layer data of batch 1, and the inter-layer data of batch 2 and The inter-layer data of batch 3, the inter-layer data of batch 0, the inter-layer data of batch 1, the inter-layer data of batch 2, and the inter-layer data of batch 3 outputted by layer 9 are stored in the internal memory. Layer 10 processes the batch 0 inter-layer data output by layer 9. Layer 11 processes the inter-layer data of batch 0 output by layer 10. Layer 10 processes the inter-layer data of batch 1 output by layer 9. Layer 11 processes the inter-layer data of batch 1 output by layer 10. Layer 10 processes the batch 2 inter-layer data output by layer 9. Layer 11 processes the inter-layer data of batch 2 output by layer 10. Layer 10 processes the batch 3 inter-layer data output by layer 9. Layer 11 processes the inter-layer data of batch 3 output by layer 10.
接下来,结合图12对神经网络的数据处理方法进行详细阐述。在这里,以处理器211执行神经网络的数据处理方法为例进行说明。其中,内部存储器包括存储器223、存储器324、存储器329和存储器3210。外部存储器为内存212。由计算节点根据确定的批大小完成神经网络的运算。计算节点包括芯片221、瓦片320、处理器件323或引擎326。如图12所示,神经网络的数据处理方法包括S1201和S1202。Next, the data processing method of the neural network will be described in detail in conjunction with Figure 12. Here, a data processing method in which the processor 211 executes a neural network is taken as an example for description. Among them, the internal memory includes a memory 223, a memory 324, a memory 329, and a memory 3210. The external memory is the memory 212. The calculation node completes the calculation of the neural network according to the determined batch size. The computing node includes a chip 221, a tile 320, a processing device 323, or an engine 326. As shown in Figure 12, the neural network data processing method includes S1201 and S1202.
S1201、处理器211获取输入数据的数据量、运行神经网络的芯片内的内部存储器 的第一特征和神经网络中多个层的第二特征。输入数据是神经网络的输入层接收到的数据。例如,输入数据为数据集内的数据。以图像处理为例进行说明,如输入数据为数据集中的32张图片。S1201, the processor 211 obtains the data amount of the input data, the first feature of the internal memory in the chip running the neural network, and the second feature of the multiple layers in the neural network. The input data is the data received by the input layer of the neural network. For example, the input data is data in a data set. Take image processing as an example. For example, the input data is 32 pictures in the data set.
第一特征包括内部存储器在芯片内的分布特征和内部存储器的容量中至少一个。可理解的,内部存储器在芯片内的分布特征包括运行神经网络的芯片内的存储器的个数和存储器与计算节点的连接关系,芯片内的存储器容量和个数很多,并不是每次这些存储资源都被用于神经网络计算,分配给行神经网络计算的存储资源是变化的,因此需要动态地根据存储器的个数和存储器与计算节点的连接关系,即所述分布特征优化神经网络配置。例如,神经网络电路220包括的存储器223的个数、存储器324的个数、存储器329的个数和存储器3210的个数,以及存储器223与芯片221的连接关系、存储器324与处理器件323的连接关系、存储器329与引擎326的连接关系和存储器3210与引擎326的连接关系。The first feature includes at least one of the distribution feature of the internal memory in the chip and the capacity of the internal memory. It is understandable that the distribution characteristics of internal memory in the chip include the number of memories in the chip running the neural network and the connection relationship between the memory and the computing node. The memory capacity and number in the chip are large, not every time these storage resources Both are used for neural network calculations, and the storage resources allocated to the neural network calculations vary. Therefore, it is necessary to dynamically optimize the neural network configuration according to the number of memories and the connection relationship between the memories and the computing nodes, that is, the distribution characteristics. For example, the neural network circuit 220 includes the number of memories 223, the number of memories 324, the number of memories 329, and the number of memories 3210, as well as the connection relationship between the memory 223 and the chip 221, and the connection between the memory 324 and the processing device 323 The relationship, the connection relationship between the storage 329 and the engine 326, and the connection relationship between the storage 3210 and the engine 326.
内部存储器的容量包括运行神经网络的芯片内的所有存储器的容量,芯片内的存储器容量和个数很多,并不是每次这些存储资源都被用于神经网络计算,分配给行神经网络计算的存储资源是变化的,因此需要动态地根据所述容量优化神经网络配置。例如,神经网络电路220包括的存储器223的容量、存储器324的容量、存储器329的容量和存储器3210的容量。可理解的,内部存储器的容量可以是指内部存储器的可用容量。The capacity of internal memory includes the capacity of all memories in the chip running the neural network. The memory capacity and number in the chip are large. Not every time these storage resources are used for neural network calculations, they are allocated to the storage for neural network calculations. Resources are variable, so the neural network configuration needs to be dynamically optimized according to the capacity. For example, the neural network circuit 220 includes the capacity of the memory 223, the capacity of the memory 324, the capacity of the memory 329, and the capacity of the memory 3210. It is understandable that the capacity of the internal memory may refer to the available capacity of the internal memory.
第二特征包括多个层间的连接关系和多个层中每个层的与计算相关的参数。芯片内的计算资源是会变化的,并不是每次这些计算资源都被用于神经网络计算,因此多个层间的连接关系和多个层中每个层的与计算相关的参数也会根据需求发生变化,需要动态地根据所述变化优化神经网络配置。可理解的,多个层间的连接关系包括神经网络中每一个层与其他层中至少一层的连接关系。依据神经网络完成的功能的不同,神经网络内的层的连接关系也不同,本申请对神经网络内的层的连接关系不予限定。每个层的与计算相关的参数包括输入数据的维数和输出数据的维数、偏移参数、卷积核、量化参数、或归一化参数等。The second feature includes the connection relationship between the multiple layers and the calculation-related parameters of each of the multiple layers. The computing resources in the chip will change, not every time these computing resources are used for neural network calculations, so the connection relationship between multiple layers and the calculation-related parameters of each layer in multiple layers will also be based on As requirements change, the neural network configuration needs to be dynamically optimized according to the changes. It is understandable that the connection relationship between multiple layers includes the connection relationship between each layer in the neural network and at least one layer in other layers. According to the different functions performed by the neural network, the connection relationship of the layers in the neural network is also different, and this application does not limit the connection relationship of the layers in the neural network. The calculation-related parameters of each layer include the dimensionality of the input data and the dimensionality of the output data, offset parameters, convolution kernels, quantization parameters, or normalization parameters.
第一特征和第二特征可以存储在主机210内的内存212中。处理器211可以从主机210内的内存212中获取内部存储器的特征和神经网络中多个层的特征。The first feature and the second feature may be stored in the memory 212 in the host 210. The processor 211 may obtain the characteristics of the internal memory and the characteristics of multiple layers in the neural network from the memory 212 in the host 210.
S1202、处理器211依据所述数据量、所述第一特征和所述第二特征确定所述多个层的批大小、N个子图、M个图和层间数据的存储位置,多个层的批大小中至少两个层的批大小不同。S1202. The processor 211 determines the batch size of the multiple layers, N sub-pictures, M pictures, and storage locations of inter-layer data according to the data volume, the first characteristic and the second characteristic. The batch size of at least two layers in the batch size are different.
具体的,处理器211可以采用迭代算法,依据所述数据量、所述第一特征和所述第二特征确定多个层的批大小、N个子图、M个图和层间数据的存储位置。其中,优化算法可以是动态规划算法、贪婪算法或遗传算法。可理解的,处理器根据所述数据量、所述第一特征和所述第二特征并非一次计算,得到多个层的批大小、N个子图、M个图和层间数据的存储位置,而是采用迭代算法经过多次迭代实验,从多个实验结果中选择多个层的批大小、N个子图、M个图和层间数据的存储位置,来确保内部存储器的利用率,以及运行神经网络的芯片的计算效率。其中,N为大于或等于2的整数,M为大于或等于1的整数,N≥M。例如,N=2,M=1,表示神经网络的层划分为 2个子图,2个子图划分为一个图。又如,N=2,M=2,表示神经网络的层划分为2个子图,2个子图划分为2个图。又如,N=3,M=2,表示神经网络的层划分为3个子图,3个子图划分为2个图。Specifically, the processor 211 may use an iterative algorithm to determine the batch size of multiple layers, N sub-pictures, M pictures, and storage locations of inter-layer data according to the amount of data, the first feature, and the second feature. . Among them, the optimization algorithm can be a dynamic programming algorithm, a greedy algorithm or a genetic algorithm. It is understandable that the processor does not perform a calculation based on the amount of data, the first feature, and the second feature at one time, and obtains the batch size of multiple layers, N sub-pictures, M pictures, and storage locations of inter-layer data. Instead, it uses an iterative algorithm to go through multiple iterative experiments. From multiple experimental results, select the batch size of multiple layers, N sub-graphs, M graphs, and storage locations of inter-layer data to ensure the utilization of internal memory and operation The computational efficiency of the neural network chip. Wherein, N is an integer greater than or equal to 2, M is an integer greater than or equal to 1, and N≥M. For example, N=2, M=1, it means that the layer of the neural network is divided into 2 subgraphs, and the 2 subgraphs are divided into one graph. For another example, N=2, M=2, it means that the layer of the neural network is divided into 2 subgraphs, and the 2 subgraphs are divided into 2 graphs. For another example, N=3, M=2, which means that the layer of the neural network is divided into 3 subgraphs, and the 3 subgraphs are divided into 2 graphs.
例如,处理器211先基于内部存储器的容量确定神经网络中每一层的批尺寸,然后将批尺寸相同的层融合为子图。再基于子图的缓存需求和内部存储器的容量将多个子图融合为图,如此得到的图中包含不同批尺寸的子图。也就是说,后续以图为单位调度神经网络时是以不同批大小处理输入数据,那么每一个图的缓存需求既不会超过内部存储器的容量,又能提升片上存储器的利用率,提升硬件的运行性能。For example, the processor 211 first determines the batch size of each layer in the neural network based on the capacity of the internal memory, and then merges layers with the same batch size into sub-graphs. Based on the cache requirements of the sub-graphs and the capacity of the internal memory, multiple sub-graphs are merged into a graph, and the resulting graph contains sub-graphs of different batch sizes. That is to say, when the neural network is scheduled in the unit of graphs, the input data is processed in different batch sizes, so the cache requirement of each graph will not exceed the capacity of the internal memory, but also can improve the utilization of the on-chip memory and improve the hardware. Operational performance.
可理解的,N个子图中的层连接起来就是一个完整的神经网络。N个子图中的每个子图包含一个或多个相同批大小的层。一个或多个相同批大小的层是神经网络中连续的层。不同的子图包括的层的个数可以相同也可以不同。Understandably, the layers in the N subgraphs are connected to form a complete neural network. Each of the N subgraphs contains one or more layers of the same batch size. One or more layers of the same batch size are consecutive layers in the neural network. The number of layers included in different sub-pictures may be the same or different.
M个图中的子图连接起来就是一个完整的神经网络。M个图中的每个图包括一个或多个子图。不同的图包括的子图的个数可以相同也可以不同。The sub-graphs in the M graphs are connected to form a complete neural network. Each of the M graphs includes one or more subgraphs. The number of sub-pictures included in different pictures may be the same or different.
在调度神经网络进行数据处理的过程中,会产生相应的神经网络运算开销。如计算时间开销,数据搬运时间开销等。可以通过预设神经网络的运算开销指标,用于衡量神经网络性能。如神经网络运算开销较低,则神经网络性能较优。如图13所示,示例性的给出一种神经网络中的层处理数据的过程,包括数据搬入过程(即读入输入数据的过程),计算过程以及数据搬出过程(即存储输出数据的过程)。其中,神经网络处理一批数据,需要先将部分数据搬入即数据搬入过程,此过程产生的开销为头部开销。之后,数据搬入过程,计算过程以及数据搬出过程并行。最后,神经网络将最后运算出的数据执行数据搬出过程,存储至存储空间,此过程产生的开销为尾部开销。In the process of scheduling neural network for data processing, corresponding neural network computing overhead will be generated. Such as calculation time cost, data transfer time cost, etc. The computational cost index of the neural network can be preset to measure the performance of the neural network. If the computational cost of the neural network is low, the performance of the neural network is better. As shown in Figure 13, an exemplary process of processing data in a neural network is given, including the process of data import (that is, the process of reading input data), the calculation process, and the process of data export (that is, the process of storing output data). ). Among them, the neural network processes a batch of data, and needs to first move part of the data in, that is, the data move in process, and the overhead in this process is the head overhead. After that, the data import process, the calculation process, and the data export process are parallel. Finally, the neural network executes the data removal process of the last calculated data and stores it in the storage space. The overhead generated by this process is the tail overhead.
在本申请实施例中,层以批大小为单位处理数据。某层处理一批输入数据的过程中,计算时间=该层的计算量/搭载该神经网络的芯片的算力,数据搬运时间=(输入数据量+输出数据量)/(内部存储器带宽或片外部储器带宽),总时间开销=头部开销+max(计算时间,数据搬运时间)+尾部开销。可以看出,若批大小过小,则可能导致头部开销和尾部开销对应的时间大于或等于计算时间,导致神经网络运算效率较低。神经网络中某层的时间开销可以根据当前层的输入数据或输出数据中至少一个的存储位置以及搭载该神经网络的芯片的算力获得。其中,数据的存储位置包括内部存储器和外部存储器。In this embodiment of the application, the layer processes data in units of batch size. In the process of processing a batch of input data for a certain layer, calculation time = calculation amount of this layer / computing power of the chip equipped with the neural network, data transfer time = (input data volume + output data volume) / (internal memory bandwidth or chip External storage bandwidth), total time overhead = head overhead + max (calculation time, data transfer time) + tail overhead. It can be seen that if the batch size is too small, the time corresponding to the head overhead and the tail overhead may be greater than or equal to the calculation time, resulting in lower neural network operation efficiency. The time overhead of a certain layer in the neural network can be obtained according to the storage location of at least one of the input data or output data of the current layer and the computing power of the chip equipped with the neural network. Among them, the storage location of data includes internal memory and external memory.
由于允许层间数据存储到外部存储器,统筹规划外部存储器和内部存储器共同存储层间数据,减小了占用内部存储器的存储空间。另外,由于允许层间数据可以存储到外部存储器,可以为神经网络中的层设置更大的批大小,从而减少神经网络中的层处理每组批的头部开销,提高了处理器的计算效率。Since the inter-layer data is allowed to be stored in the external memory, the external memory and the internal memory are jointly planned to store the inter-layer data, which reduces the storage space of the internal memory. In addition, since the inter-layer data can be stored in the external memory, a larger batch size can be set for the layers in the neural network, thereby reducing the head overhead of processing each batch of the layers in the neural network, and improving the computational efficiency of the processor .
在根据上述神经网络的层的划分结果处理所述神经网络的输入数据过程中,图中的层的调度顺序为根据所述图中包含的各个子图的调度顺序,以及所述子图中的层的调度顺序确定。例如,所述子图中的层的调度顺序与神经网络中的层的调度顺序相同。子图包含的层的批大小对应的批按照所述子图包含的层的顺序处理。图中包含的各个子图的调度顺序为根据批大小以及子图中的首层和末层的调度顺序确定。图包含的子图的层间数据进行聚集处理或散开处理。关于子图和图的解释可以参考上述阐述。In the process of processing the input data of the neural network according to the division result of the layers of the neural network, the scheduling order of the layers in the figure is according to the scheduling order of each subgraph contained in the figure, and the order of the subgraphs in the subgraph The scheduling sequence of the layers is determined. For example, the scheduling order of the layers in the subgraph is the same as the scheduling order of the layers in the neural network. The batches corresponding to the batch size of the layers included in the sub-picture are processed in the order of the layers included in the sub-picture. The scheduling order of each subgraph included in the figure is determined according to the batch size and the scheduling order of the first and last layers in the subgraph. The inter-layer data of the sub-graphs contained in the graph are aggregated or scattered. For the explanation of sub-pictures and graphs, please refer to the above description.
示例性的,如图13所示,假设神经网络包含6层,层级顺序为层0-层5(layer0-layer5,L1-L5)。L0、L1、L4和L5对应的批尺寸为1,L2和L3对应的批尺寸为2。将批大小相同的层组成一个子图,即L0和L1组成子图0。L2和L3组成子图1。L4和L5组成子图2。将子图组成图,即子图0、子图1和子图2组成图。L0和L1对应的批尺寸为1,则子图0每次可处理数据大小为1的输入数据,即批0和批1分开处理。将批0输入L0后,经过L0和L1的处理,L1的输出数据为C0。L2对应的批尺寸为2,此时C0仅对应批0,不满足L2的处理需求,需要将C0暂存与内部存储器中。将批1输入L0进行处理,经过L0和L1的处理,L1的输出数据为C1。此时,L1输出两批数据,满足L2的处理需求。内部存储器中包含C0和C1两组数据,将C0和C1聚合后,L2可以调用聚合后的C0和C1进行处理。故,若将子图0和子图1划分为一个图,在调度L0和L1处理批1的过程中,C0占用内部存储器的缓存空间,C0对应的数据量为L0和L1额外的内部存储器缓存需求。此过程中,L0对应的输入数据的缓存需求为(C0+A1)对应的数据量,输出数据的缓存需求为(C0+B1)对应的数据量;L1对应的输入数据的缓存需求为(C0+B1)对应的数据量,输出数据的缓存需求为(C0+C1)对应的数据量。Exemplarily, as shown in FIG. 13, it is assumed that the neural network includes 6 layers, and the hierarchy sequence is layer 0-layer 5 (layer0-layer5, L1-L5). The batch size corresponding to L0, L1, L4, and L5 is 1, and the batch size corresponding to L2 and L3 is 2. The layers with the same batch size form a subgraph, that is, L0 and L1 form subgraph 0. L2 and L3 form subfigure 1. L4 and L5 make up subfigure 2. The subgraphs are composed of graphs, that is, subgraph 0, subgraph 1, and subgraph 2 are composed of graphs. The batch size corresponding to L0 and L1 is 1, so subgraph 0 can process input data with a data size of 1 each time, that is, batch 0 and batch 1 are processed separately. After batch 0 is input to L0, after L0 and L1 are processed, the output data of L1 is C0. The batch size corresponding to L2 is 2. At this time, C0 only corresponds to batch 0, which does not meet the processing requirements of L2, and C0 needs to be temporarily stored in the internal memory. Batch 1 is input to L0 for processing, after L0 and L1 are processed, the output data of L1 is C1. At this time, L1 outputs two batches of data to meet the processing requirements of L2. The internal memory contains two sets of C0 and C1 data. After C0 and C1 are aggregated, L2 can call the aggregated C0 and C1 for processing. Therefore, if sub-graph 0 and sub-graph 1 are divided into one graph, in the process of scheduling L0 and L1 to process batch 1, C0 occupies the cache space of the internal memory, and the amount of data corresponding to C0 is L0 and L1. Additional internal memory cache requirements . In this process, the cache requirement of input data corresponding to L0 is the amount of data corresponding to (C0+A1), the cache requirement of output data is the amount of data corresponding to (C0+B1); the cache requirement of input data corresponding to L1 is (C0 +B1) corresponds to the amount of data, and the buffer requirement for output data is the amount of data corresponding to (C0+C1).
若将子图1和子图2划分为一个图进行调度,则会出现scatter问题。如图13所示,L3输入数据为批0对应的D0和批1对应的D1,输出数据为批0对应的E0和批1对应的E1。L4对应的批尺寸为1,则不能同时处理E0和E1。此时,L4先处理E0,将E1暂存于内部存储器中。那么,在调度L4和L5处理批0对应的数据的过程中,E1占用内部存储器的缓存空间,E1对应的数据量为L4和L5额外的内部存储器缓存需求。此过程中,L4对应的输入数据的缓存需求为(E1+E0)对应的数据量,输出数据的缓存需求为(E1+F0)对应的数据量;L5对应的输入数据的缓存需求为(E1+F0)对应的数据量,输出数据的缓存需求为(E1+G0)对应的数据量。If sub-graph 1 and sub-graph 2 are divided into one graph for scheduling, a scatter problem will occur. As shown in Figure 13, the input data of L3 is D0 corresponding to batch 0 and D1 corresponding to batch 1, and the output data is E0 corresponding to batch 0 and E1 corresponding to batch 1. The batch size corresponding to L4 is 1, so E0 and E1 cannot be processed at the same time. At this time, L4 processes E0 first, and temporarily stores E1 in internal memory. Then, in the process of scheduling L4 and L5 to process the data corresponding to batch 0, E1 occupies the cache space of the internal memory, and the amount of data corresponding to E1 is the extra internal memory cache requirement of L4 and L5. In this process, the cache requirement of input data corresponding to L4 is the amount of data corresponding to (E1+E0), the cache requirement of output data is the amount of data corresponding to (E1+F0); the cache requirement of input data corresponding to L5 is (E1 +F0) corresponds to the amount of data, and the buffer requirement for output data is the amount of data corresponding to (E1+G0).
需要说明的是,由于子图包含的多个层的层间数据和子图间的层间数据存储到内部存储器,占据了内部存储器的存储空间,因此多个层的批大小和层间数据的存储位置还受到了划分子图和图的影响。It should be noted that because the inter-layer data of multiple layers contained in the sub-picture and the inter-layer data between the sub-pictures are stored in the internal memory, which occupies the storage space of the internal memory, the batch size of the multiple layers and the storage of the inter-layer data The location is also affected by the division of subgraphs and graphs.
例如,如图9所示,由于层0和层1在计算批A1时,层间数据C0保存在高速缓存中,会导致层0和层1的可用缓存变小,影响切分输入数据。For example, as shown in Fig. 9, since the inter-layer data C0 is stored in the cache when calculating batch A1 for layer 0 and layer 1, the available cache for layer 0 and layer 1 will become smaller, which will affect the segmentation of input data.
又如,如图10所示,由于在层4和层5处理层间数据E0时,层间数据E1保存在高速缓存中,占用高速缓存的空间,导致层4和层5的可用缓存变小,影响切分输入数据。As another example, as shown in Figure 10, when the inter-layer data E0 is processed in layers 4 and 5, the inter-layer data E1 is stored in the cache, which occupies the space of the cache, resulting in smaller available caches in the layers 4 and 5 , Affecting the segmentation of input data.
因此,在进行不同批大小的层的划分的过程中,需要考虑由于聚集或散开问题造成的额外的内部存储器缓存需求,确定切分后的子图的缓存需求是否超出内部存储器的容量。Therefore, in the process of dividing layers of different batch sizes, it is necessary to consider the additional internal memory cache requirements caused by the aggregation or spreading problem, and determine whether the cache requirements of the divided sub-graphs exceed the capacity of the internal memory.
本申请实施例提供的神经网络的数据处理方法,综合参考输入数据的数据量、第一特征和第二特征切分输入数据,为神经网络中的层设置不同的批大小。因此,通过为神经网络中的每一层设置合理的批大小,在神经网络推理过程中,充分利用内部存储器存储神经网络的层间数据,减少了运行神经网络的芯片与外部存储器的交互,从而提高了内部存储器的利用率,以及确保运行神经网络的芯片的计算效率。The neural network data processing method provided by the embodiments of the present application comprehensively refers to the data amount of the input data, the first feature and the second feature to segment the input data, and sets different batch sizes for the layers in the neural network. Therefore, by setting a reasonable batch size for each layer in the neural network, the internal memory is fully utilized to store the inter-layer data of the neural network during the neural network inference process, which reduces the interaction between the chip running the neural network and the external memory, thereby Improve the utilization of internal memory and ensure the computational efficiency of the chip running the neural network.
在一种可能的实现方式中,可以由其他计算机离线执行S1201和S1202生成切分策略和调度神经网络的层的执行顺序。将切分策略和调度神经网络的层的执行顺序配置到神经网络系统内的控制器,由神经网络系统内的控制器控制执行切分策略和调度神经网络的层的执行顺序。In a possible implementation manner, other computers can execute S1201 and S1202 offline to generate the segmentation strategy and schedule the execution order of the layers of the neural network. The segmentation strategy and the execution order of the layers of the scheduling neural network are configured to the controller in the neural network system, and the controller in the neural network system controls the execution order of the segmentation strategy and the layers of the scheduling neural network.
在另一种可能的实现方式中,可以由神经网络系统内的控制器执行S1201和S1202生成切分策略和调度神经网络的层的执行顺序,控制器统一管理调度神经网络各层和切分多个批。In another possible implementation, the controller in the neural network system can execute S1201 and S1202 to generate the segmentation strategy and schedule the execution sequence of the neural network layers, and the controller uniformly manages the various layers of the scheduling neural network and the number of segmentation. Batches.
下面结合具体示例,对本申请实施例所提供的神经网络调度方法进行说明。The neural network scheduling method provided in the embodiment of the present application will be described below with reference to specific examples.
示例一、输入数据为整图数据。Example 1: The input data is the whole image data.
如图14所示,基于内部存储器容量,从神经网络的整体性能考虑,确定L0和L1对应的批大小为1张图片,L2、L3和L4对应的批大小为2张图片,L5和L6对应的批大小为4张图片。采用本申请实施例提供的方法,将L0和L1划分为子图0,将L2-L4划分为子图1,将L5和L6划分为子图2。针对聚集问题,基于内部存储器容量,从神经网络的整体性能考虑,将3个子图划分为一个图,即将L0-L6划分为图,图的缓存需求小于或等于内部存储器的容量。图中包含批大小不同的层,在调度神经网络中的子图处理输入数据的过程中,能够提升内部存储器的利用率,提升运行神经网络的芯片的运行性能。As shown in Figure 14, based on the internal memory capacity and considering the overall performance of the neural network, it is determined that the batch size corresponding to L0 and L1 is 1 picture, and the batch size corresponding to L2, L3 and L4 is 2 pictures, and L5 and L6 correspond to The batch size is 4 pictures. Using the method provided in the embodiment of the present application, L0 and L1 are divided into sub-graph 0, L2-L4 is divided into sub-graph 1, and L5 and L6 are divided into sub-graph 2. Aiming at the aggregation problem, based on the internal memory capacity and considering the overall performance of the neural network, the 3 subgraphs are divided into one graph, that is, L0-L6 are divided into graphs. The cache requirement of the graph is less than or equal to the capacity of the internal memory. The figure contains layers with different batch sizes. In the process of scheduling the subgraphs in the neural network to process the input data, it can improve the utilization of internal memory and improve the operating performance of the chip running the neural network.
如图14所示,假设数据集中包含8张图片,L0为图的首层,批大小为1张图片,故将数据集切分为8批输入数据(图14所示的批0-批7),每批输入数据为1张图片对应的整图数据,分批输入L0。如图14所示,在处理当前数据集输入数据的过程中,调度2次子图0,对应调度1次子图1,即调度顺序为L0→L1→L0→L1→L2→L3→L4;调度2次子图1,对应调度1次子图2,即调度顺序为L2→L3→L4→L2→L3→L4→L5→L6。处理当前数据集的输入数据需要调度8次子图0,4次子图1,2次子图2。As shown in Figure 14, assuming that the data set contains 8 pictures, L0 is the first layer of the picture, and the batch size is 1 picture, so the data set is divided into 8 batches of input data (batch 0-batch 7 shown in Fig. 14 ), each batch of input data is the whole image data corresponding to 1 picture, and input L0 in batches. As shown in Figure 14, in the process of processing the input data of the current data set, subgraph 0 is scheduled twice, corresponding to subgraph 1 is scheduled once, that is, the scheduling sequence is L0→L1→L0→L1→L2→L3→L4; Scheduling subgraph 1 twice corresponds to subgraph 2 scheduling once, that is, the scheduling sequence is L2→L3→L4→L2→L3→L4→L5→L6. To process the input data of the current data set, it is necessary to schedule 8 times of subgraph 0, 4 times of subgraph 1, and 2 times of subgraph 2.
示例二、输入数据为非整图数据。Example 2: The input data is non-integral image data.
如图15所示,基于内部存储器容量,从神经网络的整体性能考虑,确定L0和L1对应的批大小为1/4张图片,L2、L3和L4对应的批大小为1/2张图片。采用本申请实施例提供的方法,将L0和L1划分为子图0,将L2-L4序列切分为子图1。针对重叠问题,如图15所示,输入数据为非整图数据,需要采用填充算法处理输入数据,填充数据为阴影部分。基于内部存储器容量,从神经网络的整体性能考虑,将2个子图划分为一个图,即将L0-L4划分为图,图的缓存需求小于或等于内部存储器的容量。图中包含批大小不同的层,在调度神经网络中的子图处理输入数据的过程中,能够提升内部存储器的利用率,提升运行神经网络的芯片的运行性能。As shown in Figure 15, based on the internal memory capacity and considering the overall performance of the neural network, it is determined that the batch size corresponding to L0 and L1 is 1/4 pictures, and the batch size corresponding to L2, L3, and L4 is 1/2 pictures. Using the method provided in the embodiment of the present application, L0 and L1 are divided into subgraph 0, and the L2-L4 sequence is divided into subgraph 1. In view of the overlap problem, as shown in Figure 15, the input data is non-integral image data, and the input data needs to be processed with a filling algorithm, and the filling data is the shaded part. Based on the internal memory capacity and considering the overall performance of the neural network, the two subgraphs are divided into one graph, that is, L0-L4 are divided into graphs. The cache requirement of the graph is less than or equal to the capacity of the internal memory. The figure contains layers with different batch sizes. In the process of scheduling the subgraphs in the neural network to process the input data, it can improve the utilization of internal memory and improve the operating performance of the chip running the neural network.
如图15所示,假设数据集中包含2张图片,L0为图的首层,批大小为1/4张图片,故将数据集切分为8批输入数据(图15所示的批0-批7),每批输入数据为1/4张图片对应的非整图数据,分批输入L0。如图15所示,在处理当前数据集输入数据的过程中,调度2次子图0,对应调度1次子图1,即调度顺序为L0→L1→L0→L1→L2→L3→L4。处理当前数据集的输入数据需要调度8次子图0,4次子图1。As shown in Figure 15, assuming that the data set contains 2 pictures, L0 is the first layer of the picture, and the batch size is 1/4 pictures, so the data set is divided into 8 batches of input data (batch 0- as shown in Figure 15). Batch 7), each batch of input data is the non-integrated image data corresponding to 1/4 pictures, and input L0 in batches. As shown in Figure 15, in the process of processing the input data of the current dataset, subgraph 0 is scheduled twice, corresponding to subgraph 1 is scheduled once, that is, the scheduling sequence is L0→L1→L0→L1→L2→L3→L4. Processing the input data of the current data set needs to schedule 8 times of subgraph 0 and 4 times of subgraph 1.
可以理解的是,为了实现上述实施例中功能,神经网络系统包括了执行各个功能相应的硬件结构或软件模块中至少一个。本领域技术人员应该很容易意识到,结合本 申请中所公开的实施例描述的各示例的单元及方法步骤,本申请能够以硬件或硬件和计算机软件相结合的形式来实现。某个功能究竟以硬件还是计算机软件驱动硬件的方式来执行,取决于技术方案的特定应用场景和设计约束条件。It can be understood that, in order to implement the functions in the foregoing embodiments, the neural network system includes at least one of a hardware structure or a software module corresponding to each function. Those skilled in the art should easily realize that, in combination with the units and method steps of the examples described in the embodiments disclosed in this application, this application can be implemented in the form of hardware or a combination of hardware and computer software. Whether a certain function is executed by hardware or computer software-driven hardware depends on the specific application scenarios and design constraints of the technical solution.
图16和图17为本申请的实施例提供的可能的神经网络的数据处理装置的结构示意图。这些神经网络的数据处理装置可以用于实现上述方法实施例中处理器211的功能,因此也能实现上述方法实施例所具备的有益效果。16 and FIG. 17 are schematic diagrams of the structure of a possible neural network data processing device provided by an embodiment of the application. These neural network data processing devices can be used to implement the functions of the processor 211 in the foregoing method embodiment, and therefore can also achieve the beneficial effects of the foregoing method embodiment.
在本申请的实施例中,该图16中神经网络的数据处理装置可以是如图2所示的处理器211或其上运行软件形成的装置。如图16所示,神经网络的数据处理装置1600包括获取单元1610和处理单元1620。神经网络的数据处理装置1600用于实现上述图12中所示的方法实施例中处理器211的功能。当神经网络的数据处理装置1600用于实现图12所示的方法实施例中处理器211的功能时:获取单元1610用于执行S1201;处理单元1620用于执行S1202。有关上述获取单元1610和处理单元1620更详细的描述可以直接参考图12所示的方法实施例中相关描述直接得到,这里不加赘述。In the embodiment of the present application, the data processing device of the neural network in FIG. 16 may be the processor 211 shown in FIG. 2 or a device formed by running software on it. As shown in FIG. 16, the data processing device 1600 of the neural network includes an acquisition unit 1610 and a processing unit 1620. The neural network data processing device 1600 is used to implement the function of the processor 211 in the method embodiment shown in FIG. 12 above. When the neural network data processing device 1600 is used to implement the function of the processor 211 in the method embodiment shown in FIG. 12: the acquiring unit 1610 is used to perform S1201; the processing unit 1620 is used to perform S1202. More detailed descriptions of the above-mentioned acquisition unit 1610 and processing unit 1620 can be obtained directly by referring to the relevant description in the method embodiment shown in FIG. 12, and will not be repeated here.
该神经网络的数据处理装置还可以是与神经网络系统200相连接的其他设备的模块(如芯片)。如图17所示,神经网络的数据处理装置1700包括处理器1710和接口电路1720。处理器1710和接口电路1720之间相互耦合。可以理解的是,接口电路1720可以为收发器或输入输出接口。可选的,神经网络的数据处理装置1700还可以包括存储器1730,用于存储处理器1710执行的指令或存储处理器1710运行指令所需要的输入数据或存储处理器1710运行指令后产生的数据。例如,神经网络的数据处理装置可以包括图2中所示的主机210,处理器1710可以包括处理器211,存储器1730就是内存212。以上方案用于为神经网络芯片配置批大小以便神经网络可以高效工作,所述实施例中的批大小、图和子图的处理和相关算法运行都是由处理器211执行,实际上该处理方法可以由其他类型的处理器或设备执行,例如其他位于神经网络芯片内部的控制器或处理器执行相关方案,以便完成对所述神经网络的配置。在一种可能性中,神经网络芯片内部可以包括一种或多种类型的处理器,该处理器可以运行相关的神经网络的配置方案以便获得适合的批大小和图、子图划分等算法,在配置好所述神经网络的参数后,该处理器可以运行神经网络一致性神经网络计算,从而实现自配置,本实施例不做限定。The data processing device of the neural network may also be a module (such as a chip) of other equipment connected to the neural network system 200. As shown in FIG. 17, the data processing device 1700 of the neural network includes a processor 1710 and an interface circuit 1720. The processor 1710 and the interface circuit 1720 are coupled to each other. It can be understood that the interface circuit 1720 may be a transceiver or an input/output interface. Optionally, the neural network data processing device 1700 may further include a memory 1730 for storing instructions executed by the processor 1710 or storing input data required by the processor 1710 to run the instructions or storing data generated after the processor 1710 runs the instructions. For example, the data processing device of the neural network may include the host 210 shown in FIG. 2, the processor 1710 may include the processor 211, and the memory 1730 is the memory 212. The above scheme is used to configure the batch size for the neural network chip so that the neural network can work efficiently. In the described embodiment, the batch size, the processing of graphs and subgraphs, and the operation of related algorithms are all executed by the processor 211. In fact, the processing method can be It is executed by other types of processors or devices, for example, other controllers or processors located inside the neural network chip execute related solutions to complete the configuration of the neural network. In one possibility, one or more types of processors can be included in the neural network chip, and the processor can run related neural network configuration schemes in order to obtain suitable batch sizes and algorithms such as graph and subgraph division. After configuring the parameters of the neural network, the processor can run neural network consistency neural network calculations, thereby realizing self-configuration, which is not limited in this embodiment.
当神经网络的数据处理装置1700用于实现图12所示的方法时,处理器1710用于执行上述处理单元1620的功能,接口电路1720用于执行上述获取单元1610的功能。When the neural network data processing device 1700 is used to implement the method shown in FIG. 12, the processor 1710 is used to perform the functions of the above-mentioned processing unit 1620, and the interface circuit 1720 is used to perform the functions of the above-mentioned obtaining unit 1610.
可以理解的是,本申请的实施例中的处理器可以选择性包括中央处理单元(Central Processing Unit,CPU),还可以是其它通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field Programmable Gate Array,FPGA)或者其它可编程逻辑器件、晶体管逻辑器件,硬件部件或者其任意组合。通用处理器可以是微处理器,也可以是任何常规的处理器。It is understandable that the processor in the embodiment of the present application may optionally include a central processing unit (Central Processing Unit, CPU), or may also be other general-purpose processors, digital signal processors (Digital Signal Processors, DSPs), and special purpose processors. Integrated circuits (Application Specific Integrated Circuit, ASIC), Field Programmable Gate Array (Field Programmable Gate Array, FPGA) or other programmable logic devices, transistor logic devices, hardware components, or any combination thereof. The general-purpose processor may be a microprocessor or any conventional processor.
本申请的实施例中的方法步骤可以通过硬件的方式来实现,也可以由处理器执行软件指令的方式来实现。软件指令可以由相应的软件模块组成,软件模块可以被存放于随机存取存储器(Random Access Memory,RAM)、闪存、只读存储器(Read-Only  Memory,ROM)、可编程只读存储器(Programmable ROM,PROM)、可擦除可编程只读存储器(Erasable PROM,EPROM)、电可擦除可编程只读存储器(Electrically EPROM,EEPROM)、寄存器、硬盘、移动硬盘、CD-ROM或者本领域熟知的任何其它形式的存储介质中。一种示例性的存储介质耦合至处理器,从而使处理器能够从该存储介质读取信息,且可向该存储介质写入信息。当然,存储介质也可以是处理器的组成部分。处理器和存储介质可以位于ASIC中。另外,该ASIC可以位于网络设备或终端设备中。当然,处理器和存储介质也可以作为分立组件存在于网络设备或终端设备中。The method steps in the embodiments of the present application can be implemented by hardware, and can also be implemented by a processor executing software instructions. Software instructions can be composed of corresponding software modules, which can be stored in Random Access Memory (RAM), flash memory, read-only memory (Read-Only Memory, ROM), and programmable read-only memory (Programmable ROM) , PROM), Erasable Programmable Read-Only Memory (Erasable PROM, EPROM), Electrically Erasable Programmable Read-Only Memory (Electrically EPROM, EEPROM), register, hard disk, mobile hard disk, CD-ROM or well-known in the art Any other form of storage medium. An exemplary storage medium is coupled to the processor, so that the processor can read information from the storage medium and can write information to the storage medium. Of course, the storage medium may also be an integral part of the processor. The processor and the storage medium may be located in the ASIC. In addition, the ASIC can be located in a network device or a terminal device. Of course, the processor and the storage medium may also exist as discrete components in the network device or the terminal device.
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机程序或指令。在计算机上加载和执行所述计算机程序或指令时,全部或部分地执行本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、网络设备、用户设备或者其它可编程装置。所述计算机程序或指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机程序或指令可以从一个网站站点、计算机、服务器或数据中心通过有线或无线方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是集成一个或多个可用介质的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,例如,软盘、硬盘、磁带;也可以是光介质,例如,数字视频光盘(digital video disc,DVD);还可以是半导体介质,例如,固态硬盘(solid state drive,SSD)。In the foregoing embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented by software, it can be implemented in the form of a computer program product in whole or in part. The computer program product includes one or more computer programs or instructions. When the computer program or instruction is loaded and executed on the computer, the process or function described in the embodiment of the present application is executed in whole or in part. The computer may be a general-purpose computer, a special-purpose computer, a computer network, network equipment, user equipment, or other programmable devices. The computer program or instruction may be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium. For example, the computer program or instruction may be downloaded from a website, computer, The server or data center transmits to another website site, computer, server or data center through wired or wireless means. The computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server or a data center that integrates one or more available media. The usable medium may be a magnetic medium, such as a floppy disk, a hard disk, and a magnetic tape; it may also be an optical medium, such as a digital video disc (digital video disc, DVD); and it may also be a semiconductor medium, such as a solid state drive (solid state drive). , SSD).
在本申请的各个实施例中,如果没有特殊说明以及逻辑冲突,不同的实施例之间的术语或描述具有一致性、且可以相互引用,不同的实施例中的技术特征根据其内在的逻辑关系可以组合形成新的实施例。In the various embodiments of this application, if there are no special instructions and logical conflicts, the terms or descriptions between different embodiments are consistent and can be mutually cited. The technical features in different embodiments are based on their inherent logical relationships. It can be combined to form a new embodiment.
本申请中,“至少一个”是指一个或者多个,“多个”是指两个或两个以上。可以理解的是,在本申请的实施例中涉及的各种数字编号仅为描述方便进行的区分,并不用来限制本申请的实施例的范围。上述各过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定。In this application, "at least one" refers to one or more, and "multiple" refers to two or more. It is understandable that the various numerical numbers involved in the embodiments of the present application are only for easy distinction for description, and are not used to limit the scope of the embodiments of the present application. The size of the sequence number of the above processes does not mean the order of execution, and the execution order of each process should be determined by its function and internal logic.

Claims (16)

  1. 一种神经网络的数据处理方法,其特征在于,A neural network data processing method, which is characterized in that:
    获取神经网络的输入数据的数据量、运行所述神经网络的芯片内的内部存储器的第一特征和所述神经网络中多个层的第二特征;Acquiring the amount of input data of the neural network, the first feature of the internal memory in the chip running the neural network, and the second feature of multiple layers in the neural network;
    依据所述数据量、所述第一特征和所述第二特征确定所述多个层中每个层的批大小,所述多个层中至少两个层的批大小不同。The batch size of each layer in the plurality of layers is determined according to the amount of data, the first characteristic, and the second characteristic, and the batch sizes of at least two layers in the plurality of layers are different.
  2. 根据权利要求1所述的方法,其特征在于,所述第一特征包括所述内部存储器在所述芯片内的分布特征和所述内部存储器的容量中至少一个,所述第二特征包括所述多个层间的连接关系和所述多个层中每个层的与计算相关的参数。The method according to claim 1, wherein the first feature includes at least one of a distribution feature of the internal memory in the chip and a capacity of the internal memory, and the second feature includes the The connection relationship between the multiple layers and the calculation-related parameters of each of the multiple layers.
  3. 根据权利要求1或2所述的方法,其特征在于,依据所述数据量、所述第一特征和所述第二特征确定所述多个层的批大小,包括:The method according to claim 1 or 2, wherein determining the batch size of the multiple layers according to the amount of data, the first characteristic, and the second characteristic comprises:
    依据所述数据量、所述第一特征和所述第二特征确定所述多个层的批大小、N个子图、M个图和层间数据的存储位置,N为大于或等于2的整数,M为大于或等于1的整数,N≥M;Determine the batch size of the multiple layers, N sub-pictures, M pictures, and storage locations of inter-layer data according to the amount of data, the first feature, and the second feature, where N is an integer greater than or equal to 2 , M is an integer greater than or equal to 1, and N≥M;
    其中,所述层间数据的存储位置包括所述内部存储器或外部存储器中至少一个,所述外部存储器为运行所述神经网络的芯片外的存储器,所述子图包含一个或多个相同批大小的层,所述图包括一个或多个子图。Wherein, the storage location of the inter-layer data includes at least one of the internal memory or the external memory, the external memory is an off-chip memory running the neural network, and the subgraph includes one or more batches of the same size The graph includes one or more subgraphs.
  4. 根据权利要求3所述的方法,其特征在于,所述子图包含的多个层的层间数据存储在所述内部存储器。The method according to claim 3, wherein the inter-layer data of the multiple layers included in the sub-picture is stored in the internal memory.
  5. 根据权利要求3或4所述的方法,其特征在于,所述子图间的层间数据存储在所述内部存储器。The method according to claim 3 or 4, wherein the inter-layer data between the sub-pictures is stored in the internal memory.
  6. 根据权利要求3-5中任一项所述的方法,其特征在于,所述图间的层间数据存储在所述外部存储器。The method according to any one of claims 3-5, wherein the inter-layer data between the pictures is stored in the external memory.
  7. 根据权利要求1-6中任一项所述的方法,其特征在于,所述批大小对应的批为一个图片、多个图片或者一个图片中的部分图像。The method according to any one of claims 1 to 6, wherein the batch corresponding to the batch size is one picture, multiple pictures, or partial images in one picture.
  8. 一种神经网络的数据处理装置,其特征在于,A neural network data processing device, which is characterized in that:
    获取单元,用于获取神经网络的输入数据的数据量、运行所述神经网络的芯片内的内部存储器的第一特征和所述神经网络中多个层的第二特征;An acquiring unit, configured to acquire the amount of input data of the neural network, the first feature of the internal memory in the chip running the neural network, and the second feature of multiple layers in the neural network;
    处理单元,用于依据所述数据量、所述第一特征和所述第二特征确定所述多个层中每个层的批大小,所述多个层中至少两个层的批大小不同。The processing unit is configured to determine the batch size of each of the multiple layers according to the amount of data, the first feature, and the second feature, and the batch sizes of at least two of the multiple layers are different .
  9. 根据权利要求8所述的装置,其特征在于,所述第一特征包括所述内部存储器在所述芯片内的分布特征和所述内部存储器的容量中至少一个,所述第二特征包括所述多个层间的连接关系和所述多个层中每个层的与计算相关的参数。The device according to claim 8, wherein the first feature includes at least one of a distribution feature of the internal memory in the chip and a capacity of the internal memory, and the second feature includes the The connection relationship between the multiple layers and the calculation-related parameters of each of the multiple layers.
  10. 根据权利要求8或9所述的装置,其特征在于,所述处理单元具体用于:The device according to claim 8 or 9, wherein the processing unit is specifically configured to:
    依据所述数据量、所述第一特征和所述第二特征确定所述多个层的批大小、N个子图、M个图和层间数据的存储位置,N为大于或等于2的整数,M为大于或等于1的整数,N≥M;Determine the batch size of the multiple layers, N sub-pictures, M pictures, and storage locations of inter-layer data according to the amount of data, the first feature, and the second feature, where N is an integer greater than or equal to 2 , M is an integer greater than or equal to 1, and N≥M;
    其中,所述层间数据的存储位置包括所述内部存储器或外部存储器中至少一个,所述外部存储器为运行所述神经网络的芯片外的存储器,所述子图包含一个或多个相 同批大小的层,所述图包括一个或多个子图。Wherein, the storage location of the inter-layer data includes at least one of the internal memory or the external memory, the external memory is an off-chip memory running the neural network, and the subgraph includes one or more batches of the same size The graph includes one or more subgraphs.
  11. 根据权利要求10所述的装置,其特征在于,所述子图包含的多个层的层间数据存储在所述内部存储器。The device according to claim 10, wherein the inter-layer data of the multiple layers included in the sub-picture is stored in the internal memory.
  12. 根据权利要求10或11所述的装置,其特征在于,所述子图间的层间数据存储在所述内部存储器。The device according to claim 10 or 11, wherein the inter-layer data between the sub-pictures is stored in the internal memory.
  13. 根据权利要求10-12中任一项所述的装置,其特征在于,所述图间的层间数据存储在所述外部存储器。The device according to any one of claims 10-12, wherein the inter-layer data between the pictures is stored in the external memory.
  14. 根据权利要求8-13中任一项所述的装置,其特征在于,所述批大小对应的批为一个图片、多个图片或者一个图片中的部分图像。The device according to any one of claims 8-13, wherein the batch corresponding to the batch size is one picture, multiple pictures, or partial images in one picture.
  15. 一种神经网络的数据处理装置,其特征在于,包括:至少一个处理器和存储器,其中,所述存储器用于存储计算机程序,使得所述计算机程序被所述至少一个处理器执行时实现如权利要求1-7中任一项所述的神经网络的数据处理方法。A neural network data processing device, characterized by comprising: at least one processor and a memory, wherein the memory is used to store a computer program, so that when the computer program is executed by the at least one processor, the The neural network data processing method described in any one of 1-7 is required.
  16. 一种计算机可读存储介质,其特征在于,所述存储介质中存储有计算机程序或指令,当所述计算机程序或指令被神经网络的数据处理装置执行时,实现如权利要求1-7中任一项所述的神经网络的数据处理方法。A computer-readable storage medium, characterized in that a computer program or instruction is stored in the storage medium, and when the computer program or instruction is executed by a data processing device of a neural network, it can implement any of claims 1-7. One of the neural network data processing methods.
PCT/CN2020/093624 2020-05-30 2020-05-30 Data processing method and apparatus for neural network WO2021243489A1 (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
PCT/CN2020/093624 WO2021243489A1 (en) 2020-05-30 2020-05-30 Data processing method and apparatus for neural network
PCT/CN2021/073691 WO2021244045A1 (en) 2020-05-30 2021-01-26 Neural network data processing method and apparatus
CN202180037755.7A CN115668222A (en) 2020-05-30 2021-01-26 Data processing method and device of neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/093624 WO2021243489A1 (en) 2020-05-30 2020-05-30 Data processing method and apparatus for neural network

Publications (1)

Publication Number Publication Date
WO2021243489A1 true WO2021243489A1 (en) 2021-12-09

Family

ID=78831421

Family Applications (2)

Application Number Title Priority Date Filing Date
PCT/CN2020/093624 WO2021243489A1 (en) 2020-05-30 2020-05-30 Data processing method and apparatus for neural network
PCT/CN2021/073691 WO2021244045A1 (en) 2020-05-30 2021-01-26 Neural network data processing method and apparatus

Family Applications After (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/073691 WO2021244045A1 (en) 2020-05-30 2021-01-26 Neural network data processing method and apparatus

Country Status (2)

Country Link
CN (1) CN115668222A (en)
WO (2) WO2021243489A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116382880B (en) * 2023-06-07 2023-08-11 成都登临科技有限公司 Task execution method, device, processor, electronic equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140142929A1 (en) * 2012-11-20 2014-05-22 Microsoft Corporation Deep neural networks training for speech and pattern recognition
CN107454965A (en) * 2015-05-21 2017-12-08 谷歌公司 Batch processing in neural network processor
CN108885571A (en) * 2016-04-05 2018-11-23 谷歌有限责任公司 The input of batch machines learning model
CN109492754A (en) * 2018-11-06 2019-03-19 深圳市友杰智新科技有限公司 One kind is based on deep neural network model compression and accelerated method
CN110389910A (en) * 2018-04-17 2019-10-29 英特尔公司 For managing the method and arrangement of the memory in cascade neural network

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2018018451A (en) * 2016-07-29 2018-02-01 富士通株式会社 Machine learning method, machine learning program and information processing device
CN207440765U (en) * 2017-01-04 2018-06-01 意法半导体股份有限公司 System on chip and mobile computing device
US20180341852A1 (en) * 2017-05-24 2018-11-29 International Business Machines Corporation Balancing memory consumption of multiple graphics processing units in deep learning

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140142929A1 (en) * 2012-11-20 2014-05-22 Microsoft Corporation Deep neural networks training for speech and pattern recognition
CN107454965A (en) * 2015-05-21 2017-12-08 谷歌公司 Batch processing in neural network processor
CN108885571A (en) * 2016-04-05 2018-11-23 谷歌有限责任公司 The input of batch machines learning model
CN110389910A (en) * 2018-04-17 2019-10-29 英特尔公司 For managing the method and arrangement of the memory in cascade neural network
CN109492754A (en) * 2018-11-06 2019-03-19 深圳市友杰智新科技有限公司 One kind is based on deep neural network model compression and accelerated method

Also Published As

Publication number Publication date
CN115668222A (en) 2023-01-31
WO2021244045A1 (en) 2021-12-09

Similar Documents

Publication Publication Date Title
CN109102065B (en) Convolutional neural network accelerator based on PSoC
US11775430B1 (en) Memory access for multiple circuit components
JP7366274B2 (en) Adaptive search method and device for neural networks
JP7451614B2 (en) On-chip computational network
CN111105023B (en) Data stream reconstruction method and reconfigurable data stream processor
WO2021051987A1 (en) Method and apparatus for training neural network model
CN117501245A (en) Neural network model training method and device, and data processing method and device
WO2021244045A1 (en) Neural network data processing method and apparatus
WO2022156475A1 (en) Neural network model training method and apparatus, and data processing method and apparatus
Véstias Processing systems for deep learning inference on edge devices
Kim et al. Efficient multi-GPU memory management for deep learning acceleration
US11461662B1 (en) Compilation time reduction for memory and compute bound neural networks
Sun et al. Multi-node acceleration for large-scale GCNs
Yan et al. Acceleration and optimization of artificial intelligence CNN image recognition based on FPGA
JP7108702B2 (en) Processing for multiple input datasets
CN112789627B (en) Neural network processor, data processing method and related equipment
US11921667B2 (en) Reconfigurable computing chip
US20200192797A1 (en) Caching data in artificial neural network computations
Zhou et al. Design and implementation of YOLOv3-Tiny accelerator based on PYNQ-Z2 heterogeneous platform
WO2021120036A1 (en) Data processing apparatus and data processing method
CN110415162B (en) Adaptive graph partitioning method facing heterogeneous fusion processor in big data
WO2020051918A1 (en) Neuronal circuit, chip, system and method therefor, and storage medium
EP3895024A1 (en) Caching data in artificial neural network computations
RamaDevi et al. Machine learning techniques for the energy and performance improvement in Network-on-Chip (NoC)
WO2021237755A1 (en) Neural network scheduling method and apparatus

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20939062

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20939062

Country of ref document: EP

Kind code of ref document: A1