WO2020133317A1

WO2020133317A1 - Computing resource allocation technology and neural network system

Info

Publication number: WO2020133317A1
Application number: PCT/CN2018/125239
Authority: WO
Inventors: 刘哲; 曾重; 王铁英; 段小祥; 张慧敏
Original assignee: 华为技术有限公司
Priority date: 2018-12-29
Filing date: 2018-12-29
Publication date: 2020-07-02
Also published as: CN113597621A

Abstract

The present application provides a computing resource allocation technology and a neural network system. The neural network system comprises a processor and a plurality of neural network chips connected to the processor, and each neural network chip comprises a plurality of computing units having a computing-in-memory function. The processor configures, for each neural network layer, according to output data volumes of neural network layers in the neural network system, a computing unit for executing operations of this neural network layer, so that the computing power of the computing units for executing the operations of adjacent neural network layers matches each other. The neural network system provided by the present application can be applied to the field of artificial intelligence, and improve the data processing efficiency of the neural network system.

Description

Computing resource allocation technology and neural network system

Technical field

The present application relates to the field of computer technology, in particular to a computing resource allocation technology and a neural network system.

Background technique

Deep learning (DL) is an important branch of artificial intelligence (Artificial Intelligence, AI). Deep learning is a neural network constructed to imitate the human brain, which can achieve better recognition results than traditional shallow learning methods. . Convolutional neural network (Convolutional Neural Network, CNN) is one of the most common deep learning architectures and the most widely studied deep learning method. The typical field of convolutional neural network processing is image processing. Image processing is an application to identify and analyze the input image, and finally output a set of classified image content. For example, we can use the convolutional neural network algorithm to extract and classify the body color, license plate number and model of a motor vehicle on a picture.

Convolutional neural networks usually use a three-layer sequence: convolutional layer (Convolutional Layer), pooling layer (Pooling Layer) and modified linear units (Rectified Linear Units, ReLU) to extract the features of the picture. The process of extracting picture features is actually a series of matrix operations (for example, matrix multiply-add operations). Therefore, how to process the pictures in the network in parallel and quickly becomes a problem to be studied in a convolutional neural network.

Summary of the invention

The application provides a computing resource allocation technology and a neural network system, which can improve the data processing speed in the neural network.

In a first aspect, an embodiment of the present invention provides a computing resource allocation method applied in a neural network system. This method can be performed by the host host connected to the neural network chip. According to this method, after acquiring the data volume of the first output data of the first neural network layer and the data volume of the second output data of the second neural network layer in the neural network system, according to the deployment of the neural network system The requirement determines N first weights to be configured for the first neural network layer and M second weights to be configured for the second neural network layer. Further, according to the calculation specifications of the calculation units in the neural network system, N first weights are deployed on P calculation units, and M second weights are deployed on Q calculation units. The input data of the second neural network layer includes the first output data, N and M are both positive integers, and the ratio of N and M to the data volume of the first output data and the second output data The ratio of the amount of data corresponds. Both P and Q are positive integers, the P computing units are used to perform operations of the first neural network layer, and the Q computing units are used to perform operations of the second neural network layer.

The computing resource allocation method provided by the embodiment of the present invention takes into account the amount of data output by the adjacent neural network layer when configuring the computing unit that executes each layer of neural network operation according to the deployment requirements, so that the computing nodes that perform different neural network layer operations The computing power of the computer is matched, so that the computing power of the computing nodes that perform each layer of neural network operations can be fully utilized to improve the efficiency of data processing.

With reference to the first aspect, in a possible implementation manner, the deployment requirement includes a calculation delay, and the first neural network layer is a starting layer of all neural network layers in the neural network system. The determining N first weights to be configured for the first neural network layer and M second weights to be configured for the second neural network layer includes: according to the data amount of the first output data, the calculation The delay and the calculation frequency of the resistance random access memory cross matrix ReRAM crossbar in the calculation unit determine the value of N; according to the ratio of the data amount of the first output data to the data amount of the second output data and The value of N determines the value of M.

Specifically, in a possible implementation manner, when the first neural network layer is the starting layer of all neural network layers in the neural network system, the value of N may be obtained according to the following formula:

among them,

Used to indicate the number N of weights required for the first layer of neural network configuration,

Is the number of rows of output data of the first layer of neural network,

The number of columns of output data for the first-layer neural network. t is the set calculation delay, and f is the calculation frequency of the CrossBar in the calculation unit. The value of M can be calculated according to the following formula: N/M=first output data amount/second output data amount.

With reference to the first aspect, in yet another possible implementation manner, the neural network system includes multiple neural network chips, each neural network chip includes multiple computing units, and each computing unit includes at least one resistive random access The memory cross matrix ReRAM crossbar, the deployment requirements include the number of chips of the neural network system. When the first neural network layer is the starting layer of the neural network system, it is determined that N first weights to be configured for the first neural network layer and M to be configured for the second neural network layer The second weights include: according to the number of the chips, the number of ReRAM crossbars in each chip, the number of ReRAM crossbars required to deploy one weight for each layer of neural network, and the amount of output data of the adjacent neural network layer Determines the value of N; determines the value of M according to the ratio of the data amount of the first output data and the data amount of the second output data and the value of N.

Specifically, in a possible implementation manner, when the deployment requirement is the number of chips required by the neural network system, and the first neural network layer is the starting layer of the neural network system, the following The two formulas are used to obtain N first weights to be configured for the first neural network layer and M second weights to be configured for the second neural network layer, where the value of N is

, The value of M is

Value.

Among them, xb _{1 is} used to represent the number of crossbars required to deploy a weight of the first layer (or called the starting layer) neural network,

Used to represent the number of weights required for the starting layer, xb _{2 is} used to represent the number of crossbars required to deploy one weight in the second layer of neural network,

Used to represent the number of weights required for the second layer of neural network. xb _{n is} used to represent the number of crossbars required to deploy a weight in the nth layer neural network,

It is used to represent the number of weights required for the nth layer neural network, K is the number of chips of the neural network system required for deployment requirements, and L is the number of crossbars in each chip.

Used to represent the number of weights required for layer i;

Used to represent the number of weights required for layer i-1,

Used to represent the number of rows of output data of the i-th layer,

Used to represent the number of columns of the output data of the i-th layer,

Used to represent the number of rows of output data of the i-1th layer,

It is used to represent the number of columns of the output data of the i-1th layer. The value of i can be from 2 to n, where n is the total number of neural network layers in the neural network system.

With reference to the first aspect, in yet another possible implementation manner, the neural network system includes multiple neural network chips, each neural network chip includes multiple secondary computing nodes, and each secondary computing node includes multiple computing Units, the method further includes mapping the P computing units and the Q computing units into multiple secondary computing nodes according to the number of computing units included in the secondary computing nodes in the neural network system. Wherein, at least a part of the P computing units and at least a part of the Q computing units are mapped into the same secondary computing node. According to this method, the computing units that perform the operations of adjacent neural network layers can be located in the same secondary computing node as much as possible, thereby reducing the amount of data transmitted between computing nodes and increasing the speed of data transmission between different neural network layers .

In yet another possible implementation manner, the method further includes mapping the plurality of P computing units and the Q computing units according to the number of secondary computing nodes included in each neural network chip The secondary computing nodes are mapped into the multiple neural network chips. Wherein, at least a part of the secondary computing nodes of the secondary computing nodes to which the P computing units belong and at least a part of the secondary computing nodes of the secondary computing nodes to which the Q computing units belong are mapped to the same neural network In the chip. According to this method, the secondary computing nodes that perform the operations of adjacent neural network layers can be located in the same neural network chip as much as possible, which further reduces the amount of data transmitted between the computing nodes and improves the data transmission between different neural network layers. speed.

With reference to the foregoing first aspect and the foregoing one possible implementation manner of the first aspect, in yet another possible implementation manner, the ratio of the N and M and the data amount of the first output data and the second The corresponding ratio of the data volume of the output data includes: the ratio of the ratio of the N and M to the data volume of the first output data and the data volume of the second output data is the same.

In a second aspect, the present application provides a neural network system, including a host and a plurality of neural network chips, each neural network chip includes a plurality of computing units, the host is connected to the plurality of neural network chips and used for execution The computing resource allocation method described in the foregoing first aspect or any possible implementation manner of the first aspect.

In a third aspect, the present application provides a resource allocation apparatus, including a functional module capable of executing the computing resource allocation method described in the first aspect and any possible implementation manner of the first aspect.

In a fourth aspect, the present application also provides a computer program product, including program code, and the instructions included in the program code are executed by a computer to implement the first aspect and any possible one of the first aspect The computing resource allocation method described in the implementation.

In a fifth aspect, the present application also provides a computer-readable storage medium for storing program code, and the instructions included in the program code are executed by a computer to implement the foregoing first aspect and The method for computing resource allocation described in any possible implementation manner of the first aspect.

BRIEF DESCRIPTION

In order to more clearly explain the embodiments of the present invention or the technical solutions in the prior art, the following will briefly introduce the drawings required in the description of the embodiments. Obviously, the drawings in the following description are only for the invention. Some embodiments.

1 is a schematic structural diagram of a neural network system provided by an embodiment of the present invention;

1A is a schematic structural diagram of yet another neural network system provided by an embodiment of the present invention;

2 is a schematic structural diagram of a computing node in a neural network chip provided by an embodiment of the present invention;

3 is a schematic diagram of a logical structure of a neural network layer in a neural network system according to an embodiment of the present invention;

4 is a schematic diagram of a set of computing nodes for processing data of different neural network layers in a neural network system according to an embodiment of the present invention;

5 is a flowchart of a method for computing resource allocation in a neural network system according to an embodiment of the present invention;

6 is a flowchart of yet another method for computing resource allocation according to an embodiment of the present invention;

6A is a flowchart of a resource mapping method according to an embodiment of the present invention;

7 is a schematic diagram of another computing resource allocation method according to an embodiment of the present invention;

8 is a flowchart of a data processing method according to an embodiment of the present invention;

9 is a schematic diagram of weights provided by an embodiment of the present invention;

10 is a schematic structural diagram of a resistive random access memory crossbar (ReRAM crossbar) according to an embodiment of the present invention;

11 is a schematic structural diagram of a resource allocation apparatus according to an embodiment of the present invention.

detailed description

In order to enable those skilled in the art to better understand the solutions of the present invention, the technical solutions in the embodiments of the present invention will be described clearly in conjunction with the drawings in the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, but not all the embodiments.

Deep learning (DL) is an important branch of artificial intelligence (Artificial Intelligence, AI). Deep learning is a neural network constructed to imitate the human brain, which can achieve better recognition results than traditional shallow learning methods. . Artificial neural network (Artificial Neural Network, ANN), referred to as neural network (NN) or neural network for short, in the field of machine learning and cognitive science, is a kind of imitation biological neural network (animal central nervous system, especially Is a mathematical or computational model of the structure and function of the brain), used to estimate or approximate functions. Artificial neural networks can include convolutional neural networks (Convolutional Neural Networks, CNN), deep neural networks (Deep Neural Networks, DNN), multilayer perceptrons (Multilayer Perceptron, MLP) and other neural networks. FIG. 1 is a schematic structural diagram of an artificial neural network system according to an embodiment of the present invention. Figure 1 illustrates the convolutional neural network as an example. As shown in FIG. 1, the convolutional neural network system 100 may include a host 105 and a convolutional neural network circuit 110. The convolutional neural network circuit 110 may also be referred to as a neural network accelerator. The convolutional neural network circuit 110 is connected to the host 105 through the host interface. The host interface may include a standard host interface and a network interface. For example, the host interface may include a Peripheral Component Interconnect Express (PCIE) interface. As shown in FIG. 1, the convolutional neural network circuit 110 may be connected to the host 105 through the PCIE bus 106. Therefore, the data can be input into the convolutional neural network circuit 110 through the PCIE bus 106, and receive the processed data of the convolutional neural network circuit 110 through the PCIE bus 106. Moreover, the host 105 may also monitor the working state of the convolutional neural network circuit 110 through the host interface.

The host 105 may include a processor 1052 and memory 1054. It should be noted that, in addition to the devices shown in FIG. 1, the host 105 may also include a communication interface and other devices such as a magnetic disk as an external storage, which is not limited herein.

A processor (Processor) 1052 is an arithmetic core and a control core (Control Unit) of the host 105. The processor 1052 may include multiple processor cores. The processor 1052 may be a very large-scale integrated circuit. An operating system and other software programs are installed in the processor 1052, so that the processor 1052 can achieve access to the memory 1054, cache, disk, and peripheral devices (such as the neural network circuit in FIG. 1). It can be understood that, in the embodiment of the present invention, the Core in the processor 1052 may be, for example, a central processing unit (Central Processing Unit, CPU), or other specific integrated circuits (Application Specific Integrated Circuit, ASIC).

The memory 1054 is the main memory of the host 105. The memory 1054 is connected to the processor 1052 through a double data rate (DDR) bus. The memory 1054 is generally used to store various running software, input and output data, and information exchanged with external storage in the operating system. In order to improve the access speed of the processor 1052, the memory 1054 needs to have the advantage of fast access speed. In the traditional computer system architecture, dynamic random access memory (Dynamic Random Access Memory, DRAM) is usually used as the memory 1054. The processor 1052 can access the memory 1054 at a high speed through a memory controller (not shown in FIG. 1), and can perform a read operation and a write operation on any storage unit in the memory 1054.

A convolutional neural network (CNN) circuit 110 is a chip array composed of multiple neural network (NN) chips. For example, as shown in FIG. 1, the CNN circuit 110 includes multiple NN chips 115 and multiple routers 120. For convenience of description, the embodiment of the present invention refers to the NN chip 115 in the application as chip 115 for short. The plurality of chips 115 are connected to each other through a router 120. For example, one chip 115 may be connected to one or more routers 120. Multiple routers 120 may constitute one or more network topologies. Data can be transmitted between the chips 115 through the one or more network topologies. For example, the plurality of routers 120 may constitute a first network 1106 and a second network 1108, where the first network 1106 is a ring network and the second network 1108 is a two-dimensional mesh (2D mesh) network. Therefore, the data input from the input port 1102 can be sent to the corresponding chip 115 by the network composed of the plurality of routers 120, and the data processed by any one chip 115 can also be sent to other chips 115 through the network composed of the plurality of routers 120. Process or send out from output port 1104.

Further, FIG. 1 also shows a schematic structural diagram of the chip 115. As shown in FIG. 1, chip 115 may include multiple neural network processing units 125 and multiple routers 122. FIG. 1 takes the neural network processing unit as a tile for example. In the architecture of the data processing chip 115 shown in FIG. 1, one tile 125 may be connected to one or more routers 122. The multiple routers 122 in the chip 115 may constitute one or more network topologies. Data can be transmitted between tiles 125 through the various network topologies. For example, the plurality of routers 122 may constitute a first network 1156 and a second network 1158, where the first network 1156 is a ring network and the second network 1158 is a two-dimensional mesh (2D mesh) network. Therefore, the data input to the chip 115 from the input port 1152 can be sent to the corresponding tile 125 according to the network composed of the plurality of routers 122, and the data processed by any one tile 125 can also be sent through the network composed of the plurality of routers 122 Send to other tiles 125 or output from output port 1154.

It should be noted that when the chips 115 are interconnected by routers, one or more network topologies composed of multiple routers 120 in the convolutional neural network circuit 110 and a network composed of multiple routers 122 in the data processing chip 115 The topology may be the same or different, as long as data can be transmitted between the chips 115 or the tiles 125 through the network topology, and the chip 115 or the tiles 125 can receive data or output data through the network topology. In the embodiment of the present invention, the number and types of networks composed of

multiple routers

120 and 122 are not limited. Moreover, in the embodiment of the present invention, the router 120 and the router 122 may be the same or different. For clarity of description, in FIG. 1, a distinction is made between the router 120 connected to the chip and the router 122 connected to the tile. For convenience of description, in the embodiment of the present invention, the chip 115 or tile 125 in the convolutional neural network system may also be referred to as a computing node.

In practical applications, in another situation, the chips 115 may also be interconnected through a high-speed interface (High Transport IO) instead of the router 120. As shown in FIG. 1A, FIG. 1A is a schematic structural diagram of yet another neural network system according to an embodiment of the present invention. In the neural network system shown in FIG. 1A, the host 105 is connected to multiple PCIE cards 109 through a PCIE interface 107, and each PCIE card 109 may include multiple neural network chips 115, and the neural network chips are connected through a high-speed interconnection interface . The interconnection between chips is not limited here. It can be understood that, in actual applications, the tiles within the chip may not be connected by a router, and the high-speed interconnection method between the chips shown in FIG. 1A is adopted. In another case, it is also possible to use the router connection shown in FIG. 1 between the tiles within the chip, and the high-speed interconnection method shown in FIG. 1A between the chips. The embodiments of the present invention do not limit the connection modes between chips or within chips.

2 is a schematic structural diagram of a computing node in a neural network chip provided by an embodiment of the present invention. As shown in FIG. 2, the chip 115 includes multiple routers 120, and each router can be connected to a tile 125. In practical applications, one router 120 can also connect multiple tiles 125. As shown in FIG. 2, each tile 125 may include an input-output interface (TxRx) 1252, a switching device (TSW) 1254, and multiple processing devices (PE) 1256. The TxRx 1252 is used to receive the data input to the tile 125 from the Router 120 or output the calculation result of the tile 125. To put it another way, TxRx 1252 is used to transfer data between tile 125 and router 120. A switch (TSW) 1254 is connected to TxRx 1252, and the TSW 1254 is used to implement data transmission between the TxRx 1252 and multiple PEs 1256. Each PE 1256 may include one or more computing engines (computing engines) 1258. The one or more computing engines 1258 are used to implement neural network calculations on the data in the input PE 1256. For example, the data input to tile 125 and the convolution kernel preset in tile 125 may be multiplied and added. The calculation result of Engine 1258 can be sent to other tiles 125 through TSW 1254 and TxRx 1252. In practical applications, an Engine 1258 may include modules that implement convolution, pooling, or other neural network operations. Here, the specific circuit or function of the Engine is not limited. For simplicity of description, in the embodiment of the present invention, the calculation engine is simply referred to as the engine engine.

Those skilled in the art can know that because of the new type of non-volatile memory (Resistive random-access memory, ReRAM), which has the advantages of integrating storage and calculation, it has also been widely used in recent years. Neural network system. For example, a resistive random access memory crossbar (ReRAM) crossbar composed of multiple memristor cells (ReRAM) can be used to perform matrix multiply-add operations in a neural network system. In the embodiment of the present invention, Engine 1258 may include one or more crossbars. The structure of ReRAM crossbar can be shown in Figure 10. Later, we will introduce how to perform matrix multiply-add operation through ReRAM crossbar. It can be seen from the above introduction to the neural network that the neural network circuit provided by the embodiment of the present invention includes multiple NN chips, each NN chip includes multiple tile tiles, and each tile includes multiple processing devices PE, and each PE Including multiple engine engines, each engine is realized by one or more ReRAM crossbars. It can be seen that the neural network system provided by the embodiment of the present invention may include multi-level computing nodes, for example, may include four-level computing nodes: the first-level computing node is a chip 115, the second-level computing node is a tile within the chip, and The third-level computing node is the PE in the tile, and the fourth-level computing node is the Engine in the PE.

On the other hand, those skilled in the art may know that the neural network system may include multiple neural network layers. In the embodiment of the present invention, the neural network layer is a logical layer concept. A neural network layer refers to performing a neural network operation once. Each layer of neural network computing is implemented by computing nodes. The neural network layer may include a convolution layer, a pooling layer, and the like. As shown in FIG. 3, the neural network system may include n neural network layers (also called n-layer neural networks), where n is an integer greater than or equal to 2. FIG. 3 shows some neural network layers in the neural network system. As shown in FIG. 3, the neural network system may include a first layer 302, a second layer 304, a third layer 306, a fourth layer 308, and a fifth layer 310 To the nth layer 312. Among them, the first layer 302 can perform a convolution operation, the second layer 304 can perform a pooling operation on the output data of the first layer 302, and the third layer 306 can perform a convolution operation on the output data of the second layer 304, The fourth layer 308 may perform a convolution operation on the output result of the third layer 306, and the fifth layer 310 may perform a sum operation on the output data of the second layer 304 and the output data of the fourth layer 308, and so on. It can be understood that FIG. 4 is only a simple example and description of the neural network layer in the neural network system, and does not limit the specific operation of each layer of the neural network. For example, the fourth layer 308 may also be a pooling operation. The fifth layer 310 may also be other neural network operations such as convolution operations or pooling operations.

In the existing neural network system, after the calculation of the i-th layer in the neural network is completed, the calculation result of the i-th layer will be temporarily stored in the preset buffer. When performing the calculation of the i+1th layer, the calculation unit The calculation result of the i-th layer and the weight of the i+1th layer need to be reloaded from the preset cache. Among them, the i-th layer is any layer in the neural network system. In the embodiment of the present invention, because ReRAM uses a crossbar in the engine of the neural network system, and because ReRAM has the advantage of integrating storage and calculation, the weights can be configured on the ReRAM before calculation, and the calculation results can be directly sent to the next layer Perform pipeline calculations. Therefore, each layer of neural network only needs to cache very little data. For example, each layer of neural network only needs to cache enough input data for one window calculation. Further, in order to realize parallel and fast processing of data, an embodiment of the present invention provides a method for streaming data through a neural network. For clarity of description, the following briefly introduces the stream processing of the neural network system in conjunction with the convolutional neural network system of FIG. 1.

As shown in FIG. 4, in order to realize rapid processing of data, the computing nodes in the system can be divided into multiple node sets to perform calculations of different neural network layers respectively. FIG. 4 takes the division of tiles 125 in the neural network system shown in FIG. 1 as an example to illustrate different sets of computing nodes that implement neural network computing at different layers in the embodiment of the present invention. As shown in FIG. 4, multiple tiles 125 in the chip 115 may be divided into multiple node sets. For example: first node set 402, second node set 404, third node set 406, fourth node set 408, and fifth node set 410. Among them, each node set includes at least one computing node (for example, tile 125). The computing nodes of the same node set are used to perform neural network operations on the data entering the same neural network layer, and the data of different neural network layers are processed by the computing nodes of different node sets. The processing results of a computing node will be transmitted to the computing nodes in other node sets for processing. This pipelined processing method makes each layer of neural network only need to cache very little data, and can make multiple computing nodes concurrent Processing the same data stream to improve processing efficiency. It should be noted that FIG. 4 uses tiles as an example to illustrate a set of computing nodes used to process different neural network layers (for example, convolutional layers). In actual applications, because a tile contains multiple PEs, each PE contains multiple Engines, and different application scenarios require different amounts of calculation. Therefore, according to the actual application situation, the computing nodes in the neural network system can be divided with the granularity of PE, Engine or chip, so that the computing nodes in different sets are used to handle the operations of different neural network layers. According to this manner, the computing node referred to in the embodiment of the present invention may be Engine, PE, tile, or chip.

In addition, those skilled in the art may know that when a computing node (for example, tile125) performs a neural network operation (for example, convolution calculation), it may calculate the data input to the computing node based on the weight of the corresponding neural network layer, For example, a certain tile 125 may perform a convolution operation on the input data input to the tile 125 based on the weight of the corresponding convolution layer, for example, perform a matrix multiply-add calculation on the weight and the input data. The weight is usually used to indicate the importance of the input data to the output data. In neural networks, the weights are usually represented by a matrix. As shown in FIG. 9, the weight matrix of j rows and k columns shown in FIG. 9 may be a weight of a neural network layer, and each element in the weight matrix represents a weight value. In the embodiment of the present invention, since the computing nodes of a node set are used to perform the operation of a neural network layer, the computing nodes of the same node set may share weights, and the computing nodes in different node sets may have different weights. In the embodiment of the present invention, the weights in each computing node can be configured in advance. Specifically, each element in a weight matrix is configured in the ReRAM cell in the corresponding crossbar array, so that the matrix multiply-add operation of the input data and the configured weight can be implemented through the crossbar array. In the follow-up, we will briefly introduce how to implement matrix multiply-add operation through crossbar.

According to the above description, in the embodiment of the present invention, in the process of implementing neural network stream processing, the computing nodes in the neural network may be divided into a set of nodes for processing different neural network layers, and corresponding weights are configured. Thus, computing nodes of different node sets can perform corresponding calculations according to the configured weights. And, the computing nodes of each node set can send the computing results to the computing nodes used to perform the next layer of neural network operations. A person skilled in the art may know that, in the process of realizing the neural network stream processing, if the computing resources for performing different layers of neural network operations do not match, for example, the computing resources for performing the upper layer neural network operations are less, and the next layer of neural network operations are performed There are relatively many computing resources, which will result in a waste of computing resources of the next level of computing nodes. In order to make full use of the computing power of computing nodes and match the computing power of computing nodes performing different neural network layer operations, embodiments of the present invention provide a computing resource allocation method for allocating computing nodes performing different neural network layer operations The matching of the computing power of the computing nodes used to perform the operation of two adjacent neural network layers in the neural network system improves the data processing efficiency in the neural network system and does not waste computing resources.

5 is a flowchart of a method for computing resource allocation in a neural network system according to an embodiment of the present invention. This method can be applied to the neural network system shown in FIG. 1. This method may be implemented by the host computer 105 when deploying a neural network or when configuring a neural network system. Specifically, it may be implemented by the processor 1052 in the host computer 105. As shown in FIG. 5, the method may include the following steps.

In step 502, the network model information of the neural network system is obtained. The network model information includes the first output data amount of the first neural network layer and the second output data amount of the second neural network layer in the neural network system. Network model information can be determined according to actual application requirements. For example, the total number of neural network layers and the algorithm of each layer can be determined according to the application scenario of the neural network system. The network model information may include the total number of neural network layers in the neural network system, the algorithm of each layer, and the data output of each layer of the neural network. In the embodiments of the present invention, the algorithm refers to a neural network operation that needs to be performed. For example, the algorithm may refer to a convolution operation, a pooling operation, and so on. As shown in FIG. 3, the neural network layer of the neural network system according to the embodiment of the present invention may have n layers, where n is an integer not less than 2. In this step, the first neural network layer and the second neural network layer may be two layers in the n layer that are operationally dependent. In the embodiment of the present invention, the two neural network layers having a dependency relationship mean that the input data of one neural network layer includes the output data of another neural network layer. Two neural network layers with dependencies can also be referred to as adjacent layers. For example, as shown in FIG. 3, the output data of the first layer 302 is the input data of the second layer 304, therefore, the first layer 302 and the second layer 304 have a dependency relationship. The output data of the second layer 304 is the input data of the third layer 306, the input data of the fifth layer 310 includes the output data of the second layer 304, therefore, the second layer 304 and the third layer 306 have a dependency relationship, the second layer 304 and the fifth layer 310 also have a dependency relationship. For clarity of description, in the embodiment of the present invention, the first layer 302 shown in FIG. 3 is the first neural network layer, and the second layer 304 is the second neural network layer as an example for description.

In step 504, according to the deployment requirements of the neural network system, the first output data amount, and the second output data amount, determine the N first weights and the first M second weights to be configured in the second neural network layer. Wherein, N and M are both positive integers, and the ratio of N and M corresponds to the ratio of the first output data volume and the second output data volume. In practical applications, the deployment requirements may include the calculation delay of the neural network system, or may include the number of chips required to be deployed by the neural network system. Those skilled in the art can know that the operation of the neural network is mainly to perform matrix multiply-add operations. The output data of each layer of the neural network is also a one-dimensional or multi-dimensional real matrix. Therefore, the first output data includes the first neural network layer. The number of rows and columns of output data, and the second output data amount includes the number of rows and columns of output data of the second neural network layer.

As mentioned above, when a computing node performs a neural network operation, for example, when performing a convolution operation or a pooling operation, it is necessary to perform a multiply-add calculation on the input data and the weight of the corresponding neural network layer. Since the weights are configured on the cells in the crossbar, the crossbars in the calculation unit perform calculations on the input data in parallel, so the number of weights can determine the parallel computing capabilities of multiple calculation units that perform neural network operations. In another way of expression, the computing power of the computing node performing the neural network operation is determined by the number of weights configured in the computing unit performing the neural network operation. In the embodiment of the present invention, in order to match the computing power of the two-layer neural network that performs adjacent operations in the neural network system, the first output data amount and the second output data amount may be based on specific deployment requirements Determine the number of weights to be configured for the first neural network layer and the second neural network layer. Since the weights of different neural network layers are not necessarily the same, for clarity of description, in the embodiments of the present invention, the weights required for the operation of the first neural network layer are called first weights, and the weights required for the operation of the second neural network layer Called the second weight. Performing the first neural network layer operation means that the computing node performs the corresponding calculation on the data input to the first neural network layer based on the first weight, and performing the first neural network layer operation means that the computing node inputs the second neural network based on the second weight The data of the layer performs corresponding calculations. The calculations here can be neural network operations such as performing convolution or pooling operations.

The following will describe in detail how to determine the number of weights to be configured for each layer of neural network in this step according to different deployment requirements. The number of weights to be configured for each layer of the neural network includes the number N of first weights to be configured by the first neural network layer and the number M of second weights to be configured by the second neural network layer. In the embodiment of the present invention, the weight refers to a weight matrix. The number of weights refers to the number of weight matrices required, or the number of copies of weights. The number of weights can also be understood as how many identical weight matrices need to be configured.

In one case, when the deployment requirement of the neural network system is the calculation delay of the neural network system, in order that the calculation of the entire neural network system does not exceed the set calculation delay, the first The data output volume of the neural network (that is, the starting layer of all neural network layers in the neural network system), the calculation delay, and the calculation frequency of the ReRAM crossbar used in the neural network system determine the first The number of weights that need to be configured for a layer of neural network, and then the number of weights that need to be configured for each layer of neural network according to the number of weights that need to be configured for the first layer of neural network and the output data amount of each layer of neural network. Specifically, the number of weights required for the first layer (ie, the starting layer) neural network can be obtained according to the following formula 1:

(Formula 1)

among them,

Used to indicate the number of weights required for the first layer (ie, the starting layer) neural network,

Is the number of rows of output data of the first layer (ie, the starting layer) neural network,

The number of columns of output data for the first layer (ie, the starting layer) neural network. t is the set calculation delay, and f is the calculation frequency of the CrossBar in the calculation unit. Those skilled in the art can know that the value of f can be obtained according to the configuration parameters of the adopted crossbar. The data volume of the output data of the first layer neural network can be obtained according to the network model information obtained in step 502. It should be noted that, in the embodiment of the present invention, the first-layer neural network is the starting layer neural network among all neural network layers in the neural network system. It can be understood that, when the first neural network layer is the starting layer of all neural network layers in the neural network system, the number N of the first weight is calculated according to formula one

Value.

After obtaining the number of weights required for the first layer of neural network, in order to improve the efficiency of data processing in the neural network system, to avoid bottlenecks or data waiting in the pipelined parallel processing mode, to match the processing speed of the adjacent neural network layer, in In the embodiment of the present invention, the ratio of the number of weights required by two adjacent layers can be made to correspond to the ratio of the output data amount of the two adjacent layers. For example, the ratio can be the same. Therefore, in the embodiment of the present invention, the number of weights required by each layer of neural network can be determined according to the number of weights required by the first layer of neural network and the ratio of the output data amount of each layer of neural network. Specifically, the number of weights required for each layer of neural network can be calculated according to the following formula (2):

(Formula 2)

among them,

Used to represent the number of weights required for layer i;

Used to represent the number of weights required for layer i-1,

Used to represent the number of rows of output data of the i-th layer,

Used to represent the number of columns of the output data of the i-th layer,

Used to represent the number of rows of output data of the i-1th layer,

It is used to represent the number of columns of the output data of the i-1th layer. The value of i can be from 2 to n, where n is the total number of neural network layers in the neural network system. To put it another way, in the embodiment of the present invention, the ratio of the number of weights required to perform the operation of the i-1th layer neural network to the number of weights required to perform the ith layer of the neural network operation is the i-1th layer The ratio of the output data volume of and the output data volume of the i-th layer corresponds.

Those skilled in the art may know that the output data of each neural network layer may include multiple channels (channel), where the channel refers to the number of kernels in each neural network layer. A Kernel represents a feature extraction method, corresponding to a feature map (feature map), multiple feature maps constitute the output data of this layer. The weight used by a neural network layer includes multiple kernels. Therefore, in practical applications, in another situation, the output data volume of each layer can also consider the number of channels of each layer of the neural network. Specifically, after obtaining the number of weights required for the first neural network layer according to the above formula 1, the number of weights required for each layer of neural network can be obtained according to the following formula 3:

(Formula 3)

The difference between Formula 3 and Formula 2 is that Formula 3 further considers the number of channels output by each layer of neural network on the basis of Formula 2. Among them, C _{i-1 is} used to represent the number of channels of the i-1 layer, C _{i is} used to represent the number of channels of the i layer, the value of i is from 2 to n, n is the number of channels of the neural network layer in the neural network system The total number of layers, n is an integer not less than 2. The number of channels of each layer of neural network can be obtained from the network model information.

In the embodiment of the present invention, after the number of weights required for the starting layer is obtained according to the above formula 1, it can be calculated according to formula 2 (or formula 3) and the output data amount of each layer of neural network included in the network model information The number of weights required for each layer of neural network. For example, when the above-mentioned first neural network layer is the starting layer of all neural network layers in the neural network system, after the number N of the first weight is obtained according to formula 1, it can be calculated according to formula 2, according to the value of N And the set first output data amount and second output data amount to obtain the number M of second weights required by the second neural network layer. In another way of expression, after obtaining the value of N, the value of M can be calculated according to the following formula: N/M=first output data amount/second output data amount.

In yet another case, when the deployment requirement is the number of chips required by the neural network system, the number of weights required to obtain the first layer of neural network can be calculated in combination with the following formula 4 and the foregoing formula 2, or it can be combined The following formula 4 and the foregoing formula 3 calculate the number of weights required to obtain the first layer of neural network.

(Formula 4)

In the above formula 4, xb _{1 is} used to represent the number of crossbars required to deploy a weight of the first layer (or called the starting layer) neural network,

It is used to represent the number of weights required for the nth layer neural network, K is the number of chips of the neural network system required for deployment requirements, and L is the number of crossbars in each chip. The above formula 4 indicates that the sum of the number of crossbars of each neural network layer is less than or equal to the total number of crossbars included in the chip in the set neural network. For the description of Formula 2 and Formula 3, please refer to the previous description, and no more details are given here.

Those skilled in the art can know that after determining the model of the neural network system, a weight of each neural network layer of the neural network system and the specifications of the crossbar used in the neural network system (that is, the number of rows and columns of the ReRAM cell in the crossbar Number) has been determined. In another way of expression, the network model information of the neural network system also includes the size of a weight used by each neural network layer and crossbar specification information. Therefore, in the embodiment of the present invention, the xb _{i of} the i-th layer neural network can be obtained according to the weight of each layer (ie, the number of rows and columns of the weight matrix) and the specifications of the crossbar, where i takes the value from 1 to n. The value of L can be obtained from the parameters of the chip used by the neural network system. In the embodiment of the present invention, in one case, the number of weights required to obtain the starting layer neural network according to Formula 4 and Formula 2 above (ie

), the number of weights that need to be configured for each layer can be obtained according to Equation 2 and the output data amount of each layer obtained from the network model information. In another case, the number of weights required to obtain the starting layer neural network according to Formula 4 and Formula 3 above (ie

), the number of weights that need to be configured for each layer can also be obtained according to Equation 3 and the output data amount of each layer.

In step 506, according to the calculation specifications of the calculation units in the neural network system, N first weights are deployed on P calculation units, and M second weights are deployed on Q calculation units on. Wherein, P and Q are both positive integers, the P computing units are used to perform operations of the first neural network layer, and the Q computing units are used to perform operations of the second neural network layer. In the embodiment of the present invention, the calculation specification of the calculation unit refers to the number of crossbars included in one calculation unit. In practical applications, a computing unit may include one or more crossbars. Specifically, as mentioned above, since the network model information of the neural network system further includes the size of one weight used by each neural network layer and the specification information of the crossbar, the deployment relationship between one weight and the crossbar can be obtained. After obtaining the number of weights to be configured for each layer of the neural network in step 504, the weights of each layer may be deployed on the corresponding number of calculation units according to the number of crossbars included in each calculation unit. Specifically, the elements in the weight matrix are respectively configured in the ReRAM cells of the crossbar of the calculation unit. In the embodiment of the present invention, the computing unit may refer to a PE or an engine, one PE may include multiple engines, and one engine may include one or more crossbars. Since the weight of each layer may be different, a weight can be deployed on one or more engines.

Specifically, in this step, according to the deployment relationship of one weight and crossbar and the number of crossbars included in the calculation unit, the P calculation units and the M number of The two weights need to be deployed in Q calculation units. For example, N first weights of the first neural network layer may be deployed on P computing units, and M second weights may be deployed on Q computing units. Specifically, the elements in the N first weights are respectively allocated to the corresponding crossbar ReRAM cells in the P calculation units. The elements in the M second weights are respectively allocated to the corresponding crossbar ReRAM cells in the Q calculation units. Thus, the P computing units may perform the operation of the first neural network layer on the input data input to the P computing units based on the configured N first weights, and the Q computing units may be based on the configured Q first The second weight performs the operation of the second neural network layer on the input data input to the Q computing units.

It can be seen from the above embodiments that the computing resource allocation method provided by the embodiments of the present invention considers the amount of data output by the adjacent neural network layer when configuring the computing unit that performs each layer of neural network operations according to deployment requirements, so that different neural networks are executed. The computing power of the computing nodes operating at the network layer matches, so that the computing power of the computing nodes can be fully utilized to improve the efficiency of data processing.

Further, in the embodiments of the present invention, in order to further reduce the amount of data transmission between computing units that execute different neural network layers, the transmission bandwidth between computing units or computing nodes is saved. The computing unit can be mapped to the superior computing node of the computing unit according to the following method. As mentioned above, the neural network system may include four-level computing nodes: a first-level computing node chip, a second-level computing node tile, a third-level computing node PE, and a fourth-level computing node engine. Taking the fourth-level computing node engine as the computing unit as an example, FIG. 6 describes in detail how to map the P computing units that need to deploy the N first weights and the Q computing units that need to deploy the M second weights Go to the superior computing node. This method can still be implemented by the host 105 in the neural network system shown in FIGS. 1 and 1A. As shown in FIG. 6, the method may include the following steps.

In step 602, the network model information of the neural network system is obtained. The network model information includes the first output data amount of the first neural network layer and the second output data amount of the second neural network layer in the neural network system. In step 604, according to the deployment requirements of the neural network system, the first output data amount, and the second output data amount, determine the N first weights and the first M second weights to be configured in the second neural network layer. In step 606, according to the calculation specifications of the calculation units in the neural network system, determine the P calculation units that need to be deployed with the N first weights, and the Q number that need to be deployed with the M second weights Calculation unit. In the embodiment of the present invention, for

steps

602, 604, and 606, reference may be made to the related description in the foregoing steps 502, 504, and 506, respectively. The difference between step 606 and step 506 is that, in step 606, after determining the P computing units to be deployed with the N first weights and the Q computing units to be deployed with the M second weights, and The N first weights are not directly deployed to P computing units, and the M second weights are deployed to Q computing units. Instead, step 608 is entered.

In step 608, the P computing units and the Q computing units are mapped into multiple three-level computing nodes according to the number of computing units included in the three-level computing nodes in the neural network system. Specifically, as shown in FIG. 6A, FIG. 6A is a flowchart of a resource mapping method according to an embodiment of the present invention. 6A takes the computing unit as the fourth-level computing node engine as an example, and describes how to map the engine into the third-level computing node PE. As shown in FIG. 6A, the method may include the following steps.

In step 6082, the P computing units and the Q computing units are divided into m groups, and each group includes P/m computing units for executing the first neural network layer and Q/m calculation units in the second neural network layer. Among them, m is an integer not less than 2, and the values of P/m and Q/m are both integers. Specifically, the P computing units are used as the computing unit performing the i-1th layer, and the Q computing units are used as the computing unit performing the i-1th layer as an example. As shown in Fig. 7, at the i-1 layer, 8 computing units need to be allocated (ie P=8), at the i layer, 4 computing units need to be allocated (ie, Q=4), and at the i+1 layer, 4 computing units need to be allocated. Calculation unit, and divided into 2 groups (ie m = 2) as an example. Then you can get two groups as shown in Figure 7, where the first group includes 4 computing units at the i-1th layer, 2 computing units at the ith layer, and 2 computing units at the i+1th layer . Similarly, the second group includes 4 computing units at the i-1th layer, 2 computing units at the ith layer, and 2 computing units at the i+1th layer.

In step 6084, according to the number of computing units included in the third-level computing node, each group of computing units is mapped to the third-level computing node. In the process of mapping, try to make the computing unit that performs the operation of the adjacent neural network layer map to the same three-level node. As shown in FIG. 7, assume that in the neural network system, each first-level computing node chip includes eight second-level computing node tiles, and each tile includes two third-level computing nodes PE, and each PE includes 4 engines. For the first group, you can map the four engines at the i-1th layer to a third-level computing node PE (such as PE1 in Figure 7), and map the two engines at the ith layer and the i+1th layer The two engines are mapped to a third-level computing node PE (such as PE2 in Figure 7). Similarly, according to the mapping method for the computing units in the first group, for the computing units in the second group, the four engines at the i-1th layer can be mapped to PE3, and the two engines at the ith layer and the second The two engines in the i+1 layer are mapped onto one PE4 together. In practical applications, after the mapping of the computing units in the first group is completed, the computing units of other groups can be mapped in a mirrored manner according to the mapping method of the first group.

According to this mapping method, the computing units that execute adjacent neural network layers (for example, the i-th layer and the i+1th layer in FIG. 7) can be mapped to the same three-level computing node as much as possible. Therefore, when the output data of the i-th layer is sent to the computing unit of the i+1th layer, it only needs to be transmitted between the same third-level node (PE), and does not need to occupy the bandwidth between the third-level nodes, which can improve Data transmission speed reduces transmission bandwidth consumption between nodes.

Returning to FIG. 6, in step 610, according to the number of third-level computing nodes included in the second-level computing nodes in the neural network system, a plurality of three mapping units of the P computing units and the Q computing units are mapped The level computing nodes are mapped to multiple level two computing nodes. In step 612, according to the number of secondary computing nodes included in each neural network chip, the multiple secondary computing nodes mapped by the P computing units and the Q computing units are mapped to the multiple Neural network chip. As described above, FIG. 6A takes the example of mapping the engine performing the layer i operation to the third-level computing node as an example. Similarly, according to the method shown in FIG. 6A, the third-level node can also be mapped to the second-level node , And map the second-level nodes to the first-level nodes. For example, as shown in FIG. 7, for the first group, PE1 performing the operation of the i-1 layer and PE2 performing the operations of the i-th layer and the i+1-th layer may be mapped into the same second-level computing node Tile1. For the second group, PE3 performing the operation of the i-1 layer and PE4 performing the operations of the i-th layer and the i+1th layer can be further mapped into the same second-level computing node Tile2. Further, the operations Tile1 and Tile2 that perform the i-1th layer, the ith layer, and the i+1th layer can also be mapped into the same chip chip1. In this way, the mapping relationship from the first-level computing node chip to the fourth-level computing node engine in the neural network system can be obtained.

In step 614, the N first weights and the M second weights are deployed to P corresponding to the multiple third-level nodes, multiple second-level computing nodes, and multiple first-level computing nodes, respectively. Calculation units and Q calculation units. In the embodiment of the present invention, the mapping relationship from the first-level computing node chip to the fourth-level computing node engine in the neural network system can be obtained according to the methods described in FIGS. 6A and 7. For example, a mapping relationship between the P computing units and the Q computing units and the multiple third-level nodes, multiple second-level computing nodes, and multiple first-level computing nodes may be obtained, respectively. Furthermore, in this step, the weights of the corresponding neural network layer can be deployed to the computing units of the computing nodes at all levels according to the obtained mapping relationship. For example, as shown in FIG. 7A, the N weights of the i-1th layer can be deployed in the four computing units corresponding to chip1, tile1, and PE1 and the four computing units corresponding to chip1, tile2, and PE3, respectively. The M second weights of the i-th layer are respectively deployed to two computing units corresponding to chip1, tile1 and PE2 and two computing units corresponding to chip1, tile2 and PE4. In another way of expression, the N weights of the i-1 layer are respectively deployed in the four computing units in chip1—>tile1—>PE1 and the four computing units in chip1—>tile2—>PE3. The M weights of the i-th layer are respectively deployed in two computing units in chip1—>tile1—>PE2 and two computing units in chip1—>tile2—>PE4.

Through this deployment method, not only can the computing capabilities of the computing units supporting the operation of the adjacent neural network layer in the neural network system described in the embodiments of the present invention be matched, but also the computing units performing the operations of the adjacent neural network layer can be made as much as possible Many are located in the same three-level computing node, as many third-level computing nodes executing adjacent neural network layers are located in the same second-level computing node, and as many secondary computing nodes executing adjacent neural network layer are in the same In a first-level computing node (for example, a neural network chip), it can reduce the amount of data transmitted between computing nodes and increase the speed of data transmission between different neural network layers.

It should be noted that in the embodiment of the present invention, in a network neural system including four-level computing nodes, a fourth-level computing node engine is used as a computing unit to describe a process of allocating computing resources for performing operations of different neural network layers. In another way of expression, the above embodiment divides the set of operations that perform different neural network layers with the engine as the granularity. In practical applications, the third-level computing node PE can also be used as the computing unit for distribution. In this case, the third-level computing node PE and the second-level computing node tile and the first-level computing node chip can be established according to the above method. Mapping. Of course, when the amount of data to be calculated is large, the second-level computing node tile can also be used for allocation. To put it another way, in the embodiment of the present invention, the calculation unit may be engine, PE, tile, or chip, which is not limited herein.

The above describes in detail how the neural network system provided by the embodiment of the present invention configures computing resources. The neural network system will be further described below from the perspective of processing data. 8 is a flowchart of a data processing method according to an embodiment of the present invention. This method is applied to the neural network system shown in FIG. 1, and the neural network system shown in FIG. 1 is configured by the method shown in FIGS. 5-7 to allocate computing resources for performing different neural network layer operations. As shown in FIG. 8, the method may be implemented by the neural network circuit shown in FIG. 1. The method may include the following steps.

In step 802, P computing units in the neural network system receive first input data. Wherein, the P computing units are used to perform the first neural network layer operation of the neural network system. In the embodiment of the present invention, the first neural network layer is any layer in the neural network system. The first input data is data that needs to perform the operation of the first neural network layer. When the first neural network layer is the first layer 302 in the neural network system shown in FIG. 3, the first input data may be data input to the neural network system for the first time. When the first neural network layer is not the first layer of the neural network system, the first input data may be output data processed by other neural network layers.

In step 804, the P calculation units perform calculation on the first input data according to the configured N first weights to obtain first output data. In the embodiment of the present invention, the first weight is a weight matrix. The N first weights refer to N weight matrices, and the N first weights may also be referred to as N first weight copies. The N first weights may be configured in the P calculation units according to the method shown in FIGS. 5-7. Specifically, the elements in the first weights are respectively configured into the ReRAM cells of the crossbars included in the P calculation units, so that the crossbars in the P calculation units can pair the input data based on the N first weights Parallel computing makes full use of the computing power of the crossbar in P computing units. In the embodiment of the present invention, after receiving the first input data, the P calculation units may perform a neural network operation on the received first input data based on the configured N first weights to obtain The first output data. For example, the crossbar in the P calculation units may perform a matrix multiply-add operation on the first input data and the configured first weight.

In step 806, the Q computing units in the neural network system receive second input data. The Q calculation units are used to perform a second neural network layer operation of the neural network system, and the second input data includes the first output data. Specifically, in one case, the Q calculation units may only perform the operation of the second neural network layer on the first output data of the P calculation units. For example, the P computing units are used to perform the operations of the first layer 302 shown in FIG. 3, and the Q computing units are used to perform the operations of the second layer 302 shown in FIG. In this case, the second input data is the first output data. In yet another case, the Q calculation units may also be used to jointly perform a second neural network operation on the first output data of the first neural network layer and the output data of other neural network layers. For example, the P computing units may be used to perform the neural network operation of the second layer 304 shown in FIG. 3, and the Q computing units may be used to perform the neural network operation of the fifth layer 310 shown in FIG. In this case, the Q calculation units are used to perform operations on the output data of the second layer 304 and the fourth layer 308, and the second input data includes the first output data and the fourth The output data of layer 308.

In step 808, the Q calculation units perform calculation on the second input data according to the configured M second weights to obtain second output data. In the embodiment of the present invention, the second weight is also a weight matrix. The M second weights refer to M weight matrixes, and the M second weights may also be referred to as M second weight copies. Similar to step 804, the second weight may be configured into the ReRAM cell of the crossbar included in the Q calculation units according to the method shown in FIG. After receiving the second input data, the Q calculation units may perform a neural network operation on the received second input data based on the configured M second weights to obtain the second output data. For example, the crossbar in the Q calculation units may perform a matrix multiply-add operation on the second input data and the configured second weight. It should be noted that, in the embodiment of the present invention, the ratio of N and M corresponds to the ratio of the data volume of the first output data to the data volume of the second output data.

For clarity of description, the following briefly describes how the ReRAM crossbar implements matrix multiply-add operations. As shown in FIG. 9, the weight matrix of j rows and k columns shown in FIG. 9 may be a weight of a neural network layer, and each element in the weight matrix represents a weight value. 10 is a schematic structural diagram of a ReRAM crossbar in a computing unit provided by an embodiment of the present invention. For convenience of description, the ReRAM crossbar may be simply referred to as a crossbar in this embodiment of the present invention. As shown in FIG. 10, the crossbar includes multiple ReRAM cells, such as G _1,1 , G _2,1, and so on. The multiple ReRAM cells constitute a neural network matrix. In the embodiment of the present invention, in the process of configuring the neural network, the weight shown in FIG. 9 may be input into the crossbar from the bit line of the crossbar shown in FIG. 10 (as shown by the input port 1002 in FIG. 10), so that the weight Each element in is allocated to the corresponding ReRAM cell. For example, the weight element W _{0,0 in} FIG. 9 is configured in G _1,1 in FIG. 10, and the weight element W _1,0 in FIG. 9 is configured in G _{2,1 and so} on in FIG. 10. Each weight element corresponds to a ReRAM cell. When performing neural network calculations, input data is input to the crossbar through the crossbar word line (input port 1004 shown in FIG. 10). It is understandable that the input data can be expressed by voltage, so that the input data and the weight value configured in the ReRAM cell can be dot-multiplied, and the calculated result can be obtained from the output terminal of each column of the crossbar in the form of output voltage (as shown in FIG. 10) The output port shown is 1006) output.

As mentioned above, because the computing unit that performs each layer of neural network operation in the neural network system is configured, the amount of data output by the adjacent neural network layer is considered, so that the computing power of the computing node that performs the operation of the adjacent neural network layer Able to match. Therefore, the data processing method provided by the embodiment of the present invention can make full use of the computing power of the computing node and improve the data processing efficiency of the neural network system.

In yet another situation, an embodiment of the present invention provides a resource allocation apparatus. The device can be applied to the neural network system shown in FIG. 1 and FIG. 1A, and is used to allocate computing nodes that perform operations of different neural network layers, so that the computing nodes used to perform operations of two adjacent neural network layers in the neural network system The matching of computing power improves the data processing efficiency in the neural network system and does not waste computing resources. It can be understood that the resource allocation device may be located in the host, may be implemented by a processor in the host, or may be a physical device that exists independently of the processor. For example, it can be used as a processor-independent compiler. As shown in FIG. 11, the resource allocation apparatus 1100 may include an acquisition module 1102, a calculation module 1104, and a deployment module 1106.

An obtaining module 1102, configured to obtain the data amount of the first output data of the first neural network layer and the data amount of the second output data of the second neural network layer in the neural network system, the input of the second neural network layer The data includes the first output data. The calculation module 1104 is configured to determine N first weights to be configured for the first neural network layer and M second weights to be configured for the second neural network layer according to deployment requirements of the neural network system. Wherein, N and M are both positive integers, and the ratio of N and M corresponds to the ratio of the data volume of the first output data to the data volume of the second output data.

As mentioned above, the neural network system according to the embodiment of the present invention includes multiple neural network chips, each neural network chip includes multiple computing units, and each computing unit includes at least one resistive random access memory cross matrix ReRAM crossbar . In one case, the deployment requirement includes a calculation delay. When the first neural network layer is the starting layer of all neural network layers in the neural network system, the calculation module is used to calculate A data volume of the output data, the calculation delay, and a calculation frequency of the resistance random access memory cross matrix ReRAM crossbar in the calculation unit determine the value of N, and according to the data volume of the first output data and the The ratio of the data amount of the second output data and the value of N determine the value of M.

In yet another situation, the deployment requirement includes the number of chips of the neural network system, the first neural network layer is a starting layer of the neural network system, and the calculation module is configured to The number of Re, the number of ReRAM crossbars in each chip, the number of ReRAM crossbars required to deploy a weight for each layer of neural network, and the ratio of the output data volume of adjacent neural network layers determine the value of N, and The value of M is determined according to the ratio of the data amount of the first output data to the data amount of the second output data and the value of N.

A deployment module 1106, configured to deploy N first weights to P computing units according to the calculation specifications of the calculation units in the neural network system, and deploy M M second weights to Q calculations On the unit, where P and Q are both positive integers, the P computing units are used to perform operations of the first neural network layer, and the Q computing units are used to perform operations of the second neural network layer. The calculation specification of the calculation unit refers to the number of crossbars included in one calculation unit. In practical applications, a computing unit may include one or more crossbars. Specifically, after the calculation module 1104 obtains the number of weights to be configured for each layer of the neural network, the deployment module 1106 may deploy the weights of each layer on the corresponding calculation unit according to the number of crossbars included in each calculation unit. Specifically, the elements in the weight matrix are respectively configured in the ReRAM cells of the crossbar of the calculation unit. In the embodiment of the present invention, the computing unit may refer to a PE or an engine, one PE may include multiple engines, and one engine may include one or more crossbars. Since the weight of each layer may be different, a weight can be deployed on one or more engines.

As described above, the neural network system shown in FIG. 1 includes multiple neural network chips, each neural network chip includes multiple secondary computing nodes, and each secondary computing node includes multiple computing units. In order to further reduce the amount of data transmission between computing units executing different neural network layers, the transmission bandwidth between computing units or computing nodes is saved. The resource allocation device 1100 may further include a mapping module 1108 for mapping the computing unit to the superior computing node of the computing unit. Specifically, after the calculation module 1104 obtains N first weights to be configured for the first neural network layer and M second weights to be configured for the second neural network layer, the mapping module 1108 is used to establish The mapping relationship between the N first weights and the P computing units, and the mapping relationship between the M second weights and the Q computing units is established. Further, the mapping module 1108 is further configured to map the P computing units and the Q computing units to multiple second units according to the number of computing units included in the secondary computing node in the neural network system In a level computing node, at least a part of the P computing units and at least a part of the Q computing units are mapped into the same level two computing node.

Further, the mapping module 1108 is further configured to map the P computing units and the Q computing units to the plurality of secondary computing nodes according to the number of secondary computing nodes included in each neural network chip Map into the multiple neural network chips. Wherein, at least a part of the secondary computing nodes of the secondary computing nodes to which the P computing units belong and at least a part of the secondary computing nodes of the secondary computing nodes to which the Q computing units belong are mapped to the same neural network In the chip.

In the embodiment of the present invention, how the mapping module 1108 establishes the mapping relationship between the N first weights and the P computing units, establishes the mapping relationship between the M second weights and the Q computing units, and how The P computing units and the Q computing units are respectively mapped to the upper-level computing nodes of the computing unit. Refer to the foregoing corresponding descriptions of FIG. 6, FIG. 6A and FIG. 7, which will not be repeated here.

An embodiment of the present invention also provides a computer program product that implements the above resource allocation method, and an embodiment of the present invention also provides a computing program product that implements the above data processing method. The above computer program products all include programs that store program codes. A computer-readable storage medium. The instructions included in the program code are used to execute the method flow described in any one of the foregoing method embodiments. Persons of ordinary skill in the art may understand that the foregoing storage medium includes: a USB flash drive, a mobile hard disk, a magnetic disk, an optical disk, a random access memory (Random-Access Memory, RAM), a solid state disk (SSD), or a non-volatile memory A non-transitory machine-readable medium that can store program code, such as a non-volatile memory.

It should be noted that the embodiments provided in this application are only schematic. Those skilled in the art can clearly understand that, for the convenience and conciseness of description, in the above-mentioned embodiments, the description of each embodiment has its own emphasis. For a part that is not detailed in an embodiment, you can refer to other implementations. Examples of related descriptions. The features disclosed in the embodiments, claims and drawings of the present invention may exist independently or in combination. Features described in the form of hardware in the embodiments of the present invention may be executed by software, and vice versa. No limitation here.

Claims

A computing resource allocation method applied in a neural network system is characterized by including:

Acquiring the data volume of the first output data of the first neural network layer and the data volume of the second output data of the second neural network layer in the neural network system, the input data of the second neural network layer including the first Output Data;

Determine N first weights to be configured for the first neural network layer and M second weights to be configured for the second neural network layer according to the deployment requirements of the neural network system, where N and M are both positive integers , And the ratio of N and M corresponds to the ratio of the data volume of the first output data to the data volume of the second output data;

According to the calculation specifications of the calculation units in the neural network system, N first weights are deployed on P calculation units, and M second weights are deployed on Q calculation units, where P Both Q and Q are positive integers, the P computing units are used to perform operations of the first neural network layer, and the Q computing units are used to perform operations of the second neural network layer.
The method according to claim 1, wherein the deployment requirement includes a calculation delay, and the first neural network layer is a starting layer of all neural network layers in the neural network system,

The determining N first weights to be configured for the first neural network layer and M second weights to be configured for the second neural network layer includes:

Determine the value of N according to the data amount of the first output data, the calculation delay, and the calculation frequency of the resistance random access memory cross matrix ReRAM crossbar in the calculation unit;

The value of M is determined according to the ratio of the data amount of the first output data to the data amount of the second output data and the value of N.
The method according to claim 1, wherein the neural network system includes a plurality of neural network chips, each neural network chip includes a plurality of calculation units, and each calculation unit includes at least one resistive random access memory cross Matrix ReRAM crossbar, the deployment requirements include the number of chips of the neural network system, the first neural network layer is the starting layer of the neural network system,

The determining N first weights to be configured for the first neural network layer and M second weights to be configured for the second neural network layer includes:

The N is determined according to the ratio of the number of the chips, the number of ReRAM crossbars in each chip, the number of ReRAM crossbars required to deploy one weight of each layer of neural network, and the output data amount of adjacent neural network layers Value of

The value of M is determined according to the ratio of the data amount of the first output data to the data amount of the second output data and the value of N.
The method according to claim 1, wherein the neural network system includes multiple neural network chips, each neural network chip includes multiple secondary computing nodes, and each secondary computing node includes multiple computing units, The method also includes:

Mapping the P computing units and the Q computing units to multiple secondary computing nodes according to the number of computing units included in the secondary computing nodes in the neural network system, wherein the P computing At least a part of the computing units in the unit and at least a part of the computing units in the Q computing units are mapped into the same secondary computing node.
The method according to claim 4, further comprising:

According to the number of secondary computing nodes included in each neural network chip, mapping the multiple secondary computing nodes mapped by the P computing units and the Q computing units into the multiple neural network chips, Wherein, at least a part of the secondary computing nodes of the secondary computing nodes to which the P computing units belong and at least a part of the secondary computing nodes of the secondary computing nodes to which the Q computing units belong are mapped to the same neural network In the chip.
A neural network system, characterized in that it includes:

Multiple neural network chips, each of which includes multiple computing units;

A processor, connected to the plurality of neural network chips and used for:

Acquiring the data volume of the first output data of the first neural network layer and the data volume of the second output data of the second neural network layer in the neural network system, the input data of the second neural network layer including the first Output Data;

Determine N first weights to be configured for the first neural network layer and M second weights to be configured for the second neural network layer according to the deployment requirements of the neural network system, where N and M are both positive integers , And the ratio of N and M corresponds to the ratio of the data volume of the first output data to the data volume of the second output data;

According to the calculation specifications of the calculation units in the neural network system, N of the first weights are deployed to P of the plurality of calculation units, and M of the second weights are deployed to all On the Q computing units of the plurality of computing units, where P and Q are both positive integers, the P computing units are used to perform the operation of the first neural network layer, and the Q computing units are used to perform The operation of the second neural network layer.
The neural network system according to claim 6, wherein the deployment requirement includes a calculation delay, and the first neural network layer is a starting layer of all neural network layers in the neural network system,

In the step of determining N first weights to be configured for the first neural network layer and M second weights to be configured for the second neural network layer, the processor is configured to:

Determine the value of N according to the data amount of the first output data, the calculation delay, and the calculation frequency of the resistance random access memory cross matrix ReRAM crossbar in the calculation unit;

The value of M is determined according to the ratio of the data amount of the first output data to the data amount of the second output data and the value of N.
The neural network system according to claim 6, wherein each computing unit includes at least one resistive random access memory cross matrix ReRAM crossbar, and the deployment requirement includes the number of chips of the neural network system, the The first neural network layer is the starting layer of the neural network system,

In the step of determining N first weights to be configured for the first neural network layer and M second weights to be configured for the second neural network layer, the processor is configured to:

The N is determined according to the ratio of the number of the chips, the number of ReRAM crossbars in each chip, the number of ReRAM crossbars required to deploy one weight of each layer of neural network, and the output data amount of adjacent neural network layers Value of

The value of M is determined according to the ratio of the data amount of the first output data to the data amount of the second output data and the value of N.
The neural network system according to claim 6, wherein the neural network system includes multiple neural network chips, each neural network chip includes multiple secondary computing nodes, and each secondary computing node includes multiple computing Unit, the processor is also used to:

Mapping the P computing units and the Q computing units to multiple secondary computing nodes according to the number of computing units included in the secondary computing nodes in the neural network system, wherein the P computing At least a part of the computing units in the unit and at least a part of the computing units in the Q computing units are mapped into the same secondary computing node.
The neural network system according to claim 9, wherein the processor is further configured to:

According to the number of secondary computing nodes included in each neural network chip, mapping the multiple secondary computing nodes mapped by the P computing units and the Q computing units into the multiple neural network chips, Wherein, at least a part of the secondary computing nodes of the secondary computing nodes to which the P computing units belong and at least a part of the secondary computing nodes of the secondary computing nodes to which the Q computing units belong are mapped to the same neural network In the chip.
A resource allocation device is characterized by comprising:

An acquisition module for acquiring the data amount of the first output data of the first neural network layer and the data amount of the second output data of the second neural network layer in the neural network system, and the input data of the second neural network layer Including the first output data;

A calculation module, configured to determine N first weights to be configured for the first neural network layer and M second weights to be configured for the second neural network layer according to the deployment requirements of the neural network system, where N And M are both positive integers, and the ratio of N and M corresponds to the ratio of the data volume of the first output data to the data volume of the second output data;

A deployment module, configured to deploy N first weights to P computing units according to the calculation specifications of the calculation units in the neural network system, and deploy M second weights to Q computing units In the above, where P and Q are both positive integers, the P computing units are used to perform operations of the first neural network layer, and the Q computing units are used to perform operations of the second neural network layer.
The resource allocation device according to claim 11, wherein the deployment requirement includes a calculation delay, and the first neural network layer is a starting layer of all neural network layers in the neural network system, and the calculation The module is used for:

Determine the value of N according to the data amount of the first output data, the calculation delay, and the calculation frequency of the resistance random access memory cross matrix ReRAM crossbar in the calculation unit;

The value of M is determined according to the ratio of the data amount of the first output data to the data amount of the second output data and the value of N.
The resource allocation device according to claim 11, wherein the neural network system includes multiple neural network chips, each neural network chip includes multiple computing units, and each computing unit includes at least one resistive random access Memory cross matrix ReRAM crossbar, the deployment requirements include the number of chips of the neural network system, the first neural network layer is the starting layer of the neural network system, and the calculation module is used for:

The N is determined according to the ratio of the number of the chips, the number of ReRAM crossbars in each chip, the number of ReRAM crossbars required to deploy one weight of each layer of neural network, and the output data amount of adjacent neural network layers Value of

The value of M is determined according to the ratio of the data amount of the first output data to the data amount of the second output data and the value of N.
The resource allocation device according to claim 11, wherein the neural network system includes multiple neural network chips, each neural network chip includes multiple secondary computing nodes, and each secondary computing node includes multiple computing Unit, the device further includes:

A mapping module, configured to map the P computing units and the Q computing units to multiple secondary computing nodes according to the number of computing units included in the secondary computing nodes in the neural network system, wherein, At least a part of the P computing units and at least a part of the Q computing units are mapped into the same secondary computing node.
The resource allocation apparatus according to claim 14, wherein the mapping module is further used to:

According to the number of secondary computing nodes included in each neural network chip, mapping the multiple secondary computing nodes mapped by the P computing units and the Q computing units into the multiple neural network chips, Wherein, at least a part of the secondary computing nodes of the secondary computing nodes to which the P computing units belong and at least a part of the secondary computing nodes of the secondary computing nodes to which the Q computing units belong are mapped to the same neural network In the chip.
A computer program product includes program code, and the instructions included in the program code are executed by a computer to perform the computing resource allocation method according to any one of claims 1-5.
A computer-readable storage medium includes computer program instructions, which when executed on a computer, causes the computer to perform the method according to any one of claims 1-5.