CN110046705B

CN110046705B - Apparatus for convolutional neural network

Info

Publication number: CN110046705B
Application number: CN201910301387.7A
Authority: CN
Inventors: 许喆; 丁雪立; 陈柏纲
Original assignee: Novumind Ltd
Current assignee: NOVUMIND Ltd.
Priority date: 2019-04-15
Filing date: 2019-04-15
Publication date: 2022-03-22
Anticipated expiration: 2039-04-15
Also published as: CN110046705A

Abstract

An apparatus for a convolutional neural network includes a two-dimensional array of processing engines. The two-dimensional array of processing engines is configured to receive input data and weight data, including: an array of N processing engine rows and M processing engine columns, where N and M are both positive integers greater than or equal to 2. And the processing engine two-dimensional array performs convolution operation on the input data and the weight data and outputs an intermediate result. The processing engines of the two-dimensional array of processing engines are configured for self-contained pooling processing after said convolution operations. It is also possible that the two-dimensional array of processing engines has a fully connected structure, with intermediate results being stacked within the two-dimensional array of processing engines. The device realizes high-speed processing of input tensor data and can flexibly deal with the input tensor data with different dimensions.

Description

Apparatus for convolutional neural network

Technical Field

The present disclosure relates to neural network convolution tensor processors, and more particularly, to apparatus for convolutional neural networks.

Background

The neural network establishes a model structure by simulating a neural connection structure of the human brain, and is a hot spot of academic research and enterprise research at present. Current neural networks, particularly convolutional neural networks for image processing and object recognition, need to process a large amount of data expressed as third or higher order tensors, and also tensor data having different shapes and sizes. There is therefore a need for a neural network dedicated computing device capable of processing three-order or higher tensor data of different shapes at high speed. Further, the binarization neural network refers to a neural network that performs binarization processing on weight values and/or input data. Currently, no high-precision computing device for the binary neural network exists.

Disclosure of Invention

In view of this, it is necessary to provide a neural network dedicated calculation device capable of processing third-order or higher tensor data at high speed, and it is also necessary to provide a high-precision calculation device for a binarization neural network. To this end, the present disclosure provides a tensor processor including a plurality of Processing Engines (PEs) and a ping-pong controller connected to the PEs. The tensor processor can determine the number of the PEs to be called and the dimension of the two-dimensional array formed by the called PEs according to actual needs (such as information of the dimension of the input tensor data, the dimension of the convolution kernel and the like), and call all or part of the PEs to form the two-dimensional array of the PEs. Furthermore, the tensor processor configures the connection relationship and the data flow direction between the PEs of the two-dimensional array of PEs, and can cut the input tensor data according to the dimensionality of the two-dimensional array of PEs, so that the input tensor data can be processed at high speed, and the input tensor data with different dimensionalities can be flexibly dealt with. Aiming at the derivation operation of the binary neural network, the tensor processor replaces the convolution operation by a hardware mode and performs threshold operation on the convolution operation result, so that the binary neural network computing device with the advantages of high speed and high precision is realized.

According to an aspect of the present disclosure, there is provided an apparatus for a convolutional neural network, the apparatus comprising a two-dimensional array of processing engines configured to receive input data and weight data, the two-dimensional array of processing engines comprising: an array of N processing engine rows and M processing engine columns, wherein N and M are both positive integers greater than or equal to 2; the processing engine two-dimensional array performs convolution operation on the input data and the weight data and outputs an intermediate result; wherein the processing engines of the two-dimensional array of processing engines are configured for self-contained pooling processing after the convolution operation.

According to another aspect of the present disclosure, there is provided an apparatus for a convolutional neural network, the apparatus comprising a two-dimensional array of processing engines configured to receive input data and weight data, the two-dimensional array of processing engines comprising: an array of N processing engine rows and M processing engine columns, wherein N and M are both positive integers greater than or equal to 2; the processing engine two-dimensional array performs convolution operation on the input data and the weight data and outputs a result; wherein the two-dimensional array of processing engines has a fully connected structure, the intermediate results are stacked within the two-dimensional array of processing engines.

Drawings

Embodiments of the present disclosure have other advantages and features which will become more readily apparent from the following detailed description and appended claims when taken in conjunction with the accompanying drawings, wherein:

fig. 1 illustrates an architecture of a tensor processor including an input-output bus, a ping-pong controller, and a two-dimensional array of PEs, according to one embodiment.

FIG. 2 illustrates the architecture and data flow of a two-dimensional array of PEs of a tensor processor of one embodiment.

Fig. 3 shows the architecture and data flow of a two-dimensional array of PEs of a tensor processor of another embodiment.

Fig. 4 illustrates a first line of the calculation result derived by the PE two-dimensional array of the tensor processor in the embodiment shown in fig. 3.

Fig. 5 illustrates a second row of the PE two-dimensional array derivation operation result of the tensor processor in the embodiment illustrated in fig. 3.

Fig. 6 illustrates a third row of the PE two-dimensional array derivation operation result of the tensor processor in the embodiment illustrated in fig. 3.

Figure 7 illustrates a tensor processor of one embodiment configuring PEs of a two-dimensional array of PEs to accommodate the dimensions of an input data matrix.

Figure 8 illustrates a tensor processor of one embodiment cuts the input data matrix to fit the dimensions of the two-dimensional array of PEs.

Figure 9 illustrates a first way of matching convolution kernels to image data input after the tensor processor of one embodiment cuts the input data matrix.

Figure 10 illustrates a second way of matching convolution kernels to image data input after the tensor processor of one embodiment cuts the input data matrix.

Figure 11 illustrates a third way of matching convolution kernels to image data input after the tensor processor cuts the input data matrix for one embodiment.

Figure 12 illustrates a tensor processor configuring data with Multicast, according to one embodiment.

Figure 13 illustrates the data flow for a fully-connected operation by the tensor processor of one embodiment.

Figure 14 illustrates PE configuration parameters of a tensor processor of one embodiment.

FIG. 15 illustrates a PE of a tensor processor of an embodiment performing a binary neural network convolution operation and a threshold operation.

Fig. 16 shows an architecture of a tensor processor of another embodiment.

Figure 17 illustrates an architecture of a ping-pong controller of a tensor processor of one embodiment.

Detailed Description

The drawings and the following description are by way of example only. It should be understood from the following discussion that alternative embodiments of the structures and methods disclosed herein will be readily recognized as viable alternatives that may be employed without departing from the principles of what is claimed.

Referring to fig. 1, the tensor processor of one embodiment includes an input-output bus 100, a plurality of PEs, and a ping-pong controller 102 coupled to the plurality of PEs. The input/output bus 100 receives input data (such as image data expressed as third-order or higher tensors or feature tensors including image feature values) from the outside, transmits the input data to the ping-pong controller 102, and receives output data from the ping-pong controller 102 to output the output data to the outside. The input/output bus 100 may receive convolution kernel data (the convolution kernel data may be a set of weight values or only a single weight value, or may be a convolution kernel tensor) from the outside. In other embodiments, the convolution kernel data may also come from the tensor processor itself, such as by pre-storing the convolution kernel data in a PE configuration parameter register (not shown) of the ping-pong controller 102. The ping-pong controller 102 determines the number of PEs to be called and the dimension of the two-dimensional array formed by the called PEs according to the information of the input data and the convolution kernel data (such as the dimension of the input data and the dimension of the convolution kernel data), and then calls all or a part of the PEs connected to the ping-pong controller to form the two-dimensional array of PEs. In the embodiment shown in fig. 1, ping-pong controller 102 defines a two-dimensional array of 4 rows and 4 columns of 16 PEs (possible non-invoked PEs not shown). Further, the ping-pong controller 102 configures the 16 PEs with respect to each other and the data flow (e.g., as shown in fig. 1, the operation result is vertically transmitted from top to bottom to the PE in the bottom row). In other embodiments, the ping-pong controller 102 determines that N (N is a positive integer greater than or equal to 2) times M (M is a positive integer greater than or equal to 2) PEs are required to form a two-dimensional array of N rows and M columns or M rows and N columns or other dimensions of PEs, and configures a connection relationship between N times M PEs and a data flow direction (including but not limited to a data flow direction of image data, a data flow direction of weight data, and a data flow direction of operation results).

With continued reference to fig. 1, ping-pong controller 102 configures the convolution kernel data to the PEs in the two-dimensional array of PEs and also transmits the input data to the PEs in the two-dimensional array of PEs. For a specific PE in the PE two-dimensional array, the PE performs convolution operation according to input data transmitted to the PE and convolution kernel data configured to the PE to obtain an operation result. In particular, the configuration of the convolution kernel data to the PE occurs before the transmission of the input data to the PE, i.e., the convolution kernel data is configured and then the transmission of the input data is started. Because the convolution kernel data or the weight value has high reusability in convolution operation, input data such as image characteristic values can be transmitted to the PE two-dimensional array for operation without pause by configuring the convolution kernel data or the weight value in advance, and therefore the quantity of data processed by each batch of the tensor processor is increased. In other embodiments, configuring the convolution kernel data may also occur simultaneously with or after transmitting the input data. According to the configured connection relationship between the PEs and the data flow direction, the ping-pong controller 102 selects the operation results of some of the PEs as output results. In the embodiment shown in fig. 1, the ping-pong controller 102 selects the operation results of the 4 PEs in the bottom row as the output result. In other embodiments, the two-dimensional array of PEs may have different dimensions or architectures, and the connection relationship and data flow between PEs may have different configurations, according to actual needs, and each PE in the two-dimensional array of PEs may be specified in a specific architecture to provide an output result. In addition, according to some embodiments of the present disclosure, the ping-pong controller 102 may transmit the input data for operation while configuring a new convolution kernel (or weight data), thereby increasing the processing speed of the tensor processor. Specifically, for PEs in the entire two-dimensional array of PEs, the ping-pong controller 102 may configure new convolution kernel data to one portion of the PEs while also transferring input data to another portion of the PEs. That is, while updating the convolution kernel data of one part of PEs, the convolution kernel data of another part of PEs can be kept unchanged and the operation of another part of PEs can be continued, so as to accelerate the processing speed of the tensor processor.

With reference to fig. 1, the operation result obtained by performing convolution operation on the input data transmitted to the PE and the convolution kernel data (or weight data) configured to the PE by the PE in the PE two-dimensional array may be an intermediate result Psum of the neural network derivation operation, or a regularization result obtained by regularizing the intermediate result Psum. For the Binary convolutional neural network (Binary CNN), the operation result of PE may be the intermediate result Psum, or may be the result of 1 bit 1 or 0 after regularization. In some embodiments, the intermediate result Psum obtained after the PE performs the operation is transmitted to the ping-pong controller 102. In other embodiments, the PE two-dimensional array uses a fully-connected layer (full-connected), and the intermediate result Psum obtained after the PE performs the operation does not need to be transmitted to the ping-pong controller 102. That is, the intermediate results Psum are not read and written by the ping-pong controller 102, but are directly stacked between the PEs of the two-dimensional array of PEs. The PE two-dimensional array adopting the full-connection layer supports both full-connection operation and convolution operation. In addition, the PE two-dimensional array using the full link layer does not need to read and write the intermediate result Psum through the ping-pong controller 102, but completes the operation inside the two-dimensional array, thereby reducing the delay and facilitating the high-speed calculation derivation. The ping-pong controller 102 can adjust the connection relationship and data flow direction between the PEs according to actual needs, thereby controlling whether to read and write the intermediate result Psum through the ping-pong controller 102, and further realizing the switching between the full connection operation and the convolution operation. In addition, according to some embodiments of the present disclosure, by configuring the PE, pooling (posing) operations in the neural network model can be mixed with convolution operations, i.e., letting the configured PE perform a pooling operation on its own.

Referring to fig. 2, the data flow of the PE two-dimensional array of the tensor processor of one embodiment includes, but is not limited to, the data flow of image data, the data flow of weight data, and the data flow of operation results. The embodiment shown in fig. 2 provides a two-dimensional array of 3 rows and 3 columns of PEs consisting of 9 PEs. The image data enters from the PE of the leftmost column of the PE two-dimensional array and then is transmitted to the adjacent right PE of the same row one by one, namely from the first column to the second column and then from the second column to the third column in the left-to-right direction. And the weight data are transmitted one by one from the PE of the leftmost column toward the nearest PE to the right of neither the same row nor the same column in the diagonal direction after entering. The operation result obtained by the operation performed by each PE of the two-dimensional array of PEs propagates one by one vertically toward the nearest PE below the same row, that is, from the first row to the second row and then from the second row to the third row in the top-down direction. The data flow direction of the PE two-dimensional array shown in fig. 2 is merely for explaining that the data flow direction of the image data, the data flow direction of the weight data, and the data flow direction of the operation result by the tensor processor can be individually controlled. The tensor processor may arrange the data flow direction of the image data, the data flow direction of the weight data, and the data flow direction of the operation result separately when arranging the connection relationship and the data flow direction between the PEs according to actual needs, for example, information of the input data and the convolution kernel data. The embodiment shown in fig. 2 is only used to illustrate one configuration of data flow among PEs in a 3-row and 3-column PE two-dimensional array, and should not be used to limit other possible configurations of the PE two-dimensional array according to the present disclosure.

The PE two-dimensional array shown in fig. 1 and the PE two-dimensional array shown in fig. 2, as well as the PE two-dimensional array in other embodiments of the disclosure, are only used for illustrating the connection relationship and data flow direction between PEs, and should not be used to limit other possible configurations of the PE two-dimensional array in the disclosure. In the embodiments of the present disclosure, the relative relationship between PEs, such as front, back, left, right, top, bottom, and the like, the position information of the PE in the first row and the second column, and the PE in the leftmost column or the lowermost row, and the like are mentioned, which are only for convenience of explaining the connection relationship and the data flow direction between PEs, but should not be understood as requiring that PEs are strictly arranged according to the mentioned relative relationship and position relationship, and should not be used to limit the present disclosure to other possible configurations of the two-dimensional array of PEs. In addition, the two-dimensional array of PEs illustrated in the multiple figures of the present disclosure has various data flow directions indicated by arrows, and these arrows are only for convenience of explaining the data flow directions of the PEs among each other, and should not be used to limit other possible configurations of the two-dimensional array of PEs of the present disclosure.

Referring to fig. 3 to 6, in which fig. 3 shows a tensor processor of another embodiment, fig. 4 to 6 show that an input image data matrix is 5 rows and 5 columns and a weight data matrix is 3 rows and 3 columns. The tensor processor includes a two-dimensional array of 3 rows and 3 columns of 9 PEs, the 9 PEs being numbered PE1 through PE9, respectively. Fig. 3 also shows the connection relationship and the data flow direction between the 9 PEs, including the data flow direction of image data, the data flow direction of weight data, and the data flow direction of operation results. The image data is transmitted to the corresponding PE in the manner shown in fig. 3. Specifically, each PE corresponds to one line of image data: line 1 of image data is transferred to PE numbered PE1, line 2 of image data is transferred to PE numbered PE2 and PE4, line 3 of image data is transferred to PE numbered PE3, PE5 and PE7, line 4 of image data is transferred to PE numbered PE6 and PE8, and line 5 of image data is transferred to PE numbered PE 9. And the weight data is configured to the corresponding PE in the manner shown in fig. 3. Specifically, each PE corresponds to a row of weight data: the weight data line 1 is configured to PEs numbered PE1, PE4, and PE7, the weight data line 2 is configured to PEs numbered PE2, PE5, and PE8, and the weight data line 3 is configured to PEs numbered PE3, PE6, and PE 9. The operation results are overlapped in the manner shown in fig. 3. Specifically, the operation result of PE numbered PE1 is overlapped and transmitted to PE numbered PE2, and then overlapped and transmitted to PE numbered PE3, and finally line 1 of the convolution operation output result is obtained. The operation result of PE numbered PE4 is overlapped to PE numbered PE5, and then overlapped to PE numbered PE6, and finally, the 2 nd line of the convolution operation output result is obtained. The operation result of PE numbered PE7 is overlapped to PE numbered PE8, and then is overlapped to PE numbered PE9, and finally, the 3 rd line of the convolution operation output result is obtained.

Referring to fig. 3 and 4, taking a binary convolutional neural network as an example, PE numbered PE1 performs a convolution operation of the binary convolutional neural network on row 1 of input image data and row 1 of configured weight data. PE No. PE2 performs a convolution operation of a binary convolution neural network on the 2 nd row of the input image data and the 2 nd row of the arranged weight data. PE No. PE3 performs a convolution operation of a binary convolution neural network on row 3 of the input image data and row 3 of the arranged weight data. And overlapping and transmitting the operation result of the PE with the number of PE1 to the PE with the number of PE2, and then continuously overlapping and transmitting to the PE with the number of PE3 to finally obtain the 1 st line of the output result of the binary neural network convolution operation. The specific PE performs convolution operation on the input image data and the configured weight data, and the completion of the convolution operation may occur before or after receiving an operation result transmitted by another PE stack, or may occur simultaneously. The specific PE superposes and transmits the operation result after the convolution operation and the operation result transmitted by the other PE to a third PE.

Referring to fig. 3 and 5, taking a binary convolutional neural network as an example, PE numbered PE4 performs a convolution operation of the binary convolutional neural network on row 2 of input image data and row 1 of configured weight data. PE No. PE5 performs a convolution operation of a binary convolution neural network on row 3 of the input image data and row 2 of the arranged weight data. PE No. PE6 performs a convolution operation of a binary convolution neural network on the 4 th line of the input image data and the 3 rd line of the arranged weight data. And overlapping and transmitting the operation result of the PE with the number of PE4 to the PE with the number of PE5, and then continuously overlapping and transmitting to the PE with the number of PE6 to finally obtain the 2 nd line of the output result of the binary neural network convolution operation. The specific PE performs convolution operation on the input image data and the configured weight data, and the completion of the convolution operation may occur before or after receiving an operation result transmitted by another PE stack, or may occur simultaneously. The specific PE superposes and transmits the operation result after the convolution operation and the operation result transmitted by the other PE to a third PE.

Referring to fig. 3 and 6, taking a binary convolutional neural network as an example, PE numbered PE7 performs a convolution operation of the binary convolutional neural network on row 3 of input image data and row 1 of configured weight data. PE No. PE8 performs a convolution operation of a binary convolution neural network on the 4 th line of the input image data and the 2 nd line of the arranged weight data. PE No. PE9 performs a convolution operation of a binary convolution neural network on the 5 th line of the input image data and the 3 rd line of the arranged weight data. And overlapping and transmitting the operation result of the PE with the number of PE7 to the PE with the number of PE8, and then continuously overlapping and transmitting to the PE with the number of PE9 to finally obtain the 3 rd line of the output result of the binary neural network convolution operation. The specific PE performs convolution operation on the input image data and the configured weight data, and the completion of the convolution operation may occur before or after receiving an operation result transmitted by another PE stack, or may occur simultaneously. The specific PE superposes and transmits the operation result after the convolution operation and the operation result transmitted by the other PE to a third PE.

Referring to fig. 3 to 6, the input image data matrix is 5 rows and 5 columns, the weight data matrix is 3 rows and 3 columns, the tensor processor is configured with a 3-row and 3-column PE two-dimensional array composed of 9 PEs, and is also configured with the connection relationship and data flow direction between the 9 PEs. Further, the tensor processor inputs a line of image data to a particular PE and also configures a line of weight data to the particular PE. The specific PE performs convolution operation on the input image data and the configured weight data to output an operation result. And overlapping the operation results of the multiple PEs according to a specific mode to obtain a certain row of the convolution operation output result of the neural network. In other embodiments, the two-dimensional array of PEs may have different dimensions or sizes, for example, the two-dimensional array of PEs may be 12 × 14. In other embodiments, the tensor processor resizes the two-dimensional array of PEs based on information (e.g., matrix dimensions) of the input matrix of image data and the matrix of weight data. The embodiments shown in fig. 3 to 6 are only used for illustrating one architecture of the PE two-dimensional array and one way of configuring the PE two-dimensional array, and should not be used to limit the disclosure to other possible architectures and configurations of the PE two-dimensional array. In some other embodiments, the size of the convolution kernel (or the weight data matrix) used for performing the convolution operation may be 3 × 3, or 1 × 1, 5 × 5, or 7 × 7.

According to other embodiments of the present disclosure, the tensor processor may simultaneously input image data for convolution operation to a plurality of PEs and simultaneously configure weight data for convolution operation to a plurality of PEs, thereby optimizing data transmission, by configuring a size and a structure of a two-dimensional array of PEs and by configuring a connection relationship and a data flow direction between PEs. In accordance with some embodiments of the present disclosure and referring to the architecture of the PE two-dimensional array shown in fig. 3, PEs numbered as PE3, PE5, and PE7 may synchronously receive line 3 of image data from a ping-pong controller or a buffer outside the PE two-dimensional array, while PEs numbered as PE1, PE4, and PE7 may synchronously receive line 1 of weight data from a ping-pong controller or a buffer outside the PE two-dimensional array. The two-dimensional arrays of PEs shown in fig. 3 to 6 are only used for illustrating one architecture of the two-dimensional arrays of PEs and one way of configuring the two-dimensional arrays of PEs, and should not be used to limit the disclosure to other possible architectures and configurations of the two-dimensional arrays of PEs.

In the embodiments shown in fig. 3 to 6, the operation result of the first PE is overlapped to the second PE, and then overlapped to the third PE. In other embodiments, the result of the first PE is passed to the second PE, and then passed to the first PE instead of the third PE after the second PE has finished the convolution operation. Then, the first PE receives the input new image data, and if necessary, may also configure new weight data or continue to keep the configured weight data unchanged, and performs convolution operation on the new image data, and then outputs the result.

In the embodiments shown in fig. 3 to 6, taking a binary convolutional neural network as an example, the PE performs a convolution operation of the binary convolutional neural network on input image data and configured weight data. In other embodiments, the PE may perform a full join operation on the input image data and the configured weight data. In other embodiments, the tensor processor may be used for derivation operations of a non-binary convolutional neural network, such as a neural network of data type INT4, INT8, INT16, or INT32, while the PE performs convolution operations corresponding to the data type of the neural network on the input image data and configured weight data.

Referring to fig. 7, the PE two-dimensional array of the tensor processor of one embodiment is a 12 row and 14 column matrix array, while the input data matrix is 3 rows and 13 columns. The tensor processor adjusts the two-dimensional array of PEs to place portions of the PEs in inactive states to reduce power consumption.

Referring to fig. 8, the PE two-dimensional array of the tensor processor of one embodiment is a 12 row 14 column matrix array and the input data matrix is 5 rows 27 columns. The tensor processor cuts the input data matrix into two input data matrices of 5 rows and 14 columns and 5 rows and 13 columns to adapt to the dimension of the two-dimensional array of PE.

With continued reference to fig. 7 and 8, according to some embodiments of the present disclosure, the tensor processor may determine the dimension (or size) of the two-dimensional array of PEs and also the connection relationship between PEs and the data flow direction according to information (such as matrix dimension) of the input image data matrix and the weight data matrix (or convolution kernel). The tensor processor may also cut the input data matrix according to the determined dimensions of the good two-dimensional array of PEs. The tensor processor may again adjust the dimensions of the two-dimensional array of PEs, which have been previously determined, if desired. Therefore, the tensor processor disclosed by the present disclosure can have the flexibility of processing input data matrices of different dimensions by cutting the input data matrix on the premise of maintaining the dimension of the current PE two-dimensional array unchanged. Tensor data required to be processed by the neural network can be expanded and then expressed into data matrixes with different dimensions, and the flexibility of the tensor processor for processing the input data matrixes with different dimensions is favorable for realizing high-speed derivation operation of the neural network. On the other hand, when the dimensionality of the input data matrix of the neural network keeps better consistency, or according to other actual requirements, the tensor processor can readjust the dimensionality of the PE two-dimensional array which is determined previously according to the dimensionality and other information of the input data matrix, so that the dimensionality of the PE two-dimensional array which is more suitable for processing the current input data matrix, as well as the connection relationship and the data flow direction between PEs are selected. For example, referring to the embodiments shown in fig. 3 to 6, when the input image data matrix is 5 rows and 5 columns and the weight data matrix is 3 rows and 3 columns, the tensor processor configures a two-dimensional array of 3 rows and 3 columns of PEs composed of 9 PEs, thereby implementing high-speed derivation operation of the input image data matrix. According to some embodiments of the present disclosure, the tensor processor both cuts the input image data matrix and adjusts the dimensionality and other configurations of the current two-dimensional array of PEs, thereby facilitating high-speed processing of complex and diverse input tensor data.

Referring to fig. 9, the tensor processor of one embodiment takes a first approach to matching the convolution kernel to the image data input after cutting the input data matrix. The first way means that the same convolution kernel is used for different image data inputs. As shown in fig. 9, the image data of the first line is different from the image data of the second line, and the convolution kernel or weight data of the first line and the convolution kernel or weight data of the second line are the same filter, that is, the same convolution kernel. The output results of both the first and second rows enter channel 1.

Referring to fig. 10, the tensor processor of one embodiment takes a second approach to matching the convolution kernel to the image data input after cutting the input data matrix. The second way means that the same image data input corresponds to different convolution kernels. As shown in fig. 10, the convolution kernel or weight data of the first line is different from the convolution kernel or weight data of the second line, and the image data of the first line and the image data of the second line are the same image data. The output results of both the first and second rows enter channel 1.

Referring to fig. 11, the tensor processor of one embodiment takes a third approach to matching the convolution kernel to the image data input after cutting the input data matrix. The third method is to input two different image data into two different convolution kernels respectively after cutting. As shown in fig. 11, the convolution kernel or weight data of the first line is different from the convolution kernel or weight data of the second line, and the image data of the first line is different from the image data of the second line. The output results of the first row go to lane 1 and the output results of the second row go to lane 2.

Referring to fig. 12, the tensor processor of one embodiment optimizes the transfer of data by configuring the data in a Multicast mode of propagation (Multicast). Multicast means that one read operation can read data from a ping-pong controller or a buffer outside the two-dimensional array of PEs and send to multiple PEs. In other words, a variable number of PEs receive a new data configuration through Multicast in one instruction cycle, so that the tensor processor can configure the same data into multiple PEs in one instruction cycle. And the multiple PEs configured with the same data through Multicast may be located in the same row or column in the two-dimensional array of PEs, or belong to any combination from any position in the two-dimensional array. For example, referring to fig. 3 and 12 simultaneously, the tensor processor configures the weight data line 1 to PEs numbered PE1, PE4, and PE7 simultaneously by Multicast, configures the weight data line 2 to PEs numbered PE2, PE5, and PE8 simultaneously by Multicast, and configures the weight data line 3 to PEs numbered PE3, PE6, and PE9 simultaneously by Multicast. The data configured by Multicast may include convolution kernels for convolution operations and may also include weight data for neural network derivation operations. According to some embodiments of the present disclosure, the data configured by Multicast may also include a threshold value required for a threshold operation of a binary neural network convolution operation. The threshold data for configuration may be already trained thresholds. The tensor processor transmits the input data to the two-dimensional array of PEs for convolution operation after the configuration of the convolution kernel (or weight data) and the threshold value through Multicast. That is, the input data such as the data of image features are input to the PE two-dimensional array for operation during the actual operation process. The tensor processor can adopt a static algorithm, after the trained threshold value is configured to the PE two-dimensional array, the input data matrix and the weight data matrix are continuously configured to the corresponding PE respectively through Multicast to carry out convolution operation, and therefore faster derivation operation speed is achieved.

Referring to fig. 13, a tensor processor of one embodiment has an architecture that supports fully-connected operations. The tensor processor realizes the support of the data flow which is fully connected on the fully-connected layer in the neural network by adjusting the connection relation and the data flow direction between the two-dimensional arrays of the PE.

Figure 14 shows a list of parameters that the tensor processor of one embodiment configures for each PE. Each line of the input image feature data is assigned a feature _ row _ id. And weight data is assigned a weight _ row _ id per row. Each PE is assigned a weight _ row _ id _ local associated with the PE and a feature _ row _ id _ local associated with the PE. When configuring the weight data, the weight _ row _ id with the configured weight data is compared with the weight _ row _ id _ local of the PE, and if the weight _ row _ id _ local is consistent, the PE receives the configured weight value, and if the weight _ row _ id _ local is inconsistent, the PE does not receive the configured weight value. When configuring the input image feature data, the configured input image feature data with the feature _ row _ id is compared with the feature _ row _ id _ local of the PE. If they are consistent, the PE receives the configured input image feature data, and if not, the PE does not receive the data. The tensor processor calculates feature _ row _ id, weight _ row _ id, feature _ row _ id _ local, and weight _ row _ id _ local for allocation based on information of input image feature data and weight data (or convolution kernel). For example, the image feature data having the dimension of 3 dimensions has three dimensions of length, width, and depth, and the convolution kernel data having the dimension of 4 dimensions has four dimensions of length, width, depth, and the number of convolution kernels. The tensor processor can calculate the number of the PEs to be called and the dimension of a two-dimensional array formed by the called PEs according to the information of three dimensions of the image characteristic data and the information of four dimensions of the convolution kernel data, and then calculates the feature _ row _ id allocated to each line of the image characteristic data, the weight _ row _ id allocated to each line of the convolution kernel data, and the feature _ row _ id _ local and the weight _ row _ id _ local allocated to the PEs.

Referring to fig. 3, 12, and 14, the tensor processor configures the weight data line 1 to PEs numbered PE1, PE4, and PE7 through Multicast. The tensor processor compares weight _ row _ id of the 1 st line of weight data with weight _ row _ id _ local of each PE. Only the PEs numbered PE1, PE4, and PE7 have their weight _ row _ id _ local matching the weight _ row _ id of row 1 of the weight data, and therefore only the PEs numbered PE1, PE4, and PE7 receive row 1 of the weight data.

With reference to fig. 14, for configuration parameters of a specific PE, the tensor processor may set a parameter model _ set to set whether the working mode of the PE is RGB operation or full binary operation, may set a parameter Psum _ set to set whether the PE receives an operation result of another PE and accumulates the operation result to the PE, may set a parameter Pool _ en to set whether the PE operates from a pooling layer after convolution operation, may set a parameter Dout _ on to set whether the PE outputs the operation result, and may set a parameter Row _ on to set whether the PE participates in operation. The tensor processor can also set a parameter K _ num to describe the dimensionality of a convolution kernel participating in convolution operation, and set a parameter Psum _ num to describe the number of intermediate results P _ sum performing accumulation operation. By configuring the configuration parameters of the PEs in advance, the tensor processor can control the working mode and working state of the PEs and control the connection relationship and data flow direction between the PEs. According to other embodiments of the present disclosure, the configuration parameters in the PEs may be readjusted according to actual needs, so as to readjust the connection relationship and the data flow direction between the PEs. Further, since whether the PE should receive the input data or the weight data is determined by matching the parameters configured in advance in the PE with the parameters to which the input data or the weight data is allocated, the tensor processor can improve the efficiency of configuring data by Multicast. Therefore, the tensor processor accelerates the tensor calculation of the neural network by adopting the arrangement mode of the two-dimensional matrix, optimizes data transmission and control operation by configuring the parameters of the PE, and adjusts the dimensionality of the tensor capable of being processed by cutting the input matrix and adjusting the two-dimensional array of the PE, thereby realizing the high-speed calculation device of the neural network capable of dynamically adjusting.

The embodiment shown in fig. 14 is only used to illustrate one possible combination of PE configuration parameters, and should not be used to limit other possible configurations of the PE according to the present disclosure. According to some embodiments of the present disclosure, the tensor processor may calculate feature _ row _ id _ local and feature _ column _ id _ local for matching image feature data, and weight _ row _ id _ local and weight _ column _ id _ local for matching convolution kernel or weight data to be configured to each PE, according to input image feature data and information of convolution kernel. Each image feature data for input to the PE is assigned a feature _ row _ id and a feature _ column _ id pair. And each convolution kernel or weight data is assigned a pair of weight _ row _ id and weight _ column _ id. The feature _ row _ id _ local of the PE and the feature _ row _ id of the image feature data, and the feature _ column _ id _ local of the PE and the feature _ column _ id of the image feature data are compared, respectively, when configuring the image feature data. The PE accepts image feature data only if the two matches are consistent and rejects image feature data if there is at least one mismatch. Similarly, weight _ row _ id _ local and weight _ row _ id of the PE, and weight _ column _ id _ local and weight _ column _ id of the PE, respectively, are compared when configuring the convolution kernel or weight data. The PE accepts the convolution kernel or weight data only if the two matches are consistent, and rejects the convolution kernel or weight data if there is at least one mismatch.

Referring to fig. 15, the tensor processor of one embodiment performs a binary neural network convolution operation and a threshold operation. Specifically, the tensor processor binarizes both the feature image data and the weight data, and further expresses as 0 and 1. Therefore, the feature image data and the weight data after binarization only need one bit of storage bit (if the data are expressed as 1 and-1, two bits of storage bit are needed, one bit of storage symbol and the other bit of storage value), thereby saving a large amount of storage space. Further, since the binarized feature image data and weight data are represented by 0 or 1 of one bit, the multiplication operation of the neural network convolution operation may be replaced with an exclusive nor (XNOR) logic gate. While the addition operation of the neural network convolution operation may be replaced with the popcount operation. The Popcount operation means the number of bits in the result each having a value of 1. For example, if the number of all bit position values of a 32-bit operand is 1, the number of 0 (representing-1 if represented by 1 and-1) is 32-a, and the final result is a- (32-a) ═ 2 a-32.

With continued reference to fig. 15, taking the convolution operation of the 32-bit feature image data and the 32-bit weight data as an example, the multiply-add operation of the two 32-bit operands can be replaced by: two 32-bit operands are subjected to exclusive-nor operation, and the obtained result is subjected to popcount operation. Specifically, as shown in fig. 15, the 16-bit intermediate result Psum of the binary neural network convolution operation is obtained by passing one bit of image data and one corresponding bit of weight data through an exclusive nor (XNOR) logic gate, passing the results of a plurality of exclusive nor (XNOR) logic gates through a 32-bit popcount, and then superimposing the results output by the plurality of 32-bit popcount. Because the multiplication and addition operation of the convolution operation is replaced by the logic gate operation and the popcount operation, a large amount of floating point operation is saved, and the tensor processor realizes the accelerated convolution operation of the binary neural network. The accelerated convolution operation of the binary neural network may be realized by any PE of the two-dimensional array of PEs of the tensor processor, or may be realized by a specific PE. According to some embodiments of the present disclosure, the tensor processor may also employ a convolution operation based on a floating point operation of a general neural network instead of the logic gate operation and the popcount operation.

The tensor processor shown in figure 15 thresholds the intermediate result Psum to improve the accuracy of the operation. Specifically, the tensor processor compares the intermediate result Psum of the convolution operation with a trained threshold value, and outputs the intermediate result Psum of 16 bits or a normalized 0 or 1 of 1 bit.

Assuming that the result after convolution is a, the batch processing function batchnorm (a) is γ · (a- μ)/σ + B, where u is the mean of the vector, σ is the variance, γ is the scaling factor, and B is the offset. Carrying out binarization operation, namely sign function operation:

binarization:

to simplify the operation, the Batchnormal operation and the binarization operation are simplified and combined into a threshold operation. It can be seen that batchnorm (a) is a cut point, and when it is 0 or more, the result is 1, and otherwise, the value is 0. Therefore, when batchnorm (a) is 0, a is u- (B · σ)/γ. Note that "a" is Tk, meaning batchnorm (Tk) ═ γ · (Tk- μ)/σ + B ═ 0. Therefore, when the convolution operation result is equal to or greater than Tk, the value is 1, and the other value is 0.

Suppose the training results for Batchnormal are: γ is 4, μ is 5, σ is 8, and B is 2. Calculating to obtain: tk ═ u- (B · σ)/γ ═ 5- (2 × 8)/4 ═ 1.

Simplifying and calculating before combination: when the convolution calculation result is 0, substituting into the Batchnormal formula: 4 × (0-5)/8 +2 ═ 0.5, and since it was smaller than 0, the result was 0. When the convolution calculation result is 2, substituting into the Batchnormal formula: 4 × (2-5)/8 +2 ═ 0.5, and the result was 1 because it was larger than 0.

Simplifying and calculating after combination: the threshold value Tk is 1, and is smaller than Tk when the convolution result is 0, so the result is 0. When the convolution result is 2, the result is 1 because it is larger than Tk. So the output results are consistent before and after simplification.

Therefore, simplifying the Batchnormal operation and the binarization operation into the thresholding operation can yield a result consistent with that before simplification, but the thresholding operation saves a lot of floating point operation resources. After the binary convolutional neural network training is completed, Tk is derived from the formula Tk ═ u- (B · σ)/γ, and the result of the convolution may be compared with the value of Tk. The tensor processor compares the convolution result with the trained threshold value Tk, so that a large number of floating point operations are not needed, the operation precision is improved, and the inference time of the neural network is shortened. The threshold operation may be implemented by any PE of the two-dimensional array of PEs of the tensor processor, or by a designated PE.

According to some embodiments of the present disclosure, the tensor processor replaces the multiply-add operation of the binary neural network convolution operation by the logic gate operation and the popcount operation, and compares the convolution result with the trained threshold value by the threshold operation to improve the operation precision, thereby implementing the binary neural network tensor calculation apparatus with both high speed and high precision. In other embodiments, the tensor processor further implements operations by the ping-pong controller while configuring the two-dimensional array of PEs. In other embodiments, the tensor processor further implements a static algorithm and processes the input data quickly by configuring the weight and threshold values in advance to the PE before beginning to input the eigen-image data. In other embodiments, the tensor processor further configures the PEs for self-pooling (posing) processing. In other embodiments, the tensor processor further enables the intermediate results Psum to be directly stacked between the PEs of the two-dimensional array of PEs without being read and written by the ping-pong controller through full connectivity. In other embodiments, the tensor processor further enables processing of input tensor data of different dimensions by cutting the input data matrix and adjusting the dimensions of the two-dimensional array of PEs. In other embodiments, the tensor processor further implements optimized data transfer via Multicast configuration data.

According to some embodiments of the present disclosure, the trained threshold may be obtained by a specified number of iterations, or a regression algorithm with convergence, or by comparison with a labeled test image. According to some other embodiments of the present disclosure, the trained threshold value may be obtained by a general method of training a neural network and machine learning.

Referring to fig. 16, the tensor processor of one embodiment includes a matrix 500 of a two-dimensional array of PEs, a control module 502, a weight data buffer 504, a threshold value data buffer 506, an image data buffer 508, and an input-output bus 510. According to some embodiments of the present disclosure, ping-pong controller 512 includes, but is not limited to, a control module 502, a weight data buffer 504, a threshold data buffer 506, and an image data buffer 508. According to some further embodiments of the present disclosure, the ping-pong controller 512 includes a control module 502, a weight data buffer 504, and an image data buffer 508. The input-output bus 510 receives data from the outside, such as tensor matrix data of the third order or higher. The input/output bus 510 writes the received data to the weight data buffer 504, the threshold value data buffer 506, and the image data buffer 508, respectively, according to the type and use of the received data (such as weight data, threshold value data, or image data). The weight data buffer 504, the threshold value data buffer 506, and the image data buffer 508 transmit the respective stored data to the control module 502 and read the weight data, the threshold value data, and the image data from the control module 502, respectively. The input data of the present embodiment is image data as an example, but the input data is not limited to image data, and may be voice data, a data type suitable for object recognition, or another data type. In other embodiments, the image data buffer 508 may be replaced with an input data buffer 514 (not shown). The input data buffer 514 is used to receive various types of input data from the input output bus, including image, sound, or other data types. The input data buffer 514 also transfers its stored data to the control module 502 and reads the corresponding data from the control module 502.

With reference to fig. 16, the control module 502 determines the number of PEs to be called according to the information of the weight data and the image data, and then determines the connection relationship and the data flow direction between the called PEs, thereby constructing the matrix 500 of the PE two-dimensional array. The control module 502 may also determine the number of PEs to be called, the connection relationship between the called PEs, and the data flow direction only according to the information of the weight data, only according to the information of the image data, or only by means of a program configured in the control module 502 in advance. The control module 502 transmits the weight data, threshold value data and image data to the constructed matrix 500 of the two-dimensional array of PEs. The control module 502 may adjust the PEs in the matrix 500 of the PE two-dimensional array or cut the image data matrix according to the dimensions of the image data matrix and the dimensions of the matrix of the PE two-dimensional array. The control module 502 may match the weight data or convolution kernel with the image data in different ways after cutting the image data matrix, including the same convolution kernel for different image data inputs, the same image data input corresponding to different convolution kernels, and also inputting two different image data into two different convolution kernels respectively after cutting.

With continued reference to fig. 16, the control module 502 may further adjust the connection relationships and data flow directions between the PEs of the matrix 500 of the two-dimensional array of PEs one or more times as needed. The control module 502 may allocate the same data (weight data or threshold value data or image data) to multiple PEs in one instruction cycle through Multicast, and the multiple PEs allocated with the same data may be located in the same row or column in the matrix 500 of the two-dimensional array of PEs, or belong to any combination of any positions in the matrix 500 from the two-dimensional array of PEs. The control module 502 may be configured with convolution kernels or weight data for convolution operations, as well as threshold data. The threshold data for configuration may be already trained thresholds. The tensor processor can adopt a static algorithm, and after the trained threshold value is configured to the matrix 500 of the PE two-dimensional array, convolution operation is carried out on the image data matrix and the weight data matrix, so that the faster operation speed is realized.

Referring to fig. 17, a ping-pong controller of a tensor processor of one embodiment includes a PE configuration parameter register 600. The PE configuration parameter register 600 is used to store configuration parameters such as weight data or threshold values. The ping-pong controller needs to read configuration parameters and configure the PE before performing convolution operation of any dimension. The ping-pong controller can be configured to operate as it is, for which a CONSUMER pointer 602 and a PRODUCER pointer 604 reside in a single set. The CONSUMER pointer 602 is a read-only register field that the tensor processor can examine to determine which toggle group the data path selects, while the PRODUCER pointer 604 is fully controlled by the tensor processor. In other embodiments, the PE configuration parameter register 600 is also used to store various parameters to be used for configuring a PE, such as the configuration parameters of the PE shown in fig. 15.

The input data in the embodiments of the present disclosure is exemplified by image data, but the input data is not limited to the image data. The input data may also be voice data, a data type suitable for object recognition, or other data types. The input data of the embodiments of the present disclosure is exemplified by third-order or higher-order tensor data, but the input data is not limited to the third-order or higher-order tensor data. The input data may also be second, first or zero order tensor data.

According to some embodiments of the present disclosure, the ping-pong controller may perform operations while configuring, that is, may perform convolution operations or full-connection operations on the input data matrix while configuring a new convolution kernel (or weight data) and/or a trained threshold, thereby increasing the processing speed of the tensor processor.

According to some embodiments of the present disclosure, the configuration input data, the weight data and the threshold values are independent of each other and may be performed synchronously.

According to some embodiments of the present disclosure, by adjusting the configuration, the PE two-dimensional array may be used to implement full join operation or convolution operation without reading and writing the intermediate result Psum, but directly done in the PE two-dimensional array. And switching between the fully-connected operation and the convolution operation can be achieved by adjusting the configuration.

According to some embodiments of the present disclosure, by configuring PEs, pooling (posing) operations in a neural network model may be mixed with convolution operations, i.e., configured PE self-contained pooling operations.

According to some embodiments of the present disclosure, the PE to be invoked may be flexibly selected, and a two-dimensional array of PEs (including controlling the data flow direction between PEs) having a specific configuration may be set according to actual needs by adjusting the configuration for the connection relationship and the data flow direction between PEs, and the input data matrix may be further divided according to the set two-dimensional array of PEs or a part of unused PEs may be selected to enter an inactive state.

According to some embodiments of the present disclosure, the tensor processor may adopt a static algorithm, and after configuring the trained threshold value to the PE two-dimensional array, perform convolution operation on the input data matrix and the weight data matrix, thereby achieving a faster operation speed.

According to some embodiments of the present disclosure, the PE used in the tensor processor may use a common neural network processor such as FPGA or GPU, or may be a specially designed processor as long as the minimum functional requirements required for implementing various embodiments of the present disclosure are met.

According to some embodiments of the present disclosure, the tensor processor is used for a binary convolutional neural network, and the PEs of the two-dimensional array of PEs of the tensor processor perform a convolution operation of the binary convolutional neural network on input image data and configured weight data. In other embodiments, the PE may perform a full join operation on the input image data and the configured weight data. In other embodiments, the tensor processor may be used for derivation operations of a non-binary convolutional neural network, such as a neural network with data types INT4, INT8, INT16, or INT32, and the PE performs a convolution operation or a full connection operation corresponding to the data type of the neural network on the input image data and configured weight data.

The features of the above-described embodiments may be arbitrarily combined, and for the sake of brevity, all possible combinations of the features in the above-described embodiments are not described, but should be construed as being within the scope of the present disclosure as long as there is no contradiction between the combinations of the features.

The above-described embodiments should not be construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. An apparatus for a convolutional neural network, the apparatus comprising a two-dimensional array of processing engines, input data being sliced according to dimensions of the two-dimensional array of processing engines to fit the sliced input data to the dimensions of the two-dimensional array of processing engines, the two-dimensional array of processing engines configured to receive the sliced input data and weight data, the two-dimensional array of processing engines comprising:

an array of N processing engine rows and M processing engine columns, wherein N and M are both positive integers greater than or equal to 2;

the processing engine two-dimensional array performs convolution operation on the cut input data and the weight data and outputs an intermediate result;

wherein the processing engines of the two-dimensional array of processing engines are configured for self-contained pooling processing after the convolution operation.

2. The apparatus of claim 1, wherein the input data and the weight data are third-order or higher tensors, and wherein a connection relationship and a data flow direction between processing engines in the two-dimensional array of processing engines are configured according to a dimension of the input data and a dimension of the weight data.

3. The apparatus of claim 1, wherein a portion of the two-dimensional array of processing engines are set to a standby state based on a dimension of the input data and a dimension of the two-dimensional array of processing engines.

4. The apparatus of claim 1, wherein the input data is a third or higher order tensor.

5. The apparatus of claim 1, wherein when the weight data for a first portion of the processing engines in the two-dimensional array of processing engines is changed, the weight data for a second portion of the processing engines in the two-dimensional array of processing engines remains unchanged and the input data for the second portion of processing engines is changed.

6. The apparatus of claim 1, wherein the output of the two-dimensional array of processing engines is a regularization result obtained by performing a regularization operation on the intermediate result.

7. The apparatus of claim 1, wherein each processing engine in the two-dimensional array of processing engines is assigned an input data local ID and a weight data local ID, each component of the input data is assigned an input data ID, each component of the weight data is assigned a weight data ID, each processing engine in the two-dimensional array of processing engines matingly receives a component of the input data by comparing the input data local ID and the input data ID of the processing engine, and each processing engine in the two-dimensional array of processing engines matingly receives a component of the weight data by comparing the weight data local ID and the weight data ID of the processing engine.

8. The apparatus according to claim 1, wherein the input data and the weight data are both subjected to binarization processing, and the processing engine two-dimensional array performs binarization neural network convolution operation to obtain an intermediate result of the binarization neural network convolution operation.

9. The apparatus according to claim 8, wherein the binarized neural network convolution operation intermediate result is compared with a trained threshold value, so as to output 1-bit 0 or 1 obtained by regularizing the binarized neural network convolution operation intermediate result, and the threshold value satisfies a batch processing function batchnorm (a) of 0;

the batch processing function batchnorm (a) γ · (a- μ)/σ + B, where a is the convolution result of the neural network used to train the threshold value threshold, μ is the mean of the vectors, σ is the variance, γ is the scaling factor, and B is the bias.

10. The apparatus of claim 8, wherein the multiplication operation of the convolution operation of the binary neural network is implemented by an exclusive nor gate operation, and the addition operation of the convolution operation of the binary neural network is implemented by an operation of the number of 1.

11. An apparatus for a convolutional neural network, the apparatus comprising a two-dimensional array of processing engines, input data being sliced according to dimensions of the two-dimensional array of processing engines to fit the sliced input data to the dimensions of the two-dimensional array of processing engines, the two-dimensional array of processing engines configured to receive the sliced input data and weight data, the two-dimensional array of processing engines comprising:

wherein the two-dimensional array of processing engines has a fully connected structure, the intermediate results are stacked within the two-dimensional array of processing engines.

12. The apparatus of claim 11, wherein the input data and the weight data are third-order or higher tensors, and wherein a connection relationship and a data flow direction between processing engines in the two-dimensional array of processing engines are configured according to a dimension of the input data and a dimension of the weight data.

13. The apparatus of claim 11, wherein a portion of the two-dimensional array of processing engines are set to a standby state based on a dimension of the input data and a dimension of the two-dimensional array of processing engines.

14. The apparatus of claim 11, wherein the input data is a third or higher order tensor.

15. The apparatus of claim 11, wherein when the weight data for a first portion of the processing engines in the two-dimensional array of processing engines is changed, the weight data for a second portion of the processing engines in the two-dimensional array of processing engines remains unchanged and the input data for the second portion of processing engines is changed.

16. The apparatus of claim 11, wherein the output of the two-dimensional array of processing engines is a regularization result obtained by performing a regularization operation on the intermediate result.

17. The apparatus of claim 11, wherein each processing engine in the two-dimensional array of processing engines is assigned an input data local ID and a weight data local ID, each component of the input data is assigned an input data ID, each component of the weight data is assigned a weight data ID, each processing engine in the two-dimensional array of processing engines matingly receives a component of the input data by comparing the input data local ID and the input data ID of the processing engine, and each processing engine in the two-dimensional array of processing engines matingly receives a component of the weight data by comparing the weight data local ID and the weight data ID of the processing engine.

18. The apparatus according to claim 11, wherein the input data and the weight data are both subjected to binarization processing, and the processing engine two-dimensional array performs binarization neural network convolution operation to obtain an intermediate result of the binarization neural network convolution operation.

19. The apparatus as claimed in claim 18, wherein the multiplication operation of the convolution operation of the binary neural network is implemented by an exclusive nor gate operation, and the addition operation of the convolution operation of the binary neural network is implemented by an operation of the number of 1.

20. The apparatus according to claim 18, wherein the binarized neural network convolution operation intermediate result is compared with a trained threshold value, so as to output 1-bit 0 or 1 obtained by regularizing the binarized neural network convolution operation intermediate result, and the threshold value satisfies a batch processing function batchnorm (a) of 0;