CN113537482B

CN113537482B - Neural network computing module, method and communication device

Info

Publication number: CN113537482B
Application number: CN202111071502.XA
Authority: CN
Inventors: 王赟; 张官兴; 郭蔚; 黄康莹; 张铁亮
Original assignee: Shaoxing Ewa Technology Co Ltd
Current assignee: Shaoxing Ewa Technology Co Ltd
Priority date: 2021-09-14
Filing date: 2021-09-14
Publication date: 2021-12-28
Anticipated expiration: 2041-09-14
Also published as: CN113537482A

Abstract

The invention provides a neural network computing module, a neural network computing method and communication equipment, which belong to the field of data processing, and specifically comprise a data controller, a data extractor, a first shift register group and a neural network computing unit, wherein the data controller adjusts a data path according to configuration information and instruction information and controls the data extractor to extract characteristic line data and convolution kernel line data from characteristic diagram data of an image to be processed in lines; the first shift register group outputs the characteristic line data to the neural network computing unit in a serial input and parallel output mode; the neural network computing unit correspondingly performs multiplication and accumulation operation on the input characteristic line data and the convolution kernel line data to complete convolution operation of one convolution kernel and the characteristic map data, and completes accumulation of a plurality of convolution results in at least one period, thereby realizing circuit reconstruction and data multiplexing.

Description

Neural network computing module, method and communication device

Technical Field

The invention relates to the field of data processing, in particular to a neural network computing module, a neural network computing method and communication equipment.

Background

The convolutional neural network is composed of an input layer (inputlayer), an arbitrary number of hidden layers (hiddenlayer) as intermediate layers, and an output layer (outputlayer). The input layer (inputlayer) has a plurality of input nodes (neurons). The output layer has output nodes (neurons) that identify the number of objects.

The convolution kernel is a small window placed in the hidden layer, in which the weight parameters are stored. And sequentially sliding the convolution kernels on the input image according to the step length, and performing multiplication and addition operation on the convolution kernels and the input characteristic image of the corresponding area, namely multiplying the weight parameters in the convolution kernels by the values of the corresponding input image and then summing the weight parameters. The traditional convolution accelerating operation device needs to use img2col method to perform matrix-form expansion processing on the input characteristic diagram data and convolution kernel data according to the convolution kernel size and step length parameters and then operate the expanded matrix, therefore, the convolution acceleration can be carried out according to the matrix multiplication operation rule, but in the method, after the characteristic data matrix is expanded, larger on-chip cache is needed, more off-chip main memory reading frequency is needed similarly, and data which can not be efficiently multiplexed for reading needs to occupy the read-write bandwidth of off-chip access memory, so that the hardware power consumption is increased, meanwhile, the convolution acceleration operation method based on the img2col expansion mode is not beneficial to the realization of hardware logic circuits of convolution operation of convolution kernels with different sizes and step lengths, therefore, in the operation process of the convolution network, each input channel needs to carry out convolution matrix operation with a plurality of convolution kernels, and characteristic map data need to be obtained for a plurality of times; and all the feature map data on each channel are completely cached in the buffer, so that the data volume is huge, and when the convolution matrix calculation is carried out, on-chip storage resources can be wasted due to the fact that the feature data size after matrix conversion far exceeds the original feature data size, and the operation of large data volume cannot be executed.

Disclosure of Invention

Accordingly, to overcome the above-mentioned shortcomings of the prior art, the present invention provides a neural network computing module, method and communication device.

In order to achieve the above object, the present invention provides a neural network computing module, including: the data control device adjusts a data path according to configuration information and instruction information and controls the data extractor to extract characteristic line data and convolution kernel line data from characteristic map data of an image to be processed according to lines; the first shift register group outputs the characteristic line data to the neural network computing unit in a serial input and parallel output mode; the neural network computing unit correspondingly performs multiplication and accumulation operation on the input characteristic line data and the convolution kernel line data to complete convolution operation of one convolution kernel and the characteristic map data, and completes accumulation of a plurality of convolution results in at least one period, thereby realizing circuit reconstruction and data multiplexing.

In one embodiment, the neural network computing module further includes a second shift register group, and the second shift register group outputs the convolution kernel data to the current convolution operation multiply-add array and the next convolution operation multiply-add array of the neural network computing unit by adopting serial input and selecting one output according to step length.

In one embodiment, the neural network computing unit includes a multiplication and addition subunit and a partial sum buffer subunit, the multiplication and addition subunit performs multiplication operation on the input characteristic line data and the convolution kernel line data correspondingly, and performs accumulation operation on the convolution line partial sum data in the partial sum buffer subunit, when the convolution operation of the convolution kernel line data and the characteristic line data of a corresponding convolution window is completed, the multiple line convolution results of the convolution window are partially summed, and one sliding window convolution operation of a convolution kernel is realized; the convolution operation multiplication and addition arrays of each group of different stages output row operation results to an accumulator in a convolution row period, and the accumulator accumulates the row operation results output by the convolution operation multiplication and addition arrays of each group of stages corresponding to all rows of a current convolution kernel through an addition tree, so that convolution operation of one convolution kernel is realized.

In one embodiment, the neural network computing unit includes a plurality of neural network acceleration slices, each neural network acceleration slice includes a plurality of convolution operation multiply-add arrays, each neural network acceleration slice at least completes convolution operation of feature map data of one input channel and convolution kernel data, and the plurality of neural network acceleration slices complete convolution operation of feature map data of the plurality of input channels and convolution kernel data.

In one embodiment, the plurality of neural network acceleration slices form a first neural network operation matrix, and the plurality of first neural network operation matrices are coupled in parallel to form a second neural network acceleration matrix; the first neural network operation matrix in the second neural network acceleration matrix is used for completing convolution operation of a plurality of input channel characteristic data and a convolution kernel, and the plurality of second neural network acceleration matrices complete parallel convolution operation of the plurality of input channel characteristic data and the plurality of convolution kernels.

In one embodiment, a plurality of groups of convolution operation multiplication and addition arrays in each neural network acceleration fragment acquire characteristic line data in a parallel input mode; and acquiring convolution kernel data by the multiple groups of convolution operation multiplication and addition arrays in each neural network acceleration fragment in a serial input mode.

In one embodiment, the data controller is configured to obtain, according to the configuration information and the instruction information, a storage address of the feature data and the corresponding weight data loaded into the neural network computing unit, instruct the multiplexer to turn on and off the adjustment data path, and input the feature data and the corresponding weight data into the corresponding neural network computing unit according to the instruction information; and the data extractor comprises a feature extractor and a convolution kernel extractor, the feature extractor is used for extracting feature line data from the feature data cache unit according to the instruction information, and the convolution kernel extractor is used for extracting convolution kernel data from the convolution kernel cache unit according to the instruction information so as to transmit the convolution kernel data to the neural network computing module.

In one embodiment, the feature data buffer unit includes a plurality of feature data buffer groups, each feature data buffer unit buffers part of feature data of one input channel and is coupled with at least one neural network acceleration slice, the plurality of neural network acceleration slices share one convolution kernel data buffer unit, each neural network acceleration slice acquires feature map data of one input channel from a corresponding feature data buffer group, and the same convolution kernel data is distributed to the plurality of neural network acceleration slices.

In one embodiment, the data controller obtains the storage addresses of the feature data and the corresponding weight data loaded into the neural network computing unit according to configuration information and instruction information, the instruction multiplexer outputs the feature data to the neural network computing unit from the first shift register group in a serial input and parallel output mode, and the instruction multiplexer outputs the convolution kernel data to the current convolution operation multiplication and addition array and the next convolution operation multiplication and addition array of the neural network computing unit from the second shift register group in a serial input and one output mode according to step length selection.

The invention also provides a neural network acceleration method, which comprises the following steps: a neural network acceleration method, comprising: adjusting a data channel by adopting a data controller according to configuration information and instruction information, and controlling a data extractor to extract characteristic line data and convolution kernel line data from characteristic map data of an image to be processed according to lines; outputting the characteristic line data to a neural network computing unit by adopting a first shift register group in a serial input and parallel output mode; and the neural network computing unit is adopted to correspondingly carry out multiplication and accumulation operation on the input characteristic line data and the convolution kernel line data, complete convolution operation of one convolution kernel and the characteristic map data, and complete accumulation of a plurality of convolution results in at least one period, thereby realizing circuit reconstruction and data multiplexing.

The invention also provides communication equipment which comprises a Central Processing Unit (CPU), a memory DDR SDRAM and the neural network computing module, wherein the CPU, the memory DDR SDRAM and the neural network computing module are in communication connection, the CPU is used for controlling the neural network computing module to start convolution operation, and the DDR SDRAM is used for inputting feature map data and weight data to the data cache module of the neural network computing module.

Compared with the prior art, the invention has the advantages that: by multiplexing the feature map data, the operation speed is optimized, and further, a blocking technology is adopted to block the feature map data of each input channel, namely, only part of the feature map data is taken, and after the calculation is finished, the cache data is updated, so that the cache requirements of the on-chip feature map data are reduced, and the calculation core non-delay operation is realized.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a block diagram of a neural network computational module in one embodiment of the invention;

FIG. 2 is a schematic diagram of a neural network computing module according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a neural network computational unit in an embodiment of the present invention;

FIG. 4 is a schematic diagram of a neural network computational module loading feature line data and convolution kernel line data in an embodiment of the present invention;

fig. 5 is a timing diagram of a convolution operation performed on pu by a convolution kernel with a step size of 1 for 3 × 3 convolution kernels in an embodiment of the present invention;

fig. 6 is a timing diagram of a convolution operation performed on pu by 5 × 5 convolution kernels with step size 1 in another embodiment of the present invention;

fig. 7 is a timing diagram of a convolution operation performed on pu by 5 × 5 convolution kernels with a step size of 2 in another embodiment of the present invention;

fig. 8 is a flowchart illustrating a neural network acceleration method according to an embodiment of the present invention.

Detailed Description

The embodiments of the present application will be described in detail below with reference to the accompanying drawings.

The following description of the embodiments of the present application is provided by way of specific examples, and other advantages and effects of the present application will be readily apparent to those skilled in the art from the disclosure herein. It is to be understood that the embodiments described are only a few embodiments of the present application and not all embodiments. The present application is capable of other and different embodiments and its several details are capable of modifications and/or changes in various respects, all without departing from the spirit of the present application. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It is noted that various aspects of the embodiments are described below within the scope of the appended claims. It should be apparent that the aspects described herein may be embodied in a wide variety of forms and that any specific structure and/or function described herein is merely illustrative. Based on the present application, one skilled in the art should appreciate that one aspect described herein may be implemented independently of any other aspects and that two or more of these aspects may be combined in various ways. For example, an apparatus may be implemented and/or a method practiced using any number and aspects set forth herein. Additionally, such an apparatus may be implemented and/or such a method may be practiced using other structure and/or functionality in addition to one or more of the aspects set forth herein.

It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present application, and the drawings only show the components related to the present application rather than the number, shape and size of the components in actual implementation, and the type, amount and ratio of the components in actual implementation may be changed arbitrarily, and the layout of the components may be more complicated.

In addition, in the following description, specific details are provided to facilitate a thorough understanding of the examples. However, it will be understood by those skilled in the art that aspects may be practiced without these specific details.

The embodiment of the application provides communication equipment, wherein a hardware architecture of the communication equipment comprises a Central Processing Unit (CPU), a memory DDR SDRAM and a neural network computing module which are in communication connection. The CPU is used for controlling the neural network computing module to start convolution operation, and the DDR SDRAM is used for inputting feature map data and weight data to the data cache module of the neural network computing module. The CPU, the DDRSDRAM and the neural network computing module are in communication connection. The CPU controls the neural network computing module to start convolution operation, the DDRSDRAM is used for inputting a plurality of convolution data and a plurality of convolution parameters to the data cache module of the neural network computing module, then the neural network computing module completes the convolution operation according to the obtained convolution data and the convolution parameters to obtain an operation result, the operation result is written back to a memory address appointed by the DDR SDRAM, and the CPU is informed of the completion of the convolution operation.

As shown in fig. 1 and fig. 2, an embodiment of the present application provides a neural network computing module 100 including a data controller 102, a data extractor 104, a first shift register set 106, and a neural network computing unit 108.

The data controller 102 adjusts the data path according to the configuration information and the instruction information, and controls the data extractor 104 to extract feature line data and convolution kernel line data from the feature map data of the image to be processed in lines.

The first shift register group 106 outputs the characteristic line data to the neural network computing unit 108 by serial input and parallel output.

The PE accelerator is the most basic unit for accelerating the processing of the neural network, each unit at least comprises a multiplier, an adder, a part and a result buffer, at least one convolution operation of a weight parameter and input characteristic data can be completed, and the accumulation of a plurality of convolution results can be completed in at least one period. As shown in fig. 2, the first shift register group 106 inputs the feature diagram line data to each stage of PE in parallel by serial connection and parallel output, and outputs the data to each stage of PE through the multiplexer; the current convolution kernel data is sent to the next stage of multiply-add unit while being sent to the first stage shift register set 106. The weights between PE0i and PE1i are connected in series through a first shift register group; PE1i also has a second shift register set serially connected to the next stage PE2i via a multiplexer.

In one embodiment, the second shift register group 110 also employs a serial connection parallel output scheme and outputs data to the multiply-add unit through the multiplexer. The second shift register group adopts a serial input and parallel output mode, and outputs the convolution kernel data to the current convolution operation multiplication and addition array and the next convolution operation multiplication and addition array of the neural network computing unit through the second multiplexer according to the step length signal.

And correspondingly multiplying the characteristic line data and the convolution kernel line data which are sent into the multiplication and addition unit, then performing accumulation operation on the characteristic line data and the convolution kernel line data, and performing summation operation on the convolution line part sum data in part of the convolution kernel line and the part of the convolution data in the cache unit, and after the convolution operation of one convolution kernel line and the corresponding convolution window characteristic line data is completed, performing summation operation on the convolution result part sum output of the convolution line and convolution results of other lines of the convolution kernel, so that the convolution kernel and one sliding window convolution operation is realized.

The characteristic line data (X00, … and X0n) are continuously and sequentially sent into the PE group in parallel according to the line sequence, the convolution kernel line data (F00/F01/F02) are sent into the PE group convolution operation in series in a circulating sequence in a mode of F00/F01/F02-F00/F01/F02-F00/F01/F02, and each stage of PE outputs the corresponding part and result of the current convolution kernel line through one convolution kernel line period (the convolution kernel line size is 3, the convolution kernel line period is 3). Different stage PEs output partial sum results of convolution kernel lines sliding convolution operation on the characteristic diagram lines according to the step length s; after each group of different peer PEs outputs partial sums in one convolution row period, the partial sums and the results output by the peer PEs of each group of PEs corresponding to all rows of the current convolution kernel are accumulated through the addition tree, so that the convolution operation of one convolution kernel is realized, and the convolution calculation shown in fig. 5 is obtained.

In fig. 5, PE00 is the partial sum of the convolution of the first row of convolution kernel and the corresponding feature line data output in three consecutive cycles, PE10 is the partial sum of the convolution of the first row of convolution kernel and the window adjacent to the window calculated by PE00 output in three consecutive cycles, PE20 is the partial sum of the convolution of the first row of convolution kernel and the window adjacent to the window calculated by PE10 output in three consecutive cycles, and multiple PE groups can realize the convolution acceleration operation of different rows of convolution kernel and the corresponding feature data row, which is equivalent to the convolution operation of the convolution kernel sliding and traversing in the feature map row direction. The first shift register group selects a signal according to the step length parameter s, selects the output of the corresponding shift register group as the input of the next-stage PE, realizes the traversal convolution operation of the convolution window in the row direction according to the step length s, and other PE groups realize the window sliding convolution in the characteristic diagram column direction according to the step length by multiplexing partial input characteristic diagram row data and convolution kernel row data, and performs the convolution acceleration operation.

The neural network computing unit 108 performs convolution operation of at least one convolution kernel and the feature map data, and performs accumulation of a plurality of convolution results in at least one period, thereby realizing circuit reconstruction and data multiplexing.

In one embodiment, as shown in fig. 3, the neural network computing unit 108 includes a plurality of neural network acceleration slices 1081, each of which includes a plurality of convolution multiply-add arrays 1082, each of which performs convolution of at least one input channel feature map data with one convolution kernel data, and a plurality of neural network acceleration slices performs convolution of a plurality of input channel feature map data with one convolution kernel data.

The neural network acceleration slice (PU slice) may be composed of a plurality of PE acceleration processing units (PE accelerators). The multiple PU fragments can realize data parallel computation of different dimensions through system configuration. The neural network acceleration circuit may include a plurality of neural network acceleration slices (PU slices), activation/pooling circuits, and an accumulation unit, among others. The PE accelerator is the most basic unit for accelerating the processing of the neural network, each unit at least comprises a multiplier, an adder, a part and a result buffer, at least one convolution operation of a weight parameter and input characteristic data can be completed, and the accumulation of a plurality of convolution results can be completed in at least one period. In this application, a PU tile may contain PE accelerators arranged in an array. The characteristic data cache unit comprises a plurality of characteristic data cache groups, each characteristic data cache unit caches partial characteristic data of one input channel and is coupled with at least one PU (polyurethane), namely, each PU acquires characteristic graph data of one input channel from the corresponding characteristic data cache group; meanwhile, a plurality of PUs share one convolution kernel data cache unit, namely, the data of the line of the same convolution kernel is broadcasted to the plurality of PUs, so that the parallel of multi-channel input single-channel output convolution operation is realized.

In order to realize data multiplexing and reduce the read-write bandwidth pressure of a main memory, each PU fragment can calculate the convolution results of current single-channel input characteristic line data (data acquired by a convolution kernel in the row direction of an image to be processed is characteristic line data) and a plurality of convolution kernels in the row direction of an input characteristic image respectively; updating the convolution kernel data, and restarting the convolution operation until the single characteristic line data and all convolution kernels complete the convolution operation; after the convolution operation of the characteristic line data of the single input channel and all convolution kernels is completed, updating the input characteristic diagram line, namely, the convolution operation under the convolution kernels is moved, and then circulating the steps until the convolution operation of the current input characteristic diagram and all convolution kernels is completed, and outputting a multi-channel output characteristic diagram of the input characteristic diagram; and after the convolution results of the input feature diagram of the single channel and all convolution kernels are finished, updating the feature diagram input channel, wherein the operation sequence can be flexibly configured according to the actual situation and the configuration information or the instruction information.

In one embodiment, the neural network computing unit comprises a multiplication and addition subunit, a part and a cache subunit.

And when the convolution operation of the convolution kernel data and the characteristic line data corresponding to the convolution window is finished, the multiple line convolution results of the convolution window are partially accumulated, so that one sliding window convolution operation of the convolution kernel is realized.

Each group of convolution operation multiplication and addition arrays with different stages outputs row operation results to an accumulator in a convolution row period, and the accumulator accumulates the row operation results output by the convolution operation multiplication and addition arrays with the same stages corresponding to all rows of the current convolution kernel through an addition tree, so that the convolution operation of one convolution kernel is realized.

In one embodiment, the plurality of neural network acceleration slices form a first neural network operation matrix, and the plurality of first neural network operation matrices are coupled in parallel to form a second neural network acceleration matrix; the first neural network operation matrix in the second neural network acceleration matrix is used for completing convolution operation of the plurality of input channel characteristic data and one convolution kernel, and the plurality of second neural network acceleration matrices are used for completing parallel convolution operation of the plurality of input channel characteristic data and the plurality of convolution kernels. A neural network acceleration slice can compute an acceleration operation between a feature map data and a convolution kernel convolution. A plurality of PU fragments of the same second neural network acceleration matrix can simultaneously calculate convolution results of the same convolution kernel and different input characteristic data, each PU fragment shares a convolution kernel cache unit, and a plurality of fragments share a characteristic data cache unit; a plurality of PU fragments of different second neural network acceleration matrixes can simultaneously calculate convolution results of a plurality of different convolution kernels and the same input feature data, each PU fragment shares one convolution kernel cache unit, and a plurality of fragments share one feature data cache unit.

The plurality of PUs form a first sub-neural network operation PU matrix, and the plurality of first sub-neural network operation PU matrices form a second neural network PU matrix; each submatrix in the second neural network PU matrix is used for completing convolution operation of a plurality of input channel characteristic data and one convolution kernel, and the plurality of submatrixes can complete parallel convolution operation of the plurality of input channel characteristic data and the plurality of convolution kernels.

In one embodiment, each group of convolution operation multiplication and addition arrays obtains characteristic line data in a parallel input mode; and each group of convolution operation multiplication and addition arrays obtains convolution kernel data in a serial input mode. The PU comprises at least one group of PEs, and each group of PEs is responsible for convolution operation of convolution kernel line data and corresponding feature map line data; the multiple groups of PEs can realize convolution operation of multiple convolution kernel rows and multiple corresponding characteristic map row data, namely each group of PEs forms one row, and the multiple rows of PEs complete convolution operation of at least one convolution kernel row and corresponding characteristic data. Each group of PEs obtains characteristic line data in a parallel input mode, namely each metadata in each characteristic line data is simultaneously broadcast to each level of PE in the current group; meanwhile, each group of PEs obtains convolution kernel line data in a serial input mode, namely, the convolution kernel line metadata flows from the first-level PE to the next-level PE in each clock cycle.

The neural network computing module converts the value of the convolution kernel acquired from the input image into a row (column) for storage operation, can complete the convolution operation of the current data by only reading the convolution kernel from the main memory once under the condition of not expanding the data in a matrix form, not only reduces memory access and improves the energy efficiency of data access, but also reuses time characteristic data and optimizes the operation speed, and further adopts a blocking technology to block the characteristic map data of each input channel, namely only takes part of the characteristic map data, updates the cache data after the computation is finished, and reduces the cache requirement of the on-chip characteristic map data, thereby realizing the computation core non-delay operation.

In one embodiment, as shown in fig. 4, the neural network acceleration slice 1081 comprises a plurality of first multiplexers 1083 coupled in parallel with the convolution operation multiply-and-add array in a one-to-one correspondence, and a plurality of second multiplexers 1084 coupled in series with the convolution operation multiply-and-add array in a one-to-one correspondence; the first multiplexer acquires characteristic line data corresponding to the convolution operation multiplication and addition array through a data selection signal and inputs the characteristic line data to the corresponding convolution operation multiplication and addition array at each stage in parallel, and the second multiplexer acquires convolution kernel line data corresponding to the convolution operation multiplication and addition array and inputs the convolution kernel line data to the convolution operation multiplication and addition array at each stage in series to complete convolution multiplication and addition operation.

The first multiplexers are respectively coupled with the corresponding PE groups in parallel, each PE group can select at least one of the two characteristic line data through a selection signal, and the sixth PE group in fig. 4 can select data of 6 different lines; the second multiplexers are coupled in series with corresponding groups of PEs, respectively, each group of PEs being capable of selecting at least one of the two convolutional kernel data through a selection signal, and the sixth group of PEs in fig. 4 being capable of selecting 6 different convolutional kernel data. The first multiplexer obtains corresponding characteristic line data through a data selection signal (provided by configuration information or data loading instruction information), and inputs the characteristic line data into each stage of PE of a corresponding PE group in parallel, and meanwhile, the second multiplexer selects corresponding convolution kernel line data to input into each stage of PE in series to complete convolution multiply-add operation.

The remaining idle PE groups can acquire the feature line data and the convolution kernel line data used for convolution operation in the feature diagram column direction through the multiplexer, so that multiplexing of input data is realized. For example, for a convolution kernel step of 3 × 3 being 1, taking the matrix of fig. 4 as an example, the first three PE groups complete parallel accelerated convolution operation of a convolution kernel in the row direction (PE 00, PE10, PE20 are one group, PE01, PE11, PE21 are one group, PE02, PE12, PE22 are one group), the fourth PE group can multiplex the feature line data in the second feature extractor 1 and the convolution kernel line data in the first weight obtainer 0, the fifth PE group can multiplex the feature line data in the third feature extractor 2 and the convolution kernel line data in the second weight obtainer 1, the sixth PE group can multiplex the feature line data in the fourth feature extractor 3 and the convolution kernel line data in the third weight obtainer 2, and then send the data pairs into corresponding PE groups, complete the sliding operation of convolution windows in the feature map column direction, thereby improving the multiplexing of feature map data and reducing the occupation of read-write bandwidth, and meanwhile, the whole PE array is in a full-load state of efficient operation.

In one embodiment, the neural network computing module further comprises a first shift register group and a second shift register group, the neural network computing unit comprises a multiplication and addition subunit, a part and a buffer subunit, the first shift register group adopts a serial input and parallel output mode, and characteristic line data are output to the multiplication and addition subunit through the first multiplexer; the second shift register group adopts a mode of serial input and selecting one output according to step length, and outputs convolution kernel data to a next-stage convolution operation multiplication and addition array and multiplication and addition subunit through a second multiplexer, the multiplication and addition subunit correspondingly performs multiplication operation on the input characteristic line data and the convolution kernel data and performs accumulation operation on the input characteristic line data and the convolution kernel data and partial convolution line sum data in partial sum buffer subunit, and when the convolution operation of the convolution kernel data and the characteristic line data of a corresponding convolution window is completed, partial sum of a plurality of line convolution results of the convolution window is accumulated, so that one sliding window convolution operation of a convolution kernel is realized; each group of convolution operation multiplication and addition arrays with different stages outputs row operation results to an accumulator in a convolution row period, and the accumulator accumulates the row operation results output by the convolution operation multiplication and addition arrays with the same stages corresponding to all rows of the current convolution kernel through an addition tree, so that the convolution operation of one convolution kernel is realized.

In one embodiment, the data controller is used for acquiring the storage addresses of the feature data and the corresponding weight data loaded into the neural network computing unit according to the configuration information and the instruction information, and simultaneously instructing the multiplexer to switch on and off to adjust a data path, and inputting the feature data and the corresponding weight data into the corresponding neural network computing unit according to the instruction information; and the data extractor comprises a feature extractor and a convolution kernel extractor, the feature extractor is used for extracting feature line data from the feature data cache unit according to the instruction information, and the convolution kernel extractor is used for extracting convolution kernel data from the convolution kernel cache unit according to the instruction information so as to transmit the convolution kernel data to the neural network computing module.

In one embodiment, the feature data cache unit includes a plurality of feature data cache groups, each feature data cache unit caches a part of feature data of one input channel and is coupled with at least one neural network acceleration slice, the plurality of neural network acceleration slices share one convolution kernel data cache unit, each neural network acceleration slice acquires feature map data of one input channel from a corresponding feature data cache group, and the same convolution kernel data is distributed to the plurality of neural network acceleration slices. The data controller broadcasts the data to the PE units in the 4 th row, except for communicating the weight acquirer 0 with the 0 th row of the PE units, that is, the first row weight parameter performs a row convolution operation with the characteristic data 1, and also performs a row convolution with the characteristic data 2 (wherein the characteristic data 2 is connected to the PE units in the 2 nd row, and also connected to the PE units in the 4 th row through the multiplexer), that is, the convolution kernel performs a convolution operation on adjacent rows each time, and the data controller 102 uses a plurality of neural networks to accelerate fragmentation to calculate the convolution result of the characteristic data input in a single channel and the convolution kernel in the row direction of the image to be processed.

In one embodiment, the data controller acquires the storage addresses of the characteristic data and the corresponding weight data loaded into the neural network computing unit according to the configuration information and the instruction information, the instruction multiplexer outputs the characteristic data to the neural network computing unit from the first shift register group in a serial input and parallel output mode, and the instruction multiplexer outputs the convolution kernel data to the current convolution operation multiplication and addition array and the next convolution operation multiplication and addition array of the neural network computing unit from the second shift register group in a serial input and one-step output mode.

As shown in fig. 6, when the convolution kernel line size (e.g. 5) is larger than the number of PEs in each group of PEs (e.g. 3 PEs), in each convolution kernel line operation period, address access of each stage of PE to the buffered feature line data by the feature line data extractor may conflict (since each stage of PE synchronously obtains the same feature metadata input, and at the next convolution kernel line cycle period, the feature data address required by a part of PE conflicts with the current feature extractor access buffer address), but the data accessed by a part of PE has been accessed in the previous operation period, and is buffered in the first shift register group, so the data controller obtains the storage address of the feature data and the corresponding weight data loaded into the neural network computing unit according to the configuration information and the instruction information, and at the same time, the instruction multiplexer turns on and off the data path, and inputting the characteristic data and the corresponding weight data into the corresponding neural network computing unit according to the instruction information, and preventing the conflict between part of PE access data addresses and current characteristic extractor access addresses by selecting register group output data in the corresponding first shift register group. Therefore, the PE unit can control the multiplexer through the data controller to achieve conflict data acquisition, so that the convolution operation of convolution kernels with different sizes is applicable, and the condition that address access conflicts of various levels of PEs under the condition of non-alignment of access data are avoided to cause PE kernel waiting.

As shown in fig. 2, the feature extractor may load data into the PE array in a row-wise order according to the data address; meanwhile, 6 feature extractors are arranged in the PE array, and can be FIFO units or an addressing mode (addressing is carried out from the cache units in sequence and data is loaded to the PE), if two of the FIFO units are empty, the feature extractors which are empty at present can be subjected to signal truncation processing, so that the power consumption is reduced, and the weight acquirer is similar; for 5 × 5 or 7 × 7 convolution kernels, all the feature extractors and the weight obtainers can be in a full-load working state; the number of feature extractors can be flexibly configured, such as reduced or increased, i.e., the PE array is correspondingly reduced or increased.

Selecting an input channel by adopting a multi-path input multiplexer according to the configuration information of the data controller, and realizing the convolution operation of a convolution kernel and the corresponding characteristic line data; for example, for a 3 × 3 convolution kernel, the 4 th, 5 th, and 6 th PE columns multiplex the data output by the feature data extractor 2 nd, 3 th, and 4 th, respectively. If 5 × 5 convolution kernels are used, the 4 th and 5 th PE columns are communicated with the corresponding 4 th and 5 th feature data extractors, the 6 th PE column is communicated with the 2 nd feature data extractor, and the sizes of the convolution kernels are not aligned with the numbers of the PE columns, so that after one row of convolution operation is completed, the input path of the input multiplexer needs to be reconfigured, for example, in a first row period, the convolution operation of 5+1 rows (5 × 5 convolution kernels are convoluted on the feature map in one row period, the convolution operation of the first row of the convolution kernels is performed on the feature map, the second row is performed on the 1 st to 4PE columns by shifting down one row, and the convolution operation is performed only on one row of feature data) is completed at one time, so that in a second row period, the 2 nd to 5 th row weights of the 1 st to 4 th feature extractors and the 2 th row of the convolution kernels are used for performing convolution operation on the 1 st to 4PE columns, and the 6 th row PE rows of the data of the 1 st and 2 nd feature extractors; in which fig. 5 only shows some connections, and some connections are not shown.

As shown in fig. 5, the weight parameters are serially flowed into each PE unit in a column direction (as shown in fig. 5, three PE units are arranged in a column direction to accommodate a convolution kernel of 3 × 3), and the PE units in each row in the same column are linked by a group of delay register sets through a multiplexer, and the delay circuit structure is used for adjusting the sliding step of the convolution window; each column of PE units receives the characteristic data of a specific row through the multiplexer, the characteristic data are output to the PE units in the same column and each row in parallel, and because the weight parameters flow into each PE in series, when the characteristic data reach the PE in the 2 nd row and the PE in the 3 rd row in the initial state, the characteristic data need to be delayed for 1 period and 2 periods respectively, and then convolution operation is carried out on the characteristic data and the corresponding weight parameters; meanwhile, a group of characteristic data delay circuits are also arranged in the PE unit, so that firstly, the characteristic data and the weight parameters are aligned in the initial state, and when the convolution kernel exceeds 3 x 3, the corresponding characteristic data are multiplexed through the multi-path input selection circuit, and address conflict during data reading is prevented.

The feature data extractor acquires at least 1 line of continuous feature line data from a feature cache circuit on the PU fragment (as the embodiment adopts a convolution kernel of 3 multiplied by 3, in order to fully utilize the multiplication and addition circuit resource in the PE array, the feature extraction circuit requests the feature cache circuit to load continuous 4 lines of feature line data at one time); and meanwhile, the weight acquirer acquires at least 1 row of weight parameters (directly acquires the three rows of weight parameters and loads the weight parameters to the weight acquirer) at one time, and for inputting 4 rows of characteristic line data and 3 rows of convolution kernel line data, at least two characteristic diagram column direction convolution results are output in each convolution kernel line cycle period.

In the embodiment, the feature extractor and the weight obtainer form a data extractor, which may be a FOFI structure or an addressing circuit, each data extractor extracts the first line data (F00, F01, F02) and the first eigen line data (X00, X01, X02, X03, X04, X05, X06 …) of the weights according to the lines, and extracts the feature data of other lines after each line period.

In the initial state, a first cycle F00 is fed to register 1 and multiplier in PE00, and feature data X00 is fed to PE00, PE10, PE 20; in PE00, X00 is multiplied by F00 and the result is fed into row accumulation BUFFER, when F00 arrives at the next cycle in PE01, X00 is stored in each PE feature register set 1; in the second cycle X01, the signals are sent to PE00, PE01 and PE20, at this time, X00 shifts to feature register group 2, X01 enters feature register group 1, at the same time, F01 is sent to PE00, F00 in PE00 is sent to PE01, X01 and F01 perform convolution operation in PE00, and the operation result is sent to an adder to be added with the result of X00F 00 buffered in the BUFFER of the previous cycle. Meanwhile, multiplying the X01 in the PE01 by the F00, and sending the result into an accumulation buffer; a third period F02 enters PE00, F01 in PE00 enters PE01, F00 in PE01 enters PE02, and X02 is synchronously transmitted to each PE and is subjected to multiply-add operation with corresponding weight parameters; after the three periods, the convolution of the row of weights and the corresponding characteristic data can be realized, after the four periods, the convolution kernel row is equivalent to one sliding on the characteristic data row, and after the eight periods, the convolution result of the convolution kernel row sliding on the corresponding characteristic data row for 6 times can be obtained; since 3 rows of input are performed in parallel by the 3 × 3 convolution kernels in the column, the output of results of 6 convolution sliding windows can be completed in eight cycle periods; after the initialization state, the whole multiplication and addition unit is in a full-load state.

In another embodiment, as shown in fig. 6, the convolution kernel is 5 × 5, the step size is 1, the PE level 3 performs parallel accelerated operation on convolution operations of three convolution window convolution kernel lines in the feature map line direction, and after PE00 completes convolution of one convolution kernel line, the convolution operation of the fourth convolution window convolution kernel line is started (the convolution operation of the second and third convolution kernel lines is performed by the same PE10 and PE 20), at this time, the access address of the PE00 feature data needs to be started from X03, but at this time, PE01 and PE02 need to access the data of X05, so that the feature data extractor may conflict with the data access addressing of X03 and X05. To avoid conflicts, the feature data register set (feature data of X04, X03, X02, X01 have been buffered for the current 4 clock cycles) selection circuit may be used to obtain X03 data, i.e., X03 from register set 2; meanwhile, when the selection signals of the 3 PE register groups are consistent, the address pointer of the feature data extractor is updated again, for example, the address pointer is updated to X05 in the 8 th period (each stage of PE needs to access X05 data, and the address of the data accessed in the two subsequent periods is the same), and after the next convolution kernel row period (for example, the 11 th clock period), the feature data selection circuit is reconfigured; full-load calculation of the 5x5 convolution kernel can be flexibly realized through the circuit configuration.

In another embodiment, as shown in fig. 7, the convolution kernel is 5x5, the convolution operation with step size of 2, after 5 cycles, each PE has loaded data, but for PE00, which has completed the operation of one convolution window convolution kernel row, the row data address of the next convolution window is X06, but now the feature extractor address pointer is occupied by PE10 and PE20 and points to X05 data (which requires X05 data to be retrieved from the feature data cache circuit), therefore, for PE00, it needs to idle for a period, and then when the address pointer of the feature extractor points to X06, it performs the convolution operation of the convolution kernel line of the fourth convolution window, i.e. X06 × F00, similarly, when the feature data loaded by other PE stages conflicts with the access address of another current PE feature extractor, the PE unit with access address conflict is set to be in idle state in the current operation period, and acquiring corresponding feature data for product operation in the next period according to the access address of the feature extractor. From the above, it can be seen that some PE units are not fully loaded in a certain period and are in an idle state in a partial access address collision period, but for the convolution operation with the 5X5 convolution kernel and the step size of 2, the utilization rate of PE units is: 3 × 6-3/3 × 6 × 100% =83%, namely, the number of PE units is in an iteration period, the idle number of each iteration period is/(the number of PE units is in the iteration period, wherein the iteration period is from the first PE idle of cycee 6 to cycee 11 after initialization, namely, each idle period of PE00 is 6, the idle number of each iteration period is the total number of idle times of the group of PEs in the iteration period, and the larger the convolution kernel size is, the higher the utilization rate of the whole PE unit is.

As shown in fig. 8, the present embodiment further provides a neural network acceleration method, including the following steps:

step 602, adjusting a data path by using a data controller according to configuration information and instruction information, and controlling a data extractor to extract characteristic line data and convolution kernel data from characteristic map data of an image to be processed according to lines;

step 604, outputting the characteristic line data to a neural network computing unit by adopting a serial input and parallel output mode of a first shift register group;

and 606, performing multiplication and accumulation operation on the input characteristic line data and the convolution kernel line data correspondingly by using a neural network computing unit to complete convolution operation of one convolution kernel and characteristic map data, and completing accumulation of a plurality of convolution results in at least one period, thereby realizing circuit reconstruction and data multiplexing.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A neural network computing module is used in convolution processing of a convolution neural network and is characterized by comprising a data controller, a data extractor, a first shift register group, a second shift register group and a neural network computing unit,

the data controller adjusts a data path according to the configuration information and the instruction information, and controls the data extractor to extract characteristic line data and convolution kernel line data from the characteristic map data of the image to be processed according to lines;

the first shift register group outputs the characteristic line data to the neural network computing unit in a serial input and parallel output mode;

the second shift register group outputs the convolution kernel data to a current convolution operation multiplication and addition array and a next convolution operation multiplication and addition array of the neural network computing unit in a serial input and one-step output mode according to step length;

the neural network computing module further comprises a PE array, a plurality of first multiplexers and a plurality of second multiplexers, wherein the first multiplexers are coupled with the convolution operation multiplication and addition arrays in parallel in a one-to-one correspondence manner, and the second multiplexers are coupled with the convolution operation multiplication and addition arrays in series in a one-to-one correspondence manner; the first multiplexer acquires characteristic line data corresponding to the convolution operation multiplication and addition array through a data selection signal and inputs the characteristic line data to the corresponding convolution operation multiplication and addition arrays at each stage in parallel, and the second multiplexer acquires convolution kernel line data corresponding to the convolution operation multiplication and addition array and inputs the convolution kernel line data to the convolution operation multiplication and addition arrays at each stage in series to complete convolution multiplication and addition operation, so that circuit reconstruction and data multiplexing are realized.

2. The neural network computation module of claim 1, wherein the neural network computation unit comprises a multiply-add subunit and a partial sum buffer subunit,

the multiplication and addition subunit correspondingly multiplies the input characteristic line data and the convolution kernel line data, accumulates the multiplication and addition data with the convolution line part and data in the part and cache subunit, and partially accumulates a plurality of line convolution results of the convolution window when the convolution operation of the convolution kernel line data and the characteristic line data corresponding to the convolution window is completed, so that one sliding window convolution operation of a convolution kernel is realized;

the convolution operation multiplication and addition arrays of each group of different stages output row operation results to an accumulator in a convolution row period, and the accumulator accumulates the row operation results output by the convolution operation multiplication and addition arrays of each group of stages corresponding to all rows of a current convolution kernel through an addition tree, so that convolution operation of one convolution kernel is realized.

3. The neural network computing module of claim 2, wherein the neural network computing unit comprises a plurality of neural network acceleration slices, each neural network acceleration slice comprises a plurality of convolution operation multiply-add arrays, each neural network acceleration slice performs convolution operation of at least one input channel feature map data and one convolution kernel data, and a plurality of neural network acceleration slices performs convolution operation of a plurality of input channel feature map data and one convolution kernel data.

4. The neural network computational module of claim 3, wherein the plurality of neural network acceleration slices form a first neural network operational matrix, and a plurality of first neural network operational matrices are coupled in parallel to form a second neural network acceleration matrix; the first neural network operation matrix in the second neural network acceleration matrix is used for completing convolution operation of a plurality of input channel characteristic data and a convolution kernel, and the plurality of second neural network acceleration matrices complete parallel convolution operation of the plurality of input channel characteristic data and the plurality of convolution kernels.

5. The neural network computation module of claim 4, wherein the plurality of sets of convolution operation multiply-add arrays in each neural network acceleration slice acquire characteristic line data by means of parallel input; and acquiring convolution kernel data by the multiple groups of convolution operation multiplication and addition arrays in each neural network acceleration fragment in a serial input mode.

6. The neural network computing module of claim 1, wherein the data controller is configured to obtain storage addresses of the feature data and the corresponding weight data loaded into the neural network computing units according to the configuration information and the instruction information, and instruct the multiplexer to turn on and off the adjustment data path, and input the feature data and the corresponding weight data into the corresponding neural network computing units according to the instruction information;

and the data extractor comprises a feature extractor and a convolution kernel extractor, the feature extractor is used for extracting feature line data from the feature data cache unit according to the instruction information, and the convolution kernel extractor is used for extracting convolution kernel data from the convolution kernel cache unit according to the instruction information so as to transmit the convolution kernel data to the neural network computing module.

7. The neural network computing module of claim 6, wherein the feature data buffer unit comprises a plurality of feature data buffer sets, each feature data buffer unit buffers a portion of the feature data of an input channel and is coupled to at least one neural network acceleration slice, the plurality of neural network acceleration slices share a convolution kernel data buffer unit,

each neural network acceleration fragment acquires feature map data of one input channel from the corresponding feature data cache group, and the same convolution kernel data is distributed to a plurality of neural network acceleration fragments.

8. The neural network computing module of claim 6, wherein the data controller obtains storage addresses of the feature data and the corresponding weight data loaded into the neural network computing unit according to configuration information and instruction information, and simultaneously the instruction multiplexer outputs the feature data to the neural network computing unit from the first shift register group in a serial-in parallel-out manner, and the instruction multiplexer outputs the convolution kernel data to the current convolution operation multiplication and addition array and the next convolution operation multiplication and addition array of the neural network computing unit from the second shift register group in a serial-in parallel-out manner according to a step length selection output manner.

9. A neural network computing method, comprising:

adjusting a data channel by adopting a data controller according to configuration information and instruction information, and controlling a data extractor to extract characteristic line data and convolution kernel line data from characteristic map data of an image to be processed according to lines;

outputting the characteristic line data to a neural network computing unit by adopting a first shift register group in a serial input and parallel output mode;

outputting the convolution kernel data to a current convolution operation multiplication and addition array and a next convolution operation multiplication and addition array of the neural network computing unit by adopting a mode of serial input and selecting one output according to step length through a second shift register group;

the neural network computing unit is adopted to correspondingly carry out multiplication and accumulation operation on the input characteristic line data and the convolution kernel line data, the characteristic line data are continuously and sequentially sent to all levels of PE accelerators in parallel according to a line sequence, the convolution kernel line data are serially sent to all levels of PE accelerators according to a circulation sequence to carry out convolution operation, convolution operation of a convolution kernel and the characteristic graph data is completed, accumulation of a plurality of convolution results is completed in at least one period, and the convolution results of the characteristic line data input in a single channel and the convolution kernels in the input characteristic graph line direction are computed; updating convolution kernel data until the single characteristic line data and all convolution kernels complete convolution operation; after the feature line data of a single input channel and convolution operation of all convolution kernels are completed, updating the input feature graph line until a multi-channel output feature graph of an input feature graph is output; and after the convolution results of the input feature diagram of a single channel and all convolution kernels are finished, updating the feature diagram input channel, thereby realizing circuit reconstruction and data multiplexing.

10. A communication device, comprising a central processing unit, a memory DDR SDRAM and the neural network computing module of any one of claims 1 to 8, which are communicatively connected, wherein the central processing unit is configured to control the neural network computing module to start a convolution operation, and the DDR SDRAM is configured to input feature map data and weight data to a data caching module of the neural network computing module.