CN113792868A

CN113792868A - Neural network computing module, method and communication device

Info

Publication number: CN113792868A
Application number: CN202111071503.4A
Authority: CN
Inventors: 王赟; 张官兴; 郭蔚; 黄康莹; 张铁亮
Original assignee: Shanghai Ewa Intelligent Technology Co ltd; Shaoxing Ewa Technology Co Ltd
Current assignee: Shanghai Ewa Intelligent Technology Co ltd; Shaoxing Ewa Technology Co Ltd
Priority date: 2021-09-14
Filing date: 2021-09-14
Publication date: 2021-12-14
Anticipated expiration: 2041-09-14
Also published as: CN113792868B

Abstract

The invention provides a neural network computing module, a method and communication equipment, which belong to the field of data processing, and concretely comprise a data controller, a data extractor and a neural network computing unit, wherein the data controller adjusts a data channel according to configuration information and instruction information, controls a data stream extracted by the data extractor to be loaded to the corresponding neural network computing unit according to the instruction information, the neural network computing unit at least completes convolution operation of a convolution kernel and characteristic data and completes accumulation of a plurality of convolution results in at least one period, thereby realizing circuit reconstruction and data multiplexing, the neural network computing unit comprises a plurality of neural network acceleration fragments, each neural network acceleration fragment comprises a plurality of convolution operation multiplication and addition arrays, each neural network acceleration fragment at least completes convolution operation of characteristic diagram data of an input channel and convolution kernel data, and the plurality of neural network acceleration slices complete convolution operation of the feature map data of a plurality of input channels and convolution kernel data.

Description

Neural network computing module, method and communication device

Technical Field

The invention relates to the field of data processing, in particular to a neural network computing module, a neural network computing method and communication equipment.

Background

The convolutional neural network is composed of an input layer (inputlayer), an arbitrary number of hidden layers (hiddenlayer) as intermediate layers, and an output layer (outputlayer). The input layer (inputlayer) has a plurality of input nodes (neurons). The output layer has output nodes (neurons) that identify the number of objects.

The convolution kernel is a small window placed in the hidden layer, in which the weight parameters are stored. And sequentially sliding the convolution kernels on the input image according to the step length, and performing multiplication and addition operation on the convolution kernels and the input characteristic image of the corresponding area, namely multiplying the weight parameters in the convolution kernels by the values of the corresponding input image and then summing the weight parameters. The traditional convolution accelerating operation device needs to use img2col method to perform matrix-form expansion processing on the input characteristic diagram data and convolution kernel data according to the convolution kernel size and step length parameters and then operate the expanded matrix, therefore, the convolution acceleration can be carried out according to the matrix multiplication operation rule, but in the method, after the characteristic data matrix is expanded, larger on-chip cache is needed, more off-chip main memory reading frequency is needed similarly, and data which can not be efficiently multiplexed for reading needs to occupy the read-write bandwidth of off-chip access memory, so that the hardware power consumption is increased, meanwhile, the convolution acceleration operation method based on the img2col expansion mode is not beneficial to the realization of hardware logic circuits of convolution operation of convolution kernels with different sizes and step lengths, therefore, in the operation process of the convolution network, each input channel needs to carry out convolution matrix operation with a plurality of convolution kernels, and characteristic map data need to be obtained for a plurality of times; and all the feature map data on each channel are completely cached in the buffer, so that the data volume is huge, and when the convolution matrix calculation is carried out, on-chip storage resources can be wasted due to the fact that the feature data size after matrix conversion far exceeds the original feature data size, and the operation of large data volume cannot be executed.

Disclosure of Invention

Accordingly, to overcome the above-mentioned shortcomings of the prior art, the present invention provides a neural network computing module, method and communication device.

In order to achieve the above object, the present invention provides a neural network computing module, including a data controller, a data extractor, and a neural network computing unit, where the data controller adjusts a data path according to configuration information and instruction information, controls a data stream extracted by the data extractor to be loaded to a corresponding neural network computing unit according to the instruction information, the neural network computing unit at least completes convolution operation of a convolution kernel and characteristic data corresponding to multiple input channels of the characteristic map data, and completes accumulation of multiple convolution results in at least one cycle, thereby implementing circuit reconstruction and data multiplexing, the neural network computing unit includes multiple neural network acceleration slices, each neural network acceleration slice includes multiple convolution operation multiplication and addition arrays, each neural network acceleration slice at least completes convolution operation of the characteristic map data of one input channel and one convolution kernel data, and the plurality of neural network acceleration slices complete convolution operation of the feature map data of a plurality of input channels and convolution kernel data.

In one embodiment, the plurality of neural network acceleration slices form a first neural network operational matrix, and the plurality of first neural network operational matrices are coupled in parallel to form a second neural network acceleration matrix; the first neural network operation matrix in the second neural network acceleration matrix is used for completing convolution operation of a plurality of input channel characteristic data and a convolution kernel, and the plurality of second neural network acceleration matrices complete parallel convolution operation of the plurality of input channel characteristic data and the plurality of convolution kernels.

In one embodiment, each group of convolution operation multiplication and addition arrays obtains characteristic line data in a parallel input mode; and each group of convolution operation multiplication and addition arrays obtains convolution kernel data in a serial input mode.

In one embodiment, the neural network turbo slicing comprises a plurality of first multiplexers and a plurality of second multiplexers, the first multiplexers being coupled in parallel with the convolution operation multiply-add arrays in a one-to-one correspondence, and the second multiplexers being coupled in series with the convolution operation multiply-add arrays in a one-to-one correspondence; the first multiplexer acquires the characteristic line data corresponding to the convolution operation multiplication and addition array through a data selection signal and inputs the characteristic line data to the corresponding convolution operation multiplication and addition array at each stage in parallel, and the second multiplexer acquires the convolution kernel line data corresponding to the convolution operation multiplication and addition array and inputs the convolution kernel line data to the convolution operation multiplication and addition array at each stage in series to complete convolution multiplication and addition operation.

In one embodiment, the neural network computing module further includes a first shift register group and a second shift register group, the neural network computing unit includes a multiplication and addition subunit, a part and a buffer subunit, the first shift register group adopts a serial input and parallel output mode, and outputs the characteristic line data to the multiplication and addition subunit through a first multiplexer; the second shift register group adopts a mode of serial input and selecting one output according to step length, and outputs the convolution kernel data to a next-stage convolution operation multiplication and addition array and the multiplication and addition subunit through a second multiplexer, the multiplication and addition subunit correspondingly performs multiplication operation on the input characteristic line data and the convolution kernel data and performs accumulation operation on the characteristic line data and the convolution line part and data in the part and the buffer subunit, and when the convolution operation of the convolution kernel data and the characteristic line data of a corresponding convolution window is completed, a plurality of line convolution results of the convolution window are partially accumulated to realize a sliding window convolution operation of a convolution kernel; the convolution operation multiplication and addition arrays of different stages in each group output the row operation results to the accumulator in a convolution row period, and the accumulator accumulates the row operation results output by the convolution operation multiplication and addition arrays of different stages corresponding to all rows of the current convolution kernel through an addition tree, so that the convolution operation of one convolution kernel is realized.

In one embodiment, the data controller is configured to obtain, according to the configuration information and the instruction information, a storage address of the feature data and the corresponding weight data loaded into the neural network computing unit, instruct the multiplexer to turn on and off the adjustment data path, and input the feature data and the corresponding weight data into the corresponding neural network computing unit according to the instruction information; and the data extractor comprises a feature extractor and a convolution kernel extractor, the feature extractor is used for extracting feature line data from the feature data cache unit according to the instruction information, and the convolution kernel extractor is used for extracting convolution kernel data from the convolution kernel cache unit according to the instruction information so as to transmit the convolution kernel data to the neural network computing module.

In one embodiment, the feature data buffer unit includes a plurality of feature data buffer groups, each feature data buffer unit buffers part of feature data of one input channel and is coupled with at least one neural network acceleration slice, the plurality of neural network acceleration slices share one convolution kernel data buffer unit, each neural network acceleration slice acquires feature map data of one input channel from a corresponding feature data buffer group, and the same convolution kernel data is distributed to the plurality of neural network acceleration slices.

The invention also provides a neural network computing method, which comprises the following steps: a data controller is adopted to adjust a data channel according to configuration information and instruction information, the data flow extracted by the data extractor is controlled to be loaded to a corresponding neural network computing unit according to the instruction information, the neural network computing unit is adopted to at least complete convolution operation of one convolution kernel and characteristic line data corresponding to a plurality of input channels of the characteristic diagram data, and accumulation of a plurality of convolution results is completed in at least one period, the neural network computing unit comprises a plurality of neural network acceleration slices, each neural network acceleration slice comprises a plurality of convolution operation multiplication and addition arrays, each neural network acceleration slice at least completes convolution operation of feature map data of one input channel and convolution kernel data, and the plurality of neural network acceleration slices complete convolution operation of the feature map data of the plurality of input channels and the convolution kernel data.

The invention also provides communication equipment which is characterized by comprising a Central Processing Unit (CPU), a memory DDR SDRAM and the neural network computing module which are in communication connection, wherein the CPU is used for controlling the neural network computing module to start the convolution operation, and the DDR SDRAM is used for inputting feature map data and weight data to the neural network computing module.

Compared with the prior art, the invention has the advantages that: the data does not need to be expanded in a matrix form, and the convolution and accumulation operation of the current data can be completed only by reading the characteristic line data and the convolution kernel data from the main memory once, so that the memory access bandwidth and the storage space are reduced, the energy efficiency of data access is improved, high-efficiency characteristic diagram data multiplexing is realized, the operation speed is optimized, and the on-chip characteristic diagram data caching requirement is reduced, thereby realizing the calculation core non-delay operation.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a block diagram of a neural network computational module in one embodiment of the invention;

FIG. 2 is a schematic diagram of a neural network computing module according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a neural network computing module loading feature line data and convolution kernel line data in an embodiment of the present invention;

FIG. 4 is a diagram of a register bank loaded with eigen row data and convolution kernel row data in an embodiment of the invention;

fig. 5 is a timing diagram of a convolution operation performed on pu by a 3 x 3 convolution kernel with a step size of 1 according to an embodiment of the present invention;

fig. 6 is a timing diagram of a convolution operation performed on pu by 5 × 5 convolution kernels with step size 1 in another embodiment of the present invention;

fig. 7 is a timing diagram of a convolution operation performed on pu by 5 × 5 convolution kernels with a step size of 2 in another embodiment of the present invention;

fig. 8 is a flowchart illustrating a neural network acceleration method according to an embodiment of the present invention.

Detailed Description

The embodiments of the present application will be described in detail below with reference to the accompanying drawings.

The following description of the embodiments of the present application is provided by way of specific examples, and other advantages and effects of the present application will be readily apparent to those skilled in the art from the disclosure herein. It is to be understood that the embodiments described are only a few embodiments of the present application and not all embodiments. The present application is capable of other and different embodiments and its several details are capable of modifications and/or changes in various respects, all without departing from the spirit of the present application. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It is noted that various aspects of the embodiments are described below within the scope of the appended claims. It should be apparent that the aspects described herein may be embodied in a wide variety of forms and that any specific structure and/or function described herein is merely illustrative. Based on the present application, one skilled in the art should appreciate that one aspect described herein may be implemented independently of any other aspects and that two or more of these aspects may be combined in various ways. For example, an apparatus may be implemented and/or a method practiced using any number and aspects set forth herein. Additionally, such an apparatus may be implemented and/or such a method may be practiced using other structure and/or functionality in addition to one or more of the aspects set forth herein.

It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present application, and the drawings only show the components related to the present application rather than the number, shape and size of the components in actual implementation, and the type, amount and ratio of the components in actual implementation may be changed arbitrarily, and the layout of the components may be more complicated.

In addition, in the following description, specific details are provided to facilitate a thorough understanding of the examples. However, it will be understood by those skilled in the art that aspects may be practiced without these specific details.

The embodiment of the application provides communication equipment, and a hardware architecture of the communication equipment comprises a Central Processing Unit (CPU), a memory DDR SDRAM and a neural network computing module which are in communication connection. The CPU is used for controlling the neural network computing module to start convolution operation, and the DDR SDRAM is used for inputting feature map data and weight data to the data cache module of the neural network computing module. The CPU, the DDRSDRAM and the neural network computing module are in communication connection. The CPU controls the neural network computing module to start convolution operation, the DDRSDRAM is used for inputting a plurality of convolution data and a plurality of convolution parameters to the neural network computing module, then the neural network computing module completes convolution operation according to the obtained convolution data and the obtained convolution parameters to obtain an operation result, the operation result is written back to a memory address appointed by the DDR SDRAM, and the CPU is informed of completion of the convolution operation.

As shown in fig. 1, an embodiment of the present application provides a neural network computing module 100 including a data controller 10, a data extractor 20, and a neural network computing unit 30.

The data controller 10 adjusts the data path according to the configuration information and the instruction information, and controls the data stream extracted by the data extractor 20 to be loaded to the corresponding neural network computing unit 30 according to the instruction information.

The neural network computing unit 30 performs at least one convolution kernel and convolution operation of the characteristic line data corresponding to the plurality of input channels of the characteristic diagram data, and performs accumulation of a plurality of convolution results in at least one period, thereby realizing circuit reconstruction and data multiplexing.

As shown in fig. 2, the neural network computing unit 30 includes a plurality of neural network acceleration slices 31, each of which includes a plurality of convolution operation multiply-add arrays 311, each of which performs convolution operation of at least feature map data of one input channel and convolution kernel data, and a plurality of neural network acceleration slices perform convolution operation of feature map data of a plurality of input channels and convolution kernel data.

The neural network acceleration slice (PU slice) 31 may be composed of a plurality of PE acceleration processing units (PE accelerators). The multiple PU fragments can realize data parallel computation of different dimensions through system configuration. The NN neural network acceleration circuit may include a plurality of neural network acceleration slices (PU slices), activation/pooling circuits, and an accumulation unit, among others. The PE accelerator is the most basic unit for accelerating the processing of the neural network, each unit at least comprises a multiplier, an adder, a part and a result buffer, at least one convolution operation of a weight parameter and input characteristic data can be completed, and the accumulation of a plurality of convolution results can be completed in at least one period. In this application, a PU tile may contain PE accelerators arranged in an array. The characteristic data cache unit comprises a plurality of characteristic data cache groups, each characteristic data cache unit caches partial characteristic data of one input channel and is coupled with at least one PU (polyurethane), namely, each PU acquires characteristic graph data of one input channel from the corresponding characteristic data cache group; meanwhile, a plurality of PUs share one convolution kernel data cache unit, namely, the data of the line of the same convolution kernel is broadcasted to the plurality of PUs, so that the parallel of multi-channel input single-channel output convolution operation is realized.

In order to realize data multiplexing and reduce the read-write bandwidth pressure of a main memory, each PU fragment can calculate the convolution results of current single-channel input characteristic line data (data acquired by a convolution kernel in the row direction of an image to be processed is characteristic line data) and a plurality of convolution kernels in the row direction of an input characteristic image respectively; updating the convolution kernel data, and restarting the convolution operation until the single characteristic line data and all convolution kernels complete the convolution operation; after the convolution operation of the characteristic line data of the single input channel and all convolution kernels is completed, updating the input characteristic diagram line, namely, the convolution operation under the convolution kernels is moved, and then circulating the steps until the convolution operation of the current input characteristic diagram and all convolution kernels is completed, and outputting a multi-channel output characteristic diagram of the input characteristic diagram; and after the convolution results of the input feature diagram of the single channel and all convolution kernels are finished, updating the feature diagram input channel, wherein the operation sequence can be flexibly configured according to the actual situation and the configuration information or the instruction information.

According to the neural network computing module, the convolution and accumulation operation of the current data can be completed only by reading the characteristic line data and the convolution kernel data once from the main memory without expanding the data in a matrix form, so that the memory access bandwidth and the storage space are reduced, the energy efficiency of data access is improved, high-efficiency characteristic diagram data multiplexing is realized, the operation speed is optimized, and the on-chip characteristic diagram data caching requirement is reduced, so that the non-delay operation of a computing core is realized.

In one embodiment, the plurality of neural network acceleration slices form a first neural network operation matrix, and the plurality of first neural network operation matrices are coupled in parallel to form a second neural network acceleration matrix; the first neural network operation matrix in the second neural network acceleration matrix is used for completing convolution operation of the plurality of input channel characteristic data and one convolution kernel, and the plurality of second neural network acceleration matrices are used for completing parallel convolution operation of the plurality of input channel characteristic data and the plurality of convolution kernels. A neural network acceleration slice can compute an acceleration operation between a feature map data and a convolution kernel convolution. A plurality of PU fragments of the same second neural network acceleration matrix can simultaneously calculate convolution results of the same convolution kernel and different input characteristic data, each PU fragment shares a convolution kernel cache unit, and a plurality of fragments share a characteristic data cache unit; a plurality of PU fragments of different second neural network acceleration matrixes can simultaneously calculate convolution results of a plurality of different convolution kernels and the same input feature data, each PU fragment shares one convolution kernel cache unit, and a plurality of fragments share one feature data cache unit.

The plurality of PUs form a first sub-neural network operation PU matrix, and the plurality of first sub-neural network operation PU matrices form a second neural network PU matrix; each submatrix in the second neural network PU matrix is used for completing convolution operation of a plurality of input channel characteristic data and one convolution kernel, and the plurality of submatrixes can complete parallel convolution operation of the plurality of input channel characteristic data and the plurality of convolution kernels.

In one embodiment, each group of convolution operation multiplication and addition arrays obtains characteristic line data in a parallel input mode; and each group of convolution operation multiplication and addition arrays obtains convolution kernel data in a serial input mode. The PU comprises at least one group of PEs, and each group of PEs is responsible for convolution operation of convolution kernel line data and corresponding feature map line data; the multiple groups of PEs can realize convolution operation of multiple convolution kernel rows and multiple corresponding characteristic map row data, namely each group of PEs forms one row, and the multiple rows of PEs complete convolution operation of at least one convolution kernel row and corresponding characteristic data. Each group of PEs obtains characteristic line data in a parallel input mode, namely each metadata in each characteristic line data is simultaneously broadcast to each level of PE in the current group; meanwhile, each group of PEs obtains convolution kernel line data in a serial input mode, namely, the convolution kernel line metadata flows from the first-level PE to the next-level PE in each clock cycle.

In one embodiment, as shown in fig. 3, the neural network acceleration slice 31 includes a plurality of first multiplexers 312 and a plurality of second multiplexers 313, the first multiplexers 312 are coupled in parallel with the convolution operation multiplication and addition arrays 311 in a one-to-one correspondence, and the second multiplexers 313 are coupled in series with the convolution operation multiplication and addition arrays in a one-to-one correspondence; the first multiplexer acquires characteristic line data corresponding to the convolution operation multiplication and addition array through a data selection signal and inputs the characteristic line data to the corresponding convolution operation multiplication and addition array at each stage in parallel, and the second multiplexer acquires convolution kernel line data corresponding to the convolution operation multiplication and addition array and inputs the convolution kernel line data to the convolution operation multiplication and addition array at each stage in series to complete convolution multiplication and addition operation.

The first multiplexers are respectively coupled with the corresponding PE groups in parallel, each PE group can select at least one of the two characteristic line data through a selection signal, and the sixth PE group in fig. 3 can select data of 6 different lines; the second multiplexers are coupled in series with corresponding groups of PEs, respectively, each group of PEs being capable of selecting at least one of the two convolutional kernel data through a selection signal, and the sixth group of PEs in fig. 3 being capable of selecting 6 different convolutional kernel data. The first multiplexer obtains corresponding characteristic line data through a data selection signal (provided by configuration information or data loading instruction information), and inputs the characteristic line data into each stage of PE of a corresponding PE group in parallel, and meanwhile, the second multiplexer selects corresponding convolution kernel line data to input into each stage of PE in series to complete convolution multiply-add operation.

The remaining idle PE groups can acquire the feature line data and the convolution kernel line data used for convolution operation in the feature diagram column direction through the multiplexer, so that multiplexing of input data is realized. For example, for a convolution kernel step of 3 × 3 being 1, taking the matrix of fig. 3 as an example, the first three PE groups complete parallel accelerated convolution operation of a convolution kernel in the row direction (PE 00, PE10, PE20 are a group, PE01, PE11, PE21 are a group, PE02, PE12, PE22 are a group), the fourth PE group can multiplex the feature line data in the second feature extractor 1 and the convolution kernel line data of the first weight acquirer 0, the fifth PE group can multiplex the feature line data in the third feature extractor 2 and the convolution kernel line data of the second weight acquirer 1, the sixth PE group can multiplex the feature line data in the fourth feature extractor 3 and the convolution kernel line data of the third weight acquirer 2, and then send the data pairs into corresponding PE groups, complete sliding window convolution operation in the feature map column direction, thereby improving the multiplexing of the feature map data and reducing the occupation of read-write bandwidth, and meanwhile, the whole PE array is in a full-load state of efficient operation.

In one embodiment, the neural network computing module further comprises a first shift register group and a second shift register group, the neural network computing unit comprises a multiplication and addition subunit, a part and a buffer subunit, the first shift register group adopts a serial input and parallel output mode, and characteristic line data are output to the multiplication and addition subunit through the first multiplexer; the second shift register group adopts a serial input and parallel output mode, convolution kernel data are output to a next-stage convolution operation multiplication and addition array and a multiplication and addition subunit through a second multiplexer according to step length signals, the multiplication and addition subunit correspondingly performs multiplication operation on the input characteristic line data and the convolution kernel data, and performs accumulation operation on the input characteristic line data and convolution kernel data and convolution line part and data in part and a cache subunit, and when the convolution operation of the convolution kernel data and the characteristic line data corresponding to a convolution window is completed, a plurality of line convolution result parts of the convolution window are accumulated, so that one sliding window convolution operation of a convolution kernel is realized; each group of convolution operation multiplication and addition arrays with different stages outputs row operation results to an accumulator in a convolution row period, and the accumulator accumulates the row operation results output by the convolution operation multiplication and addition arrays with the same stages corresponding to all rows of the current convolution kernel through an addition tree, so that one-time window convolution operation of one convolution kernel is realized; different-stage PEs calculate a plurality of sliding convolution operations of the current convolution kernel row in the characteristic diagram row direction in parallel; in the convolution operation process, the characteristic diagram lines are sequentially input in parallel into each group of each level of PE, and the corresponding convolution kernel line data are loaded in series into each group of each level of PE in a periodic circulation mode.

As shown in fig. 4, the first shift register group 314 adopts a serial connection parallel output mode to input the feature diagram line data to each stage of PE in parallel, and outputs the data to each stage of PE through the multiplexer; the current convolution kernel metadata is simultaneously fed into the multiply-add unit as it is fed into the first stage shift register.

The second shift register group 315 also outputs the convolution kernel data to the next stage of multiply-add unit through the multiplexer by using a serial connection parallel output method.

And correspondingly multiplying the characteristic line data and the convolution kernel line data which are sent into the multiplication and addition unit, then performing accumulation operation on the characteristic line data and the convolution kernel line data, and performing summation operation on the convolution line part sum data in part of the convolution kernel line and the part of the convolution data in the cache unit, and after the convolution operation of one convolution kernel line and the corresponding convolution window characteristic line data is completed, performing summation operation on the convolution result part sum output of the convolution line and convolution results of other lines of the convolution kernel, so that the convolution kernel and one sliding window convolution operation is realized.

The characteristic line data (X00, … and X0n) are continuously and sequentially sent into the PE group in parallel according to the line sequence, the convolution kernel line data (F00/F01/F02) are sent into the PE group in series in a circulating sequence in a mode of F00/F01/F02-F00/F01/F02-F00/F01/F02 for convolution operation, and each stage of PE outputs a part and a result corresponding to the current convolution kernel line through one convolution kernel line period (the size of the convolution kernel line is 3, the period of the convolution kernel line is 3). Different stage PEs output partial sum results of convolution kernel lines sliding convolution operation on the characteristic diagram lines according to the step length s; after each group of different peer PEs outputs partial sums in one convolution row period, the partial sums and the results output by the peer PEs of each group of PEs corresponding to all rows of the current convolution kernel are accumulated through the addition tree, so that the convolution operation of one convolution kernel is realized, and the convolution calculation shown in fig. 5 is obtained.

In fig. 5, PE00 is the partial sum of the convolution of the first row of convolution kernel and the corresponding feature line data output in three consecutive cycles, PE10 is the partial sum of the convolution of the first row of convolution kernel and the window adjacent to the window calculated by PE00 output in three consecutive cycles, PE20 is the partial sum of the convolution of the first row of convolution kernel and the window adjacent to the window calculated by PE10 output in three consecutive cycles, and multiple PE groups can realize the convolution acceleration operation of different rows of convolution kernel and the corresponding feature data row, which is equivalent to the convolution operation of the convolution kernel sliding and traversing in the feature map row direction. The first shift register group selects a signal according to the step length parameter s, selects the output of the corresponding shift register as the input of the next-stage PE, and realizes the traversal convolution operation of the convolution window in the row direction according to the step length s, and the other PE groups realize the window sliding convolution acceleration operation in the characteristic diagram column direction according to the step length by multiplexing part of the input characteristic diagram data and convolution kernel data.

In one embodiment, the data controller is used for acquiring the storage addresses of the feature data and the corresponding weight data loaded into the neural network computing unit according to the configuration information and the instruction information, and simultaneously instructing the multiplexer to switch on and off to adjust a data path, and inputting the feature data and the corresponding weight data into the corresponding neural network computing unit according to the instruction information; and the data extractor comprises a feature extractor and a convolution kernel extractor, the feature extractor is used for extracting feature line data from the feature data cache unit according to the instruction information, and the convolution kernel extractor is used for extracting convolution kernel data from the convolution kernel cache unit according to the instruction information so as to transmit the convolution kernel data to the neural network computing module.

In one embodiment, the feature data cache unit includes a plurality of feature data cache groups, each feature data cache unit caches a part of feature data of one input channel and is coupled with at least one neural network acceleration slice, the plurality of neural network acceleration slices share one convolution kernel data cache unit, each neural network acceleration slice acquires feature map data of one input channel from a corresponding feature data cache group, and the same convolution kernel data is distributed to the plurality of neural network acceleration slices. The data controller broadcasts the data to the PE units in the 4 th row, except for communicating the weight acquirer 0 with the 0 th row of the PE units, that is, the first row weight parameter performs a row convolution operation with the characteristic data 1, and also performs a row convolution with the characteristic data 2 (wherein the characteristic data 2 is connected to the PE units in the 2 nd row, and also connected to the PE units in the 4 th row through the multiplexer), that is, the convolution kernel performs a convolution operation on adjacent rows each time, and the data controller 10 uses a plurality of neural networks to accelerate the fragmentation calculation of the convolution result of the characteristic data input in a single channel and the convolution kernel in the row direction of the image to be processed.

In one embodiment, the data controller acquires the storage addresses of the characteristic data and the corresponding weight data loaded into the neural network computing unit according to the configuration information and the instruction information, the instruction multiplexer outputs the characteristic data to the neural network computing unit from the first shift register group in a serial input and parallel output mode, and the instruction multiplexer outputs the convolution kernel data to the current convolution operation multiplication and addition array and the next convolution operation multiplication and addition array of the neural network computing unit from the second shift register group in a serial input and one-step output mode.

As shown in fig. 6, when the convolution kernel line size (e.g. 5) is larger than the number of PEs in each group of PEs (e.g. 3 PEs), in each convolution kernel line operation period, address access of each stage of PE to the buffered feature line data by the feature line data extractor may conflict (since each stage of PE synchronously obtains the same feature metadata input, and at the next convolution kernel line cycle period, the feature data address required by a part of PE conflicts with the current feature extractor access buffer address), but the data accessed by a part of PE has been accessed in the previous operation period, and is buffered in the first shift register group, so the data controller obtains the storage address of the feature data and the corresponding weight data loaded into the neural network computing unit according to the configuration information and the instruction information, and at the same time, the instruction multiplexer turns on and off the data path, and inputting the characteristic data and the corresponding weight data into the corresponding neural network computing unit according to the instruction information, and preventing the conflict between part of PE access data addresses and the current characteristic extractor access addresses by selecting the register output data in the corresponding first shift register group. Therefore, the PE unit can control the multiplexer through the data controller to achieve conflict data acquisition, so that the convolution operation of convolution kernels with different sizes is applicable, and the condition that address access conflicts of various levels of PEs under the condition of non-alignment of access data are avoided to cause PE kernel waiting.

As shown in fig. 4, the feature extractor may load data into the PE array in a row-wise order according to the data address; meanwhile, 6 feature extractors are arranged in the PE array, and can be FIFO units or an addressing mode (addressing is carried out from the cache units in sequence and data is loaded to the PE), if two of the FIFO units are empty, the feature extractors which are empty at present can be subjected to signal truncation processing, so that the power consumption is reduced, and the weight acquirer is similar; for 5 × 5 or 7 × 7 convolution kernels, all the feature extractors and the weight obtainers can be in a full-load working state; the number of feature extractors can be flexibly configured, such as reduced or increased, i.e., the PE array is correspondingly reduced or increased.

Selecting an input channel by adopting a multi-path input multiplexer according to the configuration information of the data controller, and realizing the convolution operation of a convolution kernel and the corresponding characteristic line data; for example, for a 3 × 3 convolution kernel, the 4 th, 5 th, and 6 th PE columns multiplex the data output by the feature data extractor 2 nd, 3 th, and 4 th, respectively. If 5 × 5 convolution kernels are used, the 4 th and 5 th PE columns are communicated with the corresponding 4 th and 5 th feature data extractors, the 6 th PE column is communicated with the 2 nd feature data extractor, and the sizes of the convolution kernels are not aligned with the numbers of the PE columns, so that after one row of convolution operation is completed, the input path of the input multiplexer needs to be reconfigured, for example, in a first row period, the convolution operation of 5+1 rows (5 × 5 convolution kernels are convoluted on the feature map in one row period, the convolution operation of the first row of the convolution kernels is performed on the feature map, the second row is performed on the 1 st to 4PE columns by shifting down one row, and the convolution operation is performed only on one row of feature data) is completed at one time, so that in a second row period, the 2 nd to 5 th row weights of the 1 st to 4 th feature extractors and the 2 th row of the convolution kernels are used for performing convolution operation on the 1 st to 4PE columns, and the 6 th row PE rows of the data of the 1 st and 2 nd feature extractors; in which fig. 5 only shows some connections, and some connections are not shown.

As shown in fig. 5, the weight parameters are serially flowed into each PE unit in a column direction (as shown in fig. 5, three PE units are arranged in a column direction to accommodate a convolution kernel of 3 × 3), and the PE units in each row in the same column are linked by a group of delay register sets through a multiplexer, and the delay circuit structure is used for adjusting the sliding step of the convolution window; each column of PE units receives the characteristic data of a specific row through the multiplexer, the characteristic data are output to the PE units in the same column and each row in parallel, and because the weight parameters flow into each PE in series, when the characteristic data reach the PE in the 2 nd row and the PE in the 3 rd row in the initial state, the characteristic data need to be delayed for 1 period and 2 periods respectively, and then convolution operation is carried out on the characteristic data and the corresponding weight parameters; meanwhile, a group of characteristic data delay circuits are also arranged in the PE unit, so that firstly, the characteristic data and the weight parameters are aligned in the initial state, and when the convolution kernel exceeds 3 x 3, the corresponding characteristic data are multiplexed through the multi-path input selection circuit, and address conflict during data reading is prevented.

The feature data extractor acquires at least 1 line of continuous feature line data from a feature cache circuit on the PU fragment (as the embodiment adopts a convolution kernel of 3 multiplied by 3, in order to fully utilize the multiplication and addition circuit resource in the PE array, the feature extraction circuit requests the feature cache circuit to load continuous 4 lines of feature line data at one time); and meanwhile, the weight acquirer acquires at least 1 row of weight parameters (directly acquires the three rows of weight parameters and loads the weight parameters to the weight acquirer) at one time, and for inputting 4 rows of characteristic line data and 3 rows of convolution kernel line data, at least two characteristic diagram column direction convolution results are output in each convolution kernel line cycle period.

In the embodiment, the feature extractor and the weight obtainer form a data extractor, which may be a FOFI structure or an addressing circuit, each data extractor extracts the first line data (F00, F01, F02) and the first eigen line data (X00, X01, X02, X03, X04, X05, X06 …) of the weights according to the lines, and extracts the feature data of other lines after each line period.

In the initial state, a first cycle F00 is fed into register 1 and a multiplier in PE00, and feature data X00 is fed into PE00, PE10, PE 20; x00 in PE00 is multiplied by F00 and the result is fed into the row accumulation BUFFER, where F00 arrives in PE01 in the next cycle and X00 is stored in each PE feature register 1; in the second cycle X01, the data are sent to PE00, PE01 and PE20, at this time, X00 shifts to feature register 2, X01 enters feature register 1, at the same time, F01 is sent to PE00, F00 in PE00 is sent to PE01, X01 and F01 perform convolution operation in PE00, and the operation result is sent to an adder to be accumulated with the result of X00F 00 buffered in accumulation BUFFER in the previous cycle. Meanwhile, multiplying the X01 in the PE01 by the F00, and sending the result into an accumulation buffer; a third period F02 enters PE00, F01 in PE00 enters PE01, F00 in PE01 enters PE02, and X02 is synchronously transmitted to each PE and is subjected to multiply-add operation with corresponding weight parameters; after the three periods, the convolution of the row of weights and the corresponding characteristic data can be realized, after the four periods, the convolution kernel row is equivalent to one sliding on the characteristic data row, and after the eight periods, the convolution result of the convolution kernel row sliding on the corresponding characteristic data row for 6 times can be obtained; since 3 rows of input are performed in parallel by the 3 × 3 convolution kernels in the column, the output of results of 6 convolution sliding windows can be completed in eight cycle periods; after the initialization state, the whole multiplication and addition unit is in a full-load state.

In another embodiment, as shown in fig. 6, the convolution kernel is 5 × 5, the step size is 1, the PE level 3 performs parallel accelerated operation on convolution operations of three convolution window convolution kernel lines in the feature map line direction, and after PE00 completes convolution of one convolution kernel line, the convolution operation of the fourth convolution window convolution kernel line is started (the convolution operation of the second and third convolution kernel lines is performed by the same PE10 and PE 20), at this time, the access address of the PE00 feature data needs to be started from X03, but at this time, PE01 and PE02 need to access the data of X05, so that the feature data extractor may conflict with the data access addressing of X03 and X05. To avoid conflicts, the feature data register (feature data of X04, X03, X02, X01 have been buffered for the current 4 clock cycles) selection circuit may be used to obtain X03 data, i.e., X03 from register 2; meanwhile, when the 3 PE register selection signals are consistent, the address pointer of the feature data extractor is updated again, for example, the address pointer is updated to X05 in the 8 th cycle (each stage of PE needs to access X05 data, and the address of the data accessed in the two subsequent cycles is the same), and after the next convolution kernel row cycle (for example, the 11 th clock cycle), the feature data selection circuit is reconfigured; full-load calculation of the 5x5 convolution kernel can be flexibly realized through the circuit configuration.

In another embodiment, as shown in fig. 7, the convolution kernel is 5x5, the convolution operation with step size of 2, after 5 cycles, each PE has loaded data, but for PE00, which has completed the operation of one convolution window convolution kernel row, the row data address of the next convolution window is X06, but now the feature extractor address pointer is occupied by PE10 and PE20 and points to X05 data (which requires X05 data to be retrieved from the feature data cache circuit), therefore, for PE00, it needs to idle for a period, and then when the address pointer of the feature extractor points to X06, it performs the convolution operation of the convolution kernel line of the fourth convolution window, i.e. X06 × F00, similarly, when the feature data loaded by other PE stages conflicts with the access address of another current PE feature extractor, the PE unit with access address conflict is set to be in idle state in the current operation period, and acquiring corresponding feature data for product operation in the next period according to the access address of the feature extractor. From the above, it can be seen that some PE units are not fully loaded in a certain period and are in an idle state in a partial access address collision period, but for the convolution operation with the 5X5 convolution kernel and the step size of 2, the utilization rate of PE units is: 3 × 6-3/3 × 6 × 100% =83%, namely, the number of PE units is in an iteration period, the idle number of each iteration period is/(the number of PE units is in the iteration period, wherein the iteration period is from the first PE idle of cycee 6 to cycee 11 after initialization, namely, each idle period of PE00 is 6, the idle number of each iteration period is the total number of idle times of the group of PEs in the iteration period, and the larger the convolution kernel size is, the higher the utilization rate of the whole PE unit is.

As shown in fig. 8, the present embodiment further provides a neural network acceleration method, including the following steps:

step 602, adjusting a data path by using a data controller according to configuration information and instruction information, and controlling a data stream extracted by a data extractor to be loaded to a corresponding neural network computing unit according to the instruction information;

step 604, a neural network computing unit is adopted to at least complete convolution operation of one convolution kernel and the characteristic diagram data, and accumulation of a plurality of convolution results is completed in at least one period, so that circuit reconstruction and data multiplexing are realized, wherein the neural network computing unit comprises a plurality of neural network acceleration slices, and each neural network acceleration slice comprises a plurality of convolution operation multiplication and addition arrays;

step 606, each neural network acceleration slice is adopted to complete convolution operation of at least one input channel feature map data and one convolution kernel data, and a plurality of neural network acceleration slices complete convolution operation of a plurality of input channel feature map data and one convolution kernel data.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A neural network computing module is characterized by comprising a data controller, a data extractor and a neural network computing unit,

the data controller adjusts a data path according to the configuration information and the instruction information, controls the data stream extracted by the data extractor to be loaded to the corresponding neural network computing unit according to the instruction information,

the neural network computing unit at least completes convolution operation of one convolution kernel and the characteristic line data corresponding to the plurality of input channels of the characteristic diagram data, and completes accumulation of a plurality of convolution results in at least one period, thereby realizing circuit reconstruction and data multiplexing,

the neural network computing unit comprises a plurality of neural network acceleration slices, each neural network acceleration slice comprises a plurality of convolution operation multiplication and addition arrays, each neural network acceleration slice at least completes convolution operation of feature map data of one input channel and convolution kernel data, and the plurality of neural network acceleration slices complete convolution operation of the feature map data of the plurality of input channels and the convolution kernel data.

2. The neural network computational module of claim 1, wherein the plurality of neural network acceleration slices form a first neural network operational matrix, and a plurality of first neural network operational matrices are coupled in parallel to form a second neural network acceleration matrix; the first neural network operation matrix in the second neural network acceleration matrix is used for completing convolution operation of a plurality of input channel characteristic data and a convolution kernel, and the plurality of second neural network acceleration matrices complete parallel convolution operation of the plurality of input channel characteristic data and the plurality of convolution kernels.

3. The neural network computing module of claim 1, wherein each set of convolution multiply-add arrays obtains characteristic line data by parallel input; and each group of convolution operation multiplication and addition arrays obtains convolution kernel data in a serial input mode.

4. The neural network computational module of claim 1, wherein the neural network acceleration slice comprises a plurality of first multiplexers and a plurality of second multiplexers, the first multiplexers being coupled in parallel with the convolution multiply-add arrays in a one-to-one correspondence, the second multiplexers being coupled in series with the convolution multiply-add arrays in a one-to-one correspondence; the first multiplexer acquires the characteristic line data corresponding to the convolution operation multiplication and addition array through a data selection signal and inputs the characteristic line data to the corresponding convolution operation multiplication and addition array at each stage in parallel, and the second multiplexer acquires the convolution kernel line data corresponding to the convolution operation multiplication and addition array and inputs the convolution kernel line data to the convolution operation multiplication and addition array at each stage in series to complete convolution multiplication and addition operation.

5. The neural network computing module of claim 1, further comprising a first shift register set and a second shift register set, wherein the neural network computing unit comprises a multiply-add subunit and a partial sum buffer subunit,

the first shift register group adopts a serial input and parallel output mode, and outputs the characteristic line data to the multiplication and addition subunit through a first multiplexer; the second shift register group adopts a mode of serial input and selecting one output according to step length, and outputs the convolution kernel data to the next convolution operation multiply-add array and the multiply-add subunit through a second multiplexer,

the multiplication and addition subunit correspondingly multiplies the input characteristic line data and the convolution kernel line data, accumulates the multiplication and addition data with the convolution line part and data in the part and cache subunit, and partially accumulates a plurality of line convolution results of the convolution window when the convolution operation of the convolution kernel line data and the characteristic line data corresponding to the convolution window is completed, so that one sliding window convolution operation of a convolution kernel is realized;

the convolution operation multiplication and addition arrays of each group of different stages output row operation results to the accumulator in a convolution row period, and the accumulator accumulates the row operation results output by the convolution operation multiplication and addition arrays of each group of stages corresponding to all rows of the current convolution kernel through an addition tree, so that the convolution operation of one convolution kernel is realized.

6. The neural network computing module of claim 1, wherein the data controller is configured to obtain storage addresses of the feature data and the corresponding weight data loaded into the neural network computing units according to the configuration information and the instruction information, and instruct the multiplexer to turn on and off the adjustment data path, and input the feature data and the corresponding weight data into the corresponding neural network computing units according to the instruction information;

and the data extractor comprises a feature extractor and a convolution kernel extractor, the feature extractor is used for extracting feature line data from the feature data cache unit according to the instruction information, and the convolution kernel extractor is used for extracting convolution kernel data from the convolution kernel cache unit according to the instruction information so as to transmit the convolution kernel data to the neural network computing module.

7. The neural network computing module of claim 6, wherein the feature data buffer unit comprises a plurality of feature data buffer sets, each feature data buffer unit buffers a portion of the feature data of an input channel and is coupled to at least one neural network acceleration slice, the plurality of neural network acceleration slices share a convolution kernel data buffer unit,

each neural network acceleration fragment acquires feature map data of one input channel from the corresponding feature data cache group, and the same convolution kernel data is distributed to a plurality of neural network acceleration fragments.

8. A neural network computing method, comprising:

the data controller is adopted to adjust the data path according to the configuration information and the instruction information, the data flow extracted by the data extractor is controlled to be loaded to the corresponding neural network computing unit according to the instruction information,

adopting a neural network computing unit to at least complete convolution operation of one convolution kernel and characteristic line data corresponding to a plurality of input channels of the characteristic diagram data, and completing accumulation of a plurality of convolution results in at least one period so as to realize circuit reconstruction and data multiplexing, wherein the neural network computing unit comprises a plurality of neural network acceleration fragments, each neural network acceleration fragment comprises a plurality of convolution operation multiply-add arrays,

and the plurality of neural network acceleration slices are used for completing convolution operation of the feature map data of the plurality of input channels and the convolution kernel data.

9. A communication device, comprising a central processing unit CPU, a memory DDR SDRAM and the neural network computing module of any one of claims 1 to 7, which are communicatively connected, wherein the CPU is configured to control the neural network computing module to start the convolution operation, and the DDR SDRAM is configured to input feature map data and weight data to the neural network computing module.