CN113792868B

CN113792868B - Neural network computing module, method and communication equipment

Info

Publication number: CN113792868B
Application number: CN202111071503.4A
Authority: CN
Inventors: 王赟; 张官兴; 郭蔚; 黄康莹; 张铁亮
Original assignee: Shanghai Ewa Intelligent Technology Co ltd; Shaoxing Ewa Technology Co ltd
Current assignee: Shanghai Ewa Intelligent Technology Co ltd; Shaoxing Ewa Technology Co ltd
Priority date: 2021-09-14
Filing date: 2021-09-14
Publication date: 2024-03-29
Anticipated expiration: 2041-09-14
Also published as: CN113792868A

Abstract

The invention provides a neural network computing module, a method and communication equipment, which belong to the field of data processing, and specifically comprise a data controller, a data extractor and a neural network computing unit, wherein the data controller adjusts a data path according to configuration information and instruction information, controls a data stream extracted by the data extractor to be loaded into the corresponding neural network computing unit according to the instruction information, the neural network computing unit at least completes convolution operation of a convolution kernel and feature line data and completes accumulation of a plurality of convolution results in at least one period, so that circuit reconstruction and data multiplexing are realized, the neural network computing unit comprises a plurality of neural network acceleration fragments, each neural network acceleration fragment comprises a plurality of convolution operation multiplication and addition arrays, each neural network acceleration fragment at least completes convolution operation of feature map data of one input channel and convolution kernel data, and the plurality of neural network acceleration fragments complete convolution operation of feature map data of a plurality of input channels and convolution kernel data.

Description

Neural network computing module, method and communication equipment

Technical Field

The present invention relates to the field of data processing, and in particular, to a neural network computing module, a neural network computing method, and a communication device.

Background

The convolutional neural network is composed of an input layer (inputlayer), an arbitrary number of hidden layers (hiddenlayer) as intermediate layers, and an output layer (outputlayer). The input layer (inputlayer) has a plurality of input nodes (neurons). The output layer has output nodes (neurons) that identify the number of objects.

The convolution kernel is a small window placed in the hidden layer in which the weight parameters are stored. The convolution kernel slides on the input image according to the step length in turn, and performs multiplication and addition operation with the input characteristic image of the corresponding region, namely, the weight parameter in the convolution kernel and the value of the corresponding input image are multiplied first and then summed. The traditional convolution acceleration operation device needs to utilize an img2col method to perform matrix form expansion processing on input feature map data and convolution kernel data according to convolution kernel size and step size parameters and then operate on the expanded matrix, so that convolution acceleration can be performed according to matrix multiplication operation rules, but after the feature data matrix is expanded, the method needs larger on-chip cache, and simultaneously needs more off-chip main memory read frequency and data which cannot be efficiently multiplexed for reading, so that the bandwidth of off-chip memory read-write is occupied, hardware power consumption is increased, and meanwhile, the convolution acceleration operation method based on the img2col expansion mode is not beneficial to realizing a hardware logic circuit for convolution operation of convolution kernels and step sizes of different sizes, so that in the convolution network operation process, each input channel needs to perform convolution matrix operation with a plurality of convolution kernels, and the feature map data needs to be acquired for a plurality of times; and all the characteristic diagram data on each channel are cached in a buffer, so that the data volume is huge, and when the convolution matrix calculation is performed, the size of the characteristic data after matrix conversion far exceeds the size of the original characteristic data, so that the on-chip storage resources are wasted, and the calculation of a large data volume cannot be performed.

Disclosure of Invention

Accordingly, to overcome the above-described shortcomings of the prior art, the present invention provides a neural network computing module, method and communication device.

In order to achieve the above object, the present invention provides a neural network computing module, including a data controller, a data extractor, and a neural network computing unit, where the data controller adjusts a data path according to configuration information and instruction information, controls a data stream extracted by the data extractor to be loaded into a corresponding neural network computing unit according to the instruction information, and the neural network computing unit at least completes a convolution operation of feature line data corresponding to a plurality of input channels of feature map data, and completes accumulation of a plurality of convolution results in at least one period, thereby achieving circuit reconstruction and data multiplexing, and the neural network computing unit includes a plurality of neural network acceleration slices, each neural network acceleration slice includes a plurality of convolution operation multiply-add arrays, and each neural network acceleration slice at least completes a convolution operation of feature map data and convolution kernel data of one input channel, and the plurality of neural network acceleration slices completes a convolution operation of feature map data and convolution kernel data of a plurality of input channels.

In one embodiment, the plurality of neural network acceleration tiles form a first neural network operation matrix, and the plurality of first neural network operation matrices are coupled in parallel to form a second neural network acceleration matrix; the first neural network operation matrix in the second neural network acceleration matrix is used for completing convolution operation of a plurality of input channel characteristic data and one convolution kernel, and the plurality of second neural network acceleration matrices are used for completing parallel convolution operation of a plurality of input channel characteristic data and a plurality of convolution kernels.

In one embodiment, each group of convolution operation multiply-add arrays acquires feature line data in a parallel input mode; each group of convolution operation multiply-add arrays acquires convolution kernel data through a serial input mode.

In one embodiment, the neural network accelerated tile includes a plurality of first multiplexer and a plurality of second multiplexer, the first multiplexer being coupled in parallel one-to-one correspondence with the convolution multiply-add array, the second multiplexer being coupled in series one-to-one correspondence with the convolution multiply-add array; the first multiplexer acquires the characteristic line data corresponding to the convolution operation multiplication and addition array through a data selection signal, and inputs the characteristic line data into the corresponding convolution operation multiplication and addition array at each stage in parallel, and the second multiplexer acquires the convolution kernel data corresponding to the convolution operation multiplication and addition array and inputs the convolution kernel data into the convolution operation multiplication and addition array at each stage in series to complete convolution multiplication and addition operation.

In one embodiment, the neural network computing module further includes a first shift register set and a second shift register set, the neural network computing unit includes a multiply-add subunit and a portion and a buffer subunit, the first shift register set adopts a serial input and parallel output mode, and outputs the feature line data to the multiply-add subunit through a first multiplexer; the second shift register group adopts a mode of serial input and selecting one output according to step length, and outputs the convolution kernel data to a next-stage convolution operation multiply-add array and the multiply-add subunit through a second multiplexer, the multiply-add subunit correspondingly carries out multiplication operation on the input characteristic line data and the convolution kernel data, carries out accumulation operation on the characteristic line data and the convolution line data in the part and the buffer subunit, and carries out convolution operation on a plurality of line convolution result parts and accumulation of the convolution window when the convolution operation of the convolution kernel line data and the characteristic line data of the corresponding convolution window is completed, so that one sliding window convolution operation of the convolution kernel is realized; the convolution operation multiplication-addition array of each group of different stages outputs the row operation result to the accumulator in a convolution row period, and the accumulator accumulates the row operation results output by the convolution operation multiplication-addition array of each group of the same stage corresponding to all rows of the current convolution kernel through an addition tree, so that the convolution operation of one convolution kernel is realized.

In one embodiment, the data controller is configured to obtain, according to the configuration information and the instruction information, a storage address of the feature data and the corresponding weight data loaded into the neural network computing unit, and simultaneously instruct the multiplexer to adjust the on-off of the data path, and input the feature data and the corresponding weight data into the corresponding neural network computing unit according to the instruction information; the data extractor comprises a feature extractor and a convolution kernel extractor, wherein the feature extractor is used for extracting feature line data from a feature data caching unit according to instruction information, and the convolution kernel extractor is used for extracting convolution kernel data from the convolution kernel caching unit according to the instruction information so as to transmit the convolution kernel data to the neural network computing module.

In one embodiment, the feature data caching unit includes a plurality of feature data caching groups, each feature data caching unit caches part of feature data of one input channel and is coupled with at least one neural network acceleration slice, the plurality of neural network acceleration slices share one convolution kernel data caching unit, each neural network acceleration slice obtains feature map data of one input channel from the corresponding feature data caching group, and the same convolution kernel data is distributed to the plurality of neural network acceleration slices.

The invention also provides a neural network calculation method, which comprises the following steps: the data controller is used for adjusting a data path according to configuration information and instruction information, the data flow extracted by the data extractor is controlled to be loaded to a corresponding neural network computing unit according to the instruction information, the neural network computing unit is used for at least completing convolution operation of characteristic line data corresponding to a plurality of input channels of the characteristic map data, accumulation of a plurality of convolution results is completed in at least one period, and therefore circuit reconstruction and data multiplexing are achieved.

The invention also provides communication equipment which is characterized by comprising a CPU (central processing unit), a DDR SDRAM (double data rate SDRAM) and the neural network computing module which are in communication connection, wherein the CPU is used for controlling the neural network computing module to start the convolution operation, and the DDR SDRAM is used for inputting characteristic diagram data and weight data to the neural network computing module.

Compared with the prior art, the invention has the advantages that: the data is not required to be expanded in a matrix form, and the convolution and accumulation operation of the current data can be completed only by reading the feature line data and the convolution kernel data once from the main memory, so that the memory access bandwidth and the storage space are reduced, the energy efficiency of data access is improved, the efficient feature map data multiplexing is realized, the operation speed is optimized, and the on-chip feature map data caching requirement is reduced, thereby realizing the computation core non-delay operation.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a block diagram of a neural network computation module in one embodiment of the invention;

FIG. 2 is a schematic diagram of the neural network computation module in one embodiment of the invention;

FIG. 3 is a schematic diagram of a neural network computation module loading feature line data and convolution kernel line data in one embodiment of the present invention;

FIG. 4 is a schematic diagram of a register set loading feature line data and convolution kernel line data in one embodiment of the present invention;

FIG. 5 is a timing diagram of the convolution operation of the 3*3 convolution kernel on pu with a step size of 1 in one embodiment of the invention;

FIG. 6 is a timing diagram of the convolution kernel 5*5 in another embodiment of the present invention, running a convolution operation on pu with a step size of 1;

FIG. 7 is a timing diagram of the convolution kernel 5*5 in another embodiment of the present invention, running a convolution operation on pu with a step size of 2;

fig. 8 is a flowchart of a neural network acceleration method according to an embodiment of the present invention.

Detailed Description

Embodiments of the present application are described in detail below with reference to the accompanying drawings.

Other advantages and effects of the present application will become apparent to those skilled in the art from the present disclosure, when the following description of the embodiments is taken in conjunction with the accompanying drawings. It will be apparent that the described embodiments are only some, but not all, of the embodiments of the present application. The present application may be embodied or carried out in other specific embodiments, and the details of the present application may be modified or changed from various points of view and applications without departing from the spirit of the present application. It should be noted that the following embodiments and features in the embodiments may be combined with each other without conflict. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.

It is noted that various aspects of the embodiments are described below within the scope of the following claims. It should be apparent that the aspects described herein may be embodied in a wide variety of forms and that any specific structure and/or function described herein is merely illustrative. Based on the present application, one skilled in the art will appreciate that one aspect described herein may be implemented independently of any other aspect, and that two or more of these aspects may be combined in various ways. For example, apparatus may be implemented and/or methods practiced using any number and aspects set forth herein. In addition, such apparatus may be implemented and/or such methods practiced using other structure and/or functionality in addition to one or more of the aspects set forth herein.

It should also be noted that the illustrations provided in the following embodiments merely illustrate the basic concepts of the application by way of illustration, and only the components related to the application are shown in the drawings and are not drawn according to the number, shape and size of the components in actual implementation, and the form, number and proportion of the components in actual implementation may be arbitrarily changed, and the layout of the components may be more complicated.

In addition, in the following description, specific details are provided in order to provide a thorough understanding of the examples. However, it will be understood by those skilled in the art that aspects may be practiced without these specific details.

The embodiment of the application provides a communication device, and a hardware architecture of the communication device comprises a Central Processing Unit (CPU) comprising communication connection, a memory DDR SDRAM and a neural network computing module. The DDR SDRAM is used for inputting the feature map data and the weight data to a data cache module of the neural network computing module. The CPU, the DDRSDRAM and the neural network computing module are in communication connection. The CPU controls the neural network computing module to start convolution operation, the DDRSDRAM is used for inputting a plurality of convolution data and a plurality of convolution parameters to the neural network computing module, then the neural network computing module completes convolution operation according to the acquired convolution data and the convolution parameters to obtain an operation result, the operation result is written back to a memory address agreed by the DDR SDRAM, and the CPU is informed of the completion of the convolution operation.

As shown in fig. 1, the embodiment of the present application provides a neural network computing module 100 including a data controller 10, a data extractor 20, and a neural network computing unit 30.

The data controller 10 adjusts the data path according to the configuration information and the instruction information, and controls the data stream extracted by the data extractor 20 to be loaded to the corresponding neural network computing unit 30 according to the instruction information.

The neural network computing unit 30 at least completes the convolution operation of the feature line data corresponding to the plurality of input channels of the feature map data by one convolution kernel, and completes the accumulation of a plurality of convolution results in at least one period, thereby realizing circuit reconstruction and data multiplexing.

As shown in fig. 2, the neural network computing unit 30 includes a plurality of neural network acceleration slices 31, each of which includes a plurality of convolution operation multiply-add arrays 311, each of which performs a convolution operation of at least feature map data of one input channel and one convolution kernel data, and the plurality of neural network acceleration slices performs a convolution operation of feature map data of a plurality of input channels and one convolution kernel data.

The neural network acceleration tile (PU tile) 31 may be composed of a plurality of PE acceleration processing units (PE accelerators). Multiple PU slices can realize data parallel computation of different dimensions through system configuration. The NN neural network acceleration circuit may include a plurality of neural network acceleration tiles (PU tiles), an activation/pooling circuit, an accumulation unit, and the like. The PE accelerator is the most basic unit for accelerating the neural network, each unit can at least comprise a multiplier, an adder, a part of the adder and a result buffer, at least one convolution operation of a weight parameter and input characteristic data can be completed, and accumulation of a plurality of convolution results can be completed in at least one period. In this application, PU slices may contain PE accelerators arranged in an array. The characteristic data caching unit comprises a plurality of characteristic data caching groups, and each group of characteristic data caching unit caches partial characteristic data of one input channel and is coupled with at least one PU (polyurethane), namely each PU acquires characteristic map data of one input channel from the corresponding characteristic data caching group; meanwhile, a plurality of PUs share a convolution kernel data buffer unit, namely, the data of the same convolution kernel are broadcasted to the plurality of PUs, so that the parallel operation of multi-channel input single-channel output convolution is realized.

In order to realize data multiplexing and reduce the pressure of the main memory read-write bandwidth, each PU partition can firstly calculate the convolution results of the characteristic line data (the data collected by the convolution kernel in the line direction of the image to be processed is the characteristic line data) input by the current single channel and a plurality of convolution kernels in the line direction of the input characteristic image; updating the convolution kernel data, restarting the convolution operation until the convolution operation is completed by the single characteristic line data and all convolution kernels; after the convolution operation of the feature line data of the single input channel and all convolution kernels is completed, updating the input feature line, namely moving the convolution operation under the convolution kernels, and then circulating the steps until the convolution operation of the current input feature line and all the convolution kernels is completed, and outputting a multi-channel output feature line of the input feature line; after the input feature map of the single channel and all convolution kernel convolution results are completed, the feature map input channel is updated, and the operation sequence can be flexibly configured according to configuration information or instruction information according to actual conditions.

According to the neural network computing module, matrix form expansion of data is not needed, convolution and accumulation operation of current data can be completed only by reading the feature line data and the convolution kernel data once from the main memory, memory access bandwidth and storage space are reduced, energy efficiency of data access is improved, efficient feature map data multiplexing is realized, operation speed is optimized, and on-chip feature map data buffering requirements are reduced, so that non-delay operation of a computing core is realized.

In one embodiment, the plurality of neural network acceleration tiles form a first neural network operational matrix, and the plurality of first neural network operational matrices are coupled in parallel to form a second neural network acceleration matrix; the first neural network operation matrix in the second neural network acceleration matrix is used for completing convolution operation of the characteristic data of a plurality of input channels and one convolution kernel, and the plurality of second neural network acceleration matrices are used for completing parallel convolution operation of the characteristic data of the plurality of input channels and the plurality of convolution kernels. A neural network acceleration patch may compute an acceleration operation between a profile data and a convolution kernel convolution. The plurality of PU fragments of the same second neural network acceleration matrix can simultaneously calculate convolution results of the same convolution kernel and different input characteristic data, each PU fragment simultaneously shares a convolution kernel buffer unit, and the plurality of fragments share a characteristic data buffer unit; the multiple PU fragments of different second neural network acceleration matrixes can simultaneously calculate convolution results of multiple different convolution kernels and the same input characteristic data, each PU fragment shares a convolution kernel buffer unit, and the multiple fragments share a characteristic data buffer unit.

The plurality of PUs form a first sub-neural network operation PU matrix, and the plurality of first sub-neural network operation PU matrices form a second neural network PU matrix; each submatrix in the second neural network PU matrix is used for completing convolution operation of a plurality of input channel characteristic data and one convolution kernel, and the plurality of submatrices can complete parallel convolution operation of the plurality of input channel characteristic data and the plurality of convolution kernels.

In one embodiment, each group of convolution operation multiply-add arrays acquires feature line data in a parallel input mode; each group of convolution operation multiply-add arrays acquires convolution kernel data through a serial input mode. The PU comprises at least one group of PE, and each group of PE is responsible for convolution operation of convolution kernel line data and corresponding feature map line data; the multiple groups of PEs can realize convolution operation of multiple convolution kernel rows and multiple corresponding feature image row data, namely, each group of PEs form a column, and the multiple columns of PEs complete convolution operation of at least one convolution kernel row and the corresponding feature data. Each group of PE acquires feature line data in a parallel input mode, namely, each piece of metadata in each feature line data is simultaneously broadcast to each level of PE in the current group; and meanwhile, each group of PE acquires convolution kernel row data in a serial input mode, namely, the convolution kernel row metadata flows from the first PE in each group of PE to the next PE in each clock cycle.

In one embodiment, as shown in fig. 3, the neural network acceleration tile 31 includes a plurality of first multiplexers 312 and a plurality of second multiplexers 313, the first multiplexers 312 are coupled in parallel in one-to-one correspondence with the convolution multiply-add array 311, and the second multiplexers 313 are coupled in series in one-to-one correspondence with the convolution multiply-add array; the first multiplexer obtains the characteristic line data corresponding to the convolution operation multiply-add array through the data selection signal, and the characteristic line data are input into the corresponding convolution operation multiply-add arrays in parallel, and the second multiplexer obtains the convolution kernel data corresponding to the convolution operation multiply-add array and serially inputs the convolution kernel data into the convolution operation multiply-add arrays in each stage to complete the convolution product multiply-add operation.

The first multiplexers are respectively coupled with the corresponding PE groups in parallel, each PE group can select at least one of the two characteristic line data through the selection signal, and the sixth PE group in FIG. 3 can select 6 different lines of data; the second plurality of multiplexers are serially coupled to corresponding groups of PEs, each group of PEs selecting at least one of the two convolution kernel rows by the selection signal, and the sixth group of PEs in FIG. 3 selecting 6 different convolution kernel rows. The first multiplexer obtains the corresponding characteristic line data through the data selection signal (provided by the configuration information or the data loading instruction information), and the characteristic line data are input into the PE of each stage of the corresponding PE group in parallel, and meanwhile, the second multiplexer selects the corresponding convolution kernel data to be input into the PE of each stage in series to complete the convolution multiplication and addition operation.

The residual idle PE groups can acquire the characteristic row data and the convolution kernel data used for convolution operation in the direction of the characteristic diagram column through a multiplexer, so that multiplexing of input data is realized. For example, for a convolution kernel step size of 3×3 being 1, taking the matrix of fig. 3 as an example, the first three columns of PE groups complete a parallel acceleration convolution operation (PE 00, PE10, PE20 are a group, PE01, PE11, PE21 are a group, PE02, PE12, PE22 are a group) in the row direction of the convolution kernel, the fourth group of PEs can multiplex the feature row data in the second feature extractor 1 and the convolution kernel row data of the first weight acquirer 0, the fifth group of PEs can multiplex the feature row data in the third feature extractor 2 and the convolution kernel row data of the second weight acquirer 1, and the sixth group of PEs can multiplex the feature row data in the fourth feature extractor 3 and the convolution kernel row data of the third weight acquirer 2, and then send the pairs of data into the corresponding PE groups to complete a sliding convolution operation of the convolution window in the row direction of the feature map, thereby improving the multiplexing of the feature map data and reducing the occupation of the main memory read-write bandwidth, and simultaneously the whole PE array is in a high-efficiency running state.

In one embodiment, the neural network computing module further includes a first shift register set and a second shift register set, the neural network computing unit includes a multiply-add subunit and a portion and a buffer subunit, the first shift register set adopts a serial input and parallel output mode, and outputs the feature line data to the multiply-add subunit through a first multiplexer; the second shift register group adopts a serial input and parallel output mode, and outputs the convolution kernel data to a next-stage convolution operation multiplication and addition array and a multiplication and addition subunit through a second multiplexer according to step signals, the multiplication and addition subunit correspondingly carries out multiplication operation on the input characteristic line data and the convolution kernel data, and carries out accumulation operation on the characteristic line data and the convolution line data in the partial and buffer subunits, and when the convolution operation of the convolution kernel data and the characteristic line data corresponding to a convolution window is completed, a plurality of line convolution result parts and accumulation of the convolution window are carried out, so that one sliding window convolution operation of the convolution kernel is realized; the convolution operation multiplication and addition arrays of different levels output row operation results to an accumulator in a convolution row period, and the accumulator accumulates row operation results output by the convolution operation multiplication and addition arrays of the same level corresponding to all rows of the current convolution kernel through an addition tree, so that one-time window convolution operation of one convolution kernel is realized; the PE of different levels calculates a plurality of sliding convolution operations of the current convolution kernel row in the direction of the feature map row in parallel; and the characteristic diagram lines are sequentially input and loaded into each group of all stages of PE in parallel in the convolution operation process, and the corresponding convolution kernel data are serially loaded into each group of all stages of PE in a periodic cycle mode.

As shown in fig. 4, the first shift register 314 uses a serial connection and parallel output mode to input the feature map data to the PEs at each stage in parallel, and outputs the data to the PEs at each stage through the multiplexer; the current convolution kernel metadata is fed into the multiply-add unit simultaneously with being fed into the first stage shift register.

The second shift register group 315 also adopts a serial connection and parallel output mode and outputs the convolution kernel data to the next multiplication and addition unit through a multiplexer.

And carrying out multiplication operation on the characteristic line data and the convolution kernel line data fed into the multiplication and addition unit correspondingly, then carrying out accumulation operation on the characteristic line data and the convolution line data in the partial and buffer unit, and after the convolution operation of one convolution kernel line and the corresponding convolution window characteristic line data is completed, carrying out partial and output of the convolution result of the convolution line and the convolution result of other lines of the convolution kernel and accumulation, so as to realize the convolution operation of one sliding window of the convolution kernel.

The characteristic line data (X00, …, X0 n) are fed into the PE group in parallel according to the line sequence, the convolution kernel line data (F00/F01/F02) are fed into the PE group in series according to the circulating sequence in a mode of F00/F01/F02-F00/F01/F02-F00/F01/F02 for carrying out convolution operation, and each stage of PE outputs a part and a result corresponding to the current convolution kernel line after one convolution kernel line period (the convolution kernel line size is 3, and the convolution kernel line period is 3). The PE of different levels outputs the part and result of the convolution kernel line sliding convolution operation on the feature image line according to the step length s; after each group of different level PEs output partial sums in a convolution line period, the partial sum results output by the level PEs of each group of PEs corresponding to all lines of the current convolution kernel are accumulated through an addition tree, so that convolution operation of one convolution kernel is realized, and convolution calculation shown in figure 5 is obtained.

In fig. 5, PE00 is a convolution portion and result of data of a first row of a convolution kernel and a corresponding feature row in three consecutive periods, PE10 is a partial convolution portion and result of data of the first row of the convolution kernel and an adjacent window of a window calculated by PE00 in three consecutive periods, PE20 is a partial convolution portion and result of data of the first row of the convolution kernel and an adjacent window of a window calculated by PE10 in three consecutive periods, and a plurality of groups of PEs can implement convolution acceleration operations of different rows of the convolution kernel and corresponding feature data rows, respectively, which is equivalent to sliding traversal convolution operations of the convolution kernel in the direction of a feature map row. The first shift register group selects the output of the corresponding shift register as the input of the next PE according to the step size parameter s selection signal, so that the convolution window traverses the convolution operation along the row direction of the step size s, and other PE groups realize the window sliding convolution acceleration operation along the column direction of the feature map according to the step size by multiplexing part of the input feature map row data and convolution kernel row data.

In one embodiment, the data controller is configured to obtain, according to the configuration information and the instruction information, the storage addresses of the feature data and the corresponding weight data loaded into the neural network computing unit, and simultaneously instruct the multiplexer to adjust the on-off state of the data path, and input the feature data and the corresponding weight data into the corresponding neural network computing unit according to the instruction information; the data extractor comprises a feature extractor and a convolution kernel extractor, wherein the feature extractor is used for extracting feature line data from the feature data caching unit according to the instruction information, and the convolution kernel extractor is used for extracting convolution kernel data from the convolution kernel caching unit according to the instruction information so as to transmit the convolution kernel data to the neural network computing module.

In one embodiment, the feature data caching unit includes a plurality of feature data caching groups, each feature data caching unit caches part of feature data of one input channel and is coupled with at least one neural network acceleration slice, the plurality of neural network acceleration slices share one convolution kernel data caching unit, each neural network acceleration slice obtains feature map data of one input channel from the corresponding feature data caching group, and the same convolution kernel data is distributed to the plurality of neural network acceleration slices. The data controller, besides connecting the weight acquirer 0 and the 0 column of the PE unit, broadcasts the data to the 4 th column PE unit, that is, the first row weight parameter performs a row convolution operation with the 1 st characteristic row data, and also performs a row convolution operation with the 2 nd characteristic row data (wherein the 2 nd characteristic row data is connected to the 4 th column PE unit through a multiplexer in addition to the 2 nd column PE), that is, the convolution result of the characteristic row data input through a single channel and the convolution kernel in the row direction of the image to be processed is calculated by using a plurality of neural networks in an accelerating and slicing way, which is equivalent to the convolution operation of the convolution kernel each time in the adjacent row.

In one embodiment, the data controller obtains the storage addresses of the characteristic data and the corresponding weight data loaded into the neural network computing unit according to the configuration information and the instruction information, meanwhile, the instruction multiplexer outputs the characteristic data to the neural network computing unit in a serial input and parallel output mode from the first shift register group, and the instruction multiplexer outputs the convolution kernel data to the current convolution operation multiply-add array and the next convolution operation multiply-add array of the neural network computing unit in a serial input and one output mode according to step size selection from the second shift register group.

As shown in fig. 6, when the convolution kernel line size (for example, 5) is greater than the number of PEs (for example, 3 PEs) of each group of PEs, in each convolution kernel line operation period, address access of each level of PEs to the cached feature line data by the feature line data extractor may collide (since each level of PEs synchronously acquire the same feature metadata input, at the next convolution kernel line cycle, the feature data address required by a part of PEs collides with the current feature extractor access cache address), but the data accessed by a part of PEs has already been accessed in the previous operation period, and these data are cached in the first shift register group, so the data controller acquires the storage addresses of the feature data and the corresponding weight data loaded into the neural network computing unit according to the configuration information and the instruction information, and instructs the multiplexer to adjust the data path, input the feature data and the corresponding weight data to the corresponding neural network computing unit according to the instruction information, and prevent the conflict between the part of the access data and the current feature extractor access address by selecting the output data of the registers in the corresponding first shift register group. Therefore, the PE unit can control the multiplexer to acquire conflict data through the data controller, so that convolution operation of convolution kernels with different sizes is applicable, and PE kernel waiting caused by address access conflict of PE at each level under the condition that access data are not aligned is avoided.

As shown in fig. 4, the feature extractor may load data to the PE array in row order according to the data address; meanwhile, 6 feature extractors are arranged in the PE array and can be FIFO units or an addressing mode (addressing is carried out from a cache unit and data is loaded to PE in sequence), if two of the FIFO units are empty, signal interception processing can be carried out on the feature extractor which is currently empty at the moment, so that the power consumption is reduced, and the weight acquirer is the same; for a convolution kernel of 5×5 or 7×7, all feature extractors and weight acquisitors can be put into full-load operation; the feature extractor number can be flexibly configured such as reduced or increased, i.e., the PE array is correspondingly reduced or increased.

The selection of an input channel is completed by adopting a multiplexing input multiplexer according to the configuration information of the data controller, so that the convolution operation of the convolution kernel and the corresponding characteristic data is completed; for example, for a 3×3 convolution kernel, the 4, 5, and 6PE columns multiplex the data output by the feature data extractors 2, 3, and 4, respectively. If for the convolution kernel of 5×5, the 4 th and 5 th PE columns are communicated with the corresponding 4 th and 5 th feature data extractors, and the 6 th PE column is communicated with the 2 nd feature data extractor, since the convolution kernel size is not aligned with the PE column number, after one row of convolution operation is completed, the input paths of the input multiplexers need to be rearranged, for example, in the first row period, 5+1 rows (one row period convolution of the convolution kernel of 5×5 on the feature map and one row of the convolution kernel on the feature map, one row of the convolution kernel are moved down and only one row of the feature data are convolved) are completed once, so that the 1 st to 4 th feature extractors and the 2 nd to 5 th row weights of the convolution kernel are convolved on the 1 st to 4 th PE columns, and the 5 th and 6 th PE rows multiplex the data of the 1 st and 2 nd feature extractors; wherein only part of the connection is shown in fig. 5 and part of the connection is not shown.

As shown in fig. 5, the weight parameters flow into the PE units in serial order in the column direction (three PE units are arranged in the column direction in order to adapt to the convolution kernel of 3*3), and the PE units in the same column and each row are linked by a group of delay register groups through a multiplexer, and the delay circuit structure is used for adjusting the sliding step length of the convolution window; each column of PE units receives characteristic data of a specific row through a multiplexer input and outputs the characteristic data to each row of PE units in the same column in parallel, and because weight parameters flow into each PE in series, when the PE units reach the 2 nd and 3 rd rows of PE in an initial state, the characteristic data needs to be delayed for 1 and 2 periods respectively, and then convolution operation is carried out on the characteristic data and the corresponding weight parameters; meanwhile, a group of characteristic data delay circuits are also arranged in the PE unit, and firstly, the alignment of characteristic data and weight parameters is realized in an initial state, and the PE unit is used for multiplexing corresponding characteristic data through a multi-input selection circuit when the convolution kernel exceeds 3*3, so that address conflict during data reading is prevented.

The feature data extractor acquires at least 1 line of continuous feature line data from the feature cache circuit on the PU slice (because the embodiment adopts a convolution kernel of 3×3, the feature extraction circuit requests the feature cache circuit for loading 4 continuous feature line data lines at a time in order to fully utilize the multiplication circuit resource in the PE array); meanwhile, the weight acquirer acquires at least 1 row of weight parameters at one time (three rows of weight parameters are directly acquired and loaded into the weight acquirer), and for inputting 4 rows of characteristic row data and 3 rows of convolution kernel row data, each convolution kernel row cycle outputs at least two characteristic image column direction convolution results.

In this embodiment, the feature extractor and the weight acquirer form a data extractor, which may be a FOFI structure or an addressing circuit, each of the data extractors extracts the first row data (F00, F01, F02) and the first feature row data (X00, X01, X02, X03, X04, X05, X06, …) of the weight by row, respectively, and re-extracts the other row feature data after each row period.

In the initial state, the first period F00 is fed to the register 1 and the multiplier in the PE00, and the characteristic data X00 is fed to the PE00, the PE10, the PE20; x00 is multiplied by F00 in PE00, and the result is sent to a row accumulation BUFFER, at the moment, F00 in PE01 arrives in the next period, and X00 is stored in each PE characteristic register 1; in the second cycle X01 is sent to PE00, PE01, PE20, at this time X00 is shifted to the signature register 2, X01 enters the signature register 1, at the same time F01 is sent to PE00, F00 in PE00 is sent to PE01, X01 and F01 are convolved in PE00, and the result of the operation is sent to the adder to be added with the result of X00X F00 buffered in the addition BUFFER in the previous cycle. Meanwhile, multiplying X01 in PE01 by F00, and sending the result into an accumulation buffer; the third period F02 enters PE00, F01 in PE00 enters PE01, F00 in PE01 enters PE02, meanwhile X02 is synchronously transmitted to each PE, and multiplication and addition operation is carried out with corresponding weight parameters; the convolution of one row of weight and the corresponding characteristic data can be realized through the three periods, after four periods, the convolution kernel row slides once on the characteristic data row, and after eight periods, a convolution result that the convolution kernel row slides 6 times on the corresponding characteristic data row can be obtained; since the column 3*3 convolution kernel performs 3 rows of inputs in parallel, the result output of 6 convolution sliding windows can be completed in eight cycle periods; after the initialization state, the whole multiply-add unit is in a full load state.

In another embodiment, as shown in fig. 6, the convolution kernel is 5*5, the step size is 1, 3-level PE parallel acceleration operation feature map line direction three convolution window convolution kernel lines of convolution operation, when PE00 completes one convolution kernel line convolution, the convolution operation of the fourth convolution window convolution kernel line is started (the second and third window convolution kernel line convolution operations are accelerated convolution operations performed by the same group of PEs 10 and 20), where the address of the PE00 feature data needs to start from X03, but where PE01 and PE02 need to access the data of X05, so that the feature data extractor may cause conflict in accessing and addressing the data of X03 and X05. To avoid collisions, the X03 data may be obtained using a signature data register (the signature data for X04, X03, X02, X01 has been cached for the current 4 clock cycles) selection circuit, i.e., X03 is obtained from register 2; meanwhile, when the 3 PE register selection signals are consistent, the address pointer of the feature data extractor is updated again, for example, in the 8 th cycle (each PE needs to access X05 data and the addresses of the data accessed in the two subsequent cycles are the same), the address pointer is updated to X05, and after the next convolution kernel line cycle (for example, the 11 th clock cycle), the feature data selection circuit is reconfigured; full-load calculation of the 5×5 convolution kernel can be flexibly realized through the circuit configuration.

In another embodiment, as shown in fig. 7, the convolution kernel is 5*5, the step size is 2, after 5 cycles, each PE has been loaded with data, but for PE00, a convolution window convolution kernel row operation is completed, the next time the line address of the convolution window is X06, but at this time the feature extractor address pointer is occupied by PE10 and PE20 and points to X05 data (the X05 data needs to be obtained from the feature data buffer circuit), so for PE00, when the feature extractor address pointer points to X06, a fourth convolution window convolution kernel row convolution operation, that is, X06F 00, is performed, and when the feature data loaded by other-stage PEs conflicts with another current PE feature extractor access address, the PE unit that has the address conflict is in an idle state in the current operation cycle, and the product operation is performed in the next cycle according to the feature extractor access address obtaining the corresponding feature data. From the above, it can be seen that some PE units are not fully loaded during a portion of the address collision period in an idle state during a particular period, but for a convolution operation with a step size of 2 for a 5X5 convolution kernel, the utilization of the PE units is: 3*6-3/3×6×100+=83%, i.e. the number of PE units per iteration cycle-the number of idles per iteration cycle/(the number of PE units per iteration cycle), where the iteration cycle is from the first PE idle of cycel 6 to cycel 11 after initialization, i.e. PE00 is 6 per idle cycle, and the number of idles per iteration cycle is the total number of idles of the group of PEs in the iteration cycle, while for convolution operations with a step size exceeding 1, the larger the convolution kernel size, the higher the overall PE unit utilization.

As shown in fig. 8, the present embodiment further provides a neural network acceleration method, including the following steps:

step 602, a data controller is adopted to adjust a data path according to configuration information and instruction information, and the data flow extracted by a data extractor is controlled to be loaded to a corresponding neural network computing unit according to the instruction information;

step 604, a neural network computing unit is adopted to complete at least one convolution operation of convolution kernel and feature map data, and accumulation of a plurality of convolution results is completed in at least one period, so that circuit reconstruction and data multiplexing are realized, the neural network computing unit comprises a plurality of neural network acceleration fragments, and each neural network acceleration fragment comprises a plurality of convolution operation multiply-add arrays;

in step 606, the convolution operation of the feature map data of at least one input channel and the convolution kernel data is completed by using each neural network acceleration segment, and the convolution operation of the feature map data of a plurality of input channels and the convolution kernel data is completed by using a plurality of neural network acceleration segments.

The foregoing is merely specific embodiments of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions easily conceivable by those skilled in the art within the technical scope of the present application should be covered in the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A neural network computing module is characterized by comprising a data controller, a data extractor and a neural network computing unit,

the data controller adjusts the data path according to the configuration information and the instruction information, controls the data flow extracted by the data extractor to be loaded to the corresponding neural network computing unit according to the instruction information,

the neural network computing unit at least completes the convolution operation of the characteristic line data corresponding to the plurality of input channels of the characteristic diagram data by one convolution kernel, and completes the accumulation of a plurality of convolution results in at least one period, thereby realizing circuit reconstruction and data multiplexing,

the neural network computing unit comprises a plurality of neural network accelerating fragments, each neural network accelerating fragment comprises a plurality of convolution operation multiply-add arrays, each neural network accelerating fragment at least completes convolution operation of feature map data of one input channel and convolution kernel data, the plurality of neural network accelerating fragments complete convolution operation of the feature map data of a plurality of input channels and the convolution kernel data, and the neural network accelerating fragments firstly compute convolution results of the feature line data input by a current single channel and the convolution kernels in the input feature map line direction respectively; updating the convolution kernel data, restarting the convolution operation until the convolution operation is completed by the single characteristic line data and all convolution kernels; after the convolution results of the input feature map of one single channel and all convolution kernels are completed, updating the input channel of the feature map,

The neural network computing module also comprises a first shift register group and a second shift register group, the neural network computing unit comprises a multiplication and addition subunit, a part and a buffer subunit,

the first shift register group adopts a serial input and parallel output mode, and outputs the characteristic line data to the multiplication and addition subunit through a first multiplexer; the second shift register group adopts a mode of serial input and selecting one output according to the step length, and outputs convolution kernel data to the next convolution operation multiply-add array and the multiply-add subunit through a second multiplexer,

the multiplication and addition subunit correspondingly multiplies the input characteristic line data and the convolution kernel line data, and performs accumulation operation with the convolution line part and the data in the part and the buffer subunit, and when the convolution operation of the convolution kernel line data and the characteristic line data of the corresponding convolution window is completed, the convolution result parts and the accumulation of a plurality of lines of the convolution window are performed, so that one sliding window convolution operation of the convolution kernel is realized;

the convolution operation multiplication and addition arrays of different stages output row operation results to an accumulator in a convolution row period, and the accumulator accumulates the row operation results output by the convolution operation multiplication and addition arrays of the same stage of all groups corresponding to all rows of the current convolution kernel through an addition tree, so that the convolution operation of one convolution kernel is realized.

2. The neural network computing module of claim 1, wherein the plurality of neural network acceleration tiles form a first neural network operational matrix, the plurality of first neural network operational matrices being coupled in parallel to form a second neural network acceleration matrix; the first neural network operation matrix in the second neural network acceleration matrix is used for completing convolution operation of a plurality of input channel characteristic data and one convolution kernel, and the plurality of second neural network acceleration matrices are used for completing parallel convolution operation of a plurality of input channel characteristic data and a plurality of convolution kernels.

3. The neural network computing module of claim 1, wherein each set of convolution operation multiply-add arrays obtains feature line data by way of parallel input; each group of convolution operation multiply-add arrays acquires convolution kernel data through a serial input mode.

4. The neural network computing module of claim 1, wherein the neural network acceleration tile comprises a plurality of first multiplexer and a plurality of second multiplexer, the first multiplexer being coupled in one-to-one parallel with the convolution operation multiply-add array, the second multiplexer being coupled in one-to-one serial with the convolution operation multiply-add array; the first multiplexer acquires the characteristic line data corresponding to the convolution operation multiplication and addition array through a data selection signal, and inputs the characteristic line data into the corresponding convolution operation multiplication and addition array at each stage in parallel, and the second multiplexer acquires the convolution kernel data corresponding to the convolution operation multiplication and addition array and inputs the convolution kernel data into the convolution operation multiplication and addition array at each stage in series to complete convolution multiplication and addition operation.

5. The neural network computing module according to claim 1, wherein the data controller is configured to obtain, according to the configuration information and the instruction information, a storage address of the feature data and the corresponding weight data loaded into the neural network computing unit, and instruct the multiplexer to make and break a data path, and input the feature data and the corresponding weight data into the corresponding neural network computing unit according to the instruction information;

the data extractor comprises a feature extractor and a convolution kernel extractor, wherein the feature extractor is used for extracting feature line data from a feature data caching unit according to instruction information, and the convolution kernel extractor is used for extracting convolution kernel data from the convolution kernel caching unit according to the instruction information so as to transmit the convolution kernel data to the neural network computing module.

6. The neural network computing module of claim 5, wherein the feature data cache unit comprises a plurality of feature data cache sets, each feature data cache unit caching a portion of the feature data of an input channel and coupled to at least one neural network acceleration tile, the plurality of neural network acceleration tiles sharing a convolution kernel data cache unit,

Each neural network acceleration segment acquires the characteristic diagram data of one input channel from the corresponding characteristic data cache group, and the same convolution kernel data is distributed to a plurality of neural network acceleration segments.

7. A neural network computing method, comprising:

the data controller is adopted to adjust the data path according to the configuration information and the instruction information, the data flow extracted by the data extractor is controlled to be loaded to the corresponding neural network computing unit according to the instruction information,

the neural network computing unit is adopted to at least complete the convolution operation of the characteristic line data corresponding to a plurality of input channels of the characteristic graph data and complete the accumulation of a plurality of convolution results in at least one period, thereby realizing circuit reconstruction and data multiplexing, the neural network computing unit comprises a plurality of neural network acceleration fragments, each neural network acceleration fragment comprises a plurality of convolution operation multiply-add arrays,

the method comprises the steps that convolution operation of feature map data and convolution kernel data of at least one input channel is completed by adopting each neural network acceleration slicing, convolution operation of feature map data and convolution kernel data of a plurality of input channels is completed by a plurality of neural network acceleration slicing, and convolution results of feature line data and a plurality of convolution kernels which are input by a current single channel in the input feature map line direction are calculated firstly by the neural network acceleration slicing; updating the convolution kernel data, restarting the convolution operation until the convolution operation is completed by the single characteristic line data and all convolution kernels; after the convolution results of the input feature map of one single channel and all convolution kernels are completed, updating the input channel of the feature map,

8. A communication device comprising a central processing unit CPU, a memory DDR SDRAM and the neural network computing module of any of claims 1 to 6 in communication connection, wherein the CPU is configured to control the neural network computing module to initiate the convolution operation, and the DDR SDRAM is configured to input feature map data and weight data to the neural network computing module.