CN115700605A - Reconfigurable hardware accelerator applied to convolutional neural network training - Google Patents

Reconfigurable hardware accelerator applied to convolutional neural network training Download PDF

Info

Publication number
CN115700605A
CN115700605A CN202110874007.6A CN202110874007A CN115700605A CN 115700605 A CN115700605 A CN 115700605A CN 202110874007 A CN202110874007 A CN 202110874007A CN 115700605 A CN115700605 A CN 115700605A
Authority
CN
China
Prior art keywords
input
operation processing
data
target
trained
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110874007.6A
Other languages
Chinese (zh)
Inventor
王中风
邵海阔
林军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University
Original Assignee
Nanjing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University filed Critical Nanjing University
Priority to CN202110874007.6A priority Critical patent/CN115700605A/en
Publication of CN115700605A publication Critical patent/CN115700605A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Complex Calculations (AREA)

Abstract

The application provides a reconfigurable hardware accelerator applied to convolutional neural network training. The reconfigurable hardware accelerator includes: the cache architecture comprises an input cache architecture and an output cache architecture, the operation processing array comprises a plurality of operation processing modules which are arranged in a two-dimensional array mode, when the operation processing array is used, the input cache architecture is used for rearranging and grouping data to be operated according to a preset data grouping mode, the data are sent to the operation processing module for processing, and the internal data connection mode of each operation processing module in the operation processing array is dynamically adjusted at different training stages, so that the operation processing module performs convolution operation processing corresponding to a candidate training stage according to a moving step length. The whole device has flexible calculation mode, can process multi-channel operation in parallel, and can meet the calculation requirements of different training stages only by adopting a hardware architecture, thereby having higher model training efficiency.

Description

Reconfigurable hardware accelerator applied to convolutional neural network training
Technical Field
The application relates to the technical field of computers and electronic information, in particular to a reconfigurable hardware accelerator applied to convolutional neural network training.
Background
In recent years, convolutional Neural Networks (CNN) models are widely applied to many fields such as computer vision, speech recognition and natural language processing, and as the requirement for recognition accuracy is gradually increased, the structure of CNN models is increasingly large and parameters contained therein are also increasingly large, so that training of CNN models is further increasingly complex and time-consuming, and because of the consideration of online learning and data privacy, training of CNN models on edge computing platforms with limited resources has a wide demand, and therefore, training of CNN models needs to be accelerated.
The training stage of the CNN model mainly comprises a Forward Propagation (FP) stage, a Backward Propagation (BP) stage and a Weight Gradient (WG) calculation stage, in the FP stage, according to the sequence from front to back, the input activation value (namely the output activation value of the previous layer) of each layer in the CNN model and the convolution kernel weight corresponding to the layer are subjected to convolution calculation, the output activation value of the layer is obtained through an activation function, the layers are calculated backwards layer by layer, finally, the deviation between the output prediction label and the real label is evaluated by using a loss function, and the loss is calculated; in the BP stage, the loss calculated in the FP stage is utilized, and the input error value of each layer (namely the error value of the next layer) and the convolution kernel weight corresponding to the layer are subjected to convolution calculation according to the sequence from back to front to obtain the error value of the layer, and the error value of each layer in the CNN model is finally obtained by calculating layer by layer and front; and finally, in the WG stage, performing convolution operation on the output activation value of the previous layer and the error value of the current layer according to a chain rule to finally obtain the weight gradient of the current layer, calculating layer by layer in the way, and finally obtaining the updated convolution kernel weight of each layer in the whole CNN model to further finish the training of the whole CNN model.
At present, a hardware accelerator developed based on an FPGA (Field Programmable gate array) can be generally used for training a CNN model, but the hardware accelerator only has a processing unit with a single structure, so that not only is the hardware structure simple, but also the realized computing function is single, and during training, extra disassembly and recombination are generally required to be performed on a computing process, and data needs to be read repeatedly, so that the training efficiency is low, and the requirement for efficient training of the CNN model cannot be met.
Disclosure of Invention
The application provides a reconfigurable hardware accelerator applied to convolutional neural network training, which can be used for solving the technical problem of low training efficiency of the existing hardware architecture.
In order to solve the technical problem, the embodiment of the application discloses the following technical scheme:
a reconfigurable hardware accelerator applied to convolutional neural network training comprises a cache architecture, an operation processing array, a functional module and a main controller, wherein:
the cache architecture comprises an input cache architecture and an output cache architecture; the input cache architecture is used for storing data to be operated of a network layer to be trained in a candidate training stage, rearranging and grouping the data to be operated according to a preset data grouping mode, and inputting the data to be operated into the operation processing array, wherein the candidate training stage is any training stage in all training stages;
the operation processing array comprises a plurality of operation processing modules arranged in a two-dimensional array mode and a scaling rounding module connected with each row of operation processing modules; the operation processing module is used for receiving data input by the input cache architecture, performing convolution operation processing corresponding to the candidate training stage according to a preset moving step length according to an instruction of the main controller, and inputting a convolution operation result to a corresponding scaling rounding module; the scaling rounding module is used for converting the data format of the convolution operation result and then sending the result to the output cache framework for storage;
the functional module is used for performing activation operation or pooling operation on the data in the output cache architecture and updating the weight value of the convolution kernel to be trained in the network layer to be trained after training is completed;
the main controller is used for determining the data grouping mode according to the number of the convolution kernels to be trained and the number of the network layer input channels to be trained in the candidate training stage; and adjusting the internal data connection mode of the operation processing module according to the candidate training stage and the moving step length so that the operation processing module executes convolution operation processing corresponding to the moving step length and the candidate training stage.
In one implementation, the input cache architecture includes a first input architecture and a second input architecture;
the first input architecture comprises a first input cache module and a first input pre-fetching module, wherein the first input cache module is used for storing first input data in the data to be operated, the first input pre-fetching module is connected with each operation processing module in the operation processing array and is used for rearranging and grouping the first input data according to the data grouping mode, determining a target column in the operation processing array corresponding to each group of first target data and sending each group of first target data to each operation processing module in the corresponding target column;
the second input architecture comprises a second input cache module and a second input pre-fetching module, the second input cache module is used for storing second input data in the data to be operated, the second input pre-fetching module is connected with each operation processing module in the operation processing array and used for rearranging and grouping the second input data according to the data grouping mode, determining a target row in the operation processing array corresponding to each group of second target data and sending each group of second target data to each operation processing module in the corresponding target row.
In an implementation manner, when the second input prefetch module sends the second target data to all the operation processing modules in the corresponding target row, each data in the second target data is sequentially sent to all the operation processing modules in the corresponding target row according to a preset clock cycle.
In an implementation manner, if the candidate training phase is an FP phase, the first input data is weight values of a plurality of convolution kernels to be trained in the network layer to be trained, and the second input data is a multi-channel input activation value of the network layer to be trained;
if the candidate training stage is a BP stage, the first input data are weight values of a plurality of convolution kernels to be trained in the network layer to be trained, the second input data are determined according to a multi-channel input error value of the network layer to be trained, and the convolution kernels to be trained are rotated by one hundred eighty degrees to obtain a matrix;
if the candidate training stage is the WG stage, the first input data is determined according to the multichannel error value of the network layer to be trained, and the second input data is the multichannel input activation value of the network layer to be trained.
In one implementation, the operation processing module includes a MAC array, an adder group, a multiplexer group, and a partial sum FIFO queue;
the MAC array comprises a plurality of MACs which are arranged in a two-dimensional array mode, the number and the arrangement mode of the MACs are the same as the size of a target weight matrix, the target weight matrix is a weight matrix of a target channel in a target convolution kernel to be trained, the target convolution kernel to be trained is any convolution kernel to be trained in the plurality of convolution kernels to be trained, and the target channel is any channel in the target convolution kernel to be trained;
the MAC includes a first external port, a second external port, an internal port, a multiplier, an adder, a multiplexer, and a register, and is configured to multiply input data of the first external port with input data of the second external port, and add the multiplied input data to input data of the internal port to obtain an intermediate result, and transmit the intermediate result according to an instruction of the host controller along a path corresponding to the candidate training stage, where the input data of the first external port is a first target value at a target position corresponding to a position of the MAC in the received first target data, the input data of the second external port is all second target values that need to be multiplied by the first target value in the received second target data, the input data of the internal port is one of a partial sum in a convolution operation process, a result transferred by a previous MAC, or an intermediate result output by the MAC, and the selection of the input data of the internal port is implemented by the multiplexer;
the adder group comprises two adders, and is used for summing intermediate results output by multiple paths of MAC in the convolution operation process according to the instruction of the main controller and outputting the results to the partial sum FIFO queue;
the multiplexer group comprises a plurality of line multiplexers for transmitting the intermediate result output by each MAC and the result output by the adder group to the partial and FIFO queues, wherein the number of the line multiplexers is the same as that of the MAC array;
the partial sum FIFO queue is used to store all partial sums during the convolution operation.
In one implementation manner, the performing data format conversion on the convolution operation result includes:
after all calculations of the network layer to be trained in the candidate training stage are completed, determining a first maximum value from all convolution operation results output by a target operation processing module, and determining the first maximum value as a local maximum value, wherein the target operation processing module is any one of a plurality of operation processing modules;
determining the shift bit number when converting from int32 format to int8 format according to the first maximum value;
converting each convolution operation result output by the target operation processing module from an int32 format to an int8 format according to the shift bit number to obtain a candidate result;
determining a second maximum value from the local maximum values corresponding to each operation processing module in the row where the target operation processing module is located, and determining the second maximum value as a global maximum value;
acquiring a first shift digit corresponding to a target local maximum value and a second shift digit corresponding to the global maximum value, wherein the target local maximum value is any one of local maximum values corresponding to all operation processing modules of a row where the target operation processing module is located;
determining a shift difference between the first number of shifted bits and the second number of shifted bits;
and according to the displacement difference, carrying out secondary adjustment on the data format of a target candidate result, wherein the target candidate result is a candidate result output by an operation processing module corresponding to the target local maximum value.
Therefore, the reconfigurable hardware accelerator provided by the embodiment of the application dynamically adjusts the internal data connection mode of each operation processing module in the operation processing array at different training stages, so that the operation processing modules perform convolution operation processing corresponding to candidate training stages according to moving step lengths, invalid calculation can be effectively avoided, and the operation processing modules can perform parallel processing, thereby greatly improving the resource utilization efficiency of hardware. The whole reconfigurable hardware accelerator is flexible in calculation mode, can process multi-channel operation in parallel, and can meet calculation requirements of different training stages only by adopting a hardware architecture, so that the reconfigurable hardware accelerator has high model training efficiency.
Drawings
FIG. 1 is a schematic diagram of a calculation process corresponding to different training stages of a CNN model in a training process;
fig. 2 is a schematic structural diagram of a reconfigurable hardware accelerator applied to convolutional neural network training according to an embodiment of the present disclosure;
fig. 3 is a schematic structural diagram of an arithmetic processing module according to an embodiment of the present application;
fig. 4a is a schematic diagram of an internal data flow when an operation processing module performs convolution operation in a manner that a step size is 1 at an FP stage according to an embodiment of the present application;
fig. 4b is a schematic diagram of an internal data flow when the operation processing module performs convolution operation in the FP stage in a manner that a step size is 2 according to the embodiment of the present application;
fig. 5a is a schematic diagram of an internal data flow when the operation processing module performs convolution operation in a BP phase in a manner that a step length is 1 according to the embodiment of the present application;
fig. 5b is a schematic diagram of an internal data flow when the operation processing module performs convolution operation in a BP stage in a manner that a step length is 2 according to the embodiment of the present application;
fig. 6a is a schematic diagram of internal data flow when the arithmetic processing module performs convolution operation in the WG stage in a manner of setting a step length to 1 according to an embodiment of the present application;
fig. 6b is a schematic diagram of internal data flow when the arithmetic processing module performs convolution operation in the WG stage in a manner of step size 2 according to an embodiment of the present application;
FIG. 7 is a schematic diagram illustrating a workflow of a scaling and rounding module for data format conversion according to an embodiment of the present application;
FIG. 8 is a schematic diagram of the structure of the convolution kernel size, input feature map size and output feature map size in a convolution layer of a VGG network;
FIG. 9 is a diagram showing specific values of a convolution block and a convolution kernel under one channel;
fig. 10 is a schematic diagram of the internal specific data flow when the operation processing module performs convolution operation in the FP stage in the manner of step size 1 in this application example.
Detailed Description
To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
The training process of the CNN model will be described first with reference to the drawings.
Fig. 1 exemplarily shows a calculation flow diagram of a CNN model in different training phases of a training process, as shown in fig. 1, the training phases of the CNN model mainly include an FP phase, a BP phase, and a WG phase.
In the FP stage (as shown in fig. 1 (a) and (b)), the training feature map is passed through the entire CNN network layer by layer. And sequentially calculating the output activation value of each layer through corresponding operation according to the type of the network layer. For example, for the convolution layer with step =1 (assuming the l-th layer), the convolution kernel is moved on the input feature map with 1 as a step, and the l-th layer convolution kernel (with a size of 3 × 3) is weighted (w) l ) And the output activation value (a) of the (l-1) th layer l-1 ) (size 5X 5) convolution calculation is performed, and then the output activation value (a) is obtained through the activation function l ) (size 3X 3), the calculation flow is shown in FIG. 1 (a). For the convolution layer with step =2 (assuming the l-th layer), the convolution kernel is shifted on the input feature map by 2, resulting in the output activation value (a) l ) (size 2 × 2), the calculation flow is shown in fig. 1 (b). The training process of the FP stage passes the predicted label and the true label finally output by forward propagation to a loss function, such as a cross entropy loss function, to evaluate the deviation between the predicted label and the true label and calculate the loss (loss).
In the BP phase (as shown in fig. 1 (c) and (d)), starting from the loss calculated in the FP phase, the gradient of the activation value of each layer, i.e. the error value (error, e), is calculated layer by layer in the order from back to front l ). For example, for the convolution layer with step =1, the calculation mode is similar to the convolution of the FP stage, the 1 st convolution kernel (the convolution kernel at this time is obtained by rotating the convolution kernel in the FP stage by 180 degrees, and the size is still 3 × 3) is moved on the input error map by 1 as a step, and the weight (w) is used as a weight l ) Error value (e) with (l + 1) th layer l+1 ) (the error value at this time is the matrix after filling, the size is changed from original 3 x 3 to 7 x 7) to carry out convolution calculation, and obtain the error value (e) of the l-th layer l ) (size 5X 5) and the calculation flow is shown in FIG. 1 (c). Note that during back propagation, the number of input and output channels is opposite to the FP stage. For the convolution layer with step size =2, the error value (e) of the rotated convolution kernel in the (l + 1) th layer with 1 as the step size l+1 ) (the error value at this time is the filled momentThe matrix is filled to change the size from original 2 × 2 to 7 × 7, and slides on the matrix which is filled with zero in two dimensions of the horizontal direction and the vertical direction, and the error value (e) of the l-th layer is obtained by convolution calculation l ) (size 5X 5), the calculation flow is shown in FIG. 1 (d).
In the WG stage (shown in FIGS. 1 (e) and (f)), the activation value (a) is activated by using the output during the forward propagation according to the chain rule l ) And an error value (e) l ) To calculate. For example, for the convolutional layer with step =1, the output of the l-th layer is activated (a) l ) Error value (e) with (l + 1) th layer l+1 ) Performing convolution to obtain the weight gradient (g) of the (l + 1) th layer l+1 ) The calculation flow is shown in fig. 1 (e). For convolution layer with step size =2, the (l + 1) th layer error value (e) is calculated l+1 ) (the error value at this time is 2 × 2 in size and zero-filled in both the horizontal and vertical dimensions) and the output activation value (a) of the l-th layer l ) Performing convolution calculation to obtain the weight gradient (g) of the (l + 1) th layer l+1 ) The calculation flow is shown in fig. 1 (f).
In order to accelerate the training process of the CNN model and meet the efficient training requirement of the CNN model, the embodiment of the application provides a reconfigurable hardware accelerator applied to convolutional neural network training. Fig. 2 schematically illustrates a structural diagram of a reconfigurable hardware accelerator applied to convolutional neural network training according to an embodiment of the present application, and as shown in fig. 2, the reconfigurable hardware accelerator according to the embodiment of the present application includes a cache architecture 100 (not shown in the figure), an operation processing array 200, a functional module 300, and a main controller 400, where:
cache architecture 100 includes an input cache architecture 101 and an output cache architecture 102. The input cache architecture 101 is configured to store data to be operated by the network layer to be trained in the candidate training phase, rearrange and group the data to be operated according to a preset data grouping manner, and input the data to be operated into the operation processing array 200. The candidate training phase is any one of all training phases, namely any one of the FP phase, the BP phase and the WG phase. The output cache architecture 102 is an output cache in the figure.
The arithmetic processing array 200 includes a plurality of arithmetic processing modules 201 arranged in a two-dimensional array, and a Scaling and Rounding Unit (SRU) 202 connected to each row of the arithmetic processing modules 201. The operation processing module 202 is configured to receive data input by the input cache architecture 101, perform convolution operation processing corresponding to the candidate training phase according to an instruction of the main controller 400 and according to a preset moving step, and then input a convolution operation result to a corresponding Scaling and Rounding Unit (SRU) 202. The scaling rounding module 202 is configured to perform data format conversion on the convolution operation result, and send the result to the output buffer architecture 102 for storage.
The arithmetic processing array 200 includes n × m arithmetic processing modules 201, and preferably, n × m is 16 × 16. A reconfigurable hardware accelerator provided in the embodiment of the present application is described by taking 16 × 16 operation processing modules 201 as an example.
The functional module 300 is configured to perform an activation operation or a pooling operation on data in the output cache architecture 102, and after training is completed, update a weight value of a convolution kernel to be trained in a network layer to be trained. Specifically, the functional module 300 includes other general computing modules required in the training process, such as a linear activation function (ReLU), pooling (Pooling), weight Update (Weight Update), and the like.
The main controller 400 is configured to determine a data grouping manner according to the number of convolution kernels to be trained and the number of network layer input channels to be trained in the candidate training phase. And adjusting the internal data connection mode of the operation processing module 201 according to the candidate training stage and the moving step length, so that the operation processing module 201 performs convolution operation processing corresponding to the candidate training stage according to the moving step length. That is, the main controller 400 may be configured to generate various control signals required by each module in the convolutional neural network training process, for example, send the data packet mode to the input buffer architecture 101, send the internal data connection mode to the operation processing module 201, and the like.
Therefore, the reconfigurable hardware accelerator provided by the embodiment of the application dynamically adjusts the internal data connection mode of each operation processing module in the operation processing array at different training stages, so that the operation processing modules perform convolution operation processing corresponding to candidate training stages according to moving step lengths, invalid calculation can be effectively avoided, and the operation processing modules can perform parallel processing, thereby greatly improving the resource utilization efficiency of hardware. The whole reconfigurable hardware accelerator is flexible in calculation mode, can process multi-channel operation in parallel, and can meet calculation requirements of different training stages only by adopting a hardware architecture, so that the reconfigurable hardware accelerator has high model training efficiency.
Further, a specific structure of the input cache architecture 101 is explained below.
As shown in fig. 2, the input cache architecture 101 includes a first input architecture 1011 and a second input architecture 1012.
The first input architecture 1011 includes a first input buffer module 10111 and a first input prefetch module 10112, the first input buffer module 10111 is configured to store first input data in data to be operated, the first input prefetch module 10112 is connected to each operation processing module 201 in the operation processing array 200, and is configured to rearrange and group the first input data in a data grouping manner, determine a target column in the operation processing array 200 corresponding to each group of the first target data, and send each group of the first target data to each operation processing module 201 in the corresponding target column.
The second input architecture 1012 includes a second input buffer module 10121 and a second input prefetch module 10122, where the second input buffer module 10121 is configured to store second input data in the data to be operated, and the second input prefetch module 10122 is connected to each operation processing module 201 in the operation processing array 200, and is configured to rearrange and group the second input data in a data grouping manner, determine a target row in the operation processing array 200 corresponding to each group of second target data, and send each group of second target data to each operation processing module 201 in the corresponding target row.
Specifically, if the candidate training phase is the FP phase, the first input data is weight values of a plurality of convolution kernels to be trained in the network layer to be trained. The first target data is a weight value of any channel in any convolution kernel to be trained, and the first input prefetch module 10112 allocates a target column to each weight value of each channel in each convolution kernel to be trained, and sends the target column to each operation processing module 201 located in the target column. The second input data is a multi-channel input activation value of the network layer to be trained, namely the activation value of each pixel point in the input characteristic diagram, namely the output activation value of the previous network layer of the network layer to be trained. The second target data is an input activation value of any channel, and the second input prefetch module 10122 allocates a target line to each input activation value of each channel, and sends the target line to each operation processing module 201 in the target line.
And if the candidate training stage is the BP stage, the first input data is the weight values of a plurality of convolution kernels to be trained in the network layer to be trained, wherein the convolution kernels to be trained are matrixes obtained after the convolution kernels to be trained are rotated by one hundred eighty degrees. The first target data is a weight value of any channel in any convolution kernel, and the first input prefetch module 10112 allocates a target column to each weight value of each channel in each convolution kernel, and sends the target column to each operation processing module 201 located in the target column. The second input data is determined according to the multi-channel input error value of the network layer to be trained, namely, the second input data is determined according to the error value of the network layer next to the network layer to be trained.
Specifically, in the BP phase, if the step size is 1, the input error value to the current network layer (assuming e is used) l+1 Representation) to obtain a filled error value (assumed to be e' l+1 Representation), the second input data is a padded error value (e' l+1 ) The second target data is an error value of any filled channel, and the second input prefetch module 10122 allocates a target line to the error value of each channel, and sends the target line to each operation processing module 201 in the target line. If the step size is 2, then for the current network layerInput error value (assume e l+1 Expressed) and zero-filled in both the horizontal and vertical dimensions, resulting in a filled and zero-filled error value (assumed to be e) " l+1 Expressed), the second input data is the filled and zero-filled error value (e " l+1 ) The second target data is an error value of any one of the filled and zero-filled channels, and the second input prefetch module 10122 allocates a target line to each error value of each channel, and sends the target line to each operation processing module 201 in the target line.
If the candidate training phase is the WG phase, the first input data is determined from a multi-channel error value of the network layer to be trained. The second input data is a multi-channel input activation value of the network layer to be trained.
Specifically, in the WG stage, if the step size is 1, the first input data is the multichannel error value (e) of the network layer to be trained (assumed to be the l +1 th layer) l+1 ) The first target data is an error value of any channel, and the first input prefetch module 10112 allocates a target column to each error value of each channel, and sends the target column to each operation processing module 201 in the target column. The second input data is a multi-channel input activation value of the network layer to be trained, the second target data is an input activation value of any one channel, and the second input prefetch module 10122 allocates a target row to each input activation value of each channel and sends the target row to each operation processing module 201 in the target row. If the step size is 2, the multi-channel error value (e) of the network layer to be trained (assumed to be the l +1 th layer) l+1 ) After zero-padding in both the horizontal and vertical dimensions, the zero-padded error value (assumed to be e' l+1 Representation), the first input data is a zero-padded error value (e' l+1 ) The first target data is an error value after zero padding of any channel, and the first input prefetch module 10112 allocates a target column for the error value after zero padding of each channel, and sends the target column to each operation processing module 201 in the target column.
Exemplarily, taking the FP stage as an example, assuming that the network layer to be trained includes 3 convolution kernels to be trained, at time t0, the first input prefetch module 10112 sends the weight value of the 1 st channel in the 1 st convolution kernel to be trained to each operation processing module 201 in the 1 st column, and simultaneously sends the weight value of the 1 st channel in the 2 nd convolution kernel to each operation processing module 201 in the 2 nd column, and simultaneously sends the weight value of the 1 st channel in the 3 rd convolution kernel to each operation processing module 201 in the 3 rd column; meanwhile, the second input prefetch module 10122 sends the input activation value of the 1 st channel to each operation processing module 201 on the 1 st row, thereby completing the convolution operation of the 1 st channel. At the time t1, the first input prefetch module 10112 sends the weight value of the 2 nd channel in the 1 st convolution kernel to be trained to each operation processing module 201 in the 4 th row, and simultaneously sends the weight value of the 2 nd channel in the 2 nd convolution kernel to be trained to each operation processing module 201 in the 5 th row, and simultaneously sends the weight value of the 2 nd channel in the 3 rd convolution kernel to be trained to each operation processing module 201 in the 6 th row; meanwhile, the second input prefetch module 10122 sends the input activation value of the 2 nd channel to each of the operation processing modules 201 on the 2 nd row, thereby completing the convolution operation of the 2 nd channel. And analogizing until convolution operation of all channels is completed, and obtaining an output activation value of the network layer to be trained at the FP stage.
It should be noted that, when the second input prefetch module 10122 sends the second target data to all the operation processing modules 201 in the corresponding target row, each data in the second target data is sent to all the operation processing modules 201 in the corresponding target row in sequence according to a preset clock cycle. Specifically, each of the second target data is transmitted in a systolic array. The predetermined clock period is not particularly limited, and is, for example, 1 × 10 -8 s or 5X 10 -9 s, and the like.
Therefore, by adopting the input cache architecture, the data to be operated from the on-chip cache can be rearranged according to the calculation mode of the operation processing module and correctly transmitted to the operation processing array, thereby avoiding redundant operation and greatly improving the operation efficiency.
Next, a specific configuration of the arithmetic processing block 201 will be described.
Fig. 3 exemplarily shows a schematic structural diagram of an arithmetic processing module provided in an embodiment of the present application, and as shown in fig. 3, the arithmetic processing module 201 includes a MAC (multiple-accumulation unit) array 2011, an adder group 2012, a multiplexer group 2013, and a partial sum FIFO queue 2014.
The MAC array 2011 includes a plurality of MACs 20111 arranged in a two-dimensional array manner, the number and the arrangement manner of the MACs 20111 are the same as the size of a target weight matrix, the target weight matrix is a weight matrix of a target channel in a target convolution kernel to be trained, the target convolution kernel to be trained is any convolution kernel to be trained in the plurality of convolution kernels to be trained, and the target channel is any channel in the target convolution kernel to be trained. That is, the MAC array 2011 includes 9 MACs 20111 for a 3 × 3 target weight matrix, and is arranged in a 3 × 3 manner.
Specifically, the arithmetic processing module 201 provided in the embodiment of the present application is described by taking 9 MACs 20111 as an example in a 3 × 3 arrangement.
The MAC20111 includes a first external port (port 1), a second external port (port 2), an internal port (port 3, 4, or 5), a multiplier (a), an adder (B), a multiplexer (C), and a register (D), and is configured to multiply input data of the first external port (port 1) and input data of the second external port (port 2), add the multiplied input data to input data of the internal port (port 3, 4, or 5), obtain an intermediate result, and transmit the intermediate result according to a path corresponding to a candidate training stage according to an instruction of the host controller 400.
The input data of the first external port (port 1) is a first target value at a target position corresponding to the position of the MAC20111 in the received first target data, the input data of the second external port (port 2) is all second target values which need to be multiplied by the first target value in the received second target data, the input data of the internal port (port 3, 4 or 5) is one of a partial sum (transmitted through the port 3) in a convolution operation process, a result transferred by the previous MAC20111 (transmitted through the port 4) or an intermediate result output by the MAC20111 (transmitted through the port 5), and the selection of the input data of the internal port is realized by the multiplexer (C).
Further, port 3 is used to receive the intermediate result of the convolution of the partial sum, i.e. the data of the previous channel, from the partial sum FIFO 2014, for example, after the convolution of the first channel, i.e. the calculation of the result of 9 MACs 20111, is completed, the result is temporarily stored in the partial sum FIFO 2014, in the first clock cycle of the incoming data of the next channel, port 3 of the first MAC20111 (assumed to be denoted as MAC 11) in the first row is gated by the multiplexer (C), the partial sum is passed to the adder (B) for accumulation, and in the next clock cycle, the multiplexer (C) selects the input of the adder (B) as port 4, and port 4 of MAC11 is set to 0. Note that in most clock cycles, the adder (B) input of MAC11 is gated to port 4, i.e., 0. Only when the computation needs to accumulate the results of the different lanes does we fetch the partial sum from the partial and FIFO queue 2014 and pass it into MAC11 through port 3. And for the first row, second MAC20111 (assumed to be MAC 12) and third MAC20111 (assumed to be MAC 13), port 4 is used to receive the transfer value of the previous MAC 20111. Port 3 is also reserved for other MACs 20111, such as MAC12 \ 8230 \8230; MAC33, because these MACs 20111 also receive partial sums during the BP phase and WG phase.
Port 4 is used to receive 0 or a transfer value from the previous MAC20111, where MAC11, MAC21, MAC31 is 0, MAC12, MAC13, MAC 22, MAC 23, MAC 32, MAC33 receive the transfer value.
Port 5 is used to perform per-MAC 20111 internal self-accumulation during the WG phase. In this calculation mode, the multiply-accumulate calculation result does not need to be transferred between the MACs 20111, but is sent to the adder (B) inside itself and the subsequent multiplication result is accumulated continuously.
The multiplexer (C) is used for selectively transmitting the input data of the internal port (the port 3, 4 or 5) to the adder (B) through a control signal in different calculation modes and periods. The register (D) is used for storing the intermediate result output by the adder (B).
The adder group 2012 includes two adders for summing intermediate results output from the multipath MAC20111 during convolution operation according to an instruction from the main controller 400, and outputting the results to the partial sum FIFO queue 2014.
The set of multiplexers 2013 includes a number of row multiplexers equal to the number of rows in the MAC array 2011 to pass the intermediate results output by each MAC20111 and the results output by the set of adders 2012 into the partial sum FIFO queue 2014.
The partial sum FIFO queue 2014 is used to store all partial sums during the convolution operation. And also to provide the partial sums to the MAC array 2011 for continued accumulation until a final output is obtained.
Therefore, each row of operation processing modules share the same first input data, each row of operation processing modules share the same second input data, the additional overhead caused by repeated data reading is reduced, meanwhile, parallel processing can be performed among the operation processing modules, the throughput is improved, the internal data connection mode of the operation processing modules can be dynamically adjusted according to different training stages, the calculation mode is flexible, the hardware architecture can meet the calculation requirements of different training stages, and the operation efficiency is high.
The following describes, with reference to specific drawings, the flow of internal data in the operation processing module 201 provided in the embodiment of the present application in different training stages.
It should be noted that the following specific data flow direction descriptions are all described with respect to a single arithmetic processing module 201 in the arithmetic processing array 200, and the following specific data flow direction calculation processes all correspond to the calculation flows at different training stages in the training process described in fig. 1.
Fig. 4a exemplarily shows a schematic diagram of an internal data flow when the operation processing module performs convolution operation in a manner that a step size is 1 in the FP stage according to the embodiment of the present application. As shown in fig. 4a, when the step size is 1, the arithmetic processing module 201 operates as a conventional 2-D convolver with the weight-maintained data stream. 9 weight values (in w) for one channel of convolution kernel to be trained ij Where i represents the abscissa of the weight value and j represents the ordinate of the weight value) is accurately delivered to the ends of the 9 MACsMouth 1, i.e. w 00 Port 1, w passing to the first MAC of the first row 01 Port 1, \ 8230 \ 8230;, w, delivered to the first row second MAC 22 Passes to port 1 of the third MAC in the third row, and these weights remain unchanged in the arithmetic processing block 201 until the calculation process moves to the next channel. At the same time, the input activation value of the input characteristic diagram of the channel (using a) ij Where i represents the abscissa of the input activation value and j represents the ordinate of the input activation value) is passed to port 2 of the 9 MACs. Specifically, the activation value a is input ij In the process of transferring to each MAC, all input activation values a are firstly transferred ij And regrouping, selecting all input activation values which need to be multiplied by the weight value received by the MAC port 1, and sequentially inputting the input activation values to the corresponding MAC ports 2 according to a preset clock period. Exemplarily, taking three MACs in the second row as an example, the weight values received by port 1 are w respectively 10 、w 11 And w 12 When the convolution kernel to be trained is convolved on the input feature map in the mode of step length being 1, the convolution kernel needs to be convolved with w 10 、w 11 And w 12 Input activation value of convolution from a 10 Start, i.e. input activation value a of the first row in the input profile 00 、a 01 823060 while it does not participate in w 10 、w 11 And w 12 The convolution operation of (2). Similarly, the input activation values received at port 2 of the three MACs in the first row are derived from a 00 Firstly, sequentially inputting according to a preset clock period; the input activation value received by port 2 of the third row of three MACs is from a 20 Initially, finally, the output results of each row of MACs are summed by one adder in the adder group 2012 and output to the partial sum FIFO queue 2014.
It should be noted that the controller 400 may prune the invalid result during the operation, such as a, before each part and entry part and FIFO queue 2014 00 ×w 01 Or a 00 ×w 02 Such invalid calculations are not required values in the normal convolution process, and thus the controller 400 removes such results. The removing steps of the invalid result in the whole operation process of the embodiment of the application are allIf the operation process of each stage is introduced subsequently, the step will not be described again.
Fig. 4b is a schematic diagram illustrating an internal data flow when the operation processing module performs convolution operation in a manner that a step size is 2 in the FP stage according to an embodiment of the present application. As shown in FIG. 4b, when the step size is 2, port 1 of each MAC receives the weight value at the corresponding position (using w) in the 9 weight values of one channel of the convolution kernel to be trained ij Indicating that i represents the abscissa of the weight value and j represents the ordinate of the weight value), port 2, if still inputting according to the data flow at step 1, has half of the calculation operations invalid, and therefore, in order to reduce this unnecessary calculation overhead, activates the input of the input profile of the channel (with a) by the value of a ij Wherein i represents the abscissa of the input activation value, and j represents the ordinate of the input activation value) are divided into two groups, i.e., an even input group and an odd input group, according to whether the ordinate j is an odd number or an even number, wherein the first bit of the odd input group is complemented by 0, and then sequentially input to the corresponding MAC port 2 according to a preset clock period. Illustratively, taking three MACs in the first row as an example, the weight values received by port 1 are w respectively 00 、w 01 And w 02 Will input an activation value a ij Divided into even input groups (a) 00 、a 02 、a 04 823060; \8230;) and odd input groups (0, a) 01 、a 03 、a 05 And 8230, and then, simultaneously inputting the values of the even input group to the first MAC and the third MAC according to a preset clock period, and inputting the values of the odd input group to the second MAC according to a preset clock period. Similarly, in the three MACs in the second row, even numbers are input into the set (a) 10 、a 12 、a 14 Respective values of '8230' \ '8230'; are simultaneously input to the first MAC and the third MAC according to a preset clock period, and an odd number of input groups (0, a) are input 11 、a 13 、a 15 Values of 8230, 8230are input to the second MAC at a predetermined clock period. In the third row of three MACs, even numbers are input into the set (a) 20 、a 22 、a 24 Respective values of 82308230are simultaneously inputted to the first MAC and the third MAC according to a preset clock periodMAC, input odd numbers into the group (0, a) 21 、a 23 、a 25 And (8230) \ 8230;) are input to the second MAC according to a preset clock period, and finally, the output results of each row of MACs are summed by an adder in the adder group 2012 and output to the partial sum FIFO queue 2014.
Fig. 5a schematically illustrates an internal data flow when the operation processing module performs convolution operation in a manner that a step size is 1 in a BP stage according to an embodiment of the present application. As shown in fig. 5a, when the step size is 1, the data flow direction inside the operation processing module 201 is the same as the data flow direction when the step size is 1 in the FP stage, where the difference 1 is that the port 1 of 9 MACs receives 9 weight values of one channel of a convolution kernel to be trained, instead of 9 weight values of one channel of the convolution kernel, the port 1 of 9 MACs receives 9 weight values of one channel of a convolution kernel, that is, the port 1 of the first row of three MACs receives w weight values respectively 22 (FP stage w 00 )、w 21 (FP stage is w 01 ) And w 20 (FP stage is w 02 ) And w is received by port 1 of the three MACs in the second row respectively 12 (FP stage w 10 )、w 11 (FP stage w 11 ) And w 10 (FP stage is w 12 ) And w is received by port 1 of the third row of three MAC 02 (FP stage is w 20 )、w 01 (FP stage is w 21 ) And w 00 (FP stage is w 22 ). Difference 2 is that port 2 of the 9 MACs receives not the input activation value of the input profile of the channel, but the filled input error value of the channel (in e) ij Where i represents the abscissa of the padded input error value and j represents the ordinate of the padded input error value). Illustratively, port 2 of the first row three MAC receives e 00 、e 01 、e 02 Port 2 of the three MACs of the second row receives e 10 、e 11 、e 12 Port 2 of the third row of the three MACs receives e 20 、e 21 、e 22 ……。
Fig. 5b illustrates an example of internal numbers when the operation processing module provided in the embodiment of the present application performs convolution operation in a BP phase in a manner that a step size is 2According to the flow diagram. As shown in FIG. 5b, when the step size is 2, the calculation mode is greatly different from the FP stage, and when the step size is 2, the input error value after filling (using e) is needed ij Representing, wherein i represents the abscissa of the filled input error value, and j represents the ordinate of the filled input error value), and zero-filling in both horizontal and vertical dimensions to obtain a filled and zero-filled error value (e) ij ). The port 1 of each MAC inputs a weight value at a corresponding position corresponding to the operation at the time among 9 weight values of one channel of the convolution kernel, and for example, the port 1 of the first row of three MACs receives w weight values respectively 02 、w 00 And w 01 And w is received by port 1 of the three MACs in the second row respectively 12 、w 10 And w 11 And w is received by port 1 of the third row of three MAC 22 、w 20 And w 21 . Input into port 2 of each MAC is all input error values (e) to be multiplied by the weight value received at port 1, i.e. the filled and zeroed error values ij ). Likewise, to reduce unnecessary computational overhead, the input error value (in e) may also be used ij Where i represents the abscissa of the input error value and j represents the ordinate of the input error value) are grouped in advance according to whether the ordinate j is an odd number or an even number. Illustratively, port 2 of the first row of three MACs receives 0, e 10 、e 11 、e 12 823060, 8230; port 2 of the second three MACs receives 0, e 10 、e 11 、e 12 \8230; port 2 of the third row three MAC receives 0, e 00 、e 01 、e 02 823060, 8230and its advantages. In the calculation process with step size 2, the partial sum FIFO 2014 needs to receive four partial sums in parallel, which are the sum of the output result of the second MAC in the first row and the output result of the second MAC in the third row, the sum of the output result of the third MAC in the first row and the output result of the third MAC in the third row, the output result of the second MAC in the second row, and the output result of the third MAC in the second row.
FIG. 6a is a schematic diagram illustrating a method of an arithmetic processing module provided in an embodiment of the present application in a WG stage with a step size of 1The data flow inside the formula is schematically shown when convolution operation is performed. As shown in fig. 6a, when the step size is 1, since the size of the error value matrix of different network layers is large and not completely different, the weight gradient of one channel of the convolution kernel usually has the same dimension of 3 × 3 size as the MAC. Therefore, the operation processing module works in an output holding mode, and the calculation result is continuously accumulated in the MAC until a final effective result is obtained, so that the utilization efficiency of the MAC is improved. Illustratively, port 1 of the first row of three MACs receives e respectively 02 、e 01 And e 00 And e is received by port 1 of the three MACs in the second row 12 、e 11 And e 10 And e is received by port 1 of the third row of three MAC respectively 22 、e 21 And e 20 . Port 2 of the first three MACs receives a 00 、a 01 、a 02 \8230; port 2 of the second three MACs receives a 10 、a 11 、a 12 823060, 8230; port 2 of the third row three MAC receives a 20 、a 21 、a 22 823060, 8230and its advantages. In the figure, (3) represents port 3, and (5) represents port 5.
Fig. 6b is a schematic diagram illustrating an internal data flow when the arithmetic processing module performs convolution operation in a manner of step size 2 in the WG stage according to an embodiment of the present application. As shown in FIG. 6b, when the step size is 2, the matrix of error values received by port 1 of each MAC will be filled with zeros in the horizontal and vertical directions according to the input activation value (a) ij ) Will enter the activation value (a) ij ) Even and odd groups are separated to avoid zero value input, thereby reducing invalid calculations. Illustratively, the error value received by the port 1 of each MAC is the same as the error value received by the port 1 of each MAC when the step size of the WG stage is 1, and details thereof are not repeated here. For the data received by port 2 of each MAC, taking the three MACs in the first row as an example, the activation value a will be input ij Divided into even input groups (a) 00 、a 02 、a 04 "\8230; \ 8230;) and odd input group (0, a) 01 、a 03 、a 05 8230, simultaneously inputting the values of even input group according to preset clock periodFor the first MAC and the third MAC, the values of the odd input group are input to the second MAC according to a preset clock cycle. Similarly, in the three MACs in the second row, even numbers are input into the set (a) 10 、a 12 、a 14 Respective values of '8230' \ '8230'; are simultaneously input to the first MAC and the third MAC according to a preset clock period, and an odd number of input groups (0, a) are input 11 、a 13 、a 15 Values of 8230, 8230are input to the second MAC at a predetermined clock period. In the third row of three MACs, an even number is input into the group (a) 20 、a 22 、a 24 Respective values of '8230' \ '8230'; are simultaneously input to the first MAC and the third MAC according to a preset clock period, and an odd number of input groups (0, a) are input 21 、a 23 、a 25 Values of 8230, 8230are input to the second MAC at a predetermined clock period.
Therefore, the operation processing module configures different internal data connection modes at different training stages, the data reuse rate in the operation process is improved, the data circulate in the operation processing module in a pulsating mode, meanwhile, the redundant calculation of multiplying 0 is avoided under the condition that the step length is 2, the hardware utilization rate is improved, and the overall convolution operation efficiency is very high.
The specific workflow of the scaling rounding module 202 is described below.
The Scaling and Rounding Unit (SRU) 202 is configured to perform data format conversion on the convolution operation result, where the data format conversion is performed in the following manner:
firstly, after all calculations of a network layer to be trained in a candidate training stage are completed, a first maximum value is determined from all convolution operation results output by a target operation processing module, the first maximum value is determined to be a local maximum value, and the target operation processing module is any one of a plurality of operation processing modules.
Secondly, according to the first maximum value, the shift bit number when converting from int32 (32bit integer ) format to int8 (8bit integer ) format is determined.
Thirdly, converting each convolution operation result output by the target operation processing module from the int32 format to the int8 format according to the shift bit number to obtain a candidate result.
Fourthly, determining a second maximum value from the local maximum values corresponding to each operation processing module in the row where the target operation processing module is located, and determining the second maximum value as a global maximum value.
And fifthly, acquiring a first shift digit corresponding to the target local maximum value and a second shift digit corresponding to the global maximum value, wherein the target local maximum value is any one of local maximum values corresponding to all the operation processing modules of the row where the target operation processing module is located.
Sixth, a shift difference between the first number of shift bits and the second number of shift bits is determined.
And seventhly, performing secondary adjustment on the data format of the target candidate result according to the displacement difference, wherein the target candidate result is the candidate result output by the operation processing module corresponding to the target local maximum value.
Specifically, each of the operation processing modules 201 in the row where the scaling and rounding module 202 is located may use the scaling and rounding module 202 shared by the row in turn at different time intervals.
Fig. 7 exemplarily shows a work flow diagram of performing data format conversion by the Scaling and rounding module according to the embodiment of the present application, as shown in fig. 7, when performing data format conversion, the Scaling and rounding module 202 may be divided into two stages, where the first stage is a Local Maximum Scaling (LMS) stage, the stage is based on a Local Maximum value of all convolution operation results output by a single target operation processing module, and determines a shift bit number converted from int32 to int8 according to the Local Maximum value, stores the Local Maximum value and the Local shift bit number into a Maximum value Register Max (registers, maximum Register File, which is a storage device independent from the Scaling and rounding modules), obtains candidate results by the SRU converting each convolution operation result from int32 format to int8 format according to the Local shift bit number, and transmits the candidate results in int8 format to on-chip (e.g. Block cache RAM, block random access memory), and the stage corresponds to the first to the third stage. The second stage is a Global Maximum Scaling (GMS) stage, when the computation is completed and all convolution results corresponding to all the computation processing modules in the row of the target computation processing module are obtained, a Global Maximum value can be obtained from local Maximum values corresponding to different computation processing modules, the Global Maximum value and a Global shift bit number are also stored in a Maximum register file, a shift difference is determined according to the Global shift bit number corresponding to the Global Maximum value and the local shift bit number corresponding to the target computation processing module, and the data formats of all the candidate results output by the target computation processing module in the on-chip cache are secondarily adjusted according to the shift difference, where the stage corresponds to the fourth to seventh stages. Illustratively, if the shift difference is 1 bit, all candidate results output by the target arithmetic processing module need to be shifted by 1 bit again.
Therefore, by adopting the scaling rounding module, the process of mapping the high bit width intermediate result of the quantized neural network training algorithm to the low bit width final result can be divided into two stages of local scaling and global scaling, the size of an on-chip buffer area and the bandwidth requirement of data transmission are reduced, and a foundation is provided for training the convolutional neural network on a resource-limited edge computing platform.
In order to more clearly illustrate the application of the reconfigurable hardware accelerator provided by the embodiment of the present application in the training process of the convolutional neural network, the following description is made by using a specific example.
Taking a relatively representative VGG convolutional neural network as an example, fig. 8 exemplarily shows a structural schematic diagram of a convolutional kernel size, an input feature map size, and an output feature map size in one convolutional layer of the VGG network, as shown in fig. 8, the size of the input feature map is 128 × 56 × 56 (Channel × Height × Width), the size of the convolutional kernel is 128 × 3 × 3 (Channel × Height × Width), the number of the convolutional kernels is 256, each convolutional kernel performs convolutional calculation on the input feature map according to a set step size, and a result of one Channel of the output feature map is obtained, and the size of the output feature map calculated by the 256 convolutional kernels is 256 × 56 × 56 (Channel × Height × Width).
Because the input feature map is usually large in size and different in size, the input feature map can be divided into convolution blocks (blocks) with more uniform size in two dimensions (Height × Width) of Height and Width in advance, for example, the input feature map can be divided into 7 convolution blocks with size of 128 × 8 × 8 (Channel × Height × Width), and each convolution Block is relatively independent of the operation of a convolution kernel. The following describes an operation process of the reconfigurable hardware accelerator provided in the embodiment of the present application in a training process of a convolutional neural network, by taking a computation of a 128 × 8 × 8 convolution block and a 128 × 3 × 3 convolution kernel with a step size of 1 at an FP stage of training as an example.
Fig. 9 is a diagram illustrating specific values of a convolution block and a convolution kernel in one channel, and as shown in fig. 9, a convolution block with a size of 8 × 8 and a convolution kernel with a size of 3 × 3 are convolved by a step size of 1 to obtain a convolution calculation result with a size of 6 × 6, where o 00 =a 00 ×w 00 +a 01 ×w 01 +a 02 ×w 02 +a 10 ×w 10 +…+a 22 ×w 22 ,o 01 =a 01 ×w 00 +a 02 ×w 01 +a 03 ×w 02 +a 11 ×w 10 +…+a 23 ×w 22 And by analogy, obtaining each value in the output characteristic diagram.
Each value shown in fig. 9 is stored in a Block RAM (Block RAM) of the on-chip cache BRAM of the FPGA, and the BRAM may implement a function of a cache module, where a first input prefetch module transmits 9 weight values of one channel of a convolution kernel to ports 1 of 9 MACs of an operation processing module, and a second input prefetch module transmits an activation value in an input feature map to a port 2 of the MAC.
Fig. 10 is a schematic diagram illustrating an internal specific data flow when the operation processing module performs convolution operation in the FP stage with a step size of 1 in this application example, as shown in fig. 10, in one clock cycle, the activation values of 3 MAC inputs on the same row are the same, the multiply-accumulate results of the activation values and the weight values are transmitted back through the internal register, and the multiply-accumulate results of different rows are added at the back end to obtain 9 weight values (w is w) in the present application example ij ) And 9 areActivation value (a) ij ) Multiply the accumulated result. Because the interior of the operation processing module adopts a pulse array structure, the operation processing module can obtain a result every clock cycle, the result is only a multiply-accumulate result of a convolution kernel and one channel of the input characteristic diagram, and only a part of a final convolution result, namely a partial sum, is obtained, and the partial sum is accumulated with the calculation results of other channels to obtain the final convolution result, and is temporarily stored in a partial sum FIFO queue. The 9 weighted values in the operation processing module are kept unchanged before the activation value of one channel of the input characteristic diagram is completely transmitted until the convolution calculation is transferred to the next channel of the input characteristic diagram and the convolution kernel, at the moment, the weighted values in the operation processing module are updated, meanwhile, the operation processing module receives the activation value of the next channel of the input characteristic diagram, and the calculation result is accumulated with the part and the part sum taken out from the FIFO queue. The data accumulation of all channels is completed to obtain the final convolution result, and the result can be transmitted to an output buffer through the partial sum FIFO queue.
In order to evaluate the feasibility and performance of the reconfigurable hardware accelerator provided by the embodiment of the application, a VGG-like model is firstly trained by using a CIFAR-10 data set, and the model is based on the VGG model, a stride =2 convolutional layer is used to replace a maximum pooling layer and a subsequent stride =1 convolutional layer, so as to perform downsampling on an input feature map.
The hardware Design of the reconfigurable hardware accelerator is realized by using Verilog HDL, and is synthesized by Vivado 2018.3Design Suite, xilinxVC709 (Virtex 7 XC7VX 690T) is selected as a target FPGA platform, and the resource utilization rate of the FPGA is shown in Table 1. The BRAM resource is used to implement the on-chip data buffer, and as can be seen from table 1, the MAC in the arithmetic processing array occupies the most DSP resources.
Table 1: resource utilization of FPGA
(Resource) LUT LUTRAM FlipFlop BRAM DSP
Amount of occupied resources 171248 24704 143565 896 2324
Total amount of resources 433220 174200 866400 1470 3600
Occupancy rate 39.53% 14.18% 16.57% 60.95% 64.56%
When the reconfigurable hardware accelerator works at the frequency of 200MHz, the 771GOPS performance and the 47.38GOPS/W energy efficiency are realized by the reconfigurable hardware accelerator provided by the embodiment of the application. Table 2 is a table comparing the performance and energy efficiency of the present invention with the prior art.
Table 2: comparison table of performance and energy efficiency of the invention and the prior art
Figure BDA0003190026790000131
As shown in table 2, the comparison object 1, the comparison object 2, the comparison object 3, the comparison object 4, and the comparison object 5 are all convolutional neural network training hardware accelerators commonly used in the prior art and applied to different development platforms, and compared with the comparison object 1 and the comparison object 2 which adopt floating point number operation and the comparison object 3 and the comparison object 4 which adopt fixed point number operation, the present invention achieves higher performance and better energy efficiency. The comparison object 5 achieves higher performance than the comparison object 5, but consumes more power, the energy efficiency of the invention is superior to that of the comparison object 5, the invention has the advantages of benefiting from a complete 8-bit integer training algorithm and a reconfigurable architecture capable of eliminating redundant operation, and meanwhile, the double-stage scaling and rounding scheme also reduces the use of on-chip memory and energy consumption.
Therefore, the reconfigurable hardware accelerator provided by the embodiment of the application dynamically adjusts the internal data connection mode of each operation processing module in the operation processing array at different training stages, so that the operation processing modules perform convolution operation processing corresponding to candidate training stages according to moving step lengths, invalid calculation can be effectively avoided, and the operation processing modules can perform parallel processing, so that the resource utilization efficiency of hardware is greatly improved. The whole reconfigurable hardware accelerator is flexible in calculation mode, can process multi-channel operation in parallel, and can meet calculation requirements of different training stages only by adopting a hardware architecture, so that the reconfigurable hardware accelerator has high model training efficiency.
The present application has been described in detail with reference to specific embodiments and illustrative examples, but the description is not intended to limit the application. Those skilled in the art will appreciate that various equivalent substitutions, modifications or improvements may be made to the presently disclosed embodiments and implementations thereof without departing from the spirit and scope of the present disclosure, and these fall within the scope of the present disclosure. The protection scope of this application is subject to the appended claims.

Claims (6)

1. A reconfigurable hardware accelerator applied to convolutional neural network training is characterized by comprising a cache architecture, an operation processing array, a functional module and a main controller, wherein:
the cache architecture comprises an input cache architecture and an output cache architecture; the input cache architecture is used for storing data to be operated of a network layer to be trained in a candidate training stage, rearranging and grouping the data to be operated according to a preset data grouping mode, and inputting the data to be operated into the operation processing array, wherein the candidate training stage is any training stage in all training stages;
the operation processing array comprises a plurality of operation processing modules arranged in a two-dimensional array mode and a scaling rounding module connected with each row of operation processing modules; the operation processing module is used for receiving data input by the input cache architecture, performing convolution operation processing corresponding to the candidate training stage according to a preset moving step length according to an instruction of the main controller, and then inputting a convolution operation result to a corresponding scaling rounding module; the scaling rounding module is used for converting the data format of the convolution operation result and then sending the result to the output cache framework for storage;
the functional module is used for performing activation operation or pooling operation on the data in the output cache architecture and updating the weight value of the convolution kernel to be trained in the network layer to be trained after training is completed;
the main controller is used for determining the data grouping mode according to the number of the convolution kernels to be trained and the number of the network layer input channels to be trained in the candidate training stage; and adjusting the internal data connection mode of the operation processing module according to the candidate training stage and the moving step length so that the operation processing module executes convolution operation processing corresponding to the moving step length and the candidate training stage.
2. The reconfigurable hardware accelerator of claim 1, wherein the input cache architecture comprises a first input architecture and a second input architecture;
the first input architecture comprises a first input cache module and a first input pre-fetching module, wherein the first input cache module is used for storing first input data in the data to be operated, the first input pre-fetching module is connected with each operation processing module in the operation processing array and is used for rearranging and grouping the first input data according to the data grouping mode, determining a target column in the operation processing array corresponding to each group of first target data and sending each group of first target data to each operation processing module in the corresponding target column;
the second input architecture comprises a second input cache module and a second input pre-fetching module, the second input cache module is used for storing second input data in the data to be operated, the second input pre-fetching module is connected with each operation processing module in the operation processing array and is used for rearranging and grouping the second input data according to the data grouping mode, determining a target row in the operation processing array corresponding to each group of second target data and sending each group of second target data to each operation processing module in the corresponding target row.
3. The reconfigurable hardware accelerator according to claim 2, wherein when the second input prefetch module sends the second target data to all the operation processing modules in the corresponding target row, each data in the second target data is sent to all the operation processing modules in the corresponding target row in sequence according to a preset clock cycle.
4. The reconfigurable hardware accelerator of claim 2,
if the candidate training stage is the FP stage, the first input data are weight values of a plurality of convolution kernels to be trained in the network layer to be trained, and the second input data are multi-channel input activation values of the network layer to be trained;
if the candidate training stage is a BP stage, the first input data are weight values of a plurality of convolution kernels to be trained in the network layer to be trained, the second input data are determined according to a multi-channel input error value of the network layer to be trained, and the convolution kernels to be trained are rotated by one hundred eighty degrees to obtain a matrix;
and if the candidate training stage is a WG stage, the first input data is determined according to the multichannel error value of the network layer to be trained, and the second input data is the multichannel input activation value of the network layer to be trained.
5. The reconfigurable hardware accelerator according to claim 4, wherein the operation processing module comprises a MAC array, an adder group, a multiplexer group and a partial sum FIFO queue;
the MAC array comprises a plurality of MACs arranged in a two-dimensional array mode, the number and the arrangement mode of the MACs are the same as the size of a target weight matrix, the target weight matrix is a weight matrix of a target channel in a convolution kernel to be trained, the convolution kernel to be trained is any convolution kernel to be trained in the convolution kernels to be trained, and the target channel is any channel in the convolution kernel to be trained;
the MAC includes a first external port, a second external port, an internal port, a multiplier, an adder, a multiplexer, and a register, and is configured to multiply input data of the first external port by input data of the second external port, add the multiplied input data to input data of the internal port to obtain an intermediate result, and transmit the intermediate result according to an instruction of the host controller along a path corresponding to the candidate training stage, where the input data of the first external port is a first target value at a target position corresponding to a position of the MAC in the received first target data, the input data of the second external port is all second target values that need to be multiplied by the first target value in the received second target data, the input data of the internal port is one of a partial sum in a convolution operation process, a result transmitted by a previous MAC, or an intermediate result output by the MAC, and the selection of the input data of the internal port is implemented by the multiplexer;
the adder group comprises two adders, and is used for summing intermediate results output by the multipath MAC in the convolution operation process according to the instruction of the main controller and outputting the results to the partial sum FIFO queue;
the multiplexer group comprises a plurality of line multiplexers for transmitting the intermediate result output by each MAC and the result output by the adder group to the partial and FIFO queues, wherein the number of the line multiplexers is the same as that of the MAC array;
the partial sum FIFO queue is used to store all partial sums during the convolution operation.
6. The reconfigurable hardware accelerator according to claim 1, wherein said data format converting the convolution operation result comprises:
after all calculations of the network layer to be trained in the candidate training stage are completed, determining a first maximum value from all convolution operation results output by a target operation processing module, and determining the first maximum value as a local maximum value, wherein the target operation processing module is any one of a plurality of operation processing modules;
determining the shift bit number when converting from an int32 format to an int8 format according to the first maximum value;
converting each convolution operation result output by the target operation processing module from an int32 format to an int8 format according to the shift bit number to obtain a candidate result;
determining a second maximum value from the local maximum values corresponding to each operation processing module in the row where the target operation processing module is located, and determining the second maximum value as a global maximum value;
acquiring a first shift digit corresponding to a target local maximum value and a second shift digit corresponding to the global maximum value, wherein the target local maximum value is any one of local maximum values corresponding to all operation processing modules of a row where the target operation processing module is located;
determining a shift difference between the first number of shifted bits and the second number of shifted bits;
and performing secondary adjustment on the data format of a target candidate result according to the displacement difference, wherein the target candidate result is a candidate result output by an operation processing module corresponding to the target local maximum value.
CN202110874007.6A 2021-07-30 2021-07-30 Reconfigurable hardware accelerator applied to convolutional neural network training Pending CN115700605A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110874007.6A CN115700605A (en) 2021-07-30 2021-07-30 Reconfigurable hardware accelerator applied to convolutional neural network training

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110874007.6A CN115700605A (en) 2021-07-30 2021-07-30 Reconfigurable hardware accelerator applied to convolutional neural network training

Publications (1)

Publication Number Publication Date
CN115700605A true CN115700605A (en) 2023-02-07

Family

ID=85120831

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110874007.6A Pending CN115700605A (en) 2021-07-30 2021-07-30 Reconfigurable hardware accelerator applied to convolutional neural network training

Country Status (1)

Country Link
CN (1) CN115700605A (en)

Similar Documents

Publication Publication Date Title
US10860922B2 (en) Sparse convolutional neural network accelerator
CN107229967B (en) Hardware accelerator and method for realizing sparse GRU neural network based on FPGA
CN107704916B (en) Hardware accelerator and method for realizing RNN neural network based on FPGA
CN110188869B (en) Method and system for integrated circuit accelerated calculation based on convolutional neural network algorithm
CN110705703B (en) Sparse neural network processor based on systolic array
CN111461311B (en) Convolutional neural network operation acceleration method and device based on many-core processor
US11120101B2 (en) Matrix multiplication system and method
US11983616B2 (en) Methods and apparatus for constructing digital circuits for performing matrix operations
CN110674927A (en) Data recombination method for pulse array structure
CN110851779B (en) Systolic array architecture for sparse matrix operations
CN111767994B (en) Neuron computing device
CN110210615B (en) Systolic array system for executing neural network calculation
CN110766128A (en) Convolution calculation unit, calculation method and neural network calculation platform
CN113033794B (en) Light weight neural network hardware accelerator based on deep separable convolution
CN110543936B (en) Multi-parallel acceleration method for CNN full-connection layer operation
CN112836813B (en) Reconfigurable pulse array system for mixed-precision neural network calculation
WO2022112739A1 (en) Activation compression method for deep learning acceleration
EP4318275A1 (en) Matrix multiplier and method for controlling matrix multiplier
CN115238863A (en) Hardware acceleration method, system and application of convolutional neural network convolutional layer
US11928176B2 (en) Time domain unrolling sparse matrix multiplication system and method
CN113313252B (en) Depth separable convolution implementation method based on pulse array
CN110766136B (en) Compression method of sparse matrix and vector
CN116888591A (en) Matrix multiplier, matrix calculation method and related equipment
CN115700605A (en) Reconfigurable hardware accelerator applied to convolutional neural network training
CN116384444A (en) Configurable pooling processing unit for neural network accelerator

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination